-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CountVectorizer #315
Merged
Merged
Add CountVectorizer #315
Changes from 3 commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
defmodule Scholar.FeatureExtraction.CountVectorizer do | ||
@moduledoc """ | ||
A `CountVectorizer` converts already indexed collection of text documents to a matrix of token counts. | ||
""" | ||
import Nx.Defn | ||
|
||
@doc """ | ||
Generates a count matrix where each row corresponds to a document in the input corpus, and each column corresponds to a unique token in the vocabulary of the corpus. | ||
|
||
The input must be a 2D tensor where: | ||
* Each row represents a document. | ||
* Each document has integer values representing tokens. | ||
|
||
The same number represents the same token in the vocabulary. Tokens should start from 0 and be consecutive. Negative values are ignored, making them suitable for padding. | ||
|
||
## Examples | ||
iex> t = Nx.tensor([[0, 1, 2], [1, 3, 4]]) | ||
iex> Scholar.FeatureExtraction.CountVectorizer.fit_transform(t) | ||
Nx.tensor([ | ||
[1, 1, 1, 0, 0], | ||
[0, 1, 0, 1, 1] | ||
]) | ||
|
||
With padding: | ||
iex> t = Nx.tensor([[0, 1, -1], [1, 3, 4]]) | ||
iex> Scholar.FeatureExtraction.CountVectorizer.fit_transform(t) | ||
Nx.tensor([ | ||
[1, 1, 0, 0, 0], | ||
[0, 1, 0, 1, 1] | ||
]) | ||
""" | ||
deftransform fit_transform(tensor) do | ||
max_index = tensor |> Nx.reduce_max() |> Nx.add(1) |> Nx.to_number() | ||
opts = [max_index: max_index] | ||
|
||
fit_transform_n(tensor, opts) | ||
end | ||
|
||
defnp fit_transform_n(tensor, opts) do | ||
check_for_rank(tensor) | ||
counts = Nx.broadcast(0, {Nx.axis_size(tensor, 0), opts[:max_index]}) | ||
|
||
{_, counts} = | ||
while {{i = 0, tensor}, counts}, Nx.less(i, Nx.axis_size(tensor, 0)) do | ||
{_, counts} = | ||
while {{j = 0, i, tensor}, counts}, Nx.less(j, Nx.axis_size(tensor, 1)) do | ||
index = tensor[i][j] | ||
|
||
counts = | ||
if Nx.any(Nx.less(index, 0)), | ||
do: counts, | ||
else: Nx.indexed_add(counts, Nx.stack([i, index]), 1) | ||
|
||
{{j + 1, i, tensor}, counts} | ||
end | ||
|
||
{{i + 1, tensor}, counts} | ||
end | ||
|
||
counts | ||
end | ||
|
||
defnp check_for_rank(tensor) do | ||
if Nx.rank(tensor) != 2 do | ||
raise ArgumentError, | ||
""" | ||
expected tensor to have shape {num_documents, num_tokens}, \ | ||
got tensor with shape: #{inspect(Nx.shape(tensor))}\ | ||
""" | ||
end | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
defmodule Scholar.Preprocessing.BinarizerTest do | ||
use Scholar.Case, async: true | ||
alias Scholar.FeatureExtraction.CountVectorizer | ||
doctest CountVectorizer | ||
|
||
describe "fit_transform" do | ||
test "fit_transform test" do | ||
counts = CountVectorizer.fit_transform(Nx.tensor([[2, 3, 0], [1, 4, 4]])) | ||
|
||
expected_counts = Nx.tensor([[1, 0, 1, 1, 0], [0, 1, 0, 0, 2]]) | ||
|
||
assert counts == expected_counts | ||
end | ||
|
||
test "fit_transform test - tensor with padding" do | ||
counts = CountVectorizer.fit_transform(Nx.tensor([[2, 3, 0], [1, 4, -1]])) | ||
|
||
expected_counts = Nx.tensor([[1, 0, 1, 1, 0], [0, 1, 0, 0, 1]]) | ||
|
||
assert counts == expected_counts | ||
end | ||
end | ||
|
||
describe "errors" do | ||
test "wrong input rank" do | ||
assert_raise ArgumentError, | ||
"expected tensor to have shape {num_documents, num_tokens}, got tensor with shape: {3}", | ||
fn -> | ||
CountVectorizer.fit_transform(Nx.tensor([1, 2, 3])) | ||
end | ||
end | ||
end | ||
end |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under
jit
mode, you cannot convert a tensor to a number. You can try adding this test:fun = &Scholar.FeatureExtraction.CountVectorizer.fit_transform(&1, indexed_tensor: true)
Nx.Defn.jit(fun).(Nx.tensor([[0, 1, 2], [1, 3, 4]])
The correct solution is to compute the default value for
max_index
insidedefn
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that it needs to be a number in order to create a tensor with this shape. Inside
defn
, I'm afraid it is not possible to dynamically obtain a number. We can make this a required option, so we do not need to useNx.to_number()
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. So we should not use
deftransform
, because this certainly cannot be invoked inside adefn
, as we don't yet support dynamic shapes. For now, it should be a regulardef
and maybe it should be called something else.I believe this topic appeared in the past and we maybe discussed a possible contract for keeping code that doesn't work inside
defn
but I don't recall it right now. :) Maybe @msluszniak does?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, if the shape was based on computations we either droped the options that force it (like percent of variation preserved in PCA) or develop some heuristics that roughly assesses the upper bound of the problematic shape, and then make computations on bigger tensor with some "padding".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or make an option that is required via NimbleOptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe an option is the way to go then and we could even have a helper function like
CountVectorizer.size(n)
that they could use to compute it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this required option and added helper function 😄