Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CountVectorizer #315

Merged
merged 5 commits into from
Jan 17, 2025
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 72 additions & 0 deletions lib/scholar/feature_extraction/count_vectorizer.ex
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
defmodule Scholar.FeatureExtraction.CountVectorizer do
@moduledoc """
A `CountVectorizer` converts already indexed collection of text documents to a matrix of token counts.
"""
import Nx.Defn

@doc """
Generates a count matrix where each row corresponds to a document in the input corpus, and each column corresponds to a unique token in the vocabulary of the corpus.

The input must be a 2D tensor where:
* Each row represents a document.
* Each document has integer values representing tokens.

The same number represents the same token in the vocabulary. Tokens should start from 0 and be consecutive. Negative values are ignored, making them suitable for padding.

## Examples
iex> t = Nx.tensor([[0, 1, 2], [1, 3, 4]])
iex> Scholar.FeatureExtraction.CountVectorizer.fit_transform(t)
Nx.tensor([
[1, 1, 1, 0, 0],
[0, 1, 0, 1, 1]
])

With padding:
iex> t = Nx.tensor([[0, 1, -1], [1, 3, 4]])
iex> Scholar.FeatureExtraction.CountVectorizer.fit_transform(t)
Nx.tensor([
[1, 1, 0, 0, 0],
[0, 1, 0, 1, 1]
])
"""
deftransform fit_transform(tensor) do
max_index = tensor |> Nx.reduce_max() |> Nx.add(1) |> Nx.to_number()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under jit mode, you cannot convert a tensor to a number. You can try adding this test:

fun = &Scholar.FeatureExtraction.CountVectorizer.fit_transform(&1, indexed_tensor: true)
Nx.Defn.jit(fun).(Nx.tensor([[0, 1, 2], [1, 3, 4]])

The correct solution is to compute the default value for max_index inside defn.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that it needs to be a number in order to create a tensor with this shape. Inside defn, I'm afraid it is not possible to dynamically obtain a number. We can make this a required option, so we do not need to use Nx.to_number().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So we should not use deftransform, because this certainly cannot be invoked inside a defn, as we don't yet support dynamic shapes. For now, it should be a regular def and maybe it should be called something else.

I believe this topic appeared in the past and we maybe discussed a possible contract for keeping code that doesn't work inside defn but I don't recall it right now. :) Maybe @msluszniak does?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, if the shape was based on computations we either droped the options that force it (like percent of variation preserved in PCA) or develop some heuristics that roughly assesses the upper bound of the problematic shape, and then make computations on bigger tensor with some "padding".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or make an option that is required via NimbleOptions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe an option is the way to go then and we could even have a helper function like CountVectorizer.size(n) that they could use to compute it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this required option and added helper function 😄

opts = [max_index: max_index]

fit_transform_n(tensor, opts)
end

defnp fit_transform_n(tensor, opts) do
check_for_rank(tensor)
counts = Nx.broadcast(0, {Nx.axis_size(tensor, 0), opts[:max_index]})

{_, counts} =
while {{i = 0, tensor}, counts}, Nx.less(i, Nx.axis_size(tensor, 0)) do
{_, counts} =
while {{j = 0, i, tensor}, counts}, Nx.less(j, Nx.axis_size(tensor, 1)) do
index = tensor[i][j]

counts =
if Nx.any(Nx.less(index, 0)),
do: counts,
else: Nx.indexed_add(counts, Nx.stack([i, index]), 1)

{{j + 1, i, tensor}, counts}
end

{{i + 1, tensor}, counts}
end

counts
end

defnp check_for_rank(tensor) do
if Nx.rank(tensor) != 2 do
raise ArgumentError,
"""
expected tensor to have shape {num_documents, num_tokens}, \
got tensor with shape: #{inspect(Nx.shape(tensor))}\
"""
end
end
end
33 changes: 33 additions & 0 deletions test/scholar/feature_extraction/count_vectorizer.ex
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
defmodule Scholar.Preprocessing.BinarizerTest do
use Scholar.Case, async: true
alias Scholar.FeatureExtraction.CountVectorizer
doctest CountVectorizer

describe "fit_transform" do
test "fit_transform test" do
counts = CountVectorizer.fit_transform(Nx.tensor([[2, 3, 0], [1, 4, 4]]))

expected_counts = Nx.tensor([[1, 0, 1, 1, 0], [0, 1, 0, 0, 2]])

assert counts == expected_counts
end

test "fit_transform test - tensor with padding" do
counts = CountVectorizer.fit_transform(Nx.tensor([[2, 3, 0], [1, 4, -1]]))

expected_counts = Nx.tensor([[1, 0, 1, 1, 0], [0, 1, 0, 0, 1]])

assert counts == expected_counts
end
end

describe "errors" do
test "wrong input rank" do
assert_raise ArgumentError,
"expected tensor to have shape {num_documents, num_tokens}, got tensor with shape: {3}",
fn ->
CountVectorizer.fit_transform(Nx.tensor([1, 2, 3]))
end
end
end
end
Loading