Add `tokenize_batch` method #14

michael-p · 2023-11-28T10:40:32Z

The Python bindings provide this method as a convenience function. I think this is useful for users of the Rust library as well, so I implemented it there. This means we have to take on a dependency on ndarray, but I think this is justified given how widely used this crate is in the ML ecosystem.

src/lib.rs

djc · 2023-11-28T13:11:55Z

Cargo.toml

 regex = "1.10.2"

 [dev-dependencies]
 criterion = "0.5.1"

 [[bench]]
 name = "bench"
-required-features = ["openai-vocabulary-file"]
+required-features = ["ndarray", "openai-vocabulary-file"]


We should avoid this and only guard the tokenize_batch_small benchmark on this.

I moved the benchmark which requires the (now optional) ndarray feature to a second benchmark harness.

Why is this better than just guarding the single benchmark?

I'm not sure guarding a single benchmark is even possible in Criterion without much effort. I didn't find anything in the documentation, and tried several combinations while defining the criterion_group!(...) and criterion_main!(...) but it looks like there is no easy way to feature-gate individual benchmarks (except probably duplicating both criterion_group!(...) and criterion_main!(...)).

In addition, it's maybe a bit more discoverable. I tend to just run cargo bench on projects I'm not familiar with (instead of cargo bench --all-features) and would hence miss those benchmarks. By having two benchmark .rs files I might realize that there are actually more benchmarks I can run.

src/lib.rs

Cargo.toml

src/lib.rs

djc · 2023-11-28T14:22:00Z

Cargo.toml

 regex = "1.10.2"

 [dev-dependencies]
 criterion = "0.5.1"

 [[bench]]
 name = "bench"
-required-features = ["openai-vocabulary-file"]
+required-features = ["ndarray", "openai-vocabulary-file"]


Why is this better than just guarding the single benchmark?

michael-p requested a review from djc November 28, 2023 10:47

djc reviewed Nov 28, 2023

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

src/lib.rs Outdated Show resolved Hide resolved

src/lib.rs Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

michael-p force-pushed the tokenize-batch branch from 1435509 to b055c73 Compare November 28, 2023 13:09

djc reviewed Nov 28, 2023

View reviewed changes

michael-p force-pushed the tokenize-batch branch from b055c73 to 6fb71dd Compare November 28, 2023 14:05

Add tokenize_batch method

ca501b2

michael-p force-pushed the tokenize-batch branch from 6fb71dd to ca501b2 Compare November 28, 2023 14:07

djc approved these changes Nov 28, 2023

View reviewed changes

michael-p merged commit e85644f into main Nov 28, 2023
7 checks passed

michael-p deleted the tokenize-batch branch December 6, 2023 13:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `tokenize_batch` method #14

Add `tokenize_batch` method #14

michael-p commented Nov 28, 2023

This comment was marked as resolved.

djc Nov 28, 2023

michael-p Nov 28, 2023

djc Nov 28, 2023

michael-p Nov 28, 2023 •

edited

Loading

djc Nov 28, 2023

Add tokenize_batch method #14

Add tokenize_batch method #14

Conversation

michael-p commented Nov 28, 2023

This comment was marked as resolved.

djc Nov 28, 2023

Choose a reason for hiding this comment

michael-p Nov 28, 2023

Choose a reason for hiding this comment

djc Nov 28, 2023

Choose a reason for hiding this comment

michael-p Nov 28, 2023 • edited Loading

Choose a reason for hiding this comment

djc Nov 28, 2023

Choose a reason for hiding this comment

Add `tokenize_batch` method #14

Add `tokenize_batch` method #14

michael-p Nov 28, 2023 •

edited

Loading