Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tokenize_batch method #14

Merged
merged 1 commit into from
Nov 28, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,18 @@ required-features = ["openai-vocabulary-file"]

[dependencies]
ahash = "0.8.6"
ndarray = { version = "0.15.6", optional = true }
regex = "1.10.2"

[dev-dependencies]
criterion = "0.5.1"

[[bench]]
name = "bench"
name = "encode"
required-features = ["openai-vocabulary-file"]
harness = false

[[bench]]
name = "tokenize_batch"
required-features = ["ndarray", "openai-vocabulary-file"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid this and only guard the tokenize_batch_small benchmark on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the benchmark which requires the (now optional) ndarray feature to a second benchmark harness.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this better than just guarding the single benchmark?

Copy link
Contributor Author

@michael-p michael-p Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure guarding a single benchmark is even possible in Criterion without much effort. I didn't find anything in the documentation, and tried several combinations while defining the criterion_group!(...) and criterion_main!(...) but it looks like there is no easy way to feature-gate individual benchmarks (except probably duplicating both criterion_group!(...) and criterion_main!(...)).

In addition, it's maybe a bit more discoverable. I tend to just run cargo bench on projects I'm not familiar with (instead of cargo bench --all-features) and would hence miss those benchmarks. By having two benchmark .rs files I might realize that there are actually more benchmarks I can run.

harness = false
4 changes: 2 additions & 2 deletions benches/bench.rs → benches/encode.rs
Original file line number Diff line number Diff line change
Expand Up @@ -53,5 +53,5 @@ fn long_sentence(c: &mut Criterion) {
});
}

criterion_group!(benches, short, realistic, long_word, long_sentence);
criterion_main!(benches);
criterion_group!(encode, short, realistic, long_word, long_sentence);
criterion_main!(encode);
18 changes: 18 additions & 0 deletions benches/tokenize_batch.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
use criterion::{black_box, criterion_group, criterion_main, Criterion};

use instant_clip_tokenizer::Tokenizer;

fn small(c: &mut Criterion) {
let tokenizer = Tokenizer::new();
c.bench_function("small", |b| {
b.iter(|| {
tokenizer.tokenize_batch(
black_box(["Hi", "How are you?", "I'm fine, thanks!"]),
black_box(6),
)
})
});
}

criterion_group!(tokenize_batch, small);
criterion_main!(tokenize_batch);
104 changes: 94 additions & 10 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -41,15 +41,21 @@
//!
//! # Crate features
//!
//! This crate provides one feature named `openai-vocabulary-file`. This feature
//! bundles the default vocabulary file used for OpenAI's CLIP model together
//! with this crate and allows users to construct a new tokenizer simply by
//! calling [`Tokenizer::new`].
//! This crate provides two features:
//!
//! This feature is enabled by default. To disable it use `default-features =
//! false` when specifying the dependency on this crate in your `Cargo.toml`.
//! You will need to supply your own vocabulary file then and construct the
//! tokenizer using [`Tokenizer::with_vocabulary`].
//! * **ndarray** - Enables the [`ndarray`](https://docs.rs/ndarray) dependency
//! and the `Tokenizer::tokenize_batch` method that can be used to tokenize
//! several input strings at once, returning a matrix suitable for directly
//! passing to the CLIP neural network.
//! * **openai-vocabulary-file** - This feature bundles the default vocabulary
//! file used for OpenAI's CLIP model together with this crate and allows
//! users to construct a new tokenizer simply by calling [`Tokenizer::new`].
//! When disabled, you will need to supply your own vocabulary file and
//! construct the tokenizer using [`Tokenizer::with_vocabulary`].
//!
//! The **openai-vocabulary-file** feature is enabled by default. To disable it
//! use `default-features = false` when specifying the dependency on this crate
//! in your `Cargo.toml`.

use std::io::{self, BufRead};

Expand Down Expand Up @@ -189,15 +195,80 @@ impl Tokenizer {
})
}

/// Tokenize a batch of multiple input strings.
///
/// Each given input string is encoded using the [`encode`] method and the
/// numeric representation written to a row in the resulting two-dimensional
/// matrix of shape `(texts.len(), context_length)`, with the special
/// `<start_of_text>` token prepended, and `<end_of_text>` appended to each
/// text.
///
/// The individual input strings are lowercased before being tokenized, but
/// otherwise no pre-processing is performed.
///
/// `context_length` is the maximum number of tokens per each text and
/// should be `77` for all current CLIP models. If tokenization results in
/// less than `context_length` tokens the resulting row will be padded with
/// trailing zeros. If tokenizing an input text results in too many tokens,
/// the token sequence will be truncated to fit within the resulting row of
/// length `context_length`, always including the `<start_of_text>` and
/// `<end_of_text>` marker tokens.
///
/// The resulting matrix can be passed directly to the CLIP neural network.
michael-p marked this conversation as resolved.
Show resolved Hide resolved
///
/// [`encode`]: Tokenizer::encode
///
/// # Panics
///
/// Panics if `context_length < 3`.
///
/// # Examples
///
/// ```
/// # use ndarray::array;
/// # use instant_clip_tokenizer::{Token, Tokenizer};
/// let tokenizer = Tokenizer::new();
/// let encoded = tokenizer.tokenize_batch(["Hi", "How are you?"], 5);
/// assert_eq!(encoded, array![
/// [49406, 1883, 49407, 0, 0],
michael-p marked this conversation as resolved.
Show resolved Hide resolved
/// [49406, 829, 631, 592, 49407],
/// ]);
/// ```
#[cfg(feature = "ndarray")]
pub fn tokenize_batch<'a, I>(&self, texts: I, context_length: usize) -> ndarray::Array2<u16>
where
I: IntoIterator<Item = &'a str>,
I::IntoIter: std::iter::ExactSizeIterator,
{
if context_length < 3 {
panic!("context length must be at least 3");
}
let texts = texts.into_iter();
let mut result = ndarray::Array2::zeros((texts.len(), context_length));
let mut tokens = Vec::with_capacity(context_length);
for (text, mut result_row) in texts.zip(result.rows_mut()) {
tokens.clear();
tokens.push(self.start_of_text());
self.encode(text, &mut tokens);
tokens.truncate(context_length - 1);
tokens.push(self.end_of_text());
for (token, result_element) in tokens.iter().zip(&mut result_row) {
*result_element = token.to_u16();
}
}
result
}

/// Encode a `text` input as a sequence of tokens.
///
/// The resulting tokens are appended to `out`. `text` is lowercased before
/// being tokenized, but otherwise no pre-processing is performed.
///
/// The encoded token sequence does not include the special
/// `<start_of_text>` and `<end_of_text>` marker tokens. When these are
/// needed they have to be added manually by the caller using the
/// [`start_of_text`] and [`end_of_text`] methods, as in the example below.
/// needed you can either use the `tokenize_batch` method instead, or add
/// them manually by using the [`start_of_text`] and [`end_of_text`]
/// methods, as in the example below.
///
/// [`start_of_text`]: Tokenizer::start_of_text
/// [`end_of_text`]: Tokenizer::end_of_text
Expand Down Expand Up @@ -361,6 +432,19 @@ impl Token {
mod tests {
use super::*;

#[cfg(feature = "ndarray")]
#[test]
fn tokenize_batch() {
let tokenizer = Tokenizer::new();
let encoded = tokenizer.tokenize_batch(["Hi", "How are you?", "I'm fine, thanks!"], 6);
let expected = ndarray::array![
[49406, 1883, 49407, 0, 0, 0],
[49406, 829, 631, 592, 286, 49407],
[49406, 328, 880, 3797, 267, 49407],
];
assert_eq!(encoded, expected);
}

#[test]
fn encode_special_chars() {
let tokens = encode("hello world!!!");
Expand Down