Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache tokenization results for individual words #5

Open
michael-p opened this issue Nov 24, 2023 · 0 comments
Open

Cache tokenization results for individual words #5

michael-p opened this issue Nov 24, 2023 · 0 comments

Comments

@michael-p
Copy link
Contributor

The original Python library uses an internal dictionary (SimpleTokenizer#cache) where it stores the tokenization results for each word it has tokenized so far.
This obviously speeds things up for frequently occurring words, however, the drawback is that this cache is unbounded in size. Therefore, if the tokenizer is used to tokenize user-provided input (e.g. for a search API-endpoint) then a malicious user could feed the tokenizer with random sequences of characters, eventually resulting in an out-of-memory condition.
For this reason the Rust implementation does not yet use an internal cache.

We have a few options here:

  • Option 1: Simply don't provide a cache. Tokenization of individual words is already so fast that a cache might actually slow things down, especially if the cache implementation is more complex (i.e. not simply an unbounded AHashMap but instead something bounded like a LRU-cache)
  • Option 2: Let the user decide. We could define a trait Cache and change the Tokenizer::encode method to fn encode<C: Cache>(&self, text: &str, out: &mut Vec<Token>, cache: &mut C). We should probably also provide some simple implementations like NoCache (doesn't do anything) and UnboundedCache (just uses a AHashMap - equivalent to Python version). If a user wants something more fancy than that they can implement it for themselves.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant