Cache tokenization results for individual words #5

michael-p · 2023-11-24T15:54:22Z

The original Python library uses an internal dictionary (SimpleTokenizer#cache) where it stores the tokenization results for each word it has tokenized so far.
This obviously speeds things up for frequently occurring words, however, the drawback is that this cache is unbounded in size. Therefore, if the tokenizer is used to tokenize user-provided input (e.g. for a search API-endpoint) then a malicious user could feed the tokenizer with random sequences of characters, eventually resulting in an out-of-memory condition.
For this reason the Rust implementation does not yet use an internal cache.

We have a few options here:

Option 1: Simply don't provide a cache. Tokenization of individual words is already so fast that a cache might actually slow things down, especially if the cache implementation is more complex (i.e. not simply an unbounded AHashMap but instead something bounded like a LRU-cache)
Option 2: Let the user decide. We could define a trait Cache and change the Tokenizer::encode method to fn encode<C: Cache>(&self, text: &str, out: &mut Vec<Token>, cache: &mut C). We should probably also provide some simple implementations like NoCache (doesn't do anything) and UnboundedCache (just uses a AHashMap - equivalent to Python version). If a user wants something more fancy than that they can implement it for themselves.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache tokenization results for individual words #5

Cache tokenization results for individual words #5

michael-p commented Nov 24, 2023

Cache tokenization results for individual words #5

Cache tokenization results for individual words #5

Comments

michael-p commented Nov 24, 2023