You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The original Python library uses an internal dictionary (SimpleTokenizer#cache) where it stores the tokenization results for each word it has tokenized so far.
This obviously speeds things up for frequently occurring words, however, the drawback is that this cache is unbounded in size. Therefore, if the tokenizer is used to tokenize user-provided input (e.g. for a search API-endpoint) then a malicious user could feed the tokenizer with random sequences of characters, eventually resulting in an out-of-memory condition.
For this reason the Rust implementation does not yet use an internal cache.
We have a few options here:
Option 1: Simply don't provide a cache. Tokenization of individual words is already so fast that a cache might actually slow things down, especially if the cache implementation is more complex (i.e. not simply an unbounded AHashMap but instead something bounded like a LRU-cache)
Option 2: Let the user decide. We could define a trait Cache and change the Tokenizer::encode method to fn encode<C: Cache>(&self, text: &str, out: &mut Vec<Token>, cache: &mut C). We should probably also provide some simple implementations like NoCache (doesn't do anything) and UnboundedCache (just uses a AHashMap - equivalent to Python version). If a user wants something more fancy than that they can implement it for themselves.
The text was updated successfully, but these errors were encountered:
The original Python library uses an internal dictionary (
SimpleTokenizer#cache
) where it stores the tokenization results for each word it has tokenized so far.This obviously speeds things up for frequently occurring words, however, the drawback is that this cache is unbounded in size. Therefore, if the tokenizer is used to tokenize user-provided input (e.g. for a search API-endpoint) then a malicious user could feed the tokenizer with random sequences of characters, eventually resulting in an out-of-memory condition.
For this reason the Rust implementation does not yet use an internal cache.
We have a few options here:
AHashMap
but instead something bounded like a LRU-cache)trait Cache
and change theTokenizer::encode
method tofn encode<C: Cache>(&self, text: &str, out: &mut Vec<Token>, cache: &mut C)
. We should probably also provide some simple implementations likeNoCache
(doesn't do anything) andUnboundedCache
(just uses aAHashMap
- equivalent to Python version). If a user wants something more fancy than that they can implement it for themselves.The text was updated successfully, but these errors were encountered: