Skip to content

Commit

Permalink
Fill the README with real content
Browse files Browse the repository at this point in the history
  • Loading branch information
michael-p committed Nov 29, 2023
1 parent 425da10 commit a97bb2c
Show file tree
Hide file tree
Showing 2 changed files with 597 additions and 2 deletions.
83 changes: 81 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,82 @@
# Instant CLIP Tokenizer: a fast tokenizer for the CLIP neural network
![Cover logo](./cover.svg)

The vocabulary file and Python tokenizer code in this repository are Copyright (c) 2021 OpenAI ([MIT-License](https://github.com/openai/CLIP/blob/main/LICENSE)).
# Instant CLIP Tokenizer: a fast tokenizer for the CLIP neural network, written in Rust

[![Documentation](https://docs.rs/instant-clip-tokenizer/badge.svg)](https://docs.rs/instant-clip-tokenizer/)
[![Crates.io](https://img.shields.io/crates/v/instant-clip-tokenizer.svg)](https://crates.io/crates/instant-clip-tokenizer)
[![PyPI](https://img.shields.io/pypi/v/instant-clip-tokenizer)](https://pypi.org/project/instant-clip-tokenizer/)
[![Build status](https://github.com/instant-labs/instant-clip-tokenizer/workflows/CI/badge.svg)](https://github.com/instant-labs/instant-clip-tokenizer/actions?query=workflow%3ACI)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE-MIT)

Instant CLIP Tokenizer is a fast pure-Rust text tokenizer for [OpenAI's CLIP model](https://github.com/openai/CLIP). It is intended to be a replacement for the original Python-based tokenizer included in the CLIP repository, aiming for 100% compatibility with the original implementation. It can also be used with [OpenCLIP](https://github.com/mlfoundations/open_clip) and other implementations using the same tokenizer.

For the microbenchmarks included in this repository, Instant CLIP Tokenizer is ~70x faster than the Python implementation (with preprocessing and caching disabled to ensure a fair comparison).

## Using the library

### Rust

```toml
[dependencies]
instant-clip-tokenizer = "0.1.0"
# To enable additional functionality that depends on the `ndarray` crate:
# instant-clip-tokenizer = { version = "0.1.0", features = ["ndarray"] }
```

### Python **(>= 3.9)**

```sh
pip install instant-segment
```

Using the library requires `numpy >= 1.16.0` installed in your Python environment (e.g., via `pip install numpy`).

### Examples

```rust
use instant_clip_tokenizer::{Token, Tokenizer};

let tokenizer = Tokenizer::new();

let mut tokens = Vec::new();
tokenizer.encode("A person riding a motorcycle", &mut tokens);
let tokens = tokens.into_iter().map(Token::to_u16).collect::<Vec<_>>();
println!("{:?}", tokens);

// -> [320, 2533, 6765, 320, 10297]
```

```python
import instant_clip_tokenizer

tokenizer = instant_clip_tokenizer.Tokenizer()

tokens = tokenizer.encode("A person riding a motorcycle")
print(tokens)

# -> [320, 2533, 6765, 320, 10297]

batch = tokenizer.tokenize_batch(["A person riding a motorcycle", "Hi there"], context_length=5)
print(batch)

# -> [[49406 320 2533 6765 49407]
# [49406 1883 997 49407 0]]
```

## Testing

To run the tests run the following:

```sh
cargo test --all-features
```

You can also test the Python bindings with:

```sh
make test-python
```

## Acknowledgements

The vocabulary file and original Python tokenizer code included in this repository are Copyright (c) 2021 OpenAI ([MIT-License](https://github.com/openai/CLIP/blob/main/LICENSE)).
Loading

0 comments on commit a97bb2c

Please sign in to comment.