Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
gbenson authored May 16, 2024
1 parent e4d4c12 commit 241a522
Showing 1 changed file with 17 additions and 2 deletions.
19 changes: 17 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,13 @@

# DOM tokenizers

DOM-aware tokenizers for [🤗 Hugging Face](https://huggingface.co/)
DOM-aware tokenizers for 🤗 [Hugging Face](https://huggingface.co/)
language models.

## Installation

### With PIP

```sh
pip install dom-tokenizers[train]
```
Expand All @@ -31,6 +32,20 @@ pip install -e .[dev,train]
```

## Train a tokenizer

### On the command line

Check everything's working using a small dataset of around 300 examples:

```sh
train-tokenizer gbenson/interesting-dom-snapshots
```

Train a tokenizer with a 10,000-token vocabulary using a dataset of
4,536 examples and upload it to the Hub:

```sh
train-tokenizer gbenson/interesting-dom-snapshots -n 10000
train-tokenizer gbenson/webui-dom-snapshots -n 10000 -N 4536
huggingface-cli login
huggingface-cli upload dom-tokenizer-10k
```

0 comments on commit 241a522

Please sign in to comment.