Skip to content

egorsmkv/pdf-ner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Extract entities from a PDF file

Download a PDF file

wget -O "2501_01104.pdf" "https://arxiv.org/pdf/2501.01104"

Download model

wget "https://huggingface.co/onnx-community/gliner_small-v2.1/resolve/main/tokenizer.json"
wget "https://huggingface.co/onnx-community/gliner_small-v2.1/resolve/main/onnx/model.onnx"

Run

Dev

RUST_LOG=debug cargo run -- \
  --pdf-files files.txt \
  --tokenizer-file tokenizer.json \
  --model-file model.onnx \
  --entities technology,organization

Prod

cargo build --release

cp target/release/pdf-ner .

./pdf-ner \
  --pdf-files files.txt \
  --tokenizer-file tokenizer.json \
  --model-file model.onnx \
  --entities conference,name

About

NER on PDF file using Rust

Topics

Resources

Stars

Watchers

Forks

Languages