Use ColBERT as a search engine for the ACL Anthology and OpenReview conferences, or any .bib file. Check out the live demo.
# (optional): conda install -y -n aclsearch python=3.10
git clone https://github.com/davidheineman/acl-search
pip install -r requirements.txt
python src/server.py # (this will download a pre-built index!)
Common fixes:
# getting pip errors? (install sentencepiece deps)
sudo apt-get update
sudo apt-get install -y pkg-config libsentencepiece-dev
# running on CUDA? (fix broken package path)
INSTALL_PATH=PATH_TO_YOUR_PYTHON_INSTALL # e.g., /root/ai2/miniconda3/envs/acl_search/lib/python3.10
cp ./src/extras/segmented_maxsim.cpp $INSTALL_PATH/site-packages/colbert/modeling/segmented_maxsim.cpp
cp ./src/extras/decompress_residuals.cpp $INSTALL_PATH/site-packages/colbert/search/decompress_residuals.cpp
cp ./src/extras/filter_pids.cpp $INSTALL_PATH/site-packages/colbert/search/filter_pids.cpp
cp ./src/extras/segmented_lookup.cpp $INSTALL_PATH/site-packages/colbert/search/segmented_lookup.cpp
(Optional) Parse & Index the Anthology
This step allows indexing the anthology manually. This can be skipped, since the parsed/indexed anthology will be downloaded from huggingface.co/davidheineman/colbert-acl.
You can also include you own papers by adding to the anthology.bib
file!
# pull from openreview
echo -e "[email]\n[password]" > .openreview
python src/scrape/openrev.py
# pull from acl anthology
python src/scrape/acl.py
# create unified dataset
python src/parse.py
# index with ColBERT
# (note: sometimes there is a silent failure if the CPP extensions do not exist)
python src/index.py
Deploy Web Server
# Start an API endpoint
gunicorn -w 1 --threads 100 --worker-class gthread -b 0.0.0.0:8080 src.server:app
# Then visit:
# http://localhost:8080
# or use the API:
# http://localhost:8080/api/search?query=Information retrevial with BERT
Deploy as a Docker App
# Build and run locally
docker build . -t acl-search:main
docker run -p 8080:8080 acl-search:main
# Or pull the hosted container
docker pull ghcr.io/davidheineman/acl-search:main # add for macos: --platform linux/arm64
docker run -p 8080:8080 ghcr.io/davidheineman/acl-search:main
# Lauch it as a web service!
brew install flyctl
fly launch
fly scale vm shared-cpu-2x # scale up cpu!
fly scale memory 4096 # scale up memory!
Update Index on HF
# Download a fresh set of papers, index and push to hf:
chmod +x src/scrape/beaker/index.sh
./src/scrape/beaker/index.sh
# Build and deploy container for auto-updating:
docker build -t acl-search -f src/scrape/beaker/Dockerfile .
docker run -it -e HF_TOKEN=$HF_TOKEN acl-search # (Optional) test it out!
# Run on beaker
beaker image delete davidh/acl-search
beaker image create --name acl-search acl-search
beaker experiment create src/scrape/beaker/beaker-conf.yml
Paper Table
# add OpenAI API key
echo -e "[OPENAI_API_KEY]" > .openai-api-key
# Run paper table as a local service
pm2 start paper-table.config.js
pm2 logs paper-table-backend --lines 10
pm2 startup # To have it run on startup
pm2 save
# To shut down the server + flush logs
pm2 stop paper-table-backend && pm2 flush paper-table-backend
# To restart
pm2 stop paper-table-backend && pm2 flush paper-table-backend && pm2 start paper-table.config.js
To see an example of search, visit: colab.research.google.com/drive/1-b90_8YSAK17KQ6C7nqKRYbCWEXQ9FGs