Skip to content

search the anthology with ColBERT

Notifications You must be signed in to change notification settings

davidheineman/acl-search

Repository files navigation

ACL Search

Use ColBERT as a search engine for the ACL Anthology and OpenReview conferences, or any .bib file. Check out the live demo.

Quick Setup

# (optional): conda install -y -n aclsearch python=3.10
git clone https://github.com/davidheineman/acl-search
pip install -r requirements.txt 
python src/server.py # (this will download a pre-built index!)

Common fixes:

# getting pip errors? (install sentencepiece deps)
sudo apt-get update
sudo apt-get install -y pkg-config libsentencepiece-dev

# running on CUDA? (fix broken package path)
INSTALL_PATH=PATH_TO_YOUR_PYTHON_INSTALL # e.g., /root/ai2/miniconda3/envs/acl_search/lib/python3.10
cp ./src/extras/segmented_maxsim.cpp $INSTALL_PATH/site-packages/colbert/modeling/segmented_maxsim.cpp
cp ./src/extras/decompress_residuals.cpp $INSTALL_PATH/site-packages/colbert/search/decompress_residuals.cpp
cp ./src/extras/filter_pids.cpp $INSTALL_PATH/site-packages/colbert/search/filter_pids.cpp
cp ./src/extras/segmented_lookup.cpp $INSTALL_PATH/site-packages/colbert/search/segmented_lookup.cpp

More Features

(Optional) Parse & Index the Anthology

This step allows indexing the anthology manually. This can be skipped, since the parsed/indexed anthology will be downloaded from huggingface.co/davidheineman/colbert-acl.

You can also include you own papers by adding to the anthology.bib file!

# pull from openreview
echo -e "[email]\n[password]" > .openreview
python src/scrape/openrev.py

# pull from acl anthology
python src/scrape/acl.py

# create unified dataset
python src/parse.py

# index with ColBERT 
# (note: sometimes there is a silent failure if the CPP extensions do not exist)
python src/index.py

Deploy Web Server

# Start an API endpoint
gunicorn -w 1 --threads 100 --worker-class gthread -b 0.0.0.0:8080 src.server:app

# Then visit:
# http://localhost:8080
# or use the API:
# http://localhost:8080/api/search?query=Information retrevial with BERT

Deploy as a Docker App

# Build and run locally
docker build . -t acl-search:main
docker run -p 8080:8080 acl-search:main

# Or pull the hosted container
docker pull ghcr.io/davidheineman/acl-search:main # add for macos: --platform linux/arm64 
docker run -p 8080:8080 ghcr.io/davidheineman/acl-search:main

# Lauch it as a web service!
brew install flyctl
fly launch

fly scale vm shared-cpu-2x # scale up cpu!
fly scale memory 4096 # scale up memory!

Update Index on HF

# Download a fresh set of papers, index and push to hf:
chmod +x src/scrape/beaker/index.sh
./src/scrape/beaker/index.sh

# Build and deploy container for auto-updating:
docker build -t acl-search -f src/scrape/beaker/Dockerfile .
docker run -it -e HF_TOKEN=$HF_TOKEN acl-search # (Optional) test it out!

# Run on beaker
beaker image delete davidh/acl-search
beaker image create --name acl-search acl-search
beaker experiment create src/scrape/beaker/beaker-conf.yml

Paper Table

# add OpenAI API key
echo -e "[OPENAI_API_KEY]" > .openai-api-key

# Run paper table as a local service
pm2 start paper-table.config.js
pm2 logs paper-table-backend --lines 10

pm2 startup # To have it run on startup
pm2 save

# To shut down the server + flush logs
pm2 stop paper-table-backend && pm2 flush paper-table-backend

# To restart
pm2 stop paper-table-backend && pm2 flush paper-table-backend && pm2 start paper-table.config.js

Example notebooks

To see an example of search, visit: colab.research.google.com/drive/1-b90_8YSAK17KQ6C7nqKRYbCWEXQ9FGs