Skip to content

Latest commit

 

History

History
60 lines (45 loc) · 1018 Bytes

File metadata and controls

60 lines (45 loc) · 1018 Bytes

Wikipedia Natural Questions & TriviaQA

1. prepare data

wget https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-train.json.gz
gzip -d biencoder-nq-train.json.gz

2. Convert train data format

python prepare_retrieve_data.py --input ./biencoder-nq-train.json --output ./nq-train-data

3. Embedding fine-tuning

sh embed_pairwise_train.sh

If with nohup

nohup sh embed_pairwise_train.sh > output.log 2>&1 &

4. Retrieval

Download data

# queries
wget https://www.dropbox.com/s/x4abrhszjssq6gl/nq-test-queries.json
wget https://www.dropbox.com/s/b64e07jzlji8zhl/trivia-test-queries.json

# corpus
wget https://www.dropbox.com/s/8ocbt0qpykszgeu/wikipedia-corpus.tar.gz
tar -xvf wikipedia-corpus.tar.gz

Build corpus index

sh encode_corpus.sh

Build query index

sh encode_query.sh

Search

sh retrieve.sh

5. Evaluation

Reference