In Entity Disambiguation (ED) we are given text and mentions. The task is then to find the unique meaning (e.g. Wikipedia entity) to what the mentions refer.
We talk about Entity Linking (EL) if the input is raw text and a model has to identify mentions and disambiguate them.
The training corpus is derived from the Kensho Derived Wikimedia Dataset (licence CC BY-SA 3.0). We used the "link_annotated_text.jsonl" that provides wikipedia pages divided into sections. Each section consists of a name, a text and wikipedia hyperlinks specified by offset, length and wikipedia id of the referenced page.
The test corpora are the test split of the AIDA CoNLL-YAGO dataset (AIDA-b), the Reddit EL corpus, the Tweeki EL corpus, the ShadowLink dataset and the WNED-WIKI/WNED-CWEB corpora processed by Le and Titov, 2018.
This repository is basically a collection of python scripts to obtain and process the data. First clone the repository and install the requirements. Note that you need at least python>=3.8 to handle the pickled objects. The intended use is as follows:
- The test data is ready to use in the test_data folder. Each split comes in jsonl and conll format.
The conll files are tab separated. In the first column is a text token, second column contains the wikipedia id and the third column the wikipedia title. Annotations are equipped with BIO-tags, there is a 'O' for tokens with no annotations. Moreover single documents are separated with a '-DOCSTART-' and in the beginning of each document there is a comment line (starting with '# ', i.e. hashtag followed by blank) that is a unique identifier for each document. The form of this identifier depends on the respective dataset and usually does not contain any additional information except for the two datasets cweb and wikipedia: Here the document identifier contains the difficulty bracket (separated by a tab, more information on the brackets see Guo and Barbosa, 2018). In the following example you see the first lines of the aida-b_final.conll file.
-DOCSTART-
# 1163testb SOCCER
SOCCER O O
- O O
JAPAN B-993546 B-Japan national football team
GET O O
LUCKY O O
WIN O O
, O O
CHINA B-887850 B-China national football team
IN O O
# Output
Mention: JAPAN --- Wikipedia title: Japan national football team --- Wikipedia id: 993546
Mention: CHINA --- Wikipedia title: China national football team --- Wikipedia id: 887850
Mention: AL-AIN --- Wikipedia title: Al Ain --- Wikipedia id: 212131
Mention: United Arab Emirates --- Wikipedia title: United Arab Emirates --- Wikipedia id: 69328
Mention: Japan --- Wikipedia title: Japan national football team --- Wikipedia id: 993546
Mention: Asian Cup --- Wikipedia title: 1996 AFC Asian Cup --- Wikipedia id: 1013464
Mention: Syria --- Wikipedia title: Syria national football team --- Wikipedia id: 1131669
Mention: China --- Wikipedia title: China national football team --- Wikipedia id: 887850
Mention: Uzbekistan --- Wikipedia title: Uzbekistan national football team --- Wikipedia id: 1032413
Mention: China --- Wikipedia title: China national football team --- Wikipedia id: 887850
Mention: Uzbek --- Wikipedia title: Uzbekistan national football team --- Wikipedia id: 1032413
Mention: Igor Shkvyrin --- Wikipedia title: Igor Shkvyrin --- Wikipedia id: 12394021
Mention: Chinese --- Wikipedia title: China --- Wikipedia id: 5405
Mention: Soviet --- Wikipedia title: Soviet Union --- Wikipedia id: 26779
Mention: Asian Cup --- Wikipedia title: AFC Asian Cup --- Wikipedia id: 250683
Mention: Asian Games --- Wikipedia title: 1994 Asian Games --- Wikipedia id: 3285394
Mention: Uzbekistan --- Wikipedia title: Uzbekistan national football team --- Wikipedia id: 1032413
Mention: Japan --- Wikipedia title: Japan national football team --- Wikipedia id: 993546
Mention: Syria --- Wikipedia title: Syria national football team --- Wikipedia id: 1131669
Mention: Takuya Takagi --- Wikipedia title: Takuya Takagi --- Wikipedia id: 7612409
Mention: Hiroshige Yanagimoto --- Wikipedia title: Hiroshige Yanagimoto --- Wikipedia id: 8330373
Mention: Syrian --- Wikipedia title: Syria national football team --- Wikipedia id: 1131669
Mention: Syria --- Wikipedia title: Syria national football team --- Wikipedia id: 1131669
Mention: Hassan Abbas --- Wikipedia title: Hassan Abbas --- Wikipedia id: 21828137
Mention: Syria --- Wikipedia title: Syria national football team --- Wikipedia id: 1131669
Mention: Japan --- Wikipedia title: Japan national football team --- Wikipedia id: 993546
Mention: Syrian --- Wikipedia title: Syria national football team --- Wikipedia id: 1131669
Mention: Syrian --- Wikipedia title: Syria national football team --- Wikipedia id: 1131669
Mention: Japan --- Wikipedia title: Japan national football team --- Wikipedia id: 993546
Mention: Shu Kamo --- Wikipedia title: Shu Kamo --- Wikipedia id: 9087957
Mention: Syrian --- Wikipedia title: Syria national football team --- Wikipedia id: 1131669
Mention: Japan --- Wikipedia title: Japan --- Wikipedia id: 15573
Mention: World Cup --- Wikipedia title: FIFA World Cup --- Wikipedia id: 11370
Mention: FIFA --- Wikipedia title: FIFA --- Wikipedia id: 11049
Mention: UAE --- Wikipedia title: United Arab Emirates national football team --- Wikipedia id: 1044396
Mention: Kuwait --- Wikipedia title: Kuwait national football team --- Wikipedia id: 1041857
Mention: South Korea --- Wikipedia title: South Korea national football team --- Wikipedia id: 1018627
Mention: Indonesia --- Wikipedia title: Indonesia national football team --- Wikipedia id: 1044538
Additionally we provide the entity vocabulary of all test splits combined in test_data/wikiids_to_titles_test_splits.pickle.
import pickle
with open('test_data/ids_and_titles/wikiids_to_titles_test_splits.pickle', 'rb') as handle:
ids_to_titles_test_sets = pickle.load(handle)
print(f'There are {len(ids_to_titles_test_sets)} entities in the test sets.')
wikipedia_id = list(ids_to_titles_test_sets.keys())[0]
print(f'Wikipedia id: {wikipedia_id} Wikipedia title: {ids_to_titles_test_sets[wikipedia_id]}')
# Output
There are 14206 entities in the test sets.
Wikipedia id: 993546 Wikipedia title: Japan national football team
- To create the train split you need to download the Kensho Derived Wikimedia Dataset, more specifically the "link_annotated_text.jsonl" file. Moreover, for tokenization we utilize the 'en_core_web_sm' model from spaCy. Download it with the following command:
python -m spacy download en_core_web_sm
Then, to generate the data, you need to set two paths in the script 'repo/scripts/zelda.py'
...
# replace the path with the path to the file 'link_annotated_text.jsonl' on your system
PATH_TO_KENSHO_JSONL = ''
# replace with the path where you saved the repository on your system
PATH_TO_REPOSITORY = ''
...
Also you can set two variables
# If you want a conll version of ZELDA-train, set this to true
create_conll_version_of_zelda_train = True
# If you want to generate the entity descriptions, set this to true
create_entity_descriptions = True
Then, all you need to do is to execute 'zelda.py':
# go to the scripts folder and call
python zelda.py
Note that it may take a few hours to generate all objects. The generated data will be stored in 'repo/train_data' and contains the zelda-train split (in jsonl and conll format), the entity descriptions (in jsonl format), the candidate lists (as a pickled dictionary) and a dictionary containing all id-title pairs (of all train and test sets).
# the entity vocabulary can be handled just as the vocabulary of only the test sets
import pickle
with open('train_data/zelda_ids_to_titles.pickle', 'rb') as handle:
zelda_ids_to_titles = pickle.load(handle)
print(f'There are {len(zelda_ids_to_titles)} entities in zelda.')
wikipedia_id = list(zelda_ids_to_titles.keys())[42]
print(f'Wikipedia id: {wikipedia_id} Wikipedia title: {zelda_ids_to_titles[wikipedia_id]}')
# once created, the mention_entities_counter contains, for each collected mention, a dictionary of entity:count pairs where
# we saved how often we saw the respective mention together with a certain entity.
with open('train_data/zelda_mention_entities_counter.pickle', 'rb') as handle:
zelda_mention_entities_counter = pickle.load(handle)
mention = 'Ronaldo'
print(zelda_mention_entities_counter[mention])
# Output
There are 821559 entities in zelda.
Wikipedia id: 9663 Wikipedia title: Electronics
{'Cristiano Ronaldo': 3, 'Ronaldo (Brazilian footballer)': 2}
The script scripts/scripts_for_candidate_lists/demo_of_candidate_lists.py demonstrates how we used the candidate lists to achieve the numbers in our paper (add reference). Note that to use it you need to set the PATH_TO_REPOSITORY variable in the script. Executing it should output the following numbers.
AIDA-B | TWEEKI | REDDIT-P | REDDIT-C | CWEB | WIKI | S-TAIL | S-SHADOW | S-TOP | |
---|---|---|---|---|---|---|---|---|---|
MFS | 0,634 | 0,723 | 0,832 | 0,809 | 0,611 | 0,651 | 0,991 | 0,149 | 0,41 |
CL-Recall | 0,91 | 0,94 | 0,983 | 0,981 | 0,924 | 0,986 | 0,994 | 0,565 | 0,728 |
MFS ("most frequent sense") chooses, for each mention, the entity that we empirically counted the most often for that mention (assuming the mention is contained in our lists).
CL-Recall (CL for "Candidate List") indicates whether the gold entity is actually contained in the candidate lists for all the mentions.
The numbers report the accuracy, i.e. #mentions-linking-to-their-mfs/#mentions
and #mentions-that-have-gold-entity-in-their-candidates/#mentions
.
All other scripts in this repository (e.g. scripts_for_test_data, scripts_for_candidate_lists) are not needed to create the data and are added for transparency reasons, to show how we created ZELDA. The objects (id-title dictionaries, candidate lists, etc.) were created in October 2022. Executing the additional scripts at another time might change the resulting objects because Wikipedia continuously evolves.