Code for "Type-supervised sequence labeling based on the heterogeneous star graph for named entity recognition". Details and paper please check here.
You can try to create environment as follows:
conda create --name GrapnNER python=3.9.13
conda activate GraphNER
pip install -r requirements.txt
or directly import conda environment on Windows as follows:
conda env create -f windows.yaml
or directly import conda environment on Linux as follows:
conda env create -f linux.yaml
Original source of datasets:
- GENIA: http://www.geniaproject.org/genia-corpus
- CoNLL03: https://data.deepai.org/conll2003.zip
- WeiboNER: https://github.com/OYE93/Chinese-NLP-Corpus/tree/master/NER/Weibo
You can download our processed datasets from here.
Data format:
{
"tokens": [
"IL-2",
"gene",
"expression",
"and",
"NF-kappa",
"B",
"activation",
"through",
"CD28",
"requires",
"reactive",
"oxygen",
"production",
"by",
"5-lipoxygenase",
"."
],
"entities": [
{
"start": 14,
"end": 15,
"type": "protein"
},
{
"start": 4,
"end": 6,
"type": "protein"
},
{
"start": 0,
"end": 2,
"type": "DNA"
},
{
"start": 8,
"end": 9,
"type": "protein"
}
],
"relations": {},
"org_id": "ge/train/0001",
"pos": [
"PROPN",
"NOUN",
"NOUN",
"CCONJ",
"PROPN",
"PROPN",
"NOUN",
"ADP",
"PROPN",
"VERB",
"ADJ",
"NOUN",
"NOUN",
"ADP",
"NUM",
"."
],
"ltokens": [],
"rtokens": []
}
The ltokens
contains the tokens from the previous sentence. And The rtokens
contains the tokens from the next sentence.
For used word vectors including Chinese word2vec, Glove and Bio-word2vec, you can download from here.
You can run the experiment on GENIA dataset as follows:
python main.py --dataset_name=genia --evaluate=test --concat --pretrain_select=dmis-lab/biobert-base-cased-v1.2 --word2vec_select=bio --batch_size=4 --epochs=5 --max_length=128 --pos_dim=50 --char_dim=50
You can run the experiment on weiboNER dataset as follows:
python main.py --dataset_name=weiboNER --evaluate=dev --evaluate=test --pretrain_select=bert-base-chinese --word2vec_select=chinese --batch_size=4 --epochs=5 --max_length=64
You can run the experiment on Conll2003 dataset as follows:
python main.py --dataset_name=conll2003 --evaluate=test --concat --pretrain_select=bert-base-cased --word2vec_select=glove --batch_size=4 --epochs=5 --max_length=128 --pos_dim=50 --char_dim=50
If you have any questions related to the code or the paper or the copyright, please email wenxr2119@mails.jlu.edu.cn
.
We would appreciate it if you cite our paper as following:
@article{wen2022graph,
title={Type-supervised sequence labeling based on the heterogeneous star graph for named entity recognition},
author={Xueru Wen, Changjiang Zhou, Haotian Tang, Luguang Liang, Yu Jiang, Hong Qi},
journal={arXiv preprint arXiv:2210.10240},
year={2022}
}