README

Code for "Type-supervised sequence labeling based on the heterogeneous star graph for named entity recognition". Details and paper please check here.

Setup

Requirements

You can try to create environment as follows:

conda create --name GrapnNER python=3.9.13
conda activate GraphNER
pip install -r requirements.txt

or directly import conda environment on Windows as follows:

conda env create -f windows.yaml

or directly import conda environment on Linux as follows:

conda env create -f linux.yaml

Datasets

Original source of datasets:

GENIA: http://www.geniaproject.org/genia-corpus
CoNLL03: https://data.deepai.org/conll2003.zip
WeiboNER: https://github.com/OYE93/Chinese-NLP-Corpus/tree/master/NER/Weibo

You can download our processed datasets from here.

Data format:

{
  "tokens": [
    "IL-2",
    "gene",
    "expression",
    "and",
    "NF-kappa",
    "B",
    "activation",
    "through",
    "CD28",
    "requires",
    "reactive",
    "oxygen",
    "production",
    "by",
    "5-lipoxygenase",
    "."
  ],
  "entities": [
    {
      "start": 14,
      "end": 15,
      "type": "protein"
    },
    {
      "start": 4,
      "end": 6,
      "type": "protein"
    },
    {
      "start": 0,
      "end": 2,
      "type": "DNA"
    },
    {
      "start": 8,
      "end": 9,
      "type": "protein"
    }
  ],
  "relations": {},
  "org_id": "ge/train/0001",
  "pos": [
    "PROPN",
    "NOUN",
    "NOUN",
    "CCONJ",
    "PROPN",
    "PROPN",
    "NOUN",
    "ADP",
    "PROPN",
    "VERB",
    "ADJ",
    "NOUN",
    "NOUN",
    "ADP",
    "NUM",
    "."
  ],
  "ltokens": [],
  "rtokens": []
}

The ltokens contains the tokens from the previous sentence. And The rtokens contains the tokens from the next sentence.

Word vectors

For used word vectors including Chinese word2vec, Glove and Bio-word2vec, you can download from here.

Run

You can run the experiment on GENIA dataset as follows:

python main.py --dataset_name=genia --evaluate=test --concat --pretrain_select=dmis-lab/biobert-base-cased-v1.2 --word2vec_select=bio --batch_size=4 --epochs=5 --max_length=128 --pos_dim=50 --char_dim=50

You can run the experiment on weiboNER dataset as follows:

python main.py --dataset_name=weiboNER --evaluate=dev --evaluate=test --pretrain_select=bert-base-chinese --word2vec_select=chinese --batch_size=4 --epochs=5 --max_length=64

You can run the experiment on Conll2003 dataset as follows:

python main.py --dataset_name=conll2003 --evaluate=test --concat --pretrain_select=bert-base-cased --word2vec_select=glove --batch_size=4 --epochs=5 --max_length=128 --pos_dim=50 --char_dim=50

Reference

If you have any questions related to the code or the paper or the copyright, please email wenxr2119@mails.jlu.edu.cn. We would appreciate it if you cite our paper as following:

@article{wen2022graph,
  title={Type-supervised sequence labeling based on the heterogeneous star graph for named entity recognition},
  author={Xueru Wen, Changjiang Zhou, Haotian Tang, Luguang Liang, Yu Jiang, Hong Qi},
  journal={arXiv preprint arXiv:2210.10240},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
modules		modules
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
linux.yaml		linux.yaml
main.py		main.py
requirements.txt		requirements.txt
windows.yaml		windows.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README

Setup

Requirements

Datasets

Word vectors

Run

Reference

About

Releases

Packages

Languages

License

wenxueru/GraphNER

Folders and files

Latest commit

History

Repository files navigation

README

Setup

Requirements

Datasets

Word vectors

Run

Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages