This project focuses on developing a Named Entity Recognition (NER) and Relation Extraction pipeline using the RadGraph dataset.
-
Clone the repository and navigate to the project directory:
git clone https://github.com/your-repo/radgraph-ner.git cd radgraph-ner
-
Install the dependencies:
pip install -r requirements.txt
-
Install the radgraph package in development mode:
python setup.py develop
Ensure raw JSON files are placed in the data/raw/
directory:
data/
├── raw/
│ ├── section_findings.json
│ └── section_impression.json
Then run the preprocessing script:
python scripts/preprocess.py
This script loads the raw JSON, creates a Hugging Face Dataset
object, and saves it to data/processed/
.
Using RadGraph-XL, a large-scale expert-annotated dataset for entity and relation extraction in radiology reports. The dataset covers multiple anatomy-modality pairs (chest CT, abdomen/pelvis CT, brain MR, and chest X-rays) and includes over 410k entities/relations labeled by board-certified radiologists.
- Download: Stanford Center for Artificial Intelligence in Medicine & Imaging (AIMI)
- Paper: RadGraph-XL: A Large-Scale Expert-Annotated Dataset for Entity and Relation Extraction from Radiology Reports (ACL 2024)
This dataset was introduced by Delbrouck et al., 2024 and has been shown to surpass previous approaches by up to 52% in extracting clinical entities and relations from radiology reports.
- See
notebooks/eda.ipynb
for details on entity distribution, report length analysis, and example highlights.
Entity Type | Count |
---|---|
Observation::definitely present | 2,440,911 |
Anatomy::definitely present | 2,456,763 |
Observation::definitely absent | 396,044 |
Observation::uncertain | 265,709 |
Observation::measurement::definitely present | 6,393 |
Anatomy::measurement::definitely present | 14,307 |
Anatomy::uncertain | 1,018 |
Anatomy::definitely absent | 254 |
Observation::measurement::uncertain | 22 |
Observation::measurement::definitely absent | 1 |
- Handling Rare Classes:
- The rare class
Observation::measurement::definitely absent
(count: 1) was explicitly assigned to the test set to ensure inclusion in model evaluation.
- The rare class
- Total Tokens: 15,815,652
- Unique Tokens: 33,855
This step prepares train/val/test splits from the RadGraph dataset for model training. If stratification isn't feasible (e.g., due to rare entities), it falls back to a random split.
Command:
python scripts/train_test_split.py --input-file data/processed/radgraph.jsonl --output-dir data/splits --stratify
Output:
train.jsonl
: Training dataval.jsonl
: Validation datatest.jsonl
: Test data
A demonstration notebook notebooks/tokenization_demo.ipynb
showcases:
- Differences between BPE, and WordPiece tokenization, vs general and domain tokenizers, illustrating the potential improvement in capturing medical terms within specialized vocabularies.
- Examples of tokenization applied to medical text in RADGRAPH. For example, check out how ‘ACHALASIA’ is split across.
- Fine-tune pretrained models (e.g., Bio_ClinicalBERT, ClinicalBERT) on RadGraph for NER and relation extraction tasks.
- Measure how tokenizer splits affect real NER metrics on the RadGraph dataset.
- Consider exploring custom tokenization schemes for rare terms as a future experiment.
- Recommended Tools:
- Black: Automatically formats your Python code to PEP 8 style.
- Flake8 / Ruff: Linting for code issues (unused imports, undefined vars).
- isort: Sorts imports by sections (standard library, third-party, local).
- Pre-commit Hooks: You can set up a
.pre-commit-config.yaml
to auto-run these on each commit.
Initial Setup:
pip install black flake8 isort
Run the tools:
black .
isort .
flake8 .