RadGraph NER & Relation Extraction Project

This project focuses on developing a Named Entity Recognition (NER) and Relation Extraction pipeline using the RadGraph dataset.

Setup Instructions

Clone the repository and navigate to the project directory:

git clone https://github.com/your-repo/radgraph-ner.git
cd radgraph-ner

Install the dependencies:
```
pip install -r requirements.txt
```
Install the radgraph package in development mode:
```
python setup.py develop
```

Data Preprocessing

Ensure raw JSON files are placed in the data/raw/ directory:

data/
├── raw/
│   ├── section_findings.json
│   └── section_impression.json

Then run the preprocessing script:

python scripts/preprocess.py

This script loads the raw JSON, creates a Hugging Face Dataset object, and saves it to data/processed/.

Dataset: RadGraph-XL

Using RadGraph-XL, a large-scale expert-annotated dataset for entity and relation extraction in radiology reports. The dataset covers multiple anatomy-modality pairs (chest CT, abdomen/pelvis CT, brain MR, and chest X-rays) and includes over 410k entities/relations labeled by board-certified radiologists.

This dataset was introduced by Delbrouck et al., 2024 and has been shown to surpass previous approaches by up to 52% in extracting clinical entities and relations from radiology reports.

Exploratory Data Analysis (EDA)

See notebooks/eda.ipynb for details on entity distribution, report length analysis, and example highlights.

Key Dataset Insights

Entity Type Distribution

Entity Type	Count
Observation::definitely present	2,440,911
Anatomy::definitely present	2,456,763
Observation::definitely absent	396,044
Observation::uncertain	265,709
Observation::measurement::definitely present	6,393
Anatomy::measurement::definitely present	14,307
Anatomy::uncertain	1,018
Anatomy::definitely absent	254
Observation::measurement::uncertain	22
Observation::measurement::definitely absent	1

Handling Rare Classes:
- The rare class Observation::measurement::definitely absent (count: 1) was explicitly assigned to the test set to ensure inclusion in model evaluation.

Token Counts

Total Tokens: 15,815,652
Unique Tokens: 33,855

Train-Test Split Preparation

This step prepares train/val/test splits from the RadGraph dataset for model training. If stratification isn't feasible (e.g., due to rare entities), it falls back to a random split.

Command:

python scripts/train_test_split.py --input-file data/processed/radgraph.jsonl --output-dir data/splits --stratify

Output:

train.jsonl: Training data
val.jsonl: Validation data
test.jsonl: Test data

Tokenization Insights

A demonstration notebook notebooks/tokenization_demo.ipynb showcases:

Differences between BPE, and WordPiece tokenization, vs general and domain tokenizers, illustrating the potential improvement in capturing medical terms within specialized vocabularies.
Examples of tokenization applied to medical text in RADGRAPH. For example, check out how ‘ACHALASIA’ is split across.

Next Steps

Fine-tune pretrained models (e.g., Bio_ClinicalBERT, ClinicalBERT) on RadGraph for NER and relation extraction tasks.
Measure how tokenizer splits affect real NER metrics on the RadGraph dataset.
Consider exploring custom tokenization schemes for rare terms as a future experiment.

Style & Linting

Recommended Tools:
- Black: Automatically formats your Python code to PEP 8 style.
- Flake8 / Ruff: Linting for code issues (unused imports, undefined vars).
- isort: Sorts imports by sections (standard library, third-party, local).
- Pre-commit Hooks: You can set up a .pre-commit-config.yaml to auto-run these on each commit.

Initial Setup:

pip install black flake8 isort

Run the tools:

black .
isort .
flake8 .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RadGraph NER & Relation Extraction Project

Setup Instructions

Data Preprocessing

Dataset: RadGraph-XL

Exploratory Data Analysis (EDA)

Key Dataset Insights

Entity Type Distribution

Token Counts

Train-Test Split Preparation

Tokenization Insights

Next Steps

Style & Linting

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
notebooks		notebooks
radgraph		radgraph
scripts		scripts
tests		tests
.gitignore		.gitignore
README.md		README.md
config.py		config.py
requirements.txt		requirements.txt
setup.py		setup.py

kulsoom-abdullah/radgraph-ner

Folders and files

Latest commit

History

Repository files navigation

RadGraph NER & Relation Extraction Project

Setup Instructions

Data Preprocessing

Dataset: RadGraph-XL

Exploratory Data Analysis (EDA)

Key Dataset Insights

Entity Type Distribution

Token Counts

Train-Test Split Preparation

Tokenization Insights

Next Steps

Style & Linting

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages