This repository contains the implementation of a custom Named Entity Recognition (NER) model to extract disease-treatment pairs from tokenized sentences and their corresponding predicted labels. The project is a part of an NLP-focused assignment for syntactic processing and demonstrates the ability to identify meaningful relationships between entities in medical text.
The solution uses a structured approach to process sentences, identify diseases (D
) and treatments (T
), and handle noise and redundancies effectively. It incorporates techniques such as dependency parsing, entity validation, and deduplication to improve accuracy and ensure the extracted data is meaningful.
-
Custom NER Model:
- Processes tokenized sentences and labels (
O
,D
,T
) to identify multi-word diseases and their associated treatments. - Uses linguistic features and dependency parsing for enhanced entity extraction.
- Processes tokenized sentences and labels (
-
Disease-Descriptor Combination:
- Dynamically identifies descriptors (e.g., "hereditary," "acute") linked to diseases and combines them to form multi-word entities.
-
Treatment Deduplication:
- Ensures treatments are unique for each disease to avoid redundancy in the output.
-
Entity Validation:
- Filters out invalid or noisy entities using rules such as excluding terms with special characters, overly generic names, or procedural descriptions.
The project pipeline consists of the following key steps:
- Tokenized sentences and corresponding labels are processed to extract disease and treatment entities.
- Blank lines indicate sentence boundaries, ensuring proper alignment between sentences and labels.
- A rule-based custom NER approach is implemented, focusing on:
- Descriptor Extraction: Identifies adjectives (
ADJ
) and compound nouns (NOUN
) linked to diseases using spaCy's dependency parsing. - Entity Merging: Combines multi-word entities dynamically (e.g., "hereditary retinoblastoma").
- Validation: Applies strict rules to validate entities, excluding noise and generic terms.
- Descriptor Extraction: Identifies adjectives (
- Treatments are deduplicated and validated to ensure relevance.
- Disease names are refined to reduce noise and focus on meaningful terms.
- A dictionary where:
- Keys are diseases (with descriptors).
- Values are lists of associated treatments.
Below is an example of the Disease-Treatment Dictionary
generated by the pipeline:
Disease | Treatments |
---|---|
diabetes gestational cases | control, good, glycemic |
hereditary retinoblastoma | radiotherapy |
myocardial infarction | aspirin, warfarin |
pulmonary primary hypertension | fenfluramine |
esophageal achalasia | injection, pneumatic, laparoscopic, dilation, myotomy, botulinum, toxin |
hepatitis B | vaccine |
-
Clone the Repository:
git clone https://github.com/mohiteamit/upGrad-syntactic-processing-assignment.git cd upGrad-syntactic-processing-assignment
-
Prepare Data:
- Place tokenized sentences and label files in the
data/
directory. - Ensure sentences and labels are aligned line by line.
- Place tokenized sentences and label files in the
-
Run the Script:
- Use the provided function
extract_diseases_and_treatments(sentences, predictions)
in your Python environment. - The function takes preprocessed sentences and labels as input and outputs the
Disease-Treatment Dictionary
.
- Use the provided function
-
Evaluate Output:
- Review the dictionary to ensure meaningful disease-treatment mappings.
Amit Mohite @mohiteamit
For any questions or collaboration opportunities, feel free to reach out via GitHub.
This project is for educational purposes and is part of the upGrad NLP assignment. Contributions are welcome!