upGrad Syntactic Processing Assignment

Overview

This repository contains the implementation of a custom Named Entity Recognition (NER) model to extract disease-treatment pairs from tokenized sentences and their corresponding predicted labels. The project is a part of an NLP-focused assignment for syntactic processing and demonstrates the ability to identify meaningful relationships between entities in medical text.

The solution uses a structured approach to process sentences, identify diseases (D) and treatments (T), and handle noise and redundancies effectively. It incorporates techniques such as dependency parsing, entity validation, and deduplication to improve accuracy and ensure the extracted data is meaningful.

Features

Custom NER Model:
- Processes tokenized sentences and labels (O, D, T) to identify multi-word diseases and their associated treatments.
- Uses linguistic features and dependency parsing for enhanced entity extraction.
Disease-Descriptor Combination:
- Dynamically identifies descriptors (e.g., "hereditary," "acute") linked to diseases and combines them to form multi-word entities.
Treatment Deduplication:
- Ensures treatments are unique for each disease to avoid redundancy in the output.
Entity Validation:
- Filters out invalid or noisy entities using rules such as excluding terms with special characters, overly generic names, or procedural descriptions.

Approach

The project pipeline consists of the following key steps:

1. Data Preprocessing

Tokenized sentences and corresponding labels are processed to extract disease and treatment entities.
Blank lines indicate sentence boundaries, ensuring proper alignment between sentences and labels.

2. NER Model

A rule-based custom NER approach is implemented, focusing on:
- Descriptor Extraction: Identifies adjectives (ADJ) and compound nouns (NOUN) linked to diseases using spaCy's dependency parsing.
- Entity Merging: Combines multi-word entities dynamically (e.g., "hereditary retinoblastoma").
- Validation: Applies strict rules to validate entities, excluding noise and generic terms.

3. Model performance

4. Post-Processing

Treatments are deduplicated and validated to ensure relevance.
Disease names are refined to reduce noise and focus on meaningful terms.

5. Final Output

A dictionary where:
- Keys are diseases (with descriptors).
- Values are lists of associated treatments.

Example Output

Below is an example of the Disease-Treatment Dictionary generated by the pipeline:

Disease	Treatments
diabetes gestational cases	control, good, glycemic
hereditary retinoblastoma	radiotherapy
myocardial infarction	aspirin, warfarin
pulmonary primary hypertension	fenfluramine
esophageal achalasia	injection, pneumatic, laparoscopic, dilation, myotomy, botulinum, toxin
hepatitis B	vaccine

How to Use

Clone the Repository:

git clone https://github.com/mohiteamit/upGrad-syntactic-processing-assignment.git
cd upGrad-syntactic-processing-assignment

Prepare Data:
- Place tokenized sentences and label files in the data/ directory.
- Ensure sentences and labels are aligned line by line.
Run the Script:
- Use the provided function extract_diseases_and_treatments(sentences, predictions) in your Python environment.
- The function takes preprocessed sentences and labels as input and outputs the Disease-Treatment Dictionary.
Evaluate Output:
- Review the dictionary to ensure meaningful disease-treatment mappings.

Contact

Amit Mohite @mohiteamit

For any questions or collaboration opportunities, feel free to reach out via GitHub.

License

This project is for educational purposes and is part of the upGrad NLP assignment. Contributions are welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
data		data
.gitignore		.gitignore
README.MD		README.MD
image.png		image.png
requirements.txt		requirements.txt
syntactic-processing-Amit_Mohite.ipynb		syntactic-processing-Amit_Mohite.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

upGrad Syntactic Processing Assignment

Overview

Features

Approach

1. Data Preprocessing

2. NER Model

3. Model performance

4. Post-Processing

5. Final Output

Example Output

How to Use

Contact

License

About

Releases

Packages

Languages

mohiteamit/syntactic-processing

Folders and files

Latest commit

History

Repository files navigation

upGrad Syntactic Processing Assignment

Overview

Features

Approach

1. Data Preprocessing

2. NER Model

3. Model performance

4. Post-Processing

5. Final Output

Example Output

How to Use

Contact

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages