Skip to content

Syntactic Processing of medical data | NLP | UpGrad | IIITB

Notifications You must be signed in to change notification settings

mohiteamit/syntactic-processing

Repository files navigation

upGrad Syntactic Processing Assignment

Overview

This repository contains the implementation of a custom Named Entity Recognition (NER) model to extract disease-treatment pairs from tokenized sentences and their corresponding predicted labels. The project is a part of an NLP-focused assignment for syntactic processing and demonstrates the ability to identify meaningful relationships between entities in medical text.

The solution uses a structured approach to process sentences, identify diseases (D) and treatments (T), and handle noise and redundancies effectively. It incorporates techniques such as dependency parsing, entity validation, and deduplication to improve accuracy and ensure the extracted data is meaningful.


Features

  1. Custom NER Model:

    • Processes tokenized sentences and labels (O, D, T) to identify multi-word diseases and their associated treatments.
    • Uses linguistic features and dependency parsing for enhanced entity extraction.
  2. Disease-Descriptor Combination:

    • Dynamically identifies descriptors (e.g., "hereditary," "acute") linked to diseases and combines them to form multi-word entities.
  3. Treatment Deduplication:

    • Ensures treatments are unique for each disease to avoid redundancy in the output.
  4. Entity Validation:

    • Filters out invalid or noisy entities using rules such as excluding terms with special characters, overly generic names, or procedural descriptions.

Approach

The project pipeline consists of the following key steps:

1. Data Preprocessing

  • Tokenized sentences and corresponding labels are processed to extract disease and treatment entities.
  • Blank lines indicate sentence boundaries, ensuring proper alignment between sentences and labels.

2. NER Model

  • A rule-based custom NER approach is implemented, focusing on:
    • Descriptor Extraction: Identifies adjectives (ADJ) and compound nouns (NOUN) linked to diseases using spaCy's dependency parsing.
    • Entity Merging: Combines multi-word entities dynamically (e.g., "hereditary retinoblastoma").
    • Validation: Applies strict rules to validate entities, excluding noise and generic terms.

3. Model performance

Model metrics

4. Post-Processing

  • Treatments are deduplicated and validated to ensure relevance.
  • Disease names are refined to reduce noise and focus on meaningful terms.

5. Final Output

  • A dictionary where:
    • Keys are diseases (with descriptors).
    • Values are lists of associated treatments.

Example Output

Below is an example of the Disease-Treatment Dictionary generated by the pipeline:

Disease Treatments
diabetes gestational cases control, good, glycemic
hereditary retinoblastoma radiotherapy
myocardial infarction aspirin, warfarin
pulmonary primary hypertension fenfluramine
esophageal achalasia injection, pneumatic, laparoscopic, dilation, myotomy, botulinum, toxin
hepatitis B vaccine

How to Use

  1. Clone the Repository:

    git clone https://github.com/mohiteamit/upGrad-syntactic-processing-assignment.git
    cd upGrad-syntactic-processing-assignment
  2. Prepare Data:

    • Place tokenized sentences and label files in the data/ directory.
    • Ensure sentences and labels are aligned line by line.
  3. Run the Script:

    • Use the provided function extract_diseases_and_treatments(sentences, predictions) in your Python environment.
    • The function takes preprocessed sentences and labels as input and outputs the Disease-Treatment Dictionary.
  4. Evaluate Output:

    • Review the dictionary to ensure meaningful disease-treatment mappings.

Contact

Amit Mohite @mohiteamit

For any questions or collaboration opportunities, feel free to reach out via GitHub.


License

This project is for educational purposes and is part of the upGrad NLP assignment. Contributions are welcome!