NotNimbleMiner by Christy Mednick, Zachary Graeber, Malav Ramanan, Anirudh Venkatachalam and Manasvini Narayanan for ECS170: Artificial Intelligence with Gabriel Simmons.
NotNimbleMiner is a project aimed at extracting and classifying sensitive information from clinical notes using natural language processing techniques. This repository contains code for a comprehensive pipeline that tokenizes clinical notes from XML files, creates word embeddings, explores vocabulary, assigns labels, and trains a multi-label classification model.
Key Features:
- Tokenization of Clinical Notes: Efficiently processes XML files to extract and tokenize clinical notes.
- Word Embedding Model: Creates word embeddings using Word2Vec to capture semantic relationships within the clinical text.
- Vocabulary Exploration: Allows users to interactively explore and select relevant terms for classification.
- Multi-Label Classification: Trains a Logistic Regression model to automatically classify clinical notes based on selected terms.
- Evaluation Metrics: Provides classification reports for model performance assessment.
Usage:
- Tokenize Clinical Notes: Input your XML files containing clinical notes to start the tokenization process.
- Explore Vocabulary: Interactively select terms for classification based on similarity scores.
- Train Multi-Label Classifier: Train a Logistic Regression model on the selected terms and tokenized data.
- Evaluate Model: Assess model performance using classification reports.
Requirements:
- Python 3.x
- Libraries: spaCy, numpy, xml.etree.ElementTree, gensim, scikit-learn