This repository explores the challenges of building a generalized mental health classifier using anonymous, user-generated text data from various online platforms. Mental health classification is inherently complex due to overlapping symptoms, ambiguous language, and low signal-to-noise ratio in textual descriptions. By leveraging these unstructured data sources, our work sheds light on:
The nuanced ways patients describe symptoms. Limitations of LLM-based approaches for mental health classification. Insights into the complexities of mental health diagnosis. This project contributes to advancing the understanding of mental health dynamics while improving accessibility to mental health resources.
In the data folder, you will find cleaned-output, which is the data we used for all downstream tasks. You will also find a folder that contains all of the train-val-test splits for our experiments, as well as a .ipynb file to create unique data splits. The scripts used to create cleaned-output can be found in the folder called preprocessing. This folder contains scripts for both cleaning and creating the mini-csvs
In the stats folder, you will find a few scripts and a directory for calculating interesting statistics about the data. pos contains the following scripts and files:
- tagger.py, which tags the mini-csvs using thespacy tagger.
- pos-stats.sh, which generates POS tags for the CSVs.
- pos_count.txt, which contains the POS counts generated by pos-stats.sh.
- type-token.sh, which generates type token ratios for each mini CSV.
- anygram directory, which is contains all scripts for ngrams processing of each individual disorder files.
We make the process of replicating results relatively straightforward. If you want to replicate a result for a particular model, simply run one of the training scripts in the trainers directory. Your model will be saved to an output directory with the same name as the corresponding .py file.
We've uploaded our two downsampled models to huggingface. To inference with your own symptomatic description, run predictor.py in the command line as follows:
python predict.py --model "rachelhamelburg/downsampled_disorder_only" or "rachelhamelburg/downsampled_model --text "Your symptoms"
Keep in mind that the model was trained on lengthy symptomatic descriptions. Short descriptions are not likely to yield good results.
We also streamline the process of experimenting with different datasets, models, and a few hyperparameters. Simply cd into the experimental folder, and put the appropriate arguments in the command line:
python train_model.py --train_file <train_file> --val_file <val_file> --test_file <test_file> --model_name <model_name> --output_dir <output_dir> --num_epochs <num> --batch_size <num> --learning_rate <rate>