MultiLabel-Paper-Task-Classifier

A text classification model built from scatch consisting of data collection, model training and deployement. The model can classify 258 tasks enlisted here

Data Collection

Data extraction began by retrieving information from papers with code in two steps. The initial step involved obtaining paper URLs, which led to the formation of paper_urls dataset comprising paper titles and their corresponding links with paper_url.ipynb

Following this, each URL in the dataset was visited to extract abstracts and the associated tasks from the papers using url_details.ipynb, ultimately forming the primary dataset.

In total 26778 paper details have been scraped.

Data Pre-processing

At first, there were 2397 different tasks in the dataset. After looking closely, I found that 2139 of them were tasks. So, I removed those tasks, leaving 258 tasks. Then, I got rid of the abstracts without any task, and that left me with 26628 samples.

Model Training

I fine-tuned a distilroberta-base model from HuggingFace Transformers using Fastai and Blurr. You can check out the notebook for the model training here.

Model Compression and ONNX Inference

The model that underwent training has a memory size of 314+MB. I compressed this model using ONNX quantization and reduced its size to below 83MB.

Model Deployement

The compressed model is deployed to HuggingFace Spaces Gradio App. The implementation can be found in deployment folder or here

Web Deployement:

Deployed a flask app built for users to provide abstract as input and to get tasks as output. You can check the flask branch. The website is live here

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
deployment		deployment
models		models
notebooks		notebooks
scaper		scaper
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MultiLabel-Paper-Task-Classifier

Data Collection

Data Pre-processing

Model Training

Model Compression and ONNX Inference

Model Deployement

Web Deployement:

About

Releases

Packages

Languages

License

TabassumTanzim/multilabel-paper-task-classifier

Folders and files

Latest commit

History

Repository files navigation

MultiLabel-Paper-Task-Classifier

Data Collection

Data Pre-processing

Model Training

Model Compression and ONNX Inference

Model Deployement

Web Deployement:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages