A text classification model built from scatch consisting of data collection, model training and deployement. The model can classify 258 tasks enlisted here
Data extraction began by retrieving information from papers with code in two steps. The initial step involved obtaining paper URLs, which led to the formation of paper_urls dataset comprising paper titles and their corresponding links with paper_url.ipynb
Following this, each URL in the dataset was visited to extract abstracts and the associated tasks from the papers using url_details.ipynb, ultimately forming the primary dataset.
In total 26778 paper details
have been scraped.
At first, there were 2397 different tasks in the dataset. After looking closely, I found that 2139 of them were tasks. So, I removed those tasks, leaving 258 tasks. Then, I got rid of the abstracts without any task, and that left me with 26628 samples.
I fine-tuned a distilroberta-base
model from HuggingFace Transformers using Fastai and Blurr. You can check out the notebook for the model training here.
The model that underwent training has a memory size of 314+MB. I compressed this model using ONNX quantization and reduced its size to below 83MB.
The compressed model is deployed to HuggingFace Spaces Gradio App. The implementation can be found in deployment folder or here

Deployed a flask app built for users to provide abstract as input and to get tasks as output. You can check the flask
branch. The website is live here