The following is a set of Jupyter notebook tutorials which demonstrate how to use various text classification models supported by NeMo Curator. The goal of using these classifiers is to help with data annotation, which is useful in data blending for foundation model training.
Each of these classifiers are available on Hugging Face and can be run independently with the Transformers library. By running them with NeMo Curator, the classifiers are accelerated using CrossFit, a library that leverages intellegent batching and RAPIDS to accelerate the offline inference on large datasets. Each of the Jupyter notebooks in this directory demonstrate how to run the classifiers on text data and are easily scalable to large amounts of data.
Before running any of these notebooks, please see this Getting Started page for instructions on how to install NeMo Curator.
NeMo Curator Classifier | Hugging Face page |
---|---|
AegisClassifier |
nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0 and nvidia/Aegis-AI-Content-Safety-LlamaGuard-Permissive-1.0 |
ContentTypeClassifier |
nvidia/content-type-classifier-deberta |
DomainClassifier |
nvidia/domain-classifier |
FineWebEduClassifier |
HuggingFaceFW/fineweb-edu-classifier |
InstructionDataGuardClassifier |
nvidia/instruction-data-guard |
MultilingualDomainClassifier |
nvidia/multilingual-domain-classifier |
PromptTaskComplexityClassifier |
nvidia/prompt-task-and-complexity-classifier |
PyTorchClassifier |
Requires local .pth file(s) for any DeBERTa-based text classifier(s) |
QualityClassifier |
quality-classifier-deberta |