CTMNeg is a neural topic model based on the VAE-based topic model CTM. CTMNeg uses a negative sampling mechanism to improve the quality of the generated topics. In particular, during model training, we perturb the generated document-topic vector and use a triplet loss to encourage the document reconstructed from the correct document-topic vector to be similar to the input document and dissimilar to the document reconstructed from the perturbed vector.
Among the three datasets that we have used, 20NewsGroup (20NG) and M10 are available in OCTIS. The another one dataset GoogleNews (GN) that we have used in the paper is added in the preprocessed_datasets.
To optimize the hyperparameters of CTMNeg and to compare its performance with the existing topic models, we have used OCTIS which is an integrated framework for topic modeling. Two notebooks are provided in the examples directory. First one is to run the hyperparameter optimization for CTMNeg, which also generates the experimental results with the optimized hyperparameters, and the second one is to run some existing topic models to compare the results.
Name | Link |
---|---|
Hyperparameter optimization and result generation for CTMNeg | |
Getting results of some existing topic models for comparison |
This work has been accepted at ICON (International Conference on Natural Language Processing) 2022!
Read the paper:
If you decide to use this resource, please cite:
@inproceedings{adhya-etal-2022-improving, title = "Improving Contextualized Topic Models with Negative Sampling", author = "Adhya, Suman and Lahiri, Avishek and Kumar Sanyal, Debarshi and Pratim Das, Partha", booktitle = "Proceedings of the 19th International Conference on Natural Language Processing (ICON)", month = dec, year = "2022", address = "New Delhi, India", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.icon-main.18", pages = "128--138", abstract = "Topic modeling has emerged as a dominant method for exploring large document collections. Recent approaches to topic modeling use large contextualized language models and variational autoencoders. In this paper, we propose a negative sampling mechanism for a contextualized topic model to improve the quality of the generated topics. In particular, during model training, we perturb the generated document-topic vector and use a triplet loss to encourage the document reconstructed from the correct document-topic vector to be similar to the input document and dissimilar to the document reconstructed from the perturbed vector. Experiments for different topic counts on three publicly available benchmark datasets show that in most cases, our approach leads to an increase in topic coherence over that of the baselines. Our model also achieves very high topic diversity.", }
All experiments are conducted using OCTIS which is an integrated framework for topic modeling.
OCTIS: Silvia Terragni, Elisabetta Fersini, Bruno Giovanni Galuzzi, Pietro Tropeano, and Antonio Candelieri. (2021). OCTIS: Comparing and Optimizing Topic models is Simple!. EACL. https://www.aclweb.org/anthology/2021.eacl-demos.31/