CTMNeg

CTMNeg is a neural topic model based on the VAE-based topic model CTM. CTMNeg uses a negative sampling mechanism to improve the quality of the generated topics. In particular, during model training, we perturb the generated document-topic vector and use a triplet loss to encourage the document reconstructed from the correct document-topic vector to be similar to the input document and dissimilar to the document reconstructed from the perturbed vector.

Datasets

Among the three datasets that we have used, 20NewsGroup (20NG) and M10 are available in OCTIS. The another one dataset GoogleNews (GN) that we have used in the paper is added in the preprocessed_datasets.

Tutorials

To optimize the hyperparameters of CTMNeg and to compare its performance with the existing topic models, we have used OCTIS which is an integrated framework for topic modeling. Two notebooks are provided in the examples directory. First one is to run the hyperparameter optimization for CTMNeg, which also generates the experimental results with the optimized hyperparameters, and the second one is to run some existing topic models to compare the results.

Name	Link
Hyperparameter optimization and result generation for CTMNeg
Getting results of some existing topic models for comparison

How to cite this work?

This work has been accepted at ICON (International Conference on Natural Language Processing) 2022!

Read the paper:

If you decide to use this resource, please cite:

@inproceedings{adhya-etal-2022-improving,
title = "Improving Contextualized Topic Models with Negative Sampling",
author = "Adhya, Suman  and
  Lahiri, Avishek  and
  Kumar Sanyal, Debarshi  and
  Pratim Das, Partha",
booktitle = "Proceedings of the 19th International Conference on Natural Language Processing (ICON)",
month = dec,
year = "2022",
address = "New Delhi, India",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.icon-main.18",
pages = "128--138",
abstract = "Topic modeling has emerged as a dominant method for exploring large document collections. Recent approaches to topic modeling use large contextualized language models and variational autoencoders. In this paper, we propose a negative sampling mechanism for a contextualized topic model to improve the quality of the generated topics. In particular, during model training, we perturb the generated document-topic vector and use a triplet loss to encourage the document reconstructed from the correct document-topic vector to be similar to the input document and dissimilar to the document reconstructed from the perturbed vector. Experiments for different topic counts on three publicly available benchmark datasets show that in most cases, our approach leads to an increase in topic coherence over that of the baselines. Our model also achieves very high topic diversity.",
}

Acknowledgment

All experiments are conducted using OCTIS which is an integrated framework for topic modeling.

OCTIS: Silvia Terragni, Elisabetta Fersini, Bruno Giovanni Galuzzi, Pietro Tropeano, and Antonio Candelieri. (2021). OCTIS: Comparing and Optimizing Topic models is Simple!. EACL. https://www.aclweb.org/anthology/2021.eacl-demos.31/

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
bert		bert
examples		examples
misc		misc
octis		octis
preprocessed_datasets		preprocessed_datasets
results		results
tests		tests
README.rst		README.rst
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CTMNeg

Datasets

Tutorials

How to cite this work?

Acknowledgment

About

Releases

Packages

Languages

AdhyaSuman/CTMNeg

Folders and files

Latest commit

History

Repository files navigation

CTMNeg

Datasets

Tutorials

How to cite this work?

Acknowledgment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages