contextualized-topic-models - Using BERT for Contextual Semantic Topic Identification

Contextualized Topic Models: An Introduction

Overview

Contextualized Topic Models (CTM) are an innovative family of topic modeling approaches that leverage pre-trained language representations, such as BERT, to enhance topic modeling. These models bring a modern twist to traditional techniques by integrating contextual information, thereby improving the coherence and relevance of the topics discovered from texts.

Core Models

CTM includes two primary models:

CombinedTM: This model merges contextual embeddings with the conventional bag-of-words (BoW) approach. It's designed to produce topics that are not only contextually rich but also linguistically coherent.
ZeroShotTM: This model shines in scenarios where some words from the training data might be missing in test data. It supports cross-lingual applications, making it versatile for multilingual topic assignments.

Advantages

One of the striking advantages of CTMs is their flexibility regarding embeddings. As new embedding methods emerge, they can be easily integrated into CTMs, paving the way for continuous improvement. Unlike traditional models, CTMs are not confined to word-count features alone.

Kitty: A Useful Tool

The CTM package also includes a module named Kitty, which allows users to create a human-in-the-loop classifier. This classifier can streamline document classification and cluster creation, significantly aiding in tasks like document filtering, even across different languages.

Tutorials and Resources

For those eager to dive deeper, CTM offers various resources:

Medium Blog Post and Colab Tutorials: These give step-by-step guidance on using the models, making it easier for beginners to get started with contextualized topic modeling.

Usage Considerations

When using CTMs, consider the following:

The BoW size should ideally be capped at 2000 terms for optimal model performance.
The choice of embedding model matters. For instance, using a multilingual model on English texts might not yield the best results compared to a pure English-trained model.
Preprocessing plays a crucial role. Generally, preprocessed text is used to build the BoW, while the raw text is used for creating embeddings.

Installation

Installing the CTM package is straightforward via pip:

pip install -U contextualized_topic_models

Make sure to install the appropriate version of CUDA if you plan to use GPU support.

Documentation and Support

Comprehensive documentation is available, covering everything from basic model usage to advanced features like language-specific settings and custom embeddings with Kitty.

In summary, Contextualized Topic Models enhance traditional topic modeling by incorporating language-specific embeddings, making them powerful tools for both mono-lingual and cross-lingual applications. Whether you're dealing with English texts or multilingual datasets, CTM offers a versatile and efficient solution for topic discovery and document classification.