Introduction to Tomotopy
What is Tomotopy?
Tomotopy is a powerful Python extension built on the robust C++ library known as Tomoto (Topic Modeling Tool). It is designed specifically for topic modeling using Gibbs-sampling, which is a statistical algorithm for sampling from a probabilistic distribution. Tomotopy is optimized for speed by taking advantage of modern CPUs' vectorization capabilities, supporting several major topic models.
Supported Topic Models
Tomotopy offers a wide range of topic models, which allows users to choose the one that best fits their data and analysis needs. These include:
- Latent Dirichlet Allocation (LDA): A model used for discovering abstract topics within a collection of texts.
- Labeled LDA: Enhances LDA by including pre-assigned categories for each document.
- Partially Labeled LDA: Applies Labeled LDA where labels can be missing.
- Supervised LDA: Incorporates response variables into the LDA.
- Dirichlet Multinomial Regression (DMR): Useful for incorporating metadata to aid in topic modeling.
- Generalized DMR: An extension that models hierarchical data.
- Hierarchical Dirichlet Process (HDP): Allows the number of topics to grow with the data.
- Hierarchical LDA: Organizes topics hierarchically.
- Multi Grain LDA: Models both corpus level and document level topics.
- Pachinko Allocation Model (PAM): Captures correlations among topics.
- Hierarchical PAM: Extends PAM with hierarchical structures.
- Correlated Topic Model (CTM): Captures correlations between topics.
- Dynamic Topic Model (DTM): Models the evolution of topics over time.
- Pseudo-document based Topic Model: Groups documents into pseudo-documents for topic modelling.
Getting Started
Getting started with Tomotopy is straightforward. It can be easily installed via pip, Python's package manager. It runs on Linux, macOS, and Windows systems that support Python version 3.6 or newer. To begin, you just need to import Tomotopy into your Python environment and check for available SIMD (Single Instruction, Multiple Data) instruction sets that optimize computation speed.
Basic Usage
The following is a simple example of how to use Tomotopy for LDA training with text data:
import tomotopy as tp
# Create a model with 20 topics
mdl = tp.LDAModel(k=20)
# Add example data into the model
for line in open('sample.txt'):
mdl.add_doc(line.strip().split())
# Train the model
for i in range(0, 100, 10):
mdl.train(10)
print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))
# Display top 10 words from each topic
for k in range(mdl.k):
print('Top 10 words of topic #{}'.format(k))
print(mdl.get_topic_words(k, top_n=10))
# Summarize the model
mdl.summary()
Performance and Optimization
Tomotopy is built for rapid iterations and takes full advantage of modern multi-core processors. Using Collapsed Gibbs-Sampling, Tomotopy may have slower convergence than some other models but compensates with much quicker computation per iteration, especially when set up with CPUs supporting SIMD instructions.
Saving and Loading Models
Models in Tomotopy can be saved and reloaded, allowing users to store model results and reuse them when needed. This is done through simple save and load methods, making it easy to manage the models.
Interactive Viewer
Introduced in version 0.13.0, Tomotopy includes an interactive viewer allowing users to visualize and interact with model results via a web interface, enhancing the model exploration experience.
Advanced Features
Tomotopy supports the manipulation of documents both part of and external to the model. It offers ways to infer topic distribution on new, unseen documents using the trained model, providing flexibility for dynamic data environments. Additionally, it supports various data input through user-defined document corpus.
Conclusion
Tomotopy is a flexible and efficient tool for anyone interested in topic modeling. Its design leverages the speed of C++ and the flexibility of Python, providing a convenient package for researchers and professionals. With continuous updates enhancing its capabilities, tomotopy is a valuable asset in text analysis and natural language processing tasks.