small-text - Enhance Text Classification Efficiency with Active Learning

Introduction to the Small-Text Project

Small-Text is a Python library designed for efficient active learning in text classification. Active learning is a machine learning paradigm where the algorithm interactively queries a human annotator to label data points with the desired outputs. This project brings state-of-the-art tools to streamline this process, making text classification more accessible and adaptable to various applications.

Features of Small-Text

Small-Text is packed with features that make it an essential tool for anyone looking to optimize their text classification tasks:

Unified Interfaces: The library provides a consistent interface for active learning across different machine learning frameworks such as scikit-learn, Pytorch, and Hugging Face's Transformers. This allows users to easily integrate different query strategies and classifiers into their workflows.
Support for GPUs and Transformers: The library supports GPU-based models using Pytorch, which enhances the performance and speed of processing. It also integrates with Transformers, allowing the use of advanced text classification models alongside active learning strategies.
Flexible Installation Options: For users with limited hardware resources, Small-Text can be installed with minimal dependencies, running effectively even without a GPU. More comprehensive installations are also available for users who wish to leverage the full power of the tool.
Pre-Implemented Strategies: Small-Text comes with a suite of pre-implemented components, including query strategies, initialization strategies, and stopping criteria, all of which have been scientifically evaluated.

Understanding Active Learning

Active learning is particularly valuable in scenarios where labeled data is scarce. It helps efficiently label training data for supervised learning, leveraging the limited available data to maximize learning efficiency.

Learning curve example

Recent Updates

Small-Text is actively maintained and continuously improved. Some of the recent updates include:

Version 1.4.1: Released in August 2024, this version is a bugfix release.
Version 1.4.0: Released in June 2024, introduced a new query strategy named AnchorSubsampling (AnchorAL).

Additionally, the project has been recognized with a paper presented at EACL 2023, which was awarded the Best System Demonstration.

Getting Started

Installing Small-Text is straightforward. You can install it via pip:

pip install small-text

To access additional features with transformers, you can use:

pip install small-text[transformers]

For conda users, a full installation includes:

conda install -c conda-forge "torch>=1.6.0" "torchtext>=0.7.0" transformers small-text

More detailed installation instructions and quick start guides are available in the project documentation.

Resources and Contributions

The project is open to contributions and welcomes community involvement. The comprehensive documentation provides in-depth guides and showcases, including tutorials on using Small-Text for various text classification scenarios.

For further exploration and to contribute to its development, the community can find guidelines in the CONTRIBUTING.md file.

Acknowledgments

Small-Text was developed by Christopher Schröder and the NLP group at Leipzig University, part of the Webis research network. The project has been supported by the Development Bank of Saxony.

Small-Text is licensed under the MIT License, and users are encouraged to cite its paper for academic and research purposes.

For detailed information and updates, users can refer to the documentation.