lhotse - Adaptive Python Library for Seamless Audio Data Handling with PyTorch

Introduction to Lhotse

Lhotse is a Python library designed to simplify and improve the preparation and handling of speech and audio data. It is part of the next-generation Kaldi ecosystem, alongside k2, and aims to make complex speech processing tasks more accessible to a broad audience.

Main Objectives

Lhotse was developed with several key goals in mind:

Python-Centric Design: It seeks to attract a wider community of developers by providing a Python-centric approach to speech processing.
Flexibility for Experts: Experienced users of Kaldi will find a familiar and expressive command-line interface.
Standardized Data Preparation: It offers standard preparation recipes for commonly used speech corpora.
Seamless Integration with PyTorch: Developers can leverage PyTorch Dataset classes tailored for speech and audio-related tasks.
Efficient Data Handling: The library emphasizes efficient use of I/O bandwidth and storage capacity.

Tutorials and Examples

The Lhotse library is accompanied by an array of tutorials and examples to guide users in its application:

Basic Workflow: Introductory tutorials introduce users to the complete Lhotse workflow, demonstrating data preparation and manipulation.
Advanced Features: It covers concepts like audio cuts, data transformations, and integrations with other datasets.
Real-World Applications: Users can explore real-world implementations through Icefall recipes and ESPnet+Lhotse examples.

Key Concepts

Lhotse introduces unique concepts and features to streamline the speech data preparation process:

Audio Cuts: This concept allows users to easily manipulate audio data for tasks such as mixing, truncation, and padding on-the-fly, reducing storage needs.
Data Augmentation and Feature Extraction: It supports both pre-computed and on-the-fly data transformations, with highly compressed feature matrices for efficient storage.
Flexible Data Manipulation: Lhotse uses human-readable text manifests for data and metadata, accessible through intuitive Python classes.

Installation and Getting Started

Lhotse supports Python versions 3.7 and later. Users can quickly install the library using pip:

pip install lhotse

For the latest version directly from GitHub, use:

pip install git+https://github.com/lhotse-speech/lhotse

For development purposes, cloning the repository and setting up the environment for development is straightforward:

git clone https://github.com/lhotse-speech/lhotse
cd lhotse
pip install -e '.[dev]'
pre-commit install

Customization and Optional Features

Lhotse allows customization through various environment variables, influencing the library's behavior for audio processing preferences. Additionally, optional dependencies can be installed to unlock enhanced features, such as Kaldi compatibility, faster manifest reading, WebDataset tarball format, and others.

Concluding Remarks

Lhotse is revolutionizing the way speech and audio data is prepared and used for machine learning. By offering a user-friendly, efficient, and flexible framework, it opens the door to a broader community, simplifying complex tasks and enabling effective speech processing solutions.