GigaSpeech - A Multi-Domain ASR Dataset with Extensive 10,000-Hour Transcripts

GigaSpeech: A Comprehensive Overview

GigaSpeech is an ambitious and evolving project offering a massive dataset centered around Automatic Speech Recognition (ASR). This notable project boasts an impressive collection of 10,000 hours of transcribed audio, making it a valuable resource for training and developing ASR systems. The dataset is meticulously documented in the Interspeech paper, available for preview on arXiv.

Version and Download Information

The current version of GigaSpeech is 1.0.0, released on July 5, 2021. To access this dataset, prospective users are encouraged to begin by filling out a Google form. Upon completion, users can choose between two options: receiving raw data through an email from SpeechColab or opting for a pre-processed version via HuggingFace.

Leaderboard and Tools

GigaSpeech extends a platform for showcasing notable contributors who utilize various toolkits and training recipes, complete with benchmark results. Baseline models have been developed using toolkits like Athena, Espnet, Kaldi, Pika, and Icefall, among others. These combine various techniques such as Transformer-AED, Conformer-CTC, and RNN-T, emphasizing the dataset's robust capabilities and appeal to researchers.

Dataset Composition and Structure

GigaSpeech stands out with its rich and diverse collection of audio sources. It comprises over 33,000 hours of audio, with 10,000 hours being meticulously transcribed for supervised learning. The remaining hours offer immense potential for unsupervised scenarios. The data is sourced from Audiobooks, Podcasts, and YouTube, ensuring a wide array of acoustic conditions from clean to noisy environments.

Training and Evaluation Subsets

The dataset is organized into transcribed subsets namely XS, S, M, L, and XL, varying from as little as 10 hours to the full 10,000 hours. This allows for flexibility, catering to both small-scale and industrial-level experiments. Additionally, evaluation subsets "Dev" and "Test" are professionally annotated, providing a solid benchmark for testing ASR systems.

Data Preparation and Processing

GigaSpeech simplifies the data preparation process with comprehensive scripts maintained in this repository. These scripts are adaptable for various speech recognition toolkits, ensuring ongoing utility even as the dataset evolves. Users are encouraged to resample audio to a 16 kHz rate for optimal training and testing.

Text and Metadata Handling

Text preprocessing within GigaSpeech retains limited punctuation to explore advanced research areas like punctuation restoration. Metadata is contained within a single JSON file, providing crucial details like URLs, paths, segments, and speaker annotations, which are expected to expand with future updates.

Collaboration and Support

The GigaSpeech project is a collaborative effort, drawing contributions from diverse institutions such as Tsinghua University and Xiaomi Corporation. The project team continuously seeks to enhance its capabilities, inviting participation from the community to expand the project's scope, add support for more tasks, and contribute to the dataset's growth.

Future Enhancements

The team plans to consistently enhance the dataset by incorporating more diverse audio sources and expanding support for additional tasks like speaker identification. The ongoing updates promise to enrich the corpus further, ensuring its relevance and utility for upcoming technologies in speech recognition.

In sum, GigaSpeech is a pivotal resource in the field of ASR, offering a versatile and expansive dataset that supports innovation and research. It invites widespread community collaboration to advance speech technology for a broader audience.