open-speech-corpora - Diverse Open Speech Datasets Supporting Both Research and Commercial Applications

Open Speech Corpora

The Open Speech Corpora project offers a comprehensive list of open speech datasets that are significant for research and development in the field of Speech Technology. The project emphasizes free or low-cost options and promotes datasets publically released under open licenses such as Creative Commons or the Community Data License Agreement. While not every dataset aligns perfectly with these criteria, all included are accessible and usable for research or commercial applications. Additionally, the project encourages the community to suggest new additions to this growing list, enhancing its utility and scope.

CC-0 Licensed Datasets

Datasets under the CC-0 license are freely available for use without restrictions. Key datasets include:

Common Voice: This multilingual set offers over 20,000 hours of data, collected from multiple speakers, making it invaluable for various speech technologies.
Yesno: A Hebrew dataset comprising 6 minutes of speech from a single male speaker.
LJ Speech Corpus: Providing about 24 hours of English speech data from one female speaker, ideal for training synthesis models.
NST Datasets: Available in Danish, Swedish, and Norwegian, these datasets cover ASR (Automatic Speech Recognition), Dictation, and Speech Synthesis tasks.

CC-BY Licensed Datasets

The CC-BY license allows broad usage, provided attribution is given. Some standout datasets include:

ARU Speech Corpus: Offers English UK speech data collected across 12 different speakers.
LibriSpeech: Known for its extensive range of around 1,000 hours of English audio, this dataset serves as a benchmark for ASR tasks.
NCHLT: A suite of datasets for various African languages, aiding linguistic diversity in speech research.

Other Licenses: CC-BY-SA, CC-BY-ND, CC-BY-NC, and CC-BY-NC-SA

Some datasets come with additional restrictions compared to CC-0 or CC-BY. Notable examples are:

Google Javanese, Nepali, Bengali, etc.: Under the CC-BY-SA license, these datasets support research in lesser-resourced languages.
IBM Recorded Debates: An English dataset under the CC-BY-ND, these recordings of debates provide insight into diverse speech patterns and styles.
TV3Parla and Russian Open STT Corpus: These CC-BY-NC datasets are excellent resources for working with Catalan and Russian speech data, respectively.
CHiME-Home: Available under the CC-BY-NC-SA license, this English dataset is tailored for studying speech in domestic environments.

In conclusion, Open Speech Corpora provides a vast array of datasets useful for anyone conducting research in speech technology, highlighting resources that promote open access and community-driven expansions. By offering a wide spectrum of languages and applications, this project serves as a critical tool for advancing the field and fostering innovation.