Your cart is empty!
0 reviews / Write a review
97:43:54 Hours | 62.2 GB speech data | 1916 Speakers | 1,916 Audio segments | 48 kHz | 16 bit wav.
The LDC-IL Multi-Lingual Raw Speech Corpus dataset is extracted from the raw speech corpora published by LDC-IL in various Indian languages. This dataset is built to address the needs of some applications like language identifier modules where multiple language samples are a requirement, to explore cross-linguistic variations and diatopic comparison to determine what generalizations are possible about the types of variable features, to build multilingual phoneme set and models etc.
The Multi-Lingual speech dataset sampling is taken from the content type of ‘Creative Text-T2’ which is extracted mainly from literary sources. The creative text of the LDC-IL Speech dataset comprises of essays or short stories. One of these essays or short stories, selected randomly from a data set, is assigned to a speaker for reading out. The same story may be read out by multiple speakers.
The available Speech Corpus details:
Total Speakers 1916 (958 Female and 958 Male)
A detailed explanation of the Multi-Lingual Raw Speech Corpus will be available in the Multilingual Raw Speech Documentation.
For any research-based citations, please use the following citations:
Narayan Kumar Choudhary, Rajesha N., Manasa G., 2021. Multilingual Raw Speech Corpus. Central Institute of Indian Languages, Mysore
Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview
Tags: Multilingual, Raw Speech Corpus, Speech Corpus