• Multilingual Raw Speech Corpus
Multilingual Raw Speech Corpus
  • Contributor: CIIL Mysore
  • Product Code: CIIL-MUL-RAW-Speech-139
Sample Download | size: 387.1KB | type: pdf
Added on : 27 Aug 2021

Dataset Description

97:43:54 Hours | 62.2 GB speech data | 1916 Speakers | 1,916 Audio segments | 48 kHz | 16 bit wav. 


 

The LDC-IL Multi-Lingual Raw Speech Corpus dataset is extracted from the raw speech corpora published by LDC-IL in various Indian languages. This dataset is built to address the needs of some applications like language identifier modules where multiple language samples are a requirement, to explore cross-linguistic variations and diatopic comparison to determine what generalizations are possible about the types of variable features, to build multilingual phoneme set and models etc.


The Multi-Lingual speech dataset sampling is taken from the content type of ‘Creative Text-T2’ which is extracted mainly from literary sources. The creative text of the LDC-IL Speech dataset comprises of essays or short stories. One of these essays or short stories, selected randomly from a data set, is assigned to a speaker for reading out. The same story may be read out by multiple speakers.  

  The available Speech Corpus details: 


Total Speakers 1916 (958 Female and 958 Male)


Assamese       2:33:40      68          1.64     2:34:33    64 1.65       5:08:13  132  3.30
Bengali         2:38:34     56     1.59     2:47:32  61     1.69    5:26:06   117 3.29
Bodo               2:30:39     42     1.61     2:41:04  40     1.72    5:11:43     82 3.34
Dogri         1:16:44       30     0.84     1:35:00  31     1.01    2:51:44     61 1.84
Gujarati         2:32:10     45     1.63 2:30:40  42     1.61    5:02:50     87 3.25
Hindi         2:37:28     44     1.66 2:30:18  44     1.57    5:07:46     88 3.23
Kannada         2:37:06     45     1.68 2:32:50  48     1.63    5:09:56     93 3.32
Kashmiri         2:32:26     30     1.63 2:39:46  29     1.71    5:12:12     59 3.34
Konkani        2:50:24     62     1.82 2:41:25  62     1.74    5:31:49     124 3.57
Maithili        2:46:28     54     1.71 2:53:31  50     2.00    5:39:59     104 3.48
Malayalam      2:38:16     68     1.69 2:28:17  61     1.59    5:06:33     129 3.29
Manipuri        2:15:42     29     1.45 2:44:43  32     1.76    5:00:25     61 3.22
Marathi        2:38:26     56     1.70 2:41:57  58     1.73    5:20:23     114 3.43
Nepali        2:51:09     44     1.83 2:58:41  52     1.91    5:49:50     96 3.75
Odia        2:38:24     63     1.70 2:32:10  60     1.63    5:10:34     123 3.33
Punjabi        2:41:13     67     1.72 2:35:40  62     1.66    5:16:53     129 3.40
Tamil        2:35:24     78     1.57 2:45:20  70     1.66    5:20:44     148 3.24
Telugu        2:06:18     24     1.33 3:00:40  38     1.93    5:06:58     62 3.27
Urdu        2:20:22     53     1.50 2:48:54  54     1.81    5:09:16     107 3.31


A detailed explanation of the Multi-Lingual Raw Speech Corpus will be available in the Multilingual Raw Speech Documentation.

 

For any research-based citations, please use the following citations: 

Narayan Kumar Choudhary, Rajesha N., Manasa G., 2021.  Multilingual Raw Speech Corpus.  Central Institute of Indian Languages, Mysore

Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview

    Write a review

    Please login or register to review

    Tags: Multilingual, Raw Speech Corpus, Speech Corpus

    Disclaimer: The information provided on this page has been procured through different sources. Please write back to us at nplt_support[at]cdac[dot]in in case you would like to suggest an update.