Multilingual Raw Speech Corpus

Contributor: CIIL Mysore
Product Code: CIIL-MUL-RAW-Speech-139

Sample Download | size: 0B | type: pdf

Added on : 27 Aug 2021

Description
Reviews (0)

Dataset Description

The LDC-IL Multi-Lingual Raw Speech Corpus dataset is extracted from the raw speech corpora published by LDC-IL in various Indian languages. This dataset is built to address the needs of some applications like language identifier modules where multiple language samples are a requirement, to explore cross-linguistic variations and diatopic comparison to determine what generalizations are possible about the types of variable features, to build multilingual phoneme set and models etc.

The Multi-Lingual speech dataset sampling is taken from the content type of ‘Creative Text-T2’ which is extracted mainly from literary sources. The creative text of the LDC-IL Speech dataset comprises of essays or short stories. One of these essays or short stories, selected randomly from a data set, is assigned to a speaker for reading out. The same story may be read out by multiple speakers.

The available Speech Corpus details:

Total Speakers 1916 (958 Female and 958 Male)

Assamese 2:33:40 68 1.64 2:34:33 64 1.65 5:08:13 132 3.30

Bengali 2:38:34 56 1.59 2:47:32 61 1.69 5:26:06 117 3.29

Bodo 2:30:39 42 1.61 2:41:04 40 1.72 5:11:43 82 3.34

Dogri 1:16:44 30 0.84 1:35:00 31 1.01 2:51:44 61 1.84

Gujarati 2:32:10 45 1.63 2:30:40 42 1.61 5:02:50 87 3.25

Hindi 2:37:28 44 1.66 2:30:18 44 1.57 5:07:46 88 3.23

Kannada 2:37:06 45 1.68 2:32:50 48 1.63 5:09:56 93 3.32

Kashmiri 2:32:26 30 1.63 2:39:46 29 1.71 5:12:12 59 3.34

Konkani 2:50:24 62 1.82 2:41:25 62 1.74 5:31:49 124 3.57

Maithili 2:46:28 54 1.71 2:53:31 50 2.00 5:39:59 104 3.48

Malayalam 2:38:16 68 1.69 2:28:17 61 1.59 5:06:33 129 3.29

Manipuri 2:15:42 29 1.45 2:44:43 32 1.76 5:00:25 61 3.22

Marathi 2:38:26 56 1.70 2:41:57 58 1.73 5:20:23 114 3.43

Nepali 2:51:09 44 1.83 2:58:41 52 1.91 5:49:50 96 3.75

Odia 2:38:24 63 1.70 2:32:10 60 1.63 5:10:34 123 3.33

Punjabi 2:41:13 67 1.72 2:35:40 62 1.66 5:16:53 129 3.40

Tamil 2:35:24 78 1.57 2:45:20 70 1.66 5:20:44 148 3.24

Telugu 2:06:18 24 1.33 3:00:40 38 1.93 5:06:58 62 3.27

Urdu 2:20:22 53 1.50 2:48:54 54 1.81 5:09:16 107 3.31

A detailed explanation of the Multi-Lingual Raw Speech Corpus will be available in the Multilingual Raw Speech Documentation.

For any research-based citations, please use the following citations:

Narayan Kumar Choudhary, Rajesha N., Manasa G., 2021. Multilingual Raw Speech Corpus. Central Institute of Indian Languages, Mysore

Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview

Tags: Multilingual, Raw Speech Corpus, Speech Corpus

Dataset Description

Write a review