Indian English Raw Speech Corpus

Indian English Raw Speech Corpus - Bengali Variant

Contributor: CIIL Mysore
Product Code: CIIL-BEN-RAW-Speech-140

Sample Download | size: 0B | type: zip

Added on : 27 Aug 2021

Dataset Description

English language is a blend of Anglo-Saxon which is the prominent language of Britain in middle ages. It has been propagated to every corner of the world by colonists. English emerges as the most visible legacy of British in India because India was under British raj for almost two centuries and English is a part of education system here. Most of the states in India use their regional languages and do not have a common language to communicate. So English is used for inter-state communication.

LDC-IL has 25 hours Indian English - Bengali Variant speech data. The LDC-IL Indian English Speech data set consists of different types of datasets that are made up of word lists, sentences, texts and date formats. Approximately 15 minutes of speech (per speaker) has taken from 27 female and 26 Male from Bengali mother tongue speakers of different age groups. Each speaker recorded these datasets which are randomly selected from a master dataset.

The available Speech Corpus details:

Total Speakers 53 (27 Female and 26 Male)

Domains	Audio Segments	Each Domain Duration
Contemporary Text (News)	52	6:03:15
Creative Text	52	2:41:17
Sentence	1300	1:29:35
Date Format	104	0:08:56
Command and Control Words	2882	3:09:13
Person Name	1040	0:33:56
Place Name	519	1:30:22
Most Frequent Word - Part	1442	1:22:38
Most Frequent Word - Full Set	5985	6:01:44
Phonetically Balanced	1782	1:52:21
Form and Function - Word	886	0:53:54

A detailed explanation of the Indian English Raw Speech Corpus - Bengali Variant will be available in the Indian English Raw Speech Corpus - Bengali Variant Documentation.

For any research-based citations, please use the following citations:

Ramamoorthy L., Narayan Kumar Choudhary, Arundhati Sengupta, Rejitha KS, Rajesha N., Manasa, G., 2021. Indian English Raw Speech Corpus - Bengali Variant. Central Institute of Indian Languages, Mysore.
Choudhary, Narayan, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174.

Speech Data Attributes
Annotation	Raw Speech Corpus
Language	Bengali
Duration	25:50:17
Speaker Type	Native
No. of Audio Segment	16,044
Speaker Gender	Male and Female

Tags: Indian English, Raw Speech Corpus, Bengali Variant, Speech Corpus

Dataset Description

Write a review