- Contributor: CIIL Mysore
- Product Code: CIIL-TAM-RAW-Speech-138
Dataset Description
139:11:41 Hours | 86 GB speech data
| 452 Speakers | 60,287 Audio segments | 48 kHz | 16 bit wav.
Tamil is one of the
longest-surviving classical languages in the world. It is one of the
prominent language among the Dravidian language family. Tamil is widely spoken
in the state of Tamil Nadu, Union Territory of Pondicherry, Sri Lanka, in
East-Asian countries like Burma, Malaysia, Singapore, Indonesia, Indio china,
Fiji, in South-Africa, British Guinea and in islands like Mauritius and
Madagascar etc. The language is an official language in Tamil Nadu and some of
the foreign countries such as Sri Lanka and Singapore. It has official status
in the Indian state of Tamil Nadu and the Indian Union Territory of
Pondicherry. Tamil has its own font. The language is highly
agglutinative in nature. Tamil has Phonological simplicity, Morphological
parity and primitiveness. There is separability and significance of all affixes
in Tamil language. There is an absence nominative case termination and
arbitrary words in Tamil language.
The
LDC-IL speech data is collected from the regions of Kongu, Kumari,
Madurai, Nellai, Salem and Thanjai, from both the genders and
different age groups. Each speaker recorded these datasets which are
randomly selected from a master dataset.
The
available Speech Corpus details:
Total
Speakers 452 (214 Female and 219 Male)
Domains |
Audio
Segments |
Each
Domain Duration |
Contemporary Text (News) |
433 |
57:53:48 |
Creative Text |
429 |
14:21:31 |
Sentence |
10,764 |
14:51:03 |
Date Format |
842 |
01:20:17 |
Command and Control Words |
12,882 |
12:57:06 |
Person Name |
8,755 |
03:57:29 |
Place Name |
4,002 |
10:34:38 |
Most Frequent Word - Part |
12,813 |
11:14:05 |
Most Frequent Word - Full Set |
2,000 |
02:26:05 |
Phonetically Balanced |
3,860 |
04:55:10 |
Form and Function - Word |
3,507 |
04:40:29 |
A detailed explanation of the
Nepali Speech Corpus will be available in the Nepali Speech Data
Documentation.
For any research-based citations, please use the following citations:
- Ramamoorthy, L., Narayan Choudhary, Thennarasu S, Prem Kumar L R, Amudha R, Prabagaran R, Srikanth D. 2021. Tamil Raw Speech Corpus. Central Institute of Indian Languages, Mysore.
- Narayan Choudhary, Rajesha N., Manasa G. & L. Ramamoorthy. 2019. “LDC-IL Raw Speech Corpora: An Overview” in Linguistic Resources for AI/NLP in Indian Languages. Central Institute of Indian Languages, Mysore. pp. 160-174.
Speech Data Attributes | |
Annotation | Raw Speech Corpus |
Language | Tamil |
Duration | 139:11:41 |
Speaker Type | Native |
No. of Audio Segment | 60,287 |
Speaker Gender | Male and Female |
Tags: Tamil, Raw Speech Corpus, Speech Corpus