Call For Collaborative Participation in Collecting speech data in Indian Language
Call for Participation - Collecting manually transcribed speech data in Indian Languages in Public-Private Partnership 2020
If interested, please respond by 20th July 2020 to
Downloadable Call details
The Natural Language Translation Mission (NLTM) originated from the Prime Minister's Office and is part of the Govt's plan to remove language barriers in India. It is managed by the Principal Scientific Adviser and the associated ministry is Ministry of Electronics and Information Technology (MeiTY). NLTM also referred to as Bahubashak will be focussing on Indian Language Technologies with its various components Automatic Speech Recognition (Umesh, IIT Madras), Text-to-Speech (Hema, IIT Madras), text-to-text Machine Translation (Dipti, IIIT-Hyd, Pushpak, IITB and Ajai, CDAC), and Optical Character Recognition (AG Ramakrisha, IISc). The main objective of Bahubashak is to have Indian Language technology systems and products deployed in the field with the help of start-ups. The co-ordinating institutes will provide technical and research support for the deployment of these technologies through start-ups
Prof. S. Umesh is co-ordinating the Automatic Speech Recognition (ASR) part of Bahubashak (BBASR). The main challenge in developing good ASR system is availability of adequate amount of high quality manually transcribed speech data in Indian languages. This is a huge barrier for new start-ups and for research by academic and research institutions. It is therefore of paramount importance to build this speech data infrastructure so that ASR technologies can be boosted. One of the goals of Bahubashak is to create this national speech resource.
In the Pilot project of Bahubhashak we plan to collect data in Hindi, Indian English and Tamil. The goal is to collect a mix of telephone and wide-band speech with emphasis on conversational speech, with read speech being not more than 20% of the data. The goal is to collect about 3000 hours of Hindi, 2000 hours of Indian English and 1000 hours of Tamil in the pilot project. Since there is very little resource in Indian languages, we encourage participation from companies who have already an established record of collecting data and building in-house ASR systems, so that the quality of manually transcribed data is ensured. After giving a lead-time of 18 months to participating organisations for access to the data, this resource will then be shared in the open domain to help new start-ups and research organisations. To keep the corpus growing, new start-ups who use the data will be requested to contribute equivalent data to the corpus over a period of 18 months.
We plan to operate in a public private partnership mode. All participating organisations will collect data by putting about 75% of the cost, while upto 25% of the funds will be provided by project funds from MeiTY through IIT-Madras. The data collected by the participating organisations will be shared with each other in a collaborative manner with a multiparty agreement to protect the rights of all stakeholders. For example, if there are 5 participating organisation, they will share all the data collected with each other, so the data pool will be 5 times what is collected by one organisation. Since about 75% of data collection cost comes from organisations, they will have the first right-ofuse of the data. So during first 18 months only 25% of data will be released (corresponding to public money contribution) and the remaining data will be released in public domain in regular intervals after 18 months.
Encourage organisations/companies who have:
(1) Proven record of building in-house ASR systems in Indian languages
(2) Ability to collect at least 500 hours in Hindi, 250 hours in Indian English and 100 hours in Tamil for the Pilot Bahubashak project.
(3) Above 80% of the collected data should be conversational, and could be a mix of telephone and wide-band speech with widespeech being preferred.
Data Collection Details and Funding:
(1) Each participant will collect at least 500 hours in Hindi, 250 hours in Indian English and 100 hours in Tamil for the Pilot Bahubashak project.
(2) 75% of the cost will be borne by the participant and IIT-Madras will provide upto 25% of the funding through MeiTY Pilot project (maximum of 10 lakhs per 1000 hours from MeITY project).
(3) The total data collected as part of this agreement will be shared by all the participants with each other as well as IIT-Madras. This data can be used internally to improve the systems developed by each company/organisation, but cannot be shared or sold to parties outside of the agreement for 2 years from the start of the agreement.
(4) During the first 18 months, IIT-Madras will release 25% of the data into public domain. This will be done at regular intervals of 4-6 months.
(5) The remaining data will be released at quarterly intervals after 18 months of agreement. During the first 18 months, only the participants and IIT-Madras will have access to all the data, and only 25% of the data (for which IITM provided funds) will be released in public domain by IIT Madras.
(6) All speech data will be collected after obtaining proper permission from the subject/speaker clearly stating that it may be used for research or commercial purposes to improve speech and language technologies.
(7) Speech data will be appropriately anonymised to protect personal information or identity of the subject/speaker.
Steps to participate in this Collaborative Effort for Data Collection
Steps to participate in this Collaborative Effort for Data Collection If your company/organisation is interested in participating in this national effort to build speech data infrastructure, please send an email to
with following documents:
(1) A brief write-up of the company/organisation with background in building ASR systems.
(2) The amount of data that the company/organisation is willing to collect and the type of data (telephone/wideband as well as conversational/read speech). From each participant a minimum of 500 hours in Hindi, 250 hours in Indian English and 100 hours in Tamil is expected so that Pilot can quickly collect data.
(3) Approximate budget for data collection and timeline as envisaged by the company/organisation
(4) A cover letter stating a willingness to participate in the data collection in PPP mode with about 75% being borne by the company/organisation and 25% from MeiTY funds and that the data will be shared with other data collection participants for a period of 2 years before being released in public domain.
If interested, please respond by 20th July 2020 to:
Prof. S. Umesh
Room No CSD-310
Dept. of Electrical Engineering
Dates and deadlines: 20th July 2020