Evaluation of Lip Syncing of Videos in Speech to Speech Translation from English to Indian Languages
(pranaw@cdac.in, bira@cdac.in, hema@cse.iitm.ac.in)
As a part of the National Language Translation Mission (funded by MeitY, Govt of India), IIT Madras and CDAC Mumbai are jointly organising a challenge for the "Lip-Syncing in Speech-to-Speech Translation". It aims towards helping and encouraging the advancement of speech to speech translation of videos in Indian languages. The basic challenge is to take the input video (in English), create the output video in Hindi or Tamil and do lip-syncing on the output video. Although exact lip-syncing is not possible owing to the words being changed in translation, transcreation of the video in Indian languages where the video and audio is synced is expected.
Speech technologies and Natural Language Processing (NLP) have attracted more attention in recent times. Recent advances in these technologies have shown that Automatic Speech Recognition (ASR), Machine Translation (MT), Text to Speech Synthesis (TTS) systems with some human intervention can produce usable speech-to-speech translation. When it comes to translating videos of one language to another language, the duration of a speech file created in the target language is different from the duration of the speech file in the source language. As a result, target language speech does not suitably fit on the original video and lip-syncing is required. Around the world, various approaches are being used for lip-syncing. In India, the speech research community has grown significantly in the present time and one can witness the current speech revolution. It is necessary to understand and compare various research techniques used to do lip-syncing. Primary objective of this challenge is to understand and compare the different lip-syncing approaches while simultaneously identifying efficient groups in this domain in the country.
There will be two types of tasks:
NB: Participants may take part in one or both the tasks and take one or both language pairs
A video of around 20 minutes duration, similar (with respect to the domain, speaker, type etc) to the input video, corresponding SRT, speech file in target language, and corresponding speech file will be released to participants after 4 weeks of the announcement of the challenge. Each participant will get a different video. Participants will have to do lip-syncing and generate a video in the target language, and share the same with us. 1 minute clip from source video and corresponding clip from target video will be taken for evaluation.
A video of around 20 minutes duration, similar (with respect to the domain, speaker, type etc) to the input video, will be released to participants after 4 weeks of the announcement of the challenge. Each participant will get a different video. Participants will have to do speech-to-speech translation and lip-syncing, and generate a video in target language, and share both the videos (source and target) with us. 1 minute clip from source video and corresponding clip from target video will be taken for evaluation.
Along with the built system, participant will have to submit a write up (1-2 page) about the following:
The organisers will conduct a Mean Opinion Score (MOS) testing to evaluate the submitted target language video clips. Evaluators will watch the target language video and give ranking for lip-syncing accuracy on a scale of 1 to 5 (higher the score better the quality).
The organisers will conduct a Degradation Mean Opinion Score (DMOS) testing to evaluate the submitted target language video clip with reference to the original language video clip. Evaluators will compare the target language video with the source language video clip and give ranking for quality of output video on a scale of 1 to 5 (higher the score better the quality). This includes the audio quality, lip-syncing performance, and above all, how well the information in the source video is preserved in the target video.
Those who perform well in this challenge may get the following opportunities:
Interested parties should register as soon as possible by using the below link:
You need to provide the following information in a form available at the above link:
There is no registration fee.
Date / Month | Event |
---|---|
26th August 2021 | Announcement of challenge |
6th September 2021 | Last date of registration |
As soon as registration is confirmed | Data release |
6th October 2021 | Submission of system for evaluation (by midnight PDT) |
October 2021 | Evaluation of system |
November 2021 | Release of results |
The license for the released data will be shared with the participants. Data will be released to each participant once the appropriate license has been agreed to.
Development tools, useful scripts, and other resources helpful in developing Lip-syncing systems are given below
These may be helpful during development. This is just for reference, participants are free to use any tool or technologies for building voice.
This is a challenge, which is designed to answer scientific questions, and not a competition. Therefore, we rely on your honesty in preparing your entry.
Source Video | Target Video |
For further information please contact pranaw@cdac.in, amolb@cdac.in, cs19s032@smail.iitm.ac.in with cc to hema@cse.iitm.ac.in.
Pranaw Kumar (Mob) : +91-7303226768
Mano Ranjith (Mob) : +91-9025318193
Amol Bole (Mob) : +91-9422787940