Lip Sync Challenge

CHALLENGE OVERVIEW

Speech technologies and Natural Language Processing (NLP) have attracted more attention in recent times. Recent advances in these technologies have shown that Automatic Speech Recognition (ASR), Machine Translation (MT), Text to Speech Synthesis (TTS) systems with some human intervention can produce usable speech-to-speech translation. When it comes to translating videos of one language to another language, the duration of a speech file created in the target language is different from the duration of the speech file in the source language. As a result, target language speech does not suitably fit on the original video and lip-syncing is required. Around the world, various approaches are being used for lip-syncing. In India, the speech research community has grown significantly in the present time and one can witness the current speech revolution. It is necessary to understand and compare various research techniques used to do lip-syncing. Primary objective of this challenge is to understand and compare the different lip-syncing approaches while simultaneously identifying efficient groups in this domain in the country.

Tasks

There will be two types of tasks:

Task 1: Lip-Syncing of source video with target speech (TTS) audio:

Language Pairs:

English to Hindi
English to Tamil

Input Data:

5-6 source videos of 20 to 30 minutes in English language
Corresponding SRT file (having corresponding transcription and time stamp using ASR)
Speech audio file in the target language (using TTS), and transcription

Target: Participants will have to develop/optimise their lip syncing algorithm/method to work efficiently on the given videos.

This task is a closed challenge task and participants must only use the above given data.

Evaluation: Subjective evaluation of lip-syncing performance of the output video (refer below for more information)
Submission: Submit lip-synced videos in target language (refer below for more information)

Task 2: Speech-to-Speech Translation with Lip-Syncing:

Language Pairs:

English to Hindi
English to Tamil

Input Data: 5-6 videos of 20 to 30 minutes in English language.
Target: Participants will have to develop/optimise end to end system for speech to speech translation of videos with lip-syncing.

Participants may use any ASR, MT, and TTS systems as per their choice.
Participants will have to develop/optimise their lip syncing algorithm/method to work efficiently on the given videos.

Evaluation: Subjective evaluation of quality of the output video. This includes the audio quality, lip-syncing performance, and above all, how well the information in the source video is preserved in the target video. (refer below for more information)
Submission: Submit lip-synced videos in target language (Refer below for more information)

NB: Participants may take part in one or both the tasks and take one or both language pairs

Output Submission and Test Data:

Task 1:

A video of around 20 minutes duration, similar (with respect to the domain, speaker, type etc) to the input video, corresponding SRT, speech file in target language, and corresponding speech file will be released to participants after 4 weeks of the announcement of the challenge. Each participant will get a different video. Participants will have to do lip-syncing and generate a video in the target language, and share the same with us. 1 minute clip from source video and corresponding clip from target video will be taken for evaluation.

Task 2:

A video of around 20 minutes duration, similar (with respect to the domain, speaker, type etc) to the input video, will be released to participants after 4 weeks of the announcement of the challenge. Each participant will get a different video. Participants will have to do speech-to-speech translation and lip-syncing, and generate a video in target language, and share both the videos (source and target) with us. 1 minute clip from source video and corresponding clip from target video will be taken for evaluation.

Write Up

Along with the built system, participant will have to submit a write up (1-2 page) about the following:

Approach
Technology
Data
Challenges faced
Features
Observations
etc...

Listening Test

Task 1:

The organisers will conduct a Mean Opinion Score (MOS) testing to evaluate the submitted target language video clips. Evaluators will watch the target language video and give ranking for lip-syncing accuracy on a scale of 1 to 5 (higher the score better the quality).

Task 2:

The organisers will conduct a Degradation Mean Opinion Score (DMOS) testing to evaluate the submitted target language video clip with reference to the original language video clip. Evaluators will compare the target language video with the source language video clip and give ranking for quality of output video on a scale of 1 to 5 (higher the score better the quality). This includes the audio quality, lip-syncing performance, and above all, how well the information in the source video is preserved in the target video.

Benefits of Participation

Those who perform well in this challenge may get the following opportunities:

A leadership board will be published with the name and ranking of good performers on the National Platform for Language Technology (https://nplt.in) website.
MHRD and MeitY are planning to engage some agencies to do speech-to-speech translation. Good performers would get priority in this process.
Opportunity to participate in the next phase of the project.

Registration

Interested parties should register as soon as possible by using the below link:

You need to provide the following information in a form available at the above link:

Preferred team name - the organisers may adjust this so that all teams have meaningful, unique names
Affiliation - the name of your University and lab, or your Company
Contact details:

Main contact person's email address - should be an institutional email address
Backup email address (es)
Postal address
Phone number

You should only register for the challenge if you actually intend to submit an entry to the challenge and to comply with all the rules/guidelines mentioned.

Registration Fee

There is no registration fee.

Provisional Timelines

Date / Month	Event
26th August 2021	Announcement of challenge
6th September 2021	Last date of registration
As soon as registration is confirmed	Data release
6th October 2021	Submission of system for evaluation (by midnight PDT)
October 2021	Evaluation of system
November 2021	Release of results

Licenses

The license for the released data will be shared with the participants. Data will be released to each participant once the appropriate license has been agreed to.

Development tools and Other Resources

Development tools, useful scripts, and other resources helpful in developing Lip-syncing systems are given below

Lip-Syncing Systems

Lip-syncing code using Google API: https://github.com/google/making_with_ml/tree/master/ai_dubs

Automatic Speech Recognition (ASR)

IITM Speech API: https://apim.iiithcanvas.com/devportal/apis/10d3c4a5-eabb-4cf5-a94b-ed4bb8a4de51/overview

Machine Translation (MT)

IIITH MT API: https://apim.iiithcanvas.com/devportal/apis/ddaa6457-2fdc-4dfc-969b-d33d3b39f25b/overview

Text to speech (TTS)

IITM SMTLab API: https://apim.iiithcanvas.com/devportal/apis/82064458-7549-4d1b-87e9-ac4ef9ce7f3c/test

These may be helpful during development. This is just for reference, participants are free to use any tool or technologies for building voice.

How are these rules/guidelines enforced?

This is a challenge, which is designed to answer scientific questions, and not a competition. Therefore, we rely on your honesty in preparing your entry.

Sample Video

Source Video	Target Video

References:

1. http://cdn.iiit.ac.in/cdn/cvit.iiit.ac.in/images/ConferencePapers/2019/ACM_Final_Face2Face.pdf

CALL FOR PARTICIPATION