CALL FOR PARTICIPATION

Lip-Sync Challenge 2021

Evaluation of Lip Syncing of Videos in Speech to Speech Translation from English to Indian Languages

Pranaw Kumar, Bira Chandra Singh, Hema A Murthy

(pranaw@cdac.in, bira@cdac.in, hema@cse.iitm.ac.in)

As a part of the National Language Translation Mission (funded by MeitY, Govt of India), IIT Madras and CDAC Mumbai are jointly organising a challenge for the "Lip-Syncing in Speech-to-Speech Translation". It aims towards helping and encouraging the advancement of speech to speech translation of videos in Indian languages. The basic challenge is to take the input video (in English), create the output video in Hindi or Tamil and do lip-syncing on the output video. Although exact lip-syncing is not possible owing to the words being changed in translation, transcreation of the video in Indian languages where the video and audio is synced is expected.

CHALLENGE OVERVIEW

Speech technologies and Natural Language Processing (NLP) have attracted more attention in recent times. Recent advances in these technologies have shown that Automatic Speech Recognition (ASR), Machine Translation (MT), Text to Speech Synthesis (TTS) systems with some human intervention can produce usable speech-to-speech translation. When it comes to translating videos of one language to another language, the duration of a speech file created in the target language is different from the duration of the speech file in the source language. As a result, target language speech does not suitably fit on the original video and lip-syncing is required. Around the world, various approaches are being used for lip-syncing. In India, the speech research community has grown significantly in the present time and one can witness the current speech revolution. It is necessary to understand and compare various research techniques used to do lip-syncing. Primary objective of this challenge is to understand and compare the different lip-syncing approaches while simultaneously identifying efficient groups in this domain in the country.

Tasks

There will be two types of tasks:


  • Task 1: Lip-Syncing of source video with target speech (TTS) audio:
    • Language Pairs:
      1. English to Hindi
      2. English to Tamil
    • Input Data:
      • 5-6 source videos of 20 to 30 minutes in English language
      • Corresponding SRT file (having corresponding transcription and time stamp using ASR)
      • Speech audio file in the target language (using TTS), and transcription
    • Target: Participants will have to develop/optimise their lip syncing algorithm/method to work efficiently on the given videos.
      • This task is a closed challenge task and participants must only use the above given data.
    • Evaluation: Subjective evaluation of lip-syncing performance of the output video (refer below for more information)
    • Submission: Submit lip-synced videos in target language (refer below for more information)


  • Task 2: Speech-to-Speech Translation with Lip-Syncing:
    • Language Pairs:
      1. English to Hindi
      2. English to Tamil
    • Input Data: 5-6 videos of 20 to 30 minutes in English language.
    • Target: Participants will have to develop/optimise end to end system for speech to speech translation of videos with lip-syncing.
      • Participants may use any ASR, MT, and TTS systems as per their choice.
      • Participants will have to develop/optimise their lip syncing algorithm/method to work efficiently on the given videos.
    • Evaluation: Subjective evaluation of quality of the output video. This includes the audio quality, lip-syncing performance, and above all, how well the information in the source video is preserved in the target video. (refer below for more information)
    • Submission: Submit lip-synced videos in target language (Refer below for more information)

NB: Participants may take part in one or both the tasks and take one or both language pairs


Output Submission and Test Data:

Task 1:

A video of around 20 minutes duration, similar (with respect to the domain, speaker, type etc) to the input video, corresponding SRT, speech file in target language, and corresponding speech file will be released to participants after 4 weeks of the announcement of the challenge. Each participant will get a different video. Participants will have to do lip-syncing and generate a video in the target language, and share the same with us. 1 minute clip from source video and corresponding clip from target video will be taken for evaluation.


Task 2:

A video of around 20 minutes duration, similar (with respect to the domain, speaker, type etc) to the input video, will be released to participants after 4 weeks of the announcement of the challenge. Each participant will get a different video. Participants will have to do speech-to-speech translation and lip-syncing, and generate a video in target language, and share both the videos (source and target) with us. 1 minute clip from source video and corresponding clip from target video will be taken for evaluation.

Write Up

Along with the built system, participant will have to submit a write up (1-2 page) about the following:

  • Approach
  • Technology
  • Data
  • Challenges faced
  • Features
  • Observations
  • etc...

Listening Test

Task 1:

The organisers will conduct a Mean Opinion Score (MOS) testing to evaluate the submitted target language video clips. Evaluators will watch the target language video and give ranking for lip-syncing accuracy on a scale of 1 to 5 (higher the score better the quality).


Task 2:

The organisers will conduct a Degradation Mean Opinion Score (DMOS) testing to evaluate the submitted target language video clip with reference to the original language video clip. Evaluators will compare the target language video with the source language video clip and give ranking for quality of output video on a scale of 1 to 5 (higher the score better the quality). This includes the audio quality, lip-syncing performance, and above all, how well the information in the source video is preserved in the target video.

Benefits of Participation

Those who perform well in this challenge may get the following opportunities:

  • A leadership board will be published with the name and ranking of good performers on the National Platform for Language Technology (https://nplt.in) website.
  • MHRD and MeitY are planning to engage some agencies to do speech-to-speech translation. Good performers would get priority in this process.
  • Opportunity to participate in the next phase of the project.

Registration

Interested parties should register as soon as possible by using the below link:


REGISTER NOW

You need to provide the following information in a form available at the above link:

  • Preferred team name - the organisers may adjust this so that all teams have meaningful, unique names
  • Affiliation - the name of your University and lab, or your Company
  • Contact details:
    • Main contact person's email address - should be an institutional email address
    • Backup email address (es)
    • Postal address
    • Phone number
  • You should only register for the challenge if you actually intend to submit an entry to the challenge and to comply with all the rules/guidelines mentioned.

Registration Fee

There is no registration fee.

Provisional Timelines

Date / Month Event
26th August 2021 Announcement of challenge
6th September 2021 Last date of registration
As soon as registration is confirmed Data release
6th October 2021 Submission of system for evaluation (by midnight PDT)
October 2021 Evaluation of system
November 2021 Release of results

Licenses

The license for the released data will be shared with the participants. Data will be released to each participant once the appropriate license has been agreed to.

Development tools and Other Resources

Development tools, useful scripts, and other resources helpful in developing Lip-syncing systems are given below

These may be helpful during development. This is just for reference, participants are free to use any tool or technologies for building voice.

How are these rules/guidelines enforced?

This is a challenge, which is designed to answer scientific questions, and not a competition. Therefore, we rely on your honesty in preparing your entry.

Sample Video

Source Video Target Video

Contact us!

For further information please contact pranaw@cdac.in, amolb@cdac.in, cs19s032@smail.iitm.ac.in with cc to hema@cse.iitm.ac.in.

Pranaw Kumar (Mob) : +91-7303226768

Mano Ranjith (Mob) : +91-9025318193

Amol Bole (Mob) : +91-9422787940