Indian Language Technology Proliferation and Deployment Centre

Standardization

What is ASCII Script Code?

The American Standard Code for Information Interchange(ASCII) is a character-encoding scheme based on the ordering of the English alphabet. Most modern character-encoding schemes such as ISCII are based on ASCII. The basic composition of ASCII has been accepted by majority of text processing community worldwide.

What is ISCII ?

Indian Script Code for Information Interchange(ISCII) code table is a super-set of all the characters required in the ten Brahmi-based Indian scripts. The ISCII code standard specifies a 7-bit code table which can be used in 7 or 8-bit ISO compatible environments. It allows English and Indian script alphabets to be used simultaneously. The ISCII forms a part of higher 128 characters in 8 bit environment.
A common alphabet for all the Indian scripts is made possible by their common origin from the same ancient Brahmi script. The ISCII code contains only the basic alphabets required by the Indian scripts. All the composite characters are formed through combinations of these basic characters.

What is Inscript Keyboard Layout ?

INSCRIPT (INdian SCRIPT) layout uses the standard 101 keyboard. The mapping of the characters is such that it remains common for all the Indian languages (written left to right). This is because of the fact that the basic character set of the Indian languages is common. This design enables user for seamless transition from INSCRIPT keyboard of one language to another. Even though a user does not know the particular script he intends to type in, he/she can use required keyboard with the prior knowledge of character placements of the language keyboard he/she is aware of.
The basic structure of INSCRIPT keyboard layout has vowels and their corresponding vowel signs on the same key and located on the left portion of the keyboard. The consonants are placed on the right portion of the keyboard and the placement is such that the consonants of one varg are split over two keys

What is WCAG 2.0 Standard ?

The Web Content Accessibility Guidelines (WCAG) 2.0 documents explain how to make Web content accessible to people with disabilities. Web "content" generally refers to the information in a Web page or Web application, including text, images, forms, sounds, and such.
Please refer: http://www.w3.org/TR/WCAG20/

What is the role of W3C India Office?

The World Wide Web Consortium (W3C) develops interoperable technologies (specifications, guidelines, software, and tools) to lead the Web to its full potential as a forum for information, commerce, communication, and collective understanding. The Indian W3C Office (W3C-India) is the national contact point for W3C activities in India
Please Refer : http://www.w3cindia.in/

So are ISO/IEC 10646 and Unicode the same thing ?

No. Although the character codes and encoding forms are synchronized between Unicode and ISO/IEC 10646, the Unicode Standard imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications. To this end, it supplies an extensive set of functional character specifications, character data, algorithms and substantial background material that is not in ISO/IEC 10646.
Please Refer: http://www.unicode.org

What is the relation between ISO/IEC 10646 and Unicode?

Both ISO/IEC 10646 and Unicode specify the same character encoding. They contain the same characters at the same locations and remain fully synchronized even as they are extended to cover additional characters.
Please Refer: http://www.unicode.org

What is the basic difference between Unicode and ISCII code?

The most basic difference is 16 bit nature of Unicode and 8 bit nature of ISCII. ISCII code maintains single logical value of the character across all the Brahmi based Indian scripts whereas Unicode has given seperate code pages for each script thereby giving separate value for each character. Some additional characters are encoded in Unicode.
Please Refer: http://www.unicode.org

What is Unicode policy for character encoding ?

The Unicode Standard regularly requires updating to expand its repertoire of characters. New characters are added to meet a variety of uses, ranging from technical symbols to letters for regional scripts or for archaic languages. Character properties are also expanded or revised to meet new implementation requirements. However, changes to the standard must be constrained by the requirements of backward compatibility between versions. To that end, the "Unicode Character Encoding Stability" Policy limits the ways in which the Unicode Standard and related Unicode specifications can change. The Unicode Technical Committee is responsible for the technical adherence of its standards and specifications to this policy.
Please Refer: http://www.unicode.org

Tools and Technologies

What is a Font ?

A font is the design for a set of characters. It is the combination of typefaces and other qualities, such as size, pitch, and spacing. A font provides for displaying a set of symbols through well-defined shapes. Fonts used to be created by craftsmen and artists during the days of printing machines that used movable type faces. Today, fonts are created by artists and designers who work with computer based tools. Font creation is both a labour intensive as well as a creative process: both a science and an art. It is a myth that fonts are of one type only. In fact there are as many fonts as there are requirements of the industry. The fonts used for word-processing are very different from those used for mobile devices and yet even more different from those used for display on TV screens. Fonts used for Desk-Top publishing demand a different visual beauty to ensure that the eyes do not get fatigued by reading.

What is Open Type Fonts ?

Open Type fonts are 16 bit font code format. It is standardized by Unicode with support for all Indic script and others. In some Open Type fonts around 10,000 glyphs shapes in the font are added to get proper recognition and identification of characters. Open Type Fonts are used for modern Linux and Windows applications.

How do I perform speech synthesis?

Cancatenative synthesis and Format Synthesis are the two methids of generating synthesized speech.
Concatenative Synthesis :- Concatenative synthesisis based on the caoncatenation of segments of recorded speech.This method produces most natural sounding speech.There are three main sub types of concatenative speech synthesis:-
1) Unit selection syntesis - Unit selection synthesis uses large databases of recorded speech.Each recorded utterance is segmented into individual phones, diphones, half-phones, syllables, morphemes, words, phrases, and sentences.An index of the units in the speech database is then created based on the segmentation and acoustic parameters.At runtime, the desired target utterance is created by determining the best chain of candidate units from the database.This method provides the greatest naturalness.
2) Diphone Synthesis - Diphone synthesis uses a minimal speech database containing all the diphones (sound-to-sound transitions) occurring in a language.At runtime, the target prosody of a sentence is superimposed on these minimal units by means of digital signal processing techniques The quality of the resulting speech is generally worse than that of unit-selection systems, but more natural-sounding than the output of formant synthesizers.
3) Domain Specific Synthesis - Domain-specific synthesis concatenates prerecorded words and phrases to create complete utterances.It is used in applications where the variety of texts the system will output is limited to a particular domain, like transit schedule announcements or weather reports.The level of naturalness of these systems can be very high because the variety of sentence types is limited, and they closely match the prosody and intonation of the original recordings.
Formant Synthesis :- Formant synthesis does not use human speech samples at runtime. Instead, the synthesized speech output is created using additive synthesis and an acoustic model.This type of synthesis technology generate artificial, robotic-sounding speech.Formant synthesizers are usually smaller programs than concatenative systems because they do not have a database of speech samples.They can therefore be used in embedded systems.

Application Showcase

What is "word by word" translation?

In “word – By – Word” translation each word or phrase is translated as machine understands it. But, while doing this word-by-word translation grammar is not considered, so it is not that much effective and can lead to following problems of Lexical Ambiguity – where a word can have multiple meanings. (bank, saw , shot, run, 'strikes' can occur as either an verb meaning to hit or a noun meaning a refusal to work.)
Synonymy – where different words can have same basic meaning. e.g. “see , look” or “ run, jog”.
Transposition – Word order varies for English and Indian languages.
Idioms / Phrases– Word -by -word translation in case of Idioms and phrases may not hold good.

How do these translation programs operate? Are they easy to use?

These translation programs both, English to Indian language machine translation and Indian language to Indian language machine translation are easy to operate.
First you have to select whether you want English to Indian language or Indian language to Indian language machine translation.
If you go for English to Indian language translation, you have option to input the text which you want to translate. You can either type (copy – paste ) the text or can upload the text file which need to be translated. Next , you need to select the target language to which translation need to done. Then click on Translate button which will provide translation in selected target language. Here you can choose from multiple translation option that a software can provide, you can select one of them as your choice which may be the correct or nearest translation of the source language.
If you go for Indian to Indian language translation, you have option to input the text which you want to translate. You can either type (copy – paste ) the text or can upload the URL of the web page which need to be translated. Next , you need to select the source and target language pair to which translation need to done. Then click on Translate button which will provide translation in selected target language. If URL of the web page is submitted to translate corresponding web page gets translated on-the-fly.

Can I add words and phrases to the dictionary of translation programs?

No. You can not add words and phrases to the dictionary of translation programs.

Can I translate web pages and e-mail?

Translation of Indian language web page can be done to another Indian language. Here you have to provide the URL of the web page which need to translate then select the target language to which translation has to do and then translate. It will translate the selected web page. However this feature is available for certain languages. No, Translation of e-mail addresses is not available.

What quality of translation can I expect from translation software?

Compared to the rest of the world, Machine translation systems in India are in nascent stage. English to Indian language and Indian language to Indian language systems for tourism domain have been made available for your feedback. A lot of development is taking place to improve the accuracy further.

How can I translate letters and other paper documents with my computer?

You can think for a special OCR which can scan the document and then translate the particular letter or document, However this facility is not available for English to Indian language machine translation or Indian language to Indian language machine translation.

What is machine translation?

Machine translation (MT) is an automatic translation process in which computer software is used for translating a document or text from one natural language ( L1 ) to another natural language ( L2 ) ( Where L1 is source language and L2 is Target language) by considering the grammatical structure, nature of each language and using rules to transfer the grammatical structure of the source language (L1) into the target language (L2).While doing the translation, care should be taken that the meaning of a text in the original (source) language must be fully restored in the target language, i.e. the translation.

Miscellaneous

How many scheduled language exist

There are 22 constitutionally recognized Indian languages. They are Assamese, Bangla, Bodo, Dogri, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Maithili, Manipuri, Malayalam, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Santali, Sindhi, Tamil, Telugu and Urdu.

Difference between script and language ?

Script is basically a way or mechanism of inscribing. One script may be used for writing different languages and one language may be written using different scripts. For eg. Devnagari script is used for writing languages like Hindi, Marathi, Konkani, Santali, Dogri, Sindhi, Kashmiri. At the same time, languages like Konkani are written using Roman script and Devnagari script, santali language can be written using Devnagari script as well as ol-chiki script.

Examples of one Script, Multiple languages is

1.Hindi language – आपका नाम क्या है (Devnagari Script)

2.Marathi Language - गव्हाचे पीठ शुभ्र होईपर्यंत अगदी बारीक करून घ्यावे (Devnagari Script)

3.Konkani Language- संयोजन करून दस्तावेज निर्माण आनी वांटणी करपाक (Devnagari Script)

Examples of one language written using Multiple script is

1.Roman Script - Maheti tantrikotiche baxe tantrikota eka billiona pros chodd bahu-bhaxa bharatiyank eka meka lagim haddchyant nirnnayok patr khellta (Konkani Language)

2.Devnagari Script - माहिती तंत्रज्ञानाचे भास तंत्रगिन्यान लाखांनी भौ-भाशिक भारतियांक एकठाय हाडपान व्हडले योगदान करता. (Konkani Language)

Which all OS support Indian languages ?

Indian languages are supported by different operating systems including various versions of Windows and Linux. Each of these flavors of operating systems support different number of languages and hence it is more likely that the latest versions of operating system has support for more number of languages. For more information about the script and font support, Please Refer: http://msdn.microsoft.com/en-us/goglobal/bb688099.aspx

How to enable Indian languages on Windows OS?

Enabling Indian languages will allow you to use different features of the Operating Systems. There are different ways of enabling Indian languages on different Operating Systems.

Enabling Indic support on Windows XP

1.Go to Start -> Control Panel -> Date, Time, Language and Regional Options.

2.Click on Regional and Language Options.

3.Click on the Languages tab as shown below.

$\"FAQ_images_1\"$

4.Under the heading \'Supplemental language support\' check the item \'Install files for Complex Script and right-to-left languages (including Thai)\'.

5. Allow the OS to install necessary files from Windows XP disc Reboot

Want to Select as User Locale, Location and System Locale?

1. Click on the \'Regional Options\' tab to set User Locale and Location.

2. Under the heading \'Standards and format\' select Hindi or any other language as your User Locale from the drop-down box. This selection will determine settings for numbers, currencies, times and dates as well as sorting rules for the language.

3. Under the heading \'Location\', select a country where you are physically located such as India.

$\"FAQ_images_2\"$

Setting-up the Indian Language Keyboards or Input Locales In Regional and Language Options panel, click on the Languages Tab.

1. Once Languages tab is selected click on the \'Details\' tab to install different input locales or Keyboards.

2. Click on the \'Add\' button to add a keyboard for a particular language.

3. In the drop-down box select Hindi or Marathi as the Input Language.

4. A corresponding keyboard layout/IME will be automatically selected as shown below. Click OK to close the dialog boxes.

Similar to Windows 2000, Windows XP also offers following Devanagari keyboard layouts such as :

1. Hindi-Traditional

2. Hindi-Devanagari-Inscript

3. Marathi

4. Marathi-Devanagari-Inscript

5. Konkani-Devanagari-Inscript

6. Sanskrit-Devanagari-Inscript

The Hindi-Traditional and Marathi keyboards contain all the characters that are traditionally used in Hindi and Marathi and include English punctuation without the need to change to the English keyboard to get at the punctuation. It is the recommended keyboard for most users.

The Devanagari-Inscript keyboard contains an extended Devanagari character set that includes characters for transliterating into Devanagari from other Indian languages as well as some Sanskrit and ancient Vedic characters. This keyboard is recommended for special users.

$\"FAQ_images_3\"$

$\"FAQ_images_4\"$

Enabling Indic support on Windows 2000

Enable Indic Functionality in the OS

1. Go to Start -> Settings -> Regional Options -> General (Tab)

2. In the Language Settings for the System, enable Indic.

3. Copy necessary files from the Windows 2000 disc.

4.Reboot the computer after files have been copied.

$\"FAQ_images_5\"$

What to Select as User Locale and System Locale?

User Locale:

1. Once Indic language support has been enabled in the OS, we can select any one of the available Indic languages as User Locale.

2. For example we can select Hindi or Tamil as the User Locale. User Locale in turns determines the various settings/formats for numbers , currencies, date, and time. However it is not necessary to select an Indic language as User Locale. Instead, we can select English (US) as the User Locale, if the situation demands.

System Locale:

1. This Setting is invoked by clicking on the Set Default command button at the bottom of the Regional Options.

2. By design, no Indian language can be selected as the System Locale. English (US) or (UK) is the best choice in any situation.

Setting-up the Indian Language Keyboards or Input Locales

In the 2nd step if you select an Indic language as User Locale, Windows will automatically add a keyboard for that language.

Otherwise to add a keyboard for a particular language,

1.Select the Input Locales tab

2.Click on the Add button located under Input Language display Box.

3.This will take you to Add Input Locale dialog Box as shown below.

Select desired language in the Input Locale drop-down box

$\"FAQ_images_6JPG\"$

The Devanagari keyboard layouts available in Windows 2000 are:

1.Hindi-Traditional

2.Hindi-Devanagari-Inscript

3.Marathi

4.Marathi-Devanagari-Inscript

5.Konkani-Devanagari-Inscript

Sanskrit-Devanagari-Inscript

The Hindi-Traditional and Marathi keyboards contain all the characters that are traditionally used in Hindi and Marathi and include English punctuation without the need to change to the English keyboard to get at the punctuation. It is the recommended keyboard for most users. The Devanagari-Inscript keyboard contains an extended Devanagari character set that includes characters for transliterating into Devanagari from other Indian languages as well as some Sanskrit and ancient Vedic characters. This keyboard is recommended for special users.

For more details about setting the Indian Languages on different version of Windows please visit https://www.microsoft.com/en-in/bhashaindia/

What is regional settings ?

With Regional and Language Options Setting in Control Panel, you can change the format Windows uses to display dates, times, currency amounts, large numbers, and numbers with decimal fractions. You can also choose from a large number of input languages and text services, such as different keyboard layouts, Input Method Editors, and speech and handwriting recognition programs. When you switch to another input language, some programs offer special features, such as font characters or spelling checkers designed for different languages.

By default, products in the Windows OS family install the files for most input languages supported by Windows. However, if you want to enter or display text in the East Asian languages (Chinese, Japanese, or Korean) or the complex script and right-to-left languages (Arabic, Armenian, Georgian, Hebrew, the Indic languages, Thai, or Vietnamese), you can install the language files from the Windows CD-ROM.

Each language has a default keyboard layout, but many languages have alternate versions. Even if you do most of your work in one language, you might want to try other layouts. In English, for example, typing letters with accents might be simpler with the U.S.-International layout.

Which are the tools and technologies available for Indian languages ? Segment wise - word processing, Data processing, Desktop Publishing, Web publishing

1. Word Processing: Bharateeya Open office is a complete office suite like any other commercially available office suites. It has a component named WRITER which is used for word processing.

2. Desktop Publishing: Free and open source softwares have been localized into Indian languages. Tools such as SCRIBUS are a very useful tools for desktop publishing works.

3. Web Publishing: Basically in order to create multilingual websites/ text/ data, you need minimum of three things in place. One, the mechanism to type the text/ data in that language, second, suitable fonts which will render the typed text/ data on web and third thing is storing the text/ data into suitable format using standard encodings such as Unicode. Once these are available, rest of the things become simple.

What are minimum components required for providing language support

At the minimum , the operating system should be enabled for Indian languages. Once the operating system has been enabled for Indian languages, you can start working with the tools like IME (Input Method Editor), User Locale and System Locale provided by the OS to work in Indian languages

Linguistic Resources & Tools

What is CLDR ?, where it is useful?

Common Locale Data repository is a Unicode initative for collecting and making available the Commonly used Locale specific information such as 'days of the week', 'number and time formats', units of measuring distance, weight, quantity etc. This data is available in XML format.

The Unicode CLDR provides key building blocks for software to support the world's languages, with the largest and most extensive standard repository of locale data available. This data is used by a wide spectrum of companies for their software internationalization and localization, adapting software to the conventions of different languages for such common software tasks as:

1. Formatting of dates, time and time zones.

2. Formatting numbers and currency values.

3. sorting text.

4. choosing languages or countries by name.

CLDR uses the XML format provided by UTS #35: Locale Data Markup Language (LDML). LDML is a format used not only for CLDR, but also for general interchange of locale data, such as in Microsoft's .NET.
Please Refer http://cldr.unicode.org/

Language Technology Industry

What is Language Technology?

Language technology researches computer systems, which understand and/or synthesize spoken and written human languages. Included in this area are speech processing (recognition, understanding, and synthesis), information extraction, handwriting recognition, machine translation, text summarization, and language generation. With it's wide range of languages, India needs to focus on language technology for bridging the digital divide and making the fruits of Information Technology available to the masses.

Indian Language Technology Proliferation & Deployment Centre

भारतीय भाषा प्रौद्योगिकी प्रसरण एवं विस्तारण केंद्र