•    Freeware
  •    Shareware
  •    Research
  •    Localization Tools 20
  •    Publications 707
  •    Validators 2
  •    Mobile Apps 22
  •    Fonts 31
  •    Guidelines/ Draft Standards 3
  •    Documents 13
  •    General Tools 38
  •    NLP Tools 105
  •    Linguistic Resources 255
This paper presents an overview of corpus classification and development in electronic format for 16 language-pairs, with Hindi as the source language. In a multi-lingual country like India, the major thrust in language technology lies in providing inter-communication services and direct information access in one`s own language. As a result, language technology in India has seen major developments over the last decade in terms of machine translation and speech synthesis systems. As deeper research advances, the need for high quality standardised corpus is being seen as a primary challenge. To address these needs, the government of India has initiated a mega project called the Indian Languages Corpora Initiative (ILCI) to collect parallel annotated corpus in 17 scheduled languages of the Indian constitution. The project is in its second phase currently, within which it aims to collect 8,50,000 parallel annotated sentences in 17 Indian languages in the domains of Entertainment and Agriculture. Together with the 6,00,000 parallel sentences collected in Phase 1 in the domains of Health and Tourism (Choudhary & Jha, 2011), The corpus being developed is one of the largest known parallel annotated corpora for any Indian language till date. This phase will ultimately also see the development of chunking standards for processing the annotated corpus.

Added on June 6, 2016


  More Details
  • Contributed by : Atul
  • Product Type : Research Paper
  • License Type : Freeware
  • System Requirement : Not Applicable
  • Author : Akanksha Bansal, Esha Banerjee and Girish Nath Jha
Author Community Profile :
Similar / Suggested Resources