•    Freeware
  •    Shareware
  •    Research
  •    Localization Tools 20
  •    Publications 707
  •    Validators 2
  •    Mobile Apps 22
  •    Fonts 31
  •    Guidelines/ Draft Standards 3
  •    Documents 13
  •    General Tools 38
  •    NLP Tools 105
  •    Linguistic Resources 255

Search Results | Total Results found :   139

You refine search by :    Text Corpora  
  Catalogue
Under the Indian Languages Corpora Initiative (ILCI) project initiated by the MeitY, Govt. of India, Jawaharlal Nehru University, New Delhi had collected corpus in Hindi as source language and translated it in Bodo as the target language. There are 70,000 sentences, including Health, Tourism, Agriculture and Entertainment domain in this corpus. This corpus has a unique sentence ID for each sentence, UTF-8 encoding, and text file format. The translated sentences have been POS tagged and Chunked properly. The chunking guideline used in this corpus creation, is provided in supporting document.

Added on April 26, 2019

2
13

  More Details
  • Contributed by : ILCI Consortium, JNU
  • Product Type : Text Corpora
  • License Type : Research
  • System Requirement : Not Applicable

Under the Indian Languages Corpora Initiative (ILCI) project initiated by the MeitY, Govt. of India, Jawaharlal Nehru University, New Delhi had collected corpus in Hindi as source language and translated it in English as the target language. There are 70,000 sentences, including Health, Tourism, Agriculture and Entertainment domain in this corpus. This corpus has a unique sentence ID for each sentence, UTF-8 encoding, and text file format. The translated sentences have been POS tagged and Chunked properly. The chunking guideline is provided in supporting document.

Last updated on April 29, 2019

2
30

  More Details
  • Contributed by : ILCI Consortium, JNU
  • Product Type : Text Corpora
  • License Type : Research
  • System Requirement : Not Applicable

Under the Indian Languages Corpora Initiative phase –II (ILCI Phase-II) project, initiated by the MeitY, Govt. of India, Jawaharlal Nehru University, New Delhi had collected Gujarati monolingual text corpus. There are approx. 30,000 sentences of general domain in this corpus. These sentences have been POS tagged and Chunked properly. The chunking guideline is provided in supporting document. This corpus has following features: unique ID, UTF-8 encoding, and text file format.

Added on April 23, 2019

2
8

  More Details
  • Contributed by : ILCI Consortium, JNU
  • Product Type : Text Corpora
  • License Type : Research
  • System Requirement : Not Applicable

Under the Indian Languages Corpora Initiative (ILCI) project initiated by the MeitY, Govt. of India, Jawaharlal Nehru University, New Delhi had collected corpus in Hindi as source language and translated it in Gujarati as target language. There are 70,000 sentences including Health, Tourism, Agriculture and Entertainment domain in this corpus. Each sentence has a unique ID. The translated sentences have been POS tagged and Chunked properly. The chunking guideline is provided in supporting document.

Last updated on April 23, 2019

2
11

  More Details
  • Contributed by : ILCI Consortium, JNU
  • Product Type : Text Corpora
  • License Type : Research
  • System Requirement : Not Applicable

This is Author wise Marathi Language text corpus for research purpose that includes research area as Data or Text Mining that includes but not limited to Author Identification, Author Profiling, Sentiment Analysis, Text Summarization etc on Marathi Language text. There are two datasets based on two categories one is comedy articles and another category is mixed articles i.e. composed of comedy, novels, Lalit lekhan etc. of well-known authors in Marathi.

Dataset–I is a collection of articles on category comedy by 5 different authors. A file for each author is prepared which contain all articles by that author i.e. dataset-I contain 5 files with the file name as author name. A number of words by each author is ranging from minimum 7006 and maximum 10,411 words.
Dataset–II is composed of articles of the mixed category. In total 10 different authors with minimum 26874 and maximum 33722 words. A file for each author is prepared which contain all articles by that author with the file name as the name of the author name. These files are encoded with UTF 8 encoding.

This resource is contributed by Mr. Sunil Digamberrao Kale & Dr. Rajesh S. Prasad

Disclaimer: The responsibility of performance of this resource/tool lies with the contributor.

Last updated on August 1, 2018

3
43

  More Details
  • Contributed by : Kale Sunil Digamberrao Dr. Rajesh S. Prasad
  • Product Type : Text Corpora
  • License Type : Research
  • System Requirement : Not Applicable