•    Freeware
  •    Shareware
  •    Research
  •    Localization Tools 20
  •    Publications 707
  •    Validators 2
  •    Mobile Apps 22
  •    Fonts 31
  •    Guidelines/ Draft Standards 3
  •    Documents 13
  •    General Tools 38
  •    NLP Tools 105
  •    Linguistic Resources 255
  Catalogue
Item Name: Marathi 1T 2-gram Version 1
Author(s): Uma Gajendragadkar [umagadkar@gmail.com], Sarang Joshi
Release Date: November 17, 2015
Data Source(s): Web Collection
Application(s): Language Modeling
Language(s): Marathi
Language ID(s):Marathi
Citation: Uma Gajendragadkar, COEP, SPPU, Pune, INDIA and Sarang Joshi, PICT, SPPU, Pune, INDIA Marathi 1T 2-gram Version 1 Web Download.
Introduction
This data set, contributed by Uma Gajendragadkar and Sarang Joshi, contains Marathi word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to two-grams. Three-gram, Four-gram, Five-gram can be made available on request to authors. This data can be used for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses.
Source Data
The n-gram counts were generated from approximately 29 crore word tokens of text from publicly accessible Web pages.
Character Encoding
The input encoding of documents was automatically detected, and all text was converted to UTF8.
Data Sizes
File sizes: approx. 170MB text files Number of tokens: 290,406,855 Number of sentences: 109,277,834 Number of raw unigrams: 765,589 Number of unigrams: 588,797 Number of bigrams: 3,470,365

Added on November 17, 2015

2
71

  More Details
  • Contributed by : Uma Gajendragadkar, Sarang Joshi
  • Product Type : Text Corpora
  • License Type : Research
  • System Requirement : Not Applicable
Similar / Suggested Resources