BIS standard "IS 16333 (Part 3)" defines the requirements for mobile handset for inputting of text in English, Hindi and at least one additional official Indian language along with facility of message readability in the phones for all 22 Indian official languages. So to help the mobile manufacturer in the internal verification and to check the effectiveness of language support, TDIL along with CDAC-GIST have prepared a robust test data covering relevant language Consonant (C), Vowels (V), Numerals (N), Matras(M), Halant (H), Diacritic(D), combinations of C, V, N, M, H, D along with word list and sentences. Test data, thus created can be used to test the inputting and display on the mobile handsets.
For best view download SakalBharati Font.
Authorship Identification is the task of identifying who wrote a given piece of text from a given set of candidate authors (suspects). The increasingly large volumes of texts on the Internet enhance the great yet urgent necessity for authorship identification. For this purpose, a large amount of work has already been done for the English language. Comparatively, less
research has been carried out for Indian regional languages such as Tamil, Telugu, Bengali and Punjabi whereas no such experiment is available for Marathi.
Added on July 17, 2018
Contributed by : Individual
Product Type : Research Paper
License Type : Freeware
System Requirement :
Author : Kale Sunil Digamberrao ,Dr. Rajesh S. Prasad
This is Author wise Marathi Language text corpus for research purpose that includes research area as Data or Text Mining that includes but not limited to Author Identification, Author Profiling, Sentiment Analysis, Text Summarization etc on Marathi Language text. There are two datasets based on two categories one is comedy articles and another category is mixed articles i.e. composed of comedy, novels, Lalit lekhan etc. of well-known authors in Marathi.
Dataset–I is a collection of articles on category comedy by 5 different authors. A file for each author is prepared which contain all articles by that author i.e. dataset-I contain 5 files with the file name as author name. A number of words by each author is ranging from minimum 7006 and maximum 10,411 words.
Dataset–II is composed of articles of the mixed category. In total 10 different authors with minimum 26874 and maximum 33722 words. A file for each author is prepared which contain all articles by that author with the file name as the name of the author name. These files are encoded with UTF 8 encoding.
This corpus contains 1077 audio files of Telugu language of 1073 speakers and transcriptions folder which contains the .lab transcription files for each audio file. This data was prepared for Agricultural Commodity domain and Size of this corpus is 5.4 GB.
This corpus contains the more than 62000 audio files of Tamil language of 1000 speakers, .dic file which contains word and its corresponding phonetic representation and transcription text file listing the transcription for each audio file. This data was prepared for Agricultural Commodity domain and Size of this corpus is 5.7 GB.