This module takes free text and produces tokens with sentence boundaries marked.
A token may be any of the following: word, abbreviation, punctuation mark, real number, special symbol etc. No token has white space in it. Special symbols such as ‘|’, ‘.’ and two new lines are treated as end of sentence marker. Period is analyzed to decide whether it is an end of sentence marker or not. Abbreviations such as Mr. or Dr. is consider as a token. A list of acronyms is consulted when a period is found. Based on the list and some rules, it decides whether it is an abbreviation. By the end of processing, each sentence will contain all the tokens that make up the sentence
Parsing is the process of assigning grammatical labels to each chunk/constituent in the sentence. Identification of the grammatical labels (karaka and non-karaka relations) for each word of the sentence helps many applications such as WSD, NER etc. There are a number of approaches, such as rule-based, statistics based, transformation-based etc. which are used for parsing. Here it is used a rule based approach in Paninian dependency frame work.