Document Image Analysis, like any Digital Image Analysis requires identification and extraction of proper features, which are generally extracted from uncompressed images, though in reality images are made available in compressed form for the reasons such as transmission and storage efficiency. However, this implies that the compressed image should be decompressed, which indents additional computing resources. This limitation induces the motivation to research in extracting features directly from the compressed image. In this research, we propose to extract essential features such as projection profile, run-histogram and entropy for text document analysis directly from run-length compressed text-documents.
Added on March 14, 2018
Contributed by : OCR Consortium
Product Type : Research Paper
License Type : Freeware
System Requirement :
Author : Mohammed Javed,P. Nagabhushan,B.B. Chaudhuri
In this paper we present how Bag-of-Features Hidden Markov Models can be applied to printed Bangla word spotting. These statistical models allow for an easy adaption to different problem domains. This is possible due to the integration of automatically estimated visual appearance features and Hidden Markov Models for spatial sequential modeling. In our evaluation we are able to report high retrieval scores on a new printed Bangla dataset. Furthermore, we outperform state-of-the-art results on the well-known George Washington word spotting benchmark. Both results have been achieved using an almost identical parametric method configuration.
Extraction and recognition of Bangla text from video frame images is challenging due to fonts type and style variation, complex color background, low-resolution, low contrast etc. In this paper, we propose an algorithm for extraction and recognition of Bangla and Devanagari text form video frames with complex background. Here, a two-step approach has been proposed. After text localization, the text line is segmented into words using information based on line contours. First order gradient values of the text blocks are used to find the word gap. Next, an Adaptive SIS binarization technique is applied on each word. Next this binarized text block is sent to a state of the art OCR for recognition.
Skew correction of a scanned document page is an important preprocessing step in document image analysis. We propose here a fast and robust skew estimation algorithm based on rank analysis in Farey sequence. Our target document class comprises two major Indian scripts with headlines, namely Devnagari and Bangla. At the beginning, straight edge segments from the edge map of the document page are detected by our algorithm using properties of digital straightness. Straight edges derived in this manner are binned by Farey ranks in correspondence with their slopes. The principal bin, identified from these bins using the strength of accumulated edge points, represents the principal direction along the direction of headlines, from which the gross skew angle is estimated. A fast refinement algorithm is then applied with a finer tuning of Farey ranks, to detect the skew up to the desired level of precision.
Malayalam tree bank data is in Shakti Standard Format (SSF). SSF is a common representation for data. SSF allows information in a sentence to be represented in the form of one or more trees together with a set of attribute-value pairs with nodes of the trees. The attribute-value pairs allow features or properties to be specified with every node. Sentence
level SSF is used to store the analysis of a sentence. It occurs as part of text level SSF. The analysis of a sentence may mark any or all of the following kinds of information as appropriate: part of speech of the words in the sentence; morphological analysis of the words including properties such as root, gender, number, person, tense, aspect, modality; phrase-structure or dependency structure of the sentence; and properties of units such as chunks, phrases, local word groups, tags, etc. SSF is theory neutral and allows both phrase structure as well as dependency structure to be coded, and even mixed in well defined ways. The SSF representation for a sentence consists of a sequence of trees. Each tree is made up of one or more related nodes. Total size of the Malayalam tree bank corpus is 9512 monolingual sentence ids ,6010 parallel sentence ids and approx. 251 verb frames. Following supporting documents are provided:
1. BIS Tag set
2. Chunk Guidelines
3. Dependency guidelines_Malayalam
5.morph guidelines final
Tags: TreeBank, Malayalam treebank, Malayalam Treebank Corpus, Tree Bank Data
Added on February 27, 2018
Contributed by : IL dependency tree bank, IIIT Hyd