Most of the document pre-processing techniques are parameter dependent. In this paper, we present a novel framework that learns optimal parameters, depending on the nature of the document image content for binarization and text/graphics segmentation. The learning problem has been formulated as an optimization problem using EM algorithm to adaptively learn optimal parameters. Experimental results have established the effectiveness of our approach.
To collect the test data for Indian Language search engine, the prime source of information is from the internet. Primarily in search keywords are the test data that has to be fed into the search engine. Search keywords can be collected from various Indian language web pages. These search keywords can be used to evaluate Indian language search engine.
Total 4735 search keywords prepared for nine languages namely Assamese, Bengali, Gujarati, Hindi, Marathi, Odia, Punjabi, Tamil and Telugu across following categories-
3. NER & Acronyms
4. Data Integrity (Normalization, Spelling Variation)
5. Grammatical Forms handling (Singular/Plural, Lemmatizer, Synonyms, Spell checker)
6. Ranking (Single Best Target)
Domain Covered: Tourism
Text segmentation and localization algorithms are proposed for the born-digital image dataset. Binarization and edge detection are separately carried out on the three colour planes of the image. Connected components (CC's) obtained from the binarized image are thresholded based on their area and aspect ratio. CC's which contain sufficient edge pixels are retained. A novel approach is presented, where the text components are represented as nodes of a graph. Nodes correspond to the centroids of the individual CC's. Long edges are broken from the minimum spanning tree of the graph. Pair wise height ratio is also used to remove likely non-text components.
In this paper, we discuss the issues related to word recognition in born-digital word images. We introduce a novel method of power-law transformation on the word image for binarization. We show the improvement in image binarization and the consequent increase in the recognition performance of OCR engine on the word image. The optimal value of gamma for a word image is automatically chosen by our algorithm with fixed stroke width threshold. We have exhaustively experimented our algorithm by varying the gamma and stroke width threshold value. By varying the gamma value, we found that our algorithm performed better than the results reported in the literature.
Scene word images undergo degradations due to motion blur, uneven illumination, shadows and defocusing, which lead to difficulty in segmentation. As a result, the recognition results reported on the scene word image datasets of ICDAR have been low. We introduce a novel technique, where we choose the middle row of the image as a subimage and segment it first. Then, the labels from this segmented sub-image are used to propagate labels to other pixels in the image. This approach, which is unique and distinct from the existing methods, results in improved segmentation. Bayesian classification and Max-flow methods have been independently used for label propagation.