Script Identification is one of the challenging step in the Optical Character Recognition system for multi-script documents. In Indian and Non-Indian context some results have been reported, but research in this field is still emerging. This paper presents a research work in the identification of Gurmukhi and English scripts at word level. It also identifies English Numerals from Gurmukhi text. Gabor feature extraction is one of most popular method for script recognition. This paper presents a zone based gabor feature extraction technique.
Digitization of newspaper article is important for registering historical events. Layout analysis of Indian newspaper is a challenging task due to the presence of different font size, font styles and random placement of text and non-text regions. In this paper we propose a novel framework for learning optimal parameters for text graphic separation in the presence of complex layouts. The learning problem has been formulated as an optimization problem using EM algorithm to learn optimal parameters depending on the nature of the document content.
Added on August 28, 2018
Contributed by : Consortium
Product Type : Research Paper
License Type : Freeware
System Requirement :
Author : Ritu Garg,Anukriti Bansal,Santanu Chaudhury,Sumantra Dutta Roy
Active learning and crowd sourcing are becoming increasingly popular in the machine learning community for fast and cost effective generation of labels for large volumes of data. However, such labels may be noisy. So, it becomes important to ignore the noisy labels for building of a good classifier. We propose a framework for finding the best possible augmentation of a classifier for the character recognition problem using minimum number of crowd labeled samples. The approach inherently rejects the noisy data and tries to accept a subset of correctly labeled data to maximize the classifier performance.
We propose here a technique for transforming the layout of a printed document image to a new user-conducive layout. Its objective is to effectuate better display in a low-resolution screen for providing comfort and convenience to a viewer while reading. The task of re-targeting starts with analyzing the document image in the spatial domain for identifying its paragraphs. Text lines, words, characters, and hyphenations are then recognized from each paragraph, and necessary word stitching is performed to reproduce the paragraph, as appropriate to the resolution of the display device. Test results and related subjective evaluation for different datasets, especially the pages scanned from some Bengali and English magazines, demonstrate the strength and effectiveness of the proposed technique.
Performance of an OCR system is badly affected due to presence of hand-drawn annotation lines in various forms, such as underlines, circular lines, and other text-surrounding curves. Such annotation lines are drawn by a reader usually in free hand in order to summarize some text or to mark the keywords within a document page. In this paper, we propose a generalized scheme for detection and removal of these hand-drawn annotations from a scanned document page. An underline drawn by hand is roughly horizontal or has a tolerable undulation, whereas for a hand-drawn curved line, the slope usually changes at a gradual pace.