Script identification in a multi-lingual document environment has numerous applications in the field of document image analysis, such as indexing and retrieval or as an initial step towards optical character recognition. In this paper, we propose a novel hierarchical framework for script identification in bi-lingual documents. The framework presents a top-down approach by performing page, block/paragraph and word level script identification in multiple stages. We utilize texture and shape based information embedded in the documents at different levels for feature extraction. The prediction task at different levels of hierarchy is performed by Support Vector Machine (SVM) and Rejection based classifier defined using AdaBoost. Experimental evaluation of the proposed concept on document collections of Hindi/English and Bangla/English scripts have shown promising results.
This paper presents integration and testing scheme for managing a large Multilingual OCR Project. The project is an attempt to implement an integrated platform for OCR of different Indian languages. Software engineering, workflow management and testing processes have been discussed in this paper. The OCR has now been experimentally deployed for some specific applications and currently is being enhanced for handling the space and time constraints, achieving higher recognition accuracies and adding new functionalities.
Added on September 8, 2017
Contributed by : OCR Consortium
Product Type : Research Paper
License Type : Freeware
System Requirement :
Author : Deepak Arya ,C. V. Jawahar,Chakravorty Bhagvati ,Santanu Chaudhury,Tushar Patnaik ,B. B. Chaudhuri,G. S. Lehal,A. G. Ramakrishna
In this paper, we propose a novel framework for segmentation of documents with complex layouts. The document segmentation is performed by combination of clustering and conditional random fields (CRF) based modeling. The bottom-up approach for segmentation assigns each pixel to a cluster plane based on color intensity. A CRF based discriminative model is learned to extract the local neighborhood information in different cluster/color planes. The final category assignment is done by a top-level CRF based on the semantic correlation learned across clusters. The proposed framework has been extensively tested on multi-colored document images with text overlapping graphics/image.
In this paper we present an approach for correcting character recognition errors of an OCR which can recognise Indic Scripts. Suffix tree is used to index the lexicon in lexicographical order to facilitate the probabilistic search. To obtain the best probable match against the mis-recognised string, it is compared with the sub-strings (edges of suffix tree) using similarity measure as weighted Levenshtein distance, where Confusion probabilities of characters (Unicodes) are used as substitution cost, until it exceeds the specified cost k. Retrieved candidates are sorted and selected on the basis of their lowest edit cost. Exploiting this information, the system can correct non-word errors and achieves maximum error rate reduction of 33% over simple character recognition system.
The paper presents application of multiple features for word based document image indexing and retrieval. A novel framework to perform Multiple Kernel Learning for indexing using the Kernel based Distance Based Hashing is proposed. The Genetic Algorithm based framework is used for optimization. Two different features representing the structural organization of word shape are defined. The optimal combination of both the features for indexing is learned by performing MKL. The retrieval results for document collection belonging to Devanagari script are presented.