A competition was organized by the authors to detect text from scene images. The motivation was to look for script-independent algorithms that detect the text and extract it from the scene images, which may be applied directly to an unknown script. The competition had four distinct tasks: (i) text localization and (ii) segmentation from scene images containing one or more of Kannada, Tamil, Hindi, Chinese and English words. (iii) English and (iv) Kannada word recognition task from scene word images. There were totally four submissions for the text localization and segmentation tasks. For the other two tasks, we have evaluated two algorithms, namely nonlinear enhancement and selection of plane and midline analysis and propagation of segmentation, already published by us. A complete picture on the position of an algorithm is discussed and suggestions are provided to improve the quality of the algorithms. Graphical depiction of f-score of individual images in the form of benchmark values is proposed to show the strength of an algorithm.
Added on December 13, 2017
Contributed by : Consortium
Product Type : Research Paper
License Type : Freeware
System Requirement :
Author : Deepak Kumar,M. N. Anil Prasad,A. G. Ramakrishnan
A script independent, font-size independent scheme is proposed for detecting bold words in printed pages. In OCR applications such as minor modifications of an existing printed form, it is desirable to reproduce the font size and characteristics such as bold, and italics in the OCR recognized document. In this morphological opening based detection of bold (MOBDoB) method, the binarized image is segmented into sub-images with uniform font sizes, using the word height information. Rough estimation of the stroke widths of characters in each sub-image is obtained from the density. Each sub-image is then opened with a square structuring element of size determined by the respective stroke width. The union of all the opened sub-images is used to determine the locations of the bold words. Extracting all such words from the binarized image gives the final image. A minimum of 98 % of bold words were detected from a total of 65 Tamil, Kannada and English pages and the false alarm rate is less than 0.4 %.
Conventional optical character recognition systems, designed to recognize linearly aligned text, perform poorly on document images that contain multi-oriented text lines. This paper describes a novel technique that can extract text lines of arbitrary curvature and align them horizontally. By invoking the spatial regularity properties of text, adjacent components are grouped together to obtain the text lines present in the image. To align each identified text line, we fit a B-spline curve to the centroids of the constituent characters and normal vectors are computed all along the resulting curve. Each character is then individually rotated such that the corresponding normal vector is aligned with the vertical axis. The method has been tested on images that contain text laid out in various forms namely arc, wave, triangular and combination of these with linearly skewed text lines. It yields 97.3% recognition accuracy on text strings where state-of-the-art OCRs fail before alignment.
Most of the document pre-processing techniques are parameter dependent. In this paper, we present a novel framework that learns optimal parameters, depending on the nature of the document image content for binarization and text/graphics segmentation. The learning problem has been formulated as an optimization problem using EM algorithm to adaptively learn optimal parameters. Experimental results have established the effectiveness of our approach.
To collect the test data for Indian Language search engine, the prime source of information is from the internet. Primarily in search keywords are the test data that has to be fed into the search engine. Search keywords can be collected from various Indian language web pages. These search keywords can be used to evaluate Indian language search engine.
Total 4735 search keywords prepared for nine languages namely Assamese, Bengali, Gujarati, Hindi, Marathi, Odia, Punjabi, Tamil and Telugu across following categories-
3. NER & Acronyms
4. Data Integrity (Normalization, Spelling Variation)
5. Grammatical Forms handling (Singular/Plural, Lemmatizer, Synonyms, Spell checker)
6. Ranking (Single Best Target)
Domain Covered: Tourism