A script independent, font-size independent scheme is proposed for detecting bold words in printed pages. In OCR applications such as minor modifications of an existing printed form, it is desirable to reproduce the font size and characteristics such as bold, and italics in the OCR recognized document. In this morphological opening based detection of bold (MOBDoB) method, the binarized image is segmented into sub-images with uniform font sizes, using the word height information. Rough estimation of the stroke widths of characters in each sub-image is obtained from the density. Each sub-image is then opened with a square structuring element of size determined by the respective stroke width. The union of all the opened sub-images is used to determine the locations of the bold words. Extracting all such words from the binarized image gives the final image. A minimum of 98 % of bold words were detected from a total of 65 Tamil, Kannada and English pages and the false alarm rate is less than 0.4 %.
Conventional optical character recognition systems, designed to recognize linearly aligned text, perform poorly on document images that contain multi-oriented text lines. This paper describes a novel technique that can extract text lines of arbitrary curvature and align them horizontally. By invoking the spatial regularity properties of text, adjacent components are grouped together to obtain the text lines present in the image. To align each identified text line, we fit a B-spline curve to the centroids of the constituent characters and normal vectors are computed all along the resulting curve. Each character is then individually rotated such that the corresponding normal vector is aligned with the vertical axis. The method has been tested on images that contain text laid out in various forms namely arc, wave, triangular and combination of these with linearly skewed text lines. It yields 97.3% recognition accuracy on text strings where state-of-the-art OCRs fail before alignment.
Most of the document pre-processing techniques are parameter dependent. In this paper, we present a novel framework that learns optimal parameters, depending on the nature of the document image content for binarization and text/graphics segmentation. The learning problem has been formulated as an optimization problem using EM algorithm to adaptively learn optimal parameters. Experimental results have established the effectiveness of our approach.
To collect the test data for Indian Language search engine, the prime source of information is from the internet. Primarily in search keywords are the test data that has to be fed into the search engine. Search keywords can be collected from various Indian language web pages. These search keywords can be used to evaluate Indian language search engine.
Total 4735 search keywords prepared for nine languages namely Assamese, Bengali, Gujarati, Hindi, Marathi, Odia, Punjabi, Tamil and Telugu across following categories-
3. NER & Acronyms
4. Data Integrity (Normalization, Spelling Variation)
5. Grammatical Forms handling (Singular/Plural, Lemmatizer, Synonyms, Spell checker)
6. Ranking (Single Best Target)
Domain Covered: Tourism
Text segmentation and localization algorithms are proposed for the born-digital image dataset. Binarization and edge detection are separately carried out on the three colour planes of the image. Connected components (CC's) obtained from the binarized image are thresholded based on their area and aspect ratio. CC's which contain sufficient edge pixels are retained. A novel approach is presented, where the text components are represented as nodes of a graph. Nodes correspond to the centroids of the individual CC's. Long edges are broken from the minimum spanning tree of the graph. Pair wise height ratio is also used to remove likely non-text components.