The paper proposes a novel multi-modal document image retrieval framework by exploiting the information of text and graphics regions. The framework applies multiple kernel learning based hashing formulation for generation of composite document indexes using different modalities. The existing multimedia management methods for imaged text documents have not addressed the requirement of old and degraded documents. In the subsequent contribution, we propose novel multi-modal document indexing framework for retrieval of old and degraded text documents by combining OCR'ed text and image based representation using learning. The evaluation of proposed concepts is demonstrated on sampled magazine cover pages, and documents of Devanagari script.
We propose a new technique for impulse noise filtering that can remove the impulse noises from color as well as gray scale images. We operate on the HSI (Hue-Saturation-Intensity) color model. Our algorithm has three Phases. In first Phase, we take a window W of size N×N (say, 3×3) and form two groups: group of color and group of colorless pixels. We select the group that has the higher count of pixels in W. This allows us to remove the noise due to the colorless pixels from the color pixels and vice-versa. In the second Phase, if the selected group is a collection of colorless pixels then we find the median pixel based on increasing order of Intensity values and we call this as a candidate pixel.
In this paper we propose an approach to separate the non-texts from texts of a manuscript. The non-texts are mainly in the form of doodles and drawings of some exceptional thinkers and writers. These have enormous historical values due to study on those writers’ subconscious as well as productive mind. We also propose a computational approach to recover the struck-out texts to reduce human effort. The proposed technique has a preprocessing stage, which removes noise using median filter and segments object region using fuzzy c-means clustering. Now connected component analysis finds the major portions of non-texts, and window examination eliminates the partially attached texts. The struck-out texts are extracted by eliminating straight lines, measuring degree of continuity, using some morphological operations.
This paper presents an implementation of an OCR system for the Meetei Mayek script. The script has been newly reintroduced and there is a growing set of documents currently available in this script. Our system accepts an image of the textual portion of a page and outputs the text in the Unicode format. It incorporates preprocessing, segmentation and classification stages. However, no post-processing is done to the output. The system achieves an accuracy of about 96% on a moderate database.
Segmentation of a text-document into lines, words and characters, which is considered to be the crucial preprocessing stage in Optical Character Recognition (OCR) is traditionally carried out on uncompressed documents, although most of the documents in real life are available in compressed form, for the reasons such as transmission and storage efficiency. However, this implies that the compressed image should be decompressed, which indents additional computing resources. This limitation has motivated us to take up research in document image analysis using compressed documents. In this paper, we think in a new way to carry out segmentation at line, word and character level in run-length compressed printed-text-documents. We extract the horizontal projection profile curve from the compressed file and using the local minima points perform line segmentation. However, tracing vertical information which leads to tracking words-characters in a run-length compressed file is not very straight forward.
Added on March 14, 2018
Contributed by : OCR Consortium
Product Type : Research Paper
License Type : Freeware
System Requirement :
Author : Mohammed Javed, P. Nagabhushan, B.B. Chaudhuri