Identification of the script of the text in multi-script documents is one of the important steps in the design of an OCR system for the analysis and recognignation of the page. Much work has already been reported in this area relating to Roman, Arabic, Chineses, Korean and Japanese script. In the Indian context, though some results have been reported, the task is still at its infancy the script. In the work presented in this paper, a successful attempt has been made to identify the scripts, ar the word level, in a bilingual document containing Roman and Tamil scripts. Two different approaches have been proposed and thoroughly tested. In the first method, words are divided into three spatial zones. The spatial spread of a word in upper and lower zones, together with the character density, is used to identify the script. The second technique analyses the directional energy distribution of a words with various font styles and sizes have been used for the testing of the proposed algorithms and the results are quite encouraging.
Added on September 25, 2014
Product Type : Research Paper
License Type : Freeware
System Requirement :
Author : Dhanya D, A G Ramakrishnan, Peeta Basa Pati