ERA “Ancient Script Recognition”
Abu Alrob, Ghadeer
MetadataShow full item record
Optical Character Recognition (OCR) is an image processing technique that takes the image, recognizes and extracts the text contents from it. This technique depends on preprocessing for image, noise reduction, machine learning, and need a huge number of data to be trained to guarantee the best accuracy. One of the open source engines that work under OCR is Tesseract. Which is a portable software try to give the best result of character recognition based on machine learning and image processing techniques. Tesseract is based on artificial neural network and needs a huge number of samples to be trained using tesseract command in all operating systems. ERA project is built on these techniques to support ancient languages characters recognition. The languages that ERA works on them were supported in the beginnings of the 21st century in Unicode version 5.2 and version 7. ERA as a project, and after training phase, did provide the .traineddata files for these old languages which can be very useful to the usage of OCR projects around the world. Especially, focusing on old Arabic languages recognition. ERA support 5 ancient languages, three of them unsupported by any resources yet which are Old South Arabian, Old North Arabic (Dadanitic, Safaitic, Taymanitic, and Hismaic), Nabataean, and the other ones are Greek, Romanian. ERA use any digital image as input and provide methods to get clearer one, and go through Tesseract engine using .traineddata file of the specified language and give a text data of input contents as output. Aِlso ERA can make the user help in character recognition by drawing on images if the result was incorrect. The result of this project will be much important and useful for people who are fond of archaeology.