Challenge C. Isolated Character/Glyph Recognition for Southeast Asian Palm Leaf Manuscripts
C.1 Description and goals
In a DIA system, the word or text recognition task are generally categorized in two different approaches: the segmentation based and the segmentation free methods. In the case of segmentation based method, the isolated character recognition task is a very important process [1]. A proper feature extraction and a correct classifier selection can increase the recognition rate [2]. Although many methods for isolated character recognition have been widely developed and tested especially for Latin based scripts and alphabets, there is still a need for in-depth evaluation of those methods to be applied for various other types of scripts with optimal performance. It includes the isolated character recognition task for many Southeast Asian scripts, and more specifically the scripts which were written on the ancient palm leaf manuscripts.
C.2 Dataset
The palm leaf manuscript datasets for isolated character/glyph recognition task are presented in Table 1. For Balinese character dataset, Balinese philologists manually annotated the segment of connected component that represent a correct character in Balinese script from the word level binarized images which were manually annotated [3, 4, 5] by using Aletheia [8] (Fig. 1). Sundanese character dataset was annotated manually [7] (Fig. 3). For Khmer character dataset, a tool has been developed in order to annotate characters/glyphs in the document page. The polygon boundary of each character is traced manually by dotting out its vertex one by one. A label is given to each annotated character after its boundary has been constructed [6] (Fig. 2).
Manuscripts |
Classes |
Training |
Test |
DataSet |
Balinese |
133 |
11 710 images |
7 673 images |
Extracted from AMADI_LontarSet [3,4,5] |
Khmer |
111 |
113 206 images |
90 669 images |
Extracted from SleukRith Set [6] |
Sundanese |
60 |
4 555 images |
2816 images |
Extracted from Sunda Dataset [7] |
Table 1 : Palm leaf manuscript datasets for isolated character/glyph recognition task
Figure 1 : Balinese character dataset
Figure 2 : khmer character dataset
Figure 3 : Sundanese character dataset
C.3 Tracks
This challenge has three different tracks, isolated character recognition for:
- Track 1: palm leaf manuscripts from Bali,
- Track 2: palm leaf manuscripts from Cambodia,
- Track 3: palm leaf manuscripts from Sunda.
C.4 Protocol
- Participants must submit a description of methods:
i) a maximum of one A4 page with a detailed description and
ii) a maximum of 200 words of a short summary
- Participants must submit the results of isolated character recognition for all images in the test set in one text file following this format :
filename image test;character class recognized
Example:
test1.jpg;A
test2.jpg;Na
test3.jpg;Ca
......
- Participants must also submit a small, simple and complete (if use any library) executable package of their method implementation, with a clear user manual to run the isolated character recognition process for a given example of image.
C.5 Evaluation
Following the evaluation method from ICFHR competition [3], the recognition rate, i.e., the percentage of correctly classified samples over the test samples:
C/N is calculated, where C is the number of correctly recognized samples, and N is the total number of test samples.
References
[1] A. Aggarwal, K. Singh, K. Singh, Use of Gradient Technique for Extracting Features from Handwritten Gurmukhi Characters and Numerals, Procedia Comput. Sci. 46 (2015) 1716–1723. doi:10.1016/j.procs.2015.02.116.
[2] M. Zahid Hossain, M. Ashraful Amin, Hong Yan, Rapid Feature Extraction for Optical Character Recognition, CoRR. abs/1206.0238 (2012). http://arxiv.org/abs/1206.0238.
[3] J.-C. Burie, M. Coustaty, S. Hadi, M.W.A. Kesiman, J.-M. Ogier, E. Paulus, K. Sok, I.M.G. Sunarya, D. Valy, ICFHR 2016 Competition on the Analysis of Handwritten Text in Images of Balinese Palm Leaf Manuscripts, in: 15th Int. Conf. Front. Handwrit. Recognit. 2016, Shenzhen, China, 2016: pp. 596–601. doi:10.1109/ICFHR.2016.107.
[4] M.W.A. Kesiman, S. Prum, J.-C. Burie, J.-M. Ogier, Study on Feature Extraction Methods for Character Recognition of Balinese Script on Palm Leaf Manuscript Images, in: 23rd Int. Conf. Pattern Recognit., Cancun, Mexico, 2016.
[5] M.W.A. Kesiman, J.-C. Burie, J.-M. Ogier, G.N.M.A. Wibawantara, I.M.G. Sunarya, AMADI_LontarSet: The First Handwritten Balinese Palm Leaf Manuscripts Dataset, in: 15th Int. Conf. Front. Handwrit. Recognit. 2016, Shenzhen, China, 2016: pp. 168–172. doi:10.1109/ICFHR.2016.39.
[6] D. Valy, M. Verleysen, S. Chhun, J.-C. Burie, A New Khmer Palm Leaf Manuscript Dataset for Document Analysis and Recognition – SleukRith Set, in: 4th Int. Workshop Hist. Doc. Imaging Process., Kyoto, Japan, 2017.
[7] M. Suryani, E. Paulus, S. Hadi, U.A. Darsa, J.-C. Burie, The Handwritten Sundanese Palm Leaf Manuscript Dataset From 15th Century, in: 14th IAPR Int. Conf. Doc. Anal. Recognit., Kyoto, Japan, 2017. doi:10.1109/ICDAR.2017.135.
[8] C. Clausner, S. Pletschacher, A. Antonacopoulos, Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments, in: IEEE, 2011: pp. 48–52. doi:10.1109/ICDAR.2011.19.