Challenge B. Text Line Segmentation for Southeast Asian Palm Leaf Manuscripts
B.1 Description and goals
Text line segmentation is a crucial pre-processing step of most DIA pipelines. The task aims at extracting and separating text region into individual lines. Most of line segmentation approaches in the literature require that the input image is binarized. However, due to degradation, noise are often found on historical documents such as palm leaf manuscripts, the binarization task is not able to produce good enough results. The line segmentation methods that are independent of binarization task work usually directly on color/greyscale images. Therefore, the goal of this challenge is to develop a binarization-free method to segment and extract text line regions from color/greyscale images of ancient palm leaf manuscripts.
B.2 Datasets
The palm leaf manuscript datasets for text line segmentation task are presented in Table 1. The text line segmentation ground truth data for Balinese and Sundanese manuscripts have been generated by hand based on the binarized ground truth images [1]. For Khmer, an ID of the line which it belongs to, is associated to each annotated character. The region of a text line is the union of the areas of the polygon boundaries of all annotated characters composing it [2,4]. The ground truth for line segmentation of the training set will be provided. The ground truth of each file is a raw image file. Each pixel stores a positive integer value corresponding to the ID of the text line it belongs to. For the background or undefined region, its pixel stores a zero value.
Manuscripts |
Train |
Test |
DataSet |
Balinese |
47 pages |
49 pages |
Extracted from AMADI_LontarSet [1] |
Khmer |
50 pages |
200 pages |
Extracted from SleukRith Set [2] |
Sundanese |
31 pages |
30 pages |
Extracted from Sunda Dataset [3] |
Table 1 : Palm leaf manuscript datasets for text line segmentation task
The ground truth raw data format follows the evaluation tool of the ICDAR2013 competition.
To read the data, please refer to the link at :
http://users.iit.demokritos.gr/~nstam/ICDAR2013HandSegmCont/Protocol.html
B.3 Track
This challenge has only one single track, text line segmentation for a mixed collection of palm leaf manuscripts from Bali, Cambodia and Sunda.
B.4 Protocol
- Participants must submit a description of methods:
i) a maximum of one A4 page with a detailed description and
ii) a maximum of 200 words of a short summary
- Participants must submit the results of text line segmentation for all images in the test set. For example: if the file name of original image is ABCD01.jpg, then result of the text line segmentation should be a raw image file with the same format as the ground truth file mentioned above and should be named ABCD01_lineseg.dat. The participants can either fill out the seam area without any background pixel or they can try to localize only text line areas by including background pixels.
- Participants must also submit a small, simple and complete (if use any library) executable package of their method implementation, with a clear user manual to run the text line segmentation process for a given example of image.
B.5 Evaluation
Following the previous work [5], we use the evaluation criteria and tool provided by ICDAR2013 Handwriting Segmentation Contest [6].
First, the one-to-one (o2o) match score is computed for a region pair based on the evaluator’s acceptance threshold. In our experiments, we used 90% as the acceptance threshold. Let N be the count of ground truth elements, and M be the count of result elements. With the o2o score, three metrics are calculated: detection rate (DR), recognition accuracy (RA), and performance metric (FM).
Références
[1] M.W.A. Kesiman, J.-C. Burie, J.-M. Ogier, G.N.M.A. Wibawantara, I.M.G. Sunarya, AMADI_LontarSet: The First Handwritten Balinese Palm Leaf Manuscripts Dataset, in: 15th Int. Conf. Front. Handwrit. Recognit. 2016, Shenzhen, China, 2016: pp. 168–172. doi:10.1109/ICFHR.2016.39.
[2] D. Valy, M. Verleysen, S. Chhun, J.-C. Burie, A New Khmer Palm Leaf Manuscript Dataset for Document Analysis and Recognition – SleukRith Set, in: 4th Int. Workshop Hist. Doc. Imaging Process., Kyoto, Japan, 2017.
[3] M. Suryani, E. Paulus, S. Hadi, U.A. Darsa, J.-C. Burie, The Handwritten Sundanese Palm Leaf Manuscript Dataset From 15th Century, in: 14th IAPR Int. Conf. Doc. Anal. Recognit., Kyoto, Japan, 2017. doi:10.1109/ICDAR.2017.135.
[4] D. Valy, M. Verleysen, K. Sok, Line Segmentation for Grayscale Text Images ofKhmer Palm Leaf Manuscripts, in: 7th Int. Conf. Image Process. Theory Tools Appl. IPTA 2017, Montreal, Canada, 2017.
[5] M.W.A. Kesiman, D. Valy, J.-C. Burie, E. Paulus, I.M.G. Sunarya, S. Hadi, K.H. Sok, J.-M. Ogier, Southeast Asian palm leaf manuscript images: a review of handwritten text line segmentation methods and new challenges, J. Electron. Imaging. 26 (2016) 011011. doi:10.1117/1.JEI.26.1.011011.
[6] N. Stamatopoulos, B. Gatos, G. Louloudis, U. Pal, A. Alaei, ICDAR 2013 Handwriting Segmentation Contest, in: IEEE, 2013: pp. 1402–1406. doi:10.1109/ICDAR.2013.283.