Challenge D. Word Transliteration for Southeast Asian Palm Leaf Manuscripts
D.1 Description and goals
In order to make the palm leaf manuscripts more accessible, readable and understandable to a wider audience, an optical character recognition (OCR) system should be developed. In many DIA systems, word or text recognition is the final task on the processing pipeline. But, normally in Southeast Asian script the speech sound of the syllable changes according to some certain phonological rules. In this case, an OCR system is not sufficient. Therefore, a transliteration system should also be developed to help to transliterate the ancient scripts on these manuscripts. By definition, transliteration is defined as the process of obtaining the phonetic translation of names across languages [1]. Transliteration involves rendering a language from one writing system to another. In [1], the problem is stated formally as a sequence labeling problem from one language alphabet to other. It will help to index and to access quickly and efficiently to the content of the manuscripts. In this challenge, the methods will be evaluated to transliterate the words from three different scripts of palm leaf manuscript.
D.2 Dataset
The palm leaf manuscript datasets for word transliteration task are presented in Table 1. For Khmer dataset, all characters in the page has been annotated and they are grouped together into words (Fig. 2). More than one label may be given to the created word. The order of how each character in the word is selected is also kept [4]. Balinese (Fig. 1) and Sundanese (Fig. 3) word dataset were manually annotated by using Aletheia [6].
Manuscripts |
Training |
Test |
Text | DataSet |
Balinese |
15 022 images |
10475 images |
Latin |
AMADI_LontarSet [2,3] |
Khmer |
16 333 images |
7791 images |
Latin and Khmer |
SleukRith Set [4] |
Sundanese |
1 426 images |
317 images |
Latin |
Sunda Dataset [5] |
Table 1 : Palm leaf manuscript datasets for word transliteration task
Figure 1 : Balinese word dataset
Figure 2 : khmer word dataset
Figure 3 : Sundanese word dataset
D.3 Tracks
This challenge has four different tracks, the word transliteration for:
- Track 1: palm leaf manuscripts from Bali,
- Track 2: palm leaf manuscripts from Cambodia,
- Track 3: palm leaf manuscripts from Sunda,
- Track 4: a mixed collection of palm leaf manuscripts from Bali, Cambodia and Sunda.
D.4 Protocols
- Participants must submit a description of methods:
i) a maximum of one A4 page with a detailed description and
ii) a maximum of 200 words of a short summary
- Participants must submit the text results of word transliteration for all images in the test set as one text file following this format (filename;transliteration result):
test1.jpg;transliteration 1
test2.jpg;transliteration 2
test3.jpg;transliteration 3
test4.jpg;transliteration 4
test5.jpg;transliteration 5
......
- Participants must also submit a small, simple and complete (if use any library) executable package of their method implementation, with a clear user manual to run the word transliteration process for a given example of manuscript image.
D.5 Evaluation
The error rate is defined by edit distances between ground truth and recognizer output and is computed by using the provided OCRopy function ocropus-errs (https://github.com/tmbdev/ocropy/blob/master/ocropus-errs)
References
[1] P. Shishtla, V.S. Ganesh, S. Subramaniam, V. Varma, A language-independent transliteration schema using character aligned models at NEWS 2009, in: Association for Computational Linguistics, 2009: p. 40. doi:10.3115/1699705.1699715.
[2] J.-C. Burie, M. Coustaty, S. Hadi, M.W.A. Kesiman, J.-M. Ogier, E. Paulus, K. Sok, I.M.G. Sunarya, D. Valy, ICFHR 2016 Competition on the Analysis of Handwritten Text in Images of Balinese Palm Leaf Manuscripts, in: 15th Int. Conf. Front. Handwrit. Recognit. 2016, Shenzhen, China, 2016: pp. 596–601. doi:10.1109/ICFHR.2016.107.
[3] M.W.A. Kesiman, J.-C. Burie, J.-M. Ogier, G.N.M.A. Wibawantara, I.M.G. Sunarya, AMADI_LontarSet: The First Handwritten Balinese Palm Leaf Manuscripts Dataset, in: 15th Int. Conf. Front. Handwrit. Recognit. 2016, Shenzhen, China, 2016: pp. 168–172. doi:10.1109/ICFHR.2016.39.
[4] D. Valy, M. Verleysen, S. Chhun, J.-C. Burie, A New Khmer Palm Leaf Manuscript Dataset for Document Analysis and Recognition – SleukRith Set, in: 4th Int. Workshop Hist. Doc. Imaging Process., Kyoto, Japan, 2017.
[5] M. Suryani, E. Paulus, S. Hadi, U.A. Darsa, J.-C. Burie, The Handwritten Sundanese Palm Leaf Manuscript Dataset From 15th Century, in: 14th IAPR Int. Conf. Doc. Anal. Recognit., Kyoto, Japan, 2017. doi:10.1109/ICDAR.2017.135.
[6] C. Clausner, S. Pletschacher, A. Antonacopoulos, Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments, in: IEEE, 2011: pp. 48–52. doi:10.1109/ICDAR.2011.19.