Challenge D

Challenge D. Word Transliteration for Southeast Asian Palm Leaf Manuscripts

D.1 Description and goals

In order to make the palm leaf manuscripts more accessible, readable and understandable to a wider audience, an optical character recognition (OCR) system should be developed. In many DIA systems, word or text recognition is the final task on the processing pipeline. But, normally in Southeast Asian script the speech sound of the syllable changes according to some certain phonological rules. In this case, an OCR system is not sufficient. Therefore, a transliteration system should also be developed to help to transliterate the ancient scripts on these manuscripts. By definition, transliteration is defined as the process of obtaining the phonetic translation of names across languages [1]. Transliteration involves rendering a language from one writing system to another. In [1], the problem is stated formally as a sequence labeling problem from one language alphabet to other. It will help to index and to access quickly and efficiently to the content of the manuscripts. In this challenge, the methods will be evaluated to transliterate the words from three different scripts of palm leaf manuscript.

D.2 Dataset

The palm leaf manuscript datasets for word transliteration task are presented in Table 1. For Khmer dataset, all characters in the page has been annotated and they are grouped together into words (Fig. 2). More than one label may be given to the created word. The order of how each character in the word is selected is also kept [4]. Balinese (Fig. 1) and Sundanese (Fig. 3) word dataset were manually annotated by using Aletheia [6].

Manuscripts	Training	Test	Text	DataSet
Balinese	15 022 images from 130 pages	10475 images from 100 pages	Latin	AMADI_LontarSet [2,3]
Khmer	16 333 images part of 657 pages	7791 images part of 657 pages	Latin and Khmer	SleukRith Set [4]
Sundanese	1 426 images from 20 pages	317 images from 10 pages	Latin	Sunda Dataset [5]

Table 1 : Palm leaf manuscript datasets for word transliteration task

Figure 1 : Balinese word dataset

Figure 2 : khmer word dataset

Figure 3 : Sundanese word dataset

D.3 Tracks

This challenge has four different tracks, the word transliteration for:
    - Track 1: palm leaf manuscripts from Bali,
    - Track 2: palm leaf manuscripts from Cambodia,
    - Track 3: palm leaf manuscripts from Sunda,
    - Track 4: a mixed collection of palm leaf manuscripts from Bali, Cambodia and Sunda.

D.4 Protocols

- Participants must submit a description of methods:
          i) a maximum of one A4 page with a detailed description and
          ii) a maximum of 200 words of a short summary

- Participants must submit the text results of word transliteration for all images in the test set as one text file following this format (filename;transliteration result):
          test1.jpg;transliteration 1
          test2.jpg;transliteration 2
          test3.jpg;transliteration 3
          test4.jpg;transliteration 4
          test5.jpg;transliteration 5
          ......

- Participants must also submit a small, simple and complete (if use any library) executable package of their method implementation, with a clear user manual to run the word transliteration process for a given example of manuscript image.

D.5 Evaluation

The error rate is defined by edit distances between ground truth and recognizer output and is computed by using the provided OCRopy function ocropus-errs (https://github.com/tmbdev/ocropy/blob/master/ocropus-errs)

References

[1]   P. Shishtla, V.S. Ganesh, S. Subramaniam, V. Varma, A language-independent transliteration schema using character aligned models at NEWS 2009, in: Association for Computational Linguistics, 2009: p. 40. doi:10.3115/1699705.1699715.
[2]    J.-C. Burie, M. Coustaty, S. Hadi, M.W.A. Kesiman, J.-M. Ogier, E. Paulus, K. Sok, I.M.G. Sunarya, D. Valy, ICFHR 2016 Competition on the Analysis of Handwritten Text in Images of Balinese Palm Leaf Manuscripts, in: 15th Int. Conf. Front. Handwrit. Recognit. 2016, Shenzhen, China, 2016: pp. 596–601. doi:10.1109/ICFHR.2016.107.
[3]    M.W.A. Kesiman, J.-C. Burie, J.-M. Ogier, G.N.M.A. Wibawantara, I.M.G. Sunarya, AMADI_LontarSet: The First Handwritten Balinese Palm Leaf Manuscripts Dataset, in: 15th Int. Conf. Front. Handwrit. Recognit. 2016, Shenzhen, China, 2016: pp. 168–172. doi:10.1109/ICFHR.2016.39.
[4]    D. Valy, M. Verleysen, S. Chhun, J.-C. Burie, A New Khmer Palm Leaf Manuscript Dataset for Document Analysis and Recognition – SleukRith Set, in: 4th Int. Workshop Hist. Doc. Imaging Process., Kyoto, Japan, 2017.
[5]    M. Suryani, E. Paulus, S. Hadi, U.A. Darsa, J.-C. Burie, The Handwritten Sundanese Palm Leaf Manuscript Dataset From 15th Century, in: 14th IAPR Int. Conf. Doc. Anal. Recognit., Kyoto, Japan, 2017. doi:10.1109/ICDAR.2017.135.
[6]    C. Clausner, S. Pletschacher, A. Antonacopoulos, Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments, in: IEEE, 2011: pp. 48–52. doi:10.1109/ICDAR.2011.19.