Challenge D. Word Transliteration for Southeast Asian Palm Leaf Manuscripts

D.1 Description and goals

In order to make the palm leaf manuscripts more accessible, readable and understandable to a wider audience, an optical character recognition (OCR) system should be developed. In many DIA systems, word or text recognition is the final task on the processing pipeline. But, normally in Southeast Asian script the speech sound of the syllable changes according to some certain phonological rules. In this case, an OCR system is not sufficient. Therefore, a transliteration system should also be developed to help to transliterate the ancient scripts on these manuscripts. By definition, transliteration is defined as the process of obtaining the phonetic translation of names across languages [1]. Transliteration involves rendering a language from one writing system to another. In [1], the problem is stated formally as a sequence labeling problem from one language alphabet to other. It will help to index and to access quickly and efficiently to the content of the manuscripts. In this challenge, the methods will be evaluated to transliterate the words from three different scripts of palm leaf manuscript.

D.2 Dataset

The palm leaf manuscript datasets for word transliteration task are presented in Table 1. For Khmer dataset, all characters in the page has been annotated and they are grouped together into words (Fig. 2). More than one label may be given to the created word. The order of how each character in the word is selected is also kept [4]. Balinese (Fig. 1) and Sundanese (Fig. 3) word dataset were manually annotated by using Aletheia [6].

Manuscripts

Training

Test

Text DataSet

Balinese

15 022 images
from 130 pages

10475 images
from 100 pages

Latin

AMADI_LontarSet [2,3]

Khmer

16 333 images
part of 657 pages

7791 images
part of 657 pages

Latin and Khmer

SleukRith Set [4]

Sundanese

1 426 images
from 20 pages

317 images
from 10 pages

Latin

Sunda Dataset [5]

Table 1 : Palm leaf manuscript datasets for word transliteration task

Figure 1 : Balinese word dataset

Figure 2 : khmer word dataset

Figure 3 : Sundanese word dataset

D.3 Tracks

This challenge has four different tracks, the word transliteration for:
    - Track 1: palm leaf manuscripts from Bali,
    - Track 2: palm leaf manuscripts from Cambodia,
    - Track 3: palm leaf manuscripts from Sunda,
    - Track 4: a mixed collection of palm leaf manuscripts from Bali, Cambodia and Sunda.

D.4 Protocols

- Participants must submit a description of  methods:
          i) a maximum of one A4 page with a detailed description and
          ii) a maximum of 200 words of a short summary

- Participants must submit the text results of word transliteration for all images in the test set as one text file following this format (filename;transliteration result):
          test1.jpg;transliteration 1
          test2.jpg;transliteration 2
          test3.jpg;transliteration 3
          test4.jpg;transliteration 4
          test5.jpg;transliteration 5
          ......

- Participants must also submit a small, simple and complete (if use any library) executable package of their method implementation, with a clear user manual to run the word transliteration process for a given example of manuscript image.

D.5 Evaluation

The error rate is defined by edit distances between ground truth and recognizer output and is computed by using the provided OCRopy function ocropus-errs  (https://github.com/tmbdev/ocropy/blob/master/ocropus-errs)

References

[1]    P. Shishtla, V.S. Ganesh, S. Subramaniam, V. Varma, A language-independent transliteration schema using character aligned models at NEWS 2009, in: Association for Computational Linguistics, 2009: p. 40. doi:10.3115/1699705.1699715.
[2]    J.-C. Burie, M. Coustaty, S. Hadi, M.W.A. Kesiman, J.-M. Ogier, E. Paulus, K. Sok, I.M.G. Sunarya, D. Valy, ICFHR 2016 Competition on the Analysis of Handwritten Text in Images of Balinese Palm Leaf Manuscripts, in: 15th Int. Conf. Front. Handwrit. Recognit. 2016, Shenzhen, China, 2016: pp. 596–601. doi:10.1109/ICFHR.2016.107.
[3]    M.W.A. Kesiman, J.-C. Burie, J.-M. Ogier, G.N.M.A. Wibawantara, I.M.G. Sunarya, AMADI_LontarSet: The First Handwritten Balinese Palm Leaf Manuscripts Dataset, in: 15th Int. Conf. Front. Handwrit. Recognit. 2016, Shenzhen, China, 2016: pp. 168–172. doi:10.1109/ICFHR.2016.39.
[4]    D. Valy, M. Verleysen, S. Chhun, J.-C. Burie, A New Khmer Palm Leaf Manuscript Dataset for Document Analysis and Recognition – SleukRith Set, in: 4th Int. Workshop Hist. Doc. Imaging Process., Kyoto, Japan, 2017.
[5]    M. Suryani, E. Paulus, S. Hadi, U.A. Darsa, J.-C. Burie, The Handwritten Sundanese Palm Leaf Manuscript Dataset From 15th Century, in: 14th IAPR Int. Conf. Doc. Anal. Recognit., Kyoto, Japan, 2017. doi:10.1109/ICDAR.2017.135.
[6]    C. Clausner, S. Pletschacher, A. Antonacopoulos, Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments, in: IEEE, 2011: pp. 48–52. doi:10.1109/ICDAR.2011.19.