Challenge 2. Query-by-Example Word Spotting on Palm Leaf Manuscript Images

2.1 Description and goals

Word spotting system is one of the most demanding system which has to be developed for the collection of palm leaf manuscript images. This system will facilitate user to find word patch images in all collection of palm leaf manuscript images with a single keyword patch image as a query image. Many image features and descriptors have been proposed to perform the word spotting task. The characteristics of palm leaf manuscripts provide a suitable challenge for testing and evaluation of some image features and descriptors which were already proposed for word spotting methods. Writing in balinese script, there is no space between words in a text line. Some characters are written on upper baseline or under the baseline of text line.

2.2 Construction of Ground Truth Word-level Annotated Patch Images

To create the word-level annotated ground truth dataset of the manuscript, we asked the balinese philologists, the students in informatics, and the students in balinese literature to work together to segment and to annotate the word in manuscript with ALETHEIA (1) , an advanced document layout and text ground-truthing system [4].

2.3 Datasets

For this challenge, the dataset is partitioned into training and test subsets.
For the training subset, we provide :

1.    100 original images
2.    About 10,000 word-level annotated patch images from those 100 original images

For the testing subset, we provide :

1.    100 original images (different from the training subset)
2.    30 word-level annotated patch images as query test

full size image : Challenge2.png

 

2.4 Protocols

Participants submit the results of word spotting for all query test in testing subset in one file text following this format:
filename_query;filename_image_of_spotted_area;column_top_left;row_top_left;column_bottom_right;row_bottom_right;
(A spotting area is defined by a rectangle with top left and bottom right point)

Example :

query1.jpg;manuscript1.jpg;200;100;450;300;
query1.jpg;manuscript1.jpg;800;700;950;850;
query1.jpg;manuscript2.jpg;200;100;450;300;
query2.jpg;manuscript1.jpg;210;150;550;320;

2.5 Evaluation

 We calculated Recall and Precision of spotting area based on ground truth word-level annotated patch images of the testing subset. A spotting area is considered as relevant if it overlaps more than 50% of a ground truth word-level patch area containing the same query word and the size of the spotting area (width and height) is not twice bigger than the size of ground truth area.

Références

[4]    C. Clausner, S. Pletschacher, A. Antonacopoulos, Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments, in: IEEE, 2011: pp. 48–52. doi:10.1109/ICDAR.2011.19

 

(1) http://www.primaresearch.org/tools/Aletheia