Adaptive Text Recognition through Visual Matching Supplementary Material Chuhan Zhang1, Ankush Gupta2, and Andrew Zisserman1 1 Visual Geometry Group,Department of Engineering Science University of Oxford fczhang,
[email protected] 2 DeepMind, London
[email protected] Contents Evaluation Metrics Section1 Ablation Study Section2 Examples and Visualizations Section3 Implementation of SOTA Models Section4 Performance on Scene Text Section5 Dataset Details Section6 Code, data, and model checkpoints are available at: http://www.robots.ox. ac.uk/~vgg/research/FontAdaptor20/. 1 Evaluation Metrics We measure the character (CER) and word error rates (WER): N EditDist y(i); y(i) 1 X gt pred CER = N (i) i=1 Length ygt (i) (i) th where, ygt and ypred are the i ground-truth and predicted strings respectively in a dataset containing N strings; EditDist is the Levenshtein distance [10]; Length (ygt) is the number of characters in ygt. WER is computed as CER above with words (i.e. contiguous characters separated by whitespace) in place of characters as tokens. 2 C.Zhang et al. 2 Ablation study 2.1 Ablation on modules at a larger scale In section 5.4 of the main paper, we ablated each major component of the pro- posed model architecture to evaluate its relative contribution to the model's performance. However, there the training set was limited to only one FontSynth attribute (regular fonts). Here in table1, we ablate the same model components, but couple it with increasing number of training fonts from the FontSynth dataset (section 5.2 of the main paper), still evaluating on the FontSynth test set.