Distinguishing Mathematics Notation from English Text using Computational Geometry Derek M. Drake Henry S. Baird Lehigh University, CSE Dept. Lehigh University, CSE Dept. Bethlehem, PA 18015 USA Bethlehem, PA 18015 USA
[email protected] [email protected] Abstract and browsers of the OCR results as collections of images uncorrupted by OCR errors, for example in “visual math A trainable method for distinguishing between mathe- indexes” to assist fast search. Also, since several experi- matics notation and natural language (here, English) in mental systems exist to recognize mathematics [10, 5, 3] images of textlines, using computational geometry meth- which run independently of commercial OCR machines, au- ods only with no assistance from symbol recognition, is de- tomatic isolation of math images could allow them to be ap- scribed. The input to our method is a “neighbor graph” plied selectively where they are most likely to succeed. The extracted from a bilevel image of an isolated textline by the Infty system [12] is an example of such an integration. method of Kise [8]: this is a pruned form of Delaunay tri- Mathematics notation is often largely—though seldom angulation of the set of locations of black connected com- entirely—independent of the natural language of the sur- ponents. Our method first attempts to classify each vertex rounding text, and so there may be important uses for a and, separately, each edge of the neighbor graph as belong- “language-free” mathematics recognition system applicable ing to math or English; then these results are combined to independently of any language-specific OCR system.