A Probabilistic Generative Model for Typographical Analysis of Early Modern Printing
Total Page:16
File Type:pdf, Size:1020Kb
A Probabilistic Generative Model for Typographical Analysis of Early Modern Printing Kartik Goyal1 Chris Dyer2 Christopher Warren3 Max G’Sell4 Taylor Berg-Kirkpatrick5 1Language Technologies Institute, Carnegie Mellon University 2Deepmind 3Department of English, Carnegie Mellon University 4Department of Statistics, Carnegie Mellon University 5Computer Science and Engineering, University of California, San Diego kartikgo,cnwarren,mgsell @andrew.cmu.edu f [email protected] [email protected] Abstract We propose a deep and interpretable prob- abilistic generative model to analyze glyph shapes in printed Early Modern documents. We focus on clustering extracted glyph images into underlying templates in the presence of multiple confounding sources of variance. Our approach introduces a neural editor model that first generates well-understood printing phe- nomena like spatial perturbations from tem- Figure 1: We desire a generative model that can be plate parameters via interpertable latent vari- biased to cluster according to typeface characteristics ables, and then modifies the result by generat- (e.g. the length of the middle arm) rather than other ing a non-interpretable latent vector responsi- more visually salient sources of variation like inking. ble for inking variations, jitter, noise from the archiving process, and other unforeseen phe- textual analysis for establishing provenance of his- nomena associated with Early Modern print- torical documents. For example, Hinman(1963)’s ing. Critically, by introducing an inference study of typesetting in Shakespeare’s First Folio network whose input is restricted to the vi- relied on the discovery of pieces of damaged or sual residual between the observation and the distinctive type through manual inspection of every interpretably-modified template, we are able glyph in the document. More recently, Warren et al. to control and isolate what the vector-valued latent variable captures. We show that our (2020) examine pieces of distinctive types across approach outperforms rigid interpretable clus- several printers of the early modern period to posit tering baselines (Ocular) and overly-flexible the identity of clandestine printers of John Milton’s deep generative models (VAE) alike on the Areopagitica (1644). In such work, researchers task of completely unsupervised discovery of frequently aim to determine whether a book was typefaces in mixed-font documents. produced by a single or multiple printers (Weiss (1992); Malcolm(2014); Takano(2016)). Hence, 1 Introduction in order to aid these visual methods of analyses, Scholars interested in understanding details related we propose here a novel probabilistic generative to production and provenance of historical docu- model for analyzing extracted images of individ- ments rely on methods of analysis ranging from the ual printed characters in historical documents. We study of orthographic differences and stylometrics, draw from work on both deep generative modeling to visual analysis of layout, font, and printed char- and interpretable models of the printing press to acters. Recently developed tools like Ocular (Berg- develop an approach that is both flexible and con- Kirkpatrick et al., 2013) for OCR of historical docu- trollable – the later being a critical requirement for ments have helped automate and scale some textual such analysis tools. analysis methods for tasks like compositor attri- As depicted in Figure1, we are interested in iden- bution (Ryskina et al., 2017) and digitization of tifying clusters of subtly distinctive glyph shapes historical documents (Garrette et al., 2015). How- as these correspond to distinct metal stamps in ever, researchers often find the need to go beyond the type-cases used by printers. However, other 2954 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2954–2960 July 5 - 10, 2020. c 2020 Association for Computational Linguistics sources of variation (inking, for example, as de- picted in Figure1) are likely to dominate conven- tional clustering methods. For example, power- ful models like the variational autoencoder (VAE) (Kingma and Welling, 2014) capture the more visu- ally salient variance in inking rather than typeface, while more rigid models (e.g. the emission model of Ocular (Berg-Kirkpatrick et al., 2013)), fail to fit the data. The goal of our approach is to account for these confounding sources of variance, while isolating the variables pertinent to clustering. Hence, we propose a generative clustering model that introduces a neural editing process to add ex- pressivity, but includes interpretable latent vari- ables that model well-understood variance in the printing process: bi-axial translation, shear, and rotation of canonical type shapes. In order to make Figure 2: Proposed generative model for clustering im- our model controllable and prevent deep latent vari- ages of a symbol by typeface. Each mixture component ables from explaining all variance in the data, we c corresponds to a learnable template Tk. The λ vari- introduce a restricted inference network. By only ables warp (spatially adjust) the original template T to T~. This warped template is then further transformed allowing the inference network to observe the vi- via the z variables to T^ via an expressive neural filter sual residual of the observation after interpretable function parametrized by θ. modifications have been applied, we bias the poste- rior approximation on the neural editor (and thus the metal stamp by λ, and the editor latent vari- the model itself) to capture residual sources of vari- able responsible for residual sources of variation ance in the editor – for example, inking levels, ink by z. As illustrated in Fig.2, after a cluster compo- bleeds, and imaging noise. This approach is related nent c = k is selected, the corresponding template to recently introduced neural editor models for text Tk undergoes a transformation to yield T^k. This generation (Guu et al., 2018). transformation occurs in two stages: first, the inter- In experiments, we compare our model with pretable spatial adjustment variables (λ) produce rigid interpretable models (Ocular) and powerful an adjusted template ( 2.1), T~k = warp(Tk; λ), generative models (VAE) at the task of unsuper- and then the neural latentx variable transforms the vised clustering subtly distinct typeface in scanned adjusted template ( 2.2), T^k = filter(T~k; z). The images early modern documents sourced from marginal probabilityx under our model is Early English Books Online (EEBO). X Z p(X) = πk p(X λ, z; Tk)p(λ)p(z)dzdλ, j 2 Model k Our model reasons about the printed appearances where p(X λ, z; Tk) refers to the distribution over j of a symbol (say majuscule F) in a document via a the binary pixels of X where each pixel has a mixture model whose K components correspond bernoulli distribution parametrized by the value to different metal stamps used by a printer for the of the corresponding pixel-entry in T^k. document. During various stages of printing, ran- dom transformations result in varying printed man- 2.1 Interpretable spatial adjustment ifestations of a metal cast on the paper. Figure2 Early typesetting was noisy, and the metal pieces depicts our model. We denote an observed image were often arranged with slight variations which of the extracted character by X. We denote choice resulted in the printed characters being positioned of typeface by latent variable c (the mixture com- with small amounts of offset, rotation and shear. ponent) with prior π. We represent the shape of These real-valued spatial adjustment variables are the k-th stamp by template Tk, a square matrix denoted by λ = (r; o; s; a), where r represents the of parameters. We denote the interpretable latent rotation variable, o = (oh; ov) represents offsets variables corresponding to spatial adjustment of along the horizontal and vertical axes, s = (sh; sv) 2955 denotes shear along the two axes. A scale fac- Posterior Approximation tor, a~ = 1:0 + a, accounts for minor scale vari- Inference ations arising due to the archiving and extraction q(z|X, c; ϕ) z parameters processes. All variables in λ are generated from a ϕ Gaussian prior with zero mean and fixed variance z = InferNet(Rc,c) as the transformations due to these variables tend to be subtle. Residual In order to incorporate these deterministic trans- Rc Warped template T˜c formations in a differentiable manner, we map λ to Rc = X T˜c a template sized attention map Hij for each output X − c pixel position (i; j) in T~ as depicted in Figure3. Observation Template choice The attention map for each output pixel is formed Figure 4: Inference network for z conditions on the in order to attend to the corresponding shifted (or mixture component and only the residual image left scaled or sheared) portion of the input template after subtracting the λ-transformed template from the and is shaped according to a Gaussian distribution image. This encourages z to model variance due to with mean determined by an affine transform. This sources other than spatial adjustments. approach allows for strong inductive bias which font generation. contrasts with related work on spatial-VAE (Bepler et al., 2019) that learns arbitrary transformations. 2.3 Learning and Inference Our aim is to maximize the log likelihood of the observed data ( Xd d N; d < n ) of n images wrt. model parameters:f j 2 g X h X LL(T1;:::;k; θ) = max log πk T,θ d k Figure 3: Translation operation: The mode of the at- Z tention map is shifted by the offset values for every out- i p(Xd λd; zd; Tk; θ)p(λd)p(zd)dzddλd put pixel in T~. Similar operations account for shear, j rotation, and scale. During training, we maximize the likelihood wrt. 2.2 Residual sources of variations λ instead of marginalizing, which is an approxima- tion inspired by iterated conditional modes (Besag, Apart from spatial perturbations, other major 1986): sources of deviation in early printing include ran- X X Z dom inking perturbations caused by inconsistent max log max πk p(Xd λd = γk;d; zd; T;θ γk;d j application of the stamps, unpredictable ink bleeds, d k and noise associated with digital archiving of the Tk; θ)p(λd = γk;d)p(zd)dzd documents.