A Deep Factorization of Style and Structure in Fonts

A Deep Factorization of Style and Structure in Fonts Nikita Srivatsan1 Jonathan T. Barron2 Dan Klein3 Taylor Berg-Kirkpatrick4 1Language Technologies Institute, Carnegie Mellon University, [email protected] 2Google Research, [email protected] 3Computer Science Division, University of California, Berkeley, [email protected] 4Computer Science and Engineering, University of California, San Diego, [email protected] Figure 1: Example fonts from the Capitals64 dataset. The task of font reconstruction involves generating missing glyphs from partially observed novel fonts. Abstract which typically presume a library of known fonts. We propose a deep factorization model for ty- In the case of historical document recognition, for pographic analysis that disentangles content example, this problem is more pronounced due from style. Specifically, a variational infer- to the wide range of lost, ancestral fonts present ence procedure factors each training glyph into in such data (Berg-Kirkpatrick et al., 2013; Berg- the combination of a character-specific con- Kirkpatrick and Klein, 2014). Models that capture tent embedding and a latent font-specific style this wide stylistic variation of glyph images may variable. The underlying generative model eventually be useful for improving optical charac- combines these factors through an asymmet- ric transpose convolutional process to gener- ter recognition on unknown fonts. ate the image of the glyph itself. When trained In this work we present a probabilistic latent on corpora of fonts, our model learns a man- variable model capable of disentangling stylistic ifold over font styles that can be used to an- features of fonts from the underlying structure of alyze or reconstruct new, unseen fonts. On each character. Our model represents the style of the task of reconstructing missing glyphs from each font as a vector-valued latent variable, and an unknown font given only a small num- parameterizes the structure of each character as a ber of observations, our model outperforms both a strong nearest neighbors baseline and learned embedding. Critically, each style latent a state-of-the-art discriminative model from variable is shared by all characters within a font, prior work. while character embeddings are shared by characters of the same type across all fonts. Thus, 1 Introduction our approach is related to a long literature on us- One of the most visible attributes of digital lan- ing tensor factorization as a method for disentan- guage data is its typography. A font makes use gling style and content (Freeman and Tenenbaum, of unique stylistic features in a visually consistent 1997; Tenenbaum and Freeman, 2000; Vasilescu manner across a broad set of characters while pre- and Terzopoulos, 2002; Tang et al., 2013) and to serving the structure of each underlying form in recent deep tensor factorization techniques (Xue order to be human readable – as shown in Figure1. et al., 2017). Modeling these stylistic attributes and how they Inspired by neural methods’ ability to disentan- compose with underlying character structure could gle loosely coupled phenomena in other domains, aid typographic analysis and even allow for auto- including both language and vision (Hu et al., matic generation of novel fonts. Further, the vari- 2017; Yang et al., 2017; Gatys et al., 2016; Zhu ability of these stylistic features presents a chal- et al., 2017), we parameterize the distribution that lenge for optical character recognition systems, combines style and structure in order to generate glyph images as a transpose convolutional neural plicit by Zhang et al.(2018, 2020) – as the goal decoder (Dumoulin and Visin, 2016). Further, the of our analysis is to learn a smooth manifold of decoder is fed character embeddings early on in font styles that allows for stylistic inference given the process, while the font latent variables directly a small sample of glyphs. However, many other parameterize the convolution filters. This archi- style transfer tasks in the language domain (Shen tecture biases the model to capture the asymmet- et al., 2017) suffer from ambiguity surrounding ric process by which structure and style combine the underlying division between style and seman- to produce an observed glyph. tic content. By contrast, in this setting the distinc- We evaluate our learned representations on the tion is clearly defined, with content (i.e. the char- task of font reconstruction. After being trained acter) observed as a categorical label denoting the on a set of observed fonts, the system reconstructs coarse overall shape of a glyph, and style (i.e. the missing glyphs in a set of previously unseen fonts, font) explaining lower-level visual features such conditioned on a small observed subset of glyph as boldness, texture, and serifs. The modeling images. Under our generative model, font recon- approach taken here might inform work on more struction can be performed via posterior inference. complex domains where the division is less clear. Since the posterior is intractable, we demonstrate how a variational inference procedure can be used 3 Font Reconstruction to perform both learning and accurate font reconstruction. In experiments, we find that our pro- We can view a collection of fonts as a matrix, X posed latent variable model is able to substantially , where each column corresponds to a particu- outperform both a strong nearest-neighbors base- lar character type, and each row corresponds to a x line as well as a state-of-the-art discriminative sys- specific font. Each entry in the matrix, ij, is an i tem on a standard dataset for font reconstruction. image of the glyph for character in the style of a j 64 × 64 Further, in qualitative analysis, we demonstrate font , which we observe as a grayscale how the learned latent space can be used to inter- image as shown in Figure1. In a real world set- polate between fonts, hinting at the practicality of ting, the equivalent matrix would naturally have more creative applications. missing entries wherever the encoding for a character type in a font is undefined. In general, not all 2 Related Work fonts contain renderings of all possible character types; many will only support one particular lan- Discussion of computational style and content guage or alphabet and leave out uncommon sym- separation in fonts dates at least as far back as the bols. Further, for many commercial applications, writings of Hofstadter(1983, 1995). Some prior only the small subset of characters that appears in work has tackled this problem through the use of a specific advertisement or promotional message bilinear factorization models (Freeman and Tenen- will have been designed by the artist – the major- baum, 1997; Tenenbaum and Freeman, 2000), ity of glyphs are missing. As a result, we may wish while others have used discriminative neural mod- to have models that can infer these missing glyphs, els (Zhang et al., 2018, 2020) and adversarial a task referred to as font reconstruction. training techniques (Azadi et al., 2018). In con- Following recent prior work (Azadi et al., trast, we propose a deep probabilistic approach 2018), we define the task setup as follows: Dur- that combines aspects of both these lines of past ing training we have access to a large collection work. Further, while some prior approaches to of observed fonts for the complete character set. modeling fonts rely on stroke or topological rep- At test time we are required to predict the missing resentations of observed glyphs (Campbell and glyph images in a collection of previously unseen Kautz, 2014; Phan et al., 2015; Suveeranont and fonts with the same character set. Each test font Igarashi, 2010), ours directly models pixel values will contain observable glyph images for a small in rasterized glyph representations and allows us randomized subset of the character set. Based on to more easily generalize to fonts with variable the style of this subset, the model must reconstruct glyph topologies. glyphs for the rest of the character set. Finally, while we focus our evaluation on font Font reconstruction can be thought of as a form reconstruction, our approach has an important re- of matrix completion; given various observations lationship with style transfer – a framing made ex- in both a particular row and column, we wish to Gaussian 푧3 Prior 푧 1 푧4 푧2 푒3 Font Embedding Latent Variables 푧1 푧2 푧3 푧4 푧4 푒1 Transpose Conv 푒2 푒3 Character Embedding Observed Glyphs Parameters Decoder Architecture Figure 2: Depiction of the generative process of our model. Each observed glyph image is generated conditioned on the latent variable of the corresponding font and the embedding parameter of the corresponding character type. For a more detailed description of the decoder architecture and hyperparameters, see AppendixA. reconstruct the element at their intersection. Alter- a vector-valued latent variable rather than a deter- natively we can view it as a few-shot style transfer ministic embedding. task, in that we want to apply the characteristic More specifically, for each font in the collec- k attributes of a new font (e.g. serifs, italicization, tion, a font embedding variable, zj 2 R , is drop-shadow) to a letter using a small number of sampled from a fixed multivariate Gaussian prior, examples to infer those attributes. Past work on p(zj) = N (0;Ik). Next, each glyph image, xij, font reconstruction has focused on discriminative is generated independently, conditioned on the techniques. For example Azadi et al.(2018) used corresponding font variable, zj, and a character- k an adversarial network to directly predict held out specific parameter vector, ei 2 R , which we re- glyphs conditioned on observed glyphs. By con- fer to as a character embedding. Thus, glyphs of trast, we propose a generative approach using a the same character type share a character embed- deep latent variable model. Under our approach ding, while glyphs of the same font share a font fonts are generated based on an unobserved style variable.

Load more