A Deep Factorization of Style and Structure in Fonts

Nikita Srivatsan1 Jonathan T. Barron2 Dan Klein3 Taylor Berg-Kirkpatrick4

1Language Technologies Institute, Carnegie Mellon University, [email protected] 2Google Research, [email protected] 3Computer Science Division, University of California, Berkeley, [email protected] 4Computer Science and Engineering, University of California, San Diego, [email protected]

Figure 1: Example fonts from the Capitals64 dataset. The task of font reconstruction involves generating missing glyphs from partially observed novel fonts.

Abstract which typically presume a library of known fonts. We propose a deep factorization model for ty- In the case of historical document recognition, for pographic analysis that disentangles content example, this problem is more pronounced due from style. Specifically, a variational infer- to the wide range of lost, ancestral fonts present ence procedure factors each training glyph into in such data (Berg-Kirkpatrick et al., 2013; Berg- the combination of a character-specific con- Kirkpatrick and Klein, 2014). Models that capture tent embedding and a latent font-specific style this wide stylistic variation of glyph images may variable. The underlying generative model eventually be useful for improving optical charac- combines these factors through an asymmet- ric transpose convolutional process to gener- ter recognition on unknown fonts. ate the image of the glyph itself. When trained In this work we present a probabilistic latent on corpora of fonts, our model learns a man- variable model capable of disentangling stylistic ifold over font styles that can be used to an- features of fonts from the underlying structure of alyze or reconstruct new, unseen fonts. On each character. Our model represents the style of the task of reconstructing missing glyphs from each font as a vector-valued latent variable, and an unknown font given only a small num- parameterizes the structure of each character as a ber of observations, our model outperforms both a strong nearest neighbors baseline and learned embedding. Critically, each style latent a state-of-the-art discriminative model from variable is shared by all characters within a font, prior work. while character embeddings are shared by char- acters of the same type across all fonts. Thus, 1 Introduction our approach is related to a long literature on us- One of the most visible attributes of digital lan- ing tensor factorization as a method for disentan- guage data is its typography. A font makes use gling style and content (Freeman and Tenenbaum, of unique stylistic features in a visually consistent 1997; Tenenbaum and Freeman, 2000; Vasilescu manner across a broad set of characters while pre- and Terzopoulos, 2002; Tang et al., 2013) and to serving the structure of each underlying form in recent deep tensor factorization techniques (Xue order to be human readable – as shown in Figure1. et al., 2017). Modeling these stylistic attributes and how they Inspired by neural methods’ ability to disentan- compose with underlying character structure could gle loosely coupled phenomena in other domains, aid typographic analysis and even allow for auto- including both language and vision (Hu et al., matic generation of novel fonts. Further, the vari- 2017; Yang et al., 2017; Gatys et al., 2016; Zhu ability of these stylistic features presents a chal- et al., 2017), we parameterize the distribution that lenge for optical character recognition systems, combines style and structure in order to generate glyph images as a transpose convolutional neural plicit by Zhang et al.(2018, 2020) – as the goal decoder (Dumoulin and Visin, 2016). Further, the of our analysis is to learn a smooth manifold of decoder is fed character embeddings early on in font styles that allows for stylistic inference given the process, while the font latent variables directly a small sample of glyphs. However, many other parameterize the convolution filters. This archi- style transfer tasks in the language domain (Shen tecture biases the model to capture the asymmet- et al., 2017) suffer from ambiguity surrounding ric process by which structure and style combine the underlying division between style and seman- to produce an observed glyph. tic content. By contrast, in this setting the distinc- We evaluate our learned representations on the tion is clearly defined, with content (i.e. the char- task of font reconstruction. After being trained acter) observed as a categorical label denoting the on a set of observed fonts, the system reconstructs coarse overall shape of a glyph, and style (i.e. the missing glyphs in a set of previously unseen fonts, font) explaining lower-level visual features such conditioned on a small observed subset of glyph as boldness, texture, and serifs. The modeling images. Under our generative model, font recon- approach taken here might inform work on more struction can be performed via posterior inference. complex domains where the division is less clear. Since the posterior is intractable, we demonstrate how a variational inference procedure can be used 3 Font Reconstruction to perform both learning and accurate font recon- struction. In experiments, we find that our pro- We can view a collection of fonts as a matrix, X posed latent variable model is able to substantially , where each column corresponds to a particu- outperform both a strong nearest-neighbors base- lar character type, and each row corresponds to a x line as well as a state-of-the-art discriminative sys- specific font. Each entry in the matrix, ij, is an i tem on a standard dataset for font reconstruction. image of the glyph for character in the style of a j 64 × 64 Further, in qualitative analysis, we demonstrate font , which we observe as a grayscale how the learned latent space can be used to inter- image as shown in Figure1. In a real world set- polate between fonts, hinting at the practicality of ting, the equivalent matrix would naturally have more creative applications. missing entries wherever the encoding for a char- acter type in a font is undefined. In general, not all 2 Related Work fonts contain renderings of all possible character types; many will only support one particular lan- Discussion of computational style and content guage or alphabet and leave out uncommon sym- separation in fonts dates at least as far back as the bols. Further, for many commercial applications, writings of Hofstadter(1983, 1995). Some prior only the small subset of characters that appears in work has tackled this problem through the use of a specific advertisement or promotional message bilinear factorization models (Freeman and Tenen- will have been designed by the artist – the major- baum, 1997; Tenenbaum and Freeman, 2000), ity of glyphs are missing. As a result, we may wish while others have used discriminative neural mod- to have models that can infer these missing glyphs, els (Zhang et al., 2018, 2020) and adversarial a task referred to as font reconstruction. training techniques (Azadi et al., 2018). In con- Following recent prior work (Azadi et al., trast, we propose a deep probabilistic approach 2018), we define the task setup as follows: Dur- that combines aspects of both these lines of past ing training we have access to a large collection work. Further, while some prior approaches to of observed fonts for the complete character set. modeling fonts rely on stroke or topological rep- At test time we are required to predict the missing resentations of observed glyphs (Campbell and glyph images in a collection of previously unseen Kautz, 2014; Phan et al., 2015; Suveeranont and fonts with the same character set. Each test font Igarashi, 2010), ours directly models values will contain observable glyph images for a small in rasterized glyph representations and allows us randomized subset of the character set. Based on to more easily generalize to fonts with variable the style of this subset, the model must reconstruct glyph topologies. glyphs for the rest of the character set. Finally, while we focus our evaluation on font Font reconstruction can be thought of as a form reconstruction, our approach has an important re- of matrix completion; given various observations lationship with style transfer – a framing made ex- in both a particular row and column, we wish to Gaussian 푧3 Prior 푧 1 푧4 푧2 푒3 Font Embedding Latent Variables 푧1 푧2 푧3 푧4

푧4 푒1 Transpose Conv

푒2

푒3 Character Embedding Observed Glyphs Parameters Decoder Architecture

Figure 2: Depiction of the generative process of our model. Each observed glyph image is generated conditioned on the latent variable of the corresponding font and the embedding parameter of the corresponding character type. For a more detailed description of the decoder architecture and hyperparameters, see AppendixA. reconstruct the element at their intersection. Alter- a vector-valued latent variable rather than a deter- natively we can view it as a few-shot style transfer ministic embedding. task, in that we want to apply the characteristic More specifically, for each font in the collec- k attributes of a new font (e.g. serifs, italicization, tion, a font embedding variable, zj ∈ R , is drop-shadow) to a letter using a small number of sampled from a fixed multivariate Gaussian prior, examples to infer those attributes. Past work on p(zj) = N (0,Ik). Next, each glyph image, xij, font reconstruction has focused on discriminative is generated independently, conditioned on the techniques. For example Azadi et al.(2018) used corresponding font variable, zj, and a character- k an adversarial network to directly predict held out specific parameter vector, ei ∈ R , which we re- glyphs conditioned on observed glyphs. By con- fer to as a character embedding. Thus, glyphs of trast, we propose a generative approach using a the same character type share a character embed- deep latent variable model. Under our approach ding, while glyphs of the same font share a font fonts are generated based on an unobserved style variable. A corpus of I ∗ J glyphs is modeled embedding, which we can perform inference over with only I character embeddings and J font vari- given any number of observations. ables, as seen in the left half of Figure2. This modeling approach can be thought of as a form of 4 Model deep matrix factorization, where the content at any Figure2 depicts our model’s generative process. given cell is purely a function of the vector repre- Given a collection of images of glyphs consisting sentations of the corresponding row and column. of I character types across J fonts, our model hy- We denote the full corpus of glyphs as a matrix pothesizes a separation of character-specific struc- X = ((x11, ..., x1J ) , ..., (xI1, ...xIJ )) and denote tural attributes and font-specific stylistic attributes the corresponding character embeddings as E = into two different representations. Since all char- (e1, ..., eI ) and font variables as Z = (z1, ..., zJ ). acters are observed in at least one font, each char- Under our model, the probability distribution acter type is represented as an embedding vector over each image, conditioned on zj and ei, is pa- which is part of the model’s parameterization. In rameterized by a neural network, described in the contrast, only a subset of fonts is observed during next section and depicted in Figure2. We denote training and our model will be expected to gen- this decoder distribution as p(xij|zj; ei, φ), and eralize to reconstructing unseen fonts at test time. let φ represent parameters, shared by all glyphs, Thus, our representation of each font is treated as that govern how font variables and character em- beddings combine to produce glyph images. In the character embedding is projected to a low res- contrast, the character embedding parameters, ei, olution matrix with a large number of channels. which feed into the decoder, are only shared by Following that, we apply several transpose convo- glyphs of the same character type. The font vari- lutional layers which increase the resolution, and ables, zj, are unobserved during training and will reduce the number of channels. Critically, the con- be inferred at test time. The joint probability under volutional filter at each step is not a learned param- our model is given by: eter of the model, but rather the output of a small multilayer perceptron whose input is the font la- Y p(X,Z; E, φ) = p(xij|zj; ei, φ)p(zj) tent variable z. Between these transpose convo- i,j lutions, we insert vanilla convolutional layers to fine-tune following the increase in resolution. 4.1 Decoder Architecture Overall, the decoder consists of four blocks, One way to encourage the model to learn disen- where each block contains a transpose convolu- tangled representations of style and content is by tion, which upscales the previous layer and re- choosing an architecture that introduces helpful duces the number of channels by a factor of two, inductive bias. For this domain, we can think of followed by two convolutional layers. Each (trans- the character type as specifying the overall shape pose) convolution is followed by an instance norm of the image, and the font style as influencing the and a ReLU activation. The convolution filters all finer details; we formulate our decoder with this have a kernel size of 5 × 5. The character em- difference in mind. We hypothesize that a glyph bedding is reshaped via a two-layer MLP into a can be modeled in terms of a low-resolution char- 8 × 8 × 256 tensor before being fed into the de- acter representation to which a complex operator coder. The final 64 × 64 dimensional output layer specific to that font has been applied. is treated as a grid of parameters which defines The success of transpose convolutional layers at the output distribution on pixels. We describe the generating realistic images suggests a natural way specifics of this distribution in the next section. to apply this intuition. A transpose convolution1 is a convolution performed on an undecimated in- 4.2 Projected Loss put (i.e. with zeros inserted in between pixels in The conventional approach for computing loss on alternating rows and columns), resulting in an up- image observations is to use an independent out- scaled output. Transpose convolutional architec- put distribution, typically a Gaussian, on each tures generally start with a low resolution input pixel’s intensity. However, deliberate analysis of which is passed through several such layers, it- the statistics of natural images has shown that im- eratively increasing the resolution and decreasing ages are not well-described in terms of statistically the number of channels until the final output di- independent pixels, but are instead better modeled mensions are reached. We note that the asymme- in terms of edges (Field, 1987; Huang and Mum- try between the coarse input and the convolutional ford, 1999). It has also been demonstrated that im- filters closely aligns with the desired inductive bi- ages of text have similar statistical distributions as ases, and therefore use this framework as a starting natural images (Melmer et al., 2013). Following point for our architecture. this insight, as our reconstruction loss we use a Broadly speaking, our architecture represents heavy-tailed (leptokurtotic) distribution placed on the underlying shape that defines the specific char- a transformed representation of the image, similar acter type (but not the font) as coarse-grained in- to the approach of Barron(2019). Modeling the formation that therefore enters the transpose con- statistics of font glyphs in this fashion results in volutional process early on. In contrast, the stylis- sharper samples, while modeling independent pix- tic content that specifies attributes of the specific els with a Gaussian distribution results in blurry, font (such as serifs, drop shadow, texture) is rep- oversmoothed results. resented as finer-grained information that enters More specifically, we adopt one of the strate- into the decoder at a later stage, by parameteriz- gies employed in Barron(2019), and transform ing filters, as shown in the right half of Figure2. image observations using the orthonormal variant Specifically we form our decoder as follows: first of the 2-Dimensional Discrete Cosine Transform 1sometimes erroneously referred to as a “deconvolution” (2-D DCT-II) (Ahmed et al., 1974), which we de- Convolutional Transpose Convolutional Encoder Decoder Latent Space 푒1 푒1 풩(휇,Σ)

Observation Pointwise 핂핃 푒2 Subsampling Max

푒 풩(0, I) 3 푒3

Discrete Cosine Transform 피푞 log푝

Figure 3: Depiction of the computation graph of the amortized variational lower bound (for simplicity, only one font is shown). The encoder approximates the generative model’s true posterior over the font style latent variables given the observations. It remains insensitive to the number of observations by pooling high-level features across glyphs. For a more specific description of the encoder architecture details, see AppendixA.

64×64 64×64 note as f : R → R for our 64 × 64 di- ages. Intuitively, it models the fact that images mensional image observations. We transform both tend to be mostly smooth, with a small amount of the observed glyph image and the corresponding non-smooth variation in the form of edges. Com- grid or parameters produced by our decoder be- puting this heavy-tailed loss over the frequency fore computing the observation’s likelihood. decomposition provided by the DCT-II instead of This procedure projects observed images onto a the raw pixel values encourages the decoder to grid of orthogonal bases comprised of shifted and generate sharper images without needing either scaled 2-dimensional cosine functions. Because an adversarial discriminator or a vectorized rep- the DCT-II is orthonormal, this transformation is resentation of the characters during training. Note volume-preserving, and so likelihoods computed that while our training loss is computed in DCT-II in the projected space correspond to valid mea- space, at test time we treat the raw grid of param- surements in the original pixel domain. eter outputs xˆ as the glyph reconstruction. Note that because the DCT-II is simply a rota- tion in our vector space, imposing a normal dis- 5 Learning and Inference tribution in this transformed space should have lit- Note that in our training setting, the font variables tle effect (ignoring the scaling induced by the di- Z are completely unobserved, and we must induce agonal of the covariance matrix of the Gaussian their manifold with learning. As our model is gen- distribution) as Euclidean distance is preserved erative, we wish to optimize the marginal proba- under rotations. For this reason we impose a bility of just the observed X with respect to the heavy-tailed distribution in this transformed space, model parameters E and φ: specifically a Cauchy distribution. This gives the Z following probability density function p(X; E, φ) = p(X,Z; E, φ)dZ Z 1 However, the integral over Z is computation- g(x;x, ˆ γ) =  2  f(x)−f(ˆx)  ally intractable, given that the complex relation- πγ 1 + γ ship between Z and X does not permit a closed form solution. Related latent variable models where x is an observed glyph, xˆ is the location such as Variational Autoencoders (VAE) (Kingma parameter grid output by our decoder, and γ is a and Welling, 2014) with intractable marginals hyperparameter which we set to γ = 0.001. have successfully performed learning by opti- The Cauchy distribution accurately captures the mizing a variational lower bound on the log heavy-tailed structure of the edges in natural im- marginal likelihood. This surrogate objective, called the evidence lower bound (ELBO), intro- combine the features obtained from each charac- duces a variational approximation, q(Z|X) = ter in a manner that is largely invariant to the Q j q(zi|x1j, . . . , xIj) to the model’s true poste- total number and types of characters observed. rior, p(Z|X). Our model’s ELBO is as follows: This provides an inductive bias that encourages the model to extract similar features from each charac- X ELBO = Eq[log p(x1j, . . . , xIj|zj)] ter type, which should therefore represent stylistic j as opposed to structural properties. − KL(q(zj|x1j, . . . , xIj)||p(zj)) Overall, the encoder over each glyph consists of three blocks, where each block consists of a con- where the approximation q is parameterized via volution followed by a max pool with a stride of a neural encoder network. This lower bound can two, an instance norm (Ulyanov et al., 2016), and be optimized by stochastic gradient ascent if q is a ReLU. The activations are then pooled across the a Gaussian, via the reparameterization trick de- characters via an elementwise max into a single scribed in (Kingma and Welling, 2014; Rezende vector, which is then passed through four fully- et al., 2014) to sample the expectation under q connected layers, before predicting the parameters while still permitting backpropagation. of the Gaussian approximate posterior. Practically speaking, a key property which we Reconstruction via Inference: At test time, we desire is the ability to perform consistent inference pass an observed subset of a new font to our en- over z given a variable number of observed glyphs coder in order to estimate the posterior over zj, in a font. We address this in two ways: through and take the mean of that distribution as the in- the architecture of our encoder, and through a spe- ferred font representation. We then pass this en- cial masking process in training; both of which are coding to the decoder along with the full set of shown in Figure3. character embeddings E in order to produce re- constructions of every glyph in the font. 5.1 Posterior Approximation Observation Subsampling: To get reconstruc- 6 Experiments tions from only partially observed fonts at test time, the encoder network must be able to infer zj We now provide an overview of the specifics of from any subset of (x1j, . . . , xIj). One approach the dataset and training procedure, and describe for achieving robustness to the number of obser- our experimental setup and baselines. vations is through the training procedure. Specifi- 6.1 Data cally when computing the approximate posterior for a particular font in our training corpus, we We compare our model against baseline sys- mask out a randomly selected subset of the char- tems at font reconstruction on the Capitals64 acters before passing them to the encoder. This dataset (Azadi et al., 2018), which contains the 26 incentivizes the encoder to produce reasonable es- capital letters of the English alphabet as grayscale timates without becoming too reliant on the fea- 64 × 64 pixel images across 10, 682 fonts. These tures extracted from any one particular character, are broken down into training, dev, and test splits which more closely matches the setup at test time. of 7649, 1473, and 1560 fonts respectively. Encoder Architecture: Another way to encour- Upon manual inspection of the dataset, it is age this robustness is through inductive bias in the apparent that several fonts have an almost visu- encoder architecture. Specifically we use a convo- ally indistinguishable nearest neighbor, making lutional neural network which takes in a batch of the reconstruction task trivial using a naive algo- characters from a single font, concatenated with rithm (or an overfit model with high capacity) for their respective character type embedding. Fol- those particular font archetypes. Because these lowing the final convolutional layer, we perform datapoints are less informative with respect to a an elementwise max operation across the batch, model’s ability to generalize to previously unseen reducing to a single vector representation for the styles, we additionally evaluate on a second test set entire font which we pass through further fully- designed to avoid this redundancy. Specifically, connected layers to obtain the output parameters we choose the 10% of test fonts that have max- of q as shown in Figure3. By including this ac- imal L2 distance from their closest equivalent in cumulation across the elements of the batch, we the training set, which we call “Test Hard”. Test Full Test Hard Observations 1 2 4 8 1 2 4 8 NN 483.13 424.49 386.81 363.97 880.22 814.67 761.29 735.18 GlyphNet 669.36 533.97 455.23 416.65 935.01 813.50 718.02 653.57 Ours (FC) 353.63 316.47 293.67 281.89 596.57 556.21 527.50 513.25 Ours (Conv) 352.07 300.46 271.03 254.92 615.87 556.03 511.05 489.58

Table 1: L2 reconstruction per glyph by number of observed characters. “Full” includes the entire test set while “Hard” is measured only over the 10% of test fonts with the highest L2 distance from the closest font in train.

6.2 Baselines 7 Results As stated previously, many fonts fall into visually We now present quantitative results from our ex- similar archetypes. Based on this property, we use periments in both automated and human annotated a nearest neighbors algorithm for our first base- metrics, and offer qualitative analysis of recon- line. Given a partially observed font at test time, structions and the learned font manifold. this approach simply “reconstructs” by searching the training set for the font with the lowest L2 dis- 7.1 Quantitative Evaluation tance over the observed characters, and copy its Automatic Evaluation: We show font reconstruc- glyphs verbatim for the missing characters. tion results for our system against nearest neigh- For our second comparison, we use the Glyph- bors and GlyphNet in Table1. Each model is given Net model from Azadi et al.(2018). This ap- a random subsample of glyphs from each test font proach is based on a generative adversarial net- (we measure at 1, 2, 4, and 8 observed characters), work, which uses discriminators to encourage the with their character labels. We measure the av- model to generate outputs that are difficult to erage L2 distance between the image reconstruc- distinguish from those in the training set. We tions for the unobserved characters and the ground test from the publicly available epoch 400 check- truth, after scaling intensities to [0, 1]. point, with modifications to the evaluation script Our system achieves the best performance for to match the setup described above. both the overall and hard subset of test for all num- We also perform an ablation using fully- bers of observed glyphs. Nearest neighbors pro- connected instead of convolutional layers. For vides a strong baseline on the full test set, even more architecture details see AppendixA. outperforming GlyphNet. However it performs much worse on the hard subset. This makes sense 6.3 Training Details as we expect nearest neighbors to do extremely We train our model to maximize the expected well on any test fonts that have a close equiva- log likelihood using the Adam optimization al- lent in train, but suffer in fidelity on less tradi- gorithm (Kingma and Ba, 2015) with a step size tional styles. GlyphNet similarly performs worse of 10−5 (default settings otherwise), and perform on test hard, which could reflect the missing modes early stopping based on the approximate log like- problem of GANs failing to capture the full di- lihood on a hard subset of dev selected by the pro- versity of the data distribution (Che et al., 2016; cess described earlier. To encourage robustness in Tolstikhin et al., 2017). The fully-connected ab- the encoder, we randomly drop out glyphs during lation is also competitive, although we see that training with a probability of 0.7 (rejecting sam- the convolutional architecture is better able to in- ples where all characters in a font are dropped). fer style from larger numbers of observations. On All experiments are run with a dimensionality of the hard test set, the fully-connected network even 32 for the character embeddings and font latent outperforms the convolutional system when only variables. Our implementation2 is built in Py- one observation is present, perhaps indicating that Torch (Paszke et al., 2017) version 1.1.0. its lower-capacity architecture better generalizes from very limited data. 2https://bitbucket.org/NikitaSrivatsan/ DeepFactorizationFontsEMNLP19 Figure 5: t-SNE projection of latent font variables in- ferred for the full training set, colored by k-means clus- tering with k = 10. The glyph for the letter “A” for Figure 4: Reconstructions of partially observed fonts in each centroid is shown in overlay. the hard subset from our model, GlyphNet, and nearest neighbors. Given images of glyphs for ‘A’ and ‘B’ in while occasionally blurry at the edges, are gener- each font, we visualize reconstructions of the remain- ally faithful at reproducing the principal stylistic ing characters. Fonts are chosen such that the L2 loss of our model on these examples closely matches the features of the font. For example, we see that for average loss over the full evaluation set. font (1) in Figure4, we match not only the overall shape of the letters, but also the drop shadow and to an extent the texture within the lettering, while Human Evaluation: To measure how consis- GlyphNet does not produce fully enclosed letters tent these perceptual differences are, we also per- or match the texture. The output of nearest neigh- form a human evaluation of our model’s recon- bors, while well-formed, does not respect the style structions against GlyphNet using Amazon Me- of the font as closely as it fails to find a font in chanical Turk (AMT). In our setup, turkers were training that matches these stylistic properties. In asked to compare the output of our model against font (2) the systems all produce a form of gothic the output of GlyphNet for a single font given one lettering, but the output of GlyphNet is again lack- observed character, which they were also shown. ing in certain details, and nearest neighbors makes Turkers selected a ranking based on which recon- subtle but noticeable changes to the shape of the structed font best matched the style of the ob- letters. In the final example (3) we even see that served character, and a separate ranking based on our system appears to attempt to replicate the pix- which was more realistic. On the full test set (1560 elated outline, while nearest neighbors ignores this fonts, with 5 turkers assigned to each font), hu- subtlety. GlyphNet is in this case somewhat in- mans preferred our system over GlyphNet 81.3% consistent, doing reasonably well on some letters, and 81.8% of the time for style and realism respec- but much worse on others. Overall, nearest neigh- tively. We found that on average 76% of turkers bors will necessarily output well-formed glyphs, shown a given pair of reconstructions selected the but with lower fidelity to the style, particularly on same ranking as each other on both criteria, sug- more unique fonts. While GlyphNet does pick up gesting high annotator agreement. on some subtle features, our model tends to pro- duce the most coherent output on harder fonts. 7.2 Qualitative Analysis In order to fully understand the comparative be- 7.3 Analysis of Learned Manifold havior of these systems, we also qualitatively com- Since our model attempts to learn a smooth man- pare the reconstruction output of these systems to ifold over the latent style, we can also perform analyze their various failure modes, showing ex- interpolation between the inferred font represen- amples in Figure4. We generally find that our tations, something which is not directly possible approach tends to produce reconstructions that, using either of the baselines. In this analysis, we Figure 6: Interpolation be- tween font variants from the same font family, show- ing smoothness of the latent manifold. Linear combina- tions of the embedded fonts correspond to outputs that lie intuitively “in between”.

��� ��� ��� ������� ��� ����������� ��� ��������� ��� ���������� ��� ����� ��� ������� ��� �������

Figure 7: t-SNE projection of latent font variables inferred on Fonts, colored by weight and category.

take two fonts from the same font family, which the high-dimensional representations to visualize differ along one property, and pass them through these high-level groupings. Additionally, we dis- our encoder to obtain the latent font variable for play a sample glyph for the centroid for each clus- each. We then interpolate between these values, ter. We see that while the clusters are not com- passing the result at various steps into our decoder pletely separated, the centroids generally corre- to produce new fonts that exists in between the ob- spond to common font styles, and are largely ad- servations. In Figure6 we see how our model can jacent to or overlapping with those with similar apply serifs, italicization, and boldness gradually stylistic properties; for example script, handwrit- while leaving the font unchanged in other respects. ten, and gothic styles are very close together. This demonstrates that our manifold is smooth and To analyze how well our latent embeddings cor- interpretable, not just at the points corresponding respond to human defined notions of font style, to those in our dataset. This could be leveraged to we also run our system on the Google Fonts 3 modify existing fonts with respect to a particular dataset which despite containing fewer fonts and attribute to generate novel fonts efficiently. less diversity in style, lists metadata including nu- Beyond looking at the quality of reconstruc- merical weight and category (e.g. serif, hand- tions, we also wish to analyze properties of the writing, monospace). In Figure7 we show t- latent space learned by our model. To do this, SNE projections of latent font variables from our we use our encoder to infer latent font variables model trained on Google Fonts, colored accord- z for each of the fonts in our training data, and ingly. We see that the embeddings do generally use a t-SNE projection (Van Der Maaten, 2013) cluster by weight as well as category suggesting to plot them in 2D, shown in Figure5. Since that our model is learning latent information con- the broader, long-distance groupings may not be sistent with how humans perceive font style. preserved by this transformation, we perform a k- means clustering (Lloyd, 1982) with k = 10 on 3https://github.com/google/fonts 8 Conclusion William T Freeman and Joshua B Tenenbaum. 1997. Learning bilinear models for two-factor problems in We presented a latent variable model of glyphs vision. In Proceedings of IEEE Computer Society which learns disentangled representations of the Conference on Computer Vision and Pattern Recog- structural properties of underlying characters from nition, pages 554–560. IEEE. stylistic features of the font. We evaluated our Leon A Gatys, Alexander S Ecker, and Matthias model on the task of font reconstruction and Bethge. 2016. Image style transfer using convolu- showed that it outperformed both a strong nearest tional neural networks. CVPR. neighbors baseline and prior work based on GANs Douglas R Hofstadter. 1983. Metamagical themas. especially for fonts highly dissimilar to any in- Scientific American, 248(5):16–E18. stance in the training set. In future work, it may be worth extending this model to learn a latent mani- Douglas R Hofstadter. 1995. Fluid concepts and cre- fold on content as well as style, which could allow ative analogies: Computer models of the fundamen- tal mechanisms of thought. Basic books. for reconstruction of previously unseen character types, or generalization to other domains where Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan the notion of content is higher dimensional. Salakhutdinov, and Eric P Xing. 2017. Controllable text generation. arXiv preprint arXiv:1703.00955, Acknowledgments 7.

This project is funded in part by the NSF under Jinggang Huang and David Mumford. 1999. Statistics CVPR grant 1618044 and by the NEH under grant HAA- of natural images and models. . 256044-17. Special thanks to Si Wu for help con- Diederik P Kingma and Jimmy Ba. 2015. Adam: A structing the Google Fonts figures. method for stochastic optimization. ICLR.

Diederik P Kingma and Max Welling. 2014. Auto- References encoding variational bayes. ICLR. Nasir Ahmed, T Natarajan, and Kamisetty R Rao. Stuart Lloyd. 1982. Least squares quantization in pcm. 1974. Discrete cosine transform. IEEE Transac- IEEE Transactions on Information Theory. tions on Computers. Tamara Melmer, Seyed Ali Amirshahi, Michael Koch, Samaneh Azadi, Matthew Fisher, Vladimir G Kim, Joachim Denzler, and Christoph Redies. 2013. Zhaowen Wang, Eli Shechtman, and Trevor Darrell. From regular text to artistic writing and artworks: 2018. Multi-content GAN for few-shot font style Fourier statistics of images with low and high aes- transfer. CVPR. thetic appeal. Frontiers in Human Neuroscience.

Jonathan T. Barron. 2019. A general and adaptive ro- Adam Paszke, Sam Gross, Soumith Chintala, Gre- bust loss function. CVPR. gory Chanan, Edward Yang, Zachary DeVito, Zem- ing Lin, Alban Desmaison, Luca Antiga, and Adam Taylor Berg-Kirkpatrick, Greg Durrett, and Dan Klein. Lerer. 2017. Automatic differentiation in PyTorch. 2013. Unsupervised transcription of historical doc- In NIPS Autodiff Workshop. uments. ACL. Huy Quoc Phan, Hongbo Fu, and Antoni B Chan. 2015. Taylor Berg-Kirkpatrick and Dan Klein. 2014. Im- Flexyfont: Learning transferring rules for flexible proved typesetting models for historical OCR. ACL. typeface synthesis. In Computer Graphics Forum, Neill DF Campbell and Jan Kautz. 2014. Learning a volume 34, pages 245–256. Wiley Online Library. manifold of fonts. ACM TOG. Danilo Jimenez Rezende, Shakir Mohamed, and Daan Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Wierstra. 2014. Stochastic backpropagation and Bengio, and Wenjie Li. 2016. Mode regularized approximate inference in deep generative models. generative adversarial networks. arXiv preprint ICML. arXiv:1612.02136. Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Vincent Dumoulin and Francesco Visin. 2016. A guide Jaakkola. 2017. Style transfer from non-parallel text to convolution arithmetic for deep learning. arXiv by cross-alignment. NIPS. preprint arXiv:1603.07285. Rapee Suveeranont and Takeo Igarashi. 2010. David J. Field. 1987. Relations between the statistics Example-based automatic font generation. In of natural images and the response properties of cor- International Symposium on Smart Graphics, pages tical cells. JOSA A. 127–138. Springer. Yichuan Tang, Ruslan Salakhutdinov, and Geoffrey • Ci : convolution filters with i filters of 5 × 5, Hinton. 2013. Tensor analyzers. In International 2 pixel zero-padding, stride of 1, dilation of 1 conference on machine learning , pages 163–171. • I : instance normalization Joshua B Tenenbaum and William T Freeman. 2000. • Ti,j,k : transpose convolution with i filters of Separating style and content with bilinear models. 5×5, 2 pixel zero-padding, stride of j, k pixel Neural computation, 12(6):1247–1283. output padding, dilation of 1, where kernel Ilya O Tolstikhin, Sylvain Gelly, Olivier Bous- and bias are the output of an MLP (described quet, Carl-Johann Simon-Gabriel, and Bernhard below) Scholkopf.¨ 2017. Adagan: Boosting generative • H : reshape to 26 × 256 × 8 × 8 models. In Advances in Neural Information Pro- cessing Systems, pages 5424–5433. A.2 Network Architecture

Dmitry Ulyanov, Andrea Vedaldi, and Victor Lem- Our fully-connected encoder is: pitsky. 2016. Instance normalization: The miss- F128-R-F128-R-F128-R-F1024-R-M-F128-R-F128- ing ingredient for fast stylization. arXiv preprint R-F128-R-F64 arXiv:1607.08022. Our convolutional encoder is: Laurens Van Der Maaten. 2013. Barnes-Hut-SNE. arXiv preprint arXiv:1301.3342. C64-S-I-R-C128-S-I-R-C256-S-I-R-F1024-M-R- F128-R-F128-R-F128-R-F64 M Alex O Vasilescu and Demetri Terzopoulos. 2002. Multilinear analysis of image ensembles: Tensor- Our fully-connected decoder is: faces. In European Conference on Computer Vision, F128-R-F128-R-F128-R-F128-R-F128-R-F4096 pages 447–460. Springer. Our transpose convolutional decoder is: Hong-Jian Xue, Xinyu Dai, Jianbing Zhang, Shujian F128-R-F16384-R-H-T256,2,1-R-C256-I-R-C256-I- Huang, and Jiajun Chen. 2017. Deep matrix factor- ization models for recommender systems. IJCAI. R-T128,2,1-R-C128-I-R-C128-I-R-T64,2,1-I-R-C64- I-R-C64-I-R-T32,1,0-I-R-C32-I-R-C1 Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and Taylor Berg-Kirkpatrick. 2017. Improved varia- MLP to compute a transpose convolutional param- tional autoencoders for text modeling using dilated eter of size j is: convolutions. ICML. F128-R-Fj Yexun Zhang, Ya Zhang, and Wenbin Cai. 2018. Sep- arating style and content for generalized style trans- fer. In Proceedings of the IEEE conference on com- puter vision and pattern recognition, pages 8447– 8455.

Yexun Zhang, Ya Zhang, and Wenbin Cai. 2020. A unified framework for generalizable style transfer: Style and content separation. IEEE Transactions on Image Processing, 29:4085–4098.

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. ICCV.

A Appendix A.1 Architecture Notation We now provide further details on the specific ar- chitectures used to parameterize our model and in- ference network. The following abbreviations are used to represent various components: • Fi : fully-connected layer with i hidden units • R : ReLU activation • M : batch max pool • S: 2 × 2 spatial max pool