A Learned Representation for Scalable Vector Graphics

A Learned Representation for Scalable Vector Graphics Raphael Gontijo Lopes,∗ David Ha, Douglas Eck, Jonathon Shlens Google Brain firaphael, hadavid, deck, [email protected] Abstract Learned Vector Graphics Representation Pixel Counterpart Dramatic advances in generative models have resulted moveTo (15, 25) lineTo (-2, 0.3) in near photographic quality for artificially rendered faces, cubicBezier (-7.4, 0.2) (-14.5, 11.7), (-12.1, 23.4) animals and other objects in the natural world. In spite of ... such advances, a higher level understanding of vision and imagery does not arise from exhaustively modeling an object, but instead identifying higher-level attributes that best summarize the aspects of an object. In this work we at- tempt to model the drawing process of fonts by building sequential generative models of vector graphics. This model has the benefit of providing a scale-invariant representation for imagery whose latent representation may be systematically manipulated and exploited to perform style propagation. We demonstrate these results on a large dataset of fonts and highlight how such a model captures the statisti- Conveying Different Styles cal dependencies and richness of this dataset. We envision that our model can find use as a tool for graphic designers to facilitate font design. Figure 1: Learning fonts in a native command space. Un- like pixels, scalable vector graphics (SVG) [11] are scale- 1. Introduction invariant representations whose parameterizations may be The last few years have witnessed dramatic advances systematically adjusted to convey different styles. All vec- in generative models of images that produce near photo- tor images are samples from a generative model of the SVG graphic quality imagery of human faces, animals, and natu- specification. ral objects [4, 25, 27]. These models provide an exhaustive characterization of natural image statistics [51] and repre- arXiv:1904.02632v1 [cs.CV] 4 Apr 2019 sent a significant advance in this domain. However, these Our goal is to train a drawing model by presenting it advances in image synthesis ignore an important facet of with a large set of example images [16, 13]. To succeed, how humans interpret raw visual information [47], namely the model needs to learn both the underlying structure in that humans seem to exploit structured representations of those images and to generate drawings based on the learned visual concepts [33, 21]. Structured representations may be representation [2]. In computer vision this is referred to as readily employed to aid generalization and efficient learning an “inverse graphics” problem [38, 31, 22, 41]. In our case by identifying higher level primitives for conveying visual the output representation is not pixels but rather a sequence information [32] or provide building blocks for creative ex- of discrete instructions for rendering a drawing on a graph- ploration [21, 20]. This may be best seen in human drawing, ics engine. This poses dual challenges in terms of learning where techniques such as gesture drawing [44] emphasize discrete representations for latent variables [56, 23, 37] and parsimony for capturing higher level semantics and actions performing optimization through a non-differential graphics with minimal graphical content [53]. engine (but see [36, 31]). Previous approaches focused on ∗Work done as a member of the Google AI Residency Program (g. program synthesis approaches [32, 10] or employing rein- co/airesidency) forcement and adversarial learning [13]. We instead focus 1 on a subset of this domain where we think we can make [14] have demonstrated impressive advances [45, 14] over progress and improves the generality of the approach. the last few years resulting in models that generate high Font generation represents a 30 year old problem posited resolution imagery nearly indistinguishable from real pho- as a constrained but diverse domain for understanding tographs [25, 4]. higher level perception and creativity [21]. Early research A second direction has pursued building probabilistic attempted to heuristically systematize the creation of fonts models largely focused on invertible representations [8, 27]. for expressing the identity of characters (e.g. a, 2) as well Such models are highly tractable and do not suffer from as stylistic elements constituting the “spirit” of a font [20]. training instabilities largely attributable to saddle-point op- Although such work provides great inspiration, the results timization [14]. Additionally, such models afford a true were limited by a reliance on heuristics and a lack of a probabilistic model in which the quality of the model may learned, structured representation [46]. Subsequent work be measured with well-characterized objectives such as log- for learning representations for fonts focused on models likelihood. with simple parameterizations [34], template matching [54], example-based hints [62], or more recently, learning mani- 2.2. Autoregressive generative models folds for detailed geometric annotations [5]. One method for vastly improving the quality of gener- We instead focus the problem on generating fonts speci- ative models with unsupervised objectives is to break the fied with Scalable Vector Graphics (SVG) – a common file problem of joint prediction into a conditional, sequential format for fonts, human drawings, designs and illustrations prediction task. Each step of the conditional prediction [11]. SVG’s are a compact, scale-invariant representation task may be expressed with a sequential model (e.g. [19]) that may be rendered on most web browsers. SVG’s spec- trained in an autoregressive fashion. Such models are often ify an illustration as a sequence of a higher-level commands trained with a teacher-forcing training strategy, but more so- paired with numerical arguments (Figure 1, top). We take phisticated methods may be employed [1]. inspiration from the literature on generative models of im- Autoregressive models have demonstrated great success ages in rasterized pixel space [15, 55]. Such models provide in speech synthesis [42] and unsupervised learning tasks powerful auto-regressive formulations for discrete, sequen- [56] across multiple domains. Variants of autoregressive tial data [15, 55] and may be applied to rasterized renderings models paired with more sophisticated density modeling [3] of drawings [16]. We extend these approaches to the gener- have been employed for sequentially generating handwrit- ation of sequences of SVG commands for the inference of ing [15]. individual font characters. The goal of this work is to build a tool to learn a representation for font characters and style 2.3. Modeling higher level languages that may be extended to other artistic domains [7, 49, 16], The task of learning an algorithm from examples has or exploited as an intelligent assistant for font creation [6]. been widely studied. Lines of work vary from directly mod- In this work, our main contributions are as follows: eling computation [24] to learning a hierarchical composi- • Build a generative model for scalable vector graphics tion of given computational primitives [12]. Of particular (SVG) images and apply this to a large-scale dataset of relevance are efforts that learn to infer graphics programs 14 M font characters. from the visual features they render, often using constructs like variable bindings, loops, or simple conditionals [10]. • Demonstrate that the generative model provides a per- The most comparable methods to this work yield impres- ceptually smooth latent representation for font styles sive results on unsupervised induction of programs usable that captures a large amount of diversity and is consis- by a given graphics engine [13]. As their setup is non differ- tent across individual characters. entiable, they use the REINFORCE [59] algorithm to perform adversarial training [14]. This method achieves im- • Exploit the latent representation from the model to in- pressive results despite not relying on labelled paired data. fer complete SVG fontsets from a single (or multiple) However, it tends to draw over previously-generated draw- characters of a font. ings, especially later in the generation process. While this • Identify semantically meaningful directions in the la- could be suitable for modelling the generation of a 32x32 tent representation to globally manipulate font style. rastered image, SVGs require a certain level of precision in order to scale with few perceptible issues. 2. Related Work 2.4. Learning representations for fonts 2.1. Generative models of images Previous work has focused on enabling propagation of Generative models of images have generally followed style between classes by identifying the class and style from two distinct directions. Generative adversarial networks high level features of a single character [46, 21, 20] or by Image Autoencoder SVG Decoder lineTo, cubicBezier and EOS. SVG commands were normalized by starting from the top-most command and or- “N” style vector dering in a clockwise direction. In preliminary experiments, image image SVG MDN we found it advantageous to specify command arguments encoder decoder decoder using relative positioning information. See Appendix for temperature more details of the dataset collection and normalization. The final dataset consists of a sequence of commands Figure 2: Model architecture. Visual similarity between specified in tuples. Each item in the sequence consists of SVGs is learned by a class-conditioned, convolutional vari- a discrete selection

A Learned Representation for Scalable Vector Graphics

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support