A Learned Representation for Scalable Vector Graphics
Total Page:16
File Type:pdf, Size:1020Kb
A Learned Representation for Scalable Vector Graphics Raphael Gontijo Lopes,∗ David Ha, Douglas Eck, Jonathon Shlens Google Brain {iraphael, hadavid, deck, shlens}@google.com Abstract Learned Vector Graphics Representation Pixel Counterpart Dramatic advances in generative models have resulted moveTo (15, 25) in near photographic quality for artificially rendered faces, lineTo (-2, 0.3) cubicBezier (-7.4, 0.2) (-14.5, 11.7), (-12.1, 23.4) animals and other objects in the natural world. In spite of ... such advances, a higher level understanding of vision and imagery does not arise from exhaustively modeling an ob- ject, but instead identifying higher-level attributes that best summarize the aspects of an object. In this work we at- tempt to model the drawing process of fonts by building se- quential generative models of vector graphics. This model has the benefit of providing a scale-invariant representation for imagery whose latent representation may be systemati- cally manipulated and exploited to perform style propaga- tion. We demonstrate these results on a large dataset of fonts and highlight how such a model captures the statisti- Conveying Different Styles cal dependencies and richness of this dataset. We envision that our model can find use as a tool for graphic designers to facilitate font design. Figure 1: Learning fonts in a native command space. Un- like pixels, scalable vector graphics (SVG) [11] are scale- 1. Introduction invariant representations whose parameterizations may be The last few years have witnessed dramatic advances systematically adjusted to convey different styles. All vec- in generative models of images that produce near photo- tor images are samples from a generative model of the SVG graphic quality imagery of human faces, animals, and natu- specification. ral objects [4, 25, 27]. These models provide an exhaustive characterization of natural image statistics [52] and repre- sent a significant advance in this domain. However, these Our goal is to train a drawing model by presenting it advances in image synthesis ignore an important facet of with a large set of example images [16, 13]. To succeed, how humans interpret raw visual information [48], namely the model needs to learn both the underlying structure in that humans seem to exploit structured representations of those images and to generate drawings based on the learned visual concepts [33, 21]. Structured representations may be representation [2]. In computer vision this is referred to as readily employed to aid generalization and efficient learning an “inverse graphics” problem [38, 31, 22, 41]. In our case by identifying higher level primitives for conveying visual the output representation is not pixels but rather a sequence information [32] or provide building blocks for creative ex- of discrete instructions for rendering a drawing on a graph- ploration [21, 20]. This may be best seen in human drawing, ics engine. This poses dual challenges in terms of learning where techniques such as gesture drawing [44] emphasize discrete representations for latent variables [57, 23, 37] and parsimony for capturing higher level semantics and actions performing optimization through a non-differential graphics with minimal graphical content [54]. engine (but see [36, 31]). Previous approaches focused on ∗Work done as a member of the Google AI Residency Program (g. program synthesis approaches [32, 10] or employing rein- co/airesidency) forcement and adversarial learning [13]. We instead focus 17930 on a subset of this domain where we think we can make 2. Related Work progress and improves the generality of the approach. 2.1. Generative models of images Font generation represents a 30 year old problem posited as a constrained but diverse domain for understanding Generative models of images have generally followed higher level perception and creativity [21]. Early research two distinct directions. Generative adversarial networks attempted to heuristically systematize the creation of fonts [14] have demonstrated impressive advances [46, 14] over for expressing the identity of characters (e.g. a, 2) as well the last few years resulting in models that generate high as stylistic elements constituting the “spirit” of a font [20]. resolution imagery nearly indistinguishable from real pho- Although such work provides great inspiration, the results tographs [25, 4]. were limited by a reliance on heuristics and a lack of a A second direction has pursued building probabilistic learned, structured representation [47]. Subsequent work models largely focused on invertible representations [8, 27]. for learning representations for fonts focused on models Such models are highly tractable and do not suffer from with simple parameterizations [34], template matching [55], training instabilities largely attributable to saddle-point op- example-based hints [63], or more recently, learning mani- timization [14]. Additionally, such models afford a true folds for detailed geometric annotations [5]. probabilistic model in which the quality of the model may We instead focus the problem on generating fonts speci- be measured with well-characterized objectives such as log- fied with Scalable Vector Graphics (SVG) – a common file likelihood. format for fonts, human drawings, designs and illustrations [11]. SVG’s are a compact, scale-invariant representation 2.2. Autoregressive generative models that may be rendered on most web browsers. SVG’s spec- One method for vastly improving the quality of gener- ify an illustration as a sequence of a higher-level commands ative models with unsupervised objectives is to break the paired with numerical arguments (Figure 1, top). We take problem of joint prediction into a conditional, sequential inspiration from the literature on generative models of im- prediction task. Each step of the conditional prediction ages in rasterized pixel space [15, 56]. Such models provide task may be expressed with a sequential model (e.g. [19]) powerful auto-regressive formulations for discrete, sequen- trained in an autoregressive fashion. Such models are often tial data [15, 56] and may be applied to rasterized renderings trained with a teacher-forcing training strategy, but more so- of drawings [16]. We extend these approaches to the gener- phisticated methods may be employed [1]. ation of sequences of SVG commands for the inference of Autoregressive models have demonstrated great success individual font characters. in speech synthesis [42] and unsupervised learning tasks The goal of this work is to build a tool to learn a repre- [57] across multiple domains. Variants of autoregressive sentation for font characters and style that may be extended models paired with more sophisticated density modeling [3] to other artistic domains [7, 50, 16], or exploited as an in- have been employed for sequentially generating handwrit- telligent assistant for font creation [6]. We aspire that our ing [15]. methods could be applied generically, but we focus on font generation as our main inspiration in hopes that it opens op- 2.3. Modeling higher level languages portunities to work on more complex illustrations [7]. To The task of learning an algorithm from examples has this end, our main contributions are as follows: been widely studied. Lines of work vary from directly mod- eling computation [24] to learning a hierarchical composi- • Build a generative model for scalable vector graphics tion of given computational primitives [12]. Of particular (SVG) images and apply this to a large-scale dataset of relevance are efforts that learn to infer graphics programs 14 M font characters. from the visual features they render, often using constructs like variable bindings, loops, or simple conditionals [10]. • Demonstrate that the generative model provides a per- The most comparable methods to this work yield impres- ceptually smooth latent representation for font styles sive results on unsupervised induction of programs usable that captures a large amount of diversity and is consis- by a given graphics engine [13]. As their setup is non differ- tent across individual characters. entiable, they use the REINFORCE [60] algorithm to per- form adversarial training [14]. This method achieves im- • Exploit the latent representation from the model to in- pressive results despite not relying on labelled paired data. fer complete SVG fontsets from a single (or multiple) However, it tends to draw over previously-generated draw- characters of a font. ings, especially later in the generation process. While this could be suitable for modelling the generation of a 32x32 • Identify semantically meaningful directions in the la- rastered image, SVGs require a certain level of precision in tent representation to globally manipulate font style. order to scale with few perceptible issues. 7931 Modelling languages for image generation [13, 16], Image Autoencoder SVG Decoder which our work tackles, can be construed as a probabilistic “N” programming problem. What makes our problem unique, vector style MDN however, is (a) models must be perceptually judged [13] and image image SVG encoder decoder decoder (b) by working in the SVG format, we open the opportunity to exploit a de-facto standard format for icons and graphics. temperature 2.4. Learning representations for fonts Figure 2: Model architecture. Visual similarity between Previous work has focused on enabling propagation of SVGs is learned by a class-conditioned, convolutional vari- style between classes by identifying the class and style from ational