Learning a Manifold of Fontsupplemental Material
Total Page:16
File Type:pdf, Size:1020Kb
Learning a Manifold of Fonts Supplemental Material Neill D.F. Campbell1 Jan Kautz1;2 1University College London 2NVIDIA Research A Fonts Used of Q and D. If we need to consider multiple sets of these vectors we stack them together row-wise to produce a matrix, for example T T Andale Mono Andalus Angsana New X = [···x j ···] and Y = [···y j ···] . Aparajita Arial Browallia New GP Regression Calibri Cambria Candara In the task of regression, we wish to identify a function f (·) that maps the input to the output space so that we can Consolas Constantia Corbel ∗ ∗ ∗ Cordia New DaunPenh David take a new input x and find the corresponding output as y = f (x ). Ebrima Euphemia Franklin Gothic Book Now, if we take a GP it has the property that any discrete set of Gautami Georgia Gill Sans MT samples taken from the GP will be normally distributed. Thus if T Kalinga Lao Sangam MN Lucida Sans Unicode the GP is evaluated at a series of values X = [x1 ···xJ] from our T MS PMincho Microsoft Himalaya Microsoft Sans Serif input space, the corresponding output values Y = [y1 ···yJ] will be Microsoft Tai Le Miriam Mongolian Baiti distributed as MoolBoran Narkisim Nyala P(YjX) = (YjM(X);C(X;X)) (B-1) PMingLiU Palatino Linotype Perpetua N Plantagenet Cherokee Shonar Bangla SimSun-ExtB where Sylfaen Tahoma Times New Roman − 1 1 T −1 Traditional Arabic Trebuchet MS Vani N (xjm;S) = k2pSk 2 exp − (x − m) S (x − m) (B-2) 2 Verdana Table 1: The fonts used in our experiments. is the standard multivariate Gaussian distribution with M(·) as a mean function and C(·;·) as a covariance function. The mean func- tion generates a vector from the input X and the covariance function We took the ‘ttf’ files from the fonts folder on a ‘Windows 7’ in- creates a (positive semi-definite) covariance matrix, these vectors stallation and removed any non-latin or symbol based fonts. We each have the same number of dimensions as there were examples also removed a few fonts that were topologically inconsistent over given (in this case J). Thus the GP handles an infinite object (a many letters, for example handwriting based fonts; if only a few function defined over a continuous space) by considering only a letters are inconsistent, e.g.’g’, we omitted the inconsistent character finite set of input values at a time (J values in our example) and and embedded the font with missing data so that the manifold fills returning the corresponding finite set of output values. in unmatched character [Navaratnam et al. 2007]. Our training set includes examples of many different categories of typefaces (e.g. Condition on Training Data Now we know about the distribution Humanist, Garalde, Didone/Modern, Slab, Lineale/Grotesque, and between the input and output spaces of a GP we can make it useful Glyphic) for a total of 46 fonts listed in Table1. for regression by conditioning the GP on our training data. That is to say that we add the constraint that if any of the training inputs Xtr B Details on the GP-LVM are presented to the GP, the corresponding training output Ytr should be produced. Thus if we consider a new test input x∗, we can find Here we provide a more in-depth introduction to Gaussian Processes, y∗ = f (x∗) as illustrating their use for regression. We then extend this technique ∗ Ytr M(Xtr) C(Xtr;Xtr) C(Xtr;x ) to present the Gaussian Process Latent Variable Model (GP-LVM) ∗ ∼ N ∗ ; ∗ ∗ ∗ : (B-3) that we use to construct probabilistic manifolds. Finally, we provide y M(x ) C(x ;Xtr) C(x ;x ) the details about how to learn the true dimensionality of a manifold We simplify the situation by subtracting the mean from our training from the data. input and output so that the mean function becomes a zero-vector. Standard manipulation can convert this to an output function as B.1 Gaussian Processes f (x∗) ∼ N C(x∗;X )C(X ;X )−1Y ; S∗ (B-4) Gaussian Processes (GPs) are a non-parametric model that consist tr tr tr tr of distributions over functions. They can be used as a powerful with the output covariance (‘error bars’) as way to model regression, that producing a smooth mapping func- ∗ ∗ ∗ ∗ −1 ∗ tion from an input space to an output space (both of which maybe S = C(x ;x ) −C(x ;Xtr)C(Xtr;Xtr) C(x ;Xtr) : (B-5) multi-dimensional vector spaces); essentially they act as a prior that Therefore, the conditioned GP provides the regression function encourages functions to be smooth. A full treatment is beyond the to predict output vectors for new input vectors with the predicted scope of this work and instead we provide a basic illustrative exam- value as the mean of (B-4) and the covariance value S∗ providing a ple; there are many excellent texts on GPs including the definitive measure of confidence in the prediction; a low standard deviation book by Rasmussen and Williams [2006]. We begin by considering would indicate a very precise estimate. Please see [Rasmussen and using GPs for the problem of regression and then show how we may Williams 2006] for further details. treat the problem of finding a manifold (or probabilistic dimension- ality reduction) as a new take on the regression task, forming the The Covariance Function The final part of the GP regression GP-LVM. model is then to specify the covariance function C(·;·). This func- tion must accept one or more samples from the input space as each Notation For the illustration of the regression task we will use argument and return a covariance value between each pairwise com- the following notation. Suppose that we have a set of training bination of samples. The function can depend on a small number Q data pairs fx j;y jg that associate an input vector x j 2 R with its of ‘hyperparameters’ that we shall denote with q. The elements a D corresponding output vector y j 2 R with respective dimensions particular covariance matrix given by a set of inputs are indexed using the notation for representing a set of vectors described above. and we require a trade-off between producing a perfect reconstruc- Thus tion of the data and a smooth and useful manifold. With this in mind, h T T i a noise model is usually added to the covariance matrix during train- C([···xi ···] ;[···x j ···] jq) = c xi;x jjq (B-6) i; j ing to account for any slight noise in the high dimensional vectors. We therefore replace the basic covariance function of (B-7) with th th where [·]i; j denotes the element at the i row and j column of the 1 matrix and c(xi;x jjq) denotes the covariance between two input c0 x ;x jq = a exp − y kx − x k2 + s 2 (B-8) vectors. i j 2 i j Perhaps the most typical covariance function would be a radial basis where s is the standard deviation of the noise. The noise term is only function (RBF) present during training and is not used afterwards when generating high dimensional vectors using (B-4). 1 2 c xi;x jjq = a exp − y kxi − x jk (B-7) 2 In our experiments, the noise variation was found to be very small, where a and y are the hyperparameters and therefore q = [a;y]. suggesting that the font outlines do indeed lie on a low dimensional We can see that this covariance function will produce a smooth manifold. mapping from the input space to the output space since nearby input Training The training process for the GP-LVM considers the like- vectors will have a high covariance and therefore the output vectors lihood of the high dimensional data Y as will be correlated. Similarly, if the input vectors are far apart (relative to the length -scale parameter y) then there will be a low covariance M 2 and the corresponding output vectors will be independent of one P(YjX;q) = ∏ N ym j 0;C(X;Xjq) + s I (B-9) another. m=1 The training of a GP consists of learning the values for these hyper- where there are M training examples (one for each font) and I is the parameters by maximizing the log-likelihood (B-1) of the training identity matrix. Whereas, for the GP model we simply maximized ∗ data (Xtr;Ytr) to give an optimal setting q . Since the covariance the log-likelihood over the hyperparameters q, for the GP-LVM train- T function can be a powerful, non-linear function, as in the case of ing we must maximize jointly over the latent vectors X = [···x j ···] the RBF, this training cannot be performed analytically and is in- as well as q such that stead performed using gradient based methods such as Conjugate Gradients. X∗;q ∗ = argmax log[P(YjX;q)] : (B-10) X;q B.2 The GP-LVM Again, the covariance function is a complex, non-linear function, We can think of the task of dimensionality reduction for a set of both in terms of its relationship to X and to q; however, gradients high dimensional vectors as identifying a set of low dimensional may be found against both so training may be performed using vectors that recreate the corresponding high dimensional vectors Conjugate Gradients [Lawrence 2005]. Such methods require initial under a basic mapping. More powerfully, we could then go further values for both X and q.