Fontcode: Embedding Information in Text Documents Using Glyph Perturbation

FontCode: Embedding Information in Text Documents using Glyph Perturbation CHANG XIAO, CHENG ZHANG, and CHANGXI ZHENG, Columbia University !"#$!" !"#!$!"#!$ -!$$ 3 4 4 3 "%&'()*)+,) ..!/"01#&$!/"01#&$ 3 4 2 4 3 3 3 Fig. 1. Augmented poster. Among many applications enabled by our FontCode method, here we create a poster embedded with an unobtrusive optical barcode (le). The poster uses text fonts that look almost identical from the standard Times New Roman, and has no traditional black-and-white barcode paern. But our smartphone application allows the user to take a snapshot (right) and decode the hidden message, in this case, a Youtube link (middle). We introduce FontCode, an information embedding technique for text docu- Additional Key Words and Phrases: Font manifold, glyph perturbation, error ments. Provided a text document with specic fonts, our method embeds correction coding, text document signature user-specied information in the text by perturbing the glyphs of text char- acters while preserving the text content. We devise an algorithm to choose ACM Reference format: unobtrusive yet machine-recognizable glyph perturbations, leveraging a Chang Xiao, Cheng Zhang, and Changxi Zheng. 2017. FontCode: Embedding recently developed generative model that alters the glyphs of each charac- Information in Text Documents using Glyph Perturbation. ACM Trans. Graph. ter continuously on a font manifold. We then introduce an algorithm that 1, 1, Article 1 (December 2017), 16 pages. embeds a user-provided message in the text document and produces an en- https://doi.org/10.1145/nnnnnnn.nnnnnnn coded document whose appearance is minimally perturbed from the original document. We also present a glyph recognition method that recovers the embedded information from an encoded document stored as a vector graphic 1 INTRODUCTION or pixel image, or even on a printed paper. In addition, we introduce a new Information embedding, the technique of embedding a message into error-correction coding scheme that recties a certain number of recognition host data, has numerous applications: Digital photographs have errors. Lastly, we demonstrate that our technique enables a wide array of metadata embedded to record such information as capture date, applications, using it as a text document metadata holder, an unobtrusive exposure time, focal length, and camera’s GPS location. Watermarks optical barcode, a cryptographic message embedding scheme, and a text embedded in images, videos, and audios have been one of the most document signature. important means in digital production to claim copyright against CCS Concepts: • Computing methodologies → Image processing; • Ap- piracies [Bloom et al. 1999]. And indeed, the idea of embedding plied computing → Text editing; Document metadata; information in light signals has grown into an emerging eld of visual light communication (e.g., see [Jo et al. 2016]). In all these areas, information embedding techniques meet two ACM acknowledges that this contribution was authored or co-authored by an employee, or contractor of the national government. As such, the Government retains a nonexclu- desiderata: (i) the host medium is minimally perturbed, implying sive, royalty-free right to publish or reproduce this article, or to allow others to do so, that the embedded message must be minimally intrusive; and (ii) for Government purposes only. Permission to make digital or hard copies for personal the embedded message can be robustly recovered by the intended or classroom use is granted. Copies must bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than ACM must decoder even in the presence of some decoding errors. be honored. To copy otherwise, distribute, republish, or post, requires prior specic Remaining reclusive is the information embedding technique for permission and/or a fee. Request permissions from [email protected]. text documents, in both digital and physical form. While explored © 2017 Association for Computing Machinery. 0730-0301/2017/12-ART1 $15.00 by many previous works on digital text steganography, information https://doi.org/10.1145/nnnnnnn.nnnnnnn embedding for text documents is considered more challenging to ACM Transactions on Graphics, Vol. 1, No. 1, Article 1. Publication date: December 2017. 1:2 • Xiao, C. et al meet the aforementioned desiderata than its counterparts for im- 2 RELATED WORK ages, videos, and audios [Agarwal 2013]. This is because the “pixel” We begin by clarifying a few typographic terminologies [Campbell of a text document is individual letters, which, unlike an image pixel, and Kautz 2014]: the typeface of a character refers to a set of fonts cannot be changed into other letters without causing noticeable each composed of glyphs that represent the specic design and dierences. Consequently, existing techniques have limited infor- features of the character. With this terminology, our method embeds mation capacity or work only for specic digital le formats (such messages by perturbing the glyphs of the fonts of text letters. as PDF or Microsoft Word). We propose FontCode, a new information embedding technique Font manipulation. While our method perturbs glyphs using the for text documents. Instead of changing text letters into dierent generative model by Campbell and Kautz [2014], other methods ones, we alter the glyphs (i.e., the particular shape designs) of their create fonts and glyphs with either computer-aided tools [Rugglcs fonts to encode information, leveraging the recently developed con- 1983] or automatic generation. The early system by [Knuth 1986] cept of font manifold [Campbell and Kautz 2014] in computer graph- creates parametric fonts and was used to create most of the Com- ics. Thereby, the readability of the original document is fully re- puter Modern typeface family. Later, Shamir and Rappoport [1998] tained. We carefully choose the glyph perturbation such that it has proposed a system that generate fonts using high-level parametric a minimal eect on the typeface appearance of the text document, features and constraints to adjust glyphs. This idea was extended to while ensuring that glyph perturbation can be recognized through parameterize glyph shape components [Hu and Hersch 2001]. Other Convolutional Neural Networks (CNNs). To recover the embedded approaches generate fonts by deriving from examples and tem- information, we develop a decoding algorithm that recovers the plates [Lau 2009; Suveeranont and Igarashi 2010], similarity [Lovis- information from an input encoded document—whether it is repre- cach 2010] or crowdsourced attributes [O’Donovan et al. 2014]. Re- sented as a vector graphics le (such as a PDF) or a rasterized pixel cently, Phan et al. [2015] utilize a machine learning method trained image (such as a photograph). through a small set of glyphs in order to synthesize typefaces that Exploiting the features specic to our message embedding and have a consistent style. retrieval problems, we further devise an error-correction coding Font recognition. Automatic font recognition from a photo or scheme that is able to fully recover embedded information up to a image has been studied [Avilés-Cruz et al. 2005; Jung et al. 1999; certain number of recognition errors, making a smartphone into a Ramanathan et al. 2009]. These methods identify fonts by extracting robust FontCode reader (see Fig. 1). statistical and/or typographical features of the document. Recently in [Chen et al. 2014], the authors proposed a scalable solution lever- Applications. As a result, FontCode is not only an information aging supervised learning. Then, Wang et al. [2015] improved font embedding technique for text documents but also an unobtrusive recognition using Convolutional Neural Networks. Their algorithm tagging mechanism, nding a wide array of applications. We demon- can run without resorting to character segmentation and optical strate four of them. (i) It serves as a metadata holder in a text doc- character recognition methods. In our work, we use existing algo- ument, which can be freely converted to dierent le formats or rithms to recognize text fonts of the input document, but further printed on paper without loss of the metadata—across various dig- devise an algorithm to recognize glyph perturbation for recovering ital and physical forms, the metadata is always preserved. (ii) It the embedded information. Unlike existing font recognition meth- enables to embed in a text unobtrusive optical codes, ones that can ods that identify fonts from a text of many letters, our algorithm replace optical barcodes (such as QR codes) in artistic designs such aims to identify glyph perturbation for individual letters. as posters and yers to minimize visual distraction caused by the barcodes. (iii) By construction, it oers a basic cryptographic scheme Text steganography. Our work is related to digital steganography that not only embeds but also encrypts messages, without resorting (such as digital watermarks for copyright protection), which has to any additional cryptosystem. And (iv) it oers a new text signa- been studied for decades, mostly focusing on videos, images, and ture mechanism, one that allows to verify document authentication audios—for example, we refer to [Cheddad et al. 2010] for a compre- and integrity, regardless of its digital format or physical form. hensive overview of digital imaging steganography. However, digital text steganography is much more challenging [Agarwal 2013], and thus much less developed.

Fontcode: Embedding Information in Text Documents Using Glyph Perturbation

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support