OpenGL Insights

Edited by Patrick Cozzi and Christophe Riccio WebGL Models: End-to-End 30

Won Chun

30.1 Introduction

When we were making Body in 2010 (see Figure 30.1), WebGL was new technology. It wasn’t entirely clear what we’d be able to accomplish; indeed, the uncertainty was one of the reasons for building it in the first place. Adopting a new technology like WebGL was aleapoffaith.Evenifthe idea made perfect sense to us, so many things outside of our control needed to go right. We eventually open sourced Google Body,1 and unequivocally demonstrated that it is possible to build a rich, 3D application using WebGL. Google Body made the use of WebGL less an act Figure 30.1. Google Body, a 3D anatomy browser for of faith, and more an act of the web. execution.

1The content is now available at zygotebody.com.

431 432 V Transfers

When we started, the simple act of loading a 3D model was its own little leap of faith. The source 3D model data was incredibly detailed; we ended up shipping a reduced version with 1.4 million triangles and 740 thousand vertices for 1,800 anatomical entities. Our first prototype simply loaded everything using COLLADA, but that took almost a minute to load from a local . That’s a really long time to wait for a web page. COLLADA was inefficient in its use of bandwidth and computation, but it wasn’t immediately apparent how we could do better. There is no DOM node or MIME-type for 3D models. WebGL added TypedArray support to JavaScript, but XMLHttpRequest wouldn’t get it for months. Fortunately, we devised a way to load Google Body’s 1.4 million triangles much more efficiently than we expected using a surprising combination of standard web technologies. I’ve shared WebGL Loader, http://code.google.com/p/webgl-loader/, as an open- source project so that other WebGL projects can benefit. This chapter describes how WebGL Loader works. WebGL Loader can a 6.3 MB OBJ file to under 0.5 MB, better than a 13:1 compression ratio. alone shrinks the same file to about 1.8 MB and yields a model file that is slower to parse and render. Taken apart, the techniques and concepts are actually quite simple to understand. I also hope to give insights into the process of how we went about making those choices so that we can be equipped with more than just faith for the next time we are spurred to delve into the depths of technical uncertainty.

30.2 Life of a 3D Model Loading 3D models was the apparent challenge, but it is always important to keep the big picture in mind. When I joined Google, as part of orientation, I attended internal lectures about how the key internal systems at Google work. They all had titles like“Life of a [blank].” For example, “Life of a Query” walks through, step by step, what happens when a user performs a search on google.com. This end-to-end perspective is the key to building computer systems because it motivates what we build, even when we focus on a single component. I spent two years optimizing a tiny corner of web search by a few percent, but I knew the effort was worthwhile because it was in the critical path for every web search by all of our users. If “Life of a Query” teaches us anything, it is that there are lots of queries. It doesn’t make sense to think about loading 3D models outside the context of all the other steps a 3D model takes through an application. Each step informs the de- sign. For Google Body, we thought of the model as it related to four key stages. They are general enough that they should apply to most other WebGL projects as well: • Pipeline. Where does the 3D model data originate? • Serving. How does the 3D model data arrive in users’ browsers? • Encoding. What format should the 3D model data use? • Rendering. How does the 3D model data get to the screen? 30. WebGL Models: End-to-End 433

30.2.1 Stage 1: Pipeline Unless the application is a demo where 3D models are mere stand-ins to showcase some new technology, it is important to understand their creation process, called the art pipeline. The art pipeline can include many creative steps, like concept drawing or 3D modeling, but the key technological steps to understand are the tools that get models from the care of the artists to the application. Art creation tools like Maya, Blender, Photoshop, or GIMP have their own file formats, but applications usually don’t use them directly. The flexibility that is a necessity during creation ends up as baggage after the art is created. Contrast a Photoshop PSD file with a plain old JPG image. The PSD file can contain lots of extra sophistication like layers or color transforms that are essential for lossless, nondestructive editing, but that will never get used by something that simply needs to present the final image. A similar thing happens with 3D models. For example, 3D modeling software keeps track of positions, normals, texture coordinates, or other vertex attributes in separate lists. A vertex is represented as a tuple of indices, one for each vertex at- tribute. This is an excellent choice for creation and editing; if an artist changes an attribute like a position or texture coordinate, all the vertices that share that attribute are implicitly updated. GPUs don’t support these multi-indexed vertex lists; they must first be converted to a single-indexed vertex list. WebGL additionally enforces a limit of 65,536 unique vertices in a single drawElements call for maximum porta- bility. Larger meshes must be split into batches. Somewhere along the line, the model needs to be converted from a multipurpose, creation-appropriate form to a streamlined, render-appropriate form. The best time and place to do this is in the art pipeline. From the perspective of the application, everything that happens in the art pipeline appears to happen magically ahead of time. Since more time can be spent in the art pipeline than during loading, it is a great opportunity to offload work from the stages ahead. This is where to optimize the mesh for rendering and compress it for serving. 30.2.2 Stage 2: Serving We don’t have much control over what happens between the server and client because it is all governed by Internet standards; all we can do is understand how things work and how to best use them. I learned about high-performance web serving during my first project at Google, optimizing infrastructure behind iGoogle.2 As it turns out, serving 3D models over HTTP is not that different from serving other kinds of static contents, such as stylesheets or images. Almost everything in this section applies equally to images and to meshes since it is all about how to send bits onto the client. Perhaps more accurately, this section is about how not to send bits to the client using that reliable system design workhorse: caching. The bits we never send are the fastest and cheapest of all.

2iGoogle is the customizable homepage for Google: http://www.google.com/ig. 434 V Transfers

HTTP caching fundamentals. Before a browser makes an HTTP GET request to download the data for a URL, it first checks the cache to see if the data corresponding to that URL are already present. If so, the browser potentially has a cache hit. The problem with simply using these data right away is that they might be stale; the data in the cache might not reflect what is actually being served anymore. This is the trouble with caches; they seem like a cheap and simple way to improve performance, but then we have to deal with the extra complexity of validation. There are several ways a browser validates its cache data. The basic approaches use a conditional GET. The browser remembers when it originally fetched the data, so it uses an if-modified-since header in a GET request, and the server responds with a new version only if necessary. Another approach is to use ETags, which can act as a content fingerprint so the server can detect when a URL has changed since last requested. These approaches work as we would expect, but both still require a round-trip communication with the server. These requests, while usually cheap, still consume connections, a limited resource on both the client and server, and can have unpre- dictable latencies depending on the state of the network. It is better to completely avoid this parasitic latency with a simple tweak in the server. Instead of forcing the browser to ask, “Is this stale?” for each request, the server can proactively tell the browser how long the data will be fresh by using an Expires header. With an explicit expiration, the browser can use locally cached data immediately. AlongExpires header (the longest supported is just under a year) improves caching performance, but makes updates less visible. My favorite approach to deal with this is to simply never update the data for a given URL by using unique URLs for each version of the data. One particularly simple way is to embed a content fingerprint in the URL—no need to maintain any extra state. Fingerprinting works best if you control the references to your fingerprinted as- sets; otherwise, clients can end up with stale references. A hybrid approach is to have a small manifest file with a conventional URL and caching parameters referencing fingerprinted URLs. This way, we can have good caching performance for bulk data, amortizing validation, while maintaining freshness.

HTTP proxy caches. The most important cache is the browser cache, but in re- ality, there could be many between the user and the server. Companies use HTTP proxies as an intermediary between workstations and the Internet, and usually these include web caches to efficiently use available Internet bandwidth. ISPs, in the busi- ness of providing bandwidth, deploy many caching proxies to improve performance. Some web servers will have reverse proxies that are optimized for serving static con- tent. It is an excellent idea to recruit these caches, especially because they are numer- ous and tend to be closer to users. Fortunately, all these caches understand the same HTTP conventions, so with some careful design, we can simultaneously optimize for all of them. 30. WebGL Models: End-to-End 435

To enable proxy caching, the server should set the Cache-control: public header. Since proxy caches are shared, a few more criteria must be met before the data is actually effectively cached. Some proxy caches ignore URLs with ? so don’t use query strings. Cookies don’t work in shared caches, and they also make every HTTP request header larger. A good way to avoid performance problems with cookies is to use a separate, cookieless domain. Different domains need not be separate servers, but they could be, for example, if we used a content delivery network (CDN) service. To use mul- tiple domains to serve content, we must also enable CORS, cross-origin resource sharing (http://enable-cors.org/), to add exceptions to the browser’s same-origin se- curity policy. CORS applies to resources fetched using XMLHttpRequest as well as HTML images destined to be WebGL texture data.

Compression. Caching is one way HTTP avoids sending bits, and compression is the other. Media files used on the web are already compressed. A raw true-color image uses 24-bits per , but JPEG is typically able to compress that image using only 1 or 2 bits per pixel at high quality. Modern like WebP perform even better [Banarjee and Arora 11]. For text files, like HTML, CSS, or JavaScript, HTTP allows for on-the-fly com- pression using GZIP. Unlike JPEG, GZIP is an exact, lossless encoding tuned for text. At a high level, GZIP is similar to many other text-compression algorithms. It starts with a phrase matcher, LZ77, to eliminate repeated character sequences and follows that with [Gailly and Adler 93]. Huffman coding, which uses fewer bits to encode frequently occurring bytes, is the workhorse statistical coder; it is also used as the final step of JPEG and PNG compression. It turns out that it is also possible to use Huffman encoding as the final step for compressing 3D models by piggybacking on top of HTTP’s GZIP encoding. Instead of writing our own compressor, all we need is to encode 3D models in a GZIP-friendly way.

30.2.3 Stage 3: Loading Even before JavaScript gets to see the model’s data, the browser has already performed quite a bit of work downloading it. As efficient as JavaScript implementations have gotten recently, it is a still a good idea to exploit as much native capability as possible. Fortunately, we can use GZIP to handle most of the compression and decompression work. The key to efficient loading is in finding a representation that is compact but easy to convert to WebGL vertex buffers from what we can receive and process using JavaScript.

XML or JSON? XMLHttpRequest is the principle mechanism for downloading bulk application data in JavaScript, and until recently, it didn’t even really support binary data. Despite the name, we don’t actually have to use XMLHttpRequest for 436 V Transfers

XML data; we can grab the raw responseText string data and do pretty much what we want with it using JavaScript. If not XML, could we use JSON? Working with JSON in JavaScript is simple because it is a tiny, data-only dialect of JavaScript. Every browser that supports WebGL has a fast and safe JSON.parse method. JSON may be a lighter and simpler alternative to XML, but it is still no panacea for 3D models. Sure, XML’s tags are much larger than JSON’s, but the tag sizes don’t really matter if almost all of the bytes are consumed by giant float arrays. Besides, GZIP does a good job of compressing the redundancy in XML tags. We’ll leave the XML versus JSON war alone.

Quantization. For models, the real problem isn’t XML versus JSON at all. The key is avoiding the storage, transmission, and parsing of the vertex attributes and triangle indices. As a human-readable string, a 32-bit, 4-byte floating-point value requires up to seven decimal digits, not including the sign (another byte), decimal point (another byte), and possibly the scientific notation exponent (yet more bytes): -9.876543e+21, for example. Since floating-point numbers do not always have exact decimal representations, algorithms for parsing decimal values are exceedingly fussy, challenging, and slow. Quantizing the source floating-point format to a fixed- point format in the art pipeline would greatly help. Since 32-bit floating point is more than you need for most situations,3 we can save space and time by using only as much precision as necessary. The fixed-point resolution depends on the attribute: 10 bits (1,024 values) per channel is generally considered adequate for normals, and it works well for texture coordinates as well. Google Body was very sensitive to positions, since our models included many small anatomical details. We settled on 14 bits (16,384 levels) per channel. For a six-foot human, this was nearly 1/250 resolution; for a two-meter human, this was better than 1/8 mm resolution. Google Body had very high quality requirements, and analyses later in this chapter use this fine level of quantization. Most models could manage with less, and compression rates will improve accord- ingly. After quantizing, both the vertex attributes and triangle indices can be encoded within short, 16-bit integers. Small integers are faster on many JavaScript virtual machines; Chrome’s V8 engine is particularly efficient with integers that fit within 31-bits [Wingo 11]. And, as it turns out, JavaScript has always had a fast, compact, and convenient way of representing arrays of 16-bit values: strings. With this key insight, XMLHttpRequest.responseText suddenly looks quite well suited for the job. JavaScript uses UTF-16 for its internal string representation, and we index a string by simply using the String.charCodeAt method. The art pipeline could encode all the attributes and indices as “characters” in a string, serve them up GZIP compressed, and then decode them into TypedArrays destined to become vertex and index buffer objects.

3Sometimes, 32-bit floating point is not enough! See [Cozzi and Ring 11]. 30. WebGL Models: End-to-End 437

Code Points Byte 1 Byte 2 Byte 3 [0 ... 127] 0XXXXXXX [128 ... 2,047] 110YYYXX 10XXXXXX [2,048 ... 65,535] 1110YYYY 10YYYYXX 10XXXXXX

Table 30.1. Table of UTF-8 encodings for 16-bit Unicode code points. XXXXXXXX represents the bottom 8 bits of the code point, and YYYYYYYY represents the top 8 bits.

Unicode, UTF-16, and UTF-8. This works aside from one detail: surrogate pairs. As of Unicode 6.0, there are over one million code points4 defined; only the most common code points, the basic multilingual plane (BMP), can be encoded by a single 16-bit value. Code points outside the BMP are encoded using a pair of 16-bit values,5 both of which are in the 2,048 code surrogate-pair range from 0xD800 (55,296) to 0xDFFF (57,343). The surrogate-pair range is at the high end of 16-bit values, so it is reasonable to avoid it by ensuring that all the encoded values are small enough. In fact, we encoded many meshes in blissful ignorance before encountering one that hit the surrogate-pair range. In practice, this is not a problem (see Table 30.1). Even though JavaScript uses UTF-16 for strings, UTF-8 is the strongly preferred encoding for HTTP data, and is transparently decoded by the browser. UTF-8 is a very well-designed character encoding. It uses a clever byte-oriented variable-length encoding, so it maintains backwards compatibility with ASCII, well with GZIP, and is endian-independent. Like UTF-16, UTF-8 is designed to transmit smaller Unicode code points more efficiently than larger ones. Because most 16-bit Unicode code points require a 3-byte encoding, variable-length encoding seems like a disadvantage, but if most values are small, then it is a significant advantage.

30.2.4 Stage 4: Rendering Used correctly, indexed triangle lists are one of the fastest methods used to draw triangles in most GPU-accelerated 3D graphics libraries, including WebGL. Indexed triangle lists are intrinsically faster because they contain the most information about how vertices are shared between triangles. Triangle strips and fans also share vertices between triangles, but not as effectively. Strips are also inconvenient to draw in WebGL; to save on draw calls, triangle strips must be joined together by degenerate triangles.

Optimizing for the -transform vertex cache. To render a triangle, the GPU must first transform its vertices. Vertex shading is programmable in WebGL, so these transforms have the potential to be expensive. Fortunately, the results are

4“Code point” is rigorous Unicode specification language for what we consider to be a “character” [Unicode 11]. 5Unicode makes a distinction between characters and encodings, so technically, values within the surrogate-pair range are “code units,” not “code points.” 438 V Transfers

strictly determined by the inputs. If the GPU knows when a vertex is shared between triangles, it can also share the transformation for that vertex. To do this, the GPU stores recently transformed vertices in the post-transform vertex cache. The post-transform vertex cache can be quite effective in reducing the number of vertex transforms. In a normal model, most vertices are shared by 6 triangles. In this idealized case, there are only 0.5 vertices per triangle, so the vertex shader would only execute every other triangle—an ACMR (average cache miss ratio) of 0.5. These are compulsory misses; each vertex needs to be transformed at least once, no matter how often it is shared. Idealized triangle strips only achieve a best-case ACMR of 1.0. It is rare to actually achieve these ideal values. In practice, good ACMRs for indexed triangle lists are in the 0.6 to 0.7 range,6 which is still considerably better than the theoretical ideal ACMR for triangle strips. Unlike the caches used in HTTP, which operate in millisecond timescales, the post-transform vertex cache is a tight piece of hardware that operates on nanosecond timescales. In order for it to help, it must be very fast, and in order for it to be fast, it must be small. As a result, the post-transform vertex cache can only store a fixed number of vertices at a time, introducing the possibility of capacity misses. We can help avoid capacity misses by optimizing the triangle order so that vertex references are locally clustered. There is a great deal of existing literature written on optimizing triangle orders to minimize ACMR. The key to all triangle optimization approaches is in how they model the post-transform vertex cache. The problem is that there are too many dif- ferent kinds of post-transform vertex caches. GPUs with separate vertex shaders gen- erally used FIFOs with 16 to 32 elements, Modern unified-shader GPUs do some- thing completely different. And who knows what might change in the future? Instead of trying to model all of these variants precisely, a better approach is to use a cache-oblivious model that does a reasonable job of approximating them all. WebGL Loader uses my favorite approach, Tom Forsyth’s “Linear Speed Vertex Optimization.” It is fast, simple, and generally gets as close to ideal results as more specific or sophisticated approaches. More crucially, it never performs badly. Forsyth’s algorithm is greedy and does no backtracking; it simply finds the best triangle at any given moment and adds that to the triangle list [Forsyth 06]. De- termining the “best” triangle trades off between greedily maximizing cache use and ensuring that triangles are added promptly so that the last few triangles are not left scattered around the mesh. Forsyth’s algorithm uses simple heuristics to address both factors. Vertex references are tracked using a 32-entry LRU list where more recently referenced vertices have higher scores. The algorithm also tracks the number of re- maining triangles that share a vertex; vertices with fewer triangles remaining have higher scores. The best triangle is the triangle with the highest vertex score sum.

6Unfortunately, perfect ACMR depends on the mesh, specifically the actual ratio of vertices to trian- gles. An alternative metric is ATVR, average transform to vertex ratio, which was first described by Ignacio Castano.˜ Naturally, a perfect ATVR is 1 [Castano 09]. 30. WebGL Models: End-to-End 439

Before After Before After 2 2

1.5 1.5

1 1 AMCR AMCR

0.5 0.5

0 0 61624 32 61624 32 FIFO Cache Length FIFO Cache Length (a) (b)

Figure 30.2. Post-transform vertex cache optimization on (a) hand 00 and (b) ben 00.

We quantify the improvements that post-transform vertex cache optimization makes by modeling FIFO caches of various common sizes on meshes before and after optimization. We use two models from the Utah 3D Animation Repository [SCI In- stitute 11]: hand 00, the first keyframe of a wiggling hand animation with 9,740 ver- tices and 17,135 triangles, an ideal ACMR of 0.568, and ben 00, the first keyframe of a running man animation with 44,915 vertices and 78,029 triangles, an ideal ACMR of 0.575 (see Figure 30.2). For both models and all modeled FIFO cache lengths, Forsyth’s algorithm im- proves ACMR. Larger caches have greater benefit, but these benefits are marginal beyond 16 or 24 entries, so the choice of a 32-entry LRU cache is well justified [Hoppe 99].

Optimizing for the pretransform vertex cache. Just as there is a cache at the output of the vertex shaders, there is also one buffering the input loads. To optimize for the pretransform vertex cache, we must try to make these loads occur as sequentially as possible. The index array is accessed sequentially no matter what, so nothing needs to be done there. For static geometry, it is slightly more efficient to interleave all the vertex attributes needed for rendering so that the vertex shader can load them together. The remaining variable is the actual ordering of the vertex accesses, which are indirectly indexed. We don’t want to actually change the order of the triangles es- tablished by the post-transform vertex optimization, but it is possible to change the order of the vertices without changing the order of the triangles. Simply order the vertices by their first reference in the index buffer, and update the index buffer to reference its new location. Figure 30.3 is an illustration of an indexed triangle list before and after the pretransform vertex cache optimization. On the left, the vertex indices have been 440 V Transfers

Indices: 6 2 442 00Indices: 1122 3

Attributes: Attributes:

Figure 30.3. Before (left) and after (right) vertex reordering.

optimized using the post-transform vertex cache but not the pretransform vertex cache. Attributes are shared between nearby triangles, but appear in an arbitrary or- der, causing random reads by the vertex shader. After the pretransform vertex-cache optimization (on the right), the attributes are ordered by first appearance, yielding mostly sequential reads by the vertex shader. The triangle order is preserved.

Analysis. We can visualize the effect of vertex cache optimization by plotting in- dices and attributes before and after. We examine the first 2,000 indices and x po- sitions of the Happy Buddha, a large, greater than one-million triangle model from the Stanford 3D Scanning Repository7 (see Figure 30.4). In general, indices follow a linear trend upwards. The slope represents the average number of times a vertex is shared—in this case, roughly 5.5 times (there are 361 unique values in the first 2,000 indices). Before optimization, index values jump around both forwards and backward by hundreds. Around position 1,600, there is

Indices Optimized Indices 1000

750

500 Index

250

0 0 111 222 333 444 555 666 777 888 999 111012211332144315541665177618871998 Position

Figure 30.4. Effect of optimization on indices.

7http://graphics.stanford.edu/data/3Dscanrep/ 30. WebGL Models: End-to-End 441

Original Optimized 0.15

–0

Value –0.15

–0.3

–0.45 0 111 222 333 444 555 666 777 888 9991110 1221133214431554166517761887 1998 Index

Figure 30.5. Effect of optimization on x position. even an index that goes off the charts. After optimization, the indices form a much smoother line. Because of pretransform vertex-cache optimization, there are no large forward jumps, and because of post-transform vertex-cache optimization, backwards jumps are relatively small. Vertex-cache optimization also affects attributes. There is a definite pattern to the original data; they were originally captured by laser, so the periodic scanning is apparent. The value units are in the original floating-point data, although the actual data have been quantized. After vertex-cache optimization, attributes are ordered by use. Instead of rapidly scanning back and forth, x positions are more correlated, forming a random-looking, but connected line (see Figure 30.5). We will see that while this line appears random, it has structure that can be exploited for compression.

30.3 A Coherent Whole

WebGL and HTTP are calling the shots in the life of a 3D model, and all we can do is find the best way to do as much work as possible ahead of time during the art pipeline without placing too large of a burden on browser JavaScript. Fortunately, even if WebGL and HTTP apply constraints on what we can do, they also provide the tools and guidance to make everything work together. Part of the problem is solved: we will encode attributes and indices as Unicode code points. Let us now revisit UTF-8 encoding. On one hand, as a byte-oriented encoding, UTF-8 interacts well with GZIP; on the other hand, only a small range of 442 V Transfers

UTF-8 is encoded compactly. It is a waste and missed opportunity to use GZIP to counteract this expansion, especially when it is avoidable. Optimizing the mesh for vertex caches introduces an exploitable structure to the vertex indices and attributes. Because of post-transform vertex-cache optimizations, vertex index references are clustered. Because of pretransform vertex-cache optimiza- tions, index values increase gradually and sort related vertex attributes together. In- troducing coherence in the mesh not only makes it faster to render, it makes it more compressible too!

30.3.1 Delta Coding GZIP is designed for text compression, so it needs a little help compressing signals like vertex attributes and indices. It is unreasonable to expect LZ77 phrase matching to help something that isn’t structured like words, but Huffman coding should cer- tainly help. An easy approach is to first transform the data using delta coding. Delta coding is a simple predictive filter that sends differences within the data instead of the data themselves. If the most recent value is a good predictor of the next value, then the difference will be cheaper to encode than the next value. Delta coding can be implemented like this in JavaScript:

var prev = 0; for ( var i = 0; i < original.length; i++) { delta[i] = original[i] - prev; prev = original[i]; }

Given an original array of [100, 110, 107, 106, 115],thedelta array will be [100, 10, −3, −1, 9]. Delta decoding to reverse this is also really simple. This is good news because it is the innermost loop of the JavaScript decompressor:

var prev = 0; for ( var i = 0; i < delta.length; i++) { prev += delta[i]; original[i] = prev; }

From this example, we can see that even if the original data are in the hundreds range, the differences are much smaller in magnitude. This effect will help both UTF-8 and Huffman coding. Smaller values will use fewer bytes in UTF-8. Smaller values also skew the distribution of bytes so that Huffman coding can use fewer bits to encode them. When or decoding a vector quantity, such as a vertex attribute, you want to make sure to compute differences across like values, rather than adjacent interleaved values—x positions with x positions, and not x positions with y positions or even x normals. In general, this means you need a different prev for each scalar 30. WebGL Models: End-to-End 443 attribute value. For example, if your vertex format contains both a position and a normal, you will need to keep track of six values in the decoder (the encoder is left as an exercise to the reader): var prev = new Array(6); for ( var i = 0; i < delta.length;) { for ( var j=0;j<6;j++){ prev[j] += delta[i]; original[i] = prev[j] ++i; } }

30.3.2 Delta Coding Analysis

We can visualize the effects of delta coding using the Happy Buddha x position data. Compared to the original signal, the delta coded signal is centered on zero, and has smaller variation. These qualities make it more compressible (see Figure 30.6). Delta coding doesn’t always help. For example, the original data do not have coherence easily exploitable by delta coding, as shown in Figure 30.7. Even though delta coding manages to center values around zero, the difference signal varies more than the original, unoptimized data.

Optimized Optimized Delta 0.15

–0

–0.15 X Value

–0.3

–0.45 0 111 222 333 444 555 666 777 888 999 11101221133214431554 1665177618871998 Index

Figure 30.6. Delta coding optimized positions. 444 V Transfers

Delta Original 0.4

0.2

0 X Value

–0.2

–0.4 0 111 222 333 444 555 666 777 888 999 111012211332 14431554 1665177618871998 Index

Figure 30.7. Delta coding original positions.

30.3.3 ZigZag Coding Delta coding generates negative values roughly half the time. Unlike floating-point numbers, which have a dedicated sign bit, fixed-point values normally use a two’s complement representation. In two’s complement, small negative numbers map to large unsigned values, which spoils the skew distribution we were hoping to achieve with delta coding. Ideally, we’d like the bottom bit to act like a sign bit so that small negative numbers are interleaved with small positive numbers, as shown in Table 30.2. At Google, we call this zigzag coding, although it is a much older idea that follows naturally from delta coding. It’s defined in the open-source Protocol Buffer library [Proto 11], which uses a variable-length encoding for integer values. Since zigzag

Delta Unsigned ZigZag −3 0xFFFD 0x0005 −2 0xFFFE 0x0003 −1 0xFFFF 0x0001 0 0x0000 0x0000 1 0x0001 0x0002 2 0x0002 0x0004 3 0x0003 0x0006

Table 30.2. Two’s complement and zigzag coding. 30. WebGL Models: End-to-End 445 encoding and decoding are in the inner loop for the compressor and decompressor, it is worth using some bit-twiddling magic to make sure it is as fast as possible. Encoding a 16-bit signed value uses the following incantation:

((input << 1) ^ (input >> 15)) & 0xFFFF;

The 0xFFFF might be implicit if you know it will be stored in a 16-bit value (e.g. a uint16 t in C/C ++or within a Uint16Array in JavaScript). Decoding is also funky: In both cases, you want to make sure to use signed right shifts.

(input >> 1) ^ (-(input & 1));

30.3.4 Delta + ZigZag Coding Analysis We can visualize the effectiveness of delta + zigzag coding by examining the distri- bution of encoded values, as shown in Figure 30.8. A semi-log plot of the first 128 values, shown in Figure 30.9, shows that delta values have a small magnitude bias. These are the values that will be encoded in a single UTF-8 byte and account for 92.8% of the total. Aside from some periodic spikes, the plot is linear with a nega- tive slope, indicating an exponential decay. The spikes, which occur at multiples of 41, are artifacts of the discrete scanning resolution of the original model. The trend continues for the first 2,048 values, the values that can be encoded in UTF-8 using up to 2 bytes. The exponential decay and the periodic spikes continue.

Delta Codes 10000

1000

100 Log Count

10

1 0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 Value

Figure 30.8. UTF-8 single byte encoding. 446 V Transfers

Delta Codes 10000

1000

100 Log Counts

10

1 0 111 222 333 444 555 666 777 888 999111012211332144315541665177618871998 Value

Figure 30.9. UTF-8 single and double byte encoding.

This plot accounts for 99.98% of the total values. Values larger than 1,000 are very rare.

30.3.5 The Compression Pipeline We finally have all the details we need to build a model compressor. Google Body followed exactly these steps:

1. Quantize attributes. Encode attributes such that each value can be read as a single small integer value from JavaScript, avoiding expensive client-side parsing. We may reconstitute the original values at load time or delay that to the vertex shader; position scales and offsets are easy to fold into the model transformation.

2. Optimize models for vertex-cache performance. Forsyth’s “Linear Speed Vertex Optimizer” simultaneously improves rendering performance while making the model data more compressible by making the data more coherent.

3. Delta coding. Expose coherence to later coding stages.

4. ZigZag coding. Fix two’s complement mapping so that small negative num- bers are encoded using small values.

5. UTF-8 encoding. This is how HTTP likes to send text data. As a bonus, it uses fewer bytes to send smaller values, an excellent follow-up to delta and zigzag coding. 30. WebGL Models: End-to-End 447

6. GZIP compression. Standard HTTP compression provides the final statis- tical coder in the form of Huffman coding.

30.4 Key Improvements

After open-sourcing WebGL Loader, we discovered a few improvements over Google Body’s original mesh-compression algorithm. The first was an ordering that im- proved attribute compression, and the second was a new technique for encoding vertex indices that improved on basic delta coding.

30.4.1 Interleaving vs. Transposing WebGL prefers storing vertex attributes in interleaved form because that is how they are accessed by the vertex shader. This is not ideal for Huffman coding, which dy- namically adapts to the input statistics. If the data is interleaved, then Huffman coding will see a mix of deltas for disparate attributes like positions, normals, and texture coordinates. Even different dimensions of the same attribute can have differ- ent behavior (see Figure 30.10). In the above semi-log chart, we plot the value distributions of both x position and x normal. While both show a characteristic exponential decay, their lines have different slopes and thus different decay rates. Furthermore, normals don’t have the periodic frequency spikes exhibited in the positions.

Position Normal 10000

1000

100 Log Count

10

1 0 113 226 339 452 565 678 791 904 1017113012431356146915821695180819212034 Value

Figure 30.10. Attribute distributions. 448 V Transfers

If we use a transposed format for compression (xxx...yyy...zzz... instead of xyzxyz xyz...), then Huffman coding will see long runs of values that come from a sin- gle attribute dimension, allowing it to adapt and encode to their distribution more efficiently. Interleaving doesn’t have a significant additional cost for decoding in JavaScript, because all the data need to be copied from the XMLHttpRequest. responseText to a TypedArray anyway. Compared to the interleaved represen- tation, the transposed representation has slightly worse memory locality, but achieves approximately 5% better GZIP compression.

30.4.2 High-Water Mark Prediction Another significant improvement was in the delta and zigzag coding for vertex in- dices. There is a better predictor for the next index than the previous index: the highest seen index to that point, or the high-water mark. When a model is opti- mized for the post-transform vertex cache, vertex indices tend to be clustered. When a model is optimized for the pretransform vertex cache, vertex indices slowly increase. Specifically, the only time a vertex index larger than the current high-water mark ap- pears, it is exactly one larger. Contrast this with delta coding, where differences are positive or negative with roughly equal probability. High-water mark prediction works by encoding the difference from the potential next high-water mark. Differences will never be negative and will be zero if and only if a new vertex index is referenced. This encoding is about as simple as delta coding:

var nextHighWaterMark = 0; for ( var i = 0; i < original.length; i++) { var index = original[i]; delta[i] = nextHighWaterMark - index; if (index === nextHighWaterMark) { nextHighWaterMark++; } }

Since differences are never negative, there is no need for zigzag coding. Decoding correspondingly looks like this:

var nextHighWaterMark = 0; for ( var i = 0; i < delta.length; i++) { var code = delta[i]; original[i] = nextHighWaterMark - code; if (code === 0) { nextHighWaterMark++; } }

We can visualize the effectiveness of delta coding with and without high-water mark prediction by examining the effect on the distribution of encoded values on the 30. WebGL Models: End-to-End 449

High-water Mark Delta + ZigZag Indices 60000

45000

30000 Count

15000

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Figure 30.11. Top 20 references.

first 55,000 vertices of the Happy Buddha model. First, a close-up of the 20 smallest, most frequently referenced values, as shown in Figure 30.11. The original index data are the yellow line along the bottom that plots the num- ber of times the first 20 vertex indices are referenced. In this scale, it is difficult to see that each index is referenced about six times on average, as expected. High-water mark prediction yields the blue line, and delta + zigzag coding is the red line. They each plot the number of times an encoded value is sent. Both high-water mark and delta + zigzag strongly bias the distribution of values sent to be small; this effect

High-water Mark Delta + ZigZag Indices

10000

1000

100 Log Count

10

1 0 7 14 21 28 35 42 49 56 63 70 77 84 91 98 105 112 119 126 Position

Figure 30.12. UTF-8 single byte encoding. 450 V Transfers

High-water Mark Delta + ZigZag Indices

10000

1000

100 Log Count

10

1 0 113 226 339 452 565 678 791 904 1017 1130 1243 1356 1469 1582 1695 1808 1921 2034 Position

Figure 30.13. UTF-8 single- and double-byte encoding.

allows Huffman coding to compress well. However, high-water mark coding does a better job, yielding smaller values with a steeper distribution. Delta + zigzag coding reveals a slight sawtooth character; there are slight peaks at even values, indicating a slight skew towards positive deltas. These trends continue as we examine the set of values encoded by a single byte in UTF-8, as shown in Figure 30.12. Using a semi-log plot, the original index data are actually visible. The gap be- tween high-water mark and delta + zigzag coding persists. Over 95.7% of high-water mark–predicted values are encoded by one byte, compared to just over 90.9% for delta + zigzag, a nearly 5% improvement. We can see that the tail is substantially longer for delta + zigzag by zooming out to the set of values encoded in up to two bytes in UTF-8, as shown in Figure 30.13. For delta + zigzag, 6.54% of the values require two bytes in UTF-8. High-water mark reduces this to 2.88%. Cumulatively, 98.6% of high-water mark–predicted values are encoded using 2 bytes or fewer, slightly better than 97.4% for delta + zigzag.

30.4.3 Performance

Most of the computation in this model-compression technique is actually done by UTF-8 and GZIP, so JavaScript doesn’t have to do very much to the data. The time actually spent doing delta, zigzag, and high-water mark decoding is quite low. Con- sider the Happy Buddha model, which is 1,087,716 triangles and 543,652 vertices (position and normal) and compresses to 5,148,735 bytes (∼ 4.73 bytes per trian- gle). We must delta + zigzag decode 3,261,912 16-bit vertex attributes and 3,263,148 30. WebGL Models: End-to-End 451

16-bit indices. On my oldest working laptop, a 2.4 GHz Core 2 Duo running Mac OS 10.5.8, the time spent in JavaScript is a fraction of a second. The performance depends on the browser. Prior to Chrome 17, Firefox 8 is faster, but Chrome 17 makes some important performance improvements for han- dling strings from XMLHttpRequest:

Browser Time Chrome 16.0.912.41 383 ms Firefox 8.01 202 ms Chrome 17.0.949.0 114 ms

Churning through 5,148,735 bytes in 114 ms is a throughput better than 45MB/sec, closer to local storage bandwidth than broadband Internet bandwidth. The current implementation in WebGL loader uses progress events to incremen- tally decode, so transfer and compute latency can overlap. WebWorkers will soon support fast, zero-copy transfers of TypedArrays [Herman and Russell 2011], so multiple models could be decoded in parallel. Since decoding is pure JavaScript and doesn’t actually require a WebGL context, it is possible to overlap parallel model downloads with operations like WebGL context and shader initialization.

30.4.4 Future Work There are a few more improvements we have not yet im- plemented that should improve compression. One no- table improvement is to use indexed triangle strips in- stead of indexed triangle lists. Indexed triangle strips are not very popular, because they aren’t faster than indexed triangle lists on desktop GPUs (strips may be faster on mobile GPUs). They would help for compression in two ways: directly, by making index buffers smaller, and in- directly, by making parallelogram prediction simple. Parallelogram prediction (Figure 30.14) is a tech- nique that can be used to predict vertex attributes on Figure 30.14. Parallelo- triangles that share an edge, as adjacent triangles in a tri- gram prediction. angle strip implicitly do. In the figure to the right, the shaded triangle is the current triangle in a triangle strip and so shares an edge with the next triangle. Parallelogram prediction assumes that the next triangle looks like the previous one, so it guesses that the next vertex can be computed by reflecting the unshared vertex (at the bottom-left) across the shared edge, yielding the dotted triangle. The guess is usually close to the actual triangle (solid, unshaded triangle), so only a small correction vector needs to be stored (dashed arrow). 452 V Transfers

30.5 Conclusion

Web browsers are evolving at a rapid pace. WebGL is a welcome addition, but it has brought its share of unsolved problems. Google Body’s model-compression tech- nique described how to close a key gap: how to load high-quality meshes in WebGL without direct browser support. Thanks to existing web infrastructure and quantum leaps in JavaScript performance, meshes can be compressed compactly and efficiently without a severe impact on loading time while simultaneously optimizing for GPU rendering performance. We eagerly anticipate the pervasive deployment of 3D con- tent on the Web using WebGL.

Acknowledgments. I would first like to thank the Zygote Media Group, who was an excellent partner during the Google Body endeavor. Preserving the quality of their artistically fashioned models of human anatomy was a significant motivation for the work in this chapter. I would like to thank the people at Google who helped make this happen: my 80% team who was always patient even when this 20% activity became all consuming; Thatcher Ulrich, who tells me which of my ideas are any good; and Ken Russell, who was always generous with his knowledge of browser and WebGL performance. Finally, I reserve my greatest thanks forthe Google Body team, especially Arthur Blume, David Kogan, Vangelis Kokkevis, Rachel Weinstein Petterson, Nico Weber, and Dr. Roni Zeiger.

Bibliography

[Banarjee and Arora 11] Somnath Banerjee and Vikas Arora. “WebP Compression Study.” code.google.com/speed/webp/docs/webp study.html, May 18, 2011. [Castano 09] Ignacio Castano,˜ “ACMR.” http://www.ludicon.com/castano/blog/2009/01/ acmr-2/, January 29, 2009. [Cozzi and Ring 11] Patrick Cozzi and Kevin Ring. 3D Engine Design for Virtual Globes. Nat- ick, MA: A K Peters, 2011. http://www.virtualglobebook.com/. [Forsyth 06] Tom Forsyth. “Linear-Speed Vertex Cache Optimization.” http://home.comcast. net/∼tom forsyth/papers/fast vert cache opt.html, September 28, 2006. [Gailly and Adler 93] Jean-Loup Gailly and Mark Adler. “A Brief Description of the Algo- rithms used by GZIP.” http://www.gzip.org/algorithm.txt,August 9, 1993. [Herman and Russell 2011] David Herman and Kenneth Russell. “Typed Array Specifica- tion.” http://www.khronos.org/registry/typedarray/specs/latest/#9, December 8, 2011. [Hoppe 99] Hughes Hoppe. “Optimization of Mesh Locality for Transparent Vertex Caching.” ACM SIGGRAPH Proceedings (1999) 269–276. [Proto 11] Proto. “Encoding.” http://code.google.com/apis/protocolbuffers/docs/encoding. html#types,retrieved December 14, 2011. [SCI Institute 11] SCI Institute. The Utah 3D Animation Respository. http://www.sci.utah. edu/∼wald/animrep/. 30. WebGL Models: End-to-End 453

[Unicode 11] Unicode. “Unicode 6.0.0.” http://www.unicode.org/versions/Unicode6.0.0/, September 26, 2011. [Wingo 11] Andy Wingo. “Value Representation in JavaScript Implementations.” http:// wingolog.org/archives/2011/05/18/value-representation-in--implementations, May 11, 2011.