Transformations a Quarterly Publication From

issue 04 // spring-summer 08

Transformations A quarterly publication from

Q&A with Mike Rayfield The Visual Computing Era

f e a3D t u r e s Imaging in Stereo Optimized PC Design NVISION 08!

Yesterday's PC was designed as a "one-size-fits-all" system. But today's consumer is using a a completely different set of applications and they are all about the visual experience. Applications w o r d like watching videos, editing photos, and playing games require optimized designs to deliver the best experience. This is visual computing and it’s leading to segmentation of the PC market f r o m as system builders develop machines that outperform baseline enterprise PCs—a system for lifestyle users, another for gamers, yet another for prosumers. NVIDIA has taken a leading m i k e role in developing technology that enables visual computing applications and powers the next phase of GPU growth. In this issue we will discuss the Optimized PC design movement, our new Tegra family of computers-on-a-chip, and the growing popularity of CUDA.

Michael W. Hara v i c e p r e s i d e n t i n v e s t o r r e l a t i o n s a n d communications

PS: We look forward to seeing you at NVISION 08, August 25-27, in San Jose, California.

i s s u e f o u r // s p r i n g -s u m m e r 08 // n v i d i a corporation // w w w .n v i d i a .c o m transformations // Q&A with Mike Rayfield

Q&A with Mike Rayfield General manager, mobile, NVIDIA Corporation

NVIDIA recently introduced Tegra, a new brand How have visual mobile devices of computers-on-a-chip. What are some key changed in the last few years? points you would like us to know about Tegra? MR: The introduction of the iPhone fundamentally MR: The Tegra™ products have been designed from changed the market in two ways. First, it raised the bar the ground up to meet the needs of the growing for what consumers expect from smartphones, from universe of mobile visual computers. Smartphones, the quality of the audio and video to the immersive and personal navigation devices, portable media players, intuitive user interface. Second, and more importantly, and mobile Internet devices are all morphing into it marked a significant architectural change in how computers. They require extreme battery life and these devices are designed. They have evolved from amazing multimedia performance and must fit in form being purely communications devices with limited factors ranging from palm size to laptops. NVIDIA’s multimedia capabilities to true multimedia computers 15 years of innovation in visual computing positions with communications capabilities. With this shift in us to deliver these capabilities to our customers. architecture it is easy to imagine how a phone, personal media player, personal navigation device, or mobile Keep in mind that this is really just the beginning. Tegra Internet device are all the same fundamental device will target a broad array of products, from mobile Internet with different peripheral and input/output capabilities. devices to other delightful and engaging devices that We’ve done extensive research into these and other emerging platforms and the Tegra products are tailored will drive this new era of very personal computing. to deliver these subtle variations in capability. This is a discontinuity in the market on a scale similar to the introduction of Windows 95, which marked the start of the PC becoming a true consumer device.

transformations // n v i d i a corporation // w w w .n v i d i a .c o m Tegra will target a broad array of products, from mobile Internet devices to other delightful and engaging devices that will drive this new era of very personal computing.

Why did NVIDIA focus on Microsoft Imagine a scenario where you are travelling to Europe Windows Mobile and Windows CE? for business with your APX 2500-based smartphone. You listen to music for the 12-hour flight. You check into your hotel and plug your phone into the TV via an MR: In order to build a real computer, you need a real operating system. The Microsoft operating systems HDMI cable and watch a two-hour HD movie. After all deliver the necessary technical capability and enjoy of this, you still have over half of your battery power widespread market adoption. We have worked with available for making and receiving phone calls, browsing Microsoft throughout the development of Tegra and the Web, or sending emails. That’s something that together we’re ensuring that the combination of Tegra everyone can relate to and there is a clear demand and Windows Mobile and CE will meet the challenging in the market for technology that can deliver it. multimedia demands of markets around the world. And this isn’t a scenario that is limited to the smartphone space by any means. Portable media players, mobile Your Tegra APX 2500 product targets a class Internet devices, and all sorts of handheld and embedded of devices referred to as Smartphone 2.0. entertainment devices are rapidly growing in capabilities What does NVIDIA mean by this term? beyond their core functions. Personal navigation devices, for example, now feature photo viewers, MP3 players, and MR: Essentially, Smartphone 2.0 is a true multimedia Web browsers on top of their core navigation capabilities. computer in your pocket. Of course it will still run the enterprise applications that epitomize today’s The NVIDIA Tegra products deliver the technology smartphone, but it will be so much more. Smartphone needed to enable high-performance and engaging 2.0 users will have, in the palm of their hand, an applications that merely sip at the battery without incredibly versatile device for creating and sharing sacrificing quality. It is this kind of usability that truly content, watching and recording HD videos, listening embodies the next generation of personal computers to music, navigating, and browsing the Web. and NVIDIA, with our proven expertise in visual computing, is uniquely positioned to deliver it.

transformations // n v i d i a corporation // w w w .n v i d i a .c o m transformations // The Visual Computing Era

146X 36X 19X 17X 100X

The Visual Computing Era Spotlight on CUDA and Scalable Parallel Programming By John Nickolls, Ian Buck, Kevin Skadron (on sabbatical from Univ. of Virginia) and Michael Garland, NVIDIA. Adapted from an article that appeared in ACM Queue Magazine, April 2008

The advent of multi-core According to conventional wisdom, Unified Graphics and Computing GPUs CPUs and many-core GPUs parallel programming is difficult. Driven by the insatiable market However, early experience with demand for real-time, high-definition means that mainstream the CUDA™ scalable parallel 3D graphics, the programmable GPU processor chips are programming model and C language has evolved into a highly-parallel, now parallel systems. shows that many sophisticated multithreaded, many-core processor. Furthermore, their parallelism programs can be readily expressed It is designed to efficiently support continues to scale with with a few easily understood the graphics shader programming abstractions. Since NVIDIA released model, in which a program for one Moore’s Law. The challenge CUDA technology in 2007, developers thread draws one vertex or shades is to develop mainstream have rapidly developed scalable one pixel fragment. The GPU application software that parallel programs for a wide range excels at fine-grained, data-parallel transparently scales its of applications, including financial workloads consisting of thousands modeling, computational chemistry, of independent threads executing parallelism to leverage and oil & gas exploration. These vertex, geometry, and pixel shader the increasing number of applications scale transparently to program threads concurrently. processor cores, much as hundreds of processor cores and 3D graphics applications thousands of concurrent threads. The tremendous raw performance of transparently scale their modern GPUs has led researchers NVIDIA GPUs with the new Tesla™ to explore mapping more general parallelism to many-core unified graphics and computing non-graphics computations onto the GPUs with widely varying architecture run CUDA C programs GPU processors. Such GPGPU—or numbers of cores. and are widely available in laptops, General-Purpose Computation on PCs, workstations, and servers. GPUs—systems have produced The CUDA model is also applicable some impressive results, but the to other shared-memory parallel limitations and difficulties of doing processing architectures, including this via graphics APIs are legend. multi-core CPUs. This broad-based desire to use the GPU as a more general parallel computing device motivated NVIDIA to develop a new unified graphics and computing GPU architecture and the CUDA programming model.

transformations // n v i d i a corporation // w w w .n v i d i a .c o m professional s o l u t i o n s

149X 47X 20X 24X 30X

GPU Computing Architecture CUDA Paradigm CUDA Application Experience Introduced by NVIDIA in November CUDA provides three key CUDA minimally extends the C 2006, the Tesla unified graphics and abstractions—a hierarchy of thread and C++ programming languages. computing architecture significantly groups, shared memories, and barrier The programmer writes a serial extends the GPU beyond graphics synchronization—that provide a clear program that calls parallel kernels, —its massively multithreaded parallel structure to conventional and writes serial C code for the processor array becomes a highly- C code for one thread of the kernels, which may be simple efficient unified platform for both hierarchy. Multiple levels of threads functions or full programs. A kernel graphics and general-purpose provide fine-grained parallelism and executes in parallel across a set of parallel-computing applications. By coarse-grained parallelism. The parallel threads. Programmers who scaling the number of processors abstractions guide the programmer are comfortable developing in C and memory partitions, the Tesla to partition the problem into coarse can begin writing CUDA programs architecture spans a wide market sub-problems that can be solved in a very short amount of time. range from the high-performance independently in parallel, and then enthusiast GeForce® GPUs and into finer pieces that can be solved In the relatively short period since its professional Quadro® and Tesla in parallel. The programming model introduction, a number of real-world computing products to a variety of scales transparently to large numbers parallel application codes have been inexpensive, mainstream GeForce of threads and processor cores: a developed in CUDA. These include GPUs. Its computing features enable compiled CUDA program executes financial modeling, FHD spiral MRI straightforward programming of on any number of processors, and reconstruction, molecular dynamics, the GPU processor cores in C with only the run time system needs to and n-body astrophysics simulation. CUDA. Wide availability in laptops, know the physical processor count. desktops, workstations, and servers, Continued on following page>> coupled with C programmability and CUDA software, make the Tesla architecture the first ubiquitous supercomputing platform.

transformations // n v i d i a corporation // w w w .n v i d i a .c o m transformations // The Visual Computing Era

146X 36X 19X 17X 100X

Interactive visualization Ionic placement for Transcoding HD video Simulation in Matlab Astrophysics N-body of volumetric white molecular dynamics stream to HI264 for using .mex file simulation5 matter connectivity1 simulation on GPU2 portable video3 CUDA function4

Examples of speedup results using CUDA compared to previous approaches.

Running on Tesla-architecture The University of Illinois has Although CUDA was released not GPUs, these applications were able successfully taught a semester- that long ago, it is already the target to achieve substantial speed-ups long parallel programming course of massive development activity— over alternative implementations using CUDA to a mix of Computer there are tens of thousands of running on serial CPUs: the financial Science and non-Computer CUDA developers. The combination modeling was 149 times faster, the Science students, with students of massive speedups, an intuitive MRI reconstruction was 263 times obtaining impressive speedups on programming environment, and faster, the molecular dynamics a wide variety of real applications, affordable, ubiquitous hardware code was 10–100 times faster, and including the previously-mentioned are unique in today’s market, the n-body simulation was 50–250 MRI reconstruction example. and represent the realization times faster. The large speedups of decades of hopes in parallel are due to the highly parallel The programming paradigm provided computing. In short, we believe nature of the Tesla architecture by CUDA has allowed developers to CUDA represents a democratization and its high memory bandwidth. harness the power of these scalable of parallel programming. parallel processors with relative ease, Conclusion enabling them to achieve speedups Links to the latest version of CUDA is a model for parallel of 100 times or more on a variety the CUDA development tools, programming that provides a few of sophisticated applications. The documentation, code samples, easily understood abstractions that CUDA abstractions, however, are and user discussion forums can be allow the programmer to focus on general and also provide an excellent found at: www.nvidia.com/CUDA. algorithmic efficiency and develop programming environment for multi- scalable parallel applications. In fact, core CPU chips. A prototype source- CUDA is an excellent programming to-source translation framework environment for teaching parallel developed at the University of programming. The University of Illinois compiles CUDA programs Virginia used it as a short, three- for multi-core CPUs by mapping a week module in an undergraduate parallel thread block to loops within computer architecture course, a single physical thread. CUDA and students were able to write kernels compiled in this way exhibit a correct k-means clustering excellent performance and scalability. program after just three lectures.

transformations // n v i d i a corporation // w w w .n v i d i a .c o m professional s o l u t i o n s

149X 47X 20X 24X 30X

Financial simulation of GLAME@lab: An M-script Ultrasound medical Highly optimized object Cmatch exact string LIBOR model with API for linear algebra imaging for cancer oriented molecular matching to find similar swaptions6 operations on GPU7 diagnostics8 dynamics9 proteins and gene sequences10

Stanford University Launches New Pervasive Parallelism Lab Last month, NVIDIA announced that it is a founding member of Stanford University’s new Pervasive Parallelism Lab (PPL). The PPL will develop new techniques, tools, and training materials to allow software engineers to harness the parallelism of the multiple processors that are already available in virtually every new computer. NVIDIA’s investment complements the company’s ongoing strategy to solve some of the world’s NVIDIA Tesla Architecture GPU with 112 Streaming Processor (SP) Cores, arranged as 14 Streaming Multiprocessors (SMs). most computationally-intensive problems with GPUs and world-class tools and software.

HEADLINE IMAGERY SOURCES: Until recently, computer installations delivering 1 “Interactive Visualization of Volumetric White Matter Connectivity in DT-MRI Using a Parallel-Hardware massive parallelism could be deployed only in Hamilton-Jacobi Solver”, by Won-Ki Jeong, P. Thomas Fletcher, Ran Tao, and Ross T. Whitaker, IEEE large-scale computer centers with hundreds Conference on Visualization, Oct. 2007. http://www.cs.utah.edu/%7Ewkjeong/publication/vis07.pdf. to thousands of separate computer systems. 2 “Accelerating molecular modeling applications with graphics processors”, by John E. Stone, James C. Phillips, Peter L. Freddolino, David J. Hardy, Leonardo G. Trabuco, Klaus Schulten. Journal of With the recent introduction of many-core Computational Chemistry, vol. 28, no. 16, 2007, pp 2618–2640, http://dx.doi.org/10.1002/jcc.20829. processors such as the GPU and the multi- 3 Video encoding test uses iTunes on CPU and Elemental RapiHD on GPU running under core CPU, most new computer systems come Windows XP on Intel Core 2 Duo 1.66GHz and Intel Core 2 Quad Extreme 3GHz. GPUs were equipped with multiple processors that require GeForce 8800M on Gateway P-Series FX notebook, and GeForce 8800 GTS 512MB on new software techniques to exploit parallelism. Asus P5K-V motherboard (Intel G33-based) with 2GB DDR2 system memory. Extrapolated from 1 min 50 sec 1280x720 HD movie clip. http://www.elementaltechnologies.com/ Without new software techniques, computer 4 "MATLAB plug-in for CUDA". http://developer.nvidia.com/object/matlab_cuda.html. scientists are concerned that rapid increases 5 "High-Performance Direct Gravitational N-body Simulations on Graphics Processing Units—II: in the speed of computing could stall. An implementation in CUDA", Robert G. Belleman, Jeroen Bedorf, Simon Portegies Zwart, New Astronomy, July 2007, arXiv:0707.0438v2 http://arxiv.org/abs/0707.0438v2 From fundamental hardware to new user-friendly 6 LIBOR Monte Carlo code and "Notes on using the NVIDIA 8800 GTX graphics card", by Mike programming languages that will allow developers Giles and Su Xiaoke. http://www2.maths.ox.ac.uk/~gilesm/hpc/NVIDIA/libor/report.pdf to exploit parallelism automatically, the PPL will 7 “GLAME@lab: An M-script API for Linear Algebra Operations on Graphics Processors”, Sergio Barrachina, Maribel Castillo, Francisco D. Igual, Rafael Mayo, Enrique S. allow programmers to implement their algorithms Quintana-Ort, Feb. 2008, http://www3.uji.es/~figual/files/Papers/para08.pdf. in accessible, “domain-specific” languages while 8 See http://www.techniscanmedicalsystems.com at deeper, more fundamental levels of software, 9 “General Purpose Molecular Dynamics Simulations Fully Implemented on Graphics the system would do all the work for them in Processing Units”, by Joshua A. Anderson, Chris D. Lorenz, and A. Travesset, optimizing the code for parallel processing. Journal of Computational Physics, vol 227, no. 10, May 2008, DOI: 10.1016/j. jcp.2008.01.047, http://www.external.ameslab.gov/hoomd/downloads/GPUMD.pdf 10 “Fast Exact String Matching on the GPU”, Michael C. Schatz and Cole Trapnell, NVIDIA is joined by AMD, HP, IBM, Intel, and Sun May 2008, http://www.cbcb.umd.edu/software/cmatch/Cmatch.pdf in this venture. transformations // 3D Imaging in Stereo

3D Imaging in Stereo

An intriguing area of imaging Naked-eye stereoscopy can be Since 2000, the university’s research technology is the field of implemented in a variety of ways; group has been developing a system the one being studied by Associate where in vivo cross-sections obtained research known as “naked-eye Professor Hongen Liao and graduate in real time by CT or MRI scans are stereoscopy,” which displays student Nicholas Herlambang in treated as volume textures, which 3D stereoscopic images Professor Dohi’s research group can not only be reconstituted as 3D without the need for special is called Integral Videography (IV). images through volume rendering, but eyeglasses. This technology This method uses a special display also displayed as stereoscopic video comprised of a micro-lens array with for use in an IV system. not only has applications convex lenses on a matrix which in entertainment, but is is bonded to a liquid crystal panel. This system could revolutionize real- being studied as a practical Directly beneath each micro-lens, there time, stereoscopic, in-vivo imaging. technology for a variety of are some 100 liquid-crystal elements. However, the amount of computation The convex lens projects the light from is huge; the volume rendering alone professional applications. each element in various directions. The creates a high processing load, then object to be represented in 3D space further processing is required for One especially promising is illuminated by light rays from several the stereoscopic imaging. For each application is in medical directions, forming a stereoscopic video frame, a vast number of angles imaging, where NVIDIA’s CUDA image which, to the user, seems to be must be displayed at the same time. floating in the air. Multiply this by the number of frames parallel computing platform is in the video and a staggering amount being studied by Professor Because this method projects a 3D of computation must be done with Takeyoshi Dohi and his image into space, it has advantages high precision in a short time. colleagues in the Department over the traditional stereoscopic of Mechano-Informatics at the method, where different images are In research in 2001, real-time displayed for the viewer’s left and volume rendering and stereoscopic University of Tokyo’s Graduate right eyes. Using IV, the 3D image reconstitution for images of 512 x School of Information Science can be observed from a wide area 512 resolution on a Pentium III 800 and Technology. in front of the display by several MHz PC took over ten seconds to viewers at once, without using special generate a single frame. To speed eyeglasses or viewpoint tracking. up processing, the group tried using the 60 CPUs of an UltraSPARC III 900 MHz machine, the latest high- performance computer available at the time. But the best result that could be obtained was five frames

transformations // n v i d i a corporation // w w w .n v i d i a .c o m professional s o l u t i o n s

The Advanced Therapeutic and Rehabilitation Engineering (ATRE) Laboratory About the ATRE Lab is a multi-disciplinary research center working on computer-aided surgery and rehabilitation engineering. Its focus is on the development of robotic devices and computer programs to assist physicians with minimally invasive surgery, as well robotically-enhanced assisting devices for the elderly and physically challenged. The laboratory is part of the Department of Mechano-Informatics in the Graduate School of Information Science and Technology at the University of Tokyo.

LEFT TO RIGHT Professor Takeyoshi Dohi, Associate Professor Hongen Liao, Graduate Student Nicholas Herlambang per second. This was simply not fast the current IV system for Tesla using each surgeon able to visualize the enough to be practical. CUDA. This is expected to boost operation in real time. performance even further. Both the volume rendering and It is difficult to bring a huge parallel subsequent conversion to IV format “We have evaluated various computer array into clinical settings, require data-parallel vector calculation. development environments for parallel but the powerful computing For this, the optimal computing computing,” says Herlambang, the capabilities of GPUs make it paradigm is the GPU. Accordingly, Liao researcher responsible for the CUDA possible to provide compact, parallel and Herlambang started to research implementation, “but we chose CUDA computing modules. GPU implementation using CUDA, a because it lets us do development general-purpose, C-language GPU in the C language syntax we are Kaori Nakamura and David Wilton of NVIDIA development environment from NVIDIA. used to. Also, we will be able to take contributed to this article. advantage of speed increases in First, the researchers developed future generations of GPUs without a prototype system based on the modifying the system we have NVIDIA architecture. When data sets developed. If an environment that from the 2001 study were run on makes it easy to debug large CUDA the GPU using CUDA, performance programs becomes available, CUDA improved to 13-14 frames per will become an even more powerful second. As the UltraSPARC system development environment for parallel had cost tens of millions of yen, computing and we expect it will find the researchers were amazed that more applications in medical image a GPU, costing a hundred times processing as well.” less, delivered nearly three times the performance. Moreover, according When images from CT and MRI are IV system using CUDA. to the group, NVIDIA’s GPU was at viewed stereoscopically in real time, least 70 times faster than the latest physicians can check the state of generation of multi-core CPUs. diseased tissues and make diagnoses In addition, tests showed that the without biopsies and surgery. GPU’s high performance was even Moreover, several physicians can more conspicuous for larger volume view the images at the same time and texture sizes. consult with one another. And it may eventually make it possible for several Currently, the group is working with physicians to perform arthroscopic the Tesla D870-based deskside surgery and other minimally invasive supercomputer and is optimizing surgical techniques together, with

transformations // n v i d i a corporation // w w w .n v i d i a .c o m transformationstransformations // // Optimized Lorem Ipsum PC Designest Dolor

Optimized PC Design The User Experience is Paramount

Consumers are demanding A PC should be designed and All over the world, we see the PCs that deliver richer optimized for how the user intends to movement toward usage-optimized use it. The more visual the experience, PC design taking hold and graphics and beautiful the more important the investment increasing in momentum. We call visuals. To respond, PC in the GPU becomes. And for large this the “Optimized PC” design manufacturers all over the segments of the marketplace like movement. This trend, combined world are no longer designing workstations, gaming PCs, video/ with growing numbers of visually- PCs based simply on the photo editing PCs, media centers, and rich applications and receptive lifestyle PCs, the visual experience consumers with high expectations, “speeds and feeds” of the is front-and-center and cannot be is driving consumption of GPUs. CPU, but with the total compromised. The GPU is no longer user experience in mind. a luxury or merely a “nice-to-have.”

An Optimized PC Improves the Visual Computing Experience Games Video Photos Operating Systems World in Conflict™. ©2008 Golden Gate Bridge PicLens ©2008 Microsoft® Windows Vista™ Massive Entertainment AB. PureVideo® HD technology Cooliris, Inc. ©2008 Microsoft

transformations // n v i d i a corporation // w w w .n v i d i a .c o m transformations // NVISION 08!

NVISION 08! The First-Ever Visual Computing Mega-Event please join us at nvision 08, august 25-27, 2008 in San jose, california

Featuring industry luminaries, in-depth technical sessions, cutting-edge technology previews, and live musical “NVISION 08 will be the defining event and visual entertainment, NVISION 08 will focus on the intersection of for visual computing. It brings the visual technology, science, and art that defines the world of visual computing. computing ecosystem under one roof

The event will be held August to explore, share, and discover.” 25-27, 2008 in San Jose, California, promising to turn Silicon Valley into a mecca of visual computing this -Jen-Hsun Huang, CEO, NVIDIA summer. Conference registration and sponsorship information is available at www.nvision2008.com.

For more information, contact Calisa Cole at [email protected].

Professional Visualization Demoscene Community

Digital Art Computing

transformations // n v i d i a corporation // w w w .n v i d i a .c o m Transformations is a quarterly publication of NVIDIA’s investor relations and communications group.

Please send feedback to [email protected].

NVIDIA Corporation | 2701 San Tomas Expressway | Santa Clara, CA 95050 | www.nvidia.com

Copyright © 2008 NVIDIA Corporation. All rights reserved. NVIDIA, the NVIDIA logo, GeForce, Quadro, Tesla, NVIDIA Tegra, CUDA, PureVideo and NVISION are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.