TheThe DefinitiveDefinitive GuideGuidetmtm TToo
Scaling Out SQL Server 2005
Don Jones Chapter 1
Introduction to Realtimepublishers
by Sean Daily, Series Editor
The book you are about to enjoy represents an entirely new modality of publishing and a major first in the industry. The founding concept behind Realtimepublishers.com is the idea of providing readers with high-quality books about today’s most critical technology topics—at no cost to the reader. Although this feat may sound difficult to achieve, it is made possible through the vision and generosity of a corporate sponsor who agrees to bear the book’s production expenses and host the book on its Web site for the benefit of its Web site visitors. It should be pointed out that the free nature of these publications does not in any way diminish their quality. Without reservation, I can tell you that the book that you’re now reading is the equivalent of any similar printed book you might find at your local bookstore—with the notable exception that it won’t cost you $30 to $80. The Realtimepublishers publishing model also provides other significant benefits. For example, the electronic nature of this book makes activities such as chapter updates and additions or the release of a new edition possible in a far shorter timeframe than is the case with conventional printed books. Because we publish our titles in “real-time”—that is, as chapters are written or revised by the author—you benefit from receiving the information immediately rather than having to wait months or years to receive a complete product. Finally, I’d like to note that our books are by no means paid advertisements for the sponsor. Realtimepublishers is an independent publishing company and maintains, by written agreement with the sponsor, 100 percent editorial control over the content of our titles. It is my opinion that this system of content delivery not only is of immeasurable value to readers but also will hold a significant place in the future of publishing. As the founder of Realtimepublishers, my raison d’être is to create “dream team” projects—that is, to locate and work only with the industry’s leading authors and sponsors, and publish books that help readers do their everyday jobs. To that end, I encourage and welcome your feedback on this or any other book in the Realtimepublishers.com series. If you would like to submit a comment, question, or suggestion, please send an email to [email protected], leave feedback on our Web site at http://www.realtimepublishers.com, or call us at 800-509- 0532 ext. 110. Thanks for reading, and enjoy!
Sean Daily Founder & Series Editor Realtimepublishers.com, Inc.
i Chapter 1
Introduction to Realtimepublishers...... i Chapter 1: An Introduction to Scaling Out...... 1 What Is Scaling Out? ...... 1 Why Scale Out Databases? ...... 3 Microsoft SQL Server and Scaling Out...... 4 SQL Server 2005 Editions ...... 6 General Technologies ...... 6 General Scale-Out Strategies...... 7 SQL Server Farms...... 7 Distributed Partitioned Databases...... 9 Scale-Out Techniques ...... 10 Distributed Partitioned Views...... 11 Distribution Partitioned Databases and Replication ...... 13 Windows Clustering...... 13 High-Performance Storage...... 14 Hurdles to Scaling Out Database Solutions...... 14 Database Hurdles ...... 19 Manageability Hurdles...... 19 Server Hardware: Specialized Solutions...... 19 Comparing Server Hardware and Scale-Out Solutions ...... 21 Categorize the Choices ...... 21 Price/Performance Benchmarks...... 22 Identify the Scale-Out Solution ...... 23 Calculate a Total Solution Price ...... 23 Take Advantage of Evaluation Periods...... 23 Industry Benchmarks Overview ...... 24 Summary...... 25 Content Central ...... 25 Download Additional eBooks! ...... 25
ii Chapter 1
Copyright Statement © 2005 Realtimepublishers.com, Inc. All rights reserved. This site contains materials that have been created, developed, or commissioned by, and published with the permission of, Realtimepublishers.com, Inc. (the “Materials”) and this site and any such Materials are protected by international copyright and trademark laws. THE MATERIALS ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. The Materials are subject to change without notice and do not represent a commitment on the part of Realtimepublishers.com, Inc or its web site sponsors. In no event shall Realtimepublishers.com, Inc. or its web site sponsors be held liable for technical or editorial errors or omissions contained in the Materials, including without limitation, for any direct, indirect, incidental, special, exemplary or consequential damages whatsoever resulting from the use of any information contained in the Materials. The Materials (including but not limited to the text, images, audio, and/or video) may not be copied, reproduced, republished, uploaded, posted, transmitted, or distributed in any way, in whole or in part, except that one copy may be downloaded for your personal, non- commercial use on a single computer. In connection with such use, you may not modify or obscure any copyright or other proprietary notice. The Materials may contain trademarks, services marks and logos that are the property of third parties. You are not permitted to use these trademarks, services marks or logos without prior written consent of such third parties. Realtimepublishers.com and the Realtimepublishers logo are registered in the US Patent & Trademark Office. All other product or service names are the property of their respective owners. If you have any questions about these terms, or if you would like information about licensing materials from Realtimepublishers.com, please contact us via e-mail at [email protected].
iii Chapter 1
[Editor’s Note: This eBook was downloaded from Content Central. To download other eBooks on this topic, please visit http://www.realtimepublishers.com/contentcentral/.]
Chapter 1: An Introduction to Scaling Out
Enterprise applications have become more complex and have taken on a greater burden for managing a company’s critical data. At the same time, the amount of data managed by those applications has swelled exponentially as companies begin to track an increasingly greater amount of information—data about customers, vendors, sales, and more. In addition, the advent and popularity of data warehousing has expanded the amount of stored data by a factor of a hundred or more. In short, we’re keeping track of more data than ever before, and our database servers are starting to show signs of strain. This strain has been the catalyst for interest in scaling out from a diverse audience that ranges from database administrators to CEOs. Even your organization’s CFO will care, as scaling up is often more expensive than scaling out, especially when scaling up requires the purchase of an expensive midrange or mainframe platform.
What Is Scaling Out? As applications grow to support tens and hundreds of thousands of users, scaling is becoming a mission-critical activity. Scaling up—improving efficiency by fine-tuning queries, indexes, and so forth, as well as moving to more powerful hardware or upgrading existing hardware—helps IT organizations do more with less. In the past, scaling up met the increasing IT needs of many organizations. After all, even an older Windows NT 3.51 server can address up to 4GB of physical RAM. Newer 64-bit machines can address terabytes of physical RAM, which, even by today’s standards, seems practically infinite. And yet, thanks to the incredible amount of data that organizations handle, all that hardware isn’t always sufficient. Scaling out is the process of making multiple servers perform the work of one logical server or of dividing an application across multiple servers. A Web farm provides an excellent example of scaling out (see Figure 1.1). Each server in the farm is completely independent and hosts an identical copy of the entire Web site. Users are load balanced across the servers, although the users rarely realize that more than one server exists. For example, when you visit http://www.microsoft.com, can you tell how many Web servers are actually behind the scenes handling your request? The number is greater than one or two, you can be sure.
1 Chapter 1
Figure 1.1: Web farms represent an excellent example of scaling out.
Why is scaling out even necessary in a Web farm? After all, most Web servers are fairly inexpensive machines that employ one or two processors and perhaps 2GB of RAM or so. Surely one really pumped-up server could handle the work of three or four smaller ones for about the same price? The reason is that individual servers can handle only a finite number of incoming connections. Once a server is dealing with a couple of thousand Web requests (more or less, depending on the server operating system—OS, hardware, and so forth), adding more RAM, network adapters, or processors doesn’t increase the server’s capacity enough to meet the demand. There is always a performance bottleneck, and it’s generally the server’s internal IO busses that move data between the disks, processors, and memory. That bottleneck is literally built-in to the motherboard, leaving you few options other than scaling out. Figure 1.2 illustrates the bottleneck, and how adding processors, disks, or memory—the features you can generally upgrade easily—can only address the problem up to a point.
2 Chapter 1
Figure 1.2: The motherboard bottleneck.
Why Scale Out Databases? Database servers don’t, at first glance, seem to share a lot of characteristics with Web servers. Web servers tend to deal with small amounts of data; database servers must handle thousands of times more information. Database servers are usually “big iron” machines with multiple processors, tons of RAM, fast disks, and so forth—they certainly don’t suffer from a shortage of raw computing power. In addition, database servers don’t always accept connections directly from clients; multi-tier applications provide one or more intermediate layers to help consolidate connections and conserve database servers’ capacity. Database servers also have vastly more complicated jobs than Web servers. Web servers read files from disk and stream the files out over the Internet; database servers do an incredible amount of work—even adding a single row to a small database table might require a half-dozen indexes to be updated, triggers to be executed, other connections to be notified of the new row, and so forth. The work of a database server eventually results in a cascade of additional work. Database servers tend to bottleneck at several levels. Although 4GB of RAM is a lot of memory, a database server might be dealing with 8GB worth of data changes. Four processors sounds like a lot—and it is—but a database server might be asking those processors to deal with cascading changes from thousands of database operations each minute. Eventually those changes are going to have to be dumped to disk, and the disks can only write data so fast. At some point, the hardware isn’t going to be able to keep up with the amount of work required of database servers. Thus, scaling out is becoming critical for companies that rely on truly large databases.
3 Chapter 1
What Is a Truly Large Database? For a view of a truly large database, take a look at Terraserver.com. This system contains satellite imagery down to 1 or 2 meters of detail of nearly the entire world. This amount of data is literally terabytes of information; so much so that simply backing up the information requires high-powered hardware. Terraserver.com does more than simply store and query this vast amount of data, which is primarily the photographs stored as binary data; it is Internet-accessible and has handled tens of thousands of user connections per hour. Making this size of database available to a large audience of users wouldn’t be possible without scaling out. In fact, Terraserver.com’s traffic has been high enough to take down Internet Service Providers (ISPs) whose customers were all hitting the Terraserver.com site—particularly when the site released photos of Area 51 in Nevada. Although the ISPs had problems, Terraserver.com never went down during this period. Other giant databases are easy to find. Microsoft’s Hotmail Web site, for example, deals with a good bit of data, as does Yahoo!’s email service. America Online has a pretty huge database that handles its membership records for tens of millions of users. Big databases are everywhere.
Microsoft SQL Server and Scaling Out Most organizations tackle database performance issues by scaling up rather than out. Scaling up is certainly easier, and Microsoft SQL Server lends itself very well to that approach; however, scaling up might not always be the best choice. Many database applications have inefficient designs or grow inefficient over time as their usage patterns change. Simply finding and improving the areas of inefficiency can result in major performance benefits. For example, I once worked with a customer whose database required a nine-table join to look up a single customer address. Selectively denormalizing the tables and applying appropriate indexes let SQL Server execute customer address queries much faster. As address lookups were a very common task, even a minor per-query improvement resulted in a major overall performance benefit. Inefficiency: It’s Nobody’s Fault Why are databases inefficient? Determining a single culprit is a difficult task as there are several reasons why a database’s performance might be less than stellar. Heading the list is poor design, which can result from a number of sources. First, many application developers do not excel at database design. Some, for example, have been taught to fully normalize the database at all costs, which can lead to significantly degraded performance. Sometimes project schedules do not permit enough design iterations before the database must be locked down and software development begins. In some cases, the application is not designed well, resulting in an incomplete database design that must be patched and expanded as the application is created. Change is another major factor. An application might be used in a way unintended by its designers, which reduces efficiency. The application may have expanded and begun suffering from scope creep—the growth of change of project requirements. In this case, redesigning the application from the beginning to meet current business needs might be the best solution to database inefficiency. Sheer growth can also be a problem. Databases are designed for a specific data volume; once that volume is exceeded, queries might not work as they were intended. Indexes might need to be redesigned or at least rebuilt. Queries that were intended to return a few dozen rows might now return thousands of rows, affecting the underlying design of the application and the way data is handled. These problems are difficult to address in a live, production application. Scaling up—trying to optimize performance in the current environment—tends to have a limited effect. Nobody will debate that the application’s design is inefficient, but companies are reluctant to destroy a serviceable application without serious consideration. Scaling out provides a less-drastic solution. Although scaling out requires much work on the server side, it might not require much more than minor revisions to client-side code, making the project approachable without completely re-architecting the application. Although scaling out might not be the most elegant or efficient way to improve performance, it helps alleviate many database and application design flaws, and can allow companies to grow their database applications without needing to redesign them from scratch.
4 Chapter 1
There is a limit to how much scaling up can improve an application’s performance and ability to support more users. To take a simplified example, suppose you have a database that has the sole function of performing a single, simple query. No joins, no real need for indexes, just a simple, straightforward query. A beefy SQL Server computer—say, a quad-processor with 4GB of RAM and many fast hard drives—could probably support tens of thousands of users who needed to simultaneously execute that one query. However, if the server needed to support a million users, it might not be able to manage the load. Scaling up wouldn’t improve the situation because the simple query is as finely tuned as possible—you’ve reached the ceiling of scaling up, and it’s time to turn to scaling out. Scaling out is a much more complicated process than scaling up, and essentially requires you to split a database into various pieces, then move the various pieces to independent SQL Server computers. Grocery stores provide a good analogy for comparing scale up and scale out. Suppose you drop by your local supermarket and load up a basket with picnic supplies for the coming holiday weekend. Naturally, everyone else in town had the same idea, so the store’s a bit crowded. Suppose, too, that the store has only a single checkout lane open. Very quickly, a line of unhappy customers and their shopping carts would stretch to the back of the store. One solution is to improve the efficiency of the checkout process: install faster barcode scanners, require everyone to use a credit card instead of writing a check, and hire a cashier with the fastest fingers in the world. These measures would doubtless improve conditions, but they wouldn’t solve the problem. Customers would move through the line at a faster rate, but there is still only the one line. A better solution is to scale out by opening additional checkout lanes. Customers could now be processed in parallel by completely independent lanes. In a true database scale-out scenario, you might have one lane for customers purchasing 15 items or less, because that lane could focus on the “low-hanging fruit” and process those customers quickly. Another lane might focus on produce, which often takes longer to process because it has to be weighed in addition to being scanned. An ideal, if unrealistic, solution might be to retain a single lane for each customer, but to divide each customer’s purchases into categories to be handled by specialists: produce, meat, boxed items, and so forth. Specialized cashiers could minimize their interactions with each other, keeping the process moving speedily along. Although unworkable in a real grocery store, this solution illustrates a real-world model for scaling out databases.