The Risks of Using Spreadsheets for Statistical Analysis Why Spreadsheets Have Their Limits, and What You Can Do to Avoid Them

The risks of using spreadsheets for statistical analysis Why spreadsheets have their limits, and what you can do to avoid them Let’s go Introduction Introduction Spreadsheets are widely used for statistical analysis; and while they are incredibly useful tools, they are Why spreadsheets are popular useful only to a certain point. When used for a task 88% of all spreadsheets they’re not designed to perform, or for a task at contain at least one error or beyond the limit of their capabilities, spreadsheets can be somewhat risky. An alternative to spreadsheets This paper presents some points you should consider if you use, or plan to use, a spreadsheet to perform statistical analysis. It also describes an alternative that in many cases will be more suitable. The learning curve with IBM SPSS Statistics Software licensing the SPSS way Conclusion Introduction Why spreadsheets are popular Why spreadsheets are popular A spreadsheet is an attractive choice for performing The answer to the first question depends on the scale and calculations because it’s easy to use. Most of us know (or the complexity of your data analysis. A typical spreadsheet 1 • 2 • 3 • 4 • 5 think we know) how to use one. Plus, spreadsheet programs will have a restriction on the number of records it can 6 • 7 • 8 • 9 come as a standard desktop computer resource, so they’re handle, so if the scale of the job is large, a tool other than a already available. spreadsheet may be very useful. A spreadsheet is a wonderful invention and an An alternative to spreadsheets excellent tool—for certain jobs. All too often, however, spreadsheets are called upon to perform tasks that are beyond their capabilities. Moreover, the perception that a spreadsheet is easy to use is, to some extent, an illusion. It is always easy to get an The learning curve with IBM answer out of a spreadsheet—but it’s not necessarily easy to SPSS Statistics get the correct answer. However, the decision to use something else—an unfamiliar technology or tool—is not always an easy one. When considering an alternative, two questions come to mind: Software licensing How useful is this tool? And how hard is it to learn? the SPSS way Conclusion Introduction Spreadsheets can be useful for statistical time preparing data because of lack of specialized data analysis; but when used for tasks they’re preparation features. Why spreadsheets are popular not designed to perform, they have Finally, if the task is simply to analyze a limited quantity some limitations. of historical data, a spreadsheet will do. But if you want to 1 • 2 • 3 • 4 • 5 make reliable forecasts or draw trends, especially if they 6 • 7 • 8 • 9 involve large datasets, then there are much better tools. As for complexity, if you only need a superficial review of your data, a spreadsheet may be a suitable tool. But if you This paper will return to consider the answer to the second suspect that there is valuable information in your data question—how hard is it to learn?—when it looks at an that isn’t immediately obvious, or if you need to perform a An alternative to spreadsheets alternative to spreadsheets for statistical calculations. detailed analysis or find hidden patterns, a spreadsheet will Before continuing, however, it is worth noting that not give you the functionality you need. people use spreadsheets for tasks other than numerical calculation. For example, spreadsheets are frequently Another factor to consider is the degree of accuracy used as if they were databases, to create and manage required. Spreadsheet results can be unreliable when lists. Again, the principles of scale and complexity apply. The learning curve with IBM working with large datasets and/or performing complex Beyond certain limits, a proper database, with built-in rules SPSS Statistics calculations. If absolute accuracy is required, a spreadsheet for structuring data, maintaining data integrity, developing may not suffice. Instead, a different, more reliably accurate audit trails and so on, is far more suitable. tool should be considered. Two things to remember about spreadsheets The effort required in using a spreadsheet is also a Users need to keep two important considerations in mind Software licensing consideration, An analyst using spreadsheet for when working with spreadsheets: spreadsheets can be the SPSS way performing statistical analysis will have to spend more Conclusion Introduction complex to create, and they can be error prone. Spreadsheets are really computer programs Why spreadsheets are popular When you design a spreadsheet layout, you are writing a computer program. Spreadsheet programs such as 1 • 2 • 3 • 4 • 5 Microsoft Excel use what is known as a “nonprocedural 6 • 7 • 8 • 9 programming language.” Although it is also possible to write procedural programs for Excel in Visual Basic, the everyday business of typing formulas into cells is an exercise in nonprocedural programming. An alternative to spreadsheets Creating a spreadsheet can be as complex as computer programming. The learning curve with IBM Nonprocedural programming is just as SPSS Statistics full of decisions, complexity and chances Software licensing the SPSS way Conclusion Introduction for mistakes as all but the simplest Examples of costly spreadsheet errors from around the procedural program. world include: Why spreadsheets are popular • “....a cut and paste error in a spreadsheet cost TransAlta With standard software development methodology, $24M”1 1 • 2 • 3 • 4 • 5 procedural computer programs are double- and triple- 6 • 7 • 8 • 9 • “...a simple spreadsheet formatting error made MI5, the checked. By contrast, a spreadsheet, although it may be British Intelligence Agency bug more than one thousand vitally important to the operation of a company, is usually wrong phones”2 the work of one person. It is almost never checked or tested in detail, and quite often goes into production with little • “..Harvard University economists Carmen Reinhart and or no verification. Yet important management decisions— Kenneth Rogoff have acknowledged making a spreadsheet An alternative to spreadsheets revenue forecasts and plans for future investment, for calculation mistake in a research paper, ‘Growth in a Time example—are based on the numbers that it produces. of Debt’ (PDF), which has been widely cited to justify budget-cutting”3 Spreadsheets are prone to errors • “...London Olympics ended up selling 10,000 tickets for A number of studies have been made concerning the non-existent seats because of a spreadsheet error”4 The learning curve with IBM frequency of errors in spreadsheets. Based on these, SPSS Statistics it seems that 88 percent of all sheets contain at least one error. The studies were carried out by making visual Types of spreadsheet errors inspections of mission-critical spreadsheets, so it is possible Spreadsheet errors can be divided into three main types. that many other errors were not found. Also, it was found that attempts to correct errors often introduced new ones. The “friendliest” type is what you could call functional Software licensing errors. These errors are the easiest to find because they the SPSS way simply stop the spreadsheet from working. Instead of giving you wrong numbers, they give error messages, or nothing at all. Conclusion Introduction Then there are outlier errors. With these, the spreadsheet There are a number of stories about how spreadsheet appears to work, but the numbers aren’t right. Often, such errors are spotted by someone who has an idea of what the errors have had embarrassing consequences. Why spreadsheets are popular results should be and draws attention to the fact that the One dates from 2013, when a graduate student results don’t match expectations. 1 • 2 • 3 • 4 • 5 uncovered a number of errors in the spreadsheets 6 • 7 • 8 • 9 from a highly influential 2010 research paper used by many government policy makers to justify budget Types of spreadsheet errors include cutting.6 While Harvard University economists Carmen functional errors, outlier errors and Reinhart and Kenneth Rogoff have acknowledged their stealth errors, which range in severity An alternative to spreadsheets mistake, they maintain that their conclusions about the association between higher government debt and from low to high. slower economic growth is still valid.7 The most serious are what might be called stealth errors. Another story is about a $100 million “oops.” These produce incorrect results, but nobody believes Companies are bought and sold all the time. And once they’re incorrect. They pass inspection and are accepted The learning curve with IBM the parties agree on a price, that’s usually it. However, as the truth. Stealth errors occur either because nobody SPSS Statistics in October 2014, the number of outstanding shares knows what the correct result should be (which is often the of Tibco Software was misstated because of one case with statistical calculations), or the numbers are only small spreadsheet error. As a result, the sale price of slightly different from expectations and seem reasonable.5 the company—a price all parties had agreed to—was overstated by $100 million. When the transaction was Software licensing completed, the acquired company’s shareholders the SPSS way received $100 million less than they expected. The only folks smiling here were the litigators.6 Conclusion Introduction Accidentally inserting a number into a cell already containing a formula will overwrite the equation and turn the cell content into a constant. Why spreadsheets are popular Causes of spreadsheet errors • Misuse of built-in functions: The wrong function can be 1 • 2 • 3 • 4 • 5 Spreadsheet users should know what factors cause errors. used—for example, using AVERAGEA, which evaluates 6 • 7 • 8 • 9 Unfortunately, there are too many causes to list here, but text and false entries to zero, instead of AVERAGE, which the major ones are: ignores them.

The Risks of Using Spreadsheets for Statistical Analysis Why Spreadsheets Have Their Limits, and What You Can Do to Avoid Them

Mapreduce-Based D ELT Framework to Address the Challenges of Geospatial Big Data

A Survey on Data Collection for Machine Learning a Big Data - AI Integration Perspective

Best Practices in Data Collection and Preparation: Recommendations For

Data Migration (Pdf)

Advanced Data Preparation for Individuals & Teams

Preparation for Data Migration to S4/Hana

Preparing a Data Migration Plan a Practical Introduction to Data Migration Strategy and Planning

Analyzing the Web from Start to Finish Knowledge Extraction from a Web Forum Using KNIME

Improving Data Preparation for Business Analytics Applying Technologies and Methods for Establishing Trusted Data Assets for More Productive Users

Web Data Integration: a New Source of Competitive Advantage

Automated Machine Learning for Healthcare and Clinical Notes Analysis

The Ultimate Guide to Data Preparation