THE BIG DATA INFORMATION ARCHITECTURE! an Analysis of the Consequences of the Big Data Trend

The Bloor Group ! ! ! THE BIG DATA INFORMATION ARCHITECTURE! An Analysis of the Consequences of the Big Data Trend Robin Bloor, Ph.D.! & Rebecca Jozwiak ! ! ! ! ! ! ! ! ! ! ! RESEARCH REPORT !________________________________________________ ! ! ! "We are drowning in information, and starving for knowledge."! ~ John Naisbitt RESEARCH REPORT THE BIG DATA INFORMATION ARCHITECTURE What’s With All This “Big Data?”! The Babylonians who walked the earth in 3800 BC – nearly six thousand years ago – took a regular census. They didn’t just count people, they also counted livestock and volumes of commodities like wool. Clay tablets were their means of recording data, and their CPU was an abacus. No doubt at that time, a census was big data indeed.! “Big Data” is why computers exist. No matter whether we consider the U.S. census of 1890, which was processed on punched card by Herman Hollerith’s famous tabulating machine, or the code-breaking computers of World War II, which leveraged parallel computation or “Big Processing” – computers continually evolve to better manage data. ! This frames the two main dimensions of large computer workloads. Either they involve sifting through a great deal of data or they involve doing a large amount of processing. In reality, Big Data is a poor description of this computing duality, but it is the one that has captured the headlines, so it is the one we have to use.! The IT industry generates and harvests more data every year. It’s been that way from the beginning. Roughly speaking, the data grows at about 55% per year. If you do the math, this means that data volumes grow in size by about 10x every 6 years or so. This increase is suspiciously in line with Moore’s Law, which has delivered 10x in computer power every six years since Gordon Moore made his wonderful and surprisingly observation.! On one hand, the capacity of the technology increases, and on the other, it gets used. We might thus conclude that what is now happening is just “same old, same old,” but in fact this is not true at all. ! To see why this is so, we need to take a broad look at what has happened in computing in the past, and what is happening now.! The Technology Curve and its Demise! The evidence suggests that since about 1960 the IT market has been expanding at a dramatic but nevertheless predictable rate. The expansion rate has been dramatic because it has been exponential rather than linear. As human beings we are comfortable with the idea of linear growth. We can represent it as an even upward slope that yields a predictable improvement every year. We are less comfortable with predictable exponential growth. Even though the improvement is regular, we tend to underestimate its impact.! It was in an effort to capture exponential technology improvement that we came up with a graphic representation of it. A diagram of this is shown in Graph 1 on the following page. It shows a fairly complex graph, which illustrates in a general way the response time for computer applications graphed against the IT workload they present. ! The vertical axis is logarithmic, meaning that each unit (marked in black) represents 10 times the previous unit: i.e., 0.01 seconds, 0.1 seconds, then 10 seconds, 100 seconds and so on. We could have extended the graph (and hence the area labeled real time) below the 0.01 second line, but we’ve chosen to truncate it there. ! The horizontal axis is not logarithmic. No specific units are shown for workload because there are no obvious ways to metricate the workload of an application. Sometimes applications take a long time because the CPU is busy or because a great deal of data is being accessed or because network latency is an added factor. The use of resources varies.! !1 THE BIG DATA INFORMATION ARCHITECTURE Application Migration The Area Of As-Yet-Unrealized Applications Source: The Bloor Group Graph 1: The Dynamics of Technology Change To be clear about the terms we are using:! 1. Response time: This is the time interval between a user initiating something and the computer responding and completing it. If you click the mouse on a button in a window, the computer’s response time is the time taken to carry out the command executed by the button. The button might cause a query to run against a large database, or it might just make a menu drop down. Regardless, the definition of response time is the same. But having noted that, let’s also point out that some computer applications are “conversational” (i.e., you do something, then the computer does something, then you do something, then the computer does something and so on) and some are fundamentally one-shot events (e.g., you tell the computer to print a photo album). This is an important distinction.! 2. Workload: This is, as we have already indicated, a rough measure of the computer power required to run the application for the individual user and provide the response time indicated on the vertical axis. It’s not the total measure of the amount of computer power a database uses to satisfy many users, it’s the measure of the computer power required to satisfy just one query for one user on a particular volume of data with a given response time. It includes all the resources of the disks, the memory, the CPU and the network that would be required to deliver a particular response time.! !2 THE BIG DATA INFORMATION ARCHITECTURE Now look at the lines that are drawn for each decade since 1960 when the commercial computer business first took off. Naturally, each indicate that the higher the workload, the longer the response time. The areas beneath and to the right of each line indicate (for each line) the areas of application that were simply not possible due to lack of computer power then.! The colored areas of the graph represent ranges of response times relevant to computer users:! • Slow batch: Range 4 hours to 1+ days. Here the workload takes so long that the associated operational processes have to be planned in a very organized way. It is difficult to be productive with such very long response times. With very very large data heaps, even nowadays a single query can take a day or more.! • Medium batch: Range 15 minutes to 4 hours. Here the workload is still ponderously long and impedes the business process. Typical applications in this area are scientific (compute intensive) and business intelligence (BI) related, such as some ETL processes that load large amounts of data into data warehouses. We use the term “batch” for these workloads because it is usually best to simply batch such workloads together and run them serially until complete – in the way that most mainframe jobs used to run until transaction processing became a reality. ! • Fast batch: Range 15 seconds to 15 minutes. When we get to this level of response, it is easier for users to arrange their work around the computer response by organizing their manual activities to allow for the relatively short delay. Fast batch applications can be, and often are, under user control. Software developers through the 1990s grew accustomed to arranging their work in this manner, testing one program while another was compiling.! • Transactional: Range 1 second to 15 seconds. Transactional is where you enter information onto a screen then press “Enter” to commit the transaction, such as when you buy an airline ticket on a web site. The comfortable range for transaction response is within 4 seconds. Nowadays a web site will put up a spinning wheel and a message to say “we are dealing with your transaction, please wait” if the delay is worse than that. In the early days of transaction processing it was generally acknowledged that a response time worse than 15 seconds was untenable, less than 4 seconds was desirable and less than a second, ideal.! • Interactive: Range 0.1 second to 1 second. Here we are dealing with applications where the interface itself is truly interactive (computer games are a good example). Normally, the fastest that human beings can respond to a stimulus is in one tenth of a second. That’s about the time a baseball player at bat has to react when the pitcher throws the ball. So that speed of response feels “synchronous.” It feels less so as response time increases. If responses go out beyond one second then the added latency is obvious and for some interactive applications, untenable. Interactivity began with the advent of the PC and became truly interactive with the advent of the GUI where a sub-second response is absolutely necessary. ! • Real time: Range less than 0.1 seconds. We’re slightly abusing the meaning of real time here, since the term normally refers to computing activity that must meet a timed deadline. We’re taking it to mean computing activity which goes beyond the best human response speed. Using that definition, this response time is for computer-to-computer applications, such as automated trading.! !3 THE BIG DATA INFORMATION ARCHITECTURE Finally, we need to explain the large arrows that indicate the migration of applications from one classification of applications to another. This simply indicates that specific business applications have a tendency to migrate over time from slower classifications to faster ones. ! In reality, this would be best illustrated as an animated graph. The first line (labeled 1960) should be envisaged as expanding forward in the direction of the dotted arrow toward the 1970 line and then toward the 1980 line and so on, continuing to expand outward.

THE BIG DATA INFORMATION ARCHITECTURE! an Analysis of the Consequences of the Big Data Trend

Data Processing with Unit Record Equipment in Iceland

Introduction to Database Systems

Herman Hollerith and Early Mechanical/Electrical Tabulator/Sorters

Punched Card - Wikipedia, the Free Encyclopedia Page 1 of 11

2 9215FQ14 FREQUENTLY ASKED QUESTIONS Category Pages Facilities & Buildings 3-10 General Reference 11-20 Human Resources

The Development of Punch Card Tabulation in the Bureau of the Census

House for the History of IBM Data Processing

History and Evolution of IBM Mainframes Over the 60 Years Of

Test Yourself (Quiz 1)

Early History of Computers: Machines for Mass Calculation

Relational Database Systems 1

Co-Evolution of Information Processing Technology and Use: Interaction Between the Life Insurance and Tabulating Industries Jo A