The Bloor Group ! ! ! THE BIG DATA INFORMATION ARCHITECTURE! An Analysis of the Consequences of the Big Data Trend

Robin Bloor, Ph.D.! & Rebecca Jozwiak

! ! ! ! ! ! ! ! ! ! ! RESEARCH REPORT

!______! ! !

"We are drowning in information, and starving for knowledge."! ~ John Naisbitt

RESEARCH REPORT THE BIG DATA INFORMATION ARCHITECTURE What’s With All This “Big Data?”! The Babylonians who walked the earth in 3800 BC – nearly six thousand years ago – took a regular census. They didn’t just count people, they also counted livestock and volumes of commodities like wool. Clay tablets were their means of recording data, and their CPU was an abacus. No doubt at that time, a census was big data indeed.! “Big Data” is why exist. No matter whether we consider the U.S. census of 1890, which was processed on by ’s famous tabulating machine, or the code-breaking computers of World War II, which leveraged parallel computation or “Big Processing” – computers continually evolve to better manage data. ! This frames the two main dimensions of large workloads. Either they involve sifting through a great deal of data or they involve doing a large amount of processing. In reality, Big Data is a poor description of this computing duality, but it is the one that has captured the headlines, so it is the one we have to use.! The IT industry generates and harvests more data every year. It’s been that way from the beginning. Roughly speaking, the data grows at about 55% per year. If you do the math, this means that data volumes grow in size by about 10x every 6 years or so. This increase is suspiciously in line with Moore’s Law, which has delivered 10x in computer power every six years since Gordon Moore made his wonderful and surprisingly observation.! On one hand, the capacity of the technology increases, and on the other, it gets used. We might thus conclude that what is now happening is just “same old, same old,” but in fact this is not true at all. ! To see why this is so, we need to take a broad look at what has happened in computing in the past, and what is happening now.!

The Technology Curve and its Demise! The evidence suggests that since about 1960 the IT market has been expanding at a dramatic but nevertheless predictable rate. The expansion rate has been dramatic because it has been exponential rather than linear. As human beings we are comfortable with the idea of linear growth. We can represent it as an even upward slope that yields a predictable improvement every year. We are less comfortable with predictable exponential growth. Even though the improvement is regular, we tend to underestimate its impact.! It was in an effort to capture exponential technology improvement that we came up with a graphic representation of it. A diagram of this is shown in Graph 1 on the following page. It shows a fairly complex graph, which illustrates in a general way the response time for computer applications graphed against the IT workload they present. ! The vertical axis is logarithmic, meaning that each unit (marked in black) represents 10 times the previous unit: i.e., 0.01 seconds, 0.1 seconds, then 10 seconds, 100 seconds and so on. We could have extended the graph (and hence the area labeled real time) below the 0.01 second line, but we’ve chosen to truncate it there. ! The horizontal axis is not logarithmic. No specific units are shown for workload because there are no obvious ways to metricate the workload of an application. Sometimes applications take a long time because the CPU is busy or because a great deal of data is being accessed or because network latency is an added factor. The use of resources varies.!

1 THE BIG DATA INFORMATION ARCHITECTURE

Application Migration

The Area Of As-Yet-Unrealized Applications

Source: The Bloor Group

Graph 1: The Dynamics of Technology Change

To be clear about the terms we are using:! 1. Response time: This is the time interval between a user initiating something and the computer responding and completing it. If you click the mouse on a button in a window, the computer’s response time is the time taken to carry out the command executed by the button. The button might cause a query to run against a large database, or it might just make a menu drop down. Regardless, the definition of response time is the same. But having noted that, let’s also point out that some computer applications are “conversational” (i.e., you do something, then the computer does something, then you do something, then the computer does something and so on) and some are fundamentally one-shot events (e.g., you tell the computer to print a photo album). This is an important distinction.! 2. Workload: This is, as we have already indicated, a rough measure of the computer power required to run the application for the individual user and provide the response time indicated on the vertical axis. It’s not the total measure of the amount of computer power a database uses to satisfy many users, it’s the measure of the computer power required to satisfy just one query for one user on a particular volume of data with a given response time. It includes all the resources of the disks, the memory, the CPU and the network that would be required to deliver a particular response time.!

2 THE BIG DATA INFORMATION ARCHITECTURE

Now look at the lines that are drawn for each decade since 1960 when the commercial computer business first took off. Naturally, each indicate that the higher the workload, the longer the response time. The areas beneath and to the right of each line indicate (for each line) the areas of application that were simply not possible due to lack of computer power then.! The colored areas of the graph represent ranges of response times relevant to computer users:! • Slow batch: Range 4 hours to 1+ days. Here the workload takes so long that the associated operational processes have to be planned in a very organized way. It is difficult to be productive with such very long response times. With very very large data heaps, even nowadays a single query can take a day or more.! • Medium batch: Range 15 minutes to 4 hours. Here the workload is still ponderously long and impedes the business process. Typical applications in this area are scientific (compute intensive) and business intelligence (BI) related, such as some ETL processes that load large amounts of data into data warehouses. We use the term “batch” for these workloads because it is usually best to simply batch such workloads together and run them serially until complete – in the way that most mainframe jobs used to run until transaction processing became a reality. ! • Fast batch: Range 15 seconds to 15 minutes. When we get to this level of response, it is easier for users to arrange their work around the computer response by organizing their manual activities to allow for the relatively short delay. Fast batch applications can be, and often are, under user control. Software developers through the 1990s grew accustomed to arranging their work in this manner, testing one program while another was compiling.! • Transactional: Range 1 second to 15 seconds. Transactional is where you enter information onto a screen then press “Enter” to commit the transaction, such as when you buy an airline ticket on a web site. The comfortable range for transaction response is within 4 seconds. Nowadays a web site will put up a spinning wheel and a message to say “we are dealing with your transaction, please wait” if the delay is worse than that. In the early days of transaction processing it was generally acknowledged that a response time worse than 15 seconds was untenable, less than 4 seconds was desirable and less than a second, ideal.! • Interactive: Range 0.1 second to 1 second. Here we are dealing with applications where the interface itself is truly interactive (computer games are a good example). Normally, the fastest that human beings can respond to a stimulus is in one tenth of a second. That’s about the time a baseball player at bat has to react when the pitcher throws the ball. So that speed of response feels “synchronous.” It feels less so as response time increases. If responses go out beyond one second then the added latency is obvious and for some interactive applications, untenable. Interactivity began with the advent of the PC and became truly interactive with the advent of the GUI where a sub-second response is absolutely necessary. ! • Real time: Range less than 0.1 seconds. We’re slightly abusing the meaning of real time here, since the term normally refers to computing activity that must meet a timed deadline. We’re taking it to mean computing activity which goes beyond the best human response speed. Using that definition, this response time is for computer-to-computer applications, such as automated trading.!

3 THE BIG DATA INFORMATION ARCHITECTURE

Finally, we need to explain the large arrows that indicate the migration of applications from one classification of applications to another. This simply indicates that specific business applications have a tendency to migrate over time from slower classifications to faster ones. ! In reality, this would be best illustrated as an animated graph. The first line (labeled 1960) should be envisaged as expanding forward in the direction of the dotted arrow toward the 1970 line and then toward the 1980 line and so on, continuing to expand outward. At the same time, behind that line the large white arrows should be continually moving down and to the left, indicating perennial migrations from one application classification band to another. In effect, that’s what has happened since about 1960, and that’s what we expected to continue to happen from 2004 onward, when we first drew this graph. And it did for a while.! And then something changed.! The Surprise of 2013! We had gradually come to expect that every now and then a new technology would emerge in a new application area because of this constant exponential increase in computer power. And for a while this is what we witnessed. The complex event processing (CEP) market suddenly emerged, base upon the intelligent use of technology to process information streams. The 10x increase in CPU and memory speed that Moore’s Law delivered thrust it into existence. The same could be said of VMware’s virtual machines. The growth of computer power made such a development possible. When you examine the technical performance, it is clear that the 10x power in 6 years that Moore’s Law described made the technology possible.! But in 2013 we noticed a distinctly different pattern emerge, particularly in the field of Big Data. We encountered some technologies (ETL, large scale queries, analytics) that were capable of running more than 100 times faster and even 1000 times faster than what came before. This was in obvious violation of what we had come to expect in the technology market and it required an explanation.! The Great Disruption! The acceleration we noticed was hardware based. The root cause was simply that by 2004 it was no longer possible to increase CPU speeds by ramping up the clock speed, so chip vendors chose instead to add more processor cores to the chip. This made the CPUs more powerful, but it also meant that if you wanted to use that extra power you would have to use parallel programming techniques. There were no good software solutions for this and it took a long time for such tools to arise, but eventually they did.! Thus the cause of the disruption that we began to observe was parallel processing. Software began to emerge that could spread its workload over tens, hundreds or even thousands of processors. And CPUs were still getting faster. But that wasn’t the whole picture. The situation was even more disruptive because change was afoot throughout the hardware layer. It is worth our briefly noting all such changes here, because each one of them has consequences:! The Cloud, as Infrastructure! Aside from the fact that there can be cost advantages in using cloud infrastructure, the cloud has collapsed the time taken to deploy commodity servers. It is possible to deploy such

4 THE BIG DATA INFORMATION ARCHITECTURE servers in a matter of minutes. This makes it possible to prototype software of almost every kind more swiftly, even where it might not make sense to run operationally in the cloud.! Networking! The advent of virtual networking combined with very high network speeds (provided by Cisco, Juniper, etc.) has made networks far more flexible than they previously were. Historically, software architects have ignored network hardware, regarding networks as a fixed constraint. They are no longer such a constraint and they are capable of carrying very high volumes of data, if it makes sense to do that.! SSD and Disk! Disk is dying. It may die quickly in the coming years, depending on price factors and the difficulty in replacing it with solid state disk (SSD). Samsung, the world’s largest provider of flash memory, declared last year that it expects flash to follow Moore’s Law over the next 6 years. If it does, SSD, which is already about 10x the speed of spinning disk and is falling in price, will soon make spinning disk irrelevant. Another straw in this wind is that IBM has invested heavily in SSD technology and appears to be trying to evict the disk from the data center without damage to its own significant storage business. Other storage vendors will inevitably follow suit. ! Memory! Memory prices have been falling fairly dramatically year-over-year since the 1970s. They suddenly halted their descent last year for several exceptional reasons: a fire at the SK Hynix Chinese lab, an earthquake in Taiwan and the decline of the PC market, which has moved some manufacturing away from commodity DRAM. We can expect memory prices to soon resume their decline. Historically, memory speed improvements never kept pace with CPU speed improvements, achieving only between 2% and 11% gains annually. Nevertheless this was better than disk managed. Nowadays, the rule of thumb commonly quoted is that memory is about 100,000 times faster than disk for a random read. Effective caching, for example, by well-engineered database technology reduces that advantage to 3000:1, which is still very considerable. With the advance of SSD we may see the memory-to-disk speed ratio diminish, but it will always be large enough to consider how to exploit it. ! The CPU! CPUs grow bigger in capacity (number of transistors on the chip) with miniaturization. This means that the chip makers can affix more capability to the chip by either adding more cache memory to the chip or adding more processor cores. By 2012, Intel, who dominates the server CPU market, was producing x86 CPUs with up to 16 logical cores and considerable amounts of cache memory. On these chips there are three layers of cache:! 1. L1 cache is very close to the processor core with a capacity up to 32KB. It takes 3 times as long to fetch data from here than if the data is already in the core, ready to be processed. ! 2. L2 cache is not so close to the core and has a capacity up to 256KB. It takes 10 times as long to fetch data from here than if the data is already in the core, ready to be processed. ! 3. L3 cache is shared between all processors on the chip. It takes 10 times as long to fetch data from here than if the data is already in the core, ready to be processed. !

5 THE BIG DATA INFORMATION ARCHITECTURE

Because of this hierarchy of speeds, there are some operations that are best handled in specific locations. Certain table handling operations are best performed on the chip because it is possible to use Intel vector instructions, which can process multiple values at the same time (in parallel). It may also be advantageous to carry out data compression and decompression on the chip where possible because that will save memory. The point is that the CPU can be exploited in ways that were not previously possible because now it has considerable capacity. And it will have more in time. It is estimated that the 14nm chip recently announced by Intel will be superseded by a 10nm chip in 2016 and a 7nm chip in 2018. ! After that it may not be possible to miniaturize the CPU much further – it is hard to know for sure. But even if that’s where it stops, there will probably be much more cache memory on the chip in 2018 than there is now. ! System on a Chip Technology! There is growing enthusiasm for the possibilities of system on a chip (SoC) technology. The idea is simply to load a chip with every component of a system: one or more processor cores, memory, timers, external interfaces (for USB, FireWire, Ethernet, SPI, etc.), power management and so on. The point is that with current levels of miniaturization, there is now room on a chip for all of this. If you have a particular workload that can run very effectively in a parallel environment, (Big Data workloads are obvious examples) then by threading together many such SoCs you may be able to build a scale-out environment that dwarfs the power of a network of commodity servers, because it directly targets the workload.! To make a thriving business of this, you need volume sales of the chip, but this seems possible or at least promising, partly because there are SoC designs, from ARM and others, that can be varied to purpose. Our current view is that if anything is likely to disrupt the market for commodity servers it is commodity SoCs, deployed in cloud configurations. It is possible that they will eventually provide more bang for the buck.! The Architectural Implications of Hardware Disruption! Technology vendors, often start-ups that detect a business opportunity, almost always respond to hardware disruption first. Sometimes they even anticipate change before it manifests. ! Consequently, it is not difficult to identify vendors who foresaw some of these hardware disruptions and built products to exploit them. HP Vertica (founded in 2005) provides a clear example of this, designing and building a wholly new column-store database created to scale out across a grid of commodity servers rather than a big Unix cluster. Aerospike, another database company, but this time with an OLTP in-memory database, built its product both to leverage in-memory operation and to exploit SSD technology through parallel access to data. Both IBM and Actian provide examples of exploiting chip capabilities, since both Actian’s Vector database and IBM’s DB2 now exploit on-chip vector instructions for the sake of performance. ! These examples are database products focused in one way or another on improving performance for particular workloads. However, the hardware disruption we are currently witnessing has a much broader impact that just the database industry. True, databases are not what they used to be; but more to the point, it is the data that is not what it used to be. The data has changed, and not just in terms of its volume, but in terms of its nature.!

6 THE BIG DATA INFORMATION ARCHITECTURE The Event: The Atom of Processing?! Until it eventually became a cliché, Big Data marketing campaigns seemed to be driven by alliterations of the letter V: volume, velocity and variety, later to be joined by value and veracity. This was a pity because it was utterly misleading, but the term Big Data was itself misleading.! A number of technology trends were clearly in play, which can be rationalized in the following way:! • The most far-reaching disruption at the hardware level was the advent of parallelism, which made it possible to process queries on much larger heaps of data than were catered for by traditional relational databases. One of the outcomes of this was the scale- out column-store databases like HP Vertica and InfiniDB, and another was the genesis of Hadoop. So yes, processing much larger data volumes was entirely possible.! • Data-streaming (CEP) technology had become increasingly capable. Once focused almost completely on the financial markets, where such technology could sell for a high ticket price, it was now establishing a more general market, and yes indeed, the data might well arrive at speed, i.e., with significant velocity. ! • Also, if you are accumulating a large heap of data, it may arrive in a data stream, but the problem then is to ingest the data quickly rather than analyze it quickly, as it is with data streaming applications.! • The “variety” descriptor derived from the fact that some sources of data, particularly web data and social media data, were not conveniently structured in the rows and columns of a database, and yet they contained data that might be worth analyzing. Such data, generally referred to as “unstructured,” does indeed have a structure, but not one that is convenient for analytical processing.! • Perhaps more troubling than any of the V-words was the E-word: “external.” The point is that much (but not all) of the data that organizations now wished to process originated from outside the organization. This meant that most users of such data could have little influence over its structure or even its quality.! • That, by the way, is probably how the “veracity” word was dreamed up by some marketing team. But it isn’t really veracity that is the issue; it’s the quality of data and its provenance that matters.! There is no mystery to the question of why organizations would gather any of this data. The value, if it exists, will be in the analysis of it and the use of such intelligence to make better business decisions.! Looking at it in this way, we can easily imagine organizations gradually adding new sources of both external and internal data to the heaps of data they already analyze, and changing their systems to accommodate the extra data accordingly. Certainly many businesses can and will proceed in this manner. But we think doing so would be unwise, just as it was unwise to continue building centralized systems once the disruptive impact of the PC made client/ server architectures possible. The same kind of radical architectural transformation is called for now, and it mitigates in favor of an “event-driven” architecture. Let us examine what such an architecture might be.!

7 THE BIG DATA INFORMATION ARCHITECTURE

Figure 1: An Event-Driven Approach, in Overview Figure 1 depicts an event-driven approach in a general way. Every executing process in which the organization has an interest is generating event data. Such a data stream is managed by a virtual traffic cop that directs it to the right location, possibly filtering or replicating the event data while doing so. Looking at it in this way, we can regard everything that is happening as generating or capturing events, and the whole corporate IT environment as an effort to manage and exploit the flow of data. ! Transactions and Events! In the past we built systems that were transactional. A transaction was a change to an organization’s data. Thus an order, or a delivery, or a payment constituted a change to data. Other transactions might be changes to customer addresses, or discount agreements, or credit limits. Our BI applications reported on the transactions of the organization. In terms of data, that was the level of granularity at which we worked. ! It has gradually become necessary to think in terms of events. For example, when a customer makes a purchase on the web, each mouse click on the company web site is an event. Such data needs to be captured and analyzed: ! • What provoked the visit to the web site (an email, a web advert, a Google search, etc.)?! • Which web pages did they visit on our site?! • What options did we present to them?! • How many times did they visit the site previously?! • What did they do on such visits?! 8 THE BIG DATA INFORMATION ARCHITECTURE

So a customer purchase is no longer a simple transaction. In fact it is a series of individual events that led up to and included the transaction. And of course, we are interested in all events whether they led to a transaction or not. We will compare this pattern with the pattern of events of other site visitors.! For every activity in which the business has an interest – hiring, marketing campaigns, procurement, customer service and so on – it has become possible to gather data on related events that link together through a timeline and lead to some outcome, good or bad.! There are several important things to note. Events always involve time information and it must be captured. Events also have a context. They took place in a specific geographical or virtual location, and this dimension also needs to be captured. The event is the fundamental atom of a process, and we can analyze these atoms, discover patterns in them and exploit those patterns. The transaction wasn’t an atom of processing at all. It was a molecule.! The view of data we had in the past was incomplete. We captured the transactions that happened, but had little knowledge of those transactions that never completed: the sales that nearly occurred and the customer complaints that were never made.! The Internet of Things and the Great Refinement! After the networking of the world by the Internet, two further steps would inevitably occur.! The first was the creation of Internet-enabled mobile devices so that people could be connected while in motion, not just to each other but to the immense data and processing resources of the Internet. This added a geographical dimension to the Internet and hence to a good deal of data. Data coming from a mobile device has a location and time of origin which can be known and may be useful. The location also attaches a time and place to the mobile device carrier. ! The Mobile Data Landscape! Unless you’ve seen the statistics, you probably do not have an accurate idea of the volume of mobile data. Here are some useful stats, courtesy of Cisco:! • Mobile data traffic worldwide grew 81% in 2013, from 820 petabytes per month in December 2012, to 1.5 exabytes per month in December 2013. One month’s mobile data in that month was about 50% larger than all Internet traffic in 2000.! • Smart devices represented 21% of the total connected devices in 2013, but 88% of the traffic; hence, a smart device generates 29 times more traffic than a non-smart device. Mobile video traffic was 53% of the total by the end of 2013.! • About 526 million mobile devices and connections were added in 2013, bringing the total to about 7 billion, a rise of over 7%, with smartphones accounting for 77% of the growth.! Mobile devices, by the way, include laptops, tablets, smartphones, dumb-phones and the new and growing category of wearable devices, of which there were about 22 million by December 2013. The number of connected tablets (about 92 million) is poised to overtake the number of laptops (149 million) if it has not already done so, as the rate of tablet sales growth in 2013 was 220%, compared to laptop sales growth at less than 29%.! Not much of these many exabytes of data counts as Big Data, in the sense that the data cannot be profitably analyzed. First of all, most of the data traffic is transient: telephone calls that are

9 THE BIG DATA INFORMATION ARCHITECTURE never recorded, texts that are soon deleted, photos or videos sent and consumed but never saved and so on.! Aside from this consumer data, there is a great deal of other data stored on mobile devices which you may or may not be aware of. This is mostly event data. Data of your movements (every time you move the mobile device you generate an event) as well as the contacts you make and the data on the interactions you have. This overlaps to some extent with the personal data that you deliberately collect (contact details, text messages, etc.), the extent of which you will quickly discover if you lose your phone and, for some reason or other, this data is not actually backed up.! But a great deal of this data is invisible and it is gathered either by the phone itself or by the applications running on your phone. They gather such data because they can advantageously use it. This is the Big Data on your mobile device that actually does get analyzed and, right now, it is not measured in exabytes.! This is event data. Every time you do anything on the mobile device itself or with an application on the mobile device you generate events, and these events are recorded and eventually harvested for analysis. You could characterize such data as data about state and state changes, in respect of the device itself or an application running on that device.! Things: Dead or Alive! The much heralded Internet of Things (IoT) is going to generate exactly the same kind of data and for the same reason. The mobile revolution was about devices people carry and use. The IoT will be about every other device or even object of which some organization or person might like to know the state. ! So it includes mobile things like vehicles (skateboards, bicycles, cars, trucks, buses, ships, airplanes, etc.) or machinery (dynamic machines and robots) or life forms (wild life, pets or people). It also includes immobile things like infrastructure (buildings, roads, pipelines, factories, etc.), plant life (trees and crops), geography (land, rivers and seas) and meteorology (the atmosphere).! The likelihood is that, speaking in general terms, we will instrument everything it is economic to instrument, so that we can always be aware of its state and any important changes of state that occur.! As regards data, the IoT will not generate much transient data. In most contexts, it will be clear whether it is worth trying to gather event data and embedded sensors, or chips will be installed accordingly in particular locations to generate useful data. The volume of data gathered will inevitably be “big.” We already have the example that a four-engine jumbo jet currently creates around 640 terabytes of data on a single Atlantic crossing – and the IoT has only just begun.! The Refinement! Logic suggests the following scenario:! 1. A specific thing (say a three bedroom house) does not have any embedded sensors or chips in any of its components (walls, doors, floors, furniture, etc.) or utilities (electricity, telecommunications, air-conditioning/heating, water, etc.).! 2. Sensors and embedded chips are added in all appropriate places to the benefit of the inhabitants of the house. The outcomes might include: better controlled internal

10 THE BIG DATA INFORMATION ARCHITECTURE

environment, lower cost services from the utilities, improved safety in every dimension from lower fire risk to less likelihood of burglary, improved health of those living in the house and so on.! 3. Analysis of the sensor and chip logs provides reliable data on exactly how every person in the house uses the rooms and facilities of the house. It becomes clear that the design of the house could be much improved in terms of many of its aspects: the dimensions of rooms, the location of specific devices (stove, dishwasher, microwave, washing machine, etc.), air flow and heating, disposition of electric sockets, location of stairs, front door, back door, etc.! 4. Using the data gathered, builders can design much better houses and they can be delivered already instrumented.! Step four is what we call the great refinement. Objective feedback is a very powerful thing. The use of embedded sensors and devices will, perhaps for the first time, provide objective feedback on how people actually use their various devices and the spaces they occupy. We have never had such objective feedback; we have depended upon the opinions of credible experts and thought leaders. They may have been exactly correct in the assumptions they made, and thus, there may be little that can be improved, but this is unlikely.! It is far more likely that the collection of such unprecedented and accurate feedback will spark a significant amount of design rethinking, and it will change the world accordingly. ! The Nature of Events! Aside from the fact that accumulating collections of events will inevitably lead to very high volumes of data, it is clear that the nature of event data is not the same as the data we have traditionally gathered and demands that we take a different perspective on it.! We first consider some fundamental aspects of an event:! 1. Events always occur on a timeline. There is an immutable order to events based on time. One event, no matter what it relates to, always happens before, after, or at exactly the same time as another event. ! 2. Events always occur in a given place. It may be real/physical, such as within a car engine or it may be real/virtual, such as events on a web site. In the latter case, virtual location is likely to be more important than the physical location. ! 3. Events always have a context, and an event may have multiple contexts. So when someone visits a web site, the event of clicking on a given link has a context in respect to that individual and his/her browsing activity, and it also has a context in respect of the web site and its activity. It will also have a context in respect to the connection between the browser and the web site, which will involve a series of networking events.! 4. Events have attributes. It helps to think of events as specific records, like log file records, that contain the important dimensions of an event. A sensor in a pipe might, for example, simply report the rate of flow of the fluid through the pipe at a given time, or it might also report other dimensions such as the temperature of the fluid and the pressure in the pipe at that point.! An event has identifying information (time, location, context information) and also has attributes. You can choose to consider all data records as being either an event or an

11 THE BIG DATA INFORMATION ARCHITECTURE aggregation of events or derived data. In this, we need to distinguish between aggregations that are historical and aggregations of data derived in other ways. This is conceptual. If we choose to regard events as the atoms of processes, then we can define the following types of events.!

Figure 2: The Customer: An Instantiation Event is a Historical Aggregation 1. Instantiation event. Let’s say that a customer (person) comes into existence within the context of our organization’s processes. That such an entity already existed (as a person) long before they interacted with our organization means that their instantiation event will be an aggregation of historical information. They have, for example, a date_of_birth and a gender which were conferred by events that are a long time in the past. They have an address, where they live, which they began to occupy at some point in the past. They have a credit_card_number from an account they obtained some time in the past, and so on. We illustrate this roughly in the above figure. We have categorized educational events, employment events, financial events and social events, and we indicate changes of location at specific points. Much of this information may not be useful and may not even be possible to obtain. When we register a new customer we will try to gather all the information that we consider useful, and each attribute we gather will refer to some event of some kind in the history of that individual.! 2. A state report event. This is simply the data about the state of some entity at a given time. State events reported by sensors or by software in log files will be generated according to some simple or perhaps even sophisticated rule. A state report may be generated every second perhaps (even if no change of state has occurred), or may be generated when a state change occurs within a specific range (for example, temperature rises or falls by 0.1 degrees) or both (a report is produced at least once every minute but also every time the temperature changes by 0.1 degrees).! 3. A trigger event. It is worth highlighting the idea of a trigger event, one that causes other events to occur. Thus, in respect to state, there may be a threshold where if a given value rises above a particular value, then an action is automatically generated, which means that other events will occur. An obvious example is a purchase on a web site. Once all the required customer and financial details are entered and the “confirm”

12 THE BIG DATA INFORMATION ARCHITECTURE

button is clicked, various payment and dispatch events – perhaps a whole cascade of events – fire off. Trigger events are the events that give rise to actions or transactions.! 4. A correction event. This is an important and, in respect to its ramifications, a potentially disruptive event. Consider the situation where, for whatever reason, it is discovered that one or more event records contain incorrect data (perhaps a sensor has been malfunctioning or a software error has been discovered) and a correction needs to be made. The correction event must not only register what the correct data is, but must also record the time at which previous event records are known to have become erroneous. The possibility is that inappropriate actions have occurred because of erroneous data. How such corrections are recorded depends on the circumstances. ! We can think of all data as being either event data or derived data. Derived data is data that has been calculated or aggregated from a collection of events. Thinking in this manner can be helpful, because it allows us to consider the life cycle of data with a very fine level of granularity. For example, a customer record (derived data) is formed by a gradual accumulation of events that change the current values of some of the customer attributes to bring the record to its current state. All the events that created it form an audit trail of how the current state of the customer record was derived. This provides a lineage for the customer record. ! With processing of events, data lineage can be traced and thus the life cycle of any data entity or collection of data can be known and analyzed. This has ramifications for data analytics in respect to knowing the provenance and reliability of data, and it may also be useful in preventing or detecting fraud. ! Hopefully we have provided enough information here to suggest that designing, modeling and building event-based systems is slightly different from building traditional transactional systems. This will be the case both in the initial processing of events and in the analysis of events.! ! ! ! ! ! ! ! ! ! ! ! ! !

13 THE BIG DATA INFORMATION ARCHITECTURE Big Data and Hadoop! The move to Big Data-oriented businesses began over a decade ago with Yahoo and Google. To support their fast growing operations, both companies had to manage extremely large volumes of data. Neither company wasted time investigating the database technology of the day – it clearly couldn’t handle their volumes and it was clumsy or worse for unstructured data, which they needed to process effectively. Consequently, they “rolled their own” technology.! Among other technologies Google created Google MapReduce, a software framework that implemented parallelism. The goal was to scale out dramatically by having software that would run reliably on a network of ten to over a thousand computers. Because Google chose to patent its framework, the open source software pioneer, Doug Cutting, decided to develop a similar framework. This attracted the attention of Yahoo, which hired Cutting. Not long after that, the Apache Hadoop open source project was born, with Yahoo engineers investing a good deal of their time in the project. By 2008, the open source Hadoop was available.! In our view, Hadoop is a fundamental component of Big Data architecture. ! The Dynamics of Scale-Out! Divide up a software task effectively between 10 computers and it will run close to 10 times faster. On 100 computers it will run almost 100 times faster. In the 50 years of computing before the year 2000, very little effort was put into building software that could work in such a parallel manner. There were exceptions, of course. Some software for scientific computing (referred to as high performance computing) was written in parallel, and the GPUs that render the visual images for analysis and display on PCs ran software in parallel. But normal commercial computing neither knew of nor cared about parallelism. Even commercial databases, which were engineered for speed and hence did tip their hats to parallelism, were written only to scale out on relatively small clusters of computers. They were not written for extensive scale out.! Scalable technology began to blossom before Hadoop leapt into the picture. Purpose-built solutions such as Greenplum (2003), HP Vertica (2005) and Infobright (2005) emerged. They harvested customers from companies that were bumping up against the performance limitations of traditional relational databases. Quick on their heels, various NoSQL database products followed suit, including MongoDB and Apache Cassandra, both of which began as open source projects and morphed into commercial products from 10gen (now MongoDB) and DataStax. The NoSQL databases attacked a different weakness of traditional databases: their awkwardness in dealing with data that didn’t fit conveniently into tables. And they were also built to scale.! These products gained traction because there was a genuine need for them. They were naturally sought out for applications with large data volumes – applications that were expensive in terms of time, resources and expertise. ! At the same time, the x86 chip family started to add more processor cores to the CPU with each new generation of chips. Suddenly commercial hardware itself mitigated in favor of parallel processing. Commodity hardware was finally ready for it. !

14 THE BIG DATA INFORMATION ARCHITECTURE

Parallel software technology, which had once been a niche capability that few developers cared about much, moved to center stage. Thus Hadoop, a scalable file system with a parallel capability that happened to be open source, began to capture attention.! Hadoop: Why and Why Not?! In its initial release, Hadoop comprised a fail-safe indexed file system (called HDFS) married to a parallel execution framework (called MapReduce). It was attractive for several reasons. As it was open source, it could be tried and prototyped at little cost. It could store and search through very large volumes of any kind of data reasonably quickly. It was a scale-out key- value store (the kind of file organization that was once known as ISAM) which meant that data could be stored in the HDFS without the need to do any kind of data modeling. Any kind of data that had, or could be allocated, a key could be captured. It had redundancy built into its operation, so if almost any server failed, the job Hadoop was running would be immune to the failure. That was the upside. ! However, there were disadvantages. Hadoop jobs ran in a serial batch manner, so one job had to finish before another could begin. Hadoop’s only parallel development capability was via MapReduce, and it wasn’t easy for programmers to become productive with it quickly. Most importantly, while Hadoop delivered a parallel capability, it was by no means an optimized parallel capability and in fact, it performed quite poorly. Finally Hadoop’s “high availability” was designed for running on very large numbers of commodity servers, but each new generation of x86 chips had increasingly more cores on each chip.! Pretty soon it became obvious that few businesses would ever need to deploy thousands of servers for specific large workloads when x86 chips had 16 or more processor cores. Hundreds of servers perhaps, but even that would be unusual. ! In essence, Hadoop was designed to scale out, but not to scale up.! The Juggernaut: The Hadoop Ecosystem! The most important aspect of Hadoop – important because, in our view, it overrides almost everything else – is that it has spawned a considerable software ecosystem. Software ecosystems are a force to be reckoned with. The seed at the heart of a thriving ecosystem can have many weaknesses and deficiencies, but because of a general belief in what that seed can bring to the table, an ecosystem emerges and products appear that remove the deficiencies and compensate for the weaknesses. ! Examples of previous technology ecosystems include IBM mainframe, DEC VAX, MSDOS, Windows, Solaris and Linux. Each of these had direct competitors of one kind or another, even if in most instances they outdistanced them quickly. Hadoop is unusual in that there is no direct competition. There is no other scale-out key-value store that anyone in IT has any interest in.! Hadoop quickly became the foundation of a major open source initiative which gave rise to a whole series of complementary software components, giving it a vast array of additional functionality. These components include:! • Pig, an analytical language! • HBase, a SQL database capability (with Zookeeper for maintenance)! • Cassandra, a NoSQL database!

15 THE BIG DATA INFORMATION ARCHITECTURE

• HCatalog, a metadata capability! • Hive, a data warehouse capability! • Sqoop, a data export capability! • Oozie, a workflow capability! Naturally, all of these components scale out reasonably well, but until very recently, all were limited to running in a single batch queue. This meant that only one job could run at a time – no matter whether it wanted to access a single record or query terabytes of data.! The open source initiatives around Hadoop are, of course, only half the story. Some vendors, notably Actian with its Vector product and Calpont with InfiniDB, have ported their databases to work directly on the Hadoop HDFS. The ETL vendors have accommodated Hadoop en mass, not surprisingly since ETL is currently one of the primary applications. RedPoint Global has delivered a data management capability to Hadoop. Datameer, Alpine Data Labs and Splunk are providing analytics on Hadoop. Teradata, IBM and HP all have Hadoop enablement strategies in place, as do Oracle and SAP. And there are many other vendors, too numerous to list, that we could also mention.! A Useful Parallel! There is a parallel between the ongoing evolution of Hadoop and the rise to prominence of Linux. They can be thought of as open source cousins of a kind. Linux was a Unix-derived operating system that was stable, practical and not proprietary. It generated an ecosystem of complementary open source software products. Its context was as an operating environment, like Windows, suited to running a desktop or single server applications. It achieved enterprise credibility primarily, in our view, because of investment by commercial IT vendors, particularly IBM. It was reliably supported and gradually proved itself in a variety of useful roles: running web servers, running domain name servers, as a file server, running email, as a PC OS, as a database server, as a middleware server and eventually in the enterprise, as a server for the full gamut of business applications.! We see a similar pattern emerging in the evolution of Hadoop. At first, we saw enthusiasm from technical developers, followed by significant levels of adoption boosted considerably by the open source nature of the product. Then we saw the emergence of a commercial ecosystem and finally significant investment, this time highlighted by Intel’s big $750 million investment in Cloudera.! Hadoop’s march to prominence and dominance has been faster than that of Linux, partly because Linux defined the pattern by which open source products can succeed and partly because it has no direct competition. Just as Linux first established itself as a web server, Hadoop first established itself as a data collection system and as a natural consequence, an ETL environment. We expect its areas of applications to spread more broadly than that, and ultimately it may dominate the whole of the data layer within a corporate environment.! Hadoop Maturity: Hadoop 2.0 and its Consequences! With the advent of Hadoop 2.0 in August 2013, it was generally realized that the Hadoop environment had matured significantly. It was a major release. Aside from some useful enhancements, including the elimination of the NameNode as a single point of failure, Hadoop 2.0 decoupled MapReduce from the HDFS and introduced two new components: YARN and Tez. ! 16 THE BIG DATA INFORMATION ARCHITECTURE

Eliminating the marriage of HDFS to MapReduce was a welcome development. MapReduce has its uses, but it is often an inappropriate processing model. This is particularly the case in respect to data queries. If MapReduce were an optimal algorithm for querying data, then most databases would have employed it long ago for managing query workloads, and as far as we are aware, none ever did.! YARN is a scheduling capability that allows multiple jobs (or workloads) to run on the HDFS concurrently. YARN makes it possible for Hadoop to support multiple queries running against the same data at the same time, which is the typical pattern of database workloads and may become the typical pattern for analytical workloads.! Finally Tez enhances MapReduce’s capability, allowing it, for example, to run small queries against very large data volumes efficiently. ! The Importance of YARN! YARN lays the groundwork for HDFS to become a true scale-out networked file system with few obvious limits. Its impact is transformative. Prior to YARN, Hadoop was a MapReduce platform. With YARN, MapReduce became a sideshow. The added functionality that YARN delivers provoked a number of vendors, including Actian and Teradata, to announce new Hadoop capabilities. In effect, it instantly paved the way for a far richer Hadoop ecosystem.! With the addition of YARN, we expect Hadoop’s file system (HDFS) to gradually become the industry-standard file system for the data layer, both in data centers and in the cloud. Conceptually the point is this: each server in the data layer will have a local OS (most likely Linux) but Hadoop, along with YARN, will be able to act as the software that coordinates workloads that span grids of servers. It can, and we believe will, become the OS for the data layer.! As such it will likely become a vital component of any Big Data Information Architecture.! Is Hadoop Enterprise-Ready?! The short answer to this is “No, but it is close.” Hadoop lacks data security features, but there are products that can fix that. It lacks system management capability, but there are products that can fix that. It lacks a good SQL interface for data, but that can be fixed. It’s not good at metadata management but that can be fixed, too. It doesn’t perform well, but it can be made to perform well.! The point is that pretty much every operational Hadoop weakness can now be covered if you choose the right complementary products and components. And of course, the pressure for Hadoop to be enterprise-ready is in full force. According to Hortonworks, Hadoop is now in installed in almost every Fortune 500 company, and many others besides. Right now most of the usage is experimental (proof-of-concept and pilot projects) and credible estimates suggest that perhaps only just 10-15% of organizations with Hadoop are using it in production. ! In a recent survey of 158 corporate executives from a spectrum of industries, carried out by 1010data, Inc., the most common complaint about Hadoop, from roughly 70% of respondents, was in the area of security and reliability, with about 65% also complaining about the expense associated with implementation and maintenance. There is also a shortage of skilled Hadoop developers, and that may act as a brake on Hadoop usage for a while.! !

17 THE BIG DATA INFORMATION ARCHITECTURE A Data Flow Architecture! From the outset of our research effort, we quickly concluded that a scale-out file system would inevitably become a fundamental component of the Big Data Information Architecture. As Hadoop has such remarkable momentum and is the only scale-out-file-system game in town we believe it will dominate the computer IT landscape in the coming years. There is a very genuine need for the technology.! The Pattern of the Past! For many organizations, the multitude of business applications and the associated data management and BI capabilities carve out a familiar pattern. Transactional applications use traditional relational databases to store and manage transactional data. When data is shared between transactional systems it is replicated in some way to keep source data consistent. The data from transactional systems is syphoned off by software that will extract, cleanse, transform and store it in a staging area. ! From there it will be loaded into a data warehouse that was designed to accommodate a mix of query traffic to serve a variety of BI applications. Subsets of the data may subsequently be offloaded into data marts or in some cases, desktop databases for various types of BI reporting or for data analysis. When its usefulness expires data will either be archived or simply deleted.! This practice has become outmoded. It has been disrupted by the following factors:! • The universe of corporate data – the data that a corporation can exploit – has grown. It now includes many external sources of data: supply chain data, social media data and data from other private or public sources in the cloud.! • Scale-out technology has made it possible to store and report on corporate data that was previously ignored. This is particularly the case with the log files of network devices, operating systems, databases, applications and particularly web logs. ! • Data streams from many data sources (social media, news and weather, markets, etc.) are easy to access via web APIs, and data streaming services, free or otherwise, are proliferating. This poses insuperable problems for the old data warehouse arrangement, both in terms of the speed of data arrival and the format of the data, much of which is unstructured or semi-structured.! • The aging database technology that was used to build data warehouses was not engineered for scale-out parallel operation. As a consequence, the increase in data volumes alone challenges its suitability for the new universe of corporate data. ! • Much of the data that organizations wish to use is unstructured in the sense of not being conveniently described by traditional metadata. To add to this, there is a new generation of NoSQL databases which is more effective in processing some of this data (including graph databases and document databases). ! The immediate importance of Hadoop is that it can be used conveniently as a staging area for capturing and storing some of the data that cannot, for various practical reasons, be included in the data warehouse. As Hadoop scales out indefinitely, it doesn’t matter too much how big this staging area gets, so very large collections of data can be gathered if there is a need. If we think in those terms then we can imagine an arrangement of a data layer which looks very

18 THE BIG DATA INFORMATION ARCHITECTURE

Figure 3: Hadoop in the Data Layer much like the one illustrated in Figure 3. This shows the typical data warehouse arrangement augmented by Hadoop.! Business (transactional) applications run on what we have labeled Legacy DBMS (traditional relational databases, mainly). Data is gathered from these databases using traditional ETL jobs to transfer appropriate data into the data warehouse. A number of local applications run on the data warehouse (labeled DW Apps) running query workloads. Some will be BI apps that have access to the data warehouse that maintains, for example, BI dashboards.! Our illustration shows simplistically the inner workings of the data warehouse. Unlike Hadoop, the data warehouse is a data query engine that is tuned to optimize the multiple concurrent queries that regularly request data from within it. This is its workload. We have illustrated the inner workings of the database as a shared-nothing scale-out environment which shards the data it presides over onto multiple servers and distributes queries across those servers to gather the requested data.! If we add Hadoop into this environment it clips into the data layer quite simply. Data flows into Hadoop and is refined so that it can be moved to the data warehouse. At the very least some data cleansing and metadata definition will be done so that, via some Hadoop ETL job, data can be transferred into the data warehouse. Also there will be local Hadoop apps.! !

19 THE BIG DATA INFORMATION ARCHITECTURE

A Change of Paradigm! Our illustration suggests that we might be able to accommodate Big Data and all that it means simply by adding in Hadoop as a data dump that could feed the data warehouse. We do not believe that to be a credible course of action. In particular it is our view that the extreme level of hardware disruption that we are currently experiencing has such a potentially dramatic impact on response times, namely for “slow” batch jobs, that we need to rethink the whole data layer.! We previously constructed the corporate data layer by identifying sensible places to locate collections of data and inserting appropriately configured databases in those locations. We thus tended to give specific OLTP apps or application suites an OLTP database, and we configured a large query engine, a data warehouse, to act as a concentration point for structured corporate data. This would be appropriately located so that it could be fed by pipelines of data from the OLTP apps. The BI apps would either feed directly from the data warehouse or, because there was a limit to the workloads it could support, they would feed from data marts that were extracted from the data warehouse.! Data marts were thus data depots for locating data that was offloaded from the data warehouse because it was incapable of supporting the full query workload. ! It could also be the case, depending on circumstance, that companies would, for the sake of timeliness, choose to build what were called operational data stores (ODS), databases that integrated data drawn directly and swiftly from multiple sources, circumventing the data warehouse. The problem was that data could simply take too long to pass through the transformation and data cleansing routines to get into the data warehouse – and the ODS provided a quick and dirty shortcut.! In our view, a change of paradigm is now mandated. Its justification is simple. The speed of processing engendered by the combination of parallelism and the increasing power of computer hardware at all levels means that:! We should build systems to cater for data flow rather than data at rest.! Vendors like Teradata, for example, may argue that they have been oriented toward data flow for years, and this is indeed so in respect of their database architecture. However, here we are casting the net wider than the central database itself to cover the whole data layer.! There is an architectural point worth highlighting. The space occupied by executable software has never been particularly large and the ratio of space occupied by data to space occupied by executable software has always been very high. Nevertheless it continues to grow year after year as the volumes of data grow inexorably. ! Indeed, even though network speeds and networking configurability have increased significantly in recent years, it has not kept pace with the growth in data. The volume of data has grown to a level where in the vast majority of situations the processing should always, if possible, be moved to execute close to the data it intends to process. Put simply:! Do not move the data unless you absolutely have to.! !

20 THE BIG DATA INFORMATION ARCHITECTURE

It might then seem that, if it is best not to move data, we should not be thinking of a data flow architecture at all. Nevertheless, we should. Because the data is going to move anyway, not just from one storage place to another, but through memory and onto the CPU.! Data Flow: Acquisition, Refining, Processing, Shipping! Figure 4 illustrates in overview what we believe to be a rational Big Data Information Architecture, or if you prefer, a data flow architecture.!

Figure 4: A Big Data Information Architecture in Overview The first point to note is that it is an event-based architecture. The arriving data may not exclusively be events – it may include some fairly complex data items and it may also include events that have been pre-aggregated. The data arrives from data streams, embedded processors (Internet of Things), data sources in the cloud (social media, etc.), mobile devices, desktops and servers within the corporate IT facility.! It arrives at what we have chosen to call a Corporate Data Hub. This is not and probably cannot be a single physical database, no matter how capable. It it most likely (for the foreseeable future) several data engines and/or storage capabilities. ! There are two general activities that take place in the Corporate Data Hub:! 1. Data Refining: As we shall explain later in more detail, this is a complex activity consisting of a variety of functions.! 2. Local Workloads: These are workloads that run against the data stored in the Data Hub. ! 21 THE BIG DATA INFORMATION ARCHITECTURE

We could perhaps define a third activity – the management of the Data Hub itself. We will discuss this later as it is merits some elaboration. However, the main activities taking place on the Data Hub are those that prepare the data for its eventual use in various contexts and the running of local workloads against the data that is available for processing.! An important concept to understand here is that we are (in theory at least) no longer constrained to move data around in the way that we once were. The assumption is that there are, and always will be from this point on, highly scalable engines that can scale up and scale out to provide sufficient computer power to process any workload.! The point is simply that we will not be exporting data to data marts, in the way we once did, for the purposes of providing adequate performance for specific workloads. The Data Hub and its associated software will provide the performance.! Let us first consider a “green field” situation. Imagine that we are able to start from scratch without any technology constraints imposed on us by existing systems. In such circumstances we would identify all sources of data that might be of use to our organization. We would, no doubt, implement various business software packages (on commodity servers) or possibly use known cloud capabilities for such business applications. Either way, we would capture all events from all such applications and funnel them into the Data Hub.! We would acquire (or even build) software to carry out various functions on the data as we imported it into the Data Hub. Conceptually we would be refining data in order to be able to use it directly. Depending upon each data source, this activity might be significant or it might not.! All the BI and data analytics applications can run on the Data Hub. Additionally, all new transactional applications can also run here, and should if feasible. It is for that reason, incidentally, that the word “warehouse” is no longer appropriate for this large collection of data. The applications that will not be able to use the Data Hub are as follows:! • Packaged software which is not able to use the capabilities and API of the Data Hub.! • Software that has inappropriate operational characteristics. For example, software that has inadequate network bandwidth or security characteristics to use the Data Hub as a data server.! • A good deal of office and personal software apps would also be inappropriate. Such software also has inappropriate operational characteristics, but it is worth mentioning separately. Note, however, that it is entirely possible that the Data Hub be used as a file sever to backup and serve the files used by such applications.! If we now include the reality of existing legacy software of varied flexibility and age, including even an existing data warehouse, it may be that for a variety of reasons (cost, platform, configuration, mode of operation, etc.) that much of the legacy software is unable to use the Data Hub directly. If that is the case, it will be necessary to export data from the Hub for use by such applications and to capture from such applications data that the Data Hub itself requires in order to function optimally. This is illustrated under the Data Shipping section in Figure 4. !

22 THE BIG DATA INFORMATION ARCHITECTURE

The Two Data Flows! In Figure 5, we illustrate a Big Data Information Architecture in far greater detail. The main point that we wish to surface here is that once we enter an event-driven environment the architecture naturally splits into two data flows. ! The fact is simply that event data either needs to be processed at once or it does not. Event data may need to be processed immediately for one of two reasons:! 1. The event stream data needs to be analyzed immediately and continuously in order to analyze events and respond to the data stream. Apps that process events in this way (CEP and related apps) are labelled Streaming Apps in Figure 5.! 2. Some events are simply triggers that provoke some action in some application. For this reason we show events possibly being routed to any application in the Application Layer.! Figure 5 shows all events – as they stream into the organization or emerge from any application or any device (mobile, desktop, server, networking device, etc.) – as passing through a Filtering, Replicating and Routing process.!

Figure 5: The Data Refinery and Processing Hub in the Data Layer

This process is the traffic cop for all newly emerging data. It knows where to direct the event, whether to duplicate it (for example, it could go both to a Streaming App and also be stored

23 THE BIG DATA INFORMATION ARCHITECTURE within the Data Hub) or whether to simply delete it from the stream. The goal for this process is that it imposes almost no latency on the movement of data. This is necessary because some real-time streaming applications may have service levels that demand very very low latency. In other words, this process needs to be fault tolerant and extremely fast.! In regard to the applications in the Application Layer, applications either use the Data Hub directly, sending data requests to the Hub, which becomes a Local Workload; or they use Data Marts (data extracts) created and refreshed by ETL or data virtualization software, which accesses the Data Hub, as shown in Figure 5. The same ETL or data virtualization capability will also extract data from the Data Hub for export.! All data newly created by any app of any kind in the Application Layer is captured when created and passed to the Filtering, Replication and Routing process.! Data Refining and Processing! The first point to note is that the Data Hub may have multiple engines. One useful way to think of this is that ingest into the Data Hub really is a refining process. First it is necessary to refine the data so that it is suitable to be processed. We discuss this process in depth a little later, but here we note that when data arrives it is not likely to be organized in the form best suited for its processing.! The most appropriate form will depend on what processing we expect to be applied to the data. For example it may make sense to aggregate some events as a way of compressing data. It may make sense to calculate summaries. It may be best to store some data in a columnar manner. With certain data, it may be better to store it in rows or in a graph database. Some data may be best stored in multiple engines. The point is to optimally organize the data so that it is ready to use.! And while we might be thinking in terms of the data being stored on disk or SSD, some data may, because of high usage, be placed in memory immediately.! We might like to think of the Data Hub as a database, which it logically is, but physically it will almost certainly consist of multiple engines. In our view, Hadoop (HDFS) and YARN will constitute one of those engines. HDFS will almost inevitably be the place where data is refined. Other data engines (one or more) will be optimized for specific workloads. It’s worth noting that some of these workloads will involve analytic calculations as well as requests for data.! Optimizing the performance of the Data Hub will thus be complex. It will involve optimization that is in a general sense unprecedented. We can think of it in the following way:! If data is available for processing but not stored optimally for its future use, and a query is received that touches that data, then the data will be read into memory. ! However, it is very likely that it will be necessary to read the data into memory in order to refine the data. It is thus the case that with some data, the refining process is also the time at which the physical organization of the data should be determined. If it is not possible to do it then, it should be done the first time the data is touched.! In reality the processing load in the Data Hub is fairly complex. This is illustrated in Figure 6, which has been created on the basis of what we regard as necessary functions. Note that there are some software products which may be capable of carrying out several of these functions.!

24 THE BIG DATA INFORMATION ARCHITECTURE

Figure 6: Processes in the Corporate Data Hub ! The various functions we have illustrated above are as follows: ! • Data Security: All data security procedures including encryption, both while data is at rest and while in motion, and all roles and responsibilities in respect to access rights need to be implemented from the moment that data enters the Corporate Data Hub.! • Data Cleansing: Naturally, the full gamut of data cleansing activity from simple data correction through data deduplication and disambiguation can and should be applied here.! • Metadata Discovery: The data entering the Corporate Data Hub needs, before it is available for use, to have its metadata defined (to a given standard). This may involve a variety of processing activity from the application of standard patterns to known data sources to the use of semantics to determine, as accurately as possible, the meaning of the data. Some instances may require human intervention.! • Metadata Management: The Corporate Data Hub will naturally accumulate a metadata resource (repository) that will need to be managed at the physical level and at the logical level. The activity here is primarily one of assembling metadata catalogs and/or taxonomies which can provide access to metadata both for software and for users. !

25 THE BIG DATA INFORMATION ARCHITECTURE

• MDM & Business Glossary: Master data management (MDM) is a natural extension of metadata management. It is collaboration on the business meaning of data and business terminology that may bring to light both terminology variances and data aliases. The goal is that data users (including software developers) can fully understand the data they have access to.! • Data Mapping: Ideally, it will be possible to assemble and maintain a full data map of all data that is of interest to the corporation, not just the data stored within the Corporate Data Hub, but also all such data including metadata maps of data sources, data exports and data in motion. ! • Data Lineage: The provenance and lineage of all data needs to be captured and maintained. This is of particular importance for analytics activities since bad data can lead to wrong or inaccurate conclusions and actions.! • Data Life Cycle Management: Given the above set of information it will become possible to proactively manage the life cycle of events and derived data to the point of data being retired and, if justified, deleted. ! • Performance Monitoring & Management: This can be thought of as the low-level management of the data engine(s) to optimize the performance of individual workloads and individual data engines.! • Service Level Management: This is traditional service level management applied to the Corporate Data Hub. It involves the scheduling of workloads against available resources in order to meet agreed and targeted service levels.! • System Management: This involves all other system management activities surrounding data flow, including fire-fighting, software management, IT asset management, network management and so on. ! • ETL & Data Virtualization: This is the export of data from the Data Hub both for apps within the corporate environment and for data customers elsewhere.! All but the last six of these functions concern data refinement. The critical ones are data security, data cleansing and metadata discovery since these need to be applied before the data is usable. In some circumstances, when the metadata is very discoverable, a schema-on-read approach may be viable.! Metadata management, MDM and business glossary, data mapping and data lineage may be deferrable and may be part of a corporate effort to manage the data resource and improve its usability. ! It could be said that the Big Data world of metadata is like the old world of metadata except that it has many more data sources – and just like data from packaged software, the business has no control over the data definitions of the new sources. We could even think of this as “big metadata.” If Big Data means lots more data than we had before, then big metadata is lots more data sources.! In our view, data lineage is likely to become an increasingly important aspect of data management and particularly the management of the Data Hub, partly because of the increasing importance of data analytics, and partly because we are beginning to witness significant growth in the market for data itself. It is going to be progressively difficult to sell data at the best price without being able to declare its provenance and lineage.!

26 THE BIG DATA INFORMATION ARCHITECTURE

This also explains, to some degree, the relevance of data life cycle management. When data becomes valuable its inventory needs to be known and actively managed.! It may be obvious to the reader that performance management and service level management will be of critical importance across the Data Hub. This is no longer just about the individual performance of specific data engines – as it may be in a traditional data warehouse environment – it is about the service level of mission critical systems. ! The Data Hub will inevitably be mission critical in the full meaning of the word. As such, system management is as relevant here as it is to all other operational systems.! Finally its is worth noting that ETL and data virtualization are about data flow. The data shipping done here may well be time critical, and hence, performance critical.!

27 THE BIG DATA INFORMATION ARCHITECTURE The Big Data Architecture in Summary! We consider Big Data to be poor terminology. We prefer to think in terms of Big Processing rather than Big Data. For many decades, computer users profited from Moore’s Law. Roughly speaking, it multiplied computer power by a factor of 10 every 6 years and thus the IT revolution rolled forward at what seemed like a remarkable and perhaps unsustainable pace. ! In fact, that pace proved sustainable, and with the advent of multicore CPUs, which in turn provoked the software industry to begin to embrace parallel processing, the pace quickened considerably. It quickened so much, in fact, that for scale-out server-based applications we believe that in the years 2013 to 2020 speed will increase by a factor of 1000.! This, in our view, provides the foundation of a Big Data Information Architecture (BDIA) that is based on parallelism and which will leverage that power.! In terms of hardware we observe the following:! • Applications will be built to exploit hierarchical memory: CPU memory (L1, L2 and L3), DRAM and SSD (flash memory).! • Flash memory speeds are accelerating in line with Moore’s Law.! • Spinning disk will soon disappear.! • It is possible that system on a chip (SoC) technology will further disrupt the hardware layer.! In terms of data we observe:! • The basic atom of processing (item of data) is the event. We will build applications and systems that are event-based. We identify four types of events: instantiation event, state report event, trigger event, correction event. Current data modeling methodologies will be enhanced to enable events to be included in data models.! • We can view any processing entity (organization, business, individual, etc.) as processing a flow of events. Events will be aggregated to create more complex data entities, which can be thought of as derived data. ! • Organizations now need to cater for external data. Some will need to cater a great deal for such data from many data sources.! • In processing events and ingesting external data there will need to be a focus on data refinement.! We felt it necessary in researching Big Data to spend significant time investigating Hadoop. Our conclusions are as follows:! • In its early releases (prior to 2.0) Hadoop may have been popular and excited the imagination of developers, but it had limited capability and applicability. This changed with the advent of Hadoop 2.0, which included the scheduler YARN and broke the link between the HDFS and MapReduce.! • In our view Hadoop’s HDFS is already the default scale-out file system for IT. Its importance in this role is difficult to overestimate. Already some software vendors are building databases on top of it. There is a considerable need for this kind of capability.! • Hadoop itself has the possibility of become the operating system for the data layer if the sophistication of YARN gradually increases in data models.!

28 THE BIG DATA INFORMATION ARCHITECTURE

• In our view Hadoop has a unique and import role to play in a BDIA. It will act as a complement to one or more purpose-built scale-out databases which will make up the Corporate Data Hub.! • It may evolve into this role from its use as a data staging area for a data warehouse. ! In our view the primary application for Big Data technology and architecture is data analytics, and it will remain so. In light of this, it is worth noting that data analytics is a process that mixes a mathematical workload with a data access workload. As such, no databases, even very recently constructed databases, have been specifically built for this workload.! In our view the BDIA is fundamentally a data flow architecture. Specifically: ! • It needs to cater for data flows rather than be designed to process data at rest.! • Though it may seem paradoxical, it will be far better in most circumstances to move the processing to the data, rather than move the data to the processing.! Our current model of a BDIA includes the following characteristics:! • The Corporate Data Hub, the heart of the BDIA, can be viewed as involving four activities: data acquisition, data refining, data processing and data shipping.! • It is an event-based architecture which can be conceived as being founded on the receipt and processing of events.! • Events are initially handled by a filtering, replicating and routing process which directs them to the appropriate destination. All events are directed to the Data Hub. Some may be replicated and directed elsewhere (to streaming apps or possibly other apps).! • Applications either use the Data Hub directly or use data extracts or data marts provided by the Data Hub. They may also have local data but all relevant events that occur within the applications are directed to the Data Hub for storage.! • The data storage capabilities of the Data Hub include both key/value store and database(s). The key/value store is where data refining occurs. The databases provide fast access to data. The Data Hub can be treated as a single logical data store even though it may be constituted from multiple data access engines.! The above conclusions were arrived at primarily by the researchers of this report, but also involved discussion and collaboration with IT users, IT vendors (both hardware and software vendors) and other analysts who work with The Bloor Group.! ! ! ! ! ! ! ! ! ! 29 THE BIG DATA INFORMATION ARCHITECTURE ! ! ! ! ! ! ! ! SPONSORS OF THIS REPORT INCLUDE:

About The Bloor Group ! The Bloor Group is a consulting, research and technology analysis firm that focuses on open research and the use of modern media to gather knowledge and disseminate it to IT users. Visit both www.TheBloorGroup.com and www.InsideAnalysis.com for more information. The Bloor Group is the sole copyright holder of this publication. ! PO Box 200638 | Austin TX 78720 | 512–524–3689 !

30