Big Data and the Nature of Business Decisions
Big Data and the Nature of Business Decisions
April, 2013
Mark Madsen www.ThirdNature.net @markmadsen Our ideas about information and how it’s used are outdated. How We Think of Users The conventional design point is the passive consumer of information. Proof: methodology ▪ IT role is requirements, design, build, deploy, administer ▪ User role is receive data Self‐service is not like picking the right doughnut from a box. How We Think of Users How We Want Users to Think of Us Our design point is the passive consumer of information. Proof: methodology ▪ IT role is requirements, design, build, deploy, administer ▪ User role is run reports Self‐serve BI is not like picking the right doughnut from a box. How We Think of Users What Users Really Think Food supply chain: an analogy for data Multiple contexts of use, differing quality levels What do you I never said the mean, “only “E” in EDW meant doughnuts?” “everything”… It’s going to get a lot bigger
E
Not E!
Everything is digital. It’s no longer just rows and columns, it’s bits. The sensor data revolution
Sensor data doesn’t fit well with current methods of collection and storage, or with the technology to process and analyze it.
Copyright Third Nature, Inc. Unstructured is really unmodeled. We turn text into data, but we don’t model it by hand.
Sentiment, tone, opinion Words & counts, keywords, tags
Topics, genres, relationships, Categories, Entities abstracts taxonomies people, places, things, events, IDs Copyright Third Nature, Inc. Three kinds of measurement data we collect The convenient data is transactional data. ▪ Goes in the DW and is used, even if it isn’t the right measurement. The difficult and misleading data is declarative data. ▪ What people say and what they do require ground truth. The inconvenient data is observational data. ▪ It’s not neat, clean, or designed into most systems of operation. We need to make use of all three.
Copyright Third Nature, Inc. “Big data is unprecedented.” ‐ Anyone involved with big data in even the most barely perceptible way We’ve been here before
Source: Bill Schmarzo, EMC BI is a now commodity, a cost of doing business Big Data, Big Hype
$876 Gajillion (analyst estimates of the big data market) “Big” is the oldest, easiest problem to solve
Image courtesy of Teradata “Big” is well supported by databases now
Source: Noumenal, Inc. Commoditization is the fundamental driver
1010 10 9 10,000 X improvement 10 8 107 106 105 104 103 102 101 10 10‐1 01‐2 10‐3 10‐4 Calculations per Calculations per second per $1000 10‐5 10‐6 Data: Ray Kurzweil, 2001
1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 Mechanical Relay Vacuum tube Transistor Integrated circuit Storage costs have declined with computing costs
With big data systems, the cost of storing data is an order of magnitude lower than with databases today (but not the cost or ability to query it back out). Processing data at scale is at least an order of magnitude cheaper too. Source: Venturebeat
Copyright Third Nature, Inc. Parallel computing: the underlying technology
“In pioneer days they used oxen for heavy pulling, and when one ox couldn't budge a log, they didn't try to grow a larger ox. We shouldn't be trying for bigger computers, but for more systems of computers.” Grace Hopper Cloud Computing: A Big Data Enabler
What you see: seemingly infinite resource to apply to computing problems on short notice and at low cost Key impacts of cloud computing model: Utility computing – ▪ Pay for the resources you use, when you use them ▪ Expense instead of capital ▪ Elastic: scale up and down, like a utility ▪ Speed to acquire and deploy resources
But: cloud is built for scalability, not DB response time
Copyright Third Nature, Inc. Two big data straw men used by vendors in our market
It’s a poor It’s a poor man’s man’s ETL! database! Hadoop: a summary of the magic 1. Provides both storage and complex processing as part of the same platform 2. Makes parallel programming more accessible 3. Schemaless, therefore flexible 4. Inexpensive, reliable scale‐out 5. Potential for fast, scalable ingest 6. The Apache version is free
The bad stuff: ▪ Not for mutable data ▪ Simple file‐based sequential processing ▪ Zero data management An important Hadoop + cloud computing benefit Scalability is free –if your task requires 10 units of work, you can decide when you want results: 1 server, 10 units of time
10 servers, 1 unit of time
X X Time Cost is the same. Not true of the conventional IT model Copyright Third Nature, Inc. Quantitative differences can be qualitative
Eadweard Muybridge, 1878 “Faster” is a qualitative difference (when there’s enough of it)
Big data enables this kind of “faster” for processing workloads, as well as deeper analytics and new analytics Big data value: There’s a pony in there somewhere… The myth that still drives big data
All we need is a fat pipe and pans working in parallel…
You change an org by acting with, through others, not alone. What really happens to most great insights
If you don’t have a way to turn that insight into an action within the organization then you are producing expensive trivia. The Three Four Six Many V’s of Big Data
I got a fever, and the only prescription that can cure it is MORE V’s!
Common belief: the more V’s you have, the more budget you get. Much of the big data value comes from analytics BI is a retrieval problem, not a computational problem. Five basic things you can do with analytics ▪ Prediction – what is most likely to happen? ▪ Estimation – what’s the future value of a variable? ▪ Description – what relationships exist in the data? ▪ Simulation – what could happen? ▪ Prescription – what should you do?
Slide 34 Copyright Third Nature, Inc. Copyright Third Nature, Inc. Analytic Maturity: This is Nonsense
Organizations do all of High these in different places at different times. “why” is the hardest question to answer, not a factual question. “What will happen?” This model isn’t built Predictive “What’s around what people do. Analytics happening?” It’s built around classes
Business Value Operational BI / “Why did it of technology. happen?” realtime BI “What Analysis happened?” Reporting Static & Query, Excel, Dashboards, Statistics, data Interactive OLAP, Visual Scorecards mining, Low Reports discovery optimization
Organizational Maturity Two keys to making big data worthwhile Value: Actionability: Goal Æ solution Simple “value” isn’t enough. not Information has to be Solution Æ goal actionable, somehow. We think of BI as publishing, an old metaphor.
Publishing has value, but may not be actionable. Data is not the end of the line, it’s the departure point We ignored the important tasks that deliver value.
Slide 38 Decisions are the starting point for most of the organization. A decision is a choice between options in a situation involving uncertainty, with a risk that the outcome won’t meet a goal. Planning data strategy means understanding the context of data use so we can build infrastructure
We need to focus on what people do with information as the primary task, not on the data or the technology.
Analyze Analyze Monitor Decide Act Exceptions Causes
No problem No idea Do nothing
Copyright Third Nature, Inc. General model for organizational use of data
Analyze Analyze Monitor Decide Act Exceptions Causes
No problem No idea Do nothing
Act within the process Usually real-time to daily
Copyright Third Nature, Inc. General model for organizational use of data
Collect Act on the process new data Usually days/longer timeframe
Analyze Analyze Monitor Decide Act Exceptions Causes
No problem No idea Do nothing
Copyright Third Nature, Inc. You need to be able to support both paths
Causal analysis, “data science” Collect new data Act on the process
Analyze Analyze Monitor Decide Act Exceptions Causes
Act within the process Conventional BI, addition of EDM
Copyright Third Nature, Inc. Act: the part that creates the most problems Decision Assumptions
Deliberation ▪ Actions are consciously chosen. Rationality ▪ People make logical decisions. Sure they do. Order ▪ System are understandable and the results of actions predictable. What’s the reality in most organizations? Irrationality, vanity, unreasonable behavior, politics, bureaucracy, doing the same things repeatedly.
Where data really comes from A very abstract business intelligence model Who are the people making decisions?
Strategic
Tactical
Operational The process aspect of decisions connects people Scope of control for people in most organizations aligns: in process, on process, over process
Strategic
Tactical
Operational
The exceptions not handled at one level due to rule / procedure / policy deficiency are escalated to the next. Copyright Third Nature, Inc. What is the nature of their decisions? Scope, time frame of decision, time scale of data, data volume, breadth of data, frequency, pattern vs fact‐based
Strategic Months • Pattern‐based • Broad scope Days‐ • Fact‐based • Moderate Tactical Weeks scope Mins‐ • Rule‐based Days • Narrow scope Operational Analytic complexity
Copyright Third Nature, Inc. How and where can you apply information?
High single value, less frequent, so improve the Strategic effectiveness of individual decisions. Tactical Fuzzy middle ground Low single value, frequent, can improve the efficiency Analytic complexity Operational or the effectiveness for large aggregate improvement.
Strategy to Execution What kind of support do people have today?
Strategic Dashboards, scorecards, but mainly other people
Tactical Email, meetings, dashboards
Operational Reports, dashboards Realm of traditional BI
Reality of most reports and dashboards is that they provide basic monitoring at best. Differing decision goals and needs create tension
Managing the business: Strategic • Want change • Seek adaptation Tactical Operating the business: • Want stability Operational • Seek consistency
There is a difference between operating a business and managing a business. Most BI / BA today supports operating.
Copyright Third Nature, Inc. Business management has changed due to information Our simplistic notions of BI with stable models, ordered data and predictability are being replaced by concepts from decision support and complex adaptive systems (CAS).
Simple Complicated Complex
Assumption: Order Assumption: Unorder Assumption: Disorder Cause and effect is repeatable Cause and effect is separated Cause and effect is coherent & predictable in time & space, repeatable, in retrospect only, modelable learnable but changing Known Knowable Unpredictable Standard processes, clear Analytical techniques to Experiment to create possible metrics, best practice determine options, effects options Sense, categorize, respond Sense, analyze, respond Test, sense, respond Reporting, dashboards Ad‐hoc, OLAP, exploration Data science, casual analysis
Copyright Third Nature, Inc. Situational context governs data use Business intelligence support varies by decision context
Handles this really well (most of the time). Handles this sort of ok, sometimes. This, not so much.
Assumption: Order Assumption: Unorder Assumption: Disorder Cause and effect is repeatable Cause and effect is separated Cause and effect is coherent & predictable in time & space, repeatable, in retrospect only, modelable learnable but changing Known Knowable Unpredictable Standard processes, clear Analytical techniques to Experiment to create possible metrics, best practice determine options, effects options, test hypotheses Sense, categorize, respond Sense, analyze, respond Test, sense, respond Reporting, dashboards Ad‐hoc, OLAP, data discovery Casual analysis, simulation Basic BI Analysis Data science, analytics
Copyright Third Nature, Inc. The usage models for conventional BI
Collect Act on the process new data Usually days/longer timeframe This is what we’ve been doing with BI so far: static Analyze Analyze Monitor reporting, dashboards,Decide Act Exceptions Causesad-hoc query, OLAP
No problem No idea Do nothing
Act within the process Usually real-time to daily
Copyright Third Nature, Inc. The usage models for analytics and “big data”
Analytics and big data is Collect Act on the process focused on new use new data cases: deeper analysis, Usually days/longer timeframe causes, prediction, optimizing decisions Analyze Analyze Monitor Decide Act This isn’t ad-hoc,Exceptions Causes reporting, or OLAP.
No problem No idea Do nothing
Act within the process Usually real-time to daily
Copyright Third Nature, Inc. Somewhere along the way, the BI community lost sight of the real goal
The M-OODA loop, Rousseau & Breton, 2004 Where does our current infrastructure have trouble? Cost of growth, storing data Cost of and ability to deliver analytics Using non‐tabular data, like text and documents Supporting use of information in real time Time to deliver information for new business projects Supporting people in analysis Hadoop Adoption
Some people can’t resist getting the next new thing because it’s new. Many IT organizations are like this, promoting a solution and hunting for the problem that matches it. Better to ask “What is the problem for which Hadoop is the answer?” Business Intelligence vs Big Data / Analytics Business Intelligence: ▪ focus is on retrieval and delivery of data ▪ monitoring and identifying exceptions ▪ little variability, ambiguity, uncertainty ▪ reporting, dashboards, scorecards, OLAP for bounded exploration and analysis Business Analytics: ▪ focus is on generation of new data, insight/foresight ▪ exploring data, finding insights ▪ expect uncertainty and probability and pattern rather than specific data ▪ computational / probabilistic techniques Both need to focus on action and goals to succeed. There’s a shift in how we view and use analytics foo
P2M M2P or M2M Big changes for data warehousing workloads
The results of analytic processing can, often do, feed back into the system from which they originate. Much of the data is being read, written and processed in real time. Our design point was not changing tables and ephemeral patterns. Four core capabilities big data adds 1. Unlimited scale of storage, processing ▪ Agility, faster turnaround for new data requests (but not a replacement for BI) ▪ Fewer staff to accomplish same goals 2. New data accessibility ▪ More data retained for longer period ▪ Access to data unused due to cost or processing limits ▪ Any digital information becomes usable data 3. Scalable realtime processing ▪ Brings ability to monitor and act on data as events occur 4. Arbitrary analytics ▪ Faster analysis ▪ Deeper analysis ▪ More broadly accessible analytics Big Data Shift in a Nutshell It’s an architectural reconfiguration, just like web 2.0 The old model for data The new model for data ▪ Read only ▪ Read‐write ▪ Integrate before use ▪ Integrate at time of use ▪ Record only important data ▪ Record all the data ▪ Retrieval‐focused ▪ Processing‐focused ▪ Single method of access ▪ Multiple methods of access ▪ Deterministic models & use ▪ Stochastic models & use ▪ Human‐level latency ▪ Machine‐level latency ▪ Centralized publishing ▪ Community creation As a technology moves from emerging to commodity the nature of acquiring, using and managing it changes
Innovation Maturation Saturation
Generate Constrain Standardize / options choices minimize choice Innovation Adaptation Acquisition Novel practice Good practice Best practice Maximize value Optimize Minimize costs
Agile & open 6 Sigma & process source* methods methods Copyright Third Nature, Inc. Best of Breed or Integrated?
IT mega-vendors rarely offer value in an early market Designing for data: monolithic vendor technology‐ based classifications of the ecosystem won’t help
These types of eye charts provide a categorization of what’s available, not what you need. They ignore the contexts of use that are most important.
70 State of the market It’s a supply‐side market. VC accelerated technology development beyond the ability of most organizations to adopt. We are in an early stage. People are expensive, machines are cheap, so delivery will change. We need to develop new skills and learn how to apply the new technology, data and techniques to business problems as in the ’80s “When a new technologyQuestions? rolls over you, you're either part of the steamroller or part of the road.” – Stewart Brand About the Presenter
Mark Madsen is president of Third Nature, a research and advisory firm focused on analytics, business intelligence and data management. Mark is an award‐winning author, architect and CTO whose work has been featured in numerous industry publications. Over the past ten years Mark received awards for his work from the American Productivity & Quality Center, TDWI, and the Smithsonian Institute. He is an international speaker, a contributor at Forbes Online and Information Management. For more information or to contact Mark, follow @markmadsen on Twitter or visit http://ThirdNature.net About Third Nature
Third Nature is a research and consulting firm focused on new and emerging technology and practices in analytics, business intelligence, and performance management. If your question is related to data, analytics, information strategy and technology infrastructure then you‘re at the right place. Our goal is to help companies take advantage of information-driven management practices and applications. We offer education, consulting and research services to support business and IT organizations as well as technology vendors. We fill the gap between what the industry analyst firms cover and what IT needs. We specialize in product and technology analysis, so we look at emerging technologies and markets, evaluating technology and hw it is applied rather than vendor market positions. CC Image Attributions
Thanks to the people who supplied the creative commons licensed images used in this presentation:
Outdated gumshoe.jpg – http://flickr.com/photos/olivander/372385317/ donuts_4_views.jpg ‐ http://www.flickr.com/photos/le_hibou/76718773/ wheat_field.jpg ‐ http://www.flickr.com/photos/ecstaticist/1120119742/ straw men.jpg ‐ http://www.flickr.com/photos/robinellis/6034919721/ ponies in field.jpg ‐ http://www.flickr.com/photos/bulle_de/352732514/ train_to_sea.jpg ‐ http://www.flickr.com/photos/innoxiuss/457069767/ chinatown little color gate.jpg ‐ http://www.flickr.com/photos/paullikespics/3248133830/ where data really comes from ‐ Blake Stacy klein_bottle_red.jpg ‐ http://flickr.com/photos/sveinhal/2081201200/