The Parable of Google Flu: Traps in Big Data Analysis

Total Page:16

File Type:pdf, Size:1020Kb

The Parable of Google Flu: Traps in Big Data Analysis FINALFINAL FINALFINAL POLICYFORUM BIG DATA The Parable of Google Flu: Large errors in fl u prediction were largely avoidable, which offers lessons for the use Traps in Big Data Analysis of big data. David Lazer, 1, 2* Ryan Kennedy, 1, 3, 4 Gary King, 3 Alessandro Vespignani 3,5,6 n February 2013, Google Flu the algorithm in 2009, and this Trends (GFT) made headlines model has run ever since, with a I but not for a reason that Google few changes announced in October executives or the creators of the fl u 2013 ( 10, 15). tracking system would have hoped. Although not widely reported Nature reported that GFT was pre- until 2013, the new GFT has been dicting more than double the pro- persistently overestimating flu portion of doctor visits for influ- prevalence for a much longer time. enza-like illness (ILI) than the Cen- GFT also missed by a very large ters for Disease Control and Preven- margin in the 2011–2012 fl u sea- tion (CDC), which bases its esti- son and has missed high for 100 out mates on surveillance reports from of 108 weeks starting with August laboratories across the United States 2011 (see the graph ). These errors ( 1, 2). This happened despite the fact are not randomly distributed. For that GFT was built to predict CDC example, last week’s errors predict reports. Given that GFT is often held this week’s errors (temporal auto- up as an exemplary use of big data correlation), and the direction and ( 3, 4), what lessons can we draw magnitude of error varies with the from this error? time of year (seasonality). These The problems we identify are patterns mean that GFT overlooks not limited to GFT. Research on considerable information that whether search or social media can could be extracted by traditional predict x has become common- statistical methods. place ( 5– 7) and is often put in sharp contrast surement and construct validity and reli- Even after GFT was updated in 2009, with traditional methods and hypotheses. ability and dependencies among data (12). the comparative value of the algorithm as a Although these studies have shown the The core challenge is that most big data that stand-alone fl u monitor is questionable. A value of these data, we are far from a place have received popular attention are not the study in 2010 demonstrated that GFT accu- where they can supplant more traditional output of instruments designed to produce racy was not much better than a fairly sim- methods or theories (8 ). We explore two valid and reliable data amenable for scien- ple projection forward using already avail- issues that contributed to GFT’s mistakes— tifi c analysis. able (typically on a 2-week lag) CDC data big data hubris and algorithm dynamics— The initial version of GFT was a par- ( 4). The comparison has become even worse and offer lessons for moving forward in the ticularly problematic marriage of big and since that time, with lagged models signifi - big data age. small data. Essentially, the methodology cantly outperforming GFT (see the graph). was to fi nd the best matches among 50 mil- Even 3-week-old CDC data do a better job Big Data Hubris lion search terms to fit 1152 data points of projecting current flu prevalence than “Big data hubris” is the often implicit ( 13). The odds of fi nding search terms that GFT [see supplementary materials (SM)]. assumption that big data are a substitute match the propensity of the fl u but are struc- Considering the large number of for, rather than a supplement to, traditional turally unrelated, and so do not predict the approaches that provide inference on infl u- data collection and analysis. Elsewhere, we future, were quite high. GFT developers, enza activity ( 16– 19), does this mean that have asserted that there are enormous scien- in fact, report weeding out seasonal search the current version of GFT is not useful? tifi c possibilities in big data ( 9– 11). How- terms unrelated to the fl u but strongly corre- No, greater value can be obtained by com- ever, quantity of data does not mean that lated to the CDC data, such as those regard- bining GFT with other near–real-time one can ignore foundational issues of mea- ing high school basketball ( 13). This should health data ( 2, 20). For example, by com- have been a warning that the big data were bining GFT and lagged CDC data, as well 1Lazer Laboratory, Northeastern University, Boston, MA overfi tting the small number of cases—a as dynamically recalibrating GFT, we can 02115, USA. 2Harvard Kennedy School, Harvard University, standard concern in data analysis. This ad substantially improve on the performance Cambridge, MA 02138, USA. 3Institute for Quantitative Social Science, Harvard University, Cambridge, MA 02138, USA. hoc method of throwing out peculiar search of GFT or the CDC alone (see the chart). 4University of Houston, Houston, TX 77204, USA. 5Laboratory terms failed when GFT completely missed This is no substitute for ongoing evaluation for the Modeling of Biological and Sociotechnical Systems, the nonseasonal 2009 infl uenza A–H1N1 and improvement, but, by incorporating this 6 Northeastern University, Boston, MA 02115, USA. Institute pandemic ( 2, 14). In short, the initial ver- information, GFT could have largely healed for Scientifi c Interchange Foundation, Turin, Italy. sion of GFT was part flu detector, part itself and would have likely remained out of CREDIT: ADAPTED FROM AXEL KORES/DESIGN & DIRECTION/ISTOCKPHOTO.COM ART CREDIT: *Corresponding author. E-mail: [email protected]. winter detector. GFT engineers updated the headlines. www.sciencemag.org SCIENCE VOL 343 14 MARCH 2014 1203 POLICYFORUM Algorithm Dynamics 10 Lagged CDC All empirical research stands on a founda- Google Flu Google Flu + CDC CDC tion of measurement. Is the instrumentation 8 Google estimates more actually capturing the theoretical construct of than double CDC estimates interest? Is measurement stable and compa- 6 rable across cases and over time? Are mea- % ILI 4 surement errors systematic? At a minimum, it is quite likely that GFT was an unstable 2 refl ection of the prevalence of the fl u because 0 of algorithm dynamics affecting Google’s 07/01/09 07/01/10 07/01/11 07/01/12 07/01/13 search algorithm. Algorithm dynamics are 150 Google starts estimating the changes made by engineers to improve Google Flu Lagged CDC the commercial service and by consum- high 100 out of 108 weeks 100 Google Flu + CDC ers in using that service. Several changes in Google’s search algorithm and user behav- 50 ior likely affected GFT’s tracking. The most common explanation for GFT’s error is a media-stoked panic last fl u season ( 1, 15). 0 Error (% baseline) Although this may have been a factor, it can- –50 not explain why GFT has been missing high by wide margins for more than 2 years. The 07/01/09 07/01/10 07/01/11 07/01/12 07/01/13 2009 version of GFT has weathered other Data media panics related to the fl u, including the 2005–2006 influenza A/H5N1 (“bird flu”) GFT overestimation. GFT overestimated the prevalence of fl u in the 2012–2013 season and overshot the actual level in 2011–2012 by more than 50%. From 21 August 2011 to 1 September 2013, GFT reported overly outbreak and the 2009 A/H1N1 (“swine fl u”) high fl u prevalence 100 out of 108 weeks. (Top) Estimates of doctor visits for ILI. “Lagged CDC” incorporates pandemic. A more likely culprit is changes 52-week seasonality variables with lagged CDC data. “Google Flu + CDC” combines GFT, lagged CDC estimates, made by Google’s search algorithm itself. lagged error of GFT estimates, and 52-week seasonality variables. (Bottom) Error [as a percentage {[Non-CDC The Google search algorithm is not a estmate)Ϫ(CDC estimate)]/(CDC) estimate)}. Both alternative models have much less error than GFT alone. static entity—the company is constantly Mean absolute error (MAE) during the out-of-sample period is 0.486 for GFT, 0.311 for lagged CDC, and 0.232 testing and improving search. For example, for combined GFT and CDC. All of these differences are statistically signifi cant at P < 0.05. See SM. the offi cial Google search blog reported 86 changes in June and July 2012 alone (SM). fi ed by the service provider in accordance events, but search behavior is not just exog- Search patterns are the result of thousands of with their business model. Google reported enously determined, it is also endogenously decisions made by the company’s program- in June 2011 that it had modifi ed its search cultivated by the service provider. mers in various subunits and by millions of results to provide suggested additional search Blue team issues are not limited to consumers worldwide. terms and reported again in February 2012 Google. Platforms such as Twitter and Face- There are multiple challenges to replicat- that it was now returning potential diagnoses book are always being re-engineered, and ing GFT’s original algorithm. GFT has never for searches including physical symptoms whether studies conducted even a year ago documented the 45 search terms used, and like “fever” and “cough” ( 21, 22). The for- on data collected from these platforms can the examples that have been released appear mer recommends searching for treatments be replicated in later or earlier periods is an misleading ( 14) (SM). Google does provide of the fl u in response to general fl u inqui- open question. a service, Google Correlate, which allows ries, and the latter may explain the increase Although it does not appear to be an issue the user to identify search data that correlate in some searches to distinguish the fl u from in GFT, scholars should also be aware of the with a given time series; however, it is lim- the common cold.
Recommended publications
  • How Big Data Enables Economic Harm to Consumers, Especially to Low-Income and Other Vulnerable Sectors of the Population
    How Big Data Enables Economic Harm to Consumers, Especially to Low-Income and Other Vulnerable Sectors of the Population The author of these comments, Nathan Newman, has been writing for twenty years about the impact of technology on society, including his 2002 book Net Loss: Internet Profits, Private Profits and the Costs to Community, based on doctoral research on rising regional economic inequality in Silicon Valley and the nation and the role of Internet policy in shaping economic opportunity. He has been writing extensively about big data as a research fellow at the New York University Information Law Institute for the last two years, including authoring two law reviews this year on the subject. These comments are adapted with some additional research from one of those law reviews, "The Costs of Lost Privacy: Consumer Harm and Rising Economic Inequality in the Age of Google” published in the William Mitchell Law Review. He has a J.D. from Yale Law School and a Ph.D. from UC-Berkeley’s Sociology Department; Executive Summary: ϋBͤ͢ Dͯ͜͜ό ͫͧͯͪͭͨͮ͜͡ ͮͰͣ͞ ͮ͜ Gͪͪͧ͢͠ ͩ͜͟ Fͪͪͦ͜͞͠͝ ͭ͜͠ ͪͨͤͩ͢͝͠͞ ͪͨͤͩͩͯ͟͜ institutions organizing information not just about the world but about consumers themselves, thereby reshaping a range of markets based on empowering a narrow set of corporate advertisers and others to prey on consumers based on behavioral profiling. While big data can benefit consumers in certain instances, there are a range of new consumer harms to users from its unregulated use by increasingly centralized data platforms. A. These harms start with the individual surveillance of users by employers, financial institutions, the government and other players that these platforms allow, but also extend to ͪͭͨͮ͡ ͪ͡ Ͳͣͯ͜ ͣͮ͜ ͩ͝͠͠ ͧͧ͜͟͞͠ ϋͧͪͭͤͯͣͨͤ͜͢͞ ͫͭͪͤͧͤͩ͢͡ό ͯͣͯ͜ ͧͧͪ͜Ͳ ͪͭͫͪͭͯ͜͞͠ ͤͩͮͯͤͯͰͯͤͪͩͮ to discriminate and exploit consumers as categorical groups.
    [Show full text]
  • Issues, Challenges, and Solutions: Big Data Mining
    ISSUES , CHALLENGES , AND SOLUTIONS : BIG DATA MINING Jaseena K.U. 1 and Julie M. David 2 1,2 Department of Computer Applications, M.E.S College, Marampally, Aluva, Cochin, India [email protected],[email protected] ABSTRACT Data has become an indispensable part of every economy, industry, organization, business function and individual. Big Data is a term used to identify the datasets that whose size is beyond the ability of typical database software tools to store, manage and analyze. The Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This paper presents the literature review about the Big data Mining and the issues and challenges with emphasis on the distinguished features of Big Data. It also discusses some methods to deal with big data. KEYWORDS Big data mining, Security, Hadoop, MapReduce 1. INTRODUCTION Data is the collection of values and variables related in some sense and differing in some other sense. In recent years the sizes of databases have increased rapidly. This has lead to a growing interest in the development of tools capable in the automatic extraction of knowledge from data [1]. Data are collected and analyzed to create information suitable for making decisions. Hence data provide a rich resource for knowledge discovery and decision support. A database is an organized collection of data so that it can easily be accessed, managed, and updated. Data mining is the process discovering interesting knowledge such as associations, patterns, changes, anomalies and significant structures from large amounts of data stored in databases, data warehouses or other information repositories.
    [Show full text]
  • Oracle Big Data SQL Release 4.1
    ORACLE DATA SHEET Oracle Big Data SQL Release 4.1 The unprecedented explosion in data that can be made useful to enterprises – from the Internet of Things, to the social streams of global customer bases – has created a tremendous opportunity for businesses. However, with the enormous possibilities of Big Data, there can also be enormous complexity. Integrating Big Data systems to leverage these vast new data resources with existing information estates can be challenging. Valuable data may be stored in a system separate from where the majority of business-critical operations take place. Moreover, accessing this data may require significant investment in re-developing code for analysis and reporting - delaying access to data as well as reducing the ultimate value of the data to the business. Oracle Big Data SQL enables organizations to immediately analyze data across Apache Hadoop, Apache Kafka, NoSQL, object stores and Oracle Database leveraging their existing SQL skills, security policies and applications with extreme performance. From simplifying data science efforts to unlocking data lakes, Big Data SQL makes the benefits of Big Data available to the largest group of end users possible. KEY FEATURES Rich SQL Processing on All Data • Seamlessly query data across Oracle Oracle Big Data SQL is a data virtualization innovation from Oracle. It is a new Database, Hadoop, object stores, architecture and solution for SQL and other data APIs (such as REST and Node.js) on Kafka and NoSQL sources disparate data sets, seamlessly integrating data in Apache Hadoop, Apache Kafka, • Runs all Oracle SQL queries without modification – preserving application object stores and a number of NoSQL databases with data stored in Oracle Database.
    [Show full text]
  • Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners
    Additional praise for Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners “Jared’s book is a great introduction to the area of High Powered Analytics. It will be useful for those who have experience in predictive analytics but who need to become more versed in how technology is changing the capabilities of existing methods and creating new pos- sibilities. It will also be helpful for business executives and IT profes- sionals who’ll need to make the case for building the environments for, and reaping the benefi ts of, the next generation of advanced analytics.” —Jonathan Levine, Senior Director, Consumer Insight Analysis at Marriott International “The ideas that Jared describes are the same ideas that being used by our Kaggle contest winners. This book is a great overview for those who want to learn more and gain a complete understanding of the many facets of data mining, knowledge discovery and extracting value from data.” —Anthony Goldbloom Founder and CEO of Kaggle “The concepts that Jared presents in this book are extremely valuable for the students that I teach and will help them to more fully under- stand the power that can be unlocked when an organization begins to take advantage of its data. The examples and case studies are particu- larly useful for helping students to get a vision for what is possible. Jared’s passion for analytics comes through in his writing, and he has done a great job of making complicated ideas approachable to multiple audiences.” —Tonya Etchison Balan, Ph.D., Professor of Practice, Statistics, Poole College of Management, North Carolina State University Big Data, Data Mining, and Machine Learning Wiley & SAS Business Series The Wiley & SAS Business Series presents books that help senior-level managers with their critical management decisions.
    [Show full text]
  • Internet of Things Big Data Artificial Intelligence
    IGF 2019 Best Practices Forum on Internet of Things Big Data Artificial Intelligence Draft BPF Output Report - November 2019 - Acknowledgements The Best Practice Forum on Internet of Things, Big Data, Artificial Intelligence (BPF IoT, Big Data, AI) is an open multistakeholder group conducted as an intersessional activity of the Internet Governance Forum (IGF). This report is the draft output of the IGF2019 BPF IoT, Big Data, AI and is the product of the collaborative work of many. Facilitators of the BPF IoT, Big Data, AI: Ms. Concettina CASSA, MAG BPF Co-coordinator Mr. Alex Comninos, BPF Co-coordinator Mr. Michael Nelson, BPF Co-coordinator Mr. Wim Degezelle, BPF Consultant (editor) The BPF draft document was developed through open discussions on the BPF mailing list and virtual meetings. We would like to acknowledge participants to these discussions for their input. The BPF would also like to thank all who submitted input through the BPF’s Call for contributions. Disclaimer: The IGF Secretariat has the honour to transmit this paper prepared by the 2019 Best Practice Forum on IoT, Big Data, AI. The content of the paper and the views expressed therein reflect the BPF discussions and are based on the various contributions received and do not imply any expression of opinion on the part of the United Nations. IGF2019 BPF IoT, Big Data, AI - Draft output report - November 2019 2 / 37 DOCUMENT REVIEW & FEEDBACK (editorial note 12 November 2019) The BPF IoT, Big Data, AI is inviting Community feedback on this draft report! How ? Please send your feedback to [email protected] Format? Feedback can be sent in an email or as a word or pdf document attached to an email.
    [Show full text]
  • Artificial Intelligence and Big Data – Innovation Landscape Brief
    ARTIFICIAL INTELLIGENCE AND BIG DATA INNOVATION LANDSCAPE BRIEF © IRENA 2019 Unless otherwise stated, material in this publication may be freely used, shared, copied, reproduced, printed and/or stored, provided that appropriate acknowledgement is given of IRENA as the source and copyright holder. Material in this publication that is attributed to third parties may be subject to separate terms of use and restrictions, and appropriate permissions from these third parties may need to be secured before any use of such material. ISBN 978-92-9260-143-0 Citation: IRENA (2019), Innovation landscape brief: Artificial intelligence and big data, International Renewable Energy Agency, Abu Dhabi. ACKNOWLEDGEMENTS This report was prepared by the Innovation team at IRENA’s Innovation and Technology Centre (IITC) with text authored by Sean Ratka, Arina Anisie, Francisco Boshell and Elena Ocenic. This report benefited from the input and review of experts: Marc Peters (IBM), Neil Hughes (EPRI), Stephen Marland (National Grid), Stephen Woodhouse (Pöyry), Luiz Barroso (PSR) and Dongxia Zhang (SGCC), along with Emanuele Taibi, Nadeem Goussous, Javier Sesma and Paul Komor (IRENA). Report available online: www.irena.org/publications For questions or to provide feedback: [email protected] DISCLAIMER This publication and the material herein are provided “as is”. All reasonable precautions have been taken by IRENA to verify the reliability of the material in this publication. However, neither IRENA nor any of its officials, agents, data or other third- party content providers provides a warranty of any kind, either expressed or implied, and they accept no responsibility or liability for any consequence of use of the publication or material herein.
    [Show full text]
  • Big Data Platforms, Tools, and Research at IBM
    IBM Research Big Data Platforms, Tools, and Research at IBM Ed Pednault CTO, Scalable Analytics Business Analytics and Mathematical Sciences, IBM Research © 2011 IBM Corporation Please Note: IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion. Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here. IBM Big Data Strategy: Move the Analytics Closer to the Data New analytic applications drive the Analytic Applications BI / Exploration / Functional Industry Predictive Content BI / requirements
    [Show full text]
  • Artificial Intelligence, Big Data and Cloud Taxonomy
    DEPARTMENT OF DEFENSE ARTIFICIAL INTELLIGENCE, BIG DATA AND CLOUD TAXONOMY Foreword by Hon. Robert O. Work, 32nd Deputy Secretary of Defense COMPANIES INCLUDED GOVINI CLIENT MARKET VIEWS Altamira Technologies Corp. University of Southern California Artificial Intelligence - Computer Vision Aptima Inc. United Technologies Corp. AT&T Inc. (T) Vector Planning & Services Inc. Artificial Intelligence - Data Mining BAE Systems PLC (BAESY) Vision Systems Inc. Artificial Intelligence - Deep Learning Boeing Co. (BA) Vykin Corp. Booz Allen Hamilton Inc. (BAH) World Wide Technology Inc. Artificial Intelligence - Machine Learning CACI International Inc. (CACI) Artificial Intelligence - Modeling & Simulation Carahsoft Technology Corp. Combined Technical Services LLC Artificial Intelligence - Natural Language Processing Concurrent Technologies Corp. Cray Inc. Artificial Intelligence - Neuromorphic Engineering CSRA Inc. (CSRA) Artificial Intelligence - Quantum Computing deciBel Research Inc. Dell Inc. Artificial Intelligence - Supercomputing Deloitte Consulting LLP DLT Solutions Inc. Artificial Intelligence - Virtual Agents DSD Laboratories Artificial Intelligence - Virtual Reality DXC Technologies Corp. (DXC) General Dynamics Corp. (GD) Big Data - Business Analytics Georgia Tech. Harris Corp. Big Data - Data Analytics Johns Hopkins APL Big Data - Data Architecture & Modeling HRL Laboratories International Business Machines Inc. (IBM) Big Data - Data Collection Inforeliance Corp. Big Data - Data Hygiene Insight Public Sector Inc. Intelligent Automation Inc. Big Data - Data Hygiene Software Intelligent Software Solutions Inc. JF Taylor Inc. Big Data - Data Visualization Software L3 Technologies Inc. (LLL) Big Data - Data Warehouse Leidos Inc. (LDOS) Lockheed Martin Co. (LMT) Big Data - Distributed Processing Software Logos Technologies LLC ManTech International Corp. (MANT) Big Data - ETL & Data Processing Microsoft Corp. (MSFT) Big Data - Intelligence Exploitation Mythics Inc. NANA Regional Corp. Cloud - IaaS Northrop Grumman Corp.
    [Show full text]
  • Big Data, Data Mining and Machine Learning
    Contents Forward xiii Preface xv Acknowledgments xix Introduction 1 Big Data Timeline 5 Why This Topic Is Relevant Now 8 Is Big Data a Fad? 9 Where Using Big Data Makes a Big Difference 12 Part One The Computing Environment ..............................23 Chapter 1 Hardware 27 Storage (Disk) 27 Central Processing Unit 29 Memory 31 Network 33 Chapter 2 Distributed Systems 35 Database Computing 36 File System Computing 37 Considerations 39 Chapter 3 Analytical Tools 43 Weka 43 Java and JVM Languages 44 R 47 Python 49 SAS 50 ix ftoc ix April 17, 2014 10:05 PM Excerpt from Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners by Jared Dean. Copyright © 2014 SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. x ▸ CONTENTS Part Two Turning Data into Business Value .......................53 Chapter 4 Predictive Modeling 55 A Methodology for Building Models 58 sEMMA 61 Binary Classifi cation 64 Multilevel Classifi cation 66 Interval Prediction 66 Assessment of Predictive Models 67 Chapter 5 Common Predictive Modeling Techniques 71 RFM 72 Regression 75 Generalized Linear Models 84 Neural Networks 90 Decision and Regression Trees 101 Support Vector Machines 107 Bayesian Methods Network Classifi cation 113 Ensemble Methods 124 Chapter 6 Segmentation 127 Cluster Analysis 132 Distance Measures (Metrics) 133 Evaluating Clustering 134 Number of Clusters 135 K‐means Algorithm 137 Hierarchical Clustering 138 Profi ling Clusters 138 Chapter 7 Incremental Response Modeling 141 Building the Response
    [Show full text]
  • Hard Disk Drive Failure Detection with Recurrence Quantification Analysis
    HARD DISK DRIVE FAILURE DETECTION WITH RECURRENCE QUANTIFICATION ANALYSIS A Thesis Presented By Wei Li to The Department of Mechanical & Industrial Engineering in partial fulfillment of the requirements for the degree of Master of Science in the field of Industrial Engineering Northeastern University Boston, Massachusetts August 2020 ACKNOWLEDGEMENTS I would like to express my sincere appreciation to my thesis advisor, Professor Sagar Kamarthi, who has been providing me with guidance, encouragement and patience throughout the duration of this project. I would also like to extend my gratitude to Professor Srinivasan Radhakrishnan for his inspiration and guidance. 3 TABLE OF CONTENTS LIST OF TABLES .............................................................................................................. v LIST OF FIGURES ........................................................................................................... vi ABSTRACT ...................................................................................................................... vii 1. INTRODUCTION .......................................................................................................... 1 1.1 Hard Disk Drive ........................................................................................................ 1 1.2 Self-Monitoring Analysis and Reporting Technology (SMART) ............................... 4 1.3 Research Objective ................................................................................................... 5 2. STATE-OF-THE-ART
    [Show full text]
  • How to Store and Process Big Data: Are Today’S Databases Sufficient? Jaroslav Pokorný
    How to Store and Process Big Data: Are Today’s Databases Sufficient? Jaroslav Pokorný To cite this version: Jaroslav Pokorný. How to Store and Process Big Data: Are Today’s Databases Sufficient?. 13th IFIP International Conference on Computer Information Systems and Industrial Management (CISIM), Nov 2014, Ho Chi Minh City, Vietnam. pp.5-10, 10.1007/978-3-662-45237-0_2. hal-01405547 HAL Id: hal-01405547 https://hal.inria.fr/hal-01405547 Submitted on 30 Nov 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Distributed under a Creative Commons Attribution| 4.0 International License How to store and process Big Data: are today's databases sufficient? Jaroslav Pokorný Department of Software Engineering, Faculty of Mathematics and Physics Charles University, Prague, Czech Republic [email protected] Abstract. The development and extensive use of highly distributed and scalable systems to process Big Data is widely considered. New data management archi- tectures, e.g. distributed file systems and NoSQL databases, are used in this context. On the other hand, features of Big Data like their complexity and data analytics demands indicate that these tools solve Big Data problems only par- tially.
    [Show full text]
  • Abstract Introduction Relational Database Management Systems (RDBMDS)
    Comparisons of Relational Databases with Big Data: a Teaching Approach Ali Salehnia South Dakota State University Brookings, SD 57007 [email protected] Abstract The paper's objective is to provide classification, characteristics and evaluation of available relational database systems which may be used in Big Data Predictions and Analytics. In addition, it provides a teaching approach from moving relational database to the Big Data environment. In this study, we try to answer the question of why Relational Database Bases Management Systems such as IBM’s, DB2, Oracle, and SAP fail to meet the Big Data Analytical and Prediction Requirements. The paper also compares the structured, semi-structured, and unstructured data as well as dealing with security issues related to these data formats. Finally, the operational issues such as scale, performance and availability of data by utilizing these database systems will be compared. Introduction It has been more than 40 years since the Relational Database Management Systems have been implemented and used by different organizations. This database approach was satisfying the needs of businesses dealing with static, query intensive data sets since these data sets were relatively small in nature [21, 22]. These traditional relational database systems do not answer the requirements of the increased type, volume, velocity and dynamic structure of the new data sets. So, organizations with lots of data, either have to purchase new systems or re-tool what they already have [1, 4, 5, and 8]. Big data deals with volume, velocity, variability and variety. Velocity obviously refers to how quickly the streaming data is captured [1, 6].
    [Show full text]