Data Science and its role in Big Data analytics Stefano De Francisci
THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION
Eurostat Outline
1. Data Science, basic concepts 2. A short history 3. A new concept of Science? 4. Big Data as the new frontier of Data Science 5. Data, information, knowledge
Eurostat WIKIPEDIA Extraction of knowledge from large volumes of data that are structured or ...[DS includes] The field of data science is unstructured, which is a continuation of mathematics, statistics, emerging at the intersection of the field data mining and predictive data engineering, the fields of social science and analytics, also known as knowledge pattern recognition and statistics, information and discovery and data mining (KDD). learning, advanced computer science, and design "Unstructured data" can include emails, computing, visualization, BERKELEY SCHOOL OF videos, photos, social media, and other uncertainty modeling, user-generated content. data warehousing, and INFORMATION high performance computing with the goal First, the raw material, the INTERDISCIPLINARY NEW KINDS of extracting meaning Data “data” part of Data Science, from data and creating OF DATA is increasingly data products Science heterogeneous and unstructured. Second, MOUT DATA AS computers interpret data PRODUCT automatically, making them NEW METHODS FOR active agents in the process of sense making. MAKING-SENSE TO DATA DHAR …merely using data isn’t really what we Data science is the study of mean by “data science.” A data application where information comes At its core, data science acquires its value from the data itself, and from, what it represents involves using automated creates more data as a result. It’s not just an and how it can be turned methods to analyze application with data; it’s a data product. into a valuable resource in massive amounts of data Data science enables the creation of data the creation of business and and to extract knowledge products from them. LOUKADIS (O’REILLY MEDIA) IT strategies NEW YORK ROUSE UNIVERSITY Eurostat Data Science landscape • Signal processing • Visualization • Probability models • Predictive analytics • Nanotechnologies • Machine learning • Uncertainty modeling • Physics • Statistical learning • Data warehousing • Robotics • Data mining • Data compression • Mathematics • Database • Computer programming • Statistics • Data engineering • High Performance Computing • Information theory • Pattern recognition • Information technology FIELDS • AI Data Science TECHINIQUES (WIKIPEDIA ) OBJECTS APPROACHES
The development of machine learning, a branch Methods that scale to Big Data are of of artificial intelligence used to uncover patterns particular interest in data science, in data from which predictive models can be although the discipline is not generally developed, has enhanced the growth and considered to be restricted to such data. importance of data science.
Eurostat A data scientist may or may not Who is a Data Scientist? have specialized industry knowledge to aid in modeling In addition to advanced analytic skills, this individual is also business problems and with proficient at integrating and preparing large, varied datasets, understanding and preparing data. architecting specialized database and computing environments, and communicating results. Creating value from data The data scientist has requires a range of talents: TASKS MISSION emerged as a new role, from data integration and distinct from — but Data Scientist preparation, to with similarities to — PROFILE TALENT architecting specialized those of business GARTNER computing/database intelligence (BI) analysts environments, to data and statisticians RESPONSIBILITY PECULIARITY mining and intelligent algorithms
An individual responsible for modeling complex business problems, discovering business insights Data scientists can be invaluable in generating and identifying opportunities through the use of insights, especially from "big data;" but their unique statistical, algorithmic, mining and visualization combination of technical and business skills, together techniques. with their heightened demand, makes them difficult to find or cultivate.
D. Laney, L. Kart Eurostat Unicorn
CONWAY Venn RALEIGH diagram MOUT ERICKSON
Eurostat 5 4 6 Venn diagram >7 7
Eurostat Eurostat Eurostat Is Data Science a maturity science?
Types of domain dealt by an intellectual Feature of a new discipline: enterprises: (a) To represent an autonomous field (unique (a) topics (facts, data, problems, topics) phenomena, observations, and the like) (b) To provide an innovative approach to both (b) methods (techniques, approaches, and traditional and new philosophical topics so on) (original methodologies); (c) theories (hypotheses, explanations, and (c) To stand beside other disciplines, offering the so forth) systematic treatment of its own conceptual foundations (new theories). If a discipline attempts to innovate in more than one of these domains simultaneously is premature, as detaches itself too abruptly from the normal and continuous thread of evolution of its general field (Stent 1972). crossroad of As everyone’s Transdisciplinary (like • technical matters concern is to be anyone’s cybernetics or semiotics) • theoretical issues nobody’s own area of or interdisciplinary (like • applied problems business specialisation biochemistry or cognitive • conceptual analyses science)?
L. Floridi Eurostat 191
201 http://www.emc.com/collateral/about/news/emc-data-science-study-wp.pdf
285 where second first value value value Size of effect shown 191 201 0,050 in graphic Size of effect in data 285 195 0,462
195 Lie factor 0,108
Eurostat Short History of Data Science (Loosely based on Gil Press version)
http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/
Eurostat Eurostat Eurostat Eurostat Eurostat Eurostat Eurostat Eurostat Eurostat Steps to a Metaphisics of Data Science • How does the Data Science in the context of the Knowledge Organization? • What are its relations with other fields of scientific knowledge? • Can DS be explained as part of the philosophy of science?
Data Information Knowledge
Scientific Data Information Knowledge context Science Science Science
Philosophy of Philosophical Philosophy of Philosophy of Knowledge context Data Information (Epistemology, Gnoseology)
Eurostat Beyond Data Science? Information Science sits at the intersection Information Science is the study of information of technology, people, and organizations. and how it is used by people within It is a distinct discipline and has a focus on organisations Information and Communication Technologies (ICT) used by people to manage information within organisations.
Data Information Knowledge
Ambito Information scientifico Science
Ambito filosofico http://infosci.otago.ac.nz/what-is-information-science/
Eurostat Beyond Data Science?
Informatio Data Knowledge n
Ambito Knowledge scientifico Science
Ambito filosofico
https://www.crcpress.com/Knowledge-Science-Modeling-the-Knowledge-Creation-Process/Nakamori/9781439838365
Eurostat Beyond Data Science? http://www.nytimes.com/2013/02/05/opinion/brooks-the-philosophy-of-data.html?_r=0 https://www.mendeley.com/groups/546011/philosophy-of-data/papers/
Informatio Data Knowledge n
Scientific context
Philosophical Philosophy of Data context?
Eurostat Beyond Data Science? Philosophy of information (PI) = def. the philosophical field concerned with (a) the critical investigation of the conceptual nature and basic principles of information, including its dynamics, utilization, and sciences, and (b) the elaboration and application of information theoretic and computational methodologies to philosophical problems. https://en.wikipedia.org/wiki/Philosophy_of_information http://socphilinfo.org/sites/default/files/i2pi_2013.pdf http://philosophyofinformation.net/publications/pdf/wipi.pdf Data Information Knowledge
Philosophical Philosophy of context Information
Eurostat New millennium frontiers
Starting from phenomena such data deluge, the existence of new and alternatives data sources like the Internet, sensors and images, the availability of data not ad-hoc collected but automatically generated, it is understood that relations between scientific fields could not be confined to a binary interdisciplinary relationships, but it needed a triangulation and a trans- disciplinary approach, and the identification of a data-driven scientific method
Computer scientists need new Computer approaches, methods and Science techniques to organize and extract knowledge
Domain experts are having to deal Statisticians and TRADITIONAL with data from mathematicians RESEARCH alternative sources are unable to Statistica e Aree develop their matematica tematiche own data
Eurostat New paradigms?
Extraction of knowledge from large volumes Science as computational approach, of data that are structured or unstructured, where like the two traditional which is a continuation of the field data scientific methods, all of the mining and predictive analytics, also known computational steps by which as knowledge discovery and data mining (KDD) scientists draw conclusions are DATA SCIENCE revealed e-SCIENCE DATA-DRIVEN LARGE-SCALE Providing tools that Science as simplify CITIZEN SCIENCE communication, process, SHARING SCIENTIFIC Systematic collection cooperation and processing and WORKFLOW and analysis of data; collaboration between calculation development of interested parties technology; testing COLLABORATION PARTICIPATION SCIENCE 2.0 of natural phenomena; and the Research environments that support advanced data acquisition, dissemination of data storage, data management, data integration, data mining, these activities by data visualization and other computing and information researchers on a processing services distributed over the Internet beyond the primarily avocational scope of a single institution basis CYBERINFRASTRUCTURE CROWD SCIENCE Eurostat New roads?
Eurostat On the concept of datum Rational Integers Numbers
Real Imaginary Numbers Mathematics Numbers
Nominal ATA List D Alphanumeric Ordinal Statistics IT Floating Point Interval Bit Ratio Integers Boolean
Data Science Eurostat On the concept of datum
Semiotics
Fact
DATA
Sign Signal
Eurostat
If the data were… a signal
theory
communication
Shannon [L. Floridi]
Eurostat Data as a sign
If the data were… a sign … become… elements of
Communication Meaning-making process system
Signal Entity Code Entity in-presence in-absence
Machines Human
(stimulus) (interpretation)
Eurostat Data as a symbol Information
Data Real-World phenomenon
Ogden, C.K. & Richards, I.A. The Meaning of Meaning: A study of the Influence of Language upon Thought and of the Science of Symbols, Harcourt Brace, New York,1956.
Eurostat Data as a fact ROWLEY HENRY …data "as being discrete, objective facts or observations, which Data as "merely are unorganized and unprocessed and therefore have no raw facts” meaning or value because of lack of context and interpretation”
NON-INTERPRETABILITY Data as "chunks of Data as «PIECE» OF FACT facts about the state RAW FACT a fact of the world" SUM OF FACTS GAMBLE MATERIAL FACT Data as “material VERACITY fact” BOIKO Information Data… as facts have as a fundamental as "the sum property that they are true, total of...facts have objective reality, or otherwise can and ideas" be verified, such definitions would preclude false, meaningless, CLEVELAND WIKIPEDIA and nonsensical data.
Eurostat Data vs. information
"Typically information is defined in terms of data, knowledge in terms of information, and wisdom in terms of knowledge" J. Rowley
Eurostat DIKW approach
Gene Bellinger et al. Data, Information, Knowledge, and Wisdom http://www.systems- thinking.org/dikw/dikw.htm
G. Bellinger et al. Data, Information, Knowledge, and Wisdom http://www.systems-thinking.org/dikw/dikw.htm 36 Eurostat DIKW approach
H. Cleveland "Information as Resource", The Futurist, December 1982 http://www.nwlink.com/~donclark/performance/understanding.html
Eurostat DIKW approach
S. Wurman http://www.jonkolko.com/projectFiles/scad/IACT370_04_InformationAnxiety.pdf
Eurostat An objective (mind WIENER FLORIDI independent) entity. In fact, what we mean by “…a name for the content It can be generated information – the elementary Intuitively, of what is exchanged with or carried by unit of information – is a “information” is the outer world as we messages or by difference which makes a often used to refer adjust to it, and make our other products of difference.” to non-mental, user- adjustment felt upon it.” cognizers BATESON independent, declarative, CAMBRIDGE DICTIONARY DIFFERENCE REGULATION semantic contents, OF PHILOSOPHY embedded in OBJECTIVITY CONTENT Organized data physical Information implementations SAINT-ONGE ORGANIZATION RELEVANCE like databases, INTERPRETATION encyclopaedias, web Interpreted data INTEGRITY sites, television VARIETY PROBST programmes etc. The word “information” has been given different meanings by various writers in the general field of information theory. It is Data endowed Knowledge likely that at least a number of these will prove sufficiently useful with relevance which can in certain applications to deserve further study and permanent and purpose be transmitted recognition. It is hardly to be expected that a single concept of without loss of DRUCKER information would satisfactorily account for the numerous integrity possible applications of this general field. SHANNON KOGUT AND ZANDER Eurostat From Data to Information PHILOSOPHY OF COMPUTING AND INFORMATION Querying (searching, Contextualizing and browsing) primary data to relating primary data find what we need DIKW
MEANING RELATION D2I
SHAPE
Activate inductive process (through formalization, models etc.) that best represents primary data STATISTICAL INFERENCE
Eurostat Schools of thought on Knowledge
Commodity • Positivism (mid- Knowledge as absolute and universal truth. 19th century) • Artefact that can be handled in discrete • It is still especially units and that people may (or may not) strong in the posses natural sciences • Thing for which we can gain evidence, and it is separated from the knower Community • Critique of the Knowledge as Constructivist approach established • Reality (and hence also knowledge) quantitative should be understood as socially approach to constructed. science • amongst social • It is impossible to define knowledge scientists (1960’s) universally; it can only be defined in practice, in the activities of and interactions between individuals.
Eurostat Information vs. Knowledge
Information Knowledge Information can be made tangible and Knowledge, on the other hand, is a much represented as objects outside of the more elusive entity – while some see it as human mind an object Relationship is asymmetrical, suggesting that data may be transformed into information, which, in turn, may be transformed into knowledge. However, it does not seem to be possible to go the other way. It connotes the appraisement that knowledge is more valuable than information, turn is superior to data. Information is more factual Knowledge is about beliefs and commitment Knowledge is always about action – the knowledge must be used to some end (Nonaka and Takeuchi, 1995, pp. 57-58) Tuomi (1999) argues that data emerges as a result of adding value to information, which in turn is knowledge that has been structured and verbalised.
Eurostat Information vs. Knowledge
Information Knowledge Processed data Actionable information Simply gives us facts Allows making predictions, casual associations, or predictive decisions
Clear, crisp, structured and simplistic Muddy, fuzzy, partly unstructured
Easily expressed in written form Intuitive, hard to communicate, and difficult to express in words and illustration
Obtained by condensing, correcting, Lies in connections, conversations contextualizing, and calculating data between people, experienced-based intuition, and people’s ability to compare situations, problems and solutions
Devoid of owner dependencies Depends on the owner
Abdullah et al. Eurostat Value chain of Knowledge
Data with Valuable Simple relevance and information from observations
purpose the human mind
Davenport
Tuomi
Knowledge, We can By examining the embedded in our instantiate some structure of this minds, is thus a of this knowledge information, we may prerequisite as information finally codify it into pure data
Eurostat Value life-cycle of Knowledge
Both data and information require knowledge in order to be interpretable
Data and Old knowledge is information are used to reflect useful tools for upon data and constructing new information knowledge.
When the data or information has been made sense of, a new state of knowledge is formed
Eurostat Impact of Big Data on DIKW chain
Eurostat