<<

Data Science and its role in Big analytics Stefano De Francisci

THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Eurostat Outline

1. Data Science, basic concepts 2. A short history 3. A new concept of Science? 4. as the new frontier of Data Science 5. Data, , knowledge

Eurostat WIKIPEDIA Extraction of knowledge from large volumes of data that are structured or ...[DS includes] The field of data science is unstructured, which is a continuation of , , emerging at the intersection of the field and predictive data engineering, the fields of social science and analytics, also known as knowledge pattern recognition and statistics, information and discovery and data mining (KDD). learning, advanced , and design "" can include emails, , visualization, BERKELEY SCHOOL OF videos, photos, social media, and other uncertainty modeling, user-generated content. data warehousing, and INFORMATION high performance computing with the goal First, the raw material, the INTERDISCIPLINARY NEW KINDS of extracting meaning Data “data” part of Data Science, from data and creating OF DATA is increasingly data products Science heterogeneous and unstructured. Second, MOUT DATA AS computers interpret data PRODUCT automatically, making them NEW METHODS FOR active agents in the process of sense making. MAKING-SENSE TO DATA DHAR …merely using data isn’t really what we Data science is the study of mean by “data science.” A data application where information comes At its core, data science acquires its value from the data itself, and from, what it represents involves using automated creates more data as a result. It’s not just an and how it can be turned methods to analyze application with data; it’s a data product. into a valuable resource in massive amounts of data Data science enables the creation of data the creation of business and and to extract knowledge products from them. LOUKADIS (O’REILLY MEDIA) IT strategies NEW YORK ROUSE UNIVERSITY Eurostat Data Science landscape • Signal processing • Visualization • Probability models • Predictive analytics • Nanotechnologies • • Uncertainty modeling • Physics • Statistical learning • Data warehousing • Robotics • Data mining • • Mathematics • • Computer programming • Statistics • Data engineering • High Performance Computing • Information theory • Pattern recognition • FIELDS • AI Data Science TECHINIQUES (WIKIPEDIA ) OBJECTS APPROACHES

The development of machine learning, a branch Methods that scale to Big Data are of of artificial intelligence used to uncover patterns particular interest in data science, in data from which predictive models can be although the discipline is not generally developed, has enhanced the growth and considered to be restricted to such data. importance of data science.

Eurostat A data scientist may or may not Who is a Data Scientist? have specialized industry knowledge to aid in modeling In addition to advanced analytic skills, this individual is also business problems and with proficient at integrating and preparing large, varied datasets, understanding and preparing data. architecting specialized database and computing environments, and communicating results. Creating value from data The data scientist has requires a range of talents: TASKS MISSION emerged as a new role, from and distinct from — but Data Scientist preparation, to with similarities to — PROFILE TALENT architecting specialized those of business GARTNER computing/database intelligence (BI) analysts environments, to data and RESPONSIBILITY PECULIARITY mining and intelligent algorithms

An individual responsible for modeling complex business problems, discovering business insights Data scientists can be invaluable in generating and identifying opportunities through the use of insights, especially from "big data;" but their unique statistical, algorithmic, mining and visualization combination of technical and business skills, together techniques. with their heightened demand, makes them difficult to find or cultivate.

D. Laney, L. Kart Eurostat Unicorn

CONWAY Venn RALEIGH diagram MOUT ERICKSON

Eurostat 5 4 6 Venn diagram >7 7

Eurostat Eurostat Eurostat Is Data Science a maturity science?

Types of domain dealt by an intellectual Feature of a new discipline: enterprises: (a) To represent an autonomous field (unique (a) topics (facts, data, problems, topics) phenomena, observations, and the like) (b) To provide an innovative approach to both (b) methods (techniques, approaches, and traditional and new philosophical topics so on) (original methodologies); (c) theories (hypotheses, explanations, and (c) To stand beside other disciplines, offering the so forth) systematic treatment of its own conceptual foundations (new theories). If a discipline attempts to innovate in more than one of these domains simultaneously is premature, as detaches itself too abruptly from the normal and continuous thread of evolution of its general field (Stent 1972). crossroad of As everyone’s Transdisciplinary (like • technical matters concern is to be anyone’s cybernetics or semiotics) • theoretical issues nobody’s own area of or interdisciplinary (like • applied problems business specialisation biochemistry or cognitive • conceptual analyses science)?

L. Floridi Eurostat 191

201 http://www.emc.com/collateral/about/news/emc-data-science-study-wp.pdf

285 where second first value value value Size of effect shown 191 201 0,050 in graphic Size of effect in data 285 195 0,462

195 Lie factor 0,108

Eurostat Short History of Data Science (Loosely based on Gil Press version)

http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/

Eurostat Eurostat Eurostat Eurostat Eurostat Eurostat Eurostat Eurostat Eurostat Steps to a Metaphisics of Data Science • How does the Data Science in the context of the Knowledge Organization? • What are its relations with other fields of scientific knowledge? • Can DS be explained as part of the philosophy of science?

Data Information Knowledge

Scientific Data Information Knowledge context Science Science Science

Philosophy of Philosophical Philosophy of Philosophy of Knowledge context Data Information (Epistemology, Gnoseology)

Eurostat Beyond Data Science? sits at the intersection Information Science is the study of information of technology, people, and organizations. and how it is used by people within It is a distinct discipline and has a focus on organisations Information and Communication Technologies (ICT) used by people to manage information within organisations.

Data Information Knowledge

Ambito Information scientifico Science

Ambito filosofico http://infosci.otago.ac.nz/what-is-information-science/

Eurostat Beyond Data Science?

Informatio Data Knowledge n

Ambito Knowledge scientifico Science

Ambito filosofico

https://www.crcpress.com/Knowledge-Science-Modeling-the-Knowledge-Creation-Process/Nakamori/9781439838365

Eurostat Beyond Data Science? http://www.nytimes.com/2013/02/05/opinion/brooks-the-philosophy-of-data.html?_r=0 https://www.mendeley.com/groups/546011/philosophy-of-data/papers/

Informatio Data Knowledge n

Scientific context

Philosophical Philosophy of Data context?

Eurostat Beyond Data Science? Philosophy of information (PI) = def. the philosophical field concerned with (a) the critical investigation of the conceptual nature and basic principles of information, including its dynamics, utilization, and sciences, and (b) the elaboration and application of information theoretic and computational methodologies to philosophical problems. https://en.wikipedia.org/wiki/Philosophy_of_information http://socphilinfo.org/sites/default/files/i2pi_2013.pdf http://philosophyofinformation.net/publications/pdf/wipi.pdf Data Information Knowledge

Philosophical Philosophy of context Information

Eurostat New millennium frontiers

Starting from phenomena such data deluge, the existence of new and alternatives data sources like the Internet, sensors and images, the availability of data not ad-hoc collected but automatically generated, it is understood that relations between scientific fields could not be confined to a binary interdisciplinary relationships, but it needed a triangulation and a trans- disciplinary approach, and the identification of a data-driven scientific method

Computer scientists need new Computer approaches, methods and Science techniques to organize and extract knowledge

Domain experts are having to deal Statisticians and TRADITIONAL with data from mathematicians RESEARCH alternative sources are unable to Statistica e Aree develop their matematica tematiche own data

Eurostat New paradigms?

Extraction of knowledge from large volumes Science as computational approach, of data that are structured or unstructured, where like the two traditional which is a continuation of the field data scientific methods, all of the mining and predictive analytics, also known computational steps by which as knowledge discovery and data mining (KDD) scientists draw conclusions are DATA SCIENCE revealed e-SCIENCE DATA-DRIVEN LARGE-SCALE Providing tools that Science as simplify CITIZEN SCIENCE communication, process, SHARING SCIENTIFIC Systematic collection cooperation and processing and WORKFLOW and analysis of data; collaboration between calculation development of interested parties technology; testing COLLABORATION PARTICIPATION SCIENCE 2.0 of natural phenomena; and the Research environments that support advanced data acquisition, dissemination of , , data integration, data mining, these activities by and other computing and information researchers on a processing services distributed over the Internet beyond the primarily avocational scope of a single institution basis CYBERINFRASTRUCTURE CROWD SCIENCE Eurostat New roads?

Eurostat On the concept of datum Rational Integers Numbers

Real Imaginary Numbers Mathematics Numbers

Nominal ATA List D Alphanumeric Ordinal Statistics IT Floating Point Interval Bit Ratio Integers Boolean

Data Science Eurostat On the concept of datum

Semiotics

Fact

DATA

Sign Signal

Eurostat

If the data were… a signal

theory

communication

Shannon [L. Floridi]

Eurostat Data as a sign

If the data were… a sign … become… elements of

Communication Meaning-making process system

Signal Entity Code Entity in-presence in-absence

Machines Human

(stimulus) (interpretation)

Eurostat Data as a symbol Information

Data Real-World phenomenon

Ogden, C.K. & Richards, I.A. The Meaning of Meaning: A study of the Influence of Language upon Thought and of the Science of Symbols, Harcourt Brace, New York,1956.

Eurostat Data as a fact ROWLEY HENRY …data "as being discrete, objective facts or observations, which Data as "merely are unorganized and unprocessed and therefore have no raw facts” meaning or value because of lack of context and interpretation”

NON-INTERPRETABILITY Data as "chunks of Data as «PIECE» OF FACT facts about the state RAW FACT a fact of the world" SUM OF FACTS GAMBLE MATERIAL FACT Data as “material VERACITY fact” BOIKO Information Data… as facts have as a fundamental as "the sum property that they are true, total of...facts have objective reality, or otherwise can and ideas" be verified, such definitions would preclude false, meaningless, CLEVELAND WIKIPEDIA and nonsensical data.

Eurostat Data vs. information

"Typically information is defined in terms of data, knowledge in terms of information, and wisdom in terms of knowledge" J. Rowley

Eurostat DIKW approach

Gene Bellinger et al. Data, Information, Knowledge, and Wisdom http://www.systems- thinking.org/dikw/dikw.htm

G. Bellinger et al. Data, Information, Knowledge, and Wisdom http://www.systems-thinking.org/dikw/dikw.htm 36 Eurostat DIKW approach

H. Cleveland "Information as Resource", The Futurist, December 1982 http://www.nwlink.com/~donclark/performance/understanding.html

Eurostat DIKW approach

S. Wurman http://www.jonkolko.com/projectFiles/scad/IACT370_04_InformationAnxiety.pdf

Eurostat An objective (mind WIENER FLORIDI independent) entity. In fact, what we mean by “…a name for the content It can be generated information – the elementary Intuitively, of what is exchanged with or carried by unit of information – is a “information” is the outer world as we messages or by difference which makes a often used to refer adjust to it, and make our other products of difference.” to non-mental, user- adjustment felt upon it.” cognizers BATESON independent, declarative, CAMBRIDGE DICTIONARY DIFFERENCE REGULATION semantic contents, OF PHILOSOPHY embedded in OBJECTIVITY CONTENT Organized data physical Information implementations SAINT-ONGE ORGANIZATION RELEVANCE like , INTERPRETATION encyclopaedias, web Interpreted sites, television VARIETY PROBST programmes etc. The word “information” has been given different meanings by various writers in the general field of information theory. It is Data endowed Knowledge likely that at least a number of these will prove sufficiently useful with relevance which can in certain applications to deserve further study and permanent and purpose be transmitted recognition. It is hardly to be expected that a single concept of without loss of DRUCKER information would satisfactorily account for the numerous integrity possible applications of this general field. SHANNON KOGUT AND ZANDER Eurostat From Data to Information PHILOSOPHY OF COMPUTING AND INFORMATION Querying (searching, Contextualizing and browsing) primary data to relating primary data find what we need DIKW

MEANING RELATION D2I

SHAPE

Activate inductive process (through formalization, models etc.) that best represents primary data STATISTICAL INFERENCE

Eurostat Schools of thought on Knowledge

Commodity • Positivism (mid- Knowledge as absolute and universal truth. 19th century) • Artefact that can be handled in discrete • It is still especially units and that people may (or may not) strong in the posses natural sciences • Thing for which we can gain evidence, and it is separated from the knower Community • Critique of the Knowledge as Constructivist approach established • Reality (and hence also knowledge) quantitative should be understood as socially approach to constructed. science • amongst social • It is impossible to define knowledge scientists (1960’s) universally; it can only be defined in practice, in the activities of and interactions between individuals.

Eurostat Information vs. Knowledge

Information Knowledge Information can be made tangible and Knowledge, on the other hand, is a much represented as objects outside of the more elusive entity – while some see it as human mind an object Relationship is asymmetrical, suggesting that data may be transformed into information, which, in turn, may be transformed into knowledge. However, it does not seem to be possible to go the other way. It connotes the appraisement that knowledge is more valuable than information, turn is superior to data. Information is more factual Knowledge is about beliefs and commitment Knowledge is always about action – the knowledge must be used to some end (Nonaka and Takeuchi, 1995, pp. 57-58) Tuomi (1999) argues that data emerges as a result of adding value to information, which in turn is knowledge that has been structured and verbalised.

Eurostat Information vs. Knowledge

Information Knowledge Processed data Actionable information Simply gives us facts Allows making predictions, casual associations, or predictive decisions

Clear, crisp, structured and simplistic Muddy, fuzzy, partly unstructured

Easily expressed in written form Intuitive, hard to communicate, and difficult to express in words and illustration

Obtained by condensing, correcting, Lies in connections, conversations contextualizing, and calculating data between people, experienced-based intuition, and people’s ability to compare situations, problems and solutions

Devoid of owner dependencies Depends on the owner

Abdullah et al. Eurostat Value chain of Knowledge

Data with Valuable Simple relevance and information from observations

purpose the human mind

Davenport

Tuomi

Knowledge, We can By examining the embedded in our instantiate some structure of this minds, is thus a of this knowledge information, we may prerequisite as information finally codify it into pure data

Eurostat Value life-cycle of Knowledge

Both data and information require knowledge in order to be interpretable

Data and Old knowledge is information are used to reflect useful tools for upon data and constructing new information knowledge.

When the data or information has been made sense of, a new state of knowledge is formed

Eurostat Impact of Big Data on DIKW chain

Eurostat