Les cahiers de veille de la Fondation Télécom Managing data in ahyperconnectedworld Contents

3 A deluge of data 3 Predicting the future with new amounts of data 5 Completely new horizons 6 Why is (big) data an opportunity now? Editorial May 2013 8 From Big Data to Small Data: giving the power back to the people.

The digital transformation is not a linear phenomenon. It proceeds by waves 9 The data landscape of innovation, rich with their own issues and their own dynamics. 9 What is data? After the wave of computing and the wave of the Internet, we are probably 9 What is generating data? entering a third sequence, the data wave: changes that we know for the 10 Cleaning and contextualizing data most part, but which deserve to be examined as a whole. 11 Processing data The data revolution is, of course, the explosion of data available in organiza- 11 Different types of data (structured and non tions or accessible over the Internet. It is the falling cost of data production structured) systems, which allows anyone to deploy information systems they would 12 Managing the petascale computing not have dreamed of a decade ago. It is the flow of radical innovations that 14 The data market hide behind words like «big data» or «cloud computing». It is also political 16 Unlocking data value and social change: the relationship between citizens and their own digital 16 Machine learning identity revealed by the Quantified Self movement, the privacy concerns, 18 Realtime analytics the growing demand for transparency in government agencies and large organizations. It is new government practices, such as open data, leading 18 Data analysis for everyone to new forms of open and collaborative government. And new risks too, 19 Data visualization: telling stories with data especially in the cyber security domain… 20 Rise of the data jobs In our organizations, the data revolution means new kinds of jobs that 21 Privacy and trust concerns modify the ways we work and represent of the world: datascientists, data 22 Challenges in the air visualizers. It brings new concerns too, such as data governance, and new 22 Scientific challenges training needs that our schools must respond to. Towards a better understanding of the world? It is, by definition, difficult to measure a wave of radical innovation. One // Pour a maximum of data and solve big always starts by comparing the new with the old, with what we already problems // Build and share data-oriented know: «Data is the new oil for our economies… A treasure to keep or distri- infrastructures // Learn to apply context to bute… A danger to privacy…» These analogies are not false, but they are the numbers // Make the data qualitative and incomplete and may cause us to miss the point. meaningful 24 Technical challenges The key point is that the flow of data exchanged on the networks now forms the backdrop of the economy, an important part of the social bond, a tool Will “delete” become a forbidden word? // Anonymize for good // Do not neglect big data to forge new forms of public action, an essential component of our digital risks // Master data correlation // Master identities… and a treasure of risks and opportunities. the big data cycle // Make the mobile phone If I had to make an analogy, I would suggest rather: our economy is in the your data assistant // Enable dataviz on new middle of the same revolution that the life sciences lived when bioche- devices // Invent the future of shopping mistry was discovered. With the ability to work effectively on large distribu- 26 Societal challenges tions at very granular levels, we are in a position to rebuild all our old lear- Open new ways of thinking // Democratize data nings, to open new ones, and imagine radically new forms of intervention. management // Teach the future datascientists

There are big challenges to come, indeed… 27 Working with the Institut Mines-Télécom Henri Verdier 27 Glossary

Director, Etalab Les cahiers de veille de la Fondation Télécom

ig data has been designated trend of component that becomes dominant in the de- the year 2012 and continues to make sign of these objects. These programs generate the headlines in 2013: reports, confe- traces in large quantity and in real time, giving Brences and announcements follow one another their location, their task, the identity of the user, A deluge of data at a furious pace, making it more and more a their internal state, and even information about part of the data landscape. To the deluge of their environment, like the network status or the data associated with big data adds a flood of level of attacks to which they are subject. These information on the phenomenon, and it is more traces can be easily exploited in the connect- necessary than ever to see more clearly. Over ed world to extract a user profile to best adapt the past few years, big data has expanded to the service provided by the object to the user fast data and smart data, open data from pu- needs, but also to the interests of brands (with a Big Data is a slippery concept, and big data- blic administrations, dark data that need to be balance between privacy and economy), to get sets sizes are a constantly moving target, revealed, and small data that is more specific. a state of the object in order to detect the need currently ranging from a few dozen tera- It is now time to learn how to manage all these for maintenance, and to perform better invento- bytes to many petabytes. Big data is data flavors of data. ry management to improve product availability. that is too large, complex and dynamic for any tools, procedures and processes avail- We choose in this cahier de veille to explore data Big data is not for big business only. Data is no able at the moment to create, capture, store, from a particular angle: managing the data from longer an isolated IT discussion. We are now manipulate, manage and analyze. the traces we leave in both the real world and the quantifying every aspect of our lives, and the connected world, traces related to the digitaliza- data we generate, own and publish must remain In this cahier de veille we write the term Big tion of our lives. Everyday objects (phone, tablet, under control, whether you are a small organiza- Data with capital letters sparingly. We use car, bathroom scales) all now have a software tion or an individual. data when it refers to ordinary data, open data, fast data, big data… whereas Big Data Predicting the future with new amounts of data with capital letters refers to the marketing phenomenon. Managing large amounts of data is not new, Google compared the flu-related query counts and «big data» has been a concept known by with traditional flu surveillance systems and IT manager for years. However, producing large found a pattern emerging between how many amounts of data, accessing massive new data people search such queries and how many sources and having all of them at our fingertips people actually have flu symptoms. The same Data law #1 : The faster you marks a change in the way we work. close relationship could be obtained world- analyze your data, the greater its wide, and updated every hour or so, when tradi- predictive value. New insights come from sources that were not tional surveillance services could only update known before or were previously impossible to their results once a week, on a country basis. A From David Feinleib’s 8 laws of Big Data. analyze. What is huge today is that businesses new set of data (the Google queries, compared See references page 8. can leverage the entire Web as a data source, to counting people visiting a doctor) was deliv- and at relatively low costs. One of the first uses ering a totally new set of information that could of big data hitting the public attention was the be used as a complement to predict phenome- Google flu trend service, launched in 2008. non, and this opened the path to the research Google found at that time that certain search on web-produced data. terms, among the millions of users queries In 2008, Google was able to spot trends in around the world for health information, were While Google had somehow neglected to up- the Swine Flu epidemic two weeks before the good indicators of flu activity during flu season. date its Google Trends service before 2007, it U.S. Center for Disease Control by analyzing These results were published in the prestigious now updates information daily, and Hot Trends searches that people were making. Nowa- journal Nature. is updated hourly. It is worth noting that this is a days, Google Flu Trends (blue line) uses aggre- publicly available service for everybody. gated Google search data to estimate current flu activity around the world in near real-time. Figure on the right presents French results, compared with data collected from the public Sentinelles health network (orange lines).

Source: http://goo.gl/SUqJd

2 — 3 However, flu prediction is mostly a way to make so what’s going on if decisions are wrong? On a monitoring process more accurate, and oth- April 23, the very same week the stock market er sets of web collected data can give better article was published in Nature, a bogus insights on the future. This is particularly the message on a false report of explosions at the 24 10 Yottabyte case with data in the financial market context. White House was published via the Associated Indeed, a recent scientific study published in Press account and caused financial markets our digital universe Nature in early 2013 showed that it may be to drop sharply for a short period. This was evi- possible to quantify trading behavior in finan- dence for some that automatic algorithms were today cial markets using the Google Trends service. using social network data and could take deci- sions without waiting to substantiate things. As the crises in financial markets affect humans worldwide, being able to better understand, Predicting moves as a function of past behav- and predict, stock market moves, will be surely ior can also be done with web collected data 21 10 Zettabyte of some help. These trading decisions reflect from its own customers. Liligo.com, a real-time the complexity of human behavior. The authors travel search engine, simplifies the user experi- 1.3 ZB of network traffic of the study suggest that «massive new data ence, by providing information about past price sources resulting from human interaction with evolution and forecasts based on historical by 2016 the Internet may offer a new perspective on data. As it constantly records the result pages, the behavior of market participants in periods the information collected allows the visualiza- of large market movements.» tion of the price time series for every ticket, and can be used to gain insight into the effect 18 10 Exabyte Their findings show that Google Trends data can of the yield management policy conducted by reflect the current state of the stock markets the travel companies. Based on vast streams 250 million DVDs but may have also been able to anticipate cer- of heterogeneous historical data collected tain future trends. They found patterns that may through the Internet from more than 250 travel Diameter be interpreted as early warning signs of stock sites, agencies and tour operators, researchers of the Sun market moves. It could be generalized to other at Institut Mines-Télécom proposed approach- complex systems that human being faces, the es to forecasting travel price changes at a giv- France largest environmental questions for instance. en horizon, taking as input variables a list of de- inner distance scriptive characteristics of the flight, together 15 In fact, all is contained in that «may be inter- with possible features of the past evolution of 10 Petabyte preted». There is a short step from using data the related price series. In May 2013, Microsoft migration to interpret the world and using it to act on it, of Hotmail accounts was over 150PB of user data. what is big ? Niagara Falls 12 10 length When are we talking big? 15 million SMS messages are sent globally. In 2011, the volume of data found on the Internet Terabyte Big Data is often defined first through the three Boeing Jet generates 10 TB was 1.8 zettabytes and a volume of 2.9 zetta- of data per engine every 30 «V’s» – Volume, Velocity, and Variety, vol- bytes is expected for 2015. minutes. ume being the facet that is most discussed. Big data can come fast. The velocity is defined Nowadays, humanity produces in two days as the pace at which the data are to be captured 109 Gigabyte more data than throughout its history from and consumed (referring to both streams of data its beginnings until 2003. In eight years, the coming from Internet of Things and the speed volume of data will be fifty times greater than of growth in large-scale data repositories). Just

6 it is today. The big data market will reach €50 think of algorithmic trading engines that must 10 billion and 10 times more servers will be nee- detect trade buy/sell patterns in datastreams Megabyte ded. Big data is less than ten years old, but the coming in at 5,000 orders per second. If 1 Megabyte technology is already massively deployed by represents the , Twitter, Google… As a comparison, Data processed by large web companies is not size of an ant... every minute on average 350,000 tweets and structured and neatly formatted. The users re- Les cahiers de veille de la Fondation Télécom

Data is generated and collected all around us each page/visited pages, the previously visited every second, and this opens completely new sites, the transformation from visitor to cus- horizons when managing our everyday life in tomer? Clirisgroup is a company once hosted a hyperconnected world. First of all, the mobile in the Institut Mines-Télécom incubators, that Completely phone in our pocket can track every movement measures every flow via video-analysis, inside, you make and every sound you hear. It «knows» outside and at each key point of a retail outlet. new horizons when you are at home or away. Its accelerome- These flow-measurements are strengthened ter sensors even «know» this is you wearing the with the data of the store information system phone, by how you walk. And with the advent itself: staff time-schedules, realized sales, etc. of the Internet of Things (see our cahier #3), all and are integrated with external data: weath- sorts of sensors are monitoring you, in your car, er, roadworks, marketing operation inside the your fridge, your scale and your toothbrush, bil- retail outlet, etc. This allows the calculation of lions of devices that can sense, communicate, the store and universes attractiveness, mea- compute and potentially actuate. surement of the impact of communication cam- paigns, optimization of the advice and sales All the data streams coming from these devices forces, measurement of the impact of exter- The data streams coming are challenging the traditional approaches to nalities and unforeseen circumstances... When from our devices are data management, and can even infer usages the customers are digitally identified (via mo- “ not thought of before. Here is another exam- bile payment, QR code flashing...) these online challenging the traditional ple of the new kind of data being captured. The data could be correlated with real world data approaches to data London transportation agency has deployed and give specific insights never seen before. management. significant amount of sensor roadways in or- der to monitor traffic in real-time to optimum In some use cases, concerns about big data big traffic management during the 2012 Olympics. brother are arising. Beginning with companies, This includes surveillance cameras in all the where the analysis of e-mails, instant messag- ” underground railway stations, parking areas, ing, phone calls, and mouse click (clickstreams) light rail stations and piers, at the bus stations can now be employed in the quest of greater and on the buses themselves. These cameras business efficiency. Indeed, the data produced operate in real time, catch people speeding on by workers is becoming a valuable asset. Apply- the roads, and can record data in an analogue ing datascience to human resources is more or digital format. They were deployed with the and more appealing among academics and focus of collecting traffic information. However, entrepreneurs. Gild, an 18-month-old start-up other organizations would like to access this company, provides services to automatically sensor data so they can use it for a different discover talented programmers. In order to pre- perspective, traffic updates or weather infor- dict how well a programmer will perform in a job, what is big ? mation for instance. The London transporta- Gild is tuning algorithms that evaluate the candi- tion agency would provide a so called Sensor dates through no fewer than 300 variables. At a quire the ability to store, query, and integrate network as a Service, and these third party or- time when recruiters do not have enough time to results across a variety of information types, ganizations would provide their customer with read all the resumes they receive (they spend including text, image, audio, video… This is the value added services. Indeed, the real value of an average of 6 seconds per CV), and where au- third V in the Big Data definition. And more V are the London video surveillance system would tomated systems are struggling to sort the can- to come, as the reader will see later. come from understanding the meaning of the didates, big data and machine learning systems images themselves, not just from the metadata could reduce human bias during the selection «Big» is a relative concept. EMC has defined associated with them. process to identify the best candidates. big data as «any attribute that challenges constraints of a system capability or business The next example demonstrates the application Back to sensors. The Bank of America has need». For example, a 40MB Power Point pre- of well known web insights to real world premis- equipped 90 of its employees with badges to sentation is big data, as is a 1TB medical image. es, along with the analysis of meaningful imag- study their movements and interactions. Data es. What if you could get analytics on your sales collection told an interesting story, and the «Big» relates to the data itself, the size of the in the same way you get logs from your website, bank decided to promote the taking of breaks dataset, the velocity, the number of data and such as the number of visits, the bouncing rate in groups rather than alone. This increased the the types, or any combination of these. (website immediate exit), the length of time on productivity by 10%!

4 — 5 The foundations for big data were laid down in 2. The Tools / Framework / Software effect: February 2001 by Gartner analyst Doug Laney, the development of open source load following SGI Chief Scientist John Mashey’s balancing systems on distributed archi- seminars in ‘97 who emphasized on the volume tectures for very large volumes of data, challenges. In an analyst report, he anticipated among them the Hadoop and MapReduce both the growth and impact that big data would projects. Many of these tools are available Why is (big) data have on technology, business and day to day free under the open source licensing, help- an opportunity life. He was the one who described big data as a ing to keep the cost of big data implemen- 3-dimensional data challenge of increasing data tation and management under control. now? volume, velocity and variety, and characterized big data as «data that’s an order of magnitude 3. The Data effect: the emergence of fast greater than data you’re accustomed to.» growing data volumes much larger and more heterogeneous than before. A lot Whereas big data has been around for a decade of unstructured contents are produced now, experts agree to say that 2013 is the year everyday, for the most ill-suited to tradi- when it will find its ways out of the data centers tional relational , but full of key Four effects & five and the tech rooms. The chief financial officer, information for businesses and adminis- chief marketing officer and chief sales officer, trations. The multiplication of traces on “ major trends. who are all in charge with growing revenue, are the Internet and the proliferation of data now aware what sort of opportunities big data from the real world through a wider net- are offering. work of sensors generate large amount of ” data that is more real-time oriented. There are four effects and five major trends causing organizations today to rethink the way 4. The Market effect: big data has a huge they approach data management. economic and scientific potential. The data accumulated by the services industry have 1. The Hardware effect: a significant drop within them significant value to the knowl- in the cost of hardware as a whole and edge of customers. This will give the path to its commoditization (servers, storage, the revolution in real personalized services network ...). This enable to distribute data to best meet the needs of users. through large clusters of nodes, at rela- tively low costs. These effects will be of no use if not accompa- nied by the trends which are changing minds quantified-self both within companies and ordinary people. From Flu Trends to Quantified Self: how the healthcare sector can benefit on Big data causative analysis on certain diseases», as reported by analysts at Wipro. $300 billion was Typical areas that produce large datasets are ples of existing applications producing high the estimated potential induced by Big Data for meteorology, space exploration, genomics, velocity data. the healthcare sector in the US in 2012. physics, simulations, biology medical research and environmental science. Some of the po- Among all, healthcare is flooded with a deluge The healthcare sector regularly invites develop- tential application areas of big data analytics of data, collected from multiple sources, from ers to work on their use cases. One of last year are smart homes and smart grids, energy man- professionals – lab results, biometrics, med- Health 2.0 hackathon in Boston rewarded a team agement, (cyber)security and automation, law ical claims, pharmacy claims, point-of-care – that created a website «No Sleep Kills», through enforcement, terrestrial, air and sea traffic con- to individuals themselves through the internet which people can see how poor sleeping pat- trol, transportation, location based systems, and social media as with the Google Flu Trends. terns can lead to drowsy drivers and auto acci- urbanism, telecommunications, search quality, dents. This data analytics are useful inputs for manufacturing, retail, online marketing, custom- At a larger scale – hospitals, regions, countries – different users: vehicle industry, insurance, and er service, billing, trade analysis, financial mar- the health information exchanges initiatives can customers. [http://goo.gl/KomYh] The Health ket and services, fraud and risk management. «be utilized for medical research, contributing 2.0’s SXSW Code-a-thon, held in march 2013, Defense applications, business transactional greatly to evidence-based medicine, better was sponsored by BodyMedia, a pioneer in the systems, embedded systems are some exam- assessment of incidence, prevalence and development of wearable body monitors that Les cahiers de veille de la Fondation Télécom

1. Big data is no longer confined to the tech- 4. The Chief Information Officer and the nical floors: like virtualization and cloud Chief Marketing Officer can better work computing in the recent past, and in some together: once these company executives ways a continuation of this, it is becoming experience the strengths of big data ana- obvious that spending in the big data is lytics and how they can better understand worth the penny, especially as the cost for their customer, they will never go back to analytics is minimal compared to the cost their previous practices. for a traditional data warehouse. Moreover, big data cuts significantly data integration 5. The power of analytics are now available costs, and gives tools for data exploration to the masses: following up the XaaS (X that have never been seen before, offering as a service) family, BDaaS (Big Data as a new ways to use and understand data. service) providers are emerging with an en- tire stack of services (acquisition, storage, 2. Big Data is now mainstream: both the analytics, visualization...) easily usable by open source community (for instance companies of all sizes, even start-ups with Hadoop) and long established IT compa- no funds and individuals with no skills. This nies (among them IBM, Microsoft, Oracle, gives everybody a chance to play with the SAP...) have joined forces to enable com- Big Data ocean. panies and administrations to invest in confidence in Big Data. This will not be a short hype cycle or the next bubble, but However, there has been some warns on «Big Dave Feinleb (see references page 8) adds rather an underlying trend for the com- Data» fatigue in the first quarter of 2013. People 3 I’s to the Big Data pot. [http://goo.gl/NTrkO] ing years, based on technical and social seem to get stuck on the 3 V’s definition of Big breakthrough innovations. Data, and other V’s are proposed (we emphasize • Immediate – in the sense that you need them in this cahier de veille) that are all perti- to do something about it now 3. Skeptics can be overcome: big and small nent. But enterprises do not want everlasting companies are beginning to see good ROI promises made years ago about the benefits of • Intimidating – what if you don’t? from the insights provided by the use of big data and analytics, they want solution, and Big Data. It is especially true when they that may be on not ever bigger data. It is now • Ill-defined– what is it, anyway? can better understand their businesses time to go beyond the marketing campaign, to and the businesses of their competitors. use «Big Data» with parsimony in order not to be filtered out, if we do not want to completely miss the rise of the data-driven economy.

collect physiological data for use in improving reveal that 14 million wearable devices were of data produced and extracted by smart tex- health, wellness and fitness. Developer had to shipped in 2011, and that number is likely to tiles, then from other sensor networks, in an use the BodyMedia API. [http://goo.gl/Tz789] reach 171 million in 2016. anonymous and secure environment, and in charge of developing a high-added-value ser- Wipro analysts explain how it is possible to The French Withings body scale connects to vice offer based on the collected data. reshape the healthcare sector with new tech- various Health 2.0 services such as Google nologies: «As healthcare goes now far beyond Health and Microsoft HealthVault as well as diet Monitoring him/herself and voluntarily sharing hospitals, it is possible to have wearable, even and exercise sites such as DailyBurn. Cityzen the data on web services is called Quantified internal, sensor-based devices to monitor vital Sciences is another French company created Self, a «movement to incorporate technology signs and symptoms of patients. From clothes in 2008 that specializes in smart textiles con- into data acquisition on aspects of a person’s embedded with sensing devices to headsets ception and development. Smart textiles are daily life in terms of inputs (e.g. food consumed, that measure brainwaves, wearable devices can embedded with micro-sensors enabling them quality of surrounding air), states (e.g. mood, be seamlessly incorporated into the ensemble.» to monitor temperature, heart rate, speed and arousal, blood oxygen levels), and performance acceleration as well as to geolocate. Beyond (mental and physical). Such self-monitoring According to IMS Research, the wearable tech- Cityzen Sciences activities, the entity Cityzen and self-sensing, which combines wearable nology market was worth $2 billion in 2011 Data, in the Telecom Bretagne incubator held in sensors (EEG, ECG, video, etc.) and wearable and will reach $6 billion by 2016. The findings Brest, will be in charge of managing the storage computing, is also known as lifelogging.»

6 — 7 And here is a possible miss. Big data is actual- ly mostly customer data: data about customer References and further readings behavior. From data collected on the web, the From Big Data to marketing team want first to quickly identify Google Flu Trends How-To http://goo.gl/SUqJd what seems to interest a customer, and then Small Data: giving to organize the customer navigation (search Foundations for big data by Gartner analyst the power back to results, navigation, ads shown…) on the site, Doug Laney, 2001, PDF http://goo.gl/C5iHn then on the rest of the web. the people. «Quantifying Trading Behavior in Financial Organizations, equipped with much higher Markets Using Google Trends», Scientific means than those of ordinary people, become Reports in Nature, February 2013 able to handle ever-increasing volumes of in- http://goo.gl/J8f2K creasingly heterogeneous information to de- tect ever more subtle phenomena, to increas- David Feinleib slideshares, producer of The ingly relevant decisions – and ultimately to Big Data Landscape and bigdatalandscape. strengthen their position or to occupy new. At com website. http://goo.gl/r4TLN no time do they try to talk to or listen to the vis- He is the author of 8 Laws of Big Data that we itor, but they prefer to guess its intention from disseminated in this cahier as a reference. its old moves. Big data could thus be the last Owners and users means that companies have found to not to talk http://whatsthebigdata.com/ of the data can be to their customers. “ http://humanfaceofbigdata.com/ individuals, not only This approach has drawbacks, though. Not large organizations. everybody is equal before the predictive algo- Gil Press, Forbes: «A Very Short History Of rithms, because not everybody is on the social Big Data», May 2013 http://goo.gl/jXn3i networks, or even so, produces data with the «Big Data News Roundup: Correlation vs. ” same rate. And beyond these behavioral data Causation», April 2013 http://goo.gl/qmfDv on which are applied predictive algorithms, there is no room left for serendipity anymore, and everybody is gradually presented the Miller, H.E. (2013). «Big-data in cloud com- same set of results or choices. puting: a taxonomy of risks» Information Research, 18(1) paper 571. Let’s change our point of view towards a cus- [Available at http://goo.gl/oJUWO] tomer-centric use. In the near future the chal- lenge is not to help the organization to do things better – sell more and better – but to help people as individuals to do things better. «Emc And Big Data – A Fun Explanation», This is what we can call Small Data. Big Data is February 2013, 9’ Video http://goo.gl/O9SSo dealing with statistics and trends, not specif- ics and immediate utility? Small Data is quite «Big data, big dead end», January 2012: the opposite. Individuals, more equipped and http://goo.gl/YGvyy ; «Big data, big illu- connected, become active agents able to ben- sion», April 2012: http://goo.gl/x1kBc ; «Big efit from the same technology that companies, Data, Big Hype, Big Danger», April 2013: to analyze the past, to draw conclusions that http://goo.gl/hACvn make sense for them, to plan for the future, to be helped to make decisions, and implement «Le recrutement et la productivité à l’heure these decisions towards their stakeholders. des Big Data», May 2013 http://goo.gl/7RiHT They share the right data about them, what they want to do right now, and want practical Clémençon et. al. Telecom ParisTech / Liligo answers, not inferred from a motive or an inten- (2012) «A Data-Mining Approach to Travel tion emerging from a crowd pattern. This shift Price Forecasting» http://goo.gl/xP44p from company-centric to user-centric is the shift from data-crunching to data-sharing. Les cahiers de veille de la Fondation Télécom

What is data? What is generating data?

ata is a brute fact, which has not yet 75% of data are nowadays generated by indi- The data landscape been interpreted. Data is not informa- viduals. This includes the digital footprints left tion, it is a value assigned to a thing. by people living most of their lives online, the DTo create information – and then knowledge telemetry generated by their devices, and all – out of data, we need to interpret that data. sources of information about their behaviors In «19°» is a data we can read on our thermostat. 80% of cases, companies play a role in the life All the collected thermostat data of a building cycle of data: they store it, protect it, preserve make a set of data. «Flats from this building it confidentiality or ensure good distribution. Uses of data (1/3): Describe seem to overheat» is an information, that can be Data law #5 : Plan for exponential grow. derived from the comparison of this set of data

PatientA’s blood pressure at with past data, or with data collected from other All these new datasets come mostly from: buildings in the area. Data is thus typically the Data 9 a.m. on Wednesday

elements result of measurements. • The web: website traffic logs, indexation, search queries, online transaction, friend- Histogram of current blood In today’s digitally hyperconnected world data ships and social media relationships, pressure readings for 45-54 is all around us, in our phones, RFID’s clothes, document, images and video storage... year old females cars, food... Examples of data include a table of Aggregates

Types ofTypes data numbers representing blood pressure over a • Commercial data collected in the real Plot of average blood pres- month, the characters on this page, the record- world: transaction logs in a retail store sure readings by age group ing of sounds made by a person complaining for males and females about overheating at the phone. Almost every- • Personal data: medical records… Clusters in the country thing we touch or use belongs to and feeds back to a larger data set. • Public and open data… From Miller, H.E. (2013) Raw data is not useful as itself. It is unpro- • This human generated data is just the be- cessed data that must be refined, interpreted ginning. A lot more data is now generated Miller, H.E. proposes first a taxonomy of to gain more value. In the data pipeline (see by machines and the Internet of Things: types of data: atomic data elements, ag- sidebar below), processed data is often the raw sensors networks, RFID, GPS, phone gregates of data and clusters of data. Three data of the next stage. traffic logs... uses of data are then listed for the healthcare sector: «describe» above, «analyze» and Data is never neutral. New combinations of • Field data, which is data collected in an «act» on next pages. data can create new knowledge and insights uncontrolled in situ environment that were not thought at the beginning. For instance, monitoring the temperatures in a • Experimental data and scientific inves- apartment building can give information about tigation by observation and recording : the habits of a resident, that can be correlated genomics, astronomy, meteorology, with, say, his tax return, and cast doubt on the environmental science… actual composition of his family.

transformation refining serialization processing acquisition presentation conversion geocoding interpretation extraction indexation visualization refreshing munging adding description learning collecting ranking reporting archiving wrangling & metadata analyzing integration anonymizing sharing deleting cleaning contextualizing relevancy check aggregation protecting storage / warehousing The data pipeline What do we do with data? Whatever the different types of data, almost all processing can be expressed as a set of incremental stages through the data pipeline above. With small projects, not each of these stages may be necessary. Ultimately, archived data can be reinserted in the pipeline for new insights.

8 — 9 the elevators in the morning and switch to the Cleaning and contextualizing data stairs at night. It was then interpreted as fol- lows: perhaps students were tired from staying Data need to be cleaned and transformed first, up late, and became enough energized during to remove invalid records and to obtain a sane the day to use the stairs. But this appeared to set of values. Cleaning and contextualizing is be an entirely other story when the professor a first reinterpretation process and allows to had a discussion with security guards who re- have a new look on the original dataset. vealed that one of the elevators broke down a few evenings during the week, and the lazy stu- Cleaning means to combine different datasets dents had no choice but to use the stairs. into a single table, remove duplicate entries and apply normalization processes. This can be the Back to the Google flu trend. According to an ar- more time-intensive aspect of processing data, ticle in the journal Nature, Google’s algorithms and still needs human intervention. Badly for- were wrong this year: their results were double Leading organizations matted numbers can indeed be corrected auto- the actual estimates by the Centers for Dis- of the future will be matically, but without ultimate human control ease Control and Prevention. Indeed Google Flu “ this could lead to big errors depending on the trends is only one source in addition to the flu distinguished by the numbers nature. Inconsistencies in data and file surveillance methods, but it nevertheless rais- quality of their predictive corruptions are also corrected at this stage. es questions. So what went wrong? «Several algorithms. This is the CIO researchers suggest that the problems may challenge, and opportunity. Data need context too, in order to be really use- be due to widespread media coverage of this ful. Context turns disparate data points into a year’s severe U.S. flu season» the author’s story. Data without context tells a misleading wrote in Nature. Add social media on top of this, story, as a former professor of the New York and you understand how the news of the flu ” University reports in the NY Times. He wanted spread quicker than the virus itself. to determine with the help of sensors wheth- er students used the elevators more than the «In other words, Google’s algorithm [was] stairs, and whether that changed throughout looking only at the numbers, not at the con- the day. The experiment went well and the data text of the search results.» open data collected told a story: students seemed to use Open data: when freeing data benefits to all people and organizations True indeed, data is under the spotlights, wheth- massive set of data, the creation of value in the termixing with other datasets. Open data er big or small, acquired in real time or belong- open data context depends more on the shar- are technically, legally and economically ing too large archived sets. In this data storm, ing and interoperability possibilities. open. the open data phenomenon plays a particular role, because it opens the way to combine dif- As anybody can now produce open data, with • Universal Participation: everyone must ferent public and private datasets together and no assumptions on the «for what purpose?», be able to use, reuse and redistribute – thereby to develop more and better products it is worth knowing how these data need to be there should be no discrimination against and services. characterized: fields of endeavor or against persons or groups. For example, ‘non-commercial’ According to the opendefinition.org website, • Availability and Access: the data must be restrictions that would prevent ‘commer- open data is «data that can be freely used, re- available as a whole and at no more than cial’ use, or restrictions of use for certain used and redistributed by anyone – subject a reasonable reproduction cost, prefer- purposes (e.g. only in education), are not only, at most, to the requirement to attribute ably by downloading over the Internet. allowed. and sharealike.» Saying «Open data» can em- Open data is not raw data and must also be phasize three different meanings: the data that available in a convenient and modifiable When opening up data, it is important to focus is open; doing the act of opening data; asking form. on non-personal data, that is, data which does people to open their data. not contain information about specific individu- • Reuse and Redistribution: the data must als. Cleaning and anonymizing data can thus be Whereas big data focuses primarily on the be provided under terms that permit re- a mandatory first step. benefits offered by exploiting ever-growing use and redistribution including the in- Les cahiers de veille de la Fondation Télécom

• There is always some sort of structure: un- Different types of data (struc- structured data can indeed be structured at tured and non structured) some level. This is the case with web logs, a commonly cited form of unstructured data. Processing data The two major categories of data are qualitative URI, which forms part of the log, or dates, and quantitative data. are well structured data, but at a different level at which analysis is to be performed. • Qualitative data refers to the quality of One should better describes web logs as something: colors, shape, texture of an mixed-type data. A text document is in- object… herently semi- or poly-structured. Some can look at it as a plain bag of words when Uses of data (2/3): Analyze • Quantitative or numerical data refers to others will query the words via stemming numbers: temperature, size, price, number and synonyms, and others will study the

Time series of Patient A’s of items… document’s grammatical structure.

Data blood pressure

elements Beyond qualitative and quantitative, data falls This is where comes a fourth V in into more categories: the big data definition:variability , Correlation between 45- when the data format changes, or 54 year old female blood • Discrete data which is numerical data like when one adds just a field in the initial data pressure readings and daily a count. Shoes sizes are discrete data, structure. Format can also change over time,

Aggregates calories consumed even if they are not same amongst differ- for the user’s purposes, or because similar data Types ofTypes data Regression analysis of ent countries. are added from another source that was differ- group health status vs. ent in structure and format. caloric consumption with • Continuous data which is numerical data

Clusters sex and age as ‘dummy’ within a continuous range. Feet sizes are Data law #3 : Use more diverse variables natural and continuous data, with a mini- data, not just more data. mum and a maximum. Let’s not confuse variability with variety. You • Data can describe an item by categories it meet variety when you go to the newsstand belongs to: one’s foot can be qualitatively and have access to dozens of different news- describe as big, or belong to the «big» cat- papers. Let’s choose one and come over the egory; one shoe can be «brand new», «old week to choose the same title: every day it is Uses of data (3/3): Act fashionned»or «broken». fresh news and stories, and new styles to tell the stories. This is variability.

Prescribe medication and Data falls too in a continuum spectrum from dosage level to treat hyper- structured to unstructured. Big data solutions allow data to be stored in its Data tension elements original form, either structured, unstructured • Structured data refers to data that is iden- or semi/poly-structured, in its entire variety Change budget for proper tifiable because of a high degree of organi- and variability, and be available for analysis eating habits information zation, such that inclusion in a traditional when a user queries the data. campaign relational databases is seamless. Aggregates Unstructured data are often estimated to ac- Types ofTypes data Run simulation model to • Unstructured data is all sort of documents count for 70-85% of the data in existence. «Big predict change in group in all formats (word, PDF...), XML or JSON data is about looking ahead, beyond what ev- health status using various documents, emails, images, videos, sli- erybody else sees,» said Peter Sondergaard,

Clusters budgets as health interven- deshare presentations, and more recent senior vice president at Gartner and global head tion scenarios social media posts and comments on so- of Research. «You need to understand how to cial network walls, tweets, and all natural deal with hybrid data, meaning the combina- From Miller, H.E. (2013) language logs such as customer service tion of structured and unstructured data, and call and chat logs... Unstructured data is how you shine a light on ‘dark data.’ Dark data any data not in a relationship format, and is the data being collected, but going unused not related to a predefined data model. despite its value. Leading organizations of the future will be distinguished by the quality of their predictive algorithms.»

10 — 11 Specialized software and hardware are needed eliminate relationships between data entities for the data challenge, either big, fast, open, for the benefit of allowing for greater horizon- secured or real-time: developments in parallel tal scalability, partitioning across several ma- and distributed processing are necessary for chines and replication. NoSQL also gives the working in a reasonable amount of time. Lets’ developer a more flexible «on-read» schema begin with two key enablers, not completely re- model that has its benefits. lated to each others: the NoSQL rev- Managing the olution, and the Hadoop computing stack and There are currently 4 types of NoSQL data- petascale com- ecosystem. bases: Key-value (e.g. Memcached) ; Column oriented or clones BigTable (e.g. Cassandra) ; The NoSQL revolution puting Document oriented (e.g. CouchDB, MongoDB) ; NoSQL (mostly interpreted as « Not Only SQL ») Graph (e.g. Neo4j). See [http://goo.gl/Z3Rh] for offerings are closely associated with Web ap- a complete list of NoSQL solutions. plication providers, becoming key foundations The Hadoop stack and ecosystem of any web-scale computing stack whether for online Create, Read, Update, Delete (CRUD) ap- The technology solution that is most associat- plications or for off-line analytics. It is a broad ed with big data is Hadoop, an open source da- class of database management systems iden- tabase, and is now 10 years old. It can be seen tified by its non-adherence to the widely used today as a catalyst for Big Data. Hadoop and oth- relational database management system mod- ers have democratized the data management, el, ACID (see below the limits of traditional data- off-line batch computing and analytics, to make base management systems). NoSQL databases them accessible – both practicality and cost – Data law #2 : maintain one are not primarily built on tables, and as a result, to small companies and organizations. copy of your data, not dozens. generally do not use the widely used SQL lan- guage for data manipulation. Instead, they are One of the first big data challenge was to har- accessed via plain get and put commands. vest, index and rank the billions of web page Much of the structured data are eliminated in out there, and make better search engines. In the queries, which are often reduced to straight 2004 Google issued a famous paper on the key-value pairs. Fixed table schema are not «MapReduce algorithm» (a computational ap- mandatory, and join operations are avoided. proach that involves breaking large volumes of Big data is not new, data down into smaller batches, and processing but the tools are. NoSQL products like Cassandra, HBase (also them separately ) and their Google File Sys- “ a part of the Hadoop ecosystem, see below), tem. This inspired people at Yahoo! who wrote , CouchDB and MongoDB (to mention a few) an open source version in Java, as an Apache ” The limits of the traditional database management systems Relational database management systems performed on frozen data, after a so-called ETL (RDBMS) store data in relational tables (tables (Extract, Transform and Load) approach. Data related to each other via one common element), was extracted from OLTP (On-Line Transaction- structured in rows and columns, and use Struc- al Processing) systems where they obeyed tured Query Language (SQL) for accessing and the ACID (Atomicity, Consistency, Isolation and manipulating data inside. Unstructured data Durability) rule, and loaded on data warehouse cannot fit well in such tables, as theirschema is systems able to handle large volume of data. not know in advance, and is typically stored in This is good for analyzing past performance, but limits key-value pairs not accessible via SQL. fails to deal with real-time insights demands. Additionally, new data is coming at increasing SQL and relational database have failed over speeds, 80% of it being unstructured and fail- the past decade to keep up with the scaling ID Lastname Firstname Blood pressure demands and requirements with regards to 1 Smith Mary 90 performance or agility, coming from large social 2 Jones Lena 119 networks datasets, and this paved the way to new different approaches and various database 3 Martinez Sara 104 alternatives. In this earlier days, analytics were 4 Smith Clara 123 Data organized in a row-oriented database Les cahiers de veille de la Fondation Télécom

project: the Hadoop framework. It supports the Massively Parallel Processing porate real time data with learned behavior, and running of applications on large clusters of com- react in real time. modity hardware (commodity servers that in Big data is not all about MapReduce. Massively turn use commodity disks), divided into many Parallel Processing (MPP) is another approach This was made possible thanks to the decrease small fragments of work (the MapReduce idea), to distribute query processing, more corporate of memory costs in the past few years, and its each of which may be executed or re-executed oriented than academic- or research-oriented easy commoditization in the cloud. Real-time on any node in the cluster, meaning that clusters MapReduce. In both approaches processing of in-memory databases simplify the internal op- can grow as data volumes do. It is associated data is distributed across a cluster of compute timization algorithms and makes better use of with its own distributed file system (HDFS) that nodes, each separate nodes processing data in the hardware. stores data on the compute nodes, providing parallel, and the final result being assembled NewSQL high bandwidth across the cluster. Nowadays, at the node-level output. However, MapReduce Hadoop refers to a larger ecosystem of software and MPP are used in rather different scenarios, NewSQL addresses the problem of datasets packages, including MapReduce, HDFS, and and use different hardwares. MPP is used on mixed with structured and unstructured data. many software packages to support the import expensive, specialized hardware tuned for CPU, The term was first used by 451 Group ana- and export of data into and from HDFS, and other storage and network performance. MPP prod- lyst Matthew Aslett in a 2011 research pa- distributed file system like Amazon S3, or newer, ucts are thus bound by the cost of these assets per. NewSQL is a set of various new scalable/ more efficient DFS. and the software, and by its finite hardware. high-performance SQL database vendors (or databases), it is not a new query language. Hadoop can be used for any sort of work that is MPP columnar analytic databases have be- batch-oriented rather than real-time, that is very come a common choice for any real-time ana- NewSQL databases provide the same scalable data-intensive (and data that does not have to lytics on structured data. Whereas in tradition- performance of NoSQL systems for OLTP work- be structured), and is able to work on pieces of al relational databases the data is organized in loads while still maintaining the ACID guarantees data in parallel. However, Hadoop goes now far rows, MPP use a column-oriented organization, of a traditional single-node database system. beyond than the mere MapReduce jobs. It can in a compressed form, yielding faster queries. They are built on a scale-out, shared-nothing ar- be used for other applications, many of which Moreover, the nature of MPP facilitates the chitecture, capable of running on a large number are under development at Apache: the HBase «scale out» by simply adding more commodity of nodes without suffering bottlenecks. database, the Apache Mahout machine learn- hardware. ing system, the Apache Hive Data Warehouse Three different approaches are adopted by Grid / Cache system (essentially providing a SQL abstraction vendors. New databases are proposed, de- over MapReduce), or Yarn, the nextgen Hadoop The NoSQL solution still relies on data stored on signed from scratch to achieve scalability and framework for job scheduling and cluster re- disk, making it not practicable for real-time due performance. Some changes to the code may source management, for naming a few. to all the reads and writes. Additionally, the de- be required and data migration is still needed lay in replication could result in datasets out of from older databases. Performance is allowed date. Transactions that need speed and reliabil- via non-disk (memory) or new kinds of disks ity are not good candidates for NoSQL. (flash/SSD) data store. Some solutions can ing to fit in the traditional rows and columns. be software-only (VoltDB, NuoDB and Drizzle). Row-based systems are designed to efficiently Big data grids now can store data across many Secondly, vendors offer new MySQL storage return data (for instance : Smith Mary 90) for in-memory nodes, rather than data stores on engines: MySQL is used extensively in OLTP and an entire row, or record, in as few operations as disk. All the reads and writes from disk are in web services. The third approach is to ensure possible. In a column-oriented DBMS, data is eliminated, data can be queried up to 10x more scalability of the OLTP databases by providing a stored as sections of columns, making efficient rapidly, and the overall performance is more pluggable feature to cluster transparently. to «find all the people with the last name Smith» consistent. These databases are able to incor- in one operation.

OLAP (Online analytical processing) is an ap- proach to answer multi-dimensional analytical. Databases stored in so-called OLAP Cubes use a multidimensional data model, and provide fast access to knowledge through techniques that include pre-aggregated, pre-built analytics in the cube. Big data needs ad-hoc, data explo- ration and knowledge self-discovery, which is not possible in the OLAP cube based on require- Huge storage databases need huge cooling. ments and assumptions. Here is a glimpse at Google’s data centres.

12 — 13 Law #8 : Big Data is transforming business the same way IT did.

The data market The data landscape This figure compiles several landscapes of the data market (see refs at the These application development platforms are often built on top of Hadoop and/or scale-out platforms. They provide additional analytical capabili- ties beyond what the underlying database bottom right). «» indicates the page can natively provide. Example includes Infochimps, Continuuity, number where details are provided. Acunu, Wibidata, Causata, LucidWorks, Cityzen Data…

Big data is creating a new layer in the econo- Oracle, SAP, IBM, Tableau, Palantir, my at full speed. All is about turning data into Hortonworks, MapR, Vertica, Cloudera, Microsoft, bime, chart. metaLayer, dataspora, information, knowledge and, ultimately, into Greenplum, EMC2, IBM, Kognitio, Datas- io, GoodData, Talend, Teradata, Metamarkets, revenue. This will accelerate growth in the glob- tax, Exasol, Actian ParAccel… Jaspersoft… Datameer, ClearStiry, al economy. Estimation from analysts in 2013 visual.ly, platfora, alteryx, suggest that the global Big data market will grow business to a staggering $16.9 billion by 2015. Big data intelligence panopticon, Cinequant, is not a business model, it is business. It is an analytics infrastructure Treerank, Visibrain, enabler for new ways of conducting business. Dataveyes, Focusmatic, Splunk>, Loggly  kwypesoft, The Metrics  New business need data-driven business Talend, Informatica, Pentaho, Flume, Sumo Logic… factory, Squid solutions… models Oqoop, Squid solutions… log data apps analytics and visualization Hardware, software and networks having been Some companies or solutions can be found in commoditized to the point that they are essen- several classes, and this can change over time. tially free, Data is the only business (model) that French companies are underlined. is left: either monetize data, or provide the infra- data integration structure to enable the monetization of data. Data sources include: environmental sensors, social & commercial web, ser- Big data enables companies to create new prod- Factual, GNIP, DataSift, Inrix, vice providers, infrastructure providers, ucts and services, enhance existing ones, and LexisNexis, Kaggle, Windows Azure Mar- telecom networks, open data , quan- invent or refine business models. Companies ketplace, Space Curve, Loqate… which don’t take the turn of (big) data, or which tified self data , geocoding services, do not take the time to analyze the dark data datasets and datasets directories… they have in hands, or how collecting data might data as a service change their business, will soon be wiped out sources

The large sensors clouds are still in CouchDB, HBase, MongoDB, Neo4J, Traditional enterprise data is only growing at their infancy. Open source initiatives like InfoGrid, Infinite Graph, Cassandra, db4o, 20% year over year when the amount of new OpenIoT (Open Source blueprint for large Oracle Coherence, ObjectStore, GemStone, data being stored is growing in the order of Cloudera, Intel, scale self-organizing cloud environments for no sql Polar… 50% . There will thus be two key shifts within IoT applications) is aimed at developing an Hortonworks, EMC2, the storage industry: open source middleware platform to connect IBM, MapR, Hadapt, AmazonRDS, SQLAzure, a move towards more commodi- • Internet-objects to the cloud. http://openiot.eu/ Ubeeko… FathomDB, Sa- caleBase Continuent ty-based storage that can potentially take the place of traditional storage sensor as a service Drizzle Handler- Socket, VoltDB… • a new set of high-scale storage archi- hadoop new sql , tectures to store all this new data.  Windows Azure, Dell, Cisco, EMC2, NetApp, SGI, Infochimps, Fusion, IO, Withings… Google BigQuery… infrastructure as a service hardware provider Les cahiers de veille de la Fondation Télécom

The amount of new data stored varies across by more agile company reinventing like http://www.programmableweb. geography: new data stored in Petabytes their business through a creative de- com/. All these APIs are well doc- by geography in 2010 (source: Wipro); Open struction of today’s business mod- umented, and some of the datasets data initiatives by regions as of 2012 (source: els. At the beginning of this century, they are connected to can even be Open Knowledge Foundation) data offers a new Industrial Revolu- queried in natural language. A billion tion, that was announced by the In- records dataset such as Versium en- These application development platforms are often built on top of Hadoop and/or scale-out ternet and the mobiles. Smart com- ables specific individual queries that platforms. They provide additional analytical capabili- ties beyond what the underlying database panies first think data as an asset, cross traditional marketing boundar- can natively provide. Example includes Infochimps, Continuuity, upon which they experiment and ies: social-graphic, demographic, and Acunu, Wibidata, Causata, LucidWorks, Cityzen Data… horizontal plaforms build business models. psycho-graphic, in both the online and off-line worlds. Tableau, Palantir, Forbes, IBM, Data registers, API and mashups metaLayer, dataspora, Deloitte, ThinkBig, at the heart of new businesses France’s assets Teradata, Metamarkets, Cetadata… Datameer, ClearStiry, Two keys of understanding are need- France’s entrepreneurs and scien- ed to explain how this data ecosys- tists can play an essential role in visual.ly, platfora, alteryx, services

vertical plaforms tem develops itself so quickly. All is the new Economy of data. First of all, panopticon, Cinequant,

… happening in the interfaces, what is the French School of Mathematics Treerank, Visibrain, Rocketfuel called mashups and APIs. As data is renowned all around the world. In Dataveyes, Focusmatic, MediaScience, Bluefin, grew to big data, companies used to France 27% of college students earn  kwypesoft, The Metrics  Data collective, store it in the cloud, and others de- a degree in math, science, technol- factory, Squid solutions… Recorded Future… veloped tools in the cloud to process ogy or engineering, compared with analytics and visualization it. But data gains tremendous value only 17% in the U.S. Of the 52 winners ad media apps when correlated with other data from of the Fields Medal, 11 have been Cityzen Sciences various sources you don’t own. Us- French. Second point is its Telecom A Do Tank for the French Big Data community ing the directories of public or paid industry, used to provide detailed datasets is the second key. In a and quick reporting across tera or In France, Aproged and Cap Digital founded matter of hours, clever developers petabytes of data. Ultimately, the

in March 2013 the Alliance Big Data, and Safetyline, during Startup Weekends can create French big data startups and cham- have been joined by ADBS, APEC, GFII, Insti- new services by tapping a continual pions, the clusters dedicated to the tut Mines-Télécom, and others. stream of information from internal digital creative industries, and the http://www.alliancebigdata.com/ and external sources, and by query- higher education institutes and re- ing the APIs of big web services like search centers all are working to- Interdividual Gazzang virtru twitter, Amazon or Ebay, or smaller gether with the French government Greenbureau… Dataguise one APIs discovered on directories on major data projects. privacy  security References CouchDB, HBase, MongoDB, Neo4J, Teradat Vertica InfoGrid, Infinite Graph, Cassandra, db4o, Netezza Oracle… The (big) data market is currently a very noisy market. This figure joins Oracle Coherence, ObjectStore, GemStone, several landscapes made by the analysts and the industry: Polar… edw / sql the big data landscape by Dave Feinleb [2012 http://goo.gl/MJu3M]  the big data open source tools [jan’2013 http://goo.gl/WUlTe] AmazonRDS, SQLAzure, GridGain, Ter- the Hadoop ecosystem by Datameer [jan’2013 http://goo.gl/se8wH] FathomDB, Sa- caleBase Continuent racotta, Infispan, the database map by the 451 Group [feb’2013 http://goo.gl/QgRI5] Drizzle Handler- Socket, VoltDB… memchaced… Explorys, Palantir, Kyruus, Splunk>, Logic, Sumo the big data ecosystem by Sqrrl [mar’2013 http://goo.gl/y7Nc8] grid / cache All these landscapes demonstrate the lack of agreement on what are the different segments of the (big) data market and what to call them (see Amazon web services, Eucalyptus, «the big data landscape revisited» [april 2013 http://goo.gl/hEsVG]). AppScale, GoGrid, Zinux, Intercloud… For an evolution of these references over the years, read «Getting a grip on cloud provider applications forSpecialized a specificindustry vertical: Predictive Policing,Bloomreach, Myrrix, the rapidly changing DBMS landscape» [http://goo.gl/cgHwz].

14 — 15 Machine learning classes Machine learning Machine learning algorithms generally fall into Unlocking data The question is then: what is the meaning of two classes depending on whether the training all this data? Basic analytical methods used in data set includes the correct or desired answer value Business Intelligence and Enterprise Reporting or not. When the desired answer is known and Tools delivered simple sums, counts, averages drives the algorithm, this is called supervised and results from SQL queries. These analyt- learning. A training dataset is provided to the ics were specified by humans who knew what algorithm, with all the data attributes for each should be calculated and how to do it. This is training instance, but also the correct class not possible anymore with datasets too large for the instance. At its opposite, unsupervised for comprehensive analysis. It is not possible learning makes possible to find insights on for an analyst to test all hypotheses and unlock data with no clues. It is provided with unlabeled the value buried in the multiple data sources. data inputs and has to discover by itself any New algorithms are needed to deal with big associations or relationships between the data data: existing statistical algorithms do not instances. This algorithm is an autodidact. It scale, and using sampling for prediction may proceeds by seeking and grouping similar data. Scikit-learn integrates machine learning miss important facts or phenomena. It is very interesting to use this technique when algorithms in the tightly-knit scientific we do not know what we are looking for. For ex- Python world, building upon numpy, scipy, Machine learning systems automate decision ample, speech recognition, especially the iso- and matplotlib. As a machine-learning making on subsequent data points, automati- lation of the voice with respect to a noisy envi- module, it provides versatile tools for data cally producing outputs like classification, rec- ronment, uses an unsupervised neural network mining and analysis in any field of science ommendations (as with Amazon’s products) or algorithm. and engineering. It strives to be simple and groupings. efficient, accessible to everybody, and Reinforcement learning is a class of algorithms reusable in various contexts. Technologies includes WEKA, Mahout, MOA, which is somewhere between supervised and See also «contributions» page 27. scikit-learn, SkyTree… unsupervised learning. From supervised learn- ing it borrows the knowledge of a desired out- [http://scikit-learn.org/] Value is the fifth V of the Big Data come. A sequence of steps is needed to arrive to For a presentation of Scikit-learn, read this known outcome, but it is not known if every [PDF: http://goo.gl/kiZ2i] IT departments have had to make decisions step goes effectively towards the goal or not. The about which data to keep and how long to keep right answer is never given, and like unsuper- it, at a time where the processing power re- vised learning, reinforcement learning systems quired to perform analysis was far beyond are trained with unlabeled data. However, some their capacities. Machine learning helps distance measure to the goal is done, and the in- now to unlock the Value in the datasets, as ternal mechanics of the algorithm are rewarded New algorithms are needed soon as the data is produced. or punished, according to their positive or poor to deal with data progress towards the desirable outcome.

A 2011 McKinsey report suggests suitable tech- ly handled by tensor-based computation, such Machine learning applications nologies for Machine Learning to include A/B as multi-linear subspace learning. testing, association rule learning, classification, According to SkyTree, a leader in advanced cluster analysis, crowd-sourcing, data fusion Additional technologies being applied to big analytics services, Machine Learning can be and integration, ensemble learning, genetic al- data include massively parallel-processing da- applied wherever data is available to gain new gorithms, machine learning, natural language tabases, search-based applications, data-min- insight and improve decision making: processing, neural networks, pattern recogni- ing grids, distributed file systems, distributed tion, anomaly detection, predictive modeling, re- databases, cloud based infrastructure (appli- • E-Tailing: product recommendation en- gression, sentiment analysis, signal processing, cations, storage and computing resources) and gines, cross channel analytics, events/ supervised and unsupervised learning, simula- the Internet. activity behavior segmentation tion, time series analysis and visualization. Source: • Retail/Consumer: merchandising and Multidimensional big data can also be repre- market basket analysis, campaign man- sented as tensors, which can be more efficient- agement and optimization, supply-chain Les cahiers de veille de la Fondation Télécom

Parametric and non-parametric modeling density estimation tree, kernel density estimation, kernel conditional density es- Two of the main trends in machine learning are timation «parametric» and «non-parametric» model- ing, terms which have been used in different • classification, used to predict a category: ways in the statistics and machine learning kernel logistic regression, decision tree, literature. A parametric learning algorithm in- nearest-neighbor classifier, kernel discrim- Data law #4 : Data has value volves a fixed family of functions, indexed by inant analysis, neural network, support far beyond what you originally a fixed number of parameters (independent of vector machine, random forests, boosted anticipate. (don’t throw it away) the number of training data): a linear regression, trees, deep learning or a multivariate Gaussian distribution are such algorithms. These models are very accurate • regression, used to predict a number: only if the correct model class and parameters linear regression with variable selection are chosen. A learning algorithm is non-para- (LASSO), regression tree, kernel regres- metric if the complexity of the functions it can sion, Gaussian process regression, sup- learn is allowed to grow as the amount of train- port vector regression ing data increases. They are agnostic to the Is there a model for predicting the success of a model and cannot achieve the same accuracy • dimension reduction, used to reduce crowdfunding campaign? of the best parametric model. number of columns: principal component analysis, non-negative matrix factoriza- Machine learning tasks tion, independent component analysis, BigML, a machine learning online service dig- manifold learning / kernel PCA, maximum ged into 17,000 crowdfunding campaigns, and The seven more common machine learning variance unfolding, Gaussian graphical can predict that 80.5 percent campaigns with tasks are listed below, along with most com- models, discrete graphical models, com- more than 34 backers will succeed. mon associated algorithms. The first criteria to pressed sensing The datasets are still available to play with in choose the best algorithm are its accuracy, its this article: http://goo.gl/QJ8cT. speed (in training or testing phase), its inter- • clustering, used to find natural groups: pretability, or how they made its predictions, k-means, hierarchical clustering, mean- and its simplicity to be tuned. In the follow- shift, topic models ing list, Parametric algorithms are in italics, non-parametric are underlined. • testing and matching, used to compare datasets: minimum spanning tree, bipar- • multivariate querying, used to find sim- tite cross-matching, n-point correlation ilar objects: nearest neighbors, range two-sample testing search, farthest neighbors Webinars like [http://goo.gl/8lpZH] by SkyTree • density estimation, used to find likeli- are a valuable help to find its way among these hood of objects: mixture of gaussians, algorithms available for everyone in Hadoop frameworks like Mahout [http://goo.gl/j7PEo].

management and analytics, event/behav- lection, forecasting and optimization, click fraud detection, campaign and sales pro- ior-based targeting, market and consumer fraud detection/prevention, social graph gram optimization, patient care quality segmentations analysis for marketing optimization, cus- and program analysis, drug discovery and tomer segmentation development analysis • Financial services: next generation fraud detection/risk management, credit risk • Telecommunications: pricing optimiza- • Government: fraud detection, threat de- scoring decision making, high speed arbi- tion, customer churn management, call tection, cybersecurity, energy network trage trading, abnormal trading analysis/ detail record analysis, network perfor- management/optimization detection mance optimization, mobile user behavior analysis • Energy industry: smart grid management • Web scale use cases: click-stream seg- – smart meter data driven, power genera- mentation and analysis, ad targeting/se- • Health and life sciences: health insurance tion management

16 — 17 or servers, they are running complex analytics Realtime analytics in the database management system itself. In France, Squid Solutions offers analytics as a Once the data is acquired and cleaned, it is time service on fast data, providing a low-latency to extract insights from it. The three biggest high-availability worldwide server network to analytics areas are customer interaction (for track their customer interactions with all of further recommendation and personalization), their web properties, affiliates, display adver- Nowadays, organizations network and sensor monitoring, and game and tising, etc… want to perform «deep mobile application back-ends. Add it algorith- “analytics» on the massive mic trading, anti-fraud, risk measurement, law New software are needed to process real-time datasets. This ranges enforcement/national security, healthcare and information from highly dynamic sources, and from statistics (averages, stakeholder-facing analytics. web companies such as twitter have devel- oped and released in open source these new correlations, regressions) Advanced analytics can also enable organi- technologies. Storm from Nathan Marz (now to more complex functions zations to penetrate new markets, grow rev- with Twitter) operates on streaming data that such as graph analysis and enues, track competition, forecast demand, is expected to be continuous. These complex predictive analytics by using drive product and service differentiation, ana- event-driven systems allow to identify mean- advanced machine learning. lyze detailed transactions to better understand ingful events from a flood of events: one of the customer patterns, provide advertisers with most interesting example provided by Twitter is more granular targeted advertising, acquire the generation of trend information. and retain new customers, better predict cus- ” tomer churn and profitability, enhance visitor experiences, and respond to market dynamics Data analysis for everyone and regulations. Value of data over time A lot of companies still don’t have the techno- Real-time analytics enables users to get up-to- logical sophistication to understand the whole The value of a data elements goes the-minute fast data by tapping directly what’s data science and cannot hire data scientists down with time and the value of a happening on their ecosystem, as it happens. to dig in their data. The hardest work is the so- cluster of data goes up with time. Many companies have thus rethought tradi- called data munging and it gets harder with tional approaches to performing analytics. scale. Advanced analytics on unstructured Source: VoltDB Instead of downloading data to local desktops data online services like Precog (see example of web analytics below), offer tools targeting specific use cases. The sentiment analysis and value of data elements natural language processing are highlighted in the social media one, while the web analytics focuses on features such as behavioral clus- fast complex large value of cluster of data tering. And the visualization of the results have Hadoop etc. been particularly tuned. New SQL NoSQL Data warehouse data value

traditional RDBMS application complexity age of data Transactional Analytic Interactive Real-time Record Lookup Historical Exploratory simple slow small slow simple milliseconds 10 milliseconds second(s) minutes hours • place trade • calculate • retrieve • backtest • algo • serve ad risks click algos discovery • examine • leaderboard streams • business • log analysis packets • aggregate • show orders intelligence • fraud pat- • approve • count • daily reports tern match trans. Les cahiers de veille de la Fondation Télécom

Visibility Valorisation Visualization Data visualization tells a story on data, and this Visualization tools are now available for the story could be misleading the reader. The figure masses. Free data visualization tool can help on the left shows the same data with two differ- create an interactive viz in minutes and em- Data visualization: telling ent presentations. Say it is the percentage of bed it in on website or share it. Lots of visual- production of 3 goods, A, B and C, over a period ization javascript libraries and frameworks are stories with data of three years. A is 20% of the whole production available to allow scientists, journalists and the first year, and B et C 30 and 50% respective- companies to make their data more visual and ly. Depending on how you classify the goods thus, sharable. The major social networks pro- and the colors you choose, you will not give the vides tools to make their user view, and under- 100% same impression on the company situation. stand the relationships between you and your C connection, and how to grow and use it better. Data visualization helps to find new meanings These tools can make sense out of the social B to data that were misinterpreted before, and to network noise. share that meaning with others. Through her work as a nurse in the Crimean War, Florence Beware a new deluge of data visualization! It is Nightingale was a pioneer in applied statistics above all about giving attention to data that is A and data visualization, back in 1855. She gath- too valuable to be remained on the shelf, and production of goods of production ered data on relating death tolls in hospitals communicating an idea that will drive action. 0% to cleanliness and communicated her results Three requirements and three legitimate rea- 2011 2012 2013 through a new sort of pie chart, called afterwards sons must be satisfied to make data visualiza- 100% roses, polar area charts or coxcombs. This par- tion valuable and worth the effort: ticular visualization emphasized the real causes A of death among soldiers and showed that more • Information must be interpretable. With soldiers were dying from preventable illnesses so much unstructured data used today, than from their wounds. interpretation must come from the meta- data associated: what is the data, where, B As these data could not have been explainable when and how it was collected... with a simple pie chart, it is worth nowadays to production of goods of production be aware of all the different visual represen- Information must be relevant to the peo- C • 0% tations we have in the toolbox. More than one ple looking for insights, and to the target- 2011 2012 2013 hundred of visualization methods exist, and ed purposes. Same data, different visualization. In Ralph Lengler & Martin J. Eppler listed them as which company would you like to invest? a periodic table in 2007, available as an inter- • Information must be original or, as Night- active visualization online. This applies to data, ingale proved in 1855, shed new light on a information, concept, metaphor, strategy and phenomenon. compound visualization. • The dashboard reason: visualizations can Data visualization can tell stories to better un- help to check assumptions about how a derstand complex data and / or large amount of system we are interested in operates, to multidimensional data. In a famous video lec- see how it can deviate from a predefined ture published on YouTube, Hans Rosling, the model, and to make relevant decisions. Swedish statistician and medical doctor, runs through 200 years’ worth of augmented-reality • The gameification reason: playing with data visualization telling the story of economic data, develop intuition and new insights development and health in 200 countries over on the behavior of a known system, replay 200 years using 120,000 numbers in a mere long time series data in a shorter experi- four minutes. Plotting life expectancy against mental time frame. income for every country since 1810, Hans shows how the world we live in is radically dif- • The exploration reason: when data is too ferent from the world most of us imagine. complex to be understandable with mere [http://goo.gl/Aat8Y] statistics, visualization tools can help to build a model that make possible to ask the good questions to the system.

18 — 19 New skills are required to deal with data, and IT jobs in the United States,» adds Peter Sonder- each of these skills has a unique function in big gaard. «In addition, every big data-related role data analytics. Moreover, the data specialists in the U.S. will create employment for three peo- Rise of the data are operating at different levels: ple outside of IT, so over the next four years a to- tal of 6 million jobs in the U.S. will be generated jobs • data architects put together disparate by the information economy.» types of data in new ways to create fresh insights However, companies attempting to handle the data challenges with silo-ed statisticians, • data engineers / operators develop the computer scientists or MBAs will certainly fail. architecture that helps analyze and sup- What is needed are professionals with a con- ply data vergence of skills, somewhat called «data sci- Data law #7 : Put data and entists», with a strong background in artificial humans together to get the • data visualizers translates analytics into intelligence, natural language processing and most insight. comprehensive and sharable information data management. Then, there are the data change agents who This convergence of skills view of the data sci- have an evangelist informal role, data stew- entist is attributed to Drew Conway in Septem- ards that ensure that data sources are ber 2010. He explains that the traditional properly accounted for and main- researcher may have substantive tained, and data virtualization / Substantative expertise and learning, as well as cloud specialists who build and experience statistical skills and the ability to support the Database as a Ser- use algorithms, but he is lacking vice functions. Danger Traditional the training and experience to The data science Venn diagram zone research manipulate raw data in a by Drew Conway [http://goo.gl/gKkJP] «But there is a chal- Data clever and skillful way. To lenge» said Peter Son- science him, data plus math and dergaard, senior vice Math & statistics only gets you «Hacking» Machine president at Gartner and Statistics & machine learning. Such global head of Research. skills learning a specialist knows hack- «There is not enough Knowledge ing and math/stat, but is talent in the industry. Our substance-free. In others public and private education words, most of the analytics systems are failing us. Therefore, only one- in machine learning are theoretical and mod- third of the IT jobs will be filled. Data experts el-free. Finally, Conway places a danger zone will be a scarce, valuable commodity. IT lead- at the convergence of hacking skills and sub- ers will need immediate focus on how their stantiative experience. These are people who organization develops and attracts the skills «know enough to be dangerous». They may required. These jobs will be needed to grow have discover an interesting fact, but as they do your business. These jobs are the future of the not master statistics, they cannot distinguish new information economy.» between this random event and a systematic pattern. «Either through ignorance or malice McKinsey projects that by 2018, the U. S. will this overlap of skills gives people the ability to talents need 140,000 to 190,000 people with exper- create what appears to be a legitimate analy- tise in statistical methods and data analysis, sis without any understanding of how they got In the search for data talented people ? the «deep analytical talent», and 1.5 million there or what they have created» says Conway. more data-literate managers, people capable Here are more than 75 – and still growing – job of analyzing data in ways that enable business Thus the Data Scientist is a rare bird indeed and interview questions maintained by Dr. Vincent decisions. the three complementary skills – mathemat- Granville http://goo.gl/Y48ZU ics and statistics, substantiative experience, «By 2015, 4.4 million IT jobs globally will be cre- hacking skills – combined in one person are See also «7 new types of jobs created by Big ated to support Big Data, generating 1.9 million highly valuable. Data», September 2012, http://goo.gl/oE9KM Les cahiers de veille de la Fondation Télécom

Google statement on privacy is highlighted on This is the essence of Veracity. That the Google Flu Trend page (op. cit.) as follows: is «conformity with truth or fact» or in «Google Flu Trends can never be used to iden- short, Accuracy or Certainty. And this can Privacy and tify individual users because we rely on ano- be caused by any of inconsistencies, model nymized, aggregated counts of how often cer- approximations, ambiguities, deception, fraud, trust concerns tain search queries occur each week. We rely duplication, spam and latency. on millions of search queries issued to Google over time, and the patterns we observe in the Ensuring that data is full of veracity at any time data are only meaningful across large popula- of its life cycle might be one of the biggest chal- tions of Google search users.» lenges we will face.

Only a few years ago, we were cautioned not to put online our name or birth date, but societal norms are shifting, and we now check online when traveling, telling everyone we are away. But even if we stay cautious, aggregation of References and further readings data from innocuous datasets can be easy, and relatively low cost computing can be done by «When Google got flu wrong», Article in Na- governments, criminals or our neighbors to pre- ture, February 2013, http://goo.gl/Cgbt9 What is legal? dict our buying behavior. V. N. Vapnik, «The Nature of Statistical Lear- “ What is ethical? The data-gathering technology raises ques- ning Theory», Springer, 2000 tions about the limits of people surveillance. What will the public find «You don’t know what data is being collected «The history of Hadoop: From 4 nodes to the acceptable? and how it is used» says Marc Rotenberg, exec- future of data», 2013, http://goo.gl/mh3Of utive director of the Electronic Privacy Informa- In an ideal world, these tion Center about data collected in workplaces On open data, see a brief history in the Paris- three considerations would (see page 5). The possibility of a startup scav- Tech Review, 2013, http://goo.gl/EqzPK lead to the same result… enging the social web to build complete profiles of people – with email, name, location, inter- «Pionneering the dataviz: Nightingale’s ests and such – raises the question of a better ‘Coxcombs’», in the Understanding Uncer- Bill Franks, Chief Analytics Officer for international regulation, balancing the rights to tainty blog, 2008, http://goo.gl/OAhBo Teradata’s global alliance program” privacy vs security vs commodity. A thoughtfully curated selection of tools And this is urgent matters, when one considers that will make your life easier creating the impact of the drones, a question debated for meaningful data visualizations. two years in the U.S., and yet largely unknown http://selection.datavisualization.ch/ by citizens in Europe. Surveillance drones are collecting amounts of data, which can then be «Towards A Periodic Table of Visualization combined with facial recognition technology Methods for Management», PDF & interac- and big datasets. Airspace will soon be open for tive: http://goo.gl/YMCpJ & http://goo.gl/Abk8 private drones, and technology makes them stealth and small, recalls the Electronic Fron- Data Visualization Tools by the visual.ly tier Foundation. website: http://goo.gl/IBEHJ

One last question, and another V in our bag. «Process real-time big data with Twitter Now that we have all this data, here is the piv- Storm», April 2013, IBM, http://goo.gl/uSNKQ otal question: can it be trusted? Data collected from a drone tells I have a terrorist behavior: «Big Data, Big Value», Les Entretiens de can it be trusted? Data collected from flu trends Télécom ParisTech, experts’ opinions and 10 launches a large production of vaccines: what if french startup interviews, Decembre 2012, this causes a financial mess, or panic, or other http://goo.gl/hLmGb unexpected phenomenon?

20 — 21 e now propose a series of 17 sci- in the ‘scientific section’ the challenges which entific, technical and/or societal may need further research, even if in some cas- challenges which address major es applications are already on the market. Then issuesW in data management. We arranged them come technical challenges which need a appro- Challenges in the air so as to show their connections. Other chal- priate answer, and societal challenges which lenges do exist, but this list provides a prelim- imply us more deeply. For most of them we inary current view, ranging from solving the big propose an opening statement, and then clear problems of the world to everyday data man- objectives or still open questions, and possibly agement for everyone of us. Some challenges issues and blocking points. might be in several sections. We first classify

Towards a better understanding of the world? #1 Data exploration is the last scientific exploration paradigm. Thousand years ago science was only empiri- tivities: capture, curation, and analysis. Much cal, describing natural phenomena. Then came of the vast volume of scientific data captured the theoretical branch a few hundred years by instruments on a daily basis, along with in- Scientific challenges ago, using models and generalizations, with Ke- formation generated via computational mod- pler’s Laws, Newton’s Laws of Motion and so on. els, reside then forever in a live and curated The computational branch appeared during the state for the purposes of continued analysis. last century, simulating complex phenomena, when the theoretical models were too compli- Some say it is sufficient to understand the cated to solve analytically and the computers world. Chris Anderson, editor of Wired magazine, were available to do the job. Nowadays science hypothesizes the end of hypotheses: «We can has changed, and here is how: scientists do stop looking for models. We can analyze the not look anymore directly at their instruments. data without hypotheses about what it might Data law #6 : Solve a real They are looking at data captured by instru- show. We can throw the numbers into the big- pain point. ments or generated by simulations, before be- gest computing clusters the world has ever ing processed and analyzed. seen and let statistical algorithms find patterns where science cannot.» Rely on data is already «This data exploration is the fourth paradigm reshaping the explanation schemes: categories, for scientific exploration», said Jim Gray, the that once were a key for describing the world, eminent database researcher, in January 2007, are no longer used (see sidebar below). and calling eScience this transformed scientific method where «IT meets scientists.» This da- Question: a – somehow philosophical – ques- ta-intensive science consists of three basic ac- tion still remains. To what extent are we going to

The quest for raw data through the open and statistical categories. This is a «zoom out» describe and structure the society are falling. big data movements: will categories disap- function to understand the world. These cate- They do not seem to explain the world anymore. pear? gories propose a system of conventions for de- scribing the social world. Currently, the world of big data and open data Journalists, sociologists and statisticians use suggest to predict user behavior through to make visible the social world from surround- However, two current crisis are changing the learning algorithms on raw data so that every- ing data with different techniques and method- situation: a general crisis of representation of one can project its own interpretation, its own ologies. The former study the data to reveal hid- the world, especially through the media and agenda, its own objectives. And these two phe- den realities, to show the public things which it politics, and a crisis with regard to categorical nomena will now produce new interpretations does not have access. For their part, sociolo- representations related to statistical conven- of our societies. gists make visible the effects of structures. tions that describe the social world. «From statistics to big data: what changes Statisticians, and especially those from gov- Indeed, said Dominique Cardon, sociologist at in our understanding of the world?» Article in ernment agencies, reveal correlations between Orange Labs, the categories that allowed us to French, December 2012: http://goo.gl/97rWa Les cahiers de veille de la Fondation Télécom

change our understanding of the world, in a way Build and share data-oriented Learn to apply context to the we could make more errors? Over-reliance on infrastructures numbers historical data for instance can create mischar-#3 #4 acterization of data and solving the wrong prob- Build the Vannevar Bush’s memex. From content to context: data without con- lems. It also risks repeating the error of treating text tells a misleading story. correlation as causation. We need not only new hardware and software architecture, but also new infrastructures ded- We showed on page 10 how the Google’s flu Further reading: «The Fourth Paradigm: Da- icated to multidisciplinary research on large trend algorithm was looking at the numbers, ta-Intensive Scientific Discovery», October datasets and new insights on eScience. While not at the context of the search results. 2009, Microsoft Research, http://goo.gl/qqr8B Google has just announced its collaboration with NASA to use the recently quantum comput- How a computer can sense the context of er acquired by the Universities Space Research search results, and more generally of data? Pour a maximum of data and Association from the Canadian-based company The equivalent of the five human senses for solve big problems D-Wave, the race is started over between organi- computers are: date and time, geographical #2 zations to solve the big problems with big data. location, physical environment (the weather Is this the new way for solving information are just a website away), topic of the world? Objectives: to continue the development of stor- interests inferred from websites open by the age and computing dedicated to big data, both user, emails or created contents. Objective: find quickly innovative solutions to for the scientific teams, and the industrials, in- solve big problems cluding SMEs. This is already the case in France Objective: develop proofing methods to ensure within the competitive clusters and/or the Insti- that the numbers/raw data are always ana- Methodology: take a problem, let’s say the tuts de Recherche Technologique, and must be lyzed in a meaningful context, and remain so environmental question, check all the useful done in a European perspective. For instance, when additional data are collected, especially disciplines to see what data are necessary to the Institut Mines-Télécom coordinates the via data correlation (see challenge #10). Make tackle the problem. Break large problems up badap project (Big (a) Data Academic Platform), the data scientists able to ask the right ques- into smaller problems, and collect these data. a big data as a service platform for researchers tions by providing them a general set of funda- and SMEs. It is based on SQL-MPP solutions, and mental questions. There are numerous good reasons to choose the will cope fast data thanks to the Storm (see page ecological question for such a challenge. One of 18) open source framework. A second objective From data to sentiments them is the overgrowing source and heteroge- is to unlock the scientific data, buried in books neity of data, from low-cost sensors networks or in small labs. «Long-term data provenance as #5Sense the mood in the room. to satellite observations, produced by organi- well as community access to distributed data zations, government agencies and people. The are just some of the challenges.» Numbers and raw data are cold assets. Data is Internet connectivity enables then data sharing not information, information is not knowledge, across organizations and disciplines. This «eco- Industrials must provide real use cases with and what does matter is subjective information: logical science driven by data» presents new large datasets. As in the U.S., the healthcare sentiments, emotions, intentions and opinions. computing infrastructure needs and opportuni- sector can be one of the first provider of such ties, which are the next challenge. use cases: it must be educated to big data. A five-year-old child can say immediately her parents mood, making sense out of heteroge- Further reading: «Redefining Ecological Science Further reading: «As We May Think.» Bush, Van- neous contexts, but it is still hard for a comput- Using Data» in «The Fourth Paradigm: Data-In- nevar. The Atlantic. July 1945. Starting point: er program to figure that out. Computers can tensive Scientific Discovery», op.cit. http://en.wikipedia.org/wiki/As_We_May_Think perform automatic sentiment analytics from digital texts, using machine learning algorithms Is datascience a misnomer? such as semantic analysis or support vector machines. It is harder for face recognition sys- Fundamentally, Science, in the hypotheti- The term «datascience» must be used with cau- tems. And still a challenge for mood trends to be co-deductive model, proceeds by formalizing tious, as it lacks the need to theory if it over relies known from our digital behaviors split on sev- a hypothesis given a set of observations and on data. Without theory there is no real learning eral social networks and made of small piece assumptions, designing an experiment around and no real questions to ask. If we process mas- of contents. The so-called augmented perva- that hypothesis, testing it and analyzing the sive amounts of data without human interven- sive intelligence will provide contextual ap- data generated through that process to either tion and underlying theory, we will create a data plications, that learn from our daily behaviors corroborate or falsify the hypothesis. Science cycle of collect-analyze-act which may appear through our mobile device, and propose help to also refers to a body of knowledge itself. valid, but is in fact a short-term view. facilitate our days.

22 — 23 Will “delete” become a forbidden Anonymize for good word? Data can either be useful or perfectly #7 anonymous but never both. Really? From storing important data to keeping #8 Technical challenges all data. This is one of the most challenging problem, and a major blocking point for many collaboration The cost of storage has dropped dramatical- between industrials and/or researchers. It’s a ly, we end up keeping all the data, but can we scientific and societal challenge too, which still keep everything forever? Beyond the technical remains open. and ecological questions, the main issue still remains that of privacy and that of the ‘right Objective: ensure that whatever happens in the to be digitally forgotten’. This is a long debated future, even with adding complementary data regulatory question in Europe: the explanatory or crossing data not originally designed to, data memorandum concerning the European legis- that should not be associated with a person will lative proposal for a General Data Protection never be. Regulation can be found at http://goo.gl/tUOr5. Metaphor: how can we tell something to a child This may also be a cultural question between and be sure she will not remind the words in an- #6 digital natives and other people, or even be- other context, embarrassing the whole family? Make the data qualitative and tween countries. A technical answer and a meaningful growing trend is the so-called «short-term Question: is there a Heisenberg principle as- Drop the drip syndrome. social networks» launched with the promise serting a fundamental limit to the certainty that the users set a time limit for how long their with which certain pairs of properties of data, The Data Rich/Information Poor syndrome re- friends can view their contents, after which it such as usefulness and anonymisation, can be fers to «the problem of an abundance of data will be hidden from the friend’s device and de- established simultaneously? that does nothing to inform practice because it leted from the company’s servers. However, it is not presented in context through the use of was reported recently that Snapchat, one of the Do not neglect big data risks relevant comparisons». It is still an important more famous short-term network, didn’t actual- Mischaracterizing data resulting in privacy viola- problem in a big data context. In a heteroge- ly make contents disappear, and this raises the #9tions are one of the risks pointed out by Miller, neous data context, it is easier to use data that question of trust. H.E. (op.cit.). The fail to meet the user require- we know the meaning of. Both scientists and ments due to the inability to process data, or the organizations need new capabilities that rely If we do not keep all data, then we cannot keep uncertainty regarding who owns customer infor- on new semantic approaches, shared vocab- open all options for further interesting analyt- mation, are other risks affecting the end user. ularies and ontologies. This is a prerequisite ics, when datascientists develop new theories to the ability to master data correlation (chal- and models, and go back in time to understand Addressing the wrong problem, focusing on the lenge #10) between disciplines or sources of these new models. But if we collectively choose near domain and ignoring the real problem or data. And before being meaningful, data must to keep every data, we must ensure to never cre- mischaracterizing data resulting in poor deci- be qualitative. ate data aggregates that compromise privacy. sions are among the risks facing datascientists.

Dimensions of data quality mands? Data can be incomplete or even • Accessibility: Can the data be obtained too complete! when needed? • Relevance: Do the data address their user’s needs? • Coherence: How well do the data hang • Security: Are the data physically and together? Do irrelevant details, confusing logically secure? • Accuracy: Do the data reflect the under- measures or ambiguous format make lying reality? Is the level of precision them incoherent? • Validity: Do the data satisfy appropriate consistent with the user’s demands? standards related to other dimensions • Format: How are the data presented to the such as accuracy, timeliness, complete- • Timeliness: Are the data current, relative user? Is the context appropriate? ness and security? to user demands? • Compatibility: Are the data compatible in Source: Miller, H. (1996). The multiple dimen- • Completeness: Does the level of com- format and definition with other data with sions of information quality, Information pleteness correspond with user de- which they are being used? Systems Management, 13 (2), 79-82 Les cahiers de veille de la Fondation Télécom

Big Picture of Big Data

Master data correlation Data Captation Analysis Exploitation How to cross data without a priori? How to find relevant datasets? #10 The questions are hypothetical and answers are uncertain Dynamic Requests How do I use this data type, which I have > Semi-Automatic Analysis Visual Analytics never seen, with the data I use every (InfoVis + Data Analytics) day? One could give many more exam- Sources Valorisation Open Data Data Mining ples of questions arising in this «mash- Public Algorithmic Statistics Monetization Environment Data Requests e.g. When analysis allows ing up knowledge» that is being made to create new sources of money Environmental Sensing > Automatic Analysis COMPREHENSION Video-surveillance The questions are known Social possible. One possible answer to this Scienti c Data and answers expected Humanities Studies Medical Studies e.g. When individual shares Public Institutions her ‘Quantified Self’ challenge would be the deployment of a Personal Data Streaming Cultural Physiological Sensing e.g. When data provides real worldwide network of open datasets OPTIMISATION some new Visual Arts Personal Productions Data Flow (Blog posts, comments, Marketing and API directories, an artificial intelli- Social Web videos, photos, etc.) VOLUME Logistic Societal Commercial VARIETY e.g. When a city reduces traffic congestion VELOCITY gence facilitating to cross any sort of Customer Pro les Company Administration CREATION Sustainable e.g. When optimization data by simply testing, and even propos- New Services Transaction Data reduces CO2 Service Providers Storage Data Products ing new correlations by itself. Political Infrastructure Providers Interaction Data e.g. When decisions Behaviour are based on data analysis PREDICTION Telecom Networks Educational Master the big data cycle Usage Logs e.g. When data analysis allows (Call, sms, apps, etc.) a best understanding of history The more you use data, the more you Operational Data Store Network Data (QoS) NoSQL Data Bases produce data. #11 Data Processing Cleaning, Sorting, Mash-ups, Calculating, Summarizing From the ocean of data to the storage Intelligent Storage, Accessibility cloud and then to the many actors pro- Data Warehouse cessing data and producing new data Big picture of Big Data Diagram by that we saw in the data landscape on Acceleration Platform Alcatel-Lucent Bell Labs pages 14-15, data is continuously cir- To reuse or remix [email protected] Diagram by Acceleration Platform,Licensed Bell with CreativeCommons Labs, CC BY 3.0 culating as water in the water cycle. The license cc by 3.0. In the Alcatel-Lucent Bell Labs more we consume data, the more we find data that you spend all your coins yesterday, and research facilities, data scientists, mathe- useful for our daily usages, and the more we the traffic conditions. maticians and computer scientists invent the try to produce even more data. With the risk to algorithms for the future Big Data applications. produce inaccurate or redundant data. With the Mobile phones were once «personal data as- risk to produce data which is useless or could be sistant». It is now time to make this real: a de- backfired on us in the future. vice that could even pass the Turing tests. Invent the future of shopping Issue: produce only relevant and not yet known Enable dataviz on new devices #14From company-centric to user-centric data, and to be able to find relevant datasets People understand what they can see. data management. before adding its own data. Produce neither too #13 much nor too little data, in order to get results Information design pioneer Edward Tufte has With our big data oriented new devices, it is from our queries without preventing the seren- one primary rule: show the data. A second rule possible to establish the Small Data paradigm. dipity, which is a central part of the quest to pro- was show comparisons. To support and encour- duce new knowledge. age new powerful ways of thinking, the data Scenario: forget all these hours spent on com- must also be manipulated in their environment. parative engines to replace your just broken Make the mobile phone your data down espresso machine. Scan the QR code on assistant Use case: at a dinner with friends, talking about the machine, and broadcast an «intentcast» to #12 energy saving. Look at the wall and access in a the marketplace, without revealing any person- Use case: You are driving your car, in a hurry to- comprehensive way to a public dataset of heat al information. You will get in return only rele- wards your lunch, you don’t have money left to loss measures in your neighborhood. vant offers, without being polluted by irrelevant pay for it but you don’t know it. Your mobile does ads, and the fear to know your personal data know, and it will route you, on its own initiative, Issues: data-oriented human-machine interfac- are in the hands of third party vendors. to the nearest cash machine. It does know your es for mobile devices, for motion sensing input agenda, your habits, the fact that you are a bit devices like the Kinect, for augmented reality Further reading: «The Customer as a God.», absent-minded these days, the fact that this devices like Google glasses, and for human Essay, July 2012, Wall Street Journal. restaurant does only accept notes, the fact body interfaces like Interaxon’s Muse. http://goo.gl/XkGyO

24 — 25 Open new ways of thinking sional Development to «Data Mining and Ap- Both industrials and individuals must change plications Graduate Certificate» training three #15their minds, and accept to open the data they years in partnership with Sony and Cisco, MIT, own or produce. For industrials, in our world- Chicago Northwestern University (Predictive Societal challenges wide competition, it is an illusion to protect their Analytics), North Carolina State University data assets anymore. In order to unlock its own (MSc in Analytics in partnership with SAS) or UC data value, it is necessary to start thinking how San Diego (certificate program in data mining). to get more value from external interactions. Carefully tuned API can protect their data, while Next Fall, Telecom ParisTech shall propose a at the same time allowing their datasets to be- novel Professional Master program, fully dedi- come an internationally-recognized reference. cated to big data. It aims at teaching concepts With regard to the citizens, the challenge is to and techniques required to manage and exploit accept to share more personal data in order to massive data, in a progressive and very com- fed the humankind database, as long as priva- plete manner. It includes technical courses cy is respected. This new ways of thinking must related to the following topics: semi-structured be lead by visionary entrepreneurs. databases, machine-learning, web technolo- gies, decision support systems, distributed Democratize data management computing, computer security. Beyond the ac- As we saw in pages 16-17, there are thousand quisition of general knowledge related to the #16of machine learning methods, but textbooks are different fields involved in big data (computer written for researchers, not practitioners. In this science and applied mathematics especially), new arising economy of data, everyone should the goal of this training is to develop effective be confident with data and data manipulation. skills. The theoretical content of the lessons shall be illustrated by a variety of case studies Issue 1: make people understand what data is. and practical applications (e.g. design of rec- ommending systems, design of search engines Inspiration: School of Data provides courses in information retrieval) arising from a variety for everyone, from data-newbie to pro looking of fields, ranging from e-commerce to finance Before 1786, authors invariably for new ideas. It works to empower civil society through defense/security. Societal aspects of presented quantitative data organizations, journalists and citizens with the the Big Data phenomenon, legal (privacy) and “ as tables of numbers before skills they need to use data effectively. economic, shall also be investigated at length the economist William Playfair http://schoolofdata.org/ during the training. published a book called The Commercial and Political Atlas, Issue 2: make people want to play with data, and In order to keep the program in line with the full of line graphs, bar graphs, able to make their own minds from raw data. needs in the industry, a number of companies and other pie charts he created (from start-ups to big companies) in a variety of Inspiration: In his «Ten Brighter Ideas» essay, sectors (defense, Internet, finance, high-tech) specifically. Today, people take Bret Victor proposes a prototype of a reactive shall be involved in the training. Partners include these graphical forms for granted; document, full of data collected on the web. The in particular Thales, BNP ParisBas, Safran group, they seem as obvious and reader can play with the premise and assump- EADS, Criteo, Liligo, SAS, Capgemini, and IBM. fundamental as written language. tions of various claims about energy saving, and see the consequences update immediately.  Bret Victor, designer, http://goo.gl/TyM8p worrydream.com ” «User-centric data management, data democra- Teach the future datascientists tization, new ways of thinking, numbers in con- The sexiest job of the 21st century? text and meaningful visualization…» Our future #17 in the new Economy of Data is full of promises, The leading training institutions in datascience if we can avoid a «Data Divide» between those are largely Anglo-Saxon. Without being ex- who have access and the opportunity to make haustive, we can cite the masters programs effective use of data and those who do not. dedicated to «machine learning» proposed by Carnegie Mellon University, Berkeley, Stanford This could be the biggest challenge we face. University and the Stanford Center for Profes- Les cahiers de veille de la Fondation Télécom

Working with the Institut Mines-Télécom This cahier de veille was written with the help signal and image processing with a taste for and a member of the Social Networks Chair for of several contributors from the schools of the scientific computing, numerical methods, data eMarketing at the Institut Mines-Télécom. She Institut Mines-Télécom. Stephan Clémençon mining and machine learning. He is one of the conducts work on social networks analysis, is a teacher-researcher at Telecom ParisTech, contributors of the scikit-learn machine learn- focusing more particularly on the detection of a member of the department TSI (Image and ing framework (see some of his applications at implicit communities of interest over the so- Signal Processing) and works in the lab LTCI http://martinos.org/mne). Claire Levallois-Barth cial web. She is also involved in social network (Communication and Information Theory). His is a teacher on legal and privacy aspects at Tele- analysis-based recommendation problematics. main research contributions are in the fields of com ParisTech and Telecom SudParis, among Claude Berrou’s Neucod research program at Markov processes, nonparametric statistics and others, she is responsible of the Chair «Values Telecom Bretagne aims to identify and exploit statistical machine-learning. He is responsible and Policies of Personal Information» launched the strong analogies observed between the of the Industrial Chair «Machine Learning» at by the Institut Mines-Télécom in April 2013, structure and properties of the cerebral cortex Telecom ParisTech. Alexandre Gramfort is an and member of the expert network for Etalab. and those of modern error correcting decoders. assistant professor at Telecom ParisTech. His re- Bruno Defude’s team at Telecom SudParis is Awarded a grant of 1.9 million euro by the Euro- search interests are on mathematical modeling specialized in the field of High Performance pean Research Council, this project based on a and the computational aspects of brain imaging. Computing. Cécile Bothorel, is a researcher at multidisciplinary approach will give new fresh He is more generally interested in biomedical Telecom Bretagne in the department LUSSI, insights on machine learning. Additional documents are available on the partner area of ​​the site of the Fondation Télécom. Glossary Ant by Jacob Eckert, from The Noun Project. anonymisation: the process of treating data such eScience: a recent term and a new research method- quantified self: a movement to incorporate technolo- that it cannot be used for the identification of individ- ology. Computationally intensive science that is car- gy into data acquisition on aspects of a person’s daily uals. ried out in highly distributed network environments. life in terms of inputs, states, and performance. Also: API: Application Programming Interface. A way com- Etalab: a service under the French Prime Minister, in self-monitoring and self-sensing. See page . puter programs talk to one another. Can be under- charge of the French Open Data initiative. raw data: primary data that has not been subjected to stood in terms of how a programmer sends instruc- www.etalab.gouv.fr processing or any other manipulation. tions between programs. Hadoop: an open source software project adminis- schema: the structure that defines the organization attribute & share alike: a Creative Commons license tered by the Apache Software Foundation that en- of data in a database system. that requires attributing the original source of the li- ables the distributed processing of large data sets sentiment analysis: the application of statistical censed material, allowing derivative works under the across clusters of commodity servers. See details functions on comments people make online and same or a similar license. page . through social networks to determine their mood. commodity hardware: computer hardware that is af- information: A piece of information. Value added from serendipity: the ability of finding something good or fordable and easy to obtain. the process of collecting and organizing data. Infor- useful while not specifically searching for it. compound visualization: the visualization result- mation needs to be converted into knowledge before Small Data: a recent term with multiple definitions. ing from two (or more) spatially distinct different it can be used, by relating it to oneself, one’s experi- Emphasizes the need to decentralized, more localized data representations, each of which operating inde- ences, environment and other contextual information. and ultimately user-centric, data management. pendently, with the possibility to be used together to JSON: JavaScript Object Notation. A common format SQL: Structured Query Language. Initially developed correlate information in one representation with that to exchange data. at IBM in the early 1970s, it is a popular programming in another. knowledge: based on the value added from the pro- language designed for managing data held in a rela- dark data: the value of dark data is locked up in a way cess of organizing information, plus expert opinion, tional database management system (RDBMS). so that it isn’t readily available for use by analytics. skills and experience, plus a far more complex mix of transactional data: data that changes unpredictably. Also, data accumulated and still unused. creativity, serendipity, and social and cultural binds. Turing tests: a test of a machine’s ability to exhibit in- data crunching: a marketing term. The process of col- lifelogging: the process of wearing specific devices to telligent behavior equivalent to, or indistinguishable lecting and cleaning the data. capture continuous physiological data. from, that of an actual human. data munging: the process of converting or mapping MapReduce: a framework for processing paralleliz- XML: a markup language that defines a set of rules data from one raw form into another format that al- able problems across huge datasets using a large for encoding documents in a format that is both hu- lows for more convenient consumption of the data. number of computers (nodes), collectively referred man-readable and machine-readable. Data Science: a recent term (see sidebar page ) to as a cluster. See page . with multiple definitions. Drew Conway’s Data Sci- open data: data that can be used, reused and - See also: ence Venn Diagram is a good definition as a start (see tributed freely by anyone for any purpose. Details on http://schoolofdata.org/handbook/appendix/glossary/ page ). page . http://data-informed.com/glossary-of-big-data-terms/ early warning signs: patterns one wishes to see long pattern: in machine learning, a non-null finite se- before their impact on an observed phenomenon. quence of constant and variable symbols.

26 — 27 Les cahiers de veille de la Fondation Télécom

The cahier de veille de la Fondation Télécom is the result of studies conducted jointly by Institut Mines-Télécom professors and industry experts. Each cahier, which deals with a specific topic, is given to researchers at the Institute who gather around them recognized experts. All at once comprehensive and concise, the cahier de veille offers a state of the art of the technology and an analysis of both the market and the economic and legal aspects, focusing on the most critical points. It concludes with perspectives that are all possible ways of joint working between partners of the Fondation Télécom and the Institut Mines-Télécom teams.

With the support from: Alcatel-Lucent, BNP Paribas, Google, Orange and SFR, founding partners of the Fondation Télécom And with Accenture, Astrium Services, Cassidian Cybersecurity, CDC, Sopra Group and Streamwide 499438505

Fondation Télécom RCS 46, rue Barrault - 75634 Paris cedex 13 - France Tel.: + 33 (0) 1 45 81 77 77 Fax: + 33 (0) 1 45 81 74 42 [email protected] www.fondation-telecom.org Nereÿs – 06 62 12 15 25 – – 06 62 Nereÿs