<<

Advanced Databases

instructor: Peter Baumann email: [email protected] tel: -3178 office: Research 1, room 88

Advanced Databases (P. Baumann) 1 Where It All Started Source: Wikipedia . 1890 census on 62,947,714 US population  “Big Data”

. Hollerith „tabulating machine and sorter“ • 2 years faster

. Tabulating Machine Company  International Business Machines Corporation

Herman Hollerith in 1888 Hollerith card puncher, used by the Bureau

Hollerith

Advanced Databases (P. Baumann) 2 [Wikipedia] 340151 Big Databases & Cloud Services (P. Baumann) 3 [Domo] 340151 Big Databases & Cloud Services (P. Baumann) 4 2012

[saubereAdvanced-zaehne.de] Databases (P[atlasformen.de]. Baumann) 5 Big Data

. Internet: the unprecedented . Typical Big Data: information collector • Social networks - facebook, twitter, • May 2012: 200m Web servers GPS, ... [Yahoo] • Business: Data Warehousing • estd 50+b static pages [Yahoo] • Geo: , data, • 2012: 31b searches / month [Google] ... • Wayback Machine: 240 billion web • Petrol industry: pages archived from 1996 „more bytes than barrels“ • ...plus „Deep Web“

http://www.sgi.com/go/twitter/#heatmaps

Advanced Databases (P. Baumann) 6 The 4th Paradigm

[TonyAdvanced Hey, StewartDatabases Tansley (P. Baumann), Kristin Tolle (eds.)] 7 „Big Data“: The 4+ Vs

[Doug Laney / Gartner & IBM] . „data too big to transport“, but also „too complex to process“

. Volume - ngEO plannings: 10^12 images under ESA custody

. Velocity - NASA EOSDIS: 5 TB/d; LOFAR: 25 TB/h; phones: 1+ PB/d

. Variety - grids; point clouds; general meshes; vectors; text; graphs; ...

. Veracity - Quality, provenance, trust

. ...plus more in blogs: Value, Verisimilitude, Variability, Visualization, ...

Advanced Databases (P. Baumann) 8 Data Engineering

. “Data Engineering fastest-growing tech job: 45% p.a.” [DICE report, 2020]

. Roles in Data Engineering [AnalyticsVidhya, modified] • ETL Engineer • maintain veracity of data; ensure proper work of tools, permission, system pipelines • Database Administrator • ensure continuous work of data generating & ingesting systems • extensive knowledge of traditional, NoSQL, Cloud databases • Data Engineer • ingest, integrate, maintain all the data sources • working database & coding knowledge; understand business needs & long-term scalability needs • Enterprise Data Architect • combination of DB Administrator & Data Engineer, responsible for “big picture” • knowledge of database tools, languages like Python, , Scala, distributed systems, …

Advanced Databases (P. Baumann) 9 Big Data: a Kaleidoscope

340151 Big Databases & Cloud Services (P. Baumann) 10 Big Data in High Energy Physics

. CERN, Large Hadron Collider: 13 PB in 2010

Advanced Databases (P. Baumann) 11 [CERN] Advanced Databases (P. Baumann) 12  Youtube Big Data in the Sciences

. „Exaflood“: ~100s of Exabytes in 2020 • Spectral bands [ESA] • resolution: km → ~20cm

. CubeSats are coming!

[Planet.com] Advanced Databases (P. Baumann) 13 Variety in Oceanography

Advanced Databases (P. Baumann) 14 [OGC Interoperability Experiment] Big Data in the Life Sciences

. Neuro Sciences: Human Brain Project (EU, ~1b €), BRAIN (US) • Multi-scale models of the human brain (molecular - behavioral)

. Data aggregation integration  cost saving, improved care • Personalised patient care • Real-time observation & agent adjustment

. Genome medicine • 23andme.com: personalised analysis of your DNS • „ is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, gender, sexual orientation, gender identity or expression, religion, national origin, marital status, age, disability, veteran status, genetic information, or any other protected status.”

Advanced Databases (P. Baumann) 15 Big Data in Business Intelligence [Wikipedia]

. Business data worldwide 2x every 1.2 years [estd]

. Walmart: 1+ million customer transactions / h • estd. 2.5+ PB databases =167x US LoC

. FICO Falcon CC Fraud Detection System Possible Early Warning Sign for Market Crashes • 2.1b CC accounts worldwide

. Equifax: multi-million customers„ key data compromised

Advanced Databases (P. Baumann) 16 [www.wired.com] Big Data in Industry

. Industry 4.0: integration of production & IT [Kristen Nicole] • Optimising value chain & life cycle

. Automobiles • Typically, ~100m LoC • Networked with co-traffic, traffic lights, ... • 2.8 ZB in 2012, plus 2.5 PB / Tag [Computerwoche]

. Airplanes: • A380: 1b LoC [Airbus] • Per engine: 1 TB / 3 min • LHR  JFK = 640 TB

Advanced Databases (P. Baumann) 17 Advanced Databases (P. Baumann) 18 Big Data in Social Networks

. Facebook [M. Rodriguez, Aurelius]

[Statista]

. MS Messenger: 30b chats, 240m participants [Leskovec, 2008]

. Global mobile phone traffic: 80,000 PB in 2016 [Gartner]

Advanced Databases (P. Baumann) 19 Big Data in Social Networks

. Social Network Analysis, Sentiment Analysis, Human Analytics: • How isolated / connected / central / important is a person? • How / where from / where to does information flow? opinions?

. Intelius.com: „live in the know“

[V. Krebs, orgnet.com]

Advanced Databases (P. Baumann) 20 Pizza ordering in the future Internet of Things (IoT)

. Every thing is in the Internet • „the Internet“ knows state of physical

. Not really new • ABS, emergency stop via light sensor, RFID, ...

. New: comprehensiveness, data fusion, AI...in realtime [Shutterstock, Forbes] • T-Shirt, refrigerator, beer bottle, Fitbit, car, family, neighbours, boss, insurance, ...

. Data privacy? security? • Known issues, new dimensions

Advanced Databases (P. Baumann) 21 Data Scientist / Engineer

. “Data Scientist: The Sexiest Job of the 21st Century” • Harvard Business Review, 2012

. Data Scientist = Statistics + tool skills + domain expertise + communication Domain Scientist

Data Scientist . Data Scientist ≠ Scientist !

Computer Scientist

Advanced Databases (P. Baumann) 22 Our Research: Array Databases

Advanced Databases (P. Baumann) 23 Spatio-Temporal Datacubes

[DKRZ]

Advanced Databases (P. Baumann) 24 [DKRZ] rasdaman: Agile Datacube Analytics

= „raster data manager“: SQL + n-D arrays

. Mature, operational, on OSGeo Live www.rasdaman.org • 2.5+ PB databases, 1000x parallelization, federation

. OGC, ISO, INSPIRE datacube standards crafted by rasdaman team • Reference Implementation

Advanced Databases (P. Baumann) 26 A Brief History of Array Databases

340151 Big Databases & Cloud Services (P. Baumann) 27 Spatio-Temporal Datacubes on Virtual Globes

Advanced Databases (P. Baumann) 28 Agriculture

Advanced Databases (P. Baumann) 29 Daily Hydro Estimator

[rasdaman340151 Big backend] Databases & Cloud Services (P. Baumann) 30 British Geological Service

[BGS 2013] Advanced Databases (P. Baumann) 31 PlanetServer

[rasdamanAdvanced Databases backend] (P. Baumann) 32 Demo

Advanced Databases (P. Baumann) 33 EarthServer: Datacubes At Your Fingertips

. Agile Analytics on x/y/t + x/y/z/t Earth & Planetary datacubes • C/S rigorously coverage standards • EU rasdaman + US NASA WorldWind

. 2.5+ PB; 1,000+ cloud parallelization

. Intercontinental initiative, 3+3 years: EU + US + AUS

. www.earthserver.eu, www.planetserver.eu

Advanced Databases (P. Baumann) 34 Parallel, Distributed Processing

max( (A.nir - A.red) / (A.nir + A.red) ) + avg(B.green) + max( (C.red + C.green + C.blue) / 3 ) + max( (D.nir + D.red) / 2 ) Dataset D 1 query  1,000+ cloud nodes Ex: ESA OPS-SAT [ACM SIGMOD DanaC 2014] [VLDB BOSS 2016] Dataset C Ex: NCI / AUS

Dataset A Ex: ECMWF / UK

Dataset B

Advanced Databases (P. Baumann) 35 Standards: ISO Array SQL [SSDBM 2014]

create table LandsatScenes( : integer not null, acquired: date, scene: row( band1: integer, ..., band7: integer ) mdarray [ 0:4999,0:4999] )

select id, encode(scene.band1-scene.band2)/(scene.band1+scene.band2)), „image/tiff“ ) from LandsatScenes where acquired between „1990-06-01“ and „1990-06-30“ and avg( scene.band3-scene.band4)/(scene.band3+scene.band4)) > 0

Advanced Databases (P. Baumann) 36 Big Datacube Standards (By Us)

. Open Geospatial Consortium (OGC) : • Spatio-Temporal „Big Geo Data“ standards suite

. ISO: • TC211: Spatio-Temporal „Big Geo Data“ standards suite • SC32: SQL/MDA („Multi-Dimensional Arrays“)

. INSPIRE: • Co-shaping harmonized European Spatial Data Infrastructure

Advanced Databases (P. Baumann) 37 Back to the Course

Advanced Databases (P. Baumann) 38 Course Plot

. Databases • RDBMS recap & engine deep dive

. Database application development

. NoSQL, NewSQL, MapReduce

. OLAP

. Virtualization & Cloud

. Security

Advanced Databases (P. Baumann) 39 Prerequisites

. Motivation, Interest, Curiosity

. Some general CS / IT knowledge • Algorithms & data structures, object-oriented concepts, programming

. "reading without writing is daydreaming"

Advanced Databases (P. Baumann) 40 Resources

. "Database Management Complete Book" Ullman & Garcia Molina & Widom, Prentice Hall

. www.faculty.jacobs-university.de/pbaumann teaching Advanced Databases

. peer group

. mailing list course-bdcs

. TA + me

Advanced Databases (P. Baumann) 41 Grading

. Exam • written, @ end of semester

. Lab • Semester project: build your own Web service

Advanced Databases (P. Baumann) 42 Let‘s Roll!

340151 Big Databases & Cloud Services (P. Baumann) 43