Advanced Databases
instructor: Peter Baumann email: [email protected] tel: -3178 office: Research 1, room 88
Advanced Databases (P. Baumann) 1 Where It All Started Source: Wikipedia . 1890 census on 62,947,714 US population “Big Data”
. Hollerith „tabulating machine and sorter“ • 2 years faster
. Tabulating Machine Company International Business Machines Corporation
Herman Hollerith in 1888 Hollerith card puncher, used by the United States Census Bureau
Hollerith punched card
Advanced Databases (P. Baumann) 2 [Wikipedia] 340151 Big Databases & Cloud Services (P. Baumann) 3 [Domo] 340151 Big Databases & Cloud Services (P. Baumann) 4 2012
[saubereAdvanced-zaehne.de] Databases (P[atlasformen.de]. Baumann) 5 Big Data
. Internet: the unprecedented . Typical Big Data: information collector • Social networks - facebook, twitter, • May 2012: 200m Web servers GPS, ... [Yahoo] • Business: Data Warehousing • estd 50+b static pages [Yahoo] • Geo: Satellite imagery, weather data, • 2012: 31b searches / month [Google] ... • Wayback Machine: 240 billion web • Petrol industry: pages archived from 1996 „more bytes than barrels“ • ...plus „Deep Web“
http://www.sgi.com/go/twitter/#heatmaps
Advanced Databases (P. Baumann) 6 The 4th Paradigm
[TonyAdvanced Hey, StewartDatabases Tansley (P. Baumann), Kristin Tolle (eds.)] 7 „Big Data“: The 4+ Vs
[Doug Laney / Gartner & IBM] . „data too big to transport“, but also „too complex to process“
. Volume - ngEO plannings: 10^12 images under ESA custody
. Velocity - NASA EOSDIS: 5 TB/d; LOFAR: 25 TB/h; phones: 1+ PB/d
. Variety - grids; point clouds; general meshes; vectors; text; graphs; ...
. Veracity - Quality, provenance, trust
. ...plus more in blogs: Value, Verisimilitude, Variability, Visualization, ...
Advanced Databases (P. Baumann) 8 Data Engineering
. “Data Engineering fastest-growing tech job: 45% p.a.” [DICE report, 2020]
. Roles in Data Engineering [AnalyticsVidhya, modified] • ETL Engineer • maintain veracity of data; ensure proper work of tools, permission, system pipelines • Database Administrator • ensure continuous work of data generating & ingesting systems • extensive knowledge of traditional, NoSQL, Cloud databases • Data Engineer • ingest, integrate, maintain all the data sources • working database & coding knowledge; understand business needs & long-term scalability needs • Enterprise Data Architect • combination of DB Administrator & Data Engineer, responsible for “big picture” • knowledge of database tools, languages like Python, Java, Scala, distributed systems, …
Advanced Databases (P. Baumann) 9 Big Data: a Kaleidoscope
340151 Big Databases & Cloud Services (P. Baumann) 10 Big Data in High Energy Physics
. CERN, Large Hadron Collider: 13 PB in 2010
Advanced Databases (P. Baumann) 11 [CERN] Advanced Databases (P. Baumann) 12 Youtube Big Data in the Earth Sciences
. „Exaflood“: ~100s of Exabytes in 2020 • Spectral bands [ESA] • resolution: km → ~20cm
. CubeSats are coming!
[Planet.com] Advanced Databases (P. Baumann) 13 Variety in Oceanography
Advanced Databases (P. Baumann) 14 [OGC Ocean Interoperability Experiment] Big Data in the Life Sciences
. Neuro Sciences: Human Brain Project (EU, ~1b €), BRAIN (US) • Multi-scale models of the human brain (molecular - behavioral)
. Data aggregation integration cost saving, improved care • Personalised patient care • Real-time observation & agent adjustment
. Genome medicine • 23andme.com: personalised analysis of your DNS • „Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, gender, sexual orientation, gender identity or expression, religion, national origin, marital status, age, disability, veteran status, genetic information, or any other protected status.”
Advanced Databases (P. Baumann) 15 Big Data in Business Intelligence [Wikipedia]
. Business data worldwide 2x every 1.2 years [estd]
. Walmart: 1+ million customer transactions / h • estd. 2.5+ PB databases =167x US LoC
. FICO Falcon CC Fraud Detection System Possible Early Warning Sign for Market Crashes • 2.1b CC accounts worldwide
. Equifax: multi-million customers„ key data compromised
Advanced Databases (P. Baumann) 16 [www.wired.com] Big Data in Industry
. Industry 4.0: integration of production & IT [Kristen Nicole] • Optimising value chain & life cycle
. Automobiles • Typically, ~100m LoC • Networked with co-traffic, traffic lights, ... • 2.8 ZB in 2012, plus 2.5 PB / Tag [Computerwoche]
. Airplanes: • A380: 1b LoC [Airbus] • Per engine: 1 TB / 3 min • LHR JFK = 640 TB
Advanced Databases (P. Baumann) 17 Advanced Databases (P. Baumann) 18 Big Data in Social Networks
. Facebook [M. Rodriguez, Aurelius]
[Statista]
. MS Messenger: 30b chats, 240m participants [Leskovec, 2008]
. Global mobile phone traffic: 80,000 PB in 2016 [Gartner]
Advanced Databases (P. Baumann) 19 Big Data in Social Networks
. Social Network Analysis, Sentiment Analysis, Human Analytics: • How isolated / connected / central / important is a person? • How / where from / where to does information flow? opinions?
. Intelius.com: „live in the know“
[V. Krebs, orgnet.com]
Advanced Databases (P. Baumann) 20 Pizza ordering in the future Internet of Things (IoT)
. Every thing is in the Internet • „the Internet“ knows state of physical world
. Not really new • ABS, emergency stop via light sensor, RFID, ...
. New: comprehensiveness, data fusion, AI...in realtime [Shutterstock, Forbes] • T-Shirt, refrigerator, beer bottle, Fitbit, car, family, neighbours, boss, insurance, ...
. Data privacy? security? • Known issues, new dimensions
Advanced Databases (P. Baumann) 21 Data Scientist / Engineer
. “Data Scientist: The Sexiest Job of the 21st Century” • Harvard Business Review, 2012
. Data Scientist = Statistics + tool skills + domain expertise + communication Domain Scientist
Data Scientist . Data Scientist ≠ Computer Scientist !
Computer Scientist
Advanced Databases (P. Baumann) 22 Our Research: Array Databases
Advanced Databases (P. Baumann) 23 Spatio-Temporal Datacubes
[DKRZ]
Advanced Databases (P. Baumann) 24 [DKRZ] rasdaman: Agile Datacube Analytics
= „raster data manager“: SQL + n-D arrays
. Mature, operational, on OSGeo Live www.rasdaman.org • 2.5+ PB databases, 1000x parallelization, federation
. OGC, ISO, INSPIRE datacube standards crafted by rasdaman team • Reference Implementation
Advanced Databases (P. Baumann) 26 A Brief History of Array Databases
340151 Big Databases & Cloud Services (P. Baumann) 27 Spatio-Temporal Datacubes on Virtual Globes
Advanced Databases (P. Baumann) 28 Agriculture
Advanced Databases (P. Baumann) 29 Daily Hydro Estimator
[rasdaman340151 Big backend] Databases & Cloud Services (P. Baumann) 30 British Geological Service
[BGS 2013] Advanced Databases (P. Baumann) 31 PlanetServer
[rasdamanAdvanced Databases backend] (P. Baumann) 32 Demo
Advanced Databases (P. Baumann) 33 EarthServer: Datacubes At Your Fingertips
. Agile Analytics on x/y/t + x/y/z/t Earth & Planetary datacubes • C/S APIs rigorously coverage standards • EU rasdaman + US NASA WorldWind
. 2.5+ PB; 1,000+ cloud parallelization
. Intercontinental initiative, 3+3 years: EU + US + AUS
. www.earthserver.eu, www.planetserver.eu
Advanced Databases (P. Baumann) 34 Parallel, Distributed Processing
max( (A.nir - A.red) / (A.nir + A.red) ) + avg(B.green) + max( (C.red + C.green + C.blue) / 3 ) + max( (D.nir + D.red) / 2 ) Dataset D 1 query 1,000+ cloud nodes Ex: ESA OPS-SAT [ACM SIGMOD DanaC 2014] [VLDB BOSS 2016] Dataset C Ex: NCI / AUS
Dataset A Ex: ECMWF / UK
Dataset B
Advanced Databases (P. Baumann) 35 Standards: ISO Array SQL [SSDBM 2014]
create table LandsatScenes( id: integer not null, acquired: date, scene: row( band1: integer, ..., band7: integer ) mdarray [ 0:4999,0:4999] )
select id, encode(scene.band1-scene.band2)/(scene.band1+scene.band2)), „image/tiff“ ) from LandsatScenes where acquired between „1990-06-01“ and „1990-06-30“ and avg( scene.band3-scene.band4)/(scene.band3+scene.band4)) > 0
Advanced Databases (P. Baumann) 36 Big Datacube Standards (By Us)
. Open Geospatial Consortium (OGC) : • Spatio-Temporal „Big Geo Data“ standards suite
. ISO: • TC211: Spatio-Temporal „Big Geo Data“ standards suite • SC32: SQL/MDA („Multi-Dimensional Arrays“)
. INSPIRE: • Co-shaping harmonized European Spatial Data Infrastructure
Advanced Databases (P. Baumann) 37 Back to the Course
Advanced Databases (P. Baumann) 38 Course Plot
. Databases • RDBMS recap & engine deep dive
. Database application development
. NoSQL, NewSQL, MapReduce
. OLAP
. Virtualization & Cloud
. Security
Advanced Databases (P. Baumann) 39 Prerequisites
. Motivation, Interest, Curiosity
. Some general CS / IT knowledge • Algorithms & data structures, object-oriented concepts, programming
. "reading without writing is daydreaming"
Advanced Databases (P. Baumann) 40 Resources
. "Database Management Complete Book" Ullman & Garcia Molina & Widom, Prentice Hall
. www.faculty.jacobs-university.de/pbaumann teaching Advanced Databases
. peer group
. mailing list course-bdcs
. TA + me
Advanced Databases (P. Baumann) 41 Grading
. Exam • written, @ end of semester
. Lab • Semester project: build your own Web service
Advanced Databases (P. Baumann) 42 Let‘s Roll!
340151 Big Databases & Cloud Services (P. Baumann) 43