Introduction to [Big] Data Analytics Frameworks Data Sciences (pilot) Training EC
Sébastien Varrette, PhD Parallel Computing and Optimization Group (PCOG), University of Luxembourg (UL), Luxembourg
Feb. 7th and Apr. 1st, 2019, Luxembourg
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 1 / 126 N Short CV
https://varrette.gforge.uni.lu Experienced Research Scientist at University of Luxembourg (UL) y of experience in management & deployment of HPC systems 15 →֒ (Ph.D Computer Science w. great distinction (2007, INPG/UL →֒ :Research interests →֒ X Security, performance of parallel & distributed computing platforms (HPC/Cloud/IoT) X 590 citations, 4 books, 80 articles in peer-reviewed journals/conf.
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 2 / 126 N Short CV
https://varrette.gforge.uni.lu Experienced Research Scientist at University of Luxembourg (UL) y of experience in management & deployment of HPC systems 15 →֒ (Ph.D Computer Science w. great distinction (2007, INPG/UL →֒ :Research interests →֒ X Security, performance of parallel & distributed computing platforms (HPC/Cloud/IoT) X 590 citations, 4 books, 80 articles in peer-reviewed journals/conf.
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 2 / 126 N Short CV
https://varrette.gforge.uni.lu Experienced Research Scientist at University of Luxembourg (UL) y of experience in management & deployment of HPC systems 15 →֒ (Ph.D Computer Science w. great distinction (2007, INPG/UL →֒ :Research interests →֒ X Security, performance of parallel & distributed computing platforms (HPC/Cloud/IoT) X 590 citations, 4 books, 80 articles in peer-reviewed journals/conf.
Deputy Head, Uni.lu HPC for Research Management committee member representing Luxembourg: (ETP4HPC, EU COST NESUS, PRACE (acting Advisor →֒ National / EU HPC projects involvement: (National HPC and Big Data competence center (MECO →֒ (NVidia Cooperation agreement on AI and HPC (SMC →֒ EuroHPC JU / Hosting entities for Supercomputer program →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 2 / 126 N Welcome!
(Pilot) Data Sciences Training for EC
Introduction to [Big] Data Analytics starts with data processing facility enabling Analytics →֒ X HPC: High Performance Computing X HTC: High Throughput Computing ... continue with daily data management →֒ X . . . before speaking about Big data management X in particular: data transfer (over SSH), data versioning with Git introduction to classical Big Data analytics frameworks →֒ X Distributed File Systems (DFS), Mapreduce X Main Processing engines (batch, stream, hybrid) . . . aka Hadoop,Storm, Samza, Flink, Spark very brief) review of other useful Data Analytic frameworks) →֒ X . . . and their effective usage on HPC platforms X Python, R etc.
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 3 / 126 N Disclaimer: Acknowledgements
Part of these slides were courtesy borrowed w. permission from: Prof. Martin Theobald (Big Data and Data Science Research Group), UL →֒ Part of the slides material adapted from: Advanced Analytics with Spark, O Reilly →֒ Data Analytics with HPC courses →֒ X c CC AttributionNonCommercial-ShareAlike 4.0 General/hands-on material adapted from: of course) the Uni.lu HPC Tutorials, credits: UL HPC team) →֒ X S. Varrette, V. Plugaru, S. Peter, H. Cartiaux, C. Parisot X R section: A. Ginolhac :similar Github projects →֒ X Jonathan Dursi: hadoop-for-hpcers-tutorial
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 4 / 126 N Introduction
Summary
1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components
2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data
3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark
4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing
5 Conclusion
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 5 / 126 N Introduction
Summary
1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components
2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data
3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark
4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing
5 Conclusion
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 6 / 126 N Introduction
Prerequisites: Metrics
HPC: High Performance Computing BD: Big Data
Main HPC/BD Performance Metrics
Computing Capacity: often measured in flops (or flop/s) (Floating point operations per seconds (often in DP →֒ GFlops = 109 TFlops = 1012 PFlops = 1015 EFlops = 1018 →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 7 / 126 N Introduction
Prerequisites: Metrics
HPC: High Performance Computing BD: Big Data
Main HPC/BD Performance Metrics
Computing Capacity: often measured in flops (or flop/s) (Floating point operations per seconds (often in DP →֒ GFlops = 109 TFlops = 1012 PFlops = 1015 EFlops = 1018 →֒
Storage Capacity: measured in multiples of bytes = 8 bits GB = 109 bytes TB = 1012 PB = 1015 EB = 1018 →֒ GiB = 10243 bytes TiB = 10244 PiB = 10245 EiB = 10246 →֒ Transfer rate on a medium measured in Mb/s or MB/s Other metrics: Sequential vs Random R/W speed, IOPS ...
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 7 / 126 N Introduction
Why HPC and BD ? HPC: High Performance Computing BD: Big Data
Andy Grant, Head of Big Data and HPC, Atos UK&I To out-compete you must out-compute Increasing competition, heightened customer expectations and shortening product development cycles are forcing the pace of acceleration across all industries.
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 8 / 126 N Introduction
Why HPC and BD ? HPC: High Performance Computing BD: Big Data
Essential tools for Science, Society and Industry Data driven economy context →֒ All scientific disciplines are becoming computational today →֒ X require very high computing power, handle huge volumes of data Industry, SMEs increasingly relying on HPC to invent innovative solutions →֒ while reducing cost & decreasing time to market . . . →֒
Andy Grant, Head of Big Data and HPC, Atos UK&I To out-compete you must out-compute Increasing competition, heightened customer expectations and shortening product development cycles are forcing the pace of acceleration across all industries.
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 8 / 126 N Introduction
Why HPC and BD ? HPC: High Performance Computing BD: Big Data
Essential tools for Science, Society and Industry Data driven economy context →֒ All scientific disciplines are becoming computational today →֒ X require very high computing power, handle huge volumes of data Industry, SMEs increasingly relying on HPC to invent innovative solutions →֒ while reducing cost & decreasing time to market . . . →֒ HPC = global race (strategic priority) - EU takes up the challenge: (PRACE / EuroHPC / IPCEI on HPC and Big Data (BD →֒ Applications Andy Grant, Head of Big Data and HPC, Atos UK&I To out-compete you must out-compute Increasing competition, heightened customer expectations and shortening product development cycles are forcing the pace of acceleration across all industries.
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 8 / 126 N Introduction
Different HPC Needs per Domains
Material Science & Engineering
#Cores
Network Bandwidth Flops/Core
Network Latency Storage Capacity
I/O Performance Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 9 / 126 N Introduction
Different HPC Needs per Domains
Biomedical Industry / Life Sciences
#Cores
Network Bandwidth Flops/Core
Network Latency Storage Capacity
I/O Performance Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 9 / 126 N Introduction
Different HPC Needs per Domains
Deep Learning / Cognitive Computing
#Cores
Network Bandwidth Flops/Core
Network Latency Storage Capacity
I/O Performance Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 9 / 126 N Introduction
Different HPC Needs per Domains
IoT, FinTech
#Cores
Network Bandwidth Flops/Core
Network Latency Storage Capacity
I/O Performance Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 9 / 126 N Introduction
Different HPC Needs per Domains
DeepBiomedicalALLMaterial LearningResearch Science IoT,Industry / Computing CognitiveFinTech & / EngineeringLife Computing SciencesDomains
#Cores
Network Bandwidth Flops/Core
Network Latency Storage Capacity
I/O Performance Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 9 / 126 N Introduction
New Trends in HPC
Continued scaling of scientific, industrial & financial applications
F C E well beyond Exascale H -P C S . . . →֒ New trends changing the landscape for HPC Emergence of Big Data analytics →֒ Emergence of (Hyperscale) Cloud Computing →֒ Data intensive Internet of Things (IoT) applications →֒ Deep learning & cognitive computing paradigms →֒
Eurolab-4-HPC Long-Term Vision on High-Performance Computing This study was carried out for RIKEN by Editors: Theo Ungerer, Paul Carpenter
Funded by the European Union Horizon 2020 Framework Programme (H2020-EU.1.2.2. - FET Proactive)
[Source : EuroLab-4-HPC] Special Study Analysis of the Characteristics and Development Trends of the Next-Generation of Supercomputers in Foreign Countries
Earl C. Joseph, Ph.D. Robert Sorensen Steve Conway Kevin Monroe
[Source : IDC RIKEN report, 2016]
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 10 / 126 N
Introduction
Toward Modular Computing
Aiming at scalable, flexible HPC infrastructures Primary processing on CPUs and accelerators →֒ X HPC & Extreme Scale Booster modules :Specialized modules for →֒ X HTC & I/O intensive workloads; X [Big] Data Analytics & AI
[Source : "Towards Modular Supercomputing: The DEEP and DEEP-ER projects", 2016]
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 11 / 126 N Introduction
Summary
1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components
2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data
3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark
4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing
5 Conclusion
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 12 / 126 N Introduction
HPC Computing Hardware
CPU (Central Processing Unit) Highest software flexibility High performance across all computational domains →֒ (Ex: Intel Core i9-9900K (Q4’18) Rpeak ≃ 922 GFlops (DP →֒ Base X 8 cores @3.6GHz (14nm, 95W, ≃ 3.5 billion transistors) + integ. graphics
Intel Coffee Lake die
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 13 / 126 N Introduction
HPC Computing Hardware
CPU (Central Processing Unit) Highest software flexibility High performance across all computational domains →֒ (Ex: Intel Core i9-9900K (Q4’18) Rpeak ≃ 922 GFlops (DP →֒ Base X 8 cores @3.6GHz (14nm, 95W, ≃ 3.5 billion transistors) + integ. graphics
GPU (Graphics Processing Unit): Ideal for ML/DL workloads
(Ex: Nvidia Tesla V100 SXM2 (Q2’17) Rpeak ≃ 7.8 TFlops (DP →֒ X 5120 cores @ 1.3GHz (12nm, 250W, 21 billion transistors) Accelerators
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 13 / 126 N Introduction
HPC Computing Hardware
CPU (Central Processing Unit) Highest software flexibility High performance across all computational domains →֒ (Ex: Intel Core i9-9900K (Q4’18) Rpeak ≃ 922 GFlops (DP →֒ Base X 8 cores @3.6GHz (14nm, 95W, ≃ 3.5 billion transistors) + integ. graphics
GPU (Graphics Processing Unit): Ideal for ML/DL workloads
(Ex: Nvidia Tesla V100 SXM2 (Q2’17) Rpeak ≃ 7.8 TFlops (DP →֒ X 5120 cores @ 1.3GHz (12nm, 250W, 21 billion transistors)
Intel MIC (Many Integrated Core) Accelerator ASIC (Application-Specific Integrated Circuits), FPGA (Field Programmable Gate Array)
Accelerators ֒→ least software flexibility highest performance for specialized problems →֒ X Ex: AI, Mining, Sequencing. . . =⇒ toward hybrid platforms w. DL enabled accelerators
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 13 / 126 N Introduction
HPC Components: Local Memory
Larger, slower and cheaper
L1 L2 L3 - - - CPU C C C Memory Bus I/O Bus a a a Memory c c c h h h Registers e e e
L1-cache L2-cache L3-cache register (SRAM) (SRAM) (DRAM) Memory (DRAM) reference Disk memory reference reference reference reference reference
Level: 1 2 3 4
Size: 500 bytes 64 KB to 8 MB 1 GB 1 TB
Speed: sub ns 1-2 cycles 10 cycles 20 cycles hundreds cycles ten of thousands cycles
SSD (SATA3) R/W: 550 MB/s; 100000 IOPS 450 e/TB HDD (SATA3 @ 7,2 krpm) R/W: 227 MB/s; 85 IOPS 54 e/TB
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 14 / 126 N Introduction
HPC Components: Interconnect
latency: time to send a minimal (0 byte) message from A to B bandwidth: max amount of data communicated per unit of time
Technology Effective Bandwidth Latency Gigabit Ethernet 1 Gb/s 125 MB/s 40µs to 300µs 10 Gigabit Ethernet 10 Gb/s 1.25 GB/s 4µs to 5µs Infiniband QDR 40 Gb/s 5 GB/s 1.29µs to 2.6µs Infiniband EDR 100 Gb/s 12.5 GB/s 0.61µs to 1.3µs Infiniband HDR 200 Gb/s 25 GB/s 0.5µs to 1.1µs 100 Gigabit Ethernet 100 Gb/s 1.25 GB/s 30µs Intel Omnipath 100 Gb/s 12.5 GB/s 0.9µs
Infiniband
32.6 %
1.4 % [Source : www.top500.org, Nov. 2017] 40.8 % 4.8 % Proprietary 10G 7 % Gigabit Ethernet
13.4 % Omnipath
Custom Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 15 / 126 N Introduction
HPC Components: Interconnect
latency: time to send a minimal (0 byte) message from A to B bandwidth: max amount of data communicated per unit of time
Technology Effective Bandwidth Latency Gigabit Ethernet 1 Gb/s 125 MB/s 40µs to 300µs 10 Gigabit Ethernet 10 Gb/s 1.25 GB/s 4µs to 5µs Infiniband QDR 40 Gb/s 5 GB/s 1.29µs to 2.6µs Infiniband EDR 100 Gb/s 12.5 GB/s 0.61µs to 1.3µs Infiniband HDR 200 Gb/s 25 GB/s 0.5µs to 1.1µs 100 Gigabit Ethernet 100 Gb/s 1.25 GB/s 30µs Intel Omnipath 100 Gb/s 12.5 GB/s 0.9µs
Infiniband
32.6 %
1.4 % [Source : www.top500.org, Nov. 2017] 40.8 % 4.8 % Proprietary 10G 7 % Gigabit Ethernet
13.4 % Omnipath
Custom Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 15 / 126 N Introduction
Network Topologies Direct vs. Indirect interconnect direct: each network node attaches to at least one compute node →֒ indirect: compute nodes attached at the edge of the network only →֒ X many routers only connect to other routers.
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 16 / 126 N Introduction
Network Topologies Direct vs. Indirect interconnect direct: each network node attaches to at least one compute node →֒ indirect: compute nodes attached at the edge of the network only →֒ X many routers only connect to other routers.
Main HPC Topologies
CLOS Network / Fat-Trees [Indirect] (can be fully non-blocking (1:1) or blocking (x:1 →֒ typically enables best performance →֒ X Non blocking bandwidth, lowest network latency
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 16 / 126 N Introduction
Network Topologies Direct vs. Indirect interconnect direct: each network node attaches to at least one compute node →֒ indirect: compute nodes attached at the edge of the network only →֒ X many routers only connect to other routers.
Main HPC Topologies
CLOS Network / Fat-Trees [Indirect] (can be fully non-blocking (1:1) or blocking (x:1 →֒ typically enables best performance →֒ X Non blocking bandwidth, lowest network latency Mesh or 3D-torus [Direct] Blocking network, cost-effective for systems at scale →֒ Great performance solutions for applications with locality →֒ Simple expansion for future growth →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 16 / 126 N Introduction
HPC Components: Operating System
Exclusively Linux-based (really 100%) Reasons: stability →֒ development flexibility →֒
[Source : www.top500.org, Nov 2017]
Linux 100 %
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 17 / 126 N Introduction
HPC Components: Software Stack
Remote connection to the platform SSH Identity Management / SSO: LDAP, Kerberos, IPA... Resource management: job/batch scheduler ...SLURM, OAR, PBS, MOAB/Torque →֒ (Automatic) Node Deployment: ...FAI, Kickstart, Puppet, Chef, Ansible, Kadeploy →֒ (Automatic) User Software Management: Easybuild, Environment Modules, LMod →֒ Platform Monitoring: ...Nagios, Icinga, Ganglia, Foreman, Cacti, Alerta →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 18 / 126 N Introduction
[Big]Data Management
Storage architectural classes & I/O layers
Application
[Distributed] File system
Network Network SATA NFS SAS iSCSI CIFS FC ... AFP ......
DAS Interface SAN Interface NAS Interface
Fiber Ethernet/
DAS Channel Network Fiber Ethernet/ Channel Network SATA SAN SAS File System NAS
Fiber Channel SATA
SAS SATA
Fiber Channel SAS
Fiber Channel
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 19 / 126 N Introduction
[Big]Data Management: Disk Encl.
≃ 120 Ke - enclosure - 48-60 disks (4U) (incl. redundant (i.e. 2) RAID controllers (master/slave →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 20 / 126 N Introduction
[Big]Data Management: File Systems
File System (FS)
Logical manner to store, organize, manipulate & access data
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 21 / 126 N Introduction
[Big]Data Management: File Systems
File System (FS)
Logical manner to store, organize, manipulate & access data
(local) Disk FS : FAT32, NTFS, HFS+, ext{3,4}, {x,z,btr}fs... manage data on permanent storage devices →֒ poor perf. read: 100 → 400 MB/s | write: 10 → 200 MB/s →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 21 / 126 N Introduction
[Big]Data Management: File Systems
Networked FS: NFS, CIFS/SMB, AFP disk access from remote nodes via network access →֒ poorer performance for HPC jobs especially parallel I/O →֒ X read: only 381 MB/s on a system capable of 740MB/s (16 tasks) X write: only 90MB/s on system capable of 400MB/s (4 tasks)
[Source : LISA’09] Ray Paden: How to Build a Petabyte Sized Storage System
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 22 / 126 N Introduction
[Big]Data Management: File Systems
Networked FS: NFS, CIFS/SMB, AFP disk access from remote nodes via network access →֒ poorer performance for HPC jobs especially parallel I/O →֒ X read: only 381 MB/s on a system capable of 740MB/s (16 tasks) X write: only 90MB/s on system capable of 400MB/s (4 tasks)
[Source : LISA’09] Ray Paden: How to Build a Petabyte Sized Storage System
[scale-out] NAS ...aka Appliances OneFS →֒ Focus on CIFS, NFS →֒ Integrated HW/SW →֒ Ex: EMC (Isilon), IBM →֒ (SONAS), DDN...
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 22 / 126 N Introduction
[Big]Data Management: File Systems
Basic Clustered FS GPFS File access is parallel →֒ File System overhead operations is distributed and done in parallel →֒ X no metadata servers File clients access file data through file servers via the LAN →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 23 / 126 N Introduction
[Big]Data Management: File Systems
Multi-Component Clustered FS Lustre, Panasas File access is parallel →֒ File System overhead operations on dedicated components →֒ X metadata server (Lustre) or director blades (Panasas) Multi-component architecture →֒ File clients access file data through file servers via the LAN →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 24 / 126 N Introduction
[Big]Data Management: FS Summary
File System (FS): Logical manner to store, organize & access data ...local) Disk FS : FAT32, NTFS, HFS+, ext4, {x,z,btr}fs) →֒ Networked FS: NFS, CIFS/SMB, AFP →֒ Parallel/Distributed FS: SpectrumScale/GPFS, Lustre →֒ X typical FS for HPC / HTC (High Throughput Computing)
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 25 / 126 N Introduction
[Big]Data Management: FS Summary
File System (FS): Logical manner to store, organize & access data ...local) Disk FS : FAT32, NTFS, HFS+, ext4, {x,z,btr}fs) →֒ Networked FS: NFS, CIFS/SMB, AFP →֒ Parallel/Distributed FS: SpectrumScale/GPFS, Lustre →֒ X typical FS for HPC / HTC (High Throughput Computing) Main Characteristic of Parallel/Distributed File Systems Capacity and Performance increase with #servers
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 25 / 126 N Introduction
[Big]Data Management: FS Summary
File System (FS): Logical manner to store, organize & access data ...local) Disk FS : FAT32, NTFS, HFS+, ext4, {x,z,btr}fs) →֒ Networked FS: NFS, CIFS/SMB, AFP →֒ Parallel/Distributed FS: SpectrumScale/GPFS, Lustre →֒ X typical FS for HPC / HTC (High Throughput Computing) Main Characteristic of Parallel/Distributed File Systems Capacity and Performance increase with #servers
Name Type Read* [GB/s] Write* [GB/s]
ext4 Disk FS 0.426 0.212 nfs Networked FS 0.381 0.090 gpfs (iris) Parallel/Distributed FS 11.25 9,46 lustre (iris) Parallel/Distributed FS 12.88 10,07 gpfs (gaia) Parallel/Distributed FS 7.74 6.524 lustre (gaia) Parallel/Distributed FS 4.5 2.956
∗ maximum random read/write, per IOZone or IOR measures, using concurrent nodes for networked FS.
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 25 / 126 N Introduction
HPC Components: Data Center
Definition (Data Center)
Facility to house computer systems and associated components (Basic storage component: rack (height: 42 RU →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 26 / 126 N Introduction
HPC Components: Data Center
Definition (Data Center)
Facility to house computer systems and associated components (Basic storage component: rack (height: 42 RU →֒
Challenges: Power (UPS, battery), Cooling, Fire protection, Security Power/Heat dissipation per rack: Power Usage Effectiveness HPC computing racks: 30-120 kW →֒ Storage racks: 15 kW Total facility power →֒ PUE = Interconnect racks: 5 kW IT equipment power →֒ Various Cooling Technology Airflow →֒ ...Direct-Liquid Cooling, Immersion →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 26 / 126 N Introduction
Software/Modules Management
https://hpc.uni.lu/users/software/ Based on Environment Modules / LMod convenient way to dynamically change the users environment $PATH →֒ permits to easily load software through module command →֒ Currently on UL HPC: .software packages, in multiple versions, within 18 categ 200 < →֒ reworked software set for iris cluster and now deployed everywhere →֒ X RESIF v2.0, allowing [real] semantic versioning of released builds {hierarchical organization Ex: toolchain/{foss,intel →֒
$> module avail # List available modules
$> module load
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 27 / 126 N Introduction
Software/Modules Management
Key module variable: $MODULEPATH / where to look for modules :altered with module use
export EASYBUILD_PREFIX=$HOME/.local/easybuild export LOCAL_MODULES=$EASYBUILD_PREFIX/modules/all module use $LOCAL_MODULES
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 28 / 126 N Introduction
Software/Modules Management
Key module variable: $MODULEPATH / where to look for modules :altered with module use
export EASYBUILD_PREFIX=$HOME/.local/easybuild export LOCAL_MODULES=$EASYBUILD_PREFIX/modules/all module use $LOCAL_MODULES
Main modules commands:
Command Description
module avail Lists all the modules which are available to be loaded module spider
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 28 / 126 N Introduction
Software/Modules Management
http://hpcugent.github.io/easybuild/
Easybuild: open-source framework to (automatically) build scientific SW Why?: "Could you please install this software on the cluster?" Scientific software is often difficult to build →֒ X non-standard build tools / incomplete build procedures X hardcoded parameters and/or poor/outdated documentation EasyBuild helps to facilitate this task →֒ X consistent software build and installation framework X includes testing step that helps validate builds X automatically generates LMod modulefiles
$> module use $LOCAL_MODULES $> module load tools/EasyBuild # Search for recipes for a given software $> eb -S Spark $> eb Spark-2.4.0-Hadoop-2.7-Java-1.8.eb -Dr # Dry-run install $> eb Spark-2.4.0-Hadoop-2.7-Java-1.8.eb -r
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 29 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Summary
1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components
2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data
3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark
4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing
5 Conclusion
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 30 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Summary
1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components
2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data
3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark
4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing
5 Conclusion
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 31 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Data Intensive Computing
Data volumes increasing massively Clusters, storage capacity increasing massively →֒ Disk speeds are not keeping pace. Seek speeds even worse than read/write
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 32 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Data Intensive Computing
Data volumes increasing massively Clusters, storage capacity increasing massively →֒ Disk speeds are not keeping pace. Seek speeds even worse than read/write
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 32 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Speed Expectation on Data Transfer
http://fasterdata.es.net/
How long to transfer 1 TB of data across various speed networks?
Network Time 10 Mbps 300 hrs (12.5 days) 100 Mbps 30 hrs 1 Gbps 3 hrs 10 Gbps 20 minutes
(Again) small I/Os really kill performances Ex: transferring 80 TB for the backup of ecosystem_biology →֒ . . .same rack, 10Gb/s. 4 weeks −→ 63TB transfer →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 33 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Speed Expectation on Data Transfer
http://fasterdata.es.net/
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 34 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Speed Expectation on Data Transfer
http://fasterdata.es.net/
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 34 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Storage Performances: GPFS
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 35 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Storage Performances: Lustre
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 36 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Storage Performances
Based on IOR or IOZone, reference I/O benchmarks Read tests performed in 2013 →֒
65536 32768 16384 8192 4096 2048 1024 512 SHM / Bigmem
I/O bandwidth (MiB/s) Lustre / Gaia 256 NFS / Gaia 128 SSD / Gaia Hard Disk / Chaos 64 0 5 10 15 Number of threads
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 37 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Storage Performances
Based on IOR or IOZone, reference I/O benchmarks Write tests performed in 2013 →֒
32768 16384 8192 4096 2048 1024 512 SHM / Bigmem
I/O bandwidth (MiB/s) 256 Lustre / Gaia NFS / Gaia 128 SSD / Gaia Hard Disk / Chaos 64 0 5 10 15 Number of threads
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 37 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Understanding Your Storage Options
Where can I store and manipulate my data?
Shared storage (NFS - not scalable ≃ 1.5 GB/s (R) O(100 TB →֒ (GPFS/Spectrumscale - scalable ≃ 10-500 GB/s (R) O(10 PB →֒ (Lustre - scalable ≃ 10-500 GB/s (R) O(10 PB →֒ Local storage (local file system (/tmp) O(1 TB →֒ X over HDD ≃ 100 MB/s, over SDD ≃ 400 MB/s (RAM (/dev/shm) ≃ 30 GB/s (R) O(100 GB →֒ Distributed storage HDFS, Ceph, GlusterFS, BeeGFS,- scalable ≃ 1 GB/s →֒
⇒ In all cases: small I/Os really kill storage performances
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 38 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Summary
1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components
2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data
3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark
4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing
5 Conclusion
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 39 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Data Transfer in Practice
$> wget [-O
$> curl [-o
Transfer from FTP/HTTP[S] wget or (better) curl can also serve to send HTTP POST requests →֒ (support HTTP cookies (useful for JDK download →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 40 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Data Transfer in Practice
$> scp [-P
$> rsync -avzu [-e ’ssh -p
[Secure] Transfer from/to two remote machines over SSH (scp or (better) rsync (transfer only what is required →֒ Assumes you have understood and configured appropriately SSH!
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 41 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
SSH: Secure Shell
Ensure secure connection to remote (UL) server establish encrypted tunnel using asymmetric keys →֒ X Public id_rsa.pub vs. Private id_rsa (without .pub) X typically on a non-standard port (Ex: 8022) limits kiddie script X Basic rule: 1 machine = 1 key pair the private key is SECRET: never send it to anybody →֒ X Can be protected with a passphrase
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 42 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
SSH: Secure Shell
Ensure secure connection to remote (UL) server establish encrypted tunnel using asymmetric keys →֒ X Public id_rsa.pub vs. Private id_rsa (without .pub) X typically on a non-standard port (Ex: 8022) limits kiddie script X Basic rule: 1 machine = 1 key pair the private key is SECRET: never send it to anybody →֒ X Can be protected with a passphrase
SSH is used as a secure backbone channel for many tools Remote shell i.e remote command line →֒ File transfer: rsync, scp, sftp →֒ .versionning synchronization (svn, git), github, gitlab etc →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 42 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
SSH: Secure Shell
Ensure secure connection to remote (UL) server establish encrypted tunnel using asymmetric keys →֒ X Public id_rsa.pub vs. Private id_rsa (without .pub) X typically on a non-standard port (Ex: 8022) limits kiddie script X Basic rule: 1 machine = 1 key pair the private key is SECRET: never send it to anybody →֒ X Can be protected with a passphrase
SSH is used as a secure backbone channel for many tools Remote shell i.e remote command line →֒ File transfer: rsync, scp, sftp →֒ .versionning synchronization (svn, git), github, gitlab etc →֒
Authentication: (password (disable if possible →֒ better) public key authentication) →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 42 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
SSH: Public Key Authentication
Client Local Machine
local homedir ~/.ssh/
id_rsa owns local private key
id_rsa.pub
known_hosts logs known servers
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 43 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
SSH: Public Key Authentication
Client Server Local Machine Remote Machine
local homedir remote homedir ~/.ssh/ ~/.ssh/ knows granted id_rsa owns local private key authorized_keys (public) key
id_rsa.pub
known_hosts logs known servers
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 43 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
SSH: Public Key Authentication
Client Server Local Machine Remote Machine
local homedir remote homedir ~/.ssh/ ~/.ssh/ knows granted id_rsa owns local private key authorized_keys (public) key
id_rsa.pub SSH server config /etc/ssh/ sshd_config known_hosts logs known servers
ssh_host_rsa_key
ssh_host_rsa_key.pub
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 43 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
SSH: Public Key Authentication
Client Server Local Machine Remote Machine
local homedir remote homedir ~/.ssh/ ~/.ssh/ knows granted id_rsa owns local private key authorized_keys (public) key
id_rsa.pub
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 43 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
SSH: Public Key Authentication
Client Server Local Machine Remote Machine
local homedir remote homedir ~/.ssh/ ~/.ssh/ 1. Initiate connection knows granted id_rsa owns local private key authorized_keys 2. create random (public) key challenge, “encrypt” using public key id_rsa.pub
3. solve challenge using private key return response
4. allow connection iff response == challenge
Restrict to public key authentication: /etc/ssh/sshd_config:
PermitRootLogin no # Enable Public key auth. # Disable Passwords RSAAuthentication yes PasswordAuthentication no PubkeyAuthentication yes ChallengeResponseAuthentication no
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 43 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
SSH Setup on Linux / Mac OS
OpenSSH natively supported; configuration directory : ~/.ssh/ (package openssh-client (Debian-like) or ssh (Redhat-like →֒ SSH Key Pairs (public vs private) generation: ssh-keygen specify a strong passphrase →֒ X protect your private key from being stolen i.e. impersonation X drawback: passphrase must be typed to use your key
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 44 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
SSH Setup on Linux / Mac OS
OpenSSH natively supported; configuration directory : ~/.ssh/ (package openssh-client (Debian-like) or ssh (Redhat-like →֒ SSH Key Pairs (public vs private) generation: ssh-keygen specify a strong passphrase →֒ X protect your private key from being stolen i.e. impersonation X drawback: passphrase must be typed to use your key ssh-agent
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 44 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
SSH Setup on Linux / Mac OS
OpenSSH natively supported; configuration directory : ~/.ssh/ (package openssh-client (Debian-like) or ssh (Redhat-like →֒ SSH Key Pairs (public vs private) generation: ssh-keygen specify a strong passphrase →֒ X protect your private key from being stolen i.e. impersonation X drawback: passphrase must be typed to use your key ssh-agent DSA and RSA 1024 bit are deprecated now!
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 44 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
SSH Setup on Linux / Mac OS
OpenSSH natively supported; configuration directory : ~/.ssh/ (package openssh-client (Debian-like) or ssh (Redhat-like →֒ SSH Key Pairs (public vs private) generation: ssh-keygen specify a strong passphrase →֒ X protect your private key from being stolen i.e. impersonation X drawback: passphrase must be typed to use your key ssh-agent DSA and RSA 1024 bit are deprecated now!
$> ssh-keygen -t rsa -b 4096 -o -a 100 # 4096 bits RSA
(better) $> ssh-keygen -t ed25519 -o -a 100 # new sexy Ed25519
Private (identity) key Public Key ~/.ssh/id_{rsa,ed25519} ~/.ssh/id_{rsa,ed25519}.pub
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 44 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
SSH Setup on Windows
Use MobaXterm! http://mobaxterm.mobatek.net/ tabbed] Sessions management] →֒ X11 server w. enhanced X extensions →֒ Graphical SFTP browser →֒ SSH gateway / tunnels wizards →֒ remote] Text Editor] →֒ ... →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 45 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Summary
1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components
2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data
3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark
4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing
5 Conclusion
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 46 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Sharing Code and Data
Before doing Big Data, manage and version correctly normal data
What kinds of systems are available?
Good: NAS, Cloud ...NextCloud, Dropbox, {Google,iCloud} Drive, Figshare →֒ Better - Version Control systems (VCS) SVN, Git and Mercurial →֒ Best - Version Control Systems on the Public/Private Cloud GitHub, Bitbucket, Gitlab →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 47 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Sharing Code and Data
Before doing Big Data, manage and version correctly normal data
What kinds of systems are available?
Good: NAS, Cloud ...NextCloud, Dropbox, {Google,iCloud} Drive, Figshare →֒ Better - Version Control systems (VCS) SVN, Git and Mercurial →֒ Best - Version Control Systems on the Public/Private Cloud GitHub, Bitbucket, Gitlab →֒
Which one? Depends on the level of privacy you expect →֒ X . . . but you probably already know these tools , (Few handle GB files. . . Or with Git LFS (Large File Storage →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 47 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Centralized VCS - CVS, SVN
Computer A Central VCS Server
Checkout Version Database File
Version 3
Version 2
Version 1
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 48 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Centralized VCS - CVS, SVN
Computer A Central VCS Server
Checkout Version Database File
Version 3
Version 2 Computer B
Checkout Version 1
File
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 48 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Distributed VCS - Git
Server Computer
Version Database
Version 3 Computer A Computer B
Version 2 File File
Version 1 Version Database Version Database
Version 3 Version 3
Version 2 Version 2
Version 1 Version 1
Everybody has the full history of commits
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 49 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Tracking changes (most VCS)
Checkins over Time
C1
file A
file B
file C
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Tracking changes (most VCS)
Checkins over Time
C1 C2
file A Δ1
file B
file C Δ1
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Tracking changes (most VCS)
Checkins over Time
C1 C2 C3
file A Δ1
file B
file C Δ1 Δ2
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Tracking changes (most VCS)
Checkins over Time
C1 C2 C3 C4
file A Δ1 Δ2
file B Δ1
file C Δ1 Δ2
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Tracking changes (most VCS)
Checkins over Time
C1 C2 C3 C4 C5
file A Δ1 Δ2
file B Δ1 Δ2
file C Δ1 Δ2 Δ3
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Tracking changes (most VCS)
Checkins over Time
C1 C2 C3 C4 C5
file A Δ1 Δ2 delta storage file B Δ1 Δ2
file C Δ1 Δ2 Δ3
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Tracking changes (Git)
Checkins over Time
C1 C2 C3 C4 C5
file A Δ1 Δ2 delta storage file B Δ1 Δ2
file C Δ1 Δ2 Δ3
snapshot (DAG) storage
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Tracking changes (Git)
Checkins over Time
C1 C2 C3 C4 C5
file A Δ1 Δ2 delta storage file B Δ1 Δ2
file C Δ1 Δ2 Δ3
Checkins over Time
C1
snapshot A (DAG) storage B
C
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Tracking changes (Git)
Checkins over Time
C1 C2 C3 C4 C5
file A Δ1 Δ2 delta storage file B Δ1 Δ2
file C Δ1 Δ2 Δ3
Checkins over Time
C1 C2
snapshot A A1 (DAG) storage B B
C C1
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Tracking changes (Git)
Checkins over Time
C1 C2 C3 C4 C5
file A Δ1 Δ2 delta storage file B Δ1 Δ2
file C Δ1 Δ2 Δ3
Checkins over Time
C1 C2
snapshot A A1 (DAG) storage B B
C C1
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Tracking changes (Git)
Checkins over Time
C1 C2 C3 C4 C5
file A Δ1 Δ2 delta storage file B Δ1 Δ2
file C Δ1 Δ2 Δ3
Checkins over Time
C1 C2 C3
snapshot A A1 A1 (DAG) storage B B B
C C1 C2
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Tracking changes (Git)
Checkins over Time
C1 C2 C3 C4 C5
file A Δ1 Δ2 delta storage file B Δ1 Δ2
file C Δ1 Δ2 Δ3
Checkins over Time
C1 C2 C3
snapshot A A1 A1 (DAG) storage B B B
C C1 C2
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Tracking changes (Git)
Checkins over Time
C1 C2 C3 C4 C5
file A Δ1 Δ2 delta storage file B Δ1 Δ2
file C Δ1 Δ2 Δ3
Checkins over Time
C1 C2 C3 C4
snapshot A A1 A1 A2 (DAG) storage B B B B1
C C1 C2 C2
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Tracking changes (Git)
Checkins over Time
C1 C2 C3 C4 C5
file A Δ1 Δ2 delta storage file B Δ1 Δ2
file C Δ1 Δ2 Δ3
Checkins over Time
C1 C2 C3 C4
snapshot A A1 A1 A2 (DAG) storage B B B B1
C C1 C2 C2
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Tracking changes (Git)
Checkins over Time
C1 C2 C3 C4 C5
file A Δ1 Δ2 delta storage file B Δ1 Δ2
file C Δ1 Δ2 Δ3
Checkins over Time
C1 C2 C3 C4 C5
snapshot A A1 A1 A2 A2 (DAG) storage B B B B1 B2
C C1 C2 C2 C3
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Tracking changes (Git)
Checkins over Time
C1 C2 C3 C4 C5
file A Δ1 Δ2 delta storage file B Δ1 Δ2
file C Δ1 Δ2 Δ3
Checkins over Time
C1 C2 C3 C4 C5
snapshot A A1 A1 A2 A2 (DAG) storage B B B B1 B2
C C1 C2 C2 C3
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
VCS Taxonomy
Mac OS File local rcs Versions
delta Subversion centralized cvs storage svn
mercurial distributed hg
cp -r bontmia time local rsync backupninja machine duplicity duplicity
snapshot (DAG) centralized storage bitkeeper
bazaar distributed git bzr
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 51 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Git at the heart of BD http://git-scm.org
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 52 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
So what makes Git so useful?
(almost) Everything is local
everything is fast every clone is a backup you work mainly offline
Ultra Fast, Efficient & Robust
Snapshots, not patches (deltas) Cheap branching and merging Strong support for thousands of parallel branches →֒ Cryptographic integrity everywhere
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 53 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Other Git features
Git does not delete Immutable objects, Git generally only adds data →֒ If you mess up, you can usually recover your stuff →֒ X Recovery can be tricky though
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 54 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Other Git features
Git does not delete Immutable objects, Git generally only adds data →֒ If you mess up, you can usually recover your stuff →֒ X Recovery can be tricky though
Git Tools / Extension
cf. Git submodules or subtrees Introducing git-flow workflow with a strict branching model →֒ offers the git commands to follow the workflow →֒
$> git flow init $> git flow feature { start, publish, finish }
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 54 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Git in practice
Basic Workflow
Pull latest changes git pull Edit files vim / emacs / subl ... Stage the changes git add Review your changes git status Commit the changes git commit
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 55 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Git in practice
Basic Workflow
Pull latest changes git pull Edit files vim / emacs / subl ... Stage the changes git add Review your changes git status Commit the changes git commit
For cheaters: A Basicerer Workflow
Pull latest changes git pull Edit files vim / emacs / subl ... Stage & commit all the changes git commit -a
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 55 / 126 N [Big] Data Management in HPC Environment: Overview and Challenges
Git Summary
Advices: Commit early, commit often! commits = save points →֒ X use descriptive commit messages Do not get out of sync with your collaborators →֒ Commit the sources, not the derived files →֒
Not covered here (by lack of time) !does not mean you should not dig into it →֒ :Resources →֒ X https://git-scm.com/ X tutorial: IT/Dev[op]s Army Knives Tools for the Researcher X tutorial: Reproducible Research at the Cloud Era X Using Git-crypt to Protect Sensitive Data
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 56 / 126 N Big Data Analytics with Hadoop, Spark etc.
Summary
1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components
2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data
3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark
4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing
5 Conclusion
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 57 / 126 N Big Data Analytics with Hadoop, Spark etc.
Summary
1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components
2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data
3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark
4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing
5 Conclusion
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 58 / 126 N Big Data Analytics with Hadoop, Spark etc.
What is a Distributed File System?
Straightforward idea: separate logical from physical storage. ,Not all files reside on a single physical disk →֒ ,or the same physical server →֒ ,or the same physical rack →֒ . . .,or the same geographical location →֒ Distributed file system (DFS): virtual file system that enables clients to access files →֒ X . . . as if they were stored locally.
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 59 / 126 N Big Data Analytics with Hadoop, Spark etc.
What is a Distributed File System?
Straightforward idea: separate logical from physical storage. ,Not all files reside on a single physical disk →֒ ,or the same physical server →֒ ,or the same physical rack →֒ . . .,or the same geographical location →֒ Distributed file system (DFS): virtual file system that enables clients to access files →֒ X . . . as if they were stored locally.
Major DFS distributions: NFS: originally developed by Sun Microsystems, started in 1984 →֒ AFS/CODA: originally prototypes at Carnegie Mellon University →֒ GFS: Google paper published in 2003, not available outside →֒ Google HDFS: designed after GFS, part of Apache Hadoop since 2006 →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 59 / 126 N Big Data Analytics with Hadoop, Spark etc.
Distributed File System Architecture?
Master-Slave Pattern
Single (or few) master nodes maintain state info. about clients All clients R&W requests go through the global master node. Ex: GFS, HDFS
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 60 / 126 N Big Data Analytics with Hadoop, Spark etc.
Distributed File System Architecture?
Master-Slave Pattern
Single (or few) master nodes maintain state info. about clients All clients R&W requests go through the global master node. Ex: GFS, HDFS
Peer-to-Peer Pattern
No global state information. Each node may both serve and process data.
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 60 / 126 N Big Data Analytics with Hadoop, Spark etc.
Google File System (GFS) (2003)
Radically different architecture compared to NFS, AFS and CODA. specifically tailored towards large-scale and long-running →֒ analytical processing tasks .over thousands of storage nodes →֒ Basic assumption: !client nodes (aka. chunk servers) may fail any time →֒ .Bugs or hardware failures →֒ .Special tools for monitoring, periodic checks →֒ .Large files (multiple GBs or even TBs) are split into 64 MB chunks →֒ .Data modifications are mostly append operations to files →֒ !Even the master node may fail any time →֒ X Additional shadow master fallback with read-only data access. Two types of reads: Large sequential reads & small random reads
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 61 / 126 N Big Data Analytics with Hadoop, Spark etc.
Google File System (GFS) (2003)
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 62 / 126 N Big Data Analytics with Hadoop, Spark etc.
GFS Consistency Model
Atomic File Namespace Mutations .File creations/deletions centrally controlled by the master node →֒ ,Clients typically create and write entire file →֒ X then add the file name to the file namespace stored at the master. Atomic Data Mutations .only 1 atomic modification of 1 replica (!) at a time is guaranteed →֒ Stateful Master Master sends regular heartbeat messages to the chunk servers →֒ .Master keeps chunk locations of all files (+ replicas) in memory →֒ . . .locations not stored persistently →֒ X but polled from the clients at startup. Session Semantics .Weak consistency model for file replicas and client caches only →֒ .Multiple clients may read and/or write the same file concurrently →֒ .The client that last writes to a file wins →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 63 / 126 N Big Data Analytics with Hadoop, Spark etc.
Fault Tolerance & Fault Detection
Fast Recovery .master & chunk servers can restore their states and (re-)start in s →֒ X regardless of previous termination conditions. Master Replication .shadow master provides RO access when primary master is down →֒ X Switches back to read/write mode when primary master is back. ,Master node does not keep a persistent state info. of its clients →֒ X rather polls clients for their states when started. Chunk Replication & Integrity Checks chunk divided into 64 KB blocks, each with its own 32-bit checksum →֒ X verified at read and write times. Higher replication factors for more intensively requested chunks →֒ (hotspots) can be configured.
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 64 / 126 N Big Data Analytics with Hadoop, Spark etc.
Map-Reduce
Breaks the processing into two main phases: 1 the map phase 2 the reduce phase. Each phase has key-value pairs as input and output, .the types of which may be chosen by the programmer →֒ the programmer also specifies the map and reduce functions →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 65 / 126 N Big Data Analytics with Hadoop, Spark etc.
Hadoop
Initially started as a student project at Yahoo! labs in 2006 Open-source Java implem. of GFS and MapReduce frameworks →֒ Switched to Apache in 2009. Now consists of three main modules: 1 HDFS: Hadoop distributed file system 2 YARN: Hadoop job scheduling and resource allocation 3 MapReduce: Hadoop adaptation of the MapReduce principle
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 66 / 126 N Big Data Analytics with Hadoop, Spark etc.
Hadoop
Initially started as a student project at Yahoo! labs in 2006 Open-source Java implem. of GFS and MapReduce frameworks →֒ Switched to Apache in 2009. Now consists of three main modules: 1 HDFS: Hadoop distributed file system 2 YARN: Hadoop job scheduling and resource allocation 3 MapReduce: Hadoop adaptation of the MapReduce principle
Basis for many other open-source Apache toolkits: PIG/PigLatin: file-oriented data storage & script-based query →֒ language HIVE: distributed SQL-style data warehouse →֒ HBase: distributed key-value store →֒ .Cassandra: fault-tolerant distributed database, etc →֒ HDFS still mostly follows the original GFS architecture.
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 66 / 126 N Big Data Analytics with Hadoop, Spark etc.
Hadoop Ecosystem
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 67 / 126 N Big Data Analytics with Hadoop, Spark etc.
Scale-Out Design
HDD streaming speed ~ 50MB/s 3TB =17.5 hrs →֒ 1PB = 8 months →֒ Scale-out (weak scaling) FS distributes data on ingest →֒ Seeking too slow 10ms for a seek~ →֒ Enough time to read half a megabyte →֒ Batch processing Go through entire data set in one (or small number) of passes
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 68 / 126 N Big Data Analytics with Hadoop, Spark etc.
Combining Results
Each node preprocesses its local data Shuffles its data to a small number of other →֒ nodes Final processing, output is done there
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 69 / 126 N Big Data Analytics with Hadoop, Spark etc.
Fault Tolerance
Data also replicated upon ingest Runtime watches for dead tasks, restarts them on live nodes Re-replicates
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 70 / 126 N Big Data Analytics with Hadoop, Spark etc.
Data Distribution: Disk
Hadoop & al. arch. handle the hardest part of parallelism for you .aka data distribution →֒ On disk: HDFS distributes, replicates data as it comes in →֒ Keeps track of computations local to data →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 71 / 126 N Big Data Analytics with Hadoop, Spark etc.
Data Distribution: Network
On network: Map Reduce works in terms of key-value pairs. Preprocessing (map) phase ingests data, emits (k, v) pairs →֒ ,Shuffle phase assigns reducers →֒ X gets all pairs with same key onto that reducer. Programmer does not have to design communication patterns →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 72 / 126 N Big Data Analytics with Hadoop, Spark etc.
Hadoop Success Reason
Hardest parts of parallel programming with HPC tools ,Decomposing the problem, and →֒ ,Getting the intermediate data where it needs to go →֒ !Hadoop does that for you →֒ X automatically, for a wide range of problems Built a reusable substrate .HDFS and the MapReduce layer were very well architected →֒ X Enables many higher-level tools X Data analysis, machine learning, NoSQL DBs,. . . Extremely productive environment →֒ X . . . Hadoop 2.x (YARN) is now much more than MapReduce
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 73 / 126 N Big Data Analytics with Hadoop, Spark etc.
The Hadoop Filesystem
HDFS is a distributed parallel filesystem Not a general purpose file system →֒ X does not implement posix X cannot just mount it and view files Access via hdfs fs commands or programatic APIs Security slowly improving
$> hdfs fs -[cmd] cat chgrp chmod chown copyFromLocal copyToLocal cp du dus expunge get getmerge ls lsr mkdir movefromLocal mv put rm rmr setrep stat tail test text touchz
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 74 / 126 N Big Data Analytics with Hadoop, Spark etc.
The Hadoop Filesystem
Required to be: able to deal with large files, large amounts of data →֒ scalable & reliable in the presence of failures →֒ fast at reading contiguous streams of data →֒ only need to write to new files or append to files →֒ require only commodity hardware →֒ As a result: Replication →֒ Supports mainly high bandwidth, not especially low latency →֒ No caching →֒ X what is the point if primarily for streaming reads? X Poor support for seeking around files X Poor support for zillions of files Have to use separate API to see filesystem →֒ (Modelled after Google File System (2004 Map Reduce paper →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 75 / 126 N Big Data Analytics with Hadoop, Spark etc.
Hadoop vs HPC
HDFS is a block-based FS ,A file is broken into blocks →֒ these blocks are distributed across nodes →֒ Blocks are large; ,64MB is default →֒ many installations use 128MB or larger →֒ Large block size time to stream a block much larger than →֒ time disk time to access the block.
# Lists all blocks in all files: $> hdfs fsck / -files -blocks
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 76 / 126 N Big Data Analytics with Hadoop, Spark etc.
Datanodes and Namenode
Two types of nodes in the filesystem:
1 Namenode stores all metadata / block locations in →֒ memory Metadata updates stored to persistent →֒ journal 2 Datanodes store/retrieve blocks for client/namenode →֒
Newer versions of Hadoop: federation ...namenodes for /user, /data =6 →֒ High Availability namenode pairs →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 77 / 126 N Big Data Analytics with Hadoop, Spark etc.
Writing a file
Writing a file multiple stage process: Create file →֒ Get nodes for blocks →֒ Start writing →֒ Data nodes coordinate →֒ replication Get ack back →֒
Complete →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 78 / 126 N Big Data Analytics with Hadoop, Spark etc.
Writing a file
Writing a file multiple stage process: Create file →֒ Get nodes for blocks →֒ Start writing →֒ Data nodes coordinate →֒ replication Get ack back →֒
Complete →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 78 / 126 N Big Data Analytics with Hadoop, Spark etc.
Writing a file
Writing a file multiple stage process: Create file →֒ Get nodes for blocks →֒ Start writing →֒ Data nodes coordinate →֒ replication Get ack back →֒
Complete →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 78 / 126 N Big Data Analytics with Hadoop, Spark etc.
Writing a file
Writing a file multiple stage process: Create file →֒ Get nodes for blocks →֒ Start writing →֒ Data nodes coordinate →֒ replication Get ack back →֒
Complete →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 78 / 126 N Big Data Analytics with Hadoop, Spark etc.
Writing a file
Writing a file multiple stage process: Create file →֒ Get nodes for blocks →֒ Start writing →֒ Data nodes coordinate →֒ replication Get ack back (while →֒ writing) Complete →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 78 / 126 N Big Data Analytics with Hadoop, Spark etc.
Writing a file
Writing a file multiple stage process: Create file →֒ Get nodes for blocks →֒ Start writing →֒ Data nodes coordinate →֒ replication Get ack back (while →֒ writing) Complete →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 78 / 126 N Big Data Analytics with Hadoop, Spark etc.
Where to Replicate?
Tradeoff to choosing replication locations Close: faster updates, less network bandwidth →֒ Further: better failure tolerance →֒ Default strategy: 1 copy on different location on same node 2 second on different rack(switch), 3 third on same rack location, different node. Strategy configurable. Need to configure Hadoop file system to know location of nodes →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 79 / 126 N Big Data Analytics with Hadoop, Spark etc.
Reading a file
Reading a file Open call →֒ Get block locations →֒ Read from a replica →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 80 / 126 N Big Data Analytics with Hadoop, Spark etc.
Reading a file
Reading a file Open call →֒ Get block locations →֒ Read from a replica →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 80 / 126 N Big Data Analytics with Hadoop, Spark etc.
Reading a file
Reading a file Open call →֒ Get block locations →֒ Read from a replica →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 80 / 126 N Big Data Analytics with Hadoop, Spark etc.
Hadoop / HDFS Deployment
Target: Hadoop Multi-node cluster to exploit HPC nodes →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 81 / 126 N Big Data Analytics with Hadoop, Spark etc.
Configuring Hadoop Cluster
Templates in ${HADOOP_HOME}/etc/hadoop Set NameNode Location (the HDFS master) core-site.xml →֒ X fs.defaultFS: NameNode Location Set path for HDFS hdfs-site.xml →֒ X dfs.namenode.name.dir: namespace/transactions logs X dfs.datanode.data.dir: local storage of blocks X dfs.replication: how many times data is replicated in the cluster Set YARN as Job Scheduler mapred-site.xml →֒ X mapred.job.tracker: JobTracker (MapReduce master) conf/masters (master only): secondary NameNodes →֒ X primary NameNode is where bin/start-dfs.sh is executed :(conf/slaves (master only →֒ X list of Hadoop slave daemons (DataNodes and TaskTrackers) Configure Memory Allocation yarn-site.xml →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 82 / 126 N Big Data Analytics with Hadoop, Spark etc.
Running Hadoop Cluster
1 Formatting the HDFS filesystem via the NameNode simply initializes the directory specified by dfs.name.dir →֒
$> hadoop namenode -format
2 Starting the multi-node cluster (on the NameNode) Start HDFS daemons $HADOOP_HOME/bin/start-dfs.sh →֒ X Monitor your HDFS Cluster: hdfs dfsadmin -report Start MapReduce daemons $HADOOP_HOME/bin/start-mapred.sh →֒ Start YARN $HADOOP_HOME/bin/start-yarn.sh →֒ X Monitor YARN: yarn node -list
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 83 / 126 N Big Data Analytics with Hadoop, Spark etc.
Using HDFS
Once the file system is up and running, you can copy files back and forth . . . →֒
$> hadoop fs -{get|put|copyFromLocal|copyToLocal} [...]
Default directory is /user/${username} Nothing like a cd →֒
$> hdfs fs -mkdir /home/vagrant/hdfs-test $> hdfs fs -ls /home/vagrant $> hdfs fs -ls /home/vagrant/hdfs-test $> hdfs fs -put data.dat /home/vagrant/hdfs-test $> hdfs fs -ls /home/vagrant/hdfs-test
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 84 / 126 N Big Data Analytics with Hadoop, Spark etc.
Using HDFS
In general, the data files you send to HDFS will be large .or else why bother with Hadoop →֒ Do not want to be constantly copying back and forth view, append in place →֒ Several APIs to accessing the HDFS Java, C++, Python →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 85 / 126 N Big Data Analytics with Hadoop, Spark etc.
Back to Map-Reduce
Map processes one element at a time .emits results as (key, value) pairs →֒ All results with same key are gathered to the same reducers Reducers process list of values →֒ emit results as (key, value) pairs →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 86 / 126 N Big Data Analytics with Hadoop, Spark etc.
Map
All coupling done during shuffle phase Embarrassingly parallel task →֒ all map →֒ Take input, map it to output, done. Famous case NYT using Hadoop to convert 11 →֒ million image files to PDFs X almost pure serial farm job
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 87 / 126 N Big Data Analytics with Hadoop, Spark etc.
Reduce
Reducing gives the coupling In the case of the NYT task: :not quite embarrassingly parallel →֒ X images from multi-page articles X Convert a page at a time, X gather images with same article id onto node for conversion
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 88 / 126 N Big Data Analytics with Hadoop, Spark etc.
Shuffle
shuffle is part of the Hadoop magic By default, keys are hashed →֒ hash space is partitioned between →֒ reducers On reducer: gathered (k,v) pairs from mappers →֒ are sorted by key, then merged together by key →֒ ([Reducer then runs on one (k,[v →֒ tuple at a time you can supply your own partitioner Assign similar keys to same node →֒ ([Reducer still only sees one (k, [v →֒ tuple at a time.
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 89 / 126 N Big Data Analytics with Hadoop, Spark etc.
Example: Wordcount
Was used as an example in the original MapReduce paper Now basically the hello world of map reduce →֒
Problem description: Given a set of documents: count occurences of words within these documents →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 90 / 126 N Big Data Analytics with Hadoop, Spark etc.
Example: Wordcount
How would you do this with a huge document? :Each time you see a word →֒ X if it is a new word, add a tick mark beside it, X otherwise add a new word with a tick ... But hard to parallelize pb when updating the list →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 91 / 126 N Big Data Analytics with Hadoop, Spark etc.
Example: Wordcount
MapReduce way all hard work done automatically by →֒ shuffle Map: just emit a 1 for each word you see →֒ Shuffle: ,assigns keys (words) to each reducer →֒ sends (k,v) pairs to appropriate reducer →֒ Reducer just has to sum up the ones →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 92 / 126 N Big Data Analytics with Hadoop, Spark etc.
Example: Wordcount
MapReduce way all hard work done automatically by →֒ shuffle Map: just emit a 1 for each word you see →֒ Shuffle: ,assigns keys (words) to each reducer →֒ sends (k,v) pairs to appropriate reducer →֒ Reducer just has to sum up the ones →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 92 / 126 N Big Data Analytics with Hadoop, Spark etc.
Example: Wordcount
MapReduce way all hard work done automatically by →֒ shuffle Map: just emit a 1 for each word you see →֒ Shuffle: ,assigns keys (words) to each reducer →֒ sends (k,v) pairs to appropriate reducer →֒ Reducer just has to sum up the ones →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 92 / 126 N Big Data Analytics with Hadoop, Spark etc.
Summary
1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components
2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data
3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark
4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing
5 Conclusion
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 93 / 126 N Big Data Analytics with Hadoop, Spark etc.
Hadoop 0.1x
Original Hadoop was basically HDFS + infra. for MapReduce .Very faithful implementation of Google MapReduce paper →֒ Job tracking, orchestration all very tied to M/R model →֒ Made it difficult to run other sorts of jobs
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 94 / 126 N Big Data Analytics with Hadoop, Spark etc.
YARN and Hadoop 2
YARN: Yet Another Resource Negotiator Looks a lot more like a cluster scheduler/resource manager →֒ .Allows arbitrary jobs →֒ Allow for new compute/data tools.
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 95 / 126 N Big Data Analytics with Hadoop, Spark etc.
Batch vs Stream vs Hybrid Processing
Batch processing: Hadoop operating over a large, static dataset →֒ returning the result when the computation is complete →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 96 / 126 N Big Data Analytics with Hadoop, Spark etc.
Batch vs Stream vs Hybrid Processing
Batch processing: Hadoop operating over a large, static dataset →֒ returning the result when the computation is complete →֒
Stream processing: Storm, Samza .(compute over streamed data (as it enters the system →֒ can handle a nearly unlimited amount of data →֒ X . . . but only process one or very few items at a time X . . . with minimal state being maintained in between records
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 96 / 126 N Big Data Analytics with Hadoop, Spark etc.
Batch vs Stream vs Hybrid Processing
Batch processing: Hadoop operating over a large, static dataset →֒ returning the result when the computation is complete →֒
Stream processing: Storm, Samza .(compute over streamed data (as it enters the system →֒ can handle a nearly unlimited amount of data →֒ X . . . but only process one or very few items at a time X . . . with minimal state being maintained in between records
Hybrid processing: Spark,Flink
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 96 / 126 N Big Data Analytics with Hadoop, Spark etc.
Batch processing Model (Hadoop)
Typical workflow: Reading the dataset from the HDFS filesystem →֒ Dividing dataset into chunks distributed among the available nodes →֒ map) Applying the computation on each node to the subset of data) →֒ X intermediate results are written back to HDFS Redistributing the intermediate results to group by key →֒ reduce) the value of each key by summarizing and combining the) →֒ results calculated by the individual nodes Write the calculated final results back to HDFS →֒
Advantages Drawbacks Can handle BIG datasets slow Can run inexpensive hardware
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 97 / 126 N Big Data Analytics with Hadoop, Spark etc.
Stream processing Model
Processing is event-based .and does not “end” until explicitly stopped . . . →֒ Results immediately available. . . and updated with new data
Near real-time processing Tied to Kafka messaging
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 98 / 126 N Big Data Analytics with Hadoop, Spark etc.
Hybrid processing Model
Can handle both batch and stream workloads Spark: focuses primarily on speeding up batch processing workloads .full in-memory computation and processing optimization →֒ Flink: for complex stream processing, similar characteristics (in-memory engine. . . ) than Spark yet →֒ X real streaming processing (Spark treats them as micro batches) X Better support for cyclical and iterative processing X Lower latency and higher throughput X Enhanced memory management (page out to local disk if RAM full)
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 99 / 126 N Big Data Analytics with Hadoop, Spark etc.
Summary
1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components
2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data
3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark
4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing
5 Conclusion
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 100 / 126 N Big Data Analytics with Hadoop, Spark etc.
Apache Spark
Next generation batch processing framework . . . with stream processing capabilities Built using many of the principles of Hadoop’s MapReduce engine Everything you can do in Hadoop, you can also do in Spark →֒
In contrast to Hadoop
Spark computation paradigm is not just MapReduce job Key feature - in-memory analyses. Based on Directed Acyclic Graphs (DAGs) representation →֒ multi-stage, in-memory dataflow graph →֒ X based on Resilient Distributed Datasets (RDDs).
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 101 / 126 N Big Data Analytics with Hadoop, Spark etc.
Apache Spark Spark is implemented in Scala, running in a Java Virtual Machine. :Spark supports different languages for application development →֒ X Java, Scala, Python, R, and SQL. Originally developed in AMPLab (UC Berkeley) from 2009, ,donated to the Apache Software Foundation in 2013 →֒ .top-level project as of 2014 →֒ Latest release: 2.4.0 (Nov 2018)
Key features
Impressive speed Simpler directed acyclic graph model Rich set of libraries . . .Machine learning, graph processing, SQL →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 102 / 126 N Big Data Analytics with Hadoop, Spark etc.
Main Building Blocks
The Spark Core API provides the general execution layer on top of which all other functionality is built upon. Four higher-level components (in the _Spark ecosystem): 1 Spark SQL (formerly Shark), 2 Streaming, to build scalable fault-tolerant streaming applications. 3 MLlib for machine learning 4 GraphX, the API for graphs and graph-parallel computation
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 103 / 126 N Big Data Analytics with Hadoop, Spark etc.
Resilient Distributed Dataset (RDD)
Default Fault Tolerance in Map Reduce: output everything to disk. .and always. Effective but extremely costly . . . →֒ How to maintain fault tolerance without sacrificing in-memory →֒ performance for truly large-scale analyses?
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 104 / 126 N Big Data Analytics with Hadoop, Spark etc.
Resilient Distributed Dataset (RDD)
Default Fault Tolerance in Map Reduce: output everything to disk. .and always. Effective but extremely costly . . . →֒ How to maintain fault tolerance without sacrificing in-memory →֒ performance for truly large-scale analyses?
Solution: (Record lineage of an RDD (think version control →֒ If container, node goes down, reconstruct RDD from scratch →֒ X Either from beginning, X or from (occasional) checkpoints which user has some control over. ,User can suggest caching current state of RDD in memory →֒ X or persisting it to disk, or both. You can also save RDD to disk, or replicate partitions across nodes →֒ for other forms of fault tolerance.
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 104 / 126 N Big Data Analytics with Hadoop, Spark etc.
Resilient Distributed Dataset (RDD)
Resilient Distributed Dataset (RDD) Partitioned collections across →֒ nodes: lists, maps. . . .Set of operations on these RDDs →֒ X map, filter, join (reduce) Effective Fault Tolerance with RDDs Storing, reconstructing lineage →֒ (Replication (optional →֒ (Persistance to disk (optional →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 105 / 126 N Big Data Analytics with Hadoop, Spark etc.
Building Spark
Trivial on HPC systems with Easybuild
# LOCAL_MODULES=$HOME/.local/easybuild/modules/all $> module use $LOCAL_MODULES $> module load tools/EasyBuild # Search for recipes for a given software $> eb -S Spark
# Select and install once of the recipe as part of your local modules $> eb Spark-2.4.0-Hadoop-2.7-Java-1.8.eb -Dr # Dry-run install $> eb Spark-2.4.0-Hadoop-2.7-Java-1.8.eb -r
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 106 / 126 N Big Data Analytics with Hadoop, Spark etc.
Running Spark - Interactive usage
Better reserve exclusive node(s) Several interactive interfaces: Pyspark, Scala Spark Shell spark-shell, R Spark Shell →֒
$> module load devel/Spark $> module list Currently Loaded Modules: 1) lang/Java/1.8.0_162 2) devel/Spark/2.4.0-Hadoop-2.7-Java-1.8 $> pyspark [...] Welcome to ______/ __/______/ /__ _\ \/ _ \/ _ ‘/ __/ ’_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.0 /_/
Using Python version 2.7.5 (default, Oct 30 2018 23:45:53) SparkSession available as ’spark’. >>> Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 107 / 126 N Big Data Analytics with Hadoop, Spark etc.
Running Spark - Cluster mode docs
Spark applications run as independent sets of processes on a cluster coordinated by the SparkContext object in your main program →֒ Spark is agnostic to the underlying cluster manager →֒ Spark currently supports sevelal cluster managers: Standalone: a simple cluster manager included with Spark →֒ Apache Mesos / Hadoop YARN →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 108 / 126 N Big Data Analytics with Hadoop, Spark etc.
Running Spark - Standalone Cluster
1 create a master and the workers. eventually) Check the web interface of the master) →֒ 2 submit a spark application to the cluster with spark-submit 3 Let the application run and collect the result 4 stop the cluster at the end.
Script Description
sbin/start-master.sh Starts a master instance on the machine the script is executed on. sbin/start-slaves.sh Starts a slave instance on each machine specified in the conf/slaves file. sbin/start-slave.sh Starts a slave instance on the machine the script is executed on. sbin/start-all.sh Starts both a master and a number of slaves as described above. sbin/stop-master.sh Stops the master that was started via the bin/start-master.sh script. sbin/stop-slaves.sh Stops all slave instances on the machines specified in the conf/slaves file. sbin/stop-all.sh Stops both the master and the slaves as described above.
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 109 / 126 N Big Data Analytics with Hadoop, Spark etc.
Running Spark - Standalone Cluster
see Uni.lu HPC Tutorial: Big Data - launcher.Spark.sh
# (eventually) interactive job $> srun -N 3 --ntasks-per-node 1 -c 28 --exclusive --pty bash -i $> ./runs/launcher.Spark.sh -i SLURM_JOBID = 256624 SLURM_JOB_NODELIST = iris-[094,096,098] [...] ======Spark Master ======url: spark://iris-094:7077 Web UI: http://iris-094:8082 ======3 Spark Workers ======export SPARK_HOME=$EBROOTSPARK export MASTER_URL=spark://iris-094:7077 export SPARK_DAEMON_MEMORY=4096m export SPARK_WORKER_CORES=28 export SPARK_WORKER_MEMORY=110592m export SPARK_EXECUTOR_MEMORY=110592m
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 110 / 126 N Big Data Analytics with Hadoop, Spark etc.
Running Spark - Standalone Cluster
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 111 / 126 N Big Data Analytics with Hadoop, Spark etc.
Running Spark - Standalone Cluster
# MASTER=$(scontrol show hostname $SLURM_NODELIST | head -n 1) $> spark-submit \ --master spark://${MASTER}:7077 \ --conf spark.driver.memory=${SPARK_DAEMON_MEMORY} \ --conf spark.executor.memory=${SPARK_EXECUTOR_MEMORY} \ --conf spark.python.worker.memory=${SPARK_WORKER_MEMORY} \ $SPARK_HOME/examples/src/main/python/pi.py 1000 [...] Pi is roughly 3.141368
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 112 / 126 N [Brief] Overview of other useful Data Analytics frameworks
Summary
1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components
2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data
3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark
4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing
5 Conclusion
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 113 / 126 N [Brief] Overview of other useful Data Analytics frameworks
Summary
1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components
2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data
3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark
4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing
5 Conclusion
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 114 / 126 N [Brief] Overview of other useful Data Analytics frameworks
Python
Effective for fast prototype coding (Simple (now used a reference language in schools →֒ Easy creation of reproducible and isolated environment →֒ X pip: Python package manager X virtualenv: Create virtual environment
$> pip install –-user
$> pip freeze -l > requirements.txt # Dump python environment
$> pip install -r requirements.txt # Restore saved environment
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 115 / 126 N [Brief] Overview of other useful Data Analytics frameworks
Virtualenv / venv Virtualenv allows you to create several environments each will contain their own list of Python packages →֒ Built-in support in Python 3 for virtual environments with venv →֒ Best-practice: create one virtual environment per project
$> cd /path/to/project /!\ ADAPT the name of the virtual env, below ’myproject’ # Python 2 (deprecated) $> module load lang/Python/2.7.14-intel-2018a $> pip install --user virtualenv $> virtualenv myproject $> source myenv/bin/activate
# Python 3 $> module load lang/Python/3.6.4-intel-2018a # Hint: centralize all your venv in ~/venv/ $> python3 -m venv ~/venv/myproject $> source ~/venv/myproject/bin/activate
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 116 / 126 N [Brief] Overview of other useful Data Analytics frameworks
Useful Python libraries / Data Sciences
Library Description numpy Fundamental package for scientific computing jupyter (notebook) Create and share documents that contain live code keras, Tensorflow, Pytorch Machine learning frameworks Scoop Scalable COncurrent Operations in Python pythran / Cithon Python->C++ compilation, C bindings ...
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 117 / 126 N [Brief] Overview of other useful Data Analytics frameworks
Summary
1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components
2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data
3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark
4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing
5 Conclusion
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 118 / 126 N [Brief] Overview of other useful Data Analytics frameworks
R – Statistical Computing
R – shorthand for GNU R An interactive programming language derived from S →֒ Appeared in 1993, created by R. Ihaka and R. Gentleman →֒ Focus on data analysis and plotting →֒ also shorthand for the ecosystem around this language Book authors →֒ Package developers →֒ Ordinary useRs →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 119 / 126 N [Brief] Overview of other useful Data Analytics frameworks
Why use R?
It’s free and open-source easy to install / maintain →֒ (multi-platform (Windows, macOS, GNU/Linux →֒ (can process big files and analyse huge amounts of data (db tools →֒ integrated data visualization tools, even dynamic →֒ .fast, and even faster with C++ integration via Rcpp →֒ Easy to get help huge R community in the web →֒ .stackoverflow with a lot of tags like r, ggplot2 etc →֒ rbloggers →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 120 / 126 N [Brief] Overview of other useful Data Analytics frameworks
Why use R?
It’s free and open-source easy to install / maintain →֒ (multi-platform (Windows, macOS, GNU/Linux →֒ (can process big files and analyse huge amounts of data (db tools →֒ integrated data visualization tools, even dynamic →֒ .fast, and even faster with C++ integration via Rcpp →֒ Easy to get help huge R community in the web →֒ .stackoverflow with a lot of tags like r, ggplot2 etc →֒ rbloggers →֒
YET R is hard to learn R base is complex, has a long history and many contributors →֒ Prefer to learn through tidyverse →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 120 / 126 N [Brief] Overview of other useful Data Analytics frameworks
R Packages / Daily Work
CRAN: reliable package checked during →֒ submission process MRAN for windows users →֒ Github: easy install devtools could be a security issue →֒ Daily Development: RStudio
# CRAN install install.packages("ggplot2") # Github install -- assumes ’install.packages("devtools")’ devtools::install_github("tidyverse/readr")
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 121 / 126 N [Brief] Overview of other useful Data Analytics frameworks
R in HPC environment
$> srun [...] -c 4 --pty bash $> module load lang/R $> R R version 3.4.4 (2018-03-15) -- "Someone to Lean On" [...] > sessionInfo() # information about R version, loaded libraries
R environment itself is not parallelized Effective use of multi-threaded environment: certain workloads (mostly linear algebra) can use multiple threads →֒ X typically through OpenMP / Intel Math Kernel Library (MKL) X cf OMP_NUM_THREAD Better rely on future package Unified Parallel and Distributed Processing in R for Everyone →֒ run expressions asynchronously →֒ X Future expressions are run according to plan defined by the user For distributed nodes runs: use Rmpi Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 122 / 126 N [Brief] Overview of other useful Data Analytics frameworks
R in HPC environment / future Package
> install.packages(c("future", "furrr", "purrr", "dplyr", "dbscan", "tictoc", "Rtsne"), repos = "https://cran.rstudio.com") > library(furrr) ### Sequencial run > plan(sequential) > tictoc::tic() > nothingness <- future_map(c(2, 2, 2),~Sys.sleep(.x),.progress = TRUE) > tictoc::toc() 13.614 sec elapsed
### Parallel run > plan(multiprocess) > tictoc::tic() > nothingness <- future_map(c(2, 2, 2),~Sys.sleep(.x),.progress = TRUE) > tictoc::toc() 4.14 sec elapsed
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 123 / 126 N Conclusion
Summary
1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components
2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data
3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark
4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing
5 Conclusion
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 124 / 126 N Conclusion
Conclusion
Effective Data Analytic required appropriate processing facility HPC: High Performance Computing →֒ HTC: High Throughput Computing →֒ Before speaking about Big data management understand performance bottleneck →֒ ensure best-practices for daily data management →֒ (.effective code/data sharing (Git etc →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 125 / 126 N Conclusion
Conclusion
Effective Data Analytic required appropriate processing facility HPC: High Performance Computing →֒ HTC: High Throughput Computing →֒ Before speaking about Big data management understand performance bottleneck →֒ ensure best-practices for daily data management →֒ (.effective code/data sharing (Git etc →֒
Classical Big Data analytics frameworks Batch vs. Streaming vs. Hybrid Processing engines →֒ Hadoop,Storm, Samza, Flink, Spark →֒ Other useful Data Science environment [Python, R, [Matlab etc →֒ requires adaptatation for effective usage on HPC platforms . . . →֒
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 125 / 126 N Thank you for your attention...
Questions? http://hpc.uni.lu
Dr. Sebastien Varrette University of Luxembourg, Belval Campus: Maison du Nombre, 4th floor 2, avenue de l’Université L-4365 Esch-sur-Alzette mail: [email protected]
High Performance Computing @ Uni.lu Prof. P. Bouvry, Dr. S. Varrette V. Plugaru, S. Peter, H. Cartiaux, C. Parisot
3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop 1 Introduction Batch vs Stream vs Hybrid Processing HPC & BD Trends Apache Spark Reviewing the Main HPC and BD Components 4 [Brief] Overview of other useful Data Analytics 2 [Big] Data Management in HPC Environment: frameworks Overview and Challenges Python Libraries Performance Overview in Data transfer R – Statistical Computing Data transfer in practice Sharing Data 5 Conclusion
Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 126 / 126 N