NSF14-43054 start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science

• Indiana University (Fox, Qiu, Crandall, von Laszewski), • Rutgers (Jha) • Virginia Tech (Marathe) • Kansas (Paden) • Stony Brook (Wang) • Arizona State(Beckstein) • Utah(Cheatham) Overview by Gregor von Laszewski April 6 2016 http://spidal.org/ http://news.indiana.edu/releases/iu/2014/10/big-data-dibbs-grant.shtml http://www.nsf.gov/awardsearch/showAward?AWD_ID=1443054

04/6/2016 1 Some Important Components of SPIDAL Dibbs • NIST Big Data Application Analysis: features of data intensive apps • HPC-ABDS: -HPC interoperable software with performance of HPC (High Performance Computing) and the rich functionality of the commodity Apache Big Data Stack. – Reservoir of software subsystems – nearly all from outside project and mix of HPC and Big Data communities – Leads to Big Data – Simulation - HPC Convergence • MIDAS: Integrating Middleware – from project • Applications: Biomolecular Simulations, Network and Computational Social Science, Epidemiology, Computer Vision, Spatial Geographical Information Systems, Remote Sensing for Polar Science and Pathology Informatics. • SPIDAL (Scalable Parallel Interoperable Data Analytics Library): Scalable Analytics for – Domain specific data analytics libraries – mainly from project – Add Core Machine learning Libraries – mainly from community – Performance of Java and MIDAS Inter- and Intra-node • Benchmarks – project adds to community; See WBDB 2015 Seventh Workshop on Big Data Benchmarking

04/6/2016 2 64 Features in 4 views for Unified Classificaon of Big Data and Simulaon Applicaons

Both 10D Geospatial Information System Analytics 9 HPC Simulations Data Source and Style View (Model for 8D of Things Simulations Data) 7D Metadata/Provenance (Nearly all Data) 6D Shared / Dedicated / Transient / Permanent 5D Archived/Batched/Streaming – S1, S2, S3, S4, S5 4D HDFS/Lustre/GPFS

3D Files/Objects Kernels/Many subclasses 2D Enterprise Data Model 1D SQL/NoSQL/NewSQL benchmarks (Analytics/Informatics/Simulations) - Convergence D M D D M M D M D M M D M D M M body Methods - Multiscale Method Iterative PDE Solvers Engine Recommender Base Data Statistics Local Micro N Data Classification Linear Algebra Particles and Fields Learning Graph Algorithms Nature of mesh if used Spectral Methods Data Alignment Streaming Data Algorithms Data Search/Query/Index Core Libraries Global (Analytics/Informatics/Simulations) Evolution of Discrete Systems Optimization Methodology Visualization Diamonds 1 2 3 4 4 5 6 6 7 8 9 9 10 10 11 12 12 13 13 14 Data Data � Execution Dynamic Dynamic Model Performance Flops Data Model Data Data Model Veracity Communication Regular Regular Data 22 21 20 19 18 17 16 11 10 9 8 7 6 5 4 15 14 13 12 3 2 1 Views and Iterative � Metric Metric

M M M M M M M M M M M M M M M M M M M M M M Volume Velocity Variety Abstraction Facets per Abstraction Size Variety = = = / NN = = Byte/Memory Simulation (Exascale) Big Data Processing R R Environment Pleasingly Parallel 1 Simple D D / / = =

Processing Diamonds Diamonds / / / 2 Metrics Irregular Irregular

Classic MapReduce M M � Static Static ( Structure / / Processing View Map-Collective 3 �

(All Model) Non Non Map Point-to-Point ) = = 4 = - - S S = = N ; Metric Metric I I Core Map Streaming 5 IO/Flops Data Model 6 libraries

Single Program Multiple Data = =

7 N N Bulk Synchronous Parallel per 8 watt Fusion 9 Execution View Dataflow 10 Agents 11M (Mix of Data and Model) Problem Architecture View Workflow 12 (Nearly all Data+Model)

04/6/2016 3 Java MPI performs better than Threads 128 24 core Haswell nodes on SPIDAL DA-MDS Code

Best MPI; inter and intra node

Best Threads intra node MPI inter node

04/6/2016 4 Big DataKaleidoscope and of(Exascale) (Apache) Big Data Simulation Stack (ABDS) and HPC Convergence Technologies II Cross- 17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, , Cutting Naiad, Oozie, Tez, FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, , NiFi (NSA), Functions Jitterbit, Talend, Pentaho, Apatar, Compose, KeystoneML 16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, PLASMA MAGMA, 1) Message Azure Machine Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j, and Data H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Parasol, Dream:Lab, Google Fusion Tables, Protocols: CINET, NWB, Elasticsearch, Kibana, Logstash, Graylog, , Tableau, D3.js, three.js, Potree, DC.js, TensorFlow, CNTK Avro, Thrift, 15B) Application Hosting Frameworks: , AppScale, Red Hat OpenShift, , Aerobatic, AWS Elastic Beanstalk, Azure, Cloud Protobuf Foundry, Pivotal, IBM , Ninefold, , Stackato, appfog, CloudBees, , CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT, 2) Distributed Agave, Atmosphere Coordination 15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq, : Google Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird Chubby, 14B) Streams: Storm, S4, Samza, Granules, Neptune, Google MillWheel, Amazon Kinesis, LinkedIn, Twitter Heron, Databus, Facebook Zookeeper, Puma/Ptail/Scribe/ODS, Azure Stream Analytics, Floe, Spark Streaming, Flink Streaming, DataTurbine Giraffe, 14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (), Reef, Disco, JGroups Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem 3) Security & 13) Inter communication Collectives, point-to-point, publish-subscribe: MPI, HPX-5, Argo BEAST HPX-5 BEAST PULSAR, Harp, Netty, Privacy: ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon InCommon, SNS, Lambda, Google Pub Sub, Azure Queues, Event Hubs Eduroam 12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan, VoltDB, OpenStack H-Store Keystone, 12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC LDAP, Sentry, 12) Extraction Tools: UIMA, Tika Sqrrl, OpenID, SAML OAuth 11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, , Pivotal 4) Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB, Spark SQL Monitoring: 11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, ZHT, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB, Ambari, Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J, Ganglia, graphdb, Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame Nagios, Inca Public Cloud: Azure Table, Amazon Dynamo, Google DataStore

11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet 10) Data Transport: , HTTP, FTP, SSH, Globus Online (GridFTP), Flume, , Pivotal GPLOAD/GPFDIST 21 layers 9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm, Over 350 Torque, Globus Tools, Pilot Jobs 8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS Software Public Cloud: , Azure Blob, Google Packages 7) Interoperability: , Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis 6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat, Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google , January Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang,04/6/2016 Any2Api 29 5) IaaS Management from HPC to hypervisors: Xen, KVM, QEMU, Hyper-V, VirtualBox, OpenVZ, LXC, -Vserver, OpenStack, OpenNebula,5 , , CloudStack, CoreOS, rkt, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds 2016 Networking: Google Cloud DNS,

HPC-ABDS Mapping of Activities • Level 17: Orchestration: Apache Beam (Google Cloud Dataflow) integrated with Cloudmesh • Level 16: Applications. Datamining for molecular dynamics, Image processing for remote sensing and pathology, graphs, streaming • Level 16: Algorithms. MLlib, Mahout, SPIDAL – Release SPIDAL Clustering, Dimension Reduction (multi-dimensional scaling), Web point set visualization • Level 14: Programming. Storm, Heron, Hadoop, Spark, Flink. Inter and Intra- node performance • Level 13: Communication. Enhanced Storm and Hadoop using HPC runtime technologies • Level 11: Data management. Hbase and MongoDB integrated via use of Beam and other Apache tools • Level 9: Cluster Management. Integrate Pilot Jobs with Yarn, Spark, Hadoop • Level 6: DevOps. Python Cloudmesh virtual Cluster Interoperability

04/6/2016 6