Indiana University

NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science • Indiana University (Fox, Qiu, Crandall, von Laszewski) • Rutgers (Jha) • Virginia Tech (Marathe) • Kansas (Paden) • Stony Brook (Wang) • Arizona State (Beckstein) • Utah (Cheatham) Overview by Geoffrey Fox, May 16, 2016 http://spidal.org/ http://news.indiana.edu/releases/iu/2014/10/big-data-dibbs-grant.shtml http://www.nsf.gov/awardsearch/showAward?AWD_ID=1443054 1 Some Important Components of SPIDAL Dibbs • NIST Big Data Application Analysis – features of data intensive Applications. • HPC-ABDS: Cloud-HPC interoperable software performance of HPC (High Performance Computing) and the rich functionality of the commodity Apache Big Data Stack. – This is a reservoir of software subsystems – nearly all from outside the project and being a mix of HPC and Big Data communities. – Leads to Big Data – Simulation – HPC Convergence. • MIDAS: Integrating Middleware – from project. • Applications: Biomolecular Simulations, Network and Computational Social Science, Epidemiology, Computer Vision, Spatial Geographical Information Systems, Remote Sensing for Polar Science and Pathology Informatics. • SPIDAL (Scalable Parallel Interoperable Data Analytics Library): Scalable Analytics for: – Domain specific data analytics libraries – mainly from project. – Add Core Machine learning libraries – mainly from community. – Performance of Java and MIDAS Inter- and Intra-node. • Benchmarks – project adds to communityWBDB2015 Benchmarking Workshop • Implementations: XSEDE and Blue Waters as well as clouds (OpenStack, Docker) 2 Big Data - Big Simulation (Exascale) Convergence • Discuss Data and Model together as built around problems which combine them, but we can get insight by separating which allows better understanding of Big Data - Big Simulation “convergence” • Big Data implies Data is large but Model varies – e.g. LDA with many topics or deep learning has large model – Clustering or Dimension reduction can be quite small • Simulations can also be considered as Data and Model – Model is solving particle dynamics or partial differential equations – Data could be small when just boundary conditions – Data large with data assimilation (weather forecasting) or when data visualizations are produced by simulation • Data often static between iterations (unless streaming); Model varies between iterations • Take 51 NIST and other use cases derive multiple specific features • Generalize and systematize with features termed “facets” • 50 Facets (Big Data) or 64 Facets (Big Simulation and Data) divided into 4 sets or views where each view has “similar” facets – Allows one to study coverage of benchmark sets and architectures 3 64 Features in 4 views for Unified Classification of Big Data and Simulation Applications Both 10D Geospatial Information System 9 HPC Simulations Data Source and Style View Internet of Things Simulations Analytics 8D (Nearly all Data) (Model for Data) 7D Metadata/Provenance 6D Shared / Dedicated / Transient / Permanent 5D Archived/Batched/Streaming – S1, S2, S3, S4, S5 4D HDFS/Lustre/GPFS 3D Files/Objects Kernels/Many subclasses 2D Enterprise Data Model 1D SQL/NoSQL/NewSQL benchmarks - (Analytics/Informatics/Simulations) Convergence body Methods D M D D M M D M D M M D M D M M - Multiscale Method Multiscale Iterative PDE Solvers Recommender Engine Statistics Base Data Local Micro N Classification Data Algebra Linear Particles and Fields and Particles Learning Graph Algorithms Nature of mesh ifused Spectral Methods Alignment Data Algorithms Data Streaming Data Search/Query/Index Core Libraries (Analytics/Informatics/Simulations) Global Evolution of DiscreteSystems Methodology Optimization Visualization Diamonds 1 2 3 4 4 5 6 6 7 8 9 9 10 10 11 12 12 13 13 14 Data 푂 Data Dynamic Dynamic Execution Model Performance Flops Data Model Data Data Model Veracity Communication Regular Regular Data 22 21 20 19 18 17 16 11 10 9 8 7 6 5 4 15 14 13 12 3 2 1 Views and Iterative 푁 2 Metric Metric M M M M M M M M M M M M M M M M M M M M M M Volume Velocity Variety Abstraction Facets per Abstraction Size Variety = = = / NN = = Byte/Memory Simulation (Exascale) Big Data Processing R R Environment 1 Simple Pleasingly Parallel D D / / = = / Processing Diamonds Diamonds / / Metrics 2 Irregular Irregular M Classic MapReduce M 푂 Static Static ( Structure / / Processing View Map-Collective 3 푁 Non Non Map Point-to-Point ) = = (All Model) 4 = - - S S = = N ; Metric Metric I I Core Map Streaming 5 IO/Flops Model Data Shared Memory 6 libraries = Single Program Multiple Data = N 7 N Bulk Synchronous Parallel per 8 watt Fusion 9 Execution View Dataflow 10 (Mix of Data and Model) Agents 11M Problem Architecture View Workflow 12 (Nearly all Data+Model) 4 6 Forms of MapReduce Cover “all” circumstances Describes - Problem (Model reflecting data) - Machine - Software Architecture 5 HPC-ABDS Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies Cross- 17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad, Cutting Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA), Functions Jitterbit, Talend, Pentaho, Apatar, Docker Compose, KeystoneML 16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, PLASMA MAGMA, 1) Message Azure Machine Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j, and Data H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Parasol, Dream:Lab, Google Fusion Tables, Protocols: CINET, NWB, Elasticsearch, Kibana, Logstash, Graylog, Splunk, Tableau, D3.js, three.js, Potree, DC.js, TensorFlow, CNTK Avro, Thrift, 15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud Protobuf Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT, 2) Distributed Agave, Atmosphere Coordination 15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq, : Google Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird Chubby, 14B) Streams: Storm, S4, Samza, Granules, Neptune, Google MillWheel, Amazon Kinesis, LinkedIn, Twitter Heron, Databus, Facebook Zookeeper, Puma/Ptail/Scribe/ODS, Azure Stream Analytics, Floe, Spark Streaming, Flink Streaming, DataTurbine Giraffe, 14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (Apache Flink), Reef, Disco, JGroups Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem 3) Security & 13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, HPX-5, Argo BEAST HPX-5 BEAST PULSAR, Harp, Netty, Privacy: ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon InCommon, SNS, Lambda, Google Pub Sub, Azure Queues, Event Hubs Eduroam 12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan, VoltDB, OpenStack H-Store Keystone, 12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC LDAP, Sentry, 12) Extraction Tools: UIMA, Tika Sqrrl, OpenID, SAML OAuth 11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, Apache Derby, Pivotal Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB, Spark SQL 4) Monitoring: 11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, ZHT, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB, Ambari, Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J, Ganglia, graphdb, Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame Nagios, Inca Public Cloud: Azure Table, Amazon Dynamo, Google DataStore 11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet 10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal GPLOAD/GPFDIST 21 layers 9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm, Over 350 Torque, Globus Tools, Pilot Jobs 8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS Software Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage Packages 7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis 6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat, Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes, January Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang, Any2Api 29 5) IaaS Management from HPC to hypervisors: Xen, KVM, QEMU, Hyper-V, VirtualBox,asdf OpenVZ, LXC, Linux-Vserver,

Indiana University

Integration of CUDA Processing Within the C++ Library for Parallelism and Concurrency (HPX)

Apache Flink™: Stream and Batch Processing in a Single Engine

Bench - Benchmarking the State-Of- The-Art Task Execution Frameworks of Many- Task Computing

HPX – a Task Based Programming Model in a Global Address Space

Communicate the Future PROCEEDINGS HYATT REGENCY ORLANDO 20–23 MAY | ORLANDO, FL and Summit.Stc.Org #STC18

Evaluation of SPARQL Queries on Apache Flink

Regeldokument

Efficient Support for Data-Intensive Scientific Workflows on Geo

Apache Taverna

Trifacta Data Preparation for Amazon Redshift and S3 Must Be Deployed Into an Existing Virtual Private Cloud (VPC)

Exascale Computing Project -- Software

Portable Stateful Big Data Processing in Apache Beam