CIF21 Dibbs: Middleware and High Performance Analytics Libraries for Scalable Data Science
Total Page:16
File Type:pdf, Size:1020Kb
NSF14-43054 start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science • Indiana University (Fox, Qiu, Crandall, von Laszewski), • Rutgers (Jha) • Virginia Tech (Marathe) • Kansas (Paden) • Stony Brook (Wang) • Arizona State(Beckstein) • Utah(Cheatham) Overview by Gregor von Laszewski April 6 2016 http://spidal.org/ http://news.indiana.edu/releases/iu/2014/10/big-data-dibbs-grant.shtml http://www.nsf.gov/awardsearch/showAward?AWD_ID=1443054 04/6/2016 1 Some Important Components of SPIDAL Dibbs • NIST Big Data Application Analysis: features of data intensive apps • HPC-ABDS: Cloud-HPC interoperable software with performance of HPC (High Performance Computing) and the rich functionality of the commodity Apache Big Data Stack. – Reservoir of software subsystems – nearly all from outside project and mix of HPC and Big Data communities – Leads to Big Data – Simulation - HPC Convergence • MIDAS: Integrating Middleware – from project • Applications: Biomolecular Simulations, Network and Computational Social Science, Epidemiology, Computer Vision, Spatial Geographical Information Systems, Remote Sensing for Polar Science and Pathology Informatics. • SPIDAL (Scalable Parallel Interoperable Data Analytics Library): Scalable Analytics for – Domain specific data analytics libraries – mainly from project – Add Core Machine learning Libraries – mainly from community – Performance of Java and MIDAS Inter- and Intra-node • Benchmarks – project adds to community; See WBDB 2015 Seventh Workshop on Big Data Benchmarking 04/6/2016 2 64 Features in 4 views for Unified Classifica7on of Big Data and Simula7on Applica7ons Both 10D Geospatial Information System Analytics 9 HPC Simulations Data Source and Style View (Model for 8D Internet of Things Simulations Data) 7D Metadata/Provenance (Nearly all Data) 6D Shared / Dedicated / Transient / Permanent 5D Archived/Batched/Streaming – S1, S2, S3, S4, S5 4D HDFS/Lustre/GPFS 3D Files/Objects Kernels/Many subclasses 2D Enterprise Data Model 1D SQL/NoSQL/NewSQL benchmarks (Analytics/Informatics/Simulations) - Convergence D M D D M M D M D M M D M D M M body Methods - Multiscale Method Iterative PDE Solvers Engine Recommender Base Data Statistics Local Micro N Data Classification Linear Algebra Particles and Fields Learning Graph Algorithms Nature of mesh if used Spectral Methods Data Alignment Streaming Data Algorithms Data Search/Query/Index Core Libraries Global (Analytics/Informatics/Simulations) Evolution of Discrete Systems Optimization Methodology Visualization Diamonds 1 2 3 4 4 5 6 6 7 8 9 9 10 10 11 12 12 13 13 14 Data Data � Execution Dynamic Dynamic Model Performance Flops Data Model Data Data Model Veracity Communication Regular Regular Data 22 21 20 19 18 17 16 11 10 9 8 7 6 5 4 15 14 13 12 3 2 1 Views and Iterative � # Metric Metric M M M M M M M M M M M M M M M M M M M M M M Volume Velocity Variety Abstraction Facets per Abstraction Size Variety = = = / NN = = Byte/Memory Simulation (Exascale) Big Data Processing R R Environment Pleasingly Parallel 1 Simple D D / / = = Processing Diamonds Diamonds / / / 2 Metrics Irregular Irregular Classic MapReduce M M � Static Static ( Structure / / Processing View Map-Collective 3 � (All Model) Non Non Map Point-to-Point ) = = 4 = - - S S = = N ; Metric Metric I I Core Map Streaming 5 IO/Flops Data Model Shared Memory 6 libraries Single Program Multiple Data = = 7 N N Bulk Synchronous Parallel per 8 watt Fusion 9 Execution View Dataflow 10 Agents 11M (Mix of Data and Model) Problem Architecture View Workflow 12 (Nearly all Data+Model) 04/6/2016 3 Java MPI performs better than Threads 128 24 core Haswell nodes on SPIDAL DA-MDS Code Best MPI; inter and intra node Best Threads intra node MPI inter node 04/6/2016 4 Big DataKaleidoscope and of(Exascale) (Apache) Big Data Simulation Stack (ABDS) and HPC Convergence Technologies II Cross- 17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad, Cutting Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA), Functions Jitterbit, Talend, Pentaho, Apatar, Docker Compose, KeystoneML 16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, PLASMA MAGMA, 1) Message Azure Machine Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j, and Data H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Parasol, Dream:Lab, Google Fusion Tables, Protocols: CINET, NWB, Elasticsearch, Kibana, Logstash, Graylog, Splunk, Tableau, D3.js, three.js, Potree, DC.js, TensorFlow, CNTK Avro, Thrift, 15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud Protobuf Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT, 2) Distributed Agave, Atmosphere Coordination 15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq, : Google Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird Chubby, 14B) Streams: Storm, S4, Samza, Granules, Neptune, Google MillWheel, Amazon Kinesis, LinkedIn, Twitter Heron, Databus, Facebook Zookeeper, Puma/Ptail/Scribe/ODS, Azure Stream Analytics, Floe, Spark Streaming, Flink Streaming, DataTurbine Giraffe, 14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (Apache Flink), Reef, Disco, JGroups Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem 3) Security & 13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, HPX-5, Argo BEAST HPX-5 BEAST PULSAR, Harp, Netty, Privacy: ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon InCommon, SNS, Lambda, Google Pub Sub, Azure Queues, Event Hubs Eduroam 12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan, VoltDB, OpenStack H-Store Keystone, 12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC LDAP, Sentry, 12) Extraction Tools: UIMA, Tika Sqrrl, OpenID, SAML OAuth 11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, Apache Derby, Pivotal 4) Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB, Spark SQL Monitoring: 11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, ZHT, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB, Ambari, Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J, Ganglia, graphdb, Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame Nagios, Inca Public Cloud: Azure Table, Amazon Dynamo, Google DataStore 11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet 10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal GPLOAD/GPFDIST 21 layers 9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm, Over 350 Torque, Globus Tools, Pilot Jobs 8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS Software Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage Packages 7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis 6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat, Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes, January Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang,04/6/2016 Any2Api 29 5) IaaS Management from HPC to hypervisors: Xen, KVM, QEMU, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula,5 Eucalyptus, Nimbus, CloudStack, CoreOS, rkt, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds 2016 Networking: Google Cloud DNS, Amazon Route 53 HPC-ABDS Mapping of Activities • Level 17: Orchestration: Apache Beam (Google Cloud Dataflow) integrated with Cloudmesh • Level 16: Applications. Datamining for molecular dynamics, Image processing for remote sensing and pathology, graphs, streaming • Level 16: Algorithms. MLlib, Mahout, SPIDAL – Release SPIDAL Clustering, Dimension Reduction (multi-dimensional scaling), Web point set visualization • Level 14: Programming. Storm, Heron, Hadoop, Spark, Flink. Inter and Intra- node performance • Level 13: Communication. Enhanced Storm and Hadoop using HPC runtime technologies • Level 11: Data management. Hbase and MongoDB integrated via use of Beam and other Apache tools • Level 9: Cluster Management. Integrate Pilot Jobs with Yarn, Spark, Hadoop • Level 6: DevOps. Python Cloudmesh virtual Cluster Interoperability 04/6/2016 6 .