Digital Science Center Geoffrey C. Fox, David Crandall, Judy Qiu, Gregor von Laszewski, Fugang Wang, Badi' Abdul-Wahid Saliya Ekanayake, Supun Kamburugamuva, Jerome Mitchell, Bingjing Zhang School of Informatics and Computing, Indiana University

Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies Cross- 17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, , Digital Science Center Research Cutting Naiad, Oozie, Tez, FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Dataflow, NiFi (NSA), Scientific Impact Metrics Functions Jitterbit, Talend, Pentaho, Apatar, Compose 16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, Azure Machine 1) Message Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j, H2O, IBM We developed a software framework and to evaluate scientific Areas and Data Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Google Fusion Tables, CINET, NWB, , Kibana, Protocols: Logstash, Graylog, , Tableau, D3.js, three.js, Potree, DC.js impact for XSEDE. We calculate and track various Scientific Impact Avro, Thrift, 15B) Application Hosting Frameworks: , AppScale, Red Hat OpenShift, , Aerobatic, AWS Elastic Beanstalk, Azure, Cloud Protobuf • Digital Science Center Facilities Foundry, Pivotal, IBM , Ninefold, , Stackato, appfog, CloudBees, , CloudControl, dotCloud, Dokku, OSGi, HUBzero, Metrics of XSEDE., BlueWaters, and NCAR. In recently conducted 2) Distributed OODT, Agave, Atmosphere Coordination: 15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq, peer comparison study we showed how well XSEDE and its support • RaPyDLI Deep Learning Environment Google Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird Chubby, 14B) Streams: Storm, S4, Samza, Granules, Google MillWheel, Amazon Kinesis, LinkedIn Databus, Facebook Puma/Ptail/Scribe/ODS, Azure Stream services (ECSS) perform judging by citations received. During this Zookeeper, Analytics, Floe • SPIDAL Scalable Data Analytics Library and Giraffe, 14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (), Reef, Hama, process we retrieved and processed millions of data entries from JGroups Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem multiple sources in various formats to obtain the result. applications including Ice Sheet 3) Security & 13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, Harp, Netty, ZeroMQ, ActiveMQ, RabbitMQ, Privacy: NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon SNS, Lambda, Google Pub Sub, InCommon, Azure Queues, Event Hubs • MIDAS Big Data Software Eduroam 12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan OpenStack 12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC Keystone, UIMA, Tika • Big Data Ogres Classification and LDAP, Sentry, 12) Extraction Tools: Sqrrl, OpenID, 11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, , Pivotal SAML OAuth Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB Benchmarks 4) 11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB, Monitoring: Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J, Ambari, Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame • CloudIOT of Things Environment Ganglia, Public Cloud: Azure Table, Amazon Dynamo, Google DataStore Nagios, Inca 11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet • Cloudmesh Cloud and Bare metal 10) Data Transport: , HTTP, FTP, SSH, Globus Online (GridFTP), Flume, , Pivotal GPLOAD/GPFDIST 21 layers 9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm, Over 350 Torque, Globus Tools, Pilot Jobs Green implies Automation 8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS Software Public Cloud: , Azure Blob, Google HPC # Rank - Rank - # Citation # Citation - • XSEDE TAS Monitoring citations and system Packages 7) Interoperability: , Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis Integration Publications Average Median - Average Median 6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat, Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google , XSEDE 2349 61 65 26 11 metrics May 15 Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang, Any2Api 5) IaaS Management from HPC to hypervisors: Xen, KVM, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula, 2015 , , CloudStack, CoreOS, rkt, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds Peers 168422 49 48 13 5 • Data Science Education with MOOC’s Networking: Google Cloud DNS,

  Ice Layer Detection Algorithm Data analytics for IoT devices in Cloud Parallel Sparse LDA High Performance Data Analytics with We developed a framework to bring data from IoT devices to a cloud • Original LDA (orange) Harp LDA on BR2 The polar science community has built radars capable of surveying the environment for real time data analysis. The framework consists of; compared to LDA exploiting Java + MPI on Multicore HPC Clusters polar ice sheets, and as a result, have collected terabytes of data and Data collection nodes near the devices, Publish-subscribe brokers to sparseness (blue) We find it is challenging to achieve high performance in HPC clusters for is increasing its repository each year as signal processing techniques bring data to cloud and coupled with other batch • Note data analytics making big data problems. We approach this with Java and MPI, but improves improve and the cost of hard drives decrease enabling a new- processing engines for data processing in cloud. Our data pipe line is use of Infiniband (i.e. limited further using Java memory maps to and off heap data structures. We generation of high resolution ice thickness and accumulation maps. Robot → Gateway → Message Brokers → Apache Storm. by communication!) achieve zero intra-node messaging, zero GC, and minimal memory • Java code running under footprint. We present performance results of running it on a latest Intel Manually extracting layers from an enormous corpus of ice thickness Simultaneous Localization and Mapping(SLAM) is an example Harp – Hadoop plus HPC Haswell HPC cluster consisting 3456 cores total. and accumulation data is time-consuming and requires sparse hand- application built on top of our framework, where we exploit parallel plugin selection, so developing image processing techniques to automatically data processing to the expensive SLAM computation. • Corpus: 3,775,554 Wikipedia 20 200K SPEEDUP ON 48 NODES aid in the discovery of knowledge is of high importance. documents, Vocabulary: 1 Java Intranode Comm + MPI million words; Topics: 10k Internode Comm topics; Java Threads (No Intranode Comm) + MPI Internode Comm • BR II is Big Red II All MPI Comm with Cray Speedup Gemini interconnect • Juliet is Haswell Cluster with Intel (switch) and Mellanox 0 Robot Map of the Environment (node) infiniband (not 1 2 3 4 5 6 7 8 9 101112131415161718192021222324 optimized) Harp LDA on Juliet (36 core nodes) Parallelism per node (# cores)