Big Data Software

Total Page:16

File Type:pdf, Size:1020Kb

Big Data Software Big Data Software Spring 2017 Bloomington, Indiana Editor: Gregor von Laszewski Department of Intelligent Systems Engeneering Indiana University [email protected] Contents 1 S17-ER-1001 Globus Toolkit Saber Sheybani, 6 2 S17-IO-3000 CDAP Cask Data Application Platform Avadhoot Agasti 9 3 S17-IO-3004 MySQL Cory Coulter 12 4 S17-IO-3005 RabbitMQ Abhishek Gupta 15 5 S17-IO-3008 Docker Container Vishwanath Kodre 18 6 S17-IO-3010 Couchbase Server: A Usable Overview Matthew Lawson 21 7 S17-IO-3011 Apache Airavata Scott McClary 25 8 S17-IO-3012 Google Bigtable Mark McCombe 28 9 S17-IO-3013 Apache Beam (Google Cloud Dataflow) Leonard Mwangi 32 10 S17-IO-3014 Xen: A bare metal hypervisor Piyush Rai 35 11 S17-IO-3015 Apache Lucene Roy Choudhury, Sabyasachi 37 1 12 S17-IO-3016 CoreOS Ribka Rufael 40 13 S17-IO-3017 MongoDB Nandita Sathe 43 14 S17-IO-3019 vCloud and vSphere Michael Smith 47 15 S17-IO-3020 Google Fusion Table Milind Suryawanshi 50 16 S17-IO-3021 CUBRID RDBMS Abhijit Thakre 53 17 S17-IO-3022 Netty vs ZeroMQ in Realtime Analytics Sunanda Unni 56 18 S17-IO-3023 H-Store Karthick Venkatesan 59 19 S17-IO-3024 Not Submitted Ashok Vuppada 63 20 S17-IR-2001 HTCondor: Distributed Workflow Management System Niteesh Kumar Akurati 66 21 S17-IR-2002 Google Dremel: SQL-Like Query for Big Data Jimmy Ardiansyah 69 22 S17-IR-2004 Apache Samza Ajit Balaga, S17-IR-2004 72 23 S17-IR-2006 An Overview of Apache Spark Snehal Chemburkar, Rahul Raghatate 75 2 24 S17-IR-2008 An overview of HadoopDB and its Architecture Karthik Anbazhagan 80 25 S17-IR-2011 Ansible Anurag Kumar Jain, Gregor von Laszewski 83 26 S17-IR-2012 Lustre File System Pratik Jain 86 27 S17-IR-2013 An overview of Flume and its Applications in BigData Sahiti Korrapati 89 28 S17-IR-2014 An Overview of Apache Sqoop Harshit Krishnakumar 92 29 S17-IR-2016 Apache Spark's Machine Learning Library (MLlib) Anvesh Nayan Lingampalli 95 30 S17-IR-2017 An Overview of OpenNebula Project and its Applications Author Missing 98 31 S17-IR-2018 Analysis of Pentaho Bhavesh Reddy Merugureddy 101 32 S17-IR-2019 Twister: A new approach to MapReduce Programming Vasanth Methkupalli 104 33 S17-IR-2021 Docker Machine and Swarm Shree Govind Mishra 107 34 S17-IR-2022 Triana Abhishek Naik 110 35 S17-IR-2024 LDAP Ronak Parekh, Gregor von Laszewski 113 3 36 S17-IR-2026 Ceph - Distributed Storage System Rahul Raghatate, Snehal Chemburkar 116 37 S17-IR-2027 Twitter Heron Shahidhya Ramachandran 120 38 S17-IR-2028 A Report on Kubernetes Srikanth Ramanam 124 39 S17-IR-2029 An overview of Azure Machine Learning and its Applications Naveenkumar Ramaraju 127 40 S17-IR-2030 Microsoft Azure Data Factory Sowmya Ravi 130 41 S17-IR-2031 Google Cloud storage: A journey towards Cloud storage Kumar Satyam 133 42 S17-IR-2034 Apache Drill Yatin Sharma 136 43 S17-IR-2035 Oracle PGX Piyush Shinde 139 44 S17-IR-2036 An Overview of the Java Message Service (JMS) Rahul Singh 143 45 S17-IR-2037 File Transfer Protocol - An Overview Sriram Sitharaman 146 46 S17-IR-2038 Introduction to H2O Sushmita Sivaprasad 150 47 S17-IR-2041 Tajo: A Distributed Warehouse System for Large Datasets Sagar Vora 153 4 48 S17-IR-2044 Allegro Graph Diksha Yadav 156 49 S17-TS-0003 TBD Tony Liu, Vibhatha Abeykoon, Gregor von Laszewski 158 50 S17-TS-0006 Not Submitted Author Missing 161 5 Review Article Spring 2017 - I524 1 Globus Toolkit SABER SHEYBANI1,* 1School of Informatics and Computing, Bloomington, IN 47408, U.S.A. *Corresponding authors: [email protected] paper-001, September 21, 2017 The Globus Toolkit is an open source software stack, developed to serve as the middleware for grid com- puting. It is organized as a collection of loosely coupled components consisting of services, programming libraries and development tools designed for building grid-based applications. GT components fall into five broad domain areas: Security, Data Management, Execution Management, Information Services, and Common Runtime. © 2017 https://creativecommons.org/licenses/. The authors verify that the text is not plagiarized. Keywords: Globus Toolkit, grid computing, WSRF, OGSA, cluster resource management https://github.com/ssheybani/sp17-i524/tree/master/paper1/S17-ER-1001/report.pdf 1. INTRODUCTION Initially, work on Globus was motivated by the demands of "virtual organizations" in science. These organizations need access to resources and services which are not easily replicable locally. Examples of these needs include access and management of equipment and large amounts of data, located in various remote databases on a regular basis. Although each application may have different specific requirements, the demand for a few basic functions is shared between most applications. "They often need to discover available resources, configure a computing resource to run an application, move data reliably from one site to another, monitor system components, control who can do what, and manage user credentials". These functions can be well addressed in a grid computing framework [1]. Fig. 1. Grid Architecture [2] According to [2], “a grid is a system that coordinates resources that are not subject to centralized control, using standard, open, general-purpose protocols and interfaces to deliver nontrivial qual- to standardize practically all the services that can commonly ities of service”. The major difference between grids and other be found in a grid system (job management services, resource distributed systems is that grids have significantly larger extent management services, security services, etc.) by specifying a set of heterogeneity of resources. They also enable resource sharing of standard interfaces for these services. In order to realize that among virtual organizations, whereas other distributed com- goal, OGSA needs to choose some sort of distributed middle- puting technologies only support a single organization [3]. The ware on which to base the architecture. In other words, if OGSA grid architecture consists of a number of layers that altogether (for example) defines that the JobSubmissionInterface has a sub- serve the functions and purposes of the systems (See figure 1). mitJob operation, there has to be a common and standard way to EGEE: Enabling Grids for E-Science in Europe [4], NEESit: Net- invoke that operation. Web Services were chosen as the under- work for Earthquake Engineering Simulation [5], TeraGrid [6] lying technology. Web services are platform-independent and are examples of grid systems. language-independent since they are message-oriented and rely A grid system often comprises various services such as Job on language-neutral XML dialects to send messages, to specify Management Service and Resource Discovery and Management interfaces, etc. Hence, web services are well-suited for building Service, which constantly need to interact with each other. How- loosely coupled systems such as grid systems. ever, each of these services may have been implemented by dif- However, although Web services can, in theory, be either ferent vendors, which is not necessarily compatible with other stateless or stateful, they are usually stateless. This means that services. The Open Grid Services Architecture (OGSA) aims the Web service can’t "remember" information, or keep state, 6 Review Article Spring 2017 - I524 2 from one invocation to another. On the other hand, OGSA re- Allocation and Management (the heart of GT Execution quires Stateful Services. The Web Services Resource Frame- Management, providing services to deploy and monitor work (WSRF) specifies how we can make our Web Services jobs), Community Scheduler Framework (provides a single stateful, along with adding other features. In other words, while interface to different resource schedulers), Workspace Man- OGSA is the architecture, WSRF is the infrastructure on which agement (allows users to dynamically create and manage that architecture is built on. workspaces on remote hosts), and Grid Telecontrol Protocol With this background, we can define Globus Toolkit as an (provides a WSRF-enabled service interface for control of open source toolkit that provides an implementation of OGSA, remote instruments). WSRF and a number of other standards for grid computing. Figure 2 displays the relationship between OGSA, Web service, Information Services : Commonly referred to as the Monitor- WSRF, Globus Toolkit. ing and Discovery System (MDS), these services deal with monitoring and discovery of resources in a virtual orga- nization. They include Index Service(for aggregation of resources of interest to a VO), Trigger Service (same as In- dex Service, but is configured to perform certain actions based on the data collected from resources), and WebMDS (provides a web browser-based view of data collected by GT4 aggregator services). Common Runtime : These components provide a set of funda- mental libraries and tools for hosting existing services as well as developing new services, in languages of C, Java, and Python (C Runtime, Java Runtime, and Python Run- time). 3. APPLICATIONS Globus Toolkit has been used in many scientific and industrial applications. Examples of large-scale e-science projects relying on the Globus Toolkit include the Network for Earthquake Engi- neering and Simulation (NEES), FusionGrid, the Earth System Fig. 2. Relationship between OGSA, GT4, WSRF, and Web Grid (ESG), the NSF Middleware Initiative and its GRIDS Center, Services [2] and the National Virtual Observatory. In the design of the Large Hadron Collider at CERN, Globus-based technologies have been developed through the European Data Grid, and the U.S. efforts 2. COMPONENTS like the Grid Physics Network (GriPhyN) and Particle Physics Data Grid [7]. The libraries and services in Globus Toolkit can be classified into Globus Toolkit is helping to bridge the gap for commercial ap- five broad domain areas, namely security, data management, exe- plications of Grid computing.
Recommended publications
  • Large-Scale Learning from Data Streams with Apache SAMOA
    Large-Scale Learning from Data Streams with Apache SAMOA Nicolas Kourtellis1, Gianmarco De Francisci Morales2, and Albert Bifet3 1 Telefonica Research, Spain, [email protected] 2 Qatar Computing Research Institute, Qatar, [email protected] 3 LTCI, Télécom ParisTech, France, [email protected] Abstract. Apache SAMOA (Scalable Advanced Massive Online Anal- ysis) is an open-source platform for mining big data streams. Big data is defined as datasets whose size is beyond the ability of typical soft- ware tools to capture, store, manage, and analyze, due to the time and memory complexity. Apache SAMOA provides a collection of dis- tributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms. It fea- tures a pluggable architecture that allows it to run on several distributed stream processing engines such as Apache Flink, Apache Storm, and Apache Samza. Apache SAMOA is written in Java and is available at https://samoa.incubator.apache.org under the Apache Software Li- cense version 2.0. 1 Introduction Big data are “data whose characteristics force us to look beyond the traditional methods that are prevalent at the time” [18]. For instance, social media are one of the largest and most dynamic sources of data. These data are not only very large due to their fine grain, but also being produced continuously. Furthermore, such data are nowadays produced by users in different environments and via a multitude of devices. For these reasons, data from social media and ubiquitous environments are perfect examples of the challenges posed by big data.
    [Show full text]
  • DSP Frameworks DSP Frameworks We Consider
    Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica DSP Frameworks Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini DSP frameworks we consider • Apache Storm (with lab) • Twitter Heron – From Twitter as Storm and compatible with Storm • Apache Spark Streaming (lab) – Reduce the size of each stream and process streams of data (micro-batch processing) • Apache Flink • Apache Samza • Cloud-based frameworks – Google Cloud Dataflow – Amazon Kinesis Streams Valeria Cardellini - SABD 2017/18 1 Apache Storm • Apache Storm – Open-source, real-time, scalable streaming system – Provides an abstraction layer to execute DSP applications – Initially developed by Twitter • Topology – DAG of spouts (sources of streams) and bolts (operators and data sinks) Valeria Cardellini - SABD 2017/18 2 Stream grouping in Storm • Data parallelism in Storm: how are streams partitioned among multiple tasks (threads of execution)? • Shuffle grouping – Randomly partitions the tuples • Field grouping – Hashes on a subset of the tuple attributes Valeria Cardellini - SABD 2017/18 3 Stream grouping in Storm • All grouping (i.e., broadcast) – Replicates the entire stream to all the consumer tasks • Global grouping – Sends the entire stream to a single task of a bolt • Direct grouping – The producer of the tuple decides which task of the consumer will receive this tuple Valeria Cardellini - SABD 2017/18 4 Storm architecture • Master-worker architecture Valeria Cardellini - SABD 2017/18 5 Storm
    [Show full text]
  • Myriad: Resource Sharing Beyond Boundaries
    Resource Sharing Beyond Boundaries Mohit Soni Santosh Marella Adam Bordelon Anoop Dawar Ben Hindman Brandon Gulla Danese Cooper Darin Johnson Jim Klucar Kannan Rajah Ken Sipe Luciano Resende Meghdoot Bhattacharya Paul Reed Renan DelValle Ruth Harris Shingo Omura Swapnil Daingade Ted Dunning Will Ochandarena Yuliya Feldman Zhongyue Luo Agenda What's up with Datacenters these days? Apache Mesos vs. Apache Hadoop/YARN? Why would you want/need both? Resource Sharing with Apache Myriad What's running on your datacenter? Tier 1 services Tier 2 services High Priority Batch Best Effort, backfill Requirements Programming models based on resources, not machines Custom resource types Custom scheduling algorithms: Fast vs. careful/slow Lightweight executors, fast task launch time Multi-tenancy, utilization, strong isolation Hadoop and More Support Hadoop/BigData ecosystem Support arbitrary (legacy) processes/containers Connect Big Data to non-Hadoop apps, share data, resources Mesos from 10,000 feet Open Source Apache project Cluster Resource Manager Scalable to 10,000s of nodes Fault-tolerant, no SPOF Multi-tenancy, Resource Isolation Improved resource utilization Mesos is more than Yet Another Resource Negotiator Long-running services; real-time jobs Native Docker; cgroups for years; Isolate cpu/mem/disk/net/other Distributed systems SDK; ~200 loc for a new app Core written in C++ for performance, Apps in any language Why two resource managers? Static Partitioning sucks Hadoop teams fine with isolated clusters, but Ops team unhappy; slow
    [Show full text]
  • Apache Apex: Next Gen Big Data Analytics
    Apache Apex: Next Gen Big Data Analytics Thomas Weise <[email protected]> @thweise PMC Chair Apache Apex, Architect DataTorrent Apache Big Data Europe, Sevilla, Nov 14th 2016 Stream Data Processing Data Delivery Transform / Analytics Real-time visualization, … Declarative SQL API Data Beam Beam SAMOA Operator SAMOA DAG API Sources Library Events Logs Oper1 Oper2 Oper3 Sensor Data Social Databases CDC (roadmap) 2 Industries & Use Cases Financial Services Ad-Tech Telecom Manufacturing Energy IoT Real-time Call detail record customer facing (CDR) & Supply chain Fraud and risk Smart meter Data ingestion dashboards on extended data planning & monitoring analytics and processing key performance record (XDR) optimization indicators analysis Understanding Reduce outages Credit risk Click fraud customer Preventive & improve Predictive assessment detection behavior AND maintenance resource analytics context utilization Packaging and Improve turn around Asset & Billing selling Product quality & time of trade workforce Data governance optimization anonymous defect tracking settlement processes management customer data HORIZONTAL • Large scale ingest and distribution • Enforcing data quality and data governance requirements • Real-time ELTA (Extract Load Transform Analyze) • Real-time data enrichment with reference data • Dimensional computation & aggregation • Real-time machine learning model scoring 3 Apache Apex • In-memory, distributed stream processing • Application logic broken into components (operators) that execute distributed in a cluster •
    [Show full text]
  • Security Log Analysis Using Hadoop Harikrishna Annangi Harikrishna Annangi, [email protected]
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by St. Cloud State University St. Cloud State University theRepository at St. Cloud State Culminating Projects in Information Assurance Department of Information Systems 3-2017 Security Log Analysis Using Hadoop Harikrishna Annangi Harikrishna Annangi, [email protected] Follow this and additional works at: https://repository.stcloudstate.edu/msia_etds Recommended Citation Annangi, Harikrishna, "Security Log Analysis Using Hadoop" (2017). Culminating Projects in Information Assurance. 19. https://repository.stcloudstate.edu/msia_etds/19 This Starred Paper is brought to you for free and open access by the Department of Information Systems at theRepository at St. Cloud State. It has been accepted for inclusion in Culminating Projects in Information Assurance by an authorized administrator of theRepository at St. Cloud State. For more information, please contact [email protected]. Security Log Analysis Using Hadoop by Harikrishna Annangi A Starred Paper Submitted to the Graduate Faculty of St. Cloud State University in Partial Fulfillment of the Requirements for the Degree of Master of Science in Information Assurance April, 2016 Starred Paper Committee: Dr. Dennis Guster, Chairperson Dr. Susantha Herath Dr. Sneh Kalia 2 Abstract Hadoop is used as a general-purpose storage and analysis platform for big data by industries. Commercial Hadoop support is available from large enterprises, like EMC, IBM, Microsoft and Oracle and Hadoop companies like Cloudera, Hortonworks, and Map Reduce. Hadoop is a scheme written in Java that allows distributed processes of large data sets across clusters of computers using programming models. A Hadoop frame work application works in an environment that provides storage and computation across clusters of computers.
    [Show full text]
  • The Cloud‐Based Demand‐Driven Supply Chain
    The Cloud-Based Demand-Driven Supply Chain Wiley & SAS Business Series The Wiley & SAS Business Series presents books that help senior-level managers with their critical management decisions. Titles in the Wiley & SAS Business Series include: The Analytic Hospitality Executive by Kelly A. McGuire Analytics: The Agile Way by Phil Simon Analytics in a Big Data World: The Essential Guide to Data Science and Its Applications by Bart Baesens A Practical Guide to Analytics for Governments: Using Big Data for Good by Marie Lowman Bank Fraud: Using Technology to Combat Losses by Revathi Subramanian Big Data Analytics: Turning Big Data into Big Money by Frank Ohlhorst Big Data, Big Innovation: Enabling Competitive Differentiation through Business Analytics by Evan Stubbs Business Analytics for Customer Intelligence by Gert Laursen Business Intelligence Applied: Implementing an Effective Information and Communications Technology Infrastructure by Michael Gendron Business Intelligence and the Cloud: Strategic Implementation Guide by Michael S. Gendron Business Transformation: A Roadmap for Maximizing Organizational Insights by Aiman Zeid Connecting Organizational Silos: Taking Knowledge Flow Management to the Next Level with Social Media by Frank Leistner Data-Driven Healthcare: How Analytics and BI Are Transforming the Industry by Laura Madsen Delivering Business Analytics: Practical Guidelines for Best Practice by Evan Stubbs ii Demand-Driven Forecasting: A Structured Approach to Forecasting, Second Edition by Charles Chase Demand-Driven Inventory
    [Show full text]
  • Download File
    Annex 2: List of tested and analyzed data sharing tools (non-exhaustive) Below are the specifications of the tools surveyed, as to February 2015, with some updates from April 2016. The tools selected in the context of EU BON are available in the main text of the publication and are described in more details. This list is also available on the EU BON Helpdesk website, where it will be regularly updated as needed. Additional lists are available through the GBIF resources page, the DataONE software tools catalogue, the BioVel BiodiversityCatalogue and the BDTracker A.1 GBIF Integrated Publishing Toolkit (IPT) Main usage, purpose, selected examples The Integrated Publishing Toolkit is a free open source software tool written in Java which is used to publish and share biodiversity data sets and metadata through the GBIF network. Designed for interoperability, it enables the publishing of content in databases or text files using open standards, namely, the Darwin Core and the Ecological Metadata Language. It also provides a 'one-click' service to convert data set metadata into a draft data paper manuscript for submission to a peer-reviewed journal. Currently, the IPT supports three core types of data: checklists, occurrence datasets and sample based data (plus datasets at metadata level only). The IPT is a community-driven tool. Core development happens at the GBIF Secretariat but the coding, documentation, and internationalization are a community effort. New versions incorporate the feedback from the people who actually use the IPT. In this way, users can help get the features they want by becoming involved. The user interface of the IPT has so far been translated into six languages: English, French, Spanish, Traditional Chinese, Brazilian Portuguese, Japanese (Robertson et al, 2014).
    [Show full text]
  • System and Organization Controls (SOC) 3 Report Over the Google Cloud Platform System Relevant to Security, Availability, and Confidentiality
    System and Organization Controls (SOC) 3 Report over the Google Cloud Platform System Relevant to Security, Availability, and Confidentiality For the Period 1 May 2020 to 30 April 2021 Google LLC 1600 Amphitheatre Parkway Mountain View, CA, 94043 650 253-0000 main Google.com Management’s Report of Its Assertions on the Effectiveness of Its Controls Over the Google Cloud Platform System Based on the Trust Services Criteria for Security, Availability, and Confidentiality We, as management of Google LLC ("Google" or "the Company") are responsible for: • Identifying the Google Cloud Platform System (System) and describing the boundaries of the System, which are presented in Attachment A • Identifying our service commitments and system requirements • Identifying the risks that would threaten the achievement of its service commitments and system requirements that are the objectives of our System, which are presented in Attachment B • Identifying, designing, implementing, operating, and monitoring effective controls over the Google Cloud Platform System (System) to mitigate risks that threaten the achievement of the service commitments and system requirements • Selecting the trust services categories that are the basis of our assertion We assert that the controls over the System were effective throughout the period 1 May 2020 to 30 April 2021, to provide reasonable assurance that the service commitments and system requirements were achieved based on the criteria relevant to security, availability, and confidentiality set forth in the AICPA’s
    [Show full text]
  • Apache Sentry
    Apache Sentry Prasad Mujumdar [email protected] [email protected] Agenda ● Various aspects of data security ● Apache Sentry for authorization ● Key concepts of Apache Sentry ● Sentry features ● Sentry architecture ● Integration with Hadoop ecosystem ● Sentry administration ● Future plans ● Demo ● Questions Who am I • Software engineer at Cloudera • Committer and PPMC member of Apache Sentry • also for Apache Hive and Apache Flume • Part of the the original team that started Sentry work Aspects of security Perimeter Access Visibility Data Authentication Authorization Audit, Lineage Encryption, what user can do data origin, usage Kerberos, LDAP/AD Masking with data Data access Access ● Provide user access to data Authorization ● Manage access policies what user can do ● Provide role based access with data Agenda ● Various aspects of data security ● Apache Sentry for authorization ● Key concepts of Apache Sentry ● Sentry features ● Sentry architecture ● Integration with Hadoop ecosystem ● Sentry administration ● Future plans ● Demo ● Questions Apache Sentry (Incubating) Unified Authorization module for Hadoop Unlocks Key RBAC Requirements Secure, fine-grained, role-based authorization Multi-tenant administration Enforce a common set of policies across multiple data access path in Hadoop. Key Capabilities of Sentry Fine-Grained Authorization Permissions on object hierarchie. Eg, Database, Table, Columns Role-Based Authorization Support for role templetes to manage authorization for a large set of users and data objects Multi Tanent Administration
    [Show full text]
  • Building a Scalable Index and a Web Search Engine for Music on the Internet Using Open Source Software
    Department of Information Science and Technology Building a Scalable Index and a Web Search Engine for Music on the Internet using Open Source software André Parreira Ricardo Thesis submitted in partial fulfillment of the requirements for the degree of Master in Computer Science and Business Management Advisor: Professor Carlos Serrão, Assistant Professor, ISCTE-IUL September, 2010 Acknowledgments I should say that I feel grateful for doing a thesis linked to music, an art which I love and esteem so much. Therefore, I would like to take a moment to thank all the persons who made my accomplishment possible and hence this is also part of their deed too. To my family, first for having instigated in me the curiosity to read, to know, to think and go further. And secondly for allowing me to continue my studies, providing the environment and the financial means to make it possible. To my classmate André Guerreiro, I would like to thank the invaluable brainstorming, the patience and the help through our college years. To my friend Isabel Silva, who gave me a precious help in the final revision of this document. Everyone in ADETTI-IUL for the time and the attention they gave me. Especially the people over Caixa Mágica, because I truly value the expertise transmitted, which was useful to my thesis and I am sure will also help me during my professional course. To my teacher and MSc. advisor, Professor Carlos Serrão, for embracing my will to master in this area and for being always available to help me when I needed some advice.
    [Show full text]
  • A Comprehensive Study of Bloated Dependencies in the Maven Ecosystem
    Noname manuscript No. (will be inserted by the editor) A Comprehensive Study of Bloated Dependencies in the Maven Ecosystem César Soto-Valero · Nicolas Harrand · Martin Monperrus · Benoit Baudry Received: date / Accepted: date Abstract Build automation tools and package managers have a profound influence on software development. They facilitate the reuse of third-party libraries, support a clear separation between the application’s code and its ex- ternal dependencies, and automate several software development tasks. How- ever, the wide adoption of these tools introduces new challenges related to dependency management. In this paper, we propose an original study of one such challenge: the emergence of bloated dependencies. Bloated dependencies are libraries that the build tool packages with the application’s compiled code but that are actually not necessary to build and run the application. This phenomenon artificially grows the size of the built binary and increases maintenance effort. We propose a tool, called DepClean, to analyze the presence of bloated dependencies in Maven artifacts. We ana- lyze 9; 639 Java artifacts hosted on Maven Central, which include a total of 723; 444 dependency relationships. Our key result is that 75:1% of the analyzed dependency relationships are bloated. In other words, it is feasible to reduce the number of dependencies of Maven artifacts up to 1=4 of its current count. We also perform a qualitative study with 30 notable open-source projects. Our results indicate that developers pay attention to their dependencies and are willing to remove bloated dependencies: 18/21 answered pull requests were accepted and merged by developers, removing 131 dependencies in total.
    [Show full text]
  • Is 'Distributed' Worth It? Benchmarking Apache Spark with Mesos
    Is `Distributed' worth it? Benchmarking Apache Spark with Mesos N. Satra (ns532) January 13, 2015 Abstract A lot of research focus lately has been on building bigger dis- tributed systems to handle `Big Data' problems. This paper exam- ines whether typical problems for web-scale companies really benefits from the parallelism offered by these systems, or can be handled by a single machine without synchronisation overheads. Repeated runs of a movie review sentiment analysis task in Apache Spark were carried out using Apache Mesos and Mesosphere, on Google Compute Engine clusters of various sizes. Only a marginal improvement in run-time was observed on a distributed system as opposed to a single node. 1 Introduction Research in high performance computing and `big data' has recently focussed on distributed computing. The reason is three-pronged: • The rapidly decreasing cost of commodity machines, compared to the slower decline in prices of high performance or supercomputers. • The availability of cloud environments like Amazon EC2 or Google Compute Engine that let users access large numbers of computers on- demand, without upfront investment. • The development of frameworks and tools that provide easy-to-use id- ioms for distributed computing and managing such large clusters of machines. Large corporations like Google or Yahoo are working on petabyte-scale problems which simply can't be handled by single computers [9]. However, 1 smaller companies and research teams with much more manageable sizes of data have jumped on the bandwagon, using the tools built by the larger com- panies, without always analysing the performance tradeoffs. It has reached the stage where researchers are suggesting using the same MapReduce idiom hammer on all problems, whether they are nails or not [7].
    [Show full text]