How to Create Java Project with Apache Spark

Total Page:16

File Type:pdf, Size:1020Kb

How to Create Java Project with Apache Spark Create new Java Project with Apache Spark A new Java Project can be created with Apache Spark support. For that, jars/libraries that are present in Apache Spark package are required. The path of these jars has to be included as dependencies for the Java Project. In this tutorial, we shall look into how to create a Java Project with Apache Spark having all the required jars and libraries. We use Eclipse as IDE to work with the project used for demonstration in the following example. The process should be same with other IDEs like IntelliJ IDEA, NetBeans, etc. As a prerequisite, Java and Eclipse had to be setup on the machine. Eclipse – Create Java Project with Apache Spark 1. Download Apache Spark Download Apache Spark from [[https://spark.apache.org/downloads.html]]. The package is around ~200MB. It might take a few minutes. Download Apache Spark 2. Unzip and find jars Unzip the downloaded folder. The contents present would be as below : Apache Spark Package Contents jars : this folder contains all the jars that needs to be included in the build path of our project. 3. Create Java Project and copy jars Create a Java Project in Eclipse, and copy jars folder in spark directory to the Java Project, SparkMLlib22. Create a Java Project and copy Jars 4. Add Jars to Java Build Path Right click on Project (SparkMLlbi22) -> Properties -> Java Build Path(3rd item in the left panel) - > Libraries (3rd tab) -> Add Jars (button on right side panel) -> In the Jar Selection, Select all the jars in the ‘jars‘ folder -> Apply -> OK. Add jars to build path 5. Check the setup – Run an MLLib example You may also copy ‘data’ folder to the project and add ‘jars’ in spark ‘examples‘ directory to have a quick glance on how to work with different modules of Apache Spark. We shall run the following Java Program, JavaRandomForestClassificationExample.java, to check if the Apache Spark setup is successful with the Java Project. JavaRandomForestClassificationExample.java /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ // $example on$ import java.util.HashMap; import scala.Tuple2; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.PairFunction; import org.apache.spark.mllib.regression.LabeledPoint; import org.apache.spark.mllib.tree.RandomForest; import org.apache.spark.mllib.tree.model.RandomForestModel; import org.apache.spark.mllib.util.MLUtils; // $example off$ public class JavaRandomForestClassificationExample { public static void main(String[] args) { // $example on$ SparkConf sparkConf = new SparkConf().setAppName("JavaRandomForestClassificationExample" .setMaster("local[2]").set("spark.executor.memory","2g"); JavaSparkContext jsc = new JavaSparkContext(sparkConf); // Load and parse the data file. String datapath = "data/mllib/sample_multiclass_classification_data.txt"; JavaRDD data = MLUtils.loadLibSVMFile(jsc.sc(), datapath).toJavaRDD(); // Split the data into training and test sets (30% held out for testing) JavaRDD[] splits = data.randomSplit(new double[]{0.7, 0.3}); JavaRDD trainingData = splits[0]; JavaRDD testData = splits[1]; // Train a RandomForest model. // Empty categoricalFeaturesInfo indicates all features are continuous. Integer numClasses = 3; HashMap<Integer, Integer> categoricalFeaturesInfo = new HashMap<>(); Integer numTrees = 55; // Use more in practice. String featureSubsetStrategy = "auto"; // Let the algorithm choose. String impurity = "gini"; Integer maxDepth = 5; Integer maxBins = 32; Integer seed = 12345; final RandomForestModel model = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins, seed); // Evaluate model on test instances and compute test error JavaPairRDD<Double, Double> predictionAndLabel = testData.mapToPair(new PairFunction<LabeledPoint, Double, Double>() { @Override public Tuple2<Double, Double> call(LabeledPoint p) { return new Tuple2<>(model.predict(p.features()), p.label()); } }); Double testErr = Double testErr = 1.0 * predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() { @Override public Boolean call(Tuple2<Double, Double> pl) { return !pl._1().equals(pl._2()); } }).count() / testData.count(); System.out.println("Test Error: " + testErr); System.out.println("Learned classification forest model:\n" + model.toDebugString()); // Save and load model model.save(jsc.sc(), "target/tmp/myRandomForestClassificationModel"); RandomForestModel sameModel = RandomForestModel.load(jsc.sc(), "target/tmp/myRandomForestClassificationModel"); // $example off$ jsc.stop(); } } Output 17/07/23 09:46:09 INFO DAGScheduler: Submitting ResultStage 6 (MapPartitionsRDD[20] at map at RandomForest.scala:553), which has no missing parents 17/07/23 09:46:09 INFO MemoryStore: Block broadcast_8 stored as values in memory (estimated size 8.1 KB, free 882.2 MB) 17/07/23 09:46:09 INFO MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 3.4 KB, free 882.2 MB) 17/07/23 09:46:09 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on 192.168.1.100:34199 (size: 3.4 KB, free: 882.5 MB) 17/07/23 09:46:09 INFO SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:1006 17/07/23 09:46:09 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 6 (MapPartitionsRDD[20] at map at RandomForest.scala:553) (first 15 tasks are for partitions Vector(0, 1)) 17/07/23 09:46:09 INFO TaskSchedulerImpl: Adding task set 6.0 with 2 tasks 17/07/23 09:46:09 INFO TaskSetManager: Starting task 0.0 in stage 6.0 (TID 11, localhost, executor driver, partition 0, ANY, 4621 bytes) 17/07/23 09:46:09 INFO TaskSetManager: Starting task 1.0 in stage 6.0 (TID 12, localhost, executor driver, partition 1, ANY, 4621 bytes) 17/07/23 09:46:09 INFO Executor: Running task 0.0 in stage 6.0 (TID 11) 17/07/23 09:46:09 INFO Executor: Running task 1.0 in stage 6.0 (TID 12) . Conclusion In this Apache Spark Tutorial, we have successfully learnt to create a Java Project with Apache Spark libraries as dependencies; and run a Spark MLlib example program. Learn Apache Spark ✦ Apache Spark Tutorial ✦ Install Spark on Ubuntu ✦ Install Spark on Mac OS ✦ Scala Spark Shell - Example ✦ Python Spark Shell - PySpark ✦ Python Spark Shell - PySpark ➩ Setup Java Project with Spark ✦ Spark Scala Application - WordCount Example ✦ Spark Python Application ✦ Spark DAG & Physical Execution Plan ✦ Setup Spark Cluster ✦ Configure Spark Ecosystem ✦ Configure Spark Application ✦ Spark Cluster Managers Spark RDD ✦ Spark RDD ✦ Spark RDD - Print Contents of RDD ✦ Spark RDD - foreach ✦ Spark RDD - Create RDD ✦ Spark Parallelize ✦ Spark RDD - Read Text File to RDD ✦ Spark RDD - Read Multiple Text Files to Single RDD ✦ Spark RDD - Read JSON File to RDD ✦ Spark RDD - Containing Custom Class Objects ✦ Spark RDD - Map ✦ Spark RDD - FlatMap ✦ Spark RDD - Filter ✦ Spark RDD - Distinct ✦ Spark RDD - Reduce Spark Dataseet ✦ Spark - Read JSON file to Dataset ✦ Spark - Write Dataset to JSON file ✦ Spark - Add new Column to Dataset ✦ Spark - Concatenate Datasets Spark MLlib (Machine Learning Library) ✦ Spark MLlib Tutorial ✦ Spark MLlib Tutorial ✦ KMeans Clustering & Classification ✦ Decision Tree Classification ✦ Random Forest Classification ✦ Naive Bayes Classification ✦ Logistic Regression Classification ✦ Topic Modelling Spark SQL ✦ Spark SQL Tutorial ✦ Spark SQL - Load JSON file and execute SQL Query Spark Others ✦ Spark Interview Questions.
Recommended publications
  • Webroot Brightcloud® SDK for C and C++ Sdks Apache License 2.0
    Webroot BrightCloud® SDK For C and C++ SDKs Apache License 2.0 • Apache Portable Runtime Utilities (APR-util) Copyright © 2008-2018, The Apache Software Foundation • Apache Portable Runtime Utilities 1.2.12 (APR-util) Copyright © 2008-2018, The Apache Software Foundation • X Delta 3.0.3 Copyright (C) 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015 [email protected] Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. “License” shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. “Licensor” shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. “Legal Entity” shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, “control” means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. “You” (or “Your”) shall mean an individual or Legal Entity exercising permissions granted by this License. “Source” form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. “Object” form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
    [Show full text]
  • Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions
    00 Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions MUTAZ BARIKA, University of Tasmania SAURABH GARG, University of Tasmania ALBERT Y. ZOMAYA, University of Sydney LIZHE WANG, China University of Geoscience (Wuhan) AAD VAN MOORSEL, Newcastle University RAJIV RANJAN, Chinese University of Geoscienes and Newcastle University Interest in processing big data has increased rapidly to gain insights that can transform businesses, government policies and research outcomes. This has led to advancement in communication, programming and processing technologies, including Cloud computing services and technologies such as Hadoop, Spark and Storm. This trend also affects the needs of analytical applications, which are no longer monolithic but composed of several individual analytical steps running in the form of a workflow. These Big Data Workflows are vastly different in nature from traditional workflows. Researchers arecurrently facing the challenge of how to orchestrate and manage the execution of such workflows. In this paper, we discuss in detail orchestration requirements of these workflows as well as the challenges in achieving these requirements. We alsosurvey current trends and research that supports orchestration of big data workflows and identify open research challenges to guide future developments in this area. CCS Concepts: • General and reference → Surveys and overviews; • Information systems → Data analytics; • Computer systems organization → Cloud computing; Additional Key Words and Phrases: Big Data, Cloud Computing, Workflow Orchestration, Requirements, Approaches ACM Reference format: Mutaz Barika, Saurabh Garg, Albert Y. Zomaya, Lizhe Wang, Aad van Moorsel, and Rajiv Ranjan. 2018. Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions.
    [Show full text]
  • Building Machine Learning Inference Pipelines at Scale
    Building Machine Learning inference pipelines at scale Julien Simon Global Evangelist, AI & Machine Learning @julsimon © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Problem statement • Real-life Machine Learning applications require more than a single model. • Data may need pre-processing: normalization, feature engineering, dimensionality reduction, etc. • Predictions may need post-processing: filtering, sorting, combining, etc. Our goal: build scalable ML pipelines with open source (Spark, Scikit-learn, XGBoost) and managed services (Amazon EMR, AWS Glue, Amazon SageMaker) © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Apache Spark https://spark.apache.org/ • Open-source, distributed processing system • In-memory caching and optimized execution for fast performance (typically 100x faster than Hadoop) • Batch processing, streaming analytics, machine learning, graph databases and ad hoc queries • API for Java, Scala, Python, R, and SQL • Available in Amazon EMR and AWS Glue © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. MLlib – Machine learning library https://spark.apache.org/docs/latest/ml-guide.html • Algorithms: classification, regression, clustering, collaborative filtering. • Featurization: feature extraction, transformation, dimensionality reduction. • Tools for constructing, evaluating and tuning pipelines • Transformer – a transform function that maps a DataFrame into a new
    [Show full text]
  • Evaluation of SPARQL Queries on Apache Flink
    applied sciences Article SPARQL2Flink: Evaluation of SPARQL Queries on Apache Flink Oscar Ceballos 1 , Carlos Alberto Ramírez Restrepo 2 , María Constanza Pabón 2 , Andres M. Castillo 1,* and Oscar Corcho 3 1 Escuela de Ingeniería de Sistemas y Computación, Universidad del Valle, Ciudad Universitaria Meléndez Calle 13 No. 100-00, Cali 760032, Colombia; [email protected] 2 Departamento de Electrónica y Ciencias de la Computación, Pontificia Universidad Javeriana Cali, Calle 18 No. 118-250, Cali 760031, Colombia; [email protected] (C.A.R.R.); [email protected] (M.C.P.) 3 Ontology Engineering Group, Universidad Politécnica de Madrid, Campus de Montegancedo, Boadilla del Monte, 28660 Madrid, Spain; ocorcho@fi.upm.es * Correspondence: [email protected] Abstract: Existing SPARQL query engines and triple stores are continuously improved to handle more massive datasets. Several approaches have been developed in this context proposing the storage and querying of RDF data in a distributed fashion, mainly using the MapReduce Programming Model and Hadoop-based ecosystems. New trends in Big Data technologies have also emerged (e.g., Apache Spark, Apache Flink); they use distributed in-memory processing and promise to deliver higher data processing performance. In this paper, we present a formal interpretation of some PACT transformations implemented in the Apache Flink DataSet API. We use this formalization to provide a mapping to translate a SPARQL query to a Flink program. The mapping was implemented in a prototype used to determine the correctness and performance of the solution. The source code of the Citation: Ceballos, O.; Ramírez project is available in Github under the MIT license.
    [Show full text]
  • Unravel Data Systems Version 4.5
    UNRAVEL DATA SYSTEMS VERSION 4.5 Component name Component version name License names jQuery 1.8.2 MIT License Apache Tomcat 5.5.23 Apache License 2.0 Tachyon Project POM 0.8.2 Apache License 2.0 Apache Directory LDAP API Model 1.0.0-M20 Apache License 2.0 apache/incubator-heron 0.16.5.1 Apache License 2.0 Maven Plugin API 3.0.4 Apache License 2.0 ApacheDS Authentication Interceptor 2.0.0-M15 Apache License 2.0 Apache Directory LDAP API Extras ACI 1.0.0-M20 Apache License 2.0 Apache HttpComponents Core 4.3.3 Apache License 2.0 Spark Project Tags 2.0.0-preview Apache License 2.0 Curator Testing 3.3.0 Apache License 2.0 Apache HttpComponents Core 4.4.5 Apache License 2.0 Apache Commons Daemon 1.0.15 Apache License 2.0 classworlds 2.4 Apache License 2.0 abego TreeLayout Core 1.0.1 BSD 3-clause "New" or "Revised" License jackson-core 2.8.6 Apache License 2.0 Lucene Join 6.6.1 Apache License 2.0 Apache Commons CLI 1.3-cloudera-pre-r1439998 Apache License 2.0 hive-apache 0.5 Apache License 2.0 scala-parser-combinators 1.0.4 BSD 3-clause "New" or "Revised" License com.springsource.javax.xml.bind 2.1.7 Common Development and Distribution License 1.0 SnakeYAML 1.15 Apache License 2.0 JUnit 4.12 Common Public License 1.0 ApacheDS Protocol Kerberos 2.0.0-M12 Apache License 2.0 Apache Groovy 2.4.6 Apache License 2.0 JGraphT - Core 1.2.0 (GNU Lesser General Public License v2.1 or later AND Eclipse Public License 1.0) chill-java 0.5.0 Apache License 2.0 Apache Commons Logging 1.2 Apache License 2.0 OpenCensus 0.12.3 Apache License 2.0 ApacheDS Protocol
    [Show full text]
  • Flare: Optimizing Apache Spark with Native Compilation
    Flare: Optimizing Apache Spark with Native Compilation for Scale-Up Architectures and Medium-Size Data Gregory Essertel, Ruby Tahboub, and James Decker, Purdue University; Kevin Brown and Kunle Olukotun, Stanford University; Tiark Rompf, Purdue University https://www.usenix.org/conference/osdi18/presentation/essertel This paper is included in the Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’18). October 8–10, 2018 • Carlsbad, CA, USA ISBN 978-1-939133-08-3 Open access to the Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation is sponsored by USENIX. Flare: Optimizing Apache Spark with Native Compilation for Scale-Up Architectures and Medium-Size Data Grégory M. Essertel1, Ruby Y. Tahboub1, James M. Decker1, Kevin J. Brown2, Kunle Olukotun2, Tiark Rompf1 1Purdue University, 2Stanford University {gesserte,rtahboub,decker31,tiark}@purdue.edu, {kjbrown,kunle}@stanford.edu Abstract cessing. Systems like Apache Spark [8] have gained enormous traction thanks to their intuitive APIs and abil- In recent years, Apache Spark has become the de facto ity to scale to very large data sizes, thereby commoditiz- standard for big data processing. Spark has enabled a ing petabyte-scale (PB) data processing for large num- wide audience of users to process petabyte-scale work- bers of users. But thanks to its attractive programming loads due to its flexibility and ease of use: users are able interface and tooling, people are also increasingly using to mix SQL-style relational queries with Scala or Python Spark for smaller workloads. Even for companies that code, and have the resultant programs distributed across also have PB-scale data, there is typically a long tail of an entire cluster, all without having to work with low- tasks of much smaller size, which make up a very impor- level parallelization or network primitives.
    [Show full text]
  • Apache Harmony Project Tim Ellison Geir Magnusson Jr
    The Apache Harmony Project Tim Ellison Geir Magnusson Jr. Apache Harmony Project http://harmony.apache.org TS-7820 2007 JavaOneSM Conference | Session TS-7820 | Goal of This Talk In the next 45 minutes you will... Learn about the motivations, current status, and future plans of the Apache Harmony project 2007 JavaOneSM Conference | Session TS-7820 | 2 Agenda Project History Development Model Modularity VM Interface How Are We Doing? Relevance in the Age of OpenJDK Summary 2007 JavaOneSM Conference | Session TS-7820 | 3 Agenda Project History Development Model Modularity VM Interface How Are We Doing? Relevance in the Age of OpenJDK Summary 2007 JavaOneSM Conference | Session TS-7820 | 4 Apache Harmony In the Beginning May 2005—founded in the Apache Incubator Primary Goals 1. Compatible, independent implementation of Java™ Platform, Standard Edition (Java SE platform) under the Apache License 2. Community-developed, modular architecture allowing sharing and independent innovation 3. Protect IP rights of ecosystem 2007 JavaOneSM Conference | Session TS-7820 | 5 Apache Harmony Early history: 2005 Broad community discussion • Technical issues • Legal and IP issues • Project governance issues Goal: Consolidation and Consensus 2007 JavaOneSM Conference | Session TS-7820 | 6 Early History Early history: 2005/2006 Initial Code Contributions • Three Virtual machines ● JCHEVM, BootVM, DRLVM • Class Libraries ● Core classes, VM interface, test cases ● Security, beans, regex, Swing, AWT ● RMI and math 2007 JavaOneSM Conference | Session TS-7820 |
    [Show full text]
  • Webroot Secureanywhere® Mobile for Android Apache License 2.0
    Webroot SecureAnywhere® Mobile for Android Apache License 2.0 Android - platform - dalvik 2.2_r1 Copyright © 2008 The Android Open Source Project Android - platform - frameworks - base 5.1.0_r1 Copyright © 2008 The Android Open Source Project Android - platform - packages - apps - Browser 5.1.0_r1 Copyright © 2008 The Android Open Source Project Android - platform - packages - apps - Settings 5.1.0_r1 Copyright © 2008 The Android Open Source Project Android Developer Tools (ADT) Bundle 135.1641136 Copyright © 2008 The Android Open Source Project Android Donations Lib Copyright © 2011-2015 Dominik Schürmann <[email protected]> Android SDK Support Libraries 19.0.1 Copyright © 2008 The Android Open Source Project Android Studio 0.5.2 Copyright © 2008 The Android Open Source Project android-lockpattern Copyright 2012 Hai Bison android-log-collector Copyright © 2012 The Android Open Source Project. Copyright © 2009, 2012 Xtralogic, Inc. Google's Base64.java 1.3 Copyright © 2006 Google Inc. Portions copyright © 2002, Google, Inc. Gson 2.3 Copyright © 2008 Google Inc. libphonenumber 7.0.5 Copyright © 2011 The Libphonenumber Authors phonelicenses-android-client Copyright © 2010 http://droidprofessor.com <[email protected]> Google Android Platform SDK 2.1_r1 Copyright © 2008 The Android Open Source Project Google Android Platform SDK 2.2_r02 Copyright © 2008 The Android Open Source Project Google Android Platform SDK r11 Copyright © 2008 The Android Open Source Project Google Android Platform SDK 24.3.4 Copyright © 2008 The Android Open Source Project Apache License 2.0 Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
    [Show full text]
  • Large Scale Querying and Processing for Property Graphs Phd Symposium∗
    Large Scale Querying and Processing for Property Graphs PhD Symposium∗ Mohamed Ragab Data Systems Group, University of Tartu Tartu, Estonia [email protected] ABSTRACT Recently, large scale graph data management, querying and pro- cessing have experienced a renaissance in several timely applica- tion domains (e.g., social networks, bibliographical networks and knowledge graphs). However, these applications still introduce new challenges with large-scale graph processing. Therefore, recently, we have witnessed a remarkable growth in the preva- lence of work on graph processing in both academia and industry. Querying and processing large graphs is an interesting and chal- lenging task. Recently, several centralized/distributed large-scale graph processing frameworks have been developed. However, they mainly focus on batch graph analytics. On the other hand, the state-of-the-art graph databases can’t sustain for distributed Figure 1: A simple example of a Property Graph efficient querying for large graphs with complex queries. Inpar- ticular, online large scale graph querying engines are still limited. In this paper, we present a research plan shipped with the state- graph data following the core principles of relational database systems [10]. Popular Graph databases include Neo4j1, Titan2, of-the-art techniques for large-scale property graph querying and 3 4 processing. We present our goals and initial results for querying ArangoDB and HyperGraphDB among many others. and processing large property graphs based on the emerging and In general, graphs can be represented in different data mod- promising Apache Spark framework, a defacto standard platform els [1]. In practice, the two most commonly-used graph data models are: Edge-Directed/Labelled graph (e.g.
    [Show full text]
  • Open Source Claire Le Goues
    Foundations of Software Engineering Lecture 24: Open Source Claire Le Goues 1 Learning goals • Understand the terminology “free software” and explain open source culture and principles. • Express an educated opinion on the philosophical/political debate between open source and proprietary principles. • Reason about the tradeoffs of the open source model on issues like quality and risk, both in general and in a proprietary context. 2 Motivation to understand open source. • Companies work on open source projects. • Companies use open source projects. • Companies are based around open source projects. • Principles percolate throughout industry. • Political/philosophical debate, and being informed is healthy. 3 Quick and easy definitions • Proprietary software – software which doesn’t meet the requirements of free software or open source software • Free software – software with a strong emphasis on user rights • Open source software – software where the source code is shared with the community • Does Free Software = Open Source? 4 “Free as in free speech.” 5 6 Stallman vs. Gates 7 Free Software vs Open Source • Free software origins (70-80s ~Stallman) – Political goal – Software part of free speech • free exchange, free modification • proprietary software is unethical • security, trust – GNU project, Linux, GPL license • Open source (1998 ~ O'Reilly) – Rebranding without political legacy – Emphasis on internet and large dev./user involvement – Openness toward proprietary software/coexist – (Think: Netscape becoming Mozilla) 8 The Cathedral and the Bazaar 9 The Cathedral and the Bazaar • Cathedral (closed source) – Top-down design with focus on planning • Bazaar (open source) – Organic bottom-up movement – Code always public over internet – Linux/Fetchmail stories 10 Eric Raymond.
    [Show full text]
  • Apache Spark Solution Guide
    Technical White Paper Dell EMC PowerStore: Apache Spark Solution Guide Abstract This document provides a solution overview for Apache Spark running on a Dell EMC™ PowerStore™ appliance. June 2021 H18663 Revisions Revisions Date Description June 2021 Initial release Acknowledgments Author: Henry Wong This document may contain certain words that are not consistent with Dell's current language guidelines. Dell plans to update the document over subsequent future releases to revise these words accordingly. This document may contain language from third party content that is not under Dell's control and is not consistent with Dell's current guidelines for Dell's own content. When such third party content is updated by the relevant third parties, this document will be revised accordingly. The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any software described in this publication requires an applicable software license. Copyright © 2021 Dell Inc. or its subsidiaries. All Rights Reserved. Dell Technologies, Dell, EMC, Dell EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners. [6/9/2021] [Technical White Paper] [H18663] 2 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Table of contents Table of contents
    [Show full text]
  • HDP 3.1.4 Release Notes Date of Publish: 2019-08-26
    Release Notes 3 HDP 3.1.4 Release Notes Date of Publish: 2019-08-26 https://docs.hortonworks.com Release Notes | Contents | ii Contents HDP 3.1.4 Release Notes..........................................................................................4 Component Versions.................................................................................................4 Descriptions of New Features..................................................................................5 Deprecation Notices.................................................................................................. 6 Terminology.......................................................................................................................................................... 6 Removed Components and Product Capabilities.................................................................................................6 Testing Unsupported Features................................................................................ 6 Descriptions of the Latest Technical Preview Features.......................................................................................7 Upgrading to HDP 3.1.4...........................................................................................7 Behavioral Changes.................................................................................................. 7 Apache Patch Information.....................................................................................11 Accumulo...........................................................................................................................................................
    [Show full text]