Setting up the Distributed Analytics Environment
Total Page:16
File Type:pdf, Size:1020Kb
APPENDIX A Setting Up the Distributed Analytics Environment This appendix is a step-by-step guide to setting up a single machine for stand-alone distributed analytics experimentation and development, using the Hadoop ecosystem and associated tools and libraries. Of course, in a production-distributed environment, a cluster of server resources might be available to provide support for the Hadoop ecosystem. Databases, data sources and sinks, and messaging software might be spread across the hardware installation, especially those components that have a RESTful interface and may be accessed through URLs. Please see the references listed at the end of the Appendix for a thorough explanation of how to configure Hadoop, Spark, Flink, and Phoenix, and be sure to refer to the appropriate info pages online for current information about these support components. Most of the instructions given here are hardware agnostic. The instructions are especially suited, however, for a MacOS environment. A last note about running Hadoop based programs in a Windows environment: While this is possible and is sometimes discussed in the literature and online documentation, most components are recommended to run in a Linux or MacOS based environment. Overall Installation Plan The example system contains a large number of software components built around a Java-centric maven project: most of these are represented in the dependencies found in your maven pom.xml files. However, many other components are used which use other infrastructure, languages, and libraries. How you install these other components—and even whether you use them at all—is somewhat optional. Your platform may vary. Throughout this book, as we’ve mentioned before, we’ve stuck pretty closely to a MacOS installation only. There are several reasons for this. A Mac Platform is one of the easiest environments in which to build standalone Hadoop prototypes (in the opinion of the author), and the components used throughout the book have gone through multiple versions and debugging phases and are extremely solid. Let’s review the table of components that are present in the example system, as shown in Table A-1, before we discuss our overall installation plan. © Kerry Koitzsch 2017 275 K. Koitzsch, Pro Hadoop Data Analytics, DOI 10.1007/978-1-4842-1910-2 APPENDIX A ■ SETTING UP THE DIsTRIBuTED ANaLYTICs ENVIRONMENT Number Component Name Discussed URL Description in Chapter 1 Apache Hadoop All hadoop.apache.org map/reduce distributed framework 2 Apache Spark All spark.apache.org distributed streaming framework 3 Apache Flink 1 flink.apache.org distributed stream and batch framework 4 Apache Kafka 6, 9 kafka.apache.org distributed messaging framework 5 Apache Samza 9 samza.apache.org distributed stream processing framework 6 Apache Gora gora.apache.org in memory data model and persistence 7 Neo4J 4 neo4j.org graph database 8 Apache Giraph 4 giraph.apache.org graph database 9 JBoss Drools 8 www.drools.org rule framework 10 Apache Oozie oozie.apache.org scheduling component for Hadoop jobs 11 Spring Framework All https://projects. Inversion of Control spring.io/spring- Framework (IOC) and framework/ glueware 12 Spring Data All http://projects.spring. Spring Data processing io/spring-data/ (including Hadoop) 13 Spring Integration https://projects. support for enterprise spring.io/spring- integration pattern-oriented integration/ programming 14 Spring XD http://projects.spring. “extreme data” integrating with io/spring-xd/ other Spring components 15 Spring Batch http://projects.spring. reusable batch function library io/spring-batch/ 16 Apache Cassandra cassandra.apache.org NoSQL database 17 Apache Lucene/Solr 6 lucene.apache.org lucene. open source search apache. engine org/solr 18 Solandra 6 https://github.com/ Solr + Cassandra interfacing tjake/Solandra 19 OpenIMAJ 17 openimaj.org image processing with Hadoop 20 Splunk 9 splunk.com Java-centric logging framework 21 ImageTerrier 17 www.imageterrier.org image-oriented search framework with Hadoop (continued) 276 APPENDIX A ■ SETTING UP THE DIsTRIBuTED ANaLYTICs ENVIRONMENT Number Component Name Discussed URL Description in Chapter 22 Apache Camel camel.apache.org general purpose glue-ware in Java: implements EIP supports 23 Deeplearning4j 12 deeplearning4j.org deep learning toolkit for Java Hadoop and Spark 24 OpenCV | BoofCV opencv.org boofcv. used for low-level org image processing operations 25 Apache Phoenix phoenix.apache.org OLTP and operational analytics for Hadoop 26 Apache Beam beam.incubator.apache. unified model for creating data org pipelines 27 NGDATA Lily 6 https://github.com/ Solr and Hadoop NGDATA/lilyproject 28 Apache Katta 6 http://katta. distributed Lucene with sourceforge.net Hadoop 29 Apache Geode http://geode.apache.org distributed in-memory database 30 Apache Mahout 12 mahout.apache.org machine learning library with support for Hadoop and Spark 31 BlinkDB http://blinkdb.org massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. 32 OpenTSDB http://opentsdb.net time series–oriented database: runs on Hadoop and HBase 33 University of 17 http://hipi. image processing interface Virginia HIPI cs.virginia.edu/ with Hadoop framework gettingstarted.html 34 Distributed R and https://github.com/ Weka vertica/DistributedR statistical analysis support libraries 35 Java Advanced 17 http://www.oracle. low-level image processing Imaging (JAI) com/technetwork/java/ package download-1-0-2-140451. html 36 Apache Kudu kudu.apache.org fast analytics processing library for the Hadoop ecosystem 37 Apache Tika tika.apache.org content-analysis toolkit (continued) 277 APPENDIX A ■ SETTING UP THE DIsTRIBuTED ANaLYTICs ENVIRONMENT Number Component Name Discussed URL Description in Chapter 38 Apache Apex apex.apache.org unified stream and batch processing framework 39 Apache Malhar https://github.com/ operator and codec library for apache/apex-malhar use with Apache Apex 40 MySQL Relational 4 Database 41 42 Maven, Brew, All mxaven.apache.org build, compile, and version Gradle, Gulp control infrastructure components Once the initial basic components, such as Java, Maven, and your favorite IDE are installed, the other components may be gradually added to the system as you configure and test it, as discussed in the following sections. Set Up the Infrastructure Components If you develop code actively you may have some or all of these components already set up in your development environment, particularly Java, Eclipse (or your favorite IDE such as NetBeans, IntelliJ, or other), the Ant and Maven build utilities, and some other infrastructure components. The basic infrastructure components we use in the example system are listed below for your reference. Basic Example System Setup Set up a basic development environment. We assume that you’re starting with an empty machine. You will need Java, Eclipse IDE, and Maven. These provide programming language support, an interactive development environment (IDE), and a software build and configuration tool, respectively. First, download the appropriate Java version for development from the Oracle web site http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html The current version of Java to use would be Java 8. Use Java–version to validate the Java version is correct. You should see something similar to Figure A-1. 278 APPENDIX A ■ SETTING UP THE DIsTRIBuTED ANaLYTICs ENVIRONMENT Figure A-1. First step: validate Java is in place and has the correct version Next, download the Eclipse IDE from the Eclipse web site. Please note, we used the “Mars” version of the IDE for the development described in this book. http://www.eclipse.org/downloads/packages/eclipse-ide-java-ee-developers/marsr Finally, download the Maven-compressed version from the Maven web site https://maven.apache. org/download.cgi . Validate correct Maven installation with mvn --version On the command line, you should see a result similar to the terminal output in Figure A-2. 279 APPENDIX A ■ SETTING UP THE DIsTRIBuTED ANaLYTICs ENVIRONMENT Figure A-2. Successful Maven version check Make sure you can log in without a passkey: ssh localhost If not, execute the following commands: ssh-keygen -t rsa cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 0600 ~/.ssh/authorized_keys There are many online documents with complete instructions on using ssh with Hadoop appropriately, as well as several of the standard Hadoop references. Apache Hadoop Setup Apache Hadoop, along with Apache Spark and Apache Flink, are key components to the example system infrastructure. In this appendix we will discuss a simple installation process for each of these components. The most painless way to install these components is to refer to many of the excellent “how to” books and online tutorials about how to set up these basic components, such as Venner (2009). Appropriate Maven dependencies for Hadoop must be added to your pom.xml file. To configure the Hadoop system, there are several parameter files to alter as well. 280 APPENDIX A ■ SETTING UP THE DIsTRIBuTED ANaLYTICs ENVIRONMENT Add the appropriate properties to core-site.xml: <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration> also to hdfs-site.xml: <configuration> <property> <name>dfs.replication</name > <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>file:///home/hadoop/hadoopinfra/hdfs/namenode</value> </property> <property> <name>dfs.data.dir</name>