Current Loom Installation Guide

Total Page:16

File Type:pdf, Size:1020Kb

Current Loom Installation Guide

Loom Installation

Valid for: Loom 2.0.0+

Contents

Conceptual Overview of Installation ● Download the Loom distribution

● Edit configuration files

● Start the Loom server In these instructions:

● Edit the red text before executing commands.

● Blue text highlights content of interest.

Prerequisites Consult your system administrator as needed for the following prerequisites.

1. A Hadoop cluster running on Linux machines. a. Loom 2.0+ has been tested on the following Hadoop distributions. Loom supports MRv2 (YARN) as well as MRv1.

Distributor Version

Cloudera CDH 5.1

Hortonworks HDP 2.1

Teradata TDH 2.1

b. Operating Systems: Linux. Loom has been run on Ubuntu, CentOS, RHEL, and SLES. c. Browsers: Chrome and Firefox. d. JDK: Oracle JDK or OpenJDK, versions 6 or 7

2. Choosing an installation location for Loom a. On the cluster a.i. It is recommended that you install Loom on the NameNode, for simplicity in managing permissions. However, Loom can be run on any node in the cluster. b. Off the cluster b.i. Loom can also be run outside the cluster a machine that can communicate with the Hadoop APIs but is not itself running any Hadoop services (commonly known as an “edge” node). b.ii. It is not necessary for users on the machine to be able to access HDFS from the command line, but this machine will need to have a copy of the same Hadoop distribution files as the cluster – in particular, the libraries for Hadoop, Hive, and HCatalog.

3. Local Username/Permissions a. On both the machine where you still be running Loom and on all nodes in the cluster, create a dedicated Linux username for Loom. The alphanumeric ID, numeric user ID (UID), and group ID (GID) for the user must be the same across machines. a.i. This user will be referred to as loomuser throughout this document, but it can have any name. a.ii. Depending on Loom security settings (see Advanced Configuration > Security), this will be the username interacting directly with Hadoop services. b. Grant loomuser sudo privileges. b.i. This is not absolutely necessary, but if you choose not to do so, you will need access to another username with sudo privileges in order to change ownership of the directory where Loom is downloaded. c. On the machine where Loom will be running, grant loomuser ownership of the following local directory

file:/tmp/loomuser The default location for local temporary files created by Hive when executed by the loomuser userid. This may be overridden in the “hive.exec.local.scratchdir” property of hive-site.xml.

d. Set HIVE_HOME, HADOOP_HOME, and HCAT_HOME environment variables for loomuser. These variables should be set permanently for loomuser, or specified in loom-server.sh, but should not just be set for the current shell session. d.i. These variables should be set to the directories that contain the Hive, Hadoop, and HCatalog “lib” directories, respectively, and should NOT have a trailing slash. d.i.1. The exact values will vary depending on your Hadoop distribution. Examples are below, but you should confirm that the Hive and Hadoop “lib” files are actually located at the paths below. d.i.1.a. Typical example for Hortonworks

HIVE_HOME=/usr/lib/hive HCAT_HOME=/usr/lib/hive-hcatalog

HADOOP_HOME=/usr/lib/hadoop

d.i.1.b. Typical example for CDH4 as installed by Cloudera Manager

HIVE_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hive

HCAT_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hcatalog

HADOOP_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop

d.i.1.c. For TDH, the following additional environment variable is needed

PATH=$PATH:/opt/teradata/jvm64/jdk7/bin

4. Hadoop Username/Permissions a. Grant loomuser read and write access to the following HDFS directory:

hdfs:/user/hive/warehouse The default location of the Hive warehouse. This may be overridden in the “hive.metastore.warehouse.dir” property of hive- site.xml file.

b. Create and grant loomuser ownership of the following HDFS directories:

hdfs:/tmp/hive-loomuser The default location for temporary files created by Hive when executed by the loomuser userid. This may be overridden in the “hive.exec.scratchdir” property of hive-site.xml.

hdfs:/user/loomuser The home directory for loomuser on HDFS. c. Grant loomuser read and write access to any HDFS directories where the user will want to browse, query, or output new data.

5. Hive a. Install Hive with a multi-user metastore, such as MySQL or PostegreSQL. a.i. If Hive was installed as a demo, it is probably using the default Apache Derby metastore, which is single- user. Your Hadoop distributor should have instructions on switching Hive to use a non-Derby metastore.

6. Networking a. Ports: The port on which Loom will run (8080 by default, but you can specify any port at runtime) must be exposed such that intended users of Loom will be able to access that port through their web browser.

7. Web Browser a. The latest versions of Firefox and Chrome are compatible with Loom. Internet Explorer is not supported.

First-time Installation That is, on a cluster where Loom has never been installed:

Download and Install Loom

1. Open an SSH session on the machine where you are going to install Loom.

2. Create a loom directory wherever you want Loom installed (e.g. /usr/local), transfer ownership to loomuser, and cd into it.

loomuser@node:~$ cd /usr/local

loomuser@node:/usr/local$ mkdir loom

loomuser@node:/usr/local$ sudo chown -R loomuser /usr/local/loom

loomuser@node:/usr/local$ cd loom

3. Download Loom x.y.z (for example, 1.2.7) and unzip loomuser@node:/usr/local/loom$ wget --no-check-certificate http://www.revelytix.com/transfer/loom- x.y.z -distribution.zip; unzip loom- x.y.z-distribution.zip

4. Run the bin/check-setup.sh script. a. For MapR users: you will need to uncomment (i.e. delete the pound sign before) the following line in bin/check-setup.sh, in order to include certain native dependencies.

loom-x.y.z-distribution/bin/check-setup.sh

# MapR requires native dependencies

JAVA_LIB_PATH="-Djava.library.path=$HADOOP_HOME/lib/native/Linux-amd64-64"

b. You only need to run this script once: before the first time you start Loom.

loomuser@node:/usr/local/loom$ loom-x.y.z-distribution/bin/check-setup.sh

# Example output

loomuser@node:/usr/local/loom$ loom-x.y.z-distribution/bin/check-setup.sh

Checking Loom configuration

Checking default loom port ...... port '8080' on host 'localhost' ... OK.

Checking availability of datomic transactor port ...... port '4334' on host 'localhost' ... OK.

Checking default Hadoop FileSystem ...... configured to use hdfs://localhost:8020 ... OK.

Checking default Hadoop JobTracker ...... configured to use JobTracker 'localhost' port '50030' ... OK.

Loom is ready to run. c. If “default loom port” check fails: c.i. The default port for Loom Server is 8080, but Loom can easily be run on a different port. Instructions are included in the documentation below, starting with the phrase “To run this server on a different port...” d. If “availability of datomic transactor” check fails: d.i. This means another application is running on port 4334, 4335, or 4336. If you cannot remove the application, it is possible to configure Loom to start the transactor on a different set of three contiguous ports. Open loom-x.y.z-distribution/lib/datomic/transactor.properties, and set ‘port’ to the first port in the sequence you want to use:

loom-x.y.z-distribution/lib/datomic/transactor.properties

########### free mode config ###############

protocol=free

host=localhost

#free mode will use 3 ports starting with this one:

port=

d.ii. You may also be seeing this error if you have started Loom on this machine before; as mentioned above, it is only necessary to run checkup.sh before the first time you start Loom. Once you start Loom, the transactor runs as a background process on ports 4334-4336, and will keep running on these ports in between restarts of the Loom server. e. If “default Hadoop FileSystem” check fails: either you did not set HADOOP_HOME correctly (see Prerequisites > Username/Permissions) or HDFS is not running. f. If “default Hadoop JobTracker” check fails, either you did not set HADOOP_HOME correctly (see Prerequisites > Username/Permissions) or JobTracker is not running.

5. Set Loom’s DistributedCache directory a. In loom-x.y.z-distribution/config/loom.properties, set loom.dist.cache to the desired HDFS directory. It will default to hdfs:/user/${user.name}/loom-dist-cache unless otherwise changed, where ${user.name} is the name of the user who starts the loom server.

# Sets the location in HDFS where Loom manages the distributed cache that it # uses to configure MapReduce jobs that it submits. The Loom server process

# must have permission to write in this location.

loom.dist.cache=

b. IMPORTANT: must BOTH be an absolute path (not a URI) for an HDFS folder AND ALSO end with a "/" For example:

/user/loom/ ACCEPTABLE

/user/loom NOT ACCEPTABLE

loom/ NOT ACCEPTABLE

hdfs://master:9000/user/loom/ NOT ACCEPTABLE

6. At this point, if you want to take advantage of Loom’s advanced configuration options, see the “Advanced Configuration” section and complete the relevant steps before proceeding to the next step below.

Start Loom 1. For MapR users: you will need to uncomment (i.e. delete the pound sign before) the following line in bin/loom-server.sh, in order to include certain native dependencies.

loom-x.y.z-distribution/bin/loom-server.sh

# MapR requires native dependencies

JAVA_LIB_PATH="-Djava.library.path=$HADOOP_HOME/lib/native/Linux-amd64-64"

2. Start the Loom Server. a. IMPORTANT: always run the loom-server.sh script from the current distribution directory, e.g. /usr/local/loom/loom-x.y.z-distribution. Loom has certain dependencies that require to be started from the distribution directory b. These examples use ‘nohup’ plus ‘&’ to run Loom in the background. You can also run Loom from a ‘screen’ window, if you have the ‘screen’ package installed.

loomuser@node:/usr/local/loom/loom-x.y.z-distribution$ nohup ./bin/loom- server.sh &

[hit ENTER to regain command-line access]

a. To run this server on a different port, before starting the Loom server, include the port number after loom-server.sh

loomuser@node:/usr/local/loom/loom-x.y.z-distribution$ nohup ./bin/loom- server.sh &

# Example

loomuser@node:/usr/local/loom/loom-x.y.z-distribution$ nohup ./bin/loom- server.sh 8081 &

b. Check the contents of nohup.out. Once Loom has is running, you will see the message, “Loom server started.”

loomuser@node:/usr/local/loom/loom-x.y.z-distribution$ tail -f nohup.out

-h gives a list of usages/options

Starting Database...

HADOOP_CP=

HIVE_CP=

Starting Loom Server... Starting Loom Server on port 8080

Loom Server started

Congratulations! You have now installed Loom.

Upgrade

For a cluster where Loom has already been installed:

Back Up Current Registry

1. To make a copy of your existing registry, run the backup.sh script from the distribution directory. By default, =localhost, =8080, and =backup.json.

loomuser@node:/usr/local/loom/loom-x.y.z-distribution$ ./bin/backup.sh -h -p -o

a. This will produce a backup.json file in the distribution directory

loomuser@node:/usr/local/loom/loom-x.y.z-distribution$ ls

backup.json bin config data datomic.pid docs lib license logs plugins R README.txt registry schema

Download and Install Loom 1. Open an SSH session on the machine where you have installed Loom.

2. Cd into the loom directory. loomuser@node:~$ cd /usr/local/loom

3. Load Loom a.b.c (for example, 1.0.1) onto node and unzip.

loomuser@node:/usr/local/loom$ wget --no-check-certificate http://www.revelytix.com/transfer/loom- a.b.c -distribution.zip; unzip loom- a.b.c-distribution.zip

4. Set Loom’s DistributedCache directory. a. In loom-x.y.z-distribution/config/loom.properties, set loom.dist.cache to the desired HDFS directory. It will default to hdfs:/user/${user.name}/loom-dist-cache unless otherwise changed, where ${user.name} is the name of the user who starts the loom server.

# Sets the location in HDFS where Loom manages the distributed cache that it

# uses to configure MapReduce jobs that it submits. The Loom server process

# must have permission to write in this location.

loom.dist.cache=

b. IMPORTANT: must BOTH be an absolute path (not a URI) for an HDFS folder AND ALSO end with a "/" For example:

/user/loom/ ACCEPTABLE

/user/loom NOT ACCEPTABLE

loom/ NOT ACCEPTABLE

hdfs://master:9000/user/loom/ NOT ACCEPTABLE 5. See the “Advanced Configuration” section in this document for instructions on additional configuration options.

Stop and Start Loom

1. Find the PID of the currently running Loom server. a. If you have sudo permissions:

loomuser@node:/usr/local/loom/loom-a.b.c-distribution$ sudo netstat -tnlp | grep

# Example

loomuser@node:/usr/local/loom/loom-a.b.c-distribution$ sudo netstat -tnlp | grep 8080

[sudo] password for loomuser:

tcp6 0 0 :::8080 :::* LISTEN 18139/java

b. If you do not have sudo permissions, you can use an alternative method; the Loom server process will be the first process returned

loomuser@node:/usr/local/loom$ ps aux | grep revelytix.servlet

# Example

loomuser@node:/usr/local/loom/loom-a.b.c-distribution$ ps aux | grep revelytix.servlet

loomuser 18139 0.8 13.6 1051396 280792 ? Sl 10:00 1:20 java -XX:PermSize=128m -XX:MaxPermSize=256m -Dtransactor.props=loom-0.6.1- distribution/bin/../lib/datomic/transactor.properties -cp loom-0.6.1- distribution/bin/../config:loom-0.6.1-distribution/bin/../lib/*:loom-0.6.1- distribution/bin/../lib/ext/*:/opt/cloudera/parcels/CDH-4.2.0- 1.cdh4.2.0.p0.10/lib/hadoop/etc/hadoop::/opt/cloudera/parcels/CDH-4.2.0- 1.cdh4.2.0.p0.10/lib/hadoop/client-0.20/* revelytix.servlet loomuser 19380 0.0 0.0 7624 932 pts/0 S+ 12:30 0:00 grep --color=auto revelytix.servlet

2. Kill the currently running Loom server.

loomuser@node:/usr/local/loom/loom-a.b.c-distribution$ kill

3. If you are upgrading Loom, you must stop the transactor processes. You can skip this step if you are simply restarting the Loom server, i.e. using the same distribution.

loomuser@node:/usr/local/loom/loom-a.b.c-distribution$ ./bin/stop- database.sh

Stopped Database

4. Start the new Loom server. IMPORTANT: always invoke the loom-server.sh script from the distribution directory, e.g. /usr/local/loom/loom-x.y.x-distribution directory. Loom has certain dependencies that require it to be started from the distribution directory.

loomuser@node:/usr/local/loom/loom-a.b.c-distribution$ nohup ./bin/loom- server.sh &

[hit ENTER to regain command-line access]

a. To run this server on a different port, simply specify the port when starting the Loom server.

loomuser@node:/usr/local/loom/loom-a.b.c-distribution$ ./bin/loom-server.sh

# Example

loomuser@node:/usr/local/loom/loom-a.b.c-distribution$ ./bin/loom-server.sh 8081 5. Do not log into the Lab Bench or attempt to view or register data before finishing the next section.

Restore Registry 1. From the new distribution directory, restore the registry, using the backup.json file you created with the previous distribution. By default, =localhost and =8080.

loomuser@node:/usr/local/loom/loom-a.b.c-distribution$ ./bin/restore.sh /usr/local/loom/loom-x.y.z-distribution/ -h -p

Congratulations! You have now updated Loom.

Advanced Configuration

Security, User Impersonation, and Authentication 1. See loom-x.y.x-distribution/docs/Loom_Security.txt for details. You will need to restart Loom after making any Loom configuration changes, and restart Hadoop services after making any Hadoop configuration changes.

ActiveScan: Potential Sources 1. One of Loom’s features is the ability to detect “Potential Sources;” that is, regularly and recursively scan a specified HDFS directory to detect new files, which Loom displays in the ‘Sources’ Home page of the Loom Lab Bench (browser UI), as well as on the ‘Loom’ home page in the ‘Recent Sources’ column.

2. To turn on ActiveScan: Potential Sources, edit loom-x.y.z-distribution/config/loom.properties:

loom-x.y.z-distribution/config/loom.properties

# Enable active scanning of potential datasets in HDFS.

activeScan.dataset.enabled=true

# Set the top-level directory under which to scan for potential datasets # in HDFS. May be specified as an absolute hdfs:// URL or a relative

# path that will be resolved against the Loom working directory.

# Defaults to the Loom working directory.

activeScan.dataset.baseDir=

a. Example configurations

activeScan.dataset.baseDir=hdfs://node:8020/home/loomuser/loomInput ACCEPTABLE

activeScan.dataset.baseDir=/home/loomuser/loomInput ACCEPTABLE

activeScan.dataset.baseDir=loomInput ACCEPTABLE, if loomuser has a configured working directory

3. By default, Loom is set to scan the specified directory every 60 minutes, but you can change this:

loom-x.y.z-distribution/config/loom.properties

activeScan.dataset.scanIntervalMinutes=

4. You can also determine the size of the sample Loom will scan from each file, in terms of either number of rows (activeScan.hdfs.parseLines) or number of bytes (activeScan.hdfs.maxBufferSize). Loom will stop scanning as soon as it reaches one of those limits.

# The number of records to parse from a file in HDFS to determine whether it's a potential source.

activeScan.hdfs.parseLines=50

# The maximum amount of data to read into memory from an HDFS file to determine whether it's a potential source.

activeScan.hdfs.maxBufferSize=8388608

5. Once configuration changes have been made, start or restart the Loom server. Changes will not take effect otherwise.

Custom Metadata Properties

IMPORTANT: Read if you are restarting Loom and using custom metadata properties. If you meet both of the following conditions: 1) You used the Custom Metadata feature of Loom, i.e. removed, edited, or added properties to the CSV(s) in loom-x.y.z-distribution/schema and 2) you are planning to restore the registry which you previously backed up , then you must copy the contents of loom-x.y.z- distribution/schema from the old Loom distribution directory into the new directory. Otherwise, Loom will not be able to restore your registry due to a mismatch in registry structure.

loomuser@node:~/loom/loom-a.b.c-distribution$ rm schema/*

loomuser@node:~/loom/loom-a.b.c-distribution$ cp ../loom-a.b.c- distribution/schema/* schema/

Upon startup, Loom looks for CSVs in the directory loom-x.y.z-distribution/schema, and reads the properties defined therein. In order to remove, edit, or create properties for a given class of entities, you will need to edit the CSVs in loom-x.y.z-distribution/schema directory.

● All CSVs must follow the naming format: meta-*.csv. For example: meta-user-extension.csv, meta- customproperties.csv.

● Each CSV must use the following schema:

Column Name Description Examples

type The entity type that the property is Must be one of: associated with. source/SourceExtension

dataset/DatasetExtension, process/ProcessExtension, job/JobExtension meta.attribute/name The unique name of the property. geo/Location

finance/Rate meta.attribute/valueType The datatype for the property. string, long, uri, uuid meta.attribute/cardinality Indicates whether the property refers to a Must be 'one' or 'many.' single value or a list. meta.attribute.ref/type Only use this property if dataset/Dataset meta.attribute/valueType is set to ‘uuid,’ otherwise leave as null. meta.attribute.ref/type indicates the type of entity to which meta.attribute/valueType refers. meta.attribute/unique ‘value’ or ‘identity’ indicates that this Must be null, 'value', or property uniquely identifies the entity; 'identity.' that is, no 2 entities can share the same value for this property. meta.attribute/index Indicates that this property should be Must be 'TRUE' or 'FALSE.’ indexed for fast lookups. meta.attribute/fulltext Only use this property if Must be 'TRUE' or 'FALSE.' meta.attribute/valueType is set to ‘string,’ otherwise leave null. This property indicates whether meta.attribute/valueType should be indexed for text searches. [Note: Support for this feature is not included in Loom 1.1.3.] meta.attribute/doc The label that will be displayed in the Lab Owned By Bench; a text string describing the property.

An example of correctly formatted custom properties for a Source, Dataset, Process, and Job: loom-x.y.z-distribution/schema/meta-user-extension.csv

#Column Headers type,meta.attribute/name,meta.attribute/valueType,meta.attribute/cardinality ,meta.attribute.ref/type,meta.attribute/unique,meta.attribute/index,meta.att ribute/fulltext,meta.attribute/doc

#Custom Property for a Source source/SourceExtension,source.extenstion/hasSink,boolean,one,,,FALSE,FALSE,H as Sink?

#Custom Property for a Dataset dataset/DatasetExtension,poc.docs/externalDocumentation,uri,many,,,FALSE,FAL SE,External Documentation

#Custom Property for a Process process/ProcessExtension,process.extension/requestedBy,string,one,,,FALSE,FA LSE,Requested By

#Custom Property for a Job job/JobExtension,job.extension/optimized,boolean,one,,,FALSE,FALSE,Optimized

For additional reading, see the README file in the Loom distribution folder (loom-x.y.z- distribution/README.txt)

Recommended publications