ORAAH 2.8.2 Change List Summary I

ORAAH 2.8.2 Change List Summary i ORAAH 2.8.2 Change List Summary ORAAH 2.8.2 Change List Summary ii REVISION HISTORY NUMBER DATE DESCRIPTION NAME ORAAH 2.8.2 Change List Summary iii Contents 1 ORAAH 2.8.2 Changes 1 1.1 Non backward-compatible API changes:.......................................1 1.2 Backward-compatible API changes:.........................................1 2 ORAAH 2.8.1 Changes 1 2.1 ORCHcore......................................................1 2.1.1 Non backward-compatible API changes:...................................1 2.1.2 Backward-compatible API changes:.....................................1 2.1.3 New API functions:..............................................3 2.1.4 Other changes:................................................5 2.2 ORCHstats......................................................6 2.2.1 Non backward-compatible API changes:...................................6 2.2.2 Backward-compatible API changes:.....................................8 2.2.3 New API functions:..............................................8 2.2.4 Other changes:................................................9 2.3 OREbase.......................................................9 2.3.1 Non backward-compatible API changes:...................................9 2.3.2 Backward-compatible API changes:.....................................9 2.3.3 New API functions:.............................................. 10 2.3.4 Other changes:................................................ 10 2.4 ORCHmpi....................................................... 10 2.4.1 Non backward-compatible API changes:................................... 11 2.4.2 Backward-compatible API changes:..................................... 11 2.4.3 New API functions:.............................................. 11 2.4.4 Other changes:................................................ 12 3 Copyright Notice 13 ORAAH 2.8.2 Change List Summary 1 / 13 ORAAH 2.8.2 release marks another major step to bring Spark-based high performance analytics to the Oracle R Advanced Analytics platform. ORAAH now supports Apache Spark 2.3 and higher with added support of Spark data frames as input to all Spark based algorithms. ORAAH now also works with Apache Impala as a data source. All supported operations on Apache Impala table have the same syntax as that of Apache Hive table. Along with Impala and Hive tables, you can also ingest data into Spark analytics from other JDBC sources like Oracle, MySQL, PostgreSQL, etc provided you have the respective JDBC drivers in your CLASSPATH and SPARK_DIST_CLASSPATH. ORAAH continues to support legacy Hadoop mapReduce based analytics as well in the 2.x product releases, but we encourage our customers to move their code base to use the Spark versions (when available) of same functionality. 1 ORAAH 2.8.2 Changes The following components were upgraded: * Intel MKL upgraded to 2019u5 * Apache Spark upgraded to 2.4.0 * Apache MLlib upgraded to 2.4.0 * Scala upgraded to 2.11.12 * mpich upgraded to 3.3.2 * Apache Commons Math upgraded to 3.6.1 * Apache Log4J upgraded to 2.13.0 * Antlr upgraded to 4.8 * rJava upgraded to 0.9-12 * rJDBC upgraded to 0.2-8 The following issues were fixed: * Fixed misleading error message in the intall postcheck. * spark.connect always reports 2 cores/instances. 1.1 Non backward-compatible API changes: • None. 1.2 Backward-compatible API changes: • None. 2 ORAAH 2.8.1 Changes Support for CDH version 6.0.0 and a number of bug fixes. 2.1 ORCHcore ORCHcore includes a number of code improvements and bug fixes as listed below. 2.1.1 Non backward-compatible API changes: • None. 2.1.2 Backward-compatible API changes: • spark.connect() ORAAH now resolves the active HDFS Namenode for the cluster if the user does not specify dfs.namenode parameter. An INFO message is displayed which prints the Namenode being used. If for any reason, auto-resolution of Namenode fails, users can always specify dfs.namenode value themselves. spark.connect() has a new logical parameter enableHive. If set to TRUE, it enables Hive support in the new Spark Session. Hive support enables usage of Hive tables directly as input source for Spark based algorithms in ORAAH. If there is an active Hive connection when spark.connect() is invoked with the default value (NULL), enableHive will be set to TRUE. ORAAH 2.8.2 Change List Summary 2 / 13 For Hive supported to be enabled in the Spark session, you need to have your Hive configuration available at the client side in the form of hive-site.xml file in your CLASSPATH. You can do so by adding your HIVE_CONF_DIR to the CLASSPATH before starting R and loading ORCH library. Alternatively, you can specify the Hive metastore details as part of spark.connect() by specifying parameters like: hive. metastore.uris="thrift://METASTORE_NAME:METASTORE_PORT" Also, if your Hive is kerberized then you need to additionally specify other parameters like: spark.connect(..., enable Hive=TRUE, hive.metastore.uris="thrift://METASTORE_NAME:METASTORE_PORT" hive.metastore. sasl.enabled="true", hive.metastore.kerberos.principal=HIVE_PRINCIPAL, hive.security.aut horization.enabled="false", hive.metastore.execute.setugi="true", ...). A new parameter logLevel was also added to spark.connect(). This value overwrites the logging category specified in Apache Spark’s log4j.properties file. spark.connect() now loads the default Spark configuration from spark-defaults.conf file if it is found in the CLASSPATH. If such file is found, an INFO message is displayed to the user that defaults were loaded. These configurations have the lowest priority during the creation of a new Spark session. Properties specified within spark.connect() will overwrite the default values read from the file. New defaults were added to spark.connect() that are used when not specified. spark.executor.instances, which specifies the number of parallel Spark worker instances to create, is set to 2 by default. Also, spark.executor.cores which specifies the number of cores to use on each Spark executor, is set to 2 by default. For more details check help(spark.connect) after loading library(ORCH). • orch.connect() On encountering an error while connecting to an Oracle database orch.connect() now prints the exact failure message, instead of a generic failure message. For example, invalid username/password, Network adapter couldn’t connect to host, ORACLE not available, etc. • hdfs.dim() When new CSV data files are loaded in HDFS using hdfs.upload() or a new HDFS directory is attached using hdfs. attach() the dimensions of this data are unknown. Invoking hdfs.dim() on such HDFS path starts a mapReduce program (on user prompt) to compute the dimensions. In ORAAH 2.8.1, if you are connected to Spark using spark.connect() then instead of a slower Hadoop mapReduce computation, dimensions are determined using a Spark task. This results in a faster reponse time of hdfs.dim(). • hdfs.fromHive() hdfs.fromHive() now works with both Apache Hive and Apache Impala tables. Support for Apache Impala connection has been newly included in this release. • hdfs.toHive() hdfs.toHive() now works with both Apache Hive and Apache Impala tables. Support for Apache Impala connection has been newly included in this release. Also, the ore.frame object is returned invisibly now. This prevents hdfs.toHive() from printing the entire ore.frame if the return value was not assigned to another R object. • hdfs.push() Specifying paramter driver="olh" with hdfs.push() caused failures in earlier versions of ORAAH. This has been fixed in this release. • orch.create.parttab() ORAAH 2.8.2 Change List Summary 3 / 13 Previously, orch.create.parttab() would fail in a few cases. If the input was one either an Apache Hive table with first column of string type, or an hdfs.id object returned by hdfs.push() function. It would also fail if partcols specified a column that had string values containing spaces. All these issues have been fixed in this release. • hdfs.write() There was an issue with hdfs.write(..., overwrite=TRUE) wherein after overwriting an HDFS path the ORAAH’s HDFS cache for that particular path would go out of sync in some cases. Causing a failure on invoking hdfs.get() on that HDFS path. This problem has been resolved in this release. • hdfs.upload() If the CSV files being uploaded to HDFS using hdfs.upload() have a header line that provides the column names for the data, ORAAH now uses this information in the generated metadata instead of treating the data with no column names (which caused ORAAH to use generated column names VAL1, VAL2, and so on). The user need to specify the paramter header=TRUE in hdfs.upload() for correct header line handling. • mapReduce Queue selection for ORAAH mapReduce jobs A new field queue has been added to mapred.config object which is used to specify the mapReduce configuration for the mapReduce jobs written in R using ORAAH. This new configuration helps user to select the mapReduce queue to which the mapReduce task will be submited. For more details, check help(mapred.config). 2.1.3 New API functions: • orch.jdbc() The Apache Spark based analytics in ORAAH 2.8.1 can read input data over a JDBC connection. orch.jdbc() facilitates creation of one such JDBC connection identifier object. This object can be provided as data in any of the Spark based algorithm available from ORCHstats and ORCHmpi packages. User must have the JDBC driver libraries for specific database, you want to read the input data from, in your CLASSPATH. Also, since Spark reads the JDBC input in a distributed manner, this JDBC library jar file should be present on all of your cluster nodes at the same path, and make sure to add this path to SPARK_DIST_CLASSPATH in Renviron.site configuration file. See install guide manual for more details on configuration. For more details and examples, check help(orch.jdbc). • orch.jdbc.close() orch.jdbc.close() is used to close the JDBC connection started by orch.jdbc(). It is recommended to close the connections once the data has been read and the connection will no longer be used. For more details and examples, check help(orch.jdbc.close). • orch.df.scale() orch.df.scale() is used to scale the numerical columns of a data frame.

ORAAH 2.8.2 Change List Summary I

Vulnerability Summary for the Week of July 10, 2017

Chainsys-Platform-Technical Architecture-Bots

Sphinx: Empowering Impala for Efficient Execution of SQL Queries

Yellowbrick Versus Apache Impala

Supplement for Hadoop Company

Connect by Clause in Hive

Master Big Data Y Ciencia De Los Datos

Cloudera Enterprise with IBM the Modern Platform for Machine Learning and Analytics, Optimized for the Cloud One Platform

Is There a Schema Construct in Hbase

Dissertation Enabling Model-Driven Live Analytics for Cyber-Physical

Developing Apache Spark Applications Date Published: 2019-09-23 Date Modified

Amazon EMR Migration Guide How to Move Apache Spark and Apache Hadoop from On-Premises to AWS