Revolution R Enterprise™ 7 Microsoft HPC Administrator's Guide

Total Page:16

File Type:pdf, Size:1020Kb

Revolution R Enterprise™ 7 Microsoft HPC Administrator's Guide Revolution R Enterprise™ 7 Microsoft HPC Administrator’s Guide The correct bibliographic citation for this manual is as follows: Revolution Analytics, Inc. 2015. Revolution R Enterprise 7 Microsoft HPC Administrator’s Guide. Revolution Analytics, Inc., Redmond, WA. Revolution R Enterprise 7 Microsoft HPC Administrator’s Guide Copyright © 2015 Revolution Analytics, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of Revolution Analytics. U.S. Government Restricted Rights Notice: Use, duplication, or disclosure of this software and related documentation by the Government is subject to restrictions as set forth in subdivision (c) (1) (ii) of The Rights in Technical Data and Computer Software clause at 52.227-7013. Revolution R, Revolution R Enterprise, RPE, RevoScaleR, RevoDeployR, RevoTreeView, and Revolution Analytics are trademarks of Revolution Analytics. Other product names mentioned herein are used for identification purposes only and may be trademarks of their respective owners. Revolution Analytics One Microsoft Way Redmond, WA 98052 Revised on August 11, 2015 We want our documentation to be useful, and we want it to address your needs. If you have comments on this or any Revolution document, write to [email protected]. Table of Contents 1 Introduction ................................................................................................................... 1 1.1 Terminology ..................................................................................................................... 1 1.2 System Requirements ...................................................................................................... 1 2 Quick Deployment ......................................................................................................... 2 2.1 Creating Firewall Exceptions ............................................................................................ 3 2.2 Creating Shared and Working Directories........................................................................ 3 2.3 Providing Access to Jobs and Large Data Directories ...................................................... 3 2.4 Installing the Package on All Nodes ................................................................................. 4 2.5 Removing the Package from All Nodes ............................................................................ 5 2.6 Sharing Setup Information with your R Users .................................................................. 5 3 Managing Data .............................................................................................................. 6 3.1 Copying Data with ClusterCopy ........................................................................................ 6 4 Additional Install Options ............................................................................................... 7 4.1 Performing a Full Install of Revolution R Enterprise ........................................................ 7 4.2 Installing to a Non-Default Location ................................................................................ 8 1 Introduction Revolution R Enterprise for Windows is an enhanced, supported version of the open- source R language. It includes RevoScaleR, Revolution’s package for statistical analysis of large data sets. RevoScaleR turns R into a clustered high performance computing (HPC) application when run via HPC Server. In the usual configuration, users will access the HPC Server cluster while running Revolution R Enterprise from their client laptop or desktop workstation. To support this configuration, Revolution Analytics has developed a lightweight node installer that allows you to quickly deploy Revolution R Enterprise to all the nodes of your HPC cluster. This manual provides detailed instructions for installing and uninstalling Revolution R Enterprise for Windows on your HPC cluster. 1.1 Terminology In this manual, we use the following terminology for various computers (the first three terms are Microsoft’s; more information can be found here. Head node: The HPC Server node that serves as the head node for the HPC cluster. Compute node: An HPC Server node configured as a compute node on the HPC cluster. Workstation node: A Windows 7 or 8 computer configured as a workstation node on the HPC cluster. Client workstation: A Windows computer that is not part of the HPC cluster but is joined to the same Active Directory domain and that is equipped with Revolution R Enterprise and is capable of establishing a cluster connection to the HPC cluster. 1.2 System Requirements Revolution R Enterprise runs on clusters running HPC Pack 2012 and those running HPC Server 2008R2, but HPC Server 2008R2 support is now deprecated. To run Revolution R Enterprise for Windows on clusters running HPC Pack 2012, the cluster must meet the following minimum requirements: Head Node o x64-architecture computer o x64 version of Windows Server 2012. o HPC Pack 2012 o 4GB RAM o 200MB disk space for installation o Must be joined to an Active Directory domain. Compute Nodes o x64-architecture computer o x64 version of Windows Server 2012 or Windows Server 2008 R2. o HPC Pack 2012 o 4GB RAM 2 Quick Deployment o 200MB disk space for installation o Must be joined to same Active Directory domain as head node. Workstation Nodes (optional) o x64-architecture computer o Windows 7, Windows 8 o HPC Pack 2012 o 4GB RAM o 200MB disk space for installation o Must be joined to same Active Directory domain as head node. To run Revolution R Enterprise for Windows on HPC Server 2008R2, the cluster must meet the following minimum requirements: Head Node o x64-architecture computer o x64 version of Windows Server 2008 R2 Standard, Enterprise, Datacenter, or HPC edition. o HPC Pack 2008 R2 o 4GB RAM o 200MB disk space for installation o Must be joined to an Active Directory domain. Compute Nodes o x64-architecture computer o x64 version of Windows Server 2008 or Windows Server 2008 R2 Standard, Enterprise, Datacenter, or HPC edition. o HPC Pack 2008 R2 o 4GB RAM o 200MB disk space for installation o Must be joined to same Active Directory domain as head node. Workstation Nodes (optional) o x86- or x64-architecture computer o Windows 7 Professional or Enterprise o HPC Pack 2008 R2 o 4GB RAM o 200MB disk space for installation o Must be joined to same Active Directory domain as head node. 2 Quick Deployment Installing Revolution R Enterprise on your cluster involves three main steps: 1. Ensuring that your cluster firewall allows the two main Revolution R Enterprise processes to communicate. 2. Creating a network share directory on the head node that all Revolution R Enterprise users can access, together with user-specific working directories on all compute nodes. Quick Deployment 3 3. Ensuring that all Revolution R Enterprise users have the necessary permissions to run HPC Server jobs together with read-access to “large data” directories on all compute nodes. 4. Running a parametric sweep job to install the software on all the nodes. 2.1 Creating Firewall Exceptions Revolution R Enterprise requires two distinct processes on each node of your cluster, and the results of these processes need to be communicated between nodes. You must therefore create firewall exceptions for these processes. To do this, go to the Allowed Apps page in your Windows Firewall and add the following programs to your Allowed Apps list: C:\Program Files\RRO\R-3.1.3\library\RevoScaleR\rxLibs\x64\BxlServer.exe C:\Program Files\RRO\R-3.1.3\bin\x64\Rterm.exe 2.2 Creating Shared and Working Directories Revolution R Enterprise uses a network share directory to store its job and task information. You can give this share any name, but it must be read and write accessible by all Revolution R Enterprise users. The main share directory will contain folders for each user’s individual use. Thus, a typical structure might be to create a network share directory named \\AllShare, and then to populate that share with individual directories for each unique user: \\AllShare\fred, \\AllShare\gloria, etc. Your R users will refer to their particular subdirectory as their shareDir. Each worker also needs to have a private working directory on each node. Typically these will be standard user directories, e.g., C:\Users\fred, C:\Users\gloria, etc. Your R users will refer to this working directory as their workingDir. 2.3 Providing Access to Jobs and Large Data Directories RevoScaleR is a package for large data analysis, and it is expected that within a given organization, numerous R users will want to analyze shared data sets. To do this, all users must have an account on the cluster that allows them to run distributed jobs, and each user must have access to the shared data sets. These files will normally be in the RevoScaleR .xdf format, and will either be complete copies or distributed copies (that is, each node might have just the portion of the data needed for its own computations). In either case, all Revolution R Enterprise users need to have read access to the data. Write access must be available to some individual (who may be an R user or a cluster administrator) to allow distribution of the data. These data directories must not be network shares; they must be local to the individual nodes. For simplicity, we suggest “C:\data” as the data
Recommended publications
  • Revolution R Enterprise 6.1 README
    Revolution R Enterprise 6.1 README Revolution R Enterprise 6.1 for 32-bit and 64-bit Windows and 64-bit Red Hat Enterprise Linux (RHEL 5.x and RHEL 6.x) features an updated release of the RevoScaleR package that provides fast, scalable data management and data analysis: the same code scales from data frames to local, high-performance .xdf files to data distributed across a Windows HPC Server cluster, Windows HPC Server Azure Burst cluster, or IBM Platform Computing LSF cluster. RevoScaleR also allows distribution of the execution of essentially any R function across cores and nodes, delivering the results back to the user. Installation instructions and instructions for getting started are provided in your confirmation e-mail. What’s New in Revolution R Enterprise 6.1 Big Data Decision Tree Models New RevoScaleR function rxDTree can be used to create decision tree models. It is based on a binning algorithm so that it can scale to huge data. Both classification and regression trees are supported. The model objects returned can be made to inherit from the rpart class of the rpart package, so that plot.rpart, text.rpart, and printcp can be used for subsequent analysis. Prediction for models fitted by rxDTree can be done using rxPredict. See Chapter 10 of the RevoScaleR User’s Guide for examples on how to create decision tree models with rxDTree. Additional information is available in the rxDTree help file, seen by entering ?rxDTree at the R command line. Support for Compression in .xdf Files RevoScaleR’s .xdf files can now be created using zlib compression.
    [Show full text]
  • Frequently Asked Questions About Rcpp
    Frequently Asked Questions about Rcpp Dirk Eddelbuettel Romain François Rcpp version 0.12.7 as of September 4, 2016 Abstract This document attempts to answer the most Frequently Asked Questions (FAQ) regarding the Rcpp (Eddelbuettel, François, Allaire, Ushey, Kou, Chambers, and Bates, 2016a; Eddelbuettel and François, 2011; Eddelbuettel, 2013) package. Contents 1 Getting started 2 1.1 How do I get started ?.....................................................2 1.2 What do I need ?........................................................2 1.3 What compiler can I use ?...................................................3 1.4 What other packages are useful ?..............................................3 1.5 What licenses can I choose for my code?..........................................3 2 Compiling and Linking 4 2.1 How do I use Rcpp in my package ?............................................4 2.2 How do I quickly prototype my code?............................................4 2.2.1 Using inline.......................................................4 2.2.2 Using Rcpp Attributes.................................................4 2.3 How do I convert my prototyped code to a package ?..................................5 2.4 How do I quickly prototype my code in a package?...................................5 2.5 But I want to compile my code with R CMD SHLIB !...................................5 2.6 But R CMD SHLIB still does not work !...........................................6 2.7 What about LinkingTo ?...................................................6
    [Show full text]
  • Revolution R Enterprise™ 7.1 Getting Started Guide
    Revolution R Enterprise™ 7.1 Getting Started Guide The correct bibliographic citation for this manual is as follows: Revolution Analytics, Inc. 2014. Revolution R Enterprise 7.1 Getting Started Guide. Revolution Analytics, Inc., Mountain View, CA. Revolution R Enterprise 7.1 Getting Started Guide Copyright © 2014 Revolution Analytics, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of Revolution Analytics. U.S. Government Restricted Rights Notice: Use, duplication, or disclosure of this software and related documentation by the Government is subject to restrictions as set forth in subdivision (c) (1) (ii) of The Rights in Technical Data and Computer Software clause at 52.227-7013. Revolution R, Revolution R Enterprise, RPE, RevoScaleR, RevoDeployR, RevoTreeView, and Revolution Analytics are trademarks of Revolution Analytics. Other product names mentioned herein are used for identification purposes only and may be trademarks of their respective owners. Revolution Analytics. 2570 W. El Camino Real Suite 222 Mountain View, CA 94040 USA. Revised on March 3, 2014 We want our documentation to be useful, and we want it to address your needs. If you have comments on this or any Revolution document, send e-mail to [email protected]. We’d love to hear from you. Contents Chapter 1. What Is Revolution R Enterprise? ....................................................................
    [Show full text]
  • Iotools: High-Performance I/O Tools for R by Taylor Arnold, Michael J
    CONTRIBUTED RESEARCH ARTICLE 6 iotools: High-Performance I/O Tools for R by Taylor Arnold, Michael J. Kane, and Simon Urbanek Abstract The iotools package provides a set of tools for input and output intensive data processing in R. The functions chunk.apply and read.chunk are supplied to allow for iteratively loading contiguous blocks of data into memory as raw vectors. These raw vectors can then be efficiently converted into matrices and data frames with the iotools functions mstrsplit and dstrsplit. These functions minimize copying of data and avoid the use of intermediate strings in order to drastically improve performance. Finally, we also provide read.csv.raw to allow users to read an entire dataset into memory with the same efficient parsing code. In this paper, we present these functions through a set of examples with an emphasis on the flexibility provided by chunk-wise operations. We provide benchmarks comparing the speed of read.csv.raw to data loading functions provided in base R and other contributed packages. Introduction When processing large datasets, specifically those too large to fit into memory, the performance bottleneck is often getting data from the hard-drive into the format required by the programming environment. The associated latency comes from a combination of two sources. First, there is hardware latency from moving data from the hard-drive to RAM. This is especially the case with “spinning” disk drives, which can have throughput speeds several orders of magnitude less than those of RAM. Hardware approaches for addressing latency have been an active area of research and development since hard-drives have existed.
    [Show full text]
  • Integrating R with Azure for High-Throughput Analysis Hugh Analysis Shanahan
    Integrating R with Azure for High- throughput Integrating R with Azure for High-throughput analysis Hugh analysis Shanahan Hugh Shanahan Department of Computer Science Royal Holloway, University of London [email protected] @HughShanahan Hugh Shanahan Integrating R with Azure for High-throughput analysis Applicability to other domains Integrating R with Azure for High- throughput analysis This project started out doing something very specific Hugh for the domain I work in (Computational Biology). Shanahan I promise that there will be no Biology in this talk !! Realised can be extended to running high-throughput jobs in R. Contrast with MapReduce / R formalisms (HadoopStreaming, Rhipe, Revolution Analytics, ... ) - parallelisation happens outside of individual R script. Hugh Shanahan Integrating R with Azure for High-throughput analysis Applicability to other domains Integrating R with Azure for High- throughput analysis This project started out doing something very specific Hugh for the domain I work in (Computational Biology). Shanahan I promise that there will be no Biology in this talk !! Realised can be extended to running high-throughput jobs in R. Contrast with MapReduce / R formalisms (HadoopStreaming, Rhipe, Revolution Analytics, ... ) - parallelisation happens outside of individual R script. Hugh Shanahan Integrating R with Azure for High-throughput analysis Applicability to other domains Integrating R with Azure for High- throughput analysis This project started out doing something very specific Hugh for the domain I work in (Computational Biology). Shanahan I promise that there will be no Biology in this talk !! Realised can be extended to running high-throughput jobs in R. Contrast with MapReduce / R formalisms (HadoopStreaming, Rhipe, Revolution Analytics, ..
    [Show full text]
  • BIG DATA 50 the Hottest Big Data Startups of 2014
    BIG DATA 50 The hottest big data startups of 2014 Jeff Vance Table of Contents Big Data Startup Landscape – Overview .................................................................................. i About the Author ................................................................................................................. iii Introduction – the Big Data Boom .......................................................................................... 1 Notes on Methodology & the origin of the Big Data 50 ............................................................. 2 The Big Data 50 ..................................................................................................................... 5 Poised for Explosive Growth ....................................................................................................... 5 Entrigna ................................................................................................................................... 5 Nuevora ................................................................................................................................... 7 Roambi .................................................................................................................................... 9 Machine Learning Mavens ........................................................................................................ 10 Oxdata ................................................................................................................................... 10 Ayasdi ...................................................................................................................................
    [Show full text]
  • REVOLUTION ANALYTICS = ACTUARIAL EYE (Part 1)
    REVOLUTION ANALYTICS = ACTUARIAL EYE (Part 1) “Perhaps the most important cultural trend today: The explosion of data about every aspect of our world and the rise of applied math gurus who know how to use it.” – Chris Anderson, editor-in-chief of Wired. “There is a real appetite in the business to understand more and to think about how we assemble and use data, also making sure we have the right people to ask the right questions of the data – because one without the other is not helpful.” - Wendy Thorpe, AMP Why Analytics? Organizations are built on great decisions and great decisions are built on great predictions. So what are great predictions built on? The answer is ANALYTICS!!! So in case you are wondering what this could possibly mean for you as a prospective actuary, the answer depends on what kind of actuary you want to be. Let me quote Duncan West here: If you are in a leadership role in your organization, get data onto the strategic agenda. Many companies talk about the importance of data but talk is not cheap. Do they manage themselves in ways that demonstrates the importance? Actuaries in any role need to go away from a regulatory and compliance mindset that accuracy is the most important way to measure success. They should measure success by helping the business to make good decisions. And actuaries at all levels need to help develop the skills necessary to show insights to the business. Insights are useless if the business can’t understand them. So communicating insight is a key part of an actuarial role.
    [Show full text]
  • Book of Abstracts
    Book of Abstracts June 27, 2015 1 Conference Sponsors Diamond Sponsor Platinum Sponsors Gold Sponsors Silver Sponsors Open Analytics Bronze Sponsors Media Sponsors 2 Conference program Time Tuesday Wednesday Thursday Friday 08:00 Registration opens Registration opens Registration opens Registration opens 08:30 – 09:00 Opening session (by Rector peR! M. Johansen, Aalborg University) Aalborghallen 09:00 – 10:00 Romain François Di Cook Thomas Lumley Aalborghallen Aalborghallen Aalborghallen 10:00 – 10:30 Coffee break Coffee break Coffee break (15 min) ee break Sponsored by Quantide Sponsored by Alteryx ff Session 1 Session 4 10:30 – 12:00 Sponsor session (10:15) Kaleidoscope 1 Kaleidoscope 4 Aalborghallen Aalborghallen Aalborghallen incl. co Morning Tutorials DataRobot Ecology Medicine Gæstesalen Gæstesalen RStudio Teradata Networks Regression Musiksalen Musiksalen Revolution Analytics Reproducibility Commercial Offerings alteryx Det Lille Teater Det Lille Teater TIBCO H O Interfacing Interactive graphics 2 Radiosalen Radiosalen HP 12:00 – 13:00 Sandwiches Lunch (standing buffet) Lunch (standing buffet) Break: 12:00 – 12:30 Sponsored by Sponsored by TIBCO ff Revolution Analytics Ste en Lauritzen (12:30) Aalborghallen Session 2 Session 5 13:00 – 14:30 13:30: Closing remarks Kaleidoscope 2 Kaleidoscope 5 Aalborghallen Aalborghallen 13:45: Grab ’n go lunch 14:00: Conference ends Case study Teaching 1 Gæstesalen Gæstesalen Clustering Statistical Methodology 1 Musiksalen Musiksalen ee break Data Management Machine Learning 1 ff Det Lille Teater Det Lille
    [Show full text]
  • Mergers in the Digital Economy
    2020/01 DP Axel Gautier and Joe Lamesch Mergers in the digital economy CORE Voie du Roman Pays 34, L1.03.01 B-1348 Louvain-la-Neuve Tel (32 10) 47 43 04 Email: [email protected] https://uclouvain.be/en/research-institutes/ lidam/core/discussion-papers.html Mergers in the Digital Economy∗ Axel Gautier y& Joe Lamesch z January 13, 2020 Abstract Over the period 2015-2017, the five giant technologically leading firms, Google, Amazon, Facebook, Amazon and Microsoft (GAFAM) acquired 175 companies, from small start-ups to billion dollar deals. By investigating this intense M&A, this paper ambitions a better understanding of the Big Five's strategies. To do so, we identify 6 different user groups gravitating around these multi-sided companies along with each company's most important market segments. We then track their mergers and acquisitions and match them with the segments. This exercise shows that these five firms use M&A activity mostly to strengthen their core market segments but rarely to expand their activities into new ones. Furthermore, most of the acquired products are shut down post acquisition, which suggests that GAFAM mainly acquire firm’s assets (functionality, technology, talent or IP) to integrate them in their ecosystem rather than the products and users themselves. For these tech giants, therefore, acquisition appears to be a substitute for in-house R&D. Finally, from our check for possible "killer acquisitions", it appears that just a single one in our sample could potentially be qualified as such. Keywords: Mergers, GAFAM, platform, digital markets, competition policy, killer acquisition JEL Codes: D43, K21, L40, L86, G34 ∗The authors would like to thank M.
    [Show full text]
  • Cortana Analytics in Banking and Capital Markets: Delivering ROI on Big Data
    Cortana Analytics in Banking and Capital Markets: Delivering ROI on Big Data Executive Summary Summaryew industries can derive more benefit from big data and advanced analytics than the financial services industry. Nearly every FSI F transaction is executed electronically—and the amount of data generated by the industry is staggering. Recent technological innovations in cloud computing, big data and advanced analytics are enabling FSIs to transform the vast stores of data at their disposal into game-changing insights across a broader set of applications, that include helping to increase sales; improve customer service; and reduce risk, fraud and customer churn. Leading financial institutions are partnering with Microsoft to make the most of their data, leveraging the Microsoft Analytics platform with Cortana Analytics, a fully managed big data and advanced analytics suite. This powerful platform, combined with Revolution Analytics (creator of applications for R, the world’s most widely used programming language for statistical computing and predictive analytics), provides customers with the flexibility of an end-to-end suite across on-premises and cloud-deployment models. Industry Trends The financial services industry is facing a number of challenges and opportunities in a rapidly changing marketplace. Customers are in the driver’s seat. Empowered customers are clearly a part of today’s industry landscape. Retail and institutional customers are: More informed than ever Increasingly mobile Expecting consistent service across channels Not as trusting as they once were As a result, there is increased pressure for new customer engagement models, greater transparency and the ability to demonstrate business integrity. Delivering personalized, contextual and connected experiences across channels is critical.
    [Show full text]
  • Breaking Data Science Open How Open Data Science Is Eating the World
    Breaking Data Science Open How Open Data Science Is Eating the World Michele Chambers, Christine Doig, and Ian Stokes-Rees Beijing Boston Farnham Sebastopol Tokyo Breaking Data Science Open by Michele Chambers, Christine Doig, and Ian Stokes-Rees Copyright © 2017 O’Reilly Media, Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Editor: Tim McGovern Interior Designer: David Futato Production Editor: Nicholas Adams Cover Designer: Randy Comer Proofreader: Rachel Monaghan February 2017: First Edition Revision History for the First Edition 2017-02-15: First Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Breaking Data Science Open, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is sub‐ ject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
    [Show full text]
  • Introduction to R Background Installation Basics
    Intro.1 Intro.2 Introduction to R • 2001: I first hear about R early this year! • 2004: The first UseR! conference is held, and the non-profit Much of the content here is from Appendix A of my Analy- R Foundation is formed. The conference is now held annually sis of Categorical Data with R book (www.chrisbilder.com/ with its location alternating between the US and Europe each categorical). All R code is available in AppendixInitialExam- year. ples.R from my course website. • 2004: During a Joint Statistical Meetings (JSM) session that I attended, a SPSS executive says his company and other sta- Background tistical software companies have felt R’s impact and they are changing their business model. R is a statistical software package that shares many similari- ties with the statistical programming language named S. A pre- • 2004: Version 2.0.0 was released. liminary version of S was created by Bell Labs in the 1970s • 2007: Revolution Analytics was founded to sell a version of R that was meant to be a programming language like C but for that focuses on big data and parallel processing applications; statistics. John Chambers was one of the primary inventors the company was purchased by Microsoft in 2015. for the language, and he won the Association for Computing Journal of the American Statistical Machinery Award in 1999 for it. A nice video interview with • 2008: The editor for the Association John Chambers about the early days of S is available at https: says during a JSM presentation that R has be- //www.youtube.com/watch?v=jk9S3RTAl38.
    [Show full text]