Revoscaler User's Guide
Total Page:16
File Type:pdf, Size:1020Kb
RevoScaleR User’s Guide The correct bibliographic citation for this manual is as follows: Microsoft Corporation. 2016. RevoScaleR User’s Guide. Microsoft Corporation, Redmond, WA. RevoScaleR User’s Guide Copyright © 2016 Microsoft Corporation. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of Microsoft Corporation. U.S. Government Restricted Rights Notice: Use, duplication, or disclosure of this software and related documentation by the Government is subject to restrictions as set forth in subdivision (c) (1) (ii) of The Rights in Technical Data and Computer Software clause at 52.227-7013. Revolution R, Revolution R Enterprise, RPE, RevoScaleR, DeployR, RevoPemaR, RevoTreeView, and Revolution Analytics are trademarks of Microsoft Corporation. Revolution R Enterprise/Microsoft R Server includes the Intel® Math Kernel Library (https://software.intel.com/en-us/intel-mkl). RevoScaleR includes Stat/Transfer software under license from Circle Systems, Inc. Stat/Transfer is a trademark of Circle Systems, Inc. Other product names mentioned herein are used for identification purposes only and may be trademarks of their respective owners. Microsoft Corporation One Microsoft Way Redmond, WA 98052 U.S.A. Revised on November 9, 2015 We want our documentation to be useful, and we want it to address your needs. If you have comments on this or any Microsoft R Services document, send e-mail to [email protected]. We’d love to hear from you. Contents Chapter 1. Introduction .................................................................................................... 1 1.1 Why RevoScaleR? ............................................................................................................. 1 1.1.1 Accessing External Data Sets .................................................................................... 2 1.1.2 Efficiently Storing and Retrieving Data ..................................................................... 2 1.1.3 Data Cleaning, Exploration, and Manipulation ......................................................... 2 1.1.4 Statistical Analysis ..................................................................................................... 2 1.1.5 Writing Your Own Analyses for Large Data Sets ....................................................... 3 1.2 Getting Started ................................................................................................................. 3 1.2.1 Accessing External Data Sets .................................................................................... 3 1.2.2 Data Cleaning, Exploration, and Transformations .................................................... 4 1.2.3 Statistical Analysis ..................................................................................................... 6 1.2.4 Writing Your Own Analyses for Large Data Sets ....................................................... 7 1.3 Sample Data for Use with RevoScaleR ............................................................................. 9 1.4 Managing Threads .......................................................................................................... 10 1.5 Generating Random Numbers ....................................................................................... 11 1.6 Using RevoScaleR with Rscript ....................................................................................... 12 1.7 Getting Help ................................................................................................................... 12 Chapter 2. Importing Data .............................................................................................. 13 2.1 Data Compression in .xdf Files ....................................................................................... 14 2.2 Importing Delimited Text Data ....................................................................................... 14 2.2.1 Specifying a Missing Value String ........................................................................... 15 2.3 Importing Fixed-Format Data ......................................................................................... 16 2.4 Importing SAS Data ........................................................................................................ 17 2.5 Importing SPSS Data ....................................................................................................... 18 2.6 Specifying Variable Data Types ...................................................................................... 19 2.7 Specifying Additional Variable Information ................................................................... 21 2.8 Appending to an Existing File ......................................................................................... 22 2.9 Transforming Data on Import ........................................................................................ 22 2.10 Converting Dates Stored As Character Strings ........................................................... 23 2.11 Importing Wide Data .................................................................................................. 23 2.12 Reading Data from an .xdf File into a Data Frame ..................................................... 24 2.13 Splitting Data Files ...................................................................................................... 26 2.14 Importing Data as Composite Xdf Files ...................................................................... 27 2.15 Using Data from the Hadoop Distributed File System ............................................... 29 2.15.1 Note on Using RevoScaleR with rhdfs..................................................................... 29 Chapter 3. Data Sources ................................................................................................. 31 3.1 Data Source Constructors .............................................................................................. 31 3.2 Specifying Delimiters ...................................................................................................... 32 3.3 Compute Contexts and Data Sources............................................................................. 33 3.4 Methods for Looking at Data Sources ............................................................................ 34 3.5 Using Data Sources ......................................................................................................... 35 3.6 Working with an Xdf Data Source .................................................................................. 36 3.7 Using an Xdf Data Source with biglm ............................................................................. 36 Chapter 4. Transforming and Subsetting Data ................................................................. 38 4.1 Creating a Subset of Rows and Columns ........................................................................ 39 4.2 Transforming Data with rxDataStep .............................................................................. 40 4.2.1 Creating and Transforming Variables ..................................................................... 41 4.2.2 Subsetting and Transforming Variables .................................................................. 43 4.3 Using the Data Step to Create an .xdf File from a Data Frame ...................................... 45 4.4 Converting .xdf Files to Text ........................................................................................... 45 4.5 Re-Blocking an .xdf File .................................................................................................. 46 4.6 Modifying Variable Information ..................................................................................... 47 4.7 Sorting Data .................................................................................................................... 47 4.7.1 Removing Duplicates While Sorting ........................................................................ 48 4.7.2 The rxQuantile Function and the Five-Number Summary ...................................... 51 4.8 Merging Data .................................................................................................................. 52 4.8.1 Inner Merge ............................................................................................................ 52 4.8.2 Outer Merge ........................................................................................................... 53 4.8.3 One-to-one Merge .................................................................................................. 54 4.8.4 Union Merge ........................................................................................................... 55 4.8.5 Using rxMerge with .xdf files .................................................................................. 56 4.9 Creating and Recoding Factors....................................................................................... 57 4.9.1 Recoding Factors to Ensure Variable Compatibility ............................................... 60 Chapter 5. Models in RevoScaleR.................................................................................... 61 5.1 External Memory Algorithms ........................................................................................