AWS Marketplace Getting Started

Actian Vector

VAWS-51-GS-14

Copyright © 2018 Corporation. All Rights Reserved.

This Documentation is for the end user’s informational purposes only and may be subject to change or withdrawal by Actian Corporation (“Actian”) at any time. This Documentation is the proprietary information of Actian and is protected by the copyright laws of the United States and international treaties. The software is furnished under a license agreement and may be used or copied only in accordance with the terms of that agreement. No part of this Documentation may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying and recording, or for any purpose without the express written permission of Actian. To the extent permitted by applicable law, ACTIAN PROVIDES THIS DOCUMENTATION “AS IS” WITHOUT WARRANTY OF ANY KIND, AND ACTIAN DISCLAIMS ALL WARRANTIES AND CONDITIONS, WHETHER EXPRESS OR IMPLIED OR STATUTORY, INCLUDING WITHOUT LIMITATION, ANY IMPLIED WARRANTY OF MERCHANTABILITY, OF FITNESS FOR A PARTICULAR PURPOSE, OR OF NON-INFRINGEMENT OF THIRD PARTY RIGHTS. IN NO EVENT WILL ACTIAN BE LIABLE TO THE END USER OR ANY THIRD PARTY FOR ANY LOSS OR DAMAGE, DIRECT OR INDIRECT, FROM THE USE OF THIS DOCUMENTATION, INCLUDING WITHOUT LIMITATION, LOST PROFITS, BUSINESS INTERRUPTION, GOODWILL, OR LOST DATA, EVEN IF ACTIAN IS EXPRESSLY ADVISED OF SUCH LOSS OR DAMAGE.

The manufacturer of this Documentation is Actian Corporation.

For government users, the Documentation is delivered with “Restricted Rights” as set forth in 48 C.F.R. Section 12.212, 48 C.F.R. Sections 52.227-19(c)(1) and (2) or DFARS Section 252.227-7013 or applicable successor provisions.

Actian, Actian DataCloud, Actian DataConnect, Actian X, Versant, PSQL, Actian Director, Actian Vector, Actian Vector in Hadoop, EDBC, Enterprise Access, , OpenROAD, and are trademarks or registered trademarks of Actian Corporation and its subsidiaries. All other trademarks, trade names, service marks, and logos referenced herein belong to their respective companies. Contents

1. Introduction 5 What Is Actian Analytics - Vector?...... 5 Vector Technology ...... 5 Vectorised Processing--Calculating Query Answers Fast ...... 5 Storage Innovations--Beating the Disk Bottleneck ...... 6 Vector Amazon Machine Image (AMI) ...... 7

2. Deploying Vector from the AWS Marketplace 9 Access the Vector AMI...... 9 Launch and Deploy the Vector AMI...... 9

3. Running Queries, Creating and Tables, and Loading Data 15 Using Actian Director...... 15 Download and Install Actian Director ...... 16 Start Director and Connect to the Vector EC2 Instance ...... 16 Run Queries with Actian Director ...... 18 Load Sample Data Using Vector CLI and Director ...... 20 Using the Vector Command Line Interface ...... 23 Start the Vector Command Line Interface ...... 24 Run Queries with Actian Vector CLI ...... 25 Load Sample Data Using the Vector CLI...... 27 Remove the Sample Databases...... 33 More About Actian Vector ...... 33

A. More Sample Queries 35

B. Configuring Actian Vector Enterprise Edition on AWS Marketplace 37 1. Choose an Instance Type ...... 37 2. Choose an Instance Size ...... 38 3. Select a Price Plan ...... 40

C. Configuring Storage for Vector on AWS 41 AWS EC2 Storage Concepts and Options ...... 41 AWS EC2 Root Device Volume...... 41 Storage Options ...... 41 Tuning Volume Layout for Performance ...... 42

3 Instance Store Volumes ...... 42 EBS Volumes...... 43 Configuring Vector to Use the Newly Set Up Disks...... 44

D. Migrating Vector Databases Between AMIs 47

4 1. Introduction

This section contains the following topics:

What Is Actian Analytics Database - Vector? ...... 5 Vector Technology ...... 5 Vector Amazon Machine Image (AMI) ...... 7

What Is Actian Vector?

Actian Vector (hereafter Vector) is a next generation database management system from the Actian family of database products. Vector is targeted at analytical database applications—applications that need to process large volumes of data and perform complex operations on it to derive useful information. Typical examples include data warehousing, data mining, and reporting. Vector is optimized to work with both memory- and disk-resident datasets, allowing it to efficiently process large amounts of data (hundreds of gigabytes).

Note: Although it is fast for data analysis, Vector is not meant to be used for traditional transaction processing. For OLTP, you can use other products from Actian such as Ingres Database.

Vector Technology

Vector introduces a new way of storing data and a completely new mechanism for evaluating queries. Innovations such as vectorised processing, compression, and columnar data layout allow analytical queries to be run fast on a single server, even a laptop.

Vectorised Processing--Calculating Query Answers Fast The most distinctive feature of Vector is the “vectorised” method it uses for evaluating queries. Rather than operating on single values from single table records at a time, Vector makes the CPU operate on “vectors,” which are arrays of values from many different records. Such vectorised execution brings out the best in modern CPU technology. It brings to the world of databases the

Introduction 5 Vector Technology

high performance that modern computers exhibit for scientific calculation, gaming, and multimedia applications. The technical basis for efficiency of vectorised processing is that modern chip technology (be it Intel, AMD, or IBM manufactured) now uses deeply pipelined CPU designs. Keeping all pipelines full—and thus efficiency near peak—is impossible for traditional database engines primarily due to code complexity. Similarly, crucial CPU features such as Intel's Streaming SIMD Extensions (SSE) and Advanced Vector Extensions (AVX) are not used well by traditional database systems. Vectorised processing changes that. It provides efficiency that traditionally is only obtained by computer programs handwritten for one particular task. Also, because of the high clock frequency of current CPUs, database systems now need to treat main memory access as a significant cost factor. Vector tackles this by ensuring that the vectors it operates on fit inside the CPU caches, avoiding unnecessary (and in multi-core systems, often contended) main memory access. Vector takes advantage of multi-core systems by handling multiple queries concurrently or by running single queries in parallel. The improved overall computational efficiency of Vector over traditional commercial relational database technology is at least an order of magnitude for long running analytical queries.

Storage Innovations--Beating the Disk Bottleneck Any database system with such a high computational speed runs the risk of becoming I/O bound. For this reason, the second major component of Vector consists of storage innovations designed for high I/O throughput. These innovations include: • Columnar data layout • Advanced compression • Storage indexes The Vector storage mechanism uses columnar data layout, which allows analytical queries to avoid disk access for columns not involved in a query. While you can generally think of Vector storage as a column store, Vector can mix columnar and row-based storage so that certain columns that are always accessed together get stored in the same disk block. Layout decisions are handled automatically by the system, but can also be controlled by the user. To further avoid I/O becoming a performance bottleneck, Vector introduces a number of advanced compression schemes. These schemes are designed for fast decompression. Therefore, accessing compressed data in Vector means that less data needs to come from disk, yet queries do not slow down due to decompression.

6Introduction Vector Amazon Machine Image (AMI)

Finally, Vector uses storage indexes. The storage indexes are small and store the minimum and maximum value per data block. The storage index, which is automatically created and maintained, enables the execution engine to rapidly identify candidate data blocks.

Vector Amazon Machine Image (AMI)

An Amazon Machine Image (AMI) is a special type of virtual appliance used to create a virtual machine within the Amazon Elastic Compute Cloud (“EC2”). The AMI serves as the basic unit of deployment for services delivered using EC2. An AMI is a template that contains a software configuration (for example, an operating system, an application server, and applications). An AMI provides the information required to launch an instance—a virtual server in the cloud. You specify an AMI when you launch an instance, and you can launch as many instances from the AMI as you need.1

The Vector AMI is a Linux image that contains: • Linux operating system—CentOS 7.4 • Vector Community Edition or Enterprise Edition • Sample database table • Sample data—real-world data from the U.S. Bureau of Transportation It is deployed in the Amazon Web Services (AWS) cloud. For more information, see Deploying Vector from the AWS Marketplace on page 9.

1. http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instances-and-amis.html

Introduction 7 Vector Amazon Machine Image (AMI)

8Introduction 2. Deploying Vector from the AWS Marketplace

This section contains the following topics: Access the Vector AMI ...... 9 Launch and Deploy the Vector AMI ...... 9

Access the Vector AMI

The Actian Vector AMI is available from the AWS Marketplace at the following locations: • Community Edition: https://aws.amazon.com/marketplace/pp/B07FXYD6GX The Community Edition is limited in size to 100 GB. • Enterprise Edition: https://aws.amazon.com/marketplace/pp/B07FMYGCJL Vector Enterprise Edition has no data limits and entitles you to sign up for Enterprise Support. For more information about sizing for Vector Enterprise Edition, see Configuring Storage for Vector on AWS on page 41. Vector is delivered as an Amazon Machine Image (AMI), and you will need an AWS account to launch the AMI.

Note: IAM users must have an “AWS MarketplaceFullAccess” policy attached so that you can subscribe to the Marketplace Vector edition of your choice and be able to create EC2 instances. For more information, see http://docs.aws.amazon.com/marketplace/latest/ controlling-access/ControllingAccessToAWSMarketplaceSubscriptions.html.

Launch and Deploy the Vector AMI

To launch the Vector AMI

1. On the “Actian Vector Analytic Database” page, click Continue to Subscribe. If you are not logged in to your Marketplace account, you will be prompted to do so.

Deploying Vector from the AWS Marketplace 9 Launch and Deploy the Vector AMI

After your subscription is confirmed, you may select an Annual License to reduce subscription costs for the Enterprise Edition. The Community Edition does not have a License cost. 2. Click Continue to Configuration. 3. Select the AMI, a Software Version, and a Region from the dropdown. 4. Click Continue to Launch. 5. From the Choose Action dropdown, select a launch point. Perform the appropriate procedure, following.

To Launch from Website

a. Depending on your version of Vector, an EC2 Instance Type is selected by default: • Vector Community Edition: m5.2xlarge • Vector Enterprise Edition: m5.4xlarge Select the EC2 Instance Type. Important! For Vector Enterprise Edition sizing, see Configuring Actian Vector Enterprise Edition on AWS Marketplace on page 37. b. Choose a VPC and Subnet. Note: Be sure to choose a VPC and Subnet that will enable your AMI to have network access to and from your location. A default VPC may not have restricted network access or may not auto-assign a public IP address. Verify your VPC settings or create a new one using the AWS Console. For more information, see https://docs.aws.amazon.com/ AmazonVPC/latest/UserGuide/VPC_Subnets.html. c. Select a Security Group. We recommend that you Create a New Security Group Based on Seller Setting to ensure the correct ports are open to allow connectivity to the AMI and Database through JDBC and ODBC. Enter a Name and Description for the new security group. You can allow all IPs to have access or restrict access to a whitelisted group of IP addresses. d. Select a Key Pair to use to access the Vector instance over SSH. If there are no Key Pairs, create one in the Amazon EC2 Console.

10 Deploying Vector from the AWS Marketplace Launch and Deploy the Vector AMI

Amazon EC2 uses public-key cryptography to encrypt and decrypt login information. Public-key cryptography uses a public key to encrypt a piece of data, such as a password, then the recipient uses the private key to decrypt the data. The public and private keys are known as a key pair.1 For more information, go to: http://docs.aws.amazon.com/ AWSEC2/latest/UserGuide/ec2-key-pairs.html#having-ec2-create-your-key-pair e. In the lower right corner, click Launch. The successfully deployed AMI confirmation page is displayed f. Click the link for the EC2 Console page and review the newly launched EC2 instance details.

To Launch through EC2 The Step 2: Choose an Instance Type page is displayed. a. Choose an Instance Type with your desired configuration, considering these choices: • Type •vCPUs • Memory •Storage • Network Performance • IPv6 Support For Vector Community edition, we recommend at minimum a general purpose m4.2xlarge instance. Vector scales automatically with more cores and memory, so the higher the better. Important! For Vector Enterprise Edition sizing, see Configuring Actian Vector Enterprise Edition on AWS Marketplace on page 37. b. Click Next. The Step 3: Configure Instance Details page is displayed. c. Modify any advanced configuration options and click Next. The Step 4: Add Storage page is displayed. d. Specify the storage size of at least 150 GB (the default) and volume type, then click Next.

1. http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html

Deploying Vector from the AWS Marketplace 11 Launch and Deploy the Vector AMI

The Step 5: Add Tags page is displayed. e. (Optional) Create a tag applicable to volumes or instances by assigning a key-value pair. f. Click Next. The Step 6: Configure Security Group page is displayed. g. Create your own security group using Director port 44223, configuring sources and ports for your system. “0.0.0.0/0” specifies everywhere. h. Click Review and Launch in the lower right corner of the browser. The Step 7: Review Instance Launch page is displayed. i. Review your settings. If you need to change anything, click the applicable step link near the top of the browser window. j. Click Launch. k. Select an existing key pair or create a new key pair. Amazon EC2 uses public-key cryptography to encrypt and decrypt login information. Public-key cryptography uses a public key to encrypt a piece of data, such as a password, then the recipient uses the private key to decrypt the data. The public and private keys are known as a key pair.1 For more information, go to: http://docs.aws.amazon.com/ AWSEC2/latest/UserGuide/ec2-key-pairs.html#having-ec2-create-your-key-pair l. Click Launch Instance. m. Review launch status.

n. Click the link to view the status of the launched instance in the EC2 Console. The new instance appears in the list of instances in the EC2 console. Important! Wait until the instance status checks are complete, to ensure that the instance is healthy. This may take several minutes.

1. http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html

12 Deploying Vector from the AWS Marketplace Launch and Deploy the Vector AMI

You are now ready to interact with the Vector AMI using Actian Director or the Vector Command Line Interface. For more information, see Running Queries, Creating Databases and Tables, and Loading Data on page 15.

Deploying Vector from the AWS Marketplace 13 Launch and Deploy the Vector AMI

14 Deploying Vector from the AWS Marketplace 3. Running Queries, Creating Databases and Tables, and Loading Data

This section contains the following topics: Using Actian Director...... 15 Using the Vector Command Line Interface ...... 23 More About Actian Vector...... 33

There are two ways you can interact with Vector: through the Actian Director desktop administration application or the Vector Command Line Interface (CLI). Consult one of the following sections for how to perform basic tasks using your chosen interface: • Using Actian Director on page 15 (recommended if you are just getting started with Vector) • Using the Vector Command Line Interface on page 23

Using Actian Director

Actian Director 2.1 is an easy-to-use desktop application that lets you interact with Actian Vector. Using Director, you can: • Manage databases, tables, servers, and their components • Administer security (users, groups, roles, and profiles) • Create, store, and execute queries

Tasks in this section include:

1. Download and Install Actian Director on page 16. 2. Start Director and Connect to the Vector EC2 Instance on page 16. 3. Run Queries with Actian Director on page 18. 4. Load Sample Data Using Vector CLI and Director on page 20. • Create a database • Create a table • Load data into the table

Running Queries, Creating Databases and Tables, and Loading Data15 Using Actian Director

• Optimize the data by generating statistics • Run more queries

Download and Install Actian Director You may download Actian Director for the following platforms: • Microsoft Windows • Apple Mac OS X (Director Community Edition) For complete installation information, see the Actian Director Installation and Configuration Guide.

Start Director and Connect to the Vector EC2 Instance

To start Director and connect to your Vector EC2 instance

1. Start Director on your local desktop. 2. On the Start page, click Connect to an instance.

The Connect to Instance dialog opens.

16 Running Queries, Creating Databases and Tables, and Loading Data Using Actian Director

3. From the Description tab of AWS window, copy the Public DNS entry to the Clipboard.

4. Paste the Public DNS string into the Instance field and add :44223 to the end. For example:

ec2-107-21-68-199.compute-1.amazonaws.com:44223 44223 is the management port set up for the Vector AMI. The AMI Marketplace listing comes with a default security group that allows inbound traffic on this port. If you used this security group, there should be no connectivity issues. 5. Enter the following login credentials: Login: actian Password: copy and paste the Instance ID from the Description tab of the AWS window (see Note). Remember password: check Note: The Vector instance is set up using a DBMS password. The default password is the EC2 Instance ID, which you can find in the AWS EC2 Console on the Instances tab under Description with label “Instance ID.” For example:

i-051eb4cdf77d4a29d

Running Queries, Creating Databases and Tables, and Loading Data17 Using Actian Director

6. Click Connect. The Vector EC2 instance is added to the Instance Explorer on the left. For more information about Director, see the Actian Director User Guide.

Run Queries with Actian Director The Vector AMI comes pre-installed with the ontimedb demonstration database, which is loaded with actual airline flight data from the U.S. Bureau of Transportation from 1987 to the present— 175 million rows.

This first query displays the year, month, and number of flights per month ordered by year and then month from the ontime table of ontimedb.

Important! The root volume of an Amazon AMI is based on an Elastic Block Storage (EBS) snapshot, a standard practice when creating AMIs. These snapshots are stored on S3 and, as you access blocks on the volume, they are slowly loaded from S3 into EBS and served to the EC2 instance. This means that the first access to blocks of existing files after launching an EC2 instance from an AMI boot can be very slow. However, all subsequent accesses (even between instance reboots) are much faster.

This directly affects the sample queries against the ontimedb database that is included in the AMI, as the disk blocks for the ontimedb database files must be read from the S3 snapshot the first time the query accesses them. Although the sample queries that you will run in the following sections will be slow the first time, all subsequent queries will be fast and will

18 Running Queries, Creating Databases and Tables, and Loading Data Using Actian Director

continue to be fast through the EC2 instance restarts. For more information, see the AWS documentation at http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-initialize.html.

To run the sample query

1. Start Director and connect to your Vector EC2 instance. (See Start Director and Connect to the Vector EC2 Instance on page 16.) 2. On the Director Start page, click New Query. A new query tab opens in the Multiple Document Interface. 3. Select ontimedb from the Select a Database dropdown. 4. Enter the following SQL command (copy and paste):

SELECT count(*) FROM ontime\g 5. Click Execute. Results are displayed in the bottom half of the query tab:

6. Check the query execution time in the status bar at the bottom right of the query window. You may run additional sample queries from More Sample Queries on page 35.

Running Queries, Creating Databases and Tables, and Loading Data19 Using Actian Director

Load Sample Data Using Vector CLI and Director

Note: To create a database and load it with data in Vector Community Edition, which has a 100 GB size limit, you must first delete the provided ontimedb database. For more information, see Remove the Sample Databases on page 33.

In this exercise, you will create a database, create a table in that database, and then load it with data provided in the AMI. To perform the following procedures, you must connect to the Vector EC2 instance so that it is displayed in the Instance Explorer. See Start Director and Connect to the Vector EC2 Instance on page 16.

Create a Database In this exercise, you will create a database named demodb.

To create a database

1. Select the Vector AMI instance in the Instance Explorer of Director. 2. On the Start page, click New Database or click DATABASE, Database, New Database. The New Database dialog box opens. 3. Enter the name demodb. 4. Click OK. The demodb database is created. 5. Click Close to close the New Database dialog.

The new database is added to the Databases folder of the Vector AMI instance.

Create a Table In this exercise, you will create two tables in the demodb database: ontime and carriers. You will create the tables using copy-and-paste query commands.

To create the tables

1. In the Instance Explorer, navigate to and select the demodb database you created in the previous procedure.

20 Running Queries, Creating Databases and Tables, and Loading Data Using Actian Director

2. On the Start page, click New Query or click QUERY, New. 3. A new query tab opens in the Multiple Document Interface. 4. Select the demodb database from the Select a Database dropdown. 5. Copy and paste the following SQL command in the Query Text window:

create table ontime( year integer not null, quarter i1 not null, month i1 not null, dayofmonth i1 not null, dayofweek i1 not null, flightdate ansidate not null, uniquecarrier char(7) not null, airlineid integer not null, carrier char(2) default NULL, tailnum varchar(50) default NULL, flightnum varchar(10) not null, originairportid integer default NULL, originairportseqid integer default NULL, origincitymarketid integer default NULL, origin char(5) default NULL, origincityname varchar(35) not null, originstate char(2) default NULL, originstatefips varchar(10) default NULL, originstatename varchar(46) default NULL, originwac integer default NULL, destairportid integer default NULL, destairportseqid integer default NULL, destcitymarketid integer default NULL, dest char(5) default NULL, destcityname varchar(35) not null, deststate char(2) default NULL, deststatefips varchar(10) default NULL, deststatename varchar(46) default NULL, destwac integer default NULL, crsdeptime integer default NULL, deptime integer default NULL, depdelay integer default NULL, depdelayminutes integer default NULL, depdel15 integer default NULL, departuredelaygroups integer default NULL, deptimeblk varchar(9) default NULL, taxiout integer default NULL, wheelsoff varchar(10) default NULL, wheelson varchar(10) default NULL, taxiin integer default NULL, crsarrtime integer default NULL, arrtime integer default NULL, arrdelay integer default NULL, arrdelayminutes integer default NULL, arrdel15 integer default NULL, arrivaldelaygroups integer default NULL, arrtimeblk varchar(9) default NULL,

Running Queries, Creating Databases and Tables, and Loading Data21 Using Actian Director

cancelled i1 default NULL, cancellationcode char(1) default NULL, diverted i1 default NULL, crselapsedtime integer default NULL, actualelapsedtime integer default NULL, airtime integer default NULL, flights integer default NULL, distance integer default NULL, distancegroup i1 default NULL, carrierdelay integer default NULL, weatherdelay integer default NULL, nasdelay integer default NULL, securitydelay integer default NULL, lateaircraftdelay integer default NULL, firstdeptime varchar(10) default NULL, totaladdgtime varchar(10) default NULL, longestaddgtime varchar(10) default NULL, divairportlandings varchar(10) default NULL, divreacheddest varchar(10) default NULL, divactualelapsedtime varchar(10) default NULL, divarrdelay varchar(10) default NULL, divdistance varchar(10) default NULL, div1airport varchar(10) default NULL, div1airportid integer default NULL, div1airportseqid integer default NULL, div1wheelson varchar(10) default NULL, div1totalgtime varchar(10) default NULL, div1longestgtime varchar(10) default NULL, div1wheelsoff varchar(10) default NULL, div1tailnum varchar(10) default NULL, div2airport varchar(10) default NULL, div2airportid integer default NULL, div2airportseqid integer default NULL, div2wheelson varchar(10) default NULL, div2totalgtime varchar(10) default NULL, div2longestgtime varchar(10) default NULL, div2wheelsoff varchar(10) default NULL, div2tailnum varchar(10) default NULL, div3airport varchar(10) default NULL, div3airportid integer default NULL, div3airportseqid integer default NULL, div3wheelson varchar(10) default NULL, div3totalgtime varchar(10) default NULL, div3longestgtime varchar(10) default NULL, div3wheelsoff varchar(10) default NULL, div3tailnum varchar(10) default NULL, div4airport varchar(10) default NULL, div4airportid integer default NULL, div4airportseqid integer default NULL, div4wheelson varchar(10) default NULL, div4totalgtime varchar(10) default NULL, div4longestgtime varchar(10) default NULL, div4wheelsoff varchar(10) default NULL, div4tailnum varchar(10) default NULL, div5airport varchar(10) default NULL, div5airportid integer default NULL, div5airportseqid integer default NULL,

22 Running Queries, Creating Databases and Tables, and Loading Data Using the Vector Command Line Interface

div5wheelson varchar(10) default NULL, div5totalgtime varchar(10) default NULL, div5longestgtime varchar(10) default NULL, div5wheelsoff varchar(10) default NULL, div5tailnum varchar(10) default NULL ) \g 6. Click Execute. Results are displayed in the bottom half of the query tab. 7. Create the carriers table and insert data into it:

create table carriers(ccode char(2) collate ucs_basic, carrier char(25) collate ucs_basic )\g

INSERT INTO carriers VALUES ('AS','Alaska Airlines (AS)'), ('AA','American Airlines (AA)'), ('DL','Delta Air Lines (DL)'), ('EV','ExpressJet Airlines (EV)'), ('F9','Frontier Airlines (F9)'), ('HA','Hawaiian Airlines (HA)'), ('B6','JetBlue Airways (B6)'), ('OO','SkyWest Airlines (OO)'), ('WN','Southwest Airlines (WN)'), ('NK','Spirit Airlines (NK)'), ('UA','United Airlines (UA)'), ('VX','Virgin America (VX)')\g The new tables are added to the Tables folder of the Vector AMI instance, demodb database. If you do not see the tables, right-click and select Refresh.

Load Data To load data into the database and table you have just created, you must use the Vector Command Line Interface. Start the Vector Command Line Interface on page 24 and then go to Load Data on page 30.

Using the Vector Command Line Interface

The Vector Command Line Interface (CLI) is installed with the Vector instance. You can us this text-based interface to interact with the Vector instance and execute SQL commands to run queries, create databases and tables, and load data into tables.

Tasks in this section include:

1. Start the Vector Command Line Interface on page 24. 2. Run Queries with Actian Vector CLI on page 25. 3. Load Sample Data Using the Vector CLI on page 27. • Create a database

Running Queries, Creating Databases and Tables, and Loading Data23 Using the Vector Command Line Interface

• Create a table • Load data into the table • Optimize the data by generating statistics • Run more queries

Start the Vector Command Line Interface The Vector Command Line Interface (CLI) is installed with the Vector instance. It enables you to interact with Vector databases and tables using line-based text commands. To access the CLI in the Vector instance, you will need to use an SSH (Secure Shell) utility such as PuTTY on Windows or the native SSH client on Linux or Mac OS X. Follow the appropriate procedure below.

To connect to the Vector instance using PuTTY on Windows

You may download PuTTY here. 1. Start the PuTTY telnet client on your machine. 2. From the Description tab of the Vector EC2 instance browser window, copy the Public DNS to the Clipboard:

3. On the PuTTY Session page, paste the DNS into the Host Name (or IP address) field. 4. On the Connection, SSH, Auth page, browse for and load your local private key file (.ppk) used for authorization.

24 Running Queries, Creating Databases and Tables, and Loading Data Using the Vector Command Line Interface

Note: The EC2 key-pair provides a .pem file, which works with command line SSH on Linux or Mac OS X. However, it must be converted to.ppk format on Windows. For more information, go to: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html. 5. Click Open to start your terminal session. The “login as” prompt is displayed in the terminal session window. Use actian for the login.

To connect to the Vector instance using SSH on Linux or OS X1

1. In a command-line shell, change directories to the location of the private key file that you created when you launched the instance. 2. Ensure that the private key file (.pem) has the appropriate file permissions (400). If not, use the chmod command to ensure that the permissions are set correctly:

chmod 400 /path/my-key-pair.pem 3. Use the ssh command to connect to the instance. Specify the private key (.pem) file and actian@public_dns_name. For example:

ssh -i my-key-pair.pem [email protected] You receive a response such as the following:

The authenticity of host 'ec2-198-51-100-1.compute-1.amazonaws.com (10.254.142.33)' can't be established. RSA key fingerprint is 1f:51:ae:28:bf:89:e9:d8:1f:25:5d:37:2d:7d:b8:ca:9f:f5:f1:6f. Are you sure you want to continue connecting (yes/no)?

Run Queries with Actian Vector CLI The Vector instance comes pre-installed with the ontimedb demonstration database, which is loaded with actual airline flight data from the U.S. Bureau of Transportation from 1987 to the present—175 million rows. This first query will show you the year, month, and number of flights per month ordered by year and then month from the ontime table of ontimedb.

Important! The root volume of an Amazon AMI is based on an Elastic Block Storage (EBS) snapshot, a standard practice when creating AMIs. These snapshots are stored on S3 and, as you access blocks on the volume, they are slowly loaded from S3 into EBS and served to the EC2 instance. This means that the first access to blocks of existing files after launching an EC2

1. http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html

Running Queries, Creating Databases and Tables, and Loading Data25 Using the Vector Command Line Interface

instance from an AMI boot can be very slow. However, all subsequent accesses (even between instance reboots) are much faster.

This directly affects the sample queries against the ontimedb database that is included in the AMI, as the disk blocks for the ontimedb database files must be read from the S3 snapshot the first time the query accesses them. Although the sample queries that you will run in the following sections will be slow the first time, all subsequent queries will be fast and will continue to be fast through the EC2 instance restarts. For more information, see the AWS documentation at http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-initialize.html.

To run the sample query

1. Start the Vector CLI. (See Start the Vector Command Line Interface on page 24.) 2. Start the tool by entering the following command at the $ prompt:

sql ontimedb 3. Specify that the following query be timed. Enter the following command at the * prompt:

\rt 4. Enter the following SQL command (copy and paste):

SELECT count(*) FROM ontime\g Results are displayed:

5. Check the query execution time at the bottom of the results display. 6. Enter the following query to find the cities with the most flights from 2012:

SELECT TOP 10 origincityname as City, count(*) as Num_of_flights FROM ontime WHERE Year = 2012 GROUP BY origincityname ORDER BY Num_of_flights DESC;

26 Running Queries, Creating Databases and Tables, and Loading Data Using the Vector Command Line Interface

You may run more sample queries from More Sample Queries on page 35.

Load Sample Data Using the Vector CLI

Note: To create a database and load it with data in Vector Community Edition, which has a 100 GB size limit, you must first delete the provided ontimedb database. For more information, see Remove the Sample Databases on page 33.

So far, you have been able to query data that was made available to you in the ontimedb database. In this exercise, you will create a database, create tables in that database, load data into Vector, and generate statistics. At the end of the exercise, you will have loaded the same amount of data (approximately 175 million rows) that you ran your queries against earlier.

Create a Database In this exercise, you will create a database named demodb.

To create a database

Enter the following command: createdb demodb The demodb database is created.

Create a Table In this exercise, you will create two tables in the demodb database: ontime and carriers. You will create the tables using copy-and-paste query commands.

To create the tables

1. If it it not already started, start the sql tool by entering the following command at the $ prompt:

sql demodb 2. Copy and paste the following SQL command at the * prompt:

create table ontime( year integer not null, quarter i1 not null, month i1 not null, dayofmonth i1 not null,

Running Queries, Creating Databases and Tables, and Loading Data27 Using the Vector Command Line Interface

dayofweek i1 not null, flightdate ansidate not null, uniquecarrier char(7) not null, airlineid integer not null, carrier char(2) default NULL, tailnum varchar(50) default NULL, flightnum varchar(10) not null, originairportid integer default NULL, originairportseqid integer default NULL, origincitymarketid integer default NULL, origin char(5) default NULL, origincityname varchar(35) not null, originstate char(2) default NULL, originstatefips varchar(10) default NULL, originstatename varchar(46) default NULL, originwac integer default NULL, destairportid integer default NULL, destairportseqid integer default NULL, destcitymarketid integer default NULL, dest char(5) default NULL, destcityname varchar(35) not null, deststate char(2) default NULL, deststatefips varchar(10) default NULL, deststatename varchar(46) default NULL, destwac integer default NULL, crsdeptime integer default NULL, deptime integer default NULL, depdelay integer default NULL, depdelayminutes integer default NULL, depdel15 integer default NULL, departuredelaygroups integer default NULL, deptimeblk varchar(9) default NULL, taxiout integer default NULL, wheelsoff varchar(10) default NULL, wheelson varchar(10) default NULL, taxiin integer default NULL, crsarrtime integer default NULL, arrtime integer default NULL, arrdelay integer default NULL, arrdelayminutes integer default NULL, arrdel15 integer default NULL, arrivaldelaygroups integer default NULL, arrtimeblk varchar(9) default NULL, cancelled i1 default NULL, cancellationcode char(1) default NULL, diverted i1 default NULL, crselapsedtime integer default NULL, actualelapsedtime integer default NULL, airtime integer default NULL, flights integer default NULL, distance integer default NULL, distancegroup i1 default NULL, carrierdelay integer default NULL, weatherdelay integer default NULL, nasdelay integer default NULL, securitydelay integer default NULL, lateaircraftdelay integer default NULL,

28 Running Queries, Creating Databases and Tables, and Loading Data Using the Vector Command Line Interface

firstdeptime varchar(10) default NULL, totaladdgtime varchar(10) default NULL, longestaddgtime varchar(10) default NULL, divairportlandings varchar(10) default NULL, divreacheddest varchar(10) default NULL, divactualelapsedtime varchar(10) default NULL, divarrdelay varchar(10) default NULL, divdistance varchar(10) default NULL, div1airport varchar(10) default NULL, div1airportid integer default NULL, div1airportseqid integer default NULL, div1wheelson varchar(10) default NULL, div1totalgtime varchar(10) default NULL, div1longestgtime varchar(10) default NULL, div1wheelsoff varchar(10) default NULL, div1tailnum varchar(10) default NULL, div2airport varchar(10) default NULL, div2airportid integer default NULL, div2airportseqid integer default NULL, div2wheelson varchar(10) default NULL, div2totalgtime varchar(10) default NULL, div2longestgtime varchar(10) default NULL, div2wheelsoff varchar(10) default NULL, div2tailnum varchar(10) default NULL, div3airport varchar(10) default NULL, div3airportid integer default NULL, div3airportseqid integer default NULL, div3wheelson varchar(10) default NULL, div3totalgtime varchar(10) default NULL, div3longestgtime varchar(10) default NULL, div3wheelsoff varchar(10) default NULL, div3tailnum varchar(10) default NULL, div4airport varchar(10) default NULL, div4airportid integer default NULL, div4airportseqid integer default NULL, div4wheelson varchar(10) default NULL, div4totalgtime varchar(10) default NULL, div4longestgtime varchar(10) default NULL, div4wheelsoff varchar(10) default NULL, div4tailnum varchar(10) default NULL, div5airport varchar(10) default NULL, div5airportid integer default NULL, div5airportseqid integer default NULL, div5wheelson varchar(10) default NULL, div5totalgtime varchar(10) default NULL, div5longestgtime varchar(10) default NULL, div5wheelsoff varchar(10) default NULL, div5tailnum varchar(10) default NULL ) \g 3. Press Enter. The table is created.

4. Create the carriers table and insert data into it:

Running Queries, Creating Databases and Tables, and Loading Data29 Using the Vector Command Line Interface

create table carriers(ccode char(2) collate ucs_basic, carrier char(25) collate ucs_basic )\g

INSERT INTO carriers VALUES ('AS','Alaska Airlines (AS)'), ('AA','American Airlines (AA)'), ('DL','Delta Air Lines (DL)'), ('EV','ExpressJet Airlines (EV)'), ('F9','Frontier Airlines (F9)'), ('HA','Hawaiian Airlines (HA)'), ('B6','JetBlue Airways (B6)'), ('OO','SkyWest Airlines (OO)'), ('WN','Southwest Airlines (WN)'), ('NK','Spirit Airlines (NK)'), ('UA','United Airlines (UA)'), ('VX','Virgin America (VX)')\g 5. Quit the terminal monitor: type \q and press Enter. Your SQL statement(s) have been committed. You may now load data into the ontime table.

Load Data In this exercise, you will load airline flight data into the empty ontime table you created in the previous exercise. The bulk loader command for Vector is called vwload. This command can load up to 500,000 rows per second, depending on the speed of your hard drive. This exercise showcases a couple of vwload options to load part of the sample data and then points you to a handy script that will load all the data for you using one command. The ontimefiles.zip file contains CSV files with airline flight data from 1987 to the present and must be unzipped before loading the data.

To unzip and load the data files

1. In the CLI window, change to the directory where ontimefiles.zip is located and unzip the file:

cd /Vector/sample_data

unzip ontimefiles.zip Note: This could take a while; the uncompressed data is about 75 GB. 2. Load a single CSV file with vwload: Copy and paste the following command to load a subset of airline data for January 1988 into the ontime table of the demodb database:

vwload -H -f , -q '"' -I -t ontime demodb On_Time_On_Time_Performance_1988_1.csv The command line uses the following options:

-H Indicates a header is used in the data files

30 Running Queries, Creating Databases and Tables, and Loading Data Using the Vector Command Line Interface

-f Specifies the comma field separator

-q Specifies the quote character, enclosed within single quotes

-I Specifies that one too many fields in the data files should be ignored

-t Specifies the target table ontime, the demodb database name, and the data filenames The vwload command is executed, and the data in the file On_Time_On_Time_Performance_1988_1.csv is loaded into the ontime table of demodb. For more information about vwload options, see the vwload command. 3. Load multiple CSV files with vwload: To speed data loading, engage multiple cores on your machine using the -c option, which enables parallelization. To show how this works, load all data files for the year 1989. Copy and paste the following command to load the data files into the ontime table of the demodb database:

vwload -H -f , -q '"' -I -c -t ontime demodb On_Time_On_Time_Performance_1989_*.csv The vwload command is executed, and the data in the 12 files for 1989 is loaded into the ontime table of demodb. 4. Delete the data loaded in previous steps to prepare for the rest of this exercise. Enter the following command:

echo "delete from ontime\g" | sql demodb 5. Verify that the ontime table is empty:

echo "select count(*) from ontime\g" | sql demodb Now that you know how to load single and multiple files, you can use the parallel load option to load the complete 75-GB data set into Vector. 6. Load the complete data set: The loadall.sh script is included in ontimefiles.zip that will load all the years of data using the -c option. Run the load script:

./loadall.sh demodb The entire data set of about 175 million rows is loaded into Vector.

Running Queries, Creating Databases and Tables, and Loading Data31 Using the Vector Command Line Interface

Note: This can take up to 15 minutes, depending on your machine type. The more cores you have, the faster the data will load. Generally, when using the -c option, the number of input files to the vwload command should equal the number of physical cores (not virtual cores) on the machine.

To display the number of rows in the ontime table

1. At the $ prompt, enter:

echo "select count(*) from ontime\g" | sql demodb 2. Verify that the table contains over 175 million rows. You may now optimize the data.

Optimize the Data by Generating Statistics Vector has a sophisticated cost-based optimizer that uses statistics to optimize queries. We recommend that you collect statistics upon initially loading data and after any subsequent changes to the data. There are several ways to collect statistics, but for this exercise you will use the command-line utility, optimizedb.

To generate statistics

Enter the following at the $ prompt: optimizedb demodb Using vector processing, statistics are generated for both the ontime and carriers tables. For more information about optimizedb, see optimizedb command.

Run More Queries After loading and optimizing data, run some of the queries against demodb using Director or the Vector CLI. See More Sample Queries on page 35. Check execution time to see how fast they run.

Note: Vector will automatically tune itself to the number of cores and memory on the machine and will use more resources where it can to scale. You may shut down the EC2 instance and resize it to a larger number of cores and check the query times again.

32 Running Queries, Creating Databases and Tables, and Loading Data More About Actian Vector

Remove the Sample Databases To free up disk space and reduce the amount of data that contributes towards the allowed data quota for Vector Community Edition, you may remove the sample databases using the destroydb command: destroydb ontimedb destroydb demodb

More About Actian Vector

You can learn more about Actian Vector online at http://docs.actian.com/vector/5.1/index.html.

Running Queries, Creating Databases and Tables, and Loading Data33 More About Actian Vector

34 Running Queries, Creating Databases and Tables, and Loading Data A. More Sample Queries

The following table contains additional queries you can run against the ontimedb or demodb databases. Copy the commands from the SQL column and paste them into the CLI or a Director query document. End each command with \g.

Description SQL

Count of all the rows SELECT count(*) FROM ontime\g

Number of flights per year SELECT year,count(*) as c1 from ontime group by YEAR;\g

Percentage of delays for SELECT t.carrier, c, c2, c*1000/c2 as c3 FROM (SELECT carrier, each carrier for 2016 year count(*) AS c FROM ontime WHERE DepDelay>10 AND Year=2016 GROUP BY carrier) t JOIN (SELECT carrier, count(*) AS c2 FROM ontime WHERE Year=2016 GROUP BY carrier) t2 ON (t.Carrier=t2.Carrier) ORDER BY c3 DESC;\g

Percent of flights delayed SELECT t.year, c1/c2 FROM (select year,count(*)*1000 as c1 from more than 10 minutes per ontime WHERE DepDelay>10 GROUP BY Year) t JOIN (select year,count(*) as c2 from ontime GROUP BY year) t2 ON (t.year=t2.year);\g year

Count of delays per SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year airport for years 2010– BETWEEN 2010 AND 2016 GROUP BY Origin ORDER BY c DESC fetch first 10 rows only;\g 2016

Cities with the most SELECT top 10 origincityname as City, count(*) as Num_of_flights flights in 2016 FROM ontime WHERE Year = 2016 GROUP BY origincityname ORDER BY Num_of_flights DESC;\g

Top 10 flights with biggest SELECT c.carrier, origin + ' (' + origincityname + ')' AS origin, average delays (a flight is dest + ' (' + destcityname + ')' AS dest, avg(arrdelayminutes) AS TotalArrivalDelay, avg(carrierdelay) as carrierdelay, defined by the carrier, avg(weatherdelay) AS weatherdelay, avg(nasdelay) AS nasdelay, departure airport, and avg(SecurityDelay) SecurityDelay, avg(LateAircraftDelay) destination) LateAircraftDelay FROM ontime o LEFT JOIN carriers c ON o.carrier=c.ccode WHERE o.ArrDelay > 10 AND o.Cancelled <> 1 AND o.year = 2016 GROUP BY 1,2,3 ORDER BY 4 DESC FETCH FIRST 50 ROWS ONLY;\g

Percentage of a flight time SELECT o.carrier, c.carrier, 1- avg(airtime)/ avg when the airplane left (ActualElapsedTime) AS pct_time_not_in_flight FROM ontime o LEFT JOIN carriers c ON o.carrier=c.ccode where year=2016 group by 1,2 from the gate but is not in order by 3 DESC;\g flight vs. total duration, per airline in 2016

What cities are served by SELECT top 10 destcityname, count(distinct origincityname) from the most other cities ontime where year = 2016 group by 1 order by 2 desc;\g

More Sample Queries 35 Description SQL

Top 10 markets with most SELECT ct.origincitymarketid , LISTAGG(DISTINCT c.origincityname, '; destinations ') WITHIN GROUP (ORDER BY c.origincityname) as originarea,count(*) AS num_of_destinations FROM ( SELECT origincitymarketid, destcitymarketid FROM ontime WHERE year = 2016 GROUP BY 1,2) ct JOIN ( SELECT origincitymarketid, origincityname FROM ontime GROUP BY 1,2) c ON c.origincitymarketid = ct.origincitymarketid GROUP BY 1 ORDER BY 3 DESC FETCH FIRST 10 ROWS ONLY;\g

Days of week with most SELECT DayOfWeek, count(*) AS c FROM ontime WHERE Year BETWEEN 2000 total flights over several AND 2008 GROUP BY DayOfWeek ORDER BY c DESC;\g years

Days of week with most SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND delays over several years Year BETWEEN 2000 AND 2008 GROUP BY DayOfWeek ORDER BY c DESC;\g

Carrier with most delays SELECT carrier, count(*) FROM ontime WHERE DepDelay>10 AND in one year Year=2007 GROUP BY carrier ORDER BY 2 DESC;\g

Carrier with the most WITH t AS (SELECT carrier, count(*) AS c FROM ontime WHERE percentage delays in one DepDelay>10 AND Year=2007 GROUP BY carrier), t2 AS (SELECT carrier, count(*) AS c2 FROM ontime WHERE Year=2007 GROUP BY carrier) SELECT year t.carrier, c, c2, c*1000/c2 as c3 FROM t JOIN t2 ON (t.Carrier=t2.Carrier) ORDER BY c3 DESC;\g

Carrier with the most WITH t AS (SELECT carrier, count(*) AS c FROM ontime WHERE percentage delays in DepDelay>10 AND Year between 2000 and 2008 GROUP BY carrier), t2 AS (SELECT carrier, count(*) AS c2 FROM ontime WHERE Year between 2000 multiple years and 2008 GROUP BY carrier) SELECT t.carrier, c, c2, c*1000/c2 as c3 FROM t JOIN t2 ON (t.Carrier=t2.Carrier) ORDER BY c3 DESC;\g

Running total of flights WITH t AS (SELECT carrier, year,count(*) cnt FROM ontime GROUP BY over the years by carrier carrier, year) SELECT carrier, year, cnt, sum(cnt) OVER (PARTITION BY carrier ORDER BY carrier, year ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS "Running Total" FROM t FETCH FIRST 10 ROWS ONLY;\g

36 More Sample Queries B. Configuring Actian Vector Enterprise Edition on AWS Marketplace

This section contains the following topics: 1. Choose an Instance Type ...... 37 2. Choose an Instance Size ...... 38 3. Select a Price Plan ...... 40

You configure Actian Vector Enterprise Edition for the AWS Marketplace in three steps:

1. Choose an instance type. 2. Choose an instance size. 3. Select a price plan.

1. Choose an Instance Type

Decide on the instance type (a mix of computing power and storage) you want for your use case based on your workload.

General purpose instances Based on Elastic Block Store (EBS) storage, provides additional flexibility to start up and shut down instances or move to another instance type when you need different resource levels. • Economy – Most economical general-purpose server configuration; use for basic query workloads and development environments. • Economy Sport – For memory-optimized server configurations with more memory per core to accelerate queries against larger data sets.

Storage-optimized instances • Enterprise – Adds hard disk storage locally to accelerate access to more frequently used data instead of waiting for EBS data transfers. • Enterprise Capacity – Optimizes local storage and memory capacity for putting more of the database onto local hard disks to accelerate performance and maximize availability for workloads with more updates.

Configuring Actian Vector Enterprise Edition on AWS Marketplace 37 2. Choose an Instance Size

• Enterprise Sport – Faster local storage (SSDs) with big memory to provide maximum performance for heavy workloads (more users or more complex queries covering larger data sets). For more information about how to configure Vector to use additional EBS or Instance Store volumes, see Configuring Storage for Vector on AWS on page 41.

2. Choose an Instance Size

Select an instance size based on the instance type you chose in step 1. Take into account: • Number of users • Complexity of queries • Size of the database (2 cores per simultaneous query for simple reports and development workloads, 4–8 cores per query for average workloads, and up to 32 cores per query for complex query workloads).

Size Economy Economy Enterprise Enterprise Enterprise Sport Capacity Sport Small M4.2xlarge R4.2xlarge (2–5 active •4 cores •4 cores users*) •32 GB •61 GB RAM •EBS •EBS Medium M4.4xlarge R4.4xlarge H1.4xlarge D2.4xlarge (4–10 users) •8 cores •8 cores •8 cores •8 cores •64 GB •122 GB •64 GB •122 GB •EBS •EBS • 2x2000 • 12x2000 HDD HD

38 Configuring Actian Vector Enterprise Edition on AWS Marketplace 2. Choose an Instance Size

Size Economy Economy Enterprise Enterprise Enterprise Sport Capacity Sport Large R4.8xlarge H1.8xlarge D2.8xlarge (8–20 users) •16 cores •16 cores •18 cores •244 GB •128 GB •244 GB •EBS • 4x2000 • 24x2000 HDD HD I3.8xlarge •16 cores •244 GB • 4x1900 SSD XLarge M4.10xlarge R4.16xlarge H1.16xlarge I3.16xlarge X1.16xlarge (16–40 users) •20 cores •32 cores •32 cores •32 cores •32 cores •160 GB •488 GB •256 GB •488 GB •976 GB •EBS •EBS • 8x2000 • 8x1900 • 1x1920 HDD SSD SSD XXLarge M4.16xlarge X1.32xlarge (32–80 users) •32 cores •64 cores •256 GB • 1952 GB •EBS • 2x1920 SSD

* The definition of users can vary widely, but the range here is intended to indicate connected users actively submitting queries against the database.

Examples:

Users and Needs Configuration Instance Type • 4 users • General purpose •Small Economy • Basic queries •8 cores • M4.4xlarge instance • 1 TB database •EBS storage • Cost sensitive

Configuring Actian Vector Enterprise Edition on AWS Marketplace 39 3. Select a Price Plan

Users and Needs Configuration Instance Type •12 users •16 cores • Large Enterprise • Ad hoc queries • Local HDD storage • H1.8xlarge instance • Performance sensitive • 5 TB database • Always available •20 users • Large local HDD storage • Large Enterprise • Mixed queries •18 cores • D2.8xlarge instance • Performance sensitive • 10 TB database • Always available •40 users • Fast storage • XXL Enterprise Sport • Complex queries •64 cores • X1.32xlarge instance • Maximum performance •Big memory • 10 TB database • 24x7 availability

3. Select a Price Plan

After choosing the instance type and instance size, review and select the appropriate pricing plan. The AWS Marketplace Console provides a breakdown of software and hardware costs based on your selection. In addition, there are hourly and annual subscription plans. Choosing an annual plan gives you a discount. More information about annual subscriptions is at https://aws.amazon.com/marketplace/ help/buyer-annual-subscription.

40 Configuring Actian Vector Enterprise Edition on AWS Marketplace C. Configuring Storage for Vector on AWS

This section contains the following topics: AWS EC2 Storage Concepts and Options ...... 41 Tuning Volume Layout for Performance ...... 42 Configuring Vector to Use the Newly Set Up Disks ...... 44

This section introduces some AWS concepts about block storage. It explains what you must do to configure Vector to use these storage options to get the right storage configuration for your use case.

AWS EC2 Storage Concepts and Options

The Vector AMI comes with a default 150-GB volume. This is enough space to get started running and loading data into the demo database. However, for larger datasets or for production workloads, you may want to customize your configuration, for example, by increasing the size of the root volume or configuring multiple volumes in a standard RAID configuration to maximize performance.

AWS EC2 Root Device Volume When an EC2 instance is launched from an AMI, the root device volume contains the image used to boot the instance—mainly the operating system, all the configured services, and applications. This volume can be backed by either EBS or Instance Store (both are explained in the following sections) and can be configured when the AMI is created. The Vector AMI is backed by EBS. This means that when you launch an EC2 instance from the Vector AMI, you will have at least one EBS volume attached to the instance, which will be the root volume. It will contain the operating system, a configured and ready-to-use Vector installation, sample data, and the sample database. You will see this root volume when you launch the instance. It will have a size of 150 GB.

Storage Options AWS provides a variety of storage options that can be used with EC2 (see https:// docs.aws.amazon.com/AWSEC2/latest/UserGuide/Storage.html). Vector expects a block-level

Configuring Storage for Vector on AWS 41 Tuning Volume Layout for Performance

storage volume, so only the Amazon EC2 Instance Store (hereafter called Instance Store) and the Amazon Elastic Block Store (hereafter called EBS) volumes are discussed here.

Instance Store Instance Store provides temporary block-level storage to an EC2 instance. It consists of one or more instance store volumes exposed as block devices. The data in an instance store persists only during the lifetime of the instance. For more information, see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html. Instance stores are available only for certain EC2 instance types, such as storage-optimized instances (see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage-optimized- instances.html). If your workloads have very high I/O demands and you expect to run your instance 24x7 or bring up new instances on demand and reload data, then this may be an option for you.

EBS EBS provides durable storage volumes that can be attached to a running instance. EBS can be attached to an EC2 when launching from an AMI or even while the instance is running. EBS volumes can be resized, and come in various types that differ in performance characteristics and pricing (see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html). All EC2 instances can work with EBS volumes. If you want the flexibility of being able to stop, start, or resize EC2 instances without losing your data, then EBS volumes may be the choice for you. They provide lower I/O performance than instance store volumes but have additional flexibility and higher durability.

Tuning Volume Layout for Performance

After determining the right type of storage volume for your use case, the next step is to decide which volume type to use.

Instance Store Volumes If you have decided to use instance store volumes, you may choose between SSDs and HDDs. Storage-optimized instances provide HDD- and SSD-based storage options, and Vector works

42 Configuring Storage for Vector on AWS Tuning Volume Layout for Performance

well with both. HDDs provide denser storage per node and are slightly less costly per GB than equivalent SSD options. To add instance store volumes to your instance, see https://docs.aws.amazon.com/AWSEC2/ latest/UserGuide/add-instance-store-volumes.html#adding-instance-storage-instance. After adding the volumes to your instance, they will be available as devices to the operating system, but you must format and mount them. For more information about how to do this for EBS volumes, see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-using-volumes.html. To get the best performance from instance store volumes, you may want to create a RAID array, which not only provides redundancy but also increased performance. For instructions on how to create a RAID array, see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/raid- config.html#linux-raid. Although that documentation is meant for EBS volumes, it applies to instance store volumes as well. It explains how to create the array from the devices that represent the volumes, format them, and mount them. After setting up the array and mounting it, it will appear as a standard mounted device with a file system to Vector. You may proceed with Vector configuration.

EBS Volumes EBS volumes are similar to instance stores but offer multiple options for HDD and SSD types. In addition, performance limits are tied to each type, and there is a maximum performance for each volume type that is related to the EC2 instance type. A good overview of the various EBS options is available at https://docs.aws.amazon.com/ AWSEC2/latest/UserGuide/EBSVolumeTypes.html. Detailed limits for volumes based on instance type are at https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ EBSOptimized.html. After you add the EBS volumes to your instance (something you can do at launch time or while the instance is running), you must format them and make them available before they can be used (see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-using-volumes.html). We have found that st1 volumes generally offer a good performance/cost tradeoff.

Note: There are limits to performance per volume. For example, a single st1 volume of size 2 TiB can deliver a maximum burst performance of 500 MiB/second. However, the EC2 instance itself can cap EBS performance (see https://docs.aws.amazon.com/AWSEC2/latest/ UserGuide/EBSOptimized.html). For example, if you use a 2-TiB st1 volume on an m4.4xlarge instance, you will get only 250 MiB/second since the EC2 EBS performance

Configuring Storage for Vector on AWS 43 Configuring Vector to Use the Newly Set Up Disks

overrides the EBS volume performance. However, if you choose an m4.10xlarge EC2 instance, you will get the full burst performance of 500 MiB/second from EBS since an m4.10xlarge has a cap of 500 MiB/second for EBS throughput. If you choose to go for a larger instance, for example, an m4.16xlarge, which has an EBS maximum throughput of 1250 MiB/second and would like to be able to achieve that throughput, you could use 3 x 2 TiB EBS volumes for this instance in a RAID configuration (see https://docs.aws.amazon.com/AWSEC2/latest/ UserGuide/raid-config.html), which theoretically would give you a 1500 MiB/second (3 x 500) burst throughout, and the EC2 cap effectively would result in 1250 MiB/second.

Configuring Vector to Use the Newly Set Up Disks

After you have set up your storage and mounted the appropriate device (which could be a single volume or multiple volumes making up a RAID array), you can create a database that uses this storage by following this simple process: 1. Define a location in Vector that points to the new disk or directory representing the storage you have set up. 2. Create a database to use this defined location. For example, let’s assume that we have mounted our device (backed by EBS or instance store volumes) on the /mnt/disk1 mount point, which we want to use for a new database, proddb. Create the location and the database by using one of the following procedures.

To use Terminal Monitor and the Command-Line tool to create the location and the new database

1. Connect to the system database iidbdb using the SQL terminal monitor and execute the following command:

CREATE LOCATION myloc1 WITH AREA = '/mnt/disk1/db_location', USAGE = (ALL)\go This creates the directory db_location with an appropriate directory structure underneath. Note: The user actian must have write privileges for /mnt/disk1, otherwise this command will fail. 2. Create a database with the createdb command-line tool and use db_location for that database:

createdb proddb -dmyloc1

44 Configuring Storage for Vector on AWS Configuring Vector to Use the Newly Set Up Disks

To use Actian Director to create the location and the new database

1. Connect to the instance and expand the tree underneath. Right-click on the Locations folder and select New Location….

The New Location dialog box opens. 2. Enter the Location Name myloc1 and the Location Area /mnt/disk1/db_location and check all the usage checkboxes:

3. Click OK to create the location. 4. Right-click on the Databases folder and select New Database….

Configuring Storage for Vector on AWS 45 Configuring Vector to Use the Newly Set Up Disks

The New Database dialog opens. 5. Enter the new database name proddb in the Name field. 6. Click the Locations link and select your new location myloc1 for each of the locations:

7. Click OK to close the dialog box. Your new database is created. For more information about locations, see the Vector documentation at http://docs.actian.com/ vector/5.0/index.html#page/User%2F11._Using_Alternate_Locations.htm%23.

46 Configuring Storage for Vector on AWS D. Migrating Vector Databases Between AMIs

This section contains the following topics: Database Ownership...... 47 Objects NOT Copied ...... 47 Copying a Database Using the unloaddb Utility ...... 48

This section describes how to copy a database from one Vector AMI to another. You may want to do this if, for example, if you upgrade to a new Version of Vector in the AWS Cloud. This entails copying your databases from the previous version of the AMI to the new version.

Database Ownership

Unless you specifically destroy the original database, you will end up with two identical databases having the same name and owner, in two different Vector AMIs. If you want to change ownership of the databases at the same time, you can do so by editing the reload.ing script and changing the user name that follows the "-u" flag.

Objects NOT Copied

Objects that will NOT be copied (because they are not stored in your database) include: • Object code: This generally is not a problem because applications can be recompiled on the target machine. • Source code. • Other compiled objects: This includes compiled forms or any objects referenced in OSL by the keywords "call subsystem". This note will offer some suggestions for keeping track of OSL source files when moving databases. See your query language reference manuals for details of the copyapp "-s" flag for moving source files that are not stored in the database.

Migrating Vector Databases Between AMIs 47 Copying a Database Using the unloaddb Utility

This procedure assumes that you are moving a user database, as opposed to the iidbdb (master database).

To copy a database using unloaddb

1. Log in as the owner of the database you want to move. 2. Ensure that your environment is set up to run the installation FROM which you want to copy the database. 3. Create a temporary working directory, since the next few steps will create a number of files. It is convenient to choose a directory path that you can duplicate on the target AMI because the “copy to” and “copy from” scripts created by the unloaddb utility contain explicit references to the directory path from which you execute unloaddb. Therefore, you might want to create a subdirectory under /tmp (since this directory will exist on the target AMI). Example:

$ mkdir /tmp/my_work_dir

$ cd /tmp/my_work_dir 4. Unload the database: a. Execute unloaddb. It does not actually unload the database but creates two script files (unload.ing and reload.ing):

$ unloaddb -c dbname b. Unload the database by executing the script file unload.ing:

$ chmod 777 unload.ing

$ unload.ing Because unload.ing copies out all your tables into files, it can take some time to run, depending on the number and size of the tables in your database. Note: The "-c" flag copies out files in portable ASCII format. You will need this unless you are copying the database to another installation on this same machine or to a binary- compatible machine. If you are not sure, use the "-c" flag; however, using the "-c" flag affects the value of floating-point numbers. Precision of formated character output of floats is controlled with the "-f" flag. 5. Copy the entire contents of the unloaddb directory and the local source code directories to the corresponding working directories on the new AMI.

48 Migrating Vector Databases Between AMIs If the target machine shares a network with the source machine, you can use FTP or the UNIX scp command to copy the directories to the target node. The following example uses commands common to many UNIX machines. Example—Copy files between AMIs: The temporary working directory for the database unload files is /tmp/movedir on the first AMI. We will move the database to a machine called “ec2-xxx-xxx-xxx-xxx.compute- 1.amazonaws.com .” The remote copy command should look like this:

$ scp -r -i keyfile.pem /tmp/movedir [email protected] 1.amazonaws.com:/tmp 6. When the unloaded files have been successfully transferred, move to the target installation and change your environment so that you are properly set up to run the installation INTO which you would like to copy the database. Your unloaddb files are now sitting in a temporary directory of the same name as the one from which you unloaded on the source machine, for example, /tmp/my_work_dir). Likewise, your source code files also are now located on the target machine in a directory of your choice. 7. Create the database in the new installation:

% createdb database_name Note: You may create the database with another name, but if you do this, you must also change the database name in the reload.ing script. 8. Ensure that you are in the temporary directory containing the transferred unloaddb files. For example, /tmp/my_work_dir. 9. Reload the database. This is the most time-consuming step, since all tables will be loaded and modified. You may also want to sysmod the database after reloading it:

$ chmod 777 reload.ing

$ reload.ing

$ sysmod database_name You now have two databases in two installations. Unless you created your new database with a new name, you now have two separate databases with the same name and owner. If it is not important to keep both databases, we recommend you destroy the original. Remember that updates to the database in one installation will not be reflected in the other. Your move is now completed.

Migrating Vector Databases Between AMIs 49 50 Migrating Vector Databases Between AMIs