<<

SQLite and sqlite3

Statistical Computing & Programming

Shawn Santo

06-12-20

1 / 41 Supplementary materials

Companion videos

Review and sqlite3 overview Creating tables Join operations Finer details

Additional resources

SQL Tutorial Package nodbi vignette

2 / 41 Recall

3 / 41 Databases

A database is a collection of data typically stored in a computer system. It is controlled by a database management system (DBMS). There may be applications associated with them, such as an API.

Types of DBMS: MySQL, Microsoft Access, Microsoft SQL Server, FileMaker Pro, Oracle Database, and dBASE.

Types of databases: Relational, object-oriented, distributed, NoSQL, graph, and more.

4 / 41 management system

A system that governs a relational database, where data is identified and accessed in relation to other data in the database.

Relational databases generally organize data into tables comprised of fields and records.

Many RDBMS use SQL to access data.

5 / 41 SQL

SQL stands for Structured Query Language

It is an American National Standards Institute standard computer language for accessing and manipulating RDBMS.

There are different versions of SQL, but to be compliant with the American National Standards Institute the version must support the key query verbs.

6 / 41 Big picture

Source: https://www.w3resource.com/sql/tutorials.php 7 / 41 Common SQL query structure

Main verbs to get data:

SELECT columns or computations FROM WHERE condition GROUP BY columns HAVING condition ORDER BY [ASC | DESC] LIMIT offset, count

WHERE, GROUP BY, HAVING, ORDER BY, LIMIT are all optional. Primary computations: MIN, MAX, COUNT, SUM, AVG.

We can perform these queries with dbGetQuery() and paste().

8 / 41 Verb connections

SQL dplyr SELECT () table data frame WHERE filter() pre-aggregation/calculation GROUP_BY group_by() HAVING filter() post-aggregation/calculation ORDER BY arrange() with possibly a desc() LIMIT slice()

9 / 41 SQL arithmetic and comparison operators

SQL supports the standard +, -, *, /, and % (modulo) arithmetic operators and the following comparison operators.

Operator Description = Equal to > Greater than < Less than >= Greater than or equal to <= Less than or equal to <> Not equal to

10 / 41 SQL logical operators

Operator Description ALL TRUE if all of the subquery values meet the condition AND TRUE if all the conditions separated by AND is TRUE ANY TRUE if any of the subquery values meet the condition BETWEEN TRUE if the operand is within the range of comparisons EXISTS TRUE if the subquery returns one or more records IN TRUE if the operand is equal to one of a list of expressions LIKE TRUE if the operand matches a pattern NOT Displays a record if the condition(s) is NOT TRUE OR TRUE if any of the conditions separated by OR is TRUE SOME TRUE if any of the subquery values meet the condition

11 / 41 sqlite3

12 / 41 SQLite and sqlite3

SQLite is a software that provides a relational database management system. The lite in SQLite means light weight in terms of setup, database administration, and required resource.

This is available on the DSS servers. In your terminal

[sms185@numeric1 ~]$ which sqlite3 /usr/bin/sqlite3

Check out

man sqlite3

From the summary:

sqlite3 is a terminal-based front-end to the SQLite library that can evaluate queries interactively and display the results in multiple formats. sqlite3 can also be used within shell scripts and other applications to provide batch processing features.

If you have a MAC, sqlite3 should be available on there as well. 13 / 41 Create a database

First, in RStudio, let's create an object called oil.

oil <- readr::read_table("http://users.stat.ufl.edu/~winner/data/oilimport.dat", col_names = ("year", "month", "month_series", "barrels_purchased", "total_value", "unit_price", "cpi") )

oil

#> # A tibble: 420 x 7 #> year month month_series barrels_purchased total_value unit_price cpi #> #> 1 1973 1 1 102750 282153 2.75 42.6 #> 2 1973 2 2 95276 259675 2.73 42.9 #> 3 1973 3 3 112053 316283 2.82 43.3 #> 4 1973 4 4 98841 279958 2.83 43.6 #> 5 1973 5 5 123102 357042 2.9 43.9 #> 6 1973 6 6 118152 360020 3.05 44.2 #> 7 1973 7 7 101752 320591 3.15 44.3 #> 8 1973 8 8 148219 483085 3.26 45.1 #> 9 1973 9 9 124966 421856 3.38 45.2 #> 10 1973 10 10 134741 476763 3.54 45.6 #> # … with 410 more rows

14 / 41 Next, create a database called mydb.sqlite and add a table oil.

library(DBI) # Database Interface mydb <- dbConnect(RSQLite::SQLite(), "mydb.sqlite") dbWriteTable(mydb, "oil", oil)

Check our database is there.

[sms185@numeric1 ]$ ls mydb.sqlite

Load it with command sqlite3 mydb.sqlite.

[sms185@numeric1 sql]$ sqlite3 mydb.sqlite SQLite version 3.26.0 2018-12-01 12:34:55 Enter ".help" for usage hints. sqlite>

15 / 41 Typing .help at the prompt (sqlite>)

sqlite> .help .archive ... Manage SQL archives .auth ON|OFF Show authorizer callbacks .backup ?DB? FILE Backup DB (default "main") to FILE .bail on|off Stop after hitting an error. Default OFF .binary on|off Turn binary output on or off. Default OFF .cd DIRECTORY Change the working directory to DIRECTORY .changes on|off Show number of rows changed by SQL .check Fail if output since .testcase does not match .clone NEWDB Clone data into NEWDB from the existing database

.... will reveal some of the help features.

16 / 41 Commands sqlite3

1. Query commands: sqlite3 just reads lines of input and passes them on to the SQLite library for execution. This will be the typical command you provide when you want to access, update, and data tables.

2. Dot commands: these are lines that begin with a dot (".") and are intercepted and interpreted by the sqlite3 program itself. These commands are typically used to change the output format of queries, or to execute certain prepackaged query statements.

17 / 41 Navigating sqlite3

To list all names and files of attached databases

sqlite> .databases main: /home/fac/sms185/sql/mydb.sqlite

To list all the tables

sqlite> .tables oil

View the current settings

sqlite> .show echo: off eqp: off explain: auto headers: off mode: list nullvalue: "" output: stdout colseparator: "|" rowseparator: "\n" stats: off width: filename: mydb.sqlite

18 / 41 Table details

To show the CREATE statements matching the specified table

sqlite> .schema oil CREATE TABLE `oil` ( `year` REAL, `month` REAL, `month_series` REAL, `barrels_purchased` REAL, `total_value` REAL, `unit_price` REAL, `cpi` REAL );

Note the ; at the end.

19 / 41 Query

Get the first 5 rows from oil.

sqlite> SELECT * FROM oil ...> LIMIT 5; 1973.0|1.0|1.0|102750.0|282153.0|2.75|42.6 1973.0|2.0|2.0|95276.0|259675.0|2.73|42.9 1973.0|3.0|3.0|112053.0|316283.0|2.82|43.3 1973.0|4.0|4.0|98841.0|279958.0|2.83|43.6 1973.0|5.0|5.0|123102.0|357042.0|2.9|43.9

How about a nicer output? Change the mode and headers settings.

sqlite> .mode column sqlite> .headers on sqlite> SELECT * FROM oil ...> LIMIT 5; year month month_series barrels_purchased total_value unit_price cpi ------1973.0 1.0 1.0 102750.0 282153.0 2.75 42.6 1973.0 2.0 2.0 95276.0 259675.0 2.73 42.9 1973.0 3.0 3.0 112053.0 316283.0 2.82 43.3 1973.0 4.0 4.0 98841.0 279958.0 2.83 43.6 1973.0 5.0 5.0 123102.0 357042.0 2.9 43.9

20 / 41 Examples

For each year, find the month that the maximum number of barrels of oil were purchased.

sqlite> SELECT year, month as MM, MAX(barrels_purchased) as Max_BBLs ...> FROM oil ...> GROUP BY year ...> ORDER BY Max_BBLs DESC ...> LIMIT 10; year MM Max_BBLs ------2004.0 6.0 344729.0 2006.0 8.0 336528.0 2003.0 7.0 335511.0 2005.0 8.0 329039.0 2007.0 3.0 324248.0 2002.0 10.0 315720.0 2000.0 8.0 312955.0 2001.0 10.0 311803.0 1998.0 8.0 300227.0 1999.0 5.0 289457.0

21 / 41 Let's make a similar query but for unit price and only return the relevant columns.

sqlite> SELECT year, MAX(unit_price) as max_unit_price ...> FROM oil ...> GROUP BY year ...> HAVING year BETWEEN 1975 AND 1990; year max_unit_price ------1975.0 12.11 1976.0 12.59 1977.0 13.62 1978.0 13.59 1979.0 24.57 1980.0 33.94 1981.0 36.92 1982.0 35.53 1983.0 32.5 1984.0 28.05 1985.0 27.38 1986.0 25.8 1987.0 18.07 1988.0 22.4 1989.0 17.97 1990.0 30.08

22 / 41 Exercise

Create a query that returns a table that has all the oil information where the average barrels purchased for a year exceeded 200,000.

23 / 41 Creating new tables from existing tables

Create with command CREATE TABLE

sqlite> CREATE TABLE oil_1973( ...> month INT, ...> barrels_purchased REAL, ...> total_value REAL, ...> unit_price REAL, ...> cpi REAL ...> );

We are specifying the table name, oil_1973, variables names, and their type.

Add data with command INSERT INTO

sqlite> INSERT INTO oil_1973 ...> SELECT month, barrels_purchased, total_value, unit_price, cpi ...> FROM oil ...> WHERE year = 1973;

24 / 41 Verify

sqlite> .tables oil oil_1973

sqlite> SELECT * FROM oil_1973 ...> LIMIT 5; month barrels_purchased total_value unit_price cpi ------1 102750.0 282153.0 2.75 42.6 2 95276.0 259675.0 2.73 42.9 3 112053.0 316283.0 2.82 43.3 4 98841.0 279958.0 2.83 43.6 5 123102.0 357042.0 2.9 43.9

25 / 41 Creating new tables from outside data

nukes <- readr::read_table("http://users.stat.ufl.edu/~winner/data/nuketest.dat", col_names = c("year", "month", "month_series", "tests"))

nukes

#> # A tibble: 576 x 4 #> year month month_series tests #> #> 1 1945 1 1 0 #> 2 1945 2 2 0 #> 3 1945 3 3 0 #> 4 1945 4 4 0 #> 5 1945 5 5 0 #> 6 1945 6 6 0 #> 7 1945 7 7 1 #> 8 1945 8 8 2 #> 9 1945 9 9 0 #> 10 1945 10 10 0 #> # … with 566 more rows

readr::write_csv(nukes, path = "nukes.csv")

26 / 41 To import a CSV file of data as a table

sqlite> .mode csv sqlite> .import nukes.csv nukes sqlite> SELECT * FROM nukes LIMIT 5; year,month,month_series,tests 1945,1,1,0 1945,2,2,0 1945,3,3,0 1945,4,4,0 1945,5,5,0

Adjust the mode for pretty output with .mode.

sqlite> .mode column sqlite> SELECT * FROM nukes LIMIT 5; year month month_series tests ------1945 1 1 0 1945 2 2 0 1945 3 3 0 1945 4 4 0 1945 5 5 0

27 / 41 More tables from tables

sqlite> CREATE TABLE nukes_1973( ...> year INT, ...> month INT, ...> month_series INT, ...> tests INT);

sqlite> INSERT INTO nukes_1973 ...> SELECT * FROM nukes WHERE year = 1973;

sqlite> SELECT * FROM nukes_1973; year month month_series tests ------1973 1 337 0 1973 2 338 1 1973 3 339 2 1973 4 340 4 1973 5 341 4 1973 6 342 5 1973 7 343 0 1973 8 344 0 1973 9 345 0 1973 10 346 3 1973 11 347 1 1973 12 348 4

28 / 41 Join tables

Our nuclear data from 1973... Our oil data from 1973...

sqlite> SELECT * FROM nukes_1973; sqlite> SELECT * FROM oil_1973; year month month_series tests month barrels_purchased total_value u ------1973 1 337 0 1 102750.0 282153.0 2 1973 2 338 1 2 95276.0 259675.0 2 1973 3 339 2 3 112053.0 316283.0 2 1973 4 340 4 4 98841.0 279958.0 2 1973 5 341 4 5 123102.0 357042.0 2 1973 6 342 5 6 118152.0 360020.0 3 1973 7 343 0 7 101752.0 320591.0 3 1973 8 344 0 8 148219.0 483085.0 3 1973 9 345 0 9 124966.0 421856.0 3 1973 10 346 3 10 134741.0 476763.0 3 1973 11 347 1 11 132168.0 503245.0 3 1973 12 348 4 12 100950.0 532182.0 5

How can we merge these two tables?

29 / 41 Some joins visualized

30 / 41 Default join

sqlite> SELECT * FROM oil_1973 JOIN nukes_1973; month barrels_purchased total_value unit_price cpi year month mon ------1 102750.0 282153.0 2.75 42.6 1973 1 337 1 102750.0 282153.0 2.75 42.6 1973 2 338 1 102750.0 282153.0 2.75 42.6 1973 3 339 1 102750.0 282153.0 2.75 42.6 1973 4 340 1 102750.0 282153.0 2.75 42.6 1973 5 341 1 102750.0 282153.0 2.75 42.6 1973 6 342 1 102750.0 282153.0 2.75 42.6 1973 7 343 1 102750.0 282153.0 2.75 42.6 1973 8 344 1 102750.0 282153.0 2.75 42.6 1973 9 345 1 102750.0 282153.0 2.75 42.6 1973 10 346 1 102750.0 282153.0 2.75 42.6 1973 11 347 1 102750.0 282153.0 2.75 42.6 1973 12 348

...

What happened with this join?

By default, a cross join was used, and it combines every row from the first table oil_1973 with every row from the second table nukes_1973 to form the resulting table.

31 / 41 Natural join

sqlite> SELECT * FROM oil_1973 NATURAL JOIN nukes_1973; month barrels_purchased total_value unit_price cpi year month_series t ------1 102750.0 282153.0 2.75 42.6 1973 337 0 2 95276.0 259675.0 2.73 42.9 1973 338 1 3 112053.0 316283.0 2.82 43.3 1973 339 2 4 98841.0 279958.0 2.83 43.6 1973 340 4 5 123102.0 357042.0 2.9 43.9 1973 341 4 6 118152.0 360020.0 3.05 44.2 1973 342 5 7 101752.0 320591.0 3.15 44.3 1973 343 0 8 148219.0 483085.0 3.26 45.1 1973 344 0 9 124966.0 421856.0 3.38 45.2 1973 345 0 10 134741.0 476763.0 3.54 45.6 1973 346 3 11 132168.0 503245.0 3.81 45.9 1973 347 1 12 100950.0 532182.0 5.27 46.2 1973 348 4

In the NATURAL JOIN, all the columns from both tables with the same name will be matched against each other. It automatically tests for equality between the values of every column that exists in both tables. This is why in the above example we only see month once.

32 / 41 Be explicit on how to join

sqlite> SELECT year, month, tests, unit_price ...> FROM nukes_1973 ...> LEFT JOIN oil_1973 ...> ON nukes_1973.month = oil_1973.month; Error: ambiguous column name: month

sqlite> SELECT year, oil_1973.month, tests, unit_price FROM nukes_1973 LEFT JOIN oil_1973 ON nukes_1973.month = oil_1973.month; year month tests unit_price ------1973 1 0 2.75 1973 2 1 2.73 1973 3 2 2.82 1973 4 4 2.83 1973 5 4 2.9 1973 6 5 3.05 1973 7 0 3.15 1973 8 0 3.26 1973 9 0 3.38 1973 10 3 3.54 1973 11 1 3.81 1973 12 4 5.27

33 / 41 Other joins

sqlite> SELECT * FROM oil_1973 NATURAL RIGHT JOIN nukes_1973; Error: RIGHT and FULL OUTER JOINs are not currently supported

34 / 41 Exercise

Use nukes and oil to perform one SQL query that returns the mean oil price and mean nuclear tests per year for all years where oil price and nuclear test data is available.

year avg_oil_price avg_nuke_tests ------1992 16.7033333333333 0.5 1991 17.5166666666667 0.583333333333 1990 20.3591666666667 0.666666666666 1989 16.4408333333333 0.916666666666 1988 14.3066666666667 1.25 1987 16.6766666666667 1.166666666666 1986 14.3925 1.166666666666 1985 26.22 1.416666666666 1984 27.6708333333333 1.5 1983 29.5833333333333 1.5 1982 33.3591666666667 1.5 1981 35.1066666666667 1.333333333333 1980 31.555 1.166666666666 1979 18.6758333333333 1.25 1978 13.4341666666667 1.583333333333 1977 13.3233333333333 1.666666666666 1976 12.42 1.666666666666 1975 11.5883333333333 1.833333333333 1974 11.0875 1.833333333333 1973 3.29083333333333 2.0

35 / 41 Finer details

36 / 41 SQL statement processing

What happens in the background when a SQL statement is sent to the RDBMS?

1. The SQL statement is parsed around key words.

2. The statement is validated. Do all tables and fields exist? Are names unambiguous?

3. The RDBMS optimizes and eventually generates an access plan - how best to retrieve the data, update the data, or delete the data.

4. Access plan is executed

37 / 41 Subqueries

A subquery may occur in a

SELECT clause FROM clause WHERE clause

The inner (sub) query executes first before its parent query so that the results of an inner query can be passed to the outer query.

sqlite> SELECT * FROM nukes_1973 ...> WHERE month IN ( ...> SELECT month FROM oil_1973 ...> WHERE unit_price BETWEEN 2.5 AND 3.5);

What are we trying to get with the above query?

38 / 41 Tips

On very large datasets use indices to speed up searches

Use CAST to change data types in a query

Attach databases to perform JOIN operations on tables across databases

See your query and improve performance with EXPLAIN QUERY PLAN followed by your query

39 / 41 Beyond SQL: NoSQL

Document Graph stores Key-value stores Wide-column stores

R package nodbi provides a single user interface for interacting with many NoSQL databases. This is similar to package DBI for interacting with relational databases that use SQL.

R package nodbi supports:

MongoDB Redis (server based) CouchDB Elasticsearch SQLite

Check out their vignette for more information: https://docs.ropensci.org/nodbi/

40 / 41 References

https://www.w3resource.com/sql/tutorials.php

http://www.sql-join.com/sql-join-types

https://www.mongodb.com/nosql-explained

41 / 41