SQLite and sqlite3
Statistical Computing & Programming
Shawn Santo
06-12-20
1 / 41 Supplementary materials
Companion videos
Review and sqlite3 overview Creating tables Join operations Finer details
Additional resources
SQL Tutorial Package nodbi vignette
2 / 41 Recall
3 / 41 Databases
A database is a collection of data typically stored in a computer system. It is controlled by a database management system (DBMS). There may be applications associated with them, such as an API.
Types of DBMS: MySQL, Microsoft Access, Microsoft SQL Server, FileMaker Pro, Oracle Database, and dBASE.
Types of databases: Relational, object-oriented, distributed, NoSQL, graph, and more.
4 / 41 Relational database management system
A system that governs a relational database, where data is identified and accessed in relation to other data in the database.
Relational databases generally organize data into tables comprised of fields and records.
Many RDBMS use SQL to access data.
5 / 41 SQL
SQL stands for Structured Query Language
It is an American National Standards Institute standard computer language for accessing and manipulating RDBMS.
There are different versions of SQL, but to be compliant with the American National Standards Institute the version must support the key query verbs.
6 / 41 Big picture
Source: https://www.w3resource.com/sql/tutorials.php 7 / 41 Common SQL query structure
Main verbs to get data:
SELECT columns or computations FROM table WHERE condition GROUP BY columns HAVING condition ORDER BY column [ASC | DESC] LIMIT offset, count
WHERE, GROUP BY, HAVING, ORDER BY, LIMIT are all optional. Primary computations: MIN, MAX, COUNT, SUM, AVG.
We can perform these queries with dbGetQuery() and paste().
8 / 41 Verb connections
SQL dplyr SELECT select() table data frame WHERE filter() pre-aggregation/calculation GROUP_BY group_by() HAVING filter() post-aggregation/calculation ORDER BY arrange() with possibly a desc() LIMIT slice()
9 / 41 SQL arithmetic and comparison operators
SQL supports the standard +, -, *, /, and % (modulo) arithmetic operators and the following comparison operators.
Operator Description = Equal to > Greater than < Less than >= Greater than or equal to <= Less than or equal to <> Not equal to
10 / 41 SQL logical operators
Operator Description ALL TRUE if all of the subquery values meet the condition AND TRUE if all the conditions separated by AND is TRUE ANY TRUE if any of the subquery values meet the condition BETWEEN TRUE if the operand is within the range of comparisons EXISTS TRUE if the subquery returns one or more records IN TRUE if the operand is equal to one of a list of expressions LIKE TRUE if the operand matches a pattern NOT Displays a record if the condition(s) is NOT TRUE OR TRUE if any of the conditions separated by OR is TRUE SOME TRUE if any of the subquery values meet the condition
11 / 41 sqlite3
12 / 41 SQLite and sqlite3
SQLite is a software library that provides a relational database management system. The lite in SQLite means light weight in terms of setup, database administration, and required resource.
This is available on the DSS servers. In your terminal
[sms185@numeric1 ~]$ which sqlite3 /usr/bin/sqlite3
Check out
man sqlite3
From the summary:
sqlite3 is a terminal-based front-end to the SQLite library that can evaluate queries interactively and display the results in multiple formats. sqlite3 can also be used within shell scripts and other applications to provide batch processing features.
If you have a MAC, sqlite3 should be available on there as well. 13 / 41 Create a database
First, in RStudio, let's create an object called oil.
oil <- readr::read_table("http://users.stat.ufl.edu/~winner/data/oilimport.dat", col_names = c("year", "month", "month_series", "barrels_purchased", "total_value", "unit_price", "cpi") )
oil
#> # A tibble: 420 x 7 #> year month month_series barrels_purchased total_value unit_price cpi #>
14 / 41 Next, create a database called mydb.sqlite and add a table oil.
library(DBI) # R Database Interface mydb <- dbConnect(RSQLite::SQLite(), "mydb.sqlite") dbWriteTable(mydb, "oil", oil)
Check our database is there.
[sms185@numeric1 sql]$ ls mydb.sqlite
Load it with command sqlite3 mydb.sqlite.
[sms185@numeric1 sql]$ sqlite3 mydb.sqlite SQLite version 3.26.0 2018-12-01 12:34:55 Enter ".help" for usage hints. sqlite>
15 / 41 Typing .help at the prompt (sqlite>)
sqlite> .help .archive ... Manage SQL archives .auth ON|OFF Show authorizer callbacks .backup ?DB? FILE Backup DB (default "main") to FILE .bail on|off Stop after hitting an error. Default OFF .binary on|off Turn binary output on or off. Default OFF .cd DIRECTORY Change the working directory to DIRECTORY .changes on|off Show number of rows changed by SQL .check GLOB Fail if output since .testcase does not match .clone NEWDB Clone data into NEWDB from the existing database
.... will reveal some of the help features.
16 / 41 Commands sqlite3
1. Query commands: sqlite3 just reads lines of input and passes them on to the SQLite library for execution. This will be the typical command you provide when you want to access, update, and merge data tables.
2. Dot commands: these are lines that begin with a dot (".") and are intercepted and interpreted by the sqlite3 program itself. These commands are typically used to change the output format of queries, or to execute certain prepackaged query statements.
17 / 41 Navigating sqlite3
To list all names and files of attached databases
sqlite> .databases main: /home/fac/sms185/sql/mydb.sqlite
To list all the tables
sqlite> .tables oil
View the current settings
sqlite> .show echo: off eqp: off explain: auto headers: off mode: list nullvalue: "" output: stdout colseparator: "|" rowseparator: "\n" stats: off width: filename: mydb.sqlite
18 / 41 Table details
To show the CREATE statements matching the specified table
sqlite> .schema oil CREATE TABLE `oil` ( `year` REAL, `month` REAL, `month_series` REAL, `barrels_purchased` REAL, `total_value` REAL, `unit_price` REAL, `cpi` REAL );
Note the ; at the end.
19 / 41 Query
Get the first 5 rows from oil.
sqlite> SELECT * FROM oil ...> LIMIT 5; 1973.0|1.0|1.0|102750.0|282153.0|2.75|42.6 1973.0|2.0|2.0|95276.0|259675.0|2.73|42.9 1973.0|3.0|3.0|112053.0|316283.0|2.82|43.3 1973.0|4.0|4.0|98841.0|279958.0|2.83|43.6 1973.0|5.0|5.0|123102.0|357042.0|2.9|43.9
How about a nicer output? Change the mode and headers settings.
sqlite> .mode column sqlite> .headers on sqlite> SELECT * FROM oil ...> LIMIT 5; year month month_series barrels_purchased total_value unit_price cpi ------1973.0 1.0 1.0 102750.0 282153.0 2.75 42.6 1973.0 2.0 2.0 95276.0 259675.0 2.73 42.9 1973.0 3.0 3.0 112053.0 316283.0 2.82 43.3 1973.0 4.0 4.0 98841.0 279958.0 2.83 43.6 1973.0 5.0 5.0 123102.0 357042.0 2.9 43.9
20 / 41 Examples
For each year, find the month that the maximum number of barrels of oil were purchased.
sqlite> SELECT year, month as MM, MAX(barrels_purchased) as Max_BBLs ...> FROM oil ...> GROUP BY year ...> ORDER BY Max_BBLs DESC ...> LIMIT 10; year MM Max_BBLs ------2004.0 6.0 344729.0 2006.0 8.0 336528.0 2003.0 7.0 335511.0 2005.0 8.0 329039.0 2007.0 3.0 324248.0 2002.0 10.0 315720.0 2000.0 8.0 312955.0 2001.0 10.0 311803.0 1998.0 8.0 300227.0 1999.0 5.0 289457.0
21 / 41 Let's make a similar query but for unit price and only return the relevant columns.
sqlite> SELECT year, MAX(unit_price) as max_unit_price ...> FROM oil ...> GROUP BY year ...> HAVING year BETWEEN 1975 AND 1990; year max_unit_price ------1975.0 12.11 1976.0 12.59 1977.0 13.62 1978.0 13.59 1979.0 24.57 1980.0 33.94 1981.0 36.92 1982.0 35.53 1983.0 32.5 1984.0 28.05 1985.0 27.38 1986.0 25.8 1987.0 18.07 1988.0 22.4 1989.0 17.97 1990.0 30.08
22 / 41 Exercise
Create a query that returns a table that has all the oil information where the average barrels purchased for a year exceeded 200,000.
23 / 41 Creating new tables from existing tables
Create with command CREATE TABLE
sqlite> CREATE TABLE oil_1973( ...> month INT, ...> barrels_purchased REAL, ...> total_value REAL, ...> unit_price REAL, ...> cpi REAL ...> );
We are specifying the table name, oil_1973, variables names, and their type.
Add data with command INSERT INTO
sqlite> INSERT INTO oil_1973 ...> SELECT month, barrels_purchased, total_value, unit_price, cpi ...> FROM oil ...> WHERE year = 1973;
24 / 41 Verify
sqlite> .tables oil oil_1973
sqlite> SELECT * FROM oil_1973 ...> LIMIT 5; month barrels_purchased total_value unit_price cpi ------1 102750.0 282153.0 2.75 42.6 2 95276.0 259675.0 2.73 42.9 3 112053.0 316283.0 2.82 43.3 4 98841.0 279958.0 2.83 43.6 5 123102.0 357042.0 2.9 43.9
25 / 41 Creating new tables from outside data
nukes <- readr::read_table("http://users.stat.ufl.edu/~winner/data/nuketest.dat", col_names = c("year", "month", "month_series", "tests"))
nukes
#> # A tibble: 576 x 4 #> year month month_series tests #>
readr::write_csv(nukes, path = "nukes.csv")
26 / 41 To import a CSV file of data as a table
sqlite> .mode csv sqlite> .import nukes.csv nukes sqlite> SELECT * FROM nukes LIMIT 5; year,month,month_series,tests 1945,1,1,0 1945,2,2,0 1945,3,3,0 1945,4,4,0 1945,5,5,0
Adjust the mode for pretty output with .mode.
sqlite> .mode column sqlite> SELECT * FROM nukes LIMIT 5; year month month_series tests ------1945 1 1 0 1945 2 2 0 1945 3 3 0 1945 4 4 0 1945 5 5 0
27 / 41 More tables from tables
sqlite> CREATE TABLE nukes_1973( ...> year INT, ...> month INT, ...> month_series INT, ...> tests INT);
sqlite> INSERT INTO nukes_1973 ...> SELECT * FROM nukes WHERE year = 1973;
sqlite> SELECT * FROM nukes_1973; year month month_series tests ------1973 1 337 0 1973 2 338 1 1973 3 339 2 1973 4 340 4 1973 5 341 4 1973 6 342 5 1973 7 343 0 1973 8 344 0 1973 9 345 0 1973 10 346 3 1973 11 347 1 1973 12 348 4
28 / 41 Join tables
Our nuclear data from 1973... Our oil data from 1973...
sqlite> SELECT * FROM nukes_1973; sqlite> SELECT * FROM oil_1973; year month month_series tests month barrels_purchased total_value u ------1973 1 337 0 1 102750.0 282153.0 2 1973 2 338 1 2 95276.0 259675.0 2 1973 3 339 2 3 112053.0 316283.0 2 1973 4 340 4 4 98841.0 279958.0 2 1973 5 341 4 5 123102.0 357042.0 2 1973 6 342 5 6 118152.0 360020.0 3 1973 7 343 0 7 101752.0 320591.0 3 1973 8 344 0 8 148219.0 483085.0 3 1973 9 345 0 9 124966.0 421856.0 3 1973 10 346 3 10 134741.0 476763.0 3 1973 11 347 1 11 132168.0 503245.0 3 1973 12 348 4 12 100950.0 532182.0 5
How can we merge these two tables?
29 / 41 Some joins visualized
30 / 41 Default join
sqlite> SELECT * FROM oil_1973 JOIN nukes_1973; month barrels_purchased total_value unit_price cpi year month mon ------1 102750.0 282153.0 2.75 42.6 1973 1 337 1 102750.0 282153.0 2.75 42.6 1973 2 338 1 102750.0 282153.0 2.75 42.6 1973 3 339 1 102750.0 282153.0 2.75 42.6 1973 4 340 1 102750.0 282153.0 2.75 42.6 1973 5 341 1 102750.0 282153.0 2.75 42.6 1973 6 342 1 102750.0 282153.0 2.75 42.6 1973 7 343 1 102750.0 282153.0 2.75 42.6 1973 8 344 1 102750.0 282153.0 2.75 42.6 1973 9 345 1 102750.0 282153.0 2.75 42.6 1973 10 346 1 102750.0 282153.0 2.75 42.6 1973 11 347 1 102750.0 282153.0 2.75 42.6 1973 12 348
...
What happened with this join?
By default, a cross join was used, and it combines every row from the first table oil_1973 with every row from the second table nukes_1973 to form the resulting table.
31 / 41 Natural join
sqlite> SELECT * FROM oil_1973 NATURAL JOIN nukes_1973; month barrels_purchased total_value unit_price cpi year month_series t ------1 102750.0 282153.0 2.75 42.6 1973 337 0 2 95276.0 259675.0 2.73 42.9 1973 338 1 3 112053.0 316283.0 2.82 43.3 1973 339 2 4 98841.0 279958.0 2.83 43.6 1973 340 4 5 123102.0 357042.0 2.9 43.9 1973 341 4 6 118152.0 360020.0 3.05 44.2 1973 342 5 7 101752.0 320591.0 3.15 44.3 1973 343 0 8 148219.0 483085.0 3.26 45.1 1973 344 0 9 124966.0 421856.0 3.38 45.2 1973 345 0 10 134741.0 476763.0 3.54 45.6 1973 346 3 11 132168.0 503245.0 3.81 45.9 1973 347 1 12 100950.0 532182.0 5.27 46.2 1973 348 4
In the NATURAL JOIN, all the columns from both tables with the same name will be matched against each other. It automatically tests for equality between the values of every column that exists in both tables. This is why in the above example we only see month once.
32 / 41 Be explicit on how to join
sqlite> SELECT year, month, tests, unit_price ...> FROM nukes_1973 ...> LEFT JOIN oil_1973 ...> ON nukes_1973.month = oil_1973.month; Error: ambiguous column name: month
sqlite> SELECT year, oil_1973.month, tests, unit_price FROM nukes_1973 LEFT JOIN oil_1973 ON nukes_1973.month = oil_1973.month; year month tests unit_price ------1973 1 0 2.75 1973 2 1 2.73 1973 3 2 2.82 1973 4 4 2.83 1973 5 4 2.9 1973 6 5 3.05 1973 7 0 3.15 1973 8 0 3.26 1973 9 0 3.38 1973 10 3 3.54 1973 11 1 3.81 1973 12 4 5.27
33 / 41 Other joins
sqlite> SELECT * FROM oil_1973 NATURAL RIGHT JOIN nukes_1973; Error: RIGHT and FULL OUTER JOINs are not currently supported
34 / 41 Exercise
Use nukes and oil to perform one SQL query that returns the mean oil price and mean nuclear tests per year for all years where oil price and nuclear test data is available.
year avg_oil_price avg_nuke_tests ------1992 16.7033333333333 0.5 1991 17.5166666666667 0.583333333333 1990 20.3591666666667 0.666666666666 1989 16.4408333333333 0.916666666666 1988 14.3066666666667 1.25 1987 16.6766666666667 1.166666666666 1986 14.3925 1.166666666666 1985 26.22 1.416666666666 1984 27.6708333333333 1.5 1983 29.5833333333333 1.5 1982 33.3591666666667 1.5 1981 35.1066666666667 1.333333333333 1980 31.555 1.166666666666 1979 18.6758333333333 1.25 1978 13.4341666666667 1.583333333333 1977 13.3233333333333 1.666666666666 1976 12.42 1.666666666666 1975 11.5883333333333 1.833333333333 1974 11.0875 1.833333333333 1973 3.29083333333333 2.0
35 / 41 Finer details
36 / 41 SQL statement processing
What happens in the background when a SQL statement is sent to the RDBMS?
1. The SQL statement is parsed around key words.
2. The statement is validated. Do all tables and fields exist? Are names unambiguous?
3. The RDBMS optimizes and eventually generates an access plan - how best to retrieve the data, update the data, or delete the data.
4. Access plan is executed
37 / 41 Subqueries
A subquery may occur in a
SELECT clause FROM clause WHERE clause
The inner (sub) query executes first before its parent query so that the results of an inner query can be passed to the outer query.
sqlite> SELECT * FROM nukes_1973 ...> WHERE month IN ( ...> SELECT month FROM oil_1973 ...> WHERE unit_price BETWEEN 2.5 AND 3.5);
What are we trying to get with the above query?
38 / 41 Tips
On very large datasets use indices to speed up searches
Use CAST to change data types in a query
Attach databases to perform JOIN operations on tables across databases
See your query and improve performance with EXPLAIN QUERY PLAN followed by your query
39 / 41 Beyond SQL: NoSQL
Document Graph stores Key-value stores Wide-column stores
R package nodbi provides a single user interface for interacting with many NoSQL databases. This is similar to package DBI for interacting with relational databases that use SQL.
R package nodbi supports:
MongoDB Redis (server based) CouchDB Elasticsearch SQLite
Check out their vignette for more information: https://docs.ropensci.org/nodbi/
40 / 41 References
https://www.w3resource.com/sql/tutorials.php
http://www.sql-join.com/sql-join-types
https://www.mongodb.com/nosql-explained
41 / 41