How To Analyze JSON with SQL

Kent Graziano, Chief Technical Evangelist I @kentgraziano

© 2020 Snowflake Inc. All Rights Reserved My Bio

• Chief Technical Evangelist, Snowflake Inc • Oracle ACE Director, Alumni (DW/BI) • OakTable Network • Blogger – The Data Warrior • Certified Data Vault Master and DV 2.0 Practitioner • Former Member: Boulder BI Brain Trust (#BBBT) • Member: DAMA Houston & DAMA International • Data Architecture and Data Warehouse Specialist • 30+ years in IT • 25+ years of Oracle-related work • 25+ years of data warehousing experience • Author & Co-Author of a bunch of books () • Past-President of ODTUG and Rocky Mountain Oracle User Group

© 2020 Snowflake Inc. All Rights Reserved 3 years in stealth + 4 years GA

Founded 2012 by industry veterans First customers with over 120 2014, general patents availability 2015

Over $920M in venture funding from leading 1600+ employees investors Over 3000 customers today

Fun facts:

Queries processed in Largest single Largest number of Single customer Single customer Snowflake per day: table: tables single DB: most data: most users: 365 million 45 trillion rows 3.1 million! > 19.3PB > 11,000

© 2020 Snowflake Inc. All Rights Reserved 3 AGENDA

➢What is JSON? ➢Why do you care? ➢JSON support in Snowflake ➢Step-by-step examples ➢Conclusion

© 2020 Snowflake Inc. All Rights Reserved What is JSON?

• Java A minimal, readable format • Script for structuring data. • Object It is used primarily to • Notation transmit data between a and a , as an alternative to XML

© 2020 Snowflake Inc. All Rights Reserved Why worry about JSON?

• There is LOTS of it out there • JavaScript is popular • REST API’s for IoT & Mobile • Application and web logs – Social Media • Self-describing so very portable • Open datasets published in JSON • Data.gov • Datasf.org • Data.cityofNewYork.us • Opportunity for analysis!

© 2020 Snowflake Inc. All Rights Reserved Load Your Semi-Structured Data Directly Into a Relational Table Before After Semi Structured Data Semi Structured Data

Load into NoSQL Platform or Hadoop

Parse the Data Into an Load into Snowflake Understandable Schema

Load into a relational database Query with SQL and Join it to Other Structured Data Query with SQL © 2020 Snowflake Inc. All Rights Reserved 7 JSON in Snowflake

● With Snowflake, you can: ● Natively ingest semi-structured data ● Store it efficiently ● Access it quickly using simple extensions to standard SQL ● New datatype - VARIANT

© 2020 Snowflake Inc. All Rights Reserved 8 JSON Support with SQL Structured data Semi-structured data (e.g. JSON, Avro, XML)

{ "firstName": "John", "lastName": "Smith", "height_cm": 167.64, Apple 101.12 250 FIH-2316 "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021-3100" Pear 56.22 202 IHO-6912 }, "phoneNumbers": [ { "type": "home", "number": "212 555-1234" }, { "type": "office", "number": "646 555-4567" } ] Orange 98.21 600 WHQ-6090 }

All Your Data! select v:lastName::string as last_name from json_demo;

© 2020 Snowflake Inc. All Rights Reserved How? • Snowflake invented a new data type called VARIANT • Allows semi-structured data to be loaded, as is, into a column in a relational table • Snowflake automatically discovers the attributes (“keys”) and structure that in the JSON document • Statistics are collected in Snowflake’s metadata repository which enables optimization • No Hadoop or NoSQL necessary • Query the JSON doc directly using SQL • Join the data to columns in other relational tables

© 2020 Snowflake Inc. All Rights Reserved 10 STEP BY STEP

PLEASE NOTE YOU WILL NEED A SNOWFLAKE ACCOUNT TO TRY ON YOUR OWN

© 2020 Snowflake Inc. All Rights Reserved Step 1: Create a Table

create or replace table json_demo (v variant);

Single column table with a column “v”

© 2020 Snowflake Inc. All Rights Reserved 12 Step 2: Load Some Data

➢ Load sample JSON document using an INSERT and Snowflake’s PARSE_JSON function. ➢ Stored in the VARIANT column “v” ➢ Snowflake “parses” the JSON into sub- columns based on the “keys” at load time ➢ Recorded with pointers in metadata ➢ Structural information is dynamically derived based on the schema definition embedded in the JSON string. (Schema-on-Read)

© 2020 Snowflake Inc. All Rights Reserved 13 Step 3: Start Pulling Data Out

➢ Get the data from the sub-column “fullName”

➢ A is used to reference the JSON sub-columns

Where:

v = the column name in the json_demo table

fullName = attribute in the JSON schema

v:fullName = notation to indicate which attribute in column “v” we want to select

© 2020 Snowflake Inc. All Rights Reserved 14 Step 4: Casting the Data

➢ The data is easily formatted ➢ Use :: to cast to a proper data type ➢ Add an alias with “as” (just like regular SQL)

➢ You can now query semi-structured data without learning a new

© 2020 Snowflake Inc. All Rights Reserved 15 Handling More Complexity

➢ The original sample document contains some nested data.

➢ phoneNumber is a sub-column ➢ Subsequently areaCode and subscriberNumber are sub-columns of the sub-column

➢ Access it using a very familiar table.column dot notation

➢ Now can easily pull apart nested objects

© 2020 Snowflake Inc. All Rights Reserved 16 What happens if the structure changes? ➢ The data provider can change the specifications ➢ Adding a new attribute like extensionNumber to phoneNumber ➢ Existing loads and queries keep working

➢ Easily adapt SQL - modify the query

➢ If the reverse happens and an attribute is dropped, the query will not fail. ➢ It returns a NULL value.

➢ Snowflake insulates all the code you write from these types of dynamic changes

© 2020 Snowflake Inc. All Rights Reserved 17 How to handle arrays of data

➢ Children is a sub-column with an array ➢ Note ‘[…]’ indicates an array

➢ 3 rows in the array and each row has 3 sub-columns (name, gender, and age). ➢ Each of those rows constitutes the value of that array entry

➢ The function array_size returns the # of rows. ➢ To pull the data for each row, use the dot notation, but with added specification for the row number of the array located inside the

© 2020 Snowflake Inc. All Rights Reserved 18 Flatten

➢ Instead of doing the set of UNION ALLs, we add the FLATTEN into the FROM clause and give it a table alias ➢ Flatten takes an array and returns a row for each element in the array f = table alias for the flattened array ➢ Selects all the data in the array as though name = sub-column in the array it were rows in table value = return the value of the sub-column ➢ No need to figure out how many elements there are ➢ FLATTEN allows us to determine the structure and content of the array on the fly ➢ This makes the SQL resilient to changes in the JSON document.

© 2020 Snowflake Inc. All Rights Reserved 19 Flatten - Results

➢ Same result as the SELECT with the UNIONs.

➢ Resulting column header reflects the different syntax. ➢ Add a column alias and casting to make it look nice ➢ Now all the array sub-columns are formatted just like a relational table

➢ If another element is added to the array, such as a fourth child, we will not have to change the SQL

© 2020 Snowflake Inc. All Rights Reserved 20 Putting this all together

➢ You can write a query to get the parent’s name and the children’s names in one result set

➢ If you just want a quick count of children by parent, you do not need FLATTEN but instead refer back to the array_size

➢ Notice there is no “group by” clause needed because the nested structure of the JSON has naturally grouped the data for us

© 2020 Snowflake Inc. All Rights Reserved 21 How to Handle Multiple Arrays?

➢ You can pull from several arrays at once with no problem.

➢ What about an array within an array? ➢ Snowflake can handle that too!

➢ In the sample data yearsLived is an array nested inside the array described by citiesLived

© 2020 Snowflake Inc. All Rights Reserved 22 Nested Arrays

➢ Add a second FLATTEN clause that transforms yearsLived array within the FLATTENed citiesLived array

➢ In this case the 2nd FLATTEN transforms, or pivots, the yearsLived array for each value returned from the first FLATTEN of the citiesLived array.

➢ The resulting output shows Year Lived by City

➢ Syntax is [flattened array alias].value.[key name]

© 2020 Snowflake Inc. All Rights Reserved 23 Enhance the Results

➢ Adding to the output is simple

➢ Add the v:fullName to show who lived where in what year

© 2020 Snowflake Inc. All Rights Reserved 24 Aggregations

➢ You can execute standard SQL aggregations on the semi-structured data ➢ Just like ANSI SQL, you can do a count and a group by

➢ You can also create much more complex analyses using the library of standard SQL aggregation and windowing functions including LEAD, LAG, RANK, and STDDEV

© 2020 Snowflake Inc. All Rights Reserved 25 Filtering Your Data

➢ Like standard SQL, you can add a WHERE clause

➢ To make it easier to read the SQL, you can even reference the sub-column alias city_name in the predicate.

➢ Or you can also use the full, sub-column specification cl.value:cityName

© 2020 Snowflake Inc. All Rights Reserved 26 Schema-on-Read is a Reality

➢ Brand new, optimized data type, ➢ With Snowflake, you get the bonus of VARIANT on-demand resources scalability built for the cloud ➢ VARIANT offers native support for querying JSON without: ➢ Snowflake gives you a fast path to the ➢ The need to analyze the structure ahead enterprise end game: of time ➢ The true ability to quickly and easily ➢ Design appropriate database tables and load semi-structured data into a columns modern data warehouse and make it ➢ Parsing the data string into that available for immediate analysis predefined schema

➢ VARIANT provides the same performance as all the standard relational data types.

© 2020 Snowflake Inc. All Rights Reserved 27 Discover the performance, concurrency, and simplicity of Snowflake

As easy as 1-2-3! Sign up and receive 01 $400 worth of free Visit Snowflake.com usage for 30 days!

02 Click “Try for Free”

03 Sign up & register

Snowflake is the only data warehouse built for the cloud. You can automatically scale compute up, out, or down—independent of storage. Plus, you have the power of a complete SQL database, with zero management, that can grow with you to support all of your data and all of your users. With Snowflake On Demand™, pay only for what you use.

© 2020 Snowflake Inc. All Rights Reserved Questions?

© 2020 Snowflake Inc. All Rights Reserved Contact Info

Kent Graziano Snowflake Computing [email protected] On Twitter @KentGraziano

More info at http://snowflake.com

Visit my blog at http://kentgraziano.com

© 2020 Snowflake Inc. All Rights Reserved THANK YOU

© 2020 Snowflake Inc. All Rights Reserved