How to Analyze JSON with SQL

How To Analyze JSON with SQL Kent Graziano, Chief Technical Evangelist I @kentgraziano © 2020 Snowflake Inc. All Rights Reserved My Bio • Chief Technical Evangelist, Snowflake Inc • Oracle ACE Director, Alumni (DW/BI) • OakTable Network • Blogger – The Data Warrior • Certified Data Vault Master and DV 2.0 Practitioner • Former Member: Boulder BI Brain Trust (#BBBT) • Member: DAMA Houston & DAMA International • Data Architecture and Data Warehouse Specialist • 30+ years in IT • 25+ years of Oracle-related work • 25+ years of data warehousing experience • Author & Co-Author of a bunch of books (Amazon) • Past-President of ODTUG and Rocky Mountain Oracle User Group © 2020 Snowflake Inc. All Rights Reserved 3 years in stealth + 4 years GA Founded 2012 by industry veterans First customers with over 120 2014, general database patents availability 2015 Over $920M in venture funding from leading 1600+ employees investors Over 3000 customers today Fun facts: Queries processed in Largest single Largest number of Single customer Single customer Snowflake per day: table: tables single DB: most data: most users: 365 million 45 trillion rows 3.1 million! > 19.3PB > 11,000 © 2020 Snowflake Inc. All Rights Reserved 3 AGENDA ➢What is JSON? ➢Why do you care? ➢JSON support in Snowflake ➢Step-by-step examples ➢Conclusion © 2020 Snowflake Inc. All Rights Reserved What is JSON? • Java A minimal, readable format • Script for structuring data. • Object It is used primarily to • Notation transmit data between a server and a web application, as an alternative to XML © 2020 Snowflake Inc. All Rights Reserved Why worry about JSON? • There is LOTS of it out there • JavaScript is popular • REST API’s for IoT & Mobile • Application and web logs – Social Media • Self-describing so very portable • Open datasets published in JSON • Data.gov • Datasf.org • Data.cityofNewYork.us • Opportunity for analysis! © 2020 Snowflake Inc. All Rights Reserved Load Your Semi-Structured Data Directly Into a Relational Table Before After Semi Structured Data Semi Structured Data Load into NoSQL Platform or Hadoop Parse the Data Into an Load into Snowflake Understandable Schema Load into a relational database Query with SQL and Join it to Other Structured Data Query with SQL © 2020 Snowflake Inc. All Rights Reserved 7 JSON in Snowflake ● With Snowflake, you can: ● Natively ingest semi-structured data ● Store it efficiently ● Access it quickly using simple extensions to standard SQL ● New datatype - VARIANT © 2020 Snowflake Inc. All Rights Reserved 8 JSON Support with SQL Structured data Semi-structured data (e.g. JSON, Avro, XML) { "firstName": "John", "lastName": "Smith", "height_cm": 167.64, Apple 101.12 250 FIH-2316 "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021-3100" Pear 56.22 202 IHO-6912 }, "phoneNumbers": [ { "type": "home", "number": "212 555-1234" }, { "type": "office", "number": "646 555-4567" } ] Orange 98.21 600 WHQ-6090 } All Your Data! select v:lastName::string as last_name from json_demo; © 2020 Snowflake Inc. All Rights Reserved How? • Snowflake invented a new data type called VARIANT • Allows semi-structured data to be loaded, as is, into a column in a relational table • Snowflake automatically discovers the attributes (“keys”) and structure that exist in the JSON document • Statistics are collected in Snowflake’s metadata repository which enables optimization • No Hadoop or NoSQL necessary • Query the JSON doc directly using SQL • Join the data to columns in other relational tables © 2020 Snowflake Inc. All Rights Reserved 10 STEP BY STEP PLEASE NOTE YOU WILL NEED A SNOWFLAKE ACCOUNT TO TRY ON YOUR OWN © 2020 Snowflake Inc. All Rights Reserved Step 1: Create a Table create or replace table json_demo (v variant); Single column table with a column “v” © 2020 Snowflake Inc. All Rights Reserved 12 Step 2: Load Some Data ➢ Load sample JSON document using an INSERT and Snowflake’s PARSE_JSON function. ➢ Stored in the VARIANT column “v” ➢ Snowflake “parses” the JSON into sub- columns based on the “keys” at load time ➢ Recorded with pointers in metadata ➢ Structural information is dynamically derived based on the schema definition embedded in the JSON string. (Schema-on-Read) © 2020 Snowflake Inc. All Rights Reserved 13 Step 3: Start Pulling Data Out ➢ Get the data from the sub-column “fullName” ➢ A colon is used to reference the JSON sub-columns Where: v = the column name in the json_demo table fullName = attribute in the JSON schema v:fullName = notation to indicate which attribute in column “v” we want to select © 2020 Snowflake Inc. All Rights Reserved 14 Step 4: Casting the Data ➢ The data is easily formatted ➢ Use :: to cast to a proper data type ➢ Add an alias with “as” (just like regular SQL) ➢ You can now query semi-structured data without learning a new programming language © 2020 Snowflake Inc. All Rights Reserved 15 Handling More Complexity ➢ The original sample document contains some nested data. ➢ phoneNumber is a sub-column ➢ Subsequently areaCode and subscriberNumber are sub-columns of the sub-column ➢ Access it using a very familiar table.column dot notation ➢ Now can easily pull apart nested objects © 2020 Snowflake Inc. All Rights Reserved 16 What happens if the structure changes? ➢ The data provider can change the specifications ➢ Adding a new attribute like extensionNumber to phoneNumber ➢ Existing loads and queries keep working ➢ Easily adapt SQL - modify the query ➢ If the reverse happens and an attribute is dropped, the query will not fail. ➢ It returns a NULL value. ➢ Snowflake insulates all the code you write from these types of dynamic changes © 2020 Snowflake Inc. All Rights Reserved 17 How to handle arrays of data ➢ Children is a sub-column with an array ➢ Note ‘[…]’ indicates an array ➢ 3 rows in the array and each row has 3 sub-columns (name, gender, and age). ➢ Each of those rows constitutes the value of that array entry ➢ The function array_size returns the # of rows. ➢ To pull the data for each row, use the dot notation, but with added specification for the row number of the array located inside the brackets © 2020 Snowflake Inc. All Rights Reserved 18 Flatten ➢ Instead of doing the set of UNION ALLs, we add the FLATTEN into the FROM clause and give it a table alias ➢ Flatten takes an array and returns a row for each element in the array f = table alias for the flattened array ➢ Selects all the data in the array as though name = sub-column in the array it were rows in table value = return the value of the sub-column ➢ No need to figure out how many elements there are ➢ FLATTEN allows us to determine the structure and content of the array on the fly ➢ This makes the SQL resilient to changes in the JSON document. © 2020 Snowflake Inc. All Rights Reserved 19 Flatten - Results ➢ Same result as the SELECT with the UNIONs. ➢ Resulting column header reflects the different syntax. ➢ Add a column alias and casting to make it look nice ➢ Now all the array sub-columns are formatted just like a relational table ➢ If another element is added to the array, such as a fourth child, we will not have to change the SQL © 2020 Snowflake Inc. All Rights Reserved 20 Putting this all together ➢ You can write a query to get the parent’s name and the children’s names in one result set ➢ If you just want a quick count of children by parent, you do not need FLATTEN but instead refer back to the array_size ➢ Notice there is no “group by” clause needed because the nested structure of the JSON has naturally grouped the data for us © 2020 Snowflake Inc. All Rights Reserved 21 How to Handle Multiple Arrays? ➢ You can pull from several arrays at once with no problem. ➢ What about an array within an array? ➢ Snowflake can handle that too! ➢ In the sample data yearsLived is an array nested inside the array described by citiesLived © 2020 Snowflake Inc. All Rights Reserved 22 Nested Arrays ➢ Add a second FLATTEN clause that transforms yearsLived array within the FLATTENed citiesLived array ➢ In this case the 2nd FLATTEN transforms, or pivots, the yearsLived array for each value returned from the first FLATTEN of the citiesLived array. ➢ The resulting output shows Year Lived by City ➢ Syntax is [flattened array alias].value.[key name] © 2020 Snowflake Inc. All Rights Reserved 23 Enhance the Results ➢ Adding to the output is simple ➢ Add the v:fullName to show who lived where in what year © 2020 Snowflake Inc. All Rights Reserved 24 Aggregations ➢ You can execute standard SQL aggregations on the semi-structured data ➢ Just like ANSI SQL, you can do a count and a group by ➢ You can also create much more complex analyses using the library of standard SQL aggregation and windowing functions including LEAD, LAG, RANK, and STDDEV © 2020 Snowflake Inc. All Rights Reserved 25 Filtering Your Data ➢ Like standard SQL, you can add a WHERE clause ➢ To make it easier to read the SQL, you can even reference the sub-column alias city_name in the predicate. ➢ Or you can also use the full, sub-column specification cl.value:cityName © 2020 Snowflake Inc. All Rights Reserved 26 Schema-on-Read is a Reality ➢ Brand new, optimized data type, ➢ With Snowflake, you get the bonus of VARIANT on-demand resources scalability built for the cloud ➢ VARIANT offers native support for querying JSON without: ➢ Snowflake gives you a fast path to the ➢ The need to analyze the structure ahead enterprise end game: of time ➢ The true ability to quickly and easily ➢ Design appropriate database tables and load semi-structured data into a columns modern data warehouse and make it ➢ Parsing the data string into that available for immediate analysis predefined schema ➢ VARIANT provides the same performance as all the standard relational data types. © 2020 Snowflake Inc.

How to Analyze JSON with SQL

JSON Hijacking for the Modern Web About Me

Introduction to JSON (Javascript Object Notation)

OCF Core Optional 2.1.0

Understanding JSON Schema Release 2020-12

Convert Json Into Html Table

The Latest IBM Z COBOL Compiler: Enterprise COBOL V6.2!

Plain Text & Character Encoding

JSON JSON (Javascript Object Notation) Ecmascript Javascript

International Standard Iso/Iec 9075-2

2021-2022 Recommended Formats Statement

ISO/IEC Version of ECMA-404

CAT Reporting Customer & Account Technical Specifications for Industry