How to Analyze JSON with SQL
Total Page:16
File Type:pdf, Size:1020Kb
How To Analyze JSON with SQL Kent Graziano, Chief Technical Evangelist I @kentgraziano © 2020 Snowflake Inc. All Rights Reserved My Bio • Chief Technical Evangelist, Snowflake Inc • Oracle ACE Director, Alumni (DW/BI) • OakTable Network • Blogger – The Data Warrior • Certified Data Vault Master and DV 2.0 Practitioner • Former Member: Boulder BI Brain Trust (#BBBT) • Member: DAMA Houston & DAMA International • Data Architecture and Data Warehouse Specialist • 30+ years in IT • 25+ years of Oracle-related work • 25+ years of data warehousing experience • Author & Co-Author of a bunch of books (Amazon) • Past-President of ODTUG and Rocky Mountain Oracle User Group © 2020 Snowflake Inc. All Rights Reserved 3 years in stealth + 4 years GA Founded 2012 by industry veterans First customers with over 120 2014, general database patents availability 2015 Over $920M in venture funding from leading 1600+ employees investors Over 3000 customers today Fun facts: Queries processed in Largest single Largest number of Single customer Single customer Snowflake per day: table: tables single DB: most data: most users: 365 million 45 trillion rows 3.1 million! > 19.3PB > 11,000 © 2020 Snowflake Inc. All Rights Reserved 3 AGENDA ➢What is JSON? ➢Why do you care? ➢JSON support in Snowflake ➢Step-by-step examples ➢Conclusion © 2020 Snowflake Inc. All Rights Reserved What is JSON? • Java A minimal, readable format • Script for structuring data. • Object It is used primarily to • Notation transmit data between a server and a web application, as an alternative to XML © 2020 Snowflake Inc. All Rights Reserved Why worry about JSON? • There is LOTS of it out there • JavaScript is popular • REST API’s for IoT & Mobile • Application and web logs – Social Media • Self-describing so very portable • Open datasets published in JSON • Data.gov • Datasf.org • Data.cityofNewYork.us • Opportunity for analysis! © 2020 Snowflake Inc. All Rights Reserved Load Your Semi-Structured Data Directly Into a Relational Table Before After Semi Structured Data Semi Structured Data Load into NoSQL Platform or Hadoop Parse the Data Into an Load into Snowflake Understandable Schema Load into a relational database Query with SQL and Join it to Other Structured Data Query with SQL © 2020 Snowflake Inc. All Rights Reserved 7 JSON in Snowflake ● With Snowflake, you can: ● Natively ingest semi-structured data ● Store it efficiently ● Access it quickly using simple extensions to standard SQL ● New datatype - VARIANT © 2020 Snowflake Inc. All Rights Reserved 8 JSON Support with SQL Structured data Semi-structured data (e.g. JSON, Avro, XML) { "firstName": "John", "lastName": "Smith", "height_cm": 167.64, Apple 101.12 250 FIH-2316 "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021-3100" Pear 56.22 202 IHO-6912 }, "phoneNumbers": [ { "type": "home", "number": "212 555-1234" }, { "type": "office", "number": "646 555-4567" } ] Orange 98.21 600 WHQ-6090 } All Your Data! select v:lastName::string as last_name from json_demo; © 2020 Snowflake Inc. All Rights Reserved How? • Snowflake invented a new data type called VARIANT • Allows semi-structured data to be loaded, as is, into a column in a relational table • Snowflake automatically discovers the attributes (“keys”) and structure that exist in the JSON document • Statistics are collected in Snowflake’s metadata repository which enables optimization • No Hadoop or NoSQL necessary • Query the JSON doc directly using SQL • Join the data to columns in other relational tables © 2020 Snowflake Inc. All Rights Reserved 10 STEP BY STEP PLEASE NOTE YOU WILL NEED A SNOWFLAKE ACCOUNT TO TRY ON YOUR OWN © 2020 Snowflake Inc. All Rights Reserved Step 1: Create a Table create or replace table json_demo (v variant); Single column table with a column “v” © 2020 Snowflake Inc. All Rights Reserved 12 Step 2: Load Some Data ➢ Load sample JSON document using an INSERT and Snowflake’s PARSE_JSON function. ➢ Stored in the VARIANT column “v” ➢ Snowflake “parses” the JSON into sub- columns based on the “keys” at load time ➢ Recorded with pointers in metadata ➢ Structural information is dynamically derived based on the schema definition embedded in the JSON string. (Schema-on-Read) © 2020 Snowflake Inc. All Rights Reserved 13 Step 3: Start Pulling Data Out ➢ Get the data from the sub-column “fullName” ➢ A colon is used to reference the JSON sub-columns Where: v = the column name in the json_demo table fullName = attribute in the JSON schema v:fullName = notation to indicate which attribute in column “v” we want to select © 2020 Snowflake Inc. All Rights Reserved 14 Step 4: Casting the Data ➢ The data is easily formatted ➢ Use :: to cast to a proper data type ➢ Add an alias with “as” (just like regular SQL) ➢ You can now query semi-structured data without learning a new programming language © 2020 Snowflake Inc. All Rights Reserved 15 Handling More Complexity ➢ The original sample document contains some nested data. ➢ phoneNumber is a sub-column ➢ Subsequently areaCode and subscriberNumber are sub-columns of the sub-column ➢ Access it using a very familiar table.column dot notation ➢ Now can easily pull apart nested objects © 2020 Snowflake Inc. All Rights Reserved 16 What happens if the structure changes? ➢ The data provider can change the specifications ➢ Adding a new attribute like extensionNumber to phoneNumber ➢ Existing loads and queries keep working ➢ Easily adapt SQL - modify the query ➢ If the reverse happens and an attribute is dropped, the query will not fail. ➢ It returns a NULL value. ➢ Snowflake insulates all the code you write from these types of dynamic changes © 2020 Snowflake Inc. All Rights Reserved 17 How to handle arrays of data ➢ Children is a sub-column with an array ➢ Note ‘[…]’ indicates an array ➢ 3 rows in the array and each row has 3 sub-columns (name, gender, and age). ➢ Each of those rows constitutes the value of that array entry ➢ The function array_size returns the # of rows. ➢ To pull the data for each row, use the dot notation, but with added specification for the row number of the array located inside the brackets © 2020 Snowflake Inc. All Rights Reserved 18 Flatten ➢ Instead of doing the set of UNION ALLs, we add the FLATTEN into the FROM clause and give it a table alias ➢ Flatten takes an array and returns a row for each element in the array f = table alias for the flattened array ➢ Selects all the data in the array as though name = sub-column in the array it were rows in table value = return the value of the sub-column ➢ No need to figure out how many elements there are ➢ FLATTEN allows us to determine the structure and content of the array on the fly ➢ This makes the SQL resilient to changes in the JSON document. © 2020 Snowflake Inc. All Rights Reserved 19 Flatten - Results ➢ Same result as the SELECT with the UNIONs. ➢ Resulting column header reflects the different syntax. ➢ Add a column alias and casting to make it look nice ➢ Now all the array sub-columns are formatted just like a relational table ➢ If another element is added to the array, such as a fourth child, we will not have to change the SQL © 2020 Snowflake Inc. All Rights Reserved 20 Putting this all together ➢ You can write a query to get the parent’s name and the children’s names in one result set ➢ If you just want a quick count of children by parent, you do not need FLATTEN but instead refer back to the array_size ➢ Notice there is no “group by” clause needed because the nested structure of the JSON has naturally grouped the data for us © 2020 Snowflake Inc. All Rights Reserved 21 How to Handle Multiple Arrays? ➢ You can pull from several arrays at once with no problem. ➢ What about an array within an array? ➢ Snowflake can handle that too! ➢ In the sample data yearsLived is an array nested inside the array described by citiesLived © 2020 Snowflake Inc. All Rights Reserved 22 Nested Arrays ➢ Add a second FLATTEN clause that transforms yearsLived array within the FLATTENed citiesLived array ➢ In this case the 2nd FLATTEN transforms, or pivots, the yearsLived array for each value returned from the first FLATTEN of the citiesLived array. ➢ The resulting output shows Year Lived by City ➢ Syntax is [flattened array alias].value.[key name] © 2020 Snowflake Inc. All Rights Reserved 23 Enhance the Results ➢ Adding to the output is simple ➢ Add the v:fullName to show who lived where in what year © 2020 Snowflake Inc. All Rights Reserved 24 Aggregations ➢ You can execute standard SQL aggregations on the semi-structured data ➢ Just like ANSI SQL, you can do a count and a group by ➢ You can also create much more complex analyses using the library of standard SQL aggregation and windowing functions including LEAD, LAG, RANK, and STDDEV © 2020 Snowflake Inc. All Rights Reserved 25 Filtering Your Data ➢ Like standard SQL, you can add a WHERE clause ➢ To make it easier to read the SQL, you can even reference the sub-column alias city_name in the predicate. ➢ Or you can also use the full, sub-column specification cl.value:cityName © 2020 Snowflake Inc. All Rights Reserved 26 Schema-on-Read is a Reality ➢ Brand new, optimized data type, ➢ With Snowflake, you get the bonus of VARIANT on-demand resources scalability built for the cloud ➢ VARIANT offers native support for querying JSON without: ➢ Snowflake gives you a fast path to the ➢ The need to analyze the structure ahead enterprise end game: of time ➢ The true ability to quickly and easily ➢ Design appropriate database tables and load semi-structured data into a columns modern data warehouse and make it ➢ Parsing the data string into that available for immediate analysis predefined schema ➢ VARIANT provides the same performance as all the standard relational data types. © 2020 Snowflake Inc.