<<

Data Wrangling With MongoDB

Lesson 01 Notes

Data Extraction Fundamentals

Intro Hi I'm Shanon Bradshaw director of education at MongoDB, MongoDB is the company behind the open source noSQL database of the same name. Prior to joining MongoDB, I was a computer scientist working in academia and consulting on a number of different data science projects in the financial industry, social media and other spaces. This is a class about data wrangling. Data scientist spend about 70% of their time data wrangling. So what is data wrangling? Well, it's a process of gathering, extracting, cleaning, and storing our data.

Only after that does it really make sense to do any analysis. So if you're quantum wall street and you want to build models to automate trading you first need to make sure you're basing your models on reliable data. Or if you're building a map app you need to ensure your data's corrector you can quickly find yourself in something of a public relations disaster. As another example if you're working on a smaller scale Analyzing data for a research team, you want to ensure your team can make decisions based on the data you're providing. If you don't take the time to ensure your data is in good shape before doing any analysis, you run a big risk of wasting a lot of time later on, or worse, losing the faith of your colleagues, who depend on the data you've prepared. It's a little like putting the cart before the horse. So while we're going to get to some analysis, That's why we're doing all this in the first place. This class is really about getting your data ready so that any analysis is built on a solid foundation of good data. Here you're going to get a chance to do some tinkering. We're going to develop your hacker muscles, or, should I say, wrangler muscles. We'll work with lots of different types of data. For music, energy, Wikipedia, and Twitter, to name a few. We'll also teach you how to work with data in most of the formats you're likely to see. JSON, XML, CSV, Excel, and HTML. And even some legacy text formats. In the last half of the course, we'll show you how to store your data in MondoDB, and use it to support analysis. MongoDB is becoming increasingly important to data scientists around the world, as a powerful and scalable tool for big data problems. And we'll wrap it up with a case study, that allows you to put all the pieces together. We're happy to have you in the class. Let's get started.

Action Time

This class will be pretty interactive. You'll be getting your hands dirty all the time. Let's jump right into some action. The first question of the day is, which of the following solutions seems better to you? And, maybe you have a funny story to share with your fellow students related to this. Get to the forums and share it.

Assessing the Quality of Data Pt. 1

Generally we should not trust any data we get. Where does it come from? Either from a human typing it in, from a program written by a human, or, some combination of the two. And everywhere that humans are involved, there's a potential for problems with our data.

Assessing the Quality of Data Pt. 2

Most data needs a closer look and at least a few fixes. For example, this is a picture of my house. Well, actually it's not my house it's actually the house next door. But in it's street map view, Google Maps believes this is actually my house. Here's another quick example. So this is the Wikipedia page for the Volkswagen Beetle. And probably the most common name for the beatle is bug. And you can see here that the nickname is bg. It's just a typo but a great example of the kinds of errors you can get anytime humans are involved in producing data. Which is always.

Let's take a look at one more example before we move on. This is some financial data that I worked with several years ago and a couple of interesting things to note here. One is that this line here is missing values for most of the fields, there are lots of reasons why that might be the case, this is the type of thing that we've got to look for anytime we're dealing with data.

Another example here Has to do with what assumptions we might have about the dates. If you look at these we might assume that this actually means August 4th, 2008, certainly if you're an American, that's the type of assumption that you'd make. But then we get to dates like this and there's another example here, where it becomes pretty clear that this is probably a date format In which the day comes first. So we've got to be careful about any assumptions we bring with us into a data wrangling task. So let's go ahead and spell this out. We need to asses our data in order to test assumptions about the values that are there. The data types for those values and the shape of our data. We also need to identity errors or outliers in our data. We're going to look at several ways of testing our data to see whether there are errors present or whether we have data that's really outside the range of expected values. And finally, we need to access whether there are any missing values within our data. In addition to all of this, we also need to make sure that our data will support the type of queries that we need to make. The idea here is to eliminate surprises leading to bad analysis later. We'll go into all of this in lesson three, but before can clean any data, we need to make sure we know how to gather data. And that's what we're going to be looking at in the rest of this lesson, and the lesson that follows.

Tabular Formats

Okay, let's talk about Tabular Data. You're probably familiar with spreadsheet applications like Microsoft Office's Excel, or Calc from Apache OpenOffice, or even Google Spreadsheets. So with any one of these applications. We can create tabular data in the form of spreadsheet. So I'm a big Beatles' fan, indulge me as we begin a little data wrangling with some discography data. After all, I'm sure even space cowboys will be listening to Beatles' someday. Okay, so as data scientist were usually concern with what items a dataset contains and what fields. Those items have. As I'm sure you're aware, with tabular data, each row represents a data item. And, an individual item can have one or more fields. The columns each represent a different field. And in most tabular data, the very first row will actually label those fields in some way. Finally, we have an individual cell that contains a value for a particular field. In this case the label for this particular album is capital records in Canada. Okay so just to look at that again in a more abstract sense. Each row is an item in our data set. Each column makes up the fields for the data items. And finally, individual values for a field are stored in cells. In the videos that follow, we'll quickly get our hands study working with tabular data in the CSV and Excel data formats in the Python programming language.

CSV Format

A common way to distribute tabular data is in a data format called CSV. Even if you're already familiar with CSV, stay with us because in just a few minutes, we'll look at using the CSV module in Python to make it much easier to work with this type of data. As you can see in this example, this is the same discography data we looked at in a spreadsheet The only difference here is the format. In this case we're looking at in CSV. Here you can see the first line of the file contains the labels for all of the fields. Let's take a look at an individual item here. Here we have an individual item in this data set. A Hard Day's Night is the single released on June 26th, 1964. And while it looks like this particular item appears on two lines of text, it's really just one line, my window simply isn't wide enough to show the entire line visually on a single line.

So one of the principal reasons why CSV is so widely used is because it's very lightweight. Each line of text is a single row, fields are separated by a delimiter, usually a comma, but we can also have an alternate of CSV, called TSV, where the delimeter is tab characters. CSV files really store just the data itself, or just the data and the delimeters. The benefit of this is that the files are as small as they reasonably can be. Another nice feature is that we don't need any special purpose software. We don't need, for example, Microsoft Excel in order to load CSV files. We can look at CSV files using a Command Light editor as we just saw an example, or even the very simplest text editors. It's also the case that it's easy to write programs that read in CSV data, in just about any programming language. Finally, though we don't need a spreadsheet application in order to read and write CSV, any spreadsheet app will be able to work with CSV files. Now what I'd like you to do is choose a tool of your choice, download the provided file, see the instructors comments for the links. And then look at the CSV file you've downloaded in some text editor.

Programming Quiz: Parsing CSV Files

Okay, it's finally time to do a little data wrangling for ourselves. We're going to look at parsing CSV files in Python. In this case, we're going to be reading the CSV data into our program and creating on dictionary for each item in that file. So you might ask yourself, why would we do something like this? Why not just open it in a spreadsheet application? One reason, is because if the file is big, let's tens or even hundreds of megabytes, opening it in a spreadsheet application like Excel can be slow, inefficient or maybe even impossible. Your app might do the software equivalent of this. Another reason we might want to programmatically process tabular data, is because we might have a whole lot of files to process. So, doing it manually in the spreadsheet application simply isn't an option.

Alright, let's take a look at the code provided. Here you can see, we have a parse file application. In this exercise, we're going to be working with the Beatles' disckography data, one more time. You'll be working in the parse file function in the provided code. And, your assignment is to use the Python function split to parse each row into a dictionary. For each dictionary, the names of the fields will serve as the keys and the value you find on a given row will serve as the values for those keys. You should produce an array of these dictionaries, one dictionary for each item remember. And you should return that array from the parse file function. Now, one final instruction here, is that rather than processing the entire file, you should only parse the first ten lines in this file. If you go beyond that, you run into trouble with this particular dataset. Since this is the first exercise we're looking at in this course. Let me talk a little bit about this test function here. We're providing this as a means for you to test your implementation of parse file. This will run a little bit of code which calls the parse file function and sample the result that it gets back from parse file, checking to see if it has the expected values. When you actually submit your program, we'll be running some different test code, possibly on a different dataset.

Answer:

Cool, let's take a look at a solution to the parsing CSV files exercise. We're going to begin here by opening the file. The next thing to note is that we're going to read the first line of that file, and split it using the common character as the delimiter. This will give us a list of values that we can use as keys for each one of the data items that we pull out of this file later on. Okay, we're then going to loop through the lines of the file. We'll break out of this loop once we've processed ten lines. For every line up to that, we're going to split the line, again using the comma as the delimiter, and then we're going to initialize this empty dictionary. Now, the entry each time through is going to be this data item that we'll construct using the keys that we got from the first line of the file, and the individual field values that we just got here. We going to loop through the fields and using enumerate, we'll get a index value in addition to the value for each item in this field's list. That would allow us to access the appropriate value in the header to use as a key for that particular field in our entry or data item. And then we'll use the corresponding value as the value for that particular field. Now note that in both cases we're using the strip method here in here in order to pull off any extraneous white­space around either the header value or the individual value for this line of the file. This is our first example of data cleaning which is another major theme of this course. In Excel files or .CSV files, a lot of times you'll have garbage white space that surround values. You don't really notice them or care about them in the file itself. But, when you're processing them in Python, they can make a big difference especially if you're doing comparisons between values. So, it's always a good idea to use strip. Okay, finally, we're going to append that one data item to our data array. So that it's included in the list of items that we return from this function first file.

Quiz: Problematic Line

All right, in this quiz I'd like you to identify which one of these lines would cause a problem for our parser, if written in the same way I showed you in the solution. By that I mean, using split to identify the individual fields in our CSV file. Answer:

And the answer is it would be line 15. The reason for that is in this column we actually have a value that includes a comma as part of the data. What that would mean is that when we're processing this file, we would end up having one extra field for this particular row because of this comma right here.

Using CSV Module

Now let's step back a minute and think a little bit more about CSV. We've talked about the fact that fields are delimited by commas. So what happens if we have a field that actually has a comma in it, like for example, this one. This particular Beatles album was released on two different labels, one in New Zealand and one in the US, so the way this data set has been set up, those two different labels are simply separated by a comma here. Now based on what we know so far about CSV, or I should say what we've discussed so far in the class about CSV, this would cause a problem for us because our parser would interpret this as a field separator. Now, the way that the CSV format actually handles this, or the way that most applications that deal with CSV format actually handle this, is to do something like the following. So you can see here that this is the field in this actual CSV file. Over here what I've done is simply load it inside a Google sheet, but here's a raw CSV file. And you can see that the way this has been structured is for this particular line, this field has been enclosed in quotes. Okay? So what that does is indicates that you can ignore field delimiters from here to here.

So, we've got some choices in terms of what we use for quotes, you could use double quotes or you can use single quotes. Well, that would cause a problems in other ways. You can see here that we have this quote character here, single quote here. There's also one here, which is actually used as an apostrophe for Sgt Pepper's Lonely Hearts Club Band. So it would be extremely tedious if in our Python programs we had to deal with all of these different variations of exceptions. And the fact is that though we call it CSV, or comma separated values, you can really use any delimiter you want here, as long as that character is only used for a field delimiter in rows in our dataset. So as, so often is the case of software development, this problem has been abstracted away and solved for all of the different variations, the tedious details that we might have to deal with in order to work with the format like CSV that has so many variations, and asterisks as my friend, Will Cross is fond of saying. This is the Python CSV module. This module deals with CSV formats in a pretty complete way. So, let's look at how we use this module. Now, what I'm actually going to do here is use the DictReader class from this module. This assumes that what we want to do is read all of our data into dictionaries, which is what we've been doing all along, and what we'll kind of continue to do throughout the rest of the course. But it has some other pretty cool features as well. For example, it assumes that the first row of whatever file we are going to read is actually a header row. And that those are the names we want to use for fields. So, going back to our CSV file, if I scroll up to the top, we can see that this first row here is actually all of the field labels that we would like for the columns in this data set or the fields in this data set. So, what this dictionary reader will do for us is as it reads in rows, it will create a dictionary for each row. The field names will be whatever it found in that first row, and it remembers them as we read through the data file. And the values then will in turn be each of the associated values on each line of the file. And again, it also handles things like dealing with quote characters, dealing with quoted fields that may commas inside of them, and so on. We don't have to worry about that at all, using the CSV module.

So let's take a look at the rest of this code. Essentially we're just opening up the data file. We're instantiating a DictReader from the CSV module, and then we're simply looping through. Each time through here, this class is going to produce a line for us. And that line will actually be a dictionary, composed of the appropriate fields for that particular line. So then if we scroll down, what I'm going to do here is simply print out all of those values, okay? So, let's take a look at running this piece of code. And again, remember we're using the CSV module. Okay, so if we run this, the output we get, we'll just look at the second to last one here, is a dictionary composed of each of the labels that came out from the first line of this file. And the field value for each one of the fields for this particular row from the data file. Okay? And it seamlessly handles for us fields that may be quoted on a particular line, and other nuances that we might see in the CSV format, and conveniently stuffs everything into individual dictionaries for us. So, whenever you're working with CSV files in Python, it's best to use the CSV module, because so many of the challenges of working with this type of data have already been solved for us.

Intro to XLRD

Okay. Let's turn our attention now to working with Excel documents. What I'd like to use is an example here, is data from the Electric Reliability Council of Texas. This is an organization that manages the flow of electricity to millions of Texas customers. They also provide a lot of publicly available datasets. On things like load, or the amount of electricity that's used by their customers. This is one such example for the calendar year 2013. A lot of organizations publish this type of data. And often they will publish it as Excel documents. What that means for us is that we have to have a way of programmatically working with Excel because many times, we'll want to process dozens or perhaps hundreds of files all of which have been published as Excel.

What I'd like to introduce you to here is a python module called XLRD. XLRD allows us to work with Excel files whether it's the old XLS format or the new XLSX format. This module will allow us to read­in all of the data from an excel workbook and, work with it in any way that we need to in a Python program. There's also an XLWT module, which allows us to create Excel files, if we need to do that programmatically. Okay so the first thing that I want to do before we dive in and take a look at this example code is actually go through and run it. Okay. What I've done with this program is produce output that illustrates a lot of the features that you might want to work with when using XLRD one processing excel data. So, one of the things I would like to illustrate here, is how we can read our data from an Excel file entirely into python list and work with it. How we go about working with rows and columns and individual cells in an Excel file using the XLRD module. And then finally, a little bit of information about dates. This is primarily because of the way that dates are represented in XML that we need to talk about this. So let's go back and look at this code. Here we're simply specifying which file it is that we're going to load. And in this case, it's an XLS file, an old format Microsoft Excel file. Okay? And then, the bulk of the work that we're doing here is actually done in the Parse File function. Alright so this is the command we use to open a workbook and you note that we're using the same variable name her okay? Then we need to specify each sheet we'd like to work with. So here we're selecting the sheet zero and you'll find that XLRD is entirely zero based indexing. So the very first column is actually column zero, the first row is actually row zero. Okay, here's an example of doing a list comprehension, and in this case, what we're doing is essentially looping through all of the rows and columns and reading all of that data into a Python list. What I'm doing here, then, is simply printing out the value of Row Three and Column Two for this list that we've just generated.

Now if I scroll down a little bit, we can see that following the list comprehension piece, what we're doing here is looping through the entire sheet one row at a time, and then moving across the columns. And the way that I've set this up, is that once we get to row 50, what we're going to do is essentially is print out the values in that particular row. One column at a time. Now again, this is a piece of code that does nothing more than simply illustrate the functions and methods that you'll want to work with when using XLRD. In the next piece of our example code, I'm illustrating how to work with rows and columns in XLRD. Alright, in this case what we're going to do here is we're going to simply grab the number of rows for this particular sheet. And then we'll print that out, alright. This is an illustration of how to check the data type, or the value type, for a particular cell. Using the cell type method, for sheet, for objects of type sheet, okay? Then cell value. This actually gets the value that's stored in that cell as the appropriate Python value. Whether it's a

floating point value or something else. And then finally this is actually a pretty cool method. Here, we can slice the values out of a particular column. What that means it that we could say, okay, I want three values from this particular column and I want to to start on row one, so actually the second row down. And I want to go through rows one, two, and three. Up to row four but not including row four. And I want to take those three values sliced out of that particular column. Okay. And then here we are doing something very similar to what we did here. Which is actually checking the type of the value in the given cell. But in this case we have a date in this cell Okay. So, we're going to pull out the cell value. Now, it turns out that, in the old Excel format, dates were represented simply as floating point numbers. So what we can do is use the XLRD date as tuple method to get us that time in a way that allows us to work with it as a date in Python.

Okay, let's go look at that output one more time. Okay, so here's our list comprehension. We're simply printing out the value at row 3, column 2 from the list into which we read all of the excel data. So here we can see where I was looping through the rows and columns and printed out all of the values on row fifty Okay, and then number of rows in this sheet happens to be more than 7200. Alright? Now note here that the type of data in cell two, is specified as two. You can look up, if you want to, in the XLRD documentation Exactly what all of these different type identifiers reference. In this case it's a floating point number. We can see that the value that we've got there is 1,036 and change, okay? And here's the output from having taking a slice out of Column Three. Okay, so here's that piece where we were working with dates. Now XLRD does distinguish the fact that that particular cell holds a date. But here's the time, represented as a floating point number. Okay? What we can do is convert this to a Python datetime tuple Using that XLRD method that I showed you. And this gives us a much better representation of the data that we're interested in. So, in this case, what we've got is, row one, column zero. So it's that very first cell at the top, okay? This one here, right? And note that what we're pulling out there as the value is, in fact, year 2013, January 1st, 1 AM.

Programming Quiz: Reading Excel Files

Okay, I hope you're enjoying the course so far. Here's an exercise that'll give you a chance to do some data wrangling using the XLRD module in Python. What I'd like you to do here, is read through the aircot hourly load file. This file here, and report the time stamp, which is stored in this column, and load for the min, max, and average values from this column. From the coast region of Texas. Let's quickly take a look at the code provided. What I'd like you to look at first are the assertions here at the bottom. This gives you an idea for the type of values you need to produce. What I'd like you to do here, is find the max value in that column B, the coast region. And then, for that max value, find the value on column to the left where the time stamp is stored. And report it as a tuple, just like we did when we were looking at our example of how to use XLRD. In order to complete this exercise, you'll be working in the Parse File function. And you need to pull out values that will allow you to complete this data dictionary right here, which is going to be returned.

Answer:

Great job making it this far into the course. Let's take a look at the solution to this XORD exercise. So here's our parse file function, and you can see that what we've done here, is use this column slicing trick. We've used the call values method on sheet to pull out all of the values in column one. Column one being the coast column within that data set. Then, we're simply using max and min here on this list that we've pulled out, in order to get the max val and min val in that entire column. We're using the index method on lists, to figure out where that max val is in the CV list. Now, because the data that we have actually begins on row one of the spreadsheet rather than row zero, we've got to add a one here. So that within our spreadsheet, we end up with the right position for that value, that is to say, the right row number. Okay, then what we're going to do here is for the position or the row on which the max value appears, we're going to take the value in column zero. That will give us the max time. Okay? And then what we're going to do, is turn that time, which we'll get as a floating point number, into an actual time tuple. We'll do that same process with the min value. In generating our data dictionary that we'll end up returning from this function, we can simply plug in the realtime, maxval, realmintime, and minval, and then to get the average, we're simply going to calculate it right here, and make it the value of this key, avgcoast. Okay, let's run this. I just want to point out that I'm using the p print module here, so that we get our data printed out in a nicely structured way. And we can see that we get good values here. If you take a look in the spreadsheet itself and do something like sort the spreadsheet, you'll see that the maximum times and the maximum values are calculated correctly. One thing I want to note is that, in our assertion here, we're doing a rounding so that we don't have any issues with floating point values being slightly off as we get further and further to the right of the decimal point.

Intro to JSON As we just saw, we're limited in what we can represent in tabular data. In many situations, we have data items with fields that have sub­fields. As programmers we are accustomed to this way of thinking, we have objects that have fields, that reference other objects. And, a lot of times all the objects have several fields of their own. So here we're looking at the The Beatles discography page on Wikipedia. This is the page we use to produce that data that we looked at earlier. Okay, so if we were to represent this data in the tabular form, in a CSV file or Excel worksheet, we'd need to do some unwinding. Given that we've got certifying authorities here, and different types of certifications. Gold record, Platinum record, and so on. There's also this additional complicating factor that an individual record can be certified as some multiple. Platinum. So this particular album is a multi platinum record, four times platinum. Platinum means it sold a million copies. Four times platinum means it sold more than four million copies. So if we want to represent this part of this dataset in a tabular form, we would essentially need two columns for every certifying authorities. One for what level of certification, and one for the multiplier. Aside from being tedious and error prone, this is just a really unnatural way to represent this data. So, this is why the JSON standard has emerged for modeling data and as a means of transmitting data between systems. So, let's take a look at a way of representing this data in JSON. I'm just going to look at A Hard Day's Night. A Hard Day's Night is interesting because it was actually released on two different labels on two different dates. So let's take a look then at how we might do this in JSON.

All right. So what I've done is I've taken just the Hard Day's Night data and implemented it as a JSON object. Okay, so you can see we have a field for title, artist and releases. And as we scroll down through this data, you can see that releases is actually an array. Okay? In this case I've got two because if you remember we have two different releases for this. One on the United Artists label and one of the Parlophone label. Okay? What we've done here as well Because the chart positions are in reference to a specific release, we need to make them part of the appropriate release object within the releases array. So in this case, the peak chart position for the United Artists release of A Hard Day's Night in France was five, in the UK was one. We also need to associate the certifications with the release, because a different release will have different certifications. So here we've got the different certifying authorities and then for that RIAA certification, it has a multiplier of four because it was actually a four times platinum album. Okay. And here is the release data for the other release, the one on Parlophone. So this is a way of representing the data in JSON. And it's important to note that JSON objects are just like dictionaries in Python and many other languages. There's a data type in most programming languages that's analogous to a JSON object. In Python it happens to be a dictionary. Many other languages have dictionary or map like data types that are very similar to JSON object. And most programming languages have arrays as well. In Python, they happen to be called lists. Okay, so let's just document some of what we have to remember about JSON here.

Data Modeling in JSON

Let's talk a little bit about some of the specifics for data modeling in JSON. First thing we should mention, is that items may have different fields. So you notice there, that for the certifications that actually had a multiplier, we included that multiplier. In the case of Hard Days Night, it was four. But for those that don't actually have a multiplier, we simply left that field out. And as we just discussed, JSON objects may have nested objects. That is, fields may have values which are themselves JSON objects. We can also nest arrays. So, a field may have a value of an array, and that array may have as it's elements other JSON objects, individual values, or other arrays.

JSON Playground

One of the ways you're most likely to encounter JSON data, is through the use of a web service. A web service is essentially a database that you can access using HTTP requests. With a web service, you formulate queries as URLs. So I'd like to take a look at a simple example using the musicbrainz.org site. The nice thing about this site is that we don't need an API key, so it'll be very easy for you to experiment with the code I'm going to give you. Okay, MusicBrainz is essentially a wiki, but one that provides a web service with access to their data. The type of data that MusicBrainz maintains, is metadata on music. So we can ask questions about artists, labels, recordings, etcetera.

Now the way that we query this site, is by constructing a URL that has musicbrainz.org as a base, but then specifies a particular entity, the type of data we're interested in getting back. And we can specify some additional parameters that allow us to be specific about exactly what features or what meta data we'd like to get back for a given entity, say for an artist. Now, one thing that we're going to have to do in order to use this site is, if we want to get specific information about an artist. We actually need to know its unique identifier, that is the MusicBrainz's unique identifier. And this theme of unique identifiers is something that's going to come up over and over and over again in this course in a variety of different ways. So the first thing that we're going to do, is actually use their Search interface and we're just going to pass in a query. That includes some, a search for the artists we're interested in finding some information about. We're going to process the results that come back from that query in order to get the id for that particular artist and then we can request information about a specific artist using that id.

Now, unless you think I'm a one­trick pony, we're actually going to ask for data for a different artist, in this case Lucero. So, here's where we're doing our first query. And this is going to give us back a results set. Now, we're going to process that results set in order to get back the id for Lucero, the MusicBrainz's id for Lucero. Now, I'm not going to go into this in too much detail. But something I want to point out here is that, what's happening is, when we issue this query, the response we get back is JSON data. Using the JSON module in Python, we can read in that data and it will simply be translated into the appropriate Python objects. Now, a JSON object is equivalent to a Python dictionary. So, what we're accessing right here is, that Python dictionary and we're going to access the artists field of that dictionary. That field happens to have an array as its value and I happen to know that the second object in that array is the one I'm actually interested in, the band Lucero from Memphis, okay. Now we want the id for this particular artist. The value of an artist is another object. It was a JSON object and the results we got back from MusicBrainz in our case, because we're working in Python, the JSON module has translated it into another Python dictionary. So, the id field contains the value we're actually interested in getting here. Okay, then what we're going to do is query this side again, this time specifically requesting information on this particular id, the id that we got back from our first query. Well, I should say from processing our first query. Alright. So, then we're going to issue this second query. We're actually going to ask for releases. And you can look above in the code here, to see what's going on here. Essentially, what I've done is implemented some convenience code that allows us to easily pass in the parameters that MusicBrainz is expecting in order to get the releases data back. Okay. And then the rest of this is simply printing out the results. So, let's take a look at, what the results look like.

So, here's the result of having dived into the result set from that first query to get the artists we're interested in. Okay, here's the id we care about. And then, if we look down, a couple of things I want to point out here. I've set up this code so that each time it makes a request out to this web service, it prints out the URL, right. Again, remember that URL is how we specify our query to a web service. So, at musicbrainz.org, I'm specifying that I want to access the web service, version two of the web service. I believe at least that's what this two means. And then I'm saying the entity I'm interested in is artist, and here's the id for that artist. Now, what I want you to do is give me back data that's formatted as JSON and this particular parameter for MusicBrainz is where we specify exactly what metadata we'd like to receive in response. It's kind of a catch all for everything. If we want more than one type of metadata, we simply concatenate them together, here. In this case, I'm just interested in releases, okay? So then what I did in the code was, I actually processed the response that I got back to this query, to pull out the very first release, okay? And here is the object in Python that's going to be a dictionary. That represents or that stores the data for that first release. This happens to be a release entitled 2012­04­20 Webster Hall, New York. As it turns out, I was actually at this show. My wife had surprised me by flying in my best friend, who is also a big Lucero fan. Okay? So then what I did was, in here, I used a list comprehension to process all of the releases, and extract just the title for each one of those releases. And then I'm simply printing them all out here.

So what I encourage you to do with this code, is take a look at all of it, get a sense for how it works, and then experiment on your own with some queries. You can look for different artists, you can use some of the conveniences that we've provided. Here to query for different types of metadata about a given artist, okay? So just play with this part of the code or work with Python in the Python interpreter, and just get a sense for the data that's here. What I really encourage you to do, is print out the full results set, dive into exactly what type of data is coming back. It's all just hierarchical database structure that has objects or arrays nested within other objects or arrays. Following this, we'll have a quiz where we actually ask you to answer some specific questions, so you will have to be able to query this web service correctly.

Quiz: Exploring JSON Okay, it's time for a quiz. Play around with the programming exercise on JSON and answer the questions that we're about to look at. Now you could certainly do this by looking at the data manually or just by knowing things about artists, so of course doing a little Googling. Please try to do it by writing code, building on the code that you were given in the previous exercise. Okay, let's take a look at the questions for this quiz. So, the first question is: how many bands named "First Aid Kit" are there? Next, see if you can find the Begin­Area name for the band Queen. Then find the Spanish alias for The Beatles. Dive into the data and find the disambiguation for Nirvana. Now, the Nirvana I mean here is the band that Kurt Cobain served as front­man for. And finally, please answer the question, when was One Direction formed?

­ How many bands named “First Aid Kit”? [2] ­ Begin­area name for Queen [London] ­ Spanish alias for Beatles? [Los Beatles] ­ Nirvana disambiguation? [90s US grunge band] ­ When was One Direction formed? [2010]

Answer:

Okay lets take a look at the answers to these questions. So for the “First Aid Kit” question, there are actually 2 bands named “First Aid Kit” one of them being in Sweden and I forget actually where the other one is from. What's the begin area name for queen? This is actually a pretty easy one. It is London. Okay? The Spanish alias for Beatles. This happens to be Los Beatles. There's actually a BBC recording where they talk about the Spanish alias for the Beatles. It's an interview with John, Paul, George and Ringo. Okay, enough Beatles nerdery. Alright, the Nirvana Disambiguation. So there are actually a few bands named Nirvana. The Kurt Cobain Nirvana is disambiguated in music brains with the label, "90s US Grunge Band." Okay? And when was One Direction formed? 2010.

Check you out

In this lesson, we looked at extracting data from CSV files, Microsoft Excel, and JSON. We'll get into some more complex wrangling tasks in the next lesson, where we explore XML as a data format, and also screen scraping to get data out of HTML.