Lesson 01 Notes Data Extraction Fundamentals

Data Wrangling With MongoDB Lesson 01 Notes Data Extraction Fundamentals Intro Hi I'm Shanon Bradshaw director of education at MongoDB, MongoDB is the company behind the open source noSQL database of the same name. Prior to joining MongoDB, I was a computer scientist working in academia and consulting on a number of different data science projects in the financial industry, social media and other spaces. This is a class about data wrangling. Data scientist spend about 70% of their time data wrangling. So what is data wrangling? Well, it's a process of gathering, extracting, cleaning, and storing our data. Only after that does it really make sense to do any analysis. So if you're quantum wall street and you want to build models to automate trading you first need to make sure you're basing your models on reliable data. Or if you're building a map app you need to ensure your data's corrector you can quickly find yourself in something of a public relations disaster. As another example if you're working on a smaller scale Analyzing data for a research team, you want to ensure your team can make decisions based on the data you're providing. If you don't take the time to ensure your data is in good shape before doing any analysis, you run a big risk of wasting a lot of time later on, or worse, losing the faith of your colleagues, who depend on the data you've prepared. It's a little like putting the cart before the horse. So while we're going to get to some analysis, That's why we're doing all this in the first place. This class is really about getting your data ready so that any analysis is built on a solid foundation of good data. Here you're going to get a chance to do some tinkering. We're going to develop your hacker muscles, or, should I say, wrangler muscles. We'll work with lots of different types of data. For music, energy, Wikipedia, and Twitter, to name a few. We'll also teach you how to work with data in most of the formats you're likely to see. JSON, XML, CSV, Excel, and HTML. And even some legacy text formats. In the last half of the course, we'll show you how to store your data in MondoDB, and use it to support analysis. MongoDB is becoming increasingly important to data scientists around the world, as a powerful and scalable tool for big data problems. And we'll wrap it up with a case study, that allows you to put all the pieces together. We're happy to have you in the class. Let's get started. Action Time This class will be pretty interactive. You'll be getting your hands dirty all the time. Let's jump right into some action. The first question of the day is, which of the following solutions seems better to you? And, maybe you have a funny story to share with your fellow students related to this. Get to the forums and share it. Assessing the Quality of Data Pt. 1 Generally we should not trust any data we get. Where does it come from? Either from a human typing it in, from a program written by a human, or, some combination of the two. And everywhere that humans are involved, there's a potential for problems with our data. Assessing the Quality of Data Pt. 2 Most data needs a closer look and at least a few fixes. For example, this is a picture of my house. Well, actually it's not my house it's actually the house next door. But in it's street map view, Google Maps believes this is actually my house. Here's another quick example. So this is the Wikipedia page for the Volkswagen Beetle. And probably the most common name for the beatle is bug. And you can see here that the nickname is bg. It's just a typo but a great example of the kinds of errors you can get anytime humans are involved in producing data. Which is always. Let's take a look at one more example before we move on. This is some financial data that I worked with several years ago and a couple of interesting things to note here. One is that this line here is missing values for most of the fields, there are lots of reasons why that might be the case, this is the type of thing that we've got to look for anytime we're dealing with data. Another example here Has to do with what assumptions we might have about the dates. If you look at these we might assume that this actually means August 4th, 2008, certainly if you're an American, that's the type of assumption that you'd make. But then we get to dates like this and there's another example here, where it becomes pretty clear that this is probably a date format In which the day comes first. So we've got to be careful about any assumptions we bring with us into a data wrangling task. So let's go ahead and spell this out. We need to asses our data in order to test assumptions about the values that are there. The data types for those values and the shape of our data. We also need to identity errors or outliers in our data. We're going to look at several ways of testing our data to see whether there are errors present or whether we have data that's really outside the range of expected values. And finally, we need to access whether there are any missing values within our data. In addition to all of this, we also need to make sure that our data will support the type of queries that we need to make. The idea here is to eliminate surprises leading to bad analysis later. We'll go into all of this in lesson three, but before can clean any data, we need to make sure we know how to gather data. And that's what we're going to be looking at in the rest of this lesson, and the lesson that follows. Tabular Formats Okay, let's talk about Tabular Data. You're probably familiar with spreadsheet applications like Microsoft Office's Excel, or Calc from Apache OpenOffice, or even Google Spreadsheets. So with any one of these applications. We can create tabular data in the form of spreadsheet. So I'm a big Beatles' fan, indulge me as we begin a little data wrangling with some discography data. After all, I'm sure even space cowboys will be listening to Beatles' someday. Okay, so as data scientist were usually concern with what items a dataset contains and what fields. Those items have. As I'm sure you're aware, with tabular data, each row represents a data item. And, an individual item can have one or more fields. The columns each represent a different field. And in most tabular data, the very first row will actually label those fields in some way. Finally, we have an individual cell that contains a value for a particular field. In this case the label for this particular album is capital records in Canada. Okay so just to look at that again in a more abstract sense. Each row is an item in our data set. Each column makes up the fields for the data items. And finally, individual values for a field are stored in cells. In the videos that follow, we'll quickly get our hands study working with tabular data in the CSV and Excel data formats in the Python programming language. CSV Format A common way to distribute tabular data is in a data format called CSV. Even if you're already familiar with CSV, stay with us because in just a few minutes, we'll look at using the CSV module in Python to make it much easier to work with this type of data. As you can see in this example, this is the same discography data we looked at in a spreadsheet The only difference here is the format. In this case we're looking at in CSV. Here you can see the first line of the file contains the labels for all of the fields. Let's take a look at an individual item here. Here we have an individual item in this data set. A Hard Day's Night is the single released on June 26th, 1964. And while it looks like this particular item appears on two lines of text, it's really just one line, my window simply isn't wide enough to show the entire line visually on a single line. So one of the principal reasons why CSV is so widely used is because it's very lightweight. Each line of text is a single row, fields are separated by a delimiter, usually a comma, but we can also have an alternate of CSV, called TSV, where the delimeter is tab characters. CSV files really store just the data itself, or just the data and the delimeters. The benefit of this is that the files are as small as they reasonably can be. Another nice feature is that we don't need any special purpose software. We don't need, for example, Microsoft Excel in order to load CSV files. We can look at CSV files using a Command Light editor as we just saw an example, or even the very simplest text editors.

Load more