<<

Sphinx-Themes template Release 1

Jan 24, 2019

Section I:

1 Getting Started 1

2 Data, Science, and Design 3

3 Introduction to Data 9

4 Command Line and UNIX 21

5 Introduction to Python 25

6 Exploratory Data Analysis 35

7 Important Libraries 49

8 Mapping Data with Folium 55

9 Exploring Data 63

10 Investigating Stop and Frisk 69

11 Folium and Mapping 77

12 Lab 04: Scraping Reviews 79

13 Web Design 87

14 Flask App 91

15 FLASK App 97

16 Learning Application 99

17 Indices and tables 101

i ii CHAPTER 1

Getting Started

Please be sure you’ve downloaded and installed the following programs/applications: • the latest version of Anaconda from here • if Windows, download, install, during installation add to PATH, git for windows here • a text editor such as Sublime Text, or Atom

1 Sphinx-Themes template, Release 1

1.1 Class Schedule

Class Number Topic MODULE I 1 Introduction and Overview 2 Pandas and EDA I 3 Pandas and EDA II 4 Plotting and Presentation Graphics MODULE II 5 Linear Regression I: OLS 6 Linear Regression II: Ridge, Lasso, Elastic Net 7 Classification I: Logistic Regression and KNN 8 Classification II: Evaluation Metrics 9 Modeling Practice 10 Webscraping and NLTK I 11 Webscraping and NLTK II MODULE III 12 Additional Models I 13 Additional Models II 14 Decision Trees 15 Forest Models 16 Deployment and Databases I 17 Deployment and Databases II MODULE IV 18 Artifical Intelligence I: Perceptrons and Neural Networks 19 Artificial Intelligence II: Introduction to Deep Learning 20 Final Presentations

1.2 Assignments

Number Topic Due 1 Exploratory Data Analysis 6.27 2 Regression 7.9 3 Classification 7.16 4 Final Data Set and EDA 7.25 5 Final Data Set Initial Models 8.8 6 Final Data Set Models II 8.15 7 Final Presentation and Project 8.22

[ ]:

2 Chapter 1. Getting Started CHAPTER 2

Data, Science, and Design

INSTRUCTOR: Jacob Frias Koehler, PhD EMAIL: [email protected] LOCATION: University Center 300 TIME: Monday and Wednesday 12:00 - 1:15 pm WEBSITE: https://github.com/jfkoehler/code-toolkit SLACK CHANNEL: https://codeliberalarts.slack.com This course aims to introduce students to the Python computing language. Students will write basic programs with Python, investigate the use of Python to perform data analysis (including machine learning and artificial intelligence), use Python to access and structure information from the web, and also use Python to build and deploy web based applications. The course will teach these skills through three projects 1) analysis of public policing data 2) a sentiment analysis and topic modeling project using news articles and social media data, and 3) design and deployment of a personal web application. No prior experience with the Python computing language is required, however students are expected to own a laptop computer, or check one out from the University and bring it with them to every class.

2.1 Course Requirements

• participate in class discussions on readings and computing work ( 20 %) • contribute to slack each week (5 %) • complete weekly assignments (40%) • complete three projects (35%)

3 Sphinx-Themes template, Release 1

2.1.1 Learning Outcomes

By successful completion of this course, students will be able to: • Use Python to access, load, and analyze data in a variety of formats • Use API’s to query databases and build datasets • Use Python to implement Standard Machine Learning Algorithms in supervised and unsupervised learning problems • Use Python to model with Artificial Neural Networks • Use Python to build Web Applications • Understand the Object Oriented Programming Style • Discuss historical context for important algorithms and methods in contemporary data science More generally, we hope that students will be able to: 1. use computation as a tool to enhance their liberal arts education—to better analyze, communicate, create and learn. 2. gain a broader understanding of the historical and social factors leading to the rise of coding. 3. work through the social and political implications of/embedded within computational technologies and develop an accompanying ethical framework. 4. think critically about the ways they and others interact with computation including understanding its limits from philosophical, logical, mathematical and public policy perspectives 5. engage in project-based and collaborative learning that utilizes computational/algorithmic thinking. 6. understand the intrinsic relationship between the physical world, analog environments and digital experiences

2.2 Outline

This is a starting outline for the course, and is likely to change as we move through the semester.

2.3 Section I: Introduction to Python and Data

2.3.1 WEEK 1

• Introduction to Python: Foundations. data types: int, str, float, boolean

2.3.2 Week 2

• Introduction to Python: Control Flow and Functions • Introduction to Python: Collections

2.3.3 Week 3

• Introduction to Pandas I • Introduction to Pandas II

4 Chapter 2. Data, Science, and Design Sphinx-Themes template, Release 1

2.3.4 Week 4

• Visualization with Bokeh I • Visualization with Bokeh II

2.3.5 Week 5

Exploring Policing Data Project • Mapping Workshop with Folium • Presenting with Code

2.4 Section II: Prediction

2.4.1 Week 6

• Introduction to Machine Learning: Regression • Introduction to Machind Learning: Classification

2.4.2 Week 7

• Machine Learning with Text I • Machine Learning with Text II

2.4.3 Week 8

• Machine Learning with Images I • Machine Learning with Images II

2.4.4 Week 9

• Introduction to Neural Networks I • Introduction to Deep Learning I

2.4.5 Week 10

Predicting the Future • Keras and PyTorch samples

2.4. Section II: Prediction 5 Sphinx-Themes template, Release 1

2.5 Section III: Web Deployment

2.5.1 Week 11

• Intro to Flask I • Intro to Flask II

2.5.2 Week 12

• Advanced Flask • Advanced Flask

2.5.3 Week 13

FINAL PROJECT WORKSHOPS

2.5.4 Week 14

2.5.5 Week 15

PRESENTATIONS

2.6 Resources

The university provides many resources to help students achieve academic and artistic excellence. These resources include: - University Libraries: http://library.newschool.edu - University Learning Center: http://www.newschool.edu/ learning-center - University Disabilities Service: www.newschool.edu/student-disability-services/ In keeping with the university’s policy of providing equal access for students with disabilities, any student with a disability who needs academic accommodations is welcome to meet with me privately. All conversations will be kept confidential. Students requesting any accommodations will also need to contact Student Disability Service (SDS). SDS will conduct an intake and, if appropriate, the Director will provide an academic accommodation notification letter for you to bring to me. At that point, I will review the letter with you and discuss these accommodations in relation to this course. Student Ombuds: The Student Ombuds office provides students assistance in resolving conflicts, disputes or com- plaints on an informal basis. This office is independent, neutral, and confidential.

2.6.1 Academic Honesty and Integrity

Compromising your academic integrity may lead to serious consequences, including (but not limited to) one or more of the following: failure of the assignment, failure of the course, academic warning, disciplinary probation, suspension from the university, or dismissal from the university. Students are responsible for understanding the University’s policy on academic honesty and integrity and must make use of proper citations of sources for writing papers, creating, presenting, and performing their work, taking exami- nations, and doing research. It is the responsibility of students to learn the procedures specific to their discipline for

6 Chapter 2. Data, Science, and Design Sphinx-Themes template, Release 1

correctly and appropriately differentiating their own work from that of others. The full text of the policy, including adjudication procedures, is found at http://www.newschool.edu/policies/ Resources regarding what plagiarism is and how to avoid it can be found on the Learning Center’s website: http: //www.newschool.edu/university-learning-center/avoiding-plagiarism.pdf Intellectual Property Rights: http://www.newschool.edu/provost/accreditation-policies/ Grade Policies: http://www.newschool.edu/registrar/academic-policies/

2.6.2 Attendance

“Absences may justify some grade reduction and a total of four absences mandate a reduction of one letter grade for the course. More than four absences mandate a failing grade for the course, unless there are extenuating circumstances, such as the following: an extended illness requiring hospitalization or visit to a physician (with documentation); a family emergency, e.g. serious illness (with written explanation); observance of a religious holiday. The attendance and lateness policies are enforced as of the first day of classes for all registered students. If registered during the first week of the add/drop period, the student is responsible for any missed assignments and coursework. For significant lateness, the instructor may consider the tardiness as an absence for the day. Students failing a course due to attendance should consult with an academic advisor to discuss options. Divisional and/or departmental/program policies serve as minimal guidelines, but policies may contain additional elements determined by the faculty member.” Please note: Writing Faculty have a slightly different policy, please check with your chair. Responsibility Students are responsible for all assignments, even if they are absent. Late papers, failure to complete the readings assigned for class discussion, and lack of preparedness for in-class discussions and presentations will jeopardize your successful completion of this course. Participation Class participation is an essential part of class and includes: keeping up with reading, contributing meaningfully to class discussions, active participation in group work, and coming to class regularly and on time. Student Course Ratings During the last two weeks of the semester, students are asked to provide feedback for each of their courses through an online survey. They cannot view grades until providing feedback or officially declining to do so. Course evaluations are a vital space where students can speak about the learning experience. It is an important process which provides valuable data about the successful delivery and support of a course or topic to both the faculty and administrators. Instructors rely on course rating surveys for feedback on the course and teaching methods, so they can understand what aspects of the class are most successful in teaching students, and what aspects might be improved or changed in future. Without this information, it can be difficult for an instructor to reflect upon and improve teaching methods and course design. In addition, program/department chairs and other administrators review course surveys. Instructions are available online at http://www.newschool.edu/provost/course-evaluations-student-instructions.pdf.

[ ]:

2.6. Resources 7 Sphinx-Themes template, Release 1

8 Chapter 2. Data, Science, and Design CHAPTER 3

Introduction to Data

While many scientific investigations may begin with an abstract question, this is usually revised in terms of constraints on available information and resources. In this class, we will often be limited to available data in semi or well structured formats. Accordingly, we want to be good at asking two different kinds of questions; Descriptive and Inferential.

3.1 Example I: Ames Housing

This dataset represents information about houses in Ames Iowa. Here is a link to the description of the dataset: • Ames Housing Description Your goal is to consider the different kinds of information available, and to generate at least two descriptive and two inferential questions about the dataset. Is there any data that you anticipate being problematic?

3.2 Example II: Donors Choose

A popular website for learning data science is kaggle.com. A recent competition involved a dataset from the organi- zation Donors Choose that supports grants to teachers. They framed the problem as follows:

Next year, DonorsChoose.org expects to receive close to 500,000 project proposals. As a result, there are three main problems they need to solve: 1. How to scale current manual processes and resources to screen 500,000 projects so that they can be posted as quickly and as efficiently as possible 2. How to increase the consistency of project vetting across different volunteers to improve the experience for teachers 3. How to focus volunteer time on the applications that need the most assistance

9 Sphinx-Themes template, Release 1

The goal of the competition is to predict whether or not a DonorsChoose.org project proposal submitted by a teacher will be approved, using the text of project descriptions as well as additional metadata about the project, teacher, and school. DonorsChoose.org can then use this information to identify projects most likely to need further review before approval.

See the full data description here. Again, your goal is to identify at least two inferential and two descriptive questions around this dataset.

3.3 Example III: NYC Open Data

Many city and state agencies have begun providing access to data through API’s. One example is the NYC Open Data located here. For example, if we wanted to use the dataset with information about restaurant meal information we find the code necessary to import the data. Quickly, we have this up and running in a Jupyter notebook and are ready to explore. • Load the libraries: pandas, matplotlib, and numpy

[1]:% matplotlib inline import matplotlib.pyplot as plt import pandas as pd import numpy as np

• Load in the data as a DataFrame

[2]: nyc_311= pd.read_json('https://data.cityofnewyork.us/resource/xk2u-49gx.json')

[4]: nyc_311.head() [4]: address_type agency agency_name \ 0 NaN TLC Correspondence - Taxi and Limousine Commission 1 NaN DPR Department of Parks and Recreation 2 NaN DPR Department of Parks and Recreation 3 NaN DPR Department of Parks and Recreation 4 NaN DOT Department of Transportation

borough bridge_highway_direction city closed_date \ 0 Unspecified NaN NaN 2005-01-12T00:00:00.000 1 MANHATTAN NaN NEW YORK 2005-01-03T00:00:00.000 2 BROOKLYN NaN BROOKLYN 2005-01-07T00:00:00.000 3 BROOKLYN NaN BROOKLYN 2005-01-07T00:00:00.000 4 MANHATTAN NaN NaN 2005-01-04T00:00:00.000

community_board complaint_type created_date \ 0 0 Unspecified Taxi Compliment 2005-01-04T00:00:00.000 1 10 MANHATTAN Animal in a Park 2005-01-03T00:00:00.000 2 06 BROOKLYN Maintenance or Facility 2005-01-04T00:00:00.000 3 16 BROOKLYN Maintenance or Facility 2005-01-04T00:00:00.000 4 Unspecified MANHATTAN Highway Condition 2005-01-04T00:00:00.000

... school_number school_or_citywide_complaint school_phone_number \ 0 ... Unspecified NaN Unspecified 1 ... Unspecified NaN Unspecified (continues on next page)

10 Chapter 3. Introduction to Data Sphinx-Themes template, Release 1

(continued from previous page) 2 ... Unspecified NaN Unspecified 3 ... Unspecified NaN Unspecified 4 ... Unspecified NaN Unspecified

school_region school_state school_zip status street_name \ 0 Unspecified Unspecified Unspecified Closed NaN 1 Unspecified Unspecified Unspecified Closed NaN 2 Unspecified Unspecified Unspecified Closed NaN 3 Unspecified Unspecified Unspecified Closed NaN 4 Unspecified Unspecified Unspecified Closed NaN

taxi_pick_up_location unique_key 0 NaN 33827 1 NaN 33840 2 NaN 33845 3 NaN 33846 4 NaN 34152

[5 rows x 38 columns]

[6]: nyc_311.info() Int64Index: 1000 entries, 0 to 999 Data columns (total 38 columns): address_type 52 non-null object agency 1000 non-null object agency_name 1000 non-null object borough 1000 non-null object bridge_highway_direction 70 non-null object city 291 non-null object closed_date 998 non-null object community_board 1000 non-null object complaint_type 1000 non-null object created_date 1000 non-null object cross_street_1 136 non-null object cross_street_2 133 non-null object descriptor 1000 non-null object facility_type 990 non-null object ferry_direction 83 non-null object incident_address 224 non-null object incident_zip 220 non-null object intersection_street_1 224 non-null object intersection_street_2 224 non-null object landmark 10 non-null object location_type 621 non-null object park_borough 1000 non-null object park_facility_name 1000 non-null object resolution_action_updated_date 326 non-null object school_address 1000 non-null object school_city 1000 non-null object school_code 1000 non-null object school_name 1000 non-null object school_number 1000 non-null object school_or_citywide_complaint 26 non-null object school_phone_number 1000 non-null object school_region 1000 non-null object (continues on next page)

3.3. Example III: NYC Open Data 11 Sphinx-Themes template, Release 1

(continued from previous page) school_state 1000 non-null object school_zip 1000 non-null object status 1000 non-null object street_name 224 non-null object taxi_pick_up_location 125 non-null object unique_key 1000 non-null int64 dtypes: int64(1), object(37) memory usage: 304.7+ KB

[9]: #select a column nyc_311['complaint_type'] [9]: 0 Taxi Compliment 1 Animal in a Park 2 Maintenance or Facility 3 Maintenance or Facility 4 Highway Condition 5 Maintenance or Facility 6 Ferry Inquiry 7 DCA / DOH New License Application Request 8 Taxi Compliment 9 Street Sign - Damaged 10 Ferry Inquiry 11 Street Condition 12 Street Sign - Damaged 13 Taxi Compliment 14 Maintenance or Facility 15 DCA / DOH New License Application Request 16 Maintenance or Facility 17 Street Condition 18 Highway Condition 19 Highway Condition 20 Taxi Complaint 21 Animal in a Park 22 Taxi Complaint 23 Taxi Complaint 24 Highway Condition 25 Street Condition 26 Ferry Inquiry 27 Taxi Complaint 28 Highway Condition 29 Taxi Compliment ... 970 Street Condition 971 DOE Complaint or Compliment 972 Animal in a Park 973 Taxi Complaint 974 Street Condition 975 Maintenance or Facility 976 Traffic Signal Condition 977 Ferry Inquiry 978 Ferry Inquiry 979 Ferry Inquiry 980 Street Condition 981 Street Condition 982 Ferry Inquiry 983 Taxi Compliment (continues on next page)

12 Chapter 3. Introduction to Data Sphinx-Themes template, Release 1

(continued from previous page) 984 Animal in a Park 985 Taxi Complaint 986 DCA / DOH New License Application Request 987 Taxi Compliment 988 Street Light Condition 989 Taxi Compliment 990 DCA / DOH New License Application Request 991 Highway Condition 992 Street Condition 993 DCA / DOH New License Application Request 994 Taxi Compliment 995 Street Light Condition 996 DCA / DOH New License Application Request 997 Street Light Condition 998 Taxi Compliment 999 Street Light Condition Name: complaint_type, Length: 1000, dtype: object

[10]: #get value counts nyc_311['complaint_type'].value_counts() [10]: Street Condition 152 DCA / DOH New License Application Request 134 Taxi Complaint 122 Maintenance or Facility 109 Taxi Compliment 105 Ferry Inquiry 83 Animal in a Park 68 Street Sign - Damaged 65 Highway Condition 62 Violation of Park Rules 21 DOE Complaint or Compliment 21 Drinking 8 Food Establishment 7 Volunteer Placement Request 7 Street Light Condition 5 Homeless Encampment 5 Overgrown Tree/Branches 5 Found Property 3 Rodent 3 Discipline and Suspension 3 Sidewalk Condition 2 Traffic Signal Condition 2 Health and Safety 2 Public Payphone Complaint 1 Bus Stop Shelter Placement 1 TLC Medallion Request 1 Window Guard 1 Root/Sewer/Sidewalk Condition 1 Street Sign - Missing 1 Name: complaint_type, dtype: int64

[14]: nyc_311['borough'] =='BROOKLYN' [14]: 0 False 1 False 2 True (continues on next page)

3.3. Example III: NYC Open Data 13 Sphinx-Themes template, Release 1

(continued from previous page) 3 True 4 False 5 False 6 False 7 False 8 False 9 True 10 False 11 False 12 False 13 False 14 True 15 False 16 False 17 False 18 False 19 False 20 False 21 False 22 False 23 False 24 False 25 False 26 False 27 False 28 False 29 False ... 970 False 971 False 972 False 973 False 974 False 975 False 976 False 977 False 978 False 979 False 980 False 981 False 982 False 983 False 984 False 985 False 986 False 987 False 988 False 989 False 990 False 991 False 992 False 993 False 994 False 995 False 996 False 997 False 998 False (continues on next page)

14 Chapter 3. Introduction to Data Sphinx-Themes template, Release 1

(continued from previous page) 999 False Name: borough, Length: 1000, dtype: bool

[16]: bk= nyc_311[nyc_311['borough'] =='BROOKLYN']

[17]: bk['complaint_type'].value_counts() [17]: Maintenance or Facility 26 Street Sign - Damaged 19 Street Condition 17 Animal in a Park 13 Highway Condition 12 Taxi Complaint 4 Drinking 3 Violation of Park Rules 3 Overgrown Tree/Branches 2 Homeless Encampment 2 Sidewalk Condition 1 Food Establishment 1 Rodent 1 Window Guard 1 Name: complaint_type, dtype: int64

[18]: bkplot= bk['complaint_type'].value_counts()

[19]: plt.plot([1,3,2,5]) [19]: []

[20]: bkplot [20]: Maintenance or Facility 26 Street Sign - Damaged 19 Street Condition 17 Animal in a Park 13 Highway Condition 12 Taxi Complaint 4 Drinking 3 (continues on next page)

3.3. Example III: NYC Open Data 15 Sphinx-Themes template, Release 1

(continued from previous page) Violation of Park Rules 3 Overgrown Tree/Branches 2 Homeless Encampment 2 Sidewalk Condition 1 Food Establishment 1 Rodent 1 Window Guard 1 Name: complaint_type, dtype: int64

[23]: bkplot.shape [23]: (14,)

[25]: bkplot.index[:5] [25]: Index(['Maintenance or Facility', 'Street Sign - Damaged', 'Street Condition', 'Animal in a Park', 'Highway Condition'], dtype='object')

[26]: bkplot[:5] [26]: Maintenance or Facility 26 Street Sign - Damaged 19 Street Condition 17 Animal in a Park 13 Highway Condition 12 Name: complaint_type, dtype: int64

[29]: plt.bar(bkplot.index[:5], bkplot[:5]) plt.xticks(rotation= 80) [29]: ([0, 1, 2, 3, 4], )

16 Chapter 3. Introduction to Data Sphinx-Themes template, Release 1

[30]: plt.savefig('bkplainers.png') daniel dahee becks ash

import pandas as pd from sodapy import Socrata

# Unauthenticated client only works with public data sets. Note 'None' # in place of application token, and no username or password: client= Socrata("data.cityofnewyork.us", None)

# Example authenticated client (needed for non-public datasets): # client = Socrata(data.cityofnewyork.us, # MyAppToken, # userame="[email protected]", # password="AFakePassword")

# First 2000 results, returned as JSON from API / converted to Python list of # dictionaries by sodapy. results= client.get("r3pg-q9c3", limit=31000)

# Convert to pandas DataFrame results_df= pd.DataFrame.from_records(results)

[18]: results_df.info() RangeIndex: 31000 entries, 0 to 30999 Data columns (total 48 columns): calories 25196 non-null object calories_100g 12572 non-null object calories_text 119 non-null object carbohydrates 24572 non-null object carbohydrates_100g 12463 non-null object carbohydrates_text 316 non-null object cholesterol 23775 non-null object cholesterol_100g 12036 non-null object cholesterol_text 166 non-null object dietary_fiber 24081 non-null object dietary_fiber_100g 12329 non-null object dietary_fiber_text 424 non-null object food_category 31000 non-null object item_description 31000 non-null object item_name 31000 non-null object kids_meal 31000 non-null object limited_time_offer 31000 non-null object menu_item_id 31000 non-null object potassium 389 non-null object potassium_100g 350 non-null object protein 24558 non-null object protein_100g 12427 non-null object protein_text 242 non-null object regional 31000 non-null object restaurant 31000 non-null object (continues on next page)

3.3. Example III: NYC Open Data 17 Sphinx-Themes template, Release 1

(continued from previous page) restaurant_id 31000 non-null object restaurant_item_name 31000 non-null object saturated_fat 24258 non-null object saturated_fat_100g 12112 non-null object saturated_fat_text 44 non-null object serving_size 13199 non-null object serving_size_household 6604 non-null object serving_size_text 16 non-null object serving_size_unit 13214 non-null object shareable 31000 non-null object sodium 24949 non-null object sodium_100g 12554 non-null object sodium_text 71 non-null object sugar 23489 non-null object sugar_100g 12026 non-null object sugar_text 294 non-null object total_fat 24927 non-null object total_fat_100g 12511 non-null object total_fat_text 56 non-null object trans_fat 23251 non-null object trans_fat_100g 11695 non-null object trans_fat_text 3 non-null object year 31000 non-null object dtypes: object(48) memory usage: 11.4+ MB

[44]: results_df.groupby('restaurant')[['restaurant','item_description']].count().

˓→nlargest(10,'item_description')['item_description'] [44]: restaurant Starbucks 3849 Dunkin' Donuts 2850 Sheetz 1054 Sonic 982 Pizza Hut 972 Golden Corral 654 Jersey Mike's Subs 616 Dominos 557 Perkins 553 Jason's Deli 522 Name: item_description, dtype: int64

[22]: results_df.calories.describe(include='all') [22]: count 25196 unique 992 top 0 freq 924 Name: calories, dtype: object

[24]: results_df.calories.isnull().sum() [24]: 5804

[45]: import numpy as np results_df[results_df.calories.isnull()][['restaurant','calories']].describe()

18 Chapter 3. Introduction to Data Sphinx-Themes template, Release 1

[45]: restaurant calories count 5804 0.0 unique 81 0.0 top Dunkin' Donuts NaN freq 1900 NaN

[48]: results_df[results_df.restaurant =="Dunkin' Donuts"][['item_description','calories

˓→']].head() [48]: item_description calories 6304 Whole Wheat Bagel, Bagels, Bakery, Food 320 6340 Bismark, Donuts, Bakery 490 6461 Jelly Donut, Donuts, Bakery, Food 270 6480 Chocolate Frosted Cake Donut, Donuts, Bakery, ... 350 6530 Cinnamon Stick, Donuts, Bakery 380

3.3. Example III: NYC Open Data 19 Sphinx-Themes template, Release 1

20 Chapter 3. Introduction to Data CHAPTER 4

Command Line and UNIX

content/images/unix/UNIX-plate.jpg

OBJECTIVES: • Create, Delete, and Navigate files and directories in the Terminal • Copy and Rename files in the Terminal • Use Pipes and Filters to combine commands in Terminal • Locating Files and things in Files in Terminal Before the advent of the mouse, computers were only able to interact with typed text commands. There are many advantages to this way of interacting with your machine, we will explore some of these today. I believe the terminal to be important in terms of the speed you are able to complete operations and the control that you have over these operations.

4.1 Introduction to Shell

On a mac, I type cmd + spacebar, write terminal and hit enter. This will bring up a new terminal. • Navigating Files • cd: Change directory • pwd: Print Working Directory • ls: List Files • ls -F: List Files with Flag

21 Sphinx-Themes template, Release 1

• ls -l: List Files with more info (long) • ls -R: List Recursively • ls -t: List by time modified • ls -R -t: List recursively by time modified • cd ..: Move one directory up • cd ~: Move to root directory

4.2 Files and Directories

• mkdir: Creates directory • touch: Creates file • nano: Simple text editor in terminal • rm: Removes file (CAREFUL!) • rm -r: Removes recursively • rm -i thesis/quotations.txt: Removes responsibly • rm -r -i thesis: Recursive responsibilities • mv: Moves a file, also renames it • cp: Copies a file

4.3 Investigating Files

• cat: Displays file contents • head: Displays top rows of file • tail: Displays bottom rows of file • sort: Allows for different sorting • wc: Word Count • *: Wildcard meaning anything (*.pdf is all pdf files) • ?: Match single element (?learning.pdf would match elearning.pdf, blearning.pdf, but not yearning.pdf) • command > file: Directs output to file • first | second: Pipe to combine and link commands • grep : Get regular expression used to find things in documents • grep -w: restricts regular expression to matching word • grep -n: adds line number to output of regular expression • grep -n -w: combine numbering and word restriction • grep -n -w -i: adds case sensitivity to above example • find .: Find is used to find files • find . -type d: Find things that are directories

22 Chapter 4. Command Line and UNIX Sphinx-Themes template, Release 1

• find . -type f: Find things that are files • find . -name '*.txt': Find things by name

4.4 Further Reading

• Software Carpentry: **The UNIX Shell**

4.4. Further Reading 23 Sphinx-Themes template, Release 1

24 Chapter 4. Command Line and UNIX CHAPTER 5

Introduction to Python

Objectives: • Launch, Save, Reopen Jupyter notebook • Use Markdown Cells • Run Python Cells in Jupyter notebook • Explain different data types with Python • Convert between integers, floats, and strings • Use built-in Python Functions • Use help documentation for built-in functions • Write basic functions with Python • Use if statements in Python • Use a for loop in Python

5.1 Markdown Cells

To change a cell to a markdown cell, you can type cmd + m + m. Also, you can use the toolbar in the Jupyter notebook to manually change the cell type. Markdown is designed to be a simple language for rendering text on the web. We will continue to use markdown and you will develop your vocabulary as we move. For now, here is a handy overview of the use of markdown by the creator: https://daringfireball.net/projects/markdown/syntax.

5.2 Python Cells

With code cells, we are able to write and execute Python code right out of the box. Below, we introduce an example with basic arithmetic. To execute the cell, we press shift + enter.

25 Sphinx-Themes template, Release 1

[1]:3+5 * 7/4 ** 2 [1]: 5.1875

If this were a value that we wanted to reuse, we can save it as a variable (a for example).

[2]:a=3+5 * 7/4 ** 2

[3]:a [3]: 5.1875

There are different kinds of things that we are working with in Python. Three important kinds or types are: • int: Integer numbers • float: Floating point decimal approximations • str: Strings of symbols and words Below I demonstrate each kind and verify this using the type() method. This will be an important consideration throughout our work, as depending on the kind of thing we’re working with we have different operations allowed. For example, we can add numbers with other numbers but we cannot add strings with numbers. We can however coerce certain types, for example if we wanted 3.14 to be considered as an integer we would write int(b).

[4]:a=4 b= 3.14 c='gucci' d= int(b)

[5]: type(a) [5]: int

[6]: type(b) [6]: float

[7]: type(c) [7]: str

[8]: type(d) [8]: int

[9]:d [9]:3

[10]: type(a+ b) [10]: float

[11]: type(a+ a) [11]: int

26 Chapter 5. Introduction to Python Sphinx-Themes template, Release 1

[12]: type(a+ c) ------TypeError Traceback (most recent call last) in () ---> 1 type(a+c)

TypeError: unsupported operand type(s) for +: ’int’ and ’str’

5.3 Lists

Another important idea will be around collections of things. The first way we will examine storing collections of things with Python is the list object. We construct lists using the square brackets [] and seperate elements with a comma.

[13]:a=[4,5,6,7,8]

[14]: type(a) [14]: list

We can get inside the list and select individual elements or collections of these elements. To do so, we will use the index or position of the element. It is important to note that Python begins indexing at 0, so the first element in the list corresponds with an index of 0.

[15]: a[0] [15]:4

To select a slice of the list, we use the [first_element_index: first_element_excluded_index] syntax. Again, note that the last index corresponds to the first element to not be included.

[16]: a[0:3] [16]: [4, 5, 6]

We can skip count using two colons.

[17]: a[::2] [17]: [4, 6, 8]

[18]: a[2::2] [18]: [6, 8]

5.4 Creating Lists and Loops

We have other ways for creating lists that automate the process. Rather than writing out each element, we can use a counter (the range() function) to step through indicies performing a specified operation each time through. A simple way for building such a list would be to start with an empty list, and loop through a set amount of times adding an element to the list each time through.

5.3. Lists 27 Sphinx-Themes template, Release 1

The example below first demonstrates the appending of an element to an empty list using the .append() method. Next, we see a loop in action that prints out the value of the counter at each step. Finally, we use this to build a list of even integers.

[19]: #create a list with a loop #and the range() function

#create empty list a=[]

[20]:a [20]:[]

[21]: #add to list with append a.append(0)

[22]:a [22]: [0]

[23]: #print the counter for i in range(10): print(i) 0 1 2 3 4 5 6 7 8 9

[24]: #adjusting how we use the range #function to start at 1 and count #by two for i in range(1, 20,2): print(i) 1 3 5 7 9 11 13 15 17 19

[25]: #use a loop to build list of even integers b=[] for i in range(10): b.append(2*i)

28 Chapter 5. Introduction to Python Sphinx-Themes template, Release 1

[26]:b [26]: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

[27]:c=[3,4,2,1,'steve'] for i in c: print(i) 3 4 2 1 steve

[28]: #list comprehension b=[2 *i for i in range(10)]

[29]:b [29]: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

[30]: #dynamic list that starts with 1 #adds one to this to create the next element and continue #based on counter a=[1] for i in range(10): next= a[i]+1 a.append(next)

[31]:a [31]: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

5.5 Libraries

Most of the time rather than writing things in base Python, we will use existing libraries that contain functions specific to the libraries purpose. Two important libraries for us will be Pandas and Matplotlib. Pandas is a Python data structure library that works on arrays. Matplotlib is a plotting library that we will use for generating plots. To use the libraries, we need to import them. We will also abbreviate their names to lessen the amount of writing we need to do. The abbreviations are standard. In addition to these libraries, we will tell the Jupyter notebook to make our plots in the notebook itself rather than in a separate window using the %matplotlib inline magic command.

[32]:% matplotlib inline import matplotlib.pyplot as plt import pandas as pd

In the example below, we create two lists using list comprehensions, make a DataFrame out of this, investigate the DataFrame, and finally plot the data using matplotlib. Note that when I want to do something using a pandas function, I preface it with pd and for matplotlib it’s plt.

[33]: l1= [i for i in range(10)] l2= [(i/2) **i for i in range(10)]

5.5. Libraries 29 Sphinx-Themes template, Release 1

[34]: #create a DataFrame from my first list #provide my own name for the column df= pd.DataFrame(l1, columns=['Column_1'])

[35]: df [35]: Column_1 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9

[36]: #add a second column to my DataFrame #with the second list df['Column_2']= l2

[37]: #the head function shows us the #top 5 rows of the dataframe df.head() [37]: Column_1 Column_2 0 0 1.000 1 1 0.500 2 2 1.000 3 3 3.375 4 4 16.000

[38]: df.head(2) [38]: Column_1 Column_2 0 0 1.0 1 1 0.5

[39]: df.tail(3) [39]: Column_1 Column_2 7 7 6433.929688 8 8 65536.000000 9 9 756680.642578

[40]: df.Column_1 [40]:0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 (continues on next page)

30 Chapter 5. Introduction to Python Sphinx-Themes template, Release 1

(continued from previous page) 9 9 Name: Column_1, dtype: int64

[41]: df[['Column_1','Column_2']].head() [41]: Column_1 Column_2 0 0 1.000 1 1 0.500 2 2 1.000 3 3 3.375 4 4 16.000

Now, if I want to plot this information I can use the plt.plot() method. Here, I will provide the names of the columns in my DataFrame as the 푥 and 푦 variables.

[42]: plt.plot(df['Column_1'], df['Column_2']) [42]: []

[43]: plt.plot(df.Column_1,'--^') [43]: []

5.5. Libraries 31 Sphinx-Themes template, Release 1

[44]: plt.plot?

[45]: #pandas also allows plotting directly from #the dataframe df.plot() [45]:

There are many different styles of plots available, here are just a few.

[46]: plt.scatter(df['Column_1'], df['Column_2']) [46]:

[47]: plt.bar(df['Column_1'], df['Column_2']) plt.title('A Bar') [47]: Text(0.5,1,'A Bar')

32 Chapter 5. Introduction to Python Sphinx-Themes template, Release 1

[48]: plt.boxplot(df['Column_2']);

[49]: df.plot() [49]:

5.5. Libraries 33 Sphinx-Themes template, Release 1

[50]: df.plot(kind='density') [50]:

[51]: df.plot?

[52]: plt.savefig('plot.svg')

[53]: df.to_json('data/play.json')

[ ]:

[ ]:

[ ]:

34 Chapter 5. Introduction to Python CHAPTER 6

Exploratory Data Analysis

“Before approaching, with any degree of certainty, a science, or novels, or political speeches, or the œuvre of an author, or even a single book, the material with which one is dealing is, in its raw, neutral state, a population of events in the space of discourse in general. One is led therefore to the project of a pure description of discursive events as the horizon for the search for the unities that form within it.” - Michel Foucault, The Archaeology of Knowledge.

6.1 Digging in the Data

Objectives • Ask Descriptive Questions of Data • Use Pandas and Seaborn to Explore Dataset For each example below, we want to understand something of an approximate solution to the following questions: 1. What is the Data about? 2. What are the Data Types? 3. Is there missing data (what do we do about it)? 4. How are the variables distributed? 5. How do the center and spread of variables compare?

[1]:% matplotlib inline import matplotlib.pyplot as plt import pandas as pd import seaborn as sns

[2]: tips= sns.load_dataset('tips')

35 Sphinx-Themes template, Release 1

[3]: type(tips) [3]: pandas.core.frame.DataFrame

[5]: tips.info() RangeIndex: 244 entries, 0 to 243 Data columns (total 7 columns): total_bill 244 non-null float64 tip 244 non-null float64 sex 244 non-null category smoker 244 non-null category day 244 non-null category time 244 non-null category size 244 non-null int64 dtypes: category(4), float64(2), int64(1) memory usage: 7.2 KB

[6]: tips.to_csv('data/tips.csv')

[9]: df= pd.read_csv('data/tips.csv', index_col=)

[11]: df.head() [11]: total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4

[ ]: tips.

[7]: tips.head() [7]: total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4

[8]: tips.tail() [8]: total_bill tip sex smoker day time size 239 29.03 5.92 Male No Sat Dinner 3 240 27.18 2.00 Female Yes Sat Dinner 2 241 22.67 2.00 Male Yes Sat Dinner 2 242 17.82 1.75 Male No Sat Dinner 2 243 18.78 3.00 Female No Thur Dinner 2

[12]: tips.describe() [12]: total_bill tip size count 244.000000 244.000000 244.000000 mean 19.785943 2.998279 2.569672 (continues on next page)

36 Chapter 6. Exploratory Data Analysis Sphinx-Themes template, Release 1

(continued from previous page) std 8.902412 1.383638 0.951100 min 3.070000 1.000000 1.000000 25% 13.347500 2.000000 2.000000 50% 17.795000 2.900000 2.000000 75% 24.127500 3.562500 3.000000 max 50.810000 10.000000 6.000000

[12]: tips.describe(include='all') [12]: total_bill tip sex smoker day time size count 244.000000 244.000000 244 244 244 244 244.000000 unique NaN NaN 2 2 4 2 NaN top NaN NaN Male No Sat Dinner NaN freq NaN NaN 157 151 87 176 NaN mean 19.785943 2.998279 NaN NaN NaN NaN 2.569672 std 8.902412 1.383638 NaN NaN NaN NaN 0.951100 min 3.070000 1.000000 NaN NaN NaN NaN 1.000000 25% 13.347500 2.000000 NaN NaN NaN NaN 2.000000 50% 17.795000 2.900000 NaN NaN NaN NaN 2.000000 75% 24.127500 3.562500 NaN NaN NaN NaN 3.000000 max 50.810000 10.000000 NaN NaN NaN NaN 6.000000

[15]: tips.values [15]: array([[16.99, 1.01, 'Female', ..., 'Sun', 'Dinner', 2], [10.34, 1.66, 'Male', ..., 'Sun', 'Dinner', 3], [21.01, 3.5, 'Male', ..., 'Sun', 'Dinner', 3], ..., [22.67, 2.0, 'Male', ..., 'Sat', 'Dinner', 2], [17.82, 1.75, 'Male', ..., 'Sat', 'Dinner', 2], [18.78, 3.0, 'Female', ..., 'Thur', 'Dinner', 2]], dtype=object)

[16]: tips['sex'].value_counts() [16]: Male 157 Female 87 Name: sex, dtype: int64

[18]: tips[['total_bill','tip']].mean() [18]: total_bill 19.785943 tip 2.998279 dtype: float64

[14]: tips.groupby('smoker').mean() [14]: total_bill tip size smoker Yes 20.756344 3.008710 2.408602 No 19.188278 2.991854 2.668874

[21]: tips.groupby(['smoker','day']).mean() [21]: total_bill tip size smoker day Yes Thur 19.190588 3.030000 2.352941 Fri 16.813333 2.714000 2.066667 (continues on next page)

6.1. Digging in the Data 37 Sphinx-Themes template, Release 1

(continued from previous page) Sat 21.276667 2.875476 2.476190 Sun 24.120000 3.516842 2.578947 No Thur 17.113111 2.673778 2.488889 Fri 18.420000 2.812500 2.250000 Sat 19.661778 3.102889 2.555556 Sun 20.506667 3.167895 2.929825

Measures of Center: mean, median, mode Measures of Spread: range, standard deviation

[22]: tips.total_bill.mean() [22]: 19.785942622950824

[23]: tips.total_bill.median() [23]: 17.795

[24]: tips.total_bill.max() [24]: 50.81

[25]: tips.total_bill.min() [25]: 3.07

[4]: def range_finder(df, col): r= df[col].max()- df[col].min() print("The range for the column", col,"is", r)

[5]: range_finder(tips,'total_bill') The range for the column total_bill is 47.74

[35]: tips.total_bill.std() [35]: 8.902411954856856

[46]: plt.hist(tips.total_bill) [46]: (array([ 7., 42., 68., 51., 31., 19., 12., 7., 3., 4.]), array([ 3.07 , 7.844, 12.618, 17.392, 22.166, 26.94 , 31.714, 36.488, 41.262, 46.036, 50.81 ]), )

38 Chapter 6. Exploratory Data Analysis Sphinx-Themes template, Release 1

[49]: tips.info() RangeIndex: 244 entries, 0 to 243 Data columns (total 7 columns): total_bill 244 non-null float64 tip 244 non-null float64 sex 244 non-null category smoker 244 non-null category day 244 non-null category time 244 non-null category size 244 non-null int64 dtypes: category(4), float64(2), int64(1) memory usage: 7.2 KB

PROBLEM: Create and describe distributions for the other variables that warrant a histogram.

[6]: plt.hist(tips.size) [6]: (array([0., 0., 0., 0., 0., 1., 0., 0., 0., 0.]), array([1707.5, 1707.6, 1707.7, 1707.8, 1707.9, 1708. , 1708.1, 1708.2, 1708.3, 1708.4, 1708.5]), )

6.1. Digging in the Data 39 Sphinx-Themes template, Release 1

[11]: plt.bar(tips.sex, tips.total_bill) [11]:

[12]: plt.bar(tips.sex, tips.tip) [12]:

40 Chapter 6. Exploratory Data Analysis Sphinx-Themes template, Release 1

[8]: plt.hist(tips.tip) [8]: (array([41., 79., 66., 27., 19., 5., 4., 1., 1., 1.]), array([ 1. , 1.9, 2.8, 3.7, 4.6, 5.5, 6.4, 7.3, 8.2, 9.1, 10. ]), )

[9]: plt.hist(tips.day) [9]: (array([19., 0., 0., 87., 0., 0., 76., 0., 0., 62.]), array([0. , 0.3, 0.6, 0.9, 1.2, 1.5, 1.8, 2.1, 2.4, 2.7, 3. ]), )

6.1. Digging in the Data 41 Sphinx-Themes template, Release 1

PROBLEM: Investigate the data set provided below. Can you ask some basic questions about the variables?

[13]: titanic= sns.load_dataset('titanic')

[14]: titanic.info() RangeIndex: 891 entries, 0 to 890 Data columns (total 15 columns): survived 891 non-null int64 pclass 891 non-null int64 sex 891 non-null object age 714 non-null float64 sibsp 891 non-null int64 parch 891 non-null int64 fare 891 non-null float64 embarked 889 non-null object class 891 non-null category who 891 non-null object adult_male 891 non-null bool deck 203 non-null category embark_town 889 non-null object alive 891 non-null object alone 891 non-null bool dtypes: bool(2), category(2), float64(2), int64(4), object(5) memory usage: 80.6+ KB

[16]: plt.bar(titanic['class'], titanic.survived) [16]:

42 Chapter 6. Exploratory Data Analysis Sphinx-Themes template, Release 1

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

6.2 Quantities and Categories

It is often helpful to understand how the data is distributed across categorical classes. For example, in our tips data, we can explore if the tip distribution varies by smoker, gender, or day of the week. To do so, we will use a variety of plots from seaborn. We want to make a statement about what these visualizations are telling us in each plot to gain insight into larger patterns in the dataset.

[51]: sns.stripplot('smoker','total_bill', data= tips) [51]:

6.2. Quantities and Categories 43 Sphinx-Themes template, Release 1

[52]: sns.stripplot('day','total_bill', data= tips) [52]:

[53]: sns.stripplot('day','total_bill', data= tips, jitter= True) [53]:

44 Chapter 6. Exploratory Data Analysis Sphinx-Themes template, Release 1

[55]: #swarmplot avoids overlapping points sns.swarmplot('day','total_bill', hue='sex', data= tips) [55]:

[56]: sns.boxplot('day','total_bill', hue='time', data= tips) [56]:

6.2. Quantities and Categories 45 Sphinx-Themes template, Release 1

[57]: sns.violinplot('total_bill','day', hue='time', data= tips) [57]:

[58]: sns.violinplot(x="day", y="total_bill", hue="sex", data=tips, split=True) [58]:

46 Chapter 6. Exploratory Data Analysis Sphinx-Themes template, Release 1

[60]: sns.barplot('day','total_bill', hue='smoker', data= tips) [60]:

[61]: sns.countplot('sex', data= tips) [61]:

6.2. Quantities and Categories 47 Sphinx-Themes template, Release 1

PROBLEM Use the Titanic dataset and your earlier description to generate plots as an initial exploratory analysis of the data.

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

48 Chapter 6. Exploratory Data Analysis CHAPTER 7

Important Libraries

Objectives: • Use pandas to load and explore data • Use matplotlib to plot data • Use help functions and documentation to better understand libraries functions

7.1 Pandas

content/images/pandas_logo.png

Our main use of Pandas is for its data structure – the DataFrame. You can think of the DataFrame as similar to a spreadsheet program where there are rows and columns of information. Here, we will create and examine the DataFrame in a few ways.

[1]:% matplotlib inline import matplotlib.pyplot as plt import pandas as pd

[2]: #create DataFrame from list a=[1,2,3,4]

[3]: df= pd.DataFrame(a)

[5]: df.head(2)

49 Sphinx-Themes template, Release 1

[5]:0 0 1 1 2

[6]: #create from dictionary a={'data':[1,2,3,4]}

[7]:a [7]: {'data': [1, 2, 3, 4]}

[8]: a['data'] [8]: [1, 2, 3, 4]

[9]: a['data'][0] [9]:1

[10]: df= pd.DataFrame(a)

[11]: df [11]: data 0 1 1 2 2 3 3 4

[12]: #from a local file !ls data/gapminder_data/ gapminder_all.csv gapminder_gdp_asia.csv gapminder_gdp_africa.csv gapminder_gdp_europe.csv gapminder_gdp_americas.csv gapminder_gdp_oceania.csv

[13]: #from local file df= pd.read_csv('data/gapminder_data/gapminder_gdp_europe.csv')

[35]: #info df.info() RangeIndex: 30 entries, 0 to 29 Data columns (total 13 columns): country 30 non-null object gdpPercap_1952 30 non-null float64 gdpPercap_1957 30 non-null float64 gdpPercap_1962 30 non-null float64 gdpPercap_1967 30 non-null float64 gdpPercap_1972 30 non-null float64 gdpPercap_1977 30 non-null float64 gdpPercap_1982 30 non-null float64 gdpPercap_1987 30 non-null float64 gdpPercap_1992 30 non-null float64 gdpPercap_1997 30 non-null float64 (continues on next page)

50 Chapter 7. Important Libraries Sphinx-Themes template, Release 1

(continued from previous page) gdpPercap_2002 30 non-null float64 gdpPercap_2007 30 non-null float64 dtypes: float64(12), object(1) memory usage: 3.1+ KB

[36]: #columns df.columns [36]: Index(['country', 'gdpPercap_1952', 'gdpPercap_1957', 'gdpPercap_1962', 'gdpPercap_1967', 'gdpPercap_1972', 'gdpPercap_1977', 'gdpPercap_1982', 'gdpPercap_1987', 'gdpPercap_1992', 'gdpPercap_1997', 'gdpPercap_2002', 'gdpPercap_2007'], dtype='object')

[15]: df.columns[::2] [15]: Index(['country', 'gdpPercap_1957', 'gdpPercap_1967', 'gdpPercap_1977', 'gdpPercap_1987', 'gdpPercap_1997', 'gdpPercap_2007'], dtype='object')

[16]: #head and tail df.head(6) [16]: country gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 \ 0 Albania 1601.056136 1942.284244 2312.888958 1 Austria 6137.076492 8842.598030 10750.721110 2 Belgium 8343.105127 9714.960623 10991.206760 3 Bosnia and Herzegovina 973.533195 1353.989176 1709.683679 4 Bulgaria 2444.286648 3008.670727 4254.337839 5 Croatia 3119.236520 4338.231617 5477.890018

gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 \ 0 2760.196931 3313.422188 3533.003910 3630.880722 1 12834.602400 16661.625600 19749.422300 21597.083620 2 13149.041190 16672.143560 19117.974480 20979.845890 3 2172.352423 2860.169750 3528.481305 4126.613157 4 5577.002800 6597.494398 7612.240438 8224.191647 5 6960.297861 9164.090127 11305.385170 13221.821840

gdpPercap_1987 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 \ 0 3738.932735 2497.437901 3193.054604 4604.211737 1 23687.826070 27042.018680 29095.920660 32417.607690 2 22525.563080 25575.570690 27561.196630 30485.883750 3 4314.114757 2546.781445 4766.355904 6018.975239 4 8239.854824 6302.623438 5970.388760 7696.777725 5 13822.583940 8447.794873 9875.604515 11628.388950

gdpPercap_2007 0 5937.029526 1 36126.492700 2 33692.605080 3 7446.298803 4 10680.792820 5 14619.222720

[17]: df.tail(3)

7.1. Pandas 51 Sphinx-Themes template, Release 1

[17]: country gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 \ 27 Switzerland 14734.232750 17909.489730 20431.092700 28 Turkey 1969.100980 2218.754257 2322.869908 29 United Kingdom 9979.508487 11283.177950 12477.177070

gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 \ 27 22966.144320 27195.11304 26982.290520 28397.715120 28 2826.356387 3450.69638 4269.122326 4241.356344 29 14142.850890 15895.11641 17428.748460 18232.424520

gdpPercap_1987 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 \ 27 30281.704590 31871.530300 32135.323010 34480.957710 28 5089.043686 5678.348271 6601.429915 6508.085718 29 21664.787670 22705.092540 26074.531360 29478.999190

gdpPercap_2007 27 37506.419070 28 8458.276384 29 33203.261280

[39]: #describe df.describe() [39]: gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 \ count 30.000000 30.000000 30.000000 30.000000 mean 5661.057435 6963.012816 8365.486814 10143.823757 std 3114.060493 3677.950146 4199.193906 4724.983889 min 973.533195 1353.989176 1709.683679 2172.352423 25% 3241.132406 4394.874315 5373.536612 6657.939047 50% 5142.469716 6066.721495 7515.733738 9366.067033 75% 7236.794919 9597.220820 10931.085347 13277.182057 max 14734.232750 17909.489730 20431.092700 22966.144320

gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987 \ count 30.000000 30.000000 30.000000 30.000000 mean 12479.575246 14283.979110 15617.896551 17214.310727 std 5509.691411 5874.464896 6453.234827 7482.957960 min 2860.169750 3528.481305 3630.880722 3738.932735 25% 9057.708095 10360.030300 11449.870115 12274.570680 50% 12326.379990 14225.754515 15322.824720 16215.485895 75% 16523.017127 19052.412163 20901.729730 23321.587723 max 27195.113040 26982.290520 28397.715120 31540.974800

gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007 count 30.000000 30.000000 30.000000 30.000000 mean 17061.568084 19076.781802 21711.732422 25054.481636 std 9109.804361 10065.457716 11197.355517 11800.339811 min 2497.437901 3193.054604 4604.211737 5937.029526 25% 8667.113214 9946.599306 11721.851483 14811.898210 50% 17550.155945 19596.498550 23674.863230 28054.065790 75% 25034.243045 27189.530312 30373.363307 33817.962533 max 33965.661150 41283.164330 44683.975250 49357.190170

[40]: #columns df['gdpPercap_1962'].head() [40]: 0 2312.888958 1 10750.721110 (continues on next page)

52 Chapter 7. Important Libraries Sphinx-Themes template, Release 1

(continued from previous page) 2 10991.206760 3 1709.683679 4 4254.337839 Name: gdpPercap_1962, dtype: float64

[41]: #multiple columns df[['gdpPercap_1962','gdpPercap_1967']].mean() [41]: gdpPercap_1962 8365.486814 gdpPercap_1967 10143.823757 dtype: float64

7.2 Plotting from DataFrame

We can plot the data in our DataFrame in a similar manner. Maybe we are interested in comparing the distributions of the gdp across the 1960’s. We can do this by creating a histogram of each column. In creating these plots, we will work directly from the DataFrame. In practice, I tend to use what works fastest in generating plots and go back through with either Matplotlib or Seaborn after to clean things up.

[54]: df[['gdpPercap_1962','gdpPercap_1967']].plot(kind='hist', alpha= 0.3) [54]:

[55]: df[['gdpPercap_1962','gdpPercap_1967']].plot(kind='density') [55]:

7.2. Plotting from DataFrame 53 Sphinx-Themes template, Release 1

[73]: df.plot(legend= False, figsize=(10,5), title='GDPs in Europe') [73]:

7.3 Further Reading

• pandas tutorial: http://pandas.pydata.org/pandas-docs/stable/10min.html#min • pandas cheatsheet: http://pandas.pydata.org/Pandas_Cheat_Sheet.pdf • matplotlib tutorials: https://matplotlib.org/tutorials/index.html

54 Chapter 7. Important Libraries CHAPTER 8

Mapping Data with Folium

Below, we will go through a brief introduction to the Folium library. This is a nice way to build interactive visuzliza- tions. We will be executing these in the jupyter notebooks, however they are easily output as .html files ready to be served. To begin, let’s make sure we have folium installed.

[1]: import folium import pandas as pd

We can make basic maps centered at any geolocation. For example, below we create a basic map of Portland, Oregon.

[2]:m= folium.Map(location=[45.5236,-122.6750])

[3]:m

55 Sphinx-Themes template, Release 1

[3]:

[18]:m.save('index_map.html')

We can add arguments that include changing the style of the map and the initial zoom level.

[4]: folium.Map( location=[45.5236,-122.6750], tiles='Stamen Toner', zoom_start=13 ) [4]:

We can use the popup argument to include information to be displayed at specified marker locations.

[ ]:

[ ]:

[ ]:

[5]: tooltip='Click me!' m= folium.Map( location=[45.372,-121.6972], zoom_start=12, tiles='Stamen Terrain' )

folium.Marker([45.3288,-121.6625], popup='Mt. Hood Meadows').add_to(m) folium.Marker([45.3311,-121.7113], popup='Timberline Lodge').add_to(m) m [5]:

We can even include markdown syntax and icons.

[6]:m= folium.Map( location=[45.372,-121.6972], zoom_start=12, tiles='Stamen Terrain' )

folium.Marker( location=[45.3288,-121.6625], popup='Mt. Hood Meadows', icon=folium.Icon(icon='cloud') ).add_to(m)

folium.Marker( location=[45.3311,-121.7113], popup='Timberline Lodge', icon=folium.Icon(color='green') ).add_to(m) (continues on next page)

56 Chapter 8. Mapping Data with Folium Sphinx-Themes template, Release 1

(continued from previous page)

folium.Marker( location=[45.3300,-121.6823], popup='Some Other Location', icon=folium.Icon(color='red', icon='info-sign') ).add_to(m) [6]:

[7]:m [7]:

We can manually control radii for markers of interest. Below, we plot two circles at specific locations.

[8]:m= folium.Map( location=[45.5236,-122.6750], tiles='Stamen Toner', zoom_start=13 )

folium.Circle( radius=100, location=[45.5244,-122.6699], popup='The Waterfront', color='crimson', fill=False, ).add_to(m)

folium.CircleMarker( location=[45.5215,-122.6261], radius=50, popup='Laurelhurst Park', color='#3186cc', fill=True, fill_color='#3186cc' ).add_to(m)

[8]:

[9]:m [9]:

8.1 Mapping Bike Data

Now, we will use a dataset from NYC’s citibike data. Our goal is to compare incoming and outgoing traffic at given stations depending on the time of day.

[10]: folium_map= folium.Map(location=[40.738,-73.98], zoom_start=13, tiles="CartoDB dark_matter") marker= folium.CircleMarker(location=[40.738,-73.98]) marker.add_to(folium_map)

8.1. Mapping Bike Data 57 Sphinx-Themes template, Release 1

[10]:

[11]: folium_map [11]:

[12]: bikes= pd.read_csv('../data/201306-citibike-tripdata.zip', compression='zip') ------ValueError Traceback (most recent call last) in () ---> 1 bikes= pd.read_csv(’../data/201306-citibike-tripdata.zip’, compression=

˓→’zip’)

/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in

˓→parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols,

˓→squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values,

˓→false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na,

˓→na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format,

˓→keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands,

˓→decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect,

˓→ tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer,

˓→doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory,

˓→buffer_lines, memory_map, float_precision) 707 skip_blank_lines=skip_blank_lines) 708 -> 709 return _read(filepath_or_buffer, kwds) 710 711 parser_f.__name__= name

/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in

˓→_read(filepath_or_buffer, kwds) 447 448 # Create the parser. -> 449 parser= TextFileReader(filepath_or_buffer, **kwds) 450 451 if chunksize or iterator:

/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, ˓→engine, **kwds) 816 self.options[’has_index_names’]= kwds[’has_index_names’] 817 -> 818 self._make_engine(self.engine) 819 820 def close(self):

/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self,

˓→engine) 1047 def _make_engine(self, engine=’c’): 1048 if engine == ’c’: -> 1049 self._engine= CParserWrapper(self.f, **self.options) 1050 else: 1051 if engine == ’python’:

/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, ˓→**kwds) 1693 kwds[’allow_leading_cols’]= self.index_col is not False 1694 (continues on next page)

58 Chapter 8. Mapping Data with Folium Sphinx-Themes template, Release 1

(continued from previous page) -> 1695 self._reader= parsers.TextReader(src, **kwds) 1696 1697 # XXX

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()

ValueError: (’Multiple files found in compressed zip file %s’,

˓→"[’201306-citibike-tripdata.csv’, ’__MACOSX/’, ’__MACOSX/._201306-citibike-tripdata.

˓→csv’]")

[ ]: bikes.info()

[ ]: bikes['starttime']= pd.to_datetime(bikes['starttime']) bikes['stoptime']= pd.to_datetime(bikes['stoptime']) bikes['hour']= bikes['starttime'].map( lambda x: x.hour) bikes['ehour']= bikes['stoptime'].map( lambda x: x.hour)

[13]: bikes.head() ------NameError Traceback (most recent call last) in () ---> 1 bikes.head()

NameError: name ’bikes’ is not defined

[14]: locations= bikes.groupby('start station id').first() ------NameError Traceback (most recent call last) in () ---> 1 locations= bikes.groupby(’start station id’).first()

NameError: name ’bikes’ is not defined

[15]: locations= locations.loc[:, ["start station latitude","start station longitude",

˓→"start station name"]] ------NameError Traceback (most recent call last) in () ---> 1 locations= locations.loc[:,["start station latitude", "start station

˓→longitude", "start station name"]]

NameError: name ’locations’ is not defined

[16]: subset= bikes[bikes["hour"]==10] ------NameError Traceback (most recent call last) in () ---> 1 subset= bikes[bikes["hour"]==10]

NameError: name ’bikes’ is not defined

8.1. Mapping Bike Data 59 Sphinx-Themes template, Release 1

[17]: dept_counts= subset.groupby("start station id").count() ------NameError Traceback (most recent call last) in () ---> 1 dept_counts= subset.groupby("start station id").count()

NameError: name ’subset’ is not defined

[ ]: dept_counts= dept_counts.iloc[:, [0]]

[ ]: dept_counts.columns=["Departure Counts"]

8.2 Problem

Repeat the above for arrivals, in anticipation of joining the two for our map.

[ ]: bikes['hour']= bikes['starttime'].map( lambda x: x.hour) locations= locations.loc[:, ["start station latitude","start station longitude",

˓→"start station name"]] subset= bikes[bikes["hour"]==18] dept_counts= subset.groupby("start station id").count() dept_counts= dept_counts.iloc[:, [0]] dept_counts.columns=["Departure Counts"]

locations2= bikes.groupby('end station id').first() locations2= locations2.loc[:, ["end station latitude","end station longitude","end

˓→station name"]] subset= bikes[bikes["ehour"]==18] arr_counts= subset.groupby("end station id").count() arr_counts= arr_counts.iloc[:, [0]] arr_counts.columns=["Arrival Counts"]

[ ]: trip_counts= dept_counts.join(locations).join(arr_counts)

[ ]: trip_counts.head()

[ ]: for index, row in trip_counts.iterrows():

net_departures= (row["Departure Counts"]-row["Arrival Counts"])

radius= net_departures/7

if net_departures>0: color="#E37222" # tangerine else: color="#0A8A9F" # teal

folium.CircleMarker(location=(row["start station latitude"], row["start station longitude"]), radius=radius, color=color, fill=True).add_to(folium_map)

60 Chapter 8. Mapping Data with Folium Sphinx-Themes template, Release 1

[ ]: folium_map

[ ]: popup_text= """ {}
total departures: {}
total arrivals: {}
net departures: {}"""

popup_text= popup_text.format(row["start station name"], row["Arrival Counts"], row["Departure Counts"], net_departures)

[ ]:

[ ]: for index, row in trip_counts.iterrows(): net_departures= (row["Departure Counts"]-row["Arrival Counts"]) radius= net_departures/7 if net_departures>0: color="#E37222" # tangerine else: color="#0A8A9F" # teal

folium.CircleMarker(location=(row["start station latitude"], row["start station longitude"]), radius=radius, color=color, fill=True, popup= popup_text).add_to(folium_map)

[ ]: folium_map

8.3 PROBLEM

Compare this image to that of when people are leaving work. Doe you see what you expect? What does this tell you about movement in the city?

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

8.3. PROBLEM 61 Sphinx-Themes template, Release 1

[ ]:

[ ]:

[ ]:

[ ]:

[ ]: locations2= bikes.groupby('end station id').first() locations2= locations2.loc[:, ["end station latitude","end station longitude","end

˓→station name"]] subset= bikes[bikes["hour"]==10] arr_counts= subset.groupby("end station id").count() arr_counts= arr_counts.iloc[:, [0]] arr_counts.columns=["Arrival Counts"]

62 Chapter 8. Mapping Data with Folium CHAPTER 9

Exploring Data

GOAL: • Load and describe dataset, variables, and variable types with .info(), .head(), .columns • Explore categorical features and quantitative features using basic measures of center and spread with. describe() • Filter values in DataFrame based on logical conditions • Split, Apply, Combine using groupby method • Use histograms, boxplots, barplots, and countplots to investigate distributions of quantitative variables

9.1 Load and Describe

Below, we have a dataset from NYCOpenData related to police incidents. First, we load in the dataset using the pd. read_csv() method. Then, using the .info() method, we can examine the different features, feature data types, and missing values in the data.

63 Sphinx-Themes template, Release 1

[1]:% matplotlib inline import matplotlib.pyplot as plt import pandas as pd import numpy as np import seaborn as sns

[2]: #read in the data df= pd.read_csv('../data/sqf-2017.csv')

[3]: df.info() RangeIndex: 11629 entries, 0 to 11628 Data columns (total 86 columns): STOP_FRISK_ID 11629 non-null int64 STOP_FRISK_DATE 11629 non-null object STOP_FRISK_TIME 11624 non-null object YEAR2 11629 non-null int64 MONTH2 11629 non-null object DAY2 11629 non-null object STOP_WAS_INITIATED 11629 non-null object SPRINT_NUMBER 11202 non-null object RECORD_STATUS_CODE 11629 non-null object ISSUING_OFFICER_RANK 11629 non-null object ISSUING_OFFICER_COMMAND_CODE 11629 non-null int64 SUPERVISING_OFFICER_RANK 11629 non-null object SUPERVISING_OFFICER_COMMAND_CODE 11629 non-null int64 SUPERVISING_ACTION_CORRESPONDING_ACTIVITY_LOG_ENTRY_REVIEWED 11629 non-null object LOCATION_IN_OUT_CODE 11629 non-null object JURISDICTION_CODE 11629 non-null object JURISDICTION_DESCRIPTION 11629 non-null object OBSERVED_DURATION_MINUTES 11629 non-null int64 SUSPECTED_CRIME_DESCRIPTION 11629 non-null object STOP_DURATION_MINUTES 11629 non-null int64 OFFICER_EXPLAINED_STOP_FLAG 11629 non-null object OFFICER_NOT_EXPLAINED_STOP_DESCRIPTION 11628 non-null object OTHER_PERSON_STOPPED_FLAG 11629 non-null object SUSPECT_ARRESTED_FLAG 11629 non-null object SUSPECT_ARREST_NUMBER 11629 non-null object SUSPECT_ARREST_OFFENSE 11629 non-null object SUMMONS_ISSUED_FLAG 11629 non-null object SUMMONS_NUMBER 11629 non-null object SUMMONS_OFFENSE_DESCRIPTION 11629 non-null object OFFICER_IN_UNIFORM_FLAG 11629 non-null object ID_CARD_IDENTIFIES_OFFICER_FLAG 11629 non-null object SHIELD_IDENTIFIES_OFFICER_FLAG 11629 non-null object VERBAL_IDENTIFIES_OFFICER_FLAG 11629 non-null object FRISKED_FLAG 11629 non-null object SEARCHED_FLAG 11629 non-null object OTHER_CONTRABAND_FLAG 11629 non-null object FIREARM_FLAG 11629 non-null object KNIFE_CUTTER_FLAG 11629 non-null object OTHER_WEAPON_FLAG 11629 non-null object WEAPON_FOUND_FLAG 11629 non-null object PHYSICAL_FORCE_CEW_FLAG 11629 non-null object PHYSICAL_FORCE_DRAW_POINT_FIREARM_FLAG 11629 non-null object PHYSICAL_FORCE_HANDCUFF_SUSPECT_FLAG 11629 non-null object (continues on next page)

64 Chapter 9. Exploring Data Sphinx-Themes template, Release 1

(continued from previous page) PHYSICAL_FORCE_OC_SPRAY_USED_FLAG 11629 non-null object PHYSICAL_FORCE_OTHER_FLAG 11629 non-null object PHYSICAL_FORCE_RESTRAINT_USED_FLAG 11629 non-null object PHYSICAL_FORCE_VERBAL_INSTRUCTION_FLAG 11629 non-null object PHYSICAL_FORCE_WEAPON_IMPACT_FLAG 11629 non-null object BACKROUND_CIRCUMSTANCES_VIOLENT_CRIME_FLAG 11629 non-null object BACKROUND_CIRCUMSTANCES_SUSPECT_KNOWN_TO_CARRY_WEAPON_FLAG 11629 non-null object SUSPECTS_ACTIONS_CASING_FLAG 11629 non-null object SUSPECTS_ACTIONS_CONCEALED_POSSESSION_WEAPON_FLAG 11629 non-null object SUSPECTS_ACTIONS_DECRIPTION_FLAG 11629 non-null object SUSPECTS_ACTIONS_DRUG_TRANSACTIONS_FLAG 11629 non-null object SUSPECTS_ACTIONS_IDENTIFY_CRIME_PATTERN_FLAG 11629 non-null object SUSPECTS_ACTIONS_LOOKOUT_FLAG 11629 non-null object SUSPECTS_ACTIONS_OTHER_FLAG 11629 non-null object SUSPECTS_ACTIONS_PROXIMITY_TO_SCENE_FLAG 11629 non-null object SEARCH_BASIS_ADMISSION_FLAG 11629 non-null object SEARCH_BASIS_CONSENT_FLAG 11629 non-null object SEARCH_BASIS_HARD_OBJECT_FLAG 11629 non-null object SEARCH_BASIS_INCIDENTAL_TO_ARREST_FLAG 11629 non-null object SEARCH_BASIS_OTHER_FLAG 11629 non-null object SEARCH_BASIS_OUTLINE_FLAG 11629 non-null object DEMEANOR_CODE 11629 non-null object DEMEANOR_OF_PERSON_STOPPED 11588 non-null object SUSPECT_REPORTED_AGE 11629 non-null object SUSPECT_SEX 11629 non-null object SUSPECT_RACE_DESCRIPTION 11629 non-null object SUSPECT_HEIGHT 11629 non-null object SUSPECT_WEIGHT 11629 non-null object SUSPECT_BODY_BUILD_TYPE 11629 non-null object SUSPECT_EYE_COLOR 11629 non-null object SUSPECT_HAIR_COLOR 11629 non-null object SUSPECT_OTHER_DESCRIPTION 10607 non-null object STOP_LOCATION_PRECINCT 11629 non-null object STOP_LOCATION_SECTOR_CODE 11629 non-null object STOP_LOCATION_APARTMENT 11604 non-null object STOP_LOCATION_FULL_ADDRESS 11629 non-null object STOP_LOCATION_PREMISES_NAME 11629 non-null object STOP_LOCATION_STREET_NAME 11629 non-null object STOP_LOCATION_X 11629 non-null object STOP_LOCATION_Y 11629 non-null object STOP_LOCATION_ZIP_CODE 11629 non-null object STOP_LOCATION_PATROL_BORO_NAME 11629 non-null object STOP_LOCATION_BORO_NAME 11629 non-null object dtypes: int64(6), object(80) memory usage: 7.6+ MB

[4]: df[] File "", line1 df[] ˆ SyntaxError: invalid syntax

[ ]: df.head()

9.1. Load and Describe 65 Sphinx-Themes template, Release 1

9.1.1 Split, Apply, Combine

Using the .groupby() method is a great way to explore patterns within a categorical variable. The effect of grouping our data based on boro name would allow us to then apply some kind of summary to the individual groups. The image below demonstrate splitting and applying the mean to each group. In our data, for example, if we wanted to know the number of complaints by borough, we could groupby the boro name and select a relevant column to count.

[ ]: df.groupby('STOP_LOCATION_BORO_NAME')['FRISKED_FLAG'].count()

Usually, we can readily plot the results of the .groupby() tacking a plot method onto the end. For example, it would make sense to look at a barplot of the results of our above count.

[ ]: df.groupby('STOP_LOCATION_BORO_NAME')['FRISKED_FLAG'].count().plot(kind='bar',

˓→color='blue', title='Count of Frisks by Borough', figsize=(10,5))

[ ]: df.groupby('STOP_LOCATION_BORO_NAME')[['SEARCHED_FLAG','FRISKED_FLAG']].count().

˓→plot(kind='bar', color=['blue','red'], title='Count of Frisks by Borough',

˓→figsize=(10,5))

9.2 Filtering Data

Now, we see that there are some values we may want to exclude in order to get a plot of the typical five boroughs. To do so, we would filter the data to only have values in the borough name column to BRONX, BROOKLYN, MANHATTAN, QUEENS, STATEN IS.

[ ]: five_boroughs= df[(df['STOP_LOCATION_BORO_NAME'] =='MANHATTAN')| (df['STOP_

˓→LOCATION_BORO_NAME'] =='BRONX')| (df['STOP_LOCATION_BORO_NAME'] =='BROOKLYN')|

˓→(df['STOP_LOCATION_BORO_NAME'] =='STATEN IS')| (df['STOP_LOCATION_BORO_NAME'] ==

˓→'QUEENS')]

[ ]: five_boroughs.STOP_LOCATION_BORO_NAME.value_counts()

[ ]: five_boroughs.groupby('STOP_LOCATION_BORO_NAME')['FRISKED_FLAG'].count().plot(kind=

˓→'bar', color='blue', title='Count of Frisks by Borough', figsize=(10,5))

We don’t have many quantitative columns, but we can examine the stop duration column; and use the results to filter information based on “normal” stop times.

66 Chapter 9. Exploring Data Sphinx-Themes template, Release 1

[ ]: df['OBSERVED_DURATION_MINUTES'].describe()

It looks like 75% of our data is near 2 minutes. We also see that there was a stop that seems to have lasted 99999 minutes. I’m going to drop any stop that lasted more than two hours, and see how many we’ve eliminated. To do so, we will check the shape of our original dataframe and compare it to that after the filter.

[ ]: df.shape

[ ]: df[df['OBSERVED_DURATION_MINUTES']<= 120].shape

[ ]: 11629- 11521

[ ]: 108/11629

[ ]: trimmed_time= df[df['OBSERVED_DURATION_MINUTES']<= 120]

[ ]: trimmed_time['OBSERVED_DURATION_MINUTES'].hist()

[ ]: trimmed_time['OBSERVED_DURATION_MINUTES'].plot(kind='box')

9.2.1 PROBLEMS

1. Explore the data based on the race feature. Determine how many individuals from each reported category were frisked citywide, and then per each borough. 2. Explore the STOP_DURATION_MINUTES feature. Remove any outliers, and describe the distribution of values including a visualization. 3. Explore some variables by the Gender feature. How many frisks were of non-male suspects versus male? 4. What percentage of stops were black males ages 18 - 35? 5. Describe the overall use of physical force in the stops. These are the columns:

PHYSICAL_FORCE_CEW_FLAG 11629 non-null object PHYSICAL_FORCE_DRAW_POINT_FIREARM_FLAG 11629 non-null object PHYSICAL_FORCE_HANDCUFF_SUSPECT_FLAG 11629 non-null object PHYSICAL_FORCE_OC_SPRAY_USED_FLAG 11629 non-null object PHYSICAL_FORCE_OTHER_FLAG 11629 non-null object PHYSICAL_FORCE_RESTRAINT_USED_FLAG 11629 non-null object PHYSICAL_FORCE_VERBAL_INSTRUCTION_FLAG 11629 non-null object PHYSICAL_FORCE_WEAPON_IMPACT_FLAG 11629 non-null object

6. What can you say about the SUSPECT_ARREST_OFFENSE column? Were most people stopped arrested? Were the rates across races similar? 7. Identify a publication that used this data to investigate a specific area of interest, or a full summary like the RAND report. Here is a list of information about relevant studies: https://www.icpsr.umich.edu/icpsrweb/NACJD/studies/21660/publications Describe the questions your report sought to answer. Do you think you can use the data to replicate any of the analysis? Try it!

9.2. Filtering Data 67 Sphinx-Themes template, Release 1

68 Chapter 9. Exploring Data CHAPTER 10

Investigating Stop and Frisk

GOALS: • Access datasets as external files • Create DataFrame from file • Select rows and columns from the DataFrame • Filter values in DataFrame based on logical conditions • Split, Apply, Combine using groupby method • Use histograms, boxplots, barplots, and countplots to investigate distributions of quantitative variables • Use .corr() to explore relationships between variables and visualize with heatmap • Use scatterplots to examine relationships between quantitative variables

10.1 The Stop and Frisk Data

To start, we want to create a DataFrame from our .csv file we downloaded from the NYPD shared data.

69 Sphinx-Themes template, Release 1

70 Chapter 10. Investigating Stop and Frisk Sphinx-Themes template, Release 1

[1]:% matplotlib inline import matplotlib.pyplot as plt import pandas as pd import numpy as np import seaborn as sns

[2]: #read in the data df= pd.read_csv('../data/sqf-2017.csv')

[3]: #look at first five rows df.head() [3]: STOP_FRISK_ID STOP_FRISK_DATE STOP_FRISK_TIME YEAR2 MONTH2 DAY2 \ 0 1 1/16/2017 14:26:00 2017 January Monday 1 2 1/16/2017 14:26:00 2017 January Monday 2 3 2/8/2017 11:10:00 2017 February Wednesday 3 4 2/20/2017 11:35:00 2017 February Monday 4 5 2/21/2017 13:20:00 2017 February Tuesday

STOP_WAS_INITIATED SPRINT_NUMBER RECORD_STATUS_CODE \ 0 Based on Self Initiated 11617924 APP 1 Based on Self Initiated 11617924 APP 2 Based on C/W on Scene 17020808555 APP 3 Based on Self Initiated 9027 APP 4 Based on Radio Run 10439 APP

ISSUING_OFFICER_RANK ... STOP_LOCATION_SECTOR_CODE \ 0 SGT ... (null) 1 SGT ... (null) 2 POM ... C 3 POM ... H 4 POM ... H

STOP_LOCATION_APARTMENT STOP_LOCATION_FULL_ADDRESS \ 0 (null) 180 GREENWICH STREET 1 (null) 180 GREENWICH STREET 2 (null) WALL STREET && BROADWAY 3 (null) 75 GREENE STREET 4 2 429 WEST BROADWAY

STOP_LOCATION_PREMISES_NAME STOP_LOCATION_STREET_NAME STOP_LOCATION_X \ 0 (null) GREENWICH STREET 982381 1 (null) GREENWICH STREET 982381 2 (null) WALL STREET 981005 3 (null) GREENE STREET 984031 4 (null) WEST BROADWAY 983894

STOP_LOCATION_Y STOP_LOCATION_ZIP_CODE STOP_LOCATION_PATROL_BORO_NAME \ 0 201750 (null) PBMS 1 201750 (null) PBMS 2 197131 (null) PBMS 3 202796 (null) PBMS 4 203523 (null) PBMS

STOP_LOCATION_BORO_NAME 0 MANHATTAN 1 MANHATTAN (continues on next page)

10.1. The Stop and Frisk Data 71 Sphinx-Themes template, Release 1

(continued from previous page) 2 MANHATTAN 3 MANHATTAN 4 MANHATTAN

[5 rows x 86 columns]

[4]: #examine variables and variable types df.info() RangeIndex: 11629 entries, 0 to 11628 Data columns (total 86 columns): STOP_FRISK_ID 11629 non-null int64 STOP_FRISK_DATE 11629 non-null object STOP_FRISK_TIME 11624 non-null object YEAR2 11629 non-null int64 MONTH2 11629 non-null object DAY2 11629 non-null object STOP_WAS_INITIATED 11629 non-null object SPRINT_NUMBER 11202 non-null object RECORD_STATUS_CODE 11629 non-null object ISSUING_OFFICER_RANK 11629 non-null object ISSUING_OFFICER_COMMAND_CODE 11629 non-null int64 SUPERVISING_OFFICER_RANK 11629 non-null object SUPERVISING_OFFICER_COMMAND_CODE 11629 non-null int64 SUPERVISING_ACTION_CORRESPONDING_ACTIVITY_LOG_ENTRY_REVIEWED 11629 non-null object LOCATION_IN_OUT_CODE 11629 non-null object JURISDICTION_CODE 11629 non-null object JURISDICTION_DESCRIPTION 11629 non-null object OBSERVED_DURATION_MINUTES 11629 non-null int64 SUSPECTED_CRIME_DESCRIPTION 11629 non-null object STOP_DURATION_MINUTES 11629 non-null int64 OFFICER_EXPLAINED_STOP_FLAG 11629 non-null object OFFICER_NOT_EXPLAINED_STOP_DESCRIPTION 11628 non-null object OTHER_PERSON_STOPPED_FLAG 11629 non-null object SUSPECT_ARRESTED_FLAG 11629 non-null object SUSPECT_ARREST_NUMBER 11629 non-null object SUSPECT_ARREST_OFFENSE 11629 non-null object SUMMONS_ISSUED_FLAG 11629 non-null object SUMMONS_NUMBER 11629 non-null object SUMMONS_OFFENSE_DESCRIPTION 11629 non-null object OFFICER_IN_UNIFORM_FLAG 11629 non-null object ID_CARD_IDENTIFIES_OFFICER_FLAG 11629 non-null object SHIELD_IDENTIFIES_OFFICER_FLAG 11629 non-null object VERBAL_IDENTIFIES_OFFICER_FLAG 11629 non-null object FRISKED_FLAG 11629 non-null object SEARCHED_FLAG 11629 non-null object OTHER_CONTRABAND_FLAG 11629 non-null object FIREARM_FLAG 11629 non-null object KNIFE_CUTTER_FLAG 11629 non-null object OTHER_WEAPON_FLAG 11629 non-null object WEAPON_FOUND_FLAG 11629 non-null object PHYSICAL_FORCE_CEW_FLAG 11629 non-null object PHYSICAL_FORCE_DRAW_POINT_FIREARM_FLAG 11629 non-null object PHYSICAL_FORCE_HANDCUFF_SUSPECT_FLAG 11629 non-null object PHYSICAL_FORCE_OC_SPRAY_USED_FLAG 11629 non-null object PHYSICAL_FORCE_OTHER_FLAG 11629 non-null object (continues on next page)

72 Chapter 10. Investigating Stop and Frisk Sphinx-Themes template, Release 1

(continued from previous page) PHYSICAL_FORCE_RESTRAINT_USED_FLAG 11629 non-null object PHYSICAL_FORCE_VERBAL_INSTRUCTION_FLAG 11629 non-null object PHYSICAL_FORCE_WEAPON_IMPACT_FLAG 11629 non-null object BACKROUND_CIRCUMSTANCES_VIOLENT_CRIME_FLAG 11629 non-null object BACKROUND_CIRCUMSTANCES_SUSPECT_KNOWN_TO_CARRY_WEAPON_FLAG 11629 non-null object SUSPECTS_ACTIONS_CASING_FLAG 11629 non-null object SUSPECTS_ACTIONS_CONCEALED_POSSESSION_WEAPON_FLAG 11629 non-null object SUSPECTS_ACTIONS_DECRIPTION_FLAG 11629 non-null object SUSPECTS_ACTIONS_DRUG_TRANSACTIONS_FLAG 11629 non-null object SUSPECTS_ACTIONS_IDENTIFY_CRIME_PATTERN_FLAG 11629 non-null object SUSPECTS_ACTIONS_LOOKOUT_FLAG 11629 non-null object SUSPECTS_ACTIONS_OTHER_FLAG 11629 non-null object SUSPECTS_ACTIONS_PROXIMITY_TO_SCENE_FLAG 11629 non-null object SEARCH_BASIS_ADMISSION_FLAG 11629 non-null object SEARCH_BASIS_CONSENT_FLAG 11629 non-null object SEARCH_BASIS_HARD_OBJECT_FLAG 11629 non-null object SEARCH_BASIS_INCIDENTAL_TO_ARREST_FLAG 11629 non-null object SEARCH_BASIS_OTHER_FLAG 11629 non-null object SEARCH_BASIS_OUTLINE_FLAG 11629 non-null object DEMEANOR_CODE 11629 non-null object DEMEANOR_OF_PERSON_STOPPED 11588 non-null object SUSPECT_REPORTED_AGE 11629 non-null object SUSPECT_SEX 11629 non-null object SUSPECT_RACE_DESCRIPTION 11629 non-null object SUSPECT_HEIGHT 11629 non-null object SUSPECT_WEIGHT 11629 non-null object SUSPECT_BODY_BUILD_TYPE 11629 non-null object SUSPECT_EYE_COLOR 11629 non-null object SUSPECT_HAIR_COLOR 11629 non-null object SUSPECT_OTHER_DESCRIPTION 10607 non-null object STOP_LOCATION_PRECINCT 11629 non-null object STOP_LOCATION_SECTOR_CODE 11629 non-null object STOP_LOCATION_APARTMENT 11604 non-null object STOP_LOCATION_FULL_ADDRESS 11629 non-null object STOP_LOCATION_PREMISES_NAME 11629 non-null object STOP_LOCATION_STREET_NAME 11629 non-null object STOP_LOCATION_X 11629 non-null object STOP_LOCATION_Y 11629 non-null object STOP_LOCATION_ZIP_CODE 11629 non-null object STOP_LOCATION_PATROL_BORO_NAME 11629 non-null object STOP_LOCATION_BORO_NAME 11629 non-null object dtypes: int64(6), object(80) memory usage: 7.6+ MB

[5]: #examine distribution df.OBSERVED_DURATION_MINUTES.hist() [5]:

10.1. The Stop and Frisk Data 73 Sphinx-Themes template, Release 1

[6]: df.OBSERVED_DURATION_MINUTES.mean() [6]: 28.94083756126924

[7]: df.OBSERVED_DURATION_MINUTES.describe() [7]: count 11629.000000 mean 28.940838 std 975.822879 min 0.000000 25% 1.000000 50% 1.000000 75% 2.000000 max 99999.000000 Name: OBSERVED_DURATION_MINUTES, dtype: float64

[8]: df.SUSPECT_RACE_DESCRIPTION.value_counts() [8]: BLACK 6595 WHITE HISPANIC 2570 BLACK HISPANIC 997 WHITE 977 (null) 268 ASIAN/PAC.ISL 206 AMER IND 9 MALE 7 Name: SUSPECT_RACE_DESCRIPTION, dtype: int64

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

74 Chapter 10. Investigating Stop and Frisk Sphinx-Themes template, Release 1

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

10.1. The Stop and Frisk Data 75 Sphinx-Themes template, Release 1

76 Chapter 10. Investigating Stop and Frisk CHAPTER 11

Folium and Mapping

Mapping Community Gardens

[ ]:

77 Sphinx-Themes template, Release 1

78 Chapter 11. Folium and Mapping CHAPTER 12

Lab 04: Scraping Reviews

GOALS: • Scrape album reviews from Pitchfork • Scrape album images from Pitchfork

12.1 LEVEL I

In the last example *intro to webscraping*, we extracted basic information from the page containing all reviews on pitchfork.com. Now, your task is first, to scrape the links to each review page. This is akin to clicking on the review, and being taken to the page with the full review.

79 Sphinx-Themes template, Release 1

80 Chapter 12. Lab 04: Scraping Reviews Sphinx-Themes template, Release 1

At each page, your goal is to scrape the headline, the text of the review, the score as a number, the author, genre, and date. If you’re feeling ambitious, grab the sample music files when they exist.

[49]:% matplotlib inline import matplotlib.pyplot as plt import requests from bs4 import BeautifulSoup import pandas as pd import numpy as np

[50]: page_num= [i for i in range(2, 400)] url='https://pitchfork.com/reviews/albums/' response= requests.get(url) soup= BeautifulSoup(response.text,'html.parser') reviews= soup.find_all('div',{'class':'review'})

[51]: artists=[] albums=[] genre=[] author=[] when=[] links=[] for review in reviews: t= review.find('li').text artists.append(t) s= review.find('h2').text albums.append(s) genre.append(review.find('li',{'class':'genre-list__item'}).text) author.append(review.find('ul',{'class':'authors'}).text) when.append(review.find('time').text) links.append(review.a['href'])

[52]: df_reviews= pd.DataFrame({'artist': artists,'albums': albums,'genre': genre,

˓→'author': author,'when': when})

[53]: links=[] for i in reviews: links.append(i.a['href'])

[54]: links [54]: ['/reviews/albums/maxwell-maxwells-urban-hang-suite/', '/reviews/albums/pinegrove-skylight/', '/reviews/albums/joe-strummer-joe-strummer-001/', '/reviews/albums/lupe-fiasco-drogas-wave/', '/reviews/albums/father-awful-swim/', '/reviews/albums/tim-hecker-konoyo/', '/reviews/albums/mount-kimbie-dj-kicks/', '/reviews/albums/roc-marciano-behold-a-dark-horse/', '/reviews/albums/brandon-coleman-resistance/', '/reviews/albums/metric-art-of-doubt/', '/reviews/albums/lonnie-holley-mith/', '/reviews/albums/ryan-hemsworth-elsewhere/']

[55]: brief=[] full=[] (continues on next page)

12.1. LEVEL I 81 Sphinx-Themes template, Release 1

(continued from previous page) for i in links: url='https://pitchfork.com/'+ str(i) resp= requests.get(url) soup= BeautifulSoup(resp.text,'html.parser') brief.append(soup.find('div',{'class':'review-detail__abstract'}).text) full.append(soup.find('div',{'class':'contents dropcap'}).text)

[56]: len(full) [56]: 12

[57]: len(brief) [57]: 12

[58]: len(artists) [58]: 12

[59]: url='https://pitchfork.com/reviews/albums/'+'?page=2' response= requests.get(url) soup= BeautifulSoup(response.text,'html.parser') reviews= soup.find_all('div',{'class':'review'}) for review in reviews: t= review.find('li').text artists.append(t) s= review.find('h2').text albums.append(s) genre.append(review.find('li',{'class':'genre-list__item'}).text) author.append(review.find('ul',{'class':'authors'}).text) when.append(review.find('time').text) links.append(review.a['href'])

[60]: len(artists) [60]: 24

[61]: for i in links[12:]: url='https://pitchfork.com/'+ str(i) resp= requests.get(url) soup2= BeautifulSoup(resp.text,'html.parser') brief.append(soup2.find('div',{'class':'review-detail__abstract'}).text) full.append(soup2.find('div',{'class':'contents dropcap'}).text)

[62]: len(full) [62]: 24

[63]: len(artists) [63]: 24

[64]: df_reviews= pd.DataFrame({'artist': artists,'albums': albums,'genre': genre,

˓→'author': author,'when': when,'brief': brief,'full': full})

82 Chapter 12. Lab 04: Scraping Reviews Sphinx-Themes template, Release 1

[65]: df_reviews.head() [65]: albums artist author \ 0 Maxwell’s Urban Hang Suite Maxwell by: Jason King 1 Skylight Pinegrove by: Quinn Moreland 2 Joe Strummer 001 Joe Strummer by: Stephen Thomas Erlewine 3 Drogas Wave Lupe Fiasco by: Brian Josephs 4 Awful Swim Father by: Briana Younger

brief \ 0 Each Sunday, Pitchfork takes an in-depth look ... 1 Completed in 2017 and shelved for almost a yea... 2 This 32-track collection combines remastered r... 3 On his seventh album, the conscious hip-hop fa... 4 Having relocated to L.A., the Atlanta rapper r...

full genre \ 0 In the late summer of 1996, Harlem was a loopy... Pop/R&B 1 A year ago, it looked like Pinegrove’s next mo... Rock 2 Timing never was Joe Strummer’s strong suit. N... Rock 3 Conscious hip-hop exists in a state of perpetu... Rap 4 Father’s music has always been flippant. That ... Rap

when 0 19 hrs ago 1 September 29 2018 2 September 29 2018 3 September 29 2018 4 September 29 2018

[69]: url='https://pitchfork.com/reviews/albums/?page=' for i in range(1, 1714): response= requests.get(url+ str(i)) soup= BeautifulSoup(response.text,'html.parser') reviews= soup.find_all('div',{'class':'review'})

for review in reviews: t= review.find('li').text artists.append(t) s= review.find('h2').text albums.append(s) genre.append(review.find('li',{'class':'genre-list__item'})) author.append(review.find('ul',{'class':'authors'}).text) when.append(review.find('time').text) links.append(review.a['href'])

[70]: len(artists) [70]: 20656

[ ]:

[71]: len(albums) [71]: 20656

12.1. LEVEL I 83 Sphinx-Themes template, Release 1

[72]: len(genre) [72]: 20654

[73]: len(links) [73]: 20654

[76]: brief=[] full=[] for i in links: url='https://pitchfork.com/'+ str(i) resp= requests.get(url) soup= BeautifulSoup(resp.text,'html.parser') brief.append(soup.find('div',{'class':'review-detail__abstract'})) full.append(soup.find('div',{'class':'contents dropcap'}))

[77]: len(brief) [77]: 20654

[79]: len(full) [79]: 20654

[81]: full[10].text [81]: 'Lonnie Holley starts MITH, his third album, with a simple, striking statement: “I’m

˓→a suspect in America,” he sings over a prismatic keyboard line, stretching the last

˓→syllable until he runs out of breath. As a black man who has lived in the cities of

˓→the South for much of his life, Holley has endured that reality for 68 years,

˓→watching as white suspicion of black bodies has evolved but never receded. “Here I

˓→stand accused/My life has been so misused/Through blood, sweat, and tears/I’m a

˓→suspect,” he continues, as overdubbed vocals, synthesizers, and woozy trombone

˓→bloom around his words. He ultimately reasons that all of us--yes, even the

˓→suspects--become dust specks, anyhow. The song drifts into silence.\nHolley’s life

˓→has been dotted by death and hardship. But in the late 1970s, he began assembling

˓→everyday detritus into found-art sculptures that balance the pedestrian with the

˓→surreal, a practice that has earned him high-profile spots at The Met and the

˓→Smithsonian American Art Museum. He made music all the while, sharing it among

˓→friends on cassette, but he didn’t issue his first real record until 2012’s Just

˓→Before Music, at the age of 62. That record and its follow-up, 2013’s Keeping a

˓→Record of It, felt like gentle trips through outer space, as if it were possible to

˓→float the Milky Way. But on MITH, Holley does with music what he’s done with visual

˓→art for decades: He collects our ugliest obscured objects and transforms them into

˓→singular reflections on our troubled world.\nFor MITH’s 18-minute centerpiece, “I

˓→Snuck Off the Slave Ship,” Holley methodically imagines the process and pain of

˓→enslavement, beginning by watching his own capture from a distance while chains and

˓→hand drums rattle behind him. He uses this narrative conceit to trace intertwined

˓→histories of labor and oppression--sneaking off one slave ship only to end up on

˓→another, with chattel slavery transformed into industrial wage slavery--well into

˓→the future, where the cycle will ostensibly continue. Holley sees dripping blood,

˓→drowning bodies, and exploding bombs, his voice rising from familiar astral soul

˓→into gnarled screams. “I Woke Up in a Fucked-Up America” follows, made harrowing by

˓→horns that herald end times. People have been waking up in a fucked-up America long

˓→before the United States existed, but Holley grounds the feeling in our time with

˓→references to “overdatafying,” “arguing, fussing, and fighting,” and “the miscount

˓→of voting around the world.” Swooping brass and calamitous percussion swell behind ˓→him. “Let me out of this dream,” Holley sings of what is actually a nightmare,(continues on next page) ˓→requesting relief before a final blast of horns seals his fate.\nAnimated by

˓→expanded instrumentation, this unease ripples throughout MITH. Holley’s supporting 84˓→cast here includes avant-garde instrumentalists like Chapter Dave Nelson 12. Lab and 04: Shahzad Scraping Reviews ˓→Ismaily, plus harmonies from folk duo Anna & Elizabeth. Drums from late producer

˓→Richard Swift lend “Copying the Rock” a soft sense of chaos. On “Coming Back (From

˓→the Distance Between the Spaces of Time),” cymbal splashes and snare rolls add a

˓→sense of looming anxiety; Holley sings about a human returning from outer space,

˓→warning of the solidarity it will take for us to survive what is to come.\nBut MITH

˓→isn’t entirely mirthless. Holley again maps the skies on “How Far Is Spaced-Out?”

˓→and conjures a jazz-adjacent flow during “There Was Always Water,” which features

˓→fellow mystic improviser Laraaji. But the finale, “Sometimes I Wanna Dance,”

˓→unfolds with such contrasting gaiety that it almost feels accidental. As Laraaji

˓→bounds around an upbeat piano line, Holley urges his listeners to “Shake it around,

˓→up or down, in or out,” his vocals crisscrossing like dancers across the floor of a

˓→high-school gym. A free activity and an act of freedom, dancing requires no special

˓→skills or circumstances; you can dance at home or work, for 15 seconds or until

˓→your body falls to the floor. No whiplash ending, this is a survivor’s reminder to

˓→look beyond pervasive doom and gloom and summon the spirit needed for another day.

˓→If we can’t take the time to express the joy of our own creation and our shared

˓→humanity, what’s the point of any of this?\n' Sphinx-Themes template, Release 1

(continued from previous page)

12.2 LEVEL II

Go back to the original page of reviews and scroll down. Notice that the url at the top of the page is simply adding numbers as it advances. This pattern will allow you to scrape multiple pages, and gather more reviews from earlier dates. 1. Directly add the next reviews to a new url, and use your pattern above to scrape the additional reviews. 2. Write a loop to go through the next ten pages of reviews and gather each piece.

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

12.3 LEVEL III

Write a loop to go through all reviews available. Save the results as a .csv file. If you were able to scape the images; store these in a folder.

[ ]:

[ ]:

[ ]:

[ ]:

12.4 LEVEL IV

It is easy to use the textblob library to add sentiment and polarity of reviews to our DataFrame. We need to convert the text to a TextBlob object, and then use the .polarity and .subjectivity labels of the text as new columns in our DataFrame. Use the example below as a starting place to add two new columns to your dataframe containing the polarity and subjectivity scores for each review.

12.2. LEVEL II 85 Sphinx-Themes template, Release 1

[1]: rev="Danielle Bregoli’s leap from meme to rapper continues with her debut mixtape

˓→that leans heavily on mimicry and trails dreadfully behind the current sound of hip-

˓→hop."

[2]: from textblob import TextBlob

[3]: text= TextBlob(rev)

[5]: text.sentiment [5]: Sentiment(polarity=-0.05000000000000002, subjectivity=0.5)

[6]: text.polarity [6]: -0.05000000000000002

[8]: text.subjectivity [8]: 0.5

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

[ ]:

86 Chapter 12. Lab 04: Scraping Reviews CHAPTER 13

Web Design

In this lab, you will use our starting Django application to build a polling site. The website will extend the polling application to ask specific questions about an area of interest to you. You should build in 10 questions, and add an option for a text write in response. Your work will be focused on two areas; theming the site and building the questions. In the next part of the lab, you will deploy your application and collect responses to the questions. Here, you will find a copy of the base site as a .zip file. You can unzip this, launch a virtual enviornment in the project and install Django. Then, the python manage.py runserver should give you a site like the image below:

87 Sphinx-Themes template, Release 1

Your aim is to theme this site, and add some better questions than what we’ve got at present. (Polls link!)

88 Chapter 13. Web Design Sphinx-Themes template, Release 1

13.1 Site Files

Here is a copy of a zipped Django website. mywebsite

13.1. Site Files 89 Sphinx-Themes template, Release 1

You’ll have to follow the steps below to open this and begin working: • Unzip the file on your Desktop • Navigate inside the unzipped directory. You should see the manage.py file here. • Activate a virtual enviornment with pipenv shell and install Django. • Create a superuser to access admin. • Launch the site!

90 Chapter 13. Web Design CHAPTER 14

Flask App

In this example, we will build a Flask application that will deliver some custom data to users through an API. Using our example from class, we aim to combine our webscraping work together with the Flask application development.

[1]: import requests from bs4 import BeautifulSoup

[17]: from IPython.display import YouTubeVideo YouTubeVideo('r8U4s_WAAg8') [17]:

[2]: url='https://www.lyrics.com/artist/Dokken/4110' r= requests.get(url) soup= BeautifulSoup(r.text,'html.parser')

[3]: albums= soup.find_all('h3',{'class':'artist-album-label'})

[4]: for i in albums[:5]: print(i.text) Back in the Streets [1979] Breaking the Chains [1982] Tooth and Nail [1984] [1985] [1987]

[5]: songs= soup.find_all('td',{'class':'tal qx'})

[6]: for i in songs[::2][:5]: print(i.text) Felony Prisoner [Live] (continues on next page)

91 Sphinx-Themes template, Release 1

(continued from previous page) Breaking the Chains I Can’t See You In the Middle

[7]: songs[0].a.attrs['href'] [7]: '/lyric/41422//Felony'

[8]: base='https://www.lyrics.com/'

[9]: lyr= base+ songs[0].a.attrs['href'] import pandas as pd

[10]: def lookup(): artist= input("Who you lookin for duder?") if len(artist.split())<2: url='https://www.lyrics.com/artist/'+ artist.split('')[0].lower() else: url='https://www.lyrics.com/artist/'+ artist.split('')[0].lower() for i in range(1, len(artist.split())): url= url+'%20'+ artist.split('')[i].lower() r= requests.get(url) soup= BeautifulSoup(r.text,'html.parser') alb=[] s=[] ti=[] albums= soup.find_all('h3',{'class':'artist-album-label'}) for i in albums: alb.append(i.text) songs= soup.find_all('td',{'class':'tal qx'}) for i in songs[::2]: s.append(i.text) for i in songs[1::2]: ti.append(i.text) base='https://www.lyrics.com/' l=[] for i in range(len(songs)): if songs[i].a is None: pass else: lyr= base+ songs[i].a.attrs['href'] url= lyr r= requests.get(url) soup= BeautifulSoup(r.text,'html.parser') l.append(soup.find('pre',{'id':'lyric-body-text'}).text) df= pd.DataFrame({'song': s,'time': ti,'lyrics': l}) return df

[11]: lookup() Who you lookin for duder? Dokken [11]: song time \ 0 Felony 2:49 1 Prisoner [Live] 6:06 2 Breaking the Chains 3:51 (continues on next page)

92 Chapter 14. Flask App Sphinx-Themes template, Release 1

(continued from previous page) 3 I Can't See You 3:11 4 In the Middle 3:43 5 Live to Rock (Rock to Live) 3:39 6 Nightrider 3:13 7 Paris Is Burning 5:11 8 Seven Thunders 3:55 9 Stick to Your Guns 3:25 10 Young Girls 3:14 11 Bullets to Spare 3:35 12 Don't Close Your Eyes 4:10 13 Heartless Heart 3:30 14 Turn on the Action 4:43 15 Without Warning 1:34 16 Don't Lie to Me 3:37 17 The Hunter 4:07 18 Jaded Heart 4:16 19 Lightnin' Strikes Again 3:48 20 Slippin' Away 3:48 21 Will the Sun Rise? 4:10 22 4:46 23 Cry of the Gypsy 4:51 24 Dream Warriors 4:42 25 Heaven Sent 4:52 26 Kiss of Death 5:51 27 Lost Behind the Wall 4:19 28 Night by Night 5:23 29 Prisoner 4:20 ...... 62 Who Believes 4:23 63 Drown [DVD] 4:52 64 Sign of the Times 3:14 65 Paris Is Burning [Live] 5:09 66 Tooth and Nail [DVD] 67 Breaking the Chains [Live] 3:55 68 Goodbye My Friend 4:06 69 Heart Full of Soul 2:28 70 Little Girl 3:44 71 There Was a Time 3:53 72 You 3:47 73 In My Dreams [Live] 5:42 74 Tooth and Nail [Live] 4:31 75 [Live] 6:21 76 Alone Again [1995/ Live In Tokyo] 7:22 77 Alone Again 78 I Will Remember 79 Better Off Before 3:00 80 Can You See 4:09 81 Don't Bring Me Down 3:24 82 Escape 4:37 83 I Surrender 3:03 84 The Last Goodbye 4:37 85 Still I'm Sad 4:37 86 Til the Livin' End 4:02 87 Heart to Stone 3:56 88 Judgment Day 4:02 89 Standing on the Outside 3:52 90 This Fire 4:42 (continues on next page)

93 Sphinx-Themes template, Release 1

(continued from previous page) 91 Mirror Mirror 4:40

lyrics 0 Felony\r\nIn tight blue jeans\r\nI did not kno... 1 It wasn't just your innocence\r\nNo, it wasn't... 2 Sit there thinkin'\r\nIn your room\r\nYou feel... 3 You keep me waitin'\r\nWhy I don't know\r\nSee... 4 I talk with you\r\nI think of her\r\nIt always... 5 Run out of breath\r\nAnd I feel I'm moving too... 6 It's late at night\r\nI get that feeling runni... 7 This town I'm in can't take no more\r\nDecaden... 8 Demon eyes\r\nAnd the fate is coming down on y... 9 Well I've been trying this for so very long\r\... 10 Staying out all night dancing\r\nYou can't kee... 11 The shooting's over\r\nThe smoke is clear\r\nA... 12 Somebody's watching me, is it just a crazy dre... 13 Every night it's the same old thing\r\nBaby, I... 14 I'm like a wildcat at midnight\r\nI'm like a s... 15 [Instrumental]\n 16 You didn't see me, but I overheard\r\nA short ... 17 Clouds roll by as I look to the sky\r\nAnd the... 18 I didn't know you, but I could see it in your ... 19 The lights go down\r\nYou can feel it all arou... 20 Like a whisper, you came in the night\r\nGone ... 21 In the name of liberty, we set off to the sea\... 22 Never thought our love would last\r\nFor so lo... 23 You say I'm a restless soul \r\nI don't mind \... 24 I lie awake and dread the lonely nights\nI'm n... 25 There seems no justice when you fall in love \... 26 A brief encounter\r\nLike wind through the tre... 27 When the sun fades to black and the night has ... 28 It's bad enough being under control\r\nSame th... 29 It wasn't just your innocence\r\nNo, it wasn't...... 62 Tried to go to sleep last night\nBut too many ... 63 I been left in the crossfire / sacrificed to l... 64 Life is a mystery\nLeaves me no consolation\nL... 65 This town, I'm in can't take no more\nDecadenc... 66 Desperate living- driving me mad\r\nWritings o... 67 Sit there thinkin'\r\nIn your room\r\nYou feel... 68 I tried to say, didn't love you, or need you\n... 69 Sick at heart and lonely, deep in dark despair... 70 Mother never told you\r\nThat's all right\r\nD... 71 D. Dokken/K. Keeling\n\nIt was late like the s... 72 Life as dreams\r\nStreets of gold\r\nYou used ... 73 Toss and turn all night in the sheets\r\nI can... 74 Desperate living- driving me mad\r\nWritings o... 75 I've been lost in the middle\r\nAlways trying ... 76 I'd like to see you in the morning light\r\nI'... 77 I'd like to see you in the morning light\nI li... 78 I will remember you\r\nWill you remember me?\r... 79 But can I find my way to you?\nWe've got this ... 80 It's not hard to believe\nThat the world isn't... 81 I played the fool, it seems a thousand times b... 82 Turn the lights down low there's nobody home\r... 83 Leave me here alone\nLying in the burning sun\... 84 I walk these streets they built with progress\... (continues on next page)

94 Chapter 14. Flask App Sphinx-Themes template, Release 1

(continued from previous page) 85 You were the one on the trigger\nDeciding the ... 86 It's too late to tell me you're sorry\r\nIt's ... 87 I can't see your face anymore\nThrough our bro... 88 Say a prayer, for the day, that you first bega... 89 You knew there'd be consequences, \nWhen I gav... 90 On and on we go again\nThere seems no point of... 91 Look inside my eyes\r\nYou can see everywhere ...

[92 rows x 3 columns]

[12]: dokken= lookup() Who you lookin for duder? dokken

[13]: type(dokken) [13]: pandas.core.frame.DataFrame

[14]: dokken.head() [14]: song time \ 0 Felony 2:49 1 Prisoner [Live] 6:06 2 Breaking the Chains 3:51 3 I Can't See You 3:11 4 In the Middle 3:43

lyrics 0 Felony\r\nIn tight blue jeans\r\nI did not kno... 1 It wasn't just your innocence\r\nNo, it wasn't... 2 Sit there thinkin'\r\nIn your room\r\nYou feel... 3 You keep me waitin'\r\nWhy I don't know\r\nSee... 4 I talk with you\r\nI think of her\r\nIt always...

[15]: beatnuts= lookup() Who you lookin for duder? the beatnuts

[16]: beatnuts.head() [16]: song time lyrics 0 Reign of the Tec 3:21 [JuJu]\nIt's the hard little pistol-packing pu... 1 Are You Ready 3:14 What we gonna do right here is\r\nYeah, yeah, ... 2 Get Funky 3:37 Hook:\n"Get Funky" "That's right" (x4)\n\nJuJu... 3 Intro 1:46 [CUTS, SCRATCHES & PROPS]\n\nIt's a , it's a \... 4 Props over Here 4:00 Showing love, with the fucking bass in your fa...

95 Sphinx-Themes template, Release 1

96 Chapter 14. Flask App CHAPTER 15

FLASK App

Build an API for Pitchfork.

97 Sphinx-Themes template, Release 1

98 Chapter 15. FLASK App CHAPTER 16

Learning Application

Build Machine Learning app with Flask

99 Sphinx-Themes template, Release 1

100 Chapter 16. Learning Application CHAPTER 17

Indices and tables

• genindex • modindex • search

101