<<

Extracting YouTube videos using SAS

Dr Craig Hansen Craig Hansen has been using SAS for 15 years within the health research setting. He gained his doctorate in epidemiology at the University of the Sunshine Coast and since then has worked at

• the University of Queensland School of Medicine (Australia) – working in cardiovascular research

• the United States Environmental Protection Agency (USEPA) – working in air pollution research

• the Centers for Disease Control and Prevention (CDC, USA) – working in birth defects research

• Kaiser Permanente Center for Health Research (USA) – working on health studies using electronic medical

records.

He recently accepted a position at the South Australian Health and Medical Research Institute as Senior

Epidemiologist within the Wardliparingga Aboriginal Health Unit. Dr Craig Hansen Extracting YouTube Videos using SAS

By Craig Hansen, PhD South Australian Health & Medical Research Institute

Overview YouTube API SAS Proc HTTP SAS XML Mapper Example – medications during pregnancy Questions Have you ever wanted to “scrape” all the information from YouTube videos?

Length

Uploader Views

Date

Comments You can……using YouTube Data API and SAS!

YouTube Data API (Metadata for videos)

Proc HTTP

XML File

XML Mapper

SAS Datasets YouTube API – Video feed example

• https://gdata.youtube.com/feeds/api/videos?q=surfing

Google's APIs allows you to integrate YouTube videos and functionality into websites or applications. Examples: YouTube Analytics API YouTube Data API YouTube Player API

XML File of all the videos found with the search term “surfing”

Only allows 25 videos per XML file

Only allows 500 videos in total per search term XML

• “Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format which is both human- readable and machine-readable” • A structured (“tree-like”) text that can be used to store information in a hierarchical format

See the “structure” based on tags < > …..text …..

This is what SAS XML Mapper reads tag:youtube.com,2008:standardfeed:global:most_popular 2008-07-18T05:00:49.000-07:00 Most Popular

tag:youtube,2008:video:ZTUVgYoeN_b 2008-07-05T19:56:35.000-07:00 2008-07-18T07:21:59.000-07:00

Shopping for Coats (only a snippet)

GoogleDevelopers https://gdata.youtube.com/feeds/api/users/GoogleDevelopers _x5XG1OV2P6uZZ5FSM9Ttw< widescreen 2008-07-05T19:56:35.000-07:00 UC_x5XG1OV2P6uZZ5FSM9Ttw ZTUVgYoeN_b 2008-07-04 SAS PROC HTTP

• Issues Hypertext Transfer Protocol (HTTP) requests

• YouTube API search (https://developers.google.com/youtube/v3/)

FILENAME myxml "C:\Current Work\YT." encoding="UTF-8";

PROC HTTP out=myxml url="https://gdata.youtube.com/feeds/api/videos?q=surfing" method="get"; RUN; SAS XML Mapper

• We now need to parse all the text in the XML into a dataset

Perfect, we can use this to “map” all the structured text in the XML into a dataset – it does all the work for you!

SAS XML Mapper reads in the XML or Schema file and interprets the “structure” based on the tags <> ….text….

**Download SAS XML Mapper from SAS website SAS XML Mapper – Create a map

XML window XML Map window

Source window SAS XML Mapper – Create a map

Open XML XML file or Schema The datasets it will create

XML map generated – save this map SAS XML Mapper – Creating Datasets

See how the datasets are linked by ID variables Information available about Datasets

• Feed information • Duration • Title • Number of views • Description • Ratings • Author • Comments (links to another API) • Date published • Restrictions • Category • URL SASSAS Code Code

FILENAME test1 " C:\Current Work\Surf.xml " encoding="UTF-8";

PROC HTTP out=test1 url="https://gdata.youtube.com/feeds/api/videos?q=surfing" method="get"; RUN;

FILENAME YOUTUBET 'C:\Current Work\Surf.xml' ; FILENAME SXLEMAP ' C:\Current Work\YouTubeGetData.map'; LIBNAME YOUTUBET xmlv2 xmlmap=SXLEMAP ACCESS=READONLY;

This will create all the linked datasets generated from the map Merge the datasets based on the linked IDs My project – Medications during Pregnancy

• Assess the following: • How many videos have information on medications during pregnancy

• The source of these videos

• The popularity of these videos

• The information in the videos (manual review)

• Enter additional information in a database upon review Challenges

• 25 videos per XML • You can pull in videos based on start index = SAS macro • 500 per search term • Pull in the 500 videos using a SAS macro • Repeat using variations on each search term = 500 per each variation • Duplicates = that’s ok because you cast a wide net and clean these later

2023 search terms %MACRO YOUTUBE; %DO k = 1 %TO &MAXID. %BY 1; 2023 search terms PROC SQL NOPRINT; SELECT DISTINCT LEFT(TRIM(MEDSEARCH)) INTO :MEDSEARCH FROM SEARCH_TERMS WHERE ID=&k.; QUIT; %DO i = 1 %TO 500 %BY 20; Increments of 20 videos for each xml %LET URL = "https://gdata.youtube.com/feeds/api/videos?q=&MEDSEARCH%nrstr(&max)-results=20%nrstr(&start)-index=&i";

[insert the proc http I showed before]

PROC SQL; CREATE TABLE VIDEOS&i. AS SELECT DISTINCT Outer Macro [insert some SQL joins] Inner Macro ; Loops through QUIT; Increments through the Search terms PROC SQL NOPRINT; start index getting 20 SELECT COUNT(*) INTO :NOBS FROM VIDEOS&i.; QUIT; videos until 500 are PROC APPEND DATA=VIDEOS&i. BASE=videos FORCE; retrieved RUN;

PROC DATASETS LIB=WORK NOLIST; DELETE VIDEOS&i.; Append the videos from each QUIT; RUN; iteration of the inner/outer macros

%IF &NOBS.=0 %THEN %GOTO EXIT; %END; %EXIT: %END; %MEND YOUTUBE; %YOUTUBE;

Import into ACCESS database Results – What I learned?

Medication+Pregnancy • A lot of non-relevant videos Search Terms, n= 2023

YouTube API Data Feed Excluded • Duplicates within each iteration n=97,480 records Duplicate Records of videos extracted n = 3812 Distinct Videos n=41,438 • Same video but different title (93,668 records) Medication or Pregnancy Search Term not in • Title or Description Need to have a ‘selection criteria’ Distinct Videos n=30,976 videos to identify relevant videos (e.g. n=10,462 some additional data cleaning) Medication+Pregnancy Search Term in Title • YouTube ‘category’ variable is not n=651 videos Not Relevant Videos Total = 336 reliable No mention of med = 243 Manual Review Duplicate = 70 Not avail./no sound = 13 • Lag time between metadata and Not English = 5 Legal Process = 5 Final Analyses actual time (data on YouTube) n=315 videos Useful Resources

Jason Secosky: Executing a PROC from a DATA Step http://support.sas.com/resources/papers/proceedings12/227-2012.pdf George Zhu: Accessing and Extracting Data from the Using SAS http://support.sas.com/resources/papers/proceedings12/121-2012.pdf Eric Lewerenz: An Example of Website “Screen Scraping” http://www.mwsug.org/proceedings/2009/appdev/MWSUG-2009-A09.pdf

SAS PROC HTTP http://support.sas.com/documentation/cdl/en/proc/65145/HTML/default/viewer.htm#n197g47i7j66x9n15xi0gaha8o v6.htm

Google APIs https://developers.google.com/apis-explorer/#p/ YouTube APIs https://developers.google.com/youtube/ Thank you – I love SAS!

• Questions • Feedback • Thumbs up? • Thumbs down? WRAP UP

• Survey – Please complete & hand back for Lucky Draw

• IAPA Chapter Meeting

• SANZOC – SAS Australia & New Zealand Online Community www.communities.sas.com

• Questions / Suggestions [email protected]

• Thank you

• Lucky Draw Copyright © 2012, SAS Institute Inc. All rights reserved. ADELAIDE CHAPTER MEETING: ENSEMBLES OF 20,000 MODELS IN THE ATO

Guest Speaker: Dr Graham Williams, Head of Corporate Analytics, ATO

www.iapa.org.au/Events

Copyright © 2012, SAS Institute Inc. All rights reserved.