Extracting Youtube Videos Using SAS
Total Page:16
File Type:pdf, Size:1020Kb
Extracting YouTube videos using SAS Dr Craig Hansen Craig Hansen has been using SAS for 15 years within the health research setting. He gained his doctorate in epidemiology at the University of the Sunshine Coast and since then has worked at • the University of Queensland School of Medicine (Australia) – working in cardiovascular research • the United States Environmental Protection Agency (USEPA) – working in air pollution research • the Centers for Disease Control and Prevention (CDC, USA) – working in birth defects research • Kaiser Permanente Center for Health Research (USA) – working on health studies using electronic medical records. He recently accepted a position at the South Australian Health and Medical Research Institute as Senior Epidemiologist within the Wardliparingga Aboriginal Health Unit. Dr Craig Hansen Extracting YouTube Videos using SAS By Craig Hansen, PhD South Australian Health & Medical Research Institute Overview YouTube API SAS Proc HTTP SAS XML Mapper Example – medications during pregnancy Questions Have you ever wanted to “scrape” all the information from YouTube videos? Length Uploader Views Date Comments You can……using YouTube Data API and SAS! YouTube Data API (Metadata for videos) Proc HTTP XML File XML Mapper SAS Datasets YouTube API – Video feed example • https://gdata.youtube.com/feeds/api/videos?q=surfing Google's APIs allows you to integrate YouTube videos and functionality into websites or applications. Examples: YouTube Analytics API YouTube Data API YouTube Player API XML File of all the videos found with the search term “surfing” Only allows 25 videos per XML file Only allows 500 videos in total per search term XML • “Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format which is both human- readable and machine-readable” • A structured (“tree-like”) text that can be used to store information in a hierarchical format See the “structure” based on tags < > …..text …..</> This is what SAS XML Mapper reads <?xml version='1.0' encoding='UTF-8'?> <feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearch/1.1/' xmlns:gml='http://www.opengis.net/gml' xmlns:georss='http://www.georss.org/georss' xmlns:media='http://search.yahoo.com/mrss/' xmlns:batch='http://schemas.google.com/gdata/batch' xmlns:yt='http://gdata.youtube.com/schemas/2007' xmlns:gd='http://schemas.google.com/g/2005' gd:etag='W/"CE4EQH47eCp7ImA9WxRQGEQ."'> <id>tag:youtube.com,2008:standardfeed:global:most_popular</id> <updated>2008-07-18T05:00:49.000-07:00</updated> <category scheme='http://schemas.google.com/g/2005#kind' term='http://gdata.youtube.com/schemas/2007#video'/> <title>Most Popular</title> <id>tag:youtube,2008:video:ZTUVgYoeN_b</id> <published>2008-07-05T19:56:35.000-07:00</published> <updated>2008-07-18T07:21:59.000-07:00</updated> <category scheme='http://gdata.youtube.com/schemas/2007/categories.cat' YouTube XML SCHEMA term='People' label='People'/> <title>Shopping for Coats</title> (only a snippet) <content type='application/x-shockwave-flash' src='http://www.youtube.com/v/ZTUVgYoeN_b?f=gdata_standard...'/> <link rel='alternate' type='text/html' href='https://www.youtube.com/watch?v=ZTUVgYoeN_b'/> <author> <name>GoogleDevelopers</name> <uri>https://gdata.youtube.com/feeds/api/users/GoogleDevelopers</uri> <yt:userId>_x5XG1OV2P6uZZ5FSM9Ttw<</yt:userId> <yt:aspectRatio>widescreen</yt:aspectRatio> <yt:duration seconds='79'/> <yt:uploaded>2008-07-05T19:56:35.000-07:00</yt:uploaded> <yt:uploaderId>UC_x5XG1OV2P6uZZ5FSM9Ttw</yt:uploaderId> <yt:videoid>ZTUVgYoeN_b</yt:videoid> </media:group> <gd:rating min='1' max='5' numRaters='14763' average='4.93'/> <yt:recorded>2008-07-04</yt:recorded> <yt:statistics viewCount='383290' favoriteCount='7022'/> <yt:rating numDislikes='19257' numLikes='106000'/> </entry> </feed> SAS PROC HTTP • Issues Hypertext Transfer Protocol (HTTP) requests • YouTube API search (https://developers.google.com/youtube/v3/) FILENAME myxml "C:\Current Work\YT.xml" encoding="UTF-8"; PROC HTTP out=myxml url="https://gdata.youtube.com/feeds/api/videos?q=surfing" method="get"; RUN; SAS XML Mapper • We now need to parse all the text in the XML into a dataset Perfect, we can use this to “map” all the structured text in the XML into a dataset – it does all the work for you! SAS XML Mapper reads in the XML or Schema file and interprets the “structure” based on the tags <> ….text….</> **Download SAS XML Mapper from SAS website SAS XML Mapper – Create a map XML window XML Map window Source window SAS XML Mapper – Create a map Open XML XML file or Schema The datasets it will create XML map generated – save this map SAS XML Mapper – Creating Datasets See how the datasets are linked by ID variables Information available about Datasets • Feed information • Duration • Title • Number of views • Description • Ratings • Author • Comments (links to another API) • Date published • Restrictions • Category • URL SASSAS Code Code FILENAME test1 " C:\Current Work\Surf.xml " encoding="UTF-8"; PROC HTTP out=test1 url="https://gdata.youtube.com/feeds/api/videos?q=surfing" method="get"; RUN; FILENAME YOUTUBET 'C:\Current Work\Surf.xml' ; FILENAME SXLEMAP ' C:\Current Work\YouTubeGetData.map'; LIBNAME YOUTUBET xmlv2 xmlmap=SXLEMAP ACCESS=READONLY; This will create all the linked datasets generated from the map Merge the datasets based on the linked IDs My project – Medications during Pregnancy • Assess the following: • How many videos have information on medications during pregnancy • The source of these videos • The popularity of these videos • The information in the videos (manual review) • Enter additional information in a database upon review Challenges • 25 videos per XML • You can pull in videos based on start index = SAS macro • 500 per search term • Pull in the 500 videos using a SAS macro • Repeat using variations on each search term = 500 per each variation • Duplicates = that’s ok because you cast a wide net and clean these later 2023 search terms %MACRO YOUTUBE; %DO k = 1 %TO &MAXID. %BY 1; 2023 search terms PROC SQL NOPRINT; SELECT DISTINCT LEFT(TRIM(MEDSEARCH)) INTO :MEDSEARCH FROM SEARCH_TERMS WHERE ID=&k.; QUIT; %DO i = 1 %TO 500 %BY 20; Increments of 20 videos for each xml %LET URL = "https://gdata.youtube.com/feeds/api/videos?q=&MEDSEARCH%nrstr(&max)-results=20%nrstr(&start)-index=&i"; [insert the proc http I showed before] PROC SQL; CREATE TABLE VIDEOS&i. AS SELECT DISTINCT Outer Macro [insert some SQL joins] Inner Macro ; Loops through QUIT; Increments through the Search terms PROC SQL NOPRINT; start index getting 20 SELECT COUNT(*) INTO :NOBS FROM VIDEOS&i.; QUIT; videos until 500 are PROC APPEND DATA=VIDEOS&i. BASE=videos FORCE; retrieved RUN; PROC DATASETS LIB=WORK NOLIST; DELETE VIDEOS&i.; Append the videos from each QUIT; RUN; iteration of the inner/outer macros %IF &NOBS.=0 %THEN %GOTO EXIT; %END; %EXIT: %END; %MEND YOUTUBE; %YOUTUBE; Import into ACCESS database Results – What I learned? Medication+Pregnancy • A lot of non-relevant videos Search Terms, n= 2023 YouTube API Data Feed Excluded • Duplicates within each iteration n=97,480 records Duplicate Records of videos extracted n = 3812 Distinct Videos n=41,438 • Same video but different title (93,668 records) Medication or Pregnancy Search Term not in • Title or Description Need to have a ‘selection criteria’ Distinct Videos n=30,976 videos to identify relevant videos (e.g. n=10,462 some additional data cleaning) Medication+Pregnancy Search Term in Title • YouTube ‘category’ variable is not n=651 videos Not Relevant Videos Total = 336 reliable No mention of med = 243 Manual Review Duplicate = 70 Not avail./no sound = 13 • Lag time between metadata and Not English = 5 Legal Process = 5 Final Analyses actual time (data on YouTube) n=315 videos Useful Resources Jason Secosky: Executing a PROC from a DATA Step http://support.sas.com/resources/papers/proceedings12/227-2012.pdf George Zhu: Accessing and Extracting Data from the Internet Using SAS http://support.sas.com/resources/papers/proceedings12/121-2012.pdf Eric Lewerenz: An Example of Website “Screen Scraping” http://www.mwsug.org/proceedings/2009/appdev/MWSUG-2009-A09.pdf SAS PROC HTTP http://support.sas.com/documentation/cdl/en/proc/65145/HTML/default/viewer.htm#n197g47i7j66x9n15xi0gaha8o v6.htm Google APIs https://developers.google.com/apis-explorer/#p/ YouTube APIs https://developers.google.com/youtube/ Thank you – I love SAS! • Questions • Feedback • Thumbs up? • Thumbs down? WRAP UP • Survey – Please complete & hand back for Lucky Draw • IAPA Chapter Meeting • SANZOC – SAS Australia & New Zealand Online Community www.communities.sas.com • Questions / Suggestions [email protected] • Thank you • Lucky Draw Copyright © 2012, SAS Institute Inc. All rights reserved. ADELAIDE CHAPTER MEETING: ENSEMBLES OF 20,000 MODELS IN THE ATO Guest Speaker: Dr Graham Williams, Head of Corporate Analytics, ATO www.iapa.org.au/Events Copyright © 2012, SAS Institute Inc. All rights reserved..