Python Selenium Download Pdf Chrome Driver How to Download Dynamically Loaded Content Using Python
Total Page:16
File Type:pdf, Size:1020Kb
python selenium download pdf chrome driver How to download dynamically loaded content using python. When you surf online, you occasionally visit websites that show content like videos or audio files which are dynamically loaded. This is basically done using AJAX calls or sessions where the URLs for these files are generated in some way for which you can not save them by normal means. A scenario would be like, for example, you visited a web page and a video is loaded through a video player like jwplayer after 1 or 2 secs. You want to save this but unfortunately you couldn’t do so by right-clicking on it and saving it as the player doesn’t show that option. Even if you use command line tool like wget or youtube-dl, it might be possible but for some reason it just doesn’t work. Another problem is that, there are still some javascript functions that need to be executed and until then, the video url does not get generated in the site. Now the question is, how would you download such files to your computer? Well, there are different ways to download them. One way is to use a plugin or extension for the browser like Grab Any Media or FlashGot which can handle such requests and might allow you to download it. Now the thing is, lets say you want to download an entire set of videos that are loaded using different AJAX calls. In this case, the plugins or extensions, might work but it takes a long time to manually download each file. A better method would be to write a small script that automates this process. This tutorial aims to teach you guys on how to use the selenium web driver and do simple tasks like downloading dynamically loaded content in a website using python. Prerequisites. For this tutorial, you need to have atleast some knowledge on how to program in python. If you don’t know anything about it, then I would suggest you to check out this site : http://www.tutorialspoint.com/python/. It is a great starting point to learn a new language and you can quickly learn the basics. So, before we start, I would like to give an small introduction to the modules that I am going to use in my python script. The system that I’m using is a Ubuntu Studio 14.04. In order to install the modules, you can use python-pip and also you might need to have administrative privileges. Here are the modules as follows : : The selenium framework is a suite of tools that can be used to test web applications and also automate the web browser tasks. By using it’s provided API, you can do simple tasks like automating administration work for a website or some website-related maintenance, by sending commands to the browser. It is supported in various programming languages like Python, Java, Javascript, PHP, C#, Perl, Ruby. To install this framework for python, just type the following command in the terminal : : The bs4 is a HTML/XML parser that does a great job at screen-scraping elements and getting information like the tag names, attributes, and values. It also has set of methods that allow you do things like, to match certain instances of a text and retrieve all the elements that contain it. You can install this module like this : : This module is a python port for the wget command-line program. Its easy to setup and you can quickly download videos or files to your system by using its API. You can also follow the “traditional” method of downloading files, like using the standard urllib module or by doing a subprocess call to the wget command-line program, but for this tutorial, I will be using this module to get the job done. Here is how you install it : (Optional) : The PhantomJS is a headless browser (doesn’t have a front-end GUI and everything works at the backend) that is used for web page interaction. It is similar to the selenium web driver but the difference is that, it is headless. For this tutorial, I will be using the basic Firefox web driver via selenium, but you can test this out if you do not want a browser to popup every time the script runs. You can download the phantomjs executable from their homepage, but I advise you to use a downgraded version of this as the latest one might not be compatible with selenium module. The Concept. The idea is basically : To get the web page using the selenium web driver. Parse and extract the video or audio urls from the html page using BeautifulSoup. Download the files to the system using wget. Step 1. The first step we need to do is import the necessary modules in the python script or shell, and this can be done as shown below : From the selenium module, we import the following things : webdriver : This submodule has the functionality to initialize the various browsers like Chrome, Firefox, IE, etc. Keys : This allows us to send key presses or inputs to the web driver. WebDriverWait : The WebDriverWait is similar to the sleep function of the time module. This function tells the webdriver to wait for "n" seconds. expected_conditions : There are common conditions that the web driver takes into account like, for example, a condition would be for an element to appears in the browser or when the title of the page is somename . Here is a list of all the available conditions 1 : title_is title_contains presence_of_element_located visibility_of_element_located visibility_of presence_of_all_elements_located text_to_be_present_in_element text_to_be_present_in_element_value frame_to_be_available_and_switch_to_it invisibility_of_element_located element_to_be_clickable - it is Displayed and Enabled. staleness_of element_to_be_selected element_located_to_be_selected element_selection_state_to_be element_located_selection_state_to_be alert_is_present. Step 2. Now, according to the concept, for a single video url that is loaded using an AJAX call, we need to get the web page using the selenium webdriver. This is done as follows : When you execute this in the python shell or via the script (after you import the modules), you will observe that, a firefox browser will popup and a page will be loaded into it. If you want to use the PhantomJS and stop the browser from popping up, then just replace the webdriver.Firefox() with webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true']) . Step 3. Here is the tricky part, what you need to do is extract the video urls from the web page. As every website is designed differently, you don’t have an accurate solution. You would need to manually check for a pattern or the video element that is dynamically loaded. This can be done by looking at the browser’s developer console. From the previously mentioned scenario, lets say the video is dynamically loaded using a AJAX call after 1 sec you visit the website. Then you would need to wait till the video is loaded and then get the element. So, for that you can write the script in this manner : When you execute these two lines in the python shell, it will tell the browser to wait for 50 seconds by default until the element with the specific id appears or is visible on the screen, and then get the html source. Now if the element doesn’t have an ID, there are other ways you can get the specific element/tag. Lets say the element has a certain class, then you can just replace the By.ID with By.CLASS_NAME . Here is the entire list of attributes for the By class object 2 : ID XPATH LINK_TEXT PARTIAL_LINK_TEXT NAME TAG_NAME CLASS_NAME CSS_SELECTOR. Step 4. Once you get the HTML source, you would need to parse it and extract the video tag from it. This is done as shown below : The list_of_attributes is a python dictionary, with (key,value) pairs which specify the tags attributes. The parser.findAll() searches the entire HTML source and gets the video tags with the specific attributes. This generates a multi-dimensional array and is stored in the tag variable. Step 5. The next step is to get the url from the video tag and finally download it using wget. We can do this by writing the script in this manner : Depending on the number of videos loaded in the web page, you can specify which video you want to download. This can by done changing the value of n . Step 6. Finally, once the job is done, we close the driver : The script. Now, when you put all the pieces together, and with some additional functionality to login to a website, you will get something like this : Conclusion. What we did in this tutorial is, to create a small script that automates the process of downloading a file which is dynamically loaded. The above script works for a single url. If you want to download multiple files, then you would need to manually grab the tags and dynamic content information of each website and store them in json or xml file. Then you would need to read that file and pass it through a for loop. I created another small script that does this job. Its not full proof but is a good starting point for you guys to get an idea on how to do it. It also accepts command line arguments that allow you to download either a single video file or by taking in a file that contains all the video urls. You can get the script here : https://gitlab.com/snippets/8921. Please keep in mind that, you will encounter some websites which are so secure, that even though what you do, you just cannot download that video or file.