Scalable Web Crawler

Scalable Web Crawler Adi Suresh Oscar Mejia Aniket Shah Pavan Kancherlapalli Ansuman Prusty Roland Schiebel Agenda ● Introduction ● Infrastructure ● System Design ● Demo ● Application Flow ● Design and Implementation Challenges ● Testing ● Team Dynamics ● Conclusion 2 Basics ◎ A web crawler is: ○ An application that can recursively iterate through web pages and documents ◎ Our solution: ○ Crawler (spider) ○ Scraper (HTML and JS) ○ Cloud Repository ○ No Search Engine 3 Our Customer ◎ Kyle Gallatin ○ Data Scientist at ◎ Team of technical and non-technical users ◎ “Restrictions”: ○ Avoid vendor lock-in ○ Use Docker / Kubernetes / Python 4 Requirements ◎ Provides a UI Interface for non-technical users ◎ Scales per request ◎ Crawls entire domain or list of URLs ◎ Stores various document types (e.g. html, docx and pdf) ◎ Uses machine learning to filter crawled documents 5 System Design 6 Tools ◎ Python ◎ Flask ◎ Django ◎ MySQL ◎ Redis ◎ Docker ◎ Kubernetes ◎ Google Cloud 7 Demo ◎ Gitlab (CI & CD) ◎ Google Cloud Console ◎ Web Crawler Example Starting Crawl Request Crawling and Scraping Pages Completing Crawl Request Overview of Private Cloud Environment Machine Learning Model Purpose: To store crawled page or not ◎ User trains a classification model on dataset ◎ Output of the model is an integer ◎ Mapping between model’s output integers to categories known to user ◎ Uploads model to web crawler service Machine Learning Model (contd..) Dynamic Content Crawling ◎ Standard Python ‘requests’ library fetches static HTML ◎ Selenium with webdriver for dynamic HTML ◎ Crawling with ‘Selenium’ around two times slower than with ‘requests’ ◎ Option provided to crawl with ‘Selenium’ or ‘requests’ Testing Team Dynamics ◎ Conducted weekly sprints ◎ Held standup meetings three times a week ◎ Assigned tasks based on services 17 Conclusion We built a unique web crawler with the following properties: ◎ Provides a GUI for user to interact ◎ Scales horizontally with open source technologies (Docker and Kubernetes) ◎ Stores multiple file types ◎ Uses trained ML models to filter documents the user wants to store 18 Thank You! Questions?.

Scalable Web Crawler

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support