Scalable Web Crawler Adi Suresh Oscar Mejia Aniket Shah Pavan Kancherlapalli Ansuman Prusty Roland Schiebel Agenda ● Introduction ● Infrastructure ● System Design ● Demo ● Application Flow ● Design and Implementation Challenges ● Testing ● Team Dynamics ● Conclusion
2 Basics ◎ A web crawler is: ○ An application that can recursively iterate through web pages and documents ◎ Our solution: ○ Crawler (spider) ○ Scraper (HTML and JS) ○ Cloud Repository ○ No Search Engine
3 Our Customer ◎ Kyle Gallatin ○ Data Scientist at
◎ Team of technical and non-technical users ◎ “Restrictions”: ○ Avoid vendor lock-in ○ Use Docker / Kubernetes / Python
4 Requirements ◎ Provides a UI Interface for non-technical users ◎ Scales per request ◎ Crawls entire domain or list of URLs ◎ Stores various document types (e.g. html, docx and pdf) ◎ Uses machine learning to filter crawled documents
5 System Design
6 Tools ◎ Python ◎ Flask ◎ Django ◎ MySQL ◎ Redis ◎ Docker ◎ Kubernetes ◎ Google Cloud
7 Demo
◎ Gitlab (CI & CD) ◎ Google Cloud Console ◎ Web Crawler Example Starting Crawl Request Crawling and Scraping Pages Completing Crawl Request Overview of Private Cloud Environment Machine Learning Model
Purpose: To store crawled page or not ◎ User trains a classification model on dataset ◎ Output of the model is an integer ◎ Mapping between model’s output integers to categories known to user ◎ Uploads model to web crawler service Machine Learning Model (contd..) Dynamic Content Crawling
◎ Standard Python ‘requests’ library fetches static HTML ◎ Selenium with webdriver for dynamic HTML ◎ Crawling with ‘Selenium’ around two times slower than with ‘requests’ ◎ Option provided to crawl with ‘Selenium’ or ‘requests’ Testing Team Dynamics
◎ Conducted weekly sprints ◎ Held standup meetings three times a week ◎ Assigned tasks based on services
17 Conclusion We built a unique web crawler with the following properties: ◎ Provides a GUI for user to interact ◎ Scales horizontally with open source technologies (Docker and Kubernetes) ◎ Stores multiple file types ◎ Uses trained ML models to filter documents the user wants to store
18
Thank You! Questions?