Scalable Adi Suresh Oscar Mejia Aniket Shah Pavan Kancherlapalli Ansuman Prusty Roland Schiebel Agenda ● Introduction ● Infrastructure ● System Design ● Demo ● Application Flow ● Design and Implementation Challenges ● Testing ● Team Dynamics ● Conclusion

2 Basics ◎ A web crawler is: ○ An application that can recursively iterate through web pages and documents ◎ Our solution: ○ Crawler (spider) ○ Scraper (HTML and JS) ○ Cloud Repository ○ No

3 Our Customer ◎ Kyle Gallatin ○ Data Scientist at

◎ Team of technical and non-technical users ◎ “Restrictions”: ○ Avoid vendor lock-in ○ Use Docker / Kubernetes / Python

4 Requirements ◎ Provides a UI Interface for non-technical users ◎ Scales per request ◎ Crawls entire domain or list of ◎ Stores various document types (e.g. , docx and ) ◎ Uses to filter crawled documents

5 System Design

6 Tools ◎ Python ◎ Flask ◎ Django ◎ MySQL ◎ Redis ◎ Docker ◎ Kubernetes ◎ Google Cloud

7 Demo

◎ Gitlab (CI & CD) ◎ Google Cloud Console ◎ Web Crawler Example Starting Crawl Request Crawling and Scraping Pages Completing Crawl Request Overview of Private Cloud Environment Machine Learning Model

Purpose: To store crawled page or not ◎ User trains a classification model on dataset ◎ Output of the model is an integer ◎ Mapping between model’s output integers to categories known to user ◎ Uploads model to web crawler service Machine Learning Model (contd..) Dynamic Content Crawling

◎ Standard Python ‘requests’ library fetches static HTML ◎ Selenium with webdriver for dynamic HTML ◎ Crawling with ‘Selenium’ around two times slower than with ‘requests’ ◎ Option provided to crawl with ‘Selenium’ or ‘requests’ Testing Team Dynamics

◎ Conducted weekly sprints ◎ Held standup meetings three times a week ◎ Assigned tasks based on services

17 Conclusion We built a unique web crawler with the following properties: ◎ Provides a GUI for user to interact ◎ Scales horizontally with open source technologies (Docker and Kubernetes) ◎ Stores multiple file types ◎ Uses trained ML models to filter documents the user wants to store

18

Thank You! Questions?