
Scalable Web Crawler Adi Suresh Oscar Mejia Aniket Shah Pavan Kancherlapalli Ansuman Prusty Roland Schiebel Agenda ● Introduction ● Infrastructure ● System Design ● Demo ● Application Flow ● Design and Implementation Challenges ● Testing ● Team Dynamics ● Conclusion 2 Basics ◎ A web crawler is: ○ An application that can recursively iterate through web pages and documents ◎ Our solution: ○ Crawler (spider) ○ Scraper (HTML and JS) ○ Cloud Repository ○ No Search Engine 3 Our Customer ◎ Kyle Gallatin ○ Data Scientist at ◎ Team of technical and non-technical users ◎ “Restrictions”: ○ Avoid vendor lock-in ○ Use Docker / Kubernetes / Python 4 Requirements ◎ Provides a UI Interface for non-technical users ◎ Scales per request ◎ Crawls entire domain or list of URLs ◎ Stores various document types (e.g. html, docx and pdf) ◎ Uses machine learning to filter crawled documents 5 System Design 6 Tools ◎ Python ◎ Flask ◎ Django ◎ MySQL ◎ Redis ◎ Docker ◎ Kubernetes ◎ Google Cloud 7 Demo ◎ Gitlab (CI & CD) ◎ Google Cloud Console ◎ Web Crawler Example Starting Crawl Request Crawling and Scraping Pages Completing Crawl Request Overview of Private Cloud Environment Machine Learning Model Purpose: To store crawled page or not ◎ User trains a classification model on dataset ◎ Output of the model is an integer ◎ Mapping between model’s output integers to categories known to user ◎ Uploads model to web crawler service Machine Learning Model (contd..) Dynamic Content Crawling ◎ Standard Python ‘requests’ library fetches static HTML ◎ Selenium with webdriver for dynamic HTML ◎ Crawling with ‘Selenium’ around two times slower than with ‘requests’ ◎ Option provided to crawl with ‘Selenium’ or ‘requests’ Testing Team Dynamics ◎ Conducted weekly sprints ◎ Held standup meetings three times a week ◎ Assigned tasks based on services 17 Conclusion We built a unique web crawler with the following properties: ◎ Provides a GUI for user to interact ◎ Scales horizontally with open source technologies (Docker and Kubernetes) ◎ Stores multiple file types ◎ Uses trained ML models to filter documents the user wants to store 18 Thank You! Questions?.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages19 Page
-
File Size-