Scalable Web Crawler

Scalable Web Crawler

Scalable Web Crawler Adi Suresh Oscar Mejia Aniket Shah Pavan Kancherlapalli Ansuman Prusty Roland Schiebel Agenda ● Introduction ● Infrastructure ● System Design ● Demo ● Application Flow ● Design and Implementation Challenges ● Testing ● Team Dynamics ● Conclusion 2 Basics ◎ A web crawler is: ○ An application that can recursively iterate through web pages and documents ◎ Our solution: ○ Crawler (spider) ○ Scraper (HTML and JS) ○ Cloud Repository ○ No Search Engine 3 Our Customer ◎ Kyle Gallatin ○ Data Scientist at ◎ Team of technical and non-technical users ◎ “Restrictions”: ○ Avoid vendor lock-in ○ Use Docker / Kubernetes / Python 4 Requirements ◎ Provides a UI Interface for non-technical users ◎ Scales per request ◎ Crawls entire domain or list of URLs ◎ Stores various document types (e.g. html, docx and pdf) ◎ Uses machine learning to filter crawled documents 5 System Design 6 Tools ◎ Python ◎ Flask ◎ Django ◎ MySQL ◎ Redis ◎ Docker ◎ Kubernetes ◎ Google Cloud 7 Demo ◎ Gitlab (CI & CD) ◎ Google Cloud Console ◎ Web Crawler Example Starting Crawl Request Crawling and Scraping Pages Completing Crawl Request Overview of Private Cloud Environment Machine Learning Model Purpose: To store crawled page or not ◎ User trains a classification model on dataset ◎ Output of the model is an integer ◎ Mapping between model’s output integers to categories known to user ◎ Uploads model to web crawler service Machine Learning Model (contd..) Dynamic Content Crawling ◎ Standard Python ‘requests’ library fetches static HTML ◎ Selenium with webdriver for dynamic HTML ◎ Crawling with ‘Selenium’ around two times slower than with ‘requests’ ◎ Option provided to crawl with ‘Selenium’ or ‘requests’ Testing Team Dynamics ◎ Conducted weekly sprints ◎ Held standup meetings three times a week ◎ Assigned tasks based on services 17 Conclusion We built a unique web crawler with the following properties: ◎ Provides a GUI for user to interact ◎ Scales horizontally with open source technologies (Docker and Kubernetes) ◎ Stores multiple file types ◎ Uses trained ML models to filter documents the user wants to store 18 Thank You! Questions?.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    19 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us