Building a Search Engine for the Cuban Web
Total Page:16
File Type:pdf, Size:1020Kb
Building a Search Engine for the Cuban Web Jorge Luis Betancourt Search/Crawl Engineer NOVEMBER 16-18, 2016 • SEVILLE, SPAIN Who am I 01 Jorge Luis Betancourt González Search/Crawl Engineer Apache Nutch Committer & PMC Apache Solr/ES enthusiast 2 Agenda • Introduction & motivation • Technologies used • Customizations • Conclusions and future work 3 Introduction / Motivation Cuba Internet Intranet Global search engines can’t access documents hosted the Cuban Intranet 4 Writing your own web search engine from scratch? or … 5 Common search engine features 1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions • filters (facets) • autocorrection 2 Image search (size, format, color, objects) • thumbnails • show metadata • filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS 6 How to fulfill these requirements? store query At the core a search engine: stores some information a retrieve this information when a question is received 7 Open Source to the rescue … crawler 1 Index Server 2 web interface 3 8 Apache Nutch Nutch is a well matured, production ready “ Web crawler. Enables fine grained configuration, relying on Apache Hadoop™ data structures, which are great for batch processing. 9 Apache Nutch • Highly scalable • Highly extensible • Pluggable parsing protocols, storage, indexing, scoring, • Active community • Apache License 10 Apache Solr TOTAL DOWNLOADS 8M+ MONTHLY DOWNLOADS 250,000+ • Apache License • Great community • Highly modular • Stability / Scalability • Based on Lucene • Battle tested 11 Back to the list of features 1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions • filters (facets) • autocorrection 2 Image search (size, format, color, objects) • thumbnails • show metadata • filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS 12 Image search and thumbnails Custom parser & indexer to store the image thumbnail h1 Custom parser & indexer & scoring img p identify and store the text related with an image 13 How does it work? 2 img 3 1 h1 img img p 14 News search (NRT & alerting) Nutch is really not suited for this task: Batch nature of the Hadoop Jobs doesn’t fit well in this scenario 15 Our topology http://news-site.com index RSS fetch parse flaxsearch/luwak monit or parse the RSS feed and outputs the news links to be processed according to SC protocol. https://github.com/commoncrawl/news-crawl16 Querying the data 1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions • filters (facets) • autocorrection 2 Image search (size, format, color, objects) • thumbnails • show metadata • filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS 17 Querying the data 1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions • filters (facets) • autocorrection 2 Image search (size, format, color, objects) • thumbnails • show metadata • filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS 18 Apache Solr • Solr has full support for highlighting (3 impl) • powerful faceting capabilities (even more on recent releases) • autocorrection support based on the index content • awesome scalability (SolrCloud, classic master-slave replication) 19 The features, once again 1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions • filters (facets) • autocorrection 2 Image search (size, format, color, objects) • thumbnails • show metadata • filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS 20 The features, once again 1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions • filters (facets) • autocorrection 2 Image search (size, format, color, objects) • thumbnails • show metadata • filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS 21 Other features - monitoring We needed a way of monitoring our infrastructure without a great Internet connection you can’t send GB of logs to a cloud environment, so … (and metrics) time series store (and logs) analytical tool (and facets) 22 Other features - monitoring (and logs) parsing & aggregation (and metrics) time series store (and logs) analytical tool (and facets) 23 Banana (Kibana port) for visualizations 24 Infrastructure HTTP HTTP WEB HTTP HTTP HTTP Solr 2 Replicador HTTP JAVABIN Solr Master 1 Crawlers Nutch 25 Some usage stats less than 10 000 visits around 600 unique visitors 26 Future work Apply deep learning techniques to process the raw images and mix with current approach Increase the number of signals that we get from our crawlers (correlate even more crawl related events) 27 Thanks Questions? [email protected] M @jorgelbg !.