Building a Search Engine for the Cuban Web
Jorge Luis Betancourt Search/Crawl Engineer
NOVEMBER 16-18, 2016 • SEVILLE, SPAIN Who am I
01
Jorge Luis Betancourt González Search/Crawl Engineer Apache Nutch Committer & PMC Apache Solr/ES enthusiast
2 Agenda
• Introduction & motivation
• Technologies used
• Customizations
• Conclusions and future work
3 Introduction / Motivation
Cuba
Internet Intranet
Global search engines can’t access documents
hosted the Cuban Intranet
4 Writing your own web search engine
from scratch?
or …
5 Common search engine features
1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions
• filters (facets) • autocorrection 2 Image search (size, format, color, objects) • thumbnails • show metadata
• filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS
6 How to fulfill these requirements?
store query At the core a search engine: stores some
information a retrieve this
information when a
question is received
7 Open Source to the rescue …
crawler 1
Index Server 2
web interface 3
8 Apache Nutch
Nutch is a well matured, production ready “ Web crawler. Enables fine grained configuration, relying on Apache Hadoop™
data structures, which are great for batch
processing.
9 Apache Nutch
• Highly scalable
• Highly extensible
• Pluggable parsing protocols, storage,
indexing, scoring,
• Active community
10 Apache Solr
TOTAL DOWNLOADS 8M+
MONTHLY DOWNLOADS 250,000+
• Apache License • Great community
• Highly modular • Stability / Scalability
• Based on Lucene • Battle tested
11 Back to the list of features
1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions
• filters (facets) • autocorrection 2 Image search (size, format, color, objects)
• thumbnails • show metadata
• filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS
12 Image search and thumbnails
Custom parser & indexer to store the image
thumbnail h1
Custom parser &
indexer & scoring img p identify and store the text
related with an image
13 How does it work? 2 img
3 1 h1 img
img p
14 News search (NRT & alerting)
Nutch is really not suited for this task: Batch nature of
the Hadoop Jobs doesn’t fit well in this scenario
15 Our topology
http://news-site.com
index
RSS fetch parse flaxsearch/luwak
monit or parse the RSS feed and outputs the news links to be processed according to SC protocol.
https://github.com/commoncrawl/news-crawl16 Querying the data
1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions
• filters (facets) • autocorrection 2 Image search (size, format, color, objects)
• thumbnails • show metadata
• filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS
17 Querying the data
1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions
• filters (facets) • autocorrection 2 Image search (size, format, color, objects)
• thumbnails • show metadata
• filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS
18 Apache Solr
• Solr has full support for highlighting (3 impl)
• powerful faceting capabilities (even more on recent
releases)
• autocorrection support based on the index content
• awesome scalability (SolrCloud, classic master-slave
replication)
19 The features, once again
1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions
• filters (facets) • autocorrection 2 Image search (size, format, color, objects)
• thumbnails • show metadata
• filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS
20 The features, once again
1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions
• filters (facets) • autocorrection 2 Image search (size, format, color, objects)
• thumbnails • show metadata
• filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS
21 Other features - monitoring
We needed a way of monitoring our infrastructure
without a great Internet connection you can’t send
GB of logs to a cloud environment, so …
(and metrics) time series store (and logs) analytical tool (and facets)
22 Other features - monitoring
(and logs) parsing & aggregation
(and metrics) time series store (and logs) analytical tool (and facets)
23 Banana (Kibana port) for visualizations
24 Infrastructure HTTP HTTP
WEB HTTP HTTP HTTP Solr 2 Replicador
HTTP
JAVABIN Solr Master 1 Crawlers Nutch
25 Some usage stats
less than 10 000 visits around 600 unique visitors
26 Future work
Apply deep learning techniques to process the raw
images and mix with current approach
Increase the number of signals that we get from our crawlers (correlate even more crawl related events)
27 Thanks
Questions?
[email protected] M @jorgelbg !