Building a Search Engine for the Cuban Web

Jorge Luis Betancourt Search/Crawl Engineer

NOVEMBER 16-18, 2016 • SEVILLE, SPAIN Who am I

01

Jorge Luis Betancourt González Search/Crawl Engineer Committer & PMC /ES enthusiast

2 Agenda

• Introduction & motivation

• Technologies used

• Customizations

• Conclusions and future work

3 Introduction / Motivation

Cuba

Internet Intranet

Global search engines can’t access documents

hosted the Cuban Intranet

4 Writing your own web search engine

from scratch?

or …

5 Common search engine features

1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions

• filters (facets) • autocorrection 2 Image search (size, format, color, objects) • thumbnails • show metadata

• filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS

6 How to fulfill these requirements?

store query At the core a search engine: stores some

information a retrieve this

information when a

question is received

7 Open Source to the rescue …

crawler 1

Index Server 2

web interface 3

8 Apache Nutch

Nutch is a well matured, production ready “ . Enables fine grained configuration, relying on

data structures, which are great for batch

processing.

9 Apache Nutch

• Highly scalable

• Highly extensible

• Pluggable parsing protocols, storage,

indexing, scoring,

• Active community

10 Apache Solr

TOTAL DOWNLOADS 8M+

MONTHLY DOWNLOADS 250,000+

• Apache License • Great community

• Highly modular • Stability / Scalability

• Based on Lucene • Battle tested

11 Back to the list of features

1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions

• filters (facets) • autocorrection 2 Image search (size, format, color, objects)

• thumbnails • show metadata

• filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS

12 Image search and thumbnails

Custom parser & indexer to store the image

thumbnail h1

Custom parser &

indexer & scoring img p identify and store the text

related with an image

13 How does it work? 2 img

3 1 h1 img

img p

14 News search (NRT & alerting)

Nutch is really not suited for this task: Batch nature of

the Hadoop Jobs doesn’t fit well in this scenario

15 Our topology

http://news-site.com

index

RSS fetch parse flaxsearch/luwak

monit or parse the RSS feed and outputs the news links to be processed according to SC protocol.

https://github.com/commoncrawl/news-crawl16 Querying the data

1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions

• filters (facets) • autocorrection 2 Image search (size, format, color, objects)

• thumbnails • show metadata

• filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS

17 Querying the data

1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions

• filters (facets) • autocorrection 2 Image search (size, format, color, objects)

• thumbnails • show metadata

• filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS

18 Apache Solr

• Solr has full support for highlighting (3 impl)

• powerful faceting capabilities (even more on recent

releases)

• autocorrection support based on the index content

• awesome scalability (SolrCloud, classic master-slave

replication)

19 The features, once again

1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions

• filters (facets) • autocorrection 2 Image search (size, format, color, objects)

• thumbnails • show metadata

• filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS

20 The features, once again

1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions

• filters (facets) • autocorrection 2 Image search (size, format, color, objects)

• thumbnails • show metadata

• filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS

21 Other features - monitoring

We needed a way of monitoring our infrastructure

without a great Internet connection you can’t send

GB of logs to a cloud environment, so …

(and metrics) time series store (and logs) analytical tool (and facets)

22 Other features - monitoring

(and logs) parsing & aggregation

(and metrics) time series store (and logs) analytical tool (and facets)

23 Banana (Kibana port) for visualizations

24 Infrastructure HTTP HTTP

WEB HTTP HTTP HTTP Solr 2 Replicador

HTTP

JAVABIN Solr Master 1 Crawlers Nutch

25 Some usage stats

less than 10 000 visits around 600 unique visitors

26 Future work

Apply deep learning techniques to process the raw

images and mix with current approach

Increase the number of signals that we get from our crawlers (correlate even more crawl related events)

27 Thanks

Questions?

[email protected] M @jorgelbg !