Decentralised Web Search

Web Software For Everyone: We can remove dependency from (a) large search engine provider

FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine For Everyone http://yacy.net Topics Tech Demo Dev

•Search Engine Technology how large-scale search engines are made available for everyone using peer-to-peer technology

•Demonstration: what you can do in just five minutes: installation, crawling, searching, monitoring, scheduling

•System Components and Development: Details about a search appliance components like scheduler, document parser, administration and visualization. Easy integration into a web page. APIs for external index queries and external index feeding components.

FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Search Engine Components Tech Demo Dev

Retrieval, Indexing, Storage and Search Components

filtering,

Depth = 0 Start-URL parsing Text Analysis Indexing

@

links words Depth = 1

Double Link Stop words

Crawler Check Check Depth = 2 URL Reverse Crawl Stack Word Index

Word URL References

YaCy has an Database integrated NoSQL ranking, Database. The verification, database stores a visualisation Reverse Word Index, Metadata

Search and the source

Interface documents.

FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Large Search Cluster: Model Tech Demo Dev

Efficient search engines are constructed using a matrix of many small search engines

Search Engine Cluster

Search Search Search Search Search Engine Engine Engine Engine Engine

Search Search Search Search Search Engine Engine Engine Engine Engine

Search Search Search Search Search Engine Engine Engine Engine Engine

vertical scaling:vertical per second more queries horizontal scaling: more documents

FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Large Search Cluster in Data Center Tech Demo Dev

Usually such search engine clusters are hosted by one organization in a data center

Search Search Search Search Search Engine Engine Engine Engine Engine

Search Search Search Search Search Engine Engine Engine Engine Engine

Search Search Search Search Search Engine Engine Engine Engine Engine

FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Large Search Cluster: Decentralised Tech Demo Dev

Imagine you can take the software outside and connect peers decentralised

Search Search Search Search Search Engine Engine Engine Engine Engine

Search Search Search Search Search Engine Engine Engine Engine Engine

Search Search Search Search Search Engine Engine Engine Engine Engine

FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Decentralised Search with YaCy 1/3 Tech Demo Dev

YaCy is a search engine appliance that can be used either in a data center or as a decentralised network of private peers

Peer Peer Peer Peer Peer

Peer Peer Peer Peer Peer

Peer Peer Peer Peer Peer

How can a search matrix be distributed? The peers are ordered using an ordering on peer hashes. The hash-ordering is closed at the end and the resulting network can be drawn as a circle...

FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Decentralised Search with YaCy 2/3 Tech Demo Dev

A ,Folded‘ Search Matrix

Peer Peer Peer

Peer Peer This peer (as an A peer which searches example) fetches some Peer Peer information can access Web pages and directly peers holding distributes index Peer Peer the corresponding fragments to other index peers. Peer Peer

Peer Peer

Peer Peer DHT-Store Peer DHT-Read

YaCy peers store index fragments according to a ‘folded‘ ordering on word-hashes and url-hashes in a distributed hash table (DHT). The index is distributed redundantly to save the index when some peers are not available. The redundancy also helps to increase search performance.

FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Decentralised Search with YaCy 3/3 Tech Demo Dev

The ,default‘ YaCy Search Engine Network

Peer Types: Junior Senior Principal DHT-Store behind firewall or router has open server port publishes seed-lists DHT-Read

FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net YaCy Search Cluster in a Data Center Tech Demo Dev

http:// sciencenet.fzk.de 300 million documents ,Sciencenet‘: Search Engine for scientific content in the Karlsruhe Institute of Technology: 34 computers running YaCy in it‘s own network

FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Decentralised Search for Everyone Tech Demo Dev

Search Engine @Home > 1 Billion Documents

People run they own YaCy search peer at home and create independent search for everyone

FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Benefits Tech Demo Dev

Impact of running your own search engine:

become independent from large search engine operators keep company secrets search tracks can reveal industrial research targets your personal relevance you can create a ranking method for your personal needs same rights for all people everyone can run a search engine

FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Demo: Users Tech Demo Dev

linuxtag.org

-club.de

geoclub.de

fsfe.org

FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Use Cases Tech Demo Dev

•Decentralised Peer-to-Peer Web Search search engines for everyone •High-Performance Search Clusters generic search portals for any need

•Internet Search Portal for a project combining wikis, blogs, forums and portal pages •Alert-Service for News using RSS create a News-Feed using recent search results for a specific topic

•Intranet Search Appliance search in local web servers and file shares

FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Search Interface Tech Demo Dev

API for search results is RSS (Opensearch) and JSON SRU

Facets: Domains, Authors

every link is verified before it is displayed: the content is loaded, parsed and used for a search snippet generation

APIs Opensearch (search results with RSS), JSON, AJAX tools Standards Tools search widget, ready-to-use code snippets to embed search everywhere

FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Document Retrieval and Parser Tech Demo Dev

A Search Engine should support people in the search for documents in unstructured formats: this needs a kind of ‘understanding‘ of content

Connection Parsing Interpretation load and crawl from: read document formats: find metadata (headline, author, date, locations) HTTP, HTTPS, FTP, filesystem, HTML, XHTML, RSS, RDF, SMB-shares XHTML+RDFa, FOAF, vCard, find links of different kind Flash, PDF, PS, Word, Excel, (text, images, movies etc.) Import from: Visio, Powerpoint, Dublin Core / XML files, OpenOffice, RTF, csv, gzip, zip, store statistical data for OAI-PMH, wikimedia dumps, tar, rar, bzip2, 7zip, images search suggestions SQL databases (EXIF), torrent files

FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Search Result Ranking Tech Demo Dev

a prototype discussion do you use the same about ranking no. PR is difficult ranking as G**gle? and sometimes useless (i.e. in intranets) then you cannot be better? that‘s what we have many ranking criteria lucene has and users can mix them. but is this better? what is ‘better‘? G**gle defines ‘better‘ as: ‘most people like it‘ I have an idea: .... suddenly people think about their personal relevance requirements..... then what is the similar to best ranking? G**gle PR do experiments! If you run your own search engine, then you may need your own ranking. Different contents may need different rankings. every peer? when doing a remote in YaCy, you can search, the remote peer uses your own combine many ranking too! weighted attributes FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Parts of a Search Appliance Tech Demo Dev

Search Engine Data Visualisation

retrieval, indexing, storage and search components index creation process, system load, link structure, p2p net configuration Scheduler and Steering Database Administration



 

 automatic scheduled re-indexing and crawl queues, robots.txt, rss feeds, scheduler back-up of search appliance set-up data, p2p connections, network messages

FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Search Interface Integration Tech Demo Dev

Code Snippet Example #1: a search window in an iframe How to integrate a