Decentralised Web Search
Web Search Engine Software For Everyone: We can remove dependency from (a) large search engine provider
FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine For Everyone http://yacy.net Topics Tech Demo Dev
•Search Engine Technology how large-scale search engines are made available for everyone using peer-to-peer technology
•Demonstration: what you can do in just five minutes: installation, crawling, searching, monitoring, scheduling
•System Components and Development: Details about a search appliance components like scheduler, document parser, administration and visualization. Easy integration into a web page. APIs for external index queries and external index feeding components.
FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Search Engine Components Tech Demo Dev
Retrieval, Indexing, Storage and Search Components
filtering,
Depth = 0 Start-URL parsing Text Analysis Indexing
@
links words Depth = 1
Double Link Stop words
Crawler Check Check Depth = 2 URL Reverse Crawl Stack Word Index
Word URL References
YaCy has an Database integrated NoSQL ranking, Database. The verification, database stores a visualisation Reverse Word Index, Metadata
Search and the source
Interface documents.
FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Large Search Cluster: Model Tech Demo Dev
Efficient search engines are constructed using a matrix of many small search engines
Search Engine Cluster
Search Search Search Search Search Engine Engine Engine Engine Engine
Search Search Search Search Search Engine Engine Engine Engine Engine
Search Search Search Search Search Engine Engine Engine Engine Engine
vertical scaling:vertical per second more queries horizontal scaling: more documents
FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Large Search Cluster in Data Center Tech Demo Dev
Usually such search engine clusters are hosted by one organization in a data center
Search Search Search Search Search Engine Engine Engine Engine Engine
Search Search Search Search Search Engine Engine Engine Engine Engine
Search Search Search Search Search Engine Engine Engine Engine Engine
FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Large Search Cluster: Decentralised Tech Demo Dev
Imagine you can take the software outside and connect peers decentralised
Search Search Search Search Search Engine Engine Engine Engine Engine
Search Search Search Search Search Engine Engine Engine Engine Engine
Search Search Search Search Search Engine Engine Engine Engine Engine
FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Decentralised Search with YaCy 1/3 Tech Demo Dev
YaCy is a search engine appliance that can be used either in a data center or as a decentralised network of private peers
Peer Peer Peer Peer Peer
Peer Peer Peer Peer Peer
Peer Peer Peer Peer Peer
How can a search matrix be distributed? The peers are ordered using an ordering on peer hashes. The hash-ordering is closed at the end and the resulting network can be drawn as a circle...
FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Decentralised Search with YaCy 2/3 Tech Demo Dev
A ,Folded‘ Search Matrix
Peer Peer Peer
Peer Peer This peer (as an A peer which searches example) fetches some Peer Peer information can access Web pages and directly peers holding distributes index Peer Peer the corresponding fragments to other index peers. Peer Peer
Peer Peer
Peer Peer DHT-Store Peer DHT-Read
YaCy peers store index fragments according to a ‘folded‘ ordering on word-hashes and url-hashes in a distributed hash table (DHT). The index is distributed redundantly to save the index when some peers are not available. The redundancy also helps to increase search performance.
FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Decentralised Search with YaCy 3/3 Tech Demo Dev
The ,default‘ YaCy Search Engine Network
Peer Types: Junior Senior Principal DHT-Store behind firewall or router has open server port publishes seed-lists DHT-Read
FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net YaCy Search Cluster in a Data Center Tech Demo Dev
http:// sciencenet.fzk.de 300 million documents ,Sciencenet‘: Search Engine for scientific content in the Karlsruhe Institute of Technology: 34 computers running YaCy in it‘s own network
FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Decentralised Search for Everyone Tech Demo Dev
Search Engine @Home > 1 Billion Documents
People run they own YaCy search peer at home and create independent search for everyone
FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Benefits Tech Demo Dev
Impact of running your own search engine:
become independent from large search engine operators keep company secrets search tracks can reveal industrial research targets your personal relevance you can create a ranking method for your personal needs same rights for all people everyone can run a search engine
FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Demo: Users Tech Demo Dev
linuxtag.org
linux-club.de
geoclub.de
fsfe.org
FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Use Cases Tech Demo Dev
•Decentralised Peer-to-Peer Web Search search engines for everyone •High-Performance Search Clusters generic search portals for any need
•Internet Search Portal for a project combining wikis, blogs, forums and portal pages •Alert-Service for News using RSS create a News-Feed using recent search results for a specific topic
•Intranet Search Appliance search in local web servers and file shares
FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Search Interface Tech Demo Dev
API for search results is RSS (Opensearch) and JSON SRU
Facets: Domains, Authors
every link is verified before it is displayed: the content is loaded, parsed and used for a search snippet generation
APIs Opensearch (search results with RSS), JSON, AJAX tools Standards Tools search widget, ready-to-use code snippets to embed search everywhere
FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Document Retrieval and Parser Tech Demo Dev
A Search Engine should support people in the search for documents in unstructured formats: this needs a kind of ‘understanding‘ of content
Connection Parsing Interpretation load and crawl from: read document formats: find metadata (headline, author, date, locations) HTTP, HTTPS, FTP, filesystem, HTML, XHTML, RSS, RDF, SMB-shares XHTML+RDFa, FOAF, vCard, find links of different kind Flash, PDF, PS, Word, Excel, (text, images, movies etc.) Import from: Visio, Powerpoint, Dublin Core / XML files, OpenOffice, RTF, csv, gzip, zip, store statistical data for OAI-PMH, wikimedia dumps, tar, rar, bzip2, 7zip, images search suggestions SQL databases (EXIF), torrent files
FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Search Result Ranking Tech Demo Dev
a prototype discussion do you use the same about ranking no. PR is difficult ranking as G**gle? and sometimes useless (i.e. in intranets) then you cannot be better? that‘s what we have many ranking criteria lucene has and users can mix them. but is this better? what is ‘better‘? G**gle defines ‘better‘ as: ‘most people like it‘ I have an idea: .... suddenly people think about their personal relevance requirements..... then what is the similar to best ranking? G**gle PR do experiments! If you run your own search engine, then you may need your own ranking. Different contents may need different rankings. every peer? when doing a remote in YaCy, you can search, the remote peer uses your own combine many ranking too! weighted attributes FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Parts of a Search Appliance Tech Demo Dev
Search Engine Data Visualisation
retrieval, indexing, storage and search components index creation process, system load, link structure, p2p net configuration Scheduler and Steering Database Administration
automatic scheduled re-indexing and crawl queues, robots.txt, rss feeds, scheduler back-up of search appliance set-up data, p2p connections, network messages
FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net Search Interface Integration Tech Demo Dev
Code Snippet Example #1: a search window in an iframe How to integrate a
Code Snippet Example #2: a search box (points to new page) Code Snippet #2 looks like:
your YaCy peer provides help pages with code snippets for an easy integration!
FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net External Index Retrieval Tech Demo Dev
> curl http://localhost:8080/yacysearch.rss?query=foaf&maximumRecords=10
FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net External Index Feeding Tech Demo Dev
Standards: YaCy can import standard Dublin Core Metadata
How to import Dublin Core Files: just place the xml files into a hand-over directory at DATA/SURROGATES/in/
The Dublin Core XML File Standard: http://dublincore.org/documents/dc-xml-guidelines/
FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net License: GPL Installation Free Software Tech Demo Dev
•Download from http://yacy.net
YaCy for Windows YaCy for Mac YaCy for Debian YaCy for Linux / generic (tar.gz)
•Just Extract the Package, then Start the Start-Script There are simple installers for Windows, Mac and a debian release, but it is easy to just install the generic release because it contains everything that is needed. •Administration using the Web Interface YaCy is a Web Application. The administration can be done completely using the built-in web interface with your web browser. Just open http://localhost:8080 The main configuration is done when you select your use case (Distributed P2P Web Search, Portal Search, Intranet Search) after just two clicks. •Support We have a web forum: http://forum.yacy.de Some information can be found at the wiki: http://wiki.yacy.de ...or contact me: mc@yacy.net
FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine for everyone http://yacy.net what you can do
• learn about search engine technology and teach other people • create your own search portal • be creative! -- we listen to your ideas • help -- make a translation of the administration interface!
FOSS ASIA - Ho Chi Minh City, Vietnam 2010 Michael Christen Web Search Engine For Everyone http://yacy.net