Web Search By The People For The People

a decentralised : the missing link between free content and free users to build a free web

FSCONS 2010 Michael Christen Web Search By The People, For The People http://yacy.net Topics Vision Demo Tech Dev

•Vision: YaCy as the missing link between free content and the user; for privacy, independence, against censoring and for better search results

•Demonstration: what you can do in just five minutes: installation, crawling, searching, monitoring, scheduling

•Technology: Search engine technology, crawling the web, understanding documents, ranking, peer-to-peer architecture of YaCy and privacy protection

•Development: APIs that you can use with existing tools and easily with some coding.

FSCONS 2010 Michael Christen Web Search By The People, For The People http://yacy.net A Non-Free World Wide Web Vision Demo Tech Dev

There is a missing link in the web between free content and the user because free content needs a decentralised free search technology

Data Search User

as it is today: User needs proprietary & PROPRIETARY & centralised software to discover Data under Creative Commons CENTRALISED: free content License it traces you & data can be Open Access Repositories censored, blocked, is this what removed, spammed we want?

FSCONS 2010 Michael Christen Web Search By The People, For The People http://yacy.net A Free World Wide Web 1/2 Vision Demo Tech Dev

The World Wide Web should be a many-to-many media: a receiver can be a sender and vice versa

User/Data User/Data

Producer and User Producer and User Server with Server with Free Software Free Software and and Free Content Free Content

FSCONS 2010 Michael Christen Web Search By The People, For The People http://yacy.net A Free World Wide Web 2/2 Vision Demo Tech Dev

In a free world wide web the users must run search engines

Search Search free content is now linked User/Data with decentralised search User/Data

Producer and User Producer and User Server with Server with Free Software Free Software and and Free Content Free Content

The Web: Sender & Receiver

FSCONS 2010 Michael Christen Web Search By The People, For The People http://yacy.net The Missing Link 1/2 Vision Demo Tech Dev

the missing link between free software, free content and the user is a network of YaCy Peers or something similar

Peer Peer User/Data User/Data

YaCy Peer Owner YaCy Peer Owner Server with Server with Free Software, Free Software, Free Content Free Content and and Free Search Free Search

The Web: Sender & Receiver

FSCONS 2010 Michael Christen Web Search By The People, For The People http://yacy.net The Missing Link 2/2 Vision Demo Tech Dev

A Free World Wide Web Benefits: no global censoring 1 : 1000 User/Data Peer (?) User/Data because its decentralised Peer User/Data Peer User/Data Peer User/Data User/DataPeer User/Data you cannot be traced User/DataUser/Data User/DataUser/Data User/DataUser/Data User/DataUser/Data you run the search portal

Free Content, Free Search Other Web User/Producer same rights for everyone everyone can contribute equally all part of... this is the wiki principle for search engines get out of control of data evaluation and lawful interception your personal relevance The Web: Sender & Receiver create a ranking method FSCONS 2010 Michael Christen Web Search By The People, For The People http://yacy.net Demo: Use Cases Vision Demo Tech Dev

•Decentralised Peer-to-Peer Web Search free search for everyone

•Internet Search Portal for a project combining wikis, blogs, forums and portal pages •Alert-Service for News using RSS create a News-Feed using recent search results for a specific topic

•Intranet Search Appliance search in local web servers and file shares

FSCONS 2010 Michael Christen Web Search By The People, For The People http://yacy.net Demo: Search Interface Vision Demo Tech Dev

API for search results is RSS (Opensearch) and JSON SRU

Facets: Domains, Authors

every link is verified before it is displayed: the content is loaded, parsed and used for a search snippet generation

APIs Opensearch (search results with RSS), JSON, AJAX tools Standards Tools search widget, ready-to-use code snippets to embed search everywhere

FSCONS 2010 Michael Christen Web Search By The People, For The People http://yacy.net Demo: Users Vision Demo Tech Dev

linuxtag.org

-club.de

geoclub.de

fsfe.org

FSCONS 2010 Michael Christen Web Search By The People, For The People http://yacy.net Search Engine Components Vision Demo Tech Dev

Retrieval, Indexing, Storage and Search Components

filtering,

Depth = 0 Start-URL parsing Text Analysis Indexing

@

links words Depth = 1

Double Link Stop words

Crawler Check Check Depth = 2 URL Reverse Crawl Stack Word Index

Word URL References

YaCy has an Database integrated NoSQL ranking, Database. The verification, database stores a visualisation Reverse Word Index, Metadata

Search and the source

Interface documents.

YaCy Peer-to-Peer Network

FSCONS 2010 Michael Christen Web Search By The People, For The People http://yacy.net The Vision Demo Tech Dev

Crawler Administration Document Tree

the start url a simple ‘Site Crawl‘; there is also a detailed crawl start for ‘wide‘ crawls

follow all links until: • no more documents in domain • crawl depth is reached • maximum number of docs reached

all crawl starts are placed into a scheduler use Target Host Balancing (for several crawl starts or ‘wide‘ crawls)

target hosts (domain name) round-robin access robots.txt, latency and minimum access time 0.5s loader FTP SMB parser

FSCONS 2010 Michael Christen Web Search By The People, For The People http://yacy.net Understanding Documents Vision Demo Tech Dev

A search engine should support people in the search for documents in unstructured formats: this needs a kind of ‘understanding‘ of content

Connection Parsing Interpretation load and crawl from: read document formats: find metadata (headline, author, date, locations) HTTP, HTTPS, FTP, filesystem, HTML, XHTML, RSS, RDF, SMB-shares XHTML+RDFa, FOAF, vCard, find links of different kind Flash, PDF, PS, Word, Excel, (text, images, movies etc.) Import from: Visio, Powerpoint, Dublin Core / XML files, OpenOffice, RTF, csv, gzip, zip, store statistical data for OAI-PMH, wikimedia dumps, tar, rar, bzip2, 7zip, images search suggestions SQL databases (EXIF), torrent files

FSCONS 2010 Michael Christen Web Search By The People, For The People http://yacy.net Search Result Ranking Vision Demo Tech Dev

a prototype discussion do you use the same about ranking no. PR is difficult ranking as G**gle? and sometimes useless (i.e. in intranets) then you cannot be better? that‘s what we have many ranking criteria lucene has and users can mix them. but is this better? what is ‘better‘? G**gle defines ‘better‘ as: ‘most people like it‘ I have an idea: .... suddenly people think about their personal relevance requirements..... then what is the similar to best ranking? G**gle PR do experiments! If you run your own search engine, then you may need your own ranking. Different contents may need different rankings. every peer? when doing a remote in YaCy, you can search, the remote peer uses your own combine many ranking too! weighted attributes FSCONS 2010 Michael Christen Web Search By The People, For The People http://yacy.net Parts of a Search Appliance Vision Demo Tech Dev

Search Engine Data Visualisation

retrieval, indexing, storage and search components index creation process, system load, link structure, p2p net configuration Scheduler and Steering Database Administration



 

 automatic scheduled re-indexing and crawl queues, robots.txt, rss feeds, scheduler back-up of search appliance set-up data, p2p connections, network messages

FSCONS 2010 Michael Christen Web Search By The People, For The People http://yacy.net The Peer-to-Peer Network 1/2 Vision Demo Tech Dev

The YaCy Network: a distributed hash table

Peer

Peer Peer

This peer (as an A peer which searches example) fetches some information can access Web pages and directly peers holding distributes index Peer Peer the corresponding fragments to other index peers.

Peer Peer

DHT-Store Peer DHT-Read

YaCy peers store index fragments according to a ‘folded‘ ordering on word-hashes and url-hashes in a distributed hash table (DHT). The index is distributed redundantly to save the index when some peers are not available. The redundancy also helps to increase search performance.

FSCONS 2010 Michael Christen Web Search By The People, For The People http://yacy.net The Peer-to-Peer Network 2/2 Vision Demo Tech Dev

The default YaCy Network: „freeworld“

Peer Types: Junior Senior Principal DHT-Store behind firewall or router has open server port publishes seed-lists DHT-Read

FSCONS 2010 Michael Christen Web Search By The People, For The People http://yacy.net Privacy Vision Demo Tech Dev

The Architecture of YaCy Ensures Privacy

Nobody can see what you added to the global search index - YaCy does not store words in clear text but only as word-hashes - your search index is mixed with indexes from other peers during DHT transfer - the origin of DHT transfers is not stored into the search index

Nobody can see what you search - if you use your own YaCy application, you are the only one who can track what you do - a tracking against mis-use (DoS etc.) is build-in, but - remote search tracking cannot see the remote users search words, only word hashes.

FSCONS 2010 Michael Christen Web Search By The People, For The People http://yacy.net Search Interface Integration Vision Demo Tech Dev

Code Snippet Example #1: a search window in an iframe How to integrate a