YaCy Grid & THE YACY GRID CONCEPT

MICHAEL CHRISTEN [email protected] @0rb1t3r

YaCy Grid Concept by [email protected] @0rb1t3r OBSERVATIONS..

YaCy Grid Concept by [email protected] @0rb1t3r OBSERVATIONS..

YaCy Grid Concept by [email protected] @0rb1t3r PEOPLE’S EXPECTATION ABOUT SOLVING A PROBLEM

Star Trek IV 1986 - talk to a computer, it will solve your problem

YaCy Grid Concept by [email protected] @0rb1t3r AI: DIGITAL PERSONAL ASSISTANTS DIGITAL PERSONAL ASSISTANTS IN VARIOUS SHAPES

GOOGLE HOME APPLE HOMEPOD AMAZON ECHO HARMAN CARDON GOOGLE NOW SIRI ALEXA CORTANA

YaCy Grid Concept by [email protected] @0rb1t3r VON YACY ZU YACY GRID

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Decentralization !7

user user user user

user user Administrator

user user

central server user (aka ‚cloud‘) user

user user user user

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Decentralization !8

user+ user+ admin admin user+ user+ admin admin user+ user+ admin admin

decentralized, user+ Peer-to-Peer network or user+ admin admin ad-hoc network in intranet or darknet user+ user+ admin admin user+ user+ admin admin user+ user+ admin admin

YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Large-Scale Search Engine !9

Search Engine Cluster

Search Search Search Search Engine Engine Engine Engine

Search Search Search Search Engine Engine Engine Engine

Search Search Search Search Engine Engine Engine Engine

verticalscaling: moreperformance horizontal scaling: more documents

YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Large-Scale Search Engine !10

re-sharding

Search Engine Cluster

Search Search Search Search Search Engine Engine Engine Engine Engine

Search Search Search Search Search Engine Engine Engine Engine Engine

Search Search Search Search Search Engine Engine Engine Engine Engine

verticalscaling: moreperformance horizontal scaling: more documents

YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Scaling with YaCy !11

Search Search Search Search Search Engine Engine Engine Engine Engine

Search Search Search Search Search Engine Engine Engine Engine Engine

Search Search Search Search Search Engine Engine Engine Engine Engine

YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Scaling with YaCy !12

Search Search Search Search Search EnginePeer EnginePeer EnginePeer EnginePeer EnginePeer

Search Search Search Search Search EnginePeer EnginePeer EnginePeer EnginePeer EnginePeer

Search Search Search Search Search EnginePeer EnginePeer EnginePeer EnginePeer EnginePeer

YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Scaling with YaCy !13

Peer Peer Peer

Peer Peer

Peer Peer Crawl the web, create a DHT Search in a web index, distribute Peer Distributed Hash Table Peer Distributed Hash Table the index

Peer Peer

Peer Peer

DHT-Store DHT-Read Peer Peer Peer

YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Scaling with YaCy !14

DHT

Distributed Hash Table

YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Scaling with YaCy !15

YaCy ,freeworld‘ Network

YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Scaling with YaCy !16

Problems with YaCy: • Search index is incomplete • Too much Redundancy • No stability (because that’s wanted)

Solution: YaCy Grid • Complete index • No Redundancy (or not too much..) • Result stability (centralized index) • Industry-strength scaling (distributed)

YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Components of a Search Engine !17

Search Engine Start-URL Termination

URL Stack several jobs crawler Concurrency Loader Parser Indexing Stacker

YaCy Grid Concept+Monitoring! by [email protected] @0rb1t3r Information Retrieval Components of a Search Engine !18

Search Engine

crawler

YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Components of a Search Engine !19

Search Engine

crawler parser

YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Components of a Search Engine !20

Search Engine

crawler parser

Index Schema

YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Components of a Search Engine !21

Search Engine

crawler parser

Index Schema

YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Components of a Search Engine !22

Search Engine

crawler parser

Index Schema

YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Components of a Search Engine !23

Search Engine

crawler parser

Index Schema

YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Components of a Search Engine !24

Search Engine

crawler parser

Index Schema

YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Components of a Search Engine !25

Text fields from Index Schema

YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Components of a Search Engine !26 http://localhost:8090/solr/select?q=text_t:ibm%20mainframe%20AND%20url_file_ext_s:pdf&fl=sku,author,publisher_t Search Engine

crawler parser

search interface

YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Components of a Search Engine !27

Search Engine

crawler parser search interface

monitoring

YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Components of a Search Engine !28

Search Engine

crawler parser search interface

monitoring administration

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid: Basic Schema !29

Content Harvester Crawler + Scraper (first all content Search engines must be loaded) consist of three simple parts:

Search Index Core create an index i.e. with Solr or elasticsearch

If you look at the Search Portal details it’s a bit with appropriate presentation of the search results more complex

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid: more Components !30

json api Documents Parser i.e. you need a Images Audio Video detect and extract document Messages the full-text of

file formats Schema

parser these components

json api can also be used Ranking Semantic semantics can Recognition Enrichment outside of a search Tagging Linguistic understand the meaning of the enhance engine!

content Schema

ranking data store

json api they could be Advertising Add-On and a Affiliate Content developed and commercial components monetizing used with a

component! data store Opensearch business partner

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid: Architecture !31

json api Distributionjson api Collection json api http Content Documents Archive Documents Parser ftp Harvester Images Images smb Audio Audio file Crawler + Scraper Video with bittorrent Video detect and extract WARC sharing the full-text of

WARC Messages WARC Messages re-use: 100% file formats file batch file batch

re-use: 80% data store re-use: 80% YaCy Schema closed-group WARC sharing network

json api Distributionjson api Collection json api Ranking Semantic Documents Search Index Metasearch Aggregation Recognition Enrichment Images Core Social Media Meta-Search Tagging Audio Ranking Linguistic understand the Video IR System different search meaning of the Messages abstraction sources bundling Solr Elastic content http/json and post-ranking file batch file batch Opensearch

data store YaCy Schema re-use: 100% YaCy Schema re-use: 20%

large-scale search engine matrix two implementations: enterprise & p2p

json api json api Feed-Back Search Cache Advertising Add-On Documents Search Moderation & Moderation Affiliate Content Images Portal Classifier Audio Profanity provide user- commercial Video identical to YaCy components search page using specific moderated Messages bootstrap client http/json results http/json http/json

Opensearch data store Opensearch data store re-use: 100%

Initialize client to json api client to json api Visualize Schedule System Administration Monitoring Alerts Backup Debugging Updates re-use: 20% re-use: 20% Resources Config Users YaCy Grid Concept by [email protected] @0rb1t3r NUTZUNG VON YACY GRID

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid: Landesportal NRW !33

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid: Landesportal NRW !34

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid: Landesportal NRW !35

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid: Landesportal NRW !36

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid: SUSPER !37

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid: Standard YaCy Search Portal !38

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid: YaCy and YaCy Grid on GitHub !39

> 1.3K STARGAZERS! 700 STARGAZERS! FOR YACY FOR YACY GRID

YaCy Grid Concept by [email protected] @0rb1t3r ‚Legacy YaCy‘: still strong… !40

YaCy Grid Concept by [email protected] @0rb1t3r ARCHITEKTUR VON YACY GRID

YaCy Grid Concept by [email protected] @0rb1t3r YACY GRID YACY GRID LEGEND ARCHITECTURE YACY GRIDCRAWLER YACY GRIDPARSER YACY GRIDLOADER WARC FILEPRODUCER HEADLESS BROWSER GRID COMPUTATION https://github.com/yacy/ https://github.com/yacy/ https://github.com/yacy/ WARC-TO-JSON yacy_grid_crawler yacy_grid_parser yacy_grid_loader ALL GRIDCOMPONENTSCANRUNIN COMPONENTS NEW YACY COMPONENTS (INTEGRATED) NON-YACY MASTER CONNECTPROGRAM GRID ORCHESTRATION https://github.com/yacy/ YACY GRIDMCP MESSAGE BROKER RABBITMQ MEDIATOR ELASTICSEARCH ELASTICSEARCH ASSET BROKER DOCUMENT PUSH FTP MEDIATOR QUERY CLIENT yacy_grid_mcp CONCEPTS MIDDLEWARE QUEUES INDEXING TELEMETRY GRID MONITORING GRID FILESHARING ELASTICSEARCH https://www.elastic.co/ https://www.elastic.co/ http://mina.apache.org/ GRID STORAGE SEARCH INDEX FTP SERVER ftpserver-project/ RABBIT MQ www.rabbitmq.com/ KIBANA https:// DESKTOP SEARCH SEARCH FRONTEND yacy_webclient_bootstrap https://github.com/yacy/ SEARCH PORTAL LIKE LEGACYYACY CRAWL START FRONT END YACY/JSON SUSPER SEARCH API SEARCH API GSA/XML API

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Framework for Grid Orchestration

YACY MCP RABBIT MQ APACHE FTP SERVER ELASTICSEARCH GRID ORCHESTRATION MIDDLEWARE QUEUES GRID FILE SHARING SEARCH INDEX https://github.com/yacy/ http://mina.apache.org/ yacy_grid_mcp https://www.rabbitmq.com/ ftpserver-project/

Prozess: Onboarding:

- MCP konnektiert RabbitMQ, - RabbitMQ: FTP und elasticsearch, sowie http://yacygrid.com:15672 child-MCP’s - MCP: - Der MCP ist in allen anderen http://yacygrid.com:8100/ yacy/grid/mcp/status.json Grid Komponenten eingebettet - elastic über GSA Search: - MCP beinhaltet auch Indexer http://yacygrid.com:8100/ - Such-APIs sind in MCP yacy/grid/mcp/index/ enthalten gsasearch.xml?q=*

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid High-Performance Computing with Queues

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Message Architecture

YACY GRID CRAWLER YACY GRID PARSER WARC FILE PRODUCER WARC-TO-JSON

YACY GRID LOADER YACY GRID INDEXER HEADLESS BROWSER ELASTICSEARCH PUSH

Verfahren

- Messages include Crawl Profile and Actions - Actions can run in parallel and as sequence - the MCP component is orchestrating connections

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Crawler

YACY GRID CRAWLER WARC FILE PRODUCER

Graphical Crawl Start Interface

- integrated in SUSPER susper.com

http://susper.com/ crawlstartexpert

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Crawler and Loader+Headless Browser

YACY GRID CRAWLER WARC FILE PRODUCER

Application Interface

- Parameter identisch zum YaCy/1 Aufruf - Dokumentiert in http://www.yacy-websuche.de/wiki/index.php/Dev:APICrawler

- Beispiel: http://localhost:8300/yacy/grid/crawler/crawlStart.json? crawlingURL=yacy.net

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Crawler und Loader+Headless Browser curl -X POST -F "[email protected]" -F "serviceName=crawler" -F "queueName=" http://yacygrid.com:8100/yacy/grid/mcp/messages/ send.json wobei job.json ist: { "metadata": { "process": "yacy_grid_loader", "count": 1 }, "data": [{ "id": „201705042045000-xyz", "crawlingMode": „url", "crawlingURL": "http://yacy.net", "sitemapURL": „", "crawlingFile": „", "crawlingDepth": 3, "crawlingDepthExtension": „", "range": „domain", "mustmatch": „.*", "mustnotmatch": „", "ipMustmatch": „.*", "ipMustnotmatch": „", "indexmustmatch": „.*", "indexmustnotmatch": „", "deleteold": „off", "deleteIfOlderNumber": 0, "deleteIfOlderUnit": „day", "recrawl": „nodoubles", "reloadIfOlderNumber": 0, "reloadIfOlderUnit": „day", "crawlingDomMaxCheck": „off", "crawlingDomMaxPages": 1000, "crawlingQ": „off", "directDocByURL": „off", "storeHTCache": „off", "cachePolicy": "if fresh“, "indexText": „on", "indexMedia": „off", "xsstopw": „off", "collection": „user", "agentName": "yacybot (yacy.net; crawler from yacygrid.com)", "user": "[email protected]", "client": "yacygrid.com" }], „actions": [{ "type": "loader", "queue": "webloader", "urls": ["http://yacy.net"], "collection": "test", "targetasset": "test3/yacy.net.warc.gz", "actions": [{ "type": "parser", "queue": "yacyparser", "sourceasset": "test3/yacy.net.warc.gz", "targetasset": "test3/yacy.net.jsonlist", "targetgraph": "test3/yacy.net.graph.json" "actions": [{ "type": "indexer", "queue": "elasticsearch", "sourceasset": "test3/yacy.net.jsonlist" },{ "type": "crawler", "queue": "webcrawler", "sourceasset": "test3/yacy.net.graph.json" }, ] }] }] } YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Crawler und Loader+Headless Browser

„actions": [{ "type": "loader", "queue": "webloader", "urls": ["http://yacy.net"], "collection": "test", "targetasset": "test3/yacy.net.warc.gz", "actions": [{ "type": "parser", "queue": "yacyparser", "sourceasset": "test3/yacy.net.warc.gz", "targetasset": "test3/yacy.net.jsonlist", "targetgraph": "test3/yacy.net.graph.json" "actions": [{ "type": "indexer", "queue": "elasticsearch", "sourceasset": "test3/yacy.net.jsonlist" },{ "type": "crawler", "queue": "webcrawler", "sourceasset": "test3/yacy.net.graph.json" }, ] }] }] YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Crawler und Loader+Headless Browser

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Parser

YACY GRID PARSER LEGACY YACY

WARC-TO-JSON PARSER YACY/1 INTEGRATION https://github.com/yacy/ JSON SURROGATE JSON SURROGATE yacy_search_server

Prozess

- Parser wurde vollständig aus YaCy/1 herausgelöst und in Microservice übernommen - Der Parser hat Anschluss an Queueing und verarbeitet Parser-Actions - Resultat sind elasticsearch bulk index files, die auch YaCy/1 nun lesen kann.

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Indexierungsdienst und Such-API

YACY/JSON YACY GRID INDEXER ELASTICSEARCH SEARCH FRONTEND ELASTICSEARCH PUSH SEARCH INDEX GSA/XML https://www.elastic.co/ SEARCH FRONTEND

Prozess:

- MCP ist führend bei der Verbindung zu Elasticsearch, daher: - Der Indexer wurde als Queue Prozessor im MCP eingebaut. - das Such-Frontend für GSA/ XML und YaCy/JSON befindet sich ebenfalls als Servlet im MCP

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid M4 Trigger, Pipelining, Monitoring

GRID MONITORING

Status: Konzept vorhanden

- noch keine Software nur zum Monitoring entwickelt - Pipelining weitgehend über Action-Objekte im Queueing realisiert - Status kann über RabbitMQ beobachtet werden: http://yacygrid.com:15672

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Alle URLs

End-to-End Dienste-URLs

- Crawl Start http://yacygrid.com:8300/yacy/grid/crawler/crawlStart.json?crawlingURL=yacy.net - Monitoring über RabbitMQ: http://yacygrid.com:15672 - Suche in Elasticsearch http://localhost:9200/web/crawler/_search?q=text_t:* - Suche über MCP http://yacygrid.com:8100/yacy/grid/mcp/index/gsasearch.xml?q=publicplan http://yacygrid.com:8100/yacy/grid/mcp/index/yacysearch.json?query=publicplan

YaCy Grid Concept by [email protected] @0rb1t3r SUPERCOMPUTER SIMULATION

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Supercomputer Simulation

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Supercomputer Simulation

72 CORES!

YaCy Grid Concept by [email protected] @0rb1t3r WAS KANN YACY GRID BESSER

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Headless Browser

YACY GRID LOADER HEADLESS BROWSER

Verfahren

- Dynamische Web-Inhalte können mit einfachen Crawlern nicht erfasst werden - Die Tendenz geht zu dynamischen Inhalten überall!

- In YaCy Grid werden Webseiten wie mit einem Browser geladen (mit HtmlUnit) - Der Vorgang ist sehr ressourcenintensiv aber einfach parallelisierbar!

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Drop-In Replacement for GSA! - (has GSA API) http://yacygrid.com:8100/yacy/grid/mcp/index/gsasearch.xml?q=publicplan http://yacygrid.com:8100/yacy/grid/mcp/index/yacysearch.json?query=publicplan

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Was kann es besser?

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Standards

YaCy Grid Architektur profitiert von der Nutzung von Standards - Kein Monolith, verteilte Architektur mit austauschbaren Elementen, - im Wesentlichen Standard-Tools, „Lücken“ sind YaCy Grid Elemente - Speicherformat für Crawlings in WARC Files - Suchinterface per Opensearch/RSS und GSA - Metadaten über Standard-Vokabularien und Linked Open Data - Index-Dumps im elasticsearch Bulk-Format

Synergien - WARC-Tools (u.a. OpenWayback + WebRecorder) - Teile sind austauschbar (WARC per Heritrix oder wget generieren) - Zwischenformate (WARC und Index Dumps) sind handelsfähig auf einer Data-Ecommerce Plattform (?Lizenzfrage?)

Voraussetzung - YaCy Document Schema (sehr reichhaltig! - mehr als tika liefert) YaCy Grid Concept by [email protected] @0rb1t3r WAS FEHLT NOCH?

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Weitere Komponenten

Endnutzer Webinterface

- Account Management: - Stammdaten, Basket, Checkout - Wie ein Daten-Shoppingportal - Man kauft Suchindizes

- Crawl Start: - Wie bei Legacy YaCy nur mit mehr Komfort - Erstellt Legende für Suchkorpus

- Monitoring & Billing: - Was für einen Aufwand hat mein Crawl verursacht?

- Suche: - Plug-Ins für die eigene Webseite -> Basket -> Checkout

YaCy Grid Concept by [email protected] @0rb1t3r EINSATZ FÜR OWI

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Einsatz für OWI

Verwendbare Komponenten

-alle!

YaCy Grid Concept by [email protected] @0rb1t3r THE FUTURE OF SEARCH

YaCy Grid Concept by [email protected] @0rb1t3r THE FUTURE OF SEARCH ENGINES + USER INTERFACES

TIMELINE Bing The Web: WolframAlpha not so important any more Google Replaced by: Personal Assistants AltaVista they will THE FUTURE OF SEARCH ENGINES Answer to all Questions Yahoo

Twitter Social Media: more information, fast Youtube „Youtuber“ Mobile First: Facebook Personal Assistants require services based on THE FUTURE 3.0 OF USER GGG INTERFACES schema.org Conversational Web YaCy.net SUSI.AI & YaCy Grid web web web web 2.5 self- Answers to all 1.0 .com 2.0 communities broadcast Questions!

1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 YaCy2012 2013 2014 Grid2015 2016 Concept2017 2018 2019 2020 by2021 2022 [email protected] 2024 2025 2026 2027 2028 @0rb1t3r2029 2030 DIGITAL PERSONAL ASSISTANTS DIGITAL PERSONAL ASSISTANTS IN VARIOUS SHAPES SMART SPEAKER

GOOGLE, APPLE, AMAZON, MICROSOFT: FOSS PERSONAL ASSISTANT? WHERE IS A GOOGLE HOME APPLE HOMEPOD AMAZON ECHO HARMAN CARDON GOOGLE NOW SIRI ALEXA CORTANA

YaCy Grid Concept by [email protected] @0rb1t3r THE ECOSPHERE FOR DIGITAL ASSISTANTS

YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Search Engine & STAND UND ERFAHRUNGEN

MICHAEL CHRISTEN [email protected] @0rb1t3r

YaCy Grid Concept by [email protected] @0rb1t3r