YaCy Grid Search Engine & THE YACY GRID CONCEPT
MICHAEL CHRISTEN [email protected] @0rb1t3r
YaCy Grid Concept by [email protected] @0rb1t3r OBSERVATIONS..
YaCy Grid Concept by [email protected] @0rb1t3r OBSERVATIONS..
YaCy Grid Concept by [email protected] @0rb1t3r PEOPLE’S EXPECTATION ABOUT SOLVING A PROBLEM
Star Trek IV 1986 - talk to a computer, it will solve your problem
YaCy Grid Concept by [email protected] @0rb1t3r AI: DIGITAL PERSONAL ASSISTANTS DIGITAL PERSONAL ASSISTANTS IN VARIOUS SHAPES
GOOGLE HOME APPLE HOMEPOD AMAZON ECHO HARMAN CARDON GOOGLE NOW SIRI ALEXA CORTANA
YaCy Grid Concept by [email protected] @0rb1t3r VON YACY ZU YACY GRID
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Decentralization !7
user user user user
user user Administrator
user user
central server user (aka ‚cloud‘) user
user user user user
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Decentralization !8
user+ user+ admin admin user+ user+ admin admin user+ user+ admin admin
decentralized, user+ Peer-to-Peer network or user+ admin admin ad-hoc network in intranet or darknet user+ user+ admin admin user+ user+ admin admin user+ user+ admin admin
YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Large-Scale Search Engine !9
Search Engine Cluster
Search Search Search Search Engine Engine Engine Engine
Search Search Search Search Engine Engine Engine Engine
Search Search Search Search Engine Engine Engine Engine
verticalscaling: moreperformance horizontal scaling: more documents
YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Large-Scale Search Engine !10
re-sharding
Search Engine Cluster
Search Search Search Search Search Engine Engine Engine Engine Engine
Search Search Search Search Search Engine Engine Engine Engine Engine
Search Search Search Search Search Engine Engine Engine Engine Engine
verticalscaling: moreperformance horizontal scaling: more documents
YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Scaling with YaCy !11
Search Search Search Search Search Engine Engine Engine Engine Engine
Search Search Search Search Search Engine Engine Engine Engine Engine
Search Search Search Search Search Engine Engine Engine Engine Engine
YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Scaling with YaCy !12
Search Search Search Search Search EnginePeer EnginePeer EnginePeer EnginePeer EnginePeer
Search Search Search Search Search EnginePeer EnginePeer EnginePeer EnginePeer EnginePeer
Search Search Search Search Search EnginePeer EnginePeer EnginePeer EnginePeer EnginePeer
YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Scaling with YaCy !13
Peer Peer Peer
Peer Peer
Peer Peer Crawl the web, create a DHT Search in a web index, distribute Peer Distributed Hash Table Peer Distributed Hash Table the index
Peer Peer
Peer Peer
DHT-Store DHT-Read Peer Peer Peer
YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Scaling with YaCy !14
DHT
Distributed Hash Table
YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Scaling with YaCy !15
YaCy ,freeworld‘ Network
YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Scaling with YaCy !16
Problems with YaCy: • Search index is incomplete • Too much Redundancy • No stability (because that’s wanted)
Solution: YaCy Grid • Complete index • No Redundancy (or not too much..) • Result stability (centralized index) • Industry-strength scaling (distributed)
YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Components of a Search Engine !17
Search Engine Start-URL Termination
URL Stack several jobs crawler Concurrency Loader Parser Indexing Stacker
YaCy Grid Concept+Monitoring! by [email protected] @0rb1t3r Information Retrieval Components of a Search Engine !18
Search Engine
crawler
YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Components of a Search Engine !19
Search Engine
crawler parser
YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Components of a Search Engine !20
Search Engine
crawler parser
Index Schema
YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Components of a Search Engine !21
Search Engine
crawler parser
Index Schema
YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Components of a Search Engine !22
Search Engine
crawler parser
Index Schema
YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Components of a Search Engine !23
Search Engine
crawler parser
Index Schema
YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Components of a Search Engine !24
Search Engine
crawler parser
Index Schema
YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Components of a Search Engine !25
Text fields from Index Schema
YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Components of a Search Engine !26 http://localhost:8090/solr/select?q=text_t:ibm%20mainframe%20AND%20url_file_ext_s:pdf&fl=sku,author,publisher_t Search Engine
crawler parser
search interface
YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Components of a Search Engine !27
Search Engine
crawler parser search interface
monitoring
YaCy Grid Concept by [email protected] @0rb1t3r Information Retrieval Components of a Search Engine !28
Search Engine
crawler parser search interface
monitoring administration
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid: Basic Schema !29
Content Harvester Crawler + Scraper (first all content Search engines must be loaded) consist of three simple parts:
Search Index Core create an index i.e. with Solr or elasticsearch
If you look at the Search Portal details it’s a bit with appropriate presentation of the search results more complex
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid: more Components !30
json api Documents Parser i.e. you need a Images Audio Video detect and extract document Messages the full-text of
file formats Schema
parser these components
json api can also be used Ranking Semantic semantics can Recognition Enrichment outside of a search Tagging Linguistic understand the meaning of the enhance engine!
content Schema
ranking data store
json api they could be Advertising Add-On and a Affiliate Content developed and commercial components monetizing used with a
component! data store Opensearch business partner
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid: Architecture !31
json api Distributionjson api Collection json api http Content Documents Archive Documents Parser ftp Harvester Images Images smb Audio Audio file Crawler + Scraper Video with bittorrent Video detect and extract WARC sharing the full-text of
WARC Messages WARC Messages re-use: 100% file formats file batch file batch
re-use: 80% data store re-use: 80% YaCy Schema closed-group WARC sharing network
json api Distributionjson api Collection json api Ranking Semantic Documents Search Index Metasearch Aggregation Recognition Enrichment Images Core Social Media Meta-Search Tagging Audio Ranking Linguistic understand the Video IR System different search meaning of the Messages abstraction sources bundling Solr Elastic content http/json and post-ranking file batch file batch Opensearch
data store YaCy Schema re-use: 100% YaCy Schema re-use: 20%
large-scale search engine matrix two implementations: enterprise & p2p
json api json api Feed-Back Search Cache Advertising Add-On Documents Search Moderation & Moderation Affiliate Content Images Portal Classifier Audio Profanity provide user- commercial Video identical to YaCy components search page using specific moderated Messages bootstrap client http/json results http/json http/json
Opensearch data store Opensearch data store re-use: 100%
Initialize client to json api client to json api Visualize Schedule System Administration Monitoring Alerts Backup Debugging Updates re-use: 20% re-use: 20% Resources Config Users YaCy Grid Concept by [email protected] @0rb1t3r NUTZUNG VON YACY GRID
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid: Landesportal NRW !33
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid: Landesportal NRW !34
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid: Landesportal NRW !35
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid: Landesportal NRW !36
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid: SUSPER !37
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid: Standard YaCy Search Portal !38
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid: YaCy and YaCy Grid on GitHub !39
> 1.3K STARGAZERS! 700 STARGAZERS! FOR YACY FOR YACY GRID
YaCy Grid Concept by [email protected] @0rb1t3r ‚Legacy YaCy‘: still strong… !40
YaCy Grid Concept by [email protected] @0rb1t3r ARCHITEKTUR VON YACY GRID
YaCy Grid Concept by [email protected] @0rb1t3r YACY GRID YACY GRID LEGEND ARCHITECTURE YACY GRIDCRAWLER YACY GRIDPARSER YACY GRIDLOADER WARC FILEPRODUCER HEADLESS BROWSER GRID COMPUTATION https://github.com/yacy/ https://github.com/yacy/ https://github.com/yacy/ WARC-TO-JSON yacy_grid_crawler yacy_grid_parser yacy_grid_loader ALL GRIDCOMPONENTSCANRUNIN DOCKER COMPONENTS NEW YACY COMPONENTS (INTEGRATED) NON-YACY MASTER CONNECTPROGRAM GRID ORCHESTRATION https://github.com/yacy/ YACY GRIDMCP MESSAGE BROKER RABBITMQ MEDIATOR ELASTICSEARCH ELASTICSEARCH ASSET BROKER DOCUMENT PUSH FTP MEDIATOR QUERY CLIENT yacy_grid_mcp CONCEPTS MIDDLEWARE QUEUES INDEXING TELEMETRY GRID MONITORING GRID FILESHARING ELASTICSEARCH https://www.elastic.co/ https://www.elastic.co/ http://mina.apache.org/ GRID STORAGE SEARCH INDEX FTP SERVER ftpserver-project/ RABBIT MQ www.rabbitmq.com/ KIBANA https:// DESKTOP SEARCH SEARCH FRONTEND yacy_webclient_bootstrap https://github.com/yacy/ SEARCH PORTAL LIKE LEGACYYACY CRAWL START FRONT END YACY/JSON SUSPER SEARCH API SEARCH API GSA/XML API
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Framework for Grid Orchestration
YACY MCP RABBIT MQ APACHE FTP SERVER ELASTICSEARCH GRID ORCHESTRATION MIDDLEWARE QUEUES GRID FILE SHARING SEARCH INDEX https://github.com/yacy/ http://mina.apache.org/ yacy_grid_mcp https://www.rabbitmq.com/ ftpserver-project/
Prozess: Onboarding:
- MCP konnektiert RabbitMQ, - RabbitMQ: FTP und elasticsearch, sowie http://yacygrid.com:15672 child-MCP’s - MCP: - Der MCP ist in allen anderen http://yacygrid.com:8100/ yacy/grid/mcp/status.json Grid Komponenten eingebettet - elastic über GSA Search: - MCP beinhaltet auch Indexer http://yacygrid.com:8100/ - Such-APIs sind in MCP yacy/grid/mcp/index/ enthalten gsasearch.xml?q=*
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid High-Performance Computing with Queues
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Message Architecture
YACY GRID CRAWLER YACY GRID PARSER WARC FILE PRODUCER WARC-TO-JSON
YACY GRID LOADER YACY GRID INDEXER HEADLESS BROWSER ELASTICSEARCH PUSH
Verfahren
- Messages include Crawl Profile and Actions - Actions can run in parallel and as sequence - the MCP component is orchestrating connections
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Crawler
YACY GRID CRAWLER WARC FILE PRODUCER
Graphical Crawl Start Interface
- integrated in SUSPER susper.com
http://susper.com/ crawlstartexpert
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Crawler and Loader+Headless Browser
YACY GRID CRAWLER WARC FILE PRODUCER
Application Interface
- Parameter identisch zum YaCy/1 Aufruf - Dokumentiert in http://www.yacy-websuche.de/wiki/index.php/Dev:APICrawler
- Beispiel: http://localhost:8300/yacy/grid/crawler/crawlStart.json? crawlingURL=yacy.net
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Crawler und Loader+Headless Browser curl -X POST -F "[email protected]" -F "serviceName=crawler" -F "queueName=webcrawler" http://yacygrid.com:8100/yacy/grid/mcp/messages/ send.json wobei job.json ist: { "metadata": { "process": "yacy_grid_loader", "count": 1 }, "data": [{ "id": „201705042045000-xyz", "crawlingMode": „url", "crawlingURL": "http://yacy.net", "sitemapURL": „", "crawlingFile": „", "crawlingDepth": 3, "crawlingDepthExtension": „", "range": „domain", "mustmatch": „.*", "mustnotmatch": „", "ipMustmatch": „.*", "ipMustnotmatch": „", "indexmustmatch": „.*", "indexmustnotmatch": „", "deleteold": „off", "deleteIfOlderNumber": 0, "deleteIfOlderUnit": „day", "recrawl": „nodoubles", "reloadIfOlderNumber": 0, "reloadIfOlderUnit": „day", "crawlingDomMaxCheck": „off", "crawlingDomMaxPages": 1000, "crawlingQ": „off", "directDocByURL": „off", "storeHTCache": „off", "cachePolicy": "if fresh“, "indexText": „on", "indexMedia": „off", "xsstopw": „off", "collection": „user", "agentName": "yacybot (yacy.net; crawler from yacygrid.com)", "user": "[email protected]", "client": "yacygrid.com" }], „actions": [{ "type": "loader", "queue": "webloader", "urls": ["http://yacy.net"], "collection": "test", "targetasset": "test3/yacy.net.warc.gz", "actions": [{ "type": "parser", "queue": "yacyparser", "sourceasset": "test3/yacy.net.warc.gz", "targetasset": "test3/yacy.net.jsonlist", "targetgraph": "test3/yacy.net.graph.json" "actions": [{ "type": "indexer", "queue": "elasticsearch", "sourceasset": "test3/yacy.net.jsonlist" },{ "type": "crawler", "queue": "webcrawler", "sourceasset": "test3/yacy.net.graph.json" }, ] }] }] } YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Crawler und Loader+Headless Browser
„actions": [{ "type": "loader", "queue": "webloader", "urls": ["http://yacy.net"], "collection": "test", "targetasset": "test3/yacy.net.warc.gz", "actions": [{ "type": "parser", "queue": "yacyparser", "sourceasset": "test3/yacy.net.warc.gz", "targetasset": "test3/yacy.net.jsonlist", "targetgraph": "test3/yacy.net.graph.json" "actions": [{ "type": "indexer", "queue": "elasticsearch", "sourceasset": "test3/yacy.net.jsonlist" },{ "type": "crawler", "queue": "webcrawler", "sourceasset": "test3/yacy.net.graph.json" }, ] }] }] YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Crawler und Loader+Headless Browser
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Parser
YACY GRID PARSER LEGACY YACY
WARC-TO-JSON PARSER YACY/1 INTEGRATION https://github.com/yacy/ JSON SURROGATE JSON SURROGATE yacy_search_server
Prozess
- Parser wurde vollständig aus YaCy/1 herausgelöst und in Microservice übernommen - Der Parser hat Anschluss an Queueing und verarbeitet Parser-Actions - Resultat sind elasticsearch bulk index files, die auch YaCy/1 nun lesen kann.
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Indexierungsdienst und Such-API
YACY/JSON YACY GRID INDEXER ELASTICSEARCH SEARCH FRONTEND ELASTICSEARCH PUSH SEARCH INDEX GSA/XML https://www.elastic.co/ SEARCH FRONTEND
Prozess:
- MCP ist führend bei der Verbindung zu Elasticsearch, daher: - Der Indexer wurde als Queue Prozessor im MCP eingebaut. - das Such-Frontend für GSA/ XML und YaCy/JSON befindet sich ebenfalls als Servlet im MCP
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid M4 Trigger, Pipelining, Monitoring
GRID MONITORING
Status: Konzept vorhanden
- noch keine Software nur zum Monitoring entwickelt - Pipelining weitgehend über Action-Objekte im Queueing realisiert - Status kann über RabbitMQ beobachtet werden: http://yacygrid.com:15672
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Alle URLs
End-to-End Dienste-URLs
- Crawl Start http://yacygrid.com:8300/yacy/grid/crawler/crawlStart.json?crawlingURL=yacy.net - Monitoring über RabbitMQ: http://yacygrid.com:15672 - Suche in Elasticsearch http://localhost:9200/web/crawler/_search?q=text_t:* - Suche über MCP http://yacygrid.com:8100/yacy/grid/mcp/index/gsasearch.xml?q=publicplan http://yacygrid.com:8100/yacy/grid/mcp/index/yacysearch.json?query=publicplan
YaCy Grid Concept by [email protected] @0rb1t3r SUPERCOMPUTER SIMULATION
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Supercomputer Simulation
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Supercomputer Simulation
72 CORES!
YaCy Grid Concept by [email protected] @0rb1t3r WAS KANN YACY GRID BESSER
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Headless Browser
YACY GRID LOADER HEADLESS BROWSER
Verfahren
- Dynamische Web-Inhalte können mit einfachen Crawlern nicht erfasst werden - Die Tendenz geht zu dynamischen Inhalten überall!
- In YaCy Grid werden Webseiten wie mit einem Browser geladen (mit HtmlUnit) - Der Vorgang ist sehr ressourcenintensiv aber einfach parallelisierbar!
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Drop-In Replacement for GSA! - (has GSA API) http://yacygrid.com:8100/yacy/grid/mcp/index/gsasearch.xml?q=publicplan http://yacygrid.com:8100/yacy/grid/mcp/index/yacysearch.json?query=publicplan
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Was kann es besser?
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Standards
YaCy Grid Architektur profitiert von der Nutzung von Standards - Kein Monolith, verteilte Architektur mit austauschbaren Elementen, - im Wesentlichen Standard-Tools, „Lücken“ sind YaCy Grid Elemente - Speicherformat für Crawlings in WARC Files - Suchinterface per Opensearch/RSS und GSA - Metadaten über Standard-Vokabularien und Linked Open Data - Index-Dumps im elasticsearch Bulk-Format
Synergien - WARC-Tools (u.a. OpenWayback + WebRecorder) - Teile sind austauschbar (WARC per Heritrix oder wget generieren) - Zwischenformate (WARC und Index Dumps) sind handelsfähig auf einer Data-Ecommerce Plattform (?Lizenzfrage?)
Voraussetzung - YaCy Document Schema (sehr reichhaltig! - mehr als tika liefert) YaCy Grid Concept by [email protected] @0rb1t3r WAS FEHLT NOCH?
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Weitere Komponenten
Endnutzer Webinterface
- Account Management: - Stammdaten, Basket, Checkout - Wie ein Daten-Shoppingportal - Man kauft Suchindizes
- Crawl Start: - Wie bei Legacy YaCy nur mit mehr Komfort - Erstellt Legende für Suchkorpus
- Monitoring & Billing: - Was für einen Aufwand hat mein Crawl verursacht?
- Suche: - Plug-Ins für die eigene Webseite -> Basket -> Checkout
YaCy Grid Concept by [email protected] @0rb1t3r EINSATZ FÜR OWI
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Einsatz für OWI
Verwendbare Komponenten
-alle!
YaCy Grid Concept by [email protected] @0rb1t3r THE FUTURE OF SEARCH
YaCy Grid Concept by [email protected] @0rb1t3r THE FUTURE OF SEARCH ENGINES + USER INTERFACES
TIMELINE Bing The Web: WolframAlpha not so important any more Google Replaced by: Personal Assistants AltaVista they will THE FUTURE OF SEARCH ENGINES Answer to all Questions Yahoo
Twitter Social Media: more information, fast Youtube „Youtuber“ Mobile First: Facebook Personal Assistants require services based on THE FUTURE 3.0 OF USER GGG INTERFACES schema.org Conversational Web YaCy.net SUSI.AI & YaCy Grid web web web web 2.5 self- Answers to all 1.0 .com 2.0 communities broadcast Questions!
1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 YaCy2012 2013 2014 Grid2015 2016 Concept2017 2018 2019 2020 by2021 2022 [email protected] 2024 2025 2026 2027 2028 @0rb1t3r2029 2030 DIGITAL PERSONAL ASSISTANTS DIGITAL PERSONAL ASSISTANTS IN VARIOUS SHAPES SMART SPEAKER
GOOGLE, APPLE, AMAZON, MICROSOFT: FOSS PERSONAL ASSISTANT? WHERE IS A GOOGLE HOME APPLE HOMEPOD AMAZON ECHO HARMAN CARDON GOOGLE NOW SIRI ALEXA CORTANA
YaCy Grid Concept by [email protected] @0rb1t3r THE ECOSPHERE FOR DIGITAL ASSISTANTS
YaCy Grid Concept by [email protected] @0rb1t3r YaCy Grid Search Engine & STAND UND ERFAHRUNGEN
MICHAEL CHRISTEN [email protected] @0rb1t3r
YaCy Grid Concept by [email protected] @0rb1t3r