Scalable Framework for Semantic Web Crawling

Ivo Laˇsek1,2, OndˇrejKlimpera1, Peter Vojt´aˇs2

1 Czech Technical University in Prague, Faculty of Information Technology Prague, Czech Republic [email protected] [email protected] 2 Charles University in Prague, Faculty of Mathematics and Physics, Prague, Czech Republic [email protected]

Abstract. In this paper, we introduce a new approach to semantic Web crawling. We propose a scalable framework for semantic data crawling that works as a service. Users can post their crawling requests via a RESTful API, the requests are processed on a scalable Hadoop cluster and aggregated results are returned in the desired format.

Keywords: Linked Data, Crawler, MapReduce

1 Introduction and Related Work

In this paper, we introduce our approach to semantic data crawling. We proposed and developed a semantic web crawling framework that supports all major ways to obtain data from semantic web (crawling RDF resources, RDF embedded in web pages, querying SPARQL endpoints and crawling semantic sitemaps). The framework is designed as a Web service. In order to support the scalability needed for crawling Web resources the whole service runs on the Hadoop computer cluster using MapReduce algorithm [2] to distribute the workload. Probably one of the first adopters of crawling technologies for Semantic Web were authors of semantic Web search engines. When Watson [10] and Swoogle [3] were developed, one of the biggest crawling problems was, how to actually dis- cover semantic data resources. The authors proposed several heuristics including exploring well-known ontology repositories and querying with a special type of queries. The crawling of the discovered content by Watson relies on Heritrix crawler3. The piplined crawling architecture was proposed for MultiCrawler [6] em- ployed in SWSE semantic [5, 4]. MultiCrawler deals also with per- formance scaling by distributing processed pages to individual computers based on a hashing function. However, the authors do not deal with a fail over scenario, where some computers in the cluster might break down. The multithreaded crawler was used to obtain data for Falcons search en- gine [1]. A more sophisticated crawling infrastructure is employed in Sindice

3 http://crawler.archive.org 2 Scalable Framework for Semantic Web Crawling project [9]. We adapted the format of semantic sitemap that the authors propose. However, our project focuses rather at processing individual crawling requests than indexing all semantic documents on the Web. LDSpider [7] is a Java project that enables performing custom crawling tasks. The spider performs concurrent crawling by starting multiple threads. However, all the threads still use shared CPU, memory and storage. We aim at providing a distributed crawling framework that can run on multiple computers. Ordinary web crawlers like Apache Nutch [8] do not deal with semantic web data and are designed to follow and index only ordinary web links.

2 Crawling Semantic Web Resources

While crawling Web resources in general (and RDF resources in our case) it is necessary to avoid overloading of crawled servers. The Robots Exclusion Protocol published inside robots.txt files is a valuable source of information for crawlers. A valuable source of crawling information are semantic sitemaps. Our crawler is able to extract three types of information from them: URL resources, sitemap index files and semantic data dump locations. Another way how to make semantic data accessible is to provide a SPARQL endpoint. Data sources provide an endpoint API consuming SPARQL queries. We designed the crawler so that it is able to send HTTP GET requests to given endpoints, which is the most common way of their access. Results are then retrieved and used for further processing, i.e. further crawling of contained resources, if requested by the user - according to a requested crawling depth.

3 Parallel Processing with MapReduce

We have chosen the MapReduce programming model for the crawling. It is the proven concept for this type of activity used by Google [2], Apache Nutch [8] or other crawlers. Our crawler uses MapReduce model implementation provided by Apache Hadoop framework4. Crawling of Web resources is done by launching series of dependent MapReduce jobs, where each one has a designated task to be performed. The Crawling process starts with user’s request with a list of various type of resources (URLs, Sitemaps and SPARQL queries to selected SPARQL end- points). The crawler creates two job feeders. One is responsible for creating an initial seed-list SeedListJobFeeder, second one deals with crawling Web resources to depth CrawlingJobFeeder. To create a seed list, it has to be checked if there are some robots.txt and sitemap resources. If so, MapReduce jobs are created to crawl them and discover new locations, which are later added to the seed-list. The disallowed patterns are stored. Sitemap locations are collected and processed. Crawling sitemaps is a little bit more difficult, as each sitemap can contain references to other sitemaps

4 Apache Hadoop Project homepage: http://hadoop.apache.org/ Scalable Framework for Semantic Web Crawling 3 on the server. The crawler has to crawl them in depth and each level requires a new MapReduce job. By crawling to the depth, each level of crawling requires a new MapReduce job. Before the final data merge is triggered, we extract found owl:sameAs state- ments. The final MapReduce job reads crawled triples from all depths and checks subject, predicate and object nodes against the owl:sameAs mapping index. If a match is found, the appropriate triple’s part is replaced with its reference value. These results are written to Distributed File System and copied to crawler’s local storage. Such data are then accessible via REST API to the user. When a user submits a request for crawling to the API, it is added to the queue of waiting jobs. These requests are processed in separate threads managed by a thread pool of fixed size. There is a MapReduceJobRunner which creates mentioned job feeders and handles their submission to the cluster. This class also handles job status and progress monitoring and provides its information to the user. We tested our crawler on a small scale cluster with 7 nodes. Each node had its own local 20GB storage. First Node has an additional NFS mounted storage with 1TB available disk space, which was used to store job results, log files, and the database file structure. We ran a test downloading data from http://dbpedia.org and http:// data.nytimes.com semantic dump files. The crawling results contained over 30 million triples in more than 4.2GB of data. Figure 1 shows percentage share of processes used for extracting triples from one URL. Average size of a resource was 270kB containing approximately 1600 triples. In average, there were about 10 triples per millisecond extracted on one node.

Fig. 1. Time consumption of URL processing tasks on small data 4 Scalable Framework for Semantic Web Crawling

4 Conclusion

We proposed an architecture for distributed crawling of semantic data. The in- troduced framework enables consumption of data from various types of resources including data dumps listed in semantic sitemaps, individual semantic Web doc- uments, SPARQL endpoints and ordinary Web pages containing a structured data mark-up. Currently the project is in beta version and can be downloaded from our Web page5.

Acknowledgments. This work has been partially supported the grant of Czech Technical University in Prague (SGS12/093/OHK3/1T/18) and by the grant of The Czech Science Foundation (GACR)ˇ P202/10/0761.

References

1. Gong Cheng, Yuzhong Qu, and Yuzhong Qu. Searching linked objects with falcons: Approach, implementation and evaluation. pages 49–70, 2009. 2. Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107–113, January 2008. 3. Li Ding, Tim Finin, Anupam Joshi, Rong Pan, R. Scott Cost, Yun Peng, Pavan Reddivari, Vishal Doshi, and Joel Sachs. Swoogle: a search and metadata engine for the semantic web. In Proceedings of the thirteenth ACM international conference on Information and knowledge management, CIKM ’04, pages 652–659, New York, NY, USA, 2004. ACM. 4. Andreas Harth, Aidan Hogan, Renaud Delbru, J¨urgenUmbrich, Stefan Decker, and Stefan Decker. Swse: answers before links. 2007. 5. Andreas Harth, Aidan Hogan, J¨urgenUmbrich, and Stefan Decker. Swse: Objects before documents!, 2007. 6. Andreas Harth, J¨urgenUmbrich, and Stefan Decker. Multicrawler: A pipelined architecture for crawling and indexing semantic web data. In Isabel Cruz, Stefan Decker, Dean Allemang, Chris Preist, Daniel Schwabe, Peter Mika, Mike Uschold, and Lora Aroyo, editors, The Semantic Web - ISWC 2006, volume 4273 of Lecture Notes in Computer Science, pages 258–271. Springer Berlin / Heidelberg, 2006. 7. Robert Isele, Andreas Harth, Jrgen Umbrich, Christian Bizer, and Christian Bizer. Ldspider: An open-source crawling framework for the web of linked data. In Poster at the International Semantic Web Conference (ISWC2010), Shanghai, 2010. 8. Rohit Khare and Doug Cutting. Nutch: A flexible and scalable open-source web search engine. Technical report, 2004. 9. Eyal Oren, Renaud Delbru, Michele Catasta, Richard Cyganiak, and Giovanni Tummarello. Sindice.com: A document-oriented lookup index for open linked data. International Journal of Metadata, Semantics and Ontologies, 3:2008. 10. Marta Sabou, Martin Dzbor, Claudio Baldassarre, Sofia Angeletou, and Enrico Motta. Watson: A gateway for the semantic web. In Poster session of the European Semantic Web Conference, ESWC, 2007.

5 http://research.i-lasek.cz/projects/semantic-web-crawler