Fast and Scalable Pattern Mining for Media-Type Focused Crawling
Total Page:16
File Type:pdf, Size:1020Kb
Provided by the author(s) and NUI Galway in accordance with publisher policies. Please cite the published version when available. Title Fast and Scalable Pattern Mining for Media-Type Focused Crawling Author(s) Umbrich, Jürgen; Karnstedt, Marcel; Harth, Andreas Publication Date 2009 Jürgen Umbrich, Marcel Karnstedt, Andreas Harth "Fast and Publication Scalable Pattern Mining for Media-Type Focused Crawling", Information KDML 2009: Knowledge Discovery, Data Mining, and Machine Learning, in conjunction with LWA 2009, 2009. Item record http://hdl.handle.net/10379/1121 Downloaded 2021-09-27T17:53:57Z Some rights reserved. For more information, please see the item record link above. Fast and Scalable Pattern Mining for Media-Type Focused Crawling∗ [experience paper] Jurgen¨ Umbrich and Marcel Karnstedt and Andreas Harthy Digital Enterprise Research Institute (DERI) National University of Ireland, Galway, Ireland fi[email protected] Abstract 1999]) wants to infer the topic of a target page before de- voting bandwidth to download it. Further, a page’s content Search engines targeting content other than hy- may be hidden in images. pertext documents require a crawler that discov- ers resources identifying files of certain media types. Na¨ıve crawling approaches do not guaran- A crawler for media type targeted search engines is fo- tee a sufficient supply of new URIs (Uniform Re- cused on the document formats (such as audio and video) source Identifiers) to visit; effective and scalable instead of the topic covered by the documents. For a scal- mechanisms for discovering and crawling tar- able media type focused crawler it is absolutely essential geted resources are needed. One promising ap- to discover documents of the requested media type on the proach is to use data mining techniques to iden- Web and to avoid expensive HTTP lookups of irrelevant tify the media type of a resource without the need files. Thus, the crawler needs to identify the media type of for downloading the content of the resource. The a document without establishing a connection and down- idea is to use a learning approach on features loading the content. A common way to identify the format derived from patterns occuring in the resource of a file is to use the file extension of the file name or to identifier. We present a focused crawler as a use detect characteristic byte patterns in the file content itself case for fast and scalable data mining and dis- (magic number approach), which does not scale well. The cuss classification and pattern mining techniques latter approach is not suitable because it requires to retrieve suited for selecting resources satisfying specified the data which is expensive and time consuming task. We media types. We show that we can process an av- can conclude that the file extension is only for some media erage of 17,000 URIs/second and still detect the types suitable as an identifier based on a study of 22M Web media type of resources with a precision of more documents [Umbrich et al., 2008] in 2008. than 80% and a recall of over 65% for all media types. We propose to use classification or pattern mining tech- niques to discover Web documents of requested media types without analysing the content. For this, we utilise 1 Introduction information available for a crawler during the crawling pro- The number of search engines focused on specific top- cess to classify and filter URIs for pre-defined media types. ics has increased significantly over recent years. Besides Note that learning patterns without analysing the content of search engines focused on (hyper)text documents, spe- files can be applied in other scenarios as well, e.g., in genre cialised search engines are available online which collect classification. We present general data mining approaches and integrate information from files of particular media suited for this task, discuss their strengths and weaknesses, types. Seeqpod [URL, j] and Blinkx [URL, g] offer search and, based on first experiences, present a classifier-based over audio and video files, Google Scholar [URL, b] and solution. As this method still comes with several disadvan- CiteSeer [URL, a] are digital libraries of printable docu- tages, we further propose to apply frequent pattern min- ments, Technorati [URL, d] provides real-time access to ing approaches as an alternative. Here we focus on stream news-feeds, and Seekda [URL, c] offers search capabilities mining approaches, as they provide a fast and scalable solu- for web services. A common issue for all vertical search tion that is perfectly suited for the long-running and input- engines is the challenge of discovering and downloading intensive task of a Web crawler. the targeted files on the Web. Specifically, the challenge of detecting documents of a certain media type without in- The remainder of the paper is organised as follows: In specting the content is still not solved [Lausen and Hasel- Section 2 we briefly present the basics of a crawler focus- wanter, 2007]. For this task, a URI-only classifier is a good ing on media-types and discuss general data mining ap- choice, because speed is crucial and content filtering should proaches suited for this. Section 3 presents first experi- be enabled before an (objectionable) web page is down- ences that we gained using implementations available in the loaded. Basically, a focused crawler ([Chakrabarti et al., WEKA toolkit. Based on these experiences, we propose a ∗This material is based upon works jointly supported by the classifier approach in Section 4 and discuss an improve- Science Foundation Ireland under Grant No. SFI/08/CE/I1380 ment based on pattern stream mining. Section 5 contains (Lion-2) and under Grant No. 08/SRC/I1407 (Clique: Graph & an evaluation of the classifier approach. Finally, Section 6 Network Analysis Cluster) briefly presents related work and Section 7 concludes the yCurrent affiliation: Institut AIFB, Universitat¨ Karlsruhe (TH) paper. 2 Data Mining for Focused Crawling the media type video/x-msvideo. The mapped me- The principle of a crawler is to start with a set of seed dia type matches with the real media type of our example. URIs and to recursively follow the discovered links. For The QUERY component is a string of information to be in- this, crawlers follow, for instance, a breadth-first or depth- terpreted by the resource and is omitted here. We use the first approach. This means, the crawler changes domains following notation for the URI components in this paper: through the crawling job and may be returning to domains • E for the file extension of the FILE component already visited before. A focused crawler tries to follow • F for the filename only links to pages and files of a specific type – in our case the media type of files. • the single tokens in the PATH components are labeled Our basic idea is to utilise the information available for with T a crawling during the crawl loop to identify the media-type • the HOST part is labeled with D of Web resources without inspecting the content itself. The accessible information sources for a crawler are [Umbrich 2.2 Classification et al., 2008]: We can map our media-type identification problem to a 1. the URI classification task. Based on a set of predefined media 2. the link position of the URI in the HTML document types, we want to classify a URI to its real media type. Therefore, we have to split the components of a URI into 3. the information sent with the HTTP response header a feature set that serves as the input for the classification However, in this work we exclusively focus only on the algorithm. Possible classification algorithms for this task mining of features contained in the URI. URIs are also could be one of the following: The Bayesian classifier used for topic classification [Baykan et al., 2009]. Media shows convincing results for classifying text. Most spam types are registered with their related RFC description with detection applications use a Bayesian classifier to decide 2 IANA [URL, h]. The RFC description contains a general whether or not a text snippet or an email is spam [Sahami explanation of the purpose of the media type and also rec- et al., 1998]. The support vector machines (SVM) al- ommendation for the file extension(s) to use when publish- gorithm is also used for focused web crawling [Sizov et ing a document. A media type consists of two parts, 1) the al., 2002] and achieved reasonably good results in classi- content-type and 2) the sub-type, separated by “/” fying documents to topics. The C4.5. algorithm gener- (content-type/sub-type). To explain our thoughts ates a pruned or unpruned decision tree [Quinlan, 1993] we introduce an example URI1 referring to a W3C Video [Quinlan, 1996]. A common Bayesian network classifica- on the Web Workshop in 2007. tion algorithm [Pearl, 1985] is also featured. Noteworthy, it is proven that a classification algorithm based on decision 2.1 URI trees and word tokens from anchor text can significantly [ ] Every network-retrievable document has a Uniform Re- improve focused crawling Li et al., 2005 . Hence, we do source Identifier (URI) as defined in RFC2396 [URL, i], not know how it performs for the classification task based which specifies that a URI consists of five components. on a small set of features. Some disadvantages come along with the classification approach. If we use a supervised algorithm we need to gen- [PROTOCOL]://[HOST]:[PORT]/[PATH][FILE]?[QUERY] erate a training set. Given that there exist over 200 different Please note that we focus exclusively on the Hypertext media types on the Web and that 80% of the documents are Transfer Protocol (HTTP) and omit the other protocols, text/html documents, we have the problem of gathering a such as the File Transfer Protocol (FTP, RFC 959 [URL, training set with representative candidates.