The Top 10 Alternative Search Engines (ASE) - Within Selected Categories Ranked by Webometric Indicators

Total Page:16

File Type:pdf, Size:1020Kb

The Top 10 Alternative Search Engines (ASE) - Within Selected Categories Ranked by Webometric Indicators The Top 10 Alternative search Engines (ASE) - within Selected Categories Ranked by Webometric Indicators Bernd Markscheff el Bastian Eine Within the scientifi c fi eld of webometrics many research objects had been analyzed and compared with the help of webometric indicators (e.g., the performance of a whole country, a research group or an indi- vidual). This paper presents a ranking of ASEs which is based on webo- metric indicators. As search engines have become an essential tool for searching for information on the web many alternative search services have specialized in fi nding topic- or format-specifi c search results. By creating a ranking of these ASEs within selected categories we present an overview of the ASEs which are currently available. Through webo- metric indicators the ASEs were compared and the most popular ASEs of the respective categories determined. Keywords: Search engines, Webometric indicators Bernd Markscheff el Chair of Information and 1. Introduction Knowledge Management Technische Universität While searching for information on the web, search en- Ilmenau gines can assist the user to fi nd satisfying results. Although, P.O. Box 100565 just a few dominate the search engine market a large num- 98684 Ilmenau ber of diff erent web search services and tools exist (Maaβ Germany et al., [1]). Beside the well known universal search engines bernd.markscheff el@ like Google, Yahoo and Bing several ASEs are specialized tuilmenau.de in providing options to search for special document types, specifi c topics or time-sensitive information (Gelernter [2]; Bastian Eine Consultant for Search Engine Originally presented at the 7th International Conference on Webometrics, Optimization and Online Informetrics and Scientometrics (WIS) and 12th COLLNET Meeting, Marketing September 20–23, 2011, Istanbul Bilgi University, Istanbul, Turkey. Bochumer Straβe 37 38108 Braunschweig Published Online First : 10 March 2012 Germany http://www.tarupublications.com/journals/cjsim/cjsim.htm [email protected] © COLLNET JOURNAL OF SCIENTOMETRICS AND INFORMATION MANAGEMENT (Online First) 1 The Top 10 Alternative search Engines (ASE) - within Selected Categories Ranked Lewandowski [3]). Because of the large number of ASEs and their rapidly changing range, we want to investigate the dynamic development of the ASE market. Therefore, a rank- ing can help to create a picture of the most popular ASEs currently available. To compare research objects like web sites, web pages or parts of web pages by their popularity or external impact, selected webometric indicators can be used (Ingwersen [4]). Webometric indicators can be based on data related to the web page content, web link structure, web usage or web technology (Björneborn & Ingwersen [5]; Thelwall et al., [6]). We used three diff erent webometric indicators to rank ASEs. Through the Web Impact Factor (WIF) (In- gwersen [4]) and the PageRank (Page et al., [7]) we analyzed data based on the web link structure; with the help of the Alexa Traff ic Rank we analyzed data based on the web usage (Alexa Internet [8]). These three indicators were chosen because they evaluate a large amount of data and can be retrieved easily. By using three diff erent indicators based on two types of data an objective ranking of the ASEs can be expected. 2. Methotology At fi rst we predefi ned the categories of ASEs which we will analyse in our research. Then we determined a universal set of ASEs according to these categories. Subsequently we retrieved the selected webometric indicators for the determined ASEs. Finally, we cal- culated a total value for each ASE based on the three indicators and created a ranking for each category. 2.1. Categorisation of ASEs ASEs can be structured by several diff erent approaches. The following approaches can be used to diff erentiate between the multiple types of search engines: • User behaviour (Broder [9]) • Universal & specialized search (Gelernter [2]; Lewandowski [3]) • Manual & automatic indexing (Baeza-Yates & Ribeiro-Neto [10]) • Invisible web search (Sherman & Price [11]) • Social search (Skusa & Maaβ [12]) • Real time search (Lewandowski [13]) • Semantic search (Berners-Lee et al., [14]; Skusa & Maaβ 2008 [12]) • Visual search (Weinhold [15], Bekavac et al., [16]) • Personalized search (Griesbaum [17]) • Local search (Lewandowski [3]) These and further approaches are implemented in many diff erent combinations by the ASEs. In addition, the market of ASEs and its development are highly dynamic. Hence, 2 COLLNET JOURNAL OF SCIENTOMETRICS AND INFORMATION MANAGEMENT (Online First) Bernd Markscheff el and Bastian Eine there is no universally valid categorisation of search engines. We selected the following categories of ASEs for our research: • Image search engines • Video search engines • Audio search engines • People search engines • Question & answer services • Social bookmarking services • Blog search engines • Twitter search engines • News search engines • Science search engines 2.2. Determination of the Universal Set To receive a complete, actual and objective universal set for our ranking, we used two diff erent methods for the determination of ASEs (a more detailed description of these methods can be found in Eine & Markscheff el [18]). On the one hand we evaluated search engine lists established and maintained by experts. On the other hand we specifi ed search queries, submitted them to Google and evaluated the respective search results. In the be- ginning of our evaluation of these lists, they contained a total number of 1695 search en- gines. We analyzed each of these search engines by the following criteria. A search engine of our universal set has to • Be available and functional • Fit in one of the selected categories • Use methods of his own to utilise their own or an external search index • Be without a restriction regarding topic or country (except the restrictions given by the categories) • Off er its service without a registration or charges for the user (except science search engines) In addition, the following criteria for our categories were specifi ed: • I mage and video search engines do include online communities, portals and archives with a search function available • Audio search engines have to off er an option to download the results for free • People search engines do not include social networks, white or yellow pages • News search engines have to aggregate their search results from more resources than their own news articles. COLLNET JOURNAL OF SCIENTOMETRICS AND INFORMATION MANAGEMENT (Online First) 3 The Top 10 Alternative search Engines (ASE) - within Selected Categories Ranked Further ASEs were collected by specifying search queries and submitting them to Google. Therefore, we determined keywords which the search queries should include to receive a large number of ASEs in the result set. Keywords can be found on the website of the search engines which we obtained from the search engine lists. In addition, we submitted the names of these search engines to Google to receive the search engine’s short text which also can be seen as a source of po- tential keywords. The collected keywords were assigned to our categories. After that, one search query for each category was submitted to Google (the keywords for one query were combined by the OR operator). The fi rst 100 search results were analyzed with the help of the same criteria as the ones used for the search engine lists. Finally, the ASEs obtained by these two methods were merged and duplicates removed. The determined universal set of ASEs consists of • 50 image search engines • 45 video search engines • 24 audio search engines • 7 People search engines • 13 question & answer services • 21 social bookmarking services • 17 blog search engines • 27 twitter search engines • 25 news search engines • 24 science search engines 2.3. Data Collection To determine the most popular ASEs for each category, we retrieved the WIF, PageRank and Alexa Traff ic Rank for each research object of our universal set. The WIF is collected in a very simple form. In contrast to the WIF by Ingwersen [4] we taken into account only the number of external inlinks for the ASEs. We did not imply the search engines’ total number of web pages because the popularity of the search function is the relevant subject of interest. We determined the number of external inlinks with the help of a universal search engine. Universal search engines provide commands or tools to retrieve the number of external inlinks to a web site or web page easily. Therefore, the universal search engines represent data which is gathered by their crawlers. These crawlers cannot reach all parts of the web due to the web structure (Broder et al., [9]) and dynamic of the web. Hence, it is not an ideal but capable tool for our purpose (Thelwall [19]). We used the Yahoo Site Explorer to collect the total number of external inlinks. Through the PageRank, we analyzed the web link structure as well. The basic The weight of a link indicates that a link from one web page to another can have diff erent impacts on a web page depending on the reputation of the outlinking web page. 4 COLLNET JOURNAL OF SCIENTOMETRICS AND INFORMATION MANAGEMENT (Online First) Bernd Markscheff el and Bastian Eine A web page has a high reputation when it receives inlinks from many popular web pages or has the status of a web page with high quality content regarding a specifi c topic. When calculating the PageRank a web page’s PageRank is passed on to the outlinked web pages by allocating it equally to them. So the calculation of a web page’s PageRank is recursive because its value depends on the PageRank of the inlinked web pages and in turn has an infl uence on the PageRank of the outlinked web pages. We retrieved the PageRank of the ASEs using the Google Toolbar. Google Toolbar is a browser add-on and provides diff er- ent functions and information to assist the user while surfi ng the web.
Recommended publications
  • Metadata for Semantic and Social Applications
    etadata is a key aspect of our evolving infrastructure for information management, social computing, and scientific collaboration. DC-2008M will focus on metadata challenges, solutions, and innovation in initiatives and activities underlying semantic and social applications. Metadata is part of the fabric of social computing, which includes the use of wikis, blogs, and tagging for collaboration and participation. Metadata also underlies the development of semantic applications, and the Semantic Web — the representation and integration of multimedia knowledge structures on the basis of semantic models. These two trends flow together in applications such as Wikipedia, where authors collectively create structured information that can be extracted and used to enhance access to and use of information sources. Recent discussion has focused on how existing bibliographic standards can be expressed as Semantic Metadata for Web vocabularies to facilitate the ingration of library and cultural heritage data with other types of data. Harnessing the efforts of content providers and end-users to link, tag, edit, and describe their Semantic and information in interoperable ways (”participatory metadata”) is a key step towards providing knowledge environments that are scalable, self-correcting, and evolvable. Social Applications DC-2008 will explore conceptual and practical issues in the development and deployment of semantic and social applications to meet the needs of specific communities of practice. Edited by Jane Greenberg and Wolfgang Klas DC-2008
    [Show full text]
  • A Comparison of Information Seeking Using Search Engines and Social Networks
    A Comparison of Information Seeking Using Search Engines and Social Networks Meredith Ringel Morris1, Jaime Teevan1, Katrina Panovich2 1Microsoft Research, Redmond, WA, USA, 2Massachusetts Institute of Technology, Cambridge, MA, USA {merrie, teevan}@microsoft.com, [email protected] Abstract et al. 2009 or Groupization by Morris et al. 2008). Social The Web has become an important information repository; search engines can also be devised using the output of so- often it is the first source a person turns to with an informa- cial tagging systems such as delicious (delicious.com). tion need. One common way to search the Web is with a Social search also encompasses active requests for help search engine. However, it is not always easy for people to from the searcher to other people. Evans and Chi (2008) find what they are looking for with keyword search, and at describe the stages of the search process when people tend times the desired information may not be readily available to interact with others. Morris et al. (2010) surveyed Face- online. An alternative, facilitated by the rise of social media, is to pose a question to one‟s online social network. In this book and Twitter users about situations in which they used paper, we explore the pros and cons of using a social net- a status message to ask questions of their social networks. working tool to fill an information need, as compared with a A well-studied type of social searching behavior is the search engine. We describe a study in which 12 participants posting of a question to a Q&A site (e.g., Harper et al.
    [Show full text]
  • A Federated Search and Social Recommendation Widget
    A Federated Search and Social Recommendation Widget Sten Govaerts1 Sandy El Helou2 Erik Duval3 Denis Gillet4 Dept. Computer Science1,3 REACT group2,4 Katholieke Universiteit Leuven Ecole Polytechnique Fed´ erale´ de Lausanne Celestijnenlaan 200A, Heverlee, Belgium 1015 Lausanne, Switzerland fsten.govaerts1, [email protected] fsandy.elhelou2, denis.gillet4g@epfl.ch ABSTRACT are very useful, but their generality can sometimes be an ob- This paper presents a federated search and social recommen- stacle and this can make it difficult to know where to search dation widget. It describes the widget’s interface and the un- for the needed information [2]. Federated searching over derlying social recommendation engine. A preliminary eval- collections of topic-specific repositories can assist with this uation of the widget’s usability and usefulness involving 15 and save time. When the user sends a search request, the subjects is also discussed. The evaluation helped identify us- federated search widget collects relevant resources from dif- ability problems that will be addressed prior to the widget’s ferent social media sites [3], repositories and search engines. usage in a real learning context. Before rendering aggregated results, the widget calls a per- sonalized social recommendation service deemed as crucial Author Keywords in helping users select relevant resources especially in our federated search, social recommendations, widget, Web, per- information overload age [4,5]. The recommendation ser- sonal learning environment (PLE), Pagerank, social media, vice ranks these resources according to their global popu- Web 2.0 larity and most importantly their popularity within the social network of the target user. Ranks are computed by exploiting attention metadata and social networks in the widget.
    [Show full text]
  • Unicorn: a System for Searching the Social Graph
    Unicorn: A System for Searching the Social Graph Michael Curtiss, Iain Becker, Tudor Bosman, Sergey Doroshenko, Lucian Grijincu, Tom Jackson, Sandhya Kunnatur, Soren Lassen, Philip Pronin, Sriram Sankar, Guanghao Shen, Gintaras Woss, Chao Yang, Ning Zhang Facebook, Inc. ABSTRACT rative of the evolution of Unicorn's architecture, as well as Unicorn is an online, in-memory social graph-aware index- documentation for the major features and components of ing system designed to search trillions of edges between tens the system. of billions of users and entities on thousands of commodity To the best of our knowledge, no other online graph re- servers. Unicorn is based on standard concepts in informa- trieval system has ever been built with the scale of Unicorn tion retrieval, but it includes features to promote results in terms of both data volume and query volume. The sys- with good social proximity. It also supports queries that re- tem serves tens of billions of nodes and trillions of edges quire multiple round-trips to leaves in order to retrieve ob- at scale while accounting for per-edge privacy, and it must jects that are more than one edge away from source nodes. also support realtime updates for all edges and nodes while Unicorn is designed to answer billions of queries per day at serving billions of daily queries at low latencies. latencies in the hundreds of milliseconds, and it serves as an This paper includes three main contributions: infrastructural building block for Facebook's Graph Search • We describe how we applied common information re- product. In this paper, we describe the data model and trieval architectural concepts to the domain of the so- query language supported by Unicorn.
    [Show full text]
  • Searching the Enterprise
    R Foundations and Trends• in Information Retrieval Vol. 11, No. 1 (2017) 1–142 c 2017 U. Kruschwitz and C. Hull • DOI: 10.1561/1500000053 Searching the Enterprise Udo Kruschwitz Charlie Hull University of Essex, UK Flax, UK [email protected] charlie@flax.co.uk Contents 1 Introduction 2 1.1 Overview........................... 3 1.2 Examples........................... 5 1.3 PerceptionandReality . 9 1.4 RecentDevelopments . 10 1.5 Outline............................ 11 2 Plotting the Landscape 13 2.1 The Changing Face of Search . 13 2.2 DefiningEnterpriseSearch . 14 2.3 Related Search Areas and Applications . 17 2.4 SearchTechniques. 34 2.5 Contextualisation ...................... 37 2.6 ConcludingRemarks. 49 3 Enterprise Search Basics 52 3.1 StructureofData ...................... 53 3.2 CollectionGathering. 59 3.3 SearchArchitectures. 63 3.4 Information Needs and Applications . 68 3.5 SearchContext ....................... 76 ii iii 3.6 UserModelling........................ 78 3.7 Tools, Frameworks and Resources . 81 4 Evaluation 82 4.1 RelevanceandMetrics. 83 4.2 Evaluation Paradigms and Campaigns . 85 4.3 TestCollections ....................... 89 4.4 LessonsLearned ....................... 94 5 Making Enterprise Search Work 95 5.1 PuttingtheUserinControl . 96 5.2 Relevance Tuning and Support . 103 6 The Future 110 6.1 GeneralTrends........................ 110 6.2 TechnicalDevelopments. 111 6.3 Moving towards Cooperative Search . 113 6.4 SomeResearchChallenges . 114 6.5 FinalWords ......................... 117 7 Conclusion 118 Acknowledgements 120 References 121 Abstract Search has become ubiquitous but that does not mean that search has been solved. Enterprise search, which is broadly speaking the use of information retrieval technology to find information within organisa- tions, is a good example to illustrate this.
    [Show full text]
  • Collaborative Information Access: a Conversational Search Approach
    ICCBR Workshop on Reasoning from Experiences on the Web (WebCBR-09), Seattle Collaborative Information Access: A Conversational Search Approach Saurav Sahay, Anushree Venkatesh, and Ashwin Ram College of Computing Georgia Institute of Technology Atlanta, GA Abstract. Knowledge and user generated content is proliferating on the web in scientific publications, information portals and online social me- dia. This knowledge explosion has continued to outpace technological innovation in efficient information access technologies. In this paper, we describe the methods and technologies for ‘Conversational Search’ as an innovative solution to facilitate easier information access and reduce the information overload for users. Conversational Search is an interactive and collaborative information finding interaction. The participants in this interaction engage in social conversations aided with an intelligent information agent (Cobot) that provides contextually relevant search recommendations. The collaborative and conversational search activity helps users make faster and more informed search and discovery. It also helps the agent learn about conversations with interactions and social feedback to make better recommendations. Conversational search lever- agesthe social discovery process by integrating web information retrieval along with the social interactions. 1 Introduction Socially enabled online information search (social search) is a new phenomenon facilitated by recent Web technologies. This collaborative social search involves finding specific people in your network who have the knowledge you’re look- ing for or finding relevant information based on one’s social network. Social psychologists have experimentally validated that the act of social discussions have facilitated cognitive performance[1]. People in social groups can provide solutions (answers to questions)[2], pointers to databases or other people (meta- knowledge)[2][3] , validation and legitimation of ideas[2][4], can serve as mem- ory aids[5] and help with problem reformulation[2].
    [Show full text]
  • Natural Search User Interfaces
    Natural Search User Interfaces Prof. Marti Hearst UC Berkeley March/April , 2012 Book full text freely available at: http://searchuserinterfaces.com What works well in search now? Faceted Navigation 3 Real-Time Suggestions http://netflix.com http://www.imamuseum.org/ 4 Forecasting the Future First: What are the larger trends? In technology? In society? Next: Project out from these. Preferences for Preferences for Preferences for Audio / Video / Touch Social Interaction Natural Language “Natural” Interfaces Statistical Analysis of Wide adoption of social Enormous Collections Advances in media & user-generated UI Design Of Behavioral and Other Data content 6 Trend: More Natural Queries Trend: Longer, more natural queries § The research suggests people prefer to state their information need rather than use keywords. § But after first using a search engine they quickly learned that full questions resulted in failure. 8 Trend: Longer, more natural queries § The research suggests people prefer to state their information need rather than use keywords. § But after first using a search engine they quickly learned that full questions resulted in failure. § Average query length continues to increase § In 2010 vs 2009, searches of 5-8 words were up 10%, while 1-2 word searches were down. 9 Trend: Longer, more natural queries § Social Question Answering Sites: § Information worded as questions increasing available § with the questions that the audience really wants the answers to, and § written in the language the audience wants to use. § AND with
    [Show full text]
  • 1 Public Records Resources Online
    Public Records Resources Online How to Find Everything There Is to Know About "Mr./Ms. X" *Please see disclaimer at end of document* Summary of public records on Mr. / Ms. X . Starting with one or more Background Reports from the sources below can help provide an overview of where the person has lived and worked, as well as other pertinent information for searching, such as middle name or initial. They might also contain adverse filings and licensing information that you can later verify in other sources. Examples of vendors that provide background reports (for a fee): Westlaw/CLEAR (Thomson Reuters) $ Lexis/Accurint $ TLO $ His/her social security number is . These tools can help verify if an SSN is valid and related to a living person. SSN Validator Selective Service Online Verification He/she was born / died / naturalized . In addition to the Summary Reports in the first section of this guide, official death records may be found in resources on the federal, state, and local levels. The Genealogy Resources section of this guide might have other helpful resources for death records. Social Security Death Records The official federal record is the Social Security Death Index (SSDI). Only the subscription databases, such as Lexis, Westlaw, and TLO (all $) will have the most recent SSDI records, due to privacy restrictions. Try variations of spellings, or using only the last name and birth or death year and location if you're not finding an SSDI record. Ancestry Library $ Genealogy Bank SSDI Research Guide SSDI Databases Online State and Local Death, Burial, and Cemetery Records State and local death records will often provide more detailed information than an SSDI record, but they are not as easy to find.
    [Show full text]
  • On the Collaboration Support in Information Retrieval
    Open Archive TOULOUSE Archive Ouverte ( OATAO ) OATAO is an open access repository that collects the work of Toulouse researchers and makes it freely available over the web where possible. This is an author-deposited version published in : http://oatao.univ-toulouse.fr/ Eprints ID : 19042 To link to this article : DOI : 10.1145/3092696 URL : http://doi.org/10.1145/3092696 To cite this version : Soulier, Laure and Tamine-Lechani, Lynda On the collaboration support in Information Retrieval. (2017) ACM Computing Surveys, vol. 50 (n° 4). pp. 1. ISSN 0360-0300 Any correspondence concerning this service should be sent to the repository administrator: [email protected] On the Collaboration Support in Information Retrieval LAURE SOULIER, Sorbonne Universités, UPMC Univ Paris 06, UMR 7606, LIP6 LYNDA TAMINE, IRIT, Université de Toulouse, UPS Collaborative Information Retrieval (CIR) is a well-known setting in which explicit collaboration occurs among a group of users working together to solve a shared information need. This type of collaboration has been deemed as bene!cial for complex or exploratory search tasks. With the multiplicity of factors im- pacting on the search e#ectiveness (e.g., collaborators’ interactions or the individual perception of the shared information need), CIR gives rise to several challenges in terms of collaboration support through algorithmic approaches. More particularly, CIR should allow us to satisfy the shared information need by optimizing the collaboration within the search session over all collaborators, while ensuring that both mutually bene!cial goals are reached and that the cognitive cost of the collaboration does not impact the search e#ectiveness.
    [Show full text]
  • Search Engine Optimization and the Connection with Knowledge Graphs
    FACULTY OF EDUCATION AND BUSINESS STUDIES Department of Business and Economics Studies Search Engine Optimization and the connection with Knowledge Graphs Milla Marianna Hietala Oliver Marshall 2021 Student thesis, Master degree (one year), Credits Business Administration Master Programme in Business Administration (MBA): Business Management Master Thesis in Business Administration 15 Credits Supervisor: Ehsanul Huda Chowdhury Examiner: Maria Fregidou-Malama Abstract Title: Search Engine Optimization and the connection with Knowledge Graphs Level: Thesis for Master’s Degree in Business Administration Authors: Milla Marianna Hietala and Oliver Marshall Supervisor: Ehsanul Huda Chowdhury Examiner: Maria Fregidou-Malama Date: 28-01-2021 Aim: The aim of this study is to analyze the usage of Search Engine Optimization and Knowledge Graphs and the connection between them to achieve profitable business visibility and reach. Methods: Following a qualitative method together with an inductive approach, ten marketing professionals were interviewed via an online questionnaire. To conduct this study both primary and secondary data was utilized. Scientific theory together with empirical findings were linked and discussed in the analysis chapter. Findings: This study establishes current Search Engine Optimization utilization by businesses regarding common techniques and methods. We demonstrate their effectiveness on the Google Knowledge Graph, Google My Business and resulting positive business impact for increased visibility and reach. Difficulties remain in accurate tracking procedures to analyze quantifiable results. Contribution of the thesis: This study contributes to the literature of both Search Engine Optimization and Knowledge Graphs by providing a new perspective on how these subjects have been utilized in modern marketing. In addition, this study provides an understanding of the benefits of SEO utilization on Knowledge Graphs.
    [Show full text]
  • An Innovative Approach for Online Meta Search Engine Optimization
    The 6th Conference on Software, Knowledge, Information Management and Applications, Chengdu, China, September 9-11 2012, #57. 1 An Innovative Approach for online Meta Search Engine Optimization Jai Manral and Mohammed Alamgir Hossain Computational Intelligence Research Group School of Computing Engineering and Information Sciences Northumbria University, Newcastle Upon Tyne, NE1 8ST, UK {jai.manral; alamgir.hossain}@northumbria.ac.uk This paper presents an approach to identify efficient techniques used in Web Search Engine Optimization (SEO). Understanding SEO factors which can influence page’s ranking in search engine is significant for webmasters who wish to attract large number of users to their website. Different from previous relevant research, in this study we developed an intelligent Meta search engine which aggregates results from various search engines and ranks them based on several important SEO parameters. The research tries to establish that using more SEO parameters in ranking algorithms helps in retrieving better search results thus increasing user satisfaction. Initial results generated from Meta search engine outperformed existing search engines in terms of better retrieved search results with high precision. Index Terms— Search Engine Optimization, Meta Search Engine, Page Ranking, WWW. selects its preferred web-pages. Ranking is based on various I. INTRODUCTION parameters, some are known others not. Search engine optimizers work hard to use these parameters and rank their The information on the web is prodigious; searching web-pages higher in the result. Every researcher has its own relevant information is difficult for users. The accuracy of opinion of using SEO techniques. In order to learn how to search results is measured by relevancy of query term to web- optimize webpages and their relevancy to search engines page pages ranked and displayed by search engines.
    [Show full text]
  • Indexing the World Wide Web: the Journey So Far Abhishek Das Google Inc., USA Ankit Jain Google Inc., USA
    Indexing The World Wide Web: The Journey So Far Abhishek Das Google Inc., USA Ankit Jain Google Inc., USA ABSTRACT In this chapter, we describe the key indexing components of today’s web search engines. As the World Wide Web has grown, the systems and methods for indexing have changed significantly. We present the data structures used, the features extracted, the infrastructure needed, and the options available for designing a brand new search engine. We highlight techniques that improve relevance of results, discuss trade-offs to best utilize machine resources, and cover distributed processing concept in this context. In particular, we delve into the topics of indexing phrases instead of terms, storage in memory vs. on disk, and data partitioning. We will finish with some thoughts on information organization for the newly emerging data-forms. INTRODUCTION The World Wide Web is considered to be the greatest breakthrough in telecommunications after the telephone. Quoting the new media reader from MIT press [Wardrip-Fruin , 2003]: “The World-Wide Web (W3) was developed to be a pool of human knowledge, and human culture, which would allow collaborators in remote sites to share their ideas and all aspects of a common project.” The last two decades have witnessed many significant attempts to make this knowledge “discoverable”. These attempts broadly fall into two categories: (1) classification of webpages in hierarchical categories (directory structure), championed by the likes of Yahoo! and Open Directory Project; (2) full-text index search engines such as Excite, AltaVista, and Google. The former is an intuitive method of arranging web pages, where subject-matter experts collect and annotate pages for each category, much like books are classified in a library.
    [Show full text]