<<

commentary Accessibility of information on the web

Search engines do not index sites equally, may not index new pages for months, and no engine indexes more than about 16% of the web. As the web becomes a major communications medium, the data on it must be made more accessible.

reported here, we manually classified all Steve Lawrence and C. Lee Giles 6% Scientific/educational servers and removed servers that are not part The publicly indexable World-Wide Web Pornography of the publicly indexable web. Sites that serve Government now contains about 800 million pages, 5% the same content on multiple IP addresses encompassing about 6 terabytes of text data Health were accounted for by considering only one on about 3 million servers. The web is Personal address as part of the publicly indexable web. 4% Community increasingly being used in all aspects of soci- Religion Our resulting estimate of the number of ety; for example, consumers use search Societies servers on the publicly indexable web as of engines to locate and buy goods, or to 3% February 1999 is 2.8 million. Note that it is research many decisions (such as choosing a possible for a server to host more than one holiday destination, medical treatment or site. All further analysis presented here uses 2% election vote). Scientists are increasingly only those servers considered part of the using search engines to locate research of publicly indexable web. interest: some rarely use , locating 1% To estimate the number of indexable web research articles primarily online; scientific pages, we crawled all the pages on the first editors use search engines to locate potential 0% 2,500 random web servers. The mean num- reviewers. Web users spend a lot of their time ber of pages per server was 289, leading to an using search engines to locate material on the Figure 1 Distribution of information on the estimate of the number of pages on the pub- vast and unorganized web. About 85% of publicly indexable web as of February 1999. licly indexable web of about 800 million. It is users use search engines to locate informa- About 83% of servers contain commercial important to note that the distribution of tion1, and several search engines consistently content (for example, company home pages). pages on web servers is extremely skewed, rank among the top ten sites accessed on The remaining classifications are shown above. following a universal power law4. Many sites the web2. Sites may have multiple classifications. have few pages, and a few sites have vast The and the web are transform- numbers of pages, which limits the accuracy ing society, and the search engines are an which is under development, will increase of the estimate. The true value could be high- important part of this process. Delayed this substantially); some of these are un- er because of very rare sites that have millions indexing of scientific research might lead to available while some are known to be of pages (for example, GeoCities reportedly the duplication of work, and the presence unassigned. We have tested random IP has 34 million pages), or because some and ranking of online stores in search-engine addresses (with replacement), and have sites could not be crawled completely listings can substantially affect economic estimated the total number of web servers because of errors. viability (some are reportedly for using the fraction of tests that successfully The mean size of a page was 18.7 kilobytes sale primarily based on the fact that they are locate a server. Many sites are temporarily (kbytes; median 3.9 kbytes), or 7.3 kbytes indexed by Yahoo). unavailable because of Internet connectivity (median 0.98 kbytes) after reducing the We previously estimated3 that the pub- problems or web-server downtime, and to pages to only the textual content (removing licly indexable web contained at least 320 minimize this effect, we rechecked all IP HTML tags, comments and extra white million pages in December 1997 (the pub- addresses after a week. space). This allows an estimate of the licly indexable web excludes pages that are Testing 3.6 million IP addresses (requests amount of data on the publicly indexable not normally considered for indexing by web timed out after 30 seconds of inactivity), we web: 15 terabytes (Tbytes) of pages, or 6 search engines, such as pages with authoriza- found a for one in every 269 Tbytes of textual content. We also estimated tion requirements, pages excluded from requests, leading to an estimate of 16.0 mil- 62.8 images per web server and a mean image indexing using the robots exclusion stan- lion web servers in total. For comparison, size of 15.2 kbytes (median 5.5 kbytes), lead- dard, and pages hidden behind search Netcraft found 4.3 million web servers in ing to an estimate of 180 million images on forms). We also reported that six major pub- February 1999 based on testing known host the publicly indexable web and a total lic search engines (AltaVista, , HotBot, names (aliases for the same site were consid- amount of image data of about 3 Tbytes. , and Northern Light) collec- ered as distinct hosts in the Netcraft survey We manually classified the first 2,500 ran- tively covered about 60% of the web. The (www.netcraft.com/survey/)). The estimate domly found web servers into the classes largest coverage of a single engine was about of 16.0 million servers is not very useful, shown in Fig. 1. About 6% of web servers one-third of the estimated total size of the because there are many web servers that have scientific/educational content (defined web. would not normally be considered part of here as university, college and research lab the publicly indexable web. These include servers). The web contains a diverse range of How much information? servers with authorization requirements scientific material, including scientist, uni- We have now obtained and analysed a ran- (including firewalls), servers that respond versity and project home pages, preprints, dom sample of servers to investigate the with a default page, those with no content technical reports, conference and journal amount and distribution of information on (sites ‘coming soon’, for example), web-host- papers, teaching resources, and the web. During 2–28 February 1999, we ing companies that present their home page (for example, gene sequences, molecular chose random Internet Protocol (IP) on many IP addresses, printers, routers, structures and image libraries). Much of this addresses, and tested for a web server at the proxies, mail servers, CD-ROM servers and material is not available in traditional data- standard port. There are currently 2564 other hardware that provides a web interface. bases. Research articles that do become (about 4.3 billion) possible IP addresses We built a of regular expressions to available in traditional databases are often (IPv6, the next version of the IP protocol identify most of these servers. For the results available earlier on the web5 (for example,

NATURE | VOL 400 | 8 JULY 1999 | www.nature.com © 1999 Macmillan Ltd 107 commentary scientists may post preprints on their home a b pages). The high value of the scientific infor- 100% All engines 100% Estimated web size mation on the web, and the relatively small 90% Northern Light 90% Northern Light Snap Snap percentage of servers that contain the bulk of 80% AltaVista 80% AltaVista that information, suggest that an index of all 70% HotBot 70% HotBot scientific information on the web would be 60% 60% Microsoft Infoseek Infoseek 50% feasible and very valuable. 50% Google We also analysed on the home 40% Yahoo 40% Yahoo 30% Excite 30% Excite pages of each server using the HTML META Lycos Lycos 20% 20% . Many different META tags are used, EuroSeek EuroSeek 10% 10% most of which encode details that do not 0% 0% identify the content of the page. We used the existence of one or more of the keywords or Figure 2 Relative coverage of the engines for the 1,050 queries used during the period 25–28 February description tags as evidence that effort was 1999. Note that these estimates are of relative coverage for real queries, which may differ from the put into describing the content of the site. We number of pages actually indexed because of different indexing and retrieval techniques. a, Coverage found that 34.2% of servers contain such with respect to the combined coverage of all engines; b, coverage with respect to the estimated metadata on their home page. The low usage size of the web. of the simple HTML metadata standard sug- gests that acceptance and widespread use of details of indexing and retrieval, for example that Lycos and HotBot returned only one more complex standards, such as XML or there may be different maximum limits on page per site (for Infoseek we used the option Dublin Core6, may be very slow (0.3% of sites the size of pages indexed by each engine, not to group results by site). Northern Light contained metadata using the Dublin Core restrictions on which words are indexed or has increased substantially in coverage since standard). We also noted great diversity in limits on processing time used per query. December 1997, and is now the engine with the HTML META tags found, with 123 dis- This means that comparison has to be done the greatest coverage (for our queries). tinct tags, suggesting a lack of standardiza- on real queries; comparing the size of the To express the coverage of the engines tion in usage. engine databases (readily available from the with respect to the size of the web, we need engines) would lead to misleading results. an absolute value for the number of pages performance In our previous study3, we used 575 indexed by one of the engines. We used We previously provided coverage results for queries originally performed by the employ- 128 million pages for Northern Light, as six major web search engines in December ees of NEC Research Institute (mostly scien- reported by the engine itself at the time of 1997 (ref. 3). We provide updated results tists). We retrieved the entire list of docu- our experiments. Our results show that the here for February 1999 using 11 major full- ments for each query and engine, and then search engines are increasingly falling behind text search engines (in alphabetical order): retrieved every individual document for in their efforts to index the web (Fig. 2). AltaVista, EuroSeek, Excite, Google, HotBot, analysis. Only documents containing all of The overlap between the engines remains Infoseek, Lycos, Microsoft, Northern Light, the query terms are included because the relatively low; combining the results of mul- Snap and Yahoo. We analysed the Yahoo search engines can return documents that do tiple engines greatly improves coverage of database provided by , not its hierar- not contain the query terms (for example, the web during searches. The estimated chical taxonomy (which is much smaller). documents with morphological variants or combined coverage of the engines used in HotBot, Microsoft, Snap and Yahoo all use related terms). These documents may be rel- the study is 335 million pages, or 42% of the Inktomi as a search provider, but the results evant to the query, but they prevent accurate estimated total number of pages. Hence, a differ between these engines (due to filtering comparison across search engines. In this substantial improvement in web coverage and/or different underlying Inktomi data- study, we have increased the number of can be obtained using metasearch engines, bases). We did not include Northern Light’s search engines analysed to 11, expanded the such as MetaCrawler7, which combine the ‘special collection’ (documents that are not number of queries to 1,050, and transformed results of multiple engines. part of the publicly indexable web). queries to the advanced syntax for AltaVista When searching for ‘not x’, where ‘x’ is a We performed our study by analysing the which allows retrieval of more than 200 term that does not appear in the index, responses of the search engines to real-world results. AltaVista (Boolean advanced search with the queries, as we are interested in the relative We used automatic and manual consis- ‘count matching pages’ option) and North- coverage of the engines for real queries, tency checks to ensure that responses from ern Light report the total number of pages in which can be substantially different from the the search engines were being correctly their index. Figure 3 shows the reported relative number of pages indexed by each processed. The relative coverage of each number of pages indexed by these engines engine. The engines differ in the specific engine is shown in Table 1 and Fig. 2. Note between 1 November 1998 and 1 April 1999. Table 1 Statistics for search-engine coverage and recency Search engine Northern Snap AltaVista HotBot Microsoft Infoseek Google Yahoo Excite Lycos EuroSeek Average Light Coverage with respect to 38.3 37.1 37.1 27.1 20.3 19.2 18.6 17.6 13.5 5.9 5.2 n/a combined coverage (%) 95% confidence ±0.82 ±0.75 ±0.77 ±0.64 ±0.51 ±0.82 ±0.69 ±0.61 ±0.46 ±0.30 ±0.40 n/a interval Coverage with respect to 16.0 15.5 15.5 11.3 8.5 8.0 7.8 7.4 5.6 2.5 2.2 n/a estimated web size Percentage of 9.8 2.8 6.7 2.2 2.6 5.5 7.0 2.9 2.7 14.0 2.6 5.3 invalid links Mean age of new 141 240 166 192 194 148 n/a 235 206 174 n/a 186 matching documents (days) Median age of new 84 91 33 51 57 60 n/a 76 47 174 n/a 57 matching documents (days)

108 © 1999 Macmillan Magazines Ltd NATURE | VOL 400 | 8 JULY 1999 | www.nature.com commentary

access? They typically have two main meth-

e 135

g ods for locating pages: following links to a 130 r ) s e find new pages; and user registration. v e 125 o g

c Because the engines follow links to find new a 120

p e

f pages, they are more likely to find and index

n 115 i o

g

s 110 pages that have more links to them. The n n e

o 105 engines may also use various techniques for i l d l i e

t 100 choosing which pages to index. Therefore, r

m AltaVista ( o 95

p Northern Light the engines typically index a biased sample e 90 R of the web. 1 Nov 98 1 Dec 98 1 Jan 99 1 Feb 99 1 Mar 99 1 Apr 99 To analyse bias in indexing, we mea- Date sured the probability of a site being indexed Figure 3 Number of pages indexed by AltaVista and Northern Light between 1 November 1998 and as a function of the number of links to the 1 April 1999 (sampled daily), as reported by the engines themselves. site. We used AltaVista to obtain an estimate of the number of links to the home pages of 1 the 2,500 random web servers found earlier. Northern We then tested the existence of the sites in Light 0.8 all the other engines that support URL g

n search. Figure 4 shows the probability of

i Snap x

e each engine indexing a random site versus

d 0.6 HotBot

n the number of links to the site. A very strong i

f

o Microsoft trend can be seen across all engines, where

y

t 0.4 sites with few links to them have a low i l

i Infoseek

b probability of being indexed and sites with a

b Yahoo many links to them have a high probability

o 0.2 r

P of being indexed. Considering the relative Lycos coverage results, Infoseek has a higher 0 0 1 2—10 11—20 21—100 probability of indexing random sites than might be expected. This indicates that Number of links reported by AltaVista Infoseek may index the web more broadly Figure 4 Probability of indexing random sites versus ‘popularity’ or connectedness of sites. Sites with (that is, index more sites instead of more few links to them have a low probability of being indexed. pages per site). Not only are the engines indexing a biased The number of reported pages indexed by an age is 57 days (these averages are over all sample of the web, but new search techniques AltaVista has decreased significantly at times, documents, not over search engines) (Table are further biasing the accessibility of infor- suggesting possible difficulty in maintaining 1). Although our results apply only to pages mation on the web. For example, the search an index of more than 100 million pages. matching the queries performed and not to engines DirectHit and Google make exten- We next looked at statistics that indicate general web pages, they do provide evidence sive use of measures of ‘popularity’ to rank how up to date the search-engine databases that indexing of new or modified pages can the relevance of pages (Google uses the num- are. Table 1 shows the results of two investi- take several months or longer. ber and source of incoming links while gations for each search engine. The first Why do search engines index such a DirectHit uses the number of times links investigation looks at the percentage of doc- small fraction of the web? There may be a have been selected in previous searches). The uments reported by each engine that are no point beyond which it is not economical for increasing use of features other than those longer valid (because the page has moved or them to improve their coverage or time- directly describing the content of pages fur- no longer exists). Pages that timed out are liness. The engines may be limited by the ther biases the accessibility of information. not included. The indexing patterns of the of their indexing and retrieval For ranking based on popularity, we can see a search engines vary over time, and the engine technology, or by network bandwidth. trend where popular pages become more with the most up-to-date database may not Larger indexes mean greater hardware and popular, while new, unlinked pages have an be the most comprehensive engine. maintenance expenses, and more process- increasingly difficult time becoming visible We have also looked at the age of new doc- ing time for at least some queries on in search-engine listings. This may delay or uments found that match given queries by retrieval. Therefore, larger indexes cost even prevent the widespread visibility of new tracking the results of queries using an inter- more to create, and slow the response time high-quality information. nal search engine. Queries were repeated at on average. There are diminishing returns to Steve Lawrence and C. Lee Giles are at NEC the search engines daily, and new pages indexing all of the web, because most Research Institute, 4 Independence Way, Princeton, matching the queries were reported to the queries made to the search engines can be New Jersey 08540, USA. searchers. We monitored the queries being satisfied with a relatively small database. e-mail: [email protected] tracked, and when new documents were Search engines might generate more rev- 1. Graphic, Visualization, and Usability Center. GVU’s 10th WWW found, we logged the time since the docu- enue by putting resources into other areas User Survey (October 1998); http://www.gvu.gatech.edu/ user_surveys/survey-1998-10/ ments were last modified (when available), (for example, free e-mail). Also, it is not nec- 2. Media Metrix (January 1999); http://www.mediametrix.com/ which provides a lower bound for the time essary for a page to be indexed for it to be PressRoom/Press_Releases/02_22_99. span between when documents matching a accessible: a suitable query might locate 3. Lawrence, S. & Giles, C. L. Science 280, 98–100 (1998). query are added or modified, and the time another page that links to the desired page. 4. Huberman, B. A. & Adamic, L. A. Evolutionary Dynamics of the (1999); http://www.parc.xerox.com/istl/ that they are indexed by one of the search groups/iea/www/growth.html engines (new queries are not included in the Not equal access 5. Lawrence, S., Giles, C. L. & Bollacker, K. IEEE Computer 32(6) study for the first week). The mean age of a One of the great promises of the web is that (1999). 6. Dublin Core. The Dublin Core: A Simple Content Description matching page when indexed by the first of of equalizing the accessibility of informa- Model for Electronic Resources (1999); http://purl.oclc.org/dc/ the nine engines is 186 days, while the medi- tion. Do the search engines provide equal 7. Selberg, E. & Etzioni, O. IEEE Expert 11–14, Jan.–Feb. (1997).

NATURE | VOL 400 | 8 JULY 1999 | www.nature.com © 1999 Macmillan Magazines Ltd 109