Only a Few Years Ago, the Phrase Web Search Did Not Exist. Then the Term Began to Move

Only a few years ago, the phrase "Web search" did not exist. Then the term began to move rapidly into the awareness of information professionals, about as fast as a Japanese bullet train. Today, much, though not all, of the work we do revolves in one way or another around the Web. With so much to keep on top of, precious time becomes even more precious. A couple of years ago I wrote an article trying to figure out a way to make the day 26 or 27 hours long. Unfortunately, that idea never reached the implementation stage, though it remains an idea worth considering. Even within the narrow bonds of 24/7/365, we must all still try to keep up to date about what is happening with Web search engines. The fact that they seem to change on a weekly, if not daily basis, is no excuse. We as professionals do not use every search engine or Web directory daily, nevertheless, we have to know how each works and what data each does and does not contain. I fully understand that this is easier said than done but today, information access is a topic that everyone is aware of and talking about. Pick up any newspaper. Turn on the television. Everyday more and more articles and reports discuss searching the Web. Many of these articles and reports are written for and by non-information professionals. We have to stay ahead of our clients and patrons if we hope to help them. Excite or AllTheWeb may not be your search engines of choice, but I bet they are for someone you know. Our colleagues, co-workers, and friends come to us as the "search experts" and we must do our best to help. Our knowledge and understanding in this area are great ways to make our profession look good and to make our already valuable jobs even more valuable. With this said, the following reviews the latest goings on in the search world and tries to provide some suggestions and tools to make you more knowledgeable and save you some time. Price's Priceless Tips The Web search world changes on what sometimes seems like an hourly basis. What follows are a few selected tips and resources for some of the most well-known of engines. This is just the tip of the iceberg. Resources like Scope Notes Search Engine Showdown and Search Engine Before we begin, we need to get a Watch are essential for learning and keeping up definition straight — a definition that I with how these tools work and change over think many of us have thought about. What time. does "Web search" mean to the information Ten Things to Know About Google professional? In the early days of the Web, it meant exactly how it sounds — material 1. The database that Google licenses to Yahoo! found on the open Web. [http://google.yahoo.com] is not the same size: it's smaller than the Google.com database. However, as we move forward, the term It does not contain links to cached versions of "Web search" has taken on new meanings. pages. This database is also used to supply Does a Web search involve tools like "fall-through" content (material not in Yahoo's Google or AltaVista to reach "open access" own database). It is often found listed as "Web material? Does it mean using the Web as a page" content. vehicle to log on to proprietary databases such as Factiva or Dialog? Not too long 2. Google utilizes the Open Directory Project ago, logging into proprietary services database as its Web Directory required individual connections to each [http://directory.google.com]. one. Today, any Web browser with an Internet connection can reach those services. Perhaps it means both. This lack 3. You can search stop words by placing a + in of common understanding can confuse front of the word (ex. "+To +Be +Or Not +To some and trying to solve the issue is +Be"). outside of the scope of this article. 4. At the present time the Google database is This article will primarily focus on the refreshed about once every month. "traditional" Web search, i.e., search engines that assist in locating open Web 5. You can limit your search to only .pdf files by content. The approach I have taken is to using the syntax filetype:pdf. try to answer the questions I seem to get, in one form or another, at every 6. Google is the only major search engine to conference, every workshop, and in every crawl Adobe Acrobat .pdf files. day's stack of e-mail messages. 7. If you are a frequent Google searcher, save time by using the Google Toolbar The Never-Ending Amount to [http://toolbar.google.com] and Google Buttons [http://www.google.com/ Learn, No Sign of Slowing Down options/buttons.html]. The single most difficult issue for the Web searcher to face is the sheer volume and 8. A Boolean "OR" is available with Google. For speed of change on both the Web and the it to function, capitalize the OR. search engines that try to cope with it. The sense of doom most searchers feel in struggling to keep pace occurs not because 9. Google only crawls and makes searchable the of any lack of intelligence, nor any lack of first 110 k of a page. Long documents may interest in the subject — far from it. Most have substantial content invisible to Google. often the cause is the reality of having only 24 hours in a day and the fact that life 10. Entering a U.S. street address into the exists away from the computer. query box will return a link to a map of that address location. Typing in a person or business I monitor what's going on in the Web name, city, and state will also run the query to search world on a daily basis and it's the Google phone directory. Several other combinations are available that will also query the phone directory service, including typing in the area code and number to run a reverse search [http://www.google.com/ help/features.html#wp]. Ten Things to Know About AllTheWeb almost routine for something new to arrive or for something established to change each day. For example, at the time of writing this article, AllTheWeb had just undergone major changes, Google released an image search tool, and WISEnut, a new general search tool, had come on the scene. When you couple the dynamic nature of Web searching (both individual pages and entire resources coming and going) and the need to stay up-to-date with traditional electronic tools (which undergo plenty of changes as well), print resources, and other issues of the day (can you say "copyright" or spell "Tasini"!), there is so much to do and so little time to do it. A lack of knowledge and understanding about how a particular search tool works, e.g., a new way to narrow your search, or ignorance of a more useful tool, e.g., a new search engine going online, can waste time and produce poor results. What Should the Searcher Do? I realize this is easier said than done, but Web searchers MUST devote at least 1-2 hours a week to stay current. This informal "continuing education" is crucial. Often, the knowledge you gain from these sessions will pay off handsomely with time saved and better query results in the future. The best way to learn to learn how a search engine works is by using it. Conducting preemptive research on a favorite topic makes it easy to spot differences both in terms of content and the way results are presented and at the same time to gather new resources for your own bookmarks or intranet sites. For a list of suggested sources to keep you current, see the "Essential Reading" sidebar. Is an "Open Web" Search Engine Always the Place to Begin? What Type of Information Can I Count on Finding There? Lately, I have spent a great deal of time thinking about this issue. As someone who often gives presentations about Web searching I have tried to provide session attendees with lists of what you can and can't find "on the Web" using a general-purpose Web search tool. Even in the most general sense, my attempts must fail. A few minutes after beginning, I inevitably realize that one can't boil down a dynamic universe of data like the Web into bullet points. Knowing, or better, understanding, where to start in this world of information resources is perhaps the most important information to know and share. There is no simple way of doing this. It takes time and commitment. I start learning about new resources by asking the most basic questions: What is this database or search engine? What would kinds of questions would it help me answer? Often the open Web may not be the place to begin. While it's nice to get quality material free, how long did it take to get it? Would standing up and walking to a bookshelf produce a useful answer in a much shorter period of time? Would a commercial full-text search service scan the decade-long archives of 50 or 100 newspapers in a matter of minutes? At issue are the time and money it takes to reach your answer. Even if you choose the open Web as your target, would a specialized or targeted search engine more easily find your answer, rather than one of the all-encompassing engines? Regardless, understanding how each search engine works and the many ways an engine allows you to limit and control searches will make general-purpose engines more productive and waste less of your time. We need to do this "learning" much the same way we have always "learned" traditional databases and print resources. Think about how much focus information vendors like Factiva and Dialog place on training. Unfortunately, Web engine companies do not offer this kind of training, but the learning process remains crucial. For me, the best part about being an information professional is the knowledge of where to find an answer. This is knowledge that non-professionals desire and makes our already important jobs even more valuable, especially with so many new databases and new online resources becoming available. What Should the Searcher Do? Consider the open-Web more of a directory to answers and less of an all-knowing answer machine. Sometimes, this directory WILL become an authoritative reference book and provide you with a timely and authoritative answer. Other times it will assist by providing you with background knowledge that can make using a fee-based service or a print collection more productive. Don't forget — shifting from one format to another can be a two-way street. What you learn from a print or commercial online source can produce an effective search strategy for the open Web. A Web search engine may also provide you with specific names of people to contact. Remember, the telephone and e-mail will always be very important reference resources. The Quality of Information: The Biggest Challenge to Web Searching For this Web searcher, information quality constitutes the greatest challenge faced as both a searcher and a teacher. We live in an age when anyone can become a publisher. All they need is a Web connection, server space, and something to say and/or share. Once the content goes onto a server and once a crawler finds it, the Web search engines will make it available to everyone. Within minutes or days, anyone with Web access can find that information. Amazing! And frightening! Once they have found it, the major challenge to searchers is evaluating content. They must judge its quality, and often very quickly, using the criteria that information professionals have always used to evaluate information. How does one do this? Well, this is the topic of other articles, books, and dissertations. The most important point is to take a step back, if only for a second, to ask yourself where this information is coming from and why it is being placed online. Since anyone can become a publisher with the Web as a publishing medium, the reputation and background of the site creator, their qualifications, etc., are crucial. I would strongly recommend taking a look at the resources our colleague Genie Tyburski makes available on her site for judging quality [http://www.virtualchase.com/ quality/index.html]. Evaluating information quality, something that our profession has always done, offers another in-road for sharing our skills with the public. Many who search the Web take whatever they find to be accurate, current, and worthwhile. As information professionals, we must protect them, often from themselves. One more thing. In my opinion, the challenges that information quality pose for the Web searcher prove how important it is for our profession to include Web resources as part of our collection development. We must try to make the Web a more effective tool for researchers. The Web is a living organism and, unlike an annual reference book, can change at a moment's notice. In an already busy workday, finding time to search out Web resources in an organized manner can be difficult. But all of us need to have an idea of what is available and where to turn before we actually need the resource to answer a query. Just knowing a top-level site exists that may contain the answer will not suffice. We learn our print collections, let's learn our Web collections and bookmarks. Easier said than done? Of course. Still, it remains a goal we should strive to attain. The Domination of Google Everyone, including me, loves Google. How could you not like it? In most cases, it delivers highly relevant results (though this does not always mean authoritative) in a short amount of time. When you add in features like Google Cache (a powerful way to find pages that might have just gone AWOL), you have a search engine that works and works well. Google is simple to use at a basic search level, but still returns good results. This is why non-professional searchers love it so much. The clean, single box home page is simple for non-sophisticated searchers to understand. It doesn't even allow you to directly use all three Boolean operators to return results, yet it works! Wow! More advanced searchers will be interested to know that Google uses AND as a default between search terms, permits the use of OR (it must be in all caps), and can remove a word of phrase if you use a minus (-) sign. What I like most about Google is its quest to improve on what it already has. Google always seems to be introducing something new and innovative. In February 2001, it started tracking portable document format (.pdf) material. The general public may not put a high demand on some of this content, but PDF documents offer information professionals masses of authoritative content from respected sources. At the time of writing, Google was still the only general search engine to make PDF files searchable on a large scale. What Should the Searcher Do? The advanced searcher must get to know and make use of Google at a more than "put the words in the box" level. It's very easy. Begin by looking at the Google Advanced Search page [http://www.google.com/ advanced_search.html], and at the same time learn the syntax that will allow you to limit your searches directly without having to use this page. To learn more about Google, especially on how it compares to other search engines, go to Greg Notess's Search Engine Showdown site [http://www.searchengineshowdown.com]. Here's hoping that Google continues to improve and add new useful features. Here is also hoping that Google continues to properly separate advertising content from result sets. Yet with all of Google's wonderful abilities, good searchers know that the must never make any single Web search engine the only tool used. No single engine makes "everything" searchable. Understanding the Limitations of General Web Search Tools No single Web search tool is the end-all/be-all. In fact, most have limitations that need careful consideration if you plan to use them regularly or teach others to use them. What do I mean by limitations? Here are a just a few of many possible examples:  Search spiders or crawlers (the software that brings back material to a database so you can search it) do not crawl the Web in real time. A page made available on the Web on Thursday could wait weeks before a crawler reaches it. The major search services are improving turnaround on recrawling and adding pages, but in general, expect to wait many days before a keyword search will return a recent page.  If a site or page is not linked to or submitted by someone (Webmaster, page author, etc.), it will not be accessible from a search engine. Engines primarily use these two methods of finding out about new sites and pages.  Simply because one, 1,000, or even more pages from a site are available does not mean that the engine makes every page of an entire site searchable. What Should the Searcher Do? Understand from the outset that these limitations exist and can effect your search results. Rely on more than one search engine. Make use of specialty search tools that often go "deeper" into a site to collect more content. Take advantage of "Invisible Web" resources. Use Web directories like the Librarians' Index to the Internet to "mine" specific sites. When you find something of value, bookmark it. Using Invisible/Hidden Web Resources Over the last couple of years, the phrase "the Invisible Web" has come into use; others call it the hidden or deep Web. However, for the most part all the terms are synonymous. Searchers need to know about the material in this section of the open Web. In many cases the material comes from well-known, authoritative sources, is available at low or no cost, but is not accessible using a Web search engine. Resources you interact with, sites where you fill in a set of variables and then have a "custom" page returned to you are examples of an Invisible Web page. So is a site that contains data that you can use for free, but only after you register. Why don't the search engines access this material? The search spider software seeking out material to bring back to the database finds nothing to retrieve in these examples. In the case of the custom page, the material is not accessible until the user calls for it and the system creates the page on the fly. In the other example, search spiders from general-purpose Web search engines do not fill out registration forms. So once the spider hits a page that requires registration, the spider stops and moves on. None of the material below that registration interface is searchable from general engines. One other factor can block search engine access — the "no-robot" tag. Webmasters can check off that they don't want to be spidered and most of the good, responsible crawlers will respect that request whether for all or any portion of the content on a Web site. Sometimes, Webmasters — perhaps concerned about possible excessive usage — may block the spiders without fully considering how this decision can eliminate substantial audience for the material they have taken the time and trouble and expense of loading. Prime examples of Invisible Web databases include American FactFinder from the U.S. Census, most Web-accessible library catalogs, and many of the databases available via GPO Access. What Should the Searcher Do? Know what is available before you need it. Of course, this takes time and practice. We do much the same when becoming aware of the databases from LexisNexis or Dialog. What makes this even a larger challenge is that there are thousands of these databases available and, unlike Dialog, no common search syntax. Use compilations of Invisible Web databases such as the one Chris Sherman and I have created to support our book [http://www.invisible-web.net]. Conduct Invisible Web collection development. Develop and learn your own collection. Using the "open Web" to attempt to find something with the boss breathing down your back is both difficult and inefficient. One Further Thought A great deal of research and time is devoted to making the information inside these Invisible Web databases more easily accessible from general-purpose Web search tools and other resources. The challenge is that many of these Invisible Web databases offer "custom" interfaces and database tools specifically to enable interaction with the data. Although the ability to crawl all of this data is coming and, in some cases, available now, without the proper limiting tools to harness this information, we could face even worse problems. We might make already massive uncontrolled databases the size of Google's, Excite's, or AltaVista's even larger, without the proper mechanisms to get the data out in a precise manner. In librarian speak, this translates into increasing recall, lowering precision. Specialized, Focused, and Site-Specific Search Tools: Important and Necessary I often get a bit unsettled when people and companies refer to the Invisible Web. What many understand as the Invisible Web encompasses content actually visible to general- purpose engines like Google and AltaVista. What many label as Invisible, deep, or hidden Web content actually refers to basic HTML material, easy for the general search engines to index and make accessible. Many of the databases that are often reported as Invisible Web are actually just beyond the reach of general Web search engine policies and procedures. More aggressive and focused or targeted Web crawlers may go where the general search engines have balked. For example, specialized search engines were the first to start handling .pdf formatted files. To penetrate these resources, users should learn to turn to specialized or focused search engines, important and effective tools at getting to the best answer possible on the open Web. Well-known specialized Web search engines include Psychcrawler, PoliticalInformation.Com, and Inomics.Com, each of which focuses on a specific subject (psychology, political science, and economics, respectively). Site-specific engines refer to the search engines that many sites make available to cover their own material. The general search tools can, and often do, crawl material that you can also find using a specialized, focused, and site-specific search engine. However, in some cases, the general search engines may not cover this material as well as the specialized ones. For example, the engines may not crawl the key sites in a timely manner or at a deep enough level. Bottom line: Coverage of this material by general search engines like Excite or AllTheWeb may be spottier than the specialized search tools. Here are just a few of the reasons why this problem occurs:  Time Lag. Unless paid for, spiders visit pages unannounced. Material changed or added between the dates when the spider last crawled the content — as much as a month, a quarter, or longer — remains, for all practical purposes, invisible. News material is a good illustration. A normal page from the CNN site is technically searchable from any general-purpose engine. However, for some period, it will not be searchable through a general search engine.  Depth of Crawl. Simply because a search engine makes one, 10, or 100,000 pages of a site accessible does not mean that it has crawled the entire site. Some engines only take a certain amount of material and then move on.  Each Search Engine Database Is Unique. As the work of Greg Notess makes clear, each search engine database differs. What Google knows about, Excite may not have in its database. What AltaVista can find, AllTheWeb/Fast may not make accessible.  Dead-End Pages. If a basic HTML page sits on your server and is not linked from any other page that a search tool already knows about and you don't submit it, then it will, most likely, not be discovered and crawled. A site-specific engine can crawl every page sitting on an entire server and make the page searchable. Why would you want to use one of these search engines? Several reasons. Smaller, more targeted databases make for greater precision though lower recall. Think about the world with only one massive Dialog database. Just as you select the correct database for the specific task, it works the same with specialized search engines. Additionally, these resources often offer human interaction, with a knowledgeable editor telling the crawler where to go, how often to return, and how deep to crawl. I think this job of human database editor will become more and more important in the future. What a great new career for information professionals! Finally, some of these specialized engines, the BBC News engine for example, [http://newssearch.bbc.co.uk/ ksenglish/query.htm], provide extra functionality, such as constant, even daily, updating and limiting options for search strategies. What Should the Searcher Do? Check out and use the good sources identifying and collecting specialized and focused databases. I like Profusion [http://www.profusion.com], labeled here as "Invisible Web" and the always reliable and always wonderful Librarians' Index to the Internet [http://www.lii.org], which covers a large amount of specialized and Invisible Web databases. Once you have found good tools in your areas of interest, use them and learn their features in depth. Using Search Tools on Specific Sites and Possible Intranet Solutions This is a simple idea that I think is often overlooked by searchers. We all know that information professionals should take full advantage of the special searching features, such as limiting, and other resources Web search tools offer. However, the fact that many general-purpose engines (AltaVista, Google, Ultraseek/Inktomi) are also licensed and available to search specific sites often goes unnoticed and unused. It shouldn't. The power searcher should identify when a specific "site-search" tool is actually the same software as that of a general-purpose engine. Then we should make use of the syntax, limiting functions, etc., still available as if the engine was being used to search the entire Web. Here are a few examples to illustrate my point: The Google engine and the syntax it offers is used by many sites, including FindLaw LawCrawler [http://www.lawcrawler.com], the Energy Information Administration [http://www.eia.doe.gov/], and IDG.net [http://www.idgnet.com]. Lycos provides the search technology available at USAToday.Com [http://www.usatoday.com]. AltaVista services Macworld.Com [http://www.macworld.com] and Western Michigan University [http://www.wmich.edu]. UltraSeek technology (now part of Inktomi) is used by CNN [http://www.cnn.com], and the University of Toronto [http://www.utoronto.ca]. Simply placing an interface to a well-known proprietary search product on the end user's desktop will not get them searching well. With so much attention placed on the power of search tools like Google, AltaVista, and Hotbot, these products have become synonymous with searching for the general public. Perhaps the time has come for proprietary information vendors to begin adapting and using these widely known search software into their products. This could allow, to some small degree, search trainers to not only share intricacies in becoming a more effective Web searcher, but could also allow these same techniques to be applied to in-house proprietary databases. The lack of standards is a major issue that needs addressing. The fact that many Web search tools are also available for licensing as an intranet or extranet engine makes a great deal of sense. Greater standardization of search tools can reduce the confusion and frustration felt by end users — not to mention, their trainers. What Should the Searcher Do? Learn more about the various search engines and their use as possible intranet search solutions. Start by visiting Avi Rappaport's very useful site [http://www.searchtools.com]. Not only will this resource teach you about the hundreds of different search tools available, the knowledge this site offers will also make you a better searcher. More Content Coming: The Ability to Search Audio and Video MaterialWhen it comes to non-text formats, we already have and shortly will have even more to ensure that we can provide our users with the best possible answered. The ability to search video (e.g., newscasts) and audio (e.g., radio programs) continues to expand. Material that we would have to wait weeks for in the past, assuming it ever became available, is now available shortly after the words are spoken. This material can serve many types of users, including those in international relations and competitive intelligence. Of course, archives of this material are also available. In many cases these keyword databases are created using either voice recognition technology or by capturing the text from closed captions associated with the broadcast. Work also continues on search tools that provide access to video and audio material using a non-text mechanism to access the material. For example, you could search for a specific color or type of background. An article in Technology Review provides a good orientation to the topic [http://www.techreview.com/ magazine/jul01/upstream.asp]. Much of this research will also be available for still-image search tools. Currently, such tools, including those from Google, Fast, and AltaVista, use the text surrounding the image, i.e., image captions, and additional factors to determine what a still image is about. What Should the Searcher Do? Become aware and familiar with some of the major players in this space. Virage [http://www.virage.com] is a leader in the video search arena. In fact, you can keyword search many of the reports from The NewsHour with Jim Lehrer using Virage technology at [http://www.pbs.org/ newshour/video/index.html]. Other companies of interest include TVEyes [http://www.tveyes.com], ShadowTV [http://www.shadowtv.com], and WordWave [http://www.wordwave.com]. Finally, take a test drive of SpeechBot [http://www.speechbot.com], a keyword search engine demo from Compaq, that uses speech-recognition technology to create a real-time transcript. As for image searches, try these two resources. Webseek allows you to search or browse for criteria in the image [http://www.ctr.columbia.edu/webseek/]. Visoo uses software that looks for words embedded "inside the image" [http://www.visoo.com]. The Commercialization of Search Results This issue has received a great deal of well-deserved attention lately. It seems to me that the wants and needs of the searcher/researcher and the many people from various groups (the engines themselves, the search optimization community, the advertising community) have different ideas about what the bottom line is when it comes to Web searches. Don't misunderstand me — the engines are profit-making-businesses, or try to be, so making money is goal number one. I understand this fact. However, those of us who use the "open Web" as a research tool want timely and authoritative answers without advertising or undo influence getting in the way of the best possible answer available. Can the wants and needs of the two groups co-exist? Absolutely, but it will take knowledge and continuing education for both information professionals and end users to continue to use general-purpose Web search tools as effective resources. The bottom line here is knowledge of the issues for all parties. Using the Web effectively without general-purpose search engines would be difficult, time consuming, and in many cases impossible. This is particularly true for the professional researcher. Pay-per-placement, pay-per-click allows a person or company to buy a keyword or keywords and have their results at the top of the results list when that word or words are searched. GoTo.Com is just one of many examples of this type of search engine. The extra challenge with GoTo and others is that in addition to searching at GoTo.Com they also sell their database to other engines for them to brand as their own. For example, GoTo.Com "powers" NBCi and Go.Com (formerly Infoseek). So, if a user tells you that NBCi is his or her engine of choice, in actuality they are searching GoTo.Com material. Various "flavors" of this type of branding exist in the Web search world. To get an idea of how many of these engines are online check http://www.payperclicksearchengines.com. Paid-inclusion programs available from many of the leading engines have programs in place that will allow a person or company to pay a fee and make sure that their site is crawled and included in that particular database. Additionally, this fee will also make sure that the site is recrawled on a regular basis, sometimes every week or so. This can mean that searchers may assume a currency of results based on retrieval from the paid-inclusion sites that does not occur with non-paying sites. Search optimization consultants reverse-engineer search engines and relevancy-ranking algorithms and then use this knowledge to get a client's Web pages higher in a search result list. Danny Sullivan, the editor of Search Engine Watch [http://www.searchenginewatch.com] covers this and most other parts of the search world on a regular basis and at great depth. Also, to learn more about search engine optimization take a look at Rank Write Roundtable [http://www.rankwrite.com]. By the way, keeping current with the search engine optimization discussion can often provide searchers with deep background about how the engines work. Again, this makes for a better searcher. What Should the Searcher Do? Understand the differences among search engines, become familiar with the terminology, and share this knowledge with others. In the case of more "traditional" engines, be aware of how commercial material is labeled and where it is placed. For example, AltaVista offers "partner listings" at the top and bottom of a results list. Excite uses the term "sponsored link." Hotbot places "products and services" at the top of the results list. At the time of writing, Google does not offer a paid inclusion program. However, Google will allow the purchase of keyword(s) and a link to a corresponding URL to appear away from the ranked results list, labeled as a sponsored link inside a colored box. Meta-Search Tools: Problems and Challenges I have never been a fan of meta-search engines. These tools simultaneously send your search request to many engines. Why don't I like them? Several reasons. One, meta-search engines often do not allow you to use the engine in a more than basic mode, leading to high amounts of recall but very poor precision. Equally important, especially in the last couple of years, is the fact that many of the most well-known meta-engines send a query to many entirely "pay for placement" engines. A May 2001 Danny Sullivan report [http://searchenginewatch.com/ sereport/01/05-metasearch.html] provides a clear view of this issue. For example, the popular Dogpile meta-search engine sends a query to 15 engines, six of them being entirely pay for placement. I think most researchers using the Web would be disappointed by the results they receive and the time they have wasted. What Should the Searcher Do? First, inform other searchers, especially end users who think they are "getting it all" by using a meta-search engine. Information professionals should take advantage of the "power" or "advanced" mode most general engines offer, such as limiting to a specific domain or word in the URL. One More Thing In the spirit of something for everyone, phone the neighbors and wake the children; I will mention one meta-engine that I do like and use: Hello Vivisimo! [http://www.vivisimo.com]. So why do I like it? A few reasons.  It does not send your query to any 100 percent pay-for-placement engines.  It does a reasonable job of allowing you to use some advanced syntax.  The "advanced interface" allows for several customization features.  It has some duplicate removal capabilities.  Vivisimo effectively clusters results into hierarchical sets of categories on the fly.  Users have the option of previewing a page directly from a result list.  Vivisimo searches several news databases and other search sites (e.g., Medline, USPTO, FirstGov.Gov) and still take advantage of its clustering process. This can be particularly useful for basic searchers who only enter a few keywords and do not search with limits. Using Vivisimo they can at least take advantage of the categories, hopefully assisting them in accessing the answer they want quickly. Where Have All the Pages Gone? Searching for older material is a challenge, often an impossible one. The issue as is old as Web searching and occurs not only in the Web search world, but in many other areas of digital data. Currently, when most Web pages are removed from a site, they are gone for good unless you can personally contact the Webmaster who can send you a copy. Luckily many people are thinking and working on solving this problem. One example is the work done by OCLC and RLG (Research Libraries Group) to develop standards and methods for archiving older material. The National Archives and other government agencies are doing similar work. NARA's Clinton Presidential Materials Archive [http://www.clinton.nara.gov/index.html] is an early effort to store Web resources from a presidential administration. Alexa Research [http://www.alexa.com] offers one of the earliest and most unique archiving efforts, the Alexa Archive of the Web. Brewster Kahle's project makes snapshots of the Web, archiving everything in sight. Alexa Research carries over 18 terabytes of data covering some 5 million Web sites and some 1.9 billion pages. If the site has preserved an archived copy of a page, it appears in blue and you can click to view it. If the site records a page, but has no archive for it, the page link appears with the tag "Page not in Archive" and a greyed-out link. One subset of the Alexa archiving covers some 87 million pages of material from the Election 2000 Presidential campaign [http://archive.alexa.com/]. What Should the Searcher Do? Long term? Become aware of the research and projects going on in this area. Offer comments and suggestions on how to make this material more accessible and searchable. A great archive of quality content without the proper mechanism to access it is not great. Short term? Take advantage of the Google cache feature — another "Google only" resource. Each time the Google crawler comes around to crawl a Web page, it makes a copy (unless told not to by the Web site owners) and places it on the Google server. Therefore, if you search for a page using Google and then click and find the page has been removed, return to the search results page and look for the link, next to the URL, that says, "cached." Caveat: The cache is a dynamic entity. A page does not stay in the Google cache in perpetuity. It is only available from the cache until the next time the crawler visits the page and identifies that it has gone. For more about the Google cache, go to http://www.google.com/ help/features.html#cached. Of course, another option is to either print-out or save a copy of a page. This can both be time consuming and a waste of paper or hard drive space. I use the SaveThis [http://www.savethis.com] service that allows you to copy any Web page, save it on the server, and access it from any Web browser. This free resource is well worth a look. I Still Can't Find... General, invisible, and specialized search tools still leave plenty of material not available. So many types of resources to explain, so many places to search! Your boss says that last night he was at home "searching" the Web for an article from Newsweek. He or she went to AltaVista, Google, and Yahoo! and came up empty. "These search engines don't contain 'everything,'" you tell your boss. However, often searching other databases, you can access and purchase articles you need. You explain that resources like Northern Light's Special Collection, Electric Library, or using dowjones.com (a free site) to access and purchase an individual article from Factiva's Publication Library are all possibilities. You go on to tell him or her that your library also makes numerous databases available to them through subscription licenses, databases they can access from home. The boss says. "Wow, I had no idea all of this material was available." On a roll, you also suggest that the boss check with the local public library, which you happen to know also offers access to many fee-based services. "Your tax dollars at work," you say. Finally, you tell your boss, much of where you search is determined but what you need. In some cases what you need can be found — for free — using Google or Excite, but if you don't find it, you should know where to turn next. In some cases, starting with Google or Excite might not be the best idea. There is still plenty of content not digitized that may require a trip to a library with a print or microfilm collection containing the document they need. What should the searcher do? You tell them.

Only a Few Years Ago, the Phrase Web Search Did Not Exist. Then the Term Began to Move

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support