<<

POINTOFVIEW In-Depth Understanding: Teaching Search Engines to Interpret Meaning By C. DAVID SEUSS , LLC, Cambridge, MA 02141 USA

of the Web and electronic publication repositories. But overall, there has been a stupendous lack of radical innovation in search. Let us consider an example in the use of search applications by profes- sional users. Imagine a company is considering a move into the telephony business. For competitive analysis purposes, a member of the market intelligence staff at the compa- ny decides to analyze the strategy of Cisco Systems in Bvoice over Internet protocol[ (VOIP). Assume there is a content repository available to the user of a few hundred thousand market research reports from scores of author- itative information technology market analysis firms like Gartner, Forrester, and IDC. When this user searches for BCisco and VOIP[ in the content repository of IT analyst reports using a of the current generation, the f a person from 1994 jumped forward into 2011, that person would be search result will list thousands of wowed by many of our information technology advances. Smart mobile reports. Having produced this lengthy phones, the ubiquitous Web, broadband everywhere, wireless network- search result, the search engine then ing, cloud computing, digital music, steaming media, software as a washes its hands of the situation, serviceVthe list of radical innovations goes on and on. But if that person from metaphorically dumping the pile of I B documents on the user’s desk and 1994 were to use a 2011 search engine, he or she would say, Finally, something B [ that hasn’t changed![ The 1994 user would put a search query into a search box saying, Cya! as the search engine on the 2011 search engine, and would receive back long lists of very briefly bolts out the office door. The user is summarized documents as a search result to consider, just like he or she would left to sort through the pile, and find have in 1994. Speaking as a member of the search engine industry, I must some documents the user thinks observe that the community appears to be perversely enamored by the last good might be interesting to read. The idea we had way back at the dawn of the Web. Sure, there have been tweaks to search result itself provides precious relevance ranking and substantial gains in to keep up with the growth little guidance in this process. For example, the search result will be Digital Object Identifier: 10.1109/JPROC.2011.2105531 sorted by some secret formula that

0018-9219/$26.00 2011 IEEE Vol. 99, No. 4, April 2011 | Proceedings of the IEEE 531 Point of View

will attempt to put documents esti- faces combinations of these concepts in a pharmaceutical setting one mated to be more relevant nearer to that imply meaning in the context of might include context-specific entities the top of the list. And there will be a the business, professional, or techni- like diabetes, Lipitor, or monoclonal little summary provided of each doc- cal purpose of the search process. antibodies. ument; perhaps a sentence or two of Today, meaning extraction is begin- The second and more profound text that the user can review. Acting ning to be applied by companies to extension to traditional entities is the on these scant hints, the user selects a search electronic document reposito- idea of meaning-loaded entities. few reports to read. Some are helpful, ries and various online resources to Meaning-loaded entities have depth- some are not, and the user perseveres dramatically improve and accelerate a and purpose-driven relevance. foraslongasheorshehastime,orfor researcher’s ability to gain insight into Meaning-loaded entities are events, as long as he or she can tolerate this a topic and answer specific research conditions, situations, outcomes, hit or miss process. questions. actions, relationships, and trends that Because one cannot know what one Meaning extraction works as follows. imply significance for the professional did not find, there is no objective way 1) Extract references to impor- purpose of the search. For example, in for the user to assess whether the tant concepts from every a market intelligence search applica- documents that he or she actually took document in the research tion, meaning-loaded entities might be the time to read comprehensively repository, particularly con- price cut, change in market share,or represent the body of knowledge cepts that imply meaning for strategic partnership. In a pharmaceuti- contained in the thousands of returned the business or professional cal setting, meaning-loaded entities documents on the search result. What purpose of the search. might be clinical trial, patent lost,or the user is actually doing is desperately 2) Record the location of each generic drug. wishing that the few documents he or concept in each document in A meaning-extraction-enabled she selects to read contain all the the research repository. search application can index and record important findings, analysis, and per- 3) Identify patterns of proximity- the locations in all documents to all spective available on the topic. As even related concepts that imply references to traditional entities (e.g., the most determined researcher will meaning to a knowledgeable Cisco), context-specific entities read only a very small percentage, practitioner. (VOIP), and meaning-loaded entities typically a small fraction of one per- 4) Analyze the documents re- (e.g., strategic partnership). Just for cent, of the reports or journal articles sponsive to a search query to shorthand, let us refer to these three on any given search result, this research identify those patterns and entity types collectively as concepts. strategy can best be characterized as highlight them to the user. An effective meaning extraction hope for amazing good luck. application requires tens of thousands Hope for amazing good luck as a of concepts to facilitate the meaning strategy for dealing with search results II. MEANING-LOADED discovery search process. As the con- is not, of course, the fault of the user. It ENTITIES cepts are identified they are organized is the fault of a search engine industry Entity extraction itself has been into a taxonomyVa meaning taxonomy, that believes that a list of documents is around the text analytics world for if you will. (Practically speaking, the the right response to a user whose almost two decades. Traditionally, the meaning taxonomy usually precedes the business purpose for doing the search is entities being extracted are proper concept identification.) The meaning to gain intellectual command of a body nouns, specifically: people, places, taxonomy is designed using a hierarchy of knowledge, to discover new knowl- and organizations. For example, text that is specifically relevant to the edge, to answer a profound question, to analytics could tell you that Cisco is in context. For example, VOIP would be explore the meaning of events and anewsstory,orintenthousandnews placed into the IT Technologies node of trends, or, in our example, to analyze stories in the news article repository. the meaning taxonomy while strategic the business strategy of a leading The first useful extension to entity partnership would be placed in the company that can drive the extraction is to include a relevant Corporate Strategies node. For the of a new technology and its market. taxonomy of context-specific entities pharmaceutical example, diabetes is that go beyond the proper nouns used placed in the Diseases node, Lipitor in in traditional text analytics. In an the Drugs node, and monoclonal anti- I. INTRODUCING information technology setting, these bodies in the Proteins node. MEANING EXTRACTION context-specific entities might be Meaning extraction exposes the So how might search work better? One technologies, for example VOIP, concepts found in documents respon- way is by applying meaning extraction, cloud computing, or software as a sive to a search query to the user at an emerging technology that identifies service. In different settings the both the document level and the concepts contained within documents relevant set of context-specific enti- search results level. At the document and document repositories, and sur- ties would be different. For example, level, the meaning-extraction-enabled

532 Proceedings of the IEEE | Vol. 99, No. 4, April 2011 Point of View

search engine presents the concepts identifies the scenarios, flagging those research reports discussing the VOIP found in a document as an enhance- found for the user to review. This market produces these actual search ment beyond the all-to-brief docu- automated analysis to identify scenarios results. ment summary of the style supplied represents the power of the machine to • Cisco is using a corporate by traditional search engines. This lever human intellect. For example, in strategy of acquisitions. assists a user in gaining an at-a-glance one deployed pharmaceutical applica- • Cisco is using a corporate strat- understanding of what is really in the tion, the meaning-extraction-enabled egy of strategic partnerships. document so the user can make a search engine looks for 1.9 trillion • Cisco is using a product more informed decision about wheth- potential scenarios on every search marketing strategy of market er this report or journal article should against a repository of 25 million segmentation. be downloaded and read. This facility journal articles, returning those scenar- • Cisco is using a product helps a user find those reports and ios found in a given user query in less marketing strategy of target articles that are most likely to be of than ten seconds. market. the most value, which is crucial The efficiency of using meaning • Cisco is using a product mar- considering that only a few docu- taxonomies to create scenarios is illus- keting strategy of professional ments from a long list of search trated by the fact that the text specify- services. results are actually going to be read. ing the above scenario pattern consists • Cisco is using a product mar- At the search result level, identi- of only 38 words that instruct the keting strategy of service and fying the concepts found in all the meaning-extraction-enabled search en- support. documents on the search result re- gine to look for relationships between It is immediately obvious that presents an overview of the knowl- all of the entries in the taxonomy nodes these scenarios are no ordinary search edge that is contained in those for drugs, diseases, cells, cell receptors, results; Cisco’s strategy in the VOIP documents. Such a summary overview medical devices, proteins, enzymes, market jumps right off the page provides an opportunity for knowl- genes, and therapeutic strategies. without reading a single document. edge discovery that can surprise the Then for each document in the Thesearchengineissuggestingthat user with insights otherwise unavail- 25 million journal article repository Cisco is targeting specific market able, or at least unlikely to be that is returned on a search result, the segments in the VOIP market and discovered with the hope for amazing analytical process in the meaning using a combination of high levels of good luck search strategy. extraction step examines all combina- professional services and support and tions of concepts in all specified partnerships/acquisitions, presumably taxonomy nodes to find those concepts to penetrate the market quickly. Each III. AUTOMATIC that are in proximate relationship to of the search results listed above is IDENTIFICATION OF one another according to the pattern of linked to a list of reports that discuss SCENARIOS the scenario specification. that scenario, sorted by the number of After the concepts are identified and The relationships found by the times the scenario is in the report so organized into the meaning taxonomy, meaning-extraction-enabled search users can rapidly drill into the docu- the next step is to interpret combina- engine should be considered scenar- ments that best elaborate on Cisco’s tions of concepts as potentially signif- ios,ratherthanconclusions or findings, strategy. So, for example, the user icant. A human practitioner in the since it is impossible at present to could drill down into the subset of relevant knowledge domain specifies automatically determine if the identi- reports that present target markets patterns of concepts that when found fied relationships are obvious, spuri- and market segments to learn what in proximate relationship to one an- ous, or significant. That question is those target market segments are. other imply meaning to the profession- left for the human intellect of the user Also, consider that had the user in al researchers using the search engine. to ponder. The meaning extraction the example above used a current Let us call the relationships among application can only determine that generation search engine that only concepts that fit the specified patterns the scenarios are present, and it can returned a list of documents, that user scenarios. Because the scenarios can be measure the number of documents wasindangerofmissingthepoint specified at the taxonomy-node level, a each scenario is found in, which is the entirely.Heorshewouldhaveread pattern efficiently expressed in simple single most helpful indicator of two, three, maybe five reports from terms by the human expert can expand weight. the search result of thousands. Since at runtime into a search for many, Once the scenarios have been the user did not include Bstrategic many specific relationships between identified from all the tens of partnership[ or Bprofessional services individual concepts. thousands of reports or journal articles [in the search query, the relevance The meaning-extraction-enabled on a search result, they are presented ranking formula probably would not search engine analyzes all the docu- to the user to consider. In the example have placed documents rich in those mentsonagivensearchresultand we started with, a search of IT analyst concepts at the top of the search

Vol. 99, No. 4, April 2011 | Proceedings of the IEEE 533 Point of View

result. Current generation search ment repositories to extract and news sites. No one site can claim any engines have the flaw of giving the locate the concepts contained in the significant coverage advantage over user what was asked for, not what repository. This investment involves the collection of all sites assembled on the user should have asked for had the building the meaning taxonomy, iden- theflybythesearchengine.AndWeb user already understood the topic. tifying the concepts, identifying the news search engines will instantly and myriad of ways a concept may be seamlessly switch out the publishers expressed in text, processing the contributing to any particular user IV. IMPLICATIONS FOR repository with the right search tools, query, further reducing the value of RESEARCHERS examining test cases of results, and being a publisher. The Web news Meaning extraction supports knowl- then, of course, iterating. Accom- search engine effectively eliminates edge discovery by presenting the user plishing this at the level of one user the value of an individual content with the concepts and scenarios present as opposed to one organization re- repository by creating a virtual repos- in the search result as a whole without quires no new science to achieve. itory for each query from all indexed the blind spots inherent in the hope for Rather, the only tasks are of user sources. (And it keeps that value for amazing good luck search strategy, and interface design, software engineer- itself as a business model.) by presenting these concepts and ing, and network operations organi- Asimilarprocessisrollingahead scenarios, with easy drill-down into zation to grant personalized design in search of scientific and technical the most interesting ideas, meaning and processing control of such a literature. As more and more content extraction reduces the time to insight. system to an individual user. addressing any scientific question is Users actually end up reading more While these tasks are not trivial in published by multiple sources and reports and journal articles with a implementation, and the first such made findable by generalized Web meaning-extraction-enabled search individually controllable meaning- search engines and engine, despite the helpful automated extraction-enabled search solution engines of the type often found in discovery of meaning, because they find will be much more expensive to , the value of any one piece of more on target, better, and often develop and operate than a system in content or any one collection of surprising, material. which a common set of concepts will content declines. Certainly, the value of meaning- be used by many users, there are Now apply meaning extraction to extraction-enabled search extends far already such systems being contem- equation and the situation changes. beyond the business domain; there are plated by organizations that want to Because the first two steps of meaning clear applications in scientific research, give their users individual control over extraction as discussed above (extract- technology development, and many them. It is only a matter of time before ing concepts and recording their other areas of endeavor. Scientific an organization, company, or publisher locations within documents) have to research stands the most to gain from makes the required investment and be done as a preprocessing step and meaning extraction. Imagine the search combines it with a business model that notatruntime,asearchengine engine could read all the papers permits individual users who are not returning results from multiple unco- published in a technical field and employees of the sponsoring organiza- ordinated content sources will not be identify the new concepts and scenarios tion to have individualized control over able to implement meaning extrac- for you. Everything required to produce meaning-extraction-enabled search tion. The publisher with a deep and such an implementation of meaning applications. broad repository can implement extraction is well understood today. It is meaning extraction against the pub- inevitable that there will be break- lisher’s own repository. The bigger the throughs produced by the machines, or V. IMPLICATIONS FOR repository, the better meaning extrac- more precisely, produced by the human PUBLISHERS tion will perform because the appli- researchers that skillfully guide them. And of course, there are implications cation will find more scenarios and An interesting question is whether for publishers. The most interesting will calculate their weight more such meaning-extraction-enabled so- implication is that meaning extraction accurately. Reversing the game, value lutions will be able to be made levers the value of publishers’ reposi- is returned to the publisher’s reposi- available to individual researchers as tories. One of the consequences of tory, and hence to the publisher. opposed to being only available to Web search engines and massive on- those researchers working with line content is that the value of having corporate-sponsored solutions of the unique coverage of a topic has de- VI. CONCLUSION type I have described. Individual users clined. For example, search on a news Meaning extraction represents a pow- may need different concepts and topic on a Web news search engine erful tool to help researchers analyze, scenarios to support diverse research and you will have competent coverage comprehend, and apply the flood of questions. There is an investment presenting the essential elements information that the modern era of required in preprocessing the docu- returned from thousands of individual pervasive technology has unleashed, so

534 Proceedings of the IEEE | Vol. 99, No. 4, April 2011 Point of View

that in the future, long lists of docu- ways of knowing in the user’s domain of and analyze the search results and then ments on search results, search circa knowledge. It is necessary that search present findings that would be consid- 1994, will truly be a thing of the past. engines grasp the professional purpose ered most significant by the users if they Search engines must evolve to have for a given search and that search goes were able to read all of the documents an in-depth understanding of the beyond presenting document lists to retrieved in the search process. Mean- searched material and the associated users. Search engines must interpret ing extraction is the future of search. h

Vol. 99, No. 4, April 2011 | Proceedings of the IEEE 535