Schilit Key Ideas Dl
Total Page:16
File Type:pdf, Size:1020Kb
Exploring a Digital Library through Key Ideas Bill N. Schilit and Okan Kolak Google Research 1600 Amphitheatre Parkway, Mountain View, CA 95054 {schilit,okan}@google.com ABSTRACT "I had thought each book spoke of the things, Key Ideas is a technique for exploring digital libraries by human or divine, that lie outside books. Now navigating passages that repeat across multiple books. From I realized that not infrequently books speak of these popular passages emerge quotations that authors have books: it is as if they spoke among themselves." copied from book to book because they capture an idea See also: intertextuality, Umberto Eco, Adso particularly well: Jefferson on liberty; Stanton on women's Appears in 18 Books from 1989 to 2005 rights; and Gibson on cyberpunk. We augment Popular Pas- sages by extracting key terms from the surrounding context Figure 1: One of 700-plus popularly quoted pas- and computing sets of related key terms. We then create an sages appearing under the key term \intertextual- interaction model where readers fluidly explore the library ity." Tens of millions of quotations emerge from a by viewing popular quotations on a particular key term, and clean-slate data mining algorithm that takes a dig- follow links to quotations on related key terms. In this paper ital library of scanned books as input. The reader we describe our vision and motivation for Key Ideas, present follows \See also" links to explore related quotations an implementation running over a massive, real-world dig- or \Appears in n Books..." links to read the quote ital library consisting of over a million scanned books, and in context in primary and secondary source books describe some of the technical and design challenges. The (example modified slightly for readability). principal contribution of this paper is the interaction model and prototype system for browsing digital libraries of books using key terms extracted from the aggregate context of pop- ularly quoted passages. online encyclopedia? Information technology's progress to- wards \an increasing democratization or dissemination of Categories and Subject Descriptors power" (Landow [10]) requires a ready supply of authorita- tive voices. Expanding access to these knowledge sources H.5.4 [Information Interfaces and Presentation]: Hy- is extremely important in a world where technology broad- pertext/Hypermedia; J.5 [Arts and Humanities]: Litera- casts one voice to many, authors are anonymous, credibility ture; H.4.3 [Information Systems Applications]: Com- is ambiguous, views are potentially biased, and where \we munications Applications|Information browsers have people who are trying to repeatedly abuse our sites" (Wales [13]). General Terms Our larger aim is to expose the general population to the Algorithms, Design, Human Factors great ideas and the connections between ideas that lie within books. Mining and linking ideas are both significant chal- lenges. Mining requires that we distinguish words about Keywords \ideas" from other words that appear in books, and, at the digital libraries; quotations; humanities research; data min- surface, all words look alike. Recognizing relations between ing; key phrases; hypertext; great ideas. ideas is difficult because there is no common classification system. Author's use the same words to name different 1. INTRODUCTION ideas, and different words to name the same idea. Acknowl- In our research we pose the question: How can we help edging the large problem scope and extensive efforts that people engage online books in a style similar to browsing preceded ours, the research presented here should be taken as a report on a new path we have been exploring rather than a complete solution. Umberto Eco writes that \books speak of books: it is as if Permission to make digital or hard copies of all or part of this work for they spoke among themselves" (see Figure 1). With apolo- personal or classroom use is granted without fee provided that copies are gies to Postermodernists, this sums up our approach. We not made or distributed for profit or commercial advantage and that copies developed a language-independent data mining capability bear this notice and the full citation on the first page. To copy otherwise, to to extract quotations1 from books. Data mining provides republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 1 JCDL’08, June 16–20, 2008, Pittsburgh, Pennsylvania, USA. A quotation is a passage, of a certain length, repeated Copyright 2008 ACM 978-1-59593-998-2/08/06 ...$5.00. across books. A passage that is short is more likely a col- 177 the text of each quotation, all the books where the quota- experience. We then present algorithms for extracting key tion appears, as well as the context (surrounding text) of terms from quotation context and computing related key each use. We use quotations to create links between book terms. We conclude with a broad discussion on what we passages thereby giving readers the ability to jump between learned building the prototype, what works, and where we pages where quotations occur. see opportunities for further work. Linking book passages is only one benefit of mining quota- tions. In general when one author employs another's words 2. MOTIVATION & RELATED WORK we can say the following: (1) the passage reflects an idea This work falls within the research area of information or describes an event particularly well (or, by and large, seeking and, like much of the field, follows from Vannevar better than passages not-quoted); (2) the text introducing Bush's vision of Memex. Bush popularized the idea that and following the quote{the context{likely describes some people can collaboratively organize and share knowledge: facet of the quotation. The first observation says that there \Wholly new forms of encyclopedias will appear, ready-made might be ways to distinguish ideas from other words; the sec- with a mesh of associative trails running through them..." ond that commonly used descriptions may emerge without [5]. In many ways the vision has come to pass with large a commonly used classification system. amounts of hyper-linked information in the World Wide Furthermore, when one author quotes a passage from an- Web, digital libraries, and online encyclopedias. However, other they are crediting importance to the passage. In the researchers continue to point out that people, especially aggregate this behavior can be seen as a \wisdom of crowds" those lacking domain expertise, struggle to navigate infor- [15] effect where authors' repeated re-use of one another's mation spaces. words call out popular and seminal passages. In other words, There are numerous reasons why navigating information we found that mining surface similarity exposes deep seman- is still a challenge. In our own encounters with a digital li- tics, i.e., the ideas, in books. The Key Ideas technique is brary of books we saw that the size of information chunks based on these observations. is fundamental. Following the \mesh of associative trails" One major aspect of the emergence of popular and sem- and landing on a 400 page book slows down exploration. inal ideas based on usage of quotations across books is the People like to see concise bits of information and then de- size and nature of the collection. The digital library in our cide whether they want to learn more or navigate to other inquiry is over a millions books that comprise Google Book information. Search2. This collection contains books from 28 libraries in the United States, Europe and Asia, content from over 2.1 Popular Passages 10,000 publisher partners, and books in many dozens of lan- Our experience with Book Search led us to look for ways guages. to connect readers to smaller, more interesting bits within books. We developed Popular Passages to help readers dis- 1.1 Existing Quotation Collections cover passages shared across books, and provide a way to Quotation collections are fairly popular in print and on the explore the corpus by pivoting on these passages. The pas- Web. Two well known examples are Familiar Quotations by sages that authors decided worth repeating were mostly at- John Bartlett [4], and the Columbia World of Quotations [2] tributed quotations between 1-8 sentences in length, and oc- which contain 25,000 and 65,000 quotations respectively and curing in tens, hundreds, or sometimes thousands of books. 3 are also available online . Both these and other collections Mining quotations is related to but different from plagiarism are generally well organized and excellent source books. and duplicate detection. For technical details on quotation In developing our prototype we chose not to start from mining see [9]. manually prepared quotation collections for a number of rea- Although our system mines all types of repeated passages, sons. First, we were interested in quotations based on their our focus is not on shallow bon mots but rather on passages \real world" use by authors and this is not the criteria used that authors re-use because they are relevant to a point be- in edited collections. We also wanted as inclusive a collection ing made and describe an idea well. These passages tend as possible across many languages and disciplines. Finally, to be longer than a phrase but shorter than a page. For we wanted not just the quotation text, but the collection of example, we find this passage frequently used in the field of books in which they are used as well as the context of use, environmental politics: something not available in existing online collections. Further, searching from a large database of quotations The concept of sustainable development does im- is not a solution given the variations of use and numerous ply limits - not absolute limits but limitations im- scan errors in our corpus.