Proceeding of the ACM/IEEE-CS Joint Conference on Digital Libraries (1St, Roanoke, Virginia, June 24-28, 2001)

DOCUMENT RESUME

ED 459 812 IR 058 348

TITLE Proceeding of the ACM/IEEE-CS Joint Conference on Digital Libraries (1st, Roanoke, Virginia, June 24-28, 2001). INSTITUTION Association for Computing Machinery, New York, NY.; Institute of Electrical and Electronics Engineers, Inc., Washington, DC. Computer Society. ISBN ISBN-1-58113-345-6 PUB DATE 2001-06-00 NOTE 578p.; For individual conference papers, see IR 058 349-386. Some abstracts and papers are copyrighted by the authors and are not available from ERIC. Some figures may not reproduce well. Also sponsored by ACM Special Interest Group on Information Retrieval and Special Interest Group on Hypertext, Hypermedia, and the Web; IEEE-CS Technical Committee on Digital Libraries; Coalition for Networked Information; Lucent Technologies; Sun Microsystems; Virginia Tech; and VTLS. AVAILABLE FROM Association for Computing Machinery, 1515 Broadway, New York, NY 10036. Tel: 800-342-6626 (Toll Free); Tel: 212-626-0500; e-mail: [email protected]. For full text: http://www1.acm.org/pubs/contents/proceedings/d1/379437/. PUB TYPE Collected Works Proceedings (021) EDRS PRICE MF03/PC24 Plus Postage. DESCRIPTORS Access to Information; Archives; Computer Oriented Programs; Computer System Design; *Electronic Libraries; Information Services; *Information Technology; Internet; Library. Collection Development; *Library Services; Online Systems; Technological Advancement

ABSTRACT Papers in this Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries (Roanoke, Virginia, June 24-28, 2001) discuss: automatic genre analysis; text categorization; automated name authority control; automatic event generation; linked active content; designing e-books for legal research; metadata harvesting; mapping the interoperability landscape for networked information retrieval; distributed resource discovery; enforcing interoperability with the open archives initiative (OAI) repository explorer; an OAI service provider for cross-archive searching; managing change on the Web; reputation of Web sites; personalized spiders; global digital library development; Web-based scholarship; multi-view intelligent editor for digital video libraries; digital libraries environments for scientific thinking; open virtual learning environment; automatic identification and organization of index terms; digital library collaborations; public use of digital community information systems; collaborative design with use case scenarios; automatic keyphrasing; legal deposit of digital publications; technology and values; digital strategy for the Library of Congress; multiple digital libraries; technical support workers; digital library recommendation services; digital libraries and data scholarship; metasearching; President's Information Technology Advisory Committee's February 2001 digital library report; searchable collections of enterprise speech data; transcript-free search of audio archives; audio watermarking techniques; music-notation searching and digital libraries; automatic classification of musical instrument sounds; adding

Reproductions supplied by EDRS are the best that can be made from the original document. content-based searching to a traditional music library catalog server; locating question difficulty through explorations in question space; browsing by phrases; improving video indexing in a digital library; access to a digital video library; digital library design; bucket architecture for the open video project; personalized search and summarization over multimedia healthcare information; automation and human mediation in libraries; preservation of digital information; trading networks of digital archives; cost-driven design for archival repositories; notification service for digital libraries; automated rating of reviewers; digital libraries supporting digital government; digital libraries for young children; preserving, restoring and analyzing damaged manuscripts; digital music libraries; content management for digital museum exhibitions; hierarchical document clustering of digital library retrieval results; interactive visualization of video metadata; categorizing hidden-Web resources; digital facsimile editions and online editing; and image semantics for picture libraries.(AEF)

Reproductions supplied by EDRS are the best that can be made from the original document. Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries (1st, Roanoke, Virginia, June 24-28, 2001)

U.S. DEPARTMENT OF EDUCATION Office of Educational Research and Improvement PERMISSION TO REPRODUCE AND EDUCATIONAL RESOURCES INFORMATION DISSEMINATE THIS MATERIAL HAS CENTER (ERIC) BEEN GRANTED BY CI This document has been reproduced as received from the person or organization originating it D. Cotton CI Minor changes have been made to improve reproduction quality TO THE EDUCATIONAL RESOURCES INFORMATION CENTER (ERIC) Points of view or opinions stated in this document do not necessarily represent official OERI position or policy.

2 BEST COPY AVAILABLE JCDL 2001

following in the tradition of the ACM Digital Libraries & IEEE CS Advances in Digital Libraries Conferences

Sponsored by The Association for Computing Machinery(ACM) Special Interest Group on Information Retrieval (ACM SIGIR) Special Interest Group on Hypertext, Hypermedia, and the Web (ACM SIGWEB) The Institute for Electrical and Electronics Engineers Computer Society (IEEE Computer Society) Technical Committee on Digital Libraries (TCDL) and The Coalition for Networked Information Lucent Technologies Sun Microsystems Virginia Tech VTLS

The Joint Conference on Digital WIRELESS NETWORK ACCESS FOR Libraries isa major international forum FREE AT JCDL! focusing on digitallibraries and associated technical, practical, and social issues. JCDL 2001 enhances the tradition of conference REGISTRATION DATES excellence already established by the ACM and IEEE-CS by combining the annual events Ma3"t-5Last Day to register at reduced rate thatthese professional societies have sponsored on an annual basis, the ACM Digital June45Last Day to register on-line LibrariesConferencesandtheIEEE-CS Conference StartsRegistration AdvancesinDigitalLibraries Conferences. June 24 Also,following JDCL willbe an NSF PI in-place meeting for the US Digital Libraries Initiative. JCDL encompasses themany meaningsoftheterm"digital libraries", including (but not limited to) new The intended community for this forms of information institutions; operational conference includes those interested in such aspects information systems with all manner of digital of digital libraries as infrastructure; institutions; content; new means of selecting, collecting, metadata; content; services; digital preservation; system organizing, and distributing digital content; and design; implementation; interface design; theoreticalmodelsofinformationmedia, human-computer interaction; evaluation of performance; including document genres andelectronic evaluation of usability; collection development; publishing. intellectualproperty;privacy;electronicpublishing; document genres; multimedia; social, institutional, and Digital libraries are distinguished from policyissues;usercommunities;andassociated informationretrieval systems because they theoretical topics. include moretypes ofmedia, provide additionalfunctionalityandservices,and Participation is sought from all parts of the world and include other stages of the informationlife from thefullrange of disciplines and professions cycle,fromcreationthroughuse.Digital involved in digital library research and practice, including libraries also can be viewed as a new form of computer science,information science,librarianship, information institution or as an extension of the archival science and practice, museum studies and services libraries currently provide. practice, technology, medicine, social sciences, and humanities.Alldomains-academe,government, industry, and others - are encouraged to participate as presenters or attendees.

U. 3 The First ACM IEEE

I) Li n COft erfp,1 0 67 gita Libraries

Final Program

General Information

Registration: Roanoke Foyer / North Entry Sunday 7:30-21:00 Monday 7:30-20:00 Tuesday 7:30-17:30 Wednesday 7:30-17:30 Thursday 7:30-12:00

Conference Office: Bent Mountain Room Sunday,- Thursday, 8:00 - 18:00 Email / Net Access: Mill Mountain Room SundayThursday, 8:0018:00

Help Desk: Roanoke Foyer / North Entry Sunday,Thursday, 8:0018:00 Complimentary lunches for all registered participants will be served daily in Roanoke Ballroom CDH. Complimentary continental breakfast, morning and afternoon coffee breaks will be served daily in the Crystal Foyer.

Program Note: The JCDL 2001 program includes both long and short papers. Long papers are assigned a 30 minute period; short papers (indicated by an asterisk preceding the title) a 15 minute period. Sunday, June 24, 2001Tutorials and Side Trips

7:30-9:00 Complimentary Continental Breakfast for Registered Tutorial Participants in Crystal Foyer

8:00-17:00: Tutorials Practical Digital Libraries Overview: Part 1 & 2Buck Mountain Room Edward A. Fox (Department of Computer Science, Virginia Tech)

Evaluating, Using, and Publishing eBooks: Part 1 & 2Harrison/Tyler Room Gene Golovchinsky (FX Palo Alto Laboratory) Cathy Marshall (Microsoft) Elli Mylonas (Scholarly Technology Group, Brown University)

Thesauri and Ontologies in Digital Libraries: Part 1 & 2Monroe Room Dagobert Soergel (College of Information Studies, University of Maryland, College Park)

How to Build a Digital Library Using Open-Source SoftwareWilson Room (morning) Ian H. Witten (Department of Computer Science, University of Waikato)

Hands-On Workshop: Build Your Own Digital LibraryWilson Room (afternoon) Ian H. Witten (Department of Computer Science, University of Waikato) David Bainbridge (Department of Computer Science, University of Waikato)

Building Interoperable Digital Libraries: A Practical Guide to Creating Open ArchivesCrystal Ballroom E Hussein Suleman (Department of Computer Science, Virginia Tech)

12:00-13:00 Complimentary Lunch for Registered Tutorial Participants in Roanoke GH

Side Trips Virginia Wine Tour 9:30-16:00: Bus leaves from Conference Center North Entrance Bottom Creek Gorge Hike 10:00-16:00: Buses leave from Conference Center North Entrance

Sunday Evening, June 24, 2001Main Conference Opening

18:00-20:00 Opening Reception (Roanoke Ballroom AB): Virginia ham and assorted hors d'oeuvres; music by No Strings Attached. Welcome by Robert C. Bates, Dean of Arts and Science, Virginia Tech.

5 Monday, June 25, 2001Main Conference

7:30-9:00 Complimentary Continental Breakfast in Crystal Foyer Newcomers' Breakfast in Roanoke CD Help yourselves to continental breakfast in the foyer and get together with other newcomers and friends.

Sessions 1 and 2: 9:00-10:30 Crystal Ballroom ABC Roanoke Ballroom AB Roanoke Ballroom EFG Conference Opening (1): Welcome by Edward A. Fox, Conference Chair, and Christine Borgman, Program Chair Keynote Address (2): Public Access to Digital Materials Brewster Kahle (President, Alexa Internet; Director, Internet Archive) Introduction by Edward A. Fox

10:30-11:00 Break. Refreshments in Crystal Foyer.

Session 3: 11:00-12:30 Crystal Ballroom ABC Roanoke Ballroom AB Roanoke Ballroom EFG Papers (3B): Digital Libraries for Panel (3C): The Open Archives Papers (3A): Methods for Classifying and Education: Technology, Services, Initiative: Perspectives on Metadata Organitng Content in Digital Libraries and User Studies Harvesting Moderator: Jose Luis Borbinha (Biblioteca Moderator: Dale Flecker (Harvard Moderator: James B. Lloyd (University of Nacional, Portugal) University) Tennessee) Integrating Automatic Genre Analysis into Linked Active Content: A Service for Digital Libraries Digital Libraries for Education Panelists: Andreas Rauber, Alexander Mueller-Koegler Tim Cole (Chair,Library Information David Yaron, D. Jeff Milton, Rebecca (Vienna University of Technology) Technology Committee, University of Freeland (Carnegie Mellon Illinois, Urbana-Champaign) Text Categorization for Multi-page Documents: University) Caroline Arms (Library of Congress) A Hybrid Naive Bayes HMM Approach A Component Repository for Learning Donald Waters (Mellon Foundation) Paolo Frasconi, Giovanni Soda, Alessandro Objects: A Progress Report Simeon Warner (Los Alamos National Vullo (University of Florence) Jean R. Lalcuf, Anne Morgan Spatter Laboratory) *Automated Name Authority Control (Brown University) Jeffrey Young (OCLC) James W. Warner, Elizabeth W. Brown (Johns Designing e-Books for Legal Research Hopkins University) Catherine C. Marshall, Morgan N. Price, *Automatic Eveni Generation From Multi- Gene Golovchinsky, Bill N. Schilit lingual News Stories (FX Palo Alto Laboratory) Kin Hui, Wai Lam, Helen M. Meng (The Honorable Mention, Vannevar Bush Chinese University of Hong Kong) Award

12:30-13:30 Complimentary Lunch for all Registered Attendees in Roanoke CDH.

u 6 Monday, Session 4: 13:30-15:00 Crystal Ballroom ABC Roanoke Ballroom AB Roanoke Ballroom EFG Papers (4A): Approaches to Panel (4C): Different Cultures Papers (4B): Digital Libraries and the Web: Interoperability Among Digital MeetLessons Learned in Technology and Trust Libraries Global Digital Library Moderator: Jonathan Furner (University of Moderator: James Frew (University of Develop ment California, Los Angeles) Moderator: Ching-chih Chen California, Santa Barbara) Managing Change on the Web (Professor, Graduate School of *Mapping the Interoperability Landscape for Luis Francisco-Revilla, Frank Shipman, Richard Library and Information Networked Information Retrieval Furuta, Unmil Karadkar, Avital Arora (Texas Science, Simmons College) William E. Moen (University of North Texas) A&M University) Panelists: * Distributed Resource Discovery: Using *Measuring the Reputation of Web Sites: A Hsueh-hua Chen (Chair, Z39.50 to Build Cross-Domain Preliminary Exploration Department of Library and Information Servers Greg Keast, Joan Cherry, Elaine G. Toms (University Information Science, National Ray R. Larson (University of California, of Toronto) Taiwan University) Berkeley) Wen Gao (Deputy President, Personalized Spiders for Web Search and Analysis The Open Archives Initiative Graduate Schools, Chinese Michael Chau, Daniel Zeng, Hsinchun Chen Carl Lagoze, Herbert Van de Sompel Academy of Sciences, (University of Arizona, Tucson) (Cornell University) Beijing) *Salticus: Guided Crawling for Personal Digital Von-Wun Soo (Professor of *Enforcing Interoperability with the Open Libraries Computer Science, National Archives Initiative Repository Explorer Robin Burke (California State University, Fullerton) Tsinghua University, Taipei) Hussein Suleman (Virginia Tech) Li-Zhu Zhou (Chair, Department of *Arc-An OAI Service Provider for Cross- Computer Science, Tsinghua Archive Searching University, Beijing) Xiaoming Liu, Kurt Maly, Mohammad Zubair, Michael L. Nelson (Old Dominion University)

15:00-15:30 Break. Refreshments in Crystal Foyer.

Session 5: 15:30-17:00 Crystal Ballroom ABC Roanoke Ballroom AB Roanoke Ballroom EFG Papers (5A): Tools for Constructing and Panel (5C): Digital Library Papers (5B): Systems Design and Evaluation for Using Digital Libraries Collaborations in a World Undergraduate Learning Environments Moderator: Hsinchun Chen (University of Community Moderator: Traugott Koch (Lund University & Arizona, Tucson) Moderator: David Fulker (Unidata Technical Knowledge Center of Denmark) Power to the People: End-User Building of Program Center) *The Alexandria Digital Earth Prototype System Digital Library Collections Panelists: Terence R. Smith, Greg Janee, James Frew, Anita Ian H. Witten, David Bainbridge, Stefan J. Sharon Dawes (SUNY Albany) Coleman (University of California, Santa Boddie (University of Waikato) Leonid Kalinichenko (Institute of Barbara) Informatics Problems, Russian *Web-Based Scholarship: Annotating the *Iscapes: Digital Libraries Environments for the Academy of Science, Moscow Digital Library Promotion of Scient(ic Thinking by State University) Bruce Rosenstock, Michael Gertz (University Undergraduates in Geography Tamara Su er (Center for of California, Davis) Anne J. Gilliland-Swetland, Gregory H. Leazer LifeLong Learning and Design, (University of California, Los Angeles) A Multi-View Intelligent Editor for Digital Dept. of Computer Science and Video Libraries the Institute of Cognitive *Project ANGEL: An Open Virtual Learning Brad A. Myers, Juan P. Casares, Scott Science, University of Environment with Sophisticated Access Stevens, Laura Dabbish, Dan Yocum, Colorado) Management Albert Corbett (Carnegie Mellon Constantino Thanos (DELOS John MacColl (University of Edinburgh) University) Director, Consiglio Nazionale *NBDL: A CIS Framework for NSDL delle Ricerche, Istituto di *VideoGraph: A New Tool for Video Mining Joe Futrelle, C. Kevin Chang (University of Illinois, Elaborazione della and Class yication Urbana-Champaign) Informazione) Jia-Yu Pan, Christos Faloutsos (Carnegie Su-Shing Chen (University of Missouri, Columbia) Alex Ushakov (ChemQuest Project, Mellon University) Department of Chemistry and Automatic Identification and Organization of Index Biochemistry, University of Terms for Interactive Browsing Northern Colorado) Nina Wacholder, David K. Evans, Judith L. Klavans (Columbia University)

ut. 7 Monday 19:00-21:00 Poster and Demo Session and Reception (Conference Center Foyer); Pasta bar and assorted hors d'oeuvres. Demonstrations Posters Content Management for Digital Museum An Atmospheric Visualization Collection for the NSDL Exhibitions Keith Andrew (Eastern Illinois University) Jen-Shin Hong, Bai-Hsuen Chen, Jieh Hsiang Christopher Klaus (Argonne National Laboratory) (National Chi-Nan University) Gerald Mace (University of Utah) Tien-Yu Shu (National Museum of Natural Science) Breaking the Metadata Generation Bottleneck: Preliminary Findings Elizabeth D. Liddy (School of Information Studies, Syracuse University) Demonstration of Hierarchical Document Stuart Sutton (School of Library & Information Science, University of Washington) Clustering of Digital Library Retrieval Woojin Paik (solutions-united.com, Syracuse, NY) Results Eileen Allen (School of Information Studies, Syracuse University) Christopher R. Palmer, Jerome Pesenti, Raul E. Sarah Harwell (solutions-united.com, Syracuse, NY) Valdes-Perez, Michael G. Christel, Alex Michelle Monsour (School of Information Studies, Syracuse University) G. Hauptmann, Dorbin Ng, and Howard Anne Turner (School of Library & Information Science, University of Washington) D. Wactlar (Carnegie Mellon University) Jennifer Liddy (School of Information Studies, Syracuse University)

Indiana University Digital Music Library Building the Physical Sciences Information Infrastructure, A Phased Approach Project Judy C. Gilmore, Valerie S. Allen (U.S. Department of Energy) Jon W. Dunn, Eric J. Isaacson (Indiana University, Bloomington) Development of an Earth Environmental Digital Library System for Soil and Land- Atmospheric Data Interactive Visualization of Video Metadata Eiji Ikoma, Taikan Oki, Masaru Kitsuregawa (Institute of Industrial Science, Univ. of Tokyo) Mark Derthick (Carnegie Mellon University) Digital Facsimile Editions and On-Line Editing PERSIVAL: Categorizing Hidden-Web Harry Plantinga (Computer Science, Calvin College) Resources Panagiotis G. Ipeirotis, Luis Gravano Dspace at MIT: Meeting the Challenges (Columbia University) Michael Bass (Hewlett-Packard) Mehran Sahami (Epiphany, Inc.) Margret Branschofsky (Faculty Liaison, MIT)

PERSIVAL; Personalized Search and Exploiting Image Semantics for Picture Libraries Summarization over Multimedia Health- Kobus Barbard, David Forsyth (University of California at Berkeley) care Information Noemie Elhadad, Min-Yen Kan, Simon Lok Feature Extraction for Content-Based Image Retrieval in DARWIN (Digital Analysis and and Smaranda Muresan (Columbia Recognition of Whale Images on a Network) University) Kelly R. Debure, Adam S. Russell (Eckerd College)

PERSIVAL: View Segmentation and Guided Linking: Efficiently Making Image-to-Transcript Correspondence Static/Dynamic Summary Generation for Cheng Jiun Yuan, W. Brent Seales (University of Kentucky) Echocardiogram Videos Shahram Ebadollahi, Shih-Fu Chang Integrating Digital Libraries by CORBA, XML and Servlet (Columbia University) Wing Hang Cheung, Michael R. Lyu, Kam Wing Ng (The Chinese University of Hong Kong)

Stanford Encyclopedia of Philosophy. A A National Digital Library for Undergraduate Mathematics and Science Teacher Preparation Dynamic Reference Work and Professional Development Edward N. Zalta, Uri Nodelman (Stanford Kimberly S. Roempler (Eisenhower National Clearinghouse, The Ohio State University) University) Colin Allen (Texas A&M University) Print to Electronic: Measuring the Operational and Economic Implications of an Electronic Journal Collection A System for Adding Content-Based Searching Carol Montgomery, Linda Marion (Hagerty Library, Drexel University) to a Traditional Music Library Catalogue Server Turbo Recognition: Decoding Page Layout Matthew J. Dovey (Kings College, London) Taku A. Tokuyasu (University of California at Berkeley)

Using the Repository Explorer to Achieve OAI Using Markov Models and Innovation-Diffusion as a Tool for Predicting Digital Library Protocol Compliance Access and Distribution Rates Hussein Suleman (Virginia Tech) Bruce R. Barkstrom (NASA Langley Research Center)

A Versatile Facsimile and Transcription Service for ManuscrOts and Rare Old Books at the Miguel de Cervantes Digital Library Alejandro Bia (Miguel de Cervantes Digital Library, University of Alicante, Spain)

The Virtual Naval Hospital: The Digital Library as Knowledge Management Tool for Nomadic Patrons Michael P. D'Alessandro, Donna M. D'Alessandro, Mary J. C. Hendrix (University of Iowa) Richard S. Bakalar (Naval Medical Information Management Center) Denis E. Ashley (US Navy Bureau of Medicine and Surgery)

8 Tuesday, June 26, 2001Main Conference

7:30-9:00 Complimentary Continental Breakfast in Crystal Foyer

Session 6: 9:00-10:30 Crystal Ballroom ABC Roanoke Ballroom AB Roanoke Ballroom EFG Panel (6C): A Digital Strategy for the Papers (6A): Studying the Users of Digital Papers (6B): Digital Library Collections: Library of Congress: Discussion of Libraries: Formative and Summative Policies and Practices the LC21 Report and the Role of the Evaluations Moderator: William Arms (Cornell Digital Library Community Moderator: John Ober (California Digital University) Moderator: Alan Inouye (Computer Librar),) Community Design of DLESE's Collections Science and Telecommunications Public Use of Digital Community Information Review Policy: A Technological Board, National Academy of Sciences) Systems: Findings from a Recent Study Frames Analysis Panelists: with Implications for System Design Michael Khoo (University of Colorado, Dale Flecker (Associate Director of Karen E. Pettigrew (University of Boulder) Planning and Systems, Harvard Washington, Seattle) Legal Deposit of Digital Publications: A Joan C. Durrance (University of Michigan, University Library) Review of Research and Development Ann Arbor) Margaret Hedstrom (Associate Professor, Activity School of Information, University of * Evaluating the Distributed National Adrienne Muir (Loughborough University) Michigan.) Electronic Resource *Comprehensive Access to Printed David Levy, Professor (Information School, Peter Brophy, Shelagh Fisher (The University of Washington) Materials (CAPM) Manchester Metropolitan University) G. Sayeed Choudhury, Mark Lofie, Erin *Collaborative Design with Use Case Fitzpatrick, Ben Hobbs, Greg Scenarios Chirikjian, Allison Okamura (Johns Lynne Davis (University Corporation for Hopkins University) Atmospheric Research) Nick Flores (University of Colorado, Melissa Dawe (University of Colorado, Boulder) Boulder) *Technology and Values: Lessons from Human Evaluation of Kea, an Automatic Central and Eastern Europe Keyphrasing System Nadia Caidi (University of Toronto) Steve Jones, Gordon W. Paynter (University of Waikato)

10:30-11:00 Break. Refreshments in Crystal Foyer.

Session 7: 11:00-12:30 Crystal Ballroom ABC Roanoke Ballroom AB Roanoke Ballroom EFG Panel (7C): The President's Information Papers (7A): Studying the Users of Digital Papers (7B): Techniques for Managing Technology Advisory Committee's Libraries: Qualitative Approaches Distributed Collections February 2001 Digital Libraries Moderator: Cliff McKnight (Loughborough Moderator: Ray Larson (University of Report and Its Impact University, United Kingdom) California, Berkeley) Moderator: Sally E. Howe (National Use of Multiple Digital Libraries: A Case *Overview of the Virtual Data Center Coordination Office for Information Study Project and Software Technology Research & Development) Ann Blandford, Hanna Stelmaszewska Micah Altman, L. Andreev, M. Diggoty, G. Panelists: (Middlesex University) King, E. Kolster, A. Sone, S. Verba David Nagel (President AT&T Labs, Chair Nick Bryan-Kinns (Icon MediaLab London) (Harvard University) D.L. Kiskis, M. Krott (University of PITAC Digital Library Panel) An Ethnographic Study of Technical Support Ching-chih Chen (Professor, Graduate Michigan, Ann Arbor) Workers: Why We Didn't Build a Tech School of Library and Information Support Digital Library *Digital Libraries and Data Scholarship Science, Simmons College, and Sally Jo Cunningham, Chris Knowles Bruce R. Barkstrom (NASA Langley PITAC) (University of Waikato) Research Center) Stephen M. Griffin (NSF, Digital Libraries Nina Reeves (Cheltenham and Initiative) Cloucestershire College of Higher Education) [ CONTINUED ON NEXT PAGE I [ CONTINUED ON NEXT PAGE I I CONTINUED ON NEXT PAGE I

ut 9 Tuesday, Session 7: 11:00-12:30 Crystal Ballroom ABC Roanoke Ballroom AB Roanoke Ballroom EFG

I CONTINUED] I CONTINUED] I CONTINUED]

James H. Lightbourne (National SMETE *Developing Recommendation Services for a SDLIP + STARTS = SDARTS: A Protocol Digital Library (NSDL) Program, Digital Library with Uncertain and and Toolkit for Metasearching National Science Foundation) Changing Data Noah Green, Panagiotis G. Ipeirotis, Luis Walter L. Warnick (Department of Energy, Gary Geisler (University of North Carolina, Gravano (Columbia University) Director Office of Scientific and Chapel Hill) Database Selection for Processing k Nearest Technical Information) David McArthur, Sarah Giersch (Eduprise) Neighbors Queries in Distributed *Evaluation of DEFINDER: A System to Environments Mine Definitions from Consumer- Clement Yu, Prasoon Sharma, Yan Qin Oriented Medical Text (University of Illinois, Chicago) JudithL.Klavans,SmarandaMuresan Weiyi Meng (State University of New York, (Columbia University) Binghamton)

12:30-13:30 Complimentary Lunch for all Registered Attendees in Roanoke CDH.

Session 8: 13:30-15:30 Crystal Ballroom ABC Roanoke Ballroom AB Roanoke Ballroom EFG Panel (8C): The National SMETE Digital Papers (8A): The Sound of Digital Papers (8B): Information Search and Library Program Libraries: Audio, Music, and Speech Retrieval in Digital Libraries Moderator: Brandon Muramatsu Moderator: Edie Rasmussen (University of Moderator: Nicholas Belkin (Rutgers (SMETE.ORG Project Director, Pittsburgh) University) University of California at Berkeley) Building Searchable Collections of *Locating Question Difficulty through Panelists: Enterprise Speech Data Explorations in Question Space James H. Lightboume (National SMETE James W. Cooper, Mahesh Viswanathan Terry Sullivan (University of North Texas) (IBM T .1 Watson Research Center) Digital Library (NSDL) Program, *Browsing by Phrases: Terminological Donna Byron (University of Rochester) National Science Foundation) Information in Interactive Multilingual Margaret Chan (Columbia University) Cathryn A. Manduca (University Text Retrieval Corporation for Atmospheric Research) *Transcript-Free Search of Audio Archives at Anselmo Pefias, Julio Gonzalo, Felisa Marcia Mardis (Merit Networks, TeacherLib the National Gallery of the Spoken Word Verdejo (Universidad Nacional de NSDL Project) J.H.L. Hansen (University of Colorado, Educación a Distancia) Flora P. McMartin (University of California Boulder) *Approximate Ad-hoc Query Engine for at Berkeley) J.R. Deller, Jr., M.S. Seadle(MichiganState Simulation Data University, East Lansing) Ghaleb Abdulla, Chuck Baldwin, Terence *Audio Watermarking Techniques for the Critchlow, Roy Kamimura, Ida Lozares, National Gallery of the Spoken Word Ron Musick, Nu Ai Tang (Lawrence J.R. Deller, Jr., A. Gurijala, M.S. Seadle Livermore National Laboratory) (Michigan State University, East Lansing) Byung S. Lee, Robert Snapp (University of Vermont, Burlington) Music-Notation Searching and Digital Libraries *Extracting Taxonomic Relationships from Donald Byrd (University of Massachusetts, On-Line Definitional Sources Using Amherst) LEXING Judith Klavans, Brian Whitman (Columbia *Feature Selection for Automatic University) Classification of Musical Instrument Sounds Hierarchical Indexing and Document Mingchun Liu, Chunru Wan (Nanyang Matching in BoW Technological University) Maayan Geffet, Dror G. Feitelson (The Hebrew University) *Adding Content-Based Searching to a Traditional Music Library Catalogue Scalable Integrated Region-based Image Server Retrieval using IRM and Statistical Matthew J. Dovey (Kings College, London) Clustering James Z. Wang (Pennsylvania State University and Stanford University) Yanping Du (Pennsylvania State University)

1 0 Tuesday 15:30-16:00 Break. Refreshments in Crystal Foyer.

Session 9: 16:00-17:00 Crystal Ballroom ABC Roanoke Ballroom AB Roanoke Ballroom EFG Keynote Address: Digital Rights Management: What Does This Mean For Libraries? Pamela Samuelson (Professor of Information Management and of Law, University of California at Berkeley) Introduction by Christine L. Borgman (UCLA)

18:00-22:00 Banquet and Reception at Virginia Museum of Transportation: Buffet dinner with Virginia wine bar; music by The Celtibillies. Presentation of Vannevar Bush Award around 21:00. Buses leave from Hotel Roanoke main entrance beginning at 18:00, and will be available for transport and return throughout the evening.

Directions for walkers: VMT is about five blocks from HRCC. Take the pedestrian bridge over the railroad tracks, then face back across the tracks toward the HRCC. Between you and the tracks is the beginning of Roanoke's interpretive Railwalk. Follow the Railwalk west (left when facing the tracks) to Warehouse Row. Pass warehouses on their left side (away from tracks), continue along Warehouse Row through the parking lot, and pass under the 2nd Street bridge at the Mercury / Redstone rocket. The Virginia Museum of Transportation is the long low brick building directly in front of you. Pass left of the building; the main entrance is about half way down the building.

11 Wednesday, June 27, 2001Main Conference

7:30-9:00 Complimentary Continental Breakfast in Crystal Foyer

Session 10: 9:00-10:00 Crystal Ballroom ABC Roanoke Ballroom AB Roanoke Ballroom EFG Keynote Address: Interoperability: The Still-Unfulfilled Promise of Networked Information Clifford Lynch (Executive Director, Coalition for Networked Information) Introduction by Erich J. Neuhold (GMD- IPSI)

10:00-10:30 Break. Refreshments in Crystal Foyer.

Session 11: 10:30-12:30 Crystal Ballroom ABC Roanoke Ballroom AB Roanoke Ballroom EFG Panel (11C): High Tech or High Touch: Papers (11A): Digital Video Libraries: Papers (11B): Systems Design and Automation and Human Mediation in Design and Access Architecture for Digital Libraries Libraries Moderator: Neil Rowe (Naval Postgraduate Moderator: Robin Williams (IBM Almaden Moderator: David Levy (Professor, School) Research Center) Information School, University of Cumulating and Sharing End Users Design Principles for the Information Washington) Knowledge to Improve Video Indexing in Architecture of a SMET Education Panelists: a Video Digital Library Digital Library William Arms (Cornell University) Marc Nanard, Jocelyne Nanard (Laboratoire Andy Dong, Alice Agogino (University of Oren Etzioni (Associate Professor, dInformatique, de Robotique et de California, Berkeley) Department of Computer Science and Microélectronique de Montpellier) Toward A Model of Self-Administering Data Engineering, University of Washington) XSLT for Tailored Access to a Digital Video B. Hoon Kang, Robert Wilensky (University Diane Nester Kresh (Director Public Library of California, Berkeley) Services Collections, Library of Michael Christel, Bryan Maher, Andrew Congress) Begun (Carnegie Mellon University) PERSIVAL, a System for Personalized Barbara Tillett (Library of Congress) Search and Summarization over Design of a Digital Library for Human Multimedia Healthcare Information Movement Kathleen R. McKeown, Shih-Fu Chang, Jezekiel Ben-Arie, Purvin Pandit, James Cimino, Steven K. Feiner, Carol ShyamSundar Rajaram (University of Friedman, Luis Gravano, Vasileios Illinois, Chicago) Hatzivassiloglou, Steven Johnson, *A Bucket Architecture for the Open Video Desmond A. Jordan, Judith L. Klavans, Project Vimla Patel, Simone Teufel (Columbia Michael L. Nelson (NASA Langley Research University) Center) Andre Kushniruk (York University) Gary Marchionini, Gary Geisler, Meng Yang - *An Approach to Search for Digital (University of North Carolina, Chapel Libraries Hill) Elaine G. Toms, Joan C. Bartlett (University *77re Fischkir Digital Video System: A of Toronto) Digital Library of Broadcast TV Programmes *TilePic: A File Format for Tiled A.F. Smeaton, N. Murphy, N.E. O'Connor, S. Hierarchical Data Marlow, H. Lee, K. McDonald, P. J. Anderson-Lee, R. Wilensky (University of Browne, J. Ye (Dublin City University) California, Berkeley)

12 Wednesday 12:30-13:30 Complimentary Lunch for all Registered Attendees in Roanoke CDH.

Session 12: 13:30-15:00 Crystal Ballroom ABC Roanoke Ballroom AB Roanoke Ballroom EFG Panel (12C): Digital Libraries Supporting Papers (12A): Digital Preservation: Papers (12B): Scholarly Communication Digital Government Technology, Economics, and Policy and Digital Libraries Moderator: Gary Marchionini (Cary C. Moderator: Michael Seadle (Michigan State Moderator: Marianne Afifi (University of Boshamer Professor, School of Library University, East Lansing) Southern California) and Information Science, University of Long Term Preservation of Digital Hermes A Notification Service for Digital North Carolina at Chapel Hill) Information Libraries Panelists: Raymond A. Lorie (IBM Almaden Research Daniel Faensen, Lukas Faulstich, Heinz Lawrence E. Brandt (Program Manager for Center) Schweppe, Annika Hinze, Alexander Digital Government Research, National Steidinger (Freie Universität Berlin) Creating Trading Networks of Digital Science Foundation) Archives An Algorithm for Automated Rating of Hsinchun Chen (University of Arizona) Brian Cooper, Hector Garcia-Molina Reviewers Anne Craig (Illinois State Library) (Stanford University) Tracy Riggs, Robert Wilensky (University Judith Klavans (Columbia University) of California, Berkeley) Cost-driven Design for Archival Repositories Arturo Crespo, Hector Garcia-Molina HeinOnline: An Online Archive of Law (Stanford University) Journals Richard J. Marisa (Cornell University)

Wednesday 15:00-15:30 Break. Refreshments in Crystal Foyer.

Session 13: 15:30-17:00 Crystal Ballroom ABC Roanoke Ballroom AB Roanoke Ballroom EFG Panel (13C): Digital Music Libraries Papers (13B): Applications of Digital Papers (13A): Designing Digital Libraries Research and Development Libraries in the Humanities for Education: Technology, Services Moderator: Christine Brancolini (Director, Moderator: Sally Howc (National and User Studies Indiana University Digital Library Coordination Office for IT Research & Moderator: Joyce Ray (Institute of Program) Development) Museum and Library Services) Panelists: Building a Hypertextual Digital Library in Designing a Digital Library for Young David Bainbridge (University of Waikato) the Humanities: A Case Study on London Children: An Intergenerational Mary Wallace Davidson (William and Gayle Gregory Crane, David A. Smith, Clifford E. Partnership Cook Music Library, Indiana Wulfman (Tufts University) Allison Druin, Ben Bederson, Juan Pablo University) Winner, Vannevar Bush Award Hourcade, Lisa Sherman, Glenda Andrew P. Dillon (School of Library and Revelle, Michele Platner, Stacy Weng *Document Quality Indicators and Corpus Information Science, Indiana (University of Maryland, College Park) Editions University) Jeffrey A. Rydberg-Cox (University of Dynamic Digital Libraries for Children Matthew Dovey (Libraries Automation Missouri, Kansas City) Yin Leng Theng, Norliza Mohd-Nasir, Service, University of Oxford) Anne Mahoney, Gregory R. Crane (Tufts George Buchanan, Bob Fields, Harold Jon W. Dunn (Digital Library Program, University) Thimbleby (Middlesex University), Indiana University) Noel Cassidy (St. Albans School) Michael Fingerhut (Centre Pompidou) Digital Atheneum: New Approaches for Ichiro Fujinaga (Peabody Conservatory of Preserving, Restoring, and Analyzing Looking at Digital Library Usability from a Music, Johns Hopkins University) Damaged Manuscripts Reuse Perspective Eric J. Isaacson (School of Music, Indiana Michael S. Brown, W. Brent Scales Tamara Sumner, Melissa Dawe (University University) (University of Kentucky, Lexington) of Colorado, Boulder) *Towards an Electronic Variorum Edition of Don Quixote Richard Furuta, Shueh-Cheng Hu, Siddarth Kalasapur, Rajiv Kochumman, Eduardo Urbina, Ricardo Vivancos-Perez (Texas A&M University)

13 Thursday, June 28, 2001Workshops

7:30-9:00 Complimentary Continental Breakfast for Registered Workshop Participants in Crystal Foyer

8:00-17:00: Workshops Visual Interfaces to Digital Libraries - Their Past, Present, and Future - Roanoke Ballroom B Katy Börner (Assistant Professor, Information Science & Cognitive Science, Indiana University, SLIS) Chaomei Chen (Reader, Department of Information Systems and Computing; Director, The VIVID Research Centre, Brunel University)

The Technology of Browsing Applications - Roanoke Ballroom GH Nina Wacholder (Center for Research on Information Access, Columbia University) Craig Nevill Manning (Computer Science, Rutgers University)

Classification Crosswalks: Bringing Communities TogetherRoanoke Ballroom EF Gail Hodge (Information Intemational Associates, Inc./National Biologica Information Infrastructure) Paul Thompson (West Group) Diane Vizine-Goetz (Senior Research Scientist, OCLC Office of Research) Marcia Lei Zeng (Associate Professor, School of Library and Information Science, Kent State University)

12:00-13:00 Complimentary Lunch for Registered Workshop Participants in Roanoke D. Box Lunches for "The Technology of Browsing Applications" will be delivered to Roanoke GH.

Participants in the IMLS and NSF DLI2 Workshops, please pick up separate schedules when you register.

Emergency Contact Information

Most conference personnel can be found at the JCDL Conference Office Bent Mountain Room, IIRCC

General Chair Networking Coordinator Publicity Edward Fox Unmil Karadkar European Liaison Edie Rasmussen Dept. of Computer Science Texas A&M University Constantino Thanos School of Information Sciences Virginia Tech College Station, Texas, USA Instituto di Elaborazione della Informazione University of Pittsburgh Blacksburg, VA, USA krimilecsdl.tamu.edu Consiglio Nazionale delle Richerche Pittsburgh, PA, USA foxQvtedu Italy erasmosOmail,sis,nittedu New Attendees, Doctoral Students [email protected] Program Chair Allison L. Powell Registration Christine Borman Dept of Computer Science Local Arrangements Jim French Dept of Information Studies University of Virginia Robert France Dept of Computer Science UCLA Charlottesville, Virginia, USA Digital Library Research Laboratory University of Virginia Los Angeles, CA, USA ehylfAcs.vireinia.edu Virginia Tech Charlottesville, VA USA cboremanauchtedu Blacksburg, VA, USA frenchecs.virainia.edu NSDL Support francelavtedu Treasurer Brandon Muramatsu Webmaster Neil Rowe University of California at Berkeley Sponsoring and Exhibiting Fernando Das-Neves Computer Science Dcpt. Berkeley, California, USA Michael L. Nelson Dept of Computer Naval Postgraduate School [email protected] School of Information and Library Science Virginia Tech Monterey, California, USA University of North Carolina Blacksburg, VA, USA ioweacs.nvs.navv.mil Panels Chair Chapel Hill, NC, USA fcluneveavtedu Gene Golovchinsky mlnOils.unc.eclu AN Coordinator FX Palo Alto Laboratory, Inc Workshops Chair Luis Francisco-Revilla Palo Alto, CA, USA Student Volunteers Chair Marianne Afifi Texas A&M University geneepal.xerox.corri Ghaleb Abdulla Center for Scholarly Technology College Station, Texas, USA Lawrence Livennore National Lab Information Services Division 10i0954(Ecsdl.tamu.eds Posters Chair Livennore, CA, USA University of Southern California Craig Nevill-Manning abdulla I 011nlsov Los Angeles, CA, USA Demos Chair Dept of Computer Science afifilause.edii Ray Larson Rutgers University Tutorials Chair Univeisity of California at Berkeley Piscataway, NJ, USA Jonathan Furner Berkeley, California, USA nevillaes.rutzemedu Information Studies ravQsims,berkeley.edu UCLA Los Angeles, CA USA ifurnctOuclaedu

14 JCDL 2001 Papers http://www.acm.org/jcd1/jcd101/printer/papers.html

Papers

Note: Long papers are assigned a 30 minute period, while short papers (shown by an asterisk *preceding the title) have a 15 minute period.

Paper Presentation Schedule (this is not the complete conference schedule) !Monday June 25 rruesday June 27 iWednesday June 28 _

9:00am-10:30 1 6A 6B , [ 10:30-11:00 'Break Break 11A 11B 11:00-12:30pm 3A 3B 7A 7B 12:30-1:30 1:30-3:00 4A 4B 12A 12B 8A 8B i:00-3,00 13A 13B 3:30-5:00pm I 5A 5B

MONDAY, June 25 Paper Session 3A (90min): Methods for Classifying and Organizing Content in Digital Libraries (11:00am-12:30pm)

Integrating Automatic Genre Analysis into Digital Libraries Andreas Rauber, Alexander Mueller-Koegler (Vienna University of Technology) Text Categorization for Multi-page Documents: A Hybrid Naive Bayes HMM Approach Paolo Frasconi, Giovanni Soda, Alessandro Vullo (University of Florence) *Automated Name Authority Control James W. Warner, Elizabeth W. Brown (Johns Hopkins University) *Automatic Event Generation From Multi-lingual News Stories Kin Hui, Wai Lam, Helen M. Meng (The Chinese University of Hong Kong) Paper Session 3B (90min): Digital Libraries for Education: Technology, Services, and User Studies (11:00am-12:30pm)

Linked Active Content: A Service for Digital Libraries for Education David Yaron, D. Jeff Milton, Rebecca Freeland (Carnegie Mellon University) A Component Repository for Learning Objects A Progress Report Jean R. Laleuf, Anne Morgan Spalter (Brown University) Designing e-Books for Legal Education Catherine C. Marshall, Morgan N. Price, Gene Golovchinsky, Bill N. Schilit (FX Palo Alto Laboratory) Paper Session 4A (90min): Approaches to Interoperability Among Digital Libraries (1:30pm-3:00)

*Mapping the Interoperability Landscape for Networked Information Retrieval William E. Moen, Teresa Lepchenske (University of North Texas) *Distributed Resource Discovery: Using Z39.50 to Build Cross-Domain Information Servers Ray R. Larson (University of California, Berkeley) The Open Archives Initiative Carl Lagoze, Herbert Van de Sompel (Cornell University) *Enforcing Interoperability with the Open Archives Initiative Repository Explorer Hussein Suleman (Virginia Tech) *Arc-An OAI Service Provider

1 of 5 15 4/18/02 1:38 PM JCDL 2001 Papers http://www.acm.org/jcd1/jcd101/printer/papers.html

Xiaoming Liu, Kurt Maly, Mohammad Zubair, Michael L. Nelson (Old Dominion University) Paper Session 4B (90min): Digital Libraries and the Web: Technology and Trust (1:30pm-3:00)

Managing Change on the Web Luis Francisco-Revilla, Frank Shipman, Richard Furuta, Unmil Karadkar, Avital Arora (Texas A&M University) Veasuring the Reputation of Web Sites: A Preliminary Exploration Greg Keast, Joan Cherry, Elaine G. Toms (University of Toronto) Personalized Spiders for Web Search and Analysis Michael Chau, Daniel Zeng, Hsinchun Chen (University of Arizona, Tucson) *Salticus: Guided Crawling for Personal Digital Libraries Robin Burke (California State University, Fullerton) Paper Session 5A (90min): Tools for Constructing and Using Digital Libraries (3:30pm-5:Opm)

Power to the People: End-User Building of Digital Library Collections Ian H. Witten, David Bainbridge, Stefan J. Boddie (University of Waikato) *Web-Based Scholarship: Annotating the Digital Library Bruce Rosenstock, Michael Gertz (University of California, Davis) A Multi-View Intelligent Editor for Digital Video Libraries Brad A. Myers, Juan P. Casares, Scott Stevens, Laura Dabbish, Dan Yocum, Albert Corbett (Carnegie Mellon University) *Video Graph: A New Tool for Video Mining and Classification Jia-Yu Pan, Christos Faloutsos (Carnegie Mellon University) Paper Session 5B (90min): Systems Design and Evaluation for Undergraduate Learning Environments (3:30pm-5:Opm)

*The Alexandria Digital Earth Prototype System T. R. Smith, G. Janee, J. Frew, A. Coleman, N. Faust (University of California, Santa Barbara) Iscapes: Digital Library Environments to Promote Scientific Thinking by Undergraduates in Geography Anne J. Gilliland-Swetland, Gregory H. Leazer (University of California, Los Angeles) *Project ANGEL: an Open Virtual Learning Environment with Sophisticated Access Management John Mac Coll (University of Edinburgh) *NBDL: A CIS Framework for NSDL Joe Futrelle, C. Kevin Chang (University of Illinois, Urbana-Champaign), Su-Shing Chen (University of Missouri, Columbia) Automatic Identification and Organization of Index Terms for Interactive Browsing Nina Wacholder, David K. Evans, Judith L. Klavans (Columbia University)

TUESDAY, June 26 Paper Session 6A (90min): Studying the Users of Digital Libraries: Formative and Summative Evaluations (9:00am-10:30)

Public Use of Digital Community Information Systems: Findings from a Recent Study with Implications for System Design Karen E. Pettigrew (University of Washington, Seattle), Joan C. Durrance (University of Michigan, Ann Arbor) *Evaluating the Distributed National Electronic Resource Peter Brophy, Shelagh Fisher (The Manchester Metropolitan University) *Collaborative Design with Use Case Scenarios Lynne Davis (University Corporation for Atmospheric Research) Melissa Dawe (University of Colorado, Boulder) Human Evaluation of Kea, an Automatic Keyphrasing System

2 of 5 it 4/18/02 1:38 PM JCDL 2001 Papers http://www.acm.org/jed1Ijcd101/printer/papers.html

Steve Jones, Gordon W. Paynter (University of Waikato) Paper Session 6B (90min): Digital Library Collections: Policies and Practices (9:00am-10:30)

Community Design of DLESE's Collections Review Policy: A Technological Frames Analysis Michael Khoo (University of Colorado, Boulder) Legal Deposit of Digital Publications: a Review of Research and Development Activity Adrienne Muir (Loughborough University) *Comprehensive Access to Printed Materials (CAPM) G. Sayeed Choudhury, Mark Lorie, Erin Fitzpatrick, Ben Hobbs, Greg Chirikjian, Allison Okamura (Johns Hopkins University) Nick Flores (University of Colorado, Boulder) *Technology and Values: Lessons from Central and Eastern Europe Nadia Caidi (University of Toronto) Paper Session 7A (90min): Studying the Users of Digital Libraries: Qualitative Approaches (11:00am-12:30pm)

Use of Multiple Digital Libraries: a Case Study Ann Blandford, Hanna Stelmaszewska (Middlesex University), Nick Bryan-Kinns (Icon Media Lab London) An Ethnographic Study of Technical Support Workers: Why We Didn't Build a Tech Support Digital Library Sally Jo Cunningham, Chris Knowles (University of Waikato), Nina Reeves (Cheltenham and Cloucestershire College of Higher Education) *Developing Recommendation Services for a Digital Library with Uncertain and Changing Data Gary Geis ler (University of North Carolina, Chapel Hill), David McArthur, Sarah Giersch (Eduprise) *Evaluation of DEFINDER: A System to Mine Definitions from Consumer-Oriented Medical Text Judith L. Klavans, Smaranda Muresan (Columbia University) Paper Session 7B (90min): Techniques for Managing Distributed Collections (11:00am-12:30pm)

*An Overview of the Virtual Data Center: A Complete, Open-Source, Digital Library for Distributed Collections of Quantitative Data Micah Altman, L. Andreev, G. King, E. Kolster, A. Sone, S. Verba (Harvard University), D.L. Kiskis, M. Krott (University of Michigan, Ann Arbor) *Digital Libraries and Data Scholarship Bruce R. Barkstrom (NASA Langley Research Center) SDLIP + STARTS = SDARTS: A Protocol and Too lkit for Metasearching Noah Green, Panagiotis G. Ipeirotis, Luis Gravano (Columbia University) Database Selection for Processing k Nearest Neighbors Queries in a Distributed Environment Clement Yu, Prasoon Sharma, Yan Qin (University of Illinois, Chicago), Weiyi Meng (State University of New York, Binghamton) Paper Session 8A (2hr): The Sound of Digital Libraries: Audio, Music, and Speech (1:30pm-3:30)

Building Searchable Collections of Enterprise Speech Data James W. Cooper (IBM T J Watson Research Center), Donna Byron (University of Rochester), Margaret Chan (Columbia University) *Transcript-Free Search of Audio Archives at the National Gallery of the Spoken Word J.H.L. Hansen (University of Colorado, Boulder), J.R. De Iler, Jr., M.S. Seadle (Michigan State University, Lansing) *A Watermarking Strategy for Audio Archives at the National Gallery of the Spoken Word J.R. De Iler, Jr., A. Gurijala, M.S. Seadle (Michigan State University, Lansing) Music-Notation Searching and Digital Libraries Donald Byrd (University of Massachusetts, Amherst) *Feature Selection for Automatic Classification of Musical Instrument Sounds Mingchun Liu, Chunru Wan (Nanyang Technological University) *Adding Content Based Searching to a Traditional Music Library Catalogue Server Matthew J. Dovey (Kings College, London)

3 of 5 17 4/18/02 1:38 PM JCDL 2001 Papers http://www.acm.org/jcd11jcd101/printer/papers.htm1

Paper Session 8B (2hr): Information Search and Retrieval in Digital Libraries (1:30pm-3:30)

*Locating Question Difficulty through Explorations in Question Space Terry Sullivan (University of North Texas) *Browsing by Phrases: Terminological Information in Interactive Multilingual Text Retrieval Anse Imo Penas, Julio Gonzalo, Felisa Verdejo (Universidad Nacional de EducaciOn a Distancia) *Approximate Ad-hoc Query Engine for Simulation Data Ghaleb Abdul la, Chuck Baldwin, Terence Critchlow, Roy Kamimura, Ida Lozares, Ron Musick, Nu Ai Tang (Lawrence Livermore National Laboratory), Byung S. Lee, Robert Snapp (University of Vermont, Burlington) *Extracting Taxonomic Relationships from On-Line Definitional Sources Using LEXING Judith Klavans, Brian Whitman (Columbia University) Hierarchical Indexing and Document Matching in BoW Maayan Geffet, Dror G. Feitelson (The Hebrew University) Scalable Integrated Region-based Image Retrieval using IRM and Statistical Clustering James Z. Wang, Yanping Du (Pennsylvania State University)

WEDNESDAY, June 27 Paper Session 11A (2hr): Digital Video Libraries: Design and Access (10:30am-12:30pm)

Cumulating and Sharing End Users Knowledge to Improve Video Indexing in a Video Digital Library Marc Nanard, Jocelyne Nanard (Laboratoire d'Informatique, de Robotique et de Microelectronique de Montpellier) XSLT for Tailored Access to a Digital Video Library Michael Christel, Bryan Maher, Andrew Begun (Carnegie Mellon University) Design of a Digital Library for Human Movement Jezekiel Ben-Arie, Purvin Pandit, ShyamSundar Rajaram (University of Illinois, Chicago) *A Bucket Architecture for the Open Video Project Michael L. Nelson (NASA Langley Research Center), Gary Marchionini, Gary Geis ler, Meng Yang (University of North Carolina, Chapel Hill) *The Fisch Ur Digital Video System: A Digital Library of Broadcast TV Programmes A.F. Smeaton, N. Murphy, N.E. O'Connor, S. Marlow, H. Lee, K. McDonald, P. Browne, J. Ye (Dublin City University) Paper Session 11B (2hr): Systems Design and Architecture for Digital Libraries (10:30am-12:30pm)

Design Principles for the Information Architecture of a SMET Education Digital Library Andy Dong, Alice Agogino (University of California, Berkeley) Toward A Model of Self-administering Data B. Hoon Kang, Robert Wilensky (University of California, Berkeley) PERSIVAL, a System for Personalized Search and Summarization over Multimedia Healthcare Information Kathleen R. McKeown, Shih-Fu Chang, James Cimino, Steven K. Feiner, Carol Friedman, Luis Gravano, Vasileios Hatzivassiloglou, Steven Johnson, Desmond A. Jordan, Judith L. Klavans, Andre Kushniruk, Vim la Patel, Simone Teufel (Columbia University) *An Approach to Search for Digital Libraries Elaine G. Toms, Joan C. Bartlett (University of Toronto) *Tile Pic: A File Format for Tiled Hierarchical Data J. Anderson-Lee, R. Wilensky (University of California, Berkeley) Paper Session 12A (90min): Digital Preservation: Technology, Economics, and Policy (1:30pm-3:00)

Long Term Preservation of Digital Information Raymond A. Lode (IBM Almaden Research Center) Creating Trading Networks of Digital Archives Brian Cooper, Hector Garcia Molina (Stanford University)

4 of 5 t#1 18 4/18/02 1:38 PM JCDL 2001 Papers http://www.acm.org/jcd1/jcd101/printer/papers.html

Cost-driven Design for Archival Repositories Arturo Crespo, Hector Garcia-Molina (Stanford University) Paper Session 12B (90min): Scholarly Communication and Digital Libraries (1:30pm-3:00)

Hermes A Notification Service for Digital Libraries Daniel Faensen, Lukas Faulstich, Heinz Schweppe, Annika Hinze, Alexander Steidinger (Freie Universitat Berlin) An Algorithm for Automated Rating of Reviewers Tracy Riggs, Robert Wilensky (University of California, Berkeley) Hein Online: An Online Archive of Law Journals Richard J. Marisa (Cornell University) Paper Session 13A (90min): Designing Digital Libraries for Education: Technology, Services and User Studies (3:30pm-5:00)

Designing a Digital Library for Young Children: An Intergenerational Partnership Allison Druin, Ben Bederson, Juan Pablo Hourcade, Lisa Sherman, Glenda Revel le, Michele Platner, Stacy Weng (University of Maryland, College Park) Dynamic Digital Libraries For Children Yin-Leng Theng, Norliza Mohd-Nasir, George Buchanan, Bob Fields, Harold Thimbleby (Middlesex University) Looking at Digital Library Usability from a Reuse Perspective Tamara Sumner, Melissa Dawe (University of Colorado, Boulder) Paper Session 13B (90min): Applications of Digital Libraries in the Humanities (3:30pm-5:00)

Building a Hypertextual Digital Library in the Humanities: A Case Study on London (Winner of JCDL 2001 Vannevar Bush paper Award) Gregory Crane, David A. Smith (Tufts University) *Document Quality Indicators and Corpus Editions Jeffrey A. Rydberg-Cox (University of Missouri, Kansas City), Anne Mahoney, Gregory R. Crane (Tufts University) Digital Atheneum: New Approaches for Preserving, Restoring, and Analyzing Damaged Manuscripts Michael S. Brown, W. Brent Seales (University of Kentucky, Lexington) *Towards an Electronic Variorum Edition of Don Quixote Richard Furuta, Shueh-Cheng Hu, Siddarth Kalasapur, Rajiv Kochumman, Eduardo Urbina, Ricardo Vivancos-Perez (Texas A&M University) Iclo to top of page] A

19 5 of 5 4/18/02 1:38 PM ACM Digital Library: Proceedings of the first ACM/IEEE-CS joint conference on Digital libraries

library home Ilslphabetkally list by SIG, search library reg ister IL Isubscribe DL eedbackl

italLibry ACM/IEEE-CS Joint Conference on Digital Libraries Proceedings of the first ACM/IEEE-CS joint conference on Digital libraries June 24 - 28, 2001, Roanoke, VA USA

a cress I related SJGs ed conferences Table of Contents For full text in PDF , use Adobe Acrobat Reader. Integrating automatic genre analysis into digital libraries Andreas Rauber and Alexander Muller-Ogler Pages 1 10 metadata: El abstract full text:E PDF 657 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Text categorization for multi-page documents: a hybrid naive Bayes HMM approach Paolo Frasconi, Giovanni Soda and Alessandro Vullo Pages 11 20 metadata: ri abstract n index terms full text: E PDF 273 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Automated name authority control James W. Warnner and Elizabeth W. Brown Pages 2122 metadata: E abstract E:: index terms

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Automatic event generation from multi-lingual news stories Kin Hui, Wai Lam and Helen M. Meng Pages 2324 metadata: I abstract full text: PDF 138 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] 20 BEST COPYAVAILABLE http://www.acm.org/pubs/contents/proceedings/d1/379437/ (1 of 20) [10/9/01 12:43:02 PM] ACM Digital Library: Proceedings of the first ACM/IEEE-CS joint conference on Digital libraries

Linked active content: a service for digital libraries for education David Yaron, D. Jeff Milton and Rebecca Freeland Pages 2532 metadata: abstract -0: index ter full text:ig PDF 1474 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] A component repository for learning objects: a progress report Jean R. Laleuf and Anne Morgan Spalter Pages 3340 metadata: V abstract full text:°° PDF 684 KB

[ Discuss this Article I Find Related Articles I Add to Binder ] Designing e-books for legal research Catherine C. Marshall, Morgan N. Price, Gene Golovchinsky and Bill N. Schilit Pages 41 48 metadata: a bstra ct full text: E PDF 350 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] The open archives initiative (panel session): perspectives on metadata harvesting James B. Lloyd, Tim Cole, Donald Waters, Caroline Arms, Simeon Warner and Jeffrey Young Page 49 metadata:Ea bstra ct full text: PDF 63 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Mapping the interoperability landscape for networked information retrieval William E. Moen Pages 50 51 metadata:FT:1 abstract RAindex terms full text: PDF 120 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Distributed resource discovery: using z39.50 to build cross-domain information servers Ray R. Larson Pages 5253 metadata: 7- a bstra ct full text:E.PDF 106 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ]

http://www.acm.org/pubs/contents/proceedings/d1/379437/ (2 of 2bc60/9/01 2:13:02 PM] ACM Digital Library: Proceedings of the first ACM/IEEE-CS joint conference on Digital libraries

The open archives initiative: building a low-barrier interoperability framework Carl Lagoze and Herbert Van de Sompel Pages 5462 metadata:Ebstra ct E:index terms full text: PDF 349 KB

[ Discuss this Article I Find Related Articles I Add to Binder ] Enforcing interoperability with the open archives initiative repository explorer Hussein Suleman Pages 6364 metadata: abstract Elindex terms full text: FLPDF 117 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Arc: an OAI service provider for cross-archive searching Xiaoming Liu, Kurt Maly, Mohammad Zubair and Michael L. Nelson Pages 6566 metadata:Eabstract Eindex terms full text:EPDF 131 KB

[ Discuss this ArticleI Find Related ArticlesI Add to Binder ] Managing change on the web Luis Francisco-Revilla, Frank Shipman, Richard Furuta, Unmil Karadkar and Avital Arora Pages 6776 metadata: Ed a bstra ct El index terms full text:PE poiF 268 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Measuring the reputation of web sites: a preliminary exploration Greg Keast, Elaine G. Toms and Joan Cherry Pages 7778 metadata: Ii abstract index terms full text: E PDF 115 KB [ Discuss this Article, IFind Related ArticlesI Add to Binder ] Personalized spiders for web search and analysis Michael Chau, Daniel Zeng and Hinchun Chen Pages 7987 metadata: rd abstract El index terms full text: PDF 656 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] 22 http://www.acm.org/pubs/contents/proceedings/d1/379437/ (3 of 20) [10/9/01 12:43:02 PM] ACM Digital Library: Proceedings of the first ACM/IEEE-CS joint conference on Digital libraries

Salticus: guided crawling for personal digital libraries Robin Burke Pages 8889 metadata: abstract full text:tH PDF 104 KB

[ Discuss this Article I Find Related Articles I Add to Binder ] Different cultures meet (panel session): lessons learned in global digital library development Ching Chen, Wen Gao, Hsueh-hua Chen, Li-Zhu Zhou and Von-Wun Soo Pages 9093 metadata: 7 abstract full text:EPDF 138 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Power to the people: end-user building of digital library collections Ian H. Witten, David Bainbridge and Stefan J. Boddie Pages 94103 metadata:riabstract full text: E PDF 393 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Web-based scholarship: annotating the digital library Bruce Rosenstock and Michael Gertz Pages 104105 metadata: abstract nindex terms full text: L PDF 82 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] A multi-view intelligent editor for digital video libraries Brad A. Myers, Juan P. Casares, Scott Stevens, Laura Dabbish, Dan Yocum and Albert Corbett Pages 106115 metadata:i1 abstract full text: E ow 7549 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] The Alexandria digital earth prototype Terence R. Smith, Greg Janee, James Frew and Anita Coleman Pages 118119 metadata:PIabstract full text: E PDF 138 KB

[ Discuss this ArticleI Find Related ArticlesI Add to Binder ] Iscapes: digital libraries environments for the promotion of scientific 23 http://www.acm.org/pubs/contents/proceedings/dI/379437/ (4 of 20) [10/9/01 12:43:02 PM] ACM Digital Library: Proceedings of the first ACM/IEEE-CS joint conference on Digital libraries thinking by undergraduates in geography Anne J. Gilliland-Swetland and Gregory L. Leazer Pages 120 121 metadata:Eabstract full text:EPDF 110 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Project ANGEL: an open virtual learning envoronment with sophisticated access management John MacColl Pages 122 123 metadata: g abstract full text: E pDF 103 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] NBDL: a CIS framework for NSDL Joe Futrelle, Su-Shing Chen and Kevin C. Chang Pages 124 125 metadata: E a bstra ct friclex terms full text: PDF 154 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Automatic identification and organization of index terms for interactive browsing Nina Wacholder, Dvid K. Evans and Judith L. Klavans Pages 126 134 metadata: 11trct full text:1" PDF 290 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Digital library collaborations in a world community David Fulker, Sharon Dawes, Leonid Kalinichenko, Tamara Sumner, Constantino Thanos and Alex Ushakov Page 135 metadata:Eabstract

full text: 9 PDF 72 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Public use of digital community information sstems: findings from a recent study with implications for system design Karen E. Pettigrew and Joan C. Durrance Pages 136143 metadata:Eabstract full text: pDr 219 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ]

http://www.acm.org/pubs/contents/proceedings/d1/379437/ (5 of 20) [10/9/0112:43:021PM] 24 ACM Digital Library: Proceedings of the first ACM/IEEE-CS joint conference on Digital libraries

Evaluating the distributed national electronic resource Peter Brophy and Shelagh Fisher Pages 144 145 metadata: E. abstract E index terms full text: E PDF 102 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Collaborative design with use case scenarios Lynne Davis and Melissa Dawe Pages 146 147 metadata: abstra ct 1.. index terms

full text: E F 112 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Human evaluation of Kea, an automatic keyphrasing system Steve Jones and Gordon W. Paynter Pages 148 156 metadata:r-71 abstra ct P index terms

full text: l PDF 372 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Community design of DLESE's collections review policy: a technological frames analysis Michael Khoo Pages 157 164 metadata: abstra ct tt1 index terms full text: E PDF 190 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Legal deposit of digital publications: a review of research and development activity Adrienne Muir Pages 165 173

metadata: J abstra ct fr index terms full text:E.PDF 202 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Comprehensive access to printed materials (CAPM) G. Sayeed Choudhury, Mark Lorie, Erin Fitzpatrick, Ben Hobbs, Greg Chirikjian, Allison Okamura and Nicholas E. Flores Pages 174 175 metadata: g abstract P. index terms full text: EPDF 103 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ]

http://www.acm.org/pubs/contents/proceedings/d1/379437/ (6 of 20) [10/9/01 12:43:03 PM] 25 ACM Digital Library: Proceedings of the first ACM/IEEE-CS joint conference on Digital libraries

Technology and values: lessons from central and eastern europe Nadia Caidi Pages 176177 metadata: abstract full text:EPDF 107 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] A digital strategy for the library congress Alan Inouye, Margaret Hedstrom, Dale Flecker and David Levy Page 178 metadata: i1.1abstract full text: n PDF 53 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Use of multiple digital libraries: a case study Ann Blandford, Hanna Stelmaszewska and Nick Bryan-Kinns Pages 179188 metadata: rPIabstra ct full text: PDF 173 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] An ethnographic study of technical support workers: why we didn't build a tech support digital library Sally Jo Cunningham, Chris Knowles and Nina Reeves Pages 189198 metadata:Eabstra ct EE- index terms full text: PDF 222 KB

[ Discuss this Article I Find Related Articles I Add to Binder ] Developing recommendation services for a digital library with uncertain and changing data Gary Geisler, David McArthur and Sarah Giersch Pages 199200 metadata: -r-.Habstract full text:LE raF 115 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Evaluation of DEFINDER: a system to mine definitions from consumer- oriented medical text Judith L. Klavans and Smaranda Muresan Pages 201202 metadata: i abstra ct

full text: P PDF 122 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] http://www.acm.org/pubs/contents/proceedings/d1/379437/ (7 of 20) [10/9/01 12:43:03 PIA 26 ACM Digital Library: Proceedings of the first ACM/IEEE-CS joint conference on Digital libraries

Overview of the virtual data center project and software Micah Altman, L. Andreev, M. Diggory, G. King, E. Kolster, A. Sone, S. Verba, Daniel Kiskis and M. Krot Pages 203204 metadata: abstra ct E cndexterm s full text:'I' PDF 149 KB [ Discuss this Article I Find Related Articles I Add to Binder ] Digital libraries and data scholarship Bruce R. Barkstrom Pages 205206 metadata:Eabstract full text: PDF 96 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] SDLIP + STARTS = SDARTS a protocol and toolkit for metasearching Noah Green, Panagiotis G. Ipeirotis and Luis Gravano Pages 207214 metadata:rabstract Ed index terms full text:Pa PDF 294 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Database selection for processing k nearest neighbors queries in distributed environments Clement Yu, Prasoon Sharma, Weiyi Meng and Yan Qin Pages 215222 metadata: 7E-1 abstract full text: E PDF 193 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] The president's information technology advisory committee's february 2001 digital library report and its impact Sally E. Howe, David C. Nagel, Ching-chih Chen, Stephen M. Griffin, James Lightbourne and Walter L. Warnick Pages 223225 metadata:it-1 abstract full text: E PDF 112 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Building searchable collections of enterprise speech data James W. Cooper, Mahesh Viswanathan, Donna Byron and Margaret Chan Pages 226234 metadata:rdabstract full text: E FDF 348 KB

http://www.acm.org/pubs/contents/proceedings/dI/379437/ (8 of 20) [10/9/01 12:43:031PM] 27 ACM Digital Library: Proceedings of the first ACM/IEEE-CS joint conference on Digital libraries

[ Discuss this Article I Find Related Articles I Add to Binder ] Transcript-free search of audio archives for the national gallery of the spoken word John H. L. Hansen,3. R. Deller and Michael S. Seadle Pages 235236 metadata: abstract Eindex terms full text: PDF 81 KB

[ Discuss this Article I Find Related Articles I Add to Binder ] Audio watermarking techniques for the national gallery of the spoken word J. R. Deller, Aparna Gurijala and Michael S. Seadle Pages 237238 metadata:Eabstract full text: erg 170 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Music-notation searching and digital libraries Donald Byrd Pages 239246 metadata : abstract full text: PDF 240 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Feature selection for automatic classification of musical instrument sounds Mingchun Liu and Chunru Wan Pages 247248 metadata:E.abstract full text: PDF 102 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Adding content-based searching to a traditional music library catalogue server Matthew J. Dovey Pages 249250 metadata: abstract fndex terms full text: erg 153 KB

[ Discuss this Article I Find Related Articles I Add to Binder ] Locating question difficulty through explorations in question space Terry Sullivan Pages 251252 metadata: abstract full text: E par 106 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] http://www.acm.org/pubs/contents/proceedings/dI/379437/ (9 of 20) [10/9/01 12:43:03 PM] 28 ACM Digital Library: Proceedings of the first ACM/IEEE-CS joint conference on Digital libraries

Browsing by phrases: terminological information in interactive multilingual text retrieval Anselmo Pefias, Julio Gonzalo and Felisa Verdejo Pages 253254 metadata:Eabstract full text: PDF 229 KB

[ Discuss this Article I Find Related Articles I Add to Binder ] ApproXimate ad-hoc query engine for simulation data Ghaleb Abdulla, Chuck Baldwin, Terence Critchlow, Roy Kamimura, Ida Lozares, Ron Musick, Nu Ai Tang, Byung S. Lee and Robert Snapp Pages 255256 metadata: i bstra et full text:9 pv 101 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Extracting taxonomic relationships from on-line definitional sources using LEXING Judith Klavans and Brian Whitman Pages 257258 metadata: LJ a bstra ct full text: g PDF 120 KB

[ Discuss this Article I Find Related Articles I Add to Binder ] Hierarchical indexing and document matching in BoW Maayan Geffet and Dror G. Feitelson Pages 259267 metadata: a bstra ct

full text: gLalP DF 424 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Scalable integrated region-based image retrieval using IRM and statistical clustering James Z. Wang and Yanping Du Pages 268277 metadata: 1. a bstra ct full text:L.PDF 1689 KB [ Discuss this Article lFind Related ArticlesI Add to Binder ] The national SMETE digital library program (panel session) Brandon Muramatsu, Cathryn A. Manduca, Marcia Mardis, James H. Lightbourne and Flora P. McMartin Pages 278281 metadata: abstract full text: P DF 134 KB http://www.acm.org/pubs/contents/proceedings/d1/379437/ (10 of 20) [10/9/01 12:43:0'3 PM] 29 ACM Digital Library: Proceedings of the first ACM/IEEE-CS joint conference on Digital libraries

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Cumulating and sharing end users knowledge to improve video indexing in a video digital library Marc Nanard and Jocelyne Nanard Pages 282289 metadata: X1 abstract n index terms full text:EPDF 244 KB

[ Discuss this Article I Find Related Articles I Add to Binder XSLT for tailored access to a digtal video library Michael G. Christel, Bryan Maher and Andrew Begun Pages 290299 metadata: Ei abstract Mindex tern5 full text:ta PDF 871 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Design of a digital library for human movement Jezekiel Ben-Arie, Purvin Pandit and ShyamSundar Rajaram Pages 300309 metadata: i abstract fndex terms

full text: LEEPDF 353 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] A bucket architecture for the open video project Michael L. Nelson, Gary Marchionini, Gary Geisler and Meng Yang Pages 310311 metadata:Elabstract Elindex terms full text: E PDF 780 KB

[ Discuss this Article I Find Related Articles I Add to Binder ] The fischlár digital video system: a digital library of broadcast TV programmes A. F. Smeaton, N. Murphy, N E. O'Connor, S. Marlow, H. Lee, K. McDonald, P. Browne and J. Ye Pages 312313 metadata: E, abstract full text:FE PDF 100 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Design principles for the information architecture of a SMET education digital library Andy Dong and Alice M. Agogino Pages 314321

metadata: 1 abstract

http://www.acm.org/pubs/contents/proceedings/d1/379437/ (11 of 20) [10/9/01 12:43:03 PM] 30 ACM Digital Library: Proceedings of the first ACM/IEEE-CS joint conference on Digital libraries

full text:EPDF 435 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Toward a model of self-administering data ByungHoon Kang and Robert Wilensky0 Pages 322330

metadata: I a bstra ct full text:EPDF 301 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] PERSIVAL, a system for personalized search and summarization over multimedia healthcare information Kathleen R. McKeown, Shih-Fu Chang, James Cimino, Steven Feiner, Carol Friedman, Luis Gravano, Vasileios Hatzivassiloglou, Steven Johnson, Desmond A. Jordan, Judith L. Klavans, André Kushniruk, Vimla Patel and Simone Teufel Pages 331340 metadata:,7-3_ attract. nd terai full text: E PDF 360 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] An approach to search for the digital library Elaine G. Toms and Joan C. Bartlett Pages 341342 metadata:E.a bstra ct 02; fpdx triis full text:EPDF 122 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] TilePic: a file format for tiled hierarchical data Jeff Anderson-Lee and Robert Wilensky Pages 343344 metadata: LJ abstra ct full text: E PDF 101 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] High tech or high touch (panel session): automation and human mediation in libraries David Levy, William Arms, Oren Etzioni, Diane Nester and Barbara Tillett Page 345 metadata: a bstra full text: FDF 51 KB

[ Discuss this Article I Find Related Articles I Add to Binder ] Long term preservation of digital information Raymond A. Lorie Pages 346352 http://www.acm.org/pubs/contents/proceedings/dI/379437/ (12 of 20) [10/9/01 12:43:03 FPM1 31 ACM Digital Library: Proceedings of the first ACM/IEEE-CS joint conference on Digital libraries

metadata:E-7:7abstra4 full text:EPDF 185 KB

[ Discuss this Article I Find Related ArticlesAdd to Binder ] Creating trading networks of digital archives Brian Cooper and Hector Garcia Pages 353362 metadata: abstract full text:Epc,F 767 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Cost-driven design for archival repositories Arturo Crespo and Hector Garcia-Molina Pages 363372 metadata:Eabstract full text:EPDF 176 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Hermes: a notification service for digital libraries D. Faensen, L. Faultstich, H. Schweppe, A. Hinze and A. Steidinger Pages 373380 metadata: g abstract

full text: 9 PEW 180 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] An algorithm for automated rating of reviewers Tracy Riggs and Robert Wilensky Pages 381387 metadata: g abstract full text:EPDF 136 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] HeinOnline Richard J. Marisa Pages 388394

metadata: E aastr4ct Ej terms full text: PEW 212 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Digital libraries supporting digital government Gary Marchionini, Anne Craig, Larry Brandt, Judith Klavans and Hsinchun Chen Pages 395397 metadata: abstract full text: PDF 111 KB

[ Discuss this Article I Find Related Articles I Add to Binder ] http://www.acm.org/pubs/contents/proceedings/d1/379437/ (13 of 20) [10/9/01 12:43:03 PM] 3 2 ACM Digital Library: Proceedings of the first ACM/IEEE-CS joint conference on Digital libraries

Designing a digital library for young children Allison Druin, Benjamin B. Bederson, Juan Pablo Hourcade, Lisa Sherman, Glenda Revelle, Michele Platner and Stacy Weng Pages 398405 metadata: abstract Eindex terms full text:1" PDF 1125 KB

[ Discuss this ArticleI Find Related Articles I Add to Binder ] Dynamic digital libraries for children Yin Leng Theng, Norliza Mohd-Nasir, George Buchanan, Bob Fields, Harold Thimbleby, Noel Cassidy and Noel Cassidy Pages 406415 metadata: abstract full text: E PDF 1087 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Looking at digital library usability from a reuse perspective Tamara Sumner and Melissa Dawe Pages 416425

metadata: abstract F.I=index terms full text:EPDF 1864 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Building a hypertextual digital library in the humanities: a case study on London Gregory Crane, David A. Smith and Clifford E. Wulfman Pages 426434

metadata: -*F- 17 abstract full text: PDF 353 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Document quality indicators and corpus editions Jeffrey A. Rydberg-Cox, Anne Mahoney and Gregory R. Crane Pages 435436 metadata: - abstract full text:In PDF 112 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] The digital atheneum: new approaches for preserving, restoring and analyzing damaged manuscripts Michael S. Brown and W. Brent Pages 437443 metadata: abstract

full text: caPDF 2248 KB

t 33 http://www.acm.org/pubs/contents/proceedings/dI/379437/ (14 of 20) [10/9/01 12:43:03 PM] ACM Digital Library: Proceedings of the first ACM/IEEE-CS joint conference on Digital libraries

[ Discuss this ArticleI Find Related ArticlesI Add to Binder ] Towards an electronic variorum dition of Don Quixote Richard Furuta, Shueh-Cheng Hu, Siddarth Kalasapur, Rajiv Kochumman, Eduardo Urbina and Ricardo Vivancos Pages 444 445 metadata: a bstra ct full text: i PDF 699 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Digital music libraries - research and development David Bainbridge, Gerry Bernbom, Mary Wallace, Andrew P. Dillon, Matthew Dovey, Jon W. Dunn, Michael Fingerhut, Ichiro Fujinaga and Eric J. Isaacson Pages 446 448 metadata: abstract

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Content management for digital museum exhibitions Jen-Shin Hong, Bai-Hsuen Chen, Jieh Hsiang and Tien-Yu Hsu Page 450 metadata:reia bstra ct full text: Kw 84 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Demonstration of hierarchical document clustering of digital library retrieval results C. R. Palmer, J. Pesenti, R. E. Valdes-Perez, M. G. Christel, A. G Hauptmann, D. Ng and H. D. Wactlar Page 451 metadata: abstra ct Liindex terms full text:Ei PDF 94 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Indiana university digital music library project Jon W. Dunn and Eric J. Isaacson Page 452 metadata: J abstra ct n- index terms full text: gePDF 97 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Interactive visualization of video metadata Mark Derthick Page 453

metadata: 17-=-1abstra ct E11 index terms full text:9g: PDF 115 KB 34 http://www.acm.org/pubs/contents/proceedings/dI/379437/ (15 of 20) [10/9/01 12:43:03 PM] ACM Digital Library: Proceedings of the first ACM/IEEE-CS joint conference on Digital libraries

[ Discuss this Article I Find Related ArticlesI Add to Binder ] PERSIVAL demo: categorizing hidden-web resources Panagiotis G. Ipeirotis, Luis Gravano and Mehran Sahami Page 454 metadata: full text:°E PDF 99 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] PERSIVAL: personalized summarization over multimedia health-care information Noemie Elhadad, Min-Yen Kan, Simon Lok and Smaranda Muresan Page 455 metadata:E.- abstract full text:13' PDF 44 KB

[ Discuss this ArticleI Find Related ArticlesI Add to Binder ] View segmentation and static/dynamic summary generation for echocardiogram videos Shahram Ebadollahi and Shih-Fu Chang Page 456 metadata: full text: ',Dr 76 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Stanford encyclopedia of philosophy: a dynamic reference work Edward N. Zalta, Colin Allen and Uri Nodelman Page 457 metadata: 72[abstract full text: pu 41 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] A system for adding content-based searching to a traditional music library catalogue server Matthew J. Dovey Page 458 metadata: abstra ct Eindex ter

full text: 9 PDF 101 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Using the repository explorer to achieve OAI protocol compliance Hussein Suleman Page 459 metadata: full text:EPDF 118 KB 3.5 http://www.acm.org/pubs/contents/proceedings/d1/379437/ (16 of 20) [10/9/01 12:43:03 PM] ACM Digital Library: Proceedings of the first ACM/IEEE-CS joint conference on Digital libraries

[ Discuss this Article I Find Related Articles I Add to Binder ] An atmospheric visualization collection for the NSDL Christopher Klaus and Keith Andrew Page 463 metadata:EabWact fnox term full text: PDF 94 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Breaking the metadata generation bottleneck: preliminary findings Elizabeth D. Liddy, Stuart Sutton, Woojin Paik, Eileen Allen, Sarah Harwell, Michelle Monsour, Anne Turner and Jennifer Liddy Page 464 metadata: full text: prg 59 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Building the physical sciences information infrastructure, a phased approach Judy C. Gilmore and Valerie S. Allen Page 465 metadata:Eabstract full text: ear 92 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Development of an earth environmental digital library system for soil and land-atmospheric data Eiji Ikoma, Taikan Oki and Masaru Kitsuregawa Page 466 metadata: LE-- abstract full text: PDF 280 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Digital facsimile editions and on-line editing Harry Plantinga Page 467 metadata: E abstract

full text: 9 PDF 373 KB

[ Discuss this Article I Find Related Articles I Add to Binder ] DSpace at MIT: meeting the challenges Michael J. Bass and Margret Branschofsky Page 468 metadata: abstra ct Eindex terms full text:EPDF 88 KB

[ Discuss this Article I Find Related Articles I Add to Binder ] 36 http://www.acm.org/pubs/contents/proceedings/dI/379437/ (17 of 20) [10/9/01 12:43:03 PM] ACM Qigital Library: Proceedings of the first ACM/IEEE-CS joint conference on Digital libraries

Exploiting image semantics for picture libraries Kobus Barnard and David Forsyth Page 469 metadata: abstract index terms full text:°' PDF 127 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Feature extraction for content-based image retrieval in DARWIN K. R. Debure and A. S. Russell Page 470 metadata: full text:EPDF 57 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Guided linking: efficiently making image-to-transcript correspondence Cheng Jiun Yuan and W. Brent Seales Page 471 metadata: RI a bstra ct full text: PDF 90 KB

[ Discuss this Article I Find Related Articles I Add to Binder ] Integrating digital libraries by CORBA, XML and Servlet Wing Hang Cheung, Michael R. Lyu and Kam Wing Ng Page 472 metadata: ii abstra ct full text: E PDF 134 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] A national digital library for undergraduate mathematics and science teacher preparation and professional development Kimberly S. Roempler Page 473 metadata: abstra ct full text:EPDF 75 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Print to electronic: measuring the operational and economic implications of an electronic journal collection Carol Hansen and Linda S. Marion Page 474 metadata: Ii abstra ct inindex terms

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Turbo recognition: decoding page layout 37 http://www.acm.org/pubs/contents/proceedings/d1/379437/ (18 of 20) [10/9/01 12:43:03 PM] ACM Digital Library: Proceedings of the first ACM/IEEE-CS joint conference on Digital libraries Taku A. Tokuyasu Page 475 metadata: full text: E Kw 54 KB

[ Discuss this Article I Find Related Articles I Add to Binder ] Using Markov models and innovation-diffusion as a tool for predicting digital library access and distribution Bruce R. Barkstrom Page 476 metadata: abstra ct full text:II ruiF 93 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] A versatile facsimile and transcription service for manuscripts and rare old books at the Miguel de Cervantes digital library Alejandro Bia Page 477 metadata: a bstra ct full text: t PDF 127 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] The virtual naval hospital: the digital library as knowledge management tool for nomadic patrons Michael P. D'Alessandro, Richard S. Bakalar, Donna M. D'Alessandro, Denis E. Ashley and Mary J. C. Hendrix Page 478 metadata: ia bstra ct rd. index terms full text: paPDF 95 KB

[ Discuss this ArticlejFind Related ArticlesI Add to Binder ] Workshop 1: visual interfaces to digital libraries - its past, present, and future Katy Börner and Chaomei Chen Page 482

metadata: 1abstra ct

full text: 91 PDF 77 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder Workshop 2: the technology of browsing applications Nina Wacholder and Craig Nevill Manning Page 483 metadata: full text:E PDF 91 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] 38 http://www.acm.org/pubs/contents/proceedings/dI/379437/ (19 of 20) [10/9/01 12:43:03 PM] ACM lpigital Library: Proceedings of the first ACM/IEEE-CS joint conference on Digital libraries

Workshop 3: classification crosswalks Paul Thompson, Traugott Koch, John Carter, Heike Neuroth, Ed O'Neill and Dagobert Soergel Page 484

metadata: 7-E.abstract full text: PDF 84 KB

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Workshop 4: digital libraries in asian languages Su-Shing Chen and Ching-chih Chen Page 485 metadata:

[ Discuss this Article I Find Related ArticlesI Add to Binder ] Workshop 5: information visualization for digital libraries: defining a research agenda for heterogeneous multimedia collections Lucy Nowell and Elizabeth Hetzler Page 486 metadata: J abstract

[ Discuss this Article I Find Related Articles I Add to Binder ] VideoGraph: a new tool for video mining and classification Jia-Yu Pan and Christos Faloutsos Page metadata: abstract

[ Discuss this Article I Find Related ArticlesI Add to Binder ]

Forfull text in°-PDF , use Adobe Acrobat Reader.

librhome jist alphabetically list by 1G sear& library Iregister OL I subscribe, DE. feedback I

, I MADEWITH HTML 17:I CASCADING 111 11 MU SHEETS C 4.0

39 http://www.acm.org/pubs/contents/proceedings/d1/379437/ (20 of 20) [10/9/01 12:43:03 PM] Integrating Automatic Genre Analysis into Digital Libraries

Andreas Rauber, Alexander Muller-Kogler Department of Software Technology Vienna University of Technology Favoritenstr. 911 / 188 www.ifs.tuwien.ac.at/ifs

ABSTRACT ically designed to satisfy the same information need, i.e. to With the number and types of documents in digital library provide information about a given topic, in different ways. systems increasing, tools for automatically organizing and Whenever looking for information, these issues are taken presenting the content have to be found. While many ap- into account, and they form one of the most important dis- proaches focus on topic-based organization and structur- tinguishing features in conventional libraries, together with ing, hardly any system incorporates automatic structural other non-content based information such as the age of a analysis and representation. Yet, genre information (uncon- document, the fact whether it looks like it is being used fre- sciously) forms one of the most distinguishing features in quently or remains untouched for long periods of time, and conventional libraries and in information searches. In this many others. paper we present an approach to automatically analyze the As long as a digital library can be cared for in a way sim- structure of documents and to integrate this information ilar to how conventional libraries are organized, this type into an automatically created content-based organization. of information is carefully captured in the form of meta- In the resulting visualization, documents on similar topics, data descriptions, and provided to the user, albeit mostly yet representing different genres, are depicted as books in in rather inconvenient, not intuitive textual form. Yet, with differing colors.This representation supports users intu- this information available, ways for more intuitive represen- itively in locating relevant information presented in a rele- tations can be devised. A different situation is encountered vant form. in many less-controlled digital library settings, where pieces of information from different sources are integrated, or the mere amount of information added to a repository effectively Keywords prevents it from being manually indexed and described. For Genre Analysis, Self-Organizing Map (SOM), SOMLib, these settings an automatic analysis of the structure of a Document Clustering, Visualization, Metaphor Graphics given piece of information is essential to allow the user to quickly find the correct document, not only in terms of the content provided, but also with respect to the way this con- 1.INTRODUCTION tent is presented. While the question of what a document is about has been In this paper we present a way to provide automatic anal- recognized as being crucial for presenting relevant informa- ysis of the structure of text documents.This analysis is tion to a user, the question of how a given piece of infor- based on a combination of various surface level features mation is presented is largely neglected by most present of texts, such as word statistics, punctuation information, electronic information systems. Yet, this type of informa- the occurrences of special characters and keywords, as well tion is mostly unconsciously used in almost any con- as mark-up tags capturing image, equation, hyperlink and tact with information in everyday life. Personal letters are similar information. Based on these structural descriptions treated differently than mass-mailings, a short story is read of documents, the self-organizing map (SOM) [12], a pop- on different occasions than long novels, popular science lit- ular unsupervised neural network, is used to cluster doc- erature addresses a different readership than dissertations uments according to their structural similarities. This in- or scientific papers, both of which themselves will provide formation is incorporated into the SOM Lib digital library highly similar information at differing levels of detail for dif- system [17] which provides an automatic, topic-based orga- ferent audiences. Specific sub-genres, such as for example, nization of documents using again the self-organizing map executive summaries or technical reports, were even specif- to group documents according to their content.The lib- Viewer, a metaphor-graphical interface to the SOMLib system depicts the documents in a digital library as hardcover or paperback books, binders, or papers, sorted by content Permission to make digital or hard copies of all or part of this work for into various bookshelves, labeled by automatically extracted personal or classroom use is granted without fee provided that copies are content descriptors using the LabelSOM technique.Inte- not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to grating the results of the structural analysis of documents republish, to post on servers or to redistribute to lists, requires prior specific allows us to color the documents, which are sorted by sub- permission and/or a fee. ject into the various shelves, according to their structural JCDL'01, June 24-28, 2001, Roanoke, Virginia, USA. similarities, making e.g. complex descriptions stand apart Copyright 2001 ACM 1-58113-345-6/01/0006 ...$5.00. from summaries or legal explanations on the same subject. inal work of Biber [1, 2]. He uses metrics such as pronoun Similarly, interviews on a given topic are depicted different counts and general text statistics to cluster texts in order to from reports, as are numerical tables or result listings. We find underlying dimensions of variation and to detect general demonstrate the benefits of an automatic structural analysis properties of genres. of documents in combination with content-based classifica- More recently, classification of text documents by genre tion using a collection of news articles from several Austrian has been analyzed by Karlgren et al. [10]. Again, a number daily, weekly and monthly news magazines. of different features are used to describe the structural char- The reminder of this paper is structured as follows. Sec- acteristics of documents. However, additionally to the stan- tion 2 provides a brief introduction into the principles of dard surface cues, additional features requiring syntactic genre analysis and presents a review of related work in this parsing and tagged texts, such as required for noun counts, area. We proceed by presenting the architecture and train- present participle count etc., were included. Discriminant ing procedure of the self-organizing map in Section 3. The analysis is used to obtain a set of discriminant functions application of the SOM to content-based document classifi- based on a pre-categorized training set.This line of re- cation is presented in Section 4, including an overview of the search is continued in [8], reporting in detail on the various various modules of the SOM Lib digital library system with a features used for stylistic analysis. The stylistic variations special focus on the metaphor-graphics based libViewer inter- of documents are further visualized as scatter plots based face. We then present our approach to structural classifica- on combinations of two features.Specific areas are then tion of documents and its integration with the content-based (manually) assigned special genre-type descriptors to help representation provided by the SOM Lib system in Section 5. users with analyzing the clusters of documents found in the Section 6 presents our experimental results using a collection scatter plot. of newspaper articles, reporting on content-based organiza- Recognizing the importance of integrating genre analy- tion, structural classification, and their integration. Some sis into a content-based information retrieval process, the conclusions as well as future work are listed in Section 7. DropJaw interface [3, 9] incorporates genre-based classification using C4.5 based decision trees into content-based 2.GEMIE ANALYSIS clustering using a hierarchical agglomerative group-average clustering algorithm. Documents are then represented in a Genre analysis has a long history in linguistic literature. two-dimensional matrix, with the rows representing the top- Conventionally, genre is associated with terms such as short ical clusters found in the document set, whereas the columns stories, science fiction, novels of the 17th or 18th century, organize these documents according to a number of genres fiction, reports, satire, and many others. Still, the definition the decision tree was trained to recognize. of genre is somewhat vague. According to Webster's Dictio- A different approach describing documents by a number nary of English Language, genre is defined as a category of of facets rather than directly assigning a genre is reported artistic, musical, or literary composition characterized by a in [11). A facet is a property which distinguishes a class of particular style, form, or content. Although differing def- texts that answers to certain practical interests, and which initions may be found, the main goal of genre analysis is is associated with a characteristic set of computable struc- to identify certain subgroups within a set of given objects tural or linguistic properties.Three principal categorical that share a common form of transmission, purpose, and dis- facets are analyzed. Brow characterizes a text with respect course properties. Basically, the term genre can be applied to the intellectual background required to understand a text, to most forms of communications, although it is frequently subdivided into popular, middle, upper-middle, and high. A restricted to non-interactive, and, for the scope of this paper, binary narrative facet decides whether a text is written in literary information, excluding music or film genres. While a narrative style, and the third facet, genre, classifies a text the common interpretation of genre refers to literary styles, either as reportage, editorial, scitech, legal, nonfiction, or fic- such as fiction, novel, letter, manuals, etc., automatic anal- tion. A set of 55 lexical, character-level and derivative cues ysis of genres takes a slightly different approach, focusing on are used to describe the documents, and logistic regression structural analysis using surface level cues as the main struc- is used to create a classifier based on a training set of 402 tural similarity between documents, from which genre-style manually classified texts. information is deducted.. In [21], Ries applies genre classification to spontaneous Several approaches have been taken to evaluate the struc- spoken conversations, including features such as pauses in ture or readability of text documents, resulting in numerous the conversation as well as histograms of a number of key- different measures for grading texts automatically based on words, using a backpropagation-type neural network for the surface features. Many of these features are readily avail- subsequent analysis. able in various implementations of the Unix STYLE com- It is interesting to note, that, although unsupervised mand [6]. Among the measures included in this package are methods are frequently used for content-based analysis of in- the Kincaid Formula, which is targeted towards technical formation, most of current research work turns to supervised material, having been developed for Navy training manu- models when it comes to the analysis of genre. This might als. The Flesh reading easy formula stems from 1948 and be due to the case, that people tend to think in terms of well- is based on English school texts covering grades 3 to 12. A defined genres, rather than in terms of structurally similar similar measure is the SMOG-Grading, or the WSTF Index, documents.Still, we find documents to frequently exhibit which has been developed specifically for German texts. All characteristics of several different genres to differing degrees. these measures basically compute their score by combining This is the more so as for hardly any genre there is a strict information about the number of words and syllables per and well-defined, non-overlapping set of criteria by which it sentence as well as characters per word statistics, weighted can be described, making strict classification as impossible by various constants, to obtain the according grades. a task as strict content-based classification. Similar as for More complex stylistic analyses can be found in the sem-

41 content-based document organization, unsupervised cluster analysis of genre-oriented document descriptions should be able to capture the structural similarities accordingly. VW* 140.4 114 3.THE SELF-ORGANIZING MAP The self-organizing map [13] provides cluster analysis VOW by producing a mapping of high-dimensional input data ellif x,x E , onto a usually 2-dimensional output space while Input Space Output Space preserving the topological relationships between the input data items as faithfully as possible. This model consists of a Figure 1: SOM architecture and training process set of units, which are arranged in some topology where the most common choice is a two-dimensional grid. Each of the units i is assigned a weight vector mi of the same dimension as the input data, mi E 3r, initialized with random values. of the neighborhood-kernel is reduced gradually during the During each learning step, the unit c with the highest ac- learning process such that towards the end of the learn- tivity level, i.e.the winner c with respect to a randomly ing process just the winner itself is adapted.This strat- selected input pattern x, is adapted in a way that it will egy enables the formation of large clusters in the beginning exhibit an even higher activity level at future presentations and fine-grained input discrimination towards the end of the of that specific input pattern. Commonly, the activity level learning process. of a unit is based on the Euclidean distance between the In combining these principles of self-organizing map train- input pattern and that unit's weight vector. The unit showing, we may write the learning rule as given in Expres- ing the lowest Euclidean distance between it's weight vector sion (2).Please note that we make use of a discrete time and the presented input vector is selected as the winner. notation with t denoting the current learning iteration. Hence, the selection of the winner c may be written as given The other parts of this expression are a representing the in Expression (1). time-varying learning rate, hei representing the time-varying neighborhood-kernel, x representing the currently presented input pattern, and mi denoting the weight vector assigned cllsmcll = mill) (1) to unit i. Adaptation takes place at each learning iteration and is performed as a gradual reduction of the difference between mi(t + 1) = rni(t) + a(t) .116(0[x(t)mi(t)] (2) the respective components of the input vector and the weight A simple graphical representation of a self-organizing vector. The amount of adaptation is guided by a learning map's architecture and its learning process is provided in rate a that is gradually decreasing in the course of time. Figure 1. In this figure the output space consists of a square This decreasing nature of adaptation strength ensures large of 25 units, depicted as circles.One input vector x(t) is adaptation steps in the beginning of the learning process randomly chosen and mapped onto the grid of output units. where the weight vectors have to be tuned from their ran- In the next step of the learning process, the winner c show- dom initialization towards the actual requirements of the ing the highest activation is selected. Consider the winner input space. The ever smaller adaptation steps towards the being the unit depicted as the black unit in the figure. The end of the learning process enable a fine-tuned input space weight vector of the winner, Tric(t), is now moved towards representation. the current input vector. This movement is symbolized in As an extension to standard competitive learning, units the input space in Figure 1. As a consequence of the adap- in a time-varying and gradually decreasing neighborhood tation, unit c will produce an even higher 'activation with around the winner are adapted, too. Pragmatically speak- respect to input pattern x at the next learning iteration, ing, during the learning steps of the self-organizing map a t + 1, because the unit's weight vector, in,(t +1), is now set of units around the winner is tuned towards the currently nearer to the input pattern x in terms of the input space. presented input pattern enabling a spatial arrangement of Apart from the winner, adaptation is performed with neigh- the input patterns such that alike inputs are mapped onto boring units, too. Units that are subject to adaptation are regions close to each other in the grid of output units. Thus, depicted as shaded units in the figure. The shading of the the training process of the self-organizing map results in a various units corresponds to the amount of adaptation and topological ordering of the input patterns. thus, to the spatial width of the neighborhood-kernel. Gen- The neighborhood of units around the winner may be de- erally, units in close vicinity of the winner are adapted more scribed implicitly by means of a (Gaussian) neighborhood- strongly and consequently, they are depicted with a darker kernel hci taking into account the distancein terms of the output spacebetween unit i under consideration and shade in the figure. unit c, the winner of the current learning iteration.This neighborhood-kernel assigns scalars in the range of [0,1] 4.THE SOMLIB SYSTEM that are used to determine the amount of adaptation ensuring that nearby units are adapted more strongly than units 4.1Feature extraction farther away from the winner. In order to utilize the SOM for organizing .documents by It is common practice that in the beginning of the learn- their topic a vector-based description of the content of the ing process the neighborhood-kernel is selected large enough documents needs to be created. While manually or semi- to cover a wide area of the output space. The spatial width automatically extracted content descriptors may be used,

4 2 research results have shown that a rather simple word fre- 4.3Labeling quency based description is sufficient to provide the neces- While the SOM found wide appreciation in the field of sary information in a very stable way [4, 14, 15, 19].For text classification, its application had been limited by the this word frequency based representation a vector structure fact that the topics of the various cluster were not evident is created consisting of all words appearing in the document from the resulting mapping. In order to find out which top- collection.This list of words is usually cleaned from so- ics are covered in certain areas of the map, the actual articles called stop-words, i.e. words that do not contribute to con- had to be read to find descriptive keywords for a cluster. To tent representation and topic discrimination between docu- counter this problem, we developed the La belSOM method, ments. Again, while manually crafted stop-word lists may be which analyses the trained SOM to automatically extract a used, simple statistics allow the removal of most stop-words set of attributes, i.e. keywords, that are most descriptive for in a very convenient and language- or subject-independent a unit [16]. Basically, the attributes showing a low quantiza- way. On the one hand, words appearing in too many doc- tion error value and a high weight vector value, comparable uments, e.g. in more than half of all documents, can be to a low variance and a high mean among all input vectors removed without the risk of loosing content information, as mapped onto a specific unit, are selected as labels. Thus, the content conveyed by these words is too general. On the the various units are characterized by keywords describing other hand, words appearing in less than a minimum num- the topics of the documents mapped onto them. ber of, say, 5 to 10 documents can be omitted for content- based classification, as the resulting sub-topic granularity 4.4Visualization would be too small to form a topic cluster of its own. Note, Last, but not least, while the spatial organization of doc- that the situation is different in the information retrieval uments on the 2-dimensional map in combination with the domain, where rather specific terms need to be indexed to automatically extracted concept labels supports orientation facilitate retrieval of a very specific subset of documents. in and understanding of an unknown document repository, In this respect, content-based organization and browsing of much information on the documents cannot be told from the documents constitutes a conceptually different approach to resulting representation. Information like the size of the un- accessing and interacting with document archives by brows- derlying document, its type, the date it was created, when ing topical hierarchies.This obviously has to be supple- it was accessed for the last time and how often it has been mented by various searching facilities, including information accessed at all, its language etc. is not provided in an intu- retrieval capabilities as they are currently realized in many itively interpretable way. Rather, users are required to read systems. and abstract from textual descriptions, inferring the amount The documents are described within the resulting feature or recent-ness of information provided by a given document space of commonly between 5.000 and 15.000 dimensions, by comparing size and date information. i.e. distinct terms, by the words they are made up of. While We thus developed the libViewer, a metaphor-graphics a basic binary indexing may be used to describe the content based interface to a digital library [18].Documents are of a document by simply stating whether a word appears no longer represented as textual listings, but as graphical in the document or not, more sophisticated schemes, such objects of different representation types such as binders, pa- as tf x idf, ,i.e. term frequency times inverse document fre- pers, hardcover books, paperbacks etc, with further meta- quency [22], provide a better content representation. This data information being conveyed by additional metaphors weighting scheme assigns higher values to terms that ap- such as spine width, logos, well-thumbed spines, different de- pear frequently within a document, i.e. have a high term grees of dustiness, highlighting glares, position in the shelf frequency, yet rarely within the complete collection, i.e. have and others. Based on these metaphors we can define a set of a low document frequency. Usually, the document vectors mappings of metadata attributes to be visualized, allowing are normalized to unit length to make up for length differ- the easy understanding of documents, similar to the usage ences of the various documents. of Chernoff faces for multidimensional space representation [5].Figure 2 provides an example of the visualization of documents in a digital library using the libViewer interface. 4.2Topic-based organization Documents are depicted using different document type rep- The resulting vector representations are fed into a stan- resentations, with additional metadata being conveyed by dard self-organizing map for cluster analysis. As a result, their position, color, spine width, the logos and textual in- documents on similar topics are located on neighboring units formation depicted on the spine, dust or highlighting glares, in the two-dimensional map display. In the simplest form, well-thumbed bindings and others. Metadata information a document collection may then be represented as a rectan- about documents thus can be easily interpreted and com- gular table with similar documents being mapped onto the pared across a collection, with a larger amount of informa- same cells.Using this model, users find a document col- tion being represented as compared to standard textual de- lection to be automatically structured by content in a way scriptions. similar to how documents are organized into shelves in conventional libraries. Due to its capabilities of automatically 5.STRUCTURE AND GENRE ANALYSIS structuring a document collection by subject, we have chosen the SOM as the basic building block of our SO M Lib dig- 5.1Structural features ital library system [19]. Enhanced models of the SOM, such Although the content-based organization and metaphor- as the growing hierarchical self-organizing map (GHSOM), graphical visualization of documents provided by the SOM- further allow the automatic detection of topical hierarchies Lib system greatly supports the user in interacting with a by creating a layered structure of independent SOMs that digital library, all meta-information about documents has adapt their size accordingly [20]. to be created and provided manually.While the size of

43 ItalscarglibYlawyr wofeisliimil F.Edi V, elCortnanismo HO, ' text statistics, (2) special character and punctuation counts, (3) characteristic keywords, wad (4) format-specific mark- Bad rm.-. PO. Rot SuotA Netst.. PM S, ly Op Oa ups. , I ocawrsd4Loc,an: ',),tcp : /N.. if a. tam tn. az. at/libvi,../ 1 /I Cr What% Palatod 1 5.1.2Text complexity measures iibVieiiiier Text complexity measures are based on word statistics such as the average number of words per sentence, the average word length, number of sentences and paragraphs, and similar formatting information. While being rather simple k. V, T2. : metrics, the complexity of certain constructs turns out to be 1 Z , captured quite well by these measures, especially if combined 91- ; t 7.. S . e with other characteristics such as punctuation marks. More in [1:i;Al.jil extensive measures analyzing the nesting depth of sentences Litm ..gi"o c NITERIERIEGTECH:TIPETCIESERitallt; may be considered, although we did not integrate them into the experiments presented below.Furthermore, in- 751:tr stead of using the basic measures, derived measures like any of the existing readability grade methods, such as Kincaid, i' 1 ? .- - ...t.A.1 ! c. , Y u., Coleman-Liau, Wheeler-Smith Index, Flesh Index, may be ,a .... T 1 used as more condensed representations. Still, these formu- a 1 , 2 .' I

F , ..ef las are all based on the above-mentioned basic measures any- , L t . ELI III.., 16-11UlifriltiMbIl7.1trl Mi51:47,24.t1; . MEd nt way, using various transformations to obtain graded repre-

-Dublin Car; Sonar Ulla) 1 '',-- toad Llbra; I' -by ilea , I icrt'1.1 rary sentations according to stylistic evaluation parameters. We Z0000 100% ... 1 1 _tit jrt.r.2M2j thus refrained from using those in our initial experiments, Status: DCServan done. showWalla although they may be considered for follow-up evaluations. TItle: NationalIIbtaly a Show oniole DesalytIon J 5.1.3Special character and punctuation counts A wealth of stylistic information can be obtained from specific character counts. As most prominent among these Figure 2: libViewer visualization of a digital library we consider punctuation marks, as some of them are rather characteristic for certain text genres. For example, the pres- a document, its date of creation, the date of last access, ence of exclamation mark (!)is highly indicative of more or the author are usually available, hardly any consistent emotional texts as opposed to pure fact reports in a news genre-information is provided in most electronic document magazine setting. Interviews exhibit a rather high count of collections that have too high a volume of documents added question marks and colons, whereas high counts for semi- frequently to allow manual classification. Yet, this type of colons or commas are indicative of more complex sentence information is important for a user to be able to pick the structures, especially if co-occurring with rather high sen- most appropriate document. tence lengths. Otherwise, if co-occurring with rather short Similar to the task of content-based organization, we sentences they might rather be attributed to listings and would like to have a way to automatically organize and cat- enumerations of information items.Similarly, we have to egorize documents by their structure and stylistic features analyze the occurrence of quotation marks, hyphens, peri- into basic types of genre, and to fuse this information with ods, apostrophe, slash marks, various brackets and others. the content-based organization provided by the SOM Li b sys- More complex punctuation information might be worth con- tem and its libViewer interface. Yet, we do not want to use sidering if it can be feasibly extracted from the given texts, supervised models, both because of the limitations intro- such as ellipsis points, differences between single and dou- duced by the supervised process and because of the tedious ble quotation marks, or regarding the usage of dashes versus task of having to produce an accurate training set manu- hyphens, especially if the semantic context can be deducted. ally. We thus propose to follow an approach similar to the However, as these more complex features could not be ex- one chosen for content-based document organization creat- tracted from the given text base integrating different sources ing a feature vector representation of documents capturing we had to refrain from representing this level of structural the stylistic characteristics of documents. Clustering docu- information for the given experiments. ments according to their similarity conveyed by their struc- Other characters worth incorporating into stylistic anal- tural features should then reveal basic types of documents, ysis are financial symbols like $S, euro.Furthermore, or genres. numbers as well as mathematical symbols, copyright and paragraph signs are worth including as they do hint at spe- 5.1.1Surface level cues cial categories of information, be it, for example, technical For the set of features we have to restrict ourselves to discussions, price listings, legal information, or simply ex- those features that can be automatically extracted from amples to clarify and expand on a given topic. documents with acceptable computational costs.Further- more, similar to content-based classification, we want the 5.1.4Stopwords and keywords process to be as language- and domain independent as pos- Contrary to the principles of content representation, a lot sible to allow flexible application of the system in different of genre-information is conveyed by stop-words such as pro- settings. Basically, four distinct types of features can be dis- nouns or adverbs.Thus, a list of characteristic words is tinguished, which are (1) text complexity information and added to the list of features, containing words such as I, you,

44 we, me, us, mine, yours or much, little, large, very, highly, the total number of 4 representation types available in the probably, mostly, certainly, which, that, where and others. libViewer system so far may be supplemented with additional Please note that this list is language dependent, yet may be object types showing, for example, additional bindings, the easily adapted for most languages. (These adaptions may total possible number still will be rather limited. Further- be rather complex for specific languages using word inflec- more, as the available representation types are strictly dis- tions rather than specific keywords for some characteristic tinct from each other, no gradual shift from one type of sentence structures.) Depending on the specific document document to the other can be conveyed, thus actually forc- collection to be considered and the desired focus of genre ing the documents to be definitely of one genre or the other. analysis, additional keywords may be added to that list to Providing more subtle information about its general struc- facilitate, for example, the recognition of fact-reporting ver- ture, which might be well in between two specific genres, sus opinion-modifying articles, or the separation of specula- would be more appropriate and highly preferable. tive articles. Secondly, a rather high manual effort is required to analyze the actual genres identified by the SOM .to provide a 5.1.5Mark-up tags sensible mapping of genre areas on the map onto the re- The fourth group of features is formed by specific mark-up spective graphical metaphors. Yet, the precise mapping is tags that are used to extract information about the docu- crucial for the usability of the classification result as users ment. Using such mark-up tags, information such as the will associate a specific type of information with a certain amount of images present in a given document, the number representation metaphor. The assignment of a wrong repre- of tables and equations, links, references, etc.can be ex- sentation template may thus turn out to be highly counter- tracted and included in the genre analysis. These obviously productive for information location. have to be adapted to the actual formatting of the source We thus favour an automatic approach for integrating the documents in such a way that they are consistently mapped, information provided by the genre map into the content- such as, e.g. mapping the \begin{figure} and the IMGSRC based visualization using the color-metaphor. Documents tags onto the image count feature for La TeX and HTML of similar structure shall be assigned a similar color to allow documents. intuitive recognition and interpretation of structural simi- Overall, the parser currently recognizes almost 200 differ- larities.This metaphor turns out to be almost perfectly ent features, some of which are specifically geared towards a suited for conveying the desired information as it does not specific file format, such as HTML documents, whereas oth- transport any specific meaning in the given setting by itself, ers are generally applicable. Depending on the goal of the as opposed to the realistic document representation types, structure analysis, only a subset of the available attributes which are intuitively associated with a certain kind of infor- may be selected for further analysis. mation. The resulting document descriptions are further fed into A rather straight-forward mapping of the position on the a self-organizing map for training. As a result of the train- genre-map onto a specific color is realized by mapping the ing process documents with similar stylistic characteristics rectangular map area onto a plane of the RGB color cube, are located close to each other on neighboring units of the similar to the color-coding technique for cluster identifica- SOM. We may thus find longer documents with rather com- tion in SOMs [7].Thus, documents mapped onto neigh- plex sentences in one area of the map, whereas in another boring units on the genre map will be depicted in similar area interviews, characterized by shorter sentences with a colors, allowing easy recognition of mutal similarity in style high count of colons and quotation marks might be located. as well as depicting even gradual transitions between the While this map may now serve as a kind of genre-based ac- various structural clusters. On the other hand, documents cess to the document archive, it needs to be integrated into in different regions on the genre map, exhibiting a clearly the content-based library representation to support users in distinct structure, are thus depicted in different colors on their information finding process. the content-based libViewer visualization. 5.2Integrating content and structure based 6.EXPERIMENTS classification Two possibilities offer themselves for the integrated rep- 6.1Data set resentation of content- and structure-based organization Various series of experiments have been performed in dif- within the SOMLib system. We can either chose a semi- ferent settings, including technical documents and web site automatic approach by assigning a specific document rep- analysis. For the experiments presented below we created a resentation type to a certain region on the genre-map, such collection of news reports by downloading the web-editions as, for example, assigning all documents in the upper right of 14 daily, weekly, or monthly Austrian newspapers and area of this map the representation type binder, whereas all print magazines. This setting exhibits several characteris- documents in the area on the lower left part of the map tica typical for digital libraries that cannot be tendered to may be depicted as hardcover documents. This approach manually as carefully as necessary. Information from differ- allows a very intuitive representation of documents in the ent sources having different internal classification schemata content-based representation if a sensible assignment of the is integrated.As the majority of documents stems from genres identified by the map to the available representation daily newspapers, the number of articles to be organized is types can be defmed. However, this approach has shown to too large to allow manual classification. Furthermore, while have several deficiencies when applied to large and unknown the topics covered by the various sources overlap, the per- document collections. Firstly, the number of available rep- spectives from which these issues are presented differ. To resentation types is rather limited as opposed to the number a large degeree this can be attributed to the general genre of different document types present in any collection. While of a source, such as newspapers and magazines specializing

45 on economic issues, but also, to a large degree, to the style corner, listing again the freedom party, another one of its of report chosen. In many situations we will find the same leading polititians, Peter Westenthaler, as well as the name topic to be covered by a news report as well as by an inter- of Austria's president, Thomas Klestil. This co-location of view or a column, or we find the same issue covered both in similar, yet not identical topics, is one of the most important the general news as well as in, say, the economic section of characteristics of SOMs making them particularly suitable a paper. for the organization of document collections for interactive A cleansing procedure was implemented for each data browsing. source to automatically remove characteristic formatting Shifting to another topic we find units from the economic structures of the various sources such as banners, footers, cluster in the lower right corner to be labelled with, for or navigation bars, as these would unduely interfere with example, austria, stock-exchange, fonds manager, telecom the stylistic analysis. Furthermore, different HTML encod- for the previously mentioned telecom-cluster, or enterprise, ings for special characters were converted to a uniform rep- state, va-tech, steel, oetiag, grasser, leitl for the cluster on resentation. The results reported below are based on a sub- the privatization of Austria's state-own steel enterprise VA- set of the entire collection consisting of 1.000 articles from Tech, and two of the leading polititians involved in the pri- March 2000. To keep the system as flexible and generally vatization process, Karl-Heinz Grasser and Christoph Leitl. applicable as possible, no language- or domain-specific opti- Above this unit we find a similar topic, namely the privatiza- mizations, such as stemming or the use of specific stop-word tion of the Austrian postal services, labelled with privatiza- lists, were performed. The articles were parsed to create tion, psk, contracts, nicely showing the topological ordering both a content-based and a structure-based description of in the map. the documents, which were further fed into two separate As an example for labels from the sports cluster we might self-organizing maps for cluster analysis and representation. mention the cluster on Formula 1 with labels races, bmw, Due to space restrictions we cannot provide detailed repre- williams, jaguar, wm. Since we do not specify a manually sentations of the according maps. Rather, we have selected designed stop-word list, some stop-words remain in the list representative clusters for detailed discussion. of index terms and actually show up as labels as they form a prominent common feature of articles in a cluster. We 6.2Content-based organization thus also fmd labels such as friday and both as labels for the sports cluster. Again, a handcrafted or semi-automatic ap- A 5 x 10 SOM was used for topical organization of the ar- proach may provide a better removal for stop-words, yet sac- ticles based on a 1.975-dimensional feature vector represen- rificing domain and language independence to some degree, tation. The main topical clusters identified in the collection are, on the one hand, economic articles, which consist of whereas the current approach can be applied to any given document collection. The cluster representing documents on several subclusters, such as a rather dominant group of articles relating to the telecom business, or the privatization of soccer is labelled with goal, real, madrid, muenchen, bavaria, Austria's state-owned enterprises. This cluster is located in rome, group, listing important soccer clubs playing against each other in a given group of the tournament. Neighboring the lower left corner of the resulting SOM. On the opposite, upper right corner we find mostly articles covering politi- this unit we find another unit on soccer, this time labelled champions league, cup, barcelona, madrid, porto. cal issues, such as the discussions concerning the formation of the new government following the 1999 elections. This This map serves as a content-based index to the digital library, allowing users to find, by reading the labels, which political cluster basically covers the whole left area of the content-based SOM, moving from the initial elections-based topics are covered in which section of the library. The documents can be represented as located in an HTML table with discussions to the various political topics. Another promi- the labels provided as text in the table cells.They may nent cluster is formed by sports reports covering soccer, for- also be transformed into a graphical libViewer representa- mula 1, and horse races, to name a few.Other, smaller clusters address different areas of science, with two of the tion, with the according source of the document, i.e.the magazine's title etc. being provided as a logo on the spine. more prominent sub-clusters among these being devoted to medicine, and internet technologies. Yet, we do not have any information from the resulting representation whether a given document represents, say, an Using the La belSOM method appropriate labels were auto- interview, a result listing etc., as this information is not matically extracted, describing the various topical clusters. provided as metatags within the articles. (The keywords have been translated into English for discussion in the following sections.) We find, for example, one of the clusters representing articles on Austria's Freedom 6.3Incorporating genre information Party to be labelled with fp, joerg, haider, haiders, listing the parties abbreviation as well as the name of its political 6.3.1A genre map leader. The labels for this unit also demonstrate one of the While the content-based SO M provides an organization of weaknesses of the crude indexing approach chosen. As we articles by their subject, the genre SOM analyzes the struc- do not apply any language-specific stemming techniques, the tural features of the documents and groups the documents trailing genitiv-s causes the term haider to appear in two accordingly. forms.Yet, this impreciseness does not cause distortions We find, for example, a rather dominant cluster represent- to the resulting content representation and organization, al- ing various forms of interviews, moving from reports with though language-specific adaptions would further improve several quotations in them to long interviews on a given the resulting classification, albeit sacrificing the language subject. The labels extracted by the LabelSOM technique and domain independence. help us to identify the most distinctive characteristics of a This unit is located next to another unit labelled minister given cluster. For the interviews we find the characteristic of defence, fpoe, fp, westenthaler, klestil in the bottom right attributes to be the number of opening and closing quotes as

46 well as the colon. Further distinctions can be made by the NIN Vignv Go Comm./G.4 average length of sentences as well as frequent line-breaks, IMO floleab Nano South Nen,. PON SaCurity eomm. 4 Clo 10 llatp://.4 tfc...n.ac../...<11131,.../ setting interviews apart from articles with longer citations. 7'..1:1101011oly WoNld hiMOM GUM*. 131,1PUMII 51WILbAgli ry10i400 CNN:101,W - HOP. pUJO G oa.mnam, int Another cluster of documents having a high colon count, yet a completely different sentence length structure plus sev- libViewer eral other special characters occurring frequently in the text, dta411.4*es such as opening and closing breaks or slashes, is formed by sports articles providing only result listings.These documents obviously also exhibit an unproportionally high count of numbers. While short reports separate themselves from longer articles due to a number of text length parameters, we find a further distinction within the short reports cluster having a higher count of numbers, yet less than for sports results. These articles are mostly reports or announcements for radio or TV shows, which may be either special documentaries or sports transmissions. Another large cluster of I documents consists of legal documents, which set themselves apart by the frequent usage of the paragraph character (s). Internet articles are characterized by the at-sign (©).

6.3.2Integrating genre information into the topical bud 1..t;i4b 1.0. I ter.boN'T organization 1ar fa cobs. While the labels extracted by the La belSOM technique 10 RI /4,1-0,-21100:0441 bobcriatieb 1.1,1.4.4,0,604 help the user interpreting and understanding what the SOM has learned, they are not sufficient to allow the user to intu- Fry64., fla itively tell which cluster of documents corresponds to which specific genre. This is because the extracted low-level features do not correspond to what the casual user will at- Figure 3: libViewer: sports and economy sections tribute to characteristic for, say, an interview or sports re- important, however, is the fact that it also contains a report sults listings. Instead of assigning every unit to a specific on the financial situation of one of the soccer clubs, thus genre, we map the genre SOM into an RGB-color space, being colored entirely different than the other soccer result such that documents located in the upper left corner are reports. This difference in structure, and its partial mem- colored black, whereas the upper right corner is assigned bership in the economic articles genre, can be attributed to green, the lower left corner red, and the lower right cor- the frequent occurrence of the Euro currency symbol. ner yellow. The units inbetween are automatically assigned In the shelf beneath the soccer reports we find documents intermediate colors. According to its structure, each doc- reporting on financial issues, or more precisely, on interest ument is now assigned a color based on its location on the rates, representing articles from 4 different publications, one genre SOM. This color is used for representing the document of which is a daily newspaper, 1 weekly, and 2 monthly mag- in the content-based libViewer representation. We thus may azines. Except for the weekly magazine Format, which is a now expect to find documents on the same topic, which are general news magazine, all publications on this shelf have a located on the same shelf on the content-based SOM, to be strong focus on economic issues. The according labels are coloured differently if they exhibit a different structure. given as banks, interest rates. Two distinct types of articles Figure 3 shows one shelf from the upper midle area of the can be distinguished on this unit, colored dark brown (the SO M representing articles on soccer from the sports section. first three documents) and various shadings of green, re- The general topic of this unit is given as shelf labels. As spectively. When taking a look at the according documents, can be told from the logos, the documents on this unit stem we find the dark brown ones to be rather complicated, ex- mostly from the daily newspaper Die Presse, with only one tensive reports on interest rates issues, whereas the green article being from Kurier, another daily newspaper. Also, documents are written in a rather informal style. These are the average length of all articles on this unit is rather ho- mainly made up of short sentences and rethoric questions, mogenous, with all documents being short articles, thus de- and list the most important issues in a tabular form rather picted with only the minimum spine width. Still, we can see than by complex explanations. Please note, that the more from the differing coloring that several distinct types of doc- complex articles also are longer, as can be seen from the uments are located on this unit. The first and the last article wider spine width. on this shelf, colored orange, both represent short result re- Figure 4 depicts the next two shelves down the row, con- ports, listing only the outcomes of the various matches. The tinuing in the economic section of the map. Here we find a second document, colored in bright yellow, contains a rather number of articles on the stock exchange in the upper shelf, emotional report, looking more like a transcript from a live whereas the lower shelf contains reports on the economic broadcast, but not reporting explicitly on results, next to data from the print magazine Die Wirtschaftswoche. The a green document providing a rather factual report on the first two documents on the lower shelf are colored in black, same match in a rather complicated style. The one-but-last and they provide detailed percentual listings of the maga- document, colored dark-red, contains a somewhat longer re- zines subscriber structured, their average income etc. The port on several matches, listing not only results, but also third, green document reports on the same issue, yet rather short descriptions of various sections of the matches. More represents a short overview article describing the general

47 FI la CM VW. Go Gannon Icsio 1.1Ip shown to be rather small and straight-forward.Further- itilscu rInvc ifaroto oime Sown mem,. Prin. socury o,.. Sf r.: more, the chosen approach allows gradual changes between Roan.. 4Go To ,M1ttp://w4 i f . Union. ac. at/..i/libai...,/ 11 ,i pwlionery .1,1..1 Mnaben Cont.-lora gleJoumal Saar1L01110 WION. CY $0aorgard - I.D. pap GM° D...... M various genres. No large-scale usability study has yet been performed, although first tests with a small set of users, mostly students, lithViewes have turned out encouraging results.After visiting a few documents on the respective areas of interest, most people had a feeling of what to expect from a document in a specific color, although they obviously were not able to describe it in terms of the low-level features used for classification by the map. Still, users know what to expect from, say, yellow to ochre documents (interviews, from black (numerical listings), greenish (short, simple articles), or others. Obviously, the proposed approach to structural analysis will not be able to provide a full-scale genre analysis, capturing the fine differences between certain types of information representation, especially if they involve high-level linguistic analysis. To provide a rather far-fetched example, the pre-

60,1 Nin Win sented system will definitely fail to separate a satire from factual information. It thus does not perform genre analysis IleiSenuto LA. 7 tIf ;A I WI. ter.. I in the strict sense. lit..., I Jr, . Yet by capturing structural characteristics and similari- 0.70 de. Is I It. 11,161.....1101,11.4 ties clearly is able to uncover specific genre information in given settings. While this might lead to misunderstandings in some situations, such as the impossibility of telling fac- DIM st. tual from fictional information (genre-wise), it should provide considerable support to the user trying to satisfy an Figure 4: libViewer: economy-section (cont.) information need. This is especially true as the utilization of the self-organizing map to produce a topology-preserving mapping allows to capture gradual differences between var- results from the market study in simple terms.It thus is ious structural concepts in a straight-forward manner. very similar in structure to the green documents discussed previously. We also find a number of green documents in the upper 7.CONCLUSIONS AND FUTURE WORK shelf, providing short descriptions and buying recomenda- Providing structural information about documents is es- tions for fonds and insurance policies based on fonds. The sential to help users decide about the relevance of docu- third, dark-brown document again describes a series of is- ments available in a digital library. Most document collec- sues related to stock market transaction in a more complex tions thus try to convey this information by using carefully structure, as can be expected from documents exhibiting designed metadata describing the genre of a resource. How- this color. The first two documents, apart from being rather ever, in many cases this uniform description of documents long, and thus depicted with rather broad spines, differ from cannot be provided manually. This is especially true for digi- the other reports by describing several companies and their tal libraries integrating documents from different sources, or performance by citing experts. While being rather detailed, where the number of documents to be described effectively we find many short sentences, enclosed by quotes, as the prohibits manual classification. characteristic features of these documents, thus separating In this paper we presented an automated approach to them from the neighboring darker lengthy description. the structural analysis of text documents.Characteristic features such as the average length of a sentence, counts 6.4Evaluation of punctuation marks and other special characters, as well Unfortunately, the actual genre of a document cannot be as specific words such as pronouns etc. can be used to de- intuitively told by the color it is represented in, nor could scribe the structural characteristics of a document. The self- we find a way of how the different structural characteristics organizing map, a popular unsupervised neural network, is of documents could be automatically translated into a small used to cluster the documents according to their similarity. set of distinct genres, which could then be represented by Documents are then colored according to their location on more intuitive metaphors (such as papers for interviews, har- the resulting two-dimensional map, such that structurally cover books for lengthy reports and paperbacks for shorter, similar documents are colored similarly. simpler depictions).Although such a mapping would be The result of the structural analysis is further incorpo- possible in principle in a semi-automatic way by assigning rated into the content-based organization and representa- different representation metaphors to different areas of the tion of a digital library provided by the SOM Lib system. genre SOM, we prefer the automatic mapping of the struc- Documents on the same topic, yet providing a different per- tural position of a document on the genre SOM into a simple spective of the same subject, such as reports and interviews, color space. While this metapher needs to be learned to be complex analyses, or short descriptions, are thus shown as interpreted correctly, the actually effort required to under- books of different colors in the resulting graphical represen- stand the structure intuitively, rather than explicitly, has tation provided by the libViewer interface.

48 Initial experiments have shown encouraging results.In [11] B. Kessler, G. Nunberg, and H. Schütze. Automatic the next steps we would like to refine the mapping from the detection of text genre. In Proc 8. Conf Europ. position on the structural SOM onto an appropriate color Chapter of the Association for Computational in such a way that the mutual similarity or distance of two Linguistics (ACL/EACL97), pages 32-38, Madrid, units is reflected in the perceived distance of the colors as- Spain, 1997. signed to the according documents. This would allow the http://spell.psychology.wayne.edurbkessler/. structural similarities between, say, different types of inter- [12] T. Kohonen. Self-organized formation of topologically views, to be more evident by assigning somewhat more sim- correct feature maps. Biological Cybernetics, 43, 1982. ilar colors to documents that axe part of a larger cluster [13] T. Kohonen. Self-organizing maps. Springer-Verlag, consisting of several sub-clusters. Such an improved map- Berlin, 1995. ping can be achieved by using distance information provided [14] T. Kohonen, S. Kaski, K. Lagus, J. Salojaxvi, by the weight vectors of the units in the self-organizing map. J. Honkela, V. Paatero, and A. Saarela. Furthermore, a more suitable color space in terms of human Self-organization of a massive document collection. perception may be chosen to further increase the perceived IEEE Transactions on Neural Networks, similarities and dissimilarities. 11(3):574-585, May 2000. Secondly, we will take a closer look at additional fea- http://ieeexplore.ieee.org/. tures that offer themselves for genre analysis. First experi- [15] D. Merkl and A. R,auber. Document classification with ments indicate that, for example, a distinction between fact- unsupervised neural networks. In F. Crestani and reporting articles versus opinion-forming articles is possible G. Pasi, editors, Soft Computing in Information by including additional keywords in the list of features. Retrieval, pages 102-121. Physica Verlag, 2000. http : //www. if s . tuwien . ac . atrandi/LoP . html. 8.REFERENCES [16] A. Rauber. LabelSOM: On the labeling of [1] D. Biber. Variations across Speech and Writing. self-organizing maps. In Proc Intl Joint Conf on Cambridge University Press, UK, 1988. Neural Networks (IJCNN'99), Washington, DC, July [2] D. Biber. A typology of english texts. Linguistics, 27:3 10.16. 1999. 43, 1989. http : //www. if s . tuwien . ac . at/andi/LoP. html. [3]I. Bretan, J. Dewe, A. Hallberg, N. Wolkert, and [17] A. Rauber. SOMLib: A digital library system based J. Karlgren. Web-specific genre visualization. In Proc on neural networks. In E. Fox and N. Rowe, editors, of Web Net '98, Orlando, FL, November 1998. Proc ACM Conf on Digital Libraries (ACMDL'99), http://www.stacken.kth.serdewe/. pages 240-241, Berkeley, CA, August 11.14. 1999. [4] H. Chen, C. Schuffels, and R. Orwig. Internet ACM. http: //www. acm.org/d1. categorization and search: A self-organizing approach. [18] A. Rauber and H. Bina. Visualizing electronic Journal of Visual Communication and Image document repositories: Drawing books and papers in a Representation, 7(1) :88-102, 1996. digital library. In Advances in Visual Database http://ai.BPA.arizona.edu/papers/. Systems: Proc IFIP TC2 Working Conf on Visual Database Systems, pages 95 114, Fukuoka, Japan, [5)H. Chernoff. The use of faces to represent points in k-dimensional space graphically. Journal American May, 10.- 12. 2000. Kluwer Academic Publishers. Statistical Association, 68:361-368, 1973. http : //www. if s . tuwien . ac . atrandi/LoP . html. [6] L. Cherra and W. Vesterman. Writing tools: The [19] A. Rauber and D. Merkl. The SOMLib Digital STYLE and DICTION programs. Technical Library System. InProc S. Europ. Conf on Research Report 91, Bell Laboratories, Murray Hill, NJ, 1981. and Advanced Technology for Digital Libraries Republished as part 4.4BSD User's Supplementary (ECDL99), LNCS 1696, pages 323-342, Paris, France, Documents by O'Reilly. September 22. - 24. 1999. Springer. http://www.ifs.tuwien.ac.atrandi/LoP.html. [7]J. Himberg. A SOM based cluster visualization and its application for false coloring. In Proc Int'l Joint Conf [20] A. Rauber, M. Dittenbach, and D. Merkl. on Neural Networks (IJCNN 2000), Como, Italy, July Automatically detecting and organizing documents 24.27. 2000. IEEE Computer Society. into topic hierarchies: A neural-network based approach to bookshelf creation and arrangement. In [8]J. Karlgren. Stylistic experiments in information retrieval. In T. Strzalkowski, editor, Natural Language Proc 4. Europ. Conf on Research and Advanced Information Retrieval. Kluwer, 1999. Technologies for Digital Libraries (ECDL2000), LNCS http://wWw.sics.serjussi/Artiklar/. 1923, pages 348-351, Lisboa, Portugal, September 18. - 20. 2000. Springer. [9]J. Karlgren, I. Bretan, J. Dewe, A. Hallberg, and http://www.ifs.tuwien.ac.atrandi/LoP.html. N. Wolkert. Iterative information retrieval using fast clustering and usage-specific genres. In Proc Eighth [21] K. Ries. Towards the detection and description of DELOS Workshop on User Interfaces in Digital textual meaning indicators in spontaneous Libraries, pages 85-92, Stockholm, Sweden, October conversations. In Proc Europ. Conf on Speech 1998. http://www.stacken.kth.serdewe/. Communication and Technology (EUROSPEECH99), Budapest, Hungary, September 5-9 1999. [10] J. Karlgren and D. Cutting. Recognizing text genres with simple metrics using discriminant analysis. In [22] G. Salton. Automatic Text Processing: The Proc 15. Int'l Conf on Computational Linguistics Transformation, Analysis, and Retrieval of (COLING '94), Kyoto, Japan, 1994. Information by Computer. Addison-Wesley, Reading, http://www.sics.serjussi/Artiklar/. MA, 1989.

49 Text Categorization for Multi-page Documents: A Hybrid Naive Bayes HMM Approach

Paolo Frasconi Giovanni Soda Alessandro Vullo Department of Systems and Department of Systems and Department of Systems and Computer Science Computer Science Computer Science University of Florence University of Florence University of Florence 50139 Firenze, Italy 50139 Firenze, Italy 50139 Firenze, Italy [email protected] [email protected] [email protected]

ABSTRACT 1.INTRODUCTION Text categorization is typically formulated as a concept learn- Text categorization is the problem of grouping textual ing problem where each instance is a single isolated docu- documents into different fixed classes or categories.The ment. In this paper we are interested in a more general for- task is related to the ability of an intelligent system to auto- mulation where documents axe organized as page sequences, matically perform tasks such as personalized e-mail or news as naturally occurring in digital libraries of scanned books filtering, document indexing, metadata extraction. These and magazines. We describe a method for classifying pages problems are of great and increasing importance, mainly of sequential OCR text documents into one of several as- because of the recent explosive increase of online textual in- signed categories and suggest that taking into account con- formation.Text categorization is generally formulated in textual information provided by the whole page sequence the machine learning framework. In this setting, a learning can significantly improve classification accuracy. The pro- algorithm takes as input a set of labeled examples (where posed architecture relies on hidden Markov models whose the label indicates which category the example document be- emissions are bag-of-words according to a multinomial word longs to) and attempts to infer a function that will map new event model, as in the generative portion of the Naive Bayes documents into their categories.Several algorithms have classifier.Our results on a collection of scanned journals been proposed within this framework, including regression from the Making of America project confirm the importance models [29], inductive logic programming [6], probabilistic of using whole page sequences. Empirical evaluation indi- classifiers [17, 21, 16], decision trees [18], neural networks cates that the error rate (as obtained by running a plain [22], and more recently support vector machines [12]. Naive Bayes classifier on isolated page) can be roughly re- Research on text categorization has been mainly focused duced by half if contextual information is incorporated. on non-structured documents. In the typical approach, in- herited from information retrieval, each document is represented by a sequence of words, and the sequence itself is Categories and Subject Descriptors normally flattened down to a simplified representation called 1.2.6 [Computing Methodologies]: Artificial Intelligence bag of words (BOW). This is like representing each docu- Learning; H.3.7 [Information Systems]: Information Stor- ment as a feature-vector, where features are words in the age and RetrievalDigital Libraries; I.7.m [Computing vocabulary and components of the feature-vector are statis- Methodologies]: Document and Text Processing tics such as word counts in the document. Although such a simplified representation is appropriate for relatively flat documents (such as email and news messages), other types General Terms of documents are internally structured and this structure Algorithms, Performance should be exploited in the representation to better inform the learner. In this paper we are interested in the domain of digital Keywords libraries and, in particular, collections of digitized books or magazines, with text extracted by an Optical Charac- Text categorization, Hidden Markov Models, Naive Bayes ter Recognition (OCR) system. One important challenge classifier, Multi-page documents for digital conversion projects is the management of structural and descriptive metadata. Currently, metadata management involves a large amount of keying work carried out by human operators. Automating the extraction of meta- Permission to make digital or hard copies of all or part of this work for data from digitized documents could greatly improve effi- personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies ciency and productivity [1].This automation, however, is bear this notice and the full citation on the first page. To copy otherwise, to not a trivial task and involves recognition of the ordering of republish, to post on servers or to redistribute to lists, requires prior specific text divisions, such as chapters and sub-chapters, the iden- permission and/or a fee. tification of layout elements, such as headlines, footnotes, JCDL'Ol, June 24-28, 2001, Roanoke, Virginia, USA. graphs, arid captions, and the linking of articles within a pe- Copyright 2001 ACM 1-58113-345-6/01/0006 ...$5.00.

11 50 riodical. Automatic recognition of these elements can be a text contained into a page, while in [9, 20] the HMM is used hard problem, especially without any prior knowledge about to attach labels to single words of shorter portions of text. the type of elements that are expected to be present within Our approach is validated on a real dataset consisting a given document page. Hence, page classification can rep- of 95 issues of the journal American Missionary, which is resent a useful preliminary step to guide the subsequent ex- part of the "Making of America" collection [26].In spite traction process. Moreover, extracting metadata related to of text noise due to optical recognition, our system achieves the semantic contents of document parts (such as chapters about 85% page classification accuracy when training on 10 or articles) can require the ability of recognizing the topic or issues (year 1884) and testing on issues from 1885 to 1893. the category of these parts. The solution to these problems More importantly, we show that incorporating contextual can be helped by a classifier that assigns a category to each information significantly reduces classification error, both page of the document. in the case of completely labeled example documents and Unlike email or news articles, books and periodicals axe when unlabeled documents are included in the training set. multi-page documents and the simplest level of structure that can be exploited is the serial order relation defined 2.BACKGROUND among single pages. The task we consider is the automatic categorization of each page according to its (semantic) con- Let d be a generic multi-page document, and let di de- tents1.Exploiting the serial order relation among pages note the t-th page within the document. The categoriza- within a single document can be expected to improve classi- tion task consists of learning from examples a function f : fication accuracy when compared to a strategy that simply de > {el ,eK } that maps each page di into one out of classifies each page separately. This is because the sequence K classes. of pages in documents such as books or magazines often 2.1The Naive Bayes classifier follows regularities such as those implied by typographical and editorial conventions.Consider for example the do- The above task can also be reformulated in probabilis- main of books and suppose categories of interest include tic terms as the estimation of the conditional probability title-page, dedication-page, preface-page, index-page, P(Ct = ckldt), Ct being a multinomial class variable with table-of-contents, regular-page, and so on. Even in this realizations in {c1, ,c/c}. In so doing, f can be computed very simplified case we can expect constraints about the using Bayes' decision rule, i.e. f(d) is the class with higher valid sequences of page categories in a book. For example, posterior probability. The Naive Bayes classifier computes title-page is very unlikely to follow index-page and, sim- this probability as ilarly, dedication-page is more likely to follow title-page than preface-page. Constraint of this type can be captured P(Ct = ckidt) a P(dtrt = ck)P(Ce = ck). (1) and modeled using a stochastic grammar Thus, information about the category of a given page can be gathered not only What characterizes the model is the so-called Naive Bayes by examining the contents of that page, but also by examin- assumption, prescribing that word events (each occurrence ing the contents of other pages in the sequence. Since con- of a given word in the page corresponds to one event) are textual information can significantly help to disambiguate conditionally independent given the page category.As a between page categories, we expect that classification accu- result, the class conditional probabilities can be factorized racy will improve if the learner has access to whole sequences as instead that single-page documents. In this paper we combine several algorithmic ideas to solve p(dtIct=ck)=HP(witIct=ck) (2) the problem of text categorization in the domain of multi- i=1 page documents. First, we use an algorithm similar to those described in [28] and [20] for inducing a stochastic regular where Idt! denotes the length of page dt and wit: is the i-th grammar over sequences of page categories. Second, we in- word in the page. This conditional independence assump- troduce a hidden Markov model (HMM) that can deal with tion is graphically represented by the Bayesian network2 sequences of BOWs. Each state in the HMM is associated shown in Figure 1. with a unique page category. Emissions are modeled by a Although the basic assumption is clearly false in the real multinomial distribution over word events, like in the gen- world, the model works well in practice since classification erative component of the Naive Bayes classifier. The HMM requires finding a good separation surface, not necessarily is trained from (partially) labeled page sequences, i.e. state a very accurate model of the involved probability distribu- variables are partially observed in the training set. Unob- tions.Training consists of estimating model's parameters served states (which is the common setting in most classic from a dataset 7, of labeled documents (see, e.g. [21]). applications of HMMs) arise here when document pages are 2.2Hidden Markov models partially unlabeled, like in the framework described in [23] and [13].Finally, we solve the categorization problem by HMMs have been introduced several years ago as a tool for running the Viterbi algorithm on the trained HMM, yielding probabilistic sequence modeling. The interest in this area a sequence of page categories associated with new (unseen) developed particularly in the Seventies, within the speech documents. This is somewhat related to recent applications 2A Bayesian network is an annotated graph in which nodes of HMMs to information extraction [9, 20] but the output represent random variables and missing edges encode con- labeling in our case is associated with the entire stream of ditional independence statements amongst these variables. Given a particular state of knowledge, the semantics of a 1A related formulation would consist of assigning a global Bayesian networks determine whether collecting evidence category to a whole multi-page document, but this formula- about a set of variables does modify one's belief about some tion is not considered in this paper. other set of variables [24, 11].

1 51 bility distributions3: P(XtIXti)(transition distribution) P(Dt1Xt) (emission distribution) (3) Since the process is stationary, the transition distribution can be represented as a square probability matrix whose entries are transition probabilities P(Xt = x'I)Ct_i = x3), abbreviated as P(x'1x3) in the following. In the classic liter- Figure 1: Bayesian network for the Naive Bayes clas- ature, emissions are restricted to symbols in a finite alpha- sifier. bet or multivariate continuous variables [25]. As explained in the next section, our model allows emissions to be bag- of-words.

3.THE MULTI-PAGE CLASSIFIER We now turn to the description of our classifier for multi- page documents. This section presents the architecture and the algorithms for grammar extraction, training, and classi- 410 0 ID fication. 3.1Architecture In our case, HMM emissions are associated with entire 0110 pages of the document. Thus the realizations of the observation Dt are bag-of-words representing the text in the t-th Figure 2: Bayesian networks for standard HMMs. page of the document. Within our framework, states are related to pages categories by a a deterministic function 0 that maps state realizations into page categories. We assume that 0 is a surjection but not a bijection, i.e. that there axe more state realizations than categories. This enriches the expressive power of the model, allowing different transition behaviors for pages of the same class, depending on where the page is actually encountered within the sequence. How- recognition research community [25]. During the last years ever, if the page contents depends on the category but not a large number of variants and improvements over the stan- on the context of the category within the sequence'', the use dard HMM have been proposed and applied. Undoubtedly, of multiple states per category may introduce too many free Markovian modeling is now regarded as one of the most parameters and it may be convenient to assume that significant state-of-the-art approaches for sequence learn- P(Dtlxi) = P(Ddx3) = P(DtIck)if 4'(xi) = 0(x3) = ck . ing. Besides several applications in pattern recognition and (4) molecular biology, HMMs have been also applied to several This assumption constrains emission parameters to be the text related tasks, including natural language modeling [5] same for all the HMM states labeled by the same page cate- and, more recently, information retrieval and extraction [9, gory, a form of parameters sharing that may help to reduce 20]. The recent view of the HMM as a particular case of overfitting. The emission distribution is then defined as for Bayesian networks [2, 19, 27] has helped the theoretical un- the Naive Bayes classifier, i.e. for every observed page dt derstanding and the ability to conceive extensions to the standard model in a sound and formally elegant framework. ,di, An HMM describes two related discrete-time stochastic P ((he ) =r-Ip(wnck) (5) processes. The first process pertains to hidden discrete state i = 1 variables, denoted Xt, forming a first-order Malloy chain Therefore, the architecture can be graphically described as and taking realizations on a finite alphabet {xl, ,xN}. The second process pertains to observed variables or emis- the merging of the Bayesian networks for HMMs and Naive sions, denoted D.Starting from a given state at time Bayes, as shown in Figure 3. We remark that the state (and 0 (or given an initial state distribution) the model proba- hence the category) at page t depends not only on the con- bilistically transitions to a new state X1 and correspond- tents of the page, but also on the contents of other pages in ingly emits observation DI.The process is repeated re- the document. This probabilistic dependency implements cursively until an end state is reached. Note that, as this 3We adopt the standard convention of denoting variables form of computation may suggest, HMMs are closely re- by uppercase letters and realizations by the corresponding lated to stochastic regular grammars [5]. The Markov prop- lowercase letters.Moreover, we use the table notation for erty prescribes that Xti is conditionally independent of probabilities as in [11]; for example P(X) is a shorthand for Xi, , XtI given X.Furthermore, it is assumed that the table [P(X=x1),...,P(X=x')] and P(X, yIZ) denotes Dt is independent of the rest given X. These two con- the two-dimensional table with entries P(X=xi,Y=yIZ= ditional independence assumptions are graphically depicted zk). using the Bayesian network of Figure 2.As a result, an 40f course this does not mean that the category is indepen- HMM is fully specified by the following conditional proba- dent on the context.

13 52 training documents are labeled with the class they belong to. One can then imagine to take advantage of the observ- able distribution of data to search for an effective structure in the space of HMMs topologies. Our approach is based on the application of an algorithm for data-driven model 0 0 0 induction adapted from previous works in Bayesian HMM induction [28] and construction of HMMs of text phrases 4, for information extraction [20].The algorithms starts by building a structure that is capable only to "explain" the Figure 3: Bayesian network for the hybrid IIMM available training sequences (a maximally specific model). Naive Bayes architecture. The initial structure includes as many paths (from the initial state to the final one) as there are training sequences. Every path is associated with one sequence of pages, i.e. a distinct state is created for every page in the training set.Each the mechanism for taking contextual information into ac- state x is labeled by 0(x), the category of the correspond- count. ing page in the document. Note that, unlike the example The algorithms used in this paper are derived from the shown in Figure 4, several states are generated for the same literature on Markov models [25], inference and learning in category.The algorithm then iteratively applies merging Bayesian networks [24, 11, 10], and classification with Naive heuristics that collapse states so as to augment generaliza- Bayes [17, 15].In the following we sketch the main issues tion capabilities over unseen sequences. The first heuristic, related to the integration of all these methods. called neighbor-merging, collapse two states x and x' if they 3.2Induction of HMM topology axe neighbors in the graph and 0(x) = q5(x'). The second heuristic, called V-merging, collapses two states x and x' The structure or topology of an HMM is a representation if 0(x) = 0.(x') and they share a transition from or to a of the allowable transitions between hidden states.More common state, thus reducing the branching factor of the precisely, the topology is described by a directed graph whose structure. vertices are state realizations {x1, ,xN}, and whose edges are the pairs (xi , xi) such that P(xV) 0 0. An HMM is 3.3Inference and learning said to be ergodic if its transition graph is fully-connected. Given the HMM topology extracted by the algorithm de- However, in alinost all interesting application domains, less scribed above, the learning problem consists of determining connected structures are better suited for capturing the ob- transition and emission parameters. One important distinc- served properties of the sequences being modeled, since they tion that need to be made when training Bayesian network convey domain prior knowledge.Thus, starting from the is whether or not all the variables are observed. Assuming right structure is an important problem in practical hid- complete data (all variables observed), maximum likelihood den Markov modeling. As an example, consider Figure 4, estimation of the parameters could be solved using a one- showing a (very simplified) graph that describes transitions step algorithm that collects sufficient statistics for each pa- between the parts of a hypothetical set of books. Possible rameter [10].In our case, data are complete if and only if state realizations are5 {start, title, dedication, preface, toc, the following two conditions are met: regular, index, end }. The structure indicates, among other things, that only dedication, preface, or table of contents can 1. there is a one-to-one mapping between HMM states follow the title page. Self-loops indicate that a given cate- and page categories (i.e. N = K and for k = 1, ... ,N, gory can be repeated for several consecutive pages. While ¢.(x") = ck), and 2. the category is known for each page in the training doc- dedication uments, i.e. the dataset consists of sequences of pairs ({c/I,c7}, , {dT,4}), 4 being the(known) category of page t and T being the number of pages in the doc- start title \,.5preface toc regular ument. Under these assumptions, estimation of transition parame-

index end ters is straightforward and can be accomplished as follows: NN(61,Ci (6) Figure 4: Example of IIMM transition graph. EN(ci,c7) 1=1 where N(ci, ci) is the number of times a page of class ci a structure of this kind could be hand-crafted by a domain follows a page of class cm in the training set. Similarly, esti- expert, it is may be more advantageous to learn it automat- mation of emission parameters in this case would be accom- ically from data. plished exactly like in the case of the Naive Bayes classifier We now briefly describe the solution adopted to automat- (see, e.g. [21]): ically infer HMM transition graphs from sample multi-page 1 + ck) documents. Let us assume that all the pages of the available P(w`ick) = I vi (7) 5Note that in this simplified example 0 is a one-to-one map- lvi EN(wi,ck) ping. j=1

14 53 where N(w1, ck) is the number of occurrences of word Iv/ It is interesting to point out a related application of the in pages of class CI' and 11/1is the vocabulary size (l/WI EM algorithm for learning from labeled and unlabeled doc- corresponds to a Dirichlet prior over the parameters and uments [23].In that paper the only concern was to allow plays a regularization role for whose words which are very the learner to take advantage of unlabeled documents in rare within a class). the training set. As a major difference, the method in [23] Conditions 1 and 2 above, however, are normally not sat- assumes flat single-page documents and, if applied to multi- isfied.First, in order to model more accurately different page documents, would be equivalent to a zero-order Markov contexts in which a category may occur, it may be conve- model that cannot take into account contextual information. nient to have multiple distinct HMM states for the same page category.Second, labeling pages in the training set 3.4Page classification is a time consuming process that needs to be performed by Given a document of T pages, classification is performed hand and it may be important to use also unlabeled docu- by first computing the sequence of states &I , 1 , IT that ments for training [13, 23]. This means that label 4 may be was most likely to have generated the observed sequence of not available for some t. If assumption 2 is satisfied but as- pages, and then mapping each state to the corresponding sumption 1 is not, we can derive the following approximated category 0( 'XL). The most likely state sequence can be ob- estimation formula for transition parameters: tained by running the an adapted version of Viterbi's al- N(xi xj ) gorithm, whose more general form is the max-propagation P(x Ixi) = (8) algorithm for Bayesian networks described in [11]. EN(xt, x3) 3.5Feature selection Text pages should be first preprocessed with common in- where N (xi ,x3) counts how many times state xi follows x3 formation retrieval techniques, including stemming and stop during the state merge procedure described in Section 3.2. words removal.Still, the bag-of-words representation of However, in general, the presence of hidden variables re- pages can lead to a very high-dimensional feature space cor- quires an iterative maximum likelihood estimation algorithm, responding to the vocabulary extracted from training docu- such as gradient ascent or expectation-maximization (EM). ments. A high-dimensional feature space, especially in this Our implementation uses the EM algorithm, originally for- case where features are noisy because of OCR errors, may mulated in [7] and usable for any Bayesian network with lead to the overfitting phenomenon: the learner has very local conditional probability distributions belonging to the high accuracy on the training set but generalization to new exponential family [10]. Here the EM algorithm essentially examples is poor. Feature selection is a technique for lim- reduces to the Baum-Welch form [25] with the only modi- iting overfitting by removing non-informative words from fication that some evidence is entered into state variables. documents. In our experiments we performed feature se- State evidence is taken into account in the E-step by chang- lection using information gain [30].This criterion is often ing forward propagation as follows: employed in different machine learning contexts.It mea- 0 if (x3) c' sures the average number of bits of information about the 0t` category that are gained by including a word in a document. = ENae_1(i)P(xj lxi)P(delxj)otherwise For each dictionary term w, the gain is defined as i=1 K (9) G(w) = E P(ck ) log, P(Ck) where cee(i) = P(d1d2 de, Xe =) is the forward variable in the Baum-Welch algorithm. k=1 The M-step is performed in the standard way for tran- K sition parameters, by replacing counts in Equation 6 with P(w) Ep(eklw) log, P(cklw) their expectations given all the observed variables. Emission k=1 probabilities are also estimated using expected word counts. If parameters are shared as indicated in Equation 4, these P(17)ErvkiT) log2 P (ck ljTJ) counts should be summed over states having the same label. k=1 Thus in the case of incomplete data, Equation 7 is replaced where To denotes the absence of word w. Feature selection by is performed by retaining only the words having the highest average mutual information with the class variable. OCR S +EE N (we , Ck ) E P(xi Pe) errors, however, can produce very noisy features which may P t i:0(xt)=ck P(wIlek)= be responsible of poor performance even if feature selection WI is performed. For this reason, it may be convenient to prune slvl + EEEN(w3,ck) E P(x'idt) from the dictionary (before applying the information gain J=1p t i4(xi)=ck criterion) all the words occurring in the training set with a where S is the number of training sequences, N(wl, ck) is the frequency below a given threshold h. number of occurrences of word tot in pages of class Ck, and P(x2 Pe) is computed by the Baum-Welch procednre during 3.6Learning with labeled and unlabeled pages the E-step. The sum on p extends over training sequences, Creating a training set for text categorization involves while the sum on t extends over pages of the p-th document hand labeling in order to assign a category to each docu- in the training set.The E- and M-steps are iterated un- ment. Since this is an expensive human activity, it is inter- til a local maximum of the (incomplete) data likelihood is esting to evaluate a classification system when only a frac- reached. tion of the training documents pages are labeled, while other

15 documents are used without a category label. Clearly, unla- Name Description beled documents are available at very low cost. In the case 1. Contents Cover and index of surveys of isolated page classification, previous research has demon- 2. Editorial Editorial articles 3. The South Afro-Americans' survey strated that learners such as Naive Bayes and support vector 4. The Indians American Indians' survey machines can take advantage of the inclusion in the training 5. The Chinese Reports from China missions set of documents whose class is unknown [13, 23]. In par- 6. Bureau of Women's WorkFemale conditions ticular, the method presented in [23] uses EM to deal with 7. Children's Page Education and childhood unobserved labels. 8. Communications Magazine informations In the case of multi-page documents, the presence of miss- 9. Receipts Lists of founders ing labels means that some pages of the training document 10. Advertisements contents is mostly graphic sequences have no assigned category. The architecture introduced in this paper (see Figure 3) can easily handle the Table 1: Categories in the American Missionary do- presence of unlabeled pages in the training set.Basically, main. evidence is entered into the states of the HMM chain only for those pages for which a label is known, while other state Category labels were obtained semi-automatically, start- variables are left unobserved. The belief propagation algo- ing from the MOA XML files supplied with the documents rithm is in charge of computing probabilities for these hidden collection. The assigned category was then manually checked. variables. In the case of pages containing the end and the beginning of However, the structure learning algorithm presented in two articles belonging to different categories, the page was Section 3.2 cannot be applied in the case of partially la- assigned the category of the ending article. beled documents. Instead, it is possible to use ergodic (fully Each page within a document is represented as a Bag-of- connected) HMMs and deriving a transition structure by Words, counting the number of word occurrences within the pruning, after the learning phase, those transitions having page. It is worth remarking that in this application instances small probabilities with respect to an assigned threshold. In are text documents obtained by an OCR process. Imperfec- this way, we let EM derive a specific structure for the model tions of recognition algorithm and the presence of images (note that the only alternative in the case of partially la- in some pages yields noisy text, containing misspelled or beled documents would be to obtain a transition graph from nonexistent words, and trash characters (see [3] for a report a domain expert). of OCR accuracy in the MOA digital library).Although these errors may negatively affect the learning process and 4.EXPERIMENTAL RESULTS subsequent results in the evaluation phase, we made no at- A preliminary evaluation of our system has been con- tempts to correct and filter out misspelled words, except ducted in a digital library domain where data are naturally for the feature selection process described above. However, organized in the form of page sequences. The main purpose since OCR extracted documents preserve the text layout of our experiments was to make a comparison between our found in the original image, it was necessary to rejoin words multi-page classification approach and a traditional isolated that had been hyphenated due to line breaking. page classification system. 4.2Feature selection and isolated page classi- 4.1Data Set fication We have chosen to evaluate the model over a subset of The purpose of the experiments in this section is to in- the Making of America (MOA) collection, a joined project vestigate the effects of feature selection and to assess the between the University of Michigan and Cornell University baseline prediction accuracy that can be attained using the (see moa.umd1 .umich . edu/about . html and [26]) for collect- Naive Bayes classifier on isolated pages.In a set of pre- ing and making available digitized books and periodicals liminary evaluations we have found that best performance about history and evolution processes of the American soci- are achieved by pruning words with less than h = 10 oc- ety between the XIX and XX century. Presently, the whole currences and then selecting an optimal set of informative archive contains electronic versions of important magazines words. We performed several tests by changing the informa- of the XIX century. In our experiments, we selected a sub- tion gain threshold that determines if a word is sufficiently set of the journal American Missionary (AMis), a sociolog- informative (see Section 3.5), resulting in different vocab- ical magazine with strong Christian guidelines. The task ulary sizes with different accuracy of prediction. For each consists of correctly classifying pages of previously unseen reduced vocabulary size we ran the Naive Bayes classifier on documents into one of the ten categories described in Ta- isolated pages. Results are shown in Figure 5. Vocabulary ble 1. Most of these categories are related to the topic of size ranges from 15635 words (no feature selection), yielding the articles, but some are related to the parts of the journal 65.07% classification accuracy, to 25 words, yielding 53.16% (i.e. Contents, Receipts, and Advertisements). The dataset accuracy. The optimal vocabulary size is 297 words, ob- we selected contains 95 issues from 1884 to 1893, for a to- tained with a threshold gain of 0.089, yielding the best test- tal of 3222 OCR text pages. Special issues and final report set accuracy of 72.57%. This result (72.57%) was considered issues (typically November and December issues) have been as the base measure for performance comparison between removed from the dataset as they contain categories not our model and the Naive Bayes classifier. found in the rest. The first year was selected as training set (10 training sequences, 342 pages). The remaining doc- 4.3Sequential page classification uments (from 1885 to 1993, for a total of 2880 pages) were Using the hybrid model presented in Section 3, documents used as a test set. The ten categories are temporally stable can be organized into ordered sequences of pages. The train- over the 1883-1893 time period. ing set contains 10 sequences (monthly issues) of the same

16 55 naive Bayes prediction accuracy 80 co

55 0.911504

50 0.00884956 0.0707965 5000 10000 15000 Vocabulary Dimension 0.909091 0.849057

Figure 5: Naive Bayes accuracy as a function of vo- 0.0909091 0.0377358 0.0566038 cabulary size (information gain criterion). Optimal vocabulary size is 297 words. Bur W) 0.6 0.0566038 0.25 0.00884956 Category Sequential Isolated Error recri Contents 100 100 0% 0.2 0.2 0.75 Editorial 80.9 63.11 48.2%

The South 90.81 71.84 67.4% Bur W 5 0.5 The Indians 61.07 44.3 30.1% ) 0, Co 0 The Chinese 69.93 60.78 23.3% Bureau W.W. 74.73 66.3 25.0% 0.5 0.5 0.5 Children's Page 78.26 45.65 60.0%

Communications 93.55 92.47 14.3% South 0.5 Bu W 0.708333 Receipts 98.31 98.31 0% ° 5 0 Advertisements 90.7 62.79 75.0% Total Accuracy 85.28 72.57 46.3% 0.5 0.5 0.291667

Table 2: Isolated classification (using the best Naive Child 0.444444 Bayes) vs. sequential classification (using the hybrid ) I-IMM with model merging). 0 0555556

342 documents for year 1884, while test set is organized 0.5 into 85 sequences for a total of 2880 documents from year 1885 to 1893. The bag-of-words representation of pages fed into the IIMM classifier was identical to that previously used with Naive Bayes (including preprocessing and feature selec- 0.791667 tion with a vocabulary of 297 words). We have considered two settings for validating the system. In the first setting, 0.0833333 it is assumed that category labels c;' are available for all

the pages in the training set.In the second setting, some 0 2 category labels are held out and training uses labeled and unlabeled pages.

4.3.1Completely labeled documents In the case of completely labeled documents it is possible 0.2 to run the structure learning algorithm presented in Sec- tion 3.2.Figure 6 reports the structure learned from the 10 training issues.Each vertex in the transition graph is associated with one IIMM state and is labeled with the corresponding category (see Table 1). Edges are labeled with the transition probability from source to target state, com- Figure 6: Data induced I1MM topology for Ameri- puted by counting state transitions during the state merg- can Missionary, year 1884. ing procedure (see Equation 8). The associated stochastic

1 56 grammar implies that valid AMis sequences ought to start Model performance on single sequences (merging algorithm) "American Missionary (I885-1893)" with the index page (class "Contents"), followed by a page 100 of general communications. Next state is associated with a page of an editorial article.Self transition here has a value of 0.91, meaning that with high probability the next 90 - page will belong to the editorial too. With lower probability (0.07) next page is one of the "The South" survey or 80 - (prob. 0.008) "The Indians" or "Bureau of Women's work". Continuing this way we can associate a probability to each g 70 string of page categories. Since our purpose is to predict the correct string of categories, a good grammar helps filtering out classification hypothesis which generate low (or zero) probability strings. Note that under the parameter sharing Accuracy on sequences 50 - Avg. Acc.: 85.28% assumption (see Equation 4), once the HMM structure is - Naive Bayes: 72.57% on avg. given, an estimate of the emission probabilities can be ob- - - Running Average (10 months) 401 tained using Equation 7.These values can be plugged in 18186 1887 1888 901889 18 1891 1892 1893 as initial emission parameters for the EM algorithm. Clas- Monthly Issues sification is finally performed by computing the most likely state sequence. Figure 7: Performance of the hybrid model on single Table 2 summarizes classification results on test set doc- sequences (merging algorithm). uments sequences, after a training phase applied both to Naive Bayes and our hybrid model. We report accuracy of % of labeled documents prediction on single classes and average accuracy over the Category 0 30 50 70 90 100 total of text documents. Comparison is made with respect Contents 0 100 100 100 100 100 to the best isolated-page classifier. The hybrid EIMM clas- Editorial 20.76 21.12 59.67 58.6 67.62 71.41 sifier (performing sequential classification) achieves 85.28% South 1.51 83.58 69.73 84.94 84.34 84.19 accuracy and consistently outperforms the plain Naive Bayes Indians 10.07 0 55.03 51.68 50.34 58.39 classifier working on isolated pages. The relative error re- Chinese 0 27.45 83.66 76.47 75.82 75.16 duction is about 46%, i.e.roughly half of the errors are Bur.W.W. 0 43.22 63.74 63 64.84 65.93 recovered thanks to contextual information. In particular, Child. P. 4.35 78.26 73.91 58.7 78.27 76.09 93.55 93.55 it is interesting to note the large error reduction for the cat- Commun. 0 91.4 91.4 93.55 Receipts 0 89.27 98.68 97.36 98.31 98.31 egory "Advertisements."Pages in this category typically Advert. 81.4 69.77 93.02 90.7 90.7 90.7 contain several images and few words of text. The isolated Total 8.23 55.66 73.54 75.66 78.7 80.24 page classifier is subject to prediction errors in this case Accuracy since parameter estimation for rarely occurring words can be poor. On the other hand, the constraints imposed by Table 3:Results achieved by the model trained the grammar allow to recover many prediction errors since by Expectation-Maximization, varying percentage advertisements normally occur near the end of each issue. of labeled documents. In Figure 7 we report classification performances of the hybrid model on single issues of the journal. The graph is to be interpreted as the classifier temporal trend from 1885 (0% evidence) is worse than the random guess (8.23% ac- to 1894. Negative accuracy peaks correspond to test issues curacy). With 50% of labeled documents, the model out- with more than 70 pages, a significant deviation from the performs Naive Bayes (73.54% against 72.57%). This is a average number of pages per issue (about 32). Values range positive result, because the Naive Bayes training phase (in from a minimum of 50% to a maximum of 97.09% with 10.41 the standard formulation) need the knowledge of all docu- as standard deviation. To visualize a smoother trend, we ment labels, while in this setting we simulate the knowledge calculated a running average over a temporal window of 10 of only a half of them. With greater percentages of labeled months, showing a clear superior trend over standard naive documents, performances begin to saturate reaching a max- Bayes. imum of 80.24% when all the labels are known. This result is worse compared to the 85.28% obtained with the first 4.3.2Partially labeled documents strategy (see Section 4.3.1). The main difference is that in this case we started training from an ergodic model and We have performed six different experiments, for different we used one state per class. This confirms that in the case percentages of labeled documens. In this case the structure of completely labeled documents it is advantageous to use learning algorithm cannot be applied and we used ergodic more states per class and to use the data-driven algorithm HMMs with ten states (one per class). After training, tran- for structure selection. sition with probabilities < 10-3 were pruned. In one of the six experiments we used all the available page labels with an ergodic 11MM. This experiment is useful to provide a 5.CONCLUSIONS basis for evaluating the benefits of the structure learning We have presented a text categorization system for multi- algorithm presented in Section 3.2. page documents which is capable of effectively taking into Table 3 shows detailed results of the experiments. Classifi- account contextual information to improve accuracy with cation accuracy is shown for single classes and for the the en- respect to traditional isolated page classifiers. Our method tire test set. As we can see, EM being completed uninformed can smoothly deal with unlabeled pages within a document,

18 57 although we have found that learning the HMM structure Conference on Machine Learning, Lake Tahoe, further improves performance compared to starting from an California, 1995. ergodic structure. The system uses OCR extracted words [7]A. P. Dempster, N. M. Laird, and D. B. Rubin. as features.Clearly, richer page descriptions could be in- Maximum-likelihood from incomplete data via the EM tegrated in order to further improve performance. For ex- algorithm. Journal of Royal Statistical Society B, ample, optical recognizer output information about the font, 39:1-38, 1977. size, and position of text, that may be important to help dis- [8] M. Diligenti, P. Frasconi, and M. Gori. Image criminating between classes. Moreover, OCR text is noisy document categorization using hidden tree-Markov and another direction for improvement is to include more models and structured representations. In S. Singh, sophisticated feature selection methods, like morphological N. Murshed, and W. Kropatsch, editors, Second Int. analysis or the use of n-grams [4, 141. Conf. on Advances in Pattern Recognition, volume Another aspect is the granularity of document structure 2013 of LNCS. Springer, 2001. being exploited. Working at the level of pages is straight- [9]D. Freitag and A. McCallum. Information extraction forward since page boundaries are readily available. How- with hmm structures learned by stochastic ever, actual category boundaries may not coincide with page optimization. In Proc. 12th AAAI Conference, Austin, boundaries and some pages contains portions of text related TX, 2000. to different categories. Although this is not very critical for [10] D. Heckerman. Bayesian networks dor data mining. single-column journals such as the American Missionary, the Data Mining and Knowledge Discovery, 1(1):79-120, case of documents typeset in two or three columns certainly 1997. deserves attention. A further direction of investigation is [11] F. Jensen. An Introduction to Bayesian Networks. therefore related to the development of algorithms capable of Springer Verlag, New York, 1996. performing automatic segmentation of a continuous stream [12] T. Joachims. Text categorization with support vector of text, without necessarily relying on page boundaries. machines: Learning with many relevant features. In The categorization method presented in this paper is tar- Proceedings of the European Conference on Machine geted to textual information.However, the same hybrid HMM methodology could be applied for classification of Learning. Springer, 1998. pages based on layout information, provided an adequate [13] T. Joachims. Transductive inference for text emission model is available. A suitable generative model for classification using support vector machines. In Int. document layout is presented in [8]. Conf. on Machine Learning, 1999. Finally, categorization algorithms that includes contex- [14] M. Junker and R. Hoch. Evaluating OCR and tual information may be very useful for other types of docu- non-OCR text representations for learning document ments natively available in electronic form. For example, the classifiers. In Prof. ICDAR 97, 1997. categorization of web pages may take advantage of the con- [15] T. Kalt. A new probabilistic model of text tents in neighbor pages (as defined by the hyperlink struc- classification and retrieval. CIIR TR98-18, University ture of the web). of Massachusetts, 1996. url: ciir.cs.umass .edu/publications/. [16] D. Koller and M. Sahami. Hierarchically classifying 6.ACKNOWLEDGMENTS documents using very few words. In Proc. Fourteenth We thank Oya Rieger and the Cornell University Library Int. Conf. on Machine Learning, 1997. for providing the data collected within the Making of Amer- [17] D. Lewis and W. Gale. A sequential algorithm for ica project. This research was partially supported by EC training text classifiers. In SIGIR-94, 1994. grant # IST-1999-20021 under METAe project. [18] D. Lewis alid M. Ringuette. Comparison of two learning algorithms for text categorization. In Proc. 7.REFERENCES 3rd Annual Symposium on Document Analysis and Information Retrieval, 1994. [1] The metadata engine project. [19] H. Lucke. Bayesian belief networks as a tool for http: //meta-e.uibk.ac . at,2001. stochastic parsing. Speech Communication, 16:89-118, [2] Y. Bengio and P. Frasconi. An input output HMM 1995. architecture. In G. Tesauro, D. Touretzky, and [20] A. McCallum, K. Nigam, J. Rennie, and K. Seymore. T. Leen, editors, Advances in Neural Information Automating the construction of internet portals with Processing Systems 7, pages 427-434. MIT Press, machine learning. Information Retrieval Journal, 1995. 3:127-163, 2000. [3]D. A. Bicknese. Measuring the accuracy of the OCR [21] T. Mitchell. Machine Learning. McGraw-Hill, 1997. in the Making of America. Report available at [22] H. Ng, W. Goh, and K. Low. Feature selection, moa.umdl.umich.edu/moaocr .html,1998. perceptron learning, and a usability case study for [4] W. Cavnar and J. Trenkle. N-Gram based text text categorization. In Proc. of the 20th Int. ACM categorization. In Prof. of the 3rd Annual Symposium SIGIR Conference on Research and Development in on Document Analysis and Information Retrieval, Information Retrieval, pages 67-73, 1997. pages 161-175, Las Vegas, NV, 1994. [23] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. [5]E. Charniak. Statistical Language Learning. MIT Text classification from labeled and unlabeled Press, 1993. documents using EM. Machine Learning, [6]W. W. Cohen. Text categorization and relational 39(2/3):103-134, 2000. learning. In Proceedings of the Twelfth International

19 [24] J. Pearl. Probabilistic Reasoning in Intelligent [28] A. Stolcke and S. Omohundro. Hidden Markov Model Systems: Networks of Plausible Inference. Morgan induction by bayesian model merging. In S. J. Hanson, Kaufmann, 1988. J. D. Cowan, and C. L. Giles, editors, Advances in [25] L. R. Rabiner. A tutorial on hidden Markov models Neural Information Processing Systems, volume 5, and selected applications in speech recognition. pages 11-18. Morgan Kaufmann, San Mateo, CA, Proceedings of the IEEE, 77(2):257-286, 1989. 1993. [26] E. Shaw and S. Blumson. Online searching and page [29] Y. Yang and C. Chute. An example-based mapping presentation at the University of Michigan. D-Lib method for text classification and retrieval. ACM Magazine, July/August 1997. url: Transactions on Information Systems, 12(3):252-277, www (nib . org/c1.1 j uly97 lamer i ca/ 07shaw . html. 1994. [27] P. Smyth, D. Heckerman, and M. I. Jordan. [30] Y. Yang and J. Pedersen. A comparative study on Probabilistic independence networks for hidden feature selection in text categorization. In Proceedings Markov probability models. Neural Computation, of the Fourteenth International Conference on 9(2):227-269, 1997. Machine Learning, 1997.

20 Automated Name Authority Control

James W. Warner Elizabeth W. Brown Digital Knowledge Center, MSE Library Cataloging Dept., MSE Library Johns Hopkins University Johns Hopkins University 3400 N. Charles St. 3400 N. Charles St. Baltimore, MD 21218 Baltimore, MD 21218 [email protected] [email protected]

ABSTRACT retrieve the occurrences of "Septimus Winner." The Library This paper describes a system for the automated assign- of Congress name authority files, however, document these ment of authorized names. A collaboration between a com- names, along with "Sep Winner," as names for the same puter scientist and a librarian, the system provides for en- person. hanced end-user searching of digital libraries without increasing drastically the cost and effort of creating a digital 2. BACKGROUND library. It is a part of the workflow management system of With an expanding audience and the upcoming addition the Levy Sheet Music Project. of sound and full-text lyric searching, enhancing the searching capabilities with some kind of authority control is de- Categories and Subject Descriptors sirable. Bringing "retrospective authority control" to a col- H.3.7 [Information Systems]: Information Storage and lection of over 29,000 titles can be quite a daunting task. RetrievalDigital Libraries In the Levy II framework of developing a workflow management system, however, such a project poses interesting Keywords possibilities for devising authority control tools for digitized sheet music or text collections in general. Name Authority Control, automation, indexing, metadata, Levy II does not involve hand-transcribing the names from workflow management the statements of responsibility or manually searching a local or national authority file but establishing a way to au- 1.INTRODUCTION tomate these tasks. The project seeks to apply this "retro- The Levy Sheet Music Collection[1] includes a rich on- spective" process to creating a truly useful tool to provide line index created during digitization. This index, designed for authority control in the "metadata" or "digital library specifically to meet the needs of Levy users and based in part environment" outside the context of traditional cataloging on Music Library Association guidelines, exists outside the tools.Thus, indexers will be able to use the tool as part context of the traditional library cataloging "MARC record" of a digitization project workflow.With OCLC's CORC environment. Although it includes transcribed information emerging and developing during the Levy II process, we can about the composers of music and artists responsible for cov- assess how the Levy authority control tool compares to tools ers, the index does not provide for traditional library "au- created in other metadata and cataloging initiatives. thority control"disambiguating and establishing the "au- The task involves first extracting the names of the com- thoritative" version of a person's name and providing the posers, lyricists and arrangers, as well as engravers, lithog- ability to search on "variants" or "cross-references." raphers and artists, from the transcribed statements in the Thus, users searching the collection are not able to retrieve metadata, then inventing an efficient method for creating easily the varying forms of names of persons; for instance, a authoritative forms of names. Adhering to the Library of search for "Stephen Foster" will not retrieve all the instances Congress name authority file, a standard authority file, elim- of Stephen Foster as "Stephen Collins Foster." More impor- inates "reinventing the wheel" to a certain extent and pro- tantly, a user does not necessarily know he or she has not vides for interoperability with other sheet music catalogs or found all that the Levy Collection metadata has to offer on indexes. a particular artist or composer. An extreme case found in Levy is the search for "Alice Hawthorne," which does not 3. METHODOLOGY In order to test the feasibility of the project, we have restricted our work so far to composers, lyricists, and ar- Permission to make digital or hard copies of all or part of this work for rangers. Since the names and their roles are not explicitly personal or classroom use is granted without fee provided that copies are entered into the Levy metadata, we have developed an au- not made or distributed for profit or commercial advantage and that copies tomated system to extract this information. The task is not bear this notice and the full citation on the first page. To copy otherwise, to difficult, since the names are entered directly from the title republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. page of the piece. For example, a piece might say "Com- JCDL'01, June 24-28, 2001, Roanoke, Virginia, USA. posed by John Smith. Arranged for the piano by Bill Jones." Copyright 2001 ACM 1-58113-345-6/01/0006... $5.00. Using a dictionary as well as a list of first and last names,

21 we have extracted the names with simple pattern matching when confidence is low to help the program train.This techniques. We achieved 97.2% precision and 98.7% recall. whole process can then be repeated with the second training Recall was favored, since the disambiguation system itself set. Finally, the accuracy can be measured on the held-out should help recognize invalid names. test set. This iterative process helps the scalability of the In order to retrieve authority records efficiently, we have system by requiring only a small number of records to be loaded the personal names subset of the Library of Congress disambiguated manually initially. Name Authority File into a MySQL database. The database allows us to query for all records that match a given name. 4.FUTURE WORK It also allows us to retrieve context from notes fields in the authority records and from author-title authority records. Based on promising initial results, we are applying the We will link names from the Levy Collection to their records methodology to the Levy Collection. We plan to quantify in the authority file through the unique Library of Congress the results fully and compare different methods and param- Control Numbers. eters.Such parameters include the size of the manually We started the process of name disambiguation by break- assigned set and the prior probabilities for each piece of evi- ing up the names extracted from the Levy collection into dence. We will then test the system further by bringing the four randomly selected groups. We have a seed group, two engravers, lithographers, and artists under authority con- training groups, and a held-out test group. Each group is trol. then clustered, putting similar names together. This cluster- We believe the system could be extended beyond the doing can be done strictly, loosely, or not at all depending on main of sheet music. Any large collection of digital docu- the demands of the collection. Clustering names results in ments could potentially benefit from the system, since do- having to disambiguate fewer names, but if names are often ing the work manually is costly and time consuming. The repeated within the collection, results could be adversely system is also advantageous because the iterative nature re- affected.Clustering and thus gathering varying forms of quires that only a very small percentage of the documents names for the same persons, however, prepares for bringing be processed manually. the names not found in the LC authority file under authority control. We disambiguate the seed group manually to 5.CONCLUSIONS train the automated disambiguator. From this seed, we can We believe that name authority control is beneficial to gather statistics about how names in the Levy collection users of digital libraries. However, understanding that the match their records in the LC Authority File. manual process is expensive and time consuming, we have Since no automated system can be 100% accurate, the developed a system to automate authority control. We be- system must be able to express its level of confidence. We lieve that the system can be applied generally to a variety intend to use a Bayesian approach to achieve this goal. For of digital libraries. example, if N represents a name in the Levy collection, R is an authority record in the authority file, and FN(X) repre- 6.ACKNOWLEDGMENTS sents the first name associated with X, we say P[FN(N)=FN(11)1N=R].P[N=R1 We would like to thank Sayeed Choudhury, Brian Har- P[N = RIFN(N) = FN(R)] P[FIV(N)=FN(1?)] rington, and Tim DiLauro of the Digital Knowledge Center. P[FN(N) = FN(R)1N = R] can be calculated from the The Levy Project is funded through NSF's DLI-2 initiative seed data. P[FN(N) = FN(R)] can be calculated using (Award #9817430), an IMLS National Leadership Grant, first name frequencies and P[N = R] is the prior probabil- and support from the Levy Family. ity. Each piece of evidence can be handled in this Bayesian manner with the reasonable assumption that each piece of evidence is conditionally independent. 7.REFERENCES Many types of evidence can be used to disambiguate the [1] G. S. Choudhury, C. Requardt, I. Pujinaga, names. One such piece of evidence, as shown above, is how T. DiLauro, E. W. Brown, J. W. Warner, and common a name is.For example, finding the name "John B. Harrington. Digital Workflow Management: The Smith" in the LC database is much less significant than Lester S. Levy Digitized Collection of Sheet Music. finding the name "Irving Berlin." Additionally, if the notes First Monday, 5(6), June 2000. fields of a record discuss music in some way, the likelihood [2] D. Yarowsky. Unsupervised word sense disambiguation of being a match increases. Thus, we use vector similarities rivaling supervised methods. In Proceedings of the 33rd with the seed notes to determine if a given record seems Annual Meeting of the Association for Computational to be that of a musician.Another very important piece Linguistics, pages 189-196, 1995. of evidence is publication date. A person could not have authored a document that was published before his or her birth, and, within the Levy Collection, most pieces were published during the author's lifetime.The probabilistic method allows a flexible weighting of all of these pieces of evidence.For example, if a name in the Levy collection does not match well with any of the name variants in the Authority File, the system could still make the connection if the contextual evidence is very strong. Using the seed data, the system disambiguates the first training set. Unsupervised learning techniques can be employed to improve results.[2] A human can also intervene

22 61 Automatic Event Generation from Multi-lingual News Stories Kin Hui Wai Lam Helen M. Meng The Chinese University of The Chinese University of The Chinese University of Hong Kong Hong Kong Hong Kong Shatin, Hong Kong, PRC Shatin, Hong Kong, PRC Shatin, Hong Kong, PFic [email protected] [email protected] [email protected]

ABSTRACT by named entity components, namely person names, geographical We propose a novel approach for automatic generation of location names and organization names. This story representation topically-related events from multi-lingual news sources. Named is able to capture terms conveying useful context for event entity terms are extracted automatically from the news content. generation purpose. We look into two named entity extraction Together with the content terms, they constitute the basis of approaches and investigate their impacts on event generation. The representing the story. We employ transformation-based linguistic firstapproach utilizesthe named entityextraction module tagging approach for named entity extraction. Two methods of provided in a commercial text mining product. The second gross translation on Chinese story representation into English method makes use of the transformation-based linguistic tagger have been implemented. The first approach uses only a bilingual and a collection of transformation rules to extract named entity dictionary. The second method makes use of a parallel corpus as terms. The transformation rules are learned from a training corpus. an additional resource. Unsupervised learning is employed to Another feature of our approachliesinthe online gross discover the events. translation. We have developed two methods to translate Chinese terms into English. The first one is the basic translation, which Keywords involves looking up each term in a bilingual dictionary and Event Discovery, Event Detection, Multi-lingual Text Processing replacesit with possible term translation. Another approach utilizes the easily available resources like a parallel corpus to 1. INTRODUCTION perform translation. Discovery of new eventsis tackled by unsupervised learning. Our unsupervised learning is based on a Interests in the event generation have grown rapidly in recent years. While most existing studies have concentrated on handling modified agglomerative clustering algorithm. relevant specific queries, there is a need to explore techniques for classifying relevant issues from a continuous stream of data. 2. STORY REPRESENTATION As raw stories contain a sequence of words, we need to identify Moreover, the rapid growth of the Internet and wide availability sentence boundaries for subsequent processing. Lexical clues, of electronic media allows the use of electronic means to access such as punctuation marks, are used to locate the boundaries. To the newswire stories from diverse sources. Thus, the capability of address the absence of word boundaries for Chinese stories, we identifying whether multi-lingual news are discussing the same employ the dynamic programming technique based on the tool event is one of the challenges of the event discovery task. In this provided by Linguistic Data Consortium (LDC) [1] to locate it. paper, we particularly focus on dealing with English and Chinese news stories. We augment the story representation by a two-dimensional semanticexpression,comprisingof namedentityfeature An event is defined as a particular activity or incident happening representation and content term representation. Named entity in certain place at a certain time as well as any follow-up progress. feature representation is constituted by people name component, In the event generation problem, news stories are arriving from geographical location name component and organization name different sources in chronological order. Some stories are in component. Each component is represented by a set of terms with Chinese and some are in English. Event generation aims at corresponding weights expressed as a vector. identifying whether the incoming stories belong to a new event or an existing event detected in previous stories. The detection can 3. NAMED ENTITY EXTRACTION contribute to the construction of structured guidelines for story We look into two approaches for extracting named entity terms. navigation of the whole news collection. The first method is taken from the Feature Extraction Tool in the Traditionally, stories are represented solely on raw terms appeared Intelligent Miner for Text Analysis package from IBM. Presently, in their content. In our approach, we augment the representation this tool can only be applied for processing English newswire stories. The second method is based on transformation-based Permission to make digital or hard copies of all or part of this work for linguistic tagging (TEL) approach derived from Brill [2]. This personal or classroom use is granted without fee provided that copies are approach can deal with English and Chinese as well as newswire not made or distributed for profit or commercial advantage and that and broadcast stories. copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, For Chinesestories, unknown words may causeincorrect requires prior specific permission and/or a fee. segmentations of named entities which degrade the results of JCDL '01 , June 24-28, 2001, Roanoke, Virginia, USA. tagging. Therefore, we apply unknown word identification [3] to Copyright 2001 ACM 1-58113-345-6/01/0006...$5.00. cope with this problem. Together with the word candidates

23 62 produced from the unknown word identifier and the named entity Table 5.1 Performance on Translated Chinese Stories word tokens tagged by tagger [4], we associate a weight to each Detection Under Different Similarity Threshold term to represent the story for the event discovery task. Similarity Threshold 0.05 0.08 0.1 0.13 0.15 4. GROSS TRANSLATION In order to determine whether multi-lingual news stories are Original Chinese Terms0.49500.48200.47250.48760.5116 topically-related,we conducttranslation on thestory Enhanced Translation 0.63620.59610.50360.52140.4705 representation of Chinese stories to English so that we can Basic Translation 0.63070.52240.51980.51840.4952 perform unsupervised learning on a uniform data representation. Two approaches of translation have been developed, namely the The result shows that, the detection performance of the enhanced basic translation and the enhanced translation. translation method is more encouraging than that of the basic translation method. We also observe that the best result achieves a 4.1 Basic Translation detection cost of 0.4705. If the number of content terms is We make use of a bilingual dictionary and a pin-yin file provided increased to 25, it can even achieve 0.4470. by Linguistic Data Consortium (LDC) in finding the English Table 5.2 summarizes the detection performance under different translation terms. First, we look up the Chinese term in the similarity threshold, q, by varying from 0.05 to 0.13 for event bilingual dictionary and retrieve the corresponding set of English generation of all English stories and the multi-lingual stories. terms. If there is no dictionary entry for a term in the people name component or the geographical location name component, we Table 5.2 Performance on English Stories and Multi-lingual obtain the translation by means of the pin-yin of the Chinese term. Stories Detection Under Different Similarity Threshold The English translation terms will replace the original Chinese Similarity Threshold term to represent the story with the assigned weights based on the 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 original Chinese term weight. ENG 0'43080.42790.43790.43650.43460.42760.43490.41810.4376 IMT 4.2 Enhanced Translation ENG 0'41320.38010.38640.39860.38540.38090.38020.38970.3881 In addition to the bilingual dictionary, we utilize a parallel corpus TAG MUL collection on Chinese and English news items as an additional 0'45450.43870.42230.50570.41270.42880.42910.44490.4252 resource to improve the accuracy of translation. An automated IMT MUL 0'40820.38610.38820.37770.39130.39240.37660.37770.3744 procedure is developed to transform a parallel corpus to passage- TAG aligned setting. For each Chinese story term, we use a similarity- based retrieval engine to retrieve the top-ranked Chinese passages From the result, it depicts that the detection cost of the stories in the parallel corpus that are relevant to the Chinese term by with the named entity term extracted by the transformation-based posing the Chinese term as a query. Then we retrieve the linguistic approach (TAG) is better than those extracted by the corresponding English passages of those Chinese passages. commercial product (IMT) for both multi-lingual (MUL) and Finally, we adjust the weight of the translated terms by assigning English (ENG) stories. The best results obtained from the TAG a higher weight for terms with better translation quality based on approachfrom bothdatastreamare0.3744and 0.3801 the statistics obtained from the retrieval English passages. respectively compared with 0.4127 and 0.4181, the best costs produced by IMT method. 5. EXPERIMENTS AND RESULTS In order to examine the performance of our event generation 6. FUTURE WORKS system, we evaluate our method by the corpus used in the latest We will conduct more experiments to optimize the parameter Topic Detection and Tracking 2000 (TDT2000) project organized settings.Moreover,wewillfocusoninvestigationson by DARPA [5]. The corpus consists of text and transcribed unsupervised learning module and explore other techniques on speech news data,distributed by the NationalInstitute of clustering to further improve the performance. Standards and Technology, in Chinese and English spanning from 7. REFERENCES January 1, 1998 to June 30, 1998. We followed the topic detection http://morph.ldc.upenn.edu/TDT Linguistic Data Consortium evaluation methodology designed in the TDT2000 project. The [1] detection performance is measured by the metric (CDET)Nor.. [2]E. Brill. Transformation-Based Error-Driven Learning and Smaller values correspond to better performance. Natural Language Processing: A Case Study in Part of Speech Tagging. In 1995 Association for Computational We have conducted two sets of experiments to investigate the Linguistics, Volume 21, Number 4, pages 543-565, 1995. performance of our event generation approach. In the first set of [3]C.W. Ip. Transformational Tagging for Topic Tracking in experiments, we have processed the Chinese stories by the Natural Language. In Dept. of Systems Engg. & Engg Mngt original Chinese terms, both the basic and enhanced translation of CUHK Master Thesis, June 2000. are conducted on the stories under various settings to compare the [4]H.MengandC.W.Ip. AnAnalyticalStudyof effects of translation. Table 5.1 reveals the detection performance, Transformational Tagging for Chinese Text. In Proceedings in terms of (CDET)No,, under various similarity thresholds, q, from of Research on Computational LinguisticsConference 0.05 to 0.15. (ROCLING XII), Taipei, Taiwan, ROC,pages 101-122, 1999. [5]The Year 2000 Topic Detection and Tracking (TDT2000) Task Definition and Evaluation Plan.

24 63 Linked Active Content: A Service for Digital Libraries for Education David Yaron D. Jeff Milton Rebecca Freeland Department of Chemistry Department of Chemistry Department of Chemistry Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University Pittsburgh, PA 15213 Pittsburgh, PA 15213 Pittsburgh, PA 15213 412-268-1351 412-268-1065 412-268-7981 [email protected] [email protected] rf51 @andrew.cmu.edu

ABSTRACT 1. INTRODUCTION A serviceisdescribed to help enable digitallibrariesfor education, such as the NSDL, to serve as collaboration spaces for A large body of educational research has shown the benefits of active learning, whereby students do not sit passively while they the creation, modification and use of active learning experiences. The goal is to redefine the line between those activities that fall are told information, but rather participate actively in the learning within the domain of computer programming and those that fall process [13]. Digital libraries for education, such as the NSDL, within the domain of content authoring. The current location of can help catalyze the shift towards more active, exploratory, this line, as defined by web technologies, is such that far too much inquiry-based learning activities. The digital library's potential of the design and development process isin the domain of lies not only in the ability to bring active learning experiences to software creation. This paper explores the definition and use of an unprecedented and large audience of students, but also in the "linked active content", which builds.on the hypertext paradigm potential to serve as a collaboration space to catalyze the creation, by extending itto support active content. This concept has modification and assessment of student activities. First, a grass- roots development approach, which remains in intimate touch community developmentadvantages,sinceitprovidesan authoring paradigm that supports contributions from a more with the needs of teachers and students, may reduce concerns that diverse audience, including especially those who have substantial can prevent adoption of useful teaching innovations. Second, and classroom and pedagogical expertise but lack programming perhaps more difficult to achieve, if the library is to be populated expertise. It also promotes the extraction of content from software with a large amount of high-quality material, it will be important so that collections may be better organized and more easily to engage a large community of developers and early-adopters, repurposed to meet the needs of a diverse audience of educators with a wide range of interests, expertise and approaches. and students. The goal of our research is to provide a technical infrastructure that will help digital libraries, such as the NSDL, overcome some Categories and Subject Descriptors of the main challenges associated with collaborative creation of K.3.1. [Computers in Education]: Computer Uses in Education, engaginglearningactivities.Fromacommunity building computer assisted instruction, computer managed instruction. perspective, creation of these activities requires both technical and pedagogical expertise,a combination of expertisethat few members of the NSDL community can be expected to possess. General Terms From a collections perspective, the level of interactivity required Human Factors, Experimentation for engaging activities typically leads to monolithic chunks of software that are difficult to subdivide into components that Keywords promoteadaptationandreuse.Our premiseisthatthese Education, Active learning, Web authoring challenges are intimately coupled, since they both relate to where one draws theline between software creation and content authoring. The current location of thisline,as defined by available web technologies, is such that far too much of the design and development process is in the domain of software creation. Permission to make digital or hard copies of all or part of this work for By pushing this line to allow for more powerful and flexible personal or classroom use is granted without fee provided that copies content authoring, we can better utilize the expertise of those are not made or distributed for profit or commercial advantage and members of the community with curriculum development and that copies bear this notice and the full citation on the first page. To classroom experience. Redefining the line between programmer copy otherwise, or republish, to post on servers or to redistribute to and author also promotes the extraction of content from software lists, requires prior specific permission and/or a fee. JCDL '0 1, June 24-28, 2001, Roanoke, Virginia, USA. in a manner that leads to better organized collections that may be Copyright 2001 ACM 1-58113-345-6/01/0006...$5.00. repurposed to meet the needs of a diverse audience of educators and students.

25 64 2. RELATED WORK alreadyfamiliarwiththehypertextlinkandimage map Over the last 40 years, there have been thousands of individual functionalities of Web authoring. Grounded in this model, we efforts to create science and engineering teaching and learning extend it to include links to and between simulations and other materials in digital formats, ranging from case study depositories active learning objects such as tutorials or animations. This builds to instructional software. Some have been quite effective, such as on the familiar paradigm of hypertext linking: it is intuitively the Physics Academic Software (PAS) library [4].Itis now simple to go from creating a hypertext link to sending a message widelyrecognized,however,thatdevelopingconfigurable to a learning object. While simple, this linking ability significantly learning objects that will be reused is essential to vitalizing online increases the ability of a non-programmer to create interesting instruction. The efforts of groups including the Instructional active learning experiences. Management Systems group started by EDUCAUSE [7], the In addition to the community building advantages discussed IEEE Learning Technology Standards Committee [11],the above, linked-active content also has advantages for the structure ARIADNE project[3],the Advanced Distributed Learning of a digital library for education. Consider the current manner in Initiative [5], and others to create standards for learning objects is which active content is supported on the web. Currently, the partly motivated by the reality that reuse is rare.Most of the limitations of HTML are overcome with plug-in technologies, standards currently being developed revolve around textual such as Flash or Java, that essentially provide embedded browsers content (metadata descriptions for content, standards for questions for content stored in a format that has little or no relation to the and answers, and curriculum structure standards), and not around overall HTML in which the plug-in is embedded. Without good software development and reuse. communication between the plug-in content and the overall It has long been hoped that instructional software developers HTML, even text and image portions of the software, which could would contribute to libraries of instructional software that would be handled with HTML, must instead be embedded in the plug-in. give less technically oriented educators access to the benefits of This embedding leads to large chunks of content that are difficult software simulations and interactive exercises.There are a to adapt and reuse. Although this active content often consists of number of significant new efforts at creating architectures for sub-objects with relationships, similar to a hypertext web site, the developing and sharing educational software components [2, 6, 8, limits of the plug-in technology prevent this substructure from 9, 10, 12], most of them based on the Java language. The ESCOT being made explicit in a manner that can lead to better organized [9] project is a testbed that seeks to encourage development and digital library collections. re-use of learning objects, with a current emphasis on middle- To catalyze the creation of good active learning experiences, the school mathematics. The NEEDS [14] project to create a digital infrastructure should not only support the creation of domain- library for engineering education includes educational software on specific software components, but actively promote coupling of theirsite.TheNationalScienceFoundationComputer new components to existing ones. For many of the activities that Courseware Repository [15] promises inclusion of instructional drive the web, such as product advertising and online sales, the software but posts little to date functionality provided by HTML and plug-insissufficient. Current tools to invite participation of instructors with little However, the creation of active learning experiences involves a programming expertise include general authoring environments, mix of cross-domain components, such as those that handle text such as Macromedia Director and Authorware or Macromedia's and images, along with domain-specific components, such as that Web-based Course Builder. While these are powerful and useful in scientific simulations. tools, they do not have a smooth integration path for coupling to Consider a Virtual Lab for chemical education that provides a repositories of Jaya objects, and so are not sufficient to meet the flexible simulation in which students may perform a large variety goals addressed here. They also present users with a steep of experimental procedures in a manner that mimics that of a real learning curve. Another approach to the construction of tools for laboratory [18]. While this flexibility supports a wide variety of this audience starts with general component assembly models approaches to chemical education, many approaches call for such as the Java Beans model and simplifies and/or provides providing students with guidance so that they interact with the support to make this approach accessible to a broader audience. simulation in a meaningful way [19]. This guidance is typically Development of such tools is an active research area, for instance text and image based: what to do at a certain stage, explanations at the ESCOT project. As discussed further below, our approach of concepts, questions to be answered, etc. While much of the is related to this one but, rather than starting with component functionalityneededtoprovidethisguidanceisdomain assembly and reducing its complexity, we start with Web content independent, educational software presents a rather unique need creation and extenditsabilities.Thisleadstotoolsthat for both general and domain-specific components. A flexible complement those of other projects, and invite a much broader means to present text and images, such as that in HTML, can be authoring audience. made much more powerful by allowing the text and images to link to simulations. In the Virtual lab example, it should be possible to 3. LINKED ACTIVE CONTENT provide text links and buttons that cause chemical solutions to One approach to redefining the line between educational software appear in the virtual lab, or that check the current state of the lab creation and content authoring is to view the creation of curricular to see if the student has achieved a certain goal. This design materials as primarily programming and simplify this activity so promotes reuse of components, since each domain and each that it is accessible to more instructors. Our approach starts from a software project do not need to reinvent the general tools, but very different perspective, that of curriculum creation as Web need only provide domain-specific software components. authoring. We believe that this approach has advantages in learnability, organizing and reusing content, and supporting domain-specificsoftwarecomponents. Mostinstructorsare

26 4. PROGRAMMER AND AUTHOR ROLES The viewer is a browser that supports dynamic loading of Java Producing a shiftin the line between software creation and objects, and that supports our extending linking structure between these objects. The responsibilities of the viewer are intentionally content authoringrequiresnot justa changeintechnical infrastructure, but also a change in how programmers and minimized to make the environment maximally extensible. The educators view their roles. Although a programmer must create functionality resides in the learning objects (implemented as Java learning objects, a major goal of our work is to promote a change classes), and in the links between these objects, not in the viewer in the mind-set of the programmer, away from creating finished itself. We provide implementations of core objects, discussed pieces of educational software and towards creation of learning below, that bring extended linking to web functionalities such as objects that serve as viewers and manipulators of content. hypertext and imagemaps. However, these learning objects can be swapped out for other implementations. The viewer handles only The issues we are addressing may then be viewed as a specific the extended linking of these objects. Even the placement of instanceof the more generalobjective of creating useful objects on the screen is done using a replaceable learning object, componentsforeducationalsoftware.Ourprojectisa analogoustothelayoutmanagersof Java.Ourinitial collaboration with Andries van Dam and Anne Spatter of Brown implementation of the layout manager mimics HTML frames in University, who are considering the creation of components for order to capitalize on potential instructor familiarity. use by programmers. Our emphasis here is on the creation of components for use by instructors. These instructor components The environment consists of: are to be created by programmers, ideally using the software Core learning objects such as a hypertext viewer and components developed by our Brown collaborators. image-map viewer that build on their web counterparts by supporting extended linking. These learning objects 5. LIBRARY SERVICES consistof cross-domainobjectsfor viewingtext, The goal of our research is to explore the requirements and images, setting up navigation structures, assessment and benefits of allowing links between active content. Active content grading tools, etc. is that content contained in scientific simulations, tutorials, and An authoring environment, that provides a simple other materials that, due to their interactive nature, go beyond the graphical means for arranging the objects and creating abilities of HTML and so are constructed using JAVA or a web links, and that helps the curriculum author organize the plug-in technology. We will refer to the software components that potentially large number of text snippets and image present thiscontent as learning objects. By "linked active maps that make up a complete tutorial or exercise. The content", we mean information passing between learning objects organizational aspects are handled by a file-explorer through conduits that are created during the authoring process interface that presents the author with a hierarchical list rather than being hard-coded into the objects themselves. Such of the corresponding text and image maps. This links can pass a message to a learning object, or query an object environment gives the author access to the library of for data. (More complex means of transferring information learningobjectsthatcanbeusedtoconstruct between objects are discussed in Section 8.4.) simulations.As objects are added to the library, the A principal motivation of this approach is to give an intuitive yet range of simulations that can be created grows. powerful ability to curriculum authors and instructors. It is not a A viewer, implemented in Java that creates the learning big leap to go from creating a hypertext link to sending a message environment described in the XML file output by the to a learning object, especially if the link creation is done through authoring environment. dialog boxes as described below. For example, in authoring curriculum around a combustion engine simulation, the author We are also developing domain-specific learning objects for could insert the text "click here to start the engine", with the word chemical education, such as a virtual chemistry lab and other "here" being a link that sends the "start" message to the engine. simulations. All of the software components use the Java Beans The extended link is similar to a method call in an object oriented API to expose the methods that will accept messages via the programminglanguage, and has thelogical structure linking mechanism. target:action:data,stored via XML. The output of a completed authoring process is an XML file 6. EXAMPLES A major goal of our research is to explore the authoring flexibility describing theinitialarrangement and state of the learning that may result from allowing linking between active content. Just objects, and the links between these objects that will drive the as hypertext brings more power to text authoring than may have simulation and react to the student's interactions with it. A been expected given its relative simplicity, linking of active viewer, implemented in Java,is used to create the learning content leads to considerably more power and flexibility in the environment specified by this file. Note that the XML file is not a script file, but rather a set of linked objects with an initial construction of student activities than we anticipated at the start of this project. We will attempt to illustrate this power through the configuration, analogous to a web site containing a set of linked following two examples. documents and an initial entry page. This outcome reflects the shift from a simplified-programming to extended-web-authoring 6.1 Mission to Mars paradigm. Unlike current component assembly tools, where errors In creating a student activity that utilizes a scientific simulation, are typically reported by a compiler in a language that is often the role of the programmer is to develop the simulation while that obtuse even to experienced programmers, the only types of errors of the curriculum author is to guide student interaction with the that can arise here are broken links, and even these can be avoided simulation. Under current web technologies, a programmer may via automated link-generation facilities. provide the simulation as a Java applet, which can then be placed

27 on a web page along with text and image maps that guide the The image map also supports editable hot-spots, which serve as student interaction. The text, especially if placed in a frame below entry boxes for user input. For instance, in the parameterjrame the applet, can take full advantage of the hypertext abilities of of Figure 1, the student may enter and edit text. This text is then HTML to present the material in an appropriate order and even passed as data to the method associated with the hot spot. For provide nonlinear paths through the material. However, this text instance,theuppermosttextboxsendsthemessage remains disconnected from the simulation. trajectorySimulatorsetHeat:8.9e5. In this manner, the author is allowed to create simple control panels to the simulation. The One use of the extended linking mechanism we are developing is construction of the control panel is quite different from the to allow the text to pass messages to the simulation. approach of JAVA Bean assembly environments. Here, the author first creates an image of the control panel using a tool such as 0 Mission Critical Chemistryi A Mexico and United States CollaboratNe Mission To Mars Photoshop. The author then uses our imagemap editor tool to h make the relevant portions of the image hotspots that link to the simulation. Launch the In our current implementation, the author must type in the link Simulation using appropriate syntax. However, we are currently developing using the current dialog-driven link generation. In this manner, when the author parameter chooses a hot spot for a link, the authoring environment presents a dialog box showing possible targets for the link, such as a list of 0 frames into which a new learning object might appear in response to the student clicking on the hot spot. For example, the author could select simulationjrame as the location where the object would appear.This would then prompt for the desired learning object from the library.Once a learning object is selected, the author is prompted with a list of instructions that can be sent to the chosen learning object. For example, a petri dish simulation 4. 8,985 for growing bacteria could have methods setGrowthRate, startGrowth, and stopGrowth. Finally,if the chosen method 50 requires data, such in setGrowthRate, the curriculum author would be prompted for the relevant data. Thus, the author 10000 stipulates how a simulation will behave in response to student actions by doing little more than making hyperlink connections. Figure 1. The Mars Simulation exercise with simulation frame Note also that more than one link can be attached to a single (top left), control frame (top right) and parameter frame hotspot or link. (bottom). 6.2 Pathogen While conceptually simple, the functionality described above is Figure 1 shows a simulation of a rocket trajectory from Earth to quite powerful. At the simplest level, providing one frame Mars, built using our authoring system. A programmer created the containing text next to a frame containing a simulation allows trajectory simulation as a Java component with methods that set authoring of atutorialthatguidesastudent through the the fuel characteristics and other simulation parameters. The simulation. It is also possible to create the illusion of moving author may then use the core learning objects describe in Section through a virtual world by placing links on imagemaps that load 5 to guide student interaction with the simulation. Here, the other imagemaps, as is common in computer games such as author has created three frames: the simulationjrame, which King's Quest or Myst. Clicking on a door loads an image in which contains the trajectory simulator object, and a controljrame and the door is open, and clicking again loads an image of the next parameterjrame, each of which contains an image-map viewer. room. Such images can be easily obtained with a digital camera. In addition, components may be designed thatfitinto the The image-map viewer is a core learning object that builds on the environment and provide the author with considerably more paradigm of the standard web image map. A region (ellipse or power, as illustrated by our Pathogen exercise. polygon) of an image can be made into a "hot spot" that responds The Pathogen exercise addresses topics typically covered in the when the pointer passes over it, or is clicked/double-clicked. The authoring environment allows the author to create some simple first few weeks of an introductory college or high school but useful responses to activating a hot spot. For instance, in the chemistry course. These topics are put in the context of drug controljrame of Figure 1, the pointer has activated a hot spot discovery, and in Figure 2, students lead a team of researchers on an island in search of plants with medicinal activity. (They will over "Launch" that brings up "bubble help" describing the action later bring these plants back to a pharmaceutical laboratory and of this menu item. (Various filters are provided to allow the look help determine the active ingredients.) of the bubble help to be customized.) Hot spots may also pop up additional images or a menu of links. In this case, clicking on the Launch hot spot sends a launch message to the trajectory simulator via the extended link: trajectorySimulator:launch.

28 (f) Carnage Mellon Unovemt, Nasents Pathogen! The upper left panel in Figure 2 contains a profile viewer component that displays information about the currently active team member. This component also has an associated authoring tool that allows an instructor to add his or her own team members.

41011111

NODE EDITOR PANEL

WI= -

!ete-ute

a-L.)

Figure 2: The Pathogen student exercise.

Occasionally, a team member becomes infected and must be given a appropriate drug to be cured. In determining the type and amount of drug, the student must solve a problem involving Figure 3. Authoring tool for creating and editing new maps. chemical concepts.If the drug is not appropriate, the team member is airlifted off the island and to a local hospital. The number of initial team members then sets the number of problems the student may get wrong before needing to get a new team and °Gamma! Wagon Deno. the Ptheenta Pathogen/ start again. To support the navigationrequired by thisapplication,a programmer created a map navigation component, which is loaded into the lower frame of Figure 2. The component shows a map of the island and the current location of team members, and allows 4 I the student to move these members around the island. When a team member enters a new location on the map, an image of that location is loaded into the upper-right workspace frame of Figure 2. This is an example of a component passing a message to another component. In this case, the map component passes a 410 loadlmage messagetotheimage-map componentinthe Send Dula workspace frame. The programmer also created an authoring tool, shown in Figure 3, to allow authors to start with any image and place nodes at various locations to construct a map. This simple yet powerful navigation component is reused to allow the student to navigate through the pharmaceutical lab in the second part of this exercise. We are currently extending this component to support Quicktime VR images [17]. One can envision other types of navigation components that utilize, for instance, 3-D graphics. Since the image in the workspace frame is displayed through the Figure 4. The Pathogen student exercise displaying the written image-map viewer discussed above,itsupportsallof the problem (upper right). extensions discussed there. For instance, clicking on a hot spot on the image can load a protein viewer showing the molecular structure of a protein relevant to the current exercise.

29 68 _Cencept \Jock Bench Concept Cam student interface to the simulation, and in so doing, would make curriculum decisions that severely limit potential reuse.If the My Pathov. application were meant to illustrate a college-level concept such Dcoctipbon koctoScopo Edko Solution Edda as thermodynamic cycles, the resulting user interface would Pooh tom Octomuun exclude use of the application in high schools to illustrate the LIX#310 he student Suggested dosage12 mg / 5 mg of body ideal gas law. A potential alternative would be to create a weight. input interface simulation with a very flexible interface, but this could easily lead If the dosage exceeds this amount then the to a complex application that confuses students. With the above patient is in danger of toxic side effects. Too assembly environment, the programmer isable to design an little and the pathogen is unaffected. "instructor interface," with the goal of providing sufficient flexibility that the curriculum author can design useful "student Available interfaces." This separation of responsibilities between the Problem Domainsrm.:doom onslysis programmer and curriculum developer should lead to much more cinengioNo Prowls 31/nts effective collaboration and reuse than is currently possible. Cosr3tortScicotacin Stottlioneuy_Ccnvotnrcelcies 7.2 Browsing through Active Content Text window for In creating a student exercise or tutorial, the curriculum author is editing the problem descri ption essentially creating a means for a student to browse through active content. This browsing ability is supported by the ability to create links that load learning objects into frames. For instance, in creating an exercise about hemoglobin, an author can create two frames, one of which initially contains text describing the role of Figure 5. The Pathogen instructor authoring tool showing the hemoglobin in the body and the other containing a protein viewer creation (or editing) of a specific drug problem. object displaying the 3D molecular structure of hemoglobin. In response to student clicks on text links or images, the instructor can load the virtual lab containing solutions in which hemoglobin When a team member becomes infected with a pathogen, a drug- has bound up various numbers of oxygen molecules, or an bottle component appears in the workspace frame, as shown in animation showing how hemoglobin sequentially binds up to 4 Figure 4. The instructions on the drug bottle pose a chemical oxygen atoms. problem to be solved by the students. Again, an authoring tool, The curriculum author is thereby allowed to focus on content, shown in Figure 5, is provided to allow instructors to add their own problems to the application. Note that this problem queries independent of the particular software viewer (text viewer, molecular viewer, virtuallab, QuickTime player) needed to the profile viewer for information such as the body weight of the display this content. The content is also well extracted from the team member, which may be of relevance to the appropriate cure. viewer, in a web with relationships and interconnections that This communication between viewersillustratesthe use of reflect the curriculum content, rather than integrated with the extended linking to do simple queries, in addition to the message technology needed to display this content. passing discussed above. Note that in adding a powerful component such as those created 8. FUTURE DIRECTIONS for Pathogen, the programmer creates three items: a viewer, a format for the content displayed and/or manipulated by this 8.1 More Flexible Control Panels viewer, and a authoring tool to allow non-programmers to create The Mars example of Figure 1 showed the use of an editable hot or modify this content. spot to serve as a control to a simulation. This control may be extended to allow sliders, dials etc. to be attached to certain Note also that the approach provides multiple entry points for variables of a simulation. Thisisan area where graphical instructors. For instance, an instructor may first simply use the assembly of components is known to work well. For instance, drug authoring tool of Figure 5 to add their own chemistry users with little programming expertise are able to construct exercises to Pathogen. Or they may use the profile authoring tool simple yet powerful front panels in LabView [16] that control real to replace the team members with students in their class. As they instruments. Also, attaching text boxes, sliders, etc. to JavaBeans become more familiar with the approach, they may attempt more is one area where even simple Bean assembly tools work well. We complex modificationsandeventually assembletheir own may adopt this paradigm by allowing users to drag controls on top exercises. of an image map to construct a control panel for a simulation. Attaching these controls to a simulation uses the extended linking 7. ADVANTAGES mechanism; for instance, the author could arrange for a slider to 7.1 Instructor Versus Student Interfaces act as a throttle for the engine by first setting the minimum and The above examplesillustratea fundamentalshiftinthe maximum value, and then linking it to the engine, choosing the development of interactive software for education, whereby the "engine:setspeed:value" link to the engine object. Such controls programmer's primary focusis on designing an "instructor are easily builtinto the hyperlink-like scheme for message interface" so that the instructor can design a "student interface." passing, since they need to send the message only when altered. Considerthecurrentsituation,whereasimulationof a This model mimics the construction capabilities of a commercial combustion engine would typically be constructed as a complete, JavaBeans beanbox assembly environment but delivers it in a stand-alone application. The programmer would need to attach a non-programming, education-specific service that opens

30 89 construction to authors.Simultaneously, it provides a service to we will provide a simple implementation using an open source which educational programmers can contribute their industry database such as MySQL.Extensions could include learning standard Java classes as learning objects.Of course, those objects that interface to standard course management systems such programmerswillneedtotakeintoconsiderationdesign as those offered by Blackboard and WebCT or with assessment specifications that emerge from the research being done on object tools developed for distance learning [I]. design both by other groups and as a result of our own research. 8.4 Interobject Communication 8.2 Branching Another possible extension is to add an additional mode of inter- Another issue for continued research is the means by which the object communication to the viewer. For instance, objects may be environmentrespondstostudentinput.InthePathogen allowed to publish or monitor messages on a bus, as in the applicationdiscussed above,thestudent's response tothe InfoBusistandard. While this would allow programmers to chemical problem posed on the drug bottle is checked through a provide authors with more powerful control panel objects than special-purpose mechanism provided by the programmer. We are that described above, it is important to assess the effects of this currently working on branching objects that can provide authors added complexity on the author. In particular, will issues of with a simple means to add conditional behavior, or if-then synchronization lead to new types of errors, beyond the broken statements,tothe environment. These form thebasisfor links of the current design, and potentially frustrate the curriculum individualizing responses and feedback to the choices that a author? student makes in navigating the learning environment. We view complex branching behaviors as the domain of the programmer, 9. DIGITAL LIBRARY SERVICES and therefore they are meant to be programmed into learning We stress that our approach is to start from a simple design and objects. For instance, a simulation such as our Virtual Lab [20] add in only those functionalities that retain a high level of ease-of- provides the student with varied choices and feedback on the use. This assembly environment is not meant to provide a single choices they make and actions they take. all-purpose solution to component assembly. Rather it is meant to complement assembly environments currently available that are One way to give authors branching capabilities is by allowing based on the simplified-programming model such as ESCOT. branching controls to be inserted into image maps, or control Design of assembly environments involves decisions of flexibility panels.The author could, for example, use a multiple choice vs. ease-of-use. The emphasis in this tool is on ease-of-use via an panel component to allow a student to follow different links extended-web-authoring approach that should be intuitive to depending on his or her answer. Alternately, an integer response instructors, and on the inclusion of features that help organize the control and real-number response control would accept a value potentially large number of text and images that make up a from the student, and follow various links depending on the value. learning environment. To the extent that this environment does Allowing the input to such controls to come from simulations (via not support a more complex mode of component assembly, one of extended linking) leads to a flexible simulation environment. For the tools based on the simplified-programming model may be instance, the author can create text in a text-viewer that asks the used to assemble components into an object for inclusion in this student to adjust the air/fuel ratio of the combustion engine to environment. Thus the two classes of tools actually complement achieve a certain power level. They can then add the text "click one another. Our tool fills an important need, since sole reliance here when done", and link the word here to a branching control. on more complex tools would exclude potentially valuable The branching controlis of the real-number type, linked to contributors. engine:get_power, and configured to link to textjrame:load.power correct.htm if the value is in the correct Our research then consists of the development of two services in range and to text_frame:load.power incorrect.htm if the value is support of digital libraries for education. First is providing an out of the correct range. This model becomes more powerful since authoring environment thatillustratesthe "learning object" multiple links can be attached to the same hypertext or image-map approach to creating on-line learning activities. This set of core region. learning objects such as a hypertext viewer and image-map viewer extend their web counterparts by supporting links to active 8.3 Assessment Components content. Second is exploring the extent to which the piecing Student performance assessment is an essential functionality. Our together of learning objects can be moved from the domain of the environment allows assessment/grading components to be easily programmer into the domain of the curriculum developer. We integrated into exercises with the simulations. The functionality anticipate that this shift will lead to both a larger community will include the ability for students to login to the assignment, and creating, adapting and using materials for the digital library and a then the collection of milestone data. The author can create links better organized collection. to the assessment objects as the simulationiscreated. For instance,intheabovepathogenexercise,thelinkto 10. ACKNOWLEDGMENTS textjrame:load.power incorrect.htm could be coupled with the This work was funded by the National Science Foundation as part link grading_object:milestone:("power question",incorrect), and of the NSDL (National Science Mathematics Engineering and similarly for the correct response. (This illustrates the importance Technology Education Digital Library) program. of multiple links from a single point.) The assessment component would then keep track of the number of attempts before the correct response was obtained. Grading may be viewed as a special case of this assessment model, where the grade is based on achievingcertainmilestones. We arecurrently developing I InfoBusisastandardfor data communication between methods to store milestone data in a network database, for which JavaBeans components.

3 I 11. REFERENCES [13] Meyers, B.J. & Jones, T.B. Promoting Active Learning: Strategies for the College Classroom. Jossey-Bass, San [1] Project ADEPT Francisco, 1993. http://www.users.csbsju.edultcreedladept/index.html [14] NEEDS - http://www.needs.org [2] AGENTSHEETS - http://www.agentsheets.com [15] NSFCCR - [3] ARIADNE Collaborative browsing project on digital http://www.education.siggraph.org/nsfcscr/nsfcscr.home.ht libraries. ml http://www.comp.lancs.ac.uk/computing/research/cseg/pro jects/ariadne/ [16]National Instruments Corp., LabVIEW http://www.ni.com, 11500 N Mopac Expwy, Austin TX, [4] American Physics Society - Physics Academic Software. 78759. http://webassign.net/pasnew/ [17] Quicktime VR Apple Computer Inc., [5] Advanced Distributed Learning Intiative network., DOD http://www.apple.com/quicktime http://www.adlnet.org/ [18] Smith, J. M. What Steve Jobs Did Right, The Educational [6] BELVEDERE http://www.ics.hawaii.edu Potential of the Ideas Behind NeXTSTEP. Educom [7] EDUCAUSE - http://www.educause.edu/ Review, 1994. [8] EOE http://www.eoe.org [19] Squires, D. Educational Software for Constructivist Learning Environments: Subversive Use and Volatile [9] ESCOT - http://www.sri.com/policy/ctl/html/escot.html Design. Educational Technology, May-June 1999, [10] ESLATE - http://e-slate.cti.gr/ 48-54. [11] IEEE Learning Technology Standards Committee (LTSC) [20]Virtual Laboratory, The IrYdium Project, Carnegie Mellon http://www.manta.ieee.org/groups/ltsc/ University, Chemistry Dept. [12] JAVASKETCH - http://www.keypress.com/sketchpadljava_gsp

32 71 A Component Repository for Learning Objects A Progress Report Jean R. Laleuf Anne Morgan Spalter Brown University Brown University Department of Computer Science Box 1910 Department of Computer Science Box 1910 (401) 863-7658 (401) 863-7615 [email protected] [email protected]

ABSTRACT We believe that an important category of SMET digital library content will be highly interactive, explorable microworlds for teaching science, mathematics, and engineering concepts. Such environments have proved extraordinarily time-consuming and difficult to produce, however, threatening the goals of widespread creation and use. One proposed solution for accelerating production has been the creation of repositories of reusable software components or learning objects. Programmers would use such components to rapidly assemble larger-scale environments. Although many agree onthevalue of thisapproach, fewrepositoriesof such components have been successfully created. We suggest some reasons for the lack of expected results and propose two strategies for developing such repositories. We report on a case study that provides a proof of concept of these strategies.

Keywords Components, design, digital library, education, learning objects, NSDL, reuse, software engineering, standards. Figure 1: Professor van Dam uses a Tinkertoy house and a cardboard perspective viewing volume to demonstrate camera viewing transformations. 1. INTRODUCTION Our vision for digitallibrary content goes beyond scanned It may seem at first that creating a few dozen components would literature or searchable curriculum materials to include richly suffice for many courses. For example, an introductory calculus interactive explorable microworlds that take full advantage of the course would have a function-graphing component, some tools for ever-increasing power of computers, software, and networks. finding derivatives and integrals, a series expander, expression These learning environments combine the best qualities of live editors and a few more objects, but the reality isfar more demonstration (see Fig. 1) with interaction only possible on the complicated. The situationissimilar to that faced by GUI computer. Unfortunately, our research has led us to the conclusion designers. If one looks at an interface, it seems to be made up of a thatitis an unexpectedly huge effort to create a complete few basic elements, such as field boxes, sliders, buttons, and collection of interactive learning experiences for even a single panes, but GUI libraries are huge and take enormous effort to introductory course in a given discipline. develop. We have been trying to accelerate the development process by Based on the GUI library analogy, we believe that a half a dozen creating reusable components or learning objects that can be programmers and researchers could never construct a major recombinedindifferent ways to produce setsof learning reusable educational components library in a reasonable amount environments. By components or learning objects we mean of time. Only a concerted collaborative effort of tens of person- standardized pieces of code, usually class files or Java beans, years can build the necessary content for any given field. In which programmers can easily reuse in different programs. particular, although some underlying components, such as math libraries for learning objects, may be applicable across many Permission to make digital or hard copies of all or part of this work for science education domains, more domain-specific components personal or classroom use is granted without fee provided that copies are must be created separately for each field. not made or distributed for profit or commercial advantage and that The GUI library analogy assumes that we know exactly how to copies bear this notice and the full citation on the first page. To copy make these components, but in the field of educational software otherwise, or republish, to post on servers or to redistribute to lists, components there are many open research issues. For example, requires prior specific permission and/or a fee. how does one analyze current simulations for decomposition into JCDL '0 I , June 24-28, 2001, Roanoke, Virginia, USA. Copyright 2001 ACM 1-58113-345-6/01/0006 ...$5.00. reusable components? How can one design components to be

72 useful for educators (as well as programmers)? And how does one those imposed by the Java beans spec [16]) to difficulty in choose a proper level of granularity for a component? arriving at the right level of feature complexity. In the end, we We have been exploring two main strategies for repository found that everyone wound up rewriting the "reusable" elements. creation. The first is the use of a categorization scheme to help The strategies and project discussed in this paper were inspired by programmers analyze and characterize types of components as this situation. they relate to educational purposes. The second is a method for The Exploratories project has tried to promote other aspects of addressing issues of object granularity, and determining what reusability by creating reusable hypertext structures for Web- levels of object complexity are appropriate. We have applied these based curricula [19], describing methods for integrating learning strategies in a proof-of-concept case study. objects into traditional curricula [20], categorizing pedagogical The work described here is part of an NSF NSDL grant, the approaches and teaching techniquesthat can be used for CREATE project (A Component Repository and Environment for interactive learning environments [18], and creating a Web-based Assembly of Teaching Environments). repository structure for JavaBeans [3]. 2. PREVIOUS WORK 2.1 A Component Categorization Strategy It has long been hoped that instructional software developers When we began making components, we found that they fell would contribute to libraries of instructional software that would naturally into three different categories.We characterize all give others access to the benefits of software simulations and reusable educational components as either 1. Core Technologies, interactive exercises. There are a number of significant efforts to which have a high degree of domain independence and are create develop and share educational software components, most typically quite fine-grained (e.g., Java GUI classes), 2. Support of them based on the Java language. The ESCOT project is a Technologies, which usually have some domain dependence and testbed that seeks to encourage development and reuse of learning aretypicallymedium-grainedobjects(e.g., BEA Systems objects, with a current emphasis on middle-school mathematics, JavaBeans designed for e-commerce (includes shopping cart but their components are not available for general public use [6]. beans, order tracking beans, etc.), or 3. Application Technologies, The E-Slate company sells educational components (as opposed which are almost always highly domain dependent and coarse- to finished applications) and describes about two dozen of them grained(e.g.,a Java applet that teaches about a particular on their Web site [7]. The Educational Object Economy project chemical reaction). These categories are described in more detail compiles interactive educational tools in the form of complete below: applets [5]. As far as we can tell, none of these undertakings 2.1.1 Core Technologies provides complete sets of components for specific courses and even if they become quite successful, are aimed chiefly at Characteristics educators with little or no programming experience. There is still Domain independence a need for the digital library to house lower-level components, to High levels of reusability be used by programmers to create fully customized educational environments. Further discussion of educational component use Self-contained functionality can be found in the IEEE Computer September 1999 Special Adherence to high standards of design and Issue on Web based learning and collaboration [13]. reliability In addition to work specific to educational software, the technical, Audience social,economic, and administrative problems with general Chiefly programmers but also some content developers computer code reuse are now better understood [4, 11]. Problems (with little or no programming ability) using assembly include failure to organize and index reusable objects, failure to tools mandate that code be designed with reuse in mind, lack of an Granularity organizational structure dedicated to supporting reuse, and failure to recognize the domain dependency of reuse strategies. Typically fine-grained (e.g., sliders, timers and buttons) but can also include coarse-grained objects (e.g., The Exploratories project at Brown University, on which this spreadsheets or even an entire pedagogical framework paper's work is based, has worked for over five years to create into which one can plug one's materials [19]) learning objects with high levels of interactivity [8]. Our chief content area has been introductory computer graphics, including Examples introductory linear algebra [2, 17]. We think of exploratories as Java GUI classes combinations of "exploratoriums" [9] andlaboratories, realized Various math function libraries as two- and three-dimensional explorable worlds which are IBM AlphaBeans [1] currently implemented as Java applets. Our applets are embedded in a hypertext environment and are used by a number of high 3D Interaction and visualization widgets school- and college-level courses around the world. They are Notes freely available at our Web site [8]. Many commercial offerings fall into this category, but When we began trying to reuse frequently occurring program such components may also be built in-house. elements, we quickly found that simply copying and pasting code was not a good strategy. Most classes worked well only in the program for which they were initially designed. The problems for reuse ranged from a lack of software interface standards (such as

34 73 2.1.2 Support Technologies Note that in comparison to other fields, especially e-commerce, educational technology is particularly deficient in the support Characteristics technologies category. We know of no substantial repositories for Domain dependence suchdomain-specificcomponentsandthereforeobserve Moderate levels of reusability educational software developers repeatedly building these types of components. We believe that one of the chief problems facing Self-contained functionality current efforts in educational component technology is a lack of Audience distinction between support and coretechnologies.Thisis Chiefly programmers but also some content developers compounded further by a deficiency in understanding the design (with little or no programming ability) using assembly features necessary to promote reuse of support technologies. tools. 2.2 A Granularity Strategy Granularity Although the categories described above provide guidance as to Typically medium-grained objects, often aggregating what levels of granularity are appropriate for specific types of finer-grained components (e.g., an image filtering components, individual components may infact be of any widget that combines slider and windowing granularity regardless of the category in which they fall. In our components) experience of creating components, we found that,infact, Examples creating them solely at any single level of granularity either meant BEA Systems JavaBeans designed for e-commerce (e.g., having too many features and not enough flexibility, or having so shopping cart beans, order tracking beans, etc.) fine-grained a set of components that a great deal of work still needed to be done to create the final product. Our strategy for dealing with granularity and organizing the 2.1.3 Application Technologies results is, for a given component, to produce a complete set of Characteristics: sub-components,thusprovidingobjectsatalllevelsof Domain dependence granularity, from application technology components (e.g., self- contained interactive applets) down to core technology code (e.g., Minimal levels of reusability a coordinate system package which includes a complete set of Each object is a self-contained, fully-functional interfacesandbehaviorsforcoordinate systems).This application or applet that attempts to achieve an decomposition is important to complete even if there is no current educational goal. need for some of the components it generates. Audience: When this strategy is employed, programmers given a component End users without programming experience. from any category will be able to customize that component and Application Technology components can be combined also toretrieve, customize, reuse and reconfigure any sub- by both programmers or non-programmers (using a componentsitmightaggregate.Teachersworkingwith visual environment) and can be part of a larger programmers would have virtually no constraints on how they pedagogical structure (e.g., a game or lab) or a full- could reconfigure what they see on the screen to meet their own blown curriculum. needs. By including all levels of component granularity in the development process, we hope to create component repositories Granularity: that will be broadly applicable to diverse areas of science Typically coarse-grained, although useful distinctions education.Inaddition,completely decomposed components can be made about the level of concept granularity of describe a hierarchy that is helpful for documentation purposes. these end products (e.g., fine-grained applets teaching a The downside of this strategy is that it requires a large upfront single concept vs. coarse-grained applets teaching a investment, as discussed in the Conclusion. number of concepts in one module) Examples: 3. Case Study: The Camera Viewing An applet that teaches characteristics of a Transformation particular chemical reaction. Our case study is based on a set of applets that teach students in an introductory graphics course about 3D camera transformations. An applet that teaches how to make a scenegraph in 3D graphics. One of the basic tasks in computer graphics is generating a Notes: representation of a three-dimensional scene and displaying it on the two-dimensional computer screen. This is performed in a These objects may also be part of a larger group of manner very similar to positioning a camera in front of a real- similar objects (e.g., a series of applets to teach world scene, taking a photograph, and looking at the resulting increasingly complex topics in a subject). Because these image. objects attempt to achieve an educational goal and are targeted primarily at end-users, pedagogical, user For the sake of mathematical simplicity and efficiency, a series of interface, and information structure concerns are highly manipulations must be performed on the scene and its geometry important in the creation of objects for this category. before its two-dimensional representation can be drawn on the screen. Pedagogically, these are best described as changes in

35 74 position and shape of the objects concerned. The continuum of limitations, such as the fact that it only ran in our research lab. In these changes is of particular interest to students. addition, because of the nature of the program design, it was impossible to easily modify it and when new viewing models were 3.1 History introduced in the class, the demo became somewhat confusing. As 3.1.1Text Illustrations with all software, it also suffered from code rot. Again, the custom nature of the software made this difficult to address. When the Despitetheinherentlycontinuousnatureof thecamera transformations material, the canonical reference text in computer underlying software system was abandoned by the research group, the program eventually ceased to function. A video was made, but graphics, Computer Graphics:Principles and Practice[10], provides only five pairs of discrete snapshots of these continuous the lack of interaction greatly diminished its usefulness. manipulations (see Fig. 2). Unfortunately, these images are at best 3.2 Current incarnation hardtodecipher andaretypicallyfoundconfusingand The idea of an interactive, computer-based educational tool unintuitive. The essential difficulty of these images is that they are describing the camera transformations was revisited in early fall of preset, providing no opportunity for exploration or discovery 1999 when one of the course's teaching assistants offered to through manipulationof thescene and theoperationsit throw together a rough prototype replicating a subset of the undergoes. FLESH program's capabilities. Working off this successful prototype, we engaged in a more formal design process, aiming to better think out the various pedagogical considerations and also use the new applet as a proving groundforourideason component design and organization. We thought through the potential components in terms of our three categories and began to decompose each one according to our theory of granularity completeness. 3.2.1 Pedagogical considerations We sought primarily to provide a means by which a user could Figure 2: One of five pairs of diagrams used to illustrate the visualize the various manipulations undergone by a synthetic camera viewing transformations. scene when its two-dimensional representation is generated. It needed to be useful for the professor demonstrating the concepts 3.1.2 Models used in class to his class during a lecture as well as accessible to students who In his class on introductory computer graphics, Professor Andries wanted to revisit the material and explore and experiment on their van Dam does a lot to impart the continuous nature of the scene own time. transformations through the use of Tinkertoy props, as shown in To these ends we sought to have our new versions encompass all Fig. 1. This resolves many of the issues presented by the book's the capabilities of other methods of teaching the material (such as illustrations, as he is able to move the props around and explain the diagrams and models) while enabling a students to freely how they move and change orientation in time. Students see the explore and answer any question that might occur to them. We continuity of motion and more fully grasp the concepts presented chose to provide an interactive, three-dimensional computer to them. It is difficult, however, to show more than one model rendering to allow one to view the scene from multiple angles. We changing at a time and Tinkertoys lack the flexibility necessary to also gave the users complete control over the passage of time in accurately display the concluding mathematical transformations. the simulation, allowing them to advance or backtrack as needed These include scaling and other non-rigid body,distorting tobetter understand a particular concept.Furthermore, we transformations, requiring much hand-waving by the professor provided full, interactive and graphical control over the various and implicit visualization by the students. camera parameters (position, orientation, field of view, etc.) at all 3.1.3 Customized software stages of the simulation in order to offer students as many possibilities for exploration as possible. One of the first attempts at using computer technology to resolve the problems above was a program written at Brown for an Evans 3.2.2 Component structure and Sutherland vector display in the late '70s [12]. The resulting In Fig. 3, we see a sample of the components used in the current program allowed one to manipulate the parameters of the incarnation of the camera viewing applet. The transparent gray computer camera and then have the program animatethe truncated pyramid represents the computer's perspective view succession of transformations to produce a 2D image.Students volume, and the scene being viewed consists entirely of the simple observed the continuity of the changes and saw how one shape house. A representative component structure, the Vector Solid smoothly turned into another over time.In the early '90s the View, has been selected to show how a coarse-grained, high-level program was reimplemented for a raster display, again using an component can be broken down into fine-grained, customizable experimental software system. components. This second version was more powerful and compelling than the first. It was heavily used for many years but had several

36 75 11.101 Vu. Regular Swing controls and widgets provide a way to manipulate the Camera Model parameters 0,0 1.0 0 ID 40 Isouta

0.0 1,0 Camera Camera Model Camera View

Ground Plane Plane Model Plane Grid View

Names in italics indicate that those components are non-visual, i.e. they have no corresponding on- screen representation.

Arrows indicate composition/depends-on relationships and show how one coarse-grained component is made up of multiple other fine- grained components. Vector Solid View Vector Math View Labeled Axes --.Cylinder Primitive Numeric Text Field (x3) Orientation Behavior L, This diagram only includes a few representative .Cone Primitive Validation Engine Labeled Axis (x3) components out of the 34 that are used for this .Highlight Behavior Text Field Line Primitive applet. Components that have been omitted either .Manipulator Axis Label serve a purely supportive role or very closely Behavior mirror a structure already being presented. Sphere Primitive

Figure 3: Decomposition of a sample set of the components used in the camera viewing transformation applet.

Vector Solid View: Spherical Geometry Intersection Utility Class Cylinder primitive Although omitted from the diagram, this non-visual The cylinder primitive provides a visual representation of the stem component serves a criticalrole by performing the of the vector's arrow representation. It would be relatively simple calculations necessary for the Behavior component to to replace it with another primitive such as a rectangular solid. function correctly. Encapsulating this functionality in a single component facilitates long-term maintenance, Cone primitive allowing it to be replaced at a later date with a more Serving a function similar to the cylinder primitive component, efficient implementation should one be developed. the cone primitive provides a visual representation of the vector arrow's head. As with the cylinder, changing the type of primitive 3.3 Reusability used would be trivial. Having decomposed the Camera Viewing Transformation Applet into a number of fine-grained components, one can now turn to Highlight Behavior the issue of these components' reuse. More specifically, one must This non-visual component provides user feedback functionality look at whether or not these components are at all reusable by highlighting the head and stem of the vector arrow when the outside of the one or two applets that were in the designer's mind user's mouse passes over them. This allows users to intuitively when they were created. understand that something will happen if they click on the different parts of the vector. Fig.s 4-6 show the component usage for three different classes of components. In Fig. 4, the camera viewing applets all reference Manipulator the same set of components. Indeed, our four different viewing This component allows users to visually interact with the vector, applets (Parallel Camera Parameters, Parallel Camera dragging it around to change its orientation. It is composed of Transformation, Perspective Camera Parameters, and Perspective three different components, one of which has been omitted from Camera Transformation [8]) each teach different aspects of the the diagram for simplicity's sake but which will be described transformation process but use the same set of components, below. passing in different parameters to get the desired effects. Notice that the camera applets use a mixture of both core technology and Behavior support technology components. Also, some support technology This component plugsintothe Java3D interaction components also reuse core technology components. For example, framework to allow users to interact with the vector by the Camera Interpolator component reuses all the components in dragging it to the desired position. the 3D Interpolators core component package. Sphere Primitive InFig.5, we see that an applet that reuses many of the This object provides visual feedback to users while they components used in the camera viewing applets. This applet helps drag the vector arrow to a new orientation. students understand the rendering process called radiosity, in

37 Application Viewing Applets: Radiosity Applet Shading Applet 1, 2, 3, 4 Components

Support Shading'opport Camera Support Radiosity Support Camera Form Factor Engine Components Camera Interpolator i Swing 3D Interpolators .----7\ 1-----j ...... --"---...... Core Vector Point Flane Axes Point Plotter Button Slider Appearance Editor Point Interpolator .,--* Vector Interpolator Components Label Menu Shape Interpolator Checkbox ......

Figure 4: The component reuse graph for the four camera viewing transformation applets.

Application Viewing Applets: Radiosity App let Shading Applet 1, 2, 3, 4 Components

Support Shading Support Camera Support Radiosity Support Form Factor Engine Components Camera Camera Interpolator

Swing 3D Interpolators 4------7\ ...... ---"----.. Core Vector Point Plane Axes Point Plotter Button Slider Appearance Editor Point Interpolator .-.- Vector Interpolator Components Label Menu Shape Interpolator Checkbox ......

Figure 5: The component reuse graph for the radiosity applet.

Application Viewing Applets: Radiosity App let Shading App let 1, 2, 3, 4 Components

Support Shading Support Camera Support Radiosity Support Form Factor Engine Components Camera Camera Interpolator

Swing 3D Interpolators

Core Vector Point Plane Axes Point Plotter Button Slider Appearance Editor Point Interpolator Vector Interpolator Components Label Menu Shape Interpolator Checkbox

Figure 6: The component reuse graph for the shading applet.

38 which energy transport is simulated to calculate diffuse light also planning on introducing a precise collision detection package reflections in a scene. We see reuse of the Vector, Point and Plane to the core packages as well as more basic 3D interaction widgets. core technology components as well as the standard Swing Finally, we will provide a protein docking engine component that components. The Axes component has not been reused because might find significant reuse in other software dealing with protein the Point Plotter component is being used instead. Notice also that interactions. although the components in the Camera Support package are not being used, those in the Radiosity Support package are. This is in The resulting set of components will be glued together using our keeping with the fact that support technology components are Carnegie Mellon collaborators' non-programming environment, typically domain-specific and, although reusablefor similar thereby providing the power of various in-house and third-party applets, are generally not reusable by applets from different software components in an easy to use yet powerful building tool domains. [15]. Fig. 6 demonstrates that components can, in some cases, be 3.4.2 Increased Compatibility Effort classified into two different categories depending on how they are Recognizing that it would be a grave mistake to isolate our work used by the application component. In this case, the Appearance from complementary research and development performed by Editor componentisbeing used as a support technology other research groups, we will also be maximizing compatibility component because it plays a central supportive role for the betweenthecomponentswedevelopanddevelopment shading applet. Contrast this to Fig. 4, in which this very same environments and test beds such as those produced by the E-Slate component acts as a core technology component with respect to and ESCOT projects. Recent developments have made these the camera viewing applets. This ability of components to migrate environments more compatible with the JavaBeans standard and from role to role depending on how they are used by other we anticipate increased productivity by leveraging the tools they components adds more flexibility to the classification schema laid provide. out above and thereby enhances its power. 3.4.3 Metadata Standards for Harvesting 3.3.1 Standards We have started tagging the applets we produce with standard All of our components adhere to the Java Beans specification IMS [14] metadata tags in order to make our efforts easily because doing so facilitatesintegration of components into harvestable by collectors of digital library material. In the near commercialsoftwaredesignpackages. A properly-designed future, we intend to extend this effort by tagging individual Java Bean, for instance, can be used without any problems in components, thereby allowing harvesters into our component Sun's Bean Box, Inprise's JBuilder, Forte's Netbeans IDE, or libraries as well as our applet catalogs. IBM's Visual Age. It can be used with equal success in the educationalauthoring environments being produced by our 4. DISCUSSION AND CONCLUSION colleagues at ESCOT and e-Slate. The universal acceptance of the When we proposed our framework and granularity strategy as part JavaBeans standard therefore makes it a valuable requirement for of our grant application, we had limited experience using either all our component designs and will make our components that one for real-life applications. By creating the camera viewing much more powerful down the road. transformation applet set, we were able to test out these theories in practice. We feel that the results offer a proof of concept of 3.4 Future Work these strategies. Our digital library grant is a collaborative one withchemistry We have found, however, that there are also disadvantages to professor Dave Yaron at Carnegie Mellon. Professor Yaron and using a structured, component-centric approach with an emphasis other members of the grant team at Carnegie Mellon have been on reusability. Although the up-front design results in components creating environments in which non-programming educators can that are well designed, this design methodology greatly increases customize interactive experiences and embed them in their own the length of the development cycle. For example, the viewing pedagogical materials. We have begun to join our different techniques applets used in our case study took roughly four approaches to the problem of reuse and customization in the months to design, write and test. When compared with the four digital library efforts in a collaborative project to aid chemistry days it took to write the applet's initial buggy prototype, it may education. not seem to have been worth the time or effort. In addition, it 3.4.1 Molecular Visualization Applet takes more skilled programmerstodevelop trulyreusable We are working on a molecular visualization applet to increase components, and taking this approach meant that we could no the reuse of existing components as well as introduce new longer rely solely on undergraduates, our previous source of components developed in-house and by third parties. programmers. A current prototype aims to reuse the plane model and view If a digital library of reusable learning objects is ever to become a components as well as some of the core 3D object manipulation reality, however, we must continue to invest the upfront time. In components that allow users to intuitively move objects in a 3D our case, although we took four months to design, implement, and space. test a single applet, we also produced several dozen components that we consider to be stable and generally reusable. Indeed, they This prototype expands the core technology space by using a were easily reused by the undergraduate who programmed the VRML file loader imported from a third party source and using it radiosity applet. For us, the long-term benefits of thisfar to import VRML models retrieved from the Protein Data Bank, outweigh the short-term gains of the hacked-together prototype, thereby leveraging content from an existing digital library. We are whose code was not reusable.

39 We believe the complete decomposition approach we have [8]Exploratoriesproject.Web-basededucationalsoftware, adoptedforcomponent creationallowsustoofferfull http://www.cs.brown.edu/exploratory/ customizationof components by contentauthors,thereby Exploratorium. The San Francisco Exploratorium: museum increasing the degree to which individual components are reused. [9] We are also now convinced that the effort and time we must of science, art, and human perception, http://www.exploratorium.edu/ devote to designing and developing not only good components but also a large amount of underlying infrastructure will enable us [10] James D. Foley, Andries van Dam, Steven K. Feiner, John to reap huge rewards down the line and significantly enhance the F. Hughes. "Computer Graphics: Principles and Practice," field of educational software component technology. Addison-Wesley, 1996, ISBN 0-201-84840-6. Although starting a good repository of reusable learning objects is [11] Richard P. Gabriel. "Patterns of Software: Tales from the a time-consuming and expensive task, ultimately doing so should Software Community," Oxford University Press, August dramatically reduce the time needed to create interactive learning 1996. environments. Not only will mature programmers' time be reduced but inexperienced developers will be able to make [12] R. F. Gurwitz and R. W. Thorne and A. van Dam and I. B. sophisticated learning environment by re-using learning objects Carlbo. "BUMPS: A Program for Animating Projections," in that encapsulate advanced functionality that they would not be Proceedings of ACM SIGGRAPH '80, pp. 231-237, 1980. able to easily program on their own. [13] IEEE. "Web-Based Learning and Collaboration" special Developers' new creations and the new components that are a part issue, Computer, Vol. 32, No. 9, September 1999. of them, can, in turn, be contributed to the library, creating a [14] IMS Global Learning Consortium, Inc., snowballeffectof additionalcontent,bothincomplete http://www.imsproject.org/ instructional applications and in new building blocks from which [15] IrYdium Project. Java Enhanced Chemical Education, future applications can be built. http://ir.chem.cmu.edu/irProject/ 5. ACKNOWLEDGMENTS [16] JavaBeans. "Specification for the Java 2 Platform," We would like to acknowledge helpful comments of Andy van http://java.sun.com/products/javabeans/glasgow/ Dam andcontributionstotheExploratoriesprojectfrom [17] Rosemary Michelle Simpson, Anne Morgan Spalter, and Rosemary Michelle Simpson. We would also like to thank our Andries van Dam. "Exploratories: An Educational Strategy sponsors, the National Science Foundation, through our NSDL for the 21st Century," in ACM SIGGRAPH '99 Conference and STC grants, and Sun Microsystems. Abstractions and Applications, 1999. 6. REFERENCES [18] Anne Morgan Spalter, Michael LeGrand, Saori Taichi, and [1] IBM. alphaBeans: Java Beans by IBM, Rosemary Michelle Simpson. "Considering a Full Range of http://alphaworks. ibm. com/alphabeans/ Teaching Techniques for Use in Interactive Educational Software: A" Practical Guide and Brainstorming Session, in [2]Jeff E. Beall, Adam M. Doppelt, and John F. Hughes. Proceedings of IEEE FIE 2000 (Frontiers in Education), "Developing an Interactive Illustration: Using Java and the October 2000. Web to Make It Worthwhile," in Proceedings of 3D and Multimedia on the Internet, WWW and Networks, 16-18 [19] Anne Morgan Spalter and Rosemary Michelle Simpson. April 1996, Pictureville, National Museum of Photography, "ReusableHypertext,StructuresforDistance and HT Film & Television, Bradford, UK, 1996. Learning," in Proceedings of ACM Hypertext 2000, June 2000. BeanHaus. Java Bean Repository, http://www.beanhaus.org [3] [20] Anne Morgan Spalter and Rosemary Michelle Simpson. [4]Stephanie Doublait. "Standard Reuse Practices: Many Myths "Integrating Interactive Computer-Based Learning vs. a Reality," in Standard View, Vol.5, No. 2, June, 1997. Experiences Into Established Curricula," in Proceedings of ACM 1TICSE2000(InnovationandTechnologyin [5]EOE Foundation. Educational Objects Economy: Building Communities that Build Knowledge, http://www.eoe.org. Computer Science Education), July 2000. [6] ESCOT project.EducationalSoftware Components of Tomorrow, http://www.sri.com/policy/ctl/html/escot.html.

[7]E-Slateproject.An exploratorylearningenvironment, http://e-s late. cti . gr/

40 78 Designing e-Books for Legal Research Catherine C. Marshall,' Morgan N. Price, Gene Golovchinsky, Bill N. Schilit FX Palo Alto Laboratory, Inc. 3400 Hillview Ave., Bldg. 4 Palo Alto, CA 94304 USA [email protected], [email protected], {gene, schilit}@pal.xerox.com

ABSTRACT In this paper we report the findings from a field study of legal research in a first-tier law school and on the resulting redesign of XLibris, a next-generation e-book. We first characterize a work setting in which we expected an e-book to be a useful interface for reading and otherwise using a mix of physical and digital library materials, and explore what kinds of reading-related functionality would bring value to this setting. We do this by describing important aspects of legal research in a heterogeneous information environment,includingmobility,reading,annotation,link following and writing practices, and their general implications for design. We then discuss how our work with a user community and an evolving e-book prototype allowed us to examine tandem Figure 1. Our prototype e-book at the outset of the study: the issues of usability and utility, and to redesign an existing e-book XLibris analytic reading software running on a Fujitsu pen user interface to suit the needs of law students. The study caused tablet computer. us to move away from the notion of a stand-alone reading device and toward the concept of a document laptop, a platform that We chose legal education for our study for several reasons. Early would provide wireless access to information resources, as well as discussions with our research colleagues suggested that attorneys support a fuller spectrum of reading-related activities. read and mark up documents from diverse sources, from physical as well as digital collections [4]. Furthermore, attorneys' reading Keywords is purposeful: they use such documents in subsequent work e-books, information appliances, field study, physical and digital information resources, legal education, legal research, digital activities, including writing and collaboration [I]. These practices have their roots in legal education. libraries. By studying legaleducation, we wanted to do more than 1 INTRODUCTION characterize current practice; we also wanted to evaluate existing Dynabook, Alan Kay's imagined dynamic electronic medium [9], designs and to generate new design insights about both the is often cited as an inspiration for current work on electronic usability and utility of XLibris. What would an e-book for legal books. Now we have real products (e.g., RCA's REB1100), tablet research actually do for its users? We started by observing how computers for reading. We wanted to explore the future of such paper and online resources are used in legal research; how law devices as interfaces to today's heterogeneous libraries, and to students read, annotate, organize, and use their materials; and the understand how they could support research activities typical role of legal research and reading in a larger scope of activities among knowledge workers who use such information resources. like writing and collaboration. We then worked with this potential We used XLibris [15], analytic reading software running on a pen user community to evaluate the effectiveness of XLibris. Our tablet computer, as an example of an e-book, and legal research as hope is that insights that we gathered from our study of legal a discipline in which analytic reading software could bring value. research, reading, writing, and collaboration would be more When we began our study, we conceived of ideal e-books as generally applicable in other educational and research settings, as stand-alone reading devices that would be based on a paper well as in legal work. document metaphor, would use pen interaction and freeform In the following sections, we describe the study, our observations digital ink, and would support research activities by using readers' and the broad implications for design. We conclude with the annotations as indications of their interests [16]. Indeed, XLibris details of the redesign that emerged from our observations. (shown in Figure 1) was just such a working prototype, a good foilfor our investigation of the kinds of "beyond paper" 2 STUDY DESCRIPTION functionality an e-book could bring to bear on legal research. Our field study focused on an annual Moot Court competition at a first-tier law school. Moot Court is the venue in which students practice advocacy as they argue hypothetical cases. Controversial Permission to make digital or hard copies of all or part of this work issuescasesheardinappellate-levelcourtsthatsuggest for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Author's current address: To copy otherwise, or republish, to post on servers or to redistribute Microsoft Corporation to lists, requires prior specific permission and/or a fee. One Microsoft Way - 32/1079 JCDL '01 , June 24-28, 2001, Roanoke, Virginia, USA. Redmond WA, 98052 Copyright 2001 ACM I -58113-345-6/01/0006...$5.00. [email protected] 41 unresolved points of lawtypically form the basis for Moot Court 3.1 Work settings and mobility problems. We chose Moot Court for our field study because it is The law students worked in a variety of settings, each of which characterized by one of the major legal digital library providers as offered them access to unique local resources. Settings included the closest experience a law student has to preparing for and the law library, especially work areas such as the computer/printer engaging in real courtroom advocacy. room and carrels; shared offices, mostly associated with the law The students start their research from a Transcript of Record that reviews published at the law school; dorm rooms and homes; lays out the facts of the case and cites relevant prior cases. From classrooms; other law libraries and other on-campus libraries; and their research, they retrieve, print, read, and annotate cases, and ad hoc, unpredictable work sites, especially those used while consult secondary materials such as law journals. The materials traveling. they collect are organized and used to produce a brief, a document Important localized physical resources at these settings included of constrained length that presents a position on the legal issues; computers (shared and personal) that provided access to services these materials form the basis for the oral arguments as well. and applications, network connections, printers (in particular, the Our study took place over three months, from the distribution of free printers in the law library; these are supplied by Lexis and the transcript of record to the final competition. We interviewed Westlaw legal information services), paper books and journals (in and observed the two faculty members who organized the personal collections or at the library), knowledgeable people (e.g. competition, and nine second- and third-year students who faculty, Lexis and Westlaw representatives, librarians, peers), participated in the Moot Court competition. The interviews were comfortable, quiet places to read and write, places to store open-ended and semi-structured; we observed the students and materials across uses, and places to spread out materials during a faculty interacting with online resources, meeting to coordinate work session. writing and research tasks, and attending classes. Interviews and Distributed local resources give rise to an increased need for observations took place where the participants normally worked, mobile work habits. Thus, like other students and an increasing in settings like the law library, shared on-campus offices, and number of professionals, law students may work in many places dorm rooms. We audio-recorded and transcribed interviews, on any given day, carrying materials with them in heavy photographed salient aspects of the work settings, and took field backpacks. For example, even if the students do legal research notes of our observations. from their home computers, they frequently re-retrieve materials After the competition ended, we collected the students' and at the law library so they may be printed without cost. faculty members' documentssource materials they had drawn Thisresource-centeredmobilityalsocausesthestudents' on for their research and the briefs they wrote for the competition. documentstobedecentralized.When the documentsare We analyzedthesedocumentstounderstandpatternsof electronic, they are not necessarily stored on a server, but rather annotation. stored on computers' local disks or transported on floppy disks To assist us in the design process, interviews concluded with a (which affects the way the students share files as well). Paper demonstration of the evolving XLibris prototype. We used documents may be kept at various locationsfor example, on- documents drawn fromtheMoot Courtresearchinthe campus lockers, dorm rooms, library carrelsor taken along in demonstration; these familiar documents made the system more the students' backpacks. transparent and more compelling to the law students. The students From settings and mobility to design used the device briefly, commented on specific design elements What are the design consequences of the students' current patterns and functionality, and reflected more generally on how it might be of resource-centered mobility? Mobility may mean many things: a useful in their work. student may move from a desk to a nearby comfortable chair to read a case; a student consulting a legal treatise in the library may 3 RESULTS We divide our findings into two parts: a characterization of go to a different floor to re-retrieve a case and print it; or a student may be traveling, and use another university's law library. For current practice anditsimplications for the design of an these mobile work situations, a portable reading device can form information appliance for obtaining, reading, annotating, and the bridge between paper resources and electronic ones; the organizing digital materials in the legal domain; and design hardware may be brought to where associated resources are insights about the existing technology, XLibris. This way, the available. This physical/digital bridge function, coupled with the characterization and insights are not constrained by limitations of variability of network accessinthe various work settings, existing technology, yet the technology can grow and be shaped underscores the importance of wireless access and the need to by the field study. consider how materials get on and off the device, a finding that To set the stage, we first discuss work settings and mobility. We confirms Jones et al.'s study [8]. then discuss legal research, the traditional province of online legal Our observations also echo those of Elliott [5], who interviewed information services and law publishers. Reading, organizing, and judges about their use of online legal resources. She reports that annotating documents form a core set of topics when we talk having Lexis terminals in chambers was not convenient, as it about working with legal documents.Finally, we examine forced the judge "to excuse himself in the middle of a trial, go to reading-related practices, writing and collaboration. his chambers, dial into Lexis/Nexis, print out a citation, and take it Our XLibris-related findings are related to the usability and utility back to the bench to read." of the device itself, in addition to more general reactions that the The variability of settings also suggests that the ability to keep demonstration provoked. We then use these findings about legal materials in place across sessions cannot be taken for granted; for work and the prototype as a basis for the redesign. example, shared tables or workspaces in the law library (besides personal carrels) must be cleared off. Thus our design should take

42 81 advantage of the computer's ability to readily support casual, Important resources for legal research continue to be a mix of persistent layouts of many documents in a workspace. physical and digital materials; paper books still play an important 3.2 Legal research role in legal research, but they are used in concert with online legal services. These materials may be maintained institutionally Legal research is the process of gathering the materials together to meet the needs of the task at hand. As Sutton [17] points out, (such as treatises, law books, and journals available through the criteria of whether a given item is relevant are based on use, not law library) or they may be part of a student's personal collection on abstract notions of topicality or on a set of rules governing of books and files. relevance judgment. Legal research might involve consulting The scope of these resources creates a rhythm of paper and books or law review articles in the law library, grabbing last electronic research that is seamless to the students. Though the semester's textbook off a bookshelf, searching an online service, alternation betweenprintand electronicforms may seem pursuing specific case citations, or other strategies for amassing inefficient, it is a very fast and effective way for the students to relevant background material for brief writing and oral advocacy. pull together the collection of materials that they actually want to read. Thus a treatise, a paper book, may lead to a specific case Students began their Moot Court research by identifying a key case or cases from the Transcript of Record. Is this artificial? Not citation; this case may then be retrieved directly from an online really; subsequent conversations with attorneys and the reflections service, printed, read, and marked-up. Students may type in case citations from this printout to pursue them further. Later, they may of the students themselves showed that much legal research starts with some knowledge of an important case or cases in an area. If retype portions of the case to use as quotes in their written briefs this information is not available, research is often initiated by or as notes that will contribute to their writing. Alternatively, once consulting a treatise (an encyclopedic reference that summarizes an electronic version of a case is located, research may remain the student may choose to Shepardize the case or the issues and case law in a specific area) or a law review article. electronic follow some embedded links to precedent cases. Students used expressionslikefinding a "launching pad," "raid[ing] the cases," or "looking for a thread to pull." One The students print not only to read, but also to perform triage [13] student described her experience doing research in the books as a to sort through the cases themselves, or through long lists of summer associate: potentiallyinterestingcasesthatthey havegenerated by performing a search or by Shepardizing a case. In fact, several of "The first firm I worked at was very pro-books... I was pretty much taught to look in Witkin first... It's a California law the students saved Shepards or Key Cite lists with their case printouts, and some of these lists were annotated as part of their treatise. give you case citations, and then you can narrow your search that way. Once you have a case on point, triage. or a case kind of on point, you can Key Cite it or something." From legal research to design Once the students got started on their research, they continued to We observed four trends in the students' legal research: The use citations as points of departure. They used citations in two continued importance and authority of books; research strategies different ways. First, citations are obvious links to precedent. If a that are link-based rather than search based; the advantage of student sees the same citation over and over again, referenced electronic resources for case evaluation; and alternating use of from multiple cases, it may well be valuable for current work. print and electronic resources. Students kept lists of cases to look for next or annotated case These trends suggest a set of design consequences for a legal e- printouts with proposed follow-up citations. As we saw in an book. First, there is a need to support hypertext links. Much legal earlier study [12], these potentially interesting references may not researchinvolvespursuingexplicitcitations.Furthermore, be pursued, given limitations of time and attention. citationsaccreteinfluence;citationsthatareseen by the Second, they evaluated citations, not just by looking at them, but researcher many times are likely to be pursued. Second, we must by investigatingif they arestill "good law"whether, for consider the role of paper in the use of such a device; paper example, they have been overturnedand whether they are resources and paper practiceswillpersisteven given the sufficiently authoritative. In US law the authority of a precedent is availability of e-books and electronic services. Finally, there is intimately tied toits currency, to the court in which it was significant potential value to "waving a wand over a case decided, and to whether or not it is a good fact match to the citation," quickly consulting reverse citations to check the validity of the case being read. This last design consequence highlights the current case. Reverse citation facilities such as Lexis's Shepards or Westlaw's Key Cite are typically used to determine whether a importance of good metadata, and the associated benefits of to guide on-the-spot case isstill good law. Both services provide annotated "back making the metadata readily available research decisions (for example, "Should I follow this link?" or links" to, and metadata for, the cases that have cited the case in question. Thus this activity also amounted to link following. "Is this case still good law?"). These trends also suggest that digital libraries co-exist with more Does this tendency to follow and evaluate explicit citations mean traditional resources, and must accommodate work in this hybrid that the students do not perform full text searches? They describe searching in a few situations, although it seems to be of secondary environment. Ignoring this reality in the design of interfaces to importance if a starting pointa key case, a treatise, or even a digital libraries may reduce their usefulness in the real world. comprehensive law review articleis at hand. Full-text search is 3.3 Reading and annotation used to identify a key case at the outset if none is available, to To frame a discussion of reading and annotation, it is important to check breadth and coverage if there is time, and to look for very examine first the form of the materials; the form necessarily recent cases, as Shepards links may lag court decisions by as shapes and constrains any subsequent activities like the ability to much as six months [2]. mark on documents or carry them around.

43 For the most part, the participants in our study read printed Annotations may also reflect disciplinary practices [11]. As we documents, with the notable exception of cases they retrieved and demonstrated in the quote above, for the law students annotating skimmed while they were writing. Working with documents on carries over from the case analysis techniques they learn in class. paper allowed the students to read opportunistically, carrying These techniques give students a way of looking at legal decisions materials around and finding places to read that were relatively in terms of, for example, issues (what points of law are addressed) free from distractions. We noted that the students and faculty and holdings (what the decision's import is). Students sometimes members often printed more than they read. use annotations to identify such aspects of a case, and will even Do readers then read once, and move on? Not necessarily. Legal revise them as they continue to read the materials. Such structured practice demands re-reading. A first read may be a scan or a quick interpretation leads to a greater use of annotation tactics like skim to see if the material is even relevant, or to get a general idea color-coding than may be found in other disciplines. of what is covered. Subsequent readings may be careful: students In spite of the well-developed disciplinary marking strategies that reported reading documents they were using in Moot Court front- the students exhibit and are able to discuss, these strategies are to-back. Or they may involve skipping to the relevant sections of neither fixed over time nor consistent. They change throughout the document: students reported skipping a case's dissent, or using and beyond schooling as the reader becomes more efficient and the headnotes (human-authored indices to the specific points of comfortable with legal work and unnecessary or unworkable law covered in a case) to navigate into the body of a case. Re- complexities(multi-colorcodingschemes)arediscarded. reading during writing may be very quick, just to remind the Sometimes the exigencies of the situation dictate a change in student of what is in the materials, or to find a particular passage strategy (for example, a favorite pen is left at home). Finally, of interest. annotations are crucially tied to situational factors. For example, a When readers read for a specific activity like writing a legal brief, lawyer reported that she annotates the same case differently for they are likely to mark on the documents they are reading. different uses. The students confirm this. Annotations they make Annotationisa prevailing practice,although some readers in class that capture what the professor is saying are considered annotated far more than others and one did not annotate at all. more important than .(and are readily distinguishable from) marks the students have made in their own readings of the material. As Re-annotation is also common, concomitant with the kind of re- Wolfe pointed out in her study of how students value the reading we describe above. If a student is apt to make long, annotations of experts, the source of the interpretation is very extensive annotations on the first round, he or she may cull them important [18]. during subsequent readings, either by marking them again, or by using emphasis marks like asterisks to set them off from the From reading and annotation to design original markings (e.g. Figure 2). For example, one student said: Reading is a difficult activity to support; it is hard to improve on paper and pen. We began our study with the assumption that "You're supposed to use the highlighting to tell you to go back freeform ink annotations are important to analytic readers. This and read it. But sometimes I highlight as I read, and so I have assumption held as we worked with law students. What are further to go back and mark things so that I remember to definitely go design implications of these reading and annotation practices? back to that. So that's two iterations I guess." First, re-reading seems like a good target for computational 483 U.S.,at873_875, 107 &Ct.,at 3168, 3169. support;readersarealreadyinventingstrategiestohelp Ce.ntral, in our- view, to the present case is-the fact that themselves read. They skip, scan, or skim through the documents the subjects of the Policy are (I) children, who (2) have-2 -1c using their own marks or the properties of the documents been cordinitted to the temPiailly custiidy of the.State as .schoplmaster. themselves. Furthermore, annotations vary in importance and usefulness; the Figure 2. A reader's asterisk. The reader plans to revisit marks readers make on documents have different functions and material associated with this mark. different degrees of value. Yet annotations are a fundamental It is notable that the students can articulate their own marking technique for signaling what is important in a document. They strategies. Many annotators are unaccustomed (and sometimes help readers re-read the material (focusing on the most pertinent unable) to explaining their annotation practices [12]. However the and useful portions of a longer document), and they guide readers' law students had reflected on their own annotation practices and future use of source materials in associated activities like writing. those of their peers. For example, one student said of her own Techniques may be applied to find particular kinds of annotations marking strategy: (e.g. see Figure 2) or to use collective marks across different "Usually with the cases, I try to write lacts,"issue...' I'll annotations [12]. write 'issue' next to the issue.It replaces briefing. Book 3.4 Organizing briefing [an outlining technique] is just kind of just writing the The documents that students gathered to use in the Moot Court issue, then you've got your facts and your holding...Some competition are organized in different ways, particular to how people do the holding in blue and they'll do the issue in pink. they will be used. When research begins, documents may be I don't do that." organized by the court that heard the case, by the date of the case, or simply in a stack. This reflection helps demonstrate the importance of the practice to many of the students. As we have seen in other settings, the Once the students began writing their briefs, they tended to move students' annotation strategies may vary in ways that are related to more toward a writing-based organization, creating categories like the form of the materials. For example, books sometimes receive "pro" cases (cases that support the student's side of the argument), different treatment than printouts because they are regarded as "con" cases (cases that present counterarguments), and cases with long term references. matching facts. Cases with a close fact match are particularly interesting, in that they must in some way be addressed. Upon

44 83 encountering a "con" case very similar to the Moot Court prong test. So I kind of used that to structure my outline, and problem, one student wrote, "Deal with this!" on top of the then tried to plug in cases and quotes that I could use for each document. These writing-based organizations were frequently part of the test." implemented as piles on the floor and desktop: fluid organizations The students often reported that they would like to perform a that could be rapidly accessed and changed. Naturally if the word search to re-locate the quotes that they have already read student worked in multiple venues, this organization could not be (and potentially annotated) when they are writing. One student preserved across sessions; the student needed recreate it each gave the following account: time, in each place. "after having read through all these various cases, I remember One student described the shift from research to writing this way: a citation to Brown vs. Board of Education, in which they "I find for me I like to print them out. It's sort of like the old used a quotation about education being the most important way of using index cards. If I print them out, I can staple it function of government. I couldn't off the top of my head and then I can throw them in different piles, and then the piles remember which case it was in, or where in that particular can change. And then when I'm writing, I look at the pile. I case it was located. So essentially what that involved is me bring it up to here, and then I start writing based on those going through every single case looking through all my cases. And that I know in my mind: this is argument 1; this is annotations to find this one quotation that I remember having argument 2; this is argument 3." read." This tendency toward activity-based organizational strategies The mechanics of legal writing also involves creating the citation suggests that there is no canonical way of organizing materials. form that conforms to legal practice; each authority that is cited Documents are organized and re-organized to meet the needs of must be in the "blue book" form; "that bringer of much grief, the the task at hand and to reflect the student's understanding. It was Uniform System of Citation." [2] difficult to ascertain whether the collection's structure would Additional research is also provoked by writing: the students find become more uniform after the task ended, as the students did not they need to fill in holes, or to check the authority of a particular keep Moot Court documents that they felt they could re-retrieve. case. This interleaved research is different from the student's Most acknowledged this would change when they became initial reading. When research is spurred by writing, the student practitioners, and indeed subsequent conversations with practicing may check the new materials quickly, without printing them. One attorneys revealed that files within a firm or office may have a student said: standard structure to facilitate sharing. "In the course of writing my brief or paper or whatnot, if From organization to design there's something I need to look up, a citation or a particular The design implications of these organizational strategies are case that's referenced, and I don't want to go to the library to threefold: First, it would be advantageous to provide readers with do it or print it out, then I'll do it here, and just kind of flip a way of organizing materials across sessions. Many work settings back and forth between my word processor and Westlaw. Just demand that loosely organized documents (e.g., piles on the floor) to transcribe what I need." be picked up and put away, disrupting the organization. Second, as Mander et al. also observed, the difference between organizing As we noted earlier, organizing materials to support the writing materials for research and writing, as well as the difference process is different than organizing them during the earlier phases between transient and archival structures, points to need for of research. The students cite the importance of having several multiple ways to organize documents [10]. Finally, the notebooks key cases to hand, and arraying them on the floor or tabletop in offer evidence that there is a perceived advantage to keeping piles for ready access as they are writing their briefs. documents "in one place." Moving from physical systems of From writing to design organizing documents to electronic tools may be perceived as a What is the key design insight we can take away from the law disadvantage however, since computer screens offer far less space students and their writing? What seems clear is that they switch and flexibility for spreading out and manipulating working papers. back and forth between activities; these switches may be frequent On the other hand, the capability to switch among multiple ways and fluid. They look for new sources to fill in holes in their of organizing information on the computer should accommodate arguments. They search for quotes in familiar materials as well. not only different working situations, but different cognitive styles Our initial concept of a dedicated reading appliance may not be as well. the best way of addressing this fluid shift in activities. In other 3.5 Writing words, a document laptop may be more appropriate than a The written brief is a key element of the Moot Court competition dedicated reader. We discuss this notion further in our accounts of and indeed of certain kinds of legal work. One prevailing strategy the students' reactions to the XLibris prototype. for brief-writing was to outline the important parts of the 4 XLIBRIS REDESIGN argument and find the right quotationa passage or key statement Our interviews and observations of the law students suggested of a rule of law that has come out of a precedentto illustrate or several directions of redesign for XLibris. We redesigned the support the argument. This strategy entails finding the quotation, XLibris interface in terms of functionality and also in terms of either from the student's annotations, or simply from memory; appearance; the latter flowed from the need to convey the former. transcribingtherelevantquote;correctlycitingthecase, Broadly-speaking, our redesign focused on navigation,link- conformant with the prescribed citation form. One student said: following, and re-visiting previously-read documents; on retrieval, "I looked at the cases, and looked at the different modes of on annotation, and on managing, organizing and categorizing analysis that the opinions used and there seemed to be two documents. types of tests. One, the Lee v. Weisman analysis, and the other one is the Lemon analysis, which is a three step test, a three

45 84 IS

4.1 Navigation 4.3 Annotation Our observations of the students' link following and navigation The original XLibris "Clippings" view showed the reader only the among gathered documents caused us to re-examine navigation annotated portions of documents, thereby allowing her to revisit controls. The initial design was based loosely on the Web model those passages easily. Furthermore, the list could be filtered by of navigation: a "previous" button and a "next" button moved the color. While this design supported some forms of re-retrieval, it reader back and forward through the document views. Unlike the could not accommodate some of the students' practice. Web model, backtracking did not prune branches, but kept all We saw many examplesof reviewingandre-annotating visited documentsinthe same queue. The intent was to previously-read and marked passages (see Figure 2 and the compensate for known deficiencies of the Web browsing model accompanying discussion). A student would read a document and (see [3] for a discussion of alternatives), but the result was equally mark it up; subsequently she would review the annotations, and confusing. The crux of the problem was that the same controls identify the more important ones with additional marks. In wereusedfordifferentpurposesshort-termexploratory XLibris, this required many steps: to place a second mark on a backtracking ("Where does this link go? Oh, no, that wasn't it.") passage shown in the Clippings view, a reader had to navigate to and managing or reading several documents simultaneously. We the page containing that passage, mark on it, and then move back saw both kinds of activity in our observations: quick skimming of to the Clippings view, an awkward and distracting operation. We documents (perhaps the destination of a citationlink), and redesigned this interface by allowing readers to mark on the working with multiple documents. clippings directly, without moving to the containing page; a separate button was provided for moving to the document. The involvement, worship, taught, sect, public school, public funds Clippings view and the document view were coordinated: marks View References Turn Off Lawyers Edition Display made on one view were available in the other. Thus clippings SUMMARY:These cases presented the issue whether the religion clauses of thelFirst'Amendment were violated by state statutes providing state-aidt!:n became miniature windows onto parts of documents of particular church--related elernentartend secondery_sehools, and to teache'rs therein, with regard to instruction in secular matters. in-Nos. 569 and 570, citizens and interest to the reader. taxpayers of Rhode Island brought suit against state officials in the United States District Court for the District of Rhode Island to enjoin, as repugnant to Another shortcoming we identified in the Clippings view was its automatic nature: the view would update when a new mark was Figure 3. A fragment of a page: a link target is identified by added or an old one erased. This made it difficult to collect ideas the placement of the "Back" button and references in a persistent manner. We therefore added a new In the redesign, we split the two: backtracking from a link view that was modeled after a yellow legal note pad.Readers traversal was accomplished by a dynamically-added "Back" could clip passages from the Clippings view to the notebook; once button, positioned near the target of the link (Figure 3); multi- there, they could position and resize the views as desired. document use was accommodated by providing a overlaid semi- Passages pasted into the notebook behaved similarly to the transparent menu of recent views, from which the reader could Clippings view: marks made on them affected the document, and select at random (Figure 4). Each item in the menu corresponds to vice versa. adifferent document or aspecific view (e.g.,workspace, This interface (Figure 5) was designed to accommodate the clippings, etc.). Changes to the views and documents are reflected transitions from retrieval to organization to writing: passages in these thumbnail representations. Thus readers can switch easily found while reading could be clipped and organized thematically among the recently-used views and documents without resorting in the notebook view, and then could be copied through the to more elaborate navigation. Additional evaluation is required to system clipboard to the word processing program of choice. Both know whether this menu should contain only document views, or the text and the image could be pasted in; the canonical "blue other organizational views (see below) as well. book" citation for the source would be included automatically. 4.2 Retrieval lqi The original XLibris design included a way of launching queries based on freeform digital ink annotations of the underlying text Qoternt [6]. Because the students tended to follow citations and use Shepards rather than running queries to find useful documents; search was more typically used to find documents that had already been read. Thus we replaced the experimental feature with a keyword search dialog that worked over the documents already loaded into XLibris. A simple ranking algorithm that preferred passages with many different keywords over those with just a few ,r1r,drqz.rt=i. was used to order matching passages. To handle the kind of reference-following the students preferred, we added the ability to traverse standard http references to Web- based materials. This capacity to load documents into XLibris incrementally changed theflavor interaction from a purely reading-oriented to a hybrid of reading and browsing. Again the effect was toreduce thecost,cognitive and temporal, of transitions between the various activities that make up document- L 1--Vr4A 00130:331:1 centered research. Figure 5. A notebook page with three clipped passages

46 ,85 4.4 Organization oriented research. After demonstrating to us how difficult he We observed various approachestoorganizing documents, found online statutes to use, one student explained that: reflecting differences in cognitive style and activity. The original "I have a difficult time one just coming up with [keywords] ... design included an overview of document thumbnails which the Then when you do get something, it's not what you're looking user could drag to form rudimentary piles. If more documents for. I think it's because ... there's a geography to statutes that were added to the workspace, a new page of thumbnails was you don't have with cases ... that lend themselves to having added to the overview. The reader could not add new pages hardcopy, simply because of the way it's broken down into manually,anditwasnotpossibletogroupdocuments various titles, and you can easily go to where you need to go." thematically by moving them to different pages, similar to student notebooks. Second, they asked about how XLibris would interact with their PCs. The questions centered on how other activities would Our redesign accommodated these observations: we provided a interleave with reading on the device. How would the results of way to add blank pages to the workspace and to drag documents their online research get onto the device? Would they be able to between those pages. This allowed readers to organize documents cut and paste quotes from the documents in XLibris to the one thematically: one worksheet for pro cases, another for con, for they were writing in Microsoft Word? In short, they were acutely example. This ability to move objects between worksheets also aware of the overhead an e-book might add to their current work. applied to the notebook view: the reader could now move relevant clippingstoappropriatepages,furtherincreasingXLibris's capability to organize information. We also allowed readers to mark directly on the workspace pages to label them. Finally, we added a sortable metadata list view that allows readers to group documents by such aspects as court and date, a feature useful in the initial triage stage. Of course other metadata could be used for other kinds of documents. 5 REACTIONS TO THE XLIBRIS PROTOTYPE At the close of each semi-structured interview we demonstrated XLibris to gauge general reactions to reading on an e-book, and to better match the design to legal research. Early demonstrations were of the original system; later demonstrations showed the redesigned XLibris. The demonstrations were hands-on; the Figure 6. Example configuration of a document laptop, a students and faculty members held the device, marking on Fujitsu Lifebook B-Series pen mini-notebook, rotated to documents, turning pages, and trying out other features. Because display a document in portrait mode. XLibris had Moot Court documents, the students were able to see familiar content in the system. They readily engaged with the Finally, they wondered about the relationship between XLibris prototype, reading, turning pages and marking. We explained why and their laptops. Many of them had already expressed frustration we were showing them the prototype, and elicited as much with their laptops, complaining about their weight, bulk, and discussion from them about it as possible. durability. Now we were introducing a second computer-like device. One student confided that she would not want to carry Reactions were generally positive: students would start reading both. Others asked if they could perform normal computer work and annotating right away, using the paper document metaphor in on the pen tablet computer, for example, getting their email, or the way we expected. They confirmed the desirability of a mobile device. They did, however, have some important questions about using a word processor with a keyboard. therelationshipbetweenXLibris(ande-books)andthe The implications of these questions are far reaching. Again, the technologies they currently use for reading and research. concept of a document laptop (Figure 6) seems to win out over that of a large dedicated e-book. Reading is interleaved with other First, they wanted to understand the relationship between XLibris, activities. This observation leads us to emphasize the ability to paper, and books. Many of them asked if they could print the read more effectively on existing hardware, and take advantage of documents displayed on XLibris. They were not always sure why they would want to print them, but they were certain that this its form factor. Transitions into and out of XLibrise.g., pasting capability was necessary. On the flip side, many of them told us quotes from sources into a word processorare easier if the reading and writing applications are on the same computer. that they still used "the books" for some of their research, and expressed skepticism about the ultimate utility of XLibris in book- Recent trends toward lighter hardware and wireless peripherals

1-1J school's financial records were subject to audit by the state; the teacher was re uired to teach onlthose sub acts offered in theublic schools usin

.,' e.. d ; and upon applying for a salary - _I ,.r r agree in writing not to teach a 1 i lements, After hearing evidence - itiew I; 1 ction offered in Roman Catholic .3 ....---4,7"^:-A* ------

Figure 4. Semi-transparent overlay showing several recent views (1,3,5) and documents (2,4)

47 may make it easier to combine the advantages of the tablet and Theory Working paper No. 99-1, September 1999. Available laptop form factors in the same device. at http://papers.ssrn.comipaper.taf?abstract_id=184050 3. Bieber, M. and Wan, J. Backtracking in a multiple-window 6 CONCLUSION hypertextenvironment, inProceedingsof ECHT '94 Surely the shift from e-book to document laptop represents the (Edinburgh, UK, September 1994), ACM Press, 158-166. greatest sea-change in our thinking about legal work. When we 4. Blomberg, J., Suchman, L., & Trigg, R. Reflections on a began thisstudy, we assumed we would be introducing a Work-Oriented Design Project. Human-Computer Interaction, dedicated reading device. Now we believe that the advantages 11,3 (1996), 237-265. afforded by such a device are offset by the need to interleave other activities with reading. 5. Elliott,M. Digital Library Design for Organizational UsabilityintheCourts.Availableonthewebat We are less likely to think of reading devices as peripherals http://edfu.lis.uiuc.edu/allerton/95/s3/elliott.html. Retrieved tethered to a stationary PC. Reading is so opportunistic, and paper on Jan 9, 2001. is such a flexible medium, that it seems inappropriate to tie legal work to a place and time. Wireless access to materials may be just 6. Golovchinsky, G.,Price, M.N., and Schilit, B.N. From the "in" that makes a document laptop a useful and desirable piece reading to retrieval: freeform ink annotations as queries, in of technology. Proceedings of SIGIR '99.(Berkeley, CA, August 1999), ACM Press, 19-25. Our observations of legal research left us with two important 7. Hill, W.C. and Hollan, J.D. Edit Wear and Read Wear, in insights. First, we cannot underestimate the importance of the Proceedings of CHI 92 (Monterey, CA, April 1992), ACM notion of a starting place, one that might easily be a paper treatise. Second,- we saw that link following is at least as important as the Press, 3-9. ability to perform broad queries. A document representation that 8. Jones, M., Rieger, R., Treadwell, P. and Gay, G. Live from includes links and functionality that implements link traversal the stacks: user feedback on mobile computers and wireless now seems essential. toolsforlibrarypatrons. Proceedings of ACM Digital Libraries 2000 (San Antonio, TX, June 2000), ACM Press, Through our observations, we came away with three compelling 95-102. scenarios for using a document laptop to perform legal research: (1) immediate access to current legal materials; (2) the ability to 9. Kay,A.and Goldberg,A.Personal Dynamic Media, re-retrieve familiar materials; and (3) the ability to suspend and Computer 10,3 (March 1977), 31-41. resume interrupted work that involves many documents. 10.Mander, R., Salomon, G., and Wong, Y.Y. A pile metaphor Reading and annotation were the original terms of engagement for forsupportingcasualorganizationof information,in XLibris. It would seem like there is little more to be said about Proceedings of CHI92 (Monterey, CA, April 1992), ACM these two areas. Yet we have seen new styles of working with Press, 627-634. annotations (e.g. outlining styles that use short points amplified by 11.Marshall, C. Toward an ecology of hypertext annotation, in extracted quotations), and are investigating different reading Proceedings of ACM Hypertext '98, (Pittsburgh, PA, June phenomena (e.g. re-reading). 1998) ACM Press, 40-49. We also found opportunities to revisit issues of navigation within 12.Marshall, C.C., Price, M.N., Golovchinsky, G., and Schilit, and among documents, and toexplore additional ways of B.N. Introducing a digital library reading appliance into a managing and organizing documents and passages. reading group, in Proceedings of ACM Digital Libraries 99 (Berkeley, CA, August 1999) ACM Press, 77-84. Inshort,thefieldstudy producedexcitinginsightsand possibilities for e-books. It also introduced new questions and 13.Marshall, C.C. and Shipman, F.M. III Spatial hypertext and issuesabout how a document appliance-turned-laptopwill the practice of information triage, in Proceedings of Hypertext function in legal work, and provided additional evidence for the 97 (Southampton, UK, April 1997), ACM Press, 124-133. hybridnature of documentcollections.Futuredesignsof 14.Poon, A., Weber, K., Cass, T. Scribbler: A Tool for Searching information appliances, e-books, and other interfaces to digital Digital Ink. In CHI95 Conference Companion (Denver CO, libraries must consider the simultaneous use of paper and digital May 1995), ACM Press. pp. 252-253. documents and the fluid transitions between them. 15.Schilit, B.N., Golovchinsky, G., and Price, M.N. Beyond paper: supporting active reading with free form digital ink 7 ACKNOWLEDGMENTS annotations, in Proceedings of CHI98 (Los Angeles, CA, We thank Elizabeth Churchill for her help with some of the field April 1998), ACM Press, 249-256. work, Peter Hodgson for his graphic design and Photoshop 16.Schilit, B.N., Price, M.N., Golovchinsky, G., Tanaka, K., and wizardry, and Kei Tanaka for debugging. We thank Jim Baker for Marshall, C.C. As we may read:the reading appliance supporting this research. revolution. IEEE Computer 32,1 (January 1999), 65-73. 8 REFERENCES 17.Sutton, S.A. The role of attorney mental models of law in case relevance determinations: an exploratory analysis. JASIS 45,3 1. Adler, A., Gujar, A., Harrison, B., O'Hara K. and Sellen, A. A diary study of work-related reading: design implications for (1994) 186-200. digitalreading devices,in Proceedings of CHI98 (Los 18.Wolfe, J.Effects of annotations on student readers and Angeles, CA, April 1998), ACM Press, 241-248. writers. Proceedings of ACM Digital Libraries 2000 (San 2. Baring, R.C. Legal Information and the Search for Cognitive Antonio, TX, June 2000) ACM Press, 19-26. Authority. UC Berkeley School of Law, Public Law and Legal

48 87 Mapping the Interoperability Landscape for Networked Information Retrieval William E. Moen School of Library and Information Sciences, Texas Center for Digital Knowledge, University of North Texas, P.O. Box 311068, Denton, Texas 76203 940-565-3563 [email protected]

ABSTRACT for assessinginteroperability. The U.S.federalInstitute of Museum and Library Services awarded a National Leadership Interoperabilityisafundamentalchallengefornetworked information discovery and retrieval. Often treated monolithically Grant to the School of Library and Information Sciences and Texas Center for Digital Knowledge at the University of North inthe literature,interoperabilityis multifaceted and can be analyzed into different types and levels. This paper discusses an TexastoestablisharesearchanddemonstrationZ39.50 approach to map the interoperability landscape for networked interoperability testbed [5]. information retrieval as part of an interoperability assessment A map or conceptual framework of interoperability allows us to research project. situate our focal activities since not all types of interoperability will be assessed through the initial testbed.By identifying the Categories and Subject Descriptors multiple factors that threaten interoperability, we can control some of those factors in the testbed while acknowledging that H.3.7 [Information Storage and Retrieval]: Digital Libraries subsequent phases of the research can address other factors. For standards, systems issues, user issues. example, the initial Z39.50 interoperability testbed focuses on three types or levels of interoperability: protocol syntax level, General Terms protocol service level, and semantic level. Each of these levels, Standardization and particularly the semantic level, is multifaceted. Lynch and Garcia-Molina describe semantic interoperability as a "grand Keywords challenge" research problem [3]. The testbed focuses on some Interoperability; networked information discovery and retrieval; aspects of semantic interoperability. Z39.50, testbeds. Similar to Paepcke, et al. [1], we suggest that multiple levels and categories of interoperability can be identified, and these may differ depending on the application. Too often the literature treats 1. INTRODUCTION interoperabilitymonolithicallyorsimplyfromasystem The ability of two information systems to communicate, execute perspective (i.e., the level of two systems interacting). Mapping instructions, share data, or otherwise interact is a fundamental the interoperability landscape can help focus attention on specific requirementinthe networked environment. Typically these interoperability problems and assessment methodologies. We interactions are subsumed under the term interoperability. In the suggestfurtherthatthedegree of interoperability between traditional library automation environment and more recently in information systems may be dependent on the distance between the digital library context, we have recognized the complexity of communities whose information systems attempt tointeract. linking information retrieval systems.Yet, there are increased Further,itis ultimately the user who benefits when systems expectations for seamless, transparent, and reliable access and interoperate,and weproposethatuserassessmentsof libraries. sharingof informationinandbetweendigital interoperability should be factored into the methodology. Increasingly the literature identifies interoperability as one of the fundamental problem facing networked information discovery and retrieval (NIDR) [3]. 2. DEFINITIONS In the context of the networked environment, a number of This paper outlines a preliminary and evolving framework for definitions of interoperability surface [2, 6]. Miller goes beyond a addressing interoperability.Mapping the NIDR interoperability system orientation and describes a more expansive perspective on landscape is part of a broader research and demonstration project interoperability. He suggests multiple aspects of the concept including Technical, Semantic, Inter-community, and Legal [4]. Permission to make digital or hard copies of all or part of this work for Within the context of digital libraries and networked information personal or classroom use is granted without fee provided that copies are retrieval, we assume the following operating principle: systems not made or distributed for profit or commercial advantage and that will interoperate.Miller's approach, however, underscores the copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, principle that not only systems but also organizations will need to requires prior specific permission and/or a fee. interoperate.This brings attentionto various environmental JCDL'01, rune 24-28, 2001, Roanoke, Virginia, USA. factors that can affect interoperability. Copyright 2001 ACM 1-58113-345-6101/0006..15.00.

50 88 3. INTEROPERABILITY FACTORS Another useful perspective for the interoperability conceptual map A number of diverse factors challenge interoperability: is in terms of domains. There may be some overlap here with the perspectiveof communities,butaddressingdifferencesin Multiple and disparate operating and IR systems domainscanhighlightfactorssuchasvocabulariesand Multiple protocols ontologies. Our mapping identifies the following two categories: Multiple metadata schemes Intra-domain/discipline Multiple data formats Extra-domain/discipline. Multiple languages and character sets Multiple vocabularies, ontologies, and disciplines. As in the case of communities, the challenges to interoperability within a domain may be less than between domains. Semantic Our map of interoperability will account for a variety of factors. differences may present major barriers to interoperability. 4. COMMUNITY AND DOMAIN Within a community or domain, relative homogeneity reduces interoperability challenges. Heterogeneity increases as one moves CONTEXTS FOR INTEROPERABILITY outside of a focal community/domain, and interoperability is The context of "information communities" provides a way to likely more costly and more difficult to achieve. frame the challenges of achieving interoperability. The diversity of factors above may be reduced within a particular community. 5. SUMMARY For cross-catalog information retrieval in traditional libraries, the The work on mapping the interoperability landscape is in its diversity listed above is radically reduced. Data in the catalogs are initial phase. The resulting map situates our interoperability relatively homogenous, and there is commonality in the metadata assessment research in that landscape and provides points of scheme for structuring the records, etc. The challenges to achieve reference to other researchers for assessing and overcoming interoperability in a virtual catalog application may be less than in interoperability problems.Ultimately, this map may assist in cases when one crosses community boundaries. For example, if a realistically assessing the challenges and costs in making real the user queries repositories of library bibliographic records and a promises of providing seamless and transparent access(i.e., museum's object records concurrently through Z39.50, diversity interoperability) within the networked environment. of metadata schemes and vocabulary increases. Our preliminary map proposes the following NIDR communities: 6. ACKNOWLEDGMENTS Focalconmiunity NIDR: factors affecting interoperability We thank the U.S. federal Institute of Museum and Library are minimized (e.g., libraries) Services for support of the interoperability testbed and papers Extendedcommunity NIDR:increaseddiversity(e.g., resulting from our research. cultural heritage information held by libraries and museum) Extracommunity NIDR: factors affecting interoperability 7. REFERENCES are maximized (e.g.,libraries interacting with geospatial [1]Paepcke, A., et al. Interoperability for digital libraries repositories). worldwide. Communications of the ACM 41(April 1998), Challenges to interoperability NIDR increase (and likely the costs 33-43. of achieving interoperability increase) as one moves further [2]Lynch, C. Interoperability: The standards challenge for the outside of a focal community.Figure 1 illustrates potential 1990s. Wilson Library Bulletin, 67 (March 1993), 38-42. interaction among communities. [3]Lynch, C. & Garcia-Molina, H. Interoperability, scaling, and the digital libraries research agenda: A report on the May 18- ExtendedCommunity Extra Communi 19, 1995 IITA digital libraries workshop. (1995). Available (e.g.,Cuttural Heritage) URL: Focal Community http://www-diglib.stanford.eduldiglib/pub/reports/iita-d1w/ Focal Community idliraciuni.a*. (e.g., Geospatlal ) (e.g., Libraries) [4]Miller, P. Interoperability: What it is and why should I want it. Ariadne, 24. Available URL: http://www.ariadne.ac.uk/issue24/interoperability/intro.html Focal Commnity Moen, W. E. Realizing the vision of networked access to (e.g., Archives) ExtendedCommunity [5] library resources: An applied research and demonstration Focal Community (e.g., Geospatial) project to establish and operate a Z39.50 interoperability testbed. (February 2000). Available URL: Focal Community http://www.unt.edu/zinterop (e.g., Museums) Fooal Community (e.g., Natural History [6]Preston, C.M. & Lynch, C.A. Interoperability and Museums) conformance issues in the development and implementation of the government information locator service (GILS). In W.E. Moen and C.R. McClure, The Government Information Figure 1. Information Retrieval Among/Across Communities Locator Service (GILS). Syracuse, NY: School of Information Studies, Syracuse University (1994). Distributed Resource Discovery: Using Z39.50 to Build Cross-Domain Information Servers Ray R. Larson School of Information Management and Systems University of California, Berkeley (510)642-6046 [email protected]

ABSTRACT In the following discussion we focus on the first type of selection, This short paper describes the construction and application of that is, discovering which digital libraries are the best places for Cross-Domain Information Servers using features of the standard the user to begin a search. Our approach to the second selection Z39.50 information retrieval protocol[11]. We use the Z39.50 problem has been discussed elsewhere[6,7]. Explain Database to determine the databases and indexes of a Distributed information retrieval has been an area of active given server, then use the SCAN facility to extract the contents of research interest for many years. Distributed IR presents three theindexes. This informationisused to build "collection central research problems that echo the selection problems noted documents" that can be retrieved using probabilistic retrieval by Buckland and Flaunt. These are: algorithms. 1. How to select appropriate databases or collections for search from a large number of distributed databases; Keywords 2. How to perform parallel or sequential distributed search DistributedInformationRetrieval,Cross-DomainResource over theselecteddatabases, possibly using different query Discovery, Distributed Search structures or search formulations, in a networked environment where not all resources are always available; and 1. INTRODUCTION 3. How to merge results from the different search engines Information seekers must be able to identify information resources and collections, with differing record contents and structures that are pertinent to their needs. Today they are also required to (sometimes referred to as the collection fusion problem). have the knowledge and skills to navigate those resources, once Each of these research problems presents a number of challenges identified, and extract relevant information. The widespread that must be addressed to provide effective and efficient solutions distribution of recorded knowledge across the network landscape of the World Wide Web is only the beginning of the problem. to the overall problem of distributed information retrieval. The reality is that the repositories of recorded knowledge on the These problems been approached in a variety of ways by different Web are only a small part of an environment with a bewildering researchers focusing on different aspects of retrieval effectiveness, variety of search engines, metadata, and protocols of very and on construction of the index resources required to select and different kinds and of varying degrees of completeness and search distributed collections [2,3,4,5,8,9]. incompatibility.The challenge is to not only to decide how to In this paper we present a method for building resource discovery mix, match, and combine one or more search engines with one or indexes that is based on the Z39.50 protocol standard instead of more knowledge repositories for any given inquiry, but also to requiring adoption of a new protocol. It does not require that have detailed understanding of the endless complexities of largely remote servers perform any functions other than those already incompatible metadata, transfer protocols, and so on. established in the Z39.50 standard. As Buck land and Plaunt[I] have pointed out, searching for recorded knowledge in a digital library environment involves 2. Z39.50 FACILITIES three types of selection: The Z39.50 Information retrieval protocol is made up of a number of "facilities", such as Initialization, Search, and Retrieval. Each 1. Selecting which. library (repository) to look in; facility is a logical group of services (or a single service) to 2. Selecting which document(s) within a library to look at; and perform various functions in the interaction between a client 3. Selecting fragments of data (text, numeric data, images) from (origin) and a server (target). The facilities that we will be within a document. concerned with in this paper are the Explain Facility and .the Browse Facility.

Permission to make digital or hard copies of all or part of this work for 2.1 The Explain Facility personal or classroom use is granted without fee provided that copies are The Z39.50 Explainfacilitypermitstheclienttoobtain not made or distributed for profit or commercial advantage and that informationabout theserverimplementation,including copies bear this notice and the full citation on the first page. To copy information on databases supported,attributesets used (an otherwise, or republish, to post on servers or to redistribute to lists, attribute set specifies the allowable search fields and semantics for requires prior specific permission and/or a fee. a database), diagnostic or error information, record syntaxes and JCDL '01, June 24-28, 2001, Roanoke, Virginia, USA. information on defined subsets of record elements that may be Copyright 2001 ACM 1-58113-345-6/01/0006...$5.00.

52 '90 requested from theserver (calledelementsets). The server discovery databases [8], in that work random samples of the (optionally) maintains a database of Explain information about records in the collection were used to build the indexes. The itself and may maintain Explain databases for other servers. The method described here gives the ability to use the server's own explain database appears to the client as any other database, and processing of its records in extracting the terms to be matched in uses the Z39.50 Search and Retrieval facilities to query and the resource discovery index. retrieve information from it. There are specific attributes, search terms and record syntaxes defined in the standard for the Explain 4. CONCLUSION database tofacilitateinteroperability among different server This brief paper has presented a standards-based method for implementations. building a resource discovery database from distributed Z39.50 servers. We plan to combine this system with others that support 2.2 Z39.50 Browse Facility SDLIP, and possible other protocols for further investigation of As the name of this facility implies, it was originally intended to distributed search for Digital Libraries. support browsing of the server contents, specifically the items extracted for indexing the databases. The single service in the 5. ACKNOWLEDGMENTS Browse facility is the Scan service. It is used to scan an ordered This work was sponsored by NSF and JISC under the list of terms (subject headings, titles, keyword, text terms, etc.) International Digital Libraries Program award #I1S-9975164. drawn from the database. Most implementations of the Scan service directly access the contents of the indexes on the server 6. REFERENCES and return requested portions of those indexes as an ordered list of [1] Buckland, M. K. and Plaunt, C. Selecting Libraries, terms along with the document frequency for each term. Selecting Documents, Selecting Data. In ISDL'97: (Tsukuba City, Japan, 1997). 3. IMPLEMENTATION http://www.dl.ulis.ac.jp/ISDL97/proceedings/. Our implementation relies on these two Z39.50 Facilities to derive Callan, J. P., Lu, Z. and Croft, W. B. Searching Distributed information from Z39.50 servers (including library catalogs, full- [2] Collections with Inference Networks. In SIGIR '95: (Seattle, text search systems, and digital library systems) in order to build a GIOSS-like[5] index for distributed resources. The procedure WA, 1995), ACM Press, 21-28. followed is: [3]Danzig, P.B., Ahn, J., Noll, J. and Obraczka, K. Distributed Indexing: A Scalable mechanism for Distributed Information 1. Search the Explain Database to derive the server Retrieval. In SIGIR '91 (Chicago, IL, 1991), ACM Press, information about each database maintained by the 220-229. server and the attributes available for searching that server. [4]French, J. C., et al. Evaluating Database Selection Techniques: A Testbed and Experiment. In SIGIR '98 2. For each database, we determine whether Dublin Core (Melbourne, Australia, 1998), ACM Press, 121-129 attributes are available, and if not, we select from the available attributes those most commonly associated [5]Tomasic, A., et al. Data Structures for Efficient Broker with Dublin Core information. Implementation. ACM Transactions on Information Systems, 15 (July 1997), 254-290. 3. For each of theattributes discovered, we send a sequence of Scan requests to the server and collect the [6]Larson, R. R. and Carson, C. Information Access for A resulting lists of index terms. As the lists are collected Digital Library: Cheshire II and the Berkeley Environmental they are verified for uniqueness (since a server may Digital Library. In Proceedings ASIS '99 (Washington, DC, allow multiple search attributes to be processed by the 1999), Information Today, 515-535. same index) so that duplication is avoided. [7]Larson, R. R. TREC Interactive with Cheshire II. Information Processing and Management, 37 (2001), 485- 4. For each database an XML collection document is constructed to act as a surrogate for the database using 505 theinformationobtained from theserver Explain [8]Lin, Y., et al. Zbroker: A Query Routing Broker for Z39.50 database and the Scans of the various indexes. Databases. In: CIKIv1 '99: (Kansas City, MO, 1999), ACM Press, 202-209. 5. A database of collection documents is created and indexedusingallofthetermsandfrequency [9]Xu, J. and Callan, J. (1998) Effective Retrieval with information derived above. Distributed Collections. In SIGIR '98, (Melbourne, Australia, 1998), ACM Press, 112-120. Probabilistic ranking methods[6] are used to retrieve and rank these collection documents for presentationtothe user for [10] Z39.50 Maintenance Agency. Information Retrieval selection, or for automatic distributed search of the most highly (Z39.50): Application Service Definition and Protocol ranked databases using method similar to [2] and [9]. Although Specification (ANSYNISO Z39.50-1995), Washington: the Z39.50 protocol has been used previously for resource Library of Congress, 1995

53 91 The Open Archives Initiative: Building a Low-Barrier Interoperability Framework Carl Lagoze Herbert Van de Sompel Digital Library Research Group Digital Library Research Group Cornell University Cornell University Ithaca, NY Ithaca, NY +1-607-255-6046 +1-607-255-3085 [email protected] [email protected]

ABSTRACT The result of this meeting was the formation of the Open Archives The Open Archives Initiative (0AI) develops and promotes Initiative (0AI) and beginning of work on a framework facilitating interoperability solutionsthat aim tofacilitatetheefficient the federation of content providers on the Web. Since that first dissemination of content. The roots of the OAI lie in the E-Print meeting,theOAI hasundergoneaperiodof intensive community. Over the last year its focus has been extended to development both organizationally and technically. The original include all content providers. This paper describes the recent focus on E-Prints has broadened to encompass content providers history of the OAI itsorigins in promoting E-Prints, the from many domains (with an emphasis on what could be classified broadening of its focus, the details of its technical standard for "scholarly" publishing), a refined and extensively tested technical metadata harvesting, the applications of this standard, and future framework has been developed, and an organizational structure to plans. support the Initiative has been established. The name Open Archives Initiative reflects the origins of the OAI Categories and Subject Descriptors in the E-Prints community where the term archive is generally D.2.12 [Software Engineering]:Interoperability Interface accepted as a synonym for a repository of scholarly papers. definition languages. Members of the archiving profession have justifiably noted the strictdefinition of an "archive" withintheir domain; with implicationsforpreservationof long-termvalue,statutory General Terms authorization and institutional policy. The OAI uses the term Experimentation, Standardization. "archive"ina broadersense:asa repositoryforstored information. Language and terms are never unambiguous and Keywords uncontroversial and the OAI respectfully requests the indulgence Metadata, Interoperability, Digital Libraries, Protocols. of the archiving community with this less constrained use of "archive". 1. INTRODUCTION Some explanation of the use of the term "Open" in OAI is also In October 1999, a meeting was held in Santa Fe to discuss due. Our intention is "open" from the architectural perspective mechanisms to encourage the development of E-Print solutions. defining and promoting machine interfaces that facilitate the The group at this meeting was united in the belief that the availability of content from a variety of providers. Openness does ubiquitousinterconnectivityoftheWebprovidesnew not mean "free"or "unlimited" accesstothe information opportunitiesforthetimelydisseminationofscholarly repositories that conform to the OAI technical framework. Such information. The well-known physics archive run by Paul terms are often used too casually and ignore the fact that monetary Ginsparg at Los Alamos National Laboratory has already radically cost is not the only type of restriction on use of informationany changed the publishing paradigm in its respective field. Similar advocate of "five" information recognize thatitis eminently efforts planned, or already underway, promise to extend these reasonable to restrict denial of service attacks or defamatory striking changes to other domains. misuse of information. This paper documents the development of the Open Archives Initiative and describes the plans for the OAI for the near future. Permission to make digital or hard copies of all or part of this work for At the time of completion of this paper (May 2001), the OAI has personal or classroom use is granted without fee provided that copies released the technical specifications of its metadata harvesting are not made or distributed for profit or commercial advantage and protocol. The substantial interest in the OAI heretofore indicates that copies bear this notice and the full citation on the first page. To that the approach advocated by the OAIestablishing a low-entry copy otherwise, or republish, to post on servers or to redistribute to and well-defined interoperability framework applicable across lists, requires prior specific permission and/or a fee. domains may be the appropriate catalyst for the federation of a JCDL'01, June 24-28, 2001, Roanoke, Virginia, USA. Copyright 2001 ACM 1-58113-345-6/01/0006...$5.00. broad cross-section of content providers. The coming year will indicate whether this is true and whether the technical framework defined by the metadata harvesting protocolisa sufficient

54 1 1:2 underpinning forthe development of usabledigitallibrary say here that establishing such a framework requires both technical services. and organizational agreements. There are many benefits of federation of E-Print repositories. 2. E-PRINT ORIGINS Scholarlyendeavorsareincreasinglymulti-disciplinaryand The initial meeting and developments of the Open Archives scholars should be able to move fluidly among the research results Initiative are described in detail in an earlier paper [1]. This from various disciplines. Federation and interoperability also section summarizes that material from the perspective of current encourage the construction of innovative services. Such services developments and events. might use information from various repositories and process that information The origins of the OAI lie in increasing interest in alternatives to tolinkcitations,createcross-repository query the traditional scholarly publishing paradigm. While there may be interfaces, or maintain current-awareness services. The benefits of disagreements about the nature of what changes need to take place, federation were demonstrated by earlier work joining the Los there is widespread consensus that change, perhaps radical change, Alamos archive with the NCSTRL system [5], as well as in the UPS prototype [6] that was prepared for the Santa Fe meeting. isinevitable. There are numerous motivating factors for this change. An increasing number of scholarly disciplines, especially Interoperability has numerous facets including uniform naming, those in the so-called "hard sciences" (e.g., physics, computer metadata formats, document models, and access protocols. The science, life sciences), are producing results at an increasingly participants at the Santa Fe meeting decided that a low-barrier rapid pace. This velocity of change demands mechanisms for solution was critical towards widespread adoption among E-Print reportingresultswith lower latencytimesthantheones providers. The meeting therefore adopted an interoperability experienced in the established journal system. The ubiquity of solution known as metadata harvesting. This solution allows E- high-speed networks and personal computing has created further Print (content) providers to expose their metadata via an open consumer demand for use of the Web for delivery of research interface, with the intent that this metadata be used as the basis for results. Finally, the economic model of scholarly publishing has value-added service development. More details on metadata been severely strained by rapidly rising subscription prices and harvesting and the OAI technical agreements are provided in relatively stagnant research library budgets. Section. 4. In some scholarly fields, the development of alternative models for The result of the meeting was a set of technical and organizational the communication of scholarly resultsmany in the form of on- agreements known as the Santa Fe Convention. The technical line repositories of EPrintshas demonstrated a viable alternative aspects included the agreement on a protocolfor metadata to traditional joumal publication. Perhaps the best known of these harvesting based on the broader Dienst protocol [7], a common is the Physics archivel run by Paul Ginsparg [2] at Los Alamos metadata standard for E-Prints (the Open Archives Metadata Set), National Laboratory. There are, however, a number of other and a uniform identifier scheme. The organizational agreements establishedefforts(CogPrints2, NCSTRL3, RePEC4), which coming out of the meeting were informal and involved the collectively demonstrate the growing interest of scholars in using establishmentof emaillistsforcommunicationamongst the Internet and the Web as vehicles for immediate dissemination participants,arudimentaryregistrationprocedure,andthe of research findings. Stevan Hamad, among the most outspoken definition of an acceptable use policy for consumers of harvested advocate of change, views such solutions as the first step in radical metadata. transformation of scholarly publishing whereby authors reclaim control over their intellectual property and the publishing process The Santa Fe meeting closed with enthusiasm among the [3]. participants to refine the agreements and pursue implementation and experimentation. Within a relatively short period the technical The October 1999 meeting in Santa Fe5 of what was then called specifications were completed and were posted on a publicly the UPS (Universal Preprint Service) was organized on the belief accessible web site6 along with other results of the Santa Fe that the interoperability among these E-Print archives was key to meeting. A number of the participants quickly implemented the increasing their impact. Interoperability would make it possible to technical agreements and others experimented with a number of bridge across, or federate, a number of archives. Issues related to prototype services. interoperability are well described elsewhere [4]. It is sufficient to 3. BEYOND E-PRINTS Soon after the dissemination of the Santa Fe Convention in February 2000 it became clear that there was interest beyond the http://www.arxiv.org. E-Print community. A number of other communities were 2 http://cogprints.soton.ac.uk. intrigued by a low-barrier interoperability solution and viewed metadata harvesting as a means to this end. 3 http://www.ncstrl.org. In particular, strong interest came from the research library 4 http://netec.mcc.ac.uk/RePEc/. community in the US. Key members of this community met at the 5 The Santa Fe meeting was sponsored by the Council on so-called Cambridge Meetings, sponsored by the Digital Library Library and Information Resources (CUR), theDigital Federation and the Andrew W. Mellon Foundation, at Harvard Library Federation(DLF),theScholarly Publishing & University in the first half of 2000. The goal of the meetings was Academic Resources Coalition (SPARC), the Association of Research Libraries (ARL) and the Los Alamos National Laboratory (LANL). 6 http://www.openarchives.org. to explore the ways that research libraries could expose aspects of currently an organization and an effort explicitly in transition, their collections to Web search engines. The participants, who and is committed to exploring and enabling this new and broader included not only representatives from research libraries but also range of applications. As we gain greater knowledge of the scope from the museum community, agreed that exposing metadata in a of applicability of the underlying technology and standards being uniform fashion was a key step towards achieving their goal [8]. developed, and begin to understand the structure and culture of Additional evidence of the broad-based interest in Santa Fe the various adopter communities, we expect that we will have to Convention came in the form of well-attended Open Archives make continued evolutionary changes to both the mission and Initiative workshops held at the ACM Digital Library 2000 organization of the Open Archives Initiative. Conference in San Antonio [9] and the European Digital Library Conference in Lisbon[10].Participantsatboth of these workshops included publishers, librarians, metadata and digital Technical Umbrella for Practical library experts, and scholars interested in E-Prints. Interoperability Responding to this wider interest required a reconsideration of a number of decisions made by members of the Open Archives ,N.qesszs,l-farves., Initiative at the Santa Fe meeting and in the months following. The original mission of the OAI was focused on E-Print solutions and interoperability as a means of achieving their global acceptance. While this goal was still shared by the majority of participants, it was deemed too restrictive and possibly alienating to communities not directly engaged or interested in E-Prints. A number of aspects of the technical specifications were specific to the original E-Print focused mission and needed to be generalizedforapplicabilitytoa broad range of Figure 1 - A framework for multiple communities communities. This mission isillustrated in Figure1, where the technical The credibility of the effort was uncertain due to the lack of framework is designed as an umbrella that can be exploited by a organizationalinfrastructure.Communities suchasthe variety of communities. research library community are hesitant to adopt so-called "standards" when the stability of the organization responsible A key element of this mission statement is the formulation of the forpromotionandmaintenanceof thestandardare OAI as an experiment: "an organization and an effort explicitly in questionable. transition". The organization is well aware that the technical infrastructure it proposesmetadata harvestinghas to be proven The issue of organizational stability was addressed first. In August as an effective means of facilitating interoperability or even what 2000, the DLF (Digital Library Federation) and CNI (Coalition of that interoperability will achieve. The organizational structure and Networked Information) announced organizational support and strategy reflects a belief among the steering committee that resources for the ongoing OAI effort. This support announcement "proving the concept" will require a delicate balance between was made in a press release7 that also contained the formation of stability and flexibility. Furthermore, there is strong consensus that an OAI steering committee with membership from a cross-section goals and scope of the OAI should be controlled while of communities andalevelof internationalparticipation interoperability is a wide-open area with many potential areas of (membership of the Steering Committee is listed in the Appendix investigation, the OAI should resist expanding its scope until its in Section 7). current technical goals are met and justified. The OAI steering committee immediately addressed the task of The Steering Committee also took steps to address the E-Print compiling a new mission statement that reflected the broader focus of the Santa Fe technical agreements and to fix other scope. This mission statement is as follows: problems that were revealed in testing of those agreements. A TheOpenArchivesInitiativedevelopsandpromotes technical committee was formed and a meeting organized at interoperability standards that aim to facilitate the efficient Cornell University in September 2000. The results of that meeting dissemination of content. The Open Archives Initiative has its are reported in the next section. roots in an effort to enhance access to e-print archives as a means of increasingtheavailability of scholarly communication. 4. TECHNICAL FRAMEWORK Continued support of this work remains a cornerstone of the Open The technical framework of the Open Archives Initiativeis Archives program. The fundamental technological framework and intended to provide a low-barrier approach to interoperability. The standards that are developing to support this work are, however, membership of the OAI recognizes that there are functional independent of the both the type of content offered and the limitationsto such a low-barrier framework and that other economic mechanisms surrounding that content, and promise to interoperability standards, for example Z39.50, address a number have much broader relevance in opening up access to a range of of issues in a more complete manner. However, as noted by Bill digital materials. As a result, the Open Archives Initiative is Arms [11], interoperability strategies generally increase in cost (difficulty of implementation) with an increase in functionality. The OAI technical framework is not intended to replace other 7 http://www.openarchives.org/OAISC/oaiscpress000825.htm. approaches but to provide an easy-to-implement and easy-to-

94 deploy alternative for different constituencies or different purposes expose DC metadata according to their own needs and thereby than those addressed by existing interoperability solutions. As reveal a market of community-developed metadata practices. noted earlier, experimentation will prove whether such low-barrier It should be noted that the specific decision was to use unqualified interoperability is realistic and functional. Dublin Core as the common metadata set. This decision was made At the root of the technical agreement lies a distinction between based on the belief that the common metadata set in OAI is two classes of participants: explicitly purposed for coarse granularity resource discovery. As Data Providers adopt the OAI technical framework as a discussed elsewhere [16], qualification of DC, for the purpose of means of exposing metadata about their content. more detailed description (rather than simple discovery) is still an area of some contention and threatens to interfere with the goal of Service Providers harvest metadata from data providers using simple resource discovery. The OAI takes the approach of strictly the OAI protocol and use the metadata as the basis for value- separating simple discovery from community-specific description. added services. Community-specificdescription,ormetadataspecificity,is The remainder of this section describes the components of this addressed in the technical framework by support for parallel technical framework. A theme carried through the framework and metadata sets. The technical framework places no limitations on the section is the attempt to define a common denominator for the nature of such parallel sets, other than that the metadata interoperability among multiple communities while providing records bestructuredas XML documents, which have a enough hooks for individual communities to address individual corresponding XML schema for validation (as described in section needs (without interfering with interoperability). More details on 4.2). At the time of completion of this paper (January 2001), initial the technical framework are available in the Open Archives stepshave been takentoencouragethe development of Metadata Protocol specification available at the OAI web site.8 community-specific harvestable metadata sets. Representatives of the E-Print community have been working on a metadata set 4.1 Metadata targetedat the E-Print community under the name EPMS. The OAI technical framework addresses two well-known metadata Representatives of the research library community have proposed requirements: interoperability and extensibility (or community a similar effort and there are calls for proposals from other specificity). These issues have been a subject of considerable communities (e.g.,the museum community, Open Language discussion in the metadata community [12, 13]the OAI attempts Archives). to answer this in a simple and deployable manner. The requirement for metadata interoperability is addressed by 4.2 Records, Repositories, and Identifiers requiring thatall OAI data providers supply metadata in a The OAI technical framework defines a record, which is an XML- common format the Dublin Core Metadata Element Set [14]. encoded byte stream that serves as a packaging mechanism for The decision to mandate a common element set and to use the harvested metadata. A record has three parts: Dublin Core was the subject of considerable discussion in the header containing information thatis common toall OAL One approach, which has been investigated in the research records(itisindependentofthemetadataformat literature [15], is to place the burden on the consumer of metadata disseminated in the record) and that is necessary for the rather than the provider, tolerating export of heterogeneous harvesting process. The information defined in the header is metadataandrelyingonservicesto map amongstthe the unique identifier for the record (described below), and a representations. OAL however, is purposely outside the domain of datestamp indicating the date of creation, deletion, or latest strict research, and in the interest of easy deployment and usability date of change in the metadata in the record. it was decided that a common metadata format was the prudent decision. metadatacontaining metadata in a single format. As noted in section 4.1, all OAI data providers must be capable of The decision to use the Dublin Core was also the result of some emitting records containing unqualified DC metadata. Other deliberation. The original Santa Fe convention took a different metadata formats are optional. coursedefining a metadata set, the Open Archives Metadata Set, with some functionality tailored for the E-Print community. The aboutan optional container to hold data about the metadata broadeningof thefocusoftheOAI,however,forced part of the record. Typically, the "about" container could be reconsideration of this decision and the alternative of leveraging used to hold rights information about the metadata, terms and the well-known and active work of the DC community in conditions for usage of the metadata,etc. The internal formulating a cross-domain set for resource discovery was chosen. structure of the "about" container is not defined by the protocol. It is left to individual communities to decide on its Those familiar with the Dublin Core will note that all fields in DC syntax and semantics through the definition of a schema. are optional. The OAI discussed requiring a number of DC elements in OAI records. While such requirement might be A sample OAI record is shown in Figure 2. preferable from the perspective of interoperability, the spirit of Metadata records are disseminated from Repositories, which are experimentation in the OAI persuaded the committee to keep all network accessible servers of data providers. An OAI-conformant elements optional. The committee agreed thatit would be repository supports the set of OAI protocol requests defined in desirable at this early stage to encourage metadata suppliers to Section 4.4. Abstractly, repositories contain items, and each metadata record harvested from a repository corresponds to an item. (There is a many-to-one relationship of records to items, since metadata can be expressed in multiple formats). The nature 8 http://www.openarchives.org/OAI/openarchivesprotocol.htm. of an item - for example, what type of metadata is actually stored

57 95 in the item, what type is derived on the fly, and whether the item 4.3 Selective Harvesting includes the "full content" described by the metadata - is outside A protocol that only enabled consumers of metadata to gather all the scope of the OAI protocol. This admittedly indistinct nature of metadata from a data provider would be cumbersome. Imagine the an item is intentional. The OAI harvesting protocol is meant to be transactions with large research libraries that expose the metadata agnostic as to the nature of a data providerit supports those that in their entire catalog through such a protocol! havecontentwithfixedmetadatarecords,thosethat computationally derive metadata in various formats from some Thus, some provision for selective harvesting, which makes it intermediate form or from the content itself, or those that are possible in the protocol to specify a subset of records to be metadata stores or metadata intermediaries for external content harvested, is desirable. Selection, however, has a broad range of providers. functionality. More expressive protocols include provisions for the specification of reasonably complete predicates (in the manner of database requests) on the information requested. The OAI decided

that such high functionality was not appropriate for a low-barrier oakarXiv:9901001 protocol and instead opted for two relatively simple criteria for selective harvesting. 1999-01-01 Date-basedAs noted in Section 4.2, every record contains

a date stamp, defined as "the date of creation, deletion, or latest date of modification of an item, the effect of which is a change in the metadata of a record disseminated from that item". Harvesting requests may correspondingly contain a Quantum slow motion date range for harvesting, which may be total (between two Hug, M. dates) or partial (either only a lower bound or an upper Milburn, G. J. bound). This date-based harvesting provides the means for incremental harvesting. For example, a client may have a 1999-01-01 weekly schedule for harvesting records from a repository, and e-print use the date-based selectivity to only harvest records added or http://arXiv.orglabs/9901001 modified since the last harvesting time. Set-based The protocol defines asetas "an optional construct for grouping items in a repository for the purpose of selective harvesting of records".Sets may be used in harvesting requests to specify that only records within a specific grouping should be returned from the harvesting request (note that each item in a repository may be organized Metadata may be used without restrictions in one set, several sets, or no sets at all). Each repository may define a hierarchical organization of sets that can have several top-level nodes, each of which is aset.Figure 3 illustrates a Figure 2Sample OAI Record sample set hierarchy for a fictional E-Print repository. The actual meaning of the sets is not defined within the protocol. Instead, itis expected that communities that use the OAI protocol may formulate well-defined set configurations with perhaps a controlled vocabulary for set names, and may even As illustrated in Figure 2 each record has an identifier. The nature developmechanismsforexposingthesetoservice of this identifier deserves some discussion. Concretely, the record providers. As experiments with the OAI protocol proceed in identifier serves as a key for extracting metadata from an item in a the future, it will be interesting to see how communities repository.Thiskey,parameterized by ametadata format exploit the set mechanism and if itprovides sufficient identifier, produces an OAI record. Since the identifier acts in this functionality. manner as a key it must be unique within the repository; each key corresponds to metadata derived from one item. The protocol itself does not address the issues of inter-repository naming or globally Institutions unique identifiers. Such issues are addressed at the level of Cornell University registration, which is described in Section 4.5 VirginiaTech The record identifier is expresslynotthe identifier of the item Subjects the issue of identifiers for contents of repositories is intentionally outside the scope of the OAI protocol. Undoubtedly, many clients Computer Science of the OAI protocol will want access to the full content described High Energy Physics by a metadata record. The protocol recommends that repositories use an element in metadata records to establish a linkage between Figure 3 - Sample Set Hierarchy the record and the identifier (URL, URN, DOI, etc.) of the associated item. The mandatory Dublin Core format provides the identifierelement that can be used for this purpose.

58 Even with the provisions for selective harvesting, it is possible that clients will make harvesting requests of repositories that are large GET Request and burdensome tofulfillin a single response. Some other protocols make provision for such cases with the notion of state http://ana.oa.orglOAI-script? and result sets a client explicitly opens a session, conducts verb=GetRecord& transactions within that session, and then closes a session. Yet, identifier=oai:arXiv:hep-thOl& session maintenanceisnotably complex andillsuitedfor protocols such as HTTP, which is intended as the carrier protocol metadataPrefix=dc for OAI requests and responses. Instead the OAI uses a relatively simple flow control mechanism that makes it possible to partition POST Request large transactions among several requests and responses. This flow control mechanism employs a resumption token, which is returned POST http://an.oa.org/OAI-script by a repository when the response to a harvesting request is larger Content-Length: 62 than the repository may wish to respond to at one time. The client can then use the resumption token to make subsequent requests Content-Type: application/x-www-form-urlencoded until the transaction is complete. verb=GetRecord& 4.4 Open Archives Metadata Harvesting identifier=oai:arXiv:hep-thOl& Protocol metadataPrefix=dc The initial protocol that came out of the Santa Fe meeting was a Figure 4 - Sample OAI Request Encoding subset of the Dienst protocol. While that subset protocol was functionally useful for metadata harvesting, aspects of its legacy context presented barriers to simple implementation. The current technical framework is built around a more focused and easier to 4.4.1 GetRecord implement protocol the Open Archives Metadata Harvesting This verb is used to retrieve an individual record (metadata) from Protocol. an item in a repository. Required arguments specify the identifier, The Open Archives Metadata Harvesting Protocol consists of six or key, of the requested record and the format of the metadata that requests or verbs. The protocol is carried within HTTP POST or should be included in the record. GET methods. The intentionisto make itsimple for data 4.4.2 Identifi, providers to configure OAI conformant repositories by using This verb is used to retrieve information about a repository. The readily available Web tools such as libwww-perl9. OAI requests all response schema specifies that the following information should have the following structure: be returned by the Identib) verb: base-url the Internet host and port of the HTTP server A human readable name for the repository. acting as a repository, with an optional path specified by the respective HTTP server as the handler for OAI protocol The base URL of the repository. requests. The version of the OAI protocol supported by the repository. keyword argumentsconsisting of a list of key-value pairs. The e-mail address of the administrator of the repository. At a minimum, each OAI protocol request has one key=value In addition to this fixed information, the protocol provides a pair that specifies the name of the OAI protocol request. mechanism for individual communities to extend the functionality Figure 4 shows the encoding of a sample OAI protocol request of this verb. The response may contain a list of description using both HTTP GET and POST methods. The request is the containers, for which a community may define an XML schema GetRecords verb, and the specific example requests the return of thatspecifiessemanticsforadditionaldescriptionof the the record with identifier oakarXiv:hepthOl in dc (Dublin Core) repository. format. The response to all OAI protocol requests is encoded in XML. 4.4.3 Listldentifier Each response includes the protocol request that generated the This verb is used to retrieve the identifiers of records that can be response, facilitating machine batch processing of the responses. harvested from a repository. Optional arguments permit selectivity Furthermore, the XML for each response is defined via an XML of the identifiers - based on their membership in a specific set in schema. [17-19]. The goal is to make conformance to the technical the repository or based on their modification, creation, or deletion specifications as machine verifiable as possiblea test program within a specific date range. should be able to visit an OAI repository, issue each protocol 4.4.4 ListMetadataFormats request with various arguments, and test that each response This verb is used to retrieve the metadata formats available from a conforms to the schema defined in the protocol for the response. repository. An optional argument restricts the request to the The remainder of this section summarizes each of the protocol formats available for a specific record. requests. 4.4.5 ListRecords This verb is used to harvest records from a repository. Optional arguments permit selectivity of the harvesting - based on the 9 http://www.ics.uci.edulpub/websoft/libwww-per1/. membership of records in a specific Set in the repository or based

59 97 on their modification, creation, or deletion within a specific date attendees at the DLF sponsored Cambridge meetings. These range. solicitations led to a quite comprehensive and diverse testing community, organized around an alpha testers email list. The 4.4.6 ListSets complete list of alpha testers is shown in the Appendix in section This verb is used to retrieve the set structure in a repository. 8.Itincludesrepresentatives from the E-Print community, museums, research libraries, repositories of publisher metadata, 4.5 Data Provider Conformance and and collectors of web site metadata. In addition, the alpha test Registration included two rudimentary service providers who constructed The OAI expects that data providers will fall into three layers of search interfaces based on metadata harvested from the OAI- participation, each higher layer implying the preceding layer(s); conformant alpha testers. I)OAI-conformantThese are data providers who support the Three alpha-testers deserve special mention. Hussein Suleman at protocol definition. As stated earlier, conformance is testable Virginia Tech created and continuously updated a repository since there are XlvIL-schemas to validate all responses. No explorer that allowed alpha-testers to examine the compliance of doubt, the OAI will not be able to track every provider using their repositories to the most recent version of the protocol the protocol since use of it does not require any licensing or document. Simeon Warner (Los Alamos National Laboratory and registration procedure. arXiv.org) and Michael Nelson (University of Northern Carolina 2)0AI-registeredThese are data providers who register in an and NASA) did extensive proofreading of new versions of the OAI-maintained database, which will be available through protocol document before release to the alpha-group. the OAI web site. Registration will entail that the data The results of these tests are quite encouraging. First, the protocol provider gives a BASE-URL, which the registration software specification has passed through a number of revisions and has will then use to test compliance. been vetted extensively for errors and ambiguities. Second, 3) OAI-namespace-registeredThese are data providers who virtually all the testers remarked at the ease of implementation and choose to name their records in conformance with an OAI conceptual simplicity of the protocol. naming convention for identifiers. Names that follow this 6. THE ROAD AHEAD convention have the following three components: At the beginning of 2001 the Open Archives Initiative began the a) oai A fixed string indicating that the name is in the next phase of its work, public deployment and experimentation OAI namespace. with the technical framework. To initiate this process two public b) - An identifier for the repository that is meetings were scheduled. A US meeting was be heldin unique within the OAI namespace. Washington DC on January 23. Registration for this meeting was closed when the maximum of 140 participants was reached. The c) - An identifier unique within the respective participantsrepresented a wide variety of communities. A repository. European meeting was scheduled for February 26 in Berlin. An example of a name that uses this naming scheme is: Participants at these meetings heard a complete overview of the oai:arXiv:hep-th01 goals of the OAI and the particulars of the technical framework. The advantage for repositories of adopting this naming The meetings also provided an opportunity for the development of convention is that record identifiers will be resolvable via a communities within the Open Archives framework. Communities central OAI resolution service, that will be made available at may take the form of groups of data providers that exploit the the OAI web site. The intention is to make this resolver extensibility of the Open Archives Harvesting Protocol to expose OpenURL-awarem, as a means to support open linking [20- purpose-specific metadata or the development of targeted services. 22] based on OAI identifiers. The attractiveness of such an These meetings were meant to be a "kick off" for an extended approach has been demonstrated in an experiment conducted period of experimentation (at least one year) with the harvesting intheDOI namespacen. A processforfast-track protocol. The OAI intends during this period to keep the protocol standardization of OpenURL has recently started with NISO. as stable as possible. This experimentation phase is motivated by the belief that the community needs to fully understand the 5. TESTING AND REFINEMENT functionality and limits of the interoperability framework before considering major changes or expansion of functionality. Participants of the September 2000 technical meeting at Cornell developed a rough outline of the technical framework. However, During this experimentation period, three large research and the task of normalizing and putting the framework into the form of implementation projects, in the U.S. and in Europe, plan to a specification was undertaken at Cornell University, where Open experiment with the functionality of the OAI technical framework: Archives Initiative activities are coordinated. Continuous feedback 1. National Science Digital Libraryi2 NSDL is a multi- from the alpha-test group that implemented consecutive versions participant project in the US funded by the National Science of the protocol played an important roleinthisactivity. Foundation with the goal of creating an online network of Participants in the alpha-test group were solicited from both the learning environments and resources forscience, original Santa Fe Convention E-Print community and from the mathematics, engineering, and technology education. Our group at Cornell is funded under the core infrastructure

10 http://www.sfxit.com/OpenURL It http://sfxserv.rug.ac.be:8888/public/xref/ 12 http://www.ehr.nsf.gov/ehrldue/programs/nsd1/.

60 portion of the NSDL. In the context of the OAI alpha testing foundation for more informed statements about the issues critical we experimented with a harvesting service that will later form to the success of our field. the basis for other services in our NSDL infrastructure. Future plans include working with our partners at the San 7. Appendix A OAI STEERING Diego Super Computer Center to harvest metadata via OAI, COMMITTEE post-process and normalize itfor storage in a metadata Names are followed by affiliations: repository, and then make that metadata searchable using the SDLIP [23] protocol. Caroline Anns (Library of Congress) Lorcan Dempsey (Joint Information Systems Committee, 2. Cyclades" - This is a project funded by the European Commission with partners in Italy, Germany, and the U.K. UK) The main objective of Cyclades is to develop advanced Dale Flecker (Harvard University) Internet accessible mediator services to support scholars both Ed Fox (Virginia Tech) individuallyandasmembersof communitieswhen Paul Ginsparg (Los Alamos National Laboratory) interacting with large interdisciplinary electronic archives. Daniel Greenstein (DLF) Cyclades plans to investigate the construction of these Carl Lagoze (Cornell University) services on the Open Archive foundation. Clifford Lynch (CNI) 3. Digital Library Federation TestbedAs a follow-up to the John Ober (California Digital Library) Cambridge meetings (described earlier inthis paper) a Diann Rusch-Feja (Max PlanckInstitutefor Human meeting14 of interested project participants was convened in Development) October 2000 by The Andrew W. Mellon Foundation to explore technical, organizational, and resource issues for Herbert Van de Sompel (Cornell University) broad-based metadata harvesting and to identify possible next Don Waters (The Andrew W. Mellon Foundation) steps. The result of this meeting was the commitment by a number of institutions (research libraries, other information- 8. Appendix B ALPHA TEST SITES based organizations) to expose metadata from a number of The following institutions participated in the alpha testing of the collections through the Open Archives technical technical specifications: infrastructure, and experiment with services that use this metadata. CIMI Museum Consortium These projects promise to expose what is possible using the OAI Cornell University framework and how it might be changed or expanded. Ex Libris Los Alamos National Laboratory We are intrigued by the period of experimentation that lies ahead and encouraged by the widespread interest in the Open Archives NASA Initiative. In the context of this enthusiasm, we are aware of the OCLC need to be circumspect in our approach towards the OAI and its Old Dominion University goals. As Don Waters, a member of the OAI Steering Committee, UKOLN Resource Discovery Network pointed out, the technical proposals of the OAI include a number University of Illinois Urbana Champaign of assumptions about issues not yet fully understood. University of North Carolina What is the value of a common metadata set? University of Pennsylvania What are the interactions of native metadata set with the University of Southampton minimal, conventional set? University of Tennessee What are the incentives and rewards for institutions and Virginia Tech organizations for participating in such a framework? What are the intellectual property issues vis-à-vis harvestable metadata? 9. ACKNOWLEDGMENTS Our thanks to the many members of the Open Archives Initiative Will this technical framework encourage new models of Community steering committee, technical committee, alpha scholarly communication? testers, and workshop attendees who have participated in the These, and many other questions, are all in need of thorough development of the OAI over the last year. Thanks to Naomi examination. Too often members of the digital library community Dushay for her careful reading and editing. Support for work on have made casual statements that "interoperabilityis good", the Open Archives Initiative comes from the Digital Library "metadataisimportant", andthat"scholarly publishingis Federation, the Coalition for Networked Information, the National changing". At the minimum, we hope that the OAI will create a Science Foundation (through grant no. IIS-9817416) and DARPA framework for serious investigation of these issues and lay the (through contract no. N66001-98-1-8908).

13 http://cinzica.iei.pi.cnr.it/cyclades/, 14 http://www.clir.org/diglib/architectures/testbed.htm.

61 99 10. REFERENCES [1] Van de Sompel, H. and C. Lagoze,The Santa Fe [13] Lagoze, C.Accommodating Simplicity and Complexity in Convention of the Open Archives Initiative.D-Lib Metadata: Lessons from the Dublin Core Experience. in Magazine, 2000. Seminar on Metadata.2000. Archiefschool, Netherlands 6(2).http://www.dlib.org/dlib/february00/vandesompel- Institute for Archival Education and Research, The Hague. oai/vandesompel-oai.html. [14]Weibel, S.,The Dublin Core: A simple content description [2] Ginsparg, P.,Winners and Losers in the Global Research format for electronic resources.NFAIS Newsletter, 1998. Village.The Serials Librarian, 1997. 30(3/4): p. 83-95. 40(7): p. 117-119. [3] Hamad, S.,Free at Last: The Future of Peer-Reviewed [15] Chang, C.-C.K. and H. Garcia-Molina.Mind your Journals.D-Lib Magazine, 1999. vocabulary: query mapping across heterogeneous 5(12).http://www.dlib.org/dlib/december99/12hamad.html. information sources.InInternational Conference on [4] Paepcke, A.,et al., Interoperability for Digital Libraries Management of Data and Symposium on Principles of Worldwide, Communications of the ACM1998, 41(4): 33- Database Systems.1999. Philadelphia: ACM: 335-346. 42. [16]Lagoze, C. and T. Baker,Keeping Dublin Core Simple.D- Lib Magazine, 2001. 7(1). [5] Halpern, J.Y. and C. Lagoze.The Computing Research Repository: Promoting the Rapid Dissemination and [17]Fallside (ed.), D.C.,XML Schema Part 0: Primer.2000, Archiving of Computer Science Research.InDigital World Wide Web Consortium. Libraries '99, The Fourth ACM Conference on Digital http://www.w3.org/TR/xmlschema-0/. Libraries.1999. Berkeley, CA. [18] Thompson, H.S.,et al., XML Schema Part I : Structures. [6] Van de Sompel, H., T. Krichel, and M.L. Nelson,The UPS 2000, World Wide Web Consortium. Prototype: an experimental end-user service across e-print http://www.w3.org/TR/xmlschema-1/. archives, in D-Lib Magazine.2000. [19]Biron, P.V. and A. Malhotra,XML Schema Part 2: [7] Lagoze, C. and J.R. Davis,Dienst An Architecture for Datatypes.2000, World Wide Consortium. Distributed Document Libraries.Communications of the http://www.w3.org/TR/xmlschema-2/. ACM, 1995. 38(4): 47. [20] Van de Sompel, H. and P. Hochstenbach,Reference Linking [8] A New Approach to Finding Research Materials on the in a Hybrid Library Environment , Part 3: Generalizing the Web, .2000, Digital Library Federation. SFX solution in the "SFX@Ghent & SFX@LANL" http://www.clir.org/diglib/architectures/vision.htm. experiment.D-Lib Magazine, 1999. 5(10).http://www.dlib.org/dlib/october99/van_de_sompe1/1 [9] Anderson, KM.,et al., ACM 2000 digital libraries : proceedings of the fifih ACM Conference on Digital Ovan_de_sompel.html. Libraries, June 2-7, 2000, San Antonio, Texas.2000, New [21] Van de Sompel, H. and P. Hochstenbach,Reference Linking York: Association for Computing Machinery. xiii, 293. in a Hybrid Library Envronment:, Part 1: Frameworks for Linking. [10] Borbinha, J. and T. Baker,Research and advanced D-Lib Magazine, 1999. technology for digital libraries: 4th European Conference, 5(4).http://www.dlib.org/dlib/apri199/van_de_sompe1/04van ECDL 2000, Lisbon, Portugal, September 18-20, 2000 : _de_sompel-ptl.html. proceedings.Lecture notes in computer science 1923. 2000, [22] Van de Sompel, H. and P. Hochstenbach,Reference Linking Berlin ; New York: Springer. xvii, 513. in a Hybrid Library Environment, Part 2: SFX a Generic [11] Arms, W.Y.,Digital libraries.Digital libraries and Linking Solution, in D-Lib Magazine.1999. electronic publishing. 2000, Cambridge, Ma.: MIT Press. [23] Paepcke, A.,et al., Search Middleware and the Simple [12]Lagoze, C., C.A. Lynch, and R.D. Jr.,The Warwick Digital Library Interoperability Protocol.D-Lib Magazine, Framework: A Container Architecture for Aggregating Sets 2000. of Metadata.1996, Cornell University Computer Science. 5(3).http://www.dlib.org/dlib/march00/paepcke/03paepcke. http://cs- html. tr.cs.comell.edu:80/Dienst/U1/2.0/Describe/ncstrl.comell/T R96-1593.

62 100 Enforcing Interoperability with the Open Archives Initiative Repository Explorer Hussein Suleman Department of Computer Science Virginia Tech Blacksburg, VA, USA +1 540 231-3615 [email protected]

ABSTRACT A technical working group provided input into the process of The Open Archives Initiative (0Al) is an organization dedicated creating and testing those standards under widely differing to solving problems of digital library interoperability by defining conditions.This process culminated in the announcement of a simple protocols, most recently for the exchange of metadata. The new protocol for interoperability, the Open Archives Initiative success of such an activity requires vigilance in specification of Protocol for Metadata Harvesting [3], in January 2001.These the protocol as well as standardization of implementation. The standards are now being disseminated to all interested parties who lack of standardized implementation is a substantial barrier to wish to adopt a low-cost approach to interoperability, with interoperability in many existing client/server protocols. To avoid support from a growing number of members of the Open Archives this pitfall we developed the Repository Explorer, a tool that community. supports manual and automated protocol testing. This tool has a significant impact on simplifying development of interoperability 2. MOTIVATION interfaces and increasing the level of confidence of early adopters After the inaugural meeting of the OAI in 1999, a handful of of the technology, thus exemplifying the positive impact of archivists began to implement the agreed-upon interoperability exhaustive testing and quality assurance on interoperability protocol at their distributed sites.This effort was immediately ventures. hampered by a varying interpretation of the protocol specification. This was largely due to the difficulty of precisely specifying a Categories and Subject Descriptors protocol that would both be general and applicable to multiple D.2.12 [Software Engineering]:Interoperability Interface domains. The client/server architecture chosen by the OM led to definition languages. a classic "chicken-and-egg" problem since client implementations would need to interface correctly with server implementations C.2.2 [Computer-Communication Networks]: Network there was subsequently a low degree of confidence in the ProtocolsProtocol verification. correctness of early implementations in each category. Coupled with this, even when clients and servers subscribed to the same General Terms interpretation,therewasnothighconfidencethatother Reliability, Experimentation, Standardization, Verification. client/server pairs would interoperate successfully. As one approach to address these concerns, we developed a Keywords protocol tester that would allow a user to perform low-level Interoperability, Protocol, Testing, Validation. protocol tests on a server implementation without the need for a corresponding client implementation.This now-widely used 1. CONTEXT software, the Repository Explorer, aidsin standardizing the The work of the Open Archives Initiative (0AI) was initiated in protocol understood by various different archives subscribing to connection with a meeting of representatives of various electronic the OAI model of interoperability. pre-print and related archives (e.g. NDLTD, arXiv, NCSTRL) in Santa Fe, USA in October 1999 [2]. From this meeting emanated 3. DESIGN OF REPOSITORY EXPLORER an agreement among the archivists to support a common set of The Repository Explorerisimplementedasaweb-based principles and a technical framework to achieve interoperability. application (see Figure 1) to take advantage of the ubiquitous This started the process of defining standards, broadened to nature of WWW clients, and to alleviate the need to install include digital libraries other than pre-print archives. multiple components on client machines to support allthe software components used during testing. Permission to make digital or hard copies of all or part of this work for The software supports both manual and automatic testing, but personal or classroom use is granted without fee provided that copies with an emphasis on the former.In automatic testing mode, a are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To series of protocol requests, with legal and illegal combinations of copy otherwise, or republish, to post on servers or to redistribute to parameters, are issued to the archive being tested, and the lists, requires prior specific permission and/or a fee. responses are checked for compliance with the expected range of JCDL'Ol, June 24-28, 2001, Roanoke, Virginia, USA. responses. In manual mode, the software allows a user to browse Copyright 2001 ACM 1-58113-345-6/01/0006...$5.00. through the contents of the archive using only the well-defined

63nfl interface provided by the protocolin this instance the user has 4. Lastly, the tree representation is checked for semantic full control over all parameters and can test individual features of correctness. For example, where controlled the protocol. vocabularies are used but not encoded into the XML Schema, these are checked atthis stage(e.g.,the

, ElsEst Sew S. ficarsonateeSets standard list of metadata formats used by the 0AI) Open Archives Initietive Repository Explorer Once checks are performed, if the software is being used in vI.0 manual mode, a new interface is generated to display the data received from the server and present the user with additional http:Ilsny2,myt.dulegi-b1lrmainlytetd10A11.0 options to perform further operations. An lbe iletdis : itli://ed.dl 5. CONCLUSIONS Over the course of implementing the original and revised 'Veen Ord. ta. protocol, most if not all implementers used the Repository Explorer to test their implementations.Their positive feedback ltdet until (YYYY-1414-DD). ,,,...7.77:_. list Metadata Femurs supported the original motivation that a compliance test would lin Sets a" Parsed lin Identifiers greatly ease the process of implementation. r R. }111 ia.a.-. I ..._. list Records 7 Get Record Many lessons were learnt during the process of designing the test resumptionToken software, the most significant being that testing is a decidedly non-trivialproblemifthetarget(namely,theprotocol Figure 1. Repository Explorer basic interface specification)is not stationary.However, designing a new version of the protocol in tandem with updating the test software 4. VALIDATION PROCEDURE ensured that the new specifications contain mechanisms (like self- The OAI protocol is a request/response protocol that works as a identification) that can be exploited to perform more exhaustive layer over HTTP, with responses in XML.The Repository automatic tests. Explorer performs validation at multiple levels in an attempt to detect the widest range of possible errors.Figure 2 depicts the 6. FUTURE WORK flow of response data received from a server during the validation Furtherstandardizationof thesoftwarelibrariesusedin process. development can lead to a tool suite that will not only be adaptable to future versions of the OAI protocol, but also possibly While XML tools are still not XML to other client/server protocols. XML Protocol widely deployed and are notorious for high degrees of complexity, HTTP Schema 00 Parser Checker future combinations of XSD (Schema), XSLT (Transformation) Processor and XPath (Path Specifications) could make protocol testing a Figure 2. Outline of validation/testing process fully automated process for emerging client/server protocols using XML as underlying technology. The test suite is under constant development as best practices Each step of this validation procedure performs incremental emerge among users of the community, thus ensuring that the checking as described below: Repository Explorer maintains its status as a compliance test, for which it was intended. 1. When submitting an HTTP request, HTTP errors need to be detected and handled. The OAI protocol requires that explicitly illegal requests generate errors as HTTP 7. ACKNOWLEDGMENTS status codes, and these are checked for. Many thanks go to the group of OAI protocol authors and testers who used the Repository Explorer and provided useful feedback. 2. Once therequestisissued andtheresponseis successful, the returned XML needs to be checked for validity.Since all XML responses are specified in the 8. REFERENCES protocol specification using the XML Schema Language [1]Fallside, David C. (editor). XML Schema. W3C, October [1], this part of the validation is accomplished using an 2000. http://www.w3.org/XMUSchema XML Schema processor, which attempts to match the [2] Van de Sompel, Herbert and Carl Lagoze. The Santa Fe schema with the generated XML data stream.Many Convention of the Open Archives Initiative. D-Lib structural and encoding errors in the XML are detected Magazine, Volume 6, Number 2, February 2000. in this phase of testing. http://www.dlib.org/dlib/february00/vandesompel- 3. Unfortunately, schema processing does not always work oai/O2vandesompel-oai.html flawlessly because of its many external dependencies, [3] Van de Sompel, Herbert and Carl Lagoze. The Open e.g., schemata are downloaded from the WWW as Archives Initiative Protocol for Metadata Harvesting. Open required. As a redundant mechanism, XML errors are Archives Initiative, January 2001. also detected during the parsing and tree generation hop://www.openarchives.org/OAI/openarchivesprotocol.htm phase.

64 102 Arc An OAI Service Provider for Cross-Archive Searching Xiaoming Liu, Kurt Maly, Mohammad Zubair, and Michael L. Nelson Old Dominion University Department of Computer Science Norfolk, VA 23592 USA +1 757 683 4017 {liu_x, maly, zubair, nelso_m}@cs.odu.edu

ABSTRACT soon-to-be OAI compliant.Arc(http://arc.cs.odu.edu) is the first The usefulness of the many on-line journals and scientific digital federated search service based on the OA1 protocol, and its concept libraries that exist today is limited by the lack of a service that can originates from the Universal Preprint Service (UPS) prototype [2]. federate them through a unified interface. The Open Archive We encountered a number of problems in developing Arc. Different Initiative(0AI)isonemajorefforttoaddresstechnical archives have different format/naming conventions for specific interoperability among distributed archives. The objective of 0AI is metadatafieldsthatnecessitatedatanormalization.Arbitrary to develop a framework to facilitate the discovery of content in harvesting can over-load the data provider making it unusable for distributed archives.In this paper, we describe our experience and normal purposes. Initial harvesting when a data provider joins a lessons learned in buildingArc,the first federated searching service service provider requiresa different technical approach than based on the OAI protocol. Arc harvests metadata from several OAI periodical harvesting that keeps the data current. compliant archives, normalizes them, and stores them in a search service based on a relational database (MySQL or Oracle). At 2. Architecture present we have over 165K metadata records from 16 data providers The Arc architecture is based on the Java servlets-based search from various domains. service that was developed for the Joint Training, Analysis and Simulation Center (JTASC) [3].This architectureis platform Categories and Subject Descriptors independent and it can work with any web server.Moreover, the H.3.7 [Digital Libraries]:collection, dissemination, standards. changes required to work with different databases are minimal. Our current implementation supports two relational databases, one in the General Terms commercial domain (Oracle), and the other in public domain (MySQL). The architecture improves performance over the original Design, Experimentation, Standardization, Languages UPS architecture by employing a three-level caching scheme [3]. Figure 1 outlines the major components; however, for brevity we Keywords discuss only the components relevant to this paper. Digital Library, Open Archive Initiative

1. INTRODUCTION User Interface A number of free on-line journals and scientific digital libraries exist today, however, there is a lack of a federated service that provides a Query Results Harvester unified interface to all these libraries. The Open Archive Initiative 1 (OA1) [1] is one major effort to address technical interoperability Search Engine Data Normalization JDBC among distributed archives. The objective of OAI is to develop a (Servlet) History Daily framework to facilitate the discovery of content in distributed Harvest Harvest archives.The OAI framework supports data providers (archives) 1. and service providers. The service provider develops value-added Data Data Cache Oracle MySQL services that are based on the information collected from data Provider Provider providers. Figure 1. Arc Architecture These value-added services could take the form of cross-archive search engines, linking systems, and peer-review systems. 0AI is 2.1 Harvester becoming widely,accepted and there are many archives currently or Data providers are different in data volume, partition definition, service implementation quality and network connection quality. All Permission to make digital or hard copies of all or part of this work for these factors influence the harvesting procedure. Historical and personal or classroom use is granted without fee provided that copies are newly-published data harvesting have different requirements. When not made or distributed for profit or commercial advantage and that copies a data provider joins a service provider for the first time, all past bear this notice and the full citation on the first page. To copy otherwise, or data (historical data) needs to be harvested, followed by periodic republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. harvesting to keep the data current. Historical data are high-volume JCDL 01, June 24-28, 2001, Roanoke, Virginia, USA. and more stable, the harvesting process generally runs once, and a Copyright 2001 ACM 1-58113-345-6/01/0006...$5.00. chunk-based harvest is preferred to reduce accessing time and data

1133 provider overhead. To harvest newly published data, data size is not 3. Results the major problem but the scheduler must be able to harvest new We maintain two test sites, the public Arc cross archive service and data as soon as possible and guarantee completenesseven if data another site for the OAI alpha test only. So far, in the public service, providers provide incomplete data for the current date. The OAI we have seven archives available for search (Table 1). In the OAI protocol provides flexibility in choosing the harvesting strategy; .alpha test site, we have nine archives across different domains, theoretically, one data provider can be harvested in one simple including documents from Heinonline.org, Open Video Project, transaction, or one is harvested as many times as the number of Language Resource Association, Library of Congress and other records in its collection. But in reality only a subset of this range is organizations. The alpha test data is not open to the public. possible; choosing an appropriate harvesting method has not yet been made into a formal process. We defined four harvesting types in Arc: (a) bulk-harvest of historical data (b) bulk-harvest of new 4. Lessons learned data (c) one-by-one-harvest of historical data (d) one-by-one-harvest Little is known about the long-term implications of a harvest-based of new data. From our tests, these four strategies in combination can digital library, but we have had the following initial experiences. fulfill various requirements of a particular collection. The effort of maintaining a quality federation service is highly dependant on the quality of the data providers. Some are meticulous in maintaining exacting metadata records that need no corrective 2.2 Database Schema actions. Other data providers have problems maintaining even a OAI uses Dublin Core (DC) as the default metadata set. All DC minimum set of metadata and the records harvested are useless. We attributes are saved in the database as separate fields. The archive have not yet fully addressed the issue of metadata normalization. name and partition are also treated as separate fields in the database Some normalization wasnecessarytoachievea minimum for supporting search and browse functionality. In order to improve presentation of query results.However, we did so on an ad hoc system efficiency, most fields are indexed using full-text properties basis with no formal definition of the relationship mappings. A of the database, which makes high performance queries over large controlled vocabulary will be of great help for a cross archive search dataset possible. The search engine communicates with the database servicetodefine such metadatafieldsas'subject'or even using JDBC and Connection Pool. 'organization'. XML syntax errors and character-encoding problems were surprisingly common and could invalidate entire large data 2.3 Search Interface specification sets. We also faced a trade-off in frequency of harvests: too many harvests could over burden both the service and data providers, and The search interface supports both simple and advanced search as too few harvests allow the data in the service provider to potentially well as result sorting by date stamp, relevance ranking and archive. become stale. Simple search allows users to search free text across archives. Advanced search allows user to search in specific metadata fields. Users can also search/browse specific archives and/or archive 5. REFERENCES partitions in case they are familiar with specific data provider. [I] Van de Sompel, H. and Lagoze, C. The Santa Fe Convention Author, title, abstract search are based on user input, the input can of the Open Archives Initiative. D-Lib Magazine, 6(2), use boolean operators (AND, OR, NOT). Archive,set,type, February 2000. http://www.dlib.org/dlib/february00/ languageandsubjectusecontrolledvocabulariesandare accumulated from the participating archives' source data. vandesompel-oai/02vandesompel-oai.html [2] Van de Sompel, H., Krichel, T., Nelson, M. L., Hochstenbach, P., Lyapunov, V. M., Maly, K., Zubair, M., Kholief, M., Liu, Table 1. Collections Harvested by Arc (by Feb 7, 2001) X. and O'Connell, H. The UPS Prototype: An Experimental Archive Name URL Records End-User Service across E-Print Archives, D-Lib Magazine arxiv.org 6(2), February 2000. arXiv.org e-Print Archive 151650 http://www.dlib.org/dlib/february00/vandesompel- Cognitive Science Preprints cogprints.soton.ac.uk 999 ups/02vandesompel-ups.html naca.larc.nasa.gov NACA 6352 [3]Maly, K., Zubair, M., Anan, H., Tan, D. and Zhang, Y. Networked Digital Library www.ndltd.org Scalable Digital Libraries based on NCSTRL/Dienst. In 2401 of Theses and Dissertations Proceedings of the zith European Conference on Digital Web Characterization repository.cs.viedu Libraries ECDL 2000 (Lisbon, Portugal, September 2000), 131 Repository pp. 169-179. w NCSTRL in Cornell wwntrlr.cs.og 2080 NASA Langley Technical techreportslarc.nasa.gov/hr 2323 Report Server s Nine Other Collections For arc.cs.odu.edu/help/archives 9467 the OAI alpha test .htrn

66 104 Managing Change on the Web

Luis Francisco-Revilla, Frank Shipman, Richard Furuta, Unmil Karadkar, and Avital Arora

Center for the Study of Digital Libraries and Department of Computer Science Texas A&M University College Station, TX 77843-3112, USA

{10f0954, shipman, furuta, unmil, avital}@csdl.tamu.edu

ABSTRACT Digital libraries present an electronic counterpart of traditional Increasingly,digitallibrariesare being defined thatcollect libraries.Digitallibrarians are necessary to perform similar pointers to World-Wide Web based resources rather than hold the functions in order to maintain collections of electronic documents resources themselves. Maintaining these collections is challenging or electronic pointers to documents. duetodistributed document ownership and highfluidity. It is not rare to hear people refer to the Web as a new kind of Typically a collection's maintainer has to assess the relevance of libraryor at least that it should be more like a library. However changes with little system aid. In this paper, we describe the on the Web, not all electronic collections are well maintained Walden's Paths Path Manager, which assists a maintainer in the degree of management of electroniccollectionsvaries discovering when relevant changes occur to linked resources. The considerably. Moreover, the characteristics of the Web raise many approach and system design was informed by a study of how challenges for the maintainer. Collections on the Web can be humans perceive changes of Web pages. The study indicated that extremely distributednot only the location of the documents is structural changes are key in determining the overall change and distributed, but ownership is also distributed. that presentation changes are considered irrelevant. At the moment, the Web is less like a library and more like a huge bookshelf,containingmillionsof documents andlinksto Categories and Subject Descriptors documents. This mega shelf is subject to the activities of an army 1.3.7 IDigital Librariesi: User issues; of people (and computer-based agents) that keep acting upon it in H.5.4 [Hypertext/Hypermedia]: Other (maintenance) wayssuchasadding,removing,indexing,cataloguing, rearranging, modifying, and copying documents. Additionally, all General Terms these activities tend to occur with little or no coordination, Algorithms, Management, Design, Reliability, Experimentation, worsening the situation. Human Factors, Verification. Bettercoordination,socialprotocols,andinstitutionalized management of collection can alleviate the situation in some Keywords cases, particularly when the owner of the collection and the Walden's Paths, Path Maintenance. owner(s) of the documents agree to coordinate their activities. The emerging digital topical libraries, while homed on the Web, are better maintained than the Web as a whole. They present a more 1. INTRODUCTION cohesive and organized structure in order to support their patrons. The work in building a library is not only in the collection of Different automatic mechanisms have been devised in order to materials, it is also in their organization and maintenance. help manage the documents. Indexing and cataloguing are Books ina traditionallibrary areactively managedwhen considerably easier to perform than in the pre-electronic days. acquired they must be organized, indexed and catalogued. When Nevertheless, considerable human effortisstillrequired to considered obsolete, they must be removed. Librarians go through monitor the changes in documents and collections. great length in order to keep the collection up to date. New In addition to the digital library that is provided by the "owner" of editions replace old ones, while old versions may be moved to the documents (e.g., the ACM Digital Library) there are a growing archival sections or be discarded. number of specialized libraries with distributed ownership.' In these cases librarians often are limited to working with the pointers or links to the documents. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are Digital documents are typically very fluid, i.e., changes are a not made or distributed for profit or commercial advantage and that common occurrence.Althoughdetectingchangesiseasily copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. One example is the National Science Foundation's NSDL effort, JCDL '01 , June 24-28, 2001, Roanoke, Virginia, USA. described at http://www.ehr.nsfigov/due/programs/nsdl! Copyright 2001 ACM 1-58113-345-6/01/0006...$5.00.

105 automated, assessing the relevance of changes is not. Typically versioning and intellectual property rights remain, especially humans are needed when facing the task of relevance assessment. when caching Web pages created by third parties. In fact, "relevance" is a highly abstract, contextual, and subjective A second approach might be to increase the fluidity of the paths concept. Different individuals can disagree in their relevance themselvesperhaps showing only pages that have not moved or judgment about any given change. Nevertheless, as difficult as it changed. Walden's Paths provides an implementation mechanism may be, assessing the relevance of a change is still necessary. for this through ephemeral paths [6]. Ephemeral paths are paths This paper presents an approach to aid humans in managing fluid that exist for only a short period of time as the result of some collections of documents, in particular Web pages that have been computation. Unfortunately the effectiveness of the technique is authored and are owned by third parties. The approach provides a limited here since many path authors design their paths to have a way of automatically assessing the relevance of changes in the rhetorical coherence based on the linearity imposed by the path Web pages. This method is explored in the Path Manager, a mechanism. Hence, when some pages of the path are removed the system designed to infer relevance of changes to Web pages and rhetorical structure can be broken, rendering the path ineffective. to communicate this information to the collection's maintainer. More importantly, caching and ephemeral paths only provide The paper is divided into seven sections. Section two presents mechanismsfor implementing ways of reacting tochange. background about this work and its motivation. Sections three and Humans must determine when significant change has occurred. four describe the approach used and its implementation as a Making this determination requires that the nature of the change prototype system. Section five gives an overview of a study to be considered within the context of the maintainer's goals (i.e., better understand how humans perceive changes, which we that itsrelevancebe considered). For instance, even though a page conducted to better understand how to refine the Path Manager. may changeinappearanceorwording,itmight remain Sections six and seven conclude the paper. conceptually the same. For many applications, this would be an insignificant change. 2. MOTIVATION Before proceeding, and in order to avoid possible confusion, it is Web readers access a rich and vast collection of information important to define the different people and components involved resources authored and published by a multitude of sources. In in the problem. While the issues addressed in the present work this collection it is possible to find information on virtually every relate to any meta-document application, our interest originated topic,varyinginaspects suchaspoint of view,validity, from research conducted in the Walden's Paths project. Therefore semantics, rhetoric, and contextual situation. In order to better meta-documents are defined as "paths" even though this is not the exploit this collection, different systems have been developed only possible meta-document. Similarly, construction blocks are with the goal of aiding readers by providing an interpretation or referred to as "Web pages" or "pages". The first group involved is contextualization of Web pages. the people that originally create and publish the Web pages. They Walden's Paths is an application that allows teachers to construct are identified to as "page creators". Another group is the people trails [2] or paths using Web pages authored by others [7], [19]. that create the meta-documents or paths. This group is denoted in The paths represent a higher order construct, or meta-document, this document as "path authors" or just "authors". Additionally that organizes and adds contextual information to pages authored there is the group of people accessing or using the paths, which is by others. However, meta-documents that are to have a lasting designated as "readers". Finally there is the group of people who value need to adapt to changes in their components. Thus, paths has to manage the collection of paths and are referred as must address the issue of the unpredictable changing of their "administrators". It is important to consider that any given person fundamental building blocks (Web pages). This is particularly might act in any or multiple of these roles at any given time. important given the high fluidity of Web pages [3], where authors are prone to change their pages just to maintain a "fresh" look and 2.1 Assessing Change Relevance feel. We arepursuingapproachestoassistpathauthorsand administrators in managing continuously occurring changes in Early in the Walden's Paths project issues related to the fluidity of their selected collection of Web pages. While our emphasis is on Web pages rapidly arose. We observed thatitis a common paths, our approach will be equally applicable to other Web-based occurrence for paths to include Web pages that have moved, domains that need to assess the relevance of changes to Web changed or are no longer available [18]. As a result the collection pages authored by third parties. We would expect, for example, of paths needed to be constantly revised and updated. that simple modifications to the Path Manager would enable the To date inthe project, we have implemented a number of maintenance of personal bookmark lists or other collections of approaches that can help to reduce the effects of fluidity. A simple resource links. approach is to cache the pages. This approach is one way to Simply detecting changes ina pageiseasy. The simplest address the fluidity issue and also can expedite delivery of the implementation would track the "last modified" date returned by information, particularlyinschools that have slow Internet the HTTP server. A more accurate mechanism would retain connections. However,it became evident to us that a single cached copies of Web pages. It is easy to compare a document caching approach could not solve the problem, since not all Web with a previously cached version and determine if there has been page changes are undesirable (some paths include pages that by any change. In a similar way, it would be possible to check for nature change, such as the current weather or a newspaper Web changes within specific sections of the document.These cases page). Our approach, implemented through the Walden's Paths would return a Boolean "yes/no" indication of change. Authoring Tool, allows authors to specify the caching strategy for individual pages [12]. We note, however, that issues about

68 106 As mentioned before, a difficulty arises when attempting to there are a couple of issues that hinder exporting the approach to determine the relevance of changes. The obvious approach would the World Wide Web. The first issue is the dependence of the be to assess the magnitude or amount of change. Unfortunately, distance measure on the arbitrary determination of the weights there is not an easily computed metric that provides a direct assigned to the addition and deletions. The use of these weights correlation between syntactic and semantic changes in a Web page might not work for a different collection of Web pages. The secondissueistheasymmetricnatureof thedistance For instance, there is no clear relationship between the number of measurement and its lack of normalization. That is, the distance bytes changed and the relevance of the change to the reader. A from document A to document B might be different from the large number of bytes changed might result from a page creator distance from document B to document A. In addition there are no who restructures the spacing of a page's source encoding while normalized valuesforthedistance.Therefore thedistance maintaining the same content from a semantic and rhetorical point between two documents could be a number like 0.5 or 20. This of view. Similarly a small number of bytes changed might result makes setting trigger levels an arbitrary matter. from the insertion of a few negations in the text that causes a complete reversal in meaning. On the other hand, we can think of There are other approaches that monitor features of Web pages at other situations where a large number of byte changes correspond a fine-grained level. Currently it is possible to find Web-based to the creation of a completely different page, and situations systems that provide fine-grained monitoring of changes such as where a few bytes are changed when the page creator simply AIDE[4], URL-Minder [22] and WatzNew [23]. Systems like corrects minor spelling and grammatical errors. these allow monitoring text, links, images and keywords of a given Web page. However a typical critique is that cosmetic The goal would be to efficiently obtain a measure of the semantic changes are reported to be as relevant as substantive content distance between two versions of a document. However, at the changes. In contrast, the Path Manager evaluates the relevance of moment the best we could attempt is to infer the semantic distance the change with regard to the whole page. based on the syntactic characteristics of the document. In order to accomplishthis, some heuristic actions are reasonable. For The goal of the Path Manager refers to identifying "interesting" or example the document can be partitioned based on heuristics such relevant changes to Web pages. As such, its design has been as: informed by other research work dealing withidentifying "interesting" Web pages. In particular, research in page relevance Paragraphs tend to encapsulate concepts has helped point out possible features of a page that should be Different paragraphs tend to encapsulate different concepts monitored more attentively.

Another way to partition the document is to analyze the document Researchers atthe University of CaliforniaatIrvine have encoding. Markup languages such as SGML and XML already investigated the issues of identifying readers' interests and dealing specify structure rather than presentation. with changing Web pages. They have developed systems such as Even though Web pages are encoded in HTML, analyzing the Syskill and Webert [17] and the Do-I-Care-Agent (DICA) [20], changes is not an easily automated task. While HTML provides [21].Syskill and Webertisan agent designed to discover some information about the structure of the document, itis interesting pages on the Web. The approach in this system is to commonly used for specifying presentation rather than structure. aid readers with long-term information seeking goals. DICA is an Adding to that, page creators use HTML in extremely different agent designed to monitor reader-specified Web pages and notify ways.Paragraphs, headings and other tagsare used quite the reader of important changes to these pages. Both systems rely diversely. What conceptually constitutes a paragraph for one page on user profiles intended to model their user's interests. By creator might constitute several pages for another. This does not interacting with the readers, the agents learn more about the mean that HTML tags are useless for analyzing change. On the reader's interests. However Walden's Paths is designed for a contrary, HTML tags and other features such as keywords can be differentenvironment,morespecificallyaneducational used in order to infer the relevance of changes. environment. In this case the paths are used as artifacts that provide guidance and direction to the readers. Thus, Web pages 2.2 Related Work must maintain consistency with regard to the semantic and Measuring the magnitude or relevance of changes in an automated rhetoric composition of the path. fashion is not an easy task. Researchers have encountered this There are two other systems that, although slightly tangential to problem in a variety of contexts and their approaches have the present work, have influenced the design of the Path Manager. informed our work. These systems are WebWatcher [1], [9] and Letizia [14]. Like In his doctoral dissertation David Johnson created a system for URL-Minder and WatzNew, WebWatcher notifiesthe user authoring and delivering tutorials over the Web [10],[11]. whenever specified pages change.In addition, WebWatcher Johnson designed a signature-based approach when he also faced attempts to evaluate "how interesting" a given Web page would theissue of ever-changing Web pages. Johnson's approach be for a given reader and provides navigation suggestions in real computed a "distance" between two versions of a document by time, by annotating or adding links to the page. This approach employing weighted valuesforadditions and deletionsof explores the use of the knowledge embedded in the links and the paragraphs, headings and keywords. Based on this measurement, text around the links in order to infer relevance. While Johnson Johnson created a mechanism that notifies readers and even rejected links for page distance analysis, the arguments forwarded blocks Web pages from showing when the distance between the by WebWatcher prompted the consideration of them as a metric in two versions reaches predetermined trigger levels. While his the Path Manager. testing was satisfactory with a particular collection of Web pages

69 107 Letizia is an autonomous interface agent [15] that aids a reader in page, but also allow quantifying the change magnitude based on a browsing the Web. Letizia uses a knowledge-based approach to comparison with a previously computed signature. As of now, the respond to personal interests. Like Web Watcher, it also attempts approach considers four Web page features: to evaluate how interesting a given Web page would be for a Paragraph Checksums. This metric is used to determine content given reader. While the reader is reading a Web page in Netscape, Letizia traverses the links in the page, retrieves the pages and changes. By recording a checksum for each paragraph, this approach has a finer granularity than is possible with a page analyzes them. Then it ranks the links based on how interesting the destination Web pages might be for the reader. Letizia's checksum. While they provide an idea about the degree of change to the page content, they also provide an idea of which pieces of inference takes place on the client as opposed to Web Watcher text changed. where the process is conducted on the server. Headings.Thismetricisusedtodeterminecontent and 3. APPROACH presentation changes. Headings typically highlight important text As described, determining the relevance of changes to a document and titles. Changes to headings may indicate changes to the focus can only be done in the context of how that document is to be or perspective of a page. However, they may also reflect on how used. As this knowledge will not be available to the system the documentisdivided and the information grouped and supporting relevance assessment, our goal is to provide a variety presentedtothereader.Thustheyprovidecluesabout of information about changes in a relatively concise interface. To presentation changes. do so, we use a variety of document signatures to recognize Links. This metric is used to determine structural changes of the different types of change. Web page. Since the value of hypertext documents depends not only on the document's contents, but also on the navigation 3.1 Kinds of Change provided by its links, it is important to analyze this connectivity. In order to infer change relevance, it is important to classify the There are two components to links. On one hand there is an nature of the change. There are four categories of change that we invisible component, namely the documents accessible through distinguish between: content or semantic, presentation, structural, the links. On the other hand the visual text or image of the link and behavioral. anchors provides information about the contents of the destination pages. While the page might appear visually the same, the links Content changes refer to modifications of the page contents from the reader's point of view. For example, a page created for a might have changed to point to different places rendering the page inappropriate. soccer tournament might be continuously updatedasthe tournament progresses. After the tournament has ended, the page Keywords. We usereader-provided keywordsinorderto might change to a presentation about the tournament results and determine content changes of the Web page. In the current sports injuries. version, keyword presence is used as the feature to be identified. In future versions the effectiveness of more complex techniques Presentation changes are changes relatedtothe document such as TFIDF (Term Frequency Inverse Document Frequency) representation that do not reflect changes in the topic presented in couldbeexploredtodeterminethedegreeof change. the document. For instance, changes to HTML tags can modify the appearance of a Web page while it otherwise remains the Additionally, we are considering (but have not yet implemented) evaluating different algorithms for automatic keyword generation. same. Keywords provided by users are good at distinguishing relevant Structural changes refer to the underlying connection of the pages from non-relevant pages [16]. The question remains as to document to other documents. As an example, consider changes whether users will provide them. in the link destinations of a "Weekly Hot Links" Web page. While this page might be conceptually the same, the fact that the The approach also records a global checksum for the whole page in order to diminish the possibility of false negatives. There are destination of the links have changed might be relevant, even if some changes to Web pages that do not affect any of the previous the text of the links has not. Structural changes are also important metrics. For example, a change to an image source would not be to detect, as they often might not be visually perceptible. reflected in any of the current metrics as it is not part of the text, Behavioralchangesrefertomodificationstotheactive headings, links or keywords of the document. In this case the components of a document. For Web pages this includes scripts, global checksum provides a last resort to point out changes. plug-ins and applets. The consequences of these changes are Combining the results of the various document signatures into an harder to predict, especially since many pages hide the script code single metric of change is difficult. As previously mentioned, in other files. some Web page changes are easily perceived while others are not. Some changes might alter the visual appearance of a page while 3.2 Document Signatures maintaining the same structure and content. Other changes might To representthedifferentcharacteristicsof a Web-based change the links while maintaining exactly the same appearance. document, we use a set of document signatures to infer and As already acknowledged, the relevance of change is situation compute similarities between two Web pages. In the context of dependent and no single metric will match all situations. We will this paper, the term of "document signatures" is not equivalent to return to the particular algorithms explored to compute this "signature files" (as in Witten, et al. [24]), which generally refers overall notion of change after a description of the system and to strings of bits created by hashing all the terms in the document. interface being developed. In the present work, the signature approach relies on identifying page features and characteristics that not only identify the Web

70 108 4. THE SYSTEM i%A" Path Mniingrir PIO El The Walden's Paths Path Manager is a system implemented in FieSettingsDisplay :i:j.Writirg:r1 L:Tij Algorithm Johnson Compere lo I.:wising signature Java capable of checking a list of Web pages for relevant changes. It takes a path file as input and checks all the pages specified in Path Fie H ll0f0g541Prnanagesluishomepagepalhi

the path. Alternatively, an HTML file can be specified as input, Renutts whether it is located locally (like a bookmark list) or remotely ii (specified by a URL). The Path Manager interprets links in the t4trewww.csdLtainu-odul-10f0954.4ndex.html HTML file as a list of URLs to check. In order to detect and assess the possible changes, the system retrieves all the pages 42- Front Page from the Web, then parses their contents and creates a new Index with Irojers signature for each page. The page signatures obtained are 44.. Standard Well compared with the previously computed signatures. Finally, based on the comparison results, the system presents users with an 45...Mars Medical Assistant assessment of the relevance of the overall change of each Web Crininier Supported Cooperating Work page. The user (the path's administrator) can then review each 117.- Curriculum page individually and if there are no relevant changes, the user can validate the current state of the pages. 0, About me 9, Peru Cuzco In order to compute the magnitude of the change, or the distance ii between two documents, a comparison is made between the I4 10, Pm - Expedition I

signatures of the current version and those previously stored. II11, Friends Currently thePath Manager records three versions of the signaturethe original, the last valid, and the latest time the path StagAS was checked. The original signature is the first signature obtained. It is used to give a sense of the total amount of change since the page was selected. The last valid signature corresponds to the last time that the user reviewed the changes and validated the pages. As time passes the user might update the valid signature as the pages change. This signature represents functional state of the Figure 1 . Initial State of the Interface. page. Finally, the latest signature corresponds to the last time that blue hanging flag is shown next to the page title. The flags work the Path Manager retrieved and analyzed the pages, regardless if as a boolean detection of even the slightest changes. In turn, the the user validated their state or not. Using these three signatures degree of relevance of the change is encoded by the color of the the Path Manager can indicate the amount of recent and long-term page title or URL. In particular, black means that there is no change for a page. relevant change based on the algorithm chosen. Green, yellow, and red text indicate low, medium, and high degrees of changes 4.1 Scenario of Use considered relevant by the algorithm. The coloring scheme is Figure 1 shows the initial state of the interface just after selecting based on user-defined trigger levels, which in turn could be used a path to check. for other purposes such as not showing in the path a page that has The Path Manger extracts the pages from the path file and changed more than the medium level. presents them to the user. Pages are identified by either their title At this point the user might choose to validate the state of the or, when the page title is not available, their URL. Colors are used pages by clicking on the red "Valid" button. Alternatively the user to represent the magnitude of the change. In this case pages are might wonder about what kinds of changes prompt the system to shown in blue, meaning that the current state of the page is assess the overall relevance of the changes. In order to support the unknown and needs to be checked. The flag at the left is used to more inquisitive user, the Path Manager can display the amount of indicate if any change has occurred. Since at this point that is still change to the particular document signatures used in the relevance unknown, the flag is shown as hanging. assessment. Figure 3 shows the view of the different metrics used in the relevance assessment. At this point the user can specify what algorithm and signature to use in order to check the pages in thelist. There are two The specific change metrics are presented to the right of the page algorithms implemented, Johnson's algorithm and the identification. In addition, the user can get more information on proportional algorithm. As for the signatures, the Path Manager each Web page by selecting the page identification. Figure 4 maintains three signatures:theoriginalsignature,the valid shows the detailed view of the change metrics for a particular signature, and the latest signature. page. In this case the bottom page from Figure 3 was chosen (page 11 in the figure). This view provides the assessment of Figure 2 shows the Path Manager relevance assessment of the change to the page at the bottom and information about the use of changes inthe pages using the proportional algorithm and the page above. For the user to be able to assess the relevance of comparing the signature with the last valid signature. the change, they may need to see both the content of the page and For each page a red flag is shown fluttering whenever the global the context of its use. The top of the detailed view for a page checksum has changed. If the global checksum is the same, then a presents properties of the page within the path, such as the cache strategy, annotations and visibility of the page.

71 BEST COPY AVAILABLE 109 Path Manager PH3E1 The system will also display the page in a Web browser at the FiloSettings Osprey user's request. At this point the user, who might not be the same Ilter C_171 AlgortthnPregortiOrreI j Coengrarelo:VaIIdeI0nidure person who authored the path, might judge the page inappropriate

It Path Flio H1.10/09541Pmanagefiluishomepage.palh and choose to hide the page from the viewers. While this action i II t prevents the material from being shown to readers of the path, it Results: Curren( sitp tabu e compared lo Valid vignette e does not delete the page from the path, leaving that decision to the I P.9ue path author. In case there have been connection or retrieval 1 I.- Ittry.fiwww,csill.tanetedul-I0f0054dortex.htrui problems, the manager can prompt the system to check this page HI again. 11 3.- Index with layers ;IxI 1.- Friends

I',1 4.- Stantierd index Page Modifications Report

Pitue: 11 1 S.- Mars Modica Assistant ..-.. . _... Idle: ends I'l 0.- Complier Supported Cooperative %Mak IIRL: . - Cac he type: L 1 17.- Cirdculum I Visibility show 11 6.. Aboul me Annulation

. _. 1 9.- Peru - Cuzco

'1 10.- Peru Expediliunl

1 t 1.- friends

150 Ions Messages ModificationsDetails view page OA.- 1 signatureS IS. I States: irLID: UPDATED 10],- 1 signatures 15.CA I Tidal Orator°LI Check page Computing distances for page (51 S.J FI050 checking page 151 Hide to Reader 2.- Front Page Par.graphs 1 page number121 I loadings I 10% of 0 2.-_Pronl Page 1.1 Links 120% of 1 0 Figure 2. Overall Change Relevance Assessments. Keywords I 10% of 0

close I 1 t;IHuth Munitger MOM Settings04spray Figure 4. Detailed View of Page Metrics. J sigin Muni proportional s' I Compare to:IQ-Viral signature In addition the user asks for metadata about the path as a whole by

Path inu IH110119S4lirman05enluishoreepage.path selecting the information button, located to the right of the path file in the main interface (Figures 1-3). Figure 5 shows the path !Results: Currant stgnature r.omparoti to Ongetaisernature metadata presented to the user. Page lVragrlis HrarOwJs lerks ltr,yssnrds I;

1.- Mg:Are...estate_ 100% of 0 0% of 0 100% of 30 0%01 ;Additional Path information Path:Pony Irdormalion 1 2.. Front Page 25% of 4 0% of 0 100% of 2 0% dO

P:11111011: Luis HOMenege 11 3..11iikix whh layers 1 jon of 4 0% of 0 75% 0144 0% coo Croatints 3/2012000 FidnfAtioni 3/20/2001 we% of 4 0% of 0 ft% 100 I 1 j Abstract. Thin peel gathers all the Peva0in ny peteonal nebelte. It io used tot tenting the Path rionagei. t 100% a I t00% (el 1 This approach can also be seed co cheek changes in a nebeite or eny other 0% of 4 05040 Olhof 0 given set of pages on the !kb. I 11 '''.1 05 010 200s of 5

I Il ttilttre, ut me 255 of 4 05010 0% of 0 05 MO Author: Lularancleco-Reville ' 11 g.- Peru- Cuzco 135 at fl 0%of 0 0% of 1 05 MO Phone: 1-405-8454924 [email protected] " 11 1f1- Peru Eitinaleinn I Inn% 014 0% of 0 109% of 14 0540 ______- ______. Addles. HP. Beight Bluidling 0408, College Station, Texas, 77843-3112, USA 1 62% of A 05010 205 or io 0% MO

Messages 1031.- 1 signatures il-CAI 104L- 1 signatures tI-CA i IClose I heck Path stopped with 0 uris waiting to be checked 201 pages were retrieved out (ding expected omparing Signatures Figure 5. Path information. plating last signature . , Given this information, the user of the Path Manager can contact the path author in order to inform him/her of the changes or to Figure 3. View of Change Metrics. perform and coordinate corrective measures in order to re- establish the desired function of the path.

72 110 4.2 Web Page Retrieval and Connectivity the thread will pick another from the set to be checked until all An issue for the Path Manager is the connectivity required to pages have been successfully checked or a general timeout has retrieve a set of Web pages. Connection times are never constant, occurred. as they depend on many variables out of the system or users' In any case that the Path Manager cannot assess the relevance of control. Some pages are returned rather quickly, while others take the changes, the page titles are shown in blue. Four different longer to return. Also varying connection times means that, shades of blue are used to denote different reasons that the sometimes when a Web page seems to take too long to load, relevance assessment was not successful: canceling the retrieval and immediately restarting the process results in the Web page being loaded faster. The page is yet to be checked, i.e., no connection has been attempted. Were the Web pages in a path to be checked sequentially, retrieval There was a network problem, typically a timeout during the problems of a single page would block the retrieval of all connection or retrieval states. consequent pages. Therefore the Path Manager architecture has Thegeneraltimeoutexpiredbeforetheanalysis was been designed as a multi-threaded process in order to avoid completed. possible blocks and expedite retrieval. In this scheme, each page There were problems that could not be identified. is retrieved and analyzed in an independent thread. The user controls the maximum number of simultaneous threads. Because long paths encounter temporary connectivity problems that are resolved quickly, the user can also tell the Path Manager Even usingindependentthreads,thereare many possible to check only those pages where the assessment of change was not problems when retrieving pages from the Web such as no successful. response, slow connections, and very long pages. In order to deal with these situations the system recognizes three states within 4.3 Algorithms each individual retrieval thread: We are investigating two different algorithms for categorizing and 1. Connection state: the system is attempting to contact the combining the changes found by the document signatures: one Web server hosting the specified Web page. based on Johnson's work [10],and another we callthe proportional algorithm. 2. Retrieval state: during this state the Web page has been located and its contents are being downloaded. 4.3.1 Johnson's Algorithm We first implemented a variation on Johnson's algorithm [10], 3. Analysis state: all contents have been retrieved and now the system is parsing and analyzing the contents. [11]to compute the distance between two documents. As previously mentioned, Johnson only used paragraphs, headings Prsth lolunnqm LID and keywords.Inorder to take into account structural changes, in

C017 .01:11414400101110 our implementation of his approach, we included links as an . AWN.. r471-L71moon Proportnow additional metric. For each signature, additions, deletions and P.10_1101,54.. li1ttSth,,ernj 9.9901 4441 modifications are identified. The distance metric is computed as ain,, 0 120 second. i-!00C follows. .4 240 seconds 40041410n4 D = PD + HD + LD + KD

45 seconds Where:

IIL. Slandard &.1%nathala Umaoul Distance PD Paragraph Distance HD Headings Distance Work A LD Links Distance 7..Curricultun KD Keywords Distance Iill.. About ma In turn I49.. P100 - 04440 PD=PW *R#Pmod*PWmod)+(#Padd*PWadd)+(#Pdelete*PWdelete)l

I411. Fiona. #P Where: 4t, 4,4041 #P Number of Paragraphs #Pmod Number of Paragraph Modifications #Padd Number of Paragraph Additions #Pdelete Number of Paragraph Deletions Figure 6.Selecting the timeouts. PW Paragraphs Weight PWmod Paragraph Modifications Weight The user can set different timeouts for the connection and retrieval PWadd Paragraph Additions Weight states, which every thread must finish before they expire. The PWdelete Paragraph Deletions Weight whole path checking process can also be interrupted by a general Similarly: timeout or by the user clicking the stop button. Figure 6 shows the selection of the timeouts. HD=HW1(#Hmod*HWmod1+(#Hadd*HWadd)+(#Hdelete*HWdeletell #H Each thread evaluates one page ata time. Onceiteither LD=LW*I(#Lmoci*LWmod)+(#Ladd*LWadd)+Mdelete*LWdeleten successfully or unsuccessfully evaluates the change to that page, #L

73 1 1 The keyword computation is slightly simpler since itis only Web pages were selected from paths previously created by important to detect if a keyword is missing. teachers and from personal bookmarks. Each page was modified KD = KW f Kdelete*KWdelete 1 to reflect a single type of change, and was classified by the #K magnitude and the type of change. Johnson developed his algorithm to support Web-based tutorials. In his application, the results of the algorithm were used by the 5.1 Methodology system in deciding whether a page should be displayed to the The experiment consisted of presenting the participant with two studenti.e., whether it continued to resemble the page selected versions of a Web page, the original version and a possibly modified version (some were unmodified copies of the original). by the tutorial's author. Johnson's algorithm includes the ability The person was then asked to evaluate the magnitude of change. to weight modifications, additions, and deletions independently. However, since itis supporting a computer system and not a The goal of the study was to address the following three human user, it does not provide normalized results; this makes the questions: results more difficult for a user to interpret. In particular, this makes the selection of cutoff values and the visualization of 1. Do people view the same changes in a different way when change somewhat unintuitive. given different amounts of time to analyze the pages? 2. What kinds of changes are easily perceived? 4.3.2 Proportional Algorithm This signature-based approach provides a simpler computation of 3. Of what kind of changes do users want to be notified? the distance. The change to each individual signature is computed Before the evaluation we familiarized the participants with the as follows: testing software and the kinds of change. Additionally we PD = (#Pchanges / #P) * 100 provided a context for evaluating the changes in the Web pages by #Pchanges= #Pmods + #Padds + #Pdeletes providing a scenario. The participant was asked to act the role of the Information Facilitator in a K-12 school (Kindergarten to HD = (#Hchanges / #H)100 #Hchanges= #Hmods + #Hadds + #Hdeletes High School). Teachers have chosen pages from the Web to teach their classes and it is the participant's responsibility to check for LD = (#Lchanges /#L)100 changes. #Lchanges= #Lmods + #Ladds + kdeletes

KD = (#Kchanges / #K)* 100 The first task inthe study was tofillin a pre-evaluation #Kchanges #Kmods + #Kadds + #Kdeletes questionnaire collecting demographic data about the participants The overall page distance is: and their computer literacy. The study was then divided into three phases: = (#Tchanges / #T)100 1. In phase I, the person was given 60 seconds to view each #T = #P + #H + #L + #K #Tchanges= #Pchanges + #Hchanges + #Lchanges + #Kchanges pair of Web pages. In this phase the system presented the participant 8 cases, which included examples from all of the Throughout the proportional algorithm, the number of paragraphs, different kinds of change. The system also identified the headings, links, or keywords is taken to be the maximum of the changes to the participant. The objective of this phase was to two signatures being compared. Thus, whether a page goes from provide training, and the answers to the question allowed us one to four paragraphs or from four paragraphs to one, #13 = 4. to address the third question. The proportional algorithm provides a normalized and symmetric 2. In phase II, the person was only allowed to view the page distance that is easier to use for different sets of Web pages. The pairs for 15 seconds before evaluating them. There were 3 1 normalized and symmetric properties of the distance measurement pairs in this phase. This phase addressed the first and second facilitate providing the user with a visualization or listing of the questions. different changes. This, in turn, allows the user to more effectively evaluate the page changes without having to actually review all 3. In phase III, the person was given 60 seconds to view the the pages in the path. Web pages. There were 3 1 pages in this phase, which also addressed the first and second questions. 5. THE PERCEPTION OF CHANGE In each phase, once the person viewed a Web page pair, s/he was In order to inform and evaluate the approach and design of presented with five questions: systems that aid in the automatic assessment of change relevance, 1. From the Content perspective, the degree of change is: we conducted a study to observe how humans perceive changes of Web pages. (We give a brief overview of the study here. For 2. From the Structure perspective, the degree of change is: details consult the companion paper [5].) 3. From the Presentation perspective, the degree of change is: This study covered three kinds of change: content, structure and 4. Overall, how significant are the changes? included presentation. In particular, content changes 5. If this page were in my bookmark list, I would like to be modifications to the text presented, structure changes referred to notified when changes like these occur. modifications of URLs and the anchor text forlinks, and presentation changes included modifications to colors, fonts, Questions 1-4 were answered in a 7-stop scale ranging from backgrounds, spatial positioning of information, or combinations "none" to "moderate" to "drastic". Question 5 was answered in a of these. 7-stop scale ranging from strongly disagree to strongly agree.

74 112 The finaltask inthe study was tofilla post-evaluation 5.4 Implications for the Path Manager questionnaire that gathered the participant's general comments The results of the study indicate that including links in metrics for about how easy it was to assess the magnitude of changes in Web overall change will better match our subject's evaluation of pages, and about the evaluation software. change and desire to be notified. The results also show that In all phases the testing software controlled the presentation of the presentation changes were viewed as largely unrelated to the Web pages directly. The time required to record the participant function of the page, except in extreme cases. Finally, a more answer was not controlled. The total time for the study varied detailed analysis of the results may provide weightings for fromIto 2 hours per person, depending on the time the combining the results of the four signatures into an overall view participants took in filling in the answers, and whether they of change. decided to rest between phases. 6. CHALLENGES AND LIMITATIONS 5.2 Target Population An issue with using document signatures based on HTML tags is In order to conduct the study, we recruited adults, specifically that not every page creator or Web page authoring tool uses the students at Texas A&M University. Subjects were divided into HTML tags consistently. In the case of headings, page creators two groups. Pages in Phase II for one group were used as the often choose to modify the visual appearance of the text by using pages in Phase III for the other group. This allows comparing the tags such as FONT or resorting to images. This imposes the results for the same page when given 15 or 60 seconds for requirement of implementing smarter parsers. We are currently observation. attempting to address some of these by augmenting the system to infer what is conceptually a heading and where paragraphs begin 5.3 Study Results and end, and to identify different types of changes to links. The results of Phase I, when the subjects were provided textual descriptions of what had changed, give the clearest indication of A limitation of the current Path Manager is that no indirection is what people consider change and of what they would like to be supported. When faced with Web pages containing frames, the notified. For content changes, as the percent of paragraphs Path Manager does not check the pages contained in the frames. changed, the perception of overall change also increased, as did The same is true for other tags such as images or links. There is no the subjects' desire to be notified of the change. Interestingly, as retrieval of the pages specified in the SRC or HREF fields unless the degree of structural changes increased, the perception of the they are also included explicitly in the path. overall change did not increase but the desire to be notified of the Another limitation of the Path Manager is that it does not monitor change did. For similar percentages of content and structure any JavaScript or other page behaviors. This is in part due to the change, subjects rated the content changes higher in overall complexity of the required parser and to the fact that this remains change but lower in desire to be notified. a moving target, where specifications and support vary constantly In Phases II and III, time did not alter the perception of content over time and browser type. change but was seen to play a large role in the identification of Finally, while the Path Manager can parse dynamically generated structural change. This is likely due to the visibility of content pages returned by CGIs, it does not recognize that they are changes and theinvisibility of structure changessubjects dynamic and therefore variable by nature. Augmenting the system commented in the exit survey about how difficult and time- to recognize these, could enable separate treatment for such pages. consuming it was to detect structure change. As with the results of Phase I, subjects desire to be notified of perceived changes in 7. CONCLUSIONS structure were higher than their desire to be notified for similarly Maintenance of distributed collections of documents remains a perceived changes in content. challenging and time-consuming task. People must monitor the In Phases II and III, subjects rated low and medium changes in documentsfor change and theninterpretchangestothe content similarly, but rated those pages that had drastically documents in the context of the collection's goals. changed quite a bit higher for all questions. Structural changes The Walden's Paths Path Manager supports the maintenance of saw a similar but less extreme jump in ratings when almost all the collectionsof Web pages by recognizing,evaluating,and links had been changed. informing the user of changes. The evaluation of change is based Presentation changes were not considered by amount of change on document signatures of the paragraphs, headings, links, and but by type. Subjects did not seem to notice changes to fonts, keywords. The Path Manager keeps track of original, last valid, while changes to background color and navigational images were and last collected signatures so users can determine both long- noticed but rated low as contributing to overall change or desire to term and short-term change depending on their particular concern. be notified. Changes in the arrangement of material on a page was The Path Manager has been designed to work in a distributed noticed by subjects but interpreted differently depending on the environment where connectivity to documents is unpredictable. amount of time they had to look at the page. With only 15 Its architecture and instantiation provide users with control over seconds, the rearrangement was considered a content change and system resources consumed. Also, the system provides feedback there was a large desire to be notified while with 60 seconds the about documents that are not successfully evaluated. subjects recognized that the material was the same but moved and they had a lower desire to be notified. Finally, when many Particular to uses for Walden's Paths, the Path Manager provides presentation features were changed at once, subjects rated the access to information about the use of the documents in a path and change as high on all metrics and wanted to be notified. to metadata about the path that may be important to determining the relevance of particular changes. Users may also mark pages so

75 113 that they remain in the path but the Walden's Paths server will not [11] Johnson, D.B., & Tanimoto, S.L. Reusing Web Documents display them to readers of the path. in Tutorials with the Current-Documents Assumption: Automatic Validation of Updates, in Proc. of EDMEDIA'99 A study of the perception of changes to Web pages indicated the (Seattle WA, June 1999). AACE, 74-79. desire for structural changes to be included in the determination of overall change. The study also showed that presentation changes [12] Karadkar, U., Francisco-Revilla, L., Furuta, R., Hsieh, H., & were largely considered irrelevant. Current work on the Path Shipman, F. Evolution of the Walden's Paths Authoring Manager aims to overcome difficulties with the inconsistencies Tools, in Proceedings of WebNet 2000--World Conference and indirection found in Web documents. on the WWW and Internet (San Antonio, TX, October 30-- November 4, 2000) AACE, 299-304. 8. ACKNOWLEDGEMENTS [13] Levy, D.M. Fixed or Fluid? Document Stability and new This material is based upon work supported by the National Media, in Proc. of the European Conference on Hypertext Science Foundation under Grant Numbers IIS-9812040 and DUE- Technology '94 (Edinburgh Scotland, September 1994). 0085798. ACM Press, 24-41. 9. REFERENCES [14] Lieberman, H. Letizia: An Agent That Assists Web [1] Armstrong, R., Freitag, D.M, Jachims, T., & Mitchell T. Browsing. International Joint Conference on Artificial WebWatcher: A Learning Apprentice for the World Wide Intelligence (Montreal Canada, August 1995). Morgan Web, in Working Notes of AAAI Spring Symposium on Kaufinan, 924-929. Information Gathering from Heterogeneous Distributed [15] Lieberman, H. Autonomous Interface Agents, in Proc. of Environments (Stanford University CA, March 1995) AAAI CHI'97 (Atlanta GA, March 1997). ACM Press, 67-74. Press, 6-12. [16] Pazzani, M., & Billsus, D. Learning and Revising Reader [2]Bush, V. As We May Think. The Atlantic Monthly, (August Profiles: The Identification of Interesting Web Sites. 1945), 101-108. Machine Learning 27 (1997), Kluwer Academic Publishers, 313-331. [3]Brewington, B. & Cybenko, G. How Dynamic is the Web, in Proc. of WWW9-9th International World Wide Web [17] Pazzani, M., Muramatsu, J., & Billsus, D. Syskill and Conference, IW3C2, (2000) 264-296. Webert: Identifying interesting Web sites, in Proc. of [4]Douglis, F., Ball, T., Chen, Y., and Koutsofios, E. The AAAI'96 (Portland Oregon, August 1996). American AT&T Internet Difference Engine: Tracking and Viewing Association for Artificial Intelligence, 54-59 Changes on the Web, in World Wide Web, 1(1) 27-44 [18] Shipman, F., Furuta, R., Brenner, .D., Chung, C., & Hsieh, (January 1998). H. Using Paths in the Classroom: Experiences and Adaptations, in Proc. Hypertext'98 (Pittsburgh PA, June [5]Francisco-Revilla, L., Shipman, F.M., Karadkar, U., Furuta, R., and Arora, A. Changes to Web Pages: Perception and 1998). ACM Press, 167-176. Evaluation. To appear in Proc. Hypertext 2001, (Aarhus, [19] Shipman, F., Marshall, C., Furuta, R., Brenner, .D., Hsieh, Denmark, August 2001). H., & Kumar, V. Creating Educational Guided Paths over the [6]Furuta, R., Shipman, F., Francisco-Revilla, L., Karadkar, U., World-Wide Web, in Proc. of ED-TELECOM'96 (Boston & Hu, S. Ephemeral Paths on the WWW: The Walden's MA, June 1996), AACE, 326-331. Paths Lightweight Path Mechanism. WebNet (1999), 409- [20] Starr, B., Ackerman, M.S., & Pazzani, M. Do-I-Care: a 414. collaborative Web agent, in Proc. of CHI'96, (Vancouver Canada, April 1996), ACM Press, 273-274. [7]Furuta, R., Shipman, F., Marshall, C., Brenner, .D., & Hsieh, H. Hypertext Paths and the World-Wide Web: Experiences [21] Starr, B., Ackerman, M.S., & Pazzani, M. Do-I-Care: Tell with Walden's Paths, in Proc. of Hypertext'97 (Southampton Me What's Changed on the Web, in Proc. of the AAAI U.K., April 1997). ACM Press, 167-176. Spring Symposium on Machine Learning in Information [8]Giles, L., Bollacker, K., & Lawrence, L. CiteSeer: An Access (Stanford CA, March 1996). Automatic Citation Indexing System, in Proc. of DL'98 [22] URL-Minder. Available at (Pittsburgh PA, June 1998). ACM Press, 89-98. http://www.netmind.com/html/url-minder.html [9]Joachims, T., Mitchell, T., Freitag, D., & Armstrong, R. [23] WatzNew. Available at http://www.watznew.com WebWatcher: Machine Learning and Hypertext, [24] Witten, I.H., Moffat, A., and Bell, T.C., Managing Fachgruppenterffen Maschinelles Lernen. Dortmund, Gygabytes. Compressing and Indexing Documents and Germany, August 1995. Images, 2nd Edition, Morgan Kaufman, San Francisco, CA, [10] Johnson, D.B. Enabling the Reuse of World Wide Web 1999. Documents in Tutorials. PhD. Dissertation, Dept. of computer Science and Engineering. University of Washington, Seattle, WA. 1997

76 114 Measuring the Reputation of Web Sites: A Preliminary Exploration Greg Keast, Elaine G. Toms, Joan Cherry Faculty of Information Studies University of Toronto Toronto, Ontario, Canada [email protected], ( toms, cherry [email protected]

ABSTRACT 2. BACKGROUND We describe the preliminary results from a pilot study, which Reputationisadmittedly an elusiveconstruct. The Oxford assessed the perceived reputationauthority and trustworthiness dictionary defines it as "the general estimation in which a person of the output from five WWW indexing/ranking tools. The tools is held." In essence, it is an external perception about an object are based on three techniques: external link structures, internal (deserved or otherwise). An analogy may be made to web sites. If content, or human selection/indexing. Twenty-two participants a site links to another site, then it serves as a recommendation or reviewed the output from each tool and assessed the reputation of endorsement of that site. However, the mere fact that a link exists the retrieved sites. is insufficient to claim an endorsement and thus, a reputation. Links to sites, like citations to journals, can be created for many Categories and Subject Descriptors reasons. But when the site is not only 'well-linked' but also the H.3.3 Information Search and Retrieval. structure can be analyzed to such an extent that the evidence is overwhelming, then one could conclude that a reputation has been made. TOPIC is built on this foundation. General Terms is generally considered a key criterion for Measurement, Performance, Reliability, Experimentation Source authority judging the perceived quality of information and for filtering Keywords information [7]. Rieh and Belkin [5] showed that "people depend upon such judgments of source authority and credibility more in Web sites, Reputation, Authority, Evaluation, TOPIC, Google, the Web environment than in the print environment". The Alta Vista, Lycos, Yahoo importance of reputation to a company or organization is amply illustrated by the proliferation of consulting firms that provide 1. INTRODUCTION reputation management services.These distinctiveactivities Multiple techniques have been developed to retrieve and rank web illustrate the importance of reputation as an influential factor for sites on the World Wide Web. To date the evaluation of results clients when choosing institutions, services or products [6]. has been based primarily on the relevance and 'aboutness' of a Kleinberg [3], Chakrabarti et al [2] and Brin & Page [1] have site to the query. Equally valuable to the user is the perceived exploited the link structure to confer authority on web sites. reputation or trustworthiness of the content. TOPIC, developed by TOPIC augments this work by identifying categories in which a Rafei and Mendelzon [4], identifies the topics for which a web page has a good reputation. page has a good reputation by mining the link structure (see details in [4]).In this pilot study we compare the output from TOPIC with that of several tools to ascertain if certain types of 3. METHODS tools are more conducive to providing 'reputable' web sites. 3.1 Participants The five tools tested were created from three techniques: those Twenty-two participants (13 female, 8 male and 1 unknown) that use the external link structure to make sense of the site; those participated in the study. Twenty per cent were under 26 and 30% that use the content of a site as the input to the indexing and were over 35. They were a well-educated group, 95% had at least ranking; and, those that are human-selected and indexed. In this one undergraduate degree. Forty per cent browse the web more pilot study, we compare how users perceive the output from each than 15 times a week with the remainder browsing the web fewer tool. Do certain types of tools yield sites that are perceived to be times. Due to the nature of the task that participants were to be more reputable -- authoritative and trustworthythan others? assigned, we also noted that 65% watch one to four movies a month; 10% watched more than 15.All were recruited from within the University of Toronto and were paid $10.00 for their Permission to make digital or hard copies of all or part of this work for participation. personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that 3.2 Task copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, Initially, we assessed the results to a query from five different requires prior specific permission and/or a fee. tools, which are based on three types of techniques: 1) Google and JCDL'Ol, June 24-28, 2001, Roanoke, Virginia, USA. TOPIC which use intra-site link analysis, 2) Altavista and Lycos Copyright 2001 ACM 1-58113-345-6/01/0006...$5.00.

77 115 which use automatic indexing, and 3) Yahoo which uses human- performed the most poorly on Aboutness, while those retrieved by assigned categories and assessments. The five search tools were Alta Vusta, Topic and Yahoo were rated the highest. accessed by the first author during the same two-hour period to search for "movie review" sites. We selected movie reviews 5. DISCUSSION & CONCLUSIONS because it has wide appeal and we knew that we would find likely This reports on results from one test with a single topic and thus participants who had some experience in examining web sites on results are tentative, although noteworthy. (Because of the human that topic and were thus sufficiently knowledgeable to comment effort required in the conduct of this study, only one topic, i.e., on a site's reputation. The first four sites retrieved by each tool subject matter of site, was selected for the pilot study.) TOPIC were chosen and a final list of 17 unique sites on movie reviews performed on average as well as Altavista and outperformed was compiled. Duplicates were removed, but the source of each Google. Surprisingly, it matched or outperformed the human- duplicate was included in the subsequent analysis. selected-and-indexed Yahoo.In general the two non-human techniques performed on par with the human-generated tool. 3.3 Procedures Overall the scores were relatively low with 3.5 being the average Testing took place primarily in a lab setting over a two-week rating assigned on almost all variables. Perhaps this is indicative period. Each participant was assigned a randomized list of the 17 of the status of movie review sites.Of the 1 7 sites that were sites. They visited each in turn and assessed on a scale of one to retrieved from these five tools, less than a third were considered five the trustworthiness, authoritativeness and aboutness of the worth recommending or returning to or identified as the best site and also of the links leading from the site. In addition, they sources of movie reviews. indicated if they would return to the site and/or recommend the Predicting the reputation of a web site is a difficult problem. site as a good source of movie reviews to a friend or colleague. TOPIC's approach to measuring reputation is novel. Findings Finally, they indicated which of the 17 sites they considered the from this work are promising, justifying the need for further best source for movie reviews. The entire process took about one assessment using multiple topics from diverse domains with hour per participant. different users. 4. RESULTS 6. ACKNOWLEDGMENTS Some sites were not available at the time of testing and thus the Our thanks to the Faculty of Information Studies for the use of its number of cases in each analysis varied from 17 to 22. All data facilities, to the 22 participants and to Alberto Mendelzon. were analyzed using SPSS's GLM repeated measures. Perception measures were averaged (see Table 1) by participant. Selected data are presented here. 7. REFERENCES The sites found by the five tools differed significantly in terms of [1]Brin, S. & Page, L. (1998). The anatomy of a large-scale perceived Trustworthiness (F(4,72)=3.314, p=.039), hypertextual web search engine. Authoritativeness(F(4,72)=4.165, p=.02) and Aboutness http://www7.scu.edu.au/programme/fullpapers/1921/com192 (F(4,72)=11.353, p>.001). In post hoc tests, the sites found by 1.htm Topic were considered more trustworthy and authoritative than [2]Chakrabarti, S., van den Berg, M., & Dom, B. (1999). those of Google and Lycos, while Google was less authoritative Focused crawling: a new approach to topic specific resource than Alta Vista. All five differed significantly in Aboutness. discovery. http://www.cs.berkeley.edu/ Yahoo sites were considered to be more on topic than those of -soumen/doc/www1999f/html/ Alta Vista and Topic. [3]Kleinberg, J. M. Authoritative sources in a hyperlinked Table 1. Average Rating for Sites by Tool environment. ACM, 46 (5 1998), 604-632.) Tools Trust Authority Aboutness [4]Rafiei, D. & Mendelzon, A. 0. (2000). What is this page Alta Vista 3.7 3.7 4.2 known for? Computing Web page reputations. Toronto, Ont.: University of Toronto. Retrieved September 6, 2000, from Google 3.3 3.3 3.4 the World Wide Web: ftp://ftp.db.toronto.edu/pub/papers/www9.ps.gz Lycos 3.2 3.1 3.7 [5]Rieh, S. Y. & Belkin, N. J. (1998). Understanding judgment Topic 3.7 3.8 4.0 of information quality and cognitive authority in the WWW. Yahoo 3.6 3.6 4.5 Proceedings of the ASIS Annual Meeting, 35 (1998), 279- 289. Outgoing links from each of the sites were perceived similarly. [6] Tapp, L. (2000). Reputation is key in picking the best But there were notable exceptions. There were no significant business school. The Globe and Mail, Sept. 6, 2000 differences among the five tools according to Trustworthiness and [7]Wilson, P. (1983). Second-Hand Knowledge: an Inquiry into Authoritativeness of outgoing links. Sites retrieved by Google Cognitive Authority. Westport, CT: Greenwood Press.

78 116 Personalized Spiders for Web Search and Analysis Michael Chau Daniel Zeng Hsinchun Chen Dept of Management Info. Sys. Dept of Management Info. Sys. Dept of Management Info. Sys. The University of Arizona The University of Arizona The University of Arizona Tucson, Arizona 85721, USA Tucson, Arizona 85721, USA Tucson, Arizona 85721, USA 1-520-626-9239 1-520-621-4614 1-520-621-2748 [email protected] [email protected] [email protected]

ABSTRACT index frequently enough to keepitup-to-date. Most search Searching for useful information on the World Wide Web has engines present search results to users that are incomplete and become increasingly difficult. While Internet search engines have outdated, usually leaving users confused and frustrated. been helping people to search on the web, low recall rate and A second problem that Internet users encounter is the difficulty in outdated indexes have become more and more problematic as the searching information on a particular website, e.g., looking for web grows. In addition, search tools usually present to the user informationrelatedtoacertaintopicinthewebsite only a list of search results, failing to provide further personalized www.phoenix.com. Among thepopular commercialsearch analysis which could help users identify useful information and engines, only a few offer the search option to limit a search comprehend theseresults. To alleviate these problems, we session to a specified website. Because most search engines only proposeaclient-basedarchitecturethatincorporatesnoun index a certain portion of each website, the recall rate of these phrasing and self-organizing map techniques. Two systems, searches is very low, and sometimes even no documents are namely CI Spider and Meta Spider, have been built based on this returned. Although most large websites nowadays have their built- architecture. User evaluation studies have been conducted and the in internal search engines, these engines index the information findings suggest that the proposed architecture can effectively based on different schemes and policies and users may have facilitate web search and analysis. difficulty in uncovering useful information. In addition, most of the websites on the Internet are small sites that do not have an Categories and Subject Descriptors internal search feature. H.3.3[Information Storage and Retrieval]:Information Search A third problem is the poor retrieval rate when only a single and Retrievalclustering, information filtering, search process. search engine is used. It has been estimated that none of the search engines available indexes more than 16% of the total web General Terms that could be indexed [12]. Even worse, each search engine Design, Experimentation. maintains its own searching and ranking algorithm as well as query formation and freshness standard. Unless the different Keywords features of each search engine are known, searches will be Informationretrieval,Internet spider,Internet searching and inefficient and ineffective. From the user's point of view, dealing browsing, noun-phrasing, self-organizing map, personalization, with an array of different interfaces and understanding the user evaluation. idiosyncrasies of each search engine is too burdensome. The development of meta-search engines has alleviated this problem. However, how the different results are combined and presented to 1. INTRODUCTION the user greatly affects the effectiveness of these tools. The World Wide Web has become the biggest digital library available, with more than 1 billion unique indexable web pages In addition, given the huge number of daily hits, most search [9]. However, it has become increasingly difficult to search for engines are not able to provide enough computational power to useful information on it, due to its dynamic, unstructured nature satisfy each user's information need. Analysis of search results, and its fast growth rate. Although development of web search and such as verifying that the web pages retrieved stillexist or analysis tools such as search engines has alleviated the problem to clustering of web pages into different categories, are not available a great extent, exponential growth of the web is making it in most search engines. Search results are usually presented in a impossible to collect and index all the web pages and refresh the ranked list fashion; users cannot get a whole picture of what the web pages are about until they click on every page and read the contents. This can be time-consuming and frustrating in a Permission to make digital or hard copies of all or part of this work for dynamic, fast-changing electronic information environment. personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that Inordertoalleviatethe above problems, we proposea copies bear this notice and the full citation on the first page. To copy personalized and integrated approach to web search. In this paper, otherwise, or republish, to post on servers or to redistribute to lists, we present a client-side web search tool that applies various requires prior specific permission and/or a fee. artificial intelligence techniques. We believe that a search tool JCDL '01, June 24-28, 2001, Roanoke, Virginia, USA. thatismore customizable would help userslocate useful Copyright 2001 ACM 1-58113-345-6/01/0006...$5.00.

79 117 information on the web more effectively. The client-based emerging technologies in this area. The user first needs to specify architecture also allowsfor greater computation power and some areas of interest. The tool will then automatically push resources to provide better searching and analysis performance. related information to the user. Ewatch (www.ewatch.com) is one We have conducted two experiments to evaluate the performance such example. It monitors information not only from web pages of different prototypes built according to this architecture. but also from Internet Usenet groups, electronic mailing lists, discussion areas and bulletin boards to look for changes and alert 2. RELATED WORK the user. In order to address the information overload problem on the web, Another popular technique used for monitoring and filtering research has been conducted in developing techniques and tools employs a software agent, or intelligent agent [15]. Personalized to analyze, categorize and visualize large collections of web agents can monitor websites and filter information according to pages, among other text documents. A variety of tools have been particular user needs. Machine learning algorithms, such as an developedtoassistsearching,gathering,monitoringand artificial neural network, are usually implemented in agents to analyzing information on the Internet. learn the user's preferences. 2.1 Web Search Engines and Spiders 2.3 Indexing and Categorization Many different search engines are available on the Internet. Each There have been many studies in textual information analysis of has its own characteristics and employs its preferred algorithm in information retrieval and natural language processing. In order to indexing, ranking and visualizing web documents. For example, retrieve documents based on given concepts, the documents have AltaVista (www.altavista.com) and Google (www.google.com) to be indexed. Automatic indexing algorithms have been used allow users to submit queries and present web pages in a ranked widely to extract key concepts from textual data. It having been order, while Yahoo! (www.yahoo.com) groups websites into shown that automatic indexing is as effective as human indexing categories, creating a hierarchical directory of a subset of the [18], many proven techniques have been developed. Linguistics Internet. approaches such as noun phrasing also have been applied to Another type of search engineis comprised of meta-search perform indexing for phrases rather than just words [21]. These techniques are useful in extracting meaningful terms from text engines,suchas MetaCrawler (www.metacrawler.com) and documents not only for document retrieval but also for further Dogpile (www.dogpile.com). These search engines connect to analysis. multiple search engines and integrate the results returned. As each search engine covers different portion of the Internet, meta-search Another type of analysis tool is categorization. These tools allow engines are useful when the user needs to get as much of the a user to classify documents into different categories. Some Internet as possible. There are also special-purpose topic-specific categorization tools facilitate the human categorization process by search engines [4]. For example, BuildingOnline simply providing a user-friendly interface. Tools that are more (www.buildlingonline.com)specializesinsearchinginthe powerful categorize documents automatically, allowing users to buildingindustrydomainontheweb,and LawCrawler quickly identify the key topics involved in a large collection of (www.lawcrawler.com)specializesinsearchingforlegal documents [e.g., 8, 17, 23]. information on the Internet. In document clustering, there are in general two approaches. In Internet spiders (a.k.a. crawlers), have been used as the main the first approach, documents are categorized based on individual program in the backend of most search engines. These are document attributes. An attribute might be the query term's programs that collect Internet pages and explore outgoing links in frequency in each document [7, 22]. NorthernLight, a commercial each page to continue the process. Examples include the World search engine, is another example of this approach. The retrieved Wide Web Worm [16], the Harvest Information Discovery and documents are organized based on the size, source, topic or author Access System [1], and the PageRank-based Crawler [5]. of each document. Other examples include Envision [6] and GRIDL [19]. Inrecentyears, many client-side web spidershave been developed. Because the software runs on the client machine, more In the second approach, documents are classified based on inter- CPU time and memory can be allocated to the search process and document similarities. This approach usually includes some kind more functionalities are possible. Also, these tools allow users to of machine learning algorithms. For example, the Self-Organizing have more control and personalization options during the search Map (SOM) approachclassifiesdocumentsintodifferent process. For example, Blue Squirrel's WebSeeker categories which are defined during the process, using neural (www.bluesquirrel.com) and Copernic 2000 (www.copernic.com) network algorithm [10]. Based on this algorithm, the SOM connect with different search engines, monitor web pages for any technique automatically categorizes documents intodifferent changes, and schedule automatic search. Focused Crawler [2] regions based on the similarity of the documents. It produces a locates web pages relevant to a pre-defined set of topics based on data map consisting of different regions, where each region example pages provided by the user. In addition, it also analyzes contains similar documents. Regions that are similar are located the link structures among the web pages collected. close to each other. Several systems utilizing this technique have been built [3, 11, 14]. 2.2 Monitoring and Filtering Because of the fast changing nature of the Internet, different tools have been developed to monitor websites for changes and filter out unwanted information. Push Technology is one of the

80 118 nSpider - [Step ii SearchtrkrerlajoDiComputer Moni or d _ Fie Eckt Hap

Scorch] Option. I c3;Op 'Thin'. I Storting UPLe http:// I NEI YRI.

hlipillswew.ayeSeyerconn at MOHO URL 1.The user inputs the Starting URL and search phrase into .14nre UP I CI Spider. Search options are also specified.

5e0000 Keywords r5earCh Options

MS ysindrd 7., ';\ :id Computer monitor :7Remove Kemord I

Woe Sew simmonsr=omEtrintrt oIIWIiINIIPle Eck HP. trio Ion. W. I Down.. Vims1 http /A* tyP2syc00,5 ,31,"1-1:1;;;-:.; 1-0 Sip /1* eyeSere 00,50, Mml 1p/ eyeeye.corNsae.hlred ey4err -err ms Htp ye2 y 20/04 H. o/ eyes,. conVordestss.cen.h.1 Eva. Solutions In - How te Order 0 http /Wm eye2eye hnp/Amoss ey0eys.comeortledas .a html Howl* Oples Sok.. In Sight by Eye2Eye fr."J . 0 1.0 littp /Wm eya2ey. MplAswer eyePeys.sen. optimises.. Eye2Eye Solutians In Sight - The Color Opl I. 0 Imp News ere2e. ow, open' p Awe syeOloye.conVorderlas masses .. nO.1, OW/ Solutions In Sight by Eye2Eye I. 0 latp /hews aye2eye Hip/Meow eyehypcomulreadeng EyeaEye Pea.. Use heatihy Nip Pr« eye.2*. co. made eyeOeye. cons.W. own.. ... Hms la 0Her Solaws 0, SSIS by EWE. Le htte Nor. .ys3ore casn/ors.e./ SIlstsipPonoss eye's. cen.... ht. Eya2Eye - Solutions In Ss.- Haw lo On. 0 M. /Wow eyeaere.corneordes hop/Wm ay Oa, .camelmsnual ht. computer mond EyeSEye - Solutions In Sight - Refer.. Ha 0 hap /Mew eye2eye Imp /Awe ey.eys.con.corporateli.hunt WW1- Ergonomics Come. n Pow la OW/SOW. In Sight by Ey.Eye 0 Sip //mom sys2eye.s.. YmPs Imp/Aiswe sye2eys.conVordessna Eye2Eye - Solutions by Seg.- 0.0P 0 hip II* eye2eye can/ordes Imp/Avws eye2eye.comforthoscan.11. 2. Search results are displayed ...Menem eyOeye earn/L0010. Mr111 computer nand EWE. - Co. nd VDT eye Praio .0 Mow. eyeSspy cornewles o Ilwroo eyYeye.corn/Oselety... Ontario OccupationaI Health and Sala. Hts /hew ayeSeye dynamically. Good URL List htIpMemns eyelyypcorn/Idrsepe ION Ere2E - Ors Ere F.: s IMp/Amoss eye2eya comfortlescsa W.htn. EyaNye hsot How to 0. shows all the web pages hop /Mosso eyeOeyoparrylseentorldrops ..computer wog 501... 0.042. Coml. Dram Y .....mers2Pyceordl ,sionbrsakhtmlrorneisri TOM E.20.- So... In Sight -.Pion Break hltp /hew eyePeye cemil cos. ht. Mow ysSeye cam/I moos ht. Eye20. - Solutions In Sognt - Pre. Rehm. containing the search phrase. http./10. wary. p Sy. eye... ens Eye20. - Sornpu. glasses coatings and re, mo. tyaelm http/Amopey....ser.screersclem...1 EgebEye - So... in sot- Screw... 0 H. rhos. &palsy. camel I.ns t Rihttp Wow eyo2eye panYortledapano.... Eye2Eyo in sot- How is Ors. Htp imweye2eye con. scree'Y., MM. Ohm.. ...asyceenVIeeman.html Ere2E. - Swan Informs., /. 0 http/Pewers2eye consbnipla s.envl coole `NSypeler [Slop 0: Wryest. twassisfLI:N/ Cf: Mer-M117..W.V1 I 101211 HeSCR HP.

P Pee'Se/Sclisin 'Phrase PocurnmsIDocurnwVw nag...earn 1.1. Phrases who.y Phew amen rawer Wks.. symptoms Coen.. Freer.i, TreatrnaM Wors sembel bros. American Optornek eye Wain 3. Noun Phrases are extracted from neck renown. HP Wen the web pages and the user can Poem. ProPumsay selected preferred phrases for Wah categorization. Itschens ssosSer.onment ye.. moms

yellow Wears moms sembinatios. comp./ monger teP Wes display Ime *1=1111EMZEMMIEM" &It rep

cJHorcIS rd. swo Documer. OPailMMI New 1

, TT 11IPLS I Pune P.m. menders. Computer Eyes..

4. SOM is generated based on the phrases selected. Steps 3 and 4 can be done iteratively to TT refine the results.

Figure 1. Example of a User Session with CI Spider

81 BEST COPYAVAILABLE 1JHol. Seidel IS14444 Sea.. Ciiimiej11.01m14 150E3 blllCfiliilIEmEatdidr,

1. The user chooses the search P. Snip P 0,1e engines to be included and enter the search query into Meta Spider. Search options are also specified.

Innut. 14.486 01044.-44

F. Or MO

health Neeyird nemeuta.eninels

ri t ;TeTe; I 811i4e4 Pep : puler Sean. Hesullel .11011118d' ME112 Ece tido c I G:4I 611 rEiletr ;4. '40,1 ,, R..04 140.2d-88=1--C4da';88740::1 I Sem. I Sane I Key.. 116.0.0: 0 fl Snap Mad., Ina Spear. Input 6.....eula ..m.libilhau..... Lye. 01.11CiLlbrary el Ii Heatei SeeneteUrl heelemanmsermsed min/.0.1.1 !aquae tonnin. T. nel found heepier...... eder. 0.70 Sludywn 0Irttp:/.....pot.emin.la.wm/ (... Wpm Teminae. ADDS Terminal, Unit 2. Search results from chosen search 1ht0.64.1to.p.papaismice.htm A.M. *acne. .... nein. OEDERAI. E1601100 COMMISSION teespinall...... 0.0. Seep Si. S.. Chimera...J.8e, engines are displayed dynamically. ertel...... ehlm Spem Sem rableind paea. ...tun ale. Web pages are fetched from the . Pen... Oele 1.111, 000000Walcema SU.' Dar. ima. .11.468 MienestCemeut.euepar Internet for further analysis. htte...erneatudiaa enaIerned..... eltakneta odunput. terminals .1da da. NO.. wenstutrans ea

i! ihtteememmen111.40Watasia...... Snap heal. AS18801. edam. 7

: httpeemprerituteneeMea input,. Lyra 1.11h Comp.. a. 'IOW Neale Compute; h....yesarnince...... 16.6. VISIterir 11.1.1.7 ' la htte./..m...iterapenam, Oole Wt.. le le asenanla pm Phimm . ..011111...... rea-infmnuee...Sum. ha ....omputel turnin. Come., MMeia Spider [Ste9: Phrase Anal sis) ...leuteht...alterma dome. .... rnieala ear.. Wen O. the Word Out DoErtlbole .41,11.110000 ...6n8p INA 10.11. Crneeline KlAalrAl

it ayes.... I 11 tient Onelected Phresee Selected Phrases Frog... I Phrase Fregue.. I Phrase 10 home 17 computer 5 Germany 13 health 5 employees 8 computer terminals 7 VDT terminals workplece 3. Noun Phrases are extracted from video displey terminels the web pages and the user can eye, selected preferred phrases for SiCreen video display categorization. Australia safety Act educetion I Siell topic 11-1tMeta S older La t SteCategorization - UAMetztS iderNnewS tdorMati on, d' EOM 4 4136.4ch EdeEdd 8telp

.. 11121 Ca'14 "76;461 I Ffsi,2,2 Documents I Document View' 25.07 Swe Ittlffitch 50c1j- I [-A";;, .= /FURLS 1 Phrose tt t 'erlymones 7 repentrre Stress injurt.; 5 ergonomics ,

1 Computer eye Streirl

4. SOM is generated based on the phrases selected. Steps 3 and 4 can be done iteratively to refine the results.

Slop ':11,1_teSclit forrol, I < eeek I Nat11 Create j

I TV1siled. IFound:15 f NUM 4,x

Figure 2. Example of a User Session with Meta Spider

82 BEST COPY AVAILABLE 120 keyword. When the number of web pages collected meets the 3. SYSTEM DESIGN amount specified by the user, the spiders will stop and the Two different prototypes based on the proposed architecture results will be sent to the Noun Phraser for analysis. have been built. Competitive Intelligence Spider, or CI Spider, collects web pages on a real-time basis from websites specified by the user and performs indexing and categorization analysis 3.2 Noun. Phraser on them, to provide the user with a comprehensive view of the The Arizona Noun Phraser developed at the University of websites of interest. A sample user session with CI Spider is Arizona is the indexing tool used to index the key phrases that shown in Figure 1. The second tool, Meta Spider, has similar appear in each document collected from the Internet by the functionalities as the CI Spider, but instead of performing Internet Spiders. It extracts all the noun phrases from each breadth-first search on a particular website, connects to different document based on part-of-speech tagging and linguistic rules search engines on the Internet and integrates the results. A [21]. The Arizona Noun Phraser has three components. The sample user session with Meta Spider is shown in Figure 2. tokenizer takes web pages as text input and creates output that conforms to the UPenn Treebank word tokenization rules by The architecture of CI Spider and Meta Spider is shown in separatingall punctuation and symbols from text without Figure 3. There are 4 main components, namely (1) User interfering with textual content. The tagger module assigns a Interface, (2) Internet Spiders, (3) Noun Phraser, and (4) Self- part-of-speech to every word in the document. The last module, Organizing Map (SOM). These components work together as a called the phrase generation module, converts the words and unit to perform web search and analysis. associated part-of-speech tags into noun phrases by matching tag patterns to a noun phrase pattern given by linguistic rules. Readers are referred to [21] for more detailed discussion. The frequency of every phrase is recorded and sent to the User User Interface Interface. The user can view the document frequency of each phrase and link to the documents containing that phrase. After Search Search Noun Selected Categorizati all documents are indexed, the data are aggregated and sent to Query Results Phrases Phrases Map the Self-Organizing Map for categorization.

3.3 Self-Organizing Map (SOM) Internet Spiders Noun Phraser In order to give users an overview of the set of documents collected, the Kohonen SOM employs anartificialneural network algorithm to automatically cluster the web pages Lexicon collected into different regions on a 2-D map [10]. Each document is represented as an input vector of keywords and a two-dimensional grid of output nodes is created. After the network is trained, the documents are submitted to the network and clustered into different regions. Each region is labeled by the phrase which isthe key concept that most accurately Figure 3. System Architecture represents the cluster of documents in that region. More important concepts occupy larger regions, and similar concepts are grouped in a neighborhood [13]. The map is displayed 3.1 Internet Spiders through the User Interface and the user can view the documents In CI Spider, the Internet Spiders are Java spiders that start from in each region by clicking on it. the URLs specified by the user and follow the outgoing links to search for the given keywords, until the number of web pages 3.4 Personalization Features collected reaches a user-specified target. The spiders run in Because both CI Spider and Meta Spider have been designed for multi-thread such that the fetching process will not be affected personalized web search and analysis, a user has been given by slow server response time. Robots exclusion protocol is also more control during the search process. implemented such that the spiders will not access sites where the web master has placed a text file in a host or a meta-tag in a web In the Options Panel, the user can specify how the search is to page, indicating that robots are not welcome to these sites. be performed. This is similar to the "Advanced Search" feature of some commercial search engines. The user can specify In the case of Meta Spider, the Internet Spiders first send the number of web pages to be retrieved, domains (e.g. .gov, .edu or search queries to the search engines chosen. After the results are .com) to be included in the search results, number of Internet obtained, the Internet Spiders attempt to fetch every result page. Spiders to be used, and so on. In CI Spider, the user can also Dead links and pages which do not contain the search keyword choose either Breadth-First Search or Best-First Search to be the are discarded. algorithm used by the Internet Spiders. Whenever a page is collected during the search, the link to that The SOM also is highly customizable in the sense that the user page is displayed dynamically. The user can click on any link can select and deselect phrases for inclusion in the analysis and displayed and read its full content without having to wait for the produce a new map at any time. If the user is not satisfied with whole search to be completed. The user can also switch to the the map produced, he can always go back to the previous step to Good URL List to browse only the pages that contain the search discard some phrases that are irrelevant or too general and

83 121 generate a new map within seconds. The systems also let each (www.phoenix.com). Two pilot studies were conducted in order user store a personalized "dictionary" which contains the terms to refine the search tasks and experiment design. During the real that the user does not want to be included in the results of the experiment,thirtysubjects,mostlyinformationsystems Arizona Noun Phraser and the SOM. management students, were recruited and each subject was required to perform three out of the six different searches using Another important functionality incorporated in the system is the three different search methods. At the beginning of each the Save function. The user can save a completed search session experiment session, the subject was trained in using these search and open it at a later time. This feature allows the user to methods. Each subject performed at least one complete search perform a web search and review it in the future. This also helps sessionfor each of the3search methods untilhefelt users who want to monitor web pages on a particular topic or comfortable with each method. Rotation was applied such that website. the order of search methods and search tasks tested would not 4. EVALUATION METHODOLOGIES bias our results. Two separate experiments have been conducted to evaluate CI 4.2 Evaluation of Meta Spider Spider and Meta Spider. Because we designed the two spider Meta Spider was comparedwith MetaCrawlerand systems to facilitate both document retrieval and document NorthernLight. MetaCrawler (www.metacrawler.com)isa categorization tasks, traditional evaluation methodologies would renowned, popular meta-search engine and has been recognized not have been appropriate. These methodologies treat document forits adaptability, portabilityand scalability [20]. retrievaland document categorizationseparately.Inour NorthernLight (www.northernlight.com), being one of the experiments, the experimental task was therefore so designed as largestsearchenginesontheweb,providesclustering to permit evaluation of the performance of a combination of functionality to classify search results into different categories. their functionalities in identifying the major themes related to a When using Meta Spider, the subject was allowed to use all the certain topic being searched. components including Noun Phraser and SOM. 4.1 Evaluation of CI Spider Each subject was required to use the different search tools to In our experiment, CI Spider was compared with the usual collect information related to the given topic. As in the CI methods that Internet users use to search for information on the Spider experiment, each subject was required to summarize the Internet. General users usually use popular commercial search web pages collected as a number of themes. The search topics engines to collect data on the Internet, or they simply explore were chosen from TREC 6 topics.Because the TREC topics the Internet manually. Therefore, these two search methods were were not especially designed for web document retrieval, care compared with the CI Spider. The first method evaluated was was taken to make sure each search topic was valid and Lycos, chosen because itis one of the few popular search retrievable on the Internet. Thirty undergraduate students from engines that offer the functionality to search for a certain an MIS class at The University of Arizona were recruited to keyword in a given web domain. The second method was undertake the experiment. Training and rotation similar to those "within-site" browsing and searching. In this method the subject used in the CI Spider experiment were applied. was allowed to freely explore the contents in the given website using an Internet browser. When using CI Spider, the subject 5. EXPERIMENT RESULTS AND was allowed to use all the components including Noun Phraser and SOM. DISCUSSION Two graduatestudentsmajoringinlibraryscience were Each subject first tried to locate the pages containing the given recruited as experts for each experiment. They employed the topic within the given web host using the different search different search methods and tools being evaluated and came up methodsdescribedabove.Thesubject wasrequiredto with a comprehensive set of themes for each search task. Their comprehend the contents of all the web pages relevant to that results were then aggregated to form the basis for evaluation. keyword, and to summarize the findings as a number of themes. Precision and recall rates for themes were used to measure the In our experiment, a theme was defined as "a short phrase which effectiveness of each search method. describes a certain topic." Phrases like "success of the 9840 tape drive in the market" and "business transformation services" are The time spent for each experiment, including the system examples of themes in our experiment. By examining the themes response time and the user browsing time, was recorded in order that the subjects came up with using different search methods, to evaluate the efficiency of the 3 search methods in each we were able to evaluate how effectively and efficiently each experiment. During the studies, we encouraged our subjects to method helped a user locate a collection of documents and gain tell us about the search method used and their comments were a general understanding of the response to a given search query recorded. Finally, each subject filled out a questionnaire to on a certain website. Websites with different sizes, ranging from record further comments about the 3 different methods. small sites such as www.eye2eye.com to large sites such as www.ibm.com were chosen for the experiments. 5.1 Experiment Results of CI Spider Six search queries were designed for the experiment, based on The quantitativeresults of the CI Spider experiment are suggestions given by professionals working in the field of summarized in Table 1. Four main variables for each subject competitive intelligence. For example, one of our search tasks have been computed for comparison: precision, recall, time, and was to locate and summarize the information related to "merger" ease of use. Precision rate and recall rate were calculated as on the website of a company called Phoenix Technologies follows:

84 122 Table 3: Experiment results of Meta Spider precision__. number of correct themes identified by the subject number of all themes identified by the subject Meta Meta- Northern- Spider Crawler Light Precision:Mean 0.815 0.697 0.561 recall number of correct themes identified by the subject Variance 0.281 0.315 0.402 number of all themes identified by the expert judges Recall: Mean 0.308 0.331 0.203 Variance 0.331 0.291 0.181 The time recorded was the total duration of the search task, Time(min): Mean 10.93 11.13 11.00 including both response time of the system and the browsing Variance 4.04 4.72 5.23 time of the subject. Usability was calculated based on subjects' responses to the question "How easy/difficult isit to locate useful information using [that search method]?" Subjects were Table 4: t-test results of Meta Spider Experiment required to choose a level from a scale of 1 to 5, with 1 being Meta Meta Meta- the most difficult and 5 being the easiest. Spider vs Spider vs Crawler vs In order to see whether the differences between the values were Meta- Northern- Northern- statisticallysignificant,t-testswereperformedonthe Crawler Light Light experimental data. The results are summarized in Table 2. As Precision 0.540 *0.013 0.360 can be seen, the precision and recall rates for CI Spider both Recall 1.000 0.304 0.139 were significantly higher than those of Lycos at a 5% significant Time 1.000 1.000 1.000 level. CI Spider also was given a statistically higher value than * The mean difference is significant at the 0.05 level. Lycos and within-site browsing and searching in usability. 5.3 Strength and Weakness Analysis Table 1: Experiment results of CI Spider 5.3.1 Precision and Recall CI Spider Lycos Within-Site The t-test results show that CI Spider performed statistically Browsing/ better in both precision and recall than Lycos, and Meta Spider Searching performed better than NorthernLight in precision. In terms of Precision:Mean 0.708 0.477 0.576 precision, we suggest that the main reason for the high precision rate of CI Spider and Meta Spider is their ability to fetch and Variance 0.120 0.197 0.150 verify the content of each web page in real time. That means our Recall: Mean 0.273 0.163 0.239 Spiders can ensure that every page shown to the user contains Variance 0.027 0.026 0.033 the keyword being searched. On the other hand, we found that Time(min): Mean 10.02 9.23 8.60 indexes in Lycos and NorthernLight, like most other search Variance 11.86 44.82 36.94 engines, were often outdated. A number of URLs returned by Usability*: Mean 3.97 3.33 3.23 these two search engines were irrelevant or dead links, resulting Variance 1.34 1.13 1.29 in low precision. Subjects also reported that in some cases two *Based on a scale of I to 5, where I being the most difficult to u c and 5 being the or more URLs returned by Lycos pointed to the same page, easiest. which led to wasted time verifying the validity of each page. Table 2: t-test results of CI Spider Experiment The high recall rate of CI Spider is mainly attributable to the exhaustive searching nature of the spiders. Lycos has the lowest CI Spider CI Spider Lycos vs recall rate because, like most other commercial search engines, it vs Lycos vsWithin-Within-Site samples only a number of web pages in each website, thereby Site B/S B/S missing other pages that contain the keyword. For within-site Precision *0.029 0.169 0.365 browsing and searching, a user is more likely to miss some Recall *0.012 0.459 0.087 important pages because the process is mentally exhausting. Time 0.563 0.255 0.688 Usability *0.031 *0.016 0.126 5.3.2 Display and Analysis of Web Pages *The mean difference is significant at the 0.05 level. In the CI Spider study, subjects believed it was easier to find useful information using CI Spider (with a score of 3.97/5.00) than using Lycos domain search (3.33) or manual within-site 5.2 Experiment Results of Meta Spider browsing and searching (3.23). Three main reasons may account Three variables, namely precision, recall, and time, have been for this. The first is the high precision and recall discussed computed for comparison in the Meta Spider experiment and the above. The high quality of data saved users considerable time results are summarized in Table 3. The t-testresults are and mental effort. Second, the intuitive and useful interface summarized in Table 4. In terms of precision, Meta Spider design helped subjects locate information they needed more performed better than MetaCrawler and NorthernLight, and the easily.Third,theanalysis tools helped subjects form an difference with NorthemLight was statistically significant. For overview of all the relevant web pages collected. The Arizona recall rate, Meta Spider was comparable to MetaCrawler and Noun Phraser allowed subjects to narrow and refine their better than NorthernLight. searches as wellas provided alistof key phrases that represented the collection. The Self-Organizing Map generated a

85 123 2-D map display on which subjects could click to view the 8. REFERENCES documents related to a particular theme of the collection. [1] Bowman, C., Danzig, P., Manber, U. and Schwartz, F. In our post-test questionnaires in the CI Spider experiment, we Scalable Internet Resource Discovery: Research Problems found that 77% of subjects found the Good URL List useful for and Approaches, Communications of the ACM 37(8): 98- their analyses, while 40% of subjects found either the Noun 107 (1994). Phraser or the SOM useful. This suggests that while many [2]Chakrabarti, S., van der Berg, M. and Dom, B. Focused subjects preferred traditional search resultlist, a significant Crawling: A New Approach to Topic-Specific Web portion of subjects were able to gain from the use of advanced Resource Discovery, in Proceedings of the 8th International analysis tools. Similar results were obtained in the Meta Spider World Wide Web Conference (Toronto, Canada, 1999). experiment, in which 77% of subjects found the list display useful and 45% found either the Noun Phraser or the SOM [3]Chen, H., Schufels, C. and Orwig, R. Internet useful. Categorization and Search: A Self-Organizing Approach, Journal of Visual Communication and Image 5.3.3 Speed Representation, 7(1): 88-102 (1996). The t-test results demonstrated that the three search methods in [4]Chignell, M. H., Gwizdka, J. and Bodner, R. C. eachexperimentdidnotdiffersignificantlyintime Discriminating Meta-Search: A Framework for Evaluation. requirements. As discussed in the previous section, the time Information Processing and Management, 35 (1999). used for comparison is total searching time and browsing time. Real-time indexing and fetching time, which usually takes more [5] Cho, J., Garcia-Molina, H. and Page, L. Efficient Crawling than 3 minutes, also was included in the total time for CI Spider Through URL Ordering, in Proceedings of the 7th World and Meta Spider. Therefore, we anticipate that the two Spiders Wide Web Conference (Brisbane, Australia, Apr 1998). can let users spend less time and effort in the whole search [6]Fox, E., Hix, D., Nowell, L. T., Brueni, D. J., Wake, W. C., process, because the users only need to browse the verified Lenwood, S. H. and Rao, D. Users, User Interfaces, and results. Objects: Envision, A Digital Library. Journal of the American Society for Information Science, 44(8), 480-491 6. CONCLUSION (1993). The results of the two studies are encouraging. They indicate [7]Hearst, M. TileBars: Visualization of Term Distribution that the use of CI Spider and Meta Spider can potentially Information in Full Text Information Access, in facilitate the web searching process for Internet users with Proceedings of the ACM SIGCHI Conference on Human different needs by using a personalized approach. The results Factors in Computing Systems (CHI'95), 59-66 (1995). also demonstrated that powerful Al techniques such as noun phrasing and SOM can be processed on the user's personal [8]Hearst, M. and Pedersen, J. Reexamining the Cluster computer to perform further analysis on web search results, Hypothesis: Scatter/Gather on Retrieval Results, in which allows the user to understand the search topic more Proceedings of the 19th International ACM SIGIR correctly and more completely. We believe that many other Conference on Research and Development in Information powerful techniques can possibly be implemented on client-side Retrieval (SIGIR'96), 76-84 (1996). search tools to improve efficiency and effectiveness in web [9]Inktomi WebMap, http://www.inktomi.com/webmap/ search as well as other information retrieval applications. [10] Kohonen, T. Self-Organizing Maps. Springer-Verlag, Berlin, 1995. 7. ACKNOWLEDGMENTS [1 1] Kohonen, T. Exploration of Very Large Databases by Self- We would like to express our gratitude to the following agencies Organizing Maps, in Proceedings of the IEEE International for supporting this project: Conference on Neural Networks, 1:1-6 (1997). NSFDigitalLibraryInitiative-2,"High-performance [12] Lawrence, S. and Giles, C. L. Accessibility of Information Digital Library Systems: From Information Retrieval to on the Web, Nature, 400 (1999), 107-109. Knowledge Management", IIS-9817473, April 1999-March 2002. [13] Lin, C., Chen, H. and Nunamaker J. Verifying the Proximity and Size Hypothesis for Self-Organizing Maps. NSF/CISE/CSS, "AnIntelligent CSCW Workbench: Journal of Management Information Systems, 16(3) (1999- Personalized Analysis and Visualization", IIS-9800696, 2000), 61-73. June 1998-June 2001. [14] Lin, X., Soergel, D., and Marchionini, G. A Self-organizing We would also like to thank all the members of the Artificial Semantic Map for Information Retrieval, in Proceedings of Intelligence Lab atthe University of Arizona who have the 14th International ACM SIGIR Conference on contributed in developing the two search tools, in particular Research and Development in Information Retrieval Wojciech Wyzga, Harry Li, Andy Clements, Ye Fang, David (1991), 262-269. Hendriawan, Hadi Bunnalim, Ming Yin, Haiyan Fan, Bill Oliver [15] Maes, P. Agents that Reduce Work and Information and Esther Chou. Overload. Communications of the ACM, 37(7) (July 1994), 31-40.

86 124 [16] Mc Bryan, 0. GENVL and WWW: Tools for Taming the [21] Tolle, K. M. and Chen, H. Comparing Noun Phrasing Web, in Proceedings of the 1st International World Wide Techniques for Use with Medical Digital Library Tools. Web Conference (Geneva, Switzerland, Mar 1994). Journal of the American Society for Information Science, 51(4), 352-370 (2000). [17] Rasmussen, E. Clustering Algorithms. In W. B. Frakes and R. Baeza-Yates (eds.) Information Retrieval Data [22] Veerasamy, A. and Belkin, N. J., Evaluation of a Tool for Structures and Algorithms, Prentice Hall, N. J., 1992. Visualization of Information Retrieval Results, in Proceedings of the 19th International ACM SIGIR [18] Salton, G. Another look at automatic text-retrieval systems. Conference on Research and Development in Information Communications of the ACM, 29(7) (1986), 648-656. Retrieval (SIGIR'96), 85-92 (1996). [19] Schneiderman, B., Feldman, D., Rose, A. and Grau, X. F. [23] Zamir, 0. and Etzioni, 0. Grouper: A Dynamic Clustering Visualizing Digital Library Search Results with Categorical Interface to Web Search Results, in Proceedings of the 8th and Hierarchical Axes, in Proceedings of 5th ACM International World Wide Web Conference (Toronto, Conference on ACM 2000 Digital Libraries (San Antonio, Canada, May 1999). Texas USA, 2000). [20] Selberg, E. and Etzioni, 0. The MetaCrawler architecture for resource aggregation on the Web. IEEE Expert (1997).

87 125 Salticus: Guided Crawling for Personal Digital Libraries Robin Burke California State University, Fullerton Dept. of Information Systems and Decision Sciences Fullerton, CA 92834 [email protected]

ABSTRACT search through the graph, indexing or collecting all of the pages In this paper, we describe Salticus, a web crawler that learns it finds. from users' web browsing activity. Salticus enables users to This approach is suitable for building a comprehensive index, as build a personal digital library by collecting documents and found in search engines such as Google or Alta Vista. For a generalizing over the user's choices. personaldigitallibrary, we must be more selective. One approach is to focus the crawler on a particular site and mirror Keywords itscomplete contents.Several commercial tools have this personal digital library, business intelligence, web crawling, capability, for example, Web Reaper.' document acquisition This is also unsatisfactory for several reasons. First, many of the documents returned in a complete mirror are irrelevant to the analyst's work. Second, many documents available to a human 1. INTRODUCTION user are not accessible with link-based crawling at all, because One solution to the problem of information stability on the they are accessible only through login pages or other form-based World-Wide Web is the creation of a "personal digital library," access controls. It should be noted that such controls are often [1] a personal collection of documents from on-line sources, put in place precisely so that documents can be made available stored in a local cache and organized for easy access. to human users and not to web crawlers. The Document Organization aNd Navigation Agent (DONNA) is a project that seeks to build personal digital libraries to support competitive business intelligence. A personal library, by 3. SALTICUS its nature, does not have the problems of scale presented by 3.1 Document collection public libraries. On the other hand, all aspects of a document's Salticus2 is a part of the DONNA system that acts as a proxy life-cycle must be managed by a single individual. between the user and the web. It observes all of a user's This paper describes our work inthe area of document browsing behavior including form submission, authentication interactions and cookie creation. To record browsing behavior, acquisition. The business intelligence professionals with whom the user initiates an "excursion" onto the web with Salticus we work use the Web for a large proportion of their information observing. The system builds a list of the interactions that occur gathering activity, and have a set of strategies for seeking out between the user and the various web servers visited, and caches information of different types. Many of these strategies are not the documents that are accessed. When the user decides to accessible to standard link-based web crawlers. For example, the analysts we studied made heavy use of site-based search collect a document, it is transferred to the appropriate collection within DONNA. engines. With this capability, the system operates much like document Salticus is a document collection agent that observes a user's collecting systemslikeiHarvest3thatalso build personal browsinganddocumentcollectionbehaviorandmakes collections of web documents. Where Salticus differs is in its predictions about other usefiil documents. The user can use ability to generalize over document collection actions and to these predictions to broaden the collection process and automate predict other documents to gather. it for future visits to the same pages. 3.2 Structural generalization 2. WEB CRAWLING There are several ways that one might generalize about a user's The standard approach to document collection and indexing on document gathering activity. One possibility is to generalize the web is the use of a web crawler. [2] If the Web is viewed as over the content of links or documents downloaded, building a a graph with the nodes as documents and the edges as representation of the user's interests. [3, 4] For this approach to hyperlinks, a crawler typically performs some type of best-first work, the crawler must either rely entirely on the text of the hyperlink to the document as evidence of its content, or it must Permission to make digital or hard copies of all or part of this work download every document and analyze it to determine its value. for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute 2 Salticus = Search Augmented with Learning TechniquesIna to lists, requires prior specific permission and/or a fee. Customizable User Spider. It is also the name of a genus of spiders, JCDL'01, June 24-28, 2001, Roanoke, Virginia, USA. known for its jumping abilities and advanced eyes. Copyright 2001 ACM 1-58113-345-6/01/0006...$5.00. 3

88 Withthetextassociatedwithlinksnottypicallyvery Salticustocollect"workflow"relateddocumentsfrom informative, and the prospect of processing every document not KIvIWorld in the future, as long as the site is not redesigned. practical in an interactive system, we chose a third way, which is to focus on the structure of documents. Salticus has a "Collect" 4. FUTURE WORK mode, which enables the user to select items to collect on a page Salticus pattern-based prediction of useful documents has without navigating to them. As the user selects items to collect, worked well in our informal evaluation of the collecting patterns Salticus builds an XPath [5] representation of each selected link, of intelligenceanalysts. However,ithas some significant and it generalizes over these structural descriptions to predict limitations.Itcannotcapture more complex patternsof what else on the page might be of interest. collection, such as "every third link" or "all rows of only tables We have found that very simple pre-fix/post-fix generalization is 2 and 4". We are investigating where our current model fails and sufficient for the tasks we are supporting. For example, suppose how more complex predictions might be made. the user has selected three links with the following XPath The problem of automated revisitation brings to the fore the representations: problem of identity: what constitutes a new page? Or a new HTML: BODY:TABLE[1]:TR[1]:TD[1]:A[1] version of an already-collected page? In the world of session- HTML:BODY:TABLE[1]:TR[2]:TD[1]:A[1] based URLs, no page will have the same URL that it did when HTML:BODY:TABLE[1]:TR[3]:TD[1]:A[1] previously visited, even if its contents are the same. We believe that we will be able to use our path-based representation as Salticus predicts that the links the user is interested in will be additional evidence in determining the novelty of documents. those that are exhibit the same variation at the user's choices. It selects the largest common prefix from these paths and the We are also seeking to identify high-level search behaviors, such largest common postfix, and replaces the varying part with as the site search engine strategy in the workflow example or "don't care" pattern. In this case, the pattern becomes above, higher-level collection patterns for Salticus to recognize. This would enable the system to make predictions that go across HTML:BODY: TABLE[1]: TR[4] :TD[1]: A[1] sites and pages, and provide more possibilities for automation. which is the first link in the first cell of every row of table 1. If the user instead had selected links in table 1 and in table 2, then 5. CONCLUSION the generalization would include both the table and the row. Salticus is a document acquisition agent that assists a user building a personal digital library. Salticus tracks user browsing 3.3 Example and document collection and generalizes over the user's Suppose a user initiates a Salticus excursion and enters the collection actions. Salticus's path-based representation enables address of KMWorld, an on-line publication in the area of it to avoid the problems associated with URLs as identifiers. knowledge management. Once at the site, the user enters the term "workflow" in the site's search box. The next page shows a 6. ACKNOWLEDGMENTS list of 25 documents containing this term. The user selects The DONNA project is sponsored by the FileNet Corporation, a "Collect" mode in Salticus and clicks on the first several links. leading provider of document management software, and by the At each click, the associated document is retrieved and collected University of California's DiMI program. Christie Delzeit and by the proxy, but the original page is still displayed to the user. Melissa Hanson implemented Salticus's user interface. Then the user clicks on Salticus's "Predict" function, the system predicts that all of the documents should be collected, and the 7. REFERENCES user accepts the predictions, copying all of the files into the Rasmusson, A & Olsson, T & Hansen, P. 1998. A Virtual collection. [1] Community Library: SICS Digital Library Infrastructure Project. Research and Advanced Technology for Digital 3.4 Automated operation Libraries, CDL'98. Lecture Notes in Computer Science, Once the user has performed this excursion, Salticus has a Vol. 1513. pp 677-678. Springer Verlag. record of the steps required to return to the same "place" in the future. Note that with the widespread adoption of database- [2] Heydon, A, and Najork, M. A. 1999. A scalable, extensible backed web sites and session-based technologies such as ASP web crawler. World Wide Web, 2(4):219-229, December and ColdFusion, the URL has ceased to become a useful 1999. identifier for internal pages within a site. A user trying to reissue [3]Chakrabarti, S., van der Berg, M., & Dom, B. 1999. a URL from a previous session will typically get redirected to Focused crawling: a new approach to topic-specific Web the outermost part of the site to login and walk through the resource discovery. In Proceedings of WWW8. application once more. [4]Miller,R.C.andBharat,K. 1998."SPHINX: A It is therefore not sufficient to record and replay the URLs that a Framework forCreatingPersonal,Site-Specific Web user has visited. Instead, Salticus makes use of the XPaths Crawlers." Proceedings of WWW7, pp. 119-130, Brisbane, recorded for each interaction. When replaying an excursion Australia, April 1998. autonomously, Salticus starts by issuing a URL for the first interaction, but for every subsequent interaction, it follows the [5]World-Wide Web Consortium, 1999. XML Path Language XPath to the same location on the page where the user clicked in (XPath) Version 1.0. session-based URLs. It is therefore possible for the user to send

89 127 Power to the People: End-user Building of Digital Library Collections

Ian H. Witten David Bainbridge Stefan J. Boddie Dept of Computer Science Dept of Computer Science Dept of Computer Science The University of Waikato The University of Waikato The University of Waikato Hamilton, New Zealand Hamilton, New Zealand Hamilton, New Zealand +64 7 838 4246 +64 7 838 4407 +64 7 838 6038 [email protected] [email protected] [email protected]

ABSTRACT on Critical Global Issues. Further details can be obtained from Naturally, digital library systems focus principally on the reader: www.nzdIorg. the consumer of the material that constitutes the library.In A recent enhancement to Greenstone is a facility for what we call contrast, this paper describes an interface that makes it easy for "end-user collection building." It was provoked by our work on people to build their own library collections. Collections may be digital libraries in developing countries, and in particular by the built and served locally from the user's own web server, or (given observation that effective human development blossoms from appropriate permissions) remotely on a shared digital library host. empowerment rather thangifting.Disseminating information End users can easily build new collections styled after existing originating in the developed world, as the above-mentioned ones from material on the Web or from their local filesor both, collections do, is very useful for developing countries. But a more and collections can be updated and new ones brought on-line at effective strategy for sustained long-term development is to any time. The interface, which is intended for non-professional disseminate the capability to create information collections rather end users, is modeled after widely used commercial software than the collections themselves[7].This allows developing installation packages. Lest one quail at the prospect of end users countries to participate actively in our information society rather building their own collections on a shared system, we also than observing it from outside. It will stimulate the creation of describeaninterfacefortheadministrativeuser whois new industry. And it will help ensure that intellectual property responsible for maintaining a digital library installation. remains where it belongsin the hands of those who produce it. The end-user collection building facility, which we call the 1. INTRODUCTION "Collector,"ismodeledafterpopular end-userinstallation The Greenstone Digital Library Software from the New Zealand software (such as InstallShield5. Frequently called a software Digital Library (NZDL) project provides a new way of organizing "wizard"a term we deprecate because of its appeal to mysticism information and making it available over the Internet. A collection and connotations of utter inexplicabilitythis interaction style of information is typically comprised of several thousand or suits novice users because it simplifies the choices and presents several million documents, and a uniform interface is provided to them very clearly. all documents in a collection. A library may include many different collections, each organized differentlythough there is a The core Greenstone software addresses the needs of the reader: strong family resemblance in how they are presented. the Collector addresses the needs of people who want to build and distribute their own collections. A third class of user, vital in any Greenstone collections are widely used, with many of them multi-user Greenstone installation is the system administrator, publicly available on the Web. Some have also been written, in who is responsible for configuring the software to suit the local precisely the same form, to CD-ROMS that are widely distributed environment, enabling different classes of Greenstone user, setting within developing countries (50,000 copies/year). Created on appropriate file permissions, and so on. Greenstone includes an behalf of organizations such as UNESCO, the Pan-American interface, not described in previous papers, through which the Health Organization, the World Health Organization, and the administrative user can check the status of the system, and alter it, United Nations University, they cover topics ranging from basic interactively.Sensitiveandflexibleadministrativesupport humanitarian needs through environmental concerns to disaster becomes essential when many end users are building collections. relief.Titles include the Food and Nutrition Library, World Environmental Library, Humanity Development Library, Medical This paper begins with a brief synopsis of the features of and Health Library, Virtual Disaster Library, and the Collection Greenstone. To some extentthissection overlapsmaterial presented at DLOO [8], but many features are new and others extend what was previously reported. The remainder of the paper Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are is completely new. We examine the new interactive interface for not made or distributed for profit or commercial advantage and that copies collection building, which will extend Greenstone's domain of bear this notice and the full citation on the first page. To copy otherwise, application by encouraging end users to create their own digital or republish, to post on servers or to redistribute to lists, requires prior library collections. The structure of a collection is determined by specific permission and/or a fee. JCDL '01, June 24-28, 2001, Roanoke, Virginia, USA. Copyright 2001 ACM 1-58113-345-6/01/0006 ...$5.00. www.installshield.com

94 128 its "collection configuration file," and we briefly examine what analogous scheme of "classifiers" is used: classifiers (also written can be specified in this file. Next we turn to the administrator's in PERL) create browsing indexes of various kinds based on interface and describe the facilities it provides. Finally we discuss metadata. the design process and usability evaluation of the system. Multiple-language documents. Unicode is used throughout the software, allowing any language to be processed in a consistent 2. THE GREENSTONE SOFTWARE manner, and searched properly. To date, collections have been To convey the breadth of coverage provided by Greenstone, we built containing French, Spanish, Maori, Chinese, Arabic and start with a brief overview of its facilities. More detail appears in English. On-the-fly conversion is used to convert from Unicode to [8]. an alphabet supported by the user's Web browser. A "language identification" plugin allows automatic identification of languages Accessible via Web browsers. Collections are accessed through a in multilingual collections, so that separate indexes can be built. standard web browser, such as Netscape or Internet Explorer. The browser is used for both local and remote accesswhether Multiple-language user interface. The interface can be presented Greenstone is running on your own personal computer or on a in multiple languages. Currently it is available in French, German, remote central library server. Spanish, Portuguese, Maori, Chinese, Arabic and English. New languages can be added easily. Runs on Windows and Unix. Collections can be served on either Windows (3.1/3.11, 95/98, NT) or Unix (Linux and SunOS). Any Multimedia collections. Greenstone collections can contain text, of these systems serve Greenstone collections over the Internet pictures, audio and even video clips. Most non-textual material is using either an integrated built-in Web server (the "local library" either linked in to textual documents or accompanied by textual version of Greenstone) or an external servertypically Apache descriptions(rangingfromfigurecaptionstodescriptive (the "web library" version). paragraphs) to allow full-text searching and browsing. However, the architecture is general enough to permit implementation of Full-text and fielded search. Users can search the full text of the plugins and classifiers for non-textual data. documents, or choose between indexes built from different parts of the documents. Some collections have an index of full Classifiers allow hierarchical browsing. Hierarchical phrase and documents, an index of sections, an index of paragraphs, an index keyphrase indexes of text, or indeed any metadata, can be created of titles, and an index of authors, each of which can be searched using standard classifiers.Such interfaces are described by for particular words or phrases. Queries can be ranked or Boolean; Gutwin et al. [3] and Paynter et al. [5]. terms can be stemmed or unstemmed, case-folded or not. Designed for multi-gigabyte collections. Collections can contain Flexible browsing facilities. The user can browse lists of authors, millions of documents, making the Greenstone system suitable for lists of titles, lists of dates, hierarchical classification structures, collections up to several gigabytes. Compression is used to reduce and so on. Different collections offer different browsing facilities, the size of the indexes and text [6]. Small indexes have the added and even within a collection a broad variety of browsing bonus of faster retrieval. interfaces are available. Browsing and searching interfaces are constructed during the building process according to collection New collections appear dynamically. Collections can be updated configuration information. and new ones brought on-line at any time, without bringing the system down; the process responsible for the user interface will Creates access structures automatically. All collections are easy notice (through periodic polling) when new collections appear and to maintain. Searching and browsing structures are built directly add them to the list presented to the user. from the documents themselves: no links are inserted by hand. This means that if new documents in the same format become Collections can be published on CD-ROM. Greenstone collections available, they can be merged into the collection automatically. can be published, in precisely the same form, on a self-installing However, existing hypertext links in the original documents, CD-ROM. The interaction is identical to accessing the collection on leading both within and outside the collection, are preserved. the Web (Netscapeis provided on each disk)except that response times are faster and more predictable. For collections Makes use of available metadata. Metadata forms the raw larger than one CD-ROM, a multi CD-ROM solution has been material for browsing indexes: it may be associated with each implemented. document or with individual sections within documents. Metadata Distributed collections are supported. A flexible process structure mustbeprovidedexplicitly(ofteninanaccompanying spreadsheet)orderivableautomaticallyfromthesource allows different collections to be served by different computers, documents. The Dublin Core scheme is used, however, provision yet be presented to the user in the same way, on the same Web is made for extensions, and other schemes. page, as part of the same digital library. The Z39.50 protocol is also supported, both for accessing external servers and (under Plugins and classifiers extend the system's capabilities. "Plugins" development) for presenting Greenstone collections to external (small modules of PERL code) can be written to accommodate clients. new documenttypes.Existingpluginsprocessplaintext documents, HTML documents, Microsoft WORD, PDF, PostScript, What you see is what you get. The Greenstone Digital Library is and some proprietary formats. So collections can include different open-source software, available from the New Zealand Digital source document types, a pipeline of plugins is formed and each Library (nzdl.org) under the terms of the GNU General Public document passed down it;the first plugin to recognize the License. The software includes everything described above: Web document format processes it. Plugins are also used for generic serving, CD-ROMcreation,collectionbuilding,multi-lingual tasks such as recursively traversing directory structures containing capability, plugins and classifiers for a variety of different source documents. In order to build browsing indexes from metadata, an

95 129 ,ja2JE 6110. 'E .. 56. 5. tobu11.1.

HOME HOME 1 lie' collector] collector'

Creating a new collection The Collector

The !emu eughtter tham the sword 1 Th.toqUann of stops needed to createrow tal libruy collect:oak: DantAng and dIsmbuhng oVormahon colkenons earrfds raiponsibill043 that you nule leered to reflect oat beore you &Ism Mere am !ego( cdeopiright: being able to *cow documerda ch:IfCtlan docod nrecovou con rtacessord, give Hum to other. Men to. xcial issue, collection, thoutd I. sp.oity a, name (thd associated infonbetion) retpeet the nato. of Ow eononurpity cad qf which the document, arisArk! them me .1111.1 koUrro Wu...WM thing] URA/ should not be mode avtaloble to others 2. ,, Specify Whew the eoliths date thaws from lelleath to the power .Ooo.a,dau. 11 wady ,E.::4,!:,7, Adjust the configuration options (advanced users only) build The Collector help, you to weds new collection,. modify ea add to eating ones. or delete 4.Lai421,9c.,,, 'Build. the collection (cm below) collections. To do this you will be guided through sequence of Web pages which request the infonumion the is needed. vies./ 3. Loitbe_ Proudly view you, handiwork. First, you must deeid. whether to The fourth step is where the computer does a11 the work. In the °building wee.. the computer t2 credo an enlirdy oula collection makes ell the indeses and gather. together ony other information that is required to make things r vc.* WW1 ellifling cm, adding date to it co deleting a. south But fast you have to specify the infoneeliore

In order to thold or modify &plait:brag collections you mart sign in This is to Feted you bow A diagram appears below*.0 will help you keep track of /HIM you me. The mon bunon is the one others lo gging in to your ethoputer and altering the information on it. P00: For Iltellnly lob000 you that you click to cony on in the se went, A s you go through the sequence, the buttons will change will be automatically logged out once 430 dLiaute wind Teo dep. ed Mace you lo gge d in. If this to yellow. You can return to a previous page by clicking on the cone sponding yellow button in the happens, dorft worry!you will be invited,' log in again and can continua flow what you left off. &twain.

PHI. enter your us emus. and password, end click Has button to Woo in. When you we ready/ dick the green collection information. button to begin on ding your new digital library collection/ Kurth nolloullon Password: 177.7-71 wan in I 13. oudui7Ae_. ter, o-M01;;ItiU

(b) 09an,C. 00na

Figure 1 Using the Collector to build a new collection (continued on next pages) document types. It includes an autoinstall feature to allow easy The scheme for collection building has thefollowing basic installation on both Windows and Unix. functions: create a new collection with the same structure as an existing 3. THE COLLECTOR one; Our conception of digital libraries is captured by the following brief characterization [I]: create a new collection with a different structure from existing ones; A collection of digital objects, including te.xt, video, and audio, along with methods for access and retrieval, and add new material to an existing collection; for selection, organization and maintenance of the modify the structure of an existing collection; collection. delete a collection; and It is the last pointselection, organization and maintenance of the collectionthat we address in this paper. Our view is that just as write an existing collection to a self-contained, self-installing new books acquired by physical libraries are integrated with the CD-ROM. existing catalog on the basis of their metadata, so one should easily be able to add to a digital library without having to edit its content in Figure 1 shows the Greenstone Collector being used to create a new any way. Furthermore we striveto do this without manual collection, in this case from a set of HTML files stored locally. In intervention.Once added,suchmaterialshould immediately Figure la, the user must first decide whether to work with an become a first-class component of the library. We accomplish this existing collection or build a new one. The former case covers the with a build/rebuild process that imports new documents into a first two options above; the latter covers the remainder. While the library collection using XML to standardize representation, and use example shows a collection being built from existing files, we explicitlystated metadata to update searching and browsing emphasize that the Collector supports the creation of completely structures. new collections formed from completely new information. In Greenstone, the structure of a particular collection is determined when the collection is set up. This includes such things as the 3.1 Logging in format, or formats, of the source documents, how they should be Either way it is necessary to log in before proceeding. Note that in displayed on the screen, the source of metadata, what browsing general, users access the collection-building facility remotely, and facilities should be provided, what full-text search indexes should build the collection on the Greenstone server. Of course, we cannot be provided, and how the search results should be displayed. Once allow arbitrary people to build collections (for reasons of propriety the collection is in place, it is easy to add new documents to itso if nothing else), so Greenstone contains a security system which long as they have the same format as the existing documents, and forces people who want to build collections to log in first. This the same metadata is provided, in exactly the same way. allows a central system to offer a service to those wishing to build

96 BEST COPY AVAILABLE TiO /1.raiLle EistdaYalm help fle Ed*late,,/go Veda+ Hole OO: tee HOMET HOME The collector',Triiiiii collector]

Collection information Source data

rvhon,meting a new collection you need to enter some pelioinevy information about the source ! You cm eitio. come t completely new collection, or . clone. on misting onetha is, Mg. leo data This process is elms. ed as a seri. of Web pages, overseen by The Cathedra. The bra at the, structure etymon...collodion on ono that mint already. bottom of dm pep shores you the sequence of pages to ba completed. I aCrests mem collodion Tilla for rellectim: ,:leat6 1 4 01,4141.4iy atit :till ,t4er 01, :elk ellen ao.y ,011.4411h..4.1docure...nos illgs.111,00,. pleet ler docenunto (.tu.. Amp. oo de.: elcouner.00 Lioael). Pollen 11.1.tory EXCerpt . . , rid oolladinn rig., it 4 aatt ;duo, .J0 o A thometiolii Ibe tt g. ii rClone 001edeg collection cdretIon lizneule ego 0 butude '..700 put er Edens* thelth:LC el r.rpcdo. n r4 .1.f' ;IV ''' of l I OCIOde MD Doveloprotot 1.1b1.7.` LI This iaitl..1,4an tam, opeceil3nt.t. [011,AI,, you war,t to dens lhe Mos bi.,nunr.4.. Centset maul adirms: rmUrrlint) rm. Im WI dly llig 34.nte lypt 43 those used to belhl tig 4,4rtn-as; .1r. Inp. =twice: jormet tehle ha. vaikato. sec . no Ti.c"-emme acnIzt 3 i epee-dies the fersi.pe.;;.;:e.....ts'earth7collIChen. lithe Onc.;;;JO.ibiNge I fll0n72: litioae/gedl/collectiothist/dnunload/onee.hpl . govt. cm I 001:d0 0 ppologr. 0 daptostec sepal is sent ts geo .03dre". Entre di mail odds: is inns hell fours e noeseEd=oin. rht4-737".T.FP!'l Alm Ws cellecdon:

This collection is an mtherpt ros daaOnSttaOn purposes , based on the Those olo eogy wt., re tho :oleo< dole is lac:cid ndie aro loot kill: orto,6e. Ponds, thistoxy P[Inarp Seethe*, collection.It consists or primary e e setae,. and lia0Clated Info:.0100 On 00nen'a history gathered Prom detotog nes, ... yrcur c =putt: Oyoltineotpeuto4 with .Iiir.,.° o. Leh sites around the aorta.The collection contains _about:I/undoes_ " e 0 .in edds..do b Ozieniez with "leop.i.o. for Giro to tL ..legogeoled g.nethe Web. documents. 0 en radical tiginnieduilh "Apr,. foo EA s to bo 000,14.00d:A uoing I'Ll gib: Mean, 1 raotoogi. . . 1 'his a asteroid ch,thilempthe pins:pits govenemg whit is included :nth. Pelee' dime! Ippon,oril she ben pap exhm telt coilledion is proiontoit In i1 . ono, the collodion wel millet, .II Mid .d'ilie lwee. type Lu Ulu opereit a rtiorddy. e, uny Litodorito tl coigne:fin nny dUncintio a they coulee; net

, 1nni roe opooify up to foul digdoin input pow'''. gy n. up, ci4 r filnneee., Ind direrledit, Your position in the e e guano. is indicated by en meow undenuathin this cesa thecollection ueilieed. infonnatird. stage. To proceed. click the peen 'source date button.

ado one cellos mon buttons. If you aro an advent ed us. you mey wash to adjust tho collection I> i rfarl aro n 0. SOL"." 1> ..:`,1,,n1.':::!._,,.%1 a..0.. Ir'7,'C4:.3111, tkaarL. cenEparation. Ahotnedively. go straight to th. buildeng stage. Remembet you ran Marlys revisit an A ocher stop by slicking its yellow button.

I> I ctr.170 rtP''. soduartc: o' ccooLngor: cobLtudionL).,','"',.,,, zo (c) .,-,,,

Dorm (6) . . . . Figure 1 (continued)

information collections and use that server to make them available required to make the collection operate. The fifth step is to check to others. Alternatively, a user who is running Greenstone on his or out the collection that has been created. her own computer may build collections locally, but itisstill necessary to log in because other people who view these Web pages These five steps are displayed as a linear sequence of gray buttons should not be allowed to build collections. at the bottom of the screen in Figure lb, and at the bottom of all other pages generated by the Collector. This display helps users keep track of where they are in the process. The button that should 3.2 Dialog structure be clicked to continue the sequence is shown in green (collection Upon completion of login, the page in Figure lb appears. This information in Figure lb). The gray buttons (all the others, in Figure shows the sequence of steps that are involved in collection building. lb) are inactive. The buttons change to yellow as you proceed They are: through the sequence, and the user can return to an earlier step by clicking the corresponding yellow button in the diagram. This 1.Collection information display is modeled after the "wizards" that are widely used in 2.Source data commercial software to guide users through the steps involved in installing new software. 3.Configuring the collection 4.Building the collection 3.3 Collection information 5.Viewing the collection. The next step in the sequence, collection information, is shown in Figure lc. When creating a new collection, it is necessary to enter The first step is to specify the collection's name and associated some information about it: information. The second is to say where the source data is to come from. The third is to adjust the configuration options, which requires title, considerable understanding of what is going onit is really for contact email address, and advanced users only. The fourth step is where all the (computer's) work is done. During the "building" process the system makes all brief description. the indexes and gathers together any other information that is

97 BEST COPY AVAILABLE 131 Ncl so doc le X

fie Ed, YiomStoStistdow bet EthEckY..wo tht bnO,n Holy ve.ifri/

HOME T. Build collectori,

TM collection is now being built: this night take some time. The building status line below Configure collection gives feedback on how the operation is progressing

To stop the building process al any time. click here. The building end pieseneation &you: collection me controlled by specification, in a special The collection you are working on wig remain intact. c onfiguration file. Advanced u m may want lo alter the configuration setting,. Ify are mot. altanad werjwstgethebotimettheme. If you leave this page (and have not cancelled the building process with the 'stop buildthe burlern) the collection null continue to buld and will be installed upon successful To atm the configutation settings, dd ths dm. that appear below. If you make a Aim eke, click t, completion. .12eset to renattato the originel configuration mninge,

cretor ennetteng co . wettatO .m. Importing collection .. ainteerter anne tub co .ikato. sofa. extracted Tide metadata 'Lucretia Mott to Josephine S. Giffin& 1870' ublic true ete true

inemees documents text atonic...index document:Chet lupin fleeing luQtn CrILItiug Hmt .42mi lugin TrfrPluo lugin ATEPlug -file_fs_utl login ElIATIPlug lupin Seeping lupin Recelud classify EZLIat metridettneritLerr emu

mOW

icngrttZnon bduartr . 1)- .tgon

(e)1,i01- Figure 1 (continued)

The collection title is a short phrase used through the digital designed for. The most effective way to create a new collection is library to identify the content of the collection: we have already to base its structure on an existing one, that is, to clone it. mentioned such titles as Food and Nutrition Library, World When cloning, the choice of current collections is displayed on a Environmental Library, and so on. The email address specifies the thereareusually many different first point of contact for any problems encountered with the pull-downmenu.Since collections,2 there is a good chance that a suitable structure exists. collection.If the Greenstone software detects a problem, a It is preferable that the document file types in the new collection is Finally, the brief diagnostic report sent to this address. are amongst those catered for by the old one, the same metadata is description is a statement describing the principles that govern available, and the metadata is specified in the same way; however, what is included in the collection. It appears under the heading Greenstone is equipped with sensible defaults. For instance, if About this collection on the first page when the collection is document files with an unexpected format are encountered, they presented. will simply be omitted from the collection (with a warning Lesk [4] recommends that digital libraries articulate both the message for each one). If the metadata needed for a particular principles governing what is included and how the collection is browsing structure is unavailable for a particular document, that organized. About this collection is designed to address the first document will simply be omitted from the structure. point. The second is taken care of by the help text, which includes The alternative to cloning an existing collection is to create a a list of access mechanisms that is automatically generated by the completely new one. A bland collection configuration fileis system based on what searching and browsing facilities are provided that accepts a wide range of different document types included in the collection. and generates a searchable index of the full text and an alphabetic The user's current position in the collection-building sequence is title browser. Title metadata is available for many document indicated by an arrow that appears in the display at the bottom of types, such as HTML, email, and Microsoft WORDnote, however, each screenin this case, as Figure lc shows, the collection that in the latter case it emanates from the system's "Summary" information stage. The user proceeds to Figure ld by clicking the information for the file, which is frequently incorrect because green source data button. many users ignore this Microsoft feature. Boxes are provided to indicate where the source documents are 3.4 Source data located: up to four separate input sources can be specified. There Figure ld is the point where the user specifies the source text that are three kinds of specification: comprises the collection. Either a new collection is created, or an existing one is "cloned." Creating a totally novel collection with a completely different structure from existing ones is a major undertaking, and is not what the interactive Collector interface is 2 Collections can be downloaded from nzdl.org.

98 BEST COPY AVAILABLE 132 creator [email protected] maintainer [email protected] public true beta true

indexes document:text defaultindex document:text

plugin ZIPPlug plugin GMLPlug plugin TEXTPlug plugin HTMLPlug file_is_url plugin EMAILPlug plugin Arc Plug plugin RecPlug

classify AZList metadata=Title

collectionmeta collectionname "Women's History Excerpt" collectionmeta collectionextra "This collection is an excerpt for demonstration purposes, based on the\ Women's History Primary Sources collection. It consists of primary\ sources and associated information on women's history gathered from\ Web sites around the world. The collection contains about.numdocs. \ documents" collectionmeta .document:text "documents"

Figure 2. Configuration file for collection generated in Figure 1

a directory name on the Greenstone server system (beginning 3.6 Building the collection with "file://") Figure If shows the "building" stage. Up until now, the responses to the dialog have merely been recorded in a temporary file. The an address beginning with "http://" for files to be building stage is where the action takes place. downloaded from the Web First, an internal name is chosen for the new collection, based on an address beginning with "ftp://" for files to be downloaded the title that has been supplied (and avoiding name clashes with using FTP. existing collections). Then a directory structure is created for it that includes the necessary files to retrieve, index and present the In each case of "file://" or "ftp://" the collection will include all source documents. To retrieve source documents already on the files in the specified directory, any directories it contains, any files file system, a recursive file system copy command is issued; to and directoriestheycontain, and so on. If instead of a directory a retrieve offsite files a web mirroring package (we usewget3)is filename is specified, that file alone will be included. For "http://" used to recursively copy the specified site along with any related the collection will mirror the specified Web site. image files. In the given example (Figure id) the new collection will contain Next, the documents are converted into XML. Appropriate documents taken from a local file system as well as a remote Web plugins to perform this operation must be specified in the site, which will be mirrored during the building process, thus collection configuration file. This done, the copied files are forming a new resource that is the composite of the two. deleted: the collection can always be rebuilt, or augmented and rebuilt, from the information stored in the XML files. 3.5 Configuring the collection Then the full-text searching indexes, and the browsing structures, Figure I e shows the next stage. The construction and presentation specified in the collection configuration file are created. Finally, of all collections is controlled by specifications in a special assuming that the operation has been successful, the contents of collection configuration file (see below). Advanced users may use the building process is moved to the area for active collections. this page to alter the configuration settings. Most, however, will This precaution ensures that if a version of this collection already proceed directly to the final stage. exists, it continues to be served right up until the new one is ready. In the given example the user has made a small modification to Use of global,persistent documentidentifiersensuresthe the default configuration file by including thefile_is_udflag with changeover is almost always invisible to users. the HTML plugin. This flag causes URL metadata to be inserted in The building stage is potentially very time-consuming. Small each document, based on the filename convention that is adopted collections take a minute or so but large ones can take a day or by the mirroring package. This metadata is used in the collection more. The Web is not a supportive environment for this lengthy to allow readers to refer to the original source material, rather than kind of activity. While the user can stop the building process to a local copy.

3 See www.gnu.org

99 133 EloEt lt snot so Van,* 114 El.bOO 5rneSo N:tni.v Sob

.:- CED Collection info C3CD Maintenance and Adininistration !

Maintenance and administration services awdable include Collection info

1 atrdifigt-C.O.DE euron* horn, view on-Lita logs User management collection name 'wohien' creme, maintain and =dace nollevions User management host " &mess teelatical information such as CCil erguments Ict men ednew cm port '0' thAnke PIde,dri adl aeiriE-NYX These service. are accessed using the side navigation bra on the lefthand side of the is public? tme eliesue cat.word Y.Se. Information is beta? nue Infos...eon p.eral build dam '97893724r re si_e,=_rts general interface ummealt Collection Status Impedes bri,.,..aama ' amps ti:illevionir will only oppror is *pmninct LE tins hold it; tilm tient um rendabk, corasin l' prOl*C9/1 .41..6" .documenttest dominate brin_._cm La o *Ad louldJak tit: tie t 0), and me . dy eobuyeoris iivbs ditcmry ki.*. tiCiT tte ' raetadato Preteeolt buiblion dm...1) This collection is an excerpt for dernonscanon Coallgeration flies purposes, based on theloWorneres Enemy Primary Configuration files erlbit, if. Sources colecton. Ir consists of priranytnsources dux aid,er. for kaorniatwu i,s, a c^lIta., nuimeftt coUecbooeeca edidlln ocil tdok ..i41.1u, to niew * t*Ilerti, and associmed information on women s history oOyOO.trrl gathered frorenWeb sites around the world. The Loge abbrs.. collection public? nimble? . collection contains _abmitinumdoeslodoeunentsVo int}osi Lege YE,....Y.9. all" ...AL-. Y.° yes 0 ollectionnaree Women's Eatery Excerpt icattog error log bib.terrio grcenstone demo yes no iconeollection ....t-O028 no bytes yes format fide erach numb no Ms buolding info gYriact-11.1:cieritie.:14.h.trLati05E yes Y.. dab number of ,. ath.112.P.15.1 Coml.. Science Bibliographies Ye, no documents iidLE yes Ye, '1073' o''''''sections of joAn f0t0040E04.±E700 yes yes cdrnei tined phirm demo yes Y., number of words '1038463' II,;:lti,n Tb.....41YVA..4..c.illetlitin no yes member of bytes '6294492' ,.. EY,..,.2tit Y,.-E.Y..1.4..'Ani Y.. 1,.s preferred receptionist 51ya FAO un the Ineinn(1990) It. r, LVILee fulliouit,s InntrAtte nn ntirn :Irmo yes yes t.,retlrY qu7pt5 Y., Y., Filter options for "Browseriltern Et4fIECA cily,C1115pnea*.1.1.1.10.REEUD.Mtilitrt yes yes option name typerepeatabledefault vane valid values !mE.Y127 Ef....14.12. " Y.. latEch4 Jia 13blownsly 1 Y., Ye, 'E.ndResulte Isere one per only '4' tavA4 )i.k14sn no yes 'ParentNode stringone per query " tertItesens' integer one per query '1' IlEttiik.A.E .111teca no Yes iMEta isn's ten yes no kn,whnte 7..n,ehllor Y., yet Meer options for "Nulfflit& . locsJoveb Y., nc at! PALlis.alpkilinsiThlata:ArY yes yes option name type repeatable default value valid values 411570 ...I...95PIEMMEi1..YiE.PAY1.. Y., yes ...most Plates flimern:NInnl.inent.tnett y.. Ye, Illter options for "Chterfilltee nhin gerthilitl.tA0±.1,411001570I.91977001111 C*11.-dori option num type repeatabledesalt vitae valid velum V2,1,1 una 5,.. 1.. 'Casefold' boolem one per term'Cue" trt.sg 212.11LT.Z.t.11...0..0.19s1.1±1spAl no Yes 'CombineQuery" emmerated one per query 'and' stuns Semeh C.r.nimitintr. 24 4, werr e kid Statisticsyes yes 'EnclItesults. integer one per query '10' Einrog ine no no 'Index" eaurnerated one per terra',ix" targ Y.. et, 7.angtioge' esornereted one per term " trruul test collection yes no lelatchMode enumerated one per query 'some' iyi2m....1 Wen' elm= demo Yes Ms 'Mandoes. integer one per query 'NO' tidbits 'NM'S Yet Y., lel?. Vrtitetklaarifil_Slettrfint ye, Y., 'Pleaseldatch.nueurrated one per query .some_p3rases.--P''some_phrasce. 'all_does. wlitt Women's History Primary Source DOM.00u yes no 'QueryType' numerated one per query 'ranked' 'book.% 'ranked' wEistbuddmanmet totlistbtdrininnages no no 'StartRendie integer one per query 'I' tv0ittt:ot WorrontEtsluriatAlrl Yet n. 'Steno' bmlesn one per term'false 'false. "true' Word Mu, dem* wmtb.., Ye, r. 'Oubrollecrion'enumerated one pumas " AynAts.ilta:, .o.ke...ez .. 'Term' string one per turn - I.- toeounot Ono (a)Esfl (b) Figure 3 The Greenstone Administration facility immediately using the button in Figure lf, there is no reliable way keeping this window open in their browser. If any errors cause the to prevent users from leaving the building page, and no way to process to terminate, they are recorded in this status area. detect if they do. In this case the Collector continues building the collection regardless and installs it when building terminates. 3.7 Viewing the collection Progress is displayed in the status area at the bottom part of Figure When the collection is built and installed, the sequence of buttons 1 f, updated every five seconds. The message visible in Figure 1 f visible at the bottom of Figures 1 ae appears at the bottom of indicates that when the snapshot was taken, Title metadata was Figure 1 f, with the View collection button active. This takes the being extracted from an input file. Warnings are written if input user directly to the newly built collection. files or URLs are requested that do not exist, or exist but there is Finally, email is sent to the collection's contact email address, and no plugin that can process them, or the plugin cannot find an to the system's administrator, whenever a collection is created (or associatedfile, such as an image file embedded in aHTML modified.) This allows those responsible to check when changes document. The intention is that the user will monitor progress by occur, and monitor what is happening on the system.

100 134 3.8 Working with existing collections pours their contents into the pipeline, ensuring that arbitrary Four further facilities are provided when working with an existing directory hierarchies are traversed. collection: add new material, modify its structure, delete it, and More indicative of Greenstone's power than the generic structure write it to a self-contained, self-installingCD-ROM. in Figure 2 is the ease with which other facilities can be added. To To add new material to an existing collection, the dialog structure choose just ten examples: illustrated in Figure 1is used: entry is at the "source data" stage, A full-text-searchable index of titles could be added with one Figure Id. The new data that is specified is copied as before and addition to the indexes line. converted to GML, joining any existing imported material. If authors' names were encoded in the Web pages using the Revisions of old documents should perhaps replace them rather HTMLmetaname construct, a corresponding index of authors than being treated as entirely new. However, this is so difficult to could also be added by augmenting the indexes line. determine that all new documents are added to the collection unless they are textually identical to existing ones. While an With author metadata, an alphabetic author browser would imperfectprocess,inpracticethe browsing structuresare require one additional elassifi, line. sufficiently clear to make itstraightforward to ignore near- duplicates. Recall, the aim of the Collector is to support the most WORD and/or PDF documents could beincluded by common tasksinastraightforward mannermore careful specifying the appropriate plugins updating is possible through the command line. Language metadata could be inferred by specifying an To modify the structure of an existing collection essentially means "extract-language" option to each plugin to edit its configuration file. If this option is chosen, the dialog is With language metadata present, a separate index could be entered at the "configuring the collection" stage in Figure I e. built for document text in each language Deleting a collection simply requires a collection to be selected Acronyms could be extracted from the text automatically [9] from a list, and its deletion confirmed. This is not as foolhardy as and a list of acronyms added it might seem, for only collections that were built by the Collector can actually be removedother collections (typically built by Keyphrases could be extracted from each document [2] and a advanced users working from the command line) are not included keyphrase browser added in the selection list.It would be nice to be able to selectively delete material from a collection through the Collector, but this A phrase hierarchy could be extracted from the full text of functionality does not yet exist. At present this must be done from the documents and made available for browsing [5] the command line by inspecting the file system. The format of any of these browsers, or of the documents Finally, in order to write an existing collection to a self-contained, themselves when they were displayed, or of the search self-installingCD-ROM,the collection's name is specified and the resultslist, could all be altered by appropriate "format" necessary files are automatically massaged into a disk image in a statements. standard directory. Skilled users could add any of these features to the collection by making a small change to the panel in Figure 1 e. However, we do 4. THE COLLECTION CONFIGURATION not anticipate that casual users will operate at this level, and FILE provision is made in the Collector to by-pass this editing step. More likely, someone who wants to build new collections of a Part of the collection configuration file for the collection built in certain type will arrange for an expert to construct a prototype Figure 1 is shown in Figure le; it appears in full in Figure 2. Since collection with the desired structure, and proceed to clone that into we are in the process of updating the collection configuration file further collections with the same structure but different material. format to support a wider variety of services, we will not embark on a detailed explanation of what each line means. 5. SUPPORT FOR THE SYSTEM Some of the information in the file (e.g. the email address at the top, the collection name and description near the bottom) was ADMINISTRAT OR gathered from the user during the Collector dialog. In essence this An "administrative" facility is included with every Greenstone is"collection-level metadata" and we are studying existing installation.Theentrypage,showninFigure3a,gives standards for expressing such information. The indexes line builds information about each of the collections offered by the system. a single index comprising the text of all the documents. The Note that all collections are includedfor there may be "private" classifi, line builds an alphabetic classifier of the title metadata. ones that do not appear on the Greenstone home page. With each is given its short name, full name, whether it is publicly displayed, The list of plugins is designed to be reasonably permissive. For and whether or not it is running. Clicking a particular collection's example, ZIPPlug will uncompress any Zipped files; because abbreviation (the first column of links in Figure 3a) brings up plugins operate in a pipeline the output of this decompression will information about that collection, gathered from its collection be available to the other plugins. GMLPlug ensures that any configuration file and from other internal structures created for documents previously imported into the collectionstored in an that collection.If the collectionis both public and running, XML formatwill be processed properly when the collection is clicking the collection's full name (the second link) takes you to rebuilt.TEXTPlug,HTMLPlugandEMAILP lugprocess the collection itself. documents of the appropriate types,identified by theirfile extension. RecPlug (for "recursive") expands subdirectories and

101 135 The collection we have just built has been named wohiex, for followed by the timestamp when they first accessed the digital Women's History Excerpt, and is visible near the bottom of Figure library. 3a. Figure 3b shows the information that is displayed when this link is clicked. The first section gives some information from the 5.3 User authentication configuration file, and the size of the collection (1000 documents, Greenstone incorporates an authentication scheme that can be a million words, over 6 Mb). The next sections contain internal employed to control access to certain facilities. It is used, for information related to the communication protocol through which example, to restrict the people who are allowed to enter the collections are accessed. For example, the filter options for Collector and certain administration functions.It also allows "QueryFilter" show the options and possible values that can be documents to be protected on an individual basis so that they can used when querying the collection. only be accessed by registered users on presentation of a The administrative facility also presents configuration information password; however this is currently cumbersome to use and needs about the installation and allows it to be modified. It facilitates to be developed further. Authentication is done by requesting a examination of the error logs that record internal errors, and the user name and password as illustrated in Figure I a. user logs that record usage. It enables a specified user (or users) to From the administration page users can be listed, new ones added, authorize others to build collections and add new material to and old ones deleted. The ability to do this is of course also existing ones. All these facilities are accessed interactively from protected: only users who have administrative privileges can add the menu items at the left-hand side of Figure 3a. new users. It is also possible for each user to belong to different "groups". At present, the only extant groups are "administrator" 5.1 Configuration files and "colbuilder". Members of the first group can add and remove There are two configuration files that control Greenstone's overall users, and change their groups. Members of the second can access operation: the site configuration file gsdIsite.cfg, and the main the facilities described above to build new collections and alter configuration file main.cfg. The former is used to configure the (and delete) existing ones. Greenstone software for the site where it is installed. It is designed for keeping configuration options that are particular to a given When Greenstone is installed, there is one user called admin who site. Examples include the name of the directory where the belongs to both groups. The password for this user is set during the installation process. This user can create new names and Greenstone software is kept, theHTTPaddress of the Greenstone system, and whether the fastcgi facility is being used. The latter passwords for users who belong just to the colbuilder group, contains information thatis common to the interface of all which is the recommended way of giving other users the ability to collections served by a Greenstone site.It includes the email build collections. address of the system maintainer, whether the status and collector User information is recorded in two databases that are placed in pages are enabled, and whether cookies are used to identify users. the Greenstone file structure. One contains all information relating to users. The other contains temporary keys that are created for 5.2 Logs each page access, which expire after half an hour. Thus inactive Three kinds of logs can be examined: usage logs, error logs and users must reauthenticate themselves. initialization logs. The last two are only really of interest to people maintaining the software. All user activityevery page that each 5.4 Technical information user visitscan be recorded by the Greenstone software, though The links under the Technical information heading gives access to no personal names are included in the logs. Logging, disabled by more technical information on the installation, including the default, is enabled by including an appropriate instruction in the directories where things are stored. main system configuration file. Each line in the user log records a page visitedeven the pages 6. User Evaluation generated to inspect the log files! It contains (a) the IP address of The Collector and administration pages have been produced and the user's computer, (b) a timestamp in square brackets, (c) the refined through a long period of iterative design and informal CGI arguments in parentheses, and (d) the name of the user's testing. The design underwent many revisions before reaching the browser (Netscape is called "Mozilla"). Here is a sample line, version presented in this paper. The details were thrashed out over split and annotated for ease of reading: several meetings of our digital library groupsome 20 or so individuals from a variety of disciplines including library science, /fast-cgi-bin/niupepalibrary the humanities, and notably within computer science, the field of (a) its-wwwl.massey.ac.nz human computer interaction. It was here, for example, that the (b) [950647983] idea for the progress bar at the bottom of the page was formulated, (c) (a=p, b=0, bcp=, beu=, c=niupepa, cc., ccp=0, ccs=0, and the very name of the tool, the Collector, was conceived. Once cl=, cm=, cq2=, d=, e=, er=, f=0, fc=1, gc=0, gg=text, gt=0, h=, h2=, h1=1, hp=, il=1, j=, j2=, satisfied with its development, the tool was added into the public k=1, ky=, 1=en, m=50, n=, n2=, 0=20, p=home, pw., release of the Greenstone software. q=, q2=, r=1, s=0, sp=frameset, t=1, ua=, uan=, ug=, uma=listusers, umc=, umnpw1=, umnpw2=, umpw=, Further feedback was obtained through the Greenstone mailing umug=, umun=, umus=, un=, us=invalid, v=0, w=w, x=0, z=130.123.128.4-950647871) list, a general purpose listserver for Greenstone. In this arena both current issues and future developments are discussed, and users (d) "Mozilla/4.08 [en](Win95; I;Nay)" can gain technical assistance for particular problems. Filtering this The last CGI argument, "z", is an identification code or "cookie" sourcespecificallyforremarksabouttheCollectorand generated by the user's browser: it comprises the user's IP number administration pages revealed only technical questionsnormally connected, despite extensive prior testing, with scripts performing

102

1'36 incorrectly on a given version of a particular operating system. None of the questions fell into the category "how do I do this?" 8. REFERENCES with the Collector. The technical questions indicate that the tools [1] Akscyn, R.M. and Witten, I.H. (1998) "Report on First are being used, and the dearth of "how to" questions suggests that Summit onInternationalCooperation onDigital it is performing adequately. Libraries." ks.com/idla-wp-oct98. Members of our group are presently conducting a usability study [2] Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C. and designed to establish more clearly how suitable the interface is for Nevill-Manning, C. (1999) "Domain-specific intended tasks. Volunteers from a final year performing its keyphrase extraction." Proc Int Joint Conference on computer science class are observed performing set tasks: using both the tools described here and the original set of command line ArtificialIntelligence, Stockholm,Sweden.San instructions. On completion of the tasks, which take about an Francisco, CA: Morgan Kaufmann Publishers, pp. hour, the users then fill out a questionnaire. The results from this 668-673. work are not yet available. [3]Gutwin,C.,Paynter, G.W., Witten,I.H.,Nevill- The main difficulty of formal usability testingisthatitis Manning,C.andFrank,E.(1999)"Improving inevitably artificial. The tasks are artificial, the work environment browsing in digital libraries with keyphrase indexes." isartificial,even the motivation of the usersisartificial. Decision Support Systems 27(1/2): 81-104; November. Furthermore, the parameters of operation are far more tightly prescribed than in the real world. The question posed is how well [4] Lesk, M. (1997) Practical digitallibraries.San does the tool let users perform the tasks it was designed for. But Francisco, CA: Morgan Kaufmann. what if the user wants to step outside of the design parameters? Paynter, G.W., Witten, I.H., Cunningham, S.J. and On the one hand, an unfair questionthe Collector is designed to [5] simplify commonly executed tasks. But as users become more Buchanan, G. (2000) "Scalable browsing for large familiar with a tool they naturally expect more from it. collections: a case study." Proc Fifth ACM Conference on Digital Libraries, San Antonio, TX, pp. 215-223; There questions are open ended, but vitally important. For their June. resolution we look to the open source nature of the Greenstone project. We will continue iteratively developing these tools based [6] Witten,I.H.,Moffat,A.andBell,T.C.(1999) on feedback from field trials and user comments. We will also Managing gigabytes:Compressing and indexing incorporate features added by others who actively develop the documents and images, second edition. San Francisco, Greenstone code. CA: Morgan Kaufmann. 7. CONCLUSIONS [7]Witten, I.H., Loots, M., Trujillo, M.F. and Bainbridge, This paper has described the Collector, a tool integrated into the D.(2000)"The promise of digitallibrariesin Greenstone environment that guides an end user through building developing countries." and maintaining a collection, step by step.It also details the [8]Witten,I.H., McNab,R.J.,Boddie, S.J.and support included for the overall administration and maintenance Bainbridge, D. (2000) "Greenstone: A comprehensive of a Greenstone site by the system administrator, which is open-source digitallibrary software system." Proc likewise integrated into the runtime system and gives the ability to view logs, create new users and control the access they have to, Digital Libraries 2000, San Antonio, Texas, pp. 113 for example, the collection building tool. 121. Different users of digital libraries, naturally, have different needs. [9]Yeates, S., Bainbridge, D. and Witten, I.H. (2000) While access and retrievalisan obvious requirement, and "Using compression to identify acronyms in text." dominates digitallibrary research, we believe that end-user Proc Data Compression Conference, edited by J.A. collection creation is another important element that deserves Storer and M. Cohn. IEEE Press Los Alamitos, CA, p. carefulattentionandfurtherdevelopment.Includingthis 582. (Full version available as Working Paper 00/1, capability in digital library systems will help them move away Department of ComputerScience,Universityof from the large and mostly static entities currently seen, and evolve Waikato.) into more dynamic and responsive environments.

103 137 Web-based Scholarship: Annotating the Digital Library

Bruce Rosenstock Michael Gertz Religious Studies Department of Computer Science University of California University of California Davis, CA 95616, U.S.A. Davis, CA 95616, U.S.A. [email protected] [email protected]

ABSTRACT eral public and for specialists (medievalists, Romance philol- The DL offers the possibility of collaborative scholarship, ogists, folklorists, ethnomusicologists, and Jewish studies but the appropriate tools must be integrated within the DL scholars). Users will be able to browse the archive for ballad to serve this purpose. We propose a Web-based tool to guide types, informants, provenance and/or date of recording, and controlled data annotations that link items in the DL to they will also be able to retrieve ballads through a search a domain-specific ontology and which provide an effective over the entire archive for specified words and phrases. Re- means to query a data collection in an abstract and uniform trieved XML files are presented to the Web browser in HTML fashion. through server-side on-the-fly translation. Although much of the material has been studied and cataloged according to ballad types and motifs, there is a great deal of work remain- Categories and Subject Descriptors ing to be done. In order to permit not just access to the data 11.5.3 [Group and Organization Interfaces]: Computer- set using standard keyword-based searches but the contin- supported cooperative work ued cataloging and enrichment of the data, we are designing an annotation tool to facilitate the creation of expert knowl- Keywords edge as metadata integrated within the logical structure of Data Annotations, Folk Literature DL the digital library.This tool will be Web-based, allowing multiple users the selection of an HTML page or any node 1. THE FOLK LITERATURE DL of its DOM structure as the target of an annotation. The "Folk Literature of the Sephardic Jews" [1] digital library (DL) created at the University of California at Davis 2.CONTROLLED DATA ANNOTATIONS consists at present of about 1,000 (half the final total) XML Web-based annotation systems have been discussed and files, conformant to the TEl (Text Encoding Initiative) DTD, some implementations have been tested over the past six and 500 audio files in MPEG3 format, averaging 20 MB years since the 1995 W3C Workshop on the World Wide Web each, representing 250 hours of various types of Judeo-Spanish and Collaboration [6, 5]. These systems have understood an oral literature-ballads, proverbs, wedding songs, memorates, annotation to consist of unstructured and uncontrolled text and other genres-collected from dozens of informants over which comments on the HTML page. We believe that un- the past forty years. The XML files contain transcriptions structured textual annotation does not represent the appro- of the audio files, with tagging elements dividing and struc- priate method for creating new knowledge as an integrated turing the spoken and sung portions of the material. The component of the digital library, but that such annotations informants were Sephardic Jews (descendants of the Jews only add another layer of text on top of the existing data. expelled from the Iberian Peninsula in 1492) who preserved Instead, we have adopted an approach to annotation that their oral folk traditions over hundreds of years. The trans- guides the user to employ an abstract conceptualization of mission of this traditional oral literature has, under the pres- the application domain (ontology, controlled vocabulary). sure of modern technological civilization, come to an end in In such a domain conceptualization, concepts specific to the this generation, and therefore this digital archive will pro- application domain (here folk literature of the Sephardic vide a unique repository of an invaluable cultural treasure Jews) are defined, including concept properties and rela- within the Jewish and the broader Hispanic world. tionships. Concepts can be understood as class definitions In the current project, access to the digital library will be and thus provide templates of semantically rich metadata Web-based and multi-tiered, with entry points for the gen- schemes that can be associated with the data in the library. For example, the concept "ballad" has certain properties (ballad name, metre, provenance) and certain well-defined Permission to make digital or hard copies of all or part of this work for relationships to other concepts, e.g.; "motif" , having motif personal or classroom use is granted without fee provided that copies are name, actor-role, etc.as properties. Data in Web docu- not made or distributed for profit or commercial advantage and that copies ments represent instances of concepts (as perceived by the bear this notice and the full citation on the first page. To copy otherwise, to user) and are assigned specific values for the properties. republish, to post on servers or to redistribute to lists, requires prior specific There has been some previous work on ontology-based permission and/or a fee. data annotation in the AI community (e.g., Ontobroker [3] JCDL'01, June 24-28, 2001, Roanoke, Virginia, USA. Copyright 2001 ACM 1-58113-345-6/01/0006 ...$5 .00. or WebKB [4]).In these works, the major focus is on us-

104 ing an ontology to reason about the data in order to im- sentence, or whole document) of the retrieved document and prove data query schemes, and not on how to efficiently choose a concept from the ontology, indicating that the por- populate the ontology with data. More importantly, data tion of the document is an instance of the selected concept. annotation occurs through embedding links to concepts in In the annotation tool, the user also specifies the properties Web documents. That is, document authors have to include of the selected concept instance. The properties that can annotations in their documents. Naturally, such an author- be specified depend on the respective concept template de- centric approach to annotation prevents the enrichment of scribed in the ontology. Internally, an annotation consists documents by multiple experts having different views and of a pointer (URI) to a concept in the ontology, a pointer interpretations of the data. (XPath expression) to the selected data in the (XML) docu- In contrast, our primary focus is on how an ontology can ment, and concept instances properties. Each annotation is be used to guide controlled data annotations that link por- recorded in the annotation database. Keeping annotations tions of a Web page to concepts in an ontology.A do- external to document allows multiple users to annotate same main specific ontology will be built in a collaborative fash- (portions of) documents using different concepts. ion, providing users with the means to embed their domain Given a sufficient amount of annotations as semantically knowledge in form of concepts and concept relationships. rich metadata associated with documents, the ontology to- Our approach is unique in that data annotations are ex- gether with the annotation database can be used to retrieve ternal to Web documents [2].Multiple users can annotate documents that have been annotated with user-specified con- the same data with different concepts, depending on their cepts. Respective documents are loaded into a user's Web specific view and interpretation of the data. Annotations browser through the Web proxy which embeds "anchors" to are stored separately from documents and can be queried the selected concept(s). Users can view annotated document in combination with the ontology to retrieve data that has and can traverse from annotations in documents to the on- been annotated using concepts from the ontology. In what tology. View mechanisms can be specified on annotations, follows, we will outline the architecture and functionality of providing users only with annotations that have been made the annotation tool in context of the folk literature digital by, e.g., specific users, or that contain certain user-specified library. features, e.g., creation date or concept instance properties. Users can also browse the ontology and request documents that have been annotated with the selected concept. In this 3.COMPONENTS OF THE DL way, annotations provide a semantically rich layer of meta- Figure 3 shows the basic components of our digital library. data on top of the original documents and thus facilitate better document retrieval schemes. Web browser, A prototypical data annotation tool utilizing a small (in data annotation, Ontology terms of concepts) ontology has already been implemented. and DL query tool We are currently working on extending the tools to allow users to better manage and utilize the ontology. Other extensions include a registry in which users can register their Web Proxy Annotation own Web sites and pages and thus will be able to spec- Database ify links among annotations between our digital library and their scholarly work. XML files 4.REFERENCES [1] S. G. Armistead, J. H. Silverman, and I. J. Katz. Folk Figure 1: Components of the DL Literature of the Sephardic Jews, 2: Judeo-Spanish Ballads from Oral Tradition, 1: Epic Ballads. At the backend, the XML files contain the encoded tran- University of California Press, 1986. scriptions of the informant's performance, as outlined in Sec- [2] J.M. Bremer and M. Gertz. Web data indexing through tion 1. Upon request from a client's Web browser, respec- external semantic-carrying annotations. In 11th IEEE tive files are translated into HTML files on the fly. A simple Mt. Workshop on Research Issues in Data Engineering, keyword-based query interface allows retrieving documents IEEE Computer Society, 69-76, 2001. that contain the specified keywords. In that respect, these [3]S. Decker, M. Erdmann, D. Fensel, and R. Studer. components resemble standard components of a DL. Ontobroker: Ontology based access to distributed and The ontology component contains the domain conceptu- semi-structured information. In Semantic Issues in alization. A Web-based interface allows registered users to Multimedia Systems (DS-8), 351-369, Kluwer, 1999. add and modify concepts and concepts relationships. In the [4]J. Heflin and J. Hendler. Semantic interoperability on current prototype, we use RDF schema as a means to rep- the Web. In Proc. of Extreme Markup Languages 2000, resent respective information about concepts, concept prop- 111-120. Graphic Communications Association, 2000. erties and concept relationships. A simple query interface [5]M. R8scheisen, C. Mogensen, and T. Winograd. Beyond allows a user to search for specific concepts (based on terms) browsing: Shared comments, soaps, trails, and on-line and to traverse through the ontology. communities. Technical Report STAN-CS-TR-97-1582, In order to allow users to annotate data, it is necessary Stanford, Computer Science Department, 1995. to use a Web proxy to load XML (HTML) documents into [6] R. Wilensky and T. A. Phelps. Multivalent documents: a browser. Upon activating a link that is added to a Web A new model for digital documents. Technical Report page by the proxy, a data annotation tool appears at the CSD-98-999, University of California, Computer client site. The user then can highlight a region (paragraph, Science Department, 1998.

105 139 A Multi-View Intelligent Editor for Digital Video Libraries Brad A. Myers, Juan P. Casares, Scott Stevens, Laura Dabbish, Dan Yocum, Albert Corbett Human Computer Interaction Institute Carnegie Mellon University Pittsburgh, PA 15213, USA [email protected] http://www.cs.cmu.edu/silver ABSTRACT lem by bringing to video editing many of the capabilities Silver is an authoring tool that aims to allow novice users to long available in textual editors such as Microsoft Word. edit digital video. The goal is to make editing of digital We are also trying to alleviate some of the special problems video as easy as text editing. Silver provides multiple of video editing.In particular, the Silver video editor coordinatedviews,includingproject,source,outline, provides multiple views, it uses the familiar interaction subject, storyboard, textual transcript and timeline views. techniques from text editing, and it provides intelligent Selections and edits in any view are synchronized with all techniques to make selection and editing easier. other views. A variety of recognition algorithms are applied The Silver editor is designed to support all phases of the to the video and audio content and then are used to aid in video post-production process. The storyboard and script the editing tasks. The Informedia Digital Library supplies views support brainstorming and planning for the video. the recognition algorithms and metadata used to support The project, source, subject and outline views support the intelligent editing, and Informedia also provides search and collection and organization of the source material. The a repository. The metadata includes shot boundaries and a timeline and script views support the detailed editing, and time-synchronized transcript, which are used to support the preview view can be used to show the result. intelligent selection and intelligent cut/copy/paste. Silver is an acronym and stands for Simplifying Interactive Keywords Layout and Video Editing and Reuse. The key innovations Digital video editing, multimedia authoring, video library, in the Silver editor include: providing a transcript view for Silver, Informedia. the actual audio; multiple views with coordinated selections, including the ability to show when one view only 1. INTRODUCTION contains part of the selection; intelligent context-dependent Digital video is becoming increasingly ubiquitous. Most expansion of the selection for double-clicking; and intelli- camcorders today are digital, and computers are being gent cut/copy/paste across the video and audio. These are advertised based on their video editing capabilities. For discussed in this article. example, Apple claims that you can "turn your DV iMac into a personal movie studio" [1]. There is an increasing 2. STATE OF THE ART amount of video material available on the World-Wide- Most tools for editing video still resemble analog profes- Web and in digital libraries. Many exciting research pro- sional video editing consoles. Although they support the jects are investigating how to search, visualize, and summa- creation of high quality material, they are not easy for the rize digital video, but there is little work on new ways to casual user, especially when compared with applications support the use of the video beyond just playing it. In fact, such as text editors. Current video editing software only editing video is significantly harder than editing textual operates at a low syntactic level, manipulating video as a material. To construct a report or a new composition using sequence of frames and streams of uninterpreted audio. It video found in a digital library or using newly shot video does not take advantage of the content or structure of the requires considerably more effort and time than creating a video or audio to assist in the editing. Instead, users are similar report or composition using quoted or newly au- required to pinpoint specific frames, which may involve thored text. zooming and numerous repetitions of fast-forward and rewind operations. In text editing, the user can search and In the Silver project, we are working to address this prob- select the content by letters, words, lines, sentences, para- Permission to make digital or hard copies of all or part of this work for graphs, sections, etc. In today's video and audio editing, the personal or classroom use is granted without fee provided that copies only units are low-level frames or fractions of seconds. are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To As another example, three-point editing is one option in copy otherwise, or republish, to post on servers or to redistribute to professional editors such as Adobe Premiere. In this kind of lists, requires prior specific permission and/or a fee. JCDL'01, June 24-28, 2001, Roanoke, Virginia, USA. editing, the user can locate an "in point" and an "out point" Copyright 2001 ACM 1-58113-345-6/01/0006...55.00.

106 140 i=thai tom..Wens Qsts WodowList,14.191491 Search la ANY of, ZS Sane: ero1ect ar kosevo ',...t 11P Da et S. Cali 0,1 r., o .I

''. Gonad Oak is Faniro Kosovo received a mem Sipal to the Sorbs Menace a/ Serb looms Clack roue len* eery Pei CI SO MS 351 . ...,... ,...... ,. Cle.. r 1 cur 41t0 ISeed Ooto. 1 IReers 00. i II filtered results: any of 'war kne nye. ow143907281820no ow199937201029.roo owy199907291820n. cov19993 281020009 cw113E7327291029,09 ewv1 IX* on s amt. hip** 1 it,eamm 7.511671 .810.891 9E613.92119E0 933.446 .1121917 ! 7.0.22 11E103 7.1627.7.50071 01 ....e, r .4..."...i..0010.(.00,60,03:fM IckWirk.0,,,.',15.1AVAV tkOM.IY9.).W.MrTY4.,:ci 11."7-7ril Iii;;76,,TpilFilvw.,k.., I:V.ekroM .. IiF'6-.7.-1 1-L---- .-- LISilver P. Tr:1,2121'L ICOutline IlitIA - . FID-rdi Nolo -

, I

I 1 rl I 11 i aii IIII ..o. 04.6.1 Or. is item Signd10 Sala aurae al Sab lkilij I farms l Pentagon Officials itiSILYER: Sources Tar2g!n 1 .EZselection(26.807)- [2:28- 710121 I V: Thy Cts Insneing In...kcal Dort to Mk. Alfas l'i" .

.c.. 1 1-.. el Mlos of Ic ...:' 11.11.9;5..; . 1 i c,...,m728,829., 1...... ,009207291829., cw.199907281829 rro9 '1 I 92073 .1124 84 8659 .9 29 7:1628 06 59 Add Subisof flIenSco.,e1S0 Add Sae. W.:I7otn7-9-01/;71

8[7.onunpt 42141.00 IlneLlne F:44,7, ji , .,...... , .. . FT 911 a24. 11A 1 U MI ',CM iliL' I,' i twee Inet sone as:lance pa ewe. among tnr,s0e. to to IN ; A ir,...,...Serbt swear. ot Ina root of Euro. wont be chow Cte.ty S wet.. ' II, s:44.471 1TiZik 0b4 0 3377 mow bairns of ossrs. sinouP ...d. nen, INS N es.ky,, raIli .I.1.9 .... 0. P.P03 ofInc,eY .V..0. rsiSce. 0. .1% 0 0:10 04 OM 020 030 10 1:191,30 130132 130 20 10 u-4 30 2.40 230 30 310 3,20330 3 10 3, *No . try to do Combnote s new seem to promote i ewers In En] rode w.f. reg., .w macs/ rey ono nay . the tuNs ol Son. as wet Wcel Meer,' OW,. . mon Houto 11, e, : . 1,9, tam w,,i's 90sso.r. orse. flamets,I, 1 yr..., 00.. heavy NATO sec. sy, IN 11 vIcIn.s were Leulee.1 1'1' ' II/Nnesery ner . ft. wirte Mee were tenon sewn. Ine M1.1 ot it-- Ititi no Sett. rt.loe church unfurled t.t 'beeI ,1`0,1,1 e2.' /more 014. wlo .1. cum.., e, en, V.,,eotlerien".n.r.o.: :Lave.,commander, loss been told he Nur retro sorry. C.I I II V !?" ., Wens. Caroeuxneal re* WIN,. reports. Anere peters , 133 155 29 25 -10 230 13 235 240 245 Orb may Irby won 1. 'Kosovo veer, laelsoon. soy Pr I. I 2,13 333 L-...... ,...... --1 L ....-.1 I i I -1 ..-1 ....1 I en Hand bier lo Wend Ho leo reed, taper.... P.11cly, --be-et' Is ncl mob:NW, YAK thlo r juon rote. owl of tow . I mem, .-..------...... --,- saw OPearoalon Of theb ovenbirds ...on soden 1 1/4-----"I---"L-II-"------. lo over,Irs roes I eed Mose.. lbeargeo ontolar soy Cbt I - shOuld nal novo boon surprised.. woo WU IN Iwo would be LP .41 A 4 Juno a.. Oacrion to replace . WM lobe c.to I vice cnerrnan I in Falun, II . , WA:rage req., vttonI., ..i. ' reelred i le lys boss. I - --r , ...... v.....;,...... ,.... 44., NEU wry Hon on goners weary Orb on:1.00 to re s s 0.13u1 r. . . .

ve 1, n, lead Wedurdey vet1..h. had w)..... 1...... 4 do.m. 71,1..1 of)le Seebien artbadarclench ,IIII, 211,5 2122 suapeeted Fe often Loden. Ho II Oren, superiors by appotsblok 114 3:14.3 113 1153 116 116.5 117 1173 2:18 1183 119 2:19.; 1201 2203 221 1 I 1..... L...... L... mow to IN Aeltellosse. pea col. news no van ot.1 In a altos. )41., ....1. .L. A Figure 1. An overview of all of the Silver windows. in the video source and a third point in the target to perform The Zodiac system [5] employs a branching edit history to a copy operation. The fourth point for the edit is computed organize and navigate design alternatives. It also uses this based on the length of the in-out pair. This method can be abstraction to automatically detect shot and scene bounda- traced to the use of physical videotape where the length of ries and to support the annotation of moving objects in the the output and input segments must be the same. However, video. IMPACT [33] uses automatic cut detection and three-point editing is not required for digital video and is camera motion classification to create a high level descrip- very different from the conventional cut/copy/paste/delete tion of the structure of the video, and then visualizes and technique that is used in other editing tools on computers. edits the structure using timeline and tree structure views [32]. IMPACT also detects object boundaries and can 3.RELATED WORK recognize identical objects in different shots. There is a large body of work on the extraction and visualization of information from digital video (e.g., [29] [12]). The Hierarchical Video Magnifier [24] allows users to However, most of this work has focused on automatic work with a video source at fine levels of detail while content extraction and summarization during library crea- maintaining an awareness of the context.It provides a tion and searching, and on information presentation during timeline to represent the total duration of the video source, library exploration.In the Silver project, we focus on and supplies the user with a series of low-resolution frame authoring with the content once it is found. samples. There is also a tool that can be used to expand or reduce the effective temporal resolution of any portion of After examining the role of digital video in interactive the timelines.Successive applications of the temporal multimedia applications, Mackay and Davenport realized magnifier create an explicit spatial hierarchical structure of that video could be an information stream that can be the video source. The Swim Hierarchical Browser [35] tagged, edited, analyzed and annotated [19]. Davenport et. improves on this idea by using automatically detected shots al. proposed using metadata for home-movie editing assis- in the higher level layers. These tools were only used for tance [10]. However, they assumed this data would be top-down navigation, and not for editing. We use a similar obtained through manual logging or with a "data camera" approach in our Timeline view. during filming, unlike the automatic techniques used in Silver.

107 BEST COPY AVAILABLE

1'41 The Video Retrieval and Sequencing System [8] semiauto- speech recognition [7]. The transcript is time-aligned with matically detects and annotates shots for later retrieval. the video using CMU's Sphinx speech recognition system Then, a cinematic rule-based editing tool sequences the [27]. Informedia also performs image analysis to detect shot retrieved shots for presentation within a specified time boundaries,extracting representative thumbnail images constraint. For example, the parallel rule alternates two from each shot [6] and detects and identifies faces in the different sets of shots and the rhythm rule selects longer video frame. A video OCR system identifies and recognizes shots for a slow rhythm and shorter shots for a fast one. captions in the image [28]. Certain kinds of camera move- ments such as pans and fades can also be identified. All of Most video segmentation algorithms work bottom-up, from this metadata about the video is stored in a database. This the pixels in individual frames. Hampapur [15] proposes a metadata is used by Informedia to automatically create top-down approach, modeling video-editing techniques titles, representative frames and summaries for video clips, mathematically to detect cuts, fades and translation effects and to provide searching for query terms and visualization between shots. of the results. Silver takes advantage of the metadata in the Video Mosaic [20] is an augmented reality system that database to enhance its editing capabilities. allows video producers to use paper storyboards as a means 5. TYPES OF PRODUCTIONS of controlling and editing digital video. Silver's storyboards The Silver video editor is designed to support different are more powerful since they can also be used for interac- kinds of productions. Our primary goal is to make it easier tive videos. CVEPS [23] automatically extracts key visual for middle and high school children (ages 10-18) to create features and uses them for browsing, searching and editing. An important contribution of this system is that it works in multimedia reports on a particular topic. For example, a social-studies teacher might have students create a report on the compressed domain (MPEG), which has advantages in Kosovo. We want to make it as easy for students to create terms of storage, speed and noiseless editing. such a video report using material in the Informedia library, VideoScheme [22] is a direct manipulation video editing as it would be to use textual and static graphical material system that provides the user with programming capabili- from newspaper and magazine articles. ties. This increases the flexibility and expressiveness of the We also want Silver to support original compositions of system, for example supporting repetitive or conditional two types. First, people might just shoot some video with a operations. VideoMAP [31] indexes video through a variety camcorder, and then later want to edit it into a production. of image processing techniques, including histograms and In the second type, there might first be a script and even a "x-rays" (edge pixel counts). The resulting indices can be set of storyboards, and then video is shot to match the used to detect cuts and camera operations and to create script. In these two cases where new material is shot, we visualizations of thevideo. For example, VideoMAP anticipate that the material will be processed by Informedia renders the indices over time, and VideoSpaceIcon repre- to supply the metadata that Silver needs. (Unfortunately, sents the temporal and spatial characteristics of a shot as an Informedia does not yet give users the capability to process icon. their own video, but Silver is being designed so that we are The Hitchcock system [14] automatically determines the ready when it does.) "suitability" of the different segments in raw video, based Finally, in the future, we plan for Silver to support interac- on camera motion, brightness and duration.Similar clips tive video and other multi-media productions. With an are grouped into "piles." To create a custom video, the user interactive video, the user can determine what happens drags segments into a storyboard, specifies a total desired next, rather than just playing it from beginning to end. For duration and Hitchcock automatically selects the start and example, clicking on hot spots in "living books" type end points of each clip based on shot quality and total stories may choose which video is played next. Another duration. Clips in the storyboard are represented with type of production is exemplified by the DVD video "What frames that can be arranged in different layouts, such as a could you do?" [11]. Here, a video segment is played to set "comic book" style layout [2]. We plan to incorporate up a situation, then the user is asked a question and depend- similar techniques into Silver. ing on the user's answer, a different video piece is selected 4. INFORMEDIA to be played next. We obtain our source video and metadata through CMU's Informedia Digital Video Library [34]. The Informedia 6. MULTIPLE VIEWS In order to support many different kinds and styles of project is building a searchable multimedia library that currently has over 2,000 hours of material, including editing, it is useful to have different views of the material. For example, the Microsoft Word text editor supplies an documentaries and news broadcasts. Informedia adds about outline view, a "normal" view that is good for editing, a two hours of additional news material every day. For all of "print layout" view that more closely shows what the its video content, Informedia creates a textual transcript of composition will look like, and a "print preview" view that the audio track using closed-captioning information and

108

'142 tries to be exactly like the printed output. Similarly, Power- 6.2 Source and Project Views Point provides outline, normal, notes, slide sorter, and slide When the user finds appropriate video by searching in show views. However, video editors have many fewer Informedia, the clips can be dragged and dropped into options. Adobe Premiere's principal view is a frame repre- Silver's Project View (actually, a clip can be dragged sentation, with a thumbnail image representing one or more directly to any other view, and it will be automatically video frames. Premiere also provides a Project window, a added to the project view). The Project View (see Figure 3) Timeline window, and a Monitor window (to playback will also allow other video, audio and still pictures from the video). MGI's VideoWave III editor has a Library window disk or the World-Wide Web to be loaded and made easily (similar to the project view), a StoryLine window (a simpli- available for use in the production. As in other video fied timeline), and a View Screen window for playback. editors such as Premiere, the project view is a "staging Apple's Final Cut Pro provides a Browser window (like a area" where the video to be used in a production can be project view that allows clip organization), Timeline view, kept. However, in the Silver project view, video clips that Canvas (editing palette), and Viewer window for playback. are in the composition are separated by a line from clips that are not currently in use. Dragging a clip across this line Silver currently provides nine different views: the Informe- adds or removes it from the composition. dia search and search results, project, source, subject, outline, storyboard, transcript, timeline and preview views. The source view is like a simplified project view, and These are described below. Each appears in its own win- allows easy access to the original sources of video. Clips in dow, which the user or system can arrange or hide if not this view represent the video as brought in from Informedia, needed (Figure 1 shows an overview of the full screen). from a file, or from the web. They cannot be edited, but they may be dragged to other windows to add a copy of the Soattth Ito ANY al: source to the project. ar Kosovo

A U;' 6411.0 -...,... -_.,..., Ram 5Mes Dam QtPt She Iris3 ''.6_---e87i1I -HistayI no,--7.----0 CM 61 14 /5 II 0 sausa 0s 0 irResuks Optk. 6 of 1000 results: any of 'war kosovo.' .1315083162119,,,q .11690727659 mpg .13590672155 npg oa11116762615561 571, 65 - 57 4197 &16174 .5653 551511.711.225 / 211646M 1 Clint ontweed to /*Mg& /0 10 ficiloat -MX--",^Feir,797,77- 76.1qT-7,73.77 Ir-Pip;.. Page II:ler PageI Go to Page... 1rVnuabe Figure 3. S ver's project view collects the source material to be e3.1Search Results (Page 1 of 167) 1.21.1 used in the production. The first clip is shown on the screen outlined in light yellow to indicate it is partially selected. A..tilli . . - ce . 6.3 Transcript View &kW 1.31117*_. An important innovation in the Silver video editor, enabled by Informedia, is the provision of a textual transcript of the video. This is displayed in a conventional text-editor-like rad .I.: window (see Figure 4). Informedia generates the transcript . - from various sources [16]. If the video provides closed captioning, then this is used. Speech recognition is used to Figure 2. Search and search results views from Informedia. recognize other speech, and also to align the transcript with the place in the audio track where each word appears [7]. 6.1Search Results View Because the recognition can contain mistakes, Silver inserts When starting from a search using Informedia, the searchgreen "*"s where there appears to be gaps, misalignments, results will appear in an lnformedia search results window or silence. In the future, we plan to allow the user to correct (see Figure 2). Informedia identifies clips of video that are the transcript by typing the correct words, and then use the relevant to the search term, and shows each clip with a speech recognizer to match the timing of the words to the representative frame. Each clip will typically contain many audio track. different scenes, and the length of a clip varies from around 40 seconds to four minutes. The user can get longer seg- The transcript view and the timeline view (section 6.4) are ments of video around a clip by moving up a level to the the main ways to specify the actual video segments that go full video. If the user makes a new query, then the search into the composition. In the transcript view, the boundary results window will be erased and the new results will between segments is shown as a blue double bar The appear instead. Clicking on the thumbnail image will show transcript and timeline views are used to find the desired the clip using the Windows Media Player window. Clicking portion of each clip. Transcripts will also be useful in on the filmstrip icon displays thumbnail images from each supporting an easy way to search the video for specific shot in the scene, giving a static, but holistic view of the content words. entire clip.

109 143 We also plan to use the transcript view to support the la Trantcnpt authoring of new productions from scripts. The user could type or import a new script, and then later the system would [42 aS X. automatically match the script to the audio as it is shot. tSo, for " ' earthbound Americans, objective number Oh&he past few days: Keep cool. VVater is the salvation 6.4 Timeline View or " Atlanta pachyderms. [water spfeatiiriglArkansas In Silver, like many other video editors such as Premiere, pouttry. And Georgia school kids, who found refuge in a the Timeline view is the main view used for detailed edit- water perk.lAten it's hot , there's no question there's an ing. In order to make the Timeline view more useful and added appeal to get in that water, sink in, and cool off. II oft. In lieu of water, ' air -- preferably moving. And easier to use, we are investigating some novel formats. As preferably chilled. Air-condttioning ' and a deck of cards shown in Figure 5, we are currently providing a three-level help another hot day pass in St. Louis. But if you live in view. Atlanta and don't yet have * an air conditioner, good luck trying to iind one. They've been flying off the shelves. This allows the user to work at a high level of detail without Some midwestern cities have established coofing losing the context within the composition. The top level stations -- temporary refuge from the " outside's aspheit ' always represents the entire video. The topmost row dis- oven. Weather II Weather forecasters considered plays the clips in the composition and their boundaries. For Tuesday a relative reprieve from the heat wave. each clip, it shows a representative frame, and if there is Figure 4. The Transcript view shows the text of the audio. The enough space, the end frame, title and duration. Below the "*"s represent unrecognized portions of the audio, and the "II" top row are the time codes. At the bottom of the top level is represent breaks. The reverse video part at the top is selected. an indicator showing what portions are being viewed in the Words in italics are partially selected (here "splashing") and bottom two levels. The purple portion is visible in the words in gray are partially cut out in the current production. middle level, and the cyan portion is visible in the bottom level. --ETin,di4 9:1-etilai

The middle level displays the individual shots, as detected xlitial-?-,-(-"71,1> _ Signal to the Sea by Informedia. Shot boundaries are detected by a change in al:44.471 0.i the video [17]. Each shot is visualized using the representa- 0 0:33 0:40 1:0 1:20 1:40 2 . :2 2:40 :0 3:20 3:40 tive frame for the shot as chosen by Informedia. The size of the frame is proportional to the duration of the shot.

The bottom level can display the individual frames of the ir'7, i .. a video, so the user can quickly get to particular cut points. .1 V 11 ..'gli 60501 The middle row of the bottom level represents the tran- to I

script. The bottom level also provides the ability to add 1 annotations or comments to the video (see Figure 6). 1:40 1:45 1:50 1:55 20 2:5 2:10 :15 2:202:252:302:35 1 . 1 . 1 .1 , 1 , 1 1 A key feature of the Silver timeline is that it allows different representations to be shown together, allowing the user to -r---- ! 'a see the clip boundaries, the time, samples of frames, the 1 . 4.21, 4 ,444 444 cuts, the transcript, annotations, etc. Later, using facilities '41 ; already provided by Informedia, we can add labels for ; the . . week's or fanners, recognized faces and the waveform for the audio to the 1 olf NN, ,',1est,tuassacre, 2:5 2 6 2:7 2:8 2:9 2:10 2:11 I timeline. Snapping and double-click selection will continue I 1 . 1 . 1 to be specific to each type of content. Figure 5. The Timeline view showing the three levels. The The user can pick what portion is shown in the middle level portion from about 2:2 to about 2:13 is selected, and appears on the screen in yellow in all three levels (and appears as light gray by dragging the indicator at the bottom of the top level. when printed in black-and-white). Alternatively, the scroll buttons at the edges of the middle level cause the viewed section to shift left and right. The scale of the video that is shown in the middle level can be adjusted by changing the size of the indicator at the bottom of the top level, by dragging on the edge of the indicator. This will zoom the middle level in and out. Similarly, the bottom level can be scrolled and zoomed using the indicator at the bottom of the middle level or the bottom's scroll Figure 6. The bottom row of the Timeline view can show the arrows. user's annotations (shown in red).

110 BEST COPY AVAILABLE 144 The toolbar buttons at the top of the Timeline view window perform editing and playback operations (from left to right in Figure 5): cut, copy, paste, delete, crop (deletes everything but the selection), split (splices the clip at the selection edges), add annotation, play selection, and play the entire video. Lee, Smeaton, et.al. [18] propose a taxonomy of video browsers based on three dimensions: the number of "layers" of abstraction and how they are related, the provision or omission of temporal information (varying from full timestamp information to nothing at all), and the visualization of Figure 7. The Preview view shows the frame at the cursor (the spatial versus temporal aspects of the video (a slideshow is dotted lines in Figure 5), and is where the video is played. highly temporal, a timeline highly spatial). They recommend using many linked layers, providing temporal and T ab'Subjed4yiesv P;67-Crii,;esil International issues I absolute temporal information, and a spatial visualization. _ Our timeline follows their recommendations.

The Hierarchical Video Magnifier [24] and Swim [35] also For chiel Momcilo Experts born FBI provide multi-level views. These systems are designed to crimes pros... Irojk.avic a... have recover... browse video, and navigation is achieved by drilling to higher levels of detail. The goal in Silver is to edit video and the basic interaction for navigating in the timeline is scrolling. Also, Silver is different in using multiple representations of the video within each level. Figure 8. Silver's Subject Views allows users to organize their 6.5 Preview View material by topic, type, date, etc. As the user moves the cursor through the timeline, Silver displays a dotted line (visible to the left of 2.5 in all three Zit Outhn ClEli levels of Figure 5). The frame at this point is shown in the 12.1 International issues preview view (Figure 7). If the user moves the cursor too fast, the preview view will catch up when the user stops or 7=7:4 slows down sufficiently. The play arrows at the top of the caw, timeline view cause the current video to be played in the preview view. If the black arrow is selected, the video is Nic Robertson on Kosovo played in its entirety (but the user can always stop the playback). If the yellow play button is picked, only the Albanian crime syndicates selected portion of the video is played. The preview win- he ...A dow is implemented using the Windows Media Player Fear such Kosovo inevitably push control. Add Subiect I 61..eFLP,Vg,Mtq I Figure 9. Silver's Outline View organizes the material using a .6.6 Subject View Window's Tree control. When creating a composition, different people have different ways of organizing their material. Some might like to 6.7 Outline View group the material by topic. Silver's Subject View facili- When creating a composition, one good way to organize the tates this type of organization. It provides a tabbed dialog material is in an outline. Whereas textual editing programs, box into which the material can be dragged-and-dropped such as Microsoft Word, have had outlining capabilities for from the project view. The user is free to label the tabs in years, none of the video editors have an outline view. any way that is useful, for example by the content of clip, Silver's outline view (shown in Figure 9) uses a conven- the type of shot, the date, etc. The subject view (Figure 8) tional Windows tree control, which allows the hierarchy to will allow the same clip to be entered multiple times, which be easily edited using familiar interaction techniques such will help users to more easily find material, since it might as drag-and-drop. Note that for the subject view and the be classified in multiple ways. outline view, the subjects (or folders) can be added before there are any clips to put in them, to help guide the process and serve as reminders of what is left to do.

111 3 StoiSeud,t,i4,,,-,- -177--::_,t---m _ 6.10 Selection Across Multiple Views When the user selects a portion of the video in one view in ...,5, ow. 1....------11"7., Silver, the equivalent portion is highlighted in all other views. This brings up a number of interesting user interface - design challenges.

*19990811195am, The first problem is what is the "equivalent" portion? The nhil 111 rn c19122rnI different views show different levels of granularity, so it I may not be possible to represent the selection accurately in Figure 10. The Storyboard view has segments placed in 2-D. some views. For example, if a few frames are selected in the timeline view what should be shown in the project view 6.8 Storyboard View since it only shows complete clips? Silver's design is to use Many video and cinema projects start with a storyboard dark yellow to highlight the selection, but to use light drawing of the composition, often drawn on paper. Typi- yellow to highlight an item that is only partially selected. In cally, a picture in the storyboard represents each of the Figure 3, the first clip has only part of its contents selected, major scenes or sequences of the composition. Some video so it is shown in light yellow. If the user selects a clip in the editors, notably MGI's VideoWave III, use a storyboard- project view, then all video that is derived from that clip is like view as the main representation of the composition. selected in all other views (which may result in discontinu- Silver's storyboard view (see Figure10)differs from ous selections). VideoWave in that it can be used before the clips are found, as a representation of the desired video. Stills or even hand- A similar problem arises between the timeline and tran- drawn pictures can be used as placeholders in the story- script views. A particular word in the audio may span board for video to be shot or found later. The frames in the multiple frames. So selecting a word in the transcript will storyboard can be hand-placed in two dimensions by the select all the corresponding frames. But selecting only one user (and commands will help to visually organize them), of those frames in the video may correspond to only part of which supports organizations that are meaningful to the a word, so the highlight in the transcript shows this by user. For example, some productions are told using "paral- making that word italic. This is the case of the words "one" lel editing" [3] by cutting between two different stories and "splashing" shown in the edges of the selected text in occurring at the same time (for example, most Star Wars Figure 4. (We would prefer the selection to be yellow and movies cut repeatedly between the story on a planet and the the partially selected words in light yellow to be consistent story in space). These might be represented in the story- with other views, but the Visual Basic text component does board by two parallel tracks. not support this.) If the selected video is moved, this will cut the word in two pieces. Silver represents this by repeat- Another important use for storyboards will be interactive ing the word in both places, but showing it in a different video compositions (which Silver is planned to support in font color. The video in Figure 4 was split somewhere the future). Some multimedia productions allow the user to during the word "Weather". Thus, this word is shown twice, interact with the story using various methods to pick which separated by the clip boundary. video segment comes next. For example, a question might be asked or the user might click on various hot spots. Our 7. INTELLIGENT EDITING storyboard view allows multiple arrows out of a clip, and One reason that video editing is so much more tedious than we plan to support a "natural" scripting language [26] and text editing is that in video, the user must select and operate demonstrational techniques [25] that will make it easy to on a frame-by-frame basis. Simply moving a section of specify how to choose which segment to play next based on video may require many minutes while the beginning and the end user's input. end points are laboriously located. Often a segment in the video does not exactly match with the corresponding audio. 6.9 Other Views For example, the voice can start before the talking head is In the future, we plan to add support for many other views, shown. This gives the video a continuous, seamless feel, but all inter-linked. For example, if the transcript window is makes extracting pieces much harder because the video and used to hold an authored script, then it will be important to audio portions must often be separately adjusted, with much include "director's notes" and other annotations. These fiddling to remove extraneous video or audio portions. might be linked to views that help manage lists of locations, people, scenery, and to-do items. The ability to add notes, We did an informal study to check on the prevalence of comments, annotations, and WWW links in all other views these issues, and found it to indeed be significant, at least in might also be useful. Other facilities from text documents recorded news. In looking at 238 transitions in 72 minutes might also be brought into the Silver editor, such as the of video clips recorded from CNN by Informedia, 61 (26%) ability to compare versions and keep track of changes (as a were "L-cuts," where the audio and video come in or stop at revision history). different times. Of the 61 L-cuts, 44 (72%) were cases

112

1413 where the audio of the speaker came in before the video but ended at the same time (see Figure 11-Shot A). Most of Video q#4.0440%"40+ SIN" 4,1%0WINA.04",4 these were interviews where the voice of the person being Audio interviewed would come in first and then the video of the Shot A Shot B Shot C Shot D Shot E person after a few seconds. Most of the other ways to Figure 11. It is very common for the video and audio of a sho overlap the video and audio were also represented. Some- not to start and/or end at the same time. For example, Shot B times, the audio and the video of the speaker came in at the represents a situation where the audio for the shot continues for a same time, but the audio continued past the video (Figure little while past when the video has already switched to the next 11-Shot B) or the video ended after the audio (Shot D). scene. Similarly, for Shot D, the video continues a bit past when When an interviewee was being introduced while he ap- the audio has already switched to the next scene. peared on screen, the video might come in before the audio, but both end at the same time (Shot C). To copy or delete ShotA any of these would require many steps in other editors. Video ( ) The Silver editor aims to remove much of the tedium Audio associated with editing such video by automatically adjust- Video : ing the portions of the video and audio used for selection, Fill-in area Audio cut, copy and paste, in the same way that text editors such as Microsoft Word adjust whether the spaces before and Overlapiire;-1--+-4.risertpoint (c) after words are selected. These capabilities are discussed in the following sections. (a)

7.1Intelligent Selection Figure 12. When an L-shot such as Shot A is inserted at a point in When the user double-clicks in Microsoft Word and other the video (a), Silver will check the area that might be overlapped. text editors, the entire word is selected. Triple clicking will If the audio is silent in that area, Silver will overlap the audio In some cases (c), the user will have the option get the entire paragraph, sentence or line (depending on the automatically(b). to overlay a separate piece of video if the audio cannot be over- editor). Silver provides a similar feature for video, and the lapped. unit selected on multiple clicks depends on which view is active. If the user double-clicks in the text view, the sur- 7.3 Intelligent Paste and Reattach rounding word or phrase will be selected that Informedia When a segment with an L-cut is deleted or pasted in a new recognized (the minimal unit that can be matched with the place, then Silver will need to determine how to deal with audio). In the time-line view, however, the effect of double the uneven end(s). If the audio is shorter than the video clicking depends on the specific timeline within the hierar- (e.g., if Shots C or D from Figure 11 are pasted), then chy. It can mean, for example, a shot (where shot bounda- Silver can fill in with silence since this is not generally ries are located automatically by Informedia) or an entire disruptive. However, it is not acceptable to fill in with black clip. or blank video. For example, if Shot A is pasted, Silver will 7.2Intelligent Cut, Delete and Copy look at the overlapped area of the audio to see if it is silent When an operation is performed on the selected portion of or preceeded by an L-cut (Figure 12-a). If so, then the video the video, Silver uses heuristics to try to adjust the start and can abut and the audio can be overlapped automatically end points so they are appropriate for both the video and (shown in Figure 12-b). If the audio in the overlap area is audio. not silent, however, then Silver will suggest options to the user and ask the user which option is preferred. The choices Using the information provided by Informedia, Silver will include that the audio in the destination should be replaced, try to detect the L-cut situations, as shown in Figure 11. the audio should be mixed (which may be reasonable when Silver will then try to adjust the selection accordingly. For one audio track is music or other background sounds), or example, when the user selects a sentence in the transcript else some video should be used to fill in the overlap area and performs a copy operation, Silver will look at the (as in Figure 12-c). The video to be filled in may come corresponding video. If the shot boundary is not aligned from the source of the copy (e.g., by expanding the video of with the selection from the audio, then Silver will try to Shot A) or else may be some other material or a "special determine if there is an L-cut as in Figure 11. If so, Silver effect" like a dissolve. will try to adjust the selected portion appropriately. How- ever, editing operations using this selection will require Although there are clearly some cases where the user will special considerations, as discussed next. need to get involved in tweaking the edits, we feel that in the majority of cases, the system will be able to handle the edits automatically. It will await field trials, however, to measure how successful our heuristics will be.

113 147 8. INTELLIGENT CRITICSFUTURE WORK It is also an important part of our plans to do informal and In school, children spend enormous amounts of time learn- formal user tests, to evaluate and improve our designs. ing and practicing how to write. This includes learning the As more and more video and film work is performed rules for organizing material, constructing sentences and digitally, and as more and more homes and classrooms use paragraphs, and generally making a logical and understand- computers, there will clearly be an increased demand for able composition. However, few people willlearn the digital video editing. Digital libraries contain increasing corresponding rules for creating high-quality video and amounts of multi-media material including video. Further- multimedia compositions. These are generally only taught more, people will increasingly want easy ways to put their in specialized, elective courses on film or video production. own edited video on their personal web pages so it can be Therefore, in order to help people create higher-quality shared. Unfortunately, today's video editing tools make the productions, we plan to provide automatic critics that help editing of video significantly harder than the editing of evaluate and guide the quality of the production. Some of textual material or still images. The Silver project is inves- the techniques discussed in section 7 above will actually tigating some exciting ways to address this problem, and help improve the quality. We intend to go beyond this to hopefully will point the way so the next generation of video provide many other heuristics that will watch the user's editors will be significantly easier to use. production and provide pop-up suggestions such as "avoid shaky footage" (as in [14]) and "avoid cutting in the middle ACKNOWLEDGEMENTS The Silver Project is funded in part by the National Science of a camera pan." Foundation under Grant No. IIS-9817527, as part of the Digital Video editing is a highly subjective part of filmmaking, Library Initiative-2. Any opinions, findings and conclusions or which can greatly affect the look of the finished product. recommendations expressed in this material are those of the authors and do not necessarily reflect those of the National Therefore, though some of the intelligent editing can be Science Foundation. Thanks to Bernita Myers for help with this automated to prevent the user from making obvious errors, paper. in some cases it is best to simply inform the user of the rule rather than make the artistic decision for them. For this REFERENCES reason, providing help as an Intelligent Critic is likely to be [1] Apple Computer, iMac. 2000. http://www.apple.com/imac/. appropriate in this application. [2] Boreczky, J., et al. "An Interactive Comic Book Presentation A century of filmmaking has generated well-grounded for Exploring Video," in CHI 2000 Conference Proceedings. theories and rules for film production and editing which can 2000. ACM Press. pp. 185-192. be used by our critic (e.g., [9] [21] [3]). For example, the [3] Cantine, J., Howard, S., and Lewis, B. Shot by Shot: A effect of camera angle on comprehension and emotional Practical Guide to Filmmaking, Second Edition. 1995, Pitts- impact [4], the effect of shot length and ordering on learn- burgh Filmmakers. ing [13], and the effect of lighting on subjects' understand- [4] Carroll, J.M. and Bever, T.G. "Segmentation in Cinema ing of scenes [30], are just a small sample of film-making Perception." Science, 1976. 191: pp. 1053-1055. heuristics. As automatically generated metadata improves, it [5] Chiueh, T., et al. "Zodiac: A History-Based Interactive Video will be possible for Silver to give users more sophisticated Authoring System," in Proceedings of ACM Multimedia '98. assistance. For example, when Informedia vision systems 1998. Bristol, England: are able to recognize similar scenes through an understand- [6] Christel, M., et al. "Techniques for the Creation and Explora- ing of their semantic content, a future version of Silver tion of Digital Video Libraries," Multimedia Tools and Ap- could suggest that the first use of the scene be presented for plications, B. Furht, Editor 1996, Kluwer Academic a longer period than subsequent presentations. This is Publishers. Boston, MA. desirable to keep users' interest, keeping the user from [7] Christel, M., Winkler, D., and Taylor, C. "Multimedia Ab- becoming bored with the same visual material. Such capa- stractions for a Digital Video Library," in Proceedings of the bilities in Silver will still not make Hollywood directors and 2nd ACM International Conference on Digital Libraries. editors of school children. However, it will provide a level 1997. Philadelphia, PA: pp. 21-29. of assistance that should enable naïve users to create much [8] Chua, T. and Ruan, L., "A video retrieval and sequencing more pleasing productions from video archives. system." ACM Transactions on Information Systems, 1995. 13(4): pp. 373-407. 9. CONCLUSIONS We are implementing the Silver editor in Visual Basic, and [9] Dancyger, K. The Technique of Film and Video Editing: although most functions are implemented, there is much Theory and Practice. Second ed. 1997, Boston, MA: Focal Press. more to be done. One important goal of the Silver project is to distribute our video editor so people can use it. Unfortu- [10] Davenport, G., Smith, T.A., and Pincever, N. "Cinematic nately, it is not yet robust enough to be put in front of users. Primitives for Multimedia." IEEE Computer Graphics & Ap- plications, 1991. 11(4): pp. 67-74.

114 148 [11] Fischhoff, B., et al. What Could You Do? Carnegie Mellon [23] Meng, J. and Chang, S. "CVEPS: A Compressed Video University, 1998. Editing and Parsing System," in Proceedings of the fourth http://www.cmu.edultelab/telfair_programtFischhoff.html. ACM international conference on multimedia. 1996. Boston, Interactive Video DVD. MA: pp. 43-53. [12] Gauch, S., Li, W., and Gauch, J. "The VISION Digital Video [24] Mills, M., Cohen, J., and Wong, Y. "A Magnifier Tool for Library." Information Processing & Management, 1997. Video Data," in SIGCHI '92 Conference Proceedings of 33(4): pp. 413-426. Human Factors in Computing Systems. 1992. Monterey, CA: ACM. pp. 93-98. [13] Gavriel, S. "Internalization of Filmic Schematic Operations in Interaction with Learners Aptitudes." Journal of Educa- [25] Myers, B.A. "Demonstrational Interfaces: A Step Beyond tional Psychology, 1974. 66(4): pp. 499-511. Direct Manipulation." IEEE Computer, 1992. 25(8): pp. 61- 73. [14] Girgensohn, A., et al. "A semi-automatic approach to home video editing," in Proceedings of UIST'2000: The 13th an- [26] Myers, B.A. Natural Programming: Project Overview and nual ACM symposium on User interface software and tech- Proposal. Technical Report, Carnegie Mellon University nology. 2000. San Diego, CA: ACM. pp. 81-89. School of Computer Science, CMU-CS-98-101 and CMU- HCII-98-100, January, 1998. Pittsburgh. [15] Hampapur, A., Jain, R., and Weymouth, T. "Digital Video Segmentation," in Proceedings of the Second ACM Interna- [27] Placeway, P., et al. "The 1996 Hub-4 Sphinx-3 System," in tional Conference on Multimedia. 1994. San Francisco: pp. DARPA Spoken Systems Technology Workshop. 1997. 357-364. [28] Sato, T. et al. "Video OCR: Indexing Digital News Libraries [16] Hauptmann, A. and Smith, M. "Text, Speech, and Vision for by Recognition of Superimposed Caption." ACM Multime- Video Segmentation: The Informedia Project," in AAAI dia Systems: Special Issue on Video Libraries, 1999. 7(5): Symposium on Computational Models for Integrating Lan- pp. 385-395. guage and Vision. 1995. [29] Stevens, S.M., Christel, M.G., and Wactlar, H.D. "Informe- [17] Hauptmann, A.G. and Smith, M. "Video Segmentation in the dia: Improving Access to Digital Video." interactions: New Informedia Project," in IJCAI-95: Workshop on Intelligent Visions of Human-Computer Interaction, 1994. 1(4): pp. 67- Multimedia Information Retrieval. Montreal, 1995. Mont- 71. real, Quebec, Canada. [30] Tannenbaum, P.H. and Fosdick, J.A. "The Effect of Lighting [18] Lee, H., et al. "Implementation and Analysis of Several Angle on the Judgment of Photographed Subjects." AV Keyframe-Based Browsing Interfaces to Digital Video," in Communication Review, 1960. pp. 253-262. Proceedings of the 4th European Conference on Research [31] Tonomura, Y., et al. "VideoMAP and VideoSpacelcon: and Advanced Technology for Digital Libraries (ECDL Tools for Anatomizing Video Content," in Proceedings 2000). 2000. Lisbon, Portugal: INTERCHI'93: Human Factors in Computing Systems. 1993. http://lorca.compapp.dcu.ie/Video/Papers/ECDL2000.pdf. ACM. pp. 131-136. [19] Mackay, W.E. and Davenport, G. "Virtual Video Editing in [32] Ueda, H. and Miyatake, T. "Automatic Scene Separation and Interactive Multimedia Applications." CACM, 1989. 32(7): Tree Structure GUI for Video Editing," in Proceedings of pp. 832-843. ACM Multimedia '96. 1996. Boston: [20] Mackay, W.E. and Pagani, D. "Video Mosaic: Laying out [33] Ueda, H., Tmiyatake, T., and Yoshizawa, S. "Impact: An time in a physical space," in Proceedings of the second Interactive Nautral-Motion-Picture Dedicated Multimedia ACM international conference on multimedia. 1994. San Authoring System," in Proceedings of INTERCH1'93: Con- Francisco, CA: ACM. pp. 165-172. ference on human factors in computing systems. 1993. Am- [21] Mascelli, J. The Five C's of Cinematography: Motion Picture sterdam: ACM. pp. 137-141. Filming Techiques. 1965, Los Angeles, CA: Silman-James [34] Wactlar, H.D., et al. "Lessons learned from building a Press. terabyte digital video library." IEEE Computer, 1999. 32(2): [22] Matthews, J., Gloor, P., and Makedon, F. "VideoScheme: A pp. 66 -73. Programmable Video Editing System for Automation and [35] Zhang, H., et al. "Video Parsing, Retrieval and Browsing: An Media Recognition," in ACM Multimedia'93 Proceedings. Integrated and Content-Based Solution," in ACM Multime- 1993. pp. 419-426. dia 95: Proceedings of the third ACM international conference on Multimedia. 1995. San Francisco, CA: pp. 15-24.

115 149 Video Graph: A New Tool for Video Mining and Classification

Jia-Yu Pan, Christos Faloutsos Computer Science Department Carnegie Mellon University Pittsburgh, USA {jypan, christos}acs.cmu.edu

ABSTRACT This paper introduces Video Graph, a new tool for video mining and visualizing the structure of the plot of a video sequence. The main idea is to "stitch" together similar scenes which are apart in time. We give a fast algorithm to do stitching and we show case studies, where our approach (a) gives good features for classification (91% accuracy), and (b) results in Video Graphs which reveal the logical structure of the plot of the video clips.

I.INTRODUCTION In this paper, we focus on automatically determining and visualizing the 'story-plot' of a video clip.This helps us distinguish between different video types, e.g., news stories versus commercials. The problem is:given a video clip, determine the evolution of its story-plot. For example, do Figure 1: The VideoGraph of a news story.An similar scenes alternate, as is the case of dialogue? alternating starlike structure is shown, which corresponds to an interview in the latter part of the 2.PROPOSED METHOD story. The main idea behind our approach is to spot shots that tend to occur repeatedly. For example, in an interview, shots A shot-group is called persistent if it contains multiple, con- of the interviewer and the interviewee will occur often, and intermixed. An important concept we introduce is the shot- secutive shots; while a shot-group that contains multiple, non-consecutive shots, is called recurrent. A shot-group that group.Recall that a shot is a set of consecutive similar frames. is both persistent and recurrent is called a basic shot-group; it is exactly the type of shot-group that we use to construct Definition I. A shot-group is a set of shots, that are sim- a VideoGraph. Intuitively, basic shot-groups are the ones ilar according to our similarity function. that contain shots that occur often, and therefore should be helpful in revealing the plot evolution. This is achieved Notice that shots of a shot-group may or may not be consec- through the VideoGraph: utive in time. Our experience showed that we should mainly consider shot-groups that consist of many shots.Specifi- Definition 2. The VideoGraph of a video clip is a directed cally, we make the following distinctions: graph, where every node corresponds to a shot-group, and where edges indicate temporal succession.1 *This material is based on work supported by the National Science Foundation under Cooperative Agreement No. IRI- Figure 1 gives a news clip and its VideoGraph. Notice 9817496. the pronounced star-like pattern, which, it turns out, corresponds to an interview. The graph contains only the basic shot-groups, namely Gl, G7, G9, G11, G23, G18, G20, G21. Notice the heavy traffic between shot-groups G23, G18, G20 Permission to make digital or hard copies of all or part of this work for and G21, with G23 being the center of the 'star'-shape. In personal or classroom use is granted without fee provided that copies are the same Figure 1 we also show the key frames from each not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to shot-group. The key frame of a shot-group was randomly republish, to post on servers or to redistribute to lists, requires prior specific selected among the frames of the participating shots. Notice permission and/or a fee. JCDL'01, June 24-28, 2001, Roanoke, Virginia, USA. 1Dashed edges indicate that some non-basic shot-groups Copyright 2001 ACM 1-58113-345-6/01/0006 ... $5.00. have been deleted along the way. .4

again that the 'center' shot-group, G23, indeed is the recurring shot of the interview, which contains shots when the reporter proposed questions to the invited speakers. The shot-groups which interact with G23 heavily, i.e. G18, G20, and G21, contains exactly the shots of the invited speakers replying to the reporter's questions. Next we describe the algorithm to generate the Video- Graph for a video clip. First, use any off-the-shelf shot detection technique to break the video into shots. Next, use our 'stitching' algorithm (described next) to group similar shots into shot-groups. Each shot-group is assigned a label. Then, collect statistics (i.e. count of persistent shot-groups, count of recurrent ones, e.t.c). Finally, keep only the shot- groups that axe both persistent and recurrent; consider the edges among them, indicating temporal succession; and use an off-the-shelf graph-layout tool (e.g., dog) to draw the resulting graph. This is exactly the Video Graph for this video clip. There are two issues left to discuss:(a) which statistics to keep and (b) how the 'stitching' algorithm works.For cr) the first question, we propose the following statistics: the number of I-frames, the number of shots, and the number of shot-groups in the video clip; the number of persistent, Figure 2: Video Graphs of news stories (upper part) recurrent and basic shot-groups; the count of shots in each and commercials (lower part). Note the more com- of the three classes of shot-groups; the average number of plicated structure of news stories. shots per shot-group; and the percentage of shots that are within persistent/recurrent shot-groups. The second issue to discuss is our video stitching algorithm minus 0.16 percentage points, for the 95% confidence inter- ('VS9, and specifically, how it decides whether two shots val). We used an off-the-shelf package, MEC++ [3], for the should be grouped together or not. Due to lack of space, we classification. just give the main idea: For each I-frame, we used the DCT- coefficients of the macroblocks (MB), and, specifically, the 4.CONCLUSIONS first m=20 attributes, after using a dimensionality reduction We presented the Video G raph, a new tool for video mining method (Fast Map [1]).Thus, each shot is a 'cloud' of m- and story plot visualization. The heart of our approach is dimensional points. We group together two such 'clouds', to group similar shots together, even if they are not consec- if the density of the union is comparable to the original, utive. We give algorithms that "stitch" similar shots into individual densities. shot-groups, and automatically derive VideoGraphs. We also show how to derive features (e.g.number of recur- 3.EXPERIMENTS - RESULTS rent shot-groups, etc.) for video mining and classification. In a case study with news and commercials, our proposed We used MPEG-1 video clips segmented from CNN news features achieved 91% classification accuracy, without even reports. Each clip is either a news story or a continuous using the audio information. VideoGraphs can also be used series of commercials. We chose the technique from [2] to as video representatives for efficient browsing in digital video do scene shot detection. libraries such as Informedia [4], and are useful on keyframe Story-plot visualizationFigure 2 shows the Video- selection. Graphs for several news stories (upper part)and commercials (lower part) .It is clear that most of the commercial clips have 5. fewer persistent and recurrent shot-groups and much fewer REFERENCES basics, than news stories do. Consequently, Video Graphs of [1] C. Faloutsos and K.-I. Lin. Fastmap: A fast algorithm commercial clips are more likely to be an empty graph or for indexing, data-mining and visualization of a graph of one or two nodes (Figure 2). This is due to the traditional and multimedia datasets. Proceedings of the fact that commercials rarely have a 'plot' being, instead, a ACM SIGMOD Conference, pages 163-174, 1995. collection of shots are not revisited. [2] V. Kobla, D. Doermann, and C. Faloutsos. Videotrails: ClassificationHere we illustrate that the features we Representing and visualizing structure in video extracted capture the logical structure of a video clip quite sequences. Proceeding of the Fifth ACM International well and therefore are promising for video classification. We Multimedia Conference, pages 335-346, November 1997. conducted a classification experiment based on 68 news sto- [3]R. Kohavi, D. Sommerfield, and J. Dougherty. Data ries and 33 commercials, whose total size is about 2 giga- mining using MLC++: A machine learning library in bytes.Each clip is represented by 20 features (count of C++. Tools with Artificial Intelligence, recurrent shot-groups, count of persistent shot-groups, etc.) http://www . sgi . corn/Technology/talc. We conducted 5-fold cross-validation using an ID3 classi- [4] H. Wactlas, M. Christel, Y. Gong, and A. Hauptmann. fier and we achieved classification accuracy of 91% (plus or Lessons learned from the creation and deployment of a terabyte digital video library. IEEE Computer, 2 Available at http://www.research.att.com/sw/tools/graphviz/ 32(2):66-73, February 1999.

117 The Alexandria Digital Earth Prototype System Terence R. Smith', Greg Janée2, James Frew3, Anita Coleman4 Alexandria Project, University of California at Santa Barbara, Santa Barbara, CA 93106

grouped to provide higher-level search and description capabili- 1. INTRODUCTION ties.Objects map their relevant native metadata to buckets; sig- This note summarizes the system development activities of nificantly, both field names and values are recorded. Collections theAlexandriaDigitalEarthPrototype (ADEPT) Project.5 index, aggregate, and summarize this information.They allow ADEPT and the Alexandria Digital Library (ADL) are, respec- clients to search by entire buckets and to retrieve bucket-level de- tively, the research and operational components of the Alexandria scriptions of objects, and also to "drill down" into buckets (i.e., to Digital Library Project. The goal of ADEPT is to build a distrib- search by specific metadata fields that have been mapped to the uted digital library (DL) of personalized collections of geospa- buckets). Buckets are strongly typed, being either spatial, tempo- tially referenced information.This DL is characterized by: (1) ral, hierarchical, textual, qualified textual, or numeric. Associated services for building, searching, and using personalized collec- with each bucket type are a standard representation (e.g., poly- tions; (2) collections of georeferenced multimedia information, gons and boxes for spatial buckets) and a set of standard operators including dynamic simulation models of spatially distributed (e.g., set algebra). This combination of a formal aggregation sys- processes; and (3) user interfaces employing the concept of a tem with strong typing yields a metadata model that is flexible "Digital Earth." Important near-term objectives for ADEPT are to enough to accommodate a wide range of metadata, yet expressive build prototype collections that support undergraduate learning in enough to support powerful, targeted queries. physical, human, and cultural geography and related disciplines, ADEPT collection metadata includes contextual information and then to evaluate whether using such resources helps students learn to reason scientifically. Collections and services developed (scope, purpose, etc.) and automatically derived structural information such as the spatiotemporal distribution of objects, the types and by ADEPT researchers will migrate to ADL as they mature. counts of objects, and bucket metadata mappings. Metadata mappings are qualified with statistics to give clients an indication of how 2. ARCHITECTURE well represented metadata fields are within buckets. The ADEPT architecture (Figure I) extends the ADL architecture while diverging from the latter's traditional client-server New ADEPT services layered atop the existing ADL infra- roots. ADEPT is a set of distributed peers, each supporting one structure support: (1) creating and managing local collections; (2) or more collections of objects, subsets of which may be "pub- creating new objects from client-supplied metadata; (3) importing lished" (made visible) to other peers. The "library" is the sum of objects from remote collections into local collections; and (4) all such collections. categorizing objects according to client-supplied thesauri. In creating new objects clients are free to provide any metadata, the Collections are the primary organizational metaphor in only requirements being that the metadata be represented in XML ADEPT. A collection is a set of independent, typed objects (more and that the client provide an XSLT script mapping the object precisely, a set of references to objects) together with certain con- metadata to a standard bucket-level description. Remote objects textual and structural collection-level metadata.Objects are are currently imported by simply copying them, but a future ver- largely undefined, save for their ability to provide object-level sion of ADEPT will also support proxy objects and explicit link- metadata and to be uniquely identified. While the contents of the ing.The ability to create new collections and categorize them library may thus be utterly heterogeneous, three features integrate with user-supplied thesauri effectively supports building personal- the library into a uniform whole: (1) common collection-level ized collections. metadata; (2) the ADEPT "bucket" system (a common model for object-level metadata); and (3) a central collection discovery ser- To facilitate including legacy content in the library, espe- vice. cially content managed with relational database technology, we have developed a generic collection driver for relational data- ADEPT buckets differ notably from other metadata schemes bases. The driver assumes only that it is connected to the collec- such as detailed content standards (e.g., FGDC [2]) and high-level tion via JDBC. The collection's schema, metadata bucket map- definitions (e.g., Dublin Core [3]). The ADEPT bucket system is pings, thesauri, etc. are all described in configuration files. Meta- a transparent metadata aggregation system in which semantically data reports are described by templates, which are interpreted by similar, quasi-strongly-typed metadata fields may be formally custom, query-driven report generation software. Query translation is controlled by a Python-based scripting system that sup- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that cop- [email protected] 2 [email protected] ies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires 3 [email protected] 4 [email protected] prior specific permission and/or a fee. 5 The research reported in this paper was partially supported JCDL'01. June 24-28, 2001, Roanoke, Virginia, USA. under NSF IIS-9817432. Copyright 2001 ACM 1-58113-345-6/01/0006...$5.00.

118 152 ports user-defined functions and customized query formation and physical geography (e.g., landscapes and tectonic phenomena); optimization. (2) human geography (e.g., diffusion phenomena); and (3) cultural geography (e.g., religious sites).Initial collections for The ADEPT collection discovery service manages a registry physical and human geography have already been tested in elec- of known collections and, more significantly, supports discover- tronic classroom presentations. The collection building process ing collections involves(I)ac- probabilistically web browser j SDLIP proxy, other clients quisition; (2) according to HTTP OR cataloging; (3) their relevance to web intermediary/ ingesting; (4) agivenquery. XML-+HTML converter1 HTTP transport )4 RMI transport evaluating with XML Using the notion respecttouser that relevance is scenarios; and (5) client-side services (Java classes) determined by 1 building ISO- the numbers of core functionality standard topic objects satisfying access control access control (service- and collection-level) ranking configuration maps. The col- a query, we are: mechanisms query fan-out & results merging methods file query result ranking lections may be (1)investigating result set caching searched by con- new indexing A server-side interface (Java interfaces) cept and reorgan- techniquesthat ized with differ- support j oint tt XML ent levels multidimensional JoBc paradigm of granularity into 71-- query translator proxy driver indexing of col- generic DB driver library personal collec- lectioncontents V thesauri Z39.50 driver tions for various using sparse, RDBMS configuration files, pedagogic pur- Python scdpts group driver compact data poses(e.g.,lec- structures; and tures,self-paced (2) using the Figure 1. ADEPT architecture and implementation. labs, and individ- simple histogram ual and collabora- and count infor- tive learning ses- mation contained in collection-level metadata to derive relevance sions.)Catalog descriptions of objects employ the ADEPT from typed spatiotemporal information density. Learning Objects Metadata Model (ADEPT LOMM) standard, Future directions for the ADEPT architecture include sup- which is based on the IMS [4], DLESE [5], and IEEE LOM [6] porting a richer information model, facilitating interoperability, metadata standards, and an ADEPT-developed standard for and supporting dynamic content such as models of Earth proc- model metadata [7].The model's key educational and peda- esses and other executable programs. ADEPT's current informa- gogical elements include (1) type of learning resource; (2) tion model (collections of independent objects) does not capture learning context; (3) interactivity level; and (4) description. key relationships of benefit to users. For example, objects that are members of a series share a relationship with each other and to the 4. REFERENCES series as a whole; we plan to make such relationships explicit. [1] The Alexandria Digital Earth Modeling System (ADEPT). With regard to interoperability, we plan to facilitate searching hup://www.alexandria.ucsb.edu/adept/proposal.pdf. over other types of collections by supporting connections to Z39.50 servers, and by extending the SDLIP protocol. With re- [2] Federal Geographic Data Committee. Content Stan- gard to dynamic content, our focus is on defining an architectural dard for Digital Geospatial Metadata. layer that delivers executable content to clients in a platform- http://www.fgdc.gov/metadata/csdgm/. neutral fashion, and that allows programs to interact with library [3]Dublin Core Metadata Initiative. Dublin Core Meta- contents and to populate the library with new objects. We are de- data Element Set. signing a suite of additional services that support building person- http://dublincore.org/documents/dces/. alized collections by, for example, allowing users to: (1) build collections according to personal preferences; (2) build models [4] IMS Global Learning Consortium. IMS Learning Re- representing concepts in different ways; and (3) build concept source Meta-Data Information Model. maps. http://www.imsproject.org/metadata/mdinfov I p 1 .html. [5]DLESE. http://www.dlese.org/. 3. CATALOG AND COLLECTIONS [6]IEEE Learning Technology Standards Committee. Three prototypical ADEPT collections are being built to LOM: Base Schemev3.5. aid in designing future collections and services, and to serve as http://ltsc.ieee.org/doc/wg12/scheme.htrnl. a basis for evaluating the educational impacts of ADEPT ser- [7]Crosier, S. Content Standard for Computer Model Meta- vices.The collections focus on processes and forms in (1) data. http://www.geog.ucsb.edu/scott/metadata/standard/.

119 153 Iscapes: Digital Libraries Environments for the Promotion of Scientific Thinking by Undergraduates in Geography Anne J. Gilliland-Swetland Gregory L. Leazer Department of Information Studies, University of Department of Information Studies, University of California, Los Angeles California, Los Angeles 212 GSE&IS Building 226 GSE&IS Building Los Angeles, CA 90095-1520 Los Angeles, CA 90095-1520 (+ 001) 310-206-4687 (+001) 310-206-8135 [email protected] [email protected]

ABSTRACT Scientific thinking in geography requires students to master at This paper reviews considerations associated with implementing leastfiveskills:askinggeographicquestions,acquiring the Alexandria Digital Earth Prototype (ADEPT) in undergraduate geographicinformation,organizinggeographicinformation, geography education by means of Iscapes(or Information analyzing geographic information, and answering geographic landscapes). In particular, we are interested in how Iscapes might questions [2].Digital libraries deployed in education should also be used to promote scientific thinking by undergraduate students. allow for "inquiry into authentic questions generated from student Based upon an ongoing educational needs assessment, we present experiences" [3]. If a digital library is going to be integrated directlyintoinstructionalandlearningactivitiesand be asetof conceptualprinciplesthatmightselectivelybe instrumental in the inculcation of students' scientific thinking, implementedinthedesignof educationaldigitallibrary environments. then it is likely that it will need to make available significantly augmented functionality beyond that of a digital library that is Keywords intended primarily as a supporting information resource. Digital libraries. Geography. Undergraduate education. Scientific A team of ADEPT researchers at the University of California, Los thinking. Angeles and Santa Barbara has been examining the educational implementation and evaluation of ADEPT in undergraduate 1. INTRODUCTION geography courses on both campuses[1].As part of an The Alexandria Digital Earth ProtoType (ADEPT) is a 5-year educational user needs assessment, we have been engaged in analyzing instructors' pedagogical goals and identifying the project (1999-2004) funded by theU.S.DigitalLibraries concepts that instructors use in teaching core geographic topics to Initiative,Phase2thatisdevelopinga"geolibrary"of georeferenced and geospatial resources. A key aim of ADEPT is understand how these might be facilitated by ADEPT. We have to integrate it into undergraduate education in disciplines that been particularly interested in how the use of digital libraries can might make use of such resources. Iscapes (or Information promote not only the learning of core geographic concepts, but alsothethinkingprocessesassociatedwiththerelevant landscapes)areaconstructthatspecificallyaddressesthe requirementsforintegrating ADEPT intoinstructional and disciplinary domains. learning activities. The Iscape construct, at its simplest, denotes a personalized use environment, both in terms of information resources and functionality.Information resources used in an 2. DESIGNING ISCAPES Iscape might be drawn from many distributed sites. They might 2.1 Considerations include material contributed by the user, derivative materials In addressing the design of Iscapes to support the acquisition of created by manipulation of ADEPT collections, and ADEPT's scientificthinkingskillsand knowledge by undergraduate own extensive collections. A key aim for Iscapes is to provide students, it is necessary first to identify what needs to be taught instructors and students with the means to discover, model, and how. In university environments, there are few "objective" manipulate, and thus explain dynamic geographical processes and frameworks that we can use to understand the scope of learning learn the underlying scientific reasoning. expectations within a discipline.In the K-12 environment, standardizedframeworks provide some guideposts,butin universities, course content and learning objectives can be highly Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are specific to individual instructors, even across instructors teaching not made or distributed for profit or commercial advantage and that different sections of the same course. Commonly used textbooks copies bear this notice and the full citation on the first page. To copy providesomeassistanceinidentifyingcorecontentand otherwise, or republish, to post on servers or to redistribute to lists, perspectives, but there is also wide variability in the texts and requires prior specific permission and/or a fee. editions. As a result, we face several issues, not only with the JCDL '01, June 24-28, 2001, Roanoke, Virginia, USA. development of relevant collections, but also in determining how Copyright 2001 ACM 1-58113-345-6/01/0006...$5.00.

120 15 4 instructors and students can or should be able to retrieve, collate, Iscapes should be constructed that support a and manipulate those collections. Some examples of issues that range of domain knowledge, work practices, and discipline- arise include the following: How do we analyze instructors' specific reasoning using geo-spatial resources. learning goals for their students? To what extent do instructors' Diversity and extensibility of metadata: Creators of Iscapes own research interests and the way in which they approach their should be able tointegrate new resources and associated discipline influence the way they teach core concepts? In what metadata, and to create metadata for extant resources specific to other ways can we identify core collections, core topic and their instructional or learning activities. concepts, and bestpracticesforthatdiscipline? How do instructors underscore or reinforce important points? What are the Authentic sciencetopics:Iscapesshouldpresentauthentic measuresandpermittedtransformationsof conceptsand scientific problems and data. information resources that need to be supported? How can a Expert knowledge and reasoning: Iscapes should reflect best student faced with a variety of ways to solve a problem such as practices and most authoritative current knowledge in a given finding an answer to an assigned question know that he or she has domain. taken the "best" approach, i.e., demonstrated the appropriate Support of disciplinary conventions: Iscapes should adhere to scientificreasoning techniques?Finally, and key toIscape disciplinary practices and conventions for concept representation development, is it possible to build an educational digital library such as labeling, use of color, and visual perspectives. environmentthatsupports widelyvaryinginstructional approaches while supporting and modeling best practices and core Appropriate use of technology: Iscapes should be developed concepts? where they will be more effective than existing classroom methods, such as the modeling of dynamic processes, presenting 2.2 Conceptual Requirements alternate visualizations of data or images, or prediction. As was mentioned above, one way we have addressed these Scaffolding: Iscapes should make available tools to instructors to questions is by determining the core geographical concepts. Based allow them to collate and annotate specific resources, as well as uponanalysisof interviewswithinstructorsabouttheir predetermine how students may move among and manipulate instructional goals and pedagogical approaches, observation of those resources. instructors teaching course classes, as well as class syllabi, Collaboration: Iscapes should facilitate group work by students assignments, and required texts, we have been developing concept through the use of collaboration tools. maps for major topics within courses. Annotation: Instructors and students should be able to annotate From our work to date, we have formulated a set of conceptual resources and otherwise embed commentary on their interactions principles that might selectively be implemented in the design of with the Iscapes for others to see. Instructors and students should educational digital library environments such as Iscapes: also be able to embed questions for each other. Transparency: The technology to support Iscape development and Vocabulary: Iscapes should include a spell-checker and a pull- use should not intrude upon developing and learning content, down glossary of technical terms.Instructors should be able to concepts, and processes. develop customized glossaries for their Iscapes. Progressive skill-building: Users who master basic skills and Further information about the ADEPT project is available at concepts should be able to build upon these as they progress on to ADEPT web sites at UCLA (http://is.gseis.ucla.edu/adept/) and more sophisticated Iscapes. UCSB (http://www.alexandria.ucsb.edu/dept/). Extensibility:Iscapesshould beaugmentablethroughthe inclusion of additionalcontent andlayers of functionality 3. ACKNOWLEDGMENTS permitting more sophisticated views of content. Our thanks to Christine Borgman for her comments on drafts of Parameter variation: Iscapes should support interactive visual this paper.Additional thanks to Richard Mayer. This article is and data models of dynamic processes and hypothetical scenarios. based upon work supported by the National Science Foundation These models and scenarios will permit users to vary a range of under grant no. IIS-9817432. relevant parameters and will display results to users. Granularity: The level of detail at which Iscape content should be 4. REFERENCES displayed and manipulated can be varied to facilitate different [1] Borgman, C.; Gilliland-Swetland, A.; Leazer, G.; Mayer, R.; levels of expertise as well as different disciplinary needs. Gwynn, D.; Gazan, R.; Mautone. P. Evaluating the use of Computational capabilities:Iscapes should incorporate data digital libraries in undergraduate education: a case study of manipulationtoolstosupportlongitudinalandstochastic the Alexandria Digital Earth Prototype, Library Trends, 49 analyses. (2): 228-250. Linkage: Citation-linking should assist in the identification of [2] Geography Education Standards Project. Geography for life: additional resources on a topic, and of seminal and influential national geography standards. Washington, DC: National works.It also aids users in tracing the development of key Geographic Society (1994). findings and ideas. [3]National Research Council (1996). National science Hybrid collections: Iscapes should support references to related education standards. Washington, D.C.: National Academy non-digital resources such as maps contained in university library Press. collections.

121 1515 Project ANGEL: An Open Virtual Learning Environment with Sophisticated Access Management John Mac Coll Director of Science & Engineering Library, Learning & Information Centre (SELLIC) Project/Sub-Librarian, Online Services University of Edinburgh Darwin Library Edinburgh, Scotland, UK +44 131 650 7275 [email protected]

ABSTRACT 2. A GUIDED ENVIRONMENT FOR This paper describes a new project funded in the UK by the Joint LEARNING Information Systems Committee, to develop a virtual learning ANGEL is developing an environment for learning which is environment which combines a new awareness of intemet sources 'guided' in several ways. Student users will be guided in the sense such as bibliographic databases and full-text electronic journals with of entering a tailored, customised environmentan individual portal a sophisticated access management component which permits single which collocates administrative and academic information which sign-on authentication. relatesto them from registry,library,finance, school and department. What ANGEL will offer in addition to the features now Categories and Subject Descriptors common topersonalised environmentsisaccesstorelevant [Authentication; Personalization]:. databases of contentprimary and secondaryas well as to 'pre- coordinated' resources selected by the tutor. The system will not insist on any particular interface (thus it will work with existing General Terms customised orinstitutionally-preferredportals), thoughitwill Design, Standardization provide one for institutions which wish. At the heart of the ANGEL system is the tutor interface, which is the Keywords way by which the 'guiding' of the student user is achieved. ANGEL Authentication; Access Management; Virtual Learning is building tools to make it simple for tutors, as they construct online Environments. courses,toselect resources which should be added to the environments of all students on their course. They will be prompted 1. INTRODUCTION to select from a range of resources both local and external, including Under its programme for the Development of the DNER [1] DNER resources such as the databases and datasets available from (DistributedNationalElectronicResource),theUK's Joint the UK's JISC-funded data centres (such as EDINA) as well as Information Systems Committee [2] (BSC) has funded a consortium useful locally-produced learning resources at various levels. These project known as ANGEL [3] (Authenticated Networked Guided may range from aggregated resources such as online courses, Environment for Learning), which is led by the London School of obviously (all users should normally see several of these), to Economics, with partners at the University of Edinburgh, De 'atomic' resources, or course 'granules', which may be located in Montfort University, South Bank University and associate partners several different course sites, or within the library system. EDINA (Edinburgh Data and Information Access) and Sheffield Hallam University. The ANGEL consortium was formed in the 3. ACCESS MANAGEMENT autumn of 2000. System development began early in 2001, with a ANGEL aims to extend the compass of a learning environment beyond the normal 'closed' environment of particular resources two-year timescale. selected by a tutor, to the more open environment of bibliographic databases, full-text electronic publications and datasets available within the DNER and beyond. Essential to this approach is the Permission to make digital or hard copies of all or part of this work for development of an authentication and authorisation module which personal or classroom use is granted without fee provided that copies permits, through a 'single sign-on' to ANGEL, access to the range are not made or distributed for profit or commercial advantage and of secured resources to which a user has rights. This requires an that copies bear this notice and the full citation on the first page. To approach based upon 'authentication brokering', already developed copy otherwise, or republish, to post on servers or to redistribute to by the ANGEL lead partner site, the London School of Economics, lists, requires prior specific permission and/or a fee. in an earlier JISC-funded project known as HEADLINE. [4] In this JCDLI , June 24-28, 2001, Roanoke, Virginia, USA. aspect of its development, ANGEL is tracking the work of JISC's Copyright 2001 ACM 1-58113-345-6/01/0006 ...$5.00. Committee on Authentication and Security (JCAS) which is steering the development of a UK-wide access management system known

122

1513 as Sparta. Sparta will replace the Athens system which has been based description of a range of data types, including directory used successfully in the UK for several years, but which is now information, will be adopted in ANGEL. perceived to be limited in the degree of security it can provide, and Users will thus be guided to an environment in which all networked in the flexibility which can be offered to institutional management to resources to which they have rights are available via a single sign- limit access to specific groups, and draw detailed usage information on, which means that they will not face multiple ID and password from the system. Sparta is being developed alongside work going on challenges as they take their search from resource to resource, or in the US under the Internet2 `Middleware Initiative' (in particular employ cross-searching of multiple resources. The services to which the work on the `Shibboleth' system), and in continental Europe, they do not have rights (perhaps because they are students on a through TERENA (Trans-EuropeanResearch andEducation franchised course, for example, or lifelong learners, and resource Network Association) [5]. ANGEL will deploy Sparta technology in providers do not recognise them within site licence provisions) will prototype. not be visible to them. 4. AUTHENTICATION AND 5. PROJECT OUTCOMES AUTHORISATION PRINCIPLES The main proposed deliverable is a system that brings together Development of the authentication and authorisation module is hybrid library technology with learning and teaching resources in a based upon the following principles: way that: Responsibility for authentication rests with the institution Is adaptive to user requirements. Authentication is based upon centrally-held authoritative Includes all of the best features from state-of-the-art systems institutional user information. for resource discovery and Instructional Management. Authentication and authorisation are functionally separate Employs an authentication and authorisation system which is procedures based upon the principles of single sign-on, granular allocation Authorisation involves the identification of roles of of resources, a generic account for DNER resources, and integration of subscription-based and local secured services. authenticated users. Authorisation identifies people, not computers. Enables the resources offered to students to be directly selected and maintained by the academic teaching staff responsible for Granularity is provided to the level of individual users without providing, directing and monitoring courses of study. any compromise to security Provides monitoring and guidance for the (student) user in a Migration to the ANGEL system is seamlessly provided to way that shows some 'understanding' of what the user is trying Athens users, and parallel working is available for a reasonable to do when they move from the closed, structured teaching period of time environment to the open environment of resource discovery. Credential subsidiarity operates, whereby credentials are The system will learn about the task and the user and be able to provided only to the level which is appropriate. Thus, in the remember the user's actions. case of resources which are licensed for use by all members of Could be used as an early identifier of potential student study an institution, with deeper level credentials available also to difficulty and hence an early warning of some forms of dropout identified groups or individuals, the credential used for access risk. is determined by the user's role. The ANGEL system willprovideanintermediatelayer of This means that institutions are free to use a variety of approaches. 'intelligent' middleware (between users and resources, controlled by The Project willprioritisethe deployment of X.509 digital metadata about users, and itself learning from the current and past certificates for authentication in the context of an institution acting behaviour of each user) so that access to the 'open' resources of the as a Certificate Authority. Authentication will draw upon existing hybrid library can be guided and monitored in similar ways to the student databases held in each partner university, and will not guidance and monitoring already available within the best of the require new databases to be established. The Project will assess the 'closed' virtual learning environments. difficulties encountered by this requirement. The directory structure used includes the use of the Lightweight 6. CONCLUSION Directory Access Protocol (LDAP) forthestorageof user We are hopeful that ANGEL will not only provide a useful credentials and rolesinformationfor authorisation. The roles environment for students, but will also help to achieve coordinated definitions used by the institution, and that applied by the service planning between the administrative and academic domains in provider,willdeployseparatenamespaces.Work towards universities, with the library and information domain important to recommendations for the definition of a single roles namespace for both. Effective intra-institutional collaboration will be essential to its UK HE will be carried out by the Project, allowing sufficient success. flexibility to allow individual institutions wide scope for variations. This work will be developed to include the concept of individual 7. REFERENCES user profiling, which is important for the GEL component of the [I]http://www.dner.ac.uk Project. [2]http://www.jisc.ac.uk Throughout the development of the access management module, the [3]http://www.angel.ac.uk Project will monitor Sparta developments alongside other work [4]http://www.headline.ac.uk going on internationally and will adopt solutions where appropriate. [5]http://www.terena.n The Extensible Markup Language (XML), now a standard for web-

123 157 NBDL: A CIS Framework for NSDL

Joe Futrelle Su-Shing Chen Kevin C. Chang NCSA, University of Illinois University of Missouri University of Illinois Urbana-Champaign, Illinois Columbia, Missouri Urbana-Champaign, Illinois [email protected] [email protected] [email protected]

ABSTRACT biological databases are among the best archived, managed, and is perhaps the In this paper, we describe the NBDL (National Biology Digital preserved. The earliest phase of bioinformatics Library) project, one of the six CIS (Core Integration System) naming and classification of organisms invented by the Swedish projects of the NSF NSDL (National SMETE Digital Library) biologist, Carolus Linnaeus (1707-1778). His "binomial system" provides a taxonomic hierarchy of types, species, families, orders, Program. classes, and phyla (divisions)for biology,stillin use today. Unfortunately, the biology community has never developed an Categories and Subject Descriptors integrated data resource of genomic, morphological, and taxonomic H.3.7 [Information Storage and Retrieval]: Digital Libraries information. That is, users can not search and explore all such standards, user issue.s, dissemination. information objects serendipitously. The NBDL project will address this important scientific issue. General Terms 3. THE EMERGE ARCHITECTURE Algorithms, Design, Standardization. The technicalinfrastructure supporting NBDL isthe Emerge distributed IR toolkit [3]. Emerge consists of a set of components to Keywords enable federation of distributed, heterogeneous data collection by Digital library, SMET education, Federated search. means of query and data translation. Data sources are proxied with a protocol-translation component called Gazelle and integrated with a 1. INTRODUCTION broker component called Gazebo which translates client queries into the variety of query formats required by the distributed data sources. This NSDL project, NBDL, consists of participating institutions: University of Missouri-Columbia (MU), NCSA, University of Thus, data can be retrieved and integrated independently of its location, access protocol, and query syntax. The following depicts Illinois-Urbana/Champaign (UIUC), and Missouri Botanical Garden (MOBOT) [6]. The project focuses on building an interoperable, the Emerge architecture: reliable, and scalable Core Integration System (CIS) framework for coordinating, integrating, and supporting learning environments and data resourcesprovided by NSDL collectionsandservices. We front-end Emerge components emphasize also integration issues of collections and services by service building a testbed with a large biological collection, Tropicos, of Gazelle MOBOT (Missouri Botanical Garden) and its educational services. Client 41+ This testbed can be extended to other disciplines and environments. 2. BIOLOGICAL INFORMATION Gazebo% Biology has one of the most complex information-structures IV Gazelle (concepts, data types and algorithms) among scientific disciplines. Other44.Clientr Tool API Its richness in organisms, species, cells, genes and their pathways providesmanychallengingissuesforbiologicalsciences, rawdata )Gazelle+ computational sciences, and information technology. The advances indexer in biological science and technology urgently need development of very large biological digital libraries for analyzing and managing biological information: sequences, structures, and functions, arising from DNAs, RNAs, genes and proteins, and taxonomies. At present, Figure 1. Emerge Architecture Emerge components access data through a simple interface called Permission to make digital or hard copies of all or part of this work for the Gazelle Target API, which must be implemented for each data personal or classroom use is granted without fee provided that copies source. In some cases, classes of data sources can be supported by a are not made or distributed for profit or commercial advantage and single implementation. In the case of NBDL, we have developed an that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to implementation of the Gazelle Target API for TROPICOS database, lists, requires prior specific permission and/or a fee. which allows arbitrary queries to be executed against it [7]. JCDL '01, June 24-28, 2001, Roanoke, Virginia, USA. Client queries are translated by Gazebo by a general-purpose query Copyright 2001 ACM 1-58113-345-6/01/0006..15.00. translation engine capable of generating queries in a wide variety of

124 158 syntaxes. We have developed a configuration for the Gazebo query disseminatingbiologicalinformation intotheeducational translator,whichgenerates TROPICOS queries. Semantic community. equivalences between query syntaxes are expressed in Gazebo using Under various current standards(e.g., IEEE LOM and IMS meta-attributes, which are similar to the Alexandria Digital Library's Standards), metadata structures are defined and collected for some search buckets [4]. For TROPICOS, we have developed a set of specific educational domains. Since metadata structures will always meta-attributes which can be translated into TROPICOS as well as be changing, we are developing "evolving metadata structures," ZB1G's Darwin Core query syntaxes, allowing a single client query which have adaptive semantic properties [2]. Metadata will evolve to be targeted at either TROPICOS or any ZBIG service. This will and grow through out time and utilization. Semantic mappings are allow tools to be built that can transparently retrieve biological extended to schema mappings of these evolving metadata structures information regardless of whether it originates in TROPICOS or a about learning objects and applets in our resources and services, ZBIG resource. The NBDLuser interface is given the following correlating different nomenclatures, syntaxes and semantics. In the figure: LOVE (Learning Object Virtual Exchange) of NBDL [6], we are WINISSIMINEStEiVIRMIElieneallellffillINIM11111111 IrEn'EMI developing this novel feature so that interoperability, reliability, L=r1n 11E' reusability, and scalability will be maintained even under changes of Federally funded genome research : networked resources and services. -01 Wrthat Mi. cieno end toohnology transfer IssuespreneedIngs of a Pnr,W public meeting, May 21, 1992 Washington, D.C. 5. RICH NBDL COLLECTIONS ,:-s Litny (sw. The TROPICOS botanical database at the Missouri Botanical C tresenses C Oann.1. Da Caw rat LY W.N. Fddwitt Wra Ws...et Erg.", wl 100a7y, 0022 Garden (MOBOT) contains 851,000 name records for plants and C.wr.. May E..Clutlor J 23 associated information on bibliography, types, nomenclature, usage, fl Tn. WW2 011 Dic m 0. Wu AA., 00 ...law I D distribution, and morphology. The data base currently contains over 1.5 million specimen records - mostly new collections gathered over Sublect the last 20 years with full locality data, coordinates, and elevation Sm. 11... 0 information. A literature file of over 80,000 publications used for 01' ELM airesseny. EWA L5 I ...WM.... 02: M. wan prow CD EW413 Ww. vouchering distribution and usage and authority files of people, 61: WEWW e0 Author(s) books and journals, and geographical place names. Most recently Extlairp praSil 0 Or rune.. onlysis c Flaval C.conreDr C.20100 trentresa tra .M.MM PM. ~bp Onw images from a variety of sources are included to add a visual impact

100: to the wealth of textural data. At this time thousands of images of plant habitats, structure, type specimens and prologues are available for selected taxa. The web site provides a full range of on-demand Figure 2. User Interface of NBDL and interactive html pages providing a scientific overview of the information accumulated around any of the scientific names in the 4. SEMANTIC MAPPINGS production database TROPICOS. A key requirement of the CIS architecture is the development of semantics (or ontology) of information, which are then captured in semantic mappings for the intelligent search engines, such as 6. ACKNOWLEDGMENTS Emerge. Semantic mapping extends the existing full-text, hypertext, Our thanks to other NBDL project members: Chip Bruce, Ann and database indexing and mapping schemes to include semantics or Bishop, and Bryan Heidom. Their contributions to biology and meanings of information content. Furthermore, semantic mappings education materials are essential to this project. willreducethecomplexityofbiologicaltaxonomyand classification,andcorrelatesemanticallythegenotypesand 7. REFERENCES phenotypes at various levels of biological information in NBDL. [1]C. K. Chang and H. Garcia-Molina, Approximate query An example of semantic mappings is common/scientific name translation across heterogeneous information sources, Proc. Of the mapping. A significant barrier for the educational use of resources, 26th VLDB Conference, Sept. 2000. such as TROPICOS, is the lack of common-name indexing: species [2] S. Chen, Digital Libraries: The Life Cycle of Information, Better are referred to by scientific names, which are unfamiliar to most Earth Publisher, 1998, http://www.amazon.com. educational users.To address this problem, we are extending [3] "Emerge home page," http://emerge.ncsa.uiuc.edu. Gazebo's query translation engine using an approximate query [4] J. Frew, M. Freeston, L. Hill, G. Janée, M. Larsgaard, Q. Zheng. translation algorithm [1]. Common names will be translated into Generic Query Metadata for Geospatial Digital Libraries.IEEE scientific names by querying a name directory service such as ITIS during the query translation process [5]. Although the resulting Metadata 99. query will not have the exact semantics of the original query, it will [5] "Integrated Taxonomic Information System," http://www.itis. be a close approximation and will allow users to discover relevant usda.gov/plantproj/itis/. information in one integrated step rather than requiring them to use [6] NBDL. http://cecssrvl.cecs.missouri.edu/NSDLProjectl. two separate applications. This is an important innovation towards MW3TROPICOS.http://mobot.mobot.org/Pick/Search/pick.html.

125 159 Automatic Identification and Organization of Index Terms for Interactive Browsing

Nina Wacholder David K. Evans Judith L. Klavans Columbia University Columbia University Columbia University New York, NY New York, NY New York, NY [email protected] [email protected] [email protected]

ABSTRACT 1. OVERVIEW The potential of automatically generated indexes for information Indexes are useful for information seekers because they: access has been recognized for several decades (e.g., Bush 1945 [2], support browsing, a basic mode of human information seeking Edmundson and Wyllys 1961 [4]), but the quantity of text and the [17]. ambiguity of natural language processing have made progress at this task more difficult than was originally foreseen. Recently, a body of provide information seekers with a valid list of terms, instead work on development of interactive systems to support phrase of requiring users to invent the terms on their own. Identifying browsing has begun to emerge (e.g., Anick and Vaithyanathan 1997 index terms has been shown to be one of the hardest parts of [1], Gutwin et al. [10], Nevill-Manning et al. 1997 [17], Godby and the search process, e.g., [8]. Reighart 1998 [9]). In this paper, we consider two issues related to are organized in ways that bring related information together the use of automatically identified phrases as index terms in a [16]. dynamic text browser (DTB), a user-centered system for navigating and browsing index terms: 1) What criteria are useful for assessing But indexes are not generally available for digital libraries. The the usefulness of automatically identified index terms? and 2) Is the manual creation of an index is a time consuming task that requires a quality of the terms identified by automatic indexing such that they considerable investment of human intelligence [16]. Individuals and institutions simply do not have the resources to create expert indexes provide useful access to document content? for digital resources. The terms that we focus on have been identified by LinkIT, a software tool for identifying significant topics in text [7]. Over 90% However, automatically generated indexes have been legitimately of the terms identified by LinkIT are coherent and therefore merit criticized by information professionals such as Mulvany 1994 [16]. inclusion in the dynamic text browser. Terms identified by LinIdT Indexes created by computer systems are different than those are input to Intell-Index, a prototype DTB that supports interactive compiled by human beings. A certain number of automatically navigation of index terms. The distinction between phrasal heads identified index terms inevitably contain errors that look downright (the most important words in a coherent term) and modifiers serves foolish to human eyes. Indexes consisting of automatically identified terms have been criticized on the grounds that they constitute asthebasisforahierarchicalorganization of terms.This andstructured linguistically motivated structure helps users to efficiently browsing indiscriminatelists,ratherthansynthesized and disambiguate terms. We conclude thatthe approach to representation of content. And because computer systems do not information access discussed in this paper is very promising, and understand the terms they extract, they cannot record terms with the also that there is much room for further research. In the meantime, consistency expected of indexes created by human beings. this research is a contribution to the establishment of a solid Nevertheless, the research approach that we take in this paper foundation for assessing the usability of terms in phrase browsing emphasizes fully automatic identification and organization of index applications. terms that actually occur in the text. We have adopted this approach for several reasons:

Keywords 1. Human indexers simply cannot keep up with the volume of Indexing, phrases, natural language processing, browsing, genre. new text being produced. This is a particularly pressing problem for publications such as daily newspapers which are under particular pressure to rapidly create useful indexes for large amounts of text. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are 2.New names and terms are constantly being invented and/or not made or distributed for profit or commercial advantage and that copies published. For example, new companies are formed (e.g., bear this notice and the full citation on the first page. To copy otherwise, or Verizon Communications Inc.); people's names appear in the republish, to post on servers or to redistribute to lists, requires prior news for the first time (e.g., it is unlikely that Elian Gonzalez' specific permission and/or a fee. name was in a newspaper before November 25, 1999); and JCDL '0 1 , June 24-28, 2001, Roanoke, Virginia, USA. (e.g., Copyright 2001 ACM 1-58 I 13-345-6/01/0006...$5.00. new product names are constantly being invented Handspring's Visor PDA). These terms frequently appear in print some time before they appear in an authoritative reference source.

126 160 3.Manually created external resources are not available for identifies about 500,000 non-unique terms for 12.27 MB of every corpus. Systems that fundamentally depend on manually text. We address the issue of quantity by considering the created resources such as controlled vocabularies, semantic number of terms that LinkIT identifies, as related to size of the ontologies, or manually annotated text usually cannot be original text from which they were extracted. This provides a readily adopted to corpora for which these resources do not basis for future comparison of the number of terms identified in exist. different corpora and by different techniques. 4.Differing indexingstandards acrossagencies and Usefulness of index terms: Techniques for automatically organizations makes reconciliation of indexes a difficult and identifying index terms abound. In order to make a preliminary time consuming task. The difficulty of reconciliationis assessment of the usefulness of index terms identified by exacerbatedwhenindexesarepreparedbydifferent LinkIT, we performed an experiment to measure user's organizations for different user groups, corpora and domains perceptions of the usefulness of index terms. We presented (Bert et al. 2000 [11]). Under some circumstances, it may be users with lists of index terms identified by three domain- preferable to have one large automatically generated index than independent techniques and with the newspaper articles from none at all. which the terms had been extracted (Wacholder et al. 2000 [22]). In terms of quality alone, our results show that the 5.Automatically identified index terms are useful in other technical term extraction method of Justeson and Katz 1995 digital library applications. Index term lists are essential for [14] receives the highest rating. However, in terms of a browsing, but can also form data as input to other applications combined quality and coverage metric, the Head Sorting such as information retrieval, summarization and classification method, described in Wacholder 1998 [21] used by LinIdT [25], [1]. outperforms both other techniques. Given these information needs, our goal is develop techniques for Although the terms identified by LinkIT are the primary focus of our identifying and organizing index terms that reduce the number of analysis, these criteria can be applied systematically to terms terms that users need to browse and simultaneously maximize the identified by other techniques. Techniques for identifying NPs informativeness of each term. abound and they are difficult to compare (Evans et al. 2000 [7]). In this paper, we describe a method for identifying index terms for With this work, we seek to establish a foundation for further in- use in a dynamic text browser (DTB). We have implemented a depth analysis of the usability of automatically identified terms prototype DTB calledIntell-Index which supportsinteractive whichwillinclude,interalia,theobservation of subjects navigation of index terms, with hyperlinks to the views of phrases in performing information seeking tasks using index terms generated context and full-text documents. The input to Intell-Index consists by different techniques. of noun phrases identified by LinkIT, a software tool for identifying We discuss of term coherence, usefulness and thoroughness of significant topics in domain independent text. (We describe this coverage of document content in Section 3. But before we turn to software in more detail in Section 2.) this discussion, we describe the technology that we have developed However, little work has been done on the question of what to identify, display, and structure terms. constitutes useful index terms (Milstead 1994 [15], Hert et al. 2000 [11]). In order to move toward our goal, we have therefore found it 2. Technology necessary to identify properties of index terms that affect their 2.1 Towards a DTB usefulness in an electronic browsing environment. To assess the We have designed a domain-independent system called LinkIT, quality of the index terms, we consider three criteria especially which uses the head sorting method to identify candidate index pertinenttoautomaticallyidentifiedindexterms:coherence, terms in full-text documents (Wacholder 1998 [21]; Evans 1998[6]). thoroughness of coverage of document content, and a combined Using a finite state machine compiled from a regular expression metric of quality and coverage that we call usefulness. grammar, LinkIT parses text which has been automatically tagged Coherence: Because computer systems are unable to identify with grammatical part-of-speech a finite state machine. LinkIT can terms with human reliability or consistency, they inevitably process approximately 4.11 MB tagged text per second [7], [6]. generate some number of junk terms that humans readily The expressions identified by Link1T are noun phrases (NPs), recognize as incoherent. We consider a very basic question: are coherent units whose head (most important word syntactically and automatically identified terms sufficiently coherent to be useful semantically) is a noun. For example, filter is the head of the NPs as access points to document content. To answer this question coffee filter, oil filter, smut filter, and water filters at warehouse for the LinkIT output, we randomly selected .025% of the prices. terms identified in a 250MB corpus and evaluated them with respect to their coherence. Our study showed that over 90% of At the present time LinkIT identifies a subset of NPs that occur in a the terms are coherent. Cowie and Lehnert 1996 [3] observe document, simplex NPs. A complex NP a form of cancer-causing that 90% precisionin information extractionis probably asbestos actually includes two simplex NPs, a form and cancer- satisfactory for every day use of results; this assessment is causing asbestos. A system that lists only complex NPs would list relevant here because the terms are processed by people, who only one term, a system that lists both simplex and complex NPs can fairly readily ignore the junk if they expect to encounter it. would list all three phrases, and a system that identifies only simplex NPs would list two. LinkIT identifies Simplex NPs rather than Thoroughness of coverage of document content: Because complex ones for a practical reason: Simplex NPs can be identified computer systems are more thorough and less discriminating, more reliably because they are structurally more simple. Compared they typically identify many more terms than a human indexer to simplex NPs, the boundaries of complex NPs (e.g., symptoms that would for the same amount of material. For example, LinIdT

127 161 Figure 1: Intell-Index opening screen Neter:ape NOM fieEchtYiewfiecarenunicator fek

WTI; demo ' feedback IntellIndex: The Intelligent Indexer

Browse entire index for a document collection:

Select Document Collection: 'ColumbiaIntl Affairs Online Sample L-L

Sort Index Terms by: I Head(last ward of phrase)

I Browse I (lidp,

Search for word or character string in index: Select DocumentCollection: IColumbia Intl Affairs Online Sample izj Search String:

(No blank spaces in search string) String Match: ISub.stringii(9haracters in Search String plus other characters) F=J

Case Match: [Insensitive (Upper or lower case) ILI

Look for Search String in: IAnywhore in phrase

Sort index terms by: !Head(Last word in phrase)

ESearch ILcia Help 1 crop up decades later) are difficult to identify. The head is also systems, or Smadja 1993 [17], a hybrid statistical and symbolic more difficult to identify: for example there are cases, such as group technique for identifying collocations. of children where the syntactic head (group) is distinct from the We have incorporated LinkIT output into a prototype DTB called semantic head (children). Complex NPs can be difficult for people to interpret, especially out of context. For example, the expression Intel I-Index. (http://www.cs.columbia.eduinina/IntellIndexlindexer.cgi).Figure informationaboutmedicine forbabiesisambiguous:in 1 above shows the Intell-Index opening screen. The user selects the [[information about medicine] [for infants]], the information is for collection to be browsed and then may browse the entire set of index infants; in [information about [medicine for infants]], the medicine terms identified by LinkIT. (Note that this "collection" could also is for infants. The decision to include only simplex NPs in the DTB arise from the results of a search.) Alternatively, the user may enter a has important implications for the number of index terms included term, and specify criteria to select a subset of terms that will be in the DTB, as discussed below. returned (e.g. heads only, or modifiers and heads). This gives the Finally, LinkIT sorts the NPs by head, and ranks them in terms of user better control over the results so that browsing is more theirsignificancebasedon headfrequency.Theintuitive effective. justification for sorting SNPs by head is based on the fundamental Figure 2 on p.4 shows the beginning of the alphabetized browsing linguistic distinction between head and modifier: in general, a head results for the specified corpus. As the user browses the terms makes a greater contribution to the syntax and semantics of a phrase returned by Intell-Index, she may choose to view a list of the than does a modifier. This linguistic insight can be extended to the contexts in which the terms are used; these contexts are sorted by document level. If, as a practical matter, it is necessary to rank the document and ranked by normalized frequency in the document. contribution to a whole document made by the sequence of words This view is called index term in context (ITIC) based on its constituting a documentre, the head should be ranked more highly relationship to a simpler version, i.e. keyword in context (KWIC). If than other words in the phrase. This distinction is important in the information seeker decides that the list of ITICs is promising, linguistic theory; for example, Jackendoff 1977 [13] discusses the she may view the entire document, or browse another view of the relationship of heads and modifiers in phrase structure. It is also data. important in NLP, where, for example, Strzalkowski 1997 [19] and Evans and Zhai 1996 [5] have used the distinction between heads At the present time, the DTB uses only Simplex NPs. However, and modifiers to add query terms to information retrieval systems. LinkITcollectsinformationfor conflating simplex NPs into complex ones; this will be added to Intell-Index at a later date. Powerful corpus processing techniques have been developed to measure deviance from an average occurrence or co-occurrence in 2.2 The head sorting technique the corpus. In this paper we chose to evaluate methods that depend A key advantage of using Simplex NPs rather than Complex ones is only on document-internal data, independent of corpus, domain or that, at least in English, the last word of the NP is reliably the head genre. We therefore did not use, for example, tf*idf, the purely (Wacholder 1998 [21]). Repetition of heads of phrases in a statistical technique that is the used by most information retrieval

128 162 Figure 2: Browse term results

Browse Term Results 6675 terms match your query

ability accommodation political ability political accommodation

ABM accomplishment significant accomplishment abuses human rights abuses Accord Trilateral Accord acceptance widespread acceptance Accords Background De-Nuclearization broad acceptance Accords accord access full access subsequent post-Soviet accord U.S. access bilateral accord bilateral nuclear cooperation accession quick accession accord accords nuclear-weapon-free-zone accessions earlier accessions accords accounting ... document indicates that the head represents an important concept in (Yarowsky 1993 [24]), and therefore convey the primary sense of the document. As a result, no additional information other than that the head as used in the text. For example, in the sentence "Those extracted from the document is required to sort the NPs by head. workers got a pay raise but the other workers did not", the same sense of worker is used in both NPs even though two different sets Information about frequency with which nouns are used as heads in of workers are referred to. Table 2 shows how the word workers is documents can be used to provide the users with useful views of the used as the head of a NP in four different Wall Street Journal articles content of a single document or a collection of documents. Table 1 from the Penn Treebank; determiners such as a and some have been shows the topics are identified as most important in a single article using the head sorting technique (Wall Street Journal 1988 [231). removed. Heads of terms are italicized. workers ... asbestos workers (wsj 0003) asbestos workers workers...private sector workers...private sector hospital workers...nonunion workers...private sector cancer-causing asbestos union workers (wsj 0319) cigarette filters workers ... private sector workers ... United Steelworkers researcher(s) (wsj 0592) asbestos fiber workers ... United Auto Workers ... hourly production and crocidolite maintenance workers (wsj0492) Table 2: Comparison of uses of worker as head of NPs across paper factory articles in the Table 1: Most significant terms in document This view distinguishes the type of worker referred to different articles, thereby providing information that helps rule in For example the list of phrases (which includes heads that occur certain articles as possibilities and eliminate others. This is because above a frequency cutoff of 3 in this document, with content-bearing the list of complete uses of the head worker provides explicit modifiers, if any) is a list of important concepts representative of the positive and implicit negative evidence about kinds of workers entire document. discussed in the article. For example, since the list for wsj_0003 Another view of the phrases enabled by head sorting is obtained by includes only workers and asbestos workers, the user can infer that linking NPs in a document with the same head. A single word NP hospital workers or union workers are probably not referred to in can be quite ambiguous, especially if it is a frequently-occurring this document. noun like worker, state, or act. NPs grouped by head are likely to Term context can also be useful if terms are presented in document refer to the same concept, if not always to the same entity order. For example, the index terms in Table 3 were extracted

129 163 automatically by the LinIdT system as part of the process of randomly extracted from the 250 MB corpus and alphabetized. Each identification of all NPs in a document (Evans 1998 [6]; Evans et al. term was assigned one of three ratings: 2000[7]). coherent -- a term is both coherent and an NP. Coherent terms A form make sense as a distinct unit, even out of context. Examples of coherent terms identified by LinkIT are sudden current shifts, asbestos GovernorDukakis, terminal-to-hostconnectivity and Kent cigarette filters researchers. a high percentage incoherenta term is neither a NP nor coherent. Examples of cancer deaths incoherent terms identified by LinkIT are uncertainty is, x ix a group limit, and heated potato then shot. Most of these problems workers result from idiosyncratic or non-standard text formatting. 30 years Another source of errorsis the part-of-speech tagger; for researchers example, if it erroneously identifies a verb as a noun (as in the example uncertainty is), the resulting term is incoherent: Table 3: Topics, in document order, extracted from first intermediate any term that does not clearly belong in the sentence ofWall Street Journal article coherent or incoherent categories. Typically they consist of one For most people, it is not difficult to guess that this list of terms has or more good NPs, along with some junk. In general, they are been extracted from a discussion about deaths from cancer in enough like NPs that in some ways they fit patterns of the workers exposed to asbestos. The information seeker is able to apply component NPs. One example is up Microsoft Windows, which common sense and general knowledge of the world to interpret the would be a coherent term if it did not include up. We include terms and their possible relation to each other. At least for a short this term because the term is coherent enough to justify document, a complete list of terms extracted from a document in inclusion in a list of references to Windows or Microsoft. order can relatively easily be browsed in order to get a sense of the Another example is th newsroom, where th is presumably a topics discussed in a single document. typographical error for the. There are a higher percentage of In the next section we assess, qualitatively and quantitatively, the intermediate terms among proper names than the other two usability of automatically indexed terms identified by LinIdT. The categories; this is because LinkIT has difficulty of deciding focus of this discussion is Simplex NPs; in future work, we will where one proper name ends and the next one begins, as in discuss techniques for extending the informativeness of Simplex General Electric Co. MUNICIPALS Forest Reserve District. NPs. Table 4 shows the ratings by type of term and overall. The percentage of useless terms is 6.5%. This is well under 10%, which 3. Assessment of automatically identified index puts our results in the realm of being suitable for everyday use terms according to the Cowie and Lehnert metric mentioned in Section 1. The problem of how to determine what index terms merit inclusion in a DTB is a difficult one. The standard information retrieval Total Cohe-rent Inter- Inco- metrics of precision and recall do not apply to this task because mediate herent indexes are designed to satisfy multiple information needs. In information retrieval, precision is calculated by determining how Number 574 475 62 37 many retrieved documents satisfy a specific information need. But of words indexes by design include index terms that are relevant to a variety of information needs. To apply the recall metric to index terms, we % of total 100% 82.8% 10.9% 6.5% would calculate the proportion of good index terms correctly words identified by a system relative to the list of all possible good index terms. But we do not know what the list of all possible good index Table 4: Coherence of index terms terms should look like. Even comparing an automatically generated This study demonstrates that automatically identified terms like list to a human generated list is difficult because human indexers those identified by LinkIT are of sufficient quality to be useful in add index entries that do not appear in the text; this would bias the browsing applications. evaluation against an index that only includes terms that actually occur in the text. We have therefore identified three basic criteria 3.2 Usefulness of index terms thataffectusabilityof indexterms:coherenceof terms, In order to make a preliminary determination of whether the thoroughness of coverage of document content, and usefulness. terms identified by LinkIT as likely to be useful index terms, we used a standard qualitative ranking technique to compare the 3.1 Coherence usefulnessof termsidentifiedby threedomain-independent For index terms to be useful, they must be coherent. This criterion is techniques methods for identifying index terms (Wacholder et al. particular important because any list of automatically identified 2000 [22]): index terms inevitably includes some junk. An index with less junk Keywords are terms identified by counting frequency of terms is clearly superior to one with more junk. stemmed words in a document. To assess the coherence of automatically identified index terms, 583 Technical terms are noun phrases (NPs) or subparts of NPs index terms (.025% of the total number of terms identified) were repeated more than twice in a document (Justeson and Katz 1995 [14]);

130 164 Head sorted NPs are identified by a method in which simplex Number of terms ranked at or noun phrases (as defined below) are sorted by head and then better than ranked in decreasing order of frequency (Wacholder 1998 [21]). Method 2 3 4 5 Table 5 shows examples of the index terms identified by the 27 75 124 166 different techniques. All technical terms are included; a sample of Keyword the terms identified by the other two techniques are included. Head sorted 41 96 132 160 NPs Keywords Head sorted NPs Technical terms 15 21 21 21 asbestos/asbestosis workers cancer deaths Technical Terms worker/workers asbestos workers lung cancer Table 7: Running total of terms identified at or below a specified /worked kent cigarette 160 workers rank cancer cancer dr. talcott death lung cancer cigarette filter This result is consistent with our observation that the technical term make method identifies the highest quality terms, but there are very few of asbestos u.s. them: an average of 7 per 500 words compared to over 50 for head lorillard cancer causing sorted NPs and for keywords. Therefore there is a need for fiber asbestos additional high quality terms. The list of head sorted NPs received a higher average rating than did the list of keywords, as shown in lung cancer deaths dr. Figure 2. This confirms our expectation that phrases containing ...... more content-bearing modifiers would be perceived as more useful index terms than would single word phrases consisting only of Table 5: Examples of terms, by technique for one document heads. 3.3 Thoroughness of coverage of document To compare the index terms, we presented subjects with an article from the Wall Street Journal [23] and a list of terms and asked them content to answer the following general question: "Would this term be Thoroughness of coverage of document content is a standard useful in an electronic index for this article?" Terms were rated on a criterion for evaluation of traditional indexes [11]. In order to scale of 1to 5, where 1 indicates a high quality term that should establish an initial measure of thoroughness, we evaluate number of definitely be included in the index and 5 indicates a junk term that terms identified relative to the size of the text. definitely should not be included. For example, the phrase court- Table 8 shows the relationship between document size in words and approved affirmative action plans received an average rating of 1 number of NPs per document. For example, for the AP corpus, an meaning thatit was ranked as definitely useful; the keyword average document of 476 words typically has about 127 non-unique affirmative received an average rating of 3.75, meaning that it was NPs associated with it. In other words, a user who wanted to view less useful; and the keyword action.received an average ranking of the context in which each NP occurred would have to look at 127 4.5, meaning that it was not useful. Table 6 shows the results, contexts. (To allow for differences across corpora, we report on averaged over three articles. overall statistics and per corpus statistics as appropriate.)

Keywords Head Sorted Technical Corpus Avg. Doc Size Avg. number of NPs Terms NPs/doc Average 3.27 2.89 1.79 AP 2.99K 127 ranking (476 words) Table 6 : Average rating of types of index terms FR 7.70K 338 (1175 words) Of the three lists of index terms, technical termss received the highest ratings for all three documentsan average of 1.79 on the WSJ 3.23K 132 scale of 1 to 5, with 1 being the best rating. The head sorted NPs (487 words) came in second, with an average of 2.89, and keywords came in last ZIFF 2.96K 129 with an average of 3.27. However, it should be noted that the average ranking of terms (461 words) conceals the fact that the number of technical terms is much lower Table 8: NPs per document than the other two types of terms. In contrast, Table 7, which shows the total number of terms rated at or below specified The numbers in Table 8 are important because they vary depending on the technique used to identify NPs. A human indexer readily rankings, allows us to measure quality and coverage. (1is the chooses whichever type of phrase is appropriate for the content, but highest rating; 5 is the lowest.) natural language processing systems cannot do this reliably. Because of the ambiguity of natural language, it is much easier to identify the

131

1'6 5 boundaries of simplex noun than complex ones [21], as discussed In general, the number of unique NPs increases much faster than the above in Section 2 above. number of unique headsthis can be seen by the fall in the ratio of The option of including both complex and simple forms was unique heads to NPs as the corpus size increases. adopted by Tolle and Chen 2000 [20]. They identify approximately Corpus Size in Unique Unique Ratio of 140 unique NPs per abstract for 10 medical abstracts. They do not MBs NPs Heads Unique Heads report the average length in words of abstracts, but a reasonable to NPs guess is probably about 250 words per abstract. On this calculation, the ratio between the number of NPs and the number of words in the AP 12 156798 38232 24% textis.56.In contrast, LinIdT identifies about 130 NPs for FR 34 281931 56555 20% documents of approximately 475 words, for a ratio of .27. The index terms represent the content of different units: 140 index terms WSJ 45 510194 77168 15% representstheabstract,whichisitself only an abbreviated ZIFF 165 1731940 176639 10% representation of the document. The 130 terms identified by LinIdT Total 256 2490958 254724 10% represent the entire text, but our intuition is that itis better to provide coverage of full documents than of abstracts. Experiments Table 10: Number of unique NPs and heads to determine which technique is more useful for information seekers Table 10 is interesting for a number of reasons: are needed. 1) the variation in ratio of heads to NPs per corpus-this may For four large full-text corpora, we extracted all occurrences of all well reflect the diversity of AP and the FR relative to the WSJ NPs (duplicates not removed) in each corpus, and then we list the and especially Ziff. size of the index when duplicate NPs have been removed in Table 9. 2) the number of heads decreases monotically as the size of the The numbers in parenthesis are the number of words per document corpus increases. This is because the heads are nouns. No and per corpus for the full-text columns, and the percentage of the dictionary can list all nouns; this list is constantly growing, but full text size for the list of non-unique NPs, and list of unique NPs. at a slower rate than the possible number of NPs. In general, the vast majority of heads have two or fewer different Corpus Full Text Non Unique possible expansions. There is a small number of heads, however, Unique NPs that contain a large number of expansions. For these heads, we NPs could create a hierarchical index that is only displayed when the user AP 12.27 MB 7.4 MB 2.9 MB requests further information on the particular head. In the data that (2.0 million words) (60%) (23%) we examined, on average the heads had about 6.5 expansions, with a standard deviation of 47.3. FR 33.88 MB 20.7 MB 5.7 MB 2 < % % >= Avg Std. (5.3 million words) (61%) (17%) Corp Max % <= 2 <50 50 Dev. WSJ 45.59 MB 27.3 MB 10.0 MB (7.0 million words) (60%) (22%) AP 557 72.2% 26.6% 1.2% 4.3 13.63 ZIFF 165.41 MB 108.8 MB 38.7 MB FR 1303 76.9% 21.3% 1.8% 5.5 26.95 WSJ 5343 69.9% 27.8% 2.3% 7.0 46.65 (26.3 million (66%) ' (24%)

words) , ZIFF 15877 75.9% 21.6% 2.5% 10.5 102.38 Table 9: Corpus Size Table 11: Average number of head expansions per corpus

The number of NPs reflects the number of occurrences (tokens) of The most frequent head in the Ziff corpus, a computer publication, NPs. Interestingly, the percentages are relatively consistent across is system. corpora. Additionally, these terms have not been filtered; we may be able From the point of view of the index, however, the first column in to greatly narrow the search space if the user can provide us with Table 9 represent only a first level reduction in the number of further information about the type of terms they are interested in. candidate index terms: for browsing and indexing, each term need For example, using simple regular expressions, we are able to roughly categorize the terms that we have found into four belistedonlyonce.Afterduplicateshave been removed, approximately 1% of the full text remains for heads, and 22% for categories: NPs, SNPs that look like proper nouns, SNPs that look like acronyms, and SNPs that start with non-alphabetic characters. NPs. The implications of this are explored in Section 4. It is possible to narrow the index to one of these categories, or 4. Reducing documents to NPs exclude some of them from the index. In this section, we consider how information about NPs and their heads can help facilitate effective browsing by reduce the number of terms that an information seeker needs to look at.

132 3.How is the usefulness of index terms affected by the browsing Corpus # of # of Proper # of ' # of non- environment, the domain of the document, and the user's SNPs Nouns Acronyms alphabetic elements expertise? 4.From the point of view of representation of document content, AP 156798 20787 2526 12238 what is the optimal relationship between number of index (13.2%) (1.61%) (7.8%) terms and document size? FR 281931 22194 5082 44992 5.What number of terms can information seekers readily browse? (7.8%) (1.80%) (15.95%) Do these numbers vary depending on the skill and domain knowledge of the user? WSJ 510194 44035 6295 63686 Because of the need to develop new methods to improve access to (8.6%) (1.23%) (12.48%) digital libraries, answering questions about index usability is a ZIFF 1731940 102615 38460 193340 research priority in the digital library field. This paper makes two (2.22%) contributions: description of a linguistically motivated method for (5.9%) (11.16%) identifying and browsing index terms and establishment of Total 2490958 189631 45966 300373 fundamental criteria for measuring the usability of terms in phrase (7.6%) (1.84%) (12.06%) browsing applications. Table 12 Number of SNPs by category 6. ACKNOWLEDGMENTS For example, over all of the corpora, about 10% of the SNPs start This work has been supported under NSF Grant IRI-97-12069, with a non-alphabetic character, which we can exclude if the user is "AutomaticIdentificationofSignificantTopicsinDomain searching for a general term. If we know that the user is searching Independent Full Text Analysis", PI's: Judith L. Klavans and Nina specifically for a person, then we can use the list of proper nouns as Wacholder and NSF Grant CDA-97-53054 "Computationally index terms, further narrowing the search space to approximately Tractable Methods for Document Analysis", PI: Nina Wacholder. 10% of the possible terms. We regard this technique as an important first step to reducing the number of terms that users must browse in 7. References a DTB. [1]Anick, Peter and Shivakumar Vaithyanathan (1997) 5. CONCLUSION "Exploiting clustering and phrases for context-based information retrieval", Proc. of the 20th Annual International When we began working on this paper, our goal was simply to ACM SIGIR Conference on Research and Development in assess the quality of the terms automatically identified by LinkIT for Information Retrieval (SIGIR '97), pp.314-323. use in electronic browsing applications. Through an evaluation of the results of an automatic index term extraction system, we have [2]Bush, Vannevar (1945) "As we may think," Atlantic Monthly. shown that automatically generated indexes can be useful in a Available from dynamic text-browsing environment suchasIntell-Indexfor http://www.theatlantic.com/unbound/flashbks/computer/bushf. enabling access to digital libraries. htm

We found that natural language processing techniques have reached [3]Cowie,Jimand Wendy Lehnert(1996)"Information the point of being able to reliably identify terms that are coherent extraction", Communications of the ACM, 39(1): 80-91. enough to merit inclusion in a dynamic text browser: over 93% of andWyllys, W.(1961)"Automatic the index terms extracted for use in the Intell-Index system have [4] Edmundson,H.P. abstractingandindexing--surveyandrecommendations", been shown to be useful index terms in our study. This number is a baseline; the goal for us and others should be to improve these Communications of the ACM, 4:226-234. numbers. [51Evans, David A. and Chengxiang Zhai (1996) "Noun-phrase We have also demonstrated how sorting of index terms by head analysis in unrestricted text for information retrieval", Proc. of the 34th Annual Meeting of the Association for Computational makesiteasier to browse index terms. The possibilities for additional sorting and filtering index terms are multiple, and our Linguistics,pp.17-24.24-27June1996,University of Kaufmann work suggests that these possibilities are worthy of exploration. Our California,SantaCruz,California,Morgan results have implications for our own work and also for research Publishers. results with regard to phrase browsers referred to in Section 1. [6]Evans, David K. (1998) LinIdT Documentation, Columbia As we conducted this work, we discovered that there are many University Department of Computer Science Report. Available unanswered questions about the usability of index terms. In spite of at a long history of indexes as an information access tool, there has

133 167 [9] Godby, Carol Jean and Ray Reighart (1998) "Using machine- [18] Smadja, Frank, McKeown, Kathy, and Vasileios readable text as a source of novel vocabulary to update the Hatzivassiloglou, V. (1996)."Translating collocations for Dewey Decimal Classification", presented at the SIG-CR bilingual lexicons: A statistical approach ".Computational Workshop, ASIS, < Linguistics 22 (1):1-38. http://orc.rsch.ocic.org:5061/papers/sigcr98.html >. [19] Strzalkowski, Tomek. 1997. "Building Effective Queries in [10] Gutwin, Carl, Gordon Paynter, Ian Witten, Craig Nevill- Natural Language Information Retrieval." Proc. of the 5th Manning and Eibe Franke (1999) "Improving browsing in Applied NaturalLanguage Conference (ANLP-97), digital libraries with keyphrase indexes ", Decision Support Washington, DC. pp. 299-306. Systems 27(1-2): 81-104. [20] Tolle, Kristin M. and Hsinchun Chen (2000) "Comparing noun [11] Hert, Carol A., Elin K. Jacob and Patrick Dawson (2000) "A phrasing techniques for use with medical digital library tools", usability assessment of online indexing structuresin the Journal of the American Society of Information Science networked environment", Journal of the American Society for 51(4):352-370. Information Science 51(11):971-988. [21] Wacholder, Nina (1998) "Simplex noun phrases clustered by [12] Hodges, Julia, Shiyun Yie, Ray Reighart and Lois Boggess head:a methodforidentifyingsignificanttopicsina (1996) "An automated system that assists in the generation of document",Proc.of Workshop ontheComputational document indexes", Natural Language Engineering 2(2):137- Treatment of Nominals, edited by Federica Busa, Indeijeet 160. Mani and PatrickSaint-Dizier, pp.70-79. COLING-ACL, October 16, 1998, Montreal. [13] Jackendoff, Ray, (1977), X-Bar Syntax: A Study of Phrase Structure, MIT Press, Cambridge, MA. [22] Wacholder, Nina, David Kirk Evans, Judith L. Klavans (2000) "Evaluationof automaticallyidentifiedindextermsfor [14] Justeson, John S. and Slava M. Katz (1995). "Technical browsing electronic documents", Proc. of the Applied Natural terminology: some linguistic properties and an algorithm for Language Processing and North American Chapter of the identification in text", Natural Language Engineering 1(1):9- Association for Computational Linguistics (ANLP-NAACL) 27. 2000. Seattle, Washington, pp. 302-307. [15] Milstead, Jessica L. (1994) "Needs for research in indexing", [23] Wall Street Journal (1988) Available from Penn Treebank, Journal of the American Society for Information Science. LinguisticData Consortium, University of Pennsylvania, [16] Mulvany, Nancy (1993) Indexing Books,University of Philadelphia, PA. Chicago Press, Chicago, IL. [24] Yarowsky, David (1993) "One sense per collocation", Proc. of [17] Nevill-Manning, Craig G., Ian H. Witten and Gordon W. the ARPA Human Language Technology Workshop, Princeton, Paynter (1997) "Browsing in digital libraries: a phrase based NJ, pp.266-271. approach", Proc. of the DL97, Association of Computing [25] Zhou, Joe (1999) "Phrasal terms in real-world applications". Machinery Digital Libraries Conference, 230-236. In Natural Language Information Retrieval,edited by Tomek Strazalowski, Kluwer Academic Publishers, Boston, pp.215-259.

134 168 Public Use of Digital Community Information Systems: Findings from a Recent Study with Implications for System Design Karen E. Pettigrew Joan C. Durrance Assistant Professor, The Information School Professor, School of Information University of Washington University of Michigan Box 352930 304 West Hall, 550 East University Ave Seattle, Washington, 98195-2930, USA Ann Arbor, MI 48109-1092 Voice: 206-543-6238; Fax: 206-616-3152; Voice: 734-763-1569; Fax 734-764-2475 [email protected] [email protected]

ABSTRACT education, financial status, or social ties--encounter situations where The Internet has considerably empowered libraries and changed they experience great difficulties in recognizing, expressing and meeting their needs for such community information (Chatman, common perception of what they entail.Publiclibraries,in particular, are using technological advancements to expand their 1996, in press; Chen and Hernon, 1982; Dervin, et al.,1976; range of services and enhancetheircivicroles.Providing Durrance, 1984a; Harris and Dewdney, 1994; Pettigrew, 2000; Pettigrew et al., 1999). Financial, physical, geographic and cultural community information (CI)ininnovative,digital forms via community networks is one way in which public libraries are bathersalsoprohibitindividualsfromsuccessfullyseeking facilitating everyday information needs. These networks have been information. As a result, many people cannot obtain important lauded for their potential to strengthen physical communities information, access needed services, or participate fully in their through increasing information flow about local services and events, community's dailylife.While information technologies hold and through facilitating civic interaction. However, little is known significant promise for linking individuals with information and one about how the public uses such digital services and what barriers another, they are foreshadowed by the potential for a deeper digital they encounter. This paper presents findings about how digital CI divide between the information rich and the information poor. systems benefit physical communities based on extensive case Public libraries have long recognized the importance of community studies in three states. At each site, rich data were collected using information (CI) for creating and sustaining healthy communities. online surveys, field observation, in-depth interviews and focus Comprising three elements: survival or human services information, groups with Internet users, human service providers and library local information and citizen action information (Durrance, 1984b), staff. Both the online survey and the follow-up interviews with CI can be broadly defined as: respondents were based on sense-making theory. In our paper we any information that helps citizens with their day-to-day discuss our findings regarding: (1) how the public is using digital CI problems and enables them to participate[intheir] systems for daily problem solving, and (2) the types of barriers they community.Itisallinformation pertainingtothe encounter. Suggestions for improving digital CI systems are availability of human services,such ashealthcare, provided. financial assistance, housing, transportation, education, Categories and Subject Descriptors: DL impact, and childcareservices;aswellasinformation on recreationprograms,clubs,community events,and user studies, case studies, evaluation methods, communities of information about all levels of government (Pettigrew, use/practice, social informatics 1996, p. 351). General Terms: measurement, performance, human factors, Since the 1970s public libraries have facilitated citizens' access to theory CI by providing information and referral (I&R) services, and through organizing and supporting community-wide information Keywords: community information, community networks, initiatives with local service providers (Baker and Ruey, 1988; information behavior, barriers, sense-making, qualitative methods Childers, 1984). The Internet, along with high-speed personal computers, modems, and graphical interfaces, has suggested new 1. BACKGROUND ways for libraries to facilitate citizens' information needs through Every day, citizens require equitable and easy access to local digital CI systems. One such digital collaboration in which libraries have taken a leading role and is flourishing throughout the world is resources that can help them deal with the myriad of situations that community networking. arise through daily living. Yet, all people--despite their occupation, Since thelate1980s libraries have played pivotalrolesin developing community networks(community-wideelectronic Permission to make digital or hard copies of all or part of this work for consortia) that provide citizens with equitable access to the Internet personal or classroom use is granted without fee provided that copies for obtaining CI and communicating with others (Cisler, 1996; are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy Durrance, 1993, 1994; Durrance and Pettigrew, 2000; Durrance and otherwise, or republish, to post on servers or to redistribute to lists, Schneider 1996, Gurstein 2000). Often organized and designed by requires prior specific permission and/or a fee. librarians, these digital networks provide citizens with one-stop JCDL '01, June 24-28, 2001, Roanoke, Virginia, USA. shopping using community-oriented discussions,question-and- Copyright 2001 ACM 1-58113-345-6/01/0006...$5.00.

136 answer forums, access to governmental, social services, and local public library literature that Zweizig and Dervin (1977) criticized as information, email, and Internet access (Schuler, 1994;1996). providing little insight into the uses that people make of information While individuals may interact with other users by posting queries, and information systems. One study of particular note, however, is monitoring discussions, etc., CI is often a central network feature Bishop, et al., (1999). Through interviews and focus groups in low that appears in many forms: libraries, for example, may mount their income neighborhoods with users and potential users of the databases on the Internet, while individual service providers may Prairienet community network, theyidentifiedthefollowing post information about their programs and services. Thus, the categories of digital CI need: community services and activities, architecture of the Internet makes digital CI possible by linking resources for children, healthcare, education, employment, crime information files created not only by single organizations such as and safety, and general reference tools. They recommended that libraries, but by agencies, organizations, and individuals throughout libraries might provide more effective digital information services if the community (and, of course, the world). This is a major departure they focus on ways that complement citizens' lifestyles, constraints from traditional I&R services where librarians and other CI agency and information seeking patterns. staff work with files about the community that are created on an internal library system. As a result of digital CI systems via community networks people .can access CI through public library 2. CURRENT STUDY terminals while seeking help with related search problems from Our research questions addressed the situations that prompt citizens librarians. In short, digital systems mean that citizens can access CI to use/not use digital CI systems for everyday help, the specific anytime in any place types of CI that they are seeking, how they deal with different Despitethelauding of community networks'potentialfor barriers that they encounter, and how they are helped by the CI that strengthening physical communities through increased digital CI they obtain. Our study also focused on how public libraries and flow and civic interaction, findings from recent studies (e.g., Kraut community service providers perceive digital CI systems help their et. al., 1999; Nie and Erbring, 2000) suggest that Internet use has clients, their own organizations, and the community at-large. We thereverseeffectbyisolatingindividualsanddecreasing were particularly interested in how the public's perceptions of interpersonalinteraction, which gain greater importance given digital CI systems related to those of service providers and Putnam's (1995, 2000) observation regarding the decline of social librarians. capital in physical communities. Thus, life in an electronic world Since our study was exploratory and aimed at yielding rich data, we poses several fundamental problems for research. Two such used multiple methods over several stages. Stage 1 comprised a questions that are only beginning to be addressed include: national survey with 500 medium and large-sized public libraries (1) How do individuals use digital CI systems when seeking regarding their involvement with digital CI systems. For Stage Two, help for daily situations? and we used a standard design to conduct intensive case studies in three communities (Table 1) that received national recognition for their (2) How do public library-community network initiatives respective community network and in which the local public library strengthen communities? system played a leading role. To date, little is known about how access to digital CI systems help Data collection methods at each site included (a) an online survey (or do not help) citizens with daily living, how CI affects their and follow-up telephone interviews with adult community network information behavior, and howit may or may not benefit users who access "tagged" CI web pages, along with (b) in-depth communities. In a recent literature review (Pettigrew, Durrance, and interviews, field observation and focus groups with public library- Vakkari, 1999), we observed that research interest in citizens' use of community network staff,local human service providers, and networked CI is increasing. However, the majority of papers were members of the public. The survey was posted (during different applied and descriptive in nature and were based on questionnaires time periods) on the main CI page of each network. The steps we or analyzedtransactionlogdatathatrevealedusersocio- took to address methodological considerations when conducting demographics and system or page use frequency (e.g., Geffert, online surveys (as discussed by Witte et al., (2000) and Zhang 1993; Harsh, 1995; Harvey and Home, 1995; Patrick, 1996, 1997; (2000)) are discussed in an earlier paper (Pettigrew and Durrance, Patrick and Black, 1996a&b; Patrick et al., 1995; Schalken and 2000). The number of days each survey ran and the total number of Tops, 1994), which confirms Savolainen (1998). Most studies were responses for each network are summarized in Table 2. from the professional literature and reported conflicting user and use statistics, especially regarding user socio-demographics. In this sense, the digital CI system literature has been akin to the general

Table 1. Overview of Data Collection Sites Counties/ Community Network Site Areas Served Public Library System Name (URL) Est. Northeastern Cook, Du Page, Kane,Lake,Suburban Library System NorthStarNet 1995 Illinois McHenry & Will (nsn.nsIsilus.org) Pittsburgh, Southwestern Pennsylvania Carnegie Library of Pittsburgh Three Rivers Free-Net 1995 Pennsylvania (trfn.pgh.pa) Portland, Multnomah County Multnomah County Library CascadeLink 1996 Oregon (www.cascadelink.org)

137 170 Table 2. Overview of Online Survey Res onses # Days Community Network / Survey Posted# Area Served Responses Gender Age Range M F NA 18- 25- 36- 46- 56- 66+ NA 25 35 45 55 65

NorthStarNet 60 days 34 10 20 4 6 9 9 5 2 1 2 Northeastern Illinois Three Rivers Free-Net 90 days 123 57 61 5 10 30 22 30 15 9 7 Pittsburgh, Pennsylvania CascadeLink 70 days 40 17 20 3 5 7 9 11 3 2 3 Multnomah County, Portland Total: 220 days 197 84 101 12 21 46 40 46 20 12 12

Both theuser survey and follow-up interviews were based on Our respondents reported that they use digital CI systems for many Dervin's sense-making theoty (c.f.,Dervin,1992; Savolainen, different types of situations, including those of a personal nature and 1993), which comprises a set of user-centered assumptions and those regarding the workplace. This confirms a tenet of information methods for studying the uses individuals make of information behavior,namelythatallindividualsrequirecommunity systems. It asserts that throughout daily life, people encounter gaps information at one point or another and that it is the individual's in their knowledge that they can only fill or bridge (in Dervin's situation that reveals most insight into information seeking and use terms) by making new sense of their situations through seeking (Harris and Dewdney, 1994). We found that users seek the information. Thus they use varied strategies to seek and construct following types of digital CI (in alphabetical order): information from different resources or ideas as they cope with Business different barriers. Sense-making facilitates the study of different Computer and Technical Information aspects of information behavior. Our research included two aspects: Education (1) users' assessments of the helpfulness of digital CI, and (2) users' Employment Opportunities and service providers' constructions or images of these systems. Both were investigated using the micro-moment time-line technique Financial Support where respondents were asked "to reconstruct a situation in terms of Governmental and Civic what happened (time-line steps) [and then] to describe each step in Health detail" (p. 70), which enabled us to gather and comparing the Housing perceptions of different players regarding how CI is constructed and Library Operations and Services used through electronic communication. The framework's social Local Events constructionist orientation suggested it would be viable for studying Local History and Geneaology citizens' online information behavior. In addition to the sense- Local Information (local accommodations, community making propositions, we examined our qualitative data for such features) themes such as indicators of social capital, and analyzed our Local News (weather, traffic, school closures) quantitative data for such patterns as the relationship between users' Organizations and Groups perceptions of how they were helped by the digital CI and their Other People (both local and beyond the community) willingness to access it again for help in similar situations. Parenting In the remainder of this paper, we share our findings regarding: (1) Recreation and Hobbies how the public is using digital CI systems (i.e., their information Sale, Exchange, or Donation of Goods needs), and (2)the barriers they encounterinthe process. Social Services Suggestions for improving the design of digital CI systems are also Volunteerism discussed. In future publications we are addressing how users are These categories are markedly different from those traditionally helped by digital CI systems and how these systems contribute to used to classify CI needs. Moreover, they also broaden findings building social capital at the individual and community levels. reported by Bishop, et al. (1999). Notable differences between our categories and those reported in CI studies conducted prior to the Internet are: (I) a strong emphasis on employment opportunities, 3. How the Public is Using Digital CI Systems volunteerism, and social service availability; and, (2) the inclusion The respondents' age groups followed a normal distribution with of such new categories as: sale, exchange and donation of goods, most respondents (71.4%) falling between the ages of 25 and 55, local history and genealogy, local news, computer and technical while slightly more women (54.6%) responded than men. Thus our information, and other people (residing both within and beyond the findingssuggestthatatypicaluserisnon-existent,socio- community). demographically speaking: users equally represent both genders, a distributed range of age groups, and a diverse range of occupations: What is the reason for this emphasis on employment information, from students to blue-collar workers to white-collar professionals. etc., and the emergence of novel categories? Our analysis indicates Moreover, our respondents comprised both first-time or novice that the Internet is responsible. Increased computer capabilities and users as well as very experienced searchers. online connectivity have enabled many different types of service providers to make information available about themselves that was

138

1 '7 1 previously unavailable or only in limited amounts via a public to pose several different queries or consult multiple sources. This library's CI record. In other words, service providers are now able notion of theongoing searchissimilartoBates'(1989) to share information about themselves first-hand.Prior to the "berrypicking" concept where users search for information "a bit at Internet, such information was largely only available on paper and a time" and alter their search strategies according to what they find had to be searched manually and often through intermediaries and what barriers they encounter. (although many public libraries maintained electronic, in-house Beyond analyzing the CI that users sought by need category and databases, these databases were seldom available to the public for situation type, we also focused on the information's enabling direct end-user searching). The breadth of CI available, along with aspects, i.e., the attributes of the information that would aid users in new search engine and software capabilities, has contributed to whatever it was they were trying to accomplish. This approach extending the notion of what CI comprises. Just as the Internet is builds on Dervin's notion of "verbings." We derived the following broadening our concept of community, so it is changing the scope "information enabling" categories for classifying the types of CI of community information. Due to digital CI systems, people can requested: search for other people online, sell and trade goods, research their family history, exchange neighborhood informationand all at a o Comparing (similar to verifying but may come earlier in faster, More immediate pace. Increased access to the Internet, and the cognitive process) hence community network, especially via public libraries has led to o Connecting (how to find people with related interests, an increased public awareness of what's available, what's going on etc.) and what might be found in a community. This enhanced access is o Describing (services offered, cost, eligibility, etc) undoubtedly facilitating CI flow. Whereas people once relied on o Directing (information about where something is located conversations over backyard fences, postings on notice boards at or how to get somewhere) supermarkets, and local newspapers, they are now drawing upon the capabilities of the Internet, as fueled by public library efforts, to o Explaining (in-depth, content-oriented information that seek and share information about their communities. explains how something works) While the above categories are useful for understanding the types of o Problem solving (information that will help bridge a gap digital CI that users seek, further insights are gained when one or solve a problem) considers the actual situations. The following are just a few o Promoting (want others to know about them, e.g., that examples: they're available for employment, that they've started a Teenagers used the network to find summer employment new club, etc) because it has all the local job information in one place and is o Relating (information that is relevant to the individual's trusted it as a reliable, current source; needs and situational constructs as perceived by the A senior used the network to find out about an important indiviudual). upcoming town council meeting; o Trusting (information that individuals perceive as coming A man looking for a local directory of gay and lesbian from a trusted source. This is similar to high-quality CI, i.e., CI that is accurate and current, which people said organizations searched the Web, but only came across national resources. The network directed him to the exact local they wanted) organization he needed. o Verifring (a form of corporate intelligence, people want to keep up with what their competition is doing, be aware A homebound person used the network to research his family's of new trends, etc. genealogy because its comprehensive organization of local resources, including public library, county agency and local These "enabling" attributes provide a novel way of viewing historical association materials; information needs because they focus on what users are trying to accomplishforaparticularsituation.When consideredin A former resident organized a family reunion from across the conjunction with (a) the user's initial need (as presented to the country using the network to arrange everything from activities digital CI system by either point and click or by typing a search to hotels; phrase), (b) the situation that prompted that need, and (c) what is A woman used the network to learn about local government known about the barriers that users encounteras discussed later information, such as current ordinances pertaining to matters these enabling categories reveal several implications for the design ranging from trash pick up to flood damage prevention, and to of digital CI systems. identify sources of funding for a community service project intended to help a nearby low-income community; 4. Other Findings Regarding the Public's A man, who sometimes uses the network to find miscellaneous Online Information Behavior information, said he uses it "mostly for help with lung cancer Several other novel themes emerged regarding citizens' online and possible cures or ways of living longer whether it be information behavior that contribute to the literature and may aid in conventional or alternative medicine." digital CI system design. For example, respondents indicated that According to sense-making theory, information needs cannot be they often tried other sources (e.g., friends, newspapers, telephone coniidered in isolation of the situations that create them since any directories, etc.,) for help with their questions before turning to the situationislikelytoyieldmultipleinformation needs,i.e., system. Such was the case of a user from Pittsburgh, who accessed information found for one aspect of a query frequently opens the Three Rivers Free-Net after friends and coworkers told him that another, related information need. As we found, the situations for it contained job listings and other sources such as local newspapers which users sought digital CI were complex and usually required had proven unsuccessful Since the 1960s, information science multiple pieces of information. In this sense, our users described researchhasindicatedthatsocialtiesandface-to-face how their searches were ongoing and how they anticipated having communication are primary sources of information, regardless of

139 172 the setting (home, workplace, school, etc.). Our findings suggest person. Hence, a distinguishing feature of these CI gatherers is that that this remains the case: the Internet has not replaced the role of theyaresociallyconnectedoractive,and,perhaps more social ties in citizens' information behavior. During out interviews, importantly, are aware of the potential CI needs or interests of the severalrespondentsdescribed how they spokeabouttheir people with whom they interact. These CI gatherers do not wait for information need or situation with a social tie before searching someone to say "I need to know about X."; instead, they take online. Thus, we found that the Internet is supplementing other mental notes of what's going on in the lives of the people around information-seeking behaviors in addition to creating new pathways them, their interests and situations, and then keep an eye out for CI for obtaining information: the public is using digital CI systems as that might be of interest or helpfulnot by initiating an actual, an additional source. Moreover, we learned that peoplewanttheir purposive search. In this sense, CI monitors are able to recognize community networks to promote social interaction by bringing the potential CI needs of the people around them. Another defining people together. This notion was expressed by a user who said: "a element of this social type is that they do not really care if the CI bulletin board or someway to facilitate people meeting each other they pass on is actually used, and they exhibit an understanding that and getting around would be very helpful. I've recently moved to sometimes information is used and proven helpful at a later point in town and am looking for ways to meet people. Maybe a place where time. For systems design, this information gatherer social type has people could find others who are interested in a super club or important implications. In communities, for example, that are playing cards, or informal sporting groups, etc." considered information poor, individuals who represent this social Users also tended to be highly confident that they could find what type could be identified and given advance training in Internet searching as well as in how to identify information needs and how they neededthroughthecommunity network.Despitethe to provide information in ways that best facilitate those needs. difficulties with using the Internet noted in previous studies, such as lack of content, low retrieval rates with search engines, inaccurate We also found support for Wellman's (in press; Hampton and information,etc.,ourrespondentstendedtoperceivetheir Wellman, 2000) notion that the Internet has created "glocalization" community network as an ubiquitous source and gateway to all whereitisbeing used by individualsfor bothlocaland knowledge. In this sense we identified a mismatch between what long-distance interaction.In our study,respondents used the users think they can obtain via the Internet and the likelihood that community network as a personal gateway to websites located that information exists and can be easily located. This finding throughout the world, while people far beyond the network's expands on a principle of everyday information behavior: that a physical home were using it to obtain local information. A woman mismatch exists between what users believe service providers offer in Florida, for example, used the Three Rivers Free-Net to locate and what they actually do (Harris and Dewdney, 1994). Another information about seniors' housing for her elderly father who was plausible explanation is that users are transferring their mental moving to the Pittsburgh area. A different user, who was accessing model of what public libraries contain and how they function to the the network from another region, remarked on how it helps her Internet in general. In other words, of community networks and the connect with her family: "although I haven't lived there in years, I Internet, users hold the same "information" expectations that they can keep up with the events and what is going on." Respondents associate with public libraries. The difficulty here, of course, is that alsoexpressedinterestinhavingastrongregionaland public libraries and the Internet are not the same thing: they provide neighborhood emphasis in their networks' content. different sorts of information in vastly different ways, with the roles played by professional librarians making a critical difference. An 5. Barriers to Using Digital CI Systems interesting and representative example of users' perceptions of the The notion of barriers, which iscentral to the sense-making Internet and community networks came from a young man who framework, represents the ways in which people are prevented or asserted that the Internet and community network provided non- blocked from seeking information successfully. By identifying biased informationsomething he associated with public libraries. barriers, one can devise ways of improving the design of digital CI Later, acknowledging that sometimes information is systems that facilitate users' information behavior. Our respondents "sensationalized," he added that he tries to balance information were asked several open-ended questions that address types of retrieved from the Internet with that gleaned from other sources barriers. Specifically, we asked them to explain what, if anything, before making a final decision. would make it easier for them to find what they're looking for, and On a different theme, it was interesting how some respondents to describe any past actions they might have taken regarding their revealed that they were searching for CI on behalf of another person search topic. (e.g., relative, friend), and not always at that person's behest. This Our analysis revealed that users encounter several types of barriers notion of proxy searching, of gathering requested and unrequested when using community networks and the Internet, in general. We CI for others, supports recent findings regarding the Web by labeled the main barrieras "Information-Related"Barriers that fell Erdelez and Rioux (in press), which they describe as information under this broad category included: encountering, and by Gross (in press), who describes how users present "imposed queries" at reference desks in public and school Low Retrieval Rates: Due to poor search engines and site libraries. On many levels, it seems that the Internet has made it indexing, users frequently complained that they retrieved too easier for researchers to label and identify a particular social type, much CI that search engines did not provide enough one that might be best described as "information gatherers" or specificity (e.g., for retrieving information at the neighborhood "monitors" to borrow from Baker and Pettigrew (1999). In our level), and that they were challenged with discerning what was study, these active CI seekers, who may be considered somewhat relevant to their search. Regarding the Internet, in general, akin to information gatekeepers, appeared to relish time spent one user said he didn't "like it a lot" because most sites and browsing and poking about the community network and the search engines gave him 10 billion leads that get him Internet. But the greatest satisfaction they described was when they sidetracked until he's forgotten what it was he was looking for; found something that they believed might of interest to someone Information Overload:Users were often daunted by a site's else, which they would quickly pass on, either by email or in- layout (e.g., it appeared too busy, too many bells and whistles,

140 poor font and color choice, especially for those who are color- other public access site, or because high-speed connectivity blind) and the amount of text displayed on a single screen; was unavailable in their area; Poorly organized (classified): Users complained that they Search skill barriers: Community network users did not know often did not find CI where they expected to find it, and that how to search the system (or Internet in general) or how to use there waslittlecross-referencing.As onefemaleuser advanced methods. This was reflected by several respondents, explained: "I have a difficult time finding this information. one of whom commented "I have a hard time finding None of the [system] categories apply to this even though I information even thoughIthink I'm a pretty savvy web know the entity exists. Search engines didn't help either." surfer;" Out-of-date and inaccurate information: Users found CI that Cognitive barriers: Users did not understand how the Internet was either out-of-date or there was no way of discerning when works in terms of how it is indexed and how search engines a page was created or last updated. Inaccuracies in content work, how links are created, who creates and manages the were also noted; information, how sites are updated, etc. As one user explained Authority: Without proper identifiers and author credentials or "I am not Internet savvy enough to know what would make it association endorsements, users said it was difficult to gauge easierI just muddle through," while another remarked: the "quality" of the CI source, i.e., whether they should trust "there is probably more to the website that I know about" the CI (and its source) or not; Psychological barriers: Users frequently expressed a lack of Missing: Users sometimes commented that information was confidence in their own ability to find needed information. In missing although it was described as existing at the beginning other words, they internalized their search failures: instead of of a page or document; attributing them to the Internet or just a plain lack of Dead links: Users were frustrated when finding a link to a availability, they believed the reason they could not find page or site that they believe will be highly relevant to their something was because they were unable to carry out the information need, only to find that the link is inactive or search successfully. otherwise unavailable; These barriers are highly significant because they represent the impediments that users encounter when seeking information. People Language used: Beyond most information appearingin who are job seeking, for example, feel that they cannot get ahead English only, users also commented on how some sites unless they have access to a computer, not only so they can become contained information that was written using jargon or at a more computer literate, but also because that's how they perceive level that was too high for many to understand; people learn about job opportunities these days. For any one Security: Users want strong evidence that the information they situation or information need, a user might be confronted by several submit and retrieve is confidential"reasurred security"as barriers, which, collectively, can overwhelm the user and prevent one user phrased it; him or her from locating needed information. Specificity: Users want to be able to search for information at the neighborhood level. As one user explained, "what's the use 6. Discussion of providing information concerning neighborhoods if you Our analysis of users' online information behavior reveals a rich then don't make it easy for someone to determine exactly portrait of how individuals are getting faster access to more detailed which neighborhood they're in or belong to?" information in ways that were never possible, even a decade ago Non-anticipatory systems: Although users were unable to due to digital CI system initiatives. These systems are valued and articulate this barrier themselves, their responses in the surveys used by all segments of the adult population, and enable individuals, and interviews indicated that users' information behavior from near and far, to find information about local services and would be greatly facilitated if digital CI systems were "smart events, and facilitate different types of information seeking. Our enough" either to anticipate their next information need (based analysis of the situations that create users' needs for CI revealed a on the need posed to the system by typed query or by point and plethora of rich findings that expand on previous reports, and, more click) or a related information need. All too often useri importantly, signify several novel ways in which people are seeking described how the site they found was not quite what they CI at the turn of the century by drawing upon new technologies were looking for but they did not know where to go to next. supported by public libraries. However, our results also indicated These information-related barriers point to problems as well as that users' mental models of what information exists, is retrievable, potential solutions for improving the usability and helpfulness of and is accurate on the Internet are overly optimistic. Although many digital CI systems. However, other barriers that users encounter barriers are associated with digital CI system access, these same also emerged from our analysis. Such barriers included: barriers can reveal optimal solutions that will assist in creating even stronger and more information literate communities. Our findings Technological barriers: computer connection speeds were suggest the following ways in which digital CI systems might be veryslow,software worked slowly or unavailableor improved: incompatible with connecting systems, etc.; 1. Provide users with greater specificity in their searches by Economic barriers: users who could not afford their own improving the capability of search engines and searchable computing equipment or online access felt they felt were at a fields. Users, for example, want to be able to search for CI by disadvantage unless they were able to access equipment at a neighborhood and zipcode, which reflects their notions of public library or other public computing site, which even at the community. best of times, was still not as convenient as having a home system; 2. Incorporateanticipatorysearchfeaturesthatofferusers Geographic barriers: People were hindered in accessing suggestions or "next steps" on other types of information that are related to the information currently retrieved. For example, computers because they lived far away from a public library or if a user is searching for genealogical information, then the systemcouldsuggestothersourcesofgenealogical

141 1-7 information as well as genealogical software for family tree [3] Bates, M. The design of browsing and berrypicking techniques building, etc. This heuristic approach might be developed by for the online search interface. Online Review. 13, 5 (1989), querying the user about the context of his/her search, and by 407-424. linking categories of CI based on users' perceptions of CI and [4] Bishop, A. P., Tidline, T., Shoemaker, S., and Salela, P. Public how these categories are used or connected inreal-life libraries and networked information services in low-income situations. communities. Libraries and Information Science Research, 21, 3. Query the user automatically regarding the enabling aspect(s) 3 (1999), 361-90. of the information that they are seeking and then use this data [5] Chatman, E. A. Framing social life in theory and research. In L. to provide information holistically. For example, if a user is Hoglund (Ed.), ISIC 2000: The Third International Conference seeking "directing" information, then the system might also on Information Seeking in Context. Gothenburg, Sweden, 16- bring up local bus schedules and routes, directions, etc., 18 August 2000. London, England: Taylor Graham. through a geographic information system. [6] Chatman, E. A. The impoverished life-world of outsiders. 4. Use a community information taxonomy, such as Sales (1994), Journal of the American Society for Information Science, 47, for organizing and indexing CI records and make the (1996), 193-206. taxonomy available online as part of the digital CI system. [7] Chen, C., and Hernon, P. Information Seeking: Assessing and 5. Follow established interface design principles, such as those Anticipating User Needs. Neal-Schuman, New York, 1982. proposed by Head (1999) and Raskin (2000), to reduce [8] Childers, T. Information and Referral: Public Libraries. Ablex, incidents of information overload. Incorporating easy-to-use Norwood, NJ, 1984. searchenginesthathavedifferentlevelsofsearch [9] Cisler, S. Weatherproofing a great, good place. American sophistication and followingsoliddesignstandards can Libraries, 27, 9 (1996), 42-46. contribute greatly to reducing users' frustrations with pages that appear "too busy" and list too much text. [10] Dervin, B. From the mind's eye of the user: The sense-making qualitative-quantitative methodology. In J. D. Glazier and R. 6. Indicate when the CI displayed on a page was last updated. R. Powell (Eds.), Qualitative Research in Information 7. Indicate the CI source and that person's credentials. Management (pp. 61-84). Libraries Unlimited, Englewood, 8. Ensure that pages contain the information indicated as therein CO, 1992. on higher level screens, i.e., ensure that pages actually contain [11] Dervin, B., et al. The development of strategies for dealing the contents as described on introductory screens. with the information needs of urban residents: Phase I: The 9. Remove dead linksregularly by implementing periodic citizen study. Final report of Project L0035J to the U.S. Office checking and updating practices. of Education. Seattle, WA: University of Washington, School 10.Useappropriatelanguage whenprovidingCIthatis of Communications. ERIC: ED 125640, 1976. understandable to users. [12] Durrance, J. C. Meeting Community Needs through Job and 11.Provide help mechanisms that explain the very basics, i.e., how Career Centers. Neal-Schuman, New York, 1994. the digital CI system and Internet are organized and function, [13] Durrance, J. C. Armed for Action: Library Response to Citizen how search engines work, etc., and explain that sometimes Information Needs. Neal-Schuman, New York, 1984a. information is unavailable at no fault of the user. [14] Durrance, J. C. Community information services An 12.Provide users with contact information (email and phone innovation at the beginning of its second decade. Advances in number) for someone who can assist with matching their Librarianship 13, (1984b) 99-128. information needs to the system and with general system use. [15] Durrance, J. C., et al. Serving Job Seekers and Career 13.Incorporate mid-way features that allow the systems to be used Changers: A Planning Manual for Public Libraries. American by people with slower machines, etc. Library Association, Chicago, IL, 1993. 14.Incorporate more ways of linking people together to facilitate [16] Durrance, J. C., and Pettigrew, K. E. Community information: social interaction via bulletin boards, etc. The technological touch. Library Journal, 125, 2 (2000), 44-46. By carefully considering the information needs and seeking [17] Durrance, J. C., and Schneider, K. G. Public library community behavior of users when designing digital information systems, many information activities: Precursors of community networking of the barriers noted earlier can be avoided or greatly reduced. partnerships. Ann Arbor, MI: School of Information, Systems that anticipate related information needs and the actual University of Michigan. 1996: activities or functions that users are trying to accomplish can go http://www.si.umich.edu/Community/taospaper.html. even further in facilitating users' online information behavior. [18] Erdelez, S., and Rioux, K. Sharing information encountered for others on the web. In L. Hoglund (Ed.), ISIC 2000: The Third International Conference on Information Seeking in Context. 7. References Gothenburg, Sweden, 16-18 August 2000. London, England: [1] Baker, L. M., and Pettigrew, K. E. Theories for practitioners: Taylor Graham. Two frameworks for studying consumer health information- [19] Geffert, B. Community networks in libraries: A case study of seeking behavior. Bulletin of the Medical Library Association, the Freenet P.A.T.H. Public Libraries, 32, (1993), 91-99. 87, 4 (1999), 444-50. [20] Gross, M. Imposed information seeking in school library media [2] Baker, S. L., and Ruey, E. D. Information and referral services centers and public libraries: A common behavior? In Hoglund, attitudes and barriers: A survey of North Carolina public L., (Ed.), ISIC 2000: The Third International Conference on libraries. Reference Quarterly, 28, 3 (1988), 243-52. Information Seeking in Context. Gothenburg, Sweden, 16-18 August 2000. London, England: Taylor Graham.

142 175 [21] Gurstein, M. Community Information: Enabling Communities [36] Pettigrew, K. E., and Durrance, J. C. Community building with Information and Communications Technologies. Idea using the 'Net: Perceptions of organizers, information Group Publishing, London, 2000. providers and Internet users. Internet Research 1.0: The State [22] Hampton, K., and Wellman, B. Examining community in the of the Interdiscipline, First Annual Conference of the digital neighbourhood: Early results from Canada's wired Association of Internet Researchers, University of Kansas, suburb. In T. Ishida and K. Isbister (Eds.), Digital Cities: Lawrence, Kansas, September 14-17, 2000. [url: Technologies, Experiences, and Future Perspectives (pp. www.cddc.vt.edu/aoir/] 475-492). Springer-Verlag, Berlin, 2000. [37] Pettigrew, K. E., Durrance, J. C., and Vakkari, P. Approaches [23] Harris, R. M., and Dewdney, P. Barriers to Information: How to studying public library Internet initiatives: A review of the Formal Help Systems Fail Battered Women. Greenwood Press, literature and overview of a current study. Library and Westport, CN, 1994. Information Science Research, 21, 3 (1999), 327-60. [24] Harsh, S. An analysis of Boulder Community Network usage. [38] Putnam, R. D. Bowling Alone: The Collapse and Revival of 1995: American Community. Simon and Schuster, New York, 2000. http://bcn.boulder.co.us/community/resources/harsh/harshproje [39] Putnam, R. D. Bowling alone: America's declining social ct.html. capital. Journal of Democracy, 6 (1995), 65-78. [25] Harvey, K., and Horne, T. Surfing in Seattle: What cyber- [40] Raskin, J. The Humane Interface: New Directions for patrons want. American Libraries, 26, (1995), 1028-1030. Designing Interactive Systems. Addison Wesley, Reading, [26] Head, A. J. Design Wise: A Guide for Evaluating the Interface MA, 2000. Design of Information Resources. CyberAge Books, Medford, [41] Sales, G. A Taxonomy of Human Services: A Conceptual NJ, 1999. Framework with Standardized Terminology and Definitions [27] Kraut, R., Lundmark, V., Patterson, M., Kiesler, S., for the Field. Alliance of Information and Referral System and Mukopadhyay, T., and Scherlis, W. Internet paradox: A social INFO LINE of Los Angeles, 1994. technology that reduces social involvement and psychological [42] Savolainen, R. Use studies of electronic networks: A review of well-being? American Psychologist, 53, 9 (1999), 1017-1031. the literature of empirical research approaches and challenges [28] Nie, N. H., and Erbring, L. Internet and Society: A Preliminary for their development. Journal of Documentation, 54 (1998), Report. Stanford Institute for the Quantitative Study of 332-351. Society, Stanford, CA, 2000. [43] Savolainen, R. The sense-making theory: Reviewing the [29] Patrick, A. S. Media lessons from the National Capital interests of a user-centered approach to information seeking FreeNet. Communications of the ACM, 40, 7 (1997), 74-80. and use. Information Processing and Management, 29 (1993), 13-28. [30] Patrick, A. Services on the information highway: Subjective measures of use and importance from the National Capital [44] Schalken, K., and Tops, P. The digital city: A study into the FreeNet. 1996: backgrounds and opinions of its residents. Paper presented at http://debra.dgbt.doc.ca/services-research/survey/services/. the Canadian Community Networks Conference, August 15- 17, 1994. Carleton University, Ottawa. [31] Patrick, A. S., and Black, A. Implications of access methods http://cwis.kub.n1/frw/people/schalken/schalken.htm. and frequency of use for the National Capital Freenet. 1996a: http://debra.dgbt.doc.ca/services-research/survey/connections/. [45] Schuler, D. New Community Networks: Wired for Change. Addison-Wesley, New York, 1996. [32] Patrick, A. S., and Black, A. Losing sleep and watching less TV but socializing more: Personal and social impacts of using [46] Schuler, D. Community networks: Building a new participatory the NCF. 1996b: http://debra.dgbt.doc.ca/services-research. medium. Communications of the ACM, 37 (1994), 39-51. [33] Patrick, A. S., Black, A., and Whalen, T. E. Rich, young, male, [47] Wellman, B. Physical place and CyberPlace: The rise of dissatisfied computer geeks? Demographics and satisfaction networked individualism. Journal of Urban and Regional with the NCF. In D. Godfrey and M. Levy (Eds.), Proceedings Research, 25 (in press). of Telecommunities 95: The international community [48] Witte, J.C., Amoroso, L.M., and Pen, H. Research networking conference (pp. 83-107). Victoria, British methodology - Method and representation in Internet-based Columbia: Telecommunities Canada, 1995. survey tools - Mobility, community, and cultural identity in http://debra.dgbt.doc.calservices-research/survey/demographic Survey 2000. Social Science Computer Review, 18, 2 (2000), s/vic.html. 179-195. [34] Pettigrew, K. E. Lay information provision in community [49] Zhang, Y. Using the Internet for survey research: A case study. settings: How community health nurses disseminate human Journal of the American Society for Information Science, 51, 1 services information to the elderly. The Library Quarterly, 70, (2000), 57-68. 1 (2000), 47-85. [50] Zweizig, D., and Dervin, B. Public library use, users, uses: [35] Pettigrew, K.E. Nurses' perceptions of their needs for Advances in knowledge of the characteristics and needs of the community information: Results of an exploratory study in adult clientele of American public libraries. Advances in southwestern Ontario. Journal of Education for Library and Librarianship, 7 (1977), 231-55. Information Science, 37, (1996), 351-60.

143 176 Evaluating the Distributed National Electronic Resource Peter Brophy Shelagh Fisher Director Reader Centre for Research in Library & Information Department of Information & Communications Management (CERLIM) The Manchester Metropolitan University The Manchester Metropolitan University Manchester M15 6LL Manchester M15 6LL United Kingdom United Kingdom +44-161-247-6718 +44-161-247-6153 [email protected] [email protected]

ABSTRACT and through a series of supporting studies, of which the UKOLN- The UK's development of a Distributed National Electronic led MODELS workshops are probably the bestknown and most Resource (DNER) is being subjected to intensive formative influential [2]. evaluation by a multi-disciplinary team. In this paper the Project In 1999, agreement was reached on the need to engineer a step- Director reports on initial actions designed to characterise the change in these services by moving towards an integrated DNER from multi-stakeholder perspectives. informationenvironment,inwhichaccesstoany desired resources could be managed coherently. Issues such as resource Categories and Subject Descriptors description, location, request and delivery, alongside C.2.1Network Architecture and Design;C.2.4 Distributed authentication and authorisation, need to be considered within a Systems. nationalandinternationalframeworkifinteroperability, coherence, sustainability and scalability are to be secured. The new environment was to be known as the Distributed National General Terms Electronic Resource (DNER). Management, Measurement, Performance, Design, Economics, Reliability, Human Factors, Verification. In late 1999, a Call for Proposals (known as JISC 5/99 [3]) was issuedtothehighereducationcommunityintheUK. Subsequently over 40 development projects were selected for Keywords funding. In addition, a major formative evaluation project was Distributed collections; information environments; evaluation. funded with a primary aim of learning as much as possible about the impacts of the developing DNER on end-users, and feeding 1. INTRODUCTION this learning back into the development process. The formative The United Kingdom has, over the last decade, invested heavily in evaluation, known as EDNER, is led by the Centre for Research both infrastructure and information content to support higher inLibrary & Information Management (CERLIM)atthe education's research and teaching functions. The JANET and Manchester Metropolitan University; the Centre for Studies in SuperJANET networks provide very high bandwidth (gigabit) Advanced Learning Technology (CSALT) at Lancaster University connections. Content is supplied through a series of national-level is a partner. The project will extend over 3 years, and is funded at 'deals' with public and private sector suppliers, with services approx. $1,000,000 over that period [4]. This paper reports on being delivered from three datacentres (in Bath, Edinburgh & work in profgess and initial observations of the evaluation team. Manchester): a guiding principle is that such content should be 'free at the point of use'. This has helped ensure massive take-up. 2. DEFINITIONS Considerable experimentation has taken place withservice The first issue has been to determine exactly what the DNER is delivery, not least through the Electronic Libraries Programme and whatitisintended to become. To arriveata clear (eLib)[1]which aswellasinvestigatingspecificservice understanding of what is intended and of its appropriateness, the developments (such as e-journals, subject gateways, electronic evaluation team has used a dual approach, analysing the DNER's document delivery) has explored generic issues, both through its objectives, content and use and then comparing it with a variety of final ) phase concentration on 'hybrid libraries' and `clumps' models of real-world services and environments. In essence the first approach is inductive, the second deductive.

Permission to make digital or hard copies of all or part of this work for An initial project workpackage, which involved garnering views personal or classroom use is granted without fee provided that copies fromprogrammeparticipants,potentialusersandother are not made or distributed for profit or commercial advantage and stakeholders,suggestedthattherewasawiderangeof that copies bear this notice and the full citation on the first page. To perspectives, from those who regard the DNER as an e-university copy otherwise, or republish, to post on servers or to redistribute to to those who see it as a large library or even museum. This variety lists, requires prior specific permission and/or a fee. of view was confirmed by documentary analysis of stated DNER JCDL'Ol, June 24-28, 2001, Roanoke, Virginia, USA. project objectives. Copyright 2001 ACM 1-58113-345-6/01/0006...$5.00.

144 177 The JISC itself has used a variety of definitional statements about 3.7 Portal the DNER. The 'official' definition, as given in the Call for Using definitions suggested within the DNER, a portal differs Proposals for the DNER's major development programme, is 'a from a gateway in that the user is not directed to another site in managed environment for accessing quality assured information response to a query (as, for example, when a URL displayed by a resources on the Internet' [3]. In other documents, however, JISC gateway is clicked). Rather the portal accepts the query, itself refers to the DNER as 'a comprehensive collection of electronic interrogates a series of resources, intelligently interprets the resources' [5] and 'the main academic apparatus required for results (e.g. deduplicating) and then presents a result to the user. research and teaching in the full range of main subject areas' [6]. To date portals, on this definition, remain experimental. 3. MODELS 3.8 Managed Learning Environment An alternative approach to understanding the DNER, used in MLEs should provide not just the communications structures, parallel with that reported above, isto make deductions by materials and online tuition needed for virtual learning, but all the mapping it to models of other services and environments. This support infrastructure that goes with them. It is difficult not to suggests that the DNER has features which show significant conclude that, unless the DNER can demonstrate interoperability commonality with ten other identifiable models, as descried very with institutionally-based MLEs, it will not succeed. briefly below. 3.9 e-university 3.1 Publisher Going beyond the MLE, an e-university needs a variety of As with any other service in the scholarly communication chain, the DNER has features which suggest parallels with both services which enable it to offer and deliver its products across electronic networks. These would include, at a minimum, a brand, traditional and emerging models of publishing. For example, it quality assurance, awards and financial systems. It could be needs to address the quality assurance of content and to provide argued, again, that the DNER will need to interoperate with, and facilities to enable that content to be distributed, often with possible to be integrated with, many of these functions. payment mechanisms (pace the Open Archives initiative!). 3.2 Traditional Library 3.10 dot.com Again,thedot.comsectorprovidessomelessonsfor There are many models of traditional libraries available, but the DNER appears to be replicating or replacing some of this characterisation of the DNER. In brief: dot.coms need both a high profile brand and a high quality product; excellent marketing; functionality, such as the organisation of content and its archiving robust yet simple payment mechanisms; reliable and rapid (preservation), the provision of enquiry services and the provision delivery mechanisms. The DNER needs all of these. of (in this case, virtual) study 'spaces'. 3.3 Museum 4. Interim Conclusions As indicated earlier the EDNER project is in its early stages, and Some DNER development projects are explicitly designed to whatisreportedhereisvery much workinprogress. digitise museum content. Also, the traditional museum function of Nevertheless, it is already apparent that the task of building and organising materials coherently for themed displayas well as, evaluating national-level services is complicated by very different again, the preservation functionfind direct parallels. perspectives among key stakeholders, and by the lack of any single, clear model on which to base development and evaluative 3.4 Digital Library judgements. It is likely that consensus will emerge gradually both Perhaps most obviously, although definitionally it is complex, the through debate on competing models and through assessment of DNER has features of a digital library: 'an organized collection of the impact of achievements in service delivery. multimedia data with information management methods that represent the data as information and knowledge' [7]. 5. REFERENCES 3.5 Hybrid Library [1]http://www.ukoln.ac.uk/services/elib/ The idea of the'hybridlibrary' emerged during the eLib [2]http://www.ukoln.ac.uk/dlis/models/ programme [8]. A model of this'library in the twenty-first century' suggests that its key role is as an 'invisible intermediary', [3] JISC 5/99 http://www.jisc.ac.uk:8080/pub99/c05_99.html dynamically linking each user with exactly the information they [4]http://www.cerlim.ac.uk/projects/edner.htm need. To achieve this,itis highly sophisticated both in the intelligence it has about its users and in its knowledge of potential [5]Joint Information Systems Committee. JISC Strategy 200 1- information sources. 05 (October 2000 draft). http://www.jisc.ac.uk/curriss/ general/strat_01_05/draft_strat.html 3.6 Gateway [6]http://www.jisc.ac.uk:8080/dner/background/dner_intro I .pdf In DNER terms, gateways are effectively ordered lists of Internet resources; the eLib gateways, now part of the Resource Discovery [7]http://www.ccic.gov/pubs/iita-d1w/ Network (RDN), form a key component of the DNER. A variety [8]Brophy, P. & Fisher, S. The hybrid library. The New Review of other gateways are on view. of Information & Library Research, 4, 1998, pp. 3-15

145 178 Collaborative Design with Use Case Scenarios Lynne Davis Melissa Dawe DLESE Program center Center for LifeLong Learning and Design University Corporation for Atmospheric Research Dept. Of Computer Science P.O. Box 3000 University of Colorado at Boulder Boulder, CO. 80307-3000 303 492-4932 303 497-8313 [email protected] [email protected]

1. ABSTRACT But, while both profess to consider the needs of the users to be Digitallibraries,particularly those with a community-based paramount in a successful design, they differ considerably in their governance structure, are best designed in a collaborative setting. methods. In July of 2000, at the DLESE Leadership Workshop at In this paper, we compare our experience using two design Montana State University in Bozeman, MT, our informed efforts methods: a Task-centered method that draws upon a group's to use the Use Case development method failed to result in the strength for eliciting and formulating tasks, and a Use Case expected outcomes. We explain those methods and why we method that tends to require a focus on defining an explicit conclude that the Use Case methods were not appropriate for use process for tasks. We discuss how these methods did and did not in a highly collaborative design setting. work well in a collaborative setting. 3. METHODOLOGIES Categories and Subject Descriptors 3.1 Task-centered Design H.3.7[Information Storage and Retrieval]: Digital Libraries Task-centered design methodology [3], well known among the user issues. human-computer interaction (HCI) community and interface designers, provides techniques to elicit tasks from users that General Terms capture the user's goals and work setting. It centers on developing Design, Experimentation, Human Factors. a small suite of tasks that represent, in essence, who will use the system and what they want to do. We used it prior to the Keywords workshop to elicit user tasks. Here is an excerpt from one: Use case, task-centered, design, collaboration, methodology. Alan teaches an introductory undergraduate geology class. He wants to find a few photographs taken by the 2. INTRODUCTION Sojourner of the surface of Mars and compare them to Designing a digital library system is a challenging effort; doing so similar geographical areas on earth for his class. collaboratively through community participation is even more so. Tasks are then used to guide design and testing of the system. The Nevertheless, for the last twelve months we have actively engaged process of creating a user interface based on these tasks occurs the community in the design of the Digital Library for Earth through an iterative process of design and user testing. The System Education (DLESE). DLESE serves a broad base of precise steps a user would take to accomplish a task are educators and students interested in any scientific aspect of the determined while a tangible user interface develops. This iterative Earth and as such requires a systematic design approach to define design method encourages consideration of alternatives. requirements.Because DLESE isenvisioned, designed and sustained by its community of users, such an approach for 3.2 Use Case Design defining requirements needs likewise to be a highly collaborative The Use Case design methodology as defined by Alistair effort. As we approached this problem, two design models Cockburn [1]is popular among software engineers to define emerged aslikely candidates to provide this system: Task- requirements in terms of a specific sequence of steps. It begins as centered design [3] and Use Case development [I]. Because both a narrative that is much like the tasks in the Task-centered design models use natural language to define requirements, users can methodology. A major difference is that the narratives are then read and understand them more easily than traditional software broken down into a sequence of execution steps, called a scenario. requirementsdocuments.Also,bothassumeaniterative For eachnarrative, many scenariosdescribethedifferent development model for rapid prototyping and flexibility. outcomes a user might experience when executing a task. Thus, Use Cases try to provide more detail about the system. However, the model does not include contextual or environmental factors Permission to make digital or hard copies of all or part of this work for that can influence use, and that can infer commitment to a design personal or classroom use is granted without fee provided that copies are prematurely [2], before design alternatives can be considered. not made or distributed for profit or commercial advantage and that In preparing for the DLESE Leadership Workshop, the Task- copies bear this notice and the full citation on the first page. To copy centered design methodology appealed to the user interface otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. designers, while the Use Case methodology appealed to software JCDL '01, June 24-28, 2001, Roanoke, Virginia, USA. engineers, who wanted the system requirements to be in a form Copyright 2001 ACM 1-58113-345-6/01/0006...$5.00.

146 179 more readily usable by them. So, our design process involved They were motivated to contribute at a high level and had neither elements of both. the skills, nor the inclination, to develop structured scenarios. 4. SUITABILITY FOR COLLABORATION 5.2 Readiness to Commit Where these approaches really differ becomes apparent in our Participants were asked to accept the Use Case narratives in order thescenarios. The narratives collaborative design setting. The Use Case approach assumes that toproceedwithdeveloping originated from the community but participants had no direct the narrative for a goal is already written and is ready to be broken ownership, or "buy-in" to them. They felt they were too narrow, down into its various scenarios. The process of creating Use Cases and did not represent all possibilities. They wanted to discuss the that conform to Cockburn's model requires skill to break down a Use Case into a formatted sequence of steps and assumes a narratives in terms of their experience. While this dialog was useful and valuable for expanding the existing Use Cases, it readiness on the part of the collaborators to commit to those steps. indicated a lack of readiness to commit to specific scenarios. At the Leadership Workshop, we started with representative tasks that were expressed by the user community previously. The participants were asked to form groups of 4-6 and over a one-hour 6. CONCLUSIONS period, begin to articulate step-by-step scenarios to accomplish Definingrequirementsforadigitallibrarycanbe done these tasks. collaboratively in conjunction with community members. In our experiment, the community was better suited to participate This approach was problematic in our collaborative setting. Few successfully in task creation rather than scenario development. participants understood the process. To list a single sequence of Task-centered design leads quickly to a tangible, user-testable steps the group had to exclude other interrelated steps, at least object that can serve as a picture to see how Use Case scenarios temporarily. The group had to accept assumptions about pre- and might develop. Task-centered design appears to be better suited post-conditions to carry out the goal. Also, by starting with a for design collaboration in the case where the participants are not given narrative, it disregarded a key need for the user community skilled in the semi-formal process of developing Use Cases, and to contribute to the understanding of their tasks and work setting, where they are neither motivated nor committed to do so. and hence,forthedesignerstosituatethesystem most appropriately. Finally, the collaborators were not ready to commit Cockburn argues that user interface design should come after the to a defined sequence of steps which made the group feel uneasy development of Use Cases. Our experience in collaborative design and hampered successful collaboration. suggests that user interface design should come first through the Task-centered design process, where design alternatives can be The Task-centered design process, on the other hand, begins by included rather than excluded. Then, when the mockups and low- trying to understand the user and the user's situation. In the very level prototypes begin to pass user testing and the users are ready early stages of the DLESE design, we asked our community to to commit to a design, the scenarios could be formulated by give us stories about how they envision the library might support software engineers or by those who are trained in this process to them in their work. What we received were highly visionary, cany the design further. detailed descriptions. This approach facilitated group participation that was useful to the designers for describing user settings and system context. 7. ACKNOWLEDGMENTS The user interface developers transformed task descriptions into The work described here is funded under Cooperative Agreement tangible mockup designs. Participants then used the mockups to #ATM-9732665 between the National Science Foundation (NSF) test whether they could accomplish a task. They had something and theUniversity CorporationforAtmosphericResearch tangible about which they might suggest improvements. (UCAR), and NSF Grants #9978338 and #0085600, and by supplemental funding from the National Aeronautics and Space 5. RESULTS AND DISCUSSION Administration (NASA). The views expressed herein are those of Accepting the Use Case narratives as written and having to the author(s) and do not necessarily reflect the views of NSF, its develop the scenarios proved to be very incongruous with the subagencies, or UCAR. needs, expectations,skills and motivation of our workshop participants. While it did elicit valuable information, participants did not develop scenarios. Instead, they reformulated user tasks. 8. REFERENCES We feel there are two main reasons why this exercise didn't work. [1] Cockburn, A., 1997. Goals and use cases. Journal of object-oriented programming, (Sep 1997), pp 35-40, 5.1 Preparation of Participants and (Nov-Dec 1997), pp 56-62. Participants did not understand the processes and constraints of developing Use Case scenarios. The semi-formal process was [2] Green, T. R. G. and Petre, M., 1996. Usability analysis foreign to them, and despite minimal training and guidance from of visual programming environments. J. Visual the facilitators, they essentially ignored it. Languages and Computing, 7, pp 131-174. There was a mismatch of purpose. The cognitive characteristics [3]Lewis, C. & Reiman, J., 1993. Task-centered user that are well suited to developing Use Case scenarios were interface design. Available via ftp.cs.colorado.edu incongruous with the motivation and expectations of the group. /pub/cs/distribs/clewis/HCI-Design-Books. Participants wanted to describe ways that the system can support them in getting their job done and they were not ready to exclude possibilities. They expected to discuss issues in an open forum.

147 180 Human Evaluation of Kea, an Automatic Keyphrasing System Steve Jones Gordon W. Paynter Department of Computer Science University of Waikato Private Bag 3105, Hamilton, New Zealand +64 7 838 4021 {stevej, paynter}@cs.waikato.ac.nz

ABSTRACT documents [12, 28], search and browsing interfaces [10, 11, 13], retrieval engines [3, 6, 14] and thesaurus construction [15, 18]. This paper describes an evaluation of the Kea automatic keyphrase extraction algorithm. Tools that automatically identify keyphrases Keyphrases are often chosen manually, usually by the author of a aredesirablebecause document keyphraseshave numerous document, and sometimes by professional indexers. Unfortunately applications in digital library systems, but are costly and time not all documents contain author- or indexer-assigned keyphrases. consuming to manually assign. Keyphrase extraction algorithms are Even in collections of scientific papers those with keyphrases are in usually evaluated by comparison to author-specified keywords, but the minority [13]. Manual keyphrase identification is tedious and this methodology has several well-known shortcomings. The results time-consuming, requiresexpertise, and can give inconsistent presented in this paper are based on subjective evaluations of the results, so automatic methods benefit both the developers and the quality and appropriateness of keyphrases by human assessors, and users of large document collections. make a number of contributions.First, they validate previous In this paper we describe a human evaluation of Kea [9, 27], an evaluations of Kea that rely on author keywords. Second, they show automatic keyphrase extraction algorithm developed by members of Kea's performance is comparable to that of similar systems that have the New Zealand Digital Library Project [26]. Kea uses machine been evaluated by human assessors. Finally, they justify the use of learning techniques to build a model that characterises document author keyphrases as a performance metric by showing that authors keyphrases, and later uses the model to identify likely keyphrases in generally choose good keywords. new documents. Categories and Subject Descriptors Previous evaluations show Kea's performance is state-of-the-art, but H.3.7 [Information Storage and Retrieval]: Digital Libraries are weakened by their assumption that a document's author- its best possible set of keywords. In user issues.1.2.7[ArtificialIntelligence]:Natural Language specified keyphrases are practice, the author keyphrases may not be exhaustive, and may not Processingtext analysis. even be particularly appropriatethey can be chosen for purposes other than summarisation: to associate a document with a particular General Terms discipline, for example. Algorithms, Performance, Experimentation. Our evaluation makes a number of contributions in respect of automated keyphrase extraction. First,it augments and tests the Keywords validity of the previous evaluations of Kea using a different keyphrase extraction, author keyphrases, digital libraries, subjective evaluation techniquea subjective evaluation involving human evaluation, user interface assessment of the quality and appropriateness of keyphrases. Second, it compares Kea's performance as determined by human 1. INTRODUCTION assessors against the results of similar evaluations of other systems. Some types of document (such as this one) contain a list of key Finally,itinvestigateswhethercomparisonagainstauthor words specified by the author. These keywords and keyphraseswe keyphrases is a good measure of the results of keyphrase extraction use the latter term to subsume the formerare a particularly useful systems. type of summary information. They condense documents, offering a brief and precise description of their content. They have many In the next section of this paper we present a range of keyphrase- further applications, including the classification or clustering of based interfaces developed by ourselves and others. We then describe two approaches to associating keyphrases with documents, along with techniques for keyphrase extraction, and the Kea Permission to make digital or hard copies of all or part of this work for algorithm. We discussissuesandtechniquesinevaluating personal or classroom use is granted without fee provided that copies keyphrases, providing a summary of previous research results, are not Made or distributed for profit or commercial advantage and that before proceeding to describe an experiment in which human copies bear this notice and the full citation on the first page. To copy assessors judged the quality of keyphrases generated by Kea and otherwise, or republish, to post on servers or to redistribute to lists, gathered by other means. Finally, we discuss our findings and the requires prior specific permission and/or a fee. conclusions that we draw from the experimental results. JCDL '0 I , June 24-28, 2001, Roanoke, Virginia, USA. Copyright 2001 ACM 1-58113-345-6/01/0006...$5.00.

148 181 --Notsc Opot-Plarid:. Food 4nd Nriculbulf OttaniteUoir 0 dpsite. a in. Phrosier editor Jj1101 ',a rmahi hmiiird 181 al la link highlighting Edit View Go Communicator Help File Cunont file 17 /borna/stevel/PEIRASES/SAMPLEfiles/santple I Mick Emphesise

FOODAND3fteAtINCULTTlig World An Analysisof Usage of a Digital Lib re ry Agricultural , Ste,e Jones,SaltyJo Cunningham, Rodger Sichlab 13''.intuit/m(10n of Computer Science .1f THE UNITED NATIONS Crintre De ointment .4.3-(li Unhser sltyof Welketo, Hanelton, New Zeeland Telephone:4-647 8389021 Fax:9-64 7 838 4155 E-mall:stew.), sallyn, rtrranabits.walkahmac.n. 13e_zall'oliforI yoroyd.,

forest Ord 10 0(493 phrases) docs freq Abstt act.As experimental digital libraryr te stbe ds gain wider acceptance end dekobp significantuser bases, it becomes importent to Investigate the ways In which users Prest,preducts, 38027 1078 interactwith the systems in pr actice. Tr ansaction logs ere one source of usage forest resources 892 Infor motion,and the lnfor notion on user behaviour can be culled fromthern both forest management 269 6135 automatically(through calculation of summary statistics) and manually (by examining qkery forest genetic 158 621 so ings forsemenik clues on search motivations and searching strate gy). We conduct a sustainable forest 180 486 transactionlog nalysis on user activity In the Computer Science Technical Reports wood forest 130 401 national forest 118 339 Collectionante New Zealand Digital Library., end report insights gained and Identity forest industries 100 332 resultingSearch lniirfeen demign Issues. tropical forest 116255 01.88-6188811188118800-101111thigEPhrosher Mow boas, (0 I 998-20001 IIIIIIIIM.411112210211241114123 IS forest Service 75 200 1 Introd Gel more phrases (There Is TileColleCtiOn (Caletnnly Ha Bibliography) Ranking Windows Sarver Save list of does far en', num dads ta sham date hag Phrases in Snow items Resist products (first 10 of 72 phrases, first (0 of 382 documents) f'rrquencY ne at 3 6 dos nt I" diii.a.P.F 17°T. or Interest r y NONWOOD forest products 17 17 11 digital library 6 15 forest products trade 13 [1] SenseMaisec An infornotion-Exploration special forest prbducts 5 12 computer science 3 3 nterface Supporting the Contextual Esolution of a Gel more phrases user bihaVlour I 2 I 997 Michelk Q. Wang Baldonada 58 elserie interests lnasylya 183 " Trade in ember-based forest products end the Implications treneettlon log analysis 4 I Ter ry Winogred Proceeding{ of ACM CHI 97 19 Section 3 - Forestry computer science technkt 1 1 Conference on Human Factors In Computing Community Foresby flute 12 18 search rettreace i 1 i Dystarns C000 9717 (W39413-E) Medium-Term strategy S1990-2003) and 16 interface design 1 70 progress from one context, digital libr arieI,Information IV. FORESTS AND TRADE AND THE ENVIRONMENT 15 exploration, information retrieval , information seeking, deeign and evaluation 1 21 Unasylya - An iniarnatianal journal or forestry and forest indlnilriee 14 euolutIon of a users Unasylya 183 Editorial 14 infOrleitiP)iiiiliteg 11 3 search behavior 1 1 digital library architecture 1 I [11 Visual Information Seeking:Tight Coupling of document processing I 3 Dynamic Query Filters with Sterfield Displays 1994 6e 1 uolarj N.41 rhr Istop her Ahlberg Ben Shnelderrnan Proceedings of ACM China Conference on Huston ;actors In Computing Systems Figure 1: The Phind user interface. 4ppitual information melded, tight coupling, dynamic

2. KEYPHRASE-BASED INTERFACES Figure 2: The Phrasier user interface. Our evaluation is motivated by our use of keyphrases in user interfaces for searching and browsing. We have built a number of Kniles is a simplified, Web-based version of Phrasier, a program novel systems that use keyphrases to support new styles of that supports authors and readers who work within a digital library interaction with digital libraries. [11,13].InPhrasier,browsing and queryingactivitiesare seamlessly integrated with document authoring and reading tasks Phind [19] adds a browsable topic-oriented structure to collections of documents where no structure existed beforea structure that (see Figure 2). cannot be uncovered through conventional keyword queries. Users Keyphrases are used to dynamically insert hypertext link anchors interact with a phrase hierarchy that has been automatically extracted into the text of a retrieved document. Each anchor has two levels of from the documents. The phrase hierarchy resembles a paper-based gloss @review information about the link destination), allowing subject index or thesaurus, and is presented to the user via a World users to navigate directly to desirable documents. Phrasier uses Wide Web page. variable highlighting of the phrases to help users to skim the The user begins by entering an initial query term, and a list of document and find sections of interest. Keyphrases are displayed more prominently than the rest of the text. Multiple keyphrases can phrases that contain the term is displayed (Figure 1, top pane). When be selected, retrieving a ranked list of documents related to the the user clicks on a phrase of interest, a further panel appears, listing longer phrases that contain the phrase, and the documents where it combination of selected topics. occurs. The user can continue to descend through the phrase Because links are introduced dynamically when the document is hierarchy, viewing increasingly specific phrases. At each stage viewed users can load a document from their own filestore into documents containing the phrase can be selected for display. Phrasier, and it will behave in the same way as documents from an established collection. In fact, the user can create a document by In Phind, users must move back and forth between result lists and typing it directly into Phrasier. As the user enters text, keyphrases document content. Another system, called Kniles, eliminates this are identified in real time, highlighted and turned into link anchors extraneous navigation by embedding the browsing interface directly with associated destination documents, providing immediate access into documents as they are viewed [13]. to related material. Knilesuseskeyphrasestoautomatically construct browsable A number of other systems exploit phrases to enhance user hypertexts from plain text documents that are displayed in a interaction.TheJournalof ArtificialIntelligenceResearch conventional Web browser. Link anchors are inserted into the text (http://extractor.iit.nrc.ca/jair/keyphrasest) can be accessed through wherever a phrase occurs that is a keyphrase in another document or an interface based on phrases produced by Extractor [25]. Larkey documents. A second frame of the Web page provides a summary of [17]describes a system for searching a database of patent the keyphrase anchors that have been inserted into the document. information. Within the system phrases are used to suggest query When a user clicks on a phrase a new web page is generated that expansions to users based on the search terms that have been lists the documents for which the phrase is a keyphrase. Selecting a specified. Similarly, Pedersen et al [20] use phrases to support query document from the list loads it, with hyperlinks inserted, into the reformulation in their Snippet Search system. Krulwich and Burkey web browser. [16] exploit heuristically extracted phrases to inform InfoFinder, an

149 BEST COPY AVAILABLE 182 'intelligent' agent that learns user interests during access to on-line substantially simpler, and therefore less computationally intensive, documents. than many previous approaches. The utility of each of these systems depends upon the availability of Kea has been described in detail elsewhere [9, 27], and its operation accurate, reliable keyphrases. The remainder of this paper shows that is summarised below. Kea uses a model to identify the phrases in a Kea can supply appropriate candidates. document that are most likely to be good keyphrases. This model must be learned from a set of training documents with exemplar 3. ASSOCIATING KEYPITRASES WITH keyphrases. The exemplar phrases are usually supplied by authors, DOCUMENTS thoughitisalso acceptableto manually provide exemplar There are two dominant approaches to associating keyphrases with keyphrases. documents: keyphrase assignment and keyphrase extraction. In To learn a model, Kea extracts every phrase from each of the keyphrase assignment (also known astext categorization) an training documents in turn. Many phrases are discarded at this stage, analysis of a document leads to selection of keyphrases for that including duplicates, those that begin or end with a stopword, those document from a controlled vocabulary [8].It has two main which consist only of a proper noun, those that do not match advantages:thecontrolledvocabularyensuresthatsimilar predefined phrase length constraints, and those that occur only once documents are classified consistently, and documents can be within a document. Three attributes of each remaining phrase are associated with concepts that are not explicitly mentioned in their calculated: whether or not it is an author-specified keyphrase of the text. However, there are also disadvantages: potentially useful document, the distance into a document that it first occurs, and how keyphrases are ignored if they are not in the vocabulary; and specific it is to the document (its TFIDF value). The attribute values controlled vocabularies require expertise and time to build and of every phrase in every training document are used to construct a maintain, so are not always available. Naive Bayes classifier [7] that predicts whether or not a phrase is an In the second approach, keyphrase extraction, the text of a document author keyphrase based on its other attributes. is analysed and the most appropriate words and phrases that it A range of options allows control over the model building process, contains are identified and associated with the document. Every and consequently the characteristics of the keyphrases that will phrase that occurs in the document is a potential keyphrase of the eventually be extracted. These include maximum and minimum document. This approach does not require a predefined vocabulary, acceptable phrase length (in words), and an extension to the model and is not restricted to the concepts in such a vocabulary. However, that incorporates the number of times that phrase occurs as an the keyphrases assigned to each document are less consistent, and it author-specified keyphrase in a corpora of related documents. is not easy to identify the "most appropriate" words and phrases. Once a model for identifying keyphrases is learned from the training A wide range of techniques has been applied to the problem of documents,itcan be used to extract keyphrases from other phrase extraction. Turney [24, 25] uses a set of heuristics that are documents. Each document is converted to text form and all its fine tuned using a genetic algorithm. Chen [5] uses statistical candidate phrases are extracted and converted to their canonical measures exploiting importance, frequency, co-occurrence and form. Many are immediately discarded, using the same criteria as distance attributes of word pairs. Larkey [17] builds a phrase described for the training process. The distance and TFIDF dictionary by tagging word sequences as parts of speech and attributes are computed for the remaining phrases. The Naïve Bayes retaining noun phrases. Krulwich and Burkey [16] exploit markup, model uses these attributes to calculate the probability that each such as capitalisation, emphasis, and section headings to select candidate phrase is a keyphrase. The most probable candidates are possiblysignificantphrasesfromdocuments.Anickand output in ranked order; these are the keyphrases that Kea associates Vaithyanathan [2] carry out part of speech tagging and identify with the document. noun compoundsword sequences of two or more adjectives and The number of phrases extracted from each document can be nouns terminating in a head noun. Smeaton and Kelledy [22] controlled, and is typically around 10. The length of the phrases, identify 2 or 3 word candidate phrases from text by using stopword expressed as the minimum and maximum number of words it delimiters, and then consider phrases to be meaningful if they occur contains, can also be controlled. in the document collection more than some fixed number of times. Barker and Cornacchia [4] identify noun phrases using dictionary Several predefined models are distributed with Kea, including lookup, and then consider the frequency of a given noun as a phrase models based on generic World Wide Web pages and computer head within a document, discarding those that fall below a given science technical reports. Previous work shows that models built for threshold. Tolle and Chen [23] use pattern matching rules to select specific collections are more likely to account for the idiosyncrasies phrases from texts that have been tokenized and tagged by a part-of- of that collection's keyphrases [9]. speech tagger. Of these approaches, Turney and Barker and Cornacchia explicitly 5. EVALUATING KEYPHRASES evaluating automatically attempt to simulate the author's choice of keywords and evaluate Therearetwobasic approaches to generated keyphrases. The first adopts the standard Information their methods by comparing the algorithm's choices against the Retrieval metrics of precision and recall to reflect how well author's. generated phrases match phrases which are considered to be 'relevant.' Author phrases are usually used as the set of relevant 4. KEA phrases, or the 'Gold Standard.' This approach was adopted in Kea is a keyphrase extraction algorithm developed by members of previous evaluations of Kea [9, 27]. the New ZealandDigitalLibraryProject. The algorithmis

150

' 183 Table 1: Profile of phrases associated with each document programme in Computer Science. 28 subjects were recruited, of which 23 were male and five female. All had completed at least Number of keyphrases three years of undergraduate education in computer science or a Merged Combined related discipline and were nearing completion of a fifteen week Paper Author Food Kea List course on human-computer interaction. The first language of 15 of the subjects was English. The youngest subject was 21, the oldest 1 5 49 6 58 38, and the mean age was 25. 2 6 47 6 58

3 10 51 6 66 5.3 Allocation Two papers were allocated to each of the subjects. Papers were 4 8 54 6 68 allocated randomly to the subjects, though presentation order, 5 5 51 6 57 number of viewings of each paper, and subjects' first language were controlled. Two subjects chose to read only one paper during the 6 7 55 6 67 experimental session. All other subjects were able to complete both tasks, and did so within two hours. Thereareseveral problems with evaluations based purely on author- chosen keyphrases. Barker and Cornacchia identify four [4]. First, author keyphrases do not always appear in the text of the document 5.4 Instructions to which they belong. Second, authors choose keyphrases for The subjects were instructed to first read the paper fully. They were purposesotherthandocument descriptiontoincreasethe then told to reveal a list of phrases for the paper and asked: "How likelihood of publication, for example. Third, authors rarely provide well does each of the following phrases represent what the document more than a few keyphrasesfar fewer than may be extracted is either wholly or partly about?" The list of phrases was presented automatically. Fourth, author keyphrases are available for a limited in the following form: number and type of documents. hypertext

A second approach is to gather subjective keyphrase assessments Not at all Perfectly from human readers. Previous studies involving human phrase 0 1 2 3 4 5 6 7 8 9 10 assessment [4, 5, 23, 25] follow essentially the same methodology. Subjects are provided with a document and a phrase list and asked to co-citation analysis assess in some way the relevance of the individual phrases (or of sets Not at all Perfectly of phrases) to the given document. 0 1 2 3 4 5 6 7 8 9 10 The study reported here adopts the second approach, and represents Subjects indicated their rating by drawing a circle around the the first direct human evaluation of the keyphrases generated by appropriate value. Subjects could refer back to the paper and reread Kea. It incorporates a human evaluation of author keyphrases, to it as often as required. better inform the first type of evaluation. The evaluation had three aims. First, we wished to evaluate the 5.5 Candidate Phrase Lists keyphrases produced by Kea with a variety of models and settings. Each phrase list contained phrases from a variety of sources: Kea Second, we wished to compare a subjective evaluation of Kea to the keyphrases extracted from the paper, author keyphrases specified in results of evaluations based on the author keyphrases. Finally, we the paper, and unrelated control phrases. wished to determine if the author's keyphrases are a good standard against which to measure performancedo readers think the author Three Kea models were used to extract keyphrases. The first, aliweb, was trained on a set of typical web pages found by Turney [24, 25]. keywords are good keyphrases? The second, cstr, is derived from a collection of computer science 5.1 Experimental Texts technical reports as described by Frank et al.[9]. The third, cstr-kf Aset of six English language papers from the Proceedings of ACM was trained on the same documents as cstr, but uses a further attribute which reflects how frequently a phrase occurs as a specified Conference on Human Factors 1997 (CHI 97; [1]) was used for the keyphrase in a set of training documents. Experiments using test documents. They were suitable for our purposes because they information retrieval measures show that, averaged over hundreds of contain author-specified keywords and phrases, and provide a good computer science documents, the cstr model extracts better phrases fit with the background and experience of our subjects. Each paper than the aliweb model, and that the cstr-kf model extracts better was eight pages long. phrases than either [9]. The author's keyphrases were removed from each paper so that they would not influence extraction and assessment, and so that the The minimum phrase length was varied for each model. Two phrase papers would better represent the bulk of technical reports that do sets were produced with each model, corresponding to phrases of 1 3 words and 2-3 words. The first variation reflects the way that Kea not have author keyphrases. is typically used to approximate author keyphrases, and 15 phrases 5.2 Subjects were extracted. The latter reflects Kea's use in Phind and Phrasier, which ignore phrases that consist of a single word. Subjects were recruited from a final year course on Human Computer Interaction taken as part of an undergraduate degree

151 181 Table 2: An example of sets of 15 keyphrases associated with paper 5 Keywords extracted by Kea(length 1-3) Author keywords aliweb model cstr model cstr-kf model Food

1 History mechanisms revisit revisit navigation onion 2 WWW URL web browsers garlic 3 web history navigation World Wide Web milk 4 hypertext user URL browsing ham and eggs 5 navigation history mechanisms history patterns pumpkin pie 6 navigation history mechanisms web browsers vegetable soup 7 pages pages predict 8 patterns web pages WWW 9 web browsers empirical 10 web pages user hypertext 11 stack Tauscher accessed 12 visited World Wide Web methods 13 recency visited list 14 predict browsing recurrence 15 frequency stack actions

Six unrelated phrases were introduced into each phrase list to enable point scale. Second we translate subjects'11 point responses to coarse measurement of how carefully the subjects considered the three categories, simulating responses of bad, average and good. The task. This set consists of the names of food products. three categories are formed from the ranges 0-3, 4-6 and 7-10. Third, the 11 point responses are translated into two categories, effectively In total there were 8 phrase sets for each paper: two phrase length a bad/good judgement. The two points are formed by the ranges 0-5 variations for each of three Kea models, the author keyphrases, and and 6-10. Two statistics are shown for each paper: the Kappa score the food set. The 8 sets describing each document were merged into K,and the z score, a test of the significance ofK.The number of a single master listfor each paper and exact duplicates were phrases considered by each assessor for each paper islarge. removed. The number of phrases from each source and the total Therefore, across all subjects and phrases for a given paper, the number of phrases in the list for each paper are shown in Table 1. Kappa values are low. We have looked beyond the absolute values For every paper, there is overlap between the Kea phrase lists, and and tested the significance ofK(as described by Siegel and between the Kea lists and the author keyphrases. In only one Castellan [21]), producing z scores.. paperpaper 5was the full set of author keyphrases extracted by As expected, the level of agreement increases as the number of Kea. Table 2 shows show some of the phrase sets extracted from this categories decreases from 11 to 3 to 2. Although the values ofKare paper. Phrases in bold are those that Kea extracted that are small, a test of their significance shows that the inter-assessor equivalent to author keyphrases (after case-folding and stemming). agreement was significant at the 0.01 level in all cases. The shaded areas indicate the keyphrases that would be extracted using the default settings of each model. No single model found all We found substantially greater agreement between subjects than five author phrases in the first fifteen extracted phrases. Barker and Cornacchia [4] observed in a study of keyphrase produced by Extractor and their systemB&C.They reported that 6. RESULTS "on average, the judges agree only about as much as can be expected by chance". In all cases, our subjects agreed more than one would 6.1 Inter-Subject Agreement expect by chance. We have measured the level of inter-subject agreement using two statisticaltechniques:the Kappa StatisticKand the Kendall A drawback of the Kappa statistic is that it considers agreement on Coefficient of Concordance W [21]. If we find significant agreement unordered categories. As we are interested in whether one phrase is between the subjects we can rule out the hypothesis that any effects better or worse than another, not in the specific scores for each we observe occur merely by chance. phrase, it is useful to consider agreement between the subjects' relative ranking of the phrases. The Kappa Statistic is based on the assumption that the scores given by the assessors are (unordered) categories to which phrases are The Kendall Coefficient of Concordance (W) is a measure of assigned. Agreement is represented by the Kappa score(K),a agreement between rankings. As withK,it has a value between 0 number that ranges from 0, which means there is no more agreement (agreement as expected by chance) and 1 (complete agreement). than might be expected by chance, to 1, which means the assessors Table 4 shows the result using the full 11 point scale. In each case, are in complete agreement. W is non-zero, indicating that there is inter-subject agreement. The X2 score and degrees of freedom (df) can be used to determine the Table 3 illustrates the agreement between the subjects using the level of significance of the W value. The level of agreement is Kappa score. Three different levels of granularity are considered. significant to at least the 0.01 level for all papers. First, the categories are the scores marked by the user on the 11

152 1'85 Table 3: Inter-assessor agreement measured by Kappa Table 4: Inter-assessor agreement measured by the Kendall Coefficient of Concordance Number of points in scale Paper W X2 df 11 3 2

1 0.63 321.03 58 Paper K z K z K z 2 400.67 58 1 0.13 14.99 0.26 16.06 0.32 14.14 0.70

2 0.14 15.74 0.32 18.14 0.39 18.37 3 0.63 368.90 66

3 0.15 17.44 0.28 15.22 0.29 10.58 4 0.32 236.11 68

4 0.08 12.48 0.14 9.85 0.16 8.86 5 0.38 215.65 57 5 0.13 15.29 0.22 13.11 0.27 11.92 6 0.72 237.81 67 6 0.16 5.88 0.29 6.50 0.37 6.69

The Kendall Coefficient demonstrates that there are significant (and similar, and are almost identical when single word keyphrases are sometimes strong) levels of agreement between the subjects when allowed (Figure 3). they assess the keyphrases. We conclude that subjects agree Finally we note that almost all the curves are above the mid-point sufficiently to justify further investigation into the relative quality of (5) of the 0-10 scale used by the subjects. Subjects consistently rated the different keyphrase extraction methods. the phrases positively. The exceptions to this are the food set, and the end of the curve produced by the cstr-kfmodel when single word 6.2 Human Assessments keyphrases are allowed (Figure 3). Our objective is to compare the quality of the Kea and author- specified phrases based on assessments by subjects. We do this by 7. DISCUSSION averaging the scores that subjects assigned to individual keyphrases derived from each source. 7.1 Integrity of Subjects' Assessments A potential risk with such subjective and repetitive tasks is that the Figures 3 and 4 show the scores allocated by the subjects to the assessors fail to maintain a high level of discrimination throughout authors' keyphrases, various sets of Kea phrases, and the unrelated the process. For this reason we randomly included 'noise' phrases food phrases. The Y axes are the average score (across all subjects (the food set) into the phrase lists. The fact that almost all of these and all documents) assigned to phrases in the set from each source. phrases received zero ratings, in conjunction with the agreement The X axes are the number of phrases considered from each set. The measures,leads us to believe that subjects gave appropriate leftmost point of each curve is the average score when we consider consideration to their responses throughout the tasks. Clearly, these only the first phrase in a set. The rightmost point is the average score noise phrases are highly distinct from the topic domain of the when we consider all of the phrases in a set. Intermediate points documents. We wished to ensure that, at a coarse level, subjects had represent the average score when the first N phrases are considered not allocated random or identical ratings to the large number of The curves for the author and food sets are shorter because those phrases they were asked to consider. The food phrases were sets contain fewer keyphrases that those produced by Kea. unambiguously 'noisy' and served this purpose. Figure 3 shows the scores for Kea sets containing keyphrases of 1-3 A second risk in this type of evaluation is that assessor agreement is words. Figure 4 shows the scores for Kea sets in which the length of so low that little can be determined from the data. For example, both keyphrases is 2-3 words. The experiment also considered phrases of Chen[5]("Inter-indexerinconsistencyisobviousinour length 1-4 and 2-4, but these results are not reported as they are experiment") and Barker and Comacchia [4] ("Kappa values are very similar to their counterparts of length 1-3 and 2-3 respectively. spectacularly low') experienced this difficulty. However, we have Several interesting results are revealed in the graphs. First, the established that the subjects in our experiment achieved a significant phrases which are unrelated food products are rated very lowly. levelof agreement. We attributethisto differences between Overall, only nine food phrases received a non-zero score, and no experimental methodologies. Our study used documents from a food phrase was assigned a non-zero score by a subject whose first restricted topic domain, with a sample population of human language is English. assessors who had a degree of knowledge about the topic, similar educational backgrounds and comparable baseline skills in the Second, the curves for Kea sets are downward sloping for all language of the evaluation documents. Studies that report lower models. The author keyphrases also follow this trend. inter-subjectagreementarecharacterisedby morediverse Third, the author keyphrases initially receive higher scores than the documents and sample populations. automatically extracted phrases. However, the scores of author keyphrases decrease more sharply than those of the Kea keyphrases, 7.2 Author Keyphrases as a `Gold-Standard' and it is only over the first two phrases that the disparity between One of the aims of the evaluation was to determine whether or not author phrases and the best Kea phrase set is strongly apparent. comparison against author keyphrasesis a good measure of automatically produced keyphrases. The results indicate that author Fourth, cstr-kfphrases were not rated as highly as those produced by keyphrases are consistently viewed as good representations of the the other models. The score curves for cstr and aliweb are very subject of a document. Consequently we believe the precision and recall measures described in previous work can serve as useful

153 186 7

6 lb It 4

4 8 10 2 4 6 8 10 12 14 Number of keyphrases extracted Number of keyphreses extracted Figure 3: Average scores for phrase sets, with Kea phrases of Figure 4: Average scores for phrase sets, with Kea phrases of length 1-3 length 2-3 indicators of the quality of automatically produced keyphrases. Of documents (pre 1996), and consequently favours common author- course, these methods should be adopted with an awareness of the keyphrases from the training corpus. These may be inappropriate for potential problems described earlier. the experimental documents, which focus on the topic of Computer- Human Interaction. The cstr model does not suffer from this Each set of author keyphrases was sorted in the order they appeared problem, supporting other evidence that adding author-keyphrase in the paper, and the resulting curves were downward sloping. We information makes the model strongly domain-specific [9]. can infer that authors attempt to put the most important keyphrases first (as we might expect), and that their judgement generally A second possible cause is the quality of the input texts. The cstr-kf matched that of the subjects in the evaluation. model was learned on training documents that had been crudely converted from PostScript format without human intervention, The author's apparent ranking of keyphrases suggests that the basic resulting in texts with mistakes and poor formatting. The six notion of relevance in information retrieval-based evaluationsan documents used in the evaluation were converted from PDF to text extracted keyphrase matches an author keyphrascmay be too manually with an interactive tool, resulting in substantially cleaner simplistic in some cases. Such measures might take into account the texts. Further, the length of the training documents varied more fact that there is an implicit ranking within author keyphrase lists, widely than the test set. These dissimilarities between the training and consider not only how many author keyphrases are identified, and test documents may contribute to cstr-kf's poor performance. but also the rank of those keyphrases. 7.3 Quality of Kea Keyphrases 7.4 Cross-language Suitability of Keyphrases A secondary aspect of our study allows us to compare the perceived Kea outputs keyphrases for a document in ranked order, and the quality of keyphrases for users who do and do not have English as human assessments provide some insight into the efficacy of that their first language. In fact, when we split the data based on a ordering. The downward sloping curves of the mean scores for Kea subject's first language we observe little difference. This is most keyphrases are encouraging. The mean keyphrase score decreases as likely a characteristic of the subject populationfinal year students lower ranked Kea phrases are added, indicating that that Kea phrase undertaking university study inthe English languagewhere lists are ranked effectively, and that the phrases Kea chooses first are adequate English language skills are a necessity. usually the best candidates in the phrase lists. An aim of the evaluation was to determine the effect of various Kea 7.5 Related Human Evaluations settings on the quality of extracted keyphrases, including the phrase Turney carried out a simple Web-based subjective evaluation [25] of length, the model employed and the use of keyphrase frequency the keyphrases produced by Extractor. Self-selecting subjects were data. One significant effect is that the use of keyphrase frequency requested to gauge keyphrases as good or bad with respect to a data adversely affects keyphrase quality. The poor result is clear document that they themselves submitted, and provide feedback via regardless of the phrase length and the characteristics of subjects a Web page form. Subjects could choose between a 'Good' or 'Bad' (such as their first language). This contradicts previous studies that rating for a keyphrase. Tumey reports that 82% of all keyphrases found that data regarding the number of times a phrase occurs as an were acceptable to the subjects when seven phrases were extracted author-specified keyphrase improves the performance of Kea [9]. for each document. This result counts phrases with no response to be These results rely on the observation that phrases that are commonly acceptable; 62% of the total phrases were judged 'Good'. used as author keyphrases in a topic area form a pseudo-controlled We can simulate a similar measure of acceptable phrases in our vocabulary, and consequently are more likely to be selected by study by calculating the proportion of ratings greater than 5 that authors writing new papers in the same domain. were assigned to the top seven phrases for each model. For phrase One possible explanation for the poor performance of cstr-kf in our sets based on cstr and aliweb models, between 71% and 80% of the study is that the domain of the model differs from the domain of the phrases were acceptable, compared to around 60% for cstr-kf Of the target documents. cstr-kf was trained on general Computer Science author phrases, 79% were acceptable. This is less than the 80%

154 achieved by cstr phrases of length 2-3, and is reflected in Figure 4 restricted domain [23]. The assessors in Turney's Web-based study where the cstr curve is higher than the author curve when seven were anonymous but submitted their own document for processing, phrases are extracted. and neither Chen nor Barker and Cornacchia describe their subject populations. Kea is intended for use on restricted-domain document In Barker and Cornacchia's study [4], twelve subjects rated phrases collections, and consequently its users will likely be conversant with produced by their B&C system and Tumey's Extractor. They used that domain. This is the scenario that has been modeled by our 13 documents, nine from Turney's corpora, and four of their own study. choosing. Phrases were judged on a 'bad, 'so-so' or 'good' scale that was mapped to values 0,1 and 2 respectively for analysis. The This evaluation measured the quality of individual keyphrases. We average score of an Extractor keyphrase was 0.56 (s.d. = 0.11) and have also compared sets of keyphrases by combining the scores of of a B&C phrase was 0.47 (s.d. = 0.1). Subjects were more negative individual phrases. However, in some of the uses of keyphrases that than indifferent about the phrases produced by both systems. By this we described earlier in this paper, keyphrase groups are presented to measure, Kea compares favourably to either system, with phrases, users. Barker and Cornacchia [4] captured assessments of groups of on average, receiving positive judgements. keyphrases produced by both B&C and Extractor. They found that B&C groups were preferred more often that Extractor groups (47% Chen's study [5] required assessors to choose appropriate `subjects' versus 39% of preferences). This is at odds with judgements of to represent a documenteffectively a keyphrase assignment task individual keyphrases, which reflected a preference for those in the Chinese language. Description of the captured data is limited, produced by Extractor. This suggests that evaluations of individual but itis clear that for individual subjects, intersection with the keyphrase quality do not generalize to keyphrase sets. automatically extracted set ranged from 1 to 13 keyphrases, with a mean of 5.25 (s.d. = 4.18). Just over half (42 of 80) of the 8. Conclusions automatically extracted keyphrases were also selected by 8 subjects This study has shown that Kea extracts good keyphrases, as across 10 documents. This appears slightly worse than the 62% measured by human subjects. Their assessments were uniformly achieved by Turney, and worse than the results achieved by Kea, positive, and with the exception of very short keyphrase lists, Kea although direct comparison is difficult as the experimental texts are keyphrases were almost as good as those specified by authors. These very different. results corroborate evaluations of Kea based on author keyphrases, and suggest that Kea ranks keyphrases in a sensible way. We are 7.6 Limitations confident that Kea keyphrases are suitable for use in the interfaces The results reflect positively on the performance of Kea, both described in this paper. independently and relative to other systems. However, there are some limitations to our study. First, due to resource limitations Previous studies have used author keyphrases as a gold-standard, common to evaluations of this type, the number of subjects (28) and against which other keyphrases are compared, but offered no papers (6) is limited. This is comparable with similar studies. Tolle evidence that author keyphrases are good keyphrases. Our results and Chen had 19 subjects view 10 abstracts and phrase lists [23]. show that authors do provide good quality keyphrases, at least for Barker and Comacchia had 12 judges view 13 documents [4]. Chen the style of documents in our study. They also indicate that author used eight subjects and 10 texts [5]. In Turney's study [25] 205 users keyphrases are listed with the best keyphrases first, which may have assessed keyphrases for a total of 267 documents. implications when author keyphrases are used to measure keyphrase quality. We chose to maximise the number of subjects and the number of assessments of each phrase list to minimise the effect of assessor 9. References subjectivity. In this respect our study is more robust than those Proceedings of CHI'97: Human Factors in Computing large numbers of [1] described above. Although Turney reports Systems, ACM Press, 1997. assessors and documents, the Web-based mechanism by which this was achieved necessarily relinquished control over subject and [2] Anick, P. and Vaithyanathan, S. Exploiting Clustering and document selection. Such control was important for our study PhrasesforContext-BasedInformationRetrieval.In because of the domain-specific nature of Kea's extraction algorithm. Proceedings of SIGIR'97: the 20th International Conference Due to resource and time constraints, the number of documents on Research and Development in Information Retrieval, considered was smaller than we would have ideally chosen. The (Philadelphia, 1997), ACM Press, 314-322. documents that we used are from a particular domain (computer- [3] Arampatzis, A.T., Tsoris, T., Koster, C.H.A. and Van der human interaction) and of a particular style (conference research Weide, T.P. Phrase-based information retrieval. Information paper). It is clearly difficult to assert that the results that we have Processing & Management. 34, 6 (1998); 693-707. observed for Kea can be generalised beyond such papers. However, Barker, K. and Cornacchia, N. Using Noun Phrase Heads to this is actually a moot point, because Kea is a domain-specific [4] Extract Document Keyphrases.In Proceedings of the system, trained on collections of documents that are similar to those Thirteenth Canadian Conference on Artificial Intelligence from which keyphrases are to be extracted. (LNAI 1822), (Montreal, Canada, 2000), 40-52. A second limitation of the study is the narrow profile of the subjects. [5] Chen, K.-H. Automatic Identification of Subjects for Textual To ensure accurate assessment of keyphrases, subjects must be Documents in Digital Libraries Los Alamos National conversant with the domain of the documents under consideration. Laboratory, Los Alamos, NIvI, USA, 1999. We have attempted to ensure this in our study to improve the integrity of the assessments. Tolle and Chen also adopted this [6] Croft, B., Turtle, H. and Lewis, D. The Use of Phrases and approach, using strongly matched subjects and documents in a Structured Queries in Information Retrieval. In Proceedings of SIGIR'91, 1991), ACM Press, 32-45.

155 1BB [7] Domingos, P. and Pazzani, M. On the Optimality of the Conference on Digital Libraries, (Berkeley, CA, 1999), Simple Bayesian Classifier Under Zero-One Loss. Machine ACM Press, 179-187. Learning 29, 2/3 (1997); 103-130. [18] Paynter,G.W.,Witten,I.H.andCunningham,S.J. . [8] Dumais, S.T., Platt,J., Heckerman, D. and Sahami, M. Evaluating Extracted Phrases and Extending Thesauri. In Inductive Learning Algorithms and Representations for Text Proceedings of the Third International Conference on Asian Categorization. In Proceedings of the 7th International Digital Libraries, (Seoul, Korea, 2000), 131-138. Conference on Information and Knowledge Management, [19] Paynter,G.W.,Witten,I.H.,Cunningham,S.J.and 1998), ACM Press, 148-155. Buchanan, G. Scalable Browsing for Large Collections: a [9] Frank, E., Paynter, G., Witten, I., Gutwin, C. and Nevill- Case Study. In Proceedings of Digital Libraries'00: The Manning, C. Domain-specific Keyphrase Extraction. In Fifth ACM Conference on Digital Libraries, (San Antonio, Proceedings of the Sixteenth International Joint Conference TX, USA, 2000), ACM Press, 215-223. on Aritificial Intelligence, 1999), Morgan-Kaufmann, 668- [20] Pedersen, J., Cutting, D. and Tukey, J. Snippet Search: a 673. Single Phrase Approach to Text Access. In Proceedings of [10] Gutwin, C., Paynter, G.W., Witten, I.H., Nevill-Manning, C. the1991 JointStatistical Meetings,1991), American and Frank, E. Improving Browsing in Digital Libraries with Statistical Association, Keyphrase Indexes. Journal of Decision Support Systems [21] Siegel, S. and Castellan, N.J. Nonparametric Statistics for 27, 1-2 (1999); 81-104. the Behavioral Sciences (2nd edition), McGraw Hill College [11] Jones, S. Design and Evaluation of Phrasier, an Interactive Div, 1988. System for Linking Documents Using Keyphrases. In [22] Smeaton, A. and Kelledy, F. User-Chosen Phrases in Proceedings of Human-Computer Interaction: Interactive Query Formulation for Information Retrieval. In 1NTERACT'99, (Edinburgh, UK, 1999), IOS Press, 483-490. Proceedings of the 20th BCS IRSG Colloquium, (Grenoble, [12] Jones, S. and Mahoui, M. Hierarchical Document Clustering France, 1998), Using Automatically Extracted Keyphrases. In Proceedings [23] Tolle, KM. and Chen, H. Comparing Noun Phrasing of the Third International Asian Conference on Digital Techniques for Use with Medical Digital Library Tools. Libraries, (Seoul, Korea, 2000), 113-120. Journal of the American Society for Information Science 51, [13] Jones, S. and Paynter, G. Topic-based Browsing Within a 4 (2000); 352-370. Digital Library Using Keyphrases. In Proceedings of Digital [24] Turney, P.D. Learning to Extract Keyphrases from Text. Libraries'99: The Fourth ACM Conference on Digital Technical Report ERB-1057 (NRC #41622). Canadian Libraries, (Berkeley, CA, 1999), ACM Press, 114-121. NationalResearchCouncil,InstituteforInformation [14] Jones, S. and Staveley, M. Phrasier: a System for Interactive Technology, 1999. Document Retrieval Using Keyphrases. In Proceedings of [25] Turney, P.D. Learning Algorithms for Keyphrase Extraction. SIGIR'99: the 22nd International Conference on Research Information Retrieval 2, 4 (2000); 303-336. and Development in Information Retrieval, (Berkeley, CA, S.,Apperley,M., 1999), ACM Press, 160-167. [26] Witten,I.H.,McNab, R.J.,Jones, Bainbridge, D. and Cunningham, S.J. Managing Complexity [15] Kosovac, B.,Vanier,D.J.and Froese, T.M. Use of in a Distributed Digital Library. IEEE Computer 32, 2 Keyphrase Extraction Software for Creation of an AEC/FM (1999); 74-9. Thesaurus. Electronic Journal of Information Technology in Construction 5 (2000); 25-36. [27] Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C. and Nevill-Manning, C.G. KEA: Practical Automatic Keyphrase [16] Krulwich, B. and Burkey, C. The Infofinder Agent- Extraction. In Proceedings of Digital Libraries '99: The LearningUserInterestsThroughHeuristicPhrase Fourth ACM Conference on Digital Libraries, (Berkeley, Extraction. IEEE Intelligent Systems & Their Applications CA, 1999), ACM Press, 254-255. 12, 5 (1997); 22-27. [28] Zamir, 0. and Etzioni, 0. Grouper: A Dynamic Clustering [17] Larkey, L.S. A Patent Search and Classification System. In Interface to Web Search Results. Computer Networks and Proceedings of Digital Libraries'99: The Fourth ACM ISDN Systems 31, 11-16 (1999); 1361-1374.

156 189 Community Design of DLESE's Collections Review Policy: A Technological Frames Analysis Michael Khoo Department of Anthropology CB 233 University of Colorado Boulder, CO 80309-0233, U.S.A. E-mail: [email protected]

ABSTRACT part of the 'Operations' function, is headquartered in the offices In this paper, I describe the design of a collection review policy of the University Corporation for Atmospheric Research, Boulder, for the Digital Library for Earth System Education (DLESE). A Colorado. The distributed structure of DLESE means that, apart distinctive feature of DLESE as a digital library is the 'DLESE from scheduled meetings that require members to travel across the community,' composed of voluntary members who contribute country, the great majority of communication between DLESE's metadata and resource reviews to DLESE. As the DLESE various project functions is carried out via various communication community is open, the question of how to evaluate community technologies. contributions is a crucial part of the review policy design process. One such project function involves designing a review policy that In this paper, technological frames theory is used to analyse this sets out criteria for accessing items to DLESE's catalogue. In this design process by looking at how the designers work with two paper I examine the ongoing development of this policy, through differing definitions of the 'peer reviewer,' (a) peer reviewer as an analysis of messages posted tothe DLESE Collections arbiter or editor, and (b) peer reviewer as colleague. Content Working Group discussionforum.Iwillsuggest thatthe analysis of DLESE documents shows that these frames can in turn discussion surrounding the review policy is situated within a be related to two definitions that DLESE offers of itself: DLESE wider set of negotiations concerning exactly what a 'digital as a library, and DLESE as a digital artifact. The implications of library'is. In these discussions DLESE, as a digital library, the presence of divergent technological frames for the design appears to be defined in at least two ways: firstly, as a library that process are summarised, and some suggestions for future research that is being digitised; and secondly, as a digital object that has are outlined. library-like functions. As DLESE aims to be a heterogeneous community, the presence of diverse definitions of technology Categories and Subject Descriptors should come as no surprise, and could be seen as a desirable H.3.7 [Information Storage and Retrieval]: Digital Libraries characteristic. However, for diversity to be used constructively, it collection, standards, user issues. has to be recognised as such; if left unrecognised there is a possibility that such differences have the potential to impact DLESE in unforeseen ways. Using technological frames theory General Terms (TFT) I will explore how these definitions are present in DLESE Design, Human Factors. documents, mission statements, and discussions of the review policy, and what impact they might be having on policy Keywords formulation. Content Analysis, Decision Making, Design, Digital Library, Ethnography, Peer Review, Technological Frames. 2 TECHNOLOGICAL FRAMES THEORY Technological frames theory (TFT) studies the shared frames of 1 INTRODUCTION reference underlying individual and collective perceptions of The DigitalLibraryfor Earth System Education (DLESE: technology. I will focus on the work of Orlikowski and colleagues www.dlese.org) is a National Science Foundation funded project [35, 36, 37, 38], who use a form of technological frames theory whose mission is to provide searchable access to online earth derived from partly Goffman's theory of frames [24] and partly science resources for everybody from elementary school children from Giddens' theory of structuration [22]. Orlikowksi and Gash to university professors. DLESE is structured around a distributed [37] argue that 'participatory framework' consisting of three main 'primary an understanding of people's interpretations of technology is functions'Policy,Operations,and Communitylocatedin critical to understanding their interaction with it. To interact different parts of the United States. The DLESE Program Center, with technology, people have to make sense of it; and in this sense-making process, they develop particular assumptions, expectations, and knowledge of the technology, which then Permission to make digital or hard copies of all or part of this work for serve to shape subsequent actions toward it. While these personal or classroom use is granted without fee provided that copies are interpretationsbecome taken-for-grantedand arerarely not made or distributed for profit or commercial advantage and that brought to the surface and reflected on, they nevertheless copies bear this notice and the full citation on the first page. To copy remain significant in influencing how actors in organizations otherwise, or republish, to post on servers or to redistribute to lists, think about and act toward technology. (175) requires prior specific permission and/or a fee. JCDG'01, June 24-28, 2001, Roanoke, Virginia, USA. Copyright 2001 ACM 1-58113-345-6/01/0006...$5.00.

157 1'90 INCOMMENSURABLE INCONGRUENT 1 CONGRUENT 1 INCONGRUENT INCOMMENSURATE 1 1

FRAME A < > 1 > FRAME B 1 < FRAME C < 1 > FRAME D 1

FRAME E < > 1

1 < > FRAME F Figure 1

Orlikowski and Gash refer to these located`sense-making' I want to suggest instead that we see technological frames as interpretations as `technological frames of reference,' "built-up composed of disparate knowledges and practices and therefore as repertoire[s] of tacit knowledge that [are] used to impose structure being constituted in heterogeneous rather than homogeneous upon, and impart meaning to, otherwise ambiguous social and ways. My suggestion is outlined in the lower part of Figure 1. situational information to facilitate understanding" (176). They Consider frame E as a pedagogical frame. It might contain note parallels between their concept of technological frames, and pedagogists who are also familiar with technological model being other theories of shared cognitive structures, including cognitive implemented (in the case of DLESE, they might be informed maps [11, 19], frames [24], interpretive frames [6], interpretive about system privacy and security, the differences between search schemes [22], mental models [3, 39], paradigms [27, 40], scripts and browse strategies and architectures, GUI design, etc.). Frame [1, 23], and thought worlds [17, 18] E would also contain however people who know something about There are three aspectsofOrlikowski and Gash's search and browse strategies, but who do not understand the conceptualisation of technological frames that I would like to technical implications of the distinction; and others who think that emphasisehere:theiremergence,theiridentificationwith computers will just somehow make their pedagogical experience particular groups, and their congruence. First, Orlikowski and more efficient and more fun. Similarly, frame F considered as a Gash stress that technological frames are not set, butemergent, designer's frame might contain a proportion of people who have contingent and evolving: "Frames are flexible in structure and experience with teaching and designing K-12 class projects.It content, having variable dimensions that shift in salience and would also have members who might be involved in educational content by context and over time. They are structured more as activities but who are not really knowledgeable pedagogists, as webs of meanings than as linear, ordered graphs" (176). These well as others who think that the system architectures of the tools emergent properties have important implications for practice, they provide will be transparently clear to teachers, and will making possible ongoing observation, analysis, and potential automatically make the latter's lives easier. intervention in technology implementation processes. In addition to congruent and incongruent frames, therefore, I Second, Orlikowski and Gash state that "technological frames are propose a third level of technological frame interaction termed, shared by members of a group having a particular interaction with after Kuhn [27], `incommensurate.' This incommensurate level of some technology" (203); the implication is therefore that frames interaction is one in which the concepts of one frame can not be can be associated with particular stakeholder groups, communities understand in terms of the concepts of the other frame. Although of practice, etc. Orlikowski and Gash equate `technological frames' with Kuhnian paradigms, and therefore draw an analogy between incongruence Third, Orlikowski and Gash use the notions of `congruence' and and incommensurability, they do not describe incongruence as `incongruence' to describe the nature and extent of differences precisely equivalenttoincommensurability. Orlikowski and among frames. While congruent frames are not identical, they are Gash's notion of incongruence allows for a certain degree of related in ways that imply similar expectations of a technology. commensurability,in thatitallowsfor epistemological and Incongruentframes,ontheother hand,imply "important differences in expectations, assumptions, or knowledge about ontological continuities between diverse frames. I argue, however, some key aspects of technology... We expect that where that while incongruence allows for the same data to be interpreted in different ways, incommensurability stresses that the data are in incongruent technological frames exist, organizations are likely to themselves differently constituted. In Figure 2 there is a greater experience difficulties and conflicts around developing... and conceptual distance to cover in going from incommensurate using technologies" (180). understanding to congruent understanding, than there is in going The upper part of Figure 1 shows two sets of incongruent from incongruent understanding to congruent understanding. framesA and B, and C and Dthat overlap to greater and lesser The difference is subtle but I believe important, especially with degrees. Orlikowski and Gash imply that these two sets of frames regardtocommunicative can be mapped onto a continuum leading from congruence to efficiency.Whileonlyasmall proportion of a community might espouse a technological frame incongruence (Figure 2): that is incommensurate with the frame of the group that is being CONGRUENT INCONGRUENT interacted with, the impact of that small group on the overall communicative efficiency might be out of proportion to their size within the group. While there is room for both agreement and FRAMES A & B FRAMES C & D disagreementintheinteractions between two technological Figure 2 frames,itmight pay therefore to examine situations where Note that this assessment is not quantifiable: we cannot say that incommensurabilities might be occurring. frames A and B are `three times more congruent' than frames C and D.

158 1-91 2.1 TFT Methodology review system will provide a mechanism for recognizing high-quality work analogous to the peer review process TFT methodology examines frame interactions through the traditionally used for journal articles. (18) study of the social interactions, social behaviours, and texts in As was the case with the development of the first peer-reviewed and through which such interactions are materialised. These scientific journals, "the review process should provide users objects of study can include: interviews with participants; with a high degree of confidence that they will be able to find ethnographicobservationof groupactivities(workplace the high-quality instructional materials that they need, and practices, meetings, etc); content analysis of paper and CMC creators of new materials will benefit from incentives that will documents (reports, memos, agendas, web sites, e-mail logs, result in national recognition for their contribution" (18-19) [c.f. etc.); and metrics (the click-trails of users of a web site, or the 43]. pattern and type of calls or e-mails to a help desk). The methodologyassumes,timeandresearchcircumstances DLESE materials will be `rigidly' catalogued along 'a number permitting, an iterativerelationship between induction and of axes,' using both traditional classification categories, as well deduction. Empirical data are collected, reviewed, and analysed as the new sets of categories emphasising pedagogical efficacy in order to identify technological frames. At the same time, and the earth systems thinking expected to emerge from the archive research is used to try and identify the broader contexts DLESE community's requirements. Extensive cataloguing is within which such frames might be set. Primary and secondary seen as one way in which the vast amounts of earth science data sources are then used to inform analyses, analyses that can then that already exist can be made more meaningful. Where further be reapplied to existing data, or used to inform the gathering of assistance is required, a "human-mediated" help desk "will fresh data. provide user-assistance in finding resources" (32). The role of the librarian as community delegated curator is emphasised: "Guaranteeing the integrity and stability of DLESE 3 TECHNOLOGICAL FRAMES IN DLESE materials will provide one of its key distinctions from the In its documents, DLESE represents itself as many things. The current World Wide Web. Integrity includesaccessibility, two conceptions I am concerned with here are (a) DLESE as a currency, correctness;stability includes operability, periodic library that is digital, and (b) DLESE as a digital object that upgrades,andmonitoringof productusage.Thisisa performs certain library-like functions. Unlike Orlikowski and fundamental and non-negotiable function of the library" (32). Gash, I do not suggest that these definitions are mappable onto 3.2 DLESE as Digital Artifact distinctconstituencieswithin DLESE, althoughthisisa situation that may change in the future as DLESE develops. I DLESE documents are particularly rich in discourses concerning suggest instead that they are emergent properties of DLESE as a the possibilities of technology. The following examples are whole, which may atsome pointstabiliseintovarious taken from two main sources: the online proceedings of a communities of perception. DLESE seminar held in 1999, entitled "Portal to the Future" [30], and DLESE's current public mission statement, the 3.1 DLESE as Library DLESE Community Plan [31]. In these documents, DLESE The first definition concerns that which DLESE appears as a describes itself in a number of ways; for instance as a `Memex,' digital version ofthe 'traditional' library. This definition has a 'portal,' a 'search engine,' a `network,' and as a holder of its origins partly in ideas of what a library should be, and how it 'digital artifacts.' orders its contents. First, DLESE explicitly references Vannevar Bush's 1945 vision It is always difficult to discern an artifact's point of emergence, of the Memex as a technology that promised instant control of but places to look for the roots of modern libraries might largeamounts of data[13].Accordingtothe DLESE include: the development of alphabetisation, the decimal system, Community Report, "The idea of a system that organizes and indexing in Europe from the thirteenth century on [15]; the information and integrates services is not new, dating back at technologies that culminated in Gutenberg's development of the least as far as Vannevar Bush's vision of the Memex ... Nor is it movable type printing press [20]; the development of a scientific a largestep from thisidea toitsapplication to science communitythatrecordeditsresultsinpeerreviewed education" (13). The theme is reflected in a number of places, publications [43]; the growth of a secular public sphere both forinstanceina vignettethatdescribes two professors stimulated by and embodied in public institutions [2]; and the producing class projects, who turn to DLESE and rapidly find development of systems of classification [16], including the precisely the information that they are looking for (2). The ease Dewey Decimal System [32, 33]. with which they dothisrecallsBush'simaginary user manipulating his Memex (2). The DLESE material I have examined presents DLESE more as a digital than a traditional library. Nevertheless, a number of Second, DLESE emphasises its role as a `portal.'The portal features of "traditional" libraries are referred tofor instance, metaphor has been central to DLESE design for a while: a 1999 catalogued materials that are high quality, reliable, and peer- report [30] describes DLESE as a "Portal to the Future." Within reviewedand the breadth, depth and quality of DLESE's the report DLESE is described as "enhancing the collection, collection is particularly emphasised. According to the DLESE distribution, and service functions of a traditional library" (1), Community Report [31]: by providing a portal to earth data (2), as well as being "a dynamic identity establishing new communication links between A primary goalis to collect and make available the large new users and experiences users of library materials, between body of quality educational materials developed with federal the creators and users of materials, and between learners, funding ... One of the central functions envisaged for DLESE educators and scientists" (2). The report contains a quote by is the development of a comprehensive system for reviewing and evaluating the diverse materials in its collections. This Goodchild (a participant) that reads: "The Moonshot: By 2005, to design and implement a digital library for earth science

159 192 education that serves educators and students by facilitating a Table 1 new era of sharing materials, and by being a reservoir of choice DLESE AS LIBRARY: DLESE AS DIGITAL for those materials" (3). By the time this quote was incorporated ARTIFACT: into DLESE's Community Plan it read: "The Challenge: By 2005, to design and implement the Portal, a digital library for A library that is digitised A digital artifact that acts like earth science education..." (7; emphasis added). a library Third, DLESE presents itself as a search engine: "The DL Underlying model: bricks and Underlying model: the should provide easy and individualized access to all through a mortar library technology enabled search mechanism and profile of user.... This should be done community through computer search, with an option for a human help desk" Related spatial model: Related spatial model: [12]. In an "Example User Scenario" [25], the two professors centralised and enclosed decentralised and open mentioned above "[go] back to the search engine [and] ... ask building network for a global warming theme...This search led to several hundred results, but luckily the search engine sorted them into Access to local holdings Network access to distributed three categories..." resources Fourth, DLESE will deal in distributed digital artifacts: "The Repository Portal Library's 'holdings' are a federated collection of digital artifacts Catalogue Search engines from many sources, and only a fraction are acquired and Stable documents Fungible documents maintained in central repositories' [21]. The placing of holdings in quotation marks indicates that DLESE's holdings are not those of the traditional library, but rather provided through of 5, 2 threads of 6, and 1 thread of 7 messages. 23 messages connection to high-speed electronic communication networks. were posted between 02/03/00 and 02/23/00, and 21 messages Thus, fifth, "The DL will provide seamless links to other between 11/01/00 and 11/16/00, and these clusters usually relevantlibraries"[12];"[N]ewinformationtechnologies contained the longer threads. Postings have been edited as providedirectlinkagesbetweenpeople(e.g.,educators, follows: the original headers have been redacted to impart a researchers, and the community at large) and information about sense of narrative continuity; I have only reproduced relevant the earth (e.g., instructional materials, data sets, images) that quotes from these messages and not the whole messages make new ways of learning possible" [31]; and "The work of themselves, as indicated by the lead ellipses in each message; [DLESE] advisory groups should be complemented with broad- and I have quoted messages in the order in which they appeared based,electronicallyfacilitatedinputdirectlyfromthe as threads. Pseudonyms have been used. community of Library users/creators" [21]. Our two imagined An emergent theme in the CML concerns the issue of what users described above, for instance, before performing their constitutes peer review for the DLESE collection. On the one search, have sat down "at a nearby PC, connected through high- hand, participants can see the advantages of a centralised and speed Internet-2 connections [and] quickly moved to the DL's closed peer review process. Placing the review process in the search interface" [24]. handsof thosethoughttobereliableand trustworthy These and other descriptions of the potential of DLESE's digital representatives of the geoscience community isseen as a technology are widely distributed throughout DLESE publicity guarantee thatthe DLESE resources being reviewedare material; the general tone may be summed up as follows: similarly reliable and trustworthy. "Taking advantage of new information technologies, DLESE From: Jackie can enhance the collection, distribution, and service functions of Date: 02.03.2000 a traditional library. Serving as a community center for Earth ...(chris's report states that]It is the responsibility of the Academic Recognition Task Force to system education at all levels, DLESE creates the possibility of a insure that review processes and selection criteria seamless, lifelong learning experience" [31]. are academically sound [and] defensible ... The characteristics of these two emergent framings of DLESE On theother hand,thereisa recognitionthat DLESE library, and digital artifactare summarised in Table 1 (below). community members may also have experience, for instance It should be emphasised that these framings are general and are pedagogic experience, of using a resource. This experience posited heuristically. I do not want to propose (as Orlikowski might also be of value to DLESE. Further, given the fact that it and Gash do) that they are held by separate, identifiable is planned to place a significant part of DLESE's metadata communities, either congruently or incongruently. Rather,I generation in the hands of the community, it makes sense to suggest that their articulation in differing circumstances could establish a relationship of trust between the generators and both drive and hinder DLESE. To examine ways in which this holdersof metadataearlyon. CML members therefore might occur, I have analysed postings from a publicly viewable recognised that while peer review in the traditional sense was DLESE online discussion forum. important, so was a form of CMC-based decentralised peer review.Theconceptofthe'recommendationengine,' 4 ANALYSIS specifically linked to the model used by Amazon.com, was thus introduced: The forum of the DLESE Collections Mailing List (CML) was chosen as this was the forum with the most postings (90 From: Jackie messages between 02.03.2000 and 11.09.2000). Compared to Date: 02.07.2000 ... This idea differs from the familiar peer-review other DLESE forums, there were longer threads, and clusters in system in that the reviewers self-select themselves time that suggested ongoing dialogues. In 90 messages were 35 as users of the material, rather than being selected by an editor. One advantage is that many single postings, 8 threads of 2 messages, 5 threads of 3, 1 thread reviews could be accumulated, rather than just one

160 193 or two.Another advantage is that the reviewers access the evaluation area. However,within the could be people who actually used the material in group of people in the DLESE membership database as the classroom, rather than people who just looked "educators," reviewers/testers would step forward the material over for the purpose of writing a rather than being recruited by an editor. [...] review. The present plan calls for a review for scientific accuracy by a scientist selected by an Two different concepts of 'peer review'arethus being editor/gatekeeper (as one of the last steps, discussed. While they are seen as complementary following the community review).The focus group favors having an education review board to parallel From. Gerry the science reviewboard, with a review by a selected education specialist paralleling the Date: 02.09.2000 review by the selected science specialist. ...I think that user reviews and community feedback areanessential serviceof thelibrary, not The community review subsection of the DLESE review plan instead of the peer review but in addition to the was now beginning to resemble the'target model' earlier peer review. advanced as a model for overall peer review, with an inverse there also seems to be little recognition that they might be correlation posited between peer review proficiency and the based upon two different definitions of 'peer' (the first being the number of community members expected to display such specific use associated with the journal and conference peer proficiency. review processes, and the second being the more general sense of 'colleague). Instead, a continuum of peer status is developed: The multiplication of components in any process inevitably complicates the way that process is mapped out. A contributor From: Chris (Philip) whose message does not appear in the list, but whose Date: 02.10.2000 content is reported to the list by a CML member, notes that ... Just thinking ... I imagine several levels of "Throughoutthisscenariothereisreferencetovarious review ... a la Gerry's comments the other day. First wouldbe the reviewprocess (yet tobe *evaluation levels.* This is, in my mind, another key that could defined) that leads to acceptance of the product for the library. Second would be reviewers/users. simplify or complicate things depending on how it is handled." These could be given the opportunity to identify In a follow-up comment to the CML, it was noted: themselves to both other users and to the producer From: Jackie - much as seems to be common practice with journals. Date: 07.14.2000 Gerry contributes a JPEG image that shows a target set in a field ... Hi Phil,I agree with your suggestion that it needs to be really clear to everyone (contributors, that represents 'the entire Internetthe good, the bad, the ugly.' potential users, and library staff) at what review The target's series of concentric rings correspond to ever more step each resource sits. Itseems certain that there will be a multi-step review process, but what refined levels of review of the resource, with the bull's-eye those steps will be and who will implement them representing the most rigorous form of peer review. Jackie remain undetermined (our Collections proposal suggests some steps, but it is only "pending") . We responds enthusiastically to the idea, but at the same time need labels for those steps, and a clearly- emphasises the different nature of 'traditional' and 'distributed' articulated broadly-disseminated definition of what each steps means, philosophically and peer review processes, and retains the notion of traditional peer operationally. review as the ultimate arbiter of a resource's reputation: Given that DLESE is committed to an implementation time-line, From: Jackie such generality in the review process specifications worried one Date: 02.16.2000 developer (caps. as in original): ... This discussion is all with the understanding that the community-review system would be used to From: Gail judge "pedagogical effectiveness," "ease of use by Date: 07.14.2000 faculty and students," and "motivational/ ... Since I amworking inspirational to students." I am not ready to on theDLESE metadata consider dispensing with the recruited review by a framework and developing templates for it, it would scientist evaluating "scientific accuracy." be VERY helpful to come to a tentative agreement on evaluation levels VERY soon. Following a DLESE meeting in July, a summary of a focus Soon, tentative guidelines for a two-tier acquisition process group organised by the DLESE Collections Working Group wereposted.Thesepartlyinvolvedrenaming DLESE's (CWG) was posted to the CML. The plan envisaged community 'unreviewed' collection as an 'open' collection, and establishing collection review being open only to a sub-set of the DLESE some basic criteria of 'fitness for geoscience pedagogy' for thd community, "members who were registered with DLESE as latter. It is not clear however to what extent establishing this educators." These members would have privileged access to a distinction significantly operationalised DLESE's peer review section of the DLESE site where a series of online submission process, given that(a) DLESE espouses a more rigorous forms would gather their feedback about resources. Having 'scientific peer review' model, (b) that DLESE lacks resources previously posited a distinction between "scientific peer review" to widely implement this model, and that therefore (c) at any and "community peer review," the report thus now proposed a pointtherewillalways be'reviewed'and 'unreviewed' furtherdistinction,betweencommunitymembers,and materialsinDLESE's holdingsanyway. One developer 'registered' community members. A select number of these subsequently posted the comment: 'registered' community members were in turn to be appointed to an 'educational review board,' with the expectation that these From: Henry selectedmembers would bringthesameexpertiseand Date: 09.30.2000 objectivity to the DLESE peer review process that scientists ... Aside from the Reviewed and Open Collections, then, is there a plan to allow shall we say the normally bring to journals: Great Unwashed into the library?i.e., resources that undergoeven less scrutiny -- somewould From: Jackie suggest none at all. I'm not necessarily advocating Date" 07.06.00 such a policy; however,I'd be interested in your views on this, so I can respond intelligently when ... Under the proposed plan, only members who were this question comes up (which it does fairly registered with DLESE as educators would be able to often).

161 Henry's point, although humorously framed, was pertinent. By might be fit to be a gatekeeper to the DLESE collection November 1, following the presentation of a draft review policy community members, 'registered' community members, etc. atanother CWG meeting (a presentationthat included a In the collection design process, these two synecdochesthat of sophisticated flow chart of the route that any resource would the broad collection, and that of the peer reviewer selection have totake in order to be accepted into DLESE), the processseem to have served as rhetorical devices around 'reviewed' and 'open' collections became the 'broad' collection. which the discussion in the CML has focused. Why might this It was proposed that access to the broad collection be guarded displacement be occurring?A technological frames theory by three filtering categories, one of which is subdivided into a analysis suggests that in discussing peer review within the further three sub-categories. The question still remained, though, context of an institution considered both as a library and as a operationalisation: digital artifact, the CML is in effect working with two alternate From: Jackie frames for peer review.Ihave tabulated some of these Date: 11.01.2000 distinctions in Table 2: ... The question for discussion today is how shall the filters at the gateway to the Broad Collection Table 2 be applied? This was a topic of heated discussion at the metadata workshop last week,and was left DLESE AS LIBRARY DLESE AS DIGITAL unresolved. ARTIFACT The 'broad collection' concept has subsequently received some Library Models: debate that focused on the practicalities of the filtering criteria. Contrary to some of the originally stated aims of DLESE, A library that is digitised A digital artifact that acts like regarding the desirability and necessity of involving the DLESE a library community in reviewing DLESE holdings, the conversation Peer Review Models: since November has tended to focus on the desirability having what one poster called 'DLESE aligned' reviewers at the gate of Peer reviewers as 'registered' Peer reviewers as community the broad collection, at least initially. At the time of writing members members (December 2000), no new policy has been advanced, although Peers as gatekeepers Peers as colleagues one of the last messages on the subject is, appropriately, a plea Peer review as contribution Peer review as potential to begin implementation; once in the broad collection,itis argued, the best resources can be exhaustively catalogued for the hindrance core collection, while inappropriate resources can be reviewed Peer review as objective fact Peer review as subjective and on occasion removed. opinion (recommendation engine) 5 DISCUSSION Reviewed collection Unreviewed/open/broad collection What does this discussion of peer review tell us about the digital library design process? Isuggested above that DLESE's construction of itself as a digital library is based on at least two In the presence of these two alternate frames, the synecdoches premises: DLESE as a library, and DLESE as a digital object. outlined above appear tofunctionas 'boundary objects,' The two concepts often appear undifferentiated within DLESE conceptual objects thatlie on the boundaries of various literature; this is not surprising, as it is DLESE's stated mission communities of perception, that serve as orienting points for to integrate the two. However, if the two concepts appear within discussion between those communities [1 1]. Although boundary the same discourse without reflexive differentiation, there is objects are often considered in the context of objects that enable potential for some confusion. and guide the translation of concepts between different groups, it is possible that they can also emerge unguided.I suggest that I wish therefore to draw attention to two emerging synecdoches the synecdoches identified above constitute unguided boundary in the CML. A synecdoche is a figure of speech that conflates a objects, that serve as informal conceptual tools by which CML subset with the overall set, or vice versa (for instance reporting members attempt to reconcile two alternate enframings of the that "The United States won ten medals in today's Olympics," concept of 'peer review.' In such situations there is potential for rather than "United States athletes won ten medals..."). discussants to assume that an issue has been discussed when in First, it is interesting to note that topic under current discussion, fact it hasn't, or rather, it has been reduced to a synecdoche that of how to review possible accessions to the broad collection, is represents a solutiontobothsides.Thisisapossible only a sub-set of the overall peer review scheme. If we consider explanation for the fact that in devising a review policy for the the concentric target model that was present in early exchanges, broad collection (a category of DLESE holdings that did not this part of the peer review scheme would be represented by just exist at the beginning of the CML discussion) some CML the outer ring of the target; yet somehow it seems to be starting members seem to be thinking that they are devising a review to stand in for the target as a whole. policy for the DLESE collection as a whole. Second, the discussion concerning the development of a process for peer reviewing accessions to the DLESE collection, is in 6 SUMMARY some senses becoming displaced by a discussion concerning the development of a process for peer reviewing the peers who The presenceof diversetechnologicalframeswithina would then carry out the peer reviewing process. This community is not in itself counterproductive. The existence of displacement is exemplified in the exchanges regarding who diverse technological frames should however be acknowledged and accounted for in the design process. Identification of mediator roles and of 'guided' boundary objects, for instance,

162 195 relies in part on an understanding of the technological frames examination of the parts that technological frames, guided and present in a given situation. In attempting such an accounting, unguided boundaryobjects,andmediators,playinthe this study has borrowed from two different strands of research. community-led digital library design process. The first concerns the emergence of technological frames in the context of technology implementation. Studies have included Orlikowski's work on the introduction of new groupware, Lotus 8 ACKNOWLEDGMENTS Notes, across a large organisation [36], and Tracy's account of conflictingframesof referencebetweencallmakersand I would like to thank the members of the DLESE Project Center, calltakers at 911 dispatch center [42]. The second concerns the in particular Tammy Sumner (PI) and Mary Marlino (Director), implications of digitising 'traditional' libraries [7, 11, 14, 26, for encouraging and assisting me in this project. 28, 29, 32, 33, 34, 41]. These researchers and others have problematised the assumption that building a digital library is 9 REFERENCES just a case of translating the components of traditional libraries intotheirdigitalequivalents.Thedifferencesbetween [1]Abelson, R.P. Psychological Status of the Script Concept. catalogues and search engines, solo research and CMC mediated American Psychologist 36(1981:7), 715-729. collaborative projects, paper and digital media, and so on, are [2]Anderson, B. Imagined Communities. Verso, London, not just issues of design, but also of frame translation. Digital 1991. library construction is not just a case of converting traditional library structures and functions into digital form; neither is it [3]Argyris, C., and Schon D. Organizational Learning. just a question of building digital tools to perform library-like Prentice-Hall, Englewood Cliffs NJ, 1978. functions. [4]Barley, S. Technology as an Occasion for Structuring. Operationalising these two perspectives in the context of the Administrative Science Quarterly 31(1986), 78-108. DLESE community-led design process has been achieved [5]Barley, S. Images of Imaging: Notes on Doing through the use of (a) ethnographic observation and (b) fine- Longitudinal Fieldwork. Organization Science 1(1990:3), grained interaction and content analysis. The longitudinal and 220-247. iterative methodology used makes not only for nuanced research, but allows research results to be folded back into the [6]Bartunek, J., and Moch, M. First Order, Second Order, and project. Research can make important recommendations in a Third Order Change and Organization Development number of areas, including training opportunities for DLESE Interventions: A Cognitive Approach. Journal of Applied staff and community members, documentation and FAQs, and Behavioral Science 23(1987:4), 483-500. so on. [7]Bishop, A. (ed.). How We Do User-Centered Design and In terms of design principles, it makes economic and social Evaluation of Digital Libraries: A Methodological Forum. sense to build good practice into collaborative architecture at the 37th Allerton Institute, 1995. Graduate School of Library beginning, rather than trying to achieve it after the fact. More and Information Service, University of Illinois at Urbana- recent research is looking therefore at the nature of 'guided' Champaign. http://edfu/lis.uiuc.edu/allerton boundary objects, and how these may be supported in the future. [8]Bloomberg, J.L. The Variable Impact of Computer Research on the user community of the Water in the Earth Technologies on the Organization of Work Activities. System (WES) collection is looking at how experienced WES Proceedings of the Conference on CSCW 1986. ACM members explain to less experienced community how WES Press, 35-42. works. Aware both of where WES community members are [9]Boorstin, D. The Discoverers. Random House, New York, coming from, and where WES itself is going, experienced 1983. memberscanmediateandtranslatebetweenthetwo technological frames outlined above. In effect they act as [10] Bougon, M., Weick, K., and Binkhorst, D. Cognition in 'guided' boundary objects. New research is therefore focusing Organizations: An Analysis of the Utrecht Jazz Orchestra. on (a) the emergent behaviours of experienced WES community Administrative Science Quarterly 22(1977:4), 606-639. members who take on the role of explaining WES as an [ 1 1 ] Bowker, G., and Star S. (eds.) How Classifications Work: institution to novice WES community members, and (b) the Problems and Challenges in an Electronic Age. Library ways these roles interact with the community use of tools such Trends Special Issue 47(2). as brain-storming sessions and concept maps. [12] Burrows, Howard, Shelley Canright, Tim Foresman, Don Johnson, Heather Macdonald, Sudha Ram, and Mike 7 CONCLUSION Smith. Panel 1: What are the Appropriate Scope and Focus for an Earth System Education Digital Library? Portal to It is often said that changing people will be the hardest part of the Future: A Digital Library for Earth System Education. the information technology revolution. When we ask what a Coolfont Resort, Berkeley Springs, West Virginia, August digital library is (and isn't), we are in many senses also asking 8-11, 1999. Ms. who belongs to the digital library community now, who will join http://www.dlese.org/panel/reports/panel_l.html in the futureand why they will join. In engaging with these questions, this study has examined the emergent perceptions and [13] Bush, V. As We May Think. The Atlantic Monthly roles of the members of a large diverse and complex digital 176(1945):101-108. library community. Although the researchisbased upon [14] Crabtree, Andy, Michael B. Twidale, Jon O'Brien, and observation of one case, ongoing research will implement a David M. Nichols. Talking in the Library: Implications for parallel longitudinal methodological framework to continue the

163 196 the Design of Digital Libraries. ACM Digital Libraries [30] Manduca, C.A., and Mogk, D.W. Portal to the Future: A 1997:221-227. Digital Library for Earth System Education. Coolfont [15] Crosby, Alfred W. The Measure of Reality. Quantification Resort, Berkeley Springs, West Virginia, August 8-11, and Western Society, 1250-1600. Cambridge University 1999. Preliminary Report. Ms. Press, Cambridge, 1997. http://gdl.ou.edu/report/reportnew.html [16] Dolby, R.G.A. Classification of the Sciences: The [31] Manduca, C.A., and Mogk, D.W. DLESE: A Community Nineteenth Century Tradition. In: Roy F. Ellen and David Plan. Boulder: DLESE Program Center, Boulder CO, 2000. Reason (Eds.),Classifications in Their Social Context. [32] Miksa, F. The Development of Classification at the Library Academic Press, London, 1979. of Congress. University of Illinois Graduate School of [17] Dougherty, D. Interpretive Barriers to Successful Product Library and Information Science Occasional Papers, #164, Innovation in Firms. Organizational Science 3(1992:2), 1984. 179-202. [33] Miksa, F. The DDC, the Universe of Knowledge, and the [18] Douglas, M. How Institutions Think. Routledge and Kegan Post-Modern Library. Forest Press/OCLC Online Paul, London, 1987. Computer Library Center, Albany NY, 1998. [19] Eden, C. On the Nature of Cognitive Maps. Journal of [34] Neves, F., and Fox, E. A Study of User Behavior in an Management Studies. 29(1992:3), 261-265. Immersive Virtual Environment for Digital Libraries. Proceedings of ACM Digital Libraries 2000:103-111. [20] Eisenstein, E. The Printing Press as an Agent of Change. Cambridge University Press, Cambridge, 1979. [35] Orlikowski, W. The Duality of Technology: Rethinking the Concept of Technology in Organizations. Organization [21] Fulker, D. Panel 2: How Should the Library be Governed, Science 3(3), 398-427. (1992a) Managed, and Funded? Portal to the Future: A Digital Library for Earth System Education. Coolfont Resort, [36] Orlikowski, W. Learning From Notes: Organizational Berkeley Springs, West Virginia, August 8-11, 1999. Ms. Issues in Groupware Implementation. MIT Sloan School http://www.dlese.org/panellreports/panel_2.html Working Paper #3428-92. (1992b) http://ccs.mit.edu/papers/CCSWP134.html [22] Giddens, A. The Constitution of Society: Outline of the Theoryructure. Berkeley CA, University of California [37] Orlikowski, W., and Gash, C. Technological Frames: Press, 1984. Making Sense of Information Technology in Organizations. ACM Trans. Inf. Systems, 12 (April 1994), 174-207. [23] Gioia, D.A. Symbols, Scripts, and Sensemaking: Creating Meaning in the Organizational Experience. In:The [38] Orlikowski, W., Yates, J., Okamura, K., and Fujimoto, M. Thinking Organization.Jossey-Bass, San Francisco CA, Shaping Electronic Communication: The Metastructuring 1986. of Technology in Use. MIT Sloan School Working Paper #3611-93, 1996. [24] Goffinan, E. Frame Analysis. Harper and Row, New York, http://ccs.mit.edu/papers/CCSWP167.htm1 1974. [39] Schutz, A. On Phenomenology and Social Relations. [25] Kastens, K., Ciccione, K., Duff, B., Goodchild, M., Gordin, Chicago, University of Chicago Press, 1970. D., and Ruzek, M. Panel 3: Collections. Portal to the Future: A Digital Library for Earth System Education. [40] Sheldon, A. Organization Paradigms: A Theory of Coolfont Resort, Berkeley Springs, West Virginia, August Organizational Change. Organizational Dynamics 8(3, 8-11, 1999. Ms. Winter 1980), 61-80. http://www.dlese.org/panellreports/panel_3.html [41] Theng, Y., Mohd-Nasir, N., and Thimbedy, H. Purpose and [26] Kilker, J., and Gay, G. The Social Construction of a Digital Usability of Digital Libraries. ACM Digital Libraries Library: A Case Study Examining Implications for 2000:238-239. Evaluation. Information Technology and Libraries [42] Tracy, K. Interactional Trouble in Emergency Service 18(1998:2), 60-70. Requests: A Problem of Frames. Research on Language [27] Kuhn, T. The Structure of Scientific Revolutions. Chicago, and Social Interaction 30(1997:4), 315-343. University of Chicago Press, 1970. [43] Zuckerman, H., and Merton, R. Patterns of Evaluation in [28] Levy, D. I Read the News Today, Oh Boy: Reading and Science. Institutionalisation, Structure and Functions of the Attention in Digital Libraries. Proceedings of ACM Digital Referee System. Minerva IX(1971:1), 66-100. Libraries 1997:202-210 [29] Levy, D., and Marshal, C.M. Going Digital: A Look at Assumptions Underlying Digital Libraries. Communications of the ACM 38(1995:4), 77-84.

164

1947 Legal Deposit of Digital Publications: A Review of Research and Development Activity Adrienne Muir Department of Information Science Loughborough University Loughborough LE11 3TU, UK +44 (0)1509 223065 [email protected]

ABSTRACT materialeffectivelyinthelongterm.Existingcollection management principles and practice were not designed with digital There is a global trend towards extending legal deposit to include information in mind. Online and networked publications pose digital publications in order to maintain comprehensive national archives. However, including digital publications in legal deposit particularly complex challenges. regulation is not enough to ensure the long-term preservation of Publishers typically have a legal obligation to deliver one or more these publications. Concepts, principles and practices accepted and copies of their publications to deposit libraries. The depositor is understood in the print environment, may have new meanings or no usually responsible for the cost of deposit. Legal depositories longer be appropriate in a networked environment. Mechanisms for includenationallibraries,parliamentarylibraries,university identifying, selecting and depositing digital material either do not libraries and national archives (for non-print material). There is exist, or are inappropriate, for some kinds of digital publication. great variety in the types of material collected through legal deposit. Work on developing digital preservation strategies is at an early The requirement is usually material available to the public whether stage. National and other deposit libraries are at the forefront of for sale, hire or for free. Printed publications are really the only research and develop in this area, often working in partnership with common factor. Other types of material collected include sound other libraries, publishers and technology vendors. Most work is of recordings,audiovisualmaterial and software.For arecent a technical nature. There is some work on developing policies and summary of the current status of legal deposit around the world see strategiesfor managing digitalresources.However, notall [1]. management issues or users needs are being addressed. There is currently a great deal of research and development activity in this area. Early work focused on identifying issues and problems Categories and Subject Descriptors and on gathering information for making the case for extension of K.5.0 [Legal aspects of computing] legal deposit. Currently deposit libraries are carrying out work, often in collaboration with other deposit libraries, publishers and General Terms technology vendors. Much of this work is of a technical nature and Management, Legal Aspects focuses on building the basic infrastructure, setting up digital depositories and collecting digital publications. Researchers are also working on metadata and digital preservation issues. Keywords Legal deposit Digital publications Digital preservation This review of research and development work will focus on activities specifically related to digital legal deposit. However, it will also touch on more generic work that is especially relevant. The 1. INTRODUCTION review starts with a discussion of the issues identified through The concept and practice of legal deposit is under threat in the research to provide some background and context for the rest of the digital environment. The main, though not the original, aim of legal review.Researchactivitiesaregroupedintocategoriesfor deposit is to ensure the preservation of a nation's intellectual and discussion. These categories are: building the infrastructure; pilot cultural heritage over time. Many countries are extending legal projects and digital preservation. The issues that are not currently deposit regulations to cover digital publications in order to maintain being addressed are identified and conclusions are drawn. comprehensive national archives. However, even countries that have been dealing with the legal deposit of digital publications for some time are still grappling with how to collect and manage this 2. THE ISSUES The deposit of digital publications raises legal, economic, technical and managerial/organisational issues atallstages of the legal Permission to make digital or hard copies of all or part of this work for deposit process. For the purposes of this review, these stages are personal or classroom use is granted without fee provided that copies summarised as: identification of publications; selection; acquisition; are not made or distributed for profit or commercial advantage and that accession and processing, including storing; preservation; and copies bear this notice and the full citation on the first page. To copy access. There are a number of fundamental factors that facilitate the otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. legal deposit of digital publications: definitions; metadata; and JCDL '01, June 24-28, 2001, Roanoke, Virginia, USA. standards. Copyright 2001 ACM 1-58113-345-6/01/0006...55.00.

165

1:98 There are also political issues associated with legal deposit. These As far as the concepts of documents and publishing were concerned, arise because there are a number of different actors involved in the the report did not suggest definitions, but provided "an overall legal deposit system, and each of these actors has their own framework of analysis within which to work" [5]. Martin defines a interests. The interests of one group do not necessarily coincide document as "(a) a combination of a work or compilation of works with another. For example, the commercial interests of publishers the medium on which the work or compilation is stored and any and the legal requirement to give up several copies of their product access technology which is specific to the document or (b) any one combine to cause tension between publishers, deposit libraries and of a number of copies of such a combination." legislators. This is a more complex definition than that of Mackenzie Owen and An important point that arises from the literature is that decisions van der Walle. They define electronic publications as "published taken at one stage affect decisions taken at other stages. Selection documents which are produced, distributed stored and used in . policies may need to take into account the ability of the depository electronic form" [6]. to capture and preserve particular publications. Technical, legal, Martin also defines "published within the United Kingdom." This is economic and organisational issues may influence preservation "the public to which it is offered or broadcast or made available or choices.Alternatively,differentpreservationstrategieshave before which it is performed includes a part of the United Kingdom different economic and management implications. ... and the publisher or an importer ordistributor or an agent of any For the purposes of this review, the issues identified through of the aforementioned is domiciled in the United Kingdom" [5]. research are discussed within the framework of the legal deposit Martin admits that this definition creates a potential loophole for process described above. Metadata and standards are also discussed publishers whose entire operation is outside of the UK, but whose in this way. The issue of definitions also pervades the whole offering is directly at a UK audience. He also points out it would be process. Many well-established concepts either do not apply in the difficult to enforce UK law in this situation. This problem would digitalenvironmentor needredefinition.Thisissueisso apply to any country. fundamental to legal deposit in the digital age that it is discussed Another British Library sponsored study [7] spells out the potential separately. problems associated with depositing online material. The report provides definitions for different types of database. However, a 3.DEFINITIONS major point made is that the traditional concept of publishing is not applicable in the digital environment. The "publication" process is A common theme in the literature is a lack of agreed definitions for various concepts in the digital environment. Terms that are well not the same in the print and online environments and different understood in the print environment are irrelevant or have new entities are involved. In the online world, no single entity has overall control of the process and intellectual property rights are meanings in the digital world. Examples include terms relating to created at several points. The entity that owns the rights to the data documentsorpublishingsuchas"publication,""placeof may be different from the entity hosting it. The entity owning the publication,""publicationdate,""publisher,""edition"and "authenticity." Another problem isthat the same words have rights to the retrieval software may be different from the data owner and/or the host entity. Who is responsible for deposit? differentmeaningsfordifferentcommunities.For example, "archives" and "metadata" have different meanings for different professional communities [2]. Researchers working in this area have 3.2 Preservation developed glossaries [3, 4]. Unfortunately, these are of limited use There is also confusion in the terminology used for the preservation because they have been developed specifically for the use of project of digital information. For example, there is a difference between participants. digitalpreservation and preservationdigitisation[2].Digital 3.1 Documents, publications and publishing preservation is "the storage, maintenance, and accessibility of a Preservationdigitisationinvolves Many of the problems of definition associated with legal deposit digitalobjectovertime." itsintellectualcontent. stem from the fact that the concept was originally based on mainly digitising a fragile object to preserve Preservation digitisation, in contrast, produces a surrogate for the textual information first made available in individual nations via a original object. This surrogate will then need to be preserved over physical carrier, usually a book. Some of the traditional concepts still apply to digital information on physical media. However, time. telecommunications and global networking have radically changed the nature of information dissemination. Many online publications 4. IDENTIFICATION, SELECTION, are frequently, if not continuously, updated and they are globally ACQUISITION available. New types of communication have emerged, such as email, mailing lists, chat rooms, personal World Wide Web home 4.1 Identification pages and dynamic Web pages that are generated 'on the fly' from The 1996 ELDEP study reported that the amount of material published only in digital form was quite small compared the volume databases. How much of thisinformation can be calleda of traditional publishing output [8]. This view was repeated in 1999 "publication" in the traditional sense is open to conjecture. in another study [9]. However, both of these studies stated that the The British Library commissioned a study on the definition of proportion of published output released only in digital form was terms. Existing sources of definitions are given in appendices to the likely to increase over time. study report. The study found that, at the time, existing definitions In order to acquire information, depositories have to identify it. One were not helpful because they did not deal well with new types of material, including digital material [5]. One point made was that suggestion here is that legal deposit could require all publishers to register their publications [10]. The existence of publications would definitions should be format or medium-independent to make them then be known, even they were not allcollected.It may be "future proof' [5].

166 199 impossible to enforce in practice because of the large numbers an entire Web site just one big document? There is a question as to involved, the ignorance of many Internet "publishers" of the which linked sources should also be selected. Should only internal traditional systems, or simply their unwillingness to comply. links within a document be maintained or should links to other documents be archived along with the original document? Bibliographic control is well developed in the print environment, less well developed for non-print formats and, it seems, virtually With traditional publications, deposit usually means that some non-existent for digital publications. One particular area of concern responsible entity sends physical objectsto depositbries. The isthat of unique identification of digitalpublications. While situationis more complex in the digital environment. Online identification of offline digital material such as CD-ROMs may be publications are not available in physical form so they cannot just reasonably straightforward using existing identifiers, online material be sent through the post. At present, there are three main options for presents problems [9]. There may be different manifestations of the acquiringonlineinformation.Publisherscantransferthe same content. The question is whether each manifestation should information onto a physical medium and send thattothe have a different identifier, or whether there be one identifier for the depositories.Publisherscanarrangetotransfer,or"push," underlying work. This raises the further question of how to identify information to depositories via networks. Alternatively, libraries can each manifestation and relate this to the underlying work. "pull" from publishers' sites themselves. A variant of this activity is "harvesting." This is usually done for Internet information, where New types of identifier being developed include the Digital Object the depositories use software to identify and pull in information Identifier (DOI) and Uniform Resource Names. The DOI is being from sites. developed by theInternationalFoundationtohelpinthe management and exploitation of digital information [11]. Uniform Several depositories are working with harvesting software to Resource Names are persistent identifiers for online information acquire publications, including the National Libraries of Australia [12]. [16], Finland [17] and Sweden [18]. Harvesting information is problematic in that existing tools do not entirely meet the needs of the depositories. Legal issues also arise, in that the depositories 4.2 Selection often need to negotiate permission from the publishers to copy their Mackenzie Owen and de Walle point out that legal deposit laws are material. The push option may not work well on the Internet often selective in their coverage [6]. Some types of material are because it is populated with a huge number of publishers, some of included and some are not. They recommend that, with some whom are small organisations or even individuals. It would be exceptions, all digital publications should be collected, including impossible to set up relationships with all of them. those published in parallel with print equivalents. An important point made is that deposit libraries will have to accept that they may never be able to collect all digital publications. There will be too 5.ACCESSION AND PROCESSING many publications,too many publishers,andtherateof Mackenzie Owen and Walle [8] recommend that quality checks and technological change is too fast. functional tests should be carried out for all items received. The purpose of such procedures is to check that the item is: There are some attempts at comprehensive collection of material. In Sweden, the National Library is attempting to capture the Swedish The correct version portion of the World Wide Web [13]. The aim of the Internet In the required medium and format Archive is to archive the entire Internet [14]. The comprehensive Complete approach may be feasible for countries with a relatively small Undamaged digital publishing output, but it may turn out to be impossible for Error free and fully functional countries with bigger outputs. Not copy protected Differentdepositorieshavedifferentcollectingpolicies,but Ensuring the authenticity of digital documents that are fluid by selection often involves quality judgements: the importance of a nature and capable of being changed very easily becomes a particular publication, or its future research value. Technical issues headache in the digital environment [15]. Some techniques for can potentially distort digital selection policies [10]. It may well be checking authenticity include time stamping and digital signatures. that for crucially important material that is, for technical reasons, difficult but not impossible to acquire, access and preserve, expense The amount and types of information gathered at the accession is not a factor for consideration. There is the problem of moderately stage will affect preservation of and long-term access to digital important"difficult"publications or cruciallyimportant material. Deposit libraries will need more and better information, or publications that are impossible to deal with. These documents may metadata, from publishers from publishers at the point of accession end up being lost. than is necessary for printed information. There is also a need for some standardisation in this metadata. The acquisition of dynamic digital information isparticularly problematic. Many writers comment on the impracticality or even The European Commission ELDEP study particularly focused on impossibility of capturing every version of databases that are the bibliographic control of digital legal deposit collections [8]. amended very frequently or in real-time [15]. The acquisition There are questions as to whether current cataloguing rules can deal method here would be samples or snapshots. However, there is no adequately with offline publications. The view emerging from the commonly accepted practice and littlepractical experience of literature is that current rules may not be able to deal with online sampling techniques. material at all. Hyperlinked documents present problems of deciding where the boundaries of the documents are. For example, there are questions about which is the appropriate level for archiving - a Web page, or

167 200 6.PRESERVATION preservation strategies are technology preservation, migration and Items in legal deposit collections are usually kept forever, therefore emulation. preservation is a central issue. If legal deposit collections are to include digital publications, solutions have to be found to the 6.3.1 Technology preservation problems of digital preservation. Technology preservationisreally a short-term strategy. This involves preserving the information in its original form and also the originalsoftware and hardware used tocreate and run the 6.1 Media stability information.Thestrategyislikelytoalsoinvolvemedia Early preoccupations in this area were with the longevity of digital refreshment, especially for information stored on media with very media. Estimates of the likely life expectancy of various storage short lifetimes [23]. However, hardware can only be maintained in media vary from around 1 to 100 years. Rothenberg gave some low working order for a finite period. estimates, including as little as two years for magnetic tape in some circumstances [19]. Unfortunately he gave no explanation for the 6.3.2 Migration low estimate and did not source his figures. The US National Media The Task Force on Archiving of Digital Information favoured the Laboratory contested Rothenberg's estimates [20]; it cites a 10-30 migration approach. The Task Force report defines migration as life expectancy for magnetic tape. Even so, this projection does not "the periodic transfer ofdigital material from one compare well with established archival media such as permanent hardware/software configuration to another, or from one generation paper or preservation microfilm. For these carriers, life expectancy of computer technology to a subsequent generation." [24]. There are is hundreds of years with optimal conditions. several migration strategies [25]. As well as having inherent instabilities, the physical carriers used Adopting migration strategies means making long-term for digital information also react to environmental factors. These commitments to unknown future activities and unpredictable costs. factors include both extremes of, and fluctuations in, temperature Webb makes the point that most successful emulation work has and relative humidity. Physical media also suffer from wear and tear been carried out with large amounts of homogenous data [10]. This and incorrect handling. Van Bogart produced a report on the storage is certainly not the situation for deposit libraries. While there have and handling of magnetic tape, which is widely cited in the been some small-scale experiments with migrating publications literature [21]. from floppy disks [26], it is not known how well complex material stored on optical discs will migrate, if at all. 6.2 Technological obsolescence Media instability is not the main problem as far as the preservation 6.3.3 Emulation of digital information is concerned. The main problem is that The aim of the emulation concept is to allow long-term preservation viewing and using digital information requires the aid of equipment. of digital material while retaining the functionality and look and feel The biggest threat to long-term survival is that of technological of the material. The main proponent of emulation as a preservation obsolescence of the hardware and software used to create and use strategy for digital information is Jeff Rothenberg [27]. digital information. Technical obsolescence is not a new problem The idea of emulation is to view a digital document by using the for the preservation of information. Earlier examples of this include software that created it. This does not necessarily mean that the the Sony Betamax video recording format and Readex Microcards. software has to be run. The behaviour of the software could be Recognition of technological obsolescence as the main threat to the described and the description saved so that its behaviour can be re- long-term survival of digital information becomes prominent in the created in the future. The requirements for this approach would be library and information science literature from the mid-1990s. to save the digital documents, the programs that were used to create Lehman sets out some of the aspects of technological change [22], the documents and all software required to run the documents. including changes in coding and formats, software, operating Software is dependent on the hardware it is created for, so the systems and hardware. These changes can render digital material behaviour of an obsolete hardware platform would have to be unreadable. emulated too. This would need the development of emulators, or software programs to mimic this behaviour [19]. 6.3 Preservation strategies Hardware emulation is potentially a simpler proposition than There are a number of possible strategies for digital preservation. A software emulation. The reasons for this is that there are fewer key question in deciding what strategy to use is what is to be hardware platformsthanoperatingsystemsandapplication preserved. Saving artefacts will not necessarily mean that the software, so fewer emulators would have to be specified. Secondly, information itself is also preserved. Merely refreshing media will writing specifications for hardware is a better-developed practice not overcome technological obsolescence. There are also problems than for software, so it would be easier to do [28]. associated with deciding exactly what the information or intellectual In a paper for the Council on Library and Information Resources, content is. This is especially problematic for multimedia or highly Rothenberg set out the requirements for implementing emulation of interactive information. Text, sound and pictures may be integrated; hardware [28]. These include: techniques for specifying emulators; the software associated with the information may allow interaction techniques for saving the necessary metadata (for finding, accessing between the user and the information. What has to be determined is and recreating documents) in human-readable form; and techniques whether it is the look and feel and functionality of the information for encapsulating documents, attendant metadata, software, and product that is to be preserved, or just the raw information. [2]. emulator specifications in a coherent and incorruptible way. There are a number of potential preservation strategies that address different preservation requirements and timeframes. The main

168 201 6.4 Disasters and rescue of digital description on the KB Web site [30], but there is little written in information English on the findings of this project. Ross and Gow investigated approaches taken to access digital information when media are damaged or software and hardware are 6.6Management of digital preservation: life unavailable or unknown [29]. cycles, stakeholders and rights issues Things that can go wrong with digital information are: The concept of the life cycle of digital resources has been put forward asatoolfor lookingatthe challenges of digital Media degradationunfavourable environmental conditions preservation. This was developed in the context of a study to in storage, disaster and manufacturer defects develop a strategic policy framework for the creation, management Loss of functionality of access devices technological and preservation of digital resources. Hendley added to the original obsolescence, wear and tear of mechanical parts, lack of model developed by the UK Arts and Humanities Data Service. The support for device drivers in newer software life cycle breaks down into several stages. These are: resource creation; selection and evaluation; management; disclosure; use; Loss of manipulation capabilitiesdue to changes in hardware preservation; and rights management. and operating systems Loss of presentation capabilities - change in video display Data archives can often dictate requirements at the creation extent to a great extent. This is not the case for legal depositories. Deposit technologies, particular application packages may not run in material and depositors will be diverse. In many cases, the main newer environments priority of depositors is commercial gain and this can conflict with Weak links in the creation, storage and documentation chain preservation interests. The creators of the framework acknowledge data recovered but unreadable because of encoding strategy this. cannot be identified, loss of encryption documentation, use of unusual compression algorithms Rights issues are important because preservation strategies involve copying, and possibly changing, the original information in some Possible techniques for data recovery include heat and chemical way. The life cycle framework illustrates how the stages are treatments for soiled or damaged media, searching binary structures interrelated and how decisions taken at one stage impact on other to identify recurring patterns and reverse engineering of content. stages. The aim is to help in the policy and decision making process One of the findings of the study is that there is a distinction between andhelpidentifywherecollaborationeffortswouldhelp data recovery and data intelligibility. While it may be possible to preservation. The framework is supported by a number of case recover data through searching binary structures, technological studies. One case study is legal deposit libraries. developments mean it will become harder to read the recovered data. Haynes et al. examine the attitudes of "originators and rights holders" towards to the issue of their responsibility for digital A future possibility is the use of magnetic force microscopy to read preservation [31]. The study report lists various stakeholder groups. damagedmedia.Anotheriscryptographytohelpinthe These are: libraries; publishers; archive centres; distributors; IT interpretation of recovered data. Ross and Gow also suggest an suppliers; legal depositories; consortia; authors; and networked alternative to migration and emulation strategies for preservation. information service providers. Retargetable binary translationinvolves "translatinga binary executable programme from one machine ... running a particular The consultants made a number of recommendations in their report. operating system ... and using a particular file format ... to another One was that a body should be established to co-ordinate digital platform ... running a different operation [sic] system ... and using archiving activities - a National Office of Digital Archiving. This a different file format" [29]. suggestion is similar to that of a national digital preservation officer from Matthews, Poulter and Blagg [32]. This idea has since been taken forward in the UK. The Joint Information Systems Committee 6.5 Authenticity of the Higher and Further Education Funding Councils has Migration strategies potentially pose authenticity problems because appointed an officer to develop digital preservation strategies and they cause changes in the publications being migrated. Authenticity work with other bodies to establish a so-called Digital Preservation means that an object is the same as that expected based on a prior Coalition [33]. reference or that it is what it purports to be. Another suggestion was for a distributed approach to digital There is a range of strategies for asserting the authenticity of digital preservation. Distribution could have a number of bases, including resources. The strategy used depends on the purpose for which regional, format and ownership. The suggested National Office of authenticity is needed. These include unique document identifiers, Digital Archiving would coordinate the development of standards the use of metadata to document changes, hashing, digital stamping, and guidelines in cooperation with other agencies. The consultants encapsulation techniques, digital watermarks and digital signatures. suggest that legal deposit should be used as a mechanism for acquiringmaterial,but thatpublishers should only have to During 1998, the CERBERUS project investigated the authenticity contribute one copy of each publication. and integrity of electronic documents in digital libraries with a deposit task. The project partners were the Dutch Koninklijke Users should not be charged for access, but costs should be shared Bibliotheek, the Technical Universities of Eindhoven and Delft and between research funders, the public (through government funding), the University of Amsterdam. The project was co-funded by andresearchcommunities.Thereisnomentionof any Innovation of Scientific Information Supply. There isa brief responsibility falling on publishers for maintaining archives.

169 More recently, researchers from AHDS have carried out a study on data, for example providing feedback to publishers, which may have the preservation management of digital materials and have produced data protection implications. a draft workbook for preservation managers. The workbook brings While stating that libraries accept that "the legitimate interests of together research findings and available guidelines and augments publishers require that access is limited and controlled," Mackenzie thisthrough some originalresearchwith some casestudy Owen and Walle suggest that access should be encouraged to organisations [34]. facilitate preservation of digital publications [8]. Their reason is that NORDINFO supported a study on the copyright questions related to if electronic publications are not used for some time, they may be the legal deposit of online material. This study concentrated on the found to no longer work. On the other hand, a high level of access European and Nordiclegalenvironments. The studyreport will "check the operability of electronic publications and ... identify concludes that there is a gap between copyright provisions and legal and remedy any access problems that may occur." deposit objectives.Digitalpreservation requires copying, and Before using digital information, users have to be able to find it. In copyright exceptions should allow this. Another point that comes up the digital environment, bibliographic records can have direct is that there may be moral rights issues arising from migration pointers to the material. However, as Mackenzie Owen and Walle activities if they result in changes to the migrated material. The point out, there can be several pointers, including to the original report suggests that technology, including electronic copyright storage and the archival storage location [8]. Both of these may be management systems, can contribute to solving problems. There physically located in the library. In the case of online publications, also needs to be some investigation into how depositories can the original location may be a network address. cooperate with publishers to solve problems [35].

6.7 Costs 8.BUILDING THE INFRASTRUCTURE There is a serious problem with identifying the costs associated with 8.1 Open Archival Information System digital preservation. Until full-scale operational systems have been reference model running for some time, the nature and extent of costs cannot be A major initiative relevant to digital legal deposit collections is the known for certain. Some organisations have estimated the cost of development of the Open Archival Information System (OATS) caring for digital information. The British Library included some Reference Model [39]. The Consultative Committee on Space Data alternative costings in its proposal for the extension of legal deposit Systems is drafting this standard for the International Standards in the UK [36]. Organisation. An OMS archive preserves information for access Hendley's study of different preservation methods and associated and use by a so-called Designated Community. This model is not costs took into account diversity in digital materials [23]. The study just applicable to an organization that stores digital records; it can drew on the work of a related set of studies, in particular the work be applied to any type of digital paper material. The OAIS models carried out by the AHDS. However, it also reviewed other cost the functions involved in the long-term storage of and access to acquisitionand models and visited digital libraries and archives. digitalinformation.Thesefunctionsinclude processing (ingest), archival storage, access, data management and administration of the archive. 7. ACCESS The development of the OAIS has influenced other work being Traditionally,depositlibrariesprovideaccesstodeposit carried outinexploring the development of digitaldeposit publications without charge. It is clear that this situation may not be collections. The CEDARS (CURL Exemplars for Digital Archives) accepted by rightsholders in the digital environment [37]. It is likely project in the UK has used OMS in developing its preservation that access to such deposited digital publications will be governed metadataspecification.TheEuropean NEDLI13(Networked by licence agreements. Williamson gives a flavour of the potential European Deposit Library) project is working on developing an complexity of providing access to digital information [38]. infrastructure for a European digital deposit collection [40]. British There is nothing in the literature that directly reports the views of Library and the Koninklijke Bibliotheek in the Netherlands have users, so it is not clear what kind of access they would require. used OMS as the basis of their recent tenders for systems to manage Access to printed legal deposit publications is often on a reference their digital collections. only basis. It is not clear whether users would be happy with the digital equivalent of this, or if they would want remote access. Neither is there any evidence that users would object to limitations 8.2 The NEDLIB Project on access rights, say limited or no printing or downloading of The NEDLIB project started at the beginning of 1998 and finishes material. at the end of 2000. Funding comes from European Commission, and the projectleaderisthe Koninklike Bibliotheekinthe Another legal issue is that of security. Authentication of users and Netherlands. The project partners are eight European national the setting up of access rights are security measures that have libraries, one national archive, three publishers and two information implications for users and libraries. This issue is applicable to all and communication technology companies. This project should be types of digital library. very influential in helping depositories cope with digital material. New technology providesthe means of closelymonitoring The stated project aim [40] is to "develop a common architectural information use. The purpose of monitoring use may be to police framework and basic toolsfor building deposit systems for user behaviour to ensure access agreements with publishers are not electronic publications." The project deals with the technical issues breached. Williams points out that logging use may be burdensome involved in extending legal deposit to digital material. A great deal for deposit libraries [38]. There is also the potential for other uses of of detailed project material is available on the project Web site [41].

170 203 The project consortium adopted the OAIS model, but the NEDLIB The CEDARS team is a partner in a new project on emulation for Deposit System for ElectronicPublications (DSEP) will be preservation funded through the JISC/NSF (US National Science narrower in scope than the OAIS model. This is because some of Foundation) International Digital Libraries Programme [45]. The the OAIS functions, such as Data Management and Access, are part other project partners are based in the University of Michigan. The of the general digital library environment and not specific to the project will 'develop a small suite of emulation tools, evaluate the digital depository. The aim is to link the functions of the deposit costs and benefits of emulation as a preservation strategy for system and the digital library environment through interfaces. complex multi-media documents and objects, and develop models for collection management decisions that would assist people in NEDLIB isworking with Jeff Rothenberg on an emulation making 'real life' decisions. experiment. The plan is that the first stage will result in a design for the whole experiment, a plan for testing and comparing the results of the emulations with the original works and a framework of 9.PILOT DEPOSITORIES preservation criteria and authenticity characteristics. The second stage involves modelling the emulation process and identifying 9.1National Library of Canada metadata and functionality requirements. The last stage of the The National Library of Canada ran such a project between 1994 emulation experiment will be the implementation and evaluation of and 1995. The purpose of the Electronic Publications Pilot Project the emulation process in the testbed developed by the NEDLIB (EPPP) was to pilot the acquisition, cataloguing, preservation and provision of access to a few Canadian electronic journals and other proj ect publications available via the Internet [46]. The National Library of The first stage of the emulation experiment is now complete. Canada is now building a full-scale electronic collection. Rothenberg concludes that "The results of this study suggest that using software emulation to reproduce the behaviour of obsolete computing platforms on newer platforms offers a way of running a 9.2Koninklijke Bibliotheek, Netherlands digital document's original software in the far future, thereby The Koninklijke Bibliotheek in the Netherlands took the decision to recreating the content, behaviour, and look-and-feel' of the original collect digital publications in 1994 [47, 48]. Offline publications, document" [42]. This claim seems somewhat inflated since the such as CD-ROMs were stored on the stacks with the books. From actualexperimentactuallyinvolvedrunningWindows 95 1995, the KB experimented in handling online publications on a publications on an Apple Mac using Connectix VirtualPC software small scale. Three publishers cooperated with the KB by agreeing to as the emulator. The most Rothenberg can claim is that this deposit some of their electronic publications with the KB. In 1996, particular software does what it says it does. the KB reached a provisional agreement with publishers to widen deposit. This small-scale deposit system was based on the IBM Digital Library system and became operational in 1998 [48]. The 8.3The BIBLINK Project KB is now setting up a full-scale system [49]. BIBLINK startedin1996 with support from the European Commission. Although the project came to an end in 1999, work on 9.3 National Library of Australia the initiative is ongoing under the aegis of the Conference of The NationalLibrary of Australiaset upthe PANDORA Directors of National Libraries. (Preserving and Accessing Networked Documentary Resources of Australia) project in 1996. The aim of the project was to "develop The original aim of BIBLINK was to help develop and improve policies and procedures for the selection, capture, archiving and national bibliographic services, focusing on digital publications, provision of long-term access to Australian electronic publications." especially online publications. Potentially all libraries would be The Library developed a proof-of-concept archive of Australian beneficiaries of the project. However, the main perceived benefit Internet material, which has been used to develop policies and was that national libraries would not miss the publication of procedures for the long-term preservation and access to digital significant publications [43]. publications [50]. The 131-BLINK project developed a prototype demonstration system, The National Library of Australia realised that it needed integrated called the BIBLINK Workspace. The demonstrator provides a systems for managing all parts of its collections, including digital virtual workspace or "computer mediated work environment" for material. The NLA is taking this forward with its Digital Services participating parties.It allows publishers to create records and Project. This project will provide storage for its digital material, but allows participants to access the system to retrieve, update and it will also provide management systems for most of the Library's delete records in the workspace. BIBLINK isdeveloping an collections. Exploitation Plan [43]. This will provide a framework for library partners to assess the possibility of incorporating the system into operational procedures. 9.4 Helsinki University Library (National Library of Finland) 8.4 CURL Exemplars for Digital Archives EVA was originally an eighteen month Finnish project that started in June 1997. The aim of the project was "to test methods of (CEDARS) Project capturing, registration, preserving and providing access to ... online The CEDARS project in the UK is funded by the Joint Information documents ..." [51]. EVA was used to test tools being developed by Systems Committee of the Higher and Further Education Funding various Nordic projects, including a Dublin Core metadata template Councils through the eLib Programme. CEDARS began in April and converter, URN generator and a harvesting and indexing 1998 and is due to finish in March 2001. The aim of CEDARS is to application. Documents were harvested from the World Wide Web 'address strategic, methodological and practical issues and provide using the harvester. Once captured, the documents were analysed, guidance in best practice for digital preservation' [44].

171 204 indexed then archived. The EVA II and EVA III projects have been [3]Working Definitions of Commonly Used Terms (for the building on this work [52]. purposes of the Cedars Project). http://www.leeds.ac.uk/cedars/documents/PSW01.htm. 9.5 Kungliga Biblioteket, Sweden [4]Borbinha, J., Cardoso, F., and Freire, N., NEDLIB glossary. The Kungliga Biblioteket in Sweden is currently running the http://www.kb.nl/coop/nedlib/glossary.pdf. Kulturarw3 project. The aim of this project is to "to test methods of [5]Martin, D. Definitions of publications and associated terms in collecting, preserving and providing access to Swedish electronic electronic publications. British Library Research & documents which are accessible on line in such a way that they can Development Department, 1996. be regarded as published." [18]. The project aims to collect Swedish [6] MacKenzie, G. Searching for solutions: electronic records material available on the Internet according to specified selection problems worldwide. Managing information, (July/August criteria and to automate the collection through the use of robots. 2000), 59-65.

[7]EPS Ltd. The legal deposit of online databases. British Library 10. CONCLUSIONS Research & Development Department, 1996. There is a great deal of activity in this area worldwide. While exploratory work has identified many problems arising from the [8] Mackenzie Owen, J.S.and Walle, J.v.d. Deposit collections of legal deposit of digital publications, a great deal more work is electronic publications. Office for Official Publications of the needed to solve these problems. Much of the work has been small- European Communities, 1996. scale or national, yet the problems transcend national boundaries. [9]Bide, M., Potter, E.J., and Watkinson, A. Digital preservation: Therefore, initiatives such as NEDLIB are to be welcomed. an introduction to the standards issues surrounding the archiving of non-print material. Book Industry The OAIS reference model provides a conceptual outline of the Communication, 1999. processes involved in a digital depository. However, by its nature it does not consider how theseprocess will be carried out. The [10] Webb, C. Long-term management and preservation of NEDLIB project is attempting to develop a generic technical publications on CD-ROMs and floppy disks: technical issues. infrastructure for digitaldepositories. However, the projectis http://www.nal.gov.au/niac/meetings/tech.html. limited in its scope and does not deal with the interface between the [11] The Digital Object Identifier System. http://www.doi.org/. depository and the depositors. [12] Uniform Resource Names (urn). Much of the activity in this area is concerned with technical issues. http://www.ietf.org/html.charters/urn-charter.html. There islittle evidence of any work taking a wide view of managerial or organisationalissues. These issues include the [13] Arvidson, A. and Lettenstrom, F. The Kulturarw3 Project - the management of workflows in depositories, staffing and skills Swedish Royal Web Archive. The Electronic Library, 16(2 requirements. It is also likely that there will have to be a lot more April 1998), 105-108. cooperation between publishers and depositoriestofacilitate [14] Kahle, B. Archiving the Internet. deposit, preservation and access. Access especially may require http://www.archive.org/sciam_article.html. negotiation. While there is an increasing interest in user needs in digital library research, there is little evidence of this in the legal [15] Mandel, C.A. Enduring access to digital information: deposit context. understanding the challenge. LIBER quarterly, 6 (1996), 453- 464. Current work assumes that deposit systems will be organised in a similar way to current systems in that material will be physically [16] Digital Services Project. http://www.nla.gov.au/dsp/. deposited in deposit libraries. Physical deposit may be necessary to [17] Lounamaa, K. and Salonhadu, I. EVA- the acquisition and ensure long-term preservation, but alternatives to the current system archiving of electronic network publications in Finland. could be considered. While the concept of a comprehensive archive Tietolinja news, 1 (1999). of the national intellectual output may remain an ideal in an http://www.lib.helsinki.filtietolinja/0199/evaart.html. increasingly knowledge intensive world, it is not yet clearwhether this is technically and organisationally feasible or affordable. [18] Kulturarw3 heritage project. http://kulturarw3.kb.se/html/projectdescription.html. [19] Rothenberg, J. Ensuring the longevity of digital information. 11. ACKNOWLEDGEMENTS http://www.clir.org/programs/otheractiv/ensuring.pdf. Iwish to acknowledge part-funding of thisresearch by the Loughborough University Faculty Studentship scheme. [20] Van Bogart, J.W.C. Mag tape life expectancy 10-30 years. http://palimpsest.stanford.edu/bytopic/electronic- records/electronic-storage-medialbogart.html. 12. REFERENCES [21] Van Bogart, J.W.C. Magnetic tape storage and handling. [1]Muir, A. Legal deposit of digital material in the UK: recent http://www.clitorg/pubs/reports/pub54/. developments and the international context. Alexandria, 12(3), (2000), 151-165. [22] Lehmann, K.-D..Making the transitory permanent: the intellectual heritage in a digitized world of knowledge in [2]Russell, K. Digital preservation: ensuring access to digital Books, bricks & bytes. American Academy of Arts and materials into the future. Sciences, Cambridge, Mass., 1996, 307-329. http://www.leeds.ac.uk/cedars/Chapter.html.

172

205 [23] Hendley, T. Comparison and methods & costs of digital [39] Consultative Committee for Space Data Systems. Reference preservation. British Library Research and Innovation Centre, model for an Open Archival Information System (OAIS). 1998. http://ftp.ccsds.org/ccsds/documents/pdf/CCSDS-650.0-R- 1.pdf. [24] Task Force on Archiving of Digital Information. Preserving digital information: report of the Task Force on Archiving of [40] Werf-Davelaar, T. NEDLIB: Networked European deposit Digital Information. RLG, CPA, 1996. library. Exploit interactive, 4 (2000). http://www.exploit- lib.org/issue4/nedlib/. [25] Feeney, M. Towards a national strategy for archiving digital materials. Alexandria, 11(2), (1999), 107-121. [41] NEDLIB. http://www.kb.nl/coop/nedlib/homeflash.html. [26] Woodyard, D. Farewell my floppy: a strategy for migration of [42] Rothenberg, J. An Experiment in using emulation to preserve digital information. digital publications. http://www.nla.gov.au/n1a/staffpaper/valadw.html. http://www.kb.nl/coop/nedlib/results/emulationpreservationrep ort.pdf. [27] Rothenberg, J. Ensuring the longevity of digital documents. Scientific American, 272(1), (1995), 42-47. [43] Noordemeer, T.C. A Bibliographic link between publishers of electronic resources and national bibliographic agencies: [28] Rothenberg, J. Avoiding technological quicksand: finding a Project BIBLINK. Exploit interactive, 4, (2000). viable technical foundation for digital preservation. Council on http://www.exploit-lib.org/issue4/biblink.Cedars project Library and Information Resources, 1999. summary. [29] Ross, S.and Gow, A., Digital archaeology: the recovery of http://www.leeds.ac.uk/cedars/documents/MGA04.htm. digital materials at risk. British Library Research and [44] JISC/NSF International Digital Libraries Programme: Innovation Centre, 1999. Emulation for Preservation: Project Summary. [30] CERBERUS. http://www.kb.nl/kb/sbo/dnep/cerberus-en.html. http://www.leeds.ac.uklcedars/JISCNSF/summary.htm. [31] Haynes, D., et al. Responsibility for digital archiving and long [45] Electronic Publications Pilot Project Team. Electronic term access to digital data. Library Information Technology Publications Pilot Project (EPPP): final report. http://www.n1c- Centre, 1997. bnc.cdpubs/abs/eppp/ereport.htm [32] Matthews, G., Poulter, A., and Blagg, E., Preservation of [46] Noordemeer, T.C., Deposit for Dutch electronic publications: digital materials policy and strategy issues for the UK. British research and practice in the Netherlands in Research and Library Research and Innovation Centre, 1996. advanced technology for digital libraries: first European Conference, ECDL'97 (Pisa, Italy, September 1-3, 1997). [33] Digital Preservation. http://www.jisc.ac.uk/dner/preservationl. Springer, 1997. [34] [34] Jones, M.and Beagrie, N. Preservation management of [47] Steenbakkers, J. Developing the Depository of Netherlands digital materials workbook: pre-publication draft, . 2000. Electronic Publications. Alexandria, 11(2), (1999), 93-105. [35] Mauritzen, I.and Solbakk, S.A. A Study on copyright and legal [48] Werf-Davelaar, T.v.d. DNEP 1995-2000: the Dutch Deposit of deposit of online documents. NORDINFO, 2000. Electronic Publications. http://www.bic.org.uk/Titia.ppt. [36] British Library, Proposal for the legal deposit of non-print [49] Relf, F.A. PANDORA - towards a national collection of publications to the Department of National Heritage from the Australian electronic publications. British Library. British Library, 1996. http://www.nla.gov.au/n1a/stafTpaper/ashrelfl .html#abst. [37] Wille, N.E. Legal deposit of electronic publications: questions [50] Eva - the Acquisitions and archiving of electronic network of scope and criteria for selection in Legal deposit with special publications. http://renkilib.helsinki.fileva/english.html. reference to the archiving of Electronic publications: proceedings of a seminar organised by NORDINFO and the [51] Eva - elektronisen verkkoaineiston hankinta ja arkistointi. British Library (Research and Development Department) http://www.lib.helsinki.fikva/index.html. (Windsor, England 27-29 October 1994). NORDINFO, 1994. [38] Williamson, R. Access and security issues from a publisher's perspective in Legal deposit with special reference to the archiving of Electronic publications: proceedings of a seminar organised by NORDINFO and the British Library (Research and Development Department) (Windsor, England 27-29 October 1994). NORDINFO, 1994.

173 20E Comprehensive Access to Printed Materials (CAPM) G. Sayeed Choudhury1 Nicholas E. Flores [email protected] Department of Economics Mark Lorie, Erin Fitzpatrick, Ben Hobbs2 University of Colorado at Boulder [email protected] {mlorie,bhobbs}@jhu.edu Greg Chirikjian, Allison Okamura3 {gregc,aokamura} @jhu.edu

ABSTRACT printed materials in remote locations. With the ongoing migration The CAPM Project features the development and evaluation of an of materials to off-site facilities, there is a risk that a growing automated, robotic on-demand scanning system for materials at body of knowledge will be considered lessfrequently, and remote locations.To date, we have developed a book retrieval perhaps even ignored. robot and a valuation analysis framework for evaluating CAPM. The CAPM Project began with the goal of restoring browsability We intend to augment CAPM by exploring approaches for through an automated, robotic system that would allow patrons to automatedpageturningandimprovedvaluation. These browse, in real-time, materials shelved at off-site facilities through extensions will results in a more fully automated CAPM system a Web interface. We envision a patron, upon noting that an item and a valuation framework that will not only be useful for is shelved off-site, will choose the CAPM option through their assessing CAPM specifically, but also for library services and Web browser.Subsequently, a robot will retrieve the requested functions generally. item and deliver it to a scanner. Another robotic system will turn pages at the patron's request. The patron will either view or print Categories and Subject Descriptors pages, and eventually "return" the item or request the item for H.3.7 [Information Systems]: Information Storage and Retrieval physical delivery. Once the text is scanned, the patron may also Digital Libraries. perform automatedtextanalysisoptions, suchas keyword searches on the full text. In each case, the system will respond accordingly via remote control. General Terms Measurement, Design, Economics, Experimentation To date, we have built a book retriever robot that is being tested in a small-scale version of the shelving arrangement used at Moravia Park off-site shelving facility of the Johns Hopkins University. Keywords Additionally, we have conducted a valuation analysis to evaluate Information economics, evaluation methods, browsing, digital the potential costs and benefits of CAPM. conversion, digital preservation, robotics, paper manipulation. 2. PROJECT BENEFITS 1. INTRODUCTION The most obvious benefit of CAPM is the ability to restore Librariesfacebothopportunitiesand challenges withthe browsability for materials shelved inoff-sitelocations. The development of digitallibraries. While some digitallibrary possible risk of reduced usage for off-site materials might be initiatives focus on materials that are "born digitally," most mitigated with this remote browsing capability.Essentially, libraries continue to acquire materials in print format and must CAPM will "elevate" the status of materials in off-site locations consider methods to manage their existing, substantial print towards that of electronic resources. collections.Thisapproachofdevelopingdigitallibrary capabilities while also managing large print collections has led to While achieving browsability isa noteworthy benefit of the space pressures. Given the relatively high costs of building new CAPM project, it is one only facet of the project's full potential. facilities at central campus locations, many libraries have either By combining OCR software and a search engine being developed built, or considered plans to build, off-site shelving facilities to at Johns Hopkins, it will be possible to search requested texts and accommodate their growing print collections. generate keywords through natural language processing of full text. Additionally, the CAPM system will generate tables of While moving materials to off-site locations mitigates space contents. The combination of full text, tables of contents and pressures, patrons lose the ability to browse these materials in keywords will be cross-referenced within a database to identify "real-time." Given the relative ease of accessing electronic items with similar content.This capability, combined with resources, itis possible that patrons are less likely to access traditional metadata, will emulate the serendipitous discovery of open-stack browsing. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are While these benefits are related to access, CAPM could also not made or distributed for profit or commercial advantage and that provide preservation-based benefits. Preservation copies will be copies bear this notice and the full citation on the first page. To copy created by a batch-scanning mode when patrons do not require the otherwise, or republish, to post on servers or to redistribute to lists, system. Because of the enhanced access offered by CAPM, requires prior specific permission and/or a fee. libraries and patrons might be more amenable to the transfer of JCDL '01 , June 24-28, 2001, Roanoke, Virginia, USA. Copyright 2001 ACM 1-58113-345-6/01/0006...$5.00.

174 207 itemstooff-sitefacilitieswhichgenerallyoffersuperior administrators'reactions to questions involving tradeoffs of conditionsformaterials(e.g.temperature,humidity,static service attributes (such as time) and comparing these reactions to shelving). questions involving dollar values. This exploration will provide a more comprehensive evaluation of CAPM and a generalized 3. BENEFITS VALUATION frameworkforevaluatinglibraryservices,both from the The benefits outlined above need to be properly delineated and perspective of patrons and library administrators. evaluated before libraries will adopt CAPM. Consequently, in addition to the engineering development of CAPM, there has been 5. CONCLUSIONS a concurrent, valuation analysis. The analysis has two emphases: The CAPM Projectrepresentsaninnovativeresponseto an engineering cost analysis and a valuation of potential benefits fundamental issues related to digital library development.The for patrons. The cost analysis has focused on the following cost system will provide an automated system for on-demand and categories:labor,increased electricity usage, maintenance of batch scanning of materials in off-site locations. Additionally, the equipment, book containers, loss of storage capacity associated valuation analysis will result in a rigorous assessment of CAPM with book containers, equipment and setup or implementation and a general framework for evaluation of library services. costs. The costs were calculated as a levelized average cost per use over a ten-year period. Preliminary results indicate an average Other libraries could adopt CAPM either by implementing a cost per use of between $3.50 to $24.52, depending on level of CAPM system in their facilities or by shelving their materials in a use. These costs compare favorably to costs associated with facility with CAPM or by accessing CAPM-scanned materials interlibrary lending services [3]. through the Web. Even at this intermediate stage of development, librarieswithintheUnitedStates, Europe and Asia have These potential costs will be compared to the potential benefits, as expressed interestin CAPM for both researchinterest and estimated through a contingent valuation methodology (CVM). potential implementation. CVM represents a "stated-choice" technique where individuals express a willingness-to-pay for benefits or a willingness-to- accept compensation for costs, using dollar values. CVM has been 6. ACKNOWLEDGMENTS used widely for evaluation of non-market goods, especially in the The CAPM Project has been supported by grants from the environmental field. A previous application of CVM in the library Council of Library and Information Resource and the Mellon environment provides evidence of its utility in decisions regarding Foundation. The Minolta Corporation has provided scanners for resource allocations for libraries [2]. For the CAPM project, the project.An individual donor has also provided financial patrons were queried regarding hypothetical choices (related to support. potential features of CAPM) through an online survey. The survey design was developed with the help of a workshop that convened 7. REFERENCES economists and librarians and was pre-tested with a small sample [1] Danielsson, C., and Holmberg, L. Evaluation of RAID group. The usefulness of this survey results and overall valuation workstation in Proceedings of the RESNA '94 Annual analysis will be assessed during a workshop in May 2001. Conference (Nashville, TN, June 1994), 451-453. 4. FUTURE WORK [2]Harless, D.W., and Allen, F.R. Using the contingent Both the engineering and valuation activities of CAPM will be valuation method to measure patron benefits of reference extended. We will test the retriever robot at Moravia Park and desk service in an academic library. College and Research explore the development of page-turning devices. There are page- Libraries, 60 (1), 59-69. turning devices but they are used in controlled settings, such as [3]Jackson, M.E. Measuring the Performance of Interlibrary for patients undergoing rehabilitation [1, 4]. We will investigate Loan Operations in North American Research & College and build prototype page-turning devices that will accommodate a Libraries. Association of Research Libraries, 1998 range of paper types. [4] Neveryd, H., Bolmsjö, G., and Eftring, H. Robotics in The valuation analysis will be augmented in two ways. We will Rehabilitation. IEEE Transactions on Rehabilitation conduct a post-survey investigation to assess the effectiveness of Engineering, 3 (1), 77-83. online surveys. Additionally, there is some evidence that library patrons have difficulty with dollar values for library services [5]. [5]Saracevic, T., and Kantor, P.B. Studying the value of library Consequently, we will explore the use of multi-criteria decision- and information services. Part I. Establishing a theoretical making techniques (MCDM) to evaluate CAPM. The MCDM framework. Journal of the American Society for Information techniqueswillinvolveassessingpatrons' andlibrary Science, 48 (6), 543-563.

I Digital Knowledge Center, MSE Library, Johns Hopkins University 2 Department of Geography and Environmental Engineering, Johns Hopkins University 3 Department of Mechanical Engineering, Johns Hopkins University

175 208 Technology and Values: Lessons from Central and Eastern Europe Nadia Caidi Faculty of Information Studies University of Toronto Toronto, Ontario M5S 3G6 (Canada) (416) 978 4664 [email protected]

ABSTRACT debates around digital libraries revolved around the development Technology does not develop independently of its social context. of national union catalogs (NUC) as a means to creating a shared Rather, there is a range of social, cultural and economic factors (in cataloging system, and making their country's bibliographic addition to technical factors) that define the parameters for the records available online. The NUC emerges as a result of the development and use of technologies. This paper presents a case "work" of the various coalitions (libraries, system vendors, study of the social shaping of one aspect of digital libraries, the funding agencies, policy makers). development of national union catalogs (NUC), in four countries Here, we report results from a series of interviews conducted in of Central and Eastern Europe (CEE). It examines the specific four CEE countries with key players in the development of NUC choices and values that are embedded in the design of a NUC, and projects. The outcomes of these projects show mixed results how these might be transferred to other cultural contexts. primarily because of the different approaches being used and the various assumptions embedded in the designs of these systems. Categories and Subject Descriptors Our study shows that beyond the technical and design issues, therearesocio-cognitive andsocio-politicaldynamics that [SocialIssues/Implications]:SocialInformatics.[Design- contribute to (or hamper) the development of an NUC. Implementation/Evaluation]: Case Studies, DL development. [Policy Issues]: Transborder/International Issues. 2. METHOD The choice of these four countries was motivated by their similar General Terms level of economic development, the advanced state of reforms Design. undertaken, the political stability, as well as the level of support they receive from Western countries. The focus of the study was Keywords on the major academic and research libraries (including the Information Infrastructure. Central and Eastern Europe. National national libraries, and libraries of the national academies). These Union Catalogs. Social Shaping of Technology. librarieshaveshownleadershipinadoptinginformation technologies and deep concern with information access. As such, they play important roles in policy decisions regarding library 1. INTRODUCTION services in their countries. The study of digital libraries (DLs) in Central and Eastern Europe Most of the data collected were in the form of in-depth interviews, makes a fascinating case study of a socio-technical system in the in which a broad agenda of research questions was presented to making. This paper builds on prior work on DLs in the CEE respondents about their visions and experiences with DLs, NUCs, region ([1] [2]), and constitutes a preliminary report of a large and the information infrastructure of their country. Face-to-face research project on the development of information infrastructures interviews were conducted between March and June 1999 with 49 and DLs in four CEE countries (Czech Republic, Slovakia, library leaders and policy-makers, in 37 institutions (10 in Czech Hungary and Poland) [3]. Developing DLs (understood here in its republic, 10 in Hungary, 13 in Poland and 4 in Slovakia). A dozen wider sense, as a mixture of content, technology and people) is moreinformaldiscussionswithotherrespondents,library essentially a collaborative activity in which various actors are vendors, project managers were also undertaken. pooling resources and efforts to devise solutions to a specific problem. In the four CEE countries studied, the most interesting 3. FINDINGS These CEE countries faced with the challenges of integration into Permission to make digital or hard copies of all or part of this work for a global economy and with domestic social and economic personal or classroom use is granted without fee provided that copies are transformation from the socialist system. Similarly, the story of not made or distributed for profit or commercial advantage and that copies CEE libraries is one of institutions striving to survive the social bear this notice and the full citation on the first page. To copy otherwise, transformationsoftheircountries,whileadjustingtheir or republish, to post on servers or to redistribute to lists, requires prior organizational structure to the new reality and needs of society specific permission and/or a fee. and of their user community. The data collected showed evidence JCDL '01, June 24-28, 2001, Roanoke, Virginia, USA. Copyright 2001 ACM 1-58113-345-6/01/0006...$5.00. both of the presence of common frames (e.g., a set of shared assumptions and beliefs) and of power struggles between various

176 209 social actors with conflicting interests and agendas (e.g., between 3.3 Global vs. Local Considerations types of libraries; between libraries and state agencies; between The rush toward adopting internationally recognized standards in libraries and the private sector). the field of networking, technology, and library services was 3.1 Boundaries and Negotiation associated with the need to prepare for the integration to the Transition involves the building of new institutions and the European Union, and to"catch up with the West." In the creation of new linkages between organizations, which is an respondents accounts, and in observing the societies, there seems eminently political process. The power struggle that ensues was to be an uncritical glorification of the Western way of life, evident in the ways in which most respondents framed their world economic and political arrangements, and technology. Western in terms of binary oppositions (e.g., `Us' vs. 'Them,' consortia integrated library systems were often preferred to local systems 'insiders' vs. 'outsiders,' East vs. West) and what divides rather developed in the country. Similarly, most standards were often than what unites libraries. Similarly, the respondents' rhetoric borrowed from foreign sources (e.g., USMARC, the French around the development of an NUC revolved around the obstacles RAMEAU, Library of Congress Subject Headings, etc.). The and barriers of its development (e.g.,lack of funding and language of the West was often adopted and adapted: for instance, leadership,poorcoordinationbetweengoverningagencies, some respondents used such terms as "information have and have- conflict between libraries, rigid management and structure). nots," "Telematics," "intelligent cities," "virtual libraries," etc.It Power is also expressed in the ways in which meanings and is not clear, however, whether the use of these terms was fully definitions are assigned to a technological artifact. For instance, promoted nationally, or just geared toward westerners. the various conceptions of the nature of 'information' shape the This interest in everything "Western" has to be contrasted with a social construction of both digital libraries and the NUC. In the growing sentiment of cultural identity and nationalism that has countries studied, 'information' was often viewed as a resource appeared throughout the CEE region. In the library field, this has and a public good. The view of information as a commodity that led to decisions to maintain HUNMARC (the Hungarian version could be sold, bought or exchanged was not the most prevalent of MARC) over USMARC or UNIMARC in Hungary, as a means among these respondents, who often resented the economic to keep the distinctiveness of the country. Too often,the aspects associated with it. Most important to them was the need to introduction of foreign cataloging and classification systems is createnew organizationalrelationshipsinthistransitional viewed merely as a technology transfer, and the emphasis on the environment. cultural values is usually neglected. Reconciling the national tradition with the universal mechanisms of globalization is key. 3.2 Cooperation and Resource Sharing Establishing a climate of cooperation appeared throughout the data as a prerequisite for any further attempts at developing DLs 4. DISCUSSION AND IMPLICATIONS and NUCs. Meanstoenablecooperationincludeshared Digital libraries are a complex arrangement of people, technology, catalogingcooperatives(e.g.,networks,consortia)andthe institutions, and content. As such, they have to be understood as adoption of common standards. The earlier initiatives were mostly more than the conduits (pipes) or the content that flows through fostered by the academic and research library communities, with these pipes, to include elements of power and culture. DLs cannot the help of western library-oriented philanthropic foundations. be replicated. Rather, they fitto the unique situation of the These foundations were instrumental in enabling and encouraging environment. The CEE countries studied are at the important the transfer of technology. More importantly, these foundations stage of positioning or redefining the role of the various players. (soon followed by governing agencies in the country) strongly Rather than asking whether these countries have or not DLs, it encouraged libraries to form consortia. The financial incentives may be more useful to ask when and who takes part in their and the realization of the benefits of resource sharing led to a development; and how this relates to the free flow of information, series of alliances such as the Krakow Library Project and and the advancement of democratic values in the country. These NUKAT (Poland); CASLIN (Czech Republic and Slovakia), issues will be developed further in future articles. HUSLONET and MOKKA (Hungary). However, after decades of centralization that led to 'artificial' 5. ACKNOWLEDGMENTS library networks organized by subject areas and/or types of Thanks to the A.W. Mellon Foundation for funding this project. libraries (and imposed on libraries by the ministries), respondents resented the 'forced' cooperation model newly imposed on them. Rather, they expressed more interest in experimenting, in a grass- 6. REFERENCES root manner. The variety of consortia created denotes this trend: [1] Borgman, C.L. From Gutenberg to the global information while a centralized model was opted for in Slovakia, respondents infrastructure. Access to information in the networked world. in Czech Republic, Hungary and Poland favored a mixture of Cambridge, MA, London: MIT Press, 2000. centralization and decentralization for the development of NUCs: [2]Lass, A. and R. Quandt.(eds.). Library automation in tight or loose alliances (CASLIN), resource-sharing consortia transitional societies. Oxford University Press, NY, 2000. (e.g., HUSLONET), highly distributed model (VTLS Group). [3]Caidi, N. The information infrastructure as a discursive space: A case study of the library community in Central and Eastern Europe. PhD dissertation, Department of Information Studies, UCLA, 2001. (Adviser: C.L. Borgman).

177 210 Use of Multiple Digital Libraries: A Case Study Ann Blandford Hanna Stelmaszewska Nick Bryan-Kinns Middlesex University Middlesex University Icon MediaLab London Bounds Green Road Bounds Green Road 1, Martha's Buildings, 180 Old Street London, N11 2NQ, U.K. London, N11 2NQ, U.K. London EC1V 9BP, U.K. +44 20 8362 6163 +44 20 8362 6905 +44 20 7549 0206 [email protected] H.Stelmaszewska @mdx.ac.uk [email protected]

one library to anotheroften with a substantial time interval ABSTRACT between the use of one and the use of the next users of The aim of the work reported here was to better understand the digital libraries can move almost seamlessly between them, usability issues raised when digital libraries are used in a sometimes without even noticing the transition. The focus of natural setting. The method used was a protocol analysis of this study is on use of multiple digital libraries within a users working on a task of their own choosing to retrieve single session, and on user experiences of the interaction. documents from publicly available digital libraries. Various classes of usability difficulties were found. Here, we focus on use in contextthat is, usability concerns that arise from the BACKGROUND factthatlibrariesare accessed inparticular ways, under We briefly review related work on use in context and on types technically and organisationally imposed constraints, and that of use. These other studies provide the academic context for use of any particular resource is discretionary. The concepts the study presented here. We then present an overview of from an Interaction Framework, which provides support for methods used in previous studies, relating the methods to the reasoning about patterns of interaction between users and kinds of findings those studies could establish. In particular, systems, are applied to understand interaction issues. an understanding of the alternatives was used to guide the design of the study reported here. Keywords Digital Libraries, video protocols, interaction modelling, HCI. 1.1 Use in context Several studies have looked at digital libraries in context that is, not just the library itself, but also how it sits within a INTRODUCTION larger frame of use. One example of such studies is Bishop's Digital libraries are moving from research and development [3] consideration of the use of digital libraries by people from into commercial use. If they are to realise their full potential, different socialand economic backgrounds.Her studies however, the experiences of end users need to be taken into indicate that people from different backgrounds (low-income account from the earliest stages of design. Those end users are and academic) can easily be put off using digital libraries typicallyindividuals who havenoparticularskillsin small problems tend to be magnified until they deter potential information retrieval, and are accessing library resources from users, and lack of awareness of library coverage often prevents their own desks, without support from a librarian. users from understanding what they could get out of the libraries. As libraries are becoming increasingly available for Thesefactorsclearlyhaveimplicationsfordesign.In general use, the finding that people are easily deterred has to particular, libraries need to be "walk up and use" systems that be taken seriously; our results, as presented below, highlight are easily learned; while more experienced users may require some of thedeterrent factors,including poor reliability, powerful features that support them in performing focused inadequatefeedback andthetimetakentofamiliarise searches, novice users need to get early results for minimal themselves with a new library. effort[5].The studyreportedhereinvestigatedhow comparative novices work with existingdigitallibraries, Covi and Kling [7] investigated patterns of use of digital focusing particularly on thepatternsof interactionand libraries by different groups of users, and how they vary across difficulties they experienced. Many of these difficulties are academicfieldsanduniversities. Theyfocussedon not apparent when a library is tested inisolation. Whereas interviewing potential users and, moreover, were concerned users of physical libraries have to move very deliberately from primarily with university members, rather than considering the population at large. Their study led them to conclude that the development of effective (useful and used) digital libraries needs to take account of the important roles played by other Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that people within the broader system of use (notably colleagues copies are not made or distributedforprofitor commercial and librarians), and that the views of end users, as well as those advantage and that copies bear this notice and the full citation on the of librarians and computer specialists, need to be understood first page. To copy otherwise, or republish, to post on servers or to for effective design. Whereas Covi and Kling were concerned redistribute to lists, requires prior specific permission and/or a fee. with how library use fitted in with overall working patterns, JCDL'01, June 24-28, 2001, Roanoke, Virginia, USA. the study reported hereislookingat how a particular Copyright 2001 ACM 1-58113-345-6/01/0006...55.00. interaction between a user and digital libraries evolvesi.e. the patterns of interaction, rather than the patterns of use. As shown below, the decisions taken by computer scientists and

179 211 librarians have substantial impact on the user experience, in Once a document has been located a user typically does some ways that are hard to anticipate. work withit(not necessarilyimmediatelyafterwards) otherwise, there would be no point in finding it in the first 1.2 Types of work with digital libraries place. Amongst the activities associated with working with One approach to understanding users' views is to analyse the documents are reading and annotating them. One important kinds of work people perform with library resources. In this finding about reading activity [1, 2, 13] is that people do not section we summarise studies of the kinds of things people get simply read articles from beginning to end, but rather move up to, or might get up to, in a digital library. between levels of informationfor example, from authors and titles to reading theconclusion.Current digitallibraries Searching for interesting articles is often the first thing that available via the web are, for various reasons, limited in terms comes to mind when considering "digital libraries". However, of the kinds of reading they support. Even with such technical as is being increasingly recognised, searching is not just a restrictions, there were substantial variations between subjects case of entering a search term and viewing a list of results. in this study as regards the reading processes they adopted. Furnas and Rauch [9] found that in searching for information a "one-shot query" is very rare. More typical is an extended and As well as reading, people also create, update, and annotate iterative search which involves opportunism;thatis,the documents. O'Hara etal.[15]focussed on such writing searching evolves over a period of time and relies on users activities in their studies of PhD students' use of libraries. being able to follow new (interesting) paths as they appear, They foundthatreading andwriting were inextricably which may not necessarily have been specified at the start of intertwined. Existing web-based library interfaces do not their search. support any writing activities, so subjects in this study saved and printed relevant articles,for future organisation and These notions of extended searching are supported by research annotation. carried out in conventional libraries [14]. As with digital library studies, O'Day & Jeffries found that one-shot searches 1.3 Techniques Used in Studies were rare. Rather, they found that single searches evolved into Many different techniques have been used to study people's other kinds of searching which they identified as: monitoring use of digital libraries. This section outlines the range of a topic over time; following an "information-gathering plan"; techniques used and their applicability. An understanding of and exploring a topic in an undirected way. The findings of the these past studies and their scope was used to design the study current study are consistent with those of others, but illustrate described here. that many searches are unsuccessful. Adler et al. [1] and O'Hara et al. [15] asked people to keep People do not just search for items in digital libraries, but also daily notes of their document activities, and followed up these browse for them. Jones et al. [12] characterise this distinction descriptions with structured interviews. Such a technique has as follows: the advantage of low effort on thepart of theanalysts. 1. Browsing users traverse informationstructuresto However, a large load is placed on the participants as they have identify required information to keep notes of their activities. Moreover, as individuals keep their own diaries there will be differences between their note 2. Searching usersspecifytermsofinterest,and taking styles which may be significant and cause problems in information matching those termsisreturned by an indexing and retrieval system. Users may, in turn, browse generalising results. However, such an approach helps us to understand what people do and why they do it, which can give these results in an iterative manner as discussed above. useful input to design or some qualitative evaluation results. Gutwin et al. [11] discuss the browsing in digital libraries, but Questionnaires,on-line formfilling, or registration tend to focus on how user interfaces they develop can support [3] used browsing, rather than considering what browsing is. However, documents can provide simple feedback. Bishop in their discussion of browsing support they do categorise the registration documents to build up an understanding of the different backgrounds of digital library users. In contrast she purpose of browsing as follows: used surveys to find out information about users' use of the 1. Collection evaluation: What's in this collection? Isit system after a period of time. Advantages of using such relevant to my objectives? techniques include the ability to get a large response, but the 2. Subject exploration: How well does this collection cover information returned is often shallowtypically just simple area X? answers to questions asked with no explanations of answer rationale. Theng et al.'s work [17] contrasts these approaches 3. Query exploration: What kind of queries will succeed in by using extensive questionnaires with a small group of users area X? How can 1 access this collection? after they have completed tasks with digital libraries. These These can all be considered aspects of familiarisation: with the questionnaires elicit users' perceptions of the digital libraries type of information in collections and how the library works. used but, again, they provide no means of assessing why users In the study reported here, many of the purposes associated by felt as they did. Gutwin et al. with browsing arise in situations where there is Several studies, e.g. [2, 3, 13], have employed transaction logs no significant "browsing" activity i.e.nosubstantial to gain an understanding of the activities users were engaging traversal of informationstructures.The purposesof an in with digital libraries. These logs give quantitative accounts interaction and the types of behaviour (e.g. searching vs. of user actions and so can be used to make statements such as browsing) do not appear to be closely linked. Nevertheless, the "(about 5%) took advantage of the ability to search for terms identification of these purposes is helpful: in the discussion in individual components of articles" [2]. However, such logs below, these kinds of purposes are discussed under the do not provide an understanding of why users use particular heading of "familiarisation". features of systems. Understanding why things have happened is typically tackled by interview and possibly diary studies.

180 212 Several studies have investigated how people use particular Five users were recruited for the study. Three of these were first interfaces and how their use differs between interfaces and year PhD students (referred to below as "A", "B" and "C"), one between tasks. For example, Bishop [3] and Park [16] used a final-year PhD student ("D") andone an experienced experimental design to compare the applicability of user academic ("E"), all computer scientists. A D were recruited as interfaces for digital libraries. The use of experimental design subjects specifically for this study; E, aware that this study gives statistically significant results, but such studies are was being conducted, offered to participate while performing a costly to develop and run, and can only answer specific self-defined library searching task. questions. In these cases, the aim was broadly to inform design The aim was not to give users artificial tasks, which are liable and redesign of a particular interface. Therefore, the user and a tobe either too precisely defined tobe natural or too single interface could be considered as a "closed" system, meaningless for participants, but to ask participants to select independent of external influences. This contrasts with the their own tasks to work on. Therefore the task defined for present study, for which use in context is the primary concern. participants A D was simply to obtain at least one paper on Bishop [2] used three focus groups to elicit understandings of their own research topic to help with their literature review, howfacultymembersusedjournalarticles,andwhat using their choice of libraries from a given set (easily accessed requirements such use placed on design of digital libraries. via bookmarks in a web browser). They were asked to think Following on from that, Bishop [3] used focus groups to help aloud while working. They were providedwithalittle understand different socio-economic backgrounds of digital information about each library, as shown in Table I. Access library users. Although focus groups are useful for gaining an rights for each library were defined by the subscription held overview of the issues and problems, they tend to produce by theorganisationinwhichtheyarebased.Note,in information which is often sketchy and in outline form. particular, therestriction on ACM access a causeof Bishop [2, 3] used interviews to follow up on topics that were difficulties as discussed below. raised in her focus groups. This approach allows the analysts Table 1: bookmarked libraries for users A D. to develop a fuller understanding of the issues raised in the ACM Digital library Fulltextaccessonlyto focus group, but interviews are time consuming and are, again, www.acm.org/d1/ journals and magazines (not producing qualitative results which feed into design. Other conference proceedings) studies such as those conducted by Covi and Kling [7] relied solely on interviews to assess people's perceptions and use of IDEAL Access only to articles prior digitallibraries. Approaches such as those employed by www.idealibrary.com to 1998 Furnas and Rauch [9] involved a combination of techniques as NZDL Full text articles they used interviews to inform further observation of people www.nzdl.org using digital libraries. In contrast Marshall et al. [13] used interviews to follow up other techniques such as examination EBSCO Full text articles of transaction logs; their interviews allowed analysts to probe www-uk.ebsco.com why users were performing certain patterns of interaction Emerald Full text articles identified from the logs. www.emerald-library.com Observing what people do as they use systems isa time Ingenta Full text articles consuming activity. However, it can provide useful insights www.ingenta.com into the usability of systems. Bishop [3] discusses her use of observation to gather information on engineering work and learning activities. This information can then be used to As noted above, E was not recruited in the same way, but inform (re)design, and/ or followed up in other ways such as offered to participate. She was planning to search for articles interviews, as illustrated by Furnas and Rauch [9]. on particular topics to help with writing academic papers, and From the view of the studies presented here it is clear that volunteered to do this with a video camera running, and to although there has been work on studying the usability of "thinkaloud"whileworking onherself-definedtask. specific user interfaces for searching, and to a lesser extent Consequently, she used digital libraries of her own choosing, browsing, in digital libraries there has been little work on and did not have explicit information about limitations on understanding the nature of these tasks or how libraries are access. She held a personal subscription to ACM, so she was used in a natural setting. not subject to the same restrictions as other users. Users A, B, C, D and E worked with the digital libraries for 57, METHOD 62, 62, 51 and 80 minutes respectively. The video data was The studyreportedhereaimedtoachieve abetter then transcribed, including speech and some description of understanding of how users interact withdigitallibraries interaction between user and computer system. It was analysed within a single session, but not necessarily using a single using theInteraction Framework[4].Extracts from these library. transcriptsare used inthefollowing sectionsas source Because we wished togatherdetailedinteractiondata, materials for examples. techniques such as diary-keeping, interviews, transaction logs andfocusgroupswereinappropriate.Avideo-based 1.4 Interaction Framework: overview observationalstudywiththink-aloudcommentarywas The Interaction Framework is an approach to describing actual selected as the most appropriate means of gathering data, with or possible interactions between agents (users and computer a short debriefing interview a few days later to clarify any systems) in terms of the communicative events that take place issues raised by the video data. between those agents, and the patterns of interaction. Using this neutral language, which aims to take neither a user- nor a

181 213 computer-centred view of theinteractive system, we can ACM library, he did not restrict his search to journals only, identify and discuss properties of the interaction that might or and consequently received many hits that were conference might not be desirable. For example: proceedings; like user A, he got authorisation failure when he tried to print any of these. Blind alleys are interactions such that the objective is unachievable, but where thatfactdoesnot become Although new toNetscape,userB wasarelatively apparent until some way into the interaction. sophisticated userof informationretrievalsystems.For example,heunderstood howtousequotationmarks Discriminable events are ones that another agent can selectively in search queries and also resorted quickly to easily choose between. using two browser windows so that he could continue working in one window while a document was downloading in the A canonical interaction isone that achievesallits objectives as efficiently as is theoretically possible. In other. Using this strategy, he explored the EBSCO library, then the case of digital libraries, such objectives will include NZDL, then Ingentacontinually flicking from one window to the other as pages were loading. One apparent consequence of accessingparticularpapers,accessingpapersona particular 'topic or general subject area, and gaining this is that his behaviour was more reactive than that of the other users: he appeared not to form clear beliefs about the familiarisation withacollectioncontent,typeor state of the system, but to simply respond to whatever was features. currently displayed, and the interaction appeared relatively Central to any successful interaction is the idea that users unstructured and haphazard. must have an adequate understanding of thestate of the system. Put in neutral terms, the user and system must share 1.7 User C "common ground" [6] that is, each interactional event has to User C was searching for material on Growing Cell Structures communicate sufficient information to enable the agents to (GCS), text classification and Self Organising Maps (SOM). He maintain common ground. started by accessing Emerald, but rapidly switched to Ingenta, when his first few query formulations returned no results. SUMMARIES OF INTERACTIONS Results from Ingenta were more promising; in fact, there were Before presenting detailed results, summarising the important so many hits he appeared to be overwhelmed: "I found over difficultiesfound, we presenta brief overviewof the three hundred documents here". Several times, he selected an interaction of each participant withtheavailable digital article with a promising title and followed the link "full text at libraries. ScienceDirect", but was thenrefusedaccess.[The user organisation had a subscription to Ingenta journals but not to 1.5 User A Science Direct.] He saved several abstracts and printed one full User A was interested in papers on electronic commerce. As she text article from Ingenta. He accessed the IDEAL and ACM started working, she spent a while browsing unrelated material, libraries twice each, but each time judged them "too slow". In as if orienting herself to browsing, before selecting a link from EBSCO, he failed to find any matches to his search terms. He the bookmark list. She selected the ACM digital library, and returned to the Emerald library, and reformulated his query: "I spent some time reading through the introductory page. She think I made some mistakes last time. I searched for GCS as the then searched for "thebest of electronic commerce", but keyword so this time I searched for GCS for full text and I seemed rather confused by the results returned. She selected an found something". Later on, his search took him to a link alternative search mechanism and repeated the search. She "order the book", which took him from the library to an found various articles that were "interesting" but did not print internet bookseller. He conducted further searches, using them. Although she limited her search in ACM to "journals Emerald, Ingenta and NZDL, usually receivingeither no only", conference articles were listed among the search results; matches or too many to deal with.In NZDL, he found an when she tried to download one of these articles, she was asked interesting article, but failed to download it. Although he to enter a user name and password, which she did not have, could read the article on screen, he tried threetimes to resulting in an authorisation failure. download it, without success. By the end of his interaction, he She moved to the Computer Science Technical Reports link in had found some relevant material on text classification, but NZDL, and got over a thousand hits on her search.She none on GCS or SOM. reformulated her search several times, and still too many items were returned. Eventually, she found an article she wished to 1.8 User D view and download, but failed to download it,for reasons User D looked for different things in different libraries. He discussed in section 5.1 below. She then moved on to EBSCO. startedwithACM, searchingforarticlesonusability When her search results were returned, she commented that evaluation; a large number of results were returned. He tried to "thisismorereadablethan from otherlibraries".She save aselectedarticleto a"binder",butreceived successfully found, saved and printed one article. "authorisation failure". He moved on touse NZDL, now searching for articles on "musical timbre"; although he found 1.6 User B one that he skim-read on screen, he moved on quickly to User B was looking for material on knowledge management, search for "timbre perception", which returned results that he text mining and link analysis. Although he has used other judged "more of less similar to my previous query". He opened browsers, he did not have previous experience of using second and third windows to access EBSCO and Emerald, but Netscape, so he starting by browsing Netscape pages before never switched between windows (they seemed just to be a connecting to the ACM digital library via the bookmarks. historical record of where he had got to in a particular library). Before specifying the search terms, he spent some time "trying He browsed and submitted search queriesina relatively to understand how to make the search". When searching the unstructured way, apparently trying to familiarise himself with the content and structure of the library. When he tried to

182 214 follow links to particular journals, an error message, "Type minutes. For example, user A spent over six minutes trying to mismatch" was displayed. Subsequent tests indicate that this download anarticle from onelibrary beforegivingup, was a temporary error, but it had a strong influence on this commenting that "I've tried all, but I can't download". Some of particular interaction. her sources of difficulty are illustrated in Figure 1, where the He tried to access Ingenta, but received an error message: warning message reads: "Expanding the text here will generate Ingenta was unavailable. He moved on to Emerald, about which a large amount of data for your browser to display". he commented:"It'sactually good tohavesome basic .. description of the journals. It's not my area." After a quick look, he returned to ACM to search for articles on visual texture; on selecting a "Find related articles" link, he found 0...4 something quite different that interested him a case of ,ig !WI Selecting serendipity in the search. He viewed the abstract and tried to postscript download the paper, but got an "authorisation failure"he . ., .. generated an "wasn't paying attention" when earlier told that he could ftp error. download journal papers but not conference proceedings. He went on to successfully identify and download a relevant journal article. . This warning deterred the user 1.9 User E P!,.. from downloading User E required information on diary systems, cognitive .19! .. :...EN the whole document modelling and usability of Artificial Intelligence systems. In meeting these objectives she used three libraries: ACM, IDEAL twine tne,amrsersuseur==. ra4 ,13.0.aarenrrinox and the New Zealand Digital Library (NZDL). These were used ., "...76.,...:, r+ in a relatively orderly sequence of ACM, followed by IDEAL, and finally the NZDL. The search objectives were repeated to Figure 1: difficulties with downloading in NZDL some extent with each information source as opposed to meeting each objective in sequence. Most users suffered from a "triumph of hope over experience", repeating the same unsuccessful action several times without In detail, she spent most of her time using the ACM DL. Any modification, and each time receiving the same result. For articles she found that seemed relevant were printed for further example, user B tried eight times to download full text of an review. Her first few searches were unsuccessful, yielding "no article from Science Direct, while user C did the same six times. matches". A search for information on diaries and calendars User C also tried to download the same postscript file three was more fruitful. She then switched to browsing various times, each time receiving the error message "Netscapeis collections, with mixed success. When she found an article unable to find the file or directory named that appeared interesting, she always printed it out, rather than /pub/techreports/1993/tr-93-025psZ. Click the file name and trying to read it on screen. Overall, this user's interaction was try again". Similarly, user D tried three times to access the the most fruitful, and by the end she had printed about a dozen Ingentalibrary,eachtimereceivingtheerrormessage articles. "TfeLogin: Data decoding / encoding error". It is too early to follow up these userstofind out whether or not their RESULTS experience working with these libraries on this occasion has Clearly, each of the five users in this study behaved in quite deterred them from any future use, but the productivity of all different ways when navigating the libraries, and the sample participants in the study was low (one document successfully size is far too small to make any general claims about usage retrieved after an hour of use for all except user E). Even the patterns or typical behaviours; this was not the purpose of the most successfuluser,E, commentedattheend of the study.Similarly,the purposehasnot beentoconduct interaction that: usability evaluations of particular libraries, or to pit them against each other. Rather, the aim has been to identify core ...I don't think I could cope with doing anything more. I don't usabilityissuesthatarisethroughthedetailsof user feel that I've really found much, apart from on calendars, which interaction with digitallibraries, where the users are not I could focus. I haven't found very much that was actually very assumed to be interaction retrieval experts, and where use of helpful. any particular library is discretionary. We discuss the main usability issues raised within this study The downloading difficulty of user A can be understood in under two headings: deterrent factors (and encouraging ones) terms of poor "common ground" between user and system that may influence future acceptance of digital libraries; and that is: the user did not fully understand the state of the usability issues that arise as a direct consequence of libraries system,orthemeaning of thewarningmessage,and not being "closed"i.e. that the interface between a library consequently could not find a way around her difficulty. and other resources is easily traversed. The errors experienced by users C and D can be understood in similar terms: that the event communicating the fault to the 1.10 Deterrent factors for use in context user is insufficiently expressive for the user to comprehend The users in this study all experienced substantial difficulties and respond appropriately (see Figure2). Accepting that with using one or more of the libraries they accessed. Many of occasional errors such as network faults are unavoidable, the these difficulties resulted in blind alley interactionsthat is, challenge to the digital library designer is to communicate the interactions that did not achieve the user's objectives. Some status of the system (which may include multiple servers) of these blind alley interactions took place over several effectively to the user.

183 215 BEST COPY AVAILABLE ..!.!ti!FOts.atc.Fnq- 1.! t videotape, the problem had been corrected.Accepting that 4, rtf such errors are not always avoidable, it should be possible, as Full text cannot 3171.4.L4.3" Ittrill.r.1-'2Q.:1-17-- discussed above, to give more focused feedback, so that the be accessed user can understand and respond to the error in an informed slr*r 0.. rcte ell AA. 1119 way. In this particular case, the user was restricting her search tYk-gi because she was aware that she had authorisation (based on her host organisation's subscription) to access full text of journal T.1.Ifl NM, articles, but not conference papers. Arguably, the system also has access to this information, and might have been designed so that the user's access rights were clearly indicated before r- 1.1.1: tHIIM I t she tried to download an article. rau=1 As well as failures, the user's choice of libraries isclearly Full text affected by the quality of their experience with a library, and accessible also their sense of familiarity with that library. Within the study reported here, the three dominant factors that determined perceived quality were: a sense of making progress (whether in finding relevant articles or in understanding a library better); a sense of the task being manageable (in particular, not an Figure 2: users did not understand the difference between the overwhelming number of alternatives with poor two sources of full text discriminability); and system response time being acceptable. We consider each of these aspects in turn. Gould [10]discussestheimportance of reliability when considering usability: a computer system that cannot be relied 1.10.1 A sense of progress on to work predictably wellisdifficult to learn and use. All participants had the experience of issuing search requests Digital libraries accessed using standard web technologies are that returned no matches. Particularly for user C, this became complex systems that are vulnerable to many kinds of failure. very frustrating. E.g.: Some of the failures affecting basic accessibility of the library resources have already been discussed. Others have a more ...I searched this author in several digital libraries and subtle effect on the interaction. For example, user A, aware that always cannot get anything ... this professor has published she could only accessfulltextof journalpapers(not several papers in computer journals. conference proceedings) from ACM,conducted a search And: restricted to journals only. The following transcript shows the user's words, her [>actions] and the [ selects "journals only"] [> clicks on the "search" button] [> clicks on "search" button] [< "search result" page appears] [< ACM DLsearch result page] ...no matches... I haven't found any matches...Perhaps I ...1 found 23 results... money in electronic.., not that shouldn't put trans... publications., interesting. Electronic markets and intelligent systems.... [> clicks on the browser "back" button] This is interesting ... [< ACM DL search page] [...] ... so I will search in "alljournals and proceedings"...no, ... Building bridges with practice... I will have alook at "journals only"... this Similarly,user E modifiedhersearch,butwasstill [> clicks on the document link] unsuccessful: [< article abstract page appears] [< search results page replaces search formulation page] ... this is very close to what I am interested in No matches. Great. [> scrolls down the page] [> clicks back] [< bottom of abstract article page displayed] [< search page replaces search results page] [> clicks on the "full text" button] So even on full text. Of course, its possible, no. [< the pop-up window "User name and password" appears] What happens if I turn off the human and try again? Ow yaa. You have to get a username and the password [> removes human from subject search terms] because I haven't registered Her confusion lasted some time longer. What appears to have [< subject search is now "artificial intelligence" (note happened here is that there was a temporary bug in the ACM author search is still "Hollnagel")] library,orinthemeansbywhichtheuser'ssearch I would expect to get a fair amount. specification was transmitted to the search engine, such that [> clicks search] the search was not restricted as specified. Consequently, many [< search results replace search formulation page] oftheresultsreturnedwerearticlesfromconference No matches. Ha ha ha. proceedings, but the user did not check this, believing that she In this case, the source of the problem was that this search was had restricted her search to eliminate such results. A few days conducted some time after an earlier one that has specified a later, when we tried to reproduce the interaction as recorded on particular author, but the user had not noticed that there was an

184 BEST COPY AVAILABLE 216 author entry in thesearchspecification. From theuser's terms of the discriminability of events and the number of perspective this was a new context, a new query; from the alternatives: with a large number of results, it is generally not system's perspective, this was an elaborate query (with no possibletodiscriminatebetweenthosepossiblefuture matches). Improved feedbackhelping the user to understand interactions that are likely to be successful and those that are exactly what the search results referred tomight well ease blind alleys (relative to the user's objectives). such situations and improve the user's familiarity with the system. 1.10.3 Response time: the pace of the interaction Due to factors such as network bandwidth limitations, the Conversely, users appeared to have a particularly strong sense geographical locations of servers, network loading, etc., the of making progress when serendipity worked for themthat response times of libraries were very variable in this study. is,when they came across an interesting itemthatwas Dix [8] discusses the importance of the pace of the interaction unexpected in the current context. This happened twice to user being appropriate to the task properties. Two of the users in E; e.g.: this study (B and E) responded to the pace being too slow by [- ACM DL list of papers in most recent DIS conference] opening a second browser window and interleaving interaction Triangulation. with the two windows. (D also opened multiple windows, but That actually looks quite interesting. only used them sequentially.) Both B and E commented on [> clicks on paper's link] download time. For example, user E noted that [...1 We're going to work much more efficiently if we have a So I'll have a look at that one. new navigator window... [> clicks acrobat print button] [> clicks browser file -> new navigator window] It also happened to user D: [< New window (2" window) appears with home page] I'm looking into a visual texture related papers but I ... and have that one running in thebackground. actually found something that doesn't seem to be related to As switching between windows, B commented that that at all. It's about comparing two different methods of ...it's taking quite a long time to download [...] ...so I can evaluations: empirical testing and walkthrough methods continue working with ACM... which is also interesting to me so... I'll just jump to that C also reacted to the response time of certain servers, but from the search . responded in a different wayby avoiding those libraries While serendipity is difficult to design for (by definition), it completely: can be supported through discriminability: it is important that [> clicks on IDEAL link] it is obvious to a user when such items come into viewthat It seems to be very slow. So I will stop and try another. the descriptions of items make their nature clear. [> clicks on browser "Stop" button] 1.10.2 A sense of the task being manageable [> selects browser "bookmarks" folder] While no matches is clearly not a desirable result, there has [< list of links displayed] been little discussion of the effects of too many results. [> selects ACM DL from the bookmarks] Within the Information Retrieval community, one focus of Whilethecognitivedemandsof multitaskinginthis research, and one criterion by which the quality of a search environment are not excessive (because neither task is time- engine is assessed, is the quality of search results returned, as critical), the similarity of the two tasks isliable to cause measured against suitable metrics. In practice, the users in our interference. Both B and E appeared to lose track of the state of study appeared to place more store by quantity, or at least the second window at times. discriminability, than quality.For example we havethe following comments from users in response to various search 1.10.4 Familiarity results: As noted above, user E made her own selection of libraries; for the other four users, information was intentionally re-ordered, User A (having specified "some" of the search terms to be used both on the bookmarks list and on the accompanying paper in a search for "electronic commerce"): documentation, so that users who followed the list order of Here 1 found one thousand seven hundred twenty two librarieswouldnaturallystartwithdifferentlibraries. commerce and two thousand... Nevertheless, four of the five users started by working with the ...OK, I will have to go back and say all ACM library. Two (D and E) stated explicitly that this was [> clicks on "back" button] because they were familiar with that library. For example, D commented: User C commented: Sometimes the search finds too many articles so it is a ...I've used this library in the past so more or less I know little bit boring to read all these article titles. that I can find relevant papers ...what I'm looking for. More encouragingly, user E was pleased when fewer results B commented about a library that he had not used before: were returned: ...let's try this one: EBSCO on line. I don't know it but it's 16 documents, that not too bad. always good to know new things. As was user B: [..] ...what shall I do. ... I am trying to find out how should I ...only seven, that's fine... operate this... favorite journals, let's try it. Repeatedly, users commented positively when the number of results was small (less than twenty) and negatively when it was [> clicks on "Favorite journals" link] large (typically over two hundred). We can understand this in

185 217 [<" Manage Favorites" pop-up window appears on the top 1.11.1 Access rights of "EBSCO page"] As described above, user A had difficulty downloading papers ...I am trying to understand ...how to operate this... from ACM because she believed she had limited her search to These two examples illustrate familiarity (or lack of it) with those items to which she had access rights, but others were different aspects of library use. D is concerned with the type of listed in the search results. User B also tried to download content (and his expectation of being able to locate relevant conference papers, having ignored the details on the user papers), while B is concerned with how to work with the library instructions.Similarly,asdescribedabove,usershad system and the navigation features it offers. difficulty understanding the difference between papers that could be downloaded via 1ngenta, and papers that could not be Other aspects include familiarity with content of collections or downloaded because they were only available via Science articlestype (e.g. the kinds of papers that are published in a Direct. A further potential difficulty of the same kind resides particular journal), structure and detailed content. For each in the IDEAL library, where users have access rights only to aspect of familiarity,thedesign challengeisto support journals from one publisher, but not others. These distinctions recognition (e.g. E, reading the title of a paper: "I happen to are poorly understood by users and inadequately explained know what that work is")and incremental learning. For within the libraries. Also, in the current dynamic environment example, user E had difficulty understanding the "binders" where new resourcesare coming on-streamrapidly and feature in the ACM library: spending decisions (e.g. on subscriptions) need to be reviewed [- ACM DL search results page] frequently, users have great difficulty keeping track of what I don't know what happens when I tick them. resources are available to them. Particularly as users are often tothe [...1 working in various locations, without easy access I've clicked these things now and I haven't got a clue what information about permissions settings,thatinformation I'm meant to do with them. Hmm. I don't know what a needs to be made easily accessible at the point of need. Of the five libraries used in this study, one (IDEAL) has recently binder is... added a feature to give any user individualised information on [> clicks and holds on binder link] their accessrights, whichisclearlyrecognitionof the [< browser pops up menu of possible actions] difficulty and a first step towards addressing this particular ... help... problem, which is just one aspect of the broader challenge of [> releases mouse] enabling user and library system to establish common ground. [.< page is replaced with page containing list of articles selected] 1.11.2 Working across boundaries ... please choose a binder from your bookshelf. Within this study, there were many cases of users traversing boundariesbetween collections such as Ingenta and Science I don't have a binder, and I don't have a bookshelf as far as Direct, as discussed above, or between NZDL and various ftp I'm aware. sites, as exemplified in user A's interaction: One of the substantive challenges in designing usable, useful digital library systems is enabling users to learn features CSTR browse result page displayed] incrementally, so that they are not overwhelmed at the outset [...1 byalargenumberofalternativeactions that are [.< top of the page displayed] indiscriminable (because the user has no way of predicting the ...let's try the first one effects of any actions). This would include distinguishing [> clicks on the first link] betweencoreanddiscretionaryfeatures(e.g.document [Netscape browser to access library resources. The [> presses "Review" button] setting and the task were designed to be as natural as possible. [< "review article" page displayed] Two main classes of difficulty emerged inthesessions: [> scrolls down the page] understandingtheaccessrightstheyhadbasedon [< bottom page revealed] organisational subscriptions and working across boundaries [> scrolls up the page] (between libraries, other software, other systems). We consider [< top page displayed] each of these aspects separately. I'm trying the book. [> scrolls down the page] [< bottom page displayed]

186 218 [> clicks on "order the book"] [ amazon] [> clicks on "digital library" link] [< "amazon" page displayed] [< ACM digital library page] User A also accidentally left a librarytwiceas she searched Figure 3 illustrates the source of these interactional detours: for library resources. The first time was almost immediately the user has failed to discriminate between the type of links on after she had followed the ACM link from the bookmarks the top links bar (which are generic ACM links) and those on provided: the body of the page, which are digital library links. ...what's new. Maybe if this is on the top then it will be As well as boundaries with other web based resources, users more convenient for me. also have to work across boundaries with other software and [> clicks on the "search acm" button] the operating system. For example, user C had some difficulty saving a file, shifting attention from library use to working [< ACM search page displayed] with the operating system. User B accidentally closed his ...are not included. What is not included? browser session, and had to start again: [> types "the best of electronic commerce" into search box] ...it's a key word? Yes. [< further pages displayed] [< "key word" pre- selected] ...not interesting... [> clicks on "search" button] [> closes the Netscape window] [< "Search Result" page appears] [< "word" document page displayed] I have 83 pages matching my query and I have here I to 12 [> closes the Microsoft "word" application] The score here is showing what is most related to my search [< pop-up window "Microsoft Word" appears "Do you want Strategic analysis reports...social impact... to save changes in the document"] [> scrolls down the page] [> clicks "no" button] [.< more results appear on the page] [< computer desktop appears] ... I will have a look at the "strategicanalysis reports" ...I got out of the "Netscape" [> clicks on "Netscape" icon] f4f.14117..114=OP4o,O.Y444.4,FitY.4.1. ri [<"Netscape" window opens] 3,fit "A 31 a '0 ...bookmarks...now again.. As the only user in this study who printed files, rather than po,o-tso Ng. 0.41 saving them, user E also had to integrate library work with 0` ACM Nilal Librany using other resourcesnotably a printer: It's all tied up for the minute, so I'll go and collect some LI itirr .i,ao The links on the top things from the printer. n button bar take the user [- ACM DL 1' window showing ontology paper] olto:evfirt,i41101. = M.7.1.s.L.7.4gLiai:Aitiltra .1kr.. fd ac,itlisiztv,11 r out of the library. OK. So I don't know how I will know if I've got everything e.A.:Itne:i14,iii.V.1,102;-41 -:-.1c.r.r.rri I've printed out. Because I'm losing track of it completely. ow.r.t: t4 But I think that's been sent and not needed anymore. ,o(.11,,Cmay,n4v1AIN z.,:ssrm rrr,rtt,N mrt: Such boundaries cannot beeasilyeliminated,butthe rz.rti.:re-r.A.;e'erut,:tax,mr, P aLinEtut.ui consequences of making transitions across boundaries can be LolzsCrst 11.10 mrtocx!....ctthr,mil 4.5.1,..r.ararerrts minimised; for example, user B had difficulty restoring the :nest drr.rszon .rtrtr-r,rm; ts4:.rwat 4: t state before he closed the browser, while user E had difficulty ,t,+1 establishing exactly what state the system was in (including t $ what had and had not been printed or fully downloaded): Figure 3: identifying the exit routes from a library common ground needs to be restored. She is unaware of the fact that she is not using the digital library, but searching the ACM web site. This is the behaviour CONCLUSIONS of a novice user who has just entered the ACM digital library We have considered deterrents to the use of digital libraries, for the first time, but she has clearly failed to discriminate and usability issues that are raised by the pragmatics of use in between "search ACM" and "search the Digital Library". This is a natural settingtaking account of organisational and other partly an artifact of the study, since she entered the library concerns. One way of considering usability is as the absence of directly from "outside" without passing through the ACM web any undesirable features of the interaction; therefore, the focus site, but such situations are common, particularly with the in this paper has been mainly on difficulties experienced increasing provision of digital library portals that provide rather than successes, althoughallusersachieved some links to various libraries from one web site. success intheir interactions. If some libraries have been discussed more than others,itislargely because those Later in the interaction, she again left the libraryrather more libraries were used more extensivelya consequence of use brieflyas she tried to return to the digital library home page: being discretionary. [> ACM DL search page displayed] In terms of the Interaction Framework concepts introduced [> click on "home" button] earlier, the most important design issues have been found in [< ACM home page] this study to relate to: So, I would like to find some journals, articles, 1. Familiarity: users need to be able to rapidly acquire magazines... understanding of corelibraryfeatures,content and ... So where can Ifind them...

187 219 structures, if they are to gain a sense of progress while [3]Bishop, A. P. (1999). Making Digital Libraries Go: working with the library. Comparing Use Across Genres. In Proceedings of ACM DL 2. Blind alleys: many interaction sequences did not achieve '99, pp. 94-103. the user's objectives. This is most obvious when a search [4]Blandford, A. E., Harrison, M. D. & Barnard, P. J. (1995) returns "no matches", but occurs in some more extended Using Interaction Framework to guide the design of interactions. Some interactionsthat have no material interactive systems. International Journal of Human- outcome achieve improved familiarisation (the user learns Computer Studies, 43, 101-130. more about the library structure or contents), but others do not. Perhaps more surprisingly, some interactions that [5]Carroll, J.M. and Carrithers, C. (1984). Blocking learner resulted in a large number of hits failed to achieve the errors in a training wheels system. Human Factors, 26, user's objectivesbecausetheuserwas (apparently) 377-389. overwhelmed by choice and retreated from the search [6]Clark, H.H. and Brennan, S.E. (1991). Grounding in results page, due to poor discriminability. Communication. 127-149 In Resnick, L.B., Levine, J and 3. Discriminability: forming understandings of the content Behrend, S.D. (Eds.) Perspectives on Socially Shared and possibilities in a collection relies on being able to Cognition. Washington DC.: APA. discriminate between possibilities. Therefore we need to [7]Covi, L. & Kling, R. (1997). Organisational Dimensions of ensure that the potential events are easily discriminated. Effective Digital Library Use: Closed Rational and Open 4. Serendipityfinding unexpected interesting results Natural Systems Model. In Kiesler, S. Culture of the seems to give users a particular sense of making progress. Internet, Lawrence Erlbaum Associates, New Jersey. pp. Serendipity depends onusersbeingeasilyableto 343-360 identify interesting information, which is one aspect of [8]Dix, A. J. (1992). Pace and Interaction. In A. Monk., D. discriminability. Diaper and M. Harrison, Eds. People and Computers VII, 5. Working across boundaries: transition events, where Cambridge: CUP. one agent (user or computer system) changes context, can [9]Fumas, G. W., & Rauch, S. J. (1998). Considerations for often be a source of interactional difficulties; measures Information Environments and the NaviQue Workspace. need to be taken to ensure agents maintain common Proceedings of ACM DL '98, pp. 79-88. ground and understand the consequences of transitions. This study has identified many cases where decisions taken by [10] Gould, J. D. (1988). How to design usable systems. In M. computer scientists and librarians have had unanticipated Helander (ed.) Handbook of Human-Computer Interaction, consequences. The incremental approach to delivering library pp. 757-789. Elsevier : Amsterdam. resourceshasmanypositiveadvantages(e.g.easeof [1 1] Gutwin, C., Paynter, G. Witten, I., Nevill-Manning, C., & introducing new features) but can also result in inconsistency Frank, E. (1999). Improving Browsing in Digital Libraries as users move between libraries, particularly where that move with Keyphrase Indexes. Journal of Decision Support is made so smooth that users are often unaware it has even Systems, 27(1-2), 81-104. happened. The use of site licenses that permit partial access to [12] Jones, S., McInnes, S., & Staveley, M. (1999). A Graphical resources,oraccessthatistemporary,makessystems User Interface For Boolean Query Specification. unpredictabletousers,sothat theycannot developan International Journal on Digital Libraries, Volume 2 Issue adequate (usable) understanding of what detailed goals are and are not achievable. If theoretical possibilities are to become 2/3, pp 207-223 practical solutions, then the kinds of pragmatic issues raised [13] Marshall, C. C., Price, M. N., Golovchinsky, G., & Schilit, by this study need to be taken into account in the design of B. N. (1999). Introducing a Digital Library Reading future libraries. Appliance Into a Reading Group. In Proceedings of ACM DL '99, pp. 77-84. ACKNOWLEDGEMENTS [14] O'Day, V. L., & Jeffries, R. (1993). Orienteering in an This work is funded by EPSRC Grant GR/M81748. We are Information Landscape: How Information Seekers Get grateful to the anonymous participants who took part in this From Here to There. In Proceedings of InterCHI '93, pp. study, and to George Buchanan and anonymous reviewers for 438-445. comments on earlier versions of this paper. [15] O'Hara, K. 0., Smith, F., Newman, W., & Sellen, A. (1998). Student Readers' Use of Library Documents: Implications REFERENCES for Library Technologies. In Proceedings of CHI '98, pp. [I]Adler, A., Gujar, A., Harrison, B. L., O'Hara, K., & Sellen, A. 233-240. (1998). A Diary Study of Work-Related Reading: Design Implications for Digital Reading Devices. In Proceedings [16] Park, S. (1999). User Preferences Vslhen Searching of CHI '98, pp. 241-248. Individual and Integrated Full-text Databases. In Proceedings of ACM DL '99, pp. 195-203. [2]Bishop, A. P. (1998). Digital Libraries and Knowledge Disaggregation: The Use of Journal Article Components. [17] Theng, Y.L., Duncker, E., Mohd Nasir, N., Buchanan, G. & In Proceedings of ACM DL '98, pp. 29-39. Thimbleby, H. (1999), Design guidelines and user-centred digital libraries. In Abiteboul, S. & Vercoustre, A. (Eds.), Proc. ECDL'99, 167 183.

188 220 An Ethnographic Study of Technical Support Workers: Why We Didn't Build a Tech Support Digital Library

Sally Jo Cunningham Chris Knowles Nina Reeves Department of Computer Science ITS Campus Media School of Multimedia and Computing University of Waikato University of Waikato Cheltenham and Gloucestershire Hamilton, New Zealand Hamilton, New Zealand College of Higher Education +64 -7-838-4402 +64 -7-838-4466 x7714 Cheltenham, Gloucestershire, UK [email protected] [email protected] +44 -1242 -543236 [email protected]

ABSTRACT areas for developing information sources to meet the needs of a variety of user groups has been in the construction and accessibility In this paper we describe the results of an ethnographic study of the of digital libraries. information behaviourss of university technical support workers and their information needs. The study looked at how the group Quantitative studies of usage patterns in existing libraries (digital or identified, located and used information from a variety of sources physical) have, on occasion, contained indications of a less-than- to solve problems arising in the course of their work. The results of perfect fit between users and libraries; for example, Cunningham the investigation are discussed in the context of the feasibility of and Mahoui [6] note that only 28% of visitors to two computer developing a potential information base that could be used by all sciencedigitallibrary make a second visittothosesites. members of the group. Whilst a number of their requirements Quantitative studies, such as transaction log analysis, typically give would easily be fulfilled by the use of a digital library, other detailed pictures of user actions, but can give little or no insight requirements would not. The paper illustrates the limitations of a into users' motivations and needs. For example, we know that the digital library with respect to the information behaviourss of this majority of the computer science digital libraries' users do not group of subjects and focuses on why a digital library would not return to these web sitesbut why not? Is it because the users' information needs were perfectly satisfied by a single visit, or appear to be the ideal support tool for their work. because thecollections'contents are inadequate, or perhaps Categories and Subject Descriptors because the user interfaces are unacceptable? H.3.7 Digital Libraries: User issues; D.2.1 Often usability studies, whether quantitative or qualitative, are Requirements/Specifications: Elicitation methods limited to evaluating the interface to an existing digital library or the ease with which a given user group can fmd the information in the library. Many of the studies deal with academics or other General Terms specific user groups carrying out research activities. The studies do Design, Human Factors not, however, necessarily evaluate the basic concept of a digital library as a suitable information tool in comparison to other Keywords information tools for people who provide a technical support Ethnography, user studies, requirements analysis function for those researchers,e.g.a digitallibrary may be constructed for a group of researchers but is developed or maintained by a technical support group. In the case of computer 1. INTRODUCTION science the researchers may use the library but the specialist There have been many studies in information science looking at the computer support staff who themselves 'work' in the same area nature and frequency of information seeking activities by different (computer science) may not actually choose to use the product. groups. Many have focused on the differences in how people search, what they search for, and why. They have also identified Instead of looking at groups of potential users who have already how people use the information they have obtained and how the been studied in detail we identified a specific group who have had original source of this information has been used on subsequent little attention paid to their circumstances. This group appear to be occasions over the longer term. Today, there is a plethora of a potentially rich source of data for looking at information information sources available in both electronic and paper-based behaviourss in terms of the types of tasks they perform, the tools forms to a wide range of user groups. One of the most promising they use, their knowledge base, how they interact with one another, etc. The study reported here concerns a group of consultants who provide technical support in a School consisting of computer Permission to make digital or hard copies of all or part of this work for science, mathematics and statistics departments. personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies The data gathered during the study was from six members of the bear this notice and the full citation on the first page. To copy otherwise, or group. As the entire technical support group comprised eight republish, to post on servers or to redistribute to lists, requires prior members, the number taking part in this study was, we felt, more specific permission and/or a fee. than satisfactory to get to grips with the essential elements of its JCDL '01, June 24-28, 2001, Roanoke, Virginia, USA. operation and behaviours with regard to information seeking and Copyright 2001 ACM 1-58113-345-6/01/0006...55.00.

189 221 use. The size of a subject group is an issue in any study, but even support staff may often continue to problem solve outside the more so when dealing with specialist groups. Such groups are, by formal workplace in the tea room, pub or via web access at home. their very nature, different to 'general' groupings. They have a more The shortcomings of the reductionist approaches are due, in part, to defined focus, whether that be in terms of the work they carry out, the 'outsider' perspective of the researcher who, having observed the type of people they are and so on. The size of this group and its the work processes, then attempts to describe them using their own uniqueness within the institution does not make them less worthy models and vocabulary. The latter may have little meaning for of investigation, indeed it makes them more attractive as a focus of those participating directly in the work and often focus on those study. In an institution of around seven hundred staff this support aspects that are most relevant to the researcher's preferred model group is unique. There may be similar related groups in other and solutions, for example the creation of a digital library. institutions that collectively would form a more significant subject The goal of this ethnographic study was firstly to discover the types base to study. Whilst an increased subject base may raise some of information that these technical consultants used in their work; underlying level of reliability in the results, the greater influences then to understand how they individually and as a group gather and of the environment (e.g. the nature of the institution, the country, use information. Although adopting the basic principles behind the profile of the department being supported etc.) may negate this. ethnographic forms of investigations was seen as an interesting and In such instances, the reasons for the variation would also be of potentially valuable way to look at the situation, we acknowledged great interest and of relevance for those designing information at the outset that there might be limitations to the investigation, resources such as digital libraries. especially in how we might conduct it or draw conclusions from it. In looking at the value of digital libraries it cannot be desirable that Foraparticularlythoughtfulexaminationof ethnographic only groups with a large membership be studied in any depth. Such techniques in the context of and information science, see [4]. The a focus would surely result in an architecture that was limited by aim of the study was not use the results to build an information the characteristics of those groups? Specialist groups provide system to support this group, but it did include some element of different challenges to the norm which may be more useful in identifying the requirements such a system might have to meet and identifying different behaviours that should be identified. Levy and the constraints under which such a system might operate. In this Marshall [12] assert that the "highest priority of a library, digital or sense, the idea of the digital library might be an option to be otherwise, is to serve the research needs of its constituents." considered, should the study identify information behaviourss that Different users within that constituency focus on different elements would be supported by such a tool. of resources and those developing and supporting digital libraries This paper is organized as follows: Section two describes the may tend to "idealize users and uses, projecting or inventing an methodology used in this ethnographic study: the consulting group incomplete or even inaccurate picture of the real work being done." is described, the ethnographic techniques employed are listed, and (p 80) the consultants' range of work activities are described. We then Their recommendation to avoid this is to adopt a work-oriented briefly discuss Greenstone, the digital library construction software perspective and instead look at the users, the work they do and how developed by the New Zealand Digital Library Research Group that work is supported through the use of the technologies and (http://www.nzdl.org). Greenstone shares many characteristics of documents. To achieve this they suggest the use of ethnographic other examples of the current crop of digital library architectures; techniques of observation. This is exactly what this study did and its assumptions about collection design, user interface, and user the results demonstrate how valuable such an approach is in terms characteristics are discussed in Section 3. Section 4 then examines of the richness of the results. Identifying differences between the information behaviours observed in the ethnographic study; as different user groups will, hopefully, serve to stimulate ideas about indicated by the title of this paper, the information gathering and the nature of digital libraries and how they might be used. usage behaviourss of the consulting group were not well-suited-to library.Section 5 In considering the depth and breadth of the different facets of the support through a Greenstone-like digital operation of this group we gave a great deal of thought to how we presents our conclusions. should go about identifying the information behaviourss of this group. We chose to adopt an ethnographic approach to this 2. Methodology research. The authors conducted an ethnographic study of six members of a technical support group (TSG) serving the School of Computing, Ethnographic methods arebeing used more and more by Mathematics, and Statistics at the University of Waikato. researchers and practitioners in human-computer interaction studies and in systems analysis and design in an attempt to become more To those whose expertise is in quantitative empirical methods, six awareoftheworkcircumstances,personalandsocial participants may appear rather few but the aim of naturalistic characteristics and knowledge base of potential users of proposed ethnographic techniques is to discover a rich picture by developing systems [4]. Ethnographic techniques are also used to identify the an 'appreciative stance' via the researcher's involvement in the relevance of different information systems to those people and their setting. The number of participants is therefore not as important as roles in an organization [17]. depth of the enquiry. The frequently cited usability maxim of 'know thy users' is seen as The difficulties of interdisciplinary working between ethnographers acriticalpartof designing any informationsystem. The and computer scientists, in terms of language and natural attitude, ethnographic approach appears to provide an ideal opportunity for have been summarized by Crabtree et al., [4]. The approach in this us to get to know this group in depth and to identify the various study has been to allow the researcher to view technical knowledge types of information behaviourss they exhibit. Technically oriented as a socially distributed resource that is often stored primarily investigations of work systems are often reductionist and miss through an oral culture [7]. The technical consultants' war stories important aspects of work activities [3] [4] [18]. For instance, therefore become textsfor both the ethnographer and the

190 222 consultants themselves. Thus, data gathering techniques included: set up new facilities (ranging from complete installation of a semi-structured interviews of participants; 'shadowing' participants 50+ computer lab, down to setting up a single new laptop for as they worked; observation of semi-social discussions in the an academic); School tearoom; and examination of various work artifacts (email, proactively investigate potential software and/or hardware bookmarks, webpages, office bulletin boards, etc.). problems, and locate solutions to these problems that haven't The data gathering phase of this study ran from May August occurredyet; 1999. keep an eye on long term developments in his area of 2.1 Participant Demographics and Description of hardware/software expertise, so that he can provide informed Consultant Activities advice to decision-makers in the School; There were six participants in the study. All were male, ranging in update information sources used by themselves and their age from the mid-twenties to mid-thirties. All have formal tertiary 'customers' in the School; etc. qualifications.Four of thesixhave bachelor's degreesin For some of these tasks there is a level of repetition such that once computing, and two have Ph.Ds in physics. The group support a the initial information seeking activity is completed the results can range of users of computer systems within the School such as, be applied and re-applied when the situation arises again. Some of lecturing staff, administrative and research staff, and university the tasks, however, fall into the category of 'one-offs', with the students at undergraduate and postgraduate level. Some members consequence that the problem solving and information seeking of the group were students at the University, but all have been behaviourss result in a solution that cannot substantially be used students, and have worked closely with academic staff both as again. It is clear, however, that most of the tasks, despite the level students and currently as technical support consultants. The group of repetition, share many of the characteristics of information is,therefore, seen as having a special position as a technical seekingandretrievingactivitiesdescribedininformation support group in the School as well as individually being seen as behaviours studies. In an earlier work [9] we interpreted these colleagues by a large number of the staff. activities in terms of a specific information behaviours framework Two members of the group have special roles: the group leader, (that of Ellis [7]). In this paper, we examine the potential for who holds a managerial position and also provides Unix support; supporting these activities through a digital library tailored to the and the undergraduate lab support consultant, who provides TSG's needs and preferred search behaviourss. primary maintenance for the printers, hardware, and software in the largeteachinglabs.Intermsof physicalproximity,the 3. THE GREENSTONE DIGITAL undergraduate lab support consultant has his office in a separate building from the other five consultants, who have offices adjacent LIBRARY ARCHITECTURE Greenstone (http://www.nzdl.org) is a toolkit for constructing and to each other in the ground floor of the School of Computing, maintaining digital libraries. A digital library is viewed as a set of Mathematics, and Statistics. The separation of the group into two areas has some implications for the degree of 'personal' contact collections, where each collection has a focustypically by type of document (for example, music videos), or by subject (for example, between the technical support staff and the people they support, in terms of 'foot traffic' going past their offices and in terms of the computer science research documents), or by user needs (for example, people working in disaster relief). means of contact used (e.g. phone, email, etc.). Each technical consultant is seen as having a primary area of Collections can be composed of documents held locally, can be an index to geographically distributed documents, or can be a mix of expertise. Often thisisin terms of operating systems (e.g., Windows, Unix, Macintosh,etc.) A consultant may also be local and offsite documents. It is expected that a collection will be constructed after a set of documents grows past the point at which a identified by responsibility (for example, one consultant isin linear search of the set is feasible. A collection, then, is expected to charge of the first year labs). The group is seen as a central resource for the whole School rather than for individual departments. This be large: hundreds, thousands, or millions of documents. means a member of the group tends to be called upon for reasons Current digitallibrary architecturesof which Greenstone is related to their area of expertise than some organizational role. typicalmake a number of assumptions about the documents and The consultants' work is mainly task-based. The nature of the job users of a collection: involves dealing with both poorly and well-defined tasks.In The primary interaction mode is presumed to be searching, consequence, the consultants often have to employ a range of rather than browsing. The collection creator can organize the information seeking techniques. Often projects arerelatively Greenstonesearchinterfaceinto'simplesearch'and flexible in terms of how they may be approached; frequently other 'advanced search' modes. The advanced search mode offers a members of the group are recruited to deal with specific elements more extensive set of options for tailoring a search, but at the as required, with one person being responsible for the overall expense of requiring the user to know more about search project. It is difficult to describe a typical day of a consultant in this strategies,thecollection'smetadata,andthesystem's group, but depending upon the time of year he may have he might implementation. expect to: Browsing facilities are relatively crude: the collection builder deal with immediate, low-level problems such as fixing a can specify simple categorizations and sortings of documents stopped printer queue; (grouped and sorted by author's last name, for example). These browsing facilities cannot be defined in an ad hoc interact with novice users to answer questions about standard manner by users. This situation is common across digital software packages; library implementations; while a number of novel browsing

191 223 schemes have been prototyped (see, for example, [13]), they and information use, in preference, to other, more structured design are not standard with digital library construction software. based techniques, becomes apparent when looking at how well a digital library would really serve the group's needs. A documentcollectioncanbeexpectedtogrow monotonically, with d6cuments remaining in the collection (and assumed to be potentially useful) for the foreseeable 4.1 Formally Published Documents Usually future. Discussions in the digital libraries research literature Aren't Useful on collection culling have tended to focus on the need to 'standard'academicresourcessuchas journals,conference conserve space or to observe memory limitations, rather than proceedings, bibliographic indexes, etc. were not used by members ondetecting documents whose contentshave become of the group for work, as their work-related tasks were not seen as obsolete. 'research'. Interestingly, this was as true of the consultant seconded The contents of a collection are relatively static. Groups of to an academic research group; he provided programming support, documents are added to the collection periodically, rather than but did not follow the 'research' side of this work. having individual documents added in a steady stream, in 'real Popular IT magazines suchas MacWorld wereirregularly time'.Thislimitationarisesbecauserebuildingthe consulted, primarily for pricing information in advertisements. collection's index is an expensive operation, and incremental Occasionally the local version of ComputerWorld was read, mainly index construction is not a common feature of digital library for New Zealand-specific news or for job adverts: "I read software. Typically, then, there is a (sometimes significant) ComputerWorld for a laugh, because they started sending it to me gap in time between a document's production andits for no apparent reason. I always hear about non-local things on availability through a digital library. the web first, the only reason to read it [ComputerWorld] is to see Documents can be presumed to be trustworthy, or can be what's happening locally. Everything in it is usually old news." evaluated for trustworthiness solely on their contents. The None of the magazines were viewed as core resources. collection builder is generally, though not always, expected to The printed documentationthat accompanied hardware and serve as gatekeeper for the digital library. This role may be software purchases in the School was also seen as irrelevant. All served by selecting additions to the collection on a document- members of the TSG had bookshelves full of documentation that by-document basis, or by exercising quality control through they rarely, if ever, consulted. Often the manuals would be saved choice of reputable document sources from which additions for years in pristine condition before being thrown away, still the collection are automatically drawn. shrink-wrapped: Documents stand alone, in the sense that a document is *Tom' points at his bookshelves* "I've got documentation for the viewed with little or no regard to its relationship to other stuff we've bought. It actually gets used less than you'd think, it documents. For example, it may be difficult or impossible to rarely goes into sort of details I need for the problems I deal with. view relationships such as a sequential publishing of two It 's probably useful for tutors to explain software to students." documents, or because the documents come from the same Thisisnot thatsurprisinggiven most of thisapplication source, or because they are distributed by the same mailing documentation is intended for end users rather than technical list, etc.). consultants. For example, the manuals typically focus on how to Documents, once released or published, do not change. carry out specific tasks. There is often a simple 'trouble shooting' Documents are not usually monitored for changes to their section that outlines the most frequently encountered problems and contents or metadata, which can mean that the document's how to solve them, but this is generally pitched at a relatively descriptioninthecollection may be inaccurateif the simple level, not necessarily assuming the user has detailed document is altered. knowledge of the program or the operating system on which it is running. The technical support staff are generally called upon when The primary scenario for locating information involves a solo more complicated or subtle bugs are encountered, and the TSG user searching a collection of documents. Little,if any, then require a different level or type of information to solve these support is provided for collaborative information behaviours. problems. 4. TSG INFORMATION BEHAVIOURS 4.2 Many Documents are Ephemeral... The properties of the Greenstone Digital Library Architecture offer One striking feature of the information sources used by the TSG the potential user all the advantages of a typical digital library. consultants is that many are extremely ephemeral. The usefulness Given the type of activities that the technical support group carry of a given document tends to be highly time-dependent; for out as part of their work and the associated information seeking example, they need the latest news about the latest bug found in the behaviourss it is tempting to assume that a digital library would be latest version of the operating system. The usefulness (and in that the most obvious tool for consolidating their information sources. sense, the 'quality') of information tends to deteriorate substantially Members of the group share resources, compare information from a over time. number of different sources, keep records of information they have found and search for information often, some times for a specific In many libraries having access to historical records, however, reason and sometimes for general information gathering to develop 'recent' that history may be, is often viewed as an advantage for specific skills and knowledge in an area. All of these activities research purposes, where people can go through the background would be supported through the architecture of the Greenstone Digital Library. However, the value of adopting an ethnographic approach to the investigation of the group's information seeking I Consultants' names have been altered for privacy reasons.

192 22'4 material in a subject and use it to study developments in the area. 4.4 Documents Might Not be Trustworthy The emphasis on timeliness for the technical support group, Consider one of the many digital libraries intended for computer however, would necessitate a culling of older documents as they science researchers. These collections generally contain conference lose relevancy. If this were not done then the older, obsolete, papers, journal articles, and working papers written by members of documents would numerically overwhelm the more timely pieces the computing field. Many documents have undergone the of information. scientific refereeing process, and others (such as the working papers) generally have the imprimatur of a recognized research 4.3 But Some Documents Hang Around institution. Digital library users assume that the contents of the Forever! documents have been verified or vouched for, and that the Paradoxically, however, some documents are valuable because they documents are, on the whole, trustworthy. Exceptions may arise, are obsolete! The TSG have to support several 'legacy machines', but they are expected to be confined to a small minority of the elderly computers that run long-outdated versions of software that digital library's contents. are nonetheless still heavily used by some members of the School. In contrast, much of the information that the TSG gather from It can be difficult to locate information on the specifications of old websites and mailing lists hasn't been verified, and in fact may be computers, or on bugs and fixes for obsolete software. This type of expected to contain errors, half-truths, unsubstantiated advertising information might not be carried on current websites or other claims, and rumors. The TSG members recognize that they will information sources, which tend to cull information on outdated often have to do extensive cross-checking to feel assured that the machines; instead, if a spec or bug fix is needed, it is likely to be information is reliable. For example, one consultant was observed found on a carefully hoarded document in a consultant's private to regularly monitor the websites of major software producers for stash. For example, in trying to locate a part number for an eight news on upcoming and existing products. This information would year old laptop, the TSG member consulted a long outdated version then be cross-checked with product reviews. The he validity of of a cdrom called 'service Source', distributed to Apple technicians. individual reviews would also then be cross-checked, depending on The TSG member had spotted it in a trash heap in another the source. workshop on campus and retrieved it for his own use, citing the Sometimes the trustworthiness of a site or a particular document cdrom as "a good starting point for antiques". cannot be immediately evaluated. One consultant saves particularly At the time of this study, a number of historic documents were held interesting WWW articles on stickies on his desktop, and consults on paperwhat one consultant dismissively referred to as "just them occasionally to see how the document's contents hold up over rubbish, it's filing cabinet stuff": time. "As things come to pass on the stickies" [that is, as the events *Chris asks about Tom's filing cabinet* "A lot of what goes in or trends mentioned in the articles actually occur], the consultant there I hope never to see again. Some stuff comes up yearly for can put more trust in the document and, by extension, its source staff [other documents are] vendor product details that I didn't website. want to hear about in the first place, copies of everything purchased end up in it because sometimes you have to tell them try 4.5 A Primary Information Source: Other again, all the leave that anyone does, changes in salary, I file them People forever." Other members of the TSG are a primary, significant source of *Chris asks about Tom's bookcase* [Those are] "the folders that information. One common strategy for finding a solution to a Dave [the former TSG manager] passed on to me. I don't think I problem is to ask another member of the group. These interactions ever looked in them. He did them back in the days when paper areusuallynotformallyrecorded;when askedhow really was the best way to pass things on. Now i f I print something, communications were managed between members of the TSG, the it will go out of date, it will go out of date i f I print them again. I TSG members clearly preferred immediate, personal contact with don't think I've ever looked in those boxes." each other. The following comment was typical: "For something There is an expectation that much of this "stuff' will never be important I go next door or ring, otherwise it 's email. Ringing or usedbut it should be archived, just in case it: "like somebody going to see someone is the first thing for communication." Close buys something and it comes with instructions and a little plastic physical proximity was seen as a positive advantage in solving whatsit, you give it to the punter and he loses it. I keep it and it problems: "We used to be in the same room, so we were just a turns out that no one ever needs it, but at least we've got it." shout away from getting answers. BUT one person's interruption was every person's interruption." Gradually much of this documentation was becoming available for TSG members also occasionally use each other as information storage electronically, rather than on paper. The university was distributing most of its forms as Word files, and many purchasing fi lters: records were appearing in digital form. This type of record would Bob: "Occasionally I see something interesting to Dave, or I pass be well-suited for inclusion in a digital library: once created, the something to someone else." records are unchanging, and many of the records can be easily Tom, a supervisor: "There's a Linux kernel development mailing categorized and cataloged. The ability to easily search for and list but that's too high volume, I rely on the guys [to keep up with retrieve a given document could provide a significant advantage that]." over physical filing systems. It was not clear that it would be easy to locate a given document in the cabinets, boxes, and heaps then A significant amount of 'passive' or serendipitous information in useor even to know whether or not a particular document had gatheringalsooccursinface-to-facecommunicationwith been stored! colleagues.One consultantdescribeshimself asfrequently "gossiping" with other consultants on campus to find out activities

193 225 or events that are relevant to his interests but that he's not directly Internal documentation (intended primarily for use within the TSG) working on. More formally, two of the consultants attended is by no means complete, but is generally sufficient to jog the campus-wide informational meetings scheduled and run by the memory of the documentation's creator, or to be used as a starting central computing service (irreverently referred to as "prayer point for exploration by the other members of the TSG team. For meetings"). All TSG members attend a weekly local TSG group example, a work diarypaper or electroniccould be referred to meeting: "We send Tom our weeklies [a weekly report on their later to recall a problem and its solution. individual activities]. He goes through them and picks out things External documentationintended to more directly support the he thinks are important and we talk about it in the meetings. When consultants' client baseappears to be developed in response to we finish going through that then each person may bring up other sustained demand from students or School academics. One stuff and we chat about them. We may talk just between 2 or 3 supervisor notes, "I'm a big fan of automation. Machines should be people in a meeting. able to work by themselves. [TSG] Staff turnover is a problem. One The TSG members rarely proactively announce solutions to of the biggest causes is repetitive work I tell the guys f you have to problems to the group as a whole, possibly because the consultants tell someone something more than twice, set up a web page and tend to specialize, and most problems would be of interest only to give them the URL, if more people need it try to set up a program one or two members of the group; "Very infrequently someone will to do it automatically. We have developed automated things for go, this is the solution to this problem and then go tell everyone." costs, measuring web surfing traffic." Itappears, then,thatcriticalinformation gathered by TSG Some records of problems and solutions are created by enforcing, consultants is not digitally recorded, and is difficult to formalize for where possible, a preference for dealing with "the punters" through inclusion in a digital library. This is not an unusual situation, of email. Email correspondence documents the consultant's activities course! Few digital libraries would claim to be a comprehensive for internal reports, and can be filed away for later use if the source of information for their users. It is, however, important to be problem is likely to recur. One consultant notes that, "I don't have aware of thelimitations of a proposed information source, voice mail. I don't see any point in voice mail when email is much particularly in noting the bounds for assembling a complete and better. The problem with voice mail is you can't file things, you comprehensive resource. don't have an accurate representation of the problem because you're coping with voice as well as trying to keep your facts 4.6 Local Production of DL Documents straight. With email you can read it over first. And I try to keep Some of the documents used by the TSG (and which presumably things filed with email. I use... what do you call them, folders I should be included in a TSG digital library) must be constructed by guess." the TSG themselveswho as a group aren't known for their love In examining the problem of including locally produced documents of documentation! These documents are mainly descriptions of the in a digital library, another point to consider is that the construction localsystem:labconfigurations,local network descriptions, of some documents may not be in the individual TSG consultant's instructions for setting up new machines, etc. best interest (although maintenance and development of these does These documents differ from the static, unchanging documents that may be in the best interest of the group as a whole, and of the typify the contents of most digital libraries. Locally produced university). Chatman [2] identifies secrecy as a strategy sometimes documents require considerable maintenanceand are generally employed to give an individual greater control or influence in the not complete or entirely up to date. The TSG recognize that communication process. Chatman's study concentrated on various developing documentation necessarily takes time away from other 'outsiders' characterized as members of the 'information poor' activities. One supervisor pointed out that, "some of the NT guys for example, women involved engaged in job training with the got all keen and set up a web based thing to track their work and CETA (Comprehensive Employment and Training Act) program. reports. More often the report is created from stuff scribbled in a These women did not share information about opportunities for diary. I'm more interested in what's coming up and what they'll permanent jobs, as letting others know about the positions would have to do, than what they've spent their time doing." reduce the probability that they themselves were offered a coveted permanent position. This concern is not unique to the local group; for example, in an extensive case study of a work-planning process, Soloman [19] It is important to note that we did not observe any member of the describes an information gathering and documentation process technicalsupport group consciously employing thesecrecy gone astray: strategy. However, there are obviously opportunities for a technical consultant to create job security by becoming the only person with "The staff in the regional offices take time away from their critical (and undocumented) information about the structure and technical assistance project activities to fill out project status forms function of a given system. For example, a supervisor jokingly and the study unit's staff takes time away from their program noted that one consultant had exclusive knowledge of a particular evaluation aims to maintain the project database. Projects fall system setup: "if he leaves, then we will never print again!" behind schedule and the goal of evaluation to make things better is Another supervisor, describing an upcoming lab revamp during a thwarted. The well-intentioneddriveforaccountability and brief semester break, pointed out that, "Bob has something improvement seems to have made things worse." (p. 1106) On the planned. I know what they are but not the details. If he gets hit by a other hand, inadequate or outdated documentation can cause truck, we're in the poo." problems if older versions are used to base decisions on or to solve problems with. Keeping some level of version control also means The idea of someone no longer being a member of the group keeping some record of who produced the documentation and because of an 'accident' is something that is perceived to be a risk, where it is kept. At present, the TSG members attempt to find albeit a remote one. A far greater risk, and an event more likely to balance by creating 'just enough' documentation. occur, is that of the person leaving for more lucrative employment.

194

2 216 This is a real problem that the group has experienced time and time so that movement between a digital library and other commonly again. Retaining staff is difficult because of the more highly paid used resources should be relatively painless. opportunities outside of the University. The need to record a Browsing is also an important technique for navigating information person's knowledge' as information available to the group is sources, with locational cues playing a part in information storage something TSG members are aware of, but again this has to be and subsequent retrieval. Frequently used physical documents (in weighed against the need to solve the problems at hand and deal contrast to "filing cabinet stuff') are strategically placed so as to be with the fact that recorded information will become out of date easily viewable. For example, large itemslikethe network relatively quickly. configuration diagram, lab timetable and the holiday rota are drawn The TSG see themselves as, to some extent, outsiders in the world on a whiteboard or pinned to a noticeboard and smaller items, such of the university; one TSG member remarked that academics are as a list of shortcut commands, are written on a Post-it note and "protected to some extent by tenure," while "technical staff are just stuck up anywhere handy. Sometimes the back of the hand is used cannon fodder."It would not be unexpected if consultants to note down IP addresses and passwords! occasionallyunconsciously or consciouslyutilized the Digitized documents stored on the hard drive are also sometimes outsider's information tactic of secrecy. The dependence of a retrieved (browsed) by location or appearance. This technique has collection's integrity on the producers of documentswho may been noted inearlier studies of fileorganization[1]. TSG have different priorities than the digital library maintainersis consultants might place files that they expected to use regularly or likely to be an intractable problem. shortly on their computer's desktop or other readily accessible spot (although this technique falls down if the consultant does not 4.7 Creating Individual Information engage in regular housecleaning, as evidenced by the difficulties Resources experienced by the consultant who had accumulated 79,000 files). All of the TSG members created information resources that they Color is also sometimes used as an adjunct to location, when stored on their own computer. Email filters were used by all browsing through a set of files. One consultant uses color participants, primarily to filter all messages from a mailing list extensively with his stickies, and another consultant uses colored directly into an associated mailbox. These messages were generally nodes in a mind map to organize information. never read, and the mailbox would be consulted only if the Digital libraries, as currently designed, offer little support for the consultant encountered a problem related to that mailing list's user to structure information for retrieval based on appearance or topics. In essence, a consultant uses his mail filters to build up position in the collection. Documents typically have no location, as private, searchable document collections based around the mailing such, and their appearance through the library interface cannot be lists to which the consultant subscribes. Some of the consultants altered by the user. also created significant resources based around files downloaded from various websites. These files are not cataloged, and are rarely 4.9 Information Might Not Look Like a formally organized. One consultant discovered that he had over 79,000 files (of all possible description) on his Macintosh hard Document drive. He reported using Sherlock (the Macintosh 'find file' utility, Sometimes the object of an information search is not a document as which supports searching both by file name and file contents) to such. Instead, the consultant may be looking for an example of locate a specific file. The files were organized, if that term can be some sorta sample of a type of file setup or a piece of code that used, in a very flat and wide file hierarchy: "Quite often it solves a problem similar to the situation at hand, for instance. Some [Sherlock] will find it on the desktop!" examples are located in formal sources, and are intended for use as problem solving aids. IT textbooks recognize this preference for These personal resources should logically be included in a digital learning through example; some of the most popular technical library to support TSG activities. Greenstone's definition of a resources are example-heavy 'cookbooks'. One consultant prefers a digital library as consisting of a set of focused collections, it would particular software development kit because the kit and the be logical to include a personal resource as a distinct collection, associated website include a large amount of sample code. Other accessible only by the individual TSG member. At present, examples consulted have not been formally prepared as examples, however, the process of creating and maintaining a digital library is but are simply remembered portions of the local system that a not trivial; it would be overly burdensome for an individual to use a consultant feels might provide insight into a similar situation being full-fledged digital library tool such as Greenstone to maintain faced. This latter type of example might include the contents of a these individual resources as a collection integrated into a TSG .login file, a piece of code written by one of the local TSG digital library. members, or a particular file organization. It would be difficult to formalize these examples for inclusion in a 4.8 Browsing and search by location digital library. When searching a digital library what type of key Searchingiscertainlyanimportanttechniqueforlocating search terms could the TSG member use to describe the specific documents, whether the search is conducted over the WWW as a problem that would match to the metadata recorded in the library? whole generic WWW search engines such as Google, or over The local examples are exceptionally problematic: what sort of Usenet articles with Deja News, or over a TSG member's own hard metadata would describe a file hierarchy? Formal examples in code drive using Sherlock. As noted in Section 3,digitallibrary libraries,'cookbooks', and developer's websites are generally architectures assume that the primary access mechanism to the accompanied by a description of what the example does, but this collection is searching. The types of search options supported by description might not include the features that a consultant finds Greenstone and other digital libraries tend to closely match the most usefulin the example. Working from sets of examples standard options on WWW search engines and other search tools, generally entails considerable browsing, trying to match features of

195 227 the solutions to the features of the problem at handand again, systems introduce problems of their own: users find it difficult to browsing can be difficult to effectively support in digital libraries. master the MOO interface, the videoconferencing systems require awkward-to-use hardware on participating sites, and the systems 4.10 Collaboration are easily crippled by slow (or even merely moderately fast) Twidale and Nichols [20] describe the process of sense-making connect times. Further, the MOO interfaceis text-based, and through interactions with peers as 'Over the Shoulder Learning' videoconference image resolution may be fairly coarseboth (OTSL). In this scenario, formal information sources such as online limitations making it difficult for one person to observe another's help, printed manuals, and training materials are regarded as interactions with the digital library or other information source. secondary,rather than primary,sourcesfor coming toan Digital Libraries, as many are currently designed, offer little understanding of the overall `shape' of a system. Instead, in OTSL support for the user to select information for retrieval based on it is interactions with colleagues that provide a great deal of the appearance or position in the collection. Although significant contextin coming togrips with a complex system. These research has been carried out in terms of visualizing information interactions may be prolonged and intensive, ormore typically spaces [10] [14] current DLs do not provide significant levels of can be informal and short, focused on authentic tasks in the work support for either individuals structuring that which they have environment. retrieved or for collaborative groups to structure their combined Clearly the TSG relies to a large extent on OTSL for bring new hits. Of course, a digital library could be used in an OTSL fashion staff members up to speed; the convention is to have new staff by two TSG consultantsphysicallysharinga monitorso members share an office with a more experienced consultant who collaboration between members of the local TSG group is possible works on the same operating system or who supports the same to the same extent as any piece of software. Collaboration between group of labs. In the past, when the consultants all shared the same physically distant members of the greater TSG community, large office, this OTSL would have been achieved more naturally, however, is not well supported. In sum, then, digital libraries with with the new staff member being immersed in an ongoing an architecture grounded in a view of the user as an individual conversation about all the different systems, user groups, and searcher, such as Greenstone, are not well-suited to supporting physical labs. As Twidale and Nichols [20] note, facilitating OTSL collaborative information behaviours. in a digital environment is problematic: the system must support a collaborative interaction with the document set, preserve a sense of 5. CONCLUSIONS history in the user's interactions, and maintain a task- or work- There are a number of conclusions that can be drawn from this related focus in the presentation of information. Ideally, the context study concerning the value of the ethnographic approach, the of the OTSL would be retained, perhaps in the form of a playback nature of the information gathered, the practical needs of the facility that would allow the user to review the concepts and support staff for information seeking and information use, and how procedures presented in OTSL sessions. well different types of information system might best support such Current digital library architectures such as the NZDL do not a group. provideaneffectivesupportforcollaborativeinformation In terms of the use of ethnographic methods of identifying the behaviours such as OTSL. In a physical library, a reference information behaviourss of this group, this study should be viewed librarian may serve as the expert `colleague'in OTSL; good as a success. The richness of the data gathered and how it was reference librarians do not simply give users answers to specific analyzed enabled us to view the situation from several different questions, but also who the users how their information needs can perspectives. The information gathered demonstrated how each be satisfied by guiding a library user to a greater understanding of member of the group interacts with his own information sources the library's contents and organization. Peer-to-peer OTSL may and those used by others, including colleagues as information also occur as,for example, students working on the same sources in their own right. assignment cluster around the same library catalog monitor [3]. The information behaviourss we identified through using the Direct communication between digital library users is not currently ethnographic methods were primarily ones of browsing, cross a feature of digital library architectures such as Greenstone. `Ask a checking and verifying, filtering information, monitoring sources, Reference Librarian' services have been incorporated into digital working through related information in steps to get to an end point library frameworks, including Greenstone[5].The reference and extracting information that was relevant from that which was librarian services offered in a digital library have impoverished not. These behaviourss correspond well with the framework interfaces, in comparison to face-to-face interactions in a physical described by Ellis [7]. library. Typically the digital reference librarian services are based on an exchange of emailbut email `conversations' are generally Of these behaviourss, working through steps from a start point to too slow to support the (sometimes extensive) back-and-forth an end point in its generic form could easily be supported by the required to reach a consensus on the problem to be explored. Email searching mechanisms within a digital library. If the user were to is generally adequate for one person (a reference librarian or a begin with a simple reference to a subject or problem and work OTSL participant)to give an answer toa question or to through to lists of citations and further references this would be retrospectively explain how a solution was found, but not to allow similar in many respects to an academic researcher's use of a digital two people to explore an information source collaboratively. library. Extracting relevant information and differentiating between what was useful and valid would also be relevant behaviourss Experimentsinprovidingdigitalreferenceservicesvia within a digital library construction. synchronous communication software such as video conferencing [11] or a MOO [16] have yielded mixed results. On the one hand, In terms of the information needs, rather than actions, of the group true conversational interactions are made possible. However, these there are several areas where a digital library might provide the support required. For example, material that is kept for historical

196 228 reasons or safety reasons (e.g., supporting legacy systems) and is During this study a number of very interesting and useful insights not actually used on a regular basis would be kept in a secure place, into the information behaviourss of this group were gained. While that could accessed easily by any member of the group. This type we were aware of previous research on how people use different of material is relatively static. information sources and how this was partially dependent on the task they were carrying out, the results tested our assumptions Documents are generated by different members of the group and as about what else the use of such sources was dependent upon. It also people come into the group or leave it there is a need for continuity enabled us to question assumptions about how appropriate digital of information and recording 'intellectual capital' for use by the libraries are in this situation. rest of the group. There was the recognition by the group that not having central repositories of information meant too much local The group studied here does not necessarily have the same mix and knowledge residedinindividuals and theriskinthat was proportion of information behaviours exhibited by other groups. It significant. The group members trust each other and respect the may not be able to make as effective use of a digital library, level of expertise each person has with respect to a particular area however brodd that definition may be taken to be, as groups such as of the job. This would again seem to reinforce the utility of a digital academic researchers. For a digital library to be viewed by this library in reducing the risk of losing local information and the level group as more than just another information source it would need of trust that the people could place in the content of the library to provide a greater level of support than might currently be seen to because they would have generated much of it or at least cross- be the case. checked the original source with other sources. This group is very adept at using the most efficient technologies or Whilst a number of information behaviourss and information needs tools to get what they want. These technologies have affordances would, at least on initial inspection, appear to be well served by a that digital libraries do not have and may never be able to match. digital library, not all would be. This study identified several The behaviours of cross-checking, monitoring, re-presenting and information behaviourss and several features of the current digital re-structuring information have a strong link with the activity of library architectures that conflict with the information requirements annotation, something that is not that well supported in digital or work habits of the TSG consultants. libraries. If we begin with information behaviourss:theactivities of There is a relationship between the type of annotation, the tools browsing are not necessarily well-supported by digital libraries used to make the annotation and the materials on which the founded on the premise that the basic interaction mode is annotation is made. Marshall [15] uses the example of student searching. Browsing is a less well-defined activity but is very textbooks where underlining is used to help students identify and relevant when dealing with less well defined problems. Cross- reflect on key phrases or terms to demonstrate this relationship. checking different sources is also an activity that is not well- Underlining was usedinpreferencetohighlighting (using supported in that the relationship of one document to another is highlighter pens) in paperback books because the paper is only recorded in very limited terms. Often the group members absorbent and the highlighter ink would leak through to the other cross-checked reports from different sources, at different times for side of the page. If we look at a simple example of the same type of differentreasons.Noassumptionsaboutthevalidityor relationship for the technical support group, the Post-it note or trustworthiness of the sources would be made until cross checking physical yellow sticky for recording useful pieces of information or and verification had been made. In a digital library the need to lists of tasks to remember has value because it can be transported record the result of this cross checking and validation would be physically from one point to another, over time and edited or important. amended over time. Monitoring sources and filtering information also relates back to The Greenstone architecture can be seen as a good implementation the issue of trustworthiness and being able to re-present or re- of a typical digital library. At the moment, our results compare the structure the relationships between documents often to enable the types of information behaviour a typical implementation supports user to deal with timeliness issues. The documents themselves with those undertaken by our subject group. Having reviewed our would need to be updated regularly and in doing so this may conclusions, are we being overly harsh in judging the potential change the type of relationship the document has to others, again value of a digital library for this group? Are we setting a standard requiring some restructuring or re-presenting of the source. that no comprehensive resource for this group could ever meet? One critical issue when developing a digital library is defining a set The work context for this group of subjects does provide a picture of target users. For the TSG, it is unclear what community should of information behaviours that perhaps a 'typical' digital library be the focus for a collection. Each member of the TSG participates implementation was never designed to support. If that is the case, in a number of communities: the School TSG, the university TSG, we should not be surprised the support is not there. the universal set of technical consultants, technical consultants If we adopt a broader view of digital libraries, as advocated by within their individual operating system specialty (Macintosh, Levy and Marshall [12], would that lessen the significance of our Linux, Windows, etc.), and technical consultants by role (security conclusions? The answer is probably yes, but only in part. Re- administrator, lab administrator, etc.). A digital library's usefulness examining them, in the context of a broader view, still leaves us would be compromised if it did not support an individual's with the problem that the broader view is not currently the typical membership in all of his communities, but at the same time a implementation. This will change but we would hope that the generic digital library for all TSG members would overwhelm an results of thisstudy may contribute to pushing out those individual consultant with irrelevant material from communities boundaries. which he is not a member of. In particular, it would be difficult to seamlessly merge internally and externally produced documents A final lesson to be learned from this study is the rather prosaic observation that the technical consultants use information sources into a single digital library.

197 229 to support their work. Their job is not to consult digital libraries or [9] Knowles, Chris, and Cunningham, Sally Jo. Information to gather information, but rather to selectively use resources that behaviour of technical support workers: an ethnographic will enable them to effectively and efficiently solve problems. This study. In Proceedings of OZCHI 2000 (Sydney, Australia, commonsense point tends to be obscured when system developers December 2000), IEEE Press, 275-280. concentrate on creating an information system, rather than on [10] Lamping, J., Rao, R., and Pirolli, P. A focus+context ensuring that the system created is usethl and usable, or even technique based on hyperbolic geometry for visualizing large investigating whether a system should be created at all! hierarchies. Proceedings of the ACM Conference on Human Factors in Software(CH1 '95). ACM. 1995. 6. ACKNOWLEDGMENTS We gratefully acknowledge the patience of the technical support [11] Lessick, Susan, Kjaer, Kathryn, and Clancy, Steve. Interactive group that was studied, and thank them for their participation in Reference Service (IRS) at UC Irvine: expanding reference this project. service beyond the reference desk. ACRL '97 National Conference, 1997, http//www.ala.org/acrl/paperhtm.a10.html. 7. REFERENCES [12] Levy, D. and Marshall, C. C. Going Digital: A Look at Assumptions Underlying Digital Libraries. Communications [1]Barreau, D., and Nardi, B.A. Finding and reminding: file of the ACM, 38, pp 77-84. ACM 1995. organization from the desktop. SIGCHI Bulletin 27(3), July 1995, 3943. [13] Lieu, Y-H., Dantzig, P., Sachs, M., Corety, J.T., Hinnesbusch, http://www.cwi.n1/steven/sigchi/bulletin/1995.3/barreau.html M.T., Damashek, M., and Cohen, J. Visualizing document classification: a search aid for the digital library. Journal of the [2] Chatman, Elfreda A. The impoverished life-world of American Society for Information Science 51(3), 2000, 216- Outsiders. Journal of the American Society for Information 227. Science 47(3), 1996, 193-206. [14] Mackinlay, J., Rao, R., and Card S. An organic user interface Crabtree, Andy, Twidale, Michael, O'Brien, Jon, and Nichols, [3] for searching citation links. Proceedings of the ACM David M. Talking in the library: Implications for the design of Conference on Human Factors in Software (CHI '95). ACM. digital libraries. in Proceedings of the Second ACM 1995 pp. 67-73. International Conference on Digital Libraries (Philadelphia, PA, 1997), 221-228. [15] Marshall, C. C. Annotation: from paper books to the digital library. Proceedings of DL 97.pp131-140. ACM 1997. [4]Crabtree, Andy, Nichols, David M., O'Brien, Jon, Rouncefield, Mark, and Twidale, Michael B. [16] Meyer, Judy. Servicing reference users. 1997 Ethnomethodologically-informed ethnography and http://www.ala.org/editions/cyberlib.net/4rneyer01.html information system design. Journal of the American Society [17] Orr J.E. Talking about Machines: An Ethnography of a for Information Science 51(7), 2000, 666-682. Modern Job. 1996 Ithaca, NY: ILR Press. Cunningham, Sally Jo. Providing internet reference service for [5] [18] Reeves, E.M. A study of usability aspects of a graphical user the New Zealand Digital Library: gaining insight into the user interface for discretionary users. Ph.D. thesis, University of base for a digital library. in Proceedings of the 10th Bristol, Bristol, UK. International Conference on New Information Technology (Hanoi, Vietnam, March 24-26 1998), 27-34. [19] Soloman, P. Discovering information behaviours in sense making: Time and Timing. Journal of the American Society [6] Cunningham, Sally Jo, and Mahoui, Malika. Search behaviour for Information Science 48(12), 1997, 1097-1108. in two digital libraries: a comparative transaction log analysis. in Proceedings of the European Conference on Digital [20] Twidale, Michael B., and Nichols, David M. Using studies of Libraries (Lisbon, Portugal, September 2000), Springer- collaborative activity in physical environments to inform the Verlag, 418-423. design of digital libraries. in Proceedings of the CSCW'98 Workshop on Collaborative and co-operative information Ellis, D. A behavioural approach to information retrieval [7] seeking in digital information environments. Also available as design. Journal of Documentation, 46, 1989, 318-338. Technical Report CSEG/11/1998 (Computing Department, [8]Hideaki Kanai and Katsuya Hakozaki. A Browsing System for Lancaster University, http:///www.comp.lancs.ac.uk/ a Database Using Visualization of User Preferences. computing/research/cseg/98_rep.htm Proceedings of the IEEE International Conference on Information Visualization, 2000. London, IEEE.

198 230 Developing Recommendation Services for a Digital Library with Uncertain and Changing Data

Gary Geis ler David McArthur, Sarah Giersch Interaction Design Laboratory Eduprise University of North Carolina at Chapel Hill 2000 Perimeter Park Drive Chapel Hill, NC 27599-3360, USA Morrisville, NC 27560, USA +1 919 489 2759 +1 919 376 3424 [email protected] {dmcarthur, sgiersch}@eduprise.com

ABSTRACT 2. INFORMATION AVAILABLE FOR In developing recommendation services for a new digital library RECOMMENDATIONS called iLumina (www.ilumina-project.org), we are faced with several challenges related to the nature of the data we have The iLumina digital library contains content across a wide range of science,mathand engineeringdisciplines. Our user available. The availability and consistency of data associated with communityincludesinstructors, students,and resource iLumina is likely to be highly variable.Any recommendation contributors; somebut not allwill register and provide profile strategy we develop must be able to cope with this fact, while also information. From our server logs we expect to have usage data being robust enough to adapt to additional types of data available related to both resources and users.These relatively standard over time as the digital library develops. In this paper we describe characteristics provide us with four classes of data potentially the challenges we are faced with in developing a system that can useful for recommendation services: provide our users with good, consistent recommendations under changing and uncertain conditions. Resource characteristics: All resources will be described by Keywords basic, IMS derived [4] metadata that identifies elements such as Digital Library, Recommender System, User Services. title, description, format, and requirements. This metadata will be rich and consistent.

1. INTRODUCTION Resource quality judgements: Subject experts will formally DigitallibrariesandWeb-basedresourcecollectionsin review many (but not all) of library resources. All of the content generalhave traditionally enabled their users to locate resources will, however, be available to be reviewed informally by users; through search and browse services. Over the past decade there similarly, all materials will have passed minimal acceptability has been growing use of recommendation systems as a way to standards. But the written reviews, where available, can provide suggest new items of potential interest to people [3]. In designing richer and more subjectiveinformation about the potential a new digital library (DL) called iLumina, we are exploring the usefulness of the resource for specific situations. potential of recommendation services as a way for users to find resources of interest in a digital library.iLumina is a DL of Resource use: Data describing how often a given resource has undergraduateteachingmaterialsforscience,mathematics, been downloaded, reviewed, or used as a component in larger engineering, and technology (SMET) education [1], now being resources is available from server logs and the resource database. developed by Eduprise, The University of North Carolina at Data about resources can also be associated with individual users Wilmington, Georgia State University, Grand Valley State, and to develop patterns of usage by user characteristics. Virginia Tech. Because the DL is new and non-commercial, the data available to use as input into a recommendation system will User profile: Our database contains descriptive information about be uneven and will change over time.We are particularly registered users, such as status,affiliation, areas of interest, interested, therefore, in how to create robust recommendation explicitresourceratings,andservicepreferences.This services for a DL that contains changing and uncertain data. information can be used to group users based on various similarity characteristics, track resource usage data for a given user, and adjust recommendations based on expressed preferences.

Permission to make digital or hard copies of all or part of this work for Allofthisdatacanbeveryusefulforgenerating personal or classroom use is granted without fee provided that copies are recommendations, but they vary in quality and likelihood of not made or distributed for profit or commercial advantage and that existence.For example, we do not require users to register, so copies bear this notice and the full citation on the first page. To copy profile information will vary. Reviews will range from expert- otherwise, or republish, to post on servers or to redistribute to lists, level, instructor judgements to student comments. Some registered requires prior specific permission and/or a fee. users will rate resources, others will not. Table 1 summarizes the JCDL '01 , June 24-28, 2001, Roanoke, Virginia. Copyright 2001 ACM 1-58113-000-0/00/0000...$5.00. quality and existence characteristics for each of the basic types of data we expect to use as input for generating recommendations.

199

231 the basic resource descriptive data; the availability of other data Table 1. Characteristics of the types of data available for such as reviews, ratings, and user profiles is less predictable. generating recommendations This uncertainty bringsintoquestionthevalue of a DL Type of data Quality Existence recommendation scheme that attempts to use all data sources and adapts when some of the data does not exist or is unreliable.In Resource characteristics High Always such cases, will the recommendations still be useful? More Resource quality Variable Variable generally, do DL recommendations improve as the number of information sources they use increases? Is a strategy that uses all Resource use High Variable available information, some of which may not be available, more User profile Variable Variable effective than one that uses the high-quality, always available resource characteristics?If so, given that gathering each type of data comes at some cost, is the difference in effectiveness worth 3. GENERATING RECOMMENDATIONS the cost? Can we evaluate the cost/benefit ratio of using the Given the characteristics of the types of data we have available, different types of data? several schemes for providing recommendations are possible. At one extreme, we could use only data that we know is high quality Because the iLumina DL is new we will be in a position to and always available, such as resource characteristics and resource address these kinds of questions.We expect to phase-in user use. One example of this strategy would be to recommend to users services that provide us with relevant data over time; the data resources that are similar, on specific metadata fields, to ones they available to use in the second year of operation will be much have previously downloaded. The obvious downside of this broader than that available in the first year. As iLumina grows, strategy is that other potentially rich sources of data are ignored. then, we will not only be able to devise recommendation schemes thatincorporate new types of information as they become available, but will also be able to employ user opinions to judge At the other extreme, we can include all potentially useful sources how the perceived value of different schemes improves (or not) as of datainourrecommendation scheme, and attemptto the data sources become richer. compensate for missing inputs and inputs of variable quality. The downside of this strategy is that it is hard to combine multiple 5. CONCLUSION information sources in a principled way, and the variable quality Recommendations can be a valuable service for users of a DL of sources threatens the usefulness of recommendations based on such as iLumina, which will eventually contain many resources. them. The variety of data associated with both the resources and users of a DL representapotentiallyrichsourceof inputfor Different rules could be used to, in effect, define various points recommendation services. As the iLumina DL evolves, we will be along this continuum of recommendation strategies. These might interestedin developing both strategiesthat provide useful include (information types used noted): recommendations to users even if the data available is uneven and changes over time and methods for evaluating these methods. We If the user suggests a resource she likes, we can suggest expect that the results of our efforts will be informative to structurally similar resources (resource characteristics) developers of many types of DLs.

If profile information for the user exists, we can use it to suggest resources (resource characteristics, user profile) 6. REFERENCES [I] McArthur, D., Giersch, S., Graves, B., Ward, C.R., If profile information exists and the user has previously Dillaman, R., Herman, R., Lugo, G., Reeves, J., Vetter, downloaded resources, we can suggest resources based on R., Knox, D., & Owen, S. (in press). Towards a the previously downloaded resources (resource Sharable Digital Library of Reusable Teaching characteristics, user profile, resource use) Resources: Roles for Rich Metadata. Journal of Educational Resources in Computing. If theusersuggestsaresourceshelikesorprofile information exists, we can suggest resources based on all [2] Mooney, R. J. and Roy, L. (2000). Content-Based available data (resource characteristics, user profile, resource Book Recommending Using Learning for Text use, resource quality judgments) Categorization. In Proceedings of the 5th ACM Conference on Digital Libraries (San Antonio, TX, 4. CHALLENGES June 2000). There is evidence that recommendations can be improved by combining methods, such as collaborative and content-based [3]Recommender Systems [theme issue]. (1997). approaches [2].In iLumina we will collect a range of data that Communications of the ACM 40(3). supportssuchamultiple-sourcerecommendationstrategy. [4] The IMS Global Learning Consortium, However, we can only be certain of the existence and reliability of http://www.imsproject.org

200 232 Evaluation of DEFINDER: A System to Mine Definitions from Consumer-Oriented Medical Text Judith L. Klavans Smaranda Muresan Center for Research on Information Access Department of Computer Science Columbia University Columbia University New York, NY, 10027 New York, NY, 10027 [email protected] [email protected]

ABSTRACT Text mining applications raise particularly challenging problems In this paper we present DEFINDER, a rule-based system that within digital libraries since they involve large collections of mines consumer-oriented full text articles in order to extract unstructured documents. Our approach is to combine shallow definitions and the terms they define. This research 'is part of natural language processing techniques with deep grammatical DigitalLibraryProjectatColumbiaUniversity,entitled analysis in order to efficiently mine text. PERSIVAL (PErsonalized Retrieval and Summarization of Automatic identification and extraction of terms from text has Image, Video and Language resources) [5]. One goal of the been widely studied in the computational linguistics literature project is to present information to patients in language they can [2], and many systems exist for this task using both symbolic understand. A key component of this stage is to provide and statistical techniques. The extraction of definitions and their accurate and readable lay definitions for technical terms, which associated termshas beenlesswidely studied,although may be present in articles of intermediate complexity. extraction of lexical knowledge has a rich literature [7]. The focus of this short paper is on quantitative and qualitative Through an analysis of a set of consumer-oriented medical evaluation of the DEFINDER system [3]. Our basisfor articles,weidentifiedtypicalcue-phrasesandstructural comparison was definitions from Unified Medical Language indicators that introduce definitions and the defined terms. Our System (UMLS), On-line Medical Dictionary (OMD) and system, DEFINDER, is based on two main functional modules: Glossary of Popular and Technical Medical Terms (GPTMT). 1) a shallow text processing module which performs pattern Quantitative evaluations show that DEF1NDER obtained 87% analyses using a finite state grammar, guided by cue-phrases ("is precision and 75% recall and reveal the incompleteness of called", "is the term used to describe", "is defined as", etc.) and existing resources and the ability of DEFINDER to address a limited set of text-markers ( (),) and 2) a grammar analysis gaps. Qualitative evaluation shows that the definitions extracted module that uses a rich, dependency-oriented lexicalist grammar by our system are ranked higher in terms of user-based criteria (EnglishSlot Grammar [4])for analyzing more complex of usability and readability thandefinitions from on-line linguistic phenomena (e.g. apposition, anaphora). specialized dictionaries. Thus the output of DEFINDER can be used to enhance existing specialized dictionaries, and also as a key feature in summarizing technical articles for non-specialist 2. Evaluation: Users and Uses users. In this brief paper, we present the results of three methods to evaluate the output of our system: 1) performance in terms of Keywords precision and recall, 2) quality of extracted definitions in terms Text data mining, medical digital libraries, natural language ofuser-based criteriaofreadability, usefulnessand processing, automatic dictionary creation. completeness and 3) a method to evaluate the coverage of online specialized dictionaries. For the first two, we performed a user-centered evaluation using non-specialist subjects.For the 1. The Digital Library and Text Mining for latter we chose a set of defined terms extracted by our system Definitions: the DEFINDER System and compared them against three on-line dictionaries. The The existence of massive digital libraries containing freeform results we have obtained were run over a limited set of articles documents has created an unprecedented opportunity to develop in order to thoroughly test our methods before moving to a and apply effective and scalable text mining techniques for the larger scale user-based evaluation of significantly more data. We automatic extraction of knowledge from unstructured text [1]. present the results of three experiments to quantitatively and qualitatively measure DEFINDER output. Permission to make digital or hard copies of all or part of this work for 2.1Definition Extraction Performance personal or classroom use is granted without fee provided that copies are The purpose of this experiment was to measure the performance not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy of DEFINDER in terms of precision and recall against a human- otherwise, or republish, to post on servers or to redistribute to lists, determined"gold standard".Four subjects unrelated to the requires prior specific permission and/or a fee. project were provided with a set of nine patient-oriented articles JCDL'01 , June 24-28, 2001, Roanoke, Virginia, USA. and were asked to annotate definitions and the terms they define. Copyright 2001 ACM 1-58113-345-6/01/0006...$5.00. We chose several genres (medical articles, newspapers, manual

201 233 chapters, book chapters) from trusted resources. The resulting gold standard was determined by those definitions marked-up by 1.3 at least 3 out of the 4 subjects and consisted of 53 definitions. CEFIKER DEFINDER identified 40 out of these 53 definitions obtaining Rinis 86.95% precision and 75.47% recall. CI MIS 0 CM) 2.2 User Judgements on Definition Quality Inthis experiment we asked users to rank definitions to determine if they are readable, useful or complete. The motivation is that there is unlikely to exist a single definition Figure 2Ranking suitableforbothspecialistsandnon-specialists.Indeed, specialized on-line dictionaries, while valuable resources, can be 2.3 Coverage of On-line Dictionaries too technical for non-specialists. We evaluated the quality of DEFINDER identifiesterms and definitionslacking from DEFINDER output in comparison with two specialized on-line existing resources. To evaluate coverage, we choose a base test dictionaries (UMLS and OMD). Eight subjects not qualified in set of 93 terms and their associated definitions, extracted by our the medical domain participated in the experiment. They were system from text. Three cases were found, as shown in Table 1: provided with a list of 15 randomly chosen medical terms and (1) the term is listed in one of the on-line dictionaries and is their definitions from these three sources. The task was to assign defined in that dictionary (defined); (2) the term is listed in one to each definition a quality rating for three criteria: usefulness of the on-line dictionaries but does not have an associated (U), readability (R) and completeness (C) on a scale of 1 to 7 (1 definition (undefined); (3) the term is not listed in one of the on- worst, 7 best). The source of each definition was not given in line dictionaries (absent). order not to bias the experiment. Statistical significance tests were performedforsubjectsand terms usingKendall's Term UMLS OMD GPTMT coefficient of concordance, W [6] and the sign test [6]. defined 60%(56) 76%(71) 21.5%(20) undefined 24%(22) - - absent 16%(15) 24%(22) 78.5%(73) Table 1 Coverage of Existing Online Dictionaries Table 1 shows that on-line medical dictionaries are incomplete compared to potential DEFINDER output. For example, column two shows that in OMD only 71 terms out of 93 are listed, thus leading to 76% completeness, while GPTMT, a glossary addressed to non-specialists is far from being complete, i.e. only 20 out of 93 terms were present. This proves the ability of DEFINDER to enhance on-line dictionaries with readable and useful definitions. Figure 1 - Average quality rating (AQR) We first measured the average quality rating for each of the 3. References three sources on the three criteria. The results in Fig.1 show [1]Hearst, M. Untangling Text Data Mining. Proc. of ACL'99 that DEFINDER clearly outperforms the specialized dictionaries University of Maryland, June 20-26, 1999 (invited paper). for usefulness and readability to a statistically significant degree, [2]Justeson, J. and Katz, S. Technical Terminology: Some given by the sign test (p=0.0003). In terms of completeness both Linguistic Properties and an Algorithm for Identification in UMLS and OMD performed slightly better (p=0.04). Text. Natural Language Engineering. Vol 1(1). 1995. pp. 9- 27. One question that arises in computing the AQR is whether the high scores given by one subject can compensate for the lower [3]Klavans J.L., Muresan S. DEFINDER: Rule-Based Methods values given by other subjects, thus introducing noise. To for the Extraction of Medical Terminology and their validate our results, we performed a second analysis to evaluate Associated Definitions from On-line Text. Proc of AM1A the relative ranking of the three definitional sources. Using 2000; pp. 1906. Kendall's coefficient of correlation, W, we first measured the [4]McCord M.C. The Slot Grammar system. IBM Report; 1991. interjudge reliability on each term, and for terms with significant [5]McKeown K.R et al. PERSIVAL, a System for Personalized agreement we compute the level of correlation between them. If Search and Summarization over Multimedia Healthcare W was significant, we compared the overall mean ranks of the Information. Proc of JCDL 2001. three sources. We obtained statistically significant W values for usefulness and readability (W=0.54 and W=0.45 at p=0.01 and [6]Siegal, S. and Castellan, N.J. (1988). Non-parametric p=0.05 respectively), while for completeness the correlation was statistics for the behavioural sciences (2nd Edition). New not statistically significant. Thus Figure 2 shows the results for York: McGraw Hill. usefulness (U) and readability (R) for which DEFINDER [7]Zweigenbaum P, Bouaud J, Bachimont B, Charlet J, Seroussi outranked both UMLS and OMD. B, Boisvieux JF. From Text to Knowledge: a Unifying Document-Oriented View of Analyzed Medical Language. Proceedings of IMIA WG6. 1997.

202 234 Overview of The Virtual Data Center Project and Software Micah Altman, L. Andreev, M. Diggory, Daniel L. Kiskis and M. Krot G. King, E. Kolster, A. Sone, S. Verba University of Michigan Harvard University Media Union G-4 Littauer Center, North Yard 2281 Bonisteel Cambridge, MA 02138 Ann Arbor, MI 48109 (617) 496-3847 [email protected]

ABSTRACT crucial to research, but is often costly and difficult. [Sieber 1991] In this paper, we present an overview of the Virtual Data Center Thus, our ability to replicate the work of others and to build upon (VDC) software, an open-source digital library system for the it is diminished. Researchers, university data centers, and students management and dissemination of distributed collections of all face challenges when trying to find and use quantitative quantitativedata.(see).The VDC research data. functionality provides everything necessary to maintain and The Virtual Data Center (VDC) software is a comprehensive, disseminate an individual collection of research studies, including open-source digital library system, designed to help curators and facilities for the storage, archiving, cataloging, translation, and researchers face the challenges of sharing and disseminating on-line analysis of a particular collection. Moreover, the system research data in an increasingly distributed world. The VDC providesextensivesupportfordistributedandfederated software provides a complete system for the management and collections including: location-independent naming of objects, dissemination of federated collections of quantitative data. distributed authentication and access control, federated metadata harvesting, remote repository caching, and distributed 'virtual' 2. Features, Design, and Implementation collections of remote objects. The VDC provides functionality for producers, curators and users of data. For producers, it offers naming, cataloging, storage, and Categories and Subject Descriptors dissemination of their data. For curators, it provides facilities to create virtual collections of data that bring together and organize 14.3.7. [Information Systems] -Information Storage and Retrieval datasets from multiple producers. For users, it enables on-line Digital Libraries (H.3.7); DL System Architecture, Distributed search, data conversion, and exploratory data analysis facilities. Systems, Information Repositories, Federated Search, DL Impact More specifically, the system provides five areas of functionality: General Terms (I) Study preparation. (unique naming, conversion tools for Management, Design, Standardization multiple data and documentation formats, tools for preparing catalog records for datasets); (2) Study management. (file-system independent dataset and Keywords documentation storage, archival formatting, cataloging); Numeric Data, Open-Source, Warehousing. (3) Interoperability. (DC, MARC and DDI metadata import and export; OpenArchives and Z39.50 query protocol support); 1. INTRODUCTION (4) Dissemination. (extract generation, format conversion, subset Researchers insocial sciences, and in academia in general, generation, and exploratory data analysis). increasingly rely upon large quantities of numeric data. The (5) Distributed and federated operation. (location-independent analysis of such data appears in professional journals, in scholarly naming,distributedvirtualcollections,federatedmetadata books, and increasingly often in popular media. For the scholar, harvesting,repository exchange and caching, and federated the connection between research articles and data is natural. We authentication and authorization) analyze data and publish results. We read the results of others' Each VDC library comprises a set of independent, interoperating analyses, learn from it, and move forward with our own research. components. These independent VDC libraries can also be But these connections are sometimes difficult to make. Data federated together, sharing collections through harvesting and supporting an article are often difficult to find and even more dynamic caching processes. (See Figure 1) difficult to analyze. Archiving, disseminating and sharing data is Many of the core components are modeled after the design described in Arms [1997]: objects are stored in repositories, referenced through names that are resolved by name servers, and Permission to make digital or hard copies of all or part of this work for are described in index servers that support searching. We also personal or classroom use is granted without fee provided that copies are incorporate somefeaturesof the NCSTRL [Davis 2000] not made or distributed for profit or commercial advantage and that architecture and particularly Lagoze's [1998] idea of a collection copies bear this notice and the full citation on the first page. To copy as an object that groups queries against remote index servers. otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. We have extended this core design in several significant ways. JCDL'01, June 24-28, 2001, Roanoke, Virginia, USA. First,tosupportcompletelydistributedoperationof the Copyright 2001 ACM 1-58113-345-6/01/0006...$5.00.

203 235 Remote VDC REMOTE UIS AND OTHER COMPONENTS

Name Resolver Directory Index Repository

Harves d Shared A ahorization Harv ted Proied Resoluti Meta ata Content V Local VDC Index Server Repository Directory

Collection Study Data Services Name Resolver Authority Manager Manager User Interface Server Logging Service

Web Browser Figure 1. Simplified Representation of VDC components, we have added to each VDC node a directory server, locate each other, and a 3. CONCLUSIONS. which allows the components to By providing a portable software product that makes the process centralized logging service, which aids the administrator in of data archiving and dissemination automatic and standardized, tracking usage of the system. Second, the idea of the collection the Virtual Data Center will help researchers and data archives has been expanded to provide independent encapsulation of the meet the challenges of sharing and using quantitative data. specification of virtual content and the 'view', or organization, of Consequently, we believe that quantitative research will become that content. Third, the metadata harvester, caching repository easier to replicate and extend. proxy,and distributedauthentication and authorization components work together support the 'federation' of independent VDC libraries: The result is that content from a group of libraries 4. ACKNOWLEDGMENTS is made available to their collective users, while each library Supported by NSF Award #IIS-9874747. maintains complete control over how its collections are accessed and how its patrons are authenticated. 5. REFERENCES Ourimplementation strategyemphasizes 'opensource' [1] Altman, M., et al., 2001, "The Virtual Data Center," development, and integration of the system into a production Working paper. ) Publications, Inc, CA.

204 238 Digital Libraries and Data Scholarship Bruce R. Barkstrom Atmospheric Sciences NASA Langley Research Center Hampton, VA 23681-2199 1-757-864-5676 [email protected]

ABSTRACT recording, or providing a subset of pixels belonging to a hurricane In addition to preserving and retrieving digital information, digital within an Advanced Very High Resolution Radiometer (AVHRR) libraries need to allow data scholars to create post-publication or Moderate Resolution Imaging Sprectoradiometer (MODIS) references to objects within files and across collections of files. image. Such references can serve as new metadata in their own right and Likewise, there appears to have been little work done on how to should also provide methods for efficiently extracting the subset of allow scientists to publish stable references to objects that occur in the original data that belongs to the object.This paper discusses collections of files, particularly when the collections come from very some ideas about the requirements for such references within the different sources.In this case, a scientist might want to bring context of long-term, active archival, where neither the data format together data from a field experiment that is contained in one or two nor the institutional basis can be guaranteed to remain constant. files of satellite data (e.g., a MODIS image collocated with an ASTER image) and in many small files of in situ data (such as Keywords radiosonde profiles of temperature and humidity). Data scholarship, structural data references, object data references, These problems become still more interesting when we think of the digital libraries, EOSDIS. long-term evolution of data systems.Over long time periods, references to structures within files require pointers that can survive 1. ON DATA SCHOLARSHIP structural rearrangement of the file format because the archive In the humanities, there are many mechanisms for identifying and needs to upgrade the original file from one version of a format to cross-referencing document contents: "chapter and verse", tables of another.References to objects in file collections must be able to contents, indexes, concordances, glossaries, commentaries.While survive collection migrations from media to media and repository to the table of contents and index now reside in the original document, repository.It would also be helpful for collections to be cognizant most of the other scholarly works appear after publication and do of retrospective pointer additions in other locations. not perturb the original document. To be effective, the references need to provide precise pointers to the start and end of the original 2. OBJECT REFERENCES document segments they point to.Over the years, the humanities It is convenient to categorize Earth science data as either satellite have developed important conventions to anchor the references, data or insitudata. Satellitedata usually come inlarge, including structural features such as numbered document sections homogeneously structuredfiles. The satelliteorbital motion and non-structural references such as page numbers. We are only provides an intricate and highly structured sampling of space-time. beginning to recognize the need for such apparatus in many areas of The instruments on the satellite add spatial or angular sampling digital librarianship. In this brief paper, we explore the implications patterns that are also quite regularalthough there may be large of this need within the context of scientific data collections for the differences in the spatial sampling patterns of different instruments Earth sciences. on the same satellite.Satellite data producers work with their data The emphasis in current Earth science data collections, such as "wholesale." They create large files that have "smooth edges" all NASA's Earth Observing System Data and Information System of the files in a data product contain one hour of data or one day of (EOSDIS), has been on preserving collections of files and providing data. In situ data, on the other hand, come from specialized catalogs with some simple metadata to help users find files that platforms near the Earth's surface, such as Earth-fixed collection interest them. These systems treat files in ways that are similar to sites, ocean buoys, ships, or aircraft. The files containing these data journal articles in other kinds of libraries [3, 4]. However, they have tend to be smaller and more diverse.Practical collections may be not dealt with the issue of recording references to and properties of heterogeneous: data collected for a field experiment may contain a interesting objects within the files, or of allowing data scholars to few, highly regular satellite files and a larger number of smaller, extract specific subsets of data that belong to these objects [I]. Thus, heterogeneous files. we have had little serious study of the problems of identifying, In conventional Earth science thinking, the data in these files come from continuous fields. Recently, some Earth scientists have begun This paper is authored by an employee(s) of the United States thinking in terms of data as collections of objects. In this new mode Government and is in the public domain. of thought, Earth scientists do not want data averaged over Earth- JCDL '01, June 24-28, 2001, Roanoke, Virginia, USA. fixed grids.Rather, they want averages for irregularly shaped ACM 1-58113-345-6/01/0006. objects or that recognize the internal structure of object hierarchies. In the conventional approach, scientists might resample the spatial

205 237 structures of satellite or in situ data onto grids aligned with latitude working with a particular class of objects can make those objects and longitude.Of course, this can be quite expensive: satellite more widely available.Such notification also allows a data set's orbital inclinations do not align with the North-South boundaries of users to identify who else has worked with a data set and to receive mostlatitude-longitudegridsandresamplingcaninvolve notice when the files move to new media or new formats. numericallyintensivespatialfilters. Intheobject-oriented Data producers will not have collection indexes available for their approach,rectangular grids become lessimportant. Rather, data when they create a data set. As time passes, scientists outside references to "the original measurements that belong to an object" the original team willidentifyinteresting objects within the must allow for its irregular shape. This means that extracting "just collection.They are likely to identify the same objects in other the pixels that belong to the object" from the file may involve more collections. For example, a scientist interested in hurricanes might than simply extracting a hyperslab from an array. notice that MODIS contains data useful for identifying cloud What makes this problem rather interesting is that normal rules of properties (such as cloud top altitude), that CERES contains data good library behavior exclude "writing in the original manuscript." useful for examining the hurricane's radiation budget, and that This suggests that the indexes to objects need to be treated as AMSU-E contains data useful for understanding the distribution of external metadata.They remain outside the file that contains the liquid water within the storm.These data are stored in different original data, so that it remains unaltered. locations and may well have different array structures to record in the object indexes.Collection indexes thus need flexibility to add A natural solution is to use a "pointer structure" to index the indexes for other collectionsand are prime candidates for agent measurements within the file's data that belong to a particular automation. object. Under the assumption that the data are contained in multidimensional arrays, an index needs to know the mapping between Creating collection indexes that contain object references raises the array indexes and the sequential position of the measurements some interesting long-term issues. If we knew for sure that the files within the array. With this mapping, an object reference becomes a would never be reformatted and that repositories would remain listof one-dimensional array segments, where each segment constant, we could settle for simply keeping track of the "Universal identifies a starting position in the array and an ending position.If Name Reference" of the files in the collection.However, the the data in the file are images, it may also be possible to use object problem becomes more difficult when data repositories find it descriptions of the type used by geographic information systems, necessary to reformat files because the original formatting software e.g. [2]. has a new version or because the file format's creators have disappeared. In this situation, a digital library needs to know which This approach suggests that digital libraries need to accommodate other libraries contain file collections that might fulfill searches for "object commentaries" that contain indexing structures with a list of objects within the files.Clearly, this need places a requirement on objects and their properties. For each object, the indexing structure the digital library to add external metadata that will let the file's needs to include a list of files containing the object. For each file, users know that reformatting has taken place and where alternative the index would contain a collection of appropriate object properties files may reside. and a list of array segments that contain the object.Provided the "list of files" uses a stable referencing scheme, this approach appears Object indexes and collection indexes that contain references to to be a satisfactory start for building external references that do not objects in files are a critical component of data scholarship. At the need to "write in the original manuscript." same time,ourinabilitytoguaranteeabsolutestabilityin information technology raises interesting issues in implementing When there are multiple, geographically dispersed copies of the such indexes. originalfiles and when the repository sites have independent hardware and software refresh policies, any object index will need additional functions that can identify structural variants. For 4. ACKNOWLEDGMENTS example, if one copy of the data were written by a FORTRAN The author is pleased to acknowledge the support of NASA's program and another copy were written by a C version, the two CERES project for this work. arrays may contain the same data, but the array indexes would be permuted.Thus, the pointers in an object index would need to 5. REFERENCES identify which variant created a particular file. The pointer software [1] Barkstrom, B. R. Digital Archive Issues from the would also need to recompute the array segments to correspond to Perspective of an Earth Science Data Producer. the appropriate format variant. Alternatively, digital librarians need to create strong array ordering conventions. http://ssdoo.gs fc.n asa.gov/nost/i soas/us12/ba rkstrom/ Archival%20Issues.html. 3. COLLECTION INDEXES [2]http://www.gisportal.com provides numerous sources of information. The second problemforreferencesarisesfromidentifying collections of filesthat "belong together," particularlyif the [3]Lamport, L. The implementation of reliable distributed collection allows data scholars to obtain a homogeneous view of a multiprocess systems. Computer Networks, 2, 1978. particular class of objects. The identification mechanism not only [4]Rosenthal, D. S. H. and V. Reich. Permanent Web Publishing. needs to identify the files that belong together, it also needs to http://lockss.stanford.edareenix2000/ distribute this identification to other locations that contain copies of frecnix2000.html the collection. In this way, scholars that find a collection useful for

206 238 SDLIP + STARTS = SDARTS A Protocol and Too lkit for Metasearching

Noah Green Panagiotis G. Ipeirotis Luis Gravano [email protected] [email protected] [email protected]

Computer Science Dept. Columbia University

ABSTRACT can solve common information needs, they ignore the often- In this paper we describe how we combined SDLIP and valuable information that is "hidden" behind search inter- STARTS, two complementary protocols for searching over faces, the so-called "hidden web. " distributed document collections. The resulting protocol, One way to access the information available in the hid- which we call SDARTS, is simple yet expressible enough to den web is through metasearchers [12, 18, 24], which pro- enable building sophisticated metasearch engines. SDARTS vide users with a unified searchable interface to query mul- can be viewed as an instantiation of SDLIP with metasearch- tiple databases simultaneously. A metasearcher performs specific elements from STARTS. We also report on our ex- three main tasks. After receiving a query, it determines the perience building three SDARTS-compliant wrappers: for best databases to evaluate the query (database selection), locally available plain-text document collections, for locally it translates the query in a suitable form for each database available XML document collections, and for external web- (query translation), and finally it retrieves and merges the accessible collections. These wrappers were developed to be results from the different sources (results merging) and re- easily customizable for new collections. Our work was de- turns them to the user using a uniform interface. veloped as part of Columbia University's Digital Libraries Metasearching is a central component of the Digital Li- InitiativePhase 2 (DLI2) project, which involves the de- braries InitiativePhase 2 (DLI2) project at Columbia Uni- partments of Computer Science, Medical Informatics, and versity, which involves the departments of Computer Sci- Electrical Engineering, the Columbia University libraries, ence, Medical Informatics, and Electrical Engineering, the and a large number of industrial partners. The main goal of Columbia University libraries, and a large number of indus- the project is to provide personalized access to a distributed trial partners (e.g., IBM, GE, Lucent). The project is named patient-care digital library. PERSIVAL (PErsonalized Retrieval and Summarization of Image, Video, And Language resources) and its main goal is to provide personalized access to a distributed patient-care Categories and Subject Descriptors digital library. The information needs vary widely among H.3.3 [Information Storage and Retrieval]: Information the users of the system. We have to provide access to Search and RetrievalInformation Filtering, Query Formu- all kinds of collections, ranging from Internet sites with lation, Search Process, Selection Process; H.3.5 [Informa- consumer health information to the Columbia Presbyterian tion Storage and Retrieval]: Online Information Ser- Hospital information system, which stores patient-record in- vicesData Sharing, Web-based Services; H.3.7 [Informa- formation and other relevant resources.Metasearching is tion Storage and Retrieval]: Digital Libraries; H.2.4 further complicated by the different access methods used by [Database Management]: SystemsTextual Databases, each source, which range from public CGI-based interfaces Distributed Databases; H.2.5 [Database Management]: to proprietary access methods used inside the Columbia SystemsHeterogeneous Databases Presbyterian Hospital. Key features of the project are the abilities to access all these distributed resources regardless of whether they are available locally or over the Internet, 1.INTRODUCTION to fuse repetitive and conflicting information from multiple The information available in electronic form continues to relevant sources, and to present concisely the retrieved in- grow at an exponential rate and this trend is expected to formation. Exploiting such a variety of information sources continue. Although traditional search engines like Alta Vista is a challenging task, which could benefit from information sources supporting a common interface for searching and metadata extraction. Rather than defining yet another new protocol and inter- Permission to make digital or hard copies of all or part of this work for face that distributed sources should support, we decided to personal or classroom use is granted without fee provided that copies are exploit existing efforts for our project. More specifically, we not made or distributed for profit or commercial advantage and that copies built on two complementary protocols for distributed search, bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific namely SDLIP [20] and STARTS [11], and combined them permission and/or a fee. to define SDARTS (pronounced "ess-darts"), which is the JCDL'01, June 24-28, 2001, Roanoke, Virginia, USA. focus of this paper. Copyright 2001 ACM 1-58113-345-6/01/0006 ...$5.00.

207 239 SDLIP (Simple Digital Library Interoperability Protocol) tents of the sources. Some database selection tech- is a protocol jointly developed by Stanford University, the niques rely on human-generated descriptions of the University of California at Berkeley and at Santa Barbara, sources. More robust approaches rely on simple meta- the San Diego Supercomputer Center, the California Digital data about the contents of the sources like their vo- Library, and others. The SDLIP protocol defines a layered, cabulary (i.e., list of words) together with simple as- uniform interface to query and retrieve the results from each sociated statistics. Research in this area includes [10, searchable collection through a common interface. SDLIP 13, 21, 12, 18, 24]. also supports an interface to access source metadata. SDLIP is optimized for clients that know which source Query Translation: Metasearchers send queries to they wish to access.In contrast, the focus of STARTS multiple different sources, which requires translating (STAnford protocol proposal for Internet ReTrieval and Se- the queries to the local query language and capabil- arch) is on metasearching. A crucial component of STARTS ities supported at each source. Query translation is is the definition of the specific metadata that a source should facilitated if sources export metadata on their query export to describe its contents. This metadata includes the capabilities, on whether they support word stemming, vocabulary (i.e., list of words) in the source, plus vocab- on the attributes or fields available for searching (e.g., ulary statistics (e.g., the number of documents containing author and title), etc. Research in this area includes [6, each word). These summaries of the source contents are 7, 8]. useful for the metasearchers' database selection task. SDLIP and STARTS complement each other naturally. Result Merging: Sources typically use undisclosed SDLIP has a flexible design that allows it to host different document ranking algorithms to answer user queries, query languages and metadata specifications.The major which makes the combination of query results coining parameter and return types of its methods are passed as from multiple sources challenging. Furthermore, even XML, and the DTDs for this XML allow for extensions and two collections that use the same ranking algorithm instantiations of the protocol. Thus, SDLIP can easily host might rate a common document quite differently de- the main components of the STARTS protocol. The result pending on the other documents present in each col- of this combination is SDARTS, which can be regarded as lection. Result merging is facilitated if sources export an instantiation of the SDLIP protocol with a specific query metadata on their contents and query results.Re- language, and, more importantly, with the richer metadata search in this area includes [22, 5]. interface from STARTS, which is useful for metasearching. A metasearcher needs information about the underlying To simplify making document collections compliant with collections to perform the tasks above successfully. Conse- SDARTS, we have developed a software toolkit that is easily quently, there is a need for a layer on top of the information configurable. This toolkit includes software to index locally sources that will mask source heterogeneity and export the available collections of both plain-text and XML documents. right metadata to metasearchers. One protocol that was de- Also, to be able to wrap external collections over which we signed to solve exactly this problem is STARTS, which we do not have any control and which do not support SDARTS briefly review next. natively, the toolkit includes a reference wrapper implementation that can be augmented for new external collections with relatively small changes. 3.STARTS: A PROTOCOL PROPOSAL TO SDARTS and its associated software toolkit, which are FACILITATE METASEARCHING the focus of this paper, provide the necessary infrastructure STARTS [11] is a protocol proposal that defines the infor- for metasearching and for incorporating collections into our mation that a source should export to facilitate metasearch- digital libraries project with minimal effort.In Section 2 ing. By standardizing on a common way of interacting with we describe in detail the metasearching tasks and the chal- clients and by defining what information a document source lenges involved. In Sections 3 and 4 we describe STARTS should export, metasearching becomes a much easier task. and SDLIP respectively, the two protocols on which we have Of course, the exported information alone is not a panacea based SDARTS. The resulting protocol is described in Sec- and does not solve the metasearching problems, but at least tion 5 and the toolkit and reference implementations are it can make these problems more tractable.Specifically, presented in Section 6. Finally, Section 7 provides further STARTS defines the information that should be included in discussion of our overall experience. the queries to the sources, the format of the query results, and the metadata a source should export about its contents 2.METASEARCHING and capabilities [11]. As we briefly mentioned in the Introduction, metasearch- For historical reasons, the original STARTS specification ing mainly consists of three tasks: database selection, query used Harvest's SOLE format [3] to encode queries, results, translation, and result merging. and metadata. Also, STARTS did not define explicitly how the information is transported.For our project, we de- Database Selection: A metasearcher might have fined an encoding of all the STARTS information in XML. hundreds or thousands of sources available for query- All STARTS queries, result sets, and metadata objects are ing. An alternative to broadcasting queries to every then represented as XML documents. This means that all source every time is to only send queries to "promis- STARTS elements can be easily manipulated by available ing" sources. This alternative results not only in better XML technologies, such as XSL, SAX, etc.Additionally, efficiency but in better effectiveness as well. Selecting the transformation from one XML dialect to another can the best sources for a given query requires that the be easily achieved using off-the-self tools, thus it is easy to metasearcher have some information about the con- support other query languages by just adding a thin layer

208 240

algorithm 75 34

Figure 1: A small fraction of a STARTS metadata object for a document collection. that translates the query format of another query language to STARTS. Another strong reason to transform STARTS to XML format is that SDLIP, the protocol that we review (''Simple Digital Libraries next, uses XML to encode information, so this transforma- Interoperability Protocol tion makes combining the two protocols easier. We refer to the "XMLized" STARTS version as STARTS XML. Figure 1 shows part of the metadata object that summa- (infrormat:iMi.. rizes the contents of the "20 Newsgroups" document col- DlieM lection [2], which is frequently used in the machine learning Library SeMce (ExtemalPmtocols) community. This collection consists of 19,997 articles posted Proxy to 20 newsgroups. 34 of these articles contain the term "al- CNMwork gorithm" in their body, as the metadata record in the figure indicates.Also, the record shows that this term appears (Subcollections 75 times over these 34 articles. 1 A metasearcher can use these content summaries to select what sources to send a given query (i.e., for database selection). More specifically, Figure 2: The role of SDLIP in a digital library a metasearcher can estimate the number of matches that a architecture with autonomous sources and wrappers given query will produce at each source. As a simple ex- (taken from [20]). ample, given a query on "algorithm" a metasearcher might conclude that it does need to contact a source with very few or no documents containing that word, as reported in the corresponding STARTS content summary for the source. accessed by an SDLIP-enabled client. An implementation (See [12, 18, 24] for research on how to exploit this type of of the SDLIP interfaces is called a Library Service Proxy information for database selection.) (LSP), and functions as a lower-level wrapper around one or more underlying collections. Figure 2, borrowed from [20], 4.SDLIP: SIMPLE DIGITAL LIBRARY IN- depicts the role of SDLIP in a digital library architecture. LSPs mask access differences among the information sources TEROPERABILITY PROTOCOL in the digital library, and present these sources to clients SDLIP [20] is a layered protocol that defines simple inter- through a uniform interface.Additionally, SDLLP places faces for interoperability between information sources. Its one more layer above the LSP, namely network transport, designers define it as "search middleware," lighter and eas- via HTTP/DASL or CORBA, to define exactly how the ier to use for web-related applications than standard mid- servers and the clients should communicate with each other. dleware protocols like Z39.50 [1]. The SDLIP protocol supports both a stateful and a stateless The main purpose of SDLIP is to provide uniform in- mode of operation. Just as STARTS is a stateless protocol, terfaces to information sources for querying and retrieving we decided to use only the stateless version of SDLIP in results, and taking care of the transport of the data across our project. Often web search over distributed collections the network. SDLIP defines the interfaces that a source or does not justify the added complexity of supporting stateful a wrapper of a source should implement so that it can be protocols. 1Although XML is quite verbose to describe these statistics, In a typical deployment of SDLIP, the LSP is a wrapper we should note that these XML objects can be compressed that knows how to interact with the underlying collections effectively [16], and this compression can take place before and exports a uniform SDLIP interface. The LSP interface a potential transmission to a client. is divided into three parts: the search interface, the results

209 241 interface, and the metadata interface. The search interface profile [9] and Z39.50's Exp-1 [1] attribute sets.Again, defines the operations for submitting a search request to an STARTS built on these protocols. In particular, STARTS LSP. The result interface allows clients to access the result defines that the sources should export vocabulary frequency of a search. Finally, the metadata interface allows clients to information (Section 3). SDLIP does not specify this kind question a library service proxy about its capabilities and of metadata, which is useful for metasearching. However, contents. For each interface, SDLIP supports, by design, a SDLIP is designed in such a way that it can easily incor- limited set of operations. This design decision is based on porate additional capabilities, which can be exploited by the observation that the SDLIP interfaces can be enhanced clients that are aware of them. Thus, it is natural to enrich as needed through inheritance. SDLIP with the STARTS components. While inheritance is a useful tool for extending the SDLIP We standardized on STARTS XML as the XML "dialect" interfaces, there is another mechanism for extending the pro- for SDLIP to exchange the extra information not included tocol's capabilities.The methods of the three interfaces in the original version of SDLIP. SDARTS follows all of described above are designed to pass XML documents as SDLIP's original DTDs, which include placeholders that their major parameter and return types. The conditions can be exploited to include the necessary STARTS XML for an SDLIP search, for example, would be specified not by objects (e.g., thegetPropertyInfo()method in SDLIP's primitive or object-based parameters, but rather as an XML metadata interface returns theelement that can "property sheet" of search parameters. SDLIP's grammar be used for this purpose). Since the vocabulary statistics for for this XML is extensible, making it easy to host addi- a source might be large, and given that a client other than tional metasearching-related features within SDLIP simply a metasearcher might not need them, SDARTS returns only as some dialect of XML. the URL where the content summary resides as part of the As mentioned above, SDLIP is optimized for clients that metadata for a source. Then, a metasearcher can download know which source they wish to access. In contrast, STARTS the surmnary outside of SDARTS with the given URL. specifies, among other things, the metadata that sources By standardizing on one XML format for SDLIP, we have should export for metasearching. This makes merging these created an architecture where an LSP is divided into two two complementary protocols desirable. More specifically, pieces: a standardized front-end that never needs to change, we describe next how we instantiate the SDLIP interfaces because it only has to deal with one dialect of XML, and with STARTS XML to obtain an expressive protocol for ef- an abstract back-end, which is implemented for specific un- fective metasearching. derlying collection types. Thus, a programmer would need to write only this back-end implementation. Furthermore, 5.THE SDARTS PROTOCOL this standardization and predictability of the back-end LSP makes it easier for us to create configurable reference imple- In this section, we describe how we combined SDLIP and mentations of the wrappers for frequently occurring collec- STARTS XML into the SDARTS protocol. Section 5.1 out- tion types, at the expense, of course, of reduced flexibility. lines our rationale, while Section 5.2 gives an overview of Another factor that influenced our decisions was our inten- SDARTS. tion to design a software design kit (SDK) for developers to 5.1SDARTS Design Rationale create SDARTS-compliant wrappers. The main design goal was ease of implementation. To be sure, we wanted any pro- SDARTS combines SDLIP and STARTS XML into an ex- grammer to be able to implement our standard wrapper in- pressive protocol for distributed search. We now discuss terface from scratch, and create custom wrappers fine-tuned our choice of these two existing protocols as the basis for to their underlying collections. But in addition, we wanted SDARTS. to create reference implementations for some common collec- Our decision to adopt SDLIP was influenced by the fact tion types. Moreover, we wanted these implementations to that SDLIP is already in use by other DLI2 projects around be adjustable without any additional writing of code when- the United States, and the ability to interoperate with re- ever possible. Thus, SDARTS includes a toolkit with refer- sources made available by other digital libraries efforts is of ence implementations of wrappers for frequent types of col- course attractive. Additionally, the fact that there are al- lections. These wrappers are all configurable with text files. ready implementations that allow SDLIP clients to access These files tell our wrappers things like how to index the the contents of sources supporting alternative protocols like documents in a locally available collection, how to transform Z39.50 [1] (but not vice versa) was another important fac- queries into forms that external collections can understand, tor in our choice of SDLIP as our "middleware" architecture. and how to translate results passed from collections. Once (Other emerging related efforts, notably the "Open Archives again, we chose XML as the format for these files. Later, in Initiative" [19], were still under development and not in sta- Section 6 we focus on the various reference implementations, ble form when we developed SDARTS.) Finally, we believe and the pros and cons of our configuration-driven approach. that the agreement on common interoperability protocols is a major factor in the success of efforts like the DLI2 projects. At the same time, STARTS specifically describes the in- 5.2Overall SDARTS Architecture formation that should be exchanged between sources and a Figure 3 shows the final SDARTS architecture, as it would metasearcher. STARTS includes support to plug in other be used over the Internet to make one wrapped collection attribute sets for searching like the Dublin Core [23] and available. The client layer is on the left side of the diagram, Z39.50's Bib-1 [1] attribute sets. In fact, STARTS built on and the wrapped collection is on the right side. these efforts and already supports some of their most useful In the diagram, we see two possible clients: a standard features. Additionally, STARTS specifies the metadata that SDLIP client, since SDARTS uses the SDLIP interfaces and sources should export for effective metasearching.Other transport layer, and theSDARTSBean,a client component protocols that define related metadata are the GILS Z39.50 that we developed that simplifies access to SDARTS and en-

210 242 even XML in general. All they need to understand is what Client STARTS XML the back-end interface supports, what theLSPObjectsare Program civer HTTP/DASL I..SPOhjects and how to read them, and, of course, how to handle the collection they are wrapping. An alternative design choice for SDARTS could have used the Document Object Model (DOM), which provides an ex- sDAwrs isting framework and implementation for converting XML Bean documents into Java objects. While DOM does simplify the task of objectifying XML documents, it does so at the expense of specificity and performance. DOM objects have STARTS XML Existing Native Protocol/ very generic interfaces. For example, if theBackEndLSPac- SDLIP Search Engine LSPObjects,then wrapper Client cepted DOM objects instead of programmers would have to remember what to expect when calling each of the generic getter methods on the DOM objects. They might have to make redundant calls in order to Figure 3: The use of the architectural components ascertain what data was actually available, and would cer- of SDARTS to query an SDLIP-compatible database tainly have to memorize what data types to expect. There over the Internet.(The "S" and "M" stand for would be much casting overhead during each interaction. In SDLIP's "Search" and "Source Metadata" inter- short, they would have to perform significant extra work to faces, respectively.) extract the necessary query data. As mentioned above, our overriding design goal was that it be relatively simple to implement the wrappers. As such, we opted to specify our own hances SDLIP with the metadata information that can be object representation of the XML documents,LSPObj ects, used by metasearchers. For simplicity, we chose to use the with well-known method names and well-understood types. HTTP/DASL protocol as transport layer and not to imple- The decisions made during the development of SDARTS ment at this point the support for CORBA. In the diagram resulted in a protocol that made the implementation of wrap- we show both prospective clients using the HTTP/DASL pers a relative easy task, even for a moderately experienced protocol for the communication. All messages passing be- programmer. However, our goal was to facilitate even fur- tween the clients and the top level of SDARTS are format- ther the development of wrappers for existing collections. ted in STARTS XML. As we discussed earlier, the benefits For this reason we have created reference implementations of this are: of wrappers for common collection types that can be easily configured, as described next. XML is an easily parsed and transformed language standard, and is ideal for client use; and 6.CONFIGURABLE REFERENCE IMPLE- Lower layers of SDARTS need only be implemented MENTATIONS once, as the incoming XML protocol is already known. In this section, we describe the reference wrapper implementations that we developed for common collection types. The top layer of the server side is a Java component called To wrap a collection, we can use the closest of these im- theFrontEndLSP.This component implements the major plementations and adapt it by defining simple XML-based SDLIP interfaces, and is for all intents and purposes an files. Currently, we have three reference wrapper implemen- SDLIP LSP. TheFrontEndLSPparses the incoming STARTS tations: XML requests using the Simple API for XML (SAX), and generates various search objects that implement the com- A wrapper for unindexed text documents residing in a monLSPObj ectinterface. local file system (Section 6.1); TheseLSPObj ect sare just data containers; basically, they are objectified STARTS XML documents, and are passed A wrapper for unindexed XML documents residing in to the lower layer.They represent the requests and re- a local file system (Section 6.2); and sponses that the system fulfills and produces. By standardizing on this object representation of the various SDARTS A wrapper for an external indexed collection fronted operations, we simplify the task for the prospective wrap- by a form-based WWW/CGI interface (Section 6.3). per implementors.They will not need to parse any incoming XML or generate any outgoing XML, but instead 6.1 TextBackEndLSP:Unindexed Text Document they can just examine and createLSPObjects,whose sim- Collections ple interfaces make it easy to understand queries and gen- The first wrapper we created is for locally available col- erate results. Thus, the wrapper programmer needs only lections of documents with no index over them. For this, to implement the abstractBackEndLSPinterface, which ac- we leveraged an open-source Java search engine known as cepts and returnsLSPObjects.This implementation reads Lucene [17] to index and search such collections. In its in- LSPObjectsthat come from theFrontEndLSP,and gener- ternal architecture, our wrapper hides the Lucene engine ates otherLSPObjectsthat it returns. EachLSPObjectal- behind a more abstract interface, so in practice other search ready knows how to represent itself as STARTS XML, so it engines could be used with this wrapper. is easy for the front-end to transform them into responses Figure 4 shows the basic structure of the wrapper, known to a SDARTS client. In short, back-end developers do not asTextBackEndLSP.The back-end administrator needs only really need to know anything about STARTS, SDLIP, or write one file:doc_conf ig . xml.This tells the wrapper where

211 243 Unindend XML Documents

TextBackEndLSP

Unindexed doe_ sturts_iniennediate Text atyle SAX Events = Doctunerlis .xel

Figure 4: Structure of the TextBackEndLSP wrapper. Figure 6: Indexing XML documents for the Lucene search engine, using an XSLT stylesheet to locate sdoc-corifi re-inclex."true, path/home/d1i2tese/collectiono/doc1/20groups http://localhosc/20groups xatop-wordno second file, doc_style.xs1, which is an XSL Transforma- rword,the,/word.

XML document from the collection into a new XML for- rend, mat we devised called starts_intermediate. This is a sim- rregexp4s/regexp ple subset of STARTS XML that describes STARTS doc- qdoc-configs rather, the transformation emits SAX events that ultimately tell the search engine indexer how to index the document. Like the text document wrapper, this process assumes that all documents in the collection are structured similarly. Figure 5: Portion of the contents of doc_config.xml. When we set out to devise this and the final wrapper, we believed that writing XSLT stylesheets was easier than designing and implementing Java objects; while this is true, the documents are, what fields should be indexed, and where XSLT is still non-trivial.In Section 7, we assess XSLT's to find the fields in the documents. TextBackEndLSP uses suitability for making wrappers easy to implement and con- the file to index the documents offline, so that when the figure. SDARTS server is available, the collection is fully searchable. During indexing, the wrapper also generates the two meta, 6.3WWWBackEndLSP: WWW/CGI Collections data files, meta_attributes.xml and content_summary.xml; WWWBackEndLSP is the most complex of the three wrappers. these contain the standard STARTS metadata, and can be It is intended for autonomous, remote collections fronted returned in response to requests from the front-end. The by HTML forms and CGI scripts. Such collections include file doc_config.xml is itself in a very simple XML format search engines such as Google, AltaVista, and thousands of (see Figure 5). An important point is the use of regular other web-based searchable collections. expressions to tell the wrapper how to extract the values There are two major issues in creating a generic, config- associated with each searchable field like author or title. In urable wrapper for such collections: our sample file, the author field of a document is found on a line that starts with the string "From:". How to convert a query into the proper CGI-BIN in- This format is simple enough for a non-programmer to vocation onto a search engine; and edit by hand, and a GUI administration tool can certainly easily generate it. It has some limitations; it assumes that How to interpret the HTML results returned by such all documents in the collection are in the same format. In an engine. addition, field data must be contiguous within a document. For metasearching, there is also the question of how to ex- This is especially a problem with XML documents, so we tract metadata from such engines, since most search engines created a separate wrapper, described below, to deal with do not provide any such metadata. Our current implemen- them. tation relies on the wrapper administrator to write a metadata file (the meta_attributes.xna file) with the informa- 6.2XMLBackEndLSP: Unindexed XML Document tion specified by STARTS XML. In the future, we could Collections automatically generate at least an approximation of the con- This wrapper for documents formatted in XML is quite tent summaries by using the results of research on metadata similai to its plain-text counterpart; the key difference is extraction from "uncooperative" sources [4, 15]. that to index collections of XML documents a slightly more We decided that the best way to make this wrapper con- complicated configuration file is needed. Here, the configu- figurable without additional Java coding was through the ration file doc_conf ig .xml only tells the wrapper where to use of XSLT stylesheets and the starts_intermediate for- find the documents. The administrator must then write a mat. We extended it to be able to describe CGI invocations

212 244 will invoke the