Final Report 2004-2008
Total Page:16
File Type:pdf, Size:1020Kb
Computational Linguistics for Metadata Building Final Report 2004-2008 Screenshot of CLiMB Toolkit for the ARTstor Art History Survey Collection Principal Investigator: Judith L. Klavans, Ph.D. Project Manager: Carolyn Sheffield Project Team: University of Maryland: Mairead Hunter, Tatyana Lavut, Jimmy Lin, Tandeep Sidhu, Dagobert Soergel, Rachel Wadsworth, Deborah Wallace Drexel University: Eileen Abels, Joan Beaudoin, Laura Jenemann, Dwight Swanson Columbia University: Tom Lippincott, Rebecca Passonneau, Tae Yano University of Maryland, College Park Computational Linguistics for Metadata Building Final Report Executive Summary................................................................................................................ 3 1. Project Background......................................................................................................... 4 2. The CLiMB Toolkit ........................................................................................................ 4 2.1. System Specifications ......................................................................................... 4 2.2. Workflow Studies and Usability Testing............................................................ 5 2.3. Importing Collections ......................................................................................... 5 2.4. Examining Images .............................................................................................. 7 2.5. Reviewing Texts ................................................................................................. 7 2.6. Consulting Thesaural Resources......................................................................... 8 2.7. Exporting CLiMB Metadata ............................................................................... 9 2.8. Usability Improvements...................................................................................... 9 2.9. Disambiguation................................................................................................. 10 2.10. Categorization............................................................................................... 11 2.11. Text Features................................................................................................. 12 2.12. Functional Semantic Categories ................................................................... 13 2.13. Machine Learning Experiments.................................................................... 15 2.14. Interannotator Agreement Studies ................................................................ 16 3. Sustainability................................................................................................................. 16 4. Appendices.................................................................................................................... 18 4.1. Appendix 1: CLiMB Toolkit User Manual........................................................... 18 4.2. Appendix 2: List of CLiMB-2 Presentations........................................................ 18 4.3. Appendix 3: CLiMB-2 Publications (full text)..................................................... 18 4.3.1 Judith Klavans, et al. Computational Linguistics for Metadata Building: Aggregating Text Processing Technologies for Enhanced Image Access. OntoImage 2008. Marrakech, Morocco................................................................ 42 4.3.2 Rebecca Passonneu, et al. Relation between Agreement Measures on Human Labeling and Machine Learning Performance: Results from an Art History Domain. LREC. 2008. Marrakech, Morocco....................................................... 49 4.3.3 Rebecca J. Passonneau, et al. Functional Semantic Categories for Art History Text: Human Labeling and Preliminary Machine Learning. VISAPP, MMIU 2008. Funchal, Madeira - Portugal. ................................................................................ 55 4.3.4 Judith Klavans, et al. Computational Linguistics for Metadata Building (CLiMB) Text Mining for the Automatic Extraction of Subject Terms for Image Metadata. VISAPP, MMIU 2008. Funchal, Madeira - Portugal. ......................... 64 4.3.5 Rebecca Passonneau, et al. CLiMB ToolKit: A case study of iterative evaluation in a multidisciplinary project. LREC, 2006 ........................................ 74 4.4. Appendix 4: Brochures ......................................................................................... 18 4.5. Appendix 5: CLiMB-2 Posters ............................................................................ 18 4.6. Appendix 6: Annotated Bibliography…...………………………………………18 4.7. Appendix 7: Internal Advisory Board Member List............................................ 18 T4.8. Appendix 8: External Advisory Board Member List............................................ 18 p. 2 of 113 Fall 2008 Computational Linguistics for Metadata Building Final Report Executive Summary In this report, we describe the CLiMB Toolkit, including system specifications, the results of our usabilty and work flow experiments, and descriptions of disambiguation and categorization tools. Over the course of the second phase of the CLiMB project (CLiMB-2), we have implemented refinements to the Toolkit developed in the first phase (CLiMB-1). Changes were executed in an iterative fashion and influenced by: • User feedback • Expert recommendations • Iterative testing • Developments in the field An overarching goal throughout the project has been to develop a mechanism for the description of works of art that enhances access and exceeds keyword extraction. A frequent reaction from the image indexing community has been to compare our work to Google Image and similar image keyword-based search engines available for general use. Our goal throughout the project has been to enrich the capabilities of such search and retrieval mechanisms through the integration of existing image indexing standards from the museum and library community and to develop and utilize refined computational linguistic techniques. We exceed simple keyword search by: • Consulting controlled vocabularies and enabling hierarchical query expansion • Incorporating a process for the semi-automatic disambiguation of terms using hierarchical thesauri to facilitate description • Categorizing text segments by their semantic function to further enable query refinement and to assist with disambiguation p. 3 of 113 Fall 2008 Computational Linguistics for Metadata Building Final Report 1. Project Background The Computational Linguistics for Metadata Building (CLiMB) project was initially funded at Columbia University to the Center for Research on Information Access. A description of the project and information on results of the first phase can be found at www.columbia.edu/cu/libraries/inside/projects/climb/. In sum, this initial project (now referred to as CLiMB-1) enabled a proof-of-concept, an implemented client-server prototype, and an initial set of resources for testing with the toolkit. The second phase of the project enabled a full implementation and testing with users; this second phase (CLiMB-2) www.umiacs.umd.edu/~climb/ is the topic of this report. The next phase will involve further user evaluation, and the association of social tags and trust with terms from CLiMB into a platform we have termed T3: Tags, Terms, and Trust. 2. The CLiMB Toolkit Over the course of CLiMB-2, the Toolkit has evolved into a packaged, downloadable application that reflects the image cataloging workflow we observed at museums and academic visual resource centers. We iteratively improved the Toolkit design through an ongoing literature review, user studies, and outreach efforts. Highlights of our accomplishments include: • Reimplementing the Toolkit in a client-side Java-based architecture • Packaging the Toolkit for easy public download • Improving overall usability and interface design based on user feedback • Developing disambiguation and categorization techniques to enrich the description and search processes • Determining future directions for enabling Toolkit use in repositories based on cataloging practices Each of these accomplishments is discussed in more detail in the body of the report. 2.1 System Specifications The CLiMB-1 prototype Toolkit and testbed collections operated on a server-based platform too slow for practical real-world implementation, particularly in digitally-driven image access environments. One of the major goals accomplished in CLiMB-2 was to redesign the Toolkit in a client-side architecture to speed up the download and use of the software, while preserving and extending its original functionality. We chose a Java- based platform. Java provides a rich API to develop fully-featured client-side applications with responsive graphical user interfaces. The cross-platform nature of Java also simplifies many of the issues associated with distribution and deployment of p. 4 of 113 Fall 2008 Computational Linguistics for Metadata Building Final Report applications. Additionally, the Toolkit and a public domain testbed collection are available for download as a set of JAR files which require minimal storage space on a personal computer, thereby limiting the burden of several server calls by multiple users. The Toolkit, along with one sample collection, is now available for public download from the CLiMB website (www.umiacs.umd,edu/~climb). 2.2 Workflow Studies and Usability Testing Our goal was a CLiMB-2 Toolkit interface that was intuitive, required a minimum of training, and integrated