Automatic Document Metadata Extraction using Support Vector Machines ½¾ ½ ½ Hui Han ½ C. Lee Giles Eren Manavoglu Hongyuan Zha ½ Department of Computer Science and Engineering ¾ The School of Information Sciences and Technology The Pennsylvania State University University Park, PA, 16802 hhan,zha,manavogl @cse.psu.edu
[email protected] Zhenyue Zhang Department of Mathematics, Zhejiang University Yu-Quan Campus, Hangzhou 310027, P.R. China
[email protected] Edward A. Fox Department of Computer Science, Virginia Polytechnic Institute and State University 660 McBryde Hall, M/C 0106, Blacksburg, VA 24061
[email protected] Abstract the process, facilitating the discovery of content stored in distributed archives [7, 18]. The digital library CITIDEL Automatic metadata generation provides scalability and (Computing and Information Technology Interactive Digi- usability for digital libraries and their collections. Ma- tal Educational Library), part of NSDL (National Science chine learning methods offer robust and adaptable auto- Digital Library), uses OAI-PMH to harvest metadata from matic metadata extraction. We describe a Support Vector all applicable repositories and provides integrated access Machine classification-based method for metadata extrac- and links across related collections [14]. Support for the tion from header part of research papers and show that it Dublin Core (DC) metadata standard [31] is a requirement outperforms other machine learning methods on the same for OAI-PMH compliant archives, while other metadata for- task. The method first classifies each line of the header into mats optionally can be transmitted. one or more of 15 classes. An iterative convergence proce- However, providing metadata is the responsibility of dure is then used to improve the line classification by using each data provider with the quality of the metadata a signif- the predicted class labels of its neighbor lines in the previ- icant problem.