X 7/ / .E~Smittee Member Certified by
Total Page:16
File Type:pdf, Size:1020Kb
Identifying Expression Fingerprints using Linguistic Information by Ozlem Uzuner Submitted to the Engineering Systems Division in partial fulfillment of the requirements for the degree of MASSACHUS'ETTS i-NTiTmrE Doctor of Philosophy OF TECHNOLOOGY at the MAR .10 2005 MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2005 LIBRARIES ©Massachusetts Institute of Technology 2005. All rights reserved. Author ..................................................... Enginei&ng SystemsDivision C 1A January 31. 2005 Certified by .................................................... Randall Davis Professor of Computer Science Computer Science and Artificial Intelligence Laboratory Thesis Suvervisor Certified by ...................................... ................... B61s Katz Principal Research Scientist Computer Science and Artificial Intelligence Laboratory Thesis Supervisor Certified by........................... ................ ....... '"Lee W. McKnight Associate Professor n School of forition Studies / fayr use University X 7/ / .e~Smittee Member Certified by ........................ J / .............. / f""'T7~ / FrankR.Field mI Senior Research Associate Engineering Systems Division Committee Member Accepted by ................. ...... oo .... .. o .... ...... ...... RKlc de Neufville Prof ssor gineering Systems Chair, Engineering Systems ision Education Committee ARCHIVES Identifying Expression Fingerprints using Linguistic Information by Ozlem Uzuner Submitted to the Engineering Systems Division on January 31, 2005, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract This thesis presents a technology to complement taxation-based policy proposals aimed at address- ing the digital copyright problem. The approach presented facilitates identification of intellectual property using expression fingerprints. Copyright law protects expression of content. Recognizing literary works for copyright protec- tion requires identification of the expression of their content. The expression fingerprints described in this thesis use a novel set of linguistic features that capture both the content presented in docu- ments and the manner of expression used in conveying this content. These fingerprints consist of both syntactic and semantic elements of language. Examples of the syntactic elements of expression include structures of embedding and embedded verb phrases. The semantic elements of expression consist of high-level, broad semantic categories. Syntactic and semantic elements of expression enable generation of models that correctly iden- tify books and their paraphrases 82% of the time, providing a significant (approximately 18%) improvement over models that use tfidf-weighted keywords. The performance of models built with these features is also better than models created with standard features used in stylometry (e.g., func- tion words), which yield an accuracy of 62%. In the non-digital world, copyright holders collect revenues by controlling distribution of their works. Current approaches to the digital copyright problem attempt to provide copyright holders with the same kind of control over distribution by employing Digital Rights Management (DRM) systems. However, DRM systems also enable copyright holders to control and limit fair use, to inhibit others' speech, and to collect private information about individual users of digital works. Digital tracking technologies enable alternate solutions to the digital copyright problem; some of these solutions can protect creative incentives of copyright holders in the absence of control over distribution of works. Expression fingerprints facilitate digital tracking even when literary works are DRM- and watermark-free, and even when they are paraphrased. As such, they enable meter- ing popularity of works and make practicable solutions that encourage large-scale dissemination and unrestricted use of digital works and that protect the revenues of copyright holders, for exam- ple through taxation-based revenue collection and distribution systems, without imposing limits on distribution. Thesis Supervisor: Randall Davis Title: Professor of Computer Science Computer Science and Artificial Intelligence Laboratory Thesis Supervisor: Boris Katz Title: Principal Research Scientist Computer Science and Artificial Intelligence Laboratory 2 Acknowledgments I am grateful to my advisors Prof. Randall Davis, Dr. Boris Katz, Prof. Lee McKnight, and Dr. Frank Field for sharing with me their wisdom and for providing me with support. I am thankful to my friends Gregory Marton, Kate Saenko, Vineet Sinha, Jacob Eisenstein, Metin Sezgin, Mario Chris- toudias, Jason Rennie, Federico Mora, and Juhi Chandalia for their feedback and for always making me smile. I am thankful to Sue Felshin for all her help and for being a wonderful officemate. I am blessed with amazing parents and an amazing brother-I dedicate this thesis to them. 3 4 Contents 1 Introduction 15 1.1 Background ...................................... 15 1.2 Proposed Solutions. 19 1.3 The Promise of Expression Fingerprints ....................... 20 2 Digital Copyright Problem 23 2.1 Copyright Law and its Goals . .......... 27 2.2 Mesh of Stakeholders and Copyright Concerns ................... 32 2.2.1 Copyright Holders .......................... 32 2.2.2 Digital Copyright Problem and Reactions of Copyright Holders . 35 2.2.3 Users ................................. 36 2.2.4 Digital Copyright Problem, DRM systems, DMCA, and Effects on Users . 38 2.2.5 Innovators and Technology Producers .................... 42 2.2.6 Rights Defenders ............................... 45 2.2.7 Policy Makers: the Congress and Courts ................... 46 2.3 Net Effects of DRM systems and Supporting Legislation .......... 46 2.4 Requirements for a Promotion of Progress ...................... 47 2.5 Conclusion ...................................... 48 2.6 Summary ................... ................... 48 3 Technology and Policy Solutions 51 3.1 Technologies. 52 3.1.1 Technologies for Controlling Use. 52 3.1.2 Digital Tracking Technologies .......... 56 3.2 Technology & Policy Solutions .............. 58 5 3.2.1 Trusted Third Party ............ 60 .................. 3.2.2 Creative Commons ............ .................. 61 3.2.3 Taxation-Based Solutions. .................. 62 3.3 Lessons Learned ................. .................. 65 3.4 Digital Tracking Using Expression Fingerprints . .................. 66 3.5 Conclusion .................... .................. 68 3.6 Summary ..................... .................. 68 4 Content, Expression, and Style 71 5 Preliminary Experiments 75 5.1 Features from the Literature 76 5.1.1 Surface Features . 76.. 5.1.2 Syntactic Features. 76 5.1.3 Semantic Features .............................. 77 5.2 Methods ........................................ 78 5.2.1 Decision trees ................................. 78 5.2.2 Boosting ................................... 81 5.2.3 t-Test ..................................... 82 5.3 Experiments...................................... 83 5.3.1 Classification Experiments .......................... 83 5.3.2 Significance Testing for Feature Ranking .................. 84 5.4 Conclusion .......... 85 5.5 Summary ....................................... 86 6 Verb-Based Theory of Expression 87 6.1 Linguistic Complexity and Syntactic Repertoire ................... 87 6.2 Tests of Independence ................................. 88 6.2.1 Pearsons' Chi-Square ............................. 88 6.2.2 Likelihood Ratio Chi-Square ......................... 90 6.3 Expression and Meaning ............................... 91 6.4 Expression as a Function of Syntax and Structure .................. 93 6.5 Expression as a Function of Sentence Structure .. ..... .. 96 6 6.5.1 Expressive Use of Sentence-Initial and -Final Phrase Structures 97 6.6 Expression as a Function of Sentence Complexity ............. 106 ..... 6.6.1 Sentence Complexity as a Function of Clause Structure ...... .107 ..... 6.6.2 Clause Structure and Expression .................. .108 ..... 6.7 Expression as a Function of Verb Phrase Structure ............. .118 ..... 6.7.1 Linguistic Richness ......................... .118 ..... 6.7.2 Semantics of Verbs ................... ...... 119 ..... 6.7.3 Syntax of Verbs ................... ........ .123 ..... 6.8 Summary. 134 7 Semantic Categories and Content 135 7.1 Content in the Literature . .135 ..... 7.2 General Inquirer (GI) Classes . .137 ..... 7.3 Parameter Tuning ....... 140 ..... 7.4 Evaluation of GI Categories for Capturing Content Similarity . O .143 ..... 7.5 Conclusion .......... e .147 ..... 7.6 Summary ........... O .148 ..... 8 Evaluation 149 8.1 Data ........................... 150... ......... 8.2 Feature Set. 152... ......... 8.3 Baseline Features .................... 154... ......... 8.3.1 Baseline Features ................ 155... ......... 8.4 Linguistic Features ................... 158... ......... 8.5 Classification Experiments ............... 158... ......... 8.5.1 Recognizing Paraphrases (Recognizing Titles) 159.. ......... 8.5.2 Recognizing Expression (Recognizing Books) 165... ......... 8.5.3 Recognizing Authors .............. 168.. ......... 8.6 Conclusion ....................... 170.. ......... 8.7 Summary ........................ 170.. ......... 9 Conclusion 171 10 Contributions 175 7 11 Future Work 177 11.1 Implementation of Digital Tracking with Expression Fingerprints .......... 177 11.2 Study of Linguistic Features for Other Applications . ..... 179 References