Machine-Learning Based Classification of Textual Stimuli To
Total Page:16
File Type:pdf, Size:1020Kb
MACHINE-LEARNING BASED CLASSIFICATION OF TEXTUAL STIMULI TO PROMOTE IDEATION IN BIOINSPIRED DESIGN A Dissertation by MICHAEL W. GLIER Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Chair of Committee, Daniel A. McAdams Co-Chair of Committee, Julie S. Linsey Committee Members, Richard J Malak Tracy A. Hammond Head of Department, Andreas Polycarpou August 2013 Major Subject: Mechanical Engineering Copyright 2013 Michael W. Glier ABSTRACT Bioinspired design uses biological systems to inspire engineering designs. One of bioinspired design’s challenges is identifying relevant information sources in biology for an engineering design task. Currently information can be retrieved by searching biology texts or journals using biology-focused keywords that map to engineering functions. However, this search technique can overwhelm designers with unusable results. This work explores the use of text classification tools to identify relevant biology passages for design. Further, this research examines the effects of using biology passages as stimuli during idea generation. Four human-subjects studies are examined in this work. Two surveys are performed in which participants evaluate sentences from a biology corpus and indicate whether each sentence prompts an idea for solving a specific design problem. The surveys are used to develop and evaluate text classification tools. Two idea generation studies are performed in which participants generate and record solutions for designing a corn shucker using either different sets of biology passages as design stimuli, or no stimuli. Based 286 sentences from the surveys, a k Nearest Neighbor classifier is developed that is able to identify helpful sentences relating to the function “separate” with a precision of 0.62 and recall of 0.48. This classifier could potentially double the number of helpful results found using a keyword search. The developed classifier is specific to the function “separate” and performs poorly when used for another function. ii Classifiers developed using all sentences and participant responses from the surveys are not able to reliably identify helpful sentences. From the idea generation studies, we determine that using any biology passages as design stimuli increases the quantity and variety of participant solutions. Solution quantity and variety are also significantly increased when biology passages are presented one at a time instead of all at once. Quality and variety are not significantly affected by the presence of design stimuli. Biological stimuli are also found to lead designers to types of solution that are not typically produced otherwise. This work develops a means for designers to find more useful information when searching biology and demonstrates several ways that biology passages can improve ideation. iii DEDICATION Ad majorem Dei gloriam iv ACKNOWLEDGEMENTS As with any large work, a great many people contributed to this research in ways great and small. The least I can do is to acknowledge some of those people here, even if many are unlikely to read this. First thanks go to my advisor and co-advisor, Dr. McAdams and Dr. Linsey, for their guidance and support, but mostly for their patience. I thank my committee members, Dr. Malak and Dr. Hammond, for generously lending their time and insight to this research. I am very grateful to Dr. Matt Campbell, whose presentation at DCC’12 spurred my research in a completely new direction that led to this work, which is amusingly unrelated to that initial talk. I also appreciate the support for this work provided through the National Science Foundation through NSF EEC 1025155. I am fortunate to have had some wonderful lab mates as I worked on this degree, and I am grateful to them all. A special thanks is due to Shraddha, for her friendship, advice, and commiseration as we both worked toward our doctorates. Ethan and Wes also deserve thanks for tackling the important but time-consuming and mind-numbing task of developing the biology corpus used in this research. Having started as an undergrad at Texas A&M, I am grateful to the countless people who have made my time at A&M a wonderful experience, but particularly to the faculty and staff of the Mechanical Engineering Department. I must also thank Mark Horn, Ray James, Harry Hogan, Kim Moses, and Sue Geller, without whom my eighteen year-old self may not have found his way to Texas A&M. I am further indebted to Bart v Braden, Andy Long, Stephanie Parker, and Frank Shaw, for without them I may not have been able to study at A&M. Finally, I will be forever grateful to my father, mother, brother, and sister for their continuous love and support while I studied engineering a thousand miles from home. vi TABLE OF CONTENTS Page ABSTRACT .......................................................................................................... ii DEDICATION .......................................................................................................... iv ACKNOWLEDGEMENTS ........................................................................................... v TABLE OF CONTENTS ............................................................................................... vii LIST OF FIGURES ........................................................................................................ x LIST OF TABLES ......................................................................................................... xii 1. INTRODUCTION ................................................................................................. 1 1.1. Research Questions ...................................................................................... 3 1.2. Scope of Work .............................................................................................. 4 2. RELATED WORK IN BIOINSPIRED DESIGN ................................................. 8 2.1. Directed Approach ....................................................................................... 9 2.2. Case Study Approach ................................................................................... 9 2.3. Curated Database Tools ............................................................................... 10 2.4. BioTRIZ ....................................................................................................... 12 2.5. Functional Modeling .................................................................................... 14 2.6. Automated Keyword Extraction ................................................................... 17 2.7. Engineering-to-Biology Thesaurus .............................................................. 18 2.8. Studies on Analogies in Bioinspired Design ................................................ 19 3. TEXT CLASSIFICATION BACKGROUND ...................................................... 22 3.1. Text Classifiers ............................................................................................. 22 3.1.1. Naïve Bayes ...................................................................................... 23 3.1.2. k Nearest Neighbors ......................................................................... 23 3.1.3. Support Vector Machines ................................................................. 24 3.1.4. Decision Rule Induction ................................................................... 24 3.2. Natural Language Processing ....................................................................... 25 3.2.1. Tokenization ..................................................................................... 25 3.2.2. Part of Speech Tagging .................................................................... 26 3.2.3. Word Stemming ............................................................................... 26 4. DESCRIPTION OF ANALOGY SURVEYS ....................................................... 28 4.1. Analogy Survey Structure ............................................................................ 28 vii Page 4.2. A Corpus of Biology Journals ...................................................................... 29 4.3. Corn Shucker Analogy Survey ..................................................................... 30 4.4. Alarm Clock Analogy Survey ...................................................................... 35 5. CORN SHUCKER ANALOGY SURVEY RESULTS AND DISCUSSION ...... 38 5.1. Participant Analysis ...................................................................................... 39 5.2. Sentence Analysis ........................................................................................ 42 5.3. Keyword Analysis ........................................................................................ 45 5.4. Word Analysis .............................................................................................. 51 5.5. Summary ...................................................................................................... 54 6. TEXT DISCRETIZATION ................................................................................... 56 6.1. Sentence Tokenization ................................................................................. 56 6.2. Part-of-Speech Tagging ................................................................................ 57 6.3. Stopword Removal ....................................................................................... 59 6.4. Word Stemming ..........................................................................................