Florida State University Libraries

Electronic Theses, Treatises and Dissertations The Graduate School

2014 Exploring the Data Work Organization of the Gene Ontology Shuheng Wu

Follow this and additional works at the FSU Digital Library. For more information, please contact [email protected] FLORIDA STATE UNIVERSITY

COLLEGE OF COMMUNICATION AND INFORMATION

EXPLORING THE DATA WORK ORGANIZATION OF THE GENE ONTOLOGY

By

SHUHENG WU

A Dissertation submitted to the School of Information in partial fulfillment of the requirements for the degree of Doctor of Philosophy

Degree Awarded: Fall Semester, 2014

Shuheng Wu defended this dissertation on October 24, 2014. The members of the supervisory committee were:

Besiki Stvilia Professor Directing Dissertation

Henry W. Bass University Representative

Corinne L. Jörgensen Committee Member

Michelle M. Kazmer Committee Member

The Graduate School has verified and approved the above-named committee members, and certifies that the dissertation has been approved in accordance with university requirements.

ii

I dedicate this dissertation to my beloved mother Peiqiong Ou, and my father, my husband, and those who have supported and helped me. It is all of you who have helped me grow and become who I am today.

iii ACKNOWLEDGMENTS

My deepest appreciation is owed to my major professor Dr. Besiki Stvilia and his family. Without his advice and help, I cannot even imagine if I could finish my doctoral coursework. Thanks go to him for introducing Activity Theory to me and for developing a theoretical framework that I can use in my current and future studies. Because of working with him, I learned about the beauty of theory and the power of methodology, which will benefit my future research and career. Thanks again for his support, hard work, and wisdom. Thanks also to Dr. Hank Bass. Without his advice and help, I could never finish my dissertation. I would also like to thank my other committee members Dr. Corinne Jörgensen and Dr. Michelle Kazmer. Their dedication to and passion for research in LIS set a role model for me as a researcher. Special thanks go to Dr. Gary Burnett, who are always willing to give me advice on writing and research. I would also like to express my thanks to Dr. Kathleen Burnett for her generous support to the international students of the school. I cannot help thanking Deborah Paul, Dr. Paula Mabee, and Dr. Greg Riccardi for connecting me to the insiders of the Gene Ontology community. Thanks are also due to the people of the Gene Ontology community and other interviewees, who were willing to spend time to answer my endless questions and emails. I really appreciate that I can study such an amazing community. I would also like to thank Naiqian Zhan for helping me recruit participants and being with me through the good and bad times. Thanks must go to my cohort Adam Worrall, who helped review tons of my papers and transcriptions through my doctoral program. One of my best times in the program was to discuss research with you. I would also like to thank my other cohorts Jung Hoon Baeg, Aisha Johnson, Sheila Baker, and Janice Newsum. Special thanks must go to Nicole Alemanne, Aaron Elkins, Yong Jeong Yi, and Melinda Whetstone, who are always willing to give me help and advice. I would also like to thank my wonderful friends in the program: Min Sook Park, Jongwook Lee, Blake Robinson, Hengyi Fu, Biyang Yu, and all the other doctoral students. My deepest love goes to my husband, parents, and mother-in-law. Thanks Xiang Wang for helping with my data collection and participant recruitment. Thanks to my father for inspiring me to pursue the doctoral degree. Lastly and most importantly, I would like to express my thanks, love, and respect to my dearest mother. It is you who gave me the power to finish my dissertation. I am very proud of being your daughter. I hope I make you proud and happy.

iv TABLE OF CONTENTS List of Tables ...... ix List of Figures ...... x Abstract ...... xi 1. INTRODUCTION ...... 1 1.1 Problem Statement ...... 1 1.1.1 Scientific Data Curation ...... 1 1.1.2 Bio-ontologies ...... 3 1.1.3 The Gene Ontology ...... 4 1.1.4 Research Purpose and Significance ...... 7 1.2 Research Questions ...... 8 1.3 Theoretical Frameworks ...... 9 1.3.1 Activity Theory ...... 9 1.3.2 Stvilia’s Information Quality Assessment Framework ...... 11 1.4 Research Design ...... 12 1.5 Conclusion ...... 14

2. LITERATURE REVIEW ...... 15 2.1 Knowledge Organization Systems ...... 15 2.1.1 Term Lists ...... 16 2.1.2 Classifications and Categories ...... 16 2.1.3 Relationship Lists ...... 18 2.1.4 Folksonomies ...... 20 2.1.5 Structure of KO Systems ...... 21 2.1.6 Comparison between Ontologies and Other KO Systems ...... 26 2.1.7 Knowledge Organization in Scientific Data Management ...... 26 2.2 Ontology Development ...... 27 2.2.1 Ontology Development Tools ...... 28 2.2.2 Ontology Development Methodologies ...... 29 2.3 Bio-ontologies ...... 30 2.4 The Gene Ontology ...... 32 2.4.1 The GO Term Record ...... 33 2.4.2 The GO Structure ...... 34 2.4.3 The Development and Maintenance of the GO ...... 34 2.5 Data Quality ...... 35 2.5.1 Data Quality Assessment Models ...... 36 2.5.2 Scientific Data Quality Problems ...... 40

v 2.6 Activity Theory ...... 41 2.6.1 The Origin and Development of Activity Theory ...... 42 2.6.2 Principles of Activity Theory ...... 45 2.6.3 Previous Applications ...... 47 2.6.4 Strengths and Limitations ...... 52 2.7 Stvilia’s Information Quality Assessment Framework ...... 53 2.7.1 Concepts ...... 53 2.7.2 Components and Relationships among the Components ...... 57 2.7.3 How to Use ...... 58 2.7.4 Previous Applications ...... 59 2.7.5 Strengths and Limitations ...... 61 2.8 Conclusion ...... 63

3. METHODS ...... 64 3.1 Research Questions ...... 64 3.2 Research Design ...... 65 3.2.1 Ethnography and Netnography ...... 65 3.2.2 Fieldsite ...... 68 3.2.3 Justification for Netnography ...... 69 3.2.4 Research Plan ...... 70 3.3 Archival Data Analysis ...... 71 3.4 Participant Observations ...... 73 3.5 Qualitative Semi-structured Interviews ...... 74 3.6 Data Analysis ...... 77 3.7 Ethical Issues ...... 78 3.7.1 Identifying and Explaining ...... 78 3.7.2 Informed Consent ...... 78 3.7.3 Disclosure ...... 79 3.7.4 Confidentiality ...... 79 3.7.5 Harm ...... 80 3.7.6 Consequences ...... 80 3.8 Quality Control ...... 80 3.8.1 Netnography and Ethnography ...... 81 3.8.2 Qualitative Interviewing ...... 83 3.9 Limitations ...... 84 3.10 Conclusion ...... 85

4. FINDINGS ...... 86

vi 4.1 Archival Data Analysis ...... 86 4.1.1 Activities around the GO ...... 87 4.1.2 Division of Labor ...... 87 4.1.3 Communities ...... 89 4.1.4 Data Quality Problems of the GO and Corresponding Assurance Actions ...... 90 4.1.5 Source of Data Quality Problems ...... 109 4.1.6 Tools ...... 110 4.1.7 Rules ...... 114 4.2 Participant Observations ...... 117 4.2.1 Interactions with a GO Curator ...... 118 4.2.2 Interactions with UniProt Curators ...... 119 4.3 Semi-structured Interviews ...... 120 4.3.1 Demographics of the Interviewees ...... 121 4.3.2 Activities around the GO ...... 122 4.3.3 Division of Labor ...... 127 4.3.4 Communities ...... 132 4.3.5 Tools ...... 133 4.3.6 Types and Sources of Data Quality Problems and Corresponding Assurance Actions ...... 138 4.3.7 Rules ...... 150 4.3.8 Data Quality Criteria for the GO ...... 153 4.3.9 Data Curation Skills for the GO ...... 158 4.4 Conclusion ...... 160

5. DISCUSSION ...... 161 5.1 Activities around the GO ...... 161 5.1.1 Communities ...... 162 5.1.2 Division of Labor ...... 163 5.1.3 Tools ...... 166 5.1.4 Rules ...... 169 5.1.5 Contradictions ...... 171 5.1.6 Data Curation Skills for the GO ...... 178 5.2 Data Quality Structure of the GO ...... 180 5.2.1 Types of Data Quality Problems ...... 180 5.2.2 Sources of Data Quality Problems and Assurance Actions ...... 182 5.2.3 Data Quality Criteria ...... 184 5.3 Conclusion and Future Research ...... 186

APPENDICES ...... 189

A INTERVIEW GUIDE ...... 189

vii B RESEARCH QUESTIONS AND CORRESPONDING THEORIES AND DATA COLLECTION METHODS ...... 191

C INITIAL CODING SCHEME ...... 193

D COMMUNITIES AROUND THE ONTOLOGY REQUESTS TRACKER ...... 195

E TOOLS USED TO RESOLVE DATA QUALITY PROBLEMS OF THE GO ...... 197

F SCIENTIFIC LITERATURE REFERENCES ON THE ONTOLOGY REQUESTS TRACKER ...... 200

G GO ANNOTATION TO THE MAIZE SMH1 GENE ...... 203

H TOOLS ...... 204

I INFORMED CONSENT FORM ...... 206

J APPROVALS FROM HUMAN SUBJECTS COMMITTEE ...... 207

REFERENCES ...... 209

BIOGRAPHICAL SKETCH ...... 224

viii LIST OF TABLES

Table 4.1: Data quality problems of the GO and corresponding assurance actions ...... 90

Table 4.2: Activities around the GO ...... 122

Table 4.3: Types and sources of data quality problems and corresponding assurance actions ...139

Table 4.4: Data quality criteria for bio-ontologies ...... 154

Table 5.1: Contradictions and suggestions ...... 177

ix LIST OF FIGURES

Figure 4.1: Division of labor around the GO ...... 129

Figure 5.1: The activity system of the GO Project team ...... 167

Figure 5.2: The activity system of the GO Consortium members ...... 168

x ABSTRACT

The advent of high-throughput techniques has led to exponential increase in the size of biological data encoded in various formats and stored in different databases. This has posed challenges for biological scientists to retrieve, use, analyze, and integrate data. To meet the urgent need of organizing a massive amount of heterogeneous data, there has been a trend towards the development of bio-ontologies. Among many current bio-ontologies, the Gene Ontology (GO) is one of the most successful and has been widely used across different biological communities. This study applied Activity Theory and Stvilia’s Information Quality Assessment Framework to examine the infrastructure supporting the development, maintenance, and use of the GO among different biological communities. Employing the netnographic approach, this study gathered data in a natural setting via archival data analysis, participant observations, and qualitative semi- structured interviews. The findings indicated that the GO was collaboratively developed and maintained by a consortium of biological communities, mainly model organism databases. Representatives from each of the GO Consortium member were assigned the role of GO curators and formed into groups working on different aspects of the GO. The division of labor within the GO Consortium ensures that the formidable ontology development process can be divided into manageable projects. The GO Consortium consists of not only biocurators but also software engineers and bioinformaticians, providing technical and software support. As an open community, the GO Consortium has been bringing in new groups and welcomes any individuals to submit content for inclusion in the GO database. GO’s collaborative development approach can be adopted by other similar ontologies or large-scale sociotechnical systems. This study also provided a rich description of GO’s data quality work and a conceptualization of GO’s data quality structure, including a typology of GO’s data quality problems and corresponding quality assurance actions. This knowledge base can be used for the design and management of similar sociotechnical systems and the development of best practices for knowledge organization system curation in molecular biology and biomedicine. The data curation skills that were perceived important for the GO can not only inform the training of biocurators, but also give new insight into the curriculum design and training in Library and Information Science and Data Science. The findings of this study can benefit the GO by

xi identifying various data quality issues and contradictions in its data curation work as well as suggesting strategies and actions for improvement. Future research includes developing quantitative models for assessing the quality of different aspects of GO’s data curation work. Netnographic studies can be conducted with different groups and teams within the GO Consortium to investigate their data practices and collaboration patterns, which can inform the design of support repertoires for scientific teams.

xii CHAPTER ONE

INTRODUCTION

This dissertation research explored the data work organization of the Gene Ontology (GO), which is one of the most widely used and successful ontologies in biomedicine and molecular biology. This introductory chapter begins with a problem statement, including the purpose of this study and the significance derived from investigating the problem. This chapter then presents the research questions formulated based on Activity Theory and Stvilia’s Information Quality Assessment Framework and a brief overview of those theoretical frameworks. This chapter concludes with the research design and a brief introduction to netnographies.

1.1 Problem Statement 1.1.1 Scientific Data Curation Some sciences have shifted from empirical, theoretical, and computational paradigms to the eScience paradigm, where scientists rely on computational technologies to make new scientific discoveries by analyzing immense datasets (Allard, 2012; Gray, 2007). As a new paradigm, eScience is data intensive, focusing on unifying theory, experiment, and simulation. There are three aspects highlighting the increasing needs for maintaining long-term access to scientific data: data generated in one specific discipline might be reused in other disciplines; traditional scientific disciplines have become interdependent and overlapping (e.g., biochemistry, biomedicine); and eScience is emerging with reliance on computational technologies to generate, analyze, and provide access to a vast amount of scientific data (Allard, 2012; Anderson, 2004; Borgman, Wallis, & Enyedy, 2007; Gray, 2007; Gray et al., 2005). As the quantity and diversity of scientific data are growing tremendously, preserving and archiving data may no longer be treated as post-project activities, but part of daily research activities (Anderson, 2004). Digital libraries and archives should not only collect, organize, and preserve scientific literature, but also expand the scope of their services to meet the changing data management needs of their institutions and users, involving scientific data curation and assisting scientists with their daily archiving activities (Gray, 2007; Heidorn, 2011). For example, the United States National Library of Medicine (NLM) has developed the National Center for Biotechnology Information (NCBI, 2004)—a widely used digital portal—to curate and provide open access to data and

1 literature in molecular biology, biochemistry, biomedicine, and genetics. Scientists can deposit their data to NCBI’s repositories, such as GenBank and PubChem. Data curation can be defined as “the activity of, managing and promoting the use of data from its point of creation, to ensure it is fit for contemporary purpose, and available for discovery and re-use” (Lord & Macdonald, 2003, p. 12). Due to the challenges of data curation, there are urgent needs in scientific communities for Knowledge Organization (KO) systems (e.g., metadata schema, ontologies, Semantic Web) to provide access to and make sense of a tremendous amount of scientific data (Gray, 2007; Gray et al., 2005). In Library and Information Science (LIS), KO can be defined as the activities that librarians, archivists, information specialists, subject specialists, laymen, and computers perform to describe, index, and classify documents in libraries, museums, archives, bibliographic databases, and other kinds of “memory institutions” (Hjørland, 2008, p. 86-87). The purpose of KO within LIS is to construct, apply, and evaluate KO systems for information retrieval (Hjørland, 2007). In an LIS-oriented sense, KO systems were originally developed to organize bibliographic collections and support the retrieval of relevant items from libraries, museums, and archives (Hjørland, 2008; Hodge, 2000). Scientific data differ from scientific publications in several aspects, including differences in size and volume, security requirements, accessibility, forms and formats, identifier systems, quality requirements, and metadata descriptions (Anderson, 2004). This creates challenges for traditional KO systems to represent scientific data and improve data and metadata interoperability across disparate vocabularies and domains (Allard, 2012). Scientific datasets in many cases are large, poorly organized, and stored in researchers’ computers (Carlson, 2006). Scientists would like to keep their datasets secure, secret from competitors, close by (e.g., on a server in their departments), in multiple places and versions, and under absolute control (Borgman et al., 2007). Scientific research findings are usually documented and published in journal articles, technical reports, conference proceedings or presentations, books, patents, and blogs or forums, which are available to the public in libraries, online databases, and the Internet (Gray, 2007). However, the underlying datasets are too large to fit in scientific publications, and may become inaccessible to other researchers (Anderson, 2004; Gray, 2007). In terms of origin, scientific data can be classified as observational data, computational data, experimental data, and record-keeping data (Borgman et al., 2007; National Science Board,

2 2005; Witt, Carlson, Brandt, & Cragin, 2009; Zimmerman, 2003). Scientific data can also be distinguished by their state as raw data, processed data, intermediate data, verified data, certified data (meeting certain standards), and published data (Borgman et al., 2007; Wills, Greenberg, & White, 2012). The same set of data may be analyzed, perceived, and used differently within and across communities. More and more scientific data are born digital, and become diverse in formats (e.g., software, algorithms). Having access to and analyzing these datasets depend on specific computational applications (e.g., MATLAB, LabVIEW), which are evolving rapidly with technologies (Anderson, 2004; Gray et al., 2005). Furthermore, interdisciplinary science involves the participation of a diversity of scientists. Different scientists may use the same data for different purposes. For example, the experimental data collected by ecologists to answer specific research questions may become contextual data for computer scientists and engineers of the same team to evaluate the performance of sensors (Borgman, 2012; Borgman et al., 2007; Stvilia et al., in press). Furthermore, eScience raises the expectation on the quality of scientific data and their metadata. Assuring data quality before analyzing, preserving, and providing access to scientific data has become essential. With digital libraries increasingly involved in scientific data curation through institutional data repositories, librarians and information scientists have the new tasks of developing and maintaining domain-specific KO systems to support the long-term access, use/reuse, and sharing of scientific data. Hjørland (2008) claimed that KO could not be studied independent from other disciplines. The construction, application, and evaluation of KO systems should be connected to a specific discipline or domain and their user communities; analyzing the user needs, identifying KO systems that can satisfy the user needs, and implementing and maintaining KO systems in a way that can connect disparate communities (Hodge, 2000).

1.1.2 Bio-ontologies The advent of high-throughput techniques has led to exponential increase in the size of biological and biomedical data (e.g., gene sequences) encoded in different formats using various controlled vocabularies and stored in different databases (Rubin et al., 2006; Wu, Stvilia, & Lee, 2012). This has posed challenges for scientists to retrieve, use, analyze, and integrate data. To meet the urgent need of managing a massive amount of heterogeneous data, there has been a trend towards the development and adoption of bio-ontologies in the biomedical and molecular biological communities (Kelso, Hoehndorf, & Prüfer, 2010).

3 In philosophy, ontology is one of the two most fundamental disciplines studying “about what exists, about basic kinds, categories, properties, and so on” (Hjørland, 1998, p. 607) Computer science and information science borrowed the word “ontology” from philosophy to denote a conceptual model representing objects, properties of objects, and relationships between objects in a specific domain (Chandrasekaran, Josephson, & Benjamins, 1999). One of the most frequently cited definitions of ontology in computer science is from Gruber’s (1993), who briefly described it as “an explicit specification of a conceptualization” (p. 199). Conceptualization refers to an abstract, simplified view of specific phenomenon of the world that people wish to represent for some purpose (Gašević, Djurić, & Devedžić, 2009). Specification means a formal, explicit, and machine-readable representation of the phenomenon using a set of classes (i.e., concepts), properties of concepts (i.e., slots), constraints on slots (i.e., facets), and relations between the concepts (Noy & McGuinness, 2001). In LIS, ontology was defined as a representation system providing information systems with basic semantic structure to support indexing, searching, retrieval, and actionable processes (Greenberg, Murillo, and Kunze, 2012; Jacob, 2003). In biology and biomedicine, ontology (sometimes called bio-ontology) was defined as a standardized, human-interpretable, and machine-processable representation of entities and relationships between these entities within a specific biological domain, providing scientists with an approach to annotating, analyzing, and integrating results of scientific and clinical research (Rubin et al., 2006). The above definitions imply that ontologies are developed for sharing and reusing domain knowledge to (a) enable the collaboration and communication among interdisciplinary teams or intelligent agents, (b) allow the interoperation of different information systems, (c) serve as a reliable and objective reference source for education, and (d) act as building blocks for intelligent applications (Gašević et al., 2009; Holsapple & Joshi, 2002; Noy & McGuinness, 2001). Particularly, the purposes of bio-ontologies are to (a) represent current biological knowledge, (b) annotate and organize biological data, (c) improve interoperability across biological databases, (d) turn new biological data into knowledge, and (e) assist users in analyzing data across different domains (Bard & Rhee, 2004; Gene Ontology Consortium, 2000).

1.1.3 The Gene Ontology Among many current bio-ontologies, the Gene Ontology (GO) is one of the most successful and has been widely used for data annotation, text mining, and information extraction (Bada et al.,

4 2004; Blaschke, Hirschman, & Valencia, 2002; Rubin et al., 2006). Founded in 1998 by three model organism databases, the GO consists of three sub-ontologies describing the cellular components, molecular functions, and biological processes of genes and gene products in a species-neutral manner, intending to provide each gene and gene with a cellular context (Gene Ontology Consortium, 2011b, 2014g). Since then, the Gene Ontology Consortium has expanded to include more than 30 members, including research groups and major databases for plant, animal, and microbial genomes (Gene Ontology Consortium, 2014i). The GO is open access and the Ontology data can be downloaded for free in different formats. Users can view and search the GO terms and annotations (i.e., the association between a GO term and a gene or gene product supported by an evidence source) via GO’s official browser AmiGO (http://amigo.geneontology.org/amigo). A GO term record contains a set of essential elements, including the GO term name and accession number (i.e., identifier), the sub-ontology to which the term belongs, definition of the GO term and reference source(s) of the definition, and one or more links to related terms in the Ontology (Gene Ontology Consortium, 2014p). The GO uses four types of relationship between the terms: ‘is_a,’ ‘part_of,’ ‘regulates’ (positively regulates and negatively regulates), and ‘has_part’ (Gene Ontology Consortium, 2009, 2014o). The GO term record may include optional elements, such as synonyms classified into different categories (e.g., exact, related, narrow, and broad) to aid searching, comments provided by the GO curators, the subset (e.g., prokaryote-specific) to which the GO term belongs, link to the GONUTS wiki where users provide the GO term with comments, and cross-references to other databases (Gene Ontology Consortium, 2011b, 2014p). 1.1.3.1 Ontology development and maintenance. Ontology development is a long- lasting and iterative process (Gašević et al., 2009; Greenberg et al., 2012). Particularly, developing bio-ontologies usually relies on curators reading and interpreting scientific literature and extracting concepts and relationships between these concepts from the literature (Kelso et al., 2010). Due to a vast number of journal articles published daily in different biological domains, this process is time-consuming and financially costly without community engagement (Greenberg et al., 2012). The difficulties of building and maintaining bio-ontologies are in gaining community involvement and acceptance, integrating new knowledge, and reflecting established knowledge (Open Biological and Biomedical Ontologies, 2006). Rather than using the traditional deductive (i.e., top-down), inductive (i.e., bottom-up), or synthetic ontology

5 design approach (Holsapple & Joshi, 2002), the GO adopts a collaborative approach, involving diverse communities in collectively developing the Ontology and controlling its quality. Instead of relying on ontology engineers and computer scientists, the GO terms are created and organized by various biological communities and therefore receive general acceptance within these communities (Bada et al., 2004; Hjørland, 2007). Similar to Wikipedia, GO’s data curation discussions and review processes are open to the public, enabling error detection and correction (Stvilia, Twidale, Smith, & Gasser, 2008). The GO has created a number of data-related and software-related request trackers hosted at SourceForge (http://sourceforge.net/) to allow any individual to provide feedback on the Ontology, such as suggesting a new term or definition, reorganizing a section of the Ontology, and reporting errors or omissions in the GO annotations (Gene Ontology, 2014; Gene Ontology Consortium, 2006, 2007). The GO curators review individual requests and implement edits where appropriate. 1.1.3.2 Ontology quality evaluation. Despite of the great success and acceptance of the GO in biomedicine and molecular biology, concerns remain about its quality (Rubin et al., 2006). Quality is commonly defined as ‘fitness for use’ (Juran, 1992). Data quality can be defined as “the degree the data meets the needs and requirements of the activities in which it is used” (Stvilia et al., in press). The definition of data quality encompasses duality—subjectivity (i.e., meeting individual expectations) and objectivity (i.e., meeting the task or activity requirements) (Stvilia, Gasser, Twidale, & Smith, 2007). Hence, data quality is contextual, dynamic, and multidimensional (Stvilia & Gasser, 2008a; 2008b). When data quality is evaluated in the individual level, one’s domain knowledge and familiarity with the data repository can affect their quality evaluation (Stvilia, Jörgensen, & Wu, 2012). When moved or aggregated from the individual level to a team, discipline, or community level, data quality can be understood differently (Stvilia et al., 2007). Data quality can be affected by changes to the data, the underlying object described by the data, or the context of creating and using the data (Stvilia & Gasser, 2008b). Therefore, data quality can be measured directly by examining the data, or indirectly by analyzing the provenance of the data or the data creator’s reputation, and the process of creating and using the data (Stvilia, 2006; Stvilia et al., 2007; Stvilia, Twidale, Smith, & Gasser, 2008). The perception and assessment of data quality vary by individuals, and within and across teams, disciplines, institutions, and communities (Ball, 2010).

6 There are ongoing efforts to develop models or frameworks for evaluating and improving the quality of GO. Köhler, Munn, Rüegg, Skusa, and Smith (2006) proposed two automatic metrics—circularity and intelligibility—to assess the quality of term definitions in ontologies and tested these metrics using empirical data collected from the GO. Buza, McCarthy, Wang, Bridges, and Burgess (2008) developed a composite automatic quality metric—GO Annotation Quality (GAQ) score—to evaluate the quality of GO annotations, and tested it by measuring the annotations for chicken and mouse over a period of time in the GO. Leonelli, Diehl, Christie, Harris, and Lomax (2011) collected empirical data from the GO curators and identified five quality problem sources: (a) mismatch between the GO representation and reality, (b) scope extension of the GO, (c) divergence in how the GO terminology is used across user communities, (d) new discoveries that change the meaning of GO terminology and relationships, and (e) the addition of new relations. Defoin-Platel et al. (2011) proposed a framework of twelve quality metrics for assessing the quality of GO’s functional annotations. However, most of these quality models and frameworks are incomplete and intuitive, mainly based on individual perceptions of quality requirements without substantial theoretical guide for quality assessment and input from the user communities.

1.1.4 Research Purpose and Significance The purpose of this empirical study is to examine the data work organization of the GO, gaining an understanding of how different communities collaboratively create new GO terms; how they collectively detect, discuss, and resolve data quality problems of the GO; how they use the GO to represent and organize their data; and how they perceive data curation skills needed for the GO. As a large-scale, open-access scientific data organization system, GO’s collaborative data curation mechanisms can be applied or extended to other similar ontologies and data repositories, informing system designers, data curators, and ontologists in collaboratively developing and maintaining KO systems in an efficient and less expensive way; establishing functional requirements and quality assurance infrastructure for bio-ontologies; and formulating best practices for ontology development and KO system curation in molecular biology and biomedicine. The data curation skills identified from this study can not only inform the training of biocurators, but also give new insight into the curriculum design and training in LIS and Data Science.

7 Devedžić (2002) stated that the process and methodology of ontology development are analogous to those of software development in that they share similar design criteria (e.g., clarity, extensibility, and coherence); require the assembly of domain vocabulary; and define classes, objects, and hierarchies. GO’s collaborative development and maintenance mechanisms can be adopted by software, particularly open-source software, development communities (Sandusky & Gasser, 2005). Kling (1999) defined sociotechnical system as a complex, interdependent system consisting of people, hardware, software, techniques, support resources (e.g., training help), and information structure (e.g., rules, norms, and regulations). To some extent, the GO is a large-scale sociotechnical system comprising of various biological communities collectively developing, maintaining, and using the Ontology with the support of an infrastructure. The findings of this study can also be used to inform the design and management of other large-scale sociotechnical systems. Since quality is a context-specific concept, there is no single quality assessment or assurance model that can be applied for different data repositories or ontologies (Stvilia, 2007; Stvilia et al., 2007, 2008, in press). The quality of the same dataset can be evaluated differently depending on the context of using that dataset and the individual or community value structures for quality. To evaluate or assure the quality of a particular data repository or ontology, one needs to conduct an empirical study of the data quality assurance work of the system, developing a knowledge base of its data quality work—typologies of data activities around the system, data quality problems present in the system, tools to detect and resolve data quality problems, quality intervention actions, and norms or rules regulating the activities. This knowledge base can then be used to guide the construction of quantitative quality assessment models and metrics for the system. The findings of this study can benefit the GO by identifying various data quality issues and contradictions in its data curation work as well as suggesting strategies and actions for improvement. 1.2 Research Questions Guided by Activity Theory (Engeström, 1990; Leont’ev, 1978) and Stvilia’s Information Quality Assessment Framework (Stvilia et al., 2007), the purpose of this dissertation research is to gain an understanding of GO’s collaborative data curation processes and build a knowledge base of GO’s conceptual data quality model. To accomplish this purpose, this dissertation examined the following research questions:

8 RQ1. What are some of the activities around the GO? What are their objects (objectives)? RQ1.1. What are some of the communities participating in these activities? RQ1.2. What is the division of labor within these activities? RQ1.3. What are some of the tools used in these activities? RQ1.4. What are some of the norms and rules regulating these activities? RQ1.5. What are some of the contradictions within and between these activities and how are these contradictions resolved? RQ1.6. What are some of the skills needed for data curation of the GO? RQ2. What is the data quality structure of the GO? RQ2.1. What are some of the types of data quality problems present in the GO? RQ2.2. What are some of the sources of these data quality problems? RQ2.3. What are some of the corresponding quality assurance actions taken to resolve these data quality problems? RQ2.4. What data quality criteria are considered important for the GO? RQ2.5. What are some of the policies, procedures, rules, or conventions for data quality assurance adopted by the GO?

1.3 Theoretical Frameworks 1.3.1 Activity Theory This study used Activity Theory (Engeström, 1990; Leont’ev, 1978) as a methodological framework to help formulate some of the research questions: exploring the activities around the GO, identifying the communities for content contribution and division of labor among these communities, examining the tools mediating these activities, identifying the norms and rules regulating these activities, and understanding the contradictions positively and negatively affecting these activities. As a historical-cultural framework, Activity Theory determined the research methods for this study: archival data analysis, participant observations, and qualitative interviews (Nardi, 1996). Activity Theory also served as a conceptual framework, the concepts and principles of which were used to construct a coding scheme to structure, analyze, and provide new insights into the data collected for this study. 1.3.1.1 The origin of Activity Theory. Developed by the Russian cultural psychologist Lev Vygotsky and his followers, Activity Theory is a meta-theory or body of thought for studying human activity in a specific context (Engeström, 1990; Leont’ev, 1978). The basic unit

9 of analysis in Activity Theory is the human activity. Vygotsky proposed a triadic structure of human activity consisting of subject, object, and tools (Nardi, 1996). A subject can be a person or a group of persons participating in an activity. Object refers to an objective held by the subject motivating an activity. The relationship between the subject and object is mediated by tools. Vygotsky also proposed the concepts of internalization and externalization (Kaptelinin, 1996; Wilson, 2008). Internalization refers to an individual’s thought activity to reason and reconstruct external objects or acquire new abilities. Externalization refers to the process in which people manifest, verify, or correct their mental models through external actions. 1.3.1.2 The hierarchical structure of activity. Leont’ev—Vygotsky’s student and fellow researcher—developed a hierarchical structure of activity distinguishing among activity, actions, and operations (Allen, Karanasoios, & Slavova, 2011; Roos, 2012; Wilson, 2008). Activity, at the top of the hierarchy, is collective in nature and driven by an object or motive, which can be divided into actions dependent on the division of labor. Actions, at the intermediate level of the hierarchy, are goal-directed processes and cannot be understood without the social context of the collective activity. Operations, at the bottom of the hierarchy, are condition- dependent and usually require little conscious attention to carry. They may be routinized or habituated behaviors. The hierarchical activity system is dynamic in that activities, actions, and operations are not immutable (Allen et al., 2011; Wilson, 2008). Over time, activities may become actions and operations. Actions can become operations through internalization, while operations can become actions through externalization. Instead of existing in isolation, an activity system may relate to other activities and be part of a larger activity system network. 1.3.1.3 Engeström’s Activity Theory. Due to the simplicity of Vygotsky’s triadic activity representation, ignoring the interaction between subject and the activity environment, Engeström (1990) added two new components—community and outcome—to emphasize the social aspects of activity. Community refers to a group of people who share the same object. There are three mutual relationships between subject, object, and community (Kuutti, 1996). The relationship between subject and object is mediated by tools. Subject may transfer object into outcome with the help of tools. The relationship between subject and community is mediated by rules; and the relationship between object and community is mediated by division of labor. Rules can be defined as explicit or implicit norms, conventions, and regulations that enable or limit the actions, operations, and interactions within an activity system. Division of labor means “both the

10 horizontal division of tasks between members of the community and the vertical division of power and status” (Engeström, 1990, p. 79). 1.3.1.4 Contradictions. Another key concept for understanding and applying Activity Theory is contradictions, which refers to the historically accumulated tensions or instabilities within or between activity systems, playing a central role in changing, learning, and developing these activities (Allen et al., 2011; Roos, 2012). Contradictions may give rise to cessation or transformation of existing activities, or formation of new activities (Turner, Turner, Horton, 1999). Contradictions, however, may not necessarily be negative. They can be innovations, disruptions, conflicts, and breakdowns to an activity system. To understand the development of a specific activity system, one may conduct historical and empirical analyses of the system to trace the contradictions that have occurred (Engeström, 1990). 1.3.1.5 Activity Theory for studying the data work organization of the GO. Activity Theory in essence is dynamic and evolving with its application in empirical studies (Nardi, 1996). To conceptualize the data work organization of and activities around the GO, this study used Engeström’s (1990) version of Activity Theory as a methodological and conceptual framework, supplemented with six interrelated principles summarized by Kaptelinin (1996) to help apply the concepts of Activity Theory for data collection and analysis. Taking activity as the basic unit of analysis, Activity Theory provides a hierarchical structure and universal language for studying and analyzing the activities of different scientific communities using, developing, and maintaining the GO. The concepts of internalization, externalization, and contradictions can provide a lens to understanding these communities’ routinization (i.e., internalization) and formulation (i.e., externalization) of new tools, rules, and mechanisms for ontology development and maintenance; and why they favor certain tools, rules, and mechanisms. Activity Theory enables connecting individual and community’s ontology development and maintenance activities to the activities of using the GO, and placing GO’s data curation processes in a social context.

1.3.2 Stvilia’s Information Quality Assessment Framework Since data quality is contextual, dynamic, and multidimensional (Stvilia & Gasser, 2008a; 2008b), this study used a theoretical Information Quality (IQ) Assessment Framework developed by Stvilia and his colleagues (2007) to help identify the data quality problems of the GO, examine how different communities collaboratively improve the quality of GO, and

11 determine the ontology quality requirements of these communities. Built on Activity Theory, Stvilia’s IQ Assessment Framework consists of a well-defined typology of IQ problem sources linked with affected information activities and a taxonomy of 22 IQ dimensions along with 41 generic IQ metrics. Although it was intended for information entities, Stvilia’s framework defines information as manifest in forms of data and knowledge, and has been applied to evaluate the quality of various types of data (e.g., biological data, condensed matter physics data, image metadata) and KO systems (e.g., biodiversity ontologies) (Huang, Stvilia, Jörgensen, & Bass, 2012; Stvilia, 2007; Stvilia et al., 2012; Stvilia et al., 2013; Stvilia et al., in press; Wu et al., 2012). Stvilia’s framework provides consistent and complete logic to deal with context sensitivity, specifies methodologies to analyze data activities and identify individual and community’s data quality value structure, allows reasoning the sociotechnical and cognitive aspects of data quality, and presents a typology of IQ dimensions and IQ problem sources reusable for the data analysis of this study. Guided by Activity Theory (Leont’ev, 1978; Nardi, 1996) and his IQ Assessment Framework, Stvilia (2007) constructed a model to assess the quality of biodiversity ontologies through identifying the research activities and quality requirements in Morphbank, a biodiversity repository and collaboratory curating specimen images, taxonomy information, morphological characteristics, and annotations. Through content analysis of the image annotations and Web server logs of Morphbank, Stvilia (2007) identified the research activities that involve using ontology as a tool of the biodiversity community: (a) determining the taxon of a specimen, (b) tagging part of a taxon or anomalies in a specimen, (c) evaluating the quality of a taxonomy determination, (d) finding images of a taxon, and (e) aggregating data. These identified research activities could be mapped to all four types of information activities in Stvilia’s IQ Framework, which may be affected by four sources of IQ problems. Based on the content analysis, Stvilia (2007) proposed a quality evaluation model for biodiversity ontologies consisting of specific IQ dimensions, metrics, and measurement cost, suggesting future research to develop IQ models for type-specific ontologies.

1.4 Research Design This study employed the netnographic approach (Kozinets, 2010), gathering data in a natural setting via archival data analysis, participant observations, and qualitative semi-structured interviews (Blee & Taylor, 2002; Kazmer & Xie, 2008; Lincoln & Guba, 1985) to investigate the

12 data work organization of the GO. Kozinets (2010) defined netnography as “participant- observational research based in online fieldwork” using “computer-mediated communications as a source of data to arrive at the ethnographic understanding and representations of a cultural or communal phenomenon” (p. 60). In other words, netnography is a specialized form of ethnography adapted to study online communities or communities demonstrating important social interactions online. As mentioned above, GO’s software-related and data-related request trackers at SourceForge provide different scientific communities with a platform to communicate with curators and participate in ontology development and maintenance without geographic and temporal restrictions. These request trackers are appropriate online forums for the researcher to know about the interactions among those communities, their practices, cultures, and collaborative patterns. This study selected GO’s Ontology Requests Tracker as an online fieldsite to collect netnographic observational and archival data. Different from gathering and analyzing qualitative data from this online forum, the core of netnography—participant observation— allows the researcher to experience the online social interactions and processes in the way that those communities experience them (Kozinets, 2010). Similar to ethnographic research, this study used archival data analysis and qualitative semi-structured interviews to supplement and enrich the fieldnote data collected from participant observations (Hammersley & Atkinson, 1995; Murchison, 2010; O’Reilly, 2005). The researcher first performed archival analysis of the requests submitted to GO’s Ontology Requests Tracker during 2011 and 2012 to allow identification of and familiarization with different communities participating in developing, maintaining, and using the Ontology, gaining an understanding of community members, languages, rules, norms, and practices (Kozinets, 2010). The researcher next conducted participant observations, becoming a registered user of the Gene Ontology Project on SourceForge and following discussions in the Ontology Requests Tracker. To gain an insider’s view, the researcher also participated in developing the GO, using it to annotate a gene with the guidance of a biologist and submitting the annotation for the GO curators to review. During participant observations, the researcher kept observational and reflective fieldnotes to record her experience as a participant researcher; her learning of community languages, practices, and rules; her interpretation of community cultures and activities; and her conceptualization of the nature of the fieldsite. The data collection concluded with qualitative semi-structured interviews with key informants from those communities—including the GO

13 project members, the GO content contributors, and the GO users—to allow follow-up questions developed from archival data analysis and participant observations, as well as broaden the understanding gained from archives and fieldnotes. During participant observations and interviews, the researcher collected additional documents relevant to the research questions and also analyzed them. The three research methods mentioned above were selected based on the research purpose, questions, and theoretical frameworks (Nardi, 1996). Archival data analysis and participant observations were used to answer all the research questions, except those perceptional questions (RQ1.6 and RQ2.4). Qualitative semi-structured interviews were specifically used to address RQ1.3, RQ1.4, RQ1.6, RQ2.4, and RQ2.5 and to collect data that is not available from archives and observations, such as different communities’ perceptions and requirements of ontology quality and their motivations to develop and maintain the GO.

1.5 Conclusion This introductory chapter presents a problem statement, research questions, the overarching theoretical frameworks guiding this dissertation research, and the netnographic research design. Chapter 2 provides a thorough literature review of previous research relevant to KO systems, ontology development and maintenance, and Stvilia’s IQ Assessment Framework. Chapter 2 also includes an extended review of Activity Theory and its applications in various domains. Chapter 3 provides more details of the research design, including the online fieldsite selected for the study, three methods used, research plan and procedures, and ethical issues of conducting netnographic research. Chapter 3 also discusses quality control and limitations of the study. Chapter 4 presents findings of this study organized by the three consequentially used methods: archival data analysis, participant observations, and semi-structured interviews. Chapter 5 discusses the findings from this study connecting to each research question and relevant previous studies, and suggests future research.

14 CHAPTER TWO

LITERATURE REVIEW

This chapter reviews different types of knowledge organization systems, including ontologies and their representational structure. This is followed by a comparison between ontologies and other knowledge organization systems and a brief discussion of ontology development. It then introduces bio-ontologies and the Gene Ontology, and discusses the concept of data quality. This chapter continues with an extended review of the two theoretical frameworks used for this dissertation research, Activity Theory and Stvilia’s Information Quality Assessment Framework.

2.1 Knowledge Organization Systems Libraries and librarians have a long history of developing and maintaining knowledge organization (KO) systems to support the retrieval of bibliographic collections. In a broader sense, KO is the social organization and division of the “universe of knowledge”; in a narrower sense, KO is the activities that librarians, archivists, information specialists, subject specialists, laymen, and computers perform to describe, index, and classify documents in libraries, museums, archives, bibliographic databases, and other kinds of “memory institutions” (Hjørland, 2008, p. 86-87). Hjørland (2008) claimed that Library and Information Science (LIS) is a central discipline of KO in the narrower sense, dealing with KO processes and systems. The purpose of KO within LIS is to construct, apply, and evaluate KO systems for information retrieval (Hjørland, 2007). However, KO cannot be studied independent from other disciplines (Hjørland, 2008). The construction, application, and evaluation of KO systems should be connected to a specific discipline or domain and their user communities; analyzing the user needs, identifying KO systems that can satisfy the user needs, and implementing and maintaining KO systems in a way that can connect disparate communities (Hodge, 2000). Hjørland (2007) provided two definitions of KO systems: in a narrower, LIS-oriented sense, KO systems are related to the organization of bibliographic records; in a broader, general sense, KO systems are “related to the organization of literatures, traditions, disciplines, and people in different cultures” (p. 369). Hodge (2000) proposed a typology of KO systems, including term lists (i.e., authority files, glossaries, dictionaries, gazetteers), classifications and categories (i.e., classification schemes, categorization schemes, taxonomies, subject headings),

15 and relationship lists (i.e., thesauri, ontologies, semantic networks). Hjørland (2007) argued that when considering KO systems in a broader sense, Hodge’s typology omitted certain kinds of KO systems, such as encyclopedias, bibliographic databases, bibliometric maps, and the systems of scientific disciplines. Although the general purpose of KO systems is to support information retrieval, different KO systems vary in complexity, structure, and function (Hodge, 2000). This section briefly reviews KO systems that belong to Hodge’s typology and folksonomies, and discusses the representational structure of these KO systems. This section concludes with a brief discussion of knowledge organization as applied in scientific data management.

2.1.1 Term Lists 2.1.1.1 Authority files. In libraries, an authority record is a tool that librarians or catalogers use to “establish forms of names (for persons, places, meetings, and organizations), titles, and subjects used on bibliographic records” (Library of Congress Authorities, 2011). An authority record usually includes the controlled access point for the entity, access points for variant forms of name, controlled access points for related entities, the rules under which the controlled access point was established, the resources consulted, and the cataloging agency that established the controlled access point (Patton, 2009). An authority file, such as the Library of Congress Name Authority File, is a list or database that assembles all the authority records used in or linked to a bibliographic catalog (Gorman, 2004). Hodge (2000) provides a broader definition of authority files as “lists of terms that are used to control variant names for an entity or the domain value for a particular field” (p. 5). In addition to person, family, corporate body, and work, authority files can be used to control variant names of geographic entities (e.g., countries, cities, lakes), products, colors, and others.

2.1.2 Classifications and Categories 2.1.2.1 Classification schemes. Classification is the process of clustering things or concepts based on a set of predetermined rules (Kwaśnik, 1999). Classification schemes are designed to fulfill this process. Jacob (2004) defined a classification scheme as “a set of mutually exclusive and non-overlapping classes arranged within a hierarchical structure,” “reflecting a predetermined ordering of reality” (p. 524). Librarians began to devise bibliographic classification schemes as early as the 16th century, most of which were based on philosophers’ systems of knowledge at the time of their development (Taylor & Joudrey, 2009).

16 The Dewey Decimal Classification (DDC) and the Library of Congress Classification (LCC) are two of the most widely used bibliographic classification schemes in libraries, and represent two different approaches to constructing classification schemes. The DDC is a top- down, deductive classification scheme developed by Melvil Dewey based on his conceptual framework, which means the classes of the scheme can exist even if no work is published under any given class (Kwaśnik & Rubin, 2003). Unlike the DDC, the LCC is a bottom-up, inductive classification scheme, which was originally devised to accommodate an existing collection, the LC collection (Chan, 1994; Kwaśnik & Rubin, 2003). New classes are added to an inductive scheme as needed based on the concept of “literary warrant,” which means works actually exist about those new classes (Taylor & Joudrey, 2009). Rather than a product of a single individual, the LCC was developed in parts by different Library of Congress subject specialists at different times (Chan, 1994). 2.1.2.2 Categorization schemes. Jacob (2004) defined categorization as a “process of dividing the world into groups of entities whose members are in some way similar to each other” (p. 518). The main difference between categorization and classification is that the process of categorization is less formalized. Constructing a classification scheme usually requires a conceptual framework with a set of predetermined rules based on consensus of a particular group of members (Allen, 2011). If a classification scheme is well developed, it can function as a theory, being descriptive, explanatory, heuristic, and robust (Kwaśnik, 1999). The process of classification is systematic and lawful (Jacob, 2004). In contrast, categorization divides objects into groups sharing “some immediate similarity within a given context” (Jacob, 2004, p. 528). Classification schemes can be devised to cluster objects not only on the basis of similarity, but also on other types of relationships, such as structural proximity and causal relationships (Lambe, 2007). 2.1.2.3 Subject headings. In physical libraries, bibliographic classification schemes were devised to support the retrieval of relevant items from a collection, and satisfy the need to store each item on a single location of a shelf (Hodge, 2000). Each item in a collection can be assigned to one and only one class within a classification scheme (Jacob, 2004). To enhance the subject access to bibliographic collections beyond the single access points provided by bibliographic classification schemes, libraries developed subject headings, consisting of a set of controlled terms representing the subjects of items from a collection (Hodge, 2000; Jacob, 2004).

17 Catalogers can assign more than one controlled term from subject headings to an item to represent different aspects of its subject. Compared to classification or categorization schemes and relationship lists, the structure of subject headings is relatively shallow. To represent more specific concepts, subject headings are usually coordinated or joined with rules. The Library of Congress Subject Headings (LCSH), the Medical Subject Headings (MeSH), and the Sears List of Subject Headings are some of the best-known and widely used subject headings in libraries and databases.

2.1.3 Relationship Lists 2.1.3.1 Thesauri. A Thesaurus is a list of selected terms representing single concepts supplemented with scope notes, references, and relationships among the terms (Hjørland, 2007; Hodge, 2000). The selected terms in a thesaurus are referred to as preferred terms. A thesaurus can also contain synonyms and spelling variants of the preferred terms, which are called “lead-in terms” or non-preferred terms (Hjørland, 2007, p. 367). Unlike dictionaries, thesauri usually do not include pronunciations and definitions, but contain qualifiers in parentheses to denote a single concept to terms with different meanings. A thesaurus can be viewed as a kind of authority files, establishing a list of preferred terms and gathering synonymous terms. But a thesaurus is more than a single list of terms, providing a structure to represent relationships among terms. Relationships in a thesaurus are usually expressed in narrower terms (NT), broader terms (BT), synonymous terms (ST), and related terms (RT) (Allen, 2011; Hodge, 2000). Thesauri can be used to support indexers to assign all the preferred terms and lead-in terms to a document and ensure users a more comprehensive search with higher recall (Soergel, 1995). Most thesauri were developed for a particular domain (e.g., the Art and Architecture Thesaurus), which is a specific discourse community sharing a body of concepts and knowledge and using the language in a similar way (Hjørland & Albrechtsen, 1999). 2.1.3.2 Ontologies. The word ‘ontology’ originated from the Greek ‘ontos’—meaning being—and ‘logos’—meaning word (Gašević, Djurić, & Devedžić, 2009). In philosophy, ontology is one of the two most fundamental disciplines studying “about what exists, about basic kinds, categories, properties, and so on” (Hjørland, 1998, p. 607). Computer science and information science borrowed the word “ontology” from philosophy to denote a conceptual model representing objects, properties of objects, and relationships between objects in a specific domain (Chandrasekaran, Josephson, & Benjamins, 1999).

18 One of the most frequently cited definitions of ontology in computer science is from Gruber’s (1993), who briefly described it as “an explicit specification of a conceptualization” (p. 199). Conceptualization refers to an abstract, simplified view of specific phenomenon of the world that people wish to represent for some purpose (Gašević et al., 2009). Specification means a formal, explicit, and machine-readable representation of the phenomenon using a set of classes (i.e., concepts), properties of concepts (i.e., slots), constraints on slots (i.e., facets), and relations between the concepts (Noy & McGuinness, 2001). There are a number of definitions for ontology in artificial intelligence and computing. Hendler (2001) defined ontology as “a set of knowledge terms, including vocabulary, the semantic interconnections, and some simple rules of inference and logic for some particular topic” (p. 30). Hendler’s definition indicates ontology as a representation vocabulary specifying concepts and the relationships among these concepts for a particular domain. Meanwhile, this definition implies that the semantic interconnections of ontology allows for some form of reasoning or inference. Swartout and Tate (1999) defined ontology as something that “provides the basic structure or armature around which a knowledge can be built” (p. 2). This definition distinguishes ontology from knowledge base, which uses the concepts of an ontology to represent a particular domain, such as describing what is true about the real world. For example, a medical ontology provides the definition of a particular disease but may not contain assertions that a patient had that disease. A knowledge base developed based on that ontology may contain those assertions. In other words, ontologies can provide artificial intelligent systems with a foundation to assemble knowledge bases. Kalfoglou et al. (2001) defined ontology as “shared views of the world” (p. 44). This definition emphasizes that an ontology is a shared understanding of a specific domain agreed by groups or communities, and thus enables knowledge sharing and reuse as well as semantic interoperation between different intelligent and application systems. This definition also suggests ontology creation and maintenance engage the efforts of a large number of people and/or software. In LIS, ontology was defined as a representation system providing information systems with basic semantic structure to support indexing, searching, retrieval, and actionable processes (Greenberg, Murillo, and Kunze, 2012; Jacob, 2003). Greenberg et al. (2012) stated that ontologies could be as simple as the Dublin Core metadata standard, library catalogs, abstracting databases, or as complex as the Gene Ontology.

19 The above definitions imply that no matter in which field, the major purpose of ontologies is to support domain knowledge sharing and reuse. In summary, ontologies can be used to (a) facilitate the collaboration and communication among interdisciplinary teams or intelligent agents, (b) allow for interoperability between different information systems or intelligent applications, (c) serve as a reliable and objective reference source for education, and (d) act as building blocks for intelligent applications (Gašević et al., 2009; Holsapple & Joshi, 2002; Noy & McGuinness, 2001). Devedžić (2002) proposed six criteria for assessing the quality of ontologies: (a) precise definitions of domain concepts, (b) consensus of knowledge from different communities, (c) expressiveness, (d) coherence and interoperability, (e) stability and scalability, and (f) a foundation for solving problems and constructing applications. 2.1.3.3 Semantic Web. Berners-Lee, Hendler, and Lassila proposed the concept of Semantic Web in 2001 as an extension of the current World Wide Web, which would bring structure and machine-understandable meaning to contents of Web pages to create an environment where computers can perform tasks for users. The current Web consists of mostly unstructured documents for humans to read, while the Semantic Web is “a Web of actionable information” or data for computer programs to manipulate (Shadbolt, 2006, p. 96). It can be achieved through using the Semantic Web technologies, such as extensible markup language (XML), resource description framework (RDF), universal resource identifier (URI), and ontologies (Berners-Lee et al., 2001; Shadbolt et al., 2006). XML specifies an encoding syntax for documents. RDF encodes relationships in sets of triples consisting of subjects, verbs, and objects. URI identifies the coded subjects, verbs, and objects in documents, enabling a unique definition of concepts. Ontologies relate information in documents to specific structures and rules, and enable the integration or exchange of information from different documents or systems. The Semantic Web can address the increasing need in scientific communities for integrating heterogeneous datasets from disparate disciplines (Shadbolt et al., 2006).

2.1.4 Folksonomies With the popularity of social media (e.g., Delicious, Flickr), a new form of KO system, folksonomies emerged. A folksonomy is a “resulting collective vocabulary” emerged from social tagging, which is a practice of publicly and collaboratively assigning tags to resources in a shared, online environment (Trant, 2009, p. 4). People use social tagging to organize resources for future retrieval, manage personal information, express themselves, contribute or share

20 information, attract attention, and play or compete with others (Smith, 2008; Trant, 2009). When tags created by many people are aggregated automatically, a folksonomy is formed. Since folksonomies are formed by aggregating a large number of user-generated tags to indicate how users understand and use documents, they can bring out the intentional aboutness of documents requested by users and indicate the user community’s conceptual model (Beghtol, 1986; Trant, 2009). The value of folksonomies is to provide a collaborative and democratic approach to representing and categorizing large masses of information for a large group of users at a lower cost (Mai, 2009). Folksonomies can be used to provide subject access to documents especially when nomenclature is uncertain or evolving, a field is changing and growing rapidly, and multiple viewpoints are desirable (Smith, 2008). However, the collective intelligence in folksonomies is achieved merely when a critical mass of participation is reached (Anderson, 2007). If the user community is too small to achieve the critical mass, the collaborative feature of folksonomies can add little value to the existing KO systems. There is a growing body of research indicating that folksonomies can add value to traditional KO systems (e.g., Jörgensen, Stvilia, & Jörgensen, 2008; Rolla, 2009; Stvilia, Jörgensen, & Wu, 2012; Wetterstrom, 2008). This has been accompanied by increasing efforts to enhance subject access to bibliographic collections by deploying these collections in social media (e.g., Flickr). Several libraries use socially contributed bibliographic databases, such as the LibraryThing, to organize their collections in order to improve resource discovery and facilitate information organization (Mendes, Quiñonez-Skinner, & Skaggs, 2008; Westcott, Chappell, & Lebel, 2009). Some libraries host local tagging system, and incorporate it into their online catalogs, allowing users to tag their collections and search by tags. WorldCat, one of the world’s largest online catalogs, has integrated the item tagging functionality into the catalog feature. Logged-in users can add an unlimited number of tags to any item retrieved by WorldCat (Online Computer Library Center, 2008). Research also suggests integrating folksonomies formed in social media with the Semantic Web to provide meaning and structure to Web pages at a lower cost (Gruber, 2008; Specia & Motta, 2007).

2.1.5 Structure of KO Systems Representational structure of a KO system can determine its semantic expressiveness, the ability to map a knowledge domain, and the power to support knowledge representation and discovery (Kwaśnik, 1999; Lambe, 2007). The following subsections review different types of

21 representational structure—lists, hierarchies, trees, matrices, facets, and maps—of KO systems, and discusses their strengths and weaknesses in knowledge representation and organization. 2.1.5.1 List. List is “a collection of related things” to describe relationships of commonality or similarity, collocation, sequence (e.g., project lifecycle), chaining (e.g., cause and effect), genealogy (e.g., parent to child), and gradients in attributes (Lambe, 2007, p. 14). Dictionaries, glossaries, and thesaurus are examples of KO systems expressed as alphabetical (sequenced) lists of words or terms (Hodge, 2000). Lists are the most basic building blocks of KO systems (Lambe, 2007). Though they can provide a simplest overview of a domain, lists are limited in length and detail. Other more complex KO systems, such as trees and maps, can grow from lists when they become too long or complicated to manage. 2.1.5.2 Hierarchies. Hierarchies were developed based on Aristotle’s classical theory of categories that an object or concept can be placed under one and only one of ten categories: substance, quantity, quality, relation, place, time, position, state, action, and affection (Taylor & Joudrey, 2009). Although hierarchies have different definitions in different disciplines, to be real or scientific hierarchies they have to include the following properties: (a) inclusiveness (the top class includes or contains all the subclasses in a hierarchy); (b) one and only one relationship exists in a hierarchy—the is-a or generic relationship; (c) inheritance (a subclass inherits all the attributes from its superordinate classes); (d) transitivity (subclasses are members of every superordinate class); (e) predetermined, systematic, and predictable rules for grouping entities; (f) mutual exclusivity (an entity can belong to one and only one class); and (g) necessary and sufficient criteria (Kwaśnik, 1999). Therefore, hierarchies are preferred structures for domains that have theoretical foundations, such as the evolution in biology. The MeSH and DDC are KO systems in hierarchical structure. As KO systems, hierarchies have the following strengths: (a) completeness due to predetermined rules; (b) economy of notation because of inheritance; (c) inference of attributes due to transitivity; (d) allowing real definitions owning to the is-a relationships; and (e) providing a holistic perspective of a domain (Kwaśnik, 1999). However, rigid hierarchies are barely used to construct KO systems in reality (Lambe, 2007). The necessary and sufficient criteria and generic relationship pose challenges for hierarchies representing less prototypical entities (e.g., robins as birds vs. penguins), multiple aspects of phenomena (e.g., dogs as

22 mammals vs. dogs as pets), and the ambiguity and vagueness in less developed or emerging domains (Kwaśnik, 1999). To overcome the difficulty of representing overlapping attributes of phenomena, polyhierarchies arise to accommodate more than one set of criteria grouping entities (Kwaśnik, 1999; Lambe, 2007). Polyhierarchies can connect entities that belong to classes under different hierarchies, such as mapping dogs as mammals to dogs as pets. However, polyhierarchies might become cumbersome and lost the power as KO systems when cross-connect too many hierarchies. 2.1.5.3 Trees. A tree is a KO structure similar to a hierarchy that divides and subdivides entities based on specific rules but does not require the is-a relationship between a subclass and its superordinate class (Kwaśnik, 1999). Trees are versatile and able to describe several types of relationships, such as whole to part, general to specific, cause and effect, and parent to child (Lambe, 2007). Trees can highlight relationships of interest, indicate distances between and distribution of entities, and serve as powerful visualization and mnemonic devices (Kwaśnik, 1999; Lambe, 2007). However, trees are less effective and efficient in knowledge representation than hierarchies since they cannot support inheritance and inference of entity attributes and information flows only vertically. Furthermore, a tree can reveal only one type of relationship among entities. But very often, a tree becomes impure if several types of relationships exist: in one part of the tree, a subclass is a part of its superordinate class; in another part, a subclass is a specific instance of its superordinate class (Lambe, 2007). The LCSH and LCC are examples of KO systems that have tree structure, the entities of which move from general to specific. 2.1.5.4 Paradigms/Typologies. Paradigm is a KO structure that describes entities by two or three dimensions, revealing the presence or absence of each entity at the intersection of these dimensions, which is called a cell (Kwaśnik, 1999; Lambe, 2007). The periodic table of elements is a paradigmatic KO system with two axes, organizing elements by their periods and the number of valence electrons. In social sciences, paradigms are known as typologies or matrices. Each cell in a paradigm can be empty or has more than one entity. Empty cells can reveal gaps in a domain, useful for sense making and discovery of new knowledge. For example, empty cells in the original periodic table developed by Mendeleev were proven correct, and now were filled in with later discovered elements. Paradigms are often used for naming and comparing entities, creating checklists or inventories, identifying issues and gaps, and describing phenomena.

23 However, paradigms are limited to two or three dimensions, and cannot describe every aspect of a phenomenon. Compared to trees, paradigms are more flexible, but they are descriptive in essence with limited explanatory power. 2.1.5.5 Facets. To overcome the limited perspectives provided by lists, hierarchies, trees, and paradigms, an Indian library scientist, Ranganathan (1967/2006) developed facets, as he believed that any entity in the world could be viewed from a number of fundamental, mutual exclusive perspectives or facets, which were personality, matter, energy, space, and time. However, not all the faceted KO systems developed later used Ranganathan’s facets. For example, the Classification Research Group of the United Kingdom expanded Ranganathan’s five fundamental facets to thirteen categories (Broughton, 2006). Kwaśnik (1999) claimed that rather than a different representational structure, facets is a different approach to KO, known as the analytico-synthetic or post-coordinated approach. The faceted approach or facet analysis usually follows these steps (Kwaśnik, 1999; La Barre, 2010; Lambe, 2007): (a) defining entities of interest; (b) choosing facets important for description; (c) developing or expanding each facet, treating it as a mini-classification scheme; (d) analyzing entities along with each facet; (e) combining facets and arranging them in a sequence that makes sense to the user or system. The faceted approach was first applied in bibliographic classification, such as Ranganathan’s Colon Classification, but it has great difficulty of determining the citation order of facets to place physical items in a fixed sequence (Lambe, 2007). Due to its analytical and descriptive power, the faceted approach was more successful in subject indexing. The chronological, geographical, and form subdivisions of LCSH can be viewed as an application of the faceted approach with the intent to bring out various aspects of a document topic. The Art and Architecture Thesaurus, one of the largest online thesauri of the world, is built on seven facets: associated concepts, physical attributes, styles and period, agents, activities, materials, and objects (Broughton, 2006; La Barre, 2010). Beyond Ranganathan’s expectation, the faceted approach found its greatest potential in the digital environment, especially on the Web (La Barre, 2010; Lambe, 2007). A large number of commercial Websites (e.g., wine.com) have developed faceted schemes to describe their products, allowing the Web user to browse and retrieve product information by different dimensions. The North Carolina State University Library has implemented faceted searching and browsing in its online catalog.

24 Unlike hierarchies and trees, constructing and using faceted KO systems does not require complete domain knowledge (Hjørland, 2008; Kwaśnik, 1999; Lambe, 2007). For example, the Web user can easily navigate and understand the faceted schemes of commercial Websites. Faceted KO systems are also known as expressive, flexible, and hospitable, especially useful for describing new and emerging fields. However, faceted KO systems cannot display relationships between facets, and are difficult to provide as good an overview of a knowledge domain as hierarchies and trees. It is also not easy to select and develop the citation order of fundamental facets. 2.1.5.6 System maps. Lambe (2007) defined system maps as “visual representations of a knowledge domain where proximity and connections between entities are used to express their taxonomic and real-world relationships” (p. 42). System maps can be used to represent physical things, serving as analogues of real-world arrangements. Human artery maps and the London Underground map are examples of descriptive system maps. System maps can also be conceptual, describing concepts and their relationships of a knowledge domain. The nodes in a conceptual system map represent different concepts, and branches connecting any two nodes denote relationships between the concepts. The relationship between any two concepts in a system map is flexible, and can be of any kind, such as “is-a,” “part of,” and “regulates.” WordNet and bio-ontologies are examples of conceptual system maps. Another form of system maps is used to organize knowledge assets related to business or project activities into sequenced lists or interrelated clusters of stages or tasks. Project lifecycle diagrams are examples of process system maps. System maps are powerful visualization and mnemonic devices, communicating entities in a rich context and displaying a knowledge domain in a vivid and expressive way. However, system maps might lose their visual power when describing complex, overlapping or interacting domains. 2.1.5.7 Structure of folksonomies. Compared to hierarchies and trees, the structure of folksonomies is flat and discrepant in granularity, but inclusive and flexible to bring out a variety of aspects from a document simultaneously (Golder & Huberman, 2006). Despite of lack of hierarchies, there are relationships between tags in a folksonomy that are inferred from their usage pattern (Smith, 2008). If two tags without any semantic relationship are assigned to the same document by many users, one may assume that they are related in some way. The statistical

25 analysis of tag co-occurrence can be used to identify relationships among tags and documents (Sigurbjörnsson & Zwol, 2008; Smith, 2008; Specia & Motta, 2007). Folksonomies are usually displayed as tag clouds in social media to facilitate browsing and keyword searching (Spiteri, 2010). In a tag cloud, the more frequently used tags are in a larger font than the less-popular ones, and tags are usually arranged alphabetically. Popular subjects are usually dominant in the tag cloud, which is lack of logical structure. Facets can be incorporated into folksonomies, grouping tags into a number of fundamental and mutual exclusive categories to provide some control and context to tags and facilitate more efficient organization, visualization, and search of tags (Quintarelli, Resmini, & Rosati, 2007; Spiteri, 2010). The faceted approach has been successfully used in some social media, such as Delicious, to help users organize their tags into categories, and enable them to search tags related to particular categories.

2.1.6 Comparison between Ontologies and Other KO Systems Unlike thesauri and taxonomies, the relationship between two concepts in ontologies can be of any type, not limited to the Narrower Terms, Broader Terms, Synonymous Terms, and Related Terms in thesauri or the ‘is-a’ relationship in taxonomies (Jupp et al., 2012; Lambe, 2007, p. 238). For example, the Gene Ontology mainly uses four types of relationship between the terms—‘is_a,’ ‘part_of,’ ‘has_part,’ and ‘regulates’—to encode knowledge about genes and gene products related to biological processes, molecular functions, and cellular components (Gene Ontology, 2011; Gene Ontology Consortium, 2014o). There are no preferred terms or authorized terms selected for concepts in ontologies (Jacob, 2003; Taylor & Joudrey, 2009). What distinguishes ontologies from other KO systems is they impose a set of rules allowing for some form of reasoning and inference about the represented knowledge (Berners-Lee et al., 2001; Gašević et al., 2009; Hodge, 2000; Jacob, 2003; Lambe, 2007). For example, the Music Ontology may not only contain concepts about instruments but also simple rules of how to play them.

2.1.7 Knowledge Organization in Scientific Data Management With the advent of eScience, there are urgent needs in scientific communities for KO systems (e.g., metadata schema, ontologies, Semantic Web) to provide access to and make sense of a huge amount of scientific data (Gray, 2007; Gray et al., 2005). Scientific research has become

26 more and more multi-institutional, multinational, and interdisciplinary. This creates challenges for traditional KO systems to represent interdisciplinary data and improve data and metadata interoperability across disparate vocabularies and domains (Allard, 2012). Furthermore, eScience raises the expectation on the quality of scientific data and its metadata (Anderson, 2004). Assuring data quality before analyzing, preserving, and providing access to scientific data has become essential. There have been domain-specific approaches to scientific data management. Different scientific communities have formulated distinctive conceptual models, metadata schema, and controlled vocabularies to represent and provide access to domain-specific data and entities, control variant forms of entity names, and link to other related scientific data as an intermediate. For example, due to competing nomenclature systems available for chemical substances, the chemistry community developed the Chemical Abstract Service (CAS) Registry Files (i.e., authority files for chemical names) to control synonymous chemical names by linking each CAS Registry File to a molecular structure file (Hodge, 2000). Likewise, the biological communities have a long history of developing and maintaining taxonomies and nomenclatures to identify and classify organisms, represent relationships among taxa, and ensure consistent and unique naming of taxa. Understanding KO systems and data practices in other domains can better inform librarians, information scientists, and curators in supporting knowledge representation, communication, preservation, and discovery. Scientists have maintained a tradition to share their data through libraries, archives, museums, and zoos (Guterman, 2001). With digital libraries increasingly involved in scientific data curation through institutional data repositories, librarians and information scientists can make use of their expertise in KO to help develop and maintain domain-specific KO systems, supporting the long-term access, use/reuse, and sharing of scientific data.

2.2 Ontology Development Ontology development is a long-lasting and iterative process undergoing the lifecycle of “design, implementation, evaluation, validation, maintenance, deployment, mapping, integration, sharing, and reuse” (Gašević et al., 2009, p. 59). Gašević et al. (2009) stated that ontology development usually requires a set of design principles, ontology development tools, development processes and activities, systematic methodologies, and supporting technologies. The following sections briefly review ontology development tools and methodologies.

27 2.2.1 Ontology Development Tools Ontology development tools usually include ontology representation languages, graphical ontology development environments, and ontology-learning tools (Gašević et al., 2009). 2.2.1.1 Ontology representation languages. Ontologies can be expressed by humans in natural language as a set of declarative statements (Gašević et al., 2009). Using ontology representation languages, ontologies can be implemented in computers as text-based files, whereas using visual ontology representation languages they can be represented graphically. The early ontology representation languages are logic-based, such as KIF, Ontolingua, and Loom. The later ones, such as SHOE and XOL, were developed to support semantic representations on the World Wide Web using a markup scheme (e.g., XML, HTML). The most recent ones, such as RDFS and OWL, were developed for ontology representations on the Semantic Web, and thus are called Semantic Web Languages. 2.2.1.2 Ontology development environments. Besides an ontology representation language, ontology development requires a graphical ontology editor to perform the following tasks: (a) organizing the conceptual structure of the ontology; (b) inserting concepts, properties, constraints, and relations; and (c) reconciling any semantic, syntactic, or logical inconsistencies between elements of the ontology (Gašević et al., 2009). An ontology editor working with multiple ontology representation languages and other tools (e.g., version management, domain knowledge acquisition) forms an ontology development environment, which intends to support the entire ontology development lifecycle and its subsequent use. Developed by the Stanford Center for Biomedical Informatics Research (2014), Protégé is one of the most recent and popular ontology development environments that can support RDFS and OWL. As an open source, Protégé provides scientific communities with a platform to collaboratively create, edit, view, and share ontologies. Other commonly used ontology development environments include Ontolingua and Chimaera (Noy & McGuinness, 2001). 2.2.1.3 Ontology-learning tools. Some of the difficulties in ontology development and maintenance are acquiring concepts and relations from a domain, attaining consensus from community users and domain experts, and integrating updates due to domain dynamics (Gašević et al., 2009). Ontology-learning tools are developed to overcome these difficulties, (semi)automating the process of ontology development and maintenance by extracting, annotating, and integrating information from the Web, existing databases, ontologies, and

28 knowledge bases. The (semi)automating process relies on natural language processing and statistics techniques. The Text-To-Onto workbench and OntoLearn are examples of ontology- learning tools.

2.2.2 Ontology Development Methodologies Gašević et al. (2009) stated that an ontology development methodology consists of “a set of established principles, processes, practices, methods, and activities used to design, construct, evaluate, and deploy ontologies” (p. 66). There are usually six approaches to developing ontologies: inspiration, induction, deduction, synthesis, collaboration, and a hybrid of the above approaches (Holsapple & Joshi, 2002). The inspirational approach relies on a single developer who has an identified need and personal views to build an ontology. Unless the developer’s personal views and recognized need align with those of the community, the ontology may not be widely accepted and adopted. Building an ontology on characterizing specific cases in a domain, the inductive approach assumes the characterizations of those specific cases are applicable to other cases. In contrast, the deductive approach assumes that an ontology can be built by applying or adapting some generally accepted principles to specific cases. The synthetic approach is to construct a unified ontology by integrating multiple existing ones. Using this approach may require identifying relevant ontologies; selecting a standard representation language; converting ontology languages; and reconciling concepts, relations, and terminologies. The collaborative approach counts on a joint effort of different stakeholders (e.g., ontology developers, domain experts, users) who intentionally cooperate to develop an ontology to reflect their experiences, needs, and viewpoints (Holsapple & Joshi, 2002). Different from the above four approaches, the collaborative approach has a built-in evaluation mechanism, which facilitates the different parties to assess and critique the quality of ontologies. This approach can divide the formidable ontology development process into a set of manageable tasks, engage different stakeholders to contribute contents to the ontology, provide them with opportunities to exchange viewpoints and expertise, and promote ontology commitment and acceptance. However, the acceptability and quality of ontologies developed using the collaborative approach depend on “the nature of the participants, the degree of their involvement/diligence, and developer skills in overseeing the collaborative process” (Holsapple & Joshi, 2002, p. 45). Interestingly, with the popularity of Web 2.0 technologies, there were recent trends to ontology development using folksonomies as a basis (Gašević et al., 2009). Ontologies started

29 from folksonomies may be easier to create. However, due to the flat structure of folksonomies, those ontologies may be lack of constraints on concepts and logical relations between the concepts, and thus difficulty to deploy. Devedžić (2002) claimed that the methodologies for ontology development are analogous to those for object-oriented software development. Similarly, the open-source software communities have begun adopting a coordination mechanism, maintaining open repositories of bug reports and involving different stakeholders (e.g., users, software designers, software engineers) in collectively debugging and managing software quality (Sandusky & Gasser, 2005). For example, the Mozilla community—an open source software community well known for developing the Firefox Web browser—employs a normative and collaborative software problem management process that can be decomposed into four phases. In phase one, the user or tester who identifies a software problem creates a bug report and submits it to an open bug repository (e.g., Bugzilla). In phase two, a group of software designers—playing the role of software problem managers—assesses the problem, prioritizes it, and assigns it to appropriate experts for debugging. In phase three, the assignee investigates the problem, determines the cause, evaluates the options, and resolves the bug. This phase may involve consulting with other experts, coordinating with multiple experts, voting, negotiating, and collective debugging. In phase four, the platform master or central file manager verifies the bug has been resolved and closes the bug report.

2.3 Bio-ontologies The advent of high-throughput techniques has led to exponential increase in the size of molecular biological and biomedical data (e.g., gene sequences) encoded in different formats using various controlled vocabularies and stored in different databases (Rubin et al., 2006; Wu, Stvilia, & Lee, 2012). This has posed challenges for biomedical scientists to retrieve, use, analyze, and integrate data. To meet the urgent need of managing a massive amount of heterogeneous data, there has been a trend towards the development and adoption of bio-ontologies in the biomedical and molecular biological communities (Kelso, Hoehndorf, & Prüfer, 2010). Bio-ontology was defined as a standardized, human-interpretable, and machine- processable representation of entities and relationships between these entities within a specific biological domain, providing scientists with an approach to annotating, analyzing, and integrating results of scientific and clinical research (Rubin et al., 2006). The purposes of bio-

30 ontologies are to (a) represent current biological knowledge, (b) annotate and organize biological data, (c) improve interoperability across biological databases, (d) turn new biological data into knowledge, and (e) assist users in analyzing data across different domains (Bard & Rhee, 2004; Gene Ontology Consortium, 2000). The development of bio-ontologies usually relies on curators reading and interpreting scientific literature and extracting concepts and relationships between these concepts from the literature (Kelso et al., 2010). Due to a vast number of publications in different biological domains, this process is time-consuming and financially costly without community engagement (Greenberg et al., 2012). The proliferation of bio-ontologies and overlapping of these ontologies in specific domains pose great challenges for biologists to annotate and integrate their data. The difficulties of building and maintaining bio-ontologies are in gaining community involvement and acceptance, integrating new knowledge, and reflecting established knowledge (Open Biological and Biomedical Ontologies, 2006). To overcome these obstacles, a consortium of biomedical scientists, informaticians, and ontologists collaboratively formed the National Center for Biomedical Ontology (NCBO, n.d.; Rubin et al., 2006) to fulfill the following objectives: (a) creating and providing tools for constructing, using, and maintaining ontologies in biomedicine; (b) formulating, testing, and promoting best practices of ontology development; (c) serving as a national center for evaluating the quality of biomedical ontologies; and (d) reaching out to build and support communities of ontology content creators and tool developers. Similarly, in LIS, Greenberg, Murillo, and Kunze (2012) proposed functional requirements for ontologies describing multidisciplinary scientific data. Those requirements advocate approaches to developing ontologies using crowd sourcing and expert feedback, such as low barriers for contributions, open review process, collaborative team review, collective ownership of user communities, and stakeholder engagement. To unify divergent efforts of ontology development, the Open Biological and Biomedical Ontologies (OBO) Foundry was formed as an experimental body for life-science ontology developers to establish an evolving set of ontology design principles (Smith et al., 2007). Built on the success of GO, the OBO (2006) principles require that ontologies be (a) open-access, (b) expressed in a common shared syntax, (c) possessing unique identifiers, (d) able to keep track of successive versions, (e) orthogonal to other OBO ontologies, (f) clearly scoped, (g) having textual definitions for terms, (h) using unambiguously defined relations, (i) well-documented, (j)

31 actively used, and (k) collaboratively developed with other OBO ontologies. Any communities in biology or biomedicine that agree to adopt and refine those principles are welcomed to join the OBO Foundry (2014). With more and more groups joining the OBO Foundry (2012), additional principles were adopted as follows:

• Candidate OBO ontologies are required to have a liaison person with the OBO Foundry coordinating editors. • Candidate OBO ontologies are required to have trackers for suggestions, additions, and corrections and a helpdesk to answer users’ questions. • The textual definitions of terms are complemented with equivalent formal definitions. • The textual definitions of terms are in the genus-species form, consisting of genus (i.e., the broader class to which the term belongs) and species (properties distinguishing the term from others in the class). • Each ontology term can have at most one ‘is_a’ parent. • Each ontology term has a corresponding instance in reality. This principle is called instantiability. • Ontology terms are defined using terms and relations represented by other Foundry ontologies. • Each ontology is subject to evaluation. • Each ontology is built on the upper level ontology, the Basic Formal Ontology (BFO). • Preferred terms in the ontology consist of nouns in their singular format. • Preferred terms in the ontology are nouns or noun phrases in ordinary English with technical extensions that have been established in relevant domains.

2.4 The Gene Ontology As a founding member of the OBO Foundry, the GO is one of the most successful bio-ontologies and has been widely used for data annotation, text mining, and information extraction (Bada et al., 2004; Blaschke, Hirschman, & Valencia, 2002; Rubin et al., 2006). The GO was founded in 1998 as a collaborative project among the curators of three major model organism databases: the

32 FlyBase, the Saccharomyces Genome Database, and the Mouse Genome Informatics (Gene Ontology Consortium, 2011b, 2014g). Since then, the Gene Ontology Consortium has expanded to include more than 30 members, including several major repositories for plant, animal, and microbial genomes (Gene Ontology Consortium, 2014i). The GO consists of three sub- ontologies (controlled vocabularies) describing the cellular components, molecular functions, and biological processes of genes and gene products in a species-independent manner, and intends to provide each gene and gene product with a cellular context. The GO Consortium (2014g) aims to develop the GO as controlled vocabularies that can be used across different species and databases. The GO is open access and the ontology data can be downloaded for free in different formats. Users can browse and search GO terms and annotations via GO’s official browser named AmiGO or other browsers developed by different GO Consortium members, such as the QuickGO (http://www.ebi.ac.uk/QuickGO/) and the Ontology Lookup Service (http://www.ebi.ac.uk/ontology-lookup/).

2.4.1 The GO Term Record A GO term record contains a set of essential elements: the GO term name, accession number (a unique zero-padded seven digit identifier prefixed by ‘GO:’, e.g., GO:0005125), the sub- ontology (molecular functions, biological processes, and cellular components) to which the GO term belongs, textual definition of the GO term and reference source(s) of the definition, and one or more links relating the GO term to other terms in the Ontology (Gene Ontology Consortium, 2011b, 2014p). A GO term may contain any of the following optional elements: secondary identifiers for terms that have been merged with those having identical meanings, synonyms classified into different categories (i.e., exact, related, narrow, and broad) to aid searching, cross- references to other databases (dbxrefs), comments provided by the GO curators, the subset (e.g., prokaryote-specific) to which the GO term belongs, link to the GONUTS wiki where users provide the GO term with comments, and the obsolete tag indicating the GO term has been deprecated and is no longer used. Particularly, cross-references to other databases (dbxrefs) are links to identical or similar objects curated in those databases displayed as identifiers used in those database, such as the Commission (EC) numbers.

33 2.4.2 The GO Structure The GO is structured as a directed acyclic graph (DGA), where terms are nodes of the graph and relations between the terms are edges of the graph (Gene Ontology Consortium, 2004). Unlike a strict hierarchy, each child term of the GO can have more than one parent. The GO primarily uses four types of relationship between terms: ‘is_a,’ ‘part_of,’ ‘regulates’ (positively regulates and negatively regulates), and ‘has_part’ (Gene Ontology Consortium, 2009, 2014o, 2014p). There is no ‘is_a’ relationship between terms from the three sub-ontologies, but they can be related with other types of relationships. For example, the molecular function term ‘cyclin- dependent protein activity’ can be ‘part_of’ the biological process term ‘cell cycle.’

2.4.3 The Development and Maintenance of the GO Rather than adopting the traditional inductive (i.e., top-down), deductive (i.e., bottom-up), or synthetic ontology design approaches (Holsapple & Joshi, 2002), the GO employs a collaborative approach, bringing in diverse communities collectively developing the Ontology and controlling its quality. Instead of relying on ontology engineers and computer scientists, the GO terms were created and organized by various biological communities, and have been applied to real data. Therefore, the GO has received widespread acceptance within these communities (Bada et al., 2004; Hjørland, 2007; Kelso et al., 2010). Similar to Wikipedia, GO’s data curation discussions and review processes are open to the general public, enabling error detection and correction (Stvilia, Twidale, Smith, & Gasser, 2008). The GO has created a number of data- related and software-related request trackers hosted at SourceForge (http://sourceforge.net/) to allow any individual to provide feedback on the Ontology, such as suggesting a new term or definition, reorganizing a section of the Ontology, and reporting errors or omissions in the GO annotations (Gene Ontology, 2014; Gene Ontology Consortium, 2006, 2007). The GO curators review individual requests and implement edits where appropriate. As the GO evolves with more and more terms and relationships, developing and maintaining the Ontology becomes cumbersome. To ease or automate the process of ontology maintenance, the GO complements textual term definitions with logical definitions, partitioning the definitions into mutually exclusive internal cross-products or external cross-products (cross- referencing the GO terms to other ontologies) (Gene Ontology Consortium, 2014p; Mungall et al., 2011). The computable logical definitions are in the genus-differentia form, consisting of genus (i.e., the broader class to which the term belongs) and differentia (a set of properties

34 distinguishing the term from others in the class). The differentia is related to other terms within or external to the GO using relationships from the Relation Ontology (RO) and extensions to the RO. For example, the GO term ‘mitochondrial translation’ (GO:0032543) can be defined as ‘translation’ (GO:0006412) ‘occurs_in’ mitochondrion (GO:0005739). The logical definitions can be cross-products combining terms from other OBO ontologies. For example, the GO term ‘megasporocyte nucleus’ (GO:0043076) can be defined as a ‘nucleus’ (GO:0005643) that is ‘part_of’ a ‘megasporocyte’ (CL:0000320), in which ‘Megasporocyte’ is a Cell Ontology term. Using logical definitions in conjunction with reasoners can automatically spot mistakes within the GO, detect inconsistencies between the GO and other ontologies, and add missing links in the GO (Gene Ontology Consortium, 2011b; Mungall et al., 2011). For example, a set of GO terms cross-references to the Chemical Entities of Biological Interest (ChEBI), an ontology arranging chemicals into a structure. By aligning the representation of chemicals in the GO to the representation of chemicals in the ChEBI, curators can detect structural inconsistency or missing links in the GO. Besides quality control, logical definitions and internal cross-products enable logical and probabilistic inference of annotations. Meanwhile, logical definitions cross- referencing to external ontologies allows visualization beyond the GO, supports cross-ontology querying, and is also a powerful way to involve other communities in developing and maintaining the GO (Gene Ontology Consortium, 2011b).

2.5 Data Quality Quality is commonly defined as ‘fitness for use’ (Juran, 1992). Data quality can be defined as “the degree the data meets the needs and requirements of the activities in which it is used” (Stvilia et al., in press). The definition of data quality encompasses duality—subjectivity (i.e., meeting individual expectations) and objectivity (i.e., meeting the task or activity requirements) (Stvilia et al., 2007). Therefore, data quality is contextual, dynamic, and multidimensional. When data quality is evaluated in the individual level, one’s domain knowledge and familiarity with the data repository can affect their quality evaluation (Stvilia, Jörgensen, & Wu, 2012). When moved or aggregated from the individual level to a team, discipline, or community level, data quality can be understood differently (Stvilia et al., 2007). Data quality can be affected by changes to the data, the underlying object described by the data, or the context of creating and using the data (Stvilia & Gasser, 2008b). Therefore, data quality can be measured directly by examining the data, or indirectly by analyzing the provenance of the data or the data creator’s

35 reputation, and the process of creating and using the data (Stvilia, 2006; Stvilia et al., 2007; Stvilia, Twidale, Smith, & Gasser, 2008). The perception and assessment of data quality vary by individuals, and within and across teams, disciplines, institutions, and communities (Ball, 2010; Stvilia et al., 2008).

2.5.1 Data Quality Assessment Models Since quality is contextual, dynamic, and multidimensional (Stvilia & Gasser, 2008a; 2008b; Stvilia et al., 2012), there is no single framework or model that can capture the IQ perceptions or value structure of different communities (Ball, 2010; Stvilia et al., 2008). A number of quality assessment frameworks and models have been proposed in the literature for specific types of data. Wand and Wang (1996) constructed a data quality model for information system designers, consisting of a set of quality dimensions to assess three generic categories of system design deficiencies: incomplete representation, ambiguous representation, and meaningless states. Using a theoretical approach, Wand and Wang’s (1996) model focuses only on data representation activities, and categorizes quality dimensions as internal or external, and data- related or system-related. Wang and Strong (1996) applied an empirical approach to develop a more comprehensive framework grouping data quality dimensions collected from data consumers into four categories: intrinsic, contextual, representational, and accessibility. Most of these frameworks and models have an incomplete set of quality dimensions, and fail to identify the causes of data quality problems and link them to the data activity system context (Stvilia et al., 2007). Isaac, Haslhofer, and Mader (2012) developed a quality assessment tool—consisting of computable quality checking functions built on 15 quality issues—to evaluate and improve the quality of controlled vocabularies on the Web. Isaac et al. (2012) identified those 15 quality issues for controlled vocabularies by reviewing literature and manually examining existing vocabularies, and categorized those issues into three groups: labeling and documentation issues (omitted or invalid language tags, incomplete language coverage, undocumented concepts, and label conflicts), structural issues (orphan concepts, weakly connected components, cyclic hierarchical relations, valueless associate relations, solely transitively related concepts, omitted top concepts, and top concept having broader concepts), and linked data specific issues (missing in-links, missing out-links, broken links, and undefined resources). Isaac et al. (2012) tested the quality assessment tool on 15 existing controlled vocabularies on the Web, including the

36 Geonames ontology and MeSH, and found labeling and documentation issues and structural issues in all of these vocabularies. In the domain of molecular biology, a number of models and frameworks have been developed to evaluate the quality of KO systems. Köhler, Munn, Rüegg, Skusa, and Smith (2006) proposed two automatic IQ metrics—circularity and intelligibility—to assess the quality of term definitions in ontologies, and tested the metrics using empirical data collected from the GO. These metrics were constructed by using “the linguistic structure of the ontology entries and the WordNet ontology as a reference source to compute quality scores” (Stvilia, 2007, para. 10). Buza, McCarthy, Wang, Bridges, and Burgess (2008) developed a composite automatic IQ metric—Gene Ontology Annotation Quality (GAQ) score—to evaluate the quality of GO annotations, and tested it by measuring the annotations for chicken and mouse over a period of time in the GO. The GAQ score is a product of the breath of annotation (i.e., the number of GO terms assigned to each gene product) and the evidence code (i.e., an indicator of the source of annotation) rank of the annotation. Leonelli et al. (2011) collected empirical data from the GO curators, and constructed a quality change model consisting of five sources of quality problems: (a) mismatch between the GO and reality, (b) the scope extension of the GO, (c) divergence in how the GO terms are used across communities, (d) new discoveries that change the meaning of the GO terms and their relations, and (e) addition of new relations. This quality change model was built on qualitative data collected form a small number of curators without any input from the user. Defoin-Platel et al. (2011) proposed a framework with 12 quality metrics for assessing the quality of GO’s functional annotations. Most of these frameworks and models are incomplete and intuitive based on individual perception of quality requirements, and thus are unable to assess data quality in a systematic and consistent manner. 2.5.1.1 Stvilia’s Information Quality Assessment Framework. Built on Activity Theory (Engeström, 1990; Leont’ev, 1978), Stvilia and his colleagues (2007) proposed an Information Quality (IQ) Assessment Framework consisting of a well-defined typology of IQ problem sources linked with affected information activities and a taxonomy of 22 IQ dimensions along with 41 generic IQ metrics. An IQ assessment framework refers to “a multidimensional structure consisting of general concepts, relations, classifications, and methodologies that could serve as a resource and guide for developing context-specific IQ measurement models” (Stvilia et al., p. 1722). Although it was intended for information entities, Stvilia’s IQ Assessment

37 Framework defines information as manifest in forms of data and knowledge, and has been applied to evaluate the quality of various types of data (e.g., biological data, condensed matter physics data, image metadata) and KO systems (e.g., biodiversity ontologies) (Huang et al., 2012; Stvilia, 2007; Stvilia et al., 2012; Stvilia et al., 2013; Stvilia et al., in press; Wu et al., 2012). Compared to previous data quality assessment frameworks or models, Stvilia’s is the most comprehensive, predictive, and reusable, providing consistent and complete logic to deal with context sensitivity; specifying methodologies to analyze an information activity system and identify users’ IQ value structure; and allowing rapid and inexpensive development of context- specific IQ measurement models (Stvilia, 2006; Stvilia et al., 2007). The generic IQ metrics have been successfully reused in several different contexts (e.g., the English Wikipedia). Section 2.7 provides a more thorough review of this Framework. 2.5.1.2 IQ Change Model. Stvilia and Gasser (2008a) used the case of the English Wikipedia to study causes and patterns of IQ variance and developed an IQ Change Model consisting of a typology of IQ change sources. To identify reasons for Wikipedia’s IQ change, Stvilia and Gasser (2008a) examined the review process logs of Featured Articles (FA), which are perceived as Wikipedia’s best articles, and the FA Removal Candidate process logs of former FAs. To gain patterns and trends of IQ dynamics, Stvilia and Gasser (2008a) collected time series data of article attributes (i.e., the number of edits, the number of editors, and article length) from the edit histories of a set of FAs and former FAs and a random sample of 1000 Wikipedia articles. The time series data suggested a connection between the IQ changes of articles and the changes in real-world context (e.g., holiday, event) or the lifecycle of entities represented by the articles. The analysis of process logs identified three main reasons for articles losing FA status: FA criteria changes, article changes, and the underlying entity changes. These findings reaffirm and correspond to the sources of IQ problems proposed by Stvilia’s IQ Assessment Framework: context changes, changes to the information entity, and changes to the underlying entity (Stvilia et al., 2007). The case study of Wikipedia also gives new insight into sources of IQ change, including culture changes, community makeup changes, activity or event changes, agent changes, and knowledge/technology/tool changes. The typology of IQ change sources allows for predicting IQ variance in information systems and data repositories and providing guidance for their IQ maintenance and resource allocation.

38 2.5.1.3 Value-based quality assessment model. Borrowing the cost-benefit analysis from manufacturing, Stvilia and Gasser (2008b) proposed a value-based quality assessment model, quantifying quality changes and linking changes to data quality with changes to its value. The value of data quality change can be assessed by the value change to the activity outcome adjusted by the cost of data quality change. The assessment of data quality change thus requires a baseline quality model for a specific data activity context. The reasons for determining a baseline quality model include: certain quality dimensions are more critical than others for improving data quality in a specific context; increasing quality above the baseline level might not provide additional value; and with limited resources it is unrealistic to perform a uniform quality control on all the entities of a collection. The baseline quality model is used for prioritizing quality assurance or change. Stvilia and Gasser (2008b) used an aggregated collection of Simple Dublin Core metadata records to demonstrate how to construct a baseline quality model with the conceptual and empirical approaches. The conceptual approach involves analyzing the metadata activity system through modeling metadata use scenarios. If a community-shared conceptual model, such as the Functional Requirements for Bibliographic Records (FRBR), exists for specific metadata activities, it can be used in the conceptual approach to construct a baseline quality model. When a representative metadata collection is available, one may use the empirical approach to construct a baseline model inferred from the statistical profile of the representative collection. Since the conceptual data quality model may differ from a community’s active data quality model and empirical data embodies the community’s actual data quality requirements, Stvilia and Gasser (2008b) suggested combining conceptual and empirical approaches to develop a baseline quality model. 2.5.1.4 Cross-contextual IQ measurement model. Stvilia, Al-Faraj, and Yi (2009) used Wikipedia in three languages to explore the issues of cross-contextual IQ assessment and the reusability of IQ measurement models. They compared the FA selection criteria and policies, IQ problems types, IQ metrics, and cultural characteristics of the Arabic and Korean Wikipedias to those of the English Wikipedia. They constructed a cross-contextual IQ measurement model, using Hofstede’s cultural dimensions to examine the relationship between cultural similarity and IQ model similarity. Their findings validate the IQ change model (Stvilia & Gasser, 2008a) that changes in culture or community may lead to changes in IQ value structures. Furthermore, their

39 findings reaffirm IQ is dynamic and contextual, provide new insight into reusing IQ models for different user communities of an information system, and indicate the necessity of constructing community-specific IQ measurement models.

2.5.2 Scientific Data Quality Problems Based on Stvilia’s IQ Assessment Framework (Stvilia et al., 2007) and the IQ Change Model (Stvilia & Gasser, 2008a) and data use scenario analysis, Wu, Stvilia, and Lee (2012) conceptualized data and metadata quality problems in molecular biology as inconsistent mapping, incomplete mapping, and dynamic quality problems caused by context changes, changes in the entity, and changes in the entity’s metadata. One may assume that these data and metadata quality problems might be present in the GO since the concepts represented in the GO are within the domain of molecular biology. 2.5.2.1 Inconsistent mapping. Scientists usually need to consult or collect data from multiple sources to conduct experiments, interpret results, and make predictions. For example, to study the structure of an unknown protein, researchers need to take into account several types of data, ranging from gene sequences to protein structures. However, most open-access databases curate one type of data, and no resources are available to provide one-stop shopping for all the data (Khatri et al., 2005). In order to gain a complete picture of the problem under study, researchers need to navigate from one resource to another. Therefore, it is necessary to create cross-references among related databases: a gene database needs to link to a genome database to signify the location of a gene on its genome; an mRNA database needs to link to a gene database and a protein database to indicate a gene from which an mRNA is transcribed, and a protein to which an mRNA translates; and a protein database needs to link to a gene database and a protein structure database. While many open-access databases provide cross-references to related databases, each of them has its metadata schema and identifier system. Most of the time, users have to manually convert an identifier from one database to another, or use online converters (e.g., X-REF Converter). 2.5.2.2 Incomplete mapping. Scientists may have difficulty finding and reusing data underlying publications due to incomplete metadata (Greenberg, 2009). Missing the metadata necessary for discovering, interpreting, using, and reusing data—such as spatial and temporal coverage, specimen identity information, or contextual metadata—may hinder the research process. Publication without mentioning the version number (revision ID) of sequences can

40 result in ambiguous interpretations and inconsistent descriptions of the data (Dalgleish et al., 2010). Geneticists and clinicians may find current reference sequences are missing annotations of clinically relevant transcripts (i.e., RNA sequences produced from transcription) that are essential for reporting sequence variants. Bioinformaticians doing phylogenetic analysis (i.e., studying evolutionary relatedness among groups of organisms) or phylogeographic analysis (i.e., studying geographic distributions of organisms) may spend longer than expected collecting and identifying samples from existing sequence databases due to the lack of geographic or contextual metadata describing organisms (Higgs & Attwood, 2005). 2.5.2.3 Dynamic quality problems. Proteins and genes are recommended to be named according to their functions and homology to known proteins (Goll et al., 2010; Wain et al., 2002). Scientists, however, are still learning more about protein functions, and thus protein names need to be changed frequently to represent newly found or revised knowledge of functions. With the exponential increase of sequence data generated by high-throughput techniques, manual correction of existing problematic names is infeasible (Goll et al., 2010). As a result, several names (synonyms) are in use for the same genes and their corresponding proteins across databases and the literature. Scientists should consider all available gene or protein names when doing database or literature searches; otherwise they risk missing data. The number of cross-references among databases has increased significantly since many of them began to collaborate and share data (Fundel & Zimmer, 2006). Cross-referencing, however, may lead to data redundancy and inconsistency because some data is stored in multiple databases and may be updated or changed asynchronously (Khatri et al., 2005).

2.6 Activity Theory This study used Activity Theory (Engeström, 1990; Leont’ev, 1978) as an overarching framework to help formulate research questions; make methodological decisions; and analyze, structure, and explain the netnographic data. This section reviews the origin and development of Activity Theory; introduces six interrelated principles that help apply the concepts of the Theory for data analysis; presents previous applications of the Theory in different areas; and discusses the limitations of using the Theory.

41 2.6.1 The Origin and Development of Activity Theory 2.6.1.1 Vygotsky’s Activity Theory. Originated in the Soviet Union, Activity Theory was deemed to be initiated by the Russian cultural psychologist Lev Vygotsky in the early 19th century (Allen et al., 2011; Roos, 2012). The original Activity Theory was studying human consciousness, which was proposed to form during activity and can explain the nature of human behavior consisting of activities (Wilson, 2008). The triadic structure of the human activity (i.e., subject, tools, and object) can be traced to Vygotsky’s work. A subject can be a person or a group of persons participating in an activity (Nardi, 1996). Object refers to an objective held by the subject motivating an activity. The relationship between the subject and object of an activity is mediated by tools. Due to his concentration on consciousness, Vygotsky formulated tools as psychological tools, such as languages, writings, signs, symbols, and maps (Wilson, 2008). Vygotsky also proposed the concepts of internalization and externalization. Internalization refers to an individual’s thought activity to reason and reconstruct external objects or acquire new abilities, such as one’s activity of learning community policies and rules (Kaptelinin, 1996; Wilson, 2008). Talking or thinking to oneself and summarizing in one’s mind are examples of internalization techniques. In contrast, externalization refers to the process that people manifest, verify, and correct their mental models through external actions, such as one’s activity of formulating new community policies and rules. The processes of internalization and externalization are not mutually exclusive but highly integrated and iterative. 2.6.1.2 Leont’ev’s Activity Theory. Since the proposition of Activity Theory, it has undergone continuous development. Based on Vygotsky’s work, another Russian cultural psychologist Leont’ev, who was Vygotsky’s student and fellow researcher, introduced the cultural-historical aspect and the concept of division of labor to the Theory and developed the hierarchical structure of human activity distinguishing among activity, actions, and operations (Allen et al., 2011; Roos, 2012; Wilson, 2008). Activity, at the top of the hierarchy, is collective in nature and driven by an object (i.e., motive), which can be divided into actions dependent on the division of labor. Actions, at the intermediate level of the hierarchy, are goal-directed processes and cannot be understood without the social context of the collective activity. Operations, at the bottom of the hierarchy, depend on the conditions under which the actions are performed. With practice, it will require little conscious attention to carry out the operations, which may have become routinized or habituated behaviors.

42 With the influence from Marxist, Leont’ev changed his focus on Activity Theory from human consciousness to object-oriented activity and concrete tools (Wilson, 2008). Leont’ev expanded the concept of tools to include both abstract (e.g., skills, experience, laws, and procedures) and physical tools (Allen et al., 2011; Kuutti, 1996; Wilson, 2008). Tools carry the cultural-historical development of activities, both enabling and restricting the transformation process from object to outcome. Corresponding to the hierarchical structure of human activity, Wartofsky (1979) proposed a three-level hierarchical structure of tools: primary, secondary, and tertiary tools. Corresponding to the level of operations, a subject may be unconsciously aware of the primary tools that she or he is using. Secondary tools are internal or external representations of the primary tools. Whenever a subject reflects on the nature or use of a tool, it becomes a secondary tool. This transition can be observed when a subject encounters an obstacle, forcing her or his automatic operations to become goal-directed actions and reflections on the tool. As representations of the secondary tools, tertiary tools are abstract and imaginative in nature, and can only be perceived or contemplated in one’s mind. Tertiary tools, such as religious creeds and scientific paradigms, can be transmitted beyond their immediate context of use. Engeström (1990) modified Wartofsky’s (1979) hierarchy of tools, separating the secondary tools into ‘why’ and ‘how’ artifacts and naming the primary tools as ‘what’ artifacts. The ‘what’ artifacts are visible, external physical entities, such as hammers and computers. The ‘how’ artifacts are somewhat visible and external, such as instructions and algorithms, articulating how to deal with the primary artifacts. They can be expressed in written formats, but can also be internalized in one’s mind, remaining invisible. The ‘why’ artifacts are most elusive, justifying the selection and use of specific primary artifacts. The physical formats of the ‘why’ artifacts are usually difficult to pin down. 2.6.1.3 Engeström’s Activity Theory. Because of the simplicity of Vygotsky’s triadic representation of human activity, ignoring the interaction between the subject and the activity environment, Engeström (1990) added two new components—community and outcome—to emphasize the social aspects of activity. Community can be defined in different levels; it can be an immediate group or team to which the subject belongs; it can be an organization; or it can be as broad as the society (Wilson, 2008). No matter which level to define, subjects within the same community share a common object. There are three mutual relationships between subject, object,

43 and community (Kuutti, 1996). As mentioned above, the relationship between subject and object is mediated by tools. Subject may transfer the object into outcome with the help of tools. The relationship between subject and community is mediated by rules; and the relationship between object and community is mediated by division of labor. Rules refer to explicit or implicit norms, conventions, regulations that enable or limit the actions, operations, and interactions within an activity system (Engeström, 1990). Division of labor means “both the horizontal division of tasks between the members of the community and the vertical division of power and status” (Engeström, 1990, p. 79). Engeström’s activity system indicates that his focus on Activity Theory transformed from human consciousness and individual activity to community (Wilson, 2008). Another key concept to help understand and use Activity Theory is ‘context’. Nardi (1996) claimed “activity itself is context” (p. 76). Case (2012) broadly defined context as “the particular combination of person and situation that served to frame an investigation” (p. 13). Nardi (1996) defined context as the “relations among individuals, artifacts, and social groups” to which individuals belong (p. 69). Compared to Case’s definition, Nardi’s places an emphasis on tools and individual’s social role within an activity system. Context is both internal and external to individuals, involving particular objects and goals while engaging tools, other individuals, and specific settings. Activity Theory unifies the internal and external aspects of context. Allen et al. (2011) stated that context is “the legacy of past activities” and “a determinant of present activities” remaining not static as “it is emergent, continually renegotiated” during the activities (p. 783). Context can only be understood through the association with relevant activities, both collective and individual; considering the concepts of internalization and externalization; and analyzing tools, rules, norms, division of labor, and community within the activity system. In other words, context is the cultural-historical influences on the activity system. Activity Theory in essence is dynamic and evolving with its application in empirical studies (Nardi, 1996). This section reviews the origin and development of Activity Theory, and identifies three generations of the Theory. The earliest version—Vygotsky’s Activity Theory— studied human consciousness and proposed that it was mediated by culture, that is, abstract tools, such as language, signs, and symbols. The structural representation of Vygotsky’s Activity Theory consists of two components, subject and object, the relationship between which is mediated by tools. The second version—Leont’ev’s Activity Theory—proposed that human

44 consciousness was mediated by activities, and expanded the concept of tools to include both abstract and physical tools. Unlike Vygotsky’s version concentrating on individual activities, Leont’ev introduced the concepts of division of labor and cultural-social context, distinguishing between individual and collective activities. The latest version—Engeström’s Activity Theory— added two new component, community and outcome. The relationship between subject and community is mediated by rules and norms, and the relationship between object and community is mediated by division of labor. Compared to the previous two versions, Engeström’s places an emphasis on identifying the interaction among a network of related activities (Wilson, 2008).

2.6.2 Principles of Activity Theory Kaptelinin (1996) specified six interrelated principles of Activity Theory. The first and the most fundamental principle is the unity of consciousness and activity. Consciousness means the human mind. Activity means human interaction with the external world. This principle derives from the proposition of Activity Theory that human consciousness is formed during their interaction with the external world, that is, activity (Wilson, 2008). This principle indicates that human mind can be understood within the context of activity. The second principle is object-orientedness, specifying activity theorists view the social and cultural aspects of the environment with which human beings are interacting as objective as the physical ones. The third principle is the hierarchical structure of activity, with activities at the highest level of the hierarchy, actions at the intermediate level, and operations at the lowest level (Kaptelinin, 1996). The unit of analysis of Activity Theory is activity, providing the context for understanding individual actions and operations (Kuutti, 1996; Leont’ev, 1978). Activities are distinguishable by their motives or objects. Activities, actions, and operations differ in their process: activities are oriented to specific motives that are impelling by themselves, actions are directed to particular goals that are auxiliary, and operations are condition-dependent and the process of which is automatized. Kaptelinin (1996) pointed out that when motives, goals, and conditions are frustrated, they differ in the predictability of human behavior and emotion. When the conditions are changed, people may adapt to the new conditions without realizing the changes. Similarly, when the goals are frustrated, people may set new goals without much negative emotion. However, if the motives are upset, people may become distressed, and their

45 behavior may turn to be unpredictable. The process of transforming an object into an outcome involves this hierarchical structure of activity. The hierarchical activity system is dynamic in that activities, actions, and operations are not immutable but interchangeable (Allen et al., 2011; Wilson, 2008). Over time, activities may become actions and operations. Actions can become operations through internalization, while operations can become actions through externalization. An activity may not exist in isolation, but relate to other activities and be part of the larger activity system network. The fourth principle is internalization-externalization, depicting that people’s consciousness of the external world is formed through internalizing their external activities (Wilson, 2008). Language is one of the techniques that people employ to internalize their external experience. The fifth principle is mediation stating that tools, either physical (e.g., hammers, computers) or abstract (e.g., language, symbols), mediate human activities (Kaptelinin, 1996). As a source of socialization carrying specific cultures and social experience, tools shape the way that people act. The combination of human abilities with tools forms the “functional organs,” providing new functions or strengthening the existing ones (Kaptelinin, 1996, p. 109). Wilson (2008) claimed that rules, norms, and division of labor in Engeström’s structural representation of human activity could be considered as tools mediating between the subject and object. The last but not the least principle is development. Activities, including each component of an activity system, are not static, but under continuous development (Kaptelinin, 1996). There is a concept—contradictions—closely related to this principle. Contradictions refer to historically accumulated tensions or instabilities within or between activity systems, playing a central role in changing, developing, and learning the activities (Allen et al., 2011; Roos, 2012). In other words, contradictions are the source of development of activities and can explain what have motivated certain actions or operations (Kuuitt, 1996). Contradictions may give rise to cessation or transformation of existing activities, or formation of new activities (Turner, Turner, Horton, 1999). Contradictions, however, may not necessarily be negative. They can be innovations, disruptions, conflicts, and breakdowns to an activity system. To understand the development of a specific activity system, one may conduct historical and empirical analyses of the system to trace the contradictions that have occurred (Engeström, 1990). Turner, Turner, Horton (1999) stated that “an activity is a nexus with an internal structure and a location in a

46 cultural-historical continuum wherein it developed and evolved” (p. 289). Roos (2012) claimed that the development or contradictions is “both an object of study and general research methodology in activity theory” (para. 20). Engeström (1990) classified contradictions into four categories: primary contradictions exist within each component of an activity system, secondary contradictions occur between components of the activity system, tertiary contradictions happen between different developmental phrases of the activity, and quaternary contradictions appear between different but interconnected activity systems.

2.6.3 Previous Applications Originating in cultural psychology, Activity Theory has been applied in a variety of other disciplines, such as education, work studies, organization studies, and information systems (Allen et al., 2011; Wilson, 2008). Although there has been relatively little research conducted in the domain of LIS using Activity Theory, Wilson (2008) claimed that it is appropriate for any areas in the domain. The following section reviews applications of Activity Theory in human- computer interaction (HCI), information behavior/practices, IQ assessment, and work studies. 2.6.3.1 Human-computer interaction. Nardi (1996) compared three approaches to studying context in HCI research—Activity Theory, Situated Action Models, and Distributed Cognition—and concluded that Activity Theory is the richest and most comprehensive framework. The basic unit of analysis in Situated Action Models is “the activity of persons- acting in setting” (Nardi, 1996; p. 71). As a framework of studying the relation between persons and the arena in which they act, Situated Action highlights the improvisatory nature of human activity and response to contingent, emergent, and unique situations instead of stable, enduring patterns across situations. As a complementary approach to artificial intelligence and cognitive science, Situated Action deemphasizes plans and pre-defined actions, paying attention to actions in real time. Distributed Cognition is an approach to studying how knowledge is represented internally in one’s mind and externally to the world (Nardi, 1996). The basic unity of analysis is “a cognitive system composed of individuals and the artifacts they use” (Nardi, 1996, p. 77). The cognitive system is similar to the collective activity in Activity Theory. Distributed Cognition underscores studying coordination among individuals and tools, such as how individuals coordinate to complete a task or how a tool facilitates communication within an organization.

47 Nardi (1996) claimed that Activity Theory and Distributed Cognition shared some similarity and might emerge over time, since they both provide a way of comparing different activities and distinguishing them by their objects or systematic goals. Activity Theory considers tools as mediators of consciousness under human control, assimilating the historical-cultural development of humanity (Kuutti, 1996). Distributed Cognition, however, treats people and tools equally and focuses on analyzing the use of tools in real situations, ignoring the historical development of tools. In Situated Action, a subject’s behavior is situation-oriented rather than determined by his or her knowledge, interest, and intention (Nardi, 1996). However, what constitutes a situation depends on the researcher’s definition. Unlike Activity Theory, Situated Action is concerned with analyzing particularities of situations, ignoring routine, predictable practices that can also be observed in human behavior, and thus makes it impossible for cross-context comparison and generalization. Situated Action relies on analyzing observational data, excluding such research methods as interviewing that involve the subject’s conscious awareness. In contrast, Activity Theory values a variety of data collection techniques—interviewing, observation, and historical documentary analysis—and aims to reveal human activities in a longer term from the subject’s perspectives. Nardi (1996) suggested using Activity Theory as a backbone for studying context in HCI, combined with Situated Action and Distributed Cognition to encompass the particularities of real activity and representation of knowledge. Nardi (1996) provided an example of applying Activity Theory in reanalyzing data collected in an ethnographic study to examine how people create presentation slides and whether they prefer generic or task-specific application software. The study found that the type of software to which professional slide-makers preferred depended on desired presentation quality. The reasons for using Activity Theory to reanalyze the data included: (a) lack of a conceptual framework to structure and explain the ethnographic data, (b) absence of a shared vocabulary in HCI to describe and communicate the data, and (c) the bias of cognitive science to study end users as an individual but not as a group. Nardi (1996) found that using the concepts of ‘object’, ‘action,’ and ‘goal’ from Activity Theory instead of creating concepts such as ‘task’ and ‘subtask’ could precisely describe the data and explain that actions (using generic or task- specific software) involved in slide-making varied due to the subject’s object (creating ordinary or fancy slides). Using the concept of ‘subject’ as both an individual and a group enabled

48 identifying that current task-specific software did not well support team-work slide-making—an important extra finding for the software design studies. 2.6.3.2 Information behavior/practices. Allen, Karanasios, and Slavova (2011) demonstrated how to apply Activity Theory in examining the collective activity of students completing a group assignment. Within this collective activity system, the group of students can be conceptualized as a subject, the completion of group assignment as an object, the research report as an outcome, and students’ information-seeking behavior as an action. Meanwhile, an individual student’s information-seeking behavior can be conceptualized as an activity—a unit of analysis—to enable a deeper level of understanding of information behavior under the context of the group’s collective activity. Within this activity system, the student becomes the subject, and the information needed for his or her section of the assignment becomes the object. This study suggests Activity Theory empowers researchers to analyze human activity both collectively and individually in a dynamic, multi-level manner, dependent on the level of abstraction that is selected. Allen et al. (2011) discussed that information needs could be conceptualized as object, driving the information-seeking behavior. This conceptualization enables linking information needs to the context (e.g., university education, a student’s historical-cultural background) of an activity system, and provides an opportunity of understanding the cause of information needs (e.g., to achieve a higher grade, to attain a degree). Roos (2012) used Activity Theory as a conceptual framework to reanalyze data collected from survey questionnaires, semi-structured interviews, and observations about the research practices of two research groups in molecular medicine. The purpose of the study is to examine if Activity Theory can help gain an understanding of molecular medical scientists’ information practices, consisting of information needs, use, management, seeking, and giving in context. As an interdisciplinary scientific discipline, molecular medicine “utilizes molecular and genetic techniques in the study of the biological process and mechanisms of diseases,” aiming to “provide new approaches to the diagnosis, prevention and treatment of disorders and diseases” (Roos, 2012, para. 9). Closely related to biochemistry, cell biology, molecular biology, and molecular genetics, molecular medicine is data-intensive, relying on expensive instruments and technologies to collect and analyze massive amounts of data.

49 Taking activity as the unit of analysis, Roos (2012) deconstructed the research work in molecular medicine into components (i.e., subject, outcome, object, tools, rules, community, and division of labor), identified other interconnected activity systems from the data, and categorized them into central activity (i.e., research work), object activity (i.e., preventing and treating diseases), subject-producing activity (i.e., university education), tool-producing activity (e.g., information services in libraries, bioinformatics), and rule-producing activity (e.g., management and administration of the research institute). Roos (2012) also reduced the research work in molecular medicine into a chain of actions: creating ideas, planning the experiment, laboratory work, analyzing results, and reporting results. Among these actions, creating ideas, analyzing and publishing results are most information- intensive. Roos (2012) identified information practices that are relevant to each of these actions, such as reading literature from PubMed, discussing with colleagues in conferences, and searching bioinformatics databases. Placing information practices in the broader context of the activity system of research work can make it easier to identify contradictions within and between these systems and find solutions. Studying the interaction between different activity systems (i.e., information practices and research work) helped identify that information practices were in the lower level of the hierarchy of research activities in molecular medicine, playing the role of mediating tools in the research process. 2.6.3.3 IQ assessment. As mentioned above, Stvilia et al. (2007) used Activity Theory as an overarching framework to develop a theoretical IQ Assessment Framework. Activity Theory helped conceptualize IQ problems, situating them in the context of using, creating, and improving information (Stvilia, 2006). To determine the IQ perceptions of specific communities or the IQ problems that they have encountered, one may start from identifying and analyzing their information activities. Stvilia’s Framework encompasses an information activity typology capturing the shared characteristics of related activities. Each type of the information activity is associated with one or several sources of IQ problems. The typology enables quick identification or prediction of IQ problems and relevant IQ assessment variables. Activity Theory also enabled deconstructing information activities into webs or systems of related elements, such as roles (subjects), community, actions, tools, and norms and conventions (including IQ requirements). One may expect that changes to any element of an information activity system may cause quality change to the information entity (Stvilia, 2007;

50 Stvilia & Gasser, 2008a). As mentioned above, based on Activity Theory Stvilia and Gasser (2008a) established a typology of sources of IQ change, including culture changes, community makeup changes, activity or event changes, agent changes, and knowledge/technology/tool changes, and validated this typology using the case study of the English Wikipedia. 2.6.3.4 Work studies. Engeström (1990) demonstrated how to use Activity Theory to conduct developmental work research using the case of general practioners working in two health stations in the city of Espoo in Finland. Engeström’s (1990) developmental work research consisted of three phases: (a) identifying the historical development and the current mode of practioners’ work activities, (b) designing a new model of work for the practitioners, and (c) testing the new model. Based on Activity Theory, Engeström (1990) developed three methodological principles: (a) using a collective, institutionally organized activity system as the unit of analysis as it can provide seemingly random events with context and meaning; (b) studying historical development of the activity system and its components; and (c) analyzing the inner contradictions of the activity system as the source of problems, innovations, and changes occurred to the system. To understand the development of practioners’ activity system, Engeström (1990) took the viewpoint of inner contradictions, analyzing the data to identify discoordinations in the doctor-patient discourse. The data collected in this study included 85 videotaped patient consultations and followed-up stimulated recall interviews on the consultations with patients and doctors separately. Engeström (1990) identified three secondary contradictions between components of the activity system: (a) the contradiction between tools and a new object, that is, patients’ changing or new problems could not be solved by the traditional biomedical instruments; (b) the contradiction between rules and the new object, that is, the administrative rules of the health stations delayed or discouraged solving patients’ complicated and ambiguous problems; and (c) the contradiction between division of labor and the new object, that is, patients’ complex problems could not be resolved by a single practioner but required cooperation among doctors in the health stations. This study provides an excellent example of applying Activity Theory in guiding methodological decisions and using the concept of contradictions to help identify problems in work organization. Turner, Turner, and Horton (1999) conducted an ethnographic study of organizational culture and working practices of a collaborative project team in a software house. Concepts of

51 Activity Theory—actions, object, rules, norms, tools, development, mediation, transformation, and contradictions—were used to organize and analyze rich but unstructured ethnographic data of the study, including foreground (video recordings of team meetings and member interviews) and background data. Particularly, contradictions in four different levels were used to derive requirements on the work situation. For example, the data analysis identified a primary contradiction between tools (e.g., notepad) that are used to document individual work and those (e.g., whiteboard) that are used to facilitate team communications. This primary contradiction suggests developing a new tool that supports personal note taking and enables sharing certain parts of personal notes without laborious transcription and privacy invasion.

2.6.4 Strengths and Limitations Activity Theory has been used as a conceptual framework or semantic tool, providing researchers with concepts and structure to organize, analyze, re-examine, and give new insight into their data. It enables deconstructing an activity into components (e.g., subject, object, tools, rules), analyzing it in different levels (collective and individual), and reducing it to a hierarchical structure (i.e., activity-actions-operations). It also serves as a unifying language, facilitating the communication between different research communities and allowing the accumulation and transferability of knowledge. As a methodological framework, Activity Theory can be used to develop research instruments and formulate research questions, guiding methodological decisions. As a meta-theory, it can help develop theoretical frameworks and models. Providing a holistic approach, Activity Theory situates human activity in a specific context, drawing researchers’ attention from a stand-alone, single activity to the network of activities, studying the interaction between different activity systems. Due to its origin, there are terminology problems of translating concepts of Activity Theory from Russian to English, such as activity and object (Allen et al., 2011; Kuutti, 1996). Wilson (2008) emphasized that the concept of ‘activity’ in Marxist is closely related to the concepts of ‘practice’ and ‘labor’. Activity in Russian carries the meaning of human behavior (or practices) transforming an object into an outcome, while its English equivalent does not convey. Some scholars (e.g., Nardi, 1996) comprehend the concept of object as an objective held by a subject motivating an activity. Some scholars (e.g., Engeström, 1990) view object as the raw material or problem space upon which the activity acts. Other scholars (e.g., Turner, Turner, & Horton, 1999) perceive it as an objectified motive, encapsulating the meanings of activity motive

52 and the object upon which the activity acts. The socio-cultural concepts in Engeström’s (1990) Activity Theory—rules, norms, and division of labor—can be perceived as tools (Wilson, 2008). These vocabulary problems pose challenges for using Activity Theory as a sematic tool to communicate, analyze, and explain data. Using Activity Theory without accepting Marxist philosophy and performing historical analysis of capitalism is a contentious issue. Engeström (2008) claimed that Activity Theory would become “another management toolkit or another psychological approach” if without “historical analysis of contradictions of capitalism” (p. 258). Activity Theory has been widely used and expanded in the fields of psychology and education, but is not well operationalized in LIS, except for research in information systems and HCI (Allen et al., 2011; Wilson, 2008). To overcome the immaturity of Activity Theory in LIS, it has been used in combination with other theories (e.g., Huang et al., 2012; von Thaden, 2007; Wu, 2013).

2.7 Stvilia’s Information Quality Assessment Framework This study used Stvilia’s IQ Assessment Framework to help formulate RQ2 and its sub- questions, and analyze, structure, and explain part of the netnographic data (Stvilia et al. 2007). To understand and use the Framework, it is critical to identify or conceptualize the components of the Framework, analyze relationships among the components, and know about approaches to operationalizing the Framework to develop context-specific IQ measurement models. This section introduces the concepts and components of the Framework, specifies how to apply the Framework in developing context-specific IQ models, reviews previous applications of the Framework, and discusses the strengths and limitations of the Framework.

2.7.1 Concepts There are nine concepts that could be used to describe or contextualize Stvilia’s IQ Assessment Framework: information, IQ, IQ dimensions, IQ criteria, IQ metrics, reference bases, IQ problems, information activities, and IQ measurement models. 2.7.1.1 Information. There are no commonly agreed definitions for three closely related concepts in the field of information science: data, information, and knowledge. Stvilia et al. (2007) reviewed the definitions of these concepts provided by different researchers from varying perspectives, including Bar-Hillel’s definition from the semantics perspective, Belkin and Robertson’s definition based on information theory, Taylor’s pragmatic definition, and

53 Buckland’s three main definitions (i.e., information-as-process, information-as-knowledge, and information-as-thing). Based on previous definitions in the literature, Stvilia et al. (2007) described a hierarchical relationship among these concepts, and defined data as “a raw sequence of symbols,” information as “data plus the context of its interpretation and/or use,” and knowledge as “a stock of information internally consistent and relatively stable for a given community” (p. 1721). They emphasized that the definition of information cannot be interpreted independent of the physical, cognitive, and social properties of information (Raber, 2003), including the underlying entity represented by the data and its social, cultural, and technological environments. 2.7.1.2 Information quality. As mentioned above, quality is a multidimensional concept (Stvilia & Gasser, 2008b). Stvilia et al. (2007) reviewed the definition of this concept proposed by different researchers from diverse viewpoints, and adopted Juran’s definition: “fitness for use” from manufacturing (1992, as cited in Stvilia et al., 2007), as they believe this definition encompasses the duality of quality, subjectivity and objectivity. 2.7.1.3 IQ dimensions. Wang and Strong (1996) defined data quality dimension as “a set of data quality attributes that represent a single aspect or construct of data quality” (p. 6). Likewise, Stvilia et al. (2007, p. 1722) defined IQ dimension as “any component of the IQ concept,” intended to conceptualize “measurable variations for a single aspect” of IQ (Huang et al., 2012, p. 196). To measure the quality of an information entity, one needs to link IQ dimensions of the entity to its measurable attributes. Stvilia et al. (2007) developed a set of 22 IQ dimensions from IQ-related literature, and classified them into three categories: intrinsic IQ, relational or contextual IQ, and reputational IQ. The IQ dimensions that fall into the intrinsic category include accuracy/validity, cohesiveness, complexity, semantic consistency, structural consistency, currency, informativeness/redundancy, naturalness, and precision/completeness. These IQ dimensions are classified as intrinsic IQ because their assessment is relatively context-independent and objective, measuring internal attributes of an information entity “in relation to some reference standard [e.g., dictionary] in a given culture” (Stvilia et al., 2007, p. 1724). Unlike intrinsic IQ dimensions, the assessment of relational or contextual IQ dimensions is dependent on the context and reference bases of a specific information activity system, which requires mapping an information entity to some external condition (Stvilia et al., 2007). The IQ

54 dimensions belonging to the relational IQ category include accuracy, precision/completeness, complexity, naturalness, informativeness/redundancy, relevance, semantic consistency, structural consistency, volatility, accessibility, security, and verifiability. The first nine dimensions are further classified as representational IQ, measuring how well an information entity can represent the external condition in a given information activity; the last three are infrastructure-related relational dimensions. The reputational IQ category consists of only one dimension, authority measuring “the position of an information entity in a cultural or activity structure, often determined by its origin and record of mediation” (Stvilia et al., 2007, p. 1724). 2.7.1.4 IQ criteria. The concept of IQ criteria appeared in a number of Stvilia’s IQ related works, and sometimes was used interchangeably with the concept of IQ dimensions (e.g., Huang et al., 2012; Jörgensen, Stvilia, & Wu, 2011; Stvilia, Mon, & Yi, 2009; Stvilia, Twidale, Smith, & Gasser, 2008). 2.7.1.5 IQ metrics. IQ metrics are direct or indirect measures for assessing the quality of an information entity along a particular dimension, which are developed through identifying attributes of the entity and relating them to the identified IQ problems (Stvilia, 2006). Stvilia’s IQ Assessment Framework encompasses 41 general IQ metrics, 30 of which are entity-based and 11 of which are process-based, indirect measures (Stvilia et al., 2007). These general IQ metrics can be reused for context-specific IQ assessment. The cost of using a specific IQ metric can be as low as to calculate automatically or semi-automatically, or as high as requiring manual quality evaluation. The selection of specific IQ metrics for a given IQ dimension depends on the availability of resources, the cost of measurement, the importance of the IQ dimension, and precision required. 2.7.1.6 Reference bases. Reference bases are the sources or contexts to derive IQ dimensions to evaluate the quality of an information entity (Stvilia, 2006; Stvilia, 2007). Stvilia et al. (2007) propose two types of reference bases: (a) cultural norms, conventions, and languages that are context independent; and (b) the context of a particular activity system (e.g., actions, goals, tools, and roles). 2.7.1.7 IQ problems. Stvilia et al. (2007) defined IQ problem as “occurring when the IQ of an information entity does not meet the IQ requirement of an activity on one or more IQ dimensions” (p. 1722). Stvilia’s IQ Assessment Framework identifies four sources of IQ

55 problems: “mapping, changes to the information entity, changes to the underlying entity or condition, and context changes” (Stvilia et al., 2007; p. 1722). Similar to the information system design deficiencies (i.e., incomplete representation, ambiguous representation, and meaningless states) identified by Wand and Wang (1996), mapping-related IQ problems can occur “when there is incomplete, ambiguous, inaccurate, inconsistent, or redundant mapping between some state, event, or entity and an information entity” (Stvilia, et al., 2007, p. 1722). IQ problems caused by context changes and changes to an information entity or the underlying entity are dynamic IQ problems (Stvilia, 2007). Contextual changes, such as temporal or spatial changes to an information entity can be cultural or sociotechnical. IQ changes can be direct (e.g., changes to an information entity) or indirect (e.g., changes to context or the real-world object represented by an information entity). IQ changes can be positive, eliminating or alleviating IQ problems, or be negative (but not necessarily malicious), intended to make the information entity less usable. 2.7.1.8 Information activity types. According to Activity Theory (Engeström, 1990; Leont’ev, 1978), information activities are complex webs or systems of related elements: roles (i.e., subjects, objects), actions, artifacts (tools), and norms and conventions (including IQ requirements) (Stvilia, 2007; Stvilia & Gasser, 2008a). Changes to any element of an information activity system might cause IQ change to an information entity. Stvilia et al. (2007) identified information activities that could be affected by IQ problems, and classified them into four types: (a) representation dependent activities that represent another information entity or some external situation or process, (b) decontextualizing activities that use an information entity beyond its original creation context, (c) stability dependent activities that rely on the stability of an information entity or the underlying entity, and (d) provenance dependent activities that rely on the quality of creation and mediation-process metadata. 2.7.1.9 IQ measurement models. An IQ measurement model is a conceptualized system of a set of complete, but non-redundant IQ dimensions along with related IQ metrics that can capture the IQ variance of specific information entities relatively inexpensively and “aggregate and link the degrees of IQ measurements to the degrees of the activity outcomes in some systematic and sound way” (Stvilia, 2006, pp. 81-82; Stvilia, 2007). Unlike a general IQ assessment framework, an IQ measurement model is determined by local IQ needs and requirements.

56 2.7.2 Components and Relationships among the Components Stvilia’s IQ Assessment Framework mainly consists of four components: sources of IQ problems, types of activities that could be prone to IQ problems, a taxonomy of IQ dimensions along with generic IQ metrics, and reference bases of the IQ dimensions. The central part of Stvilia’s IQ Assessment Framework is the taxonomy of IQ dimensions. As mentioned above, Stvilia et al. (2007) organized the IQ dimensions into three categories, and proposed a set of generic IQ metrics for each dimension. The classification of IQ dimensions implies that the reference base of intrinsic IQ (measuring internal attributes of an information entity in relation to some cultural standards) is cultural norms, conventions, and languages, while the reference base of relational IQ is the context of a particular activity system. Since the authority of an information entity can be determined against internal standards or in relation to an information activity, the reference base of reputational IQ can be cultural norms, conventions, and languages, and the activity system context. The purpose of the information activity typology is to capture shared characteristics of related activities, and enable quick identification or prediction of sources of IQ problems and relevant IQ assessment variables (Stvilia et al., 2007). Once the sources of IQ problems are known, one can identify a set of relevant IQ dimensions along with generic IQ metrics and the reference bases from Stvilia’s IQ Assessment Framework without difficulties. Stvilia’s IQ Assessment Framework indicates that decontextualizing activities can be affected by contextual changes to an information activity (Stvilia et al., 2007). One may expect that IQ problems in relational and reputational dimensions may occur in these activities. Similarly, representation-dependent activities can be affected by any mapping problems between an information entity and the external condition or underlying entity. These activities can be prone to IQ problems in intrinsic and representational dimensions. Stability-dependent activities can be affected by changes in an information entity and the underlying entity. One may predict that IQ problems in intrinsic and representational dimensions may arise in these activities. Provenance-dependent activities involve tracing an information entity through its whole lifecycle, and thus IQ problems in any dimensions from the taxonomy may have influence on these activities. Based on the assumption that reputation can be an indicator of any other IQ dimensions, reputational and infrastructure-related relational IQ dimensions can be used to indirectly assess IQ problems in intrinsic and representational dimensions.

57 2.7.3 How to Use Since quality is contextual and dynamic, there is no single IQ measurement model that is applicable to different information or data entities (Stvilia et al., 2008). The quality of an information entity can be measured directly by examining the entity itself, or indirectly by analyzing the entity’s process metadata and the entity creator’s reputation (Stvilia, 2006; Stvilia et al., 2007; Stvilia et al., 2008). To apply Stvilia’s IQ Assessment Framework to develop a context-specific IQ measurement model, the first step is to analyze the activity system of an information entity, and map the activities to the activity typology of the framework (Stvilia et al., 2007). The next step is to identify sources of IQ problems from the framework based on the activity types, select relevant IQ dimensions, and identify potential generic IQ metrics and reference sources. Guided by Activity Theory (Engeström, 1990; Leont’ev, 1978), the next step is to break down the activity system into actions, operations, roles, and tools, and analyze the actions and operations through information use scenarios to identify the relationship between actions and entity attributes (Stvilia & Gasser, 2008a; Stvilia et al., 2007). Meanwhile, one may analyze the information entity, relating IQ problems to entity attributes. The purpose of this step is to develop activity-specific IQ metrics with estimated critical values based on the identified entity attributes and their relations to IQ problems and information activities. Since an IQ measurement model aims to capture IQ variance of an information entity without redundancy and bias, the final step is IQ measurement aggregation based on the value structure and activity system of the information entity, using analytical (top-down, conceptual) or empirical (bottom-up, grounded) approach (Stvilia, 2007; Stvilia et al., 2007). Built upon an information entity’s use scenarios, the analytical approach involves analyzing the information entity, the activity system context, the IQ measurement context, the cultural norms and conventions, and the ranking of IQ dimensions (Stvilia et al., 2007). The empirical approach involves qualitative and quantitative analyses on the information entity and users’ IQ evaluations, generating statistical profiles of the information entity and users’ IQ value structure. The analytical approach focuses on an information entity’s use activities, while the empirical approach aims to capture users’ IQ requirements. Whether to use the analytical or empirical approach depends on the cost of analysis, the availability of users’ IQ evaluation data, the required precision of the IQ measurement model, and the criticality of IQ measurement to an

58 individual or organization. There are cases that combine the analytical and empirical analyses to construct a more comprehensive context-specific IQ measurement model (e.g., Huang et al., 2012; Stvilia, 2007). If the genre of an information entity is known, one may construct the IQ measurement model based on the expected structure, functionality, and representation associated with the genre. If a representative collection of information entities for a specific community is available, one may develop the IQ model by constructing statistical profile of the collection, which consists of statistical analysis of entity attributes and relevant process metadata (Stvilia et al., 2007). When using transaction logs to construct statistical profiles, one may consider the design of search interface may influence users’ selection of metadata elements with high accessibility (Stvilia, 2006).

2.7.4 Previous Applications Stvilia and his colleagues have applied his framework to determine the IQ value structure of different communities and construct IQ measurement models for various types of information and data entities in diverse domains, including a federated metadata collection, an online collaborative encyclopedia, the socially created metadata for image indexing, online consumer health information, biodiversity ontologies, genome annotation data, and condensed matter physics data. 2.7.4.1 Federated metadata collection. For the purpose of testing and refining his IQ Assessment Framework, Stvilia (2006) applied it to evaluate the quality of an aggregated item- level metadata collection of the Colorado Digitization Program of Cultural Heritage, and developed 13 automatic but indirect IQ measures, eight of which were based on the generic IQ metrics of the Framework (Stvilia et al., 2007). The study confirms that Stvilia’s IQ Assessment Framework can be applied to large-scale aggregated metadata repositories, and the generic IQ metrics can be reused for a particular context. 2.7.4.2 Online collaborative encyclopedia. Stvilia’s IQ Assessment Framework was also validated and operationalized on the English Wikipedia, a large-scale, online, collaborative encyclopedia (Stvilia et al., 2007; Stvilia, Twidale, Smith, & Gasser, 2005). Stvilia et al. (2005) developed seven groupings of IQ metrics to capture the IQ variance of Wikipedia articles, and tested them on a sample of random articles and FAs. The test indicated the proposed IQ metrics

59 were able to discriminate high quality articles from Wikipedia, demonstrating the Framework as a successful guide for inexpensive and rapid development of IQ measurement models. 2.7.4.3 Socially created metadata for image indexing. Stvilia and Jörgensen (2010) conducted a study to evaluate the quality of socially created metadata for photos from the Library of Congress (LC) photostream on Flickr relative to two controlled vocabularies, the Thesaurus for Graphic Materials (TGM) and the Library of Congress Subject Headings (LCSH). Guided by Stvilia’s IQ Assessment Framework and Activity Theory (Engeström, 1990; Leont’ev, 1978), the study first identified the structure and patterns of Flickr member activities around the photos, with the aim of knowing about the community’s quality value structure and quality assurance practices. The analysis of members’ comments indicated that some of their activities included disambiguating or resolving uncertainties about photo contents, evaluating metadata quality, and suggesting corrections to existing metadata. The identified member activities revealed that they encountered intrinsic and relational metadata quality problems. As suggested by Stvilia’s Framework, the study evaluated the intrinsic quality of Flickr’s social metadata in relation to WordNet (a general reference standard) to identify the ratio of valid terms (Stvilia & Jörgensen, 2010). The study assessed the relational quality of Flickr’s social metadata relative to the context of extending controlled vocabularies, mapping the social metadata to the TGM and LCSH to determine whether they could be a source of new terms for those controlled vocabularies. The study is an excellent example of applying Stvilia’s Framework to identify a specific community’s IQ problems through analyzing their activities around information entities, and to evaluate intrinsic and relational metadata quality. 2.7.4.4 Consumer health information. Using the empirical approach suggested by Stvilia’s IQ Assessment Framework, Stvilia, Mon, and Yi (2009) developed an IQ model for online consumer health information consisting of five IQ criteria constructs: accuracy, completeness, authority, usefulness, and accessibility. Combined with the IQ dimensions of Stvilia’s Framework, they developed an aggregated set of healthcare IQ criteria from three perspectives: health information Web providers, consumers, and intermediaries (e.g., librarians). To attain health information consumers’ IQ value structure, the study conducted a survey asking participants to rate the aggregated set of health IQ criteria on a five-point Likert scale, and generated five IQ criteria constructs based on participants’ rating using factor analysis. Taking the same empirical approach, the study also developed a set of five IQ marker constructs for

60 health Web pages, and found it correlated to the types and genres of health Web pages, suggesting future research on genre-specific health IQ measurement models. 2.7.4.5 Biodiversity ontologies. Stvilia (2007) developed a model to assess the quality of biodiversity ontologies by identifying the information activities and IQ requirements of Morphbank, a biodiversity repository and collaboratory curating specimen images, taxonomy information, morphological characteristics, and annotations. The proposed IQ model contains specific IQ dimensions, metrics, and measurement cost. Stvilia (2007) suggested future research of developing IQ models for type-specific biodiversity ontologies. 2.7.4.6 Genome annotation work. Huang et al. (2012) proposed a data quality model and a data quality assurance skills model characterizing genomic annotation work, using analytical and empirical approaches suggested by Stvilia’s IQ Assessment Framework. Unlike Stvilia’s previous IQ research, this study applied his framework to develop a data quality assurance skills model, suggesting the added value or extendibility of the Framework to conceptualize data quality related activities other than IQ measurement. 2.7.4.7 Condensed matter physics data. Stvilia and his colleagues (in press) recently conducted a study using an online survey to identify perceptions of and priorities for data quality of the condensed matter physics community gathered around the National High Magnetic Field Laboratory in Tallahassee, Florida. Based on the IQ dimensions of Stvilia’s IQ Assessment Framework, findings of semi-structured interviews with key informants of the community (Stvilia et al., 2013), and literature review, Stvilia et al. (in press) defined 14 data quality dimensions, and asked the survey participants to rate their importance on a 7-point Likert scale from extremely unimportant to extremely important. Using factor analysis, the study developed a model of data quality perceptions of the community consisting of four quality constructs: accuracy, accessibility, informativeness, and stability.

2.7.5 Strengths and Limitations As a theoretical framework in information science, Stvilia’ IQ Assessment Framework takes into account the physical, cognitive, and social properties of information (Raber, 2003), defining information as manifest in forms of data and knowledge. It has been applied to evaluate the quality of various types of data (e.g., biological data, condensed matter physics data, image metadata), information entities (e.g., consumer health information, health Web pages, Wikipedia articles), and KO systems (i.e., biodiversity ontologies). Using Activity Theory (Engeström,

61 1990; Leont’ev, 1978) allows the Framework to reason the sociotechnical and cognitive aspects of IQ variance that are missing in the IQ frameworks and models in manufacturing. Compared to previous and known IQ assessment frameworks, Stvilia’s is the most comprehensive, predictive, and reusable, providing solid theoretical guidance for activity-specific IQ measurement aggregation and allowing rapid and inexpensive development of context-specific IQ measurement models (Stvilia, 2006; Stvilia et al., 2007). Stvilia’s IQ Assessment Framework has been operationalized in different settings (e.g., an online collaborative encyclopedia, an aggregated digital repository) and domains (e.g., molecular biology, condensed matter physics, healthcare). Most IQ metrics of the Framework can be automatically computed, and have been reused in several contexts. Stvilia and his colleagues have further extended the Framework to develop several conceptual quality assessment models that can be reused in different settings, including a value-based quality assessment model (Stvilia & Gasser, 2008b), an IQ change model (Stvilia & Gasser, 2008a), and a cross-sociocultural IQ measurement model (Stvilia, Al-Faraj, & Yi, 2009). Published in the Journal of the American Society for Information Science and Technology (JASIST) in 2007, Stvilia’s IQ Assessment Framework is relatively young, and has primarily been used by Stvilia and his colleagues and students at Florida State University and the University of Illinois at Urbana Champaign. However, with Stvilia’s ongoing research to validate and extend his Framework, increasing numbers of scholars and scientists are using it for their own areas of research. For example, a recent work by Paul Conway (2011) from the University of Michigan adopted Stvilia’s Framework to construct archival quality measures for digitized books and journals. As of January 2014, the JASIST article presenting Stvilia’s Framework has received 177 citations in Google Scholar. Due to resource limitations, some of the IQ metrics proposed in Stvilia’s IQ Assessment Framework are indirect measures (e.g., English Wikipedia’s IQ metrics) or relatively expensive to compute, involving manual intervention. To avoid interdependencies among IQ metrics, the process of operationalization requires the additional technique of factor analysis to identify groupings of related metrics for IQ dimensions. Although Stvilia’s Framework is robust for developing context-specific IQ measurement models for different types of information and data entities, it has mainly been applied in the domains of library, museum, and archive studies; healthcare; biology; and condensed matter physics (Ochoa & Duval, 2009). Future research is

62 needed to test the Framework’s reusability for assessing IQ variance in other disciplines and communities, such as biochemistry and engineering.

2.8 Conclusion This chapter introduces different types of KO systems, including ontologies, and discusses the problems with ontology development and maintenance. This chapter also reviews the concepts of Activity Theory, the origin and development of the Theory, previous application of the Theory in different domains, and the strengths and limitations of using the Theory. Similarly, this chapter provides a thorough review of the other theoretical framework used in this dissertation research, Stvilia’s IQ Assessment Framework. The next chapter presents details of the research design.

63 CHAPTER THREE

METHODS

This chapter begins with presenting the research purpose and questions of this study. It then describes netnography, compares it with ethnography, and introduces the online fieldsite selected for the study. This chapter continues with presenting a research plan with details about how archival data analysis, participant observations, and qualitative semi-structured interviews were conducted and how the data were analyzed. This chapter also covers discussions of ethical issues, quality control, and limitations of the study. 3.1 Research Questions Guided by Activity Theory (Engeström, 1990; Leont’ev, 1978) and Stvilia’s IQ Assessment Framework (Stvilia et al. 2007), the purpose of this dissertation research is to examine the data work organization of the GO and to build a knowledge base of GO’s conceptual data quality model, gaining an understanding of how different communities use the GO to represent and organize their data; how they collaboratively create new GO terms; how they collectively detect, discuss, and resolve the data quality problems in the GO; and their quality requirements for bio- ontologies. To accomplish this purpose, this dissertation examined the following research questions: RQ1. What are some of the activities around the GO? What are their objects (objectives)? RQ1.1. What are some of the communities participating in these activities? RQ1.2. What is the division of labor within these activities? RQ1.3. What are some of the tools used in these activities? RQ1.4. What are some of the norms and rules regulating these activities? RQ1.5. What are some of the contradictions within and between these activities and how are these contradictions resolved? RQ1.6. What are some of the skills needed for data curation of the GO? RQ2. What is the data quality structure of the GO? RQ2.1. What are some of the types of data quality problems present in the GO? RQ2.2. What are some of the sources of these data quality problems? RQ2.3. What are some of the corresponding quality assurance actions taken to resolve these data quality problems?

64 RQ2.4. What data quality criteria are considered important for the GO? RQ2.5. What are some of the policies, procedures, rules, or conventions for data quality assurance adopted by the GO?

3.2 Research Design This dissertation research employed the netnographic approach (Kozinets, 2010), gathering data in a natural setting via archival data analysis, participant observations, and qualitative semi- structured interviews (Blee & Taylor, 2002; Kazmer & Xie, 2008; Lincoln & Guba, 1985) to investigate the data work organization of the GO. Netnography can be defined as “participant- observational research based in online fieldwork” using “computer-mediated communications as a source of data to arrive at the ethnographic understanding and representations of a cultural or communal phenomenon” (Kozinets, 2010, p. 60). In other words, netnography is a specialized form of ethnography—Internet ethnography—adapted to study online communities or communities demonstrating important social interactions online.

3.2.1 Ethnography and Netnography 3.2.1.1 Ethnography. Ethnography can be broadly defined as a methodology or set of methods involving “the ethnographer participating, overtly or covertly, in people’s daily lives for an extended period of time, watching what happens, listening to what is said, asking questions,” and “collecting whatever data are available to throw light on the issues that are the focus of the research” (Hammersley & Atkinson, 1995, p. 1). Since its emergence at the end of the 19th century and the beginning of the 20th century, ethnography has been applied in different ways in diverse disciplines with distinct traditions. It becomes difficult to define and draw a distinction between ethnography and other methods. Instead of defining it, O’Reilly (2005) summarized the minimum requirements for ethnography as: (a) involving an iterative-inductive research process; (b) employing a family of research methods; (c) having direct and sustained contact with human beings in the context of their daily lives; (d) watching, listening, and enquiring; (e) producing a descriptive account that “respects the irreducibility of human experience” and “acknowledges the role of theory” and researcher; and (f) envisioning human beings as “part object/part subject” (p. 3). Originating in anthropology, early ethnographies are characterized by white male ethnographers in the United States and Europe moving to a remote fieldsite for an extended

65 period of time (usually a year or longer) to study an isolated or marginalized group of people, such as the Nuer and the Trobriand Islanders (Murchison, 2010; O’Reilly, 2005). Usually without prior knowledge of the group under study, early ethnographers’ research involved picking up the local language, building rapport with the natives, observing and participating in their daily lives, taking fieldnotes, and employing other techniques such as drawing maps and kinship charts and collecting artifacts and archives. Over time, early ethnographers were assumed to gain an insider’s view of the group’s behavior and thought, and produce a holistic account of a seemingly well-defined and homogeneous group. However, early ethnographies were criticized as ethnocentric and holistic, ignoring the dynamics and variations within groups and between individuals. In the 20th century, the Chicago School adapted ethnography in sociology to study local communities (e.g., immigrants) in the city of Chicago (Murchison, 2010; O’Reilly, 2005). The fieldsite was no longer limited to remote, strange, and exotic villages, but could be urban and rural settings. The fieldsite refers to “a naturally occurring setting” (Brewer, 2000, p. 10). Ethnography became a tool enabling sociologists to study particular topics or groups (e.g., the homeless) that could not be researched using traditional sociological methods such as survey. Ethnographers, becoming attentive to group dynamics and changes within cultures and societies, began incorporating the historical approach to study historical documents and oral histories. Ethnographers today extend the fieldsite to everyday settings, such as hospitals, school playgrounds, and corporate boardrooms, focusing on gaining an understanding of individuals’ actions (O’Reilly, 2005). Contemporary ethnographies tend to be small scale and flexible, allowing the research design to evolve through the study and using a wide range of methods beyond participating, observing, and interviewing. Besides anthropology and sociology, ethnography has now been applied in education, psychology, health studies, and human- computer interaction (HCI) research. Despite the development of ethnography in different disciplines, contemporary ethnographers still maintain the tradition of conducting research in a natural setting over time, demonstrating interest in the social and cultural aspects of human life and interaction, using participation and observation, and embracing multiple research methods not reliant on one source of data (Murchison, 2010; Wolcott, 2003). 3.2.1.2 Netnography. As a relatively new form of ethnography in the Internet age, netnography has been applied in consumer and marketing research (Kozinets, 2010). Similar to

66 ethnography, netnography is naturalistic, intending to capture the meaning of human lives in a natural setting; immersive, requiring the researcher to invest a significant amount of time to interact with and become a part of an online community; descriptive, seeking a thick description of community cultures and practices; multi-method, extending beyond participant observation to include other research techniques (e.g., face-to-face interviews, archival data analysis); and adaptable, continually being attuned to suite the research questions, skills, and fieldsites. The distinction between netnography and ethnography is increasingly vague because much of the communication today no longer merely takes place face-to-face or over the telephone (Garcia, Standlee, Bechkoff, & Cui, 2009). Most contemporary ethnographies cannot be conducted without examining computer-mediated communication (CMC), such as emails, Websites, and instant messaging (IM). However, Kozinets (2010) claimed that netnography has a unique set of techniques and procedures to differentiate it from ethnography. The most pronounced characteristic of ethnography is that the mode of interaction has changed from face-to-face communication to CMC, presenting new challenges to researchers. In synchronous CMC (e.g., IM, chat rooms), interactions tend to be longer and more fragmented than face-to-face exchanges, and may be prone to interruptions, lapses, noises, and technical malfunctions (Kazmer & Xie, 2008; Kozinets, 2010). In asynchronous CMC (e.g., email, forums), interactions tend to be text-based, lengthy, and more artificial. Participants have more control over their messages and self-presentation than face-to-face exchanges. Instead of picking up a local language, netnography requires the researcher to learn specific codes, rules, emoticons, abbreviations, and commands used in CMC. Netnographers may subject to getting data in multiple formats from CMC, such as attachments, hyperlinks, audios, videos, and images. Anonymity and pseudonymity in CMC may pose problems for ethnographers to associate demographic information with textual data. Compared with ethnography, netnography is easier, faster, and less labor-intensive to conduct. In many cases, netnography collects data publicly or globally accessible from online forums, blogs, newsgroups, mailing lists, and social networking sites. There is no need for the researcher to travel to a remote fieldsite. Social interactions in online fieldsites (e.g., forums, blogs, newsgroups) are usually automatically archived, and thus are easier to observe, record, and copy than those in face-to-face fieldsites. The researcher has the option of conducting

67 netnography unobtrusively, lurking around online settings and tracing online interactions back in time that were not affected by the researcher’s presence or actions.

3.2.2 Fieldsite Kozinets (2010) proposed six criteria for choosing netnographic fieldsite: (a) relevant to research purposes and questions, (b) active in communications, (c) interactive between participants, (d) substantial in the number of communicators, (e) heterogeneous in participants, and (f) data-rich in postings. As mentioned in Chapter One, GO’s software-related and data-related request trackers at SourceForge are online forums for different scientific communities to communicate with the GO curators and participate in ontology development and maintenance without geographic and temporal restrictions (Gene Ontology, 2014; Gene Ontology Consortium, 2006, 2007). Online forums refer to online discussion boards where “participants post textual messages (these could also include graphics or photos, and often contain hyperlinks), others reply and over time these messages form an asynchronous, conversational thread” (Kozinets, 2010, p. 85). All the requests submitted to those trackers remain in SourceForge indefinitely for the purpose of recording keeping and enabling outsiders to keep track of what changes have been made to the Ontology (Gene Ontology Consortium, 2014u). GO’s request trackers are appropriate fieldsites for the researcher to learn about the interactions among those communities, their activities, practices, cultures, collaborative patterns, and community members’ sense of meaning. Since the purpose of this dissertation research is to explore the data work organization of the GO, it only focused on GO’s data-related request trackers. Among those data-related request trackers, GO’s Ontology Requests Tracker is the most active forum for a wide range of communities to discuss and address the data quality problems in the GO, which receives the highest number of requests from the registered users of the GO Project on SourceForge. As of January 1st, 2014, GO’s Ontology Requests Tracker (http://sourceforge.net/p/geneontology/ontology-requests/stats/) had received 10,594 requests from different communities, while other data-related request trackers (e.g., GO Relations, GO Big Ideas) received less than 20. This study selected GO’s Ontology Requests Tracker as an online fieldsite to collect netnographic observational and archival data. Examining the discussions and negotiations in this tracker could not only help understand the interactions, norms, rules, and collaborative patterns of those communities, but also identify GO’s typology of

68 data quality problems and dimensions that are deemed important by those communities and their quality assurance practices (Stvilia et al., 2008).

3.2.3 Justification for Netnography Kozinets (2010) claimed that netnography is appropriate for qualitative exploratory research seeking to reveal social processes and analyze how meanings are shared among communities and embedded in rules and practices. This dissertation research aims to explore the data work organization of the GO, gaining an understanding of different communities’ data activities and interactions around the GO, such as how they use the GO to represent and organize their data; how they collaboratively create new GO terms; and how they collectively detect, discuss, and resolve data quality problems in the GO. Studying this social phenomenon thus requires close and prolonged engagement with those communities. Netnography requires a lengthy stay in the fieldsite, assuming over time the researcher can know about the language, rules, and norms of the field and gain a deep understanding of community practices and thoughts (Kozinets, 2010; Murchison, 2010; O’Reilly, 2005). This makes netnography a perfect fit for the study. Netnography is also appropriate for studying communities demonstrating important social interactions online, assuming that examining their online communications can inform the practices, values, or beliefs of those communities. As a type of ethnography in the Internet age, netnography can provide a set of techniques and procedures to help examine GO’s Ontology Requests Tracker, the most active online forum for various communities to discuss and negotiate the data quality issues in the GO. Data collection in netnography means certain “involvement, engagement, contact, interaction, communication, relation, collaboration, and connection with community members” (Kozinets, 2010, p. 95). Compared to content analysis, interviews, surveys, and other sociological research methods that collect data from people out of context, netnography allows the researcher to observe and participate in community activities (e.g., using, developing, and maintaining the GO), capture the thought and context that inform community practices, and learn and ponder how to become a community member (Kozinets, 2010; O’Reilly, 2005). As a participant observer, the researcher becomes a primary instrument to generate fieldnote data— her subjective impressions and reflections on community practices and cultures—that are not obtainable through surveys and interviews (Kozinets, 2010; Murchison, 2010; Wolcott, 2003). Since ethnography and netnography are well known for studying interactions and connections

69 among various elements within society and culture, they can help situate different communities’ activities around the GO in a social context, linking the activities of developing and maintaining the GO to the activities of using the GO for different purposes (e.g., education, research). As a methodology, netnography can provide guidance on how to organize, sort, and code a relatively large amount of data collected from various sources using different methods.

3.2.4 Research Plan Similar to ethnography, the research design of netnography is iterative, reflexive, and continuous, evolving through the study (Kozinets, 2010; O’Reilly, 2005). Kozinets (2010) proposed six overlapping steps of netnography based on the procedures of ethnography: planning, entering the field, collecting data, analyzing data and interpreting findings, adhering to ethical standards, and writing and reporting findings. Although the process of conducting netnography cannot be predetermined, it is necessary to develop a research plan specifying the research methods for data collection, linking methods to specific research questions and sources of information, and assessing the availability and accessibility of those sources (Hammersley & Atkinson, 1995; Kozinets, 2010; Murchison, 2010). This section presents details of the research plan for this study. This study planned to collect and create netnographic data through three key netnographic methods: archival data analysis, participant observations, and qualitative semi- structured interviews (Kozinets, 2010). They were selected based on the research purpose, questions, and theoretical frameworks (Murchison, 2010; Nardi, 1996b). Archival data analysis and participant observations were used to answer all of the research questions, except those perceptional questions (RQ1.6 and RQ2.4). Qualitative semi-structured interviews were specifically used to address RQ1.3, RQ1.4, RQ1.6, RQ2.4, and RQ2.5 and to collect data that might not be available from archives and observations, such as different communities’ perceptions of ontology quality and their motivations to use, develop, and maintain the GO. Appendix B lists each research question and the corresponding data collection method(s) and interview question(s). To identify and familiarize with different communities participating in developing, maintaining, and using the GO and learn about their languages, rules, norms, and practices in GO’s Ontology Requests Tracker, the researcher first performed archival analysis of requests submitted to GO’s Ontology Requests Tracker during 2011 and 2012. The researcher next

70 conducted participant observations, becoming a registered user of the GO Project on SourceForge and following discussions in the Ontology Requests Tracker. The researcher kept observational and reflective fieldnotes to record her experience as a participant researcher; her learning of community languages, activities, and rules; her interpretation of community cultures and practices; and her conceptualization of the nature of the fieldsite. The researcher last conducted qualitative semi-structured interviews with key informants from those communities to allow follow-up questions developed from archival data analysis and participant observations and broaden the understanding gained from archives and fieldnotes. As discussed above, the course of netnography could not be programmed (Hammersley & Atkinson, 1995; Kozinets, 2010). Any documents relevant to the research questions identified during observations and interviews were also collected and analyzed. In other words, archival data analysis was conducted throughout the course of research. Sections 3.3, 3.4, and 3.5 provide details about how the archival data analysis, participant observations, and qualitative semi- structured interviews were conducted following this research plan; and Section 3.6 describes how the data was analyzed.

3.3 Archival Data Analysis Data collected by netnographers through participant observations and interviews are restricted to specific time, space, and people (Murchison, 2010). The netnographic data may lack historical time depth, geographic coverage, and diverse perspectives. Netnographers usually utilize archival data to deepen their understanding and gain an insight into historical continuity, change, and contradiction. Archival data refer to “a type of conversational cultural data” collected from community archives, which are “unaffected by the actions of the netnographer” (Kozinets, 2010, p. 104). Archives are historical documents or artifacts collected by individuals, organizations, and institutions to record their own histories (Murchison, 2010). Archives may exist in different levels—personal, local, regional, national, and international—and in various formats—physical, virtual, or a combination of the two. Besides the conversational data archived online, other documents, such as Websites, Wikis, and reports, can be used to provide netnographic data with a social and cultural context. Because online conversational data are relatively easy to download or copy, ethnographers usually encounter the problem of data overload (Kozinets, 2010). Before using archival data to build a netnographic description or argument, netnographers need to consider their accessibility and availability; filter for relevant data; critically analyze their

71 meaning, purpose, intended audience, provenance, and the context of production; and assess their strengths and limitations (Murchison, 2010). The researcher conducted archival data analysis in two phases: one before entering the fieldsite and one concurrent with participant observations and qualitative semi-structured interviews. The initial phase of archival data analysis was to identify and familiarize the researcher with different communities participating in developing, maintaining, and using the GO; to know about participants active in GO’s Ontology Requests Tracker and their languages, rules, norms, and practices; and to prepare the researcher for acting like a genuine member of those communities. The initial phase of archival data analysis focused on two types of archives: different communities’ online interactions archived in GO’s Ontology Requests Tracker during 2011 and 2012 and other public documents about the GO (e.g., GO Website, GO Wiki, and presentations, reports, and journal articles about the GO). Since a large number of requests (1891) were submitted to GO’s Ontology Requests Tracker during 2011 and 2012, a random sample of 320 requests was drawn for archival data analysis. The sample size was determined using the technique introduced by Powell and Connaway (2004). The unit of analysis is individual requests submitted to that Tracker, most of which included curators’ comments and the curation actions they had taken. The second phase of archival data analysis was to supplement or enrich the fieldnote and interview data. After entering the fieldsite, the researcher collected and analyzed documents that were observed being used or mentioned by participants in GO’s Ontology Requests Tracker. During interviews, the researcher also asked key informants for documents that they mentioned were relevant to or could inform the research questions. Archives may be textual, visual, or physical; owned or created by individual or institution; stored in personal computers, hard drives, servers, institutional or public repositories, or cloud storage; contain private, sensitive, or competitive information; and allow open or restricted access. Due to their variance in format, storage, accessibility, ownership, coverage, and representativeness, these documents may be used and analyzed differently. To ensure appropriate and ethical use of these documents, the researcher created a registry to catalog and manage them. As recommended by Murchison (2010), the researcher performed critical analysis on these documents to obtain contextual information, such as their provenance, purpose, and intended audience, before using them. Kozinets (2010) suggested netnographers capture online

72 conversations between community members as they occurred on screen without correcting spellings and punctuations, and pay attention to graphical representations, such as emoticons and font size. More details about how to analyze archival data are provided in Section 3.6.

3.4 Participant Observations Participant observation in ethnography can be defined as a research method that involves “participating in people’s daily lives over a period of time, observing, asking questions, taking notes and collecting other forms of data” (O’Reilly, 2005, p. 110). Similar to ethnography, participant observation is deemed the centerpiece of netnographic research, because it avoids the artificiality of controlled experiments and the unnatural setting of surveys and allows access to firsthand data that may be otherwise unobtainable (Kozinets, 2010; Murchison, 2010). However, instead of traveling to a remote fieldsite, the netnographer visits an Internet fieldsite that may provide open access to everyone. Participant observers in ethnography can adopt four different roles: covert observer, overt observer, covert participant, and overt participant (i.e., true participant observer) (Schutt, 2006). As a covert observer, the ethnographer seeks to observe things as they are without participating and disclosing her/his role as an ethnographer. In contrast, the overt observer announces her/his role as an ethnographer. The covert participant acts like the people under study without identifying her/his role as an ethnographer, while the overt participant announces her/his research role and participates in group activities. The presence of overt observers or overt participants might alter the behavior of people under study. Participants can have the opportunity to experience others’ lives and learn from their points of view, but may take the risk of becoming native and losing objectivity. Observers can have enough time to record what happens and stay objective and scientific, but may fail to gain the insider’s view. Which role to adopt depends on the research topic, the ethnographer’s personality and background, the nature of the field, and ethical concerns (Murchison, 2010; Schutt, 2006). The ethnographer’s role may not necessarily remain fixed during an ethnographic study, but may change depending on the situations (O’Reilly, 2005). Murchison (2010) suggested that the ethnographer balance participation and observation and ensure doing both. Similar to ethnography, there is a spectrum of participation and observation in netnography, ranging from “reading messages regularly and in real time, following links, rating, replying to other members via e-mail or other one-on-one communications, offering short

73 comments…contributing to community activities, to becoming an organizer, expert, or recognized voice of the community” (Kozinets, 2010, p. 96). In one extreme, the netnographer can take the content analytic approach lurking around an online setting, but may take the risk of attaining a shallow cultural understanding. In the other extreme, the netnographer can take the auto-netnographic approach, being a real community member and recording and analyzing her/his online experience. Similarly, Kozinets (2010) suggested that the netnographer balance between the objective observational approach and the subjective autobiographical approach. In terms of observation, the researcher became a registered user of the GO Project on SourceForge, following discussions in GO’s Ontology Requests Tracker on a daily basis. In terms of participation, with the guidance of a biologist, the researcher used the GO to annotate a gene, and submitted the annotation for the GO curators to review. As described in the research plan (see Section 3.2.4), the researcher kept both observational and reflective fieldnotes. Kozinets (2010) stated that netnographic fieldnotes should be “a combination of what is seen on the screen and what is experienced by the researcher” and as descriptive as possible through the frequent use of adjectives and adverbs (p. 115). Because similar observations of community interactions had been conducted in the archival data analysis, the researcher focused on writing reflective fieldnotes, recording her initial impressions on the communities, their interactions, and cultures; her learning of community languages, practices, and rules; her experience of using the GO and communicating with the GO project members; her contemplation on community meanings and cultures; and her conceptualization of the nature of the fieldsite.

3.5 Qualitative Semi-structured Interviews Qualitative interviewing is a research method used to understand people’s points of view, experiences, thoughts, and feelings with the purpose of producing knowledge (Kvale & Brinkmann, 2009; Schutt, 2006). Qualitative interviews can be categorized by their degree of structure into semi-structured interviews and unstructured interviews (Blee & Taylor, 2002; Lincoln & Guba, 1985). In unstructured interviews, without formulating specific questions the researcher expects the interviewee to direct the flow of conversation and introduce and structure the problem in her/his own words corresponding to the broad issues raised by the interviewer. In semi-structured interviews, the researcher creates an interview guide consisting of a list of questions before interviewing, and allows the flexibility to alter the order of questions, ask follow-up questions, seek clarifications, and add extra questions during the interview. The

74 researcher plays a more active role in leading semi-structured interviews than in unstructured interviews. The purposes of semi-structured interviews are to explore, discover, and interpret the meaning of phenomena. As mentioned in the research plan (see Section 3.2.4), the researcher conducted qualitative semi-structured interviews after completing the initial archival data analysis and participant observations. Compared to unstructured interviews, semi-structured interviews can ensure the research questions to be addressed. The researcher could retain the flexibility to ask follow-up questions developed from archival data analysis and participant observations, seek clarifications, obtain explanations and background information, and tailor the interview guide to different informants (Murchison, 2010). This study aimed to interview key informants from different communities participating in developing, maintaining, and using the GO. Key informants refer to community insiders who are willing to talk, knowledgeable about the research topic, and representative of the range of viewpoints (O’Reilly, 2005; Schutt, 2006). Kozinets (2010) suggested those who study communities online using netnography should extend the area of interest to communities offline. Besides participants of GO’s Ontology Requests Tracker, there may be other community members who actively develop, maintain, and use the GO without participating in the online discussions on GO’s request trackers. This study selected key informants from three groups: the GO project members, the GO content contributors, and the GO users. The researcher recruited key informants from both those who were active participants on GO’s Ontology Requests Tracker and those who were not. The number of key informants to be interviewed depends on the purpose of the study (Kvale & Brinkmann 2009). If the study intends to explore and describe a specific phenomenon, new interviews should be conducted until reaching the saturation point, where further interviews produce little additional knowledge. Kvale and Brinkmann (2009) suggested the number of interviews could range from five to 25 for a typical interview inquiry. This study planned to interview five informants from each of the three groups mentioned above. The GO Project on SourceForge maintains a roster containing the GO project members’ identity information (e.g., real name, user name, and role/position) and provides an interface to contact registered users of the project. The researcher used the roster provided by SourceForge as a starting point to identify and contact the GO Project members and other participants of the Ontology Requests Tracker.

75 For key informants who were not participants of GO’s request trackers, they were identified from archival data analysis or referred to by those who had been interviewed. The researcher used emails sent from her university email account to recruit key informants. Unlike surveys, the implementation of qualitative semi-structured interviews follows a less standardized procedure. The researcher followed the seven linear steps proposed by Kvale and Brinkmann (2009): (a) preparing by clarifying the purpose of the study, conceptualizing the theme, and obtaining knowledge of the research topic and local setting; (b) creating an interview guide (see Appendix A); (c) conducting interviews following the interview guide, establishing rapport with the informant and paying attention to her/his nonverbal expressions with symbolic meanings; (d) transcribing interviews from oral speech to written text; (e) analyzing the interviews based on the purpose and theme of the study; (f) assuring the reliability and validity of the findings; and (g) reporting. The procedure of interview inquiry was iterative and circular, moving back and forth among different steps. There were situations where the researcher identified new issues during the interview, and thus changed the questions in the interview guide. With the popularity of CMC, researchers can use the Internet media, such as email and IM, to conduct qualitative interviews (Kazmer & Xie, 2008). This occurs more often in research exploring online or virtual communities. Considering this study examined communities demonstrating important social interactions online, the researcher conducted interviews using four different modes: face-to-face, telephone, online audiovisual media (Skype and Google Plus Hangouts), and email. Kazmer and Xie (2008) reminded researchers using email interviews that (a) email lacks social cues, such as facial expressions and body language; (b) informants have more control over their messages; (c) email responses conceal informants’ thought processes; (d) the interview data may be in multiple formats; and (e) informants may leave mid-interview. Despite of the time saved for transcriptions, email interviewing requires researchers to prepare for retaining informants, using incomplete interview data, and organizing interview data in multiple formats. Therefore, the researcher only conducted email interviews in this study with informants who preferred this interaction system, and mainly used them as a supplement to formal face-to-face, online audiovisual media, or phone interviews. Except email interviews, the researcher recorded and transcribed all the interviews with the help of Audacity (http://audacity.sourceforge.net/), which is an open-source audio editor and recorder.

76 3.6 Data Analysis The netnographic data collected from this study—including archival data, fieldnote data, and interview transcripts—were stored in a password-protected secure server. Other relevant documents, such as literature review, consent forms, emails with informants, memos on the data, and research design and decisions, were also stored in the server as references or guidelines (Kozinets, 2010; Richards, 2005). The interview transcripts were coded using NVivo9, a qualitative data analysis computer software. The qualitative content analysis was conducted in four phases. Based on Activity Theory (Engeström, 1990; Leont’ev, 1978) and Stvilia’s IQ Assessment Framework (Stvilia et al., 2007), an initial coding scheme (see Appendix C) was developed for the first phase of content analysis, consisting of a list of top-level categories. Similar to grounded theory coding (Charmaz, 2006), the purposes of the first phase were to remain open to the data, explore the data, and develop a more specific coding scheme. The researcher coded all the data by sentences or paragraphs. Subcategories of the initial coding scheme and emergent codes were generated based on interpretation of the data. The researcher compared the codes, expanded, merged, related, and organized them to form a new coding scheme. A fellow doctoral student who is familiar with those theoretical frameworks and qualitative content analysis was recruited to code 10% of the interview transcripts to allow the researcher to assess if agreement could be reached on interpreting the data (Schutt, 2006). During the second phase of data analysis, the researcher and the recruited doctoral student used the new coding scheme developed from the first phase to independently code the same 10% of the interview transcripts. The researcher and the doctoral student compared their codes, and discussed and resolved the differences in their coding. Based on the discussion with the doctoral student, the researcher refined the coding scheme and used it to code all the remaining data during the third phase. In the fourth phase, a senior researcher having more experience with those theoretical frameworks and qualitative content analysis reviewed the coding process and findings. Based on the comments from the senior researcher, the coding scheme was revised. Netnographic and ethnographic data are known to be messy due to the complexity of informants’ lives, actions, and ideas (Murchison, 2010). Netnographers should expect to see apparent contradictions in the data, which may be the product of individual differences or different perspectives among informants (Kozinets, 2010). These apparent contradictions may be

77 an indication of ‘contradictions’ within or between informants’ activity systems, providing important insight into the data (Engeström, 1990). As suggested by Murchison (2010), the researcher accepted “the messiness of the ethnographic record,” paying attention to the apparent contradictions in the data (p. 181).

3.7 Ethical Issues The researcher followed the general procedures proposed in the literature to address ethical issues involved in conducting a netnography or ethnography: (a) identifying and explaining, (b) attaining informed consent from informants, (c) disclosing what have been produced from the study to informants, (d) respecting informants’ confidentiality, (e) avoiding harm to informants, and (f) preventing negative consequences for future research (Hammersley & Atkinson, 1995; Kozinets, 2010; Kvale & Brinkmann, 2009; O’Reilly, 2005).

3.7.1 Identifying and Explaining The researcher fully disclosed her identity and affiliation and explained the research purpose to any community members or informants during the research interactions. As suggested by Kozinets (2010), the researcher revealed to any informants that she was conducting research on the data work organization of the GO. As the researcher mainly used emails to recruit informants, she included a link to her research Web page in her email signature, providing community members with opportunities to know more about the research.

3.7.2 Informed Consent Informed consent involves specifying the research purpose and procedure to informants, notifying them of any risks and benefits from participating in the study, obtaining their voluntary participation, and informing them of the right to withdraw from the study at any time (Kvale & Brinkmann, 2009). Kozinets (2010) claimed that informed consent is not required if the netnographer “interacts normally in the online community or culture, that is, as she interacts as other members do on the site but also takes fieldnotes of her experience” (p. 151). Therefore, there should be no need for the researcher to gain informed consent when she was analyzing the archives and documents that were publicly available, nor when taking observational and reflective fieldnotes. However, the researcher was still required to attain informed consent from informants before interviews, whether conducted face-to-face, online, by telephone, or via email.

78 For interviews conducted face-to-face, the researcher presented the consent form (see Appendix I) to informants before the interviews and explained the purpose and procedure of the study to them. The researcher also ensured enough time for the informants to read through the consent form and ask questions if there were any. For interviews not conducted face-to-face, the researcher sent by email a link to an online consent form hosted at Qualtrics (http://www.qualtrics.com/) to informants, usually at least two days before the scheduled interviews. The informants were asked to type their name at the end of the consent form to confirm that they had read and understood the form.

3.7.3 Disclosure During the interviews, some of the informants expressed their interest to read the dissertation once it is complete. The researcher will share the findings and any publications from this dissertation research with the informants.

3.7.4 Confidentiality Confidentiality concerns whether informants’ identifying information will be disclosed and if they can be recognized by others from published information, causing them serious harm (Kvale & Brinkmann, 2009; Schutt, 2006). The researcher ensured the identities of all the informants who were interviewed remained confidential. The fieldnotes, interview recordings and transcripts, and any private documents collected from informants during interviews were stored in a password protected secure server. Although quotations from informants were included in the dissertation and may appear in future publications, their names will not be disclosed in any of the results. Direct quotes from the conversations recorded on GO’s Ontology Requests Tracker were also included in the findings of the dissertation. However, the names or other identifying information in the quotes were altered. Because GO’s Ontology Requests Tracker is publicly accessible, anyone with Internet access can use a search engine to trace those direct quotes to the original threads and identify community members (Kozinets, 2010). The researcher ensured no quotes or details that might be harmful to the communities and their members were included in the dissertation, publications, and unpublished reports.

79 3.7.5 Harm The United States Code of Federal Regulations defines minimal risk to human subjects in research as: “the probability and magnitude of harm or discomfort anticipated in the research are not greater in and of themselves than those ordinarily encountered in daily life or during the performance of routine physical or psychological examinations or tests” (Department of Health and Human Services, 2009, p. 4). Most informants were happy to be interviewed in this study. No risks or discomforts occurred to them that were greater than those ordinarily encountered in daily life as a result of being interviewed. The researcher also ensured the well-being of informants during interviews (Kvale & Brinkmann, 2009).

3.7.6 Consequences Considering the consequences of this study to the larger group that the informants represent and the future researchers who may wish to work with this group, the researcher ensured to behave appropriately in the fieldsite, avoid exploiting the informants during interviews, and not to publish untruths, misrepresentations, and results that could not be supported (Hammersley & Atkinson, 1995; Kvale & Brinkmann, 2009; O’Reilly, 2005).

3.8 Quality Control Reliability and validity are commonly used to evaluate the quality of quantitative research. Reliability in quantitative research concerns “whether the result is replicable,” and validity pertains to “whether the means of measurement are accurate and whether they are actually measuring what they are intended to measure” (Golafshani, 2003, p. 599). Some qualitative researchers have borrowed the concepts of reliability and validity from quantitative research and applied them in assessing the quality of qualitative studies (Patton, 2001). Some qualitative researchers argued that reliability and validity, as quality criteria initiated in quantitative research, are inadequate or inapplicable to qualitative research and should be redefined (Golafshani, 2003). They proposed an array of terms corresponding to reliability and validity in quantitative research, such as dependability, applicability, conformability, consistency, credibility, neutrality, rigor, transferability, and trustworthiness. Regardless of which terms they use, qualitative researchers employ routine procedures to establish high-quality studies, such as “member checking, triangulation, thick description, peer reviews, and external audits” (Creswell & Miller, 2000, p. 124). This study involved the

80 application and combination of numerous research methods, each of which has its own set of quality criteria. The following section presents those criteria proposed in the literature for netnography, ethnography, and qualitative interviewing, and then discusses how to meet these criteria in this study.

3.8.1 Netnography and Ethnography As a specialized form of ethnography, netnography is a relatively new research approach to study communities and cultures in the Internet age. Most of the quality criteria in netnography were borrowed from those in ethnography and qualitative research. The researcher used the following criteria recommended in the literature that are applicable to this study to ensure the quality of netnography. 3.8.1.1 Rigor. Kozinets (2010) defined rigor as “the extent to which the text recognizes and adheres to the standards of netnographic research” (p. 164). To meet this criterion, the researcher familiarized herself with both ethnographic and netnographic protocols and techniques; followed netnographic procedures; related each research question to the research methods, theoretical frameworks, and fieldsite selected for the study (see Appendix B); and spent a significant amount of time for her cultural immersion and internalization. 3.8.1.2 Validity. Validity in ethnographic research refers to whether the research is “plausible or credible, and there is enough evidence to support the argument” (O’Reilly, 2005, p. 226). Similar to ethnography, netnography involves an iterative-inductive research process, moving back and forth from research questions to data, collecting varied evidence by multiple methods, having virtual and sustained contact with informants to seek clarification and confirmation, and acknowledging the complexity and messiness of human lives. Quality netnographies should provide the reader with a sense as though they were interacting with community members and had gained an understanding of their languages, rules, and practices (Kozinets, 2010). To ensure the validity of the study, the researcher was committed to doing netnography carefully and thoughtfully; avoiding prejudice and bias; linking any cultural understanding and theoretical claims to the netnographic data; citing and quoting online conversations and relevant documents; and triangulating the findings via archival data analysis, participant observations, and qualitative semi-structured interviews (Kozinets, 2010; O’Reilly, 2005).

81 3.8.1.3 Literacy. Kozinets (2010) defined literacy as “the extent to which the netnographic text recognizes and is knowledgeable of literature and research approaches that are relevant to its inquiry” (p. 165). This criterion requires the researcher to thoroughly review and acknowledge previous literature in relevant areas. Because this was an interdisciplinary study to understand how different biological communities use, develop, and maintain an ontology, the researcher has reviewed literature in molecular biology, biomedicine, bioinformatics, ontology engineering, artificial intelligence, Internet research, and LIS. 3.8.1.4 Innovation. Innovation refers to “the extent to which the constructs, ideas, frameworks and narrative form of the netnography provide new and creative ways of understanding systems, structures, experience or actions” (Kozinets, 2010, p. 166). To extend current knowledge of bio-ontology development and maintenance, this study applied multiple research techniques (i.e., qualitative interviews, archival data analysis) to collect empirical data from different communities, and used Activity Theory from cultural psychology and a theoretical IQ assessment framework from LIS to provide new insight into the netnographic data. 3.8.1.5 Reflexivity. Reflexivity can be defined as “the extent to which the netnographic text acknowledges the role of the researcher and is open to alternative interpretations” (Kozinets, 2010, p. 169). This criterion indicates the netnographer should not only conduct unobtrusive observations of online cultural conversations, but also take a more active role to communicate with community members to solicit their feedback and comments. To have more direct interactions with community members, the researcher interviewed both the GO Project members and other community members who were active participants in the online fieldsite to verify what she had learned from the archival data analysis and participant observations. 3.8.1.6 Intermix. Intermix is a criterion specifically constructed for netnography, asking the netnographer to “take account of the interconnection of the various modes of social interaction—online and off—in culture members’ daily lived experiences” through a combination of online and offline fieldwork (Kozinets, 2010, p. 171). To meet this criterion, the researcher supplemented participant observations with face-to-face, phone, and online interviews with key informants, who were not limited to participants of GO’s Ontology Requests Tracker but included active community members who did not use GO’s online forums.

82 3.8.2 Qualitative Interviewing Kvale and Brinkmann (2009) defined reliability in qualitative interviewing as “the consistency and trustworthiness of research findings,” and it can be evaluated by whether the findings are reproducible by other researchers at other times (p. 245). Reliability measures of qualitative interviewing include the reliability during interviewing (interviewer reliability), transcribing (transcriber reliability), and analyzing (intersubjective reliability). 3.8.2.1 Interviewer reliability. Interviewer reliability can be influenced by the interviewer’s interviewing technique. To ensure interviewer reliability, before conducting any interviews for this study, the researcher had received a sufficient amount of training in qualitative research techniques and practiced interviewing through a number of research projects involving semi-structured interviewing. 3.8.2.2 Transcriber reliability. Transcriber reliability can be assessed by asking two transcribers to type the same passage of a recorded interview and comparing their transcripts. The researcher transcribed all the interviews conducted for this study. To ensure transcriber reliability, a doctoral student, who is a native English speaker with ample experience in semi- structured interviews and transcription, was invited to transcribe a small number of interviews and review part of the researcher’s transcriptions. 3.8.2.3 Intersubjective reliability. Intersubjective reliability consists of arithmetic intersubjectivity and dialogical intersubjectivity. Arithmetic intersubjectivity refers to the reliability “measured statistically by the degree of concurrence among independent observers or coders” (Kvale & Brinkmann, 2009, p. 243). Since this study is qualitative in nature, arithmetic intersubjectivity is not applicable. Dialogical intersubjectivity refers to the agreement between researchers who are interpreting a phenomenon. The researcher coded all the transcripts, but also recruited a fellow doctoral student familiar with qualitative content analysis and those theoretical frameworks to code part of the data, in order to evaluate the extent of agreement reached on interpreting the data. As mentioned above, a senior researcher, having more experience with those theoretical frameworks and qualitative content analysis, reviewed the coding scheme and findings and provided suggestions for revisions. Kvale and Brinkmann (2009) described validity in qualitative interviewing as “the degree that a method investigates what it is intended to investigate” (p. 246). They related this to the reliability of the researcher (e.g., moral integrity, the quality of her previous research), the

83 quality of craftsmanship, and communicative validity. Validation is a continual process that is required in all steps of an interview inquiry. 3.8.2.4 The quality of craftsmanship. The quality of craftsmanship is ascertained by “continually checking, questioning, and theoretically interpreting the findings” (Kvale & Brinkmann, 2009, p. 249). Miles and Huberman (1994) suggested several strategies for checking the findings, including examining extreme or critical cases, replicating a finding, weighing the evidence, analyzing rival explanations, following up on unexpected findings, and collecting feedback from key informants. As requested by some of the informants, the researcher will share the findings them and collect their feedback. 3.8.2.5 Communicative validity. Communicative validity refers to “the validity of knowledge claims in a conversation” that involves interviewees, interviewers, researchers, peer scholars, the scientific community, and the general public (Kvale & Brinkmann, 2009, p. 253). Communicative validation can be a negotiation between the interviewee and the interviewer, termed member validation, verifying whether the interviewer’s interpretation reflects the interviewee’s understanding of her/his statements (Kvale & Brinkmann, 2009). The researcher shared transcripts with some of the informants and asked for their feedback to assess and ensure communicative validity.

3.9 Limitations Netnography requires the researcher’s immersion into the fieldsite for an extended period of time to learn the cultural language, norms, and rules in order to gain an insider’s view (Kozinets, 2010). The terminologies and nomenclatures in biomedicine, molecular biology, and bioinformatics had posed challenges for the researcher to understand and analyze the conversations among different biological communities participating in using, developing, and maintaining the GO. This dissertation research selected only one online forum to study the data work organization of the GO. There are other interesting sites recording different communities’ interactions and their data activities around the GO. The archival data analysis of this study was limited to the requests submitted to GO’s Ontology Requests Tracker during 2011 and 2012. The findings are lack of generalizability due to the small sample size. These limitations suggest future research conducting a historical analysis on the archives of GO’s Ontology Requests Tracker and studying GO’s other active and data-rich online forums.

84 3.10 Conclusion This chapter presents details of the research design for this dissertation study. Using the netnographic approach and a combination of three key netnographic methods—archival data analysis, participant observations, and qualitative semi-structured interviews—allowed for the research purpose to be satisfied and the research questions to be addressed. The next chapter presents findings obtained from this study organized by those three research methods.

85 CHAPTER FOUR

FINDINGS

This chapter presents findings obtained from three research methods: archival data analysis, participant observations, and qualitative semi-structured interviews. As mentioned in Chapter 3, archival data analysis and participant observations were used to answer all of the research questions, except those perceptional questions (RQ1.6, and RQ2.4). Qualitative semi-structured interviews were specifically used to address RQ1.3, RQ1.4, RQ1.6, RQ2.4, and RQ2.5, and to collect data that might not be available from archives and observations.

4.1 Archival Data Analysis As mentioned in Section 3.3, all the requests (1891) submitted to GO’s Ontology Requests Tracker during 2011 and 2012 were used as a sampling frame. A random sample of 320 requests was drawn from the sampling frame to conduct archival data analysis. The sample size was determined using the technique introduced by Powell and Connaway (2004). The unit of analysis is individual requests submitted to that Tracker, most of which included curators’ comments and the curation actions they had taken. Section 3.6 presents more details about how the archival data were analyzed. Following Kozinets’ (2010) suggestion to capture online conversations between community members as they occurred on screen, the researcher did not correct most of the spellings and punctuations in the archival data. However, any personal names in the quotes from the archival data were replaced with pseudonyms. QuickGO—a GO browser developed by the UniProt-GOA Project based at the European Bioinformatics Institute (EBI)—maintains a change log for each GO term to record and keep track of all the changes made to the term. The researcher used GO’s official browser AmiGO and QuickGO’s change log to help with data analysis. As of September 2014, of these 320 requests, 311 were closed for discussion, and nine were still open or temporarily closed awaiting later discussion. Of those 311 closed requests/threads, the one that took the longest time to resolve was 1293 days; and the ones that took the shortest time lasted only a day. The highest number of participants in a thread/request was nine; and the lowest number of participants in a thread/request was one. The highest number of messages in a thread/request was 18; and the lowest number of messages in a thread/request

86 was one. The average time to resolve a request was 71.88 days, with the median being 9 days and the mode being 2 days. The average number of people participating in a request was 2.42, with the median being 2 and the mode being 2 as well. The average number of messages in a request/thread was 4.05, with the median being 3 and the mode being 2. The researcher observed that some of the threads that took longer to resolve might be due to holidays or participants’ traveling.

4.1.1 Activities around the GO The archival data analysis identified six types of activities around the GO: (a) assigning the GO terms to genes or gene products discussed in the literature or curated in other databases (e.g., UniProt); (b) adding new GO terms or types of relationships between the terms to represent the current biological knowledge; (c) maintaining the GO by reporting, discussing, and resolving any problems with the existing GO terms and relationships between the terms; (d) developing tools to facilitate the use, development, and maintenance of the GO; (e) maintaining those tools to ensure that they can support different activities around the GO; and (f) having meetings to discuss issues related to the GO.

4.1.2 Division of Labor Division of labor means “both the horizontal division of tasks between members of the community and the vertical division of power and status” (Engeström, 1990, p. 79). 4.1.2.1 Division of labor around GO’s Ontology Requests Tracker. In terms of the horizontal division of their tasks, participants of GO’s Ontology Requests Tracker can be divided into requesters, editors, reviewers, and commenters. Requesters are those who have access to the Tracker and submit suggestions for changes to the Ontology. Editors are the GO curators who receive those requests, decide who will review the requests depending on their areas, and make edits to the Ontology. Reviewers are the GO curators assigned to review those requests and correspond with the requesters, commenters, and editors. Commenters are those who comment on others’ requests, expressing their viewpoints, supporting or opposing the requests, or providing additional evidence. Although they cannot make a decision on whether to accept or reject a request, commenters may change the direction of discussions, raise new quality issues, or become requesters.

87 With respect to the vertical division of power and status, the archival data analysis found three types of system accounts in the GO Project on SourceForge: the GO Administrators, the GO Developers, and the SourceForge registered users. The GO Administrators and the GO Developers are SourceForge registered users with specific privileges and permissions. The GO Consortium requires each member database to have at least one curator serve on the request trackers, who was granted either the GO Administrator or the GO Developer privilege (Gene Ontology Consortium, 2014f). The GO Project maintains a roster of those curators at SourceForge (http://sourceforge.net/p/geneontology/_members/). Both the GO Administrators and the GO Developers can play the role of reviewers on those trackers. However, only the GO Administrators can act as editors, assigning requests to the GO Developers or other Administrators and making edits to the Ontology. Any SourceForge registered users, including the GO Administrators and the GO Developers, can become a requester or a commenter. However, the GO Administrators and the GO Developers usually cannot review requests submitted by themselves. 4.1.2.2 Division of labor around the GO. As mentioned in Section 2.4, the GO Consortium consists of more than 30 databases and biological research communities participating in developing and maintaining the GO (Gene Ontology Consortium, 2014i). Of the 33 current GO Consortium members, four are the core groups of the GO Project: the Jackson Laboratory, the GO Editorial Office at EBI in the United Kingdom, the Lawrence Berkeley National Laboratory, and the Cherry Laboratory at Stanford University. The Jackson Laboratory, hosting the Mouse Genome Informatics Database (MGI), is responsible for leading the ontology development, developing software applications for the GO Project, and providing the GO annotations to the mouse gene products. The GO Editorial Office at EBI accounts for editing the vocabularies (e.g., adding new GO terms, revising the structure, and aligning with other ontologies). The Lawrence Berkeley National Laboratory, running the Berkeley Bioinformatics Open-source Projects (BBOP), engages in developing software tools for viewing and editing the Ontology (e.g., AmiGO, OBO-Edit, TermGenie). The Cherry Laboratory, managing the Saccharomyces Genome Database (SGD), is responsible for maintaining the GO database and Web interfaces, ensuring the public access of the database, and providing the budding yeast genes with the GO annotations. All the other Consortium members (e.g., PomBase) contribute to the GO mainly by providing annotation data.

88 4.1.3 Communities According to Engeström (1990), community refers to a group of people who share the same objective. The GO Consortium maintains a roster of the GO contributors, including their names, initials, and affiliations (http://www.geneontology.org/doc/GO.curator_dbxrefs). Based on this roster and the user profiles at SourceForge, the archival data analysis identified a typology of 34 communities participating in the activities around GO’s Ontology Requests Tracker, including the GO Consortium members and external communities and groups (see Appendix D). The GO Consortium members can be categorized as model organism databases (e.g., FlyBase, PomBase), protein databases (e.g., InterPro), and biological research communities (e.g., University College London-based annotation group). Similarly, the external communities and groups can be divided into biological databases (e.g., MetaCyc), external ontologies (e.g., Plant Ontology), biological research communities (e.g., the European Molecular Biology Laboratory), and individual scientists or groups. Among those 34 participating communities, 20 were the GO Consortium members; and 14 were external communities or groups. Of 320 requests in the sample, 292 requests were submitted by curators or scientists from the GO Consortium members, accounting for 91.25% of the sample; and 28 requests were submitted by external communities or groups, making up 8.75% of the sample. Of the GO Consortium members, PomBase was the most active community on GO’s Ontology Request Trackers, which submitted 77 requests out of 320 in the sample. Of the external communities or groups, MetaCyc was the most active one, which submitted 9 requests out of 320 in the sample. Interestingly, two new term requests were submitted by an undergraduate student of Texas A&M University’s Community Assessment of Community Annotation with Ontologies (CACAO) Project. As a means of attaining community annotation, the CACAO Project is a competition for teams of undergraduate students to contribute functional annotation of genes (Gene Ontology Consortium, 2014e). Run at Texas A&M University, students of the project learn how to read scientific literature and do manual literature annotation using the GO. During the competition, teams of students get points for making annotations, and lose points if competitors correct their annotations. Below is the one of the new term requests submitted by a CACAO undergraduate: I am a student in the Texas A&M CACAO class. I am currently annotating the article http://circres.ahajournals.org/content/98/5/617.long, and believe that a new term is

89 necessary for the protein with the UniProt code P02778. The term I am requesting is negative regulation of GO:0061154. The editor accepted the request and added the new term ‘regulation of endothelial tube morphogenesis’ to the GO: Added via TermGenie. ID: GO:1901509 Label: regulation of endothelial tube morphogenesis

4.1.4 Data Quality Problems of the GO and Corresponding Assurance Actions Data quality was defined as “the degree to which the data meet the needs and requirements of the activities in which they are used” (Stvilia et al., 2014). Data quality problems occur when the data quality is lower than the level required on one or more quality dimensions in the context of the activities in which the data are used. Table 4.1 lists a typology of 22 data quality problems of the GO identified from the archival data analysis, with their distribution and proportion in the sample of 320 requests and corresponding assurance actions to resolve them. Often, more than one data quality problem was discussed in a request/thread. Based on Stvilia’s IQ Assessment Framework (Stvilia et al., 2007), these data quality problems can be clustered as ambiguity, inaccuracy, incompleteness, inconsistency, redundancy, and unnaturalness.

Table 4.1: Data quality problems of the GO and corresponding assurance actions Types of Data Data Quality Problems # % Assurance Actions Quality Problems GO term related 1. Incomplete GO terms 228 71.25% Incompleteness Add, define, link, attribute 2. Invalid GO terms 8 2.50% Inaccuracy Obsolete, replace 3. Redundant GO terms 10 3.13% Redundancy Merge, obsolete, use (as synonym) 4. Ambiguous conglomerate 3 0.94% Ambiguity Split, obsolete, add GO terms GO term name related 5. Incomplete GO term 8 2.50% Incompleteness Rename, standardize, names use (as synonym), update 6. Inaccurate GO term names 15 4.69% Inaccuracy Rename, change, use (as synonym) 7. Incorrect selection of the 4 1.25% Unnaturalness Change preferred GO terms

90 Table 4.1 Continued Types of Data Data Quality Problems # % Assurance Actions Quality Problems 8. Spelling errors in the GO 2 0.63% Inaccuracy Correct term names Synonym related 9. Incomplete synonyms 19 5.94% Incompleteness Add 10. Inaccurate synonyms 5 1.56% Inaccuracy Delete, move 11. Incorrect categorization 4 1.25% Inaccuracy Change of synonyms Definition related 12. Incomplete GO term 11 3.44% Incompleteness Add, expand, modify definitions 13. Inaccurate GO term 21 6.56% Inaccuracy Redefine, update, definitions correct Reference related 14. Inaccurate references 4 1.25% Inaccuracy Remove 15. Incomplete references 5 1.56% Incompleteness Add 16. Inconsistent identifiers 1 0.31% Inconsistency used in references Relationship related 17. Inaccurate placement of 18 5.63% Inaccuracy Move the GO terms 18. Inaccurate relationships 13 4.06% Inaccuracy Change, remove between the GO terms 19. Incomplete relationships 22 6.88% Incompleteness Add, link between the GO terms 20. Incomplete set of 2 0.63% Incompleteness Add, define, attribute, relations exemplify Taxon constraint related 21. Inaccurate taxon 6 1.88% Inaccuracy Release, change constraints to the GO terms 22. Incomplete taxon 5 1.56% Incompleteness Add constraints to the GO terms

4.1.4.1 Incomplete GO terms. The most frequently occurring data quality problem was incomplete GO terms, identified in 228 requests/threads accounting for 71.25% of the sample. Incomplete GO terms refer to those that should be in the Ontology but are missing for the purpose of annotation. This data quality problem was usually identified when curators were using the GO to annotate papers, but failed to find the GO terms that they needed. This implies the difficulty of bio-ontologies to capture and represent the current biological knowledge. On the

91 other hand, the proportion of new GO term requests in the sample demonstrates the GO is working hard to keep up-to-date with new knowledge. Below is a typical new GO term request: Hello, I need a new term for annotation several genes described in PMID 22902739 where they investigate stalk morphogenesis. NEW: sorocarp stalk morphogenesis part_of GO:0031288 sorocarp morphogenesis Def: The process in which the sorocarp stalk is generated and organized. An example of this process is found in Dictyostelium discoideum. The requester was annotating a paper having a PubMed identifier ‘22902739,’ and found the term she needed was not in the GO. She submitted a request, including the definition and placement of the new GO term in the Ontology. This request was accepted and the term ‘sorocarp stalk morphogenesis’ with an accession number ‘GO:0036360’ was added to the Ontology. Interestingly, a link to the paper (PMID:22902739) and the requester’s initials (pf) were placed next to the GO term definition in AmiGO (see http://amigo.geneontology.org/amigo/term/GO:0036360): The process in which the sorocarp stalk is generated and organized. The sorocarp stalk is a tubular structure that consists of cellulose-covered cells stacked on top of each other and surrounded by an acellular stalk tube composed of cellulose and glycoprotein. An example of this process is found in Dictyostelium discoideum. Source: PMID:22902739, GOC:pf, DDANAT:0000068 4.1.4.2 Invalid GO terms. Invalid GO terms refer to those that are beyond the scope of the GO project, or misleadingly named or defined; these invalid terms should be obsoleted. Although they can no longer be used, the obsoleted GO terms and their accession numbers still persist in the GO, with comments explaining the reason for obsoletion and the suggested replacement terms (Gene Ontology Consortium, 2014p). The archival data analysis spotted eight requests/threads discussing invalid GO terms, making up 2.50% of the sample. The following is an example of a request to obsolete an invalid GO term: As GO:0042980 'cystic fibrosis transmembrane conductance regulator binding' is a gene product-specific term, should it be made obsolete?

92 The GO is declared to develop controlled vocabularies describing molecular phenomena, but not intended as a nomenclature for genes and gene products (Gene Ontology Consortium, 2014g). The requester reported a gene product-specific term, which is out of the scope of the GO project. This term was obsoleted and replaced by GO:0044325 ‘ion channel binding’ (see http://amigo.geneontology.org/amigo/term/GO:0042980). 4.1.4.3 Redundant GO terms. Redundant GO terms refer to those that are repetitive with other GO terms representing the same meaning, which should be merged including their annotations. Redundant GO terms were found in 10 requests/threads, forming 3.13% of the sample. After being merged, redundant GO terms were usually reused as synonyms of the merged GO terms, with their accession numbers as alternative identifiers. Below is a request to merge two repetitive GO terms: I failed to notice during the kidney term development work that we made a new term garland nephrocyte differentiation; GO:0061321 that is redundant with the existing term garland cell differentiation; GO:0007514. All garland cells are nephrocytes. I think the old term should be merged with the new term as the new def is more comprehensive. FlyBase is the only group that has used GO:0007514. The editor accepted the request, and merged ‘garland cell differentiation’ with ‘garland nephrocyte differentiation’ (see http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0061321). The accession number ‘GO:0007514’ became an alternative identifier of the GO term ‘garland nephrocyte differentiation’. The editor’s reply to the requester was as below: this is done: Merged GO:0007514 garland cell differentiation into GO:0061321 garland nephrocyte differentiation 4.1.4.4 Ambiguous conglomerate GO terms. Some of the GO terms are a conglomeration of two or more concepts, each of which warrants a separate GO term to represent. These conglomerate GO terms might be ambiguous to users because they might not know which concept the terms refer to. This data quality problem was only discussed in three requests/threads, accounting for 0.94% of the sample. All of the three identified ambiguous conglomerate GO terms were biological function terms consisting of two or more reactions (functions), each of which should warrant a separate GO term. The solution is to split the

93 conglomerate GO terms, create a new functional GO term for each reaction, and obsolete the conglomerate terms. The following request indicates how the requester proposed to resolve this data quality problem: Obsolete GO:0018853, perillyl-CoA synthetase activity This term currently has 0 (count 'em!) annotations, and it is a conglomeration of two reactions. I will create brand new terms for the two reactions and remove the existing conglomerate. New terms: id: GO:0052685 name: perillic acid:CoA (ADP-forming) activity id: GO:0052686 name: perillic acid:CoA ligase (AMP-forming) activity 4.1.4.5 Incomplete GO term names. Incomplete GO term names refer to those that are not specific enough to describe the concepts according to a reference source or in the context of a given activity. Eight requests/threads involved the discussion of incomplete GO term names, making up 2.5% of the sample. The requester of the following thread found the names of several GO terms implied concepts that were more general than their definitions: I'm submitting these terms for review because their name implies a more generalized reaction yet their definitions (and sometimes "exact" synonyms) are more specific. 1) 20-alpha-hydroxysteroid dehydrogenase activity GO:0047006 Ontology: Molecular Function Synonyms related: 20alpha-HSD related: 20alpha-HSDH exact: 20alpha-hydroxy steroid dehydrogenase activity broad: 20alpha-hydroxysteroid dehydrogenase exact: 20alpha-hydroxysteroid:NAD(P)+ 20- activity Definition of the reaction: NAD(P)+ + 17-alpha,20-alpha-dihydroxypregn-4-en-3-one =

94 NAD(P)H + H+ + 17-alpha-hydroxyprogesterone. Source: EC:1.1.1.149, MetaCyc:1.1.1.149-RXN Below indicates that the editor resolved the data quality problem by renaming those terms to be more specific to match their definitions: I've renamed all of these so the name and definition represent the same specificity; if there was a synonym that matches the definition, I used it as the name. The old name is now a broad synonym in each case, and I've adjusted other synonym scopes as needed. Another request indicated an incomplete GO term name was caused by misannotations inferred from electronic annotation (IEA), and did not match its parent term. The requester proposed to rename the term to be more specific: GO:0015446 : arsenite-transporting ATPase activity Is under "transmembrane transporter" functions, but is used to annotate terms (by IEA) which are involved in transport process, and are ATPases but are not "transmembrane transporters" Can the term be rehoused, or renamed to say arsenite trasnsembrane transporting ATPase activity The editor accepted the request and expanded the GO term name as ‘arsenite-transmembrane transporting ATPase activity’ to match its parent: Given it's current placement and definition: Catalysis of the transfer of a solute or solutes from one side of a membrane to the other according to the reaction: ATP + H2O + arsenite(in) = ADP + phosphate + arsenite(out). it should be renamed as a transmembrane transporter. Where are the incorrect IEA mappings coming from? […] I updated: GO:0015446 : arsenite-transporting ATPase activity > GO:0015446 : arsenite-transmembrane transporting ATPase activity 4.1.4.6 Inaccurate GO term names. Inaccurate GO term names refer to those that cannot correctly represent the concepts according to a reference source or in the context of a given activity. For example, a requester proposed to change a GO term name to make it accurately match the scientific findings about the substance represented by the term:

95 GO:0009360 ! DNA III complex currently shows two children: GO:0043846 DNA polymerase III, DnaX complex and GO:0043845 DNA polymerase III, proofreading complex O'Donnell (http://www.jbc.org/content/281/16/10653.long) describes "three major functional units: 1) pol III core, 2) the beta sliding clamp, and 3) the gamma/tau complex clamp loader." I would suggest that the GO:0043846 DNA polymerase III, DnaX complex should be renamed as something like the "processivity clamp loader complex" to remove the gene name from the complex name. As mentioned in Section 4.1.4.2, this GO term name also violated the rule that the GO terms should not be gene or gene product specific. The editor accepted the request, renamed the GO term as ‘DNA polymerase III, clamp loader complex’ to match the third major functional unit of DNA polymerase III described in O’Donnell’s article, and added two new GO terms to reflect the first two functional units. The original GO term name—‘DNA polymerase III, DnaX complex’—was reused as an exact synonym (see http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0043846). There were 15 requests/threads discussing inaccurate GO terms names, forming 4.69% of the sample. 4.1.4.7 Incorrect selection of the preferred GO terms. As mentioned in Section 2.4.1, a GO term record contains a GO term name and an optional set of synonyms categorized as exact, related, narrow, and broad (Gene Ontology Consortium, 2014p). In MeSH, there are various terms that can identify a concept. However, only one term was selected as the preferred term, and the others were used as entry terms or lead-in terms for the purpose of indexing and retrieval (Hjørland, 2007; Soergel, 1995). Similarly in the GO, the preferred terms were selected as the GO term names, and the entry terms were labeled as exact synonyms. The archival data analysis found some of the GO term names were not the conventional ones that users would choose for searching compared to their exact synonyms. There were four threads questioning the selection of preferred GO terms, accounting for 1.25% of the sample. According to Stvilia’s IQ Assessment Framework (Stvilia et al., 2007), this data quality problem can be categorized as unnaturalness since the concepts were not “expressed by conventional, typified terms and forms”

96 in the context of a given activity (p. 1729). Below is an example of requests to change a GO term name to the one that most people use: Can the term name of DNA-dependent DNA replication initiation be changed to DNA replication initiation since there is not a DNA-independent term, and most people search on "DNA replication initiation" (would help with lucene searches where shorter strings are matched more easily, at present some children are matched in preference) The editor explained that ‘DNA-dependent DNA replication initiation’ was created to distinguish with ‘RNA-dependent DNA replication.’ The solution was to change the commonly used term ‘DNA replication initiation’ as the GO term name, and use ‘DNA-dependent DNA replication initiation’ as an exact synonym. The editor replied to the requester as follows: Okay, for consistency I've gone for the simplest option and renamed DNA-dependent DNA replication initiation ; GO:0006270 DNA replication initiation ; GO:0006270 Kept the more specific name as a synonym. Updated the definition to make it clearer that it's referring to a DNA-dependent process. 4.1.4.8 Spelling errors in the GO term names. There were only two requests/threads reporting spelling errors in the sample. Below is one of those two requests: Also please edit spelling error in GO:0061384 heart trabecular morphogenesis should be trabecula 4.1.4.9 Incomplete synonyms. Nineteen threads, making up 5.94% of the sample, involved requesting to add synonyms to the GO terms. The following is an example of such requests: Can we add an exact synonym, "inner nuclear membrane", to GO:0005637? I've come across this phrase multiple times in publications (e.g. PMID:21411627) One of the GO Developers commented on the request as follows: "inner nuclear membrane" 845,000

97 "nuclear inner membrane" 351,000 (mainly GO hits?) so could Ashley’s suggestion be the primary name? pombe community always use "inner nuclear membrane" / abbreviated to INM The commenter requested to change the exact synonym as the GO term name (preferred term) because the synonym was commonly used in his community and had more hits in the database. This is an example of how the commenter could change the direction of discussion, raise new data quality issue, and become a requester. The editor added ‘inner nuclear membrane’ as an exact synonym, but kept ‘nuclear inner membrane’ as the GO term name to make it match its parent and sibling terms: I've added in the synonym- thanks Ashley, I've kept the term name as 'nuclear inner membrane' to match the sibling and parent terms, and the outer membrane terms: organelle inner membrane --[isa]mitochondrial inner membrane --[isa]nuclear inner membrane --[isa]plastid inner membrane 4.1.4.10 Inaccurate synonyms. Inaccurate synonyms refer to those that fail to match the GO term names or cannot correctly represent the concepts according to a reference source or in the context of a given activity. This data quality problem was discussed in five threads, forming 1.56% of the sample. The following request demonstrates two synonyms did not match the GO term name in terms of identifying the concept. The requester proposed to move those synonyms to another GO term: Hi GO folks, At present 'new end take-off' and the abbreviation 'NETO' are synonyms of GO:0061161 'positive regulation of establishment of bipolar cell polarity resulting in cell shape'. This doesn't seem quite right -- it's no problem that new end take-off is a synonym, but GO:0061161 is the wrong term for it to be a synonym of. NETO is the initiation of growth from the cell end newly formed by cell division, i.e. switching from monopolar to bipolar growth. Establishment of cell polarity is a prerequisite, not a regulation target, of NETO. So I recommend moving the synonyms to GO:0051519 'activation of bipolar cell growth'.

98 The editor accepted the request, and changed 'new end take-off' and 'NETO' as the exact synonym of GO:0051519 (see http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0051519). 4.1.4.11 Incorrect categorization of synonyms. As mentioned above, the GO term record may contain a set of optional synonyms categorized into exact, broad, narrow, and related (Gene Ontology Consortium, 2014p). Four requests/threads, accounting for 1.25% of the sample, involved discussing incorrect categorization of synonyms. For undecaprenol kinase activity - GO:0009038, the term name and definition seem relatively specific: Catalysis of the reaction: ATP + undecaprenol = all-trans-undecaprenyl phosphate + ADP + 2 H(+). Source: EC:2.7.1.66, RHEA:23755 But the following "exact" synonyms seem like they might be more general and therefore, perhaps should be "broad" or "related" exact: isoprenoid alcohol kinase (phosphorylating) exact: isoprenoid alcohol kinase activity exact: isoprenoid alcohol phosphokinase activity exact: polyisoprenol kinase activity The editor accepted the request, and changed those synonyms as broad synonyms (see http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0009038). The editor’s reply to the requester was as below: Fixed. Those synonyms were pulled in automatically from EC and marked as exact, even though some of them don't exactly correspond to the term name. Thanks for catching that! 4.1.4.12 Incomplete GO term definitions. Incomplete GO term definitions refer to those that are not specific enough to distinguish the terms from others in the Ontology according to a reference source or in the context of a given activity. There was 11 threads/requests discussing the problem of incomplete GO term definitions, making up 3.44% of the sample. This data quality problem was usually resolved by editing the definitions to make them more specific or clear, or adding comments to the GO terms to explain how they should be used for annotations. The following request indicates the requester took into account the novice curators and the wider user community when defining the GO terms:

99 Would it possible to add something to the defn of GO:0014898 cardiac muscle hypertrophy in response to stress to make it clear the stress should be physiological, and not pathological? I appreciate that experienced GO curators would not make this error, but it would clarify the term usage for novice curators, as well as for non-GOC users of the ontology. The editor accepted the request, and added ‘physiological’ to the definition (see http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0014898#term=history). 4.1.4.13 Inaccurate GO term definitions. Inaccurate GO term definitions refer to those that cannot correctly explain the meaning of terms according to a reference source or in the context of a given activity. The archival data analysis identified 21 threads/requests (forming 6.56% of the sample) examining this data quality problem. Below is an example of inaccurately defined GO terms: The term spindle pole body has a parentage to cytoplasmic part which is not true for organisms which undergo a closed mitosis. i guess the parent microtubule organizing centre needs the "cytoplasmic part" parent removing?.. but its def says "cytoplasmic part", that is probably causing the reasoner to suggest this relationship ...the def needs refining... After negotiating with the requester, the editor modified the definition and attributed it to the requester by placing his initials (‘vw’) next to the updated definition: this is fixed now. Current stanza is: id: GO:0005815 name: microtubule organizing center namespace: cellular_component def: "An intracellular structure that can catalyze gamma-tubulin-dependent microtubule nucleation and that can anchor microtubules by interacting with their minus ends, plus ends or sides." [GOC:vw, http://en.wikipedia.org/wiki/Microtubule_organizing_center, ISBN:0815316194, PMID:17072892, PMID:17245416] In some cases, the GO term definitions were inaccurate according to a reference source, such as scientific literature. Below is an example of such requests:

100 Please may I have an update to the current definition of peroxisome GO:0005777 peroxisome which reads ' A small, membrane-bounded organelle that uses dioxygen (O2) to oxidize organic molecules; contains some that produce and others that degrade hydrogen peroxide (H2O2). Please may the new definition feature a broader description for the peroxisomal functions as suggested from an analysis of the information in the literature Peroxisomes are small, membrane-enclosed organelles found in most eukaryotes. Peroxisomes contain enzymes involved in a variety of metabolic pathways involving β- oxidation of long chain fatty acids, free radical detoxification, differentiation, development and morphogenesis. After asking for a relevant literature from the requester, the editor updated the definition based on the literature and placed the requester’s initials (pm) and the reference source (PMID:9302272) next to the updated definition: Updated definition to: A small organelle enclosed by a single membrane, and found in most eukaryotic cells. Contains peroxidases and other enzymes involved in a variety of metabolic processes including free radical detoxification, lipid catabolism and biosynthesis, and hydrogen peroxide metabolism. GOC:pm PMID:9302272 UniProtKB-KW:KW-0576 4.1.4.14 Inaccurate references. The archival data analysis identified four types of reference sources in the GO term records: literature references, curators who contributed the terms, internal cross-references, and external cross-references to other ontologies or databases. An inaccurate reference refers to an incorrect referring to any of the four types of reference sources. Four requests/threads concerned inaccurate references to the GO terms, accounting for 1.25% of the sample. The following is a request to update or remove inaccurate external cross- references from three GO terms: Someone at FlyBase has spotted the following obsolete EC numbers in GO dbxrefs. Could they please be updated or removed as appropriate?

101 EC 2.1.1.36 linked to tRNA (adenine-N1-)-methyltransferase activity ; GO:0016429 EC 2.1.1.29 linked to tRNA (cytosine-5-)-methyltransferase activity ; GO:0016428 EC 2.1.1.32 linked to tRNA (guanine-N2-)-methyltransferase activity ; GO:0004809 As mentioned in Section 2.4.3, database cross-references (abbreviated as dbxrefs) are links to identical or similar objects curated in external databases, displayed as identifiers used in those databases (Gene Ontology Consortium, 2014p; Mungall et al., 2011). The inaccurate dbxrefs mentioned in the request above were caused by changes to the reference source ExPASy, a bioinformatics resource portal. The editor accepted the request and removed cross-references to ExPASy from those GO terms, aligning the GO with ExPASy. In the following request, the GO curator asked to remove his initials from the definition of two new GO terms as he disagreed with creating those terms: Re[garding] the new terms you are proposing, do you mind not attributing the defs to me (I don't think we require the terms, for SMC family, if you have an experiment to say something is SMC binding, you should be able to say which gene product (they aren't all that similar). Likewise for MDM2/4 (I don't even know if these are related?). My point was that MDM2 is single loci (with a number of isoforms), but the uniprot entry would cover for any one of them, so the with field could be used instead of the specific term. I think the new terms are as bad (unnecessary) ;) The editor still created those two new GO terms (‘GO:0043221SMC family protein binding’ and ‘GO:0097371 MDM2/MDM4 family protein binding’), but removed that curator’s initials from the definitions: “I've removed your initials from that term :-)” We just want to provide a bit of structure for this node beyond just 'protein binding', not so much for annotating cases where you don't know the individual protein bound (although this will be the case for very related proteins e.g. actins) but for things like enrichments and slims, and for annotating groups who don't use c16 who would otherwise only have an annotation to 'protein binding'. […] 4.1.4.15 Incomplete references. Five threads, making up 1.56% of the sample, involved requesting to add references to the GO terms. As mentioned in Section 2.4.3, the GO is creating logical definitions to complement textual definitions, cross-referencing to external databases

102 (Gene Ontology Consortium, 2014p). Four out of these five threads were requesting for dbxrefs, aligning the GO with external ontologies or databases; and one was requesting to add a newly published article to a GO term. The following is an example of requests for dbxref to UniPathway, which is a manually curated database for metabolic pathways (Morgat et al., 2012): please could you add the Xref: http://www.unipathway.org//upa?upid=UPA00883 to the GO term GO:0036101 leukotriene B4 catabolic process. The editor accepted the request and added a database cross-reference to the UniPathway entry UPA00883 ‘leukotriene B4 degradation.’ The definition of the GO term GO:0036101 ‘leukotriene B4 catabolic process’ was updated as below as of September 2014 (see http://amigo.geneontology.org/amigo/term/GO:0036101): The chemical reactions and pathways resulting in the breakdown of leukotriene B4, a leukotriene composed of (6Z,8E,10E,14Z)-eicosatetraenoic acid having (5S)- and (12R)- hydroxy substituents. Source: CHEBI:15647, GOC:yaf, UniPathway:UPA00883, PMID:9799565 4.1.4.16 Inconsistent identifiers used in references. There was a request in the sample discussing the problem of using different types of identifiers in the GO term records: we currently use stable IDs (e.g. REACT_nnn) in def xrefs, but DB IDs (nnnn) in xrefs. We should use a single unified system, and John recommends the stable IDs. The requester pointed out that the identifiers used for cross-references in the GO term definitions were stable identifiers, but the ones used for cross-references elsewhere in the GO term records were database identifiers. One of the commenters explained the reason for inconsistent identifiers: the xrefs in the ontology file come from a mapping supplied by Reactome, which, as John ascertained, is a mix of both stable and database IDs. I believe that Reactome are now aware of the problem (John certainly is) but that they weren't going to be able to tackle it properly until May or June. 4.1.4.17 Inaccurate placement of the GO terms. The archival data analysis found 18 threads/requests (forming 5.63% of the sample) concerning inaccurate placement of the GO terms in the Ontology. Below in the thread, the requester questioned the placement of the GO term GO:0032258 ‘CVT pathway’ in the Ontology:

103 I have a large increase in the number of annotations to one of my slim terms "peroxisome biogenesis" This appears to be because the tem CVT pathway has moved under peroxisime biogenesis. Is this placement correct?\ CVT pathway A constitutive biosynthetic process that occurs under nutrient-rich conditions, in which two resident vacuolar , aminopeptidase I and alpha-mannosidase, are sequestered into vesicles; these vesicles are transported to, and then fuse with, the vacuole. This pathway is mostly observed in yeast. The editor explained to the requester that when he added the term ‘protein localization to vacuole,’ which is an ancestor term of ‘CVT pathway,’ he misplaced it in the Ontology. oops, my fault ... when I added 'protein localization to vacuole' (GO:0072665) I dropped it under 'peroxisome organization' by mistake. Now fixed - moved to correct parent 'vacuole organization'. 4.1.4.18 Inaccurate relationships between the GO terms. The GO mainly uses four types of relationships between the terms: ‘is_a,’ ‘part_of,’ ‘has_part,’ and ‘regulates’ (Gene Ontology Consortium, 2014o). Inaccurate relationships between the GO terms refer to cases where the relationship between the GO terms is incorrectly determined or the GO terms should not be linked. The archival data analysis identified 18 requests/threads discussing this data quality problem, accounting for 5.63% of the sample. In the following request, a molecular function term was found to be linked with a biological process term with an incorrect type of relationship: While investigating a query for the OBO-Edit working group, I noticed that these terms have has_part relationships the wrong way round (some tags omitted for clarity): [Term] id: GO:0030745 name: dimethylhistidine N-methyltransferase activity namespace: molecular_function def: "Catalysis of the reaction: S-adenosyl-L-methionine + N-alpha,N-alpha-dimethyl-L- histidine = S-adenosyl-L-homocysteine + N-alpha,N-alpha,N-alpha-trimethyl-L- histidine." [EC:2.1.1.44]

104 is_a: GO:0008757 ! S-adenosylmethionine-dependent methyltransferase activity relationship: has_part GO:0052707 ! N-alpha,N-alpha,N-alpha-trimethyl-L-histidine biosynthesis from histidine […] They should be either MF part_of BP or BP has_part MF, i.e. As mentioned in Section 2.4.2, terms from different sub-ontologies can be linked with a relationship other than ‘is_a’ (Gene Ontology Consortium, 2014p). For example, a molecular function term can be ‘part_of’ a biological process term; and a biological process term can ‘has_part’ a molecular function term. However, a molecular function term cannot ‘has_part’ a biological process term. Since ‘N-alpha,N-alpha,N-alpha-trimethyl-L-histidine biosynthesis from histidine’ (GO:0052707) is a biological process term, the molecular function term GO:0030745 cannot ‘has_part’ GO:0052707. The editor accepted the request and changed the relationship between the GO terms to GO:0052707 ‘has_part’ GO:0030745. The editor’s reply to the requester reveals this data quality problem was caused by human error: Ah, those are my fault, misinterpreting the OBO-Edit display (should've hand-edited the file ;) ). All fixed! Interestingly, a commenter requested to add some GO software checks (scripts) to prevent the molecular function terms ‘has_part’ the biological process terms: Ashley, can you add something to go/software/utilities/check-obo-for-standard-release.pl preventing MF has_part some BP 4.1.4.19 Incomplete relationships between the GO terms. Incomplete relationships between the GO terms occur when the GO terms should be linked with a specific type of relationship but are not. The archival data analysis found 22 requests/threads concerning incomplete relationships between the GO terms, making up 6.88% of the sample. This is the second most frequently occurring data quality problem in the sample, indicating the difficulty to achieve completeness in terms of relating the existing GO terms. Below is a request to add a ‘part_of’ (pf) link between two GO terms: Missing pf link GO:0006490 Name oligosaccharide-lipid intermediate biosynthetic process

105 GO:0042281 Name dolichyl pyrophosphate Man9GlcNAc2 alpha-1,3-glucosyltransferase activity (hope that's correct) thanks The editor accepted the request and added the ‘part_of’ (pf) relationship between those two GO terms: Based on the only current annotation to GO:0042281, that seems to be correct, so I've added the link: GO:0042281 dolichyl pyrophosphate Man9GlcNAc2 alpha-1,3-glucosyltransferase activity part_of GO:0006490 oligosaccharide-lipid intermediate biosynthetic process 4.1.4.20 Incomplete set of relations. The most commonly used relationships between the GO terms—‘is_a,’ ‘part_of,’ ‘has_part,’ and ‘regulates’—is not a comprehensive set of relations in the GO (Gene Ontology Consortium, 2014o). As mentioned in Section 2.4.2, the logical definitions use relationships from the Relation Ontology (RO) and extensions to the RO, such as ‘occurs_in’ and ‘results_in_complete_development_of.’ Similarly, the GO annotation extensions allow the curators to use additional terms (from the GO or external OBO ontologies) and relations to provide the GO annotations with a context (Gene Ontology Consortium, 2014b). There were two requests/threads in the sample concerning adding new relations for the GO annotation extensions. The following is one such request: adding to go_annotation_extension_relations.obo [Typedef] id: requires_regulation_by name: requires_regulation_by def: "annotated gene product participates in BP or executes MF activity only if regulated by action of gene product; regulation is indirect (or unknown whether direct or indirect)" [GOC:mah] domain: BFO:0000007 ! process range: TEMP:0000001 ! gene product or complex note: example for go_annotation_extension_examples.obo will be added when available

106 Interestingly, the requester specified that he would provide examples of how to use this new relation in the GO’s annotation extension examples file. 4.1.4.21 Inaccurate taxon constraints to the GO terms. Although the GO intends to be taxon neutral, some of the GO terms are valid within certain taxonomic groups (Gene Ontology Consortium, 2011b; Huntley, Sawford, Jartin, & O’Donovan, 2014). Since 2011, the GO has added taxon constraints to certain GO terms to ensure annotation accuracy and to detect errors. The GO maintains an OBO format file (http://viewvc.geneontology.org/viewvc/GO- SVN/trunk/quality_control/annotation_checks/taxon_checks/taxon_go_triggers.obo), including a list of GO terms with their taxonomic groups for which are appropriate or inappropriate to use. Each GO term in the file was related to a taxonomic group by ‘only_in_taxon’ or ‘never_in_taxon’ (Gene Ontology Consortium, 2014t). For example, the GO term ‘female pregnancy’ was related to Mammalia with the relationship ‘only_in_taxon’ to suggest that that GO term could only be used for Mammalia. Inaccurate taxon constraints occur when incorrect or redundant taxonomic groups were added to the GO terms. The solution is either to change the taxonomic groups or to remove the incorrect taxon constraints from the GO terms. Six requests/threads were discussing inaccurate taxon constraints to the GO terms, forming 1.88% of the sample. In the following example, the requester claimed that the GO term ‘photosystem II reaction center’ could not be used for photosynthetic algae since the taxon constraint for the term was too restrictive, and requested to change it: please may you amend the taxon constraint for GO:0009539 photosystem II reaction center as it excludes photosynthetic algae (PMID: 21496452) The editor’s reply to the requester indicates that the request was accepted and the taxon constraint to ‘thylakoid’—the ancestor term of ‘GO:0009539 photosystem II reaction center’— was changed to 'never_in_taxon’ Metazoa: Changed all thylakoid/chlorphyll/photosynthesis taxon restrictions to 'never_in Metazoa' to allow for photosynthesising algae. Did it this way rather than adding another restriction to the current 'only_in taxon union' stanzas, because many of the algae are direct children of 'Eukaryota' in the taxonomic tree, and the main point of the taxon restrictions are to ensure that animals aren't annotated with photosynthesis terms.

107 In the following request, the requester claimed that a taxon constraint—‘only_in_taxon’ cellular organisms—was mistakenly added to two GO terms (‘GO:0005622 intracellular’ and ‘GO:0019034 viral replication complex’), and thus they could not be annotated to viral proteins. I am analysing taxon violation to GO terms and have found some due to viral proteins that can be resolved by relaxation of the taxon constraint intracellular viral replication complex Currently viral proteins are excluded from having these annotations as the terms are currently restricted to cellular organism. These viral proteins are found intracellularly in the host cell or replication complex of the host. The editor accepted the request to remove the taxon constraint from those GO terms: Strictly speaking, they should be annotated to 'host intracellular' if we have such a term. But I appreciate that this is counter-intuitive because for viruses there isn't a non-host intracellular. In addition, the 'host X' terms have been slated for replacement with annotation extensions so this restriction would be removed then anyway. So okay, I will remove this constraint. However, we're still figuring out the technical details of how we obsolete taxon constraints, so it will have to wait until we've done that! 4.1.4.22 Incomplete taxon constraints to the GO terms. Incomplete taxon constraints occur when the GO terms should be restricted to use by certain taxonomic groups but are not. The solution is to add certain taxonomic groups to the GO terms. The archival data analysis identified five requests/threads concerning incomplete taxon constraints to the GO terms, accounting for 1.56% of the sample. Below is a request to add a taxon constraint to a GO term, preventing it from being used for species other than plants: Could you please add in a taxon restriction so that GO:0006915 apoptosis is never applied to gene products in the plant kingdom (Viridiplantae taxon:33090). Evidence from 'Morphological classification of plant cell deaths' van Doorn et al. Cell Death Differ. 2011 Aug;18(8):1241-6. (PMID:21494263) I've checked this pass Tina and Derek who are happy with this restriction.

108 Interestingly, the requester emphasized that the taxon constraint was agreed upon by Tina and Derek, who were both TAIR curators. The editor accepted the request and added the taxon constraint (‘never_in_taxon’ Viridiplantae) to the GO term ‘apoptosis:’ Added taxon rule to GO:0006915 apoptosis: never_in Viridiplantae Note that this taxon rule will still be valid and won't be affected when we soon rename GO:0006915 apoptosis to "apoptotic process" in the GO apoptosis overhaul. 4.1.5 Source of Data Quality Problems The archival data analysis identified four sources of data quality problems of the GO: (a) new discoveries or knowledge, (b) human (curator) errors, (c) inconsistent mapping between the GO and collaborating databases, and (d) errors imported from collaborating databases. 4.1.5.1 New discoveries or knowledge. Incomplete GO terms and an incomplete set of relations can be interpreted as GO’s failure to meet the needs and requirements of the activities of annotating newly published literature, since curators could not find the GO terms or relations that they need to capture new knowledge or discoveries. To resolve these data quality problems, the GO has created request trackers at SourceForge to allow curators or any other users to submit new GO term or relation requests. 4.1.5.2 Human errors. Since the GO is manually curated, data quality problems caused by human errors are inevitable. The requests discussed in Sections 4.1.4.3, 4.1.4.8 (spelling errors in the GO term names), and 4.1.4.17 are examples of data quality problems caused by curator errors. The redundant GO term mentioned in Section 4.1.4.3 was created by a curator who failed to notice that an existing GO term with the same meaning was already in the Ontology. The inaccurate placement of the GO term discussed in Section 4.1.4.17 was caused by a curator who created its ancestor term. The curator misplaced the ancestor GO term in the Ontology, and thus led to all its child terms misplaced in the Ontology. 4.1.5.3 Inconsistent mapping. The request mentioned in Section 4.1.4.14 is an example of data quality problems caused by inconsistent mapping between the GO and collaborating databases. Several GO terms were cross-referencing to relevant enzyme entries in ExPASy, which were later made obsolete in ExPASy. However, the GO failed to stay consistent with ExPASy, and thus those obsolete cross-references remained in the GO. 4.1.5.4 Imported errors from collaborating databases. In Section 4.1.4.16, the data quality problem of using different identifiers in the GO term references was caused by the

109 inconsistency imported from Reactome, which used both database identifiers and stable identifiers. Since this data quality problem was identified in the GO, the GO curators had informed the Reactome curators of the problem. Cross-referencing can be used not only to detect structural inconsistency or missing links in the GO, but also help the collaborating databases control their data quality.

4.1.6 Tools According to Activity Theory (Engeström, 1990), tools can be understood as external objects or internal symbols that communities use to detect and resolve data quality problems present in the GO. The archival data analysis identified a typology of tools used to resolve data quality problems discussed on GO’s Ontology Requests Tracker (see Appendix E). These tools can be classified into three broad categories: (a) GO’s internal tools, (b) external reference sources, and (c) other biological communities. 4.1.6.1 GO’s internal tools. The archival data analysis found participants of GO’s Ontology Requests Tracker, especially the GO curators, made frequent use of the tools developed or created by the GO Consortium. These internal tools can be divided into seven subcategories: (a) GO’s software tools, (b) GO’s documentation, (c) GO’s cross-referencing files, (d) the GO Annotation File (GAF), (e) the GO Subversion (SVN) Repository, (f) the GO Projects, and (g) the GO meetings. The GO Consortium has developed a set of software tools to access and process the Ontology (controlled vocabularies) and annotation data, such as AmiGO, QuickGO, GO Slim, TermGenie, GO’s Annotation Quality Control Checks, and GO’s Annotation Extensions. As mentioned above, AmiGO and QuickGO are tools for searching and browsing the Ontology and annotation data. GO Slim enables users to create a subset of the GO according to their needs to simplify the view of the GO terms and annotation data (Gene Ontology Consortium, 2014l). In other words, GO Slim is a customized cut-down version of the GO, which can be species specific or gene products specific, depending on the user need. TermGenie (http://go.termgenie.org/) is a Web-based tool providing the GO curators with numerous templates to create new GO terms (Gene Ontology Consortium, 2014u). Using the pattern-based approach, TermGenie can semi- automate, standardize, and ease the process of adding new GO terms compared to GO’s Ontology Requests Tracker. GO’s Annotation Quality Control Checks are a set of scripts developed to automatically check and ensure the quality of annotations submitted to the GO

110 database (Gene Ontology Consortium, 2014d). Instead of creating more specific GO terms to inflate the Ontology, GO’s Annotation Extension field (column 16) allows the curators to use additional terms (from the GO or external OBO ontologies) and relations to further specify the GO terms for annotation (Gene Ontology Consortium, 2014c). There are several new GO term requests in the sample were rejected, in which the requesters were suggested to use the existing GO terms and column 16 (annotation extension) instead of adding new ones. The GO Consortium mainly uses its official Website (http://geneontology.org/) and a Wiki (http://wiki.geneontology.org/index.php/Main_Page) to store, edit, and provide access to GO’s documentation, such as meeting notes, agendas, policies, guides, and instructions. In addition to serving as a tool for data curation discussions, GO’s request trackers at SourceForge are treated as the community’s documentation to record and keep track of the changes that have been made to the Ontology, as all the requests submitted to the trackers remain indefinitely (Gene Ontology Consortium, 2014u). Below is an example of an editor using the GO Wiki and a related SourceForge request as evidence to reject a new GO term request: I'm afraid we'd have to reject this request, based on the recent proposal to obsolete GO:0010843 promoter binding and some of its children, including hormone response binding terms (such as the existing estrogen response element binding, similar to the one you're requesting). The obsoletion wiki is here: http://wiki.geneontology.org/index.php/Proposal_to_obsolete_%22promoter_binding%22 _and_child_terms And there's a related SF request here: https://sourceforge.net/tracker/index.php?func=detail&aid=3286579&group_id=36855& atid=440764 The archival data analysis found some of GO’s cross-referencing files, including GOCHE (cross-referencing to ChEBI), Protein2GO, InterPro2GO, rhea2GO, and UniPathway2GO, were mentioned in the discussions on GO’s Ontology Requests Trackers. These files are mappings of the GO terms to identical, similar, or related terms in external databases. Curators used these cross-referencing files to detect structural inconsistency or missing links in the GO and to suggest new GO terms. There were cases that editors could not resolve the requests and needed to discuss with other GO curators during meetings. The archival data analysis identified a number of GO

111 meetings, including but were not limited to Cambridge GO Editors meeting, GO Editors call, GO Developers call, the GO Consortium (GOC) Meeting, GO-ChEBI meeting, and PAINT Workshop. Below is an editor’s response to a request, indicating that she needed to discuss with others in the GO Consortium Meeting: I've added an agenda item to the May GOC meeting about how specific GO should be, when creating individual complexes. When protein binding becomes a heterodimeric complex. And what comes under GO versus PRO for protein complexes, so I'll await the outcome of that. Below is another editor’s response to a reply demonstrating how the GO Editors meeting was used to resolve data quality problems of the GO: I discussed this with Jade and Pam at our Cam-Editors meeting this morning. We decided that since we have grouping terms for histone processes elsewhere in GO (E.g. histone modifications) we could make an exception and add in terms for histone expression, mRNA processing and catabolism. I'll put a note in saying it's an exception and not a precedent since the histones are a well recognized and conserved group of proteins. 4.1.6.2 External reference sources. Besides GO’s documentation, participants of GO’s Ontology Requests Tracker used external reference sources as evidence to support their requests and viewpoints or to oppose others’. These external reference sources can be divided into eight subcategories: (a) scientific literature, (b) scientific thesauri, (c) scientific nomenclatures, (d) encyclopedias, (e) books, (f) dictionaries, (g) other bio-ontologies, and (h) biological databases. Most of the new GO term requests in the sample contained at least one scientific literature as a reference. Requesters usually provided a PubMed identifier (PMID) in the requests instead of listing the title and author(s) of the literature. The archival data analysis found a list of scientific journals that were used as reference sources on GO’s Ontology Requests Tracker (see Appendix F). The most frequently occurring journal was the Journal of Biological Chemistry, cited by 32 requests and accounting for 10% of the sample. Other frequently cited journals included Molecular Biology of the Cell (4.06%), Proceedings of the National Academy of Sciences (3.75%), Genes & Development (2.81%), Cell (2.19%), the Biochemical Journal (1.56%), Current Biology (1.56%), and Journal of Cell Science (1.56%). This indicates that the GO represents and organizes knowledge largely in the domains of molecular biology and biochemistry.

112 Other ontologies and biological databases were also frequently used as reference sources for the new GO term requests. Similarly, requesters usually provided an ontology or database identifier in the requests. Reference sources of the new GO term requests were not limited to scientific literature and biological databases. The English Wikipedia and textbooks were occasionally used as references for the new GO term definitions. The following new GO term request included a UniProt identifier (P19801), a PubMed identifier (PMID:9786936), a link to an InterPro entry, and a link to a Wikipedia article as references. Particularly, Wikipedia and InterPro were used to help define the new GO term: Please may I request the following new term to annotate the tetranectin protein (P19801) in PMID:9786936 GO:0019904 protein domain specific binding --

113 After speaking with P.T. Roddick at MIT, who has experience with Halobacter's bacterio- opsin light dependent proton pump, he is of the opinion that this should map to proton and not oxonium or hydronium ion, since these do not go up a channel, but the actual proton does, going from proton acceptor to another (walks up, either via NH2- or COO- groups available), etc. The porters bind the H+ using some group, then release it. The oxonium would only refer to the proton when in water/solution. 4.1.7 Rules According to Activity Theory, rules refer to explicit or implicit norms, conventions, regulations that enable or limit activities (Engeström, 1990). The archival data analysis identified a set of rules regulating the activities on GO’s Ontology Requests Tracker. 4.1.7.1 Literary warrant. In LIS, literary warrant refers to “justification for the inclusion of a term in a vocabulary based on published evidence that is sufficient to prove that the form, spelling, usage, and meaning of the term are widely agreed upon in authoritative sources” (Harpring, 2010, p. 225). The archival data imply that requesters were usually required to provide pertinent scientific literature, usually PubMed articles, to support a new GO term request. This requirement is similar to literary warrant in LIS. The new GO terms should be empirically derived from scientific literature. Below is a request for a new GO term: I would like to request a new biological process term for glycolate transmembrane transport. Suggested def. - the directed movement of glycolate across a membrane by means of some agent such as a transporter or pore. Child of GO:0034220 ion transmembrane transport. Thanks The requester included a definition and suggested where to place the new term in the Ontology, but did not provide a literature reference to support the request. The editor asked the requester for a reference: “Before I add your new term, would you like to suggest a PubMed ID or other reference to support it?” 4.1.7.2 Principle of consistency. The principle of consistency refers to the selection of preferred terms for a controlled vocabulary based on structural consistency with other terms in the vocabulary no matter how warranted by literary or organizational use (Svenonius, 2003). This principle was observed to regulate the selection of preferred GO terms in the archival data, such as the request discussed in Section 4.1.4.9. Although the requester claimed that the pombe community always used ‘inner nuclear membrane,’ the editor still chose “nuclear inner

114 membrane” as the preferred GO term to keep it consistent with its sibling and parent terms. ‘Inner nuclear membrane’ was added as an exact synonym. This indicates the inclusion of synonyms in the GO was not as strict as the inclusion of GO terms and the selection of preferred GO terms, which can be based on user or organizational warrant (Harpring, 2010). 4.1.7.3 Nouns in singular format. One of the OBO Foundry principles requires the preferred terms in the ontology consist of nouns in their singular format (OBO Foundry, 2012). The archival data analysis discovered that as a founding member of the OBO Foundry, the GO strictly followed this principle. For example, a requester suggested two new GO terms with definitions and PubMed reference: ‘senescence-associated heterochromatin foci’ and ‘senescence-associated heterochromatin foci.’ Although the editor accepted the request, she slightly changed ‘foci’ to ‘focus’ in the new GO term names to ensure the singular format of preferred GO terms: Changed them (slightly) to the singular, and added in: senescence-associated heterochromatin focus ; GO:0035985 senescence-associated heterochromatin focus formation ; GO:0035986 4.1.7.4 True path violation. When a new term is added to the Ontology, it automatically inherits all the properties from its parent terms. In other words, the definitions of all the parents should apply to the new GO term. True path violation occurs when a GO term fails to meet this criterion, and the Ontology needs to be revised (Aslett & Wood, 2006). This rule was observed in several requests/threads of the sample concerning the placement of specific GO terms in the Ontology. Based on the truth path rule, a requester was questioning the child terms of ‘GO:0045990 carbon catabolite regulation of transcription’ as the definition of the parent term did not apply to its children: There are true path violations in here too. The term carbon catabolite regulation of transcription A transcription regulation process in which the presence of one carbon source leads to the modulation of the frequency, rate, or extent of transcription of specific genes involved in the metabolism of other carbon sources.

115 has children Any process involving glucose that modulates the frequency, rate or extent or transcription. which do not specify that the genes regulated are specifically involved in the metabolism of other carbon sources. 4.1.7.5 Species neutrality. Since the GO intends to be species independent, the GO term names should be species neutral to ensure they could be used across species (Gene Ontology Consortium, 2014g). This rule was observed in the archival data regulating the naming of certain GO terms. In the following request, the requester suggested to change a GO term name from ‘CenH3-containing nucleosome assembly at centromere’ to ‘CENP-A containing nucleosome assembly at centromere’ as CenH3 is an organism-specific name: Actually, could the primary term name be changed to CENP-A containing nucleosome assembly at centromere This is used more universally (I think CenH3 is an organism specific name?) The editor accepted the request and changed the name of GO:0034080 from ‘CenH3-containing nucleosome assembly at centromere’ to ‘CENP-A containing nucleosome assembly.’ The original GO term name became a related synonym. 4.1.7.6 Non-genes or gene products specific GO terms. This rule requires that the GO terms should not be genes or gene products specific. The requests discussed in Sections 4.1.4.2 and 4.1.4.6 reveal how this rule regulated the naming and inclusion of GO terms in the Ontology. The requester mentioned in Sections 4.1.4.2 claimed that the GO term 'cystic fibrosis transmembrane conductance regulator binding' was a gene product-specific term and should be obsoleted. The requester mentioned in Section 4.1.4.6 suggested the GO term ‘DNA polymerase III, DnaX complex’ be renamed as it contained a specific gene name ‘DnaX complex.’ 4.2.7.8 Other rules. In the following request, the requester suggested to rename the GO term ‘tectonic complex’ as 'tectonic-like complex' based on a newly published article: Hi, I would like to rename the GO:0036038 tectonic complex into 'tectonic-like complex'. A new paper published a similar complex which they named B9 complex (PubMed=22179047) and characterizes its function The editor’s reply to the requester indicated that the GO term names should not contain ‘-like:’

116 If I understand correctly, the protein complex described in PubMed=21725307 (and represented by GO:0036038, see the previous SourceForge request https://sourceforge.net/tracker/index.php?func=detail&aid=3440669&group_id=36855& atid=440764\) is not exactly the same as the one you refer to, though they're both from mouse. In this case, it would be appropriate to create a general parent term, as you suggest, though we'd prefer to avoid "-like" in new GO term names if possible. The GO consists of three sub-ontologies: cellular component, molecular function, and biological process. The GO terms in one sub-ontology may relate to terms in another sub-ontology. Several rules identified in the archival data concerned the relationships between terms belonging to different sub-ontologies. The GO terms in one sub-ontology can never relate to the GO terms in another sub-ontology with the ‘is_a’ relationship (Gene Ontology Consortium, 2014p). As mentioned in Section 4.1.4.18, a molecular function term can be ‘part_of’ a biological process term, and a biological process term can ‘has_part’ a molecular function term. However, a molecular function term cannot ‘has_part’ a biological process term; and a biological process term cannot be ‘part_of’ a molecular function term. Another rule observed in the archival data regulated the activities of new GO term creation. The editor’s reply to a requester below implies that the GO curators preferred adding new terms via TermGenie to doing it manually: Hi Ben, We're waiting for TermGenie to allow us to put template terms in for these type of terms, as it is fairly time-consuming to do them by hand. Let us know if you wish us to prioritize this request. Thanks for your patience […]

4.2 Participant Observations As described in Section 3.4, the researcher became a registered user of the GO Project on SourceForge, observing discussions on GO’s Ontology Requests Tracker on a daily basis from January 9, 2014 to August 8, 2014. In terms of participation, with the guidance of a biologist, the researcher used the GO to annotate a maize gene named Single myb histone 1 (Smh1) reported in a scientific article (Marian et al., 2003), but was not found in the GO database. The participation started on January 30, and ended on March 21. Since similar observations of community interactions had been conducted in the archival data analysis, the researcher focused on writing

117 and reporting reflective fieldnotes, which recorded her learning of community languages, practices, cultures, and rules; her experience of using the GO and communicating with the GO curators; and her contemplation on community meanings and cultures.

4.2.1 Interactions with a GO Curator The biologist and the researcher sent a message to the GO HelpDesk on January 30, 2014, stating that a maize called SMH was not annotated in the GO, despite of published biochemical evidence for its telomeric DNA binding function. A link to the PubMed article about this protein family was included in the message. Within 24 hours, a GO curator located in the United Kingdom responded to the message, and expressed her willingness to read the article and annotate the SMH protein family. The researcher’s impression on the GO community was that the curator responded to users promptly and was open to suggestions and input. Then the conversation between the GO curator and the biologist switched from the GO HelpDesk to email as using email would be easier for the biologist to make suggestions. The GO curator emphasized that any statement made about the gene product should be supported by a reference, preferably experimental data. She also indicated that the GO could capture author opinion to a certain extent if it was based on a peer-reviewed article with a PubMed identifier. This corresponds to the finding of archival data analysis that the activities of developing and maintaining the GO were regulated by the rule of scientific evidence. As the GO curator realized that the biologist was an expert on that protein family, she suggested that the biologist search QuickGO and provide the GO terms and proteins that could be associated. The researcher learned that the GO community respected lab scientists’ input and opinions and might try to reflect them in the Ontology. Interestingly, despite of AmiGO as GO’s official browser, the GO curator suggested using QuickGO. Following the GO curator’s suggestion, the researcher used QuickGO by entering the same keyword (i.e., Smh1) used in AmiGO, and found two hits on Smh1 in maize, which were not found in AmiGO. Both of the two entries in QuickGO came from the UniProtKB. If clicked on one of the entries, ‘Q6VSV4’ (i.e., a UniProt identifier), a GO term ‘double-stranded telomeric DNA binding’ was found associated with the gene. The evidence code for this annotation was ‘inferred from electronic annotation’ (IEA). However, this function has been validated by experimental data, and included in the PubMed article sent to the GO curator. The evidence code should be ‘inferred from direct assay’ (IDA). Since these gene entries were

118 imported from UniProt, the researcher did a UniProt search by entering the same keyword ‘Smh1.’ Similarly, two entries relevant to the gene were found in UniProt: one with the length of 55bps and the other with the length of 299bps. The status of both entries was a grey star, indicating they were not reviewed by a curator and were only annotated by automatic pipelines. The biologist also requested to standardize the naming of the SMH proteins and to update their domains, which were out of the scope of GO and beyond the GO curator’s responsibilities. The GO curator connected us to UniProt curator A dealing with plants by email on February 5, 2014, and copied both of us in the email.

4.2.2 Interactions with UniProt Curators UniProt curator A immediately replied to the GO curator and indicated his willingness to annotate the SMH protein family by adding it to UniProt’s annotation priorities. Curator A assigned the annotation (including the GO annotation) to curator B, and specified that B could not work on it until the end of the month. However, on February 14th, curator B contacted the biologist by email, indicating that he was the curator in charge of annotating the maize SMH protein family. He also attached a text file of his preliminary annotation to the email and asked the biologist to provide feedback from his expert perspective. Similar to the GO curator, UniProt curators responded to users promptly and respected lab scientists’ input and feedback. To gain an insider’s view, the researcher annotated the Smh1 gene with the guidance of the biologist. Before annotating the gene, the researcher studied carefully numerous GO annotation related documents available at GO’s official Website, including Annotation Standard Operating Procedures, GO Annotation Policies and Guidelines, Guide to GO Evidence Codes, GO Annotation Conventions, How to Submit GO Annotations, and Submit GO Annotations. The GO Website provided several templates for the GO annotations. Since the Smh1 gene that we were annotating is in maize (Zea mays), the researcher downloaded the template for analyzing plant (Arabidopsis thaliana) publications. The researcher then read carefully the article discussing the Smh1 gene, paying attention to the functions and cellular location of the gene. Using the template provided by the GO, the researcher searched QuickGO and assigned a set of GO terms and evidence codes to the gene (see Appendix G). The biologist reviewed the annotation and supplemented additional information in the “Evidence with” column, indicating the tables and figures in the article that could support the annotation. We placed a question mark

119 next to the evidence code—‘inferred from physical interaction’ (IPP)—for the GO term ‘double- stranded telomeric DNA binding’ to denote our ambiguity with assigning that evidence code. We sent our annotation to the curator B, who reviewed it and changed the IPI evidence code for ‘double-stranded telomeric DNA binding’ to ‘inferred from direct assay’ (IDA). He explained that IPI was used for something other than protein-protein interaction, which required the inclusion of a valid identifier (e.g., UniProt, ChEBI, WormBase) for each partner of the interaction. This did not apply to our case. Curator B also pointed out that since the GO terms were in a hierarchy, there was no need to annotate the gene with ‘DNA binding’ and ‘telomeric DNA binding,’ which were parents of ‘single-stranded telomeric DNA binding.’ This implies the following rules regulating the GO annotation: annotating a gene to the most specific GO terms, and annotating to a GO term implies annotating to all its parent terms. As described above, we found two entries in UniProt about the Smh1 gene in maize. One (‘Q6VSV4’) was a fragment of the other (‘Q6WS85’). The biologist did a BLAST alignment and confirmed that one of them was redundant. This data redundancy problem was propagated to QuickGO. We reported the problem to curator B and made the following requests: (a) changing the gene name of ‘Q6VSV4’ from ‘smh1’ to ‘smh1 gene fragment,’ (b) changing the protein name of the main entry ‘Q6WS85’ from ‘Putative MYB-domain histone H1 family protein’ to ‘Single myb histone1, smh1,’ and (c) changing the status of the Smh1 gene in UniProt to “Reviewed” since the biologist and curator B had reviewed it. Curator B informed us later that UniProt combined the gene fragment with the protein entry to reduce redundancy. When searched ‘Smh1’ in UniProt, only one hit on Smh1 in maize (‘Q6WS85’) was detected, and the status of the entry became ‘Reviewed.’ The researcher did a QuickGO search on ‘Smh1’ and found only one entry ‘Q6WS85.’ The gene was associated with the GO terms ‘double-stranded telomeric DNA binding’ and ‘single-stranded telomeric DNA binding’ with the evidence code ‘IDA.’

4.3 Semi-structured Interviews As mentioned in Section 3.5, the researcher conducted semi-structured interviews after archival data analysis and participant observations to allow for follow-up questions and broaden the understanding gained from archives and observations. The researcher pretested the interview guide (see Appendix A) with two biologists to ensure that the questions were clearly written and the informants would have no difficulty in understanding the questions. Considering some of the

120 interviews would not be conducted face-to-face, one of the pretests was done face-to-face, and the other was conducted by phone to ensure the entire procedure went well. A total of 21 informants was interviewed, most of which were recruited using the technique of snowball sampling (Schutt, 2006). Two informants working in the same institution were interviewed at the same time upon the request of one of the informants. All the informants were given the option to be interviewed face-to-face, using online audiovisual media (including Skype, Google Plus Hangouts, and Apple Face Time), and by phone. Of the 20 interviews, ten were conducted face-to-face, six were via Skype, two were by phone, one was via Google Plus Hangouts, and one was via email. The researcher recorded and transcribed all the interviews, except the one conducted via email upon the request of the informant. Any personal names in the quotes from the transcripts were replaced with pseudonyms, and any identifying information was replaced with letters. All the interviews were analyzed as described in Section 3.6. Of all the recorded interviews, the shortest one was 9 minutes and 52 seconds, and the longest one lasted 82 minutes and 15 seconds. The average interview length was 26 minutes and 50 seconds, with the median length being 20 minutes and 3 seconds.

4.3.1 Demographics of the Interviewees Of the 21 interviewees, three were the GO Project members; three were the GO contributors; and 15 were the GO users without directly participating in developing and maintaining the GO. They came from nine research institutions located in three countries. In terms of gender, 33.33% of the interviewees were female; and 66.67% were male. With respect to education, 52.38% of the interviewees had a doctoral degree; 19.05% had a master degree; and 28.57% had a bachelor’s degree. In terms of position, six of the 21 interviewees were biocurators for model organism databases or other biological databases; five were either associate or assistant professors in biology, biochemistry, medicine, or information science; one was a scientist of a national laboratory; one was a postdoctoral fellow in medicine; and nine were PhD students in biology, bioinformatics, or ecology. Particularly, of the six biocurators, two were the GO curators and also biological database curators. The 15 GO users who were interviewed indicated their frequency of using the GO ranged from on a daily basis (e.g., doing research on the GO) to once or twice a year. Most of these GO users implied that they learned about the GO from their colleagues and the literature. Three of the GO users mentioned that they first knew about the Ontology from the bioinformatics course.

121 4.3.2 Activities around the GO A typology of activities around the GO (see Table 4.2) was identified from the interview data. These activities can be grouped into three broad categories: (a) developing and maintaining the GO; (b) using the GO for different research purposes; and (c) educating and disseminating information about the GO. Each of these categories has subcategories. 4.3.2.1 Developing and maintaining the GO. Similar to the activities identified from the archival data analysis, the GO curators and the GO contributors participate in developing the Ontology by adding new terms to capture the most current biological knowledge. A GO contributor described her process of contributing new terms: I am not an actually GO curator, although I do submit requests for GO terms. … I do make request to the GO Consortium for terms that need to be included in the hierarchies and stuff like that. … [For example,] I need terms for lateral line development, which was not at all represented in the GO trees. So what I did was I put together a proposal of all of the terms I need with the definitions and where I thought they belonged in the tree... And I submitted that document on the Tracker [GO’s Ontology Requests Tracker].

Table 4.2: Activities around the GO Category Activity Developing and maintaining the GO (Section Adding new GO terms and types of relationships 4.3.2.1) between the terms Assigning the GO terms to genes or gene products Maintaining the GO by reporting, discussing, and resolving any quality problems with the controlled vocabularies and annotations Developing and maintaining tools to support different activities around the GO Bringing in new communities to the GO Consortium Having meetings to discuss issues related the development and maintenance of the GO Using the GO for different research purposes Doing gene enrichment analysis (Section 4.3.2.2) Predicting novel gene or protein function Building predictive models Creating new ontologies Educating and disseminating information Training curators to use the GO about the GO (Section 4.3.2.3) Writing GO related papers Educating undergraduates to use the GO

122 Besides adding new terms to the Ontology, the GO curators also engage in establishing, defining, and determining relationships between the terms. A GO curator stated: So there are several people, I am one of them, although I am not that terribly involved, who develop the [Gene] Ontology. That means not just thinking of the terms, but it’s very important to think of the relationship between the terms. A GO Project member explained the importance of adding relationships between the terms by comparing the GO to MeSH: There is the MeSH, Medical Subject Headings. And MESH is great. It is very comprehensive, but it was built to suit a given purpose, which was to index papers, and books, and journals. But it’s good for that. It’s not good for anything more sophisticated than that because the relationship types are defined. And one of the steps forwards was adding relationship types [between the GO terms]. The GO not only consists of a set of terms with relations operating between them, but also curates the annotation, which is the association between the GO terms and genes or gene products in the collaborating databases (Gene Ontology Consortium, 2014g). GO’s collaborating databases (e.g., UniProt) have spent a lot of efforts in assigning the GO terms to genes or gene products (i.e., annotation) and uploading annotations to the GO database. A model organism database curator described the annotation process as: So I use it [GO] specifically to annotate gene functions. So what we do in X [a model organism database] is we read primary literature, primary scientific literature that deals with x [a model organism] development. And based on the data presented in the literature, we annotate to the gene various gene functions. And we utilize GO to do that. So, for example, if a gene is said to be involved in mitosis and in the paper they described that there is a mutant for this particular gene and the gene has a defected mitosis process, so then we would annotate that gene with the GO process for mitosis and make a phenotype annotation for that. What that model organism database curator did is called manual or literature annotation, which involves reading full-text scientific literature, extracting genes or gene products from the literature, and—based on the experimental findings and author statements—assigning the GO terms to the genes or gene products. Instead of annotating manually, there are computational annotations, which can be generated based on sequence or structure similarity or by mapping to

123 annotations created using other controlled vocabularies. A model organism database curator, who has computational background, described how he did computational GO annotations: We work for the model organism database for y. … We also have a metabolic pathway viewer that has taken several pipelines that do automatic GO annotations onto y genes. … So I don’t do any [literature] annotation myself. I am more a computational person. We’re doing some automated functional annotation using GO terms to automate the whole y genome. So given the whole assembly, look at the structural annotations of the assembly and be able to assign GO terms to them. Similar to what have found in the archival data analysis, a lot of collaborative efforts from different communities have been spent to maintain the GO. Having the GO Helpdesk and various request trackers at SourceForge, the GO allows users and curators to provide feedback to help detect and correct errors or omissions in the controlled vocabularies and annotation data. A GO curator described his ontology maintenance activities around the request trackers and the GO Helpdesk as: But everyday I get bounces from a list of every SourceForge item say for a new request or something like that. … The other thing we have is we have something called GO Help. And people can request things from GO Help. … So I am more interested in Ontology Requests and Annotation Requests. And the Annotation Request is usually like somebody disagrees with a particular annotation that they see or some authors very upset that we haven’t mention his paper. So you will see things like that on the Annotation Tracker quite a lot. Ontology Request we see somebody thinks that… A simple one, somebody sees a process, and usually one of the annotators, and they need regulation terms that do not exist. So they put in a request. Please can we have regulation terms with that process? And one of us will put them in if it’s a reasonable request. In addition to developing and maintaining the GO and providing annotation data, the GO Project involves developing and maintaining tools to support the use, creation, and maintenance of the GO. A GO Project member revealed that the GO was building some tools for communities to contribute their annotations: We actually are in the process of building some software tools, hopefully in the next year, cause we would like to see community annotation. … We are hoping to get something

124 that’s a JavaScript-based tool out there in this grant period. Actually we have four more years to go. But in this grant period we would like to get that out there. Interestingly, the GO users, particularly the bioinformaticians with computing skills, also engage in developing tools to access and process the Ontology data. A GO user, who was a bioinformatician, explained how his lab was building a tool to visualize the GO: Actually in my group there has been work on this tool to visualize the Gene Ontology all at once. … We reduced the Gene Ontology to a tree by removing edges until we had a tree. So what I mean is we remove that just until each node had at most one parent. And then, you know, there are algorithms for laying out this tree. And then after we lay out the nodes, then we put back the edges. The GO aims to unify the representation of genes or gene products across all species. To fulfill this objective, the GO Consortium is inviting more and more databases and communities to collaboratively develop and maintain the Ontology. The GO Project member revealed: Frankly speaking, we [the GO Consortium] don't have enough biological expertise in lots of areas. There has been a project recently to work on cilia. I had no idea how important cilia were in development. But apparently they really are. … Apparently it’s if you have a problem with your cilia, you end up with those problems where your heart is misplaced. So it’s just I had no idea. So we’re bringing in people. … We’ve got another one set up in December with the group for synapse, genes involved in synapse formation and, you know, just signaling across synapses. To facilitate the communication among different collaborating databases and communities, there are different levels of regular meetings discussing issues related to the development and maintenance of the GO. A GO curator indicated: There is an annotation meeting I think every two weeks. And it’s not all the annotators, but some representative of each of the databases tries to go to those meetings. GO managers have a call every other week. And then there is something called GO Talks, which are the PIs [of the GO Project]. They have something once a week. …The GO Consortium gets together twice yearly. 4.3.2.2 Using the GO for different research purposes. The interviews with GO users allowed identifying a set of activities using the GO for different research purposes, which were missing in the archival data. The most frequently mentioned usage of the GO was gene or gene

125 set enrichment analysis, which is the process of finding the GO terms to characterize a set of genes that were over or under expressed under certain experimental conditions based on what has already been known about those genes (i.e., GO annotations) (Gene Ontology Consortium, 2014j). Most of the interviewees used the gene set enrichment analysis to help interpret their data, such as having a functional profile of the gene set. For example, one of the interviewees mentioned: In one study, we analyzed the location of sequences that conform the structure called quadruplex. And they were associated with a number of genes. And we used Gene Ontology to see if there was any functional categories that these genes less or preferentially enriched for it. Some of the interviewees used the gene set enrichment analysis for quality control: So mostly what I use it [the GO] for is sort of quality control. … if I have a dataset, just to make sure that there is a signal in the dataset. A lot of times I just do a Gene Ontology enrichment to see that the right things are showing up. So for cancer, you expect to see some sort of cell cycle, activation, and that kind of thing. Some of the interviewees used the gene set enrichment analysis to help organize or filter down their data: So we examine gene expression, but on a genome wide level, so in flies that might be examining 14,000 genes at one time. And mouse, you know, it’s closer to 30,000. And you know, you did different treatments; like in fruit flies we might look at males versus females. And then you get a long list of 1,000 genes. And it’s really complicated to make sense of it. And so the Gene Ontology databases help kind of organize the data in a way that is easier for a brain to think about rather than 1,000 individual genes. Besides the gene set enrichment analysis, users applied the GO for predicting the functions of novel proteins or genes, which were poorly studied or little had been known about. A GO user explained: As I said, Gene Ontology I think is able to cluster the proteins of known functions. But my case is where I get a lot of like unknown proteins. So I kind of use it indirectly where I basically, you know, find something, which has known function, related to my unknown target and trying to get the background information about maybe potentially where these things belong to.

126 Interestingly, the GO was used as a comparative target to build other ontologies. One interviewee said, “We also have a couple of people on the lab who work on building ontologies from molecular data, and so they use the GO ontology as sort of to compare their ontologies against.” Different from the gene set enrichment analysis as a post-experiment validation, another interviewee used the GO for phenotype analysis, developing predictive models before running experiments: But people actually use it not so much in predictive methods. So before you run experiments, using the Gene Ontology to create a model what’s going to happen, not in model to explain what you observed after the fact. … I’m trying to use the Gene Ontology as a predictive model of why the cell has a certain phenotype. And what I’d like to show is, for example, that certain terms in the Gene Ontology are sort of being activated in a certain genotype. So, for example in a genotype, if genes A, B, and C are mutated, I’d like to have a global visualization of the Ontology where the terms that gene A, B, and C are involved with, are lit up… It’s a network diagram basically with things are lit up maybe different colors, different sizes. I want a network visualization. 4.3.2.3 Educating and disseminating information about the GO. Both the archival and interview data indicate that the GO Consortium has been involved in community outreach, such as educating undergraduate students to use the GO (the CACAO Project), training curators of collaborating databases to do the GO annotation, and writing papers introducing the GO and its development. One of the interviewees, who was a GO curator and also a collaborating database curator, mentioned one of her responsibilities was to train curators to do the GO annotations: I am responsible for training new curators. I will teach them how to curate papers, and I will check their annotations before they go public. … After some time, I will not check their annotations one by one. … We also have some hands-on tutorials, teaching how to use QuickGO and do GO annotations, and also introducing GO.

4.3.3 Division of Labor The interview data imply, based on their tasks, people around the GO can be divided into the GO Project team, the GO contributors, and the GO users (see Figure 4.1). 4.3.3.1 The GO Project team. In terms of tasks, the GO Project team has created the Ontology (the controlled vocabularies). As the leading group of the GO Consortium, the GO Project team is responsible for developing, maintaining, and delivering the Ontology; providing

127 annotation data; and developing and maintaining tools to support different activities around the GO. In terms of power and status, the GO Project team can be divided into the GO curators, GO gatekeepers, GO software engineers, and GO directors group. Based on their tasks, the GO curators can be further divided into the GO editors and GO annotators. The GO editors focus on developing and editing the controlled vocabularies. The GO annotators are mainly responsible for annotating papers using the GO. A GO curator can be a GO editor and a GO annotator at the same time. A GO curator described his responsibilities as below: We have an Ontology Request Tracker at SourceForge. There are new terms need to be added to the Ontology. We check the request trackers daily. There is a roster of editors doing rotation. We try to respond to users as soon as possible. … as an ontology editor, sometimes I am asked to obsolete the term, not delete it. … There is a GO Help roster. That’s probably around 11 people. And I keep thinking my term is gonna coming up pretty soon. And, you know, I mean it’s just a week of looking at that GO Help thing, and seeing... And it’s a lot of time it’s like, oh that’s a software question. I can’t help you with that. But so and so knows more about that than I do. And so I forwarded the thing on to him, something like that. … There is people who use the vocabulary to annotate gene products, which is another thing that I do. His responsibilities imply that he is working as a GO editor and a GO annotator at the same time. The GO curator also indicated, “Typically a GO curator is a model organism database curator.” However, they differ in their permissions and responsibilities. A model organism database curator revealed the differences: I am not an actually GO curator, although I do submit requests for GO terms and I… curate with GO terms. And I do make request to the GO Consortium for terms that need to be included in the hierarchies and stuff like that. So I am not actually inputting GO terms, but…asking for someone to put in terms. … And we have a person here…who is kind of like our local GO person. If there are things that he finds when goes to the GO meeting, he passes that information onto us. Compared to model organism database curators, the GO curators have more knowledge of the Ontology and the permission to make modifications to the GO, attend the GO meetings, train

128 local curators to do the GO annotations, review requests submitted to GO’s request trackers at SourceForge, and add the new GO terms via TermGenie. The GO gatekeepers are the GO curators with more knowledge of the Ontology and more power in terms of administrating the request trackers and TermGenie. A GO curator explained: They [the GO gatekeepers] are curators, but they are curators that have a little more knowledge of the Ontology, for example, and the whole process. Should not every data curator, annotation curators, you know, is a gatekeeper. There is a gatekeeper roster. It might be 4 people, just like there is a GO Help roster. … the gatekeeper decides who is doing what, you know, kind of things [on the request trackers]. … most of the curators, the Ontology curators have access to TermGenie. … at the end of the day every term is vetted by the gatekeeper at the time for the TermGenie.

The GO • Developing, delivering, & Project team maintaining the vocabularies • Providing annotations GO directors group • Developing & maintaining tools GO gatekeepers • Training local curators GO curators, software engineers • Producing manual & computational GO annotations The GO Contributors • Submitting annotations to the GO database The GO Consortium members • Suggesting new GO terms External communities & groups • Reporting errors in annotations

The GO Users • Using GO for research • Publishing papers Bioinformaticians Bench scientists • Producing computational GO annotations • Developing tools to help use the GO

Figure 4.1: Division of labor around the GO

129 There is a specific group of people in the GO Project responsible for developing and maintaining software tools for the GO. A GO curator pointed out this group of people and their importance for the Ontology: The GO also has software engineers. Their job is to make sure things work. I can’t do that. You probably saw Andy. His name you might see a lot at SourceForge. And I can see Frank Hart. These are pure ontology speak people who make sure the software does when it is supposed to do when it is checking loads or things like that. Without these people, we would be in a mess. The GO directors group consists of four scientists, responsible for gaining and allocating funding, leading the project, charting the direction, setting priorities, establishing milestones, resolving conflicts, administering the GO Consortium meetings, and reporting to the funding agency (Gene Ontology Consortium, 2014k). 4.3.3.2 The GO contributors. The primary contributors to the GO are the GO Consortium members, particularly model organism databases. Founded by three model organism databases, the GO Consortium now consists of more than 30 model organism and protein databases and biological research communities, annotating the GO terms to their genes or gene products and contributing their annotations to the GO database (Gene Ontology Consortium, 2014i). A model organism database curator provided the number of manual annotations that his database had contributed: [As of June 27, 2014] we’ve got 25,426 genes that have some GO annotation of some sort, could be electronic, could be manual, [could be] both, for a total of 336,800 annotations. … So roughly we have a total of roughly, yes, 12,000 genes that have been manually annotated with literature. All those that is a total of almost 100,000 manual annotations in different streams. Model organism database curators not only contribute annotations to the GO database, but also suggest new terms or changes to the Ontology. As described above, whenever the curators cannot find the GO terms that they need to annotate a paper or a gene, they may request for them on GO’s Ontology Requests Tracker at SourceForge. The GO contributors are not restricted to the GO Consortium members. As an open community, the GO Consortium encourages any other communities or groups to submit content and annotations for inclusion in the GO database. The GO Consortium has specific instructions

130 on how external communities can contribute annotations to the GO and a mechanism to ensure that they receive credit for their contributions (Gene Ontology Consortium, 2014m). 4.3.3.3 The GO users. Two groups of the GO users were identified from the interview data: bench scientists and bioinformaticians. Even though the model organism database curators use the GO to curate papers, they identify themselves less of the user community. A model organism database curator claimed, “We are less of the user community, although we use it [GO] to annotate the gene.” As mentioned above, bench scientists conduct gene set enrichment analysis to help interpret, organize, and filter their experimental data. A GO curator provided an example of how bench scientists could make use of the GO: What people can do and this might be bench scientists, they do microarray experiment, where they dump ethanol on a bunch of cells versus they don’t. And then they isolate the RNA, and they see what expression of, what genes increase the expression, and what genes decrease the expression. As they get a list, and the list is really meaningless cause they, you know, like sometimes with these microarrays dealing with thousands of points. … they can use the vocabulary [the GO] plus the annotations that the community has already done on those genes to ask of this set that I have, what categories of functions do they fall into these already known from other genes. The GO curator provided another example of how bench scientists could use the GO, that is, suggesting experiments to be done. One of my favorite terms is an experiment that a lot of people do with the new gene they clone, very easy to do, is to do some sort of protein-protein interaction experiment … maybe we cloned the gene for that we know already say for Actin-1. We express the protein, and we see what proteins bind to Actin-1. And by that you get some data, and there is genes that you never see before. She calls them, you know, gene X, gene Y, gene Z. You don’t know anything about it. But at least for one thing it’s somehow interacts with Actin-1. … Ah, we know what Actin-1 does. Maybe this thing does the same thing. Maybe it has something to do with it because it is binding one of the partners or whatever. Let’s do an experiment. So that’s how I think that the GO really serves the general community in supplying, based on the annotation sets, the ability to analyze data, to suggest experiments that should be done. A GO user specifically indicated himself as an experimental biologist:

131 Most people in our department use it [the GO] more or less, but not so deeply, because we are not a bioinformatics department. We will focus on biology and experiments. The other group of GO users identified from the interview data is bioinformaticians, who are using the GO to generate computational annotations, develop tools, build predictive models, or construct other ontologies to process and represent high-throughput experimental data. Although bench scientists do not participate in developing and maintaining the GO, with the help of GO, they design and conduct experiments, process and analyze data, obtain new findings, and publish papers. Model organism database curators annotate those papers with the GO terms, and input annotations to the GO database, which are in turn used by bench scientists to help generate new experimental data and publish papers. In other words, the experimental data produced by bench scientists are the building block of the GO. Bench scientists are indirectly developing the GO by using it to conduct research and produce experimental data. Generally speaking, any bench scientists and bioinformaticians can be contributors to the GO.

4.3.4 Communities The GO Consortium is the primary contributor to the GO, consisting of three groups: (a) model organism databases, (b) protein databases, and (c) biological research communities. A GO curator described communities within the GO consortium as: The Consortium members you probably see on the Webpage. And there is the founders which are like FlyBase, and mouse, and yeast, and C. elegans. And there is people that come later, like Dictyostelium, Rat Genome Database. And then, you know, we always say there is no real human database. But GO uses UniProt. Somebody from UniProt usually comes cause after all human is not a model organism. In terms of the user community, experimental biologists are the major user group. They come from different communities, studying humans or a specific model organism (e.g., chicken, dictyostelium, fruit fly, pombe, maize, or yeast). The interview data imply that the GO users are not limited to biologists, but include biochemists, biomedical scientists, bioinformaticians, ecologists, and computer scientists. One of the interviewees identified herself as a bioinformatician: There is a lot of people now, I mean bioinformatics too, you get people who their job… They make a career out of analyzing the data once it’s in that form. I will put myself sort of in that category that I don't do the curation, but I work closely with curators because I

132 realized that any of the analyses that we do are dependent on the accuracy and fidelity, and completeness of the information that has been collected. A GO user, who was doing proteomics work, specifically pointed out that he was not a biologist: I mean for me it’s like not a real biologist. … Because I’m mainly doing a lot of like the proteomic work, … uncover a lot of like unknown proteins. So basically people don't really know about what the function about the protein. And then I’m trying to basically understand what kind of pathway these proteins belong to. And then usually I’ll just, you know, look up these proteins in the related database, and sometimes the Gene Ontology, … where, you know, they give some like physics or the clues about what the proteins are. Another GO user reveled that computer scientists also made use of the GO to build computational models: I worked in a machine-learning laboratory with Dr. Richard Howard. And we were sort of working on to be able to predict GO annotation from protein sequence. … We did reduce alphabet representations of proteins. So instead of using the 20 letter amino acid alphabet, we looked at ways to reduce that alphabet, whether it’s, you know, physical properties or looking at substitution matrix, so things like that. So to be able to test or validate our methods, we used GO as a functional, or, you know, a way to functionally annotate. So we could pull those annotations from GO pretty easily.

4.3.5 Tools The interview data analysis generates a typology of tools that help people use, develop, and maintain the GO (see Appendix H). These tools could be divided into six categories: (a) ontology browsing and searching tools, (b) ontology development tools, (c) the GO annotation tools, (d) the GO term enrichment tools, (e) communication tools, and (f) others. 4.3.5.1 Ontology browsing and searching tools. The ontology browsing and searching tools that the interviewees used included, but were not limited to, the ones developed by the GO Consortium (e.g., AmiGO, QuickGO), the ones created by other communities or groups (e.g., REViGO), and model organism databases or other biological databases providing the GO annotations. Since the interviewees included the GO curators and users from different biological communities, the tools that they used to browse and search the GO varied by their communities. Most of the users interviewed did not use the GO directly, but found the GO terms associated with the genes or gene products that they were interested in from model organism databases or

133 relevant databases of their communities. For example, the interviewees of the maize community mentioned that they used CornCyc, MaizeCyc, MaizeGDB, and TAIR, all of which contain the GO annotations to the maize genes or gene products. The interview data imply that the GO curators also used different browsers. They tended to favor the ones developed by their own communities. One of the interviewees specifically pointed out why she preferred to use AmiGO: “Since we developed AmiGO, I have to admit a certain partiality to AmiGO.” A GO curator from UniProt explained why she favored QuickGO: UniProt developed the QuickGO. I can’t remember which one was developed first. But QuickGO is quicker. AmiGO is slower. Now AmiGO has a new interface. … A model organism database curator, who was also an editor of another ontology, used the Ontology Lookup Service at EMBL-EBI. This may be due to her role as an ontology editor: I use AmiGO. I don’t use it as much. I don’t really care for that interface. I use often the Ontology Lookup Service through the EMBL site to look for terms and their definitions, make sure I’m using the right term for the right process I am trying to describe. … So this, the Ontology Lookup Service, it actually holds all ontologies and the GO just happens is also within that. So I use it. It’s the interface I prefer. So this Ontology Lookup Service, like I said, it has a lot of ontologies and GO happens to be one of the ontologies. I just like the interface that they present. I find things easier with it. 4.3.5.2 Ontology development tools. Ontology development tools are those that facilitate the GO curators to edit and maintain the Ontology, including cross-referencing databases (e.g., ChEBI, Cell Type Ontology), which are similar to the ones identified in the archival data analysis; TermGenie for adding the new GO terms; ontology editors (e.g., Protégé, OBO-Edit); scientific literature as evidence for changes to the Ontology; and contributors’ initials or ORCIDs for attribution and accountability. Interestingly, a GO curator indicated he used the ontology editor OBO-Edit more as a browser: I like OBO-Edit, which is not only an ontology tool. You can, you know, make an ontology with it. But for me it helps me browse very quickly because I can search for term, and then I can say, OK, please show me the children and the parents of the term. And often times the parents of the term are important so that I can see how it travels up to the tree... It often tells me whether I am in the right area of the Ontology. So I find the

134 user OBO-Edit as a browser is very important for me. … Now more recently we are switching to Protégé, because we are going from OBO format to OWL. The archival data analysis found the GO term contributors’ initials were placed next to the term definitions. A GO Project member explained how these initials were used: What we are looking for is we want the definitions to have some provenance, where this definition came from, why it was described cause it allows this, or who was responsible. So it’s partly for credit. But it’s also partly for when these issues come up, when people are trying to improve the Ontology. We can go back to the original source of the definition and see, you know, did someone misinterpret the paper or something like that. So it’s very useful to have the original source so that we can go back and see, you know, was it a misuse of the term in the first place or simply was it extracted incorrectly. The GO Project member revealed that the GO was launching ORCIDs to substitute these initials as a more formal way of attribution and accountability: We would like to have a more formal way of doing that cause I think it would help people a lot especially, you know, who work on these. And we are looking at one possibility would be ORCIDs. We didn’t have those originally. But now that they are around. We are thinking we might switch to that. And actually I should check down the hall cause I don't know. They may have just been implemented in the last week. 4.3.5.3 GO annotation tools. The GO curators mentioned several tools that helped them with the GO annotation, including the annotation quality control checks developed by the GO Consortium, some Web-based software tools supporting the GO annotation (e.g., Blast2GO, Protein2GO, PAINT), and the Evidence Ontology. Phylogenetic Annotation and Inference Tool (PAINT) is an open-source software application for inferring gene function by homology, which was built on a phylogenetic model instead of using the BLAST (Gene Ontology Consortium, 2014q). A GO curator explained how the PAINT was used for ontology development: “There is a project called PAINT, which is trying to apply all of the GO annotations made across all the 12 model organisms, see how we can propagate some of the annotations to species that there are no databases for.” Initially founded by the GO Project members, the Evidence Ontology (EO) is a controlled vocabulary for describing scientific evidence in biological research, ranging from laboratory experiments, computational methods, and manual literature annotation. It allows the

135 GO curators to describe why a GO term was associated with a particular gene or gene product. Most of the EO terms started from ‘inferred from’ followed by a specific scientific method. 4.3.5.3 GO term enrichment tools. The GO term enrichment tools are either Web-based applications or downloadable software that enable gene set enrichment analysis. The enrichment tools that the interviewees used include: DAVID, GOrilla, PANTHER, and WebGestalt. Of these four tools, GOrilla was the most popular among the 15 GO users who were interviewed. Six of them indicated they had used GOrilla, and four mentioned DAVID. As part of the GO Reference Genome Project, PANTHER is maintained with the most up-to-date version of GO’s annotation data, and is the only enrichment tool accessible on GO’s front page (Gene Ontology Consortium, 2014j; PANTHER, 2014). A GO Project member discussed the current GO term enrichment tools as below: There are so many enrichment tools. And so we have originally for the first 10, 12, or longer years we didn’t want to offend anybody. So we didn’t build the enrichment tools. We just made the Ontologies available and let other groups build enrichment tools. We just recently or maybe as of today are introducing an interface to an enrichment tool [PANTHER] cause there is so much demand and there is just too many of them. 4.3.5.4 Communication tools. Communication tools are those that facilitate the communication within the GO Consortium and between the GO curators and users. The interview data imply five types of communication tools: the GO Helpdesk, GO’s request trackers at SourceForge, the GO mailing lists, the GO Wiki, and various GO meetings for different groups. The GO Helpdesk (http://geneontology.org/form/contact-go) is an interface on the GO Website for anyone to submit questions, comments, or feedback to the GO. There is a roster of GO curators who rotate to serve at the Helpdesk for a week. They are expected to answer questions in a prompt manner. A GO curator indicated, “We try to get all the help inquires answered within 24 hours.” As mentioned above, the GO has created several data-related and software-related request trackers at SourceForge to allow any individual to submit requests for a new term or definition, reorganize a section of the Ontology, and correct errors in the GO annotations (Gene Ontology, 2014; Gene Ontology Consortium, 2006, 2007). Similar to the GO Helpdesk, there is a roster of GO curators who rotate to review requests submitted to the request trackers. A GO curator

136 explained the difference between the GO Helpdesk and request trackers in terms of the questions received: GO Help Desk is very obvious that if it’s your week, hopefully, you’ve done everything. For everything that has happened in your week, you’ve addressed in some way by the end of your week, because then the next poor person has to take it over. And they don’t like that. It’s just assumed you get to those very quickly. SourceForge requests, however, are different, because a lot of things that could come into SourceForge aren’t just simple things. So, for example, somebody would just say, I think that this process should be split into cellular and multi-organismal process. That kind of request is usually… This is gonna to be an entirely whole project. And so that’s not gonna get done overnight. The GO has three mailing lists available on the GO Website, where people post messages and follow development of the Ontology and activities within the GO Consortium. One of these mailing lists is restricted to the GO Consortium members, and the other two are open to the public. The GO Consortium uses a Wiki (http://wiki.geneontology.org/index.php/Main_Page) for members to store, edit, and have access to GO’s documentation, such as meeting notes, policies, guidelines, and instructions. There are various regular meetings for different groups within the GO Consortium. The GO editors of the Editorial Office at EBI have a meeting every week (Cambridge GO Editors meeting); the GO editors also have a weekly meeting; the GO curators from different databases have an annotation meeting every two weeks; the GO managers have a meeting every other week; PIs of the GO Project have a meeting every week; and the GO Consortium has a meeting twice a year. A GO curator specified how these meetings were structured: Every week the editors would go over anything. And some of these questions zipping on and on and on. But we still, you know, mull it around some more. There is an annotation meeting I think every two weeks. And it’s not all the annotators, but some representative of each of the databases tries to go to those meetings. … There is a separate call for the PAINT project. Then GO managers have a call every other week. And then there is something called GO talks, which are the PIs. They have something once a week. So it’s structured.

137 Since most of these meetings involve people from different groups at various locations, net meetings and conference calls were often held instead of face-to-face gatherings. A GO curator described the online conferencing applications that were used in different groups: We use GoToMeeting a lot, just like WebEx, only it is cheaper. People use WebEx. It’s the Stanford people use WebEx cause Stanford gets a deal, because WebEx was made there or something like that. We don't. So we use GoToMeeting. And that way we can share our screens to the group, which is an important aspect of doing things. 4.3.5.5 Others. One interesting finding is some of the interviewees, who had computer science or bioinformatics background, developed their own scripts to access and process the Ontology data. A GO user mentioned, “Mostly I sort of build my own tools.” Another GO user explained the reason for developing his own scripts was lack of available tools to accomplish his research purpose: I’ve written a lot of my own scripts to process the Gene Ontology. I find that a lot of the scripts out there, a lot of the tools are sort of… They’re not very good actually. Most of the tools are again related to term enrichments, which is not what I’m doing. And there is sort of a lack of tools for visualizing ontologies. There are some tools but they’re horrible... Some of these tools they use like old-fashioned visualization technologies. They were meant more for visualizing trees I think, and maybe not. But it’s just that what’s produced looks really old-fashioned, just like the look-and-feel. More importantly, these tools have a hard time of visualizing the entire Gene Ontology. When you use it, you can use it only visualize like a local part, like this term and maybe its neighbors, like two edges or so removed. But it’s really hard to visualize the entire ontology simultaneously.

4.3.6 Types and Sources of Data Quality Problems and Corresponding Assurance Actions The archival data analysis was restricted to one online fieldsite, GO’s Ontology Request Tracker. The data quality problems identified in Section 4.1.4 were related to the Ontology, that is, the controlled vocabularies. Since the GO database contains not only the controlled vocabularies but also the annotation data, additional data need to be collected to identify the quality problems of GO’s annotation data. The semi-structured interviews complement archival data analysis with a typology of quality problems of the Ontology, as well as GO’s annotation data (see Table 4.3).

138 4.3.6.1 Ambiguous GO terms. Ambiguous GO terms refer to those that are difficult to comprehend or distinguish with others. A user expressed his confusion with the difference between the molecular function GO terms and biological process GO terms: “There is vagueness in the description of its function. … This is function and this is a process. Yes. And function and process, what is the difference?” The GO curators were aware of this ambiguity quality problem, and resolved it by adding ‘activity’ to the end of molecular function terms to tell them apart from biological process terms: In the GO you will see that all of the function terms have the term ‘activity’ stuck to the end of it. That was made for two things. One of which was to prevent, to make it obvious when something was going into the Ontology that was a gene name or protein name. And then the other thing is to distinguish that reaction [function] from an overall process.

Table 4.3: Types and sources of data quality problems and corresponding assurance actions Data Quality Types of Data Source of Data Assurance Action Problem Instances Quality Problems Quality Problems 1. Ambiguous GO Ambiguity Add ‘activity’ to the terms molecular function GO terms 2. Inaccurate Inaccuracy Cross-reference to other placement of ontologies the GO terms 3. Lack of Incompleteness Include contextual specificity in information in the GO terms annotation extension (column 16) Include proteoform in column 17 4. Incomplete GO Incompleteness • New knowledge or Bring in new biological terms discoveries communities • Lack of experts in Have different scientific specific biological communities to work domains on different parts of the Ontology Submit new terms on GO’s Ontology Requests Tracker at SourceForge Add new GO terms via TermGenie

139 Table 4.3 Continued Data Quality Types of Data Source of Data Assurance Action Problem Instances Quality Problems Quality Problems 5. Invalid GO terms Inaccuracy Obsolete Suggest 6. Redundant GO Redundancy • Inconsistent naming Align the GO with ChEBI terms of molecular entities

7. Incomplete GO Incompleteness • Variance in gene Report omissions in annotations names annotations on GO’s • Lack of curators Annotation Issues • Lack of experts in Tracker at SourceForge specific biological Bring in new biological domains communities PAINT Project 8. Inconsistent GO Inconsistency • Variance in the GO annotations annotation policies • Variance in annotation conventions

9. Misannotations Inaccuracy • New knowledge or Implement GO annotation discoveries quality control checks • Annotations based Include contextual on sequence information in similarity annotation extension • Automatic GO (column 16) annotations Include proteoform in • Curator errors column 17 • Literature errors Report misannotations on • Quality problems GO’s Annotation Issues with the Tracker at SourceForge experimental data Predict GO misannotations

10. Annotation Incompleteness • Curator bias Complement the GO with imbalance • Literature bias high-throughput data

11. Inaccurate GO Inaccuracy • Software errors Maintain up-to-date with term the GO enrichment

4.3.6.2 Inaccurate placement of GO terms in the Ontology. A GO Project member pointed out the difficulty of correctly placing the GO terms in the Ontology: “Sometimes there are questions about where it should go in the Ontology, you know. Do we need new parents or,

140 you know, what’s the best placement?” The GO is cross-referencing to ChEBI, making use of the structural representation of similar entities in ChEBI to determine the placement of GO terms or to detect missing links between the GO terms. A GO curator explained how to ensure the accurate placement of GO terms in the Ontology by cross-referencing to ChEBI: So the thing that we wanted to do is align the ChBEI Ontology with the GO Ontology. And why this’s important? Well, depending upon the relationship, if you say this gene is involved in glucose metabolism, OK? So the [GO] term ‘glucose metabolism’… So glucose is a carbohydrate in ChEBI. Therefore, if I annotate something to ‘glucose metabolism,’ I should also automatically be annotating it to its parent ‘carbohydrate metabolism.’ And if there is no link between glucose and carbohydrate metabolism [in the GO], it’s not aligning with ChEBI. So we go in and we make sure that there is a link, ‘glucose metabolism’ is a ‘carbohydrate metabolism,’ just like glucose is a carbohydrate. 4.3.6.3 Lack of specificity in the GO terms. Lack of specificity refers to cases where the granularity of GO terms cannot meet the requirements of activities in which they are used. A GO user indicated that some of the GO terms were not specific enough for him to know about the maize genes: For maize genes, no, they are not in details. So I don’t know how they organize the GO. I know a little bit. Of course, it’s better to have very detailed terms. But sometimes for many maize genes, they just have a very brief or very general terms. So that’s a question about that. Another GO user pointed out the same data quality problem with the GO terms: I think when I used the [GO] Website, it only will list very general terms, like cell motility or metabolism, or something like that. It’s really general term. I think it would be much better if we can find more specific information, or the Ontology, every term of it should be divided into many subcategories… Yeah, I think I had the impression that those are not that useful because that the terms it uses too general from my memory. So I think it will be much better if it’s more specific. Similar to what was found in the archival data analysis, a GO curator stated that annotation extensions allowed the GO terms to be further specified with a context and more details, such as the gene products participating in a biological process and tissues or cell types in which the genes were expressed:

141 Now we have what we call an annotation extension, so called column 16 if you read it in the documentation. And this actually what it does is supply a context for the annotation. So, for example, you might find that gene X negatively regulates a process. And then you see another annotation that says gene X positively regulates a process. How can that be? And you look at the annotation extension. And ah! And one instance the experiment was done with liver cells. And in another case, the experiment was done in muscle cells. And they get two different results, somewhat opposing. But it’s again most of these are like regulatory proteins. It turned on a promoter, off a promoter, and depending upon what genes were being expressed in that tissue. You see a different outcome. Now with the annotation extension, we can track the tissue type that the experiment was done in, the cell type, the target. ... And then based on the things that you say in column 16 can do checks to make sure that what you are saying is logical outside of the whole thing. In addition to supplementing the GO terms with specificity, the GO curator indicated that annotation extensions could also be used to assess the quality of GO annotations and detect misannotations. Besides column 16 for annotation extension, column 17 was in use for proteoform, designating all the different forms of proteins arising from a single gene: That brings in an additional column, which is column 17, which is the proteoform. It’s called isoform in AmiGO. But actually it’s a proteoform. Quite often, you will see that a protein X resides in the nucleus, unless it’s phosphorylated. And that it can’t get into the nucleus any more. So it’s cytoplasmic. And the only difference is the same protein, but it has a phosphate group on it instead now. And what happens is in column 17 we can add IDs where that gives the specific proteoform that is being talked about. Both forms are expressed by the same gene, but one has something else done to it after the fact. And so we can add those. 4.3.6.4 Incomplete GO terms. Similar to what was identified from the archival data, the interviewees also mentioned the data quality problem of incomplete GO terms. A GO curator admitted that, “GO is in progress because of new knowledge.” To keep up-to-date with new knowledge, the GO curator explained that the Consortium had created the Ontology Requests Tracker at SourceForge for new term submissions and had been bringing in different biological communities to work on specific parts of the Ontology:

142 We have an Ontology Request Tracker at SourceForge. There are new terms need to be added to the Ontology. ... We have working groups from different scientific communities to help us work on different parts of the Ontology. A model organism database curator, who has used GO’s Ontology Requests Tracker, deemed it an invaluable tool for ontology development: I mean it’s an invaluable tool, because otherwise you just kind of don’t know where things… You know, it’s kind of hard to track stuff. So having a ticket tracker and a way of communicating with the people who ask for the terms it’s really... I don’t see how people could do it otherwise, cause there is a lot of information that they are curating. And I think not having a tracker would make it nearly impossible to do. A GO Project member, who participated in developing TermGenie, illustrated how TermGenie and its templates could expedite and standardize the process of adding new GO terms: TermGenie’s been the best thing that happened though...given the fact that before was simply a manual… Here is the list. Here is the request. And if you were a curator sitting there, trying to capture the information from a given paper, and suddenly you come across a term, a class that’s not in the Ontology, it would take weeks before. You have to put the request on SourceForge, wait until somebody that edit it, wait till that was committed, and then get it back again. And it could be weeks. We have a couple of cases that were unsettling that were years for some bio-terms. And it’s just that and then going back and getting your head mentally back into that paper and having this started, you know, getting it all into your brain again. It really slowed things down. TermGenie isn’t perfect yet. We are working on and improving it. But the fact that you can request a term and you would immediately have it. You could just go on with your curating without any hesitation. It has been a tremendous aid. There is a lot of things where, you know, we are using templates because basically what we are using TermGenie for is for doing these combinations like is this plus that. And if the template isn’t there, then we’ve been really hesitant to make it completely free form. But we’re going more and more that way, well, because you don’t want the Ontology to just grow without any control… A GO curator stated that having people to use the Ontology and then provide feedback to fix it was an efficient way to overcome the incompleteness:

143 One of the things that I think set the GO apart was that, you know, they didn’t wait until it was perfect and all-inclusive. They got it out there so that people could start using it. And as people use it, we get feedback and fix it. And then the fixes generate things for people to use, and so on. 4.3.6.5 Invalid GO terms. Similar to the invalid GO terms found in the archival data, a GO curator revealed that a majority of the invalid GO terms were protein names mistakenly added to the Ontology as molecular function terms, which should be obsoleted: A majority of them fall into a case where somebody made the term whose activity is already in the GO, but they named it because it’s the name of a protein. And so it doesn’t really differ from the actual activity. It’s just the protein that carries it out. And that’s an annotation. That’s not an ontology term. And so we might obsolete that term. And we make a suggestion that you should use this term and that term to make the annotation. So again the majority of those are things that turn out to be just protein names, and not function names. 4.3.6.6 Redundant GO terms. Similarly, the interviewees pointed out the data quality problem of redundant GO terms. A GO curator gave an example of the biological process ‘nucleic acid metabolism,’ which was represented by various terms in the GO due to the inconsistent naming of molecules: A lot of our nucleic acid metabolism stuff was really screwed up. And so we use ChEBI to sort that out, and mostly was because you had people would use a name that actually mean the same thing. So in the nucleotide, you have your base plus sugar. And that is usually a side. Adenine is the base. Adenosine is a nucleoside. That is base plus sugar. Now you put a phosphate on it. And now is a nucleotide. So the nucleotide has the phosphate, the sugar, and the base. And we were seeing in the GO that there were like duplicate terms, which meant the same thing. So somebody would say, for example, adenine nucleotide metabolism. OK? And then you saw adenosine nucleotide metabolism. Wait a minute. That doesn’t make sense. Or they would say, adenosine phosphate metabolism...And so we use ChEBI to make sure everything makes sense. As explained by the GO curator, the GO standardizes the molecule names in the GO terms by aligning with ChEBI. A user also mentioned that he found redundant terms in the GO and developed some scripts to remove them:

144 I try to remove redundant terms, which don’t add to the topology. So, for example, there may be two terms. One is a parent of the other. And they have the exact same set of genes. And this parent, the parent term has no other children, right? So it’s just the string of two terms that are exactly the same. They may have actually different names and different biological descriptions. But inherently besides some curator having given, stating a difference, there is no known biological difference, right? If there were, you should see that in different genes, maybe more genes in the parent, or maybe the parent having other children. But insofar as the topology, in what they actually put in the Gene Ontology, there is no difference. So what I do is I remove those terms basically. 4.3.6.7 Incomplete GO annotations. Incomplete GO annotations refer to cases where no or not enough GO terms have been assigned to specific genes or gene products, especially those that have been physically characterized by laboratory experiments. A GO user pointed out that some of the genes that he was interested in had no annotations or their annotations had not been updated with new discoveries or knowledge: “Numerous genes that have no annotations at all, or they have been annotated but it’s not been updated in central databases, especially with long, long coding RNAs and things like that.” Another user indicated, “I can’t give an exact percentage of that. But I guess it’s at least 70 or 80 percent of genes don’t have GO, characterized GO terms.” One of the GO users assumed that incomplete GO annotations might have occurred because the gene names that he inputted were different from those in the GO term enrichment tool: The most trouble I have is some genes that I input not being recognized or not found by the Ontology program we are using, whether that’s due to a shorten version of the gene name or just a different input name of the gene and what they have as their specific in their Ontology, cause, I guess, there are many different names for some genes. To resolve this data quality problem, the GO has created an Annotation Issues Tracker at SourceForge to allow any individual to report omissions in the GO annotations. A GO curator, who was also a model organism database curator, explained the main reason for incomplete GO annotations was a lack of curators to deal with a huge number of papers: And the Annotation Request is usually like…some authors very upset that we haven’t mention his paper. I don’t know. You know, it is like they wanna know why. It’s

145 supposed because, you know, we got 400 papers a week...We have 30 curators. You can do the math. As indicated by another GO curator, incomplete annotations may be due to a lack of biological experts in specific areas in the GO Consortium. The GO is bringing in new communities (e.g., Synapse) to work on specific parts of the Ontology. GO’s PAINT Project, which is trying to propagate some of the existing GO annotations to species without annotations based on homology, could help alleviate the problem of incompleteness. 4.3.6.8 Inconsistent GO annotations. Due to the variance in local annotation policies, there were cases where annotations to the same genes or gene products differed among the GO Consortium member groups, leading to inconsistent annotations. A GO curator, who was also a model organism database [X] curator, revealed: Some of the databases will differ in their use of certain evidence code called ‘inferred from expression,’ IEP. The X Database has decided not to use that evidence. We don’t feel that if something is expressed in the liver that it necessarily has something to do with liver function. I mean, you know, this is just we differ. And so you will see IEP annotations in QuickGO. You won’t see them in X. So that’s just, you know, policy decision. You know, we try for annotation consistency. But at the end of the day the annotations are up to the individual database. The Consortium supplies the Ontology to be used. There were also cases where inconsistent annotations occurred due to the variance in annotation conventions among different Consortium member groups. The GO curator gave EBI as an example: EBI has a different philosophy in annotating. Sometimes you will find discrepancies. And that’s mostly because they are protein-based. And if they’re taking annotations from X, for example, they will then apply those annotations to every protein associated with the gene, UniProt ID. So they ended up having more. But it’s really to the same gene. 4.3.6.9 Misannotations. Misannotations refer to cases where the GO terms were inaccurately associated with specific genes or gene products. The interview data imply that this data quality problem was complex, and might be caused by six different sources. A GO user stated that misannotations were due to new knowledge or discoveries that changed the context of data interpretation:

146 The GO was used to describe the genes with same functions from different species. Somehow, mismatch happen. It is due to the biological knowledge and new discovery. Interestingly, a GO user pointed out that since the GO is manually curated, either low-quality scientific papers or curators who did not interpret papers accurately, might lead to misannotations: The problem with the Gene Ontology in being manually curated is that you don’t have any measurements like that. Verifying quality in the Gene Ontology was done in the actual construction in that the curators figure it out, like they look up some references, some papers, and then they decided to update the Gene Ontology in this particular way. And they may document that. They may document what papers are related, and like what was the type of evidence. ... You could have a very well documented ontology in a sense that it’s listed what all the papers that were used to generate it. But it could be a very bad one because maybe, for example, those papers that were used were all very bad. They were just bad papers. Or they were not interpreted correctly for constructing the Ontology. Another GO user, who was an experimental biologist, emphasized that most of the misannotations were caused by the quality problems of experimental data: Most of the problems I think are from [experimental] data, not from the Gene Ontology I think...Well, I think most of data quality are controlled not by Gene Ontology, [but] by ourselves, by experiments. The GO database curates not only manual literature annotations but also annotations inferred from different evidence, such as sequence similarity and electronic annotations. A GO user described a misannotation that he encountered in the GO, which was based on sequence similarity: So one that I encountered was we studied proteins that bind to the end of the chromosome called telomeres. And these are telomere-binding proteins. ... we were trying to identify these in maize. And so we used sequence analysis to clone according to sequence. And then we used some DNA hybridization techniques to get similar sequences. We isolated a protein that if you aligned it on a computer to known databases, it would be identified as a Myb-like protein. ... When you do a blast alignment or look up the matching sequences, it had a terminal domain called the myb-like domain or the telobox, telomere-binding

147 type domain. So it turned out that this domain had sequence similarity, but it had a different chemistry and it did not bind DNA. We could not get it to bind DNA in a test to... It turns out that it had the wrong chemistry, the wrong charge. And so it was a case a motif that was misidentified. And this is now called SANT domain, S-A-N-T. So there was an example of the Myb domain and the SANT domain looked alike. They aligned to each other on the computer, but they had different chemistries. And so one binds DNA, and one probably binds protein. They may both function in a nucleus. But they were merged into one group incorrectly. And so we named the protein Terminal acidic SANT domain. So Myb domain were basic proteins, and they bind to DNA, which was an acid. So our test, our Myb-like domain actually was acidic. So that is acidic protein would not bind to an acidic DNA molecule. And so we named it Terminal acidic SANT domain 1, T-A-C-S-1. That is a case where all the proteins that have a Tacs1 domain are probably labeled in GO databases as Myb-like DNA biding. But there would be no evidence for that. It would be just by sequence similarity. A GO curator specified that some of the misannotations that he often saw on GO’s Annotation Issues Tracker were domain annotations generated automatically: So I was talking about this InterPro, domain annotations. It’s done automatically. And it’s based on the flat file that maps GO terms to domains. And every once in a while somebody spots something that doesn’t make sense. That means it has to be removed because it’s no longer true 100% of the time. So you will see things like that on the Annotation Tracker quite a lot. To resolve the data quality problem of misannotations, as mentioned above, the GO has developed and implemented annotation quality control checks to correct misannotations in the GO. GO’s Annotation Issues Tracker at SourceForge provides a virtual place where users could report errors in annotations. Including more details in the annotation extension (column 16) and column 17 may help detect misannotations. A GO user, who has a computer science background, used the machine learning approach to identify misannotations in the GO: So we wrote a paper that looked at GO annotations using machine learning approaches to be able to identify potential GO terms that were misannotated in the database to be able to identify an example set that… Some kinase proteins in the mouse genome that were misannotated. And those annotations were propagated to other species, for example, rat.

148 The rat database used the mouse annotations. And then the rat database was misannotated. So we used some machine learning approaches, mainly Naïve Bayes-based approaches to be able to predict these. 4.3.6.10 Annotation imbalance. Annotation imbalance refers to cases where some of the terms in the Ontology are associated with much more genes or gene products than the others. In other words, part of the Ontology was heavily annotated, and part of it was barely used. Interestingly, a GO user described the Gene Ontology as being lopsided: So a lot of the problems that we run into are Gene Ontology terms that are too specific. A term will have one gene. So a lot of those kind of we run into problems with computational approaches things like that, so reducing the complexity of the Gene Ontology and also balancing the Gene Ontology. So a lot of the terms having are sort of lopsided. And that they don’t have very many genes, and the others have a lot of genes. He attributed this problem to literature bias, indicating that the GO was inclined to characterize those well-studied genes but left out those that were not well studied: So the point of building the ontology from data is to have it, to get rid of sort of literature bias. So if you start from high-throughput data, then every sort of gene is kind of starts at a level playing field, whereas in a lot of cases you can get sort of literature bias for well- studied genes... But a lot of what we do in the lab revolves around high-throughput experiments. So that even genes that aren’t well studied have equal way to well-studied genes. Another GO user pointed out the same quality problem and attributed it to curator bias: But I think more generally the fact that it’s manually curated just creates a general suspicion that there’s maybe certain biases. Certain areas of the Gene Ontology may be more covered than others just because of manpower, more curation time. A GO Project member discussed the same data quality problem and indicated that those heavily annotated GO terms might need to be split: For the Ontology itself, we’ve looked at information content. You know, it’s just pretty simple, which is just, you know, if thousands of things are annotated to a term, then there’s not much information to be gained there. It’s just like you are one of many. It’s just like so what. It doesn't distinguish anything if there is... So we’ve looked at that because it’s interesting to look at terms that are leaf terms that have never had anything

149 annotated to them. It’s like why was that leaf term added if nobody was going to use it. So look at things like that. We would also look for pile-ups where it’s a leaf term and it’s got tons of things annotated to it because then it’s a leaf with low information content. And we don't expect that. So we sort of look at the two extremes whether there is zero information that’s never been used or there is, you know, it’s a term that has never been used or it’s a term that has been used too much, which case might indicate that maybe it needs to be broken out. 4.3.6.10 Inaccurate GO term enrichment. Inaccurate GO term enrichment refers to cases where the results of GO term enrichment analysis are inaccurate. A GO Project member brought to attention that some of the GO term enrichment tools did not keep up-to-date with the GO database, and thus produced results that were inaccurate. She used DAVID, a GO term enrichment analysis tool, as an example: DAVID, for example, is not what we would recommend because it’s very out of date. Part of the problem that we run into from the GO perspective is that people don’t update the data. So there will be changes to the Ontology and changes to the annotation. And enrichment isn’t going to work if those aren’t current or will, you know, skew the answers. And that is one of the reasons. DAVID is like at least 3 or 4 years out of date. And that is pretty far.

4.3.7 Rules The interview data suggest a typology of rules regulating the activities of GO development, GO annotation, GO communication, and GO use. Some of these rules are similar to those identified from the archival data. However, the interview data provide some in-depth insight into why to establish those rules from a GO’s insider’s perspective. 4.3.7.1 GO development. The archival data analysis identified a rule regulating new GO term creation: the GO curators preferred to use TermGenie. A GO curator explained why they favored TermGenie and also pointed out that only the GO curators could have access to it: There is also another set of things called TermGenies. And these are for a subset of people to make our life easier. So I love it because I can just go in and add terms, boom, boom, boom, boom, without having to get the Ontology downloaded, open it up, deal with all sorts of things. It’s just a simple term request. Those are fast. But not everybody is allowed to do that or else there would be a lot of, maybe, silly terms going in or

150 inappropriate terms. So most of the curators, the Ontology curators, have access to TermGenies. A GO Project member, who engaged in developing TermGenie, elaborated on the process of adding new GO terms via TermGenie: It’s added to the Ontology immediately, but its flag is tentative. And then, you know, one of those admin people assigns that to somebody. Most of the time they just go through and it’s like approved. Your identifier though was assigned at the very beginning, so you don't have to change anything. If somebody says, hey, that’s already in there. We are just going to add this synonym. Then the thing that got added is that identifier is sort of deprecated. And it’s like use this instead. And so we would automatically say, oh, you know, that’s a deprecated one. Here is the one that should be used instead. The archival data analysis also identified a rule called true path violation. A GO curator provided more details about this rule and illustrated the difference between ‘part_of’ and ‘has_part:’ In terms of making the Ontology itself, the biggest thing is we have to make sure that there is no true path violation. And what this essentially means is that so typically if A is a B, and B is a C, then A is a C. But you can’t make something that breaks this rule. … Hand is part of arm. Obviously, arm is not part of hand. When we say ‘part of’ as a rule, we mean that every time you see a hand, it is part of an arm. A hand does not exist in a living organism without being attached to an arm. So that’s how ‘part_of’ works. But if you have something that can be used by two or more different things, then you can’t say it’s part of both things because it can’t be a part of simultaneously of two different things. But what you can say is that in order for one thing to exist, it has to ‘have_part.’ For example, a wheel is not always part of a car. Wheel can be part of a bicycle. But car can ‘has_part’ wheel. So that every incidence of car has to have wheel. In addition to differentiating ‘has_part’ from ‘part_of,’ the GO curator described the difference between ‘part_of’ and ‘regulates’ by explaining the rule of using ‘regulates:’ So we have a ‘regulates’ relationship. And this was kind of a new thing. And so you can have something that regulates a process without being part of the process. And that was the whole purpose for doing that, because if you talk about enzymes that make a that then feed back on itself to shut the enzyme down, then that regulation is part of the process because, it’s actually, you know, part of the reaction. But if you have something

151 from the outside, some other protein comes in, and says adds a phosphate group to the enzyme, and the enzyme shuts down. That’s not part of the process of whatever pathway. But it regulates, and in fact in that case it’s negatively regulating the process. And these usually don't, can’t travel backup a tree. So, you know, if you find something, A regulates B, you can’t say B regulates C. So it is not directional. A GO editor specified a rule regulating the activity of GO term obseletion. He indicated that before obsoleting any GO terms, the GO editor needed to contact curators who had used those terms to fix their annotations: So as an ontology editor, sometimes I am asked to obsolete the term, not delete it. It means obsolete. So that means that I need to first find out who is using that term. I don’t have time to email everybody. But I can go out, and I can use AmiGO to see all of the annotations that have been submitted, that used the term, and who submitted them. Then I can say, OK, I am going to obsolete this term, and have this many people use it. These people need to or these groups need to fix their annotations, or they are going to have annotations to obsolete terms. 4.3.7.2 GO annotation. A GO curator, who was also a model organism database curator, implied an implicit rule regulating the activity of GO annotation: We’re not reviewing the paper. We are not because actually we are very careful to say when we make a GO annotation. When you make a GO annotation, you know, you give a reference, right? And that saying so and so said this. I am not saying this. So and so said it. And that’s important because we don’t want to act like this going over the heads of papers that have already been reviewed. Maybe you might think the data is crappy. You could choose not to annotate it, or just say, well, he said it. As mentioned in the archival data analysis, the GO has established a set of annotation rules. The GO curator gave an example of those annotation rules: So here for example, there is a term ‘protein binding.’ And you can’t just say protein binding without saying binds what, right? So protein binding, and you have a specific evidence code ‘IPI’ for protein interaction. And you have to say, have to, have to, have to say what the protein ID it’s bound to. So there is a rule. You break the rule, we come get you.

152 Interestingly, there is a subset of GO terms that cannot be used for manual annotations, but may be used for mapping to external ontologies to generate computational annotations. A GO curator brought up this annotation rule during the interview: One of the subsets we have for GO terms is "do_not_mannually_annotate," which comprise the list of GO terms that should not be used for manual annotation as they are too high level to be useful, e.g., response to stress. 4.3.7.3 GO communication. The archival data analysis found that when editors could not resolve the requests, they would bring them up in meetings to discuss with other GO curators. A GO curator disclosed a relevant rule below, confirming the finding gained from the archival data analysis: And in general we have a rule of something, for example, [if a request] takes more than 10 emails back and forth, this might go on the agenda for the next GO Consortium Meeting, where we meet face-to-face. 4.3.7.3 GO use. Several users mentioned during interviews that if they used the GO or tools built on the GO for their data analysis, they would cite either the GO or the associated publications. A GO user specified how he credited a GO term enrichment tool: If we use GOrilla, we cite the associated publication for GOrilla. And if we did change any of the parameters, we would state we used, you know, GOrilla with the following modifications, and would say what parameters we changed too. Otherwise we would say reviews GOrilla with the default settings.

4.3.8 Data Quality Criteria for the GO Since most of the participants in the online fieldsite—GO’s Ontology Requests Tracker—were either the GO curators or model organism database curators, users’ data quality requirements for bio-ontologies are missing in the archival data. One of the interview questions concerned users’ quality perceptions of bio-ontologies. The semi-structured interviews with the GO users complemented archival data analysis with different perspectives. The interview data suggest a typology of 12 data quality criteria for bio-ontologies that were perceived important by biocurators and ontology users, which are applicable to the GO (see Table 4.4). 4.3.8.1 Accessibility. Accessibility can be defined as the ease of locating a data object relative to a particular activity (Stvilia et al., 2007). A GO user pointed out the importance of

153 accessibility for bio-ontologies: “I think GO was referred to quite a bit just because it was very easy to get access to the data.”

Table 4.4: Data quality criteria for bio-ontologies Data Quality Criteria Definition 1. Accessibility The ease of locating and obtaining a data object relative to a particular activity 2. Accuracy The degree to which the data correctly represent an object, process, or phenomenon in the context of a particular activity or culture 3. Authority The degree of reputation of data in a given community or culture 4. Redundancy The extent to which the data are new or informative in the context of a particular activity or community 5. Completeness The extent to which the data are complete according to some general or contextual reference source 6. Consistency The extent to which similar attributes or elements of data are consistently represented using the same structure, format, and precision 7. Currency The extent to which the data represent the most up-to- date status of an object, process, or phenomenon in the context of a particular activity or culture 8. Relevance The extent to which the data are related to the matter at hand 9. Reliability The degree of confidence in data in the context of a particular activity 10. Simplicity The extent of cognitive complexity/simplicity of data measured by some index or indices 11. Stability The amount of time the data remain valid in the context of a particular activity 12. Verifiability The extent to which the correctness of data is verifiable or provable in the context of a particular activity

4.3.8.2 Accuracy. Accuracy can be defined as the degree to which the data correctly represent an object, process, or phenomenon in the context of a particular activity or culture (Stvilia et al., 2007). A GO user explained the importance of data accuracy for bio-ontologies: “You need the data accuracy. I think now we have like millions of papers published like every year. … But I think you need to be careful like what data you choose, not just like [using] the random computer tool to get all the data together.” A model organism database curator valued the accuracy of the vocabulary, paying attention to whether each element of a bio-ontology was accurately defined and related:

154 The ontology terms are appropriate. You have a good hierarchy, you have a good path to root, you have definitions, you have term IDs, you have… It’s almost like a scientific body of work to represent these ideas in an ontology. And so you want something that effectively represents what you’re trying to classify. 4.3.8.3 Authority. Authority refers to the degree of reputation of data in a given community or culture (Stvilia et al., 2007). A model organism database curator indicated her trust in the quality of OBO ontologies: “I think being part of the OBO Foundry and an ontology being part of the OBO Foundry ensures that you are using something that is quality.” A GO user implied the reputation of GO in academia: “Gene Ontology is really mentioned a lot by teachers or in lectures. I think this is more widely used one.” Another GO user specifically pointed out the data quality criterion of reputation: “You need to have good reputation. It means most of the time the data is trustworthy, and they will take their reputation more seriously, and also they probably have more people to help.” 4.3.8.4 Redundancy. Redundancy can be defined as the extent to which the data are new or informative in the context of a particular activity or community (Stvilia et al., 2007). A GO Project member perceived cross-referencing to other ontologies or databases as redundancy. Meanwhile, she found that redundancy (e.g., having chemicals in the GO terms) enabled curation efficiency, preventing curators from looking up numerous ontologies when annotating papers: If you are curating and you are trying to get through 20 papers a day, if you have to pick classes from 10 different ontologies to be able to express what you read in this paper, it slows you down. So those, a lot of just practical reasons for having what we call these pre-composed ontologies. For example, even within the GO, we have anatomical terms. So you could say we are redundant with any anatomical ontologies. There is redundancy there. We have cell types in the GO. We have chemical entities in the GO. So what we’ve been really trying to do is basically have it both ways. … for curator efficiency…the better way is to have cross-references to the three or more different classes and then make your, compose your class terms from these other class terms. 4.3.8.5 Completeness. Completeness refers to the extent to which the data are complete according to some general or contextual reference source (Stvilia et al., 2007). A GO user illustrated the importance of having complete annotation data in the Ontology, especially at the genome wide level: “I think having a high coverage where you have annotations, you know, at

155 genome wide levels or cross a lot of different types of function.” Another GO user also highlighted the data quality criterion of completeness: “The most important criteria for high quality ontologies would be completeness of gene data.” Completeness can refer to cross-referencing to other ontologies to form a network of biological knowledge. A model organism database curator described her requirements for high quality bio-ontologies as: They are utilizing the resources that are available, such as, you know, they are not making a whole another Cell Ontology when there are already existed ones. So they are, you know, they are able to cross-reference to the information that’s in the Cell Ontology or imported and not remake new... You know, it’s the whole idea not proliferating the same ideas with the whole bunch of different IDs and a whole bunch of different name spaces, but utilizing the resources that are available to further the field. 4.3.8.6 Consistency. Consistency refers to the extent to which similar attributes or elements of data are consistently represented using the same structure, format, and precision (Stvilia et al., 2007). A GO curator emphasized data consistency across different bio-ontologies: “It needs to have links to other ontologies and is consistent with different ontologies, such as the Cell Ontology.” 4.3.8.7 Currency. Currency can be defined as the extent to which the data represent the most up-to-date status of an object, process, or phenomenon in the context of a particular activity or culture. A model organism database curator brought up the importance for bio-ontologies of staying current with science: They’re staying current with the science. And I think if they won’t always working on it and always trying to make it better, then I would have a problem I think. Then I think we would have some data quality issues, because they would not stay current with the science. Another model organism database curator raised a similar point that bio-ontologies needed to be actively maintained and edited in order to stay current: You don’t want an ontology that somebody makes for their PhD dissertation. And then what I hear is and then they go away. And so nothing ever happens to it. That doesn’t change, doesn’t get edit. If somebody finds something that doesn’t make sense, there is nobody around to fix it. And so this would be a very bad one.

156 4.3.8.8 Relevance. Relevance refers to the extent to which the data are related to the matter at hand (Stvilia et al., 2007). A GO user indicated her preference for data relevance when selecting bio-ontologies for her research: “I’m more interested in functional aspects and more proteomics level of, kind of what GO provides. So GO is kind of perfect for what I do.” 4.3.8.9 Reliability. Reliability can be defined as the degree of confidence in the data in the context of a particular activity (Stvilia et al., 2007). A GO user placed importance on reliability of the source from which the annotation data were derived: “If gene A is published on a peer reviewed journal saying that it’s in the signaling transduction pathway, I think that’s a pretty, is a reliable source of information.” Another GO user also mentioned the reliability of evidence, indicating how annotations could be supported: “You can see what criteria was used to be able to generate that data. And that has to be reliable, like those evidence codes are what you think they are.” 4.3.8.10 Simplicity. Simplicity refers to the extent of cognitive complexity/simplicity of data measured by some index or indices (Stvilia et al., 2007). A GO user, who was an experimental biologist, specifically pointed out the importance of simplicity in using the annotation data, which should be “easy to comprehend and less requirement of computing skills.” 4.3.8.11 Stability. Stability refers to the amount of time the data remain valid in the context of a particular activity (Stvilia et al., 2007). A GO Project member revealed the importance and difficulty of maintaining bio-ontologies stable and sustainable in the domain: If I am going to put all the effort into developing an ontology, I would like it to be something that’s going to be around the topic, the domain, to be something that’s going to be around for a while, a little more sustained cause they are hard to maintain. And biology changes fast enough. A model organism database curator also mentioned stability, but in a sense of its usage: The Gene Ontology is a stable, what I would consider a stable ontology, although it’s being built and continually updated and everything. The general usage of it, the general rules for usage of it haven’t really changed that much. 4.3.8.12 Verifiability. Verifiability refers to the extent to which the correctness of data is verifiable or provable in the context of a particular activity (Stvilia et al., 2007). A GO user discussed the importance of verifiability of the annotation data:

157 I would say that there are a lot of care has been taken to annotate the gene. So, you know, you might have a gene, but it’s linked to a Gene Ontology term, like development. But someone really had to go to the literature and show that that gene had a role in a developmental process. And so the fly community knowledge base, it’s quite extensive. We have a well-curated database. And so I think that has allowed the Gene Ontology terms to also be high quality. Similarly, another GO user stressed the importance of providing annotation data with scientific evidence to verify the strength of annotations: I think the most important criterion is scientific evidence, which itself can be … more direct and less direct. I am ok with using indirect evidence, but I need to…It’s helpful to know when you are looking at a Gene Ontology assignment, the relative strength of that assignment, whether it’s direct, indirect, supported by multiple lines of evidence, a single line of evidence. And if it’s sequence homology across species, how many have been characterized?

4.3.9 Data Curation Skills for the GO One of the interview questions concerned the data curation skills required for the GO. Data curation refers to “the activity of, managing and promoting the use of data from its point of creation, to ensure it is fit for contemporary purpose, and available for discovery and re-use” (Lord & Macdonald, 2003, p. 12). The interview data suggest a typology of 16 skills: basic biological knowledge, domain-specific biological knowledge, computational skill, staying current on biology, bioinformatics, ontologies, reading scientific literature, detail-orientation, science experience, lab work experience, communication, data collection, data organization, statistics, collaboration and cooperation, and user studies. The most frequently mentioned skill is biology. A model organism database curator stated: “You need to have a pretty solid understanding of biology, that has to be something that you have a good foothold with at least background, understanding the processes involved.” Another model organism database curator considered that it might require having more than basic biological knowledge, but also “species-specific knowledge,” which is “useful if you’re annotating a specific species level since annotations and the pathways may be different across, you know, different species.” A GO user specifically pointed out the domain knowledge represented by the GO: “You need a good gene vocabulary, and a good scientific vocabulary,

158 and knowledge of basic biological processes.” Obviously, having domain-specific biological knowledge is not enough. Another GO user brought to attention the skill of staying current on biology: I guess everyday new gene could be found. We get involved in new processes or pathway. So we have to be, you know, frequently updated with latest understanding to get more accurate ontology relationships between different gene groups or genes. The second most frequently mentioned skill is computational expertise. A model organism database curator, who worked on computational annotations, indicated, “Actual [electronic] annotation and building the pipelines that stuff, I mean it takes computational… You need good computational background, but you don’t need a lot of, bioinformatics knowledge to do that type of work.” A GO Project member stated, “The biologist, especially they come from lab, they usually don’t know a programming language. But we need the computer scientists because it’s how to build it [GO].” Three interviewees mentioned the bioinformatics skill. One of them stated: I think definitely you need to have a very strong bioinformatics background. It [GO] is a very large database. And then how you organize them, like create an interface easy for users like us who don’t really have this kind of background to look up the information very easily is important. Not surprisingly, data curation for the GO requires some basic understanding of ontologies and the GO. A model organism database curator, who was also an ontology editor, described, “Having the biology component, but also having a good component of general ontology understanding is necessary”, such as “how ontologies are built, how they are utilized, and how to effectively build terms in.” Another model organism database curator said, “There needs to be some understanding of what GO is and how to create what the evidence codes are.” Nearly all the biocurators who were interviewed mentioned the skill of reading scientific literature. A GO curator further required “having good eyes of detail” when reading and curating papers. Besides scientific knowledge and reading literature, data curation for the GO requires science experience, especially lab work experience. The GO curator emphasized, “They need to have a background in science, and also have experience of doing good lab work and know about the experimental techniques.” A model organism database curator, who was also a GO curator, specifically pointed out the requirement for science experience:

159 We want them to have been science experience, even postdoc experience. We rarely take somebody that just got their PhD, and then they decided never wanna to see another test tube again in their life. And the whole point of that is you have to be able to read the paper and know is this particular assay that they are doing, you know, appropriate for what they are claiming … you go from a bench, and here you are. You are not doing bench work anymore. Another model organism database curator highlighted the communication skill. She explained: Communication with colleagues and peers is exceedingly important because you need to be able to… You know, when you’re putting in your terms for, you know, when someone is requesting a term, especially if you’re requesting a term, you need to be able to communicate what it is that you’re asking for, and whether that’s, you know, in the written, or whether you are talking to someone, you’re being able to understand your need and communicate what your need is pretty important. And then being able to, you know, talk to your colleagues about what it’s necessary for that term. Several GO users indicated the skills of data collection and organization. One of them perceived ontology as “an organization of ideas,” which is usually the result of “team efforts.” Besides organizational skill, she brought up the skill of collaboration and cooperation. Other skills implied from the interview data include statistics and user studies, getting feedback from the user community.

4.4 Conclusion This chapter presented findings attained from three research methods: archival data analysis of discussions on the online fieldsite, participant observations, and semi-structured interviews with both biocurators and ontology users. The findings from archival data analysis and semi- structured interviews were further divided into subsections based on the concepts of Activity Theory and Stvilia’s IQ Assessment Framework, including activities, communities, division of labor, tools, rules, types and sources of data quality problems, and corresponding quality assurance actions. Section 4.3 also reported findings on data quality criteria of bio-ontologies and ontology curation skills required for the GO. Chapter 5 will present discussions on each of the research questions and sub-questions, relating the findings of this study to the literature.

160 CHAPTER FIVE

DISCUSSION

This chapter discusses findings presented in Chapter 4 in light of the research questions, providing a synthesized view of the findings attained from three different research methods. This chapter also includes implications of the findings in relation to the literature and discussion of potential future research. Based on the analysis of contradictions within and between activity systems around the GO, suggestions were made to the GO Consortium and communities engaging in developing large-scale KO systems.

5.1 Activities around the GO The first research question focuses on the activities around the GO and their objectives. According to Activity Theory (Engeström, 1990; Leont’ev, 1978), activity refers to a complex system of related elements, including subject, object, actions, community, division of labor, rules, tools, and outcome. Activities around the GO vary by subject, community, and objective. The GO Consortium is dedicated to developing and maintaining the GO as a shared knowledge base for different biological communities. Within the GO Consortium, the GO Project team engages in developing the controlled vocabularies (e.g., adding new terms and relations), maintaining the vocabularies (e.g., reorganizing a section of the Ontology), developing tools (e.g., AmiGO) to support different activities around the GO, maintaining those tools, providing annotation data, promoting the GO to the scientific community, and bringing in new communities (e.g., synapse) to the Consortium. Members of the GO Consortium, such as PomBase and UniProt, concentrate on assigning the GO terms to their genes or gene products and uploading annotations to the GO database. Besides producing annotation data, the GO Consortium members also participate in developing and maintaining the GO, such as requesting for new terms and relationships, suggesting renaming or redefining the terms, and reporting errors or omissions in annotations. The curators of PomBase described their manual literature annotation process as “not a passive process,” but including “contributing to the development of the GO by identifying missing relationships, refining existing term definitions and extending the vocabularies by identifying new terms” (Aslett & Wood, 2006, p. 914). To facilitate

161 communication and collaboration, different groups within the GO Consortium have regular meetings, and all the groups get together twice a year at the GO Consortium meeting. In addition to ontology development and maintenance, the GO Consortium involves educating and disseminating information about the GO, such as training model organism or protein database curators to do manual GO annotations, educating undergraduates to use the GO (e.g., the CACAO Project), updating and editing the GO Website and Wiki, and publishing scholarly articles introducing the GO and reporting its developments. Users are applying the GO for different research purposes, such as controlling data quality, validating experimental data, organizing or filtering down data, suggesting new experiments, and developing new ontologies. Some of the GO users with computational skills develop their own tools to access and process the Ontology data, and/or build computational models on the GO data. The archival data analysis discovered that individual researchers or groups also contributed to the GO by suggesting new terms, reporting errors in the GO, and submitting annotations. The participation activity described in Section 4.2 is an example of individual researchers’ contributions to the GO.

5.1.1 Communities Research question 1.1 asks about the communities participating in different activities around the GO. The GO Project was initiated in 1998 by three model organism databases: the FlyBase, the Saccharomyces Genome Database (SGD), and the Mouse Genome Informatics (MGI) (Gene Ontology Consortium, 2011b, 2014g). One of the objectives of the GO Consortium is to build species neutral controlled vocabularies that can be used across different organisms. To meet this objective, the GO Consortium has been bringing in communities or groups from different biological domains. Now it has been expanded to include more than 30 members, including major model organism databases, protein databases, and other biological research communities (Gene Ontology Consortium, 2014i). Interestingly, several of the GO Consortium members are collaborations between different groups. For example, as a member of the GO Consortium, the University College London (UCL) based annotation team represents a collaboration among four groups: UCL, the Proteomic Services group at EBI, the UniProt Content group at EBI, and Manuel Mayr’s group at King’s College London (University of College London, 2014). Funded by the British Heart Foundation, the team engages in providing cardiovascular-relevant genes with detailed and high quality GO annotations. Besides supplying the GO annotations, the team

162 has contributed to creating 1,100 new GO terms for heart development and cardiac conduction during the past five years. In other words, the GO is a product of meta-collaboration. The GO has diverse user communities. They may include, but are not limited to, experimental biologists studying humans or different model organisms, biochemists doing proteomics work, biomedical scientists studying specific diseases (e.g., Parkinson’s disease), ecologists doing evolutionary research, bioinformatics analysts performing data analysis, and computer scientists or software engineers building computational models or tools (e.g., GOrilla). Interestingly, since the GO is one of the most successful bio-ontologies, there are specific communities doing research on the GO. Built on the success of the GO, ontologists in life science are establishing principles (e.g., the OBO principles) and best practices for ontology design (OBO, 2006; Smith et al., 2007). A group of bioengineers is constructing ontologies comparative or complementary to the GO (Dutkowski et al., 2013). Information scientists are studying the construction of GO’s controlled vocabularies (e.g., Mayor & Robinson, 2014) and examining its cyberinfrastructure that can support collaborative ontology development and maintenance (e.g., Wu & Stvilia, 2014).

5.1.2 Division of Labor Research question 1.2 examines the division of labor within activities around the GO. To coordinate the GO Consortium members, a GO directors group was formed, consisting of four scientists, who are the founders of the GO (Gene Ontology Consortium, 2014k). The directors group is responsible for gaining and allocating funding, leading the GO Project, setting priorities, charting direction, establishing milestones, resolving conflicts, administering the GO Consortium meetings, and reporting to the funding agency. 5.1.2.1 Leading groups. In terms of ontology development, of all the GO Consortium members, four are the leading groups. The Jackson Laboratory, hosting the MGI, is responsible for leading the ontology development, developing software applications for the GO Project, and providing mouse gene products with the GO annotations. The GO Editorial Office at EBI is accountable for developing and editing the controlled vocabularies. The Lawrence Berkeley National Laboratory is in charge of developing software tools (e.g., AmiGO, OBO-Edit) for viewing, editing, and processing the controlled vocabularies and annotation data. The Cherry Laboratory, managing the SGD, is liable for maintaining the GO database, ensuring public access of the data, and providing the budding yeast genes with the GO annotations. The other

163 GO Consortium members are mainly responsible for contributing their annotations to the GO database. 5.1.2.2 The GO Project Team. To ensure having the expertise from different biological domains to support ontology development and maintenance, each member database of the GO Consortium is required to assign at least one curator to serve as the GO curator (Gene Ontology Consortium, 2014f). The GO curators can be further divided into the GO editors accountable for developing and editing the controlled vocabularies and the GO annotators responsible for annotating genes or gene products with the GO terms. Compared to the curators of member databases, the GO curators have more privileges and responsibilities, such as sitting at the GO Helpdesk to answer user questions, serving on the request trackers at SourceForge to review requests, having access to TermGenie to add new GO terms, making modifications to the Ontology, training local curators to do the GO annotations, and having regular GO meetings. In terms of power and status, there is a small group of GO curators with more knowledge of the GO who oversee TermGenie and the request trackers at SourceForge. They can assign requests to other GO curators depending on their expertise and review the new GO terms added via TermGenie. One of the interviewees called them the GO gatekeepers. The GO curators from each member database and the four leading groups of the GO Consortium formed the GO Project team, collaboratively developing and maintaining the Ontology. The GO Project team can be divided into the following groups working on different aspects of the GO: ontology development, annotation advocacy, reference genome annotation, user advocacy, software and utilities, and other working groups. 5.1.2.3 Ontology development group. The ontology development group is to ensure the controlled vocabularies can represent the current biological knowledge, useful for annotating gene products of the reference genomes and other model organisms (Gene Ontology Consortium, 2014n). The group is responsible for editing the Ontology, developing the controlled vocabularies, reviewing requests on GO’s Ontology Requests Tracker, and reporting suggestions from different member databases for changes to the Ontology. Having the GO curators from different member databases allows the group to have the necessary domain knowledge to review requests submitted from different user communities for changes to the Ontology. Meanwhile, the group establishes the communication among different Consortium members, ensuring the GO can fulfill the needs of different members for annotation.

164 5.1.2.4 Annotation advocacy group. The annotation advocacy group is to ensure the accuracy and consistency of GO annotations provided by the GO consortium members (Gene Ontology Consortium, 2014a). The group is liable for formulating and enforcing annotation rules and policies, establishing best practices for annotation, training the GO curators, and keeping member databases up-to-date with changes to the GO. 5.1.2.5 Reference genome annotation group. The reference genome annotation group is to provide reference genomes with complete GO annotations, ensuring all genes with experimental data annotated in depth and promoting these annotations to other genomes (Gene Ontology Consortium, 2014r). Similarly, the group consists of representative GO curators from 12 model organism databases, centralizing the review of key genome annotations, assisting in the enforcement of GO annotation standards, and providing the annotation advocacy group with feedback. 5.1.2.6 Software and utilities group. The software and utilities group consists of software engineers providing the GO Consortium and users with technical and software support, such as developing software tools for ontology development and ontology data analysis (Gene Ontology Consortium, 2014s). This group is also to supplement other groups of the GO Project team. 5.1.2.7 User advocacy group. The user advocacy group is to establish the communication between the GO Consortium and the scientific community, ensuring that the GO stays useful, relevant, and accessible (Gene Ontology Consortium, 2014w). Through the GO Helpdesk, tutorials, reports, and publications, the group aims to keep users up-to-date with changes in the GO, bridge the gap between the GO Consortium and the scientific community, and provide users with support for the GO tools. The objective of the GO Consortium is to develop controlled vocabularies that can be used across different organisms and facilitate uniform queries across their databases (Gene Ontology Consortium, 2014g). The division of labor within the GO Consortium ensures that the formidable ontology development process can be divided into manageable projects, and certain biological communities can collaboratively participate in developing and maintaining the Ontology and have their experimental data annotated with the GO terms. These GO annotation data in turn support the bench scientists of those communities to produce new experimental data and discoveries, forming a cycle of knowledge creation. As Holsapple and Joshi (2002) stated,

165 GO’s collaborative development approach enables a built-in evaluation mechanism that involves different communities controlling ontology quality, contributing contents, and exchanging viewpoints and expertise. On the other hand, the GO Consortium is dominated by model organism databases. The GO directors group consists of scientists representing three model organism databases (i.e., FlyBase, MGI, and SGD), which are the founding members of the GO (Gene Ontology Consortium, 2014k). Concerns remain about whether the needs of smaller communities or less funded groups can be met and their experimental data can be accurately captured by the GO. The GO Consortium consists of biocurators and scientists either from Europe or North America. However, the GO users are located all over the world. For example, one of the interviewees of this study was a principle investigator of a life science lab located in Asia. Questions exist about whether the GO Consortium should attend to users from non-English speaking countries and be receptive to their needs. The GO Consortium may consider seeking collaboration or participation from scientific communities in other locations and expanding community outreach to other countries or areas, such as having the GO Consortium meeting outside Europe and North America.

5.1.3 Tools Research question 1.3 examines the tools used in activities around the GO. Appendices E and H list the tools identified from the archival data analysis and semi-structure interviews. Different tools were selected for specific activities depending on their objective, community, and subject. 5.1.3.1 The GO Project team. The GO Project team mainly uses the tools developed by the GO Consortium (e.g., AmiGO, QuickGO, TermGenie, GO Slim) with the help of some external tools (e.g., OBO-Edit, Protégé) and reference sources (e.g., PubMed, Wikipedia) to develop and edit the controlled vocabularies (see Figure 5.1). Certain communication tools, such as request trackers at SourceForge, GO’s mailing lists, and the GO Helpdesk, are used to facilitate the communication among different members, groups, and user communities. Occasionally, when the GO editors could not resolve quality problems raised on the request trackers, they would ask other community experts for help. For documentation purpose, the GO Consortium uses a Wiki for members to store, edit, educate, and disseminate information, such as meeting notes, policies, guidelines, and instructions. The GO Project team creates tutorials

166 and publishes scholarly articles to communicate to the scientific community the activities of the GO Consortium and changes to the GO. 5.1.3.2 The GO Consortium members. The GO Consortium members use the tools developed by the Consortium (e.g., QuickGO, Protein2GO, PAINT) and external tools (e.g., Blast2GO, Evidence Ontology) to help with their annotations. When the GO curators or member database curators cannot find the GO terms that they need for annotation or detect any misannotations, they will submit suggestions to the GO Helpdesk or the request trackers at SourceForge. They may study the GO annotation policies, guidelines, conventions, or tutorials available at the GO Website or Wiki. Compared to the GO Project team, the GO annotators or member database curators use a larger variety of tools. A GO Project member indicated, “Even within the GO [Consortium], people, who actually have some funding from NIH for the GO project, everybody’s been using a different ad hoc tool [for annotation].”

Activities • AmiGO • Request trackers • Adding new GO terms & • QuickGO • GO Helpdesk relations • GO Slim • GO’s mailing lists • Maintaining the controlled • OBO-Edit • Reference sources vocabularies • Protégé! • Other bio-ontologies • Developing & maintaining tools Tools • TermGenie • Biological databases • Providing annotations • GO Wiki • Cross-referencing files • Brining in new communities • GO Website • Other community experts • Having GO meetings • GO tutorials & • Training GO curators publications • Educating and disseminating information about the GO Subject Object GO Project Team species-neutral controlled vocabularies

Rules Division of Labor • Literary warrant • GO directors group • Principle of consistency Community • GO gatekeepers • Species neutrality • Model organism databases • GO curators • True path violation • Protein databases • Software engineers • Others • Biological research communities

Figure 5.1: The activity system of the GO Project team

167 5.1.3.3 The GO users. The GO users, who were interviewed in this study, seldom use the tools developed by the GO Consortium. Most of them either use model organism databases to look for the GO terms associated with the genes or gene products that they are interested in, or use the gene set enrichment analysis tools built by external communities or groups (e.g., GOrilla) to identify over-represented GO terms for a large set of genes. Despite a great demand for gene set enrichment analysis tools from the scientific community, the GO Consortium did not build any of them until recently. Most of the GO tools were built for the activities of ontology development, maintenance, and annotation. This may explain why the GO users rarely use the tools developed by the GO Consortium. Some of the GO users with strong computational skills develop their own tools to help meet the objectives of their research activities.

Activities • AmiGO • Ontology Lookup • Assigning the GO terms to genes • QuickGO Service or gene products • GO Wiki • Request trackers • Uploading annotations to the GO • BLAST • GO Helpdesk • Having GO meetings • Blast2GO • GO mailing lists • Protein2GO • Scientific literature • Suggestion new GO terms and Tools relations • TermGenie • Evidence Ontology • Reporting misannotations • HCOP • GO’s annotation quality • PAINT control checks • GO’s annotation extension

Subject Object each genes or gene products GO Consortium Members manually annotated in depth

Rules Division of Labor • GO annotation policies • GO curators & guidelines • Local curators • GO annotation Community conventions • Model organism databases • Local annotation policies • Protein databases • Biological research communities Figure 5.2: The activity system of the GO Consortium members

168 5.1.4 Rules Research question 1.4 investigates the rules regulating activities around the GO. This study identified a set of rules regulating the activities of developing the GO, annotating with the GO, using the GO, and communications within the GO Consortium. 5.1.4.1 Ontology development. In terms of ontology development, the GO Consortium has established rules for the GO term inclusion, naming, preferred term selection, synonyms, and the GO term obsoletion. Similar to other controlled vocabularies, the GO has a scope definition, specifying certain concepts, such as gene products and protein domains, to be excluded from the Ontology (Gene Ontology Consortium, 2014g; Svenonius, 2003). A previous study by Mayor and Robinson (2014) identified that the GO Consortium followed literary warrant for term selection. Likewise, this study found the rule of literary warrant regulating the inclusion of new terms in the Ontology (Harpring, 2010; Svenonius, 2003). Requesters, who suggest new GO terms on GO’s Ontology Requests Tracker, are usually required to provide published evidence, mostly in the form of PubMed identifiers, to support their requests and prove the form, spelling, and definition of the terms. The reference sources for GO term definitions are more flexible. They are not restricted to PubMed articles but could be books, consensus reached at the GO meetings, individual curators, groups of curators, community experts, external databases or ontologies, and even the English Wikipedia. For example, some of the GO terms were defined using the Oxford Dictionary of Biochemistry and Molecular Biology (Gene Ontology Consortium, 2014h). The GO Consortium requires curators to document the source of definitions following specific guidelines (Gene Ontology Consortium, 2009). For example, if a curator, whose initials are ‘pf,’ defined a GO term, ‘dbxref GOC:pf’ would be placed next to the definition as a reference source for attribution and accountability. This documentation allows for keeping track of the source of GO terms when misuse happens. Recently, the GO Consortium started to adopt ORCIDs to substitute curators’ initials as a more formal way of documentation. The GO Consortium has established a set of rules for naming the GO terms. These rules include but are not limited to: (a) the US spelling is preferred to the English spelling, (b) avoiding using abbreviations except for those self-explanatory terms (e.g., DNA), (c) spelling out Greek symbols, (d) using lower case except for those demanded by context (e.g., RNA) (e) using nouns in their singular form, (f) excluding anatomical qualifiers, and (g) avoiding using ‘-like’

169 for protein names. This study found that the principle of consistency was governing the selection of preferred GO terms no matter how warranted by literary or community practices, in order to keep the GO term names consistent with those of their siblings and parents (Svenonius, 2003). The inclusion of synonyms, which can be based on organizational or use warrant, is less restrictive. For example, the GO Consortium allows jargons and acronyms to be synonyms. To accommodate the specificity of GO terms required by users and not to inflate the GO, annotation extension (column 16) was implemented to provide annotation with a context and details. The GO Consortium claimed, “GO editors will regularly review the contents of the annotation extension field in submitted annotation files and create new, more specific terms if they feel enough annotations exist to warrant a pre-composed term” (Gene Ontology Consortium, 2004a, para. 9). This indicates that the GO Consortium may follow use warrant for including some terms in the Ontology (Svenonius, 2003). The number of times that people use those terms may be required to reach a specific value to justify their inclusion in the Ontology. However, the GO Consortium did not specify the value. The GO Consortium has specific rules for GO term obsoletion. The GO terms would be made obsolete if they were invalid, out of scope, or their meanings changed (Gene Ontology Consortium, 2011a). Before making any GO terms obsolete, the GO editor will inform the databases that have used those terms to fix their annotations. A GO term will never be deleted, but made obsolete with a comment explaining the reason for obsoletion and a suggestion for replacement term(s). The accession numbers of obsolete GO terms still persist, so that users can search using those identifiers and be directed to the replacement terms. 5.1.4.2 GO annotations. In terms of annotation, a set of rules, such as annotation policies and guidelines, annotation conventions, annotation standard operating procedures, and guide to evidence codes, is available at the GO Website and Wiki. Meanwhile, the GO annotation advocacy group ensures the enforcement of those annotation rules in the GO consortium member databases. As described by a GO curator during the interview, the GO Consortium members could differ in their philosophy of GO annotations and the use of certain evidence codes (e.g., IEP). This implies that the flexibility of establishing local annotation policies was allowed in member databases. Besides specified rules, there is an implicit norm regulating the annotation activity. The process of manual literature annotation differs from that of journal paper review in

170 that the GO annotators should not judge the papers, but stay objective to represent what they read in the papers using the GO terms. 5.1.4.3 GO use. Since the GO database is open access, activities of using the GO follow less restrictive rules. The GO data and software tools are under the license of creative commons (Gene Ontology Consortium, 2014v). Users are free to copy, modify, and redistribute the data as long as they provide the date and version number of the GO data and software tools. The GO users, who were interviewed in this study, indicated that they would cite either the GO or the associated papers of the software in their publications or reports, but did not mention whether or not to they would specify the date and version. 5.1.4.4 Communications. Different groups within the GO Consortium have various levels of regular meetings. All the GO Consortium members are meeting face-to-face twice a year. There is an implicit rule regulating communications within the GO Consortium. If there is an issue that takes more than 15 emails to discuss on the Request Trackers at SourceForge, the issue will be included in the agenda of the next GO Consortium meeting.

5.1.5 Contradictions Research question 1.5 examines the contradictions within and between different activity systems around the GO and how these contradictions are resolved. Contradictions refer to historically accumulated tensions or instabilities within or between activity systems, playing a central role in changing, developing, and learning the activities (Allen et al., 2011; Roos, 2012). Contradictions may exist within each component of an activity system, between components of the activity system, between different developmental phrases of the activity, and between different but interconnected activity systems (Engeström, 1990). The following sections provide examples of contradictions within several activities around the GO. Since contradictions are the source of development (Kuuitt, 1996), suggestions concerning some of these activities were made in order to help their subjects or communities meet the objectives (see Table 5.1). 5.1.5.1 Contradictions between objective and tool. Contradictions between objective and tool refer to cases where inabilities of a tool prevent the objective of an activity from being met. A GO Project member revealed the contradiction between the GO browsers—tools—and the objective of visualizing the GO: Like the graph drawing is better in QuickGO, so the AmiGO actually calls out the graph drawing in QuickGO. And the taxon checks are done by AmiGO module. So QuickGO

171 actually calls out of those. You know, we are both part of the same Consortium. Both are useful. Both should have similar versions of the data... The GO Project member indicated that each of the two browsers had some function that the other one was missing to meet a specific objective. To resolve this contradiction, the GO Consortium called out different features from those two browsers to enhance each other. Although those browsers were developed by the GO Consortium members, concerns may arise about whether they have the same version of data. During the participation activity described in Section 4.2.1, the researcher found a gene with the GO annotations in QuickGO, but failed to search it in AmiGO. To ensure data consistency and eliminate user confusion, the GO Consortium may consider consolidating the efforts of different member databases to build an integrated browser unifying the functions and strengths of existing GO browsers. A GO user indicated a contradiction between the GO browser AmiGO—a tool—and his objective of building a predictive model using the GO: The way the current [visualization] tools exist are sort of built for one type of research, which is I have one term in question. I already decided which term I want to look at. I don’t care but any other terms of the Gene Ontology. These tools, like AmiGO and whatever, are fine for that, because they zoom in on a local structure of the Gene Ontology, like a local part. They zoom in into the term in question. … But it’s limited to the attitude or the scenario where you really knew what term you wanna to look at. But in my case, I’m trying to use the entire Gene Ontology or any entire ontology as a predictive model. And before I do that, I want to use the visualization, so I can make sense of what is going on. It’s sort of a chicken and egg problem. I don’t want to zoom in on any one portion yet because I haven’t decided what I wanna zoom in on. The whole point of the visualization is to help me figure out what parts of the Gene Ontology were important. Obviously, the existing GO browsers could not help the GO user meet his objective of visualizing the entire Ontology and building a predictive model using the GO data. The interview data indicated that some of the GO users with computing techniques developed their own scripts or tools to help visualize and process the GO data. The suggestion to resolve the contradiction between the GO browsers and different users’ research objectives is that the GO Consortium may consider collaborating with the user community of bioinformaticians or computer scientists,

172 especially those doing research on the GO, to develop more versatile browsers that can serve users’ different research purposes. A GO Project member provided another example of contradictions between tool and objective. She described DAVID—a GO term enrichment analysis tool—failed to produce accurate enrichment results to meet the objective of a particular activity, as the tool did not keep up-to-date with the GO database: DAVID, for example, is not what we would recommend because it’s very out of date. Part of the problem that we run into from the GO perspective is that people don’t update the data. So there will be changes to the Ontology and changes to the annotation. And enrichment isn’t going to work if those aren’t current or will, you know, skew the answers. And that’s one of the reasons. DAVID is like at least 3 or 4 years out of date. And that’s pretty far. To resolve this contradiction, the GO Consortium made PANTHER available at GO’s Website, which is a GO term enrichment analysis tool maintained with the most up-to-date version of GO’s annotation data. As mentioned by the GO Project member, numerous GO term enrichment analysis tools developed by outside groups are available. Users who gain inaccurate or incomplete analysis results from those tools may partially attribute the data quality problems to the GO. The GO Consortium may consider bringing in communities or groups that have built the GO term enrichment analysis tools for the purpose of collaborative development and maintenance, ensuring that those tools have the most current GO data. 5.1.5.2 Contradictions between objective and division of labor. Contradictions between objective and division of labor refer to cases where the current division of tasks among members of a community inhibits the objective of an activity from being met. A GO curator, who was also a model organism database curator, described the contradiction between the number of curators doing manual annotation in his community—division of labor—and the objective of annotating a huge number of papers each month: I can only speak for Z [a model organism database]. So there are actually a triage process. We add roughly 1,200 papers to our database a month. We have 30 curators. You can do the math. There is no physical way… We have to put priorities on things. So it’s just, you know, we have to pick and choose.

173 The GO curator implied that his community had a huge number of papers published monthly, but only a tiny number of curators were manually curating those papers. To resolve this contradiction, the model organism database employed a triage process, prioritizing certain journals or papers to annotate: Every mod [model organism database] has their own way of triage. What we do, first of all, we have about 150 journals that we keep an eye on. That list is based on how many papers previously have we gotten from those journals. So those are the journals to watch, at least in our field. We then have various ways of finding… For say one issue, please find all papers that have anything to do with the z. ... So then we have a process where all those papers now we have to manually… Each curator manually looks at the paper. And again it’s triage. So you do it as fast as you can. And if you lose some, you lose some. You know, don’t worry about it. And you try to identify, oh, this paper is about z knockout. So we know that is good for alleles and phenotype. And most of the time we will select that for the GO, because most of the time z knockout gene, knocking out the genes say gives you a tailored z. A GO Project member explained the triage process in another model organism database: They do a lot of triage. You know, they’re keeping up with every paper. They’ll do the ones that are on their [priority] list or they get a special request for. It’s not a 100%, definitely not. I know for FlyBase, they get increasing the ante. You know, it’s like, well, we will only do these journals. And then, you know, which journals they did kept going down and down from like doing everything to just doing particular 7, 10 journals where most of their information came from, swamped. The triage process may lead to a certain number of papers that cannot be annotated, and result in the data quality problem of incomplete annotations. The suggestion to resolve the contradiction between division of labor and the objective of complete annotations for a specific model organism is that the GO Consortium can employ the community annotation approach, engaging the user community—bench scientists (especially the experts of particular genes or gene products)—in contributing annotations to the GO database. This is similar to social tagging, which is a process of publicly and collaboratively assigning tags to resources in a shared, online environment (Stvilia et al., 2012; Trant, 2009). The value of social tagging is to provide a collaborative and democratic approach to representing and categorizing huge masses of

174 information for a large group of users at a lower cost (Mai, 2009). However, the collective intelligence in social tagging is achieved when a critical mass of participation is reached (Anderson, 2007). If the user community is too small to achieve the critical mass, the collaborative feature of social tagging can add little value to the existing KO systems. Similarly, community annotation requires participation and collaboration from a large number of bench scientists. Creating an online interface for the user community to submit annotations is not enough. It requires community outreach, instructing the process of manual GO curation and motivating scientists to contribute. As early as the 1990s, GenBank was initially maintained by the NCBI staff manually scanning published literature and uploading nucleotide sequences to the database. Now most of the GenBank sequences were produced and contributed by individual laboratories around the world. GenBank remains to grow at an exponential rate and contains public sequences for more than 280,000 species (Benson et al., 2014). This is partially due to the requirement by most journal publishers that newly sequenced DNAs should be first deposited to public nucleotide databases (e.g., GenBank) so that they can be cited and retrieved when the articles are published (Higgs & Attwood, 2005; Wu et al., 2012). The GO consortium may learn from GenBank, collaborating with scholarly journals (especially those frequently cited journals identified in Section 4.1.6.2) and requiring individual laboratories to submit the GO annotations before their articles are published so that the GO can receive community annotations and may achieve completeness and currency. Meanwhile, bench scientists and researchers may be fostered to communicate using the GO vocabularies and accession numbers. Social tags have well known data quality problems, such as ambiguity, flat structure, spelling errors, and lack of context (Stvilia, Jörgensen, & Wu, 2012). Likewise, concerns may arise about the quality of community annotations. To assure the data quality of community annotations, the GO Consortium may consider creating specific templates for different user communities, developing automatic quality control checks, and establishing a collective community review and attribution mechanism. 5.1.5.3 Contradictions between objective and rule. Contradictions between objective and rule refer to cases where specific rules create barriers that inhibit the subject from meeting her/his objective. A GO Project member illustrated the contradiction between the rules of using Protein2GO (i.e., a manual GO annotation tool) and the objective of gaining community annotations:

175 I think the chicken community, chicken and dictyostelium, they are actually using the UniProt tool called Protein2GO. But that is tightly coupled to the UniProt database. And so it’s not really quite as open as you would [think]. You have to set up a special arrangement and agreement with the EBI. So it’s a little bit more of a barrier. There is several problems with the [Protein2GO]. One is you have to have the UniProt identifier, which you don’t necessarily have. The rules of Protein2GO inhibit different scientific communities from using it to contribute the GO annotations. To resolve this contradiction, the GO Consortium may consider developing community annotation tools that are open access, require less cognitive efforts from the user, and have built-in quality control checks and attribution mechanism. 5.1.5.4 Contradictions between objective and community. Contradictions between objective and community refer to cases where the existing communities fail to meet their shared objective. As mentioned above, the GO Consortium aims to develop controlled vocabularies that can be used across different species and databases. Although it now contains more than 30 biological communities, the GO Consortium is still lack of experts from specific domains to meet its objective of developing the GO as a shared knowledge base. The GO Project member admitted this contradiction: Frankly speaking, we [the GO Consortium] don't have enough biological expertise in lots of areas. ... So we’re bringing in people. … So one of the most successful groups is run by Jessie, the person I mentioned at UCL, University of College London. And she was able to get independent funding for specifically annotating genes involved in heart disease and genes involved in kidney disease from the UK Heart Foundation I guess. And so there have been cases, but it’s mostly outside groups getting some funding. And then we work with them. The GO Consortium is trying to resolve this contradiction by bringing in new biological communities and collaborating with outside groups that have funding. As discussed above, the GO Consortium is dominated by model organism databases consisting of curators from either Europe or North America. However, the GO users are located all over the world. Besides bringing in communities in different biological domains, the GO Consortium may consider seeking collaboration and participation from scientific communities in other locations, such as Asia and Africa, and thus enables communication with non-English speaking scientists.

176 Table 5.1: Contradictions and suggestions Contradictions Suggestions Two GO browsers have some functions The GO Consortium may consider consolidating that the other one is missing to meet a the efforts of different member databases to build specific objective. Although these an integrated browser unifying the functions and browsers were developed by the GO strengths of existing GO browsers Consortium members, concerns may arise about whether they have the same version of data.

The existing GO browsers cannot help the The GO Consortium can collaborate with the user GO user meet the objective of visualizing communities in bioinformatics and computer the entire Ontology and building a science, especially those doing research on the GO, predictive model to develop more versatile browsers that can serve users’ different research purposes

Some of the GO term enrichment analysis The GO Consortium can bring in communities or tools fail to produce accurate enrichment groups that built the GO term enrichment analysis results tools for the purpose of collaborative development and maintenance, ensuring that those tools have the most up-to-date version of GO’s annotation data

Specific biological communities have a The GO Consortium can employ the community huge number of papers published monthly, annotation approach, engaging the user but only a tiny number of curators are community—bench scientists (especially the manually curating those papers. The triage experts of particular genes or gene products)—in process employed by those communities contributing annotations to the GO database leads to a certain number of papers that cannot be annotated, and the failure of The GO consortium may learn from GenBank, those communities to meet the objective collaborating with scholarly journals and requiring of complete GO annotations for a specific individual laboratories to submit the GO model organism. annotations to genes or gene products before their articles are published

To assure the data quality of community annotations, the GO Consortium can create specific templates for different user communities, develop automatic quality control checks, and establish a collective community review and attribution mechanism

The rules of Protein2GO inhibit different The GO Consortium may consider developing scientific communities from using it to community annotation tools that are open access, contribute annotations require less cognitive efforts from the user, and have built-in quality control checks and attribution mechanism

177 Table 5.1 Continued Contradictions Suggestions The GO Consortium is lack of experts Besides bringing in communities in different from specific domains to meet its biological domains, the GO Consortium may seek objective of building the GO as a shared collaboration and participation from scientific knowledge base across different species communities in other locations, such as Asia and and databases. The GO consortium is Africa dominated by model organism databases consisting of curators from either Europe or North America. However, the GO users are located all over the world.

5.1.6 Data Curation Skills for the GO Research question 1.6 focuses on the data curation skills required for the GO. This study identified a list of 16 skills that were perceived important by the interviewees, including biocurators and the GO users (see Section 4.1.6). The International Society for Biocuration (ISB) conducted a survey of biocurators to know about their backgrounds, motivations, and career goals (Burge et al., 2012). One of the survey questions asked what attributes were perceived important for biocurators to process. Theoretical knowledge, formal scientific training, written and verbal communication, and experimental experience were perceived the most important in that study. Formal training in data management/curation and scripting and programming skills were considered less important. A significant number of respondents in that study indicated that they would benefit from training in computer languages and bioinformatics. Interviewees of the current study mentioned all of the attributes (skills) identified from the ISB’s survey. Skills that were missing in the survey but found in the current study include ontologies, domain-specific biological knowledge, staying current on biology, reading scientific literature, detail orientation, statistics, collaboration and cooperation, and user studies. The differences of the two studies may be explained by the fact that the skills identified in the current study were specifically for the data curation of GO and the interviewees included the GO users (e.g., bench scientists, bioinformaticians), who might have different perspectives than biocurators. Huang et al. (2012) conducted a survey to generate a typology of 17 data quality assurance skills needed in genome annotation. The data curation skills for the GO identified in the current study have some overlapping with the data quality assurance skills for genome

178 annotation, such as data mining, software tools, Structural Query Language (SQL), statistical skills, data warehouse setup, information overload, and user requirement. The first three skills correspond to the computational and bioinformatics skills identified in the current study. Data warehouse setup and information overload are similar to the data collection and organization skills required for GO’s data curation. User requirement parallels to the user study and communication skills identified in the current study. A set of data-quality literacy skills reported in Huang et al. (2012) is missing in the current study. Future research can examine more specific research questions about the data quality assurance skills needed for the data curation of GO. The process of manual GO annotation shares some similarity with that of subject indexing in libraries. However, the GO annotation requires one step further, extracting genes or gene products from the literature and associating them to the GO terms. The following section discusses the data curation skills required for the GO and compares them to those necessary for subject indexing in libraries. The comparison may imply the skills needed for curators of scientific data repositories and metadata librarians of digital libraries, and inform curriculum design and training in LIS and Data Science. Similar to catalogers or indexers, the GO curators are required to have the skill of reading scientific literature and a basic understanding of the controlled vocabularies, which is the GO. Catalogers or indexers may be required to have a second master degree or know a second language to index and classify books in a specific domain or language. Similarly, besides the basic biological knowledge, the GO curators are expected to have species-specific knowledge, which is useful for annotating gene products of a particular species and developing the GO terms appropriate for that species. Comparable to reference librarians, community outreach and communication skills are required for the GO curators to answer user questions, review requests, instruct bench scientists to use the GO, and provide users with technical support for software tools. Interestingly, three biocurators, who were interviewed in the current study, mentioned science experience and laboratory techniques, which are necessary for curators to understand the literature and determine whether a particular assay claimed in the article appropriate or not. To better represent and organize scientific data, digital libraries and data repositories should consider training their metadata librarians or data curators to acquire or at least know about the basic experimental techniques of specific domains. Not surprisingly, the GO curators are

179 expected to have some computational and statistics skills to conduct data analysis, such as sequence similarity analysis using the BLAST. With the data deluge in science, engineering, and medicine, Twidale, Blake, and Gant (2013) stated that the data literary skill (e.g., computational technologies, statistics) was critical not only for information professionals, data curators, and librarians, but also for citizens. Following their suggestions and the findings of this study, the education or training of data curators and metadata librarians should include the data literacy skill.

5.2 Data Quality Structure of the GO The second research question examines the data quality structure of the GO, including the types of data quality problems present in the GO, sources of these data quality problems, corresponding data quality assurance actions taken to resolve these problems, data quality criteria that are perceived important, and the reference sources for data quality assessment.

5.2.1 Types of Data Quality Problems Synthesizing the findings from three research methods, a typology of 28 data quality problems was identified in the GO. These data quality problems were found in the activities of assigning the GO terms to genes or gene products (i.e., manual GO annotation) and using the GO for different research purposes (e.g., having a functional profile of the gene set). According to Noy and McGuinness (2001), ontology is a formal, explicit, and machine-readable representation of the phenomenon using a set of concepts, properties of concepts, constraints on concepts, and relations between the concepts. There are usually no preferred or authorized terms selected for concepts in ontologies (Jacob, 2003; Taylor & Joudrey, 2009). Obviously, the GO is more than an Ontology. It serves as controlled vocabularies, containing preferred terms (to standardize the naming and spelling of concepts), synonyms (to improve recall for users), and references (Hjørland, 2007; Hodge, 2000; Svenonius, 2003). The National Information Standards Organization (2005) defined controlled vocabulary as “a list of terms that have been enumerated explicitly” and controlled (p. 5). A vocabulary is controlled if it contains a restricted subset of terms that are authorized for use (Svenonius, 2003). In LIS, controlled vocabularies are mainly used to facilitate information retrieval and can take a variety of forms, including name authority files, classification schemes, subject headings, and thesauri (Stvilia et al., 2012; Svenonius, 2003). Terms in a controlled vocabulary are usually

180 related in terms of equivalence, hierarchical, and related term (RT) relationships. However, the relationship between two GO terms is more flexible, which can be ‘is_a,’ ‘part_of,’ ‘has_part,’ and ‘regulates.’ This is not an exhaustive set of relations used in the GO. Other types of relations (e.g., ‘occurs_in’) can be found in GO’s logical definitions and annotation extensions (Gene Ontology Consortium, 2014o). In addition, the GO differs from traditional controlled vocabularies in that synonyms of the GO terms were related with different types of relationship (e.g., exact, narrow, broad, and related) (Gene Ontology Consortium, 2014g). Meanwhile, the GO database curates not only the controlled vocabularies, but also the annotation data, which are associations between the GO terms and genes or gene products. The complexity of the GO accounts for a variety of data quality problems identified in this study, which can be divided into ontology related, controlled vocabulary related, and annotation related. Ontology related data quality problems include concept related problems (i.e., incomplete GO terms, invalid GO terms, ambiguous GO terms, redundant GO terms, lack of specificity in the GO terms), definition related problems (i.e., incomplete GO term definitions, inaccurate GO term definitions), relation related problems (i.e., inaccurate placement of the GO terms, inaccurate relationships between the GO terms, incomplete relationships between the GO terms, and incomplete set of relations), and constraint related problems (i.e., inaccurate taxon constraints to the GO terms, incomplete taxon constraints to the GO terms). Controlled vocabulary related data quality problems contain the GO term name related problems (i.e., incomplete GO term names, inaccurate GO term names, incorrect selection of the preferred GO terms, spelling errors in the GO term names), synonym related problems (i.e., incomplete synonyms, inaccurate synonyms, and incorrect categorization of synonyms), and reference related problems (i.e., inaccurate references, incomplete references, and inconsistent identifiers used in references). Annotation related data quality problems comprise incomplete GO annotations, inconsistent GO annotations, misannotations, annotation imbalance, and inaccurate GO term enrichment. Mapping these data quality problems to Stvilia’s IQ Assessment Framework, they can be clustered into ambiguity, inaccuracy, incompleteness, inconsistency, redundancy, and unnaturalness. In a previous study, Stvilia (2006) assessed the quality of an aggregated metadata collection of the Colorado Digitization Program of Cultural Heritage, which harvested data

181 objects from more than 30 data providers in simple Dublin Core format. By analyzing a random sample of 150 records drawn from the aggregated collection, Stvilia (2006) identified five clusters of metadata quality problems: incompleteness, redundancy, ambiguity, inconsistency, and inaccuracy. The current study identified one more type of data quality problem in the GO, which is unnaturalness. This was due to the fact that the GO was used as controlled vocabularies. Users found some of the preferred GO terms were not the conventional ones they would chose for searching (Stvilia et al., 2007).

5.2.2 Sources of Data Quality Problems and Assurance Actions Integrating the findings from three research methods, sources of data quality problems of the GO include: new knowledge or discoveries, curator errors, curator bias, lack of curators, lack of experts in specific biological domains, literature bias, literature errors, quality problems with the experimental data, inconsistent mapping to external ontologies or databases, imported errors from external ontologies or databases, inconsistent naming of molecular entities, variance in gene names, variance in the GO annotation policies, variance in annotation conventions, annotations based on sequence similarity, automatic GO annotations, and software errors. In most cases, these sources could lead to data quality problems belonging to incompleteness and inaccuracy. Incompleteness is the most frequently occurring type of data quality problems in the archival data. The article annotated in the participation activity of the current study was published in 2003, but was not manually annotated by the GO Consortium before. This could be classified as incomplete GO annotations. Incompleteness was also identified in the interview data, such as incomplete GO annotations and annotation imbalance. Most of the data quality problems falling into incompleteness were caused by new knowledge or discoveries, lack of curators to annotate scientific literature, lack of experts in specific biological domains to develop and update controlled vocabularies, variance in gene names, curator bias, and literature bias. This may be due to modern biology as a fast-moving, data-driven field. Advances in high-throughput techniques, open access of biological databases (e.g., GenBank, UniProt), the diversity of biological communities and domains, the increasing volume of scientific data and publications, diverse and controversial nomenclature (e.g., gene names), and lack of funding all pose a great challenge for biocurators to manually annotate scientific literature and control the quality of their annotations (Burge et al., 2012; Burkhardt, Schneider, & Ory, 2006; Salimi & Vita, 2006).

182 To resolve the incompleteness problems, the GO Consortium employs a collaborative approach, bringing in different biological communities to collectively develop and maintain the Ontology. The use of GO’s Ontology Requests Tracker and TermGenie can help identify incomplete GO terms and expedite the process of adding and reviewing new GO terms. The adoption of community annotations may help reduce the GO curators’ workload. Most of the data quality problems belonging to inaccuracy were caused by curator errors, new knowledge or discoveries, literature errors, quality problems with the experimental data, inconsistent mapping to external databases, annotations based on sequence similarity, automatic GO annotations, and software errors. Subject indexing is prone to inter-indexer inconsistency, intra-indexer inconsistency, and errors caused by indexers’ misinterpreting documents (Cleveland & Cleveland, 2001). Similarly, manual literature annotation is subject to curators’ misinterpretation of literature, their inconsistency in annotating, and bias in selecting articles to annotate (Dutkowski et al., 2013). Computational annotations exist in the GO database, which were generated based on sequence or structure similarity or by mapping annotations created using other controlled vocabularies. There were cases where bench scientists later conducted experiments and found errors or incompleteness in computational annotations. Since the GO is cross-referencing to external ontologies or databases, any inconsistent mapping will cause inaccuracy to the GO terms and may lead to misannotations. Likewise, any errors in external ontologies or databases may be imported to the GO and cause inaccuracy in the GO database. The software errors were caused by external communities or groups, who built software tools on the GO data but failed to keep up-to-date with the GO database. To resolve inaccuracy problems, the GO Consortium has established annotation guidelines and policies to ensure curators’ annotation consistency. It has also implemented annotation quality control checks to help detect and correct misannotations. Annotation extensions (column 16) can provide curators with more contextual information to check or review annotations. Despite the possibility of imported errors from external ontologies or databases, the purpose of cross-referencing is to improve accuracy, completeness, and currency of the GO. GO’s Annotation Issues Tracker at SourceForge can involve different biological communities collaboratively reviewing and correcting misannotations. The GO Consortium’s participation in developing and maintaining the GO term enrichment analysis tools (e.g., PANTHER) may help improve the accuracy of gene set enrichment analysis. There are external

183 communities or groups in bioinformatics or computer science developing models or algorithms to help detect or predict the GO misannotations. Leonelli et al. (2011) conducted informal interviews with the GO curators and identified five sources that could cause quality change to the GO: (a) mismatch between the GO and reality, (b) scope extension of the GO, (c) divergence in how the GO is used across communities, (d) new discoveries that change the meaning of the GO terms and their relationships, and (e) the addition of new relations. These were all identified in the archival and interview data of the current study. Pre-composing the GO terms using concepts or terms from external ontologies or databases is an example of scope extension. Any quality changes to the external ontologies or databases may lead to quality changes to the GO. The unnaturalness problem discussed in Section 4.1.4.7 is an example of how the GO terms were used differently across communities. Due to interviews with the GO users, the current study found additional sources that could cause data quality problems to the Ontology and annotation data. In a pervious study, Stvilia et al. (2014) conducted a survey of members of the Condensed Matter Physics (CMP) community, and found nine sources of data quality problems in CMP. Of those nine sources, three are similar to the ones identified in the current study, including human errors, software errors, and changes in the context of data (e.g., new knowledge or technology). However, the sources of data quality problems of the GO are more complex, because the activity systems around the GO involve a diversity of communities having different objectives.

5.2.3 Data Quality Criteria Research question 2.4 examined the data quality criteria that were perceived important for the GO. The interview data suggest a set of data quality criteria that were considered important by the interviewees (see Table 4.4). The study conducted by Huang et al. (2012) identified a typology of 17 data quality criteria that were perceived important in genome annotation. The data quality criteria found in the current study can be mapped to those in Huang et al. (2012), except for stability and redundancy. As a KO system, the GO is expected to be sustainable in terms of content and data and stable in terms of usage, despite the dynamics in modern biology. The concept of redundancy can be understood differently depending on the context of activities. In the context of developing the GO terms, redundancy can be understood as pre-composing the GO terms using

184 concepts/terms from external ontologies or databases. As mentioned by a GO Project member, GO’s redundancy or overlapping with other ontologies allows for curator efficiency, saving their time and effort spent on looking up numerous ontologies to represent what they read in the literature. In terms of developing controlled vocabularies, the GO Consortium is balancing between cohesiveness and redundancy. The GO Project member specified that a better way to develop the GO terms is “to have cross-references to the three or more different classes [ontologies] and then make your, compose your class terms from these other class terms.” Meanwhile, redundancy or cross-referencing to external ontologies indicates the quality of the GO may to some extent depend on the quality of those external ontologies. It requires the GO to not only maintain consistent mapping to those external ontologies, but also be able to detect and correct any imported errors. Stvilia (2007) proposed a quality evaluation model for biodiversity ontologies, consisting of 12 quality dimensions: accuracy, cohesiveness, complexity, semantic consistency, structural consistency, currency, redundancy, naturalness, completeness, verifiability, volatility, and authority. All of these quality dimensions can be mapped to the quality criteria identified in the current study, except for naturalness and cohesiveness. Even though the interviewees did not mention naturalness in the current study, the archival data imply that the GO curators and the GO users required naturalness in the GO terms to improve recall for their searching (see Section 4.1.4.7). Cohesiveness refers to the extent to which the content of an ontology concentrates on one topic (Stvilia, 2012). This criterion does not apply to the GO, as it is cross-referencing to other ontologies or databases and pre-composing the GO terms using external concepts or terms. Interestingly, when asked what criteria were used to select cross-referencing ontologies for the GO, two of the interviewees mentioned cohesiveness. A GO Project member explained the reason for GO’s cross-referencing to ChEBI and the Cell Type Ontology: The more elemental they are, the better. ChEBI is sort of… They are well scoped. The remit of ChEBI is very tightly scoped. It’s chemical compounds. They are just like that’s it. That is all they do. And they do it well. And so that’s one of the things. So I think the Cell Type [Ontology] is pretty much the same thing. It’s nothing but cell types. If it’s bigger than a cell tissue, no, not in it. So ChEBI [has] really clear scoping and limited things, so that very, very elemental.

185 5.3 Conclusion and Future Research This study applied Activity Theory (Engeström, 1990; Leont’ev, 1978) and a theoretical IQ Assessment Framework (Stvilia et al., 2007) to examine the infrastructure supporting the development, maintenance, and use of a large-scale KO system shared among different biological communities. The concepts and principles of those theories helped formulate research questions, develop research design, and create a coding scheme for data organization and analysis of this study. Employing the netnographic approach (Kozinets, 2010), this study gathered data in a natural setting via archival data analysis, participant observations, and qualitative semi-structured interviews (Blee & Taylor, 2002; Kazmer & Xie, 2008; Lincoln & Guba, 1985) to investigate the data work organization of the GO. The archival data analysis enabled the researcher to identify communities participating in ontology development and maintenance, learn about the activity systems around the GO, develop a typology of data quality problems of the GO with corresponding assurance actions, and prepare for participant observations. During participant observations, the researcher further learned about the culture, norms, and practices of those communities; participated in manual GO annotation; had interactions with the GO curators; took fieldnotes to record her impressions and reflections on community practices and cultures; generated a list of additional questions for interviews; and identify a list of potential interviewees. The semi-structured interviews with different groups of people allowed the researcher to validate and extend the findings attained from archival data analysis, gain deeper understanding of the findings, identify a set of skills that were perceived important for data curation of the GO, and know about the quality criteria that were considered important for the GO. The research design and theories used in this study can be applied to other similar studies examining the infrastructure of large-scale sociotechnical systems (Kling, 1999). All the research questions formulated in the study were addressed, and a synthesized view of the findings gained from the three research methods was presented in this chapter. The findings indicated that the GO was collaboratively developed and maintained by a consortium of biological communities, mainly model organism databases. To facilitate communication and collaboration, representatives from each of the GO Consortium member were assigned the role of GO curators and formed into several groups working on different aspects of the Ontology. The division of labor within the GO Consortium ensures that the formidable ontology development

186 and maintenance process can be divided into manageable projects and the needs of different communities can be met. The GO Consortium consists of not only biocurators but also software engineers and bioinformaticians, providing the Consortium and users with technical and software support. As an open community, the GO Consortium has been bringing in new groups or communities, and welcomes any individuals to submit content and annotations for inclusion in the GO database through GO’s Helpdesk and requests trackers at SourceForge. The GO curators from each member database rotate to serve at the helpdesk and request trackers to answer questions and review suggestions. GO’s collaborative development approach—involving different communities controlling ontology quality, contributing content and annotations, and exchanging viewpoints and expertise—can be adopted by other similar ontologies or large-scale scientific data repositories. The GO Consortium also engages in developing software tools. However, most of these tools were created to support the activities of ontology development and maintenance, and could not help different user communities meet the objectives of their research activities. The GO Consortium and its directors group were dominated by model organism databases. Concerns remain about whether the needs of other communities can be met and their experimental data can be captured by the GO. The GO Consortium consists of biocurators and scientists either from Europe or North America. However, the GO users are located all over the world. Questions exist about whether the GO should bring in communities from non-English speaking countries and be receptive to their needs. Due to the complexity of the GO, this study identified a typology of data quality problems that could be divided into ontology related, controlled vocabulary related, and annotation related. Based on Stvilia’s IQ Assessment Framework, these data quality problems could be clustered into incompleteness, inaccuracy, inconsistency, redundancy, ambiguity, and unnaturalness (Stvilia et al., 2007). The GO Consortium has developed tools and implemented certain mechanisms to help improve the quality of the Ontology and annotation data, such as TermGenie, annotation quality control checks, and cross-referencing to external ontologies. The typology of data quality problems and corresponding sources and assurance actions plus the data quality criteria that were perceived important by the interviewees can serve as a conceptual data quality model for the GO Consortium to develop context specific metrics to assess and assure the data quality of GO. The data curation skills that the interviewees perceived important for the GO

187 can not only inform the training of GO curators and other biocurators, but also give new insight into the training of metadata librarians and the curriculum design in LIS and Data Science. Future research includes examining GO’s Annotation Issues Tracker at SourceForge to identify the quality problems of GO’s annotation data. Based on the findings of the current study, a survey will be developed and conducted with different biological communities to identify their data quality requirements for the GO and the data curation skills that they perceived important for the GO. The findings of the survey can help further develop quantitative models and metrics to evaluate and assure the data quality of GO. A data quality change model will also be constructed to identify quality dynamics of the GO and corresponding quality maintenance interventions for the GO. Ethnographic or netnographic studies can be conducted with different groups and teams within the GO Consortium to investigate their data practices and collaboration patterns, which can inform the design of support repertories for scientific teams. Ethnographic studies can also be performed with individual laboratories to identify how bench scientists use the GO for different research purposes, which can give new insight for the development and maintenance of the GO. Since the GO is serving as controlled vocabularies, future study can compare the authority control practices of the GO with those in libraries, which can inform librarians and library educators about the requirements for authority control in molecular biology and biomedicine, and help align library authority control models and vocabularies with the needs of scientific data curation. With the adoption and popularity of community annotation, future research can study the annotation practices of individual scientists or laboratories, and evaluate or assure the quality of community contributed annotation data.

188 APPENDIX A

INTERVIEW GUIDE

A. Thank you for participating in this study. The purpose of this research is to gain an understanding of the data work organization of the Gene Ontology (GO). In particular, I am interested in how you use the GO to annotate and manage your data, how you help develop and maintain the GO, and how you perceive the quality of GO. I have several questions to ask, and hope you will feel free to talk about any experiences or ideas that come to mind.

B. I would like to record our conversation in order to facilitate note taking. The recording will be transcribed. Please remember that you may ask to turn off the recorder at any time. Do I have your permission to record this conversation? Yes____ No____

Background Information 1. Can you tell me about your current position? 2. What is your highest degree? Where and when did you get your highest degree? What was the formal discipline of your degree? 3. What are your research interests? Do you hold any other professional positions? If yes, can you tell me what are they?

Activities around the GO 4. Have you heard of the GO before? If so, how did you know about it? 5. Did you use the GO before? If yes, how long have you used it and how often do you use it? If no, why you don’t use it? 6. How do you use the GO? Can you give me an example of using the GO? 7. What are some of the tools that help you use the GO? 8. Are you aware of any policies, rules, or norms that influence your use of the GO? If yes, can you tell me what are they? How did you know about these policies, rules, or norms? 9. Do you know anyone who has used the GO before? If so, can you tell me how she/he uses the GO? 10. What skills do you think are needed for the data curation of GO?

189 11. Do you know that the GO has created several request trackers at SourceForge to allow the user to submit suggestions for changes to the Ontology? If so, how did you know about these request trackers? 12. Did you use the Ontology Requests Tracker before? If yes, how long have you used it and how often do you use it? If no, can you tell me why you don’t use it? 13. How do you use GO’s Ontology Requests Tracker? Can you provide me an example of using that Tracker? 14. What policies, rules, or norms influence you activities in GO’s Ontology Requests Tracker? How did you know about these policies, rules, or norms? 15. Do you know anyone who has used GO’s Ontology Requests Tracker before? If so, can you tell me how she/he uses it? 16. In addition to the Ontology Requests Tracker, do you use any other request trackers? If yes, can you describe how you use them? If no, why you don’t use other request trackers?

Data Quality Structure of the GO 17. What do you consider some of the most important criteria that distinguish high quality bio- ontologies from low quality bio-ontologies? 18. Did you encounter any data quality problems with the GO before? If so, can you describe what the problem is and how did you solve it? 19. Are you familiar with any policies, procedures, rules, or conventions for data quality assurance adopted by the GO or other communities? If so, how do you employ such criteria? 20. Did you use any other ontologies? If yes, why and how do you use them? If no, why you don’t use other ontologies?

190 APPENDIX B

RESEARCH QUESTIONS AND CORRESPONDING THEORIES AND DATA COLLECTION METHODS

Interview Research Questions Theories/Concepts Data Collection Methods Questions 1. What are some of the AT: Activity, Object Archival Data Analysis #3, #6, #9, activities around the GO? IQAF: Data activity Participant Observations #13, #15, What are their objects Semi-structured Interviews #16 (objectives)? 1.1. What are some of AT: Community Archival Data Analysis #1, #3, #4, the communities Participant Observations #8, #9, #11, participating in these Semi-structured Interviews #14, #15 activities? 1.2. What is the AT: Division of Labor Archival Data Analysis #3 division of labor Participant Observations within these activities? Semi-structured Interviews 1.3. What are some of AT: Tools Archival Data Analysis #7 the tools used in these Participant Observations activities? Semi-structured Interviews 1.4. What are some of AT: Rules Archival Data Analysis #8, #14 the norms and rules Participant Observations regulating these Semi-structured Interviews activities? 1.5. What are some of AT: Contradictions Archival Data Analysis #5, #12, the contradictions Participant Observations #16, #20 within and between Semi-structured Interviews these activities and how are these contradictions resolved? 1.6. What are some of AT: Division of labor Semi-structured Interviews #10 the skills needed for data curation of the GO? 2. What is the data IQAF: Conceptual data quality structure of GO? quality model 2.1. What are some of IQAF: Data quality Archival Data Analysis #18 the types of data problems Participant Observations quality problems in Semi-structured Interviews the GO? 2.2. What are some of IQAF: Sources of data Archival Data Analysis #18 the sources of these quality problems Participant Observations data quality problems? Semi-structured Interviews

191 Interview Research Questions Theories/Concepts Data Collection Methods Questions 2.3. What are some of AT: Actions Archival Data Analysis #18 the corresponding IQAF: Data quality Participant Observations quality assurance assurance actions Semi-structured Interviews actions taken to resolve these data quality problems? 2.4. What data quality IQAF: Data quality Semi-structured Interviews #17 criteria are considered criteria/dimensions important for the GO? 2.5. What are some of IQAF: Reference bases Archival Data Analysis #19 the policies, AT: Rules Participant Observations procedures, rules, or Semi-structured Interviews conventions for data quality assurance adopted by the GO? AT: Activity Theory IQAF: Stvilia’s Information Quality Assessment Framework

192 APPENDIX C

INITIAL CODING SCHEME

Activity a complex system of related elements, including subject, object, community, division of labor, rules, tools, actions, and outcome. Object an objective held by the subject motivating an activity Tools artifacts, abstract or physical, used by the subject of an activity Community a group of people who share the same object Rules explicit or implicit norms, conventions, regulations that enable or limit the actions, operations, and interactions within an activity system Division of Labor both the horizontal division of tasks between members of the community and the vertical division of power and status Outcome the result of an activity after the mediation of tools Contradictions historically accumulated tensions or instabilities within or between activity systems, playing a central role in changing, learning, and developing these activities Internalization an individual’s thought activity to reason and reconstruct external objects or acquire new abilities, such as one’s activity of learning community policies and rules Externalization the process that people manifest, verify, and correct their mental models through external actions, such as one’s activity of formulating new community policies and rules Data Quality Problem

193 any problem that occur to the data when the data cannot meet the needs and requirements of the activities in which they are used Source of Data Quality Problem any source that leads to data quality variance and may suggest both they types of data quality problems and the types of data quality assurance actions Data Quality Assurance Action any action taken by the subject to improve the data quality to meet the needs and requirements of the activities in which the data are used Date Quality Criterion a set of attributes that represents a single aspect or construct of data quality (any component of the data quality concept)

194 APPENDIX D

COMMUNITIES AROUND THE ONTOLOGY REQUESTS TRACKER

# of Requests GOC Communities URLs Submitted Members AspGD-CGD http://www.aspergillusgenome.org/ 13 Yes Berkeley Bioinformatics http://www.berkeleybop.org/ Open-source Projects 14 Yes (BBOP) British Heart Foundation, http://www.ucl.ac.uk/functional- University College gene-annotation/ 39 Yes London-based annotation group (BHF-UCL) BioModels Database http://www.ebi.ac.uk/biomodels- 1 No (BIOMD) main/ Community Assessment of http://gowiki.tamu.edu/wiki/index.p Community Annotation hp/Category:CACAO 2 No with Ontologies (CACAO) Computer Analysis and http://web.expasy.org/groups/caliph Laboratory Investigation o/ 1 No of Proteins of Human Origin (CALIPHO) Candida Genome Database http://www.candidagenome.org/ 2 Yes (CGD) Cell Type Ontology (CL) https://code.google.com/p/cell- 1 No ontology/ dictyBase http://dictybase.org/ 7 Yes EcoliWiki http://ecoliwiki.net/colipedia/index. 2 Yes php/Welcome_to_EcoliWiki EcoCYC http://ecocyc.org/ 1 No European Molecular http://www.embl.de/index.php Biology Laboratory 1 No (EMBL) at Heidelberg FlyBase http://flybase.org/ 8 Yes GO Editorial Office at EBI http://www.ebi.ac.uk/services/team 6 Yes s/go Gene Ontology Annotation http://www.ebi.ac.uk/GOA 39 Yes (UniProt-GOA) Database InterPro http://www.ebi.ac.uk/interpro/ 1 Yes MetaCyc http://metacyc.org/ 9 No Mouse Genome Informatics http://www.informatics.jax.org/ 0 Yes (MGI)

195 # of Requests GOC Communities URLs Submitted Members MTBbase http://www.ark.in- 2 No berlin.de/Site/MTBbase.html neXtProt http://www.nextprot.org/ 4 No Ontology for Biomedical http://bioportal.bioontology.org/ont 2 No Investigations (OBI) ologies/OBI Plant-associated Microbe http://pamgo.vbi.vt.edu/ Gene Ontology 0 Yes (PAMGO) Plant Ontology (PO) http://www.plantontology.org/ 1 No PomBase http://www.pombase.org/ 77 Yes Reactome http://www.reactome.org/ 8 Yes RESID Database of Protein http://pir.georgetown.edu/resid/ 0 No Modifications Rat Genome Database http://rgd.mcw.edu/ 5 Yes (RGD) Saccharomyces Genome http://www.yeastgenome.org/ 31 Yes Database (SGD) SRI International http://www.sri.com/ 1 No The Arabidopsis https://www.arabidopsis.org/ Information Resource 6 Yes (TAIR) UniProt http://www.uniprot.org/ 23 Yes WormBase http://www.wormbase.org/#01-23-6 5 Yes Zebrafish Model Organism http://zfin.org/ 6 Yes Database (ZFIN) Individual scientists* 2 No Total 320 *cannot identify these scientists’ communities

196 APPENDIX E

TOOLS USED TO RESOLVE DATA QUALITY PROBLEMS OF THE GO

Categories Tools URLs GO’s software tools AmiGO http://amigo.geneontology.org/amigo QuickGO http://www.ebi.ac.uk/QuickGO GO Slim http://geneontology.org/page/go-slim- and-subset-guide TermGenie http://go.termgenie.org/ GO’s Annotation Quality http://geneontology.org/page/annotati Control Checks on-quality-control-checks GO’s Annotation Extension http://geneontology.org/page/annotati on-extension GO’s cross-referencing GOCHE files InterPro2GO http://www.geneontology.org/externa l2go/interpro2go Protein2GO Rehea2GO http://geneontology.org/external2go/r hea2go UniPathway2GO http://geneontology.org/external2go/u nipathway2go GO’s other databases GO Subversion (SVN) http://viewvc.geneontology.org/viewv Repository c/GO-SVN/trunk/ GO Annotation File (GAF) http://geneontology.org/page/go- annotation-file-gaf-format-20 GO Projects Viral Process http://geneontology.org/page/go- projects GO’s documentation GO Wiki http://wiki.geneontology.org/index.ph p/Main_Page GO’s official Website http://geneontology.org/ GO’s SourceForge requests http://sourceforge.net/projects/geneon tology/?source=navbar GO meetings GO Cambridge Editors meetings GO Editors calls GO Developers calls http://wiki.geneontology.org/index.ph p/GO_Managers GO-ChEBI meetings PAINT Workshop The GO Consortium http://wiki.geneontology.org/index.ph meetings p/Consortium_Meetings Scientific literature PubMed http://www.ncbi.nlm.nih.gov/pubmed PubMed Central (PMC) http://www.ncbi.nlm.nih.gov/pmc/

197 Categories Tools URLs Encyclopedias Wikipedia http://www.wikipedia.org/ Cytokines & Cells Online http://www.copewithcytokines.de/cop Pathfinder Encyclopedia e.cgi?key=KC (COPE) Books Textbooks Wikibooks http://en.wikibooks.org/wiki/Main_Pa ge Scientific thesauri National Cancer Institute http://ncit.nci.nih.gov/ Thesaurus Scientific nomenclatures International Union of http://www.chem.qmul.ac.uk/iubmb/e Biochemistry and nzyme/ Molecular Biology (IUBMB) Enzyme Nomenclature Dictionaries Wiktionary https://www.wiktionary.org/ The Free Dictionary http://www.thefreedictionary.com/ Other bio-ontologies Cell Type Ontology (CL) https://code.google.com/p/cell- ontology/ The Ontology of Chemical http://www.ebi.ac.uk/chebi/ Entities of Biological Interest (ChEBI) Neuroscience Information http://bioportal.bioontology.org/ontol Framework (NIF) ogies/NIFSUBCELL?p=classes Subcellular Ontology OBO Relation Ontology http://www.obofoundry.org/ro/ Plant Ontology http://plantontology.org/ Protein Ontology http://pir.georgetown.edu/pro/pro.sht ml Biological databases AspGD-CGD http://www.aspergillusgenome.org/ BioCyc http://biocyc.org/ BRENDA http://www.brenda-enzymes.org/ ExPASy http://www.expasy.org/ FlyBase http://flybase.org/ GeneCards http://www.genecards.org/ GeneDB http://www.genedb.org Immune Epitope Database http://www.iedb.org/ (IEDB) InterPro http://www.ebi.ac.uk/interpro/ KEGG http://www.genome.jp/kegg/ MetaCyc http://metacyc.org/ Mouse Genome Informatics http://www.informatics.jax.org (MGI) Neuromuscular http://neuromuscular.wustl.edu Online Mendelian http://www.omim.org/ Inheritance in Man

198 Categories Tools URLs (OMIM) PANTHER http://pantherdb.org/ Protein Database of 3D http://www.ebi.ac.uk/pdbsum/ Structures in the Protein Data Bank (PDBsum) PomBase http://www.pombase.org/ Protein Information http://pir.georgetown.edu/ Resource (PIR) RESID Database of Protein http://pir.georgetown.edu/resid/ Modifications Saccharomyces Genome http://www.yeastgenome.org/ Database (SGD) Uberon http://uberon.github.io/ UniPathway http://www.grenoble.prabi.fr/obiware house/unipathway UniProt http://www.uniprot.org/ Zebrafish Model Organism http://zfin.org/ Database (ZFIN) Other biological communities

199 APPENDIX F

SCIENTIFIC LITERATURE REFERENCES ON THE ONTOLOGY REQUESTS TRACKER

Journals Frequency Percentage Journal of Biological Chemistry 32 10.00% Molecular Biology of the Cell 13 4.06% Proceedings of the National Academy of Sciences 12 3.75% Genes & Development 9 2.81% Cell 7 2.19% Biochemical Journal 5 1.56% Current Biology 5 1.56% Journal of Cell Science 5 1.56% Applied & Environmental Microbiology 4 1.25% Biochemical & Biophysical Research Communications 4 1.25% EMBO Journal 4 1.25% Journal of Immunology 4 1.25% Molecular & Cellular Biology 4 1.25% Molecular Cell 4 1.25% BBA Molecular & Cell Biology of Lipids 3 0.94% Biochimie 3 0.94% Journal of Cell Biology 3 0.94% Journal of Neuroscience 3 0.94% Molecular Microbiology 3 0.94% Nature 3 0.94% Nature Cell Biology 3 0.94% Nucleic Acids Research 3 0.94% PLoS One 3 0.94% Science 3 0.94% Annual Plant Reviews 2 0.63% Annual Review of Genetics 2 0.63% Cellular Signaling 2 0.63% Cold Spring Harbor Perspectives in Biology 2 0.63% Current Opinion in Cell Biology 2 0.63% Current Opinion in Microbiology 2 0.63% Developmental Biology 2 0.63% Developmental Cell 2 0.63% European Journal of Biochemistry 2 0.63% FEBS Letters 2 0.63% FEMS Yeast Research 2 0.63% Genes to Cells 2 0.63%

200 Journals Frequency Percentage Genetics 2 0.63% Journal of Bacteriology 2 0.63% Journal of Biochemistry 2 0.63% Microbiology 2 0.63% Molecular Human Reproduction 2 0.63% Molecular Immunology 2 0.63% PLoS Biology 2 0.63% The EMBO Journal 2 0.63% The Plant Journal 2 0.63% Toxicon 2 0.63% American Journal of Obstetrics & Gynecology 1 0.31% American Journal of Pathology 1 0.31% American Journal of Physiology 1 0.31% American Journal of Respiratory & Critical Care Medicine 1 0.31% Annual Review of Biochemistry 1 0.31% Annual Review of Cell & Developmental Biology 1 0.31% Annual Review of Physiology 1 0.31% Arh Hig Rada Toksikol 1 0.31% Behavioral Brain Research 1 0.31% Biochemistry 1 0.31% Biological Chemistry Hoppe-Seyler 1 0.31% Bioscience, Biotechnology, & Biochemistry 1 0.31% Blood 1 0.31% Brain Research 1 0.31% Carbohydrate Research 1 0.31% Cell & Tissue Research 1 0.31% Cell Cycle 1 0.31% Cell Death & Differentiation 1 0.31% Cellular & Molecular Life Sciences 1 0.31% Current Genetics 1 0.31% Current Opinion in Plant Biology 1 0.31% Current Topics in Medical Mycology 1 0.31% Developmental & Comparative Immunology 1 0.31% Diabetes 1 0.31% Endocrine Reviews 1 0.31% European Journal of Cancer 1 0.31% The FASEB Journal 1 0.31% FEBS Journal 1 0.31% FEMS Microbiology Letters 1 0.31% Fungal Genetics & Biology 1 0.31% Gene 1 0.31% Genes, Brain, and Behavior 1 0.31%

201 Journals Frequency Percentage Genome Biology 1 0.31% Genome Dynamics 1 0.31% Glia 1 0.31% Human Molecular Genetics 1 0.31% Human Mutation 1 0.31% Human Reproduction Update 1 0.31% Infection & Immunity 1 0.31% Integrative & Comparative Biology 1 0.31% International Journal of Developmental Biology 1 0.31% Journal of Cancer Research 1 0.31% Journal of Cellular Physiology 1 0.31% Journal of Clinical Psychology 1 0.31% Journal of Developmental Cell 1 0.31% Journal of Hepatology 1 0.31% Journal of Infectious Diseases 1 0.31% Journal of Leukocyte Biology 1 0.31% Journal of Molecular Biology 1 0.31% Journal of RNA 1 0.31% Journal of the American Chemical Society 1 0.31% Minerva Anestesiologica 1 0.31% Molecular & Cellular Neuroscience 1 0.31% Molecular & General Genetics 1 0.31% Molecular bioSystems 1 0.31% Molecular Cell Biology 1 0.31% Molecular Endocrinology 1 0.31% Molecular Genetics & Genomics 1 0.31% Mycopathologia 1 0.31% Natural Review of Immunology 1 0.31% Nature Communications 1 0.31% Nature Genetics 1 0.31% Nature Reviews 1 0.31% Neuron 1 0.31% Neurotoxicology 1 0.31% Oncogene 1 0.31% Philosophical Transactions of the Royal Society of London 1 0.31% Plant Signaling & Behavior 1 0.31% PLoS Genetics 1 0.31% Proteomics 1 0.31% Reproductive Biology & Endocrinology 1 0.31% Respiratory Research 1 0.31% Trends in Plant Size 1 0.31% Yeast 1 0.31%

202 APPENDIX G

GO ANNOTATION TO THE MAIZE SMH1 GENE

Annotations using Zea mays publication

PMID: 14576282 Marian et al. 2003 Plant Physiology. 133:1336-1350. The Maize Single myb histone 1 gene, Smh1, Belongs to a Novel Gene Family and Encodes a Protein that Binds Telomere DNA Repeats in Vitro

Molecular Function

Evidence Gene Qualifier GO Term Evidence with code Smh1 double-stranded telomeric IDA DNA binding (GO:0003691) Smh1 double-stranded telomeric IPP ? sequence specific DNA binding interaction with maize (GO:0003691) telomere repeat dsDNA TTTAGGG/CCCTAAA (code for that unknown? - or is IPP ONLY for protein- protein interactions? (From Fig. 6, 7, table II of Marian et al) Smh1 single-stranded telomeric IDA ssDNA, DNA binding sequence non-specific (GO:0043047) interaction with oligonucleotides (Fig. 8 Marian et al.) Smh1 telomeric DNA binding IDA (GO:00042162) Smh1 DNA binding IDA (GO:0003677)

Cellular Component Evidence Gene Qualifier GO Term Evidence with code Smh1 chromosome IEA (GO:0005694)

203 APPENDIX H

TOOLS

Categories Tools URLs Ontology browsing & AmiGO http://amigo.geneontology.org/amigo searching tools QuickGO http://www.ebi.ac.uk/QuickGO Ontology Lookup Service http://www.ebi.ac.uk/ontology-lookup/ at EMBL-EBI CornCyc http://alpha.maizegdb.org/metabolic_pat hways/ FlyBase http://flybase.org/ FlyMine http://www.flymine.org/ MaizeCyc http://maizecyc.maizegdb.org/ MaizeGDB http://www.maizegdb.org/ MapMan http://mapman.gabipd.org/web/ Mouse Genome http://www.informatics.jax.org/ Informatics (MGI) modMine http://intermine.modencode.org/ NeXO http://www.nexontology.org/ PAMGO http://pamgo.vbi.vt.edu/index.php REViGO http://revigo.irb.hr/ TAIR http://www.arabidopsis.org/ UniProt http://www.uniprot.org/ Zebrafish Model Organism http://zfin.org/ Database (ZFIN) Ontology development Cell Type Ontology (CL) https://code.google.com/p/cell-ontology/ tools ChEBI http://www.ebi.ac.uk/chebi/ Other ontologies or biological databases Reference sources OBO-Edit http://oboedit.org/ Protégé http://protege.stanford.edu/ TermGenie http://go.termgenie.org/ GO contributors’ initials GO contributors’ ORCIDs GO annotation tools BLAST http://blast.ncbi.nlm.nih.gov/Blast.cgi Blast2GO https://www.blast2go.com/b2ghome Protein2GO Scientific literature GO annotation quality http://geneontology.org/page/annotation- control checks quality-control-checks The HGNC Comparison of http://www.genenames.org/cgi-bin/hcop Orthology Predictions (HCOP)

204 Categories Tools URLs PAINT http://gocwiki.geneontology.org/index.p hp/PAINT Evidence Ontology http://www.evidenceontology.org/ GO term enrichment DAVID http://david.abcc.ncifcrf.gov/ tools GOrilla http://cbl-gorilla.cs.technion.ac.il/ PANTHER http://pantherdb.org/ WebGestalt http://bioinfo.vanderbilt.edu/webgestalt/ GO communication GO Helpdesk http://geneontology.org/form/contact-go tools Request trackers at http://geneontology.sourceforge.net/ SourceForge GO mailing lists http://geneontology.org/page/go- mailing-lists GO meetings GO documentation GO Website http://geneontology.org/ tools GO Wiki http://wiki.geneontology.org/index.php/ Main_Page Others Scripts developed by users

205 APPENDIX I

INFORMED CONSENT FORM

206 APPENDIX J

APPROVALS FROM HUMAN SUBJECTS COMMITTEE

207

208 REFERENCES

Allard, S. (2012). DataONE: Facilitating eScience through collaboration. Journal of eScience Librarianship, 1, 4-17. doi:10.7191/jeslib.2012.1004

Allen, D., Karanasios, S., & Slavova, M. (2011). Working with Activity Theory: Context, technology, and information behavior. Journal of the American Society for Information Science and Technology, 62, 776-788. doi:10.1002/asi.21441

Allen, R. B. (2011). Category-based models for knowledge representation. In Information: A fundamental construct. Manuscript in preparation. Retrieved from http://boballen.info/ISS/

Anderson, P. (2007). ‘All that glisters is not gold’ Web 2.0 and the librarian. Journal of Librarianship and Information Science, 39, 195-198. doi:10.1177/0961000607083210

Anderson, W. L. (2004). Some challenges and issues in managing, and preserving access to, long-lived collections of digital scientific and technical data. Data Science Journal, 3, 191-202.

Aslett, M., & Wood, V. (2006). Gene Ontology annotation status of the fission yeast genome: Preliminary coverage approaches 100%. Yeast, 23, 913-919. doi:10.1002/yea.1420

Bada, M., Stevens, R., Goble, C., Gil, Y., Ashburner, M., Blake, J. A., Cherry J. M., Harris, M., & Lewis, S. (2004). A short study on the success of the Gene Ontology. Journal of Web Semantics, 1, 235-240. doi:10.1016/j.websem.2003.12.003

Ball, A. (2010). Review of the state of the art of the digital curation of research data (ERIM Project Report No. erim1rep091103abl2). Bath, UK: University of Bath. Retrieved from http://opus.bath.ac.uk/19022/

Bard, J. B. L., & Rhee, S. Y. (2004). Ontologies in biology: Design, applications and future challenges. Nature Review Genetics, 5, 213-222. doi:10.1038/nrg1295

Beghtol, C. (1986). Bibliographic classification theory and text linguistics: Aboutness analysis, intertextuality and the cognitive act of classifying documents. Journal of Documentation, 42, 84-109.

Benson, D. A., Clark, K., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., & Sayers, E. W. (2014). GenBank. Nucleic Acids Research, 42, D32-D37. doi:10.1093/nar/gkt1030

Berners-Lee, T., Hendler, J., & Lassila, O. (2001, May). The Semantic Web. Scientific American. Retrieved from http://www.scientificamerican.com/article.cfm?id=the- semantic-web

209 Birnholtz, J. P., & Bietz, M. J. (2003). Data at work: Supporting sharing in science and engineering. In M. Pendergast (Ed.), Proceedings of the 2003 International ACM SIGGROUP Conference on Supporting Group Work (pp. 339-348). New York, NY: ACM. doi:10.1145/958160.958215

Blaschke, C., Hirschman, L., & Valencia, A. (2002). Information extraction in molecular biology. Briefings in Bioinformatics, 3, 154-165.

Blee, K. M., & Taylor, V. (2002). Semi-structured interviewing in social movement research. In B. Klandermans & S. Staggenbory (Eds.), Methods of social movement research (pp. 92- 117). Minneapolis, MN: University of Minnesota Press.

Borgman, C. L. (2012). The conundrum of sharing research data. Journal of the American Society for Information Science and Technology, 63, 1059-1078. doi:10.1002/asi.22634

Borgman, C. L., Wallis, J. C., & Enyedy, N. (2007). Little science confronts the data deluge: Habitat ecology, embedded sensor networks, and digital libraries. International Journal on Digital Libraries, 7, 17-30. doi:10.1007/s00799-007-0022-9

Brewer, J. D. (2000). Ethnography. Philadelphia, PA: Open University Press.

Broughton, V. (2006). The need for a faceted classification as the basis of all methods of information retrieval. Aslib Proceedings: New Information Perspectives, 58, 49–72. doi:10.1108/00012530610648671

Burge, S., Attwood, T. K., Bateman, A., Berardini, T. Z., Cherry, M., O’Donovan, C., Xenarios, I., & Gaudet, P. (2012). Biocurators and biocuration: Surveying the 21st century challenges. Journal of Biological Database and Curation, 2012. doi:10.1093/database/bar059

Burkhardt, K., Schneider, B., & Ory, J. (2006). A biocurator perspective: Annotation at the research collaborator for structural bioinformatics Protein Data Bank. PLoS Computinal Biology, 2(10), 1186-1189. doi:10.1371/journal.pcbi.0020099

Buza, T. J., McCarthy, F. M., Wang, N., Bridges, S. M., & Burgess, S. C. (2008). Gene Ontology annotation quality analysis in model eukaryotes. Nucleic Acids Research, 36(2), e12. doi:10.1093/nar/gkm1167

Case, D. O. (2012). Looking for information: A survey of research on information seeking, needs and behavior (3rd ed.). Bingley, UK: Emerald.

Chan, L. M. (1994). Cataloging and classification: An introduction (2nd ed.). New York, NY: McGraw-Hill.

Chandrasekaran, B., Josephson, J. R., & Benjamins, V. R. (1999). What are ontologies, and why do we need them? Intelligent Systems and Their Applications, IEEE, 14(1), 20-26.

210 Cleveland, D. B., & Cleveland, A. D. (2001). Introduction to indexing and abstracting (3rd ed.). Englewood, Colo: Libraries Unlimited.

Conway, P. (2011). Archival quality and long-term preservation: a research framework for validating the usefulness of digital surrogates. Archival Science, 11, 293-309. doi:10.1007/s10502-011-9155-0

Dalgleish, R., Flicek, P., Cunningham, F., Astashyn A., Tully, R. E., Proctor, G., … Maglott, D. R. (2010). Locus Reference Genomic sequences: An improved basis for describing human DNA variants. Genome Medicine, 2, 24. doi:10.1186/gm145

Defoin-Platel, M., Hindle, M. M., Lysenko, A., Powers, S. J., Habash, D. Z., Rawlings, C. J., & Saqi, M. (2011). AIGO: Towards a unified framework for the analysis and the inter- comparison of GO functional annotations. BMC Bioinformatics, 12, 431. doi:10.1186/1471-2105-12-431

Department of Health and Human Services. (2009). Code of Federal Regulations. Retrieved from http://www.hhs.gov/ohrp/policy/ohrpregulations.pdf

Devedžić, V. (2002). Understanding ontological engineering. Communications of the ACM, 45(4), 136-144. doi:10.1145/505248.506002

Dutkowski, J., Kramer, M., Surma, M. A., Balakrishnan, R., Cherry, J. M., Krogan, N. J., & Ideker, T. (2013). A gene ontology inferred from molecular networks. National Biotechnology, 31(1). doi:10.1038/nbt.2463

Engeström, Y. (1990). Learning, working and imagining: Twelve studies in activity theory. Helsinki, Finland: Orienta-Konsultit Oy.

Engeström, Y. (2008). Enriching activity theory without shortcuts. Interacting with Computers, 20, 256-259. doi:10.1016/j.intcom.2007.07.003

Fundel, K., & Zimmer, R. (2006). Gene and protein nomenclature in public databases. BMC Bioinformatics, 7, 372. doi:10.1186/1471-2105-7-372

Garcia, A. C., Standlee, A. I., Bechkoff, J., & Cui, Y. (2009). Ethnographic approaches to the Internet and computer-mediated communication. Journal of Contemporary Ethnography, 38(1), 52-84. doi:10.1177/0891241607310839

Gašević, D., Djurić, D., & Devedžić, V. (2009). Model driven engineering and ontology development (2nd ed.). New York, NY: Springer.

Gene Ontology. (2014). The Gene Ontology at SourceForge. Retrieved from http://geneontology.sourceforge.net/

211 Gene Ontology Consortium. (2000). Gene Ontology: Tool for the unification of biology. Nature Genetics, 25, 25-29. doi:10.1038/75556

Gene Ontology Consortium. (2004). The Gene Ontology (GO) database and informatics resource. Nucleic Acids Research, 24(32), D258-D261. doi:10.1093/nar/gkh036

Gene Ontology Consortium. (2006). The Gene Ontology (GO) project in 2006. Nucleic Acids Research, 34, D322-D326. doi:10.1093/nar/gkj021

Gene Ontology Consortium. (2007). The Gene Ontology (GO) project in 2008. Nucleic Acids Research, 36, D440-D444. doi:10.1093/nar/gkm883

Gene Ontology Consortium. (2009). Curator guide: General conventions. Retrieved from http://wiki.geneontology.org/index.php/Curator_Guide:_General_Conventions#General_ Rules_For_GO_Terms

Gene Ontology Consortium. (2009). The Gene Ontology in 2010: Extensions and refinements. Nucleic Acids Research, 38, D331-D335. doi:10.1093/nar/gkp1018

Gene Ontology Consortium. (2011a). Curator guide: Obsoletion. Retrieved from http://wiki.geneontology.org/index.php/Curator_Guide:_Obsoletion

Gene Ontology Consortium. (2011b). The Gene Ontology: Enhancements for 2011. Nucleic Acids Research, 2011, 1-6. doi:10.1093/nar/gkr1028

Gene Ontology Consortium. (2014a). Annotation advocacy and development. Retrieved from http://wiki.geneontology.org/index.php/Annotation_Advocacy_and_Coordination

Gene Ontology Consortium. (2014b). Annotation extension. Retrieved from http://geneontology.org/page/annotation-extension

Gene Ontology Consortium. (2014c). Annotation Extension Guide. Retrieved from http://wiki.geneontology.org/index.php/Annotation_Extension#Annotation_Examples

Gene Ontology Consortium. (2014d). Annotation quality control. Retrieved from http://geneontology.org/page/annotation-quality-control

Gene Ontology Consortium. (2014e). Category:CACAO. Retrieved from http://gowiki.tamu.edu/wiki/index.php/Category:CACAO

Gene Ontology Consortium. (2014f). Curator Guide: SourceForgery. Retrieved from http://wiki.geneontology.org/index.php/Curator_Guide:_SorceForgery

Gene Ontology Consortium. (2014g). Documentation. Retrieved from http://geneontology.org/page/documentation

212 Gene Ontology Consortium. (2014h). GO acknowledgements. Retrieved from http://geneontology.org/page/go-acknowledgements

Gene Ontology Consortium. (2014i). GO Consortium contributors list. Retrieved from http://geneontology.org/page/go-consortium-contributors-list

Gene Ontology Consortium. (2014j). GO enrichment analysis. Retrieve from http://geneontology.org/page/go-enrichment-analysis

Gene Ontology Consortium. (2014k). GO leadership group summary. Retrieved from http://wiki.geneontology.org/index.php/GO_Leadership_group_summary

Gene Ontology Consortium. (2014l). GO Slim and Subset Guide. Retrieved from http://geneontology.org/page/go-slim-and-subset-guide

Gene Ontology Consortium. (2014m). How external communities can contribute annotations to the GO Consortium. Retrieved from http://wiki.geneontology.org/index.php/How_External_Communities_can_contribute_an notations_to_the_GO_Consortium#Credit_for_annotation_work

Gene Ontology Consortium. (2014n). Ontology development. Retrieved from http://wiki.geneontology.org/index.php/Ontology_Development

Gene Ontology Consortium. (2014o). Ontology relations. Retrieved from http://geneontology.org/page/ontology-relations

Gene Ontology Consortium. (2014p). Ontology structure. Retrieved from http://geneontology.org/page/ontology-structure

Gene Ontology Consortium. (2014q). PAINT. Retrieved from http://gocwiki.geneontology.org/index.php/PAINT

Gene Ontology Consortium. (2014r). Reference genome annotation project. Retrieved from http://wiki.geneontology.org/index.php/Reference_Genome_Annotation_Project

Gene Ontology Consortium. (2014s). Software and utilities. Retrieved from http://wiki.geneontology.org/index.php/Software_and_Utilities

Gene Ontology Consortium. (2014t). Species-specific terms. Retrieved from http://geneontology.org/page/species-specific-terms

Gene Ontology Consortium. (2014u). Submitting term suggestions to GO. Retrieved from http://geneontology.org/page/submitting-term-suggestions-go

Gene Ontology Consortium. (2014v). Use and license. Retrieved from http://geneontology.org/page/use-and-license

213 Gene Ontology Consortium. (2014w). User advocacy. Retrieved from http://wiki.geneontology.org/index.php/User_Advocacy

Golder, S. A., & Huberman, B. (2006). Usage patterns of collaborative tagging systems. Journal of Information Science, 32, 198-208. doi:10.1177/0165551506062337

Goll, J., Montgomery, R., Brinkac, L. M., Schobel, S., Harkins, D. M., Sebastian, Y., … Sutton, G. (2010). The Protein Naming Utility: A rules database for protein nomenclature. Nucleic Acids Research, 38, D336-D339. doi:10.1093/nar/gkp958

Gorman, M. (2004). Authority control in the context of bibliographic control in the electronic environment. Cataloging and Classification Quarterly, 38(3/4), 11-22. doi:10.1300/J104v38n03_03

Gray, J. (2007). Jim Gray on eScience: A transformed scientific method. In: T. Hey, S. Tansley, & K. Tolle (Eds.), The fourth paradigm: Data intensive scientific discovery (pp. 5-12). Edmond, WA: Microsoft Research.

Gray, J., Liu, D. T., Nieto-Santisteban, M., Szalay, A., DeWitt, D. J., & Heber, G. (2005, February). Scientific data management in the coming decade. CTWatch Quarterly, 1(1). Retrieved from http://www.ctwatch.org/quarterly/articles/2005/02/scientific-data- management/

Greenberg, J. (2009). Theoretical considerations of lifecycle modeling: An analysis of the Dryad Repository demonstrating automatic metadata propagation, inheritance, and value system adoption. Cataloging & Classification Quarterly, 47, 380-402. doi:10.1080/01639370902737547

Greenberg, J., Murillo, A., & Kunze, J. A. (2012). Ontological ownership: Empowerment and sustainability. Advances in Classification Research Online, 23, 47-48.

Gruber, T. R. (1993). A translational approach to portable ontology specifications. Knowledge Acquisition, 5(2), 199-220.

Gruber, T. R. (2008). Collective knowledge systems: Where the social Web meets the Semantic Web. Web Semantics: Science, Services and Agents on the World Wide Web, 6, 4-13. doi:10.1016/j.websem.2007.11.011

Guterman, L. (2001). Learning to swim in the rising tide of scientific data: From astronomy to zoology, researchers face an unprecedented wealth of information. The Chronicle of Higher Education, 47, A14.

Hammersley, M., & Atkinson, P. (1995). Ethnography: Principles in practice (2nd ed.). New York, NY: Routledge.

214 Harpring, P. (2010). Introduction to controlled vocabularies: Terminology for art, architecture, and other cultural works. Los Angeles, CA: Getty Research Institute.

Heidorn, P. B. (2011). The emerging role of libraries in data curation and e-science. Journal of Library Administration, 51, 662-672. doi:10.1080/01930826.2011.601269

Hendler, J. (2001). Agents and the Semantic Web. IEEE Intelligent Systems, 16(2), 33-37. doi:10.1109/5254.920597

Higgs, P. G., & Attwood, T. K. (2005). Bioinformatics and molecular evolution. Malden, MA: Blackwell Publishing Company.

Hjørland, B. (1998). Theory and metatheory of information science: A new interpretation. Journal of Documentation, 54, 606-621. doi:10.1108/EUM0000000007183

Hjørland, B. (2007). Semantics and knowledge organization. Annual Review of Information Science and Technology, 41, 367-405.

Hjørland, B. (2008). What is knowledge organization (KO)? Knowledge Organization, 35, 86- 101.

Hjørland, B., & Albrechtsen, H. (1999). An analysis of some trends in classification research. Knowledge Organization, 26, 131-139.

Hodge, G. (2000). Systems of knowledge organization for digital libraries: Beyond traditional authority files (CLIR Report 91). Washington, DC: Council on Library and Information Resources. Retrieved from http://www.clir.org/pubs/reports/pub91/pub91.pdf

Holsapple, C. W., & Joshi, K. D. (2002). A collaborative approach to ontology design. Communications of the ACM, 45(2), 42-47. doi:10.1145/503124.503147

Huang, H., Stvilia, B., Jörgensen, C., & Bass, H. W. (2012). Prioritization of data quality dimensions and skills requirements in genome annotation work. Journal of the American Society for Information Science and Technology, 63, 195-207. doi:10.1002/asi.21652

Huntley, R. P., Sawford, T., Martin, J., & O’Donovan, C. (2014). Understanding how and why the Gene Ontology and its annotations evolve: The GO within UniProt. GigaScience, 3(4), doi:10.1186/2047-217X-3-4

Isaac, A., Haslhofer, B., & Mader, C. (2012). Finding quality issues in SKOS vocabularies. Theory and Practice of Digital Libraries, 7849, 222-233. doi:10.1007/978-3-642-33290- 6_25

Jacob, E. K. (2003). Ontologies and the Semantic Web. Bulletin of the American Society for Information Science and Technology, 29(4), 19-22. doi:10.1002/bult.283

215 Jacob, E. K. (2004). Classification and categorization: A different that makes a difference. Library Trends, 52, 515-540.

Jörgensen, C., Stvilia, B., & Jörgensen, P. (2008). Is there a role for controlled vocabulary in taming tags? In J. Lussky (Chair), 19th workshop of the Special Interest Group in Classification Research. Workshop conducted at the American Society for Information Science and Technology, Columbus, OH.

Jörgensen, C., Stvilia, B., & Wu, S. (2011). Assessing the quality of socially created metadata to image indexing. In A. Grove (Ed.), Proceedings of the 74th ASIS&T Annual Meeting, 48. Retrieved from http://www.asis.org/asist2011/proceedings/submissions/325_FINAL_SUBMISSION.pdf

Jupp, S., Gibson, A., Malone, J., Parkinson, H., & Stevens, R. (2012, July). Taking a view on bio-ontologies. Paper presented at the 3rd International Conference on Biomedical Ontology, Graz, Austria. Retrieved from http://ceur-ws.org/Vol-897/session4- paper22.pdf

Juran, J. (1992). Juran on quality by design. New York, NY: The Free Press.

Kalfoglou, Y., Domingue, J., Motta, E., Vargas-Vera, M., & Shum, S. B. (2001). myPlanet: An ontology-driven Web-based personalised news service. In Proceedings of the 2001 International Joint Conference on Artificial Intelligence (pp. 44-52). Retrieved from http://eprints.aktors.org/10/02/ontoIS01final.pdf

Kaptelinin, V. (1996). Activity theory: Implications for human-computer interaction. In Nardi, B. A. (Ed.), Context and consciousness: Activity theory and human-computer interaction (pp. 103-116). Cambridge, MA: MIT Press.

Kazmer, M. M., & Xie, B. (2008). Qualitative interviewing in Internet studies: Playing with the media, playing with the method. Information, Communication, and Society, 11, 257-278. doi:10.1080/13691180801946333

Kelso, J., Hoehndorf, R., & Prüfer, K. (2010). Ontologies in biology. In R. Poli, M. Healy, & A. Kameas (Eds.), Theory and applications of ontology: Computer applications (pp. 347- 371). New York, NY: Springer. doi:10.1007/978-90-481-8847-5_15

Khatri, P., Sellamuthu, S., Malhotra, P., Amin, K., Done, A., & Draghici, S. (2005). Recent additions and improvements to the Onto-Tools. Nucleic Acids Research, 33, W762- W765. doi:10.1093/nar/gki472

Kling, R. (1999). What is social informatics and why does it matter? D-Lib Magazine, 5(1). Retrieved from http://www.dlib.org/dlib/january99/kling/01kling.html

216 Köhler, J., Munn, K., Rüegg, A., Skusa, A., & Smith, B. (2006). Quality control for terms and definitions in ontologies and taxonomies. BMC Bioinformatics, 7, 212. doi:10.1186/1471- 2105/7/212

Kozinets, R. V. (2010). Netnography: Doing ethnographic research online. Washington, DC: Sage.

Kuutti, K. (1996). Activity Theory as a potential framework for human-computer interaction research. In Nardi, B. A. (Ed.), Context and consciousness: Activity theory and human- computer interaction (pp. 17-44). Cambridge, MA: MIT Press.

Kwaśnik, B. H. (1999). The role of classification in knowledge representation and discovery. Library Trends, 48, 22-47.

Kwaśnik, B. H., & Rubin, V. L. (2003). Stretching conceptual structures in classifications across languages and cultures. Cataloging and Classification Quarterly, 37(1/2), 33-47. doi:10.1300/J104v37n01_04

La Barre, K. (2010). Facet analysis. Annual Review of Information Science and Technology, 44, 243-284. doi:10.1002/aris.2010.1440440113

Lambe, P. (2007). Organising knowledge: Taxonomies, knowledge and organisational effectiveness. Oxford, England: Chadons Publishing.

Leonelli, S., Diehl, A. D., Christie, K. R., Harris, M. A., & Lomax, J. (2011). How the Gene Ontology evolves. BMC Bioinformatics, 12, 325. doi:10.1186/1471-2105-12-325

Leont’ev, A. (1978). Activity, consciousness, personality. Englewood Cliffs, NJ: Prentice–Hall.

Library of Congress Authorities. (2011). Library of Congress Authorities help pages. Retrieved from http://authorities.loc.gov/help/auth-faq.htm#1

Lincoln, Y. S., & Guba, E. G. (1985). Implementing the naturalistic inquiry. In Naturalistic Inquiry (pp. 250-288). Newbury Park, CA: Sage.

Lord, P., & Macdonald, A. (2003). E-Science curation report: Data curation for e-Science in the UK: An audit to establish requirements for future curation and provision. Bristol, UK: JISC.

Mai, J-E. (2009, November). The boundaries of classification. Paper presented at the 20th Workshop of the American Society for Information Science and Technology Special Interest Group in Classification Research. Vancouver, BC, Canada.

217 Marian, C. O., Bordoli, S. J., Goltz, B., Goltz, M., Santarella, R. A., Jackson, L. P., … Bass, H. K. (2003). The maize single myb histone 1 gene, Smh1, belongs to a novel gene family and encodes a protein that binds telomere DNA repeats in vitro. Plant Physiology, 133, 1336-1350.

Mayor, C., & Robinson, L. (2014). Ontological realism and classification: Structure and concepts in the Gene Ontology. Journal of the Association for Information Science and Technology, 65, 686-697. doi:10.1002/asi.23057

Mendes, L. H., Quiñonez-Skinner, J., & Skaggs, D. (2009). Subjecting the cataloging to tagging. Library Hi Tech, 27, 30-41. doi:10.1108/07378830910942892

Morgat, A., Coissac, E., Coudert, E., Axelsen, K. B., Keller, G., Bairoch, A., … Viari, A. (2012). UniPathway: A resource for the exploration and annotation of metabolic pathways. Nucleic Acids Research, 40, D761-D769. doi:10.1093/nar/gkr1023

Mungall, C. J., Bada, M., Berardini, T. Z., Deegan, J., Ireland, A., Harris, M. A., Hill, D. P., & Lomax, J. (2011). Cross-product extensions of the Gene Ontology. Journal of Biomedical Informatics, 44, 80-86. doi:10.1016/j.jbi.2010.02.002

Murchison, J. M. (2010). Ethnography essentials: Designing, conducting, and presenting your research. San Francisco, CA: Jossey-Bass.

Nardi, B. A. (1996). Studying context: A comparison of activity theory, situated action models, and distributed cognition. In Nardi, B. A. (Ed.), Context and consciousness: Activity theory and human-computer interaction (pp. 69-102). Cambridge, MA: MIT Press.

National Center for Biomedical Ontology. (n.d.) About NCBO. Retrieved from http://www.bioontology.org/about-ncbo

National Center for Biotechnology Information. (2004). About NCBI. Retrieved from http://www.ncbi.nlm.nih.gov/About/glance/ourmission.html

National Information Standards Organization. (2005). Guidelines for the construction, format, and management of monolingual controlled vocabularies: An American national standard developed. Bethesda, MD: NISO Press.

National Science Board. (2005). Long-lived digital data collections: Enabling research and education in the 21st century (NSB Report No. 05-40). Arlington, VA: National Science Foundation. Retrieved from http://www.nsf.gov/pubs/2005/nsb0540/nsb0540.pdf

Noy, N. F., & McGuinness, D. L. (2001). Ontology development 101: A guide to creating your first ontology (KSL-01-05). Stanford Knowledge Systems Laboratory. Retrieved from http://www.ksl.stanford.edu/people/dlm/papers/ontology-tutorial-noy-mcguinness- abstract.html

218 Ochoa, X., & Duval, E. (2009). Automatic evaluation of metadata quality in digital repositories. International Journal on Digital Libraries, 10, 67-91. doi:10.1007/s00799-009-0054-4

Open Biological and Biomedical Ontologies. (2006). Archive of original principles. Retrieved from http://www.obofoundry.org/crit_2006.shtml

Open Biological and Biomedical Ontologies. (2012). OBO Foundry Principles 2008. Retrieved from http://obofoundry.org/wiki/index.php/OBO_Foundry_Principles_2008

Open Biological and Biomedical Ontologies. (2014). How to join the OBO Foundry. Retrieved from http://www.obofoundry.org/join.shtml

O’Reilly, K. (2005). Ethnographic methods. New York, NY: Routledge.

PANTHER. (2014). About PANTHER. Retrieved from http://www.pantherdb.org/about.jsp

Patton, G. (Ed.). (2009). Functional requirements for authority data: A conceptual model (FRAD). IFLA Working Group on the Functional Requirements and Numbering of Authority Records (FRANAR). IFLA Series on Bibliographic Control (Vol. 34). München, Germany: K.G. Saur.

Powell, R. R., & Connaway, L. S. (2004). Basic research methods for librarians (4th ed.). Westport, CT: Libraries Unlimited.

Quintarelli, E., Resmini, A., & Rosati, L. (2007). FaceTag: Integrating bottom-up and top-down classification in a social tagging system. ASIS&T Bulletin, June/July 2007. Retrieved from http://www.asis.org/Bulletin/Jun-07/quintarelli_et_al.html

Raber, D. (2003). The problem of information: An introduction to information science. Lanham, MD: Scarecrow Press.

Ranganathan, S. R. (2006). Prolegomena to library classification (3rd ed.). Bangalore: Sarada Ranganathan Endowment. (Original work published in 1967)

Richards, L. (2005). Handling qualitative data: A practical guide. London, UK: Sage.

Rolla, P. J. (2009). User tags versus subject headings: Can user-supplied data improve subject access to library collections? Library Resources and Technical Services, 53(3), 174-184.

Roos, A. (2012). Activity theory as a theoretical framework in the study of information practices in molecular medicine. Information Research, 17(3).

Rubin, D. L., Lewis, S. E., Mungall, C. J., Misra S., Westerfield, M., Ashburner, M., … Musen M. A. (2006). National Center for Biomedical Ontology: Advancing biomedicine through structured organization of scientific knowledge. OMICS, 10(2), 185-198.

219 Salimi, N., & Vita, R. (2006). The biocurator: Connecting and enhancing scientific data. PLoS Computational Biology, 2(10), 1190-1192. doi:10.1371/journal.pcbi.0020125

Sandusky, R. J., & Gasser, L. (2005). Negotiation and the coordination of information and activity in distributed software problem management. In G. Mark & M. Ackerman (Chairs), Proceedings of the 2005 International ACM SIGGROUP Conference on Supporting Group Work. New York, NY: ACM. doi:10.1145/1099203.1099238

Schutt, R. K. (2006). Investigating the social world: The process and practice of research (5th ed.). Thousand Oaks, CA: Sage.

Shadbolt, N., Hall, W., & Berners-Lee, T. (2006). The Semantic Web revisited. IEEE Intelligent Systems, 21, 96-101. doi:10.1109/MIS.2006.62

Sigurbjörnsson, B., & Zwol, R. V. (2008). Flickr tag recommendation based on collective knowledge. In Proceedings of the 17th International Conference on World Wide Web (pp. 327-336). New York, NY: ACM. doi:10.1145/1367497.1367542

Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., … Lewis, S. (2007). The OBO Foundry: Coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnology, 25(11), 1251-1255. doi:10.1038/nbt1346

Smith, G. (2008). Tagging: People-powered metadata for the social web. Berkeley, CA: New Riders.

Soergel, D. (1995). The Art and Architecture Thesaurus (ATT): A critical appraisal. Visual Resources, 10, 269-400.

Specia, L., & Motta, E. (2007). Integrating folksonomies with the Semantic Web. The Semantic Web: Research and Application, 4519, 624-639. doi:10.1007/978-3-540-72667-8_44

Spiteri, L. F. (2010). Incorporating facets into social tagging applications: An analysis of current trends. Cataloging and Classification Quarterly, 48, 94-109. doi:10.1080/01639370903338345

Stanford Center for Biomedical Informatics Research (2014). Protégé. Retrieved from http://protege.stanford.edu/about.php

Stvilia, B. (2006). Measuring information quality. (Doctoral dissertation, University of Illinois at Urbana - Champaign). Retrieved from http://wwwlib.umi.com/dissertations/fullcit/3223727

Stvilia, B. (2007). A model for ontology quality evaluation. First Monday, 12(12). Retrieved from http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2043/1905

220 Stvilia, B., Al-Faraj, A., & Yi, Y. J. (2009). Issues of cross-contextual information quality evaluation—The case of Arabic, English, and Korean Wikipedia. Library and Information Science Research, 31, 232-239. doi:10.1016/j.lisr.2009.07.005

Stvilia, B., & Gasser, L. (2008a). An activity model for information quality change. First Monday, 13(4). Retrieved from http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2126/1951

Stvilia, B., & Gasser, L. (2008b). Value-based metadata quality assessment. Journal of Library and Information Science Research, 30, 67-74. doi:10.1016/j.lisr.2007.06.006

Stvilia, B., Gasser, L., Twidale, M., & Smith, L. C. (2007). A framework for information quality assessment. Journal of the American Society for Information Science and Technology, 58, 1720-1733. doi:10.1002/asi.20652

Stvilia, B., Hinnant, Wu, S., Worrall, A., Lee, D. J., Burnett, K., Burnett, G., … Marty, P. (2013). Studying the data work of a scientific community. In J. S. Downie & R. H. McDonald (Chairs), Proceedings of the ACM/IEEE Joint Conference on Digital Libraries 2013. Indianapolis, IN. New York, NY: ACM.

Stvilia, B., Hinnant, Wu, S., Worrall, A., Lee, D. J., Burnett, K., Burnett, G., … Marty, P. (in press). Research project tasks, data, and perceptions of data quality in a condensed matter physics community. Journal of the Association for Information Science and Technology.

Stvilia, B., & Jörgensen, C. (2010). Member activities and quality of tags in a collection of historical photographs in Flickr. Journal of the American Society for Information Science and Technology, 61, 2477-2489. doi:10.1002/asi.21432

Stvilia, B., Jörgensen, C., & Wu, S. (2012). Establishing the value of socially created metadata to image indexing. Library and Information Science Research, 34, 99-109. doi:10.1016/j.lisr.2011.07.011

Stvilia, B., Mon, L., & Yi, Y. J. (2009). A model for online consumer health information quality. Journal of the American Society for Information Science and Technology, 60, 1781-1791. doi:10.1002/asi.21115

Stvilia, B., Twidale, M. B., Smith, L. C., & Gasser, L. (2005). Assessing information quality of a community-based encyclopedia. In F. Naumann, M. Gertz, & S. Mednick (Eds.), Proceedings of the International Conference on Information Quality (pp. 442-454). Cambridge, MA: MITIQ.

Stvilia, B., Twidale, M., Smith, L. C., & Gasser, L. (2008). Information quality work organization in Wikipedia. Journal of the American Society for Information Science and Technology, 59, 983–1001. doi:10.1002/asi.20813

221 Svenonius, E. (2003). Design of controlled vocabularies. Encyclopedia of Library and Information Science. In M. Dekker (Ed.), Encyclopedia of Library and Information Science (pp. 822-838). doi:10.1081/E-ELIS120009038

Swartout, W., & Tate, A. (1999). Guest editors’ introduction: Ontologies. IEEE Intelligent Systems, 14(1), 18-19.

Taylor, A. G., & Joudrey, D. N. (2009). The organization of information (3rd ed.). Westport, CT: Libraries Unlimited.

Trant, J. (2009). Studying social tagging and folksonomy: A review and framework. Journal of Digital Information, 10(1). Retrieved from http://arizona.openrepository.com/arizona/handle/10150/105375

Turner, P., Turner, S., & Horton, J. (1999). From description to requirements: An activity theoretic perspective. In: S. C. Hayne (Ed.), Proceedings of the International ACM SIGGROUP conference on Supporting Group Work (pp. 286-295). New York, NY: ACM. doi:10.1145/320297.320331

Twidale, M. B., Blake, C., & Gant, J. (2013). Towards a data literate citizenry. In W. Moen (Chair), Proceedings of iConference 2013 (pp. 247-257), Fort Worth, TX. Champaign, IL: iSchools. doi:10.9776/13189

University of College London. (2014). Functional gene annotation: Providing functional annotation to human genes. Retrieved from http://www.ucl.ac.uk/functional-gene- annotation/ von Thaden, T. L. (2007). Building a foundation to study distributed information behavior. Information Research, 12(3). Retrieved from http://informationr.net/ir/12- 3/paper312.html

Wain, H. M., Bruford, E. A., Lovering, R. C., Lush, M. J., Wright, M. W., & Povey, S. (2002). Guidelines for Human Gene Nomenclature. Genomics, 79, 464-470. doi:10.1006/geno.2002.6748

Wand, Y., & Wang, R. Y. (1996). Anchoring data quality dimensions in ontological foundations. Communications of the ACM, 39(11), 86-95.

Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 12(4), 5-34.

Wartofsky, M. (1979). Models: Representation and scientific understanding. Dordrecht, Netherlands: Reidel.

Westcott, J., Chappell, A., & Lebel, C. (2009). LibraryThing for libraries at Claremont. Library Hi Tech, 27(1), 78-81. doi:10.1108/07378830910942937

222 Wetterstrom, M. (2008). The complementarity of tags and LCSH—A tagging experiment and investigation into added value in a New Zealand library context. The New Zealand Library and Information Management Journal, 50, 296-310.

Wills, C., Greenberg, J., & White, H. (2012). Analysis and synthesis of metadata goals for scientific data. Journal for the American Society for Information Science and Technology, 63, 1505-1520. doi:10.1002/asi.22683

Wilson, T. D. (2008). Activity theory and information seeking. Annual review of Information Science and Technology, 34, 457-464. doi:10.1002/aris.2008.1440420111

Witt, M., Carlson, J., Brandt, D. S., & Cragin, M. H. (2009). Constructing data curation profiles. The International Journal of Digital Curation, 3(4), 93-103.

Wolcott, H. F. (2008). Ethnography: A way of seeing (2nd ed.). New York, NY: AltaMira Press.

Wu, S. (2013). A model for assessing the quality of Gene Ontology. In W. Moen (Chair), Proceedings of iConference 2013 (pp. 953-956). Champaign, IL: iSchools. doi:10.9776/13492

Wu, S., & Stvilia, B. (2014). Exploring the development and maintenance practices in the Gene Ontology. Advances in Classification Research Online, 24(1), 38-42. doi:10.7152/acro.v24i1.14675

Wu, S., Stvilia, B., & Lee, D. J. (2012). Authority control for scientific data: The case of molecular biology. Journal of Library Metadata, 12, 61-82. doi:10.1080/19386389.2012.699822

Zimmerman, A. S. (2003). Data sharing and secondary use of scientific data: Experience of ecologists. (Doctoral dissertation, University of Michigan). Retrieved from http://deepblue.lib.umich.edu/handle/2027.42/61844

223 BIOGRAPHICAL SKETCH

Shuheng Wu received baccalaureate degrees in Archives Science and in Computer Science and Technology from Sun Yat-sen (Zhongshan) University in China, and an MS in Library and Information Science from Syracuse University. She joined the doctoral program at the Florida State University, School of Information in the fall of 2009. She has worked as a research assistant for an OCLC/ALISE funded project and an NSF funded project. Shuheng’s research interests are in the intersection of knowledge organization (KO) and sociotechnical systems, applying them for scientific data curation and image indexing. Her current research focuses on studying project teams and data practices of a condensed matter physics community, exploring the data work organization of a bio-ontology, and examining the added value of socially created metadata to traditional KO systems for indexing historical images. Her future research agenda will expand on previous research, studying the data practices and KO systems in different scientific communities, such as molecular biology, biomedicine, and earthquake engineering.

224