Comparing Bioinformatics Software Development by Computer Scientists and Biologists: an Exploratory Study

Comparing Bioinformatics Software Development by Computer Scientists and Biologists: An Exploratory Study Parmit K. Chilana Carole L. Palmer Amy J. Ko The Information School Graduate School of Library The Information School DUB Group and Information Science DUB Group University of Washington University of Illinois at University of Washington [email protected] Urbana-Champaign [email protected] [email protected] Abstract Although research in bioinformatics has soared in the last decade, much of the focus has been on high- We present the results of a study designed to better performance computing, such as optimizing algorithms understand information-seeking activities in and large-scale data storage techniques. Only recently bioinformatics software development by computer have studies on end-user programming [5, 8] and scientists and biologists. We collected data through semi- information activities in bioinformatics [1, 6] started to structured interviews with eight participants from four emerge. Despite these efforts, there still is a gap in our different bioinformatics labs in North America. The understanding of problems in bioinformatics software research focus within these labs ranged from development, how domain knowledge among MBB computational biology to applied molecular biology and and CS experts is exchanged, and how the software biochemistry. The findings indicate that colleagues play a development process in this domain can be improved. significant role in information seeking activities, but there In our exploratory study, we used an information is need for better methods of capturing and retaining use perspective to begin understanding issues in information from them during development. Also, in bioinformatics software development. We conducted terms of online information sources, there is need for in-depth interviews with 8 participants working in 4 more centralization, improved access and organization of bioinformatics labs in North America. These included resources, and more consistency among formats. More software and database developers, and other broadly, the findings illustrate a variety of information professional staff, such as systems analysts and problems that end-user biologists and professional network administrators. Half of the participants in software developers face in developing bioinformatics these practitioner roles had CS degrees, while the other software and how they are influenced by the level of half had primary training in MBB or a related domain knowledge and technical expertise. biological field. We also compared the roles of CS researchers, who were more focused on the application 1. Introduction of theoretical concepts in computational biology with MBB researchers, who were concerned with the Since the completion of the human genome project, acquisition of CS-related skills to solve MBB computational tools have become indispensable in problems. supporting analysis, integration, modeling, and We used Robert Taylor’s concept of information visualization of large amounts of molecular data and use environments as the conceptual framework for this advancing a major core of biological research [10]. study [14]. Furthermore, we also applied MacMullin & Bioinformatics has become one of the fastest growing Taylor’s problem dimensions [7] in the comparative interdisciplinary scientific fields, bringing together analysis of the researcher and practitioner roles that Molecular Biology and Biochemistry (MBB) and emerged. Computer Science (CS), among other disciplines. A The results of this study shed light on how plethora of commercial and open-source information is used to understand the biological bioinformatics tools are emerging, but they often lack problem, to translate the problem into code, to interpret transparency in that researchers end up dealing more the output, and to resolve related technical issues. The with the complexity of the tools, rather than the results show interesting similarities and differences scientific problems at hand [3]. among the information activities of practitioners versus researchers and computer scientists versus biologists. SECSE’09, May 23, 2009, Vancouver, Canada 978-1-4244-3737-5/09/$25.00 © 2009 IEEE 72 ICSE’09 Workshop Based on these preliminary results, we believe there is documentation could have potentially useful merit in further study to better understand how information that should be exploited. software developers with no scientific domain In terms of end-user programming, Massar et al. [8] knowledge participate in not only bioinformatics, but designed BioLingua, an interactive web-based also other sciences. Also, there is need to investigate programming environment to give biologists more how specific methods or tools can be developed to control and flexibility in carrying out analyses with facilitate their participation. We are continuing to genomics, metabolic, and experimental data and explore such issues in our current research. higher-level representations. Their system made use of a symbolic programming language to provide a 2. Related Work transparent, integrated interface to commonly used bioinformatics tools by hiding the implementation Although bioinformatics is a relatively new area of details. scientific research, we are beginning to understand the Similarly, Letondal [5] developed Biok an underlying information tasks and end-user integrated programmable application for analyzing programming activities of biologists. DNA and protein sequences or multiple alignments. For example, MacMullen & Denn’s [6] research They used a novel approach in creating this tool by into information tasks of biologists revealed three using a combination of participatory design and end- categories: sequence alignment (i.e. sequences of user programming techniques. Using a “participatory DNA), structure prediction (i.e. protein folding) and programming” approach, the goal was to allow function prediction (i.e. gene function). Tran et al. [15] biologists have more control and flexibility over tools developed a cross-sectional study with six different by working collaboratively with software developers bioinformatics research labs and found four common during design. themes with the way bioinformatics tasks were carried In summary, although these studies shed light on out. These included a lack of procedural the general information and end-user programming documentation of high level tasks, use of home-grown activities of biologists, we know little about the strategies, lack of awareness of existing bioinformatics information activities of computer scientists in tools, and variations in individual needs and bioinformatics and how they compare. Also, while preferences. there has been recent research in understanding the Bartlett & Toms [1] also worked with biologists to information activities of software developers [4], their understand how they conduct functional analyses of activities in complex scientific domains are not well- gene sequences using tools such as, GenBank, the understood. This study is a step towards filling this genetic sequence database, and BLAST, an analytical important gap by investigating the information tool for calculating similarities between sequences [1]. activities that occur in developing and maintaining They discovered that these analyses served as an bioinformatics software by considering both biologists ‘extension’ of traditional techniques and a means of and computer scientists with no biological domain ‘data reduction’, but the scientists still had to verify knowledge. results through experimentation. Furthermore, they suggested that each bioinformatics expert followed a 3. Method ‘unique process’ in using these resources and that there were a number of technical and interface 3.1. Study Design inconsistencies among these tools. Umarji and Seaman [16] recently conducted a We used the semi-structured interview technique for survey of software developers in bioinformatics and data collection, since it allows for the kind of open- proposed the design of a search tool that would ended discussion that can capture the situational facilitate access to open-source bioinformatics aspects of information use, while also providing a way software components. About half of their 126 to find consistencies among responses from different respondents had CS degrees and most others had participants. We developed ten interview questions Biology-related training. Their findings indicated that (see Appendix 1) based on the following themes: open-source projects were the most common, • Nature of current software development work configuration management tools were used, but there and description of technical environment was room for improvement in testing and • Strategy used to overcome technical issues maintainability of software, and that comments and during development 73 • Strategy used to overcome domain-related in this field was about 4.25 years, ranging from one challenges during development year in the same lab to eight years in multiple labs. • Recall of recently encountered problems and the steps used for resolution 3.3. Analysis Approach • Awareness and choice of information resources and decisions made to determine relevance We used information use environments [14] as the conceptual framework for the study. This model We interviewed participants in their work settings emphasizes the context or situational aspects of to establish context and facilitate recall. Each interview information needs and how they affect how

Comparing Bioinformatics Software Development by Computer Scientists and Biologists: an Exploratory Study

Programming Paradigms & Object-Oriented

Bioconductor: Open Software Development for Computational Biology and Bioinformatics Robert C

Bioinformatics 1

13 Genomics and Bioinformatics

Paradigms of Computer Programming

The Machine That Builds Itself: How the Strengths of Lisp Family

Use of Bioinformatics Resources and Tools by Users of Bioinformatics Centers in India Meera Yadav University of Delhi, [email protected]

Scripting: Higher- Level Programming for the 21St Century

Challenges in Bioinformatics

Six Canonical Projects by Rem Koolhaas

Technological Advancement in Object Oriented Programming Paradigm for Software Development

Object Oriented Database Management Systems-Concepts