Research Directions in Biodiversity Informatics
Total Page:16
File Type:pdf, Size:1020Kb
Research Directions in Biodiversity Informatics John L. Schnase Earth and Space Data Computing Division NASA Goddard Space Flight Center Greenbelt, MD 20771 USA [email protected] Abstract plant pollination, seed dispersal, grazing land, carbon dioxide removal, nitrogen fixation, flood control, waste This paper provides an introduction to the major breakdown, and the biocontrol of crop pests. And research directions in biodiversity informatics. biodiversity — the species richness of habitats per se — The biodiversity enterprise is a vast and complex is perhaps the single most important factor influencing the information domain. I describe the need to build stability and health of ecosystems [2, 3]. infrastructure for this domain, major research It is not surprising, then, that information about thrusts needed to improve its work practices, and biodiversity forms the basis of one of our most important areas of research that could contribute to the knowledge domains, vital to a wide range of scientific, advancement of the field. I emphasize that the educational, commercial, and government uses. science of biodiversity is fundamentally an Unfortunately, most biodiversity information now exists information science, worthy of special attention in forms that are not easily accessed or used. From from the computer and information science traditional paper-based libraries to scattered databases of communities because of its distinctive attributes varying size and physical specimens preserved in natural- of scale and socio-technical complexity. history collections throughout the world, our record of biodiversity is uncoordinated and poorly integrated, and 1. Biodiversity large parts of it are isolated from general use. We lack the technologies needed to effectively gather, analyze, and The most striking feature of Earth is the existence of life, synthesize these data into new discoveries. As a result, and the most striking feature of life is its diversity. This this information is not being used as effectively as it could biological diversity — or biodiversity — provides us with by scientists, resource managers, policy-makers, or other clean air, clean water, food, clothing, shelter, medicines, potential client communities. The good news is that and aesthetic enjoyment. Biodiversity, and the ecosystems research activities are being conducted around the world that support it, contribute trillions of dollars to national that could improve our ability to manage biodiversity and global economies, directly through industries such as information, and the emerging field of biodiversity agriculture, forestry, fishing, and ecotourism and informatics is attempting to meet the challenges posed by indirectly through biologically-mediated services such as this domain. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for 2. Biodiversity Informatics direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given Until recently, little attention has been paid to computer that copying is by permission of the Very Large Data Base and information science and technology research in the Endowment. To copy otherwise, or to republish, requires a fee biodiversity domain. Those working in the field of and/or special permission from the Endowment. biodiversity informatics — many cross-trained or cross- Proceedings of the 26th International Conference on Very teamed in computer science and biology — are attempting Large Databases, Cairo, Egypt, 2000 to change that. If we are to keep pace with our need for 697 quality information about the living systems of our planet, coordination — among agencies, among divergent we must produce mechanisms that can efficiently manage interests, and among groups of people from different petabytes of a whole new generation of high-resolution, regions, different backgrounds (academia, industry, Earth-observing satellite data. We must understand how government), and with different views and requirements. to integrate these new datasets with traditional The kinds of data humans have collected about organisms biodiversity data, such as specimen data held in natural and their relationships vary in precision, accuracy, and in history collections, and genomic data from cellular- and numerous other ways. Biodiversity data types include text molecular-level work. We must be able to make and numerical measurements, images, sound, and video. correlations among data from these and even more The range of other databases with which biodiversity disparate sources, such as ecosystem-scale global change datasets must interact is also broad, including and carbon cycle data, compile those data in new ways, geographical, meteorological, geological, chemical, analyze and synthesize them, and present the results in an physical, and genomic datasets. The mechanisms used to understandable and usable manner. collect and store biodiversity data are almost as varied as Despite encouraging advances in computation and the natural world that they document. In addition, communication performance in recent years, we are able biological data can be politically and commercially to perform these activities only on a very small scale. It is sensitive and can entail conflicts of interest. User’s skill only recently, for example, that IBM announced plans to levels are highly variable, and training in this field is not build the world’s fastest supercomputer — Blue Gene — well developed. which will attempt to compute the three-dimensional Because of these complexities, humans still play a folding of human protein molecules. Given the thousands crucial role in the processing of biological data. of proteins that are produced by the unknown millions of Biological information is not as amenable to automatic species on this planet, and given too that many of these correlation, analysis, synthesis, and presentation as many molecules may have potentially significant economic other types of information, such as that in value, we are clearly just embarking on a whole new radioastronomy, where there is more coherent global world of computer-mediated exploration. We can, organization and the problems being studied are often however, make rapid progress if the computer and conducive to automatic analysis. In biodiversity research, information science and technology research community people act as sophisticated filters and query processors — becomes focused on the challenges posed by the locating resources on the Internet, downloading datasets, biodiversity research community. reformatting and organizing data for input to analysis tools, then reformatting again to visualize results. This 3. Managing Complexity process of creating higher-order understanding from dispersed datasets is a fundamental intellectual process in The single most important factor influencing the nature of the biodiversity sciences, but it breaks down quickly as work in biodiversity informatics is the problem of the volume and dimensionality of the data increase. Who complexity. Living systems are complex, and they are could be expected to “understand” millions of cases, each complex at many levels. The computational challenges having hundreds of attributes? Yet problems on this scale associated with work on cellular processes — such as are common in biodiversity and ecosystem research. decoding the structure of DNA — are enormous. But DNA-level functions are only part of an elaborate web of 4. Research Directions biotic and abiotic interactions that span from molecules to cells, and from organisms to populations and entire ecosystems. Knowledge about biodiversity is a vast and 4.1 Information Infrastructure complex information domain. The total volume of biodiversity and ecosystem This complexity arises from two sources. The first of information is almost impossible to measure. We do know these is the underlying biological complexity of the organisms themselves. There are millions of species, each that whatever the total, only a fraction has been captured of which is highly variable across individual organisms, in digital form. The natural history museums in the US populations, and time. Species have complex chemistries, alone, for example, contain at least 750 million physiologies, developmental cycles, and behaviors specimens, the vast majority of which have not been recorded in databases. The same holds for the published resulting from more than three billion years of evolution. record, where most biodiversity and ecosystem There are hundreds, if not thousands, of ecosystems, each comprising complex interactions among large numbers of information still resides in paper-based journals, books, species and between those species and multiple abiotic field notes, and the like. Clearly, one of the most factors. important infrastructure issues is to move the biodiversity The second source of complexity in biodiversity enterprise into a digital world — to create the content for a global biodiversity digital library — by digitizing the information is sociologically generated. The sociological existing corpus of scholarly work on a large scale. complexity includes problems of communication and 698 Such an infrastructure would place challenging indexing, must be elaborated upon and scaled to demands on network hardware services and on software accommodate large information sources. services related to authentication, integrity, and security. Also much needed are software