Bioinformatics

Bioinformatics Bioinformatics Introductory article Wojciech Makalowski, Pennsylvania State University, Philadelphia, Pennsylvania, USA Article contents Introduction Bioinformatics may be defined as the development and/or application of computational Databases tools and approaches for expanding the use of biological data, including those to acquire, Sequence Analysis store, organize, archive, analyze or visualize such data. Nucleotide Sequence Assembly Gene Prediction Homology Search Introduction Protein Analysis Technical Aspects Molecular biologists generate data at an unprecedent- Education ed speed (Figure 1). For example, an average bacterial Conclusions genome can be sequenced in one day by a sequencing center (factory) and the analysis of the expression of doi: 10.1038/npg.els.0005247 hundreds of genes in a single experiment has become routine. The demands and opportunities for inter- preting the data are expanding more than ever. In this context, a new discipline – bioinformatics – has Classical bioinformatics emerged as a strategic frontier between biology, Fredj Tekaia at the Institut Pasteur offers this computer science and statistics. Although the term definition of bioinformatics: ‘The mathematical, sta- ‘bioinformatics’ does not appear in the literature until tistical and computing methods that aim to solve around 1991, since the 1960s, long before anyone biological problems using deoxyribonucleic acid thought to label this activity with a special term, some (DNA) and amino acid sequences and related infor- evolutionarily oriented biologists (e.g. Margaret O. mation.’ Most biologists talk about ‘doing bioinfor- Dayhoff, Russell F. Doolittle, Walter M. Fitch and matics’ when they use computers to store, retrieve, Masatoshi Nei) have been building databases, devel- analyze or predict the composition or the structure of oping algorithms and making biological discoveries by biomolecules, for example, nucleic acids and proteins. sequence analysis. The field is not mature and These are the concerns of ‘classical’ bioinformatics, therefore not very well defined. Almost anybody active dealing primarily with sequence analysis. It is a in this field has a different definition of bioinformatics mathematically interesting property of most large and the discussion of what does and what does not biological molecules that they are polymers: ordered belong to it may go for hours. chains of simpler molecular modules called monomers. Each monomer molecule is of the same general class, but each kind of monomer has its own well-defined set of characteristics. Many monomer molecules can be joined together to form a single, far larger, macromol- 15 ecule, which has exquisitely specific informational content and/or chemical properties. Therefore, the monomers in a given macromolecule of DNA or protein can be treated computationally as letters of an 10 alphabet put together in preprogrammed arrangements to carry messages or do work in a cell. Millions 5 New bioinformatics The greatest achievement of biology, the Human 0 Genome Project (HGP), has recently been completed. The project brought not only the ultimate blueprint of 1965 1970 1975 1980 1985 1990 1995 2000 our genetic material but also rapid development of GenBank MEDLINE new technologies, for instance microarrays. As we Figure 1 Growth of biological information measured by number of enter the ‘postgenomic’ era, the nature and priorities entries in two major databases: GenBank – a collection of annotated of bioinformatics research and applications are chang- nucleotide sequences and MEDLINE – a collection of biomedical ing, for instance statistic analysis plays a more literature. prominent role. ENCYCLOPEDIA OF LIFE SCIENCES & 2005, John Wiley & Sons, Ltd. www.els.net 1 Bioinformatics fragments that are subjected to the shotgun sequencing Databases process – about 200 kb long bacterial artificial chromosomes (BACs) in the public project and the Databases are an invaluable component of bioinfor- whole human genome in the private one. The raw data, matics. They can be defined as large organized sets of after some cleaning, are used to assemble longer data usually linked with software that allows database contigs. First, all the sequences are checked for the searching, extraction of the relevant information and overlaps. The information about overlapping frag- the update of their content. A good database should ments can be summarized in a graph structure. Then, enable easy access to the data and allow the extraction the problem of sequence assembly is to find a path that of information in a way that all required data will be covers all fragments. Unfortunately, this is an NP- extracted and none of unwanted data will be retrieved. hard problem (which means that it requires non- Biological databases can be divided into primary deterministic polynomial time to be solved) and and derivative databases. Primary databases store the therefore all known algorithms are relatively slow. original information submitted by experimentalists. Recently, a Eulerian approach (one of the algorithms Database staff organize but do not add additional or that uses graph theory) has been proposed to solve the alter information submitted by a biologist. Three assembly problem but it remains to be shown if this nucleic acid databases, DDBJ, EMBL and GenBank, approach is applicable to large eukaryotic genomes. are the best known examples of primary databases. One of the greatest challenges of the human genome Derivative databases may be human curated or assembly is the recently duplicated fragments of a computationally derived. In such a case, they usually genome. These fragments, which can be as long as represent a subset of primary database with added 10 kb, have a sequence similarity higher than a sequ- value, such as correction of the data or adding encing error rate. Therefore, the assembly software some new information. SWISS-PROT and UniGene treats them as an identical sequence. This causes the represent human-curated and computer-generated known problem of collapsing legitimate duplicons into derivative databases respectively. one fragment. On the other hand, it is not uncommon that some fragments are ‘artificially’ duplicated by the assembly software, placing the same fragment in two Sequence Analysis distinct chromosomal locations. Development of technologies to determine the sequence of the nucleotides in nucleic acids or amino Gene Prediction acids in proteins has created the need for sequence analysis. It is a process of investigating the information The assembled genome is the starting point, not the content of raw sequence data. This is not a trivial task goal, of the sequencing projects. The linear sequence is as the information may be obscured from the useless unless annotated. The annotation process researcher by long stretches of sequences that do not reveals biological information hidden in the nucleotide carry any information (this is particularly true for (or amino acid) sequence. Only approximately 2% of eukaryotic genomic sequences). the human genome codes for proteins, and the protein- coding regions are not contiguous – exons are interlaced with introns. Many techniques were devel- Nucleotide Sequence Assembly oped to identify exons. Computational methods can be divided into homology-based or model-based. The The automation of DNA sequencing was a major former methods work well if a homologous sequence contributing factor to the success of the HGP in the of a given gene is already known for other organisms. first place. Despite all improvements, DNA sequenc- The latter do not require any previous knowledge ing has one crucial limitation – no more than several about a given gene. Since many prokaryotic genomes hundreds of nucleotides can be read in a single have been sequenced in the last several years, the reaction. To determine the sequence of a longer homology-based approach is a first choice for these molecule, it has to be chopped into smaller pieces, organisms. In fact, most of the genes of the newly sequenced and then the whole sequence of the long sequenced prokaryotic genomes can be determined in molecule has to be deduced based on the nucleotide this way. Although for eukaryotic genomes model- sequence of shorter molecules. The process of the long based methods are still dominating, a comparative sequence deduction is called sequence assembly. Both genomic approach can add valuable information to public and private sequencing projects used a shotgun any method applied. None of the existing methods are technique to produce raw sequencing data. The perfect and we are far from solving the gene prediction difference lies in the size of the human genome problem. In practice, several methods should be applied 2 Bioinformatics and experimental validation is required. Luckily, it top scoring sequences. The important part of the seems that computational methods can at least point process is a statistical assessment of the obtained to the protein-coding region, but precise boundaries of alignment, in other words estimating the probability of the exons are still quite difficult to predict. The major obtaining agiven alignmentby chance. Differentflavors problem is that with increased sensitivity, specificity of programs can compare nucleotide or amino acid declines. Therefore, to avoid too many false positives sequence against nucleic acid or protein databases. we have to give up some sensitivity and allow a Interestingly, because of the larger alphabet and more fraction of exons not

Load more