Appendices 1. Glossary
Total Page:16
File Type:pdf, Size:1020Kb
ApPENDICES 1. Glossary This glossary does not aim at completeness and defines only those terms of genomics and computational biology that are extensively used in this book and a few additional terms deemed by the authors to be specifically important in comparative genomics. Many addititional definitions can be found at the following WWW sites: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossarv2.html. http://genomics.phrma.org/lexicon/index.html. Alignment The principal form cf representation of the similarity between nucleotide or amino acid sequences whereby nucleotides or amino acids inferred to have been derived from the same ancestral residue are superimposed. Identical or similar (in the case of amino acid sequences) residues are typically marked in an alignment. A pairwise alignment consists of two aligned sequences; pairwise alignments are typically produced as the output of a database search. A multiple alignment consists of three or more aligned sequences. In aglobai alignment, sequences are aligned in their entirety, whereas a loeal alignment consists of aligned subsequences. The optimal (global or loeal) alignment of two sequences is the alignment with the highest score under a given scoring system, which is least likely to be found by chance; optimal alignment algorithrns for multiple sequences are currently not feasible. Struetural alignment is a sequence alignment produced by superposition of the three-dimensional structures of the respective proteins; structural alignments are generally thought to be the most accurate form of alignment. Annotation (genome annotation) Primary analysis of genomes deposited in sequence databases; typically involves prediction of genes and their functions. Archaea (formerly archaebacteria) One of the three primary kingdoms ("domains" or "superkingdoms") of life, along with bacteria and eukaryotes. Like bacteria, archaea are prokaryotes, i.e. simple, unicellular organisms with small cells containing no distinct organelles. Archaeal systems of replication, transcription and translation are phylogenetically and mechanistically affiliated with those of eukaryotes, whereas the metabolic and signaling systems largely resemble those of bacteria. 382 Bioinformatics Narrowly and most appropriately defined, methods of computer science and informatics as applied to storage and retrieval of biological (particularly sequence and structural) information. More often used to designate almost any application of computers in biology, including the entire field of computational genomics. BLAST (= Basic Local Alignment Search Tool) The most commonly used suite of programs for fast similarity searches in nucleotide or protein sequence databases. BLOSUM (= BLOcks SUbstitution Matrix) Aseries of amino acid substitution matrices, in which the scores for each pairs of amino acid residues are derived from the frequencies of substitutions in local alignments of homologous proteins from the BLOCKS database. Each matrix is tailored to a particular evolutionary distance. For example, in the BLOSUM62 matrix, which is used by default by BLAST, the scores were derived from the alignments of sequences with no more than 62% identity. CATH A hierarchie al classification of protein structures that clusters proteins at the levels of Class, Architecture, Topology and Homologous superfamily (http://www.biochem.ucl.ac.uklbsm/cathnew/index.html) . Composition-based statistics Modification of the Karlin-Altschul statistics of sequence comparison that involves calculation of the scaling parameters separately for each database sequence. This typically solves problems associated with low complexity regions and obviates the necessity for filtering. Consensussequence One of the forms of representation of the sequence conservation in an alignment of DNA or protein sequences. A consensus includes residues that are most frequent in each position of the alignment. Convergence Independent ongm of similar features in sequences, structures and biological processes due to shared functional constraints. 383 Domain (protein domain) A compact, stable, independently folding unit of protein structure. Often used also to designate a portion of a protein with distinct evolutionary history, which may exist in the form of a stand-alone protein or as part of different domain architectures (linear sequence of domains in a multidomain protein ). An evolutionary domain is typically identical to a structural domain, but in some cases, may inc1ude more than one structural domain. Domain accretion A notable phenomenon in the evolution of complex life forms, primarily eukaryotes, whereby increasingly complex domain architectures are observed within a set of orthologs. E-value (= Expectation value of a given score). The number of different alignments with scores equal to or greater than a given value that are expected to be found by chance in a search of a given database with a particular method and substitution matrix (e.g. BLAST with BLOSUM62). The lower the E-value the more statistically significant is the score. Exaptation Recruitment of a protein for a new function, which may be mechanistically, but not biologically related to the original one Family (of homologous proteins) One of the main levels in the hierarchical c1assification of proteins. There is no generally accepted, formal definition. Typically, proteins with readily detectable, statistically significant sequence similarity to each other are c1assified as a family. FASTA The first fast and sensitive method for sequence database search. Filtering (sequence filtering) Masking repeats, regions of low sequence complexity or regions with other features in nuc1eotide or amino acid sequences, typically in the query sequence prior to a database search to avoid spurious hits. Also see SEG. 384 Fold (protein fold, protein structural fold) The central category in the hierarchical c1assification of proteins. Proteins with the same basic topology of structural elements are c1assified as having the same fold. Functional genomics Study of biological functions of individual proteins, complexes, pathways etc. based on or at least facilitated by analysis of genome sequences. Gap Aspace introduced into an alignment to compensate for an insertion or deletion in one aligned sequence relative to the other(s); typically designated by a dash. Gap penalty A penalty to the overall alignment score (a negative score) for gap introduction andJor extension, intended to prevent accumulation of too many gaps in an alignment (reflecting the fact that insertions and deletions are less common during evolution than substitutions). Gene A piece of genOInic DNA that encodes the synthesis of a (at least one) mRNA or structural RNA molecule. GenBank Archival database of nuc1eotide sequences; typically, searched for nuc1eotide sequence similarity. Genome The complete DNA sequence of a life form; consists of genes and intergenic regions. Genomics The study of genomes. GenPept Archival database of protein sequences translated from ORFs annotated in GenBank; the basis for the NR database. Hidden Markov Models (HMM) Mathematical formalism used to describe sequence alignments or individual sequences in terms of the probability of a given symbol in a given position. 385 Homology Origin from a common ancestor. Applies primarily to genes and their products (homologs), but entire complexes, pathways, operons etc. also may be called homologous when they are inferred to have a common origin. Horizontal (lateral) gene transfer (HGT or LGT) Transfer of genes from one phylogenetic lineage to another, as opposed to vertical descent along alineage. HSP (High-scoring Segment Pair) A segment of an alignment of two sequences such that its similarity score cannot be improved by adding or trimming any letters; applies to similar sequence segments detected in database searches, such as BLAST. Karlin-Atlschul statistics Statistical theory based on the extreme value distribution of the best scores for pairs of sequences; used for E-value calculation in BLAST. Lineage-specific expansion (of paralogous gene families) Aseries of duplications of genes during the evolution of a phylogenetic lineage leading to the emergence of a lineage-specific cluster of paralogs. Typically, lineage-specific expansions are associated with adaptations linked to the lifestyle of the respective organisms. Lineage-specific gene loss Elimination of a distinct set of genes in a particular phylogenetic lineage. Low-complexity regions Sequence regions of biased nucleotide or amino acid composition, e.g. homopolymeric runs, short repeats, and more subtle overrepresentations of one or a few residues. Low-complexity regions in amino acid sequences typically assume non-globular structure in proteins. These regions cause spurious hits in database searches unless they are masked (e.g. with the SEG pro gram) or composition-based statistics is applied. Molecular dock The evolutionary concept, introduced by Emile Zuckerkandl and Linus Pauling, according to which the rate of a gene's evolution remains constant throughout its history . Motif (protein sequence motiO A pattern of evolutionarily conserved amino acid residues in a protein that are important for its function and are located within a certain (typically, short) distance from each other in the protein sequence. A single domain 386 may contain several conserved sequence motifs. Non-orthologous gene displacement (NOGD) Displacement, in the course of evolution, ofa gene coding for a protein responsible for a particular function