Screening of Ancestral Polymorphisms for Immune Response Genes
Total Page:16
File Type:pdf, Size:1020Kb
Screening of Ancestral Polymorphisms for Immune Response Genes vorgelegt von Diplom-Ingenieur Sven M. Dillenburger aus Berlin von der Fakultät III - Prozesswissenschaften der Technischen Universität Berlin zur Erlangung des akademischen Grades Doktor der Ingenieurwissenschaften - Dr.-Ing. - genehmigte Dissertation Promotionsauschuss: Vorsitzender: Prof. Dr. rer. nat. R. Lauster Berichter: Prof. Dipl-Ing. Dr. U. Stahl Berichter: Dr. T. D. Taylor Tag der wissenschaftlichen Aussprache: 20. Dezember 2007 Berlin 2008 D 83 1 Abbreviations and Glossary Allele Allele is used for two or more alternative forms of a gene resulting in different gene products and thus different phenotypes. An organism is homozygous for a gene if the alleles are identical, and heterozygous if they are different. Adenine (A) A purine base (nitrogenous base) and constituent of nucleotides and as such one member of the base pair A-T (adenine-thymine) in DNA and A-U (adenine-uracil) in RNA. Annotation The process of attaching biological information to DNA sequences. It consists of two procedures: 1. identifying elements in a genome, and 2. attaching biological information (ORFs and their localization, gene structure, coding regions, location of regulatory motifs, etc) to these elements. Automatic annotation tools (e.g. Ensembl) try to perform every step of this by computer analysis, as opposed to manual annotation, which involves human expertise. Alignment The process of lining up two or more sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the presence of homology. 2 BAC clone Bacterial artificial chromosome vector which carries a genomic DNA insert. Base pair (bp) Two nitrogenous (purine or pyrimidine) bases (adenine and thymine or guanine and cytosine) held together by weak hydrogen bonds. Two strands of DNA are held together in the shape of a double helix by the bonds between base pairs. The number of base pairs is often used as a measure of length of a DNA segment, e.g. 500 bp. Breakpoint Refers to sites of recombination when chromosomes undergo meiosis. BLAST Basic Local Alignment Search Tool. A sequence comparison algorithm, optimized for speed, used to search sequence databases for optimal local alignments to a query. The initial search is done for a word of length "W", which scores at least "T" when compared to the query using a substitution matrix. Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding a threshold of "S". The "T" parameter dictates the speed and sensitivity of the search. 3 BLAT The BLAST-Like Alignment Tool is a bioinformatics software tool, which performs rapid mRNA/DNA and cross-species protein alignments. BLAT is more accurate and 500 times faster than other popular existing tools for mRNA/DNA alignments and 50 times faster for protein alignments at sensitivity settings typically used when comparing vertebrate sequences. Consed A tool for viewing, editing, and finishing sequence assemblies created with Phrap. Finishing capabilities include allowing the user to pick primers and templates, suggesting additional sequencing reactions to perform, and facilitating checking the accuracy of the assembly using digest and forward/reverse pair information. Contig The result of joining an overlapping collection of sequence or clones. Chromosome The individual threads of DNA within a cell nucleus. The self- replicating genetic structures of cells containing the cellular DNA that bears in its nucleotide sequence the linear array of genes. 4 Cytosine (C) Cytosine is a pyrimidine base (nitrogenous base) and constituent of nucleotides and as such one member of the base pair G-C (guanine and cytosine) in DNA. Deoxyribonucleic acid (DNA) The molecule that encodes genetic information. DNA is a double-stranded molecule, held together by weak bonds between base pairs of nucleotides. The four nucleotides in DNA are the bases adenine (A), guanine (G), cytosine (C), and thymine (T). Draft genome sequence A sequence produced by combining the information from the individual sequenced clones and positioning the sequence along. Compared to finished sequence, many gaps exist and the ordering and orientation of contigs is tentative. Ensembl A genome browser developed by the European Bioinformatics Institute (EBI) and the The Wellcome Trust Sanger Institute. It is a software system which produces and maintains automatic annotation on selected eukaryotic genomes. 5 EST Expressed sequence tag obtained by performing a single raw sequence read from a random complementary DNA clone. Exon The protein-coding DNA sequence and untranslated regions (UTR) of a gene, after the introns are spliced out of the mRNA. Filtering (Masking) The process of hiding regions of (nucleic acid or amino acid) sequence having characteristics such as repetitiveness or low complexity that frequently lead to spurious high scores. The most commonly used software is Repeat Masker. Finished Sequence Complete sequence of a clone or genome, with an accuracy of at least 99.99% and only intractable gaps. Gene An ordered sequence of nucleotides located in a particular position (locus) on a particular chromosome that encodes a specific functional product (the gene product, i.e. a protein or RNA molecule). It includes regions involved in regulation of expression, and regions that code for a specific functional product. 6 Genotype The set of genes carried by an individual, referring to a particular pair of alleles. Genomic DNA DNA from the genome, containing all coding (exon) and noncoding (intron and other) sequences, in contrast to cDNA, which contains only coding sequences. Guanine (G) A purine base (nitrogenous base) and constituent of nucleotides and as such one member of the base pair G-C (guanine and cytosine). Intron The DNA base sequence interrupting the exon sequences of a gene; intron sequences are transcribed into RNA but are spliced out of the messenger mRNA before it is translated into protein. LD Linkage disequilibrium (LD) is the non-random association of alleles at two or more loci. LD describes a situation in which some combinations of alleles or genetic markers occur more or less frequently in a population than would be expected from a random formation of haplotypes from alleles based on their frequencies. 7 Megabase (Mb) Unit of length for DNA fragments equal to 1 million nucleotides and roughly equal to 1 centimorgan (cM). Polymerase Chain Reaction (PCR) A method for amplifying a DNA base sequence using a heat- stable polymerase and two approximately 20-base primers, one complementary to the (+)-strand at one end of the sequence to be amplified and the other complementary to the (-)-strand at the other end. Primer3 Primer designing software. Phrap A widely used computer program that assembles raw sequences into sequence contigs and assigns to each position in the sequence an associated ‘quality score‘, on the basis of the Phred scores of the raw sequence reads. A Phrap quality score of X corresponds to an error probability of approximately 10-x/10. Thus, a Phrap quality score of 30 corresponds an accuracy of 99.9% for a base in the assembled sequence. Phred A widely used computer program that analyses raw sequences to produce a base call with an associated quality score for each position in the sequence. A Phred quality 8 score of X corresponds to an error probability of approximately 10-x/10. Thus, a Phred quality score of 30 corresponds an accuracy of 99.9% for the base call in the raw read. Raw sequence Individual sequence reads, produced by sequencing of clones containing DNA inserts. sim2 A program for building local alignments of two sequences, each of which may be hundreds of kilobases long. Sim2 first constructs the n best non-intersecting chains of 'fragments', such as all occurrences of identical 5-tuples in each of two DNA sequences, for any specified n > = 1. Each chain is then refined by delivering an optimal alignment in a region delimited by the chain. SNP Single nucleotide polymorphism. A single nucleotide position in the genome sequence for which two or more alternative alleles are present. STS Sequencing Tagged Sites. A site that corresponds to a short (usually less than 500 bp), specific locus, for which a unique PCR assay has been developed. 9 Perl A programming language (Practical Extraction and Report Language) extensively used in areas such as bioinformatics and web programming. Thymine (T) A pyrimidine base (nitrogenous base) and constituent of nucleotides and as such one member of the base pair A-T (adenine-thymine) in DNA. 10 Index Abbreviations and Glossary...................................................................................2 1. Acknowledgements...........................................................................................12 2. Abstract / Zusammenfassung..........................................................................13 3. Introduction........................................................................................................17 3.1 Comparative Genomics.....................................................................................17 3.2 The Chimpanzee Genome and Positive Selection...........................................18 3.3 The Immune System.........................................................................................23 3.3.1 Nonspecific (Innate) Immunity .......................................................................23