An Introduction to Mohamed Abdel-Hakim Mahmoud Genetics Department, Faculty of Argiculture, Minia University, El-Minia, EGYPT WHAT IS BIOINFORMATICS?

 Applying ―informatics‖ techniques from math, statistics and computer science, to understand and organize the information associated with biological molecules on a large scale

 Can be defined as the body of tools, algorithms needed to handle large and complex biological information.

 Bioinformatics is a new scientific discipline created from the interaction of biology and computer.

 Bioinformatics is clearly a multi-disciplinary field including: the use of mathematical, statistical and computing methods for the organization, management, analysis & interpretation of biological information (DNA, amino acid sequences and related information) that aim to solve biological problems. More Definition

 The NCBI defines Bioinformatics as: a field of science in which biology, computer science, and information technology merge into a single discipline‖

 In Wikipedia: Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combines biology, computer science, information engineering, mathematics and statistics to analyze and interpret the biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques.  Roughly,  Bioinformatics describes any use of computers to handle biological information (Storing & processing of large amounts of biologically-derived information, whether DNA or Protein sequences). Preliminaries of Biology Structure & Function of  Nucleic Acids (DNA & RNA)  Proteins

Transcription & Translation Genome sizes of different organisms.

263 255 214 203 194 183 171 155 145 144 144 143 X106

114 109 106 98 92 85 76 72 50 164 59 X106

56 Chromosome 21

The $1,000 genome refers to an era of predictive and personalized medicine during which the cost of fully sequencing an individual's genome (WGS) is roughly one thousand USD.[1][2] It is also the title of a book by British science writer and founding editor of Nature Genetics, Kevin Davies.[3] By late 2015, the cost to generate a high- quality 'draft' whole human genome sequence was just below $1,500.[4] Genomics era: High-throughput DNA sequencing

 The first high-throughput genomics technology was AUTOMATED DNA SEQUENCING in the early 1990.  In 1995, Venter and Hamilton used whole-genome shotgun sequencing strategy to sequence the genomes of Mycoplasma and Haemophilus .  In September 1999, Celera Genomics completed the sequencing of the Drosophila genome.  The 3-billion-bp human genome sequence was generated in a competition between the publicly funded Human Genome Project and Celera Genomics: Completed genomes

 Currently the genome of the organisms are sequenced: Eukaryotes (10811) Prokaryotes (239173) Viruses (35013) Plasmids (20416) Organelles (15408)

 This generates large amounts of information to be handled by individual computers. The trend of data growth { 21st century is a century of }  Genomics: New sequence information is being produced at increasing rates. (The contents of GenBank double every year) o Metagenomics:“Who is there and what are they doing?”  Microarray: Global expression analysis: RNA levels of every gene in the genome analyzed in parallel. (OUT!) Replaced by RNA-seq  : Global protein analysis generates by large mass spectra libraries.  Metabolomics: Global metabolite analysis: 25,000 secondary metabolites characterized How to handle the large amount of information?  Answer: BIOINFORMATICS & INTERNET Why do we need the Internet?

 “omics” projects and the information associated with involve a huge amount of data that is stored on computers all over the world.  Because it is impossible to maintain up-to-date copies of all relevant databases within the lab. Access to the data is via the internet.

 There is a need for computers and algorithms that allow: o Access, processing, storing, sharing, retrieving, visualizing, annotating… Database storage

You are here Things you must have

 You have a PC running Microsoft Windows.  You have an Internet connection (a fast one if possible, but not necessarily).  You likely have a background in Molecular Biology.  You know how to use an Internet Browser but not much more about computers.  You don’t want to become a bioinformatics guru; you simply want to use the right tools for your problem.  Most private biotech companies consider it unsafe to send data over the Internet. We assume here that the data you want to analyze over the Internet is not very confidential. Bioinformatics history  Before the era of bioinformatics, only two ways of performing biological experiments were available:  within a living organism (so-called in vivo) or  in an artificial environment (so-called in vitro, from the Latin in glass).  Taking the analogy further, we can say that bioinformatics is in fact in silico biology, from the silicon chips on which microprocessors are built  In1960s: the birth of bioinformatics The beginning of bioinformatics can be traced back to Margaret Dayhoff in 1968 and her collection of protein sequences known as the Atlas of Protein Sequence and Structure. Sci. Am. 1969 Jul; 221(1):86-95.

 Early significant experiments in bioinformatics  In this study, scientists used one of the first sequence similarity searching computer programs (called FASTP), to determine that the contents of a cancer-causing viral sequence, were most similar to the well-characterized cellular PDGF gene.  Surprising result  This surprising result provided important mechanistic insights for biologists working on how this viral sequence causes cancer. Science. 1983 Jul 15; 221(4607):275-7 Nature. 1983 Jul 7-13; 304(5921):35-9.  First complete genome in Gene Bank  The genome of Haemophilus influenzae Rd is the first genome of a free living organism to be deposited into the public sequence databanks.  Science. 1995 Jul 28; 269(5223):496-512. Why do we use Bioinformatics?  Store/retrieve biological information (DATABASES)  Retrieve/compare gene(s) and/or protein(s) sequences.  Predict function of unknown gene(s) and/or protein(s).  Search for previously known functions gene(s) and/or protein(s).  Compare data with other researchers.  Compile/distribute data for other researchers. Fields related to Bioinformatics  Genomics.  “Genomics is any attempt to analyze or compare the entire genetic complement of one ore more species.  Proteomics. ―the PROTEin complement of the genOME"  “Qualitative and quantitative studies of gene expression at the level of the functional proteins themselves"  Pharmacogenomics.  “Pharmacogenomics is the application of genomic approaches and technologies to the identification of drug targets”.  Pharmacogenetics.  Pharmacogenetics is a subset of pharmacogenomics which uses genomic/bioinformatic methods to identify genomic correlates  Biophysics.  An interdisciplinary field which applies techniques from the physical sciences to understanding biological structure and function"  Mathematical Biology.  It focuses almost exclusively on specific algorithms that can be applied to large molecular biological data sets.  Medical informatics/Medinformatics.  “Study, invention, and implementation of structures and algorithms to improve communication, understanding and management of medical information.“  Cheminformatics.  "the combination of chemical synthesis, biological screening, and data-mining approaches used to guide drug discovery and development" Computational Biology Is an "approach" involving the use of computers to study biological processes  Finding the genes in the DNA sequences of various organisms.  Developing methods to predict the structure and/or function of newly discovered proteins and structural RNA sequences.  Clustering protein sequences into families of related sequences and the development of protein models.  Aligning similar proteins and generating phylogenetic trees to examine evolutionary relationships. Some Application of bioinformatics:  Medicine { Molecular Med.; Personalized Med.; Preventative Med.; Gene Therapy; Disease Diagnosis; Forensic Analysis; Drug Ddevelopment ………………}  Microbial Genome Applications.  Waste Cleanup.  Crop and livestock Improvement.  Evolutionary Studies.  Climate change studies.  Alternative energy sources.  Improve nutritional quality.  Bio-Weapons Creation.  Biotechnology ………………. etc. Some Applications…. Medical Implications  Pharmacogenomics • Not all drugs work on all patients, some good drugs cause death in some patients • So by doing a gene analysis before the treatment the offensive drugs can be avoided • Also drugs which cause death to most can be used on a minority to whose genes that drug is well suited – volunteers wanted! • Customized treatment  Gene Therapy • Replace or supply the defective or missing gene. • e.g: Insulin and Factor VIII or Haemophilia.  Diagnosis of Disease o Identification of genes which cause the disease will help detect disease at early stage.  Drug Design o One of the goals of bioinformatics is to reduce the time and cost involved with it.  Drug Discovery o Target identification (Proteins are the most common targets)  For example HIV produces HIV protease which is a protein and which in turn eat other proteins.  This HIV protease has an active site where it binds to other molecules.  So HIV drug will go and bind with that active site:  Easily said than done!  Lead compounds are the molecules that go and bind to the target protein’s active site The Commercial Market CAGR (What Is Compound Annual Growth Rate) This is a list of bioinformatics companies that have articles at Wikipedia.  Applied Maths provides the software suite BioNumerics  Astrid Research  BIOBASE  BioBam Bioinformatics creator of Blast2GO  Biomax Informatics AG bioinformatics services.  Biovia (formerly Accelrys).  Chemical Computing Group MOE software for structural modelling  CLC Bio Bioinformatics workbenches.  DNASTAR provides DNA sequence assembly and analysis.  Gene Codes Corporation  Genedata software for data analysis and storage.  GeneTalk web-based services.  GenoCAD  Genomatix  Genostar provides streamlined bioinformatics.  Inte:Ligand  Integromics  Invitrogen creator of Vector NTI  Korea Computer Centre Sinhung Company  Leidos Biomedical Research Inc. formerly SAIC. Services are aimed at the Federal Gov. market.  MacVector  QIAGEN Silicon Valley (formerly Ingenuity Systems)  Qlucore  Phalanx Biotech Group  SimBioSys created the eHITS software  SRA International services aimed at the Federal Government market.  Strand Life Sciences  TimeLogic offers DeCypher FPGA-accelerated BLAST, Smith-Waterman, HMMER and other sequence search tools. The most computational tasks:

Similarity search Sequence comparison:  Alignment, multiple alignment, retrieval  Sequences analysis:  Signal peptide, transmembrane domain,… Protein folding:  secondary structure from sequence Sequence evolution:  phylogenetic trees Applying algorithms to analyze genomics data -Accession #? -Annotation? Is it already in databases? Protein Other characteristics? information? -Sub-localization -Expression profile? -Mutants? -Soluble? You have just -3D fold cloned a gene

Is there conserved Is there any similar Evolutionary regions? sequence(s)? relationship? -Alignments? -% identity? -Phylogenetic tree -Domains? -Family member?

A critical failure of current bioinformatics is the lack of a single software package that can perform all of these functions. SEQUENCE ANALYSIS:

 Determine correct sequence of DNA

 Compare sequence with known sequences

 Translate DNA into protein sequence

 Understand protein function in context BIOLOGICAL SEQUENCES DNA Sequences

IUPAC  Chemical Structure of DNA nucleotide Base  Sequences Complementarity code A Adenine of the two strands forming C Cytosine the double helix G Guanine  Palindromes in DNA T (or U) Thymine (or Uracil) R A or G sequences and their Y C or T important biological roles S G or C o Sites for most of the W A or T K G or T restriction enzymes. M A or C o Serve as binding sites for B C or G or T many regulatory proteins. D A or G or T H A or C or T o Have a strong influence V A or C or G on the 3-D structure of N any base DNA molecules. . or - gap International Union of Pure and Applied Chemistry Points to Remember

 We say number of positions rather than number of nucleotides.  A 400-nt long DNA molecule has 400 positions for nucleotides, → (800 bp).  To make this clearer, DNA sequence sizes are often given in base pairs, (bp)  Larger units, such as Kb (1000 bp) or Mb (1000 kp) (mega-bp) are also used. The genetic code… Writing DNA Sequences. ATGGAAGTATTTAAAGCGCCACCTATTGGGATAATG GAA GTA TTT AAA GCG CCA CCT ATT GGG ATA ATG … The Reading frames.  5' to 3' DNA seq. & N- to C- Polypeptide chain  Because of the triplet-based genetic code.  the computer can generate six possible ORFs frames from any given sequence of DNA o Because the DNA can be used from both strands. THE GENETIC CODE The genetic code is triplet, the four nucleotides gave 64 (43) codons Second Position U C A G UUU UCU UAU UGU U Phe (F) Tyr (Y) Cys (C) UUC UCC UAC UGC C U Ser (S)

UUA UCA UAA Stop UGA Stop A Leu (L) UUG UCG UAG Stop UGG Trp (W) G

CUU CCU CAU CGU U end)

end) -

- His (H) `

` CUC CCC CAC CGC C 3 5 C Leu (L) Pro (P) Arg (R) CUA CCA CAA CGA A Gln (Q) CUG CCG CAG CGG G AUU ACU AAU AGU U Asn (D) Ser (S) AUC Ile (I) ACC AAC AGC C A Thr (T) AUA ACA AAA AGA A Lys (K) Arg (R) AUG Met (M) ACG AAG AGG G

First Position First( Position GUU GCU GAU GGU U Asp (D) Third ( Position GUC GCC GAC GGC C G Val (V) Ala (A) Gly (G) GUA GCA GAA GGA A Glu (E) GUG GCG GAG GGG G  THE CODE IS DEGENERATE (Degeneracy) Codons specifying the same amino acid are Synonyms The great variation in the AT/GC ratios in the DNA of various organisms without correspondingly large changes in the relative proportion of amino acids in their proteins.  Three Codons Direct Chain Termination (Stop or Nonsense Codons, UAA, UAG & UGA) Any region of the DNA sequence can, in principle, code for six different amino acid sequences, because any one of three different reading frames can be used to interpret each of the two strands. Non-coding DNA  Additional complications arise from the fact that some DNA sequences are not encoding proteins at all — and that higher organisms have large pieces of noncoding DNA inserted within their genes.  A large part of bioinformatics is devoted to the development of methods to locate protein-coding regions in DNA sequences, to delineate precisely where genes start and end, or where they are interrupted by the noncoding intervals (called introns). RNA Sequences

 Chemical Structure of RNA.  Differences between DNA & RNA.  Types, of RNA molecules: mRNA, tRNA, rRNA & smalRNAs: (siRNA, miRNA & piRNA) Small RNAs are crucial elements in a host of cellular processes including development, apoptosis, genome organization, and several diseases, notably cancer.  Structural conformation of RNA molecules and their functional roles. PROTEINS Proteins have a variety of roles that they must fulfill: 1.They are the enzymes that rearrange chemical bonds. 2.They carry signals to and from the outside of the cell, and within the cell. 3.They transport small molecules. 4.They form many of the cellular structures. 5.They regulate cell processes, turning them on and off and controlling their rates. AMINO ACIDS

Structural features of an amino acid. Each of the 20 aa consists of two parts:

1. a part that is identical among all 20 amino acids; this part is used to link one amino acid to another to form the backbone of the protein. 2. a unique side chain (or ―R group‖) that Peptide-bond formation determines the distinctive physical and chemical properties of the amino acid. Classification of the Amino Acids The 20 different amino acids can be classified into four categories based upon their major chemical properties.

1. Positively charged (and therefore basic) amino acids (3). o Arginine Arg R, Histidine His H, Lysine Lys K

2. Negatively charged (and therefore acidic) amino acids (2). o Aspartic acid Asp D, Glutamic acid Glu E

3. Polar amino acids (7). Though uncharged overall, can form hydrogen bonds with water (hydrophilic) and are often found on the outer surface of folded proteins. o Asparagine Asn N, Cysteine Cys C, Glutamine Gln Q, Glycine Gly G, Serine Ser S, Threonine Thr T, Tyrosine Tyr Y

4. Nonpolar amino acids (8). These amino acids are uncharged and have a uniform charge distribution. Because of this, they do not form hydrogen bonds with water, are called hydrophobic, and tend to be found on the inside surface of folded proteins. o Alanine Ala A, Isoleucine Ile I, Leucine Leu L, Methionine Met M, Phenylalanine Phe F, Proline Pro P, Tryptophan Trp W, Valine Val V The 20 naturally occurring amino acids in proteins. Commonly used abbreviations for amino acids, including the single-letter code, are shown in parentheses. (amino acid similarity) This variety of roles is accomplished by the variety of the three-dimensional (3D) shapes of proteins. • The three-dimensional shape of the protein is determined by the specific linear sequence of aa from N- to C-terminus. • Protein size is usually measured in terms of the number of aa that comprise it. It can range from fewer than 20 to more than 5000 aa in length, although an average protein is about 350 aa in length. • Each protein that an organism can produce is encoded in a piece of the DNA called a ―gene‖ • The number of proteins that can be produced by an organism, specially in eukaryotes, greatly exceeds the number of genes, ―alternative splicing‖. PROTEIN STRUCTURE CAN BE DESCRIBED AT FOUR LEVELS

“Tertiary structure” “Quaternary structure”

“Primary structure” “Secondary structure”

Levels of protein structure, illustrated by hemoglobin. Hemoglobin is a tetramer of two “α chains” and two “β chains,” but the two kinds of chain have very similar tertiary structures, as can be seen in the drawing. Formation of the disulfide bond. Protein domains.

Domain: A part of a polypeptide chain with a specific folded structure that does not depend for its stability on any of the remaining parts of the protein.

• Two of the four domains of the protein CD4, which is found on the surface of certain T-cells and macrophages.

• Two enzymes: triosephosphate isomerase (TIM; left) and pyruvate kinase (PK; right). The figure shows one monomer of the TIM dimer. PK folds into three domains. The central domain is a TIM barrel (compare with the side view of TIM). The comparison of TIM and PK shows that a domain found as an isolated unit in one protein can join with additional domains in another protein. Protein motifs

Motif (sequence): A short aa sequence with characteristic properties, often those suitable for association with a specific kind of domain on another protein. (Note that the term “domain” is sometimes incorrectly applied to such sequence motifs).

Motif (structural): A domain substructure that occurs in many different proteins, often having some characteristic aa sequences properties (e.g., the helix-turn-helix motif in many DNA-recognition domains).

The Cys2His2 zinc finger motif the Zif268–DNA complex Glossary of Terms Primary structure: Amino acid sequence of a polypeptide chain. Secondary structure: Elements of regular polypeptide-chain structure with main-chain hydrogen bonds satisfied. The secondary structures that occur frequently in proteins are the α helix and the parallel and antiparallel β sheets. Tertiary structure: The folded, three-dimensional conformation of a polypeptide chain. Quaternary structure: Multi-subunit organization of an oligomeric protein or protein assembly. Domain: A part of a polypeptide chain with a folded structure that does not depend for its stability on any of the remaining parts of the protein. Motif (sequence): A short amino acid sequence with characteristic properties, often those suitable for association with a specific kind of domain on another protein. (Note that the term “domain” is sometimes incorrectly applied to such sequence motifs. Motif (structural): A domain substructure that occurs in many different proteins, often having some characteristic amino acid sequences properties (e.g., the helix-turn-helix motif in many DNA-recognition domains). Topology (or fold): The structure of most protein domains can be represented schematically by the connectivity in three dimensions of their constituent secondary-structural elements and the packing of those elements against each other. Jane Richardson introduced “ribbon diagrams,” such as those in many of the figures in this chapter, as convenient ways to visualize the fold of a domain (see the caption to Fig. 6-10). Not all folds are found in naturally occurring proteins (e.g., knotted folds are not found), and some folds are more common than others. Homologous domains (or proteins): Domains (or proteins) that derive from a common ancestor. They necessarily have the same fold, and they often (but not always) have recognizably similar amino acid sequences. Homology modeling: Modeling the structure of a domain based on that of a homologous domain. Ectodomain: The part of a single-pass membrane protein that lies on the exterior side of the cell membrane. Glycosylation: Addition of a chain, sometimes branched, of one or more sugars (glycans) to a protein side chain. The glycans can be N-linked (attached to the side-chain amide of asparagine) or O-linked (attached to the side-chain hydroxyl of serine or threonine). Denaturation: Unfolding a protein or a domain of a protein, either by elevated temperature or by agents such as urea, guanidinium hydrochloride, or strong detergent (“denaturants”). Chaperone: A protein that increases the probability of native folding of another protein, usually by preventing aggregation or by unfolding a misfolded polypeptide chain so that it can “try again” to fold correctly. Active site (or catalytic site): The site on an enzyme that binds the substrate(s), often in a configuration resembling the transition state of the reaction catalyzed. Allosteric regulation: Control of affinity or of the rate of an enzymatic reaction by a ligand that binds at a site distinct from that of the substrate(s). The mechanism of allosteric regulation often involves a change in quaternary structure—that is, a reorientation or repositioning of subunits with respect to each other. Sequence Analysis Once a genome is completely sequenced, what sorts of analyses are performed on it? Some of the goals of sequence analysis are the following: 1. Identify the genes. 2. Determine the function of each gene. One way to hypothesize the function is to find another gene (possibly from another organism) whose function is known and to which the new gene has high sequence similarity. This assumes that sequence similarity implies functional similarity, which may or may not be true. 3. Identify the proteins involved in the regulation of gene expression. 4. Identify sequence repeats. 5. Identify other functional regions, for example origins of replication, pseudogenes (sequences that look like genes but are not expressed), sequences responsible for the compact folding of DNA, and sequences responsible for nuclear anchoring of the DNA. Many of these tasks are computational in nature. Given the incredible rate at which sequence data is being produced, the integration of computer science, mathematics, and biology will be integral to analyzing those sequences. PROTEIN

SEQUENCE

STRUCTURE

FUNCTION

Table gives you the list of these 20 building blocks, with their full names, three- letter codes, and one-letter codes (the IUPAC code, after the International Union of Pure and Applied Chemistry committee that designed it).

Central concept of Molecular Biology and Bioinformatics

For DNA, RNAs and Proteins

SEQUENCE ➪ ➪ ➪STRUCTURE ➪ ➪ ➪FUNCTION

 The function of a protein turned out to be a direct consequence of its 3-D structure (shape).

 Hence, It is predicted that proteins with similar sequences would fold into similar shapes —and, conversely, that proteins with similar structures would be encoded by similar sequences of DNA. DNA/RNA Bioinformatics Bioinformatics Analyses that Are Relevant To DNA/RNA Sequences

 Retrieving DNA sequences from databases .  Computing nucleotide compositions.  Identifying restriction sites.  Designing polymerase chain-reaction (PCR) primers.  Identifying open reading frames (ORFs).  Predicting elements of DNA/RNA secondary structure.  Finding repeats.  Computing the optimal alignment between two or more DNA sequences.  Finding polymorphic sites in genes (single nucleotide polymorphisms, SNPs).  Assembling sequence fragments. Working with Entire Genomes Historical view:  The first truly efficient technique to sequence DNA was discovered in 1977.  In 1995, the first sequence of an entire genome (from the microbe Hemophilus influenzae) was determined.  During this period, biologists were mostly sequencing DNA fragments that were a few thousand nucleotides in length,  Most of the bioinformatics tools available today were created during that period: 1. All basic sequence-alignment programs. 2. Phylogenetic and classification methods. 3. Various display tools adapted to relatively small- sequence objects (such) as protein sequences no more than a few thousand characters long. Genomics: Getting all the genes at once

 The determination of the first complete genome sequence terminated the gene-by-gene routine and initiated the era of Genomics  This revolution called for the design of new bioinformatics tools and databases capable to store, query, analyze, and display these huge objects in a user-friendly manner.  This development prompted the emergence of an entirely new branch of bioinformatics devoted to the parsing of large DNA sequences into their components (genes, transcription units, protein-coding regions, regulatory elements, and so forth).  This first pass is then followed by a longer phase of genome annotation, where the biological functions of these various elements are (more or less tentatively) predicted Figure representing the whole genome of the bacterium Rickettsia conorii, illustrates this new level of complexity. This circular DNA molecule is 1.3 million bp long, on the small side for a bacterium. Each little rectangle in the two most external circles of features (one circle per strand) corresponds to a protein-coding gene in the circular genome. Each rectangle corresponds to approximately 1000 bp. Genome bioinformatics  Finding which genomes are available.  Analyzing sequences in relation to specific genomes.  Displaying genomes.  Parsing a microbial genome sequence: ORFing.  Parsing a eukaryotic genome sequence: GenScan.  Finding orthologous and paralogous genes.  Finding repeats. Sequence Analysis  DNA and Protein sequences are biological information that are well suited for computer analysis  Fundamental Axiom: homologous sequences share an evolutionary ancestor and are almost surely performing the same or a similar function

Sequence Analysis: topics for today  Reverse complement.  Restriction enzyme sites for diagnostics & cloning  Open reading frame analysis  Conceptual translation  Oligo-primer design (for PCR and sequencing Alignments)  Sequence alignments (Types of Alignments)  Alignments document homologous relationships  DNA sequence alignments - best for showing identity  Protein sequence alignments best for showing similarity You Shouldn’t Have To Work With Limiting Information