DNA Sequence Alignments - Best for Showing Identity Protein Sequence Alignments Best for Showing Similarity You Shouldn’T Have to Work with Limiting Information
Total Page:16
File Type:pdf, Size:1020Kb
An Introduction to Bioinformatics Mohamed Abdel-Hakim Mahmoud Genetics Department, Faculty of Argiculture, Minia University, El-Minia, EGYPT WHAT IS BIOINFORMATICS? Applying ―informatics‖ techniques from math, statistics and computer science, to understand and organize the information associated with biological molecules on a large scale Can be defined as the body of tools, algorithms needed to handle large and complex biological information. Bioinformatics is a new scientific discipline created from the interaction of biology and computer. Bioinformatics is clearly a multi-disciplinary field including: the use of mathematical, statistical and computing methods for the organization, management, analysis & interpretation of biological information (DNA, amino acid sequences and related information) that aim to solve biological problems. More Definition The NCBI defines Bioinformatics as: a field of science in which biology, computer science, and information technology merge into a single discipline‖ In Wikipedia: Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combines biology, computer science, information engineering, mathematics and statistics to analyze and interpret the biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques. Roughly, Bioinformatics describes any use of computers to handle biological information (Storing & processing of large amounts of biologically-derived information, whether DNA or Protein sequences). Preliminaries of Biology Structure & Function of Nucleic Acids (DNA & RNA) Proteins Transcription & Translation Genome sizes of different organisms. 263 255 214 203 194 183 171 155 145 144 144 143 X106 114 109 106 98 92 85 76 72 50 164 59 X106 56 Chromosome 21 The $1,000 genome refers to an era of predictive and personalized medicine during which the cost of fully sequencing an individual's genome (WGS) is roughly one thousand USD.[1][2] It is also the title of a book by British science writer and founding editor of Nature Genetics, Kevin Davies.[3] By late 2015, the cost to generate a high- quality 'draft' whole human genome sequence was just below $1,500.[4] Genomics era: High-throughput DNA sequencing The first high-throughput genomics technology was AUTOMATED DNA SEQUENCING in the early 1990. In 1995, Venter and Hamilton used whole-genome shotgun sequencing strategy to sequence the genomes of Mycoplasma and Haemophilus . In September 1999, Celera Genomics completed the sequencing of the Drosophila genome. The 3-billion-bp human genome sequence was generated in a competition between the publicly funded Human Genome Project and Celera Genomics: Completed genomes Currently the genome of the organisms are sequenced: Eukaryotes (10811) Prokaryotes (239173) Viruses (35013) Plasmids (20416) Organelles (15408) This generates large amounts of information to be handled by individual computers. The trend of data growth { 21st century is a century of biotechnology } Genomics: New sequence information is being produced at increasing rates. (The contents of GenBank double every year) o Metagenomics:“Who is there and what are they doing?” Microarray: Global expression analysis: RNA levels of every gene in the genome analyzed in parallel. (OUT!) Replaced by RNA-seq Proteomics: Global protein analysis generates by large mass spectra libraries. Metabolomics: Global metabolite analysis: 25,000 secondary metabolites characterized How to handle the large amount of information? Answer: BIOINFORMATICS & INTERNET Why do we need the Internet? “omics” projects and the information associated with involve a huge amount of data that is stored on computers all over the world. Because it is impossible to maintain up-to-date copies of all relevant databases within the lab. Access to the data is via the internet. There is a need for computers and algorithms that allow: o Access, processing, storing, sharing, retrieving, visualizing, annotating… Database storage You are here Things you must have You have a PC running Microsoft Windows. You have an Internet connection (a fast one if possible, but not necessarily). You likely have a background in Molecular Biology. You know how to use an Internet Browser but not much more about computers. You don’t want to become a bioinformatics guru; you simply want to use the right tools for your problem. Most private biotech companies consider it unsafe to send data over the Internet. We assume here that the data you want to analyze over the Internet is not very confidential. Bioinformatics history Before the era of bioinformatics, only two ways of performing biological experiments were available: within a living organism (so-called in vivo) or in an artificial environment (so-called in vitro, from the Latin in glass). Taking the analogy further, we can say that bioinformatics is in fact in silico biology, from the silicon chips on which microprocessors are built In1960s: the birth of bioinformatics The beginning of bioinformatics can be traced back to Margaret Dayhoff in 1968 and her collection of protein sequences known as the Atlas of Protein Sequence and Structure. Sci. Am. 1969 Jul; 221(1):86-95. Early significant experiments in bioinformatics In this study, scientists used one of the first sequence similarity searching computer programs (called FASTP), to determine that the contents of a cancer-causing viral sequence, were most similar to the well-characterized cellular PDGF gene. Surprising result This surprising result provided important mechanistic insights for biologists working on how this viral sequence causes cancer. Science. 1983 Jul 15; 221(4607):275-7 Nature. 1983 Jul 7-13; 304(5921):35-9. First complete genome in Gene Bank The genome of Haemophilus influenzae Rd is the first genome of a free living organism to be deposited into the public sequence databanks. Science. 1995 Jul 28; 269(5223):496-512. Why do we use Bioinformatics? Store/retrieve biological information (DATABASES) Retrieve/compare gene(s) and/or protein(s) sequences. Predict function of unknown gene(s) and/or protein(s). Search for previously known functions gene(s) and/or protein(s). Compare data with other researchers. Compile/distribute data for other researchers. Fields related to Bioinformatics Genomics. “Genomics is any attempt to analyze or compare the entire genetic complement of one ore more species. Proteomics. ―the PROTEin complement of the genOME" “Qualitative and quantitative studies of gene expression at the level of the functional proteins themselves" Pharmacogenomics. “Pharmacogenomics is the application of genomic approaches and technologies to the identification of drug targets”. Pharmacogenetics. Pharmacogenetics is a subset of pharmacogenomics which uses genomic/bioinformatic methods to identify genomic correlates Biophysics. An interdisciplinary field which applies techniques from the physical sciences to understanding biological structure and function" Mathematical Biology. It focuses almost exclusively on specific algorithms that can be applied to large molecular biological data sets. Medical informatics/Medinformatics. “Study, invention, and implementation of structures and algorithms to improve communication, understanding and management of medical information.“ Cheminformatics. "the combination of chemical synthesis, biological screening, and data-mining approaches used to guide drug discovery and development" Computational Biology Is an "approach" involving the use of computers to study biological processes Finding the genes in the DNA sequences of various organisms. Developing methods to predict the structure and/or function of newly discovered proteins and structural RNA sequences. Clustering protein sequences into families of related sequences and the development of protein models. Aligning similar proteins and generating phylogenetic trees to examine evolutionary relationships. Some Application of bioinformatics: Medicine { Molecular Med.; Personalized Med.; Preventative Med.; Gene Therapy; Disease Diagnosis; Forensic Analysis; Drug Ddevelopment ………………} Microbial Genome Applications. Waste Cleanup. Crop and livestock Improvement. Evolutionary Studies. Climate change studies. Alternative energy sources. Improve nutritional quality. Bio-Weapons Creation. Biotechnology ………………. etc. Some Applications…. Medical Implications Pharmacogenomics • Not all drugs work on all patients, some good drugs cause death in some patients • So by doing a gene analysis before the treatment the offensive drugs can be avoided • Also drugs which cause death to most can be used on a minority to whose genes that drug is well suited – volunteers wanted! • Customized treatment Gene Therapy • Replace or supply the defective or missing gene. • e.g: Insulin and Factor VIII or Haemophilia. Diagnosis of Disease o Identification of genes which cause the disease will help detect disease at early stage. Drug Design o One of the goals of bioinformatics is to reduce the time and cost involved with it. Drug Discovery o Target identification (Proteins are the most common targets) For example HIV produces HIV protease which is a protein and which in turn eat other proteins. This HIV protease has an active site where it binds to other molecules. So HIV drug will go and bind with