M. Sc. 2nd Semester 2017

Bioinformatics

Compiled by Mr. Nitin Swamy Asst. Prof. Department of Biotechnology

1 Page

Department of Biotechnology, St. Aloysius College (Autonomous), Jabalpur

M. Sc. 2nd Semester 2017

Previous years questions asked. 1. What is ? Describe its basic needs for bioinformatics application in biological prospects. 2. Enlist various applications of Bioinformatics in Biotechnology. 3. Explain the role of NCBI database in biotechnology? What are other bioinformatics softwares used for various biotechnological applications? 4. Define Bioinformatics. Discuss in brief major applications of bioinformatics in biotechnology. 5. Write the full form of BLAST. Bioinformatics

Biological data are being produced at a phenomenal rate. For example as of August 2000, the GenBank repository of nucleic acid sequences contained 8,214,000 entries and the SWISS-PROT database of protein sequences contained 88,166. On average, these databases are doubling in size every 15 months. In addition, since the publication of the H. influenzae genome, complete sequences for over 40 organisms have been released, ranging from 450 genes to over 100,000. Add to this the data from the myriad of related projects that study gene expression, determine the protein structures encoded by the genes, and detail how these products interact with one another, and we can begin to imagine the enormous quantity and variety of information that is being produced.

As a result of this surge in data, computers have become indispensable to biological research. Such an approach is ideal because of the ease with which computers can handle large quantities of data and probe the complex dynamics observed in nature. Bioinformatics, the subject is often defined as the application of computational techniques to understand and organize the information associated with biological macromolecules. This unexpected union between the two subjects is largely attributed to the fact that life itself is an information technology; an organism’s physiology is largely determined by its genes, which at its most basic can be viewed as digital information.

2

Page

Department of Biotechnology, St. Aloysius College (Autonomous), Jabalpur

M. Sc. 2nd Semester 2017

Bioinformatics - a definition

(Molecular) bio – informatics: bioinformatics is conceptualising biology in terms of molecules (in the sense of physical chemistry) and applying "informatics

techniques" (derived from disciplines such as applied maths, computer science and statistics) to understand and organise the information associated with these molecules, on a large scale. In short, bioinformatics is a management information

system for molecular biology and has many practical applications. -As submitted to the Oxford English Dictionary

Aims of bioinformatics

The aims of bioinformatics are threefold:-

1. First, at its simplest bioinformatics organizes data in a way that allows researchers to access existing information and to submit new entries as they are produced, eg., the Protein Data Bank for 3D macromolecular structures . While data managing is an essential task, the information stored in these databases is essentially useless until analyzed. Thus the purpose of bioinformatics extends much further. 2. The second aim is to develop tools and resources that aid in the analysis of data. For example, having sequenced a particular protein, it is of interest to compare it with previously characterized sequences. This needs more than just a simple text-based search and programs such as FASTA and PSI-BLAST must consider what comprises a biologically significant match. Development of such resources dictates expertise in computational theory as well as a thorough understanding of biology. 3. The third aim is to use these tools to analyze the data and interpret the results in a biologically meaningful manner. Traditionally, biological studies examined individual systems in detail, and frequently compared those with a few that are related. In bioinformatics, we can now conduct global analyses of all the available data with the aim of

uncovering common principles that apply across many systems and

highlight novel features. 3 Page

Department of Biotechnology, St. Aloysius College (Autonomous), Jabalpur

M. Sc. 2nd Semester 2017

BIOLOGICAL DATABASES

As biology has increasingly turned into a data-rich science, the need for storing and communicating large datasets has grown tremendously. The obvious examples are the nucleotide sequences, the protein sequences, and the 3D structural data.

Bioinformatics is the application of Information technology to store, organize and analyze the vast amount of biological data which is available in the form of sequences and structures of proteins (the building blocks of organisms) and nucleic acids (the information carrier). The biological information of nucleic acids is available as sequences while the data of proteins is available as sequences and structures. Sequences are represented in single dimension where as the structure contains the three dimensional data of sequences.

Sequences and structures are only among the several different types of data required in the practice of the modern molecular biology. Other important data types includes metabolic pathways and molecular interactions, mutations and polymorphism in molecular sequences and structures as well as organelle structures and tissue types, genetic maps, physiochemical data, gene expression profiles, two dimensional DNA chip images of mRNA expression, two dimensional gel electrophoresis images of protein expression, data A biological database is a collection of data that is organized so that its contents can easily be accessed, managed, and updated. There are two main functions of biological databases:

•Make biological data available to scientists.

o As much as possible of a particular type of information should be available in one single place (book, site, and database). Published data may be difficult to find or access and collecting it from the literature is very time- consuming. And not all data is actually published explicitly in an article (genome sequences!).

•To make biological data available in computer-readable form. o Since analysis of biological data almost always involves computers, having the data in computer-readable form (rather than printed on paper) is a necessary first step.

Data Domains

•Types of data generated by molecular biology research:

– Nucleotide sequences (DNA and mRNA) 4

– Protein sequences Page

Department of Biotechnology, St. Aloysius College (Autonomous), Jabalpur

M. Sc. 2nd Semester 2017

– 3-D protein structures

– Complete genomes and maps

When Sanger first discovered the method to sequence proteins, there was a lot of excitement in the field of Molecular Biology. Initial interest in Bioinformatics was propelled by the necessity to create databases of biological sequences. Biological databases can be broadly classified into sequence and structure databases. Sequence databases are applicable to both nucleic acid sequences and protein sequences, whereas structure database is applicable to only Proteins. The first database was created within a short period after the Insulin protein sequence was made available in 1956. Incidentally, Insulin is the first protein to be sequenced. The sequence of Insulin consisted of just 51 residues (analogous to alphabets in a sentence) which characterize the sequence. While the initial databases of protein sequences were maintained at the individual laboratories, the development of a consolidated formal database known as SWISS-PROT protein sequence database was initiated in 1986 which now has about 70,000 protein sequences from more than 5000 model organisms, a small fraction of all known organisms. These huge varieties of divergent data resources are now available for study and research by both academic institutions and industries. These are made available as public domain information in the larger interest of research community through Internet (www.ncbi.nlm.nih.gov)

Databases in general can be classified in to primary, secondary and composite databases. 1. A primary database contains information of the sequence or structure alone. Examples of these include Swiss-Prot & PIR for protein sequences, GenBank & DDBJ for Genome sequences and the Protein Databank for protein structures. 2. A secondary database contains derived information from the primary database. A secondary sequence database contains information like the

conserved sequence, signature sequence and active site residues of the protein 5 families arrived by multiple sequence alignment of a set of related proteins. A Page

Department of Biotechnology, St. Aloysius College (Autonomous), Jabalpur

M. Sc. 2nd Semester 2017

secondary structure database contains entries of the PDB in an organized way. These contain entries that are classified according to their structure like all alpha proteins, all beta proteins, etc. These also contain information on conserved secondary structure motifs of a particular protein. Some of the secondary database created and hosted by various researchers at their individual laboratories includes SCOP, developed at Cambridge University; CATH developed at University College of London, PROSITE of Swiss Institute of Bioinformatics, eMOTIF at Stanford.

3. Composite database amalgamates a variety of different primary database sources, which obviates the need to search multiple resources. Different composite database use different primary database and different criteria in their search algorithm. Various options for search have also been incorporated in the composite database. The National Center for Biotechnology Information (NCBI) which hosts these nucleotide and protein databases in their large high available redundant array of computer servers, provides free access to the various persons involved in research. This also has link to OMIM (Online Mendelian Inheritance in Man) which contains information about the proteins involved in genetic diseases.

6 Page

Department of Biotechnology, St. Aloysius College (Autonomous), Jabalpur

M. Sc. 2nd Semester 2017

Organizations related to Bioinformatics:- The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health. The NCBI is located in Bethesda, Maryland and was founded in 1988 . The NCBI houses a series of databases relevant to biotechnology and biomedicine and is an important resource for bioinformatics tools and services. Major databases include GenBank for DNA sequences and PubMed, a bibliographic database for the biomedical literature. Other databases include the NCBI Epigenomics database. All these

databases are available online through the Entrez search engine. 7

Page

Department of Biotechnology, St. Aloysius College (Autonomous), Jabalpur

M. Sc. 2nd Semester 2017

Nucleotide database

GenBank

NCBI has had responsibility for making available the GenBank DNA sequence database since 1992. GenBank coordinates with individual laboratories and other sequence databases such as those of the European Molecular Biology Laboratory (EMBL) and the DNA Data Bank of Japan (DDBJ).

Since 1992, NCBI has grown to provide other databases in addition to GenBank. NCBI provides Gene, Online Mendelian Inheritance in Man, the Molecular Modeling Database (3D protein structures), dbSNP (a database of single-nucleotide polymorphisms), the Reference Sequence Collection, a map of the human genome, and a taxonomy browser, and coordinates with the National Cancer Institute to provide the Cancer Genome Anatomy Project. The NCBI assigns a unique identifier (taxonomy ID number) to each species of organism. The NCBI has software tools that are available by WWW browsing or by FTP. For example, BLAST is a sequence similarity searching program. BLAST can do sequence comparisons against the GenBank DNA database in less than 15 seconds.

European Molecular Biology Laboratory

The European Molecular Biology Laboratory (EMBL) is a molecular biology research institution supported by 22 member states, four prospect and two associate member states. EMBL was created in 1974 and is an intergovernmental organisation funded by public research money from its member states. Research at EMBL is conducted by approximately 85 independent groups covering the spectrum of molecular biology. Each of the different EMBL sites have a specific research field. The EMBL-EBI is a hub for bioinformatics research and services, developing and maintaining a large number of scientific databases, which are free of charge. At Grenoble and Hamburg, research is focused on structural biology. EMBL's dedicated Mouse Biology Unit is located in Monterotondo. At the headquarters in Heidelberg, there are units in Cell Biology and Biophysics, Developmental Biology, Genome Biology and Structural and Computational Biology as well as service groups complementing the aforementioned research fields.

The DNA Data Bank of Japan

The DNA Data Bank of Japan (DDBJ) is a biological database that collects DNA sequences. It is located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan. It is also a member of the International Nucleotide Sequence

Database Collaboration or INSDC. It exchanges its data with European Molecular Biology 8 Laboratory at the European Bioinformatics Institute and with GenBank at the National Page

Department of Biotechnology, St. Aloysius College (Autonomous), Jabalpur

M. Sc. 2nd Semester 2017

Center for Biotechnology Information on a daily basis. Thus these three databanks contain the same data at any given time. DDBJ began data bank activities in 1986 at NIG and remains the only nucleotide sequence data bank in Asia. Although DDBJ mainly receives its data from Japanese researchers, it can accept data from contributors from any other country.

Protein database

Swiss-Prot

Swiss-Prot was created in 1986 by Amos Bairoch during his PhD and developed by the Swiss Institute of Bioinformatics and subsequently developed by Rolf Apweiler at the European Bioinformatics Institute. Swiss-Prot aimed to provide reliable protein sequences associated with a high level of annotation (such as the description of the function of a protein, its domain structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases. Recognizing that sequence data were being generated at a pace exceeding Swiss-Prot's ability to keep up, TrEMBL (Translated EMBL Nucleotide Sequence Data Library) was created to provide automated annotations for those proteins not in Swiss-Prot.

Protein Information Resource

The Protein Information Resource (PIR) is an integrated public bioinformatics resource to support genomic, proteomic and systems biology research and scientific studies.PIR was established in 1984 by the National Biomedical Research Foundation (NBRF) as a resource to assist researchers in the identification and interpretation of protein sequence information. Prior to that, the NBRF compiled the first comprehensive collection of macromolecular sequences in the Atlas of Protein Sequence and Structure, published from 1965-1978 under the editorship of Margaret O. Dayhoff. Dr. Dayhoff and her research group pioneered in the development of computer methods for the comparison of protein sequences, for the detection of distantly related sequences and duplications within sequences, and for the inference of evolutionary histories from alignments of protein sequences.

9 SOFTWARES USED IN BIOINFORMATICS Page

Department of Biotechnology, St. Aloysius College (Autonomous), Jabalpur

M. Sc. 2nd Semester 2017

BLAST

In bioinformatics, BLAST for Basic Local Alignment Search Tool is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. Different types of BLASTs are available according to the query sequences. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence. The BLAST algorithm and program were designed by , Warren Gish, , , and David J. Lipman at the National Institutes of Health

BLAST is actually a family of programs (all included in the blastall executable). These include:

1. Nucleotide-nucleotide BLAST (blastn) This program, given a DNA query, returns the most similar DNA sequences from the DNA database that the user specifies. 2. Protein-protein BLAST (blastp) This program, given a protein query, returns the most similar protein sequences from the protein database that the user specifies. 3. Nucleotide 6-frame translation-protein (blastx) This program compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database. 4. Nucleotide 6-frame translation-nucleotide 6-frame translation (tblastx) This program is the slowest of the BLAST family. It translates the query nucleotide sequence in all six possible frames and compares it against the six-frame translations of a nucleotide sequence database. The purpose of tblastx is to find very distant relationships between nucleotide sequences. 5. Protein-nucleotide 6-frame translation (tblastn) This program compares a protein query against the all six reading frames of a nucleotide sequence database. 6. Large numbers of query sequences (megablast) When comparing large numbers of input sequences via the command-line BLAST, "megablast" is much faster than running BLAST multiple times. It concatenates 10 many input sequences together to form a large sequence before searching the Page

Department of Biotechnology, St. Aloysius College (Autonomous), Jabalpur

M. Sc. 2nd Semester 2017

BLAST database, then post-analyzes the search results to glean individual alignments and statistical values. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families. FASTA FASTA is a DNA and protein sequence alignment software package first described (as FASTP) by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics. The original FASTA program was designed for protein sequence similarity searching. Because of the exponentially expanding genetic information and the limited speed and memory of computers in the 1980s heuristic methods were introduced aligning a query sequence to entire data-bases. FASTA (developed in 1988) added the ability to do DNA:DNA searches, translated protein:DNA searches, and also provided a more sophisticated shuffling program for evaluating statistical significance.There are several programs in this package that allow the alignment of protein sequences and DNA sequences. The FASTA programs find regions of local or global similarity between Protein or DNA sequences, either by searching Protein or DNA databases, or by identifying local duplications within a sequence.

Data retrival tools

ENTREZ

The Entrez Global Query Cross-Database Search System is a federated search engine, or web portal that allows users to search many discrete health sciences databases at the National Center for Biotechnology Information (NCBI) website. The name "Entrez" (a greeting meaning "Come in!" in French) was chosen to reflect the spirit of welcoming the public to search the content. Entrez Global Query is an integrated search and retrieval system that provides access to all databases simultaneously with a single query string and user interface. Entrez can efficiently retrieve related sequences, structures, and references. The Entrez system can provide views of gene and protein sequences and chromosome maps. Some textbooks are also available online through the Entrez system. Entrez also provides a similar interface for searching each particular database and for refining search results. The Limits feature allows the user to narrow a search a web forms interface.

Entrez searches the following databases: 11 • PubMed Page

Department of Biotechnology, St. Aloysius College (Autonomous), Jabalpur

M. Sc. 2nd Semester 2017

• PubMed Central • Site Search • Books • Online Mendelian Inheritance in Man (OMIM) • Nucleotide: sequence database (GenBank) • Protein: sequence database • Genome: whole genome sequences and mapping • Structure: three-dimensional macromolecular structures • Taxonomy: organisms in GenBank Taxonomy • SNP: single nucleotide polymorphism

PubMed

PubMed is a free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine (NLM) at the National Institutes of Health maintains the database as part of the Entrez system of information retrieval. From 1971 to 1997, MEDLINE online access to the MEDLARS Online computerized database primarily had been through institutional facilities, such as university libraries. PubMed, first released in January 1996, ushered in the era of private, free, home- and office-based MEDLINE searching. PubMed comprises more than 27 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full-text content from PubMed Central and publisher web sites.

PubMed provides access to:

1. older references from the print version of Index Medicus, back to 1951 and earlier 2. references to some journals before they were indexed in Index Medicus and MEDLINE, for instance Science, BMJ, and Annals of Surgery 3. very recent entries to records for an article before it is indexed with Medical Subject Headings (MeSH) and added to MEDLINE 4. a collection of books available full-text and other subsets of NLM records. 5. PMC citations

Taxanomy Browsers

The Taxonomy Database is a curated classification and nomenclature for all of the organisms in the public sequence databases. This currently represents about 10% of the described species of life on the planet. 12

Applications of bioinformatics in Biotechnology Page

Department of Biotechnology, St. Aloysius College (Autonomous), Jabalpur

M. Sc. 2nd Semester 2017

A number of bioinformatics tools, softwares and databases are available for better understanding of biological complexity and analyze; and store the biological data. Thus, the bioinformatics research is accelerated to avoid time, cost and wet lab practice. There exist a number of applications of bioinformatics for accelerating the research in area of biotechnology that include automatic genome sequencing, gene identification, prediction of gene function, prediction of protein dimensional structure and phylogeny-to name a few. As shown in Fig., role of bioinformatics in different important areas of biotechnology which is one of the fastest growing area that can also help in the identification of organisms, drug discovery, vaccine designing, understand the gene and genome complexity, protein structure, functionality and folding.

Figure 1: Role of bioinformatics in different important areas of biotechnology.

1. Automatic Genome Sequencing: Bioinformatics are used in genome sequencing techniques such as a. Bioinformatics are used in the development of automated sequencing techniques that use PCR or 13 BAC based gene amplifications, two dimensional electrophoresis and also

automated reading of nucleotides. Page

Department of Biotechnology, St. Aloysius College (Autonomous), Jabalpur

M. Sc. 2nd Semester 2017

b. Bioinformatics are used in joining the sequence of smaller fragments of DNA to form a complete genome sequence. c. Bioinformatics are also used in the prediction of promoter region of the genome. d. Bioinformatics are also used in the prediction of promoter coding region of the genome. Small fragments of genome can be amplified using polymerase chain reaction or bacterial artificial chromosome. Amplified fragments may suffer from nucleotide reading errors, repeats; these in turn generate multiple copies of the same genome fragments. These repeats are removed just before the final assembly of all genome fragments via mathematical models.

2. Identification of Genes: Bioinformatics programmes such as GLIMMER and GenBank are used to identify the coding region or open reading frames in the genome.

3. Identifying Gene Function: After identifying the open reading frames present in the genome, nest step is to annotate the structure and function of the genes. Bioinformatics tool such as sequence search and pair­wise gene alignment technique are used to identify the gene function. The major four algorithms such as BLAST, BLOSUM, ClustalX and SMART are used for the functional annotation of genes.

4. Three­dimensional (3D) Structure Modelling: Single protein may present in different conformational states depending upon its interaction with other proteins present in the vicinity. Three­dimensional structure of the protein has got some regions exposed for protein­protein or protein­DNA interactions. Also the function of these proteins are also depends upon these interactions. Bioinformatics techniques are also used to predict the possible conformation of the protein coded by a gene; this in turn helps in identifying the function of the same protein.

5. Pair­wise Genome Comparison: After the identification of functioning of a particular gene, bioinformatics is also used for pair­wise genome comparisons. Pair­wise comparison of a genome with itself is done to get the details of paralogous genes or duplicated genes that have the same base sequence with some functional variations. Also pair­wise genome comparison is also done against other genome to get information such as orthologous genes. These genes are equivalent genes present in two different genomes due to speciation. This technique is also used to identify different types of gene­groups and adjacent genes that occur in the close proximity as they are involved in common higher level function, and also lateral­gene transfer that is transfer of genes from a microorganism that is evolutionary distant.

Bioinformatics technique is also used to analyse gene­fusion or gene­fission, gene­group duplication and much more. 14 6. Drug Discovery: Page

Department of Biotechnology, St. Aloysius College (Autonomous), Jabalpur

M. Sc. 2nd Semester 2017

Bioinformatics techniques are also used in the process of drug discovery, these speeds up the process.

7. Vaccine Design: Bioinformatics techniques are used in the effective and efficient designing of vaccine and also in the designing of antimicrobial agents.

8. Microbial Genome Sequencing: Bioinformatics is also used in automating microbial genome sequencing.

9. Genome Comparison: Bioinformatics techniques is also used to genome as follows, a. Identifying conserved function within a genome family b. Identifying a specific genes in a group of genomes c. Modelling 3D structure of proteins d. Docking of biochemical compounds as well as receptors. This helps in developing integrated data bases over the internet. This also helps in understanding function of gene and also genome.

Conclusion: Bioinformatics has got major impacts on biotechnology and its applications. The vast amount of data generated by human genome project or by other genome sequencing project would be unmanageable without the bioinformatics technique. Without bioinformatics handling, interpretation of these data would be unthinkable. Due to the bioinformatics application cost of all these major projects comes down drastically. Bioinformatics also quickened the drug discovery, vaccine design and also the design of anti­microbial agents. Bioinformatics is also used to understand gene and also genome. Bioinformatics programmes are also used to compare gene­pair alignment, which helps in identifying functions of gene and also genome functionality.

15 Page

Department of Biotechnology, St. Aloysius College (Autonomous), Jabalpur