MAVE-NN: Quantitative Modeling of Genotype-Phenotype Maps As

Total Page:16

File Type:pdf, Size:1020Kb

MAVE-NN: Quantitative Modeling of Genotype-Phenotype Maps As bioRxiv preprint doi: https://doi.org/10.1101/2020.07.14.201475; this version posted July 14, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Tareen et al. RESEARCH MAVE-NN: Quantitative Modeling of Genotype- Phenotype Maps as Information Bottlenecks Ammar Tareen1, William T. Ireland2, Anna Posfai1, David M. McCandlish1 and Justin B. Kinney1* are commonly used to study DNA or RNA sequences that regulate gene expression at a variety of steps, in- cluding transcription [16, 17, 18, 19, 20, 21], splicing [22, 23, 24, 25, 26], polyadenylation [27], and mRNA Abstract degradation [28, 29, 30, 31, 32]. Most MPRAs read Multiplex assays of variant effect (MAVEs) are out the expression of a reporter gene in one of two being rapidly adopted in many areas of biology ways [1]: by quantifying RNA abundance via the se- including gene regulation, protein science, and quencing of RNA barcodes that are linked to known evolution. However, inferring quantitative models variants (RNA-seq MPRAs; Fig. 1c), or by quanti- of genotype-phenotype maps from MAVE data fying protein abundance using fluorescence-activated remains a challenge. Here we introduce MAVE-NN, cell sorting (FACS) then sequencing the sorted vari- a neural-network-based Python package that ants (sort-seq MPRAs; Fig. 1e). addresses this problem by conceptualizing MAVE data can enable rich quantitative modeling of genotype-phenotype maps as information G-P maps. This key point was recognized in some of bottlenecks. We demonstrate the versatility, the earliest work on MAVEs [17, 18] and has persisted performance, and speed of MAVE-NN on a diverse as a major theme in MAVE studies [33, 34, 35, 25, 31, range of published MAVE datasets. MAVE-NN is 36]. But in contrast to MAVE experimental techniques, easy to install and is thoroughly documented at which continue to advance rapidly, there remain key https://mavenn.readthedocs.io. gaps in the methodologies available for quantitatively Keywords: MAVE; Neural Networks; Noise modeling G-P maps from MAVE data. Agnostic Regression; Global Epistasis Regression Most computational methods for analyzing MAVE data have focused on accurately quantifying the ac- tivities of individual assayed sequences [37, 38, 39, 40, 41, 42, 43]. However, MAVE measurements for individ- Background ual sequences often cannot be interpreted as providing Over the last decade, the ability to quantitatively direct quantification of the underlying G-P map that study genotype-phenotype (G-P) maps has been rev- one is interested in. First, MAVE measurements are olutionized by the development of multiplex assays of usually distorted by strong nonlinearities and noise, variant effect (MAVEs), which can measure molecu- and distinguishing interesting properties of G-P maps lar phenotypes for thousands to millions of genotypic from these confounding factors is not straight-forward. variants in parallel [1]. MAVE is an umbrella term Second, MAVE data is often incomplete. Missing data that describes a diverse set of experimental methods is common, but a more fundamental issue is that re- [2,3], three examples of which are illustrated in Fig. searchers often want to understand G-P maps over 1. Deep mutational scanning (DMS) experiments are vastly larger regions of sequence space than can be one large class of MAVE [4]. These work by linking exhaustively assayed. proteins [5,6, 7] or structural RNAs [8, 9, 10, 11] to their coding sequences, either directly or indirectly, Quantitative modeling can address both the incom- then using deep sequencing to assay which variants pleteness and indirectness of MAVE measurements [1]. survive a process of activity-dependent selection (Fig. The goal here is to determine a mathematical func- 1a). Massively parallel reporter assays (MPRAs) are tion that, given any sequence as input, will return a another major class of MAVE [12, 13, 14, 15], and quantitative value for that sequence's molecular phe- notype. Quantitative models thus fill in the gaps in *Correspondence: [email protected] G-P maps and, if appropriate inference methods are 1Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 11375, Cold Spring Harbor, NY used, can further remove confounding effects of nonlin- Full list of author information is available at the end of the article earities and noise. The simplest quantitative modeling bioRxiv preprint doi: https://doi.org/10.1101/2020.07.14.201475; this version posted July 14, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Tareen et al. Page 2 of 15 strategy is linear regression (e.g. [18, 44]). However, into a type of information bottleneck [63, 64], one that linear regression yields valid results only when one's separates the task of compressing sequence-encoded in- measurements are linear readouts of phenotype and formation (the job of the G-P map) from the task of exhibit uniform Gaussian noise. Such assumptions are mapping this information to a realistic experimental often violated in dramatic fashion by MAVEs, and fail- output (the job of the measurement process). Some ure to account for them can give rise to major artifacts, ambiguity between the quantitative properties of the such as spurious epistatic interactions [45]. G-P map and those of the measurement process in- Multiple MAVE analysis approaches that can sepa- evitably remain, but such ambiguities often affect only rate the effects of nonlinearities and noise from under- a small number of parameters in the underlying G-P lying G-P maps have been reported, but a conceptually map [54, 55]. unified strategy is still needed. Work in the theoreti- We also introduce MAVE-NN, a software package cal evolution literature has focused on a phenomenon that can rapidly execute this type of bottleneck-based called global epistasis (GE), in which measurements re- inference on large MAVE datasets. MAVE-NN cur- flect a nonlinear function of an underlying latent phe- rently supports two inference approaches: GE regres- notype [46, 47, 48, 49, 50, 51, 52, 53, 45]. In particular, sion and noise agnostic (NA) regression. GE regres- Otwinowski et al. [46] describe a regression approach in sion is modeled after the approach of [46]. NA re- which one parametrically models G-P maps while non- gression is closely related to non-parametric regres- parametrically modeling nonlinearities in the MAVE sion using information maximization (see [54, 55]) but measurement process. Parallel work in the biophysics is much faster due to its compatibility with back- literature has focused on developing ways to infer G-P propagation. MAVE-NN is available as a Python API maps from high-throughput data in a manner that is and is built on top of TensorFlow, allowing the use agnostic to the quantitative form of both nonlinearities of the powerful and flexible optimization methods. and noise [54, 55, 56, 17]. This approach focuses on the MAVE-NN is rigorously tested and is easily installed use of mutual information as an objective function. It arose from techniques in sensory neuroscience [57, 58] from PyPI using the command pip install mavenn. that were elaborated and adapted for the analysis Comprehensive documentation is available at http: of microarray data [59, 56], then applied to MPRAs //mavenn.readthedocs.io. [33, 34, 17, 18]. However, due to difficulties in esti- mating mutual information, such analyses of MAVE Results data have relied on Metropolis Monte Carlo, which Inference approaches (in our experience) is too slow to support widespread MAVE-NN supports the analysis of DNA, RNA, and adoption. A third thread in the literature has arisen protein sequences. All sequences must be the same from efforts to apply techniques from deep learning to length and, for the resulting models to be inter- modeling the genotype-phenotype map [60, 28]. Here pretable, must satisfy a natural notion of alignment. the emphasis has been on using the highly expressive The two inference methods supported by MAVE-NN, nature of deep neural networks to directly model ex- GE regression and NA regression, are illustrated in perimental output from input sequences. Yet it has re- Fig. 3. Each type of regression can be used to infer mained unclear how such neural networks might sepa- three different types of model: additive, neighbor, or rate out the intrinsic features of GP maps from effects of MAVE measurement processes. This is a manifes- pairwise. MAVE-NN represents each sequence as a bi- tation of the neural network interpretability problem, nary vector ~x of one-hot encoded sequence features. one that that is not addressed by established post-hoc The measurement obtained for each sequence is then attribution methods [61, 62]. assumed to depend on a latent phenotype φ, which is Here we describe a unified conceptual and com- assumed to depend the sequence via a linear model ~ ~ putational framework for the quantitative modeling φ = θ · ~x, where θ denotes model parameters. In both of MAVE data, one that unites the three strains of types of regression, MAVE-NN assumes that the mea- thought described above. As illustrated in Fig. 2, surement of each sequence is a function of a latent we assume that each sequence has a well-defined la- phenotype φ. This function is inferred from data using tent phenotype, of which the MAVE provides a noisy the dense subnetworks shown in Fig. 3a and 3c. Ex- nonlinear readout. To remove potentially confound- amples of these functions are shown in Fig. 3b and 3d.
Recommended publications
  • What Is Biomath?
    What is Biomath? Summary Biomathematics covers a wide range of activities at the interface between the mathematical and biological sciences. Depending on who you talk to, it might have a different name, including mathematical biology, systems biology, quantitative biology or theoretical biology. The focus of our biomathematics program is on building and using mathematical models to describe and analyze biological systems. Often this involves the analysis of biological data, which typically means using statistical approaches together with the models. In many cases, our models are mechanistic, meaning that their components represent biological processes, although many of us also use models that are more statistical in nature. More In-Depth Biomathematics is the use of mathematical models to help understand phenomena in biology. Modern experimental biology is very good at taking biological systems apart (at all levels of organization, from genome to global nutrient cycling), into components simple enough that their structure and function can be studied in isolation. Dynamic models are a way to put the pieces back together, with equations that represent the system’s components, processes, and the structure of their interactions. Mathematical models are important tools in basic scientific research in many areas of biology, including physiology, cellular biology, developmental biology, ecology, evolution, toxicology, epidemiology, immunology, natural resource management, and conservation biology. The results obtained from analysis and simulation
    [Show full text]
  • Ten Simple Rules to Consider Regarding Preprint Submission
    EDITORIAL Ten simple rules to consider regarding preprint submission Philip E. Bourne1*, Jessica K. Polka2, Ronald D. Vale3, Robert Kiley4 1 Office of the Director, The National Institutes of Health, Bethesda, Maryland, United States of America, 2 Whitehead Institute, Cambridge, Massachusetts, United States of America, 3 Department of Cellular and Molecular Pharmacology and the Howard Hughes Medical Institute, University of California San Francisco, San Francisco, California, United States of America, 4 Wellcome Library, The Wellcome Trust, London, United Kingdom * [email protected] For the purposes of these rules, a preprint is defined as a complete written description of a body of scientific work that has yet to be published in a journal. Typically, a preprint is a research article, editorial, review, etc. that is ready to be submitted to a journal for peer review or is under review. It could also be a commentary, a report of negative results, a large data set and its description, and more. Finally, it could also be a paper that has been peer reviewed and either is awaiting formal publication by a journal or was rejected, but the authors are willing to make the content public. In short, a preprint is a research output that has not completed a typi- cal publication pipeline but is of value to the community and deserving of being easily discov- ered and accessed. We also note that the term preprint is an anomaly, since there may not be a a1111111111 print version at all. The rules that follow relate to all these preprint types unless otherwise a1111111111 noted.
    [Show full text]
  • 1 the Challenges Facing Genomic Informatics
    1 The Challenges Facing Genomic Informatics Temple F. Smith What are these areas of intense research labeled bioinformatics and functional genomics? If we take literally much of the recently published "news and views," it seems that the often stated claim that the last century was the century of physics, whereas the twenty-first will be the century of biology, rests significantly on these new research areas. We might therefore ask: What is new about them? After all, compu- tational or mathematical biology has been around for a long time. Surely much of bioinformatics, particularly that associated with evolution and genetic analyses, does not appear very new. In fact, the related work of researchers like R. A. Fisher, J. B. S. Haldane, and SewellWright dates nearly to the beginning of the 1900s. The modem analytical approaches to genetics, evolution, and ecology rest directly on their and similar work. Even genetic mapping easily dates to the 1930s, with the work of T. S. Painter and his students of Drosophila (still earlier if you include T. H. Morgan's work on X-linked markers in the fly). Thus a short historical review might provide a useful perspective on this anticipated century of biology and allow us to view the future from a firmer foundation. First of all, it should be helpful to recognize that it was very early in the so-called century of physics that modem biology began, with a paper read by Hermann Muller at a 1921meeting in Toronto. Muller, a student of Morgan's, stated that although of submicroscopic size, the gene was clearly a physical particle of complex structure, not just a working construct! Muller noted that the gene is unique from its product, and that it is normally duplicated unchanged, but once mutated, the new form is in turn duplicated faithfully.
    [Show full text]
  • QUANTITATIVE BIOLOGY This Interdiscplinary Major Bridges the Space Between Biology, Chemistry, Data Science, and Engineering
    QUANTITATIVE BIOLOGY This interdiscplinary major bridges the space between biology, chemistry, data science, and engineering. It allows students interested in studying the life sciences to achieve a fuller background in the quantitative sciences such as computer science and statistics that are essential for modern data-driven biological science. The Quantitative Biology major is designed to train students to be both high-level biologists and high-level data analysts. Students will take an introductory seminar, participate in undergraduate research, and write an honors thesis. BACHELOR OF SCIENCE (BS) GENERAL OVERVIEW Seven lower division courses: One advanced biology course. Examples include: General Biology — Organismal Biology and Evolution Evolution and Population Genetics General Biology — Cell Biology and Physiology Molecular Biology Introduction to Quantitative Biology Seminar General Chemistry A At least two upper division courses. Examples include: Introduction to Programming Cellular and Molecular Neuroscience Data Structures and Object Oriented Design Mathematics of Machine Learning Calculus I Fundamentals of Physics I Choose one Capstone course from the following: Choose five courses from the following areas: Computational Genome Analysis Structural Bioinformatics: From Atoms to Cells Physics Chemistry Research Experience: Mathematics Students are required to enroll in directed Computer Science research in a lab approved by the Quantitative Biology Executive Committee or assigned faculty adviser EXPERIENTIAL OPPORTUNITIES Freshman
    [Show full text]
  • Gender Disparity in Computational Biology Research Publications
    bioRxiv preprint doi: https://doi.org/10.1101/070631; this version posted August 26, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Gender disparity in computational biology research publications Kevin S. Bonham and Melanie I. Stefan Abstract While women are generally underrepresented in STEM fields, there are noticeable differences between fields. For instance, the gender ratio in biology is more balanced than in computer science. We were interested in how this difference is reflected in the interdisciplinary field of computational/quantitative biology. To this end, we examined the proportion of female authors in publications from the PubMed and arXiv databases. There are fewer female authors on research papers in computational biology, as compared to biology in general. This is true across authorship positions, year, and journal impact factor. A comparison with arXiv shows that quantitative biology papers have a higher ratio of female authors than computer science papers, placing computational biology in between its two parent fields in terms of gender representation. Both in biology and in computational biology, a female last author increases the probability of other authors on the paper being female, pointing to a potential role of female PIs in influencing the gender balance. Intro There is ample literature on the underrepresentation of women in STEM fields and the biases contributing to it. Those biases, though often subtle, are pervasive in several ways: they are often held and perpetuated by both men and women, and they are apparent across all aspects of academic and scientific practice.
    [Show full text]
  • The Center for Quantitative Biology at Peking University
    Quantitative Biology 2015, 3(1): 1–3 DOI 10.1007/s40484-015-0041-2 INTRODUCING A CENTER The Center for Quantitative Biology at Peking University Jing Yuan*, Luhua Lai and Chao Tang Center for Quantitative Biology, Peking University, Beijing 100871, China * Correspondence: [email protected] Received January 30, 2014; OVERVIEW biology; (ii) synthetic biology; (iii) computational biology and bioinformatics; (iv) disease mechanism and drug In the 21st century, scientific research is becoming more design from a system biology perspective. The goal of and more interdisciplinary. New discoveries are often CQB is to become a leading research center in the field of made at the boundaries of different disciplines. Biology is system biology and to promote the development of entering a new era in which quantitative measurements quantitative life sciences. In recent years, CQB received a and mathematical modeling play increasingly important number of major funding supports from the Ministry of roles in understanding and predicting biological behavior. Science and Technology of China and the National Facing this challenge and opportunity, the Center for Natural Science Foundation of China. Exciting and Quantitative Biology (CQB, formally the Center for cutting-edge researches are being carried out at CQB, Theoretical Biology) was established in 2001, with the some of which have been published in prestigious support of the Nobel Laureate Prof. T. D. Lee and the journals such as Cell, Science, Nature Chemistry, PNAS, leadership of Peking University. As part of PKU’s PRL, JACS, Angew Chem. strategic initiative in enhancing interdisciplinary research, CQB is dedicated to research and education at the EDUCATION interface between the traditional more quantitative disciplines (such as mathematics, physical sciences, Education has always been a priority of CQB.
    [Show full text]
  • Qbio Master Program Quantitative Biology in Practice
    QBIO MASTER PROGRAM quantitative biology in practice At the crossroads of Biology, Physics, Chemistry and Informatics, qbio is the qbio is a new graduate program destined for OBJECTIVES curriculum of the students interested in studying Bio-Health Biology with a quantitative qbio is built around emerging interdisci- Master’s Degree perspective founded on plinary fields in biosciences, ranging from of the University experimental state-of-the-art techniques in transdisciplinary approaches. microscopy, synthetic and structural biolo- of Montpellier. gy, to modelling and systemic approaches We recruit outstanding and of life science. highly motivated students from You will learn the core of these disciplines An innovative, various backgrounds. by performing team and individual projects hands-on and mentored by leading researchers in the interdisciplinary fields and by means of an innovative peda- gogical program. program entirely held in English (2 years, 120 ECTS). QBIO MASTER PROGRAM quantitative biology in practice qbio is a hands-on PEDAGOGICAL PROFESSIONAL curriculum INNOVATION OPPORTUNITIES designed around the active experience and We aim to transmit knowledge by uncon- We aim to form high-level PhD candidates, ventional teaching strategies built around but also provide the knowledge and back- the dynamic hands-on practical courses and student ground necessary to enter the private and interactions of projects, driven by outstanding researchers biotech environment. students. from Montpellier’s community. Scientific communication and project management also play a key role in our pedagogical offer. “Cross-disciplinarity INTERNSHIPS Discussions animated by the teachers, is more than just together with your active and direct bringing disciplines involvement in handling experiments In addition to the wide choice of internship together around and concrete difficulties, will help you opportunities from the vibrant Bio-Health community in Montpellier, you will benefit some clearly appropriating the different subjects.
    [Show full text]
  • Quantitative Biology Minor Worksheet
    Minor in Quantitative Biology – Minimum Requirements BIOL CORE COURSES (30 credits; 7 upper-level) Pre-requisites/Co-requisites credit BIOL 141 – Foundations of Biology: Cells, Energy & Organisms MATH 150 or higher (pre-req) 4 BIOL 142 – Foundations of Biology: Ecology & Evolution BIOL 141 4 BIOL 141, BIOL 142, BIOL 302 – Molecular & General Genetics CHEM 101/123, 4 CHEM 102/124 (pre/co-req) STAT 355 - Introduction to Probability and Statistics for MATH 142 or 152 Scientists and Engineers OR 4 STAT 350 Statistics with Applications in the Biological MATH 150 or placement in MATH Sciences 151 MATH 151 - Calculus and Analytic Geometry I MATH 150 4 MATH151 or MATH141 or MATH 152 - Calculus and Analytic Geometry II 4 MATH155B MATH141 or MATH151 or MATH 221 - Introduction to Linear Algebra 3 MATH380 MATH 355 - Biomathematics MATH152 and MATH 221 3 ELECTIVES (Choose any combination of courses for 6 credits) Pre-requisites/Co-requisites credit BIOL 300L and Either STAT 350, BIOL 312L - Modeling in the Life Sciences 2 MATH 151 or MATH 155 BIOL 303 - Cell Biology BIOL 302, CHEM 102 3 BIOL 313 - Introduction to Bioinformatics and Computational MATH 151 and either BIOL 141 3 Biology or CMSC 104 BIOL 442 - Developmental Biology BIOL 302 and BIOL 303 3 BIOL 445 - Signal Transduction BIOL 302 and BIOL 303 4 BIOL 142 and STAT 350 or STAT BIOL 483 - Evolution: From Genes to Genomes 4 355 Permission of Instructor; Junior Biol 495 - Seminar in Bioinformatics 2 - 4 standing Important Notes: 1) The Quantitative Biology minor may NOT be taken with a major in Biological Sciences (BIOL BA or BS), Biochemistry & Molecular Biology (BIOC), or Bioinformatics and Computational Biology (BINF) because of substantial overlap in requirements.
    [Show full text]
  • Masters in Quantitative and Systems Biology (QSB) Program
    Masters in Quantitative and Systems Biology (QSB) Program Syllabus for 2019-2020 Note: Syllabus is subject to revision as course scheduling and availability changes. SUMMER PRE-QUARTER ACTIVITIES Required Activities: • Select thesis lab with assistance of the Program director • Develop a pre-arrival plan with thesis advisor • Begin reading background papers on field and project, begin any online courses/training YEAR-ROUND WORKSHOPS, JOURNAL CLUBS, & MEETINGS *Partial listing of workshops, club meetings, and events *Registration costs included as part of QSB Program NU Information Technology Workshops and Training (view website): Every year NUIT offers many workshops and training events, including: • R-fundamentals: Parts I, II and III • Programming Concepts Workshop • Git and GitHub: Introduction • Command Line/Bash (Introduction) • Quest (NU supercomputer cluster) Introduction • Python: Scikit-learn (machine learning module for Python) • R: ggplot2 • Data Drop-In Hours and Consultations: NU provides free consultations and assistance for researchers. Available on Tuesdays and Wednesdays from 2-4pm at Mudd Library (Data Visualization and GIS Lab) • DataCamp online courses: o R, Python, SQL, command line, git, spreadsheets, and more o R User Group – meets monthly, Thursdays 4pm NSF-Simons Center for Quantitative Biology Journal Club (view website): Journal Clubs are held monthly to advance interdisciplinary communication and integrate knowledge between math and biology. Meets monthly on Fridays at 2:00 PM Data Science Nights (view website): Monthly hack nights on popular data science topics, organized by northwestern university graduate students and scholars. each night will feature refreshments, a talk on data science techniques or applications, and a hacking night with data science projects or learning groups of your choice.
    [Show full text]
  • Preprints in Motion: Tracking Changes Between Posting And
    bioRxiv preprint doi: https://doi.org/10.1101/2021.02.20.432090; this version posted February 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 1 Preprints in motion: tracking changes 2 between posting and journal publication 3 4 Jessica K Polka1, Gautam Dey2, Máté Pálfy3, Federico Nanni4, Liam Brierley5, Nicholas Fraser6 & 5 Jonathon Alexis Coates7,* 6 7 8 1 ASAPbio, 3739 Balboa St # 1038, San Francisco, CA 94121, USA 9 2 Cell Biology and Biophysics Unit, European Molecular Biology Laboratory, Meyerhofstr. 1, 69117 10 Heidelberg, Germany 11 3 The Company of Biologists, Bidder Building, Station Road, Histon, Cambridge CB24 9LF, UK 12 4 The Alan Turing Institute, 96 Euston Rd, London NW1 2DB, UK 13 5 Department of Health Data Science, University of Liverpool, Brownlow Street, Liverpool, L69 3GL, 14 UK 15 6 Leibniz Information Centre for Economics, Düsternbrooker Weg 120, 24105 Kiel, Germany 16 7 William Harvey Research Institute, Charterhouse Square, Barts and the London School of Medicine 17 and Dentistry Queen Mary University of London, London, EC1M 6BQ, UK 18 19 20 21 * Correspondence: [email protected] 22 23 24 bioRxiv preprint doi: https://doi.org/10.1101/2021.02.20.432090; this version posted February 20, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
    [Show full text]
  • Quantitative and Computational Biology Course Number: BY512 Credit
    Course Name: Quantitative and Computational Biology Course Number: BY512 Credit: 3-0-0-3 Prerequisites: - IC 136 - Understanding Biotechnology & its Applications OR Consent of Faculty member Students intended for: B. Tech. 3rd and 4th year, MS/MSc. /M. Tech., Ph.D. Elective or Core: Core for M. Tech. Biotechnology, elective for others Semester: Odd/Even Course objective: This course teaches the students with the fundamentals of quantitative biology and computational biology. These are essential components for statistical analysis of biological data and will allow student to learn the basic mathematical and statistical tool required to biologists. The computational biology aspects will introduce the students with additional practical skills that will allow them to handle biological data comprehensively. Course Outline: Module 1 [21 Lectures] Quantitative Biology Probability Theory, Probability Distributions - Binomial, Gaussian and Poisson Distributions. Descriptive statistics: mean, variance and sum of squares; mean and variance of a distribution, random numbers, random sampling. Regression analysis: linear, multiple and nonlinear. Test of hypotheses: t-test, z-test; Chi-square test of independence. Multivariate Analysis: various types of classification, ANOVA, PCA Examples of Statistics in biological data analysis. Module 2 [21 Lectures] Computational biology: Bioinformatics, bio-algorithms and Tools Introduction to Basic Programming: Introduction to basic scripting and programming routinely used in computational biology. Biological Databases and Sequence File Formats: Introduction to different biological databases, their classification schemes, and biological database retrieval systems. Sequence Alignments: Introduction to concept of alignment, Scoring matrices, Alignment algorithms for pairs of sequences, Multiple sequence alignment. Gene Prediction Methods: What is gene prediction? Computational methods of gene prediction-prokaryotic & eukaryotic.
    [Show full text]
  • Gender Disparity in Computational Biology Research Publications 2
    Manuscript bioRxiv preprint doi: https://doi.org/10.1101/070631; this version posted April 13, 2017. TheClick copyright here toholder download for this preprint Manuscript (which wasmanuscript.pdf not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 1 Gender disparity in computational biology research publications 2 3 1* 2 Kevin S. Bonham , Melanie I. Stefan , 4 1 Microbiology and Immunobiology, Harvard Medical School. Boston, MA, USA 5 2 Centre for Integrative Physiology, Edinburgh Medical School. Biomedical Sciences, University of 6 Edinburgh. Edinburgh, UK 7 * kevin_bonhamhms.harvard.edu 8 Abstract 9 While women are generally underrepresented in STEM fields, there are noticeable differences between fields. 10 For instance, the gender ratio in biology is more balanced than in computer science. We were interested in 11 how this difference is reflected in the interdisciplinary field of computational/quantitative biology. To this 12 end, we examined the proportion of female authors in publications from the PubMed and arXiv databases. 13 There are fewer female authors on research papers in computational biology, as compared to biology in 14 general. This is true across authorship position, year, and journal impact factor. A comparison with arXiv 15 shows that quantitative biology papers have a higher ratio of female authors than computer science papers, 16 placing computational biology in between its two parent fields in terms of gender representation. Both in 17 biology and in computational biology, a female last author increases the probability of other authors on the 18 paper being female, pointing to a potential role of female PIs in influencing the gender balance.
    [Show full text]