Multiview Pattern Recognition Methods for Data Visualization, Embedding and Clustering

Total Page:16

File Type:pdf, Size:1020Kb

Multiview Pattern Recognition Methods for Data Visualization, Embedding and Clustering Doctoral Thesis Multiview pattern recognition methods for data visualization, embedding and clustering Ph.D. Student: Samir Kanaan Izquierdo Ph.D. Advisor: Alexandre Perera Lluna Thesis Presented in Partial Fulfillment of the Requirements for the Degree of Ph.D. Biomedical Engineering Ph.D. Program Automatic Control Department Universitat Polit`ecnicade Catalunya Barcelona, 2017 i Abstract Multiview data is defined as data for whose samples there exist sev- eral different data views, i.e. different data matrices obtained through different experiments, methods or situations. Multiview dimensionality reduction methods transform a high-dimensional, multiview dataset into a single, low-dimensional space or projection. Their goal is to provide a more manageable representation of the original data, either for data visu- alization or to simplify the following analysis stages. Multiview clustering methods receive a multiview dataset and propose a single clustering as- signment of the data samples in the dataset, considering the information from all the input data views. The main hypothesis defended in this work is that using multiview data along with methods able to exploit their information richness pro- duces better dimensionality reduction and clustering results than simply using single views or concatenating all views into a single matrix. Consequently, the objectives of this thesis are to develop and test multiview pattern recognition methods based on well known single-view dimensionality reduction and clustering methods. Three multiview pat- tern recognition methods are presented: multiview t-distributed stochas- tic neighbourhood embedding (MV-tSNE), multiview multimodal scal- ing (MV-MDS) and a novel formulation of multiview spectral clustering (MVSC-CEV). These methods can be applied both to dimensionality reduction tasks and to clustering tasks. The MV-tSNE method computes a matrix of probabilities based on distances between samples for each input view. Then it merges the differ- ent probability matrices using results from expert opinion pooling theory to get a common matrix of probabilities, which is then used as reference to build a low-dimensional projection of the data whose probabilities are similar. The MV-MDS method computes the common eigenvectors of all the normalized distance matrices in order to obtain a single low-dimensional space that embeds the essential information from all the input spaces, avoiding redundant information to be included. The MVSC-CEV method computes the symmetric Laplacian matri- ces of the similarity matrices of all data views. Then it generates a single, low-dimensional representation of the input data by computing the com- mon eigenvectors of the Laplacian matrices, obtaining a projection of the data that embeds the most relevant information of the input data views, also avoiding the addition of redundant information. A thorough set of experiments has been designed and run in order to compare the proposed methods with their single view counterpart. Also, the proposed methods have been compared with all the available results of equivalent methods in the state of the art. Finally, a comparison between the three proposed methods is presented in order to provide guidelines on which method to use for a given task. MVSC-CEV consistently produces better clustering results than other multiview methods in the state of the art. MV-MDS produces overall better results than the reference methods in dimensionality reduction ii experiments. MV-tSNE does not excel on any of these tasks. As a con- sequence, for multiview clustering tasks it is recommended to use MVSC- CEV, and MV-MDS for multiview dimensionality reduction tasks. Although several multiview dimensionality reduction or clustering methods have been proposed in the state of the art, there is no software implementation available. In order to compensate for this fact and to provide the community with a potentially useful set of multiview pat- tern recognition methods, an R software package containg the proposed methods has been developed and released to the public. Unesco codes: 120304, 120804, 120110 Resumen Los datos multivista se definen como aquellos datos para cuyas mues- tras existen varias vistas de datos distintas, es decir diferentes matrices de datos obtenidas mediante diferentes experimentos, m´etodos o situa- ciones. Los m´etodos multivista de reducci´onde la dimensionalidad trans- forman un conjunto de datos multivista y de alta dimensionalidad en un ´unicoespacio o proyecci´onde baja dimensionalidad. Su objetivo es pro- ducir una representaci´onm´asmanejable de los datos originales, bien para su visualizaci´ono para simplificar las etapas de an´alisissubsigu- ientes. Los m´etodos de agrupamiento multivista reciben un conjunto de datos multivista y proponen una ´unicaasignaci´onde grupos para sus muestras, considerando la informaci´onde todas las vistas de datos de entrada. La principal hip´otesisdefendida en este trabajo es que el uso de datos multivista junto con m´etodos capaces de aprovechar su riqueza informa- tiva producen mejores resultados en reducci´onde la dimensionalidad y agrupamiento frente al uso de vistas ´unicaso la concatenaci´onde varias vistas en una ´unicamatriz. Por lo tanto, los objetivos de esta tesis son desarrollar y probar m´etodos multivista de reconocimiento de patrones basados en m´etodos univista reconocidos. Se presentan tres m´etodos multivista de reconoci- miento de patrones: proyecci´onestoc´asticade vecinos multivista (MV- tSNE), escalado multidimensional multivista (MV-MDS) y una nueva formulaci´onde agrupamiento espectral multivista (MVSC-CEV). Estos m´etodos pueden aplicarse tanto a tareas de reducci´onde la dimension- alidad como de agrupamiento. MV-tSNE calcula una matriz de probabilidades basada en distancias entre muestras para cada vista de datos. A continuaci´oncombina las matrices de probabilidad usando resultados de la teor´ıade combinaci´on de expertos para obtener una matriz com´unde probabilidades, que se usa como referencia para construir una proyecci´onde baja dimensionalidad de los datos. MV-MDS calcula los vectores propios comunes de todas las matrices normalizadas de distancia para obtener un ´unicoespacio de baja dimen- iii sionalidad que integre la informaci´onesencial de todos los espacios de entrada, evitando informaci´onredundante. MVSC-CEV calcula las matrices Laplacianas de las matrices de simil- itud de los datos. A continuaci´ongenera una ´unica representaci´onde baja dimensionalidad calculando los vectores propios comunes de las Lapla- cianas. As´ıobtiene una proyecci´onde los datos que integra la informaci´on m´asrelevante y evita a~nadirinformaci´onredundante. Se ha dise~nadoy ejecutado una bater´ıade experimentos completa para comparar los m´etodos propuestos con sus equivalentes univista. Adem´aslos m´etodos propuestos se han comparado con los resultados disponibles en la literatura. Finalmente, se presenta una comparaci´on entre los tres m´etodos para proporcionar orientaciones sobre el m´etodo m´asadecuado para cada tarea. MVSC-CEV produce mejores agrupamientos que los m´etodos equiv- alentes en la literatura. MV-MDS produce en general mejores resultados que los m´etodos de referencia en experimentos de reducci´onde la di- mensionalidad. MV-tSNE no destaca en ninguna de esas tareas. Conse- cuentemente, para agrupamiento multivista se recomienda usar MVSC- CEV, y para reducci´onde la dimensionalidad multivista MV-MDS. Aunque se han propuesto varios m´etodos multivista en la literatura, no existen programas disponibles p´ublicamente. Para remediar este he- cho y para dotar a la comunidad de un conjunto de m´etodos potencial- mente ´util,se ha desarrollado un paquete de programas en R y se ha puesto a disposici´ondel p´ublico. C´odigosUnesco: 120304, 120804, 120110 To J&B, with love. vii Acknowledgements First I would like to express my gratitude to my advisor Dr. Alexandre Perera for his support in the creation of this thesis. As my advisor, he has shown me the way to go when I was disoriented, he has encouraged me when I was stranded in my research, and he has also taught me to realize that, eventually, things get working. As a researcher I have learnt a lot from him as well: he is careful with his work, always willing to learn new things, and has a keen eye for spotting flaws! Finally, he is a kind person and it has been a pleasure to work with him all these years. In second place I want to thank my colleagues at my workplace, the EEBE (former EUETIB), for their support, advice and encouragement during all these years. Very specially I want to thank Dr. Gerard Escudero for his generous help assuming so many common tasks to relieve me and let me have more spare time to devote to my thesis. I also want to thank Drs. Pedro Gomis and Montserrat Vallverd´ufor introducing me to my advisor. Seemingly they had a perfect notion of what I could and wanted to do and I think they made a perfect match. Raimon Jan´eand Noem´ıZapata also deserve my gratitude for their help with all the academic requirements and the paperwork. At times I have caused them extra work and I am very thankful for this. Of course I am not forgetting my fellow mates, with whom I have shared all these years of work, exchange of ideas and hopes. And some mountain hiking too! Thank you Andrey, Francesc, Helena, Maria, Gio, Sergi and many others along the way. Finally I want to thank my family and friends for supporting (and stand- ing) me during all these long years. Hopefully they (and me) will be finally relieved from the old question \How's your thesis going?". It is about time to move on to new questions. Contents Contents ix List of Tables xiii List of Figures xx 1 Multiview unsupervised pattern recognition methods 1 1.1 Introduction . 1 1.1.1 Multiview data . 1 1.1.2 Unsupervised pattern recognition methods . 11 1.2 State of the art: multiview dimensionality reduction . 12 1.2.1 Multiview dimensionality reduction methods . 13 1.3 State of the art: multiview clustering .
Recommended publications
  • CV Page 1 THOMAS JEFFERSON SHARPTON Rank
    THOMAS JEFFERSON SHARPTON Rank: Associate Professor Department: Department of Microbiology, Oregon State University Department of Statistics, Oregon State University Date Hired: October 15, 2013 Address: 226 Nash Hall, Corvallis, OR 97331 Phone: (541) 737-8623 Email: [email protected] Web: lab.sharpton.org A. EDUCATION AND EMPLOYMENT INFORMATION Education 2009 Ph.D. Microbiology, Designated Emphasis in Computational Biology, University of California, Berkeley, CA, USA Dissertation: “Investigations of Natural Genomic Variation in the Fungi” Advisors: John W. Taylor & Sandrine Dudoit 2003 B.S. (Cum Laude) Biochemistry and Biophysics, Oregon State University, Corvallis, OR, USA Employment history 2019- Associate Professor Department of Microbiology, Oregon State University Department of Statistics, Oregon State University 2016- Founding Director Oregon State University Microbiome Initiative 2013-2019 Assistant Professor Department of Microbiology, Oregon State University Department of Statistics, Oregon State University 2013- Center Investigator Center for Genome Research and Biocomputing, Oregon State University 2009-2013 Bioinformatics Fellow Gladstone Institute of Cardiovascular Disease, Gladstone Institutes Advisors: Katherine Pollard & Jonathan Eisen (UC Davis) 2003-2009 Graduate Student Researcher Department of Microbiology, University of California, Berkeley Advisors: John Taylor & Sandrine Dudoit 2005 Bioinformatics Research Intern The Broad Institute Advisor: James Galagan Additional Affiliations 2017-2018 Scientific Advisory Board Member Thomas J. Sharpton – CV Page 1 Resilient Biotics, Inc. B. AWARDS 2019 Phi Kappa Phi Emerging Scholar Award 2018 OSU College of Science Research and Innovation Seed Program Award 2017 OSU College of Science Early Career Impact Award 2014 Finalist: Carter Award in Outstanding & Inspirational Teaching in Science 2013 Gladstone Scientific Leadership Award 2007 Dean’s Council Student Research Representative 2007 Effectiveness in Teaching Award, UC Berkeley 2006-2008 Chang-Lin Tien Graduate Research Fellowship C.
    [Show full text]
  • Nandita R. Garud
    NANDITA R. GARUD Department of Ecology and Evolutionary Biology University of California, Los Angeles 621 Charles E. Young Drive South Los Angeles, CA 90095-1606 [email protected] Curriculum Vitae January 2020 EDUCATION 2015 Ph.D. in Genetics Stanford University, Stanford, CA Focus on adaptation in natural populations 2012 MS in Statistics Stanford University, Stanford, CA 2008 BS in Biometry & Statistics and Biological Sciences, (Magna Cum Laude) Cornell University, Ithaca, NY EMPLOYMENT 2019- current Assistant Professor University of California, Los Angeles, CA Department of Ecology and Evolutionary Biology Department of Human Genetics Member, Interdepartmental Program in Bioinformatics Member, Genetics & Genomics Home Area 2015-2019 Bioinformatics Fellow Postdoctoral Position The Gladstone Institutes at the University of California, San Francisco, CA Advisor: Katherine Pollard 2015 summer Data Scientist Intern Intuit Inc., Mountain View, CA PREVIOUS RESEARCH 2010-2015 Ph.D. Candidate Department of Genetics, Stanford University, Stanford, CA Advisor: Dmitri Petrov 2008-2009 Fulbright Scholar The Bioinformatics Centre, University of Copenhagen, Denmark Advisor: Jakob Pedersen 2006-2008 Undergraduate Honors Researcher Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY Advisor: Andrew Clark 2006 Cold Spring Harbor Summer Undergraduate Research Intern Cold Spring Harbor, NY Advisor: Doreen Ware AWARDS AND HONORS 2014-2015 Stanford Center for Evolution and Human Genomics Fellow 2013 Walter Fitch Finalist, Society for Molecular Biology and Evolution, Chicago. 2010-2013 National Science Foundation Graduate Research Fellow 2008-2009 Fulbright Scholar, Denmark 2009-2010 Stanford Genome Training Program Grant 2007-2008 Cornell Hughes Scholar 2007-2008 CALS Charitable Trust Grant and Morley Grant for Undergraduate Research 2004-2008 Dean’s List, Cornell University 2004-2008 Byrd Undergraduate Scholarship 2004-2008 New York Lottery Undergraduate Scholarship PUBLICATIONS 1.
    [Show full text]
  • MSTATNEWS Athe Membership Magazine of the American Statistical Association What Makes Y O U R V O T E Matter?
    April 2008 • Issue #370 MSTATNEWS AThe Membership Magazine of the American Statistical Association What Makes Y o u r V o t e Matter? Publications Agreement No. 41544521 Board of Directors Meeting Highlights Making Statistics Delicious, Not Just Palatable April 2008 • Issue #370 VISION STATEMENT To be a world leader in promoting statistical practice, applications, and research; publishing statistical journals; improving statistical education; FEATURES and advancing the statistics profession. Executive Director 2 Longtime Members Ron Wasserstein: [email protected] 10 President’s Invited Column Associate Executive Director and Director of Operations Stephen Porzio: [email protected] 11 Highlights of the March 2008 Board of Directors Meeting Director of Programs 12 ASA Board Calls for Audits To Increase Confidence in Martha Aliaga: [email protected] Electoral Outcomes Director of Science Policy Steve Pierson: [email protected] 12 ASA Position on Electoral Integrity Managing Editor Megan Murphy: [email protected] 13 Wise Elders Program Production Coordinators/Graphic Artists 14 What Makes Your Vote Matter? Colby Johnson: [email protected] Lidia Vigyázó: [email protected] 16 Canadian Model for Accrediting Professional Statisticians Publications Coordinator Valerie Snider: [email protected] 17 Proposals Sought for NSF-CBMS Conferences Advertising Manager 18 Making Statistics Delicious, Not Just Palatable Claudine Donovan: [email protected] Contributing Staff Members 20 Big Science and Little Statistics Keith Crank • Ron Wasserstein 21 Free Textbook Offered to Stats Teachers Amstat News welcomes news items and let- 22 Success of Statistical Service Leads to Expanded Network ters from readers on matters of interest to the asso- ciation and the profession. Address correspondence 22 NHIS Paradata Released to Public to Managing Editor, Amstat News, American Statistical Association, 732 North Washington Street, Alexandria VA 22314-1943 USA, or email amstat@ 23 Staff Spotlight amstat.org.
    [Show full text]
  • Cvauc: Cross- Validated Area Under the ROC Curve Confidence Intervals
    CURRICULUM VITAE and BIBLIOGRAPHY CONTACT INFORMATION Name: Mark Johannes van der Laan. Nationality: Dutch. Marital status: Married to Martine with children Laura, Lars, and Robin. University address: University of California Division of Biostatistics School of Public Health 108 Haviland Hall Berkeley, CA 94720-7360 email: [email protected]. Telephone number: office: 510-643-9866 fax: 510-643-5163 Web address: www.stat.berkeley.edu/ laan Working Papers, Division of Biostatistics: www.bepress.com/ucbbiostat EDUCATION 1990-1993: Department of Mathematics, Utrecht University. Ph.D student of Prof. Dr. R.D. Gill. Position included 25% teaching, 75% research and education. Specialization in Estimation in Semiparametric and Censored Data Models. 1991-1992: University of California, Berkeley. Statistics Program at M.S.R.I.: \Semiparametric Models and Survival Analysis". Research with guidance by second Promotor Prof. Dr. P.J. Bickel. Subject: “Efficient Estimation in the Bivariate Censoring Model". December 13, 1993: Official Public Defense of Ph.D Thesis. 1985-1990: Masters degree in Mathematics at the University of Utrecht, The Netherlands. Statistics Major. 1988-1989: One year study, Masters degree courses at the Department of Statistics, North Carolina State University, Raleigh, North Carolina, U.S.A. G.P.A 4.0, Dean's List. 1989-1990: Masters thesis under guidance of Prof. Dr. R.D. Gill. Subject: The Dabrowska Estimator and the Functional Delta method. Grade (from 1-10, 10=top): 9.5. Official Completion: May 1, 1990. 1 ACADEMIC POSITIONS 2006-present: Jiann-Ping Hsu/Karl E. Peace Endowed Chair in Biostatistics. 2013-2018: Investigator and core leader of the methods workgroup of the Sustainable East African Research in Community Health (SEARCH).
    [Show full text]
  • Downloaded from NCBI Genbank Or Sequence
    Lind and Pollard Microbiome (2021) 9:58 https://doi.org/10.1186/s40168-021-01015-y METHODOLOGY Open Access Accurate and sensitive detection of microbial eukaryotes from whole metagenome shotgun sequencing Abigail L. Lind1 and Katherine S. Pollard1,2,3,4,5* Abstract Background: Microbial eukaryotes are found alongside bacteria and archaea in natural microbial systems, including host-associated microbiomes. While microbial eukaryotes are critical to these communities, they are challenging to study with shotgun sequencing techniques and are therefore often excluded. Results: Here, we present EukDetect, a bioinformatics method to identify eukaryotes in shotgun metagenomic sequencing data. Our approach uses a database of 521,824 universal marker genes from 241 conserved gene families, which we curated from 3713 fungal, protist, non-vertebrate metazoan, and non-streptophyte archaeplastida genomes and transcriptomes. EukDetect has a broad taxonomic coverage of microbial eukaryotes, performs well on low-abundance and closely related species, and is resilient against bacterial contamination in eukaryotic genomes. Using EukDetect, we describe the spatial distribution of eukaryotes along the human gastrointestinal tract, showing that fungi and protists are present in the lumen and mucosa throughout the large intestine. We discover that there is a succession of eukaryotes that colonize the human gut during the first years of life, mirroring patterns of developmental succession observed in gut bacteria. By comparing DNA and RNA sequencing of paired samples from human stool, we find that many eukaryotes continue active transcription after passage through the gut, though some do not, suggesting they are dormant or nonviable. We analyze metagenomic data from the Baltic Sea and find that eukaryotes differ across locations and salinity gradients.
    [Show full text]
  • Patrick J. H. Bradley
    (415) 734-2745 Patrick J. H. Bradley Gladstone Institutes, GIDB [email protected] 1650 Owens Street Pronouns: he/him/his Bioinformatics Fellow San Francisco, CA 94158 Current Position ······················································································· 2013— J. David Gladstone Institutes at UCSF Bioinformatics Fellow, Prof. Katherine S. Pollard Lab Academic History ····················································································· 2012—13 Lewis-Sigler Institute for Integrative Genomics, Princeton University Postdoctoral Fellow, Prof. Olga G. Troyanskaya Lab 2005—12 Dept. of Molecular Biology, Princeton University Ph.D. in Molecular Biology, Specialization in Quantitative and Computational Biology Thesis: Inferring Metabolic Regulation from High-Throughput Data Advisors: Prof. Joshua D. Rabinowitz, Prof. Olga G. Troyanskaya Committee: Prof. Ned S. Wingreen, Prof. David Botstein 2005 Dept. of Biology, Harvard College A.B. in Biology Peer-Reviewed Publications ·········································································· 1. Patrick H. Bradley, Katherine S. Pollard. “phylogenize: correcting for phylogeny reveals genes associated with microbial distributions.” Bioinformatics, 2019; btz722.∗;z 2. Patrick H. Bradley, Patrick A. Gibney, David Botstein, Olga G. Troyanskaya, Joshua D. Rabinowitz. “Minor isozymes tailor yeast metabolism to carbon availability.” mSystems, 2019; 4:e00170-18.∗;y 3. Patrick H. Bradley, Stephen Nayfach, Katherine S. Pollard. “Phylogeny-corrected identification
    [Show full text]
  • CV 2015 Sharpton, Thomas.Pdf
    THOMAS JEFFERSON SHARPTON Department of Microbiology [email protected] Oregon State University lab.sharpton.org 226 Nash Hall, Corvallis, OR 97331 @tjsharpton tel: (541) 737-8623 fax: (541) 737-0496 github.com/sharpton EDUCATION 2009 University of California, Berkeley, CA Ph.D., Microbiology, Designated Emphasis in Computational Biology 2003 Oregon State University, Corvallis, OR B.S. (Cum Laude), Biochemistry and Biophysics. APPOINTMENTS 2013- Assistant Professor Departments of Microbiology and Statistics, Oregon State University 2009-2013 Bioinformatics Fellow Gladstone Institute of Cardiovascular Disease, Gladstone Institutes Mentors: Katherine Pollard & Jonathan Eisen (UC Davis) 2003-2009 Graduate Student Researcher Department of Microbiology, University of California, Berkeley Mentors: John Taylor & Sandrine Dudoit 2005 Bioinformatics Research Intern The Broad Institute Mentor: James Galagan 2002-2003 Research Assistant Department of Zoology, Oregon State University 2001 Research Assistant International Arctic Research Center, University of Alaska, Fairbanks, AK HONORS & AWARDS 2013 Gladstone Scientific Leadership Award 2007 Dean’s Council Student Research Representative 2007 Effectiveness in Teaching Award, UC Berkeley 2006-2008 Chang-Lin Tien Graduate Research Fellowship 2002 Student Leadership Fellowship 2002 HHMI Undergraduate Research Fellowship 2002-2003 Dr. Donald MacDonald Excellence in Science Scholarship 2002 OSU College of Science's Excellence in Science Scholarship 1999-2003 Provost Scholarship 1999 American Legion Student Ethics Award PUBLICATIONS Refereed Journal Articles 17. O’Dwyer JP, Kembel SW, Sharpton TJ (In Review) Backbones of Evolutionary History Test Biodiversity Theory for Microbes 16. Quandt AC, Kohler A, Hesse C, Sharpton TJ, Martin F, Spatafora JW. (In Print) Metagenomic sequence of Elaphomyces granulatus from sporocarp tissue reveals Ascomycota ectomycorrhizal fingerprints of genome expansion and a Proteobacteria rich microbiome.
    [Show full text]
  • ASA Commends NSF for Initiative To
    Embargoed until 9 a.m. (EDT) August 13, 2015 STATISTICAL ADVANCES HELP UNLOCK MYSTERIES OF THE HUMAN MICROBIOME SEATTLE, WA, AUGUST 13, 2015 – Advances in the field of statistics are helping to unlock the mysteries of the human microbiome—the vast collection of microorganisms living in and on the bodies of humans, said Katherine Pollard, a statistician and biome expert, during a session today at the 2015 Joint Statistical Meetings (JSM 2015) in Seattle. Pollard, senior investigator at the Gladstone Institutes and professor of epidemiology and biostatistics at the University of California, San Francisco, delivered a presentation titled “Estimating Taxonomic and Functional Diversity in Shotgun Metagenomes” during an invited session focused on statistics, the microbiome and human health. “While we are only just beginning to understand the complex roles microbes play in human biology, it is clear specific changes in microbial flora are associated with—and sometimes cause or cure—disease in the host,” said Pollard while explaining her research focus. “Some of the best-supported links are with autoimmune diseases, which are on the rise in the United States, perhaps due to antibiotic use and lack of exposure to a diverse collection of microbes during childhood. This ‘hygiene hypothesis’ suggests that health risks not attributable to human genetics and behavior may stem from differences in microbiome composition between individuals.” What’s more, a given microbial species can share less than 50% of the same genes when found in two people. These differences track with functional capabilities of the microbial communities, including genes related to sugar metabolism, biosynthesis and two-component systems.
    [Show full text]
  • Clinical Efficacy and Increased Donor Strain Engraftment After Antibiotic
    medRxiv preprint doi: https://doi.org/10.1101/2021.08.07.21261556; this version posted August 10, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license . Clinical efficacy and increased donor strain engraftment after antibiotic pretreatment in a randomized trial of ulcerative colitis patients receiving fecal microbiota transplant Byron J. Smith A,B (ORCID: 0000-0002-0182-404X) Yvette Piceno C (ORCID: 0000-0002-7915-4699) Martin Zydek D Bing Zhang D (ORCID: 0000-0002-3377-0963) Lara Aboud Syriani E Jonathan P. Terdiman D Zain Kassam F Averil Ma G (ORCID: 0000-0003-4817-1258) Susan V. Lynch D,H (ORCID: 0000-0001-5695-7336) Katherine S. Pollard A,B,I,* (ORCID: 0000-0002-9870-6196) Najwa El-Nachef D,* A The Gladstone Institute of Datascience and Biotechnology, San Francisco, CA B Department of Epidemiology and Biostatistics, University of California, San Francisco, CA C Symbiome, Inc., San Francisco, CA D Division of Gastroenterology, University of California, San Francisco, CA E College of Osteopathic Medicine of the Pacific, Western University of Health Sciences, Pomona, CA F Finch Therapeutics, Somerville, MA G Department of Medicine, University of California, San Francisco, CA H Benioff Center for Microbiome Medicine, University of California, San Francisco, CA I Chan-Zuckerberg Biohub, San Francisco, CA * Corresponding authors: [email protected] [email protected] NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice.
    [Show full text]
  • Exploring the Behaviour of the Hidden Markov Model on Cpg Island Prediction
    Exploring the behaviour of the Hidden Markov Model on CpG island prediction A Thesis Submitted to the College of Graduate Studies and Research in Partial Fulfillment of the Requirements for the degree of Master of Science in the Department of Computer Science University of Saskatchewan Saskatoon By Arnie Berg c Arnie Berg, May/2013. All rights reserved. Abstract DNA can be represented abstrzctly as a language with only four nucleotides represented by the letters A, C, G, and T, yet the arrangement of those four letters plays a major role in determining the development of an organism. Understanding the significance of certain arrangements of nucleotides can unlock the secrets of how the genome achieves its essential functionality. Regions of DNA particularly enriched with cytosine (C nucleotides) and guanine (G nucleotides), especially the CpG di-nucleotide, are frequently associated with biological function related to gene expression, and concentrations of CpGs referred to as \CpG islands" are known to collocate with regions upstream from gene coding sequences within the promoter region. The pattern of occurrence of these nucleotides, relative to adenine (A nucleotides) and thymine (T nucleotides), lends itself to analysis by machine-learning techniques such as Hidden Markov Models (HMMs) to predict the areas of greater enrichment. HMMs have been applied to CpG island prediction before, but often without an awareness of how the outcomes are affected by the manner in which the HMM is applied. Two main findings of this study are: 1. The outcome of a HMM is highly sensitive to the setting of the initial probability estimates. 2.
    [Show full text]
  • Chromatin Features Constrain Structural Variation Across Evolutionary Timescales
    Chromatin features constrain structural variation across evolutionary timescales Geoff Fudenberga,1 and Katherine S. Pollarda,b,c,d,e,f,1 aGladstone Institute of Data Science and Biotechnology, San Francisco, CA 94158; bDepartment of Epidemiology & Biostatistics, University of California, San Francisco, CA 94158; cInstitute for Human Genetics, University of California, San Francisco, CA 94158; dQuantitative Biology Institute, University of California, San Francisco, CA 94158; eInstitute for Computational Health Sciences, University of California, San Francisco, CA 94158; and fChan-Zuckerberg Biohub, San Francisco, CA 94158 Edited by Jasper Rine, University of California, Berkeley, CA, and approved December 10, 2018 (received for review May 23, 2018) The potential impact of structural variants includes not only the with autism and developmental delay, where deletions occur re- duplication or deletion of coding sequences, but also the pertur- markably uniformly across the genome, and in cancer, where dele- bation of noncoding DNA regulatory elements and structural tions in fact show a slight enrichment for disrupting otherwise chromatin features, including topological domains (TADs). Struc- important features. Together our analyses uncover a genome- tural variants disrupting TAD boundaries have been implicated wide pattern of negative selection against deletions that could po- both in cancer and developmental disease; this likely occurs via tentially alter chromatin structure and lead to enhancer hijacking. “enhancer hijacking,” whereby
    [Show full text]
  • Exploring the Behaviour of the Hidden Markov Model on Cpg Island Prediction
    CORE Metadata, citation and similar papers at core.ac.uk Provided by University of Saskatchewan's Research Archive Exploring the behaviour of the Hidden Markov Model on CpG island prediction A Thesis Submitted to the College of Graduate Studies and Research in Partial Fulfillment of the Requirements for the degree of Master of Science in the Department of Computer Science University of Saskatchewan Saskatoon By Arnie Berg c Arnie Berg, May/2013. All rights reserved. Abstract DNA can be represented abstrzctly as a language with only four nucleotides represented by the letters A, C, G, and T, yet the arrangement of those four letters plays a major role in determining the development of an organism. Understanding the significance of certain arrangements of nucleotides can unlock the secrets of how the genome achieves its essential functionality. Regions of DNA particularly enriched with cytosine (C nucleotides) and guanine (G nucleotides), especially the CpG di-nucleotide, are frequently associated with biological function related to gene expression, and concentrations of CpGs referred to as \CpG islands" are known to collocate with regions upstream from gene coding sequences within the promoter region. The pattern of occurrence of these nucleotides, relative to adenine (A nucleotides) and thymine (T nucleotides), lends itself to analysis by machine-learning techniques such as Hidden Markov Models (HMMs) to predict the areas of greater enrichment. HMMs have been applied to CpG island prediction before, but often without an awareness of how the outcomes are affected by the manner in which the HMM is applied. Two main findings of this study are: 1.
    [Show full text]