Application of Biclustering Algorithms to Biological Data

Total Page:16

File Type:pdf, Size:1020Kb

Application of Biclustering Algorithms to Biological Data APPLICATION OF BICLUSTERING ALGORITHMS TO BIOLOGICAL DATA MASTERS THESIS Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of the Ohio State University By Kemal Eren, BS Graduate Program in Computer Science The Ohio State University 2012 Thesis Committee: Umit¨ V. C¸ atalyurek,¨ Advisor Srinivasan Parthasarathy c Copyright by Kemal Eren 2012 ABSTRACT Microarrays have made it possible to cheaply collect large gene expression datasets. Biclustering has become established as a popular method for mining patterns in these datasets. Biclustering algorithms simultaneously cluster rows and columns of the data matrix; this approach is well suited to gene expression data because genes are not related across all samples, and vice versa. In the past decade many biclustering algo- rithms that specifically target gene expression data have been published. However, only a few are commonly used in bioinformatics pipelines. There are a few reasons for this omission: implementations for only a small fraction of these algorithms have been published. Those that have been published have different interfaces, and there are few comparisons of algorithms or guidelines for choosing among them in the literature. In this thesis we address three problems: the development of an efficient and effec- tive biclustering algorithm, the development of a software framework for biclustering tasks, and a comprehensive benchmark of biclustering techniques. We improved the Correlated Patterns Biclustering (CPB) algorithm’s running time and accuracy by modifying its heuristic for evaluating rows and columns for inclusion in a bicluster. This calculation was previously performed by an iterative approach, but we developed a more computationally efficient method. We further improved CPB by removing unnecessary parameters and developing a nonparametric method for filtering irrelevant biclusters. ii To provide a common interface and also enable comparison of biclustering algo- rithms, we developed a Python package for bicluster analysis, which we introduce in this thesis. This package, BiBench, provides wrappers to twelve biclustering al- gorithms, as well as functionality for generating synthetic data, downloading gene expression data, transforming datasets, and validating biclusters. Using BiBench we compared twelve algorithms, including the modified version of CPB. The algorithms were tested on synthetic datasets for their ability to recover specific bicluster models, resist noise, recover multiple biclusters, and recover over- lapping biclusters. They were also tested on gene expression data; gene ontology enrichment was used to identify biologically relevant biclusters. iii For Susan Slattery iv ACKNOWLEDGMENTS This work would have been impossible without the help of the people, both past and present, at the HPC Lab. In particular I would like to acknowledge my adviser, Umit¨ C¸ atalyurek,¨ for all of his guidance and support. v VITA 1986 .................................. Born in Cleveland, OH. 2009 .................................. B.S. in Biology, minors in Mathematics and Computer Science, The University of Michigan. 2009-Present .......................... Joined the High Performance Computing Laboratory, The Ohio State University. FIELDS OF STUDY Major Field: Computer Science and Engineering Specialization: Machine Learning, Data Mining, Bioinformatics vi TABLE OF CONTENTS Abstract ....................................... ii Dedication ...................................... iii Acknowledgments .................................. v Vita ......................................... vi List of Figures ................................... ix List of Tables .................................... xi CHAPTER PAGE 1 Introduction ................................. 1 2 Background ................................. 5 2.1 Notation ................................ 7 2.2 Algorithms ............................... 8 2.2.1 Cheng and Church ...................... 8 2.2.2 OPSM ............................. 9 2.2.3 xMOTIFs ........................... 10 2.2.4 Spectral ............................ 10 2.2.5 ISA ............................... 10 2.2.6 Plaid .............................. 11 2.2.7 BiMax ............................. 11 2.2.8 Bayesian Biclustering (BBC) ................. 11 2.2.9 QUBIC ............................. 12 2.2.10COALESCE .......................... 12 2.2.11FABIA ............................. 12 3 The Correlated Patterns Biclustering Algorithm ............. 14 3.1 Original algorithm ........................... 17 3.1.1 The tendency vector ..................... 18 3.1.2 Updating rows of a bicluster ................. 23 vii 3.1.3 Updating columns of a bicluster ............... 24 3.2 Improvements ............................. 26 3.2.1 Improving tendency vector calculation ........... 26 3.2.2 A lower bound for the minimum pairwise correlation ... 29 3.2.3 Removing the row-to-column ratio .............. 32 3.2.4 Filtering biclusters found by chance ............. 33 3.3 Implementations ............................ 34 4 BiBench: a framework for biclustering analysis .............. 37 4.1 Examples of use ............................ 37 4.2 Related work .............................. 38 4.3 Overview of functionality ....................... 39 4.3.1 Algorithms ........................... 39 4.3.2 Datasets ............................ 39 4.3.3 Validation ........................... 40 4.3.4 Visualization ......................... 40 4.4 Availability ............................... 40 5 Comparison of biclustering algorithms .................. 42 5.1 Related work .............................. 43 5.2 Algorithms ............................... 44 5.2.1 Parameter selection ...................... 44 5.3 Synthetic dataset experiments .................... 46 5.3.1 Synthetic data model..................... 46 5.3.2 Bicluster similarity measures ................. 47 5.3.3 Model experiment ....................... 49 5.3.4 Noise experiment ....................... 53 5.3.5 Number experiment ...................... 55 5.3.6 Overlap experiment ...................... 57 5.4 Microarray dataset experiments .................... 59 5.4.1 Gene Ontology enrichment analysis ............. 60 5.4.2 GDS results .......................... 61 5.5 Discussion ............................... 67 6 Conclusion and future work ........................ 69 Bibliography .................................... 70 viii LIST OF FIGURES FIGURE PAGE 1.1 Illustration of a bicluster made contiguous by row and column exchanges. 2 2.1 Three types of bicluster structures. Note that biclusters may not be continuous, as in these figures, until the appropriate row and column permutation. ............................... 6 3.1 Effect of noise on correlation of rows of ten 50x50 matrices with per- fectly correlated rows. Top: pairwise correlation density. Bottom: correlation density with tendency vector. ’Random’ is a control for comparison: randomly generated matrices with no special correlation between rows. ............................... 22 3.2 Relationship between Corr and Error on random vectors with 50 elements. ................................. 25 3.3 Row expression levels before and after making all pairwise correlations positive. .................................. 26 3.4 Running time: iterative vs single-pass tendency vector calculation .. 27 3.5 Scores: iterative vs single-pass tendency vector calculation ...... 28 3.6 The cone formed by tendency vector t and angle θ = arccos(ρmin) .. 30 3.7 ρmin as a function of ρpw......................... 31 3.8 Effect of row-to-column ratio on CPB’s scores for three different bi- cluster sizes: 25, 50, and 100 rows. (Number of bicluster columns were 100, 50, and 25, so that each bicluster always contained 2500 elements. 32 3.9 Improvement of relevance scores after filtering ............. 34 3.10 Running times of original C version and Python version of CPB as the number of rows in the dataset increases. ................ 36 ix 4.1 Four types of visualization supported by BiBench. The first three are performed by the biclust package; the last by BicOverlapper. ..... 41 5.1 The intersection area (dark red) and union area (all shaded areas) involved in the calculation of the Jaccard index. ............ 49 5.2 Bicluster model experiment. Each data point represents the mean of twenty datasets. A score of (1,1) is best. ................ 51 5.3 Results of bicluster model experiment after filtering. ......... 52 5.4 Noise experiments: additive noise. Dots shows the mean score; lines show one standard deviation. ...................... 54 5.5 Number experiment: increasing number of biclusters. ......... 56 5.6 Overlap experiment: increasing amounts of overlap between two clusters. 58 5.7 Number of rows in all GDS biclusters. ................. 62 5.8 Number of rows in enriched GDS biclusters. .............. 63 5.9 Number of rows in enriched, non-overlapping GDS biclusters. .... 64 x LIST OF TABLES TABLE PAGE 5.1 Theoretical support of each bicluster model by algorithm. ...... 48 5.2 GDS datasets ............................... 61 5.3 Aggregated results on all eight GDS datasets. Biclusters were consid- ered enriched if any GO term was enriched with p = 0.05 level after multiple test correction. The set of enriched biclusters was further processed to discard overlapping biclusters. .............. 65 5.4 Five most enriched terms for each algorithm’s best bicluster on GDS589. 66 xi CHAPTER 1 INTRODUCTION Clustering refers to a class of unsupervised
Recommended publications
  • Machine Learning: Unsupervised Methods Sepp Hochreiter Other Courses
    Machine Learning Unsupervised Methods Part 1 Sepp Hochreiter Institute of Bioinformatics Johannes Kepler University, Linz, Austria Course 3 ECTS 2 SWS VO (class) 1.5 ECTS 1 SWS UE (exercise) Basic Course of Master Bioinformatics Basic Course of Master Computer Science: Computational Engineering / Int. Syst. Class: Mo 15:30-17:00 (HS 18) Exercise: Mon 13:45-14:30 (S2 053) – group 1 Mon 14:30-15:15 (S2 053) – group 2+group 3 EXAMS: VO: 3 part exams UE: weekly homework (evaluated) Machine Learning: Unsupervised Methods Sepp Hochreiter Other Courses Lecture Lecturer 365,077 Machine Learning: Unsupervised Techniques VL Hochreiter Mon 15:30-17:00/HS 18 365,078 Machine Learning: Unsupervised Techniques – G1 UE Hochreiter Mon 13:45-14:30/S2 053 365,095 Machine Learning: Unsupervised Techniques – G2+G3 UE Hochreiter Mon 14:30-15:15/S2 053 365,041 Theoretical Concepts of Machine Learning VL Hochreiter Thu 15:30-17:00/S3 055 365,042 Theoretical Concepts of Machine Learning UE Hochreiter Thu 14:30-15:15/S3 055 365,081 Genome Analysis & Transcriptomics KV Regl Fri 8:30-11:00/S2 053 365,082 Structural Bioinformatics KV Regl Tue 8:30-11:00/HS 11 365,093 Deep Learning and Neural Networks KV Unterthiner Thu 10:15-11:45/MT 226 365,090 Special Topics on Bioinformatics (B/C/P/M): KV Klambauer block Population genetics 365,096 Special Topics on Bioinformatics (B/C/P/M): KV Klambauer block Artificial Intelligence in Life Sciences 365,079 Introduction to R KV Bodenhofer Wed 15:30-17:00/MT 127 365,067 Master's Seminar SE Hochreiter Mon 10:15-11:45/S3 318 365,080 Master's Thesis Seminar SS SE Hochreiter Mon 10:15-11:45/S3 318 365,091 Bachelor's Seminar SE Hochreiter - 365,019 Dissertantenseminar Informatik 3 SE Hochreiter Mon 10:15-11:45/S3 318 347,337 Bachelor Seminar Biological Chemistry JKU (incl.
    [Show full text]
  • Bonnie Berger Named ISCB 2019 ISCB Accomplishments by a Senior
    F1000Research 2019, 8(ISCB Comm J):721 Last updated: 09 APR 2020 EDITORIAL Bonnie Berger named ISCB 2019 ISCB Accomplishments by a Senior Scientist Award recipient [version 1; peer review: not peer reviewed] Diane Kovats 1, Ron Shamir1,2, Christiana Fogg3 1International Society for Computational Biology, Leesburg, VA, USA 2Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel 3Freelance Writer, Kensington, USA First published: 23 May 2019, 8(ISCB Comm J):721 ( Not Peer Reviewed v1 https://doi.org/10.12688/f1000research.19219.1) Latest published: 23 May 2019, 8(ISCB Comm J):721 ( This article is an Editorial and has not been subject https://doi.org/10.12688/f1000research.19219.1) to external peer review. Abstract Any comments on the article can be found at the The International Society for Computational Biology (ISCB) honors a leader in the fields of computational biology and bioinformatics each year with the end of the article. Accomplishments by a Senior Scientist Award. This award is the highest honor conferred by ISCB to a scientist who is recognized for significant research, education, and service contributions. Bonnie Berger, Simons Professor of Mathematics and Professor of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology (MIT) is the 2019 recipient of the Accomplishments by a Senior Scientist Award. She is receiving her award and presenting a keynote address at the 2019 Joint International Conference on Intelligent Systems for Molecular Biology/European Conference on Computational Biology in Basel, Switzerland on July 21-25, 2019. Keywords ISCB, Bonnie Berger, Award This article is included in the International Society for Computational Biology Community Journal gateway.
    [Show full text]
  • Prof. Dr. Sepp Hochreiter
    Speaker: Univ.-Prof. Dr. Sepp Hochreiter Institute for Machine Learning & LIT AI Lab, Johannes Kepler University Linz, Austria Title: Deep Learning -- the Key to Enable Artificial Intelligence Abstract: Deep Learning has emerged as one of the most successful fields of machine learning and artificial intelligence with overwhelming success in industrial speech, language and vision benchmarks. Consequently it became the central field of research for IT giants like Google, Facebook, Microsoft, Baidu, and Amazon. Deep Learning is founded on novel neural network techniques, the recent availability of very fast computers, and massive data sets. In its core, Deep Learning discovers multiple levels of abstract representations of the input. The main obstacle to learning deep neural networks is the vanishing gradient problem. The vanishing gradient impedes credit assignment to the first layers of a deep network or early elements of a sequence, therefore limits model selection. Most major advances in Deep Learning can be related to avoiding the vanishing gradient like unsupervised stacking, ReLUs, residual networks, highway networks, and LSTM networks. Currently, LSTM recurrent neural networks exhibit overwhelmingly successes in different AI fields like speech, language, and text analysis. LSTM is used in Google’s translate and speech recognizer, Apple’s iOS 10, Facebooks translate, and Amazon’s Alexa. We use LSTM in collaboration with Zalando and Bayer, e.g. to analyze blogs and twitter news related to fashion and health. In the AUDI Deep Learning Center, which I am heading, and with NVIDIA we apply Deep Learning to advance autonomous driving. In collaboration with Infineon we use Deep Learning for perception tasks, e.g.
    [Show full text]
  • Development of Biclustering Techniques for Gene Expression Data Modeling and Mining Juan Xie South Dakota State University
    South Dakota State University Open PRAIRIE: Open Public Research Access Institutional Repository and Information Exchange Electronic Theses and Dissertations 2018 Development of Biclustering Techniques for Gene Expression Data Modeling and Mining Juan Xie South Dakota State University Follow this and additional works at: https://openprairie.sdstate.edu/etd Part of the Genetics and Genomics Commons, and the Statistics and Probability Commons Recommended Citation Xie, Juan, "Development of Biclustering Techniques for Gene Expression Data Modeling and Mining" (2018). Electronic Theses and Dissertations. 2960. https://openprairie.sdstate.edu/etd/2960 This Thesis - Open Access is brought to you for free and open access by Open PRAIRIE: Open Public Research Access Institutional Repository and Information Exchange. It has been accepted for inclusion in Electronic Theses and Dissertations by an authorized administrator of Open PRAIRIE: Open Public Research Access Institutional Repository and Information Exchange. For more information, please contact [email protected]. DEVELOPMENT OF BICLUSTERING TECHNIQUES FOR GENE EXPRESSION DATA MODELING AND MINING BY JUAN XIE A thesis submitted in partial fulfillment of the requirements for the Master of Science Major in Statistics South Dakota State University 2018 iii ACKNOWLEDGEMENTS First and foremost, endless thanks to my advisor, Dr. Qin Ma, who encouragingly guided me through the course of this dissertation. Thanks for providing such an excellent opportunity to work in your lab. I still remembered that at the beginning I knew nothing about programming and often felt frustrated, he gave me enough time to learn from the scratch and always acknowledged my every little progress, thanks for the patience and encouragement. In the past two years, he taught me how to learn a new skill, how to give presentations, how to read and write papers, and all the necessary and essential quantities that a graduate student should have.
    [Show full text]
  • DREAM: a Dialogue on Reverse Engineering Assessment And
    DREAM:DREAM: aa DialogueDialogue onon ReverseReverse EngineeringEngineering AssessmentAssessment andand MethodsMethods Andrea Califano: MAGNet: Center for the Multiscale Analysis of Genetic and Cellular Networks C2B2: Center for Computational Biology and Bioinformatics ICRC: Irving Cancer RResearchesearch Center Columbia University 1 ReverseReverse EngineeringEngineering • Inference of a predictive (generative) model from data. E.g. argmax[P(Data|Model)] • Assumptions: – Model variables (E.g., DNA, mRNA, Proteins, cellular sub- structures) – Model variable space: At equilibrium, temporal dynamics, spatio- temporal dynamics, etc. – Model variable interactions: probabilistics (linear, non-linear), explicit kinetics, etc. – Model topology: known a-priori, inferred. • Question: – Model ~= Reality? ReverseReverse EngineeringEngineering Data Biological System Expression Proteomics > NFAT ATGATGGATG CTCGCATGAT CGACGATCAG GTGTAGCCTG High-throughput GGCTGGA Structure Sequence Biology … Biochemical Model Validation Control X-Y- Control X+Y+ Y X Z X+Y- X-Y+ Control Control Specific Prediction SomeSome ReverseReverse EngineeringEngineering MethodsMethods • Optimization: High-Dimensional objective function max corresponds to best topology – Liang S, Fuhrman S, Somogyi (REVEAL) – Gat-Viks and R. Shamir (Chain Functions) – Segal E, Shapira M, Regev A, Pe’er D, Botstein D, KolKollerler D, and Friedman N (Prob. Graphical Models) – Jing Yu, V. Anne Smith, Paul P. Wang, Alexander J. Hartemink, Erich D. Jarvis (Dynamic Bayesian Networks) – … • Regression: Create a general model of biochemical interactions and fit the parameters – Gardner TS, di Bernardo D, Lorentz D, and Collins JJ (NIR) – Alberto de la Fuente, Paul Brazhnik, Pedro Mendes – Roven C and Bussemaker H (REDUCE) – … • Probabilistic and Information Theoretic: Compute probability of interaction and filter with statistical criteria – Atul Butte et al. (Relevance Networks) – Gustavo Stolovitzky et al. (Co-Expression Networks) – Andrea CaCalifanolifano et al.
    [Show full text]
  • Package 'S4vd'
    Package ‘s4vd’ November 19, 2015 Version 1.1-1 Title Biclustering via Sparse Singular Value Decomposition Incorporating Stability Selection Date 2015-11-11 Author Martin Sill, Sebastian Kaiser Maintainer Martin Sill <[email protected]> Depends biclust,methods,irlba,foreach Description The main function s4vd() performs a biclustering via sparse singular value decomposition with a nested stability selection. The results is an biclust object and thus all methods of the biclust package can be applied. License GPL-2 LazyLoad yes Repository CRAN Repository/R-Forge/Project s4vd Repository/R-Forge/Revision 109 Repository/R-Forge/DateTimeStamp 2015-11-18 10:48:14 Date/Publication 2015-11-19 17:15:53 NeedsCompilation no R topics documented: BCheatmap . .2 BCs4vd . .3 BCssvd . .5 jaccardmat . .7 lung200 . .8 stabpath . .9 Index 11 1 2 BCheatmap BCheatmap Overlap Heatmap for the visualization of overlapping biclusters Description Heatmap function to plot biclustering results. Overlapping biclusters are indicated by colored rect- angles. Usage BCheatmap(X, res, cexR = 1.5, cexC = 1.25, axisR = FALSE, axisC= TRUE, heatcols = maPalette(low="blue",mid="white",high="red", k=50), clustercols = c(1:5), allrows = FALSE, allcolumns = TRUE) Arguments X the data matrix res the biclustering result cexR relativ font size of the row labels cexC relativ font size of the column labels axisR if TRUE the row labels will be plotted axisC if TRUE the column labels will be plotted heatcols a character vector specifing the heatcolors clustercols a character vector specifing the colors of the rectangles that indicate the rows and columns that belong to a bicluster allrows if FALSE only the rows assigned to any bicluster will be plotted allcolumns if FALSE only the columns assigned to any bicluster will be plotted Author(s) Martin Sill \ <[email protected]> Examples #lung cancer data set Bhattacharjee et al.
    [Show full text]
  • Research Report 2006 Max Planck Institute for Molecular Genetics, Berlin Imprint | Research Report 2006
    MAX PLANCK INSTITUTE FOR MOLECULAR GENETICS Research Report 2006 Max Planck Institute for Molecular Genetics, Berlin Imprint | Research Report 2006 Published by the Max Planck Institute for Molecular Genetics (MPIMG), Berlin, Germany, August 2006 Editorial Board Bernhard Herrmann, Hans Lehrach, H.-Hilger Ropers, Martin Vingron Coordination Claudia Falter, Ingrid Stark Design & Production UNICOM Werbeagentur GmbH, Berlin Number of copies: 1,500 Photos Katrin Ullrich, MPIMG; David Ausserhofer Contact Max Planck Institute for Molecular Genetics Ihnestr. 63–73 14195 Berlin, Germany Phone: +49 (0)30-8413 - 0 Fax: +49 (0)30-8413 - 1207 Email: [email protected] For further information about the MPIMG please see our website: www.molgen.mpg.de MPI for Molecular Genetics Research Report 2006 Table of Contents The Max Planck Institute for Molecular Genetics . 4 • Organisational Structure. 4 • MPIMG – Mission, Development of the Institute, Research Concept. .5 Department of Developmental Genetics (Bernhard Herrmann) . 7 • Transmission ratio distortion (Hermann Bauer) . .11 • Signal Transduction in Embryogenesis and Tumor Progression (Markus Morkel). 14 • Development of Endodermal Organs (Heiner Schrewe) . 16 • Gene Expression and 3D-Reconstruction (Ralf Spörle). 18 • Somitogenesis (Lars Wittler). 21 Department of Vertebrate Genomics (Hans Lehrach) . 25 • Molecular Embryology and Aging (James Adjaye). .31 • Protein Expression and Protein Structure (Konrad Büssow). .34 • Mass Spectrometry (Johan Gobom). 37 • Bioinformatics (Ralf Herwig). .40 • Comparative and Functional Genomics (Heinz Himmelbauer). 44 • Genetic Variation (Margret Hoehe). 48 • Cell Arrays/Oligofingerprinting (Michal Janitz). .52 • Kinetic Modeling (Edda Klipp) . .56 • In Vitro Ligand Screening (Zoltán Konthur). .60 • Neurodegenerative Disorders (Sylvia Krobitsch). .64 • Protein Complexes & Cell Organelle Assembly/ USN (Bodo Lange/Thorsten Mielke). .67 • Automation & Technology Development (Hans Lehrach).
    [Show full text]
  • Evaluating Recurrent Neural Network Explanations
    Evaluating Recurrent Neural Network Explanations Leila Arras1, Ahmed Osman1, Klaus-Robert Muller¨ 2;3;4, and Wojciech Samek1 1Machine Learning Group, Fraunhofer Heinrich Hertz Institute, Berlin, Germany 2Machine Learning Group, Technische Universitat¨ Berlin, Berlin, Germany 3Department of Brain and Cognitive Engineering, Korea University, Seoul, Korea 4Max Planck Institute for Informatics, Saarbrucken,¨ Germany fleila.arras, [email protected] Abstract a particular decision, i.e., when the input is a se- quence of words: which words are determinant Recently, several methods have been proposed for the final decision? This information is crucial to explain the predictions of recurrent neu- ral networks (RNNs), in particular of LSTMs. to unmask “Clever Hans” predictors (Lapuschkin The goal of these methods is to understand the et al., 2019), and to allow for transparency of the network’s decisions by assigning to each in- decision-making process (EU-GDPR, 2016). put variable, e.g., a word, a relevance indicat- Early works on explaining neural network pre- ing to which extent it contributed to a partic- dictions include Baehrens et al.(2010); Zeiler and ular prediction. In previous works, some of Fergus(2014); Simonyan et al.(2014); Springen- these methods were not yet compared to one berg et al.(2015); Bach et al.(2015); Alain and another, or were evaluated only qualitatively. We close this gap by systematically and quan- Bengio(2017), with several works focusing on ex- titatively comparing these methods in differ- plaining the decisions of convolutional neural net- ent settings, namely (1) a toy arithmetic task works (CNNs) for image recognition. More re- which we use as a sanity check, (2) a five-class cently, this topic found a growing interest within sentiment prediction of movie reviews, and be- NLP, amongst others to explain the decisions of sides (3) we explore the usefulness of word general CNN classifiers (Arras et al., 2017a; Ja- relevances to build sentence-level representa- covi et al., 2018), and more particularly to explain tions.
    [Show full text]
  • Employing a Transformer Language Model for Information Retrieval and Document Classification
    DEGREE PROJECT IN THE FIELD OF TECHNOLOGY ENGINEERING PHYSICS AND THE MAIN FIELD OF STUDY COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020 Employing a Transformer Language Model for Information Retrieval and Document Classification Using OpenAI's generative pre-trained transformer, GPT-2 ANTON BJÖÖRN KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Employing a Transformer Language Model for Information Retrieval and Document Classification ANTON BJÖÖRN Master in Machine Learning Date: July 11, 2020 Supervisor: Håkan Lane Examiner: Viggo Kann School of Electrical Engineering and Computer Science Host company: SSC - Swedish Space Corporation Company Supervisor: Jacob Ask Swedish title: Transformermodellers användbarhet inom informationssökning och dokumentklassificering iii Abstract As the information flow on the Internet keeps growing it becomes increasingly easy to miss important news which does not have a mass appeal. Combating this problem calls for increasingly sophisticated information retrieval meth- ods. Pre-trained transformer based language models have shown great gener- alization performance on many natural language processing tasks. This work investigates how well such a language model, Open AI’s General Pre-trained Transformer 2 model (GPT-2), generalizes to information retrieval and classi- fication of online news articles, written in English, with the purpose of compar- ing this approach with the more traditional method of Term Frequency-Inverse Document Frequency (TF-IDF) vectorization. The aim is to shed light on how useful state-of-the-art transformer based language models are for the construc- tion of personalized information retrieval systems. Using transfer learning the smallest version of GPT-2 is trained to rank and classify news articles achiev- ing similar results to the purely TF-IDF based approach.
    [Show full text]
  • An SMO Algorithm for the Potential Support Vector Machine
    LETTER Communicated by Terrence Sejnowski An SMO Algorithm for the Potential Support Vector Machine Tilman Knebel [email protected] Downloaded from http://direct.mit.edu/neco/article-pdf/20/1/271/817177/neco.2008.20.1.271.pdf by guest on 02 October 2021 Neural Information Processing Group, Fakultat¨ IV, Technische Universitat¨ Berlin, 10587 Berlin, Germany Sepp Hochreiter [email protected] Institute of Bioinformatics, Johannes Kepler University Linz, 4040 Linz, Austria Klaus Obermayer [email protected] Neural Information Processing Group, Fakultat¨ IV and Bernstein Center for Computational Neuroscience, Technische Universitat¨ Berlin, 10587 Berlin, Germany We describe a fast sequential minimal optimization (SMO) procedure for solving the dual optimization problem of the recently proposed potential support vector machine (P-SVM). The new SMO consists of a sequence of iteration steps in which the Lagrangian is optimized with respect to either one (single SMO) or two (dual SMO) of the Lagrange multipliers while keeping the other variables fixed. An efficient selection procedure for Lagrange multipliers is given, and two heuristics for improving the SMO procedure are described: block optimization and annealing of the regularization parameter . A comparison of the variants shows that the dual SMO, including block optimization and annealing, performs ef- ficiently in terms of computation time. In contrast to standard support vector machines (SVMs), the P-SVM is applicable to arbitrary dyadic data sets, but benchmarks are provided against libSVM’s -SVR and C-SVC implementations for problems that are also solvable by standard SVM methods. For those problems, computation time of the P-SVM is comparable to or somewhat higher than the standard SVM.
    [Show full text]
  • Machine Learning and Statistical Methods for Clustering Single-Cell RNA-Sequencing Data Raphael Petegrosso 1, Zhuliu Li 1 and Rui Kuang 1,∗
    i i “main” — 2019/5/3 — 12:56 — page 1 — #1 i i Briefings in Bioinformatics doi.10.1093/bioinformatics/xxxxxx Advance Access Publication Date: Day Month Year Manuscript Category Subject Section Machine Learning and Statistical Methods for Clustering Single-cell RNA-sequencing Data Raphael Petegrosso 1, Zhuliu Li 1 and Rui Kuang 1,∗ 1Department of Computer Science and Engineering, University of Minnesota Twin Cities, Minneapolis, MN, USA ∗To whom correspondence should be addressed. Associate Editor: XXXXXXX Received on XXXXX; revised on XXXXX; accepted on XXXXX Abstract Single-cell RNA-sequencing (scRNA-seq) technologies have enabled the large-scale whole-transcriptome profiling of each individual single cell in a cell population. A core analysis of the scRNA-seq transcriptome profiles is to cluster the single cells to reveal cell subtypes and infer cell lineages based on the relations among the cells. This article reviews the machine learning and statistical methods for clustering scRNA- seq transcriptomes developed in the past few years. The review focuses on how conventional clustering techniques such as hierarchical clustering, graph-based clustering, mixture models, k-means, ensemble learning, neural networks and density-based clustering are modified or customized to tackle the unique challenges in scRNA-seq data analysis, such as the dropout of low-expression genes, low and uneven read coverage of transcripts, highly variable total mRNAs from single cells, and ambiguous cell markers in the presence of technical biases and irrelevant confounding biological variations. We review how cell-specific normalization, the imputation of dropouts and dimension reduction methods can be applied with new statistical or optimization strategies to improve the clustering of single cells.
    [Show full text]
  • Biclustering in Data Mining Stanislav Busygina, Oleg Prokopyevb,∗, Panos M
    Computers & Operations Research 35 (2008) 2964–2987 www.elsevier.com/locate/cor Biclustering in data mining Stanislav Busygina, Oleg Prokopyevb,∗, Panos M. Pardalosa aDepartment of Industrial and Systems Engineering, University of Florida, Gainesville, FL 32611, USA bDepartment of Industrial Engineering, University of Pittsburgh, Pittsburgh, PA 15261, USA Available online 6 February 2007 Abstract Biclustering consists in simultaneous partitioning of the set of samples and the set of their attributes (features) into subsets (classes). Samples and features classified together are supposed to have a high relevance to each other. In this paper we review the most widely used and successful biclustering techniques and their related applications. This survey is written from a theoretical viewpoint emphasizing mathematical concepts that can be met in existing biclustering techniques. ᭧ 2007 Published by Elsevier Ltd. Keywords: Data mining; Biclustering; Classification; Clustering; Survey 1. The main concept Due to recent technological advances in such areas as IT and biomedicine, the researchers face ever-increasing challenges in extracting relevant information from the enormous volumes of available data [1]. The so-called data avalanche is created by the fact that there is no concise set of parameters that can fully describe a state of real-world complex systems studied nowadays by biologists, ecologists, sociologists, economists, etc. On the other hand, modern computers and other equipment are able to produce and store virtually unlimited data sets characterizing a complex system, and with the help of available computational power there is a great potential for significant advances in both theoretical and applied research. That is why in recent years there has been a dramatic increase in the interest in sophisticated data mining and machine learning techniques, utilizing not only statistical methods, but also a wide spectrum of computational methods associated with large-scale optimization, including algebraic methods and neural networks.
    [Show full text]