University of Sheffield Department of Chemical and Biological Engineering

Title page University of Sheffield Department of Chemical and Biological Engineering Predicting Functional Alterations Caused By Non- synonymous Variants in CHO Using Models Based on Phylogenetic Tree and Evolutionary Preservation Author: Qixun Fang Supervisor: David James Second Supervisor: Robert Falconer Submitted for the degree of Doctor of Philosophy February, 2018 Acknowledgements I would like to give my most sincere thanks to Dr Paul Dobson who offered help throughout the entire project, even after he had left to another university. This project would not be possible without his effort and expertise. I am also very grateful to Professor David James who continued the supervision after Dr Dobson had left and provided many inspirations and fun during my research. I am also very grateful to my parents who sponsored my PhD and supported me in every way they can. I would also like to thank all of my colleagues in both Paul’s group and David’s group and my friends here in the UK. My life here is much better with all these nice people in it. I Abstract Chinese Hamster Ovary (CHO) cell is a major manufacturing platform for one of the most valuable biopharmaceutical products: monoclonal antibodies. Being an immortal cell line adapted to different environments, CHO has been accumulating massive mutations in its genome. Continuous effort has been invested into building a computational model to predict CHO cell productivity. However, not much attention has been focused on its proteins which are surely effected by the mutations accumulated to some extent. In this project, we focused on the functional effect caused by non-synonymous variants found in CHO genome. A tool was built to firstly identify these variants and then predict their potential function effect by preservation, a concept derived from evolutionary conservation. Firstly, the PANTHER subfamilies, which defined on the base of potential function change within gene trees, were extended by adding proteins from species not covered by PANTHER. Sequences within the same subfamily were then aligned and had Hidden Markov Models (HMMs) built on these alignments. The HMMs were used to identify homologs in CHO proteins. After that preservation were calculated in every site of the alignments, which was then used to predict the function alterations caused by mutations on every site. Our tool was then validated using data from origin PANTHER subfamilies, PANTHER-PSEP which also calculated site preservation and BLAST, a well-accepted homolog searching algorithm. CHO protein sequences were then imported and analysed by our tool. For comparison, protein sequences from Chinese hamster were also analysed alone with two published CHO cell lines: CHO-K1 and CHO-K1GS. The predictions of proteins from these three genomes were then compared by mapping onto Gene Ontology (GO). Some detailed case studies were also demonstrated. Our tool showed good performance in validations, however, they failed to produce useful hypotheses that would motivate further experiments on bench. The potential causes are discussed at the end. II Contents Acknowledgements ..................................................................................................................... I Abstract ...................................................................................................................................... II Contents .................................................................................................................................... III List of Abbreviations ..................................................................................................................V List of Tables ............................................................................................................................VII List of Figures ..........................................................................................................................VIII 1 Introduction ............................................................................................................................ 1 1.1 Research background and motivations ............................................................................ 1 1.2 Objectives ......................................................................................................................... 3 1.3 Thesis structure ................................................................................................................ 4 2 Literature Review .................................................................................................................... 6 2.1 Chinese Hamster Ovary Cells ........................................................................................... 6 2.2 High-throughput Sequencing Techniques ...................................................................... 10 2.3 Bioinformatics Techniques ............................................................................................. 12 2.3.1 Sequence Alignment ................................................................................................ 12 2.3.2 Hidden Markov Models ........................................................................................... 14 2.3.3 Phylogenetic Tree Construction .............................................................................. 15 2.3.4 Protein Structure Prediction .................................................................................... 19 2.3.5 Non-synonymous Nucleotide Variant Impact ......................................................... 19 2.4 Important Databases ...................................................................................................... 21 2.4.1 Gene databases ....................................................................................................... 21 2.4.2 Single Nucleotide Variation (SNV) databases .......................................................... 22 2.4.3 Protein databases .................................................................................................... 23 2.4.4 Function and pathway databases ............................................................................ 25 2.5 Related Data Resource and Researches on CHO ........................................................... 26 3 Methodology ......................................................................................................................... 29 3.1 Homolog Selection ......................................................................................................... 30 3.2 Aligning Sequences ......................................................................................................... 32 3.3 Phylogenetic Tree Construction ..................................................................................... 33 3.4 Member Selection for Extended Subfamilies ................................................................. 34 III 3.5 Building HMMs ............................................................................................................... 37 3.6 Preservation Calculation ................................................................................................ 37 3.7 CHO Data Analysis and Prediction ................................................................................. 39 3.8 Related Software Description ........................................................................................ 41 3.8.1 MAFFT ...................................................................................................................... 41 3.8.2 HMMer .................................................................................................................... 43 3.8.3 RAxML ...................................................................................................................... 46 3.8.4 TreeFix ..................................................................................................................... 48 4 Model validation and performance ...................................................................................... 49 4.1 HMM validation .............................................................................................................. 49 4.2 Preservation analysis ...................................................................................................... 52 4.3 Summary ........................................................................................................................ 63 5 Model Predictions on CHO Related Data .............................................................................. 65 5.1 Prediction Overview ....................................................................................................... 65 5.2 Verification with BLAST .................................................................................................. 67 5.3 Expression and Prediction Correlation ........................................................................... 70 5.4 Comparison of Predictions of Different CHO Related Genomes ................................... 73 5.5 Summary ........................................................................................................................ 79 6 Case Studies and Hypotheses ............................................................................................... 80 6.1 Mutations on TP53 ......................................................................................................... 80 6.2 Glycolytic Process ..........................................................................................................

University of Sheffield Department of Chemical and Biological Engineering

DNA Topoisomerases and Cancer Topoisomerases and TOP Genes in Humans Humans Vs

Comparative Biochemistry and Physiology, Part D, Vol. 5, Pp. 45-54 (2010)

Type IA Topoisomerases As Targets for Infectious Disease Treatments

Fusion Transcripts and Transcribed Retrotransposed Loci Discovered Through Comprehensive Transcriptome Analysis Using Paired-End Ditags (Pets)

Table 1. a Selection of Eukaryotic Genes with a Role in the Maintenance of Genome Integrity

Supplementary Information – Postema Et Al., the Genetics of Situs Inversus Totalis Without Primary Ciliary Dyskinesia

Identification of Gastric Cancer–Related Genes Using a Cdna Microarray Containing Novel Expressed Sequence Tags Expressed in Gastric Cancer Cells

Two Closely Related Recq Helicases Have Antagonistic Roles in Homologous Recombination and DNA Repair in Arabidopsis Thaliana

(12) United States Patent (10) Patent No.: US 8,148,129 B2 Frankel Et Al

Rare Coding Variants in Five DNA Damage Repair Genes Associate

SUMO-Targeted Ubiquitin Ligases and Their Functions in Maintaining Genome Stability

Characterization of Recombinant Human DNA Topoisomerase Iiiƒї