Long Overlapping Genes in Pseudomonas Aeruginosa
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/2021.02.09.430400; this version posted February 10, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 1 Shadow ORFs illuminated: long overlapping genes in Pseudomonas 2 aeruginosa are translated and under purifying selection 3 4 Michaela Kreitmeier1, Zachary Ardern1*, Miriam Abele2, Christina Ludwig2, Siegfried Scherer1, 5 Klaus Neuhaus3* 6 7 1Chair for Microbial Ecology, TUM School of Life Sciences, Technische Universität München 8 Weihenstephaner Berg 3, 85354 Freising, Germany. 9 2Bavarian Center for Biomolecular Mass Spectrometry (BayBioMS), TUM School of Life 10 Sciences, Technische Universität München, Gregor-Mendel-Strasse 4, 85354 Freising, 11 Germany. 12 3Core Facility Microbiome, ZIEL – Institute for Food & Health, Technische Universität München 13 Weihenstephaner Berg 3, 85354 Freising, Germany. 14 15 *email: [email protected]; [email protected] 16 17 Abstract 18 The existence of overlapping genes (OLGs) with significant coding overlaps revolutionises our 19 understanding of genomic complexity. We report two exceptionally long (957 nt and 1536 nt), 20 evolutionarily novel, translated antisense open reading frames (ORFs) embedded within 21 annotated genes in the medically important Gram-negative bacterium Pseudomonas 22 aeruginosa. Both OLG pairs show sequence features consistent with being genes and 23 transcriptional signals in RNA sequencing data. Translation of both OLGs was confirmed by 24 ribosome profiling and mass spectrometry. Quantitative proteomics of samples taken during 25 different phases of growth revealed regulation of protein abundances, implying biological 26 functionality. Both OLGs are taxonomically highly restricted, and likely arose by overprinting 27 within the genus. Evidence for purifying selection further supports functionality. The OLGs 28 reported here are the longest yet proposed in prokaryotes and are among the best attested in 29 terms of translation and evolutionary constraint. These results highlight a potentially large 30 unexplored dimension of prokaryotic genomes. 31 32 Keywords: Overlapping genes, mass spectrometry, ribosome profiling, orphan genes, 33 overprinting, de novo gene origin 1 bioRxiv preprint doi: https://doi.org/10.1101/2021.02.09.430400; this version posted February 10, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 34 35 Introduction 36 The tri-nucleotide character of the genetic code enables six reading frames in a double- 37 stranded nucleotide sequence. Protein-coding ORFs at the same locus but in different reading 38 frames are referred to as overlapping genes (OLGs). Studies of coding overlaps of more than 39 90 nucleotides, i.e. non-trivial overlaps, have mainly been restricted to viruses, where the first 40 OLG was found in 19761. An OLG pair with such a non-trivial overlap can be described in terms 41 of an older, typically longer “mother gene” and more recently evolved “daughter” gene by 42 analogy to mother and daughter cells in reproduction. 43 In prokaryotes, automated genome-annotation algorithms like Glimmer allow only one open 44 reading frame (ORF) per locus with the exception of only short overlaps2. This systematically 45 excludes overlapping ORFs from being annotated as genes3. The ‘inferior’ ORFs (e.g. shorter 46 or fewer hits in databases) within overlapping gene pairs have been called ‘shadow ORFs’ 47 since they are found in the shadow of the annotated coding ORF4. Determining which ORF to 48 annotate within such pairs has been described as the most difficult problem in prokaryotic gene 49 annotation5. Nevertheless, a few prokaryotic OLGs have been discovered, often 50 serendipitously in the pursuit of other unannotated genes6-8. For instance in some Escherichia 51 coli strains a few non-trivial overlaps have been detected and experimentally analysed9-16. 52 Transcriptomic or translatomic evidence for OLGs also exist in genera such as 53 Mycobacterium17 or Pseudomonas18. Recently, antisense OLGs have also been reported in 54 archaea19 and in mammals20,21. These findings support the hypothesis that OLGs are 55 ubiquitous. 56 Verifying OLGs using mass spectrometry (MS) is difficult because most OLGs appear to be 57 short and weakly expressed, and proteomics has limited abilities in detecting such proteins22. 58 RNA sequencing led to the discovery of many antisense transcripts, but whether many of these 59 are translated is controversial23. More recently, mRNA protected by ribosomes after enzymatic 60 degradation has been sequenced using “ribosome profiling”24 (RiboSeq), showing evidence of 61 translation for antisense transcripts16,23. However, artefacts may occur due to structured 62 RNAs25. Moreover, proteins produced from such transcripts have been claimed to be 63 predominantly non-functional in Mycobacterium tuberculosis17. Detecting purifying selection 64 would provide evidence for functionality in OLGs; however, this is made difficult by selection 65 pressure on the other reading frame. Methods have recently been developed26-30, but so far 66 applied mainly in viruses. One exception is a cursory study of some OLG candidates in 67 E. coli31. Additional reason to expect to find OLGs under selection in bacteria comes from a 68 study showing excess long antisense ORFs in a pathogenic E. coli strain32. Another is the 69 frequent exchange of functional genes between phages and bacteria33. 2 bioRxiv preprint doi: https://doi.org/10.1101/2021.02.09.430400; this version posted February 10, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 70 The evolution and origin of OLGs and their constraints have long been discussed34-37 and were 71 briefly mentioned in two early books38,39. The origin of a new gene within an existing gene is 72 termed “overprinting”40. Advantageous for same-strand overlaps is that the hydrophobicity 73 profile of a frame-shifted sequence tends to be similar to that of the unshifted sequence41, as 74 may have contributed to the origin of an OLG in SARS-CoV-242. Intriguingly, overlaps of the 75 reference frame “+1” and antisense “-1” tend to have opposite hydrophobicities for their amino 76 acids. Further, similar amino acids are conserved in the antisense frame following synonymous 77 mutation in the reference frame43, facilitating the maintenance of overlapping genes. The 78 developing research area of overlapping gene origins complements recent findings of many 79 taxonomically restricted (“orphan”) genes44 and unannotated short genes in prokaryotes45. 80 The genus Pseudomonas (Gram-negative, Gammaproteobacteria) is of particular interest 81 regarding long OLGs. Its high GC content of 60-70% results in longer average lengths for 82 antisense ORFs compared to Escherichia (~50% GC)32. Previous studies reported OLGs in 83 the well-characterized species P. fluorescens and P. putida, although based on limited 84 evidence46-48. Since then, methods allowing improved discovery and verification of such genes 85 have been developed. Here, we examine Pseudomonas aeruginosa, which is a versatile 86 pathogenic species49. As an opportunistic human pathogen, it predominately causes disease 87 in immunocompromised individuals50. Hospital-acquired infections with P. aeruginosa are 88 often associated with high morbidity and mortality and may include infections of the respiratory 89 tract, the urinary tract, the bloodstream, the skin, and wounds51. The high level of intrinsic and 90 acquired resistance as well as the rapid rise of multi-drug resistant strains limit effective 91 antibiotic treatment options52,53. Thus, it is important to understand the entire genomic 92 complexity of this organism. In P. aeruginosa, we detected two exceptionally long novel ORFs 93 showing large overlaps of 957 nt and 1542 nt in antisense to annotated genes, by employing 94 RiboSeq as well as mass spectrometry. We present evidence that both OLGs indeed encode 95 proteins which evolved only recently by overprinting. Further, we detected purifying selection 96 operating through a depletion of stop codons and non-synonymous changes. 97 98 Results 99 Genomic locations of two long overlapping genes. The novel olg1 of P. aeruginosa PAO1 100 (NC_002516.2) is located at the coordinates 291556-292512(+) in frame -1 (i.e. directly 101 antisense) with respect to its mother gene (Fig. 1a). It has a minimum length of 957 nucleotides 102 (nt) and the most probable start codon is ATG291556 (see Supplementary Information). olg1 103 completely overlaps with the annotated mother gene tle3 (PA0260), encoding the toxic type VI 104 lipase effector 3 of the vgrG2b-tli3-tle3-tla3 operon. Tle3 contains two structural domains, an 3 bioRxiv preprint doi: https://doi.org/10.1101/2021.02.09.430400; this version posted February 10, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND