And Pan-Genome Content on Rhodobacteraceae Chromosomes
Total Page:16
File Type:pdf, Size:1020Kb
GBE Clustered Core- and Pan-Genome Content on Rhodobacteraceae Chromosomes Karel Kopejtka1,2,YanLin3,4,Marketa Jakubovicova5, Michal Koblızek1,2,andJu¨rgenTomasch6,* 1Laboratory of Anoxygenic Phototrophs, Center Algatech, Institute of Microbiology CAS, Trebon, Czech Republic 2Faculty of Science, University of South Bohemia, Cesk eBud ejovice, Czech Republic 3Department of Physics, School of Science, Tianjin University, China 4SynBio Research Platform, Collaborative Innovation Center of Chemical Science and Engineering, Tianjin, China 5Faculty of Information Technology, Czech Technical University in Prague, Czech Republic 6Department of Molecular Bacteriology, Helmholtz Centre for Infection Research, Braunschweig, Germany *Corresponding author: E-mail: [email protected]. Accepted: June 29, 2019 Abstract In Bacteria, chromosome replication starts at a single origin of replication and proceeds on both replichores. Due to its asymmetric nature, replication influences chromosome structure and gene organization, mutation rate, and expression. To date, little is known about the distribution of highly conserved genes over the bacterial chromosome. Here, we used a set of 101 fully sequenced Rhodobacteraceae representatives to analyze the relationship between conservation of genes within this family and their distance from the origin of replication. Twenty-two of the analyzed species had core genes clustered significantly closer to the origin of replication with representatives of the genus Celeribacter being the most apparent example. Interestingly, there were also eight species with the opposite organization. In particular, Rhodobaca barguzinensis and Loktanella vestfoldensis showed a significant increase of core genes with distance from the origin of replication. The uneven distribution of low-conserved regions is in particular pronounced for genomes in which the halves of one replichore differ in their conserved gene content. Phage integration and horizontal gene transfer partially explain the scattered nature of Rhodobacteraceae genomes. Our findings lay the foundation for a better understanding of bacterial genome evolution and the role of replication therein. Key words: genome architecture, genome evolution, origin of replication, Rhodobacteraceae. Introduction gene-dosage effect is especially pronounced in fast-growing Replication is assumed to be a key factor in the evolution of bacteria, where strongly expressed genes are preferentially genome structure and organization (Rocha 2004; Rocha concentrated near the oriC (Rocha 2004; Couturier and 2008; Cagliero et al. 2013; Jun et al. 2018). In contrast to Rocha 2006). Bacterial chromosome architecture can also eukaryotes and archaea, where the chromosome replication be shaped by large-scale interreplichore translocations, as proceeds simultaneously from multiple sites, replication of was recently shown using genome sequence comparisons bacterial chromosomes starts from a single origin of replica- between 262 closely related pairs of bacterial species tion (oriC) and continues equally along both replichores (two (Khedkar and Seshasayee 2016). halves of the chromosome extending from oriC)uptothe Relative distance of genes from the oriC is commonly terminus of replication (terC). Since cell division is often thought to be one of the most conserved properties of shorter than the time required for the replication of the chro- genome organization (Eisen et al. 2000; Sobetzko et al. mosome itself, it leads to the occurrence of multiple replica- 2012). Results of an extensive analysis comprising a set tion complexes in the cell. In result, the genes located in the of 131 gammaproteobacterial genomes showed strong early replicating regions near the oriC can be present in mul- conservation in the relative distance of conserved genes tiple copies and hence have a higher expression level com- coding for regulatory elements from the oriC (Sobetzko pared with genes in the late replicating regions. This so-called et al. 2012). ß The Author(s) 2019. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 2208 Genome Biol. Evol. 11(8):2208–2217. doi:10.1093/gbe/evz138 Advance Access publication July 3, 2019 Clustered Core- and Pan-Genome Content on Rhodobacteraceae Chromosomes GBE A replication-biased genome organization was also (Shakya et al. 2017) suggest that HGT plays an important revealed in archaeal genomes. Flynn et al. (2010) selected role in shaping the genomes of this family. six Sulfolobus genomes each with multiple replication origins, to test the hypothesis that genes situated close to oriC tend to Materials and Methods be more conserved than genes more distant from oriC. Results of this study clearly demonstrated a bias in the location Data of conserved orthologous genes (orthologs) toward the oriC. Nucleotide genomic sequences and corresponding Genbank Moreover, the analysis of evolutionary rates of these orthologs FASTA files for 109 fully sequenced Rhodobacteraceae strains revealed their slower evolution when compared with genes (Fig. 1) were obtained from NCBI GenBank (May 2018). 16S more distant from oriC (Flynn et al. 2010). Another study of rRNA gene sequences for the same set of strains were Sulfolobus genome architecture showed a mosaic of recom- obtained either from the SILVA database (Quast et al. 2012) binant single nucleotide polymorphisms along the chromo- or NCBI GenBank (May, 2018). somes of ten closely related Sulfolobus islandicus strains. This comparative genome analysis revealed large genomic Software regions surrounding all oriC sites that show reduced recom- bination rates (Krause et al. 2014). The programs and packages used in our analysis are summa- Based on an analysis of the codon composition of genomes rized in the supplementary table S1, Supplementary Material from 59 prokaryotic organisms, it was shown that genes orga- online. Commands lines for Prokka and ProteinOrtho can be nized close to the ter are in many cases A þ T-enriched at the found in supplementary table S2, Supplementary Material on- third codon position thought to reflect a higher evolutionary line. Custom scripts as well as the obtained pan-genome data rate (Daubin and Perriere 2003). In addition, sets are available from the authors upon request. horizontally transferred DNA was suggested to cluster near the terC (Rocha 2004); this was documented in the genome Comparative Genomic Analysis of Escherichia coli (Lawrence and Ochman 1998), as well as in In order to standardize further analysis, we reannotated all other prokaryotic genomes (Touchon and Rocha 2016). Early genomes using Prokka (Seemann 2014). Orthologous gene studies presenting complete genome sequences of Bacillus sub- cluster analysis was performed using the Proteinortho data- tilis (Kunstetal.1997)andE. coli (Blattner et al. 1997; Lawrence frame (Lechner et al. 2011). As in previous comparative studies andOchman1998) reported a frequent occurrence of pro- (Kalho¨ fer et al. 2011; Thole et al. 2012; Vollmers et al. 2013), phages around the terC. Furthermore, a recent study conducted orthologous protein sequences were identified with three cut- on a large and diverse sample of bacterial species revealed a off criteria: 1) e-value, 2) alignment coverage, and 3) sequence positive bias in the occurrence of hot-spots for HGT containing identity. Three pan-genome data sets (pan60, pan30, and prophages toward the terC (Oliveira et al. 2017). In addition, it pan15) were produced that differ in stringency of ortholog has been speculated that the gene-dosage effect might lead to identification. Cut-off criteria as well as the number of identified fixing of typically weakly expressed (or not at all) horizontally protein families and core genome are summarized in table 1. transferred genes closer to the terC (Rocha 2004). To date, there is no comprehensive study on the distribu- tion of conserved genes in the bacterial chromosome solely Phylogenomic Analysis focusing on one bacterial family. We therefore decided to Since for highly conserved proteins with >50% sequence analyze the relationship between the degree of gene conser- identity the probability of completely incorrect annotation is vation and its distance from the origin of replication for very low (<6%) (Sangar et al. 2007), we used the core ge- the core- and pan-genome of all Rhodobacteraceae nome identified with the most stringent parameters (pan60 (Alphaproteobacteria) with closed genomes. All the gene data set: e-value <10À10, 80% sequence coverage and 60% families present in a certain (microbial) clade are the pan- sequence identity) for constructing the phylogenomic tree. genome. The gene families with representatives present in Furthermore, we excluded all protein families that contained genomes of all strains are defined as core genome, whereas paralogs in one or more of the genomes. As a last step, we the term accessory genome describes partially shared gene excluded all protein families with a Proteinortho connectivity families and strain specific genes (Medini et al. 2005; Tettelin score <0.9. Amino acid sequences for these 85 highly con- et al. 2005). Rhodobacteraceae were selected as a family with served core genome proteins were individually