Evolutionary Evidence of Chromosomal Rearrangements Through SNAP Marina Mota Merlo

Evolutionary evidence of chromosomal rearrangements through SNAP (Selection during Niche AdaPtation) Marina Mota Merlo Degree project in bioinformatics, 2021 Examensarbete i bioinformatik 45 hp till masterexamen, 2021 Biology Education Centre and Dept. of Medical Biochemistry and Microbiology, Uppsala University Supervisors: Lionel Guy and Andrei Guliaev Abstract The Selection during Niche AdaPtation (SNAP) hypothesis aims to explain how the gene order in bacterial chromosomes can change as the result of bacteria adapting to a new environment. It starts with a duplication of a chromosomal segment that includes some genes providing a fitness advantage. The duplication of these genes is preserved by positive selection. However, the rest of the duplicated segment accumulates mutations, including deletions. This results in a rearranged gene order. In this work, we develop a method to identify SNAP in bacterial chromosomes. The method was tested in Salmonella and Bartonella genomes. First, each gene was assigned an orthologous group (OG). For each genus, single-copy panorthologs (SCPos), the OGs that were present in most of the genomes as one copy, were targeted. If these SCPos were present twice or more in a genome, they were used to build duplicated regions within said genome. The resulting regions were visualized and their possible compatibility with the SNAP hypothesis was discussed. Even though the method proved to be effective on Bartonella genomes, it was less efficient on Salmonella. In addition, no strong evidence of SNAP was detected in Salmonella genomes. The adaptability of bacteria Popular Science Summary Marina Mota Merlo Every living being, from humans to bacteria, depends on its ability to adapt. Imagine you agree to participate in a TV program with two rounds. In the first round, you have to cook a dish using a very limited set of ingredients. In the second round, you are given a wide variety of ingredients, and you have to try to prepare a dish using all of them. The idea in both cases is making the most of what you have. Who wouldn’t want to be an expert using the limited ingredients from the first round? Who wouldn’t want to be fast enough to incorporate all the ingredients from the second? Imagine that you could become twice as good at any of the rounds. You would definitely have an advantage over the other contestants and maybe you would be even chosen as a winner. Bacteria can do this much more efficiently than humans. Parts of their genomes can be duplicated just by random chance, which means that they would have twice the original dosage of some of their genes. Let’s imagine the function of these genes is related to metabolism, i.e., the way cells —and therefore bacteria— process nutrients, similar to how ingredients are cooked when following a recipe. If there is a limited set of nutrients in their environment, and the genes involved in the processing (“cooking”) of these nutrients are present twice, then the bacterium with the duplication will be more efficient than others, and therefore it will “win” over other bacteria. The bacterium will divide and divide and eventually there will be more clones of itself carrying the duplication. For a nutrient-rich environment, the same would apply, but effectiveness would be more important than efficiency. Following the contest analogy, the participant who could cook faster, and therefore include all the ingredients, would win in this case. During the contest, the participant also learnt irrelevant information, such as the birthdays of the members of the jury. In this case, the contestant would most likely forget these facts. The same happens for bacteria, since some genes in the duplication do not really confer an advantage. In this case, the genes would eventually be lost, just like the memories of the birthday dates. In fact, this process follows a similar logic as remembering: only the relevant parts are retained. And, much like entangled memories, because some genes are kept and other are lost, this results in a new, rearranged order of genes. The name of the full process is SNAP (Selection during Niche AdaPtation), since it describes how bacteria adapt to new environments that are either harsh or nutrient- rich, each one represented by a round of the contest. The purpose of this work was to develop a method able to find evidence of SNAP in bacteria. Degree project in bioinformatics, 2021 Examensarbete i bioinformatik 45 hp till masterexamen, 2021 Biology Education Centre and EvoLegioLab, Dept. of Medical Biochemistry and Microbiology Supervisors: Lionel Guy and Andrei Guliaev Table of contents 1. Introduction .................................................................................................................... 11 2. Materials and methods .................................................................................................... 14 2.1. Defining the pipeline ........................................................................................................ 14 2.2. Downloading data ............................................................................................................ 14 2.3. Finding OGs ...................................................................................................................... 15 2.4. Identifying duplicated SCPos ............................................................................................ 16 2.5. Building duplicated regions .............................................................................................. 17 2.6. Analysis of the duplicates ................................................................................................. 18 3. Results ............................................................................................................................. 19 3.1. OGs and protein locations ................................................................................................ 19 3.2. Duplicated SCPos .............................................................................................................. 20 3.3. Duplicated regions ........................................................................................................... 21 3.4. Analysis of the duplicates ................................................................................................. 23 3.4.1. Bartonella .................................................................................................................. 23 3.4.2. Salmonella ................................................................................................................. 24 4. Discussion ........................................................................................................................ 28 4.1. Distribution of genomes per species ................................................................................ 28 4.2. OGs and protein locations ................................................................................................ 29 4.3. Duplicated SCPos .............................................................................................................. 30 4.4. Duplicated regions in Bartonella ...................................................................................... 31 4.5. Duplicated regions in Salmonella ..................................................................................... 32 4.5.1. Duplicated genes ....................................................................................................... 32 4.5.2. Evidence of deletions ................................................................................................ 33 4.5.3. Veracity of the duplications ...................................................................................... 34 4.6. Performance of the method ............................................................................................. 35 5. Conclusion ....................................................................................................................... 36 6. Conflict of interest ........................................................................................................... 36 7. Acknowledgements ......................................................................................................... 37 References .......................................................................................................................... 37 Appendix ............................................................................................................................. 43 Abbreviations ABC ATP-Binding Cassette Blast Basic local alignment search tool CDS CoDing Sequence DNA DeoxyriboNucleic Acid DUF Domain of Unknown Function GFF General Feature Format HGT Horizontal Gene Transfer FTP File Transfer Protocol IAD Innovation, Amplification and Divergence MFS Major Facilitator Superfamily NCBI National Center for Biotechnology Information OG Orthologous Group SCPo Single Copy Panortholog SNAP Selection during Niche AdaPtation 1. Introduction Many of the processes that can be affected by genome organization in bacteria are integral to the proper functioning of the cell, such as replication and transcription (Touchon & Rocha, 2016). In particular, replication is the key factor structuring the chromosome. For example, its interplay with gene expression results in an overabundance of genes in the leading strand (Rocha, 2008). Another example of this is how gene order affects levels of expression, with genes placed closer to the

Load more