Revisiting the Landscape of Evolutionary Breakpoints Across Human Genome Using Multi-Way Comparison

bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Revisiting the landscape of evolutionary breakpoints across human genome using multi-way comparison Golrokh Vitae1, Amine M. Remita1, Abdoulaye BaniréDiallo1, 1 Universitédu Québec A` Montréal * [email protected] Abstract Genome rearrangement is one of the major forces driving the processes of the evolution and disease development. The chromosomal position affected by these rearrangements are called breakpoints. The breakpoints occurring during the evolution of species are known to be non randomly distributed. Detecting their landscape and mapping them to genomic features constitute an important features in both comparative and functional genomics. Several studies have attempted to provide such mapping based on pairwise comparison of genes as conservation anchors. With the availability of more accurate multi-way alignments, we design an approach to identify synteny blocks and evolutionary breakpoints based on UCSC 45-way conservation sequence alignments with 12 selected species. The multi-way designed approach with the mild flexibility of presence of selected species, helped to have a better determination of human lineage-specific evolutionary breakpoints. We identified 261,391 human lineage-specific evolutionary breakpoints across the genome and 2,564 dense regions enriched with biological processes involved in adaptive traits such as response to DNA damage stimulus, cellular response to stress and metabolic process. Moreover, we found 230 regions refractory to evolutionary breakpoints that carry genes associated with crucial developmental process such as organ morphogenesis, skeletal system development, chordate embryonic development, nerve development and regulation of biological process. This initial map of the human genome will help to gain better insight into several studies including developmental studies and cancer rearrangement processes. Introduction 1 Genome rearrangement is one of the major forces driving the process of evolution, 2 population diversity and development of diseases including cancers. It happens when 3 DNA breaks in specific positions (breakpoints) and reassembles in a way different from 4 the initial genome conformation and changes the genome landscape [6]. It is now 5 well-accepted that genome rearrangements do not happen randomly along the genome 6 and not all genomic regions are susceptible to such dramatic modifications 7 [7, 25, 35, 36]. Millions of years of evolution and natural selection driven by these 8 structural modifications curved the genome in a way that regions carrying high 9 functional pressure maintained their integrity and could be identified as orthologs with 10 same order and on the same chromosome in a range of relatively close species. Hence, 11 resistance of genomic regions to any major modification implies that these regions 12 harbor functional importance to survival as well as reproduction of species and any 13 modification could have a deleterious effect in the process of natural selection [30]. 14 1/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Studies on the genome synteny have shown that conserved regions are not only 15 significantly enriched in putative regulatory regions [23, 31] but also are associated with 16 transcriptional regulations and developmental processes [31, 41, 46]. On the other hand, 17 are the regions that are more receptive to rearrangements, participating in speciation, 18 adaptation and development of species-specific traits and behavior that are not 19 detrimental to the survival nor reproduction of the species. These break-prone regions 20 or "breakpoint hotspots" are also known to carry distinctive functional and structural 21 markers [10]. The identification of such regions are difficult and controversial due to 22 several assumptions. "Comparative genomics relies on the structuring of genomes into 23 syntenic blocks" [19]. However, comparative studies still lack a clear definition of 24 synteny. Since the term "synteny" was introduced, even the criteria that define the 25 synteny conservation vary from one study to another. Also, in most of the methods, to 26 identify synteny, synteny fragments were defined as a set of genes with a conserved 27 order [9, 37, 38, 40]. These issues are well discussed by Ghiurcuta, 2014 [19]. The lack 28 of clear definition, choice of granularity of study (minimum size), and different type of 29 comparison could led to have the high probability of different results generated by same 30 data [19]. One conclusion provided to tackle the drawback of these methods are to use 31 multi sets of homologous markers instead of only genes as anchors as well as extending 32 the pairwise comparison to multi-ways [19]. Moreover, the identification of synteny 33 breaks is affected by the selection of species. This influences the size and position of 34 synteny blocks and their corresponding breakpoints as breakpoint is not a tangible 35 physical entity in a genome; it is a result of an analytical construct based on the 36 comparison of selected genomes [42]. Other than exogenous factors, some biological 37 mechanisms such as non-allelic homologous recombination (NAHR) and non-homologous 38 endjoining (NHEJ) [20] and the presence of endogenous factors such as CpG islands [10], 39 GC content[13], some repeat elements [21, 24] and G-quadruplex [10, 39] could affect 40 the susceptibility of genomic regions to rearrangements. Due to some of these factors, 41 DNA could undergo a conformation change (e.g. from B to Z) that could destabilize the 42 molecule and make it prone to structural damage [2, 44, 47]. Also, phenomena such as 43 segmental duplications (SDs) [1, 3, 32], known to be associated evolutionary 44 breakpoints causing a double strand breaks on DNA structure. Hence, it is important to 45 revisit Evolutionary Breakpoints (EBrs) with more accurate and well-drafted 46 mammalian sequenced genomes and updated genomics features. In this paper we 47 propose a new map of accurate and more precise evolutionary breakpoints region based 48 on human lineage synteny breaks, multi-way comparison and ancestral breakpoint 49 reconstruction. Furthermore, we explore the hotspots of EBrs (EBHRs) based on their 50 coassociation to multiple structural and functional human genome features. 51 Results 52 0.1 Identification of human lineage-specific synteny breaks 53 Synteny block is a cluster of (relatively close) markers conserved among related genomes 54 that maintained their orientation and order. The regions bordered by synteny blocks 55 are the position of the genome that have been subjected to rearrangement in one or 56 more compared species since a common ancestor. Multi-way comparative analysis of 57 human genome with 10 other mammals and chicken (see supplementary table S2), 58 revealed 261,420 synteny blocks covering 52% of the human genome with a size 59 distribution from 1 Kbp to 200 Kbp. The corresponding 261,391 genomic regions were 60 identified as human lineage-specific synteny breaks or evolutionary breakpoints (EBrs). 61 30 break regions were eliminated due to their size higher than 2 Mbp. These regions are 62 whether on centromeric or telomeric regions or on genomic regions with missing 63 2/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Figure 1. Coverage of chromosomes by EBrs Figure 2. The percentages of human lineage specific breakpoints origin along the vertebrates reference tree information. See table 1. EBrs are associated with 39.58% of the human genome. 64 However, the coverage of breakpoints raise to over 40% in chr11, chr12, chr8, chr4 and 65 chr7 and to over 55% in chr19 and chrX. See figure 1. The size of these regions were 66 from 1 bp to 1.857 Mbp. The seven largest EBrs (size > 1 Mbp) were located chr14, 67 chrY, chr11, chrX, chr2 and chr7. Based on a Least Common Ancestor approach, we 68 identify the lineage-specific breakpoints, over 50% of these breakpoints were reused and 69 originated from separation of placental mammals. The next most representative 70 ancestral node was the Boreotherian ancestor with 20% of EBrs that originated from. 71 Others were originated from various ancestry nodes along the reference phylogeny tree 72 with a representation between < 1% to about 10%. The distribution of EBr ancestry 73 origins are presented in figure 2. 74 Method Input Output Size of output Conservation anchor extraction 45-way multiple alignments Conservation anchors 21,809,215 Fusion Conservation anchors Synteny clusters 346,515 filter according to length Synteny clusters Synteny blocks 261,420 Synteny break extraction Synteny blocks Synteny breaks 261,421 Size filter Synteny breaks EBrs 261,391 Permutation EBrs EBHRs 2,564 EBr desert extraction EBrs + human genome EBRRs 230 Table 1. Summary of EBr-related results 3/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 0.2 Localisation of breakpoint hotspots 75 Using an overlapping sliding window approach of 100 Kbp and a permutation test with 76 100,000 iterations, we identified 2,564 regions as EBr hotspot regions (EBHRs). 77 Although evolutionary breakpoints are distributed across 90% of genomic windows, only 78 8.28% of genomic windows (2,564) were enriched in EBrs (p-value < 0.05). The results 79 show that EBHRs contain 12 to 21 EBrs with an average of 14:85 ± 1:56 (except 80 chromosome Y).

Load more