bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Revisiting the landscape of evolutionary breakpoints across human genome using multi-way comparison
Golrokh Vitae1, Amine M. Remita1, Abdoulaye Banir´eDiallo1,
1 Universit´edu Qu´ebec A` Montr´eal
Abstract
Genome rearrangement is one of the major forces driving the processes of the evolution and disease development. The chromosomal position affected by these rearrangements are called breakpoints. The breakpoints occurring during the evolution of species are known to be non randomly distributed. Detecting their landscape and mapping them to genomic features constitute an important features in both comparative and functional genomics. Several studies have attempted to provide such mapping based on pairwise comparison of genes as conservation anchors. With the availability of more accurate multi-way alignments, we design an approach to identify synteny blocks and evolutionary breakpoints based on UCSC 45-way conservation sequence alignments with 12 selected species. The multi-way designed approach with the mild flexibility of presence of selected species, helped to have a better determination of human lineage-specific evolutionary breakpoints. We identified 261,391 human lineage-specific evolutionary breakpoints across the genome and 2,564 dense regions enriched with biological processes involved in adaptive traits such as response to DNA damage stimulus, cellular response to stress and metabolic process. Moreover, we found 230 regions refractory to evolutionary breakpoints that carry genes associated with crucial developmental process such as organ morphogenesis, skeletal system development, chordate embryonic development, nerve development and regulation of biological process. This initial map of the human genome will help to gain better insight into several studies including developmental studies and cancer rearrangement processes.
Introduction 1
Genome rearrangement is one of the major forces driving the process of evolution, 2
population diversity and development of diseases including cancers. It happens when 3
DNA breaks in specific positions (breakpoints) and reassembles in a way different from 4
the initial genome conformation and changes the genome landscape [6]. It is now 5
well-accepted that genome rearrangements do not happen randomly along the genome 6
and not all genomic regions are susceptible to such dramatic modifications 7
[7, 25, 35, 36]. Millions of years of evolution and natural selection driven by these 8
structural modifications curved the genome in a way that regions carrying high 9
functional pressure maintained their integrity and could be identified as orthologs with 10
same order and on the same chromosome in a range of relatively close species. Hence, 11
resistance of genomic regions to any major modification implies that these regions 12
harbor functional importance to survival as well as reproduction of species and any 13
modification could have a deleterious effect in the process of natural selection [30]. 14
1/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Studies on the genome synteny have shown that conserved regions are not only 15
significantly enriched in putative regulatory regions [23, 31] but also are associated with 16
transcriptional regulations and developmental processes [31, 41, 46]. On the other hand, 17
are the regions that are more receptive to rearrangements, participating in speciation, 18
adaptation and development of species-specific traits and behavior that are not 19
detrimental to the survival nor reproduction of the species. These break-prone regions 20
or ”breakpoint hotspots” are also known to carry distinctive functional and structural 21
markers [10]. The identification of such regions are difficult and controversial due to 22
several assumptions. ”Comparative genomics relies on the structuring of genomes into 23
syntenic blocks” [19]. However, comparative studies still lack a clear definition of 24
synteny. Since the term ”synteny” was introduced, even the criteria that define the 25
synteny conservation vary from one study to another. Also, in most of the methods, to 26
identify synteny, synteny fragments were defined as a set of genes with a conserved 27
order [9, 37, 38, 40]. These issues are well discussed by Ghiurcuta, 2014 [19]. The lack 28
of clear definition, choice of granularity of study (minimum size), and different type of 29
comparison could led to have the high probability of different results generated by same 30
data [19]. One conclusion provided to tackle the drawback of these methods are to use 31
multi sets of homologous markers instead of only genes as anchors as well as extending 32
the pairwise comparison to multi-ways [19]. Moreover, the identification of synteny 33
breaks is affected by the selection of species. This influences the size and position of 34
synteny blocks and their corresponding breakpoints as breakpoint is not a tangible 35
physical entity in a genome; it is a result of an analytical construct based on the 36
comparison of selected genomes [42]. Other than exogenous factors, some biological 37
mechanisms such as non-allelic homologous recombination (NAHR) and non-homologous 38
endjoining (NHEJ) [20] and the presence of endogenous factors such as CpG islands [10], 39
GC content[13], some repeat elements [21, 24] and G-quadruplex [10, 39] could affect 40
the susceptibility of genomic regions to rearrangements. Due to some of these factors, 41
DNA could undergo a conformation change (e.g. from B to Z) that could destabilize the 42
molecule and make it prone to structural damage [2, 44, 47]. Also, phenomena such as 43
segmental duplications (SDs) [1, 3, 32], known to be associated evolutionary 44
breakpoints causing a double strand breaks on DNA structure. Hence, it is important to 45
revisit Evolutionary Breakpoints (EBrs) with more accurate and well-drafted 46
mammalian sequenced genomes and updated genomics features. In this paper we 47
propose a new map of accurate and more precise evolutionary breakpoints region based 48
on human lineage synteny breaks, multi-way comparison and ancestral breakpoint 49
reconstruction. Furthermore, we explore the hotspots of EBrs (EBHRs) based on their 50
coassociation to multiple structural and functional human genome features. 51
Results 52
0.1 Identification of human lineage-specific synteny breaks 53
Synteny block is a cluster of (relatively close) markers conserved among related genomes 54
that maintained their orientation and order. The regions bordered by synteny blocks 55
are the position of the genome that have been subjected to rearrangement in one or 56
more compared species since a common ancestor. Multi-way comparative analysis of 57
human genome with 10 other mammals and chicken (see supplementary table S2), 58
revealed 261,420 synteny blocks covering 52% of the human genome with a size 59
distribution from 1 Kbp to 200 Kbp. The corresponding 261,391 genomic regions were 60
identified as human lineage-specific synteny breaks or evolutionary breakpoints (EBrs). 61
30 break regions were eliminated due to their size higher than 2 Mbp. These regions are 62
whether on centromeric or telomeric regions or on genomic regions with missing 63
2/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Figure 1. Coverage of chromosomes by EBrs
Figure 2. The percentages of human lineage specific breakpoints origin along the vertebrates reference tree
information. See table 1. EBrs are associated with 39.58% of the human genome. 64
However, the coverage of breakpoints raise to over 40% in chr11, chr12, chr8, chr4 and 65
chr7 and to over 55% in chr19 and chrX. See figure 1. The size of these regions were 66
from 1 bp to 1.857 Mbp. The seven largest EBrs (size > 1 Mbp) were located chr14, 67
chrY, chr11, chrX, chr2 and chr7. Based on a Least Common Ancestor approach, we 68
identify the lineage-specific breakpoints, over 50% of these breakpoints were reused and 69
originated from separation of placental mammals. The next most representative 70
ancestral node was the Boreotherian ancestor with 20% of EBrs that originated from. 71
Others were originated from various ancestry nodes along the reference phylogeny tree 72
with a representation between < 1% to about 10%. The distribution of EBr ancestry 73
origins are presented in figure 2. 74
Method Input Output Size of output Conservation anchor extraction 45-way multiple alignments Conservation anchors 21,809,215 Fusion Conservation anchors Synteny clusters 346,515 filter according to length Synteny clusters Synteny blocks 261,420 Synteny break extraction Synteny blocks Synteny breaks 261,421 Size filter Synteny breaks EBrs 261,391 Permutation EBrs EBHRs 2,564 EBr desert extraction EBrs + human genome EBRRs 230 Table 1. Summary of EBr-related results
3/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
0.2 Localisation of breakpoint hotspots 75
Using an overlapping sliding window approach of 100 Kbp and a permutation test with 76
100,000 iterations, we identified 2,564 regions as EBr hotspot regions (EBHRs). 77
Although evolutionary breakpoints are distributed across 90% of genomic windows, only 78
8.28% of genomic windows (2,564) were enriched in EBrs (p-value < 0.05). The results 79
show that EBHRs contain 12 to 21 EBrs with an average of 14.85 ± 1.56 (except 80
chromosome Y). EBHRs located on chromosome Y harbor 3 to 14 breakpoints in each 81
window. The EBHRs distribution with respect to chromosomes shows that 82
chromosomes are heterogeneously affected by these regions. Chromosomes are covered 83
by EBHRs from 3% to 23%. Based on the coverage of EBHRs, chromosome were 84
divided into three (3) categories: chromosomes with EBHRs representation under 8%, 85
chromosomes with a coverage between 8% to 15% and chromosomes with an over 15% 86
of EBHRs presence. See supplementary table S3. EBHRs is highly observed in 87
chromosome 22 (> 23%) when Chromosome 1 has the highest number of EBHRs (204 88
regions). However, it has a moderate coverage of EBHRs due to its size. 89
We also classified each EBHRs into different groups based on their predicted 90
evolutionary origin according to the origin of the EBRs present in that regionn. The 91
covered orgins are the five major human ancestry nodes along the reference species tree, 92
Placental, Boreotheria, Superprimates and Primates. The results show that about 55% 93
of EBHRs are mostly composed of EBrs originated around separation of human 94
placental ancestor, more than 30% are mostly composed of EBrs originated from 95
different ancestry nodes down to Boreotheria, and 10% of EBHRs were mostly composed 96
of EBrs originated from ancestry node at and after Boreotherian ancestry nodes. 97
0.3 Biological processes enrichment of genes associated with 98
EBHRs 99
The GO enrichment analysis of biological process showed that these EBHRs are 100
enriched in genes associated with distinct biological processes compared to genome. 101
Biological process to response to stress: response to DNA damage stimulus, cellular 102
response to stress, and DNA repair; metabolic process of organic and non organic 103
compounds and transformation of chemical substances: metabolic process, nucleic acid 104
metabolic process, macromolecule metabolic process, nitrogen compound metabolic process 105
and primary metabolic process; The chemical reactions and pathways resulting in the 106
breakdown of a histone protein: catabolic process as well as cellular process such as in 107
cell communication occurs among more than one cell, but occurs at the cellular level [5]. 108
The list of enriched biological process and the corresponding p-values are provided in 109
supplementary table S1. 110
0.4 EBHRs associations with genomic features 111
We mapped EBHR with 25 selected genomic markers known to have positive or 112
negative effect on fragility of the genome. The full description of these markers is 113
included in supplementary table S6. Overall overview of these regions showed that 114
EBHRs have higher coverage of CpG islands (1.3 fold), direct repeats (1.3 fold), SINE 115
(1.4), G-quadruplex (1.4 fold), self chains (1.6 fold), simple repeats (1.9 fold) and SDs (2 116
folds) compare to the genome. In terms of satellite repeats, the presence of this feature 117
falls in half of EBHRs compare to the genome. The most illustrative coverage 118
comparison of these features are illustrated in figure 3. The complete report on the 119
features coverage are provided in supplementary table S6. It should be noted that 120
although the it seems that 24.9% of EBHRs are covered by genes, however, not all the 121
EBHRs overlap with genes. Half of EBHRs has no overlap with genes. Hence, when 122
4/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
genes coverage is calculated for the other half of EBHRs, the coverage of genes 123 augments by almost 2 times (44.48%).
Figure 3. This figure shows the comparison between the coverage of selected genomic features on genome, EBHRs and EBRRs. x-axis represents the coverage percentage of each feature.
124
0.5 The most resistant genomic regions to EBrs 125
We found that less than one percent of genomic windows, 230 regions have a complete 126
absence of EBrs. These regions that are mostly continuous, located in 50 genomic 127
locations on all chromosomes. We called these regions evolutionary breakpoint deserts 128
or EBr refractory regions (EBRRs). The 5 largest locations are belonging to chrY and 129
chr19. See figure 4. More than half of EBRRs regions (120 windows) do not overlap 130
with any gene annotations. However, GO enrichment analysis of overlapping genes 131
showed that, these regions are located among genes that are enriched in biological 132
processes completely different from other genomic regions. These regions were highly 133
enriched in critical biological processes involved in development such as anatomical 134
structure arrangement, anterior/posterior pattern formation, chordate embryonic 135
development, cranial nerve development and morphogenesis, embryonic development, 136
embryonic morphogenesis, embryonic organ development, embryonic organ 137
morphogenesis, skeletal system morphogenesis, facial nerve development, nerve 138
development, organ morphogenesis and regulatory processes such as regulation of 139
biological process, regulation of cellular process, regulation of gene expression, regulation 140
of metabolic process, regulation of transcription, etc. See supplementary table S4 for the 141
complete list of enriched biological process. These genes included thirteen members of 142
Hox-cluster genes that encode conserved transcription factors in most bilateral species 143
[34]. Another interesting group of genes were the 33 members of zinc finger genes 144
located on the 19p12. 145
5/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
44MB 46MB 44MB 46MB 42MB 48MB 42MB 48MB
40MB 40MB 50MB 50MB
38MB 38MB 52MB 52MB
36MB 36MB 54MB 54MB
34MB 34MB 56MB
56MB
32MB
32MB
58MB
58MB
19
Y 30MB chr19 30MB chrY
0MB 0MB
28MB
28MB
2MB 2MB
26MB 26MB
4MB 4MB
24MB 24MB
6MB 6MB
22MB 22MB
8MB 8MB
20MB 20MB
10MB 10MB
18MB 18MB 12MB 12MB 16MB 14MB 16MB 14MB
Figure 4. Chromosome 19 showed only one single EBr in a complete chromosome band of 4.4 Mbp. This region codes for 33 members of zinc finger family which is highly conserved in primates. Chromosome Y is one of the most remarkable chromosomes on human genome. The smallest human genome with the highest presence of EBrs, EBHRs and EBRRs.
The other genes were also all conserved genes among mammals or vertebrates. 146
BCL11A, that mutation in this gene is known to be associated with intellectual 147
developmental disorder with persistence of fetal hemoglobin (IDPFH) [33], SRY 148
(sex-determining region Y), known to be associated with 46,XX testicular disorder of 149
sex development and Swyer syndrome [33], FOXP1, known to be associated with 150
Mental retardation with language impairment and autistic features (MRLIAF) [33], 151
TCF4 known to be associated with Pitt-Hopkins syndrome is a condition characterized 152
by intellectual disability and developmental delay, breathing problems, recurrent 153
seizures (epilepsy), and distinctive facial features [33], are some examples of genes and 154
their importance in terms of survival and reproduction. Study of the genomic features 155
in these regions showed that the landmark of these regions are satellite repeats. About 156
8% of these regions overlap with satellite repeats which is 60 times more than their 157
presence in the genome (0.13%) and 500 times more than their presence in EBHRs 158
(0.02%). Moreover, these regions have higher coverage of low complexity repeats, LTR 159
and a-phased repeats. On the other hand, they show a lower presence of simple repeats, 160
SINE and common fragile sites. See figure 3. 161
Discussion 162
0.6 Evaluation of synteny identification 163
We conducted a multi-way comparative analysis of 12 genomes and identified synteny 164
blocks and their corresponding 261,391 evolutionary breakpoints (EBrs) on the human 165
genome. Using a permutation approach, 559 regions of 100 Kbp were identified as 166
evolutionary breakpoint hotspot regions (EBHRs). One of the basics of comparative 167
genomics is the identification of synteny. However, there is no precise definition of 168
synteny. Species selection, their evolutionary distances, pairwise or multi-way 169
comparison, size of the anchors could all have huge impacts on the resulting synteny 170
blocks (Reviewed by Ghiurcuta, 2014 [19]). In this study 12 species were selected with 171
the human as the reference species from different principal branches of species tree. We 172
selected species according to their evolutionary distance from the human as well as the 173
6/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
quality of their genome assembly. Species from more distant lineages of mammalian 174
evolutionary tree and a non-mammalian vertebrate as out-group allows a higher 175
resolution. However, it yields to shorter size of synteny blocks compared to previous 176
studies. From multi-way alignment blocks, blocks shared by at least 7 selected species 177
were accepted as conservation anchors. This looseness lowers the dependency of this 178
method to the precise species list and could lower the bias of species selection and 179
missing data. Also, the fact that about 80% of our EBrs have a Least Common 180
Ancestor originated from the primate node to farther up on the reference tree, shows 181
that these breakpoints are reuse and a careful replace of some selected species should 182
not have a dramatic effect on the identified synteny blocks and their positions. 183
Nonetheless, comparing identified synteny blocks with Conserved Ancestral Regions 184
(CARs) that reconstructed by Ma, 2006 [27], showed that 98.4% of identified synteny 185
blocks overlaps with 59.98% CARs. Considering the estimation of Ma, 2006 [27] closest 186
to the real ancestral genomic regions, the presence of synteny blocks in conserved 187
ancestral regions is not far from the presence of these synteny regions in the human 188
contemporary genome calculated in this study (52%). Due to the use of human oriented 189
multi-way conservation alignments as anchors, over 7% of EBrs have a size equal to 1 190
bp as conservation anchors are contiguous on human genome. Comparison of EBrs with 191
evolutionary breakpoints produced in the study of Lemaitre, 2009 [26] showed that 192
EBrs were similar to what Lemaitre, 2009 [26] identified as breakpoint regions (BPR). It 193
should be noted that, in this study pairwise comparison of orthologous genes based on 194
five other mammals (with 4 that were common with species in this study) were used to 195
identify synteny and their BPRs have similar size distribution of EBrs produced in our 196
study. Other than the choice of the species, the use of genes as conservation anchors 197
limited their study to only the coding regions. However, still 3% of EBrs produced in 198
this study overlaps with 69.8% of their breakpoint regions. 199
0.7 Location of genes based on different selective pressure 200
The results presented in this study highlights the dynamic nature of the genome. EBrs 201
are dense in regions coding for functions related to adaptive response such as cellular 202
process, metabolic process, DNA repair, response to DNA damage stimulus, cellular 203
response to stress and catabolic process. On the other hand, genes with high selection 204
pressure such as members of homeobox gene family of transcription factors, sex 205
determining region Y (SRY), BAF chromatin remodeling complex subunit (BCL11A), 206
Forkhead box transcription factors (FOXP1) and zinc finger gene family are located in 207
EBRRs. All the genes in these regions are conserved among mammals and vertebrates. 208
SRY initiates male sex determination. Mutation in this gene is known to be associated 209
with 46,XY complete gonadal dysgenesis or 46,XY pure gonadal dysgenesis or Swyer 210
syndrome [33]. Translocation of part of the Y chromosome containing this gene to the X 211
chromosome causes XX male syndrome [18]. The homeobox genes encode a highly 212
conserved family of transcription factors that play an important role in morphogenesis 213
in all multicellular organisms [18]. HOXB genes are known to encode conserved 214
transcription factors in most bilateral species [34]. FOXB1 belongs to subfamily P of 215
the forkhead box (FOX) transcription factor family and known to be associated with 216
Mental retardation with language impairment and autistic features (MRLIAF). Zinc 217
finger genes code for transcription factors in all eukaryotes. 33 members of zinc finger 218
genes located on the 19p12 chromosome band. This 4.4 Mbp region is the largest EBr 219
desert in our dataset having only one single EBr within this region. This gene cluster is 220
known to be highly conserved in primate [4, 14]. This region is presented in figure 4. 221
These results are in harmony with previous study indicating that refractory regions are 222
strongly enriched for genes involved in development [15, 30] and any disruption in these 223
regions should have detrimental effect [43]. Whereas, regions with high tendency of 224
7/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
rearrangements are more involved in adaptive process such as metabolic process, DNA 225
repair and response to stress [30]. 226
In conclusion the identified EBHRs provides a better insight on the nature of 227
genome by emphasizing on the difference of the functions that EBr hotspots and EBr 228
refractory regions harbor. It points out the dynamic process of evolution, adaptation 229
and natural selection in interaction with the selective pressures of several genomic 230
regions that maintain the integrity functions crucial for the survival of the species. The 231
EBHRs and EBBRs reported in this paper will constitute a good asset for further 232
studies including the affinity of some genomic regions to rearrangement associated to 233
diseases known to be driven by genome rearrangements such as cancers. 234
Materials and Methods 235
0.8 Basic definitions 236
• Distance δ: A distance between two consecutive blocks, Bi, Bj, is the minimum 237 distance of one’s head with the tail of the other for each species: 238 δ = (Bj.st − Bi.sh). 239
• Multi-way conservation alignment: is a series of multiple alignment of 240
conserved segments in a set of genomes based on a reference genome. The 241
multiple alignments used in this study were generated using multiz and other 242
tools in the UCSC/Penn State Bioinformatics comparative genomics alignment 243
pipeline. Conserved elements are identified by phastCons [22]. 244
• Synetnic block: A syntenic block is defined as a large region of the genome that 245
corresponds to a collection of contiguous blocks that maintained their positions 246
and orders among a group of species since a common ancestor. 247
• Synteny break (evolutionary breakpoint or EBr): Given two consecutive 248
conserved blocks, any discontinuity (as in distance, chromosome or orientation) 249
between two sequences of one or more compared species that originated from any 250
ancestry node of the reference genome (by Lowest Common Ancestor (LCA) 251
approach) is considered a synteny breaks. 252
• Breakpoint region: Given two consecutive synteny blocks, the distance between 253
the two blocks on the reference genome. 254
• Breakpoint-hotspot: Breakpoint-hotspot regions is a genomic region that 255
represents a number of breakpoints significantly higher with respect to other 256
genomic regions. 257
0.9 Synteny identification 258
To perform a comparative analysis on the human genome, the most appropriate 259
candidates to capture more evolutionary patterns in the analysis are of two kinds : 1) 260
neighboring species that share common features, and 2) more or less distant species that 261
could have a broad divergence with the reference species (human). For the first group, 262
three well-studied primates have been chosen, chimpanzee, orangutan and marmoset. 263
For the second group, two well-sequenced and well-studied species from three major 264
branches of mammalian phylogenetic tree were selected as follows: rat and mouse from 265
Supraprimates, dog and cow from Laurasiatheria as well as elephant and armadillo as 266
non-Boreotherian mammals. We also added two outliers, one non placental mammal, 267
opossum and one non mammalian vertebrate, chicken. 45-way conservation sequences 268
8/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
alignments were obtained from UCSC genome browser in MAF format from: 269
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf. 270
Conservation alignment blocks having at least seven species among the 12 selected 271
species were extracted as conservation anchors. Unselected species were removed from 272
the blocks and missing selected species were added having unknown chromosome and 273
gaps as their sequences. Synteny is constructed as follows: 274
Given: 275
• a set of selected species: S = {sR, sa, sb, ..., sn}, with SR as reference species. 276
• a species reference tree T with set of ancestry nodes of sR A = 277 {A1,A2,A3, ..., An}. 278
• a multi-way conservation sequence alignment blocks B = {B1,B2,B3, ..., Bn} 279 each species sequence in a block Bi identified by its chromosome, Bi.sc, its 280 orientation, Bi.so and its head Bi.sh and tail Bi.st 281
Discontinuity between two blocks, Bi and Bj with respect to each species is defined by 282 any of three conditions: 283
• difference in chromosomes: Bi.sc 6= Bj.sc 284
• difference in orientations: Bi.so 6= Bj.so 285
• distance, δ(Bj.st − Bi.sh), greater than a defined threshold G: δ > G 286
for each two blocks, Bi and Bj: 287 ∗ 1. S ←− get subset of S with discontinuity 288
∗ 2. if LCA(S ) ⊂ A then: 289
next 290
3. else 291
Bi ←− fus(Bi,Bj) 292
Bj ←− Bj+1 293
go to step 1. 294
For each two continuous anchors (with respect to the reference genome), species with 295
any discontinuity (chromosome, orientation and distance) were identified. If the lowest 296
common ancestor of species with discontinuity is an element of human ancestry nodes of 297
the reference species tree, the two blocks would be considered as discontinuous and the 298
algorithm will continue to compare the next couple in the line. Otherwise, the two 299
blocks will be fused together to be compared with the next block. In each iteration, the 300
list of species with discontinuity and their LCA will be documented. The results of this 301
step is a list of larger conservation clusters that will go through a size filter (minimum 302
size of 1000 bp). The clusters passed through the filter are the resulting synteny blocks. 303
These steps are well illustrated in figure 5. 304
0.10 Identification of lineage-specific evolutionary breakpoint 305
Genomic regions bounded by two consecutive synteny blocks with a size smaller than 2 306
Mbp are considered as reference species lineage-specific breakpoints, or synteny breaks 307
or evolutionary breakpoints (EBrs). The size constraint is necessary to avoid ambiguous 308
regions that could not be associated correctly to breakpoints (e.g. sequences of 309
heterochromatin). The origin of each EBr is the lowest common ancestor of species with 310
discontinuity comparing the two bounded synteny blocks in the synteny identification 311
step. 312
9/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Figure 5. Extraction of synteny blocks and their corresponding breaks: In this illustration of a multi-way alignment blocks with respect to a reference species (SR). Sa to Sf are the selected species present in the alignment blocks. The reference tree show the phylogenetic relationship of these species. A1 to A5 are the SR ancestry nodes. Each arrow represents a conserved region on each genome. Directions in each arrow represent the orientation of that region in each species. The distance between each two vertical lines represents 1 Kbp. Each X shows the start of a chromosome.
0.11 Prediction of breakpoint-hotspots 313
To identify the genomic regions that are significantly enriched in breakpoint or 314
breakpoint hotspot regions, we followed the permutation strategies suggested by 315
(author?) [11]. We used a non-overlapping sliding window approach of size 100 Kb. In 316
each window, the number of breakpoints was counted. Breakpoints were considered to 317
fall into a window if they have at least one position overlapping that window. For 318
100,000 iterations, each breakpoint in the data set were simulated based on the 319
following constraints: 320
• Breakpoint should be simulated within the original chromosome. 321
• Simulated breakpoint should not fall into centromeric or telomeric regions. 322
• Breakpoint should be simulated with respect to its size. 323
• Overlaps would be allowed to lower the calculation costs. 324
A p-value is calculated for the number of simulated breakpoints per window with 325
compare to the original count. Windows with a p-vaule ¡ 0.05 is considered as 326
significantly enriched windows or breakpoint hotspot regions. 327
0.12 Collection of genomic features 328
CpG islands, nested repeats, segmental duplications, CG content and self chain markers 329
were downloaded from the UCSC Genome Browser [29]. Annotation on non-B DNA 330
conformation were downloaded from Non-B DB [8]. Conserved elements of amniotes 331
were obtained from catalog of conserved elements from genomic alignments (CEGA) 332
[12]. Common fragile sites were provided by the supplementary material of 333
Fungtammasan, 2012 [16] paper. RNA G-quadrupex were downloaded from RNA 334
G-quadruplex database (G4RNA) [17]. Benign structural variation were donwloaded 335
from Database of Genomic Variants [28]. RefSeq genes and exons where obtained from 336
NCBI FTP portal. G quadruplex were downloaded from G4 database [45] in 2013. 337
10/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
However, the database is not accessible anymore. The complete list of these features are 338
represented in supplementary table S5. 339
0.13 Identification of EBRRs 340
Regions with no identified EBrs were identified and their positions were compared to 341
genomic regions that are not well sequenced or annotated. We found 1,462 genomic 342
windows outside telomeric and centromeric regions that fall into these regions such as 343
short arms of chromosome 13, 14, 15, 21, 22, as well as chromosome bounds 1q12, 9q12 344
and 16q11. These regions were eliminated from our analysis. Hence we left with 230 345
windows of 100 Kbp long with no EBr. 346
Supporting Information 347
Supporting data for this study is available at: 348
https://github.com/bioinfoUQAM/RECOMB-CG-2019_supp 349
Acknowledgments 350
We would like to thank Julie Horvath, Aida Ouangraoua, Abou Abdallah Malick 351
Diouara and Bruno Daigle for helpful discussions. Special thanks to Emmanuel Mongin, 352
by whom this project has been initially inspired. This work is supported by Natural 353
Science and Engineering council of Canada (NSERC) and the Fonds de Recherche du 354
Qu´ebec-Nature et Technologie (FRQNT) funds to ABD. AMR is a NSERC fellow. 355
AMR and GV are FRQNT fellows. . . . 356
References
1. L. Armengol, M. A. Pujana, J. Cheung, S. W. Scherer, and X. Estivill. Enrichment of segmental duplications in regions of breaks of synteny between the human and mouse genomes suggest their involvement in evolutionary rearrangements. Human molecular genetics, 12(17):2201–2208, 2003. 2. A. Bacolla, J. A. Tainer, K. M. Vasquez, and D. N. Cooper. Translocation and deletion breakpoints in cancer genomes are associated with potential non-b dna-forming sequences. Nucleic acids research, 44(12):5673–5688, 2016. 3. J. A. Bailey, G. Liu, and E. E. Eichler. An Alu transposition model for the origin and expansion of human segmental duplications. The American Journal of Human Genetics, 73(4):823–834, 2003. 4. E. J. Bellefroid, J.-C. Marine, A. G. Matera, C. Bourguignon, T. Desai, K. C. Healy, P. Bray-Ward, J. A. Martial, J. N. Ihle, and D. C. Ward. Emergence of the znf91 kr¨uppel-associated box-containing zinc finger gene family in the last common ancestor of anthropoidea. Proceedings of the National Academy of Sciences, 92(23):10757–10761, 1995. 5. D. Binns, E. Dimmer, R. Huntley, D. Barrell, C. O’donovan, and R. Apweiler. Quickgo: a web-based tool for gene ontology searching. Bioinformatics, 25(22):3045–3046, 2009. 6. M. Blanchette. Evolutionary puzzles: An introduction to genome rearrangement. In Computational Science-ICCS 2001, pages 1003–1011. Springer, 2001.
11/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
7. G. Bourque, P. A. Pevzner, and G. Tesler. Reconstructing the genomic architecture of ancestral mammals: lessons from human, mouse, and rat genomes. Genome research, 14(4):507–516, 2004. 8. R. Z. Cer, K. H. Bruce, U. S. Mudunuri, M. Yi, N. Volfovsky, B. T. Luke, A. Bacolla, J. R. Collins, and R. M. Stephens. Non-b db: a database of predicted non-b dna-forming motifs in mammalian genomes. Nucleic acids research, 39(suppl 1):D383–D391, 2010. 9. A. T. Chinwalla, L. L. Cook, K. D. Delehaunty, G. A. Fewell, L. A. Fulton, R. S. Fulton, T. A. Graves, L. W. Hillier, E. R. Mardis, J. D. McPherson, et al. Initial sequencing and comparative analysis of the mouse genome. Nature, 420(6915):520–562, 2002. 10. S. De and F. Michor. Dna secondary structures and epigenetic determinants of cancer genome evolution. Nature structural & molecular biology, 18(8):950–955, 2011. 11. S. De, B. S. Pedersen, and K. Kechris. The dilemma of choosing the ideal permutation strategy while estimating statistical significance of genome-wide enrichment. Briefings in bioinformatics, 15(6):919–928, 2013. 12. A. Dousse, T. Junier, and E. M. Zdobnov. Cega—a catalog of conserved elements from genomic alignments. Nucleic acids research, 44(D1):D96–D100, 2015. 13. Y. Drier, M. S. Lawrence, S. L. Carter, C. Stewart, S. B. Gabriel, E. S. Lander, M. Meyerson, R. Beroukhim, and G. Getz. Somatic rearrangements across cancer reveal classes of samples with distinct patterns of dna breakage and rearrangement-induced hypermutability. Genome research, 23(2):228–235, 2013. 14. E. E. Eichler, S. M. Hoffman, A. A. Adamson, L. A. Gordon, P. McCready, J. E. Lamerdin, and H. W. Mohrenweiser. Complex β-satellite repeat structures and the expansion of the zinc finger gene cluster in 19p12. Genome research, 8(8):791–808, 1998. 15. P. G. Engstr¨om,S. J. H. Sui, Ø. Drivenes, T. S. Becker, and B. Lenhard. Genomic regulatory blocks underlie extensive microsynteny conservation in insects. Genome research, 17(12):1898–1908, 2007. 16. A. Fungtammasan, E. Walsh, F. Chiaromonte, K. A. Eckert, and K. D. Makova. A genome-wide analysis of common fragile sites: what features determine chromosomal instability in the human genome? Genome research, 22(6):993–1005, 2012. 17. J.-M. Garant, M. J. Luce, M. S. Scott, and J.-P. Perreault. G4rna: an rna g-quadruplex database. Database, 2015, 2015. 18. L. Y. Geer, A. Marchler-Bauer, R. C. Geer, L. Han, J. He, S. He, C. Liu, W. Shi, and S. H. Bryant. The ncbi biosystems database. Nucleic acids research, 38(suppl 1):D492–D496, 2009. 19. C. G. Ghiurcuta and B. M. Moret. Evaluating synteny for improved comparative studies. Bioinformatics, 30(12):i9–i18, 2014. 20. W. Gu, F. Zhang, and J. R. Lupski. Mechanisms for human genomic rearrangements. Pathogenetics, 1(1):4, 2008.
12/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
21. H. Kehrer-Sawatzki and D. N. Cooper. Molecular mechanisms of chromosomal rearrangement during primate evolution. Chromosome Research, 16(1):41–56, 2008. 22. W. J. Kent, C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. Pringle, A. M. Zahler, and D. Haussler. The human genome browser at ucsc. Genome research, 12(6):996–1006, 2002. 23. H. Kikuta, M. Laplante, P. Navratilova, A. Z. Komisarczuk, P. G. Engstr¨om, D. Fredman, A. Akalin, M. Caccamo, I. Sealy, K. Howe, et al. Genomic regulatory blocks encompass multiple neighboring genes and maintain conserved synteny in vertebrates. Genome research, 17(5):545–555, 2007. 24. J. Lee, K. Han, T. J. Meyer, H.-S. Kim, and M. A. Batzer. Chromosomal inversions between human and chimpanzee lineages caused by retrotransposons. PLoS One, 3(12):e4047, 2008. 25. C. Lemaitre, L. Zaghloul, M.-F. Sagot, C. Gautier, A. Arneodo, E. Tannier, and B. Audit. Analysis of fine-scale mammalian evolutionary breakpoints provides new insight into their relation to genome organisation. BMC genomics, 10(1):335, 2009. 26. C. Lemaitre, L. Zaghloul, M.-F. Sagot, C. Gautier, A. Arneodo, E. Tannier, and B. Audit. Analysis of fine-scale mammalian evolutionary breakpoints provides new insight into their relation to genome organisation. BMC genomics, 10(1):335, 2009. 27. J. Ma, L. Zhang, B. B. Suh, B. J. Raney, R. C. Burhans, W. J. Kent, M. Blanchette, D. Haussler, and W. Miller. Reconstructing contiguous regions of an ancestral genome. Genome research, 16(12):1557–1565, 2006. 28. J. R. MacDonald, R. Ziman, R. K. Yuen, L. Feuk, and S. W. Scherer. The database of genomic variants: a curated collection of structural variation in the human genome. Nucleic acids research, 42(D1):D986–D992, 2013. 29. L. R. Meyer, A. S. Zweig, A. S. Hinrichs, D. Karolchik, R. M. Kuhn, M. Wong, C. A. Sloan, K. R. Rosenbloom, G. Roe, B. Rhead, et al. The ucsc genome browser database: extensions and updates 2013. Nucleic acids research, 41(D1):D64–D69, 2012. 30. E. Mongin, K. Dewar, and M. Blanchette. Long-range regulation is a major driving force in maintaining genome integrity. BMC evolutionary biology, 9(1):203, 2009. 31. E. Mongin, K. Dewar, and M. Blanchette. Mapping association between long-range cis-regulatory regions and their target genes using synteny. Journal of Computational Biology, 18(9):1115–1130, 2011. 32. W. J. Murphy, D. M. Larkin, A. Everts-van der Wind, G. Bourque, G. Tesler, L. Auvil, J. E. Beever, B. P. Chowdhary, F. Galibert, L. Gatzke, et al. Dynamics of mammalian chromosome evolution inferred from multispecies comparative maps. Science, 309(5734):613–617, 2005. 33. H. page: National Library of Medicine (US). Genetics Home Reference [Internet]. Bethesda (MD). The library. Internet, June 2019. 34. J. C. Pearson, D. Lemons, and W. McGinnis. Modulating hox gene functions during animal body patterning. Nature Reviews Genetics, 6(12):893, 2005.
13/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
35. Q. Peng, P. A. Pevzner, and G. Tesler. The fragile breakage versus random breakage models of chromosome evolution. PLoS computational biology, 2(2):e14, 2006. 36. P. Pevzner and G. Tesler. Genome rearrangements in mammalian evolution: lessons from human and mouse genomes. Genome Research, 13(1):37–45, 2003. 37. P. Pevzner and G. Tesler. Genome rearrangements in mammalian evolution: lessons from human and mouse genomes. Genome Research, 13(1):37–45, 2003. 38. S. K. Pham and P. A. Pevzner. Drimm-synteny: decomposing genomes into evolutionary conserved segments. Bioinformatics, 26(20):2509–2516, 2010. 39. D. B. Pontier, E. Kruisselbrink, V. Guryev, and M. Tijsterman. Isolation of deletion alleles by g4 dna-induced mutagenesis. nature methods, 6(9):655, 2009. 40. S. Proost, J. Fostier, D. De Witte, B. Dhoedt, P. Demeester, Y. Van de Peer, and K. Vandepoele. i-adhore 3.0—fast and sensitive detection of genomic homology in extremely large data sets. Nucleic acids research, 40(2):e11–e11, 2011. 41. A. Sandelin, P. Bailey, S. Bruce, P. G. Engstr¨om,J. M. Klos, W. W. Wasserman, J. Ericson, and B. Lenhard. Arrays of ultraconserved non-coding regions span the loci of key developmental genes in vertebrate genomes. BMC genomics, 5(1):99, 2004. 42. D. Sankoff. The where and wherefore of evolutionary breakpoints. J Biol, 8:66, 2009. 43. F. Spitz, C. Herkenne, M. A. Morris, and D. Duboule. Inversion-induced disruption of the hoxd cluster leads to the partition of regulatory landscapes. Nature genetics, 37(8):889, 2005. 44. G. Wang, L. A. Christensen, and K. M. Vasquez. Z-dna-forming sequences generate large-scale deletions in mammalian cells. Proceedings of the National Academy of Sciences, 103(8):2677–2682, 2006. 45. H. M. Wong, O. Stegle, S. Rodgers, and J. L. Huppert. A toolbox for predicting g-quadruplex formation and stability. Journal of nucleic acids, 2010, 2010. 46. A. Woolfe, M. Goodson, D. K. Goode, P. Snell, G. K. McEwen, T. Vavouri, S. F. Smith, P. North, H. Callaway, K. Kelly, et al. Highly conserved non-coding sequences are associated with vertebrate development. PLoS biology, 3(1):e7, 2004. 47. J. Zhao, A. Bacolla, G. Wang, and K. M. Vasquez. Non-b dna structure-induced genetic instability and evolution. Cellular and Molecular Life Sciences, 67(1):43–62, 2010.
14/19