<<

bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Revisiting the landscape of evolutionary breakpoints across human genome using multi-way comparison

Golrokh Vitae1, Amine M. Remita1, Abdoulaye Banir´eDiallo1,

1 Universit´edu Qu´ebec A` Montr´eal

* [email protected]

Abstract

Genome rearrangement is one of the major forces driving the processes of the and disease development. The chromosomal position affected by these rearrangements are called breakpoints. The breakpoints occurring during the evolution of are known to be non randomly distributed. Detecting their landscape and mapping them to genomic features constitute an important features in both comparative and functional . Several studies have attempted to provide such mapping based on pairwise comparison of genes as conservation anchors. With the availability of more accurate multi-way alignments, we design an approach to identify synteny blocks and evolutionary breakpoints based on UCSC 45-way conservation sequence alignments with 12 selected species. The multi-way designed approach with the mild flexibility of presence of selected species, helped to have a better determination of human lineage-specific evolutionary breakpoints. We identified 261,391 human lineage-specific evolutionary breakpoints across the genome and 2,564 dense regions enriched with biological processes involved in adaptive traits such as response to DNA damage stimulus, cellular response to stress and metabolic process. Moreover, we found 230 regions refractory to evolutionary breakpoints that carry genes associated with crucial developmental process such as organ morphogenesis, skeletal system development, chordate embryonic development, nerve development and regulation of biological process. This initial map of the human genome will help to gain better insight into several studies including developmental studies and cancer rearrangement processes.

Introduction 1

Genome rearrangement is one of the major forces driving the process of evolution, 2

population diversity and development of diseases including cancers. It happens when 3

DNA breaks in specific positions (breakpoints) and reassembles in a way different from 4

the initial genome conformation and changes the genome landscape [6]. It is now 5

well-accepted that genome rearrangements do not happen randomly along the genome 6

and not all genomic regions are susceptible to such dramatic modifications 7

[7, 25, 35, 36]. Millions of years of evolution and natural selection driven by these 8

structural modifications curved the genome in a way that regions carrying high 9

functional pressure maintained their integrity and could be identified as orthologs with 10

same order and on the same in a range of relatively close species. Hence, 11

resistance of genomic regions to any major modification implies that these regions 12

harbor functional importance to survival as well as reproduction of species and any 13

modification could have a deleterious effect in the process of natural selection [30]. 14

1/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Studies on the genome synteny have shown that conserved regions are not only 15

significantly enriched in putative regulatory regions [23, 31] but also are associated with 16

transcriptional regulations and developmental processes [31, 41, 46]. On the other hand, 17

are the regions that are more receptive to rearrangements, participating in speciation, 18

adaptation and development of species-specific traits and behavior that are not 19

detrimental to the survival nor reproduction of the species. These break-prone regions 20

or ”breakpoint hotspots” are also known to carry distinctive functional and structural 21

markers [10]. The identification of such regions are difficult and controversial due to 22

several assumptions. ” relies on the structuring of genomes into 23

syntenic blocks” [19]. However, comparative studies still lack a clear definition of 24

synteny. Since the term ”synteny” was introduced, even the criteria that define the 25

synteny conservation vary from one study to another. Also, in most of the methods, to 26

identify synteny, synteny fragments were defined as a set of genes with a conserved 27

order [9, 37, 38, 40]. These issues are well discussed by Ghiurcuta, 2014 [19]. The lack 28

of clear definition, choice of granularity of study (minimum size), and different type of 29

comparison could led to have the high probability of different results generated by same 30

data [19]. One conclusion provided to tackle the drawback of these methods are to use 31

multi sets of homologous markers instead of only genes as anchors as well as extending 32

the pairwise comparison to multi-ways [19]. Moreover, the identification of synteny 33

breaks is affected by the selection of species. This influences the size and position of 34

synteny blocks and their corresponding breakpoints as breakpoint is not a tangible 35

physical entity in a genome; it is a result of an analytical construct based on the 36

comparison of selected genomes [42]. Other than exogenous factors, some biological 37

mechanisms such as non-allelic homologous recombination (NAHR) and non-homologous 38

endjoining (NHEJ) [20] and the presence of endogenous factors such as CpG islands [10], 39

GC content[13], some repeat elements [21, 24] and G-quadruplex [10, 39] could affect 40

the susceptibility of genomic regions to rearrangements. Due to some of these factors, 41

DNA could undergo a conformation change (e.g. from B to Z) that could destabilize the 42

molecule and make it prone to structural damage [2, 44, 47]. Also, phenomena such as 43

segmental duplications (SDs) [1, 3, 32], known to be associated evolutionary 44

breakpoints causing a double strand breaks on DNA structure. Hence, it is important to 45

revisit Evolutionary Breakpoints (EBrs) with more accurate and well-drafted 46

mammalian sequenced genomes and updated genomics features. In this paper we 47

propose a new map of accurate and more precise evolutionary breakpoints region based 48

on human lineage synteny breaks, multi-way comparison and ancestral breakpoint 49

reconstruction. Furthermore, we explore the hotspots of EBrs (EBHRs) based on their 50

coassociation to multiple structural and functional human genome features. 51

Results 52

0.1 Identification of human lineage-specific synteny breaks 53

Synteny block is a cluster of (relatively close) markers conserved among related genomes 54

that maintained their orientation and order. The regions bordered by synteny blocks 55

are the position of the genome that have been subjected to rearrangement in one or 56

more compared species since a common ancestor. Multi-way comparative analysis of 57

human genome with 10 other mammals and chicken (see supplementary table S2), 58

revealed 261,420 synteny blocks covering 52% of the human genome with a size 59

distribution from 1 Kbp to 200 Kbp. The corresponding 261,391 genomic regions were 60

identified as human lineage-specific synteny breaks or evolutionary breakpoints (EBrs). 61

30 break regions were eliminated due to their size higher than 2 Mbp. These regions are 62

whether on centromeric or telomeric regions or on genomic regions with missing 63

2/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 1. Coverage of by EBrs

Figure 2. The percentages of human lineage specific breakpoints origin along the vertebrates reference tree

information. See table 1. EBrs are associated with 39.58% of the human genome. 64

However, the coverage of breakpoints raise to over 40% in chr11, chr12, chr8, chr4 and 65

chr7 and to over 55% in chr19 and chrX. See figure 1. The size of these regions were 66

from 1 bp to 1.857 Mbp. The seven largest EBrs (size > 1 Mbp) were located chr14, 67

chrY, chr11, chrX, chr2 and chr7. Based on a Least Common Ancestor approach, we 68

identify the lineage-specific breakpoints, over 50% of these breakpoints were reused and 69

originated from separation of placental mammals. The next most representative 70

ancestral node was the Boreotherian ancestor with 20% of EBrs that originated from. 71

Others were originated from various ancestry nodes along the reference phylogeny tree 72

with a representation between < 1% to about 10%. The distribution of EBr ancestry 73

origins are presented in figure 2. 74

Method Input Output Size of output Conservation anchor extraction 45-way multiple alignments Conservation anchors 21,809,215 Fusion Conservation anchors Synteny clusters 346,515 filter according to length Synteny clusters Synteny blocks 261,420 Synteny break extraction Synteny blocks Synteny breaks 261,421 Size filter Synteny breaks EBrs 261,391 Permutation EBrs EBHRs 2,564 EBr desert extraction EBrs + human genome EBRRs 230 Table 1. Summary of EBr-related results

3/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

0.2 Localisation of breakpoint hotspots 75

Using an overlapping sliding window approach of 100 Kbp and a permutation test with 76

100,000 iterations, we identified 2,564 regions as EBr hotspot regions (EBHRs). 77

Although evolutionary breakpoints are distributed across 90% of genomic windows, only 78

8.28% of genomic windows (2,564) were enriched in EBrs (p-value < 0.05). The results 79

show that EBHRs contain 12 to 21 EBrs with an average of 14.85 ± 1.56 (except 80

chromosome Y). EBHRs located on chromosome Y harbor 3 to 14 breakpoints in each 81

window. The EBHRs distribution with respect to chromosomes shows that 82

chromosomes are heterogeneously affected by these regions. Chromosomes are covered 83

by EBHRs from 3% to 23%. Based on the coverage of EBHRs, chromosome were 84

divided into three (3) categories: chromosomes with EBHRs representation under 8%, 85

chromosomes with a coverage between 8% to 15% and chromosomes with an over 15% 86

of EBHRs presence. See supplementary table S3. EBHRs is highly observed in 87

chromosome 22 (> 23%) when Chromosome 1 has the highest number of EBHRs (204 88

regions). However, it has a moderate coverage of EBHRs due to its size. 89

We also classified each EBHRs into different groups based on their predicted 90

evolutionary origin according to the origin of the EBRs present in that regionn. The 91

covered orgins are the five major human ancestry nodes along the reference species tree, 92

Placental, Boreotheria, Superprimates and Primates. The results show that about 55% 93

of EBHRs are mostly composed of EBrs originated around separation of human 94

placental ancestor, more than 30% are mostly composed of EBrs originated from 95

different ancestry nodes down to Boreotheria, and 10% of EBHRs were mostly composed 96

of EBrs originated from ancestry node at and after Boreotherian ancestry nodes. 97

0.3 Biological processes enrichment of genes associated with 98

EBHRs 99

The GO enrichment analysis of biological process showed that these EBHRs are 100

enriched in genes associated with distinct biological processes compared to genome. 101

Biological process to response to stress: response to DNA damage stimulus, cellular 102

response to stress, and DNA repair; metabolic process of organic and non organic 103

compounds and transformation of chemical substances: metabolic process, nucleic acid 104

metabolic process, macromolecule metabolic process, nitrogen compound metabolic process 105

and primary metabolic process; The chemical reactions and pathways resulting in the 106

breakdown of a histone protein: catabolic process as well as cellular process such as in 107

cell communication occurs among more than one cell, but occurs at the cellular level [5]. 108

The list of enriched biological process and the corresponding p-values are provided in 109

supplementary table S1. 110

0.4 EBHRs associations with genomic features 111

We mapped EBHR with 25 selected genomic markers known to have positive or 112

negative effect on fragility of the genome. The full description of these markers is 113

included in supplementary table S6. Overall overview of these regions showed that 114

EBHRs have higher coverage of CpG islands (1.3 fold), direct repeats (1.3 fold), SINE 115

(1.4), G-quadruplex (1.4 fold), self chains (1.6 fold), simple repeats (1.9 fold) and SDs (2 116

folds) compare to the genome. In terms of satellite repeats, the presence of this feature 117

falls in half of EBHRs compare to the genome. The most illustrative coverage 118

comparison of these features are illustrated in figure 3. The complete report on the 119

features coverage are provided in supplementary table S6. It should be noted that 120

although the it seems that 24.9% of EBHRs are covered by genes, however, not all the 121

EBHRs overlap with genes. Half of EBHRs has no overlap with genes. Hence, when 122

4/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

genes coverage is calculated for the other half of EBHRs, the coverage of genes 123 augments by almost 2 times (44.48%).

Figure 3. This figure shows the comparison between the coverage of selected genomic features on genome, EBHRs and EBRRs. x-axis represents the coverage percentage of each feature.

124

0.5 The most resistant genomic regions to EBrs 125

We found that less than one percent of genomic windows, 230 regions have a complete 126

absence of EBrs. These regions that are mostly continuous, located in 50 genomic 127

locations on all chromosomes. We called these regions evolutionary breakpoint deserts 128

or EBr refractory regions (EBRRs). The 5 largest locations are belonging to chrY and 129

chr19. See figure 4. More than half of EBRRs regions (120 windows) do not overlap 130

with any gene annotations. However, GO enrichment analysis of overlapping genes 131

showed that, these regions are located among genes that are enriched in biological 132

processes completely different from other genomic regions. These regions were highly 133

enriched in critical biological processes involved in development such as anatomical 134

structure arrangement, anterior/posterior pattern formation, chordate embryonic 135

development, cranial nerve development and morphogenesis, embryonic development, 136

embryonic morphogenesis, embryonic organ development, embryonic organ 137

morphogenesis, skeletal system morphogenesis, facial nerve development, nerve 138

development, organ morphogenesis and regulatory processes such as regulation of 139

biological process, regulation of cellular process, regulation of gene expression, regulation 140

of metabolic process, regulation of transcription, etc. See supplementary table S4 for the 141

complete list of enriched biological process. These genes included thirteen members of 142

Hox-cluster genes that encode conserved transcription factors in most bilateral species 143

[34]. Another interesting group of genes were the 33 members of zinc finger genes 144

located on the 19p12. 145

5/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

44MB 46MB 44MB 46MB 42MB 48MB 42MB 48MB

40MB 40MB 50MB 50MB

38MB 38MB 52MB 52MB

36MB 36MB 54MB 54MB

34MB 34MB 56MB

56MB

32MB

32MB

58MB

58MB

19

Y 30MB chr19 30MB chrY

0MB 0MB

28MB

28MB

2MB 2MB

26MB 26MB

4MB 4MB

24MB 24MB

6MB 6MB

22MB 22MB

8MB 8MB

20MB 20MB

10MB 10MB

18MB 18MB 12MB 12MB 16MB 14MB 16MB 14MB

Figure 4. Chromosome 19 showed only one single EBr in a complete chromosome band of 4.4 Mbp. This region codes for 33 members of zinc finger family which is highly conserved in primates. Chromosome Y is one of the most remarkable chromosomes on human genome. The smallest human genome with the highest presence of EBrs, EBHRs and EBRRs.

The other genes were also all conserved genes among mammals or vertebrates. 146

BCL11A, that mutation in this gene is known to be associated with intellectual 147

developmental disorder with persistence of fetal hemoglobin (IDPFH) [33], SRY 148

(sex-determining region Y), known to be associated with 46,XX testicular disorder of 149

sex development and Swyer syndrome [33], FOXP1, known to be associated with 150

Mental retardation with language impairment and autistic features (MRLIAF) [33], 151

TCF4 known to be associated with Pitt-Hopkins syndrome is a condition characterized 152

by intellectual disability and developmental delay, breathing problems, recurrent 153

seizures (epilepsy), and distinctive facial features [33], are some examples of genes and 154

their importance in terms of survival and reproduction. Study of the genomic features 155

in these regions showed that the landmark of these regions are satellite repeats. About 156

8% of these regions overlap with satellite repeats which is 60 times more than their 157

presence in the genome (0.13%) and 500 times more than their presence in EBHRs 158

(0.02%). Moreover, these regions have higher coverage of low complexity repeats, LTR 159

and a-phased repeats. On the other hand, they show a lower presence of simple repeats, 160

SINE and common fragile sites. See figure 3. 161

Discussion 162

0.6 Evaluation of synteny identification 163

We conducted a multi-way comparative analysis of 12 genomes and identified synteny 164

blocks and their corresponding 261,391 evolutionary breakpoints (EBrs) on the human 165

genome. Using a permutation approach, 559 regions of 100 Kbp were identified as 166

evolutionary breakpoint hotspot regions (EBHRs). One of the basics of comparative 167

genomics is the identification of synteny. However, there is no precise definition of 168

synteny. Species selection, their evolutionary distances, pairwise or multi-way 169

comparison, size of the anchors could all have huge impacts on the resulting synteny 170

blocks (Reviewed by Ghiurcuta, 2014 [19]). In this study 12 species were selected with 171

the human as the reference species from different principal branches of species tree. We 172

selected species according to their evolutionary distance from the human as well as the 173

6/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

quality of their genome assembly. Species from more distant lineages of mammalian 174

evolutionary tree and a non-mammalian vertebrate as out-group allows a higher 175

resolution. However, it yields to shorter size of synteny blocks compared to previous 176

studies. From multi-way alignment blocks, blocks shared by at least 7 selected species 177

were accepted as conservation anchors. This looseness lowers the dependency of this 178

method to the precise species list and could lower the bias of species selection and 179

missing data. Also, the fact that about 80% of our EBrs have a Least Common 180

Ancestor originated from the primate node to farther up on the reference tree, shows 181

that these breakpoints are reuse and a careful replace of some selected species should 182

not have a dramatic effect on the identified synteny blocks and their positions. 183

Nonetheless, comparing identified synteny blocks with Conserved Ancestral Regions 184

(CARs) that reconstructed by Ma, 2006 [27], showed that 98.4% of identified synteny 185

blocks overlaps with 59.98% CARs. Considering the estimation of Ma, 2006 [27] closest 186

to the real ancestral genomic regions, the presence of synteny blocks in conserved 187

ancestral regions is not far from the presence of these synteny regions in the human 188

contemporary genome calculated in this study (52%). Due to the use of human oriented 189

multi-way conservation alignments as anchors, over 7% of EBrs have a size equal to 1 190

bp as conservation anchors are contiguous on human genome. Comparison of EBrs with 191

evolutionary breakpoints produced in the study of Lemaitre, 2009 [26] showed that 192

EBrs were similar to what Lemaitre, 2009 [26] identified as breakpoint regions (BPR). It 193

should be noted that, in this study pairwise comparison of orthologous genes based on 194

five other mammals (with 4 that were common with species in this study) were used to 195

identify synteny and their BPRs have similar size distribution of EBrs produced in our 196

study. Other than the choice of the species, the use of genes as conservation anchors 197

limited their study to only the coding regions. However, still 3% of EBrs produced in 198

this study overlaps with 69.8% of their breakpoint regions. 199

0.7 Location of genes based on different selective pressure 200

The results presented in this study highlights the dynamic nature of the genome. EBrs 201

are dense in regions coding for functions related to adaptive response such as cellular 202

process, metabolic process, DNA repair, response to DNA damage stimulus, cellular 203

response to stress and catabolic process. On the other hand, genes with high selection 204

pressure such as members of gene family of transcription factors, sex 205

determining region Y (SRY), BAF chromatin remodeling complex subunit (BCL11A), 206

Forkhead box transcription factors (FOXP1) and zinc finger gene family are located in 207

EBRRs. All the genes in these regions are conserved among mammals and vertebrates. 208

SRY initiates male sex determination. Mutation in this gene is known to be associated 209

with 46,XY complete gonadal dysgenesis or 46,XY pure gonadal dysgenesis or Swyer 210

syndrome [33]. Translocation of part of the Y chromosome containing this gene to the X 211

chromosome causes XX male syndrome [18]. The homeobox genes encode a highly 212

conserved family of transcription factors that play an important role in morphogenesis 213

in all multicellular organisms [18]. HOXB genes are known to encode conserved 214

transcription factors in most bilateral species [34]. FOXB1 belongs to subfamily P of 215

the forkhead box (FOX) transcription factor family and known to be associated with 216

Mental retardation with language impairment and autistic features (MRLIAF). Zinc 217

finger genes code for transcription factors in all eukaryotes. 33 members of zinc finger 218

genes located on the 19p12 chromosome band. This 4.4 Mbp region is the largest EBr 219

desert in our dataset having only one single EBr within this region. This gene cluster is 220

known to be highly conserved in primate [4, 14]. This region is presented in figure 4. 221

These results are in harmony with previous study indicating that refractory regions are 222

strongly enriched for genes involved in development [15, 30] and any disruption in these 223

regions should have detrimental effect [43]. Whereas, regions with high tendency of 224

7/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

rearrangements are more involved in adaptive process such as metabolic process, DNA 225

repair and response to stress [30]. 226

In conclusion the identified EBHRs provides a better insight on the nature of 227

genome by emphasizing on the difference of the functions that EBr hotspots and EBr 228

refractory regions harbor. It points out the dynamic process of evolution, adaptation 229

and natural selection in interaction with the selective pressures of several genomic 230

regions that maintain the integrity functions crucial for the survival of the species. The 231

EBHRs and EBBRs reported in this paper will constitute a good asset for further 232

studies including the affinity of some genomic regions to rearrangement associated to 233

diseases known to be driven by genome rearrangements such as cancers. 234

Materials and Methods 235

0.8 Basic definitions 236

• Distance δ: A distance between two consecutive blocks, Bi, Bj, is the minimum 237 distance of one’s head with the tail of the other for each species: 238 δ = (Bj.st − Bi.sh). 239

• Multi-way conservation alignment: is a series of multiple alignment of 240

conserved segments in a set of genomes based on a reference genome. The 241

multiple alignments used in this study were generated using multiz and other 242

tools in the UCSC/Penn State Bioinformatics comparative genomics alignment 243

pipeline. Conserved elements are identified by phastCons [22]. 244

• Synetnic block: A syntenic block is defined as a large region of the genome that 245

corresponds to a collection of contiguous blocks that maintained their positions 246

and orders among a group of species since a common ancestor. 247

• Synteny break (evolutionary breakpoint or EBr): Given two consecutive 248

conserved blocks, any discontinuity (as in distance, chromosome or orientation) 249

between two sequences of one or more compared species that originated from any 250

ancestry node of the reference genome (by Lowest Common Ancestor (LCA) 251

approach) is considered a synteny breaks. 252

• Breakpoint region: Given two consecutive synteny blocks, the distance between 253

the two blocks on the reference genome. 254

• Breakpoint-hotspot: Breakpoint-hotspot regions is a genomic region that 255

represents a number of breakpoints significantly higher with respect to other 256

genomic regions. 257

0.9 Synteny identification 258

To perform a comparative analysis on the human genome, the most appropriate 259

candidates to capture more evolutionary patterns in the analysis are of two kinds : 1) 260

neighboring species that share common features, and 2) more or less distant species that 261

could have a broad divergence with the reference species (human). For the first group, 262

three well-studied primates have been chosen, chimpanzee, orangutan and marmoset. 263

For the second group, two well-sequenced and well-studied species from three major 264

branches of mammalian phylogenetic tree were selected as follows: rat and mouse from 265

Supraprimates, dog and cow from Laurasiatheria as well as elephant and armadillo as 266

non-Boreotherian mammals. We also added two outliers, one non placental mammal, 267

opossum and one non mammalian vertebrate, chicken. 45-way conservation sequences 268

8/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

alignments were obtained from UCSC genome browser in MAF format from: 269

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf. 270

Conservation alignment blocks having at least seven species among the 12 selected 271

species were extracted as conservation anchors. Unselected species were removed from 272

the blocks and missing selected species were added having unknown chromosome and 273

gaps as their sequences. Synteny is constructed as follows: 274

Given: 275

• a set of selected species: S = {sR, sa, sb, ..., sn}, with SR as reference species. 276

• a species reference tree T with set of ancestry nodes of sR A = 277 {A1,A2,A3, ..., An}. 278

• a multi-way conservation sequence alignment blocks B = {B1,B2,B3, ..., Bn} 279 each species sequence in a block Bi identified by its chromosome, Bi.sc, its 280 orientation, Bi.so and its head Bi.sh and tail Bi.st 281

Discontinuity between two blocks, Bi and Bj with respect to each species is defined by 282 any of three conditions: 283

• difference in chromosomes: Bi.sc 6= Bj.sc 284

• difference in orientations: Bi.so 6= Bj.so 285

• distance, δ(Bj.st − Bi.sh), greater than a defined threshold G: δ > G 286

for each two blocks, Bi and Bj: 287 ∗ 1. S ←− get subset of S with discontinuity 288

∗ 2. if LCA(S ) ⊂ A then: 289

next 290

3. else 291

Bi ←− fus(Bi,Bj) 292

Bj ←− Bj+1 293

go to step 1. 294

For each two continuous anchors (with respect to the reference genome), species with 295

any discontinuity (chromosome, orientation and distance) were identified. If the lowest 296

common ancestor of species with discontinuity is an element of human ancestry nodes of 297

the reference species tree, the two blocks would be considered as discontinuous and the 298

algorithm will continue to compare the next couple in the line. Otherwise, the two 299

blocks will be fused together to be compared with the next block. In each iteration, the 300

list of species with discontinuity and their LCA will be documented. The results of this 301

step is a list of larger conservation clusters that will go through a size filter (minimum 302

size of 1000 bp). The clusters passed through the filter are the resulting synteny blocks. 303

These steps are well illustrated in figure 5. 304

0.10 Identification of lineage-specific evolutionary breakpoint 305

Genomic regions bounded by two consecutive synteny blocks with a size smaller than 2 306

Mbp are considered as reference species lineage-specific breakpoints, or synteny breaks 307

or evolutionary breakpoints (EBrs). The size constraint is necessary to avoid ambiguous 308

regions that could not be associated correctly to breakpoints (e.g. sequences of 309

heterochromatin). The origin of each EBr is the lowest common ancestor of species with 310

discontinuity comparing the two bounded synteny blocks in the synteny identification 311

step. 312

9/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 5. Extraction of synteny blocks and their corresponding breaks: In this illustration of a multi-way alignment blocks with respect to a reference species (SR). Sa to Sf are the selected species present in the alignment blocks. The reference tree show the phylogenetic relationship of these species. A1 to A5 are the SR ancestry nodes. Each arrow represents a conserved region on each genome. Directions in each arrow represent the orientation of that region in each species. The distance between each two vertical lines represents 1 Kbp. Each X shows the start of a chromosome.

0.11 Prediction of breakpoint-hotspots 313

To identify the genomic regions that are significantly enriched in breakpoint or 314

breakpoint hotspot regions, we followed the permutation strategies suggested by 315

(author?) [11]. We used a non-overlapping sliding window approach of size 100 Kb. In 316

each window, the number of breakpoints was counted. Breakpoints were considered to 317

fall into a window if they have at least one position overlapping that window. For 318

100,000 iterations, each breakpoint in the data set were simulated based on the 319

following constraints: 320

• Breakpoint should be simulated within the original chromosome. 321

• Simulated breakpoint should not fall into centromeric or telomeric regions. 322

• Breakpoint should be simulated with respect to its size. 323

• Overlaps would be allowed to lower the calculation costs. 324

A p-value is calculated for the number of simulated breakpoints per window with 325

compare to the original count. Windows with a p-vaule ¡ 0.05 is considered as 326

significantly enriched windows or breakpoint hotspot regions. 327

0.12 Collection of genomic features 328

CpG islands, nested repeats, segmental duplications, CG content and self chain markers 329

were downloaded from the UCSC Genome Browser [29]. Annotation on non-B DNA 330

conformation were downloaded from Non-B DB [8]. Conserved elements of amniotes 331

were obtained from catalog of conserved elements from genomic alignments (CEGA) 332

[12]. Common fragile sites were provided by the supplementary material of 333

Fungtammasan, 2012 [16] paper. RNA G-quadrupex were downloaded from RNA 334

G-quadruplex database (G4RNA) [17]. Benign structural variation were donwloaded 335

from Database of Genomic Variants [28]. RefSeq genes and exons where obtained from 336

NCBI FTP portal. G quadruplex were downloaded from G4 database [45] in 2013. 337

10/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

However, the database is not accessible anymore. The complete list of these features are 338

represented in supplementary table S5. 339

0.13 Identification of EBRRs 340

Regions with no identified EBrs were identified and their positions were compared to 341

genomic regions that are not well sequenced or annotated. We found 1,462 genomic 342

windows outside telomeric and centromeric regions that fall into these regions such as 343

short arms of chromosome 13, 14, 15, 21, 22, as well as chromosome bounds 1q12, 9q12 344

and 16q11. These regions were eliminated from our analysis. Hence we left with 230 345

windows of 100 Kbp long with no EBr. 346

Supporting Information 347

Supporting data for this study is available at: 348

https://github.com/bioinfoUQAM/RECOMB-CG-2019_supp 349

Acknowledgments 350

We would like to thank Julie Horvath, Aida Ouangraoua, Abou Abdallah Malick 351

Diouara and Bruno Daigle for helpful discussions. Special thanks to Emmanuel Mongin, 352

by whom this project has been initially inspired. This work is supported by Natural 353

Science and Engineering council of Canada (NSERC) and the Fonds de Recherche du 354

Qu´ebec-Nature et Technologie (FRQNT) funds to ABD. AMR is a NSERC fellow. 355

AMR and GV are FRQNT fellows. . . . 356

References

1. L. Armengol, M. A. Pujana, J. Cheung, S. W. Scherer, and X. Estivill. Enrichment of segmental duplications in regions of breaks of synteny between the human and mouse genomes suggest their involvement in evolutionary rearrangements. Human molecular genetics, 12(17):2201–2208, 2003. 2. A. Bacolla, J. A. Tainer, K. M. Vasquez, and D. N. Cooper. Translocation and deletion breakpoints in cancer genomes are associated with potential non-b dna-forming sequences. Nucleic acids research, 44(12):5673–5688, 2016. 3. J. A. Bailey, G. Liu, and E. E. Eichler. An Alu transposition model for the origin and expansion of human segmental duplications. The American Journal of Human Genetics, 73(4):823–834, 2003. 4. E. J. Bellefroid, J.-C. Marine, A. G. Matera, C. Bourguignon, T. Desai, K. C. Healy, P. Bray-Ward, J. A. Martial, J. N. Ihle, and D. C. Ward. Emergence of the znf91 kr¨uppel-associated box-containing zinc finger gene family in the last common ancestor of anthropoidea. Proceedings of the National Academy of Sciences, 92(23):10757–10761, 1995. 5. D. Binns, E. Dimmer, R. Huntley, D. Barrell, C. O’donovan, and R. Apweiler. Quickgo: a web-based tool for gene ontology searching. Bioinformatics, 25(22):3045–3046, 2009. 6. M. Blanchette. Evolutionary puzzles: An introduction to genome rearrangement. In Computational Science-ICCS 2001, pages 1003–1011. Springer, 2001.

11/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

7. G. Bourque, P. A. Pevzner, and G. Tesler. Reconstructing the genomic architecture of ancestral mammals: lessons from human, mouse, and rat genomes. Genome research, 14(4):507–516, 2004. 8. R. Z. Cer, K. H. Bruce, U. S. Mudunuri, M. Yi, N. Volfovsky, B. T. Luke, A. Bacolla, J. R. Collins, and R. M. Stephens. Non-b db: a database of predicted non-b dna-forming motifs in mammalian genomes. Nucleic acids research, 39(suppl 1):D383–D391, 2010. 9. A. T. Chinwalla, L. L. Cook, K. D. Delehaunty, G. A. Fewell, L. A. Fulton, R. S. Fulton, T. A. Graves, L. W. Hillier, E. R. Mardis, J. D. McPherson, et al. Initial sequencing and comparative analysis of the mouse genome. Nature, 420(6915):520–562, 2002. 10. S. De and F. Michor. Dna secondary structures and epigenetic determinants of cancer genome evolution. Nature structural & molecular biology, 18(8):950–955, 2011. 11. S. De, B. S. Pedersen, and K. Kechris. The dilemma of choosing the ideal permutation strategy while estimating statistical significance of genome-wide enrichment. Briefings in bioinformatics, 15(6):919–928, 2013. 12. A. Dousse, T. Junier, and E. M. Zdobnov. Cega—a catalog of conserved elements from genomic alignments. Nucleic acids research, 44(D1):D96–D100, 2015. 13. Y. Drier, M. S. Lawrence, S. L. Carter, C. Stewart, S. B. Gabriel, E. S. Lander, M. Meyerson, R. Beroukhim, and G. Getz. Somatic rearrangements across cancer reveal classes of samples with distinct patterns of dna breakage and rearrangement-induced hypermutability. Genome research, 23(2):228–235, 2013. 14. E. E. Eichler, S. M. Hoffman, A. A. Adamson, L. A. Gordon, P. McCready, J. E. Lamerdin, and H. W. Mohrenweiser. Complex β-satellite repeat structures and the expansion of the zinc finger gene cluster in 19p12. Genome research, 8(8):791–808, 1998. 15. P. G. Engstr¨om,S. J. H. Sui, Ø. Drivenes, T. S. Becker, and B. Lenhard. Genomic regulatory blocks underlie extensive microsynteny conservation in insects. Genome research, 17(12):1898–1908, 2007. 16. A. Fungtammasan, E. Walsh, F. Chiaromonte, K. A. Eckert, and K. D. Makova. A genome-wide analysis of common fragile sites: what features determine chromosomal instability in the human genome? Genome research, 22(6):993–1005, 2012. 17. J.-M. Garant, M. J. Luce, M. S. Scott, and J.-P. Perreault. G4rna: an rna g-quadruplex database. Database, 2015, 2015. 18. L. Y. Geer, A. Marchler-Bauer, R. C. Geer, L. Han, J. He, S. He, C. Liu, W. Shi, and S. H. Bryant. The ncbi biosystems database. Nucleic acids research, 38(suppl 1):D492–D496, 2009. 19. C. G. Ghiurcuta and B. M. Moret. Evaluating synteny for improved comparative studies. Bioinformatics, 30(12):i9–i18, 2014. 20. W. Gu, F. Zhang, and J. R. Lupski. Mechanisms for human genomic rearrangements. Pathogenetics, 1(1):4, 2008.

12/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

21. H. Kehrer-Sawatzki and D. N. Cooper. Molecular mechanisms of chromosomal rearrangement during primate evolution. Chromosome Research, 16(1):41–56, 2008. 22. W. J. Kent, C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. Pringle, A. M. Zahler, and D. Haussler. The human genome browser at ucsc. Genome research, 12(6):996–1006, 2002. 23. H. Kikuta, M. Laplante, P. Navratilova, A. Z. Komisarczuk, P. G. Engstr¨om, D. Fredman, A. Akalin, M. Caccamo, I. Sealy, K. Howe, et al. Genomic regulatory blocks encompass multiple neighboring genes and maintain conserved synteny in vertebrates. Genome research, 17(5):545–555, 2007. 24. J. Lee, K. Han, T. J. Meyer, H.-S. Kim, and M. A. Batzer. Chromosomal inversions between human and chimpanzee lineages caused by retrotransposons. PLoS One, 3(12):e4047, 2008. 25. C. Lemaitre, L. Zaghloul, M.-F. Sagot, C. Gautier, A. Arneodo, E. Tannier, and B. Audit. Analysis of fine-scale mammalian evolutionary breakpoints provides new insight into their relation to genome organisation. BMC genomics, 10(1):335, 2009. 26. C. Lemaitre, L. Zaghloul, M.-F. Sagot, C. Gautier, A. Arneodo, E. Tannier, and B. Audit. Analysis of fine-scale mammalian evolutionary breakpoints provides new insight into their relation to genome organisation. BMC genomics, 10(1):335, 2009. 27. J. Ma, L. Zhang, B. B. Suh, B. J. Raney, R. C. Burhans, W. J. Kent, M. Blanchette, D. Haussler, and W. Miller. Reconstructing contiguous regions of an ancestral genome. Genome research, 16(12):1557–1565, 2006. 28. J. R. MacDonald, R. Ziman, R. K. Yuen, L. Feuk, and S. W. Scherer. The database of genomic variants: a curated collection of structural variation in the human genome. Nucleic acids research, 42(D1):D986–D992, 2013. 29. L. R. Meyer, A. S. Zweig, A. S. Hinrichs, D. Karolchik, R. M. Kuhn, M. Wong, C. A. Sloan, K. R. Rosenbloom, G. Roe, B. Rhead, et al. The ucsc genome browser database: extensions and updates 2013. Nucleic acids research, 41(D1):D64–D69, 2012. 30. E. Mongin, K. Dewar, and M. Blanchette. Long-range regulation is a major driving force in maintaining genome integrity. BMC evolutionary biology, 9(1):203, 2009. 31. E. Mongin, K. Dewar, and M. Blanchette. Mapping association between long-range cis-regulatory regions and their target genes using synteny. Journal of Computational Biology, 18(9):1115–1130, 2011. 32. W. J. Murphy, D. M. Larkin, A. Everts-van der Wind, G. Bourque, G. Tesler, L. Auvil, J. E. Beever, B. P. Chowdhary, F. Galibert, L. Gatzke, et al. Dynamics of mammalian chromosome evolution inferred from multispecies comparative maps. Science, 309(5734):613–617, 2005. 33. H. page: National Library of Medicine (US). Genetics Home Reference [Internet]. Bethesda (MD). The library. Internet, June 2019. 34. J. C. Pearson, D. Lemons, and W. McGinnis. Modulating hox gene functions during body patterning. Nature Reviews Genetics, 6(12):893, 2005.

13/19 bioRxiv preprint doi: https://doi.org/10.1101/696245; this version posted July 9, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

35. Q. Peng, P. A. Pevzner, and G. Tesler. The fragile breakage versus random breakage models of chromosome evolution. PLoS computational biology, 2(2):e14, 2006. 36. P. Pevzner and G. Tesler. Genome rearrangements in mammalian evolution: lessons from human and mouse genomes. Genome Research, 13(1):37–45, 2003. 37. P. Pevzner and G. Tesler. Genome rearrangements in mammalian evolution: lessons from human and mouse genomes. Genome Research, 13(1):37–45, 2003. 38. S. K. Pham and P. A. Pevzner. Drimm-synteny: decomposing genomes into evolutionary conserved segments. Bioinformatics, 26(20):2509–2516, 2010. 39. D. B. Pontier, E. Kruisselbrink, V. Guryev, and M. Tijsterman. Isolation of deletion alleles by g4 dna-induced mutagenesis. nature methods, 6(9):655, 2009. 40. S. Proost, J. Fostier, D. De Witte, B. Dhoedt, P. Demeester, Y. Van de Peer, and K. Vandepoele. i-adhore 3.0—fast and sensitive detection of genomic in extremely large data sets. Nucleic acids research, 40(2):e11–e11, 2011. 41. A. Sandelin, P. Bailey, S. Bruce, P. G. Engstr¨om,J. M. Klos, W. W. Wasserman, J. Ericson, and B. Lenhard. Arrays of ultraconserved non-coding regions span the loci of key developmental genes in vertebrate genomes. BMC genomics, 5(1):99, 2004. 42. D. Sankoff. The where and wherefore of evolutionary breakpoints. J Biol, 8:66, 2009. 43. F. Spitz, C. Herkenne, M. A. Morris, and D. Duboule. Inversion-induced disruption of the hoxd cluster leads to the partition of regulatory landscapes. Nature genetics, 37(8):889, 2005. 44. G. Wang, L. A. Christensen, and K. M. Vasquez. Z-dna-forming sequences generate large-scale deletions in mammalian cells. Proceedings of the National Academy of Sciences, 103(8):2677–2682, 2006. 45. H. M. Wong, O. Stegle, S. Rodgers, and J. L. Huppert. A toolbox for predicting g-quadruplex formation and stability. Journal of nucleic acids, 2010, 2010. 46. A. Woolfe, M. Goodson, D. K. Goode, P. Snell, G. K. McEwen, T. Vavouri, S. F. Smith, P. North, H. Callaway, K. Kelly, et al. Highly conserved non-coding sequences are associated with vertebrate development. PLoS biology, 3(1):e7, 2004. 47. J. Zhao, A. Bacolla, G. Wang, and K. M. Vasquez. Non-b dna structure-induced genetic instability and evolution. Cellular and Molecular Life Sciences, 67(1):43–62, 2010.

14/19