Supplementary Information

Supplementary information

A method for complete characterization of complex germline rearrangements from long DNA reads

Satomi Mitsuhashi1,6, Sachiko Ohori1,6, Kazutaka Katoh2,3, Martin C Frith3-5*, Naomichi Matsumoto1*

1. Department of Human Genetics, Yokohama City University Graduate School of Medicine 2. Research Institute for Microbial Diseases, Osaka University, Suita, Japan 3. Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST) 4. Graduate School of Frontier Sciences, University of Tokyo 5. Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), AIST 6. Contributed equally

To whom correspondence should be addressed: M.C.F. ([email protected]), N.M. ([email protected])

Supplementary Results Gene expression level of possible causative genes in Patient 3 Patient3 had a phenotype of split-hand/foot malformation (SHFM) with hearing loss, delayed development, self-injuries, hyperactivity and sleep disorders. SHFM has several causative loci, and one has been mapped to 7q21.3-q22.11. Three genes SEM1, DLX5 and DLX6 are suggested to be related to SHFM with hearing impairment2. None of the three genes are disrupted in this patient (Supplementary Fig 7a), which is also true of other SHFM patients, suggesting some regulatory effect of the rearrangement of flanking regions in SHFM2. To test the effect of the complex rearrangement on expression of those genes, we analyzed mRNA expression levels using the patient’s lymphoblastoid cells (LCL). As we found out that SEM1 was expressed in LCLs, we analyzed expression level of SEM1. Expression levels of SEM1 cDNA were evaluated by quantitative polymerase chain reaction (qPCR) in the patient and three healthy controls. SEM1 expression of LCLs in this patient was not lower than the controls (Supplementary Fig 7b). We could not test DLX5 and DLX6 because those genes were not expressed in LCLs. It is possible these two gene(s) are contributing to disease pathogenesis.

TE insertions, processed pseudogene insertions and nuclear mitochondrial sequences Large fractions of patient-specific rearrangements were smaller-scale rearrangements including tandem duplications or insertions. Among insertions, many of them were annotated as transposable elements, especially L1HS, AluYa5 or AluYb8 (Supplementary Table5). Comparison of the fraction of patient-only TE-insertions (patients) to all TEs from RepeatMasker annotations (rmsk) from the UCSC genome browser (https://genome.ucsc.edu), suggests that these active TEs are enriched in patient-only insertions (Supplementary Fig 12), supporting the notion that recently integrated TEs are the source of these genomic variations. In addition, we observed an ERVK insertion in Patient1 (Fig2e), which was previously described3. We also identified an SVA insertion shared by patients 2 and 3 (Supplementary Table 5: group1 in Patient 2 and group6 in Patient 3). Aside from these TE insertions, there were insertions which were aligned to multiple exons of genes that are located distant from the insertion sites, possibly due to processed pseudogene insertion, in 3 patients (Fig4g, Supplementary Fig 5, Supplementary Fig 11). Insertions were from exons of MFF, MATR3 and FXR1 genes, in patient 2, 3 and 4, respectively. MFF and MATR3 processed pseudogene insertions into these loci were described previously4, but not FXR1. This kind of rare structural variation might be commonly present in our genomes because we observed it in 3/4 patients in this study, although further study of multiple individuals may be necessary to conclude. We also observed nuclear mitochondrial sequence (NUMT) in Patients 1 and 2 (Fig2e, Supplementary Fig 5, Supplementary Table 5, 7). Two of them are inserted into LINE regions of introns of SPAG16 and CEP128, respectively. In all NUMTs, there were flanking A-T oligomers at the insertion loci as suggested previously (Supplementary Table 7)5.

Supplementary Methods Clinical details of the patients Patient 1 Detailed clinical information was described elsewhere6,7. Briefly, the patient was a Caucasian female. She was 40 years old at the time of the previous study. She was delivered at 37 weeks of gestation with a birth weight of 2,930g. She presented with amenorrhea at age 17. Primary ovarian failure was indicated by hormonal level and her hypoplastic uterus with bilateral streak gonads were found by laparotomy. G-banded chromosomal analysis showed a balanced reciprocal translocation 46X, t(X;2) (q22;p13). Array CGH analysis using the Agilent 444K oligo array platform at a resolution of 44K showed no deletions at the breakpoint. She was taller than other members of her family (177cm, 98th tile). She got pregnant by in vitro fertilization with egg donation at the age of 36 years. She underwent parathyroidectomy at 24-week gestation due to primary hypoparathyroidism.

Patient 2 Detailed clinical information was described elsewhere7. Briefly, the patient was a female born to non-consanguineous Japanese parents and was 38 years old at the time of the previous study. She was born with neonatal asphyxia. She had her first menstrual period once at 14 years old, but then became amenorrheic and started on hormonal replacement therapy. She was diagnosed with primary hyperthyroidism at the age of 18 years and underwent subtotal thyroidectomy and started thyroid hormonal therapy. G-banded chromosomal analysis showed a balanced reciprocal translocation 46X, t(X;4)(q21.3;p15.2) at the age of 38 years.

Patient 3 Detailed clinical information was described elsewhere8. Briefly, the patient was a girl to non- consanguineous Japanese parents and was 9 years old at the time of the previous study. She had no family with SHFM. She was delivered at 38 weeks of gestation after an uneventful pregnancy with a birth weight of 2,850g (-0.02SD), length of 48cm (-0.4SD), and occipitofrontal circumference of 32cm (-0.5SD). She was admitted to a hospital for weak sucking when she was one month old. She showed cutaneous syndactyly of 4th and 5th digits of the right foot and 1st, 2nd, 4th and 5th digits of the left foot. Her hands were normal. In addition, she had strabismus, micrognathia, full lower lip, bilateral ear canal stenosis, a severe mixed type deafness, and developmental disorder. She walked alone at 21 months. Her developmental stage at 25 months was equal to 7-8 months. Since 3 years of age, self- injuries, hyperactivity, and sleep disorders appeared.

Patient 4 Detailed clinical information was described elsewhere9. Briefly, the patient was a girl to non- consanguineous Japanese parents and was 5 years old at the time of the previous study. She was delivered at term without asphyxia after an uneventful pregnancy. She started clonic convulsions of extremities 2 days after birth. She was diagnosed with WEST syndrome at 5 months of age. She had intellectual disability without head control, series of tonic-spasms, and hypsarrhythmia on Electroencephalogram. dnarrange details Definition of two reads sharing a rearrangement Two reads R and S (Supplementary Fig 13) are deemed to share a rearrangement if: 1. The alignments of R to the genome include two alignments A and B, and the alignments of S include X and Y, such that: 2. A overlaps X in the genome. 3. B overlaps Y in the genome. 4. A and B exhibit one of the four rearrangement types (inter-chromosome, inter-strand, non-colinear, or big gap), and X and Y exhibit the same rearrangement type. 5. If the rearrangement type is non-colinear (jumps backwards in the chromosome) or big gap (jumps forwards in the chromosome): 1. The chromosome range jumped between A and B overlaps the range jumped between X and Y. 2. The number of chromosome bases jumped between A and B is <= 2x that between X and Y, and vice-versa. 6. The alignments have consistent strandedness. Strandedness means: which chromosome strand the read aligns to. "Consistent" means that either: A and X have the same strandedness and so do B and Y, or: A and X have opposite strandedness and so do B and Y. 7. The alignments' order in their reads is consistent with the strandedness. If the alignments have the same strandedness they must occur in the same order in their reads, else they must occur in the opposite order. 8. The number of bases in read R between A and B is close to the number of bases in read S between X and Y. Specifically: abs(s-r+m-n) <= d (default 1000), where s, r, m, and n are defined in Supplementary Fig 13.

Discarding any read with any rearrangement not shared by another read For each read: check each pair of alignments (A and B) that occur consecutively in the read and exhibit one of the four rearrangement types (Supplementary Fig 13). Require that this rearrangement is shared (as defined above) by a pair of alignments (X and Y) that occur consecutively in another read.

Miscellaneous dnarrange details Two alignments are considered to be on different chromosomes only if the chromosomes are known to be different. E.g. "chr7" and "chrUn" are not known to be different, but "chr7" and "chr5_random" are. The non-colinear rearrangement type is not considered for chrM, which is circular. Two alignments A and B of read R are not deemed to exhibit a "big gap" if any other alignment is between them in read R (Supplementary Fig 13). Coordinates of left and right ends dnarrange-link needs to decide whether a left end is "downstream of" a right end, i.e. whether it is rightwards/downstream in the chromosome. We wish to allow for some overlap, and some slop in the alignments. In the current version, dnarrange-link crudely defines the chromosome coordinate of a (left or right) end as: the average of the start and end coordinates of the alignment at that end.

Limitations of dnarrange 1. It may have trouble finding patient-specific rearrangements that are close to rearrangements shared with controls. This is because, if a patient read shares a rearrangement with a control read, the patient read is discarded. 2. It will not work well with extremely long reads, or assembled chromosomes. This is because it starts by seeking reads that contain patient-specific rearrangements and lack rearrangements shared with controls. It may work best with a mixture of shorter reads (to separate nearby rearrangements) and longer reads (to span huge repeats, and know the order and orientation of rearranged fragments unambiguously). 3. It is not really designed to find transposable element insertions. If two reads overlap the same TE insertion, they may get aligned to different source TEs in the genome, because this alignment is highly ambiguous. Then dnarrange will not consider these reads to share a rearrangement, so will not group them. This could be fixed by loosening the criteria for sharing a rearrangement, at a risk of retaining many artefactual rearrangements (e.g. Supplementary Fig 14).

Alignment to human reference genome Reads were aligned to human reference genome (hg38) using LAST version 959 (http://last.cbrc.jp) as follows. windowmasker10 -mk_counts -in hg38.fa > hg38.wmstat windowmasker -ustat hg38.wmstat -outfmt fasta -in hg38.fa > hg38-wm.fa lastdb -P8 -uNEAR -R11 -c hg38 hg38-wm.fa last-train -P8 hg38 reads.fa > train.out lastal -P8 -p train.out hg38 reads.fa | last-split –m1 > alns.maf

Finding reads with translocations and complex rearrangements Rearrangements were found using dnarrange (https://github.com/mcfrith/dnarrange): dnarrange –s3 patient.maf > patient-groups.maf dnarrange finds rearranged reads, and groups those reads that share a rearrangement. To remove rearrangements that are shared by other individuals, we used 33 controls (Fig1, Supplementary Table 1): dnarrange –s3 patient.maf : control.maf > patient-only.maf Option –s3 means that the minimum number of supporting reads per group is 3. The patient-only groups file was re-analyzed with option –c1 to remove unreliable reads (Supplementary Fig 14): dnarrange –c1 –s3 patient-only.maf > final.maf

Drawing dot-plot pictures of each group of rearranged reads Dot plot pictures were obtained like this with some modification: last-multiplot final.maf final-pic Modified last-dotplot options were, for example: last-dotplot -1 chr1:149390802-149390842 --sort2=3 --strands2=1 --rot1=v -

-rot2=h --labels1=2 --rmsk1 rmsk.txt --genePred1 refFlat.txt alignment.maf alignment.png

To draw gray lines joining breakpoints, this option was used: --join=2

Assembly of reads and breakpoint detection Each group of rearranged reads was merged into a consensus sequence like this: dnarrange-merge reads.fa train.out dnarrange-output > consensus.fa Consensus sequences were re-aligned to the unmasked reference genome with these commands: lastdb -P8 -uNEAR -R01 hg38 hg38.fa last-train -P8 hg38 consensus.fa > train.out lastal -P8 -p train.out hg38 consensus.fa | last-split -m1 > re-alns.maf

Breakpoints were determined from the re-aligned file: dnarrange –s1 re-alns.maf > consensus-rearrangements Then, dot-plot pictures were produced like this: last-multiplot consensus-rearrangements dotplot-picture lamassemble details "Doubling" of substitution probabilities in lamassemble From last-train, we have a 4x4 matrix P(x,y): the probability of read base y aligned to genome base x. These 16 probabilities sum to 1. Let us define c(x) to be the complement of base x, and G(x) the probability of base x in the genome: G(x) = sum(y in a,c,g,t) P(x,y). For the subsequent steps to make sense, we require that G has parity: G(x) = G(c(x)). lamassemble forces parity by rescaling: G'(x) = [ G(x) + G(c(x)) ] / 2 P'(x,y) = P(x,y) * G'(x) / G(x) It then calculates the probability of base x in a read forward strand aligned to base y in a read forward strand: F(x,y) = sum(z in a,c,g,t) [ P'(z,x) * P'(z,y) / G'(z) ] And the probability of x in a read forward strand aligned to y in a read reverse strand: R(x,y) = sum(z in a,c,g,t) [ P'(z,x) * P'(c(z),c(y)) / G'(z) ]

"Doubling" of gap probabilities in lamassemble last-train calculates read-to-genome gap probabilities like this11: delOpenProb = delOpenCount / n insOpenProb = insOpenCount / n delExtendProb = (delCount - delOpenCount) / delCount insExtendProb = (insCount - insOpenCount) / insCount lamassemble crudely calculates these read-to-read gap probabilities: gapOpenProb = 1 - (1 - delOpenProb) * (1 - insOpenProb) gapExtendProb = (gapCount - gapOpenCount) / gapCount where: gapCount = delCount + insCount gapOpenCount = delOpenCount + insOpenCount By basic algebra, we can calculate gapExtendProb in terms of the four gap probabilities from last-train.

Alignment in lamassemble lamassemble finds pairwise alignments between the reads like this:

lastdb -uNEAR -c -R01 -W19 my_db seq_file

lastal -j4 -D1e9 -m5 -z30g -s1 -p fwd_scores my_db seq_file

lastal -j4 -D1e9 -m5 -z30g -s0 -p rev_scores my_db seq_file where the 1st lastal command compares read forward strands to each other, and the 2nd lastal command compares forward strands to reverse strands. Alignments of a sequence to itself are discarded.

The alignments are sorted in descending order of score. For each alignment in turn:

* Record a link between the 2 aligned sequences, only if they are not yet linked directly or indirectly. This link defines the next step of progressive alignment. The (forward or reverse) strands in the alignment define which strands will be used for progressive alignment.

* If the alignment uses forward/reverse strands inconsistent with strands aligned (directly or indirectly) by previous alignments: discard this alignment.

* If the alignment is not "roughly colinear" with previous alignments of these 2 sequences: discard it. Two alignments A and B, between sequences S and T, are "roughly colinear" if: start_coordinate(A,S) < start_coordinate(B,S) start_coordinate(A,T) < start_coordinate(B,T) end_coordinate(A,S) < end_coordinate(B,S) end_coordinate(A,T) < end_coordinate(B,T)

Before running MAFFT, lamassemble trims bases at the start and end of each sequence that are outside any non-discarded LAST alignment.

Gene expression levels in lymphoblastoid cell Total RNA was extracted from lymphoblastoid cells from Patient 3, and 3 controls, using RNeasy Plus Mini Kit (QIAGEN, Hilden, Germany), then subjected to reverse-transcription reaction using SuperScriptIII (Thermo Fisher Scientific). Quantitative real-time PCR was performed using Rotor-Gene SYBR Green PCR Kit and Rotor-Gene (QIAGEN, Hilden, Germany). Primers used were described in Supplementary Table 6. delta-delta CT method was used to compare gene expression levels of SEM1.

Supplementary Tables

Sequencer Median length number of reads total bases

Patient1 PromethION 13,104 8,642,604 111,990,048,861 Patient2 PromethION 3251 17,131,141 117,132,669,422 Patient3 PromethION 3223 14,926,358 94,531,709,149 Patient4 PromethION 2334 6,845,364 41,101,961,286

Control1 PromethION 17,218 5,074,319 96,709,785,254 Control2 PromethION 18,142 4,497,556 90,866,959,081 Control3 PromethION 16,986 4,403,236 81,554,440,718 Control4 PromethION 1,558 12,830,261 57,451,863,375 Control5 PromethION 2,452 9,635,261 35,004,315,947 Control6 PromethION 691 23,303,818 54,535,886,940 Control7 PromethION 7,127 7,824,636 77,878,925,289 Control8 PromethION 14,670 3,557,359 53,696,135,713 Control9 PromethION 3,520 15,926,839 82,870,233,437 Control10 PromethION 3,298 10,589,493 57,885,853,963 Control11 PromethION 2,990 11,109,622 62,195,375,599 Control12 PromethION 3,736 10,658,181 72,698,332,750 Control13 PromethION 3,169 5,529,417 38,773,128,508 Control14 PromethION 5,889 9,091,759 63,874,106,363 Control15 PromethION 1,608 17,969,232 49,838,565,927 Control16 PromethION 3,520 6,786,707 55,398,873,781 Control17 PromethION 3,110 11,467,114 56,214,013,607 Control18 PromethION 3,861 13,655,084 78,995,795,042 Control19 PromethION 3,425 17,761,209 89,558,068,739 Control20 PromethION 3,392 10,500,704 68,684,084,437 Control21 PromethION 3,382 9,582,128 63,680,708,046 Control22 PromethION 2,645 12,998,831 66,361,384,504 Control23 PromethION 2,035 20,502,426 84,790,477,596 Control24 PromethION 3,893 17,229,947 125,180,249,687 Control25 PromethION 3,601 17,880,435 112,748,995,082 Control26 PromethION 2,744 16,280,893 97,732,025,016 Control27 PromethION 1,719 23,432,032 88,440,096,328 Control28 PromethION 20,468 2,313,819 51,571,485,654 Control29 PromethION 3,087 8,709,727 42,536,118,154 Control30 PromethION 2,126 13,143,397 43,307,793,923 Control31 PromethION 13,749 2,515,618 38,569,602,139 Control32 PromethION 12,248 3,795,124 50,547,125,919 Control33 PromethION 9,606 10,503,537 95,293,663,017

Supplementary Table 1. Summary of PromethION sequencing data.

Patient1 Patient2 Patient3 Patient4

Original 2,773 3,336 3,351 2,523 data1 1,392 2,858 2,075 1,302 data2 794 1,860 1,206 806 data3 539 1,257 804 570 data4 409 609 428 292 data5 367 459 316 225 data6 307 335 234 140 data7 267 262 186 106 data8 246 228 160 98 data9 218 182 136 79 data10 204 159 120 66 data11 194 147 112 58 data12 181 122 103 51 data13 171 111 96 47 data14 164 105 92 45 data15 162 99 87 39 data16 160 93 82 36 data17 156 89 79 36 data18 151 82 77 35 data19 144 74 70 32 data20 141 71 67 32 data21 141 68 66 30 data22 137 66 63 29 data23 133 61 62 29 data24 129 54 59 29 data25 125 52 54 28 data26 124 48 51 27 data27 122 48 50 24 data28 121 46 48 22 data29 119 46 47 22 data30 119 45 47 21 data31 118 42 47 21 data32 117 41 47 21 data33 101 37 46 21 with -c1 option 80 33 43 14

Supplementary Table 2. Number of groups of rearranged reads in each patient, after successive filtering using 33 control humans. Original: number of groups in each patient before filtering.

Patient1 Patient2 Patient3 Patient4

real 205m48.036s 234m9.006s 154m17.363s 134m53.602s user 202m43.690s 232m40.968s 153m2.223s 133m49.292s sys 2m51.771s 1m30.713s 1m19.762s 1m5.775s

Supplementary Table 3. Computational time count for grouping rearranged reads from the patient and subsequent filtering using 33 controls. real: wall clock time. user: CPU time. sys: CPU time within the system.

Patient1 Patient2 Patient3 Patient4

tandem multiplication 15 3 7 3 tandem repeat expansion 13 6 3 1 retrotransposition 16 6 3 1 non-tandem duplication (insertion) 9 4 3 0 non-tandem duplication with target site deletion 3 0 0 0 large tandem duplication 1 3 1 3 deletion 8 1 5 2 inversion 1 0 1 1 processed pseudogene insertion 0 1 1 1 NUMT 1 2 0 0 unclear 7 2 4 2 chromosomal translocation 2 2 0 0 possible inversion duplication 2 0 0 0 complex rearrangement 0 3 15 0

Supplementary Table 4. Number of patient-only rearrangements in each category. NUMT: nuclear mitochondrial DNA insertions.

Supplementary Table 5 is shown in a separate file. Supplementary Table 5. Detailed description of patient-only rearrangements in Supplementary Fig 3, 5, 9 and 11. Group25 in Patient1 is an L1HS insertion, which has a 3’-transduction indicating that the source is a specific L1HS in chr4. This chr4 L1HS is absent in the reference genome, but is present in some humans (hg18.chr4:129184647)12. For group 24 & 52, inverted duplication in chr16 and chr20, are also present in public nanopore data NA12878 (rel3 and rel4. Jain et al.2018, detected by our analysis)13, suggesting that a reported duplication (array CGH) or inversion (WGS) near this locus (International HapMap, C. et al. 2010, Genomes Project, C. et al. 2010)14,15 in NA12878, is actually identical to this inverted duplication. For group6, 12 and 16 in Patient2, dnarrange-link infers a chr11 complex rearrangement (Fig3e).

group Forward Reverse FigS6 breakpoint PCR der-chr7 18 caggaaataagagactggttcctaa catgttactgcagatgatgagattt a

der-chr15 17 tgattagctctgtacctgaggactt aaaggattacattgtgatgcaaact b 13 ttccagagcgattgtattatttagc tcaatggcagtactagattcacaac c 1 gtttaatttaagcgtcccacctaat tgaaacaagccgtatgtatgatatg d-1 1 aactctctcatcacatggttcattt actaacaagcagggatacaaacaag d-2 5 agaattggaaaagagatcttgtgtg aaccctgtaaatgtggaattctgta e-1 5 agtgaattccattcagtgtaccatt gatatagcaggcatcctattttgtg e-2 5 ccatagatggatacatggataaaca actcagacttgtaggtaaccccatt e-3 12 tatagcttgaaggaattccatttgt gctatccagagcatggtctattcta f 35 atagtagctgcttgtgcctgtaatc ggtgatggttatggttatcttctca g 3 accagagaattgaagatgactatgg aattgtgggaggtaattatccattt h

der-chr9 21 agctagacctgcaatagcaaactta gcatttagaaaccattcctgtagaa i 33 atttgtgaactctgacttgaggaag ataactttcttataatgtattaaccatgctgtag j 11 ctgcatggaagtaggggtgt acactggcccacaaaaagag k 30 ccagtctaggacctctcttagtacttttat cccaactctctaatgttctactaccttact l 22 ggtttcttatatattctggttattaattctttgt ctagtccttcccaacacattagttttat m 21 ggaaggcaaattcaatccaa acctggctcacaaggtatgg n

der-chr14 6 cctgacctagaactgtcccactaag accaagatgtttctacaaaaagcac o

QPCR SEM1 cagttacgagctgaactagagaaa gggttctctgccctagaatgt

Supplementary Table 6. Primers. Primers used to confirm breakpoints and qPCR in Patient3.

Patient insert locus insert length strand mt gene mismap probability insertion locus insertion locus gene RepeatMasker annotation Near-by A-T sequence (<20bp)

Patient1 chrM:15065-15101 36 + CYB mismap=1.39e-05 chr2:213419058-213419060 SPAG16 intron LINE L1 AATAAAA Patient2 chrM:10556-10682 126 - ND4L mismap=1e-10 chr14:80547592-80547596 CEP128 intron LINE L2 TAAAAT Patient2 chrM:11680-11744 64 + ND4 mismap=0.5 chr4:7252649-7252650 SORCS2 intron DNA/hAT-Blackjack AATT

Supplementary Table 7. NUMT origin and insertion site. Nuclear Mitochondrial sequences (NUMT) found in Patient 1 and Patient2. a Patient 1 b Patient 1 c Patient 1 d Patient 3 e Patient 3 group8 group4 group6 group4

tandem deletion Transposable large tandem duplication inversion or translocation element insertion (SVA) duplication

chr9:115,801,133 chr13:37,297,849 chr5:2,004,152 chr8:94,190,000 chr20:2,822,000 chr8:19,662,869 -115,847,234 -37,336,758 -2,152,404 -94,199,000 -2,829,000 -19,736,836

lamassemble

chr9:115,801,508 chr9:37,298,350 chr5:2,018,010 chr8:94,190,000 chr20:2,822,000 chr8:19,662,998 -115,838,595 -37,336,682 -2,097,935 -94,199,000 -2,829,000 -19,736,704

transposable elements (reverse-strand) transposable elements (forward-strand) tandem repeat protein-coding sequence exons

Supplementary Figure 1. Examples of grouped rearranged reads and consensus sequences. Dot-plot of examples of grouped rearranged reads. In a dot-plot, a diagonal line is drawn where a read sequence (vertical) aligned to the reference sequence (horizontal). The horizontal black lines are boundaries between different dot-plots showing different DNA reads. a. tandem-multiplication. b. inversion. c. deletion or translocation. d. non-tandem duplication or translocation (insertion). In this dotplot, the inserted sequence is aligned to a transposable element (pale pink) e. possible large tandem duplication. It is not clear if this is a tandem duplication because no one read encompasses the whole duplication. However, this is the simplest interpretation unless other rearrangements are found in the region, thus we categorized this as large duplication. Examples a-c: Patient 1, d,e. Patient 3. The vertical stripes indicate annotations in the reference genome: tandem repeats (purple), transposable elements (pink:forward-strand, blue:reverse-strand), green (exon) and dark green (protein-coding sequence). Each group of reads was assembled by lamassemble: the resulting consensus sequences were realigned to the reference genome. Patient1 46,X,t(X;2)(q22;p13)

chr2:65,976,415 chrX:108,676,529 chr2:65,997,089 chrX:108,700,719 -66,000,996 -108,704,621 -66,023,620 -108,734,586

group3 group9

lamassemble

chr2:65,976,976 chrX:108,680,162 chr2:65,997,739 chrX:108,701,370 -66,000,827 -108,704,449 -66,020,533 -108,720,718

Supplementary Figure 2. Reciprocal chromosomal translocation in Patient 1. Dot-plot pictures of grouped rearranged reads and lamassemble consensus sequences at the translocation sites in Patient 1. Translocation t(X;2)(q22;p13) in Patient 1 does not lose any regions nor disrupt any genes. The vertical stripes indicate annotations in the reference genome: tandem repeats (purple), transposable elements (pink:forward-strand, blue:reverse-strand), green (exon) and dark green (protein-coding sequence). The horizontal black lines indicate boundaries between different dot-plots showing different reads. Patient1

tandem multiplication group1 group7 group8 group11 group12 group16 group20 chr3:989,645 -991,674 chr19:4,202,546 -4,229,255 chrX:13,216,933 -13,218,968 chr1:75,365,255 -75,410,519 chr9:115,801,508 -115,838,595 chr1:26,563,756 -26,610,138 chr17:29,469,951 -29,497,806 chr1:226,034,395 -226,088,941 chr13:36,552,010 -36,585,289

group21 group26 group27 group31 group35 group39 group80 chr7:124,892,970 -124,916,764 chr7:72,854,707 -72,883,627 chr5:145,247,125 -145,250,332 chr9:127,858,082 -127,899,971 chr16:64,660,842 -64,664,262 chrX:66,075,417 -66,078,857 chr3:23,261,050 -23,279,779 chr9:131,721,093 -131,760,932 chr12:37,984,466 -38,016,705 chr9:63,357,159 -63,375,941

tandem repeat expansion group18 group30 group40 group43 group46 group50 chrX:347,267 -376,282 chr3:195,774,592 -195,798,789 chr10:331,688 -360,313 chr3:184,740,959 -184,764,506 chr18:79,290,350 -79,324,082 chr12:132,292,746 -132,315,590

group53 group54 group58 group62 group63 group64 group66 chr6:170,393,757 -170,401,993 chr1:3,689,415 -3,703,207 chr1:148,539,093 -148,549,802 chr6:168,836,222 -168,845,924 chr12:132,142,771 -132,157,642 chr4:189,832,269 -189,842,211 chr4:181,229,310 -181,253,388

Large tandem duplications NUMT group13 group47 chrM:14,985 -15,181 chr2:213,418,000 -213,420,000 chr16:15,417,819 -15,421,445 chr18:60,594,130 -60,647,741 Patient1 Retrotransposition

L1HS insertions group22 group25 group60 group61 group76 chr10:38,280,777 -38,313,729 chr20:23,430,986 -23,434,221 chr4:128,043,568 -128,044,547 chr7:141,920,045 -141,922,648 chr20:7,116,896 -7,118,790 chrX:124,095,535 -124,096,883 chr3:98,198,734 -98,214,787 chr17:70,458,141 -70,465,814 chr14:95,138,540 -95,139,256 chr17:51,896,000 -51,902,000 chr2:198,902,811 -198,917,173 chr10:22,626,966 -22,628,239 chr16:31,930,130 -31,944,735 chrX:141,420,443 -141,427,999

SVA insertions group56 group57 group75 chr2:206,171,698 -206,173,730 chr12:56,267,446 -56,287,909 chr12:114,873,512 -114,875,518 chr13:49,060,845 -49,062,712 chr20:2,821,628 -2,826,659 chr22:40,745,542 -40,747,520 chr3:106,110,683 -106,112,530 chr4:128,082,804 -128,084,326 chr7:25,017,823 -25,020,946 chr9:72,850,362 -72,852,087 chr12:107,278,662 -107,294,565 chr13:31,753,213 -31,755,560 chr14:23,232,107 -23,233,621 chr19:29,900,544 -29,902,700 chr1:222,487,592 -222,488,962 chr7:101,357,528 -101,359,954 chr15:57,409,398 -57,420,264

AluYb8 insertion AluYa5 insertions group71 group51 group65 group72 chr12:58,984,000 -58,990,000 chrX:49,091,221 -49,092,020 chr1:95,083,000 -95,089,000 chr9:108,731,925 -108,732,703 chr1:4,225,000 -4,231,000 chrX:120,293,909 -120,294,705 chr8:31,927,206 -31,927,989 chr12:124,070,000 -124,076,000

AluYa5 insertion with ERVK insertion tandem duplication group70 group34 group77 group78 group49 chr7:125,220,223 -125,223,082 chr12:123,566,665 -123,591,525 chr6:166,781,000 -166,787,000 chr18:50,308,454 -50,309,445 chr6:166,774,927 -166,787,319 chr8:143,320,996 -143,322,242 chr18:50,308,239 -50,309,655 chr6:166,772,751 -166,789,334 chrX:155,868,014 -155,869,707 chr3:148,731,500 -148,734,000 chr3:170,228,000 -170,238,000 chr10:18,902,412 -18,903,565 Patient1 Non-tandem duplication or translocation (insertion) group17 group23 group36 group41 group42 group44 chr11:761,000 -765,000 chr9:129,585,689 -129,586,417 chr15:41,195,000 -41,200,000 chr11:25,331,000 -25,337,000 chr15:80,763,000 -80,772,000 chr1:212,812,000 -212,816,000 chr16:74,289,450 -74,289,819 chr6:113,401,271 -113,402,123 chr14:24,962,000 -24,968,000

group48 group55 group6 chr3:155,012,000 -155,016,229 chr1:66,986.112 -66,988,103 chr11:84,851,000 -84,856,000 chr16:77,750,000 -77,760,000

Deletions group2 group4 group5 group14 chr5:97.696,123 -97,778,566 chr6:13,485,711 -13,489,683 chr6:111,204,251 -111,208,258 chr16:19,907,649 -19,983,106 chr5:2,018,008 -2,097,937 chr20:31,430,193 -31,484,765

group15 group29 group33 group37 chr4:161,084,746 -161,097,242 chr4:161,173,306 -161,191,912 chr15:76,516,011 -76,567,884 chr9:61,587,992 -61,606,291 chr9:65,537,017 -65,559,089 chr1:110,808,061 -110,868,446

non-tandem duplication with target-site deletion Inversions group10 group19 group32 group74 chr13:99,870,000 -99,871,000 chr6:68,970,000 -68,985,000 chr6:15,665,000 -15,670,000 chrX:141,108,348 -141,108,889 chr5:141,274,219 -141,289,493 Inverted duplications? Patient1

group24 group52 group28 group45 chr20:5,216,594 -5,237,964 chr20:4,605,501 -4,620,461 chr20:5,045,148 -5,071,396 chr20:4,852,507 -4,871,270 chr16:35,139,354 -35,158,943 chr16:35,229,191 -35,244,007 chr16:35,508,962 -35,524,572 chr16:35,160,799 -35,174,268

dnarrange-link

chr16:34,796,604-35,551,156 chr16:34,796,604-35,551,156 chr20:4,589,868-5,253,647 chr20:4,589,868-5,253,647

4 kb deletion

Unclear group38 group59 group67 group68 group69 group73 chr6:33,226,343 -33,241,582 chr4:11,245,478 -11,247,141 chr10:89,536,858 -89,558,799 chr3:142,305,766 -142,306,914 chr13:60,886,215 -60,888,881 chr5:21,898,993 -21,917,903 chr6:59,120,407 -59,127,319 chr1:13,110,193 -13,175,011 chr18:13,180,073 -13,182,682 chr6:59,822,369 -59,827,059 chr13:46,461,765 -46,474,252 chr15:28,440,017 -28,440,985

group79 chr1:23,852,790 -23,854,163 chr6:156,668,181 -156,684,802

transposable elements (reverse-strand) transposable elements (forward-strand) tandem repeat protein-coding sequence exons

Supplementary Figure 3. Other patient-only rearrangements in Patient 1. Dot-plot pictures of lamassemble consensus sequences are shown. For some of the retrotranspositions (e.g. group57), the insertion is aligned to multiple chromosomes. In these cases, the insertion is aligned to different copies of the same type of retrotransposon (e.g. SVA). Our interpretation is that the true source copy is absent from (or misassembled in) the reference genome, so we get a fragmented alignment to different copies. Patient 2 46,X,t(X;4)(q21.3;p15.2)

chr4:12,203,846 chrX:108,161,278 chr4:12,238,970 chrX:108,193,484 chrX:107,916,034 -12,237,984 -108,195,327 -12,284,239 -108,255,145 -108,008,854 group2 group5 group10

lamassemble

chr4:12,213,510 chrX:108,172,765 chr4:12,239,035 chrX:108,193,541 chrX:107,929,975 -12,237,204 -108,194,547 -12,284,173 -108,253,689 -108,006,361

TEX13B

COL4A6 COL4A6

Supplementary Figure 4. Reciprocal chromosomal translocation in Patient 2. Dot-plot pictures of grouped rearranged reads and lamassemble consensus sequences at the translocation sites in Patient 2. Translocation t(X;4)(q21.3;p15.2) in Patient 2 disrupts the COL4A6 gene. Near the translocation site, there is a 43-kb deletion, which eliminates the whole TEX13B gene. The vertical stripes indicate annotations in the reference genome: tandem repeats (purple), transposable elements (pink:forward-strand, blue:reverse-strand), green (exon) and dark green (protein-coding sequence). The horizontal black lines indicate boundaries between different dot-plots showing different reads. Patient2

tandem repeat expansion AluYa5 insertion + tandem duplication group8 group11 group17 group19 group20 group24 group31 group14 chr19:27,468,921 -27,501,871 chr10:635,320 -666,167 chr4:189,794,848 -189,845,574 chr21:42,573,768 -42,601,845 chr9:133,143,172 -133,144,392 chr12:1,075,092 -1,095,218 chr12:40,471,931 -40,493,539 chr6:170,370,623 -170,386,610 chr1:104,505,000 -104,515,000 chr9:106,261,437 -106,262,559

Non-tandem duplication or translocation (insertion) group3 group21 group23 group33 chr20:43,340,000 -43,346,000 chrX:6,932,796 -6,933,292 chr3:11,599,000 -11,607,000 chr5:142,382,886 -142,383,951 chr7:39,808,065 -39,809,203 chr12:29,410,000 -29,414,000 chr7:25,236,034 -25,236,412 chr8:65,921,151 -65,924,000 chr14:86,330,470 -86,331,030 chr21:41,216,945 -41,217,310 chrX:149,340,790 -149,341,177

Retrotransposition SVA insertion L1HS insertions AluYa5 insertion group1 group26 group27 group28 group29 group25 chr7:97,617,537 -97,619,804 chrX:112,313,869 -112,314,630 chr4:73,577,155 -73,585,850 chr10:85,354,964 -85,362,095 chr7:38,468,740 -38,469,309 chr9:126,807,000 -126,810,000 chr8:94,174,057 -94,217,109 chr20:2,822,757 -2,827,961 chr2:136,894,445 -136,905,587 chr13:29,641,203 -29,644,343 chr3:130,266,347 -130,272,946 chr6:19,768,053 -19,771,285

Large tandem duplications group7 group13 group30 chr5:8,490,127 -8,521,907 chr5:8,748,264 -8,762,012 chr16:18,936,027 -18,946,155 chr16:19,063,155 -19,071,664 chr6:26,672,131 -26,680,684 Patient2 Processed-pseudogene insertion NUMT

group4 group9 group15 chr1:105,892,061 -105,894,122 chr5:139,293,359 -139,331,006 chr12:113,442,000 -113,455,000 chr17:45,551,893 -45,553,140 chr4:7,251,000 -7,254,000 chrM:11,000 -12,000 chrM:10,000 -11,000 chr14:80,546,000 -80,549,000

MATR3

Complex rearrangement

group6 group12 group16 chr11:54,620,741 -54,634,822 chr11:55,066,582 -55,106,289 chr11:55,048,429 -55,069,178 chr11:55,245,274 -55,265,013 chr8:73,464,003 -73,467,725 chr11:54,683,316 -54,707,594 chr11:55,272,540 -55,301,772

Unclear group22 group32

transposable elements (reverse-strand)

chr7:86,875,161 -86,884,055 chr10:14,531,733 -14,537,336 chr12:112,357,634 -112,360,898 chr7:54,442,243 -54,447,106 transposable elements (forward-strand) tandem repeat protein-coding sequence exons

Supplementary Figure 5. Other patient-only rearrangements in Patient 2. Other patient-only rearrangements in Patient 2. Dot-plot pictures of lamassemble consensus sequence are shown. Patient 2 also has a processed-pseudogene insertion into chr12, which aligns to exons of MATR3 in chr5. Part of this insertion is aligned, perhaps incorrectly, to a MATR3 processed pseudogene in chr1. -49,842,834 chr15:49,823,188 (protein-coding sequence). tandem repeats(purple),transposable elements(pink:forward-strand,blue:reverse-strand), green(exon)anddark t(7;15)(q21;q15) andt(9,14)(q21;q11.2) inPatient3.Theverticalstripesindicate annotationsinthereferencegenome: Dot-plot picturesoflamassemble consensussequencesalignedtothehumanreference genomethatexplaintranslocation Supplementary Figure6. Rearrangementgroupsdetectedintworeciprocal chromosomaltranslocationsinPatient3. -96,761,593 chr7:96,746,992 group8 group31 group17 -65,292,701 chr4:65,264,236

49,826,620 49,919,292 -49,939,827 chr15:49,917,860 24,267,175 -45,085,241 chr15:45,065,036 45,084,077 -24,269,054 chr14:24,243,015 65,290,822 96,748,157 -49,806,326 chr15:49,789,475 group23 -97,321,266 chr7:97,300,189 group10 -62,794,585 chr4:62,777,345 group19 50,520,367

-50,521,761 chr15:50,489,230 49,790,864

-98,147,629 chr7:98,124,535 97,319,745 98,126,049

-65,417,663 chr4:65,395,025 62,793,220 65,396,391 -98,182,717 chr7:98,161,804 -97,451,091 chr7:97,422,126 group3

49,824,619 group1 -49,825,918 chr15:49,809,110 -64,512,250 chr4:64,495,049 98,163,104

group20 97,449,143 -98,165,048 chr7:98,148,114 49,893,590

49,905,364 64,511,011 -24,284,750 chr14:24,265,935 -96,749,416 chr7:96,730,489 98,163,102 24,267,175 -49,907,310 chr15:49,891,643 group18

96,748,159 -50,536,698 chr15:50,519,138 50,520,398 -97,344,717 chr7:79,317,714 -78,142,250 chr9:78,119,394 group5 group7

-64,556,051 chr4:64,508,500 49,893,591 group9 -22,802,715 chr14:22,780,325 22,801,173 49,905,361 78,120,936 49,919,290 65,390,269 49,790,859

-65,392,791 chr4:65,367,029 64,511,025 49,789,339

-49,921,336 chr15:49,787,293 97,319,764 -78,122,328 chr9:78,101,295 group14 -62,803,722 chr4:62,792,309 group28 78,120,935

-22,819,542 chr14:22,799,784 22,801,178 65,290,820 chr15 chr14 chr9 chr7 chr4 -65,304,874 chr4:65,289,910 62,793,219 a deletion

chr7

SEM1 deletion

chr15

der chr15 b der chr7 SEM1

1.6

1.4

1.2

1.0

0.8

0.6 fold change

0.4

0.2

Patient3 Control1 Control2 Control3

Supplementary Figure 7. Genes at the rearrangement loci in Patient 3. a. Joined rearranged fragments from Patient 3 are aligned to the reference genome and shown by UCSC genome browser. Several genes are completely deleted or disrupted by breakpoints. b. RT-PCR of SEM1. Error bars represent standard deviation. 49,826,620 49,919,292 49,824,619 98,163,104

45,084,077 96,748,157

49,893,591

49,905,361

49,919,290 97,449,143 49,790,859 97,319,761 49,789,339 98,163,102 49,905,364 49,893,590 97,319,745 98,126,049 50,520,369 49,790,864 96,748,159 50,520,398 der chr15 der chr7 96,748,159 50,520,397 97,319,745 98,126,049 45,084,077 96,748,157 a b c

chr7 chr15 chr7 der7 der15 der15 chr15 chr7 chr7 97,449,143 98,163,102 49,905,364 49,893,590 d-1 d-2

chr7 chr7 der15 der15

chr15 chr15

49,893,591

49,905,361

49,919,290 49,790,859 97,319,761 49,789,339

e-1 e-2 e-3

chr7 chr15 chr15 der15 der15 der15 chr15 chr15 chr15

f g h 49,826,620 49,919,292 49,824,619 98,163,104 50,520,369 49,790,864

chr15 chr15 chr15 der15 der15 chr7 der15 chr15 chr7 62,793,219 64,511,011 24,267,175 24,267,175 65,290,822 64,511,025 65,390,269 65,290,820 62,793,220 65,396,391 78,120,935 22,801,178 22,801,173 78,120,936 der chr9 der chr4 der chr14 78,120,935 22,801,178 24,267,175 65,290,822 64,511,025 65,390,269 i j k

chr9 chr14 chr4 der9 der9 der9 chr14 chr4 chr4 64,511,011 24,267,175 62,793,219 65,290,820 62,793,220 65,396,391 m l n

chr4 chr4 chr4 der9 der9 der4 chr4 chr4 chr4 22,801,173 78,120,936

o chromosome 4 chromosome 7 chromosome 9 chr14 chromosome 14 der14 chr9 chromosome 15

Supplementary Figure 8. Sanger-sequence confirmation of the breakpoints. Sanger sequence confirmation of all 18 breakpoints in Patient 3. Electropherograms of the Sanger sequencing data of each breakpoint are shown with the schematic picture of the rearrangements. from theSanger-sequenceresults. dnarrangepredictions.Thereisusually0or1bpdifference Difference betweenSangersequence-confirmedbreakpointsand Supplementary Figure9.Accuracyofbreakpointprediction bydnarrangeconfirmedSangersequencing.

Difference from Sanger-confirmed breakpoint (bp) Patient 3

Tandem multiplication group12 group15 group16 group21 group27 group30 group32 chr6:17,336,996 -17,372,098 chr19:49,288,075 -49,314,402 chr10:15,168,530 -15,200,784 chr15:20,315,612 -20,342,908 chr4:47,960,536 -47,985,563 chr8:112,014,032 -112,035,707 chr15:64,147,030 -64,177,178

Tandem repeat expansion group25 group29 group33 chr1:201,242,094 -201,267,480 chr20:28,921,123 -28,941,892 chr5:145,172,901 -145,206,231

Non-tandem duplication Large tandem duplications Inversion group13 group42 group4 group34 chrX:126,415,871 -126,439,693 chr6:88,724,882 -88,726,037 chrX:120,141,977 -120,155,049 chr8:19,662,997 -19,736,705 chr1:48,466,452 -48,486,408

Deletions (or translocations) group2 group11 group24 group26 group41 chr4:19,058,323 -19,146,728 chr15:42,112,881 -42,156,099 chr7:35,630,680 -35,633,000 chr20:14,918,037 -14,965,376 chrX:27,402,272 -27,404,431 chr17:1,336,062 -1,390,058 chr5:114,983,887 -115,001,453 Patient 3

Retrotransposition

AluYb8 insertion L1HS insertion SVA insertion group43 group40 group6 group39 chr7:150,873,246 -150,886,803 chr9:35,780,879 -35,781,960 chr12:29,419,996 -29,421,329 chr7:10,463,158 -10,466,771 chr8:94,589,596 -94,593,063 chr19:13,701,155 -13,705,251 chr8:94,169,468 -94,213,271 chr20:2,822,744 -2,827,973 chr15:84,699,158 -84,738,412 chr3:187,781,911 -187,786,669 chr8:128,455,105 -128,459,325

Processed-pseudogene insertion nearby AluYa5 insertion group22 chr2:227,324,769 -227,358,328 chr15:93,290,000 -93,301,129 chr10:29,305,768 -29,307,060

Unclear group35 group36 group37 group38

transposable elements (reverse-strand) transposable elements chr18:57,703,671 -57,705,301 chr12:14,285,235 -14,310,724 chrX:34,160,496 -34,162,864 chr13:38,860,034 -38,867,819 chr1:87,824,388 -87,825,451 chr8:10,899,674 -10,901,103 chr14:92,455,238 -92,462,752 chr2:12,292,866 -12,297,176 chr6:153,179,598 -153,197,280 chr13:98,541,624 -98,544,409 chr19:38,605,175 -38,618,861 chr2:101,572,663 -101,575,951 chr3:94,120,628 -94,135,105 chr10:114,906,403 -114,910,533 (forward-strand) tandem repeat protein-coding sequence exons

Supplementary Figure 10. Other patient-only rearrangements in Patient 3. Other patient-only rearrangements in Patient 3. Dot-plot pictures of lamassemble consensus sequences are shown. The vertical stripes indicate annotations in the reference genome: tandem repeats (purple), transposable elements (pink:forward-strand, blue:reverse-strand), green (exon) and dark green (protein-coding sequence). Tandem multiplication Deletions (or translocations) Patient4 group1 group3 group4 group5 group8 chr8:6,221,967 -6,293,719 chr11:91,214,405 -91,230,378 chr11:91,313,237 -91,320,572 chr9:36,299,434 -36,348,231 chr2:45,717,142 -45,753,450 chr10:87,801,998 -87,869,155

Large tandem duplications Retrotransposition group7 group9 group13 full-length SVA_D insert group14 chr2:88,861,090 -88.868,137 chr6:62,794,028 -62,945,315 chr2:89,136,352 -89,143,044 chrX:137,568,183 -137,586,424 chrX:137,996,822 -138,020,522 chr8:94,645,307 -94,658,695 chr10:80,532,759 -80,536,153

Tandem repeat expansion Inversion Processed-pseudogene insertion group6 group10 group2 chr3:180,962,545 -180,976,888 chr8:123,378,000 -123,385,000 chr12:42,432,024 -42,433,060 chr3:9,580,993 -9,592,249 chr7:158,065,151 -158,078,809

Unclear FXR1 group11 group12

transposable elements (reverse-strand) chr12:70,473,577 -70,474,223 chr13:25,639,220 -25,639,643 chr19:15,491,975 -15,492,600 chr12:112,037,580 -112,038,298 transposable elements (forward-strand) tandem repeat protein-coding sequence exons

Supplementary Figure 11. Patient-only rearrangements in Patient 4. Other patient-only rearrangements in Patient 4. Patient 4 has a processed pseudogene insertion from exons of FXR1. Part of this insertion is aligned, perhaps incorrectly, to an FXR1 processed pseudogene in chr12. 100% L1HS 90% AluYa5 AluYb8 80% SVA ERV 70% others

60% unknown

50%

40%

30%

20%

10%

0% rmsk patients (n=4,808,373) (n=45)

Supplementary Figure 12. Active TEs are enriched in insertions. Proportions of various transposable element types in the reference genome (rmsk, https://genome.ucsc.edu) and in patient-specific insertions (patients). Supplementary Figure 13. Sketch of the information considered by dnarrange to judge whether two reads share a rearrangement. a

discarded

Supplementary Figure 14. Example of rearrangements filtered by -c1 option. a. Examples of rearranged reads, where we suspect the rearrangements to be artifacts of the sequencing process, which were excluded by running dnarrange with option –c1 (they are likely to be artifacts because almost the same length of reverse complimentary strand starts from the end of the other strand: we suspect this might be caused by the chimeric reads generated from the other strand of the nanopore DNA library). Three groups of rearranged reads are shown. b. In this group of rearranged reads, two reads with unique rearrangements were excluded by the –c1 option. Reference 1. Scherer, S. W. et al. Fine mapping of the autosomal dominant split hand/split foot locus on chromosome 7, band q21.3-q22.1. Am J Hum Genet 55, 12-20 (1994). 2. Crackower, M. A. et al. Characterization of the split hand/split foot malformation locus SHFM1 at 7q21.3-q22.1 and analysis of a candidate gene for its expression during limb development. Hum Mol Genet 5, 571-579, (1996). 3. Marchi, E., Kanapin, A., Magiorkinis, G. & Belshaw, R. Unfixed endogenous retroviral insertions in the human population. J Virol 88, 9529-9537, (2014). 4. Ewing, A. D. et al. Retrotransposition of gene transcripts leads to structural variation in mammalian genomes. Genome Biol 14, R22, (2013). 5. Tsuji, J., Frith, M. C., Tomii, K. & Horton, P. Mammalian NUMT insertion is non-random. Nucleic Acids Res 40, 9073-9088, (2012). 6. Bano, G., Mansour, S. & Nussey, S. The association of primary hyperparathyroidism and primary ovarian failure: a de novo t(X; 2) (q22p13) reciprocal translocation. Eur J Endocrinol 158, 261-263, (2008). 7. Nishimura-Tadaki, A. et al. Breakpoint determination of X;autosome balanced translocations in four patients with premature ovarian failure. J Hum Genet 56, 156-160, (2011). 8. Saitsu, H. et al. Characterization of the complex 7q21.3 rearrangement in a patient with bilateral split-foot malformation and hearing loss. Am J Med Genet A 149A, 1224-1230, (2009). 9. Saitsu, H. et al. Early infantile epileptic encephalopathy associated with the disrupted gene encoding Slit-Robo Rho GTPase activating protein 2 (SRGAP2). Am J Med Genet A 158A, 199-205, (2012). 10. Morgulis, A., Gertz, E. M., Schaffer, A. A. & Agarwala, R. WindowMasker: window-based masker for sequenced genomes. Bioinformatics 22, 134-141, (2006). 11. Hamada, M., Ono, Y., Asai, K. & Frith, M. C. Training alignment parameters for arbitrary sequencers with LAST-TRAIN. Bioinformatics 33, 926-928, (2017). 12. Ewing, A. D. & Kazazian, H. H., Jr. High-throughput sequencing reveals extensive variation in human-specific L1 content in individual human genomes. Genome Res 20, 1262-1270, (2010). 13. Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol 36, 338-345, (2018). 14. International HapMap, C. et al. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52-58, (2010). 15. Genomes Project, C. et al. A map of human genome variation from population-scale sequencing. Nature 467, 1061-1073, (2010).