COMPUTATIONAL STUDIES OF THE GENOME DYNAMICS OF MAMMALIAN TRANSPOSABLE ELEMENTS AND THEIR RELATIONSHIPS TO

by

Ying Zhang

M.Sc., Katholieke Universiteit Leuven (BELGIUM), 2004

B.E., Harbin Institute of Technology (CHINA), 1993

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF

THE REQUIREMENTS FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

in

THE FACULTY OF GRADUATE STUDIES

(Genetics)

THE UNIVERSITY OF BRITISH COLUMBIA

(Vancouver)

May, 2012

© Ying Zhang, 2012 Abstract

Sequences derived from transposable elements (TEs) comprise nearly 40 - 50% of the

genomic DNA of most mammalian species, including mouse and human. However, what

impact they may exert on their hosts is an intriguing question. Originally considered as

merely genomic parasites or “selfish DNA”, these mobile elements show their detrimental

effects through a variety of mechanisms, from physical DNA disruption to epigenetic

regulation. On the other hand, evidence has been mounting to suggest that TEs sometimes

may also play important roles by participating in essential biological processes in the host

. The dual-roles of TE-host interactions make it critical for us to understand the

relationship between TEs and the host, which may ultimately help us to better understand both normal cellular functions and disease.

This thesis encompasses my three genome-wide computational studies of TE- dynamics in mammals. In the first, I identified high levels of TE insertional polymorphisms among inbred mouse strains, and systematically analyzed their distributional features and biological effects, through mining tens of millions of mouse genomic DNA sequences. In the second, I examined the properties of TEs located in , and identified key factors, such as the distance to the - boundary, insertional orientation, and proximity to splice sites, that influence the probability that TEs will be retained in genes. In the third, a study specifically focused on genes with extremely high or low TE content in three mammalian species, I showed associations between TE density and the function/conservation of genes, as well as the relevance of chromatin state to TE accumulation in genes.

ii While most of my results clearly support the idea that today’s TE distribution pattern is an

outcome of natural selection or genetic drift during evolution, the final part of my work, which compares TE density to chromatin state in embryonic stem cells, suggests that traces of the initial integration preference of TEs still exist. Taken together, these results demonstrated the effects of both initial TE integration and natural selection in shaping the landscape of today’s mammalian genomes and, most importantly, shed light on the roles of mobile elements in evolution.

iii Preface

A version of Chapter 2 has been published: Ying Zhang, Irina A. Maksakova, Liane

Gagnier, Louie N. van de Lagemaat, Dixie L. Mager (2008). “Genome-wide assessments reveal extremely high levels of polymorphism of two active families of mouse endogenous retroviral elements.” PLoS Genet 4(2): e1000007. I designed all the computational methods for identifying mouse ERV polymorphisms, performed most computational data analyses, and wrote sections of the paper. I.A.M. and L.G. performed all the biological experiments for

Figure 2.5 and 2.7, L.N.v.d.L. analyzed the data in Figure 2.6, D.L.M and I.A.M. wrote sections of the paper.

A version of Chapter 3 has been published: Ying Zhang, Mark T. Romanish, Dixie L.

Mager (2011). “Distributions of transposable elements reveal hazardous zones in Mammalian introns.” PLoS Comput Biol 7(5): e1002046. I designed and performed all the computational data analyses, and wrote sections of the paper. M.T.R. performed the biological experiments for Figure 2.8. D.L.M and M.T.R. wrote sections of the paper.

A version of Chapter 4 has been published: Ying Zhang and Dixie L. Mager (2012). “Gene properties and chromatin state influence the accumulation of transposable elements in genes.” PLoS One 7(1): e30158. I designed and performed all the computational data analyses, and wrote the paper. D.L.M corrected and revised the paper.

The mouse work was covered by Animal Care Certificate A09-0372 issued by the UBC

Animal Care Committee.

iv Table of Contents

Abstract ...... ii

Preface ...... iv

Table of Contents ...... v

List of Tables ...... xi

List of Figures ...... xii

Acknowledgements ...... xv

Chapter 1: Introduction ...... 1

1.1 Mammalian Transposable Elements ...... 2

1.1.1 Long interspersed nucleotide elements (LINEs) ...... 6

1.1.2 Short interspersed nucleotide elements (SINEs) ...... 7

1.1.3 Long terminal repeat (LTR) retrotransposons ...... 8

1.1.4 DNA transposons ...... 10

1.2 Distribution of TEs in Mammalian Genomes ...... 11

1.2.1 TE distribution and the local G/C content ...... 12

1.2.2 TE orientation bias in genes ...... 13

1.2.3 TE density within genes ...... 14

1.3 Initial Integration Site Preference of TEs ...... 15

1.3.1 Integration site preference of retroviruses and ERVs ...... 16

1.3.2 Integration site preference of other TEs ...... 19

1.4 Activity and Polymorphism of TEs in Mammals ...... 20

1.4.1 TE activity and polymorphism in humans ...... 21

v 1.4.2 TE activity and polymorphism in mice ...... 24

1.4.3 TE activity and polymorphism in other mammals ...... 26

1.5 Effects of TE Integration in the Host Genome ...... 29

1.5.1 TE-mediated physical damage at the DNA level ...... 29

1.5.2 Transcriptional influence of TE sequences on host genes ...... 32

1.5.3 Epigenetic Effects of TEs ...... 36

1.6 Transposable Elements and Host Evolution...... 39

1.6.1 TEs vs. the host genome: an everlasting battle ...... 39

1.6.2 TE exaptation: turning “junk” into “gold” ...... 43

1.7 Thesis Objectives ...... 47

Chapter 2: Identification and Investigation of ERV Polymorphisms in Mice ...... 50

2.1 Background ...... 51

2.2 Results and Discussion ...... 53

2.2.1 Prevalence of ETn/MusDs and IAPs in different strains ...... 53

2.2.2 Identification and frequency of polymorphic ERVs ...... 54

2.2.3 Genic distribution patterns of the youngest ERVs are distinct from older

elements ...... 57

2.2.4 Confirmation of polymorphic ERVs in gene introns ...... 58

2.2.5 Potential effects mediated by polymorphic ERVs ...... 63

2.3 Concluding Remarks ...... 71

2.4 Materials and Methods ...... 72

2.4.1 Source data ...... 72

2.4.2 Design of ERV probes and detection of ERVs in the assembled B6 genome .... 73

vi 2.4.3 Detection of ERV insertions in test strains ...... 74

2.4.4 Determining the polymorphism status of ERVs present in B6 or in the test

strains ...... 75

2.4.5 B6 trace sampling and screening simulations ...... 76

2.4.6 Experimental verification of polymorphic insertions ...... 77

2.4.7 RT-PCR...... 78

2.4.8 Northern blotting ...... 79

2.4.9 Microarray analysis ...... 79

2.4.10 PolyERV custom tracks for Genome Browser ...... 80

Chapter 3: TE Distributions Reveal Hazardous Zones and Residential Preferences in

Gene Introns ...... 81

3.1 Background ...... 82

3.2 Results and Discussion ...... 83

3.2.1 Intronic regions near exon boundaries are depleted of TE insertions ...... 83

3.2.2 TEs within their U-Zones are shorter ...... 86

3.2.3 TEs near exhibit strong orientation and splice-site bias ...... 88

3.2.4 A high fraction of known mutagenic intronic TEs reside within U-zones ...... 91

3.2.5 Polymorphic LTR elements in mice show an intermediate distribution pattern 96

3.2.6 Chimeric transcripts and cryptic splice signals differ within and outside the U-

zone ...... 97

3.2.7 Abnormal gene splicing linked to polymorphic LTR element insertions near

intron boundaries ...... 99

3.3 Concluding Remarks ...... 103

vii 3.4 Materials and Methods ...... 104

3.4.1 Source Data ...... 104

3.4.2 Computer Simulation of Random TE insertions ...... 105

3.4.3 Normalization of intronic TE distribution by G/C content ...... 106

3.4.4 Calculation of the standardized TE frequency levels ...... 106

3.4.5 Computational analysis of potential splice sites in LTRs ...... 107

3.4.6 RT- and qRT-PCR of polymorphic LTR insertions in mouse genes ...... 108

Chapter 4: Gene Properties and Chromatin State Influence the Accumulation of TEs in Genes ...... 110

4.1 Background ...... 111

4.2 Results and Discussion ...... 112

4.2.1 Determining sets of orthologous genes with the same extreme of TE density . 112

4.2.2 Chromosomal distribution and TE composition of shared outlier genes ...... 114

4.2.3 Extreme TE density is associated with the function and conservation level of

genes ...... 117

4.2.4 Binding of Polymerase-II at gene promoters reveals association between TE-

content and tissue-specificity of genes ...... 120

4.2.5 Histone marks at promoters confirm the overall open status of SUO genes in

ESCs ...... 123

4.2.6 Expression data of SUO and SLO genes in ESCs support the ‘open chromatin

status’ hypothesis ...... 125

4.2.7 Gene function and chromatin status both contribute to the low TE density of

SLO genes ...... 127

viii 4.3 Concluding Remarks ...... 129

4.4 Materials and Methods ...... 131

4.4.1 Selection of source datasets ...... 131

4.4.2 Identification of genes with extreme TE densities ...... 131

4.4.3 Normalization of gene size and exon density ...... 132

4.4.4 Identification of shared-outlier genes ...... 133

4.4.5 Classification of gene conservation levels ...... 133

4.4.6 Comparison of the average TE age between SUO and SLO genes ...... 133

4.4.7 Examination of the tissue specificity of Polr2a binding at SUO/SLO promoters

...... 134

4.4.8 Statistical tests ...... 134

Chapter 5: Thesis Summary and Conclusions ...... 135

5.1 Thesis Summary ...... 136

5.1.1 Which ERVs are polymorphic in mice? ...... 136

5.1.2 What features of TE insertions are linked to their residential probability in

genes? ...... 137

5.1.3 Why are some genes highly enriched in TEs while others are depleted of TEs?

...... 138

5.2 Conclusions ...... 139

References ...... 144

Appendices ...... 163

Appendix A Supporting information of Chapter 2 ...... 164

A.1 Supplementary figures ...... 164

ix A.2 Supplementary tables ...... 172

A.3 Accession numbers ...... 190

Appendix B Supporting information of Chapter 3 ...... 191

B.1 Supplementary figures ...... 191

B.2 Supplementary tables ...... 196

Appendix C Supporting information of Chapter 4 ...... 198

C.1 Supplementary figures ...... 198

C.2 Supplementary tables ...... 207

x List of Tables

Table 1.1 Genomic compositions of major TE families in human, mouse and cow ...... 5

Table 2.1 Genomic PCR verification of ETn/MusD intronic insertions in different mouse strains ...... 62

Table 3.1 Intronic underrepresentation zones by TE type ...... 84

Table 3.2 Intronic distributional biases of mutagenic, polymorphic, and all full-length TEs 94

Table A.1 Details of ERV probes ...... 172

Table A.2 Polymorphic IAP cases in genes ...... 173

Table A.3 Polymorphic ETn/MusD cases in genes ...... 186

Table A.4 Primer sequences ...... 188

Table B.1 Mutagenic intronic Alu insertions in humans ...... 196

Table B.2 Mutagenic intronic L1 insertions in humans ...... 196

Table B.3 Mutagenic intronic ERV insertions in mice ...... 196

Table C.1 SLOs identified among human, mouse and cow ...... 207

Table C.2 SUOs identified among human, mouse and cow ...... 215

Table C.3 Overrepresentation of GO terms for SLOs (BiNGO results) ...... 219

Table C.4 Overrepresentation of GO terms for SUOs (BiNGO results) ...... 223

Table C.5 The optimization table for outlier threshold selection ...... 225

Table C.6 Number of TUs/genes derived in each step of SUO/SLO calculation ...... 226

xi List of Figures

Figure 1.1 Composition of TEs in the ...... 2

Figure 1.2 The four major types of TEs in mammals...... 3

Figure 1.3 Target integration site preferences of HIV, MLV and ASLV...... 17

Figure 1.4 Transcriptional influences of TEs...... 35

Figure 2.1 Screening strategy for detection of polymorphic ERV insertions...... 55

Figure 2.2 Fractions of polymorphic ERVs based on the four strains...... 57

Figure 2.3 Distributions of young versus older ERV elements with respect to genes...... 59

Figure 2.4 An apparent heterozygous ETn insertion in the Sytl3 gene in an A/J mouse...... 61

Figure 2.5 Detection of ETn-gene chimeric transcripts...... 65

Figure 2.6 Normalized, tissue-averaged expression of Dnajc10 across strains...... 67

Figure 2.7 Transcript levels of Opcml in A/J versus B6...... 69

Figure 3.1 Intronic distributions of the four major TE types in human (normalized)...... 86

Figure 3.2 Average size of human TEs within and outside the U-zone...... 87

Figure 3.3 Distributional biases of full-length human intronic TEs...... 88

Figure 3.4 Orientation bias of human full-length intronic TEs based on their proximity to different types of splice sites...... 90

Figure 3.5 Comparisons of TE frequency within the U-zone...... 93

Figure 3.6 Chimeric transcripts and cryptic splice signals of TEs within and outside the U- zone...... 99

Figure 3.7 Chimeric transcripts of the Trpc6 gene and Kcnh6 gene in mice...... 101

xii Figure 3.8 Effect of polymorphic LTR element insertions on transcription of the Trpc6 and

Kcnh6 genes...... 102

Figure 4.1 TE composition patterns of SUO genes...... 116

Figure 4.2 TE density and conservation level of outlier genes...... 119

Figure 4.3 Tissue-specificity of Polr2a binding at outlier genes...... 122

Figure 4.4 Histone marks at promoters of shared outlier genes...... 124

Figure 4.5 Gene expression data analyses for SUOs and SLOs in mouse ESCs...... 126

Figure 4.6 Effect-evaluation matrix for gene function and chromatin status of SLOs...... 128

Figure A.1 Mouse trace sequence archive composition (May 2007)...... 164

Figure A.2 Design of probes...... 165

Figure A.3 Estimation of fractional loss of detection of ERVs using sequence traces...... 166

Figure A.4 Splice sites in ETn elements detected in chimeric transcripts...... 167

Figure A.5 Comparative microarray expression data of Dnajc10 in different strains...... 168

Figure A.6 Comparative microarray expression data of Opcml in different strains...... 169

Figure A.7 Semi-quantitative RT-PCR of Opcml in A/J versus B6...... 170

Figure A.8 UCSC Genome Browser screen shot of the PolyERV track...... 171

Figure B.1 Intronic distributions of the four major TE types in human (non-normalized). . 191

Figure B.2 Intronic distributions of the four major TE types in mouse (normalized)...... 192

Figure B.3 Average size of mouse TEs within and outside the U-zone...... 193

Figure B.4 Distributional biases of mouse full-length intronic TEs...... 194

Figure B.5 Orientation bias of mouse full-length intronic TEs based on proximity to different types of splice sites...... 195

Figure C.1 TE density distribution of human genes...... 198

xiii Figure C.2 The relationship between gene size and exon density in human...... 199

Figure C.3 Identification of outlier genes by controlling gene size and exon density...... 200

Figure C.4 Chromosomal distribution of SUOs and SLOs in human...... 201

Figure C.5 The relationship between the number of SUOs/SLOs on human and the size...... 202

Figure C.6 Correlation analysis of the TE composition of SUOs between human and mouse.

...... 203

Figure C.7 Correlation analyses of G+C content vs. LINE/SINE composition of SUOs in human...... 204

Figure C.8 Tissue-type composition of tissue-specific outlier genes...... 205

Figure C.9 Histone marks at promoters of all outlier genes...... 206

xiv Acknowledgements

I wish to thank the many people who have contributed, in various ways, to the research work presented here. Their kind assistance and many contributions have been essential to my

completion of this thesis.

Many thanks go to my supervisor, Dr. Dixie Mager, who gave me the great opportunity to undertake the research in this fascinating field of biological science. At the beginning,

coming from a computer science/engineering background, I knew very little about molecular

biology. It was Dr. Mager’s great patience and supportive guidance that gave me the

confidence to overcome numerous difficulties encountered along the way. I thank her not

only the knowledge of biology that she imparted, but also helping me to think as a real

biologist. Her tutelage will benefit me throughout my career as a scientist.

Furthermore, I express here my deep appreciation to everyone who has been involved in

this thesis work. This includes Dr. Mark Wilkinson, Dr. Ryan Brinkman and Dr. Robert Kay

from my thesis committee, who gave me many insightful ideas ─ not only in technical

details, but also strategic guidelines. Also included are the many current and past lab

colleagues involved in my thesis projects, who significantly contributed their time and skills

in helping design and conduct related experiments. In particularly, I express my appreciation

to Dr. Louie van de Lagemaat, my lab mentor in bioinformatics, who helped me to adapt

quickly to this new field of research during the early days of my graduate study. I also thank

my friend Blair Shakell, who offered his precious time in reading my thesis and correcting

errors. My thesis work has been generously funded by the Canadian Institutes of Health

Research, with core support provided by the BC Cancer Agency, as well as fellowship

xv support from the National Science and Engineering Council of Canada. I thank all of these

organizations for their support.

Lastly, I thank my family and friends for being so supportive through all my years of

study. Especially my dear wife Yuan Xue, my best friend and the real hero behind any of my

little successes and achievements. Without her love, sacrifice and forgiveness, it would not be possible for me to achieve what I have. I thank my parents, Youfang Zhang and Ping Jin,

who always gave me their greatest support, encouragement and understanding, and by so

doing, enabled me to turn my childhood interest in science into a real life career. I thank my parents-in-law, Yi Xue and Shuying Zhang, for so diligently helping to care of our family while I was working on my thesis. And finally, I thank my dear Gavin, among the most important persons in my life, for gifting me with the extraordinary joy and great happiness of being his father and watching him growing up day-by-day, which have been so amazing and rewarding.

xvi Chapter 1: Introduction

1 1.1 Mammalian Transposable Elements

Transposable elements (TEs), also known as transposons, are DNA fragments that can move from one genomic location to another and proliferate within the genome as a molecular hitchhiker. Due to this “moving” feature, sometimes they are also referred to as “jumping genes”. Since the identification of TEs in maize by Barbara McClintock more than a half century ago (McClintock, 1950), these mobile DNA sequences have been found in almost all living organisms from bacteria to humans. Sequencing of the human genome has revealed, for example, that at least 45% of the total human genomic DNA is derived from TEs (Figure

1.1) (Lander et al., 2001), which can be divided into four major types: the long interspersed nucleotide element (LINE; Figure 1.2A), the short interspersed nucleotide element (SINE;

Figure 1.2B), the long terminal repeat (LTR) retrotransposon (Figure 1.2C), and the DNA transposon (Figure 1.2D) (Smit, 1996). The first three types, known collectively as retrotransposons or Class I transposons, account for most TEs in mammalian genomes and

Figure 1.1 Composition of TEs in the human genome.

2

Figure 1.2 The four major types of TEs in mammals. The molecular structures of A) LINE, B) SINE, C) LTR retrotransposon and D) DNA transposon are shown. Note that target-site direct repeats (TSDs) are not part of the TE. EN – endonuclease; RT – reverse transposase; C – C-terminal domain; A(n) – poly-A tail; TSD – target site direct repeat; LTR – long terminal repeat; IR – inverted repeat.

3 utilize an RNA intermediate during their retrotransposition, a hallmark “copy-and-paste”

translocation process of TEs (see below for details). Unlike retrotransposons, DNA

transposons (Class II transposons) proliferate in the host genome without an RNA-

intermediate. Instead, most of them move directly to the new genomic loci in a “cut-and-

paste” manner, without leaving the original copy at the donor site.

Most mammals contain all four types of TEs mentioned above. Each type of these TEs can

be further divided into different families/clades, which present distinct genomic compositions

in different host species. For example, about 83% of all human LINEs belong to the

relatively young L1 family, some copies of which are still active in humans. In some other mammals such as the cow, however, only half of LINEs are L1s, while most of the remaining belong to the retrotransposable element (RTE) family (Elsik et al., 2009). Similarly, about

80% of SINE elements in humans are from the Alu family, which exists only in primates

(Batzer and Deininger, 2002). As a simplified overview of the TE composition in different mammals, Table 1.1 lists the genomic compositions of major TE families present in human, mouse and cow.

As molecular symbionts residing in the host genome, TEs have coevolved with their host species for many millions of years. Ancient elements have been utilized by biologists as fossils to calibrate the neutral rate of nucleotide divergence (Lander et al., 2001), while insertions of “modern” or still active elements are often studied for their effects on genes (see

Section 1.5 below). Since different types of TEs show distinct characteristics in terms of their sequence, structure, age, life cycle, as well as distribution in the host genome, it is important to study them separately, based on their type or family.

4 In this introductory chapter, I describe briefly some essential features of each major TE type in mammals, and summarize in greater detail the known information in terms of TE genomic distributions, integration site preferences, activities and polymorphisms, biological/molecular effects, and influences on host evolution. At the end of this chapter I provide an overview of my thesis studies on the genome dynamics of TEs in mammals, which themselves are discussed more thoroughly in the following chapters.

Table 1.1 Genomic compositions of major TE families in human, mouse and cow

Human1 Mouse2 Cow3 Copy number Genomic Copy number Genomic Copy number Genomic Type/Family (x1000) coverage (%) (x1000) coverage (%) (x1000) coverage (%) LINEs 868 20.42 660 19.20 1139 23.29 L1 516 16.89 599 18.78 616 11.26 L2 315 3.22 53 0.38 132 1.18 L3/CR1 37 0.31 1.2 0.05 - - RTE (BovB) - - - - 376 10.74 SINEs 1558 13.14 1498 8.22 2883 17.66 B1 (Alu) 1090 10.60 564 2.66 - - B2 - - 348 2.39 - - B4/RSINE - - 391 2.36 - - ID - - 79 0.25 - - MIR/MIR3 468 2.54 115 0.57 301 1.39 BOV-A2 - - - - 378 2.36 Bov-tA - - - - 1462 7.73 ART2A - - - - 349 4.18 tRNA - - - - 389 1.99 LTR elements 443 8.29 631 9.87 311 3.20 ERV class I 112 2.89 34 0.68 ERV class II 8 0.31 127 3.14 ERV class III 83 1.44 37 0.58 MaLR 240 3.65 388 4.82 DNA elements 294 2.84 112 0.88 244 1.96 Total 44.69 38.17 46.11

1. Data taken from (Lander et al., 2001). 2. Data taken from (Waterston et al., 2002). 3. Data taken from (Elsik et al., 2009).

5 1.1.1 Long interspersed nucleotide elements (LINEs)

As the most prevalent TEs in humans, LINEs cover more than 20% of our entire genome

(Lander et al., 2001). Although several LINE families exist in mammals based on sequence similarity (Table 1.1), many of them show similar molecular structures and life cycles.

Generally, LINEs are autonomous non-LTR retrotransposons, which means that they encode their own required for retrotransposition. For example, a full-length copy of L1, the most abundant and active LINE family in human, is about 6 kb long and carries an internal

RNA polymerase II (Pol-II) promoter in the 5′ untranslated region (5′ UTR), two open reading frames (ORFs), and a short 3′ UTR immediately followed by a poly(A) tail (Figure

1.2A). L1 ORF1 encodes a 40-kDa with both RNA-binding and protein-protein interaction domains, but the specific roles of this protein in L1 retrotransposition are still not fully understood. L1 ORF2 encodes a 150-kDa protein that contains an endonuclease (EN) domain, a reverse transcriptase (RT) domain, and a zinc knuckle-like domain (C-terminal).

During retrotransposition, the EN domain cleaves the genomic DNA at the target site, while the RT domain plays a key role in converting the L1 RNA into complementary DNA

(cDNA) through a process known as target primed reverse transcription (TPRT) (Ostertag and Kazazian, 2001). The role of the ORF2 C-terminal zinc knuckle-like domain is not fully clear, but studies of similar domains in retroviral nucleocapsid proteins suggested that it might bind to the L1 RNA during the formation of retrotransposition intermediates and facilitate unfolding of structured RNA (Ostertag and Kazazian, 2001).

The TPRT retrotransposition process of non-LTR retrotransposons such as L1s happens in the nucleus during their life cycle. First, a full-length active L1 element is transcribed from its 5′ internal promoter. After being exported from the nucleus into the cytoplasm, the ORF1

6 and ORF2 proteins (ORF1p & ORF2p) are produced from the L1 RNA and, by cis- interacting with the L1 RNA itself, form an intermediate ribonucleoprotein (RNP) complex.

In the next step, this RNP complex is imported back into the nucleus to complete TPRT with the help of the EN and RT domains of the ORF2p. During this process, the L1 RNA is reverse transcribed to cDNA and integrated into the host genome. Due to the staggered cleavage of the target DNA by the EN domain at the new L1 integration site, a perfect pair of target site duplications (TSDs), ranging in length from 7-20bp, is created flanking the newly generated copy of L1 (Ostertag and Kazazian, 2001).

Notably, the vast majority of L1s in the human genome are 5′-truncated, most likely due to the incomplete reverse transcription of the template L1 RNA during TPRT (Lander et al.,

2001). Moreover, since L1s carry only a weak (poly(A)) signal, very often the transcription of the L1 element bypasses its own poly(A) signal so that the transcriptional machinery keeps moving downstream until it reaches the next poly (A) signal. Thus, the accidentally transcribed downstream sequence could be copied into a new L1 integration site.

This process is called “L1-mediated transduction” (Moran et al., 1999).

1.1.2 Short interspersed nucleotide elements (SINEs)

SINE elements, as indicated by the name, are much shorter compared to other TEs, being typically less than 500 bp for a full-length element. Due to their limited size, SINEs do not usually encode any proteins, but instead are thought of as hitchhikers, which only retrotranspose under the help of other autonomous TEs such as LINEs (Ohshima and Okada,

2005). For example, Alu SINEs, the most abundant TE family in the human genome (> one million copies, see Table 1.1), contain very similar structural features such as the poly(A) tail

7 at the 3′ end (Figure 1.2B) as in L1s, which might be critical during the TPRT process.

Furthermore, the coincidence of bursts of amplification of both Alus and L1s in primates during the past 150 million years also suggests that Alus might have utilized the L1 endonuclease and reverse-transcriptase during their retrotransposition. Indeed, a well designed transposition assay of marked Alus in human HeLa cells clearly demonstrated that the presence of L1 ORF2p protein and the poly(A) tail of Alu sequence are both essential in active Alu retrotransposition (Dewannieux et al., 2003), confirming the role of Alu SINE as

“parasite of parasite”.

While most SINE families originated from cellular tRNAs, human Alus are different from others by ancestrally deriving from the 7SL RNA gene (Batzer and Deininger, 2002), a small

RNA molecule that forms part of the ribosome complex. Like L1 elements, Alus also contain a 5′ internal promoter, but usually are transcribed by RNA polymerase III (Pol-III) instead of

Pol-II. However, the internal promoter of Alu is generally too weak to effectively initiate transcription of Alu RNAs by itself, and upstream sequences adjacent to the integration site could be very important for active Alu transcription (Ullu and Weiner, 1985). This feature has likely greatly lowered the total possible copies of retrotransposition-competent Alus in host genomes.

1.1.3 Long terminal repeat (LTR) retrotransposons

Like non-LTR retrotransposons, donor LTR elements also amplify in the host genome via an intermediate RNA template in a “copy-and-paste” manner. These repeats, however, contain a hallmark structure known as the “long terminal repeat” or LTR, at both the 5′ and 3′ end of the integrated DNA sequence (Figure 1.2C). The two identical LTRs at either end of

8 these elements, in fact, are products of a very sophisticated reverse-transcription process

conducted in the cytoplasm rather than in the nucleus. Briefly speaking, after an LTR

retrotransposon is transcribed from its own promoter located within the 5′ LTR, the produced mRNA is exported into the cytoplasm, where it uses a cellular tRNA molecule that is complementary to a special region called “primer binding site” (PBS) at the 5′ of the internal sequence to initiate reverse transcription. Since the cDNA synthesis does not start from the beginning of the LTR retrotransposon, the completion of the entire reverse transcription process without losing sequence information is achieved through a series of biochemical reactions, which is not discussed here (see a detailed review in (Telesnitsky and Goff, 1997)).

Interestingly, the LTR structure and the cytoplasmic reverse transcription process of LTR retrotransposons greatly resemble that of infectious retroviruses such as the murine leukemia

virus (MLV) and the human immunodeficiency virus (HIV). Moreover, many LTR

retrotransposons encode at least two ORFs that are similarly found in exogenous retroviruses

(Figure 1.2C): gag, which produces proteins involved in viral particle assembly, and pol, which encodes key enzymes such as the reverse transcriptase (RT) and integrase (IN), which are crucial during the reverse-transcription and integration process (Boeke and Stoye, 1997).

In fact, it has been postulated that in certain circumstances some LTR retrotransposons may capture an envelope gene and thus become able to infect other cells, giving rise to true exogenous retroviruses (Eickbush and Jamburuthugoda, 2008; Gifford and Tristem, 2003).

Other evidence indicates that some exogenous retroviruses have obviously lost (at least the function of) their envelope genes and, as a result, their ability to enter into other cells. When these retroviruses are trapped in host germ cells such as the sperm or egg (or other germ cell progenitors), they could be vertically passed to the next host generation as part of the genome

9 – a process known as “endogenization”. For this reason, these germline-trapped retroviruses are also known as endogenous retroviruses or ERVs. Notably, there is also a substantial number of LTR retrotransposons in mammals that do not show any similarity to retroviral genes (e.g. the mammalian apparent LTR retrotransposons (MaLRs) and the early transposons (ETns)), yet maintain the same LTR structures as ERVs. It is likely these non- coding LTR retrotransposons have amplified in the genome by hijacking the reverse transcription machinery provided by autonomous ERVs that encode competent gag and pol proteins for retrotransposition. An example of this phenomenon is the family of non- autonomous ETn elements in mice and their autonomous counterparts, the Mouse type-D endogenous proviruses (MusDs) (Mager and Freeman, 2000; Ribet et al., 2004).

Since the origins of all LTR-like structures in mammalian genomes are not clear, the terms “LTR element”, “LTR retrotransposon” and “ERV” are used interchangeably in the field and in this thesis.

1.1.4 DNA transposons

DNA transposons, or Class II transposons (Figure 1.2D), differ fundamentally from all retrotransposons discussed above by transposing without an RNA intermediate template.

Many DNA transposons, in fact, move from one genomic to another as double-stranded

DNA (dsDNA), during which the original element is excised by DNA transposase (an enzyme encoded by DNA transposons) and integrates back into the host genome at a different location. Not all DNA transposons use the “cut-and-paste” strategy for transposition, however. Helitrons, first identified in thale crest Arabidopsis Thaliana, use a

“rolling-circle” replication mechanism that involves the displacement of a single-stranded

10 DNA (ssDNA) intermediate to increase their genomic copy numbers (Kapitonov and Jurka,

2007). Another type of DNA transposon, known as Polintons (also named “Mavericks”), are very large transposons averaging from 15 to 20 kb, and even encode their own DNA polymerase for element duplication (Kapitonov and Jurka, 2006; Pritham et al., 2007). Since the number of both Helitrons and Polintons is negligible in most mammals, I only focus on classical “cut-and-paste” DNA transposons in this thesis.

1.2 Distribution of TEs in Mammalian Genomes

Significant advances in whole genome sequencing techniques during the past decade have provided unprecedented opportunities for researchers to study properties of TEs and their relationships to the host genome on a genome-wide scale. Analysis of the many assembled genomes available today has revealed that the abundance of TEs varies greatly between different species. In the puffer fish Fugu rubripes, for example, only 2.7% of the genome matches TE sequences (Aparicio et al., 2002), whereas in the B73 Maize (Zea mays), the genomic coverage of TEs reaches 85% (Schnable et al., 2009). In most mammals, the abundance of TEs lies in between the above two extremes, typically with a genomic coverage of 40 - 50% as seen in humans. Interestingly, genome-wide analyses of TE distributions in mammals, as well as in many other species, have shown apparent non-random patterns, which are likely consequences of at least three independent mechanisms: the initial integration site preference, genetic drift, and natural selection. Therefore, studying the genomic distributions of TEs becomes a powerful tool in understanding the biological interactions between these mobile elements and their host genomes through evolution.

11 1.2.1 TE distribution and the local G/C content

At the beginning of the 21st century, both the human and the mouse genomes were fully

sequenced and assembled (Lander et al., 2001; Waterston et al., 2002), providing excellent

platforms and opportunities for researchers to generate a comprehensive view of the TE-host

relationships, based on the distributions of different cohorts of TEs in the two remotely

related mammals. For example, LINEs and SINEs show high correlations with respect to the

local G/C content in both genomes. Specifically, LINEs are typically enriched in A/T-rich

regions, whereas their non-autonomous partners, SINEs, are significantly overrepresented

within G/C-rich regions. While the reasons behind this surprising difference are still under

debate, several possible explanations have been proposed and are discussed below.

First of all, since the preferred target integration site of L1 is A/T-rich (TTTT/AA), the

initial target site preference seems to be the most straightforward and reasonable explanation

for LINE’s enrichment in A/T-rich regions. However, the high occurrence of SINEs in G/C-

rich regions is puzzling. Since SINEs share similar base-composition patterns with LINEs near the target site and are likely facilitated by LINE proteins during retrotransposition

(Jurka, 1997), it is unlikely that SINEs have an opposite integration site preference.

The second possible explanation is the selection force of either purifying clearance of

SINEs in A/T-rich regions or positive selection of them in G/C-rich regions. Since most A/T-

rich DNA in the genome is gene-poor (Lander et al., 2001) and tolerate well the

accumulation of most other TEs, there is unlikely stronger selection pressure to eliminate

SINEs from such regions. Alternatively, high G/C content is known to correlate positively

with gene density (Lander et al., 2001), and the enrichment of SINEs in G/C-rich regions

could be due to their possible beneficial roles when located near/within genes. Indeed,

12 evidence has shown that SINE RNAs can promote cellular protein translation by acting as an eIF2 kinase (PKR)-inhibitor and thus might benefit the host cells under stress (Schmid,

1998). Since SINE RNAs function here directly at the RNA level, the accumulation of SINEs in the more readily transcribed gene-rich regions promises a faster cellular response (Lander et al., 2001). However, this model was based mostly on speculations, and to measure the degree of such effects could be difficult.

Lastly, spontaneous loss of SINEs in A/T-rich (gene-poor) regions may also explain this phenomenon. While such loss could be neutral and the loss rate correlates positively to the density of SINEs in gene-poor regions, negative selection pressure of the deleterious effects on nearby genes in gene-rich regions may lead to slower rates of both SINE accumulation and recombination/deletion, resulting a relatively higher SINE density in G/C-rich regions

(Medstrand et al., 2002).

Although each of the above explanations has its own supporting evidence, the real situation could be far more complex with a combination of multiple factors, and more computational and experimental studies are needed to gain a deeper understanding of such effects.

1.2.2 TE orientation bias in genes

Even before the publication of the human genome sequence, it had been noticed that the majority of intronic TEs (especially LINEs and LTR elements) are in antisense orientation with respect to the enclosing gene (Smit, 1999). Subsequently, the availability of whole- genome sequence data for many species confirmed this observation (Cutter et al., 2005;

Lander et al., 2001; Medstrand et al., 2002; Waterston et al., 2002). Limited experiments to

13 measure de novo integration patterns, however, show no evidence of any initial orientation bias when exogenous retroviruses or ERV/LTR retrotransposons integrate into the genome

(Dewannieux et al., 2004; Gasior et al., 2007; Ribet et al., 2004; Schroder et al., 2002). These results suggest a strong selection pressure against sense-oriented TE insertions within genes, very likely due to the cryptic regulatory signals these elements carry in their sequence (Smit,

1999; van de Lagemaat et al., 2006). For example, substantial evidence has shown that

intronic TE insertions can cause or premature polyadenylation, or used as

alternative promoters to produce aberrant transcripts of the host gene (see Section 1.5 of this

chapter). Notably, most de novo intronic TE insertions that do cause mutations or diseases

are, indeed, in the same orientation as the enclosing gene (van de Lagemaat et al., 2006),

suggesting a much higher chance of being detrimental when TEs integrate in sense.

1.2.3 TE density within genes

When analysis of the first human genome was published, it was noticed that TE density in

some genomic regions was extremely low compared to the average level even after

correction by G/C content (Lander et al., 2001). For example, the four homeobox gene

clusters, namely HOXA, HOXB, HOXC and HOXD, sized at ~100 kb each, contain only

less than 2% TEs in contrast to the 45% genomic level. Based on the crucial functions played

by these genes during human development, it is not hard to imagine how detrimental a TE

integration event could be, and how important it is for the host to maintain the integrity of

these regions by purifying selection. In fact, a general trend of low TE density within genes

has been observed for most TE families (Medstrand et al., 2002), and a genome-wide study

of transposon-free regions (TFRs) larger than 10 kb identified nearly a thousand such regions

14 in both human and mouse (Simons et al., 2006). While most bases covered by TFRs are non- coding, the majority (85%) of TFRs, in fact, overlap one or more annotated genes. Further analysis on (GO) for TFR-related genes revealed significant enrichment of gene functions such as transcriptional regulation and development, which agrees with several other studies on TE density within/around genes (Grover et al., 2003; Huda et al., 2009;

Mortada et al., 2010; Sironi et al., 2006).

Exceptions do exist, however. In a recent study of Hox gene clusters in the green anole lizard Anolis carolinensis, researchers surprisingly found massive accumulations of TEs in these regions, as well as in many other development-related genes (Di-Poi et al., 2009). This is inconsistent with the observation in most other vertebrates. Since there is huge morphological variation among various Squamata species (e.g. Anolis lizards have largely radiated into ~400 species that are highly adapted within a variety of ecological niches

(Alfoldi et al., 2011)), a speculative explanation for this exception could be the potential benefit of the increased genetic divergence introduced by allowing more TEs into these genomic regions that are important in lizard morphogenesis. Nonetheless, the above data demonstrates an uneven distribution of TEs relative to genes, suggesting a general theme of purifying selection around fundamentally important coding regions.

1.3 Initial Integration Site Preference of TEs

The distributions of fixed TEs in today’s mammalian genomes are combinational outcomes of the initial target site preference, natural selection and genetic drift. However, except for a limited number of lineages, most TEs in mammals are only molecular fossils carrying numerous mutations, and do not actively transpose. For this reason, directly

15 detecting the initial TE integration site preference is largely unfeasible, especially for old

elements. Nonetheless, currently active copies of LINEs and SINEs have been isolated and utilized to investigate their de novo integration features (Dewannieux et al., 2003; Gilbert et al., 2002). Alternatively, based on the kin relationship between exogenous and endogenous retroviruses (see Section 1.1.3), a glimpse of potential integration site preferences of related

ERVs can be inferred from insertion site analyses of selected retroviruses (Mitchell et al.,

2004; Wu and Burgess, 2004). Moreover, successful synthetic reconstructions to create functional copies of the presently inactive TEs such as the human ERV, HERV-K, have been performed based on consensus sequences (Dewannieux et al., 2006; Lee and Bieniasz, 2007) and, as discussed below, this “resurrected” HERV-K has been used to determine its initial target site preferences.

1.3.1 Integration site preference of retroviruses and ERVs

Exogenous retroviruses have been intensely studied over a long time because of their often pathogenic effects on host organisms. As close relatives to their endogenous counterparts, these infectious pathogens possess very similar features, from their sequences to their life cycle, compared to ERVs. Deep understanding of exogenous retroviruses can greatly assist biologists to infer various characteristics of ERVs, such as integration mechanisms and target site preferences, without being blinded by post-integration changes during evolution.

Shortly after the publication of the first human genome sequence assembly, several genome-wide studies of retrovirus integration patterns were reported. For example, by using human immunodeficiency virus (HIV) and HIV-based vectors to infect a human T cell line

(SupT1) and the naked DNA from the same cell line as an in vitro control, a high degree of

16 initial integration preference of HIV within genes was unexpectedly revealed (Schroder et al.,

2002). Specifically, while only about 35% of the 111 HIV integration sites collected in vitro were located within transcription units (which is close to the genomic coverage of genes in human), nearly 70% of the 524 HIV integration sites in human SupT1 cells resided in genes.

Interestingly, unlike most ERVs fixed in the genome, these de novo HIV integrations showed no evidence of orientation bias relative to the enclosing genes. Similarly, integration assays of murine leukaemia virus (MLV) in human HeLa cells and avian sarcoma-leukosis virus

(ASLV) in both human 293T cells and Hela cells have also been conducted (Mitchell et al.,

2004; Wu et al., 2003), with the former showing a strong preference for gene promoter regions and the latter a much more random distribution of integration sites. A schematic illustration of the target integration site preferences of the above-mentioned retroviruses is given in Figure 1.3.

Figure 1.3 Target integration site preferences of HIV, MLV and ASLV.

A conceptual gene structure is shown with boxes in different colors representing the promoter region (yellow), exons (dark blue) and introns (light blue). Arrows with different colors represent insertion sites of HIV (red), MLV (black) and ASLV (green).

To investigate the potential factors that may influence retrovirus integration site preference, additional analyses were performed in the foregoing cited studies. By comparing the transcriptional profile of target cells and the HIV integration site distribution, a trend towards integration in highly expressed genes was revealed based on different target cell

17 types (Schroder et al., 2002). Such a result supports the hypothesis that retroviruses may

prefer to integrate into open chromosomal regions, where genes are actively transcribed and

thus potentially easier to access. However, the distinct patterns of target site selection of

different retroviruses indicate that DNA accessibility is unlikely to be the only underlying mechanism. According to evidence shown for Ty elements (an LTR-retrotransposon) in yeast, a “tethering model” involving the interactions between the retroviral integration complex and sequence-specific DNA binding proteins has also been proposed (Bushman,

2003).

Theoretically, selection forces can be quite different between exogenous and endogenous retroviruses due to their distinct criteria of fitness. Given enough time for evolution, this might lead to different preferences of integration targeting (Bushman et al., 2005). For example, based on the surviving strategy, HIV may benefit from targeting actively transcribed genes in order to maximize its proliferation rate in a cell before the cell is killed.

Similarly, MLV has a higher capacity of utilizing an adjacent promoter to gain higher expression levels (De Palma et al., 2005), leading to a selection advantage of targeting the 5′- end of genes. The integration sites of ASLV, however, are much more random, which might lower the chance of being deleterious and rapidly selected against. By contrast, endogenous

LTR elements such as the Ty elements in yeast showed an apparent targeting preference in less critical regions in the host genome, probably because of the absence of an extracellular stage in their lifecycle and, therefore, their dependence on the survival of the host cell (Wu and Burgess, 2004). Due to the foregoing reasons, caution is necessary when inferring the general target site preferences of ERVs from studies on their exogenous relatives.

18 Although many ERV families in mammals have accumulated enough mutations to become

molecular fossils, the original forms of some inactive ERV families have been successfully

reconstructed and used to directly study their initial integration site preferences. For example,

in a recent study conducted by Brady et al., the authors examined the integration pattern of a

reconstructed HERV-K element (HERV-Kcon) that can efficiently retrotranspose in selected

human cell lines (Brady et al., 2009). In contrast to the distribution of existing HERV-Ks in

the human genome, de novo integration sites of HERV-Kcon show highly distinct features

including the preference for transcription units and the lack of orientation bias within genes.

Notably, the youngest HERV-Ks in the human genome show an intermediate distribution

between that of the resurrected HERV-Kcon and older HERV-Ks fixed in the genome, indicating that the current distribution pattern of ERVs is a consequence of initial integration followed by genetic drift and natural selection.

1.3.2 Integration site preference of other TEs

Although there is no evidence for on-going transposition activity of either ERVs or DNA

transposons in the human genome, active LINEs and SINEs still exist and have been

successfully isolated/engineered for de novo retrotransposition assays (Dewannieux et al.,

2003; Dewannieux and Heidmann, 2005; Gilbert et al., 2002; Moran et al., 1996). To conduct an evolutionarily unbiased investigation of the target integration site preference of LINEs and

SINEs in the human genome, Gasior and colleagues collected over one hundred de novo L1

integration loci in human HeLa cells from various sources, as well as a total of 13 de novo

SINE integration loci (including human Alu, mouse B1 and B2 elements) obtained from

similar experiments (Gasior et al., 2007). Unexpectedly, while an A/T-rich bias was observed

19 in the 50 bp flanking regions of these de novo L1 insertions as predicted by the “TTAAAA” consensus sequence of L1 integration sites (Jurka, 1997), only a neutral G/C content was revealed for such elements based on a 20 kb window size. This observation is significantly different from the genomic distribution of existing L1s, which shows a strong bias toward

A/T-rich sequence even at relatively large genomic intervals. Consistently to de novo L1s, the same study also showed a similar pattern of target site preferences for de novo Alu insertions based on either the 50 bp or the 20 kb window size, but the size of the Alu dataset was too small to make any strong conclusions (Gasior et al., 2007). Comparing these results to the genomic distributions of LINEs and SINEs that have been fixed in the human genome, it is clear that the concurrent distributions of TEs are combined outcomes of both initial integration preferences and evolutionary constraints, albeit the latter factor shows a stronger long-term effect, especially for old TE families.

1.4 Activity and Polymorphism of TEs in Mammals

According to the sequence divergence of a given TE from its consensus of the corresponding TE family and the neutral substitution rate of genomic DNA, the timespan of each TE copy from its integration into the genome can be estimated. Generally, most TEs in the mammalian genome today are ancient molecular fossils of an age ranging from tens to hundreds of Myr (Lander et al., 2001), which have lost their ability to transpose through accumulation of mutations. In addition to genetic degradation leading to inactivity, the activity of TEs can sometimes also be repressed by active silencing mechanisms of the host cell, such as epigenetic modifications and RNA silencing (see Section 1.6 for a detailed discussion). However, it is noteworthy that the age distributions and activities of TEs in

20 various mammalian species can be surprisingly different. Moreover, on-going TE activity can give rise to TE insertional polymorphisms, which refer to the presence of unfixed TE insertions only in a subpopulation of the host species. This information can be very useful when examining the effects and properties of TE integrations, as well as serving as an additional source of genomic variation when studying the evolution history and disease susceptibility of the host organisms.

1.4.1 TE activity and polymorphism in humans

In the contemporary human genome, active TE transposition events have been generally silenced. In fact, more than 97% of human TEs are remnants of ancient elements older than

25 Myr (Lander et al., 2001), most of which have accumulated enough mutations to completely lose their ability to generate new transposition events. Initial analysis of the human genome, for example, showed a flourishing period of LTRs (mostly ERV-Ls and

MaLRs) in humans that had lasted for about 100 Myr, most of which appeared to have died out ~40 Myr ago (Lander et al., 2001). Today, about 85% of the LTR retrotransposons in the human genome have lost their internal sequence due to homologous recombination between the flanking LTRs, resulting only a solitary LTR as their entire remaining sequence. So far only a single family (HERV-K) is known to have transposed since our divergence from the chimpanzee ~7 Myr ago, and is polymorphic in today’s human population (Belshaw et al.,

2005; Cordaux and Batzer, 2009; Mills et al., 2007). However, no evidence has been reported for concurrent HERV-K de novo retrotranspositions. Even more dramatically, the activity of

DNA transposons in humans, comprising less than 3% of the genome, was completely extinguished about 37 - 50 Myr ago (Lander et al., 2001; Pace and Feschotte, 2007).

21 In contrast to LTR and DNA transposons, LINE and SINE elements have extremely long lives in mammals, and are responsible for most current TE activities in humans. Based on the draft sequence assembly of the haploid human genome (95% complete), Brouha et al. estimated 80-100 retrotransposition-competent (RC) L1s in an average human being, with the bulk of retrotransposition activity coming from only a few “hot L1s” (Brouha et al., 2003).

More interestingly, the same study also showed that half of the identified RC-L1s are polymorphic in humans, indicating their relatively young age compared with elements fixed in the human population. Recently, the emergence of high throughput technologies has facilitated the discovery of an increasing number of human LINE/SINE polymorphisms

(Beck et al., 2010; Ewing and Kazazian, 2010; Huang et al., 2010; Iskow et al., 2010). Using a pair-end DNA sequencing strategy, Beck and colleagues identified 68 full-length L1s that are differentially present among individuals but absent from the reference human genome

(Beck et al., 2010). Among these polymorphic human L1s, 37 (or 55%) were hot L1s with strong retrotransposition activity, a number much greater than the total of six hot L1s found previously by Brouha and colleagues. Similarly, using high throughput pyrosequencing technology, another research group reported as many as 742 new polymorphic L1 insertions and 403 Alu insertions from a larger set of human tissue samples (Iskow et al., 2010). While some of the L1 polymorphisms reported in the above two independent studies are identical, both studies revealed many unique low frequency L1s present in only one or a few individuals, indicating the very young age of these L1s in the human population.

As the most active TE families in humans, de novo germline integrations of L1s and

Alus associated with human diseases have been reported. The detection of a de novo L1 insertion in a patient with X-linked retinitis pigmentosa (XLRP) led to the identification of

22 mutations in a novel gene RP2, which are responsible for the progressive retinal degeneration

(Schwahn et al., 1998). In another case, the insertion of Alu sequences in the fibroblast

growth-factor receptor 2 (FGFR2) gene appeared to cause the Apert Syndrome in two

patients (Oldridge et al., 1999). As of 2011, at least 51 disease-associated de novo germline

TE insertions (including 33 Alus, 15 L1s and 3 others) have been documented in the dbRIP database (http://dbrip.brocku.ca/) (Wang et al., 2006). This clearly demonstrates the ongoing activity and mutagenic effects of LINEs/SINEs in humans.

In addition to germline retrotranspositions, activity of TEs has also been reported in human somatic cells. In 2005, it was successfully demonstrated that engineered human L1s can retrotranspose in adult rat hippocampus progenitor cells in vitro and in the mouse brain in vivo (Muotri et al., 2005). Subsequently, the same group showed that neural progenitor cells, either isolated from human fetal brain or derived from human embryonic stem cells, support the retrotransposition of engineered human L1s in vitro, along with clear evidence of increased copy number of endogenous L1s in several human brain regions compared to other somatic cell types such as liver or heart from the same donor (Coufal et al., 2009). Recently, by applying a high throughput method, Baillie et al. directly showed the high retrotransposition activity of endogenous L1s in the hippocampus and caudate nucleus of three human individuals by revealing more than 7,700 de novo somatic L1 and a nearly doubled number of Alu insertions in these brain regions. This suggests that somatic genome mosaicism driven by active retrotransposition may reshape the genetic circuitry underlining normal and abnormal neurobiological processes (Baillie et al., 2011). Notably, the massive

L1/Alu retrotransposition activity in human brain seems exceptional compared to most other adult tissues, in which TEs are largely repressed. However, in the same TE activity study

23 conducted by Iskow et al. cited above (Iskow et al., 2010), the authors also identified

frequent (30%) somatic L1 insertions in lung tumors, suggesting an escalated TE activity in

at least some tumors, and a possibility that new retrotransposition events may be involved in

tumorigenesis.

1.4.2 TE activity and polymorphism in mice

Unlike in humans, LTR retrotransposons/ERVs are highly active in mice, causing about

10% of all spontaneous germline mutations (Maksakova et al., 2006). The retroviral-like

Intracisternal A Particle (IAP) and the MusD/Early Transposon (ETn), for example, are two

high copy number ERV families that are responsible for most of the insertional germline

mutations described in mice. IAP elements have been extensively studied since the early

1980s (Kuff and Lueders, 1988), and appear to cause both germline mutations as well as

oncogene or growth factor gene activation in somatic cells (Druker and Whitelaw, 2004;

Wang et al., 1997). ETn elements were originally reported in the early 1980s as a non-coding transposon-like sequence specifically expressed in early embryogenesis (Brulet et al., 1985;

Loebel et al., 2004; Shell et al., 1990), also capable of causing new mutations. It is now known that ETns represent a non-autonomous partner of the retroviral-like MusD elements

(Baust et al., 2003; Mager and Freeman, 2000), which provide the proteins in trans necessary for ETns to retrotranspose (Ribet et al., 2004). For this reason, they are usually referred to as

ETn/MusD elements.

Notably, unlike most wild-type species, laboratory mice were derived from intensive inbreeding of ancestral species of Mus, and have been carefully maintained in artificial living conditions. Thus, the effective population size of each inbred strain is down to only two,

24 leading to a significant higher probability of fixing random mutations (including TE

insertions) in each strain by genetic drift. Due to the lack of heterogeneity within each inbred

strain, strain-specific mutations are not subject to selection pressure based on the entire

mouse population. Nonetheless, here in this thesis I used the terminology of “polymorphism”

in population genetics to describe strain-specific variants in mice, as commonly referred by

many other mouse genetic variation studies (Frazer et al., 2007; Li et al., 2004; Wade and

Daly, 2005). According to a list compiled at the end of 2005 (Maksakova et al., 2006), six

strain polymorphisms and 26 mutations in mice due to germline insertions of IAPs have been

documented, and at least four polymorphisms and 19 mutations have been reported for

ETns/MusDs. Genomic hybridization and Polymerase Chain Reaction (PCR) methods have demonstrated that the IAP (Kaushik and Stoye, 1994; Lueders and Frankel, 1994; Lueders et al., 1993) and ETn/MusD (Baust et al., 2002) families are polymorphic among strains. The extent of this variation was unknown in 2006 when I commenced my genome-wide study of

ERV polymorphisms in mice (see Chapter 2 for details).

The retrotransposition activity of LINEs/SINEs in mice is also much higher compared to humans. The copy number of active mouse full-length L1s estimated by the cell culture assay could be more than 3000, about 30 times the estimate of active human L1s and close to the excess of the proportion of L1 insertions causing mouse diseases compared to the human case (2.5% vs. 0.07%, or 35-fold) (Ostertag and Kazazian, 2001). Furthermore, due to their relatively younger age and stronger retrotransposition activity, L1s show a much higher polymorphism among different mouse strains. In 2008, by analyzing DNA sequence traces derived from the whole genome sequencing (WGS) techniques, Akagi et al. reported nearly

7,000 polymorphic L1 insertions present in the C57BL/6J strain but absent from at least one

25 of four other inbred mouse strains (Akagi et al., 2008). In fact, the above total of

polymorphic L1s is very likely an underestimation, and the number is expected to increase

when sequence information becomes available for additional mouse strains.

In contrast to the higher activity of most other TE types in mice, DNA transposons were even more highly restricted in this rodent species compared to humans. The analysis of the mouse genomic sequence revealed only four lineage-specific DNA transposons in mice, whereas more than 14 have been identified in humans, most of which were deposited in early primate evolution (Waterston et al., 2002). Since the transposase of DNA transposons usually works in trans during the transposition process (i.e. to move around an element that is not necessarily the one producing the working enzyme itself) and may thus facilitate the proliferation of “dead” elements, they intrinsically require periodical horizontal transfer (HT) to refresh their transposition capacity (Feschotte and Pritham, 2007). Based on the observation that DNA transposon is the least active of the four major TE types in both human and mouse, it has been postulated that the highly developed mammalian immune system probably has contributed significantly to the large suppression of the invasion and amplification of these TEs (Waterston et al., 2002).

1.4.3 TE activity and polymorphism in other mammals

With an increasing number of fully sequenced genomes, it would be interesting to evaluate and compare TE distributions and activities in other mammalian species. A comparison of genomic sequences between human and chimpanzee revealed similar activities of LINEs in the two primate species, but a significant difference for Alu SINEs (Mikkelsen et al., 2005).

In contrast to the ~7,000 lineage-specific Alu insertions in the human, the chimpanzee

26 genome contains ~2,300 such insertions, indicating either an escalated Alu activity in human

or a declined Alu activity in chimpanzee since the time of species divergence. Recently, the

availability of the orangutan genomic sequence provided an opportunity to calibrate the above analysis of lineage-specific Alu insertions in each primate species. Surprisingly, while the TE content in general was basically consistent with the previous findings for human and chimpanzee, only ~250 orangutan-specific Alu insertions were identified (Locke et al.,

2011), indicating that Alu activity has been strongly limited within the orangutan genome since its divergence from the other two primates.

However, an LTR retrotransposon family named Pan troglodytes endogenous retrovirus

(ptERV1) was identified with nearly 200 copies in chimpanzee and several other primate species, but not in human (Mikkelsen et al., 2005). Unlike HERV-K, the youngest ERV family in human (with a majority being only solitary LTRs resulting from LTR-LTR recombination), ptERV1s in chimpanzee are much more homogeneous, and more than half of

them still full-length. Until Kaiser et al. showed a link between ptERV and the human

TRIM5α antiviral protein (Kaiser et al., 2007), it had been a conundrum as to why other

primates, but not humans, harbor this young ERV family. This is discussed further in the next

section.

In addition to primates, genomic studies of other mammalian species distantly related to

humans could help provide a more complete picture of TE activities in mammals. The

identification of both the endogenous koala retrovirus (KoRV) in the germline and the

exogenous version of the same virus in the peripheral blood mononuclear cells (PBMCs) of

diseased koalas, for instance, illustrates an example of an ongoing bombardment of retroviral

infection and endogenization in these already endangered mammals (Stoye, 2006; Tarlinton

27 et al., 2006). KoRV was originally identified as an endogenous retrovirus, but its competency in producing viral particles and the high sequence similarity to an exogenous retrovirus called gibbon ape leukemia virus (GALV) indicate it is more likely a retrovirus which had not been endogenized too long ago. Indeed, geographical studies of koalas carrying KoRV

(Tarlinton et al., 2006) showed highly variable distribution of this ERV within the whole

koala population, with 100% invasion of all koalas living in the northeast of Australia, and

none on Kangaroo Island (a small isolated island located off southern Australia). Based on

the fact that the koala subpopulation on Kangaroo Island was only established in the early

nineteenth century, it seems that the invasion of KoRV into the koala genome only happened

within 100 years.

Like KoRVs in koalas, the recent or even concurrent amplification of a large number of

DNA transposons discovered in bats was another big surprise with respect to TE activities in

mammals. Based on the results of initial analyses of the human and mouse genomic

sequences (Lander et al., 2001; Waterston et al., 2002), it was widely assumed that DNA

transposons essentially died out in mammalian genomes dozens of million years ago.

However, when Ray et al. recently examined the genomic sequence data of a distant

mammalian lineage, the bat genus Myotis, they identified six hAT-like DNA transposon

families showing low sequence divergence among individual copies and polymorphisms

between different Myotis species, suggesting their relatively young age (6.4 – 15 Myr) and

their recent activity in bats. Further studies on these elements, as well as other bat DNA

transposons, showed multiple waves of recent DNA transposon activities in these flying

mammals (Ray et al., 2008), including some very recent and probably currently active

piggyBac-like transposons, and a massive amplification of two Helitron DNA transposon

28 families that reached at least 3% of the bat M. lucifugus genome (Pritham and Feschotte,

2007).

1.5 Effects of TE Integration in the Host Genome

Although most mammalian TEs are neutral components of the genome with no significant

biological effects (Brouha et al., 2003; Mills et al., 2007), some elements do impact the

cell/organism by acting as insertional mutagens, inducing DNA rearrangements, assuming

cellular functions, and altering gene regulation (Batzer and Deininger, 2002; Cordaux and

Batzer, 2009; Maksakova et al., 2006; Mills et al., 2007). Not surprisingly, such mutations are usually associated with various types of genetic disorders, but sometimes may also contribute to innovative changes that are beneficial to the host organisms. In this section I review some of the common mechanisms observed in TE mutagenesis, as well as potential roles that TEs might have played during the host evolution.

1.5.1 TE-mediated physical damage at the DNA level

Perhaps the most intuitive effect of TE integration in the host genome is the increased genome size. Primarily as selfish genomic parasites, these mobile DNA elements continuously replicate themselves, producing more and more new copies within host genomes. Although each host species may bear a different load of TEs, the ancient origin and ubiquitous presence of these repetitive elements suggest a global genome expansion for most of today’s living creatures. Indeed, a genome size comparison of 12 eukaryote species revealed a nearly linear relationship between the size of the host genome and TE content

(Kidwell, 2002), implying non-coding repetitive DNA as an important contributor of genome

29 size. Consistent with this observation, genomic comparisons of multiple species have also

shown that the increase of genome size is an ongoing process in many species including the human, which has already accumulated ~2,000 L1, ~7,000 Alu and ~1,000 SVA SINE copies over the past 6 Myr (Mikkelsen et al., 2005). However, some non-mammalian species controversially show only a small proportion of their genomes occupied by TEs, especially those with a compact genome size less than 500 Mb (Kidwell, 2002). Most of these small- genome species still contain many TE families, but the copy number of each TE family is much lower compared with larger genomes (Kidwell, 2002). Taking the fruit fly as an example, the 117 Mb fully sequenced euchromatic portion comprises two thirds of the entire

180 Mb Drosophila melanogaster genome, but only ~1500 full or partial TEs have been identified, corresponding to just 4% of this genomic portion (Adams et al., 2000). Further studies of Drosophila TEs revealed more than 90 distinct TE families (i.e. ~16 copies/family), with the largest family reaching a total copy number of only 146 (Celniker and Rubin, 2003). Indeed, higher DNA deletion rates in general have been shown for both fly and puffer fish (only 2.7% of the 365 Mb puffer fish genome is occupied by TEs (Aparicio et al., 2002)), which potentially explains the relatively compact genome size of both species.

Insertional disruption of coding sequences (i.e. exons) is one of the most straightforward mechanisms by which TEs may physically influence the functionality of host genes. In most cases, the affected genes are only able to produce truncated or nonsense RNA templates, leading to a severe dysfunction of the gene. Human genetic disorders caused by de novo TE

insertions have been documented in accumulating cases, with Apert syndrome, cystic

fibrosis, neurofibromatosis, muscular dystrophy and breast and colon cancers as a few of

many examples (Belancio et al., 2008; Chen et al., 2005; Deininger and Batzer, 1999).

30 Sometimes, the inserted TE element also concomitantly brings in extra flanking sequences to

the integration site (a process known as “TE-mediated transduction”), whereas in some other cases, retrotransposition-mediated deletions from 1 bp to several hundred kb of the target

DNA sequence can occur. Such genomic rearrangements have been clearly demonstrated by both TE retrotransposition assays in cultured cells (Gilbert et al., 2002; Symer et al., 2002)

and genomic comparisons between multiple host genomes (Callinan et al., 2005; Han et al.,

2005), and instances related to disease were also reported (Callinan and Batzer, 2006; Mine

et al., 2007; Solyom et al., 2012). A recent study of a Japanese boy with Duchenne muscular

dystrophy (Solyom et al., 2012) revealed an insertion of a 212 bp non-coding unique

sequence from 11.q22.3 plus a 115 bp poly(A) tail in exon 67 of the patient’s Dystrophin

gene on Chromosome X. Remarkably, this turned out to be a 3′ transduction mediated by a

polymorphic L1, that has only 6% allele frequency in Japanese people and had never been

previously detected. Since the 3′ transduction of the downstream unique sequence was coupled with severe 5′ truncation during TPRT, no L1 sequence could even be found at the insertion site, a phenomenon known as “orphan transduction”.

In addition to the above-mentioned TE-mediated physical alterations of the target DNA, non-allelic or unequal homologous recombination involving existing TE loci is another major mechanism causing physical damage/rearrangements commonly seen in TE-mediated mutagenesis (Burwinkel and Kilimann, 1998; Cordaux and Batzer, 2009). By generating ectopic recombination between non-allelic homologous elements, various types of genomic rearrangements (including deletions, segmental duplications and gene inversions) can occur within the same or between different chromosomes. For example, Alu recombination- mediated deletions (RMDs) in humans have been recognized for decades, and many dozens

31 of cases related to various genetic disorders and cancers have been reported (Cordaux and

Batzer, 2009; Deininger and Batzer, 1999). Further, genomic comparisons between the

human and chimpanzee genomes revealed 492 Alu and 73 L1 RMDs in the human genome

since the divergence between the two primate species, including hundreds of cases located

within genes, a few of them involving deletions of exons (Han et al., 2008; Sen et al., 2006).

Similarly, genome-wide chromosomal inversions have been investigated by comparative genomics, and at least 44% of the 252 inversions found in the human and chimpanzee

genomes are related to either L1 or Alu elements (Lee et al., 2008a). Genomic segmental

duplications are also often found as associated with TEs. In a genome-wide study of nearly

10,000 segmental duplication junctions, Alus were found as highly overrepresented within

these regions (27% of segmental duplication junctions are terminated within Alus), and seem

to have contributed significantly to such genomic rearrangements in the human genome

during the past 30 – 40 Myr (Bailey et al., 2003).

1.5.2 Transcriptional influence of TE sequences on host genes

Being equipped with native/cryptic transcriptional regulatory signals such as transcription

factor binding sites (TFBSs) and splice/polyadenylation signals, TEs serve as a unique

reservoir of regulatory elements for host gene expression. It is now well appreciated that TEs

may work as alternative promoters or enhancers to drive transcription of nearby genes

(Cohen et al., 2009; Thornburg et al., 2006), or facilitate alternative splicing or

polyadenylation of mRNA transcripts (Lee et al., 2008b; Lev-Maor et al., 2008; Sorek et al.,

2002). During the past decades, accumulating evidence has shown that TEs have been

32 involved in shaping gene regulatory networks in the host genome, though many of the

mechanisms underlying TE-gene dynamics remain poorly understood (Feschotte, 2008).

In order to move around and proliferate in the host genome, autonomous TEs usually

encode their own proteins critical for transposition. In most cases, these TE-encoded protein

products are transcribed in the same way as cellular genes but under the control of TE

internal promoters, which sometimes can also promote the transcription of a nearby gene

(Figure 1.4A). Sequence mutations, on the other hand, may lead to the creation of new

TFBSs in a TE, which again may act as an alternative promoter or enhancer and drive gene expression. An LTR-derived alternative promoter of the human β1,3-galactosyltransferase 5

(β3Gal-T5) gene, for example, was shown as the dominant promoter in the colon (Dunn et

al., 2003), indicating that this ERV-derived element may play an important role in tissue-

specific regulation of the gene. In another example, an alternative promoter derived from a

HERV-P LTR contributes, specifically in testis, ~12% of the normal total transcription of the

human gene encoding the neuronal apoptosis inhibitory protein (NAIP), whereas an unrelated

LTR promoter in rodents confers constitutive expression of the orthologous gene (Romanish

et al., 2007). Genome-wide bioinformatics studies also revealed numerous examples of TE-

derived alternative promoters. A 2003 computational survey of the human and mouse mRNA databases showed that more than 27% and 18% genes, respectively, have at least one mRNA

with TE sequence in either the 5′ or the 3′ untranslated regions (UTRs) (van de Lagemaat et al., 2003). While it had previously been shown that the human CYP19 gene, which encodes the aromatase P450 (a key enzyme in estrogen biosynthesis), has tissue-specific expression in placenta, only until the above-mentioned bioinformatics study was it realized that it was actually driven by an LTR promoter sequence that had integrated into the genome during

33 early primate evolution. Strikingly, the same study discovered a total of 16 such examples of

the apparent usage of TE-derived promoters not appreciated before. Recently, based on high-

throughput sequencing data, Conley et al. discovered more than 50,000 ERV-initiated

transcripts in the human genome, and demonstrated a total of 114 transcription start sites

(TSSs) located within ERV sequences that contribute to the transcription of 97 genes (Conley et al., 2008). Notably, while TE-derived promoters have been found for various types of TEs,

ERV/LTR elements are the dominant type due to the abundance of cryptic TFBSs in their sequences. A more complete assessment of this phenomenon was given by Cohen et al. in a

2009 review (Cohen et al., 2009), in which both experimental and bioinformatic results were summarized and thoroughly discussed.

Alternative splicing and polyadenylation are also common mechanisms used by TEs to alter the normal transcription of host genes (Figure 1.4B, C). It has long been well appreciated that Alu elements are frequently involved in alternative splicing and a process known as exonization (Hasler and Strub, 2006; Keren et al., 2010; Sorek et al., 2002), during which part of an Alu sequence is incorporated into the mature mRNA transcripts of a host gene. A comparative analysis of 1,176 alternatively spliced and 4,151 constitutively spliced internal exons in the human genome has shown more than 5% of the former exon set contain

Alu repeats; none for the latter one (Sorek et al., 2002). A further study revealed that the left

but not the right arm of the Alu dimer sequence seems to be critical to promote alternative splicing and Alu exonization events, probably due to the weaker splice signals and lower density of exonic splicing regulatory elements (ESRs) (Gal-Mark et al., 2008). Furthermore, it has been experimentally demonstrated that two Alu elements inserted into an intron in opposite orientation undergo base-pairing and RNA editing, which, in turn, affects the

34

Figure 1.4 Transcriptional influences of TEs.

TE insertions are depicted as thick green arrows. Blue boxes are exons of a hypothetical gene. The black arrows are Transcription start sites. Red dotted lines show splicing of the mRNA transcripts of the gene. A) TEs as alternative promoter. The TE promoter upstream of the gene may contribute to tissue-specific transcription, and the TE promoter within an intron may give rise to a truncated form of the protein product. B) Alternative splicing by TEs. The intronic TE insertion is exonized and may also cause exon skipping. C) Premature polyadenylation by TEs. The gene transcription terminates at an intronic TE in the middle of the gene.

35 splicing patterns of the downstream exon by shifting it from constitutive to alternative (Lev-

Maor et al., 2008). Indeed, a recent computational study of exonized Alus vs. their non-

exonized counterparts revealed multiple features which significantly influence the possibility

of an Alu sequence being recognized as an exon by the splicing machinery, including splice signal strength, density of exonic splicing enhancers (ESEs) and silencers (ESSs), length of

flanking introns, and more interestingly, the stability of mRNA secondary structures

(Schwartz et al., 2009). Alternative splicing has also been reported for many other TE types,

such as LINEs and LTR retrotransposons (Belancio et al., 2006; Maksakova et al., 2006), and genome-wide computational assessments have identified hundreds of alternative splicing events involving TEs in human genes (Kim et al., 2010). Moreover, like TE-induced alternative splicing, the cryptic polyadenylation signals carried by many TEs may cause

premature termination of normal gene transcription, which has also been widely documented

and studied for various TE types in mammals (Lee et al., 2008b; Maksakova et al., 2006;

Perepelitsa-Belancio and Deininger, 2003).

1.5.3 Epigenetic Effects of TEs

Being used by the host cell as a fast and efficient mechanism to inhibit TE expression, epigenetic changes such as DNA methylation, histone modifications and small RNA

targeting can repress the transcription of various types of TEs (Leung and Lorincz, 2011;

Slotkin and Martienssen, 2007; Yoder et al., 1997). However, such an epigenetic defense

may sometimes cause “side-effects” on normal gene transcription, and TEs may also escape

repression and change the expression pattern of nearby genes due to the reversible nature of

epigenetic modifications.

36 One fascinating example of TE methylation influencing gene expression is the Avy allele generated by an IAP ERV insertion near the mouse agouti locus (Duhl et al., 1994). While the wild-type allele (a) of the agouti gene in C57BL/6 mice produces a normal black coat color, mutant mice possessing an Avy allele show a varying color range from yellow to agouti.

Detailed analyses of this allele revealed that an IAP LTR-retrotransposon, which is absent in

wild-type C57BL/6 mice, is located in antisense 100 kb upstream of the conventional TSS of

the Agouti gene and works as an alternative promoter for the gene, causing ectopic

expression of the agouti protein, resulting in yellow fur, obesity, diabetes and increased

susceptibility to tumors. Interestingly, the variable coat color of the mutant mouse is

correlated with the methylation state of the IAP element (i.e. hypomethylated – yellow;

hypermethylated – agouti), which displays epigenetic inheritance following maternal but not

paternal transmission (Morgan et al., 1999). Moreover, a further study showed dynamic

programming of DNA methylation of this allele in early mouse development, during which

the maternally inherited allele is demethylated much later than the paternal counterpart and

reestablished later in mouse embryogenesis (Blewitt et al., 2006). However, the lack of

methylation of the maternally inherited allele at a certain stage during mouse development

indicates that DNA methylation may not be the direct mechanism regarding its inheritance.

The fact that haplo-insufficiency of a polycomb protein induces the epigenetic inheritance of

the paternally derived Avy epiallele suggests the involvement of other epigenetic modifiers

(e.g. histone modifications) that regulate chromatin state and DNA methylation (Blewitt et

al., 2006).

Intimately related to DNA methylation, chromatin state change induced by histone

modifications is also linked to TE repression (Leung and Lorincz, 2011) and, in certain

37 circumstances, may also influence nearby genes. It has been proposed that LINEs are related

to imprinting and inactivation (Allen et al., 2003; Lyon, 2006), and examples

of repressive chromatin spreading caused by TEs have been shown in both plants and animals (Martin et al., 2009; Rebollo et al., 2011; Sun et al., 2004), albeit reported cases are rare. Using strain-specific polymorphic ERV insertions in hybrid mice (reported in Chapter

2), our lab’s recent collaborative study of the epigenetic effects of mouse ERVs showed a general trend of heterochromatin formation induced by IAP elements, and at least one apparent example of heterochromatin spreading from an IAP into a nearby gene (Rebollo et al., 2011). Specifically, based on a genome-wide assessment using chromatin immuno- precipitation sequencing (ChIP-seq), we showed a significant enrichment of the heterochromatin mark H3K9me3 around common IAP insertion sites in two ES cell lines derived from different mouse strains. The same analysis of polymorphic IAPs present only in one of the two strains clearly demonstrated a TE-induced enrichment of H3K9me3 flanking the polymorphic IAPs, but not at the orthologous loci in the mouse strain in which these IAPs are missing. More interestingly, the same study also reported an apparent case of TE-induced heterochromatin spreading into a neighboring gene, in which the H3K9me3 mark induced by a solitary antisense IAP LTR extends into the promoter region of the downstream beta 1,3- galactosyltransferase-like (B3galtl) gene, causing a decrease in the expression of this gene. In comparison, the mouse strain lacking this IAP insertion showed neither heterochromatin spreading nor impaired gene expression. To eliminate any possible genetic background differences between the two mouse strains, a hybrid strain was used for allelic methylation and expression analyses, which confirmed the IAP as the cause of heterochromatin spreading

38 and down-regulation of the neighboring gene. It is believed to be the first such case involving

ERVs documented in mammals.

The recently appreciated post-transcriptional gene regulation by small RNAs such as microRNAs (miRNA) and small interfering RNAs (siRNA) may also have originated from ancient cellular defense to virus infection and TE transposition (Obbard et al., 2009). These

short RNA molecules are about 20~30 bp in length, usually derived from double stranded

RNA (dsRNA) fragments cleaved by the RNase III enzyme Dicer, and can interact with

Argonaute proteins to form the RNA-induced silencing complex (RISC), which can bind to

mRNA targets complementary in sequence to the small RNA and cause either target mRNA

degradation or translational inhibition (Almeida and Allshire, 2005; Okamura, 2011). While

TEs are usually targets of RNA silencing, it has been shown that these repetitive sequences

can also be used as a source of small RNA biogenesis, and may repress the transcriptional activity of both TEs and cellular genes (Watanabe et al., 2006; Yang and Kazazian, 2006).

Indeed, a computational study of human miRNAs showed that ~12% of annotated human miRNAs are completely derived from sequences of various TE families (Piriyapongsa et al.,

2007), which might further regulate the transcription of thousands of host genes.

1.6 Transposable Elements and Host Evolution

1.6.1 TEs vs. the host genome: an everlasting battle

The arms race between TEs and their hosts has never stopped during the long course of evolution. Essentially as genomic parasites, TEs need to survive in the host genome by continuously generating new copies and escaping genomic silencing. Since TE activity could

39 be highly deleterious, the host organisms have developed various molecular mechanisms to limit the transposition and proliferation of these “jumping genes”.

Insertional tolerance can be considered a passive solution that the host genome uses to relieve itself from the pressure of genomic TE bombardment. Generally, TE insertions may happen anywhere in the host genome, but presumably only a small fraction of such insertions become fixed either by random genetic drift or, in rare instances, due to positive selection for retaining the TE. As mentioned earlier, organisms with relatively small genome size usually contain a small proportion of fixed TEs, presumably due to stronger negative selection, higher deletion rate or more stringent TE integration control (see Section 1.5.1). On the other hand, many larger genomes (e.g. the mammalian genome) are filled with TE sequences that are located mostly within heterochromatin or intergenic regions, where TEs are presumably less likely to pose any detrimental effects on host genes. Even when TEs do insert into genes, the intron-exon structure of genes assures a much higher opportunity of acquiring TE insertions in introns rather than in protein coding exons. As a consequence, the larger the proportion of the host genome covered by TEs, the lower the probability of having new TE insertions in critical regions (e.g. exons). However, the insertional tolerance described earlier may only be efficient for genomes that are already large enough and contain many TEs and/or non-functional DNA, otherwise the probability of deleterious TE insertions would be too high to outweigh the potential benefits obtained from relaxing the control of genome size.

In addition to passive insertional tolerance, epigenetic silencing is an important active mechanism used by the host to effectively control TE activities. As noted above, DNA methylation and histone modifications can change the regional chromatin into a repressive state so that the embedded TEs can no longer be transcribed. This largely limits new TE

40 transpositions and their interference of host gene regulation. However, this mechanism raises

a new challenge to the host genome. When a TE is epigenetically silenced by the state

change of local chromatin, the epigenetic marks associated with the target TE may spread

into nearby genes and cause deleterious side effects on regular gene expression. Indeed, a

study in the plant Arabidopsis thaliana showed a negative correlation between the density of

methylated TEs (mTEs) and the expression level of nearby genes, and population genetic

analysis in the same study confirmed a lower than neutral frequency (e.g. synonymous single

nucleotide polymorphisms) of mTEs but not their unmethylated counterparts (uTEs) near

genes, suggesting a purifying selection against mTEs (Hollister and Gaut, 2009). This

intriguing study clearly demonstrated an evolutionary trade-off between reduced TE activity

and potential side effects on nearby genes when using DNA methylation for TE silencing. To date, the spreading of epigenetic marks from silenced TEs into nearby genomic regions in mammals has been robustly confirmed for at least the B1 SINE and IAP LTR- retrotransposon elements in mice, though cases of spreading into nearby genes are rare

(Rebollo et al., 2011; Yates et al., 1999). While it is still unclear how the host genome

effectively prevents the deleterious spreading of epigenetic silencing marks, some evidence has shown the existence of a buffer zone between the silenced TE and its nearby gene,

suggesting unknown insulators might be involved (unpublished data in the Mager lab)

(Rebollo et al., 2011).

Besides DNA methylation and histone modifications, many host organisms can also

efficiently inhibit TE activities via various small RNA interference pathways such as

miRNAs, siRNAs and piRNAs. Among them, the piRNA pathway is particularly interesting

for the following reasons: 1) piRNAs are mainly produced from TE sequences; 2) their major

41 targets are TEs (Khurana and Theurkauf, 2010). This is a spectacular example of how the host genome may use existing TEs to prevent new TE transpositions. Notably, the piRNA pathway is only available in germ-line cells, in which other mechanisms of TE inhibition such as DNA methylation are often weak or absent, at least during certain development

stages (Sasaki and Matsui, 2008). Since only in germ-line can newly transposed TEs be

passed to the next generation of the host organism, the maintenance of genetic integrity and

genomic stability is even more important in germ cells. For this reason, piRNA silencing is

sometimes referred as “the vanguard of genome defense” (Siomi et al., 2011).

Although TEs are largely considered as endogenous and thus usually do not induce any

immunological response, evidence has indicated that the intrinsic immunity of the host cell

could be involved, at least for TEs with an exogenous origin (e.g. ERVs). A classical

example is the restriction of an extinct retrovirus by the human TRIM5α antiviral protein

shortly after the divergence between human and chimpanzee (Kaiser et al., 2007). The Pan

troglodytes endogenous retrovirus, or PtERV1, has been found with more than 100 copies in

both the chimpanzee and gorilla genomes, but it is absent in humans (Yohn et al., 2005).

While phylogenetic analysis strongly suggests that the invasion of exogenous PtERV1 into

multiple primate species occurred about 3-4 Myr ago (Yohn et al., 2005), it was not

understood why the pandemic apparently did not affect humans who by then were

cohabitating with other Old World primates. To answer this question, Kaiser and colleagues

studied the relationship between PtERV1 and TRIM5α protein in a panel of selected primates, and found intriguing results (Kaiser et al., 2007). For example, their sequence analysis showed an arginine (R) residue at position 332 of the human TRIM5α protein, while the hominoid ancestral sequence carries a glutamine (Q) at the same position. When these

42 researchers resurrected the silenced PtERV1 by reconstructing its DNA from the consensus

sequence, they tested the restriction power of TRIM5α in a panel of primate species and

found only the human version of this antiviral protein to be highly effective against the

infectious PtERV1 retrovirus. When the arginine at position 332 of the human TRIM5α

protein was replaced by a glutamine (R332Q), however, the restriction by human TRIM5α to

PtERV1 was significantly reduced. More interestingly, upon treatment of the human immunodeficiency virus type 1 (HIV-1), the human TRIM5α with R332Q mutation showed significant improvement of HIV infectivity reduction compared with the wild type TRIM5α bearing R332R. It appears that the ability to repel infection by PtERV1 is fixed in the human population, accounting for the lack of endogenous PtERV copies in the human genome. This

TRIM5 genetic variation, which was advantageous against PtERV, now renders humans susceptible to the modern virus, HIV.

1.6.2 TE exaptation: turning “junk” into “gold”

“Ex-aptation” is a term introduced by paleontologists, Stephen J. Gould and Elisabeth

Vrba, to describe the evolutionary scene where the biological features that now enhance fitness of an organism were not built by natural selection for their current role (Gould and

Vrba, 1982). Exaptation can be further classified into two categories: 1) functional shift (or type 1 exaptation), which refers to the reuse by natural selection of a character with previously different purposes; 2) functional cooptation (or type 2 exaptation), which describes particularly the situation where a character whose origin cannot be linked directly to natural selection (i.e. “non-aptation” as coined by Gould and Vrba) is co-opted for a current use (Gould and Vrba, 1982). Unlike the widely (and often mistakenly) used term “ad-

43 aptation”, which references the evolutionary development of a brand new feature based on

natural selection, “ex-aptation” did not get sufficient attention until recently, but now

becomes an important subject in studying evolution.

In the context of TE-host relations, exaptation played critical roles during the host

evolution. When a TE is “exonized”, i.e. incorporated into the mature mRNA of a cellular

gene as an exon, it may become part of the coding sequence and alter the protein product that

the gene makes. If the new protein product is not only functional, but also beneficial to the

host, it may be positively selected and eventually fixed in the host population (i.e. type 2

exaptation). It is noteworthy that not all exonization events will lead to exaptation, however.

Alu SINEs, for instance, are well-known as frequent exonization targets, but in a genome-

wide study of Alu exonization in primates, Krull et al. only identified a total of 153 human

chromosomal loci where Alu elements were conceivably exonized (Krull et al., 2005).

Further examination of these cases showed a large proportion being lost again in some

descendent primate lineages after exonization during evolution. The authors proposed that

such dynamic exonization of Alus was due to the relatively younger ages of Alu elements

(~60 Myr) and, indeed, a similar study of the mammalian-wide interspersed repeat elements

(MIRs) revealed a much higher extent of stable exonization for this older SINE family (~130

Myr) (Krull et al., 2007). A closer examination of five selected MIR-derived exons showed

that four of these exonized TEs contribute only proportionally to the total gene transcripts as alternative splicing forms, with the remaining case (in gene ZNF639) being constitutively spliced in all mammals tested. The studies cited above suggest that a true exaptation of TE may need a long evolutionary time to either test or obtain all the mutations required for a beneficial role, helping it to firmly endure the pressure of being lost during evolution.

44 More dramatically, exaptation of TE sequences that encode proteins may sometimes give rise to brand new genes that can significantly benefit the host organisms, a phenomenon known as “domestication”. One fascinating example of TE domestication is the Syncytin genes, which perform crucial functions in placenta formation and were adopted from endogenous retroviral elements independently several times during the radiation of mammalian evolution (Dupressoir et al., 2005; Heidmann et al., 2009; Mi et al., 2000;

Vernochet et al., 2011). In humans, HERV-W is an endogenous retrovirus family specifically expressed in the placenta, and analyses of HERV-W mRNA from this tissue have revealed the expression of an intact open reading frame (ORF) of the retroviral envelope gene from one specific HERV-W copy. Since the life cycle of ERVs is intracellular and usually does not need the retroviral envelope protein for viron secretion and reinfection, the identification of functional transcripts from the HERV-W envelope gene (named as “Syncytin 1”) suggests a positive selection. Indeed, in situ hybridization experiments clearly showed that Syncytin is a highly fusogenic membrane glycoprotein, and its expression in placenta is crucial in cell fusion and syncytia formation (the generation of multinucleated giant cells) (Blond et al.,

2000; Mi et al., 2000). Shortly after the identification of Syncytin 1, a genome-wide screen of fusogenic ERV envelopes in primates identified Syncytin 2, a conserved retroviral envelope gene co-opted from an independent ERV family, HERV-FRD, and is also involved in placental morphogenesis in primates. Similar genome-wide screens of ERV envelope genes in other mammals such as the mouse (Dupressoir et al., 2005) and the rabbit (Heidmann et al., 2009) revealed multiple examples of ERV-derived genes that originated from independent exaptation events, all of which are convergently involved in placenta formation

(Malik, 2012).

45 In addition to directly contributing to the reservoir of protein coding sequences of the host

genome, more TE exaptation events happen at the regulatory level. As mentioned earlier in

this chapter, many TEs carry cryptic or ready-to-use transcriptional signals (e.g. TFBSs) in

their sequences, and numerous cases of TE-derived promoters and enhancers that may

contribute to the regulation of normal host gene functions have been reported. Indeed, our

laboratory has recently created a web site to catalog the growing number of such cases

(http://sites.google.com/site/tecatalog/welcome). More importantly, the abundance and wide

dispersion of TEs in the host genome may lead to significant rewiring of various gene

regulatory pathways at a network level. An examination of genome-wide ChIP-seq data for

p53 showed about one third of the identified p53 binding sites occurring within human ERV

sequences (Wang et al., 2007). Similarly, another genome-wide ChIP-seq study of three key

regulatory proteins (OCT4, NANOG and CTCF) in human and mouse ES cells (Kunarso et

al., 2010) showed that repeat-associated binding sites (RABSs) contribute up to 28% of all

mapped regions, and multiple lines of evidence were shown for functional relevance of these

RABSs. It is worth remarking that, while both OCT4 and NANOG are conserved proteins

between human and mouse and regulate similar target genes, only 5% of their binding sites

are homologously occupied in the two species. By examining the origin of the RABSs of these regulatory proteins, the authors found most to be located in species-specific TEs and, as a result, revealed a group of human-specific target genes apparently rewired into the regulatory networks of OCT4 and NANOG via species-specific RABSs. All the above studies, plus the fact that thousands of conserved non-coding regions (presumably functional) in mammalian genomes are derived from TEs (Lowe et al., 2007), clearly demonstrate the

46 importance of TEs to host gene regulation in mammals. This strongly implies a beneficial role played by these once-considered “selfish” genomic parasites during host evolution.

1.7 Thesis Objectives

The overall objective of this thesis work has been to study the genome dynamics of mammalian TEs and their relationships to host genes using computational approaches. My work can be divided into three relatively independent projects, which form the foundations of the following three chapters.

Chapter 2 describes my study of genome-wide identification and evaluation of polymorphic ERV/LTR retrotransposons in mice. Although it had been recognized that

ERVs in mice were highly active and caused ~10% of all reported germ-line spontaneous mutations, the level of ERV polymorphisms in various lab mouse strains was unclear at the time I initiated this project. Using publicly available data of millions of short mouse genomic

DNA sequences with an average size around 800 bp, I designed a computational algorithm that identified both common and polymorphic insertions of the ETn/MusD and IAP elements, the two most active ERV families in mice, in four inbred mouse strains. My comparison of the genomic distributions between common and polymorphic ERVs was based on the hypothesis that if polymorphic ERVs generally represent younger elements not fixed in the host population, they should show a distribution pattern closer to the one expected by chance, which would imply an ongoing selection on these polymorphic elements. With the aid of lab colleagues, I also investigated the transcriptional effects of selected strain-specific ERV insertions.

47 Chapter 3 describes my study of intronic distributions of TEs in mammalian genes. While many genomic studies had been done on global TE distributions in various types of host

organisms, there was no comprehensive investigation of TE distributions within genes at the time I started this project. Previous studies had shown that TEs were largely underrepresented in genic regions, very likely due to a strong purifying selection. However, I was intrigued by the question of why, while many intronic TE insertions can be highly mutagenic and had been selected against, there are also numerous fixed TEs residing within introns which apparently do no harm? Examining intronic distributions of the four major types of TEs in the human and mouse, I identified genomic features that may influence the retention probability of de novo TE integrations in gene introns. To test whether natural selection is the major driving force underlining the TE distribution pattern in genes, I compared the intronic distribution of fixed (and thus presumably harmless) TEs with either the mutagenic TE insertions in introns reported in the literature or the intronic polymorphic

ERVs (presumably still under selection) identified in Chapter 2 to determine if a shift of distribution patterns could be revealed. In addition, I sought assistance from my colleagues to use experimental approaches to evaluate the transcriptional effects of selected TEs.

Chapter 4 describes my study of the properties of genes that have either very high or very

low TE densities, as well as the relevance of these gene properties to the accumulation of

TEs in introns. Although previous studies had touched on the properties of genes with

extreme TE densities (i.e. the outlier genes), none had examined those genes sharing the

same TE content extremity in multiple species. Using orthologous genes in three mammalian

species (human, mouse and cow), I identified computationally two sets of genes that have

either extremely high or low density of TEs in all three species. With these two highly

48 reliable gene sets in hand, I considered whether common properties could be found for each gene set and what the biological relevance would be. Based on the hypothesis that the extreme density of TEs might be a consequence of natural selection, I examined gene properties such as chromosomal distribution, biological function and conservation level of both gene sets, to see if any difference exists between the two groups. Since it had long been hypothesized that genes with an open chromosome state in either embryonic stem (ES) cells or germ-line cells are more prone to heritable TE insertions (i.e. all TEs that we can observe in the host genome today), I searched for evidence to support that hypothesis by examining the polymerase II occupation and histone modification status of the shared outlier genes in mouse ES cells.

49 Chapter 2: Identification and Investigation of ERV Polymorphisms in Mice

A version of Chapter 2 has been published: Ying Zhang, Irina A. Maksakova, Liane Gagnier, Louie N. van de Lagemaat, Dixie L. Mager (2008). "Genome-wide assessments reveal extremely high levels of polymorphism of two active families of mouse endogenous retroviral elements." PLoS Genet 4(2): e1000007.

50 2.1 Background

The is the model of choice for mammalian biological research and a plethora of mouse genomic resources and databases now exist (Peters et al., 2007). Notably, fueled by availability of genomic sequence for the common strain C57BL/6J (B6) (Waterston et al., 2002), several research groups have documented genetic variation among strains using single nucleotide polymorphisms (SNPs) (Frazer et al., 2007; Wade and Daly, 2005; Yang et al., 2007). Surveys of mouse polymorphism due to segmental duplications or copy number variations have also recently been published (Graubert et al., 2007; Li et al., 2004). Such resources are invaluable in trait mapping, tracing strain origins, and genotype/phenotype studies. However, genome-wide studies to document other types of genetic variation have been lacking. ERVs/LTR retrotransposons are known to be highly active in inbred mice, causing ~10% of spontaneous mutations (Maksakova et al., 2006), but relatively little is known about the level of polymorphism of such sequences (Chapter 1, Section 1.4). Southern blotting and extensive genetic mapping have clearly demonstrated that ERVs related to murine leukemia virus (MLV) are highly polymorphic (Boeke and Stoye, 1997; Frankel et al., 1990; Stoye and Coffin, 1988), but such techniques are feasible only for low copy number ERVs which constitute a very small fraction of ERVs and LTR retrotransposons in the mouse genome. Due to limitations of the array-based technology employed, the largest mouse polymorphism study performed by Perlegen focused only on SNPs in non-repetitive genomic regions, and was not designed to detect insertional ERV polymorphisms (Yang et al., 2007).

Compared to a single nucleotide difference, genetic variation due to insertion of an ERV has obviously a much greater probability of affecting the host (Chapter 1, Section 1.5). The

51 phenotypes of most mouse germ-line mutations caused by ERV insertions result not from

simple physical disruption of coding regions (although this does occur), but from

transcriptional abnormalities mediated by ERVs located in introns or near the affected genes

(Maksakova et al., 2006). Further, it is well appreciated that retroviruses can activate

oncogenes or growth control genes leading to malignancy (Boeke and Stoye, 1997; Kung et al., 1991; Rosenberg and Jolicoeur, 1997) and, indeed, are used as tags to identify genes involved in cancer (Dudley, 2003; Theodorou et al., 2007). Determining the extent of mouse

ERV polymorphism, therefore, is critical to understanding how ERVs contribute to diversity

and disease susceptibility among inbred strains.

The retroviral-like IAP and ETn/MusD families are two high copy number ERVs

responsible for most of the insertional germ-line mutations described in mice, and have been

postulated as highly polymorphic among inbred mouse strains (Chapter 1, Section 1.4.2). The

goal of the study presented in this Chapter was to quantitatively assess the genome-wide

polymorphism levels of the IAP and ETn/MusD families, and to identify those polymorphic

ERVs with the highest probability of affecting host genes. By comparing only the few strains

for which sufficient genomic sequence is available, I found high levels of insertional

polymorphism for both the IAP and ETn/MusD families. Moreover, I detected 695

polymorphic members of these ERV families located within genes, and found evidence that some of these affect gene transcription. Such polymorphisms represent a substantial source of genetic variability among inbred strains, and may play a major role in strain-specific traits.

52 2.2 Results and Discussion

2.2.1 Prevalence of ETn/MusDs and IAPs in different strains

As the first step in assessing the ERV polymorphisms in mice, I conducted a survey of the

overall copy numbers of IAP and ETn/MusD elements in the well-sequenced, assembled B6

genome using BLAST (see Section 2.4.2 of this chapter for details). For the IAP family, I

detected 2,595 full-length or partly deleted elements, plus 2,477 solitary LTRs, for a total of

5,072. ETn/MusD elements are less numerous than IAPs, with 1,873 sequences in the B6

genome, 1,457 of which are solitary LTRs. In accord with previous studies (Mager and

Medstrand, 2003), my results indicated that solitary LTRs, the result of recombination

between the 5′ and 3′ LTRs of proviral forms, are typically more common than full length

ERVs.

By the time of this study, sufficient whole genome shotgun (WGS) sequence traces

(Figure A.1) are available for only three mouse strains other than B6: A/J, DBA/2J, and

129X1/SvJ (referred to hereafter as “the three test strains”). To identify all traces containing

IAP or ETn/MusD sequences, I used specifically designed ERV probes (Figure A.2 and

Table A.1) to screen the trace archives of the three test strains with local sequence alignment.

Sequences flanking the ERV segment in each trace were used to map the region to a unique position in the assembled B6 genome, and to combine redundant traces (Section 2.4.3). This screening method identified 1,659, 1,509 and 1,379 ETn/MusD elements which could be assigned a unique location in A/J, DBA/2J and 129X1/SvJ, respectively. Similarly, for the

IAP elements, I identified 4,696 elements in A/J, 4,320 in DBA/2J, and 3,878 in 129X1/SvJ.

As discussed earlier, my genomic survey detected 1,873 MusD/ETn elements and 5,072 IAPs

in assembled B6 genome. It is likely the lower ERV numbers detected in the three test strains

53 compared to B6 is due mainly to incomplete sequence coverage of the traces available for each strain. Another factor contributing to the loss of detectable ERV insertions is the inability to map the trace to a unique location. This is usually because the flanking non-ERV portion is too short, being composed of other types of repeats, or is located within duplicated genomic regions. To determine the approximate fraction of elements from each of the three test strains not detectable due to incomplete sequence coverage or other reasons, I determined how many elements in the assembled B6 genome could be found with my method, using randomly sampled sets of WGS traces from the B6 trace archive database. Using numbers of

B6 traces equivalent to that available for A/J (11,646,236), DBA/2J (7,998,826) and

129X1/SvJ (5,998,950), I detected, respectively, 83.8%, 77.9% and 68.6% of the 1865

ETn/MusD insertions present in the assembled B6 genome (Figure A.3). It seems reasonable that approximately 16.2%, 22.1% and 33.4% of the ERVs present in the three test strains could not be found due to incomplete coverage or mapping difficulties. Moreover, this B6 trace sampling experiment also allowed me to conservatively estimate the false discovery rate of this procedure to be ~0.4% (Section 2.4.5).

2.2.2 Identification and frequency of polymorphic ERVs

As outlined in Figure 2.1 and described fully in Section 2.4.4, I designed a four-phase screening process to identify polymorphic ERVs. In the first phase, probes derived from known ERV sequences were used to screen the B6 assembled genome, and a collection of

ETn/MusD or IAP elements in B6 was obtained. In the second phase (illustrated in Figure

2.1A), I determined if the ERVs identified in the three test strains were also present in B6 by checking for existence of such ERV sequences at corresponding loci in the assembled B6

54

Figure 2.1 Screening strategy for detection of polymorphic ERV insertions.

A) Identification of ERV insertions in test strains. In the first step, ERV probes of different lengths were designed based on known ERV sequences (see Section 2.4.2 and Table A.1 for more details). Next, the ERV probes were aligned to trace sequences of the test strain with WU-BLAST, and all traces containing the target ERV sequences were retrieved (step 2). From each ERV-containing trace, a chimeric tag was constructed by taking the flanking genomic sequence appended with a small tail (≤ 50 bp) of the target ERV sequence (step 3). In the final step, all chimeric tags were mapped to the assembled B6 genome with BLAT, and the existence of corresponding ERVs in B6 was determined by checking whether the small ERV-tail was included in the alignment (step 4).

B) Determining the polymorphism status of ERVs present in B6. In the first step, probes were built based on the sequences flanking all ERV insertions in the B6 genome. In the next step, these probes were used to select all traces containing such flanking sequences in test strains. In the third step, a 35-bp-region adjacent to the mapped flanking sequence in each trace obtained from previous step was compared to the corresponding ERV sequence in the B6 genome, and the existence of such ERV element in the test strain was assessed according to the sequence identity.

In both panels, solid blue bars represent genomic sequences flanking the ERV insertions in mice. Green hatched bars or arrows with solid borders are ERV internal or LTR sequences, respectively. Gray shaded bars or arrows with broken borders are suspected ERV sequences, of which the existence is determined by the alignment score of regions annotated with “?”s.

55 genome. In the third phase (represented in Figure 2.1B), I included the dataset of all IAP and

ETn/MusD elements found in B6, and determined the presence of these ERVs in the three

test strains. To achieve this outcome, I retrieved the 5′ and 3′ flanking sequences from

elements present in the assembled B6 genome, obtained those flanking segments that could

be uniquely mapped to the genome, and identified sequence traces from the test strains that

contain these flanking segments. The traces were then checked for presence of the ERV. In

the final phase, a similar strategy was applied to the polymorphic ERV insertions found in

each test strain (but not in B6), so that the existence of corresponding ERVs in the other two

test strains could be assessed. The combination of these steps allowed me to compile lists of

ERV genomic locations and the polymorphism status of each ERV in the four strains. Due to

inability to uniquely map many ERV flanking regions to the short, unassembled sequence

traces, the status of many elements present in the assembled B6 genome could not be

computationally determined in the test strains (see below). Additionally, as discussed above,

incomplete sequence coverage of the test strains resulted in an “unknown” status for a

proportion of ERVs in each test strain.

In spite of the limitations described, I identified a large number of polymorphic ERVs

(Figure 2.2). Of all IAP elements detected in at least one strain, 2,143 were present in all

four strains, while 3,394 elements were scored as polymorphic (i.e. absent in at least one of the four strains), giving an overall polymorphic fraction of 61.3%. For ETn/MusD elements,

1,087 were mapped as present in all four strains, and 375 could be scored as absent in one or

more strains, a polymorphic fraction of 25.6% of all the elements having a determinable

status. Another 1,767 IAP and 660 ETn/MusD elements present in the assembled B6 genome

could not be mapped to the test strain traces due to incomplete trace coverage or repetitive

56 flanking regions, so their polymorphic status could not be computationally determined.

Remarkably, these high levels of insertional polymorphism were obtained by considering just

four strains, despite the fact that the status of many elements could not be ascertained in

some strains. Thus, the numbers of polymorphic ERVs among all inbred mice must be

significantly higher. Details of all the polymorphic ERV insertions identified in this study

can be found online in the form of custom tracks for the UCSC Genome Brower (see Section

2.4.10 for details).

Figure 2.2 Fractions of polymorphic ERVs based on the four strains. Pie charts indicate the status of all detectable A) IAP elements and B) ETn/MusD elements. White sections indicate the fraction of elements that could be scored as present in all four strains (annotated as ‘common’). Dark blue sections represent the fraction of elements scored as absent in at least one strain (annotated as ‘polymorphic’). Side bars illustrate the data composition of polymorphic ERVs, with dotted/striped sections indicating polymorphic elements for which status could be/not be confirmed in all four strains, respectively.

2.2.3 Genic distribution patterns of the youngest ERVs are distinct from older elements

Given that the genomic distributions of ERVs fixed in a species are strongly shaped by

selection, I predict that recently inserted ERVs will display genic distributions different from

57 their older cousins. To test this prediction, I compared the distributional properties of a subset

enriched for the youngest ERVs with that of ERVs common to all four strains. To obtain the youngest elements, I chose those present in one strain but absent in the other three strains.

Many of these likely still represent older polymorphic elements due to the fact that lab strains are genetic mixtures of subspecies of Mus (Beck et al., 2000; Wade and Daly, 2005; Yang et

al., 2007). However, this group will contain all the truly young elements inserted after strain

divergence. As shown in Figure 2.3, these datasets enriched for the “youngest” elements are

more likely to be found in genes (Figure 2.3A) and in the sense orientation within genes

(Figure 2.3B), compared with elements shared between all four strains. The higher

prevalence in genes and the reduced intronic orientation bias displayed by ERV subsets

enriched for the youngest elements suggests that some of them have inserted very recently

and may be deleterious, which have not been eliminated completely by selection.

2.2.4 Confirmation of polymorphic ERVs in gene introns

My bioinformatics screens identified 623 polymorphic IAP elements and 72 polymorphic

ETn/MusD elements located within genes in one or more of the four strains. Complete lists

of these elements and their locations with respect to the B6 genome are provided in Table

A.2 and A.3. These Tables list in which of the four strains each element was computationally

detected by my screens. The question marks in the Tables are mainly due to mapping

difficulties or incomplete sequence coverage of the trace databases. A subset of these

elements was analyzed using genomic PCR on DNA from a panel of mouse strains

(including B6 and the three test strains) with primers flanking the insertion site to verify the

insertion status. For this analysis, I chose all 28 cases of ETn/MusD elements found in A/J

58

Figure 2.3 Distributions of young versus older ERV elements with respect to genes. A) Fraction of elements located within genes. B) Fraction of genic elements oriented in the same transcriptional direction as the gene. Dark blue bars represent elements found in all four strains. White bars represent ERVs present in only one of the four strains. Dashed lines indicate the expected fractions assuming a random genomic integration pattern. Error bars show standard errors, and P-values based on two sample z-test comparing young and old groups are shown. All comparisons between the “young” and “old” subsets are statistically significantly different except for the orientation bias of ETn/MusD elements (marked with “*” in B), due to the low numbers of elements in this category. Actual numbers of elements in each category are shown as numerators in fractions, with denominators being the total numbers of elements in the different groups.

gene introns but absent in B6, and 12 cases of ETn/MusD elements present in B6 gene introns but scored as absent in A/J (Table 2.1). For the 28 cases of elements computationally detected in A/J (cases 1-28 in Table 2.1), the ETn insertion in the dysferlin (Dysf ) gene (case

# 9) is the only previously reported case and occurred 20-30 years ago in the A/J breeding stocks (Ho et al., 2004). For the set of 12 elements present in B6 (cases 29-40 in Table 2.1),

59 the ETn element in the Wiz gene (case #40) has also previously been reported as polymorphic

(Baust et al., 2002).

In total, these 40 selected cases and four strains generated an experimental space of 160

predictions. As shown in Table 2.1, columns with a strain name followed by a “(p)” indicate that data in these columns are computational predictions of the existence of the ERV

insertions in the corresponding strain. After excluding 16 undeterminable instances (denoted

as “?”s in these columns in Table 2.1), I computationally determined the presence of these

ERV insertions in all four strains, with a total number of 144 predictions. For 140 of these,

my computational predictions matched precisely the experimental confirmation of ERV

insertion status using genomic PCR (performed by my colleague, Liane Gagnier),

demonstrating the high accuracy of my bioinformatics screens. In one instance (case #39 in

DBA), the PCR failed, so the predicted insertion could not be tested. As a result, only three

cases showed anomalous PCR results that did not match my bioinformatics predictions. One

of these cases was #24 in Table 2.1, where I predicted an ETn/MusD insertion in an intron of

the Sytl3 gene in A/J mice. However, the PCR verification result showed no evidence of this

insertion in the A/J DNA sample used. I consequently reexamined the A/J sequence dataset,

and detected orthologous sequence traces both with and without this particular ERV element

(Figure 2.4). The most likely explanation for this finding is that this ERV represents a very

recent insertion present in a heterozygous state in the A/J genomic DNA used to generate the

trace sequence data. Since the rate of ETn/MusD retrotransposition in A/J is relatively high

compared with other strains (Maksakova et al., 2006), it is not surprising that individual A/J

mice will have occasional “private” insertions. The second anomalous case was #34 of an

ETn/MusD LTR found in the B6 genome within the Cadm4 gene, and confirmed as present

60

Figure 2.4 An apparent heterozygous ETn insertion in the Sytl3 gene in an A/J mouse. The top line is the first 30bp of an ETnII LTR. The second line is from the A/J trace gnl|ti|1104656312, which consists of a non-ERV part and an ETn LTR part. The third line is a different trace sequence from A/J (gnl|ti|1344398576). The bottom line is from the RefSeq gene Sytl3 in the assembled B6 genome (build 36) with genomic coordinates shown. ETn sequences are bold red, and non-ETn sequences are blue.

in all tested strains by PCR (Table 2.1). My computational screens correctly scored this LTR as present in DBA/2J and 129X1/SvJ, but as absent in A/J. Upon further examination of the

sequence data, I found that one of the two available A/J sequence traces mapping to this

location to be an artifact, since it contains a segment of unknown origin. The other trace is

also unusual, as segments of it map to two locations several kb apart. Thus, this case can be

explained by artifactual sequence traces, demonstrating that the trace archives and, therefore,

my dataset are not without errors. The last inconsistent case was #37, an insertion located within the Slfn8 gene and predicted as present in both B6 and DBA. In this case, the PCR verification in DBA showed that the element is not present. Since both the computational and experimental results were clear yet contradictory, I do not have a definitive explanation for this case, although it is possible the trace is not of DBA origin. In any event, this case was regarded as a false positive. In several instances, the PCR data also allowed me to assign a definite insertion status to elements in test strains that could not be predicted in silico due to incomplete sequence coverage of the traces (Table 2.1).

61 Table 2.1 Genomic PCR verification of ETn/MusD intronic insertions in different mouse strainsa

c case# gene location intron size insert_site b orient. B6 B6(p) A/J A/J(p) DBA DBA(p) 129X1 129X1(p) A/WySn SWR/J C3H Balb Sizef 1 2310035C23Rik intron 13 1924 1:107544558 A -d Ne +d Ye - N - N + + + + 350 2 Dnajc10 intron 3 1920 2:80121109 S - N + Y - N - N + - - - 5400 3 Atp9a intron 14 5914 2:168357724 A - N + Y - N - N - - - - 5700 4 Gem intron 2 4896 4:11637425 A - N + Y - N - ? + - - - 5500 5 B230396O12Rik intron 6 2946 4:153708611 A - N + Y - N - ? + - - - 5500 6 Art3 intron 2 8817 5:93471996 A - N + Y - N - ? F - - - 5500 7 Foxk1 intron 1 33098 5:142663608 A - N + Y - N - N + - + + 5500 8 Stk31 intron 3 6516 6:49330234 A - N + Y - ? - N + - - - 5500 9 Dysf intron 4 4795 6:84024675 S - N + Y - N - ? - - - - 6000 10 Zfp82 intron 5 4708 7:29768570 A - N + Y + ? + ? + - - + 1700 11 Pde8a intron 12 1773 7:81189399 A - N + Y - N - N + + - + 400 12 Pgbd5 intron 1 49175 8:127312767 A - N + Y - N + ? + - - + 400 13 Opcml intron 2 270720 9:28170744 S - N + Y - N - N - - - - 5400 14 Alg9 intron 14 20017 9:50582311 A - N + Y + Y - N + - + + unk. 15 Zfp291 intron 8 8166 9:55688688 A - N + Y + Y + Y + - + - 350 16 Cdh23 intron 32 2103 10:59779699 A - N + Y - N - N - - - - 5600 17 Odz2 intron 8 35382 11:36015943 A - N + Y - N - N + - - + 7500 18 Prkca intron 3 134255 11:107934734 S - N + Y - N - ? + - - - 5500 19 Akap6 intron 7 71966 12:53930690 A - N + Y - N - N + - + + unk. 20 Mark3 intron 14 7429 12:112088888 A - N + Y - N - ? + - - - 7800 21 LOC432723 intron 1 61826 13:4788783 A - N + Y - N + Y + + - + 350 22 Cacna2d3 intron 11 109422 14:28067384 A - N + Y - ? + Y + + + + 400 23 A2bp1 intron 3 321339 16:6641077 S - N + Y + Y + Y + + + + 5500 24 Sytl3 intron 4 12675 17:6573500 S - N - Y - N - ? - - - - N/A 25 Dlgap1 intron 4 55340 17:70603001 A - N + Y + Y + Y + + + + 350 26 Dym intron 16 43220 18:75389795 A - N + Y - N - N - - - - 5700 27 Mtm1 intron 8 4446 X:67552081 S - N + Y - N - ? + - + + 5500 28 Col4a6 intron 2 146075 X:136618162 S - N + Y - N + Y + - - + 5500 29 Sh3bp4 intron 1 36659 1:90927936 A + Y - N - N - N - - + + 5550 30 Tor3a intron 4 10001 1:158497287 A + Y - N - N + ? - + - - 330 31 Cd84 intron 2 20560 1:173693834 A + Y - N - N - ? - - - - 1574 32 Ttbk2 intron 4 16402 2:120487764 A + Y - N - N - N - - - - 5478 33 Unc13b intron 7 46281 4:43167395 A + Y - N - N + Y - - - - 322 34 Cadm4 intron 1 16992 7:24206094 A + Y + N + Y + Y + + + + 338 35 Dpep1 intron 1 7590 8:126077748 A + Y - N + Y - N - + + - 322 36 Vnn3 intron 3 7946 10:23546116 A + Y - N + Y + Y - - - - 5696 37 Slfn8 intron 4 8583 11:82825929 A + Y - N - Y - N - F - + 334 38 Klhl1 intron 8 15107 14:95020963 A + Y - N - ? + Y - - - - 320 39 Mapk14 intron 6 2511 17:28455067 A + Y - N F Y + Y - - - - 320 40 Wiz intron 2 19667 17:32100856 A + Y - N - N - N - - - - 7121

a Full names of mouse strains are given in Materials and Methods (Section 2.4.6). Highlighted frame shows the four strains used in computational analysis. b Corresponding position in B6 genome (version mm8) c A: insertion is antisense to gene; S: insertion is sense to the gene d + indicates presence of insertion; - indicates absence of insertion; “F” indicates PCR failure (All PCR experiments listed in this table were performed by Liane Gagnier) e Y: computational prediction of having insertion; N: computational prediction of no insertion; ?: presence of insertion could not be determined computationally f Size of ERV insertion (approximate for cases 1-28) 62 As expected, some of the identified insertions are not specific to a single strain. This finding indicates that many of the polymorphic ERV insertions arose prior to divergence of common inbred strains, or represent even older polymorphisms due to different origins of chromosomal segments in the genomes of today’s lab mice. For the 28 cases present in A/J but absent from B6, the short A/J sequence traces do not contain the entire ERV. However, the length of the inserted element could be estimated from the size of the genomic PCR product for 25 of these cases (last column in Table 2.1). In 15 cases, the size matches full- length ETn element (5.5 - 6 kb), whereas two other cases appear to be full length MusD elements (7.5-7.8 kb) and one case is likely a partly deleted element (case #10). Seven are solitary LTRs (320-400 bp), so the nature of the original insertion cannot be determined since the LTRs of ETnII elements and MusDs are extremely similar (Baust et al., 2003). For the 12 elements present in the assembled B6 genome, seven are solitary LTRs, one is a partial element and four are ETn elements based on size and sequence. The element in the Wiz gene is a longer ETn variant (Baust et al., 2002). Given that most published mutagenic insertions of this family are of the ETnII subfamily (Baust et al., 2002; Maksakova et al., 2006), the preponderance of polymorphic ETn elements over MusDs was expected.

2.2.5 Potential gene expression effects mediated by polymorphic ERVs

Since ERVs/LTRs can affect gene transcription through a variety of mechanisms (Section

1.5), some of the polymorphic ERVs detected in this study may contribute to gene expression differences between strains, possibly leading to phenotypic differences. However, the factors that determine whether transcription of a gene will be affected by a nearby or intragenic ERV insertion are not understood, and in all likelihood, are complex. It is therefore not possible to

63 estimate what fraction of the polymorphic insertions documented here may have functional

consequences. Nonetheless, it can be predicted which cases may be more likely to affect gene

expression. In the majority of documented cases where a new mutagenic ETn/MusD

insertion causes significant transcriptional defects, the element has been located within an

intron in the sense orientation, disrupting splicing patterns of the gene (Maksakova et al.,

2006; van de Lagemaat et al., 2006). Thus, it is reasonable to postulate that ETn elements within introns and oriented in the same direction as the enclosing gene have a relatively high probability of affecting mRNA processing. Moreover, compared with older insertions, the youngest, polymorphic subsets of these elements are potentially more likely to affect host gene expression, as selection may still be operating in these cases.

Based on the foregoing reasoning, I chose to examine further a subset of cases using three criteria. First, since the consequences of IAP insertions can involve LTR bidirectional promoter effects (Maksakova et al., 2006) that are more complicated and difficult to predict,

I chose to focus on ETn/MusD insertions. Second, only intronic ETn elements oriented in the same direction as the gene were selected. Third, I chose elements verified as present in

A/J but lacking in B6 using genomic PCR (see Table 2.1). Seven such cases exist, involving

ETns in the Dnajc10, Dysf, Opcml, Prkca, A2bp1, Mtm1 and Col4a6 genes. My colleague,

Irina Maksakova, performed RT-PCR on RNA from A/J mice using primers from the gene exon upstream of the ETn insertion, coupled with primers from within the ETn chosen to

detect the most frequently reported types of ETn-mediated transcriptional fusions from the

literature (Maksakova et al., 2006). Sources of RNAs were chosen based on known

expression patterns of the gene. As shown in Figure 2.5, chimeric transcripts were detected

for all five of the genes tested, namely Dnajc10, Prkca, Mtm1, Opcm1 and Col4a6. The

64

Figure 2.5 Detection of ETn-gene chimeric transcripts. Aberrant transcripts induced by a novel ETn insertion into a gene intron are shown. Gene direction, from left to right, is the same as ETn orientation. Genes Dnajc10, Opcml, Mtm1 and Col4a6 harbor ETnII insertions, while Prkca has an ETnI insertion. ETn sections that are different in an ETnI element compared to ETnII are shown as striped. Cryptic and natural splice acceptor sites are designated as blue (previously identified) or yellow (newly identified in this study) vertical arrows. Splice donor sites are represented as red (previously known) or light blue (newly identified) triangles. Natural and cryptic polyadenylation sites are marked as pA. Numbered thin arrows denote primers used to amplify chimeric transcripts. Sense primers in the upstream exon are as follows: 1, Prkca-up-ex-s; 2, Dnajc10-up-ex-s; 3, Mtm1-up-ex-s; 4, Opcml-up- ex-s; 5, Col4a6-up-ex-s. Antisense primers in the ETn are as follows: 6, IM_3as; 7, MusD2_7130as; 8, LTR_2as. Number of clones for each transcript variant compared to all clones sequenced for each primer pair is indicated. Chimeric ETn transcripts and number of their examples identified previously (Maksakova et al., 2006) are shown at the bottom. Sequences of all splice sites used in these cases are shown in Figure A.4.

Note: Irina A. Maksakova performed the experiments and made this figure. sense-oriented ETn element found in the Dysf gene in A/J has already been shown to cause similar splicing defects (Ho et al., 2004), and I did not examine A2bp1. In most cases, the splice sites used in the ETn element in the examples analyzed were analogous to those

65 characterized in known mutagenic cases. For Prkca, however, analysis showed that the

insertion is a member of the ETnI subfamily, as opposed to ETnII, and revealed usage of

cryptic splice acceptor sites not previously documented (see Figure A.4 for sequences of

splice sites). It should be noted that the subset of chimeric transcripts shown in Figure 2.5 is

likely an underestimate, since a limited number of clones were sequenced and not all transcript variants would have been detected with the primers used. This RT-PCR analysis demonstrates that these ETn elements cause patterns of aberrant splicing similar to those documented in cases of known mutations due to new ETn integrations. Nevertheless, further

quantitative analyses are required to determine the significance of these splicing

abnormalities in affecting overall levels of gene expression. Such in depth experimental

investigations for each case were beyond the scope of this study.

Microarray data on gene expression differences in inbred strains available through the

Gene Expression Omnibus (Barrett et al., 2007) were also surveyed (see online resource at

http://www.ncbi.nlm.nih.gov/geo/). My colleague, Louie van de Lagemaat, examined all

cases listed in Table 2.1 for correlations between presence of the insertion and differences in

gene transcript levels compared with strains lacking the insertion (Section 2.4.9).

Specifically, he analyzed the microarray data available through NCBI GEO accession

GSE3594 (Zapala et al., 2005), which includes data on gene expression in 10 tissues profiled

in A/J, B6, C3H/HeJ, DBA/2J and 129S6/SvEvTac mice. For the Dnajc10 gene, tissue-wide

reduction in expression was noted in A/J mice relative to the other four strains (p < 10-4,

Binomial distribution) (Figure 2.6). Microarray data available through the GeneNetwork web site (http://www.genenetwork.org/) also showed that transcript levels of this gene in A/J are much lower than in all other tested strains, based on whole brain, cerebellum, hippocampus

66 and eye datasets (Figure A.5). Dnajc10 has a sense-oriented ETn element in the third intron in A/J and the related A/WySn mice, but in no other strain tested (Table 2.1). This gene (also termed ERdj5) encodes an endoplasmic reticulum (ER) chaperone protein induced during ER

stress, and is likely involved in protein folding (Corazzari et al., 2007; Cunnea et al., 2003).

Figure 2.6 Normalized, tissue-averaged expression of Dnajc10 across strains. Analysis of microarray data of transcript levels in 10 tissues profiled in A/J, B6, C3H/HeJ, DBA/2J and 129S6/SvEvTac mice. Data retrieved from NCBI GEO accession GSE3594 (Zapala et al., 2005). Replicates within a tissue/strain were averaged and a variation pattern of relative expression levels across the above five strains was computed for each tissue. These patterns were found to be similar across tissues (except white adipose tissue which was an outlier).

Note: Louie N. van de Lagemaat performed the data analysis and made the figure.

Another gene for which significant differences in expression correlate with presence of an

ETn element is Opcml. No data is available from the Zapala et al. study on this gene, but

datasets accessed through GeneNetwork show that transcript levels in A/J, the only tested

strain carrying an ETn insertion (Table 2.1), are significantly lower than in any other strain in

67 cerebellum, whole brain, hippocampus and eye, the only tissues where A/J microarray

information is available for this gene (Figure A.6). Opcml (Opioid binding protein/cell

adhesion molecule-like), also termed Obcam, is a member of the IgLON gene family and

encodes a synaptic neural cell adhesion molecule (Schofield et al., 1989; Yamada et al.,

2007). Loss of expression and/or promoter hypermethylation of this gene has been reported

in some human cancers, suggesting that it may play a tumor suppressive role (Reed et al.,

2007; Sellar et al., 2003). Northern blot analysis on total RNA from A/J and B6 cerebellum were performed by Irina Maksakova using a probe derived from the exon upstream of the insertion site, and results are shown in Figure 2.7A. The ~6.5 kb band corresponding to

Opcml full length mRNA is markedly decreased in A/J compared with B6. A similar reduction in Opcml RNA was also observed in A/J using an exon probe downstream of the insertion site (data not shown). The two bands at 3-3.5 kb are due to cross-hybridization to another gene, neurotrimin (Hnt), which is a closely linked member of the IgLON family and highly related to Opcml in the region used as a probe (Struyk et al., 1995). Irina Maksakova also performed semi-quantitative RT-PCR on total RNA from A/J and B6 cerebral hemispheres using primers from Opcml exons just upstream and downstream of the ETn insertion site, and found an approximately 4.6-fold reduction in the correctly spliced Opcml

RNA in A/J relative to B6 (Figure 2.7B). These results confirm the microarray data (Figure

A.6), and indicate that presence of the ETn insertion correlates with a substantial decrease in full length, correctly spliced Opcml mRNA. While there may be other reasons for the

reduced transcript levels, such patterns suggest that the ETn element in these genes

significantly affects expression by causing aberrant splicing (Figure 2.5), allowing only a

minor fraction of normal transcripts to be produced.

68

Figure 2.7 Transcript levels of Opcml in A/J versus B6. A) Northern blotting. Total RNA from the cerebellum of an A/J and B6 mouse was hybridized to a probe from the Opcml exon upstream of the ETn insertion in A/J. The lower part of the figure is the ethidium bromide-stained gel, showing even loading as indicated by the ribosomal RNA bands. The band corresponding to Opcml is marked with an arrow. The probe cross-hybridized with a related gene Hnt, as explained in the text. B) Semi-quantitative PCR on cDNA from A/J and B6 cerebral hemispheres. Opcml cDNA was amplified with primers from upstream and downstream of the ETn insertion site. Opcml and Gapdh fragments were amplified from cDNA dilutions; for each dilution, the intensity of the resulting band was quantified and graphed as transcript levels of Opcml relative to Gapdh (see Figure A.7). The average and standard deviation for all experiments is shown.

Note: Irina A. Maksakova performed the experiments and made the figure.

For all other cases from Table 2.1, including the other genes with insertions that cause aberrant splicing detected by RT-PCR (Figure 2.5), available microarray data was either inconsistent or did not show a clear relationship between presence of the insertion and altered levels of transcripts. These findings suggest that, in most cases, the ETn insertion has no significant effect on expression. This result is not surprising, given that thousands of ERVs or

69 LTRs have become fixed during evolution in human and mouse genes, indicating that they can reside within introns without a functional impact (van de Lagemaat et al., 2006). As illustrated by the Dysf case, however, the microarray data should be treated with caution. It has been convincingly shown by Northern analysis that A/J mice with the ETn insertion lack full length Dsyf mRNA and protein in skeletal muscle (Ho et al., 2004). However, the available microarray data for Dysf is limited to cerebellum and whole brain, neither of which shows abnormally low transcript levels in A/J (data not shown). There could be several reasons for this discrepancy, but it illustrates that wet lab approaches are necessary to properly evaluate each case.

It is well established that IAP LTRs can promote ectopic gene transcription in cases of somatic oncogene activations and germ line mutations (Boeke and Stoye, 1997; Druker and

Whitelaw, 2004; Kung et al., 1991; Maksakova et al., 2006; Wang et al., 1997) in addition to causing gene splicing defects similar to ETns. Moreover, a few mutations caused by IAP- driven aberrant gene expression have been shown to act as metastable epialleles, exhibiting variable expressivity among genetic identical mice linked to the variable epigenetic state of the IAP LTR (Blewitt et al., 2006; Druker and Whitelaw, 2004; Wang et al., 1997). In a recent study, Horie et al identified transcripts from 11 loci in 129 strain embryonic stem cells that initiate in an IAP LTR and read into flanking sequence, in five cases giving rise to chimeric RNAs between an intronic IAP and the enclosing gene (Horie et al., 2007). In six of the 11 loci analyzed, the IAP element was not present in the B6 genome, prompting the authors to postulate that variations in IAPs may contribute to strain-specific traits. Although I have not functionally examined any cases of polymorphic IAPs identified here to look for

LTR-initiated fusion gene transcripts, it is likely that numerous such cases exist.

70 2.3 Concluding Remarks

In this initial genomic assessment of ERV polymorphisms in mice, I have used the

available DNA sequence from four inbred strains to conduct an assessment of the level of

insertional polymorphism of the currently active IAP and ETn/MusD ERV families. Despite

mapping limitations and incomplete sequence coverage, I identified 3394 IAP and 375

ETn/MusD elements that are polymorphic among the four strains, resulting in polymorphic fractions of 61.3% and 25.6%, respectively. This is the first genome-wide determination of

the extent of polymorphism of these ERV families. Given that this study was based on only a

few strains, the total numbers of polymorphic elements must be substantially higher and

represent a large source of genetic variation among inbred strains.

Among the polymorphic copies, 623 IAPs and 72 ETn/MusD elements reside in gene

introns. In all the five cases of sense-oriented ETn elements in A/J introns that we examined,

evidence for gene splicing disruption was found by RT-PCR and, for two genes, further

evidence of lower gene expression in A/J mice was observed through surveys of microarray data. While most polymorphic ERVs likely have little effect on host genes, I found that the

prevalence within genes and the intronic orientation bias exhibited by polymorphic ERV

subsets enriched for the youngest elements are distinctly different from that of older

elements. This observation suggests that some of the polymorphic ERVs are deleterious but

have not yet been eliminated by selection due to their short time in the genome or the

controlled breeding environment of laboratory mice. Indeed, new insertions of these elements

could play a significant role in genetic drift and inbreeding depression of mouse lines (Taft et

al., 2006). I propose that a comprehensive effort to document ERV and other transposable

element polymorphisms among multiple inbred strains would complement SNP data, and

71 contribute greatly to our understanding of mouse genetic history, and genotypic and

phenotypic variation.

2.4 Materials and Methods

2.4.1 Source data

The NCBI Trace Archive included a total of 195,993,571 traces from 38 mouse strains as

of May, 2007 (http://0-www.ncbi.nlm.nih.gov.ilsprod.lib.neu.edu/Traces/trace.cgi). The majority of these traces were obtained by CHIP-related resequencing techniques, which exclude most repetitive sequences. In this study, I used only sequence traces obtained by whole genome shotgun (WGS) sequencing, which are unbiased in their content of repetitive elements. Three mouse strains (A/J, DBA/2J, 129X1/SvJ) were chosen to compare to the assembled B6 genome (version mm8) because these were the only strains with sufficient traces sequenced by shotgun-related strategies (Figure A.1).

RefSeq gene annotations were retrieved from the RefGene annotation table (version mm8, April 2007 annotation) downloaded from the UCSC Genome Browser. When an ERV insertion was found in a genomic region with multiple overlapping annotations, the smallest gene size was chosen to improve specificity. Also calculated were the genomic coverage of annotated RefSeq genes in the mouse genome (used as the “expected value” in Figure 2.3A), based on the same annotation table. After merging overlapping RefSeq annotations and removing redundancies, the total coverage of genic regions in the mouse genome was calculated at 31.58%.

72 2.4.2 Design of ERV probes and detection of ERVs in the assembled B6 genome

Three types of probes designed were based on template ERV sequences (only the type-1 probe is shown in Figure 2.1, step A1). For IAP, probes were based on a recently inserted polymorphic IAP 1Δ1 element (accession #EU183301) (Juriloff et al., 2005). For

ETn/MusD, probes were based on a mutation-causing ETnII element (accession #Y17106)

(Hofmann et al., 1998). MusD and ETnII elements are on average over 90% identical in the regions of the probes. To capture ETnI elements, which differ from ETnII/MusD elements in the 3′ part of the LTR and 5′ internal region (Baust et al., 2003; Shell et al., 1990), I used a representative ETnI element (accession # AC068908). As shown in Figure A.2, the type-1 probe included the full-length LTR and a small fragment of the internal ERV sequence; type-

2 included only the full LTR; type-3 was only the first/last 60 bp of the 5′/3′ LTR. Additional information about probe design is summarized in Table A.1.

I conducted a survey of both ETn/MusD and IAP insertions in the B6 genome using the

60 bp type-3 probes because these insertions are in regions of low divergence between family members (data not shown), ensuring that all ERVs of each group would be detected. The probes were aligned to the B6 genome using the WU-BLAST 2.0 program (Gish, W. 1996-

2004 http://blast.wustl.edu/), and any hit above the cut-off threshold was scored as an ERV insertion. To keep both sensitivity and specificity as high as possible, I designed an experiment to optimize the parameters of alignment identity and length of the aligned region.

Results suggested a value of 80% for both parameters. To obtain an estimation of the sizes and numbers of ETn/MusD and IAP elements, all mapped ERV fragments (LTR termini) were merged into one individual element if they were: 1) on the same chromosome; 2) in the same orientation; 3) within 10 kb from each other.

73 2.4.3 Detection of ERV insertions in test strains

The standalone version of the WU-BLAST v2.0 program (Gish, W. 1996-2004

http://blast.wustl.edu/) was used to make local alignments between ERV probes and mouse

traces in the NCBI trace archive database (step 2 in Figure 2.1A). The threshold parameters

for BLAST were 80% for sequence identity, and 80% for length of the aligned region. A

usable ERV-containing trace consists of two parts – a non-ERV flanking sequence and the

target-ERV sequence. All ERV-containing traces with a flanking portion shorter than 30 bp were discarded. Once identified, a chimeric tag was constructed by taking the whole flanking portion appended with a small tail of its target-ERV sequence (Figure 2.1A, step 3). I required the target-ERV tail of the tag to be 1/5 of the flanking portion in length, and a maximum of 50 bp.

The ERV-containing traces were mapped to the assembled B6 genome. I used the chimeric tags derived from the previous step as the input query for BLAT (Kent, 2002), and

mapped them to the B6 genome (version mm8) (Figure 2.1A, step 4). The sequencing error

rate of the mouse traces was estimated to be ~5% (data not shown). I therefore defined

criteria for a significant mapping as follows: 1) the score of alignment being the highest mapping score among all BLAT hits; 2) the best hit being at least 2% higher in identity and

10% longer in mapping length compared to the second hit; 3) the alignment identity between the chimeric tag and the target locus being greater than 90%; 4) the length of aligned region

being more than 70% of the tag length. Once a significant BLAT mapping site was

identified, it was straightforward to check for presence of the ERV in the B6 genome based

on alignment of the small target-ERV tail of the chimeric tag. If the BLAT mapping included

more than 2/3 of the target-ERV tail, it was considered a common insertion also present in

74 B6; if the mapping included less than 1/3 of the target-ERV tail, it was scored as absent from

B6. Situations in between these two boundaries were extremely rare and were discarded.

2.4.4 Determining the polymorphism status of ERVs present in B6 or in the test strains

All sequences in the B6 genome with a length of 35 bp flanking both the 5′ and 3′ end of

each detectable ERV element were aligned back to the B6 genome with BLAT, and only

those with a unique location were retained. All retained 35-bp-flanking-sequences were used

as queries of the WU-BLAST program, and all traces from the test strains containing such

flanking sequences were collected (Figure 2.1, step B2). A minimum identity of 90% and a

minimum mapping length of 80% were required. Because of incomplete genomic coverage

of traces of test strains, many ERV flanking regions in B6 have no corresponding traces in

the trace archive database, and their polymorphism status could not therefore be determined

(denoted as “?” in Table A.2 and A.3). However, for ERVs in B6 with unique flanking

sequence found in one or more test strain traces, presence of the ERV in test strains was

determined by assessing identity between the ERV sequence in B6 and the sequence adjacent to the flanking sequence in the trace of the test strain. To align the two sequences, I used an implementation of the Needleman-Wunsch algorithm (Needleman and Wunsch, 1970). I

required a minimum identity of 90% and an alignment length of at least 35 bp to score the

ERV as present in the test strain.

Using a similar strategy to one just described, I assessed the polymorphism status of ERV

insertions found in a test strain but not in B6. Using their locations with respect to the B6

genome, probes based on flanking genomic sequences were built, and the trace archive

75 database was searched to check if traces with the same flanking sequences were present for other test strains. All qualified traces obtained from other test strains were aligned to

sequences of corresponding ERV families based on the same mapping criteria used above,

and the existence of such ERV elements in other test strains was determined. Instead of using

the ERV portion in the original ERV-containing traces, I used exemplar ERV sequences

because, for some traces, the ERV portion was too short (less than 35 bp) to make an

effective alignment.

2.4.5 B6 trace sampling and screening simulations

The ERV numbers found in the three test strains are lower than the numbers detected in

the assembled B6 genome. Incomplete sequence coverage and the inability to map the trace

to a unique location are responsible for most of the loss of detectable ERV insertions. To

estimate the fraction of ERVs that were not detected in each test strain, I applied my

screening method using random samples of the unassembled B6 traces, and plotted an ERV

detection curve based on this simulation (Figure A.3). Since the sequence quality of the B6

trace archive is generally lower than that of the three test strains, the sampling process was

based only on B6 traces with less than 1% “N”s. Sample trace datasets of different sizes were

built into simulative trace databases, and the corresponding numbers of B6 ERV insertions

detected with these datasets were plotted in Figure A.3. Independent random sampling was

applied twice for datasets smaller than 12 million traces.

The second purpose for performing the screening simulations with B6 traces was to

evaluate the accuracy of my screening method. Theoretically, all insertions found in the

simulation assays in the B6 traces should be detected in the B6 reference genome. However,

76 I found a few cases of insertions cataloged as “polymorphic”, meaning they are from the B6

traces and were mapped to a significant locus in the B6 reference genome where no such

insertion was found. One of the possible explanations for this is that the assembly of the B6 genome is not perfect, especially in repetitive regions. Indeed, only 49 of the 54 non- ecotropic murine leukemia viruses (MLV) known to be present in B6 can be found in the mouse B6 assembly (Jern et al., 2007). Nonetheless, I considered all the “polymorphic” cases in each simulation assay to be false positives, and derived a conservative estimation of the accuracy of my screening method with an average false discovery rate of 0.4%±0.1%.

2.4.6 Experimental verification of polymorphic insertions

The presence of an insertion was tested by amplifying genomic DNA from the following strains: SWR/J, C3H/HeJ, Balb/cJ, B6, A/J, DBA/2J, 129X1/SvJ and A/WySn. All strains or

DNA were from the Jackson Laboratory. Primers (see Table A.2) flanking the potential insertion sites were used to amplify specific sequences from 75 ng of genomic DNA in a 25ul reaction with Phusion DNA polymerase (New England Biolabs). As per the manufacturer’s instructions, cycling conditions had annealing temperatures of between 55-65°C, and extension times between 20 seconds and 4 minutes. PCR products were visualized on agarose gels. In some cases, amplification with the flanking primers did not produce a product, so one flanking primer and one LTR primer were used to confirm presence of an insertion.

Therefore, in these cases, the size of the ERV insertion could not be estimated. In two cases, marked as “F” in Table 2.1, the PCRs were unsuccessful in one of the strains, suggesting a structural rearrangement or the presence of other polymorphisms that prevented amplification with the primers used. Some products were sequenced directly on Minelute (Qiagen) gel

77 purified PCR fragments using the BigDye Terminator v3.1 Cycle Sequencing Kit (ABI) in an

ABI PRISM® 3730XL DNA Analyzer system.

2.4.7 RT-PCR

RNA from mouse tissues was extracted using RNeasy RNA isolation kit (Qiagen)

according to manufacturer’s recommendations. The presence of native transcripts using

primers located in exons flanking the intron with the ETn insertion was confirmed with the following primer pairs: Col4a6-up-ex-s and Col4a6-down-ex-as; Dnajc10-up-ex-s and

Dnajc10-down-ex-as; Mtm1-up-ex-s and Mtm1-down-ex-as; Opcml-up-ex-s and Opcml- down-ex-as; Prkca-up-ex-s and Prkca-down-ex-as. Then, RT-PCRs designed to look for chimeric transcripts between gene exons and the intronic ETn were performed. To search for transcripts utilizing the 2nd and 3rd splice acceptor sites in the LTR (see Figure 2.5), cDNA from A/J tissues specified in parentheses was amplified using a common ETn primer located downstream of the LTR, IM_3as, and the following upper exon-specific primers: Col4a6-up- ex-s (eye), Dnajc10-up-ex-s (testis), Mtm1-up-ex-s (lung), Opcml-up-ex-s (cerebral hemisphere), and Prkca-up-ex-s (eye). The same exon-specific primers and cDNA were used for the search of transcripts utilizing the first splice acceptor site, this time with the LTR-

specific primer located upstream of the first PolyA site, MusD2_7130as. For Dnajc10, an

additional PCR was performed with an upstream exon primer and a primer located at the very end of the LTR, IM_LTR_2as.

Semi-quantitative RT-PCR for the Opcml gene was performed with a series of A/J and B6

cerebral hemisphere cDNA dilutions, using the primers Opcml-ex2-s and Opcml-ex3-as in the exons upstream and downstream of the intronic ETn insertion. For Gapdh, primers

78 Gapdh_ex6F and Gapdh_ex7R were used. Opcml and Gapdh fragments were amplified from

cDNA dilutions of 1/20, 1/40 and 1/80 (Figure A.7A). For each dilution, the intensity of the

resulting band was quantified using ImageQuant LT (GE Healthcare) software, and graphed

as the intensity of Opcml relative to Gapdh (Figure A.7B). The average levels of all

experiments are displayed in Figure A.7B. All primer sequences for RT-PCR experiments are

listed in Table A.2.

2.4.8 Northern blotting

RNA from A/J and B6 cerebellum was used. For each lane, 6 mg of RNA was denatured,

electrophoresed in 1% agarose 3.7% formaldehyde gel in 1xMOPS buffer, transferred

overnight to a Zeta-probe nylon membrane (Bio-Rad) and baked at 80ºC. A probe specific

for the Opcml exon upstream of the ETn insertion was synthesized by PCR using primers

Opcml-ex2-s and Opcml-ex2-as and labeled with 32P using a Random Primers DNA Labeling

System (Invitrogen). Membranes were prehybridized in ExpressHyb (BD Biosciences) for 4

hours at 68ºC, hybridized overnight at the same temperature in fresh ExpressHyb, washed

according to manufacturer’s instructions and exposed to film.

2.4.9 Microarray analysis

The microarray data of mRNA expression were obtained through NCBI GEO accession

GSE3594 (Zapala et al., 2005), which included 10 tissues profiled in A/J, B6, C3H/HeJ,

DBA/2J and 129S6/SvEvTac mice. The expression values for a given probeset replicated within the same strain and tissue were averaged, and the probeset expression rank was examined in two ways. First, each strain’s expression rank across genes within a given tissue

79 was determined, and second, the inserter strain’s expression rank for a given gene was determined across tissues.

2.4.10 PolyERV custom tracks for Genome Browser

For the ease of accessing detailed information of all the polymorphic ERV insertions identified in this study by peer researchers, I produced custom tracks for the UCSC Genome

Browser (named “PolyERV”), and have made them available online:

Mouse mm8 version URL: http://genome.ucsc.edu/cgi-bin/hgTracks?db=mm8&hgt.customText=http://142.103.207.67/ people/dmager/PolyERV/track_data.txt

Mouse mm9 version URL: http://genome.ucsc.edu/cgi-bin/hgTracks?db=mm9&hgt.customText=http://142.103.207.67/ people/dmager/PolyERV/track_data_mm9.txt

A screenshot of the PolyERV track is also given in Figure A.8. The polymorphic IAP and

ETn/MusD insertions are annotated in red and green, respectively. When the ERV insertion is displayed as a horizontal bar, it is present in the B6 reference genome, and its orientation is shown by arrows inside the bar. When the ERV insertion is displayed as a vertical line, the indication is that it is not present in B6. More detailed information such as the genomic coordinates and the presence and absence of the polymorphic ERV in different inbred strains can be found by clicking on the ERV id.

80 Chapter 3: TE Distributions Reveal Hazardous Zones and

Residential Preferences in Gene Introns

A version of Chapter 3 has been published: Ying Zhang, Mark T. Romanish, Dixie L. Mager (2011). "Distributions of transposable elements reveal hazardous zones in Mammalian introns." PLoS Comput Biol 7(5): e1002046.

81 3.1 Background

As discussed in Chapter 1, mammalian TEs, which comprise at least one-third to half of

mammalian genomic DNA, are major factors that have shaped the landscape of the

mammalian genome through evolution. Although the majority of TEs reside outside genic

regions (Lander et al., 2001; Medstrand et al., 2002; Waterston et al., 2002), about 90% of all

human RefSeq genes still contain at least some TE sequences in their introns. Due to their

proximity to host genes, intronic TE insertions are generally considered to possess a greater

potential for affecting gene expression through various molecular mechanisms (Section 1.5).

Our current knowledge of the distribution of TEs within gene introns is very limited,

however, and it remains unclear why some intronic TEs perturb gene transcription while

most do not. To fully understand their biological effects, it would be useful to determine

which intronic TEs are most likely to affect gene expression, so they may be prioritized for

functional analyses. With a growing appreciation for SINE and LINE insertional

polymorphisms in human (see Section 1.4.1), such predictions would be particularly helpful

in identifying those polymorphic TE insertions with the greatest probability of affecting gene transcription, and thereby possibly contributing to phenotypic variability or disease susceptibility in humans. In this study, I conducted a set of bioinformatics analyses of TE distribution patterns within human and mouse genes, which revealed TE underrepresentation zones and distributional biases in gene introns. TEs that do occur within the underrepresentation zones are more likely to be involved in aberrant gene splicing. Known cases of intronic disease-causing TE insertions are primarily located within these zones, strongly suggesting that TEs in these locations are more likely to be harmful and be selected

82 against. The results of my study reveal a distinct tendency for TEs to affect gene transcription when poised near exons, and point to their continued role in catalyzing genome evolution.

3.2 Results and Discussion

3.2.1 Intronic regions near exon boundaries are depleted of TE insertions

According to my genomic survey, 85 - 90% of mouse and human protein coding genes contain TE sequences in their introns. In a computational study of the relationship between

Alu SINEs and alternative splicing, Lev-Maor et al. reported a drop of Alu density within

150 bp from intron boundaries (Lev-Maor et al., 2008). Based on this observation and the fact that most intronic splice signals are located at the 5′- and 3′-end of introns (Lodish et al.,

2007), I hypothesized that de novo intronic TE insertions near exons are more likely to be mutagenic, and consequently, that the frequency of TEs would be significantly lower than expected in general near intron ends.

To analyze the distributions of various TE types within introns, I conducted computer simulations to determine theoretical TE distribution patterns (see Section 3.4.2 for details). I then determined the actual distribution pattern of intronic TEs according to their distance to the nearest exon. To alleviate my concern about the potential effect of “distance shifting” - a hypothesized result of later TE insertions or other rearrangements occurring between a specific TE and its nearest exon, I also analyzed the distribution of the youngest 20% of intronic TEs. However, only minor differences were observed compared to all intronic TEs in the genome (data not shown). To show clearly the difference between simulated and actual

TE distributions at each predefined position in introns, I calculated the “standardized frequency” of observed TEs (see Section 3.4.4). The level of TE representation at each

83 predefined intronic interval is determined from the difference between the actual TE distribution in the genome (observed) and the computer simulation of random TE insertions

(expected). An overrepresentation of a given TE type within the corresponding intronic region is reflected by a positive value; an underrepresentation is indicated by a negative value.

I found, as expected, that all four major TE types are highly underrepresented near intron boundaries in both human (Figure B.1A) and mouse (data not shown). I next applied the same distribution analysis for only full-length or near full-length TE sequences (see Table 3.1 for “full-length” definitions). As shown in Figure B.1B for human, full-length TEs were again highly underrepresented when close to exons, but most TE types except SINEs showed larger underrepresentation zones (hereafter shortened to “U-zone”) compared with the all-TE distributions.

Table 3.1 Intronic underrepresentation zones by TE type* TE Human U-zone Mouse U-zone Human U-zone Mouse U-zone Human cutoff Mouse cutoff Type for All (bp) a for All (bp) a for FL (bp) b for FL (bp) b size of FL (bp) c size of FL (bp) c SINE 100 100 100 100 >250 >100 LINE 50 100 2000 2000 >5000 >5000 LTR 2000 1000 5000 2000 >5000 >5000 DNA 50 100 2000 2000 >1000 >1000

* The distributions of TEs were normalized by the overall G/C content preference of each TE type a Underrepresentation zone based on distribution of all elements of each TE type b Underrepresentation zone based on distribution of only 'near full-length' (FL) elements c The cutoff size of full-length elements for each TE type was determined as slightly shorter than the average full-length elements as described in (Lander et al., 2001) and (Waterston et al., 2002)

I also noticed that intronic regions more than 20 kb from exons showed a significant underrepresentation of SINEs compared to random simulations. Unlike patterns close to exons, intronic TE distributions greater than 20 kb from exons are less likely due to purifying selection, so I searched for other explanations. SINE elements are more abundant in G/C-rich

84 regions (Lander et al., 2001; Medstrand et al., 2002) and, since large introns resemble intergenic regions in terms of G/C content (which is generally A/T rich) (Kalari et al., 2006),

I postulated that the drop of SINE frequency compared to random simulations in deep intronic regions was an effect of local G/C content. To determine if this indeed the case, I normalized my random simulations with the local G/C content as described in Materials and

Methods (Section 3.4.3). Indeed, after applying such normalization, the underrepresentation of SINEs in deep intronic regions greatly flattened out, while the sizes of the U-zones near exons were not affected. All my subsequent analyses, therefore, employed this normalization.

Figure 3.1 shows the normalized plots for all human TEs (Figure 3.1A) and full length TEs

(Figure 3.1B), and these plots are very similar for mouse TEs (Figure B.2). It is of interest to note that the sizes of the U-zones near intron boundaries are different between TE types

(Table 3.1).

Original insertion site preferences, natural selection and genetic drift can all contribute to global TE distributions. While determining the initial integration site preference of TEs is difficult, if not impossible (especially for ancient families), a limited number of de novo TE integration studies showed that TEs in today’s human genome are distributed very differently from their initial target site preferences (see Section 1.3). Since 97% of TEs in the human genome and 93% in the mouse genome have been fixed for more than 25 million years

(Lander et al., 2001), it is indeed reasonable that their current distributions will bear little resemblance to any original insertion site preferences, but will primarily be the result of selection and genetic drift. Therefore, the TE U-zones identified here most likely result from purifying selection, rather than original avoidance of these regions during the integration process.

85

Figure 3.1 Intronic distributions of the four major TE types in human (normalized). The distributions of all (A) and full-length (B) intronic TEs in human are shown separately. The sizes of the U-zone observed for each TE type are specified in Table 3.1. In both A and B, the x-axis shows a series of predefined intronic regions based on the distance from a TE to the nearest exon. The y-axis represents the standardized frequency of TEs at each intronic region, and is normalized by G/C content for each TE type. The red dotted line indicates the expected distribution of TEs based on random computational simulations. Error bars are standard errors derived from the total number of corresponding TEs (sample size) in each bin.

3.2.2 TEs within their U-Zones are shorter

The larger U-zones for full length TEs (compare Figures 3.1A and B) suggests that purifying selection acts at much greater distances on full-length elements than on their partly deleted counterparts. This effect is not observed for SINEs, but these elements have a much shorter full-length size (~300 bp for human Alus) (Batzer and Deininger, 2002; Lander et al.,

2001), generally carry fewer cryptic transcriptional regulatory signals, and are less harmful to the enclosing genes than other TEs (Cordaux et al., 2006). For these reasons, full-length

SINE elements may be better tolerated at a closer distance to exons.

86 I next compared the average length of intronic TEs within and outside their full-length U- zones, and discovered a significant difference for all TE types in both species (Figure 3.2 for human; Figure B.3 for mouse). In fact, most elements within their respective U-zones are truncated, while a greater portion of TEs beyond such zones are full-size elements, resulting in a larger size variance (see the difference between upper whiskers in Figure 3.2 for human and also Figure B.3 for mouse). The length of individual TEs is therefore an important aspect in terms of dictating their genomic distributions, indicating that larger elements are more likely to be genotoxic when positioned near exons. These results support previous work regarding L1 LINEs, indicating that, compared to shorter elements, full length L1s have more potentially disruptive splice and polyadenylation signals (Belancio et al., 2006), greater

Figure 3.2 Average size of human TEs within and outside the U-zone. Each TE type is divided into two groups as shown on the x-axis: one group for elements located within the corresponding full-length U-zone, and another group for those beyond. The average size of each TE group is indicated as the horizontal bar within each box, which represents the central 50% of data points of the group. Outliers beyond the 1.5x IQR (interquartile range) whiskers are not shown. P-values shown on top of each boxplot are based on the two-sample Wilcoxon test.

87 effects on expression of enclosing genes (Ustyugova et al., 2006), and a greater fitness cost

(Boissinot et al., 2006).

3.2.3 TEs near exons exhibit strong orientation and splice-site bias

I next examined the distribution of intronic TEs in the sense orientation versus those in antisense with respect to the enclosing genes (see Figure 3.3A for human and Figure B.4A for mouse). Since DNA transposons comprise only about 3% of both the human and the mouse genomes, almost all of them being ancient elements without evidence of any transposition activity during the past 37 - 50 Myr (Section 1.4.1), I excluded them from the

Figure 3.3 Distributional biases of full-length human intronic TEs. A) Orientation bias of full-length intronic TEs. Y-axis shows the logarithmic fold-difference of TE frequency between sense and antisense oriented full-length TEs. B) Splice site bias of full-length intronic TEs. Y-axis shows the logarithmic fold-difference of full-length TE frequency between TEs close to the SA site and TEs close to the SD site. The x-axis shows a series of predefined intronic regions based on the distance from a TE to the nearest exon. Error bars are standard errors derived from the total number of corresponding TEs (sample size) in each bin.

88 following analyses to avoid uncertainties introduced by their relatively small numbers. While

previous studies have found an overall antisense orientation bias in genes (particularly for

LTR elements and LINEs; see Section 1.2.2), I show in this study the existence of a much stronger bias in antisense for both LINEs and LTR elements near exons. The excess of antisense TEs compared with sense elements near intron boundaries is probably the result of purifying selection, like the genome-wide orientation bias of TEs in genes. This indicates that, in general, sense-oriented TEs near splice sites have a higher probability of influencing normal gene transcription, and are potentially more harmful to the host gene. Interestingly, for SINEs I observed the same strong antisense bias in the mouse (Figure B.4A), but in the human genome, however, I observed a sense orientation bias instead of antisense for SINEs

at a close distance of 20 - 200 bp from exons (Figure 3.3A). These data are consistent with

the Alu SINE study of Lev-Maor et al. (Lev-Maor et al., 2008), in which the authors

observed more sense-oriented Alu elements near intron termini. Since Alus account for two-

thirds of human SINE elements, and many antisense Alus possess a strong cryptic SA signal

(Gal-Mark et al., 2008), selection against antisense-oriented elements may explain the

unusual underrepresentation of antisense oriented SINEs near splice sites in humans.

I further looked for evidence of any distributional bias of intronic TEs in terms of their

proximity to either splice donor sites (SDs) or splice acceptor sites (SAs). I found the total

numbers of elements near SA sites to be much lower than SD sites for all three

retrotransposon classes examined (see Figure 3.3B for human and Figure B.4B for mouse).

Since the core intronic splice signals at SD sites usually consist only of about 6 bp of

terminal intron sequence compared with 20-50 bp at SA sites (Lodish et al., 2007), selection

89 against physical disruption of critical splice motifs likely underlies this TE underrepresentation near SA sites.

Theoretically, harmful antisense transcripts of protein-coding exons may be generated by read-through transcription of antisense TEs near SD sites. If such antisense transcripts have significant detrimental effects, due to purifying selection, one might expect a larger proportion of TEs near SD sites to be in sense rather than in antisense. As shown in Figure

3.4A (human) and Figure B.5A (mouse), the predicted bias of sense orientated TEs near SD sites was not found, except for human SINEs. This is likely explained by the previously noted fact that antisense Alus possess cryptic SA signals. For other TE types I observed

Figure 3.4 Orientation bias of human full-length intronic TEs based on their proximity to different types of splice sites. Orientation bias of full-length TEs near SD sites (A) and SA sites (B). The x-axis shows a series of predefined intronic regions based on the distance from a TE to the nearest exon. Y-axis shows the logarithmic fold-difference of TE frequency between sense and antisense oriented TEs. Error bars are standard errors derived from the total number of corresponding TEs (sample size) in each bin.

90 more SD-associated elements oriented in antisense, indicating that antisense transcription is most likely effectively silenced or not a general problem, and that sense oriented TE insertions are, in fact, more detrimental. The same analysis of TEs near SA sites revealed similar orientation bias patterns.

3.2.4 A high fraction of known mutagenic intronic TEs reside within U-zones

If the reduced frequency of TEs near intron boundaries reflects the force of selection against harmful insertions, one would predict a higher fraction of mutagenic TEs in gene introns located within these TE underrepresentation zones. To test this prediction, I compiled information on documented intronic mutagenic TE insertions, and examined their integration sites in introns.

Based on the TE activity and data availability, I focused on the following three TE families in my analyses: human Alu (SINE), human L1 (LINE), and mouse LTR elements. Alus, the most abundant TE family, have successfully propagated in the human genome and reached a total number of over one million copies (Lander et al., 2001). Some of these elements are still active today, generating new insertions and causing mutations linked to diseases (Batzer and

Deininger, 2002; Gallus et al., 2010; Ganguly et al., 2003). Based on the information provided by the dbRIP database (http://dbrip.brocku.ca/) (Wang et al., 2006), I found six de novo Alu insertions associated with human diseases within introns, all belonging to the AluY subfamily (the youngest subfamily of Alu) and causing splice defects of the enclosing gene

(Table B.1). De novo disease-causing insertions of L1, the active LINE family in humans, have also been reported (Brouha et al., 2003; Chen et al., 2005; Ostertag and Kazazian, 2001;

Yoshida et al., 1998). These elements play important roles in human retrotransposon-

91 mediated pathogenesis, not only for encoding reverse-transcriptase (RT) and other proteins

required for their own retrotransposition, but also for mobilizing Alus (see Section 1.1.2). My

search of the dbRIP database in this study identified a total of five intronic L1s associated

with human diseases (Table B.2), all of which cause transcriptional disruptions. Since no

mutagenic LTR insertions and only a few insertionally polymorphic ERVs or LTRs have

been reported in human (Section 1.4.1), I turned to the mouse genome, where ERVs/LTR

elements cause ~10% of germline mutations, many of which have been well studied

(Maksakova et al., 2006). I collected in total 40 cases of mutagenic LTR elements in mice:

15 from the IAP family, 18 from the ETn/MusD family, and seven from other ERV/LTR

retrotransposons. All these ERV-induced intronic mutations in mice are due to transcriptional disruptions on the enclosing gene (Table B.3).

For the three TE families listed above, I compared the intronic distribution of mutagenic elements with all full-length counterparts in the reference genomes, and found highly consistent results (Figure 3.5 and Table 3.2). As indicated in figure 3.5A, all six mutagenic

Alu insertions are within the U-zone of SINEs (i.e. <100 bp from the nearest exon), and all are oriented antisense with respect to the enclosing gene. Moreover, five out of the six cases are near SA sites. In comparison, only 1.83% of all full-length AluYs in the reference human genome are located within the 100 bp U-zone – strikingly lower than the mutagenic elements and also more than two-fold lower than that expected by chance (p < 2.2e-16; one-sample proportion test). For all full-length AluYs within the U-zone I observed 47.7% elements in antisense, slightly lower than the random level (50%) but much lower than mutagenic insertions. Since intronic TEs show their strongest splice site bias when they are in extremely close proximity to an exon (Figure 3.3B), I examined full-length intronic AluYs located

92

Figure 3.5 Comparisons of TE frequency within the U-zone. Three TE types were examined and results were plotted in panel A, B, and C for the human Alu, human L1, and mouse LTR elements, respectively. In each plot, three groups of comparisons are shown: ‘U-zone’ stands for TE insertions within the U-zone; ‘antisense’ for human Alu or ‘sense’ for human L1 and mouse ERV indicates TEs within the U-zone in the corresponding orientation with respect to the enclosing gene; ‘< 20 bp to SA’ indicates TE insertions within 20 bp of SA sites with an exception for human Alu and L1 mutagenic TEs (marked by shading), for which all cases were included due to limited total numbers. The y- axis shows the percentage of TEs that belong to the corresponding groups. Bars in each group represent mutagenic TE insertions (green), polymorphic TE insertions (light blue), all full-length TE insertions in the reference genome (yellow), and computational simulation as a random control (dark blue). Error bars represent standard errors derived from the total number of cases (sample size) for each category.

93 less than 20 bp from exons, and observed only 10% of such elements near SA sites. Although this result cannot be directly compared to the case of mutagenic Alus due to their insufficient number within 20 bp to exons, the fact that five out of six mutagenic Alus are near SAs is noteworthy.

Similarly to mutagenic Alus, Figure 3.5B shows that all five mutagenic L1 elements are within the U-zone for full-length LINEs (i.e. <2 kb from the nearest exon). Among them, four are sense-oriented and four are near SA sites (with three of them are common in both sets). In contrast, only 23.0% of full-length intronic L1s in the reference genome are within the U-zone, which is significantly lower than both the mutagenic L1s and my random simulation (p < 0.0004 and p < 2.2e-16, respectively; two-/one-sample proportion test). Of those elements within the U-zone, only 27.7% are in sense, significantly lower than both mutagenic insertions and the simulation (p < 0.035 and p < 2.2e-16, respectively; two-/one-

Table 3.2 Intronic distributional biases of mutagenic, polymorphic, and all full-length TEs TEs in U-zone Sense TEs TEs near SA Total TE TE Type / / / cases total TEs TEs in U-zone TEs ≤ 20 bp to exon Mutagenica 6 6/6 (100%) 6/6 (100%) 5/6* (83.3%) Human Alu Full-lengthb 54136 989/54136 (1.8%) 472/989 (47.7%) 3/30 (10%) Expectedc 4.5% 50% 50% Mutagenic 5 5/5 (100%) 4/5 (80%) 4/5* (80%) Human L1 Full-length 10134 2328/10134 (23.0%) 644/2328 (27.7%) 2/7 (28.6%) Expected 28.6% 50% 50% 40 Mutagenic 29/40 (72.5%) 20/26 (76.9%) 4/6 (66.7%)

Polymorphicd 161 56/161 (34.8%) 13/56 (23.2%) 0/0 Mouse LTR Full-length 10150 1447/10150 (14.3%) 435/1447 (30.1%) 1/6 (16.7%) Expected 36.6% 50% 50%

*Due to the limited number of cases, all human Alu and L1 mutagenic insertions are included rather than only using elements within 20 bp to SAs. aMutagenic insertions documented in the literature bAll full‐length TEs (see cut‐off size of full‐length TEs in Table 1) in the reference human/mouse genome cBased on random computational simulation dPolymorphic ERV insertions present in the B6 mouse reference genome and at least one other mouse strain

94 sample proportion test). Although the number of full-length L1s in the reference genome within 20 bp to exons is very limited, among a total of seven cases only two were found near

SA sites.

I also examined the same parameters for mouse LTR elements (Figure 3.5C and Table

3.2). As expected, a high fraction of mutagenic insertions (72.5%) are within the U-zone of full-length mouse LTR elements (i.e. <2 kb from the nearest exon). Remarkably, all 15 mutagenic insertions from the IAP family were within the 2 kb U-zone. Since the orientation information of some mutagenic LTR elements within the U-zone was not indicated in their original reports, I checked the remaining 26 cases, and found 20 (76.9%) were oriented in sense. Among these mutagenic insertions in mice, five are located within 20 bp of exons, three of which are near SA sites (60%). The situation is completely different for all full- length LTR elements in the sequenced mouse genome (strain C57BL/6J, or B6), however. In contrast to mutagenic insertions, only 14.3% of full-length LTR elements in the reference genome were located within the 2 kb U-zone (p < 2.2e-16; two-sample proportion test), and of these elements only 30.1% are in the sense orientation (p < 2.65e-09; two-sample proportion test). At a distance less than 20 bp to exons, I found six full-length LTR elements in the B6 reference genome, only one of which is near the SA site (16.7%).

In summary, the foregoing analyses of mutagenic versus all full-length elements for the three retrotransposon families consistently showed an overrepresentation of mutagenic TEs within their respective U-zones, but an underrepresentation of all full-length elements within the same regions. Moreover, apparent differences in orientation and splice-site biases were also observed between mutagenic TEs and all full-length elements in the reference genomes.

These observations strongly suggest that intronic TE insertions within the U-zone have a

95 much higher potential to be deleterious to the enclosing gene, particularly when oriented in

antisense for human SINEs and in sense for LINEs and LTR elements. When intronic TE

insertions are in extreme proximity (e.g. < 20 bp) to an SA site, they are very likely to be

harmful and may cause functional abnormality of the enclosing gene.

3.2.5 Polymorphic LTR elements in mice show an intermediate distribution pattern

I next extended my analyses to polymorphic AluY and L1 insertions not associated with

any disease based on the dbRIP data. These elements are considered relatively young since

they are not fixed in humans. If, indeed, selection is still working upon these TEs, one might

expect to see an intermediate distribution pattern between that of mutagenic and all elements.

However, for both polymorphic AluYs and L1s, I observed no significant differences from all full-length elements in the reference human genome (data not shown). Though this result may be partially accounted for by the limited total number of polymorphic insertions documented in dbRIP, it is very likely that the distribution of these polymorphic TEs has already been shaped by selection.

For the youngest insertionally polymorphic mouse LTR elements, however, I had previously shown that they do have a distinct prevalence in introns and orientation bias compared with older elements (Section 2.2.3). This finding suggests that some of these insertions are detrimental, but have not been eliminated due to the artificial breeding environment of inbred strains (Maksakova et al., 2006; Waterston et al., 2002; Zhang et al.,

2008). Indeed, some known detrimental LTR insertions have even become fixed in one or a few mouse strains (Druker et al., 2004; Ho et al., 2004). I therefore analyzed a list of polymorphic LTR insertions in four mouse strains from my previous study (see Chapter 2),

96 in which I had detected different distributions between polymorphic and common LTR elements. Here I used polymorphic IAP and ETn/MusD elements present in only one of the four analyzed mouse strains (presumed to be the youngest elements), and found that 34.8% of intronic insertions were within the 2 kb U-zone (Figure 3.5C and Table 3.2), a fraction very close to the simulated prediction of a random distribution, but significantly higher than all full-length LTR elements in the mouse reference genome (14.3%; p < 5.58e-13; two- sample proportion test) and lower than the mutagenic insertions (72.5%; p < 9.79e-05; two-

sample proportion test). Moreover, I observed 23.2% of polymorphic LTRs in the U-zone to

be sense-oriented, which, though showing no statistical difference from that of all LTRs, is

highly significantly lower than the mutagenic cases (p < 6.26e-07; two-sample proportion

test). Since my list of polymorphic LTR insertions in mice does not contain any intronic

insertions within 20 bp of an exon, I could not perform the analysis of splice site proximity

bias. Nonetheless, the above observation of an intermediate distribution pattern of

polymorphic LTRs between mutagenic LTR insertions and all full-length LTRs in the

reference genome demonstrates that purifying selection is the most likely underlying force

shaping the observed intronic TE distribution patterns, and evidence further suggests that

such a process is ongoing.

3.2.6 Chimeric transcripts and cryptic splice signals differ within and outside the U-

zone

If TEs within their respective U-zones are more likely to be harmful by causing splicing

abnormalities, one can make two predictions. The first is that TEs located in the U-zones

would be associated with chimeric TE-gene transcripts more often than with TEs located

97 elsewhere in introns. To test this prediction, I downloaded and analyzed the human expressed

sequence tag (EST) data from the UCSC Genome Browser, in which only spliced transcripts

were included. I then screened for all spliced ESTs overlapping with intronic TEs (i.e.

chimeric ESTs). As Figure 3.6A shows, 11.7% of human SINE elements within the 100 bp

U-zone are associated with chimeric ESTs. In contrast, this percentage is only 1.6% for SINE

elements outside the U-zone. Similarly, for human LINEs in their 2 kb U-zone, I found 4.6%

to be associated with chimeric ESTs, while the percentage outside the U-zone significantly

drops to 0.7%. I also identified 2.9% of human LTR elements as chimeric-EST-related in the

5 kb human LTR U-zone, but only 0.9% for elements outside the U-zone. All these results

are highly statistically significant (all p-values < 2.2e-16; two-sample proportion test), which

reinforces the notion that TEs within their U-zones are more likely to be involved in aberrant

splicing. It should be pointed out, however, that the splicing events detected by this analysis

are of unknown relevance and, indeed, are unlikely to have significant detrimental effects

because these TEs are fixed.

The second prediction is that TEs not eliminated from the U-zone would have weaker

splicing signals compared with other TEs. To examine this issue, I computationally analyzed

potential splice sites within randomly selected solitary LTR sequences in human introns

using NNSplice (Reese et al., 1997) (see Section 3.4.5 for details). As shown in Figure 3.6B,

as the distance between the intronic LTR and its nearest exon decreases, the average number and the strength of predicted splice sites in these LTR sequences also decrease. This

observation indicates that LTRs carry fewer and weaker cryptic splice sites within the U-

zone, especially when they are located in close proximity to exons.

98 Figure 3.6 Chimeric transcripts and cryptic splice signals of TEs within and outside the U-zone. A) EST-associated human intronic TEs within and outside the U-zone. Each TE type is shown as a group on the x-axis. The y-axis shows the percentage of intronic TEs that contribute to chimeric ESTs with the enclosing gene. The white/dark bar represents all TEs of each TE type within/outside the U-zone, respectively. The fraction numbers beside each bar indicates the total number of TEs in each category (denominator) and the number of cases involved in chimeric ESTs (numerator). Error bars represent standard errors derived from the total number of cases (sample size) for each category. B) Predicted number and strength of cryptic splice sites in human LTRs. The top panel gives the average strength of predicted splice sites within sampled LTR sequences in each bin (the vertical axis) based on the distance from the LTR to its nearest intron boundary (the horizontal axis). The bottom panel shows the same but for the average total number of predicted slice sites.

3.2.7 Abnormal gene splicing linked to polymorphic LTR element insertions near intron boundaries

While the foregoing EST analysis suggests the importance of U-zones in TE-gene interactions, it would be useful to predict which particular intronic TEs are most likely to influence gene transcription based on their size, distance to the nearest exon, orientation, and

99 proximity to particular splice site. In my initial evaluation of this concept, I examined a panel

of polymorphic LTR element insertions in inbred mouse strains since they are currently

highly active and, as previously discussed, their genomic distribution suggests that some are likely detrimental yet being maintained due to the artificial breeding environment. In order to take the advantage of the available EST/mRNA data in the B6 reference genome, I restricted the intronic polymorphic LTR elements being examined to those present in the B6 mouse strain (Zhang et al., 2008). After excluding solitary LTRs and complex cases due to multi- gene families, I identified 44 full-length polymorphic LTR elements within the 2 kb U-zone

(data not shown). I then inspected each region using the UCSC Genome Browser (mouse genome version: mm9) to look for chimeric ESTs/mRNAs involving the LTR element and the enclosing gene, and found such transcripts for 19 of the 44 genes. For most of these 19 genes, the aberrant forms appear to be minor in abundance, so it is difficult to estimate their

overall impact on gene expression. Among these 19 genes, however, transcription of three

(Cdk5rap1, Adamts13, and Wiz) has been shown previously to be affected significantly by

the embedded LTR element (Banno et al., 2004; Baust et al., 2002; Druker et al., 2004).

Judging from the frequency of annotated chimeric transcripts, two other genes among the

group of 19 are of special interest: Kcnh6 (potassium voltage-gated channel, subfamily H

(eag-related), member 6) and Trpc6 (transient receptor potential cation channel, subfamily C,

member 6). While no evidence of transcriptional disruption caused by LTR element

insertions for these genes has been reported in the literature, UCSC Genome Browser

snapshots of their deposited mRNAs suggest significant involvement in the transcription of

each gene. For Trpc6, two of seven mRNAs in the database terminate within a polymorphic

IAP LTR element (Figure 3.7A), and for Kcnh6, one of three annotated mRNAs terminates

100

Figure 3.7 Chimeric transcripts of the Trpc6 gene and Kcnh6 gene in mice. Snapshots of the Trpc6 gene (A) and Kcnh6 gene (B) in UCSC Genome Browser are shown, with protein domains indicated. The red bar above the RefSeq gene annotation track shows the polymorphic LTR element insertion in B6 mice. For each gene, the mRNA track is shown, including the mRNAs terminating in the LTR element. Positions of primer sets used in the qRT-PCR experiments are indicated as arrowheads below the snapshot for each gene, with the upper pair (blue) for primers upstream of the polymorphic LTR element insertion, and the lower pair (green) for primers flanking the position of the LTR element insertion.

within another IAP insertion (Figure 3.7B). Trpc6 plays an important role in vascular and pulmonary smooth muscle cells, and its deficiency impairs certain allergic immune responses and smooth muscle contraction (Gonzalez-Cobos and Trebak, 2010). Kcnh6, also termed

Erg2 (eag related protein 2), encodes a pore forming (alpha) subunit of potassium channels, and may serve a role in neural activation (Elmedyb et al., 2007). To examine the potential effect of the IAP polymorphisms on transcription of these two genes, the presence or absence of these insertions were confirmed by my colleague, Mark Romanish, using genomic PCR in

101 a panel of mouse strains that included B6, A/J, and 129SvEv. As a result, we were able to confirm that, indeed, an IAP is present in B6 and A/J but not in 129SvEv for the Trpc6 gene, and that the IAP in the Kcnh6 gene is present only in B6 but not in A/J and 129SvEv (data not shown). Since both genes are highly expressed in the brain, quantitative RT-PCR were performed on brain cDNA from all three mouse strains by setting one primer pair upstream of the insertion site and another primer pair flanking the insertion site, as indicated in Figure

3.7. In mouse strains carrying the IAP insertion, we found a significant decrease in the amount of normally spliced transcripts involving exons flanking the ERV insertion compared

Figure 3.8 Effect of polymorphic LTR element insertions on transcription of the Trpc6 and Kcnh6 genes. Quantitative RT-PCR of the Trpc6 (A) and Kcnh6 (B) genes using brain RNA from the indicated mouse strains. Green bars show the amount of transcripts detected by the primer set upstream the polymorphic IAP insertion, and blue bars show the amount of transcripts detected by the primer set flanking the location of the IAP insertion. Each bar represents the mean of at least 4 experiments ± standard deviation, which was first normalized to β-actin levels in the queried strain, and then represented relative to 5’ expression levels for each gene in B6 mice. The plus and minus sign shows the presence and absence of the IAP insertion in the corresponding mouse strain, respectively.

Note: Experiments performed by Mark Romanish.

102 with exons upstream of the insertion. In contrast, we saw less difference between the upstream and flanking primer sets in strain(s) without the IAP insertion (Figure 3.8). The blockage of normal Kcnh6 transcription is particularly striking, with very little normal splicing occurring for exons flanking the IAP in the B6 strain. These data suggest significant transcriptional interference of these two genes mediated by the embedded IAPs. It would be interesting to determine if this interference does, in fact, result in phenotypic differences between mouse strains with and without these insertions.

3.3 Concluding Remarks

In this genomic study of the mammalian TE distributions within genes, I have identified intronic underrepresentation zones near exons, where fixed TEs occur less often than expected by chance. Strikingly, all documented human intronic Alu and L1 insertions and most mouse intronic LTR elements known to cause disease are located within these U-zones, strongly suggesting that TE elements in these locations are more likely to cause transcriptional disruptions and to be eliminated by selection. Moreover, TEs within their U- zones are more likely to be involved in spliced chimeric transcripts than those located elsewhere in introns, suggesting that some may be slightly detrimental. Presumably in most of these cases the transcriptional effects must be insufficient to cause such insertions to be eliminated by purifying selection. It is possible, however, that even apparently subtle effects on gene splicing could have functional consequences. Previous studies have demonstrated, on the other hand, that TEs fixed in the host genome can participate in gene transcription, producing alternative transcript isoforms that might have functional importance (see Section

1.6 in Chapter 1). Equally important as identifying potentially deleterious TE insertions, it is

103 also of great value to identify fixed TEs that contribute to normal gene expression and cell

functionality. The U-zones identified in this study, coupled with TE size, orientation bias,

and location relative to SD or SA sites can all be combined to help predict those TEs with a

higher likelihood of functional significance, while also yielding new insights into the effects

of TEs on gene regulation and evolution.

3.4 Materials and Methods

3.4.1 Source Data

TE annotations

The original TE annotation data obtained from the RepeatMasker tracks of the human

hg18 genome and the mouse mm9 genome at the UCSC Genome Browser

(http://genome.ucsc.edu) were further processed to fit this study. Since the annotations of

TEs defined by RepeatMasker (http://www.repeatmasker.org) are only fragments based on

the similarity to the consensus sequences of different TE families, a single full-length TE

element may have multiple RepeatMasker entries if its sequence is not continuous in the

genome. Therefore, in my analyses I computationally merged such TE fragments into a

single element and counted them only once as an independent TE insertion event when they:

1) belong to the same TE family; 2) be on the same chromosome; 3) be within 10 kb

distance; 4) be in the same orientation.

EST data

The human EST data were also downloaded from the UCSC Genome Browser. Here I

used the UCSC Table Browser to download only the spliced EST data, which was stored in the intronEst table. According to a reference at the UCSC genome Browser

104 (https://lists.soe.ucsc.edu/pipermail/genome/2008-June/016560.html), TE-only transcripts

were not included in these datasets.

3.4.2 Computer Simulation of Random TE insertions

To establish a baseline of TE distributions in gene introns, I applied computational

simulations of random TE insertions in both the hg18 human genome and the mm9 mouse

genome. In this study, I used the RefSeq gene annotation data downloaded from the UCSC

Genome Browser. For each round of simulation, I generated 1,000,000 random genomic loci

across the entire host genome to mimic randomized TE insertions. I divided intronic regions

into 13 bins with gradually increasing bin size according to their distance to the nearest exon:

0-20 bp, 20-50 bp, 50-100 bp, 100-200 bp, 200-500 bp, 500-1000 bp, 1-2 kb, 2-5 kb, 5-10 kb,

10-20 kb, 20-50 kb, 50-100 kb, >100 kb. My intention in using increasing bin size was to establish a higher resolution at the interesting regions near intron boundaries, while at the same time maintaining a good overview of other intronic regions. I then calculated the fraction of simulated TE insertions located in each bin with respect to the total simulated insertions in introns. This simulation process was applied three times, and the average was taken for each genome as a control distribution for all further analyses. My calculation of the standard error of the mean based on three rounds of simulations showed a negligible

sampling error for each bin (data not shown), confirming the suitability of using these results

to represent the theoretical random TE distribution.

105 3.4.3 Normalization of intronic TE distribution by G/C content

To minimize the influence on TE distribution by local G/C content, I corrected my computational simulations of random TE distribution according to the overall G/C preference of each TE type. Specifically, I first performed a genome-wide evaluation of the G/C preference of each TE type by dividing the entire host genome into a set of consecutive 20 kb windows and calculating both the density of each TE type and the G/C density for each window. Then I grouped these 20 kb windows by G/C density level (with a resolution of 1%) and calculated their average TE density at each G/C level. Based on the assumption that TE density should be close to the overall genome-wide TE density anywhere in the genome when there is no G/C preference, I calculated the fold-difference of the actual TE density at each G/C level compared with the genomic background level for each TE type. In this way, I derived a list of fold change values of TE density at each G/C level, which were then used as

the normalization coefficient to correct the simulated distribution of random TE insertions.

3.4.4 Calculation of the standardized TE frequency levels

To determine the difference between the “observed” and the “expected” TE frequency at each predefined distance bin, I used the concept of residual to measure the standardized TE

frequency:

⎛ ()obs − exp ⎞ ⎛ obs ⎞ c = log10 ⎜ +1⎟ = log10 ⎜ ⎟ ⎝ exp ⎠ ⎝ exp ⎠ where c is the residual of a given distance bin, obs is the total observed occurrence of a given

TE type in that bin, and exp is the expected number of such TE insertions derived from my computational simulations. Common logarithm (log10) was used here to equalize the value

ranges of over- and under-represented data, and the addition of “1” in the formula was to

106 fulfill the requirement that the subject of logarithm cannot be a negative number. The

absolute value of residual c literally shows the degree of relative difference between the

“observed” and “expected”. When c is positive, it means the corresponding TE type is

overrepresented in this region; when c is negative, it means such TE type is underrepresented.

3.4.5 Computational analysis of potential splice sites in LTRs

I used the web-based interface of the NNSplice program (http://www.fruitfly.org/ seq_tools/splice.html), a bioinformatics tool based on artificial neural networks and used for predicting the presence and the strength of potential splice sites in any given input DNA sequence. Due to the limitation of the maximum length of total input sequences that

NNSplice can take, in this analysis I only chose sense-oriented solitary LTR sequences (i.e.

LTR sequences annotated with a size between 200-600 bp) in human introns. The intronic region was divided into a set of consecutive bins with bin size increasing according to the following distances from the LTR to the nearest exon: 0-200 bp, 200-500bp, 500-1000 bp, 1-

2 kb, 2-5 kb, >5 kb. For all bins except the first, a total number of 100 LTR sequences were sampled randomly for three times, independently. The averaged total numbers and strength of potential splice sites based on the three samples were taken as the final values for each bin.

Since the first bin (0-200 bp) contains only 101 cases in total, I took all those cases to calculate the average total number and strength of potential splice sites for this bin without sampling. Notably, when I calculated the average strength of potential splice sites for each bin, only the site with the highest score was considered for each LTR sequence.

107 3.4.6 RT- and qRT-PCR of polymorphic LTR insertions in mouse genes

DNA and RNA Isolation

Primary mouse tissue samples were dissected from healthy adult male C57BL/6J,

129SvEv and A/J mice, and later preserved in RNA (Ambion). Genomic DNA and total

RNA were isolated from the indicated strains and tissues using TRIzol (Invitrogen) according

to the manufacturer’s specification. Subsequently, nucleic material was quantified using a

NanoDrop UV spectrophotometer (Thermo Scientific), and quality was assessed by gel

electrophoresis.

Confirmation of LTR polymorphisms

The presence/absence of ERV/LTR polymorphisms was initially determined

computationally (Chapter 2; (Zhang et al., 2008)). These predicted events were confirmed by

PCR using 50 ng of genomic DNA from C57BL/6J, 129SvEv, and A/J mice. Primers

flanking the predicted LTR polymorphisms in the Kcnh6 intron (gKcnh6-F:

catcccagagctcaaagtgg; gKcnh6-R: tgcaccagtgcatgcatgc) and the Trpc6 intron (gTrpc6-F:

gaagcatgccactctagagc; gTrpc6-R: tgtgcatgattgtgtaggtg) were used in a standard Platinum Taq

DNA polymerase reaction (94°C-5 min; [94°C-0.5 min; 58°C-0.5 min; 72°C-0.5 min] x35;

72°C-7 min; 4°C-∞).

Quantitative RT-PCR

One microgram of total C57BL/6J, 129SvEv and A/J RNA was reverse transcribed using

Superscript III (Invitrogen) following the manufacturer’s recommended protocol. The effect of the LTR polymorphisms on the expression of Trpc6 and Kcnh6 was assessed by qRT-PCR using primers situated 5′ and 3′ of the respective LTR insertions. Relative quantification of the indicated targets was carried out by the ΔΔCT method, essentially as before (Romanish et

108 al., 2009), using the following primer sets: Kcnh6 5′ (qKcnh6-Ex11-F: cgagagaagctggattgctg; qKcnh6-Ex12-R: ctgtggatgctgaagtagctg); Kcnh6 3′ (qKcnh6-3′-F3: ctcagagttcagagtcgatgc; qKcnh6-3′-R: caccagagatttgtccattgc); Trpc6 5′ (qTrpc6-Ex2-F: cttagccaatgagctggcagtg; qTrpc6-Ex3-R: ccacttcctctgtgtttctgc); Trpc6 3′ (qTrpc6-Ex3-F2: agtatgaagtaaaaaaatttgtggctc; qTrpc6-Ex4-R2: aatggcaacagcaaggaccac); β-actin (β -actin-F: aaggccaaccgtgaaaagat; β - actin-R: gtggtacgaccagaggcatac). The amplification efficiency of each primer set was derived across a template dilution series. The achieved efficiencies were: Kcnh6 5′ 87%; Kcnh6 3′

101%; Trpc6 5′ 98%; Trpc6 3′ 94%; and β-actin 96%. These values were incorporated into calculations to determine relative expression levels of the target and control genes. All

Kcnh6- and Trpc6-specific primer sets were validated to amplify at an equal efficiency with

respect to β-actin (normalization gene), and were determined to be suitable for subsequent

ΔΔCT relative quantification qRT-PCR experiments.

109 Chapter 4: Gene Properties and Chromatin State Influence the

Accumulation of TEs in Genes

A version of Chapter 4 has been published: Ying Zhang and Dixie L. Mager (2012). "Gene properties and chromatin state influence the accumulation of transposable elements in genes." PLoS One 7(1): e30158.

110 4.1 Background

Increasing availability of whole genome sequences during the last decade has revealed

insights into the relationships of TEs and genes. As discussed in Chapter 1, it has been well

studied and widely accepted that the genome-wide distribution of TEs is far from random,

which is likely due primarily to the long-term effects of natural selection (Section 1.2). While

most previous studies have focused on identifying properties of TEs that might contribute to

their genomic distribution (e.g. insertional orientation, local G/C content preference, etc.), in

the study described in this chapter, I focused on evaluating the host-TE relationship by

examining properties of the genes in which they reside. Several studies of intronic TE density

and gene function have shown that TE-poor genes are usually linked to developmental

processes, while TE-rich genes tend to be involved in metabolic pathways (Grover et al.,

2003; Mortada et al., 2010; Sironi et al., 2006). Moreover, TE density in genes has been

reported associated with gene expression patterns (Jjingo et al., 2011; Mortada et al., 2010;

Sironi et al., 2006). Indeed, a study of TE accumulation in Drosophila Melanogaster

euchromatin examined TE density differences between soma- vs. germline-expressed genes

(Fontanillas et al., 2007), finding a higher TE insertion rate for the latter. Notably, Kunarso et al. recently reported that TEs have impacted the core regulatory networks of both human and mouse embryonic stem cells (ESCs) by being involved in ES-specific transcription factor binding and gene regulation (Kunarso et al., 2010).

The previous studies cited above, while providing evidence that multiple gene properties correlate with TE accumulation/fixation rate in genes, are limited in the following respects.

First, TE density conservation among orthologous genes in multiple species was not considered when identifying TE-poor and TE-rich genes. Given the low probability of having

111 the same extreme density of independent TE insertions in orthologous genes among multiple species by chance, an enrichment of TE-rich/poor genes shared across species may lead to new insights of TE-host relations. Second, while a study had addressed the relationship between TE distribution and gene expression in germline/early embryo tissues (Fontanillas et al., 2007), it focused on invertebrates and was based only on indirect measurement of chromatin status such as EST abundance and microarray expression data. To address these limitations, I examined TE densities in genes of three mammalian species: human, mouse and cow, which are sufficiently diverged that most recognizable TE insertions are independent among these genomes (Elsik et al., 2009; Waterston et al., 2002). Additionally, since more than half of TEs in cows belong to ruminant-specific TE families that do not exist in humans and mice (Elsik et al., 2009), it provided higher confidence when evaluating possible factors that may contribute to the extreme density of TEs. At this point, I examined various properties of the TE-rich/poor genes shared across the above three species, including gene function, conservation, tissue-specificity of gene expression, and histone modification profiles in mouse ESCs. My results indicated that TE distributions in genes have been determined by or correlate with multiple properties of host genes, reflecting the influence of

TEs in shaping the landscape of host genomes.

4.2 Results and Discussion

4.2.1 Determining sets of orthologous genes with the same extreme of TE density

Although ~90% of human RefSeq genes contain sequences derived from TEs (mostly in introns), the coverage of TEs in each gene can be quite different. To determine if the difference is completely random, I identified genes that are either unusually enriched or

112 depleted of TE sequences in three mammalian species: human, mouse and cow. As

determined by the initial sequencing and comparative analyses of the mouse genome

(Waterston et al., 2002), more than half of all human TEs are lineage-specific elements inserted after the human-mouse divergence, and 87% of all TEs recognizable in the mouse

are lineage-specific, probably due to higher deletion/mutation rates in the mouse. Further, the

whole-genome sequencing of taurine cattle revealed that at least 58% of TEs in the cow

belong to ruminant-specific repeat families (Elsik et al., 2009). These data show that the

majority of TEs detectable in the human, mouse and cow today were independent insertions

introduced after their divergence. To find outlier genes enriched/depleted of TEs in each

species, I downloaded the RepeatMasker annotation of TEs from the UCSC Genome

Browser, and calculated the TE density of each gene by taking the ratio of TE coverage and

gene size. Since most small genes contain few TE sequences due to the limitation of their

size (Figure C.1; see Section 4.4.1), I restricted my analyses to genes larger than 10 kb.

I next sought outlier genes using “top 10%” as an optimized cutoff threshold for genes

with the highest/lowest TE densities (see Section 4.4.2 for details on optimization). Since

TE/gene fraction (i.e. TE density in genes) is linked to the gene length (Jjingo et al., 2011), I

decided to control my TE density outlier analysis by gene size. Moreover, in my previous

study of TE underrepresentation zones (U-zones) in mammalian gene introns (see Chapter 3),

I had found a significant decrease of TE density near exons, suggesting that exon density is

also a potential factor limiting the coverage of TEs in genes (i.e. genes with higher exon

density are expected to have less TEs). Despite the fact that exon density and gene size are

not independent (Figure C.2), I controlled my outlier analysis for both factors (Figure C.3;

see Section 4.4.3) because of their moderate association strength (r = 0.688, r is the

113 correlation coefficient for gene size vs. 1/exon density). By intersecting my datasets from the

three species (see Section 4.4.4), those genes identified as “shared outliers” were selected for

further analysis because of the low probability that their outlier status is due to random

chance (single-side frequency < 0.0005). I found a total of 189 shared lower-outlier genes

(SLOs; Table C.1) and 84 shared upper-outlier genes (SUOs; Table C.2), both numbers much

higher than expected by chance (p < 2.2e-16 for both SUOs and SLOs, proportion equality

test).

4.2.2 Chromosomal distribution and TE composition of shared outlier genes

Before performing any functional analysis of the SUO/SLO genes identified above, I was

curious to see if there is any distributional bias at the chromosome level. To answer this

question, I plotted physical locations of both SUOs and SLOs along each human

chromosome. As shown in Figure C.4, except for a disproportionally high amount of SUOs

on , no apparent enrichment was found on certain chromosome(s) for both

types of shared-outlier genes. When I counted the total SUOs and SLOs separately for each

chromosome and produced scatter-plots according to chromosome size (Figure C.5),

chromosome 19 stood out as an outlier with a relatively short size (2% of the human genome)

yet encoding 13 of 84 (15.5%) SUO genes. Since human chromosome 19 is gene-rich

compared to all other chromosomes (Lander et al., 2001), I wondered whether the majority of

TEs in SUOs are SINE elements, which have been shown as highly enriched near genes and

G/C-rich regions (Section 1.2.1), especially Alus. Indeed, further examination revealed that

53 of 84 (61%) SUOs contain more than 50% TEs as SINEs, including all of the 13 SUOs located on chromosome 19. LINEs, however, are known to associate with A/T-rich or gene-

114 poor regions (see Section 1.2.1), and to be overrepresented in gene poor chromosomes such

as chromosome X (Graham and Boissinot, 2006; Medstrand et al., 2002).

Notably, while only three SUOs are on chromosome X, two of them contain more than 50%

TEs as LINEs. These results suggest that the distribution of SUOs is not random and is likely associated with the TE composition and the overall G/C content (or gene density) of each chromosome.

Further, I wondered if orthologous SUOs in different species might contain the same types of TEs at similar percentages. When I examined the TE composition of the 84 SUOs, I found similar patterns for most between human and mouse (Figure 4.1A, B), but not for cow

(Figure 4.1C). To quantitatively measure the correlation of TE compositions between human and mouse SUOs, I calculated the proportion contributed by each of the four major TE types

(i.e. the normalized TE composition) and the corresponding correlation coefficient (r). Figure

C.6 shows that the proportion of LINE and SINE elements are highly correlated between human and mouse (r = 0.827 for LINEs; r = 0.836 for SINEs), with only poor correlations being found for ERV and DNA elements (r = 0.408 for ERVs; r = 0.267 for DNA elements).

Since the genomic density of LINEs or SINEs is highly associated with local G/C content, I calculated for all SUOs the correlation coefficient between the percentage of LINEs/SINEs and the local G/C content. As shown in Figure C.7, a strong correlation to G/C content is found for both TE types (r = -0.702 for LINEs and r = 0.775 for SINEs), suggesting that it is a major factor in determining the insertion/fixation probability of different TE types.

However, the same analysis in the cow genome showed much less tendency for similar TE composition patterns in SUOs compared with the human or mouse (Figure 4.1C). I reasoned that this is likely due to the fact that a large amount of TEs in the cow are extremely rare in

115 Figure 4.1 TE composition patterns of SUO genes. Patterns for human, mouse and cow are shown in (A), (B) and (C), respectively. Relative proportions taken by different TE types for each SUO gene are shown as a stacked bar, with the color scheme for the four major TE types indicated at the top. Genes are arranged in the same order for all three species. Gene names at the bottom are from the annotation of human RefSeq genes.

the other two species (Elsik et al., 2009). Almost half of the LINE elements in the cow, for example, belong to the RTE clade, which is also found in many other species including reptiles, insects and nematodes, but is absent from most mammals including primates and rodents (Adelson et al., 2009; Elsik et al., 2009; Malik and Eickbush, 1998). Moreover, while most of the remaining LINEs are L1 elements (the most abundant LINE family in human and mouse), only 60% belong to subfamilies present in humans (Adelson et al., 2009). More dramatically, greater than 92% of Bovine SINEs are not found in both human and mouse.

Unlike the major SINE families such as Alu in humans and B1 in mice, SINE elements in

116 cows are either derived from tRNAs or truncated LINEs instead of 7SL RNAs (Malik and

Eickbush, 1998). Based on this unique genomic composition of TEs, it is not surprising to see a distinct composition pattern of TEs in cow SUOs. Indeed, a previous study reported that L1s in the cow show little correlation with local G/C content, while RTE LINEs and most SINEs are both negatively associated with the density of local G/C (Adelson et al.,

2009). This feature is apparently different from of human and mouse, in which LINEs

(mostly L1) are negatively correlated with local G/C content, while SINEs (mostly Alu/B1) show the opposite trend (Section 1.2).

4.2.3 Extreme TE density is associated with the function and conservation level of genes

In a pioneering study of Alu SINE distribution in genes based on sequence data of human chromosomes 21 and 22, Grover et al. reported an enrichment of Alu elements in genes involved in metabolic pathways and signaling/transport processes, as well as a depletion from genes coding for information pathways and structural proteins (Grover et al., 2003).

The authors postulated that both positive and negative selection forces had been involved in shaping the current distribution of Alus in human genes. In a different study on transposon- free regions (TFRs) in mammals (Simons et al., 2006), the authors found almost 1000 genomic regions larger than 10 kb completely depleted of TEs. Although over 90% of the bases covered by these regions are non-coding, genes within them are significantly associated with developmental functions, suggesting that these regions are largely unable to accept or tolerate TE insertions. While the above and some other genome-wide studies

(Mortada et al., 2010; Sironi et al., 2006) showed an association between TE density and

117 specific gene functions, I wanted to determine if such trends exist for the outlier genes shared

among multiple mammalian species, after excluding the potential confounding effects of exon density and gene size. I examined my SUO/SLO gene lists derived from human, mouse and cow using BiNGO (Maere et al., 2005), a Gene Ontology (GO) tool used to identify statistically significant over-/under-representation of certain gene functions for a given gene

set compared to the genomic background. In accordance with previous studies of non-shared

TE density outlier genes, I observed significant enrichment of genes involved in

developmental processes for SLOs (Table C.3), and enrichment of genes involved in

metabolic pathways and DNA repair for SUOs (Table C.4). Since these orthologous genes

show similar extremes of TE density in several species and have been controlled by their size

and exon density, my observations firmly support the hypothesis that the density of TEs in

genes is not random, and is evidently associated with specific gene functions.

Intrigued by the possible association between TE density and functional importance of

genes, the next gene property I examined was the phylogenetic conservation level. In a recent

study conducted by Mortada et al., the authors applied a comprehensive analysis of TE-free

vs. TE-rich genes based on the difference of selection pressure ratio (Ka/Ks) among four

primate species (Mortada et al., 2010). That study could not find significant support for the

association between TE density and gene conservation level. The authors proposed that the

phylogenetic distances between the four primates they examined might be too close. Indeed, when they looked at the selection pressure ratio between human and mouse, evidence was found that TE-free genes tend to be more conserved than TE-rich genes. To further test this hypothesis, I defined three levels of gene conservation according to the HomoloGene database (release 63; http://www.ncbi.nlm.nih.gov/homologene): species-specific,

118 mammalian-specific, and ancient genes (see Section 4.4.5 for details). In order to include species-specific genes, I expanded my analysis to all outlier genes, including both shared and non-shared among human, mouse and cow, and obtained consistent results for all three mammalian species. In accord with the results of Mortada et al. (Mortada et al., 2010), I found clear evidence showing a higher proportion of species-specific genes among SUOs, and a higher proportion of ancient genes in SLOs (Figure 4.2). These findings indicate that

TE-poor genes are more likely to be conserved among distantly related species, while genes extremely rich in TEs show the opposite trend, suggesting that TE insertions within highly conserved genes have generally been selected against due to their detrimental effects on these genes. On the other hand, the emergence of non-conserved, species-specific genes has increased the genetic variation of the host population, and my data supports the idea that TEs

Figure 4.2 TE density and conservation level of outlier genes. Patterns for human, mouse and cow are shown in (A), (B) and (C), respectively. Proportions corresponding to different gene conservation levels for each gene set are shown as a stacked bar, with the color scheme for species-specific, mammalian-specific and ancient genes indicated at the top.

119 may have contributed to, or even played an important role in creating genetic diversity during

evolution (Biemont, 2010; Rebollo et al., 2010).

I also compared the average TE ages between SUO and SLO genes in human (Section

4.4.6). This comparison revealed a highly significant difference (p < 2.2e-16, Wilcoxon Rank

Sum test), with TEs in SUOs at 16.1% and TEs in SLOs at 21.3% sequence divergence from their consensus sequences (As reference, 16–18% of unconstrained nucleotides have been

substituted since the split of primates from other mammalian orders (Smit, 1999)). To

evaluate if such a difference in TE age was a reflection of the possible age difference

between SUO and SLO genes, I randomly selected 200 mammalian-specific genes, and another 200 ancient genes also conserved in fish or invertebrates, and compared their ages

(Section 4.4.6). Although the age of genes in the two random control groups is apparently different, the average age of TEs is about the same (19.7% vs. 19.1% divergence; p=0.1367,

Wilcoxon Rank Sum test). Based on the fact that the pattern of TE distributions is a combined effect of both initial TE integration patterns and outcomes of selection/genetic drift, and thus may change through evolution (Medstrand et al., 2002), I postulate that TEs in

SLOs may show more features of selection outcomes due to a longer evolutionary period, while the high TE density of SUOs may result partially from the fact that young TEs in such genes are not fixed stably and still subject to natural loss and purifying selection.

4.2.4 Binding of Polymerase-II at gene promoters reveals association between TE- content and tissue-specificity of genes

Recently, Jjingo et al. reported that both higher TE density and larger gene size generally associate with less tissue-specificity of gene expression (Jjingo et al., 2011). To determine

120 which factor is more important to the tissue-specificity of genes, the authors applied multiple regression analyses and found a higher contribution of TE density (66%) than gene size

(53%). While these results provided important insights into the relationship between TE density and gene expression, the confounding effects of stochastic TE integration in a single species and gene structure constraints greatly impair the accurate evaluation of such relations. To overcome this problem, I directly compared polymerase-II (Pol-II, and more specifically, the Polr2a subunit of Pol-II) binding states between the SUO and SLO gene groups, controlled for both exon density and gene size.

To obtain accurate information concerning the tissue-specificity of genes, I downloaded the ChIP-chip genome-wide RNA Polr2a binding maps of active promoters in mouse embryonic stem cells (ESCs) and adult organs (brain, heart, kidney, liver) (Barrera et al.,

2008). According to the tissue-specificity (TS) values described by the authors, I classified the SUO and SLO genes into two categories: high-TS and low-TS (see Section 4.4.7 for methods). As a control, all genes larger than 10 kb in the mouse genome were similarly categorized, and the proportions were compared with that of SUOs and SLOs. As indicated in Figure 4.3A, my results reveal that SUO genes exhibit significantly lower tissue- specificity compared to the genomic background (30% vs. 46%; p = 0.0338), and vice versa for SLOs (64% vs. 46%; p = 0.0001). These results accord with previous findings (Jjingo et al., 2011).

What next drew my interest was whether there is any difference in tissue-type composition between tissue-specific SUOs and SLOs. For those outlier genes expressed only in one tissue, it would be interesting to know if specific tissue-types are significantly overrepresented. Using the same data set from Barrera et al., I examined the tissue-type

121

Figure 4.3 Tissue-specificity of Polr2a binding at outlier genes. (A) Associations between TE density and tissue-specificity of Polr2a binding. All gene sets are divided into two categories according to the Polr2a binding pattern of genes: the high tissue-specific (high-TS) and the low tissue-specific (low-TS). The proportions corresponding to the above two categories are shown for SUOs, SLOs and the genomic background as side-by-side bars. Error bars are standard errors derived from the total number of genes (sample size) in each gene set. (B) Tissue-type composition of SUOs/SLOs associated with strong Polr2a binding in only one tissue. For each gene set, the proportion taken by each tissue type is shown with corresponding color in a stacked bar. The 'genomic background' was calculated based on all mouse genes > 10 kb that show strong Polr2a binding in only one tissue. The color scheme for different tissue types is shown at the top.

composition of all SUOs and SLOs showing strong Polr2a binding in only one tissue type

(Section 4.4.7). Despite identifying only 15 SUO genes that satisfied my criteria, nine are observed in ESCs, a proportion (60%) much higher than either SLOs (40%) or the genomic background (37%) (Figure 4.3B). In order to gain more statistical power, I expanded my analysis to all mouse outlier genes, including both shared and non-shared. As shown in

Figure C.8, I observed 89 ESC-specific upper outliers among a total of 181 upper outlier genes (49%). This is significantly higher than the genomic background level (37%; p =

0.00146). More interestingly, after expanding my analysis from shared-only to all outlier

122 genes, I observed a significant depletion of ESC-specific genes among tissue-specific lower-

outliers (93 out of 367, or 25%; p = 1.406e-05). Based on these observations and the fact that

only TE insertions occurring in the germline and early embryonic cells could be inherited by the next generation, I propose that actively transcribed genes (presumably with an open chromatin state) are prone to TE integrations, which could, in turn, lead to a faster accumulation of TEs in such genes during evolution.

4.2.5 Histone marks at promoters confirm the overall open status of SUO genes in

ESCs

To investigate the hypothesis that the chromatin regions of upper outlier genes are generally open in ES cells, I examined histone modification marks at gene promoters in mouse ESCs. In 2007, Mikkelsen et al. used ChIP-seq to generate genome-wide maps of various histone modification marks, including the open-chromatin mark histone H3 lysine 4 trimethylation (H3K4me3) and the condensed-chromatin mark histone H3 lysine 27 trimethylation (H3K27me3) in mouse ESCs (Mikkelsen et al., 2007). Although histone modification data were largely limited to gene promoters in this study, the authors showed that both H3K4me3 and H3K27me3 can effectively discriminate genes that are expressed, are poised for expression, or are stably repressed, and therefore reflect both chromatin and transcriptional states. I used this dataset to evaluate the general chromatin status of both SUO and SLO genes in mouse ES cells (Figure 4.4). Intriguingly, my results revealed a significant

enrichment of the open-chromatin mark H3K4me3 at promoters of SUOs (78%), but a

depletion at SLO promoters (43%) compared with the genomic background (64%; p = 0.0333

and p = 5.653e-08 for SUOs and SLOs, respectively), indicating a general open state

123

Figure 4.4 Histone marks at promoters of shared outlier genes. The proportions of genes associated with different histone marks are shown for SUOs, SLOs and the genomic background as side-by-side bars. Error bars are standard errors derived from the total number of genes (sample size) in each gene set.

of SUOs and closed state of SLOs in mouse ESCs. When I looked at the closed-chromatin mark H3K27me3, I observed a trend opposite to the case of H3K4me3. SLOs are more associated with H3K27me3 (1.9%) than the genomic background (0.5%; p = 0.0766, equality proportion test), but there were no SUOs in this category (0/60 hit for H3K27me3). While the above observation is intriguing, there is unavoidably a lack of statistical significance due to the limited total number of SUOs and SLOs associated with H3K27me3 alone. When I examined the enrichment of the bivalent mark H3K4me3+H3K27me3, however, SLOs showed a much higher enrichment than the genomic background (39% vs. 17%; p = 5.591e-

13), whereas SUOs showed the opposite effect (6.7% vs. 17%; p = 0.05). Since it has been shown that the H3K4me3+H3K27me3 bivalent mark is strongly associated with the “poised state” of developmental genes that are temporarily repressed in ES cells yet will become activated upon differentiation (Spivakov and Fisher, 2007), the enrichment of the bivalent

124 histone mark in SLOs supports my earlier observation that many SLO genes are involved in

essential developmental processes and, consequently, are depleted of TE insertions. For

shared outlier genes not linked to either H3K4me3 or H3K27me3, no significant differences were detected compared to the genomic background. The same analysis expanded to both

shared and non-shared outlier genes showed very similar results, as illustrated in Figure C.9.

4.2.6 Expression data of SUO and SLO genes in ESCs support the ‘open chromatin

status’ hypothesis

If indeed genes with an open chromatin status in ES cells are more prone to heritable TE-

insertions, one might expect to see a larger proportion of genes that are highly expressed in

ES cells among SUOs. To investigate this, I downloaded mouse ES cell expression data previously generated by Mikkelsen et al. (Mikkelsen et al., 2007). As expected, when SUOs

and SLOs were classified into lowly- and highly-expressed genes based on the median

genomic transcriptional level in mouse ESCs, I observed that 53% of SUOs are highly

expressed in ESCs, compared with only 41% for SLOs (Figure 4.5A). However, while it

shows a clear trend that SUOs contain a larger proportion of active genes in ESCs than

SLOs, the result lacks statistical significance (p = 0.1428). Since chip-based techniques

commonly suffer from cross-hybridization (where non-specific binding of probes produces

unavoidable experimental noise), I performed the same analysis using a different mouse ESC

expression dataset recently generated by whole transcriptome sequencing (RNA-seq) (Karimi

et al., 2011). As shown in Figure 4.5B, I made a similar observation with the RNA-seq data

as with GeneChip, but with a more dramatic difference between the two outlier groups (60%

highly expressed genes among SUOs vs. 43% for SLOs; p = 0.0155). These results suggest

125

Figure 4.5 Gene expression data analyses for SUOs and SLOs in mouse ESCs. A) Gene expression data based on GeneChip technology. B) Gene expression data based on RNA-seq technology. In both (A) and (B), SUOs and SLOs are divided into 'low' and 'high' expression genes in mouse ESCs based on the median expression level of all genes > 10 kb. The red dotted line shows the 50% level, which is the proportion taken by lowly/highly expressed genes for all genes > 10 kb in mouse ESCs. Error bars are standard errors derived from the total number of genes (sample size) in each gene set.

that genes with open chromatin status and high expression level in ES cells may be more prone to TE insertions that can be inherited, which could lead, in turn, to the accumulation of

TEs in such genes.

Of note is a recent study documenting large numbers of somatic L1 retrotransposition events in the human brain, which showed that such events occur significantly more often in genes expressed in the brain compared to random expectations (Baillie et al., 2011). This study lends support to the view that insertions of at least some TE types are more likely to occur in open chromatin.

126 4.2.7 Gene function and chromatin status both contribute to the low TE density of

SLO genes

Although the evaluation of chromatin status at SUO/SLO genes using genome-wide histone modification maps clearly shows a correlation between the density of TEs and chromatin status in ES cells, such observations may be explained by at least two possible underlying mechanisms. One mechanism could be that TEs are more likely to insert into genes with open chromatin, leading to an accumulation of TE sequences in genes active in

ESCs. Alternatively, the strong association between low TE density and poised/inactive genes in ESCs could be explained by the essential roles of these genes in cell differentiation, and thus, their limited tolerance to TE residence. As the two mechanisms are not mutually exclusive, I was curious to evaluate separately the contributions of chromatin status and gene

function. I began by classifying all SLO genes depending on whether or not they are under

the GO term of developmental processes. I then checked the SLOs associated with either

H3K4me3-only (open state) or H3K4me3+H3K27me3 bivalent mark (poised state), and

calculated the proportion of genes involved in developmental processes in each set. (Due to

the limited number of SLOs associated with the H3K27me3-only mark, I did not include

such analysis in this study.) As seen in section A of Figure 4.6, 42% (24 out of 57) SLOs

associated with an open chromatin state in ESCs were developmental genes, which is higher

(but not statistically significant) than the whole-genome background (31%; p = 0.1). The

analysis of SLOs associated with the H3K4me3+H3K27me3 bivalent mark showed 65% (30

out of 46) were developmental genes (section B of Figure 4.6), which is twofold more than

the genomic background (p = 1.470e-06). Notably, the predicted chromatin status was not a

variable in both analyses, confirming that gene function definitely plays a role in influencing

127

Figure 4.6 Effect-evaluation matrix for gene function and chromatin status of SLOs. As shown in the inner 2x2 matrix (white area), SLOs are divided into four groups according to their associated histone mark and gene function. ‘K4’ stands for H3K4me3; ‘K4+K27’ stands for H3K4me3 + H3K27me3; ‘Dvlp’ stands for developmental gene; ‘Non-Dvlp’ stands for non-developmental gene. Bar plots in the most right column show the proportion of SLOs associated with different histone marks compared with the genomic background when gene function is controlled. Bar plots in the bottom row show the proportion of SLOs associated with different gene functions compared with the genomic background when chromatin status is controlled. Error bars are standard errors derived from the total number of genes (sample size) in each gene set.

the density of TEs in genes. Moreover, the higher proportion of genes involved in developmental processes for SLOs associated with the H3K4me3+H3K27me3 bivalent mark

(65%; B/T2 in Figure 4.6) compared with SLOs associated only with H3K4me3 (42%; A/T1 in Figure 4.6) implies an effect of chromatin status of genes. Indeed, when I looked at those

128 SLOs that are either involved in developmental processes or not (i.e. controlled by gene function), the association between SLOs and the H3K4me3+H3K27me3 bivalent mark

(poised status) is consistently and significantly higher than expected: 56% when controlled for developmental genes (B/T3 in Figure 4.6) and 33% for non-developmental genes (D/T4 in Figure 4.6), compared with 17% for genomic background (p = 2.825e-13 and p = 0.007, respectively), showing that the chromatin status in ES cells is also an important factor contributing to the overall TE density of genes.

4.3 Concluding Remarks

Using the genomic sequences of three distant mammalian species, (i.e. human, mouse and cow), I identified “shared outlier” genes bearing highly conserved patterns of unusually high or low TE densities. These outlier genes, denoted as SUOs and SLOs respectively, provide a highly reliable resource for the study of TE-gene relations. Specifically, I showed that SUOs are enriched for metabolic genes, and that SLOs are enriched for genes involved in developmental processes. Moreover, SUOs show in general less conservation at the protein sequence level, and an expanded analysis involving non-shared outlier genes in each species revealed a disproportionate enrichment of species-specific genes for TE-rich genes (upper outliers). These findings are also in agreement with previous studies of the association between TE density and gene function (Grover et al., 2003; Mortada et al., 2010; Sironi et al.,

2006).

On the other hand, additional factors besides natural selection have also contributed to TE- host evolution, and therefore, cannot be neglected. The comparison of the average TE ages, for example, shows SUOs contain significantly younger TEs than SLOs, implying that the

129 high TE density of SUOs could also be due to recent TE insertions, some of which will be

lost with time. Furthermore, other results presented in this study also suggest that the

genomic signature of initial insertion site preference may still exist. I found that extreme TE

content in introns is clearly associated with the chromatin status and expression level of

genes in embryonic stem cells, with upper outlier genes being more likely to be active in ES

cells, and lower outliers being the opposite. Given that all heritable TEs are the result of

integration events that occurred in either germline cells or early embryos, and if I postulate

that TEs are more likely to insert into actively transcribing genes, it is possible that more TEs would accumulate in such genes and less in genes that are inactive in these tissues. Indeed,

only a very minor tendency for TEs to insert into actively transcribing genes in the germ line

or early embryo could contribute, over evolutionary time, to their resultant densities in genes.

In conclusion, my data support the view that both selection and initial insertion site

preference have played roles in the extreme intronic TE densities observed in mammalian genomes.

While my data provide intriguing insights on TE-gene relationships based on highly

reliable gene sets shared in multiple species, due to the limited numbers of genes studied,

caution should be taken when applying these results to a broader theme. On the other hand,

my analyses, using all (i.e. both shared and non-shared) outlier genes, show very similar

results, confirming that the findings in this study very likely reflect general characteristics of

TE-gene relationships in mammals.

130 4.4 Materials and Methods

4.4.1 Selection of source datasets

To obtain the genomic coordinates of TEs in human, mouse and cow, I downloaded the

RepeatMasker annotation of TEs from the UCSC Genome Browser (http://genome.ucsc.edu).

The versions of genome assemblies used were hg18 for human, mm9 for mouse, and bosTau4 for cow. For the annotation of genes, I downloaded RefSeq gene tracks based on the same genome assemblies listed above. I defined TE density as the ratio between the total length of TE sequences presented in a given gene and the size of the longest RefSeq isoform of that gene. An analysis of TE density in human genes revealed a normal distribution, except a singleton peak at near zero TE density (Figure C.1A). After excluding all genes shorter than 10 kb, this peak disappeared (Figure C.1B). I applied the same analysis for mouse and found similar results (data not shown). This observation indicates that most genes with near zero TE density are simply very small genes, very likely low in TEs due to their limited size. For this reason, my study included only genes larger than 10 kb. Since my intention was to identify outlier genes shared among multiple species, I also confined my data to only genes included in both the NCBI HomoloGene and RefSeq databases.

4.4.2 Identification of genes with extreme TE densities

In order to identify outlier genes, a cutoff percentile for genes with high/low TE densities was required. The selection of this threshold was a balance between the significance of effect and a reasonable outlier population size for the ease of further data analyses. I performed a computational optimization experiment by testing multiple cutoff thresholds of the outlier percentile from 2.5% to 25%, with an incremental step of 2.5%. For each percentile tested, I

131 calculated both the total number of SUOs and SLOs derived, and the ratio between the

theoretical probability by chance and the actual frequency of observing shared outliers

among the three species (Table C.5). Based on these optimization results, I chose 10% as the

optimized cutoff percentile for outlier genes in each species, with the number of genes derived in each step being indicated in Table C.6.

4.4.3 Normalization of gene size and exon density

In order to eliminate the confounding effects of the size and exon density of the gene, I controlled my outlier selection for both factors. Specifically, I designed a 5x5 factor matrix composed of the following two dimensions (Figure C.3): the first dimension (horizontal) containing five predefined consecutive ranges of gene size, each including 20% of all genes; the second dimension (vertical) made up of five consecutive bins of exon density levels, each containing 20% of all genes within the corresponding range of gene size. In this way, I arranged my gene data into a well-organized lattice, in which each unit contains the same number of genes. Based on this factor matrix, I applied the optimized cutoff percentile

threshold (i.e. 10%) on each lattice unit to obtain outlier genes with either the highest or the

lowest TE density. All outlier genes identified in each unit (the shaded part at either side of

the gene distribution within each unit in Figure C.3) were then merged into two categories

according to whether they are high or low in TEs (i.e. total upper-/lower-outliers). To verify

the efficiency of this normalization strategy, I applied statistical tests to the final sets of total

upper- and lower-outlier genes, and found no significant difference of either gene size or

exon density (data not shown).

132 4.4.4 Identification of shared-outlier genes

To identify shared outlier genes among the three mammals used in this study, I took the

HomoloGene IDs (HIDs) of the upper- and lower-outlier genes commonly identified in all

three species, and defined the resulting gene set as SUOs and SLOs, respectively. Notably,

based on the normalization of gene size and exon density, even outlier genes with a different

category of gene size or exon density in each species were, nevertheless, still able to be

correctly identified as long as they share the same extreme of TE density in all three species.

4.4.5 Classification of gene conservation levels

To evaluate the relationship between TE density and gene conservation level, I classified

all genes in each species into three conservation levels: species-specific, mammalian-

specific, and ancient genes. Based on the HomoloGene Database, I defined “species-specific”

genes as genes found only in one of the three mammalian species used in this study. I defined

“mammalian-specific” genes as genes shared between at least two of the three species, due to

the relatively poor annotation of the cow genome. I defined “ancient genes” as genes present in mammals and at least one of the following species: zebra fish, fruit fly and yeast.

4.4.6 Comparison of the average TE age between SUO and SLO genes

To calculate approximately the average TE age for a gene, I collected the RepeatMasker

annotation of TEs from the UCSC Genome Browser, and obtained the divergence of each TE

fragment from the consensus sequence of each TE family. The average age of all TEs in a

given gene is calculated with the equation:

133 ∑di × li D = i L where D is the average divergence of TE sequences from the consensus (i.e. the age of TEs), di is the sequence divergence of each TE fragment in the gene, li is the length (in bp) of each

TE fragment, and L is the total coverage (in bp) of all TEs in the gene. Based on this methodology, the average TE age of SUO and SLO genes are calculated and subsequently compared using the Wilcoxon Rank Sum test.

4.4.7 Examination of the tissue specificity of Polr2a binding at SUO/SLO promoters

As suggested by the authors of the Polr2a binding data (Barrera et al., 2008), I defined

“ubiquitous” promoter activity, based on all five tissues tested, as genes with an overall

Polr2a binding entropy H > 2, where H = -∑1≤t≤Nptlog2pt (where p is the relative Polr2a binding strength, t is the tissue type, N is the total number of tissues tested). I took genes with

H ≤ 2 as “tissue-specific” Polr2a binding. Where Polr2a binding shows the strongest activity at the given promoter, the specific tissue type was identified according to the lowest value of

“categorical tissue-specificity” Qt among all five tissues, where Qt = H - log2(pt) as described by Barrera and colleagues (Barrera et al., 2008).

4.4.8 Statistical tests

All statistical tests were done in R (version 2.9.2). All p-values described in the text were based on the equality of proportion test (also known as Binomial proportion test), unless specifically noted otherwise.

134 Chapter 5: Thesis Summary and Conclusions

135 5.1 Thesis Summary

5.1.1 Which ERVs are polymorphic in mice?

ERVs in mice are significant genomic mutagens, causing ~10% of all reported

spontaneous germ line mutations in laboratory strains (Maksakova et al., 2006). The majority

of these mutations are due to insertions of two high copy ERV families, the IAP and

ETn/MusD elements. This significant level of ongoing retrotranspositional activity suggests

that the content of these two ERV groups could be highly variable in inbred mice. At the time

of my study, no comprehensive genome-wide studies had been performed to assess their

level of polymorphism, however. In the study described in Chapter 2, three test strains, for

which sufficient genomic sequence is available, were compared to each other and to the

reference C57BL/6J genome, and very high levels of insertional polymorphism for both ERV

families were detected (with an estimated false discovery rate of only 0.4%). Specifically, I

found that at least 60% of IAP and 25% of ETn/MusD elements detected in any strain are

absent in one or more of the other three strains. The polymorphic nature of a set of 40

ETn/MusD elements found within gene introns was confirmed using genomic PCR on DNA

from a panel of mouse strains. For some cases, gene-splicing abnormalities involving the

ERV were detected, and additional evidence for decreased gene expression in strains

carrying the insertion was obtained. In total, I identified nearly 700 polymorphic IAP or

ETn/MusD ERVs or solitary LTRs that reside in gene introns, providing potential candidates

that may contribute to gene expression differences among strains. These previously

unappreciated extreme levels of polymorphism suggest that ERV insertions play a significant

role in the genetic drift of mouse lines.

136 The identification of such polymorphic ERV elements can be very useful in studying TE- host interactions and their evolutionary effects. In a recent study of ERV methylation patterns in the mouse, for example, our laboratory utilized the polymorphic ERV data described in this thesis, revealing a positive relationship between the age and methylation level of ETns in mice, as well as a negative age effect to the methylation variation of particular ETn insertions between cells/individuals (Reiss et al., 2010). In another study, we took the advantage of strain-specific ERV insertions in allelic expression assays, and showed a general trend of heterochromatin formation induced by IAP ERV elements, as well as at least one apparent example of heterochromatin spreading from an IAP into a nearby gene (Rebollo et al., 2011).

Along with other potential applications such as genetic association studies and population/evolution genetics, the mouse polymorphic ERV data presented here sheds light on our understanding of fundamental mouse genetics, phenotypic variability and TE-host gene relations.

5.1.2 What features of TE insertions are linked to their residential probability in genes?

Comprising nearly half of the mammalian genomes, TEs are found within most genes.

Although the vast majority of TEs in introns are fixed in the species and presumably exert no significant effects on the enclosing gene, some markedly perturb transcription and result in disease or a mutated phenotype. All the factors that determine the likelihood that an intronic

TE will affect transcription are not clear. In the study described in Chapter 3, I examined intronic TE distributions in both human and mouse, and found that several factors likely contribute to whether a particular TE can influence gene transcription. Specifically, I

137 observed that TEs near exons are greatly underrepresented compared to random distributions,

and further that the size of these “underrepresentation zones” (U-zones) differs between TE

types. Compared to elsewhere in introns, TEs within these zones are shorter, on average, and

show stronger orientation biases. Moreover, TEs in extremely close proximity (< 20 bp) to exons show a strong bias to be near splice-donor sites. Interestingly, disease-causing intronic

TE insertions show the opposite distributional trends. By examining EST databases, I found

that the proportion of TEs contributing to chimeric TE-gene transcripts is significantly higher within their U-zones. In addition, an analysis of predicted splice sites within human long terminal repeat (LTR) elements showed a significantly lower total number and weaker strength for intronic LTRs near exons. I selectively examined a list of polymorphic mouse

LTR elements in introns based on these factors, and showed clear evidence of transcriptional disruption by LTR element insertions in the Trpc6 and Kcnh6 genes. These studies, taken together, provide insight into the potential selective forces that have shaped intronic TE distributions, and facilitate identification of TEs most likely to exert transcriptional effects on genes.

5.1.3 Why are some genes highly enriched in TEs while others are depleted of TEs?

By measuring the normalized coverage of TE sequences within genes, as described in

Chapter 4, I identified sets of genes with conserved extremes of high/low TE density in the genomes of human, mouse and cow, and denoted them as “shared upper/lower outliers

(SUOs/SLOs)”. By comparing these outlier genes to the genomic background, I showed that a large proportion of SUOs are involved in metabolic pathways and that they tend to be mammal-specific, whereas many SLOs are related to developmental processes and have

138 more ancient origins. Furthermore, the proportions of different types of TEs within human

and mouse orthologous SUOs showed high similarity, even though most detectable TEs in

these two genomes inserted after their divergence. My computational analysis of polymerase-

II (Pol-II) occupancy at gene promoters in different mouse tissues showed that 60% of tissue-

specific SUOs have strong Pol-II binding in embryonic stem cells (ESCs), a proportion

significantly higher than the genomic background (37%). In addition, the analysis of histone

marks, such as H3K4me3 and H3K27me3 in mouse ESCs, also suggests a strong association

between TE-rich genes and open-chromatin at promoters. Two independent whole-

transcriptome datasets also showed a positive association between TE density and gene expression level in ESCs. While this study focused on genes with extreme TE densities, the results clearly show that the probability of TE accumulation/fixation in mammalian genes is not random and is likely associated with different factors/gene properties. Most importantly, my results show an association between the TE insertion/fixation rate and gene activity status in ES cells.

5.2 Conclusions

This thesis work set out to study TE-gene dynamics in mammals, using computational approaches to analyze various biological data at the whole genome level. To achieve this, my research began by identifying polymorphic ERVs in mice. Since relatively little evolutionary time has elapsed since these young ERVs integrated, many of them are not fixed yet in the host population (i.e. polymorphic ERVs). For the same reason, some of these polymorphic elements may cause slightly or moderately deleterious effects to the host, and are probably still under selection. Polymorphic ERVs do show genomic location features resulting from

139 selection pressure, however, albeit their distributional biases are generally weaker compared

with their fixed counterparts. In Chapter 2, I showed an intermediate level of

underrepresentation of polymorphic ERVs in mouse genes, as well as a moderate orientation

bias that lies between the distributions for random simulations and the actual pattern for fixed

elements. Finding of such features for polymorphic TEs is important because, along with

fixed elements and the expected random distribution derived from computational

simulations, polymorphic TEs can help us capture a “snapshot” of the ongoing processes of

natural selection imposed on TEs and the host genome. As an example, in Section 3.2.5 I

show distinct “U-zone effects” for mutagenic, random and fixed TEs, and by adding mouse polymorphic ERV data to the analysis, it is much more convincing to conclude that the U-

zone effect is a result of natural selection.

In addition to polymorphic/active TEs, ancient elements that have been fixed in the host

genome for millions of years may also contain valuable information concerning their

characteristics and evolutionary roles, in the same way that the information found in fossils is

valuable to paleontologists. Although over a million copies of TE have become fixed in

mammalian gene introns during evolution, the vast majority of them have apparently no

functional impact on the gene. Yet, new disease-causing TE insertions do occur in introns

and exert detrimental effects on the host. Intrigued by the question why some intronic TEs

are harmful whereas others are not, I looked for clues by examining the distribution of fixed

TEs in gene introns and comparing it to the distribution of the non-fixed, such as

polymorphic or mutagenic insertions. My results showed multiple genomic features that may

influence the retention probability of de novo TE integrations in gene introns, including the

distance to the intron-exon boundary, orientation, and proximity to splice sites. Moreover, by

140 using the mouse polymorphic ERV data obtained from Chapter 2, as well as mutagenic TE insertions reported in the literature, I have been able to show convincing evidence to support the premise that natural selection was the major driving force underlying the identified TE distribution patterns in genes.

While the study described in Chapter 3 revealed insertional characteristics of TEs that may determine the “harmfulness” of a TE insertion in the gene, the distinct properties of

genes themselves may also carry their own influences. By identifying orthologous genes that

share extremes of either low- or high-density of TEs (i.e. SLOs and SUOs, respectively) in

three distant mammalian species, I showed that TE density within genes is associated with

factors such as the biological function and conservation level of genes, which are likely,

again, an outcome of natural selection. In theory, initial insertional preference, natural

selection and genetic drift can all contribute to the current distribution pattern of TEs in host

genes. Although evaluating initial insertional preference is difficult for many ancient TEs, a

limited number of experiments using reconstructed TE sequences have shown distinct

distribution patterns of de novo TE insertions compared to fixed genomic distributions of

corresponding TEs (Brady et al., 2009; Gasior et al., 2007), suggesting much greater effects

of natural selection or genetic drift on fixed elements. It is therefore natural to propose that

the enrichment of genes involved in developmental processes for SLOs, for example, is due

to the essential roles these genes play in the development of the host organism and, as a

result, in strong selection against TE insertions in such genes. One can further postulate that

the enrichment of metabolic genes among the SUOs could be beneficial, in fact, since they

can potentially contribute to the genetic variation of the host species. While it is apparent that

the effect of natural selection is dominant compared to the initial integration preference of

141 TEs, my results also suggest that signatures of the latter are still evident. Using genome-wide chromatin state data generated by high-throughput techniques, I observed an enrichment of

TEs in genes with open chromatin in ES cells, where new TE integrations in the genome can be passed on to the next generation. Furthermore, by controlling either functional category of genes or chromatin state in ES cells, a quantified analysis show effects linked to both factors

(Section 4.2.6), providing more insights in understanding the relationship between TEs and genes.

The relationship between TEs and host genes has intrigued biologists since the discovery of these mobile DNA sequences in various host genomes more than a half century ago. What are these repeats? Why are they there? What do they do? What can they tell us about genome evolution and speciation? All of these are intriguing questions that I hope my thesis work has helped to address. We know now that from selfish genomic parasites to evolutionary catalysts, TEs have played various roles during the host evolution (Belancio et al., 2008;

Biemont, 2010). Boosted by many major achievements in molecular biology and genomics during the past few decades, our knowledge of TEs has been rapidly advancing, and our understanding of TE-gene relationships is also evolving. However, while various new technologies have significantly changed the fundamental methodology of biological research, numerous new challenges have arisen. For example, the great advances in whole genome sequencing and next generation sequencing techniques have provided unprecedented opportunities to systematically study TEs at the genomic level, but how to find biological insights from the overwhelming “-omics” data being produced today poses new challenges at the same time. The emergence of high throughput technologies has facilitated the discovery of an increasing number of TE germline polymorphisms and somatic insertions in human

142 cancers, with the recent advances in studies of human L1 polymorphisms being good

examples (Beck et al., 2010; Ewing and Kazazian, 2010; Huang et al., 2010; Iskow et al.,

2010). Another recent study on the sequencing of 17 mouse strains has uncovered over

700,000 structural variants among these strains, of which over half are due to TE insertional polymorphisms (Yalcin et al., 2011). While this mouse study reported a potential biological

effect of one ERV polymorphism (i.e. an IAP ERV insertion upstream of the Eps15 gene, which severely disrupted the transcription of the gene and led to a lower locomotor activity), there have been no systematic efforts published to identify which human or mouse polymorphic or somatically-acquired TEs may contribute to allele-specific gene expression differences and phenotypic variation or disease. One future direction continuing from the work presented in this thesis could be the development of comprehensive methods to help choose potential candidate TE insertions that have functional effects on cellular genes. To achieve this purpose, many of the findings from this thesis work, as well as results from other studies, could be integrated and used to build the “TE-gene interaction” model that takes into account all the genomic features of TE insertions (e.g. results from Chapter 3) and the characteristics of genes harboring TEs (e.g. results from Chapter 4). Aided by new sequencing technologies and sophisticated computational/bioinformatics methods, an ambitious but worthy goal would be to make reliable predictions of the most influential TE insertions, which are either potentially linked to various diseases or to phenotypic variations

(including both somatic and germline mutations). Ultimately, such knowledge will contribute to our understanding of the effects of TEs and their relationships to their host organisms, which is important to a broad range of studies from evolution to human medicine.

143 References

Adams, M.D., Celniker, S.E., Holt, R.A., Evans, C.A., Gocayne, J.D., Amanatides, P.G., Scherer, S.E., Li, P.W., Hoskins, R.A., Galle, R.F., et al. (2000). The genome sequence of Drosophila melanogaster. Science 287, 2185-2195.

Adelson, D.L., Raison, J.M., and Edgar, R.C. (2009). Characterization and distribution of retrotransposons and simple sequence repeats in the bovine genome. Proc Natl Acad Sci U S A 106, 12855-12860.

Akagi, K., Li, J., Stephens, R.M., Volfovsky, N., and Symer, D.E. (2008). Extensive variation between inbred mouse strains due to endogenous L1 retrotransposition. Genome Res 18, 869-880.

Alfoldi, J., Di Palma, F., Grabherr, M., Williams, C., Kong, L., Mauceli, E., Russell, P., Lowe, C.B., Glor, R.E., Jaffe, J.D., et al. (2011). The genome of the green anole lizard and a comparative analysis with birds and mammals. Nature 477, 587-591.

Allen, E., Horvath, S., Tong, F., Kraft, P., Spiteri, E., Riggs, A.D., and Marahrens, Y. (2003). High concentrations of long interspersed nuclear element sequence distinguish monoallelically expressed genes. Proc Natl Acad Sci U S A 100, 9940-9945.

Almeida, R., and Allshire, R.C. (2005). RNA silencing and genome regulation. Trends Cell Biol 15, 251-258.

Aparicio, S., Chapman, J., Stupka, E., Putnam, N., Chia, J.M., Dehal, P., Christoffels, A., Rash, S., Hoon, S., Smit, A., et al. (2002). Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297, 1301-1310.

Bailey, J.A., Liu, G., and Eichler, E.E. (2003). An Alu transposition model for the origin and expansion of human segmental duplications. Am J Hum Genet 73, 823-834.

Baillie, J.K., Barnett, M.W., Upton, K.R., Gerhardt, D.J., Richmond, T.A., De Sapio, F., Brennan, P., Rizzu, P., Smith, S., Fell, M., et al. (2011). Somatic retrotransposition alters the genetic landscape of the human brain. Nature.

Banno, F., Kaminaka, K., Soejima, K., Kokame, K., and Miyata, T. (2004). Identification of strain-specific variants of mouse Adamts13 gene encoding von Willebrand factor-cleaving protease. J Biol Chem 279, 30896-30903.

144 Barrera, L.O., Li, Z., Smith, A.D., Arden, K.C., Cavenee, W.K., Zhang, M.Q., Green, R.D., and Ren, B. (2008). Genome-wide mapping and analysis of active promoters in mouse embryonic stem cells and adult organs. Genome Res 18, 46-59.

Barrett, T., Troup, D.B., Wilhite, S.E., Ledoux, P., Rudnev, D., Evangelista, C., Kim, I.F., Soboleva, A., Tomashevsky, M., and Edgar, R. (2007). NCBI GEO: mining tens of millions of expression profiles--database and tools update. Nucleic Acids Res 35, D760-765.

Batzer, M.A., and Deininger, P.L. (2002). Alu repeats and human genomic diversity. Nat Rev Genet 3, 370-379.

Baust, C., Baillie, G.J., and Mager, D.L. (2002). Insertional polymorphisms of ETn retrotransposons include a disruption of the wiz gene in C57BL/6 mice. Mamm Genome 13, 423-428.

Baust, C., Gagnier, L., Baillie, G.J., Harris, M.J., Juriloff, D.M., and Mager, D.L. (2003). Structure and expression of mobile ETnII retroelements and their coding-competent MusD relatives in the mouse. J Virol 77, 11448-11458.

Beck, C.R., Collier, P., Macfarlane, C., Malig, M., Kidd, J.M., Eichler, E.E., Badge, R.M., and Moran, J.V. (2010). LINE-1 retrotransposition activity in human genomes. Cell 141, 1159-1170.

Beck, J.A., Lloyd, S., Hafezparast, M., Lennon-Pierce, M., Eppig, J.T., Festing, M.F., and Fisher, E.M. (2000). Genealogies of mouse inbred strains. Nat Genet 24, 23-25.

Belancio, V.P., Hedges, D.J., and Deininger, P. (2006). LINE-1 RNA splicing and influences on mammalian gene expression. Nucleic Acids Res 34, 1512-1521.

Belancio, V.P., Hedges, D.J., and Deininger, P. (2008). Mammalian non-LTR retrotransposons: for better or worse, in sickness and in health. Genome Res 18, 343-358.

Belshaw, R., Dawson, A.L., Woolven-Allen, J., Redding, J., Burt, A., and Tristem, M. (2005). Genomewide screening reveals high levels of insertional polymorphism in the human endogenous retrovirus family HERV-K(HML2): implications for present-day activity. J Virol 79, 12507-12514.

Biemont, C. (2010). A brief history of the status of transposable elements: from junk DNA to major players in evolution. Genetics 186, 1085-1093.

Blewitt, M.E., Vickaryous, N.K., Paldi, A., Koseki, H., and Whitelaw, E. (2006). Dynamic reprogramming of DNA methylation at an epigenetically sensitive allele in mice. PLoS Genet 2, e49.

145 Blond, J.L., Lavillette, D., Cheynet, V., Bouton, O., Oriol, G., Chapel-Fernandes, S., Mandrand, B., Mallet, F., and Cosset, F.L. (2000). An envelope glycoprotein of the human endogenous retrovirus HERV-W is expressed in the human placenta and fuses cells expressing the type D mammalian retrovirus receptor. J Virol 74, 3321-3329.

Boeke, J.D., and Stoye, J.P. (1997). Retrotransposons, Endogenous Retroviruses, and the Evolution of Retroelements. In Retroviruses, J.M. Coffin, L. Hughes, and H.E. Varmus, eds. (Cold Spring Harbor: Cold Spring Harbor Laboratory Press), pp. 343-435.

Boissinot, S., Davis, J., Entezam, A., Petrov, D., and Furano, A.V. (2006). Fitness cost of LINE-1 (L1) activity in humans. Proc Natl Acad Sci U S A 103, 9590-9594.

Brady, T., Lee, Y.N., Ronen, K., Malani, N., Berry, C.C., Bieniasz, P.D., and Bushman, F.D. (2009). Integration target site selection by a resurrected human endogenous retrovirus. Genes Dev 23, 633-642.

Brouha, B., Schustak, J., Badge, R.M., Lutz-Prigge, S., Farley, A.H., Moran, J.V., and Kazazian, H.H., Jr. (2003). Hot L1s account for the bulk of retrotransposition in the human population. Proc Natl Acad Sci U S A 100, 5280-5285.

Brulet, P., Condamine, H., and Jacob, F. (1985). Spatial distribution of transcripts of the long repeated ETn sequence during early mouse embryogenesis. Proc Natl Acad Sci U S A 82, 2054-2058.

Burwinkel, B., and Kilimann, M.W. (1998). Unequal homologous recombination between LINE-1 elements as a mutational mechanism in human genetic disease. J Mol Biol 277, 513- 517.

Bushman, F., Lewinski, M., Ciuffi, A., Barr, S., Leipzig, J., Hannenhalli, S., and Hoffmann, C. (2005). Genome-wide analysis of retroviral DNA integration. Nat Rev Microbiol 3, 848- 858.

Bushman, F.D. (2003). Targeting survival: integration site selection by retroviruses and LTR-retrotransposons. Cell 115, 135-138.

Callinan, P.A., and Batzer, M.A. (2006). Retrotransposable elements and human disease. Genome Dyn 1, 104-115.

Callinan, P.A., Wang, J., Herke, S.W., Garber, R.K., Liang, P., and Batzer, M.A. (2005). Alu retrotransposition-mediated deletion. J Mol Biol 348, 791-800.

Celniker, S.E., and Rubin, G.M. (2003). The Drosophila melanogaster genome. Annu Rev Genomics Hum Genet 4, 89-117.

146 Chen, J.M., Stenson, P.D., Cooper, D.N., and Ferec, C. (2005). A systematic analysis of LINE-1 endonuclease-dependent retrotranspositional events causing human genetic disease. Hum Genet 117, 411-427.

Cohen, C.J., Lock, W.M., and Mager, D.L. (2009). Endogenous retroviral LTRs as promoters for human genes: a critical assessment. Gene 448, 105-114.

Conley, A.B., Piriyapongsa, J., and Jordan, I.K. (2008). Retroviral promoters in the human genome. Bioinformatics 24, 1563-1567.

Corazzari, M., Lovat, P.E., Armstrong, J.L., Fimia, G.M., Hill, D.S., Birch-Machin, M., Redfern, C.P., and Piacentini, M. (2007). Targeting homeostatic mechanisms of endoplasmic reticulum stress to increase susceptibility of cancer cells to fenretinide-induced apoptosis: the role of stress proteins ERdj5 and ERp57. Br J Cancer 96, 1062-1071.

Cordaux, R., and Batzer, M.A. (2009). The impact of retrotransposons on human genome evolution. Nat Rev Genet 10, 691-703.

Cordaux, R., Lee, J., Dinoso, L., and Batzer, M.A. (2006). Recently integrated Alu retrotransposons are essentially neutral residents of the human genome. Gene 373, 138-144.

Coufal, N.G., Garcia-Perez, J.L., Peng, G.E., Yeo, G.W., Mu, Y., Lovci, M.T., Morell, M., O'Shea, K.S., Moran, J.V., and Gage, F.H. (2009). L1 retrotransposition in human neural progenitor cells. Nature 460, 1127-1131.

Cunnea, P.M., Miranda-Vizuete, A., Bertoli, G., Simmen, T., Damdimopoulos, A.E., Hermann, S., Leinonen, S., Huikko, M.P., Gustafsson, J.A., Sitia, R., et al. (2003). ERdj5, an endoplasmic reticulum (ER)-resident protein containing DnaJ and thioredoxin domains, is expressed in secretory cells or following ER stress. J Biol Chem 278, 1059-1066.

Cutter, A.D., Good, J.M., Pappas, C.T., Saunders, M.A., Starrett, D.M., and Wheeler, T.J. (2005). Transposable element orientation bias in the Drosophila melanogaster genome. J Mol Evol 61, 733-741.

De Palma, M., Montini, E., Santoni de Sio, F.R., Benedicenti, F., Gentile, A., Medico, E., and Naldini, L. (2005). Promoter trapping reveals significant differences in integration site selection between MLV and HIV vectors in primary hematopoietic cells. Blood 105, 2307- 2315.

Deininger, P.L., and Batzer, M.A. (1999). Alu repeats and human disease. Mol Genet Metab 67, 183-193.

147 Dewannieux, M., Dupressoir, A., Harper, F., Pierron, G., and Heidmann, T. (2004). Identification of autonomous IAP LTR retrotransposons mobile in mammalian cells. Nat Genet 36, 534-539.

Dewannieux, M., Esnault, C., and Heidmann, T. (2003). LINE-mediated retrotransposition of marked Alu sequences. Nat Genet 35, 41-48.

Dewannieux, M., Harper, F., Richaud, A., Letzelter, C., Ribet, D., Pierron, G., and Heidmann, T. (2006). Identification of an infectious progenitor for the multiple-copy HERV- K human endogenous retroelements. Genome Res 16, 1548-1556.

Dewannieux, M., and Heidmann, T. (2005). L1-mediated retrotransposition of murine B1 and B2 SINEs recapitulated in cultured cells. J Mol Biol 349, 241-247.

Di-Poi, N., Montoya-Burgos, J.I., and Duboule, D. (2009). Atypical relaxation of structural constraints in Hox gene clusters of the green anole lizard. Genome Res 19, 602-610.

Druker, R., Bruxner, T.J., Lehrbach, N.J., and Whitelaw, E. (2004). Complex patterns of transcription at the insertion site of a retrotransposon in the mouse. Nucleic Acids Res 32, 5800-5808.

Druker, R., and Whitelaw, E. (2004). Retrotransposon-derived elements in the mammalian genome: a potential source of disease. J Inherit Metab Dis 27, 319-330.

Dudley, J.P. (2003). Tag, you're hit: retroviral insertions identify genes involved in cancer. Trends Mol Med 9, 43-45.

Duhl, D.M., Vrieling, H., Miller, K.A., Wolff, G.L., and Barsh, G.S. (1994). Neomorphic agouti mutations in obese yellow mice. Nat Genet 8, 59-65.

Dunn, C.A., Medstrand, P., and Mager, D.L. (2003). An endogenous retroviral long terminal repeat is the dominant promoter for human beta1,3-galactosyltransferase 5 in the colon. Proc Natl Acad Sci U S A 100, 12841-12846.

Dupressoir, A., Marceau, G., Vernochet, C., Benit, L., Kanellopoulos, C., Sapin, V., and Heidmann, T. (2005). Syncytin-A and syncytin-B, two fusogenic placenta-specific murine envelope genes of retroviral origin conserved in Muridae. Proc Natl Acad Sci U S A 102, 725-730.

Eickbush, T.H., and Jamburuthugoda, V.K. (2008). The diversity of retrotransposons and the properties of their reverse transcriptases. Virus Res 134, 221-234.

148 Elmedyb, P., Olesen, S.P., and Grunnet, M. (2007). Activation of ERG2 potassium channels by the diphenylurea NS1643. Neuropharmacology 53, 283-294.

Elsik, C.G., Tellam, R.L., Worley, K.C., Gibbs, R.A., Muzny, D.M., Weinstock, G.M., Adelson, D.L., Eichler, E.E., Elnitski, L., Guigo, R., et al. (2009). The genome sequence of taurine cattle: a window to ruminant biology and evolution. Science 324, 522-528.

Ewing, A.D., and Kazazian, H.H., Jr. (2010). High-throughput sequencing reveals extensive variation in human-specific L1 content in individual human genomes. Genome Res 20, 1262- 1270.

Feschotte, C. (2008). Transposable elements and the evolution of regulatory networks. Nat Rev Genet 9, 397-405.

Feschotte, C., and Pritham, E.J. (2007). DNA transposons and the evolution of eukaryotic genomes. Annu Rev Genet 41, 331-368.

Fontanillas, P., Hartl, D.L., and Reuter, M. (2007). Genome organization and gene expression shape the transposable element distribution in the Drosophila melanogaster euchromatin. PLoS Genet 3, e210.

Frankel, W.N., Stoye, J.P., Taylor, B.A., and Coffin, J.M. (1990). A linkage map of endogenous murine leukemia proviruses. Genetics 124, 221-236.

Frazer, K.A., Eskin, E., Kang, H.M., Bogue, M.A., Hinds, D.A., Beilharz, E.J., Gupta, R.V., Montgomery, J., Morenzoni, M.M., Nilsen, G.B., et al. (2007). A sequence-based variation map of 8.27 million SNPs in inbred mouse strains. Nature 448, 1050-1053.

Gal-Mark, N., Schwartz, S., and Ast, G. (2008). Alternative splicing of Alu exons--two arms are better than one. Nucleic Acids Res 36, 2012-2023.

Gallus, G.N., Cardaioli, E., Rufa, A., Da Pozzo, P., Bianchi, S., D'Eramo, C., Collura, M., Tumino, M., Pavone, L., and Federico, A. (2010). Alu-element insertion in an OPA1 intron sequence associated with autosomal dominant optic atrophy. Mol Vis 16, 178-183.

Ganguly, A., Dunbar, T., Chen, P., Godmilow, L., and Ganguly, T. (2003). Exon skipping caused by an intronic insertion of a young Alu Yb9 element leads to severe hemophilia A. Hum Genet 113, 348-352.

Gasior, S.L., Preston, G., Hedges, D.J., Gilbert, N., Moran, J.V., and Deininger, P.L. (2007). Characterization of pre-insertion loci of de novo L1 insertions. Gene 390, 190-198.

149 Gifford, R., and Tristem, M. (2003). The evolution, distribution and diversity of endogenous retroviruses. Virus genes 26, 291-315.

Gilbert, N., Lutz-Prigge, S., and Moran, J.V. (2002). Genomic deletions created upon LINE- 1 retrotransposition. Cell 110, 315-325.

Gonzalez-Cobos, J.C., and Trebak, M. (2010). TRPC channels in smooth muscle cells. Front Biosci 15, 1023-1039.

Gould, S.J., and Vrba, E.S. (1982). Exaptation - a Missing Term in the Science of Form. Paleobiology 8, 4-15.

Graham, T., and Boissinot, S. (2006). The genomic distribution of L1 elements: the role of insertion bias and natural selection. J Biomed Biotechnol 2006, 75327.

Graubert, T.A., Cahan, P., Edwin, D., Selzer, R.R., Richmond, T.A., Eis, P.S., Shannon, W.D., Li, X., McLeod, H.L., Cheverud, J.M., et al. (2007). A high-resolution map of segmental DNA copy number variation in the mouse genome. PLoS Genet 3, e3.

Grover, D., Majumder, P.P., C, B.R., Brahmachari, S.K., and Mukerji, M. (2003). Nonrandom distribution of alu elements in genes of various functional categories: insight from analysis of human chromosomes 21 and 22. Mol Biol Evol 20, 1420-1424.

Han, K., Lee, J., Meyer, T.J., Remedios, P., Goodwin, L., and Batzer, M.A. (2008). L1 recombination-associated deletions generate human genomic variation. Proc Natl Acad Sci U S A 105, 19366-19371.

Han, K., Sen, S.K., Wang, J., Callinan, P.A., Lee, J., Cordaux, R., Liang, P., and Batzer, M.A. (2005). Genomic rearrangements by LINE-1 insertion-mediated deletion in the human and chimpanzee lineages. Nucleic Acids Res 33, 4040-4052.

Hasler, J., and Strub, K. (2006). Alu elements as regulators of gene expression. Nucleic Acids Res 34, 5491-5497.

Heidmann, O., Vernochet, C., Dupressoir, A., and Heidmann, T. (2009). Identification of an endogenous retroviral envelope gene with fusogenic activity and placenta-specific expression in the rabbit: a new "syncytin" in a third order of mammals. Retrovirology 6, 107.

Ho, M., Post, C.M., Donahue, L.R., Lidov, H.G., Bronson, R.T., Goolsby, H., Watkins, S.C., Cox, G.A., and Brown, R.H., Jr. (2004). Disruption of muscle membrane and phenotype divergence in two novel mouse models of dysferlin deficiency. Hum Mol Genet 13, 1999- 2010.

150 Hofmann, M., Harris, M., Juriloff, D., and Boehm, T. (1998). Spontaneous mutations in SELH/Bc mice due to insertions of early transposons: molecular characterization of null alleles at the nude and albino loci. Genomics 52, 107-109.

Hollister, J.D., and Gaut, B.S. (2009). Epigenetic silencing of transposable elements: a trade- off between reduced transposition and deleterious effects on neighboring gene expression. Genome Res 19, 1419-1428.

Horie, K., Saito, E.S., Keng, V.W., Ikeda, R., Ishihara, H., and Takeda, J. (2007). Retrotransposons influence the mouse transcriptome: implication for the divergence of genetic traits. Genetics 176, 815-827.

Huang, C.R., Schneider, A.M., Lu, Y., Niranjan, T., Shen, P., Robinson, M.A., Steranka, J.P., Valle, D., Civin, C.I., Wang, T., et al. (2010). Mobile interspersed repeats are major structural variants in the human genome. Cell 141, 1171-1182.

Huda, A., Marino-Ramirez, L., Landsman, D., and Jordan, I.K. (2009). Repetitive DNA elements, nucleosome binding and human gene expression. Gene 436, 12-22.

Iskow, R.C., McCabe, M.T., Mills, R.E., Torene, S., Pittard, W.S., Neuwald, A.F., Van Meir, E.G., Vertino, P.M., and Devine, S.E. (2010). Natural mutagenesis of human genomes by endogenous retrotransposons. Cell 141, 1253-1261.

Jern, P., Stoye, J.P., and Coffin, J.M. (2007). Role of APOBEC3 in genetic diversity among endogenous murine leukemia viruses. PLoS Genet 3, 2014-2022.

Jjingo, D., Huda, A., Gundapuneni, M., Marino-Ramirez, L., and Jordan, I.K. (2011). Effect of the transposable element environment of human genes on gene length and expression. Genome Biol Evol, 259-271.

Juriloff, D.M., Harris, M.J., Dewell, S.L., Brown, C.J., Mager, D.L., Gagnier, L., and Mah, D.G. (2005). Investigations of the genomic region that contains the clf1 mutation, a causal gene in multifactorial cleft lip and palate in mice. Birth Defects Res A Clin Mol Teratol 73, 103-113.

Jurka, J. (1997). Sequence patterns indicate an enzymatic involvement in integration of mammalian retroposons. Proc Natl Acad Sci U S A 94, 1872-1877.

Kaiser, S.M., Malik, H.S., and Emerman, M. (2007). Restriction of an extinct retrovirus by the human TRIM5alpha antiviral protein. Science 316, 1756-1758.

151 Kalari, K.R., Casavant, M., Bair, T.B., Keen, H.L., Comeron, J.M., Casavant, T.L., and Scheetz, T.E. (2006). First exons and introns--a survey of GC content and gene structure in the human genome. In Silico Biol 6, 237-242.

Kapitonov, V.V., and Jurka, J. (2006). Self-synthesizing DNA transposons in eukaryotes. Proc Natl Acad Sci U S A 103, 4540-4545.

Kapitonov, V.V., and Jurka, J. (2007). Helitrons on a roll: eukaryotic rolling-circle transposons. Trends Genet 23, 521-529.

Karimi, M.M., Goyal, P., Maksakova, I.A., Bilenky, M., Leung, D., Tang, J.X., Shinkai, Y., Mager, D.L., Jones, S., Hirst, M., et al. (2011). DNA Methylation and SETDB1/H3K9me3 Regulate Predominantly Distinct Sets of Genes, Retroelements, and Chimeric Transcripts in mESCs. Cell Stem Cell 8, 676-687.

Kaushik, N., and Stoye, J.P. (1994). Intracisternal A-type particle elements as genetic markers: detection by repeat element viral element amplified locus-PCR. Mamm Genome 5, 688-695.

Kent, W.J. (2002). BLAT--the BLAST-like alignment tool. Genome Res 12, 656-664.

Keren, H., Lev-Maor, G., and Ast, G. (2010). Alternative splicing and evolution: diversification, exon definition and function. Nat Rev Genet 11, 345-355.

Khurana, J.S., and Theurkauf, W. (2010). piRNAs, transposon silencing, and Drosophila germline development. J Cell Biol 191, 905-913.

Kidwell, M.G. (2002). Transposable elements and the evolution of genome size in eukaryotes. Genetica 115, 49-63.

Kim, D.S., Huh, J.W., Kim, Y.H., Park, S.J., and Chang, K.T. (2010). Functional impact of transposable elements using bioinformatic analysis and a comparative genomic approach. Mol Cells 30, 77-87.

Krull, M., Brosius, J., and Schmitz, J. (2005). Alu-SINE exonization: en route to protein- coding function. Mol Biol Evol 22, 1702-1711.

Krull, M., Petrusma, M., Makalowski, W., Brosius, J., and Schmitz, J. (2007). Functional persistence of exonized mammalian-wide interspersed repeat elements (MIRs). Genome Res 17, 1139-1145.

Kuff, E.L., and Lueders, K.K. (1988). The intracisternal A-particle gene family: structure and functional aspects. Adv Cancer Res 51, 183-276.

152 Kunarso, G., Chia, N.Y., Jeyakani, J., Hwang, C., Lu, X., Chan, Y.S., Ng, H.H., and Bourque, G. (2010). Transposable elements have rewired the core regulatory network of human embryonic stem cells. Nat Genet 42, 631-634.

Kung, H.J., Boerkoel, C., and Carter, T.H. (1991). Retroviral mutagenesis of cellular oncogenes: a review with insights into the mechanisms of insertional activation. Curr Top Microbiol Immunol 171, 1-25.

Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. (2001). Initial sequencing and analysis of the human genome. Nature 409, 860-921.

Lee, J., Han, K., Meyer, T.J., Kim, H.S., and Batzer, M.A. (2008a). Chromosomal inversions between human and chimpanzee lineages caused by retrotransposons. PLoS One 3, e4047.

Lee, J.Y., Ji, Z., and Tian, B. (2008b). Phylogenetic analysis of mRNA polyadenylation sites reveals a role of transposable elements in evolution of the 3'-end of genes. Nucleic Acids Res 36, 5581-5590.

Lee, Y.N., and Bieniasz, P.D. (2007). Reconstitution of an infectious human endogenous retrovirus. PLoS Pathog 3, e10.

Leung, D.C., and Lorincz, M.C. (2011). Silencing of endogenous retroviruses: when and why do histone marks predominate? Trends Biochem Sci.

Lev-Maor, G., Ram, O., Kim, E., Sela, N., Goren, A., Levanon, E.Y., and Ast, G. (2008). Intronic Alus influence alternative splicing. PLoS Genet 4, e1000204.

Li, J., Jiang, T., Mao, J.H., Balmain, A., Peterson, L., Harris, C., Rao, P.H., Havlak, P., Gibbs, R., and Cai, W.W. (2004). Genomic segmental polymorphisms in inbred mouse strains. Nat Genet 36, 952-954.

Locke, D.P., Hillier, L.W., Warren, W.C., Worley, K.C., Nazareth, L.V., Muzny, D.M., Yang, S.P., Wang, Z., Chinwalla, A.T., Minx, P., et al. (2011). Comparative and demographic analysis of orang-utan genomes. Nature 469, 529-533.

Lodish, H., Berk, A., Kaiser, C.A., Krieger, M., Scott, M.P., Bretscher, A., Ploegh, H., and Matsudaira, P. (2007). Molecular Cell Biology. In (New York: W. H. Freeman and Company), pp. 329-330.

Loebel, D.A., Tsoi, B., Wong, N., O'Rourke, M.P., and Tam, P.P. (2004). Restricted expression of ETn-related sequences during post-implantation mouse development. Gene Expr Patterns 4, 467-471.

153 Lowe, C.B., Bejerano, G., and Haussler, D. (2007). Thousands of human mobile element fragments undergo strong purifying selection near developmental genes. Proc Natl Acad Sci U S A 104, 8005-8010.

Lueders, K.K., and Frankel, W.N. (1994). Mapping of mouse intracisternal A-particle proviral markers in an interspecific backcross. Mamm Genome 5, 473-478.

Lueders, K.K., Frankel, W.N., Mietz, J.A., and Kuff, E.L. (1993). Genomic mapping of intracisternal A-particle proviral elements. Mamm Genome 4, 69-77.

Lyon, M.F. (2006). Do LINEs have a role in X-chromosome inactivation? J Biomed Biotechnol 2006, 59746.

Maere, S., Heymans, K., and Kuiper, M. (2005). BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics 21, 3448-3449.

Mager, D.L., and Freeman, J.D. (2000). Novel mouse type D endogenous proviruses and ETn elements share long terminal repeat and internal sequences. J Virol 74, 7221-7229.

Mager, D.L., and Medstrand, P. (2003). Retroviral repeat sequences. In Nature Encyclopedia of the Human Genome (Nature Publishing Group), pp. 57-63.

Maksakova, I.A., Romanish, M.T., Gagnier, L., Dunn, C.A., van de Lagemaat, L.N., and Mager, D.L. (2006). Retroviral elements and their hosts: insertional mutagenesis in the mouse germ line. PLoS Genet 2, e2.

Malik, H.S. (2012). Retroviruses push the envelope for mammalian placentation. Proc Natl Acad Sci U S A.

Malik, H.S., and Eickbush, T.H. (1998). The RTE class of non-LTR retrotransposons is widely distributed in animals and is the origin of many SINEs. Mol Biol Evol 15, 1123-1134.

Martin, A., Troadec, C., Boualem, A., Rajab, M., Fernandez, R., Morin, H., Pitrat, M., Dogimont, C., and Bendahmane, A. (2009). A transposon-induced epigenetic change leads to sex determination in melon. Nature 461, 1135-1138.

McClintock, B. (1950). The origin and behavior of mutable loci in maize. Proc Natl Acad Sci U S A 36, 344-355.

Medstrand, P., van de Lagemaat, L.N., and Mager, D.L. (2002). Retroelement distributions in the human genome: variations associated with age and proximity to genes. Genome Res 12, 1483-1495.

154 Mi, S., Lee, X., Li, X., Veldman, G.M., Finnerty, H., Racie, L., LaVallie, E., Tang, X.Y., Edouard, P., Howes, S., et al. (2000). Syncytin is a captive retroviral envelope protein involved in human placental morphogenesis. Nature 403, 785-789.

Mikkelsen, T.S., Hillier, L.W., Eichler, E.E., and al., e. (2005). Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69-87.

Mikkelsen, T.S., Ku, M., Jaffe, D.B., Issac, B., Lieberman, E., Giannoukos, G., Alvarez, P., Brockman, W., Kim, T.K., Koche, R.P., et al. (2007). Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 448, 553-560.

Mills, R.E., Bennett, E.A., Iskow, R.C., and Devine, S.E. (2007). Which transposable elements are active in the human genome? Trends Genet 23, 183-191.

Mine, M., Chen, J.M., Brivet, M., Desguerre, I., Marchant, D., de Lonlay, P., Bernard, A., Ferec, C., Abitbol, M., Ricquier, D., et al. (2007). A large genomic deletion in the PDHX gene caused by the retrotranspositional insertion of a full-length LINE-1 element. Hum Mutat 28, 137-142.

Mitchell, R.S., Beitzel, B.F., Schroder, A.R., Shinn, P., Chen, H., Berry, C.C., Ecker, J.R., and Bushman, F.D. (2004). Retroviral DNA integration: ASLV, HIV, and MLV show distinct target site preferences. PLoS Biol 2, E234.

Moran, J.V., DeBerardinis, R.J., and Kazazian, H.H., Jr. (1999). Exon shuffling by L1 retrotransposition. Science 283, 1530-1534.

Moran, J.V., Holmes, S.E., Naas, T.P., DeBerardinis, R.J., Boeke, J.D., and Kazazian, H.H., Jr. (1996). High frequency retrotransposition in cultured mammalian cells. Cell 87, 917-927.

Morgan, H.D., Sutherland, H.G., Martin, D.I., and Whitelaw, E. (1999). Epigenetic inheritance at the agouti locus in the mouse. Nat Genet 23, 314-318.

Mortada, H., Vieira, C., and Lerat, E. (2010). Genes devoid of full-length transposable element insertions are involved in development and in the regulation of transcription in human and closely related species. J Mol Evol 71, 180-191.

Muotri, A.R., Chu, V.T., Marchetto, M.C., Deng, W., Moran, J.V., and Gage, F.H. (2005). Somatic mosaicism in neuronal precursor cells mediated by L1 retrotransposition. Nature 435, 903-910.

Needleman, S.B., and Wunsch, C.D. (1970). A general method applicable to the search for similarities in the sequence of two proteins. J Mol Biol 48, 443-453.

155 Obbard, D.J., Gordon, K.H., Buck, A.H., and Jiggins, F.M. (2009). The evolution of RNAi as a defence against viruses and transposable elements. Philos Trans R Soc Lond B Biol Sci 364, 99-115.

Ohshima, K., and Okada, N. (2005). SINEs and LINEs: symbionts of eukaryotic genomes with a common tail. Cytogenet Genome Res 110, 475-490.

Okamura, K. (2011). Diversity of animal small RNA pathways and their biological utility. Wiley Interdiscip Rev RNA.

Oldridge, M., Zackai, E.H., McDonald-McGinn, D.M., Iseki, S., Morriss-Kay, G.M., Twigg, S.R., Johnson, D., Wall, S.A., Jiang, W., Theda, C., et al. (1999). De novo alu-element insertions in FGFR2 identify a distinct pathological basis for Apert syndrome. Am J Hum Genet 64, 446-461.

Ostertag, E.M., and Kazazian, H.H., Jr. (2001). Biology of mammalian L1 retrotransposons. Annu Rev Genet 35, 501-538.

Pace, J.K., 2nd, and Feschotte, C. (2007). The evolutionary history of human DNA transposons: evidence for intense activity in the primate lineage. Genome Res 17, 422-432.

Perepelitsa-Belancio, V., and Deininger, P. (2003). RNA truncation by premature polyadenylation attenuates human mobile element activity. Nat Genet 35, 363-366.

Peters, L.L., Robledo, R.F., Bult, C.J., Churchill, G.A., Paigen, B.J., and Svenson, K.L. (2007). The mouse as a model for human biology: a resource guide for complex trait analysis. Nat Rev Genet 8, 58-69.

Piriyapongsa, J., Marino-Ramirez, L., and Jordan, I.K. (2007). Origin and evolution of human microRNAs from transposable elements. Genetics 176, 1323-1337.

Pritham, E.J., and Feschotte, C. (2007). Massive amplification of rolling-circle transposons in the lineage of the bat Myotis lucifugus. Proc Natl Acad Sci U S A 104, 1895-1900.

Pritham, E.J., Putliwala, T., and Feschotte, C. (2007). Mavericks, a novel class of giant transposable elements widespread in eukaryotes and related to DNA viruses. Gene 390, 3-17.

Ray, D.A., Feschotte, C., Pagan, H.J., Smith, J.D., Pritham, E.J., Arensburger, P., Atkinson, P.W., and Craig, N.L. (2008). Multiple waves of recent DNA transposon activity in the bat, Myotis lucifugus. Genome Res 18, 717-728.

Rebollo, R., Horard, B., Hubert, B., and Vieira, C. (2010). Jumping genes and epigenetics: Towards new species. Gene 454, 1-7.

156 Rebollo, R., Karimi, M.M., Bilenky, M., Gagnier, L., Miceli-Royer, K., Zhang, Y., Goyal, P., Keane, T.M., Jones, S., Hirst, M., et al. (2011). Retrotransposon-induced heterochromatin spreading in the mouse revealed by insertional polymorphisms. PLoS Genet 7, e1002301.

Reed, J.E., Dunn, J.R., du Plessis, D.G., Shaw, E.J., Reeves, P., Gee, A.L., Warnke, P.C., Sellar, G.C., Moss, D.J., and Walker, C. (2007). Expression of cellular adhesion molecule 'OPCML' is down-regulated in gliomas and other brain tumours. Neuropathol Appl Neurobiol 33, 77-85.

Reese, M.G., Eeckman, F.H., Kulp, D., and Haussler, D. (1997). Improved splice site detection in Genie. J Comput Biol 4, 311-323.

Reiss, D., Zhang, Y., Rouhi, A., Reuter, M., and Mager, D.L. (2010). Variable DNA methylation of transposable elements: the case study of mouse Early Transposons. Epigenetics 5, 68-79.

Ribet, D., Dewannieux, M., and Heidmann, T. (2004). An active murine transposon family pair: retrotransposition of "master" MusD copies and ETn trans-mobilization. Genome Res 14, 2261-2267.

Romanish, M.T., Lock, W.M., van de Lagemaat, L.N., Dunn, C.A., and Mager, D.L. (2007). Repeated recruitment of LTR retrotransposons as promoters by the anti-apoptotic locus NAIP during mammalian evolution. PLoS Genet 3, e10.

Romanish, M.T., Nakamura, H., Lai, C.B., Wang, Y., and Mager, D.L. (2009). A novel protein isoform of the multicopy human NAIP gene derives from intragenic Alu SINE promoters. PLoS One 4, e5761.

Rosenberg, N., and Jolicoeur, P. (1997). Retroviral pathogenesis. In Retroviruses, J.M. Coffin, L. Hughes, and H.E. Varmus, eds. (Cold Spring Harbor: Cold Spring Harbor Laboratory Press), pp. 475-585.

Sasaki, H., and Matsui, Y. (2008). Epigenetic events in mammalian germ-cell development: reprogramming and beyond. Nat Rev Genet 9, 129-140.

Schmid, C.W. (1998). Does SINE evolution preclude Alu function? Nucleic Acids Res 26, 4541-4550.

Schnable, P.S., Ware, D., Fulton, R.S., Stein, J.C., Wei, F., Pasternak, S., Liang, C., Zhang, J., Fulton, L., Graves, T.A., et al. (2009). The B73 maize genome: complexity, diversity, and dynamics. Science 326, 1112-1115.

157 Schofield, P.R., McFarland, K.C., Hayflick, J.S., Wilcox, J.N., Cho, T.M., Roy, S., Lee, N.M., Loh, H.H., and Seeburg, P.H. (1989). Molecular characterization of a new immunoglobulin superfamily protein with potential roles in opioid binding and cell contact. EMBO J 8, 489-495.

Schroder, A.R., Shinn, P., Chen, H., Berry, C., Ecker, J.R., and Bushman, F. (2002). HIV-1 integration in the human genome favors active genes and local hotspots. Cell 110, 521-529.

Schwahn, U., Lenzner, S., Dong, J., Feil, S., Hinzmann, B., van Duijnhoven, G., Kirschner, R., Hemberger, M., Bergen, A.A., Rosenberg, T., et al. (1998). Positional cloning of the gene for X-linked retinitis pigmentosa 2. Nat Genet 19, 327-332.

Schwartz, S., Gal-Mark, N., Kfir, N., Oren, R., Kim, E., and Ast, G. (2009). Alu exonization events reveal features required for precise recognition of exons by the splicing machinery. PLoS Comput Biol 5, e1000300.

Sellar, G.C., Watt, K.P., Rabiasz, G.J., Stronach, E.A., Li, L., Miller, E.P., Massie, C.E., Miller, J., Contreras-Moreira, B., Scott, D., et al. (2003). OPCML at 11q25 is epigenetically inactivated and has tumor-suppressor function in epithelial ovarian cancer. Nat Genet 34, 337-343.

Sen, S.K., Han, K., Wang, J., Lee, J., Wang, H., Callinan, P.A., Dyer, M., Cordaux, R., Liang, P., and Batzer, M.A. (2006). Human genomic deletions mediated by recombination between Alu elements. Am J Hum Genet 79, 41-53.

Shell, B.E., Collins, J.T., Elenich, L.A., Szurek, P.F., and Dunnick, W.A. (1990). Two subfamilies of murine retrotransposon ETn sequences. Gene 86, 269-274.

Simons, C., Pheasant, M., Makunin, I.V., and Mattick, J.S. (2006). Transposon-free regions in mammalian genomes. Genome Res 16, 164-172.

Siomi, M.C., Sato, K., Pezic, D., and Aravin, A.A. (2011). PIWI-interacting small RNAs: the vanguard of genome defence. Nat Rev Mol Cell Biol 12, 246-258.

Sironi, M., Menozzi, G., Comi, G.P., Cereda, M., Cagliani, R., Bresolin, N., and Pozzoli, U. (2006). Gene function and expression level influence the insertion/fixation dynamics of distinct transposon families in mammalian introns. Genome Biol 7, R120.

Slotkin, R.K., and Martienssen, R. (2007). Transposable elements and the epigenetic regulation of the genome. Nat Rev Genet 8, 272-285.

Smit, A.F. (1996). The origin of interspersed repeats in the human genome. Curr Opin Genet Dev 6, 743-748.

158 Smit, A.F. (1999). Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr Opin Genet Dev 9, 657-663.

Solyom, S., Ewing, A.D., Hancks, D.C., Takeshima, Y., Awano, H., Matsuo, M., and Kazazian, H.H., Jr. (2012). Pathogenic orphan transduction created by a nonreference LINE- 1 retrotransposon. Hum Mutat 33, 369-371.

Sorek, R., Ast, G., and Graur, D. (2002). Alu-containing exons are alternatively spliced. Genome Res 12, 1060-1067.

Spivakov, M., and Fisher, A.G. (2007). Epigenetic signatures of stem-cell identity. Nat Rev Genet 8, 263-271.

Stoye, J.P. (2006). Koala retrovirus: a genome invasion in real time. Genome Biol 7, 241.

Stoye, J.P., and Coffin, J.M. (1988). Polymorphism of murine endogenous proviruses revealed by using virus class-specific oligonucleotide probes. J Virol 62, 168-175.

Struyk, A.F., Canoll, P.D., Wolfgang, M.J., Rosen, C.L., D'Eustachio, P., and Salzer, J.L. (1995). Cloning of neurotrimin defines a new subfamily of differentially expressed neural cell adhesion molecules. J Neurosci 15, 2141-2156.

Sun, F.L., Haynes, K., Simpson, C.L., Lee, S.D., Collins, L., Wuller, J., Eissenberg, J.C., and Elgin, S.C. (2004). cis-Acting determinants of heterochromatin formation on Drosophila melanogaster chromosome four. Mol Cell Biol 24, 8210-8220.

Symer, D.E., Connelly, C., Szak, S.T., Caputo, E.M., Cost, G.J., Parmigiani, G., and Boeke, J.D. (2002). Human l1 retrotransposition is associated with genetic instability in vivo. Cell 110, 327-338.

Taft, R.A., Davisson, M., and Wiles, M.V. (2006). Know thy mouse. Trends Genet 22, 649- 653.

Tarlinton, R.E., Meers, J., and Young, P.R. (2006). Retroviral invasion of the koala genome. Nature 442, 79-81.

Telesnitsky, A., and Goff, S.P. (1997). Reverse Transcriptase and the Generation of Retroviral DNA. In Retroviruses, J.M. Coffin, S.H. Hughes, and H.E. Varmus, eds. (Cold Spring Harbor: Cold Spring Harbor Laboratory Press), pp. 121-160.

Theodorou, V., Kimm, M.A., Boer, M., Wessels, L., Theelen, W., Jonkers, J., and Hilkens, J. (2007). MMTV insertional mutagenesis identifies genes, gene families and pathways involved in mammary cancer. Nat Genet 39, 759-769.

159 Thornburg, B.G., Gotea, V., and Makalowski, W. (2006). Transposable elements as a significant source of transcription regulating signals. Gene 365, 104-110.

Ullu, E., and Weiner, A.M. (1985). Upstream sequences modulate the internal promoter of the human 7SL RNA gene. Nature 318, 371-374.

Ustyugova, S.V., Lebedev, Y.B., and Sverdlov, E.D. (2006). Long L1 insertions in human gene introns specifically reduce the content of corresponding primary transcripts. Genetica 128, 261-272. van de Lagemaat, L.N., Landry, J.R., Mager, D.L., and Medstrand, P. (2003). Transposable elements in mammals promote regulatory variation and diversification of genes with specialized functions. Trends Genet 19, 530-536. van de Lagemaat, L.N., Medstrand, P., and Mager, D.L. (2006). Multiple effects govern endogenous retrovirus survival patterns in human gene introns. Genome Biol 7, R86.

Vernochet, C., Heidmann, O., Dupressoir, A., Cornelis, G., Dessen, P., Catzeflis, F., and Heidmann, T. (2011). A syncytin-like endogenous retrovirus envelope gene of the guinea pig specifically expressed in the placenta junctional zone and conserved in Caviomorpha. Placenta 32, 885-892.

Wade, C.M., and Daly, M.J. (2005). Genetic variation in laboratory mice. Nat Genet 37, 1175-1180.

Wang, J., Song, L., Grover, D., Azrak, S., Batzer, M.A., and Liang, P. (2006). dbRIP: a highly integrated database of retrotransposon insertion polymorphisms in humans. Hum Mutat 27, 323-329.

Wang, T., Zeng, J., Lowe, C.B., Sellers, R.G., Salama, S.R., Yang, M., Burgess, S.M., Brachmann, R.K., and Haussler, D. (2007). Species-specific endogenous retroviruses shape the transcriptional network of the human tumor suppressor protein p53. Proc Natl Acad Sci U S A 104, 18613-18618.

Wang, X.Y., Steelman, L.S., and McCubrey, J.A. (1997). Abnormal activation of cytokine gene expression by intracisternal type A particle transposition: effects of mutations that result in autocrine growth stimulation and malignant transformation. Cytokines Cell Mol Ther 3, 3- 19.

Watanabe, T., Takeda, A., Tsukiyama, T., Mise, K., Okuno, T., Sasaki, H., Minami, N., and Imai, H. (2006). Identification and characterization of two novel classes of small RNAs in the mouse germline: retrotransposon-derived siRNAs in oocytes and germline small RNAs in testes. Genes Dev 20, 1732-1743.

160 Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P., et al. (2002). Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520-562.

Wu, X., and Burgess, S.M. (2004). Integration target site selection for retroviruses and transposable elements. Cell Mol Life Sci 61, 2588-2596.

Wu, X., Li, Y., Crise, B., and Burgess, S.M. (2003). Transcription start regions in the human genome are favored targets for MLV integration. Science 300, 1749-1751.

Yalcin, B., Wong, K., Agam, A., Goodson, M., Keane, T.M., Gan, X., Nellaker, C., Goodstadt, L., Nicod, J., Bhomra, A., et al. (2011). Sequence-based characterization of structural variation in the mouse genome. Nature 477, 326-329.

Yamada, M., Hashimoto, T., Hayashi, N., Higuchi, M., Murakami, A., Nakashima, T., Maekawa, S., and Miyata, S. (2007). Synaptic adhesion molecule OBCAM; synaptogenesis and dynamic internalization. Brain Res 1165, 5-14.

Yang, H., Bell, T.A., Churchill, G.A., and Pardo-Manuel de Villena, F. (2007). On the subspecific origin of the laboratory mouse. Nat Genet 39, 1100-1107.

Yang, N., and Kazazian, H.H., Jr. (2006). L1 retrotransposition is suppressed by endogenously encoded small interfering RNAs in human cultured cells. Nat Struct Mol Biol 13, 763-771.

Yates, P.A., Burman, R.W., Mummaneni, P., Krussel, S., and Turker, M.S. (1999). Tandem B1 elements located in a mouse methylation center provide a target for de novo DNA methylation. J Biol Chem 274, 36357-36361.

Yoder, J.A., Walsh, C.P., and Bestor, T.H. (1997). Cytosine methylation and the ecology of intragenomic parasites. Trends Genet 13, 335-340.

Yohn, C.T., Jiang, Z., McGrath, S.D., Hayden, K.E., Khaitovich, P., Johnson, M.E., Eichler, M.Y., McPherson, J.D., Zhao, S., Paabo, S., et al. (2005). Lineage-specific expansions of retroviral insertions within the genomes of African great apes but not humans and orangutans. PLoS Biol 3, e110.

Yoshida, K., Nakamura, A., Yazaki, M., Ikeda, S., and Takeda, S. (1998). Insertional mutation by transposable element, L1, in the DMD gene results in X-linked dilated cardiomyopathy. Hum Mol Genet 7, 1129-1132.

161 Zapala, M.A., Hovatta, I., Ellison, J.A., Wodicka, L., Del Rio, J.A., Tennant, R., Tynan, W., Broide, R.S., Helton, R., Stoveken, B.S., et al. (2005). Adult mouse brain gene expression patterns bear an embryologic imprint. Proc Natl Acad Sci U S A 102, 10357-10362.

Zhang, Y., Maksakova, I.A., Gagnier, L., van de Lagemaat, L.N., and Mager, D.L. (2008). Genome-wide assessments reveal extremely high levels of polymorphism of two active families of mouse endogenous retroviral elements. PLoS Genet 4, e1000007.

162 Appendices

163 Appendix A Supporting information of Chapter 2

A.1 Supplementary figures

Figure A.1 Mouse trace sequence archive composition (May 2007).

Numbers of sequence traces produced by whole genome shotgun methods are shown in yellow, traces produced by a re-sequencing CHIP technology shown in blue and other methods shown in green.

164

Figure A.2 Design of probes.

Type-1 probes include the full-length LTR and a small 23 bp fragment of the internal 5′ or 3′ ERV sequence. Type-2 probes consist only of the full LTR. Type-3 probes cover only the first/last 60 bp of the 5′/3′ LTR.

165

Figure A.3 Estimation of fractional loss of detection of ERVs using sequence traces.

The graph shows the fraction of ERVs present in the assembled B6 genome that were found using raw sequence traces from B6. Varying numbers of whole genome shotgun traces from B6 were mapped back to the assembled genome to detect ETn/MusD sequences as described in Materials and Methods. Arrows show the fraction of insertions found with an equivalent number of WGS traces as that available for each test strain. Standard deviations in percentages are plotted but are too small to see, the largest being 1% for the sample size of four million traces.

166

Figure A.4 Splice sites in ETn elements detected in chimeric transcripts.

An alignment of an ETnI element (chr2:110,262,865-110,268,371 of mm8 version of B6 genome) and an ETnII element is shown. The 5′ LTR and part of the 3′ LTR are shown with some interior sequence. Filled arrows indicate SA sites identified in cases of ETn mutations in the literature and also found here. Open arrows show SA sites newly identified in this study. The open triangle shows a SD site in ETnI newly identified here. PolyA site marks a polyadenylation site in some published cases. The sequence of the Dnajc10 ETnII insertion differs from that of the ETnII shown, two C-to-A mutations producing new SA sites, one suspected, marked with a "?", another one sequenced. An ETnII element harboring these and one other mutation present in the Dnajc10 insertion is present in the B6 genome on chr13:23,177,615- 23,184,720. Locations of primers used are shown with arrowed lines.

Note: Figure made by Irina A. Maksakova.

167

Figure A.5 Comparative microarray expression data of Dnajc10 in different strains.

Graphs of relative transcript levels are computer screen shots of datasets available on the GeneNetwork site (www.GeneNetwork.org). Line at the top represents the intron/exon structure of the gene with location of the ETn insertion in A/J indicated and position of the probe used for microarrays shown. A) Dataset from The Hippocampus Consortium M430v2 (Dec. 2005) RMA (Robust Analysis of Microarrays) series. Mouse strains are 129S1/Svlmj, A/J, AKR/J, Balb/cJ, C3H/HeJ, C57BL/6J, CAST/Ei, DBA/2J, KK/HIJ, LG/J, NOD/LtJ, NZO/HILtJ, PWD/PhJ, PWK/PhJ, WSB/EiJ, B6D2F1, D2B6F1. B) Dataset from the Hamilton Eye Institute mouse eye M430v2 (Nov. 2005) RMA series. Mouse strains are 129S1/Svlmj, A/J, BALB/cByJ, C3H/HeJ, C57BL/6J, CAST/EiJ, DBA/2J, KK/HIJ, LG/J, NOD/LtJ, NZO/HILtJ, PWD/PhJ, PWK/PhJ, WSB/EiJ, B6D2F1, D2B6F1. C) Dataset from the Univ. of Colorado at Denver and Health Sciences Center whole brain M430v2 (Nov06) RMA series. Mouse strains are 129P3/J, 129S1/Svlmj, A/J, AKR/J, BALB/cByJ, BALB/cJ, C3H/HeJ, C57BL/6J, C58/J, CAST/EiJ, CBA/J, DBA/2J, FVB/NJ, KK/HIJ, MOLF/EiJ, NOD/LtJ, NZW/LacJ, PWD/PhJ, SJL/J. D) Dataset from the GE-NIAAA (National Institute on Alcohol Abuse and Alcoholism) cerebellum Affymetrix M430v2 (May05) PDNN (Probe Dependent Nearest Neighbors) series. Mouse strains are 129S1/Svlmj, A/J, AKR/J, BALB/cByJ, BALB/cJ, C3H/HeJ, C57BL/6J, CAST/EiJ, DBA/2J, KK/HIJ, LG/J, NOD/LtJ. Values for A/J in each graph are marked with a star and fall significantly below the normal distribution displayed by all other strains.

168

Figure A.6 Comparative microarray expression data of Opcml in different strains.

Graphs of relative transcript levels are computer screen shots of datasets available on the GeneNetwork site (www.GeneNetwork.org). Line at the top represents the intron/exon structure of the gene with location of the ETn insertion in A/J indicated and position of the probe used for microarrays shown. A) Dataset from The Hippocampus Consortium M430v2 (Dec. 2005) RMA (Robust Analysis of Microarrays) series. B) Dataset from the Hamilton Eye Institute mouse eye M430v2 (Nov. 2005) RMA series. C) Dataset from the Univ. of Colorado at Denver and Health Sciences Center whole brain M430v2 (Nov06) RMA series. D) Dataset from the GE-NIAAA (National Institute on Alcohol Abuse and Alcoholism) cerebellum Affymetrix M430v2 (May05) PDNN (Probe Dependent Nearest Neighbors) series. Mouse strains for all datasets are listed in the legend to Figure A5. Values for A/J in each graph are marked with a star and fall significantly below the normal distribution displayed by all other strains.

169

Figure A.7 Semi-quantitative RT-PCR of Opcml in A/J versus B6.

A) Semi-quantitative RT-PCR. Opcml cDNA was amplified with primers from upstream and downstream of the ETn insertion. Opcml and Gapdh fragments were amplified from undiluted cDNA and dilutions of 1/20, 1/40 and1/80. B) Graphical representation of RT-PCR. For each dilution, the intensity of the resulting band was quantified and graphed as transcript levels of Opcml relative to Gapdh. The results of one of two representative experiments are shown.

Note: Irina A. Maksakova performed the experiments and made this figure.

170

Figure A.8 UCSC Genome Browser screen shot of the PolyERV track.

The polymorphic IAP and ETn/MusD insertions in the mouse are annotated in red and green, respectively. When the ERV insertion is displayed as a horizontal bar, it is present in the B6 reference genome, and its orientation is shown as arrows inside the bar. When the ERV insertion is displayed as a vertical line, it is not present in B6. More detailed information such as the genomic coordinates and the presence and absence of the polymorphic ERV in different inbred strains can be found by clicking on the ERV id.

171 A.2 Supplementary tables

Table A.1 Details of ERV probes

Target ERV Probe Type Probe Name Probe Structure Length Template ERVa probe1_5p 5’ LTR + ERV internal 340 bp Y17106 (ETn II) Type-1 probe1_3p ERV internal + 3’ LTR 340 bp Y17106 (ETn II) Type-2 probe2 full LTR 317 bp Y17106 (ETn II) ETn/MusD Probe3_U3 5’-end of the LTR 60 bp Y17106 (ETn II) Type-3 Probe3_U5A 3’-end of the LTR 60 bp Y17106 (ETn II) Probe3_U5B 3’-end of the LTR 60 bp AC068908 (ETn I) probe1_5p 5’ LTR + ERV internal 376 bp EU183301 (IΔ1) Type-1 probe1_3p ERV internal + 3’ LTR 376 bp EU183301 (IΔ1) IAP Type-2 probe2 full LTR 317 bp EU183301 (IΔ1) Probe3_U3 5’-end of the LTR 60 bp EU183301 (IΔ1) Type-3 Probe3_U5 3’-end of the LTR 60 bp EU183301 (IΔ1) aAccession numbers of ERVs used for probe design.

172 Table A.2 Polymorphic IAP cases in genes insertion sitea gene orientb location intron_size B6 AJ DBA 129X1 chr1:10631189 Cpa6 S intron 1 123915 Nc Yc N Y chr1:107135130 Rnf152 S intron 2 70085 N ? Y ? chr1:112765156 Cdh19 S intron 3 16857 Y N Y Y chr1:117712805 Cntnap5a A intron 1 215710 Y N ? N chr1:11872330 A830018L16Rik A intron 10 103651 Y N ? N chr1:11932192 A830018L16Rik A intron 11 17370 N ? Y ? chr1:128251571 E030049G20Rik A intron 2 66267 Y N N N chr1:129385804 2610024A01Rik S intron 2 115845 Y N N N chr1:129702627 Rab3gap1 A intron 3 14999 Y N N Y chr1:13141941 Ncoa2 A intron 15 3624 N Y N Y chr1:133540602 5430435G22Rik A exon 6 1846 N N Y Y chr1:140332028 Nek7 S intron 7 12958 N Y ? ? chr1:141088753 Crb1 A intron 5 65494 N Y N N chr1:141541180 Cfhrc A intron 13 29547 Y N N Y chr1:148306255 B830045N13Rik A intron 2 167621 N Y ? ? chr1:148568011 B830045N13Rik A intron 6 79568 Y Y N N chr1:154325012 Rgl1 A intron 4 14009 N N Y N chr1:154998354 Lamc1 A intron 18 4600 N N ? Y chr1:159356956 Lztr2 S intron 1 17923 N N N Y chr1:162444080 Rabgap1l A intron 13 155102 N ? ? Y chr1:162950622 Klhl20 A intron 2 20592 N N Y N chr1:164677213 Fmo1 A intron 4 10300 Y N N N chr1:166636721 Dpt S intron 1 11868 N N Y N chr1:16687299 Ly96 A intron 3 5248 Y N N Y chr1:171886435 Ddr2 A intron 2 52772 N Y N N chr1:172255709 Nos1ap A intron 2 123790 N Y N Y chr1:172644970 Atf6 A intron 9 15676 N Y N Y chr1:175833708 Ifi202b S intron 1 48881 N Y N N chr1:176449911 Fmn2 A intron 7 34453 N ? N Y chr1:177104097 Rgs7 A intron 3 81588 N Y Y Y chr1:181190740 Smyd3 A intron 5 310879 N Y N N chr1:18238083 Defb41 S intron 1 13691 Y Y Y N chr1:187004287 Iars2 A intron 12 12491 Y Y Y N chr1:20147389 Pkhd1 S intron 60 80069 N Y ? ? chr1:25364967 Bai3 A intron 15 24077 N N Y ? chr1:25694047 Bai3 A intron 2 266137 N N ? Y chr1:36489068 Ankrd39 A intron 1 4219 N ? ? Y chr1:37197114 Cnga3 A intron 3 8943 N Y N N chr1:38226945 Aff3 A intron 10 50349 N N Y ? chr1:40642830 Slc9a2 S intron 1 36428 Y N Y N chr1:44606648 Gulp1 A intron 1 137270 N Y Y N chr1:4930959 Rgs20 S intron 1 95320 N Y ? ? chr1:58875297 Trak2 A intron 2 9017 Y Y Y N chr1:79679405 Serpine2 A intron 6 1900 N N Y N chr1:8648290 Sntg1 S intron 10 52924 Y N Y Y chr1:8942202 Sntg1 S intron 3 116970 N N Y Y chr1:9128706 Sntg1 A intron 2 203928 N Y N N

173 insertion sitea gene orientb location intron_size B6 AJ DBA 129X1 chr2:102508928 Slc1a2 A intron 1 76641 N N N Y chr2:104809033 Ga17 A intron 5 6010 N Y Y ? chr2:106367986 C130023O10Rik A intron 2 335222 Y Y Y N chr2:110408677 Slc5a12 A intron 2 3625 N Y Y Y chr2:112349790 Aven A intron 2 110526 Y N N N chr2:113610781 Scg5 S intron 2 35308 Y N N Y chr2:119908942 Pla2g4e S intron 1 44399 Y N N N chr2:120505366 Ttbk2 A intron 3 15565 Y N N N chr2:122147455 Slc28a2 S intron 15 563 N Y Y N chr2:122381289 Slc30a4 A intron 2 4841 N Y Y Y chr2:122487190 Sqrdl A intron 4 5633 Y N N N chr2:125741346 Fgf7 A intron 2 52233 N Y N ? chr2:127443107 Nphp1 A intron 16 5859 N N Y N chr2:131622077 Prnp A intron 2 24126 Y Y Y N chr2:134675433 Plcb1 S intron 2 174444 N N Y N chr2:140069047 Sel1l2 A intron 1 49416 N N Y N chr2:141440826 LOC433479 S intron 2 85934 N N Y N chr2:142050494 LOC433479 A intron 11 55722 N ? Y ? chr2:143544044 Bfsp1 S intron 2 9032 N N Y Y chr2:144319334 Dtd1 A intron 3 18329 Y Y N N chr2:144360438 Dtd1 A intron 4 111035 Y Y N N chr2:144442331 Dtd1 S intron 5 21136 Y Y N N chr2:145048177 Slc24a3 A intron 2 203415 N N Y Y chr2:145316680 Slc24a3 A intron 15 17425 Y Y N ? chr2:154045704 Cdk5rap1 A intron 6 6345 Y N N N chr2:155549661 2410003P15Rik A intron 7 28918 N N ? Y chr2:155902424 Phf20 S intron 1 26536 Y Y Y N chr2:161660353 Ptprt A intron 7 207432 N N N Y chr2:164053003 Slpi S intron 1 13248 N N N Y chr2:164091225 Matn4 A intron 3 672 N Y N ? chr2:165366393 Eya2 A intron 2 17949 Y Y N ? chr2:166361281 Setd6 S intron 1 66468 N N N Y chr2:166812758 Kcnb1 A intron 1 81696 N N N Y chr2:179589378 Cdh4 A intron 2 335441 N N Y Y chr2:20231325 BC026657 S intron 1 49879 N Y Y N chr2:26155230 Gpsm1 A intron 9 12219 Y ? N N chr2:26820307 Adamts13 S intron 23 7828 Y Y Y N chr2:38777659 Olfml2a S intron 6 2896 N ? Y ? chr2:41486698 Lrp1b S intron 7 159272 Y N ? ? chr2:41566379 Lrp1b A intron 5 42033 Y N ? N chr2:54323971 Galnt13 A intron 2 38722 N Y Y Y chr2:54744114 Galnt13 A intron 8 53215 N Y Y Y chr2:5601298 Camk1d A intron 1 148364 Y Y N N chr2:62018942 Slc4a10 A intron 4 17072 Y Y Y N chr2:68091288 Stk39 S intron 14 39948 N Y Y N chr2:69001191 Spc25 A intron 3 2481 N Y ? Y chr2:69049150 Abcb11 A intron 25 2761 N Y N Y chr2:71059935 Dync1i2 A intron 14 2596 Y N Y Y chr2:79157475 Cerkl A intron 4 6258 N Y Y N

174 insertion sitea gene orientb location intron_size B6 AJ DBA 129X1 chr2:79900254 Pde1a S intron 1 200944 N Y Y N chr2:90336025 Ptprj A intron 1 105165 N Y Y N chr2:91185450 1110051M20Rik S intron 2 38005 N Y Y N chr3:100216189 Spag17 A intron 34 2856 N N Y N chr3:103057526 Sycp1 S intron 8 3988 N N N Y chr3:115901132 Dph5 S intron 4 15235 Y Y Y N chr3:116752419 Agl A intron 25 9868 Y N N N chr3:121317947 Alg14 A intron 3 41343 N ? Y ? chr3:122158252 Abca4 A intron 43 2752 N N Y N chr3:126427417 Arsj S intron 1 72832 N N Y N chr3:131247008 Hadh A intron 1 21943 N N Y ? chr3:135782671 Slc39a8 A intron 1 22324 N N Y ? chr3:136055611 Bank1 A intron 7 109900 Y Y Y N chr3:136111166 Bank1 A intron 7 109900 Y N N N chr3:139595022 B930007M17Rik A intron 12 178647 N Y N Y chr3:141594386 Unc5c A intron 1 212148 N N Y N chr3:141868130 Bmpr1b A intron 1 101403 N N Y N chr3:152097382 Gipc2 S intron 1 27888 N N ? Y chr3:152826279 St6galnac5 A intron 2 134386 N Y ? ? chr3:154890028 Tnni3k A intron 11 15244 N Y Y N chr3:156734836 Negr1 A intron 1 297083 Y N ? N chr3:156960231 Negr1 A intron 3 52857 N Y N N chr3:28253814 Pld1 A intron 12 14620 Y N Y N chr3:30861316 4930558O21Rik S intron 7 12042 N Y ? ? chr3:51327897 Ccrn4l S intron 1 22516 Y Y Y N chr3:62486655 4631416L12Rik A intron 5 38556 N ? Y ? chr3:67115872 Rsrc1 S intron 3 85944 Y Y N Y chr3:67237206 Rsrc1 A intron 4 98118 N N Y ? chr3:75705022 Serpini1 S intron 4 2355 N N Y N chr3:76519100 Fstl5 S intron 6 52361 N Y ? Y chr3:76531573 Fstl5 A intron 6 52361 N N Y N chr3:76693308 Fstl5 A intron 10 26901 N Y N N chr3:84192475 Mnd1 S intron 5 11376 N N Y N chr3:84211092 Mnd1 S intron 4 17515 N Y Y Y chr3:84228309 Mnd1 S intron 1 13896 Y N N ? chr3:85708209 Pet112l A intron 9 17909 Y N N N chr3:86996186 Dcamkl2 A intron 1 13532 N Y ? ? chr4:100621986 Raver2 A intron 5 10683 Y N Y N chr4:102373594 Sgip1 A intron 7 29800 Y N Y Y chr4:106662737 2210012G02Rik S intron 3 17838 N Y N Y chr4:106671735 2210012G02Rik S intron 2 11874 N N Y N chr4:111069681 4931433A01Rik S intron 9 48229 Y N ? N chr4:120433792 Zfp69 S intron 2 12094 Y ? N ? chr4:123292955 Rhbdl2 S intron 1 21941 N N N Y chr4:133452899 Ccdc21 S intron 1 14061 N Y N N chr4:140076408 Padi3 A intron 1 6867 N ? Y ? chr4:140184613 Padi2 S intron 1 11022 N N Y N chr4:140211091 Padi2 A intron 12 6376 Y Y N N chr4:146246790 LOC195531 S intron 1 334847 N ? ? Y

175 insertion sitea gene orientb location intron_size B6 AJ DBA 129X1 chr4:146251747 LOC195531 S intron 1 334847 N Y ? N chr4:146280727 LOC195531 S intron 1 334847 N ? ? Y chr4:146329110 LOC195531 A intron 1 334847 N N Y N chr4:146677302 MGC67181 S intron 1 19480 Y Y ? N chr4:151368737 Nphp4 A intron 11 11540 N N N Y chr4:17877006 Mmp16 A intron 1 133718 N Y N Y chr4:20396970 E130310K16Rik A intron 3 186380 N Y Y Y chr4:37843384 Olfr157 A intron 7 535229 N N N Y chr4:38148828 Olfr157 S intron 8 144288 N ? ? Y chr4:38616468 Olfr157 A intron 10 272908 Y N Y N chr4:39058766 Olfr157 S intron 11 504175 Y N Y N chr4:40141430 Olfr157 A intron 16 797293 Y N Y N chr4:40630644 Olfr157 S intron 17 272631 N ? Y ? chr4:40922556 Dnaja1 A exon 9 2238 N Y N N chr4:41643406 Kif24 A intron 1 35726 Y N N Y chr4:43318824 4930417M19Rik A intron 3 10849 Y N N Y chr4:43459920 Olfr157 A intron 20 794532 N ? Y ? chr4:46532187 Trim14 A intron 4 11451 Y N N ? chr4:46936870 Gabbr2 A intron 1 115304 N N Y Y chr4:57976790 Akap2 A intron 1 25985 N N N Y chr4:62059985 Rgs3 S intron 2 15221 Y N N N chr4:86274553 Dennd4c A intron 13 2316 N Y ? Y chr4:87586931 BC057079 A intron 1 26593 N N Y N chr4:94305261 Tek A intron 7 8666 Y N Y ? chr4:9450826 Asph A intron 17 12150 N Y N ? chr4:99310397 Itgb3bp A intron 1 14981 Y Y ? N chr5:100291561 Enoph1 S intron 1 18836 Y N ? Y chr5:104196736 Hsd17b13 A intron 6 9487 N N N Y chr5:104737400 Pkd2 A intron 10 3448 N Y Y N chr5:105659638 Lrrc8b A intron 1 52883 Y N N Y chr5:105964456 Lrrc8d A intron 1 31143 Y N ? Y chr5:110435196 Golga3 A intron 9 8073 Y N N N chr5:115217895 Tcf1 A intron 2 4012 N N Y Y chr5:116255931 Cit A intron 42 1051 N N Y Y chr5:117755717 Ksr2 A intron 1 85703 N Y ? ? chr5:118371007 Fbxw8 A intron 5 17881 N Y N N chr5:122266574 Cutl2 A intron 2 118853 Y N Y N chr5:124578173 Mphosph9 A intron 3 4947 N Y N Y chr5:12465510 Sema3d A intron 2 15035 N N ? Y chr5:12523097 Sema3d A intron 7 16029 Y Y Y N chr5:127958002 Glt1d1 A intron 7 12495 N Y ? N chr5:127975659 Glt1d1 A intron 9 11695 N Y N N chr5:14208932 Sema3e S intron 4 46111 N Y ? ? chr5:143269838 2810055G22Rik A intron 15 22051 Y Y N N chr5:144543260 Baiap2l1 S intron 3 32616 N Y ? ? chr5:144561331 Baiap2l1 A intron 3 32616 N N N Y chr5:144803458 Nptx2 A intron 2 5063 N Y Y ? chr5:147070481 Usp12 A intron 3 8600 N Y N ? chr5:148307541 C130038G02Rik A intron 1 118143 N ? N Y

176 insertion sitea gene orientb location intron_size B6 AJ DBA 129X1 chr5:151339140 Stard13 S intron 4 9643 N N N Y chr5:15796736 Cacna2d1 A intron 10 32452 Y Y ? N chr5:17097612 Sema3c A intron 2 76736 N N Y N chr5:20747242 A530088I07Rik A intron 12 13557 Y Y ? N chr5:21148060 Fbxl13 S intron 1 12598 N ? ? Y chr5:21168567 Armc10 A intron 4 6999 N Y Y N chr5:21712860 Reln A intron 3 56243 Y Y N ? chr5:22638613 Lhfpl3 S intron 2 119215 N N Y Y chr5:22696864 Lhfpl3 S intron 2 119215 Y Y N N chr5:23081808 Srpk2 S intron 1 51060 Y Y ? N chr5:34076523 Letm1 A intron 6 8110 Y Y N ? chr5:3413005 Cdk6 A intron 2 38173 N Y ? ? chr5:38418593 Stx18 A intron 9 7107 Y N Y Y chr5:38693940 Slc2a9 A intron 6 44714 Y N Y Y chr5:44117077 Bst1 A intron 6 11103 Y Y N Y chr5:48958008 Kcnip4 A intron 1 406709 Y N N Y chr5:49108457 Kcnip4 S intron 1 406709 N Y Y N chr5:64548798 Tbc1d1 A intron 6 6630 N Y N ? chr5:67123782 3732412D22Rik A intron 1 111876 N Y Y N chr5:72348808 Gabrb1 A intron 5 78397 Y Y N Y chr5:73814057 Dcun1d4 A intron 7 9348 N Y N N chr5:73831442 Dcun1d4 A intron 8 11202 N N Y N chr5:87392834 Tmprss11d S intron 7 16883 Y Y N Y chr5:90905921 Adamts3 A intron 3 85856 Y Y N N chr5:93689748 4932413O14Rik A intron 22 10764 Y Y Y N chr5:96725801 Fras1 A intron 2 142399 Y Y ? N chr6:104611850 Cntn6 A intron 3 7365 N N N Y chr6:105663923 Cntn4 S intron 1 162525 Y Y Y N chr6:108767487 Arl8b A intron 1 30248 N N Y N chr6:111326901 Grm7 A intron 8 136571 N Y Y Y chr6:120093223 Ninj2 A intron 1 104443 Y Y Y N chr6:120777881 Atp6v1e1 A intron 2 9861 Y N Y ? chr6:122007148 Mug2 A intron 12 6664 N ? ? Y chr6:128789794 Klrb1b S intron 1 5117 N Y Y Y chr6:136801806 BC049715 A intron 2 8742 N Y Y ? chr6:137520031 Eps8 A intron 2 52015 N ? ? Y chr6:138416793 Lmo3 A intron 2 72279 Y N ? Y chr6:139788455 Pik3c2g A intron 18 71383 Y N ? ? chr6:149027741 D030011O10Rik S intron 2 12088 N N Y N chr6:149391024 Bicd1 A intron 1 74341 N ? Y ? chr6:17074195 EG232599 A intron 1 84126 N N N Y chr6:18439758 Cttnbp2 S intron 2 53478 N Y N N chr6:45030853 Cntnap2 S intron 1 593602 N N Y Y chr6:46608110 Cntnap2 A intron 13 275609 Y N Y Y chr6:54884614 Nod1 A intron 1 22514 N Y Y Y chr6:5796637 Dync1i1 A intron 6 104213 Y Y N Y chr6:63704980 Grid2 A intron 2 405417 N N Y N chr6:71884936 Rpo1-4 A intron 20 6677 Y Y N Y chr6:72534267 0610039N19Rik S intron 5 1013 N N Y N

177 insertion sitea gene orientb location intron_size B6 AJ DBA 129X1 chr6:92170731 Zfyve20 A intron 3 4373 N Y Y Y chr6:93756060 Magi1 A intron 3 6699 N Y N Y chr6:95653698 Suclg2 A intron 1 63088 N N Y N chr6:96899757 C130034I18Rik A intron 3 150455 N N Y N chr7:112221386 Parva A intron 1 116075 Y N N N chr7:113733465 Spon1 S intron 6 88865 Y Y N Y chr7:118356014 Tmc7 A intron 2 1818 N N Y Y chr7:126557998 Gdpd3 S intron 1 291 N N N Y chr7:132882066 Ctbp2 A intron 2 55883 N Y Y Y chr7:134162112 4933400E14Rik A intron 2 50310 N Y N Y chr7:134799549 Dock1 A intron 27 114205 N ? Y N chr7:134852230 Dock1 A intron 29 77815 N N Y ? chr7:16258225 Psg16 A intron 5 32340 N N N Y chr7:18698208 BC024868 A intron 1 24342 N N N Y chr7:19100806 2210010C17Rik A intron 2 2683 N N N Y chr7:23016479 Nalp4e S intron 1 18807 N Y ? ? chr7:24150314 Irgc1 S intron 1 9545 N Y N N chr7:29792842 Zfp566 S intron 2 5881 N Y Y N chr7:34989957 Gpatc1 A intron 19 4519 N Y ? N chr7:3879939 Pira6 A intron 5 4553 Y N N N chr7:43438369 BC043301 A intron 3 4844 N N ? Y chr7:43813698 1700127D06Rik S intron 2 371 N N N Y chr7:4452434 Isoc2b A intron 5 4083 N Y Y N chr7:45528137 Fut2 A intron 1 14848 N N Y Y chr7:46092624 Ush1c A intron 8 1934 N N N Y chr7:46380203 Sergef A intron 10 72131 Y Y Y N chr7:50100373 Nell1 A intron 9 19381 Y N Y N chr7:50668010 Nell1 A intron 15 125007 N N N Y chr7:51851912 Gas2 A intron 7 39988 N Y N N chr7:54807033 Luzp2 A intron 1 216731 Y Y N N chr7:55814787 Nipa2 A intron 1 16365 Y ? N ? chr7:56278670 p A intron 19 56499 N N N Y chr7:56898887 Gabrg3 A intron 3 338483 N N N Y chr7:56983305 Gabrg3 A intron 3 338483 Y ? N ? chr7:57007475 Gabrg3 S intron 3 338483 Y N Y N chr7:57026889 Gabrg3 S intron 3 338483 N Y N N chr7:57088731 Gabrg3 A intron 3 338483 Y Y Y N chr7:57280026 Gabra5 A intron 9 4675 N Y N N chr7:65527529 Tarsl2 A intron 3 4590 N N Y ? chr7:66294972 Aldh1a3 A intron 3 6071 N Y N ? chr7:71980533 Gm489 A intron 19 18443 N N Y ? chr7:75733740 Klhl25 A intron 1 16852 N Y Y N chr7:76290339 EG244071 A intron 3 3941 N Y N N chr7:76310437 EG244071 A intron 10 11341 N Y N ? chr7:76423064 EG244071 A intron 11 144374 N N N Y chr7:78957337 Agc1 A intron 2 2201 N N N Y chr7:79210579 Abhd2 A intron 5 22657 Y N ? Y chr7:80808851 Sec11l1 A intron 1 12264 Y N Y N chr7:81468649 Whdc1 A intron 10 1765 N N N Y

178 insertion sitea gene orientb location intron_size B6 AJ DBA 129X1 chr7:82613322 Eftud1 A intron 15 52773 Y N N ? chr7:89519920 Me3 A intron 1 103533 N Y N Y chr7:89525385 Me3 A intron 1 103533 Y N Y N chr7:89693986 Me3 S intron 6 26945 Y N N N chr7:91940259 Dlg2 A intron 11 47694 N Y N ? chr7:92100499 Dlg2 A intron 13 159847 Y N N Y chr7:92637224 Rab30 A intron 1 78139 N N N Y chr7:92661278 Rab30 A intron 1 78139 N N ? Y chr7:92744697 4632434I11Rik A intron 2 4570 N N Y Y chr7:92809075 Prcp S exon 9 1029 N Y N N chr7:96192800 Odz4 A intron 1 144584 Y N N Y chr7:97882033 Gdpd4 A intron 15 8581 Y Y N Y chr8:122288343 Mlycd S intron 2 839 N N Y N chr8:125686127 BC021611 A intron 6 11814 N Y N Y chr8:128131514 Disc1 A intron 10 17889 N N Y N chr8:130449167 Pard3 S intron 22 118444 N N N Y chr8:131991574 1700008F21Rik S intron 2 47626 N N N Y chr8:13317388 Tmco3 S intron 10 4885 Y Y Y N chr8:13317701 Tmco3 A intron 10 4885 Y N Y ? chr8:16790154 Csmd1 S intron 3 316521 Y Y Y N chr8:16896338 Csmd1 A intron 3 316521 Y N N N chr8:20395567 LOC436177 S intron 7 127285 N Y ? ? chr8:23600168 Nek3 S intron 11 9793 N Y Y Y chr8:23962594 Slc20a2 A intron 1 58092 N Y ? N chr8:24016713 Slc20a2 S intron 5 13340 Y N N N chr8:32657254 Fut10 S intron 1 7490 N Y Y Y chr8:40595747 Tusc3 A intron 7 4149 N Y Y ? chr8:42108760 Mtmr7 A intron 1 25680 Y N N ? chr8:42575690 Mtus1 A intron 2 6185 N Y ? Y chr8:55641586 Vegfc A intron 1 79031 Y N Y N chr8:64179531 Sh3rf1 A intron 2 102899 N ? Y N chr8:64259164 Sh3rf1 A intron 8 9261 Y N N Y chr8:64410863 Palld A intron 16 6140 N Y N N chr8:68653188 March1 A intron 1 260565 N N Y N chr8:68967805 March1 A intron 3 309405 N Y N N chr8:68984286 March1 A intron 3 309405 N N Y N chr8:69009827 March1 A intron 3 309405 Y N N Y chr8:69217878 March1 A intron 4 109726 N Y ? ? chr8:69258126 March1 A intron 4 109726 N N Y ? chr8:71437203 4732435N03Rik S intron 3 95119 N Y Y Y chr8:84929641 Inpp4b A intron 16 1150 N Y Y N chr8:88047485 EG624855 S intron 1 19301 N Y Y Y chr8:90680847 Zfp423 A intron 3 75869 N Y ? ? chr8:94277841 Fto A intron 1 88185 N N N Y chr8:94454988 Fto A intron 8 143384 N N ? Y chr8:97505641 Rspry1 A intron 1 20402 Y ? N N chr9:101947479 Ephb1 A intron 3 127746 N Y Y Y chr9:103330213 Bfsp2 A intron 1 26589 N Y N Y chr9:104482616 Cpne4 S intron 1 116322 N Y Y Y

179 insertion sitea gene orientb location intron_size B6 AJ DBA 129X1 chr9:104825982 Cpne4 A intron 7 63731 N Y Y N chr9:104841903 Cpne4 A intron 7 63731 N ? Y ? chr9:105238657 Nek11 A intron 2 44709 N Y Y Y chr9:106982772 Dock3 A intron 4 26329 N Y ? ? chr9:107085292 Dock3 A intron 1 52306 N Y N ? chr9:108535147 Slc25a20 A intron 4 2411 N Y Y ? chr9:114265329 Glb1 A intron 1 15690 N N Y Y chr9:114926725 Osbpl10 A intron 1 42073 Y N Y N chr9:115988048 Tgfbr2 A intron 2 26556 N Y N N chr9:118735079 Itga9 A intron 26 10539 N Y N ? chr9:121255107 Trak1 A intron 2 39426 Y N N Y chr9:122805410 Kif15 S intron 1 7963 N Y ? ? chr9:123591415 Ccr9 A intron 1 95843 Y Y Y N chr9:34626673 Kirrel3 A intron 1 422730 N N Y ? chr9:36288332 EG434396 A intron 1 6757 N N Y ? chr9:42465562 Grik4 S intron 3 120394 N N Y Y chr9:42495504 Grik4 A intron 3 120394 N N Y Y chr9:48054003 D930028F11Rik A intron 1 192726 N Y Y Y chr9:50294820 Bcdo2 A intron 4 3284 N Y Y ? chr9:57827226 4930535E21Rik A intron 18 963 N Y N N chr9:60115842 Thsd4 A intron 6 123165 Y N Y N chr9:65862744 Ppib S intron 3 2403 Y Y N Y chr9:68960009 Rora A intron 1 542101 N N N Y chr9:71317979 AA407270 A intron 11 9611 N Y Y ? chr9:75433883 Scg3 A intron 11 7685 N Y Y N chr9:75568793 Bmp5 A intron 1 51434 N Y Y N chr9:8121393 AK129341 A intron 2 9407 N N N Y chr9:83774731 Bckdhb A intron 3 34993 Y Y N Y chr9:8590112 Trpc6 S intron 3 9303 Y Y Y N chr9:86482671 Mod1 S intron 1 15823 N ? Y ? chr9:92152022 1700057G04Rik A intron 3 3080 N Y ? N chr9:99360375 Armc8 S intron 2 13782 N Y N Y chr10:106410062 8430416H19Rik S intron 7 18837 N Y Y N chr10:120994919 Xpot A intron 24 3445 N Y Y N chr10:121066964 D930020B18Rik A intron 4 6259 N Y Y N chr10:122235925 Ppm1h A intron 4 45501 Y N Y N chr10:13359312 Aig1 A intron 4 36773 Y Y N Y chr10:13375267 Aig1 A intron 4 36773 N Y N Y chr10:20727441 Ahi1 A intron 18 23151 N Y N N chr10:22027763 Raet1c S intron 5 191660 N Y N Y chr10:24375847 Enpp1 A intron 1 32544 Y N Y ? chr10:25159767 Epb4.1l2 A intron 5 2146 N Y N N chr10:28477359 E430004N04Rik A intron 4 80556 N ? Y ? chr10:40281941 Slc22a16 A intron 6 3361 N N N Y chr10:5384312 Esr1 S intron 6 29807 N N Y N chr10:73468636 Pcdh15 S intron 4 118784 Y Y Y N chr10:84145835 Polr3b S intron 25 3997 N Y N N chr10:85850816 Syn3 A intron 3 41160 N ? Y ? chr10:88393677 Tmem16d A intron 23 10565 Y N ? N

180 insertion sitea gene orientb location intron_size B6 AJ DBA 129X1 chr10:90516147 1200009F10Rik A intron 1 9841 N Y ? Y chr10:9355865 E130306M17Rik A intron 1 45454 N Y N N chr11:102874052 Nmt1 A intron 7 668 N Y N N chr11:105844336 Kcnh6 A intron 12 6044 Y N N ? chr11:107113157 Pitpnc1 A intron 4 24521 N Y Y N chr11:107773664 Prkca A intron 14 7822 N Y Y Y chr11:108291118 Ccdc46 S intron 7 14497 Y N N N chr11:108294001 Ccdc46 A intron 7 14497 N Y Y Y chr11:108304916 Ccdc46 A intron 9 11252 N Y Y N chr11:108465990 Ccdc46 S intron 21 58099 Y N ? N chr11:108642440 Ccdc46 A intron 25 47042 Y N N N chr11:113728274 Sdk2 A intron 2 39977 Y N Y N chr11:114674870 Gprc5c A intron 1 11192 N Y N N chr11:19976586 Actr2 A intron 7 4598 N Y N ? chr11:3032912 Sfi1 A exon 31 80 N N N Y chr11:3052573 Sfi1 S intron 14 7138 N Y N Y chr11:3062637 Sfi1 S intron 9 4409 N N Y N chr11:3072587 Sfi1 S intron 6 6721 N Y N N chr11:3093109 Sfi1 A intron 1 4302 N Y Y N chr11:47005426 Sgcd A intron 3 62344 N ? Y ? chr11:51213654 Col23a1 A intron 2 200242 N Y Y ? chr11:51281139 Col23a1 S intron 2 200242 Y Y N Y chr11:52728669 Fstl4 A intron 3 185537 N ? Y ? chr11:5517981 Ankrd48 A intron 13 2895 N Y N Y chr11:5588311 Ankrd48 A intron 26 1656 N Y N ? chr11:6041515 Nudcd3 S intron 3 24932 Y Y Y N chr11:7000997 Adcy1 A intron 3 8311 N N N Y chr11:70950223 Nalp1 S intron 1 17943 N ? ? Y chr11:72498818 Ube2g1 A intron 5 4030 N N Y N chr11:72879659 1200014J11Rik S intron 3 11458 Y Y N Y chr11:74356372 Garnl4 A intron 2 82243 Y N N N chr11:74394456 Garnl4 S intron 1 22707 N Y Y Y chr11:74700606 Rutbc1 A intron 2 22817 Y N N N chr11:75966048 Vps53 A intron 4 25402 N Y Y ? chr11:81361676 Accn1 A intron 1 996014 Y N Y Y chr11:84929126 Usp32 A intron 1 35683 N ? Y N chr11:88907472 Gm525 A intron 3 4193 Y N N ? chr11:98623955 Casc3 S intron 1 4688 N N Y Y chr12:102733042 Rin3 A intron 3 24968 N Y ? N chr12:11130003 Kcns3 S intron 2 27002 Y N N N chr12:11374493 Vsnl1 A intron 2 54597 N Y N N chr12:117020350 Ptprn2 A intron 1 94424 N Y Y N chr12:21436950 Ddef2 S intron 3 14299 N Y N N chr12:32613270 Prkar2b A intron 2 44837 Y N Y N chr12:38731481 Dgkb A intron 18 11853 N Y ? N chr12:40912515 Zfp277 A intron 2 31700 Y Y N Y chr12:42508770 Immp2l A intron 5 74824 N N Y N chr12:42634020 Immp2l A intron 6 251222 N Y Y N chr12:51477047 Prkcm S intron 1 158698 N N N Y

181 insertion sitea gene orientb location intron_size B6 AJ DBA 129X1 chr12:53469443 Arhgap5 A intron 2 14851 N N N Y chr12:53651725 Akap6 A intron 1 96393 Y N Y ? chr12:57665867 Slc25a21 S intron 6 27382 Y Y Y N chr12:60018304 Mia2 A intron 1 5555 N Y Y ? chr12:70792838 Map4k5 A intron 2 17412 N Y Y ? chr12:73247588 Rtn1 S intron 1 99380 N ? N Y chr13:101397364 Birc1g A intron 10 1685 N ? ? Y chr13:110972570 Pde4d S intron 7 130895 Y Y N Y chr13:112815113 Mier3 A intron 3 12246 Y N N Y chr13:114964103 Arl15 A intron 1 127549 N N Y Y chr13:116144653 Itga1 A intron 6 13774 N Y N N chr13:17223595 AF397014 A intron 9 129179 N N N Y chr13:17364703 AF397014 A intron 6 67198 N Y Y ? chr13:17572651 Cdc2l5 A intron 1 30479 N Y Y ? chr13:19122471 Amph S intron 12 3016 Y N Y Y chr13:19400469 Stard3nl A intron 1 19041 N Y Y N chr13:30377593 Agtr1a S intron 2 35880 Y Y Y N chr13:36372845 Fars2 A intron 4 67981 N Y N N chr13:43288515 Gfod1 A intron 1 101999 Y N N ? chr13:46636254 Cap2 A intron 7 20264 N N N Y chr13:47037177 Tpmt A intron 8 1773 Y N ? N chr13:56722503 Smad5 A intron 1 20296 Y N N ? chr13:59010939 Ntrk2 A intron 13 51975 N N Y ? chr13:59131106 Ntrk2 A intron 16 65969 Y N Y Y chr13:63166458 2010111I01Rik A intron 5 13957 Y N ? N chr13:63227285 2010111I01Rik A intron 10 29920 Y N N N chr13:64892851 Cntnap3 A intron 2 28900 Y ? ? N chr13:76462125 Spata9 A intron 4 5374 N Y N Y chr13:76652452 AK129128 A intron 43 2348 N N Y ? chr13:76986607 Mctp1 S intron 1 254004 N Y N Y chr13:77101796 Mctp1 S intron 1 254004 N Y N N chr13:90447941 Xrcc4 A intron 6 49962 N ? ? Y chr13:9431905 Dip2c A intron 1 216037 Y Y Y N chr13:94978085 Arsb S intron 5 18156 N Y N ? chr13:94988399 Arsb S intron 6 58667 N N Y Y chr14:10866447 Ptprg A intron 5 53977 N N Y Y chr14:115044689 Gpc5 S intron 7 735946 N Y Y Y chr14:116738973 Gpc6 A intron 4 245801 Y Y Y N chr14:118510725 Hs6st3 A intron 1 729767 Y Y ? N chr14:118613808 Hs6st3 A intron 1 729767 Y Y N N chr14:12098529 Synpr S intron 2 208300 N N Y Y chr14:121126269 Phgdhl1 A intron 6 10908 N Y N N chr14:12840210 Atxn7 A intron 3 39262 Y Y N N chr14:13548865 Slc4a7 A intron 9 6761 N ? Y ? chr14:13719761 Nek10 S intron 3 45445 N ? Y ? chr14:17452304 Ube2e2 A intron 3 241061 N N Y Y chr14:17491551 Ube2e2 A intron 3 241061 N N Y Y chr14:17599622 Ube2e2 A intron 3 241061 N N Y ? chr14:19962286 Adk A intron 4 64953 N N Y N

182 insertion sitea gene orientb location intron_size B6 AJ DBA 129X1 chr14:25733698 Asb14 A intron 6 7807 N Y ? ? chr14:26820315 D14Ertd171e A intron 7 45071 N Y N N chr14:33106287 3110001K24Rik S intron 2 39502 N N Y N chr14:33212947 Mmrn2 A intron 1 20385 Y Y N N chr14:34343293 Grid1 A intron 12 68288 Y Y Y N chr14:45904677 Samd4 A intron 2 131259 N Y ? N chr14:50480097 LOC434459 S intron 1 101657 N ? Y ? chr14:51059678 Rpgrip1 A intron 11 1919 N N Y N chr14:52937443 A630098A13Rik S intron 2 161469 N Y Y Y chr14:53501754 D14Ertd500e A intron 5 3042 N Y Y Y chr14:6142859 LOC544988 S intron 1 174953 Y N Y Y chr14:62978260 Rp1hl1 A intron 2 3521 N N Y Y chr14:65228565 Ptk2b A intron 1 67329 N ? Y ? chr14:71893381 Rcbtb2 A intron 4 7045 N N Y ? chr14:7419275 4930452B06Rik A intron 4 65933 N N Y Y chr14:76357616 D230005D02Rik S intron 13 22058 N Y N ? chr14:76417949 D230005D02Rik S intron 16 30400 N Y N N chr14:85842569 Diap3 A intron 5 30054 N Y N N chr15:100768243 Scn8a A intron 2 18808 N N N Y chr15:11025828 Adamts12 A intron 2 80053 N N N Y chr15:12471499 Pdzd2 A intron 1 133780 Y Y Y N chr15:28264316 Dnahc5 A intron 35 12606 N Y ? ? chr15:30531730 Ctnnd2 A intron 3 138326 N Y ? ? chr15:31299195 5730557B15Rik S intron 1 41849 Y Y Y N chr15:32617793 Sema5a A intron 17 3685 N N N Y chr15:33339872 Pgcp S intron 5 85177 N ? Y ? chr15:34204246 Laptm4b A intron 2 13550 N N N Y chr15:34947807 Stk3 A intron 7 8203 N N ? Y chr15:40647677 Zfpm2 S intron 3 96380 N N ? Y chr15:43333952 Ttc35 A intron 5 10525 N N Y N chr15:44220503 Nudcd1 A intron 7 7915 N ? Y Y chr15:52161185 Slc30a8 A intron 5 5855 N ? Y ? chr15:53531677 Samd12 A intron 3 61123 N Y Y ? chr15:64192357 Ddef1 A intron 1 32802 N Y N N chr15:64696361 Adcy8 A intron 2 49330 N Y ? N chr15:66275903 Lrrc6 A intron 7 5369 N Y N N chr15:74573404 2300005B03Rik S intron 1 3133 N N N Y chr15:79411098 Dmc1 A intron 12 15562 N Y Y Y chr15:82939357 Serhl A intron 8 6518 N N N Y chr15:86140030 Tbc1d22a A intron 8 39811 Y N N Y chr15:95163649 Nell2 A intron 14 47780 N Y Y ? chr16:11593592 4933437K13Rik S intron 11 54285 Y Y Y N chr16:31360592 Bdh1 S intron 2 8473 Y N Y Y chr16:31638524 Dlgh1 A intron 4 58810 N ? Y Y chr16:32021504 1500031L02Rik S intron 1 2854 N N Y Y chr16:32067824 Lrrc33 A intron 2 14563 Y N ? ? chr16:34964523 Ptplb A intron 2 22776 N N Y ? chr16:35741481 Hspbap1 A intron 6 5925 N Y N Y chr16:36243599 Stfa1 A intron 1 11207 Y N Y ?

183 insertion sitea gene orientb location intron_size B6 AJ DBA 129X1 chr16:36684294 Slc15a2 S intron 7 9675 Y N N Y chr16:56154949 Impg2 A intron 7 5595 N Y Y Y chr16:5991812 A2bp1 A intron 1 287739 N Y ? N chr16:59935140 Epha6 A intron 7 45093 N ? ? Y chr16:60313157 Epha6 A intron 3 203583 N Y Y N chr16:6308438 A2bp1 A intron 2 314150 Y N N ? chr16:6598891 A2bp1 A intron 3 321339 Y N ? N chr16:70425126 Gbe1 S intron 14 32012 Y N ? ? chr16:87824174 Grik1 A intron 13 12594 N N ? Y chr16:91599319 Cryzl1 A intron 4 3320 N N N Y chr17:11283761 Park2 A intron 5 58044 N Y Y N chr17:11516654 Park2 A intron 7 203154 N Y Y ? chr17:12127053 Map3k4 A intron 1 40385 Y N N Y chr17:24002501 Abca17 A intron 25 11309 N Y Y Y chr17:29772262 Zfand3 A intron 1 55500 N Y N N chr17:33574868 Vps52 A intron 17 2036 Y Y ? N chr17:33594630 AA388235 S exon 1 3868 N N Y N chr17:35657073 EG547347 A intron 1 13134 Y ? N N chr17:42054881 Opn5 A intron 4 11897 N ? Y ? chr17:42755596 Gpr110 A intron 2 6068 N Y ? N chr17:49543989 Rftn1 S intron 4 30939 Y N N N chr17:49571223 Rftn1 A intron 3 20392 N Y Y Y chr17:50126162 Plcl2 A intron 5 27806 Y N N Y chr17:53047148 Pcaf A intron 1 43284 N Y Y Y chr17:55269308 2810458H16Rik A intron 8 13647 Y ? ? N chr17:6188639 Tulp4 A exon 13 2514 N Y ? ? chr17:68201613 D930040M24Rik S intron 1 40481 Y N Y N chr17:71003229 Myom1 A intron 22 7738 Y ? ? N chr17:71300790 Smchd1 A intron 10 3606 N Y ? Y chr17:72259504 Alk A intron 3 154558 Y ? ? N chr17:72479884 Alk S intron 1 222865 Y ? ? N chr17:74664131 Ttc27 A intron 9 19113 Y N N N chr17:7897922 Gm1604 A intron 2 50426 Y Y N N chr17:80192514 Dhx57 S intron 1 8626 N Y N N chr17:83275521 Eml4 A intron 1 58785 N Y N ? chr17:83674547 Mta3 S intron 7 3492 N N Y ? chr17:90760663 Nrxn1 A intron 4 283719 N Y N Y chr17:90922764 Nrxn1 A intron 2 92475 N N Y N chr18:22924347 Nol4 A intron 6 52425 Y N N N chr18:30473818 Pik3c3 A intron 21 8541 N N N Y chr18:53659933 Prdm6 S intron 2 62882 N Y N ? chr18:62094087 Sh3tc2 A intron 3 2824 N N N Y chr18:62132682 Sh3tc2 A intron 12 16328 N N ? Y chr18:68217607 D18Ertd653e A intron 2 41863 N Y N N chr18:68537613 Mc2r A intron 1 19292 Y N ? N chr18:70608529 Stard6 A intron 5 6979 N N Y ? chr18:72353140 Dcc A intron 1 395280 N Y N N chr18:74063506 Mapk4 A intron 2 32615 N Y ? N chr18:74843234 Myo5b S intron 26 3728 N Y N ?

184 insertion sitea gene orientb location intron_size B6 AJ DBA 129X1 chr18:82664133 Mbp A intron 3 32181 N Y N ? chr18:89875351 Dok6 A intron 1 170283 N N ? Y chr19:10761586 Vps37c A intron 1 17229 N Y N ? chr19:16043758 Cep78 S intron 5 2417 N Y N ? chr19:22461274 Trpm3 A intron 1 558119 N Y Y Y chr19:23378836 Mamdc2 S intron 11 19863 N N Y ? chr19:23696514 1700028P14Rik A intron 1 35949 N Y ? ? chr19:37682580 Exoc6 A intron 18 30515 Y ? N Y chr19:39089896 Cyp2c55 S intron 4 12457 N Y Y ? chr19:40152902 Cyp2c50 A intron 7 13132 N N Y N chr19:42399286 Crtac1 A intron 2 79867 Y N N Y chr19:42992748 LOC545291 A intron 6 35297 Y N N Y chr19:43100351 LOC545291 A intron 3 280983 Y ? N ? chr19:43859474 Abcc2 A intron 9 1895 N Y Y N chr19:44288550 Scd3 A intron 3 1416 N Y Y N chr19:48570031 Sorcs3 A intron 3 47964 Y N ? ? chr19:57507282 Trub1 S intron 1 4892 N Y N Y chr19:57880359 Atrnl1 A intron 26 138092 Y N N N chr19:8278386 D630002G06Rik A intron 2 10344 N ? Y ? chrX:102893625 EG331493 A intron 3 34408 Y Y N Y chrX:133309958 Il1rapl2 A intron 2 522726 Y Y N Y chrX:137384135 Gucy2f A intron 7 13304 N ? ? Y chrX:146727042 ORF34 S intron 3 17680 N ? N Y chrX:146918864 Phf8 A intron 13 23783 Y Y Y N chrX:148612513 Klf8 A intron 1 70758 N Y N N chrX:150478771 EG546400 A intron 1 12614 N Y N Y chrX:152550699 Phex A intron 15 26562 N Y N Y chrX:161879907 Egfl6 A intron 9 4125 N ? Y ? chrX:67247535 Olfr157 S intron 22 709627 N Y ? ? chrX:68580222 Olfr157 A intron 18 276602 N ? Y ? chrX:69194673 Nsdhl S intron 3 19381 Y N N Y chrX:69587660 Olfr157 S intron 6 586741 Y Y N Y chrX:73839228 Tbl1x A intron 1 122275 N N Y ? chrX:7674806 Ssxb9 S intron 7 379739 Y Y Y N chrX:77695025 4930595M18Rik S intron 1 31292 N Y Y N chrX:9241304 Srpx S intron 1 43614 N N N Y chrX:92550530 B230358A15Rik A intron 2 18589 Y N ? ? aGenomic location with respect to B6 genome (mm8 version) bS=Sense; A=Antisense with respect to gene transcription cComputation prediction of element presence within the strain. N=not present; Y=present; ?=unknown

185 Table A.3 Polymorphic ETn/MusD cases in genes insertion sitea gene orientb location intron_size B6 AJ DBA 129X1 chr1:107544558 2310035C23Rik A intron 13 1924 Nc Yc N N chr1:11761263 A830018L16Rik A intron 8 50388 Y ? ? N chr1:158497287 Tor3a A intron 4 10001 Y N N ? chr1:173693834 Cd84 A intron 2 20560 Y N N ? chr1:176956278 Rgs7 A intron 9 27848 N N Y ? chr1:179852136 EG545391 S intron 3 712 N N Y N chr1:90927936 Sh3bp4 A intron 1 36659 Y N N N chr2:120487764 Ttbk2 A intron 4 16402 Y N N N chr2:161740844 Ptprt A intron 7 207432 N ? ? Y chr2:168357724 Atp9a A intron 14 5914 N Y N N chr2:80121108 Dnajc10 S intron 3 1920 N Y N N chr3:101214201 Ptgfrn A intron 1 29594 Y Y N Y chr3:53201348 Lhfp A intron 2 188037 Y Y N Y chr4:11637421 Gem A intron 2 4896 N Y N ? chr4:141314667 Tmem51 A intron 1 46127 Y Y N N chr4:153708611 B230396O12Rik A intron 6 2946 N Y N ? chr4:153833591 Plcl4 A exon 18 86 N Y N N chr4:43167395 Unc13b A intron 7 46281 Y N N Y chr4:69880108 Cdk5rap2 A intron 3 20942 N N ? Y chr4:83332767 4930473A06Rik A intron 25 46078 N N N Y chr5:142663608 Foxk1 A intron 1 33098 N Y N N chr5:148304836 C130038G02Rik S intron 1 118143 N N N Y chr5:21020983 Fbxl13 A intron 16 21545 N N N Y chr5:90278338 Slc4a4 A intron 18 14656 Y Y N N chr5:93471996 Art3 A intron 2 8817 N Y N ? chr6:37085228 Dgki S intron 1 149358 N N N Y chr6:49330234 Stk31 A intron 3 6516 N Y ? N chr6:57722531 AW146242 A intron 1 34890 Y Y Y N chr6:84024675 Dysf S intron 4 4795 N Y N ? chr7:24206094 Cadm4 A intron 1 16992 Y N Y Y chr7:26386874 EG243881 A intron 4 4219 N N Y Y chr7:29768570 Zfp82 A intron 5 4708 N Y ? ? chr7:81189399 Pde8a A intron 12 1773 N Y N N chr8:126077748 Dpep1 A intron 1 7590 Y N Y N chr8:127312767 Pgbd5 A intron 1 49175 N Y N ? chr9:120202643 Myrip A intron 2 52370 N N Y ? chr9:28170744 Opcml S intron 2 270720 N Y N N chr9:30821365 BC038156 A exon 1 3731 N Y N Y chr9:50582311 Alg9 A intron 14 20017 N Y Y N chr9:55688688 Zfp291 A intron 8 8166 N Y Y Y chr9:83896502 Bckdhb A intron 9 53041 Y Y N Y chr10:23546116 Vnn3 A intron 3 7946 Y N Y Y chr10:59779699 Cdh23 A intron 32 2103 N Y N N chr11:103427206 Gm884 A intron 1 2885 N N N Y chr11:107934667 Prkca S intron 3 134255 N Y N ? chr11:35233113 Slit3 A intron 4 273313 N ? ? Y chr11:36015957 Odz2 A intron 8 35382 N Y N N

186 insertion sitea gene orientb location intron_size B6 AJ DBA 129X1 chr11:75331799 Scarf1 A exon 4 524 N Y N N chr11:82825929 Slfn8 A intron 4 8583 Y N Y N chr12:112088887 Mark3 A intron 14 7429 N Y N ? chr12:53930690 Akap6 A intron 7 71966 N Y N N chr13:4788787 LOC432723 A intron 1 61826 N Y N Y chr14:28067385 Cacna2d3 A intron 11 109422 N Y ? Y chr14:6880103 Rpp14 A intron 4 1177 Y Y N N chr14:95020963 Klhl1 A intron 8 15107 Y N ? Y chr15:44040286 Trhr A intron 2 31282 Y Y Y N chr15:96269646 Sfrs2ip A intron 3 8946 Y Y Y N chr16:6641081 A2bp1 S intron 3 321339 N Y Y Y chr16:90324059 Hunk A intron 2 14502 N N Y Y chr16:96314083 Sh3bgr A intron 2 7344 Y N Y Y chr17:28455067 Mapk14 A intron 6 2511 Y N Y Y chr17:32100856 Wiz A intron 2 19667 Y N N N chr17:52975332 4921523A10Rik A intron 2 6341 N ? Y ? chr17:6430672 Sytl3 A intron 1 247001 N ? Y ? chr17:6573500 Sytl3 S intron 4 12675 N Y N ? chr17:70602997 Dlgap1 A intron 4 55340 N Y Y Y chr17:75006067 Ltbp1 A intron 3 85061 N ? ? Y chr18:75389792 Dym A intron 16 43220 N Y N N chr18:76066620 Zbtb7c A intron 2 110866 N N N Y chrX:136618166 Col4a6 S intron 2 146075 N Y N Y chrX:66814504 Olfr157 S intron 23 685526 N Y ? ? chrX:67552081 Mtm1 S intron 8 4446 N Y N ? aGenomic location with respect to B6 genome (mm8 version) bS=Sense; A=Antisense with respect to gene transcription cComputation prediction of element presence within the strain. N=not present; Y=present; ?=unknown

187 Table A.4 Primer sequences

case#a primer name sequence 5' to 3' 16 Cdh23-s CTGTGGCAGTGTGAACTTAG Cdh23-as ACGAGGCAGTCATCATGGAG 17 Odz2-s CACAGACTCAAAGCCACTGC Odz2-as GGGTTATGGTTGATTTCTG 20 Mark3-s ATGCAGTCAGGCTGACAGTC Mark3-as TTTCTTAGGCAGTCAGGTCAC 22 Cacna2d3-s AGAGAGTGATAGTAGTCATGC Cacna2d3-as TTCTCATAAG CTCTAGGAAG C 3 Atp9a-s CAGAGATTACTCCAGCCTGC Atp9a-as GAGGAAGACTATCAACAAAGC 2 Dnajc10-s CATCAGGTCACAGGTCACAG Dnajc10-as GTCTTGCTGGAGAAAGTCAGT 4 Gem-s CACAATGGAGTTCCCATA AAG Gem-as CTAAGCAAATCTCCCACAGC 7 Foxk1-s CAGACACTGACTACAGTTGC Foxk1-as ACACCACCATTGCCTGTTAC 12 Pgbd5-s TCTGATGGCTCTGGTTTCC Pgbd5-as CGAGAGACAGCACTAGAGC 13 Opcml-s TCTACCACCTGTCTTGTTTAC Opcml-as CCTAACACATCCTTCCTTGC 27 Mtm1-s GCTACCAAATTCCAGTAACAG Mtm1-as CACGACAGTACCTATGGTAG 26 Dym-s ACAGTCCTCTGAGCCCTTGA Dym-as GAAAAAGAGCCAGGGGTAGG 18 Prkca-s TTGGGATGGAGTGTTGGTTT Prkca-as CAAGCATCTTTGCTTCCACA 11 Pde8a-s CATTCAGTGCTCTGGCATCT Pde8a-as TGCGAACGTACTCACTATTCG 28 Col4a6-s TACATTCACTCCTGCCCTTG Col4a6-as TCATTCGGGGTCTACTGACA 1 2310035C23Rik-s AGAGTCCTGTGGAGCATTGG 2310035C23Rik-as AACCAAATTCCACAACAAGTCTC 25 Dlgap1-s GAGAGAGAGACGGTCACATGG Dlgap1-as CTACATACCCAGCCCCTGAA 9 Dysf-s CACAAACCATTCCCAGTCCT Dysf-as ATGTCGGGTTCAGCAGTAGC 23 A2bp1-s TTTTCCCCAAGTGAGTGC A2bp1-as CAGGTTTCTGGCTAGGAAAGG 24 Sytl3-s TTGGTTGGCATTTTCCTTTC Sytl3-as AATCATGGGGCATTCACAGT 19 Akap6-s CTGTGTGTGAGAAGCCCAGA 241s CGTCTAGATTCCTCTCTTACAGC 14 210as TGATCAA(A/G)GCTCAAATTTTATTG Alg9-as CAGTCTCTGTCGCTCTGCTG 6 Art3-s GTCTCAGAAACAGGGAAGG Art3-as TGAAGGGCTGAATGCTCTTT 5 B230396O12Rik-s TTCAGCCCTCCTAGCAAAAA

188 case#a primer name sequence 5' to 3' B230396O12Rik-as GGACTGGTTTCCTCCTGTGA 21 LOC432723-s TTTCCATCTCAGAGGCTTCC LOC432723-as TCCCTCTTTCCTTAGGTCTGC 8 Stk31-s GGGGTAGGGAGAGGAAATTG Stk31-as AAACCACATCCCCAGTCCTT 15 Zfp291-s AAGAAATGGCAAGTGGATGG Zfp291-as TGCTGCCAATACATGAAAGC 10 Zfp82-s TCAACGACCCATGTCAAAGA Zfp82-as ACCCTCACCCACCACTGTAA Col4a6-up-ex-s GAACAAACTGCCAAGCATC Col4a6-down-ex-as TCAGGAAAGCATGTACAGAC Dnajc10-up-ex-s AGCACTGAAGTTACATCCTG Dnajc10-down-ex-as GTACTCAAAGATGAAGATCTAC Gapdh_ex6F GACTTCAACAGCAACTCCCAC Gapdh_ex7R TCCACCACCCTGTTGCTGT Mtm1-up-ex-s AACAAGTGCTATGAGCTCTG Mtm1-down-ex-as CTGGGTGAATCCACGACAGT Opcml-up-ex-s CGCAGCGGAGATGCCAC Opcml-down-ex-as AGGATTGTGCTGCGGTTTAG Prkca-up-ex-s GGTTCATAAGAGGTGCCATG Prkca-down-ex-as TAGGGCTTCCGTATGTGTG MusD2-7130as GAATGAGCAGGAAGCTCCACC IM_LTR_2as TGTTGC(G/A)GCCGCCAGCAGC IM_3as ATC(T/C)CTCTGCCATTCTTCAGG Opcml-ex2-s CCTTGTACCCACAGGAGTG Opcml-ex2-as AGGAAGGCTCCCTACCTGA Opcml-ex3-as CTTGCACTATGAGATGGAC 29 Sh3bp4-s AGAGGGAAGGAAGGCTCTTG Sh3bp4-as TGGGTTATCCGTGAGTGTGA 30 Tor3a-s ATTCCGCTGTCTCTGCTTGT Tor3a-as CTGTGCTCTGTATGCCCAGA 31 Cd84-s TCCCTGCATCCTATCAGTACG Cd84-as AGAACCCAGGTTGTGGTCTG 32 Ttbk2-s AACCAAGCAAACAAAGCAAAA Ttbk2-as TTTTTCTGGGCTGGTGAGAT 33 Unc13b-s TCAAGAACAGCCATGCTCAG Unc13b-as CGGTGGCTCACACCTAGAAT 34 Cadm4-s TGGGGGAAGTAGCATAGGTG Cadm4-as GGAGCCAGACAAGACTCAGG 35 Dpep1-s GGGCCAATCAAAAGTTCAGA Dpep1-as TGTGGAAAGCACAAAAGCAG 36 Vnn3-s ATGGCAAATATTGGGGACAA Vnn3-as GTGCAGCTGTAGCCAAGTGA 37 Slfn8-s TGCTAGCCAGTGGACAAGTG Slfn8-as CTGGTCCTCCTGGTGTTGTT 38 Klhl1-s TGGAATTGTTTTGTTGACAGGA Klhl1-as GAGCAGCCTCAATTTTCTGC 39 Mapk14-s CCATGGGTTTCTGTCTCACA Mapk14-as AGGAGGGGACACTCAAACCT

189 case#a primer name sequence 5' to 3' 40 Wiz-s TATCTGCCTGTGGCTGACTG Wiz-as GCCCTTGAGTTGAAGCAGAC acase # from Table 2.1

A.3 Accession numbers

The National Center for Biotechnology Information (NCBI) Nucleotide database (http://www.ncbi.nlm.nih.gov/sites/entrez?db=Nucleotide) accession number for the ETnII element used for probe design and to align in Figures 5 and S4 is Y17106. The ETnI element used for probe design is located in a BAC clone with accession number AC068908. The accession number for the IAP element used in probe design is EU183301.

190 Appendix B Supporting information of Chapter 3

B.1 Supplementary figures

Figure B.1 Intronic distributions of the four major TE types in human (non-normalized).

The distributions of all (A) and full-length (B) intronic TEs in mouse are shown. The x-axis shows a series of intronic regions based on the distance from a TE to the nearest exon. The y-axis represents the standardized frequency of TEs. The red dotted line indicates the expected distribution of TEs based on random computational simulations. Error bars are standard errors derived from the total number of corresponding TEs (sample size) in each bin.

191

Figure B.2 Intronic distributions of the four major TE types in mouse (normalized).

The distributions of all (A) and full-length (B) intronic TEs in mouse are shown. The x-axis shows a series of intronic regions based on the distance from a TE to the nearest exon. The y-axis represents the standardized frequency of TEs and is normalized by G/C content for each TE type. The red dotted line indicates the expected distribution of TEs based on random computational simulations. Error bars are standard errors derived from the total number of corresponding TEs (sample size) in each bin.

192

Figure B.3 Average size of mouse TEs within and outside the U-zone.

Each TE type is divided into two groups as shown on the x-axis: one group for elements located within the corresponding U-zone and another group for those beyond. The average size of each TE group is indicated as the horizontal bar within each box, which represents the central 50% of data points of the group. Outliers beyond the 1.5x IQR (interquartile range) whiskers are not shown. P-values shown on top of each boxplot are based on the two sample Wilcoxon test.

193

Figure B.4 Distributional biases of mouse full-length intronic TEs.

A) Orientation bias of full-length intronic TEs. Y-axis shows the logarithmic fold-difference of TE frequency between sense and antisense oriented full-length TEs. B) Splice site bias of full-length intronic TEs. Y-axis shows the logarithmic fold-difference of TE frequency between full-length TEs close to the SA site and TEs close to the SD site. The x-axis shows a series of intronic regions based on the distance from a TE to the nearest exon. Error bars are standard errors derived from the total number of corresponding TEs (sample size) in each bin.

194

Figure B.5 Orientation bias of mouse full-length intronic TEs based on proximity to different types of splice sites.

Orientation bias of full-length TEs near A) SD sites and B) SA sites. The x-axis shows a series of intronic regions based on distance of a TE to nearest exon. Y-axis is the logarithmic fold-difference of TE frequency between sense and antisense oriented TEs. Error bars are standard errors derived from the total number of corresponding TEs (sample size) in each bin.

195 B.2 Supplementary tables

Table B.1 Mutagenic intronic Alu insertions in humans

Gene with Alu Orientation# Nearest Distance to Related diseases mutation subfamily splice-site exon (bp) FAS Sb1 - SA 50 Autoimmune lymphoproliferative syndrome (ALPS) FGFR2 Yb8 - SA 10 Apert Syndrome APC Yb9 - SD 20 Familial Adenomatous Polyposis GK Ya5 - SA 30 Glycerol kinase deficiency F8 Yb9 - SA 10 Hemophilia A OPA1 Yb8 - SA 21 Autosomal dominant optic atrophy

# The "-" sign represents antisense orientation of the TE with respect to the enclosing gene

Table B.2 Mutagenic intronic L1 insertions in humans

Gene with Orientation Nearest Distance to Related diseases mutation splice-site exon (bp) HBB + SA 100 Beta-thalassemia FKTN + SA 24 Fukuyama-type congenital muscular dystrophy (FCMD) RPS7KA3 - SA 10 Coffin-Lowry syndrome (CLS) CYBB + SA 265 Chronic granulomatous disease RP2 + SD 653 X-linked retinitis pigmentosa (XLRP)

# The "+"/"-" sign represents sense/antisense orientation of the TE with respect to the enclosing gene

Table B.3 Mutagenic intronic ERV insertions in mice

Mutation ERV family Orientation# Nearest splice-site* Distance to exon (bp)

Ap3d1mh2J IAP + SD 6 Atrnmg IAP - SD 136 Atrnmg-L IAP + SD ~420 Eya1bor IAP + SA 1575 Gusmps2J IAP + SA btw. 850-1100 Lama2Pas IAP + SD ~300 LamB3IAP IAP - SA 1

196 Mutation ERV family Orientation# Nearest splice-site* Distance to exon (bp)

Mgrn1md-2J IAP + SD 616 Mgrn1md IAP + SA ~942 Pitpnavb IAP + SA ~1126 Spna1Dem IAP ? SA 1 Pofut1cax IAP - SA 24 Pmca2joggle IAP + SD ~15 Gria4spkw1 IAP + SD ~720 Zfp69SJL IAP ? SA 965 Adcy1brl ETna + SD ~1700 Cacng2stg ETn + SD btw. 1500-2100 Cacng2stg-3J ETn + SD btw. 2500-4100 Clcn1adr ETn + SD 1033 Faslpr ETn + SA 3500 Fbxw4Dac-2J ETn + SD ~14000 Fignfi ETn + SD ~60000 Foxn1nu-Bc ETn - SA ~5200 Gli3pdn ETn + SA 24514 Hk1dea ETn ? SA 901 Lep0b-2J ETn + SD ~3200 MipCat-Fr ETn + SA ~800 Mutedmu ETn + SD 2362 Ttc7fsn ETn + SA 57 Hsf4lop11 ETn + SA 61 Fig4paletremor ETn + SA 384 Dysf prmd ETn + SD 495 Zhx2Afr1 ETn + SA ~20600 a VL30b - SD ~1200 Abcb1amds MuLVc - SA 4 Myo5ad MuLV + SD ~500 Nox3het unkd ? SD ~4000 Pdcd8Hq MuLV + SD 3432 Pde6brd1 MuLV - SD 1511 Lmf1cld MuERV + ? <250

# The "+"/"-" indicates orientation of the TE with respect to the enclosing gene; "?" indicates orientation is unknown; *SD-splice donor sites; SA-splice acceptor sites; "?" indicates that the nearest splice site was not given; aETn represents the ETn/MusD family; bVirus-like 30 element; c Murine leukemia virus; d “unk" indicates the ERV type was not given.

197 Appendix C Supporting information of Chapter 4

C.1 Supplementary figures

Figure C.1 TE density distribution of human genes.

A) TE density distribution of all human RefSeq genes. B) TE density distribution of human RefSeq genes larger than 10 kb.

198

Figure C.2 The relationship between gene size and exon density in human.

A) The negative association between gene size and exon density. B) The linear regression between gene size and the inverse of exon density. r is the correlation coefficient.

199

Figure C.3 Identification of outlier genes by controlling gene size and exon density.

In any given species, all genes ≥ 10 kb were divided into 25 subsets based on both gene size and exon density and were put into a 5 x 5 matrix. For genes in each subset, upper/lower outliers were identified by taking the top or bottom 10% genes with the most extreme TE density. The final set of upper/lower outlier genes is collected by merging the upper/lower outliers from each subset, for which the variations of both gene size and exon density are controlled.

200

Figure C.4 Chromosomal distribution of SUOs and SLOs in human.

The short red lines along the left side of each chromosome show the chromosomal locations of SLOs. The short blue lines along the right side of each chromosome show the chromosomal locations of SUOs.

201

Figure C.5 The relationship between the number of SUOs/SLOs on human chromosomes and the chromosome size.

Results for SUOs and SLOs are shown in A) and B), respectively. In both (A) and (B), the x-axis shows the genomic coverage of each chromosome in percentage, and the y-axis shows the total number of SUOs/SLOs on a given chromosome.

202

Figure C.6 Correlation analysis of the TE composition of SUOs between human and mouse.

Results for LINE, SINE, LTR retroelement and DNA transposon are shown as linear regression plot in A), B), C) and D), respectively. In each plot, each open circle represents an SUO gene and its location is determined by the density of the corresponding TE type of the SUO orthorlogs in the two species. The line across the data points in each plot represents the regression line, and r is the correlation coefficient.

203

Figure C.7 Correlation analyses of G+C content vs. LINE/SINE composition of SUOs in human.

Results for LINE and SINE are shown in (A) and (B), respectively. The x-axis shows the proportion covered by LINEs/SINEs relative to all TEs in each SUO gene. The y-axis shows the average G+C content of human SUOs. The line across the data points in each plot represents the regression line, and r is the correlation coefficient.

204

Figure C.8 Tissue-type composition of tissue-specific outlier genes.

For each gene set, the proportion corresponding to each tissue type is shown in a stacked bar according to the color scheme indicated at the top. The 'genomic background' was calculated based on all mouse genes > 10 kb that show strong Polr2a binding in only one tissue.

205

Figure C.9 Histone marks at promoters of all outlier genes.

The proportions of genes associated with different histone marks are shown for all upper outliers, all lower outliers and the genomic background as side-by-side bars. Error bars are standard errors derived from the total number of genes (sample size) in each gene set.

206 C.2 Supplementary tables

Table C.1 SLOs identified among human, mouse and cow

Hid gene_name gene_size te_coverage te_density chr gStart gEnd exonCount exon_density 13092 FOXP1 628405 154657 0.24611 chr3 71087425 71715830 21 0.334179 4086 NFIA 380480 82899 0.21788 chr1 61320567 61701047 11 0.289108 8641 CADM1 330897 67999 0.205499 chr11 1.15E+08 1.15E+08 9 0.271988 8822 TOX 313791 69067 0.220105 chr8 59880530 60194321 9 0.286815 20574 PBX1 292244 50083 0.171374 chr1 1.63E+08 1.63E+08 9 0.307962 9151 ZNF521 290226 60633 0.208916 chr18 20895888 21186114 8 0.275647 22888 FAM19A5 269799 41124 0.152425 chr22 47263951 47533750 4 0.148259 8556 TRPS1 260503 41795 0.16044 chr8 1.16E+08 1.17E+08 7 0.268711 88724 NFIB 232099 38043 0.163909 chr9 14071846 14303945 9 0.387766 7564 TCF7L2 216062 44874 0.20769 chr10 1.15E+08 1.15E+08 14 0.647962 2354 SPTBN1 215130 48890 0.227258 chr2 54536957 54752087 36 1.673407 48068 FYN 212143 57295 0.270077 chr6 1.12E+08 1.12E+08 14 0.659932 22712 LPHN2 192026 31203 0.162494 chr1 82038669 82230695 20 1.041526 21214 ZBTB16 190967 42541 0.222766 chr11 1.13E+08 1.14E+08 7 0.366555 31087 MEF2C 185811 36817 0.198142 chr5 88049814 88235625 11 0.591999 75187 CTBP2 173207 28127 0.16239 chr10 1.27E+08 1.27E+08 11 0.635078 8549 MACROD1 167489 41650 0.248673 chr11 63522605 63690094 11 0.65676 4222 LHFPL2 163611 39124 0.239128 chr5 77816793 77980404 5 0.305603 20437 COL4A1 158187 25652 0.162163 chr13 1.1E+08 1.1E+08 52 3.287249 20933 EPHA4 154264 30651 0.198692 chr2 2.22E+08 2.22E+08 18 1.166831 1320 CALCR 149952 34159 0.2278 chr7 92891734 93041686 13 0.866944 11224 SCUBE1 140126 36733 0.262143 chr22 41929173 42069299 22 1.570016 86803 MEIS1 137360 19802 0.144161 chr2 66516035 66653395 13 0.946418

207 Hid gene_name gene_size te_coverage te_density chr gStart gEnd exonCount exon_density 3801 RUNX1T1 136292 26739 0.196189 chr8 93040327 93176619 11 0.807091 4340 AFF1 134039 30853 0.230179 chr4 88147176 88281215 20 1.492103 32336 NFATC1 133552 16317 0.122177 chr18 75256759 75390311 10 0.748772 32426 SEMA6A 131301 19913 0.151659 chr5 1.16E+08 1.16E+08 19 1.447057 9635 BANP 125887 27349 0.21725 chr16 86542538 86668425 12 0.953236 55948 IKZF1 125369 9354 0.074612 chr7 50314923 50440292 8 0.638116 7813 LEF1 120878 26086 0.215804 chr4 1.09E+08 1.09E+08 10 0.82728 23384 LMF1 117351 26577 0.226474 chr16 843634 960985 11 0.937359 86984 NRXN2 117015 17984 0.15369 chr11 64130221 64247236 24 2.051019 2875 NRP2 115634 13558 0.117249 chr2 2.06E+08 2.06E+08 17 1.470156 6528 VAC14 113720 29396 0.258495 chr16 69278842 69392562 19 1.67077 7673 COL18A1 108538 11784 0.10857 chr21 45649524 45758062 43 3.961746 1952 PDE2A 98228 22426 0.228306 chr11 71964832 72063060 32 3.257727 7485 PPARGC1A 98057 16479 0.168055 chr4 23402741 23500798 13 1.32576 3636 ETV1 97909 20020 0.204476 chr7 13897380 13995289 12 1.225628 20623 PTPRF 92797 20639 0.22241 chr1 43769133 43861930 33 3.556149 2232 SATB1 90987 5610 0.061657 chr3 18364269 18455256 11 1.208964 1095 EPAS1 89274 16729 0.187389 chr2 46378066 46467340 16 1.792235 515 IGF1 84734 19863 0.234416 chr12 1.01E+08 1.01E+08 5 0.590082 187 KIT 82787 17732 0.214188 chr4 55218851 55301638 21 2.53663 6279 ERCC6 82657 22096 0.267322 chr10 50334496 50417153 21 2.54062 9697 BAIAP2 82271 9214 0.111996 chr17 76623556 76705827 15 1.823243 4526 ENPP2 81788 19739 0.241343 chr8 1.21E+08 1.21E+08 26 3.17895 3332 MYT1 77780 7674 0.098663 chr20 62266270 62344050 23 2.957058 6389 DDX31 76113 19908 0.261558 chr9 1.34E+08 1.35E+08 20 2.627672 7889 PIK3R1 75188 9631 0.128092 chr5 67558217 67633405 15 1.994999 2119 PTPN1 74196 20218 0.272494 chr20 48560297 48634493 10 1.347782 9341 PHF20L1 73449 10362 0.141077 chr8 1.34E+08 1.34E+08 21 2.859127

208 Hid gene_name gene_size te_coverage te_density chr gStart gEnd exonCount exon_density 13251 DICER1 71195 15054 0.211447 chr14 94622317 94693512 27 3.792401 20579 PFKP 69245 16905 0.244133 chr10 3099751 3168996 22 3.177125 36206 SLA 66338 17229 0.259715 chr8 1.34E+08 1.34E+08 9 1.356688 22401 ANGPT2 63612 14023 0.220446 chr8 6344580 6408192 9 1.414827 3837 ETS1 63502 6190 0.097477 chr11 1.28E+08 1.28E+08 8 1.259803 10608 PMEPA1 63090 10837 0.17177 chr20 55656857 55719947 4 0.634015 55619 SPEG 58655 9830 0.16759 chr2 2.2E+08 2.2E+08 41 6.990026 1040 DHX15 57097 14739 0.25814 chr4 24138185 24195282 14 2.451968 6899 ASS1 56568 13927 0.246199 chr9 1.32E+08 1.32E+08 16 2.828454 32466 SMYD2 55913 14631 0.261674 chr1 2.13E+08 2.13E+08 12 2.146191 870 ADCYAP1R1 54170 13374 0.246889 chr7 31058666 31112836 16 2.953664 55740 EZR 53684 13920 0.259295 chr6 1.59E+08 1.59E+08 14 2.607853 110441 ALDH1A1 52383 8426 0.160854 chr9 74705406 74757789 13 2.481721 8860 TSC22D2 50828 9664 0.190131 chr3 1.52E+08 1.52E+08 4 0.786968 21059 TLE3 49714 6723 0.135234 chr15 68127596 68177310 20 4.023012 5109 ADAMTS5 49208 6728 0.136726 chr21 27212102 27261310 8 1.625752 2069 PROX1 47903 2248 0.046928 chr1 2.12E+08 2.12E+08 5 1.043776 2266 SFRP1 47503 10678 0.224786 chr8 41238634 41286137 3 0.631539 55639 KDR 47336 8499 0.179546 chr4 55639183 55686519 30 6.337671 49690 ATP2B3 46808 6779 0.144826 chrX 1.52E+08 1.53E+08 20 4.272774 8612 OLFM1 45942 4815 0.104806 chr9 1.37E+08 1.37E+08 6 1.305995 37922 DSP 45077 7749 0.171906 chr6 7486868 7531945 24 5.324223 56497 LHX4 44747 5529 0.123561 chr1 1.78E+08 1.79E+08 6 1.340872 635 GAD1 44460 8423 0.189451 chr2 1.71E+08 1.71E+08 17 3.823662 6628 WDR1 42611 6027 0.141442 chr4 9685060 9727671 12 2.816174 74967 PCDH10 42263 4231 0.100111 chr4 1.34E+08 1.34E+08 5 1.183068 16375 TPCN2 41723 6654 0.15948 chr11 68572925 68614648 25 5.991899 7301 EHF 40414 6637 0.164225 chr11 34599243 34639657 9 2.226951

209 Hid gene_name gene_size te_coverage te_density chr gStart gEnd exonCount exon_density 55709 SH3GL1 40105 10130 0.252587 chr19 4311366 4351471 10 2.493455 55759 SLC7A5 39472 6557 0.166118 chr16 86421129 86460601 10 2.533441 1750 LSP1 39294 2147 0.054639 chr11 1830775 1870069 11 2.79941 1531 FMR1 39133 5526 0.141211 chrX 1.47E+08 1.47E+08 17 4.34416 4912 SSFA2 38993 4639 0.11897 chr2 1.82E+08 1.83E+08 18 4.616213 20445 CSPG4 38527 6043 0.156851 chr15 73753717 73792244 10 2.595582 55433 COL3A1 38374 1906 0.049669 chr2 1.9E+08 1.9E+08 51 13.29025 2438 THBS2 38263 3500 0.091472 chr6 1.69E+08 1.69E+08 23 6.011029 1997 PLCG1 38197 6434 0.168443 chr20 39199574 39237771 32 8.377621 69 COL1A2 36672 1276 0.034795 chr7 93861808 93898480 52 14.17976 12206 CCDC80 36569 6735 0.184172 chr3 1.14E+08 1.14E+08 8 2.187645 9768 ANKRD10 36530 5791 0.158527 chr13 1.1E+08 1.1E+08 6 1.642486 180 JAG1 36363 3770 0.103677 chr20 10566331 10602694 26 7.150125 10551 SON 34463 3778 0.109625 chr21 33837219 33871682 12 3.481995 53111 INTS1 34106 4037 0.118366 chr7 1476438 1510544 48 14.07377 68520 SLC2A1 33802 8665 0.256346 chr1 43163632 43197434 10 2.958405 21389 TPPP 33534 2471 0.073686 chr5 712976 746510 4 1.192819 2410 TCF7 33518 6829 0.203741 chr5 1.33E+08 1.34E+08 10 2.983472 41463 C10orf84 33268 6156 0.185043 chr10 1.2E+08 1.2E+08 8 2.404713 1212 PAX6 33170 0 0 chr11 31762915 31796085 13 3.919204 201 KCNH2 32966 4919 0.149214 chr7 1.5E+08 1.5E+08 15 4.550143 443 VLDLR 32693 4898 0.149818 chr9 2611792 2644485 19 5.811642 37525 CCND2 31579 3290 0.104183 chr12 4253198 4284777 5 1.583331 10919 CADM3 31556 2790 0.088414 chr1 1.57E+08 1.57E+08 10 3.168969 4314 SMAD7 30859 3146 0.101948 chr18 44700220 44731079 4 1.296218 9482 SLCO4A1 29851 2357 0.078959 chr20 60744241 60774092 12 4.019966 22547 COL11A2 29777 1847 0.062028 chr6 33238446 33268223 66 22.16476 11384 TBX18 29743 2605 0.087584 chr6 85500875 85530618 8 2.689709

210 Hid gene_name gene_size te_coverage te_density chr gStart gEnd exonCount exon_density 37370 PDE1B 29620 4568 0.15422 chr12 53229670 53259290 16 5.401756 74841 CSNK1D 29334 5595 0.190734 chr17 77795528 77824862 10 3.409013 55612 CRHR2 29278 5598 0.191202 chr7 30659387 30688665 12 4.098641 20688 TFAP2B 28888 607 0.021012 chr6 50894397 50923285 7 2.423151 23129 DAZAP1 28099 2366 0.084202 chr19 1358583 1386682 12 4.270615 7968 TBX4 27858 5380 0.193122 chr17 56888588 56916446 8 2.871707 88554 PFKFB2 27749 4271 0.153915 chr1 2.05E+08 2.05E+08 15 5.4056 10772 RIPK4 27721 3385 0.12211 chr21 42032597 42060318 8 2.885899 55668 PFKL 27327 1945 0.071175 chr21 44544357 44571684 22 8.050646 2709 CPZ 27054 5218 0.192874 chr4 8645334 8672388 11 4.065942 12767 CLPTM1L 27003 1717 0.063586 chr5 1370999 1398002 17 6.295597 32531 EGLN3 26864 3899 0.145138 chr14 33463171 33490035 5 1.861227 8443 AZI2 26772 3543 0.13234 chr3 28338850 28365622 8 2.988197 57205 8‐Sep 26559 2382 0.089687 chr5 1.32E+08 1.32E+08 10 3.765202 82484 ARHGEF16 26531 2562 0.096566 chr1 3361006 3387537 15 5.653764 3098 USP2 26512 3367 0.126999 chr11 1.19E+08 1.19E+08 13 4.90344 23133 NLGN3 26341 5542 0.210394 chrX 70281435 70307776 7 2.657454 3638 NR5A1 26185 3645 0.139202 chr9 1.26E+08 1.26E+08 7 2.673286 37509 ATP1B1 26014 3832 0.147305 chr1 1.67E+08 1.67E+08 6 2.30645 35319 TMEM201 25959 4385 0.16892 chr1 9571563 9597522 10 3.852229 1078 CELSR2 25738 1568 0.060922 chr1 1.1E+08 1.1E+08 34 13.21004 1918 SLC22A18 25526 4204 0.164695 chr11 2877526 2903052 11 4.309332 21322 HSPH1 25355 2019 0.079629 chr13 30608762 30634117 18 7.099191 2252 SDC1 24637 2476 0.100499 chr2 20264038 20288675 6 2.435361 925 PRDM1 23620 1736 0.073497 chr6 1.07E+08 1.07E+08 7 2.96359 1391 COL6A1 23301 208 0.008927 chr21 46226090 46249391 35 15.02081 2421 TFAP2A 22882 149 0.006512 chr6 10504901 10527783 7 3.059173 37750 NR2E1 22752 2416 0.106188 chr6 1.09E+08 1.09E+08 9 3.955696

211 Hid gene_name gene_size te_coverage te_density chr gStart gEnd exonCount exon_density 68427 KCNC4 22602 2756 0.121936 chr1 1.11E+08 1.11E+08 4 1.769755 32055 PDGFA 22583 3215 0.142364 chr7 503424 526007 6 2.656866 3845 FOSL2 21738 3153 0.145046 chr2 28469282 28491020 4 1.840096 20263 RARG 21684 2062 0.095093 chr12 51890619 51912303 10 4.611695 5075 ALX1 21526 1724 0.080089 chr12 84198166 84219692 4 1.858218 40725 HOXA3 20831 385 0.018482 chr7 27112333 27133164 4 1.920215 22410 HNRNPD 20683 3041 0.147029 chr4 83493490 83514173 9 4.3514 1550 GATA3 20498 1384 0.067519 chr10 8136672 8157170 6 2.927115 510 IGF2 20487 0 0 chr11 2106922 2127409 5 2.440572 8636 PATZ1 20460 2244 0.109677 chr22 30051789 30072249 4 1.955034 4927 LMO4 20453 168 0.008214 chr1 87566738 87587191 5 2.444629 31360 PAX9 20230 851 0.042066 chr14 36196532 36216762 5 2.471577 55800 FASN 19893 631 0.03172 chr17 77629502 77649395 43 21.61564 1877 NGFR 19718 3016 0.152957 chr17 44927653 44947371 6 3.042905 31405 TNNT3 19138 779 0.040704 chr11 1897374 1916512 15 7.83781 11450 KCTD15 18916 2573 0.136022 chr19 38979590 38998506 7 3.700571 4691 DPYSL4 18867 962 0.050988 chr10 1.34E+08 1.34E+08 14 7.420364 37520 KLF5 18535 1381 0.074508 chr13 72531142 72549677 4 2.158079 40883 GDF6 18463 1887 0.102204 chr8 97223733 97242196 2 1.083248 68057 MAT1A 17859 1653 0.092558 chr10 82021555 82039414 9 5.039476 7816 LHX9 17639 54 0.003061 chr1 1.96E+08 1.96E+08 6 3.401553 73874 COL1A1 17544 1055 0.060135 chr17 45616455 45633999 51 29.06977 55799 EMX1 17417 821 0.047138 chr2 72998111 73015528 3 1.722455 31142 THBS1 16389 641 0.039112 chr15 37660571 37676960 22 13.42364 2534 VEGFA 16271 541 0.033249 chr6 43845930 43862201 7 4.302133 7605 ZBTB7B 15888 888 0.055891 chr1 1.53E+08 1.53E+08 4 2.517623 4582 TNFAIP3 15869 689 0.043418 chr6 1.38E+08 1.38E+08 9 5.671435 55437 FGFR3 15560 373 0.023972 chr4 1764836 1780396 16 10.28278

212 Hid gene_name gene_size te_coverage te_density chr gStart gEnd exonCount exon_density 10233 GABRQ 15189 1981 0.130423 chrX 1.52E+08 1.52E+08 9 5.925341 99848 RAP2C 15137 777 0.051331 chrX 1.31E+08 1.31E+08 4 2.642532 2753 STC2 14781 595 0.040254 chr5 1.73E+08 1.73E+08 4 2.706177 1293 BGN 14594 457 0.031314 chrX 1.52E+08 1.52E+08 8 5.481705 4712 MXD4 14580 793 0.05439 chr4 2218957 2233537 6 4.115226 23132 SLC38A2 14577 893 0.061261 chr12 45038237 45052814 16 10.9762 11186 ITM2C 14343 1269 0.088475 chr2 2.31E+08 2.31E+08 5 3.486021 6461 SF1 14164 303 0.021392 chr11 64288653 64302817 13 9.178198 1653 INHBA 14106 797 0.056501 chr7 41695125 41709231 3 2.126755 2694 ENC1 14016 1282 0.091467 chr5 73958989 73973005 3 2.140411 20128 L1CAM 13925 659 0.047325 chrX 1.53E+08 1.53E+08 28 20.10772 21215 ZFAND5 13823 927 0.067062 chr9 74156160 74169983 6 4.340592 3263 EFNB1 13167 514 0.039037 chrX 67965564 67978731 5 3.797372 41636 ARRDC4 13136 385 0.029309 chr15 96304936 96318072 8 6.090134 8614 PUF60 12991 1184 0.09114 chr8 1.45E+08 1.45E+08 11 8.467401 24916 DISP2 12823 1730 0.134914 chr15 38437725 38450548 8 6.23879 81909 HNRNPK 12572 114 0.009068 chr9 85772817 85785389 17 13.52211 68089 CYP2E1 11754 713 0.06066 chr10 1.35E+08 1.35E+08 9 7.656968 22705 HEY2 11684 833 0.071294 chr6 1.26E+08 1.26E+08 5 4.279356 36929 VASN 11681 1172 0.100334 chr16 4361849 4373530 2 1.712182 69317 NLGN2 11678 118 0.010104 chr17 7252225 7263903 7 5.994177 1661 ISL1 11606 52 0.00448 chr5 50714714 50726320 6 5.16974 37900 SLC16A3 11077 305 0.027535 chr17 77779581 77790658 5 4.513858 37251 AGXT 10375 267 0.025735 chr2 2.41E+08 2.41E+08 11 10.60241 3354 OVOL1 10157 63 0.006203 chr11 65311104 65321261 4 3.938171 7884 PEA15 10036 255 0.025409 chr1 1.58E+08 1.58E+08 4 3.985652

213 hid: the HomoloGene Database ID of the gene gene_name: the human gene name annotated by RefSeq Database (hg18) gene_size: the human gene size annotated by the longest RefSeq isoform te_coverage: the coverage of TE sequences in the human gene in basepairs (bp) te_density: the density of TE sequences derived from the ratio of te_coverage/gene_size chr: the name of the human chromosome that the gene is located on gStart: the genomic coordinate of the 5' start site of the gene in the human genome (hg18) gEnd: the genomic coordinate of the 3' end site of the gene in the human genome (hg18) exonCount: number of exons in the gene exon_density: the density of exon derived from the ratio of exonCount/geneSize (/kb)

214 Table C.2 SUOs identified among human, mouse and cow

Hid gene_name gene_size te_coverage te_density chr gStart gEnd exonCount exon_density 8584 COMMD9 17158 9849 0.574018 chr11 36250417 36267575 5 2.914093 9800 PPP1R14D 13265 9717 0.732529 chr15 38894934 38908199 5 3.769318 13945 HSCB 15454 10629 0.687783 chr22 27468042 27483496 6 3.88249 57091 CALR3 17129 12222 0.713527 chr19 16450874 16468003 9 5.254247 12018 KLHL10 10557 6214 0.588614 chr17 37247568 37258125 5 4.736194 19263 TMEM219 11023 6169 0.559648 chr16 29880851 29891874 6 5.443164 7862 NFATC2IP 15450 8514 0.551068 chr16 28869818 28885268 8 5.177994 6289 ITPA 14451 7869 0.54453 chr20 3138055 3152506 8 5.535949 1501 ERCC1 14306 9285 0.649028 chr19 50604711 50619017 10 6.990074 55906 AP1M2 14645 9176 0.626562 chr19 10544346 10558991 12 8.193923 8709 DHDH 11288 6703 0.593816 chr19 54128750 54140038 7 6.201276 36381 PRODH2 13310 7541 0.566566 chr19 40982731 40996041 11 8.264463 1592 HARS 17482 9279 0.530775 chr5 1.4E+08 1.4E+08 13 7.43622 3280 FARSA 11275 6439 0.571086 chr19 12894283 12905558 13 11.52993 9232 AAAS 14173 7581 0.53489 chr12 51987506 52001679 16 11.28907 27612 ANKRD13D 13136 6658 0.506851 chr11 66813394 66826530 16 12.18027 81844 ACTL6B 13359 6771 0.506849 chr7 1E+08 1E+08 14 10.47983 16533 CCDC151 14709 7448 0.506357 chr19 11392271 11406980 13 8.838126 12877 FERMT3 17151 8426 0.491283 chr11 63730788 63747939 15 8.745846 12723 RILPL2 21329 15612 0.731961 chr12 1.22E+08 1.22E+08 4 1.875381 12083 ZBTB8OS 28879 21946 0.759929 chr1 32859893 32888772 7 2.423907 2628 NME5 24272 16218 0.668177 chr5 1.37E+08 1.38E+08 6 2.471984 75168 CHIA 29702 20082 0.676116 chr1 1.12E+08 1.12E+08 9 3.030099 8080 ECSIT 23187 15605 0.673006 chr19 11477743 11500930 8 3.450209 9315 NOSIP 24836 16374 0.659285 chr19 54750779 54775615 10 4.026413 12320 MTFMT 28128 16829 0.598301 chr15 63080902 63109030 9 3.199659 37712 RPA2 23188 13722 0.591772 chr1 28090635 28113823 9 3.881318

215 Hid gene_name gene_size te_coverage te_density chr gStart gEnd exonCount exon_density 14939 PHYHD1 21147 13327 0.630208 chr9 1.31E+08 1.31E+08 12 5.674564 1160 GTF2H3 26954 16713 0.620056 chr12 1.23E+08 1.23E+08 13 4.823032 83 DDB2 24277 13649 0.562219 chr11 47193068 47217345 10 4.119125 7893 PLD3 30059 16855 0.560731 chr19 45546171 45576230 13 4.324828 1620 HPD 19337 13533 0.69985 chr12 1.21E+08 1.21E+08 14 7.240006 426 F2 20314 13259 0.652703 chr11 46697318 46717632 14 6.891799 23489 RABEP2 20791 12803 0.615795 chr16 28823242 28844033 13 6.252705 14189 DNTTIP1 19491 11456 0.587758 chr20 43853982 43873473 13 6.669745 38185 POLR3C 18280 10411 0.56953 chr1 1.44E+08 1.44E+08 15 8.205689 730 PRIM1 20783 11772 0.566424 chr12 55411630 55432413 13 6.255112 74971 PTPN18 18527 10472 0.565229 chr2 1.31E+08 1.31E+08 15 8.096292 20712 TYK2 30045 16184 0.538659 chr19 10322203 10352248 25 8.320852 83157 KIFC1 18387 9829 0.534562 chr6 33467290 33485677 11 5.982488 3426 CDC23 25696 13721 0.533974 chr5 1.38E+08 1.38E+08 16 6.22665 24933 COMMD7 41322 26774 0.647936 chr20 30754153 30795475 9 2.178017 4309 IPP 47727 30873 0.646867 chr1 45936993 45984720 8 1.6762 1363 CDK7 42636 29275 0.686626 chr5 68566377 68609013 12 2.814523 7973 TEKT1 31761 21780 0.685747 chr17 6644023 6675784 8 2.518812 1356 CDC25C 46558 30750 0.660467 chr5 1.38E+08 1.38E+08 11 2.362644 9897 C2orf42 41135 26304 0.639455 chr2 70230520 70271655 10 2.43102 37483 XRCC6 42758 26761 0.625871 chr22 40347240 40389998 13 3.040367 39824 PTGR2 33489 20749 0.619577 chr14 73388426 73421915 10 2.986055 9502 CDK5RAP1 42693 25233 0.591034 chr20 31410305 31452998 14 3.279226 32123 SHCBP1 40844 23840 0.583684 chr16 45171968 45212812 13 3.182842 14239 GLMN 52612 30573 0.581103 chr1 92484542 92537154 19 3.611343 55978 CCT6B 33569 21982 0.65483 chr17 30279050 30312619 14 4.170514 1159 GTF2H2 32547 20007 0.614711 chr5 70366706 70399253 16 4.915968 57000 DHX40 42817 23949 0.559334 chr17 54997667 55040484 18 4.203938

216 Hid gene_name gene_size te_coverage te_density chr gStart gEnd exonCount exon_density 4597 SMC1A 48549 26669 0.549321 chrX 53417794 53466343 25 5.149437 41349 ACN9 65171 40285 0.618143 chr7 96583840 96649011 2 0.306885 78167 C3orf70 74965 44978 0.599987 chr3 1.86E+08 1.86E+08 2 0.266791 17076 UBXN2A 60318 42993 0.712772 chr2 24016879 24077197 7 1.160516 45811 AMN1 58038 36082 0.621696 chr12 31715337 31773375 7 1.206106 18584 CYP20A1 67400 45675 0.677671 chr2 2.04E+08 2.04E+08 13 1.928783 41769 TTLL9 72354 48746 0.673715 chr20 29922165 29994519 15 2.07314 37744 TBCE 81553 52240 0.640565 chr1 2.34E+08 2.34E+08 17 2.084534 33650 BOLL 59336 36308 0.611905 chr2 1.98E+08 1.98E+08 11 1.853849 2376 NEK4 60151 41944 0.697312 chr3 52719840 52779991 16 2.659972 105654 PFKFB1 60922 42087 0.690834 chrX 54976314 55037236 14 2.29802 72261 NWD1 97988 66889 0.682624 chr19 16691786 16789774 21 2.14312 65105 NLRP5 62083 38729 0.623826 chr19 61202903 61264986 15 2.41612 6920 ALG6 69581 41927 0.602564 chr1 63605885 63675466 15 2.155761 1371 CENPC1 73268 44148 0.602555 chr4 68020583 68093851 19 2.593219 3322 KIF11 62326 34719 0.557055 chr10 94342804 94405130 22 3.529827 5092 ZPBP 155788 118841 0.762838 chr7 49947584 50103372 8 0.513518 20218 GABRA3 284198 191698 0.674523 chrX 1.51E+08 1.51E+08 10 0.351867 12423 ITFG1 305718 180212 0.589471 chr16 45746798 46052516 18 0.588778 49914 NHEDC1 134682 91702 0.680878 chr4 1.04E+08 1.04E+08 12 0.890988 10410 SPATA6 173568 117494 0.676934 chr1 48536864 48710432 13 0.748986 52235 CCDC73 192562 124780 0.647999 chr11 32580201 32772763 18 0.934764 17021 ADAM32 177387 133355 0.751774 chr8 39084206 39261593 25 1.409348 51868 ALS2CR11 131753 90074 0.683658 chr2 2.02E+08 2.02E+08 15 1.138494 10778 SENP7 188968 122521 0.648369 chr3 1.03E+08 1.03E+08 23 1.217137 49799 EIF2C3 125292 79686 0.636002 chr1 36169358 36294650 19 1.516458 10225 ASH1L 227273 143196 0.630062 chr1 1.54E+08 1.54E+08 28 1.231999 2120 PTPN4 217831 127661 0.586055 chr2 1.2E+08 1.2E+08 27 1.239493

217 Hid gene_name gene_size te_coverage te_density chr gStart gEnd exonCount exon_density 9084 TRIM37 124267 70509 0.567399 chr17 54414781 54539048 25 2.011797

hid: the HomoloGene Database ID of the gene gene_name: the human gene name annotated by RefSeq Database (hg18) gene_size: the human gene size annotated by the longest RefSeq isoform te_coverage: the coverage of TE sequences in the human gene in basepairs (bp) te_density: the density of TE sequences derived from the ratio of te_coverage/gene_size chr: the name of the human chromosome that the gene is located on gStart: the genomic coordinate of the 5' start site of the gene in the human genome (hg18) gEnd: the genomic coordinate of the 3' end site of the gene in the human genome (hg18) exonCount: number of exons in the gene exon_density: the density of exon derived from the ratio of exonCount/geneSize (/kb)

218 Table C.3 Overrepresentation of GO terms for SLOs (BiNGO results)

GO‐ID p‐value corr p‐value x n X N Description 48731 3.60E‐24 7.24E‐21 87 2421 173 14301 system development 48856 5.05E‐22 5.09E‐19 88 2655 173 14301 anatomical structure development 7275 3.55E‐21 2.38E‐18 92 2971 173 14301 multicellular organismal development 32502 2.11E‐20 1.06E‐17 95 3234 173 14301 developmental process 48513 1.43E‐18 5.78E‐16 67 1791 173 14301 organ development 30154 3.03E‐16 1.02E‐13 61 1668 173 14301 cell differentiation 48869 1.10E‐15 3.16E‐13 61 1714 173 14301 cellular developmental process 32501 2.40E‐15 6.04E‐13 103 4375 173 14301 multicellular organismal process 9653 1.84E‐14 4.12E‐12 49 1217 173 14301 anatomical structure morphogenesis 9888 2.67E‐14 5.38E‐12 38 749 173 14301 tissue development 51252 1.69E‐13 3.10E‐11 60 1857 173 14301 regulation of RNA metabolic process 6355 1.92E‐13 3.23E‐11 59 1808 173 14301 regulation of transcription, DNA‐dependent 1944 2.21E‐13 3.43E‐11 23 273 173 14301 vasculature development 9887 9.24E‐13 1.33E‐10 33 635 173 14301 organ morphogenesis 1568 1.03E‐12 1.38E‐10 22 265 173 14301 blood vessel development 7399 3.85E‐12 4.85E‐10 44 1155 173 14301 nervous system development 50794 1.32E‐11 1.57E‐09 119 6223 173 14301 regulation of cellular process 50789 9.52E‐11 1.07E‐08 121 6552 173 14301 regulation of biological process 65007 1.56E‐10 1.64E‐08 125 6941 173 14301 biological regulation 10468 1.69E‐10 1.64E‐08 72 2928 173 14301 regulation of gene expression 48514 1.71E‐10 1.64E‐08 18 220 173 14301 blood vessel morphogenesis 10556 4.37E‐09 4.00E‐07 68 2877 173 14301 regulation of macromolecule biosynthetic process 45449 4.58E‐09 4.01E‐07 64 2622 173 14301 regulation of transcription 31326 5.25E‐09 4.40E‐07 70 3021 173 14301 regulation of cellular biosynthetic process 6357 5.89E‐09 4.74E‐07 30 748 173 14301 regulation of transcription from RNA polymerase II promoter 9889 7.42E‐09 5.75E‐07 70 3045 173 14301 regulation of biosynthetic process 60255 1.74E‐08 1.30E‐06 74 3377 173 14301 regulation of macromolecule metabolic process

219 GO‐ID p‐value corr p‐value x n X N Description 48646 2.76E‐08 1.99E‐06 20 375 173 14301 anatomical structure formation involved in morphogenesis 19219 3.11E‐08 2.16E‐06 68 3014 173 14301 regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolic process 7167 3.96E‐08 2.66E‐06 19 346 173 14301 enzyme linked receptor protein signaling pathway 51171 4.44E‐08 2.88E‐06 68 3040 173 14301 regulation of nitrogen compound metabolic process 48608 9.49E‐08 5.97E‐06 13 164 173 14301 reproductive structure development 42127 9.86E‐08 6.02E‐06 30 848 173 14301 regulation of cell proliferation 3006 1.13E‐07 6.69E‐06 17 296 173 14301 reproductive developmental process 48518 1.29E‐07 7.42E‐06 54 2209 173 14301 positive regulation of biological process 43588 1.37E‐07 7.62E‐06 7 34 173 14301 skin development 60429 1.40E‐07 7.62E‐06 18 337 173 14301 epithelium development 7166 1.49E‐07 7.92E‐06 38 1280 173 14301 cell surface receptor linked signaling pathway 80090 1.64E‐07 8.46E‐06 74 3554 173 14301 regulation of primary metabolic process 1974 1.83E‐07 9.19E‐06 6 22 173 14301 blood vessel remodeling 48729 2.28E‐07 1.12E‐05 16 275 173 14301 tissue morphogenesis 48522 2.49E‐07 1.20E‐05 50 2005 173 14301 positive regulation of cellular process 31323 2.62E‐07 1.23E‐05 76 3735 173 14301 regulation of cellular metabolic process 50793 2.77E‐07 1.25E‐05 28 791 173 14301 regulation of developmental process 10628 2.79E‐07 1.25E‐05 24 604 173 14301 positive regulation of gene expression 1525 3.14E‐07 1.38E‐05 12 152 173 14301 angiogenesis 19222 4.11E‐07 1.76E‐05 78 3918 173 14301 regulation of metabolic process 1501 5.74E‐07 2.41E‐05 17 332 173 14301 skeletal system development 51254 8.51E‐07 3.50E‐05 21 507 173 14301 positive regulation of RNA metabolic process 10604 9.32E‐07 3.75E‐05 30 942 173 14301 positive regulation of macromolecule metabolic process 45944 1.15E‐06 4.52E‐05 18 389 173 14301 positive regulation of transcription from RNA polymerase II promoter positive regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolic 45935 1.26E‐06 4.87E‐05 24 657 173 14301 process 7548 1.51E‐06 5.74E‐05 12 176 173 14301 sex differentiation 45941 1.76E‐06 6.58E‐05 22 576 173 14301 positive regulation of transcription 32583 1.81E‐06 6.63E‐05 13 212 173 14301 regulation of gene‐specific transcription

220 GO‐ID p‐value corr p‐value x n X N Description 122 2.04E‐06 7.32E‐05 15 286 173 14301 negative regulation of transcription from RNA polymerase II promoter 10557 2.07E‐06 7.32E‐05 24 676 173 14301 positive regulation of macromolecule biosynthetic process 51173 2.18E‐06 7.58E‐05 24 678 173 14301 positive regulation of nitrogen compound metabolic process 48468 2.28E‐06 7.79E‐05 23 632 173 14301 cell development 45893 2.78E‐06 9.34E‐05 20 501 173 14301 positive regulation of transcription, DNA‐dependent 45446 3.27E‐06 1.08E‐04 5 20 173 14301 endothelial cell differentiation 8406 3.68E‐06 1.18E‐04 10 129 173 14301 gonad development 48771 3.68E‐06 1.18E‐04 7 54 173 14301 tissue remodeling 16055 3.95E‐06 1.23E‐04 10 130 173 14301 Wnt receptor signaling pathway 23052 3.97E‐06 1.23E‐04 64 3132 173 14301 signaling 61061 4.16E‐06 1.27E‐04 14 265 173 14301 muscle structure development 9893 4.63E‐06 1.39E‐04 30 1019 173 14301 positive regulation of metabolic process 3158 5.45E‐06 1.61E‐04 5 22 173 14301 endothelium development 30855 6.18E‐06 1.80E‐04 11 168 173 14301 epithelial cell differentiation 31328 6.26E‐06 1.80E‐04 24 721 173 14301 positive regulation of cellular biosynthetic process 7398 6.33E‐06 1.80E‐04 12 202 173 14301 ectoderm development 45892 6.52E‐06 1.82E‐04 17 397 173 14301 negative regulation of transcription, DNA‐dependent 51270 6.77E‐06 1.87E‐04 13 239 173 14301 regulation of cellular component movement 9891 8.08E‐06 2.20E‐04 24 732 173 14301 positive regulation of biosynthetic process 51253 8.20E‐06 2.20E‐04 17 404 173 14301 negative regulation of RNA metabolic process 8584 9.43E‐06 2.50E‐04 7 62 173 14301 male gonad development 51093 1.12E‐05 2.90E‐04 14 289 173 14301 negative regulation of developmental process 42692 1.13E‐05 2.90E‐04 9 116 173 14301 muscle cell differentiation 51239 1.15E‐05 2.90E‐04 30 1067 173 14301 regulation of multicellular organismal process 8284 1.15E‐05 2.90E‐04 18 459 173 14301 positive regulation of cell proliferation 45137 1.18E‐05 2.93E‐04 10 147 173 14301 development of primary sexual characteristics 30334 1.25E‐05 3.07E‐04 12 216 173 14301 regulation of cell migration 48523 1.30E‐05 3.15E‐04 43 1844 173 14301 negative regulation of cellular process

221 GO‐ID p‐value corr p‐value x n X N Description 31325 1.37E‐05 3.29E‐04 28 967 173 14301 positive regulation of cellular metabolic process 22414 1.39E‐05 3.30E‐04 25 808 173 14301 reproductive process 7169 1.44E‐05 3.36E‐04 12 219 173 14301 transmembrane receptor protein kinase signaling pathway 3 1.48E‐05 3.43E‐04 25 811 173 14301 reproduction 35295 1.52E‐05 3.49E‐04 14 297 173 14301 tube development 48534 1.60E‐05 3.63E‐04 13 259 173 14301 hemopoietic or lymphoid organ development 48519 2.47E‐05 5.46E‐04 45 2020 173 14301 negative regulation of biological process 43062 2.47E‐05 5.46E‐04 10 160 173 14301 extracellular structure organization 9987 2.53E‐05 5.54E‐04 138 9366 173 14301 cellular process 30097 2.66E‐05 5.76E‐04 12 233 173 14301 hemopoiesis 8585 2.79E‐05 5.96E‐04 7 73 173 14301 female gonad development 51128 2.81E‐05 5.96E‐04 19 538 173 14301 regulation of cellular component organization 23033 2.99E‐05 6.24E‐04 46 2100 173 14301 signaling pathway 2520 3.01E‐05 6.24E‐04 13 275 173 14301 immune system development 6916 3.04E‐05 6.24E‐04 11 199 173 14301 anti‐apoptosis 43009 3.20E‐05 6.51E‐04 15 360 173 14301 chordate embryonic development

GO-ID: the Gene Ontology term ID p-value: the p-value derived from Hypergeometric test corr p-value: the corrected p-value derived from Benjamini & Hochberg False Discovery Rate (FDR) correction x: the number of samples (i.e. SUOs/SLOs) belonging to the given GO term n: the number of population (i.e. genomic background) belonging to the given GO term X: the sample size (i.e. all SUOs/SLOs with available GO annotations) N: the population size (i.e. genomic background) Description: the GO term decription

222 Table C.4 Overrepresentation of GO terms for SUOs (BiNGO results)

GO‐ID p‐value corr p‐value x n X N Description 718 2.95E‐10 2.52E‐07 6 21 63 14304 nucleotide‐excision repair, DNA damage removal 6308 1.74E‐07 4.97E‐05 6 57 63 14304 DNA catabolic process 6289 1.74E‐07 4.97E‐05 6 57 63 14304 nucleotide‐excision repair 6281 4.85E‐05 7.37E‐03 8 300 63 14304 DNA repair 44238 4.90E‐05 7.37E‐03 39 5286 63 14304 primary metabolic process 8152 5.17E‐05 7.37E‐03 42 5957 63 14304 metabolic process 6259 7.95E‐05 9.71E‐03 10 517 63 14304 DNA metabolic process 9056 9.46E‐05 1.01E‐02 14 1005 63 14304 catabolic process 43170 1.12E‐04 1.06E‐02 32 4016 63 14304 macromolecule metabolic process 44260 1.58E‐04 1.35E‐02 29 3507 63 14304 cellular macromolecule metabolic process 6807 2.94E‐04 2.20E‐02 21 2194 63 14304 nitrogen compound metabolic process 6974 3.29E‐04 2.20E‐02 8 396 63 14304 response to DNA damage stimulus 9057 3.34E‐04 2.20E‐02 9 503 63 14304 macromolecule catabolic process 34641 3.99E‐04 2.38E‐02 20 2076 63 14304 cellular nitrogen compound metabolic process 51301 4.43E‐04 2.38E‐02 7 314 63 14304 cell division 90304 4.45E‐04 2.38E‐02 16 1459 63 14304 nucleic acid metabolic process 6139 4.98E‐04 2.49E‐02 18 1785 63 14304 nucleobase, nucleoside, nucleotide and nucleic acid metabolic process 70 5.24E‐04 2.49E‐02 3 36 63 14304 mitotic sister chromatid segregation 819 5.69E‐04 2.55E‐02 3 37 63 14304 sister chromatid segregation 44237 6.05E‐04 2.55E‐02 35 4988 63 14304 cellular metabolic process 87 6.27E‐04 2.55E‐02 6 239 63 14304 M phase of mitotic cell cycle 44265 6.71E‐04 2.61E‐02 8 441 63 14304 cellular macromolecule catabolic process 6350 7.75E‐04 2.88E‐02 7 345 63 14304 transcription 279 8.58E‐04 3.06E‐02 7 351 63 14304 M phase 6368 1.22E‐03 4.18E‐02 3 48 63 14304 RNA elongation from RNA polymerase II promoter 6354 1.46E‐03 4.80E‐02 3 51 63 14304 RNA elongation

223

GO-ID: the Gene Ontology term ID p-value: the p-value derived from Hypergeometric test corr p-value: the corrected p-value derived from Benjamini & Hochberg False Discovery Rate (FDR) correction x: the number of samples (i.e. SUOs/SLOs) belonging to the given GO term n: the number of population (i.e. genomic background) belonging to the given GO term X: the sample size (i.e. all SUOs/SLOs with available GO annotations) N: the population size (i.e. genomic background) Description: the GO term decription

224 Table C.5 The optimization table for outlier threshold selection cutoff SLO_count SLO P‐fold diff SUO_count SUO P‐fold diff 0.025 39 414.8246635 6 63.81917899 0.05 81 107.6948646 14 18.61392721 0.075 121 47.66741147 36 14.18203978 0.1 183 30.41382749 66 10.96892139 0.125 239 20.33704504 115 9.785607446 0.15 293 14.42825574 160 7.878910987 0.175 353 10.94663274 213 6.605191995 0.2 437 9.078444408 265 5.505235167 0.225 521 7.601690045 329 4.800299472 0.25 603 6.413827489 398 4.233338873

cutoff: the cutoff threshold as the top/bottom percentage of the TE density distribution for upper/lower outlier genes SLO_count: the number of SLOs derived from the three species using the corresponding cutoff SLO P-fold diff: the fold difference between the observed frequency of SLOs and the probability of observing SLOs by random based on the shared genes among the three species SUO_count: the number of SUOs derived from the three species using the corresponding cutoff SUO P-fold diff: the fold difference between the observed frequency of SUOs and the probability of observing SUOs by random based on the shared genes among the three species

225 Table C.6 Number of TUs/genes derived in each step of SUO/SLO calculation

A) Human Step TU/gene type* Total of TUs/genes involved 1. source data whole genome RefSeq TUs (hg18) 27212 2. homoloGene RefSeq genes with HID 17357 3. gene size cutoff genes >= 10 kb 12423 4. all outliers all upper outliers 1231 all lower outliers 1228 5. shared outliers SUOs 84 SLOs 189

B) Mouse Step TU/gene type* Total of TUs/genes involved 1. source data whole genome RefSeq TUs (mm9) 21873 2. homoloGene RefSeq genes with HID 18248 3. gene size cutoff genes >= 10 kb 11670 all upper outliers 1175 4. all outliers all lower outliers 1175 SUOs 84 5. shared outliers SLOs 189

C) Cow Step TU/gene type* Total of TUs/genes involved 1. source data whole genome RefSeq TUs (bosTau4) 12441 2. homoloGene RefSeq genes with HID 9389 3. gene size cutoff genes >= 10 kb 6206 all upper outliers 615 4. all outliers all lower outliers 608 SUOs 84 5. shared outliers SLOs 189

Abreviations: TU - Transcription Unit SUO - Shared Upper Outlier SLO - Shared Lower Outlier

* Genes are defined as the longest isoforms of multiple TUs associated with the same gene

226