UNLV Theses, Dissertations, Professional Papers, and Capstones

8-1-2020

Transposable Element Expression in Human Embryo Single-Cell RNA-Seq Data

Corinne Sexton

Follow this and additional works at: https://digitalscholarship.unlv.edu/thesesdissertations

Part of the Bioinformatics Commons, and the Genetics Commons

Repository Citation Sexton, Corinne, " Expression in Human Embryo Single-Cell RNA-Seq Data" (2020). UNLV Theses, Dissertations, Professional Papers, and Capstones. 4026. http://dx.doi.org/10.34917/22110092

This Thesis is protected by copyright and/or related rights. It has been brought to you by Digital Scholarship@UNLV with permission from the rights-holder(s). You are free to use this Thesis in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons license in the record and/ or on the work itself.

This Thesis has been accepted for inclusion in UNLV Theses, Dissertations, Professional Papers, and Capstones by an authorized administrator of Digital Scholarship@UNLV. For more information, please contact [email protected]. TRANSPOSABLE ELEMENT EXPRESSION IN HUMAN EMBRYO

SINGLE-CELL RNA-SEQ DATA

By

Corinne Sexton

Bachelor of Science – Bioinformatics Brigham Young University 2018

A thesis submitted in partial fulfillment of the requirements for the

Master of Science – Biological Sciences

School of Life Sciences College of Sciences The Graduate College

University of Nevada, Las Vegas August 2020

Thesis Approval

The Graduate College The University of Nevada, Las Vegas

June 17, 2020

This thesis prepared by

Corinne Sexton entitled

Transposable Element Expression in Human Embryo Single-Cell RNA-Seq Data is approved in partial fulfillment of the requirements for the degree of

Master of Science – Biological Sciences School of Life Sciences

Mira Han, Ph.D. Kathryn Hausbeck Korgan, Ph.D. Examination Committee Chair Graduate College Dean

Philippos Tsourkas, Ph.D. Examination Committee Member

Jeffery Shen, Ph.D. Examination Committee Member

Edwin Oh, Ph.D. Graduate College Faculty Representative

ii

Abstract

Transposable elements (TEs) are genetic sequences which are mobile within the , including DNA transposons and . Though the vast majority are no longer able to move or duplicate in humans, they still are actively transcribed in both germline and somatic cells, particularly in early human development. TEs are expressed in an extremely cell-type and stage specific pattern during embryogenesis, suggesting that they may either have a regulatory role in the cell or be transcribed along with cell-specific genes. However, earlier studies have focused on hESC models or early embryos up to day 6, with differing patterns of TE expression.

Here, I investigate the pattern of differentially expressed TEs in single-cell RNA- seq data obtained from human embryos ranging in age from day 6 to day 14 post fertilization. The high resolution of this dataset lays bare the expression pattern of specific retrotransposons known as human endogenous retroviruses, specifically of subfamily H (HERVH), found exclusively in epiblasts. I confirm that TE expression is cell-type specific and does not rely entirely on proximity to cell-type specific genes.

Additionally, HERVHs have a similar pattern of expression to stem cell and pluripotency regulation genes according to network analysis. Functionally, HERVH elements contain open reading frames which could have coding potential. Lastly, HERVHs associated with epiblasts are enriched for developmental binding sites, suggesting that they may be candidate enhancers. This work demonstrates that HERVH expression is epiblast cell specific up to 14 days post fertilization, and provides a starting point for further analysis on the functional potential of these HERVs as promoters, enhancers or possibly as coding proteins.

iii Table of Contents

Approval page ii Abstract iii Table of Contents iv List of Figures v 1. Background 1 1.1 Transposable Elements 1 1.2 Predicted Functional Roles of TEs 2 1.3 TEs in Human Developmental Biology 3 2. Differential TE Expression in Early Development 6 2.1 Methods 6 2.2 Differential Expression Between Cell Types 8 2.3 Epiblast Specific TE Expression 11 2.4 Coding Potential of Differentially Expressed HERVH Elements 13 3. Co-expression of TE Elements with Genes 16 3.1 Methods 16 3.2 Proximity of TEs to Genes 16 3.3 TE Element Clustering 19 3.4 Gene Set Enrichment Analysis of TE/Gene Clusters 22 4. Transcription Factor Binding Sites in HERVH Elements 24 4.1 Methods 24 4.2 Sequence Divergence Among Differentially Expressed HERVHs 25 4.3 Enrichment of TFBS in Differentially Expressed HERVHs 26 4.4 HERVH TFBS Target Correlation 30 5. Discussion 33 Appendix 36 Curriculum Vitae 50

iv List of Figures

Figure 1. Figure from Xiang et al Describing Differentiation of Embryonic Cells. 4

Figure 2. Distribution of TEs Included in Differential Expression Analysis. 7

Figure 3. Log2 Normalized Expression of 451 Significant TEs in All Cell Types. 9

Figure 4. Volcano Plot of EPI vs EVT log2 fold change. 10

Figure 5. PCA Plots of TE expression. 11

Figure 6. Log2 Normalized Expression of 26 Significant TEs in Epiblast Cells. 12

Figure 7. Boxplot of Length of HERVH Alignment with Viral Protein Domain Profiles. 14

Figure 8. Boxplot of Coding Potential for the 155 Elements Which Matched Domains. 15

Figure 9. Expression of Gene Transcripts within 100kb of HERVH Elements. 17

Figure 10. Expression of a Subset of Gene Transcripts within 100kb of HERVH Elements. 18

Figure 11. Normalized Expression z-scores within Module 8 and Module 32. 20

Figure 12. Topological Overlap Matrix for Module 8 and Module 32. 21

Figure 13. Gene Set Enrichment Analysis for HERVH Module 32. 22

Figure 14. Heatmap of Genes within Module Present in Enriched REACTOME Gene Set. 23

Figure 15. Significant HERVH Elements Clustered Based on Sequence Similarity with Corresponding Ordered log2 Expression Values. 26

Figure 16. Upset Plot of Significant Transcription Factor Between the Three HERVH Clusters. 27

Figure 17. Normalized Heatmap of Transcription Factors Found in HERVHs and Known to be Upregulated in Epiblast Cells. 30

Figure 18. Normalized Heatmap of Downstream Genes. 31

v 1. Background

1.1 Transposable Elements

First discovered in 1956, transposable elements (TEs) are mobile DNA sequences that are able to move within the genome. Although Barbara McClintock, the scientist who discovered TEs, speculated that TEs had some sort of regulatory potential, for years they were regarded as simply “junk” DNA without any real purpose

(1). Recent evidence however has shown that TEs potentially play an important role in both the evolution of regulatory pathways as well as gene regulation itself (2).

TEs are found in nearly all eukaryotic and specifically make up almost

50% of the human genome (3). TEs can be categorized into two classes: DNA transposons and retrotransposons (4). Though both classes are capable of transposition, DNA transposons transpose through a DNA intermediate and retrotransposons transpose through an RNA intermediate.

Although almost all TEs are not actively transposing in the modern human population, specific retrotransposons in the endogenous retrovirus (ERV) subfamily are systematically transcribed during the development of human embryos (5). ERV elements make up 8-10% of the human genome (3) and are characterized by identical flanking long terminal repeat (LTR) sequences at both the 5’ and 3’ ends. A full length

ERV element is about 9kb and contains three coding domains which include gag, pol and env (6), however most, if not all, ERVs no longer contain protein-coding domains because of mutation. Also because of the similarity between the flanking LTR sequences, often ERVs recombine incorrectly, completely deleting the coding region and leaving only a solo LTR element (7). ERVs are also extremely abundant in

1 embryonic cells. In mouse embryos they have been shown to make up 19.36% of the total transcripts expressed in full-grown oocytes and two-cell embryos (8).

Human ERVs (HERV) are a subfamily of ERVs which were first observed in humans (9). Specific cases of HERVs have been observed to regulate gene expression through activities as enhancers, promoters, repressors, and by providing alternative splice sites (5,10–12). Additionally, HERVs have been observed to be highly expressed in individuals with disorders such as autism (13), bipolar disorder and schizophrenia

(14).

1.2 Predicted Functional Roles of TEs

The functional role of highly expressed HERVs and TEs in general has not been clearly defined, though several hypotheses have been put forward. One major component of TE structure which lends itself to the function of a regulatory element is the presence of transcription factor binding sites in TEs. Though many families of TEs harbor binding sites, LTR elements (the family which contains ERVs) are particularly enriched. They make up 19% of all human TE elements, but based on a set of CHIP- seq experiments done on K562 and GM12878 cell lines, they contained 39% of the binding peaks which were found in TEs. The same study found that on average 20% of the detected transcription factor binding peaks for the 26 transcription factors they tested were present in TE elements (15).

This prolific presence of transcription factor binding sites gives merit to the idea that TEs may be used as cis-regulatory elements, such as enhancers, promoters or insulators. Using modifications as input, Cao et al (16) built a random forest based machine learning model to predict enhancer activity of TE elements in different

2 cell types. Their results showed that most TEs have the potential to be enhancers and of the potential enhancers, 74.35% are within 10 kb of a gene expression quantitative trait loci (eQTL). The proximity to these eQTLs explains part of how these TEs could have functional impacts on gene expression. They also observed a strict cell-type specific pattern based on the predicted enhancer TEs.

Though there is much evidence supporting TEs as cis-regulatory elements, that hypothesis does not entirely explain the extremely tissue-specific TE expression pattern observed in RNA-seq data. Possible explanations for this pattern could include that TEs function as alternative transcription start sites (TSS) for genes (17) or that expressed

TEs themselves function similarly to long non-coding RNAs (lncRNA) by performing post-transcriptional modifications (17).

Overall, the observed pattern of cell-specific TE expression leads to questions of what function TEs play in the gene regulatory network and what mechanisms they employ in order to influence regulation.

1.3 TEs in Human Developmental Biology

The dataset used in this study is described in depth in section 2.1, but here I provide a brief overview of the early developmental biology discussed in the study. As shown in Figure 1 (taken from Xiang et al (18)), the single-cells sequenced in this dataset range in age from day 6 post-fertilization to day 14 post-fertilization. These cells are classified by day post fertilization as well as cell type, as determined by marker gene expression and developmental time. The cell types in the dataset are inner cell mass (ICM), epiblast (EPI), primitive endoderm/hypoblast (PrE), trophoblast (TrB) which

3 includes cytotrophoblast (CTB), syncytiotrophoblast (STB) and extravillous cytotrophoblast (EVT), and primitive streak anlage epiblast (PSA-EPI).

Figure 1. Figure from Xiang et al Describing Differentiation of Embryonic Cells. (18) Ranging from 6 to 14 days post fertilization (dpf), this illustration shows the cell types and locations in the 3D culture that was used to produce the single-cell dataset for this thesis.

These cells are taken from the blastocyst stage and undergo differentiation between trophoblasts (the outer layer) and epiblasts (the inner layer). Of particular interest are the cells within the epiblast layer, which displays marked TE expression

(see 2.3). These epiblast cells, pictured in red in Figure 1, represent the founding cell population and will differentiate into three germ layers, ectoderm, mesoderm and endoderm, which make up the embryo (19).

It is well observed that TE expression is extremely prolific during early embryo stages, specifically before the blastocyst stage. This is attributed to the fact that tends to be more accessible early on in embryogenesis. In fact, Pontis et al found that TEs harbored at least 33% of the open chromatin sites in preimplantation embryos (20). This upregulation in TE transcription makes early development a prime environment to study the role of TEs in gene regulation.

4 The youngest cells in the dataset used in this thesis are at the beginning of the blastocyst stage and therefore general TE expression is low. However, ERVs and specifically HERVs are systematically expressed later on in development in blastocyst cells and more specifically in pluripotent epiblast cells (5). This expression is so unique that HERVH has been proposed as a marker for pluripotency (21). Interestingly, while in vitro pluripotent stem cell gene expression is similar to the blastocyst gene expression,

ERV expression differs significantly between in vitro cell lines and blastocyst single-cell

RNA-seq (5). Our hypothesis is that because this single-cell embryo data was grown in

3D culture and is more mature than the blastocyst stage, it will therefore be more similar to hESCs cell lines than preimplantation embryo single-cell data.

TEs also play a major role in the regulatory network in early embryo development and contain extensive pluripotency-specific transcription factor binding sites. Three of the major human transcription factors used to induce pluripotency are OCT4 (POU5F1),

SOX2, and KLF4 (22). In mouse ESCs, Sundaram et al found that 20.2% of OCT4,

26.9% of SOX2, and 13.3% of KLF4 CHIP-seq peaks occurred within TEs, with the majority of those overlapping with LTR elements specifically (23).

In addition to transcription factor binding sites, studies have shown that a knockdown of specific ERVs through RNA interference can lead to changes in cell morphology and downregulation of known pluripotency factors (24). The specific expression pattern of ERVs, the significant presence of transcription factor binding sites, and the cell changes induced by ERV knockdown all suggest that ERVs may play a functionally important role in the pluripotency state of stem cells.

5 2. Differential TE Expression in Early Development

2.1 Methods

The dataset used in all analyses for this thesis comes from Xiang et al (18).

Briefly, the dataset was obtained from the donated surplus embryos of couples who already had a healthy baby by in vitro fertilization. These embryos were grown in a specialized 3D culture and then single-cell RNA-seq was performed following the

Smart-seq2 protocol, which is able to capture full-length cDNA from single cells (25).

Paired-end reads with a length of 150 bp were obtained for 555 single cells from 42 embryos ranging in age from day 6 to day 14 post-fertilization. This dataset was also separated into cell types based on analysis of specific marker genes.

For this analysis, raw reads of the single-cell data were aligned to the hg38 human genome using Bowtie1 (26) allowing for 10 best hits to be reported for every read (-k 10). Though Bowtie1 is a DNA sequence aligner, it was deemed the appropriate tool in this case because TEs are not generally spliced and using an RNA-seq specific aligner with built-in splicing detection capabilities, such as STAR, can lead to erroneous gapped TE read alignment across vast distances in the genome.

After reads were aligned, TE quantification was performed using a modified version of TEtranscripts (27) as described in Chung et al (28). This modified version of

TEtranscripts discounts quantification of TEs which originate from intronic regions in the genome and also quantifies each specific instance of a TE rather than providing only an aggregated family level quantification. This can be thought of as similar to reporting transcript versus gene counts. Though it is generally thought that TEs are too repetitive to accurately assess count levels at the instance level, in our previous work we showed

6 that individual human TEs are in fact unique enough for these instance-level counts to be valid for downstream analysis (29). TE counts were filtered to only include those TEs which were assigned 5 or more raw reads in at least 25 single cell samples, resulting in

33,614 TE elements. The distribution of those elements across family and class distinctions is shown in Figure 2.

Figure 2. Distribution of TEs Included in Differential Expression Analysis. Of the 33,614 elements which passed filtering of raw counts, the majority are LINE elements, followed by SINE, LTR, and DNA elements. Specifically the L1, Alu, MIR, and ERV families dominate the TEs included in the analysis.

Raw TE counts were then processed by DESeq2 (30), following the recommendation from Van den Berge et al (31) to employ a zero-inflated binomial model which helps account for the vast amounts of zero counts present in single-cell datasets. After processing, differential expression analysis was performed with a likelihood ratio test (LRT) which is suggested by the authors of DESeq2 when analyzing single cell data. Unlike the traditional Wald test, which tests the estimated standard error of log2 fold change, the LRT tests a difference between two models: one model with all terms specified, and the other a reduced model. This test is especially useful when

7 comparing several different states, such as numerous different cell types or across a time series. In the set up for this analysis, two models were tested. These included 1) a full model of days post fertilization and cell-type with a reduced model based solely on days post fertilization to investigate differences in cell-type specific TE expression and

2) a full model including only days post fertilization and a reduced model only containing the intercept to investigate differences in TE expression in a time series across epiblast cells specifically.

For both tests, significance of TEs was based on an FDR-adjusted p-value <

10e-5, a base mean of reads across samples > 10, a log2 fold change >= 4 and a log2 fold change standard error < 1.

2.2 Differential Expression Between Cell Types

In keeping with current research, TE expression is generally low in these cells.

Even after the extensive filtering performed above (see 2.1 Methods) only 3% of the quantified TEs have a median normalized count value greater than 0 across the 555 single-cell samples. TE transcripts are highly expressed early in development, peaking in humans near the 8-cell stage of embryogenesis (32). This dataset is taken from cells which are already older than the peak stage, so the lower expression is to be expected.

However there are still a few families of TEs that are highly differentially expressed between cell types.

The heatmap in Figure 3 shows the normalized log2 counts for the significant TEs across all cell types. Though these significant TEs belong to several different families, the most obvious expression pattern arises in the epiblast cells with a strong

8 upregulation of the ERV1 family of the LTR class. Specifically, the vast majority of these elements are members of the HERVH-int subfamily.

Figure 3. Log2 Normalized Expression of 451 Significant TEs in All Cell Types. This heatmap shows the normalized count expression data for the 451 significant TE instances determined by p-value < 10e-5 and log fold change between epiblast cells and extravillous trophoblast >= 4. Each row represents a single TE, annotated by TE family and TE class and columns are samples sorted by cell type and days post fertilization.

Among the 451 TEs which were deemed significant based on the above criteria

(see 2.1 Methods), there is a significant enrichment for LTR elements (p-value =

9.82e-150, Fisher’s Exact Test) and specifically for the subfamily HERVH-int (p-value =

1.92e-263, Fisher’s Exact Test). The volcano plot in Figure 4 shows the log2 fold

9 changes between epiblast (EPI) and extravillous trophoblast cells (EVT) and the p- values for the LRT test used to detect differences in expression between cell types. The enrichment for an upregulation of LTR elements is evident in the plot as almost all significantly upregulated TEs belong to the LTR family and HERVH-int subfamily.

Figure 4. Volcano Plot of EPI vs EVT log2 fold change. Each TE element is plotted with its -log2 p-value from the LRT test which detects differential expression between cell types on the y-axis and the log2 fold change between epiblast (EPI) and extravillous trophoblast (EVT) cells on the x-axis. The majority of the significant plotted LTRs are HERVH-int elements (251 / 325). Plot A and B show the same data with different labels

The cell type specificity of the TE element expression can easily be observed in the PCA plot generated using only normalized TE transcript counts. In fact, including only the 451 significant TEs in the PCA analysis is sufficient to capture the same cell- type specific pattern as shown in Figure 5.

10

Figure 5. PCA Plots of TE expression. Both plots show PCA coordinates labeled by color for cell type and shape for days post fertilization. Graph A was generated based solely on the 451 TE elements that were deemed significantly differentially expressed. Graph B was based on all TEs included in the differential expression analysis. Neither graph includes any gene counts.

Though there is not a clear clustered separation between cell types or day post fertilization, this u-shaped pattern is a common PCA result in cell differentiation and development datasets (33). This strong pattern in the PCA graphs indicates that TE expression, and specifically the 451 TE elements that were deemed significantly differentially expressed, could play a role in cell-type specific processes.

2.3 Epiblast Specific TE Expression

I further analyzed this distinct group of HERVH-int elements which show a strong upregulation expression pattern in epiblast cells with another model. This model controlled for cell type. I isolated only the cells that were characterized as epiblast cells and evaluated the changes in TE expression over a time-course.

As shown in Figure 6, several LTR elements belonging to the ERV1 family are significantly differentially expressed over time. Specifically, several elements of the

HERVH-int subfamily show strong expression early on between day 6 to day 10

11 postfertilization and then lessen by day 14 postfertilization. Surprisingly some elements also show the opposite trend, increasing expression through time.

Figure 6. Log2 Normalized Expression of 26 Significant TEs in Epiblast Cells. This heatmap shows the normalized count expression data for the 26 significant TE instances determined by p-value < 10e-5 and log fold change between day 6 cells and day 14 cells >= 4. Rows represent a TEs, annotated with family and class, and columns represent samples sorted by cell type and days post fertilization.

Just as in the model investigating cell type, LTR elements, specifically HERVH-int elements, are the most prevalent transcripts among TEs with significantly different expression patterns over time, suggesting that potentially ERVs may play a functional role in these cells. We further investigated these HERVH-int elements to try and determine why these elements are so highly expressed in pluripotent human embryonic cells, and specifically in epiblast cells.

12 2.4 Coding Potential of Differentially Expressed HERVH Elements

Functional full-length ERVs contain gag, env, and pol proteins (6). Though almost all HERVs have acquired mutations which render their proteins inactive, there are a few notable cases where a protein can still be transcribed from the HERV sequence. For example, there is one HERVW locus from which a functional envelope protein named syncytin is transcribed. Syncytin is present in placental tissue and is directly involved in trophoblast differentiation (34). Viral gag proteins and viral-like particles from HERVK have also been found in pluripotent cells (35).

Because of the strong pattern of the HERVH element expression in the above differential expression analysis, I looked to see if there was any coding potential within these elements. While it is impossible to claim that these sequences contain coding sequences without empirical evidence, I looked for computational markers of coding potential.

First I used HMMER (http://hmmer.org/) to search the 251 differentially expressed

HERVH elements for a set of marker retrovirus protein domain Hidden Markov Model profiles obtained from Zdobnov et al (36). Of the 251 HERVH elements, 118 elements had a match with the Gag P30 core shell protein profile, 47 elements had a match with the Retroviral aspartyl protease profile, and 63 elements had a match with the reverse transcriptase (RNA-dependent DNA polymerase) profile (all E-values < 0.01). The lengths of these matches are shown in Figure 7.

13

Figure 7. Boxplot of Length of HERVH Alignment with Viral Protein Domain Profiles. Each box represents one of the three different protein domains: Gag P30 core shell protein (GAG), Retroviral aspartyl protease (RVP), reverse transcriptase (RVT). The y-axis is the amount of amino acids (AA) of the HERVH element which coincided with the HMM protein domain profile.

To further investigate the possibility of coding in these elements, I used the CPC2

(Coding Potential Calculator) software which predicts coding potential based on a support vector machine which is trained on four parameters: Fickett score, open reading frame (ORF) length, ORF integrity and isoelectric point (37). Based on this model, the coding probability scores of the 155 HERVH sequences which matched at least one viral protein domain are shown in Figure 8.

14

Figure 8. Boxplot of Coding Potential for the 155 Elements Which Matched Domains. The y-axis is the coding potential as calculated by CPC2. Each element is sorted by the protein domain(s) which they matched and also subsetted by the scores for coding probability on the forward or reverse strand.

In total, 32 HERVH elements of the queried 155 HERVH elements have a coding probability of > 0.75, suggesting that there may be some merit to investigating the coding potential of these elements.

15 3. Co-expression of TE Elements with Genes

3.1 Methods

To investigate the co-expression of significant TE elements with genes, I used the weighted correlation network analysis (WGCNA) software (38). Briefly, WGCNA traditionally creates correlation-based gene modules to identify hubs of closely related genes. I used this same approach to investigate the relatedness of TE element expression and gene expression. The network was built using normalized counts from

451 significant TEs identified from differential expression analysis and 75,986 gene transcripts for which expression was calculated using salmon (39). In order to reduce the number of conditions present and the noise in the network, I removed samples in the analysis which were 12 or 14 days post fertilization.

After the network was created, I looked at genes and TEs clustered within a module as well as intramodular correlation based on eigengene correlation analysis. To discover any functional attributes of the genes which clustered with significant TEs or

TE modules, I performed gene set enrichment analysis using the anRichment R package which was written by the same authors who created WGCNA (40). I evaluated enrichment of these genes against several different datasets: the Gene Ontology database (GO) (41,42), the KEGG Pathway Database (43–45), and the REACTOME

Pathway Database (46).

3.2 Proximity of TEs to Genes

Before performing WGCNA, I wanted to ensure that any co-expression between

TEs and genes was not simply a side effect of proximity to one another. In other words,

I investigated the possibility that an HERVH element could be located so near a gene

16 that co-transcription would occur. As shown in figure 9, within 100kb of all gene and lncRNA transcripts in the Gencode v33, in total 224 / 251 total HERVHs were near 497 genes or lncRNA elements.

Figure 9. Expression of Gene Transcripts within 100kb of HERVH Elements. Rows are gene or lncRNA transcripts which were located within 100kb of one of the 251 significant HERVH elements. Columns represent samples. Log2 normalized expression data was scaled to produce z-scores across each row. Red represents relative high expression and blue represents relative low expression.

As shown in Figure 9, in general the expression of nearby genes is very different from the HERVH expression signature. However there is one block of gene transcripts in particular which exhibits the same epiblast-specific upregulation of expression. Figure

17 10 shows the same data as Figure 9, but focusing only the cluster of elements exhibiting this expression pattern.

Figure 10. Expression of a Subset of Gene Transcripts within 100kb of HERVH Elements. Rows are gene or lncRNA transcripts which exhibit similar expression patterns to HERVH elements. Columns represent samples. Log2 normalized expression data was scaled to produce z-scores across each row. Red represents relative high expression and blue represents relative low expression.

This cluster of genes consists of 21 protein-coding genes and 29 lncRNA loci which are found within 100kb of 50 HERVH elements. Therefore this shows evidence that co-transcription could be a source of the epiblast-specific expression found in

HERVH elements in some instances, but it does not account for all 251 differentially expressed HERVH elements. A table of the transcripts shown in Figure 10, along with the HERVH elements which are within 100kb is found in Table 1 in the appendix.

18 3.3 TE Element Clustering

Of the total 193 network modules, the 451 significant TEs clustered into 56 separate modules. Two modules in particular contained the majority of HERVH elements; module 32 contained 142 total TEs (125 HERVH elements) and 469 gene transcripts and module 8 contained 60 total TEs (57 HERVH elements) and 1,207 gene transcripts. The breakdown of gene types and TEs present in these two modules is shown in Table 1.

Module TEs HERVHs Protein Coding Pseudogene RNA Other Total Genes ME 8 60 57 999 42 152 14 1,207 ME 32 142 125 384 10 70 5 469

Table 1. Type Statistics of Modules which Contained Numerous TEs. The TE and gene type membership of the 2 modules which contain the most TEs.

The heatmap of normalized scaled expression counts for all members of module

8 and module 32 are shown in Figure 11A and 11B respectively. Visually, it’s clear that although many TEs were clustered into module 8 by WGCNA, the expression pattern of those TEs in module are different when compared to the other protein coding genes in the same module, whereas module 32 has a clear correlation in expression between protein coding genes and TEs. Because module 8 contains so many genes and the expression pattern is so diverse, I only include module 32 in the following functional enrichment analysis.

19 Figure 11. Normalized Expression z-scores within Module 8 and Module 32. Rows are TEs or genes contained within A) module 8 or B) module 32. Columns represent samples excluding those 12 or 14 days past fertilization. Log2 normalized expression data was scaled to produce z-scores across each row. Module 8 includes 1,267 members and module 32 includes 611 members. Red represents relative high expression and blue represents relative low expression.

I graphed the topological overlap matrices for modules 8 and 32 to see how similar the connectivity of the HERVHs were in comparison to genes within the same module. Topological overlap calculations factor in the connectivity of each gene pair to all other genes, rather than a simple adjacency calculation which uses the relationship between two genes only (47). Shown in Figure 12, the TE elements tend to be very closely related to each other topologically, in some cases even more related to TEs in other modules than other genes within their own module. From this heatmap we can also see that the topological overlap for TEs and genes in module 8 are not especially strong as compared to TE-TE overlap in module 32. This suggests that although TEs show similar RNA expression to other genes, they still retain a fairly unique signature in comparison to the entire network.

20

Figure 12. Topological Overlap Matrix for Module 8 and Module 32. Rows and columns are gene transcripts or TEs (annotated top) which are members of module 8 or module 32 (annotated left). The heatmap shows topological overlap with the darker red color representing higher degree of overlap and the light yellow representing lower degree of overlap. Dendrograms were hierarchically clustered based on the topological overlap distance.

I checked the location of each TE to ensure that the clustering of these modules was not a side effect of TEs being located within 100,000 base pairs of the genic regions they cluster with. Within module 8, there are 11 TEs out of 60 total TEs which overlap with 14 genes out of 1,207 total genes. Within module 32, there are 18 TEs out of 142 total which overlap with 19 genes out of 469 total genes. Because there is such a small amount of overlap between the nearby genes and the TEs in these modules, we can assume that extensive TE co-expression with genes in these modules is not exclusively a side effect of proximity.

21 3.4 Gene Set Enrichment Analysis of TE/Gene Clusters

Using the KEGG, REACTOME, and GO gene set databases, I performed a gene set enrichment analysis on the TE modules which contained the majority of HERVH elements, module 32. The enrichment results are shown in Figure 13.

Figure 13. Gene Set Enrichment Analysis for HERVH Module 32. The results of the gene set enrichment analysis module32.. The -log10 FDR corrected p-value is shown by the color and the enrichment ratio (observed/expected) correlates to the size of the dot.

The main area of enrichment for genes within module 32 is stem cell and pluripotency regulation. Some of the main pluripotency transcription factors are present in this module as shown in Figure 14. It is well known that HERVH is particularly active

22 in pluripotent cells and it has even been considered as a useful human-specific pluripotency marker (48). Also found in this module are two long non-coding RNA transcripts, linc-ROR and linc00458 which are known to be derived from HERVH elements and play a major role in pluripotency (49). For example, linc-ROR is a target of

POU5F1 (OCT4), SOX2, and NANOG transcription factors and has been shown by knockdown experiment to be functional for iPSC derivation (50).

Figure 14. Heatmap of Genes within Module Present in Enriched REACTOME Gene Set. The gene transcripts which were members of the REACTOME gene set and also members of module 32. Gene names are duplicated for expression of different isoforms of the genes.

While it is interesting to note that these TEs are clustered with other pluripotency markers such as POU5F1 (OCT4) and NANOG, it is still unknown what functional role these TEs play in regulating or maintaining pluripotency in stem cells.

23 4. Transcription Factor Binding Sites in HERVH Elements

4.1 Methods

Transcription factor binding sites were identified computationally using Clover

(51) which detects statistical over-representation of motifs in sequences. Clover requires a pre-compiled database of transcription factor motifs and I used the

HOCOMOCO motif collection for this analysis which contains models for 680 human transcription factors (52). To detect over-representation, Clover uses a background set of sequences to contrast enrichment against. For these analyses, I provided the sequence of human chromosome 11 to use as a background set and therefore all p- values listed in these results will be describing significance of transcription factor enrichment in relation to chromosome 11 as well as all other HERVH sequences which were not deemed significantly differentially expressed (n = 5,836).

Because Clover is a sequence-based algorithm, I used Clustal Omega to ascertain sequence similarity within the HERV subfamily (53,54). The percent identity matrices were obtained by running a multiple sequence alignment with the 251 HERVs that were deemed significant based on the methods described in Section 2.1. I then hierarchically clustered the percent identity matrices to identify distinct clusters of

HERVs with similar sequences. Three distinct groups of HERVs resulted from this analysis and all downstream Clover analyses were run for each group separately.

I also investigated the upstream and downstream gene expression effects in relation to the resulting transcription factors with significantly enriched binding sites in

TEs. To identify genes which are downstream targets of these transcription factors I used the TRRUST database which was created based on sentence-based text mining

24 techniques of pubmed articles (55). It is manually curated, and contains 8,444 regulatory relationships for 800 human transcription factors.

4.2 Sequence Divergence Among Differentially Expressed HERVHs

Traditionally, HERV elements were classified based on the amino acid found at the proviral binding site for host tRNAs, hence the name HERVH because the binding site contained a histidine (56). However this method was used out of necessity before we had access to comprehensive genetic sequencing data. This has since changed and more advanced methods have been employed such as determining relationships based on the sequence similarity of viral protein with ERVs (57). Despite the more complex methods used today, there still remains a debate over correct naming conventions for both ERVs and TEs in general.

Because the transcription factor analysis is completely sequence-based, I decided to investigate the sequence similarity of the 251 HERVH elements that I was interested in. After completing a multiple sequence alignment with Clustal Omega

(described in 4.1), three distinct groups of HERVHs emerged based on percent identity.

Though this alignment was not based on the pol viral protein, the strong pattern indicates that these clusters may have divergent sequences even though they are in the same subfamily.

25 Figure 15. Significant HERVH Elements Clustered Based on Sequence Similarity with Corresponding Ordered log2 Expression Values. Based on clustering obtained by using a multiple sequence alignment from Clustal Omega there are 3 major groups among the 251 significant HERVH elements, shown on the left. The corresponding heatmap shown on the right is ordered based on these cluster assignments.

In Figure 15, I included the log2 normalized expression counts for each of the 251 significant HERVHs in order of hierarchical clustering based on the percent identity matrix. The largest cluster seems to have a unique pattern of high expression during the inner cell mass (ICM) stage of embryo development when compared to the two smaller clusters. Because of this cluster specific pattern, I next investigated the specificity of transcription factor binding sites within and between the three clusters.

4.3 Enrichment of TFBS in Differentially Expressed HERVHs

Clover was run for each cluster separately to assess transcription factor binding sites (TFBS) which are significantly enriched when compared to background sequences of both chromosome 11 and non-significant HERVH elements. The results of the number of enriched transcription factors for each separate cluster are shown in Figure

26 16. The list of TFBSs which were significantly enriched in at least 2 clusters is shown in

Table 2 in the appendix.

Figure 16. Upset Plot of Significant Transcription Factor Between the Three HERVH Clusters. This set graph shows the intersections in significantly enriched transcription factor binding sites (TFBS) between the three HERVH clusters identified by percent identity in Figure 13. The list of TFBSs which were significantly enriched in at least 2 clusters is shown in Table 4 in the appendix.

The largest cluster was cluster three which contained 177 HERVH elements and was significantly enriched for 182 TFBSs (p < 0.05, in two background sequences).

Clusters one and two contained 31 and 42 HERVH elements and were significantly enriched for 44 and 72 TFBSs, respectively. All three clusters were enriched for mostly

27 unique TFBSs as compared to the other cluster results. In fact, only 2 TFBSs were enriched in all three clusters: STA5A and SUH.

The STA5A transcription factor is encoded by the STAT5A gene which activates transcription in the nucleus in response to cytokines. STAT5A expression is found in nearly all cells and is vital to processes such as cell proliferation and differentiation (58).

It is critical to the immune system and regulated by several different interleukins and cytokines. In a study looking at autism in mice, proinflammatory cytokines were upregulated along with ERVs in case versus control mice (59). This association of cytokines with ERVs is interesting considering the role of STAT5A in the immune system. The role of STAT5A in early embryogenesis is not clear, but it has been shown to function in preimplantation development by mediating the signals of cytokines (60).

SUH is the protein encoded by the RBPJ gene which is important as a transcriptional repressor and activator in the Notch signaling pathway. The default activity of SUH (CSL; also known as RBPJ) is transcriptional repression, through interaction with a histone deacetylase (HDAC) compressor complex. Upon Notch activation, SUH forms a complex with the activated Notch intracellular domain (NICD), together with the co-activator Mastermind (Mam; Mastermind-like transcriptional co- activator 1 MAML1), and stimulates transcription of target genes of the Notch signaling pathway (61). RBPJ is often down-regulated in human tumors and when depleted in mice tumors leads to accelerated cell death (62). Besides its role as a repressor, RBPJ is known to bind methylated DNA (63) which is interesting to note considering many TEs are highly methylated to prevent transcription (64). Also, RBPJ has been known to act

28 as a transcriptional activator under hypoxic conditions (65) and TEs are known to become activated during stress conditions (66).

The most enriched TFBS in the largest cluster is ZN281 (Zfp281) and is required for the establishment and maintenance of primed pluripotency in mouse and human

ESC. It is also essential for mouse epiblast maturation (67,68).

Interestingly, two commonly observed TFBSs in development, POU5F1 (OCT4) and KLF4, were absent in these elements.The original paper that investigated this dataset found that KLF4 was only expressed in early ICM cells and not highly expressed in epiblast cells (18). Other relevant pluripotency transcription factors were specific to certain clusters; SOX2 and ESRRB binding sites were only present in cluster three and NANOG binding sites were only present in cluster two. Also interesting, KLF4,

NANOG, and SOX2 sites were all found to be present in the 9 LTR7 elements which were also differentially expressed in epiblast cells. LTR7 is the flanking region that traditionally encloses the HERVH element.

I investigated the ENCODE database (69) to try and determine any empirical evidence for actual TF binding at these sites. For the two available NANOG ChIP-seq datasets on H1-hESC cell lines (ENCSR000BMT) there was no overlap in binding peaks with the HERVH elements. It is important to note that the HERVH elements studied here do not overlap with the “blacklisted regions” specified in Amemiya et al (70) and therefore CHIP-seq experiments could be performed to explore binding of other transcription factors. For future work, it will be important to use technologies such as

CHIP-seq and ATAC-seq to validate that these significantly enriched TFBSs do in fact

29 provide HERVH sequences with the ability to bind transcription factors and thus regulate expression.

4.4 HERVH TFBS Target Gene Expression Correlation

To see if HERVH elements could be regulated by transcription factors, I investigated the expression pattern of known target genes for the TFBSs present in

HERVH elements. Of the TFBSs which were significant in at least one of the HERVH clusters, 17 of those TFs were also found to be upregulated in epiblast cells in the original study which produced the dataset. The expression pattern of the isoforms of these TFs is shown in Figure 17 below.

Figure 17. Normalized Heatmap of Transcription Factors Found in HERVHs and Known to be Upregulated in Epiblast Cells. This heatmap shows the expression z-score scaled across samples for each TF gene which was significantly enriched in at least one cluster of HERVH elements and also found to be over-expressed in epiblast cells in Xiang et al.

30 Of the 17 transcription factors identified in Figure 15, I located 8 of their downstream target genes using the TRRUST database which had a similar expression pattern to the HERVH elements. Shown in Figure 18, the downstream genes are labelled with their corresponding transcription factors.

Figure 18. Normalized Heatmap of Downstream Genes. This heatmap shows the expression z-score scaled across samples for each downstream gene. Labelled in parentheses are the transcription factors which are upstream of these genes according to the TRRUST database.

These isoforms of these 8 genes mirror the expression pattern of HERVH elements and are significantly enriched in several GO early developmental pathways including stem cell population maintenance (p = 0.028, FDR corrected) and primitive streak formation (p = 0.013, FDR corrected). Interestingly, only SPP1 is within 100kb of an HERVH element which disproves the idea that any similarities in HERVH and gene expression are entirely due to a side effect of nearby gene transcription.

31 Overall, the similarities between expression patterns and the enrichment for developmentally relevant TFBSs suggests that HERVH elements could be involved in early developmental pathways within epiblast cells.

32 5. Discussion

In summary, using a deeply sequenced single-cell RNA dataset I was able to show the extremely specific upregulation expression pattern of certain HERVH loci in embryonic epiblast cells. Our result confirmed the result of Göke et al which showed that HERVH elements were upregulated in epiblast cell populations of single-cell transcriptomes from the blastocyst, and in a cell line of epiblast-like (naïve) hESCs when compared to conventional hESCs. With this unique single-cell dataset, we were able to show the HERVH expression trajectory in post-implantation 3D cultured cells.

Our hypothesis is that these HERVH elements are either 1) a side effect of co- transcription with nearby genes or 2) that they are functionally relevant with regard to the pluripotency of epiblast cells.

With regards to 1) I found that within 100k, the differentially expressed HERVHs were near around 500 genes. However only a subset of those genes shared a similar expression pattern, resulting in only around 50 HERVHs being near genes expressed in epiblast cells. Therefore I conclude that HERVH epiblast-specific expression is not entirely a side effect of co-transcription with nearby genes.

And to assess 2) functional relevance of this expression pattern, I investigated co-expression of HERVHs with known genes, presence of transcription factor binding sites in HERVHs, and HERVH coding potential (Table 2). The differentially expressed

HERVH elements were co-expressed with known pluripotency factors and other genes enriched for stem cell pluripotency maintenance suggesting that they may be functionally relevant. This idea is also bolstered by an RNAi knockdown which shows the necessity of HERVH in pluripotency maintenance (24).

33 Analysis Software Conclusion Co-expression of Genes WGCNA HERVHs and pluripotency genes are co- and HERVHs expressed and therefore could be co- regulated. Trancription Factor Binding Clover HERVHs are potential enhancers for Sites in HERVHs developmental genes. Coding Potential of HMMER, HERVHs have ORFs and contain protein- HERVHS CPC2 coding domains. There is a predicted potential of coding ability.

Table 2. Summary of Analyses and Conclusions. The breakdown of the three main analyses in this study as well as the software used to perform them and the conclusions drawn from them.

Interestingly, that same study showed a high number of OCT4 transcription factor binding sites in HERVH elements, whereas I found no evidence of OCT4 sites in these specific HERVHs. I did find other pluripotency factors such as SOX2 and NANOG.

Given the co-expression with pluripotency factors and reports from earlier studies, it will be worth examining the upstream region of the HERVHs, to find binding sites of co- regulating transcription factors. Within the HERVs, we found enrichment of developmental transcription factor binding sites. These binding sites suggest that

HERVH elements could have functional relevance as enhancers for early developmental programs.

Lastly, I briefly investigated the coding potential of these elements. Based on computational approaches, surprisingly there are some elements which both have an open reading frame and align with viral protein coding domains. These two evidences suggest that there may be a chance of coding potential in these HERVH elements, but

34 because I have only investigated computationally this would have to be confirmed by antibody staining to determine if any viral proteins do exist in these cells.

To further determine functional significance of these HERVH elements there are many future experiments to be done. Firstly, the comparison between these 3D-cultured post-implantation cells and other stem cell lines must be investigated to determine an appropriate model system to further study HERVHs. Potentially epiblast-like cell lines may be a better model system than general hESCs. Secondly, the relationship between the transcription factor binding sites found in these HERVHs and the pluripotent state of cells should be investigated through CHIP-seq analysis. This would determine whether transcription factors truly bind to these sequences. And lastly, performing an RNAi knockdown of differentially expressed HERVH elements and observing the behaviors of both cell morphology and other pluripotency genes could help reveal at the function of these HERVHs in epiblast cells.

Overall this work showed the specificity of HERVH expression in embryonic cells up to day 14 post fertilization. This furthers the field by showing that in the older embryo this expression is only found in epiblast cells before PSA differentiation. I also showed that genes essential to pluripotency maintenance are co-expressed with HERVH expression. Therefore, I conclude that because HERVH elements are heavily expressed in non-differentiated epiblast cells they are likely functional as promoters, enhancers or possibly as coding proteins.

35 Appendix

Table 1. Genes which are within 100kb of an HERVH element.

Gene Type HERVH Element(s) AC008517.1 lncRNA HERVH-int_dup2109 AC008753.2 lncRNA HERVH-int_dup5587 HERVH-int_dup5590 HERVH-int_dup5592 AC008753.3 lncRNA HERVH-int_dup5587 HERVH-int_dup5590 HERVH-int_dup5592 AC009055.1 lncRNA HERVH-int_dup5266 AC009055.2 lncRNA HERVH-int_dup5266 AC011447.3 lncRNA HERVH-int_dup5486 HERVH-int_dup5492 AC011447.7 TEC HERVH-int_dup5486 HERVH-int_dup5492 AC027288.1 lncRNA HERVH-int_dup4654 HERVH-int_dup4659 AC064802.1 lncRNA HERVH-int_dup3190 HERVH-int_dup3191 HERVH-int_dup3197 AC104257.1 lncRNA HERVH-int_dup3232 AC104333.4 lncRNA HERVH-int_dup323 HERVH-int_dup327 AC104791.1 lncRNA HERVH-int_dup1837 HERVH-int_dup1839 HERVH-int_dup1841 AC226119.1 lncRNA HERVH-int_dup1504 AGTPBP1 protein_coding HERVH-int_dup3269 AL117378.1 lncRNA HERVH-int_dup2636 AL117378.2 TEC HERVH-int_dup2636 AL121821.1 snoRNA HERVH-int_dup4935

36 AL121885.2 lncRNA HERVH-int_dup5749 AL157817.1 lncRNA HERVH-int_dup4777 HERVH-int_dup4779 AL161431.1 lncRNA HERVH-int_dup4903 AL161449.2 lncRNA HERVH-int_dup3292 HERVH-int_dup3297 AL357562.1 TEC HERVH-int_dup3805 AL392023.1 lncRNA HERVH-int_dup4930 AL392023.2 lncRNA HERVH-int_dup4930 AP000943.2 lncRNA HERVH-int_dup4362 AP003096.1 lncRNA HERVH-int_dup4279 C13orf42 protein_coding HERVH-int_dup4777 HERVH-int_dup4779 CABLES1 protein_coding HERVH-int_dup5353 CALB1 protein_coding HERVH-int_dup3141 DAB2IP protein_coding HERVH-int_dup3480 DAPK1 protein_coding HERVH-int_dup3404 DSCAML1 protein_coding HERVH-int_dup4392 ENPP1 protein_coding HERVH-int_dup2636 FAM124A protein_coding HERVH-int_dup4777 HERVH-int_dup4779 FXYD6 protein_coding HERVH-int_dup4392 FXYD6- protein_coding HERVH-int_dup4392 FXYD2 HDAC2-AS2 lncRNA HERVH-int_dup2353 KIAA1257 protein_coding HERVH-int_dup1225 HERVH-int_dup1228 HERVH-int_dup1230 LINC01067 lncRNA HERVH-int_dup4903 LINC01108 lncRNA HERVH-int_dup2374 LRFN5 protein_coding HERVH-int_dup4935

37 MN1 protein_coding HERVH-int_dup5749 PCDH15 protein_coding HERVH-int_dup4024 SPP1 protein_coding HERVH-int_dup1632 SYT1 protein_coding HERVH-int_dup4654 HERVH-int_dup4659 TESMIN protein_coding HERVH-int_dup4279 TRIB2 protein_coding HERVH-int_dup476 HERVH-int_dup480 VASH2 protein_coding HERVH-int_dup323 HERVH-int_dup327 ZNF90 protein_coding HERVH-int_dup5486 HERVH-int_dup5492

38 Table 2. Transcription Factor Binding Site Enrichment p-values and Scores.

Cluster 1 Cluster 2 Cluster 3 TF p-value score p-value score p-value score

COE1 0 14.1 NA NA 0 132 COE2 NA NA 0 28.2 0 339 EGR4 NA NA 0 38.3 0 429 FEZF1 NA NA 0 111 0 297 FOZ 0.002 5.79 0 17.4 0 NA GATA1 0 10.5 0.997 -4.92 0 40.1 GATA2 0 11.1 1 -3.64 0 75.7 GATA4 0 14.7 NA NA 0 89.4 GSC 0 14 0 18.3 NA NA HEY2 0.003 9.46 1 -5.7 0 143 HOXB13 0 17.7 1 -5.33 0 43.4 HOXC12 0 9.93 1 -5.81 0 109 HOXC12 0 5.78 0.996 -4.37 0 103 IRF9 0 7.09 0.999 -3.63 0 25.3 JUNB 0 7.66 0 11.5 NA NA MAFF NA NA 0 27.4 0 168 MAZ 0 51.9 0.995 -4.32 0 568 MIXL1 NA NA 0 6.7 0 78.8 NOTO 0 3.79 NA NA 0 54.5 NR4A1 NA NA 0 31.2 0 133 NR4A2 0 26.8 1 -4.95 0 144 OTX2 0 19.4 0 20.9 NA NA P73 0 27.8 1 -4.26 0 132

39 PLAGL1 0 13.9 NA NA 0 59.6 POU3F2 NA NA 0 33.2 0 111 PPARA NA NA 0 32 0 182 PRDM1 0 53.4 1 -4.84 0 185 RORA NA NA 0 55.2 0 441 SMAD3 NA NA 0 116 0 288 SP2 NA NA 0 71.6 0 478 SPIC NA NA 0 64.1 0 271 STA5A 0 39.3 0 49.2 0 149 STAT1 0.006 6.44 1 -6.92 0 189 SUH 0 30.3 0 118 0 174 TEAD1 0 24.6 1 -4.51 0 120 TEAD4 0 25.6 1 -4.53 0 224 ZNF354A NA NA 0 21.7 0 190 ZIC1 NA NA 0 21.8 0 157 ZNF219 NA NA 0 31.8 0 129

40 References

1. McClintock B. Intranuclear systems controlling gene action and mutation.

Brookhaven Symp Biol. 1956 Feb;8(8):58–74.

2. Lanciano S, Cristofari G. Measuring and interpreting transposable element

expression. Nat Rev Genet. 2020 Jun 23;1–16.

3. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial

sequencing and analysis of the human genome. Nature. 2001 Feb;409(6822):860–

921.

4. Finnegan DJ. Eukaryotic transposable elements and genome evolution. Trends

Genet. 1989 Jan 1;5:103–7.

5. Göke J, Lu X, Chan Y-S, Ng H-H, Ly L-H, Sachs F, et al. Dynamic Transcription of

Distinct Classes of Endogenous Retroviral Elements Marks Specific Populations of

Early Human Embryonic Cells. Cell Stem Cell. 2015 Feb 5;16(2):135–41.

6. Coffin JM. Structure and Classification of Retroviruses. In: Levy JA, editor. The

Retroviridae [Internet]. Boston, MA: Springer US; 1992 [cited 2020 Apr 3]. p. 19–49.

(The Viruses). Available from: https://doi.org/10.1007/978-1-4615-3372-6_2

7. Sverdlov ED. Perpetually mobile footprints of ancient infections in human genome.

FEBS Lett. 1998;428(1–2):1–6.

8. Peaston AE, Evsikov AV, Graber JH, de Vries WN, Holbrook AE, Solter D, et al.

Retrotransposons Regulate Host Genes in Mouse Oocytes and Preimplantation

Embryos. Dev Cell. 2004 Oct 1;7(4):597–606.

9. Gifford RJ, Blomberg J, Coffin JM, Fan H, Heidmann T, Mayer J, et al. Nomenclature

for endogenous retrovirus (ERV) loci. Retrovirology. 2018 Aug 28;15(1):59.

41 10. Chuong EB, Elde NC, Feschotte C. Regulatory activities of transposable elements:

from conflicts to benefits. Nat Rev Genet. 2017 Feb;18(2):71–86.

11. Jern P, Coffin JM. Effects of Retroviruses on Host Genome Function. Annu Rev

Genet. 2008 Nov 4;42(1):709–32.

12. Ye M, Goudot C, Hoyler T, Lemoine B, Amigorena S, Zueva E. Specific subfamilies

of transposable elements contribute to different domains of T lymphocyte

enhancers. Proc Natl Acad Sci [Internet]. 2020 Mar 19 [cited 2020 Mar 26]; Available

from: https://www.pnas.org/content/early/2020/03/18/1912008117

13. Balestrieri E, Pica F, Matteucci C, Zenobi R, Sorrentino R, Argaw-Denboba A, et al.

Transcriptional Activity of Human Endogenous Retroviruses in Human Peripheral

Blood Mononuclear Cells [Internet]. Vol. 2015, BioMed Research International.

Hindawi; 2015 [cited 2020 Apr 3]. p. e164529.

14. Li F, Sabunciyan S, Yolken RH, Lee D, Kim S, Karlsson H. Transcription of human

endogenous retroviruses in human brain by RNA-seq analysis. PloS One.

2019;14(1).

15. Sundaram V, Cheng Y, Ma Z, Li D, Xing X, Edge P, et al. Widespread contribution of

transposable elements to the innovation of gene regulatory networks. Genome Res.

2014 Dec 1;24(12):1963–76.

16. Cao Y, Chen G, Wu G, Zhang X, McDermott J, Chen X, et al. Widespread roles of

enhancer-like transposable elements in cell identity and long-range genomic

interactions. Genome Res. 2019 Jan 1;29(1):40–52.

17. Brind’Amour J, Kobayashi H, Richard Albert J, Shirane K, Sakashita A, Kamio A, et

al. LTR retrotransposons transcribed in oocytes drive species-specific and heritable

42 changes in DNA methylation. Nat Commun. 2018 Aug 20;9(1):1–14.

18. Xiang L, Yin Y, Zheng Y, Ma Y, Li Y, Zhao Z, et al. A developmental landscape of 3D-

cultured human pre-gastrulation embryos. Nature. 2020 Jan;577(7791):537–42.

19. Gilbert SF. Early Mammalian Development. Dev Biol 6th Ed [Internet]. 2000 [cited

2020 May 19]; Available from: https://www.ncbi.nlm.nih.gov/books/NBK10052/

20. Pontis J, Planet E, Offner S, Turelli P, Duc J, Coudray A, et al. Hominoid-Specific

Transposable Elements and KZFPs Facilitate Human Embryonic Genome Activation

and Control Transcription in Naive Human ESCs. Cell Stem Cell. 2019 May

2;24(5):724-735.e5.

21. Gerdes P, Richardson SR, Mager DL, Faulkner GJ. Transposable elements in the

mammalian embryo: pioneers surviving through stealth and service. Genome Biol.

2016 May 9;17(1):100.

22. Park I-H, Zhao R, West JA, Yabuuchi A, Huo H, Ince TA, et al. Reprogramming of

human somatic cells to pluripotency with defined factors. Nature. 2008

Jan;451(7175):141–6.

23. Sundaram V, Choudhary MNK, Pehrsson E, Xing X, Fiore C, Pandey M, et al.

Functional cis -regulatory modules encoded by mouse-specific endogenous

retrovirus. Nat Commun. 2017 Mar 28;8(1):1–12.

24. Lu X, Sachs F, Ramsay L, Jacques P-É, Göke J, Bourque G, et al. The retrovirus

HERVH is a long noncoding RNA required for human embryonic stem cell identity.

Nat Struct Mol Biol. 2014 Apr;21(4):423–5.

25. Picelli S, Faridani OR, Björklund ÅK, Winberg G, Sagasser S, Sandberg R. Full-

length RNA-seq from single cells using Smart-seq2. Nat Protoc. 2014 Jan;9(1):171–

43 81.

26. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient

alignment of short DNA sequences to the human genome. Genome Biol.

2009;10(3):R25.

27. Jin Y, Tam OH, Paniagua E, Hammell M. TEtranscripts: a package for including

transposable elements in differential expression analysis of RNA-seq datasets.

Bioinformatics. 2015 Nov 15;31(22):3593–9.

28. Chung N, Jonaid GM, Quinton S, Ross A, Sexton CE, Alberto A, et al. Transcriptome

analyses of tumor-adjacent somatic tissues reveal genes co-expressed with

transposable elements. Mob DNA. 2019 Sep 3;10(1):39.

29. Sexton CE, Han MV. Paired-end mappability of transposable elements in the human

genome. Mob DNA. 2019 Jul 10;10(1):29.

30. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion

for RNA-seq data with DESeq2. Genome Biol. 2014 Dec 5;15(12):550.

31. Van den Berge K, Perraudeau F, Soneson C, Love MI, Risso D, Vert J-P, et al.

Observation weights unlock bulk RNA-seq tools for zero inflation and single-cell

applications. Genome Biol. 2018 Feb 26;19(1):24.

32. Rodriguez-Terrones D, Torres-Padilla M-E. Nimble and Ready to Mingle:

Transposon Outbursts of Early Development. Trends Genet. 2018 Oct

1;34(10):806–20.

33. Nguyen LH, Holmes S. Ten quick tips for effective dimensionality reduction. PLOS

Comput Biol. 2019 Jun 20;15(6):e1006907.

34. Frendo J-L, Olivier D, Cheynet V, Blond J-L, Bouton O, Vidaud M, et al. Direct

44 Involvement of HERV-W Env Glycoprotein in Human Trophoblast Cell Fusion and

Differentiation. Mol Cell Biol. 2003 May;23(10):3566–74.

35. Grow EJ, Flynn RA, Chavez SL, Bayless NL, Wossidlo M, Wesche D, et al. Intrinsic

retroviral reactivation in human preimplantation embryos and pluripotent cells.

Nature. 2015 Jun 11;522(7555):221–5.

36. Zdobnov EM, Campillos M, Harrington ED, Torrents D, Bork P. Protein coding

potential of retroviruses and other transposable elements in vertebrate genomes.

Nucleic Acids Res. 2005 Feb 1;33(3):946–54.

37. Kang Y-J, Yang D-C, Kong L, Hou M, Meng Y-Q, Wei L, et al. CPC2: a fast and

accurate coding potential calculator based on sequence intrinsic features. Nucleic

Acids Res. 2017 Jul 3;45(W1):W12–6.

38. Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network

analysis. BMC Bioinformatics. 2008 Dec 29;9(1):559.

39. Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and

bias-aware quantification of transcript expression. Nat Methods. 2017

Apr;14(4):417–9.

40. Langfelder P, Miller J. anRichment: Annotation and Enrichment Functions [Internet].

[cited 2020 May 7]. Available from: https://horvath.genetics.ucla.edu/html/

CoexpressionNetwork/GeneAnnotation/

41. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene

Ontology: tool for the unification of biology. Nat Genet. 2000 May;25(1):25–9.

42. The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Res.

2019 Jan 8;47(D1):D330–8.

45 43. Kanehisa M, Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic

Acids Res. 2000 Jan 1;28(1):27–30.

44. Kanehisa M. Toward understanding the origin and evolution of cellular organisms.

Protein Sci. 2019;28(11):1947–51.

45. Kanehisa M, Sato Y, Furumichi M, Morishima K, Tanabe M. New approach for

understanding genome variations in KEGG. Nucleic Acids Res. 2019 Jan

8;47(D1):D590–5.

46. Jassal B, Matthews L, Viteri G, Gong C, Lorente P, Fabregat A, et al. The reactome

pathway knowledgebase. Nucleic Acids Res. 2020 08;48(D1):D498–503.

47. Yip AM, Horvath S. Gene network interconnectedness and the generalized

topological overlap measure. BMC Bioinformatics. 2007 Jan 24;8(1):22.

48. Santoni FA, Guerra J, Luban J. HERV-H RNA is abundant in human embryonic stem

cells and a precise marker for pluripotency. Retrovirology. 2012 Dec 20;9(1):111.

49. Izsvák Z, Wang J, Singh M, Mager DL, Hurst LD. Pluripotency and the endogenous

retrovirus HERVH: Conflict or serendipity? BioEssays. 2016;38(1):109–17.

50. Loewer S, Cabili MN, Guttman M, Loh Y-H, Thomas K, Park IH, et al. Large

intergenic non-coding RNA-RoR modulates reprogramming of human induced

pluripotent stem cells. Nat Genet. 2010 Dec;42(12):1113–7.

51. Frith MC, Fu Y, Yu L, Chen J-F, Hansen U, Weng Z. Detection of functional DNA

motifs via statistical over-representation. Nucleic Acids Res. 2004 Feb

15;32(4):1372–81.

52. Kulakovskiy IV, Vorontsov IE, Yevshin IS, Sharipov RN, Fedorova AD, Rumynskiy

EI, et al. HOCOMOCO: towards a complete collection of transcription factor binding

46 models for human and mouse via large-scale ChIP-Seq analysis. Nucleic Acids Res.

2018 Jan 4;46(D1):D252–9.

53. Goujon M, McWilliam H, Li W, Valentin F, Squizzato S, Paern J, et al. A new

bioinformatics analysis tools framework at EMBL–EBI. Nucleic Acids Res. 2010 Jul

1;38(suppl_2):W695–9.

54. Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, et al. Fast, scalable

generation of high-quality protein multiple sequence alignments using Clustal

Omega. Mol Syst Biol. 2011 Jan 1;7(1):539.

55. Han H, Cho J-W, Lee S, Yun A, Kim H, Bae D, et al. TRRUST v2: an expanded

reference database of human and mouse transcriptional regulatory interactions.

Nucleic Acids Res. 2018 Jan 4;46(Database issue):D380–6.

56. Nelson PN, Carnegie PR, Martin J, Davari Ejtehadi H, Hooley P, Roden D, et al.

Demystified . . . Human endogenous retroviruses. Mol Pathol. 2003 Feb;56(1):11–8.

57. Blomberg J, Benachenhou F, Blikstad V, Sperber G, Mayer J. Classification and

nomenclature of endogenous retroviral sequences (ERVs): Problems and

recommendations. Gene. 2009 Dec 15;448(2):115–23.

58. Rani A, Murphy JJ. STAT5 in Cancer and Immunity. J Interferon Cytokine Res. 2015

Dec 30;36(4):226–37.

59. Cipriani C, Ricceri L, Matteucci C, De Felice A, Tartaglione AM, Argaw-Denboba A,

et al. High expression of Endogenous Retroviruses from intrauterine life to

adulthood in two mouse models of Autism Spectrum Disorders. Sci Rep. 2018 Jan

12;8(1):629.

60. Nakasato M, Shirakura Y, Ooga M, Iwatsuki M, Ito M, Kageyama S, et al.

47 Involvement of the STAT5 Signaling Pathway in the Regulation of Mouse

Preimplantation Development. Biol Reprod. 2006 Oct 1;75(4):508–17.

61. Bray SJ. Notch signalling in context. Nat Rev Mol Cell Biol. 2016 Nov;17(11):722–

35.

62. Kulic I, Robertson G, Chang L, Baker JHE, Lockwood WW, Mok W, et al. Loss of the

Notch effector RBPJ promotes tumorigenesis. J Exp Med. 2015 Jan 12;212(1):37–

52.

63. Bartels SJJ, Spruijt CG, Brinkman AB, Jansen PWTC, Vermeulen M, Stunnenberg

HG. A SILAC-Based Screen for Methyl-CpG Binding Proteins Identifies RBP-J as a

DNA Methylation and Sequence-Specific Binding Protein. PLOS ONE. 2011 Oct

3;6(10):e25884.

64. Hollister JD, Gaut BS. Epigenetic silencing of transposable elements: A trade-off

between reduced transposition and deleterious effects on neighboring gene

expression. Genome Res. 2009 Aug;19(8):1419–28.

65. Aras S, Pak O, Sommer N, Finley R, Hüttemann M, Weissmann N, et al. Oxygen-

dependent expression of cytochrome c oxidase subunit 4-2 gene expression is

mediated by transcription factors RBPJ, CXXC5 and CHCHD2. Nucleic Acids Res.

2013 Feb 1;41(4):2255–66.

66. Horváth V, Merenciano M, González J. Revisiting the Relationship between

Transposable Elements and the Eukaryotic Stress Response. Trends Genet. 2017

Nov 1;33(11):832–41.

67. Fidalgo M, Huang X, Guallar D, Sanchez-Priego C, Valdes VJ, Saunders A, et al.

Zfp281 Coordinates Opposing Functions of Tet1 and Tet2 in Pluripotent States. Cell

48 Stem Cell. 2016 Sep 1;19(3):355–69.

68. Huang X, Balmer S, Yang F, Fidalgo M, Li D, Guallar D, et al. Zfp281 is essential for

mouse epiblast maturation through transcriptional and epigenetic control of Nodal

signaling. Bronner M, editor. eLife. 2017 Nov 23;6:e33333.

69. Davis CA, Hitz BC, Sloan CA, Chan ET, Davidson JM, Gabdank I, et al. The

Encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acids Res.

2018 Jan 4;46(D1):D794–801.

70. Amemiya HM, Kundaje A, Boyle AP. The ENCODE Blacklist: Identification of

Problematic Regions of the Genome. Sci Rep. 2019 Jun 27;9(1):9354.

49 Curriculum Vitae

Corinne Sexton [email protected] Education

2020 M.S. Biology, University of Nevada Las Vegas, Las Vegas, NV. GPA 4.00 2018 B.S. Bioinformatics, Brigham Young University, Provo, UT. GPA 3.96 Minors: Computer Science, Music Experience Research 2017-2018 Computing Specialist, Fulton Supercomputing Lab, Provo, UT.

• Assist researchers from many disciplines with software installation and optimization

• Prepare introductory C++ lectures for supplement to Topics in Supercomputing class 2015-2018 Research Assistant, Ridge Lab, Provo, UT.

• Analyze performance of different mappers to identify low read depth biases in algorithms

• Research splice site variant effect predictive software and develop a simplified open source pipeline in Java

• Translate 8.8 million bacterial sequences into protein secondary structure to annotate bacteriophage genes through secondary structure based BLAST 2015-2017 Research Assistant, Chaston Lab, Provo, UT.

• Created statistical R package for performing MGWAS on mono-associated animals

• Increased efficiency of starvation and lifespan MGWAS runtime from 8 days to 3 hours by taking initiative to parallelize through the supercomputer

• Worked in wet lab, including contamination checking, making

50 food for fruit flies, associating a single bacteria type in an animal, as well as creating axenic eggs. Teaching 2018 Teaching Assistant, Topics in Supercomputing, Provo, UT.

• Taught students advanced C++ techniques such as threading, MPI, and efficient memory management. 2016-2017 Teaching Assistant, Intro to Bioinformatics, Provo, UT.

• Coached students in learning Python and understanding programming concepts such as conditionals, logic, and loops

• Developed course curriculum with professor for assignments and assessment questions Grants, Awards 2019 Graduate Research Fellowship ($46,000 for 3 years), NSF.

Awarded to research the role of transposable element transcription on gene regulation in the human genome. 2017 BYU ORCA Grant ($1500), Ridge Lab.

Awarded to research predicting bacteriophage gene function through protein secondary structure BLAST 2016 BYU ORCA Grant ($1500), Chaston Lab.

Awarded to create an R package for mono-associated gnotobiotic animal meta-genome wide association 2014 Presidential Scholarship (8 semesters full tuition, half tuition stipend), BYU. Presentations 2016 A Simple Pipeline for Meta-Genome Wide Association, Beneficial Microbes. American Society for Microbiology, Seattle, WA. (poster) 2015 MAGNAMWAR - Mono-Associated Gnotobiotic Animal Meta-Genome Wide Association R Package, Biotechnology and Bioinformatics Symposium. Provo, UT. (talk)

51 Publications

Matthews MK, Wilcox H, Hughes R, Veloz M, Hammer A, Banks B, Walters A, Schneider KJ, Sexton CE, Chaston JM. “Genetic influences of the microbiota on lifespan.” Applied and Environmental Microbiology. 2020. Corinne Sexton & Mira Han. "Paired-end mappability of transposable elements in the human genome." Mobile DNA. 2019. Chung N, Jonaid GM, Quinton, Ross A, Sexton CE, Alberto A, Clymer C, Churchill D, Navarro Leija O. & Han MV. “Transcriptome analyses of tumor-adjacent somatic tissues reveal genes co-expressed with transposable elements.” Mobile DNA. 2019. Corinne Sexton, et al. "Common DNA Variants Accurately Rank an Individual of Extreme Height." International Journal of Genomics. 2018.

Corinne Sexton, et al. "Splice Site Variant Analyzer: Determining the Pathogenicity of Splice Site Variants." Journal of Biomedical Research and Practice. 2018. Corinne Sexton, et al. "MAGNAMWAR: An R package for genome-wide association studies of bacterial orthologs." Bioinformatics. 2018. Judd AM, Matthews MK, Hughes R, Veloz M, Sexton CE, Chaston JM. “Bacterial methionine metabolism genes influence Drosophila melanogaster starvation resistance.” Applied and Environmental Microbiology. 2018. Skills

Programming Languages: Python, C++, SQL, Java, C#, R, Bash, Julia Bioinformatics: ● next-generation sequence analysis (e.g. samtools, GATK etc.) ● mappers and aligners (bwa, bowtie2 etc.) ● statistical analysis in R ● supercomputer and parallel computing environment ● algorithm development

52