UPTEC X 14 029 Examensarbete 30 hp Juni 2014

Differential and co-expression of long non-coding RNAs in abdominal aortic aneurysm

Joakim Karlsson

Degree Project in Bioinformatics

Masters Programme in Molecular Biotechnology Engineering, Uppsala University School of Engineering

UPTEC X 14 029 Date of issue 2014-11 Author Joakim Karlsson

Title (English) Differential and co-expression of long non-coding RNAs in abdominal aortic aneurysm Title (Swedish) Abstract

This project concerns an exploration of the presence and interactions of long non-coding RNA transcripts in an experimental atherosclerosis mouse model with relevance for human abdominal aortic aneurysm development. 187 long noncoding RNAs, two of them entirely novel, were found to be differentially expressed between angiotensin II treated (developing abdominal aortic aneurysms) and non-treated apolipoprotein E deficient mice (not developing aneurysms) harvested after the same period of time. These transcripts were also studied with regards to co-expression network connections. Eleven previously annotated and two novel long non-coding RNAs were present in two significantly disease correlated co-expression groups that were further profiled with respect to network properties, Ontology terms and MetaCore© connections.

Keywords

LncRNA, lincRNA, abdominal aortic aneurysm, differential expression, co-expression, RNA- seq, WGCNA

Supervisors Bengt Sennblad Karolinska Institutet Scientific reviewer Magnus Lundgren Uppsala University Project name Sponsors

Language Security English

Classification ISSN 1401-2138

Supplementary bibliographical information Pages 80

Biology Education Centre Biomedical Center Husargatan 3, Uppsala Box 592, S-751 24 Uppsala Tel +46 (0)18 4710000 Fax +46 (0)18 471 4687

Differential and co-expression of long non-coding RNAs in abdominal aortic aneurysm

Joakim Karlsson

Populärvetenskaplig sammanfattning

Aneurysm (uppsvällning) i bukaortan är en allvarlig sjukdom förenad med hög dödsrisk. Kända riskfaktorer är, liksom vid många andra hjärt- och kärlsjukdomar, bland annat rökning, högt blodtryck och ålderdom. Tillståndet är oftast svårt att upptäcka i tid. Undantaget riskabla kirurgiska ingrepp finns få behandlingsmöjligheter. Kartläggningen av sjukdomens genetiska bakgrund har tagit formen av ett eget forskningsfält. Den ökade kunskap som det här fältet bidrar med förväntas underlätta utvecklandet av nya diagnostikmetoder och terapeutiska tekniker.

Som en del i denna strävan har det här arbetet fokuserat på uttrycksmönster hos så kallade långa ickekodande RNA (lncRNA). Detta är en typ av RNA-molekyler som inte kan översättas till proteiner. Det var sedan tidigare känt att ickekodande RNA kan spela en roll i hjärt- och kärlsjukdomar.

Studiens upplägg baserades på sekvenserat RNA hämtat från ett antal möss som behandlats på så vis att en grupp utvecklat aneurysm i bukaortan, medan en kontrollgrupp inte gjort det. Målet var därmed att undersöka förändringar i nivåer av lncRNA mellan dessa två grupper, för att på så vis finna lncRNA- varianter med eventuell betydelse för sjukdomen. Arbetet identifierade därigenom ett antal lncRNA- kandidater, vars beteenden pekade på kopplingar till aneurysmutvecklingen, som rekommenderas för vidare forskning.

Examensarbete 30 hp Civilingenjörsprogrammet Molekylär bioteknik

Uppsala universitet, september 2014

Contents

1 Introduction 7

2 Background 8 2.1 Longnon-codingRNAs...... 8 2.2 Abdominalaorticaneurysm ...... 8 2.3 ApoEknockoutmice ...... 9 2.4 Sequencing and transcript assembly ...... 9 2.4.1 RNAsequencing ...... 9 2.4.2 Read mapping ...... 10 2.4.3 Transcript assembly ...... 10 2.5 PredictionofnovellongnoncodingRNAs...... 11 2.6 Functional characterisation ...... 11 2.6.1 Dierential expression testing ...... 11 2.6.2 Co-expression network inference ...... 12 2.6.3 Reference Databases ...... 12

3 Methods 14 3.1 Read quality control ...... 14 3.2 Read mapping ...... 14 3.3 Transcript assembly ...... 15 3.4 IdentificationofknownlongnoncodingRNAs ...... 15 3.5 PredictionofnovellongnoncodingRNAs...... 16 3.6 Functional characterisation ...... 17 3.6.1 Dierential expression testing ...... 17 3.6.2 Co-expression network inference ...... 17 3.6.3 Clustering comparison ...... 18

4 Results 20 4.1 Read quality control ...... 20 4.2 PredictionofnovellongnoncodingRNAs...... 20 4.3 Dierentialexpressiontesting ...... 23 4.4 Co-expression network modules ...... 28 4.5 enrichment ...... 37 c 4.6 MetaCore analysis ...... 42 4.7 Clustering comparison ...... 44

5 Discussion 46

6 Conclusions 49 7 Acknowledgements 50

8 References 51

A Alternative analysis 57 A.1 Methods...... 57 A.1.1 Readtrimming ...... 57 A.1.2 Readmapping...... 57 A.1.3 Transcriptassembly...... 57 A.1.4 Dierential expression testing ...... 58 A.1.5 Identification of known long noncoding RNAs ...... 58 A.1.6 PredictionofnovellongnoncodingRNAs...... 58 A.1.7 Co-expressionnetworkanalysis ...... 58 A.2 Results...... 59 A.2.1 IdentificationoflongnoncodingRNAs ...... 59 A.2.2 Dierential expression testing ...... 62 A.2.3 Co-expressionnetworkmodules ...... 65 c A.2.4 MetaCore analysis ...... 74 A.2.5 Clusteringcomparison ...... 77 A.3 Discussion...... 77

B Candidate lncRNAs 79

Abbreviations

AAA Abdominal aortic aneurysm cDNA Complementary DNA FDR Falsediscoveryrate FPKM Fragments per kilobase of transcript per million mapped reads GO Gene Ontology lincRNA Long intergenic noncoding RNA lncRNA Long noncoding RNA MDS Multidimensional scaling miRNA Micro RNA ncRNA Noncoding RNA ORF Open reading frame PCA Principal component analysis PCR Polymerase chain reaction RABT Reference annotation based transcriptome [assembly] rRNA Ribosomal RNA ROC Receiver operating characteristic WGCNA Weighted gene co-expression network analysis 1. Introduction

Long non-coding RNAs (lncRNAs) share many similarities with messenger RNAs (mRNAs): They can be dierentially spliced and a 5’-cap and poly-A-tail is often added. However, they lack the ability to be translated into (since they ei- ther have no, or a too short, open reading frame), hence “non-coding”. However, noncoding does not equal nonsense. Interference with both the transcriptional and translational machinery in many dierent ways has been implied [1]. Some lncR- NAs have antisense binding properties and some are spliced into several small- and microRNAs that in turn may have various biological functions. However, in the vast majority of cases, their activity is simply unknown. Studying the expression patterns of lncRNAs in a particular disease may reveal novel functional connections and expand the mechanistic knowledge of the illness. In this project a particular kind of aneurysm development was studied. Abdominal aortic aneurysm (AAA) is a dilation of a region in the aorta, which may lead to rupture. The disease is asymptomatic throughout most of its progression, yet does very often have a fatal outcome. The most important factors influencing AAA appear to be smoking and hypertension [2], genetics is also known to play a role [2, 3]. Previous studies have profiled protein coding that are thought influence the condition, and also implied the involvement of microRNAs [3]. lncRNAs, known to be associated with several other diseases (among them cardiovascular ones [4]), and a fascinating biological phenomenon in itself, also oer to be a field worth exploring in the context of AAA. At our disposal was a set of RNA-sequenced tissue samples from AAA mouse models. The task was to explore this dataset with respect to the expression of both previously known and novel lncRNAs. Any promising candidates may be subject to further functional evaluation in future RNA interference experiments.

7 2. Background

2.1 Long non-coding RNAs

Long non-coding RNAs (lncRNAs) are defined as RNA molecules longer than 200 bases that do not have the ability to code for any protein. A subclass of these are called ”lincRNAs”, where the “i” stands for “intergenic”. LincRNAs reside in the regions between protein-coding genes, whereas lncRNAs in general may overlap such genes. Some protein-coding genes, for instance, have long transcript isoforms that are non-coding [1]. Intriguingly, lncRNAs have been found to be at least as abundant as the protein- coding genes expressed by the (as well as the mouse genome) [1]. However, in general their expression levels are lower the those of protein-coding genes. Patterns of expression are also highly variable between tissues [5]. LncRNAs are spliced and processed much like regular protein-coding transcripts. Addition of poly-A tails is not uncommon. In several cases, lncRNAs are spliced into small RNAs, which in turn may have their own biological functions. It also appears that lncRNA genes generally are less conserved than protein coding ones [5]. Evidence has shown that lncRNAs can act by epigenetically altering gene ex- pression. For instance, the lncRNAs Hotair, Kcnq1ot1 and RepA all interact with to cause gene silencing [1, 4]. According to one estimate, about a third of all lincRNAs interact with chromatin-modifying complexes in order to modify the functionality of various cellular processes [1]. It is also not uncommon that lncRNAs are transcribed antisense to protein-coding genes [5], so one may assume that the sequence complementarity could oer a way to interact with these other genes as well.

2.2 Abdominal aortic aneurysm

Abdominal aortic aneurysm (AAA) is a severe disease in which a focal region of the aorta is dilated. The condition is highly fatal upon aneurysm rupture. Generally, no symptoms are shown leading up to the critical point of the illness. Known risk factors are smoking, hypertension, high age, male gender and caucasian ethnicity [2]. The complex genetic aspects of the disease is an active research area. Since surgery is the only currently employed treatment [3], it would be desirable to discover new and less risk-filled venues of therapy. The processes behind the formation of aneurysms can briefly be described as a complex interplay of vascular inflammation, dysregulation of the behaviour and pro- liferation of vascular smooth muscle cells, fragmentation of elastin and degradation

8 Chapter 2: Background of other components of the arterial wall such as the extracellular matrix [3]. With respect to lncRNAs, a long non-coding RNA known as ANRIL (“antisense non coding RNA in the INK4 locus”) has been indicated in human atherosclerosis (as well as other serious conditions such as diabetes and cancer) [4]. Similarly to lncRNAs such as Hotair, it binds to polycomb and causes epigenetic modifications. This in turn leads to altered expression of other genes (CDKN2B in this case). No homolog to ANRIL is currently known to exist in mice, however. Nevertheless, it demonstrates the potential of lncRNAs to have key roles in the pathogenesis of diseases related to AAA. In addition, another class of non-coding RNA, microRNA (which have also been shown to have gene regulatory function in many cases), have been indicated in AAA formation. Two prominent examples are miR-21 and miR-29b [3]. This further supports the idea that the role of the untranslated transcriptome should not be underestimated.

2.3 ApoE knockout mice

The data used for this work was provided by a group at Stanford University. It had already been RNA sequenced by a company named Centrillion, using an Illumina Hiseq 2000, the EpiBio ScriptSeq v2 library preparation kit and the RiboZero rRNA depletion kit. The sequenced tissue samples were derived from a set of mice modified to lack the gene for apolipoprotein E (ApoE). These are common as model animals in the study of atherosclerosis. Treatment with the blood pressure-regulating protein angiotensin II (AngII) leads to development of abdominal aortic aneurysms[6] (the cases), whereas saline treatment does not (the control group). 12 samples were used, all of them 12 weeks old: four mice harvested three and seven weeks after treatment with AngII infusion (the cases), respectively, and four models treated with saline and harvested after seven weeks (the controls). The sample groups will henceforth be referred to as ”A3” and ”A7” for the case mice (respectively) and ”S7” for the control mice.

2.4 Sequencing and transcript assembly

2.4.1 RNA sequencing RNA sequencing is a method used as a for quantifying gene expression through estimation of transcript abundance (unless stated otherwise, the term “expression” will refer to “transcript abundance” in this work, rather than to protein expression levels). Transcribed material contained in samples is first fragmented, converted to complementary DNA (cDNA) by reverse transcription, amplified using the poly- merase chain reaction (PCR) and tagged with appropriate adapter sequences. The common term for this procedure is “library preparation”. Subsequently to this preparation, the fragment library is analysed with any of a number of sequencing machines available. In this case, an Illumina Hiseq 2000 had been used. These instruments read a certain number of nucleotides from the ends of each fragment. The output consists of millions of small nucleotide sequences known as “reads” [7]. In this case, the read length was 100 bases.

9 Chapter 2: Background

Reads can be paired on unpaired, depending on the chosen technology. Paired reads, or ”paired end reads”, result from copies of the same cDNA fragment being sequenced from each end independently. The use of paired reads is helpful when dealing with repetitive transcripts. Since the distance between these reads can be computationally inferred, this information can be used when assigning the reads to positions on a reference genome. Library preparation can also be done such that directionality of the fragments is preserved for downstream analysis. This is known as stranded library preparation. Preserving strand information makes it possible to tell whether a transcript origi- nated from the sense of antisense strand of DNA. This information is particularly useful when dealing with noncoding fragments, since these may have regions that are partially complimentary to protein coding transcripts. This may, for instance, be one way in which they could regulate said transcript.

2.4.2 Read mapping In order to find out which regions of the genome were expressed, it is necessary to map the RNA sequencing reads to an appropriate reference genome. Such refer- ence genomes are available from public databases maintained by, for instance, the National Center for Biotechnology Information (NCBI) [8], University of California Santa-Cruz (UCSC) [9], and the European Bioinformatics Institute (EMBL-EBI) [10]. Several algorithms have been developed to perform mapping of RNA-seq reads. One of the most cited to date is TopHat, developed by Trapnell et al. [11]. The latest version of this software, TopHat2 [12], was used in this project. In short, this aligner works by performing a preliminary local alignment of reads (using the Bowtie2 short read aligner), whereafter consensus regions are assembled and possible splice junctions are generated. After this, reads are mapped anew, taking the splice junctions into account. It is also possible to provide a reference database of known transcripts in order to first map reads to the transcriptome (the set of all known transcripts), thereby improving accuracy of alignment.

2.4.3 Transcript assembly After reads have been mapped to a genome, they will need to be assembled into fragments in order to discover the original structure of the expressed RNA tran- scripts. In order to do this, the Cuinks program, by Trapnell et al. [13], was used. Briefly, the algorithm works by first grouping short reads into consecutive coherent units according to the positions to which they were mapped in the genome. This results in the reconstruction of a preliminary set of transcript fragments. Then, overlaps between these preliminary fragments are found in order to determine what the structure of the original transcripts may have been. Often, this results in a number of conflicting or spurious reconstructed transcripts. Therefore, a second step is undertaken to find out which of these were most likely to be the correct transcripts. Mutually incompatible fragments are isolated and with the help of a graph theoretic approach, the minimal set of transcripts that can explain the particular fragment groups obtained is sought. This yields a final set of candidate transcript variants (isoforms), whose abundance levels are subsequently

10 Chapter 2: Background estimated with the use of a likelihood-based statistical model. A reference transcriptome (set of transcripts) can be used to aid in the esti- mation of expression levels and to improve accuracy of transcript reconstruction. In particular, a useful method when searching for novel isoforms is an algorithm known as Reference Annotation Based Transcriptome assembly (RABT) [14]. This approach mixes the true sequencing reads with so called “faux reads” generated from the transcriptome reference. The faux reads are supposed to compensate for coverage gaps and provide structural information for use in the reconstruction of transcript variants. In general, expression levels may be expressed simply in terms of the number of reads mapped to each fragment. But another common way is to normalise the read count with the length of the fragment, resulting in a measure known as “frag- ments per kilobase per million mapped” (FPKM). The argument for doing so is that RNA-seq is prone to a fragment length bias: since longer fragments generate more reads than shorter fragments, more reads will map to them and they are more easily quantified. It has been found that longer transcripts are more often called dierentially expressed than short ones, for this reason. Dividing by the length of the fragment is supposed to help dealing with this problem [13].

2.5 Prediction of novel long noncoding RNAs

Finding novel lncRNAs involves computationally inferring the coding potential of putative transcripts. This can be done in many ways, but the focus here was on the following filtering pipeline: First, transcripts from the novel assembly that were not found in the Ensembl gene annotation reference database [10] were extracted. Then, all transcripts shorter than 200 nucleotides were removed. After that, only candidates without very long open reading frames (ORFs) were kept. The next filtering step was based on an evolutionary inference of protein coding potential from phylogenetic codon substitution frequencies. The transcripts remaining after that were compared against the Pfam protein family database, in order to make sure that none of them resemble known protein domains (functional regions). This strategy was adopted from Sun et al. [15], and follows their ”lncRscan” pipeline.

2.6 Functional characterisation

2.6.1 Dierential expression testing Detecting dierential expression (significant dierences in the number of RNA tran- scripts generated from genes in the comparison of two or more sample groups) re- quires sophisticated statistical methods. Several problems need to be dealt with: accounting for various types of bias (such as the fragment length bias discussed ear- lier), modelling and accounting for the biological divergence of the samples (natural dierences, not due to the treatment being studied, among samples of the same con- dition), finding the significance level of any observed dierences, etc. Multiple tools exist for this purpose. The most popular ones at the time of writing, specifically made for RNA sequencing, include DESeq [16], EdgeR [17] and Cudi [18]. The Cudi tool was chosen for this project, since it was specifically designed

11 Chapter 2: Background to provide accurate estimates of dierential expression on an isoform (transcript variant, as opposed to entire gene) level. This is important, since lncRNA transcripts are commonly found to be non-coding isoforms of protein-coding genes. Another benefit of using Cudi over competing tools was the tight integration with the other tools of the so called Tuxedo pipeline, TopHat and Cuinks [19].

2.6.2 Co-expression network inference In a transcript level co-expression network, the nodes are transcripts and the edges distances between them. The distance measure used is based on the similarity of the expression profile of each transcript to that of every other transcript (the profile is the set of expression levels of a given transcript across all samples). The problem can thus be seen as that of determining the distance between a number of vectors, and finding out if any transcript form certain groups (“clusters” or “modules”) of similar expression. There are many ways of measuring the similarity between a pair of vectors, and it may not be immediately obvious which is best with regards to gene expression. Common (though simple) ways of calculating this distance are the Euclidean norm and the absolute Pearson correlation. There are also many methods of using these distance measures to find clusters among these vectors (transcripts). The most common are probably k-means [20] and hierarchical clustering[21], and modifications thereof. Others include Weighted Gene Co-expression Network Analysis (WGCNA) [22], Non-negative matrix factorisation (NMF) [23, 24] and Markov clustering [25]. All clustering methods are expected to reveal dierent structures in a given dataset. Some of these structures may be shared between the methods, but just because a given structure does not appear with the use of all methods does not mean it is not real. It could be a true and biological relevant feature of the data, which is only exposed by that particular model. But there is not guarantee that it is not just an artifact of the clustering either. Therefore, it is often beneficial to compare the results of dierent clustering methods on the data. Of course though, some clustering measures are simply more suited to some types of data than others, and a lack of consensus between two given methods could just as well be because both methods perform badly, as it could be due to one of the methods performing well and the other not. It can be hard to tell sometimes, so clustering results will need to be additionally validated using biological information gathered from literature.

2.6.3 Reference Databases A number of reference databases was used in this study:

NONCODE NONCODE is a comprehensive database of noncoding RNAs, including lncRNAs, collected by literature mining and from other specialised databases. The database is developed and maintained by the Chinese Academy of Sciences. The number or entries in the database at the time of writing (version 4) was over 210 000 (all species included) [26]. The database was available to download on request, for free. NONODE was used to identify previously annotated lncRNAs in the newly assembled transcriptome.

12 Chapter 2: Background

Gene Ontology The Gene Ontology (GO) [27] defines and hierarchically organises standardised bi- ological concepts. This enables a systematic categorisation of gene and protein functionality. Three separate main ontologies are defined by the Gene Ontology consortium (the body which created and maintains the GO): ”Biological process”, ”Molecular function” and ”Cellular component”. Molecular function refers to the various ways a gene product can act biochemically. Examples include ”enzyme” and ”receptor ligand”. Such molecular functions may contribute to what is defined as biological processes, general paths towards a biological goal. Biological processes may be ”translation” or ”signal transduction”. Cellular component refers to the location of a gene product in the cell. One of the goals of creating this structured and well-defined set of terms was to make it easier for researchers to both unambiguously assign functional annotation to any given gene, and make queries on these terms to quickly find the gene par- ticipating in any process of interest [27]. Another useful aspect, which was utilised in this project, was the possibility of functionally annotating a given set of genes. That way, it could for instance be discovered if co-expressed clusters of transcripts are enriched for any particular biological functions.

c MetaCore

c MetaCore [28] is a commercial tool by Thomson-Reuters that is based on a curated collection of interactions between “network objects” such as genes, proteins and nu- cleic acids. The use of this tool was expected to be a valuable complement to Gene Ontology enrichment, when assessing the biological significance of co-expression net- work modules.

13 3. Methods

3.1 Read quality control

The quality control software FastQC [29] was used to inspect sequencing reads with regards to overall sequencing quality aspects. The version utilised was 0.10.1, with default options. The purpose of using this tool was to find out if any overrepresented reads or large quality dierences among samples was present. RSeQC [30] (version 2.3.9) was further employed to profile the reads with respect to mapping aspects. In particular, an estimation of ”mate inner distance” was made with this program, based on preliminary alignments. The mate inner distance is defined as the distance (in nucleotides) between the two reads in a pair, with respect to where they have mapped on the genome. The value of this distance, and its standard deviation, can be specified as optional parameters during mapping with TopHat. This software also has the ability to infer experiment type, that is, stranded or unstranded sequencing.

3.2 Read mapping

Read mapping was performed with TopHat 2.0.11. The option “--no-mixed” was used to constrain the set of accepted reads to those where both reads in a pair could be mapped. This rather strict option was used in order to minimise the amount of assembled fragment artifacts in downstream analysis steps, of particular importance when aiming to discover novel transcripts. Another potentially helpful restriction on read mapping could be to also use the “--no-discordant” option, which disqualifies reads in which the members of a read pair map to dierent . This was not used, however, since it caused the current version of TopHat to crash. In addition, a reference annotation file in GTF format was provided (option “-G”), as it can increase the accuracy of alignment to take into account both the genome and the known transcriptome when mapping [11, 12]. Library type ”fr- secondstrand” was specified, as the library preparation was stranded and originated from the second strand of cDNA synthesised form the RNA fragments. The values of ”mate inner distance” and ”mate-std-deviation” were provided for each sample as they had been estimated by RSeQC on preliminary mapping runs (plots can be found in the supplementary material). The version of the short read aligner Bowtie used by TopHat was 2.2.2. The reference transcript annotation that was provided to TopHat was the En- sembl [10] mm9 reference annotation (the reason to go with mm9 for the main anal- ysis, rather than mm10, was that the current noncoding RNA reference database NONCODE was only available based on mm9) provided by the Illumina iGenomes

14 Chapter 3: Methods project,which had been specifically made to be suitable with the TopHat and Cuf- flinks software (“NCBIM37”, accessed April 11, 2014).

3.3 Transcript assembly

Transcript assembly was performed with Cuinks 2.2.0 [13]. The “-g” option was used in order to make use of Reference Annotation Based Transcriptome (RABT) assembly, which is beneficial when searching for novel transcripts [14, 19]. The “fr-secondstrand” library type was also specified. A run of Cuinks in ab initio mode (default parameters, except for “--library- type=fr-secondstrand”), was also made in order to estimate the FPKM distributions of partially assembled fragments and artifacts in comparison to the distributions of assembled fragments that completely matched the reference (more on this under the section on prediction of novel lncRNA). Both RABT and ab initio assembly was performed on the merged set of all alignments, excluding two samples that was deemed outliers (these samples had considerably worse mapping performance, dierent overall FPKM distributions and formed their own groups during principal component analysis (PCA) and multidi- mensional scaling (MDS) analysis; the PCA and MDS analyses were performed with the CummeRbund version 2.6.1 R package [19] on preliminary dierential expression results (not shown)). All transcriptome assemblies were annotated with the Cucompare [13] software included in the Cuinks package (utilising the options ”-G”, ”-C” and ”-r”). The reference used was the same Illumina iGenomes provided Ensembl [10] mm9 anno- tation as during the alignments. During the comparisons, Cucompare assigns so called “class codes” to the assembled fragments. These indicate the nature of any overlap to reference transcripts that may be found. The meaning of the class codes is summarised in table 3.1.

Table 3.1: Cucompare class codes mentioned in this work. The definitions have been cited from the Cucompare manual.

Class Meaning = “Complete match of chain” c “Contained” j “Potentially novel isoform (fragment): at least one splice junction is shared with a reference transcript” i “A transfrag falling entirely within a reference intron” o “Generic exonic overlap with a reference transcript” x “Exonic overlap with reference on the opposite strand” u “Unknown, intergenic transcript”

3.4 Identification of known long noncoding RNAs

Identification of known lncRNAs was accomplished through the use of the Cucom- pare software (version 2.2.1) included in the Cuinks package [19]. Provided to the

15 Chapter 3: Methods algorithm was a reference annotation file in GTF format downloaded from the NON- CODE database (version 4) [26], together with the RABT assembly of transcripts (besides this, default options were used). Comparisons were made twice, once with the NONCODE annotation as reference (“-r” flag) and the merged assembly as pri- mary input; once with the merged assembly as reference (-r) and the NONCODE annotation as primary input. Then the intersection between the output mapping files were taken. This eliminated cases where the matching was incomplete. After this, annotated lncRNAs were extracted from the Ensembl reference an- notation (the categories “3prime overlapping ncrna”, “ambiguous orf”, “antisense”, “lincRNA”, “ncrna host”, “non coding”, “processed transcript”, “retained intron”, “sense intronic”, “sense overlapping”, which according to Ensembl’s terminology de- note lncRNAs) and the union of the Cucompare NONCODE intersection set and the Ensembl lncRNA set was taken. The result was the set of all lncRNAs present in the novel transcriptome assembly, which had been previously annotated by either NONCODE or Ensembl.

3.5 Prediction of novel long noncoding RNAs

The general outline of the search for novel long non-coding RNAs was adapted from Sun et al. [15], whose scripts included in the “lncRscan” pipeline were employed in the execution of some of the filtering steps described below.

Filtering of partially assembled transcripts In order to reduce the risk of making false predictions, fragments that were entirely contained by known annotated transcripts were filtered away together with other fragments suspected to be mere artifacts. The method for doing so was based on the hypothesis that the overall FPKM values estimated for such fragments would be lower than that of true transcripts. This method has been used previously by, for instance, Sun et al. [15]. A custom pipeline was developed for this purpose, using Perl and R.

Inference of transcript coding potential There are many strategies for judging how likely a transcript is to be protein cod- ing. The software PhyloCSF (Phylogenetic Codon Substitution Frequency) performs multiple alignments between several species and uses the substitution frequency of bases in a given transcript, throughout the phylogeny, to determine whether it is likely to code for a protein [31]. Another, Coding Potential Calculator (CPC) is based on a support vector machine (SVM) classifier that takes into account features such as alignment scores to protein databases and certain aspects of any open reading frames (ORFs) that are found within the transcripts. iSeeRNA [32], included in the Sebnif [33] lincRNA detection pipeline, is also a support vector machine based tool, which has been shown to compare favourably to CPC. PhyloCSF and iSeeRNA has been shown to perform better than several of the prominent tools, including CPC [31, 32]. For this project, PhyloCSF was used. After coding potential has been inferred using any of the methods mentioned above, one might further validate the transcripts by searching for similarity to known

16 Chapter 3: Methods protein coding domains in Pfam, for instance. This was the strategy employed here.

3.6 Functional characterisation

If any novel lncRNAs are indicated by the data, it is of course also of interest to find out what, if anything, these transcripts may be doing. One approach to this is to see if they are dierentially expressed in the studied dataset. Another, complementary, approach is to look for similarities in expression to known transcripts. This can be done by co-expression network inference. Transcript clusters obtained through the use of such network methods can additionally be mapped to Gene Ontology categories. The dierential expression and network analyses are described in detail below. GO analysis was performed with the BiNGO plugin of Cytoscape 3.1.1 [34]. These approaches do not, however, give a final conclusion about what the tran- scripts are actually doing. Rather, the strategy should be seen as a way of generating hypotheses that can later be tested using, for instance, molecular biology techniques.

3.6.1 Dierential expression testing Transcript level dierential expression testing was performed with Cudi 2.2.1 [18]. The Cuinks RABT assembly that had been performed on the merged BAM files of aligned reads from all samples (excluding outliers) was used as reference. Multi-hit correction was performed with the “-u” option, and the library-type “fr- secondstrand” was specified. Isoforms were considered significantly dierentially expressed if they had a p-value less than 0.05 and a q-value less than 0.05. The “getSig” function of the cummeRbund R package (version 2.6.1) [19] was used to select those fragments, using 0.05 as the value of the parameter “alpha”.

3.6.2 Co-expression network inference Co-expression network inference was carried out with the WGCNA R package by Langfelder and Horvath [22]. This network construction method uses absolute Pear- son correlations between gene profiles (defined as the expression levels of a gene across all samples). These correlations are then raised to a power in order to em- phasise high correlations more than low correlations. is considered a soft thresh- olding power, an alternative to a hard cuto threshold for forming network mod- ules. This has been shown to aid in accomplishing a scale free topology [22], that is, a topology with a few highly connected nodes, rather than many approximately equally connected ones (a so called random topology). Scale free topologies are thought to better reflect biologically relevant information than random topologies [22]. The transformed correlation coecients are then the basis for a hierarchi- cal clustering of the genes, whereafter the resulting dendrogram is cut using an algorithm called “dynamic tree cut” [35], yielding clusters of genes (modules). Pre- vious studies have applied this method mostly on microarray gene expression data [36, 37, 38], but also on RNA-seq data [37, 39, 40]. In order to run such an analysis within the resource constraints of the project, it was necessary to reduce the dataset to a reasonable level, as opposed to construct- ing a network based on all transcripts (over 700 000). The method employed to accomplish such a dimensionality reduction was a filtering based on the coecient

17 Chapter 3: Methods of variation of each gene in the dataset, after log-transformation of all FPKM val- ues. It was reasoned that more variable transcripts would contribute more valuable information for co-expression network inference. A similar strategy has been used previously by Kugler et al. [40]. Genes and transcripts that do not vary at all would per definition be considered co-expressed with each other, but would not contribute any new knowledge about AAA disease in addition to what is already revealed by the dierential expression analysis. Therefore, the dataset was reduced to around 16600 genes (using a coecient of variation threshold of 0.60). In order to determine the value of the soft thresholding power , iterative calcu- lations were made, assessing how scale-free the network topology was for each value of the parameter. Values from 2 to 34, with interval 2, were tested. A value of 16 was found to give a suciently scale-free structure, and was accordingly applied for the network construction. The scale-free property of the network for the tested values can be seen in figure 4.5 in section 4.4. For the dynamic tree-cut algorithm, a minimum module size of 30 was used, together with a “mergeCutHeight” of 0.25 and otherwise default parameters. Furthermore, the correlation between the first principal component (the ”eigen- gene”) of the matrix of transcript profiles of each module and a binary vector of sample treatment information (using the values 1 for the A7 group and 0 for the S7 group) was calculated. This ranking was only performed on the basis of the mice harvested after seven weeks (the A3 group was excluded). This should indicate any relationships between the modules and the treatment.

3.6.3 Clustering comparison It is not trivial to estimate the performance of an unsupervised learning algorithm on a dataset where no class labels are known beforehand. Rather, a common ap- proach is to visually inspect the clusters and judge whether they seem to make sense biologically. Another is to compare clustering results of dierent algorithms to see if general conclusions can be drawn from any consensus that might be found between them. One way of objectively doing the later is to calculate an overlap matrix. This matrix would contain cells representing the number of genes that any pair of clusters (one from each method) have in common. The matrix could then be seen as a contingency table, and common statistical approaches may be used to determine whether the two variables (the clustering methods) have any significant association. One association measure that could be used is Cram´er’s V. This approach yields a score between 0 and 1, where 0 represents no association whatsoever and 1 signifies perfect association (the two methods have given the exact same results --unlikely in practice). A good association between two clusters would give increased support to the common co-expression patterns identified therein. There does not, however, seem to exist any universal consensus on what can be regarded as a ”strong enough” association when dealing with the Cram´er’s V statistic. Some publications use values around 0.10 as a cuto,othersaround0.25. In order to get a sense of how the method would perform on two methods that yield largely the same results, a test was made where the k-means clustering method, was compared to itself in two runs with k =51clustersandn=1randomstarting points each. Usually, one uses hundreds of random starting points to compensate

18 Chapter 3: Methods for entrapment in local optima. Only one starting point was used here in order to provide some variability in the clustering results. The algorithm was applied to the coecient of variation-filtered transcript dataset. The Cram´er’s V calculated based on this comparison was 0.76, indicating (as expected) a very high association between the two runs, illustrated in figure 3.1. In the figure, one can see that each module of the first clustering run only matches one other module well in the second clustering. 10 20 30 means with K = 51 and 1 starts − k 40 50

10 20 30 40 50

k−means with K = 51 and 1 starts

Figure 3.1: The overlap of two runs of k-means clustering on the coecient of variation filtered dataset. In both cases, k set was equal to 1 and n = 1 starting points of the algorithm was used. The image is a visualisation of the contingency matrix, wherein high values of a given cell is displayed brightly and low or zero overlap between two given clusters is displayed darkly shaded. (Additionally, in this image the number of transcripts in the intersection of each pair of modules have been normalised by the total size of the modules for grater clarity.)

Besides this, a more biology-based cluster validation procedure was undertaken, whereby network modules were uploaded to the commercial knowledge mining tool c MetaCore [28] (version 6.19) in order to inspect the biological aspects of the seem- ingly co-expressed transcripts. Gene ontology enrichment was also performed for this reason.

19 4. Results

4.1 Read quality control

The FastQC [29] results of the twelve sequenced mouse samples (one from the A3 group and one from the S7 group) revealed that two of them deviated much in quality from the rest, and principal component analysis (figure not included) confirmed large inconsistencies in gene expression patterns compared to other samples in the same condition groups. These two samples were thus excluded from further analysis. Warning flags were also raised by the program regarding overrepresented se- quences. These were almost exclusively found to be mitochondrial in origin. This was initially not thought to be a problem, however (though, see the discussion in section 5). Patterns of bias at the ends of reads, common to RNA-seq, were also discovered. This bias was thought to be due to lack of sucient randomness in the ”random” hexamer primers used during library preparation [41]. Furthermore, adapter contamination was found to be present. The cause for this was likely the short average fragment size (around 170 nucleotides), which was revealed with the RSeQC [30] quality control software (see figure 4.1). Too short fragment size means that the sequencing machine will read not just the ends of the fragment, but across the entire fragment and into the adapter ligated to the opposite side. These issues might lead to fewer reads being mapped by TopHat, although it was initially not expected that they would have any serious eect on transcript assembly and downstream analysis (but see also section 5 for more on this topic).

4.2 Prediction of novel long noncoding RNAs

Filtering of partially assembled fragments

Figure 4.2a shows the (log transformed) FPKM distributions of ab initio assembled fragments that partially (“c”) and completely (“=”) match the reference annotation, respectively. The best threshold for predicting these transcript classes from FPKM values was found to be 1.03, with discriminatory performance according to the ROC curve displayed in figure 4.2b. Both the sensitivity and specificity was found to be quite low, perhaps owing to noisy data. The chosen filtering threshold 1.03 would therefore discard a lot of potentially true lncRNAs, while simultaneously retaining false positives. However, it was considered more important to filter out as many false positives as possible, and thus applying this particular threshold, than to do no filtering at all (and thus retaining more potentially true lncRNA candidates).

20 Chapter 4: Results

Mean=−34;SD=59 0.008 0.006 Density 0.004 0.002 0.000

−200 −100 0 100 200

mRNA insert size (bp)

Figure 4.1: Inner distance of paired end reads in one of the samples. The plot was produced on one of the three AngII treated replicates that had been harvested after seven weeks using RSeQC 2.3.6, but is representative for all samples. A negative insert size indicates that reads mapped to the genome overlap with each other.

0.6 ●● ●●●●●●●●●●●●●● 1.0 ●●●●●● ●●●●●●●● ●●●●●●●● ●●●●● ●●●●●● ●●●●●● ●●●● ●●●● ●●●●● ●●●● ●●●● ●●● ●●● ●●● ●● ●● 0.8 ●● ●● ●● ●● ● ● ● 0.4 ● ● ●

0.6 ● class_code ● = ● ●

density c ●

sensitivity ●

0.4 ●

0.2 ●

● 0.2 ●

● ● ●● 0.0

0.0 0.0 0.2 0.4 0.6 0.8 1.0

0 2 4 1−specificity log10(FPKM) (a) (b)

Figure 4.2: a) Log 10 transformed FPKM-distributions of fragments that partially (“c”) and completely (“=”) match the reference annotation. b) ROC curve showing the discriminatory per- formance of each threshold when classifying transcripts as either partially of completely assembled. The optimal threshold should correspond to the point closest to the upper left corner of the coor- dinate system (1,0).

21 Chapter 4: Results

Identification of long noncoding RNAs

29 candidate lncRNAs were obtained subsequent to filtering with PhyloCSF and matching against Pfam. Intersecting these candidates with NONCODE v4 revealed an overlap of three transcripts. Furthermore, the genomic coordinates of these 29 candidates were translated (using the LiftOver tool from UCSC [42]) and matched (using Cucompare) against the most recent reference genomes (mm10) from En- sembl, RefSeq and UCSC. Only the three lncRNAs that also matched NONCODE were present on any these updated annotations. The remainder were thus 26 po- tentially novel lncRNAs (table 4.1). Of these, three did not overlap any known reference transcript (class code “u”) and could thus be long intergenic non-coding RNA (lincRNA).

Table 4.1: Overview of the 26 novel lncRNA candidates. Nearest reference genes, transcripts and class codes were assigned by Cucompare, using the Ensembl mm9 annotation as reference. Isoform IDs are the internal ones assigned to the transcripts by Cuinks.

Isoformid Closestgene Nearesttranscript Class Length Locus TCONS 00026197 Obfc2a ENSMUST00000027279 j 4944 1:51522798-51535243 TCONS 00037991 Rasal2 ENSMUST00000078308 i 1775 1:159065313-159350366 TCONS 00038774 Mpzl1 ENSMUST00000111435 j 1460 1:167522311-167564672 TCONS 00053753 Pawr ENSMUST00000095313 i 788 10:107769244-107863779 TCONS 00075759 Ebf1 ENSMUST00000081265 i 996 11:44428427-44821593 TCONS 00076304 Mgat1 ENSMUST00000101293 j 3676 11:49057692-49076532 TCONS 00084766 - - u 10332 11:104790337-104801947 TCONS 00095737 Tnk1 ENSMUST00000108631 j 2348 11:69659877-69672232 TCONS 00117924 Ubxn2a ENSMUST00000020962 j 6341 12:4881930-4914511 TCONS 00121122 Tspan13 ENSMUST00000020896 j 1100 12:36741143-36769087 TCONS 00124446 Sos2 ENSMUST00000035773 i 233 12:70684747-70782839 TCONS 00134561 Gcnt2 ENSMUST00000067778 i 591 13:40955500-41108388 TCONS 00190470 Gpihbp1 ENSMUST00000023243 j 577 15:75426921-75432461 TCONS 00210492 Lpp ENSMUST00000038053 i 1118 16:24391697-24992662 TCONS 00275488 Gnaq ENSMUST00000025541 i 5807 19:16207320-16478609 TCONS 00298653 Cstf3 ENSMUST00000028599 j 14027 2:104430679-104552326 TCONS 00307567 - - u 615 2:171483568-171489341 TCONS 00355809 Ccnl1 ENSMUST00000154585 j 1102 3:65750072-65762171 TCONS 00385472 - - u 335 4:155610115-155610534 TCONS 00445558 Ubn2 ENSMUST00000160583 j 15372 6:38383924-38474825 TCONS 00451604 Slc6a6 ENSMUST00000032185 i 1553 6:91634060-91709057 TCONS 00455818 Clec2d ENSMUST00000032260 j 817 6:129112594-129136552 TCONS 00479043 Lsm14a ENSMUST00000085585 i 237 7:35129664-35179798 TCONS 00492246 Lsp1 ENSMUST00000105968 i 2021 7:149646713-149701914 TCONS 00559428 Pts ENSMUST00000034570 o 1688 9:50329722-50336746 TCONS 00583336 4930468A15Rik ENSMUST00000114054 i 1108 X:73827948-73848263

By comparison to the NONCODE v4 database and Ensembl mm9, 30490 pre- viously annotated long non-coding RNA were additionally detected from the set of RABT assembled fragments. A relative lack of small non-coding RNAs (and miRNAs) was observed. The reason was thought to be due to greater technological diculties involved in the detection of short transcripts. Normally, small- and micro RNA studies of RNA-seq requires a specialised type of library preparation in order to be feasible [43].

22 Chapter 4: Results

4.3 Dierential expression testing

The total number of significantly dierentially expressed transcripts (both coding and non-coding) was 8801 (out of 642012 assembled transcripts in total). In the comparison between between AngII treated mice and control mice harvested after seven weeks (A7 and S7, respectively), the number was 6912. Of the previously known lncRNAs (annotated in NONCODE v4 and Ensembl mm9), 246 were found to be significantly dierentially expressed (p-value less than 0.05 and using a q-value cuto of 0.05). 185 of those were between A7 and the S7 mice. Among the 26 novel lncRNA candidates, two were significantly dierentially expressed between A7 and S7 (table 4.4). The 35 overall most significantly dierentially expressed genes (based on indi- vidual isoforms from those genes) between A7 and S7 are presented in table 4.2 (novel genes have been excluded). Similarly, the 35 most dierentially expressed long noncoding RNAs are presented in table 4.3. Some of the genes in table 4.3 have infinite fold change. This only means that they were exclusively expressed in one of the sample groups. One of those was Gm12245, which was a previously predicted pseudogene, whose expression was only observed in the case mice. Given that it is 357 bp long and noncoding, it would be classified as a lncRNA. Not much else is known about it however. Three of the most dierentially expressed genes (infinite fold change) were also Hist1h4j, Hist1h3i and Hist1h2ah, which all code for cluster proteins. Hi- stone proteins, important in the context of chromosomal structure, tend are up- regulated during DNA replication [44]. Higher expression of histone proteins may therefore be a sign of increased cellular proliferation. Two other transcript exclusively expressed in case mice was SCARNA17, which stands for Small Cajal body-specific RNA 17, and A930005H10Rik. Very little appears to be known about SCARNA17, other than the fact that Cajal body- specific RNAs seem to play a role in the spliceosomal machinery [45]. The EN- SMUST00000067500 isoform of the A930005H10Rik is a 309 bp long processed transcript which does not code for any protein. It is expressed antisense to the gene Dph5. Among the genes dierentially expressed with finite fold change, H19 was the one that deviated most between case and control. H19 encodes a long noncoding RNA and is shown to interact with Igf2 (Insulin-like growth factor 2) via a chromatin remodelling based epigenetic switch [46]. See also section 4.4. Col11a1 and Col12a1 encode two types of collagen (type XI and XII). Collagens are important constituents of the extracellular matrix. The extracellular matrix (and possible degeneration thereof) is an important part of the arterial wall and plays a large role in the mechanism of AAA disease [3]. Two of the other most dierentially expressed genes were Thbs1 and Thbs4, which encode thrombospondin proteins. According to Gonzalez-Quesada et al. (2013) [47], thrombospondin interacts with the extracellular matrix and “is a proto- typical matricellular protein that is not part of the normal cardiac matrix network, but is upregulated in cardiac remodeling because of myocardial infarction or pres- sure overload”. Thrombospondin has been found previously to be associated to AAA [48]. Itm2a (integral membrane protein 2A), in turn, encodes a transmembrane protein

23 Chapter 4: Results

Table 4.2: The overall 35 most dierentially expressed genes (based on individual isoforms) between AngII treated mice harvested after seven weeks (A7) and control mice harvested after seven weeks (S7). Fold change, p- and q-values (the latter accounting for multiple testing) were calculated with Cudi 2.2.1. A negative fold change indicates down-regulation in the control group (up-regulation in the AngII-treated group). “Class” refers to the class codes detailed in table 3.1.

TranscriptID Gene Class log2 fold change p-value q-value ENSMUST00000117266 Gm12245 = -Inf 0.00005 0.00265555 ENSMUST00000087714 Hist1h4j = -Inf 0.00005 0.00265555 ENSMUST00000099704 Hist1h3i = -Inf 0.00005 0.00265555 ENSMUST00000091742 Hist1h2ah = -Inf 0.00005 0.00265555 ENSMUST00000158610 SCARNA17 = -Inf 0.00005 0.00265555 ENSMUST00000067500 A930005H10Rik = -Inf 0.00025 0.01074100 ENSMUST00000158415 RNaseP nuc = Inf 0.00065 0.02343650 ENSMUST00000150405 Cpxm2 = -Inf 0.00110 0.03559290 ENSMUST00000136359 H19 = -7.36622 0.00005 0.00265555 ENSMUST00000140716 H19 = -7.06716 0.00005 0.00265555 ENSMUST00000071750 Col12a1 = -6.56065 0.00015 0.00697443 ENSMUST00000022213 Thbs4 = -6.25172 0.00005 0.00265555 ENSMUST00000123619 Col11a1 = -6.08950 0.00030 0.01248340 ENSMUST00000039559 Thbs1 = -5.74065 0.00030 0.01248340 ENSMUST00000033591 Itm2a = -5.67131 0.00005 0.00265555 ENSMUST00000103095 Tnnc2 = -5.60156 0.00005 0.00265555 ENSMUST00000039178 Tnn = -5.53919 0.00005 0.00265555 ENSMUST00000006626 Actn3 = -5.50895 0.00015 0.00697443 ENSMUST00000001547 Col1a1 = -5.42860 0.00005 0.00265555 ENSMUST00000055770 Hist1h1a = -5.37595 0.00005 0.00265555 ENSMUST00000170872 Thbs2 = -5.35743 0.00025 0.01074100 ENSMUST00000080511 Hist1h1b = -5.32490 0.00005 0.00265555 ENSMUST00000031565 Fscn1 = -5.21660 0.00005 0.00265555 ENSMUST00000025497 Fbn2 = -5.17334 0.00005 0.00265555 ENSMUST00000032800 Tyrobp = -5.16206 0.00005 0.00265555 ENSMUST00000081035 Mpeg1 = -5.05190 0.00005 0.00265555 ENSMUST00000119429 Myl1 = -5.02694 0.00005 0.00265555 ENSMUST00000110455 Hist1h2bk = -5.01183 0.00075 0.02627970 ENSMUST00000027885 Angptl1 = -4.98405 0.00010 0.00490988 ENSMUST00000003643 Ckm = -4.96161 0.00005 0.00265555 ENSMUST00000171470 Lox = -4.94639 0.00060 0.02200790 ENSMUST00000138511 Col1a2 = -4.89817 0.00160 0.04753520 ENSMUST00000021231 Abcc3 = -4.86676 0.00095 0.03168320 ENSMUST00000021506 Serpina3n = -4.85405 0.00005 0.00265555 ENSMUST00000005255 Wisp1 = -4.78107 0.00005 0.00265555

24 Chapter 4: Results

Table 4.3: The overall 35 most dierentially expressed lncRNA candidates between the case and control mice that had been harvested seven weeks after treatment (A7 and S7, respectively). Column headers as in table 4.2. Transcript ID:s starting with “ENSMUST” are Ensembl ID:s, those starting with “NONMMUT” refer to transcripts only annotated by NONCODE.

TranscriptID Gene Class log2 fold change p-value q-value ENSMUST00000117266 Gm12245 = -Inf 0.00005 0.00265555 ENSMUST00000150405 Cpxm2 = -Inf 0.00110 0.03559290 ENSMUST00000136359 H19 = -7.36622 0.00005 0.00265555 ENSMUST00000140716 H19 = -7.06716 0.00005 0.00265555 ENSMUST00000138511 Col1a2 = -4.89817 0.00160 0.04753520 ENSMUST00000138145 Ccl6 = -3.72700 0.00030 0.01248340 ENSMUST00000112866 Cxcl12 = -3.22994 0.00020 0.00891552 ENSMUST00000145974 Col5a3 = -3.11406 0.00145 0.04401550 ENSMUST00000108631 Tnk1 j -2.98056 0.00155 0.04638690 ENSMUST00000131787 2410006H16Rik = -2.96827 0.00005 0.00265555 ENSMUST00000154058 Gm15639 = -2.95309 0.00155 0.04638690 ENSMUST00000053085 Nlrc5 = -2.92918 0.00005 0.00265555 ENSMUST00000164588 Gm17446 = -2.89507 0.00015 0.00697443 ENSMUST00000162785 Ms4a7 = -2.79884 0.00005 0.00265555 ENSMUST00000130179 Mtmr1 = -2.74925 0.00030 0.01248340 ENSMUST00000147681 F630028O10Rik = -2.69458 0.00005 0.00265555 ENSMUST00000167285 Usf2 = -2.66836 0.00020 0.00891552 ENSMUST00000139643 Pnpla7 = -2.61396 0.00125 0.03928280 ENSMUST00000143673 AI662270 = -2.59238 0.00005 0.00265555 ENSMUST00000140732 Ptov1 = -2.57090 0.00090 0.03036040 ENSMUST00000128161 Rgs12 = -2.49486 0.00020 0.00891552 ENSMUST00000159745 Tmem55b = -2.48982 0.00075 0.02627970 ENSMUST00000155083 Ppox = -2.47681 0.00005 0.00265555 ENSMUST00000132618 Eif1 = -2.45017 0.00010 0.00490988 NONMMUT029052 - = -2.39013 0.00020 0.00891552 ENSMUST00000161573 Psme2 = -2.38926 0.00010 0.00490988 ENSMUST00000128015 Plekhg2 = -2.38253 0.00005 0.00265555 ENSMUST00000153077 Megf8 = -2.37883 0.00010 0.00490988 ENSMUST00000165292 Gm17282 = -2.36673 0.00005 0.00265555 ENSMUST00000137837 Lrrc42 = -2.34961 0.00040 0.01584780 ENSMUST00000152710 Vgll4 = -2.29871 0.00045 0.01747430 ENSMUST00000034997 Snhg5 = -2.29561 0.00005 0.00265555 ENSMUST00000132340 Trem2 = -2.28742 0.00005 0.00265555 ENSMUST00000149391 Gdi1 = -2.27913 0.00005 0.00265555 ENSMUST00000146254 Cd300lf = -2.27148 0.00115 0.03682310

25 Chapter 4: Results involved in muscle cell dierentialion [49] and T-cell activation [50]. It has also been shown to interact with collagens in terms of its involvement in chondrogenic dierentiation [51]. A role in AAA via eects on the extracellular matrix of the aortic wall is thus not unthinkable. The novel dierentially expressed transcript had the IDs TCONS 00095737 andT- CONS 00559428. TCONS 00095737 was assigned by Cuinks to the gene Tnk1 with the class code “j”, signifying that the transcript shares at least one splice junc- tion with a reference isoform. This means that is could be a novel alternative splice variant. TCONS 00559428 was closest to the Pts gene, sharing a “generic exonic overlap” (class code “o”). This simply means that that the transcript overlaps at least one exon with the reference transcript, but does not share any part of the splicing structure with the reference. It may be a flanking single-exonic transcript that overlaps one of the ends of the Pts gene (though, intuitively, it is more likely to be an artifact as compared to any “j” isoform found). TCONS 00095737 was the only one of the isoforms of Tnk1 that was dierentially expressed, as can be seen in figure 4.3. TCONS 00559428 showed a similar expres- sion pattern as the other isoforms of Pts, only with lower overall expression (figure 4.4). This would perhaps suggest that the expression is a passive consequence of the main protein coding isoform expression, maybe due to inaccuracies of the splicing machinery. In mouse, only one isoform of this gene is described in the Ensembl annotation (the protein coding variant). In human, nine ones are annotated, four of them long noncoding. This raises the likelihood that this novel lncRNA could be atruetranscriptvariant.

Table 4.4: Dierentially expressed novel lncRNA candidates between the case and control mice that had been harvested seven weeks after treatment. Column headers as in table 4.2. The transcript IDs correspond to those in table 4.1.

Transcript ID Closest gene Class log2 fold change p-value q-value TCONS 00095737 Tnk1 j -2.98 0.002 0.037 TCONS 00559428 Pts o -1.39 0.001 0.028

26 Chapter 4: Results

Tnk1 TCONS_00095737 TCONS_00095738 TCONS_00095739

3

2

1

0 OK OK OK OK OK OK TCONS_00095740 TCONS_00095741 TCONS_00095742

3

2 FPKM

1

0 OK OK OK OK OK OK TCONS_00095743 TCONS_00095744

3

2

1

0 OK OK OK OK A7 S7 A7 S7

sample_name

Figure 4.3: Expression of the isoforms of Tnk1, to which the novel transcript TCONS 00095737 was assigned. The blue bar shows expression in the A7 treated group and the orange bar shows expression in the S7 control group. Black dots are the expression levels from individual samples in each group and error bars are confidence intervals based on variance modelling calculated by CuDi.

Pts TCONS_00559426 TCONS_00559427 TCONS_00559428 12.5

10.0

7.5 FPKM 5.0

2.5

0.0 OK OK OK OK OK OK A7 S7 A7 S7 A7 S7

sample_name

Figure 4.4: Expression of the isoforms of Pts, to which the novel long noncoding transcript TCONS 00559428 was assigned. TCONS 00559427 was also a novel transcript. Axis labels and figure elements the same as in figure 4.3.

27 Chapter 4: Results

4.4 Co-expression network modules

In order to perform de novo network inference with the WGCNA method, an optimal soft thresholding power needed to be calculated, as described in section 3.6.2. The choice of this thresholding power was made based on figure 4.5. A value of 16 was found to be appropriate with respect to desired network scale independence and (low) mean connectivity.

Scale independence Mean connectivity

242628303234 1 182022 1416 12 5000 0.8 10 9 8 7 4000

0.6 6 5 3000 2 0.4 4 2000 3 Connectivity Mean 3 0.2 4 1000 2 5 67 0.0 8 91012 1 0 1416182022242628303234 Scale Free Topology Model Fit,signed R^2 Fit,signed Model Free Topology Scale

0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35

Soft Threshold (power) Soft Threshold (power)

Figure 4.5: Left: Fit of the WGCNA network to a model of scale free topology as a function of the soft thresholding power. Right: Mean connectivity as a function of the soft thresholding power. The red line indicates where the gain in scale-free fit begins to level out. Red numbers indicate the tested soft thresholding powers.

The network construction made with the 16689 transcripts that remained after filtering yielded 51 modules. The dendrogram obtained after hierarchical clustering is shown in figure 4.6, together with the module assignments made with the appli- cation of dynamic tree cutting. The modules were assigned colours codes in this process (as shown in figure 4.6), which from now on will be used to refer to them. The colours are merely labels and do not reveal any other information about the modules. The first principal component of the modules (the “eigengene”, in the terminol- ogy of Langfelder and Horvath [22]) was correlated with a vector of treatment status as mentioned in section 3.6.2. The modules were subsequently ranked according to the absolute Pearson correlation coecient obtained, also taking into account the p-value of the correlation. Intersecting the set of significantly dierentially expressed previously known lncRNAs with the transcripts in the most treatment-correlated modules revealed that 20 of those lncRNAs were present in any of those clusters (see table 4.5). Of the 26 novel lncRNA candidates, only two were left after coecient of variation fil- tering. Of these, the intergenic TCONS 00385472 was assigned to the “red” cluster. The other one was TCONS 00190470, which was assigned to the “blue” module. Those of the top ranking modules that also contain interesting lncRNAs are shown in table 4.5.

28 Chapter 4: Results

Cluster Dendrogram 1.0 0.9 0.8 Height 0.7 0.6

Module colors d hclust (*, "average")

Figure 4.6: An illustration of the WGCNA module assignment process for this dataset. The black dendrogram is obtained through hierarchical clustering of the weighted correlation based network. Modules are represented by individual colours and are the result of applying dynamic tree cutting to the dendrogram.

Given those results, it would seem that two of the modules, “blue” and “red” justify further scrutiny. Partly because they contain both novel and dierentially expressed lncRNAs, but also since they were found to be strongly correlated with the treatment vector (see table 4.5).

Table 4.5: Ranking of modules according to Pearson correlation coecients calculated between the eigengene of the module and a vector of treatment status (see section 3.6.2). Each module is represented with a colour code that was assigned to it during the WGCNA procedure, and which corresponds to the colours in figure 4.13. Only the most significantly correlated (an absolute correlation of at least 0.80 and a p-value less than or equal to 0.05) modules which harbour either dierentially expressed lncRNAs or novel lncRNAs are shown here. “Size” refers to the number of transcript isoforms in the module. ‘D.e.” stands for dierentially expressed.

Module Correlation p-value Size D.e. lncRNAs Novel lncRNAs magenta 0.95 0.0004 538 1 0 turquoise 0.95 0.001 2745 2 0 green 0.88 0.009 755 3 0 blue 0.87 0.01 2252 9 1 black 0.82 0.02 628 2 0 red 0.80 0.03 678 3 1

The “blue” module Nine significantly dierentially expressed lncRNAs were detected in the blue module. They are presented in table 4.6. Besides those, a novel lncRNA was also present. It

29 Chapter 4: Results did not, however, present any significant dierence in transcript abundance among any of the sample groups. This lncRNA was denoted TCONS 00190470 in table 4.1 and appeared to be a novel noncoding isoform of the Gpihbp1 (glycosylphos- phatidylinositol anchored high density lipoprotein binding protein 1) gene. The main protein otherwise encoded by this gene acts as a lipoprotein lipase transporter in capillary endothelial cells [52].

Table 4.6: Interesting lncRNAs in the “blue’ module. “Class” refers to the class code of the transcript (see table 3.1). “Groups” refers to which sample groups the dierence was observed in. In all cases, p-values (as obtained from Cudi) were less than 0.05 and the dierential expression was considered significant after applying multiple testing correction (an alpha of 0.05 was used as threshold for the “getSig” function in the cummeRbund R package).

TranscriptID Gene Class log2 fold change Groups ENSMUST00000132313 Hmg20b = -2.35, -2.21 A3, S7; A7, S7 ENSMUST00000132340 Trem2 = -2.29 A7, S7 ENSMUST00000162785 Ms4a7 = 1.56, -1.24, -2.80 A3, A7; A3, S7; A7, S7 ENSMUST00000149366 Tmem48 = -3.00 A3, S7 ENSMUST00000134637 Cdca3 = 2.42 A3, A7 ENSMUST00000140732 Ptov1 = -2.57 A7, S7 ENSMUST00000136359 H19 = -7.37 A7, S7 ENSMUST00000145974 Col5a3 = -2.47, -3.11 A3, S7; A7, S7 ENSMUST00000152109 C430049B03Rik = -1.80806 A7,S7

Notable among the transcripts in table 4.6 was an isoform of H19, which had a log2 fold change of -7.37 (up-regulated in A7 compared to S7). What is known about this noncoding gene is that it can influence the expression of the neighbouring gene Igf2 (insulin-like growth factor 2) [46, 53]. A transcript from Igf2 was present in the same co-expression module as the H19 isoform discussed here. The log2 fold change of the relevant Igf2 transcript was -3.38 (also up-regulated in A7 compared to S7), although in this case the dierence was not statistically significant after multiple testing correction. The second most dierentially expressed lncRNA transcript in this module was a noncoding isoform (with a retained intron) of Col5a3 (collagen, type V, alpha 3). The protein encoded by this gene has been shown to be important for glucose homeostasis in mice [54]. It has also been indicated in a cardiovascular disease aecting connective tissue known as Ehlers-Danlos syndrome (one symptom of which can be arterial rupture) [55]. Col5a3 has also been shown to be targeted by the AAA- related microRNA-29 family [56]. The protein coding transcript (with Ensembl transcript ID ENSMUST00000004201) was also significantly dierentially expressed between both A3 and S7, A7 and S7 (the log2 fold changes were -3.06 and -3.78, respectively for those two group comparisons). See figure 4.7. The blue module was imported into Cytoscape 3.1.1 [34], which was used to calculate the betweenness centrality and degree of each node, the result of which is shown in figure A.7. In the process of importing the module, connections with lesser weights than 0.29 were removed (due to the time it would take to perform an analysis on all nodes). The calculations were thus made on 274 out of the original 2252 nodes. Highly connected and central (in terms of betweenness centrality) transcripts, in the blue module came from the genes Aspm, Hhipl1, Mirg, Ltbp2, Hist1h3i, Wisp1, Fkbp11, Ndc80, Kif11, Ms4a7, Cenpe, Sdc3, Gpx7, Ptpn13, Sorcs2, Ccdc8, Plk1, Myo7a, Cmtm3, Ednra, Kif15 and Kif4. Four of those transcripts were

30 Chapter 4: Results

Col5a3 TCONS_00555880 TCONS_00555881

4

3

2 FPKM

1

0 OK OK OK OK OK OK A3 A7 S7 A3 A7 S7

sample_name

Figure 4.7: Expression levels of the isoforms of Col5a3. TCONS 00555880 is the internal cuinks ID representing ENSMUST00000004201 (the protein coding isoform) and TCONS 00555881 rep- resents ENSMUST00000145974 (the noncoding isoform). FPKM-values are shown on the vertical axis. Black dots show the expression values of each individual replicate. Blue bars represent the A3 group, pink the seven A7 group and green the S7 control group. Black dots are the expres- sion levels from individual samples in each group and error bars are confidence intervals based on variance modelling calculated by CuDi. novel isoforms of the genes Hhipl1, Mirg and Sdc3. None of them were lncRNAs, though Mirg is a gene harbouring several miRNA.

31 Chapter 4: Results

TCONS_00514717 TCONS_00327575 TCONS_00453921 TCONS_00190524 TCONS_00113408

TCONS_00426411 TCONS_00340498

TCONS_00466933

TCONS_00560962

TCONS_00440112TCONS_00079393 TCONS_00471530 TCONS_00279629TCONS_00192826 TCONS_00174295 TCONS_00132681 TCONS_00022610 TCONS_00251425 TCONS_00446434TCONS_00509096 TCONS_00596926 TCONS_00421709 TCONS_00026432 TCONS_00381824 TCONS_00273790 TCONS_00030006 TCONS_00555213 TCONS_00455733TCONS_00363434 TCONS_00542371 TCONS_00233599

TCONS_00100115

TCONS_00145535 TCONS_00103622 TCONS_00213278 TCONS_00082830 TCONS_00481250 TCONS_00415944 TCONS_00426178 TCONS_00499107 TCONS_00132475 TCONS_00106755 TCONS_00149629 TCONS_00397162 TCONS_00351348 TCONS_00080630 TCONS_00370882 TCONS_00379813 TCONS_00544802 TCONS_00299508 TCONS_00017641 TCONS_00194679 TCONS_00018260 TCONS_00456362 TCONS_00013769 TCONS_00100461TCONS_00145537TCONS_00323754 TCONS_00397407 TCONS_00380559 TCONS_00206784TCONS_00077620TCONS_00037834 TCONS_00116397 TCONS_00095727 TCONS_00116375 TCONS_00178260 TCONS_00433667TCONS_00572309 TCONS_00420925 TCONS_00514716 TCONS_00562812 TCONS_00174573 TCONS_00125971 TCONS_00402044 TCONS_00396448TCONS_00124832 TCONS_00086992 TCONS_00146309 TCONS_00222950 TCONS_00526797 TCONS_00251324 TCONS_00588891 TCONS_00091269 TCONS_00415015 TCONS_00145739 TCONS_00277423 TCONS_00296169 TCONS_00323532 TCONS_00164241TCONS_00551753 TCONS_00225268 TCONS_00024112 TCONS_00502319TCONS_00554078TCONS_00489203TCONS_00300045 TCONS_00104241TCONS_00113407 TCONS_00016832 TCONS_00003878 TCONS_00496955 TCONS_00197671 TCONS_00097649 TCONS_00405711TCONS_00323403 TCONS_00208537 TCONS_00448201TCONS_00277504TCONS_00116079TCONS_00078387 TCONS_00402052 TCONS_00236983 TCONS_00173361TCONS_00323648 TCONS_00420793 TCONS_00132386 TCONS_00277135 TCONS_00505571TCONS_00426130 TCONS_00041494TCONS_00161501TCONS_00151948 TCONS_00335011 TCONS_00014807TCONS_00084816 TCONS_00453905TCONS_00236981 TCONS_00236527TCONS_00242657 TCONS_00455013 TCONS_00085785 TCONS_00091419 TCONS_00138833 TCONS_00086919 TCONS_00306236 TCONS_00281618TCONS_00534729 TCONS_00574853 TCONS_00171186 TCONS_00418711TCONS_00346222 TCONS_00270409TCONS_00391722 TCONS_00561980 TCONS_00555915 TCONS_00475618 TCONS_00486709 TCONS_00222359 TCONS_00102928 TCONS_00346473 TCONS_00382945 TCONS_00274935 TCONS_00311490 TCONS_00341694 TCONS_00215910 TCONS_00284634 TCONS_00161770 TCONS_00474036 TCONS_00099151 TCONS_00552915 TCONS_00316519 TCONS_00385040 TCONS_00284426 TCONS_00486705TCONS_00483191 TCONS_00332548 TCONS_00552238TCONS_00141193 TCONS_00025250 TCONS_00446526 TCONS_00528621TCONS_00420306 TCONS_00315927 TCONS_00523776 TCONS_00210072 TCONS_00280580 TCONS_00497216 TCONS_00260277 TCONS_00041982 TCONS_00189469

TCONS_00203689 TCONS_00537847 TCONS_00547209TCONS_00154778 TCONS_00614067 TCONS_00484536 TCONS_00557535 TCONS_00259845 TCONS_00399807 TCONS_00105063

Figure 4.8: The “blue” WGCNA derived module. Larger nodes indicate higher betweenness centrality. Thicker edges have more weight (the edge weights are also displayed on a colour scale from green (low) to red (high)). Degree is indicated with a colour scale: green corresponds to low connectivity and red to high connectivity. The names of the nodes are the internal identifiers assigned to each transcript by the assembly software Cuinks. A high resolution version of the figure can be obtained from the author upon request.

32 Chapter 4: Results

The “red” module

The three significantly dierentially expressed lncRNAs in the red module are sum- marised in table 4.7. In addition, one novel intergenic lncRNA (lincRNA) was present (TCONS 00385472, see table 4.1), though not with a statistically significant dierence in transcript abundance between any of the sample groups (figure 4.12).

Table 4.7: Interesting lncRNAs in the “red” module. “Groups” refers to which sample groups the dierence was observed in. Remaining column headers as in table 4.2.

TranscriptID Gene Class log2 fold change p-value q-value Groups ENSMUST00000105240 Timeless = -3.55 0.00005 0.0027 A3,S7 ENSMUST00000148151 Flcn = -2.88 0.00015 0.0070 A3, S7 ENSMUST00000132193 Alkbh6 = -1.62 0.00055 0.021 A3,S7

One of the three dierentially expressed previously annotated lncRNAs in the red module was ENSMUST00000105240, a noncoding isoform of Timeless. The protein coding transcript from this gene was named after its involvement in the regulation of the circadian rhythm in Drosophila melanogaster, and its mouse homolog appears to be essential in that context as well [57, 58]. Recently, the gene has also been shown to play a role in coordinated apoptosis of embryonic cells [59]. However, the lncRNA was the only isoform of this gene that was (significantly) dierentially expressed. The second dierentially expressed lncRNA in this module was a noncoding variant of Flcn, which otherwise codes for the protein folliculin. The main pro- tein of the gene is believed to be a tumour suppressor [60]. This protein (EN- SMUST00000102697) was also significantly dierentially expressed (both between the groups A3 and S7, and A7 and S7). The third one was ENSMUST00000132193. The transcript is a 1112 bp long non-coding isoform of the Alkbh6 (alkylation repair homolog 6) gene, spliced with intron retention. It was the only one of the 12 assembled isoforms of this gene that was found to be significantly dierentially expressed. Expression of ENS- MUST00000132193 was up-regulated in the AngII-treated mice compared to the controls, though more so in the mice harvested three weeks after treatment than those harvested after seven weeks (see figure 4.11). Cytoscape 3.1.1 [34] was used to calculate the betweenness centrality and degree of each node. The resulting network can be seen in figure 4.13. Connections with lesser weights than 0.25 were removed (as before, this was mainly done to reduce the computational time of the analysis, but also to focus on the strongest co-expression relations) The 10 most central nodes of this module are summarised in table 4.8. Notably, one of them (TCONS 00190611) was a predicted novel fragment (although none of the predicted novel lncRNAs). The transcript was present neither in the Ensembl or NONCODE annotation, and attempting to use LiftOver to identify its position on the mm10 genome failed. So either it is a novel fragment or an artifact of the mm9 genome build, though the latter is suspected.

33 Chapter 4: Results

Figure 4.9: Expression levels of four isoforms of Timeless. The lncRNA ENSMUST00000105240 is named TCONS 00056327 here (internal identifier of cuinks). FPKM-values are shown on the vertical axis. Black dots show the expression values of each individual replicate. Error bars are confidence intervals based on variance modelling calculated by CuDi. Blue bars represent the A3 group, pink the seven A7 group and green the S7 control group.

Flcn TCONS_00094385 TCONS_00094386 TCONS_00094387

4

2

0 OK OK OK OK OK OK OK OK OK TCONS_00094388 TCONS_00094389 TCONS_00094390 FPKM

4

2

0 OK OK OK OK OK OK OK OK OK A3 A7 S7 A3 A7 S7 A3 A7 S7

sample_name

Figure 4.10: Expression levels of the isoforms of Flcn. The lncRNA ENSMUST00000148151 is named TCONS 00094386 here (internal identifier of cuinks). Axis labels and figure elements the same as in figure 4.9.

34 Chapter 4: Results

Alkbh6 TCONS_00478364 TCONS_00478365 TCONS_00478366 TCONS_00478367

6

4

2

0 OK OK OK OK OK OK OK OK OK OK OK OK

TCONS_00478368 TCONS_00478369 TCONS_00478370 TCONS_00478371

Figure6 4.11: Expression levels of four isoforms of Alkbh6. The lncRNA ENSMUST00000132193 is named TCONS 00478367 here (internal identifier of cuinks). Axis labels and figure elements the same as in figure 4.9.

4 NA TCONS_00385472 FPKM

7.5

2

5.0 FPKM

0 OK OK OK OK OK OK OK OK OK OK OK OK

TCONS_00478372 2.5 TCONS_00478373 TCONS_00478374 TCONS_00478375

6

0.0 OK OK OK A3 A7 S7

sample_name

4 Figure 4.12: Expression levels of the novel lncRNA TCONS 00385472.pdf. Axis labels and figure elements the same as in figure 4.9.

Table2 4.8: Summary of the ten of the most central transcripts in the “red” module. The Isoform Ids are internal Cuinks identifiers. The meaning of the class codes can be found in table 3.1.

Isoform id Gene Closest transcript Class code

0 OK TCONSOK 00079525OK OK RpainOK ENSMUST00000018593OK OK OK OK OK = OK OK A3 A7 S7 A3 A7 S7 A3 A7 S7 A3 A7 S7 TCONS 00161510 Lrp10 ENSMUST00000022782sample_name = TCONS 00190611 - - u TCONS 00206829 Dhh ENSMUST00000023737 = TCONS 00236333 Aars2 ENSMUST00000024733 = TCONS 00304221 Tm9sf4 ENSMUST00000089027 = TCONS 00312467 Ccbl1 ENSMUST00000044038 = TCONS 00426158 Sh3tc1 ENSMUST00000037959 = TCONS 00524701 Vac14 ENSMUST00000166307 = TCONS 00534181 Ano8 ENSMUST00000093450 =

35 Chapter 4: Results

TCONS_00204163

TCONS_00365930

TCONS_00154766

TCONS_00258745

TCONS_00311671 TCONS_00101545 TCONS_00466636

TCONS_00524034

TCONS_00523936 TCONS_00498508 TCONS_00211499 TCONS_00273809 TCONS_00446456

TCONS_00477100 TCONS_00096958 TCONS_00406279

TCONS_00474070 TCONS_00278438 TCONS_00358566 TCONS_00033457 TCONS_00399787

TCONS_00420374

TCONS_00548579

TCONS_00246309 TCONS_00190688 TCONS_00162918 TCONS_00374963 TCONS_00137097 TCONS_00529226 TCONS_00009874

TCONS_00376103 TCONS_00469854 TCONS_00147709 TCONS_00404445 TCONS_00211184

TCONS_00325661 TCONS_00589008 TCONS_00056942 TCONS_00211620 TCONS_00210159 TCONS_00158074

TCONS_00542272 TCONS_00553038 TCONS_00404489 TCONS_00418845 TCONS_00420021 TCONS_00432437 TCONS_00542713

TCONS_00266953 TCONS_00190599 TCONS_00406906TCONS_00478367 TCONS_00041261 TCONS_00176174 TCONS_00605588 TCONS_00344392 TCONS_00569004 TCONS_00401236 TCONS_00290905 TCONS_00605513 TCONS_00246192 TCONS_00426158 TCONS_00543867 TCONS_00236333 TCONS_00087324 TCONS_00420343 TCONS_00300641 TCONS_00079146 TCONS_00425547 TCONS_00206829 TCONS_00099651 TCONS_00434286 TCONS_00377428 TCONS_00439959 TCONS_00435529 TCONS_00498251 TCONS_00054683TCONS_00312694 TCONS_00509135 TCONS_00438980 TCONS_00235119 TCONS_00095879 TCONS_00234793TCONS_00437501 TCONS_00234624 TCONS_00480832 TCONS_00183559 TCONS_00605374 TCONS_00100649 TCONS_00191081 TCONS_00383797 TCONS_00420835

TCONS_00327803 TCONS_00065293 TCONS_00203111 TCONS_00416564 TCONS_00325690 TCONS_00384479 TCONS_00281119 TCONS_00511214 TCONS_00381330 TCONS_00385472TCONS_00450693 TCONS_00384546TCONS_00204051 TCONS_00450398

TCONS_00319783 TCONS_00161510 TCONS_00190611 TCONS_00116888 TCONS_00304221 TCONS_00050830 TCONS_00523919 TCONS_00079525 TCONS_00379422 TCONS_00209224 TCONS_00543599 TCONS_00546326TCONS_00104293 TCONS_00190088 TCONS_00070599 TCONS_00040997TCONS_00383784 TCONS_00129963 TCONS_00065433 TCONS_00104289 TCONS_00542173 TCONS_00079134 TCONS_00244947 TCONS_00435354 TCONS_00478061 TCONS_00312467 TCONS_00542245 TCONS_00204535 TCONS_00379421 TCONS_00453925 TCONS_00222437TCONS_00380883 TCONS_00257236 TCONS_00238069 TCONS_00552220 TCONS_00065533TCONS_00534181 TCONS_00480747 TCONS_00083368 TCONS_00103954 TCONS_00097723 TCONS_00506631 TCONS_00421088 TCONS_00390277 TCONS_00246390 TCONS_00280804 TCONS_00272289 TCONS_00348294TCONS_00521498 TCONS_00499916 TCONS_00083357TCONS_00407023TCONS_00487882 TCONS_00358552 TCONS_00086122 TCONS_00400087 TCONS_00220826 TCONS_00524701 TCONS_00424566TCONS_00386098 TCONS_00086349 TCONS_00050510 TCONS_00488287 TCONS_00279814TCONS_00313760 TCONS_00050788TCONS_00486560 TCONS_00094386 TCONS_00389675 TCONS_00064996 TCONS_00291021 TCONS_00511150 TCONS_00435760 TCONS_00521068 TCONS_00050181 TCONS_00476412 TCONS_00104077 TCONS_00546940 TCONS_00312230 TCONS_00065061

TCONS_00181976 TCONS_00300346 TCONS_00116640 TCONS_00125372

TCONS_00300215 TCONS_00232652TCONS_00209027 TCONS_00041711 TCONS_00280461 TCONS_00094568

TCONS_00430962 TCONS_00093203

TCONS_00096966

TCONS_00099885 TCONS_00231206

TCONS_00056170 TCONS_00281521 TCONS_00496828

TCONS_00274408 TCONS_00282819

TCONS_00385205 TCONS_00420678

TCONS_00479935

TCONS_00237201

Figure 4.13: The “red” WGCNA derived module. Figure elements are the same as was described in the legend of figure A.7. A high resolution version of the figure can be obtained from the author upon request.

36 Chapter 4: Results

4.5 Gene ontology enrichment

The “blue” module GO enrichment analysis was performed using the BiNGO plugin of Cytoscape 3.1.1, with default settings used for multiple testing correction. All genes of the tran- scripts in the blue module were included. In order to interpret those results more easily, the resulting GO terms and false discovery rate (FDR) adjusted p-values were analysed with the GO interpretation tool REVIGO [61] (the software lacks version numbering), using “large” as the size of the output list of terms, “Mus musculus” as reference and otherwise default settings (as they were defined 2014-08-08). This tool uses semantic analysis to group the terms in sensible ways such as to reduce redundancy. Highly significant categories indicated by REVIGO in the biological process on- tology were “DNA metabolism”, “regulation of immune system process”, “blood vessel morphogenesis”, “defence response”, “organelle fission”, “cell cycle”, “cellu- lar process” and “metabolism” (figure 4.14a). Similarly, in the ontology “molecular function”, important categories were: “cal- cium ion binding”, “receptor binding”, “protein kinase activity”, “binding”, “cat- alytic activity”, “GTPase regulator activity”, “manganese ion transmembrane trans- porter activity” and “cytokine receptor activity” (figure 4.14b). In the “cellular component” category, important terms were “chromosomal part”, “cell” (though that term is rather nondescript),“intrinsic to plasma membrane”, “or- ganelle”, “extracellular region part”, “protein-DNA complex”, “perinuclear region of cytoplasm”, “membrane” and “extracellular matrix’ (figure 4.14c).

37 Chapter 4: Results

Figure 4.14: The most prominent terms among the significantly enriched GO terms of the “blue” module; a) in the “biological process” ontology, b) in the “molecular function” ontology, c) in the “cellular component” ontology. The size of the square representing each term is proportional to the absolute log p-value of the enrichment significance of that term. Enrichment was calculated with the BiNGO plugin of Cytoscape 3.1.1 and the semantic grouping in this figure was the result of analysis with the REVIGO software, using the option “large” regarding the size of the resulting list of terms and “Mus musculus” as reference(and otherwise default settings). The log10 p-values visually presented here are the logarithms of the FDR-adjusted p-values from BiNGO. A high resolution version of the figure can be obtained from the author upon request.

(a)

REVIGO Gene Ontology treemap cellular negative regulation of regulation nucleic nitrogen post−translational regulation of nucleobase−containing acid phosphate−containing macromolecule compound protein compound metabolic of cellular compound cellular anatomical primary metabolic process cell system organ cell cycle cell cycle mitotic biosynthetic process structure cell cycle metabolic metabolic modification metabolic metabolic metabolic developmental differentiation development development morphogenesis process process phase cell cycle process process process process process process negative negative cellular negative regulation of phosphorus regulation of DNA regulation regulation regulation of nucleobase−containing nucleobase−containing conformation of cellular metabolic biosynthetic nitrogen compound compound compound metabolic biosynthetic regulation of regulation of cellular process process metabolic process change gamete embryo sexual cellular metabolic process process neurogenesis microtubule−based process of cell development reproduction component developmental generation mitosis cell transport process metabolic regulation differentiation movement negative cellular multicellular process macromolecule adhesion cellular process DNA−dependent macromolecule DNA of RNA transcription, regulation of regulation biosynthetic organismal blood vessel morphogenesismale anatomical cell cycle DNA replication DNA−dependent regulation of biosynthetic structure process packaging metabolic anatomical tissue cell regulation regulation of macromolecule of metabolic gamete formation metabolic process cell structure cellular blood vessel development development involved in of cell negative morphogenesis endothelial process development generation morphogenesis M phase of migration localization process biosynthetic process angiogenesis cycle cell migration negative regulation regulation regulation of cellular positive positive mitotic cell process DNA gene of phosphate negative muscle cell DNA metabolismregulation of protein spermatogenesis regulation of regulation of regulation manganese ion protein metabolic regulation cellular process regulation assembly replication expression nervous developmental developmental of cell cell cycle activation migration transmembrane cell−cell of gene metabolic process process process of cell modification blood vessel differentiation transport adhesion of nitrogen system regulation migration expression process regulation of small anatomical morphogenesis pattern regulation of regulation process proline chaperone development generation regulation of compound cartilage of manganese negative cellular molecule mediated protein chaperone−mediated specification cartilage phosphorus metabolic development endothelial macromolecule metabolic protein folding requiring protein folding structure of neurons process development establishment ion of primary metabolic localization cell migration negative regulation of protein metabolic process cofactor skeletal process process multicellular M phase of establishment transport regulation metabolic metabolic process development vasculature of cell metabolic biosynthetic metabolic glutamine family regulation of system organ endocrine localization of 'de novo' reproductive protein migration amino acid transcription organism localization DNA positive regulation development development morphogenesis process process localization process process of cellular protein of cell process DNA process process biosynthetic protein from RNA reproduction localization metabolic process in cell replication process polymerase II positive response positive folding promoter metabolic initiation cellular regulation response to regulation of regulation of regulation negative positive regulation nitrogen compound protein biosynthetic to DNA transcription, of protein regulation response to of cellular abiotic multicellular process regulation selenocysteine translational of cell process phosphorylation damage metabolic process phosphorylation DNA−dependent of protein incorporation regulation metabolic stimulus organismal cellular of protein readthrough communication stimulus regulation negative phosphorylation process stimulus process modification regulation of cellular transmembrane positive macromolecule lipid positive regulation of of cell protein DNA receptor protein regulation of response of cellular process glycosylation positive regulation nucleobase−containing regulation of communication processing regulation process defense tyrosine kinase compound metabolic metabolic macromolecule cellular protein process to fungus phosphorylation of protein of cell repair signaling pathway regulation of negative regulation proline metabolic biosynthetic metabolic posttranscriptional metabolic modification of protein response proliferation positive cellular process metabolism process gene expression phosphorylation biosynthetic regulation of process enzyme linked modification process regulation process cellular modulation by organism process gene expression positive positive receptor process process of cellular regulation of of defense response of protein other organism response nitrogen compound toll−like regulation response biosynthetic involved in symbiotic positive immune regulation metabolic process signaling interaction activation leukocyte response todefense responseprocess to stress negative positive regulation receptor anatomical of of metabolic pathway regulation of immune effector structure mediated to stress intracellular regulation of signaling apoptotic wounding process negative of cell homeostasis modification of regulation regulation of regulation of of immune positive signal regulation response immune system response process pathway immunity process morphology or of regulation of transduction of cell to physiology of other process activation response cell negative organism involved in cytokine biological proliferation symbiotic interaction biological activation multicellular immune negative regulation of macromolecule positive radiation production leukocyte regulation regulation response to involved organismal system regulation of metabolic differentiation signaling regulation of process immune of leukocyte response to defenses negative process regulation of in immune homeostasis development cell activation regulation of process chemical innate response−activating biosynthetic of other organism regulation response activation signal transduction of cell response positive process stimulus involved in symbiotic of response multicellular leukocyte regulation positive regulation interaction positive to stimulus immune regulation of lymphocyte regulation activation of regulation of regulation of proliferation to regulation response response to antigen homeostatic of signal regulation response organismal involved lymphocyte calcium ion receptor−mediated mediated lymphocyte process stimulus of response of response to other cell division regulation response transport signaling pathway of gene in immune differentiation immunity transduction to stress stimulus immune proliferation to stimulus expression to host organism biological process response into cytosol of metabolic regulation response−regulating myeloid cell myeloid signaling pathway hematopoietic cellular microglial cell activation chromatin regulation or lymphoid lymphocyte T cell activation dendritic cytoskeleton regulation of involved in process leukocyteregulation ofof immune system process involved in component assembly or organ differentiation activation cell immune response immune organization cellular process negative development organelle assembly disassembly activation lymphocyte response activation cellular component immune multicellular regulation of positive regulation of homeostasis systemic biological activation regulation myeloid regulation fission cellular organismal regulation of immune system macrophage of number arterial blood organization macromolecular regulation of adaptive immune response based dendritic cell system on somatic recombination of immune receptors of B cell of molecular spindle complex process built from immunoglobulin superfamily domains of cells pressure by microtubule−based adhesion regulation activation differentiation component movement process myeloid differentiation function hormone subunit biological assembly organization process of immune positive positive negative organelle fissionbiogenesis leukocyte regulation regulation of regulation regulation positive quality inflammatory positive regulation of microtubule calcium ion regulation regulation of catalytic of T cell cell regulation response of adaptive cell death response of apoptotic proliferation cytoskeleton of cellular chromatin developmental activation immune transport intoof binding B cell activity projection regulation of process component modification segregation signaling positive regulation death cytosol mediated chromosome chromatin organization organization organization regulation of response of systemic arterial regulation positive organization process biological regulation blood pressure organelle organization organization immune regulation regulation regulation divalent immunity regulation cellular localization immune system of macromolecular of molecular of catalytic of defense regulation of T homeostatic macromolecular spindle centrosome of cell of B cell metal ion leukocyte process complex locomotion process response cell receptor complex organization cycle process response homeostasis assembly reproduction function activity death activation transport signaling pathway assembly

(b)

REVIGO Gene Ontology treemap transferase activity, transferring purine nucleotide binding ATP binding phosphorus−containing nucleoside binding transferase hydrolase activity groups activity cation binding hydrolase activity, acting on acid anhydrides

purine peptidase transition protein ribonucleotide activity, adenyl nucleotidecalcium binding ion binding protein kinase phosphotransferase tyrosine peptidase binding nucleotide binding metal ion activity, alcohol acting on activity binding group as acceptor kinase activity L−amino acid activity peptides protein kinase activity binding ephrin

nuclease 1−acylglycerophosphocholine nucleoside−triphosphatase metal ion binding motor activity receptor O−acyltransferase activity activity activity calcium ion activity purine nucleoside binding ribonucleotide binding binding transmembrane zinc ion binding phosphoprotein receptor phosphatase phosphatase microtubule protein activity activity heparin binding hydrolase motor activity tyrosine kinase activity activity, acting transmembrane 1−alkylglycerophosphocholine receptor protein on ester bonds cysteine−type O−acetyltransferase activity kinase activity 2'−5'−oligoadenylate receptor binding peptidase synthetase activity activity protein exonuclease serine/threonine activity nucleic acid binding protein kinase activity kinase activity phosphoric tyrosine ion binding cytoskeletal nucleotidyltransferase DNA−dependent ester hydrolase phosphatase activity ATPase activity protein binding activity activity

GTPase Ras protein regulator guanyl−nucleotide protein binding receptor binding exchange factor cytokine transcription enzyme enzyme actin cytokine carbohydrate dimerization activity activity regulator regulator binding binding binding GTPase regulator activity receptor binding activity activity activity nucleoside−triphosphatase guanyl−nucleotide scavengeractivity exchange factor regulator activity protein protein activity receptor activity protein kinase catalytic activity sequence−specific complex kinase DNA binding polysaccharide heterodimerization binding activity manganese ion transcription transcription binding binding binding transcription DNA binding cobalt ion factor activity transmembrane repressor activator intermediate transmembrane transporter activity integrin calmodulin activity extracellular identical filament activity pattern binding transporter activity matrix structural protein binding binding binding binding constituent

38 Chapter 4: Results

(c)

REVIGO Gene Ontology treemap

membrane part

plasma membrane plasma intracellular membrane−bounded membrane part intracellular organelle membrane−bounded organelle cell part organelle external side integral cell intrinsic to plasma membrane intrinsic to of plasma to plasma membrane membrane membrane

integrin anchored cell outer to integral to complex membrane membrane membrane cell external intrinsic cortex encapsulating structure part immunological anchored synapse to plasma external to plasma encapsulating intracellular non−membrane−bounded membrane membrane structure non−membrane−bounded chromosomal part organelle organelle extracellular cytoplasm proteinaceous extracellular matrix part protein complex chromosomal part matrix intracellular protein−DNA

extracellular region partextracellular collagen space complex chromosome, intracellular organelle cytoplasmic kinetochore centromeric extracellular organelle part protein−DNA part region part polysome region fibrillar basement complex collagen membrane

condensed nucleus microtubule microtubule chromosome chromatin extracellular cytoskeleton macromolecular perinuclear matrix complex cytoskeleton cytoplasmic membrane origin region of cytosol nuclear spindle vesicle recognition nucleosome chromosome cytoplasm complex pole intracellular part cytoskeletal part nuclear dendrite unconventional spindle myosin perinuclear region of cytoplasm microtubule cell projection neuron part complex midbody projection vesicle neuron nuclear chromosome organelle part centrosome Golgi projection chromosome nuclear extracellular region origin of part phagocytic apparatus microtubule condensed replication endoplasmic spindle recognition vacuole alpha DNA vesicle cell surface excitatory cell complex early organizing nuclear polymerase:primase reticulum complex synapse fraction center chromosome endosome

The “red” module GO enrichment analysis (using the BiNGO plugin of Cytoscape 3.1.1, with default settings used for multiple testing correction) was performed on the “red” module, and the results were parsed and analysed with REVIGO as described in section 4.5. In the “biological process” ontology, the main terms highlighted by analysis with REVIGO were: “small molecule biosynthesis”, “ncRNA metabolism”, “response to DNA damage stimulus”, “cellular process”, “metabolism”, “embryonic digestive tract development”, “protein transport” and “biological regulation” (figure 4.15a). Similarly, the terms highlighted in the “molecular function” ontology were: “RNA binding”, “ligase activity”, “binding”, “catalytic activity”, “identical protein bind- ing”, “transcription regulator activity”, “cofactor binding”, “metal cluster binding” and “RNA polymerase II transcription cofactor activity” (figure 4.15b). The highlighted “cellular component” categories were: “cytosol”, “cell”, “or- ganelle”, “soluble fraction”, “cell fraction”, “membrane”, “plasma membrane”, “en- domembrane system”, “membrane-enclosed lumen” and “macromolecular complex”, with the respective p-value significance ranking visualised in figure 4.15c (the en- richment significance is proportional to the size of the rectangle enclosing each term in the figure).

39 Chapter 4: Results

Figure 4.15: The most prominent terms among the significantly enriched GO terms of the “red” module; a) in the “biological process” ontology, b) in the “molecular function” ontology, c) in the “cellular component” ontology. For a further description of the figure elements, see the legend of figure 4.14. A high resolution version can be obtained from the author upon request.

(a)

REVIGO Gene Ontology treemap regulation regulation regulation of of nitrogen response to cellular nucleobase−containing of cellular compound DNA damage response to compound metabolic stimulus stimulus metabolic process metabolic process cellular nitrogen compound process negative nitrogen compound regulation cellular metabolic process primary metabolic process biosynthetic process metabolic process regulation of metabolic process regulation of apoptotic biological ncRNA metabolic of primary regulation of process process process metabolic gene expression negative process regulation of regulation cell death of cellular process regulation of cellular amine amine regulation of macromolecule macromolecule response to DNAintracellular cellular ncRNAmetabolic metabolism metabolic biosynthetic small molecule macromolecule regulation metabolic damage stimulussignal macromolecule process process process process transduction gene expression biosynthetic biosynthetic of metabolic nucleobase−containing cellular biosynthetic process biosynthetic compound metabolic process process process process small process flavin−containing flavin−containing GTPase regulation cellular tRNA compound compound regulation mediated of cell component biosynthetic metabolic metabolic signal growth organization of cellular process process transduction process positive cellular regulation of cellular amino biosynthetic regulation negative negative small molecule biosynthesisnucleobase−containing regulation regulation nitrogen cellular ketone tyrosyl−tRNA of RNA regulation small molecule biological acid metabolic process tRNA of apoptotic of of cell compound metabolic metabolic process aminoacylation metabolic process biological process process aminoacylation death process process process biosynthetic for protein spermidine response cellular macromolecule 'de novo' pyrimidine regulation process nucleic acid metabolic process RNA nucleobase metabolic to drug translation biosynthetic process metabolic process of growth multicellular nucleotide nucleotide heterocycle metabolic process organism amino acid cofactor growth biosynthetic metabolic metabolic process DNA metabolic RNA response UV metabolic regulation of activation process to UV carboxylic acid process process process modification process protection cellular process in utero metabolic process multicellular embryonic small cellular heterocycle embryo regulation of organismal embryonic digestive tract nucleoside molecule protein morphogenesis transcription, biosynthetic development development development macromolecule monophosphate metabolic DNA−dependent biological catabolic process biosynthetic process embryonic digestive tract development embryo regulation metabolic process process process aspartate embryonic nucleobase−containing development T cell nucleobase−containing cellular family morphogenesis retina organic acid compound catabolic phosphorylation digestive differentiation of an compound biosynthetic amino acid ending in epithelium homeostasis catabolic process tract in thymus metabolic process process biosynthetic birth or egg nucleoside process RNA cellular process metabolism development hatching monophosphate process cellular nitrogen protein compound catabolic catabolic metabolic cofactor process protein localization process protein process biosynthetic metabolic transport lipid localization transport small molecule process oxidation−reduction regulation of process metabolic transcription, aspartate process protein transport metabolic process biosynthetic process family amino monocarboxylic establishment response DNA−dependent establishment of developmental catabolic coenzyme cellular aromatic macromolecule to process acid metabolic acid metabolic of protein process compound metabolic process biosynthetic localization localization localization stimulus process process process process

(b)

REVIGO Gene Ontology treemap

ligase activity, aminoacyl−tRNA ligase hydrolase activity transferase activity forming ligase activity activity nucleotide binding purine nucleotide binding nucleoside binding carbon−oxygen bonds

catalytic activity hydrolase ATPase nicotinate pyrophosphatase phosphorus−oxygen activity, phosphoribosyltransferase kinase activity lyase activity activity, activity activity acting on coupled ester bonds

adenyl nucleotide binding tRNA ribonucleotide binding nucleic acid binding ATP binding hydrolase ligase activityphosphoglycolate dihydrouridine activity, tyrosine−tRNA ATPase phosphatase lyase activity synthase acting on acid activity activity ligase activity activity anhydrides

RNA binding protein oxidoreductase isomerase transmembrane kinase identical receptor protein phosphotransferase uracil DNA transferase activity activity activity, activity tyrosine kinase protein binding transcription activity, alcohol N−glycosylase transferring adaptor activity group as acceptor phosphorus−containing identical protein binding regulator activity groups activity DNA purine nucleoside binding 1−alkyl−2−acetylglycerophosphocholine nuclease cation binding ion binding esterase activity N−glycosylase activity protein binding activity G−protein coupled

cysteine−S−conjugate kynurenine−oxoglutarate receptor binding beta−lyase activity cyclase activity transaminase activity 1−aminocyclopropane−1−carboxylate transaminase N−acetyltransferase synthase activity activity activity RNA pyrimidine cofactor metal polymerase II zinc ion nucleotide binding cluster transcription DNA binding binding cofactor binding binding activity

purine ribonucleotide binding metal ion binding binding transition iron−sulfur general RNA tRNA active metal ion cluster transmembrane polymerase II RNA binding binding transporter transcription binding binding activity factor activity

40 Chapter 4: Results

(c)

REVIGO Gene Ontology treemap

intracellular organelle cytoplasm intracellular part cell intracellular

cytosol nucleus cytoplasmic part mitochondrion

membrane−bounded organelle soluble organelle fraction

intracellular nucleoplasm part nucleoplasm cytosol organelle part cell part

GPI−anchor organelle mediator complex organelle part plasma transamidase membrane intracellular membrane−bounded organelle membrane cell fraction membrane complex

intracellular protein membrane−enclosed macromolecular nuclear lumen nuclear part organelle lumen complex endomembrane system lumen complex

41 Chapter 4: Results

c 4.6 MetaCore analysis The “blue” module

c The central nodes of the “blue” module were analysed with MetaCore [28] (version 6.19). A network was built using the “shortest paths” algorithm (the “expand by one” algorithm would expand the “blue” network with too many connections to handle eciently), with default settings. The result is shown in figure 4.16. Among the nodes connected to the central genes of the transcripts of the “blue” module, 7 were connected to either of the following relevant disease terms: “Aneurysm”; “Aortic Aneurysm”; “Aortic Aneurysm, Thoracic”; “Aortic Aneurysm, Abdomi- nal”; “Aneurysm, Dissecting”; “Intracranial Aneurysm”; “Coronary Aneurysm”; c “Hypertension”. These MetaCore network objects were ESR1, SMAD3, TGF- beta receptor type I, TGF-beta receptor type II, TGF-beta 1, MMP-9 and a broad group of G protein-coupled receptors called “Galpha(q)-specific peptide GPCRs” (to which one of the original “blue” transcripts was assigned). Additionally, “ther- apeutic targets” in the network were PLK1 (one of the central genes of the “blue” module), EGFR, ErbB2, SP1 and several others overlapping with the previously mentioned ones.

c Figure 4.16: MetaCore Interaction network constructed from central nodes of the “blue” mod- ule. The “shortest paths” algorithm was used with default settings. The gene identifiers were in c some cases translated to network object carrying the internal naming scheme of MetaCore,but the original genes are indicated with blue circles. Distances in the image have no meaning.

42 Chapter 4: Results

The “red” module The nine most central nodes with gene identifiers (table 4.8) of the “red” module c were further profiled with MetaCore. The network shown in 4.17 was built using the “expand by one” algorithm (with default options). Known therapeutic targets among the nodes in that network were CDK2, GPC3, c RELA, VDR, CTNNB1, ESR1 (also present in the MetaCore network constructed from the central “blue” genes), ESR2, PGR, APH1A and ADAM10. Searching the network for genes related to the terms “Aortic Aneurysm”; “Aneurysm”; “Aortic Aneurysm, Abdominal”; “Aortic Aneurysm, Thoracic”; “Hypertension”, yielded the following list of network objects: ESR1, ESR2, CLOCK, SMAD4, PGR, VDR, Caveolin-1 and ATM. A connected compound that might be relevant in the context of cardiovascular disease was cholesterol. c Furthermore, although not indicated by MetaCore as relevant in the contexts of the conditions mentioned, microRNA-210 was also present in the network. This noncoding RNA has been found to be over-expressed in several cardiovascular dis- eases [62]. Unfortunately, this RNA could not be detected in the RNA-seq data set, perhaps due to its short length providing challenges during sequencing (in line with the previously mentioned overall relative lack of short noncoding transcripts assembled) (110 bp). Additionally, the presence of CLOCK may be notable, since the most dieren- tially expressed lncRNA in the “red” module was a noncoding isoform of Timeless (table 4.7), another gene involved in circadian rhythm regulation with which it is known to interact [63].

c Figure 4.17: MetaCore Interaction network constructed from nine central nodes of the “red” module. The “expand by one” algorithm was used. The gene identifiers were in some cases c translated to network object carrying the internal naming scheme of MetaCore, but the original genes are indicated with blue circles. Distances in the image have no meaning.

43 Chapter 4: Results

4.7 Clustering comparison

In an attempt to validate the WGCNA clustering, it was compared it with a sep- arate clustering using the k-means method. Since the WGCNA method yielded 51 modules, k was set to 51 for the k-means method as comparison. For k-means, 200 starting points were used, which appeared to be sucient for convergence. As can be seen in figure 4.18, where the resulting overlap matrix is visualised. It is apparent that very few pairs of modules share any specific overlap between the WGCNA and k-means clustering. In accordance with this, the association score was a Cram´er’s V of 0.06, which indicates no significant association between the two methods. Any overlap found is almost entirely due chance (proportional to the sizes of the modules). For instance, the module 4 in the WGCNA method shares a rather large overlap with almost all of the modules obtained from k-means. This is merely the due to the large size of this module. Contrast this with the ideal result illustrated in figure 3.1, where each module only matches well with one module from the other clustering run. This is an intriguing result and exemplifies how dierent two given clustering methods can perform on the same gene expression dataset. 10 20 30 means with k = 51 and 200 starts − k 40 50

10 20 30 40 50

WGCNA with 51 clusters

Figure 4.18: The overlap of WGCNA module assignment and k-means clustering on the coe- cient of variation filtered dataset. For k-means k was set equal to 51 and n = 200 starting points of the algorithm was used. The image is a visualisation of the contingency matrix, wherein high values of a given cell is displayed brightly and low or zero overlap between two given clusters is displayed darkly shaded. (Additionally, in this image the number of transcripts in the intersection of each pair of modules have been normalised by the total size of the modules for greater clarity.)

This does not, however, invalidate the clustering results. There may be multiple explanations for this observation: Either both methods highlight biologically rele- vant, yet completely dierent, co-expression patterns in the data; or one or both of them perform particularly poorly on the data. One reason for the dierences may be that a step in the WGCNA method parses the initial cluster assignment for sub-clusters displaying certain characteristics that

44 Chapter 4: Results it believes justifies reassigning such sub-clusters to other modules. This step has no equivalent in the k-means method. Also, since the k-means method is sometimes considered somewhat of a quick and dirty method that may not give the most accurate results on complex datasets (and whose performance highly depends on the user-specified value of the k parameter [64]), one might suspect that the use of this clustering method is simply unsuitable to this complex data, and therefore fails to highlight the same biological aspects as the WGCNA method. But further study is required in order to draw any such conclusions. This lack of a conclusive objective verification places further weight on the sub- jective interpretation of biological patterns within the modules as a form of clustering validation.

45 5. Discussion

Challenges

Due to the expenses of RNA-seq, it has been common to sequence few samples deeply rather than many samples shallowly. This can cause problems when dealing with the, often large, biological variability. In particular, this variability needs to be adequately statistically modelled during dierential expression testing in order to make sure that any dierences in expression levels are actually caused by the treatment and not just there as a result of stochastic variation. Ideally, one should also test more than one dierential expression estimation tool, in order to study the impact of dierences in variance modelling on the results. This was not possible here due to time limitations, however. Another challenge was the fact that not all the assembled fragments were true RNA transcripts expressed by the sampled organism. Some were mere artifacts caused by, for instance, non-specific mapping of reads or failures of the assembly algorithm. Likely, the quality of the library preparation and sequencing runs played alargeroleinproducingtheseartifactsaswell. Many parameters could be tuned in the Cuinks assembler, and in hindsight it may have been desirable to mask mitochondrial fragments due to the overrepre- sentation of these. As it stands, it is suspected that this may have been a reason for the undesirably large overlap between the FPKM distributions of the partially and completely assembled fragments used for setting the expression threshold dur- ing the novel transcript filtering. Another potential culprit may have been adapter contamination. Ideally, the potential impact of this should have been studied more in detail. Additional analyses are being performed. (Note added in proof: The results of these analyses are presented in appendix A.) It is, in general, quite challenging to objectively validate the performance of an unsupervised learning method. At best these methods can be seen as tools for hypothesis construction. The results here showed that the WGCNA module assign- ment deviated very much from that of k-means clustering. Though one may argue that this should be expected, since the WGCNA method was (after all) developed to provide an alternative clustering algorithm that was better suited to expression data. The poor association can have multiple explanations. However, ultimately, it does not tell us as much about the biological data itself as it informs us of the dif- ferences between the algorithms. Had there, however, been found to be a significant association it would have made it much easier to justify conclusions of transcript co-expression. Perhaps k-means was not be best choice for comparison. Another clustering method such as NMF, which has been shown to perform well on biological data [65], would most likely have been a better choice.

46 Chapter 5: Discussion

Results

Among the 29 novel long non-coding RNA that remained after the filtering procedure undertaken, three were already present in the noncoding RNA database NONCODE (26 thus remaining as novel). This would appear to oer some reassurance that these may indeed be actual lncRNAs, rather than mere artifacts. Due to the inability of finding a well performing FPKM based artifact filtering threshold, however, it is hard to completely rule out that possibility. Regardless, actual verification with biotechnology techniques is required in order to confirm them. In total, 248 lncRNAs were found to be dierentially expressed. 187 of those were between the AngII treated mice harvested after seven weeks and the controls. Two of those were novel. In the comparison between the treated mice harvested after three weeks and the controls, 131 lncRNA (one of those novel) were dierentially expressed. In several of the cases where lncRNAs were dierentially expressed, they seemed to follow the expression patterns of the protein coding isoforms of the same genes. One may therefore suspect that the lncRNA expression in those cases was mainly apassivebyproductoftheoverallgeneexpression,ratherthanaconsequenceof independent functional significance. Though such independent roles can not be precluded at this stage. The exclusively lncRNA producing gene H19, on the other hand, would then appear more likely to have some role in the disease (perhaps via its connection to Igf2). Furthermore, transcript were clustered into co-expression groups (modules), which were subsequently correlated to treatment. The results highlighted a few of those, which appeared to display expression patterns related to the disease. Of those, a couple contained both novel and known lncRNAs. Those modules were subject to c further analysis, which involved Gene Ontology enrichment and MetaCore profiling of genes belonging to central (in terms of network properties) transcript isoforms. The results of GO enrichment of the first (here called “blue”) module showed an enrichment of terms such “DNA metabolism”, “regulation of immune system process”, “blood vessel morphogenesis” and “defence response”. Since immune re- actions, such as vascular inflammation, are known to be involved in the disease [3] these results should not be surprising if the genes in that module were indeed related to the disease. Another interesting term that was enriched was “extracellular region part” (encapsulating the term “collagen”) and “extracellular matrix”, which would perhaps be related to processes aecting the arterial wall in the progression of the disease (collagen protein producing genes were also found to be among the most significantly dierentially expressed ones). The second module analysed (called “red”), also displayed enrichment of a num- ber of GO terms. “small molecule biosynthesis”, “ncRNA metabolism” (processes involving tRNA seemed to be common under this term), “response to DNA damage stimulus”, “RNA binding”, and “ligase activity” were among the highlighted ones. These terms may be broad, but connections to the disease could be the terms “reg- ulation of cell death” and “regulation of cell growth” encapsulated by “response to DNA damage stimulus”. Enrichment of GO terms in a module would appear to strengthen claims of some degree of actual co-expression. Extracting the genes belonging to central transcripts in these modules and per- c forming analysis of curated interactions with MetaCore revealed that fairly close

47 Chapter 5: Discussion ties between those genes could be found, in networks that also included therapeutic targets and a number of genes connected to aortic aneurysm diseases and hyperten- sion. Network objects such as ESR1, SMAD3, TGF-beta receptor type I, TGF-beta c receptor type II, TGF-beta 1, MMP-9 (in the MetaCore network build from cen- tral genes of the “blue” module) and ESR1, ESR2, CLOCK, SMAD4, PGR, VDR, Caveolin-1 and ATM (based on the “red” module) may oer some clues as to func- tionality represented in those groups of putatively co-expressed transcripts. Long noncoding RNAs included in those modules could, by guilt of association, be inter- esting to study further. Of the dierentially expressed previously annotated lncRNAs, 11 ones were present in either of these two highly treatment-correlated co-expression modules. These modules also contained one novel lncRNA each (which, however, were not significantly dierentially expressed).

48 6. Conclusions

This work has revealed 26 novel long noncoding RNA candidates in mouse, two of which were up-regulated in the AngII treated group (which develops aneurysms) in comparison with controls. In addition, 185 previously annotated lncRNAs were also indicated as dierentially expressed between those sample groups. One lncRNA in particular displayed a high fold change between the groups: H19. Protein coding isoforms that displayed large dierences in expression between case and control involved those from histone proteins, collagens (important parts of the extracellular matrix), thrombospondin proteins (known to interact with the extracellular matrix) and Itm2a (involved in T-cell activation and muscle cell dif- ferentiation). These observations are interesting since degradation of extracellular matrix in the arterial wall is an important part in the progression of abdominal aortic aneurysm disease, and so are immune reations. Furthermore, a network of transcriptional co-expression was built using weighted gene co-expression network analysis. Several lncRNAs were found to be present in acoupleofco-expressionmodulesobtainedfromthisprocedure,whoseexpression patterns appeared to be related to treatment. Investigating these modules further revealed ties to genes and processes that could be connected to abdominal aortic aneurysm disease. Those results would suggest that the lncRNAs included in those modules could warrant further research. Understanding the role of noncoding RNAs in abdominal aortic aneurysm may provide knowledge that can open paths to new diagnostic possibilities of a disease that rarely present any symptoms until it is too late. In a best case scenario, new therapies that can supplement or provide alternatives to current invasive last-minute surgical procedures could be found along the way to towards the goal of unraveling the complex genetic mechanisms underlying this life-threatening disease.

49 7. Acknowledgements

Karolinska Institutet Bengt Sennblad Lars Maegdefessel Jesper G˚adin Mattias Fr˚anberg Stanford University Joshua Spin Uppsala University Magnus Lundgren SciLifeLab Mikael Huss

And last, but not least: all those I forgot to include here.

50 8. References

[1] John L Rinn and Howard Y Chang. Genome regulation by long noncoding RNAs. Annual review of biochemistry, 81:145–66, January 2012.

[2] Sourabh Aggarwal, Arman Qamar, Vishal Sharma, and Alka Sharma. Ab- dominal aortic aneurysm: A comprehensive review. Experimental and clinical cardiology,16:11–15,2011.

[3] Lars Maegdefessel, Ronald L Dalman, and Philip S Tsao. Pathogenesis of ab- dominal aortic aneurysms: microRNAs, proteases, genetic associations. Annual review of medicine, 65:49–62, January 2014.

[4] Ada Congrains, Kei Kamide, Mitsuru Ohishi, and Hiromi Rakugi. ANRIL: Molecular Mechanisms and Implications in Human Health. International jour- nal of molecular sciences, 14(1):1278–92, January 2013.

[5] Thomas Derrien, Rory Johnson, Giovanni Bussotti, Andrea Tanzer, Sarah Dje- bali, et al. The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome research,22(9):1775– 89, September 2012.

[6] Daiana Weiss, John J Kools, and Robert W Taylor. Angiotensin II-Induced Hypertension Accelerates the Development of Atherosclerosis in ApoE-Deficient Mice. Circulation, 103(3):448–454, January 2001.

[7] David R Bentley, Shankar Balasubramanian, Harold P Swerdlow, Georey P Smith, John Milton, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 456:53–59, 2008.

[8] Kim D Pruitt, Garth R Brown, Susan M Hiatt, Fran¸coise Thibaud-Nissen, Alexander Astashyn, et al. RefSeq: an update on mammalian reference se- quences. Nucleic acids research, 42(Database issue):D756–63, January 2014.

[9] Donna Karolchik, Galt P Barber, Jonathan Casper, Hiram Clawson, Melissa S Cline, et al. The UCSC Genome Browser database: 2014 update. Nucleic acids research, 42(Database issue):D764–70, January 2014.

[10] Paul Flicek, M Ridwan Amode, Daniel Barrell, Kathryn Beal, Konstantinos Billis, et al. Ensembl 2014. Nucleic acids research, 42(Database issue):D749– 55, January 2014.

51 Chapter 8: References

[11] Cole Trapnell, Lior Pachter, and Steven L Salzberg. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England),25(9):1105–11, May 2009.

[12] Daehwan Kim, Geo Pertea, Cole Trapnell, Harold Pimentel, Ryan Kelley, et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome biology, 14(4):R36, April 2013.

[13] Cole Trapnell, Brian A Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell dierentiation. Nature biotechnol- ogy, 28(5):511–5, May 2010.

[14] Adam Roberts, Harold Pimentel, Cole Trapnell, and Lior Pachter. Identifica- tion of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics (Oxford, England), 27(17):2325–9, September 2011.

[15] Lei Sun, Zhihua Zhang, Timothy L Bailey, Andrew C Perkins, Michael R Tal- lack, et al. Prediction of novel long non-coding RNAs based on RNA-Seq data of mouse Klf1 knockout study. BMC bioinformatics, 13:331, January 2012.

[16] Simon Anders and Wolfgang Huber. Dierential expression analysis for se- quence count data. Genome biology, 11(10):R106, January 2010.

[17] Mark D Robinson, Davis J McCarthy, and Gordon K Smyth. edgeR: a Bio- conductor package for dierential expression analysis of digital gene expression data. Bioinformatics (Oxford, England), 26(1):139–40, January 2010.

[18] Cole Trapnell, David G Hendrickson, Martin Sauvageau, Loyal Go, John L Rinn, et al. Dierential analysis of gene regulation at transcript resolution with RNA-seq. Nature biotechnology, 31(1):46–53, January 2013.

[19] Cole Trapnell, Adam Roberts, Loyal Go, Geo Pertea, Daehwan Kim, et al. Dierential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cuinks. Nature protocols, 7(3):562–78, March 2012.

[20] James Macqueen. Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1:281–297, 1967.

[21] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Elements, 1:337–387, 2009.

[22] Peter Langfelder and Steve Horvath. WGCNA: an R package for weighted correlation network analysis. BMC bioinformatics, 9(1):559, January 2008.

[23] Pentti Paatero and Unto Tapper. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Envi- ronmetrics, 5(2):111–126, June 1994.

[24] Daniel D Lee and Sebastian H Seung. Learning the parts of objects by non- negative matrix factorization. Nature,401:788–791,1999.

52 Chapter 8: References

[25] Anton J Enright, Stijn Van Dongen, and Christos A Ouzounis. An ecient algorithm for large-scale detection of protein families. Nucleic acids research, 30(7):1575–84, April 2002. [26] Chaoyong Xie, Jiao Yuan, Hui Li, Ming Li, Guoguang Zhao, et al. NON- CODEv4: exploring the world of long non-coding RNA genes. Nucleic acids research, 42(Database issue):D98–103, January 2014. [27] Michael Ashburner, Catherine A Ball, Judith A Blake, David Botstein, Heather Butler, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature genetics, 25(1):25–9, May 2000. [28] Yuri Nikolsky, Eugene Kirillov, Roman Zuev, Eugene Rakhmatulin, and Ta- tiana Nikolskaya. Functional analysis of OMICs data and small molecule com- pounds in an integrated ”knowledge-based” platform. Methods in molecular biology (Clifton, N.J.),563:177–196,2009. [29] Simon Andrews. FastQC: A quality control tool for high throughput sequence data. [Online] http://www.bioinformatics.babraham.ac.uk/projects/fastqc/, 2010. [30] Liguo Wang, Shengqin Wang, and Wei Li. RSeQC: Quality control of RNA-seq experiments. Bioinformatics,28:2184–2185,2012. [31] Michael F Lin, Irwin Jungreis, and Manolis Kellis. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioin- formatics (Oxford, England), 27(13):i275–82, July 2011. [32] Kun Sun, Xiaona Chen, Peiyong Jiang, Xiaofeng Song, Huating Wang, et al. iSeeRNA: identification of long intergenic non-coding RNA transcripts from transcriptome sequencing data. BMC genomics, 14 Suppl 2(Suppl 2):S7, Jan- uary 2013. [33] Kun Sun, Yu Zhao, Huating Wang, and Hao Sun. Sebnif: an integrated bioinfor- matics pipeline for the identification of novel large intergenic noncoding RNAs (lincRNAs)–application in human skeletal muscle cells. PloS one, 9(1):e84500, January 2014. [34] Paul Shannon, Andrew Markiel, Owen Ozier, Nitin S Baliga, Jonathan T Wang, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome research,13:2498–2504,2003. [35] Peter Langfelder, Bin Zhang, and Steve Horvath. Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics (Oxford, England), 24(5):719–20, March 2008. [36] Liang Wang, Hui Tang, Venugopal Thayanithy, Subbaya Subramanian, Ann L Oberg, et al. Gene networks and microRNAs implicated in aggressive prostate cancer. Cancer research, 69(24):9490–7, December 2009. [37] Ovidiu D Iancu, Sunita Kawane, Daniel Bottomly, Robert Searles, Robert Hitzemann, et al. Utilizing RNA-Seq data for de novo coexpression network inference. Bioinformatics (Oxford, England), 28(12):1592–7, June 2012.

53 Chapter 8: References

[38] Steve Horvath and Jun Dong. Geometric interpretation of gene coexpression network analysis. PLoS computational biology, 4(8):e1000117, January 2008.

[39] Qi Liao, Jia Shen, Jianfa Liu, Xi Sun, Guoguang Zhao, et al. Genome-wide identification and functional annotation of Plasmodium falciparum long non- coding RNAs from RNA-seq data. Parasitology research, 113(4):1269–81, April 2014.

[40] Karl G Kugler, Gerald Siegwart, Thomas Nussbaumer, Christian Ametz, Manuel Spannagl, et al. Quantitative trait loci-dependent analysis of a gene co-expression network associated with Fusarium head blight resistance in bread wheat (Triticum aestivum L.). BMC genomics, 14(1):728, January 2013.

[41] Kasper D Hansen, Steven E Brenner, and Sandrine Dudoit. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic acids research, 38(12):e131, July 2010.

[42] Angie S Hinrichs. The UCSC Genome Browser Database: update 2006. Nucleic Acids Research,34:D590–D598,2006.

[43] Susanne Motameny, Stefanie Wolters, Peter N¨urnberg, and Bj¨orn Schumacher. Next Generation Sequencing of miRNAs - Strategies, Resources and Methods. Genes, 1(1):70–84, January 2010.

[44] William F Marzlu, Eric J Wagner, and Robert J Duronio. Metabolism and regulation of canonical histone mRNAs: life without a poly(A) tail. Nature reviews. Genetics, 9:843–854, 2008.

[45] Kazimierz T Tycowski, Alar Aab, and Joan A Steitz. Guide RNAs with 5’ Caps and Novel Box C/D snoRNA-like Domains for Modification of snRNAs in Metazoa. Current Biology,14:1985–1995,2004.

[46] Adele Murrell, Sarah Heeson, and Wolf Reik. Interaction between dierentially methylated regions partitions the imprinted genes Igf2 and H19 into parent- specific chromatin loops. Nature genetics, 36:889–893, 2004.

[47] Carlos Gonzalez-Quesada, Michele Cavalera, Anna Biernacka, Ping Kong, Dong Wook Lee, et al. Thrombospondin-1 induction in the diabetic my- ocardium stabilizes the cardiac matrix in addition to promoting vascular rar- efaction through angiopoietin-2 upregulation. Circulation Research,113:1331– 1344, 2013.

[48] Joseph V Moxon, Matthew P Padula, Paula Clancy, Theophilus I Emeto, Ben R Herbert, et al. Proteomic analysis of intra-arterial thrombus secretions reveals a negative association of clusterin and thrombospondin-1 with abdominal aortic aneurysm. Atherosclerosis, 219:432–439, 2011.

[49] Dave Van Den Plas and Joseph Merregaert. Constitutive overexpression of the integral membrane protein Itm2A enhances myogenic dierentiation of C2C12 cells. Cell Biology International, 28:199–207, 2004.

54 Chapter 8: References

[50] Jacqueline Kirchner and Michael J Bevan. ITM2A is induced during thymo- cyte selection and T cell activation and causes downregulation of CD8 when overexpressed in CD4(+)CD8(+) double positive thymocytes. The Journal of experimental medicine, 190:217–228, 1999.

[51] Karen Pittois, Jan Wauters, Paul Bossuyt, Willy Deleersnijder, and Joseph Merregaert. Genomic organization and chromosomal localization of the Itm2a gene. Mammalian Genome, 10:54–56, 1999.

[52] Stephen G Young, Brandon S J Davies, Constance V Voss, Peter Gin, Michael M Weinstein, et al. GPIHBP1, an endothelial cell transporter for lipoprotein lipase. Journal of lipid research, 52(11):1869–84, December 2011.

[53] Anne Gabory, Marie-Anne Ripoche, Tomomi Yoshimizu, and Luisa Dandolo. The H19 gene: regulation and function of a non-coding RNA. Cytogenetic and genome research, 113(1-4):188–93, January 2006.

[54] Guorui Huang, Gaoxiang Ge, Dingyan Wang, Bagavathi Gopalakrishnan, De- lana H Butz, et al. 3(V) collagen is critical for glucose homeostasis in mice due to eects in pancreatic islets and peripheral tissues. The Journal of clinical investigation, 121:769–783, 2011.

[55] Guntram Borck, Peter Beighton, Christian Wilhelm, J¨urgen Kohlhase, and Christian Kubisch. Arterial rupture in classic Ehlers-Danlos syndrome with COL5A1 mutation. American Journal of Medical Genetics, Part A,152:2090– 2093, 2010.

[56] Lars Maegdefessel, Junya Azuma, and Ryuji Toh. Inhibition of microRNA- 29b reduces murine abdominal aortic aneurysm development. The Journal of clinical . . . ,122(2):497–506,2012.

[57] Amita Sehgal, Jerey L Price, Bernice Man, and Michael W Young. Loss of circadian behavioral rhythms and per RNA oscillations in the Drosophila mutant timeless. Science (New York, N.Y.), 263(5153):1603–6, March 1994.

[58] Keziban Unsal-Ka¸cmaz, Thomas E Mullen, William K Kaufmann, and Aziz Sancar. Coupling of human circadian and cell cycles by the timeless protein. Molecular and cellular biology,25:3109–3116,2005.

[59] Linda P O’Reilly, Simon C Watkins, and Thomas E Smithgall. An unex- pected role for the clock protein timeless in developmental apoptosis. PloS one, 6(2):e17157, January 2011.

[60] Nancy F da Silva, Dean Gentle, L uke B Hesson, Dion G Morton, Farida Latif, et al. Analysis of the Birt-Hogg-Dub´e(BHD) tumour suppressor gene in sporadic renal cell carcinoma and colorectal cancer. Journal of medical genetics, 40(11):820–4, November 2003.

[61] Fran Supek, Matko Boˇsnjak, Nives Skunca,ˇ and Tomislav Smuc.ˇ REVIGO sum- marizes and visualizes long lists of gene ontology terms. PloS one, 6(7):e21800, January 2011.

55 Chapter : References

[62] Cecilia Devlin, Simona Greco, Fabio Martelli, and Mircea Ivan. miR-210: More than a silent player in hypoxia. IUBMB life, 63(2):94–100, February 2011.

[63] Lauren P Shearman, Sathyanarayanan Sriram, David R Weaver, Elizabeth S Maywood, Ines Chaves, et al. Interacting molecular loops in the mammalian circadian clock. Science (New York, N.Y.), 288(5468):1013–9, May 2000.

[64] Jianhua Ruan, Angela K Dean, and Weixiong Zhang. A general co-expression network-based approach to gene expression analysis: comparison and applica- tions. BMC systems biology, 4:8, January 2010.

[65] Renaud Gaujoux and Cathal Seoighe. A flexible R package for nonnegative matrix factorization. BMC bioinformatics, 11:367, January 2010.

[66] Marcel Martin. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal, 17(1):10, May 2011.

[67] Thomas H Barker, Gretchen Baneyx, Marina Card´o-Vila, Gail A Workman, Matt Weaver, et al. SPARC regulates extracellular matrix organization through its modulation of integrin-linked kinase activity. The Journal of biological chem- istry, 280(43):36483–93, October 2005.

[68] Elena Tomasello and Eric Vivier. KARAP/DAP12/TYROBP: three names and a multiplicity of biological functions. European journal of immunology, 35(6):1670–7, June 2005.

[69] Harini Krishnan, Jhon A Ochoa-Alvarez, Yongquan Shen, Evan Nevel, Meenakshi Lakshminarayanan, et al. Serines in the intracellular tail of podoplanin (PDPN) regulate cell motility. The Journal of biological chemistry, 288(17):12215–21, April 2013.

[70] Jillian L Astarita, Sophie E Acton, and Shannon J Turley. Podoplanin: emerg- ing functions in development, the immune system, and cancer. Frontiers in immunology, 3:283, January 2012.

[71] Gaetano Santulli. Angiopoietin-like proteins: a comprehensive look. Frontiers in endocrinology, 5:4, January 2014.

56 A. Alternative analysis

A reiteration of the methods presented in the main body of this work was performed in order to examine the impact of trimming reads with respect to sequencing adapter oligonucleotides, as well as the eect of mitochondrial read overrepresentation. Ad- ditionally, some parameters of mapping and assembly were changed to alternatives that were deemed more likely to be appropriate for the purpose of this work.

A.1 Methods

A.1.1 Read trimming Sequences representing the oligonucleotide adapters of the EpiBio Scriptseq v2 stranded sequencing protocol were removed from the ends of reads using the read trimming tool Cutadapt [66] version 1.4.1, using the “-b” option (which instruct the algorithm to match the oligonucleotide sequences against both the 3’ and 5’ ends, as well as the interior of the reads). Poly-A and poly-T tails of reads were also trimmed using the same method. Low quality bases (as determined using FastQC [29] base quality plots) at the ends were removed prior to adapter trimming using fastx-trimmer version 0.0.13 (16 nucleotides from the beginning and 15 nucleotides from the end of each read). Since each read needs to have a corresponding paired partner, “singleton” reads (without partner) were removed.

A.1.2 Read mapping Mapping was performed largely identically as described in the methods section of the main report, with the exception of the added option “–no-mixed”, which instructs the program to only report alignments where both reads in a pair can be mapped. It was reasoned that this option would decrease the risk of assembling transcript artifacts downstream, due to otherwise potentially unreliable singleton read align- ment. In addition, the “-r” and “–mate-std-dev” option was specified according to the results of RSeQC [30] read mapping inspection (the procedure of which was described in the methods section of the main body of the report).

A.1.3 Transcript assembly Transcriptome assembly was performed as described in the main methods section, with the addition of the “-M” option, which purpose was to mask reads mapping to mitochondrial transcripts. The mask file was derived from the Ensembl [10] transcriptome annotation file that was used for mapping and assembly. Assemblies were merged with Cucompare [13] (using the -M option and “-g”, which instructs

57 Chapter A: Alternative analysis the program to run RABT assembly, as described regarding the cuinks runs in the main methods section). The merged assembly was annotated with the Ensembl annotation using cucompare (options “-G”, “-C” and “-r”).

A.1.4 Dierential expression testing Dierential expression testing was performed as as described in section 3.6.1, with the exception that the “-u” option was not used (due to problems pertaining to glitches of the Cudi [18] software and since it did not seem to aect the results in any significant way). Version 2.2.1 of Cudi encountered serious issues of unknown cause that were not seen in the previous runs (“Segmentation fault”), so 2.1.1 was used instead. The “-b” option was also used, which inspects the ends of the reads for overrepresentation of certain patterns, and tries to account for this overrepre- sentation by adjusting FPKM values if such bias is found. The question was raised that this was perhaps not appropriate when read ends have been trimmed (since any such biased sequence motifs would have been stripped away), but since the al- gorithm only performs correction if bias is actually detected (and would otherwise do nothing at all), it was argued that it would not have any negative consequences to use the option.

A.1.5 Identification of known long noncoding RNAs Identification of previously annotated lncRNAs in the set of assembled fragments was performed identically as previously described in section 3.4.

A.1.6 Prediction of novel long noncoding RNAs The approach used previously to filter away potential artifacts (using the FPKM distributions of completely and partially assembled transcripts) was deemed unsatis- factory for two reasons. For one, the time and system resources required for running cuinks on merged alignment files was very significant. Secondly, the performance of the previously constructed FPKM threshold classifier was very poor. Therefore a new approach was used, whereby FPKM values, confidence intervals and transcript abundance estimations status (“OK” or not) as given by Cudi (us- ing merged assemblies from Cumerge [19]) was used instead to filter out unreliable transcripts. Only transcripts whose estimated FPKM confidence interval was above zero, and whose status were “OK” were kept. Additionally, only those of the tran- scripts having FPKM greater than zero in at least two mice in each condition were retained. After thereby discarding low quality assemblies, the same filtering pipeline as described in section 3.5 was employed to discover novel lncRNA candidates.

A.1.7 Co-expression network analysis Co-expression network analysis was performed as described in section 3.6.3, with the exception of using 12 as the soft thresholding power, setting minimum module size to 100 and merge cut height to 0.15. In addition, the dataset was pre-filtered in order to reduce its size to around 16000. This filtering was composed of two steps: First, only keeping transcripts from genes that were either in the Ensembl

58 Chapter A: Alternative analysis annotation, novel lncRNA or NONCODE lncRNA. Then, filtering the dataset in order to only keep the top ca 16000 transcripts with highest coecient of variation.

A.2 Results

A.2.1 Identification of long noncoding RNAs 69 novel long noncoding lncRNA candidates were discovered. Three of these over- lapped with the NONCODE annotation (with IDs TCONS 00141617, TCONS 00722068 and TCONS 00722067 in the trimmed read based analysis), 66 thus remaining as truly novel. One transcript overlapped with the previously identified novel lncRNA candidates, (having the ID TCONS 00129545 in the new analysis and TCONS 00084766 in the previous one). 30 of the novel lncRNA did not have any kind of overlap with annotated genes and were thus considered intergenic (class code “u”, table A.1). The remaining 36 appeared to be novel isoforms of previously annotated transcripts (table A.2). Regarding already known transcript isoforms, 31624 long noncoding RNA were identified from previous annotations by NONCODE (v4) and Ensembl (mm9).

59 Chapter A: Alternative analysis

Table A.1: Overview of the 30 novel intergenic lncRNA candidates. Nearest reference genes, transcripts and class codes were assigned by Cucompare, using the Ensembl mm9 annotation as reference. Column headers as in main text table 4.1.

Isoformid Gene Nearesttranscript Class Length Locus TCONS 00078697 - - u 1174 10:62259045-62260412 TCONS 00122305 - - u 1070 11:58409413-58410669 TCONS 00126277 - - u 694 11:84684559-84685378 TCONS 00129545 - - u 3434 11:104796669-104801381 TCONS 00156551 - - u 403 12:32068232-32068765 TCONS 00170486 - - u 1074 12:113950607-113951761 TCONS 00189355 - - u 1250 12:107120334-107121995 TCONS 00224111 - - u 4478 13:74597747-74603043 TCONS 00305463 - - u 669 15:86592532-86593366 TCONS 00328957 - - u 1642 16:19892287-19893993 TCONS 00359545 - - u 2538 17:95166243-95174862 TCONS 00373910 - - u 2500 17:88351484-88354043 TCONS 00388566 - - u 2163 18:76813155-76815764 TCONS 00452647 - - u 844 2:171771282-171772189 TCONS 00496710 - - u 1543 3:69572143-69573834 TCONS 00502760 - - u 998 3:98061089-98062695 TCONS 00545054 - - u 1839 4:3016213-3018590 TCONS 00622031 - - u 511 5:28198977-28199610 TCONS 00637091 - - u 904 5:114545674-114547464 TCONS 00663966 - - u 517 6:135306139-135306768 TCONS 00681106 - - u 1238 6:90843187-90845069 TCONS 00713893 - - u 613 7:6928150-6928993 TCONS 00721729 - - u 2453 7:65876002-65879125 TCONS 00722564 - - u 974 7:69183232-69184816 TCONS 00738833 - - u 1493 8:19706436-19708036 TCONS 00742571 - - u 3569 8:43100252-43103910 TCONS 00743635 - - u 943 8:48001763-48002902 TCONS 00769513 - - u 2114 8:76172586-76174856 TCONS 00770093 - - u 536 8:79075977-79076569 TCONS 00829084 - - u 1035 X:41701449-4170340

60 Chapter A: Alternative analysis

Table A.2: Overview of the 36 novel lncRNA candidates that appeared to be new isoforms of previously known transcripts. Nearest reference genes, transcripts and class codes were assigned by Cucompare, using the Ensembl mm9 annotation as reference. Column headers as in main text table 4.1.

Isoformid Gene Nearesttranscript Class Length Locus TCONS 00002246 Gm16720 ENSMUST00000130710 j 3030 1:16647632-16652359 TCONS 00044520 Als2cr12 ENSMUST00000055313 j 1832 1:58713908-58752833 TCONS 00058021 9230116N13Rik ENSMUST00000140195 j 2077 1:138308120-138336735 TCONS 00080011 RP23-60H16.1.1 ENSMUST00000177104 j 1963 10:70773492-70777007 TCONS 00122763 Gm12273 ENSMUST00000137418 j 1592 11:61764276-61766238 TCONS 00126764 Coil ENSMUST00000135325 j 579 11:88831107-88852927 TCONS 00191604 Gm17177 ENSMUST00000169144 j 1305 13:3120377-3122425 TCONS 00199845 Gm904 ENSMUST00000099519 j 1151 13:50738596-50741207 TCONS 00315218 Cd200r2 ENSMUST00000102805 j 1176 16:44867209-44915953 TCONS 00335531 2310005G13Rik ENSMUST00000067173 j 3363 16:57036923-57071459 TCONS 00386270 Gm4951 ENSMUST00000031549 j 5278 18:60371224-60408364 TCONS 00416304 Slc22a28 ENSMUST00000065651 j 933 19:8136698-8206472 TCONS 00425961 Gm13293 ENSMUST00000131188 j 1719 2:11261115-11264989 TCONS 00479186 2310005A03Rik ENSMUST00000142569 j 1815 2:154923287-154925817 TCONS 00495995 RP23-445D13.2.1 ENSMUST00000176925 j 1393 3:64251244-64264017 TCONS 00545120 Gm11784 ENSMUST00000118568 j 1210 4:3313992-3315945 TCONS 00555933 Orm3 ENSMUST00000006687 j 888 4:63017086-63020545 TCONS 00562897 Gm12789 ENSMUST00000106914 j 1489 4:101659230-101662836 TCONS 00579840 Gm12505 ENSMUST00000146041 j 743 4:55423234-55431714 TCONS 00589000 Skint10 ENSMUST00000068851 j 980 4:112383751-112447495 TCONS 00590275 Gm12886 ENSMUST00000106266 j 2330 4:121086420-121095704 TCONS 00621502 Mir3096 ENSMUST00000116685 o 2228 5:23214891-23218029 TCONS 00621825 Speer4a ENSMUST00000079447 j 3079 5:26359122-26366046 TCONS 00640073 Gm454 ENSMUST00000160126 j 2029 5:138643516-138648896 TCONS 00650640 Vmn1r4 ENSMUST00000176838 j 845 6:56874015-56908094 TCONS 00662554 Clec2h ENSMUST00000032518 j 2218 6:128612403-128627547 TCONS 00663622 5530400C23Rik ENSMUST00000048459 j 1782 6:133242189-133246057 TCONS 00675723 A530053G22Rik ENSMUST00000060147 j 2808 6:60345557-60353737 TCONS 00686793 5930416I19Rik ENSMUST00000112152 j 1183 6:128306021-128312929 TCONS 00686826 Klrb1a ENSMUST00000032512 j 2088 6:128559085-128573419 TCONS 00686877 Klrb1b ENSMUST00000032472 j 3660 6:128763467-128777652 TCONS 00690369 Gm3104 ENSMUST00000171004 j 1454 7:3085852-3088038 TCONS 00714087 Nlrp4d ENSMUST00000086269 j 637 7:10944249-10974565 TCONS 00757200 Cd209c ENSMUST00000127592 j 1447 8:3940193-3954746 TCONS 00803647 Gm3867 ENSMUST00000086108 j 1222 9:36064625-36065962 TCONS 00823024 Enox ENSMUST00000140767 j 3866 X:100688894-100701683

61 Chapter A: Alternative analysis

A.2.2 Dierential expression testing Among the previously known lncRNAs, 74 were significantly dierentially expressed (48 of those between the A7 and S7 groups). Of the novel lncRNAs, two were dierentially expressed between the case and control mice sacrificed after seven weeks. In total, 3508 transcripts were dierentially expressed between A7 and S7 (8277 if all pairwise comparisons between A3, A7 and S7 are included). The overall most significantly dierentially expressed transcript isoforms (be- tween A7 and S7) are summarised in table A.3. Similarly, the most dierentially expressed lncRNA (both novel and previously known) are presented in table A.4. Concordant with the results presented in the main analysis in section 4.3, hi- stone proteins are among the ones displaying the largest dierences in transcript abundance estimates in the comparison of the A7 and S7 groups of mouse samples (table A.3). Sparc, however, was a gene not highlighted in the original results. The protein produced by this gene takes part in the regulation of extracellular matrix formation [67]. As previously mentioned in the main report, processes aecting the composition and eventual degradation of the extracellular matrix are involved in AAA disease. Tyrobp has roles in the immune system and takes part in inflammation “via its coupling to myeloid receptors, such as the triggering receptors expressed by myeloid cells (TREM) displayed by neutrophils, monocytes/macrophages and dendritic cells” [68]. This is interesting in context of the inflammatory response observed during abdominalaortic aneurysm formation, in which macrophages (among other immune cells) are involved [3]. Another highly dierentially expressed gene with connections to the immune system is Pdpn, which besides regulating cell motility [69] also “plays crucial roles in the biology of immune cells, including T cells and dendritic cells” [70]. An additional connection to the immune system among the genes in table A.3 is Mpeg1, which stands for “Macrophage Expressed Gene 1”. Angptl1 stands for Angiopoietin-Like Protein 1 and is known to display anti- angiogenic properties by “inhibiting the proiferation, migration, tube formation, and adhesion of endothelial cells” [71] and has additionally “been shown to exhibit antiapoptotic activity in human endothelial cells” [71] (endothelial cells line the inside of blood vessels). Given these descriptions and its high dierential epression between case and control, Angptl1 could be suspected to have a role in the disease studied here. It should also be mentioned, although its implications are rather unclear, that other than the above discussed genes, a significant number of predicted genes and pseudogenes were also present in table A.3. Regarding the lncRNAs, ENSMUST00000130642 was the most dierentially ex- pressed transcript (table A.4). It stems from the gene “Sparc”, which was discussed above. However, as such, one would probably suspect that its expression levels are a passive result of the overall expression of the gene (and its main protein), rather than indicative of any independent functionality for the noncoding isoform. However, that hypothesis remains to be tested. Several NONCODE lncRNAs were dierentially expressed. Though these remain to be characterized, and not much can be said about them. The exclusively lncRNA- producing genes 2410006H16Rik, F630028O10Rik, BC029722 and C330006A16Rik are also found among the transcripts in table A.4, though they remain unclassified. Snhg5 also remains to characterize further (as not much information could be found

62 Chapter A: Alternative analysis about it). Similarly to table A.3, a multitude of predicted genes are also represented in table A.4.

Table A.3: The overall 35 most dierentially expressed genes (based on individual isoforms) between AngII treated mice harvested after seven weeks (A7) and control mice harvested after seven weeks (S7). Column headers as in main text table 4.2.

TranscriptID Gene Class log2 fold change p-value q-value ENSMUST00000109141 Gm15427 = -Inf 0.00005 0.00636546 ENSMUST00000130642 Sparc = -Inf 0.00005 0.00636546 ENSMUST00000087714 Hist1h4j = -Inf 0.00005 0.00636546 ENSMUST00000078369 Hist1h2ab = -Inf 0.00005 0.00636546 ENSMUST00000099704 Hist1h3i = -Inf 0.00005 0.00636546 ENSMUST00000091751 Hist1h2an = -Inf 0.00005 0.00636546 ENSMUST00000091701 Hist1h3a = -Inf 0.00005 0.00636546 ENSMUST00000074245 Gm10119 = -Inf 0.00005 0.00636546 ENSMUST00000089396 Gm6723 = -Inf 0.00005 0.00636546 ENSMUST00000088648 Gm8759 = -Inf 0.00005 0.00636546 ENSMUST00000097682 Rpl27-ps3 = -Inf 0.00005 0.00636546 ENSMUST00000051412 Gm5512 = -Inf 0.00005 0.00636546 ENSMUST00000119459 H3f3c = -Inf 0.00005 0.00636546 ENSMUST00000169025 Gm17510 = -Inf 0.00005 0.00636546 ENSMUST00000122336 Gm12751 = -Inf 0.00005 0.00636546 ENSMUST00000118323 Gm11936 = -Inf 0.00005 0.00636546 ENSMUST00000117129 Pgam1-ps2 = -Inf 0.00005 0.00636546 ENSMUST00000121308 Gm5931 = -Inf 0.00005 0.00636546 ENSMUST00000083987 U6 = -Inf 0.00010 0.01118960 ENSMUST00000117045 Dynlt1-ps1 = -Inf 0.00010 0.01118960 ENSMUST00000117533 Gm12062 = -Inf 0.00035 0.03008360 ENSMUST00000093651 U6 = Inf 0.00040 0.03375630 ENSMUST00000050586 5430419D17Rik x -Inf 0.00055 0.04222160 ENSMUST00000006626 Actn3 = -5.58226 0.00010 0.01118960 ENSMUST00000032800 Tyrobp = -5.35215 0.00005 0.00636546 ENSMUST00000055770 Hist1h1a = -5.31370 0.00005 0.00636546 ENSMUST00000077389 Gm7536 = -5.28799 0.00065 0.04835630 ENSMUST00000030317 Pdpn = -5.25626 0.00005 0.00636546 ENSMUST00000027885 Angptl1 = -5.18182 0.00015 0.01562130 ENSMUST00000080511 Hist1h1b = -5.15822 0.00010 0.01118960 ENSMUST00000040359 Arsi = -5.11103 0.00005 0.00636546 ENSMUST00000021506 Serpina3n = -5.09322 0.00050 0.03911860 ENSMUST00000003643 Ckm = -5.08662 0.00005 0.00636546 ENSMUST00000081035 Mpeg1 = -5.06781 0.00005 0.00636546 ENSMUST00000163360 D17H6S56E-5 = -5.05824 0.00065 0.04835630

63 Chapter A: Alternative analysis

Table A.4: The overall 35 most dierentially expressed lncRNA (based on individual isoforms) between AngII treated mice harvested after seven weeks (A7) and control mice harvested after seven weeks (S7). Column headers as in main text table 4.2. Matches to protein coding genes may be observed, but those refer to noncoding isoforms of those genes.

TranscriptID Gene Class log2 fold change p-value q-value ENSMUST00000130642 Sparc = -Inf 0.00005 0.00636546 ENSMUST00000117533 Gm12062 = -Inf 0.00035 0.03008360 ENSMUST00000131787 2410006H16Rik = -3.26653 0.00005 0.00636546 ENSMUST00000153077 Megf8 = -2.46350 0.00025 0.02314610 TCONS 00545054 - u 2.45603 0.00020 0.01973000 ENSMUST00000023292 Arsa = -2.45424 0.00005 0.00636546 ENSMUST00000034997 Snhg5 = -2.44712 0.00005 0.00636546 ENSMUST00000137359 Gm15512 = -2.40868 0.00060 0.04528880 ENSMUST00000147681 F630028O10Rik = -2.34503 0.00010 0.01118960 ENSMUST00000110601 Gpr137b-ps = -2.29320 0.00005 0.00636546 ENSMUST00000124586 BC029722 = -2.25654 0.00005 0.00636546 ENSMUST00000133808 C330006A16Rik = -2.23649 0.00005 0.00636546 NONMMUT050162 NONMMUG031060 = -2.21508 0.00035 0.03008360 NONMMUT022945 NONMMUG014195 = -2.16479 0.00005 0.00636546 ENSMUST00000155507 Pgs1 = -2.14105 0.00025 0.02314610 ENSMUST00000172805 Map3k8 = -2.10502 0.00025 0.02314610 NONMMUT069993 NONMMUG043325 = -2.03938 0.00005 0.00636546 ENSMUST00000155949 6530402F18Rik = -2.03331 0.00020 0.01973000 ENSMUST00000151488 Yipf3 = -2.02016 0.00050 0.03911860 ENSMUST00000173269 Neu1 = -2.00370 0.00010 0.01118960 ENSMUST00000164646 E530011L22Rik = -2.00007 0.00010 0.01118960 ENSMUST00000148314 Gm13889 = -1.96165 0.00005 0.00636546 ENSMUST00000116685 Mir3096 o -1.93738 0.00045 0.03700300 ENSMUST00000175820 A230050P20Rik = -1.92682 0.00005 0.00636546 NONMMUT045116 NONMMUG027834 = -1.92268 0.00050 0.03911860 ENSMUST00000138323 Pfdn2 = -1.76269 0.00010 0.01118960 NONMMUT032612 NONMMUG020085 = -1.72129 0.00020 0.01973000 ENSMUST00000097503 Gm10524 = -1.71945 0.00025 0.02314610 ENSMUST00000163793 Gm17255 = -1.71467 0.00005 0.00636546 ENSMUST00000171743 Tmem134 = -1.70072 0.00050 0.03911860 ENSMUST00000165292 Gm17282 = -1.68657 0.00005 0.00636546 ENSMUST00000156380 2900053A13Rik = -1.65174 0.00065 0.04835630 ENSMUST00000134226 Gm12590 = -1.62306 0.00015 0.01562130 NONMMUT072544 NONMMUG044953 = -1.56763 0.00005 0.00636546 ENSMUST00000099459 Gm10780 = -1.49290 0.00005 0.00636546

64 Chapter A: Alternative analysis

A.2.3 Co-expression network modules Using WGCNA [22], 11 co-expression modules were obtained. A visualisation of the WGCNA dynamic tree cut module assignment can be seen in figure A.1. As previ- ously, the modules were ranked based on correlation between the module eigengene and the treatment status vector. The best ranking modules are found in table A.5. As before, the modules were assigned labels in the form of arbitrary colours by the WGCNA software. From now on, the modules with be referred to by these colour names.

Cluster Dendrogram 1.0 0.9 0.8 Height 0.7 0.6 0.5

Module colors d hclust (*, "average")

Figure A.1: The WGCNA module assignment of this dataset. The black dendrogram is obtained through hierarchical clustering of the weighted correlation based network. Modules are represented by individual colours and are the result of applying dynamic tree cutting to the dendrogram.

Table A.5: Ranking of modules according to absolute Pearson correlation between the eigengene of the module and a vector of treatment status. Each module is represented with a colour code corresponding to figure A.1. Only the most significantly correlated modules are shown here (ab- solute correlation greater than or equal to 0.80 and p-value less or equal to 0.05). “Size” refers to the number of transcript isoforms in the module.

Module Correlation p-value Size D.e lncRNAs Novel lncRNAs yellow 0.94 0.002 1313 2 14 pink 0.89 0.007 547 1 0 blue 0.88 0.008 3421 19 6 turquoise 0.84 0.02 4245 21 7

The “yellow” module As can be seen in table A.5, the transcript isoforms in the “yellow” module appears to be expressed in patterns that correlate highly with the treatment. The genes of the

65 Chapter A: Alternative analysis most central transcripts (in terms of betweenness centrality and degree, calculated with Cytoscape and visualised in figure A.2) of the yellow module were Gm16576, Ttk, Cd109, Fmod, BC034090, Trip13, Rasgrf1, 7SK, Tlr9, Ddah1, and C1qtnf3. Gm16576 is a predicted lincRNA gene with unknown function. BC034090 appears to an unclassified protein coding gene. The expression of Gm16576, which can be seen in figure A.3, was higher in the “A7” mice than in the “S7” control, and the dierence was statistically significant according to the “getSig” function of the CummeRbund R package, using an alpha cuto of 0.05. Other significantly dierentially expressed among the central nodes were Tlr9 (figure A.4), Fmod (figure A.5) (both in the comparison A7, S7 and A3, A7) and Trip13 (figure A.6) (in the comparison A7, S7). Tlr9 stands for “Toll- like receptor 9”, Fmod stand for “fibromodulin” and Trip13 for “Thyroid Receptor Interacting Protein 13”. In total, 34 transcripts (6 lncRNAs) were significantly dierentially expressed between any two samples in the “yellow” module, 21 (2 lncRNAs) between A7 and S7. Gm16576 was the only one of these that was previously annotated in Ensembl. The others were present in NONCODE. The remaining five dierentially expressed lncRNAs were novel transcripts. The overall most dierentially expressed transcripts in the “yellow” module can be seen in table A.7.

Table A.6: Dierentially expressed lncRNAs in the “yellow” module. Column headers as in main text table 4.7.

TranscriptID Gene Class log2 foldchange p-value q-value Groups ENSMUST00000128342 Gm16576 = -1.2 5e-05 0.00636546 A7,S7 TCONS 00722067 - u -2.47005 5e-05 0.00636546 A3,A7

66 Chapter A: Alternative analysis

TCONS_00037232 TCONS_00451204 TCONS_00196175 TCONS_00208222 TCONS_00456070 TCONS_00493990TCONS_00122300 TCONS_00166274 TCONS_00664788

TCONS_00451633

TCONS_00634859

TCONS_00803869 TCONS_00111828 TCONS_00449799 TCONS_00364816 TCONS_00314371 TCONS_00405794

TCONS_00694767 TCONS_00612361 TCONS_00407495TCONS_00502463

TCONS_00217418 TCONS_00587908 TCONS_00421588 TCONS_00089961 TCONS_00592642 TCONS_00287697TCONS_00832220 TCONS_00363809 TCONS_00304294 TCONS_00784121 TCONS_00363637 TCONS_00688471 TCONS_00671194TCONS_00473738 TCONS_00754395TCONS_00394684 TCONS_00364342 TCONS_00815148 TCONS_00585581 TCONS_00187986 TCONS_00407021 TCONS_00131373 TCONS_00362118TCONS_00172944 TCONS_00149719 TCONS_00789972TCONS_00348624TCONS_00717488 TCONS_00560544 TCONS_00149336 TCONS_00049144 TCONS_00404625 TCONS_00739376TCONS_00565228 TCONS_00701952 TCONS_00145539 TCONS_00228963TCONS_00452002 TCONS_00728800 TCONS_00122698 TCONS_00661952 TCONS_00061806 TCONS_00712133 TCONS_00636885 TCONS_00580163 TCONS_00145397 TCONS_00094198 TCONS_00500607

TCONS_00637375TCONS_00065591 TCONS_00165911TCONS_00693185 TCONS_00024808TCONS_00224031 TCONS_00796857TCONS_00700536 TCONS_00542125 TCONS_00330118 TCONS_00649227TCONS_00057823 TCONS_00170053 TCONS_00569743TCONS_00629216 TCONS_00326046 TCONS_00287499TCONS_00703005TCONS_00566048 TCONS_00592483 TCONS_00597887 TCONS_00836435 TCONS_00823707TCONS_00647256 TCONS_00032250 TCONS_00825667 TCONS_00790190 TCONS_00480439 TCONS_00421744TCONS_00342705 TCONS_00499388 TCONS_00537095TCONS_00791184TCONS_00287194 TCONS_00397223 TCONS_00640310 TCONS_00065862 TCONS_00795036TCONS_00587937 TCONS_00147884 TCONS_00481740 TCONS_00716065 TCONS_00454309 TCONS_00143150 TCONS_00735741TCONS_00566766 TCONS_00115391 TCONS_00832628 TCONS_00511274 TCONS_00693650TCONS_00108748 TCONS_00023959 TCONS_00479102 TCONS_00194851TCONS_00835399 TCONS_00457816 TCONS_00811938 TCONS_00218176TCONS_00123203 TCONS_00649445 TCONS_00685766 TCONS_00511333 TCONS_00716057 TCONS_00112746 TCONS_00704720 TCONS_00239126 TCONS_00201601 TCONS_00152634TCONS_00274395TCONS_00348027 TCONS_00370300 TCONS_00259423 TCONS_00399005 TCONS_00200255 TCONS_00815135 TCONS_00463695 TCONS_00598334 TCONS_00735573 TCONS_00289959 TCONS_00606131 TCONS_00169277 TCONS_00706706 TCONS_00639986 TCONS_00259318 TCONS_00108621 TCONS_00588996TCONS_00171287 TCONS_00550083 TCONS_00433688 TCONS_00193531 TCONS_00660068

TCONS_00466444 TCONS_00639068 TCONS_00401899 TCONS_00150255 TCONS_00795348 TCONS_00401826 TCONS_00090227 TCONS_00759046 TCONS_00140193 TCONS_00445911 TCONS_00297555 TCONS_00258552 TCONS_00712241 TCONS_00715399 TCONS_00726401 TCONS_00064933 TCONS_00346420 TCONS_00738775

TCONS_00443937 TCONS_00633981 TCONS_00067259 TCONS_00330693

TCONS_00405219

TCONS_00321059 TCONS_00112588

TCONS_00349844TCONS_00121445 TCONS_00071626

TCONS_00464649

TCONS_00467878 TCONS_00568261 TCONS_00109848 TCONS_00722007 TCONS_00738049 TCONS_00189355

Figure A.2: The “yellow” WGCNA derived module. Figure elements are the same as was described in the legend of main text figure A.7. A high resolution version of the figure can be obtained from the author upon request.

Gm16576 TCONS_00287194 1.5

1.0 FPKM

0.5

0.0 OK OK OK A3 A7 S7

sample_name

Figure A.3: Expression of the lincRNA Gm16576, the most central transcript of the “yellow” WGCNA module. The blue bar corresponds to the “A3” sample, pink to “A7” and green to “S7”.

67 Chapter A: Alternative analysis

Tlr9 TCONS_00795036

2 FPKM

1

0 OK OK OK A3 A7 S7

sample_name

Figure A.4: Expression of Tlr9, one of the most central transcripts of the “yellow” WGCNA module. The blue bar corresponds to the “A3” sample, pink to “A7” and green to “S7”.

Fmod TCONS_00023958 TCONS_00023959 TCONS_00023960

60

40 FPKM

20

0 OK OK OK OK OK OK LOWDATALOWDATALOWDATA A3 A7 S7 A3 A7 S7 A3 A7 S7

sample_name

Figure A.5: Expression of Fmod. TCONS 00023959 was one of the most central tran- scripts of the “yellow” WGCNA module. TCONS 00023958 was a novel transcript overlapping TCONS 00023959. The blue bar corresponds to the “A3” sample, pink to “A7” and green to “S7”.

68 Chapter A: Alternative analysis

Trip13 TCONS_00224031 TCONS_00224033

3

2 FPKM

1

0 OK OK OK OK OK OK A3 A7 S7 A3 A7 S7

sample_name

Figure A.6: Expression of Trip13. TCONS 00224031 was one of the most central transcripts of the “yellow” WGCNA module. TCONS 00224033 was a novel transcript isoform overlapping TCONS 0022403. The blue bar corresponds to the “A3” sample, pink to “A7” and green to “S7”.

Table A.7: Dierentially expressed transcripts between A7 and S7 in the “yellow” module. Col- umn headers as in main text table 4.2.

TranscriptID Gene Class log2 fold change p-value q-value ENSMUST00000087316 Capn6 = -4.92985 0.00005 0.00636546 ENSMUST00000048183 Fmod = -4.25981 0.00005 0.00636546 ENSMUST00000060710 Cdc25c = -3.10131 0.00065 0.04835630 ENSMUST00000072119 Ccnb1 = -2.74186 0.00020 0.01973000 ENSMUST00000029848 Col24a1 j -2.61988 0.00020 0.01973000 ENSMUST00000076505 Pyroxd2 = -2.57141 0.00005 0.00636546 ENSMUST00000020899 Matn3 = -2.38223 0.00005 0.00636546 ENSMUST00000128727 Actr3b = -2.24899 0.00035 0.03008360 ENSMUST00000083546 Mir133b = -2.20249 0.00005 0.00636546 ENSMUST00000062241 Tlr9 = -2.20233 0.00005 0.00636546 ENSMUST00000048246 Fgb = 2.15004 0.00015 0.01562130 ENSMUST00000098382 Adamts17 j -2.04165 0.00060 0.04528880 ENSMUST00000058825 Ccdc121 = -1.96315 0.00005 0.00636546 ENSMUST00000032341 Art4 = -1.82901 0.00020 0.01973000 ENSMUST00000098953 Mex3a = -1.74837 0.00005 0.00636546 ENSMUST00000090473 Gpr88 = -1.72988 0.00005 0.00636546 ENSMUST00000022053 Trip13 = -1.67750 0.00005 0.00636546 ENSMUST00000109212 Gm5431 = -1.48573 0.00005 0.00636546 ENSMUST00000043183 Ces2g = -1.41782 0.00050 0.03911860 ENSMUST00000128342 Gm16576 = -1.20000 0.00005 0.00636546 TCONS 00314402 - u 1.06805 0.00030 0.02649240

69 Chapter A: Alternative analysis

The “blue” module The central nodes of the “blue” module (visualised in figure A.7) were transcript isoforms from the genes Gns, Pctp, Heatr5a, Wdfy2, Eny2, Ext1, Ifnar1, Dscr3, B3gnt1, Ppp3ca, Hey1, Pole, Rpn1, Dera and C230081A13Rik. Several lncRNA were present in this module. Those can be found in table A.8. The 25 most significantly dierentially expressed transcript isoforms in the “blue” module are shown in table A.9. Seven of the central nodes corresponded to dierentially expressed isoforms: Pctp, Wdfy2, Ifnar1, Dscr3, B3gnt1, Hey1 and Dera.

TCONS_00795418 TCONS_00324030 TCONS_00834924

TCONS_00252082 TCONS_00066626 TCONS_00050583 TCONS_00014816 TCONS_00286668 TCONS_00194766 TCONS_00569493TCONS_00533321 TCONS_00769219

TCONS_00825436 TCONS_00351371 TCONS_00672199 TCONS_00591000 TCONS_00121709

TCONS_00813265 TCONS_00103770 TCONS_00375402 TCONS_00668123 TCONS_00344000 TCONS_00750399 TCONS_00164342

TCONS_00039856 TCONS_00807290 TCONS_00158130 TCONS_00363241 TCONS_00826596 TCONS_00381542 TCONS_00750129 TCONS_00165692TCONS_00207813 TCONS_00387205TCONS_00509098 TCONS_00028267 TCONS_00787584 TCONS_00680870 TCONS_00148027 TCONS_00589542 TCONS_00134965 TCONS_00546736 TCONS_00754653 TCONS_00406697 TCONS_00500893 TCONS_00170380 TCONS_00508147 TCONS_00788297 TCONS_00551024 TCONS_00789090 TCONS_00797384 TCONS_00080922 TCONS_00601321 TCONS_00347824TCONS_00735338 TCONS_00015558 TCONS_00638238 TCONS_00509908 TCONS_00723138 TCONS_00795688TCONS_00500769 TCONS_00738029TCONS_00386405 TCONS_00702967 TCONS_00170252TCONS_00660734TCONS_00222513 TCONS_00305188 TCONS_00227986 TCONS_00670868TCONS_00660517 TCONS_00592264 TCONS_00354739 TCONS_00416181 TCONS_00390986 TCONS_00827372 TCONS_00120949 TCONS_00681839 TCONS_00290486 TCONS_00061333 TCONS_00690719TCONS_00367303 TCONS_00713937 TCONS_00710798 TCONS_00015860 TCONS_00193555 TCONS_00836187 TCONS_00811102 TCONS_00126102 TCONS_00488538 TCONS_00239459 TCONS_00248642 TCONS_00210469 TCONS_00412966 TCONS_00135542 TCONS_00708738 TCONS_00096590 TCONS_00345602 TCONS_00147697 TCONS_00450477TCONS_00760080 TCONS_00818801 TCONS_00760004 TCONS_00729028 TCONS_00128477TCONS_00287857 TCONS_00814811 TCONS_00775956 TCONS_00769393 TCONS_00453565 TCONS_00368875 TCONS_00793706TCONS_00590201 TCONS_00035865 TCONS_00502260 TCONS_00113007TCONS_00072169 TCONS_00400392 TCONS_00755326 TCONS_00392837TCONS_00199047 TCONS_00826850 TCONS_00616687 TCONS_00770675 TCONS_00731654 TCONS_00219442TCONS_00383109TCONS_00787224TCONS_00170532 TCONS_00554632 TCONS_00418337TCONS_00537035TCONS_00506158 TCONS_00666950TCONS_00420628 TCONS_00674204 TCONS_00031910 TCONS_00727508 TCONS_00110054TCONS_00296956 TCONS_00496437TCONS_00156083 TCONS_00169336TCONS_00589442 TCONS_00031740 TCONS_00715453 TCONS_00439683 TCONS_00617273TCONS_00211829TCONS_00143307 TCONS_00174843 TCONS_00223084TCONS_00748398 TCONS_00260640 TCONS_00762362TCONS_00347819 TCONS_00055421 TCONS_00304154 TCONS_00816011 TCONS_00730033 TCONS_00639563 TCONS_00227913 TCONS_00803391 TCONS_00568273 TCONS_00305898 TCONS_00150369 TCONS_00816415TCONS_00729195 TCONS_00807007 TCONS_00693910 TCONS_00569913 TCONS_00704363 TCONS_00381961 TCONS_00147305 TCONS_00681866TCONS_00180777TCONS_00146015 TCONS_00299855 TCONS_00368852TCONS_00262957 TCONS_00416107TCONS_00407502 TCONS_00830854TCONS_00397166 TCONS_00656495 TCONS_00664248 TCONS_00748236TCONS_00814398 TCONS_00280476 TCONS_00401918 TCONS_00181745 TCONS_00449534TCONS_00614708 TCONS_00616476 TCONS_00533328 TCONS_00794994 TCONS_00240504 TCONS_00509657 TCONS_00754522 TCONS_00706456 TCONS_00618230 TCONS_00088609 TCONS_00745106TCONS_00284912TCONS_00218218 TCONS_00539664 TCONS_00426435 TCONS_00120018 TCONS_00347806 TCONS_00832208TCONS_00124684TCONS_00370674 TCONS_00728790 TCONS_00769217 TCONS_00402346 TCONS_00325337 TCONS_00081374 TCONS_00057907 TCONS_00794761 TCONS_00335547TCONS_00625691 TCONS_00057419 TCONS_00362930 TCONS_00255701 TCONS_00010149 TCONS_00629398 TCONS_00599638 TCONS_00215650 TCONS_00406811 TCONS_00803784TCONS_00613299 TCONS_00255424 TCONS_00728894 TCONS_00622223 TCONS_00832345 TCONS_00760445 TCONS_00685352 TCONS_00452986 TCONS_00122872TCONS_00194252TCONS_00072320TCONS_00738020 TCONS_00260113 TCONS_00313273 TCONS_00407466 TCONS_00039823 TCONS_00820817TCONS_00452899 TCONS_00748237TCONS_00104414 TCONS_00493419TCONS_00218169 TCONS_00129393 TCONS_00808154 TCONS_00618091 TCONS_00567706 TCONS_00487314 TCONS_00200632 TCONS_00033443 TCONS_00314005 TCONS_00161046TCONS_00743470 TCONS_00075040 TCONS_00015824TCONS_00145692 TCONS_00515407 TCONS_00611370 TCONS_00699241 TCONS_00128014 TCONS_00647125 TCONS_00217159 TCONS_00106972TCONS_00829593 TCONS_00122812 TCONS_00091188TCONS_00397200 TCONS_00150848 TCONS_00672160 TCONS_00705789 TCONS_00450284 TCONS_00662331 TCONS_00680522 TCONS_00128399TCONS_00183296TCONS_00735261 TCONS_00825501 TCONS_00595960TCONS_00140785 TCONS_00057408 TCONS_00684990TCONS_00818953 TCONS_00673660TCONS_00373681 TCONS_00307972 TCONS_00300653 TCONS_00240802 TCONS_00754652 TCONS_00351509TCONS_00141922 TCONS_00224447 TCONS_00663967 TCONS_00394649 TCONS_00762135 TCONS_00214797 TCONS_00653294 TCONS_00823061 TCONS_00364285 TCONS_00261407 TCONS_00733363 TCONS_00289855 TCONS_00665658 TCONS_00552922 TCONS_00787783 TCONS_00577383 TCONS_00807175 TCONS_00129204 TCONS_00339618 TCONS_00144765 TCONS_00453551 TCONS_00733332 TCONS_00143258 TCONS_00122134 TCONS_00287124 TCONS_00639014 TCONS_00698897 TCONS_00235444 TCONS_00343550 TCONS_00149366 TCONS_00751932TCONS_00476564 TCONS_00775285 TCONS_00788342 TCONS_00716493 TCONS_00697646 TCONS_00692530 TCONS_00571463 TCONS_00592166 TCONS_00710419 TCONS_00806686 TCONS_00078313TCONS_00010543 TCONS_00124270 TCONS_00603091

TCONS_00304994 TCONS_00637158 TCONS_00135579 TCONS_00195302 TCONS_00160666 TCONS_00621546 TCONS_00259595 TCONS_00287187TCONS_00148323 TCONS_00459503 TCONS_00145152 TCONS_00589836

TCONS_00059544 TCONS_00200646 TCONS_00621487 TCONS_00363089 TCONS_00115373 TCONS_00754485 TCONS_00415750 TCONS_00029292 TCONS_00524586TCONS_00437860 TCONS_00809735 TCONS_00819595 TCONS_00473916 TCONS_00362848 TCONS_00221583 TCONS_00363872 TCONS_00680803 TCONS_00255990 TCONS_00309889

TCONS_00732020 TCONS_00822536

TCONS_00449901 TCONS_00223045

TCONS_00696683 TCONS_00672202

TCONS_00397276 TCONS_00551137 TCONS_00406881 TCONS_00592563 TCONS_00638339 TCONS_00417040 TCONS_00149155

Figure A.7: The “blue” WGCNA derived module. Larger nodes indicate higher betweenness centrality. Figure elements are the same as was described in the legend of main text figure A.7. A high resolution version of the figure can be obtained from the author upon request.

70 Chapter A: Alternative analysis

Table A.8: Dierentially expressed lncRNAs in the “blue’ module. Column headers as in main text figure 4.7

TranscriptID Gene Class log2 fold change p-value q-value Groups ENSMUST00000131787 2410006H16Rik = -3.27 0.00005 0.0064 A7,S7 ENSMUST00000153077 Megf8 = -2.46 0.00025 0.023 A7,S7 ENSMUST00000023292 Arsa = -1.86;-2.45 0.00055;0.00005 0.042;0.0064 A3,S7;A7,S7 ENSMUST00000137359 Gm15512 = -2.41 0.00060 0.045 A7, S7 ENSMUST00000147681 F630028O10Rik = -2.35 0.00010 0.011 A7,S7 ENSMUST00000110601 Gpr137b-ps = -2.29 0.00005 0.0064 A7,S7 TCONS 00592563 - u -2.22 0.00035 0.030 A7,S7 NONMMUT022945 NONMMUG014195 = -2.16 0.00005 0.0064 A7, S7 ENSMUST00000155949 6530402F18Rik = -2.03 0.00020 0.020 A7,S7 ENSMUST00000151488 Yipf3 = -2.02 0.00050 0.039 A7, S7 ENSMUST00000164646 E530011L22Rik = -2.00 0.00010 0.011 A7,S7 ENSMUST00000175820 A230050P20Rik = -1.93 0.00005 0.0064 A7,S7 ENSMUST00000171743 Tmem134 = -1.70 0.00050 0.039 A7,S7 ENSMUST00000165292 Gm17282 = -1.50;-1.69 0.00040;0.00005 0.034;0.0064 A3,S7;A7,S7 ENSMUST00000134226 Gm12590 = -1.62 0.00015 0.016 A7, S7 ENSMUST00000146416 Rbm12b = -1.48 0.00005 0.0064 A7,S7 ENSMUST00000168014 Gm17401 = -1.33 0.00005 0.0064 A7, S7 ENSMUST00000105140 AW011738 = -1.21;-1.32 0.00050;0.00015 0.039;0.016 A3,S7;A7,S7 ENSMUST00000131147 Gm13387 = -1.14 0.00045 0.037 A7, S7

Table A.9: The 25 most dierentially expressed transcript isoforms between A7 and S7 in the “blue” module. Column headers as in main text table 4.2.

TranscriptID Gene Class log2 fold change p-value q-value ENSMUST00000055770 Hist1h1a = -5.31370 0.00005 0.00636546 ENSMUST00000030317 Pdpn = -5.25626 0.00005 0.00636546 ENSMUST00000040359 Arsi = -5.11103 0.00005 0.00636546 ENSMUST00000081035 Mpeg1 = -5.06781 0.00005 0.00636546 ENSMUST00000043069 Cyth4 = -4.97811 0.00005 0.00636546 ENSMUST00000025497 Fbn2 = -4.93516 0.00005 0.00636546 ENSMUST00000099703 Hist1h2bb = -4.77073 0.00005 0.00636546 ENSMUST00000024118 Clec4n = -4.76126 0.00005 0.00636546 ENSMUST00000021685 Hhipl1 = -4.71655 0.00005 0.00636546 ENSMUST00000002883 Sfrp4 = -4.68946 0.00005 0.00636546 ENSMUST00000006956 Saa3 = -4.62616 0.00005 0.00636546 ENSMUST00000015664 Ctsk = -4.55248 0.00005 0.00636546 ENSMUST00000034774 Itga11 = -4.54105 0.00005 0.00636546 ENSMUST00000024981 Hn1l = -4.48850 0.00005 0.00636546 ENSMUST00000018918 Cd68 = -4.43995 0.00005 0.00636546 ENSMUST00000028897 Cpxm1 = -4.42749 0.00005 0.00636546 ENSMUST00000060484 Clec4a1 = -4.40459 0.00025 0.02314610 ENSMUST00000030202 Glipr2 = -4.37610 0.00060 0.04528880 ENSMUST00000025419 Ppic = -4.31323 0.00005 0.00636546 ENSMUST00000087557 Tspan6 = -4.29439 0.00005 0.00636546 ENSMUST00000003445 Fkbp11 = -4.28323 0.00005 0.00636546 ENSMUST00000034742 Ccnb2 = -4.27510 0.00010 0.01118960 ENSMUST00000004587 Clec11a = -4.21670 0.00005 0.00636546 ENSMUST00000004201 Col5a3 = -4.20165 0.00005 0.00636546 ENSMUST00000086763 Emr1 = -4.16421 0.00005 0.00636546

71 Chapter A: Alternative analysis

The “turquoise” module The central nodes of the “turquoise” module (visualised in figure A.8) corresponded to the genes Tmem131, Atf6, Fig4, Tspan31, Itsn2, Arf6, Cpsf2, Ubr7, Hist1h4d, Fam120a, 2610301G19Rik, Ep300, Scaf8, Sos1, Trim56, Gm5578, Atp6v0d1. Signif- icantly dierentially expressed isoforms among these were from the genes Tmem131, Fig4, Arf6, Hist1h4d, 2610301G19Rik, Gm5578 and Atp6v0d1. Dierentially expressed lncRNA in this module are summarised in table A.10. The overall 25 most dierentially expressed transcript isoforms, both coding and noncoding, are shown in table A.11.

TCONS_00104960 TCONS_00348226 TCONS_00712634 TCONS_00126461 TCONS_00337282 TCONS_00710347 TCONS_00547598

TCONS_00254683 TCONS_00112133 TCONS_00807976

TCONS_00131869 TCONS_00754468 TCONS_00230145 TCONS_00009302 TCONS_00325440 TCONS_00771194 TCONS_00127963 TCONS_00788349 TCONS_00483917 TCONS_00445850 TCONS_00422163 TCONS_00224061

TCONS_00590129 TCONS_00148091 TCONS_00197796

TCONS_00258110 TCONS_00640580

TCONS_00398409 TCONS_00331212 TCONS_00500996 TCONS_00116480

TCONS_00665580 TCONS_00429680 TCONS_00143583 TCONS_00613291

TCONS_00239113 TCONS_00157174 TCONS_00126094 TCONS_00502500 TCONS_00124236 TCONS_00240757 TCONS_00449115 TCONS_00533082 TCONS_00614977

TCONS_00805332 TCONS_00097245 TCONS_00129992TCONS_00118225 TCONS_00569732 TCONS_00592024TCONS_00349172 TCONS_00621251 TCONS_00055833 TCONS_00807801 TCONS_00613881 TCONS_00748163 TCONS_00424448 TCONS_00739056

TCONS_00754660 TCONS_00387170 TCONS_00663907 TCONS_00563588 TCONS_00296623 TCONS_00197766 TCONS_00719235 TCONS_00500477 TCONS_00312964 TCONS_00307310 TCONS_00126195 TCONS_00103621 TCONS_00761620

TCONS_00705924 TCONS_00660340 TCONS_00686171 TCONS_00188709 TCONS_00565651 TCONS_00090019 TCONS_00163291 TCONS_00599199TCONS_00125317 TCONS_00640689 TCONS_00747900 TCONS_00252404 TCONS_00228976 TCONS_00208428 TCONS_00117698 TCONS_00672163TCONS_00428261 TCONS_00195270 TCONS_00795395TCONS_00021738 TCONS_00552552 TCONS_00078755 TCONS_00200973 TCONS_00255381 TCONS_00422110TCONS_00715027 TCONS_00445809 TCONS_00729167 TCONS_00120979 TCONS_00461544 TCONS_00694891 TCONS_00811210 TCONS_00416166 TCONS_00428160 TCONS_00546738TCONS_00263385 TCONS_00165709 TCONS_00413188 TCONS_00535438 TCONS_00145466 TCONS_00283331 TCONS_00039909 TCONS_00749168TCONS_00089747 TCONS_00061529TCONS_00541027 TCONS_00230345 TCONS_00407491 TCONS_00151665 TCONS_00469354 TCONS_00296462 TCONS_00121806 TCONS_00364941 TCONS_00590530 TCONS_00693779 TCONS_00795376 TCONS_00639876 TCONS_00127860 TCONS_00709755 TCONS_00032193 TCONS_00719667 TCONS_00101060 TCONS_00006568 TCONS_00612967TCONS_00629311 TCONS_00129737 TCONS_00040330 TCONS_00833039 TCONS_00124560 TCONS_00661549 TCONS_00425640TCONS_00081078 TCONS_00579439 TCONS_00770939 TCONS_00543958 TCONS_00703194 TCONS_00064320 TCONS_00065625 TCONS_00028195 TCONS_00387199TCONS_00755111 TCONS_00747977TCONS_00590217 TCONS_00553306TCONS_00512886TCONS_00412620 TCONS_00300167 TCONS_00715408 TCONS_00458566 TCONS_00181069 TCONS_00154538 TCONS_00539132 TCONS_00446366 TCONS_00071681 TCONS_00137892 TCONS_00664244 TCONS_00027717 TCONS_00428566 TCONS_00255103 TCONS_00372777 TCONS_00716783 TCONS_00067080TCONS_00444668 TCONS_00165711TCONS_00101111 TCONS_00291886 TCONS_00150444 TCONS_00637041 TCONS_00345663 TCONS_00406819 TCONS_00014991 TCONS_00499098TCONS_00808815 TCONS_00670946 TCONS_00146892 TCONS_00203480 TCONS_00097347 TCONS_00227741 TCONS_00228159 TCONS_00087732 TCONS_00147546 TCONS_00455396 TCONS_00424165TCONS_00012253TCONS_00829542 TCONS_00614560TCONS_00186151 TCONS_00553174 TCONS_00085947TCONS_00474021 TCONS_00255282 TCONS_00055541 TCONS_00633975 TCONS_00725839 TCONS_00410289 TCONS_00446911 TCONS_00638205 TCONS_00015478 TCONS_00454042 TCONS_00690681 TCONS_00606005 TCONS_00739134 TCONS_00500555 TCONS_00291473 TCONS_00091072TCONS_00150567 TCONS_00712640 TCONS_00731320 TCONS_00168877 TCONS_00111906 TCONS_00832886 TCONS_00207548 TCONS_00710555 TCONS_00816009 TCONS_00089805 TCONS_00419723 TCONS_00370660 TCONS_00775241 TCONS_00261118 TCONS_00074777 TCONS_00719425 TCONS_00260596 TCONS_00168948 TCONS_00656518 TCONS_00407973 TCONS_00123975 TCONS_00599085 TCONS_00415646 TCONS_00124053 TCONS_00407458 TCONS_00261831 TCONS_00710525TCONS_00164218 TCONS_00386341 TCONS_00560606 TCONS_00219933 TCONS_00147398 TCONS_00445875 TCONS_00031899 TCONS_00277698 TCONS_00148159 TCONS_00681248TCONS_00223286 TCONS_00124057 TCONS_00826083TCONS_00465663 TCONS_00232551 TCONS_00343981 TCONS_00300761 TCONS_00203973 TCONS_00531969 TCONS_00076714 TCONS_00220968 TCONS_00592403 TCONS_00287436TCONS_00592989 TCONS_00080015 TCONS_00762167TCONS_00087730 TCONS_00680428 TCONS_00110178 TCONS_00612206 TCONS_00598790 TCONS_00435613 TCONS_00105645 TCONS_00028094 TCONS_00835295 TCONS_00363873 TCONS_00640345 TCONS_00662110 TCONS_00371540 TCONS_00655811 TCONS_00733364 TCONS_00650358 TCONS_00815853 TCONS_00451670 TCONS_00366990 TCONS_00680167 TCONS_00764614 TCONS_00110772 TCONS_00801091 TCONS_00615486 TCONS_00776040 TCONS_00082234 TCONS_00074845 TCONS_00125131TCONS_00274618 TCONS_00741138 TCONS_00405229 TCONS_00696888 TCONS_00588586TCONS_00210486 TCONS_00339724 TCONS_00690738TCONS_00533308 TCONS_00309874 TCONS_00591879 TCONS_00351467 TCONS_00501882 TCONS_00036355 TCONS_00219584TCONS_00613803 TCONS_00685338TCONS_00291281 TCONS_00799695 TCONS_00415995 TCONS_00555823 TCONS_00592050 TCONS_00035404 TCONS_00186207 TCONS_00235460 TCONS_00565785 TCONS_00807186 TCONS_00349318 TCONS_00416682 TCONS_00404585 TCONS_00659528 TCONS_00027574 TCONS_00325323 TCONS_00111908 TCONS_00387182TCONS_00415823 TCONS_00049628 TCONS_00584907 TCONS_00219319 TCONS_00715455 TCONS_00412117 TCONS_00770140TCONS_00580175 TCONS_00412654 TCONS_00716484 TCONS_00459019 TCONS_00109840 TCONS_00566630 TCONS_00155009TCONS_00311615 TCONS_00451488 TCONS_00794091 TCONS_00684698 TCONS_00769116 TCONS_00290540 TCONS_00642026

TCONS_00814908TCONS_00071179

TCONS_00815126 TCONS_00611811 TCONS_00081998 TCONS_00144534 TCONS_00818411 TCONS_00478936TCONS_00754657 TCONS_00383461 TCONS_00641080 TCONS_00127923 TCONS_00735683 TCONS_00775838 TCONS_00428772 TCONS_00057800 TCONS_00030615 TCONS_00164912 TCONS_00079873 TCONS_00220908 TCONS_00239564 TCONS_00787782 TCONS_00287143 TCONS_00795763 TCONS_00322154 TCONS_00177555TCONS_00610483 TCONS_00219934 TCONS_00121773 TCONS_00282329 TCONS_00229631 TCONS_00708705TCONS_00639051 TCONS_00563723 TCONS_00066934 TCONS_00534517 TCONS_00193425 TCONS_00324710 TCONS_00304509 TCONS_00021719 TCONS_00576215 TCONS_00143576TCONS_00719432 TCONS_00165859 TCONS_00199641

TCONS_00449978 TCONS_00770285 TCONS_00200611

TCONS_00289839 TCONS_00081980 TCONS_00349024 TCONS_00181118 TCONS_00458019

TCONS_00428375

TCONS_00070962 TCONS_00289962 TCONS_00533097

TCONS_00777407 TCONS_00185538

TCONS_00710515 TCONS_00347752 TCONS_00380903 TCONS_00397660 TCONS_00235443 TCONS_00189764 TCONS_00427698 TCONS_00735453

Figure A.8: The “turquoise” WGCNA derived module. Figure elements are the same as was described in the legend of main text figure A.7. A high resolution version of the figure can be obtained from the author upon request.

72 Chapter A: Alternative analysis

Table A.10: Dierentially expressed lncRNAs in the “turquoise” module. Column headers as in main text table 4.2.

TranscriptID Gene Class log2 fold change p-value q-value Groups ENSMUST00000148314 Gm13889 = -3.97; -1.96; -2.00 0.00005; 0.00005; 0,00050 0.0064; 0.0064; 0.0064 A3, S7; A7, S7; A3, A7 ENSMUST00000023207 Apod = -3.59;-2.95 0.00010;0,00050 0.011;0.039 A3,S7;A3,A7 ENSMUST00000168817 Fcgr3 = -3.49 0.00055 0.042 A3, S7 ENSMUST00000133808 C330006A16Rik = -2.91;-2.24 0.00005;0.00005 0.0064;0.0064 A3,S7;A7,S7 ENSMUST00000034997 Snhg5 = -2.63;-2.45 0.00010;0.00005 0.011;0.0064 A3,S7;A7,S7 ENSMUST00000155507 Pgs1 = -2.58;-2.14 0.00005;0.00025 0.0064;0.023 A3,S7;A7,S7 ENSMUST00000172805 Map3k8 = -2.52;-2.11 0.00005;0.00025 0.00637;0.023 A3,S7;A7,S7 TCONS 00453772 - = -2.39 0.00045 0.037 A3,S7 TCONS 00789319 - = -2.38;-2.04 0.00005;0.00005 0.0064;0.0064 A3,S7;A7,S7 ENSMUST00000173269 Neu1 = -2.00 0.00010 0.011 A7, S7 NONMMUT045116 NONMMUG027834 = -1.99; -1.92 0.00045; 0.00050 0.037; 0.039 A3, S7; A7, S7 ENSMUST00000132305 Clcf1 = -1.90 0.00045 0.037 A3,S7 TCONS 00386479 - = -1.82;-1.72 0.00015;0.00020 0.016;0.020 A3,S7;A7,S7 ENSMUST00000123544 Gm16516 = -1.78 0.00020 0.020 A3, S7 ENSMUST00000135037 2810001G20Rik = -1.74 0.00060 0.045 A3,S7 ENSMUST00000070085 AI504432 = -1.66 0.00005 0.0065 A3, S7 TCONS 00274088 - = -1.61;-1.28 0.00005;0.00005 0.0064;0.0064 A3,S7;A7,S7 ENSMUST00000164804 Gm17056 = -1.60 0.00050 0.039 A3, S7 TCONS 00710288 - = -1.41 0.00045 0.037 A3,S7 ENSMUST00000070085 AI504432 = -1.05 0.00015 0.016 A7, S7

Table A.11: The 25 most dierentially expressed transcripts between A7 and S7 in the “turquoise” module. Column headers as in main text table 4.2.

TranscriptID Gene Class log2 fold change p-value q-value ENSMUST00000032800 Tyrobp = -5.35215 0.00005 0.00636546 ENSMUST00000021506 Serpina3n = -5.09322 0.00050 0.03911860 ENSMUST00000119853 Gm12174 = -4.68687 0.00005 0.00636546 ENSMUST00000038144 Esm1 = -4.63327 0.00005 0.00636546 ENSMUST00000005548 Hmox1 = -4.56687 0.00005 0.00636546 ENSMUST00000116487 Lgals3 = -4.41037 0.00005 0.00636546 ENSMUST00000118928 Gm13456 = -4.33618 0.00005 0.00636546 ENSMUST00000033004 Il4ra = -4.22683 0.00005 0.00636546 ENSMUST00000025486 Lmnb1 = -4.16269 0.00005 0.00636546 ENSMUST00000040772 Fermt3 = -4.12244 0.00005 0.00636546 ENSMUST00000034214 Mt2 = -4.07249 0.00005 0.00636546 ENSMUST00000071130 Alox5ap = -4.02396 0.00005 0.00636546 ENSMUST00000113440 Ccdc88b = -3.99798 0.00020 0.01973000 ENSMUST00000002678 Tgfb1 = -3.98090 0.00065 0.04835630 ENSMUST00000100198 Bin2 = -3.93078 0.00005 0.00636546 ENSMUST00000030651 Sh3bgrl3 = -3.90295 0.00005 0.00636546 ENSMUST00000079957 Fcer1g = -3.88466 0.00005 0.00636546 ENSMUST00000021011 Ccl7 = -3.88466 0.00020 0.01973000 ENSMUST00000069988 Xpnpep1 = -3.86411 0.00005 0.00636546 ENSMUST00000058914 Tuba1c = -3.84341 0.00005 0.00636546 ENSMUST00000034215 Mt1 = -3.69782 0.00005 0.00636546 ENSMUST00000102881 Plek = -3.69213 0.00005 0.00636546 ENSMUST00000034339 Cdh5 = -3.66873 0.00005 0.00636546 ENSMUST00000111315 Adamts4 = -3.64842 0.00050 0.03911860 ENSMUST00000021676 0610007P14Rik = -3.64035 0.00045 0.03700300

73 Chapter A: Alternative analysis

c A.2.4 MetaCore analysis The “yellow” module Constructing a network of curated relations among the most central nodes of the c “yellow” module using the “expand by one” algorithm in MetaCore gave the re- sults shown in figure A.9. Among the genes connected to the central yellow nodes c in the MetaCore network, 16 were annotated as being relevant in the context of at least one of the following diseases: “Aneurysm, Ruptured”; “Aneurysm”; “Aor- tic Aneurysm”; “Aortic Aneurysm, Thoracic”; “Aortic Aneurysm”, “Abdominal”; “Hypertension”; “Atherosclerosis”; “Intracranial Aneurysm”.

c Figure A.9: MetaCore interaction network constructed from central nodes of the “yellow” module. The “expand by one” algorithm was used (default settings). The gene identifiers were c in some cases translated to the internal naming scheme of MetaCore, but the original genes are indicated with blue circles. A high resolution version of the figure can be obtained from the author upon request.

74 Chapter A: Alternative analysis

The “blue” module Constructing a network of curated relations among the most central nodes of the “blue” module using the “shortest paths” algorithm (without the option “use canon- ical pathways”) gave the results shown in figure A.10. Among the nodes in the c MetaCore interaction network constructed from these genes, Tyk2, Calcineurin A and the androgen receptor were annotated as therapeutic targets. HEY1, 26S pro- teasome and the androgen receptor were related to the term “hypertension”. Tyk2, IFN-alpha/beta receptor, DNA polymerase epsilon, c-Myc, c-Jun, the androgen re- ceptor, HEY1, Calcineurin A, Huntingtin, 26S proteasome and CREB1 were related c to “cardiovascular disease” according to the Metacore annotation.

c Figure A.10: MetaCore interaction network constructed from central nodes of the “blue” module. The “shortest paths” algorithm was used (with ”use canonical pathways” turned o). c The gene identifiers were in some cases translated to the internal naming scheme of MetaCore, but the original genes are indicated with blue circles.

75 Chapter A: Alternative analysis

The “turquoise” module

c The MetaCore network constructed from the genes belonging to the central tran- scripts from the “turquoise” module was built using the “shortest paths” algo- rithm (without the option “use canonical pathways”). Searching the network for therapeutic targets highlighted 13 nodes: H-Ras, ESR1, CDK1, TORC2, c-Src, p38 MAPK, HDAC3, HIF1A, the androgen receptor, c-Abl, EGFR, FGFR1 and VEGFR-3. Network objects related to either “Aortic Aneurysm; Abdominal”, “Aor- tic Aneurysm”, ‘Aneurysm” or “Hypertension” were ATM, GCR-beta, AHR, Sirtuin 1, ESR1, HIF1A, PGC1-alpha, p85-alpha, BMAL1 and the androgen receptor.

c Figure A.11: MetaCore interaction network constructed from central nodes of the “turquoise” module. The “shortest paths” algorithm was used (with ”use canonical pathways” turned o). c The gene identifiers were in some cases translated to the internal naming scheme of MetaCore, but the original genes are indicated with blue circles.

76 Chapter A: Alternative analysis

A.2.5 Clustering comparison Using the method mentioned in section 3.6.3, a Cram´er’s V of 0.025 was obtained when testing for association between WGCNA and k-means clustering. For a visu- alisation of the overlap between modules (clusters), see figure A.12. These results indicate that no significant association between k-means and WGCNA clustering was present (see the discussion in section 5). 2 4 6 8 means with k = 11 and 200 starts − k 10

2 4 6 8 10

WGCNA with 11 clusters

Figure A.12: The overlap of WGCNA module assignment and k-means clustering on the coef- ficient of variation filtered dataset. For k-means k was set equal to 11 and n = 200 starts of the algorithm was used. The image is a visualisation of the contingency matrix, wherein high values of a given cell is displayed brightly and low or zero overlap between two given clusters is displayed darkly shaded. (Additionally, in this image the number of transcripts in the intersection of each pair of modules have been normalised by the total size of the modules for greater clarity.)

A.3 Discussion

The results of the trimmed read analysis revealed a rather dierent picture. Other genes and long noncoding RNAs were highlighted as most likely to be important in context of the disease. Other novel lncRNAs were also discovered, (with the exception of one, which was identified in the original analysis as well). It remains hard to determine which of the two analyses is the more correct one, especially given the multitude of parameters changed between the two: The trimming procedure can be done in many ways, and there is no guarantee that the one performed here is optimal, or even better than performing no trimming at all. In addition, the eectiveness of bias correction in Cudi is unclear and may need to be examined further. One concern when using trimmed reads and bias correction in combination is that, since bias correction examines the ends of reads for overrepresented patterns, necessary information for eectively performing this correction may be lost during trimming. Though this would mostly be a concern if bias correction is thought to indeed be valuable to perform, something that was not entirely obvious throughout the analysis performed here. Further testing of the impact of such parameters is thus recommended.

77 Chapter A: Alternative analysis

Also, two dierent artifact filtering schemes were employed in the novel lncRNA discovery step of the two analyses, so those results need not be dierent entirely due to the trimming performed. And since multiple parameters of mapping, assembly and dierential expression testing were changed (besides the trimming of the reads) between the original analysis and the supplementary one, it is not certain which factors contributed the most to the dierences observed. There is, however, one indication which makes the original analysis appear more trustworthy: The trimmed read based transcriptome assembly yielded many more ”novel” RNA transcript isoforms (around 200 000 more than in the original analysis). There are two likely explanations for this: The first is that shorter reads map less specifically to the reference genome. The other, and perhaps the most important one, is that stricter read filtering leads to more coverage gaps in the structure of individual transcripts, so that spurious isoforms are discovered. Inspection of several of these ”extra” novel fragments indeed revealed generally poor coverage (in comparison to annotated variants from the same genes). The vast majority of these are thus likely artifacts. It is not unlikely that poor isoform reconstruction may have negatively aected transcript abundance estimation (and therefore also dierential and co- expression testing). As such, it is recommended that these alternative results rather be seen as a complementary set of hypotheses to the old ones, while noting that the original is somewhat more likely to be correct. Finally, appendix B presents a table (B.1) comprising the top 50 long noncoding RNAs deemed most promising for further research in the context of AAA disease, weighing in results from both the original and the alternative analysis.

78 B. Candidate lncRNAs

Table B.1 presents 50 candidate lncRNAs recommended for further research. The list is based mainly on the results from the original (main article) analysis, with notes about dierences to the alternative one (appendix A) included. The reason for choosing to focus on the original analysis was motivated by the principle of par- simony, based on the fact that the trimmed read analysis yielded far more (about 200 000 extra) novel assembled transcripts, many of which are most likely artifacts. The presence of these suspected artifacts additionally appeared to “dilute” the read coverage of many established (previously annotated) transcript isoforms, contribut- ing to worse statistical significance in dierential expression testing (and likely also aecting the co-expression analysis). In selecting candidates for this list, intergenic lncRNAs were given higher pri- ority than lncRNAs overlapping with protein coding genes. When such isoforms of protein coding transcripts appeared to closely follow the expression levels of the cor- responding coding variants, these lncRNAs were excluded (if the relative abundance of the coding and non-coding variant does not change, one could suspect that the expression of the non-coding isoform is merely a passive side-eect of the expres- sion of the protein coding variant, rather than indicative of independent lncRNA function). Additionally, dierentially expressed lncRNAs were given higher prior- ity than lncRNAs that were merely co-expressed in treatment correlated modules. Transcript isoforms with inconsistent annotation in online databases were excluded.

79 Chapter B: Candidate lncRNAs

Table B.1: 50 candidate lncRNAs recommended for further research, ranked by absolute log2 fold change between the A7 and S7 groups (case and control at seven weeks) in the original analysis. ”Module” refers to co-expression module assignment in the original anaysis (only modules in table 4.5 taken into account). LncRNAs that were significantly dierentially expressed (on transcript level) between A7 and S7 in both the original and the alternative analysis were placed on top of the list. Dierences between the two analyses are noted.

TranscriptID Gene Rankinorig. Rankinalt. Note Module ENSMUST00000131787 2410006H16Rik 10 3 ENSMUST00000147681 F630028O10Rik 16 9 ENSMUST00000034997 Snhg5 32 7 ENSMUST00000133808 C330006A16Rik 43 12 ENSMUST00000097503 Gm10524 69 28 ENSMUST00000175820 A230050P20Rik 84 24 ENSMUST00000164646 E530011L22Rik 114 21 ENSMUST00000099459 Gm10780 118 35 ENSMUST00000131248 5430416O09Rik 148 166 ENSMUST00000156243 4732414G09Rik 169 46 ENSMUST00000108969 Gm14401 173 49 ENSMUST00000117266 Gm12245 1 - * ENSMUST00000136359, ENSMUST00000140716 H19 3, 4 - ** blue ENSMUST00000143673 AI662270 19 - * ENSMUST00000155083 Ppox 23 - ** ENSMUST00000132618 Eif1 24 - * TCONS 00234018/TCONS 00348192 (novel) 17:30713409-30714566 25 - *** ENSMUST00000132340 Trem2 33 - * blue ENSMUST00000145164,ENSMUST00000152828 Pisd-ps1 36,75 - * ENSMUST00000127901 Nsun7 38 - * turquoise ENSMUST00000127652 Tmem59 39 - ** ENSMUST00000145089 Cd37 42 - * ENSMUST00000129649, ENSMUST00000127981 A530020G20Rik 47, 70 - * ENSMUST00000118528 Gm14293 54 - * ENSMUST00000171670, ENSMUST00000166221 Snhg1 55, 82 - * ENSMUST00000164690 Gm17586 56 - ** ENSMUST00000169511 Gm9917 63 - * ENSMUST00000170675 Xpo6 66 - ** ENSMUST00000162030 Gm10075(AC098736.2) 67 - * ENSMUST00000131275 Hoxb3os 68 - ** ENSMUST00000170093 Snhg7 73 - ** ENSMUST00000148900 D4Wsu53e (Rsrp1) 74 - ** ENSMUST00000151043 1300002E11Rik 76 - * ENSMUST00000152109 C430049B03Rik 79 - ** blue ENSMUST00000147076 Rrp1b 83 - * magenta ENSMUST00000146721 Ndufaf4 87 - * ENSMUST00000123292 Dynll1 90 - ** ENSMUST00000177539 RP23-164N15.3.1(Gm20645) 94 - * ENSMUST00000139026 Vsig10 97 - * ENSMUST00000132305 Clcf1 99 - * ENSMUST00000149667 9430008C03Rik 101 - ** ENSMUST00000063040 Minos1 113 - ** ENSMUST00000170099 D930016D06Rik 135 - * ENSMUST00000152147 1810058I24Rik 136 - * ENSMUST00000154405 Qrsl1 140 - * ENSMUST00000145894 Gm14703 - - ** green ENSMUST00000156068 6330403K07Rik - - ** green ENSMUST00000105240 Timeless - - * red ENSMUST00000132193 Alkbh6 - - * red

*Significantly dierentially expressed on gene level in both analyses. Only significantly dierentially expressed on transcript level in the original analysis. **Only significantly dierentially expressed in the original analysis. Overall expression pattern looks very similar. *** Novel lncRNA. Significantly dierentially expressed in both the original and the alternative analysis (log2 fold changes -2.39/-2.64.). Not present in table A.4 since it was removed in the pre-filtering step of the novel lncRNA detection pipeline in the alternative analysis (likely due to too low read coverage).

80