Tutorial Weighted Gene Coexpression Network Analysis BxH validation with BxD data Tova Fuller, Steve Horvath

Correspondence: [email protected], [email protected]

Contents of this tutorial: 1. Gene co-expression network construction 2. Module definition based on average linkage hierarchical clustering with the dynamic tree cut algorithm, and studying module preservation 3. Finding connectivity and gene significance measures, and studying preservation across data sets as well as relationships between these measures within each data set. 4. Obtaining linear models explaining variance in weight in each data set.

Abstract Here we utilize a weighted gene co-expression network analysis (WGCNA) approach based on expression and genotype data from a previously studied BxH F2 mouse intercross as well as a new BxD cross. Specifically, we utilize weighted gene co-expression network analysis (WGCNA) methods to demonstrate preservation of modules, intramodular connectivity and gene significance. We also obtain linear models in both data crosses using a module QTL identified in the BxH data that resides on the 19th chromosome.

This work is in press: Tova Fuller, Anatole Ghazalpour, Jason Aten, Thomas A. Drake, Aldons J. Lusis, Steve Horvath (2007) Weighted gene coexpression network analysis strategies applied to mouse weight. Mamm Genome, in press.

The data are described in: Anatole Ghazalpour, Sudheer Doss, Bin Zang, Susanna Wang,Eric E. Schadt, Thomas A. Drake, Aldons J. Lusis, Steve Horvath (2006) Integrating Genetics and Network Analysis to Characterize Genes Related to Mouse Weight. PloS Genetics

We provide the statistical code used for generating the weighted gene co-expression network results. Thus, the reader be able to reproduce all of our findings. This document also serves as a tutorial to differential weighted gene co-expression network analysis. Some familiarity with the R software is desirable but the document is fairly self-contained. This document and data files can be found at the following webpage: http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/DifferentialNetworkAnalysis

More material on weighted network analysis can be found here: http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/

Method Description: The data are described in the PLoS article cited above [1]. Please also refer to the citations above and below for more information regarding weighted gene co-expression network analysis (WGCNA).

Here we attempt to show that networks may be constructed from two phenotypically different subgroups of samples from a prior WGCNA experiment on mice. Here we identify 30 mice at both extremes of the weight spectrum in the BxH data and construct weighted gene co-expression networks from each.

Network Construction: In co-expression networks, network nodes correspond to genes and connection strengths are determined by the pairwise correlations between expression profiles. In contrast to unweighted networks, weighted networks use soft thresholding of the Pearson correlation matrix for determining the connection strengths between two genes. Soft thresholding of the Pearson correlation preserves the gene co- expression information and leads to weighted co-expression networks that are highly robust with respect to the construction method [2].

The network construction algorithm is described in detail elsewhere [2]. Briefly, a gene co-expression similarity measure (absolute value of Pearson’s product moment correlation) was used to relate every pairwise gene-gene relationship. An adjacency matrix was then constructed using a `soft’ power  adjacency function aij = Power(sij, )  |sij| where sij is the co-expression similarity, and aij represents the resulting adjacency that measures the connection strengths. The power  is chosen using the scale free topology criterion proposed in Zhang and Horvath (2005). Briefly, the power was chosen such the resulting network exhibited approximate scale free topology and a high mean number of connections. The scale free topology criterion led us to choose a power of  = 6 based on the preliminary network built from the 8000 most varying genes. However, since we are using a weighted network as opposed to an unweighted network, the biological findings are highly robust with respect to the choice of this power [2].

Topological Overlap Matrix and Gene Modules The adjacency matrix was then used to define a network distance measure or more precisely a measure of node dissimilarity based on the topological overlap matrix [2]. Specifically the topological overlap matrix is given by

lij + aij wij = min{ki ,k j }+1- aij where lij = aiu auj denotes the number of nodes to which both i and j are connected, and u indexes еu the nodes of the network. The topological overlap matrix (TOM) is given by Ω=[ωij]. ωij is a number between 0 and 1 and is symmetric (i.e, ωij= ωji). The rationale for considering this similarity measure is that nodes that are part of highly integrated modules are expected to have high topological overlap with their neighbors.

Network Module Identification. Gene "modules" are groups of nodes that have high topological overlap. Module identification was based on the topological overlap matrix Ω=[ωij] defined above. To use it in hierarchical clustering, it was turned into a dissimilarity measure by subtracting it from one (i.e, the topological overlap based w dissimilarity measure is defined by dij =1-wij ). Based on the dissimilarity matrix we can use hierarchical clustering to discriminate one module from another. We used a dynamic cut-tree algorithm for automatically and precisely identifying modules in hierarchical clustering dendrogram (the details of the algorithm could be found at http://www.genetics.ucla.edu/labs/horvath/binzhang/DynamicTreeCut).

The algorithm takes into account an essential feature of cluster occurrence and makes use of the internal structure in a dendrogram. Specifically, the algorithm is based on an adaptive process of cluster decomposition and combination and the process is iterated until the number of clusters becomes stable. No claim is made that our module construction method is optimal. A comparison of different module construction methods is beyond the scope of this paper.

Intramodular connectivity and gene significance measures The row sum of the adjacency measures with a given gene i results in the network connectivity measure k = a (kall): i е iu . Analogously, the intramodular connectivity (kin) is found by summation of u№i adjacencies over all genes in a particular module. Intramodular connectivity is an important concept for identifying clinically relevant genes [3]. To measure intramodular connectivity, we find it computationally convenient to define the module based connectivity, kME, as the correlation between a given gene expression profile and the module eigengene: kME(i)=|cor(x(i),ME)|. The module eigengene is defined as the first principal component of the expression data and can be considered to be the most representative gene expression inside the module.

The gene significance with respect to a specific trait is referred to as GStrait, with GStrait of the i th gene in the array equal to |cor(x(i), trait)|, where x(i) is the gene expression profile of the ith gene.

1. Ghazalpour, A., et al., Integrating genetic and network analysis to characterize genes related to mouse weight. PLoS Genet, 2006. 2(8): p. e130. 2. Zhang, B. and S. Horvath, A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol, 2005. 4: p. Article17. 3. Horvath, S., J. Dong, and A.M. Yip, Connectivity, Module-Conformity, and Significance: Understanding Gene Co-Expression Network Methods. UCLA Technical Report, 2006.

Statistical References To cite this tutorial or the statistical methods please use 1. Zhang B, Horvath S (2005) A General Framework for Weighted Gene Co-Expression Network Analysis. Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Article 17. http://www.bepress.com/sagmb/vol4/iss1/art17 For the generalized topological overlap matrix as applied to unweighted networks see 2. Yip A, Horvath S (2006) Generalized Topological Overlap Matrix and its Applications in Gene Co-expression Networks. Proceedings Volume. Biocomp Conference 2006, Las Vegas. Technical report at http://www.genetics.ucla.edu/labs/horvath/GTOM/. For some additional theoretical insights consider 3. Horvath S, Dong J, Yip A (2006) The Relationship between Intramodular Connectivity and Gene Significance. Proceedings Volume. Biocomp Conference 2006, Las Vegas. Technical report at http://www.genetics.ucla.edu/labs/horvath/ModuleConformity/ 4. Horvath, Dong, Yip (2006) Using Module Eigengenes to Understand Connectivity and Other Network Concepts in Co-expression Networks. Submitted.

# Absolutely no warranty on the code. Please contact TF or SH with suggestions.

# Downloading the R software # 1) Go to http://www.R-project.org, download R and install it on your computer # After installing R, you need to install several additional R library packages: # For example to install Hmisc, open R, # go to menu "Packages\Install package(s) from CRAN", # then choose Hmisc. R will automatically install the package. # When asked "Delete downloaded files (y/N)? ", answer "y". # Do the same for some of the other libraries mentioned below. But note that # several libraries are already present in the software so there is no need to re-install them.

# To get this tutorial and data files, go to the following webpage # http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/DifferentialNetworkAnalysis # Download the zip file containing: # 1) R function file: "NetworkFunctions.txt", which contains several R functions # needed for Network Analysis.

# Unzip all the files into the same directory,

## The user should copy and paste the following script into the R session. ## Text after "#" is a comment and is automatically ignored by R.

# First we read in the libraries and source code we will need. library(MASS) library(class) library(Hmisc) library(sma) library(impute) library(scatterplot3d) source("/Users/TovaFuller/Documents/HorvathLab2006/NetworkFunctions/NetworkFunction s.txt")

# We set our working directory. setwd("/Users/TovaFuller/Documents/HorvathLab2007/MouseProject2.0/") # Reading in and Processing Expression, Clinical and SNP data

# 1. Expression data # First we read in our BXD and BXH expression data: dataBXD=read.csv("Gene_Expression_All_Animalsfixed_BXD.csv", header=TRUE) dataBXH=read.csv("cnew_liver_bxh_f2female_8000mvgenes_p3600_UNIQUE_tommodules_BXH.c sv", header=TRUE)

# The following denotes mapmaker id's for the genes rid=as.character(dataBXD [,1])

# Merging old, BxH color information # Now we would like to merge old color information – denoting module membership in the BxH data # set - to this new data set. This will be important later on in demonstrating module preservation. RIDs=data.frame(rid) colnames(RIDs)="mapmaker_id"

# STEP 1: Merge to match genes in BxH and BxD data sets. rTOlocusid=read.csv("ridTOlocusid.csv",header=T) colnames(rTOlocusid)=c("mapmaker_id2","locus_id") table(is.element(RIDs$mapmaker_id,rTOlocusid$mapmaker_id)) # FALSE TRUE # 8208 16964 <- lose 8208 genes.

Merge1=merge(RIDs, rTOlocusid, by.x="mapmaker_id", by.y="mapmaker_id2",all.x=T,all.y=F)

# Now we have to get the genes in order Morder1=match(RIDs$mapmaker_id,Merge1$mapmaker_id) Merge1=Merge1[Morder1,] table(Merge1$mapmaker_id==RIDs$mapmaker_id) # TRUE # 25172 table(Merge1$mapmaker_id==rid)

# STEP 2: Merge to obtain color data from old data set # Now we merge with the old mouse data (BXH) to obtain old module definitions. dataBXHmodule=dataBXH[,c(3,2,150)] colnames(dataBXHmodule) # [1] "LocusLinkID" "gene_symbol" "module" table(is.element(Merge1$locus_id, dataBXH$LocusLinkID)) # FALSE TRUE # 23219 1953

# this number is so small because only the 3600 most connected genes are in dataBXHmodule, and # there are several genes without locus link IDs in the BXH data file.

# Now we merge : Merge2=merge(Merge1, dataBXHmodule, by.x="locus_id", by.y="LocusLinkID",all.x=T,all.y=F)

# Now we have to get the genes in order Morder2=match(RIDs$mapmaker_id,Merge2$ mapmaker_id) Merge2=Merge2[Morder2,] table(Merge2$mapmaker_id== RIDs$mapmaker_id) # TRUE # 25172 table(Merge2$mapmaker_id==rid) # TRUE # 15272

Merge3=merge(Merge1, dataBXH,by.x="locus_id",by.y="LocusLinkID",all.x=T,all.y=F) # Takes a while Morder3=match(RIDs$mapmaker_id,Merge3$mapmaker_id) Merge3=Merge3[Morder3,] table(Merge3$mapmaker_id==rid) table(Merge2$mapmaker_id==Merge3$mapmaker_id) check=cbind(RIDs$mapmaker_id,Merge2)

# We name a vector colorh0, for the old, BxH module colors. color0=Merge2$module table(is.na(color0)) # FALSE TRUE # 1953 23219 overlap=!is.na(color0) datExprBXH <- t(Merge3[overlap,10:144]) dim(datExprBXH) BXHnamesOrdered=Merge3$substanceBXH[overlap] # [1] 135 1953 colorOverlap=color0[overlap] columnnames=colnames(dataBXD) #obviously, the names of the columns, which are samples

# Rows are genes and columns are samples. This is our BxD expression data. datExprBXD0=dataBXD[, grep("F2",columnnames)] colnamedata=colnames(datExprBXD0) # sample names

# We omit duplicate samples. dup1=grep("b",colnamedata) dup2=grep("c",colnamedata) dups=c(dup1,dup2) datExprBXD=t(datExprBXD0[overlap,-dups]) # 151 x 1953 BXDmice=dimnames(datExprBXD)[[1]]

# 2. Clinical Trait data # BxH Clinical Trait data # We now read in our BxH clinical trait data: datClinicalTraitsBXH=read.csv("BXH_ClinicalTraits_361mice_forNewBXH.csv",header=T) # rows are samples & columns are traits

# We order the mice so that trait file and expression file agree in BxH data: restrictMice=is.element(datClinicalTraitsBXH$MiceID,dimnames(datExprBXH)[[1]]) table(restrictMice) # restrictMice # FALSE TRUE # 226 135 datClinicalTraitsBXH=datClinicalTraitsBXH[restrictMice,] orderMiceTraits=order(datClinicalTraitsBXH$MiceID) orderMiceExpr=order(dimnames(datExprBXH)[[1]]) datClinicalTraitsBXH=datClinicalTraitsBXH[orderMiceTraits,] datExprBXH=datExprBXH[orderMiceExpr,]

# From the following table, we verify that all 135 mice are in order: table(datClinicalTraitsBXH$MiceID==dimnames(datExprBXH)[[1]]) # TRUE # 135

# BxD Clinical Trait data # We also read in our BxD clinical trait data: datClinicalTraits_SNPBXD0=read.csv("BXD_clinical_data.csv", header=T)

# We make the BxD mice names agree with datClinicalTraits sample names. BXDmice_noF=gsub("F2_","",BXDmice) restrictMiceBXD=is.element(datClinicalTraits_SNPBXD0$Mouse..,BXDmice_noF) table(restrictMiceBXD) # restrictMiceBXD # FALSE TRUE # 44 113 datClinicalTraits_SNPBXD=datClinicalTraits_SNPBXD0[restrictMiceBXD,] restrictMiceBXDa=is.element(BXDmice_noF,datClinicalTraits_SNPBXD$Mouse..) table(restrictMiceBXDa) datExprBXD=datExprBXD[restrictMiceBXDa,] BXDmice_noF= BXDmice_noF[restrictMiceBXDa] datClinicalTraits_SNPBXD$Mouse..=as.character(datClinicalTraits_SNPBXD$Mouse..) orderMiceTraitsBXD=order(datClinicalTraits_SNPBXD$Mouse..) orderMiceExprBXD=order(BXDmice_noF) datClinicalTraits_SNPBXD=datClinicalTraits_SNPBXD[orderMiceTraitsBXD, ] dim(datClinicalTraits_SNPBXD) # 132x204 datExprBXD=datExprBXD[orderMiceExprBXD,] BXDmice_noF= BXDmice_noF[orderMiceExprBXD] table(datClinicalTraits_SNPBXD$Mouse..==BXDmice_noF) # TRUE # 113 datClinicalTraitsBXD=datClinicalTraits_SNPBXD[,c(133:204)] # 113 x 72

# 3. SNP data

# BxDSNP data # We obtain the SNP data for the BxD mice. datSNPBXD0= datClinicalTraits_SNPBXD[,c(2:132)]

# We read in SNP info for BxD. datSNPBXDinfo0=read.table("BXDSNPinfo.csv",sep=",",header=T) table(is.element(dimnames(datSNPBXD0)[[2]], as.character(datSNPBXDinfo0$marker_name))) # FALSE TRUE # 4 127 restrictSNPsBXD= is.element(dimnames(datSNPBXD0)[[2]], as.character(datSNPBXDinfo0$marker_name)) datSNPBXD=datSNPBXD0[,restrictSNPsBXD] orderSNPs=order(dimnames(datSNPBXD)[[2]]) datSNPBXD=datSNPBXD[,orderSNPs] orderSNPs2=order(as.character(datSNPBXDinfo0$marker_name)) datSNPBXDinfo= datSNPBXDinfo0[orderSNPs2,]

# We check to ensure our SNPs match our SNP info. table(is.element(dimnames(datSNPBXD)[[2]], as.character(datSNPBXDinfo$marker_name))) # TRUE # 127

# Now let's get chromosome number. chrBXD=datSNPBXDinfo$chromosome_id

# BxH SNP data datSNPinfoBXH=read.csv("SNPMarkerLocusTranslationTable.csv",header=T) #[1] "UCSC.Name" "Celera.Name" "UCSC.Chromosome" "UCSC.Location" #[5] "Celera.Chromosome" "Celera.Location"

BXHindices=match(c("rs3662347","rs3714671","rs3721607","rs3676909","rs3704401","rs3 658504","rs3683481","rs3691821","rs3658160"), datSNPinfoBXH$Celera.Name) BXHsnps=datSNPinfoBXH[BXHindices,-c(5,6)]

# We read in our SNP data dat1BXH=read.csv("BluemoduleGenesWeightandSNPs.csv",header=T) datSNPBXH= data.frame(dat1BXH[1:9, c(9:143) ]) dimnames(datSNPBXH)[[1]]=as.character(dat1BXH[1:9,1])

# Now we have to make our mouse name order match that of our expression data. dimnames(datSNPBXH)[[2]]=gsub("F2","F2_", dimnames(datSNPBXH)[[2]]) datSNPBXH=datSNPBXH[,dimnames(datExprBXH)[[1]]] table(dimnames(datSNPBXH)[[2]]== dimnames(datExprBXH)[[1]]) # TRUE # 135 BXHchr=BXHsnps$UCSC.Chromosome BXHpos=BXHsnps$UCSC.Location

# We do some pre-processing on our SNP data so that we don't run into problems later on. tSNPBXH =data.frame(t(datSNPBXH)) for (i in c(1:dim(tSNPBXH)[[2]])) {tSNPBXH [,i]=as.numeric(as.character(tSNPBXH[,i]))} tSNPBXH =data.frame(tSNPBXH) dimnames(tSNPBXH)[[2]]=paste("mQTL",BXHchr,".", signif(BXHpos,3)/1e5,sep="") tSNPBXD =data.frame(datSNPBXD) for (i in c(1:dim(tSNPBXD)[[2]])) {tSNPBXD [,i]=as.numeric(as.character(tSNPBXD[,i]))} tSNPBXD =data.frame(tSNPBXD)

# Let's take a look at the dimensions of our data frames to make sure nothing went wrong. The number # of samples in each data set should be constant, and BxD and BxH data sets should share the same # number of genes, which they do. dim(datClinicalTraitsBXD) # [1] 113 72 dim(datExprBXD) # [1] 113 1953 dim(tSNPBXD) # [1] 113 127 dim(datClinicalTraitsBXH) # [1] 135 26 dim(datExprBXH) # [1] 135 1953 dim(tSNPBXH) # [1] 135 9 for (i in c(1:dim(datExprBXD)[[2]])) {datExprBXD [,i]=as.numeric(as.character(datExprBXD[,i]))} datExprBXD =data.frame(datExprBXD) for (i in c(1:dim(datExprBXH)[[2]])) {datExprBXH[,i]=as.numeric(as.character(datExprBXH[,i]))} datExprBXH=data.frame(datExprBXH) for (i in c(1:dim(datClinicalTraitsBXD)[[2]])) {datClinicalTraitsBXD [,i]=as.numeric(as.character(datClinicalTraitsBXD[,i]))} datClinicalTraitsBXD =data.frame(datClinicalTraitsBXD) weightBXH=as.numeric(datClinicalTraitsBXH[,5]) weightBXD= as.numeric(datClinicalTraitsBXD[,1]) # Module Preservation Analysis

# Now we produce a cluster dendrogram after obtaining the adjacency matrix. We do this using only # the genes that were colored by our previous analysis (intersecting genes between the two data sets). beta1=6

# Dynamic Cut-Tree Algorithm

# We used a dynamic cut-tree algorithm for selection branches of the hierarchical clustering # dendrogram (the details of the algorithm can be found at the following link: # www.genetics.ucla.edu/labs/horvath/binzhang/DynamicTreeCut. The algorithm takes into account an # essential feature of cluster occurrence and makes use of the internal structure in a dendrogram. # Specifically, the algorithm is based on an adaptive process of cluster decomposition and combination # and the process is iterated until the number of clusters becomes stable. myheightcutoff =0.999 mydeepSplit = FALSE # fine structure within module myminModuleSize = 20 # modules must have this minimum number of genes # new way for identifying modules based on hierarchical clustering dendrogram

# We now obtain our hierGTOM for the BxD data. AdjMatBXD=matrix(0, ncol=dim(datExprBXD)[[2]], nrow= dim(datExprBXD)[[2]]) AdjMatBXD<- abs(cor(datExprBXD,use="p"))^beta1 dissGTOMBXD=matrix(0, ncol= dim(datExprBXD)[[2]], nrow= dim(datExprBXD)[[2]]) dissGTOMBXD=TOMdist1(AdjMatBXD) rm(AdjMatBXD) hierGTOMBXD <- hclust(as.dist(dissGTOMBXD),method="average") par(mfrow=c(1,1)) plot(hierGTOMBXD,labels=F)

# We find the modules based on the dynamic cut-tree algorithm in BxD data. colorBXD=cutreeDynamic(hierclust= hierGTOMBXD, deepSplit=mydeepSplit,maxTreeHeight =myheightcutoff,minModuleSize=myminModuleSize) table(colorBXD) #colorBXD #turquoise blue brown yellow green red black # 388 356 302 199 193 153 146 # pink magenta grey # 101 91 24

# Compare this with the previous, BxH module sizes in genes that were found in both data sets: table(colorOverlap)

# colorOverlap # black blue brown cyan green greenyellow # 264 370 158 65 299 80 # grey lightcyan lightyellow midnightblue purple red # 50 91 16 46 45 437 # salmon # 32 # We visualize these modules: pdf(file="pics/MG2/dendrogramPlot.pdf") par(mar=c(1, 4, 4, 1) + 0.1,mfrow=c(3,1),cex=.9) plot(hierGTOMBXD, main="BxD Cross Dendrogram", labels=F, xlab="", sub=""); hclustplot1(hierGTOMBXD,colorBXD, title1="Colored by BxD Modules") hclustplot1(hierGTOMBXD,colorOverlap, title1="Colored by BxH Modules") dev.off() par(mar=c(5, 4, 4, 2) + 0.1)

# The top figure is a dendrogram showing the BxD data, the middle is coloring by BxD-defined # modules, and the bottom figure shows the BxD data colored by old, BxH modules. Here we see rough # module preservation.

# We produce a multidimensional scaling plot of the BxD data: library(scatterplot3d) cmd1=cmdscale(as.dist(dissGTOMBXD),4) par(mfrow=c(1,1)) scatterplot3d(cmd1[,1:3], color=as.character(colorOverlap), main="MDS plot",xlab="Scaling Dimension 1", ylab="Scaling Dimension 2", zlab="Scaling Dimension 3",cex.axis=1.5,angle=320) # Comparing K.ME and GSweight between data sets # Finding k.ME # Note: In the text of the article we refer to the principal component as "MEblue". Here, this is the # same as "PCblue". PCblueBXD=ModulePrinComps1(datexpr=as.matrix(datExprBXD), couleur=as.character(colorOverlap))[[1]]$PCblue PCblueBXH=ModulePrinComps1(datexpr=as.matrix(datExprBXH), couleur=as.character(colorOverlap))[[1]]$PCblue blueModuleIndex=(colorOverlap)=="blue"

# We find kME's for the different levels of detection thresholding: kMEblueBXD=as.numeric(abs(cor(PCblueBXD, datExprBXD[,blueModuleIndex],use="p"))) kMEblueBXDAll=as.numeric(abs(cor(negPCblueBXD, datExprBXD,use="p"))) kMEblueBXH=as.numeric(abs(cor(PCblueBXH, datExprBXH[,blueModuleIndex],use="p"))) kMEblueBXHAll=as.numeric(abs(cor(PCblueBXH, datExprBXH,use="p"))) datExprBXHBlue=data.frame(datExprBXH [,blueModuleIndex]) datExprBXDBlue= data.frame(datExprBXD[,blueModuleIndex])

# Finding GSweight

# To protect against outliers, we replace the values of the physiological traits by their ranks. rank1=function(x) rank(x, na.last="keep") rankdatClinicalTraitsBXD=apply(datClinicalTraitsBXD,2,rank1) rankdatClinicalTraitsBXH=apply(datClinicalTraitsBXH[,5:26],2,rank1)

# Here we find gene significance, or the correlation between a gene expression and a physiological trait if(exists("GSfunctionBXH")) rm(GSfunctionBXH) GSfunctionBXH=function(x) {cor(x,rankdatClinicalTraitsBXH,use="p")} if(exists("GSfunctionBXD")) rm(GSfunctionBXD) GSfunctionBXD=function(x) {cor(x,rankdatClinicalTraitsBXD,use="p")}

# We also compute GeneSignificance for the data frame with omission of 1000 low detection genes in # both data sets: GeneSignificanceBXD =t(apply(datExprBXD,2,GSfunctionBXD)) dimnames(GeneSignificanceBXD)[[2]]=paste("GS",dimnames(rankdatClinicalTraitsBXD) [[2]],sep="" ) GeneSignificanceBXD=data.frame(abs(GeneSignificanceBXD)) GSweightBlueBXD= GeneSignificanceBXD[blueModuleIndex,1] GSweightAllBXD= GeneSignificanceBXD[,1]

GeneSignificanceBXH =t(apply(datExprBXH,2,GSfunctionBXH)) dimnames(GeneSignificanceBXH)[[2]]=paste("GS",dimnames(rankdatClinicalTraitsBXH) [[2]],sep="" ) GeneSignificanceBXH=data.frame(abs(GeneSignificanceBXH)) GSweightBlueBXH = GeneSignificanceBXH[blueModuleIndex,1] GSweightAllBXH= GeneSignificanceBXH[,1]

# Now we make plots of relationships between k.ME and GS.weight in our BxD and BxH data sets. par(mfrow=c(2,2),mar=c(5, 5, 4, 2) + 0.2) scatterplot1(kMEblueBXHAll, kMEblueBXDAll, xlab1="k.MEblue, BxH cross", ylab1="k.MEblue, BxD cross",col1= "black") scatterplot1(GSweightAllBXH, GSweightAllBXD,xlab1="GSweight, BxH cross", ylab1="GSweight, BxD cross",col1= colorOverlap) scatterplot1(kMEblueBXH, GSweightBlueBXH,xlab1="k.MEblue, BxH cross", ylab1="GSweight, BxH cross",col1= "blue") scatterplot1(kMEblueBXD, GSweightBlueBXD, xlab1="k.MEblue, BxD cross", ylab1="GSweight, BxD cross",col1= "blue")

# This plot demonstrates: # 1. kMEblue (BXH versus the new cross) for all genes – this is simply the affinity to the blue module # 2. GSweight (BxH versus new cross) for all genes # 3. kMEblue v GSweight in the blue module in BxH cross # 4. kMEblue v GSweight in the blue module in the new BxD cross. # Linear Models # Regressing weight on PCblue # We regress weight on PCblue in the BXH data

lm1=lm(weightBXH~ PCblueBXH) summary(lm1)

Call: lm(formula = weightBXH ~ PCblueBXH)

Residuals: Min 1Q Median 3Q Max -12.1717 -3.3553 0.2933 2.6305 16.0453

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 38.0941 0.4146 91.88 < 2e-16 *** PCblueBXH 43.5771 4.7887 9.10 1.35e-15 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.763 on 130 degrees of freedom (3 observations deleted due to missingness) Multiple R-Squared: 0.3891, Adjusted R-squared: 0.3844 F-statistic: 82.81 on 1 and 130 DF, p-value: 1.352e-15

# We do the same with the square root of weight: summary(lm(sqrt(weightBXH)~ PCblueBXH)) Call: lm(formula = sqrt(weightBXH) ~ PCblueBXH)

Residuals: Min 1Q Median 3Q Max -1.06334 -0.26591 0.02787 0.23405 1.22891

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.15155 0.03396 181.118 < 2e-16 *** PCblueBXH 3.66285 0.39227 9.337 3.54e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3902 on 130 degrees of freedom (3 observations deleted due to missingness) Multiple R-Squared: 0.4014, Adjusted R-squared: 0.3968 F-statistic: 87.19 on 1 and 130 DF, p-value: 3.543e-16

# We repeat the same model, except in the new cross: lm1=lm(weightBXD~ PCblueBXD) summary(lm1) Call: lm(formula = weightBXD ~ PCblueBXD)

Residuals: Min 1Q Median 3Q Max -18.5685 -4.8604 -0.9517 5.3467 26.0681

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 30.5974 0.6835 44.767 < 2e-16 *** PCblueBXD 27.8609 7.2655 3.835 0.000209 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.265 on 111 degrees of freedom Multiple R-Squared: 0.117, Adjusted R-squared: 0.109 F-statistic: 14.7 on 1 and 111 DF, p-value: 0.0002091

SNP19BXH= tSNPBXH[,9]

# We do the same with the square root of weight: summary(lm(sqrt(weightBXD)~ PCblueBXD)) Call: lm(formula = sqrt(weightBXD) ~ PCblueBXD)

Residuals: Min 1Q Median 3Q Max -2.14934 -0.41173 -0.04982 0.51667 2.04400

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.48674 0.06225 88.14 < 2e-16 *** PCblueBXD 2.66675 0.66173 4.03 0.000103 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6617 on 111 degrees of freedom Multiple R-Squared: 0.1276, Adjusted R-squared: 0.1198 F-statistic: 16.24 on 1 and 111 DF, p-value: 0.0001025

# Now we regress weight on SNP19 in BXH: lm1=lm(weightBXH~SNP19BXH) summary(lm1)

Call: lm(formula = weightBXH ~ SNP19BXH)

Residuals: Min 1Q Median 3Q Max -14.3126 -4.0709 0.5152 4.0541 15.8874

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 35.5570 0.8460 42.028 < 2e-16 *** SNP19BXH 2.6556 0.6961 3.815 0.000209 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.779 on 130 degrees of freedom (3 observations deleted due to missingness) Multiple R-Squared: 0.1007, Adjusted R-squared: 0.09377 F-statistic: 14.56 on 1 and 130 DF, p-value: 0.0002095 lm1=lm(sqrt(weightBXH)~SNP19BXH) summary(lm1) Call: lm(formula = sqrt(weightBXH) ~ SNP19BXH)

Residuals: Min 1Q Median 3Q Max -1.27270 -0.32349 0.06349 0.34286 1.19381

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.94060 0.06997 84.897 < 2e-16 *** SNP19BXH 0.22086 0.05757 3.836 0.000194 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.478 on 130 degrees of freedom (3 observations deleted due to missingness) Multiple R-Squared: 0.1017, Adjusted R-squared: 0.09478 F-statistic: 14.72 on 1 and 130 DF, p-value: 0.0001939 lm1=lm(weightBXH~SNP19BXH+PCblueBXH) summary(lm1) Call: lm(formula = weightBXH ~ SNP19BXH + PCblueBXH)

Residuals: Min 1Q Median 3Q Max -12.2280 -2.9061 0.2126 2.6703 15.9109

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 36.4530 0.6853 53.189 < 2e-16 *** SNP19BXH 1.6830 0.5687 2.960 0.00367 ** PCblueBXH 40.7803 4.7469 8.591 2.43e-14 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.627 on 129 degrees of freedom (3 observations deleted due to missingness) Multiple R-Squared: 0.428, Adjusted R-squared: 0.4191 F-statistic: 48.26 on 2 and 129 DF, p-value: 2.258e-16 lm1=lm(sqrt(weightBXH)~SNP19BXH+PCblueBXH) summary(lm1) Call: lm(formula = sqrt(weightBXH) ~ SNP19BXH + PCblueBXH) Residuals: Min 1Q Median 3Q Max -1.06799 -0.22244 0.03819 0.20894 1.21781

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.01601 0.05611 107.220 < 2e-16 *** SNP19BXH 0.13901 0.04656 2.986 0.00339 ** PCblueBXH 3.43184 0.38863 8.831 6.43e-15 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3788 on 129 degrees of freedom (3 observations deleted due to missingness) Multiple R-Squared: 0.4401, Adjusted R-squared: 0.4315 F-statistic: 50.71 on 2 and 129 DF, p-value: < 2.2e-16

# Finding SNP19 in BXD:

# Now we find SNP19 in the BXD data. COR.PCblueBXD=rep(NA,dim(tSNPBXD)[[2]]) for (i in c(1: dim(tSNPBXD)[[2]])) {COR.PCblueBXD[i]= as.numeric(abs(cor(PCblueBXD,as.numeric(tSNPBXD[,i]) ,use="p")))}

# There are five SNPs on the 19th chromosome. tSNPBXD19= tSNPBXD[,chrBXD==19] SNP19BXD= tSNPBXD19[,which.max(COR.PCblueBXD[chrBXD==19])] # this is the 4th SNP on 19th chromosome. SNP19BXD= tSNPBXD[,chrBXD==19]

# Below is a table of the base pair positions of each of the SNPs we will analyze based search results on # Ensembl: datSNPBXDinfo[chrBXD==19,]$marker_name # [1] d19mit41 d19mit53 d19mit63 d19mit71 d19mit8 SNP Name Data Set Basepair start Basepair end rs3658160 BxH 47073456 47073456 d19mit41 BxD 18743419 18743582 SNP.1 d19mit53 BxD 45205220 45205330 SNP.2 d19mit63 BxD 36104688 36104837 SNP.3 d19mit71 BxD 59653090 59653225 SNP.4 d19mit8 BxD Not mapped by SNP.5 Ensembl MGI position of 47? SNP19BXD.1=tSNPBXD[,chrBXD==19][,1] SNP19BXD.2=tSNPBXD[,chrBXD==19][,2] SNP19BXD.3=tSNPBXD[,chrBXD==19][,3] SNP19BXD.4=tSNPBXD[,chrBXD==19][,4]# this is the one with the highest cor with PCblue SNP19BXD.5=tSNPBXD[,chrBXD==19][,5]

# We use the 4th SNP as it has the highest correlation with PCblue # Now we regress weight on SNP19 in BXD: lm1=lm(weightBXD~SNP19BXD.4) summary(lm1)

Call: lm(formula = weightBXD ~ SNP19BXD.4)

Residuals: Min 1Q Median 3Q Max -21.1928 -5.4956 -0.2433 4.5150 23.9347

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 29.160 1.143 25.518 <2e-16 *** SNP19BXD.4 1.732 1.053 1.646 0.103 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.671 on 110 degrees of freedom (1 observation deleted due to missingness) Multiple R-Squared: 0.02403, Adjusted R-squared: 0.01516 F-statistic: 2.709 on 1 and 110 DF, p-value: 0.1026 lm1=lm(sqrt(weightBXD)~SNP19BXD.4) summary(lm1) Call: lm(formula = sqrt(weightBXD) ~ SNP19BXD.4)

Residuals: Min 1Q Median 3Q Max -2.39846 -0.47337 0.02073 0.44362 1.85191

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.35715 0.10477 51.131 <2e-16 *** SNP19BXD.4 0.15579 0.09651 1.614 0.109 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7033 on 110 degrees of freedom (1 observation deleted due to missingness) Multiple R-Squared: 0.02314, Adjusted R-squared: 0.01426 F-statistic: 2.606 on 1 and 110 DF, p-value: 0.1093 # doesn't improve lm1=lm(weightBXD~SNP19BXD.4+PCblueBXD) summary(lm1) Call: lm(formula = weightBXD ~ SNP19BXD + PCblueBXD)

Residuals: Min 1Q Median 3Q Max -18.841 -5.231 -0.927 5.067 25.065 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 29.9286 1.1095 26.975 < 2e-16 *** SNP19BXD 0.8336 1.0341 0.806 0.421932 PCblueBXD 26.5662 7.5487 3.519 0.000633 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.302 on 109 degrees of freedom (1 observation deleted due to missingness) Multiple R-Squared: 0.1236, Adjusted R-squared: 0.1075 F-statistic: 7.687 on 2 and 109 DF, p-value: 0.0007531 summary(lm(sqrt(weightBXD)~SNP19BXD.4+PCblueBXD)) # 0.1172, p is 0.464311

# Pearson correlations between measures of interest

# In comparing the PCs with weight, we must first ensure that the vector is in the correct order: cor.test(kMEblueBXH, GSweightBlueBXH, method="p") Pearson's product-moment correlation data: kMEblueBXH and GSweightBlueBXH t = 11.4797, df = 368, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.4342818 0.5848353 sample estimates: cor 0.5134995 cor.test(kMEblueBXD, GSweightBlueBXD, method="p") Pearson's product-moment correlation data: kMEblueBXD and GSweightBlueBXD t = 13.2475, df = 368, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.4949667 0.6334971 sample estimates: cor 0.5682448 cor.test(PCblueBXH,weightBXH, method="p")

Pearson's product-moment correlation data: PCblueBXH and weightBXH t = 9.1, df = 130, p-value = 1.332e-15 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.5069661 0.7181278 sample estimates: cor 0.6238009 # Previously our values of PCblue did not affect anything as we defined kME as the # absolute value of the correlation with PCblueBXD. We here take PCblueBXD as its # additive inverse, keeping in mind sign is poorly defined for PC. PCblueBXD=-PCblueBXD cor.test(PCblueBXD,weightBXD, method="p")

Pearson's product-moment correlation data: PCblueBXD and weightBXD t = 3.8347, df = 111, p-value = 0.0002091 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.1679007 0.4954488 sample estimates: cor 0.3420223

# BXH: cor.test(PCblueBXH,SNP19BXH, method="p") Pearson's product-moment correlation data: PCblueBXH and SNP19BXH t = 2.2886, df = 133, p-value = 0.02368 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.02656463 0.35202807 sample estimates: cor 0.1946481 cor.test(weightBXH,SNP19BXH, method="p") Pearson's product-moment correlation data: weightBXH and SNP19BXH t = 3.8151, df = 130, p-value = 0.0002095 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.1548397 0.4630806 sample estimates: cor 0.3173167

# BXD: 4th SNP cor.test(PCblueBXD,SNP19BXD.4, method="p") Pearson's product-moment correlation data: PCblueBXD and SNP19BXD.4 t = 2.6734, df = 110, p-value = 0.008653 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.0643953 0.4135993 sample estimates: cor 0.2469997 cor.test(weightBXD,SNP19BXD.4, method="p") Pearson's product-moment correlation data: weightBXD and SNP19BXD.4 t = 1.6459, df = 110, p-value = 0.1026 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.03142964 0.33106245 sample estimates: cor 0.1550303

# In the BxH data, B corresponds to "0" in additive marker coding, H corresponds to # a "1" and A corresponds to a "2". # In the BxD data, B corresponds to a "2" in additive marker coding, and A # corresponds to a "0".

# We can visualize the distribution of these different genotypes: par(mfrow=c(1,2)) bxhsnplevels=SNP19BXH bxhsnplevels[bxhsnplevels==1]="H" bxhsnplevels[bxhsnplevels==2]="A" bxhsnplevels[bxhsnplevels==0]="B" bxdsnplevels=SNP19BXD.4 bxdsnplevels[bxdsnplevels==1]="H" bxdsnplevels[bxdsnplevels==0]="A" bxdsnplevels[bxdsnplevels==2]="B" boxplot(weightBXH~bxhsnplevels,logical=T,notch=T,main="BxH data") boxplot(weightBXD~bxdsnplevels,logical=T,notch=T,main="BxD data")