Gravitropic Signal Transduction: A Systems Approach to Gene Discovery

A dissertation presented to

the faculty of

the College of Arts and Sciences of Ohio University

In partial fulfillment

of the requirements for the degree

Doctor of Philosophy

Kaiyu Shen

May 2013

© 2013 Kaiyu Shen. All Rights Reserved.

2 This dissertation titled

Gravitropic Signal Transduction: A Systems Approach to Gene Discovery

by

KAIYU SHEN

has been approved for

the Program of Molecular and Cellular Biology

and the College of Arts and Sciences by

Sarah E. Wyatt

Professor of Environmental and Plant Biology

Robert Frank

Dean, College of Arts and Science

3

ABSTRACT

SHEN, KAIYU, Ph.D., May 2013, Molecular and Cellular Biology

Gravitropic Signal Transduction: A Systems Approach to Gene Discovery

Director of Dissertation: Sarah E. Wyatt

Gravity is an important stimulus for plants. Gravitropism, the plants’ response to gravity, can be divided into three phases: gravity perception, signal transduction and response.

Various theories have been proposed to explain the process of gravitropism, yet more genes are needed to elucidate the mechanism of gravitropic signal transduction. A transcriptome analysis, in combination with the Gravity Persistent Signal treatment, was performed to specifically study the genes involved in signal transduction. Analysis generated a list of 318 transcripts that were differentially expressed in plants that were reoriented with respect to gravity as compared to vertical controls. Based on the expression profiles and gene function annotations, five transcription factors, WRKY18,

WRKY26, WRKY33, BT2 and ATAIB, were selected for further study. In addition to the standard analysis of differentially expressed genes, a systems approach was adopted to uncover more gravity related genes. A semi-supervised learning method was developed to find additional novel genes. This learning method took a set of 32 known gravity genes from the literature as well as a collection of heterogeneous annotation features, such as existing protein-protein interactions, and co-expression profiles. The learning classifier predicted a list of 50 genes that are functionally related to gravity signal transduction. Based on this list of genes, an interaction network was predicted

4 based two complementary approaches: a dynamic Bayesian network and a time-lagged correlation coefficient. To increase confidence in the predication, genes/interactions that appeared in both networks were selected. This ‘intersected’ network provided 20 hub and bottleneck genes, fourteen of which had not been previously identified as involved in gravitropism. Such an approach provides a framework to extend current research in a more comprehensive manner, and serves a complementary to the traditional mutant/gene discovery model.

5 DEDICATION

To my family.

6 ACKNOWLEDGMENTS

I express my deep sense of gratitude to my advisor, Dr. Sarah Wyatt who provided me with moral and intellectual guidance throughout my Ph.D. research. I would also like to thank Dr. Bunescu Razvan for the whole hearted help on my machine learning project and research. I also greatly appreciate for the efforts and time of my Ph.D. committee members: Dr. Lonnie Welch, Dr. Frank Horodyski and Dr. Allan Showalter. I also thank

Vijay Nadella, the director of Ohio University Facility, who has given me wonderful opportunities to gain experience on analyzing meta-scale biological data.

Finally I thank all the previous and current Wyatt lab members for supporting my research and dissertation. Finally, I would thank for my family who have always been the strongest support for me.

7 TABLE OF CONTENTS

Page

Abstract...... 3 Dedication ...... 5 Acknowledgments ...... 6 List of Tables ...... 9 List of Figures ...... 11 Abbreviations ...... 13 Chapter 1: Introduction ...... 14 Gravitropism ...... 14 The perception of gravity and the starch statolith hypothesis...... 15 Signal transduction...... 19 Gravity response ...... 24 Arabidopsis mutants ...... 26 The gravity persistent signal (gps) mutants ...... 28 Chapter 2: Transcriptome analysis oF gravitropic signal transduction ...... 31 Introduction ...... 31 Methods ...... 33 Plant preparation ...... 33 GPS treatment ...... 34 Microarray experiment design ...... 35 RNA extraction ...... 36 Microarray data analysis ...... 39 Gene annotation ...... 42 qRT-PCR experiment and selection of housekeeping genes ...... 42 cDNA preparation and primer design ...... 44 Optimization of primer concentrations ...... 47 Results ...... 48 Microarray data analysis ...... 48 Enrichment analysis of the annotations ...... 49 Genes selected for future studies ...... 53 Transcription factors ...... 54 qRT-PCR results ...... 59 Discussion ...... 62 Chapter 3: Mining functionally related genes using Semi-supervised learning ...... 66 Introduction ...... 67 Methods ...... 70

8 Information sources and feature engineering ...... 70 Feature vector...... 73 Feature selection and filtering ...... 74 Learning methods...... 75 Selection of unlabeled data for training ...... 79 Results ...... 80 Benchmark data ...... 80 Feature collection ...... 81 Algorithm implementation and comparison ...... 82 Time dependent evaluation ...... 85 Varied composition of unlabeled genes ...... 87 Different set of negative genes ...... 88 Comparison with GeneMANIA ...... 89 Discussion ...... 90 Chapter 4: The construction of A Gravitropic Network ...... 94 Introduction ...... 94 Methods ...... 99 Feature vector preparation ...... 99 Gravitropic microarray data feature preparation ...... 99 Putative positive data selection ...... 100 Select the putative genes ...... 100 Dynamic Bayesian network ...... 101 Time lagged correlation coefficient ...... 102 Results ...... 102 Prefilter of the seed genes ...... 103 The Gravity network ...... 107 Discussion ...... 118 Chapter 5: Conclusions ...... 124 References...... 127 Appendix A. Microarray data analysis pipeline ...... 157 Appendix B. Differentially expressed genes ...... 160

9 LIST OF TABLES

Page

Table 2.1. Primer sequences designed for qPCR analysis of the five genes selected.... 47

Table 2.2. Genes differentially expressed genes in the four time points...... 49

Table 2.3. Significant GO terms (p-value < 0.005) associated with gene products targeted to the vacuoles and plastids...... 53

Table 2.4. The transcription factors discovered in the differentially expressed genes and their transcription factor families...... 55

Table 2.5. The expression levels of the five transcription factors across the four time points...... 56

Table 2.6. The most enriched GO terms in the sub-networks of the five transcription factors...... 57

Table 3.1. The AUC20 comparisons between BSVM and WSVM on original and filtered sets...... 85

Table 3.2. The AUC20 comparisons among different sets of negative genes...... 88

Table 3.3. The prediction results comparisons with GeneMANIA...... 89

Table 4.1. The seed genes that were previously identified as gravity related, specifically those involved in signal transduction...... 105

Table 4.2. Selected genes based on the semi-supervised learning...... 105

Table 4.3. The most significant GO slim terms associated with the 50 selected genes.107

Table 4.4. The predicted interactions in the intersected network...... 113

10 Table 4.5. The hub genes (degree >5) identified in the intersected network. Novel identified genes are indicated in blue...... 118

Table B.1. Genes down-regulated at 2min in the GPS microarray and their expression across the remaining time points. All significant values are highlighted in gray...... 160

Table B.2. Genes up-regulated at 2min in the GPS microarray and their expression across the remaining time points. All significant values are highlighted in gray...... 160

Table B.3. Genes down-regulated at 4min in the GPS microarray and their expression across the remaining time points. All significant values are highlighted in gray...... 160

Table B.4. Genes up-regulated at 4min in the GPS microarray and their expression across the remaining time points. All significant values are highlighted in gray...... 162

Table B.5. Genes down-regulated at 10min in the GPS microarray and their expression across the remaining time points. All significant values are highlighted in gray...... 163

Table B.6. Genes up-regulated at 10min in the GPS microarray and their expression across the remaining time points. All significant values are highlighted in gray...... 167

Table B.7. Genes uniquely down-regulated at 30min...... 168

Table B.8. Genes uniquely up-regulated at 30min ...... 168

11 LIST OF FIGURES

Page

Figure 1.1. The “grey cloud” model for gravitropic signal transduction...... 24

Figure 1.2. The “fountain” model of auxin transport in root columella cells...... 26

Figure 2.1. The design of the microarray experiment for treatment group...... 35

Figure 2.2. The raw signal density (the mean value of signal density from both channels,) distribution across the 16 arrays...... 40

Figure 2.3. The normalized signal density (the mean value of signal density of both channels, logarithmic 2 based) distribution across the 16 arrays...... 41

Figure 2.4. The geNorm M values of the four housekeeping genes...... 44

Figure 2.5. Sequences maps of the cDNA of the five genes selected: ATAIB, BT2,

WRKY18, WRKY33 and WRKY26 including the location of the qPCR primers...... 47

Figure 2.6. The overlap of differentially expressed genes between the four time points..

...... 49

Figure 2.7. The distribution of Biological Process terms among the differentially expressed genes...... 51

Figure 2.8. The distribution of Molecular Function terms among the differentially expressed genes...... 52

Figure 2.9. A scatter plot of the expression levels at 4min and 10min of the 33 transcription factors identified in the microarray analysis...... 56

Figure 2.10. The interactions of the five TFs in the Arabidopsis genome...... 58

12 Figure 2.11. The qRT-PCR validation of ATAIB, WRKY18, WRKY33, WRKY 26 and

BT2...... 62

Figure 3.1. Comparisons of performance between different algorithms, kernels and data integration...... 84

Figure 3.2. The performance comparisons between different algorithms and kernels based on the year of 2010...... 86

Figure 3.3. The performance comparisons between different algorithms and kernels based on the year of 2011...... 87

Figure 3.4. The degree distribution based on the pre-filtered and filtered networks. .... 93

Figure 4.1. Diagrammatic representation of “hub” and a “bottleneck” genes...... 97

Figure 4.2. The distribution histogram of the predicted values...... 108

Figure 4.3. A general view of the network between 80 genes and 721 interactions. .. 109

Figure 4.4. The density distribution of the time-lagged PCC among the chosen genes.110

Figure 4.5. A diagrammatic representation of the network generated using the time lagged correlation co-expression method...... 111

Figure 4.6. A diagrammatic representation of the intersected network...... 112

Figure 4.7. The power law fit to the distribution of degrees in the network, the correlation R-square score is 0.73...... 112

Figure 4.8. The comparisons between PCC values among three different types of correlation: continuous (red), random (blue) and transient (green)...... 119

13 ABBREVIATIONS

RIN - RNA Integration Number

PIN - PIN-formed

PCC - Pearson Correlation Coefficient

PPI - Protein Protein Interaction

PPIN - Plant-Pathogen Immune Network

DBN - dynamic Bayesian network qRT-PCR – quantitative Real Time Polymerase Chain Reaction

GO - Gene Ontology

BP - Biological Process

MF - Molecular Function

CC - Cellular Component

KEGG - Kyoto Encyclopedia of Genes and Genomes

TF - Transcription Factor

TFBS - Transcription Factor Binding Site

ABA - ABscisic Acid

BTB - Broad-complex Tramtrack and Bric a brac

ARF - Auxin Response Factor

EGF – Epidermal Growth Factor

14 CHAPTER 1: INTRODUCTION

Gravitropism

Gravity plays a fundamental role in plant growth and development. Generally, roots grow down into the soil for water and nutrients while shoots grow up to access the light of the sun. However, that is a simplistic view. Roots and shoots position themselves at specific angles (the Gravitropic Set-point Angle, GSA) with respect to the gravity field

(Digby and Firn, 1995). These angles can be organ specific, specific for individual species, and can change throughout development. In addition, if a plant is placed on its side, the organs will return to their original GSAs. Gravity differs from other environmental stimuli (like light and water) because it is neither visible nor touchable, but it affects plants from germination to seed set. Charles Darwin was one of the first to record the gravitropic effect on plants (reviewed by Kiss 2006). On the other hand, gravitropism is also part of a complex response network rather than an isolated pathway that integrates factors from numerous developmental and environmental stimuli

(Rasmussen 2007), such as phototropism and hydrotropism (Correll & Kiss, 2002; Bao et al., 2004; Hopkins & Kiss, 2012). Research has expanded, especially since the inception of space flight, on the mechanism of gravitropism. The studies of how plant responses to gravity could be used to benefit human life with respect to agriculture, horticulture and even long term space flight.

15 Gravitropism is a complex pathway that involves the conversion from a biophysical stimulus to biochemical signals and downstream physical responses. To study this complicated process, scientists have simplified and artificially separated the process into three continuous steps: gravity stimulus perception, signal transduction and response (see

Wyatt & Kiss, 2013). Signal transduction can further be subdivided into two phases: the early events of signal transduction and auxin redistribution.

Gravity stimulus perception refers to how plants perceive the gravity vector. Perception is thought to occur by the movement of amyloplasts in shoots and roots (reviewed by

Kiss, 2000). Signal transduction carries that information to the site of response, while the gravitropic response itself refers to the asymmetric growth and bending of the organ.

Bending is driven by the asymmetric distribution of auxin in the elongation zone of the organ (reviewed in Morita, 2010). However, little is understood about the early events that lead to auxin redistribution. In fact, the molecular mechanisms of signal transduction remain unknown, including how amyloplast sedimentation triggers the signal transduction and what causes the redistribution of auxin.

The perception of gravity and the starch statolith hypothesis

The Starch Statolith Hypothesis is one of the oldest theories relating to gravitropism and dates back to the early 1900s. It has been the dominant hypothesis governing gravity perception for over 100 years (Wyatt & Kiss 2013). This hypothesis proposes that

16 gravity is sensed by the interaction of specialized amyloplasts (statoliths) with cytoplasmic structures of specific gravity sensing cells. Statoliths are specialized amyloplasts that are enriched in starch and denser than the cytoplasm. These statoliths are contained in columella cells in the root and the starch sheath (sometimes referred to as the endodermis) of inflorescence stems and hypocotyls. Amyloplasts are ‘heavier’ than other organelles, and thus they are more sensitive to the gravity vector. Once the gravity vector is altered, amyloplasts move to the new physical bottom of the cell. When these organelles are removed by mechanical or genetic ablation, the organ loses its ability to sense gravity (reviewed by Morita 2010). Or when statoliths are moved by high gradient magnetic fields, plants bend in response to the magnetic displacement of the statoliths as would be predicted by gravity (Kuznetsov & Hasenstein 1997; Perrin et al.

2005; Morita 2010). These experiments showed that amyloplasts are important in gravity perception and their sedimentation determines the plant’s curvature.

For roots, the root cap has long been recognized as the site of gravity perception. Sack et al. (1986) first observed that amyloplast sedimentation occurs in the columella of the root cap in living corn cells. Blancaflor et al. (1998) used a laser to ablate the individual columella cells and concluded that the S2 layer is most important for gravity perception.

Recently, Leitz et al. (2009) adopted differential interference contrast microscopy to analyze the movements of statoliths in relation to the cortical ER in columella cells.

Their results showed that the statoliths move fastest during the first 30 seconds after reorientation, and the kinetic energy of the statolith sedimentation is transmitted to the

17 cortical ER network which is a putative site for signal transduction. Nevertheless, the columella cells are not the only site for gravity perception. Mullen et al. (2000) showed that gravity perception and signaling also can occur at the elongation zone, although to a lesser extent.

The perception of gravity stimulation in the hypocotyls and inflorescence stems differs from that in the root. Fukaki et al. (1998) showed that the shoot endodermis (starch sheath) is essential for gravitropism in hypocotyls and inflorescence stems by comparing wild type with shoot gravitropism (sgr) 1 and sgr7 mutants. These two mutants lack a normal endodermal layer in hypocotyl and inflorescence stem, and as a result have lost the gravitropic response in those organs while root gravitropism remains normal.

Experiments on gravitropism with starchless, reduced-starch and excessive-starch mutants in Arabidopsis provided the most straightforward evidence that amyloplasts function as statoliths. Both the complete and partially depleted starch mutants are less sensitive to gravity compared to their wild type counterparts, and the remaining sensitivity to gravity is proportional to the total mass of amyloplasts present in the cell

(Kiss et al. 1996; Weise & Kiss 1999; Weise et al. 2000). On the contrary, a starch excessive (sex) mutant, sex1 shows increased sensitivity to gravity in hypocotyls resulting from much larger amyloplasts compared with wild type (Vitha et al. 2007). Nevertheless, the starch-amyloplast hypothesis remains controversial because the starchless mutants do respond to gravity, albeit much more slowly and to a lesser extent than wild type (Stanga

18 et al. 2011). This implies that amyloplasts, although important for a full gravitropic response, are not the only player in gravity stimulus perception.

Thus an alternate model, the Gravitational Pressure Hypothesis, was proposed in the late

1990s (Staves et al. 1997). This hypothesis suggests that the entire mass of the cytoplasm, through its movement inside the cell wall, is used for gravity perception. The protoplast would exert tension on connections between the cytoplasm and the cell wall at the top of the cell, as well as compress these structures at the bottom of the cell. These oriented tensions and pressures determine the direction of gravity, therefore triggering the corresponding responses.

Another important feature of gravity perception is the “presentation time”, which is a measurement of how long it takes for a plant to “know” that an alteration in the gravity vector has occurred. The presentation time is crucial for plants to determine whether the altered gravity stimulus is only a transient perturbation, like a movement due to wind, or a more permanent reorientation. Leitz et al. (2009) estimated the time for the amyloplasts to reach the bottom of the columella cells after reorientation in roots; they suggested that

3 minutes is the minimum presentation time for Arabidopsis roots. Further, Svegzdiene et al. (2005) showed that the amyloplasts move faster in hypocotyls than in roots of Cress

(Lepidium sativum L.) seedlings. This may suggest that in Arabidopsis the presentation time for the inflorescence stems would be less than 3 min. This estimation of

19 presentation time is critical to determine the starting point for the next stage of the pathway: signal transduction.

Signal transduction

Several signaling molecules have been proposed as components in gravitropic signal transduction. Calcium, one of the most common signal molecules, has been implicated in the gravitropic response pathway (Toyota et al. 2008). A number of studies suggested a relationship between auxin transport and calcium (reviewed in Baldwin et al. 2013).

Kordyum (2003), in a set of microgravity and clinorotation experiments, suggested a role for Ca2+ in signal transduction. Aequorin, a Ca2+-sensitive luminescent protein, was used to monitor the calcium levels, and a biphasic calcium increase was observed (Toyota et al.

2008). The first intracellular calcium peak occurs in less than 20 seconds after the gravity, and the second peak was observed after 40 seconds and decayed exponentially within 60 seconds in hypocotyls and petioles, which is closely related to the asymmetric distribution of auxin (Toyota et al. 2008). Thus, the calcium peaks trigger a relocalization of PIN-formed (PIN) proteins and alters the auxin transportation. PINs are auxin efflux regulators, and the translocation of PIN proteins via phosphorylation is involved in gravitropic signal transduction (Zhang et al. 2010). There are in total eight

PINs in Arabidopsis and they can be categorized into two subfamilies (Křeček et al.

2009). Auxin up regulates most of the PIN proteins except for PIN5 (Friml et al. 2003).

The recycling of PIN proteins depends on the clathrin-dependent endocytosis and ARF

20 (auxin response factor)-EGF (epidermal growth factor) exocytosis (Geldner et al. 2001).

The PIN1 transported auxin acropetally (away from base) in the root stele while PIN3 and PIN7 transported auxin laterally in the columella cells. PIN2 transported auxin basipetally in the epidermis layer (Petrásek and Friml, 2009), forming a water-fountain model of auxin transportation. Phosphorylation is a key mechanism of regulating the protein activities, and phosphorylation/dephosphorylation has been shown to regulate the relocation and recycling of PIN proteins in the cell (Sukumar et al. 2009).

Inositol 1, 4, 5-trisphosphate (IP3) is another molecule that has been implicated in numerous signaling pathways. The experiments on IP3 were first performed in maize pulvini (Perera et al. 1999). The pulvinus is a joint-like structure at the base of the internode in monocot stems that facilitates growth-independent movement, and it has been intensively studied for its gravitropism mechanism. In maize that has been reoriented, a rapid initial change in IP3 levels on both sides of pulvini was observed, followed by a larger and persistent elevation on the lower side (Perera et al. 1999).

Further, the changes in IP3 concentrations have been examined in response to gravity stimulation at 4°C. Perera et al. (2001) showed that a cold treatment prevented the gravity response in the pulvini with the same magnitude of short-term and long-term changes in IP3 concentrations. She proposed that the cold treatment interrupted the events between IP3 signaling pathway and the asymmetric distribution of auxin (Perera et al. 2001).

21 Reactive Oxygen Species (ROS) are another family of molecules shown to be involved in gravitropic signal transduction. Joo et al. (2001) saturated agar with ROS and applied it to maize roots. When the agar was applied to the lower side of the root, the gravitropic response was enhanced; when applied to the upper side it was diminished indicating that

ROS are involved in the gravitropic response (Joo et al. 2001). Treatment with hydrogen peroxide and ascorbic acid, both of which involve a rapid accumulation of ROS, resulted in bending of the maize pulvinus away from the high concentration and suggested that

ROS is involved in signal transduction (Clore et al. 2008). They have also successfully visualized ROS increases in the maize pulvinus within 30 minutes after gravitistimulation and suggested ROS may be involved in gravitropism both before and after auxin redistribution (reviewed in Clore 2013).

Calcium, ROS and IP3 are molecules involved in ion channels. An alternative ligand- receptor model has also been proposed, suggesting that signal transduction is carried by the interaction between proteins located on the outer surface of the amyloplasts and the proteins on the ER membrane (reviewed in Baldwin et al. 2013). Moreover, the alkalization and proton has been proposed to be involved in gravitropic signal transduction. Scott & Allen (1999) and Fasano et al. (2001) showed that the leaking of

H+ out from the vacuoles in Arabidopsis roots was related to gravitropic signaling.

Further, Fasano et al. (2001) applied UV irradiation to release the protons in the columella cells where they counteracted the loss of protons in the cytosol and impaired the gravitropism response of roots.

22

Attention was drawn to the protein, Translocon of Outer Membrane of Chloroplast (TOC) when two double mutants, arg1/toc132 and arg1/toc75, were found to be completely agravitropic (Stanga et al. 2009). These mutants showed no differences in amyloplast sedimentation as compared to wild-type, suggesting that the TOC is not involved in altering the gravity perception. The function of TOCs in gravitropism is still unclear.

The cortical ER membrane has also been implicated in the signal transduction related to another putative player: actin. Actin has been suggested in signal transduction because the sedimentation of amyloplasts transmits a force to the cortical ER via actin filaments

(Bisgrove 2008). Early studies revealed the cytoskeleton’s role in gravitropism by applying an F-actin inhibitor to disrupt a plant’s gravitropic responses (Blancaflor &

Hasenstein 1997; Friedman et al. 2003). However, these experiments do not provide solid evidence that actin is involved in gravitropic signal transduction because of the cytoskeleton’s vital function in other pathways in plants. The most controversial data came from experiments that showed that interference with the cytoskeleton enhanced the gravitropic response (Hou et al. 2004). They claimed that the fine cytoskeleton structure in the columella cells act as a buffer for the sedimentation of amyloplasts. Interference with those actin filaments could initiate an exaggerated force on the cortical ER, and result in a stronger gravitropic response (Leitz et al. 2009).

23 Flavonoids are a family of secondary metabolites, and they are involved in response to stimuli such as wounding, pathogen infection (Winkel-Shirley 2002). A role for flavonoids in gravitropism was discovered based on a flavonoid deficient mutant transport testa4 (tt4). This mutant has a delayed gravitropism response, which can be rescued by exogenous flavonoid treatment (Buer & Muday 2004). Flavonoids were discovered to be enriched in the columella cells in the root cap and epidermal cells in the elongation zone, both of which are important sites for gravitropism (Buer & Muday

2004). Buer et al. (2006) further showed that flavonoids were involved by applying an ethylene precursor to alter the synthesis of flavonoids and showed a reduced gravitropic response. The most recent data show that flavonoids direct auxin transport and modulate the rate of gravitropic responses by inhibiting PGP-4 mediated auxin distribution in the elongation zone (Lewis et al. 2007; Peer & Murphy 2007; Withers et al. 2013).

In summary, various molecules and genes have been proposed to be involved the early events of gravitropic signal transduction. Research in this area has moved from a “black box” to a “grey cloud” model (Figure 1.1), but much is yet to be learned.

24

Figure 1.1. The “grey cloud” model for gravitropic signal transduction. Here the early events of signal transduction are shown as a collection of molecules, including TOC, IP3, calcium, proton, cytochrome P450, ROS, flavonoids and actin. These molecules and processes work coordinately to influence auxin transport and result in differential growth. (Picture adapted from Wyatt S.E and Kiss J.Z (2013), Plant tropism: from Darwin to the international space station. Am. J. Bot.100 (1): 1-3. Copyright from AJB)

Gravity response

Plants reorient their organs by asymmetric growth to re-establish their positions with respect to gravity. When placed horizontally, roots and shoots respectively exhibit downward and upward curvature. The Cholodny-Went hypothesis suggests that gravitropism occurs as a result of changes in the distribution of auxin (reviewed by

Gutjahr et al., 2005). The hypothesis proposes that when a plant's orientation with respect to gravity is altered, lateral auxin transport leads to the redistribution of auxin to the lower side of the gravistimulated stem causing increased growth; this asymmetric growth results in the bending of gravistimulated organ.

In roots, gravity perception occurs in the root cap while the response occurs in the elongation zone. In the hypocotyl and inflorescence stem, the curvature also occurs in the elongation zone, but it is also the site of perception. However, increases in auxin

25 inhibits elongation of the cells on the lower side of the root elongation zone but increases elongation in inflorescence stems, which causes the root to grow down and stems to grow up when auxin is asymmetrically distributed. Several mutants defective in auxin response or perception have revealed the importance of auxin in gravitropism. Axr1 and axr2 are two auxin-insensitive mutants that lack normal gravitropic responses (Ishikawa and Evans 1997), and the aux1 mutant has been shown to be totally agravitropic

(Marchant et al., 1999)

Because auxin is crucial to the gravitropic response, auxin efflux and influx carriers are also crucial to gravitropism. Auxin is polarly transported in plants from cell to cell

(reviewed in Peer et al. 2011; Blancaflore 2013). It is mainly synthesized and secreted in the shoot apex, and the polar movement of auxin is facilitated by efflux and influx carriers (Muday and DeLong, 2001). Auxin flows downward in the vasculature from the shoot to the root tip and then is diverted up through the outer layers of the root. This pattern of auxin transport has been referred to as the “fountain flow” model (Swarup and

Benett 2003, Error! Reference source not found..2). AUX1, an auxin influx carrier, is nvolved in the gravitropic response by transporting auxin into cells in the elongation zone.

Experiments have revealed that AUX1 is symmetrically localized in columella cells and not necessary for root gravitropism, but important in influencing auxin distribution in the elongation zone (Swarup et al., 2005).

26

Figure 1.2. The “fountain” model of auxin transport in root columella cells. In a vertical root (A), auxin is transported acropetally from the base of the plant to the root tip and into the columella cells. Auxin is laterally transported to the peripheral cells in the columella organ, and then transported basipetally back to the elongation zone. In a horizontal root (B), auxin is asymmetrically distributed to the physical bottom of the columella cells, and transported only in cells on the lower side of the root. The auxin efflux (blue circles) and influx carriers (green circles) are polarly distributed (C) in the cell membrane, resulting in the polar transportation of auxin. The auxin accumulates on the lower side and inhibits the growth of those cells. (Picture adapted from Elison B. Blancaflor and Patrick H. Masson, Plant Gravitropism. Unraveling the Ups and Downs of a Complex Process, Plant Physiology December 2003 vol. 133 no. 4 1677-1690)

Arabidopsis mutants

Arabidopsis thaliana has been recognized as a powerful genetic model and has been widely used in the dissection of the molecular mechanisms of the gravitropism. Mutants defected in gravitropism in hypocotyls, roots and inflorescence stems have all been discovered, and mutants have been found that are defected in individually in gravity perception, signal transduction and the response.

27

Back in 1989, Caspar et al. identified a starchless mutant pgm-1 which has a reduced gravitropic response by at least 30% compared with WT (Caspar et al., 1985). Besides the starchless pgm-1, several intermediate, reduced-starch mutants have also been discovered. Two partially starch-deficient mutants (acg20 and acg27) along with a starchless mutant acg21 (Kiss et al., 1997) have been studied for their gravitropic responses. The results indicated that the mutants respond (defined by the degree of curvature) in proportion to the amount of starch present in the columella cells. These studies focused on root gravitropism while a few other mutants specific for shoot gravitropism have been discovered. For example, a series of shoot gravitropism (sgr) mutants were found defective in gravitropism (Fukaki et al., 1996, 1998). Two classes of the sgr mutants were identified based on the organs that they affected. One class (sgr3, sgr5, sgr6) is only defective in the inflorescence stem, and the other one (sgr1, sgr2, sgr4, sgr7) is defective in both of the inflorescence and hypocotyls (Fukaki et al., 1996;

Yamauchi et al., 1997; Fukaki et al., 1998). Sgr1/scr lacks the amyloplast containing shoot endodermis and is defective in stimulus perception (Fukaki et al. 1998). The failure of the amyloplasts to sediment in sgr3-1 is thought to be caused by a defect in a vacuolar transport protein (Yano et al., 2003). SGR3 and SGR4 genes have been identified as encoding two SNARE proteins which mediate vesicle trafficking between the Golgi network and targeted organelles (Yamauchi et al., 1997). Arg1/rhg has normal starch content and hormone response, but lacks normal root and hypocotyl gravitropism

(Sedbrook et al. 1999). ARG1 was determined to be a peripheral membrane protein that

28 is involved in the trafficking of the PIN auxin efflux carriers and therefore involved in the early gravitropic signal transduction (Boonsirichai et al., 2003). Similarly, mar2

(modifier of arg1, arg2) had no defects in statoliths, but showed impaired signaling caused by mutations in TOC complex proteins (Stanga et al., 2009).

The gravity persistent signal (gps) mutants

The Gravity Persistent Signal (GPS) GPS treatment was designed to find mutants involved specifically in signal transduction. The GPS treatment (Wyatt et al., 2002) is a method that artificially separates gravity perception from the response, thus isolating signal transduction. In essence, plants are placed in the cold (4ºC) on their sides for one hour and then returned to room temperature (RT) vertically. During the GPS treatment, plants do not bend when reoriented in the cold. However, when returned to vertical at room temperature, plants bend accordingly to the altered gravity field they were presented in the cold. It seems the plants “remember” the gravity stimulus perceived in cold, but the response is delayed until return to RT. The cold treatment does not impede amyloplast sedimentation, but it does abolish auxin transport (Wyatt et al. 2002). Some aspect of signal transduction is blocked in the cold but resumes at RT.

By applying the GPS treatment, a new group of gravity mutants, the gravity persistent signal (gps) mutants, were identified (Wyatt et al., 2002). Originally, recessive mutations were identified at three loci. These mutants all show normal phenotypes at room

29 temperature, but altered gravitropic response when subjected to the GPS treatment. gps1 mutant does not bend, gps2 bend in the opposite direction as compared with wild type and gps3 over bend when returned to room temperature (Wyatt et al., 2002).

GPS1 encodes CYP705A22, a cytochrome P450 monooxygenase, and along with a closely related P450, CYP705A5, is involved in flavonoid synthesis in inflorescence stems and roots respectively (Withers et al., 2013). In addition to the original three gps mutants, more mutants have been identified and studied. gps4 (Luesse et al., 2010) shows a similar no-response phenotype as gps1; it was found to be defective in the

Altered Response to Gravity-like 2 (ARL2). Other gps mutants are still under study. For example, gps6 is a mutant which presents a unique 3D phenotype and its inflorescence stems bend perpendicularly to the plane of the gravity vector.

These previous experiments shed light on our understanding of gravitropism, with the traditional one mutant/one gene approach providing precious information of the functionalities of a single component. Yet more studies are needed to enrich our knowledge. Gravitropic signal transduction is an orchestrated process involving multiple genes and other molecules. To fully understand this process, a more comprehensive method to find potential genes was needed. Based on these considerations, the focus of this project was a genomic level, systems approach to gravitropic signal transduction.

Chapter 2 presents a transcriptome analysis that formed the foundation of the project.

30 Chapter 3 presents a systems approach to identifying functionally related genes and

Chapter 4 culminates with the production of a gravitropic regulatory network.

31 CHAPTER 2: TRANSCRIPTOME ANALYSIS OF GRAVITROPIC SIGNAL

TRANSDUCTION

Introduction

The existing theories and data about gravitropism have enriched our understanding of how statolith sedimentation and curvature occur. Of more interested now is how signal transduction occurs, how the biophysical changes induced by reorientation to the gravity field leads to auxin redistribution resulting in curvature.

The model of gravitropism (Figure 1.1) suggests that the signal transduction occurs between stimulus perception and response. In reality, however, there is no hard line between these phases. Because this project focuses on signal transduction, signal perception and response must be isolated from the process of signal transduction. The

Gravity Persistent Signal (GPS) treatment serves this purpose by introducing a cold stimulus, which allows perception of the gravistimulus but blocks the response. When the plants are placed on their side at 4oC, gravity perception occurs with observable amyloplast sedimentation (Wyatt et al. 2002). According to the statolith hypothesis, once the statoliths sediment on the new bottom of the cell, signal transduction is initiated

(Perrin et al. 2005). However, auxin is not transported at 4oC (Wyatt et al. 2002) which is necessary for the gravitropic response. Based on these observations, the GPS treatment

32 separates perception and auxin redistribution leading to the response and allows the isolation of the signal transduction phase.

The GPS experiments and other research have revealed many important genes involved in signal transduction. Most of these genes were identified using mutant screening where a mutant is identified based on an abnormal phenotype in response to reorientation with respect to the gravity vector. The mutant is then characterized, and the defective gene identified. However, these methods are not sufficient to elucidate the whole story of signal transduction. First, mutant screening is subject to bias because of its qualitative selection. Second, some mutants are pleiotropic and not specific to gravitropism, e. g. those involved in auxin transport or growth. Third, a mutation in some genes may not cause a readily visible phenotype because of compensation by alternative mechanisms, and those genes would be missed in a mutant screen.

Alternatively, transcriptome analysis can be used to identify additional genes involved in gravitropic signal transduction. One of the commonly used techniques for transcriptome analysis is based on microarray experiments. Microarrays, genomic-level screening of short DNA oligonucleotides that represent the individual gene bound to a fixed matrix, have been used for more than two decades. It can be traced back to Augenlicht & Kobrin

(1982) who used polyadenylic acid-containing RNA to recognize a total of 378 clones differentially expressed in a mouse tumor/normal cells system. Scientists quickly realized that this probe/target hybridization method could provide unprecedented

33 information for evaluation of gene expression. Since then, such hybridization-based methods have been used in analyzing transcription level data in Arabidopsis and other plant species (Kimbrough et al. 2004; Kim et al. 2009; O’Rourke et al. 2009). Most microarray experiments are applied to identify differentially expressed genes and gene transcription variations.

Here, a gene expression microarray in combination with the GPS treatment was preformed because 1) data from previous experiments suggests that gravitropic signal transduction might be regulated at the transcriptome level, and 2) signaling pathways are generally complicated systems, involving large numbers of genes.

A whole genome, transcriptome analysis would provide a snapshot of the hundreds of gene transcripts during the process. The integration of the microarray experiment and the

GPS treatment could provide unprecedented information on the gravity signal transduction.

Methods

Plant preparation

Wild type Arabidopsis (Col-0) seeds were potted (five to ten seeds per pot) and distributed equally in the pots. After potting, the seeds were cold (4ºC) stratified for two

34 days and then placed in a growth chamber at 21ºC under long day conditions (16 h light/8 h dark). At 3 weeks, plants were thinned to 3-4 plants per pot. When the inflorescence stems reached 8-10 cm, the plants were ready to undergo the GPS treatment.

GPS treatment

For the GPS treatment, plants were reoriented 90o with respect to gravity (pots were placed on their side) at 4 o C for 1h, and then returned to vertical at RT. Although the

GPS treatment was designed to specifically study gravitropism, two other stimuli are also at play: cold and mechanical stimuli. The cold stimulus is continuous during the cold phase of the GPS treatment while the mechanical stimulus occurs transiently when rotating the plants on their sides. Yet, these two factors could result in changes in gene expression levels. To eliminate the effects of these two factors, a control group was introduced to provide a baseline for gene expression levels. The control plants were maintained vertically in the cold, and the pots moved back and forth briefly to simulate the movement of reorientation. Because gene expression could fluctuate greatly in response to cold, the plants were first cold acclimated for one hour prior to GPS treatment, by placing at 4 o C for one hour. This mitigated the effects of cold stimulus.

35 Microarray experiment design

Because it was not feasible to extract RNA at each minute, only critical representative time points were selected. First, the estimation of the start of signal transduction is the presentation time, the time the stimulus needs to be present before the plant commits to respond to the altered gravity vector. For Arabidopsis inflorescence stems, the presentation time is approximately 3 min after reorientation (Leitz et al. 2009). Second, the response is visible by 15 min after reorientation, providing the upper limit for signal transduction. In total four time points were selected: 2 min, 4 min, 10 min and 30 min in the cold (Figure 2.1). Three time points (2min, 4min and 10min) represented time points of the early phase. To provide a snapshot of the late phase, samples from 30 min after reorientation in the cold served as a reference.

2 4 10 30 minutes min

Figure 2.1. The design of the microarray experiment for treatment group. Plants were first cold acclimated for 1h at 4ºC. Then plants were placed on their side in cold for 1h, then returned to vertical at RT. An observable curvature occurred in 15 minutes after return to RT. The striped arrow shows the direction of gravity stimulation under cold while the solid arrow shows the direction of gravity stimulation under room temperature.

36 RNA extraction

Four biological replicates for each time point were prepared for both treatment and control groups, 32 samples in total. One of our first challenges was the preservation of total RNA. Two methods were routinely used to preserve RNA: RNALater and freezing with liquid N2.The time lapse between sampling points is short (especially between the

2min and 4min time points) and the time needed for freezing was questionable.

Therefore the use of RNALater was explored because of the easy use. RNALater can be sprayed directly on the desired organs immediately fixing the RNA and tissues can be collected later. It has been shown effective in higher plant seedlings (Kittang et al. 2010) and Arabidopsis roots (Kimbrough et al. 2004). However, RNALater had not been tested on inflorescence stems. Because of the close timing of the time points, RNALater was thought to be a better option that freezing; however, because of the difference in cell/tissue structure between the inflorescence stem and root, a trial experiment was performed to test the effectiveness of RNALater on mature inflorescence stems.

RNALater was sprayed on the inflorescence stem and horizontal dissections of the stems were sampled after RNALater application. The effect of RNALater was measured by

Trypan blue, a negative staining indicating the vital status of the cell. The Trypan blue staining is proportional to the penetration efficacy of RNALater. Unfortunately, Trypan blue was not visible in the cells in the inflorescence stems until 30 seconds after incubation. And due to the different thicknesses of stem tissues, the blue color was also unevenly distributed. Therefore, liquid nitrogen was chosen to ensure preservation of the

37 RNA. The only challenge remaining was how to freeze numerous plants simultaneously and rapidly enough to be able to perform extraction at the time points selected.

Preliminary experiments indicated that each mature plant could yield approximately 80μg mRNA. Because 500μg of mRNA was required, six to seven mature plants (2~3 pots) were needed for RNA extraction for each time point.

A wooden frame was designed to hold six 4’’ pots of Arabidopsis. This allowed submergence of multiple pots of Arabidopsis inflorescence stems in liquid nitrogen to freeze the tissues simultaneously. After freezing, the top 4 cm of the inflorescence stems were excised, cauline leaves and flowers removed, and the remaining stem tissues stored at -80C.

RNA was extracted from the apical 4 cm of the inflorescence stem using a method combining Trizol© (Carlsbad, CA) and Qiagen RNAeasy (Valencian, CA) columns.

This method was required because the amount of RNA exceeded the Qiagen RNAeasy column capacity and the Trizol© solution provided lower quality of RNA than required.

So, Trizol was used initially to extract the RNA from the tissue and then the Qiagen

RNAeasy protocol was used to clean the RNA to produce the high quality RNA needed for the microarray. The inflorescence stem sections were first thoroughly ground in liquid nitrogen. Ground tissues (around 50mg) were mixed with 1.0ml of Trizol (pre- heated to 65ºC) and mixed by vortexing. The mixture was incubated on ice for 10min

38 with frequently vortexing to avoid sedimentation. Then 200ml of chloroform was added, and the mixture shaken vigorously until the mixture became cloudy. The mixture was then centrifuged at 13,000 rpm at 4ºC for 15 minutes, producing three layers. The top layer (300~400μl) of clear aqueous phase included the RNA and was transferred to a new tube. For each 200μl of the top layer, 700 μl of Qiagen RLT buffer and 500 μl of 97% ethanol were added and mixed by vortexing. The mixture was then transferred to a

Qiagen MiniElute spin column and centrifuge at 10,000 rpm for 15 seconds. The flow through was discarded, and 500 μl of Qiagen RPE buffer was added to the spin column, followed by centrifugation at 10,000 rpm for 15 seconds. The column was then washed twice with 750 μl of 80% ethanol and centrifuged at 13,000 rpm for 1 min. The residual ethanol was evaporated in open air. For each tube, the RNA was suspended in 30ml of

RNAase free water. RNA quality was assessed using an Agilent Bioanalyzer © RNA chip. Samples with a RIN number above 8.5 were kept for further processing.

Agilent dual color 4X44k arrays for Arabidopsis (Catalogue No. G2519F-021169) were used, with treatment and control data for the same time points labeled with Cy3 (yellow- green) and Cy5 (red). Because of possible dye biases, a dye swap design was applied.

For the four replicates of each time points, two of the treatment samples were labeled with Cy3 and two controls labeled with Cy5 while in the other two the dye labeling was reversed. This dye swap design eliminated the inherent dye biases of the dual color microarray. After labeling, each pair of treatment and control samples were hybridized onto an array. Each Agilent 4X44k slide contains four arrays. In total sixteen arrays

39 were hybridized onto four slides, with the array position for the time points assigned randomly to eliminate any position effect. The four slides were placed in Agilent

SureScan® Microarray scanner and florescent lights were shed on these four slides. The raw optical intensities were captured by the feature collection software and stored as plain text files.

Microarray data analysis

The raw gene expression data were analyzed using a novel pipeline (Shen et al., 2012,

Appendix A). First, the background noise was corrected with “normexp”, offset = 50.

Then the data were normalized between arrays with “loess” and within arrays using

“Aquantile”. The signal densities across 16 samples before pre-processing were different in their distributions (Figure 2.2). The normalization (within array and between arrays) brought the expression values to the same level and scale (Figure 2.3).

40

Figure 2.2. The raw signal density (the mean value of signal density from both channels,) distribution across the 16 arrays. Each box shows the middle quantile of the signal density for each array: from Q1 (25%) to Q3 (75%). The black line indicates the median value. The whiskers represent the values of Q3+1.5 IQR (interquantile range: Q3-Q1) and Q1-1.5IQR respectively.

41

Figure 2.3. The normalized signal density (the mean value of signal density of both channels, logarithmic 2 based) distribution across the 16 arrays. First raw signals were normalized using loess to adjust for effects caused by the dual color channels. Then they were normalized across the 16 arrays based on the “Aquantile” method. Each box shows the middle quantile of the signal density for each array: from Q1 (25%) to Q3 (75%). The black line indicates the median value. The whiskers represent the values of Q3+1.5 IQR (interquantile range: Q3-Q1) and Q1-1.5IQR respectively.

Normalized data were further filtered to keep the probes with signals in the upper 75% scale in at least 12 arrays out of 16. Finally, both of the parametric and nonparametric statistical analysis were applied on the filtered data and cross validated. Genes with a significant log fold change greater than 0.5 (up-regulated) or less than -0.5 (down- regulated) as compared to control samples were selected. Significance was calculated using an empirical Bayesian model, with Benjamini and Hochberg corrections.

42 Gene annotation

Selected genes were annotated using Affymetrix Arabidopsis ATH1 Genome Array annotation data (ath1121501.db) prepared in the Bioconductor package (Gentleman et al.,

2004). These genes were annotated with GO and KEGG terms. The GO annotations provided information on molecular functionalities, sub-cellular component and biological processes. KEGG terms provided information as to possible biochemical pathways involved. The annotations provide a way to fine tune the gene list.

qRT-PCR experiment and selection of housekeeping genes

Genes selected for further study were first validated using qRT-PCR. SYBR green labeling kits were used in the qRT-PCR experiments. The master mixture in this kit contains SYBR green inflorescence dye, dNTP and Taq DNA polymerase. All time point samples were tested (2, 4, 10 and 30 min., control and treatment). Six replicates of each sample were analyzed, 48 samples in total for each gene (4*2*6). PP2A was used as the control (see discussion below). The Ct values of each sample were collected and qBasePLUS software was used to analyze the data.

Although qRT-PCR provides robust results on RNA expression, it requires a stable

‘housekeeping gene’ to serve as a reference point. A housekeeping gene is one that is stably expressed independent of exogenous treatment and conditions (Schmittgen and

43 Zakrajsek 2000). The most frequently used housekeeping genes in the study of

Arabidopsis include glyceraldehyde-3-phosphate dehydrogenase (GAPDH), polyubiquitin (UBQ), actin genes and 18s rRNA gene (Hannah et al., 2005; Kim et al.,

2003). According Czechowski et al., (2005), the commonly used housekeeping genes need to be scrutinized carefully when applied to the qPCR experiments. They proposed five genes as novel housekeeping genes and compared their expression levels with those of five traditional housekeeping genes across the ATH1 platform microarray. Their results suggested candidates for housekeeping genes for our experiments. Four housekeeping genes suggested by Czechowski et al., (2005) were tested: UBQ10

(AT4G05320), PP2A (AT1G10430), TIP4 (At2g25810), UBC (At5g25760). The geNorm (Vandesompele et al., 2002) algorithm was used to normalize the Ct values of these housekeeping so that their stabilities were comparable. Each candidate gene was subjected to qPCR with ten replicates, and the Ct values of the four genes were expressed with ATIP4_1... ATIP4_10, AUBC10_1... AUBC10_10, AUBQ_1... AUBQ_10, APP2A_1... APP2A_10. The pairwise ratios were calculated as

STIP4/UBC10={log2(ATIP4_1/AUBC10_1),...,log2(ATIP4_10AUBC10_10}.

And the standard deviation of the pairwise ratio was calculated:

σTIP4/UBC10=st.dev(STIP4/UBC10). The M values are the arithmetic average of the standard deviations: MTIP4= (STIP4/UBC10+STIP4/PP2A+STIP4/UBQ)/2. Based on the results, PP2A was the most stable housekeeping gene with the lowest M values (Figure 2.2) and was selected for use.

44

Figure 2.4. The geNorm M values of the four housekeeping genes. The x axis shows the four candidate housekeeping genes and y axis shows the normalized M value. The PP2A was the most stable housekeeping gene, in which case it obtained the lowest M value.

cDNA preparation and primer design

Invitrogen SuperScript® One-step RT-PCR (Carlsbad, CA) system was used for the reverse transcription reactions with 0.2μM of the oligo-dT primer and 200μM of dNTPs.

Because the RNA contains mRNA, tRNA, rRNA and small RNAs, oligo-dT primers were used to selectively reverse transcribe mRNA. The thermo cycler was set for 1 cycle of amplication, 55°C for 30 minutes and 94º for 2 minutes. Complimentary DNAs

(cDNAs) were synthesized after this step.

Only the mRNA was reverse transcribed with oligo-dT primers. To avoid any interference from the residual gDNA, qPCR primers were designed to bridge exon-exon conjunctions such that these primers w not amplifying any gDNA because the introns in the gDNA prohibit the bindings of the primers.

45

cDNA sequences were retrieved from NCBI database, saved in the Genbank format (.gb) and primers designed in Geneious® . Primer3 (Rozen and Skaletsky 2000) was used to design primers within the following restrictions: Tm within 50ºC~60ºC, primer length within 19~25bp, GC percentage within 40%~60% (Figure 2.3; Table 2.1).

ATAIB

BT2

46 WRKY18

WRKY26

47 WRKY33

Figure 2.5. Sequences maps of the cDNA of the five genes selected: ATAIB, BT2, WRKY18, WRKY33 and WRKY26 including the location of the qPCR primers.

Table 2.1. Primer sequences designed for qPCR analysis of the five genes selected. Gene Name Forward Primer Reverse Primer

WRKY18 TCGGACACAAGCTTGACAGTTAA TTGACCCACCCTGGCTTGTA A BT2 CGATGACGCCGAATCGAGGAAG CCGTATGCAAGAGGAGGAATAA CG ATAIB TTGAACCATGTGGAAGCTGAGAG TGTGAATCCAAAGGCGAGATTAC WRKY33 AGCAAAGAGATGGAAAGGGGAC TGTGATTACTGCTCTCATGTCGT AA GT WRKY26 GGCCAAGAGATGGAAAAGAGAA TAAAACTGGACCTCTTCTTGGGG G

Optimization of primer concentrations

Three concentrations of primers were tested: 10μM, 50μM, 100μM and 500μM. Ct values of each primer concentration were checked, and the lowest concentrations that

48 satisfy the Ct value below 33 were chosen. Ct is the threshold cycle number; the optimal number of cycles when the signal is strong enough to be reported. Optimal concentration of primers for WRKY33, WRKY25, BT2 and WRKY33 was 50μM while

500μM was needed for ATAIB.

Results

A dual color, gene expression microarray was performed across a time course of the GPS response to identify genes involved in the early events of gravitropic signal transduction.

Briefly, plants were placed at 4oC and reoriented with respect to gravity for 1h, then returned to vertical at RT (Figure 2.1). RNA was extracted at four time points during the cold reorientation and analyzed on an Agilent Arabidopsis expression array.

Microarray data analysis

In total, 32 samples (two treatments (control and reoriented), four time points, 4 replicates each) were collected.

Analysis of the expression data returned a list of 318 genes differentially expressed between reorientation with respect to gravity and the vertical control (Table 2.2;

Appendix 2, Table1-8). Fifteen genes were expressed at more than one time point

(Figure 2.6).

49

Table 2.2. Genes differentially expressed genes in the four time points. Time Point Differentially Expressed Genes (min) Up regulated Down regulated 2 3 5 4 66 25 10 35 142 30 26 32

Figure 2.6. The overlap of differentially expressed genes between the four time points. The numbers in each oval represents the number of genes differentially expressed at a specific time point. The numbers within overlapping regions between multiple ovals indicates the number of genes that were differentially expressed at both time points.

Enrichment analysis of the annotations

Because of the high false positive rate associated with a microarray experiment, further filtering and analysis were needed. Filtering was performed based on the functional annotations of the identified genes. The functional terms associated with Arabidopsis genes were categorized as gene ontology (GO). Initially, a gene set enrichment analysis was performed on the differentially expressed genes identified. The enrichment analysis

50 focused on GO terms (Figure 2.4, Figure2.5). The GO terms were selected based on the following criteria:

1. GO terms with less than 1000 total gene members. GO terms with more than 1000

genes, e.g. metabolic process (GO:0008152), were too broad to be meaningful.

2. GO terms denoting at least 10 genes were selected. The count size of a GO term was

measured by the number of genes in the differentially expressed gene list that was

annotated with the term. Genes anotated with categories with fewer than 10 genes

were grouped as ‘other’.

3. GO terms showing statistical significance (p-value<0.05) were selected. Although

the counts of the GO terms provided a straightforward way of visualizing the

enrichment of the annotations, they did not reflect any statistical significance. In fact,

the abundant of a feature does not equal a statistical significance for the feature

because of the size difference among GO term categories. The statistical analysis was

performed based on the Fisher test, which was implemented in the GoStats package

(Falcon & Gentleman 2007).

51 response to stimulus 2% 3% 2% response to stress 2% response to abiotic stimulus

lipid metabolic process 5% 22% oxidation-reduction process 4% response to temperature stimulus

4% response to salt stress

cellular component morphogenesis 6% post-embryonic organ development 15% response to cold 4% external encapsulating structure organization 5% developmental cell growth 10% 9% cell morphogenesis involved in differentiation 7% response to water stimulus

response to auxin stimulus

Figure 2.7. The distribution of Biological Process terms among the differentially expressed genes. The Biological Process terms were selected from all the differentially expressed with p-value of at least 0.05, and the top 15 most abundant GO terms were presented.

52

9% hydrolase activity 19% 9% oxidoreductase activity

protein dimerization activity 10% 11% carboxylic ester hydrolase activity transferase activity 9% 10% oxygen binding 12% 11% antioxidant activity

Figure 2.8. The distribution of Molecular Function terms among the differentially expressed genes. The Molecular Function terms were selected from all the differentially expressed with p-value of at least 0.05, and the top 9 most abundant GO terms were presented.

The majority of differentially expressed genes identified in the microarray experiment were categorized as ‘stress response’, especially ‘abiotic stress response’, which was not unexpected (Figure 2.4 and Figure 2.5). The Biological Process and Molecular Function annotations provided a clue as to the function of the genes.

In addition to the Molecular Function and Biological Process GO term analyses, genes were categorized based on their Cellular Component GO terms. Because previous research implicated plastids and vacuoles in the events of signal transduction, genes were selected based on the subcellular location, and associated Biological Process GO terms were analyzed (Falcon and Gentleman, 2007)

53 Table 2.3. Significant GO terms (p-value < 0.005) associated with gene products targeted to the vacuoles and plastids. CC P-value GO term plastid 0.001 cellular lipid metabolic process plastid 0.001 pollen development plastid 0.001 lipid metabolic process plastid 0.001 response to abiotic stimulus plastid 0.001 monocarboxylic acid metabolic process plastid 0.002 response to water deprivation plastid 0.002 response to water plastid 0.002 catabolic process plastid 0.002 cellular catabolic process plastid 0.002 response to stress vacuole 0.001 nucleoside metabolic process vacuole 0.001 chalcone metabolic process vacuole 0.001 chalcone biosynthetic process vacuole 0.001 ketone biosynthetic process vacuole 0.002 regulation of defense response to virus by host vacuole 0.003 regulation of immune effector process vacuole 0.003 regulation of defense response to virus vacuole 0.004 pigment biosynthetic process

Genes selected for future studies

The annotations of the GO terms, specifically the Biological Process and Molecular

Functions provided more information on gene function. Thus the differentially expressed genes could be further filtered based on these functional annotations. An initial set of genes were selected for further study using the following criteria:

1). Genes were filtered based on potential for involvement in gravitropic signaling. As shown in the Figure 2.4, Figure 2.5 and Table 2.3, many genes fell into the annotation

54 terms, such as storage, flowering and development that were not likely to be directly involved in gravitropism.

2). Because transcription factors tend to be switches for gene regulation for a variety of response stimuli, of particular interest were genes annotated with Molecular Functions as transcription factors.

3). Genes at the early time points (4 and 10 minutes) were given preference.

Transcription factors

Transcriptional factors are interesting because they are the key elements in signal transduction and regulation and have large-scale downstream effects on other genes.

Because transcription factor are key in regulating downstream gene expression and biological processes, only the transcriptome profiles of the transcriptional factors were studied in this experiment. Thirty-three transcription factors were identified in the set of differentially expressed genes in the microarray data (Table 2.4).

55 Table 2.4. The transcription factors discovered in the differentially expressed genes and their transcription factor families. TAIR ID Family TAIR ID Family TAIR ID Family AT5G37250 RING/U- AT3G10595 homeodomain- AT2G17180 DUO1 box like AT4G16430 bHLH AT3G56970 bHLH AT3G12145 MADS-box AT2G31210 bHLH AT2G36270 bZIP AT5G10140 MADS-box AT2G46510 bHLH AT2G16910 bHLH AT3G02940 MYB-like AT1G25550 Myb-like AT5G65330 AGAMOUS- AT5G40360 MYB-like like AT3G57600 DREB AT3G54340 MADS-box AT4G37780 MYB-like AT1G28050 B-box AT1G69120 MADS-box AT1G71030 MYB-like AT1G17310 MADS- AT1G13600 bZIP AT5G20240 MADS-box box AT1G10610 bHLH AT4G34530 bHLH AT5G15800 MADS-box AT2G38470 WRKY AT3G60460 Myb-like AT5G53200 Homeodomain- like AT3G48360 BTB AT4G31800 WRKY AT5G07100 WRKY

These transcription factors were evaluated based on their interactions in the Arabidopsis interactome via the String database (Franceschini et al. 2013). Based on the expression levels of these transcription factors (Figure 2.9), especially the profiles at the 2 and 4 min time points, five genes were selected for further study: BT2, ATAIB, WRKY26,

WRKY33 and WRKY18 (Table 2.5).

56 4min 10min

Figure 2.9. A scatter plot of the expression levels at 4min and 10min of the 33 transcription factors identified in the microarray analysis. The five transcription factors selected (WRKY18, WRKY26, WRKY33, BT2 and ATAIB) are indicated with blue dots.

Table 2.5. The expression levels of the five transcription factors across the four time points. 2min 4min 10min 30min Locus Log P- Log P- Log P- Log P- ID Fold value Fold value Fold value Fold value Change Change Change Change WRKY18 AT5G07100 1.35 0.11 -1.76 0 -0.53 0.31 -0.94 0.09 WRKY26 AT2G46510 0.5 0.16 -1.03 0.04 -0.16 0.67 -0.24 0.32 WRKY33 AT2G38470 -0.34 0.18 -1 0.02 -0.01 0.99 0.13 0.76 ATAIB AT3G48360 -0.41 0.74 -1.41 0.1 -1.59 0.01 -1.26 0.08 BT2 AT4G31800 -0.11 0.86 -1.08 0.05 -1.83 0.02 -1.24 0.03

Among these five, BT2 and ATAIB were chosen based on their expression values as the most differentially expressed genes at the 4 and 10 minute time points, respectively. The three WRKY genes were selected because the WRKY family genes function in plant stress response, cold response and auxin signaling (Kim et al. 2009; Eulgem & Somssich

2007; van Verk et al. 2011; Wang et al. 2006). To better understand the effect of these five transcription factors on other genes, the interactions between these genes and others were retrieved based on multiple evidences (co-expression, physical concurrence, text mining, affinity experiment and homology) (Figure 2.6). The GO terms most enriched

57 among the interconnected genes were selected and categorized (Table 2.5), where the p- values were calculated based on a Fisher test, and adjusted p-value was corrected with multiple tests using the Benjamini and Hochberg method.

Table 2.6. The most enriched GO terms in the sub-networks of the five transcription factors. GO term P-value Adjusted p-value transcription factor activity 2.8449E-12 7.6802E-11 transcription regulator 2.9674E-12 7.6802E-11 activity response to stress 7.4228E-11 1.0763E-9 oxygen binding 4.0727E-10 4.7244E-9 DNA binding 1.6184E-9 1.5645E-8 response to biotic stimulus 6.3887E-9 4.6318E-8 response to endogenous 3.6613E-6 2.1235E-5 stimulus nucleic acid binding 1.2150E-5 6.4064E-5 response to abiotic 3.4941E-4 1.5589E-3 stimulus secondary metabolic 3.7100E-3 1.3449E-2 process signal transduction 1.3534E-2 4.6174E-2

58

ATAIB BT2

WRKY18 WRKY26

WRKY33

Figure 2.10. The interactions of the five TFs in the Arabidopsis genome. Each of the red spheres represent the gene of interest and the lines between different genes represent an interaction inferred from various information: co-expression (dark green), concurrence (blue), text mining (light green), affinity experiment (pink) and homology (light blue).

59 qRT-PCR results

Although microarrays provide large-scale screening, it also generates high false positive signals. qRT-PCR is a quantitative method to independently validate the expression values of the genes. It is performed in a small scale compared with microarray, but provides more precise and accurate results. qRT-PCR reports the relative/absolute abundance of a gene based on the florescence signals of a specific dye which only recognizes and binds to double stranded DNA. The expression level of a gene is proportional to the florescence signal and negatively related to the number of PCR cycles

(Ct).

Our analyses have reduced the gene list to five potential genes of interest, and the expression profiles of these five genes needed to be validated using qRT-PCR prior to further experimentation. The qRT-PCR results were normalized by setting the expression value of the 2min control group (Figure 2.13).

60 ATAIB 3.500E+000

3.000E+000 Control 2.500E+000 * Gravity 2.000E+000 [ * 1.500E+000 * 键 1.000E+000 [ [ Relative Relative Expression Level 入 5.000E-001 键 键 0.000E+000 文 2 4 10 入 30 入 Time point (min) 档 文 文 的 档 WRKY档 26 5.000E+000 引 * 的 4.500E+000 的 述 4.000E+000 [ 引 Control 引 3.500E+000 Gravity 或 3.000E+000 键 述 2.500E+000 述 关 * 2.000E+000 入 或 或 1.500E+000 [ 注 * Relative Relative Expression Level 1.000E+000 文 关 * 关 5.000E-001 键 点 [ 0.000E+000 档 注 [ 注 入 2 4 键 10 30 的 的 Time point (min) 点 键 点 文 入 摘 入 引 的 的 档 文 要 述 摘 文 摘 的 档 。 档 或 要 要 引 的 您关 的 。 。 述 引 可 引 注 您 您 或 述 将 述 点 可 可 关 或 文 的 将 或 61 WRKY 33 4.500E+000 4.000E+000

3.500E+000 3.000E+000 2.500E+000 2.000E+000 Control * * 1.500E+000 * Gravity 1.000E+000 [ [ Relative Relative Expression Level 5.000E-001 * [ 键 键 0.000E+000 [ 键 2 4 10 30 入 入 Time point (min) 键 入

文 文 入 文 档 档 文 档 的 WRKY18 的 4 档 的 引 3.5 引 的 引 3 述 述 2.5 引 述 2 或 或 述 或 Control 1.5 关 Gravity * 关 1 或 * 关 Relative Relative Expression Level * 0.5 [ 注 注 关 [ 0 [ 注 键 点 2 点 4 10 键 30 Time键 point注 (min) 点 入 的 的 入 入 点 的 文 摘 摘 文 文 的 摘 档 要 要 档 档 摘 要 的 。 。 的 的 要 。 引 您 您 引 引 。 您 述 可 可 述 述 您 可 62 BT2 10 9 8 7 6 5 4 3

Relative Relative Expression Level 2 * * * 1 * 0 [ [ [ 2 4 10 [ 30 Time point (min) 键 键 键 键 Figure 2.11. The qRT-PCR validation of ATAIB, WRKY18, WRKY33, WRKY 26 and 入 入 入 BT2. Red = the treatment group; blue =the control group. All入 values were normalized against the expression value of the control group at 2 min, * indicate values with a significant differences文 between the control文 and treatment groups based on student文 two sample t-test (p-value <0.05). 文 档 档 档 档 The results from the qRT-PCR experiments confirmed the expression values indicated in 的 的 的 的 microarray experiment for the most part. Both methods indicate that the genes were 引 引 引 引 differentially expressed during the GPS treatment, especially during the first 10 minutes, 述 述 述 with the only exception BT2. WRKY 33 and BT2 not only show述 differential expression 或 或 或 at 4 min but show the most dramatic differential expression at或 30min after reorientation. 关 关 关 关

注 注 注 Discussion 注

点 点 点 点

A microarray was performed的 to specifically的 study the gravity signal transduction的 phase of 的 the gravitropic pathway. Initially, our analyses uncovered five transcription factors, 摘 摘 摘 摘

要 要 要 要 。 。 。 。

您 您 您 您 63 ATAIB, BT2, WRKY18, WRKY26 and WRKY33, specifically involved in gravitropic signaling (Table 2.5). qRT-PCR confirmed the expression of three of these, ATAIB,

WRKY18 and WRKY 26 in the earliest time points (Fig 2.10). BT2 and WRKY33 were also differentially expressed at 4 min but they both also showed a strong differential expression at 30 min after reorientation. Information on the gene structures, protein sequences, gene families and their possible roles in gravity response, signal transduction were specifically examined.

ATAIB encodes a bHLH (helix-loop-helix) domain targeted in the nucleus. ATAIB has been previously shown to regulate ABA responses, and ATAIB mutants show reduced sensitivity to ABA signaling (Li et al., 2007). Beyond that, other functions of ATAIB have not been studied yet. However, the bHLH family is one of the largest transcription factor families including at least 147 proteins in 21 subfamilies. The ATAIB belongs to the 8th subfamily which also includes: AT4G00870, AT4G16430, AT1G01260,

AT5G46760, AT4G17880 (AtMYC4), AT1G32640 (AtMYC2) and AT5G46830.

AtMYC2 has been implicated in drought and salt stress with ABA signaling pathways

(Abe et al., 2003).

The WRKY family is another transcription family with up to 100 genes in the

Arabidopsis genome (Eulgem et al., 2000). All contain a conserved domain with

WRKYGQK at the N-terminus. A WRKY domain recognizes a sequence specific motif and binds a target gene. WRKY gens function in biotic stress, abiotic stress, seed

64 development, seed dormancy and germination, senescence and development (Ulker and

Somssich, 2004; Zhang and Wang, 2005; Rushton et al., 2010). Besides traditional transcriptional regulation models for WRKYs regulating downstream gene expression,

WRKY proteins can also repress and de-repress transcription. The WRKY signaling network includes MAPK, MAPKK, and calmodulin (Rushton et al., 2010). Among the three WRKY genes shown in our list, WRKY18 was indicated as involved in ABA response and abiotic stress (Chen et al., 2010) and pathogen defense (Pandey et al.,

2010). The important role of WRKY33 in pathogen infections was further validated by

Birkenbihl et al., (2012), Zheng et al., (2006) where over-expression of WRKY33 increases the plant defense to fungal infections. WRKY33 has also been studied in regulation of a response to drought (Wang et al., 2013). WRKY33 and WRKY26 have also been found involved in coordinating induction of plant thermo tolerance (Li et al.,

2011).

BT2 encodes a protein with a BTB domain (near the N-terminus of a zinc finger transcription factor). The BTB domain has been shown to mediate protein oligomerization, which inhibits high affinity DNA bindings. In 2003, Ahmad et al., successfully predicted the crystalized structure of the BTB domain and showed that BTB domains can form both homodimers and heterodimers with other proteins. Recently,

BT2 has been found involved in telomerase regulation (Ren et al., 2007) and in multiple plant responses to exogenous stimulations and stresses (Mandadi et al., 2009). BT2 expression is regulated by the circadian clock and is suggested to be linked with

65 photosynthesis. Mandadi et al. (2009) also showed that bt2 mutants displayed a hypersensitive response in response to ABA. In sum, BT2 is believed to be a key player in multiple signal transduction pathways and now we provide evidence for a role in gravitropism signal transduction.

Based on existing information, these five genes are all involved in abiotic/biotic stresses responses and signal pathways. However, these five genes have not been studied in relation to gravity stimulation or gravitropic responses. We wish to reveal the functions of these genes in gravitropism, which requires phenotypic analysis. In addition, eight genes were expressed uniquely at 2 min after reorientation (Table 2.2). For these 8 genes, it will be interesting to determine how they perform in the gravity signal transduction process.

66

CHAPTER 3: MINING FUNCTIONALLY RELATED GENES USING SEMI-

SUPERVISED LEARNING

Chapters 3 within the thesis document serve as a prepublication manuscript. This manuscript has been formatted to meet the guidelines set forth by Thesis and Dissertation

Services at Ohio University.

This chapter represents a prepublication manuscript submitted to Bioinformatics (Oct.

2013), which has been adapted slightly to conform to Ohio University’s thesis format.

Authors are Kaiyu Shen (Environmental and Plant Biology, Ohio University, Athens,

OH), Razvan Bunescu (Electric Engineering and Computer Science, Ohio University,

Athens, OH) and Sarah Wyatt (Environmental and Plant Biology, Ohio University,

Athens, OH).

67 Introduction

A common problem for biologists is how to find genes/proteins related to a specific process? This question has different meanings depending on the interpretation of

“related”. Genes could be related in terms of sharing gene ontology (GO) terms, being involved in the same pathway, or they could directly interact. Many in silico approaches have been proposed to find functionally related components based on the principle of

“guilt-by-association”. Functional relationships are measured by the similarity of biological features, and such biological features include but are not limited to sequence alignment, secondary and tertiary structure comparison (Pawson and Nash, 2003; Martin et al., 2005; Lewis et al., 2006; Shen et al., 2007; Suzek et al., 2007; Pitre et al., 2008), gene co- expression profiles (De Bodt et al., 2009; Wren, 2009), phylogenetic information for non-model organisms (De Bodt et al., 2009; Muller et al., 2010; Gaudet et al., 2011) and topological structure of the protein-protein interactome (PPI) (Przulj et al., 2004). These features are utilized not only individually, but also integrated as a collection of heterogeneous data for predictions (Jansen et al., 2003; Costello et al., 2009;

Fortney et al., 2010). In essence, the “guilt-by-association” strategy incrementally expands the functional group by adding genes similar to the known genes. The nature of these biological problems is to identify positive genes from a set of unlabeled ones, given a set of positive genes. Such a combination of positive labeled and unlabeled ones leads to a semi-supervised machine learning problem where only positive samples are known.

Methods for learning from positive and unlabeled data are usually based on various

68 versions of standard learning algorithms, such as Support Vector Machines (Liu et al.,

2003; Elkan and Noto, 2008), Naive Bayes (Denis et al., 2002; Calvo et al., 2007), logistic regression (Lee and Liu, 2003), label propagation (Kashima et al., 2009) or more sophisticated models such as a dynamic empirical Bayesian model (Djebbari and

Quackenbush, 2008). One of the programs, GeneMANIA (Mostafavi et al., 2008), applying multiple kernel learning with linear regression, performed well in most of the evaluations. More recently, in a similar critical assessment, up to 54 methods were tested on their predictions of unlabeled genes (Radivojac et al., 2013) and it again showed the robustness of machine learning methods. Such programs produce useful results in assigning GO terms to the unknown genes. However in many cases, biologists wish to find functionally related genes, rather than assign specific GO terms to the unknown.

Biological systems are scale-free networks (Kholodenko et al., 2012; Roy, 2012) which render GO terms inadequate for inferring complete functional relationships. For example, gravitropism, a fundamental biological process, has been studied since Darwin and aspects of it still remain a mystery (Wyatt and Kiss, 2013). Only three GO terms are annotated as related to gravitropism and only a third of the experimentally identified

“gravity genes” were annotated with these three GO terms. Over 77% of the GO terms that were assigned to the “gravity genes” were shared in fewer than 10% of those genes.

This example calls into question the accuracy of using GO terms to define “functionally related”.

69 So we decided to use learning algorithms to tell whether an unknown is really “related” or not. Among all the existing learning classifiers, the SVM algorithm is a popular choice, especially for gene regulatory networks and function prediction (Chen et al.,

2009; Cerulo et al., 2010; Bhardwaj et al., 2010; Wang et al., 2011; Wass et al., 2012;

Mordelet and Vert, 2013). The application of an SVM algorithm based on heterogeneous network has also been explored for almost a decade. Lanckriet et al. (2004) first introduced a semidefinite program approach along with the SVM (SDP/SVM) to find optimal weights for a set of kernels. Tsuda et al. (2005) suggested an algorithm with linear complexity and further indicated a condition where the kernels with optimization of weights do not always outperform those with fixed weight. Zhao et al. (2008) applied a recursive way of expanding negative sets in combination with one-class SVM and two- class SVMs to predict the unlabeled ones. Based on all previous experiments, we performed a thorough exploration of the existing methods and implemented two algorithms based on the standard SVM: biased SVM formulation (Liu et al., 2003) and the weighted samples formulation from Elkan and Noto (2008). Moreover, we have generated a rich collection of heterogeneous data and utilized different integration techniques with the SVM to reach the best performance. Our approach was evaluated using an (hereafter Arabidopsis) benchmark dataset, and the results led to a ranking of genes that provided over 75% accuracy for the top 5% of unknown genes. Most importantly, our method provides a strategy that could be utilized for different organisms and any set of seeded genes.

70 Methods

Information sources and feature engineering

To most efficiently apply the classifier for prediction, we obtained a comprehensive collection of heterogeneous annotation data.

Protein Protein Interactions

Given an arbitrary gene p and a seed gene s, the PPI feature φPPI (p, s) is defined to have the maximum value of 1.0 with direct connection and a penalty of 0.1 for each additional edge on the path. Given an input gene p, the corresponding vector of PPI features consists of all PPI features between this protein and all seed proteins in S i.e.,

ΦPPI(p) ={φPPI(p,s)|s∈S}. The PPI feature was based on the ortholog organisms as

ΦOL(p).

Literature data mining

The literature information was used to generate three distinct types of features.

A. Information Retrieval: For each gene p in evaluation dataset D, all aliases and systematic names were submitted to the Textpresso search engine (Muller et al., 2004) to

71 extract sentences in PubMed that contained the gene, generating a text document T (p).

For every word w in the vocabulary V that appears in this document, the standard tf/idf feature weight is calculated as tf (w, )∗idf (w,p) (Baeza-Yates and Ribeiro-Neto, 1999).

The term frequency factor tf (w,p) represents the number of times word w appears in the extracted text T(p), whereas the inverse document frequency idf (w,p) is computed based on the number of different genes in the evaluation dataset D for which the extracted text contains the word w, as shown in Equation 1 below.

Given an input gene p, the corresponding vector of IR features consists of the IR features computed for all words in the vocabulary V i.e., ΦIR(p)={φIR(p, w)|w∈V}. Because the high dimensionality of the IR feature vector could negatively affect the reliability of the learning model, we used the classifier stacking technique as follows: the seed genes are used as positive examples, while a large set of negative examples is sampled at random from the Arabidopsis genome. An SVM classifier was then trained on this noisy dataset.

For a gene p, the margin value output by the SVM classifier is used to create a new singleton IR feature φIR(p).

B. Shared Articles: For every gene p in the dataset, we also created a set PID (p) containing the PubMed identifiers of all articles found by Textpresso to contain the gene name or any of its synonyms. Given a gene p, the individual PMID feature values for all seed genes are aggregated into a vector of PID features ΦPID(p) = {φPID(p,s)|s ∈ S}.

72

C. Automatic Relation Extraction: For a gene p and any given seed s, we further identified the PubMed sentences that mentioned both p and s by running a relation extraction system (Bunescu and Mooney, 2005) on all sentences. The RE feature is represented as ΦRE(p) = {φRE(p, s)|s ∈ S}.

Co-expression profiles

φCO(p, s) are computed based on the mutual rank formula originally introduced in

Obayashi et al. (2007) i.e., φCO(p, s) =sqrt(rank(p, s) ∗ rank(s, p)). Given an input gene p, the coexpression features is ΦCO(p) = {φCO(p, s)|s∈ S}.

Shared Annotations

GO evidence codes were restricted to EXP (inferred from experiment), IDA (inferred from direct assay), and IPI (inferred from physical interaction). The GO set GO(p) is all terms associated with gene p ∈ D . For a gene pair (p, s), the set of shared GO annotations is computed as GO(p, s) = GO(p) ∩GO(s). A new feature φGO(p, s) is then computed, based on the number of shared GO annotations as follows:

73 N is the total number of genes in the Arabidopsis genome and N (g) is the number of genes that associated with the GO term g. Given an input gene p, the overall GO feature is ΦGO(p) ={φGO(p, s)|s ∈ S}. Similar procedure is used to compute vectors of shared annotations in KEGG and AraCyc i.e., ΦKEGG(p) andΦAra(p).

Transcription Factors and Binding Sites

TF is defined as a binary feature φT F (p, tf) that are set to 1 whenever a gene p encodes a transcription factor tf ∈ T F . We also used the set TFBS of transcription factor binding sites to define a binary feature φTFBS (p, tfbs) that is set to 1 whenever gene p contains a TF binding site tfbs ∈ TFBS.

Feature vector

For any given gene p, the features described above were aggregated as:

74 Feature selection and filtering

Without selection

All eleven types of features were assessed with the χ2 test, and we adopted all heterogeneous features into the classifier. To avoid numerical problems caused by using different scales of feature values, all features were normalized to lie in the [0, 1] interval.

With selection

Although all types of features passed the univariate χ2 test, it overlooked the dependencies between the features and the interactions with the classifier. Thus we adopted a “wrapper” method that integrates the feature selection with the classifier.

Inspired by Weston et al. (2003), the selection of features also involved the weighting of different features and we initiate a weight vector γ inside the classifier training:

1. Set: γ = [1, 1, ..., 1] where γi is the weight for feature i.

2. Solve the classifier and get the ¯ w which appears in the hyperplane formulation of

wTφ(x) + b.

3. γ = γT∗ w. If γ is not converged, set the smallest γ as zero, i.e. eliminate that feature.

Then go back to step 2. If γ is converged, then the final set of weighted features are

represented as φi = φi ∗ γi

75 Learning methods

Two-class SVM

The two-class SVM designs a heuristic approach for selecting unlabeled samples as negative examples. The SVMlight package (Joachims, 2002) was adopted.

Transductive SVM

Transductive SVM (TSVM) (Joachims, 1999) used the same training data as in 2.4.1, but in a transductive setting. Since the biological problem posed a situation where all the data, including labeled and unlabeled were explicit, a transductive learning classifier could be adopted.

Laplacian SVM

A semi-supervised method developed by Melacci and Belkin (2011). Because the distribution of all data was known, the Laplacian SVM used the same setting as in TSVM, but with only positive and unlabeled samples.

76 Label Propagation

The label propagation (Zhou et al., 2004) also asked for positive samples and negative samples to be presented where the labels where propagated along the data pattern. We modified the affinity matrix W which integrated weighted heterogeneous data.

Where is obtained:

∑ ∑

and matrix A is defined where Am,n is the dot product of ΦmU and ΦnU for all unlabeled genes U.

The Biased SVM

In the Biased SVM formulation (Lee and Liu, 2003; Liu et al., 2003), all unlabeled examples are considered to be negative, and the decision function f (x) = wTφ(x) + b is learned using the standard soft-margin SVM formulation shown in below:

77

The capacity parameters CP and CU control the penalty for errors on positive examples vs. errors on unlabeled examples. To find the best capacity parameters to use during training, the Biased SVM approach runs a grid search on a separate development dataset. This search was aimed at finding values for the parameters CP and CU that maximize pr, the product between precision p = p(y = 1|f =1) and recall r = p(f = 1|y = 1). Lee and Liu

(2003) showed that maximizing the pr criterion is equivalent with maximizing the objective r2/p(f = 1), where both r = p(f = 1|y = 1) and p(f = 1) can be estimated using the trained decision function f (x) on the tuning dataset. We implemented the Biased SVM

light approach on top of the binary SVM package, in which the CP and CU parameters of the

Biased SVM were tuned through the c and j parameters (c = CU and j = CP /CU ). The naive SVM tunes these parameters to optimize the accuracy with respect to the noisy label s(x). The BSVM tunes the c, j parameters to maximize an estimate of F1 score with respect to the true label y(x). When training with the Gaussian kernel, σ is also introduced in the grid search optimization.

Weighted Samples SVM

First, the dataset P ∪ U is split into a training set and a validation set, and a classifier g(x) is trained on the labeling s to approximate the label distribution i.e. g(x) = p(s = 1|x).

The validation set is then used to estimate p(s = 1|y = 1) as follows:

78

The second and final classifier f (x) was trained on a dataset of weighted examples that were sampled from the original training set as follows:

• Each positive example x ∈ P is copied as a positive example in the new training set with a weight p(y =1|x, s = 1) = 1.

• Each unlabeled example x ∈ U is duplicated into two training examples in the new dataset: a positive example with the weight p(y = 1|x, s = 0) and a negative example with the weight p(y = −1|x, s = 0) = 1 − p(y =1|x, s = 0).

The output of the first classifier g(x) was used to approximate the probability p(s = 1|x), whereas p(s = 1|y = 1) is estimated. We trained two classifiers g and f using the libSVM

(Chang and Lin, 2011) implementation of SVMs. Platt scaling was used with the first classifier to obtain the probability estimates g(x) = p(s = 1|x), which were then converted into weights, and used during the training of the second classifier. Because the WSVM generated the P (y =1|x, s = 0) and P (y = −1|x, s = 0), we used a leave-one assessment to obtain the best parameters: Let the training set S = [s1,s2,...,sn], and each si=[φi,ti], where φi is the feature vector and ti is its label. We then created n leave-out sets where

S[i]=[s1,s2,...,si−1,si+1,...,sn]. Then we predicted on si with a value of ̂ . We had the cross-entropy as:

79

The optimal parameters were reached when H reaches minimum after a grid search.

When applying Gaussian kernel, σ values is also optimized.

Selection of unlabeled data for training

Random selection

The most straightforward way to select unlabeled data for training is to randomly select genes as negative. We selected 100 genes randomly from the Arabidopsis genome (not excluding genes in the PPIN network) as negative samples. Besides the random selection methods, we also introduced two approaches to select negative seeds based on GO terms.

GOP was defined as the GO terms associated with all the positive samples, and GOg those associated with an unlabeled sample g.

GO Semantic based

The semantic similarities S(goi, goj ) between two GO terms goi and goj were calculated based on GSESAME (Du et al., 2009). Then the aggregated similarity of any unlabeled gene g to the positive set P is

80 The unlabeled genes were ranked in ascending order, where the top 100 genes were selected as negative in the training stage.

GO tree structure based

We defined the distance between two GO terms D(goi, goj ) as the shortest path between goi and goj in the GO tree. Si is labeled as negative iff:

Based on training with the tuning files, a Dmin = 3 and Dmax = 8 give the best predictions.

The number of genes varied based on this selection criterion.

Results

Benchmark data

The benchmark data was adopted from the Arabidopsis PlantPathogen Immune Network

(PPIN) (Mukhtar et al., 2011). The dataset provided an excellent candidate for the benchmark experiment because all these genes could be grouped based on their involvement in the same biological process. This benchmark set providesa prototypical example of a network of genes driven by shared functionality. To replicate a scenario in which a biologist wants to find more genes starting from a small set of seed genes, we first randomly selected positive genes P from the PPIN network to serve as positive

81 examples. Then we randomly sampled genes from the Arabidopsis genome along with some of the genes from PPIN network as unlabeled U. The resulting dataset of P and U examples was split into halves, one half representing the test data and the other half was further split into of a training set (2/3) and a tuning set (1/3). The aim of the program was to identify more “positive genes” from U.

Feature collection

A comprehensive feature set was acquired from public databases and available experiments. PPIs were extracted from The Arabidopsis Research Information (TAIR)

(Rhee, 2003), InAct (Kerrien et al., 2012), BioGRID (Stark et al., 2011) and The

Predicted Arabidopsis Interactome Resource (PAIR) (Lin et al., 2011). Only experimentally validated PPIs were obtained, generating a total of 2322, 10680 and

15926 interactions based on the databases with timestamps of 2010, 2011 and 2012 respectively. Ortholog information was also retrieved from OrthoMCL (Li et al.2003) and InParanoid (Ostlund et al. 2010). 14,583 PPIs from the ortholog species (S. cerevisiae, C. elegans, D. melanogaster, and Z. mays) were projected onto the

Arabidopsis genome. Moreover, information from the literature was also retrieved from

PubMed, resulting in a total of 2,923,734 sentences that contain the genes in the dataset.

These information sources were used to generate three separate features: the co-current literature ID of gene pairs, the term frequency-inverse document frequency (tf-idf )

(Reh˚uˇrek and Sojka, 2010) for each single gene, and the relational extraction

82 information for each gene (Bunescu and Mooney, 2005). Ranked co-expression profiles were calculated based on 703 distinct microarray experiments collected from TAIR

(Rhee, 2003) and the Nottingham Arabidopsis Stock Center. Furthermore, 9043 GO,

3925 KEGG and 446 Aracyc (Mueller et al., 2003) terms were retrieved corresponding to the test genes. Lastly, 16 families of transcriptional factors (TF) and putative transcription factor binding sites were collected from AGRIS (Palaniswamy et al., 2006) and Athamap

(Bulow et al., 2006).

Algorithm implementation and comparison

For all kernel based algorithms, both linear and Gaussian tuned kernels were introduced, and the best performing kernel selected. BSVM obtained the best performance with a linear kernel while other algorithms obtained the best performance with a Gaussian kernel. These comparisons were conducted with the features vectors collected in year of

2012 with 30 known positive samples. Results showed that BSVM and WSVM were the top 2 algorithms whereas the WSVM with the Gaussian kernel provided the best performance. Laplacian SVM also performed better than baseline while label propagation and TSVM failed to reach the same result as the baseline (Fig. 3.1a). Based on such a comparison, we further studied the BSVM and WSVM on different sets of features. First, the features were retrieved as the “original feature vector”. Then we filtered the features using a “wrapper” method with the classifier, as the “new feature vector”. Because the new feature vector varies based on the number of given positive

83 samples, we compared the performance on the original and new feature vector given 30,

50 and 100 positive samples. Both of the WSVM (fig. 3.1b) and BSVM (fig. 3.1c) obtained the worse performance with the new feature vector given 30 positive samples, while better performance given 50 and 100 positive samples. Then we compared the area under the curve for the top 20% (AUC20) for both of the WSVM and BSVM on the original and filtered feature vectors (Table 3.1).

84

Figure 3.1. Comparisons of performance between different algorithms, kernels and data integration. Analyses were based on the data acquisition from 2012. (a) The precision/recall curve of the top 20% between different algorithms based on the best performed kernel. The algorithms include BSVM (linear kernel), WSVM (Gaussian kernel), label propagation, transductive SVM, label propagation and Laplacian SVM (Gaussian kernel). (b) The comparisons between original feature set (dashed line) and

85 new feature set (solid line) after filtering based on WSVM of Gaussian kernel. (c) The comparisons between original feature set (dashed line) and new feature set (solid line) after filtering based on BSVM of linear kernel.

Table 3.1. The AUC20 comparisons between BSVM and WSVM on original and filtered sets. WSVM BSVM # of Seed Genes Original Filtered Original Filtered 30 0.69 0.68 0.68 0.673 50 0.711 0.723 0.706 0.716 100 0.733 0.758 0.739 0.748

Time dependent evaluation

To test our methodology, we ran multiple experiments in which the extraction of the input features was limited to content that appeared before a predetermined year. Similar comparisons of the different choices of kernels were carried out on data from 2010 (Fig.

3.2) and 2011 (Fig. 3.3). The WSVM trained with a Gaussian kernel all performed best.

The prediction becomes more precise with the growth of the feature volume in the three years, yet the performance of 2010 was still comparable to the performance of 2012.

86

Figure 3.2. The performance comparisons between different algorithms and kernels based on the year of 2010. The algorithms include wsvm (blue), bsvm (red) and traditional svm (green), based on Gaussian kernels (solid line) and linear kernel (dashed line).

87

Figure 3.3. The performance comparisons between different algorithms and kernels based on the year of 2011. The algorithms include wsvm (blue), bsvm (red) and traditional svm (green), based on Gaussian kernels (solid line) and linear kernel (dashed line).

Varied composition of unlabeled genes

One of the most influential variables of the experiment setting is the composition of the unlabeled genes. The more potential “positive” genes in the unlabeled pool, the more precise the prediction. Usually a biologist starts with a list of genes that have already passed some selection criteria, with the size ranges from less than a hundred to several thousand. Here we ranged the percentile of “positive” genes among the unlabeled ones from 1% to 40% and compared the AUC20 of WSVM with Gaussian kernel versus the random prediction.

88 Different set of negative genes

In the training session, some unlabeled genes from the Arabidopsis genome were

randomly chosen as negative. In addition to the random sampling, two other selection

methods were applied based on GO tree distance and GO semantic distance. Random

selection and GO semantic based selection provided fixed number of negative genes (100

genes in previous cases). However, the GO tree based method selected varied sizes of

negative genes. To make the comparisons fair, we equalized the size of the negative

genes. First we have observed lower performance given fewer negative genes in the

training set, e.g. the AUC20 was 0.138 given 30 positive and 100 negative samples in

training while the AUC20 was 0.134 given the same positive but 47 negative samples.

Second, we have found that GO semantic distance based selection obtained the best

AUC20 (Table 3.2)

Table 3.2. The AUC20 comparisons among different sets of negative genes. Positive # Negative # Random Sematic Tree 30.000 47.000 0.670 0.685 0.675 50.000 80.000 0.690 0.710 0.685 100.000 96.000 0.715 0.720 0.710

To test the robustness of our model, we performed two modifications: 1) introduction of

random noise features and 2) repetition of a random feature. Thirty independent random

noise features were added into the feature collection for each gene. A random feature was

selected and repeated thirty times. Analysis of the performance of the three data sets

89 indicated no significant differences between the original model and the modified ones, with an AUC20 of 0.138 (original), 0.131 (noised) and 0.132 (repeated).

Comparison with GeneMANIA

To further evaluate our approach, we compared our method with one of the state-of-the- art methods: GeneMANIA (Mostafavi et al., 2008). GeneMANIA adopts PPI, protein domain, physical interactions, co-localization, pathway and co-expression data to construct gene network and predict relevant genes. The only input for both algorithms were positive samples, so we provided both with 30, 50 and 100 positive samples, each randomly repeated for 30 times. The comparison has two facets: 1. The accuracy of the prediction, measured by the averaged precision@N where N is the number of genes returned by GeneMANIA, and 2. The “unique” true negative genes discovered over all

30 times. Results suggested that our method showed high accuracy as well as low

“redundancy” (Table 3.3).

Table 3.3. The prediction results comparisons with GeneMANIA. # of Positive 30 50 100 Redundant Non- Redundant Non- Redundant Non- redudant redudant redudant GeneMANIA 21/101 4/23 117/690 10/87 236/803 127/400 WSVM Gaussian 66/101 16/23 480/690 226/423 619/803 201/332 WSVM Gaussian 37/176 14/23 430/690 204/389 529/803 173/291 (GeneMANIA Features)

90 Discussion

BSVM (biased SVM) and WSVM (weighted sample SVM) showed their potential in mining functionally related genes. However the label propagation and transductive SVM failed to obtain better performance comparing with the baseline SVMlight algorithm (Fig.

3.1). High-degree features might be the cause for the poor performance. Several of our features were “sparse” and that led to the total feature matrix no longer being positive definite and the prediction might meet local minima or convergence problem. It posed a question on how to deal with heterogeneous features where their distributions and densities are different. To solve this problem, we proposed a feature engineering method, where the features were filtered by “wrapping” with the classifier. According to Saeys et al. (2007), the univariate selection of features, as what we did without feature selection, overlooked the feature interactions and ignored the combination with the classifier. The

“wrapper” technique, on the other hand, selects and assigns weights to features while training the classifier.

During each cycle of “embedded” training, the lowest weighted feature was assigned a weight of 0. The final set of features was obtained until the weights associated with each feature came converged. The results showed that given 30 positive samples, the performance on the original features was better. With more positive samples (50 and 100) provided, the filtered feature set outperformed than the original features (Table 3.1). To examine how the feature selection technique affected the feature vectors, we compared

91 the composition of the 11 types of features in the two feature sets. The comparison revealed that no single type of feature was favored by the filtering whereas each type of feature was filtered at a similar ratio. Furthermore, we studied the topological structure of the feature network, which was built upon the feature vectors, with a feature φp(i, j ) representing an edge between gene i and j of type p. We further compared the distribution of degrees in the two networks (Fig. 3.4). The distribution of degrees of a biological network should follow the power-law formulation of y = βxα or log (y) = log (α)x + log

(β) where y is the frequency of degree x. The fit of the biological networks to the theoretical formulation was measured by the R2 value. For the 30 positive sample condition, both networks fit poorly due to the limited number of features (Fig. 3.4a and

Fig. 3.4b). However, the R2 values of the selection technique under 50 and 100 positive samples were significantly improved by filtering the feature set, indicating that the filtered network resembles more of a real biological network (Fig 3.4 (c)-(f)). Based on these observations, to obtain the best performance on the prediction, the feature space should also obey the scale-free network.

In the context of a biological problem, “negative samples” are hard to define and sometimes do not even exist, but it is still necessary to identify “negative ones” in the training process. Because only a small portion of genes are involved in a single biological pathway, it is reasonable to apply random sampling to select genes as “negative samples”.

However, the random sampling does not provide any discriminative power on the closeness of connectivity, so we adopted two other methods based on GO terms and

92 compared the results (Fig. 3.3). The semantic based method obtained better performance than the random selection but it was not statistically significant (p-value > 0.2 comparing with random sampling with paired t-test). Interestingly, the GO based selection chose more false negatives than random sampling. This again indicated that GO terms alone may not be sufficient enough to really understand of a biological process.

In summary, our experiments provided an approach to obtain the best “candidate genes” from a pool of unlabeled ones. We have collected a rich volume of heterogeneous features to use in conjunction with well-tuned semi-supervised algorithms. Beyond that, we also introduced different conditions appropriate to a biological question and proposed ways of handling these different conditions by optimizing our algorithms. These benchmark experiments could enlighten biologists who wish to study a specific biological process and data mine functionally related genes.

93

Figure 3.4. The degree distribution based on the pre-filtered and filtered networks. (a) 30 positive, original features, (b) 30 positive, filtered features, (c) 50 positive, original features, (d) 50 positive, filtered features, (e) 100 positive, original features, (f) 100 positive, filtered features. R2 scores are given for each distribution.

94

CHAPTER 4: THE CONSTRUCTION OF A GRAVITROPIC NETWORK

Introduction

A transcriptome analysis was performed based on the Gravity Persistent Signal treatment.

The analysis discovered the differentially expressed genes, five of which were selected for further functional validation (Chapter 2). Moreover, a data mining method was developed to find functionally related genes (Chapter 3). By combining the two approaches, more gravity related genes could be identified. Beyond that, the possible interactions and relationships between the gravity related genes might be inferred from a network analysis. These relationships could provide some information on the regulatory network of gravitropic signaling. Such a system, including the genes and their interactions, is a biological network. Building such a network provides a model that shows how genes are regulated and interacted with each other. In a network, a gene is denoted with a node (point) in the graph and the interaction between genes is denoted with an undirected edge. If the network can precisely predict the influence of one gene on another, then the edge becomes a directed one, indicating the relationships between upstream and downstream genes. A biological network can also grow dynamically as new components and interactions are added.

95 Network construction is also known as a network inference approach. A biological network is so complex that the interactions between genes may be only present under certain circumstances. It changes based on conditions, so the inference of a biological network not only needs generalized data, but also needs treatment specific data. Two types of data were used to infer the gravity specific network: the time-course experimental data and treatment-control metadata. The time-course data consists of gene expression data from a series of time points after stimulation and is crucial to determine the temporal relationships between genes. The treatment-control metadata were used to extract regulation patterns of genes under the influences of other inhibitors/promoters.

Network inference is complicated. According to an assessment by Marbach et al., (2010),

11 out of 29 teams that participated in a network inference contest generated predictions that were not significantly better than those generated by a random network. Marbach et al., (2010) further introduced common algorithms for implementing a network inference, which include: correlation based, information-theoretic based and Bayesian network methods. Ma et al., (2007) proposed a graphical Gaussian model to predict an

Arabidopsis interactome based on microarray data profiles. Their results showed that the network follows a truncated power-law distribution with the power of k between 1 and 11

(Ma et al., 2007). In a recent assessment of network inference algorithms on human ovarian cancer cells, eight state-of-the-art methods were compared on simulated and experimental data and showed that all methods obtained better performances on simulated data (Madhamshettiwar et al., 2012). Most importantly, the comparisons

96 showed that the area under curve (AUC) is significantly better for small networks (50 genes) as compared to large networks (200 genes) for E.Coli where, they argued, specific data types were crucial for the accuracy of prediction.

The time-course expression profiles provide a correlation between different genes. Two inference methods are most often proposed to predict the interactions: a time-lag correlation coefficient and a dynamic Bayesian network (DBN).

The time-lag correlation coefficient calculates the Pearson Correlation Coefficient (PCC) values with a time lag parameter. The choice of the time lag parameter depends on an estimation of the temporal proximity of the regulation. A heuristic approach is used to test all possible time lags and select the one with the strongest correlation.

On the other hand, a dynamic Bayesian network assumes that the expression of a regulatee depends only on the expression of the regulator at a previous time point:

( )

( ) Where represents the expression of the gene G at time point of t, and represents the expression profiles of the parent genes (regulators of the G) at time point of t-1. This assumption decreases the search space dramatically, but it is arguable whether the “previous time point” is t-1 or time points prior.

97 A biological network is a scale-free network (reviewed by Roy 2012) where the network is not equally distributed but consists of more “important genes” as well as “trivial” genes.

An “important gene” is often termed as a “hub” gene where it has high-degree connections with other genes (Fig 4.1a). In some cases, an “important gene” can also be a “bottleneck” gene where it provides a unique or limited number of connections between two subsets of the genes, defined as betweenness centrality (Yu et al., 2007) (Fig 4.1b).

A “trivial” gene is one with fewer connections. The degree of a node is measured by its number of edges, and the betweenness centrality is measured as (Brandes 2001):

Where s, t are nodes different from n, and denotes the number of shortest paths from s to t, and denotes the number of shortest paths from s to t with n on them.

A I C O

B J P F E L

H M D N G K

(a) (b)

Figure 4.1. Diagrammatic representation of “hub” and a “bottleneck” gene. a) gene F is a hub gene because it connects with most neighbor genes. b) gene L is a bottleneck because it provides the only path from a subset of genes (A, B, D) and another subset of genes (C, G, E, H).

98 There are two major steps to building a biological network based on experimental data.

First, find as many genes as possible. Gravity related genes were identified by the transcriptome analysis. A list of differentially expressed genes was generated from the gene expression microarray. Second, a list of known gravity related genes was selected from the primary literature to serve as seeds for the network.

However, the total number of differentially expressed genes derived from the microarray experiment is not necessarily a good starting data set. The number of possible interactions quadruples as the number of genes doubles. The current gene set from the microarray experiment is around 350, so the number of possible interactions is around

122,500. A network G (V, E) represents a collection of genes V and the collection of interactions E. When the number of genes is given, the number of possible networks is exponential to the number of edges, where each interaction either exists or not. So the possible network built upon |V| genes is 2|V|*(|V|-1)/2. If the network consists of only 10 genes, the number of total possible network is up to 3E13! This is not computationally feasible so alternative approaches must be adopted to decrease the scale of the network.

The only variation between the application on gravitropism and the application on the benchmark data is that we introduced our own data as a feature.

99 Methods

Feature vector preparation

The features were prepared as described in Chapter 3 feature preparation section. In short, the features include: protein-protein interaction from public databases, literature text mining, gene functional annotations (GO/KEGG/AraCyc) and TF/TFBS information.

These annotations were prepared for each Arabidopsis gene.

Gravitropic microarray data feature preparation

Based on the transcriptome experiment described in Chapter 2, the gene expression profiles were summarized. Using these expression profiles at each time point, the correlation coefficient (here as Pearson Correlation Coefficient, PCC) was calculated for each gene pair. Any gene pair with PCC values less than 0.5 were discarded because of low statistical robustness. Besides the PCC calculated here, we also calculated the mutual rank information between each gene pair. Because mutual rank captures the high ranking pairs even if the baseline of the correlation is low, mutual rank can serve a complementary calculation for PCC. The mutual rank is calculated as:

√ ∗

Where is the rank of PCC value of gene j in all pair-wise PCC values to gene i, and

is the mutual rank of gene j to gene i.

100

These two features (PCC and mutual rank) were added into the feature vector prepared.

Putative positive data selection

The differentially expressed genes were further filtered with genes associated with at least five types of features. To serve the best usage of the semi-supervised learning, a list of “irrelevant” genes were also added into the training set, serving as the “negative” samples. These genes were randomly sampled from the Arabidopsis genome. Then we submitted the semi-supervised classifier with the known positive genes and predicted the labels of the genes.

Select the putative genes

Here, the positive genes along with the negative genes were prepared as two separate files: training and development files. Following the same tuning stages, we obtained the optimal classifier. The semi-supervised learning method uses the marginal value returned by the equation of:

∑ ∑ ( )

th Here t is the marginal value, and is the feature vector of j labeled gene, and is the

th th label of the j labeled gene. is the feature vector of the i unlabeled gene, while

101 ( ) is the kernel value. Here we used the Gaussian kernel values, with the values optimized using the development dataset. The values were trained from 1E-6 to 1, with the step of 10.

Dynamic Bayesian network

We used the DBN model proposed by Hill et al., (2012) where the model is represented as:

Figure 4.2 The background equation of the DBN where X is the total data and is the parent genes of gene Xi (Adopted from Hill et al., (2012)).

Additionally, we also prepared prior knowledge on the PPI to better infer the coefficients.

The PPI information was the same as the ones collected in the unlabeled data prediction phase (Chapter 3, feature collections).

In sum, the data were prepared using the following files:

1. Prior_graph: An 82*82 matrix, where the 82 genes consist of 32 known mutant genes

and 50 newly found genes.

2. Co-expression profile: An 82 * 8 matrix, where there are 82 genes across 8 different

sample time points.

102

The parameter for the DBN was set to find all “full possible” interactions between the genes.

Time lagged correlation coefficient

Here the Time-lagged correlation coefficient was calculated following Schmitt et al.,

(2004) equation:

• √

∑ ̅ ̅

However, we tweaked the equation of calculating because the number of time points in our experiment was limited to 4. Here the correlation between protein i and protein j with is termed as a time lag τ, where ̅ is the averaged expression value of protein i across of all time points. The , which gives the maximum value for , will be used as the lagged time for protein i and protein j.

Results

Gravitropism has been studied for more than a century, and many genes have been identified as gravity related. Genes related with gravity perception, gravity response and

103 signal transductions have been studied separately. After a thorough literature review, a list of 27 genes was selected, including ADK1 (Young et al., 2006), AGR1 (Chen et al.,

1998), ARG1 (Sedbrook et al., 1999), ARL2 (Harrison and Masson 2008), AUX1

(Bennett et al., 1996), AXR1, AXR2 (Leymaire et al., 1996), ARF9 (Roberts et al., 2007),

IFL1 (Zhong and Ye, 2001), MAR1,MAR2 (Stanga et al., 2009), MDR1 (Noh et al.,

2003), PGP4 (Terasaka et al., 2005), PIN1/PIN2/PIN3/PIN3 (reviewed by Baldwin et al.,

2012), PKS4 (Goyal et al., 2013), RCN1 (Smith et al., 2012), SGR1/2/3/4/5/7/8

(reviewed by Hashiguchi et al., 2012), TIR4 (Palme and Galweiler, 1999) and TT4

(Brown et al., 2001). To expand the scope of the functionally related genes, the five selected expressed genes (WRKY18, WRKY26, WRKY33, ATAIB and BT2) were also introduced into the seed selection in addition to the 27 gravity related genes (summarized in Table 4.1). These 32 genes served as the “seed” samples as the input of the semi- supervised classifier.

Prefilter of the seed genes

On the other hand, the classifier is aimed to find the functionally related genes from a pool of “unknown” genes which refer to genes whose functions in gravitropism have not been studied yet. Such a pool is believed to contain gravity related genes. Thus, the microarray experiment described in Chapter 2 perfectly fits this requirement that there are

350 differentially expressed genes could serve the purpose of the “unknown” genes.

104 The semi-supervised classifier can accept the “gravity seed” genes as inputs to search for functionally related genes from the differentially expressed genes. The classifier builds the similarity matrices based on the feature dimensions including PPI, literature text mining (shared PubMed ID, tf-idf information and relationship extraction), shared function annotations (Gene ontology: molecular function, biological process and cellular component; KEGG and AraCyc), co-expression profiles (based on public microarray profiles), Transcription family and Transcription family binding sites. The similarity matrices measure the functional closeness between a gene from the “unknown genes” and a gene from the “gravity seed genes”. The result of the semi-supervised learning classifier returns the list of “unknown” genes, each of which was associated with a similarity value. The higher the similarity value, the more likely the gene is functionally related to gravitropism.

One pre-filtering was performed before applying the classifier to the “unknown” genes.

The pre-filtering was used to select the genes with more than 8 types of features from the

350 gene pool, as the predictions on the genes with few features could result in low precision predictions. In total 213 genes from the differentially expressed genes were selected and submitted to the classifier. The top 50 genes from the classifier were tagged as novel gravity related genes (Table 4.2). To elucidate the functional terms of these 50 selected genes, gene set enrichment analysis was also performed, limited to GO slim terms. The GO slim terms were cut-down versions of GO terms, providing more general

105 terms than fine grained terms (Table 4.3). To avoid selecting general GO terms, terms associated with less than half of the set were selected (with p-value < 0.05).

Table 4.1. The seed genes that were previously identified as gravity related, specifically those involved in signal transduction. Locus ID Name Locus ID Name Locus ID Name AT3G09820 ADK1 AT2G16640 MAR2 AT5G07100 WRKY26 AT5G57090 AGR1 AT3G28860 MDR1 AT2G38470 WRKY33 AT1G68370 ARG1 AT2G47000 PGP4 AT3G48360 BT2 AT1G59980 ARL2 AT1G73590 PIN1 AT4G37650 SGR7 AT2G38120 AUX1 AT5G57090 PIN2 AT2G26890 SGR8 AT1G05180 AXR1 AT1G70940 PIN3 AT3G62980 TIR1 AT1G54990 AXR2 AT5G04190 PKS4 AT5G46860 SGR3 AT4G23980 ARF9 AT1G25490 RCN1 AT5G39510 SGR4 AT5G60690 IFL1 AT3G54220 SGR1 At2G01940 SGR5 AT3G46740 MAR1 AT1G31480 SGR2 AT4G31800 WRKY18 AT5G13930 TT4 AT2G46510 ATAIB

Table 4.2. Selected genes based on the semi-supervised learning. Locus ID Gene Name AT2G39020 NATA2 AT1G31800 CYTOCHROME P450(CYP97A3) AT5G23020 2-ISOPROPYLMALATE SYNTHASE 2 (IMS2) AT4G24770 31-KDA RNA BINDING PROTEIN (RBP31) AT5G60600 4-HYDROXY-3-METHYLBUT-2-ENYL DIPHOSPHATE SYNTHASE (HDS) AT4G33680 ABERRANT GROWTH AND DEATH 2 (AGD2) AT4G25050 ACYL CARRIER PROTEIN 4 (ACP4) AT5G48300 ADP GLUCOSE PYROPHOSPHORYLASE 1 (ADG1) AT3G63410 ALBINO OR PALE GREEN MUTANT 1 (APG1) AT4G30950 FATTY ACID DESATURASE 6 (FAD6) AT4G27030 FATTY ACID DESATURASE A (FADA) AT5G22500 FATTY ACID REDUCTASE 1 (FAR1) AT1G09420 GLUCOSE-6-PHOSPHATE DEHYDROGENASE 4 (G6PD4) AT1G42970 GLYCERALDEHYDE-3-PHOSPHATE DEHYDROGENASE B SUBUNIT (GAPB) AT1G80600 HOPW1-1-INTERACTING 1 (WIN1) AT1G68010 HYDROXYPYRUVATE REDUCTASE (HPR) AT5G17420 IRREGULAR XYLEM 3 (IRX3) AT1G80560 ISOPROPYLMALATE DEHYDROGENASE 2 (IMD2)

106 Table 4.2. Continued Locus ID Gene Name AT5G14760 L-ASPARTATE OXIDASE (AO) AT1G03630 PROTOCHLOROPHYLLIDE OXIDOREDUCTASE C (POR C) AT1G01090 PYRUVATE DEHYDROGENASE E1 ALPHA (PDH-E1 ALPHA) AT2G01350 QUINOLINATE PHOSHORIBOSYLTRANSFERASE (QPT) AT2G26670 REVERSAL OF THE DET PHENOTYPE 4 (TED4) AT4G34620 SMALL SUBUNIT RIBOSOMAL PROTEIN 16 (SSR16) AT3G01180 STARCH SYNTHASE 2 (SS2) AT4G35000 ASCORBATE PEROXIDASE 3 (APX3) AT4G15560 CLOROPLASTOS ALTERADOS 1 (CLA1) AT1G12520 COPPER CHAPERONE FOR SOD1 (CCS) AT3G03630 CYSTEINE SYNTHASE 26 (CS26) AT5G16710 DEHYDROASCORBATE REDUCTASE 1 (DHAR3) AT5G51070 EARLY RESPONSIVE TO DEHYDRATION 1 (ERD1) AT4G26300 EMBRYO DEFECTIVE 1027 (emb1027) AT1G78630 EMBRYO DEFECTIVE 1473 (emb1473) AT5G18570 EMBRYO DEFECTIVE 269 (EMB269) AT1G74960 FATTY ACID BIOSYNTHESIS 1 (FAB1) AT1G77590 LONG CHAIN ACYL-COA SYNTHETASE 9 (LACS9) AT1G79230 MERCAPTOPYRUVATE SULFURTRANSFERASE 1 (MST1) AT3G01120 METHIONINE OVERACCUMULATION 1 (MTO1) AT2G05990 MOSAIC DEATH 1 (MOD1) AT1G21640 NAD KINASE 2 (NADK2) AT3G47450 NO ASSOCIATED 1 (NOA1) AT2G28900 OUTER PLASTID ENVELOPE PROTEIN 16-1 (OEP16-1) AT4G11830 PHOSPHOLIPASE D GAMMA 2 (PLDGAMMA2) AT4G27440 PROTOCHLOROPHYLLIDE OXIDOREDUCTASE B (PORB) AT2G24820 TRANSLOCON AT THE INNER ENVELOPE MEMBRANE OF CHLOROPLASTS 55-II (TIC55-II) AT1G14610 TWIN 2 (TWN2) AT5G42270 VARIEGATED 1 (VAR1) AT2G30950 VARIEGATED 2 (VAR2)

107 Table 4.3. The most significant GO slim terms associated with the 50 selected genes. GO Slim Term P-value # of Genes thylakoid 4.77E-07 8 response to stress 3.64E-02 8 carbohydrate metabolic process 3.91E-05 9 response to abiotic stimulus 7.03E-04 9 multicellular organismal development 6.73E-03 9 lipid metabolic process 3.69E-07 10 membrane 4.24E-02 12 transferase activity 5.34E-05 15

The Gravity network

After the fifty new genes were identified from the classifier, possible relationships between genes could be inferred from the experimental data. The existing features only reflect the knowledge of the relationships between gene pairs that have already been studied. To gain new insights into potential relationships between genes, expression profile was used to infer the relationships. Such an expression profile was retrieved from the gravity signal transduction microarray experiment. Besides of the expression profile, existing PPIs were also included in the relationship inferring. Two networks were generated from two approaches: dynamic Bayesian network inference and time-lagged correlation coefficient.

Dynamic Bayesian network analysis

The dynamic Bayesian network inference method returned an n*n matrix (n stands for the number of genes). Each element in the matrix represented the confidence of the

108 predicted interaction between gene i and gene j. The higher the confidence value, the more robust the relationship. The distribution of the confidence values followed a normal distribution (Figure 4.2). Because there have already been interactions between the selected genes, a mean value of these interactions was calculated as around 0.9. Based on such observations, the threshold of the interaction was set as 0.9. In total, the DBN analysis predicted 721 interactions between the 82 genes with threshold of 0.9 (Figure

4.3).

Figure 4.2. The distribution histogram of the predicted values. A cutoff at 0.9 was chosen based on the mean confidence value of known interactions.

109

Figure 4.3. A general view of the network between 80 genes and 721 interactions.

Time-lagged correlation

The time-lagged correlation generated a network with 824 edges among 82 genes (Figure

4.5). The lag of the regulation based on time-lagged correlation co-expression was calculated with the highest correlated score. The time lag varies from the closest time point to the largest time point. The largest PCC distribution is shown in Figure 4.4. The threshold was set as 0.73 based on the same logic on choosing the threshold for dynamic

Bayesian network.

110

Figure 4.4. The density distribution of the time-lagged PCC among the chosen genes.

111

Figure 4.5 A diagrammatic representation of the network generated using the time lagged correlation co-expression method.

Intersected network

The networks returned by each method overlapped by 76 genes and 182 edges (Table 4.4,

Figure 4.6). The topology of the intersected network shows an ideal scale-free network property (Figure 4.6). The scale free network follows the power law distribution that:

Y=b*X-a that X is the number of degrees and Y is the frequency of that node. An ideal scale free network should have a regression line with logrithmatic scale, and the coefficient determinant measures the how well the data fit to the model.

112

Figure 4.6. A diagrammatic representation of the intersected network. The colors of nodes represent the time points at which the genes were differentially expressed with red (2min), blue (4min), green (10min), yellow (30min) and grey (no significant changes at any time points).

Figure 4.7 The power law fit to the distribution of degrees in the network, the correlation R-square score is 0.73.

113 Table 4.4. The predicted interactions in the intersected network. Interactor 1 Interactor 2 Interactor 1 (Gene Interactor 2 (TAIR ID) (TAIR ID) Name) (Gene Name) AT3G62980 AT3G46740 TIR1 TOC75-III AT5G48300 AT3G62980 ADG1 TIR1 AT4G15560 AT2G01940 CLA1 SGR5 AT4G11830 AT2G26670 PLDGAMMA2 TED4 AT5G42270 AT2G01350 VAR1 QPT AT4G25050 AT2G05990 ACP4 MOD1 AT5G13930 AT1G05180 TT4 AXR1 AT1G31480 AT1G70940 SGR2 PIN3 AT4G25050 AT4G34620 ACP4 SSR16 AT5G42270 AT5G16710 VAR1 DHAR3 AT1G80600 AT5G45930 WIN1 CHLI2 AT1G70940 AT1G80600 PIN3 WIN1 AT1G25490 AT5G42270 RCN1 VAR1 AT4G11830 AT1G79230 PLDGAMMA2 MST1 AT2G05990 AT2G16640 MOD1 TOC132 AT5G18570 AT5G16710 EMB269 DHAR3 AT1G70940 AT5G60600 PIN3 HDS AT5G18570 AT5G42270 EMB269 VAR1 AT1G80600 AT2G01350 WIN1 QPT AT4G23980 AT1G80600 ARF9 WIN1 AT1G31800 AT5G60690 CYP97A3 REV AT3G28860 AT2G16640 ABCB19 TOC132 AT1G80600 AT2G30950 WIN1 VAR2 AT5G46860 AT2G30950 VAM3 VAR2 AT4G35000 AT5G13930 APX3 TT4 AT4G27030 AT2G26890 FADA GRV2 AT5G42270 AT5G18570 VAR1 EMB269 AT5G17420 AT5G46860 IRX3 VAM3 AT4G37650 AT3G01180 SHR SS2 AT3G28860 AT1G05180 ABCB19 AXR1 AT1G25490 AT4G26300 RCN1 emb1027 AT3G62980 AT1G80560 TIR1 IMD2 AT1G25490 AT1G21640 RCN1 NADK2 AT1G80600 AT1G14610 WIN1 TWN2 AT4G30950 AT1G03630 FAD6 POR C AT5G16710 AT4G11830 DHAR3 PLDGAMMA2 AT2G05990 AT3G03630 MOD1 CS26 AT1G74960 AT2G39020 FAB1 NATA2 AT1G73590 AT5G46860 PIN1 VAM3 AT5G18570 AT3G62980 EMB269 TIR1 AT5G51070 AT4G30950 ERD1 FAD6

114 Table 4.4. Continued Interactor 1 Interactor 2 Interactor 1 (Gene Interactor 2 (TAIR ID) (TAIR ID) Name) (Gene Name) AT1G68010 AT1G80560 HPR IMD2 AT4G25050 AT3G62980 ACP4 TIR1 AT4G34620 AT5G48300 SSR16 ADG1 AT5G13930 AT3G47450 TT4 NOA1 AT4G15560 AT5G17420 CLA1 IRX3 AT5G48300 AT2G39020 ADG1 NATA2 AT3G47450 AT1G05180 NOA1 AXR1 AT1G25490 AT1G74960 RCN1 FAB1 AT1G01090 AT4G33680 PDH-E1 ALPHA AGD2 AT4G34620 AT1G80560 SSR16 IMD2 AT2G39020 AT5G18570 NATA2 EMB269 AT5G13930 AT2G16640 TT4 TOC132 AT4G23980 AT5G18570 ARF9 EMB269 AT1G70940 AT5G42270 PIN3 VAR1 AT1G79230 AT2G39020 MST1 NATA2 AT5G60690 AT1G80600 REV WIN1 AT1G74960 AT1G80560 FAB1 IMD2 AT1G54990 AT4G37650 AXR4 SHR AT4G25050 AT2G39020 ACP4 NATA2 AT2G05990 AT4G33680 MOD1 AGD2 AT2G38470 AT3G28860 WRKY33 ABCB19 AT1G12520 AT2G30950 CCS VAR2 AT1G79230 AT5G60600 MST1 HDS AT1G03630 AT5G13930 POR C TT4 AT4G30950 AT1G59980 FAD6 ARL2 AT1G68370 AT1G31800 ARG1 CYP97A3 AT4G27440 AT3G47450 PORB NOA1 AT2G24820 AT4G33680 TIC55-II AGD2 AT4G34620 AT2G39020 SSR16 NATA2 AT4G37650 AT5G42270 SHR VAR1 AT5G16710 AT1G59980 DHAR3 ARL2 AT1G73590 AT1G21640 PIN1 NADK2 AT1G68370 AT2G01350 ARG1 QPT AT4G25050 AT5G48300 ACP4 ADG1 AT2G26670 AT3G62980 TED4 TIR1 AT4G11830 AT5G39510 PLDGAMMA2 SGR4 AT1G54990 AT1G79230 AXR4 MST1 AT2G01940 AT5G16710 SGR5 DHAR3 AT2G05990 AT1G05180 MOD1 AXR1

115 Table 4.4. Continued Interactor 1 Interactor 2 Interactor 1 (Gene Interactor 2 (TAIR ID) (TAIR ID) Name) (Gene Name) AT4G34620 AT2G05990 SSR16 MOD1 AT3G28860 AT1G78630 ABCB19 emb1473 AT1G59980 AT1G74960 ARL2 FAB1 AT5G13930 AT3G03630 TT4 CS26 AT4G30950 AT1G80600 FAD6 WIN1 AT3G47450 AT5G13930 NOA1 TT4 AT1G21640 AT1G25490 NADK2 RCN1 AT4G25050 AT1G79230 ACP4 MST1 AT1G68370 AT3G28860 ARG1 ABCB19 AT2G05990 AT3G47450 MOD1 NOA1 AT1G68010 AT1G09420 HPR G6PD4 AT4G37650 AT1G21640 SHR NADK2 AT1G59980 AT2G05990 ARL2 MOD1 AT3G47450 AT3G62980 NOA1 TIR1 AT5G60690 AT2G26890 REV GRV2 AT1G68370 AT5G14760 ARG1 AO AT1G31480 AT1G74960 SGR2 FAB1 AT1G70940 AT1G79230 PIN3 MST1 AT4G30950 AT3G28860 FAD6 ABCB19 AT2G39020 AT1G73590 NATA2 PIN1 AT2G47000 AT2G30950 ABCB4 VAR2 AT3G48360 AT1G74960 bt2 FAB1 AT5G17420 AT1G25490 IRX3 RCN1 AT3G01180 AT3G62980 SS2 TIR1 AT1G25490 AT5G46860 RCN1 VAM3 AT4G30950 AT5G45930 FAD6 CHLI2 AT1G73590 AT5G16710 PIN1 DHAR3 AT1G74960 AT5G18570 FAB1 EMB269 AT4G26300 AT2G39020 emb1027 NATA2 AT3G01120 AT5G60600 MTO1 HDS AT1G74960 AT5G16710 FAB1 DHAR3 AT5G16710 AT4G30950 DHAR3 FAD6 AT4G26300 AT5G42270 emb1027 VAR1 AT3G01180 AT5G13930 SS2 TT4 AT5G46860 AT4G26300 VAM3 emb1027 AT1G59980 AT1G80560 ARL2 IMD2 AT5G60690 AT1G31800 REV CYP97A3 AT4G34620 AT3G62980 SSR16 TIR1 AT3G63410 AT2G46510 APG1 AIB

116 Table 4.4. Continued Interactor 1 Interactor 2 Interactor 1 (Gene Interactor 2 (TAIR ID) (TAIR ID) Name) (Gene Name) AT1G80600 AT5G60600 WIN1 HDS AT1G59980 AT1G80600 ARL2 WIN1 AT4G27030 AT5G17420 FADA IRX3 AT2G05990 AT4G23980 MOD1 ARF9 AT2G01940 AT4G26300 SGR5 emb1027 AT1G21640 AT4G26300 NADK2 emb1027 AT1G54990 AT1G03630 AXR4 POR C AT2G01940 AT5G46860 SGR5 VAM3 AT2G01350 AT1G05180 QPT AXR1 AT2G01350 AT3G01180 QPT SS2 AT1G80600 AT5G18570 WIN1 EMB269 AT2G01940 AT1G25490 SGR5 RCN1 AT1G21640 AT1G73590 NADK2 PIN1 AT4G15560 AT1G74960 CLA1 FAB1 AT1G21640 AT5G16710 NADK2 DHAR3 AT5G16710 AT4G26300 DHAR3 emb1027 AT1G05180 AT3G09820 AXR1 ADK1 AT2G05990 AT5G13930 MOD1 TT4 AT3G47450 AT3G03630 NOA1 CS26 AT1G80600 AT2G39020 WIN1 NATA2 AT1G59980 AT5G42270 ARL2 VAR1 AT5G46860 AT1G59980 VAM3 ARL2 AT4G24770 AT1G78630 RBP31 emb1473 AT5G60690 AT1G80560 REV IMD2 AT4G30950 AT2G30950 FAD6 VAR2 AT5G46860 AT5G17420 VAM3 IRX3 AT2G38470 AT2G28900 WRKY33 OEP16-1 AT1G03630 AT4G35000 POR C APX3 AT1G25490 AT5G16710 RCN1 DHAR3 AT4G15560 AT5G42270 CLA1 VAR1 AT1G70940 AT5G51070 PIN3 ERD1 AT1G31480 AT5G18570 SGR2 EMB269 AT1G03630 AT3G01180 POR C SS2 AT4G24770 AT2G30950 RBP31 VAR2 AT2G16640 AT1G05180 TOC132 AXR1 AT2G01940 AT5G42270 SGR5 VAR1 AT4G37650 AT1G54990 SHR AXR4 AT1G73590 AT4G11830 PIN1 PLDGAMMA2 AT3G28860 AT1G80560 ABCB19 IMD2

117 Table 4.4. Continued Interactor 1 Interactor 2 Interactor 1 (Gene Interactor 2 (TAIR ID) (TAIR ID) Name) (Gene Name) AT5G46860 AT1G79230 VAM3 MST1 AT4G34620 AT3G47450 SSR16 NOA1 AT3G48360 AT1G59980 bt2 ARL2 AT1G21640 AT5G46860 NADK2 VAM3 AT1G70940 AT4G11830 PIN3 PLDGAMMA2 AT3G09820 AT1G80600 ADK1 WIN1 AT4G25050 AT4G27440 ACP4 PORB AT1G25490 AT1G14610 RCN1 TWN2 AT2G05990 AT4G27440 MOD1 PORB AT2G16640 AT3G47450 TOC132 NOA1 AT1G31480 AT5G42270 SGR2 VAR1 AT3G48360 AT4G26300 bt2 emb1027 AT1G03630 AT4G27440 POR C PORB AT1G54990 AT1G74960 AXR4 FAB1 AT5G46860 AT1G03630 VAM3 POR C

Hub and bottleneck genes

As described in Figure 4.1, two kinds of key players could be identified from a connected network: hub genes and bottleneck genes. The degree and betweenness centrality of every gene in the intersected network were calculated, and the top 20 most connected genes were selected (Table 4.5).

118 Table 4.5. The hub genes (degree >5) identified in the intersected network. Novel identified genes are indicated in blue. Betweeness Locus ID Degree Gene Name Centrality AT5G42270 0.2336871 11 VAR1 AT1G80600 0.19555796 10 WIN1 AT2G05990 0.22423402 9 MOD1 AT1G68370 0.15092206 8 ARG1 AT1G74960 0.09396796 8 FAB1 AT4G23980 0.17525851 8 ARF9 AT2G39020 0.08350465 8 NATA2 AT1G70940 0.11237588 8 PIN3 AT3G01180 0.1197905 7 SS2 AT1G59980 0.21746269 7 ARL2 AT1G25490 0.03984455 6 RCN1 AT5G18570 0.12342811 6 EMB269 AT5G13930 0.05400621 6 TT4 AT5G16710 0.02385875 5 DHAR3 AT4G26300 0.03717202 5 emb1027 AT5G45930 0.06326749 5 CHLI2 AT5G46860 0.01337181 5 SGR3 AT3G47450 0.0367181 5 NOA1 AT1G79230 0.01898546 5 MST1 AT1G21640 0.0041773 5 NADK2

Discussion

The experiment and analysis started with discovering more possible gravity related genes and extended to identifying possible relationships among the newly found genes and the already identified ones. Two features were prepared from the microarray experiment: namely PCC and mutual ranks. Both of these two features described the correlation patterns between two genes, however, they differ in different applications. It has been reported that co-expression is a good indicator of the protein functions and PPI (Jansen et

119 al. 2002; Jansen et al. 2003) and network prediction (Zhang & Horvath 2005). Although mRNA expression level is often positively linked to protein quantification, Ostlund &

Sonnhammer (2012) showed that using a subset of high quality mRNA improves the correlation between expression levels of mRNA and proteins. On the other hand, the constant correlated pairs of genes/proteins always show high correlation but the transient correlation pairs of genes/proteins do not (Jansen et al. 2002). For example the pattern of co-expression profiles between ribosome genes, which are believed to show consistent correlation among themselves, is different with the co-expression pattern between kinase and kinase substrates (Fig 4.8).

Figure 4.8. The comparisons between PCC values among three different types of correlation: continuous (red), random (blue) and transient (green). The continuous PCC values were calculated from groups of ribosomal RNAs. The random PCC values were calculated based on 100 pairs of genes. The transient PCC values were retrieved from pairs of kinases and their substrates. All PCC values were calculated based on the database from TAIR.

Thus, these two features were both prepared for the final feature vector.

120 Genes selected from the semi-supervised learning classifier were believed to be gravity related. The reason that these genes were chose was they were “functionally similar” with the ones that were already identified as gravity related. However, these gravity related genes covered the whole gravitropism process, including perception, signal transduction and response. The scope of this experiment was restricted only to gravity signal transduction so the newly found genes might be too broad for only signal transduction.

Nevertheless, a gene set enrichment analysis was performed on the newly found genes to find potential patterns of these genes (Table 4.3). Out of 50 genes, 49 genes were associated with a GO term. As can be seen from the enriched set, “thylakoid” and

“membrane” are two main subcellular locations where these selected genes reside. On the other hand, the enriched Biological Process GO terms were related with response to stress and abiotic stimulation. One of the most enriched Molecular Function terms was

“transferase activity”, which indicates the involvement of enzyme activity. Beside the

GO terms shown in Table 4.3, there were several terms significant enough (p-value <0.05) but more generally defined. Interesting, all genes were targeted to plastids and cytoplasm.

These enriched GO terms could inform future research goals as to what categories of genes might be interesting to study. The number of genes discovered during this data mining stage varied based on the selecting criteria. There is still no agreement or equations to obtain the optimized size of predicted list. However, the number of selected genes does affect the prediction in the downstream steps: building the network. It does

121 not only affect the downstream network construction, but also has a strong impact on the performance and computational cost.

A complex network was inferred from two methods: DBN and time-lagged correlation coefficient. The DBN and lagged correlation co-expression methods have different pros and cons. The advantage of the DBN is it adopts prior information of the existing PPIs to estimate the parameters while the lagged correlation only depends on the expression profile. On the other hand, the DBN assumes the parent genes set only at the previous step, rather than any long-temporal regulations.

The DBN only accepts expression profiles with non-zero values. Otherwise, it will cause an SVD (singular vector decomposition) problem. A pseudo-value of 0.01 was assigned to all the 0 values to avoid such a mathematical problem. During the training of the DBN, the time cost is over 54,000 seconds on a quad-core (3.3 GHz) CPU, more than 15 hours.

A comparative analysis was also performed, with only 20 genes in the test set, which only cost 45 seconds with the same computer source. It is not hard to deduce that the time cost grows hyper-exponentially with the number of genes in the test set. This is because all the possible “parent sets” of each node grows hyper-exponentially with the number of genes in the test. For each one of the 80 genes, the total possible parent set is up to 6E23 combinations. Although the program itself has eliminated false combinations, the computational cost is still huge. Thus a list of genes was selected and then submitted to the interaction predictions. A rough estimation on the computation cost on a list of

122 350 genes (the size of our initial differentially expressed genes) could take up to 85 months!

Lagged-time correlation is fast and easy to calculate. However, the correlation co- expression calculation assumes the time lag is obtained with the highest correlation coefficient. So the time lag varies depending on the pairs of genes, which may need further examination.

By comparing the results returned by the time-lagged correlation co-expression and the

DBN analysis, we generated a comprehensive network returned by these two approaches, and the intersected interactions are the ones that are more interesting for further investigation. Actually, the interaction network not only suggested the topological structure of the interactions among genes, but it also provided information on key players in the network.

Among the most important genes, seven genes were previously known as gravity related, including: ARG1, ARF9, PIN3, ARL2, RCN1, TT4 and SGR3. The other thirteen genes were newly identified. VAR1 contains a conserved motif for ATPase and is located in the thylakoid membrane, which is a related to a putative site of gravitropism.

Interestingly, VAR1 has also been identified in a related analysis of the GPS treatment (Schenck et al., 2013). WIN1 encodes an ethylene response factor, regulating wax/cutin biosynthesis (Kannangara et al., 2007). It includes an AP2 domain that is

123 similar to RAP2.4 which has shown negative gravitropic regulation (Lin et al., 2008).

FAB1, on the other hand, has been identified as involved in auxin transporter recycling, which is tightly related to gravitropism response (Hirano and Sato 2011). Many other information and detailed annotations is still needed to fully understand these selected genes. However, this is a promising direction for elucidating more gravity related genes and possible interactions among the genes.

124 CHAPTER 5: CONCLUSIONS

This dissertation aimed to answer two questions: What genes are involved in gravitropic signal transduction and how do those genes interact. I have focused on three methodologies: A transcriptome analysis on gravitropic signal transduction, a semi- supervised learning method for gene identification, and a network analysis.

These methodologies generated three of sets of results: A). The transcriptome analysis of the GPS response generated a list of 318 genes that were differentially expressed between plants reoriented with respect to gravity and the vertical controls, across four time points.

Eight genes were uniquely differentially expressed at 2min. The remaining genes were differentially expressed at 4min, 10min and 30min. Because I was interested primarily in the early signaling events, I focused on the genes differentially expressed at 4 and 10min, assessing their functional annotation terms and possible interactions. Based on their expression profiles and GO terms, five transcription factors, WRKY18, WRKY26,

WRKY33, BT2 and ATAIB, were selected for further study, including phenotypic analysis of mutants defective in these genes and functional characterization of the role of the genes during GPS treatment. B). In addition to the transcriptome analysis, a systems approach was adopted to identify more genes involved in gravitropic signal transduction.

A semi-supervised learning method was developed and used to predicted related genes.

A collection of heterogeneous annotation features and genes identified as important for gravitropic signal transduction were used to identify genes they are ‘functionally similar’

125 to the known, and thus may be involved in gravitropic signaling. Using this method, fifty novel genes were found. C). And finally, a gravity specific network was generated by incorporating data from both the transcriptome analysis and the semi-supervised learning method. The network was constructed using two complementary prediction methods, a dynamic Bayesian network and time-lagged correlation coefficient To increase confidence in the prediction strength of the network, a final network was constructed based on the genes predicted by both methods. This intersected network not only provided the interaction information, but more importantly identified 20 hub and bottleneck genes. Out of these 20 genes, 13 had not been previously identified as involved in gravitropic signal transduction.

Combined, these data provide a launching point for further study, including phenotypic analysis of plants defective in the genes identified, an in-depth characterization of genes predicted and their specific role in gravitropic signaling and experimental validation of the predicted interactions. The merit of the project is that it provides a dynamic platform that can absorb additional information as it becomes available, to fine-tune the results.

For example, during the semi-supervised learning, new information could serve as a new feature to improve the selection of genes. Moreover, the interactome prediction could incorporate other expression profiles, e.g. other transcriptome and proteomic profiles to calculate the correlation of gene pairs.

126 More analysis and research need to be done to reveal the mystery of gravitropic signal transduction, and this project provides a framework that could generate hypotheses to help refine the study of this biological process.

127 REFERENCES

Abe H, Urao T, Ito T, Seki M, Shinozaki K, Yamaguchi-Shinozaki K (2003) Arabidopsis

AtMYC2 (bHLH) and AtMYB2 (MYB) function as transcriptional activators in abscisic acid signaling. The Plant Cell 15: 63–78.

Ahmad KF, Melnick A, Lax S, Bouchard D, Liu J, Kiang C-L, Mayer S, Takahashi S,

Licht JD, Privé GG (2003) Mechanism of SMRT corepressor recruitment by the BCL6

BTB domain. Molecular Cell 12: 1551–1564.

Alonso JM, Stepanova AN, Leisse TJ, Kim CJ, Chen H, Shinn P, Stevenson DK,

Zimmerman J, Barajas P, Cheuk R, et al (2003) Genome-wide insertional mutagenesis of

Arabidopsis thaliana. Science 301: 653–657.

Augenlicht LH, Kobrin D (1982) Cloning and screening of sequences expressed in a mouse colon tumor. Cancer Research 42: 1088–1093.

Baeza-Yates R, Ribeiro-Neto B (1999) Modern Information Retrieval. ACM Press, New

York .

Baldwin KL, Strohm AK, Masson PH (2013) Gravity sensing and signal transduction in vascular plant primary roots. American Journal of Botany 100: 126–142.

128

Bao F, Shen J, Brady SR, Muday GK, Asami T, Yang Z (2004) Brassinosteroids interact with auxin to promote lateral root development in Arabidopsis. Plant Physiology 134:

1624–1631.

Bhardwaj N, Gerstein M, Lu H (2010) Genome-wide sequence-based prediction of peripheral proteins using a novel semi-supervised learning technique. BMC

Bioinformatics 11: S6.

Birkenbihl RP, Diezel C, Somssich IE (2012) Arabidopsis WRKY33 is a key transcriptional regulator of hormonal and metabolic responses toward Botrytis cinerea infection. Plant Physiology 159: 266–285.

Bisgrove SR (2008) The roles of microtubules in tropisms. Plant Science 175: 747–755.

Blancaflor EB, Hasenstein KH (1997) The organization of the actin cytoskeleton in vertical and graviresponding primary roots of maize. Plant Physiology 113: 1447–1455.

Blancaflor E, Fasano J, Gilroy S (1998) Mapping the Functional Roles of Cap Cells in the Response of Arabidopsis Primary Roots to Gravity. Plant Physiology 116: 213–222.

129 De Bodt S, Proost S, Vandepoele K, Rouzé P, Van De Peer Y (2009) Predicting protein- protein interactions in Arabidopsis thaliana through integration of orthology, gene ontology and co-expression. BMC Genomics 10: 288.

Boonsirichai K, Sedbrook JC, Chen RJ, Gilroy S, Masson H (2003) Altered response to gravity is a peripheral membrane protein that modulates gravity-induced cytoplasmic alkalinization and lateral auxin transport in plant statocytes. Plant Cell 15: 2612–2625.

Brandes U (2001) A faster algorithm for betweenness centrality. Journal of Mathematical

Sociology 25: 163–177.

Breitling R, Armengaud P, Amtmann A, Herzyk P (2004) Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS letters 573: 83–92.

Brown KR, Jurisica I (2007) Unequal evolutionary conservation of human protein interactions in interologous networks. Genome Biology 8: R95.

Buer CS, Muday GK (2004) The transparent testa4 mutation prevents flavonoid synthesis and alters auxin transport and the response of Arabidopsis roots to gravity and light. The

Plant Cell 16: 1191–205.

130 Buer CS, Sukumar P, Muday GK (2006) Ethylene modulates flavonoid accumulation and gravitropic responses in roots of Arabidopsis. Plant Physiology 140: 1384–1396.

Bülow L, Steffens NO, Galuschka C, Schindler M, Hehl R (2006) AthaMap: from in silico data to real transcription factor binding sites. In Silico Biology 6: 243–252.

Bunescu RC, Mooney RJ (2005) Subsequence Kernels for Relation Extraction. In

Advances in neural information processing systems, pp. 171-178.

Calvo B, Larrañaga P, Lozano JA (2007) Learning Bayesian classifiers from positive and unlabeled examples. Pattern Recognition Letters 28: 2375–2384.

Caspar T, Huber SC, Somerville C (1985) Alterations in growth, photosynthesis, and respiration in a starchless mutant of Arabidopsis thaliana (L.) deficient in chloroplast phosphoglucomutase Activity. Plant Physiology 79: 11–17.

Cerulo L, Elkan C, Ceccarelli M (2010) Learning gene regulatory networks from only positive and unlabeled data. BMC Bioinformatics 11: 228.

Chang CC, Lin CJ (2011) LIBSVM: A library for support vector machines. ACM

Transactions on Intelligent Systems and Technology 2(3): 27.

131 Chen H, Lai Z, Shi J, Xiao Y, Chen Z, Xu X (2010) Roles of Arabidopsis WRKY18,

WRKY40 and WRKY60 transcription factors in plant responses to abscisic acid and abiotic stress. BMC Plant Biology 10: 281.

Chen JY, Youn E, Mooney SD (2009) Connecting protein interaction data, mutations, and disease using bioinformatics. In Computational Systems Biology, pp. 449-461,

Humana Press.

Clore AM (2013) Cereal grass pulvini: agronomically significant models for studying gravitropism signaling and tissue polarity. American Journal of Botany 100: 101–110.

Clore AM, Doore SM, Tinnirello SMN (2008) Increased levels of reactive oxygen species and expression of a cytoplasmic aconitase/iron regulatory protein 1 homolog during the early response of maize pulvini to gravistimulation. Plant Cell and

Environment 31: 144–158.

Correll MJ, Kiss JZ (2002) Interactions between gravitropism and phototropism in plants.

Journal of Plant Growth Regulation 21: 89–101.

Costello JC, Dalkilic MM, Beason SM, Gehlhausen JR, Patwardhan R, Middha S, Eads

BD, Andrews JR (2009) Gene networks in Drosophila melanogaster: integrating experimental data to predict gene function. Genome Biology 10: R97.

132 Cozzetto D, Buchan DW, Bryson K, Jones DT (2013) Protein function prediction by massive integration of evolutionary analyses and multiple data sources. BMC

Bioinformatics 14 Suppl 3: S1.

Czechowski T, Stitt M, Altmann T, Udvardi MK, Scheible W-R (2005) Genome-wide identification and testing of superior reference genes for transcript normalization in

Arabidopsis. Plant Physiology 139: 5–17.

Denis F, Gilleron R, Laurent A, Tommasi M (2003). Text classification and co-training from positive and unlabeled examples. In Proceedings of the ICML 2003 workshop: the continuum from labeled to unlabeled data, pp. 80-87.

Djebbari A, Quackenbush J (2008) Seeded Bayesian Networks: constructing genetic networks from microarray data. BMC Systems Biology 2: 57.

Du Z, Li L, Chen C-F, Yu PS, Wang JZ (2009) G-SESAME: web tools for GO-term- based gene similarity analysis and knowledge discovery. Nucleic Acids Research 37:

W345–349.

Elkan C, Noto K (2008) Learning classifiers from only positive and unlabeled data.

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 213–220.

133

Eulgem T, Rushton PJ, Robatzek S, Somssich IE (2000) The WRKY superfamily of plant transcription factors. Trends in Plant Science 5: 199–206.

Eulgem T, Somssich IE (2007) Networks of WRKY transcription factors in defense signaling. Current Opinion in Plant Biology 10: 366–371.

Falcon S, Gentleman R (2007) Using GOstats to test gene lists for GO term association.

Bioinformatics 23: 257–258.

Fasano JM, Swanson SJ, Blancaflor EB, Dowd PE, Kao TH, Gilroy S (2001) Changes in root cap pH are required for the gravity response of the Arabidopsis root. The Plant Cell

13: 907–921.

Fortney K, Kotlyar M, Jurisica I (2010) Inferring the functions of longevity genes with modular subnetwork biomarkers of Caenorhabditis elegans aging. Genome Biology 11:

R13.

Franceschini A, Szklarczyk D, Frankild S, Kuhn M, Simonovic M, Roth A, Lin J,

Minguez P, Bork P, von Mering C, et al (2013) STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Research 41: D808–

815.

134 Friedman H, Vos JW, Hepler PK, Meir S, Halevy AH, Philosoph-Hadas S (2003) The role of actin filaments in the gravitropic response of snapdragon flowering shoots. Planta

216: 1034–1042.

Friedman N, Geiger D, Goldszmidt M (1997) Bayesian Network Classifiers. Machine

Learning 29: 131–163.

Friml J, Vieten A, Sauer M, Weijers D, Schwarz H, Hamann T, Offringa R, Jürgens G

(2003) Efflux-dependent auxin gradients establish the apical-basal axis of Arabidopsis.

Nature 426: 147–153.

Fukaki H, Fujisawa H, Tasaka M (1996) SGR1, SGR2, SGR3: novel genetic loci involved in shoot gravitropism in Arabidopsis thaliana. Plant Physiology 110: 945–955.

Fukaki H, Wysocka-Diller J, Kato T, Fujisawa H, Benfey PN, Tasaka M (1998) Genetic evidence that the endodermis is essential for shoot gravitropism in Arabidopsis thaliana.

The Plant Journal 14: 425–430.

Furney SJ, Calvo B, Larrañaga P, Lozano JA, Lopez-Bigas N (2008) Prioritization of candidate cancer genes--an aid to oncogenomic studies. Nucleic acids research 36: e115.

135 Galland P, Pazur A (2005) Magnetoreception in plants. Journal of Plant Research 118:

371–389.

Gaudet P, Livstone MS, Lewis SE, Thomas PD (2011) Phylogenetic-based propagation of functional annotations within the Gene Ontology consortium. Briefings in

Bioinformatics 12: 449–462.

Geldner N, Friml J, Stierhof YD, Jürgens G, Palme K (2001) Auxin transport inhibitors block PIN1 cycling and vesicle trafficking. Nature 413: 425–428.

Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L,

Ge YC, Gentry J, et al (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biology 5: R80.

Grant M, Boyd S (2008). Graph implementations for nonsmooth convex programs. In

Recent Advances in Learning and Control, pp. 95-110, Springer.

Grant M, Boyd S (2008) Graph implementations for nonsmooth convex programs,

Recent Advances in Learning and Control. Lecture Notes in Control and Information

Sciences, pp.95-110, Springer.

136 Gutjahr C, Riemann M, Müller A, Düchting P, Weiler EW, Nick P (2005) Cholodny-

Went revisited: a role for jasmonate in gravitropism of rice coleoptiles. Planta 222: 575–

585.

Hannah MA, Heyer AG, Hincha DK (2005) A global survey of gene regulation during cold acclimation in Arabidopsis thaliana. PLoS Genetics 1: e26.

Hepler PK (2005) Calcium: a central regulator of plant growth and development. The

Plant Cell 17: 2142–2155.

Hopkins JA, Kiss JZ (2012) Phototropism and gravitropism in transgenic lines of

Arabidopsis altered in the phytochrome pathway. Physiologia Plantarum 145: 461–473.

Hou GC, Kramer VL, Wang YS, Chen RJ, Perbal G, Gilroy S, Blancaflor EB (2004) The promotion of gravitropism in Arabidopsis roots upon actin disruption is coupled with the extended alkalinization of the columella cytoplasm and a persistent lateral auxin gradient.

Plant Journal 39: 113–125.

Ideker T, Dutkowski J, Hood L (2011) Boosting signal-to-noise in complex biology: prior knowledge is power. Cell 144: 860–863.

137 Ishikawa H, Evans ML (1997) Novel software for analysis of root gravitropism: comparative response patterns of Arabidopsis wild-type and axr1 seedlings. Plant, Cell and Environment 20: 919–928.

Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S, Emili A, Snyder M,

Greenblatt JF, Gerstein M (2003) A Bayesian networks approach for predicting protein- protein interactions from genomic data. Science 302: 449–53.

Joachims T (1999) Transductive inference for text classification using support vector machines. In ICML, 99: 200-209.

Joachims T (2002) Optimizing search engines using click through data. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 133.

Joo JH, Bae YS, Lee JS (2001) Role of auxin-induced reactive oxygen species in root gravitropism. Plant Physiology 126: 1055–1060.

Kashima H, Yamanishi Y, Kato T, Sugiyama M, Tsuda K (2009) Simultaneous inference of biological networks of multiple species from genome-wide data and evolutionary information: a semi-supervised approach. Bioinformatics25: 2962–2568.

138 Kerrien S, Aranda B, Breuza L, Bridge A, Broackes-Carter F, Chen C, Duesbury M,

Dumousseau M, Feuermann M, Hinz U, et al (2012) The IntAct molecular interaction database in 2012. Nucleic Acids Research 40: D841–846.

Kholodenko B, Yaffe MB, Kolch W (2012) Computational approaches for analyzing information flow in biological networks. Science Signaling 5: re1.

Kim BR, Nam HY, Kim SU, Kim SI, Chang YJ (2003) Normalization of reverse transcription quantitative-PCR with housekeeping genes in rice. Biotechnology Letters 25:

1869–1872.

Kim SH, Bhat PR, Cui X, Walia H, Xu J, Wanamaker S, Ismail AM, Wilson C, Close TJ

(2009) Detection and validation of single feature polymorphisms using RNA expression data from a rice genome array. BMC Plant Biology 9: 65.

Kimbrough JM, Salinas-Mondragon R, Boss WF, Brown CS, Sederoff HW (2004) The fast and transient transcriptional network of gravity and mechanical stimulation in the

Arabidopsis root apex. Plant Physiology 136: 2790–2805.

Kiss JZ (2000) Mechanisms of the early phases of plant gravitropism. Critical Reviews in

Plant Sciences 19: 551–573.

139 Kiss JZ, Guisinger MM, Miller AJ, Stackhouse KS (1997) Reduced gravitropism in hypocotyls of starch-deficient mutants of Arabidopsis. Plant & Cell Physiology 38: 518–

525.

Kiss JZ, Wright JB, Caspar T (1996) Gravitropism in roots of intermediate-starch mutants of Arabidopsis. Physiologia Plantarum 97: 237–244.

Kittang AI, Kvaløy B, Winge P, Iversen TH (2010) Ground testing of Arabidopsis preservation protocol for the microarray analysis to be used in the ISS EMCS Multigen-2 experiment. Advances in Space Research 46: 1249–1256.

Kleine-Vehn J, Ding Z, Jones AR, Tasaka M, Morita MT, Friml J (2010) Gravity- induced PIN transcytosis for polarization of auxin fluxes in gravity-sensing root cells.

PNAS 107: 22344–22349.

Kordyum E (2003) A role for the cytoskeleton in plant cell gravisensitivity and Ca2+- signaling in microgravity. Cell Biology International 27: 219–221.

Lanckriet GRG, Deng M, Cristianini N, Jordan MI, Noble WS (2004) Kernel-based data fusion and its application to protein function prediction in yeast. Pacific Symposium on

Biocomputing, pp. 300–311.

140 Lee WS, Liu B (2003) Learning with positive and unlabeled examples using weighted logistic regression. Proceedings of the twentieth international conference on machine learning (ICML). Washington, DC, pp. 448–455.

Leitz G, Kang B-H, Schoenwaelder MEA, Staehelin LA (2009) Statolith sedimentation kinetics and force transduction to the cortical endoplasmic reticulum in gravity-sensing

Arabidopsis columella cells. The Plant Cell 21: 843–860.

Lewis DP, Jebara T, Noble WS (2006) Support vector machine learning from heterogeneous data: an empirical analysis using protein sequence and structure.

Bioinformatics 22: 2753–2760.

Lewis DR, Miller ND, Splitt BL, Wu G, Spalding EP (2007) Separating the roles of acropetal and basipetal auxin transport on gravitropism with mutations in two

Arabidopsis multidrug resistance-like ABC transporter genes. The Plant Cell 19: 1838–

1850.

Li H, Sun J, Xu Y, Jiang H, Wu X, Li C (2007) The bHLH-type transcription factor

AtAIB positively regulates ABA response in Arabidopsis. Plant Molecular Biology 65:

655–665.

141 Li L, Stoeckert CJ, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Research 13: 2178–2189.

Lin M, Shen X, Chen X (2011) PAIR: the predicted Arabidopsis interactome resource.

Nucleic Acids Research 39: D1134–1140.

Liu B, Dai Y, Li X, Lee WS, Yu PS (2003) Building text classifiers using positive and unlabeled examples. Data Mining, Third IEEE International Conference on, ICDM, pp.

179–186.

Liu H, Torii M, Xu G, Hu Z, Goll J (2010) Learning from positive and unlabeled documents for retrieval of bacterial protein-protein interaction literature. In Linking

Literature, Information, and Knowledge for Biology, pp. 62-70.

Luesse DR, Schenck CA., Berner BK, Justus B, Wyatt SE (2011) GPS4 is allelic to

ARL2: implications for gravitropic signal transduction. Gravitational and Space Biology,

23(2).

Mandadi KK, Misra A, Ren S, McKnight TD (2009) BT2, a BTB protein, mediates multiple responses to nutrients, stresses, and hormones in Arabidopsis. Plant Physiology

150: 1930–1939.

142 Marchant A, Kargul J, May ST, Muller P, Delbarre A, Perrot-Rechenmann C, Bennett

MJ (1999) AUX1 regulates root gravitropism in Arabidopsis by facilitating auxin uptake within root apical tissues. The EMBO Journal 18: 2066–2073.

Martin S, Roe D, Faulon J-L (2005) Predicting protein-protein interactions using signature products. Bioinformatics 21: 218–226.

Melacci S, Belkin M (2011) Laplacian support vector machines trained in the primal.

Journal of Machine Learning Research 12: 1149–1184.

Mordelet F, Vert J-P (2013) Supervised Inference of Gene Regulatory Networks from

Positive and Unlabeled Examples. Data Mining for Systems Biology, pp. 47–58, Springer.

Morita MT (2010) Directional gravity sensing in gravitropism. Annual Review of Plant

Biology 61: 705–720.

Mostafavi S, Ray D, Warde-Farley D, Grouios C, Morris Q (2008). GeneMANIA: a real- time multiple association network integration algorithm for predicting gene function.

Genome Biology, 9(Suppl 1): S4.

Muday GK, DeLong A (2001) Polar auxin transport: controlling where and how much.

Trends in Plant Science 6: 535–542.

143

Mueller LA, Zhang P, Rhee SY (2003) AraCyc: a biochemical pathway database for

Arabidopsis. Plant Physiology 132: 453–460.

Mukhtar MS, Carvunis A-R, Dreze M, Epple P, Steinbrenner J, Moore J, Tasan M, Galli

M, Hao T, Nishimura MT (2011) Independently evolved virulence effectors converge onto hubs in a plant immune system network. Science 333: 596–601.

Mullen JL, Wolverton C, Ishikawa H, Evans ML (2000) Kinetics of constant gravitropic stimulus responses in Arabidopsis roots using a feedback system. Plant Physiology 123:

665–670.

Müller H-M, Kenny EE, Sternberg PW (2004) Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biology 2: e309.

Muller J, Szklarczyk D, Julien P, Letunic I, Roth A, Kuhn M, Powell S, von Mering C,

Doerks T, Jensen LJ (2010) eggNOG v2.0: extending the evolutionary genealogy of genes with enhanced non-supervised orthologous groups, species and functional annotations. Nucleic Acids Research 38: D190–195.

144 Noble WS (2004) Support vector machine applications in computational biology. Kernel

Methods in Computational Biology, pp. 71–92.

Noto K, Saier M, Elkan C (2008) Learning to Find Relevant Biological Articles without

Negative Training Examples. Advances in Artificial Intelligence, pp.202-213, Springer.

O’Rourke JA, Nelson RT, Grant D, Schmutz J, Grimwood J, Cannon S, Vance CP,

Graham MA, Shoemaker RC (2009) Integrating microarray analysis and the soybean genome to understand the soybeans iron deficiency response. BMC Genomics 10: 376.

Obayashi T, Kinoshita K, Nakai K, Shibaoka M, Hayashi S, Saeki M, Shibata D, Saito K,

Ohta H (2007) ATTED-II: a database of co-expressed genes and cis elements for identifying co-regulated gene groups in Arabidopsis. Nucleic Acids Research 35: D863–

969.

Östlund G, Schmitt T, Forslund K, Köstler T, Messina DN, Roopra S, Frings O,

Sonnhammer ELL (2010) InParanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Research 38: D196–D203.

Palaniswamy SK, James S, Sun H, Lamb RS, Davuluri R V, Grotewold E (2006) AGRIS and AtRegNet. a platform to link cis-regulatory elements and transcription factors into regulatory networks. Plant Physiology 140: 818–829.

145

Pandey SP, Roccaro M, Schön M, Logemann E, Somssich IE (2010) Transcriptional reprogramming regulated by WRKY18 and WRKY40 facilitates powdery mildew infection of Arabidopsis. The Plant Journal 64: 912–923.

Pawson T, Nash P (2003) Assembly of cell regulatory systems through protein interaction domains. Science 300: 445–452.

Peer WA, Murphy AS (2007) Flavonoids and auxin transport: modulators or regulators?

Trends in Plant Science 12: 556–563.

Peña-Castillo L, Tasan M, Myers CL, Lee H, Joshi T, Zhang C, Guan Y, Leone M,

Pagnani A, Kim WK (2008) A critical assessment of Mus musculus gene function prediction using integrated genomic evidence. Genome Biology 9 Suppl 1: S2.

Perera IY, Heilmann I, Boss WF (1999) Transient and sustained increases in inositol

1,4,5-trisphosphate precede the differential growth response in gravistimulated maize pulvini. PNAS 96: 5838–5843.

Perera IY, Heilmann I, Chang SC, Boss WF, Kaufman PB (2001) A role for inositol

1,4,5-trisphosphate in gravitropic signaling and the retention of cold-perceived gravistimulation of oat shoot pulvini. Plant Physiology 125: 1499–1507.

146

Perrin RM, Young L-S, Murthy UMN, Harrison BR, Wang Y, Will JL, Masson PH (2005)

Gravity signal transduction in primary roots. Annals of Botany 96: 737–743.

Petrásek J, Friml J (2009) Auxin transport routes in plant development. Development 136:

2675–2688.

Pitre S, North C, Alamgir M, Jessulat M, Chan A, Luo X, Green JR, Dumontier M,

Dehne F, Golshani A (2008) Global investigation of protein–protein interactions in yeast

Saccharomyces cerevisiae using re-occurring short polypeptide sequences. Nucleic Acids

Research 36: 4286–4294.

Pnueli L, Abu-Abeid M, Zamir D, Nacken W, Schwarz-Sommer Z, Lifschitz E (1991)

The MADS box gene family in tomato: temporal expression during floral development, conserved secondary structures and homology with homeotic genes from Antirrhinum and Arabidopsis. The Plant Journal 1: 255–266.

Przulj N, Wigle DA, Jurisica I (2004) Functional topology in a network of protein interactions. Bioinformatics 20: 340–348.

147 Radim Řehůřek, Sojka P (2010) Software framework for topic modeling with large corpora. Proceedings of the LREC 2010 Workshop on New Challenges for NLP

Frameworks, pp. 45–50, ELRA, Valletta, Malta.

Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, Graim K, Funk

C, Verspoor K, Ben-Hur A (2013) A large-scale evaluation of computational protein function prediction. Nature Methods 10: 221–227.

Rasmussen H (2007) Intracellular Signaling: from Simplicity to Complexity.

Gravitational and Space Biology 8: 2.

Ratcliffe OJ (2003) Analysis of the Arabidopsis MADS affecting flowering gene family:

MAF2 Prevents vernalization by short periods of cold. The Plant Cell 15: 1159–1169.

Ren S, Mandadi KK, Boedeker AL, Rathore KS, McKnight TD (2007) Regulation of telomerase in Arabidopsis by BT2, an apparent target of telomerase activator1. The Plant

Cell 19: 23–31.

Rhee SY (2003) The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Research 31: 224–228.

148 Rhee SY, Zhang P, Foerster H, Tissier C (2006) AraCyc: overview of an Arabidopsis metabolism database and its applications for plant research. In Plant Metabolomics

Springer, pp. 141-154.

Roy S (2012) Systems biology beyond degree, hubs and scale-free networks: the case for multiple metrics in complex networks. Systems and Synthetic Biology 6: 31–34.

Rozen S, Skaletsky H (2000) Primer3 on the WWW for general users and for biologist programmers. Methods in Molecular Biology132: 365–386.

Rushton PJ, Somssich IE, Ringler P, Shen QJ (2010) WRKY transcription factors. Trends in Plant Science 15: 247–258.

Saeed AI, Bhagabati NK, Braisted JC, Liang W, Sharov V, Howe EA, Li J, Thiagarajan

M, White JA, Quackenbush J (2006) TM4 microarray software suite. Methods in

Enzymology 411: 134–193.

Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23: 2507–2517.

149 Sánchez-Fernández R, Davies TG, Coleman JO, Rea PA (2001) The Arabidopsis thaliana

ABC protein superfamily, a complete inventory. The Journal of Biological Chemistry

276: 30231–30244.

Scott AC, Allen NS (1999) Changes in cytosolic pH within Arabidopsis root columella cells play a key role in the early signaling pathway for root gravitropism. Plant

Physiology 121: 1291–1298.

Sedbrook JC, Chen R, Masson PH (1999) ARG1 (altered response to gravity) encodes a

DnaJ-like protein that potentially interacts with the cytoskeleton. PNAS 96: 1140–1145.

Sharan R, Ulitsky I, Shamir R (2007) Network-based prediction of protein function.

Molecular Systems Biology 3: 88.

Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, Li Y, Jiang H (2007) Predicting protein- protein interactions based only on sequences information. PNAS104: 4337–4341.

Shen K, Wyatt SE, Nadella V (2012) ArrayOU: a web application for microarray data analysis and visualization. Journal of Biomolecular Techniques 23: 37–39.

Shi L, Campbell G, Jones WD, Campagne F, Wen Z, Walker SJ, Su Z, Chu T-M,

Goodsaid FM, Pusztai L et al (2010) The MicroArray Quality Control (MAQC)-II study

150 of common practices for the development and validation of microarray-based predictive models. Nature Biotechnology 28: 827–838.

Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ, de

Longueville F, Kawasaki ES, Lee KY, et al (2006) The MicroArray Quality Control

(MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nature Biotechnology 24: 1151–1161.

Shibasaki K, Uemura M, Tsurumi S, Rahman A (2009) Auxin response in Arabidopsis under cold stress: underlying molecular mechanisms. The Plant Cell 21: 3823–3838.

Sinclair W, Trewavas AJ (1997) Calcium in gravitropism. A re-examination. Planta 203:

S85–90.

Sindhwani V, Niyogi P (2005) Beyond the point cloud: from transductive to semi- supervised learning In ICML, pp. 824–831, ACM.

Smyth GK (2004) Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular

Biology 3: Article3.

151 Stanga J, Strohm A, Masson PH (2011) Studying starch content and sedimentation of amyloplast statoliths in Arabidopsis roots. Methods in Molecular Biology 774: 103–111.

Stanga JP, Boonsirichai K, Sedbrook JC, Otegui MS, Masson PH (2009) A role for the

TOC complex in Arabidopsis root gravitropism. Plant Physiology 149: 1896–1905.

Stark C, Breitkreutz B-J, Chatr-Aryamontri A, Boucher L, Oughtred R, Livstone MS,

Nixon J, Van Auken K, Wang X, Shi X, et al (2011) The BioGRID Interaction Database:

2011 update. Nucleic Acids Research 39: D698–704.

Staves MP, Wayne R, Leopold AC (1997) The effect of the external medium on the gravitropic curvature of rice (Oryza sativa, Poaceae) roots. American Journal of Botany

84: 1522–1529.

Sukumar P, Edwards KS, Rahman A, Delong A, Muday GK (2009) PINOID kinase regulates root gravitropism through modulation of PIN2-dependent basipetal auxin transport in Arabidopsis. Plant Physiology 150: 722–735.

Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH (2007) UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23: 1282–1288.

152 Svegzdiene D, Rakleviciene D, Gaina V (2005) Kinetics of gravity-induced amyloplast sedimentation in statocytes of cress roots grown under fast clinorotation, 1g and after

180°inversion. Advances in Space Research 36: 1277–1283.

Swarup R, Bennett M (2003) Auxin transport: the fountain of life in plants?

Developmental Cell 5: 824–826.

Swarup R, Kramer EM, Perry P, Knox K, Leyser HMO, Haseloff J, Beemster GTS,

Bhalerao R, Bennett MJ (2005) Root gravitropism requires lateral root cap and epidermal cells for transport and response to a mobile auxin signal. Nature Cell Biology 7: 1057–

1065.

Tew KL, Li X-L, Tan S-H (2007) Functional centrality: detecting lethality of proteins in protein interaction networks. In Genome Informatics. International Conference on

Genome Informatics 19: 166–1077.

Toyota M, Furuichi T, Tatsumi H, Sokabe M (2008) Cytoplasmic calcium increases in response to changes in the gravity vector in hypocotyls and petioles of Arabidopsis seedlings. Plant Physiology 146: 505–514.

Tsuda K, Shin H, Schölkopf B (2005) Fast protein classification with multiple networks.

Bioinformatics 21: ii59–ii65.

153

Tsujitani M, Tanaka Y (2011) Cross-validation, bootstrap, and support vector machines.

Advances in Artificial Neural Systems: 2.

Vandesompele J, De Preter K, Pattyn F, Poppe B, Van Roy N, De Paepe A, Speleman F

(2002) Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes. Genome Biology 3: research0034.1– research0034.11.

Vitha S, Yang M, Sack FD, Kiss JZ (2007) Gravitropism in the starch excess mutant of

Arabidopsis thaliana. American Journal of Botany 94: 590–598.

Wang X, Du B, Liu M, Sun N, Qi X (2013) Arabidopsis transcription factor WRKY33 is involved in drought by directly regulating the expression of CesA8. American Journal of

Plant Sciences 04: 21–27.

Wang X, Rak R, Restificar A, Nobata C, Rupp CJ, Batista-Navarro RTB, Nawaz R,

Ananiadou S (2011) Detecting experimental techniques and selecting relevant documents for protein-protein interactions from biomedical literature. BMC Bioinformatics 12: S11.

Wass MN, Barton G, Sternberg MJE (2012) CombFunc: predicting protein function using heterogeneous data sources. Nucleic Acids Research 40: W466–W470.

154

Weise SE, Kuznetsov OA, Hasenstein KH, Kiss JZ (2000) Curvature in Arabidopsis

Inflorescence Stems Is Limited to the Region of Amyloplast Displacement. Plant and

Cell Physiology 41: 702–709.

Weston J, Elisseeff A, Schölkopf B, Tipping M (2003) Use of the zero norm with linear models and kernel methods. The Journal of Machine Learning Research 3: 1439–1461.

Winkel-Shirley B (2002) Biosynthesis of flavonoids and effects of stress. Current

Opinion in Plant Biology 5: 218–223.

Withers JC, Shipp MJ, Rupasinghe SG, Sukumar P, Schuler MA, Muday GK, Wyatt SE

(2013) Gravity Persistent Signal 1 (GPS1) reveals novel cytochrome P450s involved in gravitropism. American Journal of Botany 100: 183–193.

Witten D, Tibshirani R (2007) A comparison of fold-change and the t-statistic for microarray data analysis. Department of Statistics, Stanford University technical report.

Wren JD (2009) A global meta-analysis of microarray expression data to predict unknown gene functions and estimate the literature-data divide. Bioinformatics 25: 1694–

1701.

155 Wyatt SE, Kiss JZ (2013) Plant tropisms: from Darwin to the International Space Station.

American Journal of Botany 100: 1–3.

Wyatt SE, Rashotte AM, Shipp MJ, Robertson D, Muday GK (2002) Mutations in the gravity persistence signal loci in Arabidopsis disrupt the perception and. Plant Physiology

130: 1426–1435.

Yamauchi Y, Fukaki H, Fujisawa H, Tasaka M (1997) Mutations in the SGR4, SGR5 and

SGR6 loci of Arabidopsis thaliana alter the shoot gravitropism. Plant & Cell Physiology

38: 530–535.

Yano D, Sato M, Saito C, Sato MH, Morita MT, Tasaka M (2003) A SNARE complex containing SGR3/AtVAM3 and ZIG/VTI11 in gravity-sensing cells is important for

Arabidopsis shoot gravitropism. PNAS 100: 8589–8594.

Yu H, Kim PM, Sprecher E, Trifonov V, Gerstein M (2007) The importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics. PLoS Comput Biol 3: e59.

Zhang J, Nodzynski T, Pencík A, Rolcík J, Friml J (2010) PIN phosphorylation is sufficient to mediate PIN polarity and direct auxin transport. PNAS 107: 918–922.

156 Zhang Y, Wang L (2005) The WRKY transcription factor superfamily: its origin in eukaryotes and expansion in plants. BMC Evolutionary Biology 5: 1.

Zhao X-M, Wang Y, Chen L, Aihara K (2008) Gene function prediction using labeled and unlabeled data. BMC Bioinformatics 9: 57.

Zheng Z, Qamar SA, Chen Z, Mengiste T (2006) Arabidopsis WRKY33 transcription factor is required for resistance to necrotrophic fungal pathogens. The Plant Journal 48:

592–605.

Zhou D, Bousquet O, Lal TN, Weston J, Schölkopf B. (2004). Learning with local and global consistency. Advances in Neural Information Processing Systems, 16(16), 321-

328.

157 APPENDIX A. MICROARRAY DATA ANALYSIS PIPELINE

158

159

160 APPENDIX B. DIFFERENTIALLY EXPRESSED GENES

Table B.1. Genes down-regulated at 2min in the GPS microarray and their expression across the remaining time points. All significant values are highlighted in gray. 2min 4min 10min 30min Log P- Log P- Log P-value Log P- Fold value Fold value Fold Fold value change change change change AT3G16180 -1.29 0.02 -0.39 0.64 0.27 0.52 0.2 0.37 AT3G28340 -1.09 0.04 0.13 0.72 -0.63 0.39 0.46 0.37 AT4G34770 -1.06 0 0.19 0.61 0.1 0.7 -0.24 0.56 AT1G11545 -0.85 0.04 0.18 0.52 0.13 0.72 -0.23 0.57 AT4G34530 -0.81 0.03 -0.2 0.55 -0.18 0.52 -0.11 0.7

Table B.2. Genes up-regulated at 2min in the GPS microarray and their expression across the remaining time points. All significant values are highlighted in gray. 2min 4min 10min 30min Log P- Log P- Log P-value Log P- Fold value Fold value Fold Fold value change change change change AT1G23130 0.96 0.05 0.24 0.78 -0.75 0.18 -0.99 0.24 AT3G10340 0.82 0.04 -0.67 0.22 0.28 0.65 0.62 0.21 AT2G03090 0.73 0.01 -0.02 0.95 1.18 0.12 1.29 0.07

Table B.3. Genes down-regulated at 4min in the GPS microarray and their expression across the remaining time points. All significant values are highlighted in gray. 4min 10min 30min Log P- Log P-value Log P- Fold value Fold Fold value change change change AT5G07100 -1.76 0 -0.53 0.31 -0.94 0.09 AT4G31800 -1.41 0.1 -1.59 0.01 -1.26 0.08 AT1G09440 -1.18 0.04 0.12 0.9 1.06 0.02 AT3G48360 -1.03 0.04 -0.16 0.67 -0.24 0.32 AT2G38470 -1 0.03 -0.01 0.99 0.13 0.76 AT2G46510 -0.82 0.05 0.05 0.98 0.63 0.61 AT4G05380 -0.81 0.01 -1 0.01 0.03 0.86 AT1G11050 -0.8 0.05 -0.59 0.38 -0.19 0.35 AT1G15520 -0.8 0.02 0.44 0.18 0.24 0.3 AT3G58330 -0.8 0.01 -0.28 0.47 0.02 0.95 AT4G30140 -0.8 0.04 0.42 0.27 -0.43 0.31

161 Table B.3. Continued 4min 10min 30min Log P- Log P-value Log P- Fold value Fold Fold value change change change AT3G07820 -0.77 0.01 -0.19 0.47 -0.25 0.4 AT1G17310 -0.76 0.02 -0.12 0.75 0.49 0.1 AT1G75520 -0.76 0.01 -0.07 0.84 -0.3 0.35 AT3G58060 -0.76 0.04 0.18 0.56 -0.12 0.66 AT1G51990 -0.75 0.03 0.23 0.7 0.38 0.39 AT3G16210 -0.75 0.05 -0.54 0.4 -0.17 0.5 AT3G51740 -0.75 0.01 -0.66 0.25 0.03 0.96 AT2G17180 -0.74 0.02 -0.25 0.4 -0.13 0.53 AT5G17830 -0.74 0.05 0.37 0.26 0.19 0.65 AT4G27550 -0.73 0.02 -0.89 0.09 0.13 0.47 AT5G01130 -0.73 0.05 -0.51 0.25 0.02 0.95 AT5G02580 -0.73 0.02 -0.23 0.66 0.1 0.79 AT3G09790 -0.72 0.04 -0.9 0.02 -0.28 0.24 AT4G01470 -0.72 0.03 -0.26 0.44 -0.78 0.26 AT4G25980 -0.72 0.05 -0.76 0.08 0.06 0.71 AT4G35800 -0.72 0.05 -0.04 0.97 0.46 0.66 AT1G22760 -0.71 0.02 -0.17 0.69 -0.54 0.2 AT3G18810 -0.71 0.02 -0.96 0.03 -0.21 0.41 AT4G28010 -0.71 0.04 0.03 0.94 -0.27 0.37 AT1G03940 -0.7 0.02 0.09 0.78 -0.39 0.12 AT5G18720 -0.7 0.04 -0.08 0.82 -0.11 0.6 AT2G48150 -0.69 0.05 0.15 0.72 0.38 0.15 AT4G37780 -0.69 0.02 -0.54 0.15 0.32 0.29 AT5G10660 -0.69 0.03 -0.24 0.43 0.37 0.12 AT1G09970 -0.68 0.03 0.1 0.87 0.16 0.42 AT2G33205 -0.68 0.04 0 0.99 -0.2 0.31 AT3G14415 -0.68 0.04 0.28 0.48 0.22 0.5 AT1G21810 -0.67 0.04 -0.13 0.74 -0.23 0.28 AT3G10320 -0.67 0.03 0.26 0.53 -0.12 0.77 AT3G44115 -0.67 0.04 0.08 0.8 -0.38 0.24 AT3G05310 -0.66 0.05 0.06 0.88 0.02 0.93 AT3G08810 -0.66 0.03 -0.31 0.37 -0.45 0.19 AT4G24110 -0.66 0.05 -0.53 0.15 -1 0.06 AT3G26810 -0.65 0.05 -0.05 0.91 -0.43 0.09 AT3G48230 -0.64 0.04 0.08 0.85 -0.03 0.9

162 Table B.3. Continued 4min 10min 30min Log P- Log P-value Log P- Fold value Fold Fold value change change change AT2G30250 -0.63 0.05 0.8 0.02 0.37 0.16 AT3G22730 -0.63 0.04 -0.38 0.32 0.55 0.17 AT4G20770 -0.63 0.04 0.44 0.17 -0.33 0.25 AT4G28088 -0.63 0.03 -0.26 0.48 -0.38 0.28 AT1G31550 -0.62 0.03 -0.54 0.16 -0.21 0.39 AT3G45640 -0.62 0.03 -0.2 0.58 -0.24 0.6 AT3G52400 -0.62 0.05 -0.37 0.35 0.35 0.57 AT5G10695 -0.62 0.05 -0.11 0.66 0.2 0.57 AT1G28050 -0.61 0.05 -0.27 0.67 -0.41 0.36 AT1G69630 -0.61 0.04 0.22 0.51 0.13 0.6 AT2G24800 -0.61 0.05 -0.29 0.31 -0.66 0.16 AT3G56970 -0.61 0.03 0.12 0.67 0.1 0.65 AT3G09530 -0.58 0.03 -0.05 0.83 -0.05 0.86 AT1G48500 -0.57 0.04 0.11 0.69 0.23 0.43 AT2G33070 -0.57 0.04 0.03 0.91 0.1 0.75 AT4G04510 -0.57 0.05 -0.08 0.83 -0.16 0.45 AT3G45680 -0.56 0.05 0.22 0.56 -0.06 0.81 AT4G37670 -0.54 0.04 0.05 0.88 -0.39 0.14 AT1G06710 -0.5 0.05 0.31 0.55 -0.35 0.27 AT4G35985 -0.5 0.05 0.15 0.6 0.7 0.08

Table B.4. Genes up-regulated at 4min in the GPS microarray and their expression across the remaining time points. All significant values are highlighted in gray. 4min 10min 30min Log P- Log P-value Log P- Fold value Fold Fold value change change change AT1G18550 0.62 0.05 0.67 0.12 0.02 0.93 AT5G38565 0.61 0.03 -0.27 0.6 0 0.99 AT1G04950 0.61 0.03 0.2 0.48 0.25 0.43 AT1G25550 0.61 0.04 -0.26 0.6 -0.27 0.35 AT5G37490 0.61 0.05 -0.46 0.22 0.51 0.1 AT5G10230 0.6 0.04 0.31 0.46 0.04 0.84 AT1G80270 0.6 0.04 -0.11 0.69 -0.34 0.28 AT4G24320 0.6 0.04 0.45 0.15 0.41 0.27

163 Table B.4. Continued 4min 10min 30min Log P- Log P-value Log P- Fold value Fold Fold value change change change AT4G26510 0.6 0.05 0.33 0.53 0.09 0.67 AT3G15980 0.59 0.04 0.36 0.35 0.05 0.81 AT4G16260 0.59 0.05 -0.07 0.85 -0.26 0.32 AT4G35090 0.59 0.03 0.1 0.87 0.12 0.73 AT5G25960 0.59 0.04 -0.29 0.31 0.52 0.18 AT1G14460 0.58 0.05 0.69 0.22 -0.04 0.89 AT4G35970 0.58 0.05 0.46 0.22 0.27 0.21 AT3G13770 0.58 0.04 0.92 0.12 -0.52 0.23 AT4G34350 0.57 0.05 0.37 0.49 -0.24 0.3 AT5G53200 0.57 0.05 0.04 0.95 0.26 0.6 AT1G23300 0.56 0.05 -0.25 0.4 -0.08 0.83 AT5G50950 0.56 0.03 0.37 0.34 -0.12 0.72 AT4G09160 0.56 0.05 0 0.99 0.12 0.5 AT5G37590 0.55 0.04 0.2 0.49 0.2 0.29 AT1G32270 0.54 0.05 0.25 0.43 -0.08 0.68 AT4G33730 0.53 0.04 0.2 0.72 0.04 0.89 AT3G11325 0.52 0.05 -0.06 0.87 0.66 0.03

Table B.5. Genes down-regulated at 10min in the GPS microarray and their expression across the remaining time points. All significant values are highlighted in gray. 10min 30min Log P-value Log P- Fold Fold value change change AT1G47730 -2.27 0.01 0.26 0.44 AT1G72260 -2.14 0 -1.99 0.03 AT2G33880 -2 0.02 -1.07 0.09 AT4G34850 -1.93 0 -1.1 0.22 AT3G48360 -1.88 0 -0.88 0.22 AT4G31800 -1.83 0.02 -1.24 0.03 AT4G37980 -1.64 0.02 -1.21 0.14 AT2G19070 -1.6 0.01 -1.04 0.08 AT1G75940 -1.6 0.02 -1.23 0.15 AT5G16960 -1.57 0 -1.01 0.1 AT5G20240 -1.57 0.03 -1.55 0.18 AT2G18550 -1.52 0.01 -1.36 0.08

164 Table B.5. Continued 10min 30min Log P-value Log P- Fold Fold value change change AT3G08770 -1.49 0 -1.27 0.08 AT3G42960 -1.47 0 -0.92 0.09 AT1G02050 -1.46 0 -0.53 0.2 AT1G62940 -1.45 0.01 -1.22 0.18 AT4G16270 -1.44 0 -1.01 0.2 AT3G02480 -1.43 0.01 -1.43 0.06 AT3G51590 -1.42 0.04 -1.08 0.28 AT3G29635 -1.41 0.04 -0.35 0.4 AT4G33870 -1.41 0.01 -0.06 0.73 AT1G08065 -1.4 0.04 -0.28 0.29 AT1G01280 -1.38 0.01 -1.14 0.13 AT3G59510 -1.38 0.05 -0.53 0.26 AT5G24770 -1.36 0.03 -1.73 0.02 AT2G16910 -1.36 0 -0.81 0.12 AT3G12203 -1.33 0.01 -0.15 0.53 AT1G73290 -1.33 0.01 0.29 0.52 AT5G65330 -1.32 0.04 -0.11 0.66 AT3G12145 -1.3 0.01 -0.51 0.19 AT3G04150 -1.29 0.02 -0.09 0.64 AT1G44970 -1.28 0.03 -1.1 0.21 AT1G75790 -1.27 0.01 -1.05 0.09 AT3G13220 -1.26 0.02 -0.87 0.24 AT5G18600 -1.26 0.03 -0.86 0.21 AT2G42990 -1.26 0.02 -0.58 0.15 AT5G08250 -1.25 0.02 -0.29 0.35 AT2G31180 -1.25 0.03 0.59 0.13 AT5G46940 -1.21 0.02 -0.35 0.4 AT4G35420 -1.2 0.02 -0.12 0.74 AT3G59530 -1.2 0.01 -0.53 0.21 AT2G34810 -1.19 0.03 -0.92 0.1 AT5G38030 -1.18 0.04 0.05 0.84 AT4G23600 -1.18 0.02 -0.83 0.26 AT5G55080 -1.17 0.03 0.23 0.37 AT2G04570 -1.17 0.02 -0.94 0.15 AT4G21590 -1.17 0 -0.76 0.17

165 Table B.5. Continued 10min 30min Log P-value Log P- Fold Fold value change change AT2G36270 -1.16 0.02 -0.85 0.1 AT2G15090 -1.15 0.05 -0.61 0.05 AT5G15800 -1.14 0.01 -1.18 0.17 AT5G24820 -1.13 0.05 -0.42 0.23 AT4G15440 -1.13 0.02 -0.91 0.15 AT3G45140 -1.11 0.02 -0.6 0.13 AT3G25770 -1.1 0.01 -0.31 0.44 AT5G48550 -1.1 0.03 -0.04 0.86 AT2G46880 -1.09 0.02 -0.23 0.42 AT3G57620 -1.07 0.05 -0.92 0.03 AT2G31210 -1.05 0.02 -0.73 0.21 AT5G56820 -1.05 0.02 0.45 0.11 AT1G69120 -1.04 0.01 -0.68 0.22 AT1G78390 -1.03 0.05 0.17 0.55 AT1G17200 -1.02 0.03 -0.52 0.29 AT1G34510 -1.01 0.02 -0.05 0.86 AT5G49070 -1.01 0.01 0.1 0.82 AT4G15210 -1.01 0.02 -0.67 0.16 AT4G05380 -1 0.01 0.03 0.86 AT1G06250 -1 0.03 -0.56 0.24 AT1G74540 -1 0.05 0.04 0.95 AT1G05660 -1 0.02 0.23 0.23 AT1G71160 -0.99 0.04 -1.09 0.06 AT2G04032 -0.97 0.02 0.41 0.56 AT1G75030 -0.97 0.02 -0.6 0.1 AT4G19380 -0.97 0.02 -0.37 0.15 AT1G51460 -0.97 0.03 -0.51 0.16 AT1G80660 -0.96 0.05 -1.06 0.17 AT1G75290 -0.96 0.02 0.17 0.38 AT3G18810 -0.96 0.03 -0.21 0.41 AT1G10610 -0.95 0.01 -0.36 0.17 AT1G26480 -0.94 0.04 -0.77 0.1 AT3G06100 -0.94 0.03 -0.4 0.1 AT3G54340 -0.94 0.03 -1.07 0.15 AT3G11180 -0.93 0.05 0.33 0.3

166 Table B.5. Continued 10min 30min Log P-value Log P- Fold Fold value change change AT5G11160 -0.93 0.02 -0.58 0.13 AT2G17060 -0.93 0.03 0.8 0.03 AT1G51420 -0.93 0.05 0.1 0.67 AT1G71030 -0.92 0.05 -0.32 0.33 AT3G23840 -0.92 0.02 -0.54 0.17 AT3G09790 -0.9 0.02 -0.28 0.24 AT4G29610 -0.9 0.03 -0.99 0.05 AT1G19790 -0.89 0.04 -0.52 0.29 AT2G09970 -0.89 0.03 0.14 0.52 AT2G37700 -0.88 0.04 -0.18 0.43 AT2G24560 -0.87 0.01 -0.79 0.06 AT5G07475 -0.87 0.04 -0.48 0.33 AT4G03510 -0.86 0.02 -0.53 0.05 AT3G26540 -0.86 0.04 -0.08 0.73 AT1G13710 -0.86 0.05 -0.41 0.38 AT5G28210 -0.85 0.05 -0.12 0.54 AT2G32780 -0.84 0.05 0.32 0.18 AT5G50790 -0.83 0.03 -1.03 0.15 AT5G15860 -0.82 0.03 -0.14 0.51 AT5G54010 -0.82 0.05 -0.64 0.04 AT4G28650 -0.81 0.02 -0.31 0.38 AT2G07020 -0.81 0.04 0.16 0.48 AT4G04770 -0.81 0.02 0.01 0.96 AT1G33100 -0.81 0.03 -0.06 0.77 AT3G60460 -0.81 0.02 -0.08 0.85 AT2G39330 -0.81 0.03 -0.59 0.34 AT1G21460 -0.81 0.02 -0.72 0.06 AT3G13960 -0.81 0.03 -0.29 0.36 AT5G22260 -0.71 0.03 0.1 0.7 AT5G40320 -0.67 0.04 0.16 0.44

167 Table B.6. Genes up-regulated at 10min in the GPS microarray and their expression across the remaining time points. All significant values are highlighted in gray. 10min 30min Log P-value Log P- Fold Fold value change change AT1G59725 1.86 0.03 0.18 0.60 AT1G17150 1.47 0 0.42 0.19 AT1G80390 1.4 0.03 -0.09 0.69 AT1G13600 1.37 0.03 0.7 0.07 AT1G54095 1.27 0.05 -0.09 0.87 AT1G10585 1.22 0.04 -0.1 0.62 AT1G01310 1.22 0 -0.62 0.18 AT1G18490 1.21 0.02 0.1 0.57 AT3G02940 1.11 0 -0.12 0.72 AT3G47400 1.1 0.04 0.27 0.24 AT1G16120 1.08 0.03 -0.11 0.58 AT1G44180 1.05 0.03 -0.04 0.86 AT1G70850 1.02 0.01 0.52 0.36 AT1G12600 1.02 0.02 -0.25 0.25 AT3G02970 1.02 0.01 0.54 0.04 AT1G34650 1.01 0.05 0.2 0.44 AT4G17680 1.01 0.02 -0.32 0.45 AT4G09420 1 0.02 0.17 0.41 AT3G52370 0.99 0.03 0.66 0.17 AT1G55910 0.98 0.02 0.61 0.05 AT3G20100 0.94 0.04 0.35 0.16 AT1G74310 0.93 0.05 0.62 0.03 AT3G03670 0.92 0.01 0.7 0.26 AT1G50870 0.88 0.01 0.01 0.98 AT5G25370 0.88 0.02 0.45 0.16 AT5G26320 0.87 0.03 -0.37 0.12 AT3G28470 0.85 0.03 0.08 0.72 AT3G27490 0.85 0.04 0 0.99 AT1G11740 0.85 0.03 0.68 0.12 AT1G78990 0.82 0.02 -0.15 0.63 AT3G05975 0.81 0.04 -0.31 0.37 AT5G57920 0.81 0.02 0.34 0.41 AT2G30250 0.8 0.02 0.37 0.16

168 Table B.7. Genes uniquely down-regulated at 30min 30min Log P- Fold value change AT5G24780 -2.07 0.01 AT1G72260 -1.99 0.03 AT5G24770 -1.73 0.02 AT2G39690 -1.54 0.05 AT1G13140 -1.37 0.02 AT4G13790 -1.29 0.02 AT3G14210 -1.24 0.01 AT5G24470 -1.19 0.01 AT1G52030 -1.18 0.03 AT3G17790 -1.14 0.05 AT5G40360 -1.14 0.04 AT3G22142 -1.12 0.02 AT1G35260 -1.07 0.03 AT4G29610 -0.99 0.05 AT1G30280 -0.95 0.04 AT3G51240 -0.95 0.03 AT5G08030 -0.95 0.04 AT5G13930 -0.93 0.04 AT3G57620 -0.92 0.03 AT3G57600 -0.92 0.02 AT5G42230 -0.9 0.02 AT3G09940 -0.89 0.03 AT1G66160 -0.89 0.02

Table B.8. Genes uniquely up-regulated at 30min 30min Log P- Fold value change AT5G58770 1.71 0.05 AT5G39160 1.5 0.03 AT2G03090 1.29 0.04 AT2G43050 1.2 0.04 AT1G78370 1.13 0.05 AT1G09440 1.06 0.02

169 Table B.8. Continued 30min Log P- Fold value change AT5G10770 1.05 0.04 AT5G22490 1 0.04 AT5G07430 1 0.03 AT4G16430 0.96 0.03 AT5G14200 0.95 0.03 AT4G23400 0.93 0.02 AT1G27920 0.92 0.03 AT1G06830 0.92 0.04 AT4G16045 0.91 0.03 AT4G02520 0.88 0.03 AT3G54260 0.88 0.04 AT3G20710 0.87 0.03 AT1G01190 0.86 0.04 AT4G20110 0.85 0.03 AT5G26640 0.85 0.05 AT5G10140 0.85 0.02 AT1G73140 0.85 0.04 AT5G39770 0.84 0.02 AT4G36950 0.81 0.05 AT3G02500 0.81 0.05 AT3G11325 0.66 0.03 AT1G74310 0.62 0.03 AT1G55910 0.61 0.05 AT3G02970 0.54 0.04 AT5G66970 0.5 0.05

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

Thesis and Dissertation Services ! !