ANALYSIS

Creation and implications of a phenome-genome network

Atul J Butte1 & Isaac S Kohane2

Although and measurements are increasing in concepts in the envirome and phenome to specific could thus lead quantity and comprehensiveness, they do not characterize to the identification of new disease-associated genes5. Though some a sample’s entire phenotype in an environmental or phenomic data are available8, these are greatly overshadowed in size by experimental context. Here we comprehensively consider the >60,000 microarray measurements in repositories such as the Gene associations between components of phenotype, genotype Expression Omnibus (GEO)9. Even for microarray data stored using and environment to identify genes that may govern standards like Minimum Information About a Microarray Experiment

http://www.nature.com/naturebiotechnology phenotype and responses to the environment. Context from (MIAME) and Microarray Markup Language (MAGE- the annotations of gene expression data sets in the Gene ML)10,11, contextual annotations are represented by unstructured nar- Expression Omnibus is represented using the Unified Medical rative text; determining the phenotype and environmental context is no Language System, a compendium of biomedical vocabularies longer a tractable manual process. A question we have sought to answer with nearly 1-million concepts. After showing how data sets is whether prior investments in biomedical ontologies can provide lever- can be clustered by annotative concepts, we find a network age in finding phenome-genome and envirome-genome relations. of relations between phenotypic, disease, environmental We show here that a large set of phenome-genome and envirome- and experimental contexts as well as genes with differential genome relations can be found within a public repository of transcrip- expression associated with these concepts. We identify novel tome measurements, if the phenotypes and environmental context can genes related to concepts such as aging. Comprehensively be ascertained for each experiment, along with the expression measure- identifying genes related to phenotype and environment is a ments. We accomplished this by creating a system that extracts contex- step toward the Human Phenome Project5. tual concepts from the sample annotations in GEO, represents these concepts using the Unified Medical Language System (UMLS), unifies Nature Publishing Group Group 200 6 Nature Publishing

© In analyzing a cancer sample, such as one extracted from a lung tumor, a the gene expression measurements across data sets using NCBI Gene plethora of factors, such as phenotype and clinical history (for example, identifiers and finally relates the gene expression measurements to the chief complaint of hemoptysis, family history or tumor size), environ- contextual concepts (Fig. 1). UMLS is the largest available compendium mental exposures (for example, duration of exposure to asbestos or ciga- of biomedical vocabularies, containing >130 biomedical vocabularies rette smoke) and experimental conditions (for example, anesthesia or with ~1-million interrelated concepts12. UMLS already unifies vocab- sample preparation) have to be considered besides the more basic aspects ularies used in molecular biology and genomics, such as the Medical of its gene expression and proteomic pattern. Though these snapshots Subject Headings (MeSH), NCBI Taxonomy and the , of genomic and physiological states have been used to determine thera- with medical vocabularies including the International Classification of peutic action1, they cannot solely represent either the entire ‘envirome,’ Diseases and SNOMED International9,13,14. defined in an extended version of the initial definition by Anthony et al. for the special case of mental disorders, as the totality of equivalent Establishing a phenome-genome network environmental influences contributing to all disorders and organisms5, After manual elimination of incorrectly assigned concepts (Methods or the ‘phenome’, the physical totality of all traits of an organism as and Supplementary Note online), mappings to 4,127 UMLS concepts defined by Mahner and Kary5–7, of the sample and organism. remained (from 296,843 mappings to 5,115 strings). Concepts were Relations between enviromic concepts and phenomic concepts have from 18 source vocabularies, with MeSH (23%), Read Codes (17%) been invaluable to medicine. For example, one such relation is the asso- and SNOMED International (14%) contributing the most. The GEO ciation of environmental exposure to cigarette smoke with the phe- series description annotation was the most information rich, as it notype of lung cancer development. Comprehensively relating specific provided unique concepts (Supplementary Table 1 online). This was likely because GEO series descriptions are often dissimilar to each 1Stanford Medical Informatics, Departments of Medicine and Pediatrics, other, compared to sample descriptions, which are often repeated. As Stanford University School of Medicine, 251 Campus Drive, Room X-215, Stanford, California 94305-5479 USA. 2Informatics Program and Division of expected, the concepts mapping to the most annotations are cells and Endocrinology, Children’s Hospital Boston and Harvard Medical School, 300 RNA (Table 1). Longwood Avenue, Boston, Massachusetts 02115 USA. Correspondence should Parsing failed on too short annotations, containing only laboratory be addressed to A.J.B. ([email protected]). identifiers and few recognizable words, or too long for parsing to com- Published online 10 January 2006; doi:10.1038/nbt1150 plete. Regardless, over 99% of GEO samples were successfully directly

NATURE BIOTECHNOLOGY VOLUME 24 NUMBER 1 JANUARY 2006 55 ANALYSIS

UMLS concepts representing each GDS, derived from annotations relations demonstrated in two or more species, and (ii) keeping only those relations with a very low Q value. Our first method to heighten stringency was Extract phenome 1 Extract phenome 1 Extract phenome 1 envirome and envirome and envirome and to consider only those gene-concept relations experimental context experimental context experimental context where a second relation was present between

Relate concepts to the same concept and an orthologous gene, GEOle sample (GSM) genes that show Title GEO series (GSE) scrption GEO data set (GDS) significant differential assuming functional identity for orthologous Description Title Title 4 expression between Sourcewords Description GDS with and without genes. This resulted in relations between 444 Keywords the annotated concept genes and 46 concepts, which were graphically visualized as a phenome-genome network (Fig. GEO platform (GPL) 2 NCBI Gene identifiers 3a). Four of the top five connected concepts Exact genome were related to muscle, such as Skeletal Muscle, 3 with 128 genes (38 human, 40 mouse and 50 rat) showing significantly increased expression Expression measurement for each gene averaged across the data set and mapped level in data sets annotated with these concepts to NCBI Gene identifiers and 40 decreasing genes (6 human, 18 mouse and 16 rat) with decreased expression level (Fig. 3b). Many of these genes are known to Figure 1 The method of extracting and relating genome, phenome and envirome data from GEO data be uniquely expressed in muscle, including sets. Step 1: seven fields of annotations representing the phenotype, environmental and experimental context from GEO samples, series and data sets are parsed and mapped to UMLS concepts. Step 2: TNNI1, MYBPC1, Mybph and Pgam2; Pnliprp1, GEO platforms are manually related to NCBI Gene identifiers, allowing the same genes to be related Chad and Fgb are expressed in a few tissues across platforms. Step 3: gene expression measurements are rank normalized within each GEO sample, including muscle. Human PDLIM3 shows

http://www.nature.com/naturebiotechnology then averaged across each GEO series. Step 4: mean expression measurements for each gene in each the greatest differential expression between GEO data set were related to the concepts mapped from each GEO data set. the eight data sets annotated with the concept Muscle Cells (average normalized expression level 0.968) and the 34 data sets measuring mapped to at least one UMLS concept; the remainder were subordinate this transcript without such an annotation (expression level 0.340), a to a GEO series with annotations mapping to concepts. Thus, every GEO highly significant difference by t-test with P < 1.2 × 10−16, with Q < sample can be mapped either directly to concepts, or indirectly through 0.0002 across 100 permutations of mappings between data sets and the its parent GEO series. Similarly, every GEO data set could be mapped annotated concept (Fig. 3c). A similar pattern is seen for mouse Pdlim3 to concepts, either directly through the GEO data set annotations, or (Fig. 3d, P < 6 × 10−8, Q = 0.0025). Pdlim3 is thought to play a role in indirectly through subordinate series and sample annotations. skeletal muscle development16. Mouse Pdlim3 is most highly expressed We created a taxonomy of eleven types of errors seen in determin- in skeletal muscle, compared to 21 other tissues with EST count data in ing concepts from annotations, in three broad categories: GEO annota- the NCBI UniGene9, and 60 other tissues in the GNF SymAtlas panel of tions, UMLS vocabularies or text processing (Table 2). Three significant microarray expression measurements by Su et al.17. Nature Publishing Group Group 200 6 Nature Publishing

© errors are listed here. Several important concepts in molecular biology To quantify the sensitivity and validate the significance of these rela- experimentation are missing in UMLS, including “phosphate-buffered tions, we compared the list of genes related to the four muscle-related saline,” “transgenic” and “growth medium.” Company names and pro- concepts with two lists of transcripts previously measured through tocol details falsely map to concepts. For example, the phrase “Axon sequencing or microarrays as being expressed in muscle. Of the 40 Instruments (Foster City, California)” maps to Axon and Fostering, mouse transcripts found to have increased expression in data sets associ- related to foster homes. Many annotations include the entire MIAME ated with the muscle concepts, 27 were also highly expressed transcripts checklist10, literally beginning with “The MIAME Checklist…,” so the in muscle in microarray measurements in the GNF SymAtlas17, and 18 salient details cannot be parsed. were sequenced from the mouse skeletal muscle library dbEST 8902 To verify that data sets with similar context could now be considered (ref. 9). In total, at least 31 (78%) of the 40 mouse transcripts related to similar mathematically, we clustered the data sets using the directly and muscle could be independently validated. indirectly mapped concepts (Fig. 2a). Qualitative examination of the However, considering only relations between concepts and ortholog tree revealed multiple branches where similar concepts were used in families can be too restrictive. Few diseases have been studied in both annotations from a single submitter, whereas other branches indicated humans and model animals, and microarray data may not yet be avail- successful clustering of similar data sets from multiple submitters. For able for both. Thus, we separately used a second approach to heighten example, muscle expression data used to study inflammatory myopathy, stringency by considering only the most reliable gene-concept relations, muscular dystrophy, aging and dermatomyositis clustered together, even defined as having Q ≤ 1%. This resulted in 64,003 relations between 281 though these were from three different submitters (Fig. 2b). biomedical concepts and 7,466 genes. We then sought relationships between specific genes and annota- We explain here several of these relations involving phenotypic or tive concepts, where a gene shows a statistically significant difference environmental concepts. These relations were manually validated to in expression level between data sets annotated with that concept and ensure each of these concepts was mapped correctly to an annotation. data sets not annotated with the concept. We controlled for multiple We found 11 genes related to aging. The mean normalized expression comparison testing by calculating Q values for each relation, or the pro- level of H6PD, the H form of glucose-6-phosphate dehydrogenase, drops portion of false-positive relations expected if the given relation were sig- from 0.90 in the four data sets annotated with aging to 0.71 in the 35 data nificant15. We used two separate methods to reduce the noise associated sets without the annotation (Supplementary Fig. 1a online, P = 1.2 × with errors in concept assignment: (i) keeping only those gene-concept 10−6, Q = 0.01). Glucose-6-phosphate dehydrogenase activity is known

56 VOLUME 24 NUMBER 1 JANUARY 2006 NATURE BIOTECHNOLOGY ANALYSIS

Table 1 The top 50 UMLS concepts mapped to GEO annotations, after gross mapping errors were removed CUI Concept name Count Explanation C0007634 Cells 3,275 C0035668 RNA 2,591 C0042153 Utilization 2,383 Correct. From the words ‘use’ and ‘using’ C0025914 House mice 1,738 Too specific. From the word ‘mouse’ C0025929 Laboratory mice 1,738 Too specific. From the word ‘mouse’ C0441621 Sampling—surgical action 1,657 Incorrect. A surgical procedure to remove, from the word ‘sample’ C0870078 Sampling 1,657 C0332307 With type 1,559 Incorrect. From the phrases ‘type’ and ‘wild-type’ C0439810 Total 1,403 C0445392 Wild 1,388 Incorrect. Should have mapped to Wild-type genetics (C0678926) C0037585 Computer software 1,307 C1167624 Labeling 1,231 Incorrect. Stigmatizing an individual; should have mapped to Staining and Labeling (C0886517) C0439227 Hour 1,190 Often correct. Sometimes incorrectly mapped for abbreviations such as ‘Protein H,’ and ‘H. pylori ’ C0439228 Day 1,171 Often correct. Sometimes incorrectly mapped for abbreviations such as ‘DS domain,’ ‘oligo d(T)’ C0205397 Seen 1,126 Correct. From the word ‘saw’ C0042789 Vision 1,124 Imprecise. From the word ‘saw’ C0205173 Duplicate 1,002 C0333052 Version, NOS 986 Incorrect. A type of surgical manipulation; the most appropriate concept would be Editions (C0441792) C0439232 Minute of time 953 http://www.nature.com/naturebiotechnology C0441633 Scanning 946 C0243148 Control 918 Incorrect. An attribute, like image control or volume control; should have mapped to Control Groups (C0009932) C0086418 Homo sapiens 903 C0020114 Human 877 C0439242 ml 868 C0337051 Pool, NOS 849 Incorrect. A body of water; the most appropriate concept would be Combination (C0205195) C0080194 Muscle strain 848 Incorrect. From the word ‘strain’ C0040300 Tissues 825 C0020202 Hybridization, genetic 802 C0449945 Strain typing 793 C0681814 Experiment 793 Nature Publishing Group Group 200 6 Nature Publishing C0017337 Genes 790 © C0026809 Mus 786 C0596988 Mutant 786 C0205307 Normal 743 C0205409 Isolated 733 Incorrect. Meaning a finding by itself, from phrases such as ‘we isolated DNA’; the most appropriate concept would be Purification (C0243114) C0001779 Age, NOS 731 C0026845 Muscle 728 Often correct. Sometimes incorrectly mapped for ‘Mus’ C0596981 Muscle cells 727 C0185117 Expression, NOS 722 Incorrect. A subtype of surgical manipulation; should have mapped to Gene Expression (C0017262) C0024554 Male gender 717 C0683312 Categories 678 C0040223 Time 629 C0002778 Analysis of substances 610 C0936012 Analysis 610 C0004561 B-Lymphocytes 601 C0443050 Robinson 598 Incorrect. A named strain of organism; from the author ‘Mark D. Robinson’ C0010453 Anthropological culture 593 Incorrect. From the word ‘culture’ C0220814 Cultural 593 Incorrect. From the word ‘culture’ C0430400 Laboratory culture 593 C0009253 Coitus 587 Incorrect. From the word ‘sex’ Column 3 indicates the number of mappings between a GEO annotation and the concept. Column 4 indicates the reason why several concepts were incorrectly mapped. Blanks indicate correctly chosen concepts.

NATURE BIOTECHNOLOGY VOLUME 24 NUMBER 1 JANUARY 2006 57 ANALYSIS

Table 2 Taxonomy of mistakes in concept mapping Error Representative example Errors in GEO text Poor text formatting GEO series 91 description lists “…gene expression.

  • Brain:”, yet the GEO file format does not specify that descriptions written using the Hypertext Markup Language are allowed. ‘
  • ‘ maps to the concept Lithium Unfortunate choice of experimental identifiers GEO sample 6,751 title lists ‘Ova A/J 3.’ which maps to the concept Ovum (C0029974), whereas the investiga- tors used ‘Ova’ to abbreviate ‘ovalbumin’ Irrelevant text in GEO Over seventy GEO samples reproduce the entire MIAME checklist in their descriptions, literally beginning with “The MIAME Checklist Experiment Design…”; these are too long to be parsed by our software Spelling errors in GEO text GEO sample 4,100 source lists “murine subcontaneous adipose” instead of ‘subcutaneous’ Errors in UMLS Missing concepts in UMLS ‘PBS’ maps to the concepts Lead (the element), Lead (a homeopathic remedy) and Peripheral Blood, but there is no available concept for phosphate-buffered saline Missing synonym variants in UMLS The term ‘wild type,’ (without a dash) is used in over 400 GEO samples, series and data sets. The most appropriate concept is wild-type genetics, but there is no listed synonym for ‘wild type,’ so this term instead maps to Wild, a subtype of serotype typing Errors in parsing Mapping to unlisted concepts MetaMap created 21,867 mappings using 465 string unique identifiers that were not present in our subset of UMLS Abbreviation errors The term ‘CGS,’ from the compounds CGS 23425, a thyroid hormone analog, and CGS 21680, a selective agonist for an adenosine receptor, maps to colloid goiter Mistaken identity from experimental description GEO sample 12011 description lists “…GenePix software analysis… Axon Instruments…” which maps to the concept Axon

    http://www.nature.com/naturebiotechnology Mismatch of semantic-type Over 500 sample descriptions use the phrase “spot quality assessment,” which maps to the disease concept Exanthema, or ‘spots’ Inappropriate lexical variant processing GEO sample 51 description lists “…Gene expression profile in developing mouse cerebellum at postnatal day 7,” internally maps to the abbreviation ‘PND,’ which then maps to Paroxysmal Dyspnea (commonly also abbreviated PND)

    to increase with age in rat brain and other tissues18,19. Some individu- Fig. 1b online, P = 7.3 × 10−10, Q < 0.0001); BDNF has been previously als with G6PDH deficiency have been noted to have reduced mortality shown to have a significant drop in expression in human skin fibro- from cardiovascular disease and have increased longevity, though this blasts associated with advancing age21. The other nine genes, including is associated with the X-linked gene20. TNNT1, SYNJ1, TADA2L, SLC7A2, MGAT2 and KCTD2, have no previ- BDNF was less expressed at a lower level in the four data sets annotated ously established association with aging, though Tnnt1 and Mgat2 have with aging compared with 47 nonannotated data sets (Supplementary been associated with nemaline myopathy and type 2a congenital disorder Nature Publishing Group Group 200 6 Nature Publishing

    © a 1 2 Distance 0.0 0.2 0.4 0.6 0.8 1.0 Hypoxia: 246 Hypoxia: Gliogenesis: 24 Gliogenesis: Colon cancer: 389 Colon cancer: Cardiac aging: 399 Cardiac aging: Aging in yeast: 362 Aging in yeast: Copper regulon: 69 Copper regulon: Gastric cancer: 442 Gastric cancer: Lung ventilation: 64 Lung ventilation: Aortic stiffness: 461 Aortic stiffness: Carbon sources: 21 Carbon sources: Muscle profiles: 276 Muscle profiles: Cervical 470 cancer: Leprosy lesions: 358 Leprosy lesions: Chitin synthesis: 344 Chitin synthesis: Wood formation: 448 formation: Wood Epileptogenesis: 450 Epileptogenesis: Heart transplants: 437 Heart transplants: Spermatogenesis: 174 Spermatogenesis: Malaria resistance: 471 Malaria resistance: SL2 versus Kc167: 376 Kc167: SL2 versus Pulmonary fibrosis: 251 Pulmonary fibrosis: CD34+ analysis: 53 CD34+ cell analysis: Interslide variability: 375 Interslide variability: RNA quantity effect: 374 RNA quantity effect: Aflatoxin production: 194 production: Aflatoxin Brain region profiles: 393 region profiles: Brain E2F1regulated genes: 46 E2F1regulated genes: Dermal burn wounds: 353 Dermal burn wounds: Aldosterone response: 345 Aldosterone response: Cyclin overexpression: 110 Cyclin overexpression: DNA damage response: 25 DNA damage response: Dendritic cell subtypes: 352 Dendritic cell subtypes: Cartilage development: 392 Cartilage development: Endothelial cell profiles: 204 Endothelial cell profiles: Diauxic shift time course: 37 Diauxic shift time course: Stressrelated disorders: 452 Stressrelated disorders: Cocaine effect on brain: 255 on brain: Cocaine effect Mammary tissue profile: 205 Mammary tissue profile: Life cycle of Drosophila: 191 cycle of Drosophila: Life Inflammatory myopathy: 198 Inflammatory myopathy: Cerebral oxidative stress: 55 stress: oxidative Cerebral Sporulation time course: 104 Sporulation time course: Biomaterial engineering: 395 Biomaterial engineering: Testicular mRNA profile: 179 mRNA profile: Testicular Congestive heart failure: 213 heartCongestive failure: Peroxide effect on Typhi: 231 on Typhi: effect Peroxide Hypertension (RGU34A): 458 Hypertension (RGU34A): 459 Hypertension (RGU34B): SIV disease progression: 172 SIV disease progression: Glutamine concentration: 284 Glutamine concentration: Hypertension (RGU34C): 460 Hypertension (RGU34C): Ciprofibrate effect in liver: 273 in liver: effect Ciprofibrate Cardiomyocyte formation: 103 formation: Cardiomyocyte Fuchs’ corneal dystrophy: 120 corneal dystrophy: Fuchs’ Auxin effect on seedlings: 480 on seedlings: effect Auxin Snf/Swi mutants (v1_2.2): 105 (v1_2.2): Snf/Swi mutants DACH1responsive genes: 483 genes: DACH1responsive JNK2mediated regulation: 122 JNK2mediated regulation: Heat shock, kin82 mutant: 281 kin82 mutant: Heat shock, Anesthetic effect on brain: 364 on brain: Anesthetic effect Fermentation time course: 180 time course: Fermentation Lithium response in yeast: 354 Lithium response in yeast: Epidermal carcinogenesis: 143 Epidermal carcinogenesis: Parkinson’s Disease model: 22 Disease model: Parkinson’s Muscle function in old age: 123 Muscle function in old age: T cells and transplantation: 383 T cells and transplantation: CGH of Salmonella strains: 228 CGH of Salmonella strains: Integrated stress response: 405 stress response: Integrated Transcriptional regulation (I): 91 regulation (I): Transcriptional Kidney cancer progression (I): 8 (I): cancer progression Kidney Transcriptional regulation (II): 93 regulation (II): Transcriptional Testis gene expression profile: 1 profile: gene expression Testis Kidney cancer progression (II): 9 (II): cancer progression Kidney Hydrogen peroxide response: 17 response: Hydrogen peroxide Telomerase overexpression: 337 overexpression: Telomerase Cancer cell lines (10k_print2): 88 Cancer cell lines (10k_print2): Ume6 regulon (Ye6100subB): 27 Ume6 regulon (Ye6100subB): 26 Ume6 regulon (Ye6100subA): Ume6 regulon (Ye6100subC): 28 Ume6 regulon (Ye6100subC): 29 Ume6 regulon (Ye6100subD): Alveolar septation (U74Av2): 242 septation (U74Av2): Alveolar 243 septation (U74Bv2): Alveolar Allergeninduced goblet cells: 245 cells: Allergeninduced goblet Cancer cell lines (10k_Print3): 89 Cancer cell lines (10k_Print3): Alveolar septation (U74Cv2): 244 septation (U74Cv2): Alveolar Flower development (ATH1): 453 (ATH1): development Flower Replication fork movement (I): 97 (I): movement Replication fork Breast tumor characterization: 84 Breast tumor characterization: Estrogen targets and effects: 118 Estrogen targets and effects: DNA copynumber changes (I): 82 changes (I): DNA copynumber DNA copynumber aberrations: 11 aberrations: DNA copynumber Medulloblastoma metastasis: 232 metastasis: Medulloblastoma Life cycle of malaria parasite: 193 cycle of malaria parasite: Life Asthma exacerbatory factors: 261 factors: Asthma exacerbatory Replication fork movement (II): 98 (II): movement Replication fork Diet induced changes in liver: 279 Diet induced changes in liver: Genetic diversity among strains: 7 among strains: Genetic diversity Cerebellum of lurcher mutant: 370 Cerebellum of lurcher mutant: DNA copynumber changes (II): 83 changes (II): DNA copynumber Hippocampal gene expression: 54 Hippocampal gene expression: Steadystate temperature (y13): 32 (y13): Steadystate temperature Nitrogen depletion time course: 19 Nitrogen depletion time course: SL2 versus Kc167 (dyeswap): 377 Kc167 (dyeswap): SL2 versus Bone marrow stromal cell lines: 12 stromal cell lines: Bone marrow Diamide treatment time course: 30 Diamide treatment time course: HIV viral infection time course: 171 time course: infection HIV viral Legume embryo development: 178 Legume embryo development: Cardiac remodeling (Mu11KA): 387 Cardiac remodeling (Mu11KA): 388 Cardiac remodeling (Mu11KB): Snf/Swi mutants (384_F_v1.0): 106 (384_F_v1.0): Snf/Swi mutants Indole3acetic acid time course: 481 Indole3acetic acid time course: Steadystate temperature (y14): 111 (y14): Steadystate temperature Hypoosmotic shock time course: 33 time course: Hypoosmotic shock Muscle regeneration (U74Av1): 233 (U74Av1): Muscle regeneration 234 (U74Av2): Muscle regeneration Alveoli destruction time course: 241 destruction time course: Alveoli Asthma and atopy (HGU133A): 266 (HGU133A): Asthma and atopy 267 (HGU133B): Asthma and atopy Ligand screen in B cells: CD27: 296 CD27: Ligand screen in B cells: 297 CD40: Ligand screen in B cells: Porcine brain library evaluation: 371 library brain evaluation: Porcine DNA damage and UV radiation: 400 DNA damage and UV radiation: Photoperiodic gene expression: 415 Photoperiodic gene expression: Hyperoxic lung injury (U74Av2): 247 lung injuryHyperoxic (U74Av2): 248 lung injuryHyperoxic (U74Bv2): Hyperoxic lung injury (U74Cv2): 249 lung injuryHyperoxic (U74Cv2): Bladder inflammatory response: 398 Bladder inflammatory response: Larval tissuespecific transcripts: 444 Larval tissuespecific transcripts: Hyperosmotic shock time course: 20 time course: Hyperosmotic shock Obesity and fatty acid oxidation: 268 acid oxidation: Obesity and fatty Ethylene treatment in seedlings: 414 treatment in seedlings: Ethylene SARS coronavirus identification: 407 identification: SARS coronavirus Transacting regulatory variation: 464 regulatory variation: Transacting Corticosteroidresponsive genes: 454 Corticosteroidresponsive genes: Cell cycle, elutriation time course: 39 elutriationCell cycle, time course: Cardiomyopathy, severe dilated: 487 dilated: severe Cardiomyopathy, Intraspecific variation Canton S: 438 Canton S: variation Intraspecific Phosphateregulated pathway (I): 107 (I): Phosphateregulated pathway Myocardial infarction timecourse: 488 timecourse: infarction Myocardial Retina, macula and RPE profiles: 117 Retina, macula and RPE profiles: Phosphateregulated pathway (II): 219 (II): Phosphateregulated pathway Hypoosmotic shock, ppt1 mutant: 282 ppt1 mutant: Hypoosmotic shock, Photosynthesis in proteobacteria: 329 Photosynthesis in proteobacteria: Cytokine treatment effect on liver: 280 on liver: Cytokine treatment effect Cell cycle and Tat transactivation: 449 transactivation: Cell cycle and Tat Hypoxia and glucose metabolism: 272 and glucose metabolism: Hypoxia Ligand screen in B cells: dimaprit: 300 dimaprit: Ligand screen in B cells: Germline development and function: 6 and function: Germline development Bacillus species genomic content: 335 Bacillus species genomic content: Splenic CD4/CD25 T cell subsets: 382 Splenic CD4/CD25 T cell subsets: Menadione exposure time course: 108 time course: Menadione exposure Tryptophan metabolism (ecoli_3.0): 95 metabolism (ecoli_3.0): Tryptophan 96 metabolism (ecoli_4.3): Tryptophan Cerebellar development time course: 3 time course: Cerebellar development Stationary phase time course (y12): 18 Stationary phase time course (y12): Bladder tumor stage classification: 183 Bladder tumor stage classification: Connexin43 deletion in astrocytes: 455 deletion in astrocytes: Connexin43 HCMVinfected foreskin fibroblasts: 116 fibroblasts: foreskin HCMVinfected Stationary phase, ypl230w mutant: 283 Stationary ypl230w mutant: phase, Muscle, extraocular, critical period: 336 critical period: extraocular, Muscle, Diffuse large Bcell lymphoma (LC8): 75 Diffuse large Bcell lymphoma (LC8): Eye development (DrosGenome1): 196 (DrosGenome1): development Eye in complex solutions (AB8): 72 solutions (AB8): Proteins in complex Autoimmune disease mechanisms: 167 disease mechanisms: Autoimmune Flower development (AtGenome1): 451 (AtGenome1): development Flower Muscle, normal extraocular, profile: 254 profile: normalMuscle, extraocular, Eye development (roDROMEGAa): 195 (roDROMEGAa): development Eye Congenital heart disease (Mu11KB): 50 Congenital heart disease (Mu11KB): Congenital heart disease (Mu11KA): 49 Congenital heart disease (Mu11KA): Ligand screen in B cells: untreated: 323 untreated: Ligand screen in B cells: Type 2 diabetes and renal function: 402 2 diabetes and renal function: Type Heart failure and rescue (Mu11KA): 411 Heart and rescue (Mu11KA): failure 427 Heart and rescue (Mu11KB): failure Ligand screen in B cells: bombesin: 295 bombesin: Ligand screen in B cells: Stationary phase time course (y14): 114 Stationary phase time course (y14): Rheumatoid arthritis and TNFalpha: 486 Rheumatoid arthritis and TNFalpha: Gliogenesis: neuroectodermal cells: 490 neuroectodermal cells: Gliogenesis: Midgut metamorphosis time course: 443 Midgut metamorphosis time course: Spinal cord injury and inflammation: 419 Spinal cord injury and inflammation: UV exposure time course (ecoli_4.3): 99 time course (ecoli_4.3): UV exposure Cytomegalovirus strain comparison: 177 comparison: Cytomegalovirus strain Sex differences in immune response: 62 response: in immune differences Sex Coxsackievirus B3 pathogenesis (I): 477 B3 pathogenesis (I): Coxsackievirus Intraspecific variation Zimbabwe53: 440 Zimbabwe53: variation Intraspecific Diffuse large Bcell lymphoma (LC4b): 73 Diffuse large Bcell lymphoma (LC4b): 74 Diffuse large Bcell lymphoma (LC7b): Proteins in complex solutions (AB11): 70 solutions (AB11): Proteins in complex 71 solutions (AB14): Proteins in complex Forelimb and hindlimb development: 102 and hindlimb development: Forelimb Pancreas and islet gene expression: 206 and islet gene expression: Pancreas Coxsackievirus B3 pathogenesis (II): 478 B3 pathogenesis (II): Coxsackievirus Bladder tumor recurrence prediction: 184 Bladder tumor recurrence prediction: Germination and gibberellin inhibitor: 378 Germination inhibitor: and gibberellin Alternatively activated macrophages: 240 macrophages: activated Alternatively Ligand screen in B cells: terbutaline: 320 terbutaline: Ligand screen in B cells: UV exposure time course (ecoli_8.0): 100 time course (ecoli_8.0): UV exposure Idiopathic thrombocytopenic purpura: 390 Idiopathic thrombocytopenic purpura: Lung hypertension recovery (U74Bv2): 59 Lung hypertension (U74Bv2): recovery Hepatocellular carcinoma metastasis: 274 Hepatocellular carcinoma metastasis: Lung hypertension recovery (U74Cv2): 61 Lung hypertension (U74Cv2): recovery Ligand screen in B cells: antigen IgM: 292 antigen IgM: Ligand screen in B cells: Dendritic cells and immune response: 379 Dendritic response: cells and immune Hypertension induced by angiotensin: 485 Hypertension angiotensin: induced by Cold stress time course in sugarcane: 203 Cold stress time course in sugarcane: Transcriptional regulation (I)(dyeswap): 92 regulation (I)(dyeswap): Transcriptional Ligand screen in B cells: interleukin 4: 307 interleukin 4: Ligand screen in B cells: Muscle function and aging (HGU95A): 156 Muscle function and aging (HGU95A): Necrotizing enterocolitis development: 366 Necrotizing enterocolitis development: Thymic T cell development (Mu11KA): 237 (Mu11KA): T cell development Thymic 257 (Mu11KB): T cell development Thymic HCMV infection of foreskin fibroblasts: 476 fibroblasts: of foreskin HCMV infection Diurnal and circadianregulated genes (I): 5 Diurnal and circadianregulated genes (I): Trinitrotolueneexposed seedling roots: 447 seedling roots: Trinitrotolueneexposed Transcriptional regulation (II)(dyeswap): 94 regulation (II)(dyeswap): Transcriptional Sperm storage mechanisms in poultry: 220 Sperm mechanisms in poultry: storage Cardiac hypertrophy, exerciseinduced: 144 exerciseinduced: Cardiac hypertrophy, Heat shock, mild, at variable osmolarity: 35 osmolarity: mild, at variable Heat shock, Lung hypertension recovery (U74Av2): 252 (U74Av2): Lung hypertension recovery Lipopolysaccharidetreated neutrophils: 218 Lipopolysaccharidetreated neutrophils: Contextual fear conditioning paradigm: 200 conditioning paradigm: fear Contextual mRNA processing factors and splicing: 140 and splicing: mRNA processing factors Oxidative stress induced by allylamine: 463 allylamine: stress induced by Oxidative Maskless photolithography technology: 166 technology: Maskless photolithography Shock and adaptive response to injury: 164 response to injury: and adaptive Shock Ligand screen in B cells: interleukin 10: 308 interleukin 10: Ligand screen in B cells: Progesterone and Hoxa10 relationship: 187 relationship: Progesterone and Hoxa10 Granulosa cell tumorigenesis (Mu11KB): 48 cell tumorigenesis (Mu11KB): Granulosa Aging and apoptosis in coronary artery: 492 Leaf, stem and flower gene expression: 416 gene expression: stem and flower Leaf, Breast cancer and docetaxel treatment: 360 treatment: Breast cancer and docetaxel Glaucoma and aqueous humor outflow: 359 Glaucoma and aqueous humor outflow: Calorie restriction and aging (Mu11KA): 355 Calorie restriction and aging (Mu11KA): 356 Calorie restriction and aging (Mu11KB): Chronic obstructive pulmonary disease: 289 pulmonaryChronic obstructive disease: Crossspecies hybridization comparison: 343 comparison: Crossspecies hybridization Ras/cAMP signal transduction pathway: 456 pathway: Ras/cAMP signal transduction Estrogen and tamoxifen induced genes: 121 induced genes: Estrogen and tamoxifen Dithiothrietol exposure time course (y13): 31 time course (y13): Dithiothrietol exposure Granulosa cell tumorigenesis (Mu11KA): 47 cell tumorigenesis (Mu11KA): Granulosa Heat shock from 29C to 33C time course: 36 from 29C to 33C time course: Heat shock 34 from 37C to 25C time course: Heat shock 16 from 25C to 37C time course: Heat shock Ligand screen in B cells: leukotriene B4: 311 B4: leukotriene Ligand screen in B cells: Skin tumors and vitamin A supplements: 238 Skin tumors and vitamin A supplements: Ligand screen in B cells: interferon beta: 304 beta: interferon Ligand screen in B cells: Bcells and acute renal allograft rejection: 365 rejection: Bcells and acute renal allograft Toxicogenomic chemicalspecific profiles: 165 chemicalspecific profiles: Toxicogenomic Bone marrow B cell precursors (Mu11KB): 52 B cell precursors (Mu11KB): Bone marrow 44 B cell precursors (Mu11KA): Bone marrow Ligand screen in B cells: neuropeptide Y: 314 neuropeptide Y: Ligand screen in B cells: Dithiothrietol exposure time course (y14): 113 time course (y14): Dithiothrietol exposure Type 1 diabetes gene expression profiling: 10 profiling: 1 diabetes gene expression Type Testosterone effect in hypogonadal mice: 361 mice: in hypogonadal effect Testosterone Heat shock from 30C to 37C time course: 112 from 30C to 37C time course: Heat shock Diurnal and circadianregulated genes (II): 135 Diurnal and circadianregulated genes (II): DNA repair mechanisms in mycobacteria: 326 DNA repair mechanisms in mycobacteria: Alveoli septation inhibition and protection: 250 septation inhibition and protection: Alveoli Interferon induction by small hairpin RNA: 201 small hairpin induction by RNA: Interferon Ligand screen in B cells: 2methylthioATP: 291 2methylthioATP: Ligand screen in B cells: Spermatogenesis time course (MGU74B): 409 Spermatogenesis time course (MGU74B): 401 Spermatogenesis time course (MGU74A): Hydrogen peroxide response time course: 109 response time course: Hydrogen peroxide Spermatogenesis time course (MGU74C): 410 Spermatogenesis time course (MGU74C): Ozone effect on airways hyperpermability: 239 hyperpermability: on airways effect Ozone Ethynyl estradiol estrogen effect in uterus: 285 in uterus: estrogen effect estradiol Ethynyl mRNA and eIF4F concentration: 87 and eIF4F concentration: mRNA translation Duchenne muscular dystrophy (HGU95B): 270 (HGU95B): dystrophy Duchenne muscular 265 (HGU95E): dystrophy Duchenne muscular 262 (HGU95A): dystrophy Duchenne muscular Serum stimulation time course (10k_print2): 85 Serum time course (10k_print2): stimulation 86 Serum time course (10k_print1): stimulation Duchenne muscular dystrophy (HGU95D): 264 (HGU95D): dystrophy Duchenne muscular 263 (HGU95C): dystrophy Duchenne muscular UVB light and gammaray treated plantlets: 101 treated plantlets: UVB light and gammaray Mammary 90 epithelial cells and breast cancer: Muscle eccentric and isometric contraction: 394 Muscle eccentric and isometric contraction: Cell cycle, cdc15 blockrelease time course: 124 time course: cdc15 blockrelease Cell cycle, Cardiac development, maturation and aging: 40 and aging: maturation Cardiac development, Pathogen exposure and immune response: 260 response: and immune exposure Pathogen Transacting regulatory variation (dyeswap): 465 regulatory (dyeswap): variation Transacting Thyroid hormone effect on cardiomyocytes: 468 on cardiomyocytes: hormone effect Thyroid Ligand screen in B cells: prostaglandin E2: 316 prostaglandin E2: Ligand screen in B cells: Ligand screen in B cells: formylMetLeuPhe: 302 formylMetLeuPhe: Ligand screen in B cells: 305 gamma: interferon Ligand screen in B cells: Circadian rhythm in Clock and Cry mutants: 413 and Cry in Clock mutants: Circadian rhythm Intraspecific variation Canton S (dyeswap): 439 Canton S (dyeswap): variation Intraspecific Congenital heart disease E12.5 (Mu11KA): 384 Congenital heart disease E12.5 (Mu11KA): 385 Congenital heart disease E12.5 (Mu11KB): Evolution of gene expression (Nov2001) (I): 209 (I): (Nov2001) of gene expression Evolution 211 (I): (July2001) of gene expression Evolution Homeobox transcription factor Otx2 targets: 475 Otx2 targets: transcription factor Homeobox Wing imaginal disc spatial gene expression: 192 Wing imaginal disc spatial gene expression: Progesterone effects on uterus time course: 186 on uterus time course: Progesterone effects Stem cell self renewal and Bmi1 (MGU74A): 428 and Bmi1 (MGU74A): Stem cell self renewal 429 and Bmi1 (MGU74B): Stem cell self renewal Evolution of gene expression (Nov2001) (II): 210 (II): (Nov2001) of gene expression Evolution 212 (II): (July2001) of gene expression Evolution Stem cell self renewal and Bmi1 (MGU74C): 430 and Bmi1 (MGU74C): Stem cell self renewal Interstitial cystitis and antiproliferative factor: 391 factor: Interstitial cystitis and antiproliferative Arthritis synoviocyte response to TNF alpha: 386 Arthritis response to TNF alpha: synoviocyte Ligand screen in B cells: lipopolysaccharide: 310 lipopolysaccharide: Ligand screen in B cells: Macrophages infected with Salmonella (SHZ): 76 with Salmonella (SHZ): Macrophages infected Uterine fibroid and normal myometrial tissue: 484 Uterine fibroid and normal tissue: myometrial Melanoma, cutaneous malignant, classification: 2 Melanoma, cutaneous malignant, classification: Macrophages infected with Salmonella (SHV): 79 with Salmonella (SHV): Macrophages infected Macrophage response to lipopolysaccharide: 275 Macrophage response to lipopolysaccharide: Heat shock from various temperatures to 37C: 15 to 37C: temperatures from various Heat shock Endothelial cells, renal glomerular 185 Endothelial cells, and aortic: Duchenne muscular dystrophy (MuscleChip): 214 (MuscleChip): dystrophy Duchenne muscular Hepatocyte growth factortreated kidney cells: 406 cells: factortreated kidney Hepatocyte growth Ligand screen in B cells: nerve growth factor: 313 factor: nerve growth Ligand screen in B cells: Homeobox otd/Otx2 transcription factor action: 23 action: otd/Otx2 transcription factor Homeobox Muscle function and aging male (HGU133A): 287 Muscle function and aging male (HGU133A): 288 Muscle function and aging male (HGU133B): Dystrophindeficient mdx muscle regeneration: 236 regeneration: Dystrophindeficient mdx muscle Allergic response to ragweed in lung (U74Av1): 42 in lung (U74Av1): Allergic response to ragweed 58 in lung (U74Bv2): Allergic response to ragweed 56 in lung (U74Av2): Allergic response to ragweed 13 in lung (U74Bv1): Allergic response to ragweed Allergic response to ragweed in lung (U74Cv2): 60 in lung (U74Cv2): Allergic response to ragweed 14 in lung (U74Cv1): Allergic response to ragweed Microarray technology comparison (HGU95A): 223 technology comparison (HGU95A): Microarray Muscle response to acute resistance exercise: 278 Muscle response to acute resistance exercise: Macrophages infected with Salmonella (SHAI2): 78 with Salmonella (SHAI2): Macrophages infected Macrophages infected with Salmonella (SHS)(I): 80 with Salmonella (SHS)(I): Macrophages infected Disease resistance in callose synthase mutant: 417 Disease resistance in callose synthase mutant: Alpha thalassaemia myelodysplasia syndrome: 357 syndrome: Alpha thalassaemia myelodysplasia Cell cycle, alphafactor blockrelease time course: 38 time course: blockrelease alphafactor Cell cycle, Intraspecific variation Zimbabwe53 (dyeswap): 441 (dyeswap): Zimbabwe53 variation Intraspecific Ligand screen in B cells: Bcell activating factor: 293 factor: Bcell activating Ligand screen in B cells: Acute lymphoblastic leukemia relapse analysis: 363 relapse analysis: leukemia Acute lymphoblastic Ligand screen in B cells: lysophosphatidic acid: 309 lysophosphatidic acid: Ligand screen in B cells: Muscle function and aging female (HGU133A): 472 (HGU133A): Muscle function and aging female 473 (HGU133B): Muscle function and aging female Kaposi’s sarcoma, AIDSrelated, chemotherapy: 190 sarcoma, AIDSrelated, chemotherapy: Kaposi’s Macrophages infected with Salmonella (SHS)(II): 81 with Salmonella (SHS)(II): Macrophages infected and adenine starvation time course: 115 Amino acid and adenine starvation time course: Macrophages infected with Salmonella (SHAD2): 77 with Salmonella (SHAD2): Macrophages infected Midgut metamorphosis and ecdysone signaling: 445 Midgut metamorphosis and ecdysone signaling: Leaf morphogenesis control by JAW microRNA: 418 microRNA: Leaf morphogenesis JAW control by Ionizing radiation effect on lymphoblastoid cells: 479 cells: on lymphoblastoid effect Ionizing radiation Variation in gene expression among individuals: 331 among individuals: in gene expression Variation Skeletal repair in nonunion fractures (HGU95A): 367 (HGU95A): fractures repair in nonunion Skeletal 368 (HGU95B): fractures repair in nonunion Skeletal Skeletal repair in nonunion fractures (HGU95C): 369 (HGU95C): fractures repair in nonunion Skeletal Severe combined immunodeficiency (HGU95A): 421 (HGU95A): combined immunodeficiency Severe Dendritic cells and lipopolysaccharide response: 380 Dendritic cells and lipopolysaccharide response: Largescale analysis of the mouse transcriptome: 182 Largescale analysis of the mouse transcriptome: Largescale analysis of the human transcriptome: 181 Largescale analysis of the human transcriptome: Macrophage response to IL6, and SOCS3 effect: 325 Macrophage response to IL6, and SOCS3 effect: MHC class II transactivator targets (MGU74Av2): 431 targets (MGU74Av2): MHC class II transactivator 432 targets (MGU74Av1): MHC class II transactivator Allergic asthma model (A/J, ovalbumin sensitive): 349 sensitive): ovalbumin Allergic asthma model (A/J, Type 2 diabetes and insulin resistance (Hu35kB): 160 2 diabetes and insulin resistance (Hu35kB): Type 158 2 diabetes and insulin resistance (Hu35kA): Type Severe combined immunodeficiency (HGU133A): 420 (HGU133A): combined immunodeficiency Severe Pharmacogenomic effect of corticosteroid in liver: 253 of corticosteroidPharmacogenomic effect in liver: Cardiac hypertrophy induced by pressureoverload: 41 pressureoverload: induced by Cardiac hypertrophy Type 2 diabetes and insulin resistance (Hu35kD): 162 2 diabetes and insulin resistance (Hu35kD): Type 161 2 diabetes and insulin resistance (Hu35kC): Type Ligand screen in B cells: platelet activating factor: 315 factor: platelet activating Ligand screen in B cells: Ligand screen in B cells: sphingosine1phosphate: 317 sphingosine1phosphate: Ligand screen in B cells: mRNA processing factors and splicing (dyeswap): 141 and splicing (dyeswap): mRNA processing factors Circadian oscillations and cardiovascular function: 404 function: Circadian oscillations and cardiovascular Allergic asthma model (C3H, ovalbumin resistant): 347 resistant): Allergic asthma model (C3H, ovalbumin Plant PK12 LAMMER kinase overexpression (9K): 208 (9K): Plant PK12 LAMMER kinase overexpression Serum stimulation with cycloheximide time course: 145 time course: Serum with cycloheximide stimulation Colon carcinoma response to butyrate and aspirin: 332 and aspirin: Colon carcinoma response to butyrate SBFMBF genomic distribution (intergenic_v1.0) (I): 138 (intergenic_v1.0) (I): SBFMBF genomic distribution Energy balance perturbations effect on liver tissue: 334 tissue: Energy balance perturbations on liver effect Microarray technology comparison (4Hs version 2): 221 2): technology comparison (4Hs version Microarray Ligand screen in B cells: insulinlike growth factor 1: 306 1: factor growth insulinlike Ligand screen in B cells: Kidney: maintenance of sodium and water balance: 474 balance: maintenance of sodium and water Kidney: SBFMBF genomic distribution (intergenic_v1.0) (II): 139 (intergenic_v1.0) (II): SBFMBF genomic distribution Cardiac hypertrophy and phosphoinositide 3kinase: 446 and phosphoinositide 3kinase: Cardiac hypertrophy Ras/cAMP signal transduction pathway (dye swap): 457 swap): (dye pathway Ras/cAMP signal transduction Plant PK12 LAMMER kinase overexpression (12K): 207 (12K): Plant PK12 LAMMER kinase overexpression Ligand screen in B cells: CpGoligodeoxynucleotide: 299 CpGoligodeoxynucleotide: Ligand screen in B cells: Acute lymphoblastic leukemia treatment responses: 330 treatment responses: leukemia Acute lymphoblastic Tumor suppressor protein p53 gene dosage effects: 170 suppressor protein p53 gene dosage effects: Tumor Pulmonary fibrosis model (A/J, bleomycin resistant): 350 resistant): Pulmonary bleomycin fibrosis model (A/J, Type 2 diabetes and insulin resistance (HuGeneFL): 157 2 diabetes and insulin resistance (HuGeneFL): Type Ligand screen in B cells: tumor necrosis factoralpha: 322 tumor necrosis factoralpha: Ligand screen in B cells: Retinal pigment epithelium cells: native and cultured: 482 and cultured: native Retinal pigment epithelium cells: Ligand screen in B cells: stromal cell derived factor1: 318 factor1: stromal cell derived Ligand screen in B cells: Mefloquine treatment of NG108 neuroblastoma cells: 199 cells: Mefloquine treatment of NG108 neuroblastoma Amyotrophic lateral sclerosis (Lou Gehrig’s Disease): 412 Disease): sclerosis (Lou Gehrig’s lateral Amyotrophic Embryonic stem cell cholesterol metabolism mutants: 434 Embryonic stem cell cholesterol metabolism mutants: Normal human tissue expression profiling (HGU95B): 423 profiling (HGU95B): Normal human tissue expression 426 profiling (HGU95E): Normal human tissue expression 422 profiling (HGU95A): Normal human tissue expression Normal human tissue expression profiling (HGU95D): 425 profiling (HGU95D): Normal human tissue expression 424 profiling (HGU95C): Normal human tissue expression Juvenile dermatomyositis muscle profile (HuGeneFL): 258 profile (HuGeneFL): muscle dermatomyositis Juvenile Heart failure and left ventricular assist device support: 462 support: Heart assist device and left ventricular failure CD8+ effector and central memory T cells (MGU74A): 433 memory and central T cells (MGU74A): CD8+ effector 435 memory and central T cells (MGU74B): CD8+ effector CD8+ effector and central memory T cells (MGU74C): 436 memory and central T cells (MGU74C): CD8+ effector Heart failure induced by artificial myocardial infarction: 466 artificial infarction: Heart induced by myocardial failure Toxicosis induced by endophyte infected fescue seed: 491 seed: fescue infected endophyte induced by Toxicosis Microarray technology comparison (3OPHs version 1): 222 1): technology comparison (3OPHs version Microarray Thyroid hormone effect on cardiomyocytes (dyeswap): 469 (dyeswap): on cardiomyocytes hormone effect Thyroid Oxygenregulated gene expression in proteobacterium: 290 in proteobacterium: Oxygenregulated gene expression Juvenile dermatomyositis muscle profile (MuscleChip): 215 profile (MuscleChip): muscle dermatomyositis Juvenile Allergic asthma model (C57BL/6, ovalbumin resistant): 348 resistant): Allergic asthma model (C57BL/6, ovalbumin Tissuespecific and developmentregulated genes in maize: 4 genes in maize: Tissuespecific and developmentregulated Ligand screen in B cells: Blymphocyte chemoattractant: 294 Blymphocyte chemoattractant: Ligand screen in B cells: CGH logarithmic and stationary phase comparison, LT2: 230 CGH logarithmic and stationary phase comparison, LT2: Pulmonary fibrosis model (129/SV, bleomycin sensitive): 351 sensitive): bleomycin Pulmonary fibrosis model (129/SV, Ligand screen in B cells: transforming growth factorbeta: 321 factorbeta: growth transforming Ligand screen in B cells: SBFMBF genomic distribution (ORF_intergenic_v1.0) (I): 136 (ORF_intergenic_v1.0) (I): SBFMBF genomic distribution Age and retinal pigmented epithelium/choroid (Atlas 1.2): 396 Age and retinal pigmented epithelium/choroid (Atlas 1.2): Endometrial stromal cell differentiation induced by cAMP: 286 cAMP: induced by Endometrial stromal cell differentiation SBFMBF genomic distribution (ORF_intergenic_v1.0) (II): 137 (ORF_intergenic_v1.0) (II): SBFMBF genomic distribution Bovine functional genomics microarray evaluation lymph: 373 lymph: evaluation functional genomics microarray Bovine Crossspecies hybridization and RNA concentration effect: 342 effect: and RNA concentration Crossspecies hybridization Comparative analysis of human and great ape fibroblasts: 340 ape fibroblasts: analysis of human and great Comparative Spinal cord injury and regeneration time course (RGU34A): 63 Spinal cord injury time course (RGU34A): and regeneration CGH logarithmic 229 and stationary phase comparison, CT18: Serotonergic hallucinogen effect in somatosensory cortex: 346 in somatosensorySerotonergic hallucinogen effect cortex: Cancer Genome Anatomy Project SAGE library collection: 217 library collection: Project SAGE Cancer Genome Anatomy Cochlear hair cell line differentiation time course (Mu11KA): 45 time course (Mu11KA): Cochlear hair cell line differentiation 51 time course (Mu11KB): Cochlear hair cell line differentiation Cystic fibrosis pathology and 4phenylbutyrate (HGU133A): 493 (HGU133A): Cystic fibrosis pathology and 4phenylbutyrate 494 (HGU133B): Cystic fibrosis pathology and 4phenylbutyrate Age and retinal pigmented epithelium/choroid (Atlas 1.2 II): 397 Age and retinal pigmented epithelium/choroid (Atlas 1.2 II): Retinal gene expression in retinitis pigmentosa 1 knockout: 333 in retinitis pigmentosa 1 knockout: Retinal gene expression Bovine functional genomics microarray evaluation adrenal: 372 adrenal: evaluation functional genomics microarray Bovine Spinal cord injury and regeneration time course (RGU34B): 259 Spinal cord injury time course (RGU34B): and regeneration Spinal cord injury and regeneration time course (RGU34C): 339 Spinal cord injury time course (RGU34C): and regeneration Pharmacogenomic effect of corticosteroid in skeletal muscle: 256 of corticosteroidPharmacogenomic muscle: effect in skeletal Ligand screen in B cells: macrophage inflammatory protein3: 312 macrophage inflammatory protein3: Ligand screen in B cells: Ligand screen in B cells: growth hormone releasing hormone: 303 hormone growth releasing hormone: Ligand screen in B cells: Ligand screen in B cells: Epstein Barr virusinduced molecule1: 301 Epstein Barr virusinduced molecule1: Ligand screen in B cells: Testis somatic cell developmentallyregulated gene expression: 338 gene expression: somatic cell developmentallyregulated Testis Ligand screen in B cells: secondary lymphoidorgan chemokine: 319 secondary lymphoidorgan chemokine: Ligand screen in B cells: Ligand screen in B cells: CGS21680 hydrochloride (adenosine): 298 (adenosine): CGS21680 hydrochloride Ligand screen in B cells: Platform type/amplification method cross comparison (U74Av2): 227 type/amplification method cross comparisonPlatform (U74Av2): Cardiac hypertrophy: physiologically and pathologically induced: 489 and pathologically induced: physiologically Cardiac hypertrophy: Heart failure induced by artificial myocardial infarction (dyeswap): 467 (dyeswap): artificial infarction Heart induced by myocardial failure Extracellular matrixmodulation of malignant phenotype in bladder: 408 matrixmodulationExtracellular of malignant phenotype in bladder: Platform type/amplification method cross comparison (16K Oligo): 226 type/amplification method cross comparisonPlatform (16K Oligo): Hypothalamus response to lipopolysaccharide and restraint stress: 324 response to lipopolysaccharide stress: Hypothalamus and restraint Dietary effects of long chain polyunsaturated fatty acids (Mu11KA): 168 acids (Mu11KA): fatty of long chain polyunsaturated Dietary effects 169 acids (Mu11KB): fatty of long chain polyunsaturated Dietary effects Cancer Genome Anatomy Project SAGE library collection (mouse): 381 library collection (mouse): Project SAGE Cancer Genome Anatomy Platform type/amplification method cross comparison (16K cDNA)(I): 224 type/amplification method cross comparisonPlatform (16K cDNA)(I): Platform type/amplification method cross comparison (16K cDNA)(II): 225 type/amplification method cross comparisonPlatform (16K cDNA)(II): GEO Data Sets

    Figure 2 Hierarchical clustering of 448 GEO data sets by context, 0.0 0.2 0.4 0.6 0.8 1.0 created by treating each data set as a vector representing the presence b or absence of a mapping from that data set to each UMLS concept, then calculating binary distance between data sets and clustering using Muscle function and aging female (HGU133A): 472 Muscle function and aging female (HGU133B): 473 complete linkage. (a) A total of 6,612 samples are represented in these Muscle function and aging (HGU95A): 156 Muscle function in old age: 123 data sets. The numeral 1 indicates a cluster with 33 data sets, all from Muscle function and aging male (HGU133A): 287 Muscle function and aging male (HGU133B): 288 the Alliance for Cellular Signaling, with almost identical experimental Muscle profiles: 276 Muscle, extraocular, critical period: 336 methods. (b) A magnified view of the cluster indicated by numeral 2 is Inflammatory myopathy: 198 Muscle response to acute resistance exercise: 278 shown. Numbers to the right of data set titles indicate GDS accession Muscle, normal extraocular, profile: 254 Juvenile dermatomyositis muscle profile (MuscleChip): 215 numbers. Within the cluster, muscle samples studied with respect to Juvenile dermatomyositis muscle profile (HuGeneFL): 258 Duchenne muscular dystrophy (MuscleChip): 214 aging cluster together. Samples from inflammatory myopathy, muscular Duchenne muscular dystrophy (HGU95B): 270 Duchenne muscular dystrophy (HGU95E): 265 dystrophy and dermatomyositis also cluster together. The 19 data sets in Duchenne muscular dystrophy (HGU95D): 264 Duchenne muscular dystrophy (HGU95A): 262 this cluster were contributed to GEO by three different submitters. Duchenne muscular dystrophy (HGU95C): 263

    58 VOLUME 24 NUMBER 1 JANUARY 2006 NATURE BIOTECHNOLOGY ANALYSIS

    of glycosylation, respectively. A decrease in TNNT1 and an increase in The concept of injury represents an environmental annotation and SYNJ1 have been found to be associated with differentiation from human was related only to GPX3 (plasma glutathione peroxidase) and MAPK14 embryonic and hematopoietic stem cells, respectively22,23. (mitogen-activated protein kinase 14). Both demonstrate a significant We found gene-concept relations for other phenotypic concepts, increase in mean normalized expression levels in the four data sets asso- including diseases. The only gene relating to leukemia is DDX24; mean ciated with injury compared with over 100 other data sets measuring normalized expression of DDX24 drops from 0.71 in the six data sets mea- these genes without this annotation (Supplementary Fig. 1d,e online, suring this gene not associated with leukemia to 0.44 in the four data sets GPX3 increases from 0.56 to 0.97; MAPK14 increases from 0.52 to 0.89; with the annotation (Supplementary Fig. 1c online, P = 0.007, Q = 0.01). for both, P < 1 × 10−15 and Q < 0.0002). Both genes have been shown Interestingly, this gene was first cloned from a leukemia cDNA library24. to be related to injury of various forms. Increased expression of plasma

    a b M Hmgcs2 (15360) PGA

    Pnliprp1 (84028) Pnliprp1 (18946)

    Pnliprp1 (21253 MYH6 (4624)

    C0521324:Skeletal

    9195) MYH6 (20098) Myh6 (29556) C0242695:Muscle, Skeletal

    C0596981:Muscle Cells C0026845:Muscle Mybpc1 (362867)

    Tnni1 (21952) http://www.nature.com/naturebiotechnology MYBPC1 (1846)

    MYBPC1 (4604) Slc2a1 (20525) Tnni1 (29388) TNNI1 (2462)

    Slc2a1 (2287) Slc2a1 (24778) (94194) Cox5b (12859) TNNI1 (7135)

    c

    1 GDS annotated with one or more of four muscle concepts 0.8 GDS annotated with none of four muscle concepts

    0.6 PDLIM3

    Nature Publishing Group Group 200 6 Nature Publishing 0.4 Hs. Hs. © 0.2

    Mean normalized expression level of level Mean normalized expression 0 8 12 88 76 87 85 420 390 478 261 477 266 400 330 476 493 260 386 486 479 340 363 358 221 470 357 365 181 286 177 449 145 367 213 214 461 215 157 472 287 198 156 d

    1 GDS annotated with one or more of four muscle concepts

    0.8 GDS annotated with none of four muscle concepts

    0.6 Pdlim3 0.4 MM./

    0.2

    Mean normalized expression level of level Mean normalized expression 0 62 63 64 419 491 255 458 364 273 463 466 253 332 468 165 174 366 399 276 489 238 254 278 256

    Figure 3 Network of relations between 46 biomedical concepts extracted from the annotations of data sets in Gene Expression Omnibus and 444 genes with differential expression associated with the presence or absence of the concept. (a) Light blue nodes are UMLS concepts. Pink nodes are genes with higher expression levels in data sets annotated with their related concept; light green nodes are genes with lower expression levels in annotated data sets. Pink and green nodes are contained within gray squares indicating ortholog families. Edges (dashed) between an ortholog family and concept indicate statistically significant relations between that concept and each included gene. The remaining edges (solid arrows) indicate existing hierarchical relations between UMLS concepts. (b) Muscle Cells and three related concepts were among the most highly connected concepts, and relate to increased expression of MYH6, MYBPC1, Mybph, TNNI1 and other genes. (c) Of the human genes related to Muscle Cells, PDLIM3 shows the greatest differential expression. There is a significant increase in the normalized expression of PDLIM3 in the eight data sets annotated with Muscle Cells (dark shaded bars) compared with the 34 data sets without (light shaded bars). X-axis labels indicate GEO data set numbers. (d) A similar significant pattern of association with Muscle Cells is seen with mouse Pdlim3.

    NATURE BIOTECHNOLOGY VOLUME 24 NUMBER 1 JANUARY 2006 59 ANALYSIS

    glutathione peroxidase has been shown to be protective against toxic Outlook injury to the liver25, presumably by preventing the damaging effects of As we can now show that UMLS concepts can be used to represent the reactive oxygen species. Plasma glutathione peroxidase activity signifi- experimental context of a genomics experiment, even if only at a coarse cantly drops after burn injury, and rises after spinal cord injury26,27. The resolution, we now have an opportunity to organize our research com- mitogen-activated protein kinase pathway is activated in a variety of munity to start the call for submitters of genomic data to describe their processes, including in responses to environmental stresses and injury. experiments using UMLS concepts. This could be done in a manner MAPK14 is activated during wound healing and increases after ischemic similar to how journals require that microarray data described in manu- myocardial injury28,29. scripts be available in public repositories in standard formats10,11,35. Software tools could become available to assist authors in choosing Conclusions the correct concepts, allowing mappings to UMLS to be more accurate We have created and validated a system that identifies and repre- than those determined through automation. Finally, future successful sents phenotypic, environmental and experimental context for every phenome-genome and envirome-genome studies are assured, but the microarray sample and data set stored in GEO by mapping annota- ease and accuracy of automating inferences across data are crucially tion phrases to biomedical concepts in UMLS. In addition to merely dependent on the accuracy and consistency of the human annotation holding series of gene measurements, data sets can now be considered process, which will only happen when every investigator has a better by their phenotypic, environmental and experimental labels and even prospective understanding of the long-term value of the time invested clustered based on their shared annotative concepts (Fig. 2), similar to in improving annotations. how transcriptome measurements are commonly clustered based on expression measurements. METHODS After finding annotative concepts in UMLS for every GEO data set, Source code for software and all relevant data sets are available at http://genotext. we have shown a network of relations between phenotype (e.g., aging), stanford.edu. disease (e.g., leukemia), environmental (e.g., injury) and experimental context (e.g., muscle cells) and genes with differential expression asso- Gene Expression Omnibus annotations. The Gene Expression Omnibus (GEO) is an international repository for gene expression data, developed http://www.nature.com/naturebiotechnology ciated with these concepts. Some of these relations exist even across 9 orthologous genes in two or more species, and we have validated several and maintained by the National Library of Medicine . GEO samples (abbre- viated GSM) relate expression measurements of multiple RNA transcripts of these phenome-genome and envirome-genome relations (Fig. 3). It with platform-specific identifiers. Though GEO holds samples from many is traditionally difficult to find genes associated with components of types of parallel expression measurement systems, hereafter we will use the phenotype and environment. The example relations we have shown term “microarray” interchangeably with “sample.” The mappings between here were found because they were seen across multiple data sets (and gene identifiers used in the sample and external gene identifiers, names and types of microarrays) that were originally created to study many differ- symbols are stored in GEO platforms (abbreviated GPL). Each GEO series ent processes, and only by integrating the data sets across the phenome (abbreviated GSE), usually corresponding to a single experiment, relates to and envirome were we able to find these associations. These relations multiple GSM. A subset of the GSE has previously been manually validated immediately suggest further targeted cellular, genetic and epidemiologi- as containing internally comparable data; these are represented as GEO data cal study of how these genes may influence or be influenced by pheno- sets (abbreviated GDS). The relationship between GSM, GSE, GPL and GDS type and environment. For example, mice that are homozygous null is illustrated in Figure 1. We accessed the Gene Expression Omnibus site on March 24, 2004, and down- for Slc7a2 show a reduction in nitric oxide production in macrophages Nature Publishing Group Group 200 6 Nature Publishing 30,31 loaded each GEO series. At the time of downloading, the GEO FTP site contained © and fibroblasts, important in immune response and wound healing . 8,519 GSM in 524 GSE measured using 195 GPL, as well as 448 GDS. We wrote Mice homozygous null for Mgat2 show early postnatal lethality and software in PERL that extracts and stores within a MySQL relational database 32 a number of other defects . Decreases in TNNT1 and increases in seven types of annotations from the Gene Expression Omnibus: GEO sample SYNJ1 are markers for stem cell differentiation22,23. Our relations link title, description, source and keyword; GEO series title and description; and GEO these four genes and others to aging; it is possible that subtle changes data set title. in expression of these genes may affect aging of an organism. Further studies should be carried out to elucidate the roles of these genes in Unified Medical Language System concepts represent phenotype. To accurately model the context of a sample, we need a vocabulary that spans the domains to aging. be joined, including terms describing genes and proteins and their functions, In addition, our findings suggest that UMLS is missing concepts organism phenotypes, species, medical conditions and the experiments them- important in molecular and cell biology experimentation, but is suf- selves. The interlinked set of controlled vocabularies that best serves this role is ficient to represent many of the phenotypic, environmental and experi- the Unified Medical Language System (UMLS). UMLS is currently freely available mental concepts held in the text-based annotations of transcriptome to academic researchers, who need only a signed license agreement. UMLS has data. Though the representation of molecular biology concepts in UMLS three components. The metathesaurus contains a catalog of unified biomedical could be improved and the parsing made more robust, improving the concepts, relations between concepts and text strings mapped to each concept. quality of the annotations is now the crucial step toward the automated The semantic network contains a catalog of 135 higher level categories for all determination of context. As more journals call for microarray data to concepts in the metathesaurus, as well as relations between these categories. The be made publicly available, there is an increasing level of detail being specialist lexicon and other resources provide data and tools for processing the text strings associated with UMLS concepts. placed in the annotations, with the goal of aiding others in reproducing The UMLS metathesaurus relates source vocabularies by creating concepts the findings. In general, as more text is entered having little to do with that span the vocabularies. Each unique concept relates to one or more source the experimental design, or as more words are abbreviated, the potential vocabularies. A concept may have multiple listed synonyms and terms; each for errors in understanding rises, whether during manual or automated term is uniquely specified in the metathesaurus. Two types of relations between text processing. Findings have already been published in journals, which concepts are stored in the metathesaurus: asserted structural or hierarchical were later judged as inaccurate because of incorrect annotations33,34. relations, and statistical relations (determined by co-occurrence of concepts in Parsimony and precision in annotations will lead to better manual and records). An example of the basic relations in the metathesaurus is shown in the computational understanding. Supplementary Figure 2 online.

    60 VOLUME 24 NUMBER 1 JANUARY 2006 NATURE BIOTECHNOLOGY ANALYSIS

    The 2003AC release of UMLS was obtained through the National After all gene-concept relations are determined, those with Q values >10% Library of Medicine (http://www.nlm.nih.gov/research/umls/). Using the are discarded. Two independent methods are then used to study the remaining METAMORPHOSYS tool, we created a subset, dropping vocabularies consid- relations. In the first method, we study only those relations between gene g and ered less relevant, including translations of terms into alternate languages. All concept c if there is another gene g′ mapped to the same concept c, such that subsequent analysis was performed using the remaining subset, which contains gene g and g′ are in different species yet are in the same homology family, as 878,496 concepts described by 1,724,070 text strings. A table of all mappings defined by Homologene9. In other words, a concept has to be strongly related to between annotations and concepts is available at the website given above. orthologous genes in two species to be included. These relations are made into a graph, additional UMLS hierarchical relations between concepts are added, and Parsing annotations to find UMLS concepts. MetaMap is a software program the graph is formatted using the yEd Java Graph Editor (yWorks GmbH). In the written to take free text and generate a list of potentially matching concepts from second method, we study only the most reliable relations, defined as those with the UMLS metathesaurus36. Using the MetaMap programming libraries, we cre- Q values <0.01. With both methods, we manually verify the concept assignments ated a software system called GENOTEXT (GENOmic conTEXT) in Java that to each annotation before interpretation. processes each of the seven types of GEO annotations and stores in a relational database the UMLS string unique identifier (SUI) of all candidate matches, the Biological validation of relations between gene and muscle concepts. Two score of the match and the phrase of original text in which the string was found. sources were used to biologically validate the genes we find related to concepts As concept unique identifiers (CUI) are needed in subsequent analyses, they are representing Skeletal (C0521324), Skeletal Muscle (C0242695), Muscle Cells automatically determined using the SUI to CUI relations in the UMLS concepts (C0596981) and Muscle (C0026845). First, a symbol list was obtained from table. We define gross mapping errors as those caused by the incorrect interpreta- the Genomics Institute of the Novartis Research Foundation Mouse SymAtlas tion of abbreviations leading to multiple strings, such that it is highly unlikely GNF1M for genes that were expressed in a muscle cDNA library (obtained from that there is any proper reference for these mapped strings. We wrote a program pooled male and female 6- to 11-week-old C57BL/6J mice) at two or more times that takes many manually specified SUI and text fragments (described as regular the median expression level across all measured tissues17. Second, a list of UniGene expressions) and eliminates these mappings (additional details may be found in identifiers was obtained from UniGene for clusters of cDNAs sequenced from the Supplementary Note online). dbEST library 8902, the adult mouse skeletal muscle library from the University of Texas Southwestern Medical Center9. Both of these lists were compared to Clustering data sets by context. The value of clustering samples from multiple the list of mouse genes with increased expression in GEO data sets containing expression data sets by expression values has been previously demonstrated37. annotations mapped to any of the four concepts. Neither of these data sets was

    http://www.nature.com/naturebiotechnology Here we show how gene expression data sets themselves can be clustered by included in the version of the Gene Expression Omnibus we used. phenotypic, environmental and experimental context, when this context is rep- resented using a standardized vocabulary. Each data set is considered as a binary Note: Supplementary information is available on the Nature Biotechnology website. vector reflecting the presence or absence of a mapping from that data set or its subordinate GEO series and samples to a UMLS concept. Comprehensive pair- ACKNOWLEDGMENTS We thank Tarangini Deshpande for critical comments on and suggestions for the wise binary distances are then computed between vectors. Hierarchical clustering 38 manuscript. The authors thank Partners Healthcare Research Computing for use is performed using complete linkage using R . of and assistance with the Linux High Performance Computing Cluster. The work was supported by grants from the Lucille Packard Foundation for Children’s Health, Creation of phenomic-genomic relations. Coexpression networks have been National Institutes of Health National Center for Biomedical Computing (U54 39,40 previously shown to relate genes by similar function . Here we show how we LM008748), the National Library of Medicine (K22 LM008261), National Institute have comprehensively related gene expression, the transcriptome, to the rest of of Diabetes and Digestive and Kidney Diseases (K12 DK63696, R01 DK62948 and the envirome and phenome that remains external to these values in an automated R01 DK060837), the Harvard-MIT Division of Health Sciences and Technology and manner. A mapping was manually created from 40% of the nearly 2.5 million the Lawson Wilkins Pediatric Endocrine Society. GEO platform identifiers to LocusLink identifiers, allowing nearly 55-million

    Nature Publishing Group Group 200 6 Nature Publishing expression measurements to be referenced using LocusLink. Mappings to genes COMPETING INTERESTS STATEMENT

    © The authors declare that they have no competing financial interests. were created using GEO platform files, and not by parsing annotations. Each gene expression measurement from each GEO sample is rank normalized to Published online at http://www.nature.com/naturebiotechnology/ between 0 and 1, depending on the relative ranking of the gene’s expression level Reprints and permissions information is available online at http://npg.nature.com/ compared to other genes measured from the sample. A mean rank-normalized reprintsandpermissions/ expression level is calculated for each gene across every subordinate GEO sample within a GEO data set, and the average normalized measurements are assigned 1. Carson, J.P. et al. Pharmacogenomic identification of targets for adjuvant ther- to that GEO data set. Thus, each GEO data set has a panel of genes measured in apy with the topoisomerase poison camptothecin. Cancer Res. 64, 2096–2104 its samples and a single average rank-normalized expression measurement for (2004). 2. Zhukov, T.A., Johanson, R.A., Cantor, A.B., Clark, R.A. & Tockman, M.S. Discovery each gene. of distinct protein profiles specific for lung tumors and pre-malignant lung lesions We then iterate over the set of 55,619 measured genes, G, defined by NCBI by SELDI mass spectrometry. Lung Cancer 40, 267–279 (2003). Gene identifiers. For each gene g, we determine the GEO data sets in which the 3. Bhattacharjee, A. et al. Classification of human lung carcinomas by mRNA expres- gene was measured, D, and then determine the UMLS concepts assigned to any sion profiling reveals distinct adenocarcinoma subclasses. Proc. Natl Acad. Sci. USA 98, 13790–13795 (2001). those GEO data sets, C. For each concept c in C, we determine the subset of D 4. Yanagisawa, K. et al. Proteomic patterns of tumour subsets in non-small-cell lung annotated with c, called d1, and those data sets within D not annotated with c, cancer. Lancet 362, 433–439 (2003). 5. Anthony, J.C., Eaton, W.W. & Henderson, A.S. Looking to the future in psychiatric called d2. Concept c is only considered if subsets d1 and d2 have a minimum of four data sets each. We then determine whether the rank-normalized expression epidemiology. Epidemiol. Rev. 17, 240–242 (1995). 6. Freimer, N. & Sabatti, C. The human phenome project. Nat. Genet. 34, 15–21 measurements of g are significantly different between d1 and d2. Significance is (2003). determined using a Student’s t-test with unpaired values. An f-test for signifi- 7. Mahner, M. & Kary, M. What exactly are genomes, genotypes and phenotypes? And cantly different variances is first performed for each comparison; if positive (α ≤ what about phenomes? J. Theor. Biol. 186, 55–63 (1997). 0.005), the t-test is performed with Welch correction for unequal variance41. The 8. Stoll, M. et al. A genomic-systems biology map for cardiovascular function. Science 294, 1723–1726 (2001). threshold P value for determining significance is determined using 100 random 9. Wheeler, D.L. et al. Database resources of the National Center for Biotechnology permutations: for each gene g and concept c, the assignment of GEO data sets Information: update. Nucleic Acids Res. 32 Database issue, D35–40 (2004). 10. Brazma, A. et al. Minimum information about a microarray experiment (MIAME)— D between d1 and d2 is randomly shuffled and the t-test is repeated. A q-value is then calculated for each relation between gene g and concept c, based on the toward standards for microarray data. Nat. Genet. 29, 365–371 (2001). 11. Spellman, P.T. et al. Design and implementation of microarray gene expression proportion of false-positive concepts in C we would expect mapped to the gene markup language (MAGE-ML). Genome Biol. 3, RESEARCH0046 (2002). if concept c were significant. The table of all gene-concept relations (without 12. Bodenreider, O. The Unified Medical Language System (UMLS): integrating biomedi- additional filtering) is available at the website given above. cal terminology. Nucleic Acids Res. 32 Database issue, D267–70 (2004).

    NATURE BIOTECHNOLOGY VOLUME 24 NUMBER 1 JANUARY 2006 61 ANALYSIS

    13. Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene 28. Sharma, G.D., He, J. & Bazan, H.E. p38 and ERK1/2 coordinate cellular migra- Ontology Consortium. Nat. Genet. 25, 25–29 (2000). tion and proliferation in epithelial wound healing: evidence of cross-talk activation 14. International Classification of Diseases. Clinical Modification (ICD-9-CM), 9th revi- between MAP kinase cascades. J. Biol. Chem. 278, 21989–21997 (2003). sion, (Centers for Medicare & Medicaid Services, Washington DC, 2003). 29. Schulz, R. et al. Ischemic preconditioning preserves connexin 43 phosphoryla- 15. Storey, J.D. & Tibshirani, R. Statistical significance for genomewide studies. Proc. tion during sustained ischemia in pig hearts in vivo. FASEB J. 17, 1355–1357 Natl Acad. Sci. USA 100, 9440–9445 (2003). (2003). 16. Jo, K., Rutten, B., Bunn, R.C. & Bredt, D.S. Actinin-associated LIM protein-deficient 30. Nicholson, B., Manner, C.K. & MacLeod, C.L. Cat2 L-arginine transporter-deficient mice maintain normal development and structure of skeletal muscle. Mol. Cell. Biol. fibroblasts can sustain nitric oxide production. Nitric Oxide 7, 236–243 (2002). 21, 1682–1687 (2001). 31. Nicholson, B., Manner, C.K., Kleeman, J. & MacLeod, C.L. Sustained nitric oxide 17. Su, A.I. et al. A gene atlas of the mouse and human protein-encoding transcriptomes. production in macrophages requires the arginine transporter CAT2. J. Biol. Chem. Proc. Natl Acad. Sci. USA 101, 6062–6067 (2004). 276, 15881–15885 (2001). 18. Rikans, L.E. & Moore, D.R. Influence of aging on rat liver involved in gluta- 32. Wang, Y. et al. Modeling human congenital disorder of glycosylation type IIa in the thione synthesis and degradation. Arch. Gerontol. Geriatr. 13, 263–270 (1991). mouse: conservation of asparagine-linked glycan-dependent functions in mammalian 19. Ninfali, P., Aluigi, G. & Pompella, A. Postnatal expression of glucose-6-phos- physiology and insights into disease pathogenesis. Glycobiology 11, 1051–1070 phate dehydrogenase in different brain areas. Neurochem. Res. 23, 1197–1204 (2001). (1998). 33. Gilbertson, R.J. & Clifford, S.C. PDGFRB is overexpressed in metastatic medul- 20. Cocco, P. et al. Mortality in a cohort of men expressing the glucose-6-phosphate loblastoma. Nat. Genet. 35, 197–198 (2003). dehydrogenase deficiency. Blood 91, 706–709 (1998). 34. Tettelin, H. & Parkhill, J. The use of genome annotation data and its impact on 21. Kyng, K.J., May, A., Kolvraa, S. & Bohr, V.A. Gene expression profiling in Werner biological conclusions. Nat. Genet. 36, 1028–1029 (2004). syndrome closely resembles that of normal aging. Proc. Natl Acad. Sci. USA 100, 35. Perou, C.M. Show me the data! Nat. Genet. 29, 373 (2001). 12259–12264 (2003). 36. Aronson, A.R. Effective mapping of biomedical text to the UMLS Metathesaurus: 22. Muschen, M. et al. Molecular portraits of B cell lineage commitment. Proc. Natl the MetaMap program. Proc. AMIA Symp., 17–21 (2001). [ Acad. Sci. USA 99, 10014–10019 (2002). 37. Segal, E., Friedman, N., Koller, D. & Regev, A. A module map showing conditional 23. Bhattacharya, B. et al. Gene expression in human embryonic stem cell lines: unique activity of expression modules in cancer. Nat. Genet. 36, 1090–1098 (2004). molecular signature. Blood 103, 2956–2964 (2004). 38. Ihaka, R. & Gentleman, R.R. A language for data analysis and graphics. J. Comput. 24. Zhao, Y. et al. Cloning and characterization of human DDX24 and mouse Ddx24, Graph. Stat. 5, 299–314 (1996). two novel putative DEAD-Box proteins, and mapping DDX24 to human 39. Butte, A.J., Tamayo, P., Slonim, D., Golub, T.R. & Kohane, I.S. Discovering func- 14q32. Genomics 67, 351–355 (2000). tional relationships between RNA expression and chemotherapeutic susceptibility 25. Mirochnitchenko, O. et al. Acetaminophen toxicity. Opposite effects of two forms of using relevance networks. Proc. Natl Acad. Sci. USA 97, 12182–12186 (2000). glutathione peroxidase. J. Biol. Chem. 274, 10349–10355 (1999). 40. Lee, H.K., Hsu, A.K., Sajdak, J., Qin, J. & Pavlidis, P. Coexpression analysis of 26. Sandre, C. et al. Early evolution of selenium status and oxidative stress parameters human genes across many microarray data sets. Genome Res. 14, 1085–1094 in rat models of thermal injury. J. Trace Elem. Med. Biol. 17, 313–318 (2004). (2004). 27. Topsakal, C. et al. Effects of prostaglandin E1, melatonin, and oxytetracycline on lipid 41. Press, W.H., Teukolsky, S.A., Vetterling, W.T. & Flannery, B.P. Numerical Recipes http://www.nature.com/naturebiotechnology peroxidation, antioxidant defense system, paraoxonase (PON1) activities, and homocys- in C: The Art of Scientific Computing (Cambridge University Press, Cambridge, UK, teine levels in an animal model of spinal cord injury. Spine 28, 1643–1652 (2003). 1993). Nature Publishing Group Group 200 6 Nature Publishing ©

    62 VOLUME 24 NUMBER 1 JANUARY 2006 NATURE BIOTECHNOLOGY