<<

The Structure, Function and Evolution of the : A Systems-Level Analysis

by

Graham L. Cromar

A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Department of Molecular University of Toronto

© Copyright by Graham L. Cromar 2014 ii

The Structure, Function and Evolution of the Extracellular Matrix: A Systems-Level Analysis

Graham L. Cromar

Doctor of Philosophy

Department of Molecular Genetics University of Toronto

2014 Abstract

The extracellular matrix (ECM) is a three-dimensional meshwork of , proteoglycans and polysaccharides imparting structure and mechanical stability to tissues. ECM dysfunction has been implicated in a number of debilitating conditions including cancer, atherosclerosis, asthma, fibrosis and arthritis. Identifying the components that comprise the

ECM and understanding how they are organised within the matrix is key to uncovering its role in health and disease. This study defines a rigorous protocol for the rapid categorization of proteins comprising a biological system. Beginning with over 2000 candidate extracellular proteins, 357 core ECM and 524 functionally related (non-ECM) genes are identified. A network of high quality -protein interactions constructed from these core genes reveals the ECM is organised into biologically relevant functional modules whose components exhibit a mosaic of expression and conservation patterns. This suggests module innovations were widespread and evolved in parallel to convey tissue specific functionality on otherwise broadly expressed modules. Phylogenetic profiles of ECM proteins highlight components restricted and/or expanded in metazoans, vertebrates and mammals, indicating taxon-specific tissue innovations.

Modules enriched for medical subject headings illustrate the potential for systems based analyses to predict new functional and disease associations on the basis of network topology. This study iii also explores the evolutionary forces that guided the development of the ECM. Analyses of domain conservation patterns in ECM proteins, including the use of a novel framework for identifying non-contiguous, conserved arrangements of domains shows most are of pre- deuterostome origin. Many participate in novel domain arrangements in vertebrates suggesting the sampling of new domain combinations was an important mechanism leading to neofunctionalization of paralogous ECM genes. Distinct types of proteins and/or the biological systems in which they operate may have influenced the types of evolutionary forces that drive protein innovation. This emphasizes the need for rigorously defined systems to address questions of evolution that focus on specific systems of interacting proteins such as the ECM. Finally, overviewing the current state of our knowledge of the ECM, this study addresses important gaps and highlights areas worthy of further investigation.

iv

Acknowledgments

This project would not have been possible without the loving support and understanding of many family and friends, most notably my wife Judith Moses. The opportunity to pursue a career in research is a privilege and I am very grateful for the many ways that I experienced support along the way from so many of you. In particular, I would like to acknowledge my father-in-law, Nelu Moses, who sadly could not be here to celebrate this achievement. I know he would have been among the first to do so. His enthusiasm will be long cherished.

I thank the members of my supervisory committee including Gary Bader, Andrew Emili and Johanna Rommens. I am grateful for their kind support and advice at committee meetings. A more thoughtful and helpful project committee has surely never been struck! These accolades extend obviously to my supervisor, John Parkinson, who was brave enough to take on a student older than he was and I am glad to count him a colleague and friend.

Particular thanks go to Fred Keeley for his ardent support and infectious enthusiasm. I thank Sylvie Richard-Blum for kindly hosting me in her laboratory in Lyon, France while working on the manuscript corresponding to the first data chapter. I also thank Emilie Chautard for the generous contribution of her time to the first paper and Megan Miao who kindly provided the recombinant used in the SPR analysis. Thanks to Shoshanna Wodak, Andrei Turinski, Shuye Pu, Brian Turner and other members of the Wodak lab for their comments and contributions during joint lab meetings and other discussions. Zhaolei Zhang provided knowledgeable advice on the analysis of sequences. Ka-chun Wong contributed significantly to the analysis of sequential patterns used in the second data chapter on domain architectures.

James Wasmuth was a source of welcome distraction and the most likeable critic anyone could ask for. He, along with Xuejian Xiong, Hongyan (Bill) Song, Tuan On and Noeleen Loughran collaborated extensively on the development of pipelines, databases and programming. Most of the amazing things I learned to do with spreadsheets were the magic of Stacy Hung. I thank David He and Gabe Musso for influential advice in the early stages of my studies.

It is with pleasure that I dedicate this thesis to a former science teacher, D. Witucki who reminded me at a formative stage of my education never to count myself out. This has turned out to be a most inspirational mantra! v

Table of Contents

Acknowledgments...... iv

Table of Contents ...... v

List of Tables ...... x

List of Figures ...... xi

List of Appendices ...... xiii

List of Supporting Data Files ...... xiv

List of Abbreviations ...... xvi

Chapter 1 The Extracellular Matrix ...... 1

1 Background ...... 1

1.1 Overview ...... 1

1.2 Discovery and Early History of the Matrix...... 2

1.3 Defining Matrix Components ...... 7

1.3.1 ...... 8

1.3.2 Elastin and Elastic Fibres ...... 8

1.3.3 Proteoglycans ...... 9

1.3.4 Glycoproteins ...... 9

1.3.5 surface receptors ...... 10

1.3.6 ECM associated growth factors ...... 11

1.3.7 Modifiers of ECM structure and function ...... 11

1.3.8 Identifying additional matrix proteins ...... 11

1.4 Evolution of the ECM and its components ...... 12

1.5 Structure / Function...... 16

1.5.1 Self Assembly ...... 16 vi

1.5.1.1 Fibre Assembly ...... 17

1.5.1.2 Elastic Fibre Assembly ...... 17

1.5.2 Tissue-specific Expression...... 20

1.5.3 Post Translational Modifications ...... 22

1.5.3.1 Effects on solubility ...... 22

1.5.3.2 Biomineralization ...... 22

1.5.3.3 Activation/Inactivation by cleavage ...... 23

1.5.3.4 Modification of GAGs ...... 23

1.5.3.5 Bioactive Fragments ...... 23

1.5.3.6 Discovery of novel post-translational modifications ...... 24

1.5.4 Role in Development ...... 24

1.6 The ECM in Health and Disease ...... 27

1.7 Systems Biology ...... 30

1.7.1 Annotation...... 31

1.7.2 Interactions ...... 32

1.7.2.1 Co-immunoprecipitation ...... 33

1.7.2.2 Yeast Two-Hybrid ...... 33

1.7.2.3 Luminescence based mammalian interactome mapping procedure ...... 34

1.7.2.4 Affinity Purification and Mass Spectrometry ...... 34

1.7.2.5 Functional protein microarrays ...... 35

1.7.3 Network approaches...... 35

1.8 Goals and Rationale ...... 37

Chapter 2 Surveying the Extracellular Matrix: Towards a Systems Level Understanding of its Structure, Function and Evolution...... 41

2 Survey of the Extracellular Matrix ...... 42

2.1 Introduction ...... 42 vii

2.2 Materials and Methods ...... 43

2.2.1 Identification of Extracellular Proteins ...... 43

2.2.2 Classification of Proteins ...... 44

2.2.3 Source for Protein-Protein Interactions ...... 44

2.2.4 Network Construction and Analysis ...... 45

2.2.5 Sources for Meta-data ...... 45

2.2.5.1 Expression ...... 45

2.2.5.2 Conservation ...... 46

2.2.5.3 Functional annotation ...... 46

2.2.5.4 Disease terms ...... 47

2.2.6 Quality Assessment ...... 47

2.2.7 Surface Plasmon resonance...... 48

2.3 Results ...... 49

2.3.1 Systematic classification of extracellular proteins reveals the core ECM network consists of 357 proteins...... 49

2.3.2 The annotated list of extracellular proteins is consistent with SignalP and subcellular location predictions ...... 60

2.3.3 Experimentally derived protein-protein interactions connect 181 ECM core genes and 192 functionally-related neighbours into a scale-free network enriched for relevant functional terms ...... 62

2.3.4 The collagen subnetwork reveals anomalies in experimentally derived PPIs ...... 66

2.3.5 The search for biologically relevant functional modules in the ECM highlights the heterogeneous nature of current annotations ...... 68

2.3.6 Network topological measures identify major organising components of the ECM ...... 74

2.3.7 expression patterns predict that modules are broadly expressed but that tissue specific functionality is coordinated by a limited number of components ..79

2.3.8 Almost two thirds of ECM proteins are not conserved outside the deuterostomes ...... 82

2.3.9 Integration of MeSH annotations identifies modules associated with disease ...... 87 viii

2.3.10 Literature curation of elastin interactions resulted in doubling the number of known binding partners...... 88

2.3.11 SPRi experiments detect distinct binding characteristics of recombinant elastin fragments...... 90

2.4 Discussion ...... 93

Chapter 3 Novel domain architectures and promiscuous hubs contributed to the organisation and evolution of the ECM...... 100

3 Extracellular Matrix Domain Architecture ...... 101

3.1 Introduction ...... 101

3.2 Materials and Methods ...... 103

3.2.1 Source for Proteins ...... 103

3.2.2 Source for Domains ...... 103

3.2.3 Domain Enrichment ...... 103

3.2.4 Conservation of Domains and Domain Pairs ...... 104

3.2.5 Conservation of Domain Architecture ...... 104

3.2.6 Tandem Repeats ...... 104

3.2.7 Domain Alignment...... 105

3.2.8 Domain Adjacency Network ...... 105

3.2.9 Domain promiscuity...... 106

3.2.10 Higher Order Domain Patterns ...... 107

3.2.11 Simulated Proteomes ...... 108

3.3 Results ...... 108

3.3.1 Evolution of the ECM is driven in part by the invention of novel domains ...... 108

3.3.2 Organisation of the ECM is mediated by a relatively small number of highly promiscuous domains...... 112

3.3.3 Domain gain is a major driving force for ECM innovation in the human lineage...... 117

3.3.4 Novel ECM architectures are largely age-independent ...... 120 ix

3.3.5 Network analyses of domain adjacency reveal domain-based functional ‘modules’ that display clade-specific rewiring...... 122

3.3.6 Patterns of ECM domain usage extend to conserved higher-order architectures 128

3.4 Discussion ...... 130

Chapter 4 Summary and Future Directions ...... 136

4 Conclusions ...... 136

4.1 Summary ...... 136

4.2 Future Directions ...... 137

4.2.1 Predicting ECM Proteins ...... 137

4.2.2 Literature Curation of ECM Interactions ...... 138

4.2.3 Experimental determination of additional ECM interactions via SPR of proteins and recombinant fragments ...... 139

4.2.4 Expansion of functional modules...... 140

4.2.5 Role of carbohydrates ...... 140

4.2.6 Visualizing proteins, multimers and fragments ...... 141

4.2.7 Assessing the global importance of higher order domain patterns ...... 142

4.2.8 Repeats and Motifs ...... 143

4.2.9 Tools for visualizing paths in domain adjacency networks ...... 144

4.2.10 Domain adjacency as a method to predict novel ECM-like proteins i.e. synthetic ECM proteins...... 145

References ...... 146

Appendices ...... 170

x

List of Tables

Table 1-1: Early history of connective tissue ...... 3

Table 1-2: Collagen and elastin – important discoveries 1930-1975 ...... 5

Table 2-1: True positive and false negative ECM proteins ...... 52

Table 2-2: ‘seed’ terms enriched in gold standard ECM proteins...... 56

Table 2-3: domains enriched in the ECM network ...... 64

Table 2-4: UniProt biological processes enriched in ECM clusters ...... 71

Table 2-5: Elastin interactors ...... 91

Table 2-6: Interactions of elastin and elastin-like peptides as determined by SPRi array ...... 94

Table 3-1: Pfam-A domains found exclusively in ECM proteins...... 110

Table 3-2: Top 30 promiscuous domains in the human proteome...... 114

xi

List of Figures

Figure 1-1: Early observations that adhesion mediates tissue association in invertebrates ...... 6

Figure 1-2: Evolution of ECM proteins ...... 15

Figure 1-3: Biosynthetic route to collagen fibres ...... 18

Figure 1-4: Schematic diagram of microfibril and elastic fibre / elastin assembly ...... 20

Figure 1-5: ECM dynamics determine epithelial branch patterning in vertebrate organs ...... 26

Figure 1-6: Small-world network illustrating some common network parameters ...... 36

Figure 2-1: A workflow for the functional assignment of extracellular proteins ...... 51

Figure 2-2: Characterization of available annotation and interaction data sets ...... 59

Figure 2-3: Subcellular location predictions for ECM proteins ...... 62

Figure 2-4: Collagen subnetwork...... 67

Figure 2-5: Distribution of cluster sizes...... 69

Figure 2-6: A human ECM network based on experimental PPI evidence ...... 70

Figure 2-7: Relative frequencies of UniProt biological process annotations ...... 73

Figure 2-8: A Subnetwork consisting of Elastin and its nearest and next nearest neighbours .... 75

Figure 2-9: Correlation between network connectivity and number of annotations ...... 77

Figure 2-10: Network attributes ...... 78

Figure 2-11: An expression profile for ECM core proteins ...... 80

Figure 2-12: Patterns of correlated coexpression for network core proteins ...... 83

Figure 2-13: Conservation profiles for network core genes ...... 86 xii

Figure 2-14: Conservation of network ECM proteins ...... 87

Figure 2-15: Enrichment of MeSH disease terms by cluster ...... 89

Figure 2-16: Summary of binding sites on elastin ...... 97

Figure 3-1: Conservation of ECM domains ...... 112

Figure 3-2: Comparison of domain frequency vs. domain promiscuity ...... 115

Figure 3-3: Domain promiscuity cutoffs for human Pfam A domains at each percentile ...... 116

Figure 3-4: Distribution of promiscuity scores ...... 116

Figure 3-5: Conservation of ECM architectures ...... 118

Figure 3-6: Sample domain architectures ...... 121

Figure 3-7: Proportion of Pfam A domains found in single and multi-domain contexts ...... 123

Figure 3-8: Conservation of ECM domain pairs ...... 124

Figure 3-9: Origin of Vertebrate specific domain pairs ...... 125

Figure 3-10: Domain adjacency...... 127

Figure 3-11: Conservation of ECM domain pairs (all species) ...... 129

Figure 3-12: Higher order domain patterns ...... 131

xiii

List of Appendices

Appendix 1: List of organisms and genomic data sources ...... 170

Appendix 2: Phylogenetic ordering of species in protein/domain conservation heatmaps ...... 173

Appendix 3: Shannon information indices across a range of MCL inflation values ...... 176

Appendix 4: SignalP and TMHMM predictions ...... 177

Appendix 5: Sequences for recombinant elastin peptides ...... 178

Appendix 6: A systematically derived list of Gene Ontology (GO) terms ...... 180

Appendix 7: GO terms significantly enriched in the ECM network ...... 185

Appendix 8: MeSH terms significantly enriched in the ECM network ...... 194

Appendix 9: Pipeline for automated domain analysis ...... 198

Appendix 10: Descriptions of programs included on the accompanying CD ...... 199

Appendix 11: Partial view of an ECM network rendered as a hypergraph ...... 201

Appendix 12: Known and potential elastin interactions ...... 202

xiv

List of Supporting Data Files

File: SF1 Interactions MS Excel 37KB

File: SF2 Shannon Information Indices MS Excel 163KB

File: SF3 Protein Conservation MS Excel 162KB

File: SF4 GO Evidence MS Excel 25KB

File: SF5 ECM Summary MS Excel 1938KB

File: SF6 SignalP and TMHMM Predictions MS Excel 20KB

File: SF7 Concept Enrichment List MS Excel 3048KB

File: SF8 Cluster Analysis MS Excel 2782KB

File: SF9 GO P-values MS Excel 54KB

File: SF10 UniProt Mart Export Human MS Excel 48KB

File: SF11 Expression mappings MS Excel 181KB

File: SF12 SymAtlas Pearson group mappings MS Excel 1921KB

File: SF13 MeSH P-values MS Excel 142KB

File: SF14 ECM Network Cytoscape 138KB

File: SF15 List of Human Pfam-A Domains MS Excel 85KB

File: SF16 List of ECM associated Pfam-A Domains MS Excel 11KB

File: SF17 List of ECM enriched Pfam-A Domains MS Excel 18KB

File: SF18 List of ECM exclusive Pfam-A Domains MS Excel 12KB

File: SF19 ECM Domain Conservation Heatmap MS Excel 70KB

File: SF20 Domain Promiscuity Analysis MS Excel 257KB

File: SF21 Comparison of Promiscuity by Age MS Excel 181KB

File: SF22 ECM Domain Promiscuity 90th Percentile MS Excel 12KB xv

File: SF23 Hub Analysis MS Excel 39KB

File: SF24 Domain Architecture Conservation MS Excel 222KB

File: SF25 Quantifying Gain and Loss MS Excel 106KB

File: SF26 Sample Architecture FBLN2 MS Excel 19KB

File: SF27 Sample Architecture HGFAC MS Excel 13KB

File: SF28 Sample Architecture HSPG2 MS Excel 24KB

File: SF29 Sample Architecture MMP2 MS Excel 15KB

File: SF30 Sample Architecture VCAN MS Excel 14KB

File: SF31 Human ECM Domain Pair Conservation MS Excel 97KB

File: SF32 Vertebrate Specific Domain Pairs MS Excel 14KB

File: SF33 Domain Pair Origins MS Excel 15KB

File: SF34 Domain Pair Conservation All Species MS Excel 211KB

File: SF35 Higher Order Patterns MS Excel 40KB

File: SF36 Highly Significant Patterns MS Excel 18KB

File: SF37 Higher Order Conservation MS Excel 29KB

File: SF38 Patterns in Clusters MS Excel 8KB

File: SF39 Proteins in Clusters MS Excel 15KB

File: SF40 Domains in Clusters MS Excel 11KB

File: SF41 Enrichment Map Groups MS Excel 53KB

File: SF42 Pattern to Protein Mapping MS Excel 64KB xvi

List of Abbreviations

ADAM A disintegrin and

ADAMTS A disintegrin and metalloproteinase with thrombospondin repeats

AmiGO Gene ontology web user interface

AP-MS Affinity Purification and Mass Spectrometry

BioGPS Biology gene portal services

BioGRID Biological general repository for interaction datasets

BioHarvester Bioinformatic harvester

BioMart Ensembl download and query interface

BIND Biomolecular interaction network database

BMP Bone morphogenic protein

BP Biological process

C-terminal Carboxy terminal

CC Cellular component

CD Compact disk

CE Cello, a protein localization prediction method cDNA Complementary deoxyribonucleic acid

CL Cutis Laxa

COMP Cartilage oligomeric matrix protein

DAG Directed acyclic graph xvii

DDR Discoidin domain receptor

DIP Database of interacting proteins

EBI European bioinformatics institute

ECM Extracellular matrix

ELN Elastin

ER Endoplasmic reticulum

EST Expressed sequence tag

FACIT Fibril associated collagens with interrupted triple helices

FDR False discovery rate

FGF

FN Fibronectin

GAG Glycosaminoglycan

Genecards The human gene compendium

GEO omnibus

GO The gene ontology

GOA Gene ontology annotation

GOOSE Gene Ontology Online SQL Environment

GOSlim GO subset consisting of terms at an intermediate level of specificity

GNF Genomics Institute of the Novartis Research Foundation

GPVI Glycoprotein VI a.k.a. GP6 involved in the aggregation of xviii

HGNC HUGO committee

HMM Hidden markov model

HPRD Human protein reference database

HSPG Heparin sulphate proteoglycan

HTE Human tropoelastin

HUGO organization iHOP Information hyperlinked over proteins

IntAct Interaction database hosted at EMBL-EBI

IPI International protein index kDa Kilodalton

LCMS/MS Liquid chromatography and tandem mass spectrometry

LOCATE Mammalian protein localization database

Lox

LTBP Latent transforming growth factor binding protein

LRR Leucine rich repeat

LUMIER Luminescence based mammalian Interactome mapping procedure

MAGP Microfibril-associated glycoprotein; MAGP1 is synonymous with MFAP2

MatrixDB Extracellular matrix interactions database

MCL Markov clustering, a graph clustering algorithm

MeSH Medical subject headings xix

MF Molecular function

MFS Marfan syndrome

μg Microgram

MIMIx Minimum information about a molecular interaction experiment (standard)

MINT Molecular interaction database miRs MicroRNAs

MMP

ML MultiLoc, a protein localization prediction method

MYTH Membrane yeast two-hybrid

NAS Non-traceable author statement

NCBI National center for biotechnology information

N-terminal Amino terminal nm Nanometre

OMIM Online Mendelian inheritance in man

PA Proteome analyst, a protein localization prediction method

PANTHER Protein analysis through evolutionary relationships (classification system)

PCC Pearson correlation coefficient

PCR Polymerase chain reaction

Perl Practical Extraction and Report Language

Pfam A database of protein families xx

PostgreSQL An open source relational database management system

PPI Protein-protein interaction

PT Ptarget, a protein localization prediction method

PTC Premature termination codon

PubMed A search engine for accessing the MEDLINE database

RefSeq NCBI reference sequence database

RGD Arg-Gly-Asp motif involved in cell adhesion

RT-PCR Real-time PCR

Serpin Serine inhibitor

SignalP Signal peptide prediction algorithm

SPM Sequential pattern mining

SPR Surface plasmon resonance

SPRi Surface plasmon resonance imaging

SQL Structured query language

SVAS Supravalvular aortic stenosis

TAAD Thoracic aortic aneurysms and dissections

TGF Transforming growth factor

TIMP Tissue inhibitor of metalloproteinase

TMHMM Transmembrane prediction algorithm based on hidden markov models tYNA Topnet-like Yale Network Analyzer xxi

UniHI Unified human interactome

VEGF Vascular endothelial growth factor vWF von Willebrand factor

Wnt Wingless-related integration site

WO WoLF PSort, a protein localization prediction method

Y2H Yeast two-hybrid 1

Chapter 1 The Extracellular Matrix 1 Background 1.1 Overview

The ECM in consists of a meshwork of proteins, glycoproteins and proteoglycans that organise cells into complex tissues and allow them to sense and react dynamically to mechanical forces. Assembled matrices provide structural integrity to the skeletal system (bones, teeth, tendons, ligaments, and cartilage), vasculature (blood vessels), hollow organs (bladder, lung) and skin. In addition, they play a key role in establishing the blueprint of complex and diverse multicellular organisms, acting as a scaffold for cell growth and migration during development. Many ECM components bind growth factors and act as a reservoir controlling their spatio/temporal distribution and bioavailability. ECM assembly and degradation play important roles in immune and inflammatory responses and wound healing. As a result, disruption of the ECM whether as a result of mutation or disturbances in the homeostatic balance of its assembled components leads to a number of inborn and acquired diseases.

This chapter traces human discovery of the ECM from a pre-historic interest in hide and bone to our current understanding of its dynamic role in health and disease. Our knowledge of individual matrix components has increased from a few to a few hundred proteins. The major families are briefly reviewed here followed by notes on ECM evolution, structure and function. This is followed by sections illustrating the broad implications of the ECM in development and disease. The recent availability of sequence-based annotation resources, powerful computing platforms and publicly available databases of e.g. protein-protein interactions and expression datasets represents an opportunity to develop a systems-level framework to integrate current knowledge of individual ECM components and derive new insights into their organisation and evolution. These resources are briefly discussed followed by an overview of the project goals and key contributions. 2

1.2 Discovery and Early History of the Matrix

Collagen was the first matrix molecule to be identified and isolated. As noted in a review by Uitto1, “Collagen was initially recognized as a tissue component which when boiled, produced glue”. This property was mentioned as early as 50 A.D. by the Roman, Pliny who wrote, “glue is cooked from the hides of bulls”. The name ‘collagen’ is in fact derived from the Greek for ‘glue-producing’. Thus, as Piez2 later pointed out, the impetus to study collagen at least partly originated for commerce. Certainly, early had long recognized the useful properties of hide and bone (see Table 1-1).

In the middle ages, it was believed that living organisms were composed of different kinds and arrangements of ‘fibres’ and that life arose spontaneously from these under certain conditions2. Overlooking the reference to spontaneous generation, this ‘fibre theory’ could be construed as extraordinarily insightful given what we know about the importance of the ECM today. In fact, the belief that fibres were the basis of animal life persisted for some time. As reviewed by Borel et al.3 the correct mechanism of blood circulation was not worked out until the 17th century (by Harvey). His successors were 18th and early 19th century anatomists who, working with cadavers prior to the emergence of cell theory, first defined the concept of ‘tissues’. These structures joined parts of the body, providing form and pathways for the movement of fluids, gases and secretions2. When these tissues were viewed under early microscopes they revealed fibres.

As detailed in a review by Meikle4, John Hunter began his research on bones and teeth in 1754 at his brother’s anatomical school in Covent Garden, later establishing an experimental farm at Earl’s Court (in 1764) where he used the technique of vital staining developed by John Belchier (published in 1763) to study bone growth. By inserting lead shot into the bones of chickens and pigs, Hunter was able to reproduce an earlier observation (by Duhamel) that long bones grew in length only at the extremities. Together with his own observations of remodeling in the human mandible, Hunter was able to conclude that the growth of bone entailed two processes: deposition and absorption (now ‘resorption’). Hunter was however, according to Meikle4, unable to provide a mechanism for two reasons: the microscope did not come into widespread use for another 60 years; and cell theory did not become established by Schleidan and Schwann until the period 1839-1842. 3

Table 1-1: Early history of connective tissue

This table is reproduced from Piez2 and is used with permission.

Date Discovery Credit

Prehistory Utilitarian value of animal skin, bone and other tissues; Arose independently many times. leather technology

~1700 ‘Fiber’ theory: tissue composed of fibers which give rise to life by spontaneous generation

1809 Life derived only from life Lorenz Oken

1830 ‘Connective Tissue (Bindegewebe)’ coined Johannes Müller

1830 Cells found in connective tissue

1855 ‘Cellular theory’; cells derived only from cells; Rucolf Virchow intercellular substances made by cells

1865 ‘Collagen’ defined as protein that produces gelatin on boiling (Greek: glue producing)

1865 ‘Ground substance (Grundsubstanz)’ coined for homogeneous intercellular material

According to Piez2, the fibre theory was eventually supplanted when it became clear that connective tissue (a term coined by Müller in 1830) contained cells and that tissues could be explained as a collection of cells and their products. Two key principles we take for granted today, 1) that cells arise only from other cells and 2) that intercellular substances are made by cells, are credited to Rudolf Virchow, a histologist working in the mid-1800’s. Classical histological methods continued to dominate the study of connective tissues until about 1930 when the terms ‘intercellular matrix’ and ‘extracellular matrix’ came into use.

In France, in the late 1920s and early 1930s, Nageotte and others were able to solublilize and reconstitute collagen fibrils, recognizing the reversible nature of this interaction using light microscopy2,5. At the same time, the composition of polysaccharides and glycoconjugates isolated from articular cartilage, synovial fluid, eye tissues, blood vessels and skin were being undertaken by medical biochemists3. These studies suffered from the low solubility of the samples, which had to be isolated using harsh methods, suspected of creating artifacts3. 4

Consequently, the successfully purified components consisted of polysaccharides and glycoproteins and this became an intense focus of ECM work in both Europe and the Americas. Progress in the study of ECM carbohydrates, polysaccharides and glycoproteins is discussed in more detail in Borel’s review3. Notably, it was only after the 1960’s that the structure of collagen and elastic fibres were elucidated and it was shown that some ECM components (e.g. elastin and serum albumin) were not glycoproteins.

Electron microscopy and x-ray diffraction, which appeared in the 1930s, led to detailed structural studies of collagen. The important findings from 1930 to 1975 are reproduced here from the more comprehensive review by Piez2 (Table 1-2). Elastin was first observed as an insoluble residue following hydrolytic removal of crosslinked collagens. Its study was greatly facilitated by the discovery, in 1949 of an elastin-specific proteolyic , pancreatic elastase5,6, which subsequently allowed the study of smaller, more soluble fragments. The existence of a corresponding collagen-specific cleavage enzyme, ‘’ was theorized on the basis that collagen was known to undergo rapid turnover. The enzyme was discovered in 1962 and became the first in what is now a large family of matrix metalloproteinases2. Further technological advances including the emergence of analysers, ultracentrifugation, electrophoresis, radioactive tracers and sequencers ensured rapid progress in the description of connective tissue macromolecules in the years that followed3.

Once cell theory emerged and it was known that the ECM was not the basis of life, some scientists, notably those outside connective tissue research (though not exclusively notes Piez2), acquired the view that the matrix was, “inanimate, unreactive and purely structural – a passive environment that cells create as a place to live”. These views persisted for some time. According to Miekle4 as late as 1987, the ECM of bone was regarded by many as an inert substance composed of collagen, minerals and structural glycoproteins.

In spite of this, evidence of the importance of cell-cell and cell-matrix interactions continued to accumulate from multiple sources. Early tissue culture studies showed that most tissues were not a syncytium, but were composed instead of individual cells suggesting that the presence of tissues required cell adhesion7. 5

Table 1-2: Collagen and elastin – important discoveries 1930-1975

This table is reproduced from Piez2 and is used with permission.

Date Discovery Credit

1930 Collagen solubilized and reconstituted Nageotte

1942 Collagen D period by X-ray diffraction Bear

1942 Collagen D period by EM Schmitt

1948 Amino acid composition of collagen Bowes and Kenton

1953 Segment-Long-Spacing aggregate Schmitt and Gross

1954 Tropocollagen ‘particle’ Gross and Schmitt

1955 Collagen triple helix by X-ray diffraction Rich, Ramachandran

1955 Components of denatured collagen Orekovich

1956 Characterization of the collagen molecule Boedtker and Doty

1962 Animal collagenase Gross

1963 Collagen α chains and β components Piez

1964 Aldehydes in collagen Gallop

1964 Elastin crosslinks Partridge

1966 Collagen aldol crosslinks; molecular basis of lathyrism Bornstein and Piez

1966 Chain structure of collagen Kang and Nagai

1967 Amino acid sequencing of collagen chains Kang and Bornstein

1968 Collagen aldimine crosslinks Bailey, Tanzer

1969 Collagen types II and III Miller and Piez

1971 Procollagen Bornstein, Martin

1973 Hydroxyproline stabilizes collagen helix Berg and Prockop, Rosenbloom

6

As reviewed in Horwitz7, in 1900, Herbst showed that sea urchin blastomeres fall apart when calcium is removed and then reassociate in normal seawater to form embryo-like structures. Similar experiments by Wilson showed that sponges could similarly be dissociated into single cells and then reassociate to form a small differentiated sponge (Figure 1-1) and that mixtures of cells from different coloured sponges (different genera) could re-sort themselves7,8.

The adhesion-based self-organisation of embryonic tissues was convincingly demonstrated by Townes and Holtfreter9 who described the dissociation and reassociation of embryonic amphibian tissues. As reviewed by Horwitz7, follow-up studies using a variety of vertebrate species and tissues led to the notion that tissues self-organise through differential adhesion at the level of individual cells. The subsequent discovery of many molecules involved in cell-cell (e.g. Cadherins) and cell-matrix adhesions (esp. Fibronectin) led rapidly to the recognition of the importance of cell-matrix interactions and to the identification of receptors mediating such interactions, integrins, the elastin- receptor and some others.

Figure 1-1: Early observations that adhesion mediates tissue association in invertebrates a. Sponges can be triggered to dissociate by removing calcium, and then their reaggregation into minisponges is observed. Freshly dissociated cells that were pressed out through bolting cloth are shown. b. The morphogenesis of aggregates in open sea water. Images are from the 1907 paper by Wilson8 as reproduced by Horowitz7 and are used with permission. 7

At the close of the 20th century Robert5 summarized this progress:

It became thus clear at the end of this century that there are many matrix components, their nature and relative quantities define the structure and properties of tissues, as much or even more than its cellular components.

Matrix biology and cell biology are intimately linked. Most types of cells secrete matrix proteins and respond to matrix-related signals. This realization has provided and continues to provide new and original approaches to several topics in biology, especially development, aging and a number of related pathologies10.

1.3 Defining Matrix Components

The ECM in animals consists of an intertwined network of proteins, glycoproteins and proteoglycans which fall broadly into the following functional categories: 1) structural components such as fibril-forming collagens and elastin, 2) matricellular proteins mediating cell- ECM interactions, growth factor and protease activity, 3) proteinases and their inhibitors affecting ECM homeostasis and remodelling and, 4) and ancillary proteins interacting with the ECM to perform both structural and signaling roles. With such a diverse set of biological functions and structures, identifying an inclusive set of ECM system components is not straightforward and necessitates considering more than the set of proteins constituting the structural elements of the fibres alone.

Most of what is known about the composition of extracellular matrices has been derived from traditional experiments spanning several decades. Over this time, the number of matrix constituents has increased rapidly from just collagen, elastin and mucopolysaccharides up to the middle of the last century to perhaps several hundred although, noted Borel in his 2012 review3, these numbers have not been precisely evaluated. Subsequently, the major categories of matrix proteins were summarized in an excellent review by Hynes and Naba11 representing work complimentary to Chapter 2 . A brief outline of these is provided here, followed by a short review of two recent approaches to discover additional matrix components. 8

1.3.1 Collagens

The collagen superfamily, recently reviewed by Richard-Blum12, comprises 28 members in vertebrates. Consisting of three polypeptide chains called α chains, all collagens contain a triple helical structure which encompasses between 10% (collagen XII) and 96% (collagen I) of their overall length. Different molecular species are formed by the hybridization of different α chains belonging to different collagen types. Collagens can form a variety of supramolecular assemblies and are divided into subfamilies based on their ability to form fibrils, beaded filaments, anchoring fibrils and networks contributing to the architecture of a variety of different matrices. Fibrillar collagens are the most abundant and provide e.g. tensile strength to skin and resistance to traction in ligaments. However, even low abundance collagens are functionally important. Collagen VII constitutes only about 0.001% of total collagens in skin but is crucial for skin integrity. Though they may be considered prototypical structural proteins, collagens are not restricted to structural roles. Through their interaction with cellular receptors collagens regulate cell growth, differentiation and migration. Proteolytic cleavage of several collagen types results in the release of bioactive fragments with unique biological activities. As well, conformational changes in collagens induced as a result of the interaction of collagens with other matrix components, their assembly, degradation or mechanical forces can result in the exposure of functional, cryptic sites.

1.3.2 Elastin and Elastic Fibres

Elastin is the most abundant protein in elastic fibres which play an important role in tissues requiring the ability to stretch. These include e.g. the bladder, lung parenchyma, vasculature and skin. The thoracic aorta is composed of 50% elastin by dry weight13. Elastic fibres comprise a 10-12 nm outer sheath of microfibrils, consisting mainly of fibrillins, surrounding a core of cross-linked elastin, making up approximately 90% of the fibre14,15. Their assembly is described in more detail in a later section (see section 1.5.1 Self Assembly). At the tissue level, the orientation of elastic fibres is specific to their mechanical requirments. In arteries, which require the maintenance of uniform elastic pressure, elastic fibres form concentric rings of elastic lamellae alternating with smooth muscle16, whereas in lung, elastic fibres form a latticework17. Elastic cartilage requires a balance between stability and flexibility and here elastic fibres are arranged as three dimensional honeycomb arrays18. 9

Elastin is secreted by fibroblasts, smooth muscle cells and auricular chondrocytes as a soluble precursor, tropoelastin (60-70kDa). Tropoelastin is a multidomain protein with alternating hydrophobic and cross-linking domains, an architecture which is thought to aid in the initial alignment of adjacent molecules to facilitate cross-linking19-21. Since the formation of elastic fibres in the human heart typically occurs only during late fetal and early neonatal development with little or no turnover in adults, it has recently been hypothesized that sequence variations leading to small defects in elastic fibre assembly may play a role in late onset cardiovascular disease via the cumulative weakening of vascular tissues over the course of millions of cycles of stretch and recoil20,22. Loss of function mutations in elastin (or a deletion in the case of Williams Beuren Syndrome23) can result in supravalvular aortic stenosis (SVAS)24 and other vascular defects including aortic and arterial dilation. The latter sometimes also accompany cutis laxa (CL), a condition associated with loose, inelastic skin and sometimes pulmonary emphysema25,26.

1.3.3 Proteoglycans

The distinction between proteoglycans and glycoproteins (see below) is imprecise. In general, proteoglycans are considered to have a significant fraction of their total mass made up by glycosaminoglycans (GAGs); repeating polymers of disaccharides with additional carbolyl and sulfate groups. These confer upon proteoglycans a highly negative charge resulting in an extended, space-filling structure that is able to retain water and divalent cations such as calcium. In addition to this ‘packing’ function, proteoglycans function in lubrication and through their attached GAGs bind growth factors and other secreted molecules into the ECM. Hynes and Naba11 report approximately three dozen ECM proteoglycans are encoded in mammalian genomes. Notable ECM proteoglycans include (HSPG2), which is a core component of basement membranes, and LRR repeat proteins including biglycan and decorin, which are thought to play roles in matrix assembly. Hyalectins (aggrecan, brevican, neurocan, and versican) bind and other ECM glycoproteins and hyaluronic acid to regulate ECM protein complexes.

1.3.4 Glycoproteins

ECM glycoproteins are more numerous than proteoglycans (approximately 200 proteins compared to a few dozen) and their functions vary widely. There are groups of matrix glycoproteins enriched in: basement membrane (e.g. ), elastic fibres (e.g. fibrillins, 10 ), nervous system (e.g. agrin, netrins), vascular system (e.g. vitronectin, , von Willebrand factor), skeletal system (ameloblastin, dentin), and growth factor binding (e.g. insulin-like growth-factor binding proteins, latent transforming growth-factor β-binding proteins) to name a few.

Some of the most well-studied of these glycoproteins are laminins; trimeric molecules comprised of one α, one β and one γ chain which are major structural components of basement membranes. As with collagen, different molecular species are formed by the hybridization of different chains. In vertebrates, there are 5α, 3β and 3γ chains encoded by 11 different genes. Also well studied, fibronectin, exists as a dimer whose two identical subunits are encoded by a single gene with multiple splice variants. A vertebrate-specific protein, fibronectin exists in two forms: as a major (soluble) component of blood plasma and in an insoluble form, assembled pericellularly into a matrix which binds cell-surface integrin receptors mediating cell adhesion, growth and migration. Laminins and fibronectin exemplify what is considered to be the typical extended multimeric structure of ECM proteins with multi-domain repeats.

1.3.5 Cell surface receptors

Communication between cells and the matrix require cell surface receptors recognizing a variety of matrix proteins. The major receptors are integrins, comprising 24 heterodimeric αβ subunits in mammals. As a class, integrins are widely conserved from sponges to humans with RGD and laminin binding receptors being the most ancient11. Additional clade-specific subclasses have evolved in more complex organisms including several chordate-specific collagen receptors and leukocyte-specific receptors. Integrins and their ligands play key roles in development, hemostasis, leukocyte trafficking and immune responses and are involved in a number of receptor tyrosine kinase mediated signaling pathways. In addition, they are the target of effective therapeutic drugs against and thrombosis27. Although integrins represent the dominant class of ECM receptors and are present on most cells a number of other receptors are expressed on specific cell types. These include e.g. dystroglycan which binds to laminin, agrin and perlecan in basement membranes, GPVI on platelets and the discoidin domain receptor (DDR) tyrosin-kinases which bind collagens, among others. Importantly, in addition to binding extracellular targets, ECM receptors provide for two-way signaling between the ECM and the cytoskeleton. The cytoplasmic domains of ECM receptors mediate the assembly of large protein 11 complexes regulating the assembly of the cytoskeleton and activate many intracellular signaling cascades28.

1.3.6 ECM associated growth factors

Many growth factors bind to ECM components and are therefore an important part of the ECM as a biological system if not a bona fide part of extracellular matrices themselves29-31. In binding these factors the matrix acts as a reservoir of important developmental signals (e.g. VEGFs, Wnts, BMPs and FGFs). The spatial distribution of these signals is crucial for pattern formation and, in at least a few cases, it has become clear that the gradients of these signals are affected by ECM binding32. The interactions between latent transforming growth factor binding proteins (LTBPs) and TGF-β have significant consequences for the regulation of TGF-β function in Marfan syndrome and other genetic diseases14.

1.3.7 Modifiers of ECM structure and function

A number of mediate the lifecycle of matrix proteins, from their maturation and assembly to their degradation. The polymerization of collagen, for example, requires the preprocessing of procollagen by specific propeptidases. Collagens are then cross-linked by bonding, transglutaminase cross-linking and the action of lysyl oxidase and hydroxylases into larger assemblies12. The process of collagen and elastin assembly are described in more detail in a later section (see section 1.5.1 Self Assembly). Laminins and other basement membrane proteins are also cross-linked by disulfide bonding33. The degradation of collagens and other ECM proteins is managed by a large family of matrix metalloproteases (MMPs), ADAMs and ADAMTS and a host of other proteolytic enzymes including elastases, cathepsins and serine proteases. These in turn are kept in check by protease inhibitors including TIMPs and Serpins. This homeostatic balance between matrix deposition and destruction is crucial to maintaining normal matrix function.

1.3.8 Identifying additional matrix proteins

While the absolute number of matrix proteins is unknown, the upper estimate for ECM-encoding genes in the mammalian genome has been placed at about 400 genes34. Investigators have recently turned to computational approaches to identify additional matrix proteins. Jung et al.35 defined a set of 109 gold standard ECM proteins supported by SwissProt annotations and 12 experiments in the literature. Using a machine learning approach, they analyzed various sequence features and defined 13 informative classifiers they used to predict novel ECM proteins from unannotated human genes in Swiss-Prot. Of the 20 supposedly novel genes identified, half were already annotated as ‘Extracellular region’ in the Gene Ontology (GO). The other half had no GO annotation but, a review of the literature revealed several had prior evidence supporting their role as ECM proteins.

Using a more integrated approach, Manabe et al.34 combined computational screening of the mouse transcriptome with functional assays including matrix assembly, binding of known ECM molecules, glycosaminoglycan attachment and promotion of cell-substrate adhesion. They identified 16 novel ECM proteins.

While these studies draw attention to the fact there are likely to be additional novel ECM proteins waiting to be identified, they also underscore the need to develop rigorous, systematic approaches to identifying and cataloguing ECM proteins including the ability to recognize those already supported by literature evidence.

1.4 Evolution of the ECM and its components

The shift from single celled to multicellular organisms represented a profound change in the approach to dealing with the world outside the cell. A new opportunity arose, not only to exploit economies of scale and share labour, but to utilize new internal spaces exterior to cells and within the enclosure of the larger organism. Although multicellularity has been independently acquired in more than 20 eukaryotic lineages including animals, plants, fungi, slime molds and algae36,37 metazoans have evolved the most elaborate biological system, the ECM, governing this niche.

The acquisition of multicellularity in metazoans occurred >200 million years ago. While the direct ancestors of these animals can no longer be observed, studies of our closest living relatives have provided some insights into origin of the ECM in this clade. Choanoflagellates, the closest unicellular ancestor to metazoans, can form multicellular colonies by postdivision adhesion38 suggesting that metazoan multicellularity may have evolved in the common ancestor of choanoflagellates and metazoans by similar means. On the other hand, experiments in yeast 13 demonstrate that a rudimentary form of multicellularity can emerge under selective conditions indicating that the basic components of cell adhesion may be quite ancient39.

The first adhesion molecules to be described were involved in cell-cell and cell-matrix interactions in the marine sponges Geodia cydonium and Microciona prolifera 40,41. A simple system consisting of an adhesion factor/receptor pair interacting with galectin, it was highly specific and accounted for earlier observations of spontaneous cell sorting in sponges7,8. The conservation and diversification of cell-cell and cell-matrix adhesion systems have given rise to a rich literature detailed in recent reviews42,43.

Of recent note, Sebe-Pedros et al.44 have described a highly regulated, multicellular lifecycle stage in the protist, Capsaspora owczarzaki involving the upregulation of orthologues of the integrin adhesome along with several other proteins containing FN3 and LamininG domains commonly found in metazoan ECM proteins. C. owczarzaki is a filose amoeboid symbiont of the pulmonate snail Biomphalaria glabrata and occupies a pivotal phylogenetic position between choanoflagellate protists (the closest relatives of Metazoa), and other opisthokonts (such as nucleariids and Fungi)45,46. This discovery is perhaps the best evidence so far that at least some multicellular properties of metazoans stem from features of the aggregation behaviour in an ancestral protist. As an aside, it has been speculated that integrins in unicellular organisms have pre-ECM roles in the signaling of cytoskeletal rearrangements relating to prey capture/uptake; metazoan integrins have a role in phagocytosis in addition to their function in cell-cell and cell- matrix interactions47.

The relatively recent sequencing of several key metazoans including representative sponges, worms, echinoderms, and vertebrates in comparison with choanoflagellates has revealed the existence of approximately ten genes found across the bilaterians that are thought to be essential to the formation of basement membranes11,48. This ‘toolkit’ (Figure 1-2) may have formed the basis for complex tissues, differentiating younger clades from their simpler ancestors and, around which ever more intricate matrices emerged.

It appears, that most of the major protein families comprising the human ECM originated during the evolution of Protostomes and that, as the genome of the invertebrate deuterostome Ciona intestinalis was recently used to demonstrate, vertebrate complexity has mostly arisen through the duplication and subsequent modification of retained, pre-existing ECM genes49. However, 14 lacking a complete catalogue of ECM proteins, the relative importance of these two contributing evolutionary phases has remained unclear.

Deuterostome and particularly, vertebrate ECMs are distinct from their counterparts in protostomes and basal metazoans. They include novel ECM components such as , fibronectin and FACIT collagens49-52, novel splice variants (e.g. Agrin), as well as expansions in gene families such as laminins, thrombospondins, integrins and ADAMTS proteases53-56. The latter may well have been fueled by gene duplication as it is now well-accepted that two whole genome duplications occurred in the early vertebrate lineage49,57,58 and a third in fish50,59,60. It is thought that subfunctionalization of retained paralogues played an important role in the evolution of vertebrates. As Lu et al.61 points out, even small changes in the selective binding of ECM components and their receptors to growth factor ligands allows a creative and expanded use of an otherwise limited number of signaling pathways and this may have been important for the development of diverse tissues and organs during metazoan evolution.

It is, ultimately, at the level of proteins and their domains that genomes expand their functional repertoire. In this regard, considerable progress has been made in defining protein domains and understanding their role in the evolution of protein families. Relevant to the study of ECM proteins are observations such as the higher occurrence of multi-domain proteins in eukaryotes62, the evolution of multi-domain architectures and their shuffling via gene fusion/fission63,64, factors influencing the emergence of novel domains and their incorporation into proteins65,66 and the nature and importance of domain promiscuity; the property of some domains to be found next to a large variety of other domains in proteins67-69.

Such global studies of protein evolution, however, are necessarily dominated by cellular proteins. The unique biological role of the matrix as a dynamic structure exterior to cells begs the question, how similar or dissimilar is matrix evolution from the consensus model and to what degree are conclusions about the relative importance of various mechanisms generalizable across all proteins? In particular, functionally related sub-classes of proteins with unique biological roles, such as the ECM, could conceivably exhibit nuances that are not captured in what is, necessarily, an averaged view. This hypothesis is explored in Chapter 3. 15

Figure 1-2: Evolution of ECM proteins The figure outlines the main phylogenetic lineages (although the branch lengths are not drawn to scale), and illustrates the evolution of complexity of the matrisome and ECM during evolution. The inferred basal bilaterian had a core of ECM proteins including the basement membrane toolkit and some other ECM proteins (not all of which are shown) that have been retained in later-developing taxa, including the two main branches of metazoa (protostomes and deuterostomes). More primitive taxa had some, but not all, of these ECM proteins. During evolution of protostomes, there was modest expansion of the number of ECM genes/proteins mostly comprising taxon-specific expansions of ECM protein families by gene duplication and divergence, with some exon shuffling. A similar modest expansion occurred during evolution of the deuterostome lineage—first known acquisitions of novel ECM proteins of interest are noted in green. During evolution of the vertebrate subphylum, there was a major increase in ECM protein diversity, probably related to two whole genome duplications that occurred in that lineage. This expansion included expansion and diversification of preexisting ECM protein families, and also the development of novel protein architectures by shuffling of domains and the inclusion of novel domains (e.g., FN1, FN2, LINK). Some examples of such novel ECM proteins are indicated. As discussed in the text, this large expansion and diversification of the matrisome in vertebrates is presumably linked to novel structures such as neural crest and endothelial-lined vasculature as well as connective tissues such as cartilage, bones, and teeth, and also the development of more complex nervous and immune systems. Figure reproduced from Hynes and Naba11 used with permission.

16

1.5 Structure / Function

The interactions between ECM components and cells integrate them into functional assemblies. Several types of cells are particularly known for their specialization in secreting and maintaining different types of matrices including fibroblasts (stroma), osteoblasts (bone), and chondroblasts (cartilage). In the brain, endothelial cells lining the capillary beds along with surrounding astrocytes secrete specialized basement membrane components contributing to the blood brain barrier70. Neurons and surrounding glial cells are responsible for perineuronal nets stabilizing synapses and contributing to long term potentiation in post-synaptic dendrites71. ECM proteins within matrices are cross-linked to one another and to other proteins as well as to glycosaminoglycans such as and chondroitin sulfate. In this section, the nature and importance of matrix assembly is explored, focusing on models of collagen fibre and elastic fibre assembly. This is followed by notes on the differential expression of matrix components, their post-translational modifications and the biological roles played by matrix fragments upon degradation of larger assemblies. This culminates in a discussion of the role of ECM components in development.

1.5.1 Self Assembly

Collagen is the most abundant protein in animals and comprises three polypeptide chains (α- chains) which form a unique triple-helical structure. This structure contains a repeating motif Gly-Xaa-Yaa in which glycine appears at every third residue along each chain. Xaa and Yaa can be any amino acid but are frequently the amino acids proline and hydroxyproline which form side chain interactions that stabilize the helix72,73. The collagen superfamily includes 28 members in vertebrates and all contain a triple helix structure that make up between 10% to over 90% of their structure. However, there is no common definition for a collagen. There are triple helical proteins that are called collagens and there are proteins with triple helical domains that are not regarded as collagens73. However, collagens all have functions in tissue assembly or maintenance.

Vertebrate collagens are classified by function and domain homology and these classes participate in the formation of a varied array of structures in the ECM e.g. fibrils, networks, beaded filaments and hexagonal networks. Collagen types, their distribution, composition, and pathology have been summarized in recent reviews12,72. For illustrative purposes this review first 17 focuses on the fibrillar assembly of type I collagen (Figure 1-3) which has been studied in the most detail.

1.5.1.1 Collagen Fibre Assembly

There are two stages of self-assembly: nucleation and fibre growth. of collagen results in protocollagen strands containing carboxy-terminal (C-terminal) and amino-terminal (N-terminal) propeptides. The C-terminal propeptides direct the association of individual chains into the triple helical structure of procollagen leading to nucleation and folding of the triple helical region in a zipper-like manner from the carboxy to the amino-terminus. This process occurs intracellularly in the rough endoplasmic reticulum. C-terminal propeptides also play a role in ensuring the solubility of the procollagen molecule during transit to the ECM whereas the N-terminal propeptide plays an eventual role in the control of fibril shape74-76.

Following or during secretion into the ECM, propeptides are cleaved by specific procollagen proteinases, generating tropocollagen monomers and triggering fibril self-assembly probably via the interaction of C-terminal telopeptides with specific binding sites on triple-helical monomers77. Lysine side chains in the telopeptides are then cross-linked subsequent to fibril assembly, forming stable hydroxylysyl pyridinoline and lysyl pyridoline cross-links with the aid of lysyl oxidase. The assembled fibril is resistant to protease degradation and has great tensile strength and resilience. To appreciate the scale of the assembly involved, an individual triple helix in type I collagen is approximately 300 nm long and less than 2 nm in diameter whereas assembled fibrils in tendon are up to 1 cm in length and up to approximately 500 nm in diameter72.

1.5.1.2 Elastic Fibre Assembly

Elastic fibres, recently reviewed by Baldwin et al.15, are very extensible and able to undergo millions of cycles of stretch and recoil without failure. They are comprised of an inner core of elastin, ensheathed within fibrillin microfibrils. While elastin and fibrillin are the main components, there are several additional proteins that contribute to elastic fibre assembly and function. The axial and lateral assembly of fibrillin into microfibrils involves the pre-processing of secreted pro-fibrillin by furin, which cleaves it at both the amino and carboxy- terminus.

18

Figure 1-3: Biosynthetic route to collagen fibres, which are the major component of skin Size and complexity is increased by posttranslational modifications and self-assembly. Oxidation of lysine side chains leads to the spontaneous formation of hydroxylysyl pyridinoline and lysyl pyridinoline cross-links. Figure reproduced from Shoulders and Raines72 used with permission. 19

Assembly proceeds pericellularly in a complex process that requires fibronectin, integrins and heparin sulphate proteoglycans (Figure 1-4(a)). Tropoelastin, secreted at the cell surface, forms globules within which individual tropoelastin molecules are thought to line up, facilitating their cross-linking by lysyl oxidase (Figure 1-4(b)). The latter is activated locally by, Bmp1 presented by fibronectin in association with membrane bound integrins. -4 and 5 are required in the formation of micro-aggregates of elastin which are then targeted to assembled fibrillin microfibres via the Lox pro-domain. Deposition of elastin micro-aggregates is aided by Fibulin-515.

It is evident from these examples that the assembly and maintenance of extracellular matrices is a highly coordinated effort involving control of gene expression, post-translational modifications and numerous interactions at various levels of assembly. These are considered in more detail in the following sections.

20

Figure 1-4: Schematic diagram of microfibril and elastic fibre / elastin assembly a. Microfibril assembly occurs pericellularly, and requires fibronectin, integrins and heparin sulphate proteoglycans (HSPG). Fibrin molecules are secreted and, after processing N- and C-terminally by furin, interact homotypically at N- and C-termini leading to axial and lateral assembly to form microfibrils. Beads may arise from folding of terminal regions. Microfibrils may be stabilized by transglutaminase cross-links. The reason why fibronectin is needed for microfibril deposition is unclear, but it may act as a template for assembly and/or it may stimulate cytoskeletal tension through the α5β1 integrin, thereby facilitating assembly at fibrillar adhesions. Fibrillin-1 also interacts with α5β1, αvβ3 and αvβ6 integrins; however, it is not known whether these interactions are essential for microfibril assembly. Heparin inhibits microfibril assembly, and HSPGs may contribute by facilitating cell surface fibrillin-1 interactions. b. Elastin assembly occurs pericellularly on ‘microassembly’ and on microfibrils ‘macroaggregates’. Secreted tropoelastin forms globules at the cell surface which become cross-linked by lysyl oxidase; this process may involve αvβ3 integrin interactions with tropoelastin, and integrin interactions with heparin sulphate cross-linking by lysyl oxidase, and probably direct the deposition of elastin globules onto preformed fibrillin microfibrils, to form elastic fibres. Microfibrils and elastic fibres are important matrix storage sites for BMPs and latent TGFβ1. Figure reproduced from Baldwin et al.15 and used with permission.

1.5.2 Tissue-specific Expression

The differential expression of matrix components is a primary factor determining the tissue specificity of co-occurring matrix components. Mature elastic fibres, for example, have tissue- specific architectural arrangements that reflect different elastic requirements. Arterial elastic fibres form concentric lamellar layers supporting vascular elastic recoil. Dermal elasticity is based on integrated networks of thick reticular elastic fibres and thin fibres in the papillary dermis, and alveolar elastic fibres form fine networks that allow respiratory expansion and contraction78. 21

ECM expression is very dynamic during early limb development in vertebrates. Mechanisms for side branching are different from tip branching but are known to involve the differential expression of several MMPs61. In epithelial branching morphogenesis, fibronectin is specifically expressed at the future site where epithelium invaginates to split the epithelial tip79. In the brain, which has a distinct ECM expression profile with respect to matrix proteins80 several chondroitin sulfate proteoglycans known as lecticans (aggrecan, brevican, versican and neurocan) form ternary complexes with hyaluronan and Tenascin-N in perineuronal nets. Their respective core proteins interact with varying numbers of chondroitin sulphate chains and it is thought that by varying the population of these four lecticans e.g. via a developmental shift in their relative expression, the stiffness of the matrix is controlled81,82

Gene duplication (see section 1.4 Evolution) may have contributed to differential expression among specific family members such that modern paralogues are expressed under control of different regulatory pathways in distinct cell types or tissues51,83. However, is not the only determining factor affecting the concentration of a given protein at a particular time and place. Differential splicing of ECM proteins leads to a population of heterogeneous gene products and in many cases the potential functional implications of these splice variants is acknowledged but not well understood 84-86.

MicroRNAs (miRs) regulate biologic processes by suppression of translation or induction of degradation of mRNAs. The miR-29 family has been found to suppress e.g. collagens, elastin, and fibrillins87 and, overexpression of miR-29b was found to be associated with aging vasculature87 as well as aneurysm development in a mouse model of Marfan syndrome (MFS)88. Notably, inhibiting miR-29b prevented early aneurysm development in both this model, and in a mouse elastase infusion model of abdominal aortic aneurysm89 whereas antagonizing miR29 led to upregulation of these ECM proteins87.

Intriguingly, extracellular matrices both affect and are affected by changes in gene expression. The nucleus can be regarded as a mechanosensor. The nuclear envelope and chromatin are connected to the ECM via integrin complexes and the cytoskeleton. Artificially applied forces may change the nuclear architecture, impacting chromatin organisation, and ultimately, gene expression and this can occur much more rapidly than conventional signaling transduction cascades involving the binding of growth factors90,91. 22

1.5.3 Post Translational Modifications

To become functional components of the ECM, many ECM proteins require complex levels of transcriptional, translational and post-translational control. Collagen synthesis, as reviewed above (see section 1.5.1 Self Assembly) involves several post-translational modifications, including hydroxylation of proline and lysine residues, pro-peptide cleavage and covalent crosslinking. Known post translational modifications of ECM proteins have been recently reviewed92 and include: glycosylation, lysine hydroxylation, proline hydroxylation, phosphorylation, proteolytic processing, disulphide bonding, sulphation, ubiquitin-like conjugation, and covalent crosslinking. Here, the fundamental nature of post-translational changes to ECM molecules is further illustrated by several additional examples of major post- translational events.

1.5.3.1 Effects on solubility

Fibronectin is a glycoprotein with several splice variants that is involved in wound healing migration, adhesion and tissue patterning. It circulates in plasma as a soluble form but also appears as an insoluble form that is necessary in collagen fibrillogenesis. The polymerization of fibronectin is necessary for the incorporation of collagen, fibrillin and thrombospondin into the ECM by lysyl oxidases93-95. In this insoluble form, RGD motifs are exposed via mechanical strain and bind membrane receptors (integrins α5β1 and αvβ3) localizing fibronectin at the cell surface. Although not a classic post translational modification this type of structural change is nevertheless highly thematic to the function of ECM96 and illustrates how alterations in one molecule in an assembly can induce changes in a very complex structure.

1.5.3.2 Biomineralization

Similarly, biomineralization of matrix to produce the vertebrate skeleton is an important modification. This appears to involve careful alignment of collagen in staggered arrays to facilitate the formation of channels large enough to accommodate nanocrystals97. During mineralization, the ionic composition in the extracellular space is controlled through the action of 2+ 3− Ca and PO4 ion pumps and mediated by the presence of non-collagenous matrix proteins. Ultimately, all of the intrafibrillar spaces are filled with mineral, resulting in a flexing of collagen molecules away from the fibre axis97. 23

1.5.3.3 Activation/Inactivation by cleavage

Many matrix components including modifying enzymes themselves, have multiple cleavage sites. As a consequence molecules that start off being very similar, acquire differences in structure and function. MMPs and , for example, exist as precursors that are enzymatically inactive until they are processed into mature proteinases. For MMPs, this is accomplished by including a prodomain at the amino terminus, which when present masks the catalytic Zn-binding motif 98. This provides for a complex system of regulatory controls: MMP3 and 10 can cleave precursors of MMP1, 8, and 13 to activate them. Likewise, the membrane-type MMPs (MMP14, 16, 24, and 25) can activate pro-MMP2, and MMP14 can activate pro-MMP1398,99.

1.5.3.4 Modification of GAGs

In addition to protein targets, various glycosaminoglycans (GAGs) attached to proteoglycans and glycoproteins can be modified or removed by specific enzymes. The extracellular SULF1 and SULF2, for example, are extracellular enzymes that remove 6-O-sulfates from heparin sulfate proteoglycans altering WNT, vascular endothelial growth factor (VEGF), derived growth factor (PDGF), fibroblast growth factor (FGF), and other signaling events100.

1.5.3.5 Bioactive Fragments

The ECM participates in signal transduction, directly initiating signaling events upon the release of biologically active polypeptide fragments that arise by post-translational cleavage of parent proteins101. This is a normal process. A large group of fragments including endostatin, tumstatin, canstatin, hexastatin and arrestin are derived from collagen IV and XVIII, and have both positive and negative effects on angiogenesis102. The NC1 fragment of collagen IV has been shown to be required for cell proliferation in epithelial branching morphogenesis of the submandibular gland103. Versican, on the other hand, is targeted by ADAMTS and its fragments induce cell death and promote regression of interdigital webbing during mouse limb development104. ECM fragments therefore have a wide range of activities including the ability to induce apoptosis.

It has been speculated that growth factor like domains present in laminins, tenascins and thrombospondins could function as bioactive fragments if they were released by MMP-mediated 24 proteolysis30. MMPs do cleave precursor proteins including pro-MMP precursors and many MMPs have displayed essential functions unrelated to their proteinase activity98,99. MMPs also generate fragments promoting cell proliferation in tip epithelial cells which supply the building blocks to sustain epithelial branching61.

ECM fragments are chemotactic and can attract endothelial and inflammatory cells into areas of active tumor cell growth102. Collagen I and its fragments are often up-regulated in cancers and this correlates with increased numbers of macrophages and neutrophils at those sites. Conceivably, the dysregulation of ECM remodeling could allow mutant cells to evade apoptosis via the pro or anti-apoptotic effects of the resulting ECM fragments61.

1.5.3.6 Discovery of novel post-translational modifications

In 2009, the discovery of a sulphilimine bond in collagen IV, a bond never before observed in biomolecules105, hinted that our knowledge of ECM post-translational modifications may still be incomplete. Post-translational modifications regulate the life-cycle of ECM proteins including changes in their solubility, assembly and disassembly, activation and inactivation and, overall function. The activation/inactivation of matrix components via cleavage events not only modifies the activity of the target molecule but, can result in the production of biologically active fragments. As discussed in the next section many of these post-translational modifications are fundamental to the role of the ECM in the developmental process.

1.5.4 Role in Development

The development of a complex multicellular organism such as a vertebrate requires close adherence to an array of temporal-spatial cues orchestrating cell growth, cell migration, and differentiation. In some cases patterning, e.g. regression of interdigital in the mouse limb bud104 even involves programmed cell death. Co-evolving alongside one another the developmental process and the ECM are inextricably linked. This section will briefly explore the intimate nature of this relationship leading up to a related discussion on disease.

The ECM takes part in most basic cell behaviours. ECM receptors such as integrins, discoidin domain receptors (DDRs), syndecans, and CD44, immobilize cells and anchor them to the matrix. This attachment is essential for epithelial cells, including adult stem cells, to maintain tissue polarity, organisation, and function106. Cell-matrix attachments lead to a two-way 25 communication, known as dynamic reciprocity, in which gene expression is modified, either indirectly through intracellular signaling pathways or directly through changes in cell architecture affecting chromatin organisation91,107. In turn, changes in gene expression can alter or remodel matrix structures.

In cell migration, the ECM plays dynamic and opposing roles. On the one hand, basement membrane, a densely woven network of fibrillar proteins, acts as a barrier to cell movement31,108. On the other, the orientation, cleavage and remodeling of ECM components, such as collagen fibres, can profoundly influence the directed movement of cells, forming ‘superhighways’ on which cells can readily migrate109,110. This may occur by potentiating growth factor receptor signaling or by mechanical reinforcement of cell migration30,108.

The ECM selectively binds growth factors and is essential in shaping concentration gradients involving e.g. bone morphogenetic proteins (BMPs), fibroblast growth factors (FGFs), Hedgehogs, and Wnts30,31. Furthermore, the ECM mediates the storage and release of FGF, transforming growth factor beta (TGF-β) and vascular endothelial growth factor (VEGF) among others29,111.

ECM biomechanical properties have great impact on basic cell behaviours and developmental processes. Most cell types have mechanosensors at the cell surface and inside the cell28,42,112-114. Different cell types have different requirements for matrix rigidity; soft in adipose tissue or the brain, somewhat compliant in muscles, and very stiff and rigid in bones. Remodeling is therefore an important mechanism whereby cell differentiation can be regulated and this affects processes such as the establishment and maintenance of stem cell niches, branching morphogenesis, angiogenesis, bone remodeling, and wound repair61. ECM elasticity is known to drive cell lineage specification of mesenchymal stem cells into neurons, myoblasts, and osteoblasts115-117.

Radical remodeling of the ECM occurs during metamorphosis of insects and amphibians61. In vertebrates, significant remodeling takes place in the formation of adult bone, and in neural crest migration, angiogenesis, tooth and skeletal development, maturation of synapses, and in the nervous system98,118-120. 26

Branching morphogenesis is an essential part of the ontogeny of many vertebrate organs including the lung, kidney and mammary gland121-123(Figure 1-5).

Figure 1-5: ECM dynamics determine epithelial branch patterning in vertebrate organs The ECM is dynamic and plays essential roles in various steps during vertebrate epithelial branching morphogenesis. Deposition of newly synthesized ECM (green solid line) including fibronectin and laminin is required for splitting the epithelial bud and primary branching (1). In contrast, partial degradation of the ECM (gray dotted line) by MMP is necessary for epithelial cells to sprout from the side of the duct and undergo side branching (2). MMP activities are also required at the invasion front to maintain a constant ECM remodeling process that is essential for collective epithelial migration (3). MMP activities also generate functional ECM fragments to promote cell proliferation in the tip epithelial cells and thus are essential for supplying the necessary building blocks to sustain the rapid progress of epithelial branching. Interestingly, newly synthesized ECM is also deposited around the ‘neck’ of the branching tip (4). ECM deposition at this place may be important for the ductal remodeling process that has been observed in kidney epithelial branching. Figure was reprinted as it appears in Lu et al.61 revised from the original version created by Mark Sternlicht124. It is used with permission of the original copyright holder. 27

Mice carrying gain or loss of function alleles of MMPs or TIMPs show defects in branched organs, implicating ECM dynamics in this process98,124,125. Similarly, loss of fibronectin reduces branch number, suggesting that localized fibronectin deposition participates in epithelial bifurcation79. Collagen and laminin also appear to play a role in epithelial branching103,126.

Throughout the developmental process and beyond, the ECM presents a wide range of chemical, mechanical and topographical cues which cells may factor when calculating a coherent response to their surrounding milieu92. Consequently, while the advantages of a robust and richly detailed system of communication are clear, cellular decision making is nevertheless vulnerable to misinformation in the wake of the dysregulation or mutation of ECM components. This is reflected in a broad range of pathological conditions. Examples of these are presented in the following section.

1.6 The ECM in Health and Disease

The ECM’s extraordinary range of function is revealed by the occurrence of numerous genetic and acquired connective tissue disorders. Several examples are reviewed here. The focus is not so much on the etiology of particular conditions so much as it is to glean a general understanding of the range of ECM disorders, common patterns broadly underlying ECM pathology and their implications to understanding the ECM’s systemic strengths and vulnerabilities.

Beginning with the inherited disorders, many mutations have been characterized in ECM structural components and in the enzymes involved in their folding and post-translational processing. ECM mutations exert their deleterious effects both inside and outside the cell and result in disease. The molecular basis of how these mutations cause the myriad of connective tissue disorders depends on the function of the affected gene product, its tissue distribution and the nature of the mutation127.

A long-held view has been that mutations cause ECM dysfunction by two mechanisms. The first mechanism involves a quantitative reduction in ECM components by mutations affecting synthesis (e.g. haploinsufficiency), or by structural mutations causing cellular retention and/or aberrant degradation. Mutations in collagen trimerization domains, the C-terminal pro- propeptide or the Gly–X–Y repeat sequence for example, result in delayed or, aberrant folding and secretion of mis-assembled collagen127. Alternatively, in the second mechanism, secretion 28 of mutant protein disturbs the ECM qualitatively, compromising crucial interactions, structure and stability.

Loss of function mutations resulting from the introduction of pre-mature termination codons (PTCs) in the genes of ECM structural components often reveal haploinsufficiency with dominant inheritance. This occurs for example in collagen II in Stickler syndrome128,129 and collagen VI in Bethlem myopathy130. On the other hand, even if the affected gene product is haplosufficient, incomplete non-sense mediated decay may lead to dominant negative or gain of function effects when small quantities of the affected mRNA escape degradation, resulting in the presence of between 5 and 20% mutant truncated protein. In terms of severity, dominant, loss of function mutations generally have a milder clinical phenotype than structural gain-of-function mutations in the same gene127.

Loss-of-function mutations in genes whose products function in ECM protein processing, folding and post-translational modification can also result in connective tissue disease. For example, mutations in ADAMTS2, the enzyme that removes the collagen I n-propeptide before fibril formation (see section 1.5.1 Self Assembly), causes the recessive form of Ehlers–Danlos syndrome131. Mutations affecting sulphation of glycosaminoglycans can cause abnormal cartilage formation (chondrodysplasias) of which there are over 200 different types127.

Studies of dominant structural mutations in ECM components have led to the widely accepted model that tissue pathology results from the effects of mutant proteins on the ECM. In this model, the deleterious effects on the ECM are thought to result from reduced protein levels owing to intracellular degradation of the mutant polypeptide and/or secretion of aberrant matrix protein that disrupts the organisation of the ECM. Many inherited diseases of elastic fibres fit this model. Aberrant elastic fibre formation and/or altered homoeostasis cause disease phenotypes ranging from mild (e.g. loose skin) to severe and potentially life threatening vascular defects (e.g. supravalvular aortic stenosis (SVAS), Marfan syndrome (MFS)). An updated review of elastic fibre assembly and the heritable and acquired disorders associated with aberrant assembly and disruption of elastic fibres was recently provided by Baldwin et al.15.

Describing elastic fibre diseases and current therapies, the review notes how recent approaches such as the creation of induced pluripotent stem cells from SVAS patients may lead to a better understanding the pathogenesis of elastic fibre diseases and, how treating Marfan syndrome and 29 related fibrillinopathies with losartan and other therapies that target TGFβ activity have led to improved patient outcomes. In addition, engineering of vascular constructs based on elastin and elastic fibre components offers great promise for the repair of damaged elastic tissues. However, conclude the authors, a greater understanding of the biology of elastic fibres, their assembly and influence on the bioavailability of TGFβ and interaction with inflammatory mechanisms is urgently needed to advance therapeutic prospects for elastic fibre diseases15.

Additional mechanisms in the molecular pathology of ECM disorders may include misfolded mutant ECM proteins such as COMP, collagens and matrilin-3 which have been shown to induce significant endoplasmic reticulum (ER) stress and trigger the unfolded protein response. This is an adaptive response by the cell to attempt to balance the load of misfolded proteins in the ER. Many studies have shown that elevated ER stress and its consequences can contribute significantly to a number of human disorders including many heritable ECM disorders127.

Turning to acquired conditions, the inappropriate remodeling of the ECM and the consequent release of bioactive fragments96 generated through the cleavage of structural proteins and associated glycosaminoglycans (see section 1.5.3.5 Bioactive Fragments) has been implicated in the dysregulation of growth factors and other signals30. Consequently, there is an increasing appreciation that the ECM plays a key role in many complex and multifactorial diseases30,132,133.

Changes in ECM stiffness are often associated with ageing, injury or pathological conditions as a consequence of altered ECM composition and organisation134-136. In particular, diseased tissues have markedly different elasticity than healthy ones, being unusually stiff and rich in ECM components137,138. It has even been shown that implantation of metal into normal tissue can cause tissue fibrosis and, in some cases, tumor development139,140, suggesting that changes in tissue stiffness actually play a causative role in disease.

Abnormal ECM dynamics have been linked to tissue fibrosis of many organs134, chronic inflammation141 and play an essential role in cancer progression by promoting cancer cell proliferation, loss of cell differentiation, cancer cell invasion, and failure of cell death61,113,136,142. Stiffness of the arterial wall is a sensitive and early marker of atherosclerosis143. In addition, changes in ECM thickness accompany the progression of obstructive bladder disease96 and airway remodeling in asthma144. 30

Key mediators of ECM dynamics, the matrix (MMPs) have been shown to play an important role in brain injury after ischemic stroke as well as being responsible for tumor invasion and infiltration145,146. In a study by Cuadrado et al.146, MMP-10 was notably increased in neurons of the ischemic brain but not in healthy areas. In the skeletal system, where bone matrix is continuously remodeled, an imbalance between resorption and deposition underlies diseases wherein bone density is altered, including osteopetrosis, and more commonly, osteopenia and osteoporosis61.

In the wake of considerable evidence implicating ECM dysregulation and remodelling in disease, approaches aimed at integrating this information with an understanding of ECM architecture will be helpful to elucidate new targets for intervention. Factors that affect ECM organisation in the context of health and disease include developmental and tissue specific expression of the components, their post-translational modification and their interactions147. The next section explores the availability of data and tools suitable to undertake such an analysis.

1.7 Systems Biology

Complex systems, including systems comprised of multiple biological parts, often exhibit unexpected behaviours based on underlying organisational patterns that are not easily detected at the level of the individual components. Examples of such behaviours might include predicting the change of traffic flow from point A to point B in a network of roads given a random detour or, how the flow of information through a biological signaling network may change given the up or down regulation of certain components. Broadly, systems biology attempts to address such prediction gaps in biological systems by considering the organisation of those systems as a whole. This necessarily requires a full list of the systems components as well as information about how each of those components relate to one another. As a corollary, the resulting model creates a rational framework for the organisation and interpretation of associated metadata. In a biological system this metadata could include e.g. spatio/temporal gene expression under various conditions, known disease associations, phylogenetic conservation, or other functional annotations. Importantly, systems biology differs from (and is complementary to) traditional experimental approaches to knowledge discovery in that it is data intensive, data driven and rather than testing hypotheses, is primarily (and intentionally) hypothesis generating. 31

The recent availability of sequence-based resources, including large scale databases cataloging the functional annotation of genes, their products148-152 and interactions153-159, has facilitated systems approaches geared at understanding the organisation of biological systems at a grand scale160-167. At the same time, the increasing availability of genome sequences for a variety of eukaryotic organisms (>130 species) allows the comparative analysis of gene conservation across species to gain insights into clade-specific innovations168. Such studies have revealed e.g. that biological systems are highly interconnected, modular structures that are surprisingly robust, being resistant to the noisy signals and random perturbations that are commonplace in living systems163,169. They have also enabled the prediction of protein complexes165,166, functions for unannotated proteins153 and the building of computational tools to infer putative missing interactions between biological components based on other similarities170-172.

However, the assembly and analysis of this data is not trivial. The data tend to be heterogenous; derived from a variety of experimental methodologies, varying in quality and stored in a variety of formats, representing a mixture of traditional, small-scale laboratory and so-called ‘high throughput’ experiments. In this section, the importance and limitations of various annotation resources and interaction data sets available at the time of this study are briefly reviewed, followed by a short primer on network analysis.

1.7.1 Annotation

The scientific literature represents a deep reservoir of knowledge spanning many decades. Despite recent progress in making this resource as widely accessible as possible through e.g. searchable abstracts and online cataloging and retrieval, it is difficult to extract large volumes of discrete information (e.g. all interactions for a given molecule) from literature sources in a timely fashion. Whereas humans tend to excel at the slow extraction of unformatted (or at least unstandardized) information from such sources, computers work best given precise rules. Recent attempts at text mining173-176 seek to circumvent such limitations but manual curation remains the highest standard of quality assurance, though even this is not flawless.

The need for more efficient access to summary information has spawned a variety of databases hosting combinations of curated and computer-assisted information assemblages as well as secondary source databases which consolidate information across these servers177-179. Nevertheless, even expert curation of the literature would be useless without standards 32 supporting equivalent comparisons (apples to apples) and this need has given rise to several structured vocabularies or ontologies, for describing biological entities and their functions.

The Gene Ontology (GO)151 is comprised of three separate ontologies describing (1) cellular components, (2) molecular functions and (3) biological processes. Terms within these vocabularies are structured as a directed acyclic graph (DAG) relating parent and child terms. Member databases of the GO Consortium use these terms in the functional annotation of proteins, tracking the evidence and reference(s) supporting the annotation. Online Mendilian Inheritance In Man (OMIM)180 maintains standardized terms and disease annotations. However, these are limited to heritable diseases. Finally, less structured but useful nonetheless for establishing functional associations are controlled lists of keywords such as the biological function keywords maintained by UniProtKB149 and, Medical Subject Headings (MeSH)181 which while less standardized are associated with a large coverage of the literature.

In addition to curated functional annotations, supplementary information including large published datasets are often now made available online and/or ported into publicly available repositories that support standardized data and exchange formats (e.g. MIMIx for molecular interactions182). However, in some cases access to information is limited to transactional requests i.e. a single result or web page of information at a time or caps on the number of records returned per transaction. Thus, while the development of web-based tools has tended to facilitate look-ups by researchers interested in a single or small group of related proteins, assembling full datasets for large-scale analysis can still present challenges. Finally, there may not be a single definitive source for a given type of information. More commonly, a variety of databases serve as repositories for similar types of information. For example, protein-protein interaction databases contain unique subsets of the total interactome and the record formats for these differ from one another such that data integration remains a significant challenge for the bioinformatician. The methods by which protein-protein interactions are derived is the subject of the next section.

1.7.2 Interactions

Proteins do not act alone but instead, in combination to perform cellular and extracellular functions. These include participation in stable protein complexes forming intricate macromolecular machines, as the building blocks of large scale assemblages such as ECM fibres 33 and other structures as well as more transient interactions involved in dynamic signaling processes. Protein-protein interactions (PPIs) and, to a degree the interactions involving other molecular species (e.g. carbohydrates, lipids and small molecules), define the functional organisation of biological systems. Although carbohydrates are known to be essential components of some extracellular matrices e.g. hyaluronan is a major component of articular cartilage183, PPIs make up the vast majority of the currently catalogued interactions in the ECM, outnumbering interactions involving carbohydrates by a factor of nearly 20:1159. Consequently this study focuses on PPIs as a means to investigate the overall architecture of the ECM. Here, the most popular experimental methods for determining PPI`s are briefly reviewed with comments on their advantages and disadvantages.

1.7.2.1 Co-immunoprecipitation

There are more than 60 low-throughput experimental methods that have been used to characterize protein-protein interactions in mammals184. One popular technique, co-immuno precipitation, uses an antibody as a means to precipitate a target protein and its interacting proteins. This technique is well-regarded and easily implemented on a small-scale. However, the direct relationships among any given pair of proteins in a complex are not known, so this leads to assumptions when converting such data into binary interactions. There are two methods. One is to assume that the bait interacts with each prey but that the latter do not interact directly with one another; the ‘spoke’ model. The other is to assume that all proteins in the complex interact with every other molecule in the complex; the ‘matrix’ model.

1.7.2.2 Yeast Two-Hybrid

The yeast two-hybrid (Y2H) system is a fragment complementation assay that was designed to directly identify binary interacting partners. Bait and prey proteins are introduced via plasmids into yeast that are deficient for a transcription factor. The bait protein is fused to the binding domain of the missing transcription factor (which binds the upstream activation sequence of a downstream reporter gene) while the prey is fused to the corresponding activation domain. If there is a stable interaction between the bait and prey, this reconstitutes the functional transcription factor allowing the transcription of a reporter protein. An advantage of this approach is that bait proteins can be rapidly screened against a library consisting of thousands of potential prey proteins in a high-throughput experiment. 34

Early Y2H data sets were criticized for their high false positive rate185. However, subsequent assays introduced additional quality controls and data filters, such as testing for autoactivation of the reporter, to substantially improve the reliability of the reported interactions186. Variations on the Y2H fragment complementation assay have since been developed. For example, the membrane yeast-two hybrid assay (MYTH)187 uses a split ubiquitin system to target membrane- specific interactions.

1.7.2.3 Luminescence based mammalian interactome mapping procedure

From the point of view of discovering mammalian protein-protein interactions it is seen as a potential disadvantage that interactions in the Y2H system must occur in the nucleus to be detectable. In addition, because yeast cells lack the ability to carry out many of the post- translational modifications that occur in mammalian cells, and this could affect some interactions, alternative methods have been developed. One such example is the luminescence based mammalian interactome mapping procedure (LUMIER)188. Performed in mammalian tissue culture, this system measures the relative co-purification of an affinity-tagged protein with a second protein tagged with luciferase. However, while highly sensitive, this system does not scale up as readily as Y2H.

1.7.2.4 Affinity Purification and Mass Spectrometry

Affinity Purification and Mass Spectrometry (AP-MS)165 has become a popular experimental technique for the individual purification and identification of protein complexes. In a typical experiment a fusion protein (bait) containing an affinity purification tag is expressed in cells where it forms complexes with endogenous proteins. Cells are lysed and the bait protein (and any attached prey proteins) are purified using an antibody or affinity matrix that recognizes the affinity tag. The proteins in the co-purified complexes are then identified using mass spectroscopy (MS). This technique is similar to co-immuno precipitation except that the use of affinity tags coupled with MS facilitates high-throughput experimentation. Like the earlier technique it produces interactions of complex type requiring either the spoke or matrix model to interpret the results as binary interactions. To ensure quality of the interactions, reciprocal binding experiments are ususally performed (i.e. with bait and prey switched) and frequent, non- specific interactors are removed. 35

1.7.2.5 Functional protein microarrays

Functional protein microarrays are a derivative of technology first developed for expression microarrays. An array of capture proteins is bound to a support consisting of a glass slide, nitrocellulose membrane, bead, or microtitre plate. Probe molecules, are then added and these are typically labeled with a fluorescent dye. Any reaction between the probe and the immobilised protein emits a fluorescent signal. While fluorescence labeling is the most common detection method, other labels can be used including affinity, photochemical or radioisotope tags189. Since these labels are attached to the probe, they can interfere with the probe-target protein reaction. Therefore, a number of label free detection methods are available such as, surface plasmon resonance (SPR). The latter is an optical biosensor technique which measures the change in the angle of reflection of light interacting with oscillating electrons (the plasmon) on a thin metal surface. The angle corresponding to the minimum intensity of the reflected light is known as the SPR angle and is directly related to the amount of molecules bound at the metal surface. The resulting sensogram can be used to determine quantitative information about the binding kinetics of proteins and their interactors.

1.7.3 Network approaches A number of computational tools exist to aid in the assembly and analysis of PPIs as well as other types of interactions using data from traditional and high-throughput experiments190. Represented as a series of vertices (sometimes referred to as nodes) and edges (connections which may be either directed or undirected), the resulting networks or ‘graphs’ are amenable to formal mathematical description. A number of useful metrics have been developed to describe and compare network topologies across various kinds of systems (Figure 1-6). Many of these have been developed to describe the organisation of non-biological networks such as social networks but, interestingly the same principles apply to all small-world networks whose distribution of edges obeys a power law169.

In these systems, there are a large number of nodes with few connections and a small number of nodes with many connections. Most nodes are not direct neighbours, but randomly selected pairs of nodes are typically connected to each other by routes that traverse a small number of edges; the route requiring the least number of edges being referred to as the shortest path. Clusters of densely connected nodes are a basic feature of small-world networks and in biological PPI 36 networks have been shown to correlate with functional similarity among proteins169. Indeed this property can be exploited to predict protein function through ‘guilt by association’191. Compared to random networks, information flow in biological networks has been shown to be robust to the chance addition or removal of edges (noise). While hub proteins with many connections (high degree) near the centre of the network tend to be essential, their vulnerability is somewhat mitigated by their lower frequency192-195.

Network approaches have been very informative in deriving a basic understanding of the principles underlying the organisation of a number of biological systems including the proteomes of a number of model organisms163,165,167 and there is considerable interest in the assembly and analysis of subnetworks of functionally related proteins196-198. In particular, networks provide a valuable framework for the organisation of various metadata e.g. gene expression, disease association etc. that aid in interpreting the biological significance of interacting proteins.

Figure 1-6: Small-world network illustrating some common network parameters Vertices (v) are shaded to correspond to network parameters they exemplify. For example, vertex ‘D’ has the highest degree whereas vertex ‘H’ has a high betweeness. Since proteins do not act alone, supposing that vertices correspond to proteins and the edges between them indicate protein-protein interactions, one may infer that proteins within the cluster surrounding ‘D’ are likely to share a common function. 37

Previous studies focusing on networks of disease associated genes have shown that genes associated with similar disorders tend to have similar expression profiles and are more likely to physically interact, suggesting the existence of disease-specific functional modules199. This has led to the suggestion that disruption of network architecture (as opposed to functional changes in specific proteins themselves) is a related factor in human disease169. The ECM, with components known to participate in many complex diseases, represents an interesting network in which to explore this prediction. Furthermore, if this is true, it may also be true that subtle variations in the binding characteristics of splice variants and mutant alleles could adversely affect network dynamics leading to long term health implications, an idea further explored in a later section (see section 4.2 Future Directions).

1.8 Goals and Rationale

Decades of careful research have gradually revealed the major families of ECM proteins, their primary functions and their role in specific inherited disorders. In addition, the importance of ECM dynamics on a wide range of complex, acquired diseases is now appreciated. However, lacking a complete list of ECM family members and their interactions, the potentially important role played by ECM network perturbation on disease has not yet been examined. Examining how components of the ECM are organised and operate will enable a better understanding of the role that ECM architecture plays in health and disease, and may lead to the discovery of new therapeutic targets. In addition, this systems approach facilitates function prediction for unannotated proteins including putative novel ECM proteins and their neighbours. Factors affecting ECM organisation include e.g. temporal and spatial expression of the components, their post-translational modifications and their interactions. Together, these provide a powerful contextual framework for interpreting related meta-data such as disease association and, the evolutionary conservation of modular components.

The ECM is characterized by large, multi-domain proteins whose unique properties contributed to a variety of specialized tissues and structural features in vertebrates. However, beyond their origins, the subsequent evolutionary forces that have since guided the development of this system have remained largely unexplored. Defined as a system, the ECM represents an ideal resource to explore how mechanisms underpinning patterns of domain evolution contributed to their functional organisation and whether, by virtue of their unique biological role, the 38 evolutionary forces driving the evolution of ECM proteins are similar or dissimilar from other proteins. The goals of this project are:

1. To define the set of human ECM (and related) proteins and their highly confident protein-protein interactions.

2. To predict functional modules and provide a framework for organising and interpreting associated metadata.

3. To determine the organisation and evolutionary conservation of biologically relevant modules and their disease associations.

4. To analyze the evolutionary forces influencing the emergence of ECM proteins.

As a first step in this analysis, Chapter 2 presents an ECM parts list. The definition of such functionally related subsystems of proteins, given incomplete annotations is challenging but of considerable recent interest (see section 1.7.3 Network approaches). Here, ECM components were identified using a systematic method, leveraging annotations from several secondary source databases. This method can be applied generally to define components relating to any group of functionally related proteins in a fraction of the time needed for a more traditional, curation approach.

Having defined the set of human ECM (and related) proteins and their highly confident protein- protein interactions (Goal #1; Chapter 2), the topology of the network was analyzed to predict functional modules and provide a framework for organising and interpreting associated metadata (Goal #2; Chapter 2). This analysis revealed the organisation and evolutionary conservation of biologically relevant matrix modules as well as their disease associations (Goal #3; Chapter 2) and illustrates the potential for systems based analyses to predict new functional and disease associations on the basis of network topology.

Interactions of a critical ECM component, elastin were further investigated using a combination of curation and surface plasmon resonance imaging array (SPRi) experiments in a pilot study (extending Goal #1; Chapter 2). In addition to identifying potential new interactors of elastin, the SPRi experiments demonstrate the potential of this method to differentiate the binding characteristics of related protein fragments. SPR methods could, in the future, assist in determining binding residues which are currently unknown for the majority of matrix proteins 39 and perhaps lead to a better understanding of how networks are perturbed by subtle mutations in the components.

An analysis of protein conservation showed that approximately two thirds of ECM proteins in humans are unique to vertebrates. The evolutionary forces driving this innovation are relevant to understanding how changes in ECM components and their organization affect their function. Proteins are comprised of smaller building blocks consisting of conserved sequences known as domains. The composition and arrangement of domains in a protein determines both its function and interactions. Chapter 3 describes a novel pipeline to detect the occurrence of ECM- associated domains across the eukaryotic phylogeny. In addition to confirming the importance of domain gain, selective loss and tandem repeats observed previously for general vertebrate proteins, this work has contributed to a greater understanding of the evolutionary forces influencing the emergence of ECM proteins (Goal #4). As well, this study uncovers general and clade-specific evolutionary patterns in the usage and recombination of domains resulting in the domain architectures observed in the ECM proteins of humans and other metazoans.

The construction and analysis of a network of domain adjacency has revealed modules of domains – neighbourhoods which reflect domain patterns comprising the functional units of multi-domain ECM proteins. It is speculated, in future directions, that these domain usage patterns may be useful in developing a model for predicting novel domain architectures with desirable functions on which synthetic ECM proteins could be based. These evolutionary patterns were refined using a novel application of Sequential Pattern Analysis to define higher order patterns recurring within ECM domain architectures and whose evolutionary trajectories provide useful insights into ECM domain evolution. It is proposed that this technique may be applied more broadly to other systems or even at the genomic level to better define ‘supra- domains’; the evolutionary units of domain architectures. Finally, gaps in current annotations and PPI data as revealed by this study, suggest a number of opportunities and these future directions are discussed in Chapter 4.

This study represents a framework for the systematic definition and analysis of biological systems. In presenting an overview of the current state of our knowledge of the ECM, the analysis contributes to our understanding of the structure, function and evolution of the ECM as 40 a unique system and highlights important opportunities for further investigation which will enable a greater understanding of its role health and disease. 41

Chapter 2 Surveying the Extracellular Matrix: Towards a Systems Level Understanding of its Structure, Function and Evolution

Portions of this chapter have been reprinted or adapted from Cromar et al.200 and are used with permission.

I conceived the study and contributed to the development and verification of the software pipelines used herein. In addition, I generated the figures and performed all analyses, curations and interpretations except where specifically noted below.

Emilie Chautard contributed to the analysis of the expression data, in particular the raw data assembly for Figure 2-11(a). Xuejian Xiong performed the statistical analysis of the co- expression data, using MATLAB to produce Figure 2-12(a) and 2-12(c). Hongan Song, Xuejian Xiong, James Wasmuth, Noeleen Loughran and Tuan On were involved in various aspects of the development and maintenance of the PhyloPro pipeline which forms the basis for the analysis of orthologues that I performed.

SPRi experiments were conducted under the direction of Dr. Sylvie Ricard-Blum at the Institut de Biologie et de Chimie des Protéines (IBPC) in Lyon, France as part of a collaborative traineeship in accordance with the requirements of the UofT Collaborative Graduate Program in Genome Biology and Bioinformatics (CGPGBB). Recombinant elastin peptides were supplied by Megan Miao under the direction of Dr. Fred Keeley. I performed the supporting literature review, assisted in the chip design and fabrication, performed the SPR binding experiments with the assistance of Romain Salza and interpreted the results.

42

2 Survey of the Extracellular Matrix 2.1 Introduction

A first step in understanding the inter-relationships among ECM proteins is defining the components that makeup the ECM. Several generic resources exist that collate expert knowledge for a vast range of proteins and these may be usefully mined to extract sets of proteins implicated in a biological system of interest. However, definitions and ontologies used by these resources are not consistent. For example two of the more renowned resources, PANTHER201 and the Gene Ontology (GO) resource151 differ markedly in their definition of ECM proteins. To circumvent this lack of standardization, community efforts have focused on more detailed attempts to define components of a biological system202,203. Of relevance here, Chautard and co-workers recently began assembling a database of known components of the ECM together with physical interactions obtained from a variety of public resources and from manual curation158,203. Here these efforts are complemented through the development of a protocol that systematically defines the components of any biological system of interest within a relatively short time frame. This protocol is applied to extend our knowledge of the components of the ECM and their interactions. In addition to validating previous ECM annotations, the derivation of ECM components and collation of experimentally determined protein-protein interactions allows the construction of a network of ECM interactions that serves as a valuable platform for the integration and interpretation of additional meta-datasets detailing protein expression, function, disease associations, domains and conservation patterns. In addition to demonstrating the utility of this approach, these analyses yield insights into the organisation, function and evolution of the ECM.

Our knowledge of ECM proteins and their interactions is, however, far from complete. There are gaps in both the literature (i.e. knowledge itself) and in the representation of that information in public databases (i.e. knowledge transfer). This initial survey of the ECM mainly addresses the latter problem and in so doing provides a framework for the organisation and interpretation of new information; the remedy for the former. In the case of ECM proteins, new information is not necessarily easily acquired. Many ECM proteins are insoluble, requiring harsh chemical conditions to isolate and rendering them unsuitable for study by popular high-throughput methods for determining interactions such as co-immunoprecipitation followed by mass 43 spectrometry. In contrast, the yeast two-hybid approach involves expression in a single celled organism under conditions that are dissimilar to the extracellular milieu of multi-cellular metazoans. Alternatively, surface plasmon resonance (SPR) is an optical biosensor technique allowing direct monitoring of interactions. It has been used to characterize a wide variety of interactions, including antibody-antigen, ligand-receptor, and proteins with oligonucleotides or carbohydrates. SPR imaging (SPRi) is an array format useful for screening potential hits whereas classical SPR is low throughput but provides the ability to determine binding kinetics. These characteristics made SPR an attractive method to conduct a pilot study, herein described which explores the differential binding characteristics of human tropoelastin and several recombinant elastin-like peptides, paving the way for future studies to expand available high- quality protein-protein interaction data for ECM proteins.

2.2 Materials and Methods

2.2.1 Identification of Extracellular Proteins

A list of ‘gold standard’ ECM proteins was curated from primary literature sources. Initially the search began with a list of neighbours and next nearest neighbours of elastin (ELN) as defined by protein-protein interactions deposited in BioGRID157. Literature curation efforts subsequently defined bone fide ECM proteins based on supporting experimental evidence (i.e. non-ECM proteins were removed). The online tool AMIGO and the GO online SQL environment (GOOSE) were used to assemble lists of terms associated with gold standard proteins including the parents and children of these terms151. Uninformative terms were removed using the following ad hoc criteria: remove child terms containing (for our purpose) irrelevant levels of detail (e.g. keep GO:0042476 ‘odontogenesis’, remove GO:0042482 ‘regulation of odontogenesis’; GO:0042482 ‘positive regulation of odontogenesis’ and GO:0042483 ‘negative regulation of odontogenesis’); remove obvious false positives (e.g. discard GO:0003735 ‘structural constituent of ribosome’, GO:0005199 ‘structural constituent of cell wall’); remove non-mammalian terms (e.g. discard GO:0008011 ‘structural constituent of pupal chitin-based cuticle’); remove overly general terms (e.g. discard GO:0001705 ‘ectoderm formation’, GO:0001822 ‘kidney development’); keep terms meeting the above critera containing words with well-established association to ECM proteins (e.g. collagen, cartilage, fibril, basement membrane, basal lamina, extracellular, cell adhesion, cell migration, remodeling, wound healing). GO annotations for the complete human, mouse and rat proteomes (human cvs version 44

1.47, mouse cvs version 1.660, rat cvs version 1.64), were obtained from downloaded GO annotation (GOA) files for each species.

2.2.2 Classification of Proteins

The reviewed evidence included publicly available protein annotations observed in the secondary source databases GeneCards179, BioHarvester178 and iHOP204 including the International Protein Index (IPI)205 and functional descriptions from UniProt149 and Entrez152 as well as the literature. The HUGO Gene Nomenclature Committee (HGNC) database206 and Synergizer207 were used to cross-reference genes with alternate gene symbols. All categorizations were summarized to the level of the Ensembl gene identifier (i.e. categories reflect whether the gene has any products so annotated). Pre-computed human-mouse and human-rat orthologues were obtained from Ensembl using Biomart148,208.

2.2.3 Source for Protein-Protein Interactions

Physical protein-protein interactions (PPIs) were obtained from the Unified Human Interactome209 (UniHI); an assemblage of interaction data from several sources including BioGRID157, IntAct210, DIP155, MINT156, BIND154, HPRD211 and two large-scale, yeast two- hybrid studies163,164. This data was updated to include current interaction data from BioGRID, HPRD, IntAct, and MatrixDB as of Dec 2010. Datasets based on automated text mining173, interologues212-214 and interactions of complex type215 were retained, but excluded from our analysis (except where the latter were present in the included datasets as expanded binary interactions) on the basis that such data are more likely to contain false positives than binary interactions determined through physical experiments. For MatrixDB where complexes represent truly native, trimeric, collagens, laminins, thromspondins 1 and 2 or truly native dimeric receptors such as integrins, interactions of curated complexes were expanded into binary representations using the ‘matrix’ model (i.e. binary interactions were created between each member of the complex). Where the species for the interacting proteins was indicated in the underlying database we used this information to exclude interactions not known to occur between the human orthologues. Because extracellular proteins are not optimally expressed in the nucleus and tend to be ‘sticky’ interactions supported only by Y2H evidence216 were excluded. Interactions generated from a recent high throughput affinity purification study were also excluded as many of the identified interactions remain to be further validated217. 45

Furthermore the focus on recombinant peptides from four proteins limits the number of additional interactions that would be included. Interactions are summarized in the supplementary excel spreadsheet data file “SF1” which is also available on the accompanying CD.

2.2.4 Network Construction and Analysis

The goal was to create a network containing as many ECM and functionally related proteins as possible, linked by high confidence protein-protein interactions. The network was assembled using experimentally determined, physical protein-protein interactions as defined above between ECM proteins and their nearest neighbours. To restrict the incidence of false positives we specifically excluded predicted interactions and extracellular proteins that were not identified as functionally related in the manual review. To do this, the initial network was created using interactions only among bona fide ECM proteins. Then, a ‘layer’ of functionally related proteins linked to ECM proteins through high confidence protein-protein interactions was added. Next another layer of functionally related proteins was added where they were linked to the first layer through PPIs. This pattern continued iteratively until no more functionally related proteins could be added through PPIs. Protein interactions were visualized using Cytoscape218 and network statistics were calculated using the Network Analyzer plug-in219 and the Topnet-like Yale Network Analyzer (tYNA)220. The network was clustered using Markov Clustering (MCL)221 to produce a list of putative functional modules. The clustering algorithm was run over a range of inflation values from 1.2 to 5.0 and it was determined that a value of 2.2 resulted in clusters with the lowest heterogeneity of GO biological functions according to the method of Loganantharaj et al.222. A comparison of the resulting Shannon information indices across the range of MCL inflation values for both real and random networks is provided in Appendix 3. Supporting information for this figure can be found in the supplementary excel spreadsheet data file labeled “SF2” which is also available on the accompanying CD.

2.2.5 Sources for Meta-data

2.2.5.1 Expression

A cDNA microarray survey by Su et al.223 which explores gene expression in 84 samples corresponding to 79 human tissues was obtained from BioGPS website224, the successor to GNF's SymAtlas website. The corresponding human U133A chip annotation file was 46 downloaded from Affymetrix website (August 2010 release). An additional cDNA microarray survey by Shyamsundar et al.147 which explores gene expression in 35 normal human tissues was obtained from the Gene Expression Omnibus (GEO)225. Microarray data were filtered using Perl scripts to keep only ECM gene annotations. Of the 357 core genes, 275 genes had an expression profile (478 profiles including multiple probes) in the Su et al. dataset. Gene expression data based on expressed sequence tags (ESTs), profiles which reflect approximate expression patterns in tissues, were obtained from UniGene226. Perl scripts were written to create expression profiles of ECM genes from UniGene data files (March 2011 release). For the UniGene dataset, 348 core genes (360 profiles including multiple probes) of the total list of 357 had an expression profile in 45 different normal tissues. All expression datasets were organised by two-dimensional hierarchical clustering (Cluster3.0 using Spearman Rank correlation, complete linkage) and visualized as heatmaps in Java TreeView227. For Su et al., pearson correlation coefficients were calculated for all pairs of network proteins using a Perl script developed in house, organised using hierarchical clustering in MATLAB228, visualized as a heatmap and exported to Microsoft excel for further analysis.

2.2.5.2 Conservation

Sequence datasets for 117 published eukaryotic genomes were derived from a variety of sources (see Appendix 1 and Appendix 2). Phylogenetic profiles were generated using Inparanoid229 as previously described230. Hierarchical clustering was done using Cluster 3.0 software. City block was used for the similarity metric with complete linkage as the clustering method. Supporting information is contained in the supplementary excel spreadsheet data file labeled “SF3” which is also available on the accompanying CD.

2.2.5.3 Functional annotation

Enrichment for Pfam231 domains and GO terms (network vs. whole proteome) were determined by hypergeometric test with false discovery rate correction (FDR) as implemented in ConceptGEN232. To assess enrichment within functional modules, annotation files were downloaded directly from the Gene Ontology. UniProt keywords were downloaded from the UniProtKB database at the EBI using BioMart208. To assess module enrichment we used bootstrap resampling to compare the frequency of terms within clusters with their occurrence in 10,000 clusters of the same size drawn from the network at random. Assignment of putative 47 function to clusters was based on the occurrence within modules of high frequency UniProtKB biological process keywords. These frequencies were calculated and visualized using the WordCloud plug-in for Cytoscape233. Where coverage of keywords was insufficient to determine function, the functional descriptions were supplemented (where possible) using clarifying information obtained during the classification of extracellular proteins by manual review.

2.2.5.4 Disease terms

Medical Subject Headings (MeSH terms) were derived from Genopedia234. Additional disease terms were obtained from the morbid map at Online Mendelian Inheritance in Man (OMIM)180. Enrichment was determined by hypergeometric test with FDR correction as implemented in ConceptGEN232. Module enrichment was assessed as described above for functional annotations, comparing the occurrence of terms within clusters with their occurrence in 10,000 random clusters (bootstrap resampling).

2.2.6 Quality Assessment

Protein subcellular location predictions were obtained from LOCATE235 for five prediction methods: Proteome Analyst236, CELLO237, MultiLoc238, pTarget239, and WoLFPSORT240. The raw XML data file was parsed using a Perl script developed in house. Predictions for putative extracellular proteins (obtained on the bsis of their GO annotations) were grouped by corresponding Ensembl gene identifier and summed for each method for eight location categories: Cytoplasm, Nucleus, Mitochondrion, Golgi, Plasma Membrane, Endoplasmic Reticulum, Extracellular and Other. Observations were clustered in two dimensions (Custer 3.0, Spearman Rank Correlation, Average Linkage) and visualized using Java TreeView227 to assess the overall consistency of the predictions. Signal peptide (SignalP)241 and Transmembrane (TMHMM)242 predictions were obtained from Ensembl using BioMart. A Perl script developed in house was used to calculate the percentage of proteins containing SignalP and TMHMM predictions for several groups of proteins: extracellular, ECM, network ECM (these are ECM proteins included in the network), network neighbours and, whole proteome. Since the original classification of these proteins was based on GO annotations whose evidence might include SignalP and TMHMM predictions, the analysis was repeated removing any annotations which 48 themselves rely on these. The former are referred to as the dependent set and the latter as the independent set (see Appendix 4).

2.2.7 Surface Plasmon resonance

SPR is a biosensor technique allowing direct monitoring of interactions and consumes only small amounts of sample, typically less than 1 µg/injection. It has been used to characterize a wide variety of interactions, including antibody-antigen, ligand-receptor, and the interactions between proteins and oligonucleotides, or proteins and carbohydrates. SPR imaging (SPRi) arrays were performed using a Biacore Flexchip system (GE Healthcare), an array platform capable of analyzing one analyte against 400 target spots at a time. Recombinantly expressed elastin and elastin-like peptides (detailed in Appendix 5) were injected as analytes over protein and glycosaminoglycan arrays, each ligand being spotted in triplicate on the arrays. The arrays included the following recombinant proteins: Endostatin, vWF1, a1 chain of collagen VI, NC1(XVIII) and NC1(XVIII) lacking the heparin . Other ligands were from commercial sources. Proteins or glycosaminoglycans were printed directly onto the gold surface of a Gold Affinity chip (GE Healthcare) using a non-contact PiezoArray spotter (PerkinElmer Life Sciences). The spotted matrix (15 x 12) comprised 174 spots. Proteins were spotted at concentrations varying from 30 to 1000 μg/ml and glycosaminoglycans at 1 and 2 mg/ml. Six drops of 330 pl each were delivered to the surface of the chip (total spotted volume, 2.2 nl; spot diameter, 250–300 μm; spotted amount, 0.0066 – 2.2 ng/spot for proteins, 2.2 - 4.4 ng/spot for glycosaminoglycans). The chips were then dried at room temperature and stored under vacuum at 4 °C until their insertion into the Biacore Flexchip. The regions of interest of the chip were defined when the chip was dry. Each region of interest had four associated reference spots that were used to correct bulk refractive index changes as well as nonspecific binding of the analyte to the chip. The chip was blocked with a buffer containing mammalian proteins (Superblock, Pierce) for 5 times for 5 min. The blocked chip was then equilibrated with phosphate-buffered saline, 0.05% Tween 20 at 500 μl/min for 90 min. The analyte was flowed over the chip surface at 25 °C at a concentration of 500 nm for 25 min at the same flow rate. The dissociation was monitored during injection of phosphate-buffered saline, 0.05% Tween for 40 min. Injected proteins were diluted in phosphate-buffered saline, 0.05% Tween. Data collected from reference spots (gold surface) and buffer spots were subtracted from those collected on spotted proteins or glycosaminoglycans to obtain specific binding curves. 49

2.3 Results

2.3.1 Systematic classification of extracellular proteins reveals the core ECM network consists of 357 proteins

Recently there has been much interest in the construction and analysis of networks defining functionally distinct biochemical systems196-199,202,230. The initial task of systematically and comprehensively defining systems components is not trivial. The Gene Ontology (GO) resource was developed with this task in mind151. However for the ECM, annotation coverage is not comprehensive and may further suffer from a lack of standardized annotation assignments. For example at the beginning of this study, an AMIGO search for proteins annotated to the cell component term, GO:0031012 ‘extracellular matrix’, and its associated child terms captured less than half (151/324) of the ECM proteins identified here. Those missed include obvious true positives such as decorin and fibronectin. Subsequent GO updates have substantially improved coverage, while improvements to query tools (i.e. AmiGO) have circumvented the pervasive problem of recovering proteins with gaps in their annotations (for example, TIMP2 and NTN4 are annotated as GO:0005604 ‘basement membrane’ but not explicitly annotated to the parent term GO:0031012 ‘extracellular matrix’). However the gaps in those annotations remain and many ECM and related proteins (defined here as sharing localization and biological function) remain poorly annotated. For example LOXL3 and LOXL2 are not identified as ECM proteins despite correct assignment of LOXL1; NCAN is not returned despite the fact that many other chondroitin sulfate proteoglycans are included. The inclusion of additional mappings to external annotations e.g. UniProt Subcellular Location vocabulary to GO terms is a further recent improvement243. However, the initial lack of standard annotations associated with a system of interest suggests a need for new methods that efficiently exploit the wealth of annotation data provided by GO. It is worthwhile noting that Chautard et. al.158,203 using UniProt as the starting point also required supplementation by literature curation.

Here the construction of an ECM interactome is used as an example of a general strategy for defining and constructing a biological system of interest that exploits multiple sources of functional descriptions. This is begun by defining a ‘gold standard’ set of proteins. An initial seed set is constructed by selecting one of more well characterized proteins associated with the system (here elastin (ELN) was used). Next protein-interaction data is used to identify neighbours and next-nearest neighbours of the seeds which together form a list of candidate 50 proteins. Literature searches are then applied to prune this list for bone fide members of the system on the basis of supporting experimental data. From this defined gold standard, a list of enriched GO terms is derived, together with all parent and child terms. Uninformative terms are removed using series of reasonable pre-defined criteria (see Section 2.2.1). These GO terms are then used to identify additional members of the system, both from the organism of interest as well as any closely related organisms. This latter step is used to translate orthologue annotations that may be missing from the target organism. Enriched GO and UniProt keywords associated with the gold standard are then used to search functional descriptions from secondary sources (e.g GeneCards179, BioHarvester178 and iHOP204) to identify and remove inappropriate proteins (see Figure 2-1 for an example workflow). Note this step also has the benefit of classifying proteins into useful subcategories (e.g. membrane versus soluble). Finally, additional annotation resources such as PANTHER201 can be applied to provide independent validation (here MatrixDB158,203 an expert ECM knowledgebase was used). From the defined set of proteins, a protein interaction network may be constructed as a framework to organise and interpret additional metadata. In the next section, this process is illustrated by presenting a formalized scheme for identifying subsets of proteins related to a specific functional category using a keyword search based on 49 GO and UniProt terms enriched in a gold standard set of ECM proteins (Figure 2-1).

To derive a parts list of the human ECM and their interactors, an initial literature search was conducted to identify a set of gold standard ECM proteins. The definition of a gold standard allows the benchmarking of GO annotations to identify ECM proteins. With local expert knowledge and due to its central role in the ECM of many tissues, elastin and the proteins with which it interacts (as defined by the BioGRID resource157) were the initial focus. Both direct and indirect neighbours (i.e. neighbours of neighbours) were included. From an initial list of 55 proteins, only 34 were supported by literature evidence and therefore define our gold standard (Table 2-1). Of these 34, only 20 could be defined as ECM proteins based on their GO annotations. In the majority of cases it was found that false negatives would have been captured using more generic parent terms than ‘extracellular matrix’ GO:0031012, such as ‘extracellular region’ or ‘extracellular region part’ (GO:0005576, GO:0044421), albeit at the expense of including a considerable number of obvious false positives (e.g. members of extracellular

51

Figure 2-1: A workflow for the functional assignment of extracellular proteins. Putative extracellular proteins were assigned a functional profile based on manual review of descriptive text, subcellular location predictions, and associated small-scale evidence as summarized in secondary sources using the precedence rules indicated. Proteins were grouped by Ensembl gene identifier and scored true or false for the following pre-defined, non-mutually exclusive categories: (a) extracellular, (b) ECM, (c) membrane associated, (d) structural, (e) functionally related (i.e., functional annotation overlaps those of ECM proteins) based on the appearance of one or more of the indicated keywords. A contradiction is defined as a text entry which associates incompatible functions with a protein (e.g., nuclear protein with a signal peptide prediction). Support is defined as a text entry which associates the protein with a key term in the keyword list. Keywords were selected from GO terms enriched in our ‘gold standard’ ECM proteins and their nearest network neighbours using a hypergeometric test with Bonferroni correction as implemented in the Cytoscape plug-in, BinGO. Network neighbours were based on experimentally determined protein-protein interactions deposited in BioGRID. 52

Table 2-1: True positive and false negative ECM proteins resulting from a search using GO terms containing the words ‘extracellular matrix’

Shown here are 34 ECM proteins comprising a ‘gold standard’ (highlighted in yellow) based on elastin and its nearest and next nearest neighbours in BioGRID along with the final categorization of these proteins according to the classification scheme outlined in Figure 2-1. Total found refers to the number of gold standard proteins in the initial list, identified by the GO terms: ‘extracellular matrix’, ‘extracellular matrix part’ or ‘proteinaceous extracellular matrix’.

Elastin Final Functional Profile Neighbour Network

Gene Found in the Extracellular ECM Membrane Functionally initial ECM related protein list

FBN1 True True False True

COL14A1 No True True False True

FBLN1 True True False True

MFAP2 No True True False True

COL4A6 True True False True

NID No True True False True

LYZ No True False False False

HLN2 No N/A N/A N/A N/A

TGFB1 True False False True

APOB No True False False False

FGB No True False False True

CSPG2 True True False True

MYOC No True False False True

COL1A2 True True False True

DCN No True True False True

NID2 True True False True

MFAP5 No True True False True 53

ASS No N/A N/A N/A N/A

LAMA5 True True False True

LAMA1 True True False True

DPT No True True False True

TNF No True False False False

COL13A1 No True True True True

BCAN No True True False True

DHDDS No N/A N/A N/A N/A

HSPG2 No True True False True

COL4A5 True True False True

ELN True True False True

LOX No True False False True

SPINK1 No True False False False

COL4A4 True True False True

FCN1 No True False False True

NOV No True False False False

BGN No True True False True

COL4A2 True True False True

COL2A1 True True False True

APP No True False True False

FBLN2 True True False True

COL4A3 True True False True

SGCA True False True True

FN1 No True True False True

COL18A1 No True True False True

LAMC1 True True False True 54

JPH3 No N/A N/A N/A N/A

AGC1 No True True False True

CALR No False False False False

FKBP10 No N/A N/A N/A N/A

PRTN3 No True False False True

COL1A1 True True False True

COL4A1 True True False True

FBN2 True True False True

PRELP True True False True

ELA2 No True True False True

LGALS3 No True False False True

MATN2 True False False False

Total gold 34 standard proteins

Total found 20

organelles such as the prominosome (PROM2) or extracellular vesicular exosome (SLC2A4, TFRC)).

The initial goal was to identify a subset of the human proteome containing as many ECM associated proteins as possible including secreted, soluble, and membrane proteins as well as proteins such as growth factors. Therefore the system is not limited to only components of the ECM (defined by GO as “a structure lying external to one or more cells, which provides structural support for cells or tissues”) but also includes proteins defined here as functionally related (i.e. extracellular and membrane-associated proteins sharing biological process annotations and related keywords in common with ECM proteins). It is clear that membrane proteins, from the definition above, do not represent components of the ECM. Nevertheless they 55 include many proteins which play critical roles in the organisation and operation of the ECM (e.g. integrins, syndecans). Since the aim of this study is to understand these relationships, it was important to include such functionally related proteins in the analysis. At the gene level these necessarily include any genes with at least one extracellular product related to the ECM (i.e. it is possible to include genes associated with a cytoplasmic product, but only if it also has a product associated with the ECM).

The search criteria were therefore expanded to include more generic terms as follows: first an initial list of 27 seed terms were defined spanning the three GO categories (cell component, biological process and molecular function) which were enriched in our gold standard ECM proteins (Hypergeometric test with FDR correction, p< 0.05 – Table 2-2); next these GO terms were expanded to include their parents and all child terms (4575 terms in total). This list was reduced using a series of simple ad hoc rules (see section 2.2 Methods) resulting in a more focused list of 103 terms similar to a custom GOSlim (for a complete list of these terms see Appendix 6). It is worth noting that this list was defined prior to the availability of automated tools such as GOSlimmer244 which may further accelerate the process of defining filters for additional sets of functionally related proteins. Using this list of 103 terms, 1165 human genes were retrieved from the GO resource with secreted, extracellular or membrane-associated products that putatively encode ECM proteins and potential interacting partners.

Since different groups are responsible for providing GO annotations for each model organism, it raised the possibility that additional human proteins might be identified through orthologues annotated as extracellular proteins in other mammals (rat and mouse). Considerable variation in the evidence used to annotate extracellular proteins was observed across the three species, suggesting this approach benefits from leveraging a wider variety of evidence. In particular, human annotations had a higher frequency of non-traceable author statements (NAS) (Figure 2-2(a)). Additional supporting information is available in the supplementary excel spreadsheet data file labeled “SF4” which is also on the accompanying CD.

Using this approach an additional 1001 genes were added through orthologous relationships to the 1165 human genes identified above, resulting in a total of 2166 human genes encoding potential components and interactors of the ECM (compared to 1932 from rat and 2055 for mouse – Figure 2-2(b)). An additional 22 potential interactors were added by including a set of 56

Table 2-2: Gene ontology ‘seed’ terms enriched in gold standard ECM proteins

These terms were enriched in the gold standard set of ECM proteins. CC = Cell Component; BP = Biological Process; MF = Molecular Function.

Identifier Description CC BP MF

GO:0031012 Extracellular matrix X

GO:0044420 Extracellular matrix part X

GO:0048196 Middle lamella-containing extracellular X

GO:0005578 Proteinaceous extracellular matrix X

GO:0005576 Extracellular region X

GO:0044421 Extracellular region part X

GO:0005615 Extracellular space X

GO:0043655 Extracellular space of host X

GO:0005604 Basement membrane X

GO:0022617 Extracellular matrix disassembly X

GO:0030198 Extracellular matrix organization and biogenesis X

GO:0021939 Extracellular matrix-granule involved in X regulation of granule cell precursor proliferation

GO:0021820 Organization of extracellular matrix in the marginal X zone involved in cerebral cortex glial-mediated radial cell migration

GO:0032836 Glomerular basement membrane development X

GO:0045226 Extracellular polysaccharide biosynthetic process X

GO:0046379 Extracellular polysaccharide metabolic process X

GO:0043062 Extracellular structure organization and biogenesis X

GO:0006858 Extracellular transport X

GO:0008624 Induction of apoptosis by extracellular signals X

GO:0050839 Cell adhesion molecule binding X 57

GO:0050840 Extracellular matrix binding X

GO:0030023 Extracellular matrix constituent conferring elasticity X

GO:0030197 Extracellular matrix constituent, lubricant activity X

GO:0005201 Extracellular matrix structural constituent X

GO:0030021 Extracellular matrix structural constituent conferring X compression resistance

GO:0030020 Extracellular matrix structural constituent conferring X tensile strength

GO:0030022 Adhesive extracellular matrix constituent X

recent transcriptome-based ECM protein predictions34 and by later addition of false negatives identified through their physical interactions with matrix proteins and subsequent curation. The final set of 2188 human genes were classified into five, non-mutually exclusive categories through manual review of protein descriptions in secondary sources (GeneCards179, UniProtKB149, PubMed152, Bioinformatic harvester245 and iHOP177) for the appearance of keywords derived from 49 GO and UniProt terms statistically overrepresented in our ‘gold standard’ ECM proteins (p < 0.05, hypergeometric function with Bonferroni correction as implemented by the Cytoscape plug-in, BinGO246). Proteins were scored true or false in each category according to the hierarchical workflow presented in Figure 2-1.

In brief, if a protein is annotated with one of the defined keywords in the source it is assigned the term functionally related. Note, some keywords do not define strict functional relationships but may capture, for example structural relationships. A protein may receive additional category annotations with the presence of specific keywords. Keywords associated with the ECM subcategory identified a protein as extracellular matrix. Similarly keywords associated with the 'Structural' subcategory identified a protein as structural. Presence of membrane features in the protein annotation (e.g. presence of the words 'membrane protein') identified a protein as membrane associated. This category included transmembrane proteins (e.g. collagen XVII). However, if the protein was annotated as being on the internal membrane, it was discarded as a false positive. Finally a protein would be categorized as extracellular if the annotation indicated that it is extracellular (e.g. by subcellular location information and/or the presence of a signal 58

59

Figure 2-2: Characterization of available annotation and interaction data sets A: Distribution of gene ontology annotations for genes with extracellular products in human, mouse, and rat by evidence code category. Annotation counts (*) are normalized relative to human. B: Overlap of genes encoding extracellular proteins among human, mouse and rat as determined by precomputed Ensembl orthologues. C: Number of genes with products falling into various categories as assigned by manual review. Genes selected to seed the network were made up of an intersecting subset of membrane, structural and soluble proteins as indicated in italics. If the gene encoded multiple products with different annotations the gene was counted in only one category according to the following prioritization: membrane, insoluble, soluble. The other categories were not mutually exclusive that is, a protein can be both membrane and functionally related. D: Overlap of interactions highlighting unique interactions among several publicly available datasets used in our study. E: Node degree distribution in our ECM network potted in log-log scale obeys power law (r = 0.957).

peptide). As an example, integrins were classified as extracellular (false), ECM (false), membrane-associated (true), structural (false), functionally related (true). Where necessary primary literature sources were used to clarify classifications and resolve conflicting evidence. These labels seemed to form natural categories reflecting relevant subcellular location and function. From a practical point of view they allowed the tracking of false positives and therefore an evaluation of the effectiveness of the upstream filter as well as to distinguish between plausible (functionally related = true) and implausible (functionally related = false) interactors of matrix proteins thus acting as a high confidence threshold for potential network neighbours. Non-exclusive categories were chosen because, for example, all ECM proteins are extracellular but not all extracellular proteins are deposited into the ECM. The network centers on ECM-related (functionally related = true) proteins of three types: structural (i.e. ECM), soluble, and membrane-associated proteins and includes matricellular proteins which function as both soluble and insoluble proteins247.

Through these analyses, 168 of the 2188 initial gene symbols were either associated exclusively with non-extracellular products and may therefore be considered as false positives, or were not annotated with sufficient information to confirm their classification. The remaining 2020 genes were classified as encoding membrane-associated proteins (347 genes) and bona fide extracellular proteins (1673 genes) including: soluble proteins (1472 genes); and insoluble, non- membrane proteins (assigned to the structural subcategory (201 genes). Of these 2020 genes, 357 define ECM and membrane-associated proteins such as integrins and transmembrane collagens which were defined as the core of the ECM network. In addition, 524 non-ECM genes were classified as having potentially functionally related products including 103 membrane- associated proteins (Figure 2-2(c)). A full summary of the curation results with notes on various 60 collected metadata is contained within the supplementary excel spreadsheet data file “SF5” which is also on the accompanying CD. At the time of this work, of 325 ECM proteins annotated in MatrixDB (http://matrixdb.ibcp.fr/cgi-bin/download) 324 were confirmed here. These include 11 proteins not originally described in MatrixDB, but which now appear after communications with the database administrators (11/15/2010). One false positive, CD4, a T- cell surface glycoprotein involved in the formation of membrane lipid rafts was also identified (and re-assessed by MatrixDB curators). The extensive overlap between these sets despite the differences in approach suggests that these data represent a robust list of ECM proteins. The 1673 genes with extracellular products defined through our methodology approaches predictions that they account for approximately 10% of the mammalian genome (2025 according to LOCATE235).

2.3.2 The annotated list of extracellular proteins is consistent with SignalP and subcellular location predictions

Due to the presence of readily detectable signal peptides, secreted proteins (including extracellular proteins) are among those whose subcellular location predictions are known to be the most accurate248. Therefore, as a quality check, the list of ECM and related proteins was matched against subcellular location predictions from five methods captured by the LOCATE database235. Hierarchical clustering of location predictions revealed a high degree of consistency between prediction methods and this list (Figure 2-3). Interestingly, many of the 168 genes which did not make the final list of 2020 ECM and related genes lacked consistency in subcellular location predictions. However, alone the level of consistency is not necessarily diagnostic for false positives as a certain level of noise is expected in the presence alternative transcripts and multi-functional proteins, which may legitimately occur in two or more subcellular locations. For example, SMC3 encodes a nuclear protein involved in organisation, which when post-translationally modified gives rise to the basement membrane proteoglycan, bamacan249,250. On the other hand, many full-length collagens are predicted (likely erroneously) to be located in the nucleus as well as extracellular. These include the following chains: COL2A1, COL3A1, COL4A2, COL4A3, COL4A4, COL4A5, COL8A2, COL9A1, COL9A2, COL9A3, COL11A2, COL13A1, and COL19A1. We note that some of these collagens are found in the nucleus pulposus of intervertebral disc which could be a source of errors for electronic annotations as could suspected artifacts in cell preparations251. Additional 61

62

Figure 2-3: Subcellular location predictions for ECM proteins Hierarchical clustering (2D) of sub-cellular predictions for five prediction methods (CE-Cello; PA-Proteome Analyst; ML-MultiLoc; PT-Ptarget; WO-WoLF PSort). Intensity of red signal denotes presence of independent predictions on multiple transcripts.

comparisons to SignalP predictions found that ~90% of the identified putative extracellular proteins were predicted to contain a secretory sequence compared to 20% for the entire proteome whereas TMHMM predictions revealed no apparent enrichment for transmembrane-containing proteins (see Appendix 4). Supporting information for Appendix 4 is contained in the supplemental excel spreadsheet data file “SF6” which is also on the accompanying CD.

2.3.3 Experimentally derived protein-protein interactions connect 181 ECM core genes and 192 functionally-related neighbours into a scale-free network enriched for relevant functional terms

To explore the organisation of the proteins corresponding to the 357 genes defined as the core nodes, a network was constructed using experimentally derived physical protein-protein interaction data obtained through the Unified Human Interactome resource (UniHI) and updated with interactions found in BioGRID, HPRD, IntAct and MatrixDB as of December 2010. These interactions are detailed in supplemental excel spreadsheet data file “SF1” also available on the accompanying CD. Interaction data from these datasets showed little overlap except between BioGRID and HPRD (Figure 2-2(d)), which is expected due to BioGRID having imported data from HPRD. Lack of overlap is consistent with previous observations reporting under-sampling of the interactome by dissimilar methods and further exacerbated by several databases including IntAct, BioGRID, and MatrixDB sharing curation efforts to avoid unnecessary redundancy252,253.

To obtain the highest possible coverage of experimentally determined interactions, the network was therefore based on a union of the data169. However, in order to achieve a high level of quality, we applied selective criteria to exclude certain interactions based on, for example, the type of evidence and whether the resulting neighbours shared relevant biological function annotations with known ECM-associated genes (see section 2.2 Methods). Data from a recent affinity purification study217 were excluded on the grounds that these high throughput results have not been independently confirmed and consist largely of interactions of recombinant fragments which on the one hand would not add very many interactions to our dataset and on the 63 other would not be readily interpretable in the context of a gene-centric network. An initial network was constructed by selecting all interactions associated with the products of 209 core genes for which interaction data exist. Nodes were removed if they were not contained either in our list of core genes or the defined list of 524 functionally related, non-ECM genes. The network was then iteratively expanded by including additional interactors, provided they belonged to the set of 881 genes (core + functionally related genes) defined in the curation, until no more neighbours could be identified. The final network consists of 1120 protein-protein interactions between 373 nodes representing 181 core genes + 192 functionally related neighbours. The network is scale-free (power law fit r = 0.957, Figure 2-2(e)) with an average node degree of 6.01, shortest path length of 3.86 and a diameter of 12. The network connects approximately 50% (181/357) of the identified core genes based on the interactions of their products.

Given the initial criteria for selecting ECM and functionally related proteins for the network, an associated enrichment of functionally-related annotations is expected (e.g. domains, disease terms etc.). Using the hypergeometric test for gene set enrichment as implemented by ConceptGen232, significant enrichment (p<0.01) was observed for 210 GO terms. A full list of these terms is included as Appendix 7. These included terms related to those used in the keyword search and, those known to be enriched in gold standard ECM proteins. A large number of overlapping MeSH terms were also identified (Appendix 8). Moreover, the MeSH terms include 87 highly enriched (p<0.01) disease terms (MeSH category ‘C’). Finally, significant enrichment for 47 Pfam domains was noted, suggestive of ECM-specific domain family expansions (Table 2-3). These domains include those common to multidomain glycoproteins found in basement membranes (i.e. laminin and thrombospondin domains), proteases and their inhibitors (Trypsin, Serpin, Kunitz/Bovine pancreatic trypsin inhibitor, Hemopexin, Kringle domains), plasma proteins, growth factors and signaling (IGFBP, PDGF, TGFβ like, Wnt family) and two domains (EGF-like, Calcium binding EGF) which are the most abundant in ECM proteins (e.g. laminins, fibrillins, fibulins, nidogens, perlecan, tenasins, LTBPs, versican, aggrecan, neurocan, brevican). There are 443 EGF or EGF-like domains in 43 ECM proteins. A full listing of enriched ‘concepts’ appears in supplemental excel spreadsheet data file “SF7” also on the accompanying CD. 64

Table 2-3: Pfam domains enriched in the ECM network

Concept Name Gene List Overlap P-Value Q-Value Size

EGF-like domain 145 48 2.06E-39 1.59E-36

Collagen triple helix repeat (20 copies) 74 31 5.88E-29 2.27E-26

Fibroblast growth factor 22 16 4.92E-20 1.26E-17

Matrixin 24 16 4.24E-19 8.16E-17

Calcium binding EGF domain 76 23 8.06E-18 1.24E-15

Hemopexin 23 15 1.12E-17 1.44E-15

Trypsin 111 25 4.44E-16 4.89E-14

Laminin EGF-like (Domains III and V) 32 14 2.31E-13 2.23E-11

Thyroglobulin type-1 repeat 17 11 7.7E-13 6.59E-11

Insulin-like growth factor binding protein 13 10 9.05E-13 6.79E-11

Laminin G domain 35 14 9.71E-13 6.79E-11

Laminin N-terminal (Domain VI) 11 9 7.54E-12 4.48E-10 von Willebrand factor type A domain 40 14 7.56E-12 4.48E-10

Serpin ( inhibitor) 36 13 3.36E-11 1.85E-09

Laminin B (Domain IV) 8 8 5.44E-11 2.80E-09

Vitamin K-dependent /gamma- 14 9 2.02E-10 9.7E-09 carboxyglutamic (GLA) domain

Fibrillar collagen C-terminal domain 10 8 2.4E-10 1.09E-08

TGF-beta propeptide 21 10 5.76E-10 2.46E-08

Transforming growth factor beta like domain 36 12 6.61E-10 2.68E-08

Kringle domain 17 9 1.88E-09 6.92E-08

Platelet-derived growth factor (PDGF) 8 7 1.89E-09 6.92E-08 von Willebrand factor type C domain 25 10 4.09E-09 1.43E-07

C-terminal tandem repeated domain in type 4 6 6 6.39E-08 2.14E-06 procollagen 65

Concept Name Gene List Overlap P-Value Q-Value Size

Kunitz/Bovine pancreatic trypsin inhibitor 16 7 1.13E-06 3.63E-05 domain

Laminin Domain II 5 5 2.1E-06 6.46E-05

Astacin (Peptidase family M12A) 6 5 2.1E-06 6.46E-05

Laminin G domain 6 5 2.1E-06 6.46E-05

Laminin Domain I 5 5 2.1E-06 6.46E-05

Extracellular link domain 12 6 4.42E-06 1.17E-04

TB domain 7 5 6.16E-06 1.58E-04

Leucine rich repeat N-terminal domain 45 9 1.38E-05 3.43E-04

Small (intecrine/chemokine), 46 9 1.64E-05 0.000396 interleukin-8 like

Sushi domain (SCR repeat) 50 9 3.15E-05 7.34E-04

Thrombospondin type 3 repeat 5 4 6.59E-05 1.49E-03

Thrombospondin C-terminal region 5 4 6.59E-05 1.49E-03

Thrombospondin type 1 domain 58 9 9.64E-05 2.06E-03

Fibronectin type II domain 12 5 0.000122 2.55E-03

Fibrinogen beta and gamma chains, C-terminal 25 6 0.000309 6.26E-03 globular domain

Integrin beta tail domain 7 4 0.000317 0.006264

Reprolysin (M12B) family zinc 39 7 0.000379 0.0073 metalloprotease

Integrin, beta chain 8 4 0.000545 1.02E-02

PAN domain 8 4 0.000545 1.02E-02

CUB domain 44 7 0.000752 1.35E-02 wnt family 19 5 0.000984 0.017225

EMI domain 11 4 0.001763 3.02E-02

Disintegrin 22 5 0.001812 3.03E-02 66

Concept Name Gene List Overlap P-Value Q-Value Size

Reprolysin family propeptide 37 6 0.002128 3.49E-02

2.3.4 The collagen subnetwork reveals anomalies in experimentally derived PPIs

Collagens are the most abundant proteins in mammals. They are major components of cartilage, skin, bone and tendon. They represent a diverse family of proteins consisting of 28 (COL29A1=COL6A5) members as reviewed by Kadler et al.73 and Ricard-Blum12. Collagens can be divided into sub-families based on the macromolecular assemblies they form. They include fibre-forming and associated collagens, as well as those that form beaded filaments (collagen VI), networks (collagen IV in basement membranes, collagens VIII and X forming hexagonal networks) and anchoring fibrils (collagen VII). The supramolecular assembly of collagen IV in basement membrane is well defined: two collagen IV molecules interact via their C-terminus and four molecules interact via their N-terminus. Little is known about the supramolecular assembly formed, if any, by collagen XVIII, although it is found in some basement membranes. Other collagens that do not form supramolecular assemblies alone are defined according to their domain organisation and their sequence similarity e.g. FACITS (fibril- associated collagens with interrupted triple helices) and multiplexins (collagens XV and XVIII). Several collagens are transmembrane proteins. Here, a sub-network was constructed showing the interactions of collagens and their neighbours (Figure 2-4). Collagens share a number of interacting partners involved in cell adhesion and fibre assembly (e.g. BGN, DCN, FN1, MATN2, MFAP2, FBLN1, HSPG2). In addition, some collagens belonging to particular sub- families have additional binding partners in common which are not shared across sub-families. For example several MMPs (MMP8, MMP13, MMP15, MMP16 ) appear to be specialized for the fibre-forming collagens whereas the network-forming collagens contact several plasma proteins associated with inflammatory processes and the thrombic response (e.g. SAA1, SAA2, SAA4, HABP2, SERPINE2). Despite the fact that interaction data are available for 27 of the 28 collagens, most data sources do not accurately reflect the known supramolecular organisation of collagens in tissues. This is revealed in a number of network anomalies. 67

Figure 2-4: Collagen subnetwork Collagens are grouped and coloured by recognized sub-types. Their neighbours (uncoloured nodes) are grouped according to their interaction patterns with the former. Interactions are depicted as deposited in the raw PPI data sources without regard to larger assembly patterns (i.e. individual chains are shown separately).

For example, COL1A1 and COL1A2 have distinct interactions (attributed to only one subunit) whereas the native protein exists in tissues as an assembled fibre. Many known interactions between components of collagens IV, V, VI and IX are absent. While some experiments have been performed with isolated 1 and 2 chains of collagen I and some with the native, trimeric, proteins containing both chain types, MatrixDB is the only database so far to perform curation with native trimers referenced as complexes with an EBI identifier recognizing that isolated monomers do not exist within tissues. This curation issue highlights the need to organise and 68 display ECM data in a way that better reflects what happens in ECM once the proteins are secreted and assembled. Here, MatrixDB complexes are not used to correct underlying issues in the binary data, which is beyond the scope of this study. Rather, the binary data is used to draw attention to an important issue in the investigation of ECM fibre organisation that is not well- addressed at the level of gene-centric networks.

2.3.5 The search for biologically relevant functional modules in the ECM highlights the heterogeneous nature of current annotations

Clustering PPI network graphs allows functionally related proteins to be grouped into modules representing, for example, protein complexes or biochemical pathways254,255. Applying the MCL clustering algorithm221 to the ECM network resulted in the definition of 100 clusters of which the largest 50 contained at least three members. The latter are defined here as putative functional modules (Figure 2-5). Additional information supporting the choice of MCL inflation value for the prediction of modules as described in Section 2.2.4 appears in Appendix 3 and the supplemental data file “SF2” also available on the accompanying CD. The modules consist of a combination of ECM and functionally-related non-ECM proteins with membrane components relatively sparsely and evenly distributed among them (Figure 2-6). Detailed descriptions of each module and associated metadata appear in the supplemental excel spreadsheet data file “SF8” also available on the accompanying CD. Since membrane proteins represent a critical functional interface for the ECM the coverage of key classes of membrane proteins in the network, such as integrins, is of particular interest. Among the 18 alpha and 8 beta integrins known in mammals the network contains 8 integrin subunits (α2, αIIb, α7, α8, β1, β2, β4 and β7). Also, 2 out of 4 syndecans (SDC1, SDC2) are present. However, available interactions do not link the discoidin domain receptors (DDR1, DDR2) to the network at this time. The distribution of membrane proteins in the network suggests they do not form distinct modules but perhaps serve as adaptors bringing together larger modular components in a context dependent manner. Integrins are known to mix and match to organise context dependent subnetworks involved in specialized cell:matrix adhesions42,256. This model may apply to other membrane proteins involved in matrix functions.

Next, the biological relevance of the modules was assessed. Although no significant enrichment for GO biological process terms was found at the level of modules, several modules were identified as enriched for UniProt biological process keywords when compared to randomized 69 modules (Table 2-4; methods are described in Section 2.2.5.3). Further supporting information is included in the supplementary excel spreadsheet data files “SF9” and “SF10” which are also available on the accompanying CD. Adopting the WordCloud plug-in for Cytoscape233, modules were annotated with their most commonly occurring keyword(s) (Figure 2-6 and Figure 2-7). Where there was poor coverage of keywords, alternative annotations were derived from manual inspection of UniProt and gene descriptions. In addition to describing common biological processes, module annotations highlight several other biological features including tissue distribution (e.g. modules 4, 12 and 19), macromolecular assemblages (e.g. modules 9 and 23) and sequence similarity (e.g. module 42 and 76). These latter modules consist of members derived from the same gene family and may reflect conserved sequence features between paralogues resulting in maintaining similar interactions257.

Figure 2-5: Distribution of cluster sizes Using an inflation value of 2.2 the ECM network was clustered using Markov Clustering (MCL) resulting in the distribution of cluster sizes shown here. Inset pie charts illustrate the proportion of clusters greater than or equal to size 3 (considered the cutoff for module prediction) and the proportion of proteins within clusters at or above this cutoff. 70

Figure 2-6: A human ECM network based on experimental PPI evidence (Continued over) 71

Figure 2-6 (Continued): Main figure (A): Putative functional modules based on MCL analysis. Members of each cluster have been assigned a common colour. Protein–protein interactions are indicated by gray edges (deemphasized for clarity of the figure). Biological processes were assigned using the most frequent UniProt keywords associated with the proteins in each cluster as visualized using WordCloud (a Cytoscape plug-in). Where coverage of keywords was sparse or absent a suggested or clarifying annotation has been substituted from the reviewed gene descriptions (bracketed text). Where clusters suggested an association with known tissue distributions these were noted on the figure. Inset (B): An examination of cross-talk between clusters. Each node represents one cluster with edge weight proportional to the number of interactions between proteins in each pair of connected clusters. Inset (C): Betweeness centrality (red/orange 5 high, green/blue 5 low) and average shortest path length (large 5 low, small 5 high) predict modules with high information flow.

Table 2-4: UniProt biological processes enriched in ECM clusters

Cluster UniProt Keyword P-value

1 Blood 0.0001

2 Cell adhesion 0.0079

3 Differentiation 0.01

3 Angiogenesis 0.0232

7 Collagen degradation 0.0134

8 Cell adhesion 0.01

10 Chondrogenesis 0.0145

14 Angiogenesis 0.0041

14 Differentiation 0.0253

17 Differentiation 0.0177

18 Biomineralization 0.0214

19 Cell adhesion 0.0041

20 Collagen degradation 0.0181

24 Chondrogenesis 0.0003

24 Osteogenesis 0.0013

24 Differentiation 0.01

25 Chemotaxis 0.0063 72

Cluster UniProt Keyword P-value

48 Neurogenesis 0.0317

71 Cell adhesion 0.0243

89 Cell adhesion 0.0432

As an aside, the association of the meprins (MEP1A and MEP1B; multi-domain zinc proteases) from module 76 with proteins involved in the regulation of blood pressure implicates a new role for these proteins.

Focusing on biological process annotations, many module annotations were found to reflect the hierarchical structure of the underlying ontology. For example, proteins associated with wound healing are organised into modules associated with angiogenesis (module 14), hemostasis (modules 1, 6 and 33), cell-growth (modules 21 and 28), immune/inflammatory responses (modules 25 and 58) and matrix remodelling (modules 7, 13, 20 and 27) and their attendant signaling processes (modules 3, 11, 15 and 50). Other module annotations required additional querying of secondary source descriptions and supporting literature to resolve the biological meaning behind their organisation. Proteins involved in blood coagulation, for example, are divided into two large modules with putative pro-coagulant (module 1) and anti-coagulant (module 6) functions, which interact both directly and indirectly (via module 52). Similarly, bone morphogenesis is divided among several interlinked modules (modules 10, 24, 44, 47, 61 and 68) with module 10 playing a central organisational role. Notably in this latter example, the basis for the separation of proteins into these modules is not immediately obvious: the proposed functions of proteins associated with modules 10, 24 and 41 are indistinguishable. Nevertheless modules 47 and 61 appear involved in neuronal morphogenesis whereas proteins in module 68 have been implicated in ovarian folliculogenesis and maturation. Taken together, these examples highlight the heterogeneous nature of current annotations and underline the need to supplement categorical terminology with secondary source descriptions and supporting literature in order to uncover the underlying biology. 73

Figure 2-7: UniProt biological process annotations Relative frequencies of UniProt biological process annotations associated with putative ECM functional modules (numbered in superscripts) represented as an assemblage of ‘WordClouds’ 74

2.3.6 Network topological measures identify major organising components of the ECM

To reveal how modules are organised within the context of the global ECM, a new network based on module interactions was constructed and two key topological properties analyzed - average shortest path length (which provides a measure of how close a node is to every other node in the network) and betweenness centrality (which reflects the amount of control exerted by a given module over the interactions between other modules in the network) (Figure 2-6(b) and 2-6(c)). Structural matrix components forming basement membrane (modules 5, 12, 19, 23 and 31) and fibres (modules 2, 4 and 9) are separated into several modules. Together with the laminin complex (module 23), these modules mediate central roles within the network (low shortest path length/high betweenness centrality). A notable exception is module 31 (ANTXR2, COL4A3, COL4A4 and USH2A) which has a relatively low betweenness centrality, presumably reflecting its role in the specialized basement membranes of the kidney, inner ear and eye.

Further modules which occupy central roles within the network include modules 7 and 13 (involved in matrix remodeling). This highlights their important organisational roles in the coordination and recruitment of many ECM functions. Modules 1 and 6 (involved in blood coagulation) are highly central though they act upstream of inflammatory and wound healing events and as such their influence on ECM organisation is indirect. Modules involved in elastic fibres and their assembly (modules 34, 36, 69 and 74) are less prominent within the network, reflecting either the specialized nature of these fibres or the lack of data concerning their interactions in interaction databases. For elastin, there is experimental evidence for 16 direct interactions; however elastin is also indirectly linked with many more neighbours that represent a functionally diverse collection of network modules (Figure 2-8).

The diverse sampling of the network provided through its neighbours illustrates the suitability of elastin as our initial seed to define our ECM gold standard. These findings are also consistent with a previous suggestion that ECM proteins may represent a highly integrated, solid phase system of ligands for signal coordination30. The importance of signaling is highlighted by the presence of a number of modules dedicated to this function e.g. IGF (module 15), TGFβ (module 11), PDGF (module 50), FGF (module 3), BMP (modules 10, 24, 41, 47, 61 and 68) and Wnt (module 17), some of which are quite central (e.g. modules 10, 15 and 24).

75

: Elastin Subnetwork Elastin : Subnetwork

8

-

2

Figure Figure A A Subnetwork consisting of and Elastin nearest its and nearest next neighbours.

76

The periphery of the network is occupied by modules of small size and fewer interactions, for which it is difficult to assign function. However, the occurrence of a few modules dominated by disintegrins or proteases (20, 27) as well as one containing salivary proteins (38) suggests that some peripheral modules are occupied by proteins with highly specialized functions. The remainder may simply reflect a variety of proteins whose interactions have not been characterized. As an aside, a common criticism leveled at global network analyses is that of study bias (i.e. that well-studied proteins appear to have more interactions simply because they are well-studied or curated). In plots of degree versus number of publications as collected from two independent sources, Genopedia234 and iHOP177 as well as degree versus number of annotations (GO and MeSH) no particular bias was found with respect to the proteins included in the ECM network although FN1 is a significant outlier (Figure 2-9).

The topological properties of individual proteins were also examined to identify those that may mediate important organisational roles within the network (Figure 2-10). In general, most modules contain one or two proteins of low average shortest path length and high betweenness centrality, suggesting that they may represent the organisational centers of their respective modules. Examples include coagulation factor II (F2 - module 1), the first coagulation factor in the blood clotting cascade and plasminogen (PLG - module 6) which degrades fibrin in blood clots; fibroblast growth factor 2 and associated receptors (FGF2, FGFR1 and FGFR2 - module 3) which mediate signaling cascades involved in mitogenesis and differentiation, Perlecan (HSPG2 - module 12) which plays a significant role in cell adhesion via its structural contribution to basement membranes and, others involved in cell adhesion such as biglycan (BGN - module 18, also a proteoglycan) and thrombospondin (THBS1 - module 2, a glycoprotein). Fibrillar collagens and fibronectin also appear as organisational centers (e.g. FN1 and COL1A1/COL1A2 - modules 2 and 9 respectively) as do two major matrix metalloproteinases (MMP2 and MMP9 - modules 7 and 13 respectively) and one integrin chain (ITGB1 – module 8) which is common to a series of heterodimeric, modular integrin receptors for collagen, fibronectin, fibrinogen, laminin and others. A noteworthy connection between collagen II (COL2A1) and bone morphogenic protein 2 (BMP2), which appears as a high betweeness edge between two high betweeness nodes highlights the possible importance of type IIA procollagen in chondrogenesis as shown previously258. FURIN (module 22) is a proprotein convertase that processes latent

77

iHOP, iHOP, number of

work connectivity work and annotations of number

Genopedia, number associatedof medical subject (MeSHheadingsterms)

: Correlation between net Correlation : between

9 -

2

Figure Figure Annotations clockwise fromupper the leftare: OntologyGene (all Terms categories),number PubMedof citations according to PubMed citations according to

78

e size size inverselye

ss andss shortest lowest path length (shown in table)the inset are

: Network attributes attributes Network :

10

-

2

Figure Figure Nodesand edgesare colouredaccording to betweenness,their centrality ongradient a from low (green)high (red)to with nod proportional the to average pathshortest length. Nodes the with highestbetweenne labeled in the figure.main graph (inset,The boxedred)the scoring illustrates relative theseof nodes to the distribution

79

precursor proteins into their biologically active products. Its presence within a module involved in differentiation and fertilization highlights the important role of biologically active matrix components in numerous developmental pathways. Other central nodes that are not components of large modules may act as module coordinators - organising the diverse contributions of specialized modules to effect broader, more complex functions. Examples include Vitronectin (VTN - module 54) a multi-functional adhesive glycoprotein that promotes cell adhesion and interacts with module organisers F2 and FGFR2 and other proteins to connect major signaling pathways (IGF, TGFβ, FGF, EGF, VEGF)259. Decorin (DCN - module 66) also fits in this category of module independent organisers being a proteoglycan related in structure to biglycan and well-known for its role in matrix assembly.

2.3.7 Gene expression patterns predict that modules are broadly expressed but that tissue specific functionality is coordinated by a limited number of components

The ECM network presented here represents a global ‘pan-tissue’ network of limited physiological relevance. Therefore spatiotemporal gene expression data was examined to identify patterns of protein co-occurrence and hence the ability of proteins to interact for any given tissue. Expression data obtained from Su et al.223 reveal that many ECM genes are widely (though not necessarily highly) expressed. Nonetheless clustering expression data identified several subsets of ECM genes with tissue-specific patterns of expression (Figure 2-11). For example, genes in groups 1 and 2 (Figure 2-11, red and orange respectively) were highly expressed across all tissues in the data set whereas genes in groups 3 and 5 (Figure 2-11, yellow and blue respectively) appear to be specific for smooth muscle, colon and intestinal cells. As previously noted, clustering the expression patterns of ECM genes resulted in related tissues being grouped together (e.g. ‘circulatory system’- tissues such as cells of the , blood and heart). Supporting information for Figure 2-11 is included in the supplementary excel spreadsheet data file “SF11” also on the accompanying CD. To examine if expression patterns correlate with proposed functional modules, the expression groupings were projected onto the network (Figure 2-11(b)). Note that not all of the ECM core genes with expression data have corresponding interactions and vice versa. Similarly, some genes lie outside of the defined expression groups. In addition, gene names often encapsulate multiple products and duplication 80

Figure 2-11: An expression profile for ECM core proteins (Continued over) 81

Figure 2-11 (Continued): A: Expression of network core genes across 84 samples corresponding to 79 human tissues as measured by a previously published microarray. Values are fold-expression on a log2 pseudo-coloured scale. The midpoint (black) was chosen to emphasize different expression patterns within the sample relative to one another and corresponds to a raw expression value of 15. The array has been hierarchically clustered in two dimensions to group genes and tissues by similar expression profiles (see section 2.2 Materials and Methods). Expression patterns are identified with colour blocks for the purpose of mapping them onto the network diagram. B: Network with nodes coloured according to expression blocks. Where more than one isoform of a protein was present in the array and mapped to a separate colour block, the corresponding node was divided and coloured accordingly.

of gene names across colour groups indicate the presence of multiple probes yielding differing expression patterns for the same gene. These variations may be indicative of differences in expressed isoforms, differences in the probes ability to detect a single gene product, or sampling error (see supporting information in the accompanying spreadsheet data file “SF11” also on the CD).

Within the network several proteins were identified that, while appearing widely expressed, are represented by several probes that detect more restricted expression patterns260. Among the various modules, several were identified in which components have similar patterns of tissue expression (e.g. modules 27, 36 and 66) as well as those in which components are differentially expressed (e.g. modules 2, 9 and 14). These latter modules likely represent entities in which a core conserved function is adapted in various tissues through the inclusion of specific components. For example, module 9, which is largely composed of fibrillar collagens has three components with distinct tissue expression profiles: COL1A2 (underexpressed in brain and circulatory cells but found to be expressed in ganglia and other tissues supported by at least one of three probes for COL1A1 with which it is always associated in tissues - green); COL5A1 (preferentially expressed in smooth muscle, colon and intestine - yellow) and COL5A3 (preferentially expressed in brain - mauve). The inclusion of each component in this module likely contributes to the tissue-specific properties of their fibrillar matrices.

To more generally investigate the relationships among expression pattern, structure/function and network topology Pearson correlation coefficients (PCC) of expression were calculated for all pairs of proteins in the network. As expected, interacting pairs (PPI pairs in the network) had a higher average correlated expression than sets of random pairs (background) (Figure 2-12(c)) indicating that interacting proteins are more likely to be co-expressed. Hierarchical clustering 82 allows the definition of groups of genes with similar patterns of co-expression, as opposed to modules previously defined by topological network characteristics (Figure 2-12(a)). Additional supporting information for this figure appears in supplemental spreadsheet data file “SF12” also available on the accompanying CD. The majority of the network core proteins appear to be organised into blocks sharing distinct patterns of highly coordinated gene expression. Surprisingly, based on the frequency of associated UniProt biological process keywords (visualized using WordCloud) these expression groups are involved mainly in cell adhesion; the exception being group 6 whose primary associated function is angiogenesis followed by cell adhesion (Figure 2-12(d)). The basis for their separation into unique expression pattern blocks is not clear but may reflect associated secondary functions or in some cases (e.g. block 11, laminin complex) the stoichiometric requirements of components of distinct macromolecular assemblies. Importantly, this does not suggest that the primary function of all ECM network proteins is cell adhesion. Rather, these results highlight that many network proteins participate in cell adhesion as one of several roles and suggest that cell adhesion requires a relatively high degree of coordinated gene expression among distinct assemblies of proteins operating in a context dependent manner. Co-expression does not appear to correlate well with modules defined using network topology. Possible exceptions include the tissue specific examples mentioned earlier and perhaps fibrillar collagen which, despite containing components with distinct tissue expression profiles nevertheless includes a majority of components which are co-expressed.

Patterns revealed through the analysis of network topology are not necessarily made less interesting as a result. Rather, they emphasize the importance of an additional level of detail. For the most part ECM modules appear to consist of components exhibiting varied tissue- specific responses and it is speculated that this context-dependent expression tailors the activity and/or composition of otherwise canonical modules.

2.3.8 Almost two thirds of human ECM proteins are not conserved outside the deuterostomes

One of the fundamental processes associated with the emergence of multi-cellular life was the ability for cells to stably associate via components of the ECM. This key innovation subsequently allowed the development of distinct tissues and organs facilitating the compartmentalization of specialized biological functions. With increasing numbers of metazoan genome sequences now available, it is possible to place the list of human ECM components in an 83

Figure 2-12: Patterns of correlated coexpression for network core proteins A: Pearson correlations plotted as a symmetrical gene by gene heatmap. Groups of proteins were broken into 11 colour groups by cutting the clustered gene tree (red line). B: Network with colour groups mapped onto clusters. Multiple colours for a node occur where isoforms present in the array map to different colour groups. C: Frequency distribution of Pearson correlated coexpression values for proteins drawn from 10,000 random samplings of two populations; ECM-related PPI pairs versus random (background) pairs. D: WordClouds of UniProt biological process keywords associated with colour groups.

evolutionary context to examine the likely timing of their emergence. Applying the Inparanoid algorithm229, orthologues of ECM proteins were predicted across 117 eukaryotic genomes. These were used to generate phylogenetic profiles which were clustered into nine groups with distinct patterns of conservation (Figure 2-13). Additional supporting information for this figure is contained in supplemental excel spreadsheet data file “SF3” also on the accompanying CD. 84

Consistent with previous studies, very few proteins had detectable orthologues among plants, protists or fungi48,261. Indeed, only two proteins appear conserved across nearly all species: SMC3 is a nuclear protein involved in chromosome organisation262 which has also been shown to have a role in basement membrane250; and MFAP1 usually known for its association with elastic fibres, has been shown to be required for pre-mRNA processing in Drosophila263,264. Approximately one-third of human ECM genes appear to be conserved with protostomes (Figure 2-13, groups 3, 6, 7 and 8) and may therefore represent orthologues of the earliest ECM genes. Of these, approximately half (group 7) appear to have expanded during the emergence of the vertebrate lineage.

The vertebrate lineage is also associated with several groups of orthologues (groups 2, 4, 5, 9) which were either absent in the common ancestor of vertebrates and the basal chordates/protostomes or represent components that have substantially diverged from their protostome and basal chordate orthologues. No less than two genome duplication events are generally accepted to have occurred during early vertebrate evolution and it has been proposed that these events account in large part for the increase in both genomic and morphological complexity among vertebrates265. These observations support this hypothesis as well as the observation that the vertebrates have also undergone their own lineage specific expansions266.

Intriguingly, a number of components have also been identified here that while appearing reasonably well conserved either across metazoa (groups 6 and 8) or vertebrates (group 2 and 9) nonetheless appear to have undergone primate specific expansions. Within the vertebrate specific genes are clusters of mammalian-specific genes (Figure 2-13(b) and 2-13(c)). Again many members appear to represent primate specific expansions including genes involved in bone mineralization and osteoblast development (DMP1, MEPE, SPP1), matrix organization (MFAP5), or cell matrix attachment (LAD1, SPP1, KISS1, ANTXR2). It is not clear why these may be expanded in primates. However it is worth noting that DMP1, once thought to be dentin- specific267, has subsequently been shown by RT-PCR to be expressed in non-mineralized soft tissues including mouse liver, muscle, brain, pancreas, and kidney where it is assumed that DMP1 has novel functions268. Interestingly, TUFT1 (also group 2), thought to play an important role during the development and mineralization of enamel, has recently been shown to be expressed in non-mineralizing soft tissues suggesting it too may play a more universal role that remains to be discovered269,270. 85

86

Figure 2-13: Conservation profiles for network core genes Depicted are human ECM + functionally related genes (vertical axis) and their orthologues in 117 eukaryotic genomes (horizontal axis). Colouration of tiles indicates the type of orthologous relationship of the ECM gene relative to the (human) reference sequence. Coloured circles correspond to the colours used in Figure 2-14 (below) and are provided as an aid to cross-referencing the figures. Groups of genes with similar conservation profiles are indicated on the right and discussed further in the main text. Boxed in red are putative mammalian-specific genes which are (B) conserved in mammals with later expansion in humans and other primates, (C) conserved largely 1:1 across all mammals. The white vertical line marks the transition to metazoans. The genomes are grouped and ordered on the basis of known phylogenetic relationships (see Appendix 2). Note that here peptide datasets are used for the comparator genomes and therefore matches may include various splice variants in addition to genuine gene family expansions. Genes were clustered based on the city block method using complete linkage.

Based on these observations and the broad occurrence of ESTs for most ECM proteins, it is expected that many of these proteins will turn out to be multifunctional and that these lesser known functions will explain their expansion in primate and human lineages. Mammalian- specific genes not expanded in primates (Figure 2-13(c)) include several associated with dentition (AMBN, AMTN, ENAM), lung surfactants (SFTPB, SFTPC), the zona pellucida (ZP2) and several genes whose products are implicated in specialized matrix related functions (MGP, thought to be an inhibitor of bone formation; MMRN1, a binding protein for integrins and platelet factor V; OPTC, involved in collagen organisation in the vitreous of the eye).

The genome of the invertebrate deuterostome Ciona intestinalis was recently used to demonstrate that vertebrate complexity has mostly arisen through the duplication and subsequent modification of retained, pre-existing ECM genes49. Mapping of conservation patterns onto the ECM network reveals that much of the basement membrane, including collagenous and non- collagenous components (i.e. laminins) and core structural modules were in place in the protostomes as were modules associated with biomineralization, and collagen degradation (Figure 2-14). Signaling pathways are generally present in protostomes though in much simplified form. In fact, the latter observation has previously been suggested as a justification to study otherwise complex pathways in simpler organisms such as the , C. elegans271.

These same modules were enhanced in vertebrates with the addition of paralogues (dark blue) and nascent proteins (shades of red). Likewise, mammalian-specific proteins (green circles) are distributed as enhancements among existing network modules (Figure 2-14). These observations support a model of ECM evolution in which additional complexity and function arose at least partly out of additions to pre-existing modules. 87

Figure 2-14: Conservation of network ECM proteins Conservation groups from the previous figure (repeated here as an inset) are overlayed as coloured nodes on the ECM network. Neighbour proteins (non-ECM) are retained as small, greyed-out nodes to maintain a reference to the original, modular network structure. Mammal specific ECM proteins have been highlighted by a green border.

2.3.9 Integration of MeSH annotations identifies modules associated with disease

The biological relevance of putative functional modules was further examined by exploring the coherence of disease annotations reasoning that, through associating a module with a specific disease, it may be possible to predict new disease associations for other members of that module. Exploiting MeSH disease terms derived from Genopedia234, 85 disease terms were identified that were significantly enriched in 21 separate modules (Figure 2-15). Specific examples include Osteoporosis (p = 0.0078, module 3), Thrombosis (p = 0.0094, module 1), and Diabetes Mellitus, Type 2 (p = 0.0349 - module 5). This network approach illustrates the power of 88 topological analysis to overcome limitations in annotation coverage. For example, 19/23 genes in module 1 are annotated by MeSH with the term, ‘Venous Thrombosis’. The remaining four are not. is a powerful antithrombic agent owing to its ability to bind dependent coagulation factors272; Multimerin is involved in thrombus formation through its binding of platelet factor V/Va273; SERPINA5 expression has been correlated to known sex- specific differences in clotting in mice274; F10 is a well-studied clotting factor. Given strong experimental evidence together with their association with other members of module 1, the evidence suggests that the MeSH term ‘Venous Thrombosis’ should also be applied to the remaining four proteins. Supporting information for this figure is included in supplementary excel spreadsheet data file “SF13” and also available on the accompanying CD.

This ‘guilt by association’ can be similarly exploited to infer new disease relationships where experimental data may be more ambiguous. For example, two out of the five proteins that comprise module 23 (LAMB3 and LAMC2) are associated with the MeSH disease term ‘Macular degeneration’ (p = 0.0039 – “SF9”) implying that the other proteins in this cluster (LAMA3, BMP1 and COL7A1) may also be involved in this process. LAMA3 represents the alpha subunit of laminin-332 (previously laminin 5), which also comprises LAMB3 and LAMC2. BMP1 has recently been shown to have antiangiogenic effects275 which may have a role in controlling the angiogenic invasion of blood vessels from the choroid behind the retina associated with severe forms of macular degeneration. COL7A1 is the major component of anchoring fibrils and is best known for its role in causing dystrophic epidermolysis bullosa (a detachment of the epidermis from the basal lamina resulting in blistering of the skin). While COL7A1 is known to be expressed in the retina, the function of the protein in this context remains unknown. These types of observations illustrate the potential for systems based analyses to predict new functional and disease associations on the basis of network topology.

2.3.10 Literature curation of elastin interactions resulted in doubling the number of known binding partners

ECM interactions are known to be underrepresented in public databases. A thorough review of the literature and various public interaction datasets was conducted in July 2011 to update known information on elastin PPIs collected in Dec 2010. The focus of this review was to inform the SPRi study of elastin and elastin-like peptides (see Section 2.3.11 SPRi experiments detect distinct binding characteristics of recombinant elastin fragments). 89

Figure 2-15: Enrichment of MeSH disease terms by cluster Numbers in cells indicate the number of proteins within the cluster associated with the given MeSH disease term. 90

Surprisingly, the total number of elastin interactors was found to be 35 as compared to the 18 identified six months earlier based on public databases alone (Table 2-5). Note that self interactions were not included in the network and therefore, elastin was included in the network but not counted in the original list of elastin interactors. Curation is an ongoing process and this example illustrates the degree to which it is a moving target. Many of the experimentally supported interactions found in this sweep were published well prior to the information collected in Dec 2010 suggesting that a serious lag in knowledge translation exists between the literature and public interaction databases. These interactions have subsequently been communicated to MatrixDB curators for expert evaluation. To this total the SPRi screen added 4 possible new interactors bringing the possible number of elastin interactors to 39. Subsequent studies of the ECM or its subnetworks would be well-advised to conduct supplementary literature searches to account for the inevitable delay between discovery and deposition of information in public databases.

2.3.11 SPRi experiments detect distinct binding characteristics of recombinant elastin fragments

Elastic fibres are specialized matrix components and are a major structural constituent of large arteries, where they provide essential properties of elastic recoil and resilience. They are largely comprised of elastin. Mutations in the hydrophobic and cross-linking regions of elastin have been shown to result in alterations in the coacervation of tropoelastin and structural weakness in assembled elastin fibres20. Subsequently it has been hypothesized that natural genetic variations affecting the initial assembly and organisation of elastic fibres, may impact their long-term durability; possibly predisposing an individual to premature failure of elastic tissues such as arteries that, over the course of a lifetime must endure billions of cycles stretch and elastic recoil.

In addition to elastin a number of ancillary proteins including fibrillins, EMILINs, fibulins and LTBPs play important roles in elastic fibre assembly and organisation, mediating interactions between the fibre and its environment. Disruption of network architecture in the form of ablated interactions or more subtle alterations in the binding characteristics resulting in changes in the weighting of these relationships could conceivably contribute to premature failure. However, whether it was feasible to detect such changes was unknown. Therefore, under the auspices of 91

Table 2-5: Elastin interactors

This list of 35 interactors was curated from literature sources as of July 2011. The 18 interactors (not including elastin itself) which were included in the ECM network are bolded for comparison.

Symbol Description PMID Date of first publication

FCN1 ficolin (8947836)276 Oct-96

ASS1 argininosuccinate (1372742)277 Jan-92 synthase 1

SPINK1 serine peptidase (2093478)278 Dec-90 inhibitor, Kazal type 1

DCN decorin (11723132)279 Feb-02

ELANE elastase, neutrophil (10471600)280 Sep-99 expressed

ELN elastin (9336802)281, Apr-84 (6376098)282

LGALS3 lectin, galactoside- (10536372)283 Dec-99 binding, soluble, 3

BGN biglycan (11723132)279 Feb-02

FBLN1 fibulin 1 (11394650)284, Oct-99 (10544250)285

PRTN3 proteinase 3 (11867344)286 Mar-02

LOX lysyl oxidase (9336802)281, Sep-89 (2576848)287

FBN1 fibrillin 1 (10825173)288 Aug-00

FBN2 fibrillin 2 (10825173)288 Aug-00

MMP12 matrix metallopeptidase (19236151)289, Dec-98 12 (macrophage (9835614)290, elastase) (20345904)291

MFAP2 microfibrillar-associated (15233806)292 Jul-04 protein 2

LOXL3 lysyl oxidase-like 3 (16251195)293 Dec-05

LYZ lysozyme (9745729)294 Jun-98

FBLN2 fibulin 2 (10544250)285 Oct-99 92

Symbol Description PMID Date of first publication

CELA1 chymotrypsin-like (9175736)295, Dec-84 elastase family, member (10620133)296, 297 1

FKBP10 FK506 binding protein (11071917)298 Nov-00 10, 65 kDa

LOXL1 lysyl oxidase-like 1 (16251195)293 Dec-05

NID2 nidogen 2 (10544250)285 Oct-99 (osteonidogen)

NEU1 sialidase 1 (lysosomal (16314420)299 Feb-06 sialidase)

CTSA cathepsin A (16314420)299 Feb-06

GRID2 glutamate receptor, (20537373)300 Jun-10 ionotropic, delta 2

GLB1 galactosidase, beta 1 (16314420)299 Feb-06

IGHG1 immunoglobulin heavy (15174051)301 May-04 constant gamma 1 (G1m marker)

MMP7 matrix metallopeptidase (20884320)302, Apr-05 7 (matrilysin, uterine) (15808264)303, (20345904)291

MMP9 matrix metallopeptidase (15808264)303, Apr-05 9 (20345904)291

NID1 nidogen 1 (10544250)285 Oct-99

HSPG2 heparan sulfate (10544250)285 Oct-99 proteoglycan 2

FN1 fibronectin 1 (10544250)285 Oct-99

FBLN4 fibulin 4 (19570982)304 Sep-09

FBLN5 fibulin 5 (19570982)304 Sep-09

93 the collaborative graduate program in genome biology and bioinformatics (CGPGBB) a collaborative traineeship was undertaken, under the supervision of Dr. Sylvie Ricard-Blum, to investigate the differential binding of tropoelastin and several recombinant elastin fragments using the Biacore flexchip Surface Plasmon Resonance (SPR) array at the Institut de Biologie et de Chimie des Protéines (IBPC) in Lyon, France.

High-throughput SPR array imaging experiments (SPRi) identified 4 putative, novel interactors of elastin: Aggrecan, Brevican, Neuroglycan and (Table 2-6). Perhaps more importantly, these preliminary data revealed differences in the interactions of several of the recombinantly expressed elastin peptides. For example, biglycan does not interact with elastin fragments consisting of exons 20 and 24 suggesting that one of these exons contains the biglycan binding site. Known binding sites on elastin, with their literature support, were summarized as an aid to interpreting the SPR experiments (Figure 2-16). Consistent with these results biglycan has previously been suggested to bind adjacent to the MAGP-1 binding site at Exon 29 – 36279.

These experiments demonstrate the power of SPR arrays to rapidly screen for altered binding characteristics of sets of related peptides. As discussed in Section 4.2.3, further work will be needed to verify positive interactions from the high throughput screen and determine whether the binding characteristics of known interactors such as Biglycan were altered as a result of the differing exon structures of interacting elastin-like peptides.

2.4 Discussion

This study represents a systematic and comprehensive survey of ECM proteins and their organisation within a global protein interaction network. While current datasets represent only a fraction of likely ECM interactions, integration of additional metadatasets nevertheless reveal modules of functionally related proteins displaying heterogeneous expression and conservation patterns. Such patterns suggest that modules are composed of core elements conserved across the Metazoa, which have incorporated deuterostome, mammalian or even primate-specific elements that likely mediate lineage and/or tissue-specific functions.

The importance of protein-carbohydrate interactions in matrix biology has not been specifically addressed but deserves mention. Well-known examples include the interaction of numerous ECM proteins with glycosaminoglycans such as heparan sulfate and chondroitin sulfate. In so 94

Table 2-6: Interactions of elastin and elastin-like peptides as determined by SPRi array experiments

Asterisks indicate putative novel interactors. Interactions are denoted with a ‘+’ sign whereas non-interactions are marked ‘-‘. Peptide symbols are: Human tropoelastin (HTE), HTE with exon 36 deletion (HTEΔ36), HTE with exon 8-14 deletion (HTEΔ8-14), recombinant elastin peptide consisting of exons 20, 24, 24 (HP20-24-24), recombinant elastin peptide consisting of exons 20, 24, 24, 36 (EP20-24-24-36).

Interactor Elastin peptide (Analyte)

(Ligand) HTE HTEΔ36 HTEΔ8-14 EP20-24-24 EP20-24-24-36

Aggrecan* + - - - -

Biglycan + + + - -

Brevican* + - - - -

Neuroglycan* + + + - -

Osteonectin* + + + - -

far as the ECM represents an interface and perhaps even a transition between the protein-centric view of biology and the exciting and expanding field of glycobiology, the number of known protein-carbohydrate interactions at this time is comparatively few (the number of expected GAG-protein interactions is unknown). MatrixDB, which collates information on relevant protein-carbohydrate interactions as well as proteins, lists 119 protein-carbohydrate compared with 1836 protein-protein interactions at the time of this writing203. It is worth noting, however, that the sequestration of several growth factors into the ECM has long been known to be via the matrix-associated carbohydrates rather than the proteins themselves29. A proper understanding of the contribution of these interactions to overall matrix biology as well as further insights into proteoglycan evolution await future study. 95

The SPRi screen used in this study should not be relied upon as definitive confirmation of protein interactions in as much as the Biacore Flexchip (SPRi) is not as sensitive as systems based on classical SPR (e.g. Biacore 3000, T100 or T200). There are some differences in the binding level between for example hTE and hTE36 (that gives a lower signal) but SPRi being sensitive to the mass of the injected proteins, it is difficult to discriminate true differences in binding levels from mass variations. In addition, the binding of some elastin fragments to selectively desulfated heparins depends on the sulfate group removed. Nevertheless, backed up by classic SPR techniques, SPRi clearly has potential to expand the ECM interactome as evidenced by the detection of several new candidate interactors of elastin. As demonstrated here, the potential to rapidly screen for differential binding of related, mutant peptides will likely aid in the phenotypic characterization of natural genetic variations of elastin and other contributors to elastic-fibre assembly.

Comparative approaches rely on evolutionary conservation of proteins to infer function from the study of model organisms as well as to address the origin and significance of emergence and expansion across various taxa. Recent studies of basal-metazoan genomes suggest that an ancestral fibrillar collagen gene arose early in Metazoa, before the divergence of sponge and eumetazoan lineages305. Here it has been shown that a large proportion of ECM proteins either emerged or significantly diverged sometime after the split between the vertebrates and the basal-chordates and that only two appear conserved across Eukarya. Two groups of mammal-specific proteins are also evident: a group that maintained one to one orthologous relationships since their appearance and a second that diverged into multiple isoforms late in the mammalian lineage, some in primates and others in humans. The known functions of the latter proteins do not suggest an obvious reason for their late divergence. However, the occurrence of ESTs for ECM proteins in many tissues combined with the observation that several proteins are known or suspected to have quite diverse functions suggests that unknown functions may hold the answer.

In undertaking these analyses, this work highlights both the incompleteness of datasets involving ECM proteins as well as inconsistencies in the way in which the annotation of ECM proteins is performed. For example, considerable variation was noted in the evidence associated with functional annotations of ECM proteins in different species within the GO resource. 96

97

Figure 2-16: Summary of binding sites on elastin The figure depicts a general domain layout for tropoelastin (not to scale). This is a composite view of all possible exons not intended to represent elastin in any one species. Tracks mark exons mentioned specifically in the literature with repect to the binding of various ligands (depicted in blue blocks) and are organised by reference (black blocks). Ligands which have no literature support for a binding location are omitted. Ligands thought to bind the N-terminus or C-terminus at non-specific locations are arranged accordingly. The occurrence of question marks indicates failure to identify a specific binding site. Also shown in separate tracks are the composition of the recombinant elastin peptides explored in SPRi experiments (blue = present; red = absent).

Further, those annotations that are present in GO are not sufficient for uniquely identifying ECM proteins. Additional sources such as the LOCATE database reveals experimental evidence for the subcellular location of ECM proteins to be sparse, while computational predictions are relatively noisy. Recent attempts to extend our knowledge of ECM proteins include a systematic study combining transcriptome analysis with functional assays, which identified 16 previously uncharacterized ECM proteins34, as well as a machine learning study in which 13 informative classifiers were used to predict 20 novel ECM proteins from unannotated human genes in UniProtKB35. In this latter study, half of the 20 genes identified as novel were found to be annotated as ‘Extracellular region’ in GO while the other half were found to be supported by literature evidence. Here considerable relevant biological knowledge was found to be present in the literature and associated secondary source descriptions that do not yet appear in the structured ontologies. This impedes the effort to identify ECM proteins (and other subsets of functionally related proteins) through simple screening approaches and makes it difficult to evaluate the novelty of the resulting predictions. The structured approach proposed here to identify subsets of functionally related proteins has general applicability in such cases because it rapidly leverages the information in readily available sources while maintaining accuracy through mutual reinforcement. Significantly, this approach identified a list of 324 ECM proteins that was found to correlate very well with the independent, manually curated resource, MatrixDB158,203. Searching UniProt with both ‘extracellular matrix’ as a keyword and ‘human’ as species returns 244 reviewed proteins.

In terms of meta-annotations, disease associations for ECM proteins are sporadic in OMIM and while MeSH terms appear more useful they lack precision. Functional knowledgebases are similarly compromised. For example, GO suffers from a lack of coverage that precluded the association of functional terms to individual ECM modules. On the other hand, while UniProt 98 and Entrez provide greater coverage, they lack the well-defined structure and depth associated with GO. While the presence of recognizable keywords could be discerned within these definitions for the purposes of identifying and categorizing extracellular proteins these descriptions are not readily machine-interpretable in assigning biological functions to modules. Clearly there is a need for improved ontologies associated with ECM proteins. In terms of expression data, few studies were found that focus explicitly on ECM proteins. While such proteins are intrinsically targeted in global expression studies, some ECM proteins have relatively narrow temporal windows of expression. For example, elastin is typically expressed in the aorta relatively early in development after which there is little additional turnover306,307. Hence the importance of the ECM in development is not well reflected in human expression studies that largely focus on expression in adult tissues. In contrast, considerable developmental data has been accumulated for the mouse model and future studies would do well to consider this resource as a way of bringing developmentally important modules into the analysis.

Protein interaction data appears equally sparse. While such datasets are becoming increasingly important to understand and link biological processes, they are clearly biased against certain classes of proteins such as membrane-associated and insoluble proteins (including ECM) in which this network is enriched187,308. In addition, previous estimates suggest only 10% of the human interactome is represented in current datasets compared with 50% for our knowledge of yeast interactions309. Therefore, it is likely that the current ECM network represents only a skeleton of the complete ECM interactome and may explain the occurrence of many smaller modules for which no function could be assigned. Nevertheless, significant enrichment was found for disease terms within predicted functional modules, demonstrating their biological relevance. The non-overlap among interaction datasets is somewhat accentuated in this study and reflects the relative paucity of interactions with network neighbours that could confidently be determined through curation to be biologically relevant to the ECM. From this it should be inferred that additional, experimentally determined interactions exist within these datasets that connect low confidence nodes. This offers considerable opportunity to explore the expansion of the network outwards from the high confidence network presented here. In addition, low quality (predicted) interactions involving proteins that may be of functional relevance to the ECM also exist and are not utilized here. The limited coverage and non-overlap of PPI data sets highlight the need to develop and apply additional methods and experiments to 99 expand ECM network coverage. Currently several approaches are being pioneered in attempts to capture interactions that are likely to be relevant to this model. These include: affinity capture LCMS/MS analysis based on recombinant ECM polypeptides217; a modified yeast two-hybrid method for membrane proteins (MYTH)187; an avidity based screen that relies on the production of recombinant protein ectodomains308; and protein and glycosaminoglycan arrays probed by surface plasmon resonance imaging which have recently been applied to study the interaction network of extracellular proteins or matricryptins216,310,311. Many of these techniques rely on the time consuming production of recombinant proteins. The map presented in this study may therefore serve to focus these efforts on areas of the network which are currently poorly represented by existing interaction datasets.

Confounding the construction of a single global ECM interaction network is the recognition that it is inherently dynamic and is expected to display considerable temporal and spatial variation across tissues. At present such variation can only be assessed through indirect methods such as the integration of expression and localization data. However, these datasets are themselves subject to error and reflect only a fraction of the conditions under which biological systems operate. However, using these data has revealed tissue-specific patterns of expression of ECM related genes. For example, ECM gene expression in brain appears distinct from that in other adult tissues. In addition, approximately half of the genes demonstrate low or negligible levels of expression implying either that there are large numbers of low abundance proteins or, perhaps more likely, that these proteins are expressed in a temporal/spatial fashion distinct from the adult tissues surveyed. 100

Chapter 3 Novel domain architectures and promiscuous hubs contributed to the organisation and evolution of the ECM.

Portions of this chapter have been adapted from Cromar et al.312 (submitted).

I performed all analyses except as noted below and contributed to the development and verification of the software pipelines used. I interpreted the results, generated the figures and designed and programmed the domain pair propagation simulation which forms the statistical basis for the domain pair and higher order domain pattern analyses.

Ka-Chun Wong suggested the sequential pattern mining approach and ran PrefixSpan to define frequent sequential patterns. Xuejian Xiong provided constructive comments on the design of the statistical analysis for domain hubs. Tuan On wrote an early prototype of the domain analysis software which formed a conceptual basis for Figure 3-5. Hongan Song assisted in defining data mappings between PhyloPro and the downstream domain analysis pipeline. Noeleen Loughran ran PfamScan to generate raw domain information. Dr. Zhaolei Zhang provided helpful advice on the analysis of domain patterns. 101

3 Extracellular Matrix Domain Architecture 3.1 Introduction

The emergence of complex, multicellular organisms required innovation of biological processes facilitating a variety of new structures and functions. In vertebrates distinguishing features include specialized matrices such as cartilage, tendons, bones and teeth as well as a pressurized vascular system and a complex developmental program including neural crest migration. Technological advances in genetics and proteomics begun over the past decade continue to provide insights into the composition and organisation of these processes313-315. At the same time, the recent availability of metazoan genome sequences are beginning to shed light on the evolutionary forces driving their genesis and subsequent refinement41,48,49,265. Comparative studies of basal metazoans and their relatives, for example, suggest that homologues of many of the genes that drove the innovation of multicellularity arose in free living ancestors of metazoans48. These genes include several members of the ECM, a fundamental metazoan innovation that has come to play central roles in a variety of diverse functions including: respiration, feeding, reproduction, locomotion, osmoregulation, hemostasis and cognition47,316,317. However, beyond their origins, the subsequent evolutionary forces that have since guided the development of systems such as the ECM have remained largely unexplored.

The human ECM comprises ~324 proteins that self-organise into a complex array of fibres and ancillary proteins, providing essential scaffolding properties for arranging cells into tissues11,200,203. Despite comprising less than 2% of human protein-coding genes, ECM proteins make up between 25% - 35% of the total protein content in human tissues317. In addition to its structural properties, the ECM is also involved in morphogenesis through the mechanical regulation of cells and intercellular junction positioning318, as well as acting as a sink for a variety of growth factors that allows the system to rapidly respond to spatially relevant signals29. From the previous chapter, while one third of human ECM genes are shared with other metazoans, the remaining two thirds appear to be recent vertebrate-specific innovations. What are not clear are the mechanisms that underpin these innovations and how the organisation of protein domains, as fundamental units of selection, contributed to ECM evolution. 102

Protein domains may be defined as conserved segments of proteins with distinct function. Often, they are autonomously folding. Although proteins containing only a single domain are in the minority, vertebrate proteins, particularly those targeted to the extracellular milieu, are enriched for multidomain architectures319-321. Through combining domains with complementary roles (e.g. catalysis, binding etc.) the emergence of novel combinations of domains represents a potentially rapid evolutionary vehicle for driving innovation. Such combinations may arise through a variety of mechanisms including gene fusion/fission, duplication, recombination, exon extension and retrotransposition63,65. At the whole proteome level, studies suggest that domain gain and loss events dominate with transposition, exon shuffling and recombination appearing to play only minor roles63,69,322,323. On the other hand, it is not clear if the relative contribution of these events is consistent across the entire proteome or whether distinct types of proteins and/or the biological systems in which they function, influences the types of evolutionary forces that drive protein innovation. It has been assumed for example, that exon shuffling is a pre-dominant factor in ECM protein evolution due to the fact that many ECM domains are encoded as exonic units316. Furthermore, previous investigations of domain architectures have been limited to the study of adjacent domain pairs or triplets, neglecting the possible conservation of higher order domain architectures.

Characterized by large proteins whose multiple domains confer distinct physical and functional characteristics, the ECM represents an ideal system to investigate how patterns of domain evolution may vary for a given class of functionally related proteins. This chapter presents a systematic study of domain gain, loss and rearrangement events that have contributed to human ECM innovation. Sequential pattern mining324 was applied to reveal conserved patterns of higher order domain combinations and a domain-adjacency network approach was adopted to yield insights into the evolutionary trajectory of ECM domain combinations. Placed in the context of ECM modules previously defined on the basis of protein-protein interactions (PPIs), examples highlight how the evolution of domain architectures has resulted in the emergence of metazoan innovations such as a biomineralized skeleton and pressurized vascular system. 103

3.2 Materials and Methods

3.2.1 Source for Proteins

Protein sequences for 131 published eukaryotic genomes were obtained as previously described168. Conservation of proteins was determined using the longest peptide associated with each human ECM gene (termed the ‘reference’). Orthologues and paralogues (collectively referred to as the ‘targets’) were detected using the Inparanoid algorithm as in previous studies168,229. Protein conservation profiles were clustered using Cluster 3.0 to group ECM proteins with similar conservation patterns.

3.2.2 Source for Domains

Domain predictions were performed on the above sequences across all 131 species on a parallel computing platform using profile hidden Markov models (HMMER 3.0 with default parameters as implemented in PfamScan325). Data flow was handled in a data processing pipeline written in house using Perl and results were stored and manipulated using PostgreSQL. Reliance on the use of Pfam defined domains325, while subject to biases in the choice of organisms to generate seed alignments for the definition of domains, nonetheless provides a well-established framework to study domain evolution. The analysis included curated Pfam-A ‘domains’ and ‘families’ where a domain is defined by Pfam as a “structural unit” and a family is defined as “a collection of related protein regions”. Note: ‘motifs’ and ‘repeats’ as defined in Pfam were excluded because they did not meet the criteria for domains as independent folding units.

3.2.3 Domain Enrichment

The frequency of each domain in the human proteome and in the ECM subset was calculated using Perl scripts developed in house based on the PfamScan domain predictions described above. Domain architectures were pre-processed to remove large tandem duplications of domains (i.e. greater than two domains) which would otherwise inappropriately skew domain frequencies. This was done by iteratively removing duplicated domains where they occurred until only a single (A-A) pair remained. Proteins were then classified as either single or multi- domain depending on their domain content. Domains were classified as appearing in single or multi-domain proteins or both. Domain enrichment was determined using the hypergeometric test with FDR correction as implemented in MATLAB. 104

3.2.4 Conservation of Domains and Domain Pairs

Domains found in human ECM reference sequences were compared with domains representing the full proteome of each species. As above, domain architectures were pre-processed to remove tandem duplications of domains which would otherwise inappropriately weight domain frequencies. A domain or domain pair was considered to be conserved if it appeared in at least one of the proteins in a given species. Domain pairs comprised adjacent domains defined in the N-terminal to C-terminal orientation. Reverse orientations were considered to be unique (i.e. A- B ≠ B-A). Patterns of domains and domain pair conservation were hierarchically clustered using Cluster 3.0.

3.2.5 Conservation of Domain Architecture

The domain arrangement of each reference sequence was compared with each of its corresponding target sequences. Tandem duplications of domains were included in this analysis. A Perl program, developed in house, was used to call domain gains, losses, gains and losses of domain repeats and, complex rearrangements based on the most parsimonious change. In order to render these results in a two-dimensional heatmap, the resulting change categories were colour-coded and the plot was confined to the longest target sequence in each species (the presumed orthologue of the reference). The analysis therefore reflects the diversity of domain architectures among the most similar homologues across the phylogeny. Domain architectures were hierarchically clustered using Cluster 3.0.

3.2.6 Tandem Repeats

The total number of Pfam-A domains in each human ECM protein and the relative contribution of unique and non-unique domains were assessed using a simple counting method. The first occurrence of a domain was counted as unique and any subsequent occurrence of the same domain as non-unique. By definition the occurrence of non-unique domains is highly correlated with tandem repeats and for these purposes it was assumed they are equivalent with the caveat that some non-unique domains could be the result of two or more occurrences of the same domain separated by one or more other domains. For example, an alternating pattern such as ABABA would be recorded as having three non-unique domains and algorithmically this would not be distinguishable from AAABB – which has bona fide tandem repeats. To mitigate this source of error, only proteins with a positive ratio of non-unique domains to unique domains 105 were considered to contain tandem domain repeats. To assess the significance of the observed association between late domain gains and non-unique domains proteins were classified according to their domain architecture conservation patterns as having undergone pre-vertebrate (‘early’) or post-vertebrate (‘late’) domain gains and whether or not those gains were associated with repetitive domains (i.e. where the majority of domains in the human protein were classified as non-unique domains). The enrichment for late domain gains among proteins with large numbers of non-unique domains was then assessed using the hypergeometric test with Bonferroni correction.

3.2.7 Domain Alignment

Domain alignments were performed on unprocessed domain architectures (i.e. including domain repeats). An E-value cutoff of 0.01 was applied for inclusion of domains. Alignments were performed as follows. For each gene, each homologue was tokenized as a sequence of domains where domains were assigned an arbitrary letter code in sequence. For example, the first encountered domain was assigned the letter ‘A’ and the next unique domain the letter ‘B’ and so on. Thereafter, recurrences of domains within the same protein or within the same set of paralogues were assigned the corresponding letter code (e.g. ABCAA). The resulting short strings of letters were rapidly aligned using a custom Perl script.

3.2.8 Domain Adjacency Network

Using the ordered (N-terminal to C-terminal) architecture of domains occurring in the reference proteins, a list of domain pairs occurring in the human ECM and their orthologues from each species was constructed. These pairs were then used to construct a network of domain adjacency. The statistical significance of domain pairs was determined by comparing the frequency of each pair in the real human proteome with that of 10,000 simulated proteomes (see below). The reported p-value represents the number of times out of ten thousand simulations that a given pair was found as frequently as or more frequently than in the real proteome by random chance alone. The corresponding Z-scores were used to weight the edges of the subnetwork of all human pairs and clustered using MCL221 with a default inflation value of 2.1 to predict putative domain modules. To avoid the inclusion of negative edge weights not handled by MCL, the set of z scores was transformed by addition of a small positive value such that the lowest value became zero. Pfam to GO mappings326 were used to associate functional 106 annotations with each domain. Modules were then annotated with the term(s) with the highest enrichment among the domains within the module as visualized in WordCloud233. Overlap between domain modules and PPI based protein modules (the latter defined and annotated in section 2.3.5) was accomplished by converting proteins within modules to their corresponding domain representation. Protein modules and domain modules with the largest number of overlapping domains were matched for the purpose of transferring annotations. To assess the significance of domain overlaps between domain and PPI-based modules the occurrence of domain overlaps in ‘real’ pairs were compared with the overlaps of the highest overlapping module in each of 10,000 randomized networks constructed with the same domain distribution. Networks were visualized using Cytoscape218.

3.2.9 Domain promiscuity

The weighted bigram frequency used by Basu67 was adopted as a measure of domain promiscuity

(πi). For convenience the formula is reproduced here. This was originally derived from the Kullback-Leibler information gain formula:

(1)

The bi-gram frequency βi is:

(2)

Where t is the number of distinct domain types. Ti is the number of unique domain neighbours of domain i and fi is the frequency of domain i in the genome, calculated as ni/N, where ni is the total count of domain i and N is the total number of domains detected in the given genome:

(3)

Note πi is influenced by the number of network neighbours as well as by the number of detected domains. Since this precludes direct comparison of promiscuity scores between studies with 107 different underlying domain sets, promiscuity scores were validated through rank comparisons with a previously generated set 67.

3.2.10 Higher Order Domain Patterns

A frequent sequential pattern can be defined as an ordered set of domains which can be found in at least n proteins (support = n). For example, the sequential pattern (A,B,C) can be found in proteins with domain architectures: (A,B,C,D), (X,A,B,C), (Y,A,Y,B,C), (X,Y,A,Z,B,B,C). It should be noted that the pattern can be discontinuous as long as the ordering is preserved (in the example, A is followed by B which is then followed by C). PrefixSpan324 was used to find frequent sequential patterns in human ECM reference proteins and their orthologues in nine species representing basal metazoa / metazoa: M. brevicollis, A. queenslandica, H. magnipapillata, C. elegans, D. melanogaster, C. intestinalis, D. rerio, X. tropicalis, and G. gallus. Input files consisted of unprocessed domain architectures (i.e. including domain repeats) representing the presumed orthologues of the reference sequence (longest inparalogues). Since the presence of highly related sequences would tend to inflate the occurrences of patterns found in e.g. similar splice variants, the sequences were pre-filtered to remove redundant sequences. Calculation of percent similarity was based on BLAST output:

(4)

Sequences above 90% similarity were removed prior to pattern analysis. In implementing PrefixSpan, the only parameter, support, was set as 3 such that a frequent sequential pattern is defined as a sequential pattern found in at least 3 tuples (i.e. proteins). Since the inclusion of domain repeats led to exponential memory usage, for practical purposes output was limited to patterns involving four domains or less.

Domain patterns were hierarchically clustered using Cluster 3.0 grouping similar conservation patterns. The frequency of domains within conservation groups was visualized using WordCloud233. Domain patterns were organised using the Enrichment Map plug-in for Cytoscape327 into clusters representing related patterns with overlapping sets of domains. Statistical significance of domain patterns was determined by comparing the frequency of each pattern in the real human proteome with that of 10,000 simulated proteomes (see section 3.2.11 108

Simulated Proteomes). The reported p-value represents the number of times out of ten thousand simulations that a given pattern was found as frequently as or more frequently than in the real proteome by random chance alone.

3.2.11 Simulated Proteomes

Simulated proteomes were created on a parallel computing platform to assess the significance of observed domain pairs and patterns relative to their occurrence at random. The proteomes were generated as follows. First, using Pfam-A domain predictions for the complete human proteome, domain frequencies and domain distributions (number of domains in each protein) were pre- calculated in the real proteome. To populate each simulated proteome, a set of ‘pseudo-proteins’ was constructed by randomly selecting domains (without replacement) from a pool reflecting the domain frequencies of the real human proteome. As domain pairs were created in the growing pseudo-proteins, selection of individual domains was paused and the pair was propagated across eligible pseudo-proteins a random number of times before individual domain selection resumed. Individual domains propagated as pairs continued to be removed from the domain pool during this process. If the availability of either domain in the pair was exhausted in the domain pool or if the random propagation limit for that pair was reached, the propagation of that pair ceased and individual domain selection was resumed. This process was continued until all domains in the pool were exhausted. For domain pairs, simulated proteomes were constructed using domain frequencies corresponding to the pre-processed domain architectures of human ECM proteins (i.e. without domain repeats) and the random placement of domains was constrained so as to prevent the random creation of tandem domain repeats. Random domain pairs resulting from these simulations therefore reflected the conditions used to evaluate domain pairs in the real proteome. As sequential pattern mining was performed on unprocessed domain architectures, simulated proteomes created for assessment of higher order domain patterns were not constrained with respect to the generation of tandem domain repeats.

3.3 Results

3.3.1 Evolution of the ECM is driven in part by the invention of novel domains

As a first step towards understanding the contribution of domains to the evolution of the ECM, domain distribution was examined. Here domain definitions were adopted as provided by the 109 curated Pfam-A resource (see section 3.2 Methods). From the 4243 domains in the human proteome, 144 (~3.4%) are associated with ECM proteins, of which 101 are significantly ECM- enriched (p < 0.05, Hypergeometric test with FDR correction). Among the enriched domains, 35 are exclusive to ECM proteins (Table 3-1, Figure 3-1(a) and supplementary excel spreadsheet data files “SF15-18” on the accompanying CD). To assess the origins of these human ECM- enriched and ECM-specific domains, the occurrence of these 144 domains within the genomes of 131 fully sequenced eukaryotes was examined (Figure 3-1(b) and supplementary excel spreadsheet data files S5-S6 on the accompanying CD). Three distinct groups are apparent, corresponding to domains with eukaryotic, metazoan and vertebrate origins. Approximately four out of every five ECM domains (79.2%) are conserved but specific to choanoflagellates and metazoans, suggesting that the emergence and subsequent evolution of the ECM involved the recruitment of both a limited number of ancestral, pre-metazoan domains (20.8%), together with the innovation of a larger number of novel domains, prior to the branching of the various metazoan lineages.

Among the ancestral, pre-metazoan domains recruited to the ECM (Figure 3-1(b)) are several that are ECM-enriched as well as one that is exclusive to the ECM in humans; MFAP1_C (PF06991) is a conserved C-terminal domain of human MFAP1 (microfibrillar-associated protein), an important component of elastin-associated microfibrils. Interestingly, the yeast orthologue PRP19 which also carries the domain plays a role in pre-mRNA splicing, highlighting the ability of apparently established domains to transpose their functionality. Note that the first observed occurrences of ECM-enriched and ECM-exclusive domains were distributed across all three evolutionary periods suggesting that domains arise through a continual process yielding opportunities to develop novel functionalities. 110

Table 3-1: Pfam-A domains found exclusively in ECM proteins (human proteome)

Pfam ID Domain Name Description PF03146 NtA Agrin NtA domain PF05270 AbfB Alpha-L-arabinofuranosidase B (ABFB) PF05111 Amelin Ameloblastin precursor (Amelin) PF02948 Amelogenin PF11598 COMP Cartilage oligomeric matrix protein PF06482 Endostatin Collagenase NC10 and Endostatin PF01413 C4 C-terminal tandem repeated domain in type 4 procollagen PF07263 DMP1 Dentin matrix protein 1 PF11857 DUF3377 Domain of unknown function (DUF3377) PF11918 DUF3436 Domain of unknown function (DUF3436) PF06121 DUF959 Domain of Unknown Function (DUF959) PF05782 ECM1 Extracellular matrix protein 1 (ECM1) PF07474 G2F G2F domain PF08685 GON GON domain PF00396 Granulin Granulin PF00052 Laminin_B Laminin B (Domain IV) PF06008 Laminin_I Laminin Domain I PF06009 Laminin_II Laminin Domain II PF00055 Laminin_N Laminin N-terminal (Domain VI) PF09006 Surfac_D-trimer Lung surfactant protein D coiled-coil trimerisation PF00413 Peptidase_M10 Matrixin PF05507 MAGP Microfibril-associated glycoprotein (MAGP) PF00865 Osteopontin PF07175 Osteoregulin Osteoregulin PF03572 Peptidase_S41 Peptidase family S41 PF01471 PG_binding_1 Putative peptidoglycan binding domain PF01549 ShK ShK domain-like PF06991 Prp19_bind Splicing factor, Prp19-binding domain PF06468 Spond_N Spondin_N PF08999 SP_C-Propep Surfactant , N terminal propeptide PF00683 TB TB domain PF05735 TSP_C Thrombospondin C-terminal region PF00965 TIMP Tissue inhibitor of metalloproteinase PF10511 Cementoin Trappin protein transglutaminase binding domain PF03762 VOMI Vitelline membrane outer layer protein I (VOMI)

111

112

Figure 3-1: Conservation of ECM domains A –Relative numbers of ECM, ECM-enriched and ECM exclusive domains relative to the total number of Pfam-A domains detected in humans. B - Domains occurring in ECM proteins across 131 species are represented as coloured tiles (present = yellow tiles; absent = olive tiles). Species are arranged (with plants on the left) according to established phylogenetic relationships168 (see Appendix 2) and domains were hierarchically clustered (city block, average linkage) into groups representing similar conservation patterns. Three broad groupings corresponding to domains of eukaryotic, metazoan and vertebrate origin are apparent (labeled to the left). The track to the right of the heatmap indicates domains that are significantly enriched in ECM proteins (P<0.05 by hypergeometric test with FDR correction) and those which are exclusive to ECM proteins (grey vs. black). Note that most of the ECM- enriched domains are of early metazoan origin (i.e. conserved with protostomes) despite the fact that two thirds of human ECM proteins do not have detectable orthologues outside the deuterostomes 200 suggesting that novel domains do not account for the majority of novel proteins in vertebrates. A cluster of ECM-exclusive domains (boxed red) may have been associated with skeletal reorganisation during the fish-tetrapod transition. A second colour track indicates domain participation in single/both/multi-domain arrangements (green/yellow/red).

For example, the vertebrate-specific sub-group is enriched in ECM-exclusive domains associated with proteins involved in biomineralization (e.g. DMP1; PF07263, Osteoregulin; PF07175)328,329, enamel (e.g. Amelin; PF05111 and Amelogenin; PF02948)330,331, and bone remodeling (Osteopontin; PF00865)332. The process of skeletonization began relatively early in the vertebrate lineage corresponding to the transition from jawless to jawed vertebrates and continued with gradual increasing complexity and rearrangements in the four skeletal tissues (bone, dentin, enamel and cartilage)333,334. Given the role of these tissues in feeding, respiration and locomotion, the emergence of these novel domains after the split between fish and tetrapods is at least partly responsible for the transition to a land-based lifestyle335,336.

Given that protein domains appear to be continually recruited during evolution of the ECM, it is reasonable to hypothesize that more ancient domains are likely to play a dominant role in its organisation. This is consistent with previous studies that have shown that older domains tend to present in higher frequencies than younger domains and that frequency is a strong predictor of domain promiscuity (i.e. the propensity for domains to form stable combinations with other domains)67. In the next section, the relationships between domain age and promiscuity and their role in the previously defined network of ECM protein interactions (from Chapter 2) is investigated.

3.3.2 Organisation of the ECM is mediated by a relatively small number of highly promiscuous domains

Previous studies have found extracellular proteins possess a relatively high incidence of promiscuous domains67,68. To examine if this is also a general feature of ECM proteins, a 113 weighted bi-gram frequency metric67, which normalizes for domain frequency, was used to define the relative promiscuity of the 2282 of the 4243 Pfam-A domains which appear in multi- domain proteins in humans (Table 3-2 and supplementary excel spreadsheet data file “SF21” on the accompanying CD). Domains were grouped into three age categories based on their conservation across the Eukarya (E), Metazoa (M – including transitional metazoan and choanoflagellates) or, Vertebrata (V). Of the three age categories, vertebrate-derived domains were characterized by domains of low frequency and promiscuity while metazoan and eukaryotic domains, tended to encompass larger numbers of domains with higher frequency and/or higher promiscuity (Figure 3-2 and supplementary excel spreadsheet data file “SF22” on the accompanying CD).

On the basis of the top 10 percentile of promiscuity scores (i.e. the 90th percentile), a weighted bi-gram frequency of 0.0021 was defined as a cutoff for ‘high promiscuity’ (Figure 3-3 and supplementary excel spreadsheet data file “SF21” on the accompanying CD). Of the 124 ECM- associated domains that appear in multi-domain proteins, 38 (30.6%) could be defined as highly promiscuous which represents a significant enrichment compared to non-ECM associated domains (p < 1.5 x 10-4, Hypergeometric test with Bonferroni correction) (supplementary excel spreadsheet data file “SF23” on the accompanying CD). There was no statistically significant correlation between promiscuity and domain age (p>0.05, Fisher’s Exact Test, supplementary excel spreadsheet data file “SF22” on the accompanying CD). Furthermore, given the sigmoidal distribution of promiscuity scores (Figure 3-4); it was concluded that promiscuity is not a general characteristic of ECM domains, but rather, restricted to a small subset of ECM domains.

With roles in protein binding, domains play a fundamental role in the organisation of protein- interaction networks337,338. Moreover, multi-domain proteins, with the capacity to form interactions with multiple partners, typically represent ‘hubs’ within networks. Next examined, therefore, was the relationship between promiscuous domains and their role in the organisation of a previously defined network of ECM proteins. Within the 173 ECM network proteins with domain architectures, structural proteins were found to be significantly enriched in promiscuous domains (31/60, p < 0.005, Bootstrap sampling) (supplementary excel spreadsheet data file “SF24” on the accompanying CD). Furthermore, network hubs (defined as having a node degree ≥ 5) were significantly enriched in structural proteins (38/92 hubs, p < 0.05, Bootstrap sampling). 114

Table 3-2: Top 30 promiscuous domains in the human proteome

These are based on the weighted bigram frequency. For comparison, their rank in a similar list of 215 highly promiscuous domains in eukaryotes is included (mean promiscuity (π) value over 28 species67).

Domain Name Direct Co- Weighted Found Enriched Rank Neighbours Occurrence Bigram in in ECM Frequency ECM?

PF00595.18 PDZ 52 71 0.021795689 Y 9

PF00169.23 PH 50 77 0.014169779 2

PF00018.22 SH3_1 39 56 0.011726788 3

PF00627.25 UBA 18 26 0.011594278 38

PF00533.20 BRCT 15 26 0.010918809 21

PF00397.20 WW 20 24 0.010536495 34

PF00628.23 PHD 27 42 0.01048959 4

PF00004.23 AAA 20 22 0.009510768 1

PF12796.1 Ank_2 52 74 0.009320291 7

PF00008.21 EGF 32 51 0.009110883 Y Y 28

PF00641.12 zf-RanBP 13 16 0.00850045 No match

PF07653.11 SH3_2 23 37 0.007659358 No match

PF07699.7 GCC2_GCC3 10 17 0.007279206 No match

PF00536.24 SAM_1 20 28 0.007075622 24

PF00093.12 VWC 14 20 0.006970266 Y Y 63

PF07647.11 SAM_2 15 26 0.006860156 No match

PF07648.9 Kazal_2 15 28 0.006860156 Y Y 98

PF00130.16 C1_1 17 25 0.006780639 8 115

Domain Name Direct Co- Weighted Found Enriched Rank Neighbours Occurrence Bigram in in ECM Frequency ECM?

PF00788.17 RA 15 26 0.006563627 31

PF07714.1 Pkinase_Tyr 27 43 0.006378603 20

PF00013.23 KH_1 14 17 0.006332245 147

PF00620.21 RhoGAP 19 27 0.006246943 18

PF00791.14 ZU5 9 10 0.006195204 159

PF00787.18 PX 16 23 0.006141415 14

PF00226.25 DnaJ 16 17 0.006141415 30

PF00610.15 DEP 11 18 0.006089229 43

PF07645.9 EGF_CA 32 49 0.005995208 Y Y 25

PF00629.17 MAM 10 16 0.005971257 Y 68

PF01585.17 G-patch 12 14 0.005902413 80

PF00621.14 RhoGEF 19 32 0.005827127 No match

PF00092.22 VWA 21 31 0.005791992 Y Y 67

Figure 3-2: Comparison of domain frequency vs. domain promiscuity For human ECM domains of Eukaryotic (blue), Early Metazoan (red) and Vertebrate origin (green) this scatter of domain frequency vs. domain promiscuity (weighted bigram frequency) illustrates that high promiscuity is limited to a small subset of domains. 116

Figure 3-3: Domain promiscuity cutoffs for human Pfam A domains at each percentile The cutoff score corresponding to the 90th percentile (top 10% of domains ranked by promiscuity scores) corresponding to a weighted bigram frequency > 0.002 was used to classify the threshold for ‘high promiscuity’ ECM domains (38 of the 124 ECM domains found in multi-domain architectures).

.

Figure 3-4: Distribution of promiscuity scores Weighted bi-gram frequency for 124 ECM domains appearing in multi-domain architectures.

117

Not surprisingly, hubs were significantly enriched in highly promiscuous domains (p < 0.05, Bootstrap sampling). Consistent with the proposed role of domains in network connectivity, the top three high promiscuity domains within the ECM as seen in Table 3-2 (PDZ, EGF and VWC; PF00595, PF00008 and PF00093 respectively), are all highly conserved and facilitate protein- protein interactions, functioning in both signaling and structural contexts. On this basis, it was hypothesized that the distribution of promiscuous domains (in particular their preferential appearance in hubs) is a critical determinant of PPI network topologies; a property that coincides with the tendency for hubs to be structural in the specific case of the ECM. Note that the appearance of PDZ as an ECM domain is unusual and due to the inclusion of ERBB2IP, a protein annotated in GO151 as a component of basement membrane. However, although likely to be functionally related, a review of the cited reference (Borg et al.339) suggests the annotation should be amended to ‘basolateral membrane’. Given the importance of promiscuous domains and hence multi-domain architectures in organisation of the ECM protein interaction network, the evolutionary dynamics of multi-domain arrangements and the contribution of domain gain, loss and rearrangement events on lineage specific innovations were examined next.

3.3.3 Domain gain is a major driving force for ECM innovation in the human lineage.

To obtain a global overview of domain gain and loss events, domain architectures were generated for orthologues of ECM proteins identified in 131 published eukaryotic genomes (Figure 3-5 and supplementary excel spreadsheet data file “SF25” on the accompanying CD). Across 33 deuterostome genomes, 62.8% of ECM orthologues had identical domain architectures, suggesting selective pressure to maintain architecture. Where changes were observed in human ECM protein domain architecture, domain gain was found to be more common than domain loss (28.9% vs. 5.2% of orthologues) (supplementary excel spreadsheet data file “SF26” on the accompanying CD).

Domain rearrangements, representing the shuffling of otherwise identical domain complements, were very rare (0.2%), whereas more complex changes (involving combinations of gain and loss events) were also uncommon (2.8% of orthologues). This is in keeping with conclusions from previous global domain studies65. As expected, within the protostomes a drop in conservation of domain architectures was noted (34.3% of orthologues) (supplementary excel spreadsheet data file “SF26” on the accompanying CD). This resulted from an increase in both domain gains and 118

Figure 3-5: Conservation of ECM architectures Each coloured tile in the heatmap (centre) represents the domain composition of a protein in a given species relative to the corresponding human reference orthologue. Differences in the domain composition or arrangement have been 119 colour coded with e.g. fully conserved architectures in yellow (see key). Domain gains can be inferred here from a shift from red to yellow tiles across the phylogeny. Determination of domain composition is based on detection of Pfam-A families across 131 published eukaryotic genomes. Species are arranged (with plants on the left) according to established phylogenetic relationships168 (see Appendix 2). Proteins were hierarchically clustered (city block method, average linkage as implemented in Cluster 3.0340) into groups representing similar conservation profiles, numbered to the left (vertically). Orthology was determined using a previously published Inparanoid based method 168 using the longest peptide sequence associated with the corresponding gene. Domain composition was based on the highest scoring orthologue to the human reference sequence. For each protein the total number of domains is plotted as a stacked bar graph (right) where the number of unique domains is shown in black and the number of repeated domains in red.

losses in human ECM proteins relative to their protostome orthologues (44.2% and 7.5% respectively), together with a relatively high number of more complex changes (13.6%) (supplementary excel spreadsheet data file “SF26” on the accompanying CD). Thus, as for other systems66,341, domain losses are more likely to be deleterious to ECM function than domain gains, the latter having the potential to provide additional lineage-specific adaptations.

From Figure 3-5, in addition to instances of apparent lineage-specific gene losses within the protostomes (e.g. Group 3), there are instances where domain architectures are less complex than their deuterostome orthologues. For example orthologues of MMP14, 15, 16, and 24 contain fewer domains than are observed in vertebrates (Group 4). Conversely, within the protostomes, there are a few instances of lineage-specific domain gains (e.g. orthologues of WNT5A and WNT5B in Group 3, as well as SMOC1 and SMOC2 in Group 4). Outside metazoans, only a few instances of domain architecture conservation are found (CALR (Group 9), CHI3L1, MFAP1 and SMC3 (Group 11)).

Intriguingly, within the deuterostome lineage several groups were identified in which primates display a gain in domains relative to all other species (represented by a red to yellow tile transition in Figure 3-5; examples include top of Group 4, bottom of Group 5, Group 7, Group 8 and top of Group 9). To explore the timing of these events, domain alignments were constructed based on each ECM protein for all homologues detected across the 131 species (e.g. supplementary excel spreadsheet data files “SF27-31” on the accompanying CD). An alignment based on MMP2, for example, shows the relatively consistent architecture conserved across the deuterostomes, with only a sporadic loss of domains within certain species (Figure 3-6). At the same time, the conservation of the PG_binding domain (PF01471) with the peptidase domain in plants likely indicates an example of convergent evolution. 120

Focusing on proteins with a relatively high number of non-unique domains (Figure 3-5, 3-6 and supplementary excel spreadsheet data files “SF27-31”), rather than domain gains occurring solely at the divergence of primates from other mammals, gains (as well as losses) occur throughout the deuterostome lineage. For example, perlecan (HSPG2) is composed of a conserved core of Laminin B and Laminin EGF domains, supplemented with increasing numbers of I-set domains. Fibulin 2 (FBLN2), an important element of elastic fibres, has acquired an ANATO domain (PF01821) initially detected in fish and subsequently duplicated in mammals. Finally activator (HGFAC) demonstrates a mosaic of domain gains and losses throughout the deuterostome lineage.

To conclude, ongoing gain and/or loss of domains occurs throughout the deuterostome lineage rather than the sudden acquisition of novel domains in the primate lineage as might otherwise be inferred for groups 4, 7, 8 and 9 in Figure 3-5. Consistent with previous studies of domain evolution, domain gain events during ECM evolution appear to be more important in driving innovation than domain loss events. Compared to all human proteins in which vertebrate specific domains have been estimated at 12.3% (426 of 3465)66 it was found that 24.3% (35 of 144) of domains in ECM proteins are vertebrate specific. Nevertheless, the innovation of a proportionally larger number of vertebrate-specific ECM proteins (two thirds of human ECM proteins) suggests the involvement of additional mechanisms. In the next section the contribution of novel domain architectures on ECM evolution is investigated.

3.3.4 Novel ECM protein domain architectures are largely age- independent

The recruitment of additional domains to the ECM, in addition to providing intrinsic functionality, offers the potential to derive new functions through combination with other domains. Interestingly, compared to other human protein domains, ECM domains are significantly associated with multi-domain proteins (P<0.01 Chi square test; Figure 3-7). Of the 144 ECM domains, 62 (43.0%) are found exclusively in multi-domain proteins while an additional 67 (46.5%) are found both in single and multiple-domain proteins. This compares with 1665 (39.2%) and 775 (18.3%) respectively, for non-ECM domains. In general, ECM domains rarely occur in single-domain architectures and, further, where they are observed, they tend to be associated with more recent origins (Figure 3-1). Together these findings suggest that when new domains do occur, they tend to be integrated into multi-domain architectures. 121

Figure 3-6: Sample domain architectures Shown are the domain arrangements of human HSPG2, FBLN2, HGFAC, MMP2 and VCAN based on Pfam-A (upper figure) and the corresponding domain based alignments of homologous proteins detected across 131 species (lower figure). For the latter, homologues are arranged in phylogenetic order with humans at the top of the y-axis and domain architectures arranged along the x-axis. Note that the scale is arbitrary and that the number of species shown on the y-axis for each protein is different depending on the species conservation. Colours are randomly assigned to domains for visualization purposes and are consistent within proteins and not necessarily between proteins. 122

To explore this further, the incidence of human ECM domain pairs across 131 eukaryotes was examined (Figure 3-8 and supplementary excel spreadsheet data file “SF32” on the accompanying CD). While 28 of 144 domains precede the emergence of metazoans, with a single exception involving SMC3, all domain pairs appear unique to metazoans. SMC3 is a central component of the cohesin complex where it is involved in spindle pole assembly, perhaps its original function. However, post-translational addition of chondroitin sulfate gives rise to an alternate function as the secreted proteoglycan bamacan, an abundant basement membrane protein249,250. Also, 127 of 205 (62%) domain pairs found in humans were restricted to deuterostomes (and vertebrates in particular, Figure 3-8). Within this group, a highly conserved set of domain pairs (Group 5b) was observed, together with domain pairs that less widely conserved (Groups 4 and 5a). This implies that in addition to a core-conserved matrix, flexibility in domain pairings may account for lineage specific innovations.

Defining domains as eukaryotic (E), metazoan (M) or vertebrate (V) in origin, the source of domains driving new combinations was examined (Figure 3-1). In general, the frequency of the observed pair combinations closely matched the expected frequency of a binomial distribution (p > 0.05 Chi-square Goodness of Fit test). Of the 90 vertebrate-specific domain pairs 34 (37.8%) involved at least one vertebrate domain, with only 7 (7.8%) involving two vertebrate domains (V:V) (Figure 3-9 and supplementary excel spreadsheet data files “SF33-34” on the accompanying CD). These findings suggest that the emergence of novel domain combinations associated with the human ECM are not dependent on domains of recent origin, but arise through sampling of existing domains. Notably, eukaryotic (E:E) domain pair combinations were significantly enriched (p < 0.0005, Chi-square goodness of fit test), highlighting the capacity of even ancient (and presumably already well-sampled) domains to contribute to new functional contexts for the evolution of the ECM. From these observations it was concluded that the generation of novel domain combinations is an important factor driving lineage specific innovations in the ECM and is not related to domain age. The next section examines whether related domain architectures comprise functionally relevant modules within the ECM network.

3.3.5 Network analyses of domain adjacency reveal domain-based functional ‘modules’ that display clade-specific rewiring.

To visualize functional relationships between domain pairs in the human ECM a domain adjacency network was constructed, comprising 117 nodes (domains) connected by 201 edges 123

(Figure 3-10). The domain architecture of each human ECM protein is represented as a directional walk on this graph, which shows domains that are frequently adjacent in proteins as densely connected neighbourhoods.

Clustering of the network using the Markov clustering algorithm221 revealed 15 putative domain modules consisting of 3 or more domains. Exploiting annotations previously generated for functional modules based on protein-protein interaction data from the previous chapter, together with Gene Ontology mappings, several modules were revealed to be enriched for specific biological processes (p < 0.005 – see section 3.2 Materials and Methods). These included for example, calcium ion binding (module 2), cell adhesion (module 3) and activity (module 4). However, in general statistically enriched functional annotations were limited, likely due to: 1) limited overlap between domain-based and protein-based modules; 2) limited annotation coverage associated with ECM proteins; and 3) biases in functional annotation schemes towards protein-based rather than domain-based annotations342. Despite these challenges, domain modules nevertheless offer a visualization of common architectural motifs that clearly cluster (albeit broadly) around matrix-associated functional themes.

Figure 3-7: Proportion of Pfam A domains found in single and multi-domain contexts Human ECM proteins vs. all human proteins. 124

Figure 3-8: Conservation of ECM domain pairs The occurrence of yellow tiles indicates the presence of a specific domain pair in the given species whereas absence is denoted by a blue tile. Domain pairs occurring in ECM proteins across 131 species are shown, excluding pairs not conserved in humans (species specific pairs are shown in Figure 3-11 and are detailed a supplementary excel spreadsheet data file “SF34” on the accompanying CD). Species are arranged (with plants on the left) according to established phylogenetic relationships168 (see Appendix 2) and domain pairs are ordered by hierarchical clustering (city block method, average linkage as implemented in Cluster 3.0340) according to their conservation pattern. 125

Domain pairs of early metazoan / protostome origin (groups 1, 2 and 3) are easily distinguished from pairs of deuterostome / vertebrate origin (group 4 and 5) and ancient pairs (group 6). The majority of domain pairs are found significantly more frequently than in a randomized model of domain pair propagation (right hand colour track).

Figure 3-9: Origin of Vertebrate specific domain pairs Relative frequency of domain pairs in humans comprised of domains with various combinations of domain origin (blue) versus the expected frequency based on the binomial distribution (red) given the frequency of individual domains in each age category. The asterisk indicates that domain pairs consisting of two eukaryotic domains were observed more frequently than expected and this difference was statistically significant. Vertebrate specific domain pairs consisted of more non-vertebrate than vertebrate domains (inset).

Within these functionally-themed modules was observed an expansion of vertebrate-specific pairings that extend the variety of ECM domain architectures that first appeared in protostomes. For example, module 3 (cell adhesion) transitioned from a core comprised of EGF and several laminin domains to a larger module in which the EGF domain serves as a central ‘hub’ for a variety of ECM-based architectures. Vertebrate-specific domains such as FN1, FN2, and COLF 126 serve to form further connections with additional modules (1, 6, and 9) with functions in matrix remodeling and protein binding. In addition, vertebrate-specific domain pairings were responsible for the emergence of new modules, such as modules 9, 11 and 12.

In addition to novel links, the network also displays re-wiring of domain relationships across evolution. For example, modules 2, 7, 8, 11 and 14 are generally well-conserved with arthropods but many domain relationships are not present in basal metazoans or . Conversely, several domain pair relationships appear to have been lost in Arthropods (e.g. module 13 and between modules 1 and 2) indicating either sequence divergence, loss of function provided by the domain combination or recruitment of additional proteins to replace the function. Apparent losses include a number of domain combinations involving Kazal, Laminin and CUB domains. Importantly, it is the unique combination of these domains that has been lost rather than the domains themselves. Serine Protease Inhibitors (Serpins) in which kazal domains are found, are involved in protection against autophagy in metazoan digestive systems343. While homologues have been found in insects they appear to be highly specialized and in some cases structurally diverse344,345.

Among orthologues of human ECM proteins, a large number of poorly conserved, species- specific domain pairs were observed (305/510), suggesting the recruitment or shuffling of domains (Figure 3-11 and supplementary excel spreadsheet data file “SF35” on the accompanying CD). This indicates that domain shuffling, while random and widespread nevertheless created in any given species a fraction of the possible number of functional domain pairs.

These evolutionary analyses suggest complex patterns of rewiring contributed to clade specific differences in the usage of otherwise conserved domain pairs. However, it is becoming clear that conservation is not limited to pairs of domains but may extend to higher order domain patterns (e.g. triplets or quartets of domains)346-348. The next section considers the potential for higher order architectures to comprise units of selection.

127

Figure 3-10: Domain adjacency A – Network of domain pairs where edges represent adjacency of domains in the indicated N-terminal to C-terminal orientation (arrows). The statistical significance of each domain pair was determined through comparisons with 128 randomly constructed proteomes (see section 3.2 Materials and Methods) and used to weight each edge by Z-scores. Note that despite the appearance of thinner edges (due to scaling), the majority of real domain pairs occur significantly more frequently than in randomized simulations. Edges are coloured according to domain pair conservation groups defined in Figure 3-8 (upper inset). Node colours correspond to domain age categories as defined in Figure 3-1(b) (lower inset). MCL clusters representing putative domain modules are numbered and encircled for emphasis. Node size is proportional to betweenness centrality.

3.3.6 Patterns of ECM domain usage extend to conserved higher-order architectures

To reveal significant, repeated higher-order patterns within the human ECM, PrefixSpan324, a sequential pattern mining algorithm, was run on the domain architectures of ECM proteins to identify frequent sequential patterns (see section 3.2 Materials and Methods). Here, a frequent sequential pattern is defined as an ordered (although potentially discontinuous) set of domains identifiable in at least three proteins. For example, the sequential pattern (A,B,C) can be found in proteins with domain architectures: (A,B,C,D), (X,A,B,C), (Y,A,Y,B,C), (X,Y,A,Z,B,B,C). There were 589 patterns identified, of which 510 were determined to be significant in human ECM proteins (p < 0.05, Bootstrap resampling; supplementary excel spreadsheet data file “SF36” on the accompanying CD). Further analyses focused on the 490 patterns which were most significant (p < 0.005), of which 256 were comprised of four domains while 150 were comprised of three domains (supplementary excel spreadsheet data file “SF37” on the accompanying CD). For each pattern, its conservation across 10 representative metazoans was examined (Figure 3-12(a) and supplementary excel spreadsheet data file “SF38” on the accompanying CD).

Surprisingly, while there were lineage-specific pattern losses in worm, only a single fly-specific loss was observed. Amongst the former are patterns involving domains TIL, C8 and VWD, which occur in the human proteins: VWF, TECTA, OTOG and two mucins, MUC5A and MUC6. Consistent with the missing domain patterns, despite worm possessing homologues for at least some of these proteins, the C8 domain appears to have been lost within the nematode lineage (Figure 3-1). Interestingly, patterns based on these domains are similarly missing in fish; this is related to an inability to detect orthologues of VWF, TECTA, OTOG and MUC6 in this lineage. However, since VWF is an essential clotting factor known to be present in teleosts349 it is likely that this protein is present but divergent from the human reference sequence. 129

Figure 3-11: Conservation of ECM domain pairs (all species) A – Domain pairs were broadly categorized as conserved in vertebrates (purple), early metazoa (red) or lineage specific (green). Note that only domain pairs found in humans were included in the statistical simulation to determine the significance of domain pairs. B – Directed network of domain pairs with edge thickness indicating total frequency of domain pairs (all proteins and species) and numbered edges indicating the number of species in which the domain pair occurs. Edges are coloured according to the conservation groups defined in part A. 130

Orthologues of TECTA were detected in other fish lineages, again suggesting difficulty in detecting the Danio rerio orthologue.

To examine potential overlap in patterns arising from similar domain architectures, an enrichment map was constructed in which nodes represent discrete patterns and links connecting nodes indicate that the patterns contain common domains (Figure 3-12(b) and supplementary excel spreadsheet data files “SF39-42” on the accompanying CD). There were 13 such pattern groups identified, sharing similar domain composition. Rather than pattern groups being composed of a mosaic of all conservation groups, each group displays a limited set of conservation patterns. For example, groups 1, 5 and 12 are composed of patterns associated with various groups of vertebrates, while group 8 patterns are associated with protostomes and deuterostomes. In addition, we also observe four large groups associated with a larger range of conservation patterns, suggesting an expansion of patterns for these groups within distinct lineages. For example, pattern group 2 contains conservation patterns associated with basal metazoans, together with those that later emerged in the vertebrate lineage, highlighting the ability of apparently fixed domain architectures to acquire new domains that may help drive lineage-specific adaptations.

3.4 Discussion

The ECM is a defining feature of metazoans consisting of secreted proteins that self-assemble into a complex meshwork of fibres. Thus connected, they provide essential structural properties, as well as a platform around which to organise and translate mechanical and chemical signals into a complex body plan. As outlined in Chapter 2, out of 357 genes comprising the core ECM network, approximately two thirds of the components of the ECM were found to represent recent vertebrate-specific innovations200. Since domains represent independently folding three dimensional units of selection, the contribution of domain architecture to ECM innovation was specifically examined in this chapter. The creation of multi-domain proteins has accelerated in the metazoan lineage resulting in a rich diversity of domain architectures350. Compared to other human proteins, the present analysis reveals that ECM proteins are significantly enriched in multidomain architectures, highlighting the importance of domains in driving the evolution of the ECM. Previous studies of multidomain proteins suggest domain arrangements occur largely through gene fusion, repeat expansion and subsequent domain loss that preferentially occurs at 131

Figure 3-12: Higher order domain patterns A – Conservation of higher order domain patterns across 10 representative species. Domain patterns were ordered by hierarchical clustering along the y-axis (Euclidean method, Average linkage as implemented in Cluster 3.0340) to group patterns with similar conservation profiles (visualized by the colour bar on the left). Species were arranged (with plants on the left) according to established phylogenetic relationships168(see Appendix 2). Sequential patterns were defined using the PrefixSpan algorithm and represent combinations of up to four domains occurring in three or 132 more proteins. In these patterns domains need not be contiguous (see section 3.2 Materials and Methods). B – Clusters of related domain patterns. Nodes represent patterns and edges represent the co-occurrence of domains in adjacent patterns. Node colours relate these groups to their conservation profile (A) and to specific domains (C) whose relative frequency of occurrence in higher order domain patterns within groups is shown as a series of WordClouds233.

the termini63. With the availability of the well-curated dataset derived in Chapter 2, the intention was to examine whether such a model extends to ECM proteins, or if instead other factors have shaped the evolution of the ECM.

Based on the occurrence of domains within orthologous proteins across 131 eukaryotic species it was inferred that the emergence of the ECM involved the recruitment of extant domains together with the innovation of new domains and new domain combinations. Subsequent evolution of ECM proteins was driven through a process of domain gain with evidence of rare, clade-specific losses and more rare domain rearrangements. The recruitment of domains to the ECM appears to have been highly selective, with 109 of 144 being significantly enriched in ECM proteins including 35 exclusively to the ECM. Interestingly, these 109 domains appear to have arisen throughout the evolution of eukaryotes suggesting an ongoing recruitment of domains and associated accumulation of novel functions. The general rarity of domains of vertebrate origin compared to those of more ancient origin is consistent with previous reports of domain age in which only 12.3% (426 of 3465 domains) of all human domains were considered to be vertebrate in origin66. Therefore, these investigations support that ECM proteins are enriched in vertebrate specific domains at 24.3% (35 of 144). While it must be acknowledged that these findings may be impacted by a reliance on domain detection algorithms that may fail to identify divergent members (e.g. not detected here were insect serpins, which are known to be highly divergent from their human counterparts)344,345, there is no indication that ECM domains display a wider spectrum of diversity than other domains and hence comparisons between ECM proteins with non-ECM proteins remain. It can therefore be concluded that novel domains were important in establishing new ECM functions.

In any given generation, the pool of existing domains has the potential to form novel domain architectures through rearrangement and recruitment of existing and novel domains. Recent studies probing the mechanisms of de novo domain creation and recombination, suggest that the majority of vertebrate specific (i.e. new) domains first emerged as single-domain proteins66. 133

However, most de novo domain gains in ECM proteins were herein found to take place in the context of existing proteins or as fusions rather than as singleton domains in new genes. Consequently this suggests that given the highly interconnected nature of the ECM, in which many proteins physically interact, the emergence and subsequent recruitment of new domains to the ECM occurs under unique selective pressures that drive their integration into existing multi- domain architectures.

Proteins involved in extracellular structures were previously found to be enriched in promiscuous domains (i.e. those occurring in the context of a wide number of other domains)67. However, given that ECM proteins are mainly composed of domains either enriched for or specific to ECM proteins, it is unlikely that they are highly promiscuous. Indeed the majority of ECM-associated domains were found here to be characterized by low frequency and promiscuity, with only a small fraction of ECM domains displaying high promiscuity. Among these latter domains were the PDZ and EGF domains, which have previously been suggested to drive the formation of so called ‘hub-proteins’; proteins which form large numbers of interactions with other proteins and hence mediate pivotal roles in network organisation337. Consistent with this, ECM hubs were found to be significantly enriched in highly promiscuous domains. This suggests while physical, developmental and tissue specific properties of the matrix may be influenced by a wide variety of domains, the organisation of the matrix is mediated by a relatively small number of highly promiscuous domains that have emerged as the basis for network hubs.

The emergence of novel domains played an essential role in the establishment of the matrix. However, within the 144 ECM associated domains, only 35 (24.3%) were determined to be of vertebrate origin. Hence, given that two thirds of human ECM proteins lack detectable homologues outside vertebrates, subsequent evolution of the ECM was likely driven through mechanisms other than domain innovation. Focusing on domain co-occurrence, approximately two thirds of ECM-domain pairwise combinations were found to be unique to vertebrates. Of these, 38% (34/90) involve at least one vertebrate-specific domain, while 75% (67/90) involve at least one domain of metazoan origin. This suggests that ECM innovation in vertebrates was largely driven through the generation of novel domain combinations. Furthermore, despite the apparent enrichment of ECM for vertebrate specific domains, pairwise patterns of domain age were consistent with a model of random domain propagation in which new domains participate in a continuous process of random domain assortment from which new functions arise as much 134 from novel domain combinations as from new domains themselves342. As an aside, it was noted that older domain combinations composed of two domains of pre-metazoan origin, were the only pairs statistically over-represented in ECM proteins (p < 0.0003). Such preferential recruitment might be explained through the potential impact of novel and consequently disordered domains disrupting existing biological functions; established domains, by their very nature, being less likely to cause such disruptions323.

The dependency on new domains to provide novel innovations in vertebrates is further minimized through the reuse of existing domains either through tandem duplications, or reordering resulting in new functions347. For example, tandem repeats accounted for the most frequent ECM domain pairs. Domain repeats are often expanded through duplications of several domains at a time, a process facilitated through most ECM domains being encoded in single exons316,351. Consistent with previous studies of general vertebrate proteins352, ECM proteins comprised of a majority of non-unique domains were shown to be enriched for recent (i.e. after the emergence of vertebrates) domain gains (P < 0.05, hypergeometric test with Bonferroni correction); pre-vertebrate domain gains corresponded to low numbers of domain repeats (i.e. are associated with the gain of novel domains). Furthermore, for ECM proteins, repeats appeared to be preferentially enriched within structural proteins (P < 0.005, Chi Square goodness of fit test) supporting previous suggestions that domain repeats are driven by large structural complexes62. Beyond tandem duplications, ECM domain pairs appear bi-directionally (in both a forward (A- B) and reverse (B-A) orientation) more frequently than expected. Excluding 41 identical pairs, 12.8% (21 of 164) human ECM domain pairs were bi-directional compared to previous estimates of 3-6% for all proteins346. Interestingly, previous studies347,348 suggest the forward and reverse domain arrangements result in different functions, supporting the suggestion that such arrangements reduce the reliance on generating novel domains to drive innovation.

To further elucidate origins of domain pairings, a network-based approach was applied to identify domains of eukaryotic and metazoan origin recruited into vertebrate-specific combinations. Of 205 domain pairs found in humans, 74 appeared to be conserved across metazoans. Over 100 domain pairs are recent additions in vertebrates with a further 25 specific to mammals. This surge of vertebrate innovation, driven by the acquisition of novel domain arrangements may have been facilitated by whole genome duplications within the vertebrate lineage57. While many aspects of vertebrate skeletal evolution remain unclear333,334, the 135 identification of several domain combinations appearing after the split between teleosts and tetrapods, suggest a potential role in the evolution of skeletal tissues during the transition to life on land. Also identified were an additional 305 ECM domain pairs, absent from humans and poorly conserved elsewhere, indicating that other metazoan lineages have acquired their own complements of novel domain combinations (Figure 3-11 and supplementary excel spreadsheet data file “SF34” on the accompanying CD).

An important contribution of this study is the application of sequential pattern mining as a method to investigate higher order domain architecture patterns and their conservation. Vogel et al.353 have previously suggested that contiguous two and three domain combinations can result in evolutionary conserved three-dimensional structures, termed ‘supra-domains’. Subsequent studies of domain rearrangements further showed that domains do not necessarily need to be contiguous in order to contribute to a conserved three-dimensional fold63,354,355. The analyses herein revealed a gradually increasing pattern of complexity of ECM domain architectures across orthologues wherein higher order patterns (inclusive of domain pairs) tend to accumulate, increasing the complexity of domain architectures; a general process some have termed ‘accretion’356. These are accompanied by clade specific losses suggesting that, as for domains and domain pairs, higher order patterns represent units driving evolutionary change. For example, loss of patterns involving TIL, C8 and VWD domains in nematodes correlate with fewer paralogues of VWF, MUC5A and TECTA proteins, together with the absence or divergence of OTOG and MUC6 proteins in this lineage.

In summary, the emergence of metazoan life involved the innovation of a large number of novel ECM domains. Vertebrates subsequently exploited these domains through the generation of novel domain combinations to yield systems such as a biomineralized skeleton, a network of elastic fibres and a variety of organ systems supported by an array of specialized matrices. 136

Chapter 4 Summary and Future Directions 4 Conclusions 4.1 Summary

Through the development of a systematic protocol to leverage functional annotations from secondary source databases this work facilitated the definition of the ECM, a biological system of interest and relevance to many critical tissue functions. An overview and analysis of the ECM interactome was presented, derived from publicly available datasets and supported by experimental evidence. This survey, enumerating known ECM proteins, their interactions, associated meta-data and domain relationships, highlights the current state of knowledge of this system and brings up new questions.

The ECM network includes 357 core genes, together with an additional 524 genes that mediate related functional roles. Over 30 functional modules were identified that provide insights into the organisation and operation of the ECM as a system and, which may also be exploited for transferring both functional and disease-based annotations to previously uncharacterized genes. Evolutionary analyses revealed that approximately two thirds of the components of the ECM are recent vertebrate-specific innovations. By integrating evolutionary and protein expression datasets it was revealed that modules appear to be constructed of proteins displaying a mosaic of evolutionary trajectories suggesting that module innovations were widespread and evolved in parallel to convey tissue specific functionality on otherwise broadly expressed modules.

Consistent with the current consensus model of domain evolution in which the accretion of domains and domain combinations and selective losses lead to increasingly complex, multi- domain architectures, this study has shown that the major driving force for human ECM evolution has been the innovation of novel domain combinations, rather than novel domains. These domain combinations, which we have extended to include higher order patterns, have evolved to support the unique requirements of the ECM in its dual role as both a supporting structure and dynamic signaling component in complex living systems. Specific domains of eukaryotic, metazoan and vertebrate origin were identified which, independent of their age, gave rise to clade-specific domain combinations. However, the prevalence of older domain pairs 137 among the large number of vertebrate-specific pairs suggests the mechanism of novel domain acquisition by ECM proteins may be different than for other proteins; dominated by domain fusion and the recruitment of additional domains to existing architectures rather than the de novo creation of independent domains. This study revealed that the organisation of the matrix is mediated by a relatively small number of highly promiscuous domains which are enriched in structural proteins that have emerged as the basis for network hubs. Together these results emphasize the importance of validating models derived from global domain analyses, through focusing on specific biological processes and/or specific classes of proteins.

The ECM, once considered a static structure, has undergone a radical re-thinking. By crystallizing our current knowledge of the ECM, this study provides a valuable platform to drive future initiatives seeking to unravel the multitude of inter-relationships in this dynamic and complex system.

4.2 Future Directions

4.2.1 Predicting ECM Proteins

Prior to the intitiation of the present study a complete list of known ECM proteins had not been assembled. To circumvent the incompleteness and inconsistencies of individual datasets with respect to ECM proteins and their annotations, this study used a structured approach to leverage information across a number of readily available sources while maintaining accuracy through mutual reinforcement. The resulting list of 324 ECM proteins closely agrees with two contemporary and independently derived lists11,158,159. The absolute number of ECM proteins in the human genome is unknown. However, one recent estimate has placed the upper bound at 400 ECM-encoding genes in the mammalian genome34. If this estimate is accurate, a considerable number of ECM proteins, perhaps on the order of 76 may remain undiscovered. The numbers of functionally related, non-ECM proteins could be much higher.

Recent attempts to expand the list of known ECM proteins have included 1) a large-scale functional screen based on the RIKEN mouse full-length cDNA collection which identified 16 novel ECM proteins34 and, 2) a machine learning approach in which 13 “informative” classifiers were used to predict 20 novel ECM proteins from unannotated human genes in UniProtKB35. 138

However, in the latter case, half of the supposedly novel genes identified were already supported by literature evidence and the others were known to be extracellular.

Using a priori knowledge of ECM-associated domains, the Matrisome Project11 was able to classify a comparable set of ECM proteins to those presented here, which suggests that a novel approach utilizing domain relationships may be of use in searching for additional candidate ECM proteins. Of note, Pu and co-workers recently defined a metric, the co-occurrence score CS(i,j) = -log(Pij) which is a function of the p-value of the domain pair, and used a machine learning approach to predict 379 novel genes participating in Chromatin modification (CM)357. Since equivalent p-values have already been calculated here for ECM domain pairs, a similar machine learning approach could be applied to the prediction of additional ECM proteins. The availability of suitable computing resources and collaborators makes this an attractive follow-up project although the importance of being able to functionally validate predictions cannot be overemphasized.

4.2.2 Literature Curation of ECM Interactions

While literature curation is a labourious process, given the relatively large number of known interactions for elastin that were not reflected in public interaction databases (see section 2.3.10 Literature curation of elastin interactions resulted in doubling the number of known binding partners), there remains potential to address the sparseness of the current ECM network through additional curation efforts. This task is currently being undertaken by MatrixDB158,159 curators (Ricard-Blum, personal communication). However, there appears to be ample room to supplement this timely effort with novel methods such as improved text mining approaches.

In the short term, manual curation could be focused on smaller, important subsets such as the collagen fibre network to resolve remaining inconsistencies. For example, COL1A1 and COL1A2 have distinct interactions (attributed to only one subunit) whereas the native protein exists in tissues as an assembled fibre (see section 2.3.4 The collagen subnetwork reveals anomalies in experimentally derived PPIs). The anticipated increase in connectivity from these combined approaches would enable better resolution of predicted ECM functional modules and their associated annotations. Furthermore, it would be highly beneficial to undertake additional, supplemental curation before proceeding with planned SPRi experiments on a subset of elastic fibre proteins given the substantial number of literature supported interactions for elastin (i.e. 139 nearly half of all elastin interactions) that have not been captured in public PPI databases (see section 4.2.3 below).

4.2.3 Experimental determination of additional ECM interactions via SPR of proteins and recombinant fragments

The promising results of the SPR pilot study (section 2.3.11) have inspired several follow-up projects. First, it has been hypothesized that in the absence of several different genes (as is the case in elastin), allows formation of elastic fibres with different architectures in different tissues. This presumably reflects that matrix associated proteins that contribute to the architecture of the elastic matrix interact specifically with distinct splice variants. The Keeley lab has now produced a set of recombinant full-length human tropoelastins that mimic the five most common splice variants (Fred Keeley, personal communication). SPR provides the capability to detect the differential binding of these molecules to e.g. fibrillins, fibulins, MAGPs, and others. This work will better define elastic fibre interactions both in general and in disease models and may help delineate tissue specific elements.

Second, in one such disease model of thoracic aortic aneurysms and dissections (TAAD), the SPR pilot project laid the groundwork to examine the role of TAAD associated elastin sequence variations (SNPs) on elastic fibre interactions. Unlike most other methods of PPI detection, especially those used in high-throughput screens, classical SPR experiments yield information about binding kinetics (note: SPRi screens which are also high-throughput yield less reliable binding kinetics and candidate interactors from these screens are typically re-tested using low- throughput SPR experiments to establish the binding kinetics). SPR is a particularly sensitive method requiring very small quantities of protein (typically 1μg). This is an ideal method to test the hypothesis that disruption of network architecture is an important contributor in disease processes since even subtle changes affecting the weighting of interactions in a PPI network can be revealed. A successful outcome from these experiments would be the identification of TAAD-associated variants that show disturbance or perturbation of interaction behaviour. These could form the basis of future investigations that utilize tissue culture or animal models to examine the role of the variants in the ECM.

Further to the above investigations there is the potential to conduct SPR experiments to find additional interactors of elastic fibre components beginning with (but not necessarily limited to) 140 elastin. As a starting point, a number of predicted elastin interactors from FunCoup171 and STRING358 were identified which, along with similar predictions for other elastic fibre components could form the basis for a larger SPRi screen. Many potential interacting proteins are available from commercial sources in recombinant form and remain untested (see Appendix 12).

4.2.4 Expansion of functional modules

In addition to reproducible, experimentally determined PPIs, a considerable amount of ‘low’ confidence interaction data has been captured in the course of this study. This latter data derives from a combination of unconfirmed experiments (e.g. Cain et al.217) and interactions predicted on the basis of text mining, interrologues and other similarity measures171,358. On the basis that functionally related neighbours as defined by the curation effort are more likely to be true interactors than are other proteins that are not known to participate in ECM related functions, these low confidence interactions, at the periphery of more strongly defined functional modules should be prioritized for experiments designed to validate their inclusion. Additional prediction- based data sets such as FunCoup171 and String172, and tools such as GeneMania170,359 may also be useful to identify and prioritize candidate genes. In addition, functionally unannotated proteins that are highly connected to defined functional modules could be targeted for further analysis.

4.2.5 Role of carbohydrates

It is not immediately clear how the relatively small number of distinct GAG interactions should be analyzed in the context of an interaction network dominated by protein interactions since e.g. bi-partite networks (a network representation which employs nodes of different types to represent categorically different entities such as proteins vs. GAGs) do not lend themselves to analysis via standard graph analysis algorithms. The importance of GAGs is, however, functionally critical as the sequestration of several growth factors into the ECM has long been known to involve matrix-associated carbohydrates rather than the proteins themselves29.

In November 2012, as was highlighted at the first joint meeting of the American Society for Matrix Biology and the Society for Glycobiology (San Diego, U.S.A.), advanced toolkits such as bioorthogonal chemistries and conditional knock-outs in various glycosylation pathways are now available to probe the functional relationships of carbohydrates in biological systems. As additional molecules are highlighted, improved understanding of the contribution of their 141 interactions to overall matrix biology and their implications for proteoglycan evolution will need to be addressed. This could be accomplished e.g. using SPR arrays which are capable of discriminating interactions between various molecular species, including the glycosylated forms of proteins and, carbohydrates. These approaches could potentially create a new layer of information with dramatic implications to improve understanding of matrix biology at a systems level.

4.2.6 Visualizing proteins, multimers and fragments

ECM proteins often exist in tissues as multimers (e.g. fibres) or are cleaved to produce polypeptides with distinct biological activities360 (i.e. different physical protein-protein interactions) from their precursors. Such associations are difficult to represent using traditional, gene-centric networks. Bi-partite representations, also called bi-graphs are able to include different types of vertices and edges in a single graph, which are useful for including the interactions involving other types of biomolecules such as carbohydrates, lipids and ions as is done, for example in MatrixDB158. However, in representing proteins, their supramolecular assemblages and related fragments as independent nodes, bi-graphs do nothing to improve the visual representation of these hierarchical relationships. As well, the proliferation of node types fails to address the oft leveled criticism of networks as “fuzzballs” of unfathomable complexity361.

An alternative approach, hypergraphs, may address both of these problems. Hypergraphs may be defined very simply as graphs which allow arbitrary sets of nodes. Within this framework, nested ‘hyper’ nodes may be joined by hyper edges connecting multiple nodes by a single edge. This allows a biologically relevant rendering of matrix molecule relationships (fragments within proteins within multimers) and their individual interactions at all levels of assembly. It also facilitates the consolidation of edges, a process that is analogous to lossless data compression; portraying the same information with fewer edges. An implementation of hypergraphs is available as a Cytoscape plug-in, CyOoG362 (Cytoscape itself does not have naitive support for hypergraphs). In this implementation the authors use an algorithm based on network motif- finding to collapse certain common network patterns (e.g. cliques and bi-cliques) into ‘power nodes’ and ‘power edges’. This algorithm is implemented as a stand alone component to generate an input file whose contents are separately rendered within the Cytoscape plugin. 142

Using this plugin, a proof of concept was created for rendering the hierarchical relationships of ECM biomolecules. A custom program was written to translate the MatrixDB ‘biomolecules’ file into CyOoG’s input file format (.bbl), bypassing the CyOoG command line tool for power graph analysis (see program listing in Appendix 10). Hierarchical relationships were represented as concentric rings (nodes within nodes) and depicted the full range of interactions e.g. multimer to multimer (edge connecting power nodes), fragment to multimer (edge connects node to power node), fragment to fragment (edge connects nodes) and so on, accurately rendering the biological relationships among ECM proteins. As stoichiometry was present within the biomolecules file, it was also possible to represent these relationships (multiple nodes within a power node). A sample network layout is shown in Appendix 11.

While this rudimentary attempt faced some limitations in terms of the ability to modify the default layout and rendering of the network, it demonstrated the feasibility of using hypergraphs to represent complex ECM relationships. What would improve this is a simple, standalone rendering engine for .bbl files which is free of the constraints imposed by CyOoG. These constraints, such as the inability to select and drag powernodes as nested structures while maintaining their hyper edge connections, exist partly because of CyOoG’s specific support of the powergraph implementation and partly because of limitations imposed on it by Cytoscape which does not itself support hypergraphs in this form. Since bi-partite graphs can be represented as hypergraphs, consideration should be given to expanding the functionality of the tool to allow the automatic conversion between bi-partite, hypergraph and standard network representations. The latter would facilitate interoperability with existing tools such as Cytoscape, including a large number of existing analysis tools and plug-ins which operate on standard graphs, until suitable statistical methods for the analysis of hypergraphs are developed.

4.2.7 Assessing the global importance of higher order domain patterns

The idea that multiple domains could behave as distinct evolutionary units was first suggested by Vogel et al.353 Referred to as ‘supra-domains’ these were defined and limited to contiguous two and three domain combinations resulting in overall conserved three-dimensional structures. Furthering this concept, a framework for identifying potentially important, non-contiguous, conserved arrangements of domains was herein presented (see section 3.2.10 Higher Order Domain Patterns) along with a practical method for establishing the statistical significance of 143 domain patterns using domain pair propagation to simulate random proteomes (see Appendix 10).

Within the ECM, representing a limited set of proteins, higher order patterns were found at frequencies significantly higher than expected by random chance suggesting they are functionally relevant. Related sets of patterns defined by their common domain usage were found in particular protein families and exhibited clade specific gains and losses analogous to individual domains and domain pairs. To better determine the global importance of higher order domain patterns in the evolution of multidomain proteins it would be beneficial, within practical computational limits, to execute pattern searches across larger numbers of proteins, potentially on a genome-wide scale, documenting their conservation patterns as well as the distribution of relevant patterns and sizes across functional categories.

4.2.8 Repeats and Motifs The domain analysis conducted in this study focused on Pfam A ‘families’ and ‘domains’. Families are based on sequence conservation (i.e. sequence relatedness) and are somewhat analogous to domains which represent independently folding units defined by rigorous structural methods. This is not necessarily the case for ‘repeat’ or ‘motif’ peptides. Nevertheless, Pfam HMM models exist for these sub-domain features and the software developed herein (see Appendix 10) is capable of detecting and analyzing these features as well as domain information from PfamB, the latter being based on computational predictions which were not considered here.

Insofar as repeats play important roles in matrix structures (e.g. collagen triple helix repeats) it may be worthwhile to consider how they can be integrated into domain adjacency networks and/or higher order domain patterns. Pfam defines a repeat as a short unit that is unstable in isolation but forms a stable structure when copies are present231. Immediately this suggests that to be on equal footing with domains, occurrences of repeats in proteins would need to be grouped into structually relevant units. Extending the earlier concept by Vogel et al.353, these might be considered ‘supra-repeats’.

Motifs are defined as short units residing outside of globular domains231. Unlike repeats, motifs do function as individual units and are associated with some secondary structure. They are found individually or in tandemly duplicated arrangements. Distinguished by their (on average) 144 smaller size, these ‘mini-domains’ appear to have similar features to domains but they may not be able to fold independently. However, since motifs have not been included in most global studies of domains, perhaps because of the higher likelihood of detecting them by chance, it may be pertinent to first establish whether they exhibit other domain-like qualities before attempting to treat them on an equal footing. For example, do they propagate as either pairs of motifs or in combination with domains in a manner similar to that observed for domains? Are there promiscuous motifs? A global study focusing on motifs might answer these questions in a way that would permit the inclusion of these features in the domain centric model of protein evolution.

4.2.9 Tools for visualizing paths in domain adjacency networks

Domain adjacency networks provide valuable information about domain neighbourhoods, revealing evolved working relationships in the form of functional multi-domain proteins. While the domain architecture of every multi-domain protein can be represented as a walk on such a network it is possible to trace paths (longer than two domains) that do not correspond to any existing protein. The biological interpretation of such networks would be significantly empowered by the development of a tool for distinguishing between paths representing realized and unrealized paths. The latter are particularly interesting in considering the possibility of being able to predict novel functional architectures for use in synthetic ECM proteins (see section 4.2.10)

Cytoscape218 has evolved as an open source community platform for the manipulation and analysis of networks. Its plug-in architecture is particularly supportive of third-party tool development. In fact, a recent review highlights no less than 152 publicly available plug-in applications for this workbench363. However, a search of the available offerings using the search term ‘domain’ revealed only two applications. DomainGraph364 is a feature-rich application for the analysis of splice variants whereas OrthoNets365 focuses on the side by side comparison of PPI networks from two species with domain visibility. Neither tool appears to support the functionality required to distinguish paths representing existing domain architectures.

The minimal requirements for a proposed ‘discriminator’ plug-in should enable the user to 1) highlight a series of nodes representing a domain architecture of interest and return a list of proteins from a designated proteome, or subset thereof, that contain the highlighted architecture 145 and 2) select a series of proteins and automatically highlight the path(s) corresponding to the selected proteins. Enhancements to the basic functionality could include use of wildcards in the search to derive lists of proteins corresponding to a variety of domain patterns (e.g. AB*D) or to perform a domain alignment of the matching proteins. Additional options include 1) ability to toggle the inclusion of families, domains, repeats and motifs for all functions, 2) to automatically obtain BLAST output for a selected protein or, 3) call up visualizations of domain and protein conservation through integration with PhyloPro168.

Development of such a plugin would leverage and better integrate code already written for the ECM network which would then readily support the analysis of additional subsets of proteins (e.g. proteins involved in chromatin modification, the cytoskeleton, or others).

4.2.10 Domain adjacency as a method to predict novel ECM-like proteins i.e. synthetic ECM proteins.

The use of ECM proteins to construct new substrates and biomaterials for tissue engineering is of considerable commercial and scientific interest. Recently, for example, the design of artificial ECM proteins involving the fusion of selected, functionally characterized, laminin-derived sequences produced matrices with desired migration inhibitory, anti-angiogenic and cell adhesive properties366. While the construction of novel ECM proteins from choice sequence elements looks promising on the surface, there are undoubtedly a much larger number of possible combinations than there are viable combinations.

Fortunately, through millions of years of evolution, nature has already pre-filtered many possibilities. Using the domain adjacency network as a starting point, it should be possible to develop a statistical framework for predicting viable, alternative ECM protein architectures which can then be tested in the wet lab for novel functions. A bank of ECM domains and facilities for characterizing the function of ECM proteins exists at the Koch Institute for Cancer Research at MIT (Alexandra Naba, personal communication) and could be applied to create and test the functionality of synthetic ECM proteins predicted by such a model. 146

References

1. Uitto, J., Olsen, D.R. & Fazio, M.J. Extracellular matrix of the skin: 50 years of progress. J Invest Dermatol 92, 61S-77S (1989).

2. Piez, K.A. History of extracellular matrix: a personal view. Matrix Biol 16, 85-92 (1997).

3. Borel, J.P., Maquart, F.X., Robert, A.M., Labat-Robert, J. & Robert, L. Celebration of the 50th anniversary of the foundation of the French society for connective tissue research. Its short history in the frame of the origin and development of this discipline. Pathol Biol (Paris) 60, 2-6 (2012).

4. Meikle, M.C. Control mechanisms in bone resorption: 240 years after John Hunter. Ann R Coll Surg Engl 79, 20-7 (1997).

5. Robert, L. Matrix biology: past, present and future. Pathol Biol (Paris) 49, 279-83 (2001).

6. Balo, J. & Banga, I. Elastase and elastase-inhibitor. Nature 164, 491 (1949).

7. Horwitz, A.R. The origins of the molecular era of adhesion research. Nat Rev Mol Cell Biol 13, 805-11 (2012).

8. Wilson, H.V. On some phenomena of coalescence and regeneration in sponges. J. Exp. Zool. (1907).

9. Townes P.L., H.J. Directed movements and selective adhesion of embryonic amphibian cells. J. Exp. Zool. (1955).

10. Labat-Robert, J. & Robert, L. Introduction: matrix biology in the 21st century. From a static-rheological role to a dynamic-signaling function. Pathol Biol (Paris) 53, 369-71 (2005).

11. Hynes, R.O. & Naba, A. Overview of the matrisome--an inventory of extracellular matrix constituents and functions. Cold Spring Harb Perspect Biol 4, a004903 (2012).

12. Ricard-Blum, S. The collagen family. Cold Spring Harb Perspect Biol 3, a004978 (2011).

13. Uitto, J. Biochemistry of the elastic fibers in normal connective tissues and its alterations in diseases. J Invest Dermatol 72, 1-10 (1979).

14. Ramirez, F. & Dietz, H.C. Extracellular microfibrils in vertebrate development and disease processes. J Biol Chem 284, 14677-81 (2009).

15. Baldwin, A.K., Simpson, A., Steer, R., Cain, S.A. & Kielty, C.M. Elastic fibres in health and disease. Expert Rev Mol Med 15, e8 (2013). 147

16. Li, D.Y. et al. Elastin is an essential determinant of arterial morphogenesis. Nature 393, 276-80 (1998).

17. Starcher, B.C. Lung elastin and matrix. Chest 117, 229S-34S (2000).

18. Dietz, H.C., Ramirez, F. & Sakai, L.Y. Marfan's syndrome and other microfibrillar diseases. Adv Hum Genet 22, 153-86 (1994).

19. Cirulis, J.T. & Keeley, F.W. Kinetics and morphology of self-assembly of an elastin-like polypeptide based on the alternating domain arrangement of human tropoelastin. Biochemistry 49, 5726-33 (2010).

20. He, D. et al. Polymorphisms in the human tropoelastin gene modify in vitro self- assembly and mechanical properties of elastin-like polypeptides. PLoS One 7, e46130 (2012).

21. Song, H. & Parkinson, J. Modelling the self-assembly of elastomeric proteins provides insights into the evolution of their domain architectures. PLoS Comput Biol 8, e1002406 (2012).

22. Visconti, R.P., Barth, J.L., Keeley, F.W. & Little, C.D. Codistribution analysis of elastin and related fibrillar proteins in early vertebrate development. Matrix Biol 22, 109-21 (2003).

23. Bhattacharjee, Y. Friendly faces and unusual minds. Science 310, 802-4 (2005).

24. Eisenberg, R., Young, D., Jacobson, B. & Boito, A. Familial Supravalvular Aortic Stenosis. Am J Dis Child 108, 341-7 (1964).

25. Weir, E.K., Joffe, H.S., Blaufuss, A.H. & Beighton, P. Cardiovascular abnormalities in cutis laxa. Eur J Cardiol 5, 255-61 (1977).

26. Tassabehji, M. et al. An elastin gene mutation producing abnormal tropoelastin and abnormal elastic fibres in a patient with autosomal dominant cutis laxa. Hum Mol Genet 7, 1021-8 (1998).

27. Hynes, R.O. Integrins: bidirectional, allosteric signaling machines. Cell 110, 673-87 (2002).

28. Geiger, B. & Yamada, K.M. Molecular architecture and function of matrix adhesions. Cold Spring Harb Perspect Biol 3(2011).

29. Taipale, J. & Keski-Oja, J. Growth factors in the extracellular matrix. FASEB J. 11, 9 (1997).

30. Hynes, R.O. The extracellular matrix: not just pretty fibrils. Science 326, 1216-9 (2009).

31. Rozario, T. & DeSimone, D.W. The extracellular matrix in development and morphogenesis: a dynamic view. Dev Biol 341, 126-40 (2010). 148

32. Yan, D. & Lin, X. Shaping morphogen gradients by proteoglycans. Cold Spring Harb Perspect Biol 1, a002493 (2009).

33. Yurchenco, P.D. Basement membranes: cell scaffoldings and signaling platforms. Cold Spring Harb Perspect Biol 3(2011).

34. Manabe, R. et al. Transcriptome-based systematic identification of extracellular matrix proteins. Proc Natl Acad Sci U S A 105, 12849-54 (2008).

35. Jung, J., Ryu, T., Hwang, Y., Lee, E. & Lee, D. Prediction of extracellular matrix proteins based on distinctive sequence and domain characteristics. J Comput Biol 17, 97- 105 (2010).

36. King, N. The unicellular ancestry of animal development. Dev Cell 7, 313-25 (2004).

37. Parfrey, L.W. & Lahr, D.J. Multicellularity arose several times in the evolution of eukaryotes (response to DOI 10.1002/bies.201100187). Bioessays 35, 339-47 (2013).

38. Fairclough, S.R., Dayel, M.J. & King, N. Multicellular development in a choanoflagellate. Curr Biol 20, R875-6 (2010).

39. Ratcliff, W.C., Denison, R.F., Borrello, M. & Travisano, M. Experimental evolution of multicellularity. Proc Natl Acad Sci U S A 109, 1595-600 (2012).

40. Varner, J.A. Isolation of a sponge-derived extracellular matrix adhesion protein. J Biol Chem 271, 16119-25 (1996).

41. Muller, W.E. Origin of Metazoa: sponges as living fossils. Naturwissenschaften 85, 11- 25 (1998).

42. Berrier, A.L. & Yamada, K.M. Cell-matrix adhesion. J Cell Physiol 213, 565-73 (2007).

43. Hynes, R.O. & Zhao, Q. The evolution of cell adhesion. J Cell Biol 150, F89-96 (2000).

44. Sebe-Pedros, A. et al. Regulated aggregative multicellularity in a close unicellular relative of metazoa. Elife 2, e01287 (2013).

45. Ruiz-Trillo, I., Lane, C.E., Archibald, J.M. & Roger, A.J. Insights into the evolutionary origin and genome architecture of the unicellular opisthokonts Capsaspora owczarzaki and Sphaeroforma arctica. J Eukaryot Microbiol 53, 379-84 (2006).

46. Ruiz-Trillo, I., Roger, A.J., Burger, G., Gray, M.W. & Lang, B.F. A phylogenomic investigation into the origin of metazoa. Mol Biol Evol 25, 664-72 (2008).

47. Ozbek, S., Balasubramanian, P.G., Chiquet-Ehrismann, R., Tucker, R.P. & Adams, J.C. The evolution of extracellular matrix. Mol Biol Cell 21, 4300-5 (2010).

48. King, N. et al. The genome of the choanoflagellate Monosiga brevicollis and the origin of metazoans. Nature 451, 783-8 (2008). 149

49. Huxley-Jones, J., Robertson, D.L. & Boot-Handford, R.P. On the origins of the extracellular matrix in vertebrates. Matrix Biol 26, 2-11 (2007).

50. Tucker, R.P. et al. Phylogenetic analysis of the tenascin gene family: evidence of origin early in the chordate lineage. BMC Evol Biol 6, 60 (2006).

51. Tucker, R.P. & Chiquet-Ehrismann, R. The regulation of tenascin expression by tissue microenvironments. Biochimica et biophysica acta 1793, 888-92 (2009).

52. Katsube, K., Sakamoto, K., Tamamura, Y. & Yamaguchi, A. Role of CCN, a vertebrate specific gene family, in development. Dev Growth Differ 51, 55-67 (2009).

53. Tzu, J. & Marinkovich, M.P. Bridging structure with function: structural, regulatory, and developmental role of laminins. The international journal of biochemistry & cell biology 40, 199-214 (2008).

54. Huhtala, M., Heino, J., Casciari, D., de Luise, A. & Johnson, M.S. Integrin evolution: insights from ascidian and teleost fish genomes. Matrix Biol 24, 83-95 (2005).

55. Nicholson, A.C., Malik, S.B., Logsdon, J.M., Jr. & Van Meir, E.G. Functional evolution of ADAMTS genes: evidence from analyses of phylogeny and gene organization. BMC Evol Biol 5, 11 (2005).

56. McKenzie, P., Chadalavada, S.C., Bohrer, J. & Adams, J.C. Phylogenomic analysis of vertebrate thrombospondins reveals fish-specific paralogues, ancestral gene relationships and a tetrapod innovation. BMC Evol Biol 6, 33 (2006).

57. Dehal, P. & Boore, J.L. Two rounds of whole genome duplication in the ancestral vertebrate. PLoS Biol 3, e314 (2005).

58. Nakatani, Y., Takeda, H., Kohara, Y. & Morishita, S. Reconstruction of the vertebrate ancestral genome reveals dynamic genome reorganization in early vertebrates. Genome Res 17, 1254-65 (2007).

59. Kuraku, S. & Meyer, A. The evolution and maintenance of Hox gene clusters in vertebrates and the teleost-specific genome duplication. Int J Dev Biol 53, 765-73 (2009).

60. Chakravarti, R. & Adams, J.C. Comparative genomics of the syndecans defines an ancestral genomic context associated with matrilins in vertebrates. BMC Genomics 7, 83 (2006).

61. Lu, P., Takai, K., Weaver, V.M. & Werb, Z. Extracellular matrix degradation and remodeling in development and disease. Cold Spring Harb Perspect Biol 3(2011).

62. Apic, G., Gough, J. & Teichmann, S.A. Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J Mol Biol 310, 311-25 (2001).

63. Bornberg-Bauer, E. & Alba, M.M. Dynamics and adaptive benefits of modular protein evolution. Curr Opin Struct Biol 23, 459-66 (2013). 150

64. Bornberg-Bauer, E., Beaussart, F., Kummerfeld, S.K., Teichmann, S.A. & Weiner, J., 3rd. The evolution of domain arrangements in proteins and interaction networks. Cell Mol Life Sci 62, 435-45 (2005).

65. Marsh, J.A. & Teichmann, S.A. How do proteins gain new domains? Genome Biol 11, 126 (2010).

66. Toll-Riera, M. & Alba, M.M. Emergence of novel domains in proteins. BMC Evol Biol 13, 47 (2013).

67. Basu, M.K., Carmel, L., Rogozin, I.B. & Koonin, E.V. Evolution of protein domain promiscuity in eukaryotes. Genome Res 18, 449-61 (2008).

68. Basu, M.K., Poliakov, E. & Rogozin, I.B. Domain mobility in proteins: functional and evolutionary implications. Brief Bioinform 10, 205-16 (2009).

69. Apic, G., Huber, W. & Teichmann, S.A. Multi-domain protein families and domain pairs: comparison with known structures and a random model of domain recombination. J Struct Funct Genomics 4, 67-78 (2003).

70. Baeten, K.M. & Akassoglou, K. Extracellular matrix and matrix receptors in blood-brain barrier formation and stroke. Dev Neurobiol 71, 1018-39 (2011).

71. Dityatev, A., Schachner, M. & Sonderegger, P. The dual role of the extracellular matrix in synaptic plasticity and homeostasis. Nat Rev Neurosci 11, 735-46 (2010).

72. Shoulders, M.D. & Raines, R.T. Collagen structure and stability. Annu Rev Biochem 78, 929-58 (2009).

73. Kadler, K.E., Baldock, C., Bella, J. & Boot-Handford, R.P. Collagens at a glance. Journal of cell science 120, 1955-8 (2007).

74. Kadler, K.E., Holmes, D.F., Trotter, J.A. & Chapman, J.A. Collagen fibril formation. The Biochemical journal 316 ( Pt 1), 1-11 (1996).

75. Hulmes, D.J. Building collagen molecules, fibrils, and suprafibrillar structures. Journal of structural biology 137, 2-10 (2002).

76. Hulmes, D.J. et al. Pleomorphism in type I collagen fibrils produced by persistence of the procollagen N-propeptide. J Mol Biol 210, 337-45 (1989).

77. Prockop, D.J. & Fertala, A. Inhibition of the self-assembly of collagen I into fibrils with synthetic peptides. Demonstration that assembly is driven by specific binding sites on the monomers. J Biol Chem 273, 15598-604 (1998).

78. Kielty, C.M. Elastic fibres in health and disease. Expert Rev Mol Med 8, 1-23 (2006).

79. Sakai, T., Larsen, M. & Yamada, K.M. Fibronectin requirement in branching morphogenesis. Nature 423, 876-81 (2003). 151

80. Ruoslahti, E. Brain extracellular matrix. Glycobiology 6, 489-92 (1996).

81. Yamaguchi, Y. Lecticans: organizers of the brain extracellular matrix. Cell Mol Life Sci 57, 276-89 (2000).

82. Bonneh-Barkay, D. & Wiley, C.A. Brain extracellular matrix in neurodegeneration. Brain Pathol 19, 573-85 (2009).

83. Sun, L. et al. Identification and characterization of a second fibronectin gene in . Matrix Biol 24, 69-77 (2005).

84. White, E.S., Baralle, F.E. & Muro, A.F. New insights into form and function of fibronectin splice variants. J Pathol 216, 1-14 (2008).

85. Dang, C. et al. Tenascin-C patterns and splice variants in actinic keratosis and cutaneous squamous cell carcinoma. Br J Dermatol 155, 763-70 (2006).

86. Muriel, J.M., Xu, X., Kramer, J.M. & Vogel, B.E. Selective assembly of fibulin-1 splice variants reveals distinct extracellular matrix networks and novel functions for perlecan/UNC-52 splice variants. Dev Dyn 235, 2632-40 (2006).

87. Boon, R.A. et al. MicroRNA-29 in aortic dilation: implications for aneurysm formation. Circulation research 109, 1115-9 (2011).

88. Merk, D.R. et al. miR-29b participates in early aneurysm development in Marfan syndrome. Circulation research 110, 312-24 (2012).

89. Maegdefessel, L. et al. Inhibition of microRNA-29b reduces murine abdominal aortic aneurysm development. J Clin Invest 122, 497-506 (2012).

90. Butcher, D.T., Alliston, T. & Weaver, V.M. A tense situation: forcing tumour progression. Nat Rev Cancer 9, 108-22 (2009).

91. Spencer, V.A., Xu, R. & Bissell, M.J. Gene expression in the third dimension: the ECM- nucleus connection. J Mammary Gland Biol Neoplasia 15, 65-71 (2010).

92. Byron, A., Humphries, J.D. & Humphries, M.J. Defining the extracellular matrix using proteomics. Int J Exp Pathol (2013).

93. Sottile, J. & Hocking, D.C. Fibronectin polymerization regulates the composition and stability of extracellular matrix fibrils and cell-matrix adhesions. Mol Biol Cell 13, 3546- 59 (2002).

94. Kinsey, R. et al. Fibrillin-1 microfibril deposition is dependent on fibronectin assembly. Journal of cell science 121, 2696-704 (2008).

95. Myllyharju, J. & Kivirikko, K.I. Collagens, modifying enzymes and their mutations in humans, flies and worms. Trends in genetics : TIG 20, 33-43 (2004). 152

96. Aitken, K.J. & Bagli, D.J. The bladder extracellular matrix. Part I: architecture, development and disease. Nat Rev Urol 6, 596-611 (2009).

97. Golub, E.E. Biomineralization and matrix vesicles in biology and pathology. Semin Immunopathol 33, 409-17 (2011).

98. Page-McCaw, A., Ewald, A.J. & Werb, Z. Matrix metalloproteinases and the regulation of tissue remodelling. Nat Rev Mol Cell Biol 8, 221-33 (2007).

99. Kessenbrock, K., Plaks, V. & Werb, Z. Matrix metalloproteinases: regulators of the tumor microenvironment. Cell 141, 52-67 (2010).

100. Rosen, S.D. & Lemjabbar-Alaoui, H. Sulf-2: an extracellular modulator of cell signaling and a cancer target candidate. Expert Opin Ther Targets 14, 935-49 (2010).

101. Rhodes, J.M. & Simons, M. The extracellular matrix and blood vessel formation: not just a scaffold. J Cell Mol Med 11, 176-205 (2007).

102. Mott, J.D. & Werb, Z. Regulation of matrix biology by matrix metalloproteinases. Current opinion in cell biology 16, 558-64 (2004).

103. Rebustini, I.T. et al. Laminin alpha5 is necessary for submandibular gland epithelial morphogenesis and influences FGFR expression through beta1 integrin signaling. Dev Biol 308, 15-29 (2007).

104. McCulloch, D.R. et al. ADAMTS metalloproteases generate active versican fragments that regulate interdigital web regression. Dev Cell 17, 687-98 (2009).

105. Vanacore, R. et al. A sulfilimine bond identified in collagen IV. Science 325, 1230-4 (2009).

106. Li, L. & Xie, T. Stem cell niche: structure and function. Annu Rev Cell Dev Biol 21, 605- 31 (2005).

107. Bissell, M.J., Hall, H.G. & Parry, G. How does the extracellular matrix direct gene expression? J Theor Biol 99, 31-68 (1982).

108. Egeblad, M., Rasch, M.G. & Weaver, V.M. Dynamic interplay between the collagen scaffold and tumor evolution. Current opinion in cell biology 22, 697-706 (2010).

109. Condeelis, J. & Segall, J.E. Intravital imaging of cell movement in tumours. Nat Rev Cancer 3, 921-30 (2003).

110. Wyckoff, J.B. et al. Direct visualization of macrophage-assisted tumor cell intravasation in mammary tumors. Cancer Res 67, 2649-56 (2007).

111. Aszodi, A., Legate, K.R., Nakchbandi, I. & Fassler, R. What mouse mutants teach us about extracellular matrix function. Annu Rev Cell Dev Biol 22, 591-621 (2006). 153

112. Paszek, M.J. et al. Tensional homeostasis and the malignant phenotype. Cancer Cell 8, 241-54 (2005).

113. Levental, K.R. et al. Matrix crosslinking forces tumor progression by enhancing integrin signaling. Cell 139, 891-906 (2009).

114. Schwartz, M.A. Integrins and extracellular matrix in mechanotransduction. Cold Spring Harb Perspect Biol 2, a005066 (2010).

115. Lopez, J.I., Mouw, J.K. & Weaver, V.M. Biomechanical regulation of cell orientation and fate. Oncogene 27, 6981-93 (2008).

116. Reilly, G.C. & Engler, A.J. Intrinsic extracellular matrix properties regulate stem cell differentiation. J Biomech 43, 55-62 (2010).

117. Engler, A.J., Sen, S., Sweeney, H.L. & Discher, D.E. Matrix elasticity directs stem cell lineage specification. Cell 126, 677-89 (2006).

118. Berardi, N., Pizzorusso, T. & Maffei, L. Extracellular matrix and visual cortical plasticity: freeing the synapse. Neuron 44, 905-8 (2004).

119. Fukumoto, S. & Yamada, Y. Review: extracellular matrix regulates tooth morphogenesis. Connective tissue research 46, 220-6 (2005).

120. Zimmermann, D.R. & Dours-Zimmermann, M.T. Extracellular matrix of the central nervous system: from neglect to challenge. Histochem Cell Biol 130, 635-53 (2008).

121. Lu, P., Sternlicht, M.D. & Werb, Z. Comparative mechanisms of branching morphogenesis in diverse systems. J Mammary Gland Biol Neoplasia 11, 213-28 (2006).

122. Affolter, M. & Caussinus, E. Tracheal branching morphogenesis in : new insights into cell behaviour and organ architecture. Development 135, 2055-64 (2008).

123. Andrew, D.J. & Ewald, A.J. Morphogenesis of epithelial tubes: Insights into tube formation, elongation, and elaboration. Dev Biol 341, 34-55 (2010).

124. Sternlicht, M.D., Kouros-Mehr, H., Lu, P. & Werb, Z. Hormonal and local control of mammary branching morphogenesis. Differentiation 74, 365-81 (2006).

125. Fata, J.E., Werb, Z. & Bissell, M.J. Regulation of mammary gland branching morphogenesis by the extracellular matrix and its remodeling enzymes. Breast Cancer Res 6, 1-11 (2004).

126. Fukuda, Y. et al. The role of interstitial collagens in cleft formation of mouse embryonic submandibular gland during initial branching. Development 103, 259-67 (1988).

127. Bateman, J.F., Boot-Handford, R.P. & Lamande, S.R. Genetic diseases of connective tissues: cellular and extracellular effects of ECM mutations. Nat Rev Genet 10, 173-183 (2009). 154

128. Richards, A.J. et al. High efficiency of mutation detection in type 1 using a two-stage approach: vitreoretinal assessment coupled with exon sequencing for screening COL2A1. Hum Mutat 27, 696-704 (2006).

129. Snead, M.P. & Yates, J.R. Clinical and Molecular genetics of Stickler syndrome. J Med Genet 36, 353-9 (1999).

130. Lamande, S.R. et al. Reduced collagen VI causes Bethlem myopathy: a heterozygous COL6A1 nonsense mutation results in mRNA decay and functional haploinsufficiency. Hum Mol Genet 7, 981-9 (1998).

131. Colige, A. et al. Novel types of mutation responsible for the dermatosparactic type of Ehlers-Danlos syndrome (Type VIIC) and common polymorphisms in the ADAMTS2 gene. J Invest Dermatol 123, 656-63 (2004).

132. Werb, Z. The extracellular matrix and disease: An interview with Zena Werb. Interviewed by Kristin H. Kain. Dis Model Mech 3, 513-6 (2010).

133. Nelson, C.M. & Bissell, M.J. Of extracellular matrix, scaffolds, and signaling: tissue architecture regulates development, homeostasis, and cancer. Annu Rev Cell Dev Biol 22, 287-309 (2006).

134. Frantz, C., Stewart, K.M. & Weaver, V.M. The extracellular matrix at a glance. Journal of cell science 123, 4195-200 (2010).

135. DuFort, C.C., Paszek, M.J. & Weaver, V.M. Balancing forces: architectural control of mechanotransduction. Nat Rev Mol Cell Biol 12, 308-19 (2011).

136. Lu, P., Weaver, V.M. & Werb, Z. The extracellular matrix: a dynamic niche in cancer progression. J Cell Biol 196, 395-406 (2012).

137. Wolfe, J.N. Risk for breast cancer development determined by mammographic parenchymal pattern. Cancer 37, 2486-92 (1976).

138. Lieber, M.M. Towards an understanding of the role of forces in carcinogenesis: a perspective with therapeutic implications. Riv Biol 99, 131-60 (2006).

139. Bischoff, F. & Bryson, G. Carcinogenesis through Solid State Surfaces. Prog Exp Tumor Res 5, 85-133 (1964).

140. Hahn, F.F., Guilmette, R.A. & Hoover, M.D. Implanted depleted uranium fragments cause soft tissue sarcomas in the muscles of rats. Environ Health Perspect 110, 51-9 (2002).

141. Sorokin, L. The impact of the extracellular matrix on inflammation. Nat Rev Immunol 10, 712-23 (2010).

142. Barker, H.E., Cox, T.R. & Erler, J.T. The rationale for targeting the LOX family in cancer. Nat Rev Cancer 12, 540-52 (2012). 155

143. Claridge, M.W. et al. Measurement of arterial stiffness in subjects with vascular disease: Are vessel wall changes more sensitive than increase in intima-media thickness? Atherosclerosis 205, 477-80 (2009).

144. Munakata, M. Airway remodeling and airway smooth muscle in asthma. Allergol Int 55, 235-43 (2006).

145. Thorns, V., Walter, G.F. & Thorns, C. Expression of MMP-2, MMP-7, MMP-9, MMP- 10 and MMP-11 in human astrocytic and oligodendroglial gliomas. Anticancer Res 23, 3937-44 (2003).

146. Cuadrado, E. et al. Vascular MMP-9/TIMP-2 and neuronal MMP-10 up-regulation in human brain after stroke: a combined laser microdissection and protein array study. J Proteome Res 8, 3191-7 (2009).

147. Shyamsundar, R. et al. A DNA microarray survey of gene expression in normal human tissues. Genome Biol 6, R22 (2005).

148. Hubbard, T.J. et al. Ensembl 2009. Nucleic Acids Res 37, D690-7 (2009).

149. Consortium, U. The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res 37, D169-74 (2009).

150. Pruitt, K.D., Tatusova, T. & Maglott, D.R. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 35, D61-5 (2007).

151. Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25, 25-9 (2000).

152. Maglott, D., Ostell, J., Pruitt, K.D. & Tatusova, T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res 35, D26-31 (2007).

153. Turner, B. et al. iRefWeb: interactive analysis of consolidated protein interaction data and their supporting evidence. Database (Oxford) 2010, baq023 (2010).

154. Bader, G.D., Betel, D. & Hogue, C.W. BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res 31, 248-50 (2003).

155. Xenarios, I. et al. DIP: the database of interacting proteins. Nucleic Acids Res 28, 289- 291 (2000).

156. Chatr-aryamontri, A. et al. MINT: the Molecular INTeraction database. Nucleic Acids Res 35, D572-4 (2007).

157. Stark, C. et al. BioGRID: a general repository for interaction datasets. Nucleic Acids Res 34, D535-9 (2006). 156

158. Chautard, E., Ballut, L., Thierry-Mieg, N. & Ricard-Blum, S. MatrixDB, a database focused on extracellular protein-protein and protein-carbohydrate interactions. Bioinformatics 25, 690-1 (2009).

159. Chautard, E., Fatoux-Ardore, M., Ballut, L., Thierry-Mieg, N. & Ricard-Blum, S. MatrixDB, the extracellular matrix interaction database. Nucleic Acids Res 39, D235-40 (2011).

160. Barabasi, A.L. & Oltvai, Z.N. Network biology: understanding the cell's functional organization. Nat Rev Genet 5, 101-13 (2004).

161. Yu, H., Greenbaum, D., Xin Lu, H., Zhu, X. & Gerstein, M. Genomic analysis of essentiality within protein networks. Trends in genetics : TIG 20, 227-31 (2004).

162. Bork, P. et al. Protein interaction networks from yeast to human. Curr Opin Struct Biol 14, 292-9 (2004).

163. Rual, J.F. et al. Towards a proteome-scale map of the human protein-protein interaction network. Nature 437, 1173-8 (2005).

164. Stelzl, U. et al. A human protein-protein interaction network: a resource for annotating the proteome. Cell 122, 957-68 (2005).

165. Krogan, N.J. et al. Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440, 637-43 (2006).

166. Wang, H. et al. A complex-based reconstruction of the Saccharomyces cerevisiae interactome. Mol Cell Proteomics 8, 1361-81 (2009).

167. Peregrin-Alvarez, J.M., Xiong, X., Su, C. & Parkinson, J. The Modular Organization of Protein Interactions in Escherichia coli. PLoS Comput Biol 5, e1000523 (2009).

168. Xiong, X. et al. PhyloPro: a web-based tool for the generation and visualization of phylogenetic profiles across Eukarya. Bioinformatics 27, 877-8 (2011).

169. Zhu, X., Gerstein, M. & Snyder, M. Getting connected: analysis and principles of biological networks. Genes Dev 21, 1010-1024 (2007).

170. Mostafavi, S., Ray, D., Warde-Farley, D., Grouios, C. & Morris, Q. GeneMANIA: a real- time multiple association network integration algorithm for predicting gene function. Genome Biol 9 Suppl 1, S4 (2008).

171. Alexeyenko, A. & Sonnhammer, E.L. Global networks of functional coupling in eukaryotes from comprehensive data integration. Genome research 19, 1107-16 (2009).

172. Szklarczyk, D. et al. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic acids research 39, D561-8 (2011). 157

173. Ramani, A.K., Bunescu, R.C., Mooney, R.J. & Marcotte, E.M. Consolidating the set of known human protein-protein interactions in preparation for large-scale mapping of the human interactome. Genome Biol 6, R40 (2005).

174. Olsen, L., Johan Kudahl, U., Winther, O. & Brusic, V. Literature classification for semi- automated updating of biological knowledgebases. BMC Genomics 14 Suppl 5, S14 (2013).

175. Liu, W. et al. Extracting rate changes in transcriptional regulation from MEDLINE abstracts. BMC Bioinformatics 15 Suppl 2, S4 (2014).

176. Wu, C., Schwartz, J.M. & Nenadic, G. PathNER: a tool for systematic identification of biological pathway mentions in the literature. BMC Syst Biol 7, S2 (2013).

177. Hoffmann, R. Using the iHOP information resource to mine the biomedical literature on genes, proteins, and chemical compounds. Curr Protoc Bioinformatics Chapter 1, Unit1 16 (2007).

178. Liebel, U., Kindler, B. & Pepperkok, R. Bioinformatic "Harvester": a search engine for genome-wide human, mouse, and rat protein resources. Methods Enzymol 404, 19-26 (2005).

179. Safran, M. et al. Human Gene-Centric Databases at the Weizmann Institute of Science: GeneCards, UDB, CroW 21 and HORDE. Nucleic Acids Res 31, 142-6 (2003).

180. McKusick, V.A. Mendelian Inheritance in Man and its online version, OMIM. Am J Hum Genet 80, 588-604 (2007).

181. Lipscomb, C.E. Medical Subject Headings (MeSH). Bull Med Libr Assoc 88, 265-6 (2000).

182. Orchard, S. et al. The minimum information required for reporting a molecular interaction experiment (MIMIx). Nat Biotechnol 25, 894-8 (2007).

183. Holmes, M.W., Bayliss, M.T. & Muir, H. Hyaluronic acid in human articular cartilage. Age-related changes in content and size. Biochem J 250, 435-41 (1988).

184. Ruepp, A. et al. CORUM: the comprehensive resource of mammalian protein complexes- -2009. Nucleic Acids Res 38, D497-501 (2010).

185. von Mering, C. et al. Comparative assessment of large-scale data sets of protein-protein interactions. Nature 417, 399-403 (2002).

186. Dreze, M. et al. High-quality binary interactome mapping. Methods Enzymol 470, 281- 315 (2010).

187. Kittanakom, S. et al. Analysis of membrane protein complexes using the split-ubiquitin membrane yeast two-hybrid (MYTH) system. Methods Mol Biol 548, 247-71 (2009). 158

188. Barrios-Rodiles, M. et al. High-throughput mapping of a dynamic signaling network in mammalian cells. Science 307, 1621-5 (2005).

189. Ray, S., Mehta, G. & Srivastava, S. Label-free detection techniques for protein microarrays: prospects, merits and challenges. Proteomics 10, 731-48 (2010).

190. Liu, H., Beck, T.N., Golemis, E.A. & Serebriiskii, I.G. Integrating in silico resources to map a signaling network. Methods Mol Biol 1101, 197-245 (2014).

191. Tian, W. et al. Combining guilt-by-association and guilt-by-profiling to predict Saccharomyces cerevisiae gene function. Genome Biol 9 Suppl 1, S7 (2008).

192. Barabasi, A.L. & Albert, R. Emergence of scaling in random networks. Science 286, 509- 12 (1999).

193. Jeong, H., Mason, S.P., Barabasi, A.L. & Oltvai, Z.N. Lethality and centrality in protein networks. Nature 411, 41-2 (2001).

194. He, X. & Zhang, J. Why do hubs tend to be essential in protein networks? PLoS Genet 2, e88 (2006).

195. Fadhal, E., Gamieldien, J. & Mwambene, E.C. Protein interaction networks as metric spaces: a novel perspective on distribution of hubs. BMC Syst Biol 8, 6 (2014).

196. Calvano, S. et al. A network-based analysis of systemic inflammation in humans. Nature 438(2005).

197. Rajagopala, S. et al. The protein network of bacterial motility. Mol Syst Biol 3(2007).

198. Stuart, L.M. et al. A systems biology analysis of the Drosophila phagosome. Nature 445, 7 (2007).

199. Goh, K.-I. et al. The human disease network. PNAS 104, 8685-8690 (2007).

200. Cromar, G.L., Xiong, X., Chautard, E., Ricard-Blum, S. & Parkinson, J. Toward a systems level view of the ECM and related proteins: a framework for the systematic definition and analysis of biological systems. Proteins 80, 1522-44 (2012).

201. Thomas, P.D. et al. PANTHER: a library of protein families and subfamilies indexed by function. Genome Res 13, 2129-41 (2003).

202. Lemay, D.G. et al. The bovine lactation genome: insights into the evolution of mammalian milk. Genome Biol 10, R43 (2009).

203. Chautard, E., Fatoux-Ardore, M., Ballut, L., Thierry-Mieg, N. & Ricard-Blum, S. MatrixDB, the extracellular matrix interaction database. Nucleic Acids Res 39, D235-40 (2010).

204. Hoffmann, R. & Valencia, A. A gene network for navigating the literature. Nat Genet 36, 664 (2004). 159

205. Kersey, P.J. et al. The International Protein Index: an integrated database for proteomics experiments. Proteomics 4, 1985-8 (2004).

206. Bruford, E.A. et al. The HGNC Database in 2008: a resource for the human genome. Nucleic Acids Res 36, D445-8 (2008).

207. Berriz, G.F. & Roth, F.P. The Synergizer service for translating gene, protein and other biological identifiers. Bioinformatics 24, 2272-3 (2008).

208. Smedley, D. et al. BioMart - biological queries made easy. BMC Genomics 10, 22 (2009).

209. Chaurasia, G. et al. UniHI: an entry gate to the human protein interactome. Nucleic Acids Res 35(2007).

210. Hermjakob, H. et al. IntAct: an open source molecular interaction database. Nucleic Acids Res 32(2004).

211. Peri, S. et al. Human protein reference database as a discovery resource for proteomics. Nucleic Acids Res 32(2004).

212. Lehner, B. & Fraser, A.G. A first-draft human protein-interaction map. Genome Biol 5, R63 (2004).

213. Persico, M. et al. HomoMINT: an inferred human network based on orthology mapping of protein interactions discovered in model organisms. BMC Bioinformatics 6 Suppl 4, S21 (2005).

214. Brown, K.R. & Jurisica, I. Online predicted human interaction database. Bioinformatics 21, 2076-82 (2005).

215. Matthews, L. et al. Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Res 37, D619-22 (2009).

216. Faye, C., Chautard, E., Olsen, B.R. & Ricard-Blum, S. The first draft of the endostatin interaction network. J Biol Chem 284, 22041-7 (2009).

217. Cain, S.A. et al. Defining elastic fiber interactions by molecular fishing: an affinity purification and mass spectrometry approach. Mol Cell Proteomics 8, 2715-32 (2009).

218. Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13, 2498-504 (2003).

219. Assenov, Y., Ramirez, F., Schelhorn, S.E., Lengauer, T. & Albrecht, M. Computing topological parameters of biological networks. Bioinformatics 24, 282-4 (2008).

220. Yip, K.Y., Yu, H., Kim, P.M., Schultz, M. & Gerstein, M. The tYNA platform for comparative interactomics: a web tool for managing, comparing and mining multiple networks. Bioinformatics 22, 2968-70 (2006). 160

221. Enright, A.J., Van Dongen, S. & Ouzounis, C.A. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30, 1575-84 (2002).

222. Loganantharaj, R., Cheepala, S. & Clifford, J. Metric for measuring the effectiveness of clustering of DNA microarray expression. BMC Bioinformatics 7 Suppl 2, S5 (2006).

223. Su, A.I. et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A 101, 6062-7 (2004).

224. Wu, C. et al. BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources. Genome Biol 10, R130 (2009).

225. Barrett, T. & Edgar, R. Gene expression omnibus: microarray data storage, submission, retrieval, and analysis. Methods Enzymol 411, 352-69 (2006).

226. Liu, X., Yu, X., Zack, D.J., Zhu, H. & Qian, J. TiGER: a database for tissue-specific gene expression and regulation. BMC Bioinformatics 9, 271 (2008).

227. Saldanha, A.J. Java Treeview--extensible visualization of microarray data. Bioinformatics 20, 3246-8 (2004).

228. MATLAB, Software Package. Ver. 2009b edn (Natick, MA, 2009).

229. Berglund, A.C., Sjolund, E., Ostlund, G. & Sonnhammer, E.L. InParanoid 6: eukaryotic ortholog clusters with inparalogs. Nucleic Acids Res 36, D263-6 (2008).

230. On, T. et al. The evolutionary landscape of the chromatin modification machinery reveals lineage specific gains, expansions, and losses. Proteins 78, 2075-89 (2010).

231. Finn, R.D. et al. The Pfam protein families database. Nucleic Acids Res 38, D211-22 (2010).

232. Sartor, M.A. et al. ConceptGen: a gene set enrichment and gene set relation mapping tool. Bioinformatics 26, 456-63 (2010).

233. Oesper, L., Merico, D., Isserlin, R. & Bader, G.D. WordCloud: a Cytoscape plugin to create a visual semantic summary of networks. Source Code Biol Med 6, 7 (2011).

234. Yu, W., Clyne, M., Khoury, M.J. & Gwinn, M. Phenopedia and Genopedia: disease- centered and gene-centered views of the evolving knowledge of human genetic associations. Bioinformatics 26, 145-6 (2009).

235. Sprenger, J. et al. LOCATE: a mammalian protein subcellular localization database. Nucleic Acids Res 36, D230-3 (2008).

236. Szafron, D. et al. Proteome Analyst: custom predictions with explanations in a web-based tool for high-throughput proteome annotations. Nucleic Acids Res 32, W365-71 (2004).

237. Yu, C.S., Chen, Y.C., Lu, C.H. & Hwang, J.K. Prediction of protein subcellular localization. Proteins 64, 643-51 (2006). 161

238. Hoglund, A., Donnes, P., Blum, T., Adolph, H.W. & Kohlbacher, O. MultiLoc: prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs and amino acid composition. Bioinformatics 22, 1158-65 (2006).

239. Guda, C. pTARGET: a web server for predicting protein subcellular localization. Nucleic Acids Res 34, W210-3 (2006).

240. Horton, P. et al. WoLF PSORT: protein localization predictor. Nucleic Acids Res 35, W585-7 (2007).

241. Bendtsen, J.D., Nielsen, H., von Heijne, G. & Brunak, S. Improved prediction of signal peptides: SignalP 3.0. J Mol Biol 340, 783-95 (2004).

242. Krogh, A., Larsson, B., von Heijne, G. & Sonnhammer, E.L. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305, 567-80 (2001).

243. Barrell, D. et al. The GOA database in 2009--an integrated Gene Ontology Annotation resource. Nucleic Acids Res 37, D396-403 (2009).

244. Carbon, S. et al. AmiGO: online access to ontology and annotation data. Bioinformatics 25, 288-9 (2009).

245. Liebel, U., Kindler, B. & Pepperkok, R. 'Harvester': a fast meta search engine of human protein resources. Bioinformatics 20, 1962-3 (2004).

246. Maere, S., Heymans, K. & Kuiper, M. BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics 21, 3448-9 (2005).

247. Murphy-Ullrich, J.E. The de-adhesive activity of matricellular proteins: is intermediate cell adhesion an adaptive state? J Clin Invest 107, 785-90 (2001).

248. Sprenger, J., Fink, J.L. & Teasdale, R.D. Evaluation and comparison of mammalian subcellular localization prediction methods. BMC Bioinformatics 7 Suppl 5, S3 (2006).

249. Shimizu, K., Shirataki, H., Honda, T., Minami, S. & Takai, Y. Complex formation of SMAP/KAP3, a KIF3A/B ATPase motor-associated protein, with a human chromosome- associated polypeptide. J Biol Chem 273, 6591-4 (1998).

250. Ghiselli, G., Siracusa, L.D. & Iozzo, R.V. Complete cDNA cloning, genomic organization, chromosomal assignment, functional characterization of the promoter, and expression of the murine Bamacan gene. J Biol Chem 274, 17384-93 (1999).

251. Gay, S., Martin, G.R., Muller, P.K., Timpl, R. & Kuhn, K. Simultaneous synthesis of types I and III collagen by fibroblasts in culture. Proc Natl Acad Sci U S A 73, 4037-40 (1976).

252. Salwinski, L. et al. Recurated protein interaction datasets. Nat Methods 6, 860-1 (2009). 162

253. Gerstein, M., Lan, N. & Jansen, R. PROTEOMICS: Enhanced: Integrating Interactomes. Science 295, 284-287 (2002).

254. Pereira-Leal, J.B., Enright, A.J. & Ouzounis, C.A. Detection of functional modules from protein interaction networks. Proteins 54, 49-57 (2004).

255. Collins, S.R. et al. Toward a comprehensive atlas of the physical interactome of Saccharomyces cerevisiae. Mol Cell Proteomics 6, 439-50 (2007).

256. Paris, L. & Bazzoni, G. The protein interaction network of the epithelial junctional complex: a system-level analysis. Mol Biol Cell 19, 5409-21 (2008).

257. Gibson, T.A. & Goldberg, D.S. Questioning the ubiquity of neofunctionalization. PLoS Comput Biol 5, e1000252 (2009).

258. Zhu, Y., Oganesian, A., Keene, D.R. & Sandell, L.J. Type IIA procollagen containing the cysteine-rich amino propeptide is deposited in the extracellular matrix of prechondrogenic tissue and binds to TGF-beta1 and BMP-2. J Cell Biol 144, 1069-80 (1999).

259. Schoppet, M., Chavakis, T., Al-Fakhri, N., Kanse, S.M. & Preissner, K.T. Molecular interactions and functional interference between vitronectin and transforming growth factor-beta. Lab Invest 82, 37-46 (2002).

260. Blencowe, B.J. Alternative splicing: new insights from global analyses. Cell 126, 37-47 (2006).

261. Putnam, N.H. et al. Sea anemone genome reveals ancestral eumetazoan gene repertoire and genomic organization. Science 317, 86-94 (2007).

262. Gregson, H.C. et al. A potential role for human cohesin in mitotic spindle aster assembly. J Biol Chem 276, 47575-82 (2001).

263. Jurica, M.S. & Moore, M.J. Pre-mRNA splicing: awash in a sea of proteins. Mol Cell 12, 5-14 (2003).

264. Andersen, D.S. & Tapon, N. Drosophila MFAP1 is required for pre-mRNA processing and G2/M progression. J Biol Chem 283, 31256-67 (2008).

265. Huxley-Jones, J., Pinney, J.W., Archer, J., Robertson, D.L. & Boot-Handford, R.P. Back to basics--how the evolution of the extracellular matrix underpinned vertebrate evolution. Int J Exp Pathol 90, 95-100 (2009).

266. Hughes, A.L., da Silva, J. & Friedman, R. Ancient genome duplications did not structure the human Hox-bearing . Genome Res 11, 771-80 (2001).

267. Qin, C., D'Souza, R. & Feng, J.Q. Dentin matrix protein 1 (DMP1): new and important roles for biomineralization and phosphate homeostasis. J Dent Res 86, 1134-41 (2007). 163

268. Terasawa, M. et al. Expression of dentin matrix protein 1 (DMP1) in nonmineralized tissues. J Bone Miner Metab 22, 430-8 (2004).

269. Mao, Z. et al. The human gene: cloning and characterization. Gene 279, 181-96 (2001).

270. Leiser, Y. et al. Localization, quantification, and characterization of tuftelin in soft tissues. Anat Rec (Hoboken) 290, 449-54 (2007).

271. Polanska, U.M., Fernig, D.G. & Kinnunen, T. Extracellular interactome of the FGF receptor-ligand system: complexities and the relative simplicity of the worm. Dev Dyn 238, 277-93 (2009).

272. Kuwabara, K. et al. Calreticulin, an antithrombotic agent which binds to vitamin K- dependent coagulation factors, stimulates endothelial nitric oxide production, and limits thrombosis in canine coronary arteries. J Biol Chem 270, 8179-87 (1995).

273. Reheman, A., Tasneem, S., Ni, H. & Hayward, C.P. Mice with deleted multimerin 1 and alpha-synuclein genes have impaired platelet adhesion and impaired thrombus formation that is corrected by multimerin 1. Thromb Res 125, e177-83 (2010).

274. Wong, J.H. et al. Sex differences in thrombosis in mice are mediated by sex-specific growth hormone secretion patterns. J Clin Invest 118, 2969-78 (2008).

275. Ge, G., Fernandez, C.A., Moses, M.A. & Greenspan, D.S. Bone morphogenetic protein 1 processes prolactin to a 17-kDa antiangiogenic factor. Proc Natl Acad Sci U S A 104, 10010-5 (2007).

276. Harumiya, S. et al. Characterization of ficolins as novel elastin-binding proteins and molecular cloning of human ficolin-1. J Biochem 120, 745-51 (1996).

277. Wakui, H. et al. Renal argininosuccinate synthetase: purification, immunohistochemical localization, and elastin-binding property. Ren Physiol Biochem 15, 1-9 (1992).

278. Freeman, T.C., Davies, R. & Calam, J. Interactions of pancreatic secretory trypsin inhibitor in small intestinal juice: its hydrolysis and protection by intraluminal factors. Clin Chim Acta 195, 27-39 (1990).

279. Reinboth, B., Hanssen, E., Cleary, E.G. & Gibson, M.A. Molecular interactions of biglycan and decorin with elastic fiber components: biglycan forms a ternary complex with tropoelastin and microfibril-associated glycoprotein 1. J Biol Chem 277, 3950-7 (2002).

280. Fujita, J. et al. Modulation of elastase binding to elastin by human alveolar macrophage- derived lipids. Am J Respir Crit Care Med 160, 802-7 (1999).

281. Martins, C. et al. Menkes' kinky hair syndrome: ultrastructural cutaneous alterations of the elastic fibers. Pediatr Dermatol 14, 347-50 (1997). 164

282. Rucker, R.B. & Dubick, M.A. Elastin metabolism and chemistry: potential roles in lung development and structure. Environ Health Perspect 55, 179-91 (1984).

283. Ochieng, J., Warfield, P., Green-Jarvis, B. & Fentie, I. Galectin-3 regulates the adhesive interaction between breast carcinoma cells and elastin. J Cell Biochem 75, 505-14 (1999).

284. Dubuisson, L. et al. Expression and cellular localization of fibrillin-1 in normal and pathological human liver. J Hepatol 34, 514-22 (2001).

285. Sasaki, T. et al. Tropoelastin binding to fibulins, nidogen-2 and other extracellular matrix proteins. FEBS letters 460, 280-4 (1999).

286. Ying, Q.L. & Simon, S.R. Elastolysis by proteinase 3 and its inhibition by alpha(1)- proteinase inhibitor: a mechanism for the incomplete inhibition of ongoing elastolysis. Am J Respir Cell Mol Biol 26, 356-61 (2002).

287. Baccarani-Contri, M., Vincenzi, D., Quaglino, D., Jr., Mori, G. & Pasquali-Ronchetti, I. Localization of human placenta lysyl oxidase on human placenta, skin and aorta by immunoelectronmicroscopy. Matrix 9, 428-36 (1989).

288. Trask, T.M. et al. Interaction of tropoelastin with the amino-terminal domains of fibrillin- 1 and fibrillin-2 suggests a role for the fibrillins in elastic fiber assembly. J Biol Chem 275, 24400-6 (2000).

289. Lagente, V., Le Quement, C. & Boichot, E. Macrophage metalloelastase (MMP-12) as a target for inflammatory respiratory diseases. Expert Opin Ther Targets 13, 287-95 (2009).

290. Curci, J.A., Liao, S., Huffman, M.D., Shapiro, S.D. & Thompson, R.W. Expression and localization of (matrix metalloproteinase-12) in abdominal aortic aneurysms. J Clin Invest 102, 1900-10 (1998).

291. Heinz, A. et al. Degradation of tropoelastin by matrix metalloproteinases--cleavage site specificities and release of matrikines. FEBS J 277, 1939-56 (2010).

292. Clarke, A.W. & Weiss, A.S. Microfibril-associated glycoprotein-1 binding to tropoelastin: multiple binding sites and the role of divalent cations. Eur J Biochem 271, 3085-90 (2004).

293. Thomassin, L. et al. The Pro-regions of lysyl oxidase and lysyl oxidase-like 1 are required for deposition onto elastic fibers. J Biol Chem 280, 42848-55 (2005).

294. Seite, S. et al. Mexoryl SX: a broad absorption UVA filter protects human skin from the effects of repeated suberythemal doses of UVA. J Photochem Photobiol B 44, 69-76 (1998).

295. Rose, S.D. & MacDonald, R.J. Evolutionary silencing of the human elastase I gene (ELA1). Hum Mol Genet 6, 897-903 (1997). 165

296. Talas, U., Dunlop, J., Khalaf, S., Leigh, I.M. & Kelsell, D.P. Human elastase 1: evidence for expression in the skin and the identification of a frequent frameshift polymorphism. J Invest Dermatol 114, 165-70 (2000).

297. Rabaud, M., Dabadie, P., Lefebvre, F., Desgranges, C. & Bricaud, H. Purification of human alpha 1 antiprotease-pancreatic elastase complex. Interaction with homologous elastin. Connective tissue research 12, 165-74 (1984).

298. Patterson, C.E., Schaub, T., Coleman, E.J. & Davis, E.C. Developmental regulation of FKBP65. An ER-localized extracellular matrix binding-protein. Mol Biol Cell 11, 3925- 35 (2000).

299. Hinek, A., Pshezhetsky, A.V., von Itzstein, M. & Starcher, B. Lysosomal sialidase (neuraminidase-1) is targeted to the cell surface in a multiprotein complex that facilitates elastic fiber assembly. J Biol Chem 281, 3698-710 (2006).

300. Uemura, T. et al. Trans-synaptic interaction of GluRdelta2 and Neurexin through Cbln1 mediates synapse formation in the cerebellum. Cell 141, 1068-79 (2010).

301. Zhou, M. et al. An investigation into the human serum "interactome". Electrophoresis 25, 1289-98 (2004).

302. Heinz, A., Taddese, S., Sippl, W., Neubert, R.H. & Schmelzer, C.E. Insights into the degradation of human elastin by matrilysin-1. Biochimie 93, 187-94 (2011).

303. Schlotzer-Schrehardt, U. et al. The Pathogenesis of floppy eyelid syndrome: involvement of matrix metalloproteinases in elastic fiber degradation. Ophthalmology 112, 694-704 (2005).

304. Choudhury, R. et al. Differential regulation of elastic fiber formation by fibulin-4 and -5. J Biol Chem 284, 24553-67 (2009).

305. Exposito, J.Y. et al. Demosponge and sea anemone fibrillar collagen diversity reveals the early emergence of A/C clades and the maintenance of the modular structure of type V/XI collagens from sponge to human. J Biol Chem 283, 28226-35 (2008).

306. Mariani, T.J., Sandefur, S. & Pierce, R.A. Elastin in lung development. Exp Lung Res 23, 131-45 (1997).

307. Shapiro, S.D., Endicott, S.K., Province, M.A., Pierce, J.A. & Campbell, E.J. Marked longevity of human lung parenchymal elastic fibers deduced from prevalence of D- aspartate and nuclear weapons-related radiocarbon. J Clin Invest 87, 1828-34 (1991).

308. Bushell, K.M., Sollner, C., Schuster-Boeckler, B., Bateman, A. & Wright, G.J. Large- scale screening for novel low-affinity extracellular protein interactions. Genome Res 18, 622-30 (2008).

309. Hart, G.T., Ramani, A.K. & Marcotte, E.M. How complete are current yeast and human protein-interaction networks? Genome Biol 7, 120 (2006). 166

310. Faye, C. et al. Transglutaminase-2: a new endostatin partner in the extracellular matrix of endothelial cells. Biochem J 427, 467-75 (2010).

311. Symoens, S. et al. Identification of binding partners interacting with the alpha1-N- propeptide of type V collagen. Biochem J 433, 371-81 (2010).

312. Cromar, G.L., Wong K., Loughran, N., On T., Song H., Xiong X., Zhang Z., Parkinson J. New tricks for 'old' domains: How novel architectures and promiscuous hubs contributed to the organization and evolution of the ECM. GBE (Submitted) (2014).

313. Onnerfjord, P., Khabut, A., Reinholt, F.P., Svensson, O. & Heinegard, D. Quantitative proteomic analysis of eight cartilaginous tissues reveals characteristic differences as well as similarities between subgroups. J Biol Chem 287, 18913-24 (2012).

314. Gore, A.V., Monzo, K., Cha, Y.R., Pan, W. & Weinstein, B.M. Vascular development in the zebrafish. Cold Spring Harb Perspect Med 2, a006684 (2012).

315. Simoes-Costa, M. & Bronner, M.E. Insights into neural crest development and evolution from genomic analysis. Genome Res 23, 1069-80 (2013).

316. Hynes, R.O. The evolution of metazoan extracellular matrix. J Cell Biol 196, 671-9 (2012).

317. Di Lullo, G.A., Sweeney, S.M., Korkko, J., Ala-Kokko, L. & San Antonio, J.D. Mapping the ligand-binding sites and disease-associated mutations on the most abundant protein in the human, type I collagen. J Biol Chem 277, 4223-31 (2002).

318. Tseng, Q. et al. Spatial organization of the extracellular matrix regulates cell-cell junction positioning. Proc Natl Acad Sci U S A 109, 1506-11 (2012).

319. Zmasek, C.M. & Godzik, A. This Deja vu feeling--analysis of multidomain protein evolution in eukaryotic genomes. PLoS Comput Biol 8, e1002701 (2012).

320. Apic G., H.W., Teichmann SA. Multi-domain protein families and domain pairs: comparison with known structures and a random model of domain recombination. in J Struct Funct Genomics Vol. 4 67-78 (2003).

321. Buljan, M., Frankish, A. & Bateman, A. Quantifying the mechanisms of domain gain in animal proteins. Genome Biol 11, R74 (2010).

322. Moore, A.D., Grath, S., Schuler, A., Huylmans, A.K. & Bornberg-Bauer, E. Quantification and functional analysis of modular protein evolution in a dense phylogenetic tree. Biochimica et biophysica acta 1834, 898-907 (2013).

323. Moore, A.D. & Bornberg-Bauer, E. The dynamics and evolutionary potential of domain loss and emergence. Mol Biol Evol 29, 787-96 (2012).

324. Jian Pei , J.H., Behzad Mortazavi-asl , Helen Pinto , Qiming Chen , Umeshwar Dayal , Mei-chun Hsu PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected 167

Pattern Growth. in Data Engineering, 2001. Proceedings. 17th International Conference 215 - 224 (2001).

325. Punta, M. et al. The Pfam protein families database. Nucleic Acids Res 40, D290-301 (2012).

326. Hunter, S. et al. InterPro: the integrative protein signature database. Nucleic Acids Res 37, D211-5 (2009).

327. Merico, D., Isserlin, R., Stueker, O., Emili, A. & Bader, G.D. Enrichment map: a network-based method for gene-set enrichment visualization and interpretation. PLoS One 5, e13984 (2010).

328. Narayanan, K. et al. Dual functional roles of dentin matrix protein 1. Implications in biomineralization and gene transcription by activation of intracellular Ca2+ store. J Biol Chem 278, 17500-8 (2003).

329. Quarles, L.D. FGF23, PHEX, and MEPE regulation of phosphate homeostasis and skeletal mineralization. Am J Physiol Endocrinol Metab 285, E1-9 (2003).

330. Shintani, S. et al. Identification and characterization of ameloblastin gene in a reptile. Gene 283, 245-54 (2002).

331. Li, W., Gibson, C.W., Abrams, W.R., Andrews, D.W. & DenBesten, P.K. Reduced hydrolysis of amelogenin may result in X-linked imperfecta. Matrix Biol 19, 755-60 (2001).

332. Reinholt, F.P., Hultenby, K., Oldberg, A. & Heinegard, D. Osteopontin--a possible anchor of osteoclasts to bone. Proc Natl Acad Sci U S A 87, 4473-5 (1990).

333. Donoghue, P.C. & Sansom, I.J. Origin and early evolution of vertebrate skeletonization. Microsc Res Tech 59, 352-72 (2002).

334. Donoghue, P.C., Sansom, I.J. & Downs, J.P. Early evolution of vertebrate skeletal tissues and cellular interactions, and the canalization of skeletal development. J Exp Zool B Mol Dev Evol 306, 278-94 (2006).

335. Downs, J.P., Daeschler, E.B., Jenkins, F.A., Jr. & Shubin, N.H. The cranial endoskeleton of Tiktaalik roseae. Nature 455, 925-9 (2008).

336. Ahlberg, P.E., Clack, J.A., Luksevics, E., Blom, H. & Zupins, I. Ventastega curonica and the origin of tetrapod morphology. Nature 453, 1199-204 (2008).

337. Patil, A., Kinoshita, K. & Nakamura, H. Domain distribution and intrinsic disorder in hubs in the human protein-protein interaction network. Protein Sci 19, 1461-8 (2010).

338. Patil, A., Kinoshita, K. & Nakamura, H. Hub promiscuity in protein-protein interaction networks. Int J Mol Sci 11, 1930-43 (2010). 168

339. Borg, J.P. et al. ERBIN: a basolateral PDZ protein that interacts with the mammalian ERBB2/HER2 receptor. Nat Cell Biol 2, 407-14 (2000).

340. de Hoon, M.J., Imoto, S., Nolan, J. & Miyano, S. Open source clustering software. Bioinformatics 20, 1453-4 (2004).

341. Bornberg-Bauer, E., Huylmans, A.K. & Sikosek, T. How do new proteins arise? Curr Opin Struct Biol 20, 390-6 (2010).

342. Vogel, C., Teichmann, S.A. & Pereira-Leal, J. The relationship between domain duplication and recombination. J Mol Biol 346, 355-65 (2005).

343. Chera, S. et al. Silencing of the hydra serine protease inhibitor Kazal1 gene mimics the human SPINK1 pancreatic phenotype. Journal of cell science 119, 846-57 (2006).

344. van Hoef, V. et al. Functional analysis of a pancreatic secretory trypsin inhibitor-like protein in insects: silencing effects resemble the human pancreatic autodigestion phenotype. Insect Biochem Mol Biol 41, 688-95 (2011).

345. Nirmala, X., Kodrik, D., Zurovec, M. & Sehnal, F. Insect silk contains both a Kunitz-type and a unique Kazal-type proteinase inhibitor. Eur J Biochem 268, 2064-73 (2001).

346. Kummerfeld, S.K. & Teichmann, S.A. Protein domain organisation: adding order. BMC Bioinformatics 10, 39 (2009).

347. Bashton, M. & Chothia, C. The geometry of domain combination in proteins. J Mol Biol 315, 927-39 (2002).

348. Todd, A.E., Orengo, C.A. & Thornton, J.M. Evolution of function in protein superfamilies, from a structural perspective. J Mol Biol 307, 1113-43 (2001).

349. Martins, M.L. et al. Characterization of the acute inflammatory response in the hybrid tambacu (Piaractus mesopotamicus male x Colossoma macropomum female) (Osteichthyes). Braz J Biol 69, 957-62 (2009).

350. Ekman, D., Bjorklund, A.K. & Elofsson, A. Quantification of the elevated rate of domain rearrangements in metazoa. J Mol Biol 372, 1337-48 (2007).

351. Bjorklund, A.K., Ekman, D. & Elofsson, A. Expansion of protein domain repeats. PLoS Comput Biol 2, e114 (2006).

352. Ekman, D., Bjorklund, A.K., Frey-Skott, J. & Elofsson, A. Multi-domain proteins in the three kingdoms of life: orphan domains and other unassigned regions. J Mol Biol 348, 231-43 (2005).

353. Vogel, C., Berzuini, C., Bashton, M., Gough, J. & Teichmann, S.A. Supra-domains: evolutionary units larger than single protein domains. J Mol Biol 336, 809-23 (2004).

354. Uliel, S., Fliess, A. & Unger, R. Naturally occurring circular permutations in proteins. Protein Eng 14, 533-42 (2001). 169

355. Fliess, A., Motro, B. & Unger, R. Swaps in protein sequences. Proteins 48, 377-87 (2002).

356. Koonin, E.V. How many genes can make a cell: the minimal-gene-set concept. Annu Rev Genomics Hum Genet 1, 99-116 (2000).

357. Pu, S. et al. Expanding the landscape of chromatin modification (CM)-related functional domains and genes in human. PLoS One 5, e14122 (2010).

358. von Mering, C. et al. STRING: a database of predicted functional associations between proteins. Nucleic acids research 31, 258-61 (2003).

359. Warde-Farley, D. et al. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic acids research 38, W214-20 (2010).

360. Barker, T.H. The role of ECM proteins and protein fragments in guiding cell behavior in regenerative medicine. Biomaterials 32, 4211-4 (2011).

361. Huang, S. Systems biology of stem cells: three useful perspectives to help overcome the paradigm of linear pathways. Philos Trans R Soc Lond B Biol Sci 366, 2247-59 (2011).

362. Royer, L., Reimann, M., Andreopoulos, B. & Schroeder, M. Unraveling protein networks with power graph analysis. PLoS computational biology 4, e1000108 (2008).

363. Saito, R. et al. A travel guide to Cytoscape plugins. Nat Methods 9, 1069-76 (2012).

364. Emig, D. et al. AltAnalyze and DomainGraph: analyzing and visualizing exon expression data. Nucleic Acids Res 38, W755-62 (2010).

365. Hao, Y. et al. OrthoNets: simultaneous visual analysis of orthologs and their interaction neighborhoods across different organisms. Bioinformatics 27, 883-4 (2011).

366. Nakamura, M., Mie, M., Nakamura, M. & Kobatake, E. Construction of multi-functional extracellular matrix proteins that inhibits migration and tube formation of endothelial cells. Biotechnol Lett 34, 1571-7 (2012).

170

Appendices

Appendix 1: List of organisms and genomic data sources Organism Source PlantGDB: v.173(260809) Populus balsamifera JGI: v1.0(240310) Vitis vinifera PlantGDB: v.173(260809) Oryza sativa japonica EMBL: 241207 Oryza sativa indica NCBI: 60709 Chlamydomonas reinhardtii JGI: v1.0(240310) Cyanidioschyzon merolae Cyanidioschyzon merolae Genome Project: 180108 Ostreococcus lucimarinus CCE9901 JGI: v1.0(240310) Ostreococcus tauri JGI: v1.0(240310) Phytophthora ramorum JGI: v1.0(240310) Phytophthora sojae JGI: v1.0(240310) Phytophthora infestans Broad Institute: 70709 Thalassiosira pseudonana CCMP1335 JGI: v1.0(240310) Paramecium tetraurelia ParameciumDB: v.1.15(020108) Plasmodium berghei PlasmoDB: v.5.4(240907) Plasmodium chabaudi PlasmoDB: v.5.4(240907) Plasmodium falciparum 3D7 PlasmoDB: v.5.4(240907) Plasmodium knowlesi strain H PlasmoDB: v.5.4(240907) Plasmodium vivax PlasmoDB: v.5.4(240907) Plasmodium yoelii yoelii PlasmoDB: v.5.4(240907) Theileria annulata Ankara clone C9 GenBank: 220108 Theileria parva strain Muguga GenBank: 220108 Toxoplasma gondii ME49 ToxoDB: v.4.3(011107) Cryptosporidium hominis TU502 CryptoDB: 131107 Cryptosporidium parvum Iowa II CryptoDB: 131107 Leishmania braziliensis EMBL: 241207 Leishmania infantum EMBL: 241207 Leishmania major strain Friedlin EMBL: 241207 Trypanosoma brucei SANGER: 110506 Giardia lamblia GiardiaDB: 92707 Guillardia theta EMBL: 241207 Trichomonas vaginalis TrichDB: v.1.1(250609) Entamoeba histolytica HM-1:IMSS SANGER: 110506 Dictyostelium discoideum dictyBase: 41009 Encephalitozoon cuniculi GB-M1 EMBL: 241207 Neurospora crassa Broad Institute: 70709 Magnaporthe grisea 70-15 Broad Institute: 70709 Fusarium graminearum (Gibberella zeae PH-1) Broad Institute: 70709 Aspergillus flavus Broad Institute: 70709 Aspergillus oryzae RIB40 Jason: 210905 Aspergillus terreus NIH2624 Broad Institute: 70709 Aspergillus niger Broad Institute: 70709 Aspergillus fischeri Broad Institute: 70709 Aspergillus fumigatus Af293 EMBL: 241207 Aspergillus clavatus Broad Institute: 70709 Aspergillus (Emericella) nidulans FGSC A4 Broad Institute: 70709 171

Organism Source Pichia stipitis EMBL: 241207 Vanderwaltozyma polyspora DSM 70294 EMBL: 241207 Debaryomyces hansenii CBS767 Jason: 210905 Pichia pastoris GenBank: 27808 Candida albicans WO-1 Broad Institute: 70709 Saccharomyces cerevisiae SGD: 121207 Candida glabrata CBS 138 EMBL: 241207 Ashbya gossypii ATCC 10895 EMBL: 241207 Kluyveromyces lactis NRRL Y-1140 EMBL: 241207 Yarrowia lipolytica CLIB122 EMBL: 241207 Schizosaccharomyces pombe SANGER: 110506 Phanerochaete chrysosporium RP-78 JGI: v1.0(240310) Cryptococcus neoformans var. grubii H99 Broad Institute: 70709 Cryptococcus neoformans var. neoformans B- EMBL: 241207 3501A Cryptococcus neoformans var. neoformans JEC21 EMBL: 241207 Postia placenta MAD 698-R JGI: v1.0(20090305) Ustilago maydis 521 Broad Institute: 70709 Nematostella vectensis JGI: v1.0(240310) Monosiga brevicollis JGI: v1.0(240310) Amphimedon queenslandica JGI: v1.0(240310) Trichoplax adhaerens JGI: v1.0(240310) Hydra magnipapillata GenBank: 529188 angaria WormBase: WS205(300709) Caenorhabditis briggsae WormBase: WS205(300709) WormBase: WS205(300709) Caenorhabditis brenneri WormBase: WS205(300709) Caenorhabditis japonica WormBase: WS205(300709) Caenorhabditis remanei WormBase: WS205(300709) Pristionchus pacificus WormBase: WS205(300709) Bursaphelenchus xylophilus WormBase: WS238(112511) Meloidogyne hapla WormBase: WS238(081611) Meloidogyne incognita INRA: 29690 Ascaris suum WormBase: WS238(081611) Brugia malayi WormBase: WS205(300709) Trichinella spiralis WormBase: WS238(081611) Aedes aegypti VectorBase: V.AaegL1.48 Anopheles gambiae str. PEST ENSEMBL: 231107 Culex pipiens Broad Institute: 70709 Apis mellifera DH4 BeeBase: PreRelease2 Bombyx mori SilkDB: 2004 Drosophila ananassae FlyBase: v.1.3(250609) FlyBase: v.1.3(250609) Drosophila sechellia FlyBase: v.1.3(250609) Drosophila simulans FlyBase: v.1.3(250609) Drosophila yakuba FlyBase: v.1.3(250609) Drosophila erecta FlyBase: v.1.3(250609) Drosophila persimilis FlyBase: v.1.3(250609) Drosophila pseudoobscura FlyBase: v.1.3(250609) Drosophila willistoni FlyBase: v.1.3(250609) Drosophila virilis FlyBase: v.1.3(250609) 172

Organism Source Drosophila mojavensis FlyBase: v.1.3(250609) Drosophila grimshawi FlyBase: v.1.3(250609) Ciona intestinalis ENSEMBL: 231107 Ciona savignyi ENSEMBL: 231107 Oryzias latipes ENSEMBL: 231107 Takifugu rubripes ENSEMBL: 231107 Danio rerio ENSEMBL: 231107 Tetraodon nigroviridis ENSEMBL: 231107 Gasterosteus aculeatus ENSEMBL: 231107 Xenopus tropicalis ENSEMBL: 231107 Gallus gallus ENSEMBL: 231107 Ornithorhynchus anatinus ENSEMBL: 231107 Loxodonta africana ENSEMBL: 231107 Dasypus novemcinctus ENSEMBL: 231107 Echinops telfairi ENSEMBL: 231107 Erinaceus europaeus ENSEMBL: 231107 Sorex araneus ENSEMBL: 231107 Tupaia belangeri ENSEMBL: 231107 Myotis lucifugus ENSEMBL: 231107 Equus caballus Broad Institute: 70709 Bos taurus ENSEMBL: 231107 Felis catus ENSEMBL: 231107 Canis lupus familiaris ENSEMBL: 231107 Oryctolagus cuniculus ENSEMBL: 231107 Cavia porcellus ENSEMBL: 231107 Ochotona princeps ENSEMBL: 231107 Spermophilus tridecemlineatus ENSEMBL: 231107 Rattus norvegicus ENSEMBL: 231107 Monodelphis domestica ENSEMBL: 231107 Mus musculus ENSEMBL: 231107 Otolemur garnettii ENSEMBL: 231107 Microcebus murinus ENSEMBL: 231107 Macaca mulatta ENSEMBL: 231107 Pan troglodytes ENSEMBL: 231107 Homo sapiens ENSEMBL: 231107

173

Appendix 2: Phylogenetic ordering of species in protein/domain conservation heatmaps Display order Species Common Name / Association Proteins Domains 1 1 A. thaliana Thale cress 2 2 P. trichocarpa Western balsam poplar 3 3 V. vinifera Common grape vine 4 4 O. sativa_japonica Asian rice 5 O. sativa_indica Asian rice 5 6 C. reinhardtii Green alga 6 7 C. merolae Green alga 7 8 O. lucimarinus Green alga 8 9 O. tauri Green alga 9 10 P. ramorum Oomycete (Sudden oak death) 10 11 P. sojae Oomycete (Soybean pathogen) 12 P. infestans Oomycete (Irish potatoe famine) 11 13 T. pseudonana Diatom 12 T. thermophilia* Ciliate protozoan (Alveolate) 13 14 P. tetraurelia Ciliate protozoan (Alveolate) 14 15 P. berghei Apicomplexan (Rabbit malaria) 15 16 P. chabaudi Apicomplexan (Rodent malaria) 16 17 P. falciparum Apicomplexan (Human malaria) 17 18 P. knowlesi Apicomplexan (Primate malaria) 18 19 P. vivax Apicomplexan (Human malaria) 19 20 P. yoelii Apicomplexan (Rodent malaria) 20 21 T. annulata Apicomplexan (Cattle pathogen) 21 22 T. parva Apicomplexan (African east coast fever) 22 23 T. gondii Apicomplexan (Mammal pathogen) 23 24 C. hominis Apicomplexan (Human pathogen) 24 25 C. parvum Apicomplexan (Human pathogen) 25 26 L. braziliensis Parasitic protozoan (Leishmaniasis) 26 27 L. infantum Parasitic protozoan (Visceral leishmaniasis) 27 28 L. major Parasitic protozoan (Cutaneous leishmaniasis) 28 29 T. brucei Parasitic protozoan (Sleeping sickness) 29 30 G. lamblia Parasitic protozoan (Giardiasis) 30 31 G. theta Cryptomonad 32 T. vaginalis Flagellated protozoan (Trichomoniasis) 31 33 E. histolytica Amoebozoan (Amoebic dysentery) 32 34 D. discoideum Amoebozoan (Slime mould) 33 35 E. cuniculi Microsporidian fungus 34 36 N. crassa Bread mould (Ascomycete) 35 37 M. grisea Rice disease (Sordariomycete) 36 38 F. graminearum Wheat crown rot (Sordariomycete) 37 39 A. flavus Filamentous fungi (Eurotiomycete) 38 40 A. oryzae Filamentous fungi (Eurotiomycete) 39 41 A. terreus Filamentous fungi (Eurotiomycete) 174

Display order Species Common Name / Association Proteins Domains 40 42 A. niger Filamentous fungi (Eurotiomycete) 41 43 A. fischeri Filamentous fungi (Eurotiomycete) 42 44 A. fumigatus Filamentous fungi (Eurotiomycete) 43 45 A. clavatus Filamentous fungi (Eurotiomycete) 44 46 A. nidulans Filamentous fungi (Eurotiomycete) 45 47 P. stipitis Fission yeast (Saccharomycete) 46 48 V. polyspora Fission yeast (Saccharomycete) 47 49 D. hansenii Fission yeast (Saccharomycete) 50 P. pastoris Fission yeast (Saccharomycete) 48 51 C. albicans Fission yeast (Saccharomycete) 49 52 S. cerevisiae Fission yeast (Saccharomycete) 50 53 C. glabrata Fission yeast (Saccharomycete) 51 54 A. gossypii Fission yeast (Saccharomycete) 52 55 K. lactis Fission yeast (Saccharomycete) 53 56 Y. lipolytica Fission yeast (Saccharomycete) 54 57 S. pombe Fission yeast (Schizosaccharomycete) 55 58 P. chrysosporium Chrysosporium (Basidiomycete fungus) 56 59 C. neoformans_AH99 Cryptococcus (Basidiomycete fungus) 57 60 C. neoformans_B3501A Cryptococcus (Basidiomycete fungus) 58 61 C. neoformans_JEC21 Cryptococcus (Basidiomycete fungus) 62 P. placenta Basidomycete (Brown rot fungus) 59 63 U. maydis Corn smut (fungus) 60 64 M. brevicollis Choanoflagellate 65 A. queenslandica Sponge 61 66 T. adhaerens Metazoan (placozoa) 67 H. magnipapillata Cnidarian (hydra) 62 68 N. vectensis Starlet sea anemone 69 C. angaria Nematode (roundworm) 63 70 C. briggsae Nematode (roundworm) 64 71 C. elegans Nematode (roundworm) 72 C. brenneri Nematode (roundworm) 73 C. japonica Nematode (roundworm) 74 C. remanei Nematode (roundworm) 75 P. pacificus Nematode (roundworm) 76 B. xylophilus Nematode (pine wilt) 77 M. hapla Nematode (northern root knot) 78 M. incognita Nematode (root knot) 79 A. suum Nematode (ascariasis roundworm) 65 80 B. malayi Nematode (roundworm) 81 T. spiralis Nematode (trichinosis roundworm) 66 D. pulex* Daphnia (water flea) 67 82 A. aegypti Mosquito 68 83 A. gambiae Mosquito 84 C. pipiens Mosquito 69 85 A. mellifera Western honey bee 70 86 B. mori Silkworm 175

Display order Species Common Name / Association Proteins Domains 71 87 D. ananassae Fruit fly 72 88 D. melanogaster Fruit fly 73 89 D. sechellia Fruit fly 74 90 D. simulans Fruit fly 75 91 D. yakuba Fruit fly 76 92 D. erecta Fruit fly 77 93 D. persimilis Fruit fly 78 94 D. pseudoobscura Fruit fly 79 95 D. willistoni Fruit fly 80 96 D. virilis Fruit fly 81 97 D. mojavensis Fruit fly 82 98 D. grimshawi Fruit fly 83 L. gigantea* Owl limpet (gastropod) 84 B. floridae* Lancelet (marine chordate) 85 99 C. intestinalis Vase tunicate (sea squirt) 86 100 C. savignyi Solitary sea squirt 87 101 O. latipes Japanese rice fish 88 102 T. rubripes Fugu (pufferfish) 89 103 D. rerio Zebrafish 90 104 T. nigroviridis Green-spotted pufferfish 91 105 G. aculeatus Three-spined stickleback (fish) 92 106 X. tropicalis Western clawed frog 93 107 G. gallus Red jungle fowl (chicken) 94 108 O. anatinus Platypus 95 109 L. africana African bush elephant 96 110 D. novemcinctus Nine-banded armadillo 97 111 E. telfairi Lesser hedgehog tenrec 98 112 E. europaeus European (common) hedgehog 99 113 S. araneus Common shrew 100 114 T. belangeri Northern treeshrew 101 115 M. lucifugus Little brown bat 102 116 E. caballus Horse 103 117 B. taurus Cattle 104 118 F. catus Domestic cat 105 119 C. familiaris Domestic dog 106 120 O. cuniculus European (common) rabbit 107 121 C. porcellus Guinea pig 108 122 O. princeps American pika 109 123 S. tridecemlineatus Thirteen-lined ground squirrel 110 124 R. norvegicus Norway rat 111 125 M. domestica Gray short-tailed opossum 112 126 M. musculus House mouse 113 127 O. garnettii Northern greater galago 114 128 M. murinus Gray mouse lemur 115 129 M. mulatta Rhesus macaque 116 130 P. troglodytes Chimpanzee 117 131 H. sapiens Human Shading indicates species additions; asterisks indicate species not yet present in PhyloPro 176

Appendix 3: Shannon information indices across a range of MCL inflation values for the ECM and 100 random networks with the same degree distribution.

With increasing MCL inflation value, clusters become more granular and larger numbers of proteins (singletons) are removed from the Shannon calculation. A local minimum of the Shannon index (which is the sum of the Bir and Pir components) is reached at the same MCL inflation range in both figures. But, the index value reflects a larger proportion of proteins in the real network than is the case with the random network. 177

Appendix 4: SignalP and TMHMM predictions

Signal peptide (SignalP) and Transmembrane (TMHMM) predictions for several groupings of ECM and non-ECM proteins. Whole human genome (red) is compared with: 1) all proteins classified as extracellular in purple/blue 2) the subset classified as ECM proteins in green/light green 3) all network proteins, which are a subset of ECM proteins combined with their functionally related neighbours in yellow/tan 4) the subset of ECM proteins with interactions and therefore included in the network in black/grey 5) the subset of non-ECM proteins included in the network as network neighbours (Net-N) in orange/light orange. Each group has been further subdivided to show the difference between dependent (Dep) and independent (Indep) proteins where dependent proteins are those whose GO annotations included SignalP and/or Transmembrane based predictions used to classify the proteins. Independent proteins are those whose annotations were exclusive of such predictions.

.

178

Appendix 5: Sequences for recombinant elastin peptides

Name Molecular Sequence Weight (Da) hTE 59959.9 GVPGAIPGGVPGGVFYPGAGLGALGGGALGPGGKPLKPV PGGLAGAGLGAGLGAFPAVTFPGALVPGGVADAAAAYKA AKAGAGLGGVPGVGGLGVSAGAVVPQPGAGVKPGKVPGV GLPGVYPGGVLPGARFPGVGVLPGVPTGAGVKPKAPGVG GAFAGIPGVGPFGGPQPGVPLGYPIKAPKLPGGYGLPYT TGKLPYGYGPGGVAGAAGKAGYPTGTGVGPQAAAAAAAK AAAKFGAGAAGVLPGVGGAGVPGVPGAIPGIGGIAGVGT PAAAAAAAAAAKAAKYGAAAGLVPGGPGFGPGVVGVPGA GVPGVGVPGAGIPVVPGAGIPGAAVPGVVSPEAAAKAAA KAAKYGARPGVGVGGIPTYGVGAGGFPGFGVGVGGIPGV AGVPSVGGVPGVGGVPGVGISPEAQAAAAAKAAKYGVGT PAAAAAKAAAKAAQFGLVPGVGVAPGVGVAPGVGVAPGV GLAPGVGVAPGVGVAPGVGVAPGIGPGGVAAAAKSAAKV AAKAQLRAAAGLGAGIPGLGVGVGVPGLGVGAGVPGLGV GAGVPGFGAVPGALAAAKAAKYGAAVPGVLGGLGALGGV GIPGGVVGAGPAAAAAAAKAAAKAAQFGLVGAAGLGGLG VGGLGVPGVGGLGGIPPAAAAKAAKYGAAGLGGVLGGAG QFPLGGVAARPGFGLSPIFPGGACLGKACGRKRK hTE∆36 58573.2 GVPGAIPGGVPGGVFYPGAGLGALGGGALGPGGKPLKPV PGGLAGAGLGAGLGAFPAVTFPGALVPGGVADAAAAYKA AKAGAGLGGVPGVGGLGVSAGAVVPQPGAGVKPGKVPGV GLPGVYPGGVLPGARFPGVGVLPGVPTGAGVKPKAPGVG GAFAGIPGVGPFGGPQPGVPLGYPIKAPKLPGGYGLPYT TGKLPYGYGPGGVAGAAGKAGYPTGTGVGPQAAAAAAAK AAAKFGAGAAGVLPGVGGAGVPGVPGAIPGIGGIAGVGT PAAAAAAAAAAKAAKYGAAAGLVPGGPGFGPGVVGVPGA GVPGVGVPGAGIPVVPGAGIPGAAVPGVVSPEAAAKAAA KAAKYGARPGVGVGGIPTYGVGAGGFPGFGVGVGGIPGV AGVPSVGGVPGVGGVPGVGISPEAQAAAAAKAAKYGVGT PAAAAAKAAAKAAQFGLVPGVGVAPGVGVAPGVGVAPGV GLAPGVGVAPGVGVAPGVGVAPGIGPGGVAAAAKSAAKV AAKAQLRAAAGLGAGIPGLGVGVGVPGLGVGAGVPGLGV GAGVPGFGAVPGALAAAKAAKYGAAVPGVLGGLGALGGV GIPGGVVGAGPAAAAAAAKAAAKAAQFGLVGAAGLGGLG VGGLGVPGVGGLGGIPPAAAAKAAKYGAAGLGGVLGGAG QFPLGGVAARPGFGLSPIFP 179

Name Molecular Sequence Weight (Da) hTE∆(8-14) 48632.6 GVPGAIPGGVPGGVFYPGAGLGALGGGALGPGGKPLKPV PGGLAGAGLGAGLGAFPAVTFPGALVPGGVADAAAAYKA AKAGAGLGGVPGVGGLGVSAGVGPQAAAAAAAKAAAKFG AGAAGVLPGVGGAGVPGVPGAIPGIGGIAGVGTPAAAAA AAAAAKAAKYGAAAGLVPGGPGFGPGVVGVPGAGVPGVG VPGAGIPVVPGAGIPGAAVPGVVSPEAAAKAAAKAAKYG ARPGVGVGGIPTYGVGAGGFPGFGVGVGGIPGVAGVPSV GGVPGVGGVPGVGISPEAQAAAAAKAAKYGVGTPAAAAA KAAAKAAQFGLVPGVGVAPGVGVAPGVGVAPGVGLAPGV GVAPGVGVAPGVGVAPGIGPGGVAAAAKSAAKVAAKAQL RAAAGLGAGIPGLGVGVGVPGLGVGAGVPGLGVGAGVPG FGAVPGALAAAKAAKYGAAVPGVLGGLGALGGVGIPGGV VGAGPAAAAAAAKAAAKAAQFGLVGAAGLGGLGVGGLGV PGVGGLGGIPPAAAAKAAKYGAAGLGGVLGGAGQFPLGG VAARPGFGLSPIFPGGACLGKACGRKRK

EP20-24-24 16991.8 FPGFGVGVGGIPGVAGVPGVGGVPGVGGVPGVGISPEAQ AAAAAKAAKYGVGTPAAAAAKAAAKAAQFGLVPGVGVAP GVGVAPGVGVAPGVGLAPGVGVAPGVGVAPGVGVAPAIG PEAQAAAAAKAAKYGVGTPAAAAAKAAAKAAQFGLVPGV GVAPGVGVAPGVGVAPGVGLAPGVGVAPGVGVAPGVGVA PAIGP

EP20-24-24/36 18378.5 FPGFGVGVGGIPGVAGVPGVGGVPGVGGVPGVGISPEAQ AAAAAKAAKYGVGTPAAAAAKAAAKAAQFGLVPGVGVAP GVGVAPGVGVAPGVGLAPGVGVAPGVGVAPGVGVAPAIG PEAQAAAAAKAAKYGVGTPAAAAAKAAAKAAQFGLVPGV GVAPGVGVAPGVGVAPGVGLAPGVGVAPGVGVAPGVGVA PAIGP

Names are abbreviated as follows: Human tropoelastin (hTE); Human tropoelastin with deletion of exon 36 (hTE∆36); Human tropoelastin with deletion of exons 8 through 14 [hTE∆(8-14)]; Elastin-like peptide consisting of exons 20+24+24 (EP20-24-24); Elastin-like peptide consisting of exons 20+24+24+36 (EP20-24-24/36). Where applicable, exons are in N-terminal to C- terminal order. 180

Appendix 6: A systematically derived list of Gene Ontology (GO) terms

This list of 103 terms was used to identify candidate ECM proteins in human, mouse and rat. CC = Cell Component; BP = Biological Process; MF = Molecular Function.

Identifier Description CC BP MF

GO:0001527 microfibril x

GO:0005576 extracellular region x

GO:0005577 fibrinogen complex x

GO:0005578 proteinaceous extracellular matrix x

GO:0005581 collagen x

GO:0005582 collagen type XV x

GO:0005583 fibrillar collagen x

GO:0005584 collagen type I x

GO:0005585 collagen type II x

GO:0005586 collagen type III x

GO:0005587 collagen type IV x

GO:0005588 collagen type V x

GO:0005589 collagen type VI x

GO:0005590 collagen type VII x

GO:0005591 collagen type VIII x

GO:0005592 collagen type XI x

GO:0005593 FACIT collagen x

GO:0005594 collagen type IX x

GO:0005595 collagen type XII x

GO:0005596 collagen type XIV x

GO:0005597 collagen type XVI x 181

Identifier Description CC BP MF

GO:0005598 short-chain collagen x

GO:0005599 collagen type X x

GO:0005600 collagen type XIII x

GO:0005604 basement membrane x

GO:0005605 basal lamina x

GO:0005606 laminin-1 complex x

GO:0005607 laminin-2 complex x

GO:0005608 laminin-3 complex x

GO:0005609 laminin-4 complex x

GO:0005610 laminin-5 complex x

GO:0005611 laminin-6 complex x

GO:0005612 laminin-7 complex x

GO:0005614 interstitial matrix x

GO:0005615 extracellular space x

GO:0008002 lamina lucida x

GO:0008003 lamina densa x

GO:0008004 lamina reticularis x

GO:0009519 middle lamella x

GO:0016010 dystrophin-associated glycoprotein complex x

GO:0016011 dystroglycan complex x

GO:0030934 anchoring collagen x

GO:0030935 sheet-forming collagen x

GO:0030936 transmembrane collagen x

GO:0030937 collagen type XVII x

GO:0030938 collagen type XVIII x 182

Identifier Description CC BP MF

GO:0031012 extracellular matrix x

GO:0032579 apical lamina of hyaline layer x

GO:0033165 interphotoreceptor matrix x

GO:0033166 hyaline layer x

GO:0043205 fibril x

GO:0043256 laminin complex x

GO:0043257 laminin-8 complex x

GO:0043258 laminin-9 complex x

GO:0043259 laminin-10 complex x

GO:0043260 laminin-11 complex x

GO:0043261 laminin-12 complex x

GO:0043655 extracellular space of host x

GO:0044420 extracellular matrix part x

GO:0044421 extracellular region part x

GO:0048196 middle lamella-containing extracellular matrix x

GO:0060102 collagen and cuticulin-based cuticle extracellular x matrix

GO:0060103 collagen and cuticulin-based cuticle extracellular x matrix part

GO:0060104 surface coat of collagen and cuticulin-based cuticle x extracellular matrix

GO:0060105 epicuticle of collagen and cuticulin-based cuticle x extracellular matrix

GO:0060106 cortical layer of collagen and cuticulin-based cuticle x extracellular matrix

GO:0060107 annuli extracellular matrix x

GO:0060108 annular furrow extracellular matrix x

GO:0060109 medial layer of collagen and cuticulin-based cuticle x 183

Identifier Description CC BP MF extracellular matrix

GO:0060110 basal layer of collagen and cuticulin-based cuticle x extracellular matrix

GO:0060111 alae of collagen and cuticulin-based cuticle x extracellular matrix

GO:0021820 organization of extracellular matrix in the marginal x zone involved in cerebral cortex radial glia guided migration

GO:0021832 cell-cell adhesion involved in cerebral cortex x tangential migration using cell-cell interactions

GO:0021833 cell-substrate adhesion involved in tangential x migration using cell-cell interactions

GO:0021939 extracellular matrix-granule cell signaling involved in x regulation of granule cell precursor proliferation

GO:0022608 multicellular organism adhesion x

GO:0022609 multicellular organism adhesion to substrate x

GO:0022617 extracellular matrix disassembly x

GO:0030198 extracellular matrix organization and biogenesis x

GO:0030199 collagen fibril organization x

GO:0032836 glomerular basement membrane development x

GO:0040002 collagen and cuticulin-based cuticle development x

GO:0040004 collagen and cuticulin-based cuticle attachment to x epithelium

GO:0040006 protein-based cuticle attachment to epithelium x

GO:0042074 cell migration involved in gastrulation x

GO:0043062 extracellular structure organization and biogenesis x

GO:0043206 fibril organization and biogenesis x

GO:0046849 bone remodeling x

GO:0048251 elastic fiber assembly x

GO:0048771 tissue remodeling x 184

Identifier Description CC BP MF

GO:0050817 coagulation x

GO:0051216 cartilage development x

GO:0060055 angiogenesis involved in wound healing x

GO:0005201 extracellular matrix structural constituent x

GO:0030020 extracellular matrix structural constituent conferring x tensile strength

GO:0030021 extracellular matrix structural constituent conferring x compression resistance

GO:0030022 adhesive extracellular matrix constituent x

GO:0030023 extracellular matrix constituent conferring elasticity x

GO:0030197 extracellular matrix constituent, lubricant activity x

GO:0042329 structural constituent of collagen and cuticulin-based x cuticle

GO:0046810 host cell extracellular matrix binding x

GO:0050839 cell adhesion molecule binding x

GO:0050840 extracellular matrix binding x

185

Appendix 7: GO terms significantly enriched in the ECM network

Concept Name Concept Type Name P-Value Q-Value

Extracellular region part GO Cellular Component 1.02E-100 1.52E-98

Proteinaceous extracellular matrix GO Cellular Component 1.02E-100 1.52E-98

Extracellular matrix GO Cellular Component 1.02E-100 1.52E-98

Extracellular space GO Cellular Component 1.43E-64 1.59E-62

Extracellular matrix part GO Cellular Component 4.96E-61 4.42E-59

Extracellular matrix structural constituent GO Molecular Function 2.79E-60 3.00E-57

Cell adhesion GO Biological Process 1.22E-44 3.02E-41

Biological adhesion GO Biological Process 1.22E-44 3.02E-41

Glycosaminoglycan binding GO Molecular Function 1.74E-42 9.36E-40

Polysaccharide binding GO Molecular Function 9.02E-42 3.23E-39

Blood coagulation GO Biological Process 1.53E-39 1.27E-36

Pattern binding GO Molecular Function 2.20E-39 5.92E-37

Coagulation GO Biological Process 2.72E-39 1.68E-36

Receptor binding GO Molecular Function 6.25E-39 1.34E-36

Wound healing GO Biological Process 6.69E-39 3.31E-36

Hemostasis GO Biological Process 2.47E-38 1.02E-35

Response to external stimulus GO Biological Process 1.65E-37 5.84E-35

Growth factor activity GO Molecular Function 1.83E-37 3.28E-35

Basement membrane GO Cellular Component 2.17E-37 1.62E-35

Response to wounding GO Biological Process 7.09E-37 2.19E-34

Regulation of body fluid levels GO Biological Process 1.52E-36 4.17E-34

Heparin binding GO Molecular Function 2.43E-34 3.72E-32

Collagen GO Cellular Component 3.36E-31 2.14E-29

Skeletal development GO Biological Process 2.90E-30 7.19E-28 186

Concept Name Concept Type Name P-Value Q-Value

Carbohydrate binding GO Molecular Function 1.40E-28 1.88E-26

Phosphate transport GO Biological Process 1.34E-27 3.01E-25

Endopeptidase activity GO Molecular Function 4.42E-27 5.28E-25

Regulation of biological quality GO Biological Process 1.58E-25 3.25E-23

Structural molecule activity GO Molecular Function 2.46E-25 2.65E-23

Endopeptidase inhibitor activity GO Molecular Function 2.77E-24 2.70E-22

Calcium ion binding GO Molecular Function 7.59E-24 6.80E-22

Protease inhibitor activity GO Molecular Function 1.03E-23 8.51E-22

Tissue remodeling GO Biological Process 4.13E-23 7.86E-21

Organ morphogenesis GO Biological Process 1.50E-22 2.66E-20

Anatomical structure morphogenesis GO Biological Process 2.91E-22 4.81E-20

Tissue development GO Biological Process 1.83E-21 2.83E-19

Metalloendopeptidase activity GO Molecular Function 2E-20 1.53E-18

Cell motility GO Biological Process 3.06E-20 4.46E-18

Localization of cell GO Biological Process 3.06E-20 4.46E-18

Peptidase activity GO Molecular Function 3.22E-20 2.31E-18

Cell proliferation GO Biological Process 4.21E-20 5.49E-18

Inorganic anion transport GO Biological Process 8.48E-20 1.05E-17

Biomineral formation GO Biological Process 1.52E-19 1.80E-17

Ossification GO Biological Process 1.52E-19 1.80E-17

Cell migration GO Biological Process 4.33E-19 4.66E-17

Bone remodeling GO Biological Process 1.68E-18 1.73E-16

Blood vessel development GO Biological Process 4.08E-18 4.04E-16

Anatomical structure formation GO Biological Process 6.51E-18 6.21E-16

Vasculature development GO Biological Process 7.21E-18 6.61E-16 187

Concept Name Concept Type Name P-Value Q-Value

Proteolysis GO Biological Process 1.24E-17 1.10E-15

Extracellular structure organization and GO Biological Process 2.19E-17 1.87E-15 biogenesis

Extracellular matrix organization and GO Biological Process 2.44E-17 1.95E-15 biogenesis

Enzyme linked receptor protein signaling GO Biological Process 2.47E-17 1.95E-15 pathway

Serine-type endopeptidase inhibitor activity GO Molecular Function 2.47E-17 1.66E-15

Anion transport GO Biological Process 2.52E-17 1.95E-15

Cytokine activity GO Molecular Function 3.4E-17 2.15E-15

Enzyme inhibitor activity GO Molecular Function 5.91E-17 3.53E-15

Blood vessel morphogenesis GO Biological Process 1.16E-16 8.70E-15

Metallopeptidase activity GO Molecular Function 2.2E-16 1.24E-14

Angiogenesis GO Biological Process 2.76E-16 2.01E-14

Regulation of cell proliferation GO Biological Process 3.95E-16 2.79E-14

Cell-substrate adhesion GO Biological Process 7.58E-16 5.22E-14

Collagen metabolic process GO Biological Process 1.31E-15 8.79E-14

Positive regulation of cell proliferation GO Biological Process 2.66E-15 1.73E-13

Cell-matrix adhesion GO Biological Process 3.24E-15 2.06E-13

Serine-type endopeptidase activity GO Molecular Function 5.92E-15 3.18E-13

Serine activity GO Molecular Function 9.7E-15 4.96E-13

Collagen catabolic process GO Biological Process 1.2E-14 7.4E-13

Multicellular organismal macromolecule GO Biological Process 1.2E-14 7.4E-13 catabolic process

Protein digestion GO Biological Process 1.2E-14 7.4E-13

Multicellular organismal macromolecule GO Biological Process 1.2E-14 7.4E-13 metabolic process

Multicellular organismal protein catabolic GO Biological Process 1.2E-14 7.4E-13 188

Concept Name Concept Type Name P-Value Q-Value process

Multicellular organismal protein metabolic GO Biological Process 1.2E-14 7.4E-13 process

Growth GO Biological Process 4.85E-14 2.61E-12

Serine-type peptidase activity GO Molecular Function 6.15E-14 3.01E-12

Multicellular organismal catabolic process GO Biological Process 6.62E-14 3.49E-12

Cartilage development GO Biological Process 1.83E-13 9.42E-12

Collagen binding GO Molecular Function 1.84E-13 8.61E-12

Multicellular organismal metabolic process GO Biological Process 2.89E-13 1.46E-11

Basal lamina GO Cellular Component 3.49E-13 1.94E-11

Growth factor binding GO Molecular Function 5.54E-13 2.48E-11

Cell-cell signaling GO Biological Process 2.59E-12 1.28E-10

Regulation of developmental process GO Biological Process 4.46E-12 2.17E-10

Positive regulation of cellular process GO Biological Process 7.91E-12 3.77E-10

Chemotaxis GO Biological Process 1.19E-11 5.54E-10

Taxis GO Biological Process 1.19E-11 5.54E-10

Laminin complex GO Cellular Component 2.98E-11 1.48E-09

Integrin binding GO Molecular Function 6.18E-11 2.66E-09

Insulin-like growth factor binding GO Molecular Function 1.22E-10 5.04E-09

Fibrillar collagen GO Cellular Component 1.44E-10 6.44E-09

Acute-phase response GO Biological Process 2.2E-10 9.91E-09

Integrin-mediated signaling pathway GO Biological Process 3.29E-10 1.46E-08

Bone mineralization GO Biological Process 9.16E-10 3.98E-08

Locomotory behavior GO Biological Process 1.34E-09 5.71E-08

Inflammatory response GO Biological Process 1.41E-09 5.92E-08

Transmembrane receptor protein GO Biological Process 1.59E-09 6.57E-08 serine/threonine kinase signaling pathway 189

Concept Name Concept Type Name P-Value Q-Value

Cell growth GO Biological Process 2.89E-09 1.17E-07

Digestion GO Biological Process 4.41E-09 1.76E-07

Regulation of cell size GO Biological Process 4.91E-09 1.93E-07

Behavior GO Biological Process 5.42E-09 2.1E-07

Transmembrane receptor protein tyrosine GO Biological Process 7.95E-09 3.03E-07 kinase signaling pathway

Regulation of blood coagulation GO Biological Process 8.77E-09 3.29E-07

Negative regulation of multicellular organismal GO Biological Process 1.2E-08 4.42E-07 process

Lung development GO Biological Process 1.47E-08 5.36E-07

Laminin-1 complex GO Cellular Component 1.74E-08 7.06E-07

Respiratory tube development GO Biological Process 1.94E-08 6.98E-07

Negative regulation of blood coagulation GO Biological Process 2.36E-08 8.35E-07

Regulation of coagulation GO Biological Process 2.59E-08 9.04E-07

Regulation of cell growth GO Biological Process 6.25E-08 2.15E-06

Leukocyte chemotaxis GO Biological Process 6.63E-08 2.25E-06

Anchoring collagen GO Cellular Component 6.75E-08 2.51E-06

Leukocyte migration GO Biological Process 7.69E-08 2.57E-06

Negative regulation of coagulation GO Biological Process 8.42E-08 2.78E-06

Regulation of multicellular organismal process GO Biological Process 9.07E-08 2.96E-06

Regulation of growth GO Biological Process 1.15E-07 3.68E-06

Regulation of cell adhesion GO Biological Process 1.67E-07 5.24E-06

Tube development GO Biological Process 1.67E-07 5.24E-06

Embryonic development GO Biological Process 1.86E-07 5.77E-06

Epithelial cell proliferation GO Biological Process 1.99E-07 6.10E-06

Vascular endothelial growth factor receptor GO Biological Process 2.3E-07 6.94E-06 signaling pathway 190

Concept Name Concept Type Name P-Value Q-Value

Positive regulation of epithelial cell GO Biological Process 2.42E-07 7.22E-06 proliferation

Sheet-forming collagen GO Cellular Component 4.07E-07 1.40E-05

Protein complex binding GO Molecular Function 6E-07 2.39E-05

Regulation of cell differentiation GO Biological Process 7.67E-07 2.26E-05

Fibrinolysis GO Biological Process 1.03E-06 2.99E-05

Acute inflammatory response GO Biological Process 1.07E-06 3.08E-05

Regulation of epithelial cell proliferation GO Biological Process 1.14E-06 3.25E-05

Platelet activation GO Biological Process 1.31E-06 3.68E-05

Defense response GO Biological Process 1.72E-06 4.78E-05

Metalloendopeptidase inhibitor activity GO Molecular Function 1.9E-06 7.3E-05

Morphogenesis of a branching structure GO Biological Process 2E-06 5.49E-05

Regulation of response to external stimulus GO Biological Process 2.45E-06 6.68E-05

Odontogenesis GO Biological Process 2.62E-06 7.07E-05

Regulation of cell migration GO Biological Process 2.78E-06 7.40E-05

Astacin activity GO Molecular Function 2.93E-06 1.09E-04

Extracellular matrix structural constituent GO Molecular Function 2.93E-06 1.09E-04 conferring compression resistance

Regulation of vascular endothelial growth GO Biological Process 3.37E-06 8.87E-05 factor receptor signaling pathway

Regulation of embryonic development GO Biological Process 3.37E-06 8.87E-05

Enzyme regulator activity GO Molecular Function 4.02E-06 1.40E-04

Regulation of ossification GO Biological Process 5.31E-06 1.37E-04

Regulation of chemotaxis GO Biological Process 7.88E-06 2.01E-04

Muscle development GO Biological Process 8.55E-06 2.16E-04

Collagen type IV GO Cellular Component 9.23E-06 2.94E-04

Nervous system development GO Biological Process 1.02E-05 2.56E-04 191

Concept Name Concept Type Name P-Value Q-Value

Regulation of angiogenesis GO Biological Process 1.04E-05 0.000257

Transforming growth factor beta receptor GO Biological Process 1.12E-05 2.76E-04 signaling pathway

Regulation of cell motility GO Biological Process 1.18E-05 2.86E-04

Regulation of locomotion GO Biological Process 1.8E-05 4.33E-04

Locomotion GO Biological Process 2.06E-05 4.91E-04

Regulation of behavior GO Biological Process 2.09E-05 4.94E-04

Negative regulation of cell proliferation GO Biological Process 2.56E-05 5.98E-04

Response to chemical stimulus GO Biological Process 2.67E-05 6.17E-04

Regulation of bone remodeling GO Biological Process 2.83E-05 6.49E-04

Regulation of tissue remodeling GO Biological Process 2.83E-05 6.49E-04

Fibroblast growth factor receptor signaling GO Biological Process 3.18E-05 7.16E-04 pathway

G-protein-coupled receptor binding GO Molecular Function 3.3E-05 0.00111

Chemokine activity GO Molecular Function 4.15E-05 1.35E-03

Chemokine receptor binding GO Molecular Function 4.87E-05 1.54E-03

Hyaluronic acid binding GO Molecular Function 5.61E-05 1.72E-03

Negative regulation of developmental process GO Biological Process 5.62E-05 1.25E-03

Osteoblast differentiation GO Biological Process 6.24E-05 1.38E-03

Integrin complex GO Cellular Component 7.21E-05 2.14E-03

Structural constituent of GO Molecular Function 8.48E-05 2.53E-03

Collagenase activity GO Molecular Function 8.48E-05 2.53E-03

Plasminogen activator activity GO Molecular Function 8.48E-05 2.53E-03

Negative regulation of angiogenesis GO Biological Process 9.15E-05 2.01E-03

BMP signaling pathway GO Biological Process 9.15E-05 2.01E-03

Collagen fibril organization GO Biological Process 0.000126 2.71E-03 192

Concept Name Concept Type Name P-Value Q-Value

Cartilage condensation GO Biological Process 0.000126 2.71E-03

Positive regulation of behavior GO Biological Process 0.000126 2.71E-03

Positive regulation of chemotaxis GO Biological Process 0.000126 2.71E-03

Regulation of bone mineralization GO Biological Process 0.000126 2.71E-03

Blood circulation GO Biological Process 0.000161 3.31E-03

Circulatory system process GO Biological Process 0.000161 3.31E-03

Neutrophil chemotaxis GO Biological Process 0.000193 3.93E-03

Muscle cell proliferation GO Biological Process 0.000273 5.51E-03

Regulation of response to stimulus GO Biological Process 0.000315 6.29E-03

Soluble fraction GO Cellular Component 0.000377 0.010521

Ectoderm development GO Biological Process 0.000384 0.007609

Metanephros development GO Biological Process 0.000431 8.48E-03

Positive regulation of response to external GO Biological Process 0.000431 8.48E-03 stimulus

Transmembrane receptor protein kinase GO Molecular Function 0.000455 1.25E-02 activity

Fibril GO Cellular Component 0.000486 1.28E-02

FACIT collagen GO Cellular Component 0.000486 1.28E-02

Negative regulation of response to stimulus GO Biological Process 0.000649 0.012563

Cell-cell adhesion GO Biological Process 0.000768 0.014686

Induction of positive chemotaxis GO Biological Process 0.000771 1.47E-02

Reproductive process GO Biological Process 0.000874 1.65E-02

Branching morphogenesis of a tube GO Biological Process 0.000941 1.77E-02

Regulation of phagocytosis GO Biological Process 0.001207 0.022423

Odontogenesis of dentine-containing teeth GO Biological Process 0.001216 0.022423

Muscle fiber development GO Biological Process 0.001222 2.24E-02 193

Concept Name Concept Type Name P-Value Q-Value

Skeletal muscle fiber development GO Biological Process 0.001222 2.24E-02

Neurogenesis GO Biological Process 0.001489 2.69E-02

Ureteric bud development GO Biological Process 0.001528 2.74E-02

Regulation of proteolysis GO Biological Process 0.001528 2.74E-02

Positive regulation of positive chemotaxis GO Biological Process 0.001772 0.031344

Positive chemotaxis GO Biological Process 0.001772 0.031344

Regulation of positive chemotaxis GO Biological Process 0.001772 0.031344

Wnt receptor signaling pathway, calcium GO Biological Process 0.001892 3.28E-02 modulating pathway

Muscle cell differentiation GO Biological Process 0.001981 3.41E-02

Phagocytosis GO Biological Process 0.002085 3.56E-02

Epidermis development GO Biological Process 0.002208 3.75E-02

Tube morphogenesis GO Biological Process 0.002318 0.038889

Sensory organ development GO Biological Process 0.002318 0.038889

Cell activation GO Biological Process 0.002339 3.89E-02

Regulation of osteoblast differentiation GO Biological Process 0.002477 4.09E-02

Epithelial to mesenchymal transition GO Biological Process 0.002477 4.09E-02

Gliogenesis GO Biological Process 0.002748 4.48E-02

Wnt receptor signaling pathway GO Biological Process 0.002823 4.57E-02

194

Appendix 8: MeSH terms significantly enriched in the ECM network

Concept Name Gene List Size Overlap P-Value Q-Value

Thrombosis 37 20 7.02E-29 1.26E-26

Venous Thrombosis 22 16 5.38E-26 7.58E-24

Myocardial Infarction 53 19 3.06E-23 3.25E-21

Thrombophilia 20 14 2.77E-22 2.53E-20

Stroke 43 17 1.35E-21 1.13E-19

Thromboembolism 16 12 1.45E-19 1.02E-17

Neovascularization, Pathologic 50 16 1.21E-18 7.76E-17

Wound Healing 17 11 8.37E-17 4.85E-15

Blood Coagulation Disorders 9 9 9.86E-16 4.90E-14

Disease Progression 34 12 1.74E-14 7.37E-13

Cell Line, Tumor 174 20 2.25E-14 9.33E-13

Brain Ischemia 26 11 3.15E-14 1.27E-12

Neoplasm Metastasis 50 13 7.16E-14 2.79E-12

Cardiovascular Diseases 45 12 6.23E-13 2.27E-11

Coronary Disease 47 12 1.06E-12 3.75E-11

Osteoarthritis 13 8 8.35E-12 2.59E-10

Osteochondrodysplasias 28 9 2.07E-10 5.72E-09

Pre-Eclampsia 23 8 1.65E-09 4.05E-08

Abortion, Habitual 16 7 5.07E-09 1.19E-07

Aortic Aneurysm, Abdominal 9 6 5.93E-09 1.37E-07

Kidney Failure, Chronic 19 7 1.83E-08 3.98E-07

Pulmonary Disease, Chronic Obstructive 20 7 2.66E-08 5.59E-07

Glioma 22 7 5.22E-08 1.04E-06

Collagen Diseases 5 5 5.33E-08 1.05E-06 195

Concept Name Gene List Size Overlap P-Value Q-Value

Epidermolysis Bullosa 6 5 5.33E-08 1.05E-06

Arterial Occlusive Diseases 6 5 5.33E-08 1.05E-06

Matrix Metalloproteinase 16 6 5 5.33E-08 1.05E-06

Epidermolysis Bullosa, Junctional 6 5 5.33E-08 1.05E-06

Ehlers-Danlos Syndrome 13 6 8.11E-08 1.51E-06

Carcinoma, Squamous Cell 60 9 1.56E-07 2.75E-06

Antiphospholipid Syndrome 7 5 1.59E-07 2.79E-06

Pregnancy Complications, Hematologic 7 5 1.59E-07 2.79E-06

Coronary Artery Disease 44 8 2.59E-07 4.45E-06

Kidney Diseases 17 6 4.33E-07 7.16E-06

Premature Birth 48 8 4.88E-07 8.05E-06

Arteriosclerosis 31 7 5.29E-07 8.67E-06

Cerebral Hemorrhage 9 5 7.28E-07 1.17E-05

Nephritis, Hereditary 9 5 7.28E-07 1.17E-05

Prostatic Neoplasms 85 9 2.45E-06 3.66E-05

Breast Neoplasms 115 10 2.79E-06 4.15E-05

Fibrosis 12 5 3.35E-06 4.95E-05

Pulmonary Emphysema 5 4 4.21E-06 6E-05

Fetal Growth Retardation 13 5 4.98E-06 7.06E-05

Amelogenesis Imperfecta 6 4 1.04E-05 0.000135

Inflammation 103 9 1.05E-05 1.35E-04

Lymphatic Metastasis 30 6 1.05E-05 1.35E-04

Diabetic Nephropathies 33 6 1.74E-05 2.22E-04

Pregnancy Complications, Cardiovascular 7 4 2.07E-05 0.000261

Adenocarcinoma 57 7 2.31E-05 2.86E-04 196

Concept Name Gene List Size Overlap P-Value Q-Value

Stomach Neoplasms 58 7 2.56E-05 3.16E-04

Osteoporosis 19 5 2.94E-05 3.61E-04

Amyloidosis 19 5 2.94E-05 3.61E-04

Hemophilia A 9 4 5.71E-05 6.63E-04

Osteoarthritis, Knee 9 4 5.71E-05 6.63E-04

Alzheimer Disease 68 7 6.47E-05 7.41E-04

Glioblastoma 23 5 6.79E-05 7.77E-04

Carotid Artery Diseases 10 4 8.5E-05 9.68E-04

Hemorrhage 10 4 8.5E-05 9.68E-04

Muscular Dystrophies 46 6 9.49E-05 1.07E-03

Chronic Disease 26 5 0.000115 1.29E-03

Liver Neoplasms 33 5 0.000308 3.24E-03

Retinal Vein Occlusion 5 3 0.000617 0.00614

Protein C Deficiency 5 3 0.000617 0.00614

Amenorrhea 5 3 0.000617 0.00614

Ischemia 5 3 0.000617 0.00614

Abnormalities, Multiple 68 6 0.000624 6.14E-03

Pulmonary Embolism 6 3 0.001021 9.79E-03

Ovarian Neoplasms 47 5 0.001249 1.16E-02

Arthritis, Rheumatoid 48 5 0.001355 1.25E-02

Insulin Resistance 50 5 0.001584 1.42E-02

Brain Neoplasms 25 4 0.001828 1.63E-02

Multiple Myeloma 26 4 0.002062 0.018149

Abortion, Spontaneous 8 3 0.002115 0.018149

Gastritis 8 3 0.002115 0.018149 197

Concept Name Gene List Size Overlap P-Value Q-Value

Fibrosarcoma 8 3 0.002115 0.018149

Intervertebral Disk Displacement 8 3 0.002115 0.018149

Atherosclerosis 28 4 0.002583 2.20E-02

Infection 9 3 0.002801 2.37E-02

Tooth Abnormalities 9 3 0.002801 2.37E-02

Hypertension 60 5 0.003141 2.60E-02

Colitis, Ulcerative 30 4 0.003179 2.62E-02

Sarcoma, Kaposi 10 3 0.003578 2.93E-02

Proteinuria 10 3 0.003578 2.93E-02

Albuminuria 12 3 0.005392 4.21E-02

Calcinosis 12 3 0.005392 4.21E-02

Colonic Neoplasms 36 4 0.005442 4.21E-02

Hypogonadism 13 3 0.006427 4.92E-02

198

Appendix 9: Pipeline for automated domain analysis

199

Appendix 10: Descriptions of programs included on the accompanying CD

File name: ultradomainsummary.pl

Purpose: The analysis of domain architecture conservation.

Description: Generates a network for domain architectures and a summary heatmap of domain changes in inparalogues across all eukaryotic genomes.

File names: (1)DCC_job_script.bsh (2)DCC_command_script_generator.pl (3)DCC_parallel_preprocessor.pl (4)DCC_parallel_networkgenerator.pl (5)DCC_parallel_summarystats.pl

Purpose: The generation of simulated proteomes using domain pair propagation

(1) Job_script.bsh is a MOAB/Torque submission Descriptions: script for multiple, dynamically-run serial jobs on SciNet GPC

(2) Command_script_generator generates x number of scripts to run individual perl commandswith appropriate parameters for the parallel run. Note, following this the scripts need to be chmod +x to be made executable.

(3) Parallel_preprocessor calculates network properties for the real network

(4) Parallel_networkgenerator generates a random network that matches the characteristics of the real network (stats precalculated in DCC_Parallel_Preprocessor). Outputs the domain pair frequencies of the random network.

(5) Parallel_summarystats calculates final summary statistics on generated proteomes.

200

File name: aligndomains.pl

Purpose: Rapid domain alignment

Description: Given a list of peptide identifiers the program will retrieve the orthologue hits across all available species, perform a multiple sequence alignment of each set of orthologues and print out their domain arrangements.

File name: matrixdb_2_bbl.pl

Purpose: Generation of hypergraphs

Description: Reads a MatrixDB Molecule Decription file and constructs NODE and SET components in .bbl format for input into CyOOG, a Cytoscape Power Graph plugin (Cytoscape Ver 2.6.0). After outputting these lines it constructs the corresponding EDGE component using the interaction data from MatrixDB

201

Appendix 11: Partial view of an ECM network rendered as a hypergraph

This snapshot demonstrates the power to accurately depict the interactions between assembled ECM complexes, single proteins and various cleavage products while maintaining a sense of the heirarchical relationships between them. Note the concentric circles encapsulating individual proteins and fragments which each have distinct interactions. It also demonstrates the current limitations of the rendering. For example, embedded power nodes are not named and cannot be moved which makes it difficult to “clean up” the graph. Note that PFRAG_7_human and PFRAG_8_human are embedded in unnamed power nodes representing their parent protein. These proteins together with P02462_a make up a larger multimer represented by a surrounding power node (also unnamed) with distinct interactors. The underscore_letter extension on the UniProt identifier is a work-around where a protein occurs in more than one distinct multimer (power nodes cannot overlap). 202

Appendix 12: Known and potential elastin interactions

A network summarizing known (green) and putative interactors of elastin based on both indirect (blue) and literature evidence of potential interrologues (pink). Predicted interactors based on STRING which are available from commercial suppliers are also indicated (amber) for potential planning of a future SPRi screen.