Structure-Based Subfamily Classification of

Homeodomains

by

Jennifer Ming-Jiun Tsai

A thesis submitted in conformity with the requirements

for the degree of Master of Science

Graduate Department of Molecular Genetics

University of Toronto

© Copyright by Jennifer Ming-Jiun Tsai 2008

STRUCTURE-BASED SUBFAMILY CLASSIFICATION OF HOMEODOMAINS A thesis submitted in conformity with the requirements for Master of Science Jennifer Ming-Jiun Tsai Graduate Department of Molecular Genetics, University of Toronto, 2008

Abstract

Eukaryotic DNA-binding mediate many important steps in embryonic development and gene regulation. Consequently, a better understanding of these proteins would hopefully allow a more complete picture of gene regulation to be determined. In this study, a structure- based subfamily classification of the homeodomain family of DNA-binding proteins was undertaken in order to determine whether sub-groupings of a family could be identified that corresponded to differences in specific function, and identification of subfamily-determining residues was performed in order to gain some insight on functional differences via analysis of the residue properties. Subfamilies appear to have different specific DNA binding properties, according to DNA profiles obtained from TRANSFAC [1] and other sources in the literature. Subfamily-specific residues appear to be frequently associated with the protein-DNA interface and may influence DNA binding via interactions with the DNA phosphate backbone; these residues form a conserved profile uniquely identifying each subfamily.

ii

Acknowledgements

First and foremost, I would like to thank my husband Christopher, my parents David and Virginia, and my sister Margaret for their unfailing love and support that has enabled me to maintain my focus on my studies and research.

I would especially like to thank my supervisor, Dr. Shoshana Wodak, for her guidance and support during my graduate studies, and for developing my knowledge of what science is all about. The opportunity to attend one of the most well-known bioinformatics conferences, to give poster presentations and lectures, to act as a teaching assistant, constructive criticism, advice, and ideas all contributed to my personal development as a scientist and researcher. I am most grateful to Dr. Boris Steipe, for recognizing the potential in a young undergraduate student and nurturing my interest in science, especially computational biology, and for his enthusiastic support and constructive criticism as a member of my supervisory committee. I would like to thank Dr. John Parkinson for his enlightening contributions during joint meetings held with his laboratory members, for his constructive suggestions and support as a member of my supervisory committee, and for his generous access to the PartiGeneDB EST database. I would also like to thank Dr. Jack Greenblatt for his insightful contributions and support as a member of my supervisory committee, and his willingness to share his knowledge about DNA-binding proteins.

I would like to thank all my colleagues, friends, and mentors who have challenged me and supported me through this journey, and who have made my academic journey at the University of Toronto such a rewarding one. Special thanks go to my fellow lab colleagues past and present at the Centre for Computational Biology at the Hospital for Sick Children, especially Miguel Santos for his assistance with the structural analysis, my fellow students Gerald Quon, Torben Broemstrup, and Jim Vlasblom, and the members of the Parkinson lab, especially Chris Sanford and David He, for insightful and entertaining discussions over coffee breaks and inter-lab lunches, and for creating a friendly and supportive working environment.

iii

Table of Contents

Structure-Based Subfamily Classification of Homeodomains ...... i Abstract...... ii Acknowledgements...... iii Table of Contents...... iv List of Tables ...... vi List of Figures...... vii List of Appendices...... ix Chapter 1 – Introduction...... 1 1.1 Importance of Studying Eukaryotic DNA-Binding Proteins...... 1 1.1.1 Regulation of Gene Expression: From DNA to RNA to Protein...... 2 1.1.2 DNA Binding Proteins: An Overview...... 8 1.1.3 Homeodomains: A DNA Binding Domain Family ...... 11 1.2 Sequence Conservation and Inferences ...... 16 1.3 Principles of Protein Classification ...... 18 1.3.1 Importance of Sequence Alignment ...... 23 1.3.2 Phylogenetic Representation Methods ...... 25 1.3.3 and Subfamily Classifications ...... 27 1.4 Current Strategies for Determining Functional Residues in Proteins...... 32 1.5 A Requirement for More Structural Insight: A Case Study ...... 34 Chapter 2 – Experimental Protocol...... 37 2.1 Outline of the Protocol...... 37 2.2 Curation of Protein Structures ...... 39 2.3 Curation of Protein Sequences...... 40 2.4 Structure-Based Sequence Alignment ...... 41 2.4.1 MALECON Multiple Structure Alignment...... 42 2.4.2 ClustalX Sequence-to-Profile Alignment...... 42 2.5 Identifying the Subfamilies and Subfamily-Determining Residues ...... 43 2.5.1 Bête Subfamily Classification ...... 43 2.5.2 SDPpred: Subfamily Determining Residue Identification ...... 46 2.6 Validation of Bête Neighbour-Joined Tree...... 49 2.7 Validation of Subfamily Integrity...... 49 2.8 Structural Analysis of Subfamily-Determining Residues...... 50 2.9 Subfamily DNA Binding Profile Analysis ...... 52 Chapter 3 – Results...... 53 3.1 Subfamily Classification...... 53 3.2 Subfamily Determining Residues...... 60 3.3 Validation of Results ...... 63 3.3.1 Quality of Obtained Subfamilies ...... 63 3.3.2 Comparison of Bête Neighbour-Joined Tree with PHYLIP...... 64

iv

3.4 Analysis of Protein-DNA Interaction: Physical Characteristics of Subfamily- Determining Residues...... 68 3.4.1 Contribution to the Protein-DNA Interface ...... 68 3.4.2 Interactions Made by Interface Residues...... 71 3.5 Potential role of specificity determining residues in modulating protein-DNA interaction ...... 74 3.6 Subfamily Cognate DNA Sequences...... 91 Chapter 4 – Discussion...... 96 Chapter 5 – Conclusion ...... 103 Appendices ...... 105

v

List of Tables

Table 1: Homeodomains and availability of protein structures in the PDB. Subfamily assignments are given (as applicable) according to the Interpro homeodomain subfamily classification illustrated in Figure 18...... 13 Table 2: Homeodomain structures from the Protein Data Bank which were used in the experimental procedures...... 40 Table 3: All 27 homeodomain structures from the Protein Data Bank which were used in the creation of the homeodomain structure alignment...... 54 Table 4: Mapping of subfamily-determining residue positions and conserved residue positions from the structure-based sequence alignment to the conventional homeodomain literature numbering...... 62 Table 5: Contribution of specificity determining residues (9 of 89 residues, as per...... 70 Table 6: Chi-1 angle conformations for residue position 36 (as according to Figure 20) for representative homeodomain structures from all homeodomain subfamilies which are structurally represented in the PDB...... 73 Table 7: Contribution of subfamily determining residues (SDRs) to the protein-DNA interface in HOX 1ig7A and Paired-box 1fjlA...... 81 Table 8: Relative contribution to the protein-DNA interface area by each structural unit of the homeodomain...... 89 Table 9: Specificity determining residues (S) and conserved residues (C) in the HOX, Engrailed, Paired-box, POU, and MATa1 structural representatives, and the DNA nucleotides that they contact...... 92

vi

List of Figures

Figure 1: Organization of DNA into nucleosomes and heterochromatin...... 3 Figure 2: Formation of the pre-initiation complex in RNA polymerase II mediated transcription...... 6 Figure 3: Illustration of a leucine zipper dimer in contact with DNA (PDB ID 1ysa)...... 9 Figure 4: Illustration of a helix-loop-helix protein dimer in contact with DNA (PDB ID 1an4)...... 10 Figure 5: Illustration of a (Cys)2(His)2 in contact with DNA (PDB ID 1aay) ...... 10 Figure 6: Illustration of a winged-helix domain in contact with DNA (PDB ID 2c6y) ...... 11 Figure 7: Structure of a “typical” homeodomain in contact with DNA...... 13 Figure 8: Orthologous and paralogous gene relationships ...... 17 Figure 9: Representation of hierarchical clustering...... 19 Figure 10: Representation of k-means clustering, with k=3 in this example ...... 20 Figure 11: Representation of the triangle inequality ...... 21 Figure 12: A representation of the Evolutionary Trace method of subfamily partitioning.....31 Figure 13: Flowchart of experimental protocol, including subfamily determination, validation, identification of subfamily-specific residues, and analysis steps...... 38 Figure 14: Part of a hypothetical multiple sequence alignment in which pseudocounts would be used if more data could not be acquired ...... 45 Figure 15: Chi-1 angle conformations of amino acid sidechains ...... 51 Figure 16: MALECON homeodomain structural alignment (with HNF1 (PDB ID 2lfb) insertion between helix 2 and helix 3 excised)...... 55 Figure 17: Entire Bête homeodomain subfamily classification derived from the structure- based sequence alignment...... 57 Figure 18: Manually curated homeodomain subfamily classification from Interpro...... 58 Figure 19: A) Bête homeodomain subfamily classification, including only typical 60 amino acid long homeodomains for comparison with the manually curated Interpro homeodomain subfamilies. Classification is otherwise as in Figure 17...... 59 Figure 20: SDPpred alignment of major homeodomain subfamilies, including the largest plant homeodomain subfamily ...... 61 Figure 21: Sequence to profile similarity for actual subfamilies vs. profiles of shuffled sequences with the same size distribution as the subfamilies, after removing singleton and doubleton subfamilies and unclassifiable sequences...... 64 Figure 22: Bête leave-one-out cross-validation “supertree”, collapsed to major branchpoints ...... 66 Figure 23: Neighbour-joined tree produced using the PHYLIP phylogenetic package, using the same set of sequences as for the Bête-produced tree...... 67 Figure 24: Stereogram for POU homeodomain (1cqtB) in contact with DNA; the conserved Leu at position 22 (helix 1) makes hydrophobic-hydrophobic contact with the conserved Trp at position 74 (helix 3)...... 71 Figure 25: Helix N-cap formed by Asn residue at position 36 (according to alignment in ....73 Figure 26: Complete set of SDRs (blue) and conserved residues (red) for the HOX subfamily representative 1ig7A...... 76 Figure 27: HOX subfamily structural representative (PDB 1ig7A) and contacts made by SDRs and conserved residues...... 79

vii

Figure 28: Structural alignment of 1ig7A (HOX) and 1fjlA (paired-box) based on superposition of helix 3...... 82 Figure 29: Protein-DNA adaptation as illustrated by HOX 1ig7A, Paired-box 1fjlA, and Engrailed 3hddA...... 84 Figure 30: Superposition of the MATa1 (1akhA), MATα2 (1akhB), and HOX (1ig7A) homeodomains, taking as reference the structural alignment of helix 3 in the MATa1 and MATα2 structures relative to helix 3 of the HOX structure...... 86 Figure 31: Homeodomain DNA recognition sequences for selected homeodomains...... 95

viii

List of Appendices

Appendix A: Homeodomain Subfamily Classification Performed by Bête and Structure- Based Sequence Alignment ...... 105 Appendix B: Individual Unclassifiable Groupings from Leave-One-Out Cross-Validation of the Bête Tree...... 116 Appendix C: Chi-1 Angle Conformations at Subfamily-Determining Residue Positions ....120 Appendix D: Contribution of Subfamily-Determining Residues to the Protein-DNA Interface: Changes in Solvent Accessibility ...... 127 Appendix E: List of Interactions with DNA at distance ≤ 4.5 Å...... 139

ix 1

Chapter 1 – Introduction

1.1 Importance of Studying Eukaryotic DNA-Binding Proteins

The living cell is an intricate mixture of proteins and nucleic acids, orchestrating their seemingly choreographed dance of regulatory systems and networks. The cell forms the smallest self-sustaining unit of life [2], and contains within it all the information that is required to specify all components required for cellular function [3]. Cellular adaptability relies on the dynamic interplay between cell and environment that allows the cell to react to external stimuli [4,5]. These adjustments often involve a complex, highly regulated, and specific interaction between various proteins, ribonucleic acids (RNA), and deoxyribonucleic acid (DNA) [5,6].

The basic function of DNA is to serve as a template for protein production [7,8]. This production of proteins is highly dependent on modulation by certain types of DNA-binding proteins [6]. There is an enormous diversity of DNA-binding proteins, ranging from transcription factors, polymerases, histones, helicases, and nucleases. Histones allow DNA to be wound into an ultra-compacted form within the cell [9,10]. Helicases separate double- stranded DNA [9,10], while nucleases cleave DNA [9,10]. Transcription factors cause the stimulation or repression of protein production [9,10]. DNA polymerases assist in the duplication of DNA [9,10], whereas RNA polymerases serve as transcribers, providing the intermediate template required between DNA and protein synthesis [9,10]. Many of these DNA-binding proteins are involved in various steps of the protein production assembly line. We study eukaryotic DNA-binding proteins in order to achieve a better understanding of protein synthesis and regulation of gene expression in eukaryotes; eukaryotes are organisms whose cells contain a special compartment containing DNA called the nucleus [9,10], and these organisms include plants, yeast, worms, flies, mice, and humans.

2

1.1.1 Regulation of Gene Expression: From DNA to RNA to Protein

Gene expression is the means whereby the information encoded in DNA is transformed into functional protein products [9,10]. The process by which a DNA template specifying a protein is converted to an RNA template is known as transcription [9,10]; the process by which that RNA template is converted to the protein which it specifies is known as translation [9,10].

There are multiple points at which eukaryotic gene expression can be regulated. These points include: chromatin structure, transcription initiation, transcript processing and modification, RNA transport, transcript stability, translational initiation, post-translational modification, protein transport, and control of protein stability [9,10].

DNA exists in a highly condensed form within the nucleus, with multiple levels of packing in order to form the finished structure. At the first level of packing, DNA is wound around structural units called nucleosomes. The nucleosome forms an octameric core, which is composed of two copies of four different histone proteins: histones H3, H4, H2A, and H2B [11-14]. At this level of packing, DNA is considered to be in extended form or “beads on a string”, and this extended form, known as euchromatin, is transcriptionally active (Figure 1A) [15]. At the second level of packing, the nucleosome-wound DNA is coiled in on itself to form a solenoid structure containing about six nucleosomes per turn; this form is known as the 30 nm condensed chromatin fiber (Figure 1B) [12], and is associated with the presence of histone H1 [14]. In this condensed form, known as heterochromatin, DNA is transcriptionally inactive [11,13,15].

3

histone core A

DNA nucleosome “beads on a string” euchromatin structure

B

solenoid heterochromatin structure

Figure 1: Organization of DNA into nucleosomes and heterochromatin. A) DNA with nucleosomes in “beads on a string” or extended form, also known as euchromatin; euchromatin can be transcriptionally active. A nucleosome consists of two full turns of DNA wound around an octameric histone core. B) Nucleosome-wound DNA coiled in on itself to form a solenoid heterochromatin structure; heterochromatin is transcriptionally inactive. Typically there are six nucleosomes per turn.

A significant amount of eukaryotic gene regulation revolves around control of transcription initiation [9], with an alternative mechanism for gene regulation being transcriptional attenuation [16]. Transcription initiation begins with the assembly of particular transcription factors at a promoter sequence in order to recruit RNA polymerase to the promoter. There are three eukaryotic RNA polymerases: these are RNA polymerases I, II, and III [9,10,17]. The majority of eukaryotic genes are transcribed by RNA polymerase II, as its function is to produce mRNAs; promoter sequences to which RNA polymerase II is recruited sometimes contain an AT-rich sequence known as the TATA box [17]. Many general transcription

4 factors and complexes are involved in RNA polymerase II mediated transcription; these transcription factor complexes include TFIIA, TFIIB, TFIID, TFIIE, TFIIF, and TFIIH [17]. In the first step of RNA polymerase II transcription initiation (illustrated in Figure 2), the TATA box is bound by the TATA-box binding protein (TBP), which is contained within the transcription factor complex TFIID [17]. TBP interacts with the minor groove in DNA to cause helix bending. If no TATA box is present, an assortment of TBP-associated binding factors (TAFs) contained within TFIID will bind DNA sequence-specifically, followed by association of TBP with DNA [17]. Following the recruitment of TBP to the TATA-box, TFIIA stabilizes the interaction of TFIID with DNA [17,18]. TFIIB then associates between TFIID and the location where RNA polymerase II will later bind DNA [17]. TFIIB contacts TBP and DNA at its C-terminal end, stabilizing the TATA-TBP interaction, whereas its N- terminus faces the transcription start site where TFIIF will later bind [17]. TFIIF and RNA polymerase II then bind the growing protein-DNA complex; TFIIF promotes transcription elongation [17]. TFIIE and TFIIH are then recruited to the complex, with TFIIE forming a docking site for TFIIH [17]. The helicase activity of TFIIH is then required to unwind the DNA strands at the start site in order for transcription to proceed [17]. A large multiprotein complex known as Mediator also binds to RNA polymerase II, and mediates between the core transcriptional machinery and transcriptional activators [17]. Transcriptional activator proteins contain a DNA-binding domain linked to at least one activation domain, frequently via a flexible linker [9]. Different models for activator activity exist. Activators can bind cooperatively to enhancer sequences to form an enhanceosome, triggering local deformations of the DNA that make promoter sequences more accessible [9,19]. Alternatively, a “billboard” enhancer sequence may contain an ensemble of functional regions that can be bound by both activators and repressors, with the level of transcriptional activity of the gene as a readout [19]. Co-activator protein complexes such as Mediator exert an effect on transcriptional activation, not by binding DNA directly but by coordinating between different activator proteins and core transcriptional machinery [9,17]; other co-activator complexes include the chromatin remodeling complex, which is believed to transiently dissociate DNA from nucleosomes in order to promote the unfolding of heterochromatin [9], and histone acetylase complexes, which acetylate the N-terminal tails of histones [9]. After transcription initiation, precursor messenger RNA (pre-mRNA) is produced from the DNA template [9].

5

TATA box

TBP TFIID

TBP TFIID

TFIIA

TBP TFIID TFIIA

TFIIB

TFIID TBP TFIIA TFIIB

TFIIF RNAP II

TFIID TBP TFIIA TFIIB TFIIF RNAP II

TFIIE TFIIH

TFIID TBP TFIIA TFIIB RNAP II TFIIF TFIIHTFIIE

6

Figure 2: Formation of the pre-initiation complex in RNA polymerase II mediated transcription. TBP refers to TATA-binding protein; RNAP II represents RNA polymerase II; transcription factor complexes associated with RNA polymerase II mediated transcription are labelled TFIIA, TFIIB, TFIID, TFIIE, TFIIF, and TFIIH. Initiation of transcription requires initial recognition of the TATA box by the TATA-binding protein (a subunit of TFIID), followed by recruitment of other TFII transcription factor complexes, and association of RNA polymerase II and TFIIH (which has helicase activity).

Transcriptional attenuation is the premature termination of transcription after initiation by RNA polymerase. It was first noted as a mechanism for controlling prokaryotic gene expression, as with regulation of the trp operon in bacteria depending on the amount of tryptophan bound to Trp tRNA [20]. In this example of transcriptional attenuation, low levels of Trp cause the ribosome to stall at two Trp codons, allowing an antiterminator hairpin to form in the mRNA sequence and preventing the terminator hairpin from forming; this allows RNA polymerase to read-through the region and transcribe the trp operon [20]. Formation of the terminator hairpin, an RNA polymerase termination signal, results in transcriptional termination and release of RNA polymerase [20]. In eukaryotes, transcriptional attenuation has been observed to occur for cellular genes such as c-myc; attenuation appears also to be RNA secondary structure dependent [16].

Co-transcriptional and post-transcriptional modifications to pre-mRNA result in the production of mature messenger RNA (mRNA), which will serve as the template for translation into a polypeptide. These modifications include addition of a cap at the 5’ end of the mRNA [21,22], polyadenylation of the 3’ end of the mRNA [23,24], and RNA splicing [25]. 5’ end capping and polyadenylation of the 3’ end of the mRNA prevent exonuclease degradation and regulates nuclear export of the mRNA [21-24], while RNA splicing joins together introns to form the mature mRNA [25]. Differential joining of introns, also known as alternative splicing, allows multiple protein products to be produced from one original DNA template [25].

Most mRNAs are transported from the nucleus to the cytosol for translation. Some mRNAs, including those which encode for secreted proteins or proteins located within organelles, are

7 transported to ribosomes located on the cytosolic face of rough endoplasmic reticulum (ER) [26]. Translation of the mRNA template results in the production of the polypeptide product specified by the mRNA [9,10]. Translational initiation in eukaryotes frequently begins with detection of the 5’ end-cap on the mRNA [27,28], although cap-independent translation initiation can occur via detection of an internal ribosome entry site (IRES) [28]. These elements recruit ribosomal subunits to the mRNA; in cap-dependent translation initiation, the pre-initiation complex scans for the start codon [27]. Messenger RNA sequences consist of three-nucleotide groupings known as codons; each codon specifies a particular amino acid [7-10]. The start codon, usually AUG, specifies methionine, and indicates the start of the protein-coding sequence [27]. In IRES-dependent translation initiation, the ribosome is recruited directly to the internal ribosome entry site for translation [29]. The stability of the resulting transcript affects how many polypeptide copies can be made from one transcript, and therefore influences how persistent the corresponding protein products are within the cell [30].

One of the later steps in protein biosynthesis for many proteins is post-translational modification. Various post-translational modifications include acetylation, alkylation, phosphorylation, deamidation, ubiquitination, glycosylation, and disulfide bond formation [9,10,31]. These post-translational modifications involve addition of a functional group, such as an alkyl group, or addition of a protein, such as ubiquitin; chemical change to an amino acid, such as conversion of an amide group to a carboxylic acid; or structural change to the protein, such as formation of internal disulfide bridges [9,10,31]. Post-translational modifications may occur concurrently with protein folding [32], which is the folding of a polypeptide into the biologically active three-dimensional form of the mature protein; while the information required to fold a mature protein is contained in the amino acid sequence [33], protein folding sometimes requires the assistance of molecular helpers called chaperones [34].

After the mature folded form of the protein is produced, the protein must be delivered to the proper location in the cell. Subcellular localization signals on the protein direct the cellular transport machinery to transport the protein to one of various cellular compartments,

8 including the endoplasmic reticulum, nucleus, mitochondrial matrix, Golgi complex, or peroxisome [35].

Control of protein stability influences how long a protein remains active in the cell. When a protein is targeted for degradation, ubiquitin ligases attach ubiquitin polymers to a lysine residue on the substrate protein [36,37]. This engages the cellular degradation machinery, the proteasome, to degrade the polypeptide chain into short peptides [36,37], typically between three and twenty-four amino acids in length [9].

At each of these various points in the life-cycle of a protein, it is possible to regulate either the expression of a given gene, or the activity of its protein products. The majority of eukaryotic gene regulation revolves around control of transcription initiation, and thus around the ability of various transcription factors to activate or repress gene expression. Therefore, transcriptional control is a key feature in regulating many cellular processes, and has consequently been intensely studied. This provides a wealth of and sequence data that can be integrated, in order to investigate basic properties governing the interaction between protein regulators and their target DNA sequences using a bioinformatics approach.

1.1.2 DNA Binding Proteins: An Overview

Control of transcription initiation is frequently mediated by DNA-binding transcription factors [38]. DNA-binding transcription factors bind to transcription-control elements such as enhancers, upstream activating sequences, or repressor-binding sites; in doing so, they activate or repress transcription of a given gene [38]. Different classes of DNA-binding transcription factors include leucine zipper proteins, basic helix-loop-helix proteins, zinc finger proteins, and helix-turn-helix proteins [10,39]. The DNA binding ability of transcription factors typically occurs through interactions between the DNA-binding domain of the protein and the major groove of DNA [40-42], although interactions with the minor groove or the phosphodiester backbone of DNA may contribute to DNA binding as well [9].

9

Leucine zipper proteins (illustrated in Figure 3) are characterized by the presence of a hydrophobic amino acid, frequently leucine, at every seventh position in the dimerization region of the amino acid sequence [9]. These proteins can be homodimeric or heterodimeric, with each monomer consisting of an alpha-helix gripping two adjacent major grooves on either side of the DNA, much like a pair of pliers [9,10]. They dimerize via a coiled-coil (“zipper”) dimerization region at the C-terminal ends of the helices, and interact with the DNA via positively charged N-terminal DNA binding domains [9].

Figure 3: Illustration of a leucine zipper dimer in contact with DNA (PDB ID 1ysa). The alpha-helices interact with each other via coiled-coil regions at their C-termini, while they interact with adjacent major grooves of DNA via positively charged DNA-binding regions at their N-termini.

Basic helix-loop-helix proteins (illustrated in Figure 4) are similar to leucine zipper proteins in their mode of interaction with DNA, except that instead of a coiled-coil dimerization region and a DNA-binding region on the same alpha-helix, two alpha-helices are joined by a connecting loop, with one helix binding DNA and the other helix interacting with another helix-loop-helix protein [9,10]. Like leucine-zipper proteins, helix-loop-helix proteins are able to homo-dimerize or hetero-dimerize. Some helix-loop-helix proteins lack the DNA binding region; when heterodimerized with a full-length protein containing the DNA binding region, only the helix from the full-length protein is able to form contacts with DNA, and thus the heterodimer only binds weakly to DNA [10].

10

Figure 4: Illustration of a helix-loop-helix protein dimer in contact with DNA (PDB ID 1an4). One pair of alpha-helices forms the dimerization interface, while the other pair of alpha-helices interacts with adjacent major grooves of DNA via positively charged DNA- binding regions.

Zinc-finger proteins contain a DNA-binding coordinated around at least one zinc ion; the zinc is considered to be important for stability of the domain [9,10,43]. There are many classes of zinc finger motifs, including (Cys)2(His)2 zinc fingers, (Cys)4 zinc fingers, Cys-Cys-His-Cys zinc fingers, and (Cys)6 zinc fingers [43]. Proteins that contain a zinc-finger motif include the LIM domain (commonly associated with homeodomains), the GAL4 DNA-binding domain, steroid hormone receptors, and the transcription factor TFIIIA

[43]. An example of a (Cys)2(His)2 zinc finger is shown in Figure 5.

A B

Figure 5: Illustration of a (Cys)2(His)2 zinc finger in contact with DNA (PDB ID 1aay); the structure consists of a pair of antiparallel beta-strands and an alpha-helix, with coordination around a zinc ion for stability. A) Two cysteine residues and two histidine residues coordinate around a zinc ion; B) the alpha helix fits into the major groove of DNA.

11

The helix-turn-helix DNA-binding motif consists of a pair of alpha-helices connected by a turn; one of the alpha-helices, known as the recognition helix, makes a number of contacts in the major groove of DNA [9,10]. This motif was originally identified in bacterial proteins, but has been identified in eukaryotic proteins as well [10]. Two classes of helix-turn-helix domains found in eukaryotes include the winged-helix and the homeodomain [39]. A winged-helix (illustrated in Figure 6) consists of four helices, two of which form a helix-turn- helix motif, and a two-strand beta-sheet which forms a “wing” structure [44]. Homeodomains have an alpha-helix joined by a loop to a helix-turn-helix motif, and can bind as either monomers or dimers [9,10,39]; these proteins will be discussed in more detail in the next section.

A B

Figure 6: Illustration of a winged-helix domain in contact with DNA (PDB ID 2c6y); the structure consists of a pair of beta-strands forming a “wing” structure and four alpha-helices, two of which form a helix-turn-helix motif. A) Recognition helix in the major groove of DNA; B) top view of the winged-helix protein illustrating the “wing” formed by the beta- strands.

1.1.3 Homeodomains: A DNA Binding Domain Family

Homeodomains are transcription factors that play important biological roles in many eukaryotic organisms. They are members of the helix-turn-helix structural class of DNA- binding proteins, which exert their DNA-binding capability via an alpha-helix known as the recognition helix [9,10].

12

HOX genes are involved in limb and body plan development in mice and humans [45]; homeodomain-containing proteins are involved in body plan development in Drosophila and C. elegans (e.g. Antp, Scr, lin-39, mab-5) [46] and in mating type switching in S. cerevisiae (i.e. MATa, MATα) [47]. LIM homeodomains play a role in the vertebrate nervous system and are required for axon guidance and neuronal differentiation [48,49]. The atypical homeodomain HNF1 plays a role in glucose homeostasis in mammals [50,51].

Plant homeodomain subfamilies include plant leucine-zippers (HD-Zip), WUSCHEL, zinc-finger associated with a homeodomain (ZF-HD), Bell (associated with a Bell domain), and Knotted related homeobox (KNOX) [52,53]. Unlike non-plant homeodomains, none of the plant homeodomains have been observed to exert any homeotic effect [53]. Plant HD-Zip proteins play multiple physiological roles, including response to environmental conditions (e.g. drought tolerance), meristem regulation, and organ and vascular development [53]. WUSCHEL regulates stem cell fate in the apical meristem [52,54]. Plant ZF-HD homeodomains are expressed in floral tissues and are believed to be involved in regulation of floral development [53,55]. Bell homeodomains interact with KNOX homeodomains to regulate early internode patterning events in the shoot apical meristem during influorescence (a collection of flowers that make up a flower head) development in Arabidopsis [56].

Both plant and non-plant homeodomains and whether protein structures are available are summarized in Table 1 below.

Protein Structure Subfamily / Protein Name Available (Y/N) Paired-like homeobox (Interpro subfamily) Y SIX/SINE homeobox (Interpro subfamily) N LIM homeobox (Interpro subfamily) Y ‘Homeobox’ engrailed-type protein Y (Interpro subfamily) Homeobox protein, antennapedia type Y (Interpro subfamily, includes Antp and HOX) POU homeobox (Interpro subfamily) Y CUT homeobox (Interpro subfamily) Y Plant homeobox leucine-zipper (HD-Zip) N Plant WUSCHEL subfamily N

13

Plant zinc-finger/homeodomain (ZF-HD) Y Plant Bell subfamily N Plant KNOX subfamily N Scr (Drosophila) N MATa homeodomain (S. cerevisiae) Y MATα homeodomain (S. cerevisiae) Y HNF1 homeodomain (mammalian) Y Homeobox leucine-zipper (non-plant) Y lin-39 (C. elegans) N mab-5 (C. elegans) N Table 1: Homeodomains and availability of protein structures in the PDB. Subfamily assignments are given (as applicable) according to the Interpro homeodomain subfamily classification illustrated in Figure 18.

A “typical” homeodomain (illustrated in Figure 7) is defined as a 60 amino acid long “all- alpha” protein domain with three alpha helices, which binds specifically to the major groove of DNA [45]; the first helix is connected to the second helix by a loop [45], while the second and third helices form a helix-turn-helix motif [45]. Residues at the N-terminal end of the first helix contact the minor groove of DNA [45], while DNA binding specificity is governed by residues in the third helix or “recognition helix” of the homeodomain [45,57]. Examples of “typical” homeodomains include Engrailed and Antennapedia from Drosophila [58].

Figure 7: Structure of a “typical” homeodomain in contact with DNA. Helix 1 (the most N-terminal of the three helices) is highlighted in green, helix 2 in red, and helix 3 (recognition helix) in cyan. The recognition helix makes specific contacts with the major groove of DNA.

14

Classes of “atypical” homeodomains also exist, which contain insertions relative to the 60 amino acid length of the “typical” homeodomain. These include Three Amino acid Loop Extension (TALE) type homeodomains, which are characterized by a 3 amino acid long insertion in the first helix of the homeodomain [59]. Subclasses of TALE-class homeodomains include plant homeodomains such as KNOX/MEINOX [60], the S. cerevisiae mating-type specific proteins MATa and MATα [47], and the proteins Pbx and Exd, which participate in arthropod and vertebrate developmental programs [61,62]. Other atypical homeodomains include plant leucine-zipper homeodomains [63], and a vertebrate-specific homeobox leucine zipper called Homez [64]. The atypical homeodomain Hepatocyte Nuclear Factor 1 (HNF1), which is expressed in mammalian liver, kidney, pancreas and intestine, contains a 21 amino acid long extension in the second helix of the homeodomain [50,51].

Most homeodomains bind to DNA sequences containing a core ATTA motif with high affinity [58], although class-specific homeodomain DNA binding sites vary. For example, the TTF1 homeodomain binds to a consensus sequence of CACTCAAG, where the bolded and underlined motif corresponds to the conserved ATTA motif [58], while Antp (Antennapedia) homeodomain binds to the sequence AGCCATTAGA [58]. Homeodomains may bind to their target DNA sequences either as a monomer [65,66], as a homodimer or heterodimer [67,68], or in conjunction with other domains or protein partners [43,67-70].

An often-cited example homeodomain is the Antennapedia homeodomain, a member of the HOX subfamily of homeodomains [71]. It binds to DNA as a monomer, making specific contacts with DNA via residues in the recognition helix: positions 47, 50, 51, and 54 in the conventional homeodomain numbering [65,66]. In the case of the Antennapedia homeodomain, Ile47 makes hydrophobic interactions with two nucleotides in the homeodomain core motif, on the reverse strand of DNA: TAAT [65]. Gln50 hydrogen bonds with two nucleotides on the forward strand N-terminal to the core motif, and an adjacent nucleotide on the reverse strand: CCATT / GGTAA [65]. Asn51 hydrogen bonds with an invariant adenine in the homeodomain core motif: ATTA; the amino acid at this position is conserved in most homeodomains [65]. Met54 makes hydrophobic interactions with a nucleotide on the forward strand N-terminal to the core motif: CCATT [65]. These recognition helix positions are generally considered to be important for homeodomain DNA

15 binding [65]. Antennapedia forms part of the HOM-C homeotic complex in Drosophila, and loss-of-function mutations result in transformation of the epidermal pattern for certain segments in the thorax, in a case of mistaken identity [72].

The LIM protein family is named for three LIM family members: Lin-11, Isl-1, and Mec-3 [67]. LIM proteins contain a characteristic DNA-binding homeodomain which is C-terminal to two zinc-finger domains known as LIM domains [43,67,69]. The role of LIM- homeodomain proteins is modulated by interactions with different regulatory partners such as NLI, which mediates homo- and heterodimerization [69], or RLIM, a ubiquitin protein ligase which acts with Sin3A to repress LIM homeodomain transcription factor activity [70]. LIM- homeodomain proteins also appear to regulate transcription of target genes via association with other transcription factors such as Pit-1 (a member of the POU family) [69]. LIM homeodomains have been shown to be important for the development of tissue-specific cells in the nervous system, skeletal muscle, heart, kidneys, and endocrine glands of mammals [69].

The POU family of DNA-binding proteins is named for three types of transcription factors that contain a POU domain: the pituitary protein Pit-1, the octamer proteins Oct-1, Oct-2, and Oct-3, and the nematode cell lineage gene unc-86 [73]. The POU DNA-binding motif consists of the POU homeodomain associated by a flexible linker with a POU-specific domain; both domains are required for binding to DNA with high affinity [73,74]. Members of the POU family recognize the consensus octamer DNA sequence 5’-TATGCAAAT-3’, with the POU domain regulating specificity at the 5’ end (TATGCAAAT) and the homeodomain regulating specificity at the 3’ end (TATGCAAAT) [75]; both domains contact the central CA nucleotides [75]. Some members of the POU family appear to regulate neuronal development; unc-86 regulates cell lineage and sensory neuron development in C. elegans, while Oct-3 regulates the differentiation of the nerve trunk in mice [73].

The MATα2 homeodomain exists in two different regulatory conformations (MATα2 and MCM1 heterodimer bound to DNA, and MATα2 and MATa1 heterodimer bound to DNA). In haploid yeast cells, MATα2 interacts with MCM1 to repress “a” mating-factor specific genes; in diploid yeast cells, MATα2 interacts with MATa1 to repress haploid-specific genes

16

[68]. These two complexes have different DNA site preferences [76]. The MATa1 homeodomain has limited affinity for DNA in isolation [77]; however, when associated with MATα2, the heterodimer binds with high affinity [78]. Dimerization of MATa1 with MATα2 is mediated via the C-terminal tail of MATα2 binding to the TALE (Three Amino-acid Loop Extension) region between helices 1 and 2 of MATa1 [79]. Association of MATα2 with MCM1 increases the affinity of MATα2 for the specific operator DNA sequence by 50- to 500-fold [68]; dimerization is effected by the N-terminal extension of the MATα2 homeodomain, which interacts with a hydrophobic groove on the surface of MCM1 [68]. These interactions between MATa1, MATα2, and MCM1 control the expression of mating- type specific genes in yeast.

Homeodomains exist in eukaryotes ranging from S. cerevisiae to C. elegans to Arabidopsis to Drosophila to humans. They play important roles in body plan development, neuronal development, meristem development in plants, and mating-type specificity in yeast; consequently, the homeodomain family of proteins is a prime focus for transcription factor research.

1.2 Sequence Conservation and Inferences

Homology refers to similarity due to shared common ancestry [80], and is one of the great success stories of bioinformatics; if a protein has no experimentally determined function, its putative function is frequently established via identification of homologous sequences using programs such as BLAST [81]. If two proteins share a common ancestor, due to their shared ancestry, one would expect them to have similar functions [82,83]. If two proteins have diverged from a common structure, one would expect them to share a common structural fold [84]. Furthermore, since the proteins have diverged from a common sequence, one would expect them to have similar sequences [85,86]. Normally, a group of proteins under study is considered to be homologous due to the available evidence from protein sequence and structure. Upon making the conclusion of homology, an explanation for the similarity of the proteins can be derived, and putative function may be derived from previously characterized proteins within the homologous family.

17

When examining the relationships between different proteins, the type of homology that two proteins share can be described as orthologous, paralogous, or xenologous. In the case of orthologous proteins, sequence divergence occurs after a speciation event [80], whereas paralogous proteins diverge from each other after a gene duplication event [80] (illustrated in Figure 8); xenology occurs when genetic material is transferred between species via horizontal gene transfer [80,87]. Horizontal gene transfer occurs primarily in prokaryotic organisms [88], but has also been observed to occur in some unicellular eukaryotes [89] and some multicellular organisms [90].

Paralogues Orthologues

Human Gene B Human Gene A

Gene Figure 8: Orthologous and duplication paralogous gene relationships. Orthologues refer to gene pairs that arise as a result of Chimpanzee Gene A Human Gene A speciation; paralogues refer to gene pairs that arise as a result Speciation of gene duplication within the same species.

Common ancestral gene A

Many systems have been developed that attempt to quantify the degree of similarity between proteins [91-93]. These systems are known as scoring matrices, and represent a particular model of evolution based on observed substitution frequencies in proteins, similarity of particular structural or chemical amino acid properties, and frequency of evolutionary events (insertions and deletions) [91,92]. Using a scoring matrix, it is possible to quantify the degree of similarity between proteins and to align those regions of the proteins with greatest similarity, based on amino acid properties, number of gaps, and number of identical residues.

18

Some commonly used scoring matrices include the Point Accepted Mutation (PAM) matrices [92] and the BLOcks SUbstitution Matrix (BLOSUM) matrices [91]. When determining the degree of relatedness between two proteins, reciprocal best-match BLAST has been used as a criterion to determine putative orthology [94].

Sequence motifs which occur in multiple proteins may occur by chance, as a result of common ancestry, or due to functional significance of the motif [95]; these are not necessarily mutually exclusive. Motifs which are conserved for functional reasons may play a role in protein structure [96,97], protein-ligand interactions [98,99], or enzymatic function [97,100], depending on the protein, and are typically under purifying selection [101]. Purifying selection refers to the rapid removal of genetic mutations from a gene pool under strong selective pressures [102]. Hence, one would expect motifs conserved for functional reasons to be well conserved; one example might be the conserved disulfide bridge in immunoglobulin variable domains [103].

Since the majority of protein classifications are driven by sequence similarity [104-108], most protein classification is based upon the principle of homology.

1.3 Principles of Protein Classification

How can proteins be compared to each other? What are the criteria that can be used? Proteins can be classified according to structure, according to sequence, according to their relatedness, or according to previously characterized biological function. In many cases, these classifications may overlap, since the goal of any protein classification is to organize information about proteins of interest in order to improve our understanding of protein function.

Some structural elements which are used to classify proteins include protein folds, structural domains, domain architecture and topology, and secondary structure elements [84,109]. Structural similarity over a large grouping of proteins is often considered to be due to homology; these groupings have been referred to as superfamilies or homologous

19 superfamilies [84,109]. Structural classification may be performed in an automated computational fashion [109,110] or primarily via manual curation [84].

When proteins are classified on the basis of sequence, this is frequently done via identification of sequence motifs or sequence domains; these may be presented as consensus sequences [107], Hidden Markov Models [104], multiple alignments [104,107,111], sequence patterns [112], or sequence profiles [112]. Overall sequence identity between proteins may also be incorporated into a distance measure when clustering proteins according to sequence similarity [113,114]. Clustering algorithms may be used to group proteins according to a specified quantifiable criterion for similarity. A commonly used clustering algorithm is hierarchical clustering, in which a tree-like hierarchy of clusters is formed with a single shared root and every sequence as its own leaf on the tree (illustrated in Figure 9); the sequences which are most similar according to the chosen metric are grouped next to each other in the hierarchy. Hierarchical clustering produces only those partitions of sequence space which represent groupings of a node and all of its descendants; not all possible partitions of sequence space are possible [115,116]. Another commonly used clustering algorithm is k-means clustering (illustrated in Figure 10), in which k points are randomly chosen as centroids, and all other points are partitioned into clusters associated with the nearest centroid; new centroids are then calculated for each of the clusters, and the process is repeated until the clusters no longer change.

a ab b abc c

abcdef d e def ef f

Figure 9: Representation of hierarchical clustering. “abcdef” is the shared root, and at varying levels of the tree, there are potential clusters “abc”, “def”, “ab”, “ef”, and individual leaves

20

“a”, “b”, “c”, “d”, “e”, and “f”. The clusters derived from hierarchical clustering depend on which level the tree is cut at.

A B

C D

Figure 10: Representation of k-means clustering, with k=3 in this example: A) initial selection of cluster centroids; B) partitioning of points to the closest centroid; C) re- calculation of the centroids of the assigned clusters (note that the centroid of the top-right hand corner cluster in green has shifted slightly); D) re-partitioning of points to the closest centroid (note that partition boundaries have shifted slightly due to the re-assignation of the green centroid).

When applied to protein sequences, sequence similarity is frequently expressed as “nearness” in terms of distance, with similar sequences being closer to each other; distances are frequently scored using substitution matrices such as BLOSUM [91] or PAM [92], which

21 encode observed or extrapolated frequencies of substitution of amino acids in related sequences (see the following section 1.3.1 for a discussion of substitution matrices). Another measure of sequence similarity that is often chosen is relative entropy, which measures the degree to which one distribution can be used to represent the information content of another distribution [117]. For a measure of similarity D to be considered a similarity metric or distance function, it must satisfy the following conditions [118]:

1. non-negativity – the distance between any two elements x and y can never be negative D(x,y) ≥ 0 2. positivity – the distance from an element x to itself equals zero D(x,x) = 0 3. symmetry – the distance between elements x and y is the same as the distance between elements y and x D(x,y) = D(y,x) 4. triangle inequality – the distance between elements x and y plus the distance between elements y and z must always be greater than or equal to the distance between elements x and z D(x,y) + D(y,z) ≥ D(x,z)

The triangle inequality can be alternatively visualized as a set of three vectors (illustrated in Figure 11): y

D(x,y) D(y,z)

x z D(x,z)

Figure 11: Representation of the triangle inequality, in which the sum of the lengths of two sides of a triangle must always be greater than or equal to the length of the third side

It should be noted that while a distance based on amino acid substitution score is a distance metric, relative entropy is not a distance metric because it is asymmetric; that is, the relative entropy of amino acid distribution x given amino acid distribution y, D(x||y), is not the same as the relative entropy of distribution y given distribution x, D(y||x) (see section 2.5.1 for an

22 explanation of relative entropy, especially as applied to profiles or distributions of amino acids) [117]. Furthermore, relative entropy does not satisfy the triangle inequality [117]. Distances can be measured between two amino acid sequences, between an amino acid sequence and a profile of amino acids, or between two profiles of amino acids [119]. Using distances between two amino acid sequences performs best when all sequences under analysis have similar rates of amino acid substitution, and performs poorly if two or more lineages exhibit a more rapid rate of mutation than all other lineages, in a phenomenon known as long-branch attraction [120]; this is the measure of distance used by the multiple sequence alignment program CLUSTALW [121]. Profile-to-profile distances are likely to be more robust against long-branch attraction, but assume homogeneity among all sequences represented by a profile [119]. Sequence-to-profile distances are intermediate between the other two distance evaluation methods; the sequence information from comparing individual sequences is retained, while the position-specific composition of amino acid residues provided by a profile is maintained [119].

Proteins may also be classified according to homology or to their relative level of relatedness. The Clusters of Orthologous Groups of proteins (COGs) database contains clusters of putative orthologues according to the criterion of reciprocal best-match BLAST [94], whereas others classify proteins according to their family [104] or subfamily [106]. Reciprocal best-match BLAST is a typical criterion used to assess orthology, but can fail to identify true orthologues if paralogues arise after speciation due to duplication of one of the orthologues [122]. Relationships among protein family members, including paralogy and orthology, can be inferred via analysis of the phylogenetic tree for the protein family (for an illustration of paralogy and orthology, see Figure 8).

In order to achieve an improved understanding of protein function, ideally, one would classify proteins according to their structural similarity, since structure is normally the most conserved feature in proteins [82,83] and therefore improves the quality of the alignment and subsequent classification. However, since only limited numbers of protein structures are available, such an ideal classification is not always possible. When an ideal classification based solely upon protein structure is not possible, it is necessary to base the classification upon other measures of similar protein function. Since similar sequences often share similar

23 structure [85,86], it may be possible to achieve a functional classification based at least partially on sequence similarity. This simplification implies that structurally similar proteins with similar function, which are different in sequence, as is the case with analogous proteins under the influence of convergent evolution, would not be correctly classified. Conversely, proteins related by common ancestry will likely share similar sequence and structure, and will therefore be assorted into similar categories, even if they do not necessarily share a common function [123], as in the case of divergent evolution of proteins after gene duplication.

1.3.1 Importance of Sequence Alignment

The creation of a sequence alignment is frequently the first step in identifying regions of similarity among proteins; these regions may be attributable to homology, required for function, or are preserved due to essential structural elements [85,86,96-100]. Since it serves as the input step in many different types of analyses, the quality and type of the sequence alignment influences the usability and interpretation of the obtained results. A structure- based alignment gives a good picture of sequence variability at each position, and is well suited for determining differences in specificity for subfamily classification.

Different types of sequence alignments exist, and each type is normally used for a different purpose. Pairwise alignments are performed between pairs of sequences to identify primary similarity between sequences [124], whereas multiple alignments are performed when the overall best alignment is desired for a group of sequences, in order to identify regions of similarity common to all of the proteins [124]. Two categories of pairwise alignments exist, local and global; local sequence alignments align short stretches of very similar sequence [125] and are useful for identifying short sequence motifs or domains within proteins, whereas global sequence alignments align over the entire length of the input sequences [125] and may be used to compare overall sequence similarity between proteins and the overall relationship of sequence elements.

Optimal alignment algorithms minimize an alignment score derived using a scoring matrix, notably the PAM [92] and BLOSUM [91] matrices, and are guaranteed to find a

24 computationally optimal alignment solution with the minimum possible score. A commonly used optimal pairwise global alignment algorithm would be the Needleman-Wunsch alignment algorithm [126], while a commonly used optimal pairwise local alignment algorithm would be the Smith-Waterman alignment algorithm [127]. Heuristic alignment algorithms yield rapid determinations of sequence similarity, and do not necessarily find computationally optimal alignment solutions; these include the algorithm used by BLAST [81]. Due to the computationally intensive nature of multiple sequence alignment, most multiple sequence alignment programs employ heuristic methods rather than global optimization [121,128,129]. Various multiple sequence alignment programs include CLUSTALW [121], T-COFFEE [128], and MUSCLE [129].

Scoring matrices incorporate biological or statistical data about known sequences, and reflect a particular model of protein evolution. The PAM matrices are based on number of substitutions per hundred residues [92], and are derived from the PAM1 matrix, which is based on highly similar proteins with 1 substitution per hundred residues, or 1% divergence between proteins [92]. To extrapolate to greater levels of divergence, the PAM1 matrix is multiplied by itself; the PAM250 matrix is equivalent to the PAM1 matrix multiplied by itself 250 times. The BLOSUM matrices are derived from blocks of aligned sequences from evolutionarily diverged proteins [91,111]. Each of the BLOSUM matrices is calculated from sequences with a specified maximal sequence identity; for example, the BLOSUM62 matrix is calculated from sequences with 62% sequence identity or less [91], and the BLOSUM100 matrix is calculated from the whole of the BLOCKS database, since all the proteins have at most 100% sequence identity to one another [91,111].

Multiple sequence alignments are required for analyzing relationships among protein families and determining their phylogeny [130]; phylogenetic trees are constructed from evolutionary distances between proteins, generally calculated from a multiple alignment of the proteins [131]. Conserved sequence motifs within protein families may be identified via examination of the multiple sequence alignment for the protein family. Some protein classification methods which utilize the calculation of evolutionary distance or divergence require a multiple sequence alignment as the basis for the analysis; these include subfamily determination methods [117,119], especially if a guide tree is utilized in the creation of the

25 subfamilies; these also include some graph clustering methods when employed in protein classification, if a distance matrix calculated based on similarity between proteins is given as input [132]. Due to their shared property of evaluating the degree of relatedness between different sequences, multiple sequence alignments and phylogenetic trees are usually evaluated in tandem.

1.3.2 Phylogenetic Representation Methods

Phylogenetic trees are graphical representations of the evolutionary relationships between different proteins. Reconstructing the putative evolutionary history of a protein family may allow conclusions to be drawn about orthology or paralogy of certain proteins; this information may allow inferences to be drawn about the putative function of the proteins based on their relationship.

There are many different types of phylogenetic trees, depending on what is known about the group of proteins being studied; each type requires certain assumptions, such as knowledge of ancestral relationships or specifications of evolutionary rate. A cladogram is a tree that presents only the putative relationships between species, without making specific reference to time [133]; whereas a chronogram is a tree in which the amount of distance between branchpoints is explicitly correlated with the amount of evolutionary time that is perceived to have passed [134,135]. When constructing an evolutionary tree, the number of character changes between species is often correlated with the amount of evolutionary time that has passed, and the assumption is made that there is a constant rate of substitution on every branch of the tree; this is known as the Molecular Clock hypothesis. In the Molecular Clock hypothesis, it is assumed that the rate of evolutionary change of a specified protein remains constant over time and regardless of which species the protein has been identified in [136]. This hypothesis fails to hold when different lineages for a specified protein are evolving at different rates. These differences in evolutionary rates between lineages can be due to differences in generation time between lineages, or differences in metabolic rates. Shorter generation times allow amino acid substitutions to accrue more rapidly in the germline, assuming a constant rate of mutation per germ-cell division or germline DNA replication cycle [137]. Increased metabolic rates result in increased numbers of mutagenic oxygen

26 radicals [138] and increased rates of DNA synthesis and nucleotide replacement [139]. Trees may be rooted or unrooted. A rooted tree makes the assumption of specifying the location of the last common ancestor of all the proteins being analyzed and uses this putative unknown last common ancestor to root the tree, whereas an unrooted tree depicts the relationship among the proteins without making assumptions about ancestry [140].

Several different algorithms have been used in the identification of phylogenetic trees that represent the evolutionary relationships among all taxa being studied. One of the most common used algorithms is known as neighbour-joining [141,142]. This method requires knowledge of the distance between each pair of taxa in the tree [141,142]. It is a greedy algorithm that merges, at each step, the pair of nodes that has the shortest evolutionary distance from each other, producing an unrooted tree [141,142]. In maximum parsimony analyses, all potential trees that the given taxa can form are considered, with the tree that contains the overall smallest number of changes among all taxa being chosen as the most parsimonious [124]. In the maximum likelihood method, all possible scenarios which give rise to the observed data must be taken into consideration, and a likelihood estimate given to each possible scenario based on some model of evolution [124]. This is repeated for all possible trees, with the tree having the highest probability of giving rise to the observed data being selected as the actual tree [124]. Both the maximum parsimony and maximum likelihood methods are computationally intensive and do not perform well on large numbers of sequences [143,144]; hence, heuristic methods, such as simulated annealing, genetic algorithms, branch-and-bound, and nearest-neighbour interchange, which do not compute all possible trees are more frequently utilized [145-149]. Long branch attraction is another problem that can occur when a few sequences in a large dataset appear to have higher rates of mutation than the rest of the sequences [150]; these rapidly evolving lineages can be mistakenly identified as being closely related to each other, regardless of the actual evolutionary relationship between the proteins [150]. Long branch attraction can be a problem for maximum parsimony methods [150] and for neighbour-joining methods as well [150].

Various methods exist to determine the robustness of the tree given small fluctuations in the dataset, including bootstrapping, jackknifing, and leave-one-out cross-validation. In

27 bootstrapping, the columns of the input multiple sequence alignment are randomly sampled to create pseudoreplicate datasets of the same size as the original multiple sequence alignment [151]. In jackknifing, a fixed number of columns is randomly deleted from each pseudoreplicate dataset [151]. In leave-one-out cross-validation, each successive sequence is left out of the dataset when producing pseudoreplicate datasets [152]. In all of these methods, each of the generated datasets is then used to generate a tree in the same way that the original tree was generated [151]. The percentage of times that a given node in the actual phylogenetic tree is supported by the trees generated from the pseudoreplicate datasets is given as the support at each node [151].

Phylogenetic trees, or variations thereof, have been used in protein subfamily determination [117] if a multiple sequence alignment of the protein family is given as input. Consequently, the quality of the phylogenetic tree generated during subfamily determination influences the reliability of the subfamilies thus obtained.

1.3.3 Protein Family and Subfamily Classifications

A protein family is generally considered to be a group of homologous proteins [84,104,105,112], while a protein subfamily is generally considered to be a subset of a protein family, separated according to its specific function [153]. Different databases exist in which proteins are classified according to their structure or according to sequence similarity, in the desire to organize proteins in such a way as to better understand their function.

Databases which assort proteins according to their structural similarity include the Structural Classification of Proteins (SCOP) [84] and CATH [109]. SCOP classifies proteins according to their secondary structure, number of folding units, and subcellular localization. CATH assorts structural domains into categories according to Class (secondary structure composition), Architecture (overall shape and orientation of secondary structure), Topology (similar protein fold), and Homology (common ancestry). Both SCOP and CATH draw their source data from the Protein Data Bank [154], a repository for protein structure information.

28

The desire to classify proteins together that share similar functions led to the construction of protein family databases such as Pfam [104], the Clusters of Orthologous Groups (COGs) [108], Interpro [105], and Prosite [112]. Pfam is a collection of protein families and domains, and consists of two parts: Pfam-A and Pfam-B. Pfam-A is manually curated; Pfam-B is automatically generated from PRODOM [107], a protein domain family database containing sequences from the UniProt [155] sequence repositories. The Clusters of Orthologous Groups database classifies proteins into the same category if they are considered to be orthologues according to the reciprocal best-match BLAST criterion [108]. Interpro is a collection of protein families; some of these families, such as the homeobox family of proteins, have also been classified into subfamilies [105]. Interpro family and subfamily classifications are derived from two curated sources: Pfam and PANTHER (discussed later in this section) [156]. Prosite consists of sequence patterns and profiles which identify protein domains, families, and functional sites [112].

Sequence-driven methods for the identification of functionally important residues frequently depend on a multiple sequence alignment for a protein family of interest; some of these methods further subdivide the classification of protein families into subfamilies. Many of these subfamily-determination methods, such as Bête, ProteinKeys, TIPS, or Evolutionary Trace, rely on a phylogenetic tree or guide-tree based protocol [117,119,157,158]. The correctness and robustness of the tree affects the correctness and robustness of the resulting subfamilies, with three important assumptions being made: the same rate of amino acid substitution is assumed for all lineages, and thus all lineages are equally distant from the common ancestor; all positions in all proteins mutate at the same rate; and the substitution matrix used in the creation of the multiple sequence alignment specifies a likelihood model for conserved substitutions, regardless of structural constraints [117]. For example, glycine is a conserved residue in the first position of a type II beta-hairpin [159]; substitution of the glycine by alanine would likely destabilize the hairpin due to alanine’s relative lack of conformational freedom, but substitution matrices would not take this position-specific information into account. A discussion of the tree-based subfamily classification methods Bête, TIPS, and Evolutionary Trace follows.

29

The profile-based method Bête selects for subfamily groupings (profiles) that minimize the total relative entropy (TRE) of all columns in a multiple sequence alignment; the TRE is calculated stepwise while creating a hierarchical agglomerative tree with nearest-neighbour heuristics (for a detailed description of Bête, please see section 2.5.1). The advantage of Bête is that it does not require a user-selected cutoff for the creation of partitions; as well, by using a profile-based method to evaluate similarity, it avoids the assumption that all positions in a protein have the same rate of mutation. However, while the distance measure chosen, relative entropy, quantifies the degree to which the subfamily profile conveys the information content of all of its members, it is not a true distance metric (as mentioned in section 1.3). The fact that relative entropy is not symmetric and does not obey the triangle inequality [117] makes it difficult to quantify the degree of similarity between two subfamilies b and c, even if it is known that subfamily a is similar to subfamily b, and that subfamily a is also similar to subfamily c.

A recently developed program, ProteinKeys, which was not available during the completion of this thesis work, uses a protocol similar to that of Bête in identifying subfamilies; however, it reports specificity determining residues as well [158]. Like Bête, ProteinKeys employs a hierarchical agglomerative nearest-neighbour tree in order to identify possible subfamily partitions; however, it minimizes total combinatorial entropy instead of total relative entropy when evaluating the subfamily contrast function [158]. The combinatorial entropy S is the sum of the log of total amino acid permutations at a given alignment position i of a subfamily partition k, summed over all partitions [158]:

Equation 1: Combinatorial entropy S

S = ∑lnSi i where the combinatorial entropy for a given column i in the multiple sequence alignment is

Si = ∑lnZi,k k and the total number of permutations in a column i of a subfamily k is

Nk ! Zi,k = ∏ Nα ,i,k ! α =1,...,21

30 given that amino acid types α from 1 to 21 (21 being equivalent to gap), i is the column in the multiple sequence alignment, and k is the subfamily.

Using combinatorial entropy as a fitness function, instead of relative entropy, sidesteps the need for a distance metric or evaluation of similarity between clusters. Instead, the set of partitions that maximizes the entropy difference between the observed distribution of amino acids in each column and the distribution of amino acids when shuffled is chosen as the set of subfamilies.

The method TIPS performs subfamily classification based on family-level phylogeny, by extraction of monophyletic clades from a phylogenetic tree [119]. Like Bête, it creates a hierarchical agglomerative tree with the “most similar” clusters being joined at each step. However, unlike Bête, it does not employ a profile-profile similarity scoring method, but scores each individual sequence against each profile when calculating subfamily similarity; a Hidden Markov Model (HMM) is created for each profile which represents the probability of emitting certain amino acids at certain positions in the multiple sequence alignment. However, deriving the amino acid probability distribution for each position in the multiple sequence alignment (MSA) assumes that each amino acid position or column in the MSA is independent of all other columns of the MSA. The similarity score assigned to a pair of putative subfamilies K and M is the lesser of two probabilities: the probability that the HMM of subfamily K could produce the amino acid sequences of subfamily M, and vice versa [119].

The Evolutionary Trace procedure (illustrated in Figure 12) partitions the Unweighted Pair- Group Method using phylogenetic Averages (UPGMA) tree of the protein family using either a user-selected distance cutoff or specified sequence identity cutoffs, followed by identification of invariant residues separating the resultant subfamilies [157]. The UPGMA method constructs a rooted tree, built stepwise using hierarchical clustering, differing from neighbour-joining in that distances between pairs of sequences are averaged between the lineages; a UPGMA tree assumes that the Molecular Clock hypothesis holds for the proteins being studied [130]. This method has the advantage that sequence identity is a distance metric and thus knowledge about the degree of similarity between different sequences can be

31 used to infer similarity of more distantly related sequences; however, the distances specified by the generated tree may not be valid if different lineages for a specified protein evolve at different rates, as described in section 1.3.2. A rooted tree also assumes a particular point of divergence from a last common ancestor, enforcing subsequent relationships among sequences relative to the putative last common ancestor. The choice of the distance cutoff at which to partition the subfamilies is also a subjective criterion.

Figure 12: A representation of the Evolutionary Trace method of subfamily partitioning. A sequence identity distance cutoff is chosen, which partitions the tree into subfamilies. In this case, the cutoff chosen classifies sequences A and B into one subfamily, sequences D and E into a second subfamily, and sequences C and F are subfamilies consisting of one sequence each.

Not all sequence-driven methods rely upon phylogenetic trees or guide trees in order to generate subfamilies or detect putative functional residues. For example, the Sequence Space program [160] involves principal component analysis (PCA), taking each vector in a high- dimensional sequence space to be a clustering of proteins corresponding to a subfamily; a similar procedure, involving the residue plot as opposed to the protein plot, is used to identify functional residues within subfamilies. While this method does not suffer from the limitations inherent to tree-building subfamily determination protocols, the number of principal components retained and selection of cluster boundaries appears to be ad-hoc [160,161].

Two databases focus specifically on protein subfamilies, notably FunShift [106] and PANTHER [153]. These databases consist of classifications based on sequence information. FunShift uses the subfamily classification program Bête to automatically sub-classify the protein domain families within Pfam [106]; this has the advantage of being a completely automated process run on a large database of known protein families (Pfam), but depends on

32 the accuracy and robustness of Bête in assigning subfamilies. PANTHER draws its source data from the GenBank nonredundant (nr) protein database and produces family clusters by using BLASTP on selected “seed” proteins, followed by 3 iterations of PSI-BLAST with a cutoff of 10-5. A multiple sequence alignment is generated for the family, and the program TIPS is employed to create a partitioning of the phylogenetic tree for the family. Subfamilies are assigned based on GO annotations and identification of monophyletic subtrees of the main protein family tree, and all families and subfamilies are then reviewed by curators [155]. Hidden Markov Models (HMMs) are then created for each curated family or subfamily, with the goal being to use the HMMs to accurately classify novel protein sequences. This implicitly assumes that all protein sequences are classifiable into existing families or subfamilies, with the initial accuracy of the family or subfamily assignment being dependent on the number of classes present in the ontology. The PANTHER creators attempt to address the latter point via curator review and by scoring new sequences against both the family HMM and the subfamily HMMs, identifying a new subfamily when a sequence scores more highly against the family HMM than any subfamily HMM. One notable gap in the existing subfamily classifications is that none of these subfamily classifications draw upon current structural knowledge about proteins when attempting to define subfamilies based upon differences in protein function.

1.4 Current Strategies for Determining Functional Residues in Proteins

Many bioinformatics strategies exist to determine putative functionally important residues in proteins; these include purely structure-based methods, purely sequence-based methods, or potentially a combination of the two. Classification of proteins into subfamilies may allow us to determine which characteristics of the proteins may be functionally important, via comparison of subfamilies [162]; however, the integration of structural information with existing sequence information about subfamilies has hitherto not been explored.

When a protein structure is available, the structure may be transformed into a graph representation with each amino acid as a node, and interactions represented as edges; such a representation, combined with solvent accessibility information, may facilitate a priori

33 identification of active site residues [163]. Detection of regions of local similarity between protein structures, followed by cross-correlation with known functional annotations may also assist in active site residue determination [164,165]. Manual examination of a protein structure, assisted by secondary structure information, cleft size, and calculated solvent accessibility, may also yield insights on functional binding site regions of a protein [166].

Various programs exist to extract specificity-determining positions from a multiple sequence alignment partitioned into subfamilies. SDPpred extracts specificity-determining positions (SDPs) from the multiple sequence alignment of a protein family separated into known specificity groupings [162]. These positions are selected based upon an information-theoretic criterion, mutual information: it describes how well amino acids in columns of the multiple sequence alignment correlate with the groupings of proteins according to specificity [162]. An alternate protocol by Truong and Ikura [167] employs Hidden Markov Models (HMMs) as a means to distinguish between subfamilies based on their underlying characteristics and to identify subfamily-specific residues. An HMM is created for each sliding window of twenty residues in the multiple sequence alignment, and the number of matches to each HMM are concatenated to form an HMM histogram [167]; peaks in the histogram are chosen as subfamily signature residues. This method requires that subfamily signatures consist of consecutive amino acid residues. The subfamily determining program Sequence Space, as described in the previous section, can also be used to resolve specificity-determining residues given an appropriate choice of axes in the amino acid residue plot [160]. The most recently developed specificity determining residue programs are ProteinKeys [158] and Multi- RELIEF [168]; however, these programs were not available during the completion of this thesis work. ProteinKeys, as described earlier, classifies proteins into subfamilies, and identifies specificity-determining residues as those positions which maximize the entropy difference between the observed distribution of amino acids in the multiple sequence alignment, and the distribution when the alignment is randomized [158]. Multi-RELIEF employs a machine-learning method to identify specificity determining residues given a set of subfamilies [168]. A feature vector is created, equivalent to the length of the multiple sequence alignment; each feature is assigned an initial weight of zero. Multi-RELIEF then iterates through every protein sequence, and re-assigns weights based on how well each position separates the protein sequence from its nearest neighbour in all of the other

34 subfamilies. The positions with the highest weighting in the feature vector are identified as subfamily-determining positions.

To date, subfamily determination has been performed based exclusively on sequence information, while putative functionally important residues have been determined through either structural analysis or sequence comparison. Our goal is to inform subfamily determination and specificity residue determination via the retention of known structural information about a protein family, while simultaneously taking sequence information into account. This goal is based on the assumption that protein structure is more strongly correlated with biological function than is protein sequence; we make this assumption based on conservation of the protein fold even after evolutionary divergence [169], and on structural similarity between proteins where sequence identity is low and homology is not assumed [169]. It should be noted that evolution fails to conserve the function of related proteins in cases of divergent evolution of paralogues after gene duplication, when one copy of the protein retains the original function and the other paralogue is not held under selective pressure [170].

1.5 A Requirement for More Structural Insight: A Case Study

Protein structures are quite robust with regard to amino acid changes, and are therefore more conserved across evolutionary time than the protein sequence [171]. Since evolution conserves function, structure is conserved as a surrogate for function [82,83]. Ideally, differences in protein structures would guide classification of protein families into subfamilies with conserved functional specialization. However, the lack of available protein structures prevents representation of all subfamilies by known structures. By using structural information to guide existing sequence information in subfamily classification, we hope to achieve better correspondence between protein functional specialization and the resulting protein subfamily classification.

We define a protein domain to be a unit of structure and evolution; a polypeptide chain or parts of polypeptide chains that fold together and are co-inherited throughout evolutionary history. We define a domain family to be a group of domains sharing a common function

35

(identified by common Pfam annotation) and sharing a common domain structure. A common domain structure may be identified by any of the following:

• structurally, if the structure is known (less than 1 Å root mean squared deviation (RMSD) after optimal superposition – this is an outer bound on the range where protein crystallographic resolution linearly correlates with RMSD [172]) • by common Pfam or Smart [173] annotation over the entire domain (i.e. what is identified is a complete domain and not a domain fragment) • through shared common domain sequence signatures (e.g. active site residues)

We define a domain subfamily to be a cluster of domains within a domain family which is identified as "similar" with a functional and/or structural sub-specialization, and/or carrying a sequence signature unique to the subfamily, relative to the domain family.

This work focuses on the homeodomain family of transcription factors, since it is a well- studied transcription factor family with available protein structures [154] and known specificity groupings or subfamilies [105]. Homeodomains orchestrate the development of the body plan in eukaryotic organisms, and function in mating-type recognition in yeast. Constraining the problem of protein subfamily classification to include just the protein domain allows us to focus on that part of the protein which is conserved between members of the same protein family, and which is most likely to confer biological function.

The objective of this work will be to obtain a better understanding of how homeodomains are able to specifically bind their target DNA sequences, based on structure, sequence, and experimental data. The approach outlined in this thesis work will encompass three parts: a) the development of an objective and automatable procedure to identify putative residues required for specific DNA recognition using homeodomain structure and sequence data; b) the application of this procedure to determine candidate residues that may affect the binding specificity of homeodomains; and c) validation of these findings via comparison with published data and examination of the structural context of the candidate residues.

36

Using our structure-sequence based protocol for subfamily classification, we have obtained subfamilies that approximately correspond to those derived from the Interpro functional classification of homeodomains. This classification system will allow us to identify subcategories of proteins within the homeodomain family that share common properties. We have also identified residues in the recognition helix of the homeodomains that appear to distinguish between subfamilies, and are conserved within subfamilies. As this knowledge was not given as a priori knowledge to the subfamily classification protocol, we conclude that our structure-based sequence subfamily classification is able to identify functionally important residues, notably those involved in DNA binding specificity.

37

Chapter 2 – Experimental Protocol

2.1 Outline of the Protocol

The key steps in the protocol for protein domain subfamily classification include: - selection of the protein domain family (homeodomains) - retrieval and curation of protein structures from the Protein Data Bank [154] - delineation of the protein domain region using Pfam homeobox HMM [104] - alignment of the protein structures using MALECON [174] - retrieval and curation of protein sequences from Interpro [105] and PartiGeneDB [175,176] - production of structure-driven sequence alignment using ClustalX [177] - classification into subfamilies using Bête [117] - retrieval of subfamily-determining residues using SDPpred [162] - verification of the robustness of the Bête neighbour-joined tree via leave-one-out cross-validation - confirmation of subfamily integrity and robustness via sequence-to-profile similarity validation - examination of subfamily-determining residues in the context of protein structure and function

38

Selection of Protein Domain FigureProtein 13: Flowchart Data Bank of experimental protocol,Family including subfamily determination, validation,(Protein identification Structure of subfamily-specific residues, and analysis steps. Database)

Alignment of Protein Structures UniProt (Protein Sequence Database) Produce structure- driven MSA

Subfamily Classification (Bête) PartiGene DB

Objective determination of Enrichment of subfamilies subfamily-specific by additional protein residues sequences

Objective validation of subfamily classification

Assessing reliability of neighbour-joined subfamily Figure 13: Flowchart of experimental tree (leave-k-out cross-validation) protocol, including subfamily determination, validation,

Analysis: Examine subfamily- identification of subfamily-specific specific residues in context of residues, and analysis steps. protein structures

39

2.2 Curation of Protein Structures

Homeodomain protein structures were retrieved from the Protein Data Bank (PDB) [154]. Duplicate copies of proteins from the same species were removed in order to create a nonredundant data set (i.e. if there were three copies of the same HOX gene from mouse and two from human, two of the mouse copies would be removed, and one of the human copies as well). This was done with the exception of MATα2, where two copies of the same protein were maintained, but in two different regulatory conformations (MATα2 and MCM1 heterodimer bound to DNA, and MATα2 and MATa1 heterodimer bound to DNA). These two complexes have different DNA site preferences [76].

Initial domain boundaries for the homeodomains were assigned according to Hidden Markov Model boundaries from Pfam [104]. While Pfam produces a conservative estimate of where the true domain boundaries lie, screening using Pfam also ensures that only homologous residues from the homeodomain regions of the proteins are taken into consideration. Protein structures that did not contain a homeodomain region according to Pfam were also removed from the data set. This left us with 27 homeodomain structures out of an initial 60 (Table 2).

40

PDB ID Protein description Organism 1b8iA Ultrabithorax homeotic protein IV Drosophila melanogaster 1b8iB Homeobox protein extradenticle Drosophila melanogaster 1fjlA Paired protein Drosophila melanogaster 1ftz Fushi tarazu protein Drosophila melanogaster 1hom Antennapedia protein Drosophila melanogaster 1jggA Segmentation protein even-skipped Drosophila melanogaster 1nk2P Homeobox protein VND Drosophila melanogaster 1wjhA Homeobox leucine zipper protein Homez Drosophila melanogaster 3hddA Engrailed homeodomain Drosophila melanogaster 1b72A Homeobox protein HOX-B1 Homo sapiens 1b72B PBX1 Homo sapiens POU domain, class 2, transcription factor 1 1cqtB Homo sapiens (also known as OCT-1) 1hdp OCT-2 POU homeodomain Homo sapiens 1wi3A DNA-binding protein SATB2 - CUT containing protein Homo sapiens 1ig7A Homeotic protein MSX-1 Mus musculus 1lfuP Homeobox protein PBX1 Mus musculus 1ocp OCT-3 Mus musculus 1pufA Homeobox protein HOX-A9 Mus musculus 1s7eA Hepatocyte nuclear factor 6 (HNF6) Mus musculus 1uhsA Homeodomain Only Protein (HOP) Mus musculus 1au7A PIT-1 Rattus norvegicus 1bw5 Insulin gene enhancer (ISL-1) Rattus norvegicus 1ftt Thyroid transcription factor 1 homeodomain Rattus norvegicus 2lfb LFB1/HNF1 transcription factor Rattus rattus 1akhA Mating-type protein A-1 Saccharomyces cerevisiae 1akhB Mating-type protein α2 Saccharomyces cerevisiae 1mnmC MATα2 transcriptional repressor Saccharomyces cerevisiae Table 2: Homeodomain structures from the Protein Data Bank which were used in the experimental procedures.

2.3 Curation of Protein Sequences

175 homeodomain sequences which matched the Pfam homeobox HMM domain signature were retrieved from Interpro (Interpro accession IPR001356) [105], along with their annotations. Following structure-based sequence alignment of the 27 homeodomain structures and the 175 homeodomain sequences, this alignment was used as the initial profile for performing PSI-BLAST [178] against the PartiGene expressed sequence tag (EST) database (PartiGeneDB) [175,176]. 1 round of PSI-BLAST was performed using this profile

41 using the default E-value cut-off of 10.0, retrieving 254 homeodomain sequences from PartiGeneDB. The N-terminus and C-terminus of the homeodomain portion of the sequences were determined to be the start and end points of the PSI-BLAST sequence hit. Annotations for the kingdom, phylum, and species of the retrieved ESTs from PartiGeneDB were kindly provided by Dr. John Parkinson (personal communication). A total of 27 homeodomain structures and 429 homeodomain sequences were used in this analysis.

Subsequently, 38 “unclassifiable” homeodomain sequences (singleton and doubleton homeodomain subfamilies, as well as those sequences that caused major evolutionary tree branchpoints to lack robustness) were removed from the data set, resulting in a total of 418 homeodomains. Sequences which caused major evolutionary tree branchpoints to lack robustness were identified as single sequences that were classified by Bête as belonging to a particular subfamily, where the node connecting the subfamily (including the sequence) to the cross-validated consensus Bête tree had cross-validation node support of less than 70% (typical values were closer to 50%; see Appendix), but excluding the sequence would increase cross-validation node support for the subfamily node on the consensus Bête tree to over 90%. This was done in order to work around the limitations of the SDPpred program which was used to identify specificity-determining residues (data sets which included singleton and doubleton subfamilies were rejected as input by SDPpred), as well as to reduce the amount of noise in the Bête-derived subfamily classification; however, mis-classification of sequences to subfamilies by Bête may also be masked by the removal of these sequences. The entire protocol, including all alignment and classification steps, as well as identification of subfamily-determining residues, was carried out again on the revised data set with “unclassifiable” homeodomain sequences removed; the results based on this data set are presented in this thesis work.

2.4 Structure-Based Sequence Alignment

Ideally, differences in protein structures would guide classification of protein families into subfamilies with conserved functional specialization. However, the lack of available protein structures prevents representation of all subfamilies by known structures. By using structural information to guide existing sequence information in subfamily classification, we hope to

42 achieve better correspondence between protein functional specialization and the resulting protein subfamily classification.

2.4.1 MALECON Multiple Structure Alignment

Twenty-seven homeodomain structures were aligned using MALECON [174] to produce a structural alignment; this structural alignment serves as the “profile” for ClustalX sequence- to-profile alignment in the next step.

MALECON is a progressive combinatorial multiple structure alignment procedure. All possible optimal pairwise superpositions between structures are constructed, followed by identification of all multiple alignments containing at least three structures. Spatially equivalent residues between structures are identified, and these form the basis of the overall structure alignment.

The advantage of MALECON over pairwise structure superposition programs such as DaliLite [179] lies in MALECON’s ability to create an overall structure alignment, rather than superposing everything against a specified reference structure. Such a reference structure would have to be carefully selected according to various criteria in order to avoid bias, for example, minimum RMSD on average from all other structures.

2.4.2 ClustalX Sequence-to-Profile Alignment

The multiple structure alignment from MALECON and 429 homeodomain sequences from Interpro and PartiGeneDB were input to ClustalX. ClustalX takes each individual sequence and aligns it against a profile in a “sequence-to-profile” alignment, minimizing the alignment score between the profile and the individual sequence using a specified scoring matrix. The default alignment parameters were used: gap opening penalty of 10.00, gap extension penalty of 0.10, and the Gonnet 250 [180] scoring matrix (based upon the PAM series of scoring matrices). The resulting alignment was converted to FASTA format [181] using Readseq [182] for input to Bête subfamily classification.

43

2.5 Identifying the Subfamilies and Subfamily-Determining Residues

2.5.1 Bête Subfamily Classification

The statistical basis of our subfamily determination protocol lies in the determination of important residues and the separation into functional subfamilies performed by the program Bête [117]. To understand how these procedures are performed, a brief introduction to the statistical models and concepts used is required.

The input to Bête is a multiple sequence alignment of the domain family under study. Bête splits each sequence of the multiple sequence alignment into individual nodes with a profile containing one sequence, and at each step, merges the two nodes with the shortest “distance” from each other into a single node with a profile containing the profiles of the original two nodes. The distance measure used is the Total Relative Entropy (TRE), or symmetrized relative entropy, between the two profiles of the original two nodes i and j:

Equation 2

TRE = ∑(D(ic ||jc )+ D( jc || ic )) c

where ic and jc are the probability distributions for profiles i and j at amino acid position c in the multiple sequence alignment, D(ic || jc) is the relative entropy between ic and jc (see

Equation 3), and D(jc || ic) is the relative entropy between jc and ic (see Equation 3).

The relative entropy D between two distributions p and q is defined as:

Equation 3 p(x) D()p ||q = ∑ p(x)log x q(x) where x is the variable being summed over (as applied to Equation 2, this would be amino acid positions c in the multiple sequence alignment), and p(x) and q(x) are two functions of x

(these would correspond to the probability distributions ic and jc in Equation 2).

44

As Bête merges each pair of nodes, it builds a hierarchical agglomerative tree using nearest- neighbour heuristics. In order to identify the subfamilies, Bête partitions the tree by first evaluating all potential “cuts” of the tree that do not conflict with the node assignments identified during agglomeration. Optimally, the obtained subfamilies should maximize the number of columns in their profiles which distinguish each subfamily from all other subfamilies; however, taken to the extreme case, this would result in each profile containing exactly one sequence. Bête balances the requirement for subfamily integrity with the need to minimize the number of subfamilies by computing an “encoding cost” for each potential partitioning into subfamilies:

Equation 4

EncodingCost = N log 2 S − ∑log 2 Prob(nc,1...nc,S | Θ) c where N is the number of sequences in the multiple sequence alignment, S is the number of subfamilies, nc,i is a vector representing the observed amino acids in subfamily i at column c in the multiple sequence alignment, and Θ represents the components of the Dirichlet mixture prior which was used as a prior probability model of the input multiple sequence alignment.

The Dirichlet mixture prior that Bête employs is a collection of prior probability distributions of amino acids, or “components”, drawn from protein alignments from the BLOCKS database. The components specify the probability of each amino acid in a given alignment from the BLOCKS database. Those components that best describe the input alignment are used to model the input alignment, and this model is taken as the “true” alignment by Bête, with the input alignment taken to be a sample of the “true” alignment [91,111,117]. This is done in order to correct for potential problems with small input datasets, as the model assigns a probability of occurrence for all 20 amino acids, including rarely occurring ones.

When a data set is small, the likelihood that a rare (but not nonexistent) event will not be present in the dataset is high. For example, there are twenty amino acids, each of which has a

45 certain probability of being in a given column in a protein multiple sequence alignment (MSA). However, if there are only 5 sequences in the MSA, at most only 5 amino acids will be represented in a given column of the MSA (see Figure 14). In these cases, if it is not possible to add more data, “pseudocounts” may be added to the probability table to make the probability of the occurrence of the other amino acids greater than 0 when modeling the data.

... YTRYQLLELEKEF ...... FTAEQLQRLKAEF ...... FTSYQLEELERAF ...... FKERTRSLLREWY ...... IEVNVRGALESHF ...

Figure 14: Part of a hypothetical multiple sequence alignment in which pseudocounts would be used if more data could not be acquired. In the selected column of the multiple sequence alignment, only the amino acids K, A, R, E, and S are represented with a probability of 1/5, or 0.2, and all other amino acids are not present and therefore have a probability of 0. If these data were to be modeled in order to find other similar sequences, the amino acids that were not seen in the alignment would have to be given a small, non-zero probability in order to discover similar sequences that do not have K, A, R, E, or S at that position.

For each column in the multiple sequence alignment (MSA), the probability that each component has produced that column in the input MSA is calculated, and their log probabilities are added together to give a final log probability that the amino acids in the given column were produced by the prior probability model. In the ideal case, each component in the model would correspond to a subfamily. Referring to Equation 4, the first half of the equation, N log2 S, equals the number of sequences multiplied by the log of the number of subfamilies; the second half of the equation, ∑log 2 Prob(nc,1...nc,S | Θ) , refers to c the sum of the log-probabilities of all components in the Dirichlet mixture prior, for all columns in the multiple sequence alignment. If the components correspond exactly to subfamilies, the encoding cost would be zero. The tree partitioning which results in the lowest encoding cost is selected as the one which partitions the tree into subfamilies.

46

2.5.2 SDPpred: Subfamily Determining Residue Identification

Since Bête does not report the amino acid positions on which subfamily partitioning was determined, and the source code of Bête was unavailable to us, we needed to recapture those subfamily-determining positions post hoc from the subfamily division generated by Bête. This was done using the program SDPpred [162].

The input to SDPpred is a multiple sequence alignment with “specificity groups” specified; these correspond to the subfamilies identified by Bête. SDPpred evaluates the degree of association between the distribution of amino acids in the columns in the multiple sequence alignment and the specified division into subfamilies. The association of the amino acid distribution and the subfamily grouping is obtained via the mutual information at position p, designated Ip:

Equation 5

N 20 f p (α,i) I p = ∑∑ f p (α,i)log i==1 α 1 f p (α) f (i) where N refers to the number of subfamilies; α = 1 … 20 refers to the 20 standard proteinogenic amino acids; fp(α,i) is the ratio of the number of occurrences of residue α in subfamily i at position p, to the number of sequences in the complete multiple sequence alignment; fp(α) is the frequency of residue α in the whole alignment column; f(i) is the proportion of proteins belonging to subfamily i.

To address oversampling issues due to small sample size, SDPpred introduces pseudocounts into the data (see Section 2.5.1 for an explanation of when pseudocounts may be used), based on BLOSUM substitution matrices corresponding to the average evolutionary distance between the groups in the multiple sequence alignment. The added pseudocounts are proportional to the square root of the sample size:

Equation 6

47

20 n(α,i) + κ( n(β,i)m(β → α)) / n(i) ~ ∑ f (α,i) = β =1 n(i) + κ n(i) where n(α,i) is the number of occurrences of residue α in group i; n(i) is the size of group i; m(β → α) is the probability of amino acid substitution β → α according to the matrix chosen for group i; 0 ≤ κ ≤ 1 is a smoothing parameter. Kalinina et al. report that the obtained specificity determining positions are robust with respect to the exact choice of κ, but their results are reported with κ = 0.5 [162].

The set of columns in the multiple sequence alignment (MSA) which can best distinguish between the subfamilies are considered to be “specificity determining positions” or SDPs. Mutual information is the mutual dependence of two variables on each other; the set of columns (amino acid positions) in the MSA which maximizes the mutual information or co- dependence between 1) the amino acid composition in the columns of the MSA (variable 1), and 2) the partitioning of the MSA into subfamilies (variable 2), should be identified as SDPs. For each column of the MSA, the significance of the association between the above- identified variables is determined by normalizing the derived mutual information values against a background distribution that keeps the number of subfamilies and number of sequences in each family constant, while shuffling the amino acids in each column of the MSA so that they should no longer be correlated with the subfamily partitions. This is done by calculating the Z-score for each column of the multiple sequence alignment (Equation 7):

Equation 7: Mutual information Z-score

exp I i − I i Z = i exp σ I i

exp where the expected mutual information I i is calculated from 10,000 shuffles of the original column i in the MSA and Ii is the observed mutual information in the i-th column.

48

Kalinina et al. used a metric known as the Bernoulli estimator or B-cutoff (Equation 8) to determine the number of significant positions in the multiple sequence alignment [162]; significant positions refer to positions in the multiple sequence alignment with amino acid distributions that are least likely to arise by chance. As a null hypothesis, the previously obtained Z-scores are assumed to come from a Gaussian distribution; the obtained Z-scores are ranked in descending order. The number of Z-scores (k) is chosen in order to minimize the probability that there are at least k observed Z-scores which are greater than or equal to the kth Z-score. Basically, the number of significant positions k is chosen such that fewer than k positions are expected to be significant (in fact, the minimum number of positions expected to be significant), and thus maximize the probability that the selected k amino acid positions in the MSA are significant. The threshold for determining the number of specificity determining positions k* is calculated as follows:

Equation 8

k* = arg k min P(there are at least k observed Z - scores Z ≥ Z k ) n ⎛ i ⎞ i n−1 = arg k min(1− ∑ ⎜ ⎟q p ) i=n−k +1⎝n⎠

∞ 1 2 where n is the total number of considered positions, p = P(Z ≥ Z k ) = exp(−Z )dZ , ∫ 2π Zk and q = 1 – p.

The Z-scores of the obtained subfamily-determining positions correspond to the top tail of the Gaussian distribution of Z-scores. Thus, SDPpred identifies those positions which have maximal mutual information between the amino acid composition and the subfamily division; these correspond to the positions that Bête uses in its partitioning of the multiple alignment into subfamilies.

Sequence logos showing the degree of conservation of the residues in the Bête-assigned subfamily classification were performed using WebLogo [183], and subfamily-specific residue positions mapped onto the sequence logos.

49

2.6 Validation of Bête Neighbour-Joined Tree

Following the classification of the homeodomains into subfamilies, the hierarchical agglomerative nearest-neighbour tree produced by Bête was compared with a neighbour- joined tree produced using the PHYLIP phylogenetic suite, bootstrapped 100 times [130]. The robustness of the tree produced by Bête was also evaluated using leave-one-out cross- validation. Each sequence in the structure-based sequence alignment was systematically excluded, and Bête run on the resulting set. The Newick-formatted [130] tree generated by Bête was converted to Nexus format [184] using a Perl script written in-house. The consensus tree or “supertree”, with weights based on clade support, was generated from the concatenation of the Newick tree-files using QualiTree.pl [185,186] and visualized using TreeView [187].

2.7 Validation of Subfamily Integrity

Subfamily integrity and robustness of the procedure were validated by taking each individual sequence and scoring them for similarity against each subfamily; each subfamily was treated as an individual profile. The sequence vs. profile scoring function [188] uses the BLOSUM62 matrix [91] to score similarity between amino acids:

Equation 9

s = ∑∑ fiy S(xi , y) iy where fiy is the frequency of amino acid y at position i in the amino acid sequence of the query protein, 0 ≤ fiy ≤ 1; xi is the amino acid at position i in the target sequence (in the profile); S(xi, y) is the score given to the amino acid substitution by the BLOSUM62 matrix. The most similar subfamily profile was identified for each query homeodomain protein; the proportion of sequences that were identified as being most similar to their Bête-assigned subfamily was also calculated using an in-house Perl script.

Sequence vs. subfamily profile scoring of the actual subfamilies was also compared against profile scoring of a shuffled data set using the original homeodomains, maintaining the

50 subfamily size and number. Randomizing the sequence order but maintaining subfamily size and number allowed amino acid composition of each homeodomain sequence and each position in the multiple sequence alignment to be maintained, while creating randomized subfamilies to give the baseline likelihood for sequence to profile similarity.

2.8 Structural Analysis of Subfamily-Determining Residues

To identify residues that are part of the protein-DNA interface, accessible surface area calculations were performed on structures of homeodomain-DNA complexes from the PDB. More particularly, the GETAREA 1.1 program [189] with a water probe radius of 1.4 Å, was used to compute the change in accessible surface area (ASA) upon complex formation as follows:

Equation 10

% ΔASA = (ASADNA + ASAprotein )− ASAprotein+DNA

Residues with ΔASA > 0 are considered to be part of the interface. For those homeodomains for which a structure of the protein-DNA complex was not available in the PDB, DaliLite [179] was used to identify the most similar structure in the PDB which was crystallized in the presence of DNA. After rotation and translation of the structure without DNA, DNA from the PDB file of the most similar structure was added to the structure file.

Hydrogen bonds between protein chains and DNA were determined using MOLMOL [190], with a maximum distance of 2.4 Å between donor hydrogen and acceptor heavy atom (or 3.5 Å between the two heavy atoms), and a maximum angle of 90°. Van der Waals clashes were also determined using MOLMOL, with a cutoff of 0.5 Å.

Sidechain directionality of subfamily-determining residues was also examined. Structures were optimally superposed using DaliLite [179], using the median homeodomain structure 1fjlA (paired-box) as reference. Chi-1 angles for subfamily determining residues in each helix in the homeodomain structures were determined using WHAT IF; chi-1 angles (see Figure 15) refer to the angle between the Cγ atom on the amino acid sidechain to the N atom

51 on the peptide backbone, and are used as a measure of the direction of the amino acid sidechain. Differing chi-1 angles for the same amino acid are also referred to as amino acid rotamers.

+ + - - A gauche (g ) conformation B gauche (g ) conformation H H

Cγ or Cα Cα Oγ C β O N O C C N Cβ Chi1 angle = -60 degrees Chi1 angle = +60 Cγ or degrees Oγ

C trans conformation

Chi1 angle = 180 degrees

H Cγ or Oγ Cα Cβ N O C

Figure 15: Chi-1 angle conformations of amino acid sidechains. The chi-1 angle refers to the angle between the Cγ or Oγ atom of the amino acid sidechain, and the N-atom of the peptide backbone. The three conformations of chi-1 angles are A) gauche+ or g+ (chi-1 angle approximately -60 degrees from the Cγ atom to the N atom); B) gauche- or g- (chi-1 angle approximately +60 degrees from the Cγ atom to the N atom); and C) trans (chi-1 angle approximately 180 degrees from the Cγ atom to the N atom).

Chi-1 angles for subfamily determining residues were calculated for the three subfamilies with the greatest number of structures (HOX, POU, and PBX), due to insufficient number of homeodomain structures for the remaining subfamilies; there were seven HOX structures,

52 four POU structures, and three PBX structures. Chi-1 angles were binned as follows: +151° to +180° and -180° to -151° as trans, -90° to -31° as g+, +31° to +90° as g-, and -150° to - 91°, -30° to +30°, and +91° to +150° as maxima. Ala and Pro were classified separately, as well as indeterminate chi-1 angles (labeled as 999.9). Sidechains were then visualized using MOLMOL.

The most frequently occurring amino acids at all subfamily-determining residue positions over the HOX, POU, and PBX structures combined were identified, and correlated with the most preferred chi-1 angle conformations for those residues from the Dunbrack library [191]; deviations from the most preferred conformations were noted.

2.9 Subfamily DNA Binding Profile Analysis

DNA binding proteins may bind to a variety of DNA sequences, including operator sequences, promoter sequences, and transcription factor binding sites. DNA-binding transcription factors, such as homeodomains, may exhibit some variability in their transcription factor binding site sequences due to variations in their specific function. Members of a protein subfamily which have similar specific function likely bind similar DNA sequences, which can be assorted into a profile that more clearly shows important DNA bases.

DNA binding specificities for homeodomain subfamily members from the various HOX, CUT, POU, and LIM homeodomain subfamilies were curated from the literature and binding profiles were retrieved from TRANSFAC [1]; all DNA sequences were assorted into profiles according to subfamily. Sequence alignment was according to alignment specified in the literature or TRANSFAC profile, where available; the rest of the sequences were manually aligned according to key motifs. Profiles were converted into sequence logos using WebLogo [183] for visualization.

53

Chapter 3 – Results

3.1 Subfamily Classification

27 homeodomain structures were selected from the Protein Data Bank (PDB) in order to perform a structural alignment to inform the subfamily classification. Duplicate structures for the same protein from the same organism were removed to avoid biasing the alignment toward identical structures, except when different DNA binding site preferences were illustrated. The highest resolution structures and structures containing DNA were selected whenever possible. The homeodomain region itself was identified via Pfam HMM. Most of the domain structures exhibited the characteristics of the classic 60 amino acid long homeodomain, with a helix-loop-helix-turn-helix structure. Exceptions include HNF1 (2lfb), which incorporates a 21 amino acid long insertion between helix 2 and helix 3 , and the Three Amino-acid Loop Extension (TALE) type homeodomains, such as MATa (1akhA), MATα (1akhB, 1mnmC), and PBX (1b72B), which contain a 3 amino acid long insertion in the loop between helices 1 and 2.

54

PDB ID Protein description Organism 1b8iA Ultrabithorax homeotic protein IV Drosophila melanogaster 1b8iB Homeobox protein extradenticle Drosophila melanogaster 1fjlA Paired protein Drosophila melanogaster 1ftz Fushi tarazu protein Drosophila melanogaster 1hom Antennapedia protein Drosophila melanogaster 1jggA Segmentation protein even-skipped Drosophila melanogaster 1nk2P Homeobox protein VND Drosophila melanogaster 1wjhA Homeobox leucine zipper protein Homez Drosophila melanogaster 3hddA Engrailed homeodomain Drosophila melanogaster 1b72A Homeobox protein HOX-B1 Homo sapiens 1b72B PBX1 Homo sapiens POU domain, class 2, transcription factor 1 1cqtB Homo sapiens (also known as OCT-1) 1hdp OCT-2 POU homeodomain Homo sapiens 1wi3A DNA-binding protein SATB2 - CUT containing protein Homo sapiens 1ig7A Homeotic protein MSX-1 Mus musculus 1lfuP Homeobox protein PBX1 Mus musculus 1ocp OCT-3 Mus musculus 1pufA Homeobox protein HOX-A9 Mus musculus 1s7eA Hepatocyte nuclear factor 6 (HNF6) Mus musculus 1uhsA Homeodomain Only Protein (HOP) Mus musculus 1au7A PIT-1 Rattus norvegicus 1bw5 Insulin gene enhancer (ISL-1) Rattus norvegicus 1ftt Thyroid transcription factor 1 homeodomain Rattus norvegicus 2lfb LFB1/HNF1 transcription factor Rattus rattus 1akhA Mating-type protein A-1 Saccharomyces cerevisiae 1akhB Mating-type protein α2 Saccharomyces cerevisiae 1mnmC MATα2 transcriptional repressor Saccharomyces cerevisiae Table 3: All 27 homeodomain structures from the Protein Data Bank which were used in the creation of the homeodomain structure alignment.

These structures were structurally aligned using MALECON. MALECON is sensitive toward the presence of large indels, such as the one in HNF1 (2lfb), and will not align the proteins at all; the TALE-type homeodomains did not pose any problems, because the indel in this case is only 3 amino acids long and could be aligned easily. Consequently, the 21 amino acid long insertion was removed when performing the MALECON structural alignment, and re- inserted post-alignment in the same position, gapping all other structures. The structure-based sequence alignment was converted to FASTA format for input to ClustalX sequence-to- profile alignment.

55

1ftz -KRTRQTYTRYQTLELEKEF---HFNRYITRRRRIDIANALSLSERQIKIWFQNRRMKSKK----- 3hddA ----RTAFSSEQLARLKREF---NENRYLTERRRQQLSSELGLNEAQIKIWFQNKRAKIKK----- 1fjlA -RRSRTTFSASQLDELERAF---ERTQYPDIYTREELAQRTNLTEARIQVWFQNRRARLRK----- 1pufA -RKKRCPYTKHQTLELEKEF---LFNMYLTRDRRYEVARLLNLTERQVKIWFQNRRMKMKK----- 1bw5 TTRVRTVLNEKQLHTLRTCY---AANPRPDALMKEQLVEMTGLSPRVIRVWFQNKRCKDKK----- 1nk2P -RKRRVLFTKAQTYELERRF---RQQRYLSAPEREHLASLIRLTPTQVKIWFQNHRYKTKR----- 1mnmC ---RGHRFTKENVRILESWFAKNIENPYLDTKGLENLMKNTSLSRIQIKNWVSNRRRKEKT----- 1wi3A ----RTKISLEALGILQSFI--HDVGLYPDQEAIHTLSAQLDLPKHTIIKFFQNQRYHVK------1jggA --RYRTAFTRDQLGRLEKEF---YKENYVSRPRRCELAAQLNLPESTIKVWFQNRRMKDKR----- 1akhB ---RGHRFTKENVRILESWFAKNIENPYLDTKGLENLMKNTSLSRIQIKNWVSNRRRKEKT----- 1b8iA ----RQTYTRYQTLELEKEF---HTNHYLTRRRRIEMAHALSLTERQIKIWFQNRRMKLKK----- 1au7A --KRRTTISIAAKDALERHF---GEHSKPSSQEIMRMAEELNLEKEVVRVWFCNRRQREKR----- 1ig7A -RKPRTPFTTAQLLALERKF---RQKQYLSIAERAEFSSSLSLTETQVKIWFQNRRAKAKR----- 1hom -KRGRQTYTRYQTLELEKEF---HFNRYLTRRRRIEIAHALCLTERQIKIWFQNRRMKWKK----- 1b72A ---LRTNFTTRQLTELEKEF---HFNKYLSRARRVEIAATLELNETQVKIWFQNRRMKQKK----- 1b72B --RKRRNFNKQATEILNEYFYSHLSNPYPSEEAKEELAKKCGITVSQVSNWFGNKRIRYKK----- 1uhsA ----AATMTEDQVEILEYNF--NKVNKHPDPTTLCLIAAEAGLTEEQTQKWFKQRLAEWRR----- 1akhA ------ISPQARAFLEEVF---RRKQSLNSKEKEEVAKKCGITPLQVRVWFINKRMRS------1wjhA ------RKTKEQLAILKSFF---LQCQWARREDYQKLEQITGLPRPEIIQWFGDTRYALKHGQLKW 1b8iB ----RRNFSKQASEILNEYFYSHLSNPYPSEEAKEELARKCGITVSQVSNWFGNKRIRYKK----- 1ftt -RKRRVLFSQAQVYELERRF---KQQKYLSAPEREHLASMIHLTPTQVKIWFQNHRYKMKR----- 1ocp -KRKRTSIENRVRWSLETMF---LKCPKPSLQQITHIANQLGLEKDVVRVWFCNRRQKGKR----- 1lfuP -RRKRRNFNKQATEILNEYF---YSHPYPSEEAKEELAKKSGITVSQVSNWFGNKRIRYKK----- 1s7eA -KKPRLVFTDVQRRTLHAIF---KENKRPSKELQITISQQLGLELSTVSNFFMNAR------1hdp -RKKRTSIETNVRFALEKSF---LANQKPTSEEILLIAEQLHMEKEVIRVWFCNRRQKEKR----- 2lfb -RRNRFKWGPASQQILFQAY---ERQKNPSKEERETLVEECNVTEVRVYNWFANRRKEE------1cqtB -RKKRTSIETNIRVALEKSF----LENQKTSEEITMIADQLNMEKEVIRVWFCNRRQKEKR----- Figure 16: MALECON homeodomain structural alignment (with HNF1 (PDB ID 2lfb) insertion between helix 2 and helix 3 excised). Helices 1, 2, and 3 have been highlighted in cyan, and the “Three Amino-acid Loop Extension” region between the first and second helices has been highlighted in yellow.

A total of 429 homeodomain sequences, including 175 sequences from Interpro and 254 sequences from PartiGeneDB were aligned, one at a time, to the homeodomain structure- based multiple sequence alignment using ClustalX. The output alignment from ClustalX was then converted to FASTA format for subfamily classification using Bête. The input homeodomain alignment for Bête subfamily classification can be found in the Appendix.

56

Bête classifies the homeodomains into 41 subfamilies (Figure 17), 12 of which correspond to subfamilies or parts of subfamilies within the pre-existing curated Interpro subfamily classification (Figure 19), 25 of the identified subfamilies contain at least three sequences. However, it should be noted that the Interpro classification (Figure 18) does not include any atypical homeodomain classes or any plant homeodomain subfamilies. Many plant homeodomain sequences were included from the PartiGeneDB sequences, and some of the homeodomain structures included atypical homeodomains such as TALE-type homeodomains. Of the subfamilies containing typical 60 amino acid long homeodomain sequences (Figure 19), the majority of them correspond with the Interpro subfamily classification (Figure 18); however, two of the Interpro subfamilies (CUT and LIM) have been subdivided into finer levels of distinction, with three subfamilies each in the Bête- derived structure-based classification. The identified CUT homeodomain subfamilies appear to correspond with the number of CUT subunits associated with the homeodomain [192], while the identified LIM homeobox subfamilies appear to correlate with known paralogous groups [193,194]; Lhx6, Lhx7, and Lhx8 are known to form a mouse-specific duplicated subgroup of LIM-hd proteins [194], and Lhx1 and Lhx5 are both expressed in the ventral thalamus in the embryonic mouse brain [194]. This indicates that the classification method appears to select subfamilies that correspond to clusters of proteins with similar biological characteristics.

57

Full Bête Homeodomain Subfamily Classification

Figure 17: Entire Bête homeodomain subfamily classification derived from the structure- based sequence alignment. This classification includes a number of plant subfamilies, as well as “atypical” homeodomains which have insertions relative to the typical 60-amino acid long homeodomain, but which fold into a homeodomain structure. The number of sequences in each subfamily is given in brackets after the subfamily name, unless the subfamily is a singleton subfamily. Terminal nodes are shown in blue, non-terminal nodes are shown in black, and the root node is shown in red.

58

The illustration of the Bête homeodomain subfamily classification below includes only those subfamilies composed of 60 amino acid long typical homeodomain sequences, in order to allow for a more like-to-like comparison between the subfamilies derived using Bête and the Interpro homeodomain subfamily classification. In both of the following classifications, the following homeodomain subfamilies are present: Paired-box (or paired-like) homeobox, SIX/SINE homeobox, Engrailed (‘Homeobox’ engrailed-type protein), and HOX (Homeobox protein, antennapedia type). The Cut homeobox subfamily in the Interpro subfamily classification has been subdivided into the Onecut, SATB, and CDP/Cux subfamilies in the Bête-derived classification, the POU homeobox subfamily in the Interpro subfamily classification has been subdivided into the PIT-1 and POU/OCT subfamilies in the Bête- derived classification, and the LIM homeobox subfamily in the Interpro subfamily classification has been subdivided into the Lhx1/Lhx3/Lhx5 subfamily, the Lhx2/Lhx6/Lhx8 subfamily, and the Insulin Gene Enhancer (ISL) subfamily.

Interpro Homeodomain Subfamily Classification

Figure 18: Manually curated homeodomain subfamily classification from Interpro. This classification only includes “typical” 60 amino-acid long homeodomains from non-plant eukaryotes. The LIM homeobox, POU homeobox, and CUT homeobox subfamilies are each subdivided by the Bête subfamily classification into multiple different subfamilies.

59

A) Bête Homeodomain Subfamily Classification with typical 60- amino acid long homeodomains

B) Bête Homeodomain Subfamily Classification with Interpro subfamily subdivisions highlighted

Figure 19: A) Bête homeodomain subfamily classification, including only typical 60 amino acid long homeodomains for comparison with the manually curated Interpro homeodomain subfamilies. Classification is otherwise as in Figure 17. B) Same classification as in Figure 19A, but with Interpro subfamilies that were subdivided into multiple subfamilies highlighted. Sub-classifications of the Interpro LIM homeobox subfamily are enclosed in red boxes; sub-classifications of the Interpro POU homeobox

60 subfamily are enclosed in blue boxes; sub-classifications of the Interpro CUT homeobox subfamily are enclosed in green boxes.

3.2 Subfamily Determining Residues

Following classification into subfamilies, subfamily-determining residues were identified using SDPpred (Figure 20). Nine subfamily-determining residues were identified, of which two residues were identified in helix 1, four residues were identified in helix 2, and three residues were identified in helix 3 (recognition helix). The consensus pattern formed by these subfamily-determining residues appears to uniquely identify each subfamily, being conserved within a given subfamily and differing between subfamilies, and thus these positions may be important for specific function. Figure 20 illustrates the subfamily determining residues and residues conserved throughout the homeodomain family, mapped onto the “majority-rules” (most common residue at each position) sequences for major homeodomain subfamilies. For those more familiar with the residue numbering as given in the homeodomain literature [66], a mapping from the alignment residue number to the homeodomain literature residue number for subfamily determining residues and conserved residues is given by Table 4. All subsequent results will refer to the positions given by the alignment in Figure 20.

61 conserved at that positio family are highlighted in re helix regions highlighted in yellow, and residue positions identified by SDPpred highlighted in cy subfamily at each position); the largest plant subfamily is the Figure 20: Alignment of major homeodomain su 1 11 21 31 41 51 61 71 81 | ------RVRTVLNFKQLHTLRTCY---NANPRPDALMKEQLVEMT----GLSP------RVIRVWFQNKRCKDKK----- Plants ------LTVEQVQFLEKSF---EVENKLEPERKTQLAKEL----GLQP------RQVAIWFQNRRARWK------ISL ------RTKISVEALGILQSFIQ--DVGLYPDEEAIQTLSAQL----DLPK------YTIIKFFQNQRYYLK------Lhx2/6/8 ------KRMRTSFKHHQLRTMKSYF---AINHNPDAKDLKQLAQKT----GLPK------RVLQVWFQNARAKFRR----- Lhx1/3/5 ------RGPRTTIKAKQLETLKAAF---AATPKPARHVREQLAAET----GLDM------RVIQVWFQNRRSKERR----- SATB ------KKKRTSIAAAAKDALERHF---AEQPKPSSEEIARIAEEL----DLEK------EVVRVWFCNRRQREKR----- Onecut ------KKPRLVFTDVQRRTLHAIF---KFNKRPSKEMQITISQQL----GLEL------STVSNFFMNARRRSR------CDP/Cux ------KKPRVVLAPEEKEALKRAY---QLKPYPSPKTIEELATQL----NLKT------STVINWFHNYRSRIRR------RKKRTSIEVNVRGALESHF---LKCPKPSAQEITHIADQL----QLEK------EVVRVWFCNRRQKEKR----- PIT-1 POU SIX/SINE ------FKERTRSLLREWY---LQDPYPNPSKKRELAQAT----GLTP------TQVGNWFKNRRQRDR------Paired-box ------RRNRTTFTSYQLEELERAF---EKTHYPDVYAREELAMKI----NLPE------ARVQVWFQNRRAKWRK----- Engrailed ------KRPRTAFTAEQLQRLKAEF---QENRYLTEQRRQSLAQEL----GLNE------SQIKIWFQNKRAKIKK----- HOX/Antp ------KRARTAYTRYQLLELEKEF---HFNRYLTRRRRIEIAHAL----NLTE------RQVKIWFQNRRMKWKK----- Appendix. n for the identified subfamily. The full alignment d (positions 22, 74, 75, 77, and 79). The residue at positions conserved at 90% sequence ident bfamily “majority-rules” se an (positions 14, 19, 34, 35, 36, 40, 70, 72, 73 in this alignment), one labelled “Plants” in the a of homeodomain subfamil quences (most frequently each position in the alignment is the most ity or greater throughout the homeodomain lignment. Subfamily-determining ies is included in the occurring residue in the

Figure 20: SDPpred alignment of major homeodomain subfamilies, including the largest plant homeodomain subfamily

62

Homeodomain Subfamily Alignment residue literature residue determining / Helix position position Conserved residue (HOX amino acid) Subfamily 1 14 Tyr8 determining Subfamily 1 19 Thr13 determining Conserved 1 22 Leu16 Subfamily 2 34 Tyr25 determining Subfamily 2 35 Leu26 determining Subfamily 2 36 Thr27 determining Subfamily 2 40 Arg31 determining Subfamily 3 70 Gln44 determining Subfamily 3 72 Lys46 determining Subfamily 3 73 Ile47 determining Conserved 3 74 Trp48 Conserved 3 75 Phe49 Conserved 3 77 Asn51 Conserved 3 79 Arg53 Table 4: Mapping of subfamily-determining residue positions and conserved residue positions from the structure-based sequence alignment to the conventional homeodomain literature numbering. Subfamily-determining residues are coloured blue, while conserved residues are coloured red. The most closely associated helix for each residue is also given by the second column; the amino acid associated with a HOX (Antennapedia) homeodomain at each residue position is given alongside the conventional homeodomain literature numbering, as this subfamily is frequently referred to as a reference in the literature.

63

3.3 Validation of Results

Various methods were used to validate the Bête structure-based classification: quality assessment of the subfamilies was determined via sequence-to-subfamily profile similarity, the robustness of the subfamilies was assessed via leave-one-out cross-validation, and the accuracy of the nearest-neighbour tree produced by Bête was validated using the “neighbor” program in the PHYLIP phylogenetic package. Homeodomain sequences which appeared to be unclassifiable according to either validation scheme were identified.

3.3.1 Quality of Obtained Subfamilies

The quality of the obtained subfamilies was validated using a sequence to profile similarity scoring method, as illustrated in Figure 21. Ideally, if a subfamily is a group of sequences that are more similar to each other than to any other sequences, then each homeodomain sequence should score highest against the profile of its own subfamily. As a control, the homeodomain subfamily data set was shuffled to maintain the size and distribution of subfamilies while randomizing the sequences within each profile. Non-classifiable sequences such as singleton subfamilies were removed from the data set. Similarity between sequence and subfamily was higher for the actual subfamilies than for the randomized subfamilies, as expected. When examining the integrity of the obtained subfamilies after removing non- classifiable sequences, all but one subfamily exhibited sequence-to-subfamily profile similarity of 100%.

64

30

25 Actual Subfamilies Shuffled Sequences 20

15

10 Number of subfamilies of Number

5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fraction of sequences matched to subfamily

Figure 21: Sequence to profile similarity for actual subfamilies vs. profiles of shuffled sequences with the same size distribution as the subfamilies, after removing singleton and doubleton subfamilies and unclassifiable sequences. The fraction of sequences matched with the original subfamily profile is significantly higher for the actual subfamilies vs. the shuffled subfamilies.

3.3.2 Comparison of Bête Neighbour-Joined Tree with PHYLIP

To validate the Bête neighbour-joined tree, the Bête tree (Figure 22) was compared with a bootstrapped, neighbour-joined tree produced using the PHYLIP phylogenetic package (Figure 23). The Bête neighbour-joined tree was also validated for robustness using leave- one-out cross-validation. The PHYLIP tree (Figure 23) had very poor node-support after bootstrapping, while the Bête tree (Figure 22) was generally quite well supported after cross- validation. It should be noted that bootstrapping and cross-validation are very different measures of robustness; bootstrapping draws its pseudorandom samples from random

65 selection of columns in the multiple sequence alignment, while leave-one-out cross- validation systematically excludes each sequence in turn. Hence, if the relationship among the input protein sequences is reflected by differences in subfamily-determining residue positions, failure to sample all subfamily-determining positions in the pseudoreplicate datasets when bootstrapping will lead to poor bootstrap values and difficulty in recapturing the original tree. As well, building robust phylogenies using short sequences (<60 residues) is often problematic. Thus, the difference in node-support between the PHYLIP and Bête trees reflect primarily the difference in the method used to validate robustness; however, this difference also supports the hypothesis that subfamily-determining residues identify important differences in members of the homeodomain family.

The PHYLIP tree (Figure 23) groups the HOX and Engrailed subfamilies into one cluster, albeit with extremely low support. The CUT homeobox subfamilies (Onecut, SATB, CDP/Cux) were grouped with each other in the PHYLIP tree (suggesting a high degree of relatedness) whereas they were segregated from each other in the Bête tree (Figure 22); similarly, the LIM homeobox subfamilies (Lhx1/Lhx3/Lhx5, Lhx2/Lhx6/Lhx8, and Insulin Gene Enhancer (ISL)) were grouped together in the PHYLIP tree and separated in the Bête tree. The critical difference between the protocol used to produce the Bête tree and the protocol used to generate the PHYLIP tree lies in the measure of similarity between sequences; the protocol used to generate the PHYLIP tree employs a distance metric based strictly on scoring amino acid substitutions [130], while the protocol used in generating the Bête tree assigns each sequence to a prior probability distribution (for the most similar grouping of sequences in the BLOCKS database) and calculates the relative entropy between each pair of distributions [117]. This suggests that the clustering produced by Bête is able to encapsulate more of the functional or structural characteristics shared by members of the same subfamily while differing between subfamilies, rather than evaluating clusters of similar proteins based solely upon closeness of shared common ancestry.

66 included in the Appendix. relationship. Individual unclas fraction of trees that suppor collapsed to major branchpoint Figure 22: Bête leave-one-out cross-validation “supertree”, t the specified phylogenetic sifiable groupings have been s. Node support is given as

Figure 22: Bête leave-one-out cross-validation “supertree”, collapsed to major branchpoints

67

Figure 23: Neighbour-joined tree produced using the PHYLIP phylogenetic package, using the same set of sequences as for the Bête-produced tree. Node-support is for number of trees out of 100 bootstrap runs.

68

3.4 Analysis of Protein-DNA Interaction: Physical Characteristics of Subfamily-Determining Residues

In order to determine the role that specificity determining residues play in protein-DNA interactions, the structures of protein-DNA complexes belonging to representative proteins from each subfamily (whenever available) were analyzed and visualized. In particular, we investigated the contribution of subfamily specificity determining residues (SDRs), and of conserved residues to the protein-DNA interface and analyzed the number and types of interactions that these residues make with the DNA.

3.4.1 Contribution to the Protein-DNA Interface

Homeodomain structures that were part of a protein-DNA complex were analyzed directly. Uncomplexed structures of homeodomains in a given family were optimally superimposed onto the protein moiety of a closely related relative for which a structure of the complex was available. From the atomic coordinates of each protein-DNA complex, the contribution of a given residue i to the protein-DNA interface was computed as the change in accessible surface area [195] of the residue upon complex formation:

Equation 11

ΔASAi = ASAi (free protein) – ASAi (protein in complex with DNA)

where ΔASAi is the change in accessible surface area of residue i upon association with the DNA, and the two terms on the right hand side are the residue surface areas in the free and complexed protein respectively. Then the contribution of a given residue to the protein DNA interface is computed as the fraction of the total interface area:

Equation 12

n f =ΔASA / ΔASA ΔASAi i ∑ k k=1 where the sum in the denominator is over the residues whose accessible surface area changes upon association, and is the total interface area of the protein moiety.

69

The contributions of individual residues to the interface area (in %) are listed in the Appendix (Changes in Solvent Accessibility: Tables A-N); the proportions of the interface that is contributed by 1) all the specificity determining residues, 2) by all the conserved residues and 3) by both categories of residues together are given in Table 5.

We can see from Table 5 that, in general, the specificity determining residues contribute significantly to the area buried in the protein-DNA complexes. However, the contribution of the SDRs to the protein-DNA interface ranges widely, from 12.72% for the POU structure (1cqtB) to 53.20% for the non-plant homeobox leucine zipper structure (1wjhA), with an average of 28.34%. These large differences result from the fact that structures of complexes sometimes include additional segments (e.g. TALE-type homeodomains include a three- amino acid long loop extension), or are lacking segments (e.g. the MATa1 homeodomain structure lacks the N-terminal domain that binds the minor groove).

In all of the examined structures, the bulk of the interface is formed through association with the major groove of DNA. The largest contribution to the interface comes from interactions with helix 3, but significant additional contributions are provided by residues in helix 1 and 2 and the loop connecting the latter two helices. The completely conserved residues are located in the center of helix 3, deep in the major groove of DNA, whereas the specificity determining residues are located on helices and the loop between helices 1 and 2. The N- terminal tail of the homeodomain interacts primarily with the minor groove of DNA and contains none of the specificity determining or the conserved residues, although two of the residues in the N-terminal tail form a significant part of the protein-DNA interface.

70

Percent Contribution to Protein-DNA Interface (% ∆ASA) Specificity Conserved Residues Both SDPs and Subfamily (PDB ID) Determining Positions only Conserved Residues (SDPs) only HOX (1ig7A) 25.54% 13.95% 39.49% Engrailed (3hddA) 32.05% 15.83% 47.88% Paired-box (1fjlA) 23.26% 12.03% 35.29% POU (1cqtB) 12.72% 11.13% 23.85% TTF1/VND (1nk2P) 14.35% 11.45% 25.79% PBX/Exd (1b8iB) 25.96% 12.65% 38.61% MATa1 (1akhA) 44.27% 14.20% 58.47% MATα2 (1akhB) 22.67% 14.86% 37.53% ISL (1bw5)1 41.94% 19.77% 61.71% SATB (1wi3A)2 16.76% 12.56% 29.31% Onecut (1s7eA)1 20.85% 10.07% 30.92% HOP (1uhsA)3 36.19% 20.45% 56.64% Homez (1wjhA)1 53.20% 22.87% 76.07% HNF1 (2lfb)3 27.03% 18.96% 45.99% Average (unmapped 25.10% 13.26% 38.36% DNA only) Average (mapped 32.66% 17.45% 50.11% DNA only) Average (all listed 28.34% 15.06% 43.40% structures) Number of amino acid positions 9 5 14 (out of 89) % of total 10.1% 5.6% 15.7% alignment Table 5: Contribution of specificity determining residues (9 of 89 residues, as per Figure 20) and conserved residues (5 of 89 residues, as per Figure 20) to the protein-DNA interface in structurally represented homeodomain subfamilies. Contribution to the protein-DNA interface refers to the change in solvent accessibility at the interface. The DNA used for the ASA calculations for the first 8 structures (HOX 1ig7A through MATα2 1akhB) comes from the same PDB file as the protein chain. For the remaining 6 protein structures that do not have a corresponding DNA template, the DNA from the most similar homeodomain structure was mapped onto the structure after optimal superposition of the protein structures.

1 – DNA mapped from PDB ID 3hdd 2 – DNA mapped from PDB ID 1jgg 3 – DNA mapped from PDB ID 1fjl

71

3.4.2 Interactions Made by Interface Residues

The particular types of contacts (H-bond, hydrophobic contact, hydrophobic-hydrophilic) formed by each of the interface residues of the analyzed structures (HOX, Engrailed, Paired- box, POU) with DNA groups, are given in the Appendix (List of Interactions with DNA at distance ≤ 4.5Å). These protein-DNA contacts were derived using the Contacts of Structural Units web server [196]. Interface residues make on average 28-29 hydrogen bonds with DNA, of which 20% are contributed by specificity determining residues (SDPs); 4-5 hydrophobic-hydrophobic contacts, of which 50% are contributed by SDPs; and 21-22 hydrophilic-hydrophobic contacts, of which 28% are contributed by SDPs.

Amino acid positions which were 90% identical (or more conserved) among the homeodomain family were identified and evaluated in conjunction with subfamily- determining residues (see Figure 20). In all subfamilies that were analyzed structurally, a conserved Leu (position 22) in helix 1 and a conserved Trp (position 74) in helix 3 make hydrophobic-hydrophobic contacts; an example of this is illustrated using the POU homeodomain 1cqtB in Figure 24. This suggests that conserved residues not directly involved in protein-DNA interactions may be involved in protein structure and stability.

Figure 24: Stereogram for POU homeodomain (1cqtB) in contact with DNA; the conserved Leu at position 22 (helix 1) makes hydrophobic-hydrophobic contact with the conserved Trp

72 at position 74 (helix 3). Subfamily determining residues are illustrated in yellow, while homeodomain family conserved residues are illustrated in green. Residue numbers are numbered according to the residue numbering in the PDB file, while positions refer to the residue positions of the alignment in Figure 20.

The most common residues for each subfamily-determining position were identified, for the representative structures from all homeodomain subfamilies structurally represented in the PDB [154] (see Appendix: Chi-1 Angle Conformations at Subfamily-Determining Residue Positions). Their expected chi-angle conformations, taken from the Dunbrack backbone- independent rotamer library [191], were compared with the actual chi-angle conformations calculated using WHAT IF [197]. In general, the actual chi-angle conformations of the residues correlated with the expected chi-angle conformations; only residue position 36 (as according to Figure 20) exhibited least preferred chi-angle conformations for the amino acids at that position of the homeodomain multiple alignment, and specifically for Asp/Asn residues (Table 6). Those subfamilies with an unusual Asp/Asn chi-1 angle conformation at this position are MATa1, MATα2, Insulin Gene Enhancer (ISL), Paired-box, and Homeodomain Only Protein (HOP). The residues involved in this unusual chi-angle conformation appear to stabilize a helix N-cap at the N-terminal end of helix 2; an example of the helix N-cap in the MATa1 structural representative 1akhA is illustrated in Figure 25.

Amino acid chi-1 angle Residue (conformation) Number of instances conformation preference g+ least preferred, trans Asn (g+) 1 moderate preference, g- most preferred g+ least preferred, trans Asp (g+) 4 moderate preference, g- most preferred g+ most preferred, trans least Ser (g+) 4 preferred, g- moderate preference g- and g+ equally preferred, Thr (g+) 1 trans least preferred g+ most preferred, trans least Ser (g-) 1 preferred, g- moderate preference g+ least preferred, trans Asp (g-) 1 moderate preference, g- most preferred

73

g+ least preferred, trans Arg (g-) 1 moderate preference, g- most preferred

Table 6: Chi-1 angle conformations for residue position 36 (as according to Figure 20) for representative homeodomain structures from all homeodomain subfamilies which are structurally represented in the PDB [154]. Asp/Asn residues at this position, in order to stabilize the helix 2 N-cap, occur in non-preferred structural conformations. Residue chi-1 angle conformation preferences are drawn from the Dunbrack backbone-independent rotamer library [191]. Cells shaded in yellow indicate unusual (least preferred) chi-1 angle conformation. The representative structures (PDB IDs) used for this analysis are: HOX (1b72A), Engrailed (3hddA), POU (1au7A), PBX/Extradenticle (1b72B), Insulin Gene Enhancer/ISL (1bw5), Paired-box (1fjlA), Homeodomain Only Protein/HOP (1uhsA), HNF1 (2lfb), MATa1 (1akhA), MATα2 (1akhB), TTF1/VND (1nk2P), Onecut (1s7eA), SATB (1wi3A), Homeobox leucine zipper/Homez (1wjhA).

Figure 25: Helix N-cap formed by Asn residue at position 36 (according to alignment in Figure 20) of MATa1 structural representative 1akhA. Position 36 is marked in yellow.

74

3.5 Potential role of specificity determining residues in modulating protein-DNA interaction

Subfamily determining residues, or specificity determining residues (SDRs), have been identified for the homeodomain family of proteins on the basis of the information content in the multiple sequence alignment. A detailed analysis for determining how the SDRs accomplish their role requires a systematic study for all structures of each subfamily. This, in turn, would require an amount of work that makes it impractical for fully developing within the present Master Thesis, but would instead correspond to a natural continuation of the work. Here, an attempt is made to outline the steps for such an analysis, and to derive a model of the mechanism by which the SDRs modulate the protein-DNA interactions in different subfamilies. The analysis at this stage was entirely manual, and involved combining visual inspection of the structures with information from the interface area calculations and contact tables. Therefore, this part of the analysis could only be carried out on a limited number of structures. The following section describes the results obtained from analyzing a total of 5 representative structures, one from each of the HOX, paired-box, engrailed, MATa1, and MATα2 subfamilies.

Qualitatively, both SDRs and conserved residues are located in the same positions across all subfamilies. More specifically, the majority of the conserved residues are located at the C- terminus of the recognition helix of the homeodomain; some appear to skirt the rim of the pocket of the major groove of DNA, while others point into the major groove. In contrast, the specificity determining residues establish a more variable pattern of contacts between the protein and the DNA. The majority of these contacts are made with the phosphate groups of the DNA backbone groups; however, a few of the SDRs also appear to make contacts within the major groove.

One example of a frequently observed pattern of SDR contacts with DNA is illustrated by the HOX subfamily 1ig7A complex (Figure 27). The first residue of helix 1 (Phe108, position 14 according to the alignment in Figure 20) appears to contact the DNA between the minor and major grooves. The fourth residue of helix 2 (Arg131, position 40) and the sixth residue of helix 3 (Lys146, position 72) are usually charged residues, which, together with a polar

75 residue in the loop connecting helix 1 and helix 2 (Tyr125, position 34), tend to make hydrogen bonds with the DNA backbone in the major groove.

Figure 26 shows the complete set of SDRs in blue and the conserved residues in red for the HOX subfamily representative 1ig7A. Not every residue in the set takes part in the protein- DNA interface; moreover, the protein-DNA interface also contains additional residues that are neither specificity determining nor conserved. Starting from the N-terminus, the 14 residues which are either specificity determining or conserved are: Position 14 (Phe108 – SDR), at the end of the initial coil and start of helix 1; Position 19 (Leu113 – SDR) and Position 22 (Leu116 – conserved) in helix 1; Positions 34, 35, and 36 (Tyr125, Leu126, Ser127 – all SDRs), three residues in the loop between helix 1 and helix 2; Position 40 (Arg131 – SDR) in helix 2; Positions 70, 72, and 73 (Gln144, Lys146, and Ile147), three SDRs at the N-terminal end of helix 3; and Positions 74, 75, 77, and 79 (Trp148, Phe149, Asn151, and Arg153), four conserved residues in helix 3. Position numbering is assigned according to the alignment in Figure 20, while residue numbers were obtained from the PDB file for 1ig7A.

76

Figure 26: Complete set of SDRs (blue) and conserved residues (red) for the HOX subfamily representative 1ig7A. Not all residues shown take part in the interface with DNA, e.g. Leu116 (Position 22), Ser127 (Position 36), or Phe149 (Position 75).

Indeed, the information determining the specific function of each subfamily cannot be contained in the regularity of observed patterns, but on the particular divergences with which each homeodomain subfamily implements them. Differences in the amino acid type at SDR positions and/or the amino acid conformation seem to correlate with the different subfamilies. The distinct amino-acid composition seen in the loop between helices 1 and 2 seems to affect the relative orientation of helix 1 and helix 2, and that of helix 2 and helix 3. In addition, even when there is little overall sequence difference between structural representatives of different subfamilies, the relative contribution of the SDRs to the interface area varies significantly due to subtle differences in amino acid conformation and interactions.

77

However, it seems too simplistic to expect the mechanism for SDR function to rest upon recognition of particular bases in the DNA sequence. Recognition of DNA appears to be accomplished by the N-terminal coil, which lies within the minor groove, and the conserved residues in helix 3 which lie within the major groove: the standard model by which homeodomain family members bind to DNA. Instead, a mechanism that involves particular patterns of contacts between sets of SDRs and the DNA may mediate the affinity of a homeodomain subfamily member for particular DNA sequences. It is possible that each of these patterns might act as a functional unit, with the combination of these units and their interplay determining the specific structure, local geometry, and dynamics of the complex, thereby targeting the function of each subfamily. As an example, two patterns are observed to span the major groove in HOX 1ig7A: the first being a geometric ordering across the major groove involving SDR positions 14 (Phe108), 19 (Leu113), and 34 (Tyr125), and the second appearing to form a recurrent clamp-like pattern about the DNA, involving SDR positions 40 (Arg131) and 72 (Lys146). Differences in the conformation of the clamp residues and their interactions with the DNA phosphate groups, or occasions when one of the residues at these SDR positions fails to secure the clamping interaction, appear likely to affect protein-DNA specificity. Both patterns appear to span only three base-pairs each on opposite sides of the major groove. The gap between these two sets of base-pairs is determined by local deformation of the DNA helix. The latter would affect the extent to which any given helix would interact with the DNA; however, the reverse may also be the case, in that the particular setup of the contact patterns might also influence the local deformation of the DNA. It has been suggested in the literature that in a case such as the paired-box homeodomain, where the DNA helix is significantly bent and the homeodomain interacts with much higher affinity as a dimer, that the interaction of a single monomer is expected to induce the local deformation of the DNA that facilitates the binding of the dimer partner [198].

Figure 27 shows different views of the HOX subfamily representative 1ig7A. It illustrates the general conformation of the homeodomain relative to the DNA, with helix 3 or the recognition helix sitting well within the major groove. The top view shows SDR position 14 (Phe) and 19 (Leu) above the backbone of the 5’-3’ forward strand, and SDR position 34 (Tyr) close to the backbone of the reverse strand. As can be seen from the middle view, these

78 residues, together with the conserved Arg at position 79, are almost perfectly aligned across the major groove. Position 35 (Leu) is pointing toward the hydrophobic core of the homeodomain formed by the three helices and the loop between helices 1 and 2. Next along the sequence are the polar residue (Thr) at position 36, acting as the cap of helix 2 (see middle view), and position 40 (Arg), the first half of the clamp pattern that interacts with the reverse strand of the DNA. Residues at position 70 (Gln) and position 73 (Ile) form a contact pattern with three nucleotides on the forward strand; in between them is the residue at position 72 (Lys), the second half of the clamp pattern. The bottom view of Figure 27 illustrates the conserved residues in helix 3. The residue at position 74 (Trp) lies under helix 1 and appears to play a structural role in the homeodomain itself. It also appears to contact the backbone of the forward strand of DNA. A similar function may be posited for the residue at position 75 (Phe – not illustrated in the diagram), although this residue does not make contact with the DNA. Finally, the residues at positions 77 (Asn) and 79 (Arg) appear to contact DNA, on the forward and reverse strands respectively. The latter residue contacts primarily the same nucleotides as the Tyr at position 34 (see middle view); these appear to be aligned together with the residues at positions 14 (Phe) and 19 (Leu) across the major groove, which suggests the hypothesis that they determine a kind of support axis around which the homeodomain fold may swing helix 3 and adapt to the local deformations of the DNA helix. The SDRs contribute approximately 26% to the total interface area, and together with the conserved residues represent approximately 40% of the protein-DNA interface. The N- terminal coil sits within the minor groove and contains two (non-SDR) residues that contribute at least 10% each to the interface area.

79 Figure 27: HOX subfamily A structural representative (PDB 1ig7A). A) Back view of helix 3 Trp (74) Leu (19) and N-terminal coil of the protein; Arg (79) B) side view of helix 3 and loop Phe (14) Gln (70) between helix1 and helix 2; C) Asn (77) front view of the C-terminal end Ile (73) Lys (72) of helix 3. Specificity determining residues and conserved residues are illustrated in stick mode; the

latter are coloured blue. B Highlighted in ball-and-stick mode are residues that contribute Leu (35) Trp (74) Thr (36) at least 10% to the total protein- DNA interface. Arg (40)

Tyr (34)

C

Arg (40)

Phe (14) Trp (74) Lys (72)

Arg (79) Asn (77)

Figure 27: HOX subfamily structural representative (PDB 1ig7A) and contacts made by SDRs and conserved residues.

80

Figure 28 shows analogous views of the superposition of HOX subfamily representative 1ig7A with paired-box subfamily representative 1fjlA. This superposition was obtained by first superposing helix 3 of both structures and applying the transformation matrix to the rest of the structure. Both proteins show the same relative orientation of helix 2 and helix 3, although helix 1 in 1fjlA lies somewhat closer to helix 2 than in 1ig7A, leaving the SDR Leu13 (position 19) of 1fjlA more exposed to solvent. Consequently, position 19 in the paired-box representative 1fjlA exhibits a weaker interaction (0.3% of the total interface) with the DNA than that of equivalent position Leu113 (position 19) in the HOX representative 1ig7A (0.92%). The sequence of the loop between helices 1 and 2 is YPD (positions 34-36 according to the alignment in Figure 20) in the paired-box representative 1fjlA, and YLS (positions 34-36) in the HOX representative 1ig7A (contributions to the protein-DNA interface given in Table 7), with the last residue in each loop being fully exposed to solvent in both structures.

A significant difference is observed in the clamp pattern between the two structures (1ig7A versus 1fjlA), as shown in Figure 28. There is a change in composition from Lys146 (position 72) in the HOX structure 1ig7A to Gln46 (position 72) in the paired-box structure 1fjlA. As well, there is a slight change in the pitch (approximately 1-1.5Å) of the major groove of DNA in the paired-box 1fjlA complex relative to that for the HOX 1ig7A complex; the shift in conformation of the clamp residues in 1fjlA (positions 40 and 72 according to the alignment in Figure 20) correlates with the shift in pitch. This change has a cost in terms of the contribution to the protein-DNA interface area. In the HOX structure 1ig7A, the contribution of both clamp residues (Arg131/position 40 and Lys146/position 72) adds up to 9.17% of the total interface area, whereas in the paired-box structure 1fjlA, these residues contribute only 7.26% of the total interface area. The reduction comes primarily from residue Gln46 (position 72) of paired-box 1fjlA that becomes more exposed to solvent. Although Arg31 (position 40) of paired-box 1fjlA is closer to the DNA than Arg131 (position 40) of HOX 1ig7A, with a 2.9% versus 2% contribution to the total interface area, it does not compensate for the change in Gln46 (position 72) of paired-box 1fjlA (illustrated in Table 7).

Differences in the amino acid composition between HOX 1ig7A and paired-box 1fjlA at SDR positions appear to result in subfamily-specific differences in DNA contact (illustrated

81 in Table 7). Gln144 (position 70 according to the alignment in Figure 20) in HOX 1ig7A points directly towards the solvent, making a 1.12% contribution to the total protein-DNA interface area, whereas Arg44 (position 70) at the same position in paired-box 1fjlA is clearly bent towards the major groove, thus making more contact with the DNA (2.42% contribution to the total interface area) and makes a 2.8Å hydrogen bond with the phosphate group on the adenine at 5D (see Appendix: List of Interactions with DNA at distance ≤ 4.5 Å). By contrast, Gln144 in HOX 1ig7A is 4.4Å away from the DNA phosphate backbone. Another difference in amino acid composition at a SDR position which appears to affect protein-DNA contact is at position 73 of HOX 1ig7A and paired-box 1fjlA. Ile147 (position 73) of HOX 1ig7A contributes 6.01% to the interface region, while Val47 (position 73) of paired-box 1fjlA contributes only 4.36% to the protein-DNA interface. This difference is solely due to the difference in residue size as both sidechain conformations almost completely overlap.

Position PDB residue % of protein- Position PDB residue % of protein- # (HOX 1ig7A) DNA interface # (Paired 1fjlA) DNA interface 14 Phe108 3.04% 14 Phe8 3.26% 19 Leu113 0.92% 19 Leu13 0.30% 34 Tyr125 4.15% 34 Tyr25 5.49% 35 Leu126 1.03% 35 Pro26 0.17% 40 Arg131 2.01% 40 Arg31 2.90% 70 Gln144 1.12% 70 Arg44 2.42% 72 Lys146 7.16% 72 Gln46 4.36% 73 Ile147 6.01% 73 Val47 4.36% Table 7: Contribution of subfamily determining residues (SDRs) to the protein-DNA interface in HOX 1ig7A and Paired-box 1fjlA. Amino acid positions are given according to the structure-sequence homeodomain alignment illustrated in Figure 20. Residues highlighted in light grey are SDRs in the loop between helices 1 and 2 of the homeodomain. Residues highlighted in cyan are the clamp residues at positions 40 and 72. Residues highlighted in yellow are other SDRs which appear to differ in their amino acid composition between HOX 1ig7A and Paired-box 1fjlA, and thus in their contribution to the protein-DNA interface.

Overall, the SDRs in the paired-box homeodomain-DNA complex contribute 23% of the protein-DNA interface area; when conserved residues are also taken into account, the two types of residues contribute 35% of the protein-DNA interface region. As well, there is a difference in the conformation of the clamp between the HOX 1ig7A structure and the

82 paired-box 1fjlA structure which parallels the small local difference in pitch between the corresponding DNA chains (at the 5’ end), illustrated in the top view of Figure 29.

A Figure 28: Structural alignment of 1ig7A (HOX) and 1fjlA (paired- Trp (74) Leu (19) box) based on superposition of Arg (79) helix 3. The 1fjlA complex is Phe (14) shown in red, with conserved Gln (70) Asn (77) Ile (73) residues in 1jflA coloured orange. Lys (72) Specificity determining residues and conserved residues are illustrated in stick mode; the latter are coloured blue. Highlighted in ball-and-stick mode are residues Leu (35) Thr (36) B that contribute at least 10% to the

Trp (74) total protein-DNA interface. Amino acid residues are given for Arg (40) the HOX structural representative Tyr (34) 1ig7A.

C

Arg (40)

Lys (72) Trp (74) Phe (14)

Arg (79)

Asn (77)

Figure 28: Structural alignment of 1ig7A (HOX) and 1fjlA (paired-box) based on superposition of helix 3.

83

The amino acid sequence of the engrailed subfamily structure representative (3hddA) is almost identical to that of the HOX subfamily structure representative 1ig7A. However, there are slight differences in conformation in the engrailed homeodomain structure 3hddA relative to the HOX homeodomain structure 1ig7A. These differences result in an increase in the contribution to the protein-DNA interface made by SDRs in the engrailed homeodomain structure 3hddA relative to the HOX homeodomain structure: the SDRs contribute 32% of the total interface area, and if conserved residues are also included, they collectively contribute 48% of the total interface area. By contrast, the SDRs in the HOX subfamily representative 1ig7A contribute only 26% of the total interface area, and the SDRs and conserved residues together represent approximately 40% of the HOX subfamily homeodomain-DNA interface. Half of the increase in the SDR contribution to the interface in engrailed 3hddA versus HOX 1ig7A is due to the different conformation of Ile47 (position 73 according to the alignment in Figure 20) in engrailed 3hddA compared to that of Ile147 (position 73) in HOX 1ig7A; the other half of the increase is mostly due to a slight shift in the tyrosine (position 34) within the loop between helix 1 and helix 2 and differences in the sidechain directionality of the positively charged residues forming the recurring clamp pattern about the DNA at amino acid positions 40 and 72 (Figure 29).

84

A Figure 29: Protein-DNA adaptation. Top: superposition of 1fjlA (paired- Leu (19) Arg (40) box) and 1ig7A (HOX); 1fjlA Tyr (34) Phe (14) illustrated in green. Middle: Gln (70) superposition of 3hddA (engrailed)

Asn (77) Lys (72) and 1ig7A (HOX). Bottom: superposition of 3hddA (engrailed) and 1fjlA (paired-box). In these two last cases, 3hddA is shown in

green. The DNA strands are shown B in backbone mode. All views are

Leu (19) from the N-terminus of the Arg (40) Tyr (34) homeodomain. Only the most Phe (14) relevant SDRs are shown. The Gln (70) engrailed homeodomain (3hddA) Asn (77) Lys (72) complex is illustrated in green. Specificity determining residues and conserved residues are illustrated in stick mode. Amino C acid residues are given for the HOX Thr (36) structural representative 1ig7A for Leu (19) parts A) and B), and for the

Phe (14) Arg (40) Engrailed structural representative 3hddA for part C). Lys (72) Gln (70)

Figure 29: Protein-DNA adaptation as illustrated by HOX 1ig7A, Paired-box 1fjlA, and Engrailed 3hddA.

85

Figure 30 compares the relative structural changes of the MATa1 homeodomain (1akhA) vs. the MATα2 homeodomain (1akhB) with respect to the HOX homeodomain (1ig7A). In this case, the different roles of both the MATa1 and MATα2 homeodomains appear to involve structural differences in their interaction with the DNA, both relative to each other and in comparison to another homeodomain such as HOX 1ig7A. The MATα2 homeodomain (1akhB) does not have the full clamp in its usual orientation on either side of the DNA strand; the arginine clamp residue in helix 2 of the alignment (position 40 according to the alignment in Figure 20) is substituted by leucine in 1akhB (Leu162), and this residue does not form part of the protein-DNA interface. As well, there is a reduction in the contribution to the interface by the other clamp residue in helix 3 (at position 72), with Lys146 in HOX 1ig7A contributing 7.16% of the protein-DNA interface, while Lys177 in MATα2 1akhB contributes only 6.10% of the interface. These combined result in a decrease in the overall contribution of the SDRs in MATα2 1akhB to the protein-DNA interface (22.67%) relative to that of HOX 1ig7A (25.54%). The MATa1 homeodomain substitutes Ser (Ser94) for Tyr at position 34 in the loop region between helices 1 and 2, also resulting in a decrease in the contribution to the protein-DNA interface at that position (1.20% for Ser94 in MATa1 1akhA versus 4.25% for Tyr125 in HOX 1ig7A). However, this is more than compensated for by the changes in the clamp residues for MATa1 1akhA: the change in the helix 2 clamp residue from Arg (Arg131 in HOX 1ig7A) to Lys (Lys100 in MATa1 1akhA) and the substitution of Lys (Lys146 in HOX 1ig7A) to Arg (Arg115 in MATa1 1akhA) for the helix 3 clamp residue result in an increase in SDR contribution to the overall protein-DNA interface of 12.97% relative to HOX 1ig7A. As well, the positions of helix 1 and helix 2 for both the MATa1 and MATα2 homeodomains are shifted relative to those for the HOX homeodomain and relative to each other.

86

Figure 30: Superposition of the MATa1 (1akhA), MATα2 (1akhB), and HOX (1ig7A) homeodomains, taking as reference the structural alignment of helix 3 in the MATa1 and MATα2 structures relative to helix 3 of the HOX structure. MATa1 is coloured green, while MATα2 is coloured yellow. The conserved residues in the HOX structure (1ig7A) are coloured blue, while the conserved residues in the MATa1 structure (1akhA) are coloured orange, and the conserved residues in the MATα2 structure (1akhB) are coloured cyan. The top diagram (A) is a top view showing the relative orientation of the homeodomain helices and the ordering of the different contacts. The remaining four views (B, C, D, and E) represent a 270° counter-clockwise rotation with respect to the second diagram. Amino acid residues are given for the HOX structural representative 1ig7A.

A

Tyr (34)

Phe (75)

Trp (74)

Phe (14)

87

B

Trp (74) Phe (75)

Arg (79)

Arg (40)

Phe (14)

Gln (70)

Asn (77) Lys (72)

Ile (73)

C

Leu (35)

Arg (40)

Tyr (34)

Lys (72)

88

D

Leu (35)

Tyr (34)

Asn (77) Ile (73) Lys (72)

E

Leu (19)

Gln (70)

Trp (74)

Ile (73)

Asn (77)

89

Table 8 shows the relative contribution to the protein-DNA interface, split among each structural unit of the homeodomain: N-terminal coil, helix 1, the loop between helices 1 and 2, helix 2, and helix 3. The contribution of the SDRs in each unit is shown in yellow. The contribution of the conserved residues comes only from helix 3. The combined contribution of the SDRs and the conserved residues relative to the interface spanned by helix 3 is shown in column 9 as a ratio (between 0 and 1).

Protein- Total N- H1-H2 Total Total # DNA H1 H2 H3 SDR+Cons SDR + Subfamily terminal loop SDR Conserved residues Interface (%) (%) (%) H3 Cons coil (%) (%) (%) (%) (Å2) (%)

Engrailed 4.41 55.93 55 772.2 27.53 4.82 7.29 0.58 32.05 15.83 47.88 (3hddA) 3.25 16.69

HOX 4.67 43.92 58 1031.3 41.47 3.96 5.28 0.64 25.54 13.95 39.49 (1ig7A) 2.01 14.29

MATα2 5.58 44.88 63 1094.9 44.82 4.72 0 0.63 22.67 14.86 37.53 (1akhB) 4.42 13.53 Paired- 3.93 42.86 box 65 1339.6 43.98 3.56 5.66 0.61 23.26 12.03 35.29 2.90 14.04 (1fjlA) POU 44.95 59 918.6 49.76 4.93 0 0.35 0.42 12.72 11.13 23.85 (1cqtB) 7.78

MATa1 74.65 49 783.6 0 15.70 4.94 4.74 - 44.27 14.20 58.47 (1akhA) 22.11 4.6 ± 2.7 ± 0.8 4.6 ± 2.1 40 ± 6 42 ± 8 4.3 2.5 1.6 ± 13 ± 3 ±0.6 1.4 Table 8: Relative contribution to the protein-DNA interface area by each structural unit of the homeodomain. The relative contribution of the SDRs is given in red. The last row shows the relative variability of each column (excluding MATa1).

The difference in interface area between engrailed 3hddA and HOX 1ig7A is due to three missing residues in the N-terminal coil of the engrailed structure. In HOX 1ig7A, these residues contribute over 200 Å2 to the protein-DNA interface; if these residues were present in the engrailed structure, this would result in a reduction in all relative contributions by a factor of 0.75. Thus, the current difference between these two structures in Table 8 seems to be an artifact of the crystal structure, and both the engrailed and HOX structures would actually have the same distribution of interface area among the first three structural units. In helix 2, the contribution of the single (clamp) residue for engrailed 3hddA would be slightly

90 larger than that for HOX 1ig7A, but the total contribution of helix 2 to the interface would be smaller. In other words, residues other than SDRs in helix 2 of the engrailed homeodomain structure are contributing less to the interface than those in the HOX structure. For helix 3, both the total contribution and that of the SDRs alone would be the same for engrailed as for HOX.

If MATa1 is excluded from the analysis, some common trends can be seen. First, the contribution of the N-terminal coil and helix 3 add up to more than 80% of the interface with DNA. This supports the standard model of homeodomain binding to DNA, which points out these two units as the ones determining the whole domain family. Of the other three structural units, the loop between helix 1 and helix 2, as well as helix 2 itself, show the highest variability for both the relative contribution of the unit to the interface and the contribution to the interface by SDRs.

The large total contribution of SDRs and conserved residues to the protein-DNA interface found in the MATa1 structure appears to be an artifact due to the missing N-terminal coil in the crystal structure.

In summary, protein-DNA specificity may be controlled by the interaction of SDRs with the DNA backbone and by adaptation to the local geometrical deformations of the DNA. Of course, the former may determine the latter: the contacts with the DNA backbone at both strands may give rise to the local strain responsible for the DNA deformation. Differences in strain on opposite sides of the DNA helix may generate differences in superficial tension of the helix “surface”, leading to deformation of its geometry. Further analysis would be required to correlate the local deformation of the DNA more quantitatively by comparing the different cases. Afterwards, the present model should be confirmed by analyzing intra- subfamily cases.

91

3.6 Subfamily Cognate DNA Sequences

Subfamily-determining residues (SDRs) and conserved residues from representative homeodomain structures of the HOX (1ig7A), Engrailed (3hddA), Paired-box (1fjlA), POU (1cqtB), and MATa (1akhA) subfamilies were analyzed with respect to the contacts that they make with DNA, and their potential influence on DNA recognition specificity.

In Table 9, for each subfamily structural representative, the amino acid residue at each SDR position (S) or conserved residue position (C) is matched up with the DNA nucleotides that it contacts. The position numbering is that given by the alignment in Figure 20.

92

Position 14 19 22 34 35 36 40 70 72 73 74 75 77 79 number (S) (S) (C) (S) (S) (S) (S) (S) (S) (S) (C) (C) (C) (C)

Subfamily HOX F108 L113 L116 Y125 L126 S127 R131 Q144 K146 I147 W148 F149 N151 R153 (1ig7A) 8B 8B 21C 21C 20C 9B 20C 8B 8B 8B 21C (A) (A) (T), (T) (T) (A) (T), (A), (A) (A), (T), 22C 21C 9B 9B 22C (C) (T) (A), (A), (C) 10B 10B (T) (T), 24C (A) Engrailed F8 L13 L16 Y25 L26 T27 R31 Q44 K46 I47 W48 F49 N51 R53 (3hddA) 212C 212C 327D 327D 326D 213C 327D 213C 212C 212C 327D (A) (A) (G), (G) (G) (A) (G) (A), (A) (A), (G), 328D 214C 213C 328D (T) (T) (A), (T) 330D (A) Paired- F8 L13 L16 Y25 P26 D27 R31 R44 Q46 V47 W48 F49 N51 R53 box 4D 7E 6E 5D 6E 4D 4D 4D 7E (1fjlA) (A) (C), (T) (A) (T), (A), (A) (A), (C), 8E 7E 5D 5D 8E (T) (C) (A), (A) (A) 6D (T) POU I608 R613 L616 Q624 K625 T627 I631 V644 R646 V647 W648 F649 N651 R653 (1cqtB) 709O 708O 709O 709O 709O 709O 719P (A) (A) (A), (A), (A) (A), (T) 710O 710O 710O (A) (A) (A), 711O (T), 721P (A) MATa1 I77 R82 L85 S94 L95 N96 K100 Q113 R115 V116 W117 F118 N120 R122 (1akhA) 25C 24C 15C 14C 26C 14C 25C 25C 25C 15C (G), (T), (T) (T), (A) (T), (G), (G) (G), (T), 26C 25C 15C 15C 26C 26C 16C (A) (G) (T) (T) (A), (A) (A) 27C (T) Table 9: Specificity determining residues (S) and conserved residues (C) in the HOX, Engrailed, Paired-box, POU, and MATa1 structural representatives, and the DNA nucleotides that they contact. Amino acids are given as one-letter amino acid code and PDB residue number, e.g. F108. DNA nucleotides are given as PDB residue position and chain, with the nucleotide A, C, T, or G in brackets. Forward strand nucleotides are coloured magenta, while reverse strand nucleotides are coloured blue.

93

Figure 31 illustrates the way in which the DNA is bound by the homeodomain. The most commonly seen pattern of DNA binding is a 7 base-pair region of DNA contacted by SDRs in two “motifs” of three nucleotides, one on either strand of DNA, with one base pair between the two motifs. In the case of the POU homeodomain (1cqtB), only the motif on the forward strand is contacted by SDRs; alternatively, in the case of the MATa1 homeodomain (1akhA), the motifs on the forward and reverse strands are four and two nucleotides long respectively, separated by a pair of nucleotides.

From Figure 31, it appears that the SDRs may play a role in determining the affinity of a homeodomain in a given subfamily toward DNA sequences containing a homeodomain recognition sequence. Of the two half-sites on the DNA, the half-site on the forward strand is frequently AAT, while the half-site on the reverse strand is more variable; the particular conformation or amino acid sequence of the SDRs binding the variable half-site may result in greater specificity for a particular sequence in the variable DNA half-site. The conserved half-site on the forward strand is likely part of the consensus homeodomain binding sequence; from the literature, TAAT is known to be the consensus homeodomain binding sequence [58]. One of the conserved residues (position 77) sometimes contacts both the forward and the reverse strand of DNA in and around the conserved DNA half-site, which may affect the affinity of the homeodomain for DNA. The POU homeodomain appears to bind only the conserved DNA half-site, which may reduce its affinity for the DNA as a monomer; however, the dimer may be able to bind two half-sites. Interestingly, the half-site that the POU homeodomain does bind is bound off-centre (the sequence AAA is bound, rather than AAT, with the full DNA sequence being AAAT). The MATa1 homeodomain binds over the whole consensus homeodomain binding sequence region; however, the sequence recognized is TGAT instead of TAAT.

94

14S A) HOX (1ig7A) 19S 73S 70S 74C 73S 73S B 8 9 10 11 12 13 14 77C 77C 77C | | | | | | | | | | 5’-A---A---T---T---G---A---A-3’ 5’-A---A---T---T---G---A---A-3’ 3’-T---T---A---A---C---T---T-5’ 3’-T---T---A---A---C---T---T-5’ | | | | | | | | | | | C 26 25 24 23 22 21 20 77C 34S 34S 40S 79C 35S 72S 72S 79C B) Engrailed (3hddA) 14S 19S 70S 74C 73S C 212 213 214 215 216 217 218 77C 77C 73S | | | | | | | | | | 5’-A---A---T---T---A---C---C-3’ 5’-A---A---T---T---A---C---C-3’ 3’-T---T---A---A---T---G---G-5’ 3’-T---T---A---A---T---G---G-5’ | | | | | | | | | | | D 332 331 330 329 328 327 326 77C 34S 34S 40S 79C 35S 72S 79C C) Paired-box (1fjlA) 14S 73S 70S 74C 73S D 4 5 6 7 8 9 10 77C 77C 73S | | | | | | | | | | 5’-A---A---T---C---A---G---A-3’ 5’-A---A---T---C---A---G---A-3’ 3’-T---T---A---G---T---C---T-5’ 3’-T---T---A---G---T---C---T-5’ | | | | | | | | | | E 12 11 10 9 8 7 6 34S 34S 40S 79C 72S 72S 79C D) POU (1cqtB) 14S 70S 73S 70S 74C 73S O 708 709 710 711 712 713 714 19S 77C 77C 77C | | | | | | | | | | | 5’-A---A---A---T---A---A---G-3’ 5’-A---A---A---T---A---A---G-3’ 3’-T---T---T---A---T---T---C-5’ 3’-T---T---T---A---T---T---C-5’ | | | | | | | | | P 724 723 722 721 720 719 718 77C 79C

14S E) MATa1 (1akhA) 19S 14S 73S 70S 74C 73S C 24 25 26 27 28 29 30 31 19S 77C 77C 73S | | | | | | | | | | | | 5’-T---G---A---T---G---T---A---A-3’ 5’-T---G---A---T---G---T---A---A-3’ 3’-A---C---T---A---C---A---T---T-5’ 3’-A---C---T---A---C---A---T---T-5’ | | | | | | | | | | | C 21 20 19 18 17 16 15 14 79C 34S 40S 40S 72S 72S 79C

95

Figure 31: Homeodomain DNA recognition sequences for selected homeodomains. The nucleotides highlighted in cyan are contacted by subfamily determining residues (SDRs); these nucleotides appear to form two half-sites on the DNA, with the top-strand site being more conserved and the bottom-strand site being more variable. On the left hand side is a schematic of the DNA strands, with chain and residue number labeled. On the right hand side is a schematic of the same DNA strands, with contacting SDR (S) and conserved residue (C) positions labeled (e.g. 14S is the SDR at position 14 of the alignment, 22C is the conserved residue at position 22 of the alignment). SDRs are coloured blue, while conserved residues are coloured red.

96

Chapter 4 – Discussion

Protein function is a complex phenomenon which can be described at three main levels: 1) molecular function, which includes binding sites, binding residues, affinity and specificity properties, and catalytic mechanisms of a protein; 2) cellular function, which refers to the physiological role of a protein with the cell, including participation in metabolic pathways, signal transduction, and regulatory pathways; and 3) phenotype, which refers to the protein’s effect on growth, cell fate, or any other observable characteristic of an organism. In this thesis work, homeodomains were classified into subfamilies based only on their sequence properties, on the basis that these sequence properties might contain signals related to the molecular function of the protein, possibly reflecting affinity and specificity for promoter regions, or other binding partners, including domains and other proteins. Alternatively, common ancestry between proteins may also conserve sequence signatures. The specific aspect of function which was examined in this thesis work was specificity of protein-DNA binding. Depending on the sequence properties of a given homeodomain, it may recognize a set of DNA sequences with different affinities; at a given concentration, the homeodomain may regulate a different set of genes than another homeodomain with a slightly different sequence. Amino acid residue positions were identified which distinguish between subfamilies and might therefore confer differences in specific function; this was further explored through analysis of structural characteristics at these positions, including interaction at the protein-DNA interface.

A sequence signature reflects both functional and evolutionary constraints. Evolution acts on the overall fitness of an organism, which is impacted by the contribution to fitness of a given protein. Amino acid sequence is conserved as a result of evolution acting on the function of a protein, which is reflected in the protein structure. Structural constraints are shared among all members of a family, due to the requirement to maintain a common structural fold. Functional constraints are specific for individual subfamilies, due either to neo- functionalization (gain-of-function) or sub-functionalization (partial loss-of-function) of the given subfamily relative to the common ancestor of the protein family [199]. Residues which are common within a given subfamily and differ between subfamilies may confer differences in specific function between subfamilies. Members of different subfamilies may exhibit

97 different physiological roles due to gain-of-function mutations in the protein-coding region relative to the ancestral gene, and thus representatives from different subfamilies bind different target DNA sites [199]. Common elements may be retained by proteins exhibiting divergent evolution, if they are neutral for fitness or if the functions of the proteins complement one another. Complementarity of functions can arise from sub-functionalization of the regulatory regions of the ancestral protein; loss-of-function mutations in these regions may cause subfamilies to acquire different subfunctions of the common ancestral protein due to differences in gene regulation [199]; this was not explored within the scope of this thesis work.

The subfamily determination protocol presented in this thesis utilizes a structure-based sequence alignment and is automatable, with the exception of the protein structure curation. Manual curation of subfamilies is generally considered superior to automated subfamily determination due to expert subject knowledge, but is time-consuming to perform. Automation of the subfamily determination protocol allows many more protein sequences to be processed relative to a fully manually curated process, and allows researchers to examine subfamilies without making as many assumptions a priori, which can be revealing with regard to previously unknown functional subcategories.

A structure-based sequence alignment allows indels relative to the “canonical” 60 amino acid homeodomain sequence to be identified, and interesting residues to be mapped onto the protein structure for functional analysis. It also incorporates both protein sequence and protein structure information into the same alignment, maintaining the structural positioning of the amino acids and allowing the structural elements to form the basis of the segregation into subfamilies. The limiting factor of a structure-based protocol is the number of good quality protein structures to work from and the speed with which those structures can be curated; however, the strength of a structure-based protocol is the ability to extrapolate functional characteristics based on the input protein structures. Using MALECON to perform a multiple structure alignment allows all pairwise structure superpositions to be run automatically, with the median structure chosen after all superpositions are known and sets of spatially equivalent residues have been determined between all groups of three structures in the input structure data set [174]. By contrast, the utilization of a pairwise structure

98 superposition program such as DaliLite would have required the user to define the reference protein structure; the resultant “alignment” would be a set of structure superpositions against the user-defined reference structure, which must be chosen carefully to avoid bias [179].

The homeodomain subfamily classification presented in this thesis incorporates not only “canonical” 60 amino acid long homeodomains, but also atypical homeodomains and plant homeodomains. This extends the previous homeodomain classification provided by Interpro [105], and classifies more finely as well; some subfamilies in the Interpro homeodomain subfamily classification (CUT, LIM) have been subdivided in the presented subfamily classification into multiple subfamilies. The Interpro CUT homeodomain subfamily was subdivided into three subfamilies which correlated with the number of CUT domains associated with the homeodomain, while the Interpro LIM homeodomain subfamily was subdivided into three subfamilies which were associated with paralogous groups of LIM proteins. Both paralogy and association with other protein domains may be associated with differences in specific protein domain function.

When the neighbor-joined tree produced using Bête is compared with the neighbor-joined tree produced by PHYLIP, the Bête tree and the PHYLIP tree group proteins somewhat differently; proteins are grouped according to closeness of ancestry by the PHYLIP tree, whereas the Bête tree groups proteins together according to their relative entropy versus all other partitions created by the hierarchical agglomerative protocol. While the heuristic used to produce the trees is neighbor-joining in both cases, Bête employs the total relative entropy between groups as a measure for determining similarity between sequences, whereas PHYLIP measures distance between pairs of sequences based on all amino acid differences weighted via a scoring matrix [117,130]. As discussed in sections 1.3.2 and 1.3.3, while the PHYLIP tree (or any neighbour-joining method, including UPGMA) assumes that all positions in a protein have the same rate of mutation, Bête avoids making this assumption by evaluating “distance” between amino acid composition in profiles of sequences, instead of pairwise sequence identity comparison [117,130]. The relative entropy distance measure employed by Bête groups together sequences within a partition that are as homogeneous as possible, thus tending to maintain the homogeneity of those residues believed to be important for function within a partition [117]. However, relative entropy is not a true distance metric,

99 whereas difference in sequence identity score is a true distance metric; the degree of similarity between two subfamilies is easily judged in the PHYLIP tree but not in the Bête tree. Since the full space of all possible subfamily partitions is computationally intractable to calculate, employing a tree-building algorithm to produce subfamily partitions reduces the scope of possible partitions to a tractable number [158]. However, tree-building partition algorithms do so at the cost of assuming that partitions have to be evolutionarily consistent, that is, sequences with similar function should also be related. Tree validation methods such as bootstrapping (applied to the PHYLIP tree) or leave-one-out cross-validation (applied to the Bête tree) also suffer from various drawbacks. Bootstrapping does not perform well in situations where continuous blocks of amino acid sequence are very short (as in the case of the homeodomain, where the length of the sequence is only around 60 amino acids long) or where separations between groups depend on only a few important residues, which may not be represented in the pseudoreplicate datasets produced when bootstrapping [151]. Leave- one-out cross-validation may overfit the data set and make results seem better supported than one might expect, since every protein sequence forms part of the training set n-1 times, where n is the number of proteins in the data set [152]. Using vector-based methods such as Sequence Space [160] or clustering-based methods such as k-means clustering to produce subfamily groupings would avoid problems caused by tree dependence in the future, and allow measurement of distance between multiple clusters; both methods, however, require user-selected criteria when choosing putative subfamily groupings.

Leave-one-out cross-validation was used to validate the robustness of the subfamilies produced by Bête. The support level using leave-one-out cross-validation for Bête subfamilies varies widely according to subfamily. For example, among the CUT homeodomain subfamilies, while two subfamilies are quite well supported, the third is not: the Onecut subfamily is supported in 100% of the leave-one-out cross-validated trees produced by Bête, and the CDP/Cux subfamily is supported by 84.6% of the Bête trees, but the SATB subfamily is supported by only 53.8% of the Bête trees. For the LIM homeodomain subfamilies, the support is extremely poor; the Lhx1/Lhx3/Lhx5 subfamily is supported by just 37.1% of the Bête trees, the Lhx2/Lhx6/Lhx8 subfamily by 50.8% of the Bête trees, and the Insulin Gene Enhancer subfamily by 38.0% of the Bête trees. Other subfamilies, such as the HOX cluster (97.4% support), the SIX/SINE cluster (85.5%

100 support), the PIT-1 and POU/OCT clusters (96.9% and 99.6% support respectively), and most of the plant homeodomain subfamilies (all but one have better than 80% support) are well supported. The significant discrepancies in subfamily support suggest that the method employed by Bête, despite attempts to correct for small sample size by assuming prior probability distributions of amino acids, may not be sufficiently robust regarding small sample size input data sets, and thus cast doubt on the reliability of poorly supported subfamily divisions. It should also be emphasized, however, that any objective method for subfamily classification would be expected to yield less robust results than for family classification, since each subfamily consists of significantly fewer sequences than the entire protein family, often only a few sequences. In order to improve the robustness of subfamily predictions, more sequences could be added to the data set, thus increasing the overall sample size; however, this would not necessarily increase the per-subfamily sample size significantly. Alternatively, an iterative approach that converges on centroids in sequence space instead of relying on tree topology, such as k-means clustering, could be implemented; robustness could be validated by repeated random selection of the initial k centroids, examining how often the solution converges and how often the same clusters are created. However, implementation of a clustering-based subfamily classification method would require analysis of how the choice of k depends on the size and properties of the input sequence set.

The PHYLIP tree treats all LIM homeodomain subfamilies as a single group and all CUT homeodomain subfamilies as a single group, and merges the HOX and Engrailed subfamilies into one cluster, whereas the Bête tree treats all of these as separate groupings. Bootstrapping was used to validate the robustness of the PHYLIP tree. However, the bootstrap score (out of 100) for all of these groupings is extremely poor; the LIM cluster has a bootstrap score of 25, the CUT cluster has a bootstrap score of 10, and the HOX/Engrailed combined cluster has a bootstrap score of 2. In general, bootstrapping did not perform well when used as a means to validate the homeodomain family tree, possibly due either to short sequence length (~ 60 residues) or to specificity determining residues governing separation into subfamilies, as specificity determining residues form a relatively small proportion (less than 10%) of the overall alignment length.

101

Sequence-to-profile validation was performed in order to evaluate whether sequences within a subfamily are more similar to other sequences within the same subfamily, or whether they may be more similar to some sequences in other subfamilies. For all but one subfamily, all sequences within subfamilies appear to be more similar to each other than to sequences in any other subfamilies; the control using randomized sequences but maintaining subfamily size and number does not exhibit this property.

SDPpred was used to identify subfamily determining residue (SDR) positions after subfamily classification was performed on the homeodomain family of proteins; nine SDR positions were identified. The advantage of using SDPpred to identify these positions is that it utilizes maximal mutual information between the amino acid composition of the columns in the multiple sequence alignment, and the partitioning of the sequences into subfamilies; this captures the residues that correlate best with separation into subfamilies [162]. However, SDPpred will not identify residues that are unique only to one subfamily (e.g. HOX-specific or POU-specific residues), only residues which distinguish all subfamilies from each other. Subfamily determining residue positions identified by SDPpred appear, collectively, to be conserved within subfamilies and differ between subfamilies, constituting a unique profile for each subfamily based on the subfamily determining residues. All but one of the SDRs are observed to be at the protein-DNA interface, with the exception of one specificity determining residue which appears to form an N-cap with helix 2. Additionally, a majority of residues at the protein-DNA interface (as determined by solvent accessibility) are subfamily- determining residues, although they bind the DNA backbone and do not make specific contacts with the DNA base. Subfamily-determining residues appear to contact two half-sites in the homeodomain recognition sequence, one of which corresponds to the consensus homeodomain recognition motif TAAT [65], and the other which is more variable and which may correspond to differences in DNA specificity mediated by subfamily-determining residues. These differences support the belief that there are differences in specific DNA- binding between subfamilies. Alternatively, while this was not explored within the scope of this thesis, some differences between subfamilies at subfamily determining positions may be due to changes in protein-protein interactions with other transcription factors.

102

Subfamily determining residues (SDRs) determined using the approach outlined in this thesis may influence specific differences in DNA recognition, either by directly affecting the structural integrity of the domain or by exerting a specific influence on how the protein recognizes the three-dimensional structure of the DNA, which is determined by its sequence. One approach to experimentally validate these hypotheses involves selective mutagenesis of SDRs from the pattern of one subfamily to the pattern of the second subfamily. Such mutagenesis could result in different experimental observations: the overall structure of the domain could be compromised, or a change in affinity for cognate DNA sequences should be observed according to the SDRs being mutated. For example, the HOX signature (YLYLTRQKI) and the Engrailed signature (FLYLTRQKI) differ only at the first SDR position, in helix 1; mutation from Tyr to Phe at this position is normally considered to be a conservative amino acid substitution. If structural disruption is observed via circular dichroism (CD) spectra, then interaction at this SDR position with other structurally important residues in the homeodomain can be posited. If structural disruption is not observed, then the cognate DNA sequences for the HOX and Engrailed homeodomains can be synthesized, and the affinity of the wild-type HOX homeodomain, the mutated HOX homeodomain (with Engrailed SDR signature), and the wild-type Engrailed homeodomain for each of the DNA sequences can be measured. If the subfamily determining residues affect binding affinity for the cognate DNA sequence, one would expect the mutated HOX homeodomain to have higher affinity for the Engrailed cognate DNA sequence than for the HOX sequence. It would also be interesting to perform the same experiment on homeodomain subfamilies with more differences between their SDR signatures, screening all possible mutants with respect to DNA binding affinity and structural stability.

103

Chapter 5 – Conclusion

The subfamily determination protocol presented in this thesis integrates structural knowledge about a protein family with sequence information in determining homeodomain subfamilies. Uniquely, protein structure knowledge has never been previously integrated with protein sequence knowledge in an automatable subfamily determination protocol, and the addition of structural knowledge allows conclusions to be drawn about the resulting subfamilies based on the structural information. Ideally, the subfamily determination should be drawn from protein structures alone, as protein structure is better conserved as a result of functional conservation than is protein sequence; however, an insufficient number of available structures for a given DNA binding protein family made this impossible. Sequence properties of homeodomains, in particular the patterns of sequence variability at individual positions along the sequence, were examined in order to determine whether these properties yield any clues about the molecular function of homeodomains without making specific assumptions about evolution. Homeodomain subfamilies have been identified that appear to correspond to differences in specific function, and a set of residues have been determined which appear to uniquely identify subfamilies and appear to be involved in specificity of protein-DNA binding; all but one of the nine identified residues map into the protein-DNA interface, with the remaining residue being identified as a helix-capping residue. Additionally, a putative subfamily-specific DNA binding site has been identified, in conjunction with the homeodomain DNA consensus binding sequence TAAT that has been previously described in the literature [65]. It is possible that other sequence signals are also contained within the identified subfamily-determining residues, including residues involved in interaction with other domains or other proteins; however, this was not examined within the scope of this thesis work.

The key advantage of this protocol is that subfamily classification and specificity- determining residues are evaluated according to non-subjective criteria; proceeds according to clearly stated rules and thresholds; and except for the curation of protein structures, the protocol is scalable to high-throughput, and can be applied to any protein domain family with sufficient numbers of protein structures to form a structure profile. The protocol is entirely modular, and thus any one component program (structure alignment, subfamily classification,

104 specificity residue determination) can be substituted as better programs become available, or in order to remedy the flaws of the current protocol. A partially curated database of homeodomains, including “atypical” non-60 amino acid long homeodomains as well as those from plants and fungi, is now available, and a subfamily classification is now available for these homeodomains in addition to the existing canonical homeodomain classification.

105

Appendices

Appendix A: Homeodomain Subfamily Classification Performed by Bête and Structure-Based Sequence Alignment

Bête subfamily classification is performed using the original ClustalX structure-based sequence alignment as input. Subfamilies identified by Bête are labeled “%subfamily XXX”, where XXX is the identifier for the subfamily. If the subfamily labels are removed, this reverts to the original ClustalX multi-FASTA formatted multiple sequence alignment, with the order of the protein sequences shuffled relative to the input order, but all other alignment characteristics maintained. Sequences have been identified, where possible, from the Protein Data Bank, Interpro, or PartiGeneDB, using a Perl script.

%subfamily N923 >1akhA 89 bp MATING-TYPE PROTEIN A-1 HOMEODOMAIN ------ISPQARAFLEEVF---RRKQSLNSKEKEEVAKKC----GITP------LQVRVWFINKRMRS------%subfamily N922 >2lfb 89 bp LFB1/HNF1 TRANSCRIPTION FACTOR ------RRNRFKWGPASQQILFQAY---ERQKNPSKEERETLVEEC----NVTE------VRVYNWFANRRKEE------%subfamily N921 >HJP06140_1 89 bp ------FAPHVEKRLEEYY---QKQPFPNDSESAFLASNL----GIEP------FHLGLWFHHRRER------%subfamily N920 >CND11255_1 89 bp Fungi-Basidiomycota-Filobasidiellaneoformans ------FTKRELEALEVLW---SIAKSPSKYERQRLGAWL----GVKT------KHITVWFQKRRQEEKR----- %subfamily N919 >HGP09547_1 89 bp ------ISSDQLDKLENVF---EIRKHLNASEQGRLGKAI----GLTE------EQVNEWFEQRRNRWR------%subfamily N866 >INP10932_1 89 bp ------FGEDAIKRLNEAF---KENHYPKRNVKESLAREL----GLTL------RQVDKWFGNSR------>sp|Q22812|HM39_CAEEL 89 bp Homeobox protein ceh-39 ------RIRRFTFTQTQLDSLHTVF---QQQDRPNREMQQALSATL----KLNR------STVGNFFMNARRRLPK----- >sp|Q22811|HM21_CAEEL 89 bp Homeobox protein ceh-21 ------RLTFTETQLKSLQKSF---QQNHRPTREMRQKLSATL----ELDF------STVGNFFMNSRRRLR------%subfamily N863 >PTE05656_1 89 bp Viridiplantae-Streptophyta-Poncirus ------TPLQAKALLKFY---SEEKYPTKREMEGLAAAL----DLTY------KQVRTWFIEKRRRDK------>GAP02607_1 89 bp ------FSESQMSALVQRF---SVQRYLTPAEMKNLAKMT----GLTX------QQVKTWFQNRR------%subfamily N870 >IPD11218_1 89 bp Metazoa-Chordata-Ictalurus ------FTKEHLELLRMAF---NVDPYPGISVRESLSQAT----GLPE------SRIQVWFQNKRAR------>IPD20139_1 89 bp Metazoa-Chordata-Ictalurus ------FTKEHLELLRMAF---NVDPYPGISVRESLSQAT----GLPE------SRIQVWFQNKRAR------>TVE05086_1 89 bp Parabasalidea-Trichomonada-Trichomonas ------FSDSQRSILMHWLKNHQSNPYPTSSEKQELIEKT----GLNR------DQINVWFTNNR--VRH----- >PYD04598_1 89 bp Rhodophyta-Bangiophyceae-Porphyra ------TESQRRLLSATF---AQNPYPDVTTKNLLAEQL----GVNR------PRVSKWFQHRRQRARR----- >ENP07708_1 89 bp ------YSPEDYAILEAEY---QRNPKPDKISRASIVSRV----SLGE------KEVQIWFQNRRQNDRR----- >ACF08129_1 89 bp Viridiplantae-Streptophyta-Allium ------YTPDXNQVLEEFY---CRNPHPDANDRKQLGKAL----GFTD------DRIKYWFQNRRAVDRR----- %subfamily N856 >PBD05144_1 89 bp Fungi-Ascomycota-Paracoccidioides ------LTKEQVDTLEAQF---QTHPKPNSNVKRQLATQT----NLSL------PRVANWFQNRRAKAK------>CPG01541_1 89 bp Fungi-Ascomycota-Coccidioides ------LTKEQVETLEAQF---RAQPKPTSNVKRQLAMQT----NLTL------PRVANWFQNRRAKEK------>CND18854_1 89 bp Fungi-Basidiomycota-Filobasidiellaneoformans ------TTPEQLKVLEFWY---DINPKPDNQLREQLAAQL----GMTK------RNVQVWFQNRRAKMK------>PBD14388_1 89 bp Fungi-Ascomycota-Paracoccidioides ------ATQDQLATLEMEF---NKNPTPTAAVRERIAEEI----NMTE------RSVQIWFQNR------>HJP02417_1 89 bp ------ATQDQLTTLEMEF---NKNPTPTASVRDRIAEEI----NMTE------RSVQIWFQNRRAKIK------%subfamily N858 >BNP16314_1 89 bp ------HTTDQIRHMEALF---KETPHPDEKQRQQLSKQL----GLAP------RQVKFWFQNRRTQIK------>GHP06126_1 89 bp

106

------TADQIREMEALF---KESPHPDEKQRQQLSKQL----GLAP------RQVKFWFQNRRTQIK------>EHP01599_1 89 bp ------IPKHALQTLEQVF---KDDKFPSVETRKNLAAEL----RVTP------RQVQVWFQNKRQR------>CND11316_1 89 bp Fungi-Basidiomycota-Filobasidiellaneoformans ------TNDVQLAMLSDVF---QRTQYPSTEERDELARQL----GMTS------RSVQIWFQNRRRAVK------%subfamily N918 >TVE09682_1 89 bp Parabasalidea-Trichomonada-Trichomonas ------FNKEQKIKLERIF---KINPTPKIRQRDEIAREL----NIPL------KSVTYWFQNRRS------%subfamily N860 >sp|P07548|DFD_DROME 89 bp Homeotic protein deformed ------KRQRTAYTRHQILELEKEF---HYNRYLTRRRRIEIAHTL----VLSE------RQIKIWFQNRRMKWKK----- >sp|O57374|HXD4A_BRARE 89 bp Homeobox protein Hox-D4a ------KRSRTAYTRQQVLELEKEF---HFNRYLTRRRRIESAHTL----SLSE------RQIKIWFQNRRMKWKK----- >AMF15721_1 89 bp Metazoa-Chordata-Ambystoma ------YTRQQVLELEKEF---HYNRYLTRRRRIEIAHSL----CLSE------RQIKIWFQNRRMKWKK----- >sp|P09017|HXC4_HUMAN 89 bp Homeobox protein Hox-C4 ------KRSRAAYTRQQVLELEKEF---HYNRYLTRRRRIEIAHSL----CLSE------RQIKIWFQNRRMKWKK----- >sp|O13074|HXB4_FUGRU 89 bp Homeobox protein Hox-B4 ------KRSRTAYTRQQVLELEKEF---HYNRYLTRRRRVEIAHTL----CLSE------RQIKIWFQNRRMKWKK----- >sp|P06798|HXA4_MOUSE 89 bp Homeobox protein Hox-A4 ------KRSRTAYTRQQVLELEKEF---HFNRYLTRRRRIEIAHTL----CLSE------RQVKIWFQNRRMKWKK----- >IPD19513_1 89 bp Metazoa-Chordata-Ictalurus ------YTRQQVLELEKEF---HFNRYLTRRRRIEIAHTL----CLSE------RQIKIWFQNRRMKWKK----- >sp|P09016|HXD4_HUMAN 89 bp Homeobox protein Hox-D4 ------KRSRTAYTRQQVLELEKEF---HFNRYLTRRRRIEIAHTL----CLSE------RQIKIWFQNRRMKWKK----- >1b8iA 89 bp ULTRABITHORAX ------RQTYTRYQTLELEKEF---HTNHYLTRRRRIEMAHAL----SLTE------RQIKIWFQNRRMKLKK----- >1ftz 89 bp FUSHI TARAZU PROTEIN ------KRTRQTYTRYQTLELEKEF---HFNRYITRRRRIDIANAL----SLSE------RQIKIWFQNRRMKSKK----- >sp|P09020|HXC5_XENLA 89 bp Homeobox protein Hox-C5 ------KRSRTSYTRYQTLELEKEF---HFNRYLTRRRRIEIANNL----CLNE------RQIKIWFQNRRMKWKK----- >sp|P02832|HXC6_XENLA 89 bp Homeobox protein Hox-C6 ------RRGRQIYSRYQTLELEKEF---HFNRYLTRRRRIEIANAL----CLTE------RQIKIWFQNRRMKWKK----- >sp|P09019|HXB5_XENLA 89 bp Homeobox protein Hox-B5 ------KRARTAYTRYQTLELEKEF---HFNRYLTRRRRIEIAHTL----CLSE------RQIKIWFQNRRMKWKK----- >sp|P09013|HXB5B_BRARE 89 bp Homeobox protein Hox-B5b ------KRARTAYTRYQTLELEKEF---HFNRYLTRRRRIEIAHAL----CLSE------RQIKIWFQNRRMKWKK----- >sp|P09021|HXA5_MOUSE 89 bp Homeobox protein Hox-A5 ------KRARTAYTRYQTLELEKEF---HFNRYLTRRRRIEIAHAL----CLSE------RQIKIWFQNRRMKWKK----- >sp|P09014|HXB5A_BRARE 89 bp Homeobox protein Hox-B5a ------KRARTAYTRYQTLELEKEF---HFNRYLTRRRRIEIAHAL----CLSE------RQIKIWFQNRRMKWKK----- >sp|P09023|HXB6_MOUSE 89 bp Homeobox protein Hox-B6 ------RRGRQTYTRYQTLELEKEF---HYNRYLTRRRRIEIAHAL----CLTE------RQIKIWFQNRRMKWKK----- >1hom 89 bp ANTENNAPEDIA PROTEIN ------KRGRQTYTRYQTLELEKEF---HFNRYLTRRRRIEIAHAL----CLTE------RQIKIWFQNRRMKWKK----- >sp|P02833|ANTP_DROME 89 bp Homeotic protein antennapedia ------KRGRQTYTRYQTLELEKEF---HFNRYLTRRRRIEIAHAL----CLTE------RQIKIWFQNRRMKWKK----- >sp|P02830|HXA7_MOUSE 89 bp Homeobox protein Hox-A7 ------KRGRQTYTRYQTLELEKEF---HFNRYLTRRRRIEIAHAL----CLTE------RQIKIWFQNRRMKWKK----- >sp|P09024|HXB7_MOUSE 89 bp Homeobox protein Hox-B7 ------KRGRQTYTRYQTLELEKEF---HYNRYLTRRRRIEIAHTL----CLTE------RQIKIWFQNRRMKWKK----- >sp|P04476|HXB7B_XENLA 89 bp Homeobox protein Hox-B7 B ------KRGRQTYTRYQTLELEKEF---HFNRYLTRRRRIEIAHTL----CLTE------RQIKIWFQNRRMKWKK----- >GMD03350_1 89 bp Metazoa-Arthropoda-Glossinamorsitans ------FSSFQRKGLEIQF---QQQKYITKPDRRKLAARL----NLTD------AQVKVWFQNRRMKWR------>GAP01770_1 89 bp ------FTNHQIYELEKRF---LYQKYLSPADRDQIAQQL----GLTN------AQVITWFQNRRAKLKR----- >SSF35481_1 89 bp Metazoa-Chordata-Salmo ------FTDHQLAQLERSF---ERQKYLSVQDRMELAASL----NLTD------TQVKTWYQNRR------>BMD57173_1 89 bp Metazoa-Arthropoda-Bombyx ------FTDHQLQTLEKSF---ERQKYLSVQDRMELAAKL----GLTD------TQVKTWYQNRRTKWKR----- >SSF01710_1 89 bp Metazoa-Chordata-Salmo ------FTELQLMGLEKRF---EKQKYLSTPDRIDLAECL----DLSQ------LQVKTWYQNRRMKWKK----- >MCP03763_1 89 bp ------FSDQQLQGLEQRF---NGQKYLSTPERISLAESL----HLSE------TQVKTWFQNRRMK------>CVP02277_1 89 bp ------FSDQQLNGLEKRF---EAQRYLSTPERVELANQL----SLSE------TQVKTWFQNRRMKHKK----- >BMD56378_1 89 bp Metazoa-Arthropoda-Bombyx ------FTSEQLLELEREF---HAKKYLSLTERSQIAAAL----KLSE------VQVKIWFQNRRAKWKR----- >ECD01305_1 89 bp Metazoa-Chordata-Equus ------FSHTQVXELERKF---XRQKYLSAPERAHLAXNL----KLTE------XQVKIWFQNRRYKTKR----- >IPD17949_1 89 bp Metazoa-Chordata-Ictalurus ------FTHLQVLELEKKF---SRQRYLSAPERAHLASAL----RLTE------TQVKIWFQNRRYKTKR----- >CCE05724_1 89 bp Metazoa-Chordata-Cyprinus ------FTTSQLLVLERKF---LQKQYLSIAERAEFSNSL----NLTE------TQVKIWFSNTRAKAKR----- >BME03454_1 89 bp Metazoa-Arthropoda-Boophilus ------FTTQQLLALERKF---RVKQYLSIAERAEFSSSL----NLTE------TQVKIWFQNRRAKEKR----- >1ig7A 89 bp null ------RKPRTPFTTAQLLALERKF---RQKQYLSIAERAEFSSSL----SLTE------TQVKIWFQNRRAKAKR----- >SPE19093_1 89 bp Metazoa-Echinodermata-Strongylocentrotus ------FSGRQIFELEKQF---EVKKYLSASERAELASLL----NVTD------TQVKIWFQNRRTKWKK----- >SSF54974_1 89 bp Metazoa-Chordata-Salmo ------FSKRQIFQLESTF---DMKRYLSSAERACLASSL----QLTE------TQVKIWFQNRRNKLKR----- >SSP03100_1 89 bp ------FSRHQVSQLEMTF---DMKRYLSSQERAHLASNL----QLTE------TQVKIWFQNRRNKWKR----- >SPE27538_1 89 bp Metazoa-Echinodermata-Strongylocentrotus ------FSRSQVFQLESTF---EVKRYLSSSERAGLAANL----HLTE------TQVKIWFQNRRNKWKR----- >SSF22556_1 89 bp Metazoa-Chordata-Salmo ------FSRVQICELEKRF---HRQKYLASAERATLAKSL----KMTD------AQVKTWFQNRRTKWRR----- >SPE11478_1 89 bp Metazoa-Echinodermata-Strongylocentrotus ------FSNDQTMELEKKF---ENQKYLSPPERKKLAKVL----QLSE------RQVKTWFQNRRAKWRR----- >SPE27764_1 89 bp Metazoa-Echinodermata-Strongylocentrotus ------FTREQIGRLEKEF---ARENYVSRPKRCELATAL----NLPE------TTIKVWFQNRRMKDKR----- >1jggA 89 bp Segmentation protein even-skipped ------RYRTAFTRDQLGRLEKEF---YKENYVSRPRRCELAAQL----NLPE------STIKVWFQNRRMKDKR-----

107

>CSH13076_1 89 bp Metazoa-Chordata-Ciona ------FTHEQVRQLELDF---SENHYLTRLRRYELSLKL----SLTE------RQIKVWFQNRRMKLKR----- >SPE28572_1 89 bp Metazoa-Echinodermata-Strongylocentrotus ------FTKEQIRELENEF---NHHNYLTRLRRYEIAVTL----NLTE------RQVKVWFQNRRMKWKR----- >CCE07325_1 89 bp Metazoa-Chordata-Cyprinus ------FTKEQIRELESEF---AHHNYLTRLRRYEIAVNL----DLTE------RQVKVWFQNRRMKWKR----- >TCF01014_1 89 bp Metazoa-Arthropoda-Tribolium ------YTSAQLVELEREF---HHGKYLSRPRRIQIAENL----NLSE------RQIKIWFQNRRMKHKK----- >sp|O42365|HXA2B_BRARE 89 bp Homeobox protein Hox-A2b ------RRLRTAYTNTQLLELEKEF---HFNKYLCRPRRVEIAALL----DLTE------RQVKVWFQNRRMKHKR----- >sp|O43364|HXA2_HUMAN 89 bp Homeobox protein Hox-A2 ------RRLRTAYTNTQLLELEKEF---HFNKYLCRPRRVEIAALL----DLTE------RQVKVWFQNRRMKHKR----- >sp|O42367|HXB2A_BRARE 89 bp Homeobox protein Hox-B2a ------RRLRTAYTNTQLLELEKEF---HFNKYLCRPRRVEIAALL----DLTE------RQVKVWFQNRRMKHKR----- >sp|O42368|HXB3A_BRARE 89 bp Homeobox protein Hox-B3a ------KRARTAYTSAQLVELEKEF---HFNRYLCRPRRVEMANLL----NLSE------RQIKIWFQNRRMKYKK----- >sp|O43365|HXA3_HUMAN 89 bp Homeobox protein Hox-A3 ------KRARTAYTSAQLVELEKEF---HFNRYLCRPRRVEMANLL----NLTE------RQIKIWFQNRRMKYKK----- >sp|P02831|HXA3_MOUSE 89 bp Homeobox protein Hox-A3 ------KRARTAYTSAQLVELEKEF---HFNRYLCRPRRVEMANLL----NLTE------RQIKIWFQNRRMKYKK----- >sp|O93353|HXD3_CHICK 89 bp Homeobox protein Hox-D3 ------KRARTAYTSAQLVELEKEF---HFNRYLCRPRRVEMANLL----NLTE------RQIKIWFQNRRMKYKK----- >SPE27756_1 89 bp Metazoa-Echinodermata-Strongylocentrotus ------YSKLQIYELEKEF---TTNMYLTRDRRSKLSQAL----DLTE------RQVKIWFQNRRMKMKK----- >DIP01793_1 89 bp ------YTKYQTLELEKEF---LYNTYVSKQKRYELAKNL----YLSE------RQVKIWFQNRRMKDKK----- >SSF11272_1 89 bp Metazoa-Chordata-Salmo ------YSRAQLLELEKEF---LFNKYISRPRRYELATTL----KLTE------RHIKIWFQNRRMKWKK----- >SSF44037_1 89 bp Metazoa-Chordata-Salmo ------YSRAQLLELEKEF---LFNKYISRPRRYELATTL----NLTE------RHIKIWFQNRRMKWKK----- >ASP07310_1 89 bp ------YTRNQVLELEKEF---HFNKYLTRKRRIEIAHSL----MLTE------HQVKIWFQ------>AMF13795_1 89 bp Metazoa-Chordata-Ambystoma ------YSRYQTLELEKEF---LFNPYLTRKRRIEVSHAL----GLTE------RQVKIWFQNRRMKWKK----- >CCE05499_1 89 bp Metazoa-Chordata-Cyprinus ------YTKHRTLELEKEF---LFNMYLTPERRLEIXKSI----NLTD------RQVKIWFQNRRMKLKK----- >1pufA 89 bp Homeobox protein Hox-A9 ------RKKRCPYTKHQTLELEKEF---LFNMYLTRDRRYEVARLL----NLTE------RQVKIWFQNRRMKMKK----- >BMD62527_1 89 bp Metazoa-Arthropoda-Bombyx ------FTGTQLLELEREF---SMNMYLSRLRRIEIASRL----KLSE------KQVKIWFQNRKVKLXK----- >SRP01584_1 89 bp ------FTTHQLTELEKEY---YTSKYLDRSRRREIAKQL----ALNE------TQVKIWFQNRRMKEKK----- >sp|P09022|HXA1_MOUSE 89 bp Homeobox protein Hox-A1 ------RTNFTTKQLTELEKEF---HFNKYLTRARRVEIAASL----QLNE------TQVKIWFQNRRMKQKK----- >1b72A 89 bp HOMEOBOX PROTEIN HOX-B1 ------LRTNFTTRQLTELEKEF---HFNKYLSRARRVEIAATL----ELNE------TQVKIWFQNRRMKQKK----- >CSH18746_1 89 bp Metazoa-Chordata-Ciona ------YTKFQLAELEREF---TANEFISREMREEIATRV----GLND------RQVKIWFQNRRMKKKR----- >AMF15085_1 89 bp Metazoa-Chordata-Ambystoma ------YTGRQLAELEKEF---RTNRYITSGRKEELVASL----GLTN------RQVKIWFQNRRAKEKR----- >BFP09676_1 89 bp ------YSDHQRLELEKEF---YSNKYITIKRKVQLANEL----GLSE------RQVKIWFQNRRAKQRK----- >AMF10710_1 89 bp Metazoa-Chordata-Ambystoma ------YTDHQRLELEKEF---HYNRYITITRKAQLAANL----RLSE------RQIKIWFQNRRAKERK----- >AMF10171_1 89 bp Metazoa-Chordata-Ambystoma ------YTDHQRLELEKEF---HYSRYITIRRKAELAMAL----GLSE------RQVKIWFQNRRAKERK----- >SSF11513_1 89 bp Metazoa-Chordata-Salmo ------YTDHQRLELEKEF---HYSRYITIRRKAELATSL----SLSE------RQVKIWFQNRRAKERK----- >HCG02656_1 89 bp Metazoa-Arthropoda-Homalodisca ------YTDHQRLELEKEF---HYSRYITIRRKAELATNL----GLSE------RQVKIWFQNRRAKERK----- >PPP02701_1 89 bp ------YSDYIRLELEKEF---HMNQFINADRKADLATKL----NLTE------RQIKIWFQNRRAKKRR----- >ASP08402_1 89 bp ------YTDYQRLELEKEF---RTTQFINSERKSQLSSEL----QLTE------RQIKIWFQNRRAKDRR----- >AMF17136_1 89 bp Metazoa-Chordata-Ambystoma ------YSKGQLRELEKEY---ASSKFITKDRRRQIATDT----NLSE------RQITIWFQNRRVKEK------>MCP00368_1 89 bp ------YGKGQTIDLETEY---CTSPYVTKQRRYELSRKL----GLTE------RQVKIWFQNRRMKTKK----- >HRP00014_1 89 bp ------FTPEQLERLEREF---LKQQYMVGTERFYLAKEL----NLGE------AQVKVWFQNRRIKWRK----- >ASP20371_1 89 bp ------FTPTQADTLEKEY---LTDQYMPRTRRILIAESL----GLSE------GQVKTWFQNRRAKEKR----- >TCP00488_1 89 bp ------FTPAQADTLEKEY---LTDQYMPRTRRILIAESL----GLNE------GQVKTWFQNRRAKEKR----- >BMD20907_1 89 bp Metazoa-Arthropoda-Bombyx ------FTGDQQLRLEQTL---EKTQYINGTDRRELAQKW----GIGE------KGIKIWFQNRRMKNKR----- >BMD49473_1 89 bp Metazoa-Arthropoda-Bombyx ------FTTEQINYLENEF---KKSHYISAVQRKEIANIV----NVPE------KVIKIWFQNRRMREKK----- %subfamily N775 >sp|P31535|HMENA_MYXGL 89 bp Homeobox protein engrailed-like A ------ADQLARLRAEF---QANRYLTEERRQNLAREL----SLNE------AQIKIWFQNKRAKIKK----- >sp|P52730|HME2B_XENLA 89 bp Homeobox protein engrailed-2-B ------KRPRTAFTAEQLQRLKAEF---QTNRYLTEQRRQSLAQEL----GLNE------SQIKIWFQNKRAKIKK----- >sp|P09015|HME2A_BRARE 89 bp Homeobox protein engrailed-2a ------KRPRTAFTAEQLQRLKAEF---QTNRYLTEQRRQSLAQEL----GLNE------SQIKIWFQNKRAKIKK----- >sp|P52729|HME2A_XENLA 89 bp Homeobox protein engrailed-2-A ------KRPRTAFTADQLQRLKAEF---QTNRYLTEQRRQSLAQEL----SLNE------SQIKIWFQNKRAKIKK----- >sp|P19622|HME2_HUMAN 89 bp Homeobox protein engrailed-2 ------KRPRTAFTAEQLQRLKAEF---QTNRYLTEQRRQSLAQEL----SLNE------SQIKIWFQNKRAKIKK----- >sp|P09066|HME2_MOUSE 89 bp Homeobox protein engrailed-2 ------KRPRTAFTAEQLQRLKAEF---QTNRYLTEQRRQSLAQEL----SLNE------SQIKIWFQNKRAKIKK----- >sp|Q05916|HME1_CHICK 89 bp Homeobox protein engrailed-1 ------KRPRTAFTAEQLQRLKAEF---QANRYITEQRRQSLAQEL----SLNE------SRVKIWFQNKRAKIKK----- >sp|P31537|HME1A_XENLA 89 bp Homeobox protein engrailed-1-A ------AEQLQRLKAEF---QANRYITEQRRQSLAQEL----SLNE------SQIKIWFQNKRAKIKK-----

108

>sp|P31538|HME1B_XENLA 89 bp Homeobox protein engrailed-1-B ------KRPRTAFTAEQLQRLKAEF---QANRYITEQRRQTLAQEL----SLNE------SQIKIWFQNKRAKIKK----- >sp|P09065|HME1_MOUSE 89 bp Homeobox protein engrailed-1 ------KRPRTAFTAEQLQRLKAEF---QANRYITEQRRQTLAQEL----SLNE------SQIKIWFQNKRAKIKK----- >sp|Q04896|HME1A_BRARE 89 bp Homeobox protein engrailed-1a ------KRPRTAFTAEQLQRLKAEF---QTSRYITEQRRQALAREL----GLNE------SQIKIWFQNKRAKIKK----- >sp|P31533|HME2B_BRARE 89 bp Homeobox protein engrailed-2b ------KRPRTAFTAEQLQRLKNEF---QNNRYLTEQRRQALAQEL----GLNE------SQIKIWFQNKRAKIKK----- >sp|Q05640|HMEN_ARTSF 89 bp Homeobox protein engrailed ------KRPRTAFTAEQLSRLKHEF---NENRYLTERRRQDLAREL----GLHE------NQIKIWFQNNRAKLKK----- >sp|O02491|HMEN_ANOGA 89 bp Segmentation polarity homeobox protein engrailed ------KRPRTAFSNAQLQRLKNEF---NENRYLTEKRRQTLSAEL----GLNE------AQIKIWFQNKRAKIKK----- >sp|P05527|HMIN_DROME 89 bp Homeobox protein invected ------KRPRTAFSGTQLARLKHEF---NENRYLTEKRRQQLSGEL----GLNE------AQIKIWFQNKRAKLKK----- >3hddA 89 bp ENGRAILED HOMEODOMAIN ------RTAFSSEQLARLKREF---NENRYLTERRRQQLSSEL----GLNE------AQIKIWFQNKRAKIKK----- >sp|P09145|HMEN_DROVI 89 bp Segmentation polarity homeobox protein engrailed ------KRPRTAFSSEQLARLKREF---NENRYLTERRRQQLSSEL----GLNE------AQIKIWFQNKRAKIKK----- >sp|P02836|HMEN_DROME 89 bp Segmentation polarity homeobox protein engrailed ------KRPRTAFSSEQLARLKREF---NENRYLTERRRQQLSSEL----GLNE------AQIKIWFQNKRAKIKK----- >sp|P09076|HME30_APIME 89 bp Homeobox protein E30 ------KRPRTAFSAEQLARLKREF---AENRYLTERRRQQLSRDL----GLTE------AQIKIWFQNKRAKIKK----- >sp|P09075|HME60_APIME 89 bp Homeobox protein E60 ------KRPRTAFSGEQLARLKREF---AENRYLTERRRQQLSRDL----GLNE------AQIKIWFQNKRAKIKK----- >sp|P14150|HMEN_SCHAM 89 bp Segmentation polarity homeobox protein engrailed ------KRPRTAFSGEQLARLKHEF---TENRYLTERRRQELAREL----GLNE------AQIKIWFQNKRAKIKK----- >sp|P27610|HMIN_BOMMO 89 bp Homeobox protein invected ------KRPRTAFSGPQLARLKHEF---AENRYLTERRRQSLAAEL----GLAE------AQIKIWFQNKRAKIKK----- >sp|P27609|HMEN_BOMMO 89 bp Segmentation polarity homeobox protein engrailed ------KRPRTAFSGAQLARLKHEF---AENRYLTERRRQSLAAEL----GLAE------AQIKIWFQNKRAKIKK----- >sp|P09532|HMEN_TRIGR 89 bp Homeobox protein engrailed ------KRPRTAFSASQLQRLKQEF---QQSNYLTEQRRRSLAKEL----TLSE------SQIKIWFQNKRAKIKK----- >sp|P23397|HMEN_HELTR 89 bp Homeobox protein Ht-En ------KRPRTAFTGDQLARLKREF---SENKYLTEQRRTCLAKEL----NLNE------SQIKIWFQNKRAKMKK----- >sp|P34326|HM16_CAEEL 89 bp Homeobox protein engrailed-like ceh-16 ------KRPRTAFTGDQLDRLKTEF---RESRYLTEKRRQELAHEL----GLNE------SQIKIWFQNKRAKLKK----- %subfamily N864 >PTP01283_1 89 bp ------FNDTQLDELEKCF---KMCQYPDVSLREKLSKEI----NLPE------ARIQVWFKNRCAKHRR----- >sp|O15266|SHOX_HUMAN 89 bp Short stature homeobox protein ------RRSRTNFTLEQLNELERLF---DETHYPDAFMREELSQRL----GLSE------ARVQVWFQNRRAKCRK----- >sp|O60902|SHOX2_HUMAN 89 bp Short stature homeobox protein 2 ------RRSRTNFTLEQLNELERLF---DETHYPDAFMREELSQRL----GLSE------ARVQVWFQNRRAKCRK----- >sp|O35750|SHOX2_RAT 89 bp Short stature homeobox protein 2 ------RRSRTNFTLEQLNELERLF---DETHYPDAFMREELSQRL----GLSE------ARVQVWFQNRRAKCRK----- >OOP03311_1 89 bp ------FNRQQLEVLETLF---EATQYPDVFTREKVAEQI----QLQE------SRIQVWFKNRRAKHR------>SPE28460_1 89 bp Metazoa-Echinodermata-Strongylocentrotus ------FTRAQLDVLETLF---SRTRYPDIFMREEVAMKI----NLPE------SRVQVWFKNRRAKCR------>SPE17901_1 89 bp Metazoa-Echinodermata-Strongylocentrotus ------FTRAQLDVLETLF---SRTRYPDIFMREEVAMKI----NLPE------SRVQVWFKNRRAKCR------>sp|O54751|CRX_MOUSE 89 bp Cone-rod homeobox protein ------RRERTTFTRSQLEELEALF---AKTQYPDVYAREEVALKI----NLPE------SRVQVWFKNRRAKCR------>sp|O43186|CRX_HUMAN 89 bp Cone-rod homeobox protein ------RRERTTFTRSQLEELEALF---AKTQYPDVYAREEVALKI----NLPE------SRVQVWFKNRRAKCR------>SPE27868_1 89 bp Metazoa-Echinodermata-Strongylocentrotus ------FTTFQLHQLERAF---DMTQYPDVFMREELALRL----DLSE------SRVQVWFQNRRAKWRK----- >BMD62598_1 89 bp Metazoa-Arthropoda-Bombyx ------FTPQQLSELESLF---QKTHYPDVFLREEVALRI----SLSE------ARVQVWXQXRRAKWRK----- >sp|O09113|OTP_MOUSE 89 bp Homeobox protein orthopedia ------KRHRTRFTPAQLNELERSF---AKTHYPDIFMREELALRI----GLTE------SRVQVWFQNRRAKWKK----- >sp|O35690|PHX2B_MOUSE 89 bp Paired mesoderm homeobox protein 2B ------RRIRTTFTSAQLKELERVF---AETHYPDIYTREELALKI----DLTE------ARVQVWFQNRRAKFRK----- >sp|O14813|PHX2A_HUMAN 89 bp Paired mesoderm homeobox protein 2A ------RRIRTTFTSAQLKELERVF---AETHYPDIYTREELALKI----DLTE------ARVQVWFQNRRAKFRK----- >sp|O42115|ARX_BRARE 89 bp Aristaless-related homeobox protein ------RRYRTTFTSYQLEELERAF---QKTHYPDVFTREELAMRL----DLTE------ARVQVWFQNRRAKWRK----- >sp|O35085|ARX_MOUSE 89 bp Homeobox protein ARX ------RRYRTTFTSYQLEELERAF---QKTHYPDVFTREELAMRL----DLTE------ARVQVWFQNRRAKWRK----- >sp|O70137|ALX3_MOUSE 89 bp Homeobox protein aristaless-like 3 ------RRNRTTFSTFQLEELEKVF---QKTHYPDVYAREQLALRT----DLTE------ARVQVWFQNRRAKWRK----- >sp|O35137|ALX4_MOUSE 89 bp Homeobox protein aristaless-like 4 ------RRNRTTFTSYQLEELEKVF---QKTHYPDVYAREQLAMRT----DLTE------ARVQVWFQNRRAKWRK----- >AYP03228_1 89 bp ------FSAYQLDELEKVF---ARTHYPDVFTREELAQRV----TLTE------ARVQVWFQNRRAKFRK----- >1fjlA 89 bp PAIRED PROTEIN ------RRSRTTFSASQLDELERAF---ERTQYPDIYTREELAQRT----NLTE------ARIQVWFQNRRARLRK----- >sp|O42477|VSX2_BRARE 89 bp Visual system homeobox 2 ------RRHRTIFTSYELEELEKAF---NEAHYPDVYAREMLAMKT----ELPE------DRIQVWFQNRRAKWRK----- >sp|O42250|VSX1_BRARE 89 bp Visual system homeobox 1 ------RRHRTVFTSHQLEELEKAF---NEAHYPDVYAREMLAMKT----ELPE------DRIQVWFQNRRAKWRK----- >BFP10457_1 89 bp ------YSQWQLEELEKAF---ETTQYPDIFMREALALRL----DLIE------ARVQVWFQNRRAKLRR----- >AME09631_1 89 bp Metazoa-Arthropoda-Apis ------FNSWQLEELERAF---LSSHYPDVFMREALAVRL----ELKE------SRVAVWFQNRRAKWRK----- >sp|O73917|PAX6_ORYLA 89 bp Paired box protein Pax-6 ------RNRTSFTQEQIEALEKEF---ERTHYPDVFARERLAAKI----DLPE------ARIQVWFSNRRAKWRR----- >ASP02539_1 89 bp ------FTTFQLHELEQAF---EKCHYPDVYARELLAQKV----KLPE------VRVQVWFQNRRAKWRR----- >sp|O35602|RX_MOUSE 89 bp Retinal homeobox protein Rx ------RRNRTTFTTYQLHELERAF---EKSHYPDVYSREELAGKV----NLPE------VRVQVWFQNRRAKWRR----- >sp|O42358|RX3_BRARE 89 bp Retinal homeobox protein Rx3 ------RRNRTTFTTFQLHELERAF---EKSHYPDVYSREELALKV----NLPE------VRVQVWFQNRRAKWRR----- >sp|O42567|RX2_XENLA 89 bp ------RRNRTTFTTYQLHELERAF---EKSHYPDVYSREELAMKV----NLPE------VRVQVWFQNRRAKWRR-----

109

>sp|O42357|RX2_BRARE 89 bp Retinal homeobox protein Rx2 ------RRNRTTFTTYQLHELERAF---EKSHYPDVYSREELAMKV----NLPE------VRVQVWFQNRRAKWRR----- >sp|O42356|RX1_BRARE 89 bp Retinal homeobox protein Rx1 ------RRNRTTFTTYQLHELERAF---EKSHYPDVYSREELAMKV----NLPE------VRVQVWFQNRRAKWRR----- >sp|O42201|RX1_XENLA 89 bp ------RRNRTTFTTYQLHELERAF---EKSHYPDVYSREELAMKV----NLPE------VRVQVWFQNRRAKWRR----- >SRP04407_1 89 bp ------FTQEQLAELDNAF---QKSHYPDIYVREELARIT----KLNE------ARIQVWFQNRRAKHRK----- >AYP02578_1 89 bp ------FTQEQLAELDSAF---QKSHYPDIYVREELARIT----KLNE------ARIQVWFQNRRAKHRK----- >sp|O35160|PITX3_MOUSE 89 bp Pituitary homeobox 3 ------RRQRTHFTSQQLQELEATF---QRNRYPDMSTREEIAVWT----NLTE------ARVRVWFKNRRAKWRK----- >sp|O18381|PAX6_DROME 89 bp Paired box protein Pax-6 ------RRQRTHFTSQQLQELEHTF---SRNRYPDMSTREEIAMWT----NLTE------ARVRVWFKNRRAKWRK----- >sp|O15499|GSCL_HUMAN 89 bp Homeobox protein goosecoid-like ------RRHRTIFSEEQLQALEALF---VQNQYPDVSTRERLAGRI----RLRE------ERVEVWFKNRRAKWR------>SSF27381_1 89 bp Metazoa-Chordata-Salmo ------FPPSQVEELEKVF---LETHYPDVHIRDKLASRL----QLTE------GRVQIWFQNRRAKWRK----- >SSF10042_1 89 bp Metazoa-Chordata-Salmo ------FTSAQLEVLERFF---QESQYPDIHSRELLASQT----QLSE------ARVQIWFQNRRVKWRK----- >MIP05948_1 89 bp ------FSEEQICYLEDFFG--NTCHYPDSYQKEEIARRL----NITT------DRITVWFQNRRSKFRK----- >TDP00719_1 89 bp ------FTEAQSLLLEEAF---QESHYPDQTAKKDMAEKL----DIPE------DRITVWFQNRRAKWRR----- >sp|O43316|PAX4_HUMAN 89 bp Paired box protein Pax-4 ------RNRTIFSPSQAEALEKEF---QRGQYPDSVARGKLATAT----SLPE------DTVRVWFSNRRAKWRR----- %subfamily N917 >SPE02053_1 89 bp Metazoa-Echinodermata-Strongylocentrotus ------FNLAQINALERIF---LDVEYPDGYLKAKLANRL----EVDE------NTVQIWFQNRRAKKKR----- %subfamily N847 >IPD18117_1 89 bp Metazoa-Chordata-Ictalurus ------FSQAQVFELERRF---KQQRYLSAPEREHLANTL----KLTS------TQVKIWFQNRRYKCKR----- >CSG02378_1 89 bp Metazoa-Arthropoda-Callinectes ------FSQAQVYELERRF---KQQRYLSAPEREHLAGLL----KLTS------TQVKIWFQNRRYKCKR----- >PBP05845_1 89 bp ------FSKSQTFELERRF---KQARYLSAPEREHLASMI----NLTP------TQVKIWFQNHRYKTKR----- >SSF08630_1 89 bp Metazoa-Chordata-Salmo ------FSQAQVYELERRF---KQQKYLSAPEREHLASMI----HLSP------TQVKIWFQNHRYKMKR----- >1ftt 89 bp THYROID TRANSCRIPTION FACTOR 1 HOMEODOMAIN ------RKRRVLFSQAQVYELERRF---KQQKYLSAPEREHLASMI----HLTP------TQVKIWFQNHRYKMKR----- >SPE35398_1 89 bp Metazoa-Echinodermata-Strongylocentrotus ------FSKAQTYELERRF---RQQRYLSAPEREHLASII----RLSP------TQVKIWFQNHRYKLKR----- >1nk2P 89 bp HOMEOBOX PROTEIN VND ------RKRRVLFTKAQTYELERRF---RQQRYLSAPEREHLASLI----RLTP------TQVKIWFQNHRYKTKR----- >AMF10979_1 89 bp Metazoa-Chordata-Ambystoma ------FSKAQTLQLERRF---RQQRYLSAPERDHLAHLL----HLTP------TQVKIWFQNHRYKMKR----- >IFP07589_1 89 bp ------FSKAQTYELERRF---RQQRYLSAQEREQLAHLL----RLTP------TQVKIWFQNHRYKMKR----- >MIP05372_1 89 bp ------FSQQQVCELEKMF---RQKKYLNAPERESLAQAI----GLKP------TQVKIWFQNHRYKCKR----- >WBP00763_1 89 bp ------FTQSQVNELEERF---KLQRYVNAAERERLAVTL----GLTS------TQVKIWFQNRRYKCKR----- >SPE07312_1 89 bp Metazoa-Echinodermata-Strongylocentrotus ------FTRNQVYSMERRF---DQQRYLTSVERKEFSNSI----GLDD------HHVKIWFQNRRSKLKK----- >MIP04826_1 89 bp ------FTQNQVTRLEYMF---SMKQYLSAQEREQISSEI----GLKP------NQVKIWFQNHRYKIKR----- >MIP04433_1 89 bp ------FTQNQVTRLEYMF---SIKHYLSAQEREQISSEI----GLKP------NQVKIWFQNHRYKIKR----- %subfamily N829 >sp|O88706|LHX6_MOUSE 89 bp LIM/homeobox protein Lhx6 ------KRARTSFTAEQLQVMQAQF---AQDNNPDAQTLQKLADMT----GLSR------RVIQVWFQNCRARHKK----- >sp|O35652|LHX8_MOUSE 89 bp LIM/homeobox protein Lhx8 ------KRARTSFTADQLQVMQAQF---AQDNNPDAQTLQKLAERT----GLSR------RVIQVWFQNCRARHKK----- >AME10017_1 89 bp Metazoa-Arthropoda-Apis ------FKHHQLRTMKSYF---AINHNPDAKDLKQLSQKT----GLPK------RVLQVWFQNARAKWRR----- >sp|P29673|APTE_DROME 89 bp Protein apterous ------KRMRTSFKHHQLRTMKSYF---AINHNPDAKDLKQLSQKT----GLPK------RVLQVWFQNARAKWRR----- >sp|P50458|LHX2_HUMAN 89 bp LIM/homeobox protein Lhx2 ------KRMRTSFKHHQLRTMKSYF---AINHNPDAKDLKQLAQKT----GLTK------RVLQVWFQNARAKFRR----- >sp|P36198|LHX2_RAT 89 bp LIM/homeobox protein Lhx2 ------KRMRTSFKHHQLRTMKSYF---AINHNPDAKDLKQLAQKT----GLTK------RVLQVWFQNARAKFRR----- %subfamily N848 >PPG46627_1 89 bp Viridiplantae-Streptophyta-Physcomitrella ------ASQTEVLERAY---AVEKYPSEATRQKLVRDL----DLSD------KQLQIWFTHRRYKDRR----- >BNP10524_1 89 bp ------KTPFQLQTLEEVY---AEETYPSEATRAELSEKL----DLSD------RQLQMWFCHRRLKDKK----- >PCE03163_1 89 bp Viridiplantae-Streptophyta-Phaseolus ------KTPFQLETLEKAY---AVDNYPSETMRGELSEKL----GLSD------RQLQMWFCHRRLKDKK----- >CSE06316_1 89 bp Viridiplantae-Streptophyta-Citrus ------KTPAQVMALEKFY---NEHKYPTEEMKSQVAEQI----GLTE------KQVSGWFCHRRLKEKR----- >PGD06527_1 89 bp Viridiplantae-Streptophyta-Picea ------KTPQQVEGLESFY---AEHKYPSEAMKAQLSEEL----GLTE------KQVQGWFCHRRLKDKR----- %subfamily N851 >AOP00376_1 89 bp ------LSGEQVRALEKNF---EVENKLEPERKARLAQEL----GLQP------RQVAVWFQNRRARWK------>PSE12995_1 89 bp Viridiplantae-Streptophyta-Picea ------LSLQQVRSLEKTF---EVENKLEPERKLQLAQEL----GLQP------RQIAVWFQNRRARWK------>PPG03907_1 89 bp Viridiplantae-Streptophyta-Physcomitrella ------LSLEQVRSLERNF---EVENKLEPERKMQLAKEL----GLQP------RQVAVWFQNRRARWK------>BRE05356_1 89 bp Viridiplantae-Streptophyta-Brassicarapa ------LSVVQVKALEKNF---EIDNKLXPERKVKLAQEL----GLQP------RQVAIWFQNRRARWK------>GRD00608_1 89 bp Viridiplantae-Streptophyta-Gossypium ------LSVDQVKALEKNF---EVENKLEPDRKSKLAQEL----GLQP------RQVAVWFQNRRARWK------>JRP00287_1 89 bp ------LSVDQVKALEKNF---EVENKLEPERKVKLAQEL----GLQP------RQVAVWFQNRIARWK------

110

>PAD00461_1 89 bp Viridiplantae-Streptophyta-Prunus ------LSVEQVKALEKNF---EVENKLEPERKVKLAQEL----GLQP------RQVAVWFQNRRARWK------>PTE03183_1 89 bp Viridiplantae-Streptophyta-Poncirus ------LSVDQVKALEKNF---EVDNKLEPERKVKLAQEL----GLQP------RQVAVWFQNRRARWK------>RPP00443_1 89 bp ------LSVDQVKALEKNF---EVENKLEPDRKVKLAQEL----GLQP------RQVAVWFQNRRARWK------>BVP03504_1 89 bp ------LSVDQVKALERNF---EVENKLEPERKVKLAQEL----GLQP------RQVAVWFQNRRARWK------>GRD03861_1 89 bp Viridiplantae-Streptophyta-Gossypium ------LSVDQVKALEKNF---EVENKLEPERKVKLAQEL----GLQP------RQVAVWFQNRRARWK------>HAD32146_1 89 bp Viridiplantae-Streptophyta-Helianthus ------LSTEQVRMLERSF---EEENKLEPERKTELAKKL----GLQP------RQVAVWFQNRRARW------>SSD00939_1 89 bp Viridiplantae-Streptophyta-Saccharum ------LTPEQVHLLERSF---EEENKLEPERKTELARKL----GLQP------RQVAVWFQNRRARWK------>SPD03504_1 89 bp Viridiplantae-Streptophyta-Sorghum ------LTPEQVHLLERSF---EEENKLEPERKTELARKL----GLQP------RQVAVWFQNRRARWK------>SMP00439_1 89 bp ------LTPEQVHMLEKSL---EAENKLEPERKTQLAKKL----NLQP------RQVAVWFQNRRARWK------>JRP01502_1 89 bp ------LTTEQVHLLEKSF---EAENKLEPDRKTVLAKKL----GLQP------RQVAVWFQNRRARWK------>GHP09300_1 89 bp ------LTSEQVYLLEKSF---EAENKLEPERKSQLAKKL----GLQP------RQVAVWFQNRRARWK------>BNP01381_1 89 bp ------LTSQQVHLLEKSF---ETENKLEPERKTQLANKL----GLQP------RQVAVWFQNRRARWK------>ACF06894_1 89 bp Viridiplantae-Streptophyta-Allium ------LTAEQVNMLEKSF---ESENKLEPERKGELAKKL----GLQP------RQVAVWFQNRRARWK------>LUP01625_1 89 bp ------LTAEQVNLLEKSF---EAENKLEPERKTELAKKL----GLQP------RQVAVWFQNRRARCK------>GRD07392_1 89 bp Viridiplantae-Streptophyta-Gossypium ------LTVDQVQFLEKSF---EAENKLEPDRKVQLAKDL----GLQS------RQVAIWFQNRRARWK------>BVP01068_1 89 bp ------LSVEQVQFLEKSF---EVENKLEPERKVQLAKDL----GLQP------RQVAIWFQNRRARWK------>ACF08945_1 89 bp Viridiplantae-Streptophyta-Allium ------LSITQVQFLEKSF---EVENKLEPERKVQLAKEI----GLQP------RQVAIWFQNRRARWK------>ACF08359_1 89 bp Viridiplantae-Streptophyta-Allium ------LSVTQVQFLEKSF---EVENKLEPERKVQLAKEL----GLQP------RQVAIWFQNRRARWK------>GRD05488_1 89 bp Viridiplantae-Streptophyta-Gossypium ------LTVDQVQFLEKSF---EVENKLEPERKTQLAKEL----GLQP------RQVAIWFQNRRARWK------>GRD15804_1 89 bp Viridiplantae-Streptophyta-Gossypium ------LTVDQVQFLEKSF---EVENKLEPERKTQLAKEL----GLQP------RQVAIWFQNRRARWK------>GRD06140_1 89 bp Viridiplantae-Streptophyta-Gossypium ------LTVDQVQFLEKSF---EVENKLEPERKTQLAKEL----GLQP------RQVAIWFQNRRARWK------>GRD01531_1 89 bp Viridiplantae-Streptophyta-Gossypium ------LTVDQVQFLEKSF---EVENKLEPERKTQLAKEL----GLQP------RQVAIWFQNRRARWK------>GRD06591_1 89 bp Viridiplantae-Streptophyta-Gossypium ------LTVDQIQFLEKSF---EVDNKLEPERKIQLAKDL----GLQP------RQVAIWFQNRRARWK------>CSE02860_1 89 bp Viridiplantae-Streptophyta-Citrus ------LTVDQVQFLEKSF---EVENKLEPERKIQLAKDL----GLQP------RQVAIWFQNRRARWK------>PTE00785_1 89 bp Viridiplantae-Streptophyta-Poncirus ------LTVDQVQFLEKSF---EVENKLEPERKIQLAKDL----GLQP------RQVAIWFQNRRARWK------>PPE01139_1 89 bp Viridiplantae-Streptophyta-Prunus ------LTVDQVQFLEKSFD--MEN-KLEPERKILLAKDL----GLQP------RQVAIWFQNRRARWK------>PED06136_1 89 bp Viridiplantae-Streptophyta-Populus ------LTVDQVQFLEKSF--ELEN-KLEPERKIQLAKDL----GLQP------RQVAIWFQNRRARWK------>RPP01355_1 89 bp ------LSVEQVHFLEKSF---EEENKLEPERKTRLAKEL----GLQP------RQVAIWFQNRRARWK------>MSE04106_1 89 bp Viridiplantae-Streptophyta-Medicago ------LSXDQVQFLEKSF---EEDSKLEPERKTKLAKDL----GLQP------RQVAIWFQNRRARWK------>JRP02068_1 89 bp ------LTADQVQFLEKSF---EVDNKLEPERKSQLAKDL----GLQP------RQVAIWFQNRRARWK------>HAD01492_1 89 bp Viridiplantae-Streptophyta-Helianthus ------LTTDQVQFLEKSF---DENNKLEPERKVHLAKEL----NLQP------RQVAIWFQNRR------>SMP04043_1 89 bp ------LTAEQVQFLEKSF---AVENKLEPDRKNELAKKL----GLQP------RQVAIWFQNRRARSK------>GRD02024_1 89 bp Viridiplantae-Streptophyta-Gossypium ------LTATQVEFLERSF---EVENKLESDRKLRLAKEL----GLQP------RQVAIWFQNRRARSK------>GRD05141_1 89 bp Viridiplantae-Streptophyta-Gossypium ------LTATQVEFLERSF---EVENKLESDRKLRLAKEL----GLQP------RQVAIWFQNRRARSK------>GGP01641_1 89 bp ------LTPEQVRSLETSF---ESENKLEPDRKLKLAQQL----GLQP------RQVAVWFQNRRARWK------>BNP18445_1 89 bp ------LTSGQLASLERSF---QEDIKLDSDRKLKLSREL----GLQP------RQIAVWFQNRRARWK------>CSE16705_1 89 bp Viridiplantae-Streptophyta-Citrus ------LTSDQLESLERSF---QEEIKLDPDRKMKLAREL----GLQP------RQIAVWFQNRRARWK------>SMP06236_1 89 bp ------LNQEQVRLLEASF---DAGKKLEPERKFQLARDL----GVPP------RQIAIWYQNKRARWK------>TVD00827_1 89 bp Viridiplantae-Streptophyta-Triphysaria ------LNNEQVRTLEKNF---ELGNKLEPERKIELARAL----GLQP------RQIAIWFQNRRARWK------>AAF03294_1 89 bp Viridiplantae-Streptophyta-Acorus ------LSVEQVRILEKNF---ELGNKLEPERKMQLARAL----GLQP------RQIAIWFQNRRARWK------>ACF04417_1 89 bp Viridiplantae-Streptophyta-Allium ------LTIEQVRTLEKSF---EVGNKLEPERKMQLARAL----GLQP------RQVAIWFQNRRARWK------>CRP03497_1 89 bp ------LTAEQVNFLEMSF---NIDLKLEPERKALLAKKL----GIRP------RQVAIWFQNRRARWK------>CRP01704_1 89 bp ------LTAEQVNFLETSF---SMDLKLEPERKAHLAKQL----GIQP------RQVAIWFQNRRARWK------>CSE16783_1 89 bp Viridiplantae-Streptophyta-Citrus ------LTAEQAELLEMHF---GNEHKLESDRKDKLAAEL----GLDP------RQVAVWFQNRRARDK------>CSE07054_1 89 bp Viridiplantae-Streptophyta-Citrus ------LTAEQAELLEMHF---GNEHKLESDRKDKLAAEL----GLDP------RQVAVWFQNRRARDK------>BNP04867_1 89 bp ------LSDEQVRMLEMSF---GLEHKLESERKNRLASEL----GLDS------RQVAVWFQNRRARWK------>GAD16259_1 89 bp Viridiplantae-Streptophyta-Gossypium ------LTQEQVDLLELNF---GNEHKLESERKDRLASEL----GLDP------RQVPVWFQNRRARWK------>TMD04719_1 89 bp Viridiplantae-Streptophyta-Triticum

111

------LTAEQAALLEKSF---RAHNVLSHGEKHDLAEQL----GLKP------SKWKVWFQNRRARTK------>SRD01034_1 89 bp Viridiplantae-Streptophyta-Stevia ------FTDKQISFLEYMF---ETQSRPELRMKHQLAHKL----GLHP------RQVAIWFQNKRARSK------>CSE25906_1 89 bp Viridiplantae-Streptophyta-Citrus ------LTQDQVRLLETCF---NANQKLQVDRKLELARRL----GLPP------RQIAVWYQNRRAREK------%subfamily N758 >GHP07947_1 89 bp ------FSDEQIKSLELMF---ESETRLEPGKKLEVAKEL----GLHP------RQVAIWFQNKRARWK------>BNP12498_1 89 bp ------FSEEQIKSLEMMF---ESETRLEPRKKVQLARGL----SLQP------RQVAIWFQNKRARWK------>CSE03377_1 89 bp Viridiplantae-Streptophyta-Citrus ------FSDEQIRSLELMF---ENETRLEPRKKLQLAKEL----GWQP------RQVAIWFQNKRARWK------>GRD15275_1 89 bp Viridiplantae-Streptophyta-Gossypium ------FSDEQIKSLELMF---ESETRLEPRKKLQVAKEL----GLQP------RQVAIWFQNKRARWK------>BNP04177_1 89 bp ------FSDEQIKSLEMMF---ESETKLEPRKKVQLAREL----GLQP------RQVAIWFQNKRARWK------>PED03264_1 89 bp Viridiplantae-Streptophyta-Populus ------FSDEQIKSLETMF---ESETRLEPRKKMQLAREL----GLQP------RQVAIWFQNKRARWK------>PTG02918_1 89 bp Viridiplantae-Streptophyta-Populus ------FSDEQIKSLETMF---ESETRLEPRKKMQLAREL----GLQP------RQVAIWFQNKRARWK------>LUP01153_1 89 bp ------FTDEQIKSLESIF---ESETRLEPRKKVQLAKEL----GLQP------RQVAIWFQNKRARWK------>PPE01722_1 89 bp Viridiplantae-Streptophyta-Prunus ------FSDEQIRSLESLF---ESESRLEPRKKMQLAKEL----GLQP------RQVAIWFQNKRARWK------>PAD01753_1 89 bp Viridiplantae-Streptophyta-Prunus ------FSDEQIRSLESLF---ESESRLEPRKKMQLAKEL----GLQP------RQVAIWFQNKRARWK------>INP04052_1 89 bp ------FSNEQVKSLETIFK--LET-KLETKKKLQVARDL----GLQP------RQVAIWFQNKRARWK------>CSE04691_1 89 bp Viridiplantae-Streptophyta-Citrus ------FSDEQIRLLESIFE--SESTKLEPRKKMQVATEL----GLQP------RQVAIWFQNKRARWK------>GRD28327_1 89 bp Viridiplantae-Streptophyta-Gossypium ------FSDEQIRLLESIF---ESETKLEPRKKMQLAREL----GLQP------RQVAIWFQNRRARWK------%subfamily N719 >PCE03158_1 89 bp Viridiplantae-Streptophyta-Phaseolus ------EQVEALERLY---HECPKPSSLRRQQLKKECPILCNIEP------KQIKVWFQNRRCREK------>ACF08938_1 89 bp Viridiplantae-Streptophyta-Allium ------TAEQVEALERVY---AECPKPSSMRRQQLVRDCPILSNIEP------KQIKVWFQNRRCREK------>ACF10646_1 89 bp Viridiplantae-Streptophyta-Allium ------YTPEQVEALERVY---SECPKPSSIRRQQIIRECPILSNIEP------KQIKVWFQNRRCREK------>ACF11510_1 89 bp Viridiplantae-Streptophyta-Allium ------YTPEQVEALERVY---SECPKPSSIRRQQLIRECPILSNIEP------KQIKVWFQNRRCREK------>BVP09757_1 89 bp ------YTPEQVEALERLY---HDCPKPSSLRRQQLIRECPILSNIEP------KQIKVWFQNRRCREK------>CSE21677_1 89 bp Viridiplantae-Streptophyta-Citrus ------YTPEQVEALERLY---HECPKPSSMRRQQLIRECPILSNIEP------KQIKVWFQNRRCREK------>GRD21692_1 89 bp Viridiplantae-Streptophyta-Gossypium ------YTPEQVEALERLY---HECPKPSSMRRQQLIRECPILSNIEP------KQIKVWFQNRRCREK------>PTE00704_1 89 bp Viridiplantae-Streptophyta-Poncirus ------YTPEQVEALERLY---HECPKPSSMRRQQLIRECPILSNIEP------KQIKVWFQNRRCREK------%subfamily N800 >GAP02710_1 89 bp ------KTKEQLDVLKQHF---LRCQWPKSEDYTELVKLT----NLPR------ADVIQWFGDTRYAVKN----- >1wjhA 89 bp Homeobox-leucine zipper protein Homez ------RKTKEQLAILKSFF---LQCQWARREDYQKLEQIT----GLPR------PEIIQWFGDTRYALKHGQLKW %subfamily N916 >PPD17649_1 89 bp Metazoa-Chordata-Pongo ------SHEQLSALKGSF---CRNQFPGQSEVEHLTKVT----GLST------REVRKWFSDRRYHCRR----- %subfamily N515 >1akhB 89 bp MATING-TYPE PROTEIN ALPHA-2 HOMEODOMAIN ------RGHRFTKENVRILESWFAKNIENPYLDTKGLENLMKNT----SLSR------IQIKNWVSNRRRKEKT----- >1mnmC 89 bp MAT ALPHA-2 TRANSCRIPTIONAL REPRESSOR ------RGHRFTKENVRILESWFAKNIENPYLDTKGLENLMKNT----SLSR------IQIKNWVSNRRRKEKT----- %subfamily N795 >HSP00500_1 89 bp ------LNENQLRILKQTY---QGNQRPDTNTKEQLVEMT----GLNA------RVIRVWFQNKRCKDKK----- >HGP07414_1 89 bp ------LNENQLRILKQTY---QGNQRPDTNTKEQLVKMT----GLNA------RVIRVWFQNKRCKDKK----- >sp|P53407|ISL3_BRARE 89 bp Insulin gene enhancer protein isl-3 ------RVRTVLNEKQLHTLRTCY---NANPRPDALMKEQLVEMT----GLSP------RVIRVWFQNKRCKDKK----- >sp|P53406|ISL2_BRARE 89 bp Insulin gene enhancer protein isl-2 ------RVRTVLNEKQLHTLRTCY---NANPRPDALMKEQLVEMT----GLSP------RVIRVWFQNKRCKDKK----- >sp|P53405|ISL1_BRARE 89 bp Insulin gene enhancer protein isl-1 ------RVRTVLNEKQLHTLRTCY---NANPRPDALMKEQLVEMT----GLSP------RVIRVWFQNKRCKDKK----- >sp|P50212|ISL2B_ONCTS 89 bp Insulin gene enhancer protein ISL-2B ------RVRTVLNEKQLHTLRTCY---NANPRPDALMKEQLVEMT----GLSP------RVIRVWFQNKRCKDKK----- >1bw5 89 bp INSULIN GENE ENHANCER PROTEIN ISL-1 ------TTRVRTVLNEKQLHTLRTCY---AANPRPDALMKEQLVEMT----GLSP------RVIRVWFQNKRCKDKK----- >sp|P50480|ISL2_RAT 89 bp Insulin gene enhancer protein ISL-2 ------RVRTVLNEKQLHTLRTCY---AANPRPDALMKEQLVEMT----GLSP------RVIRVWFQNKRCKDKK----- >sp|P50211|ISL1_CHICK 89 bp Insulin gene enhancer protein ISL-1 ------RVRTVLNEKQLHTLRTCY---AANPRPDALMKEQLVEMT----GLSP------RVIRVWFQNKRCKDKK----- %subfamily N865 >sp|O88609|LMX1B_MOUSE 89 bp LIM homeobox transcription factor 1 beta ------KRPRTILTTQQRRAFKASF---EVSSKPCRKVRETLAAET----GLSV------RVVQVWFQNQRAKMKK----- >sp|O60663|LMX1B_HUMAN 89 bp LIM homeobox transcription factor 1 beta ------KRPRTILTTQQRRAFKASF---EVSSKPCRKVRETLAAET----GLSV------RVVQVWFQNQRAKMKK----- >sp|P34765|MEC3_CAEVU 89 bp Mechanosensory protein 3 ------RGPRTTIKQNQLDVLNEMF---SNTPKPSKHARAKKALET----GLSM------RVIQVWFQNRRSKERR----- >sp|P34764|MEC3_CAEBR 89 bp Mechanosensory protein 3 ------RGPRTTIRQNQLDVLNEMF---SNTPKPSKHARAKLALET----GLSM------RVIQVWFQNRRSKERR----- >sp|P09088|MEC3_CAEEL 89 bp Mechanosensory protein 3 ------RGPRTTIKQNQLDVLNEMF---SNTPKPSKHARAKLALET----GLSM------RVIQVWFQNRRSKERR----- >sp|P20154|LIN11_CAEEL 89 bp Protein lin-11 ------RGPRTTIKAKQLETLKNAF---AATPKPTRHIREQLAAET----GLNM------RVIQVWFQNRRSKERR----- >sp|P48742|LHX1_HUMAN 89 bp LIM/homeobox protein Lhx1

112

------RGPRTTIKAKQLETLKAAF---AATPKPTRHIREQLAQET----GLNM------RVIQVWFQNRRSKERR----- >sp|P29674|LHX1_XENLA 89 bp LIM/homeobox protein Lhx1 ------RGPRTTIKAKQLETLKAAF---AATPKPTRHIREQLAQET----GLNM------RVIQVWFQNRRSKERR----- >sp|P52889|LHX5_BRARE 89 bp LIM/homeobox protein Lhx5 ------RGPRTTIKAKQLETLKAAF---VATPKPTRHIREQLAQET----GLNM------RVIQVWFQNRRSKERR----- >sp|P37137|LHX5_XENLA 89 bp LIM/homeobox protein Lhx5 ------RGPRTTIKAKQLETLKAAF---IATPKPTRHIREQLAQET----GLNM------RVIQVWFQNRRSKERR----- >sp|P36200|LHX3_XENLA 89 bp LIM/homeobox protein Lhx3 ------KRPRTTITAKQLETLKNAY---NNSPKPARHVREQLSSET----GLDM------RVVQVWFQNRRAKEKR----- >sp|P50481|LHX3_MOUSE 89 bp LIM/homeobox protein Lhx3 ------KRPRTTITAKQLETLKSAY---NTSPKPARHVREQLSSET----GLDM------RVVQVWFQNRRAKEKR----- >sp|O97581|LHX3_PIG 89 bp LIM/homeobox protein Lhx3 ------KRPRTTITAKQLETLKSAY---NTSPKPARHVREQLSSET----GLDM------RVVQVWFQNRRAKEKR----- >WBP00326_1 89 bp ------ISAKSLETLKQAY---QASSKPARHVREQLAADT----GLDM------RVVQVWFQNRRAKEKR----- >sp|P20271|HM14_CAEEL 89 bp Homeobox protein ceh-14 ------KRPRTTISAKSLETLKQAY---QTSSKPARHVREQLASET----GLDM------RVVQVWFQNRRAKEKR----- %subfamily N846 >sp|P31362|PO2F3_MOUSE 89 bp POU domain, class 2, transcription factor 3 ------RKKRTSIETNIRLTLEKRF---QDNPKPSSEEISMIAEQL----SMEK------EVVRVWFCNRRQKEKR----- >SSF51768_1 89 bp Metazoa-Chordata-Salmo ------IDTNIRVALEKSFL--QQNQKPSSDEISLIADQL----NMEK------EVIRVWFCNRRQKEKR----- >sp|P09086|PO2F2_HUMAN 89 bp POU domain, class 2, transcription factor 2 ------RKKRTSIETNVRFALEKSF---LANQKPTSEEILLIAEQL----HMEK------EVIRVWFCNRRQKEKR----- >1hdp 89 bp OCT-2 POU HOMEODOMAIN ------RKKRTSIETNVRFALEKSF---LANQKPTSEEILLIAEQL----HMEK------EVIRVWFCNRRQKEKR----- >1cqtB 89 bp OCTAMER-BINDING TRANSCRIPTION FACTOR 1 ------RKKRTSIETNIRVALEKSF----LENQKTSEEITMIADQL----NMEK------EVIRVWFCNRRQKEKR----- >sp|P25425|PO2F1_MOUSE 89 bp POU domain, class 2, transcription factor 1 ------RKKRTSIETNIRVALEKSF---MENQKPTSEDITLIAEQL----NMEK------EVIRVWFCNRRQKEKR----- >sp|P14859|PO2F1_HUMAN 89 bp POU domain, class 2, transcription factor 1 ------RKKRTSIETNIRVALEKSF---LENQKPTSEEITMIADQL----NMEK------EVIRVWFCNRRQKEKR----- >sp|P16143|PO2F1_XENLA 89 bp POU domain, class 2, transcription factor 1 ------RKKRTSIETNIRVALEKSF---LENQKPTSEEITMIADQL----NMEK------EVIRVWFCNRRQKEKR----- >sp|P15143|PO2F1_CHICK 89 bp POU domain, class 2, transcription factor 1 ------RKKRTSIETNIRVALEKSF---LENQKPTSEEITMIADQL----NMEK------EVIRVWFCNRRQKEKR----- >CCE07653_1 89 bp Metazoa-Chordata-Cyprinus ------LEGTVRSALESYF---VKCPKPNTLEITHISDDL----GLER------DVVRVWFCNRRQKGKR----- >GAP00667_1 89 bp ------LEGAVRSALESYF---IKCPKPNTQEITHISDDL----GLER------DVVRVWFCNRRQKGKR----- >sp|P20263|PO5F1_MOUSE 89 bp POU domain, class 5, transcription factor 1 ------KRKRTSIENRVRWSLETMF---LKCPKPSLQQITHIANQL----GLEK------DVVRVWFCNRRQKGKR----- >1ocp 89 bp OCT-3 ------KRKRTSIENRVRWSLETMF---LKCPKPSLQQITHIANQL----GLEK------DVVRVWFCNRRQKGKR----- >ECD00006_1 89 bp Metazoa-Chordata-Equus ------IENRVRGNLENMF---LQCPKPTLQQISHIAQQL----GLEK------DVVRVWFCNRRQKGKR----- >sp|O97552|PO5F1_BOVIN 89 bp POU domain, class 5, transcription factor 1 ------KRKRTSIENRVRGNLESMF---LQCPKPTLQQISHIAQQL----GLEK------DVVRVWFCNRRQKGKR----- >IPD13553_1 89 bp Metazoa-Chordata-Ictalurus ------IRVSVKGALESHF---LKCPKTSAQEISTLADTL----QLGEG------RVVRVWVCNRRQKE------>sp|P16241|CF1A_DROME 89 bp POU-domain protein CF1A ------RKKRTSIEVSVKGALEQHF---HKQPKPSAQEITSLADSL----QLEK------EVVRVWFCNRRQKEKR----- >sp|P31364|POU2_XENLA 89 bp Transcription factor POU2 ------RKKRTSIEVSVKGVLETHF---LKCPKPAALEITSLADSL----QLEK------EVVRVWFCNRRQKEKR----- >sp|P20264|PO3F3_HUMAN 89 bp POU domain, class 3, transcription factor 3 ------RKKRTSIEVSVKGALESHF---LKCPKPSAQEITNLADSL----QLEK------EVVRVWFCNRRQKEKR----- >sp|P31361|PO3F3_MOUSE 89 bp POU domain, class 3, transcription factor 3 ------RKKRTSIEVSVKGALESHF---LKCPKPSAQEITNLADSL----QLEK------EVVRVWFCNRRQKEKR----- >sp|P20265|PO3F2_HUMAN 89 bp POU domain, class 3, transcription factor 2 ------RKKRTSIEVSVKGALESHF---LKCPKPSAQEITSLADSL----QLEK------EVVRVWFCNRRQKEKR----- >sp|P31360|PO3F2_MOUSE 89 bp POU domain, class 3, transcription factor 2 ------RKKRTSIEVSVKGALESHF---LKCPKPSAQEITSLADSL----QLEK------EVVRVWFCNRRQKEKR----- >sp|P31363|POU1_XENLA 89 bp Transcription factor POU1 ------RKKRTSIEVGVKGALENHF---LKCPKPSAHEITSLADSL----QLEK------EVVRVWFCNRRQKEKR----- >sp|P20267|PO3F1_RAT 89 bp POU domain, class 3, transcription factor 1 ------RKKRTSIEVGVKGALESHF---LKCPKPSAHEITGLADSL----QLEK------EVVRVWFCNRRQKEKR----- >sp|P21952|PO3F1_MOUSE 89 bp POU domain, class 3, transcription factor 1 ------RKKRTSIEVGVKGALESHF---LKCPKPSAHEITGLADSL----QLEK------EVVRVWFCNRRQKEKR----- >DJP03285_1 89 bp ------IEANVKSILESSF---MKLSKPSAQDISSLAEKL----SLEK------EVVRVWFCNRRQ------>sp|P20268|HM06_CAEEL 89 bp Homeobox protein ceh-6 ------RKKRTSIEVNVKSRLEFHF---QSNQKPNAQEITQVAMEL----QLEK------EVVRVWFCNRRQKEKR----- %subfamily N834 >sp|P10037|PIT1_RAT 89 bp Pituitary-specific positive transcription factor 1 ------RKRRTTISIAAKDALERHFG---EHSKPSSQEIMRMAEEL----NLEK------EVVRVWFCNRRQREKR----- >1au7A 89 bp PIT-1 ------KRRTTISIAAKDALERHF---GEHSKPSSQEIMRMAEEL----NLEK------EVVRVWFCNRRQREKR----- >sp|P10036|PIT1_BOVIN 89 bp Pituitary-specific positive transcription factor 1 ------RKRRTTISIAAKDALERHF---GEQNKPSSQEILRMAEEL----NLEK------EVVRVWFCNRRQREKR----- >sp|P28069|PIT1_HUMAN 89 bp Pituitary-specific positive transcription factor 1 ------RKRRTTISIAAKDALERHF---GEQNKPSSQEIMRMAEEL----NLEK------EVVRVWFCNRRQREKR----- >sp|P13528|UNC86_CAEEL 89 bp Transcription factor unc-86 ------KRKRTSIAAPEKRELEQFF---KQQPRPSGERIASIADRL----DLKK------NVVRVWFCNQRQKQKR----- >sp|P24350|IPOU_DROME 89 bp Inhibitory POU protein ------KKRTSIAAPEKRSLEAYF---AVQPRPSGEKIAAIAEKL----DLKK------NVVRVWFCNQRQKQKR----- >sp|P20266|PO4F1_RAT 89 bp POU domain, class 4, transcription factor 1 ------KRKRTSIAAPEKRSLEAYF---AVQPRPSSEKIAAIAEKL----DLKK------NVVRVWFCN------>sp|P17208|PO4F1_MOUSE 89 bp POU domain, class 4, transcription factor 1 ------KRKRTSIAAPEKRSLEAYF---AVQPRPSSEKIAAIAEKL----DLKK------NVVRVWFCNQRQKQKR----- %subfamily N915 >MIP03533_1 89 bp ------LDTAQKLSLDTFF---RIDPRPDNARMVEIATLL----DLDH------DVVRVWFCNRRQKLRK----- %subfamily N914 >CCD00425_1 89 bp Fungi-Zygomycota-Conidiobolus ------GNSWQINRLFEFF---ERCPRPTRAQVHSLSVEL----EMPI------KSIRIWFQNHRSKQ------

113

%subfamily N778 >GAD15382_1 89 bp Viridiplantae-Streptophyta-Gossypium ------LSKDQSAILEENF---KEHNTLNPKQKLALAKQL----GLRP------RQVEVWFQNRRARTK------>GRD09869_1 89 bp Viridiplantae-Streptophyta-Gossypium ------LSKDQSAILEENF---KEHNTLNPKQKLALAKQL----GLRP------RQVEVWFQNRRARTK------>RHD00015_1 89 bp Viridiplantae-Streptophyta-Rosa ------LSKDQSAILEESF---KDHNTLNPKQKLALAKQL----GLRP------RQVEVWFQNRRARTK------>GRD08825_1 89 bp Viridiplantae-Streptophyta-Gossypium ------LSKDQSAILEESF---KEHNTLNPKQKMALAKQL----GLRP------RQVEVWFQNRRARTK------>GAD01591_1 89 bp Viridiplantae-Streptophyta-Gossypium ------LSKDQSAILEESF---KEHNTLNPKQKMALAKQL----GLRP------RQVEVWFQNRRARTK------>CSE04725_1 89 bp Viridiplantae-Streptophyta-Citrus ------LSKDQSAILEESF---KEHNTLNPKQKLALAKQL----GLRP------RQVEVWFQNRRARTK------>PAD00275_1 89 bp Viridiplantae-Streptophyta-Prunus ------LSKDQSAILEESF---KEHNTLNPKQKLALAKQL----GLRP------RQVEVWFQNRRARTK------>PPE01018_1 89 bp Viridiplantae-Streptophyta-Prunus ------LSKDQSAILEESF---KEHNTLNPKQKLALAKQL----GLRP------RQVEVWFQNRRARTK------>AAF02115_1 89 bp Viridiplantae-Streptophyta-Acorus ------LSKDQSAVLEESF---KEHSTLNPKQKLALAKQL----NLRP------RQVEVWFQNRRARTK------>CSE14614_1 89 bp Viridiplantae-Streptophyta-Citrus ------LSKEQSAFLEESF---KEHNTLNPKQKLALAKQL----NLRP------RQVEVWFQNRRARTK------>PED07486_1 89 bp Viridiplantae-Streptophyta-Populus ------LSKDQSAFLEESF---KEHNTLTPKQKLALAKEL----NLRP------RQVEVWFQNRRARTK------>GRD29631_1 89 bp Viridiplantae-Streptophyta-Gossypium ------LSKDQSAILEECF---KEHNTLNPKQKLALAKQL----GLRP------RQVEVWFQNRRARTK------>RPP00177_1 89 bp ------LSKDQSIILEESF---KEHNTLNPKQKSALAKQL----GLRA------RQVEVWFQNRRARTK------>ACF08172_1 89 bp Viridiplantae-Streptophyta-Allium ------LSKDQSAILEESF---KGHNTLNPKQKQALAKQL----NLRP------RQVEVWFQNRRARTK------>GRD21285_1 89 bp Viridiplantae-Streptophyta-Gossypium ------LSKEQSLVLEETF---KEHSTLNPKQKLALAMQL----NLRP------RQVEVWFQNRRARTK------>HAD05132_1 89 bp Viridiplantae-Streptophyta-Helianthus ------LSKEQSAYLEETF---KEHNTLNPKQKLALAEQL----HLRP------RQVEVWFQNRRARH------>HAD09917_1 89 bp Viridiplantae-Streptophyta-Helianthus ------LSKEQSAYLEETF---KEHNTLNPKQKLALAEQL----HLRP------RQVEVWFQNRRARTK------>INP02125_1 89 bp ------LTKQQSIVLEDSF---KQHTTLNSKQKQELARRL----NLRP------RQVEVWFQNRRARTK------>PPE02653_1 89 bp Viridiplantae-Streptophyta-Prunus ------LSKEQSATLEDSF---REHTTLNPKQKQDLARKL----NLRP------RQVEVWFQNRRARTK------>ACF09585_1 89 bp Viridiplantae-Streptophyta-Allium ------LSKDQSRLLEEKF---KEHSTLNPKQKQALAKHL----NLQP------RQVEVWFQNRRARTK------>ZEP07385_1 89 bp ------LTKEQSAFLEDSF---KEHSTLNPKQKQALAKQL----NLRP------RQVEVWFQNRRARTK------>GRD07374_1 89 bp Viridiplantae-Streptophyta-Gossypium ------LTKDQSALLEESF---KQHSTLNPKQKQALAKQL----NLRP------RQVEVWFQNRRARTK------>CSE21246_1 89 bp Viridiplantae-Streptophyta-Citrus ------LTKEQSALLEESF---KQHSTLNPKQKQALARQL----NLRP------RQVEVWFPNRRARTK------>JRP01754_1 89 bp ------LTKEQSALLEESF---KQHSTLNPKQKQALARQL----NLRP------RQVEVWFQNRRARTK------>PSE04668_1 89 bp Viridiplantae-Streptophyta-Picea ------LSKEQSALLEESF---LEHSTLNPKQKNALAKEL----NLQP------RQVEVWFQNRRARTK------>CJP00479_1 89 bp ------LSKEQSALLEESF---REHSTLNPKQKNALAKQL----NLRP------RQVEVWFQNRRARTK------>PPG05037_1 89 bp Viridiplantae-Streptophyta-Physcomitrella ------LSKEQSALLEESF---KEHSTLNPKQKNALAKQL----GLRP------RQVEVWFQNRRARTK------>BVP05281_1 89 bp ------LTKDQSALLEESF---KQQSTLNPKQKQALADRL----NLRP------RQVEVWFQNRRARTK------>PSE02887_1 89 bp Viridiplantae-Streptophyta-Picea ------LSKEQSALLEESF---KENSSLNPKQKQALAKRL----NLRP------RQVEVWFQNRRARTK------>PGD09699_1 89 bp Viridiplantae-Streptophyta-Picea ------LSKDQSSLLEESF---REHSALSPKHKSALACKL----NLQP------RQVEVWFQNRRARTK------>AOP03183_1 89 bp ------LSKEQSRLLEESF---RQNHTLNPTQKEALASRL----RLKP------RQVEVWFQNRRARTK------>HVP89597_1 89 bp ------LSKDQAAVLEECF---KTHSTLNPKQKRALANRL----GLRP------RQVEVWFQNRRARTK------%subfamily N845 >AME08554_1 89 bp Metazoa-Arthropoda-Apis ------FSEEQKEALRLAF---ALDPYPNVATIEFLAGEL----ALSS------RTITNWFHNHRMRLK------>sp|P10180|CUT_DROME 89 bp Homeobox protein cut ------KKQRVLFSEEQKEALRLAF---ALDPYPNVGTIEFLANEL----GLAT------RTITNWFHNHRMRLK------>sp|P39880|CUTL1_HUMAN 89 bp Homeobox protein cut-like 1 ------KKPRVVLAPEEKEALKRAY---QQKPYPSPKTIEDLATQL----NLKT------STVINWFHNYRSRIRR----- >sp|P39881|CUTL1_CANFA 89 bp Homeobox protein cut-like 1 ------KKPRVVLAPEEKEALKRAY---QQKPYPSPKTIEELATQL----NLKT------STVINWFHNYRSRIRR----- >sp|P53565|CUTL1_RAT 89 bp Homeobox protein cut-like 1 ------KKPRVVLAPEEKEALKRAY---QQKPYPSPKTIEELATQL----NLKT------STVINWFHNYRSRIRR----- >sp|P53564|CUTL1_MOUSE 89 bp Homeobox protein cut-like 1 ------KKPRVVLAPEEKEALKRAY---QQKPYPSPKTIEELATQL----NLKT------STVINWFHNYRSRIRR----- >sp|P70298|CUTL2_MOUSE 89 bp Homeobox protein cut-like 2 ------KKPRVVLAPAEKEALRKAY---QLEPYPSQQTIELLSFQL----NLKT------NTVINWFHNYRSRMRR----- >sp|O14529|CUTL2_HUMAN 89 bp Homeobox protein cut-like 2 ------KKPRVVLAPEEKEALRKAY---QLEPYPSQQTIELLSFQL----NLKT------NTVINWFHNYRSRMRR----- %subfamily N913 >HGP08389_1 89 bp ------FRPVLLRVLETYF---QKCPFPDIAKRVEIANACNAHLQVDKRGVQLMPKEVVTPQVIANWFANKRKEMRR----- %subfamily N658 >GRD08656_1 89 bp Viridiplantae-Streptophyta-Gossypium ------PTPLQLQILENIY--EQGTGTPSKQKIKEIASELAQHGQISE------TNVYNWFQNRRARSKR----- >CSE07550_1 89 bp Viridiplantae-Streptophyta-Citrus ------PTPVQLQILESIFD--QGTGTPSKQKIKEITVELSQHGQISE------TNVYNWFQNRRARSKR----- >GRD14308_1 89 bp Viridiplantae-Streptophyta-Gossypium ------PTPVQLQILERIFD--QGTGTPSKQKIKEITSELSQHGQISE------TNVYNWFQNRRARSKR----- >GAD10806_1 89 bp Viridiplantae-Streptophyta-Gossypium ------PTPVQLQILERIFD--QGTGTPSKQKIKEITSELSQHGQISE------TNVYNWFQNRRARSKR----- >HAD01349_1 89 bp Viridiplantae-Streptophyta-Helianthus

114

------PTPVQLQILERLFE--QGNGTPSKQKIKEITSELSQHGQISE------TNVYNWFQNRRARSKR----- >PPE04959_1 89 bp Viridiplantae-Streptophyta-Prunus ------PTPVQLQILERLFE--QGNGTPSKQKIKEITSELSQHGQISE------TNVYNWFQNRRARSKR----- %subfamily N912 >tr|Q61JC4|Q61JC4_CAEBR 89 bp Hypothetical protein CBG09849 ------RTVITDYQKDVLRFVF---VNELHPTNEMIEQIATKL----EMSL------RTVQNWFHNHRTRSK------%subfamily N667 >1wi3A 89 bp DNA-binding protein SATB2 ------RTKISLEALGILQSFI--HDVGLYPDQEAIHTLSAQL----DLPK------HTIIKFFQNQRYHVK------>AMF20019_1 89 bp Metazoa-Chordata-Ambystoma ------ISVEALGILQSFIQ--DVGLYPDEEAIHTLSAQL----DLPK------YTIIKFFQNQRY------>sp|Q01826|SATB1_HUMAN 89 bp DNA-binding protein SATB1 ------RPRTKISVEALGILQSFIQ--DVGLYPDEEAIQTLSAQL----DLPK------YTIIKFFQNQRYYLK------>tr|Q5U2Y2|Q5U2Y2_RAT 89 bp Special AT-rich sequence binding protein 1 ------RPRTKISVEALGILQSFIQ--DVGLYPDEEAIQTLSAQL----DLPK------YTIIKFFQNQRYYLK------>sp|Q60611|SATB1_MOUSE 89 bp DNA-binding protein SATB1 ------RPRTKISVEALGILQSFIQ--DVGLYPDEEAIQTLSAQL----DLPK------YTIIKFFQNQRYYLK------%subfamily N911 >tr|O16218|O16218_CAEEL 89 bp Hypothetical protein F17A9.6 ------RARCLLSQDQKSQLSIFF---ETNPRPDSLEMKQLGSTL----NLCK------STIINYFTNMRRR------%subfamily N862 >AME03197_1 89 bp Metazoa-Arthropoda-Apis ------LTTDQEAVLQEQFN--RWPRAPHTADIVLLAAET----GLSE------ADVEAWYSIRLAQWRK----- >PTJ02837_1 89 bp Metazoa-Chordata-Pantroglodytes ------TEDQVEILEYNFN--KVDKHPDSTTLCLIAAEA----GLSE------EETQKWFKQRLAKWRR----- >1uhsA 89 bp homeodomain only protein ------AATMTEDQVEILEYNF--NKVNKHPDPTTLCLIAAEA----GLTE------EQTQKWFKQRLAEWRR----- >ECD02963_1 89 bp Metazoa-Chordata-Equus ------TEDQVEILEYNFN--KVNKHPDPTTLCLIAAEA----GLSE------EETQKWFKQRLAQWRR----- >ECD08347_1 89 bp Metazoa-Chordata-Equus ------TEDQVEILEYNFN--KVNKHPDPTTLCLIAAEA----GLSE------EETQKWFKQRLAQWRR----- %subfamily N861 >1lfuP 89 bp PBX, PRE-B-CELL LEUKEMIA TRANSCRIPTION FACTOR-1 ------RRKRRNFNKQATEILNEYF---YSHPYPSEEAKEELAKKS----GITV------SQVSNWFGNKRIRYKK----- >GMD02533_1 89 bp Metazoa-Arthropoda-Glossinamorsitans ------FSKQASEILNEYFYSHLSNPYPSEEAKEELARKC----GITV------SQVSNWFGNKRIRYKK----- >1b8iB 89 bp EXTRADENTICLE ------RRNFSKQASEILNEYFYSHLSNPYPSEEAKEELARKC----GITV------SQVSNWFGNKRIRYKK----- >IPD21151_1 89 bp Metazoa-Chordata-Ictalurus ------FSKQATEVLNEYFYSHLSNPYPSEEAKEELAKQC----GITV------SQVSNWFGNKRIRYKK----- >PPD21046_1 89 bp Metazoa-Chordata-Pongo ------FNKQVTEILNEYFYSHLSNPYPSEEAKEELAKKC----GITV------SQVSNWFGNKRIRYKK----- >1b72B 89 bp PBX1 ------RKRRNFNKQATEILNEYFYSHLSNPYPSEEAKEELAKKC----GITV------SQVSNWFGNKRIRYKK----- >CPG05237_1 89 bp Fungi-Ascomycota-Coccidioides ------LPKPTTDILRAWFYEHLDHPYPSEQDKQMFMTRT----GLTI------SQISNWFINARRRH------>SPE20651_1 89 bp Metazoa-Echinodermata-Strongylocentrotus ------FLKSATNIMRAWLFQHLTHPYPSEEQKKQLAQDT----GLTI------LQVNNWFINAR---RR----- >PPD00896_1 89 bp Metazoa-Chordata-Pongo ------FPKVATNIMRAWLFQHLTHPYPSEEQKKQLAQDT----GLTI------LQVNNWFINAR---RR----- >SSF35702_1 89 bp Metazoa-Chordata-Salmo ------FPKVATNIMRAWLFQHLTHPYPSEEQKKQLAQDT----GLTI------LQVNNWFINAR---RR----- %subfamily N802 >SSP00594_1 89 bp ------FSDVQKRTLQAIF---KETERPSKEMQQTIAEHL----GLDP------STVSNYFMNARRRSR------>sp|Q19720|HM38_CAEEL 89 bp Homeobox protein ceh-38 ------KRPRLVFTDIQKRTLQAIF---KETQRPSREMQQTIAEHL----RLDL------STVANFFMNARRRSR------>HCP08366_1 89 bp ------FTDIQKRTLXAIF---KETQRPSREMQQTIAEHL----RLDL------STVANFFMNARRRSR------>tr|Q61GJ0|Q61GJ0_CAEBR 89 bp Hypothetical protein CBG11192 ------KRPRLVFTDIQKRTLQAIF---KETQRPSREMQQTIAEHL----RLDL------STVANFFMNARRRSR------>HPP00771_1 89 bp ------FTDLQRRTLHAIF---KENKRPSKEMQITIAQQL----GLEL------STVSNFFMNAR---RR----- >tr|Q5MD20|Q5MD20_BRARE 89 bp Onecut-1 isoform a ------KKPRLVFTDVQRRTLHAIF---KENKRPSKELQITISQQL----GLEL------ATVSNFFMNGRRR------>tr|Q5MD19|Q5MD19_BRARE 89 bp Onecut-1 isoform b ------KKPRLVFTDVQRRTLHAIF---KENKRPSKELQITISQQL----GLEL------ATVSNFFMNGRRR------>sp|P70512|HNF6_RAT 89 bp Hepatocyte nuclear factor 6 ------KKPRLVFTDVQRRTLHAIF---KENKRPSKELQITISQQL----GLEL------STVSNFFMNARRR------>sp|O08755|HNF6_MOUSE 89 bp Hepatocyte nuclear factor 6 ------KKPRLVFTDVQRRTLHAIF---KENKRPSKELQITISQQL----GLEL------STVSNFFMNARRR------>sp|Q9UBC0|HNF6_HUMAN 89 bp Hepatocyte nuclear factor 6 ------KKPRLVFTDVQRRTLHAIF---KENKRPSKELQITISQQL----GLEL------STVSNFFMNARRR------>1s7eA 89 bp Hepatocyte nuclear factor 6 ------KKPRLVFTDVQRRTLHAIF---KENKRPSKELQITISQQL----GLEL------STVSNFFMNAR------>tr|O60422|O60422_HUMAN 89 bp Fos37502_1 ------KKQRLVFTDLQRRTLIAIF---KENKRPSKEMQVTISQQL----GLEL------NTVSNFFMNARRRC------>sp|O95948|ONEC2_HUMAN 89 bp One cut domain family member 2 ------KKSRLVFTDLQRRTLFAIF---KENKRPSKEMQITISQQL----GLEL------TTVSNFFMNARRR------>sp|Q9NJB5|ONEC_DROME 89 bp Homeobox protein onecut ------KKPRLVFTDLQRRTLQAIF---KETKRPSKEMQVTIARQL----GLEP------TTVGNFFMNARRR------>tr|O45080|O45080_CAEEL 89 bp Hypothetical protein ------KKTRLVFSDIQRRTLQAIF---RETKRPSREMQITISQQL----NLDP------TTVANFFMNARRR------>AYP01413_1 89 bp ------FTDIQRRTLQAIF---KETKRPSREMQLTISQQL----GLDP------TTVANFFMNAR---RR----- %subfamily N859 >sp|Q23175|HM32_CAEEL 89 bp Homeobox protein ceh-32 ------FKERTRSLLREWY---LKDPYPNPPKKKELANAT----GLTQ------MQVGNWFKNRRQRDR------>sp|O73916|SIX3_ORYLA 89 bp Homeobox protein SIX3 ------FKERTRGLLREWY---LQDPYPNPGKKRELAHAT----GLTP------TQVGNWFKNRRQRDR------>tr|O73709|O73709_BRARE 89 bp Homeobox protein Six6 ------FKERTRGLLREWY---LQDPYPNPSKKRELAQAT----GLTP------TQVGNWFKNRRQRDR------>tr|Q5TYZ2|Q5TYZ2_BRARE 89 bp Novel protein similar to vertebrate sine oculis homeobox homolog 6 ------FKERTRHLLREWY---LQDPYPNPSKKRELAQAT----GLTP------TQVGNWFKNRRQRDR------>tr|Q5M8S8|Q5M8S8_HUMAN 89 bp SIX6 protein

115

------FKERTRHLLREWY---LQDPYPNPSKKRELAQAT----GLTP------TQVGNWFKNRRQRDR------>sp|Q9QZ28|SIX6_MOUSE 89 bp Homeobox protein SIX6 ------FKERTRHLLREWY---LQDPYPNPSKKRELAQAT----GLTP------TQVGNWFKNRRQRDR------>sp|O93307|SIX6_CHICK 89 bp Homeobox protein SIX6 ------FKERTRHLLREWY---LQDPYPNPSKKRELAQAT----GLTP------TQVGNWFKNRRQRDR------>tr|O93282|O93282_BRARE 89 bp Homeobox protein Six7 ------FKERTRSLLREWY---LQDPYPNPSRKRHLAQAT----GLTP------TQVGNWFKNRRQRDR------>sp|O95475|SIX6_HUMAN 89 bp Homeobox protein SIX6 ------FKERTRNLLREWY---LQDPYPNPSKKRELAQAT----GLTP------TQVGNWFKNRRQRDR------>SPE28381_1 89 bp Metazoa-Echinodermata-Strongylocentrotus ------FKERTRSLLREWY---LQDPYPNPTKKRELAQAT----GLTP------TQVGNWFKN------>tr|Q5TV95|Q5TV95_ANOGA 89 bp ENSANGP00000026008 PRTIWDGEQKTHCFKERTRSLLREWY---LQDPYPNPTKKRELAQAT----GLTP------TQVGNWFKNRRQRDR------>tr|O73708|O73708_BRARE 89 bp Homeobox protein Six3 ------FKERTRSLLREWY---LQDPYPNPSKKRELAQAT----GLTP------TQVGNWFKNRRQRDR------>sp|Q62233|SIX3_MOUSE 89 bp Homeobox protein SIX3 ------FKERTRSLLREWY---LQDPYPNPSKKRELAQAT----GLTP------TQVGNWFKNRRQRDR------>sp|O95343|SIX3_HUMAN 89 bp Homeobox protein SIX3 ------FKERTRSLLREWY---LQDPYPNPSKKRELAQAT----GLTP------TQVGNWFKNRRQRDR------>sp|O42406|SIX3_CHICK 89 bp Homeobox protein SIX3 ------FKERTRSLLREWY---LQDPYPNPSKKRELAQAT----GLTP------TQVGNWFKNRRQRDR------>sp|Q94166|HM33_CAEEL 89 bp Homeobox protein ceh-33 ------FRDKSRVLLRDWY---CRNSYPSPREKRELAEKT----HLTV------TQVSNWFKNRRQRDR------>sp|Q27350|SO_DROME 89 bp Protein sine oculis ------FKEKSRSVLRDWY---SHNPYPSPREKRDLAEAT----GLTT------TQVSNWFKNRRQRDR------>sp|Q9NPC8|SIX2_HUMAN 89 bp Homeobox protein SIX2 ------FKEKSRSVLREWY---AHNPYPSPREKRELAEAT----GLTT------TQVSNWFKNRRQRDR------>sp|Q62232|SIX2_MOUSE 89 bp Homeobox protein SIX2 ------FKEKSRSVLREWY---AHNPYPSPREKRELAEAT----GLTT------TQVSNWFKNRRQRDR------>sp|Q62231|SIX1_MOUSE 89 bp Homeobox protein SIX1 ------FKEKSRGVLREWY---AHNPYPSPREKRELAEAT----GLTT------TQVSNWFKNRRQRDR------>sp|Q15475|SIX1_HUMAN 89 bp Homeobox protein SIX1 ------FKEKSRGVLREWY---AHNPYPSPREKRELAEAT----GLTT------TQVSNWFKNRRQRDR------>sp|Q94165|HM34_CAEEL 89 bp Homeobox protein ceh-34 ------FKSKSRNVLRDAY---KKCQYPSVEDKRRLAQQT----ELSI------IQVSNWFKNKRQRER------>sp|Q9UIU6|SIX4_HUMAN 89 bp Homeobox protein SIX4 ------FKEKSRNALKELY---KQNRYPSPAEKRHLAKIT----GLSL------TQVSNWFKNRRQRDRN----- >sp|Q61321|SIX4_MOUSE 89 bp Homeobox protein SIX4 ------FKEKSRNALKELY---KQNRYPSPAEKRHLAKIT----GLSL------TQVSNWFKNRRQRDRN----- >sp|Q8N196|SIX5_HUMAN 89 bp Homeobox protein SIX5 ------FKERSRAALKACY---RGNRYPTPDEKRRLATLT----GLSL------TQVSNWFKNRRQRDR------>sp|P70178|SIX5_MOUSE 89 bp Homeobox protein SIX5 ------FKERSRAALKACY---RGNRYPTPDEKRRLATLT----GLSL------TQVSNWFKNRRQRDR------

116

Appendix B: Individual Unclassifiable Groupings from Leave-One- Out Cross-Validation of the Bête Tree

Figure A: SATB subfamily from leave-one- out cross-validation. All internal nodes within the subfamily have support of 1.000, except the join with 1wi3A.

Figure B: Insulin Gene Enhancer (ISL) subfamily from leave-one-out cross- validation. All internal nodes have support of 0.987 or better.

Figure C: Plant subfamily with 8 members from leave-one-out cross-validation. All internal nodes have support of 0.996 or better.

117

Figure D: Plant subfamily with 5 members from leave-one-out cross- validation. All internal nodes have support of 0.996 or better.

Figure E: LIM homeobox subfamily (Lhx2, Lhx6, Lhx8) from leave-one- out cross-validation. All internal nodes have support of 0.996 or better.

Figure F: TTF1/VND subfamily from leave-one-out cross-validation. All internal nodes have support of 0.925 or better.

118

Figure G: Paired-box subfamily from leave-one-out cross-validation. All internal nodes have support of 0.934 or better.

119

Figure H: LIM homeobox subfamily (Lhx1, Lhx3, Lhx5) from leave-one-out cross-validation. All internal nodes have support of 0.991 or better.

120

Appendix C: Chi-1 Angle Conformations at Subfamily-Determining Residue Positions

6

5

4

3

Number of instances 2

1

0 Ile Phe Met Lys Trp Type of Amino Acid

Figure A: Amino acids at position 14 of the structure-sequence based homeodomain classification (Figure 20) in the g- chi-angle conformation.

2.5

2

1.5

1 Number of instances Number

0.5

0 Val Lys Arg Leu Type of Amino Acid

Figure B: Amino acids at position 19 of the structure-sequence based homeodomain classification (Figure 20) in the trans chi-angle conformation.

121

4.5

4

3.5

3

2.5

2

Number of instances of Number 1.5

1

0.5

0 Arg Leu Thr Gln Type of Amino Acid

Figure C: Amino acids at position 19 of the structure-sequence based homeodomain classification (Figure 20) in the g- chi-angle conformation.

8

7

6

5

4

3 Number of instances of Number

2

1

0 Tyr His Trp Type of Amino Acid

Figure D: Amino acids at position 34 of the structure-sequence based homeodomain classification (Figure 20) in the g- chi-angle conformation.

122

9

8

7

6

5

4

Number of instances of Number 3

2

1

0 Leu, g- Pro Ala Other Type of Amino Acid

Figure E: Amino acids at position 35 of the structure-sequence based homeodomain classification (Figure 20) in all available chi-angle conformations (conformation is labeled, where applicable, along with the column).

4.5

4

3.5

3

2.5

2

Number of instances of Number 1.5

1

0.5

0 Asn Asp Ser Thr Type of Amino Acid

Figure F: Amino acids at position 36 of the structure-sequence based homeodomain classification (Figure 20) in the g+ chi-angle conformation.

123

3.5

3

2.5

2

1.5 Number of instances of Number 1

0.5

0 Lys Leu Arg Type of Amino Acid

Figure G: Amino acids at position 40 of the structure-sequence based homeodomain classification (Figure 20) in the trans chi-angle conformation.

2.5

2

1.5

1 Number of instances of Number

0.5

0 Ile Lys Arg Tyr Type of Amino Acid

Figure H: Amino acids at position 40 of the structure-sequence based homeodomain classification (Figure 20) in the g- chi-angle conformation.

124

8

7

6

5

4

3 Number of instances of Number

2

1

0 Val, trans Gln, g- Arg, g- Thr, g- Type of Amino Acid

Figure I: Amino acids at position 70 of the structure-sequence based homeodomain classification (Figure 20) in all available chi-angle conformations (conformation is labeled along with the column).

3.5

3

2.5

2

1.5 Number of instances of Number 1

0.5

0 Arg Lys Gln Tyr Type of Amino Acid

Figure J: Amino acids at position 72 of the structure-sequence based homeodomain classification (Figure 20) in the trans chi-angle conformation.

125

2.5

2

1.5

1 Number of instances of Number

0.5

0 Lys Ser Ile Type of Amino Acid

Figure K: Amino acids at position 72 of the structure-sequence based homeodomain classification (Figure 20) in the g- chi-angle conformation.

2.5

2

1.5

1 Number of instances of Number

0.5

0 Val Asn Gln Type of Amino Acid

Figure L: Amino acids at position 73 of the structure-sequence based homeodomain classification (Figure 20) in the trans chi-angle conformation.

126

3.5

3

2.5

2

1.5 Number of instances of Number 1

0.5

0 Asn Ile Lys Type of Amino Acid

Figure M: Amino acids at position 73 of the structure-sequence based homeodomain classification (Figure 20) in the g- chi-angle conformation.

127

Appendix D: Contribution of Subfamily-Determining Residues to the Protein-DNA Interface: Changes in Solvent Accessibility

The following tables (A – N) detail the change in solvent accessibility per residue for representative protein structures for each structurally represented homeodomain subfamily. DNA used for the ASA calculations is from the original PDB file, unless otherwise noted. Only residues which exhibit a change in solvent accessibility upon removal of DNA are actually listed, for the sake of brevity. Position numbering is according to that given by the alignment in Figure 20. Subfamily determining residues are highlighted in yellow, conserved residues (at 90% identity throughout all homeodomains) in cyan.

A) HOX subfamily (PDB ID 1ig7A) % of protein- PDB Residue Position # ASA (Protein + DNA) ASA (Protein only) DNA interface Arg102 8 123.87 273.64 15.59% Lys103 9 117.66 159.17 4.32% Pro104 10 120.62 129.25 0.90% Arg105 11 43.36 219.88 18.38% Thr106 12 23.41 45.35 2.28% Phe108 14 17.61 46.81 3.04% Leu113 19 62.95 71.78 0.92% Tyr125 34 91.88 132.72 4.25% Leu126 35 16.46 26.39 1.03% Ile128 37 99.94 125.46 2.66% Arg131 40 19.28 38.57 2.01% Thr143 69 77.67 80.11 0.25% Gln144 70 32.17 42.89 1.12% Lys146 72 29.68 98.44 7.16% Ile147 73 10.83 68.51 6.01% Trp148 74 8.53 24.90 1.70% Gln150 76 9.09 89.52 4.33% Asn151 77 16.28 85.68 8.37% Arg153 79 18.71 66.96 7.23% Ala154 80 43.97 50.69 0.70% Lys155 81 88.83 117.02 2.94% Lys157 83 82.54 94.57 1.25% Arg158 84 176.67 204.10 2.86% Contribution of specificity determining positions to interface: 25.54% Contribution of conserved residues to interface: 13.95% Contribution of SDPs and conserved residues to interface: 39.49%

128

B) Engrailed subfamily (PDB ID 3hddA) % of protein- PDB Residue Position # ASA (Protein + DNA) ASA (Protein only) DNA interface Arg5 11 103.86 280.20 23.76% Thr6 12 66.46 89.25 3.07% Ala7 13 96.31 101.54 0.70% Phe8 14 15.85 43.60 3.74% Leu13 19 52.21 60.20 1.08% Tyr25 34 81.17 122.33 5.55% Leu26 35 28.66 41.61 1.74% Glu28 37 114.07 122.71 1.16% Arg31 40 28.18 52.31 3.25% Gln44 70 20.23 36.57 2.20% Lys46 72 52.47 89.76 5.02% Ile47 73 17.82 88.14 9.47% Trp48 74 11.88 29.87 2.42% Gln50 76 14.04 73.51 8.01% Asn51 77 22.48 83.65 8.24% Arg53 79 19.61 57.93 5.16% Ala54 80 25.78 45.73 2.69% Lys55 81 112.66 139.15 3.57% Lys57 83 138.97 185.91 6.32% Lys58 84 177.83 198.85 2.83% Contribution of specificity determining positions to interface: 32.05% Contribution of conserved residues to interface: 15.83% Contribution of SDPs and conserved residues to interface: 47.88%

129

C) Paired-box subfamily (PDB ID 1fjlA) % of protein- PDB Residue Position # ASA (Protein + DNA) ASA (Protein only) DNA interface Arg2 8 105.92 276.48 17.53% Arg3 9 133.48 166.51 3.39% Ser4 10 92.56 97.74 0.53% Arg5 11 23.31 214.54 19.65% Thr6 12 33.50 52.35 1.94% Thr7 13 112.16 121.30 0.94% Phe8 14 20.92 52.67 3.26% Leu13 19 73.52 76.43 0.30% Tyr25 34 103.48 156.90 5.49% Pro26 35 4.75 6.44 0.17% Ile28 37 68.17 78.15 1.03% Arg31 40 6.00 34.20 2.90% Arg44 70 25.65 49.15 2.42% Gln46 72 29.89 72.31 4.36% Val47 73 16.56 58.98 4.36% Trp48 74 7.44 19.59 1.25% Gln50 76 28.42 81.15 5.42% Asn51 77 12.98 83.13 7.21% Arg53 79 4.69 39.46 3.57% Ala54 80 37.16 42.45 0.54% Arg55 81 88.36 146.15 5.94% Arg57 83 98.45 174.21 7.79% Contribution of specificity determining positions to interface: 23.26% Contribution of conserved residues to interface: 12.03% Contribution of SDPs and conserved residues to interface: 35.29%

130

D) POU subfamily (PDB ID 1cqtB) % of protein- PDB Residue Position # ASA (Protein + DNA) ASA (Protein only) DNA interface Arg602 8 134.12 248.07 12.78% Lys603 9 65.32 110.26 5.04% Lys604 10 132.52 167.52 3.93% Arg605 11 22.35 216.79 21.81% Thr606 12 47.68 58.44 1.21% Ser607 13 35.14 79.61 4.99% Ile608 14 24.61 43.19 2.08% Arg613 19 61.69 87.11 2.85% Ser628 37 84.27 87.43 0.35% Glu643 69 45.22 50.69 0.61% Val644 70 1.66 16.54 1.67% Arg646 72 115.77 116.49 0.08% Val647 73 6.4 60.19 6.03% Trp648 74 0.42 24.08 2.65% Cys650 76 18.22 62.9 5.01% Asn651 77 7.67 69.34 6.92% Arg652 78 16.21 46.91 3.44% Arg653 79 59.1 72.99 1.56% Gln654 80 19.73 101.95 9.22% Lys655 81 70.31 92.33 2.47% Lys657 83 145.21 179.61 3.86% Arg658 84 179.23 191.99 1.43% Contribution of specificity determining positions to interface: 12.72% Contribution of conserved residues to interface: 11.13% Contribution of SDPs and conserved residues to interface: 23.85%

131

E) TTF1/VND subfamily (PDB ID 1nk2P) % of protein- PDB Residue Position # ASA (Protein + DNA) ASA (Protein only) DNA interface Arg110 8 236.12 268.44 3.24% Lys111 9 16.34 191.66 17.59% Arg112 10 154.54 197.39 4.30% Arg113 11 23.22 196.11 17.35% Val114 12 63.47 96.97 3.36% Leu115 13 86.51 122.04 3.57% Phe116 14 11.72 31.89 2.02% Thr121 19 33.91 36.32 0.24% Tyr133 34 113.71 143.68 3.01% Arg139 40 28.58 32.11 0.35% Gln152 70 38.77 46.99 0.82% Lys154 72 32.01 57.91 2.60% Ile155 73 25.72 78.49 5.30% Trp156 74 20.59 41.23 2.07% Gln158 76 18.69 81.06 6.26% Asn159 77 28.38 92.25 6.41% Arg161 79 19.04 48.60 2.97% Tyr162 80 30.61 127.77 9.75% Lys163 81 90.55 128.07 3.76% Lys165 83 139.67 148.46 0.88% Arg166 84 158.76 200.07 4.15% Contribution of specificity determining positions to interface: 14.35% Contribution of conserved residues to interface: 11.45% Contribution of SDPs and conserved residues to interface: 25.79%

132

F) PBX/Extradenticle subfamily (PDB ID 1b8iB) % of protein- PDB Residue Position # ASA (Protein + DNA) ASA (Protein only) DNA interface Arg205 11 101.91 258.23 20.67% Arg206 12 161.59 168.36 0.90% Asn207 13 69.65 118.56 6.47% Phe208 14 26.63 58.44 4.21% Ser213 19 19.08 21.74 0.35% Tyr228 34 81.66 133.67 6.88% Pro229 35 2.82 4.59 0.23% Glu231 37 92.21 139.15 6.21% Lys234 40 8.28 25.05 2.22% Gln247 70 41.23 66.31 3.32% Ser249 72 41.65 44.01 0.31% Asn250 73 27.18 91.07 8.45% Trp251 74 14.16 34.06 2.63% Gly253 76 20.63 26.12 0.73% Asn254 77 15.03 65.96 6.73% Arg256 79 17.56 42.39 3.28% Ile257 80 17.78 95.98 10.34% Arg258 81 61.12 151.77 11.99% Lys260 83 132.57 153.91 2.82% Lys261 84 166.59 176.20 1.27% Contribution of specificity determining positions to interface: 25.96% Contribution of conserved residues to interface: 12.65% Contribution of SDPs and conserved residues to interface: 38.61%

133

G) MATa1 subfamily (PDB ID 1akhA) % of protein- PDB Residue Position # ASA (Protein + DNA) ASA (Protein only) DNA interface Ile77 14 42.00 85.84 6.00% Arg82 19 72.23 134.36 8.50% Ser94 34 57.05 65.84 1.20% Leu95 35 17.69 24.16 0.89% Asn96 36 78.22 84.38 0.84% Ser97 37 63.31 86.74 3.21% Lys100 40 14.01 48.65 4.74% Leu112 69 88.35 99.97 1.59% Gln113 70 39.94 63.24 3.19% Arg115 72 35.62 116.50 11.07% Val116 73 8.68 66.03 7.85% Trp117 74 17.64 33.22 2.13% Phe118 75 5.23 5.50 0.04% Ile119 76 5.95 83.79 10.65% Asn120 77 9.76 45.55 4.90% Arg122 79 41.40 93.57 7.14% Met123 80 53.41 156.29 14.08% Arg124 81 98.45 186.20 12.01% Contribution of specificity determining positions to interface: 44.27% Contribution of conserved residues to interface: 14.20% Contribution of SDPs and conserved residues to interface: 58.47%

134

H) MATα2 subfamily (PDB ID 1akhB) % of protein- PDB Residue Position # ASA (Protein + DNA) ASA (Protein only) DNA interface Arg132 10 59.99 276.88 21.07% Gly133 11 0.00 61.13 5.94% His134 12 136.29 165.87 2.87% Arg135 13 34.82 188.54 14.94% Phe136 14 21.59 55.92 3.34% Lys138 16 175.63 187.58 1.16% Val141 19 42.50 53.64 1.08% Tyr156 34 81.54 125.88 4.31% Leu157 35 19.44 23.70 0.41% Ile174 69 120.26 126.33 0.59% Gln175 70 38.56 64.70 2.54% Lys177 72 39.41 102.18 6.10% Asn178 73 26.44 76.81 4.89% Trp179 74 18.45 39.41 2.04% Ser181 76 15.32 45.67 2.95% Asn182 77 13.64 72.92 5.76% Arg184 79 13.75 56.07 4.11% Arg185 80 53.90 146.11 8.96% Lys186 81 101.94 137.41 3.45% Lys188 83 138.45 174.41 3.49% Contribution of specificity determining positions to interface: 22.67% Contribution of conserved residues to interface: 14.86% Contribution of SDPs and conserved residues to interface: 37.53%

135

I) Insulin Gene Enhancer (ISL) subfamily (PDB ID 1bw5) DNA mapped from PDB ID 3hdd % of protein- PDB Residue Position # ASA (Protein + DNA) ASA (Protein only) DNA interface Thr8 12 48.74 58.05 1.47% Leu10 14 34.35 77.46 6.79% Leu15 19 46.87 56.81 1.57% Arg27 34 157.63 189.98 5.09% Pro28 35 13.83 24.22 1.64% Asp29 36 66.53 69.16 0.41% Ala30 37 68.56 75.41 1.08% Lys33 40 44.98 78.48 5.28% Val46 70 10.04 36.38 4.15% Arg48 72 119.80 162.40 6.71% Val49 73 4.78 70.21 10.30% Trp50 74 10.60 28.47 2.81% Phe51 75 24.92 30.60 0.89% Gln52 76 12.46 108.17 15.07% Asn53 77 17.47 79.83 9.82% Arg55 79 95.05 134.67 6.24% Cys56 80 39.44 94.32 8.64% Lys57 81 93.97 170.34 12.03% Contribution of specificity determining positions to interface: 41.94% Contribution of conserved residues to interface: 19.77% Contribution of SDPs and conserved residues to interface: 61.71%

J) SATB subfamily (PDB ID 1wi3A) DNA mapped from PDB ID 1jgg % of protein- PDB Residue Position # ASA (Protein + DNA) ASA (Protein only) DNA interface Arg11 11 75.03 269.84 29.19% Thr12 12 23.83 67.92 6.61% Lys13 13 160.19 197.24 5.55% Leu19 19 44.78 51.55 1.01% Thr51 70 12.72 25.68 1.94% Lys54 73 33.53 125.62 13.80% Phe55 74 10.08 28.33 2.73% Gln57 76 65.65 89.26 3.54% Asn58 77 5.41 70.94 9.82% Tyr61 80 54.41 161.94 16.11% His62 81 95.01 114.90 2.98% Lys64 83 182.74 227.44 6.70% Contribution of specificity determining positions to interface: 16.76% Contribution of conserved residues to interface: 12.56% Contribution of SDPs and conserved residues to interface: 29.31%

136

K) Onecut subfamily (PDB ID 1s7eA) DNA mapped from PDB ID 3hdd % of protein- PDB Residue Position # ASA (Protein + DNA) ASA (Protein only) DNA interface Lys102 9 154.95 184.22 5.56% Pro103 10 36.73 82.55 8.70% Arg104 11 157.58 164.97 1.40% Leu105 12 159.32 171.57 2.33% Phe107 14 74.91 116.63 7.92% Arg112 19 55.95 64.69 1.66% Pro125 35 77.18 89.42 2.33% Lys127 37 101.37 158.69 10.89% Ser142 69 0.11 12.60 2.37% Ser145 72 22.93 50.49 5.24% Asn146 73 0.00 19.52 3.71% Phe148 75 27.12 41.85 2.80% Met149 76 53.19 132.48 15.06% Asn150 77 56.90 95.19 7.27% Arg152 79 77.52 197.34 22.76% Contribution of specificity determining positions to interface: 20.85% Contribution of conserved residues to interface: 10.07% Contribution of SDPs and conserved residues to interface: 30.92%

L) Homeodomain Only Protein (HOP) subfamily (PDB ID 1uhsA) DNA mapped from PDB ID 1fjl % of protein- PDB Residue Position # ASA (Protein + DNA) ASA (Protein only) DNA interface Ala6 12 145.20 152.48 1.14% Thr7 13 126.65 130.77 0.64% Met8 14 45.06 71.13 4.08% His26 34 61.61 138.64 12.06% Pro27 35 12.98 16.77 0.59% Pro29 37 71.71 102.33 4.79% Gln45 70 59.54 76.97 2.73% Gln47 72 55.27 95.52 6.30% Lys48 73 42.40 109.05 10.43% Trp49 74 6.21 24.76 2.90% Lys51 76 28.13 137.74 17.16% Gln52 77 12.20 119.04 16.72% Arg53 78 61.06 98.53 5.87% Leu54 79 6.72 11.95 0.82% Ala55 80 65.27 68.26 0.47% Glu56 81 115.23 130.95 2.46% Arg58 83 92.73 161.95 10.83% Contribution of specificity determining positions to interface: 36.19% Contribution of conserved residues to interface: 20.45% Contribution of SDPs and conserved residues to interface: 56.64%

137

M) Non-plant homeobox leucine zipper (Homez) subfamily (PDB ID 1wjhA) DNA mapped from PDB ID 3hdd % of protein- PDB Residue Position # ASA (Protein + DNA) ASA (Protein only) DNA interface Arg9 13 214.74 265.01 8.56% Lys10 14 29.59 59.31 5.06% Leu15 19 35.45 47.98 2.13% Trp27 34 127.55 186.72 10.08% Ala28 35 0.33 10.83 1.79% Arg29 36 109.74 171.19 10.46% Arg30 37 95.71 130.15 5.86% Tyr33 40 3.34 10.68 1.25% Glu46 70 32.28 57.22 4.25% Ile48 72 39.36 64.56 4.29% Gln49 73 36.32 117.85 13.88% Trp50 74 9.12 26.34 2.93% Gly52 76 25.94 36.64 1.82% Asp53 77 28.37 93.50 11.09% Arg55 79 22.80 74.76 8.85% Tyr56 80 118.35 134.23 2.70% Ala57 81 14.89 16.54 0.28% Lys59 83 135.11 150.13 2.56% Gln62 86 103.52 116.10 2.14% Contribution of specificity determining positions to interface: 53.20% Contribution of conserved residues to interface: 22.87% Contribution of SDPs and conserved residues to interface: 76.07%

138

N) HNF1 subfamily (PDB ID 2lfb) DNA mapped from PDB ID 1fjl % of protein- PDB Residue Position # ASA (Protein + DNA) ASA (Protein only) DNA interface Arg10 8 10.10 230.24 23.28% Arg11 9 133.05 157.01 2.53% Asn12 10 90.84 137.83 4.97% Phe14 12 12.70 17.51 0.51% Lys15 13 84.95 86.51 0.16% Trp16 14 11.25 38.76 2.91% Gln21 19 73.98 77.09 0.33% Leu24 22 10.21 10.45 0.03% Pro34 35 24.46 56.63 3.40% Arg39 40 19.54 33.75 1.50% Glu71 55 86.08 90.27 0.44% Val72 69 78.33 80.96 0.28% Arg73 70 16.59 41.35 2.62% Tyr75 72 37.97 132.58 10.00% Asn76 73 6.26 65.41 6.26% Trp77 74 18.10 28.59 1.11% Ala79 76 21.79 47.49 2.72% Asn80 77 7.57 27.42 2.10% Arg82 79 40.43 189.12 15.72% Lys83 80 28.10 164.95 14.47% Glu84 81 86.65 115.87 3.09% Glu85 82 165.71 180.43 1.56% Contribution of specificity determining positions to interface: 27.03% Contribution of conserved residues to interface: 18.96% Contribution of SDPs and conserved residues to interface: 45.99%

139

Appendix E: List of Interactions with DNA at distance ≤ 4.5 Å

A) HOX subfamily (PDB ID 1ig7A) DNA Contact Aromatic- Hydrophobic- Amino Position Res. # and Distance Hydrophobic Surface Area H-bond Aromatic Hydrophilic Acid # Chain (Å) contact (Å2) contact contact (A/C/T/G) Arg102 8 8B (A) 4.4 4.0 + - - - Arg102 8 9B (A) 3.3 33.9 + - - + Arg102 8 10B (T) 3.2 46.6 + - - - Arg102 8 26C (T) 3.4 41.8 + - - - Arg102 8 27C (A) 3.8 27.4 + - - - Lys103 9 9B (A) 3.6 17.5 - - - + Lys103 9 10B (T) 2.7 33.9 - - - + Pro104 10 9B (A) 3.5 10.7 + - - + Arg105 11 7B (T) 2.6 56.2 + - - - Arg105 11 8B (A) 3.3 47.6 + - - + Arg105 11 9B (A) 3.9 5.6 - - - - Arg105 11 27C (A) 3.8 6.1 + - - - Arg105 11 28C (G) 3.5 61.4 + - - + Thr106 12 8B (A) 3.0 25.4 + - - + Thr106 12 9B (A) 2.4 24.3 + - - - Phe108 14 8B (A) 3.2 47.3 - - - - Leu113 19 8B (A) 3.9 12.8 - - - + Tyr125 34 21C (T) 3.6 30.3 + - - + Tyr125 34 22C (C) 2.5 22.0 + - - - Leu126 35 21C (T) 4.4 11.9 + - - - Ile128 37 20C (T) 4.0 22.0 - - - + Arg131 40 20C (T) 2.8 38.8 + - - - Gln144 70 9B (A) 4.4 15.6 + - - + Lys146 72 20C (T) 2.6 49.3 + - - + Lys146 72 21C (T) 3.4 41.4 + - - + Ile147 73 8B (A) 4.5 1.6 - - + - Ile147 73 9B (A) 3.7 37.7 - - + + Ile147 73 10B (T) 4.0 21.1 - - + + Trp148 74 8B (A) 3.4 32.7 + - - - Gln150 76 10B (T) 3.7 21.0 + - - + Gln150 76 11B (T) 4.0 4.0 + - - - Gln150 76 21C (T) 3.7 15.9 + - - + Gln150 76 22C (C) 2.8 45.4 + - + + Gln150 76 23C (A) 3.4 20.6 - - - + Asn151 77 8B (A) 3.2 33.0 + - - + Asn151 77 9B (A) 2.7 50.7 + - - + Asn151 77 10B (T) 4.4 2.9 - - - + Asn151 77 24C (A) 4.5 2.8 + - - - Arg153 79 21C (T) 3.7 31.7 + - - - Arg153 79 22C (C) 2.8 47.5 + - - + Lys155 81 7B (T) 3.2 29.5 + - - + Arg158 84 7B (T) 3.7 28.5 + - - +

140

B) Engrailed subfamily (PDB ID 3hddA) DNA Contact Aromatic- Hydrophobic- Amino Position Res. # and Distance Hydrophobic Surface Area H-bond Aromatic Hydrophilic Acid # Chain (Å) contact (Å2) contact contact (A/C/T/G) Arg5 11 210C (G) 3.7 9.7 - - - + Arg5 11 211C (T) 2.5 55.7 + - - - Arg5 11 212C (A) 3.2 43.3 + - - - Arg5 11 213C (A) 4.2 19.7 - - - - Arg5 11 333D (A) 3.5 11.9 + - - - Arg5 11 334D (C) 3.9 27.0 + - - + Thr6 12 212C (A) 3.3 28.1 + - - + Thr6 12 213C (A) 3.1 15.4 + - - + Phe8 14 212C (A) 3.4 35.4 + - - - Leu13 19 212C (A) 4.4 7.2 - - - + Tyr25 34 327D (G) 3.9 25.6 + - - + Tyr25 34 328D (T) 2.8 22.4 + - - - Leu26 35 327D (G) 4.3 15.2 + - - + Arg31 40 326D (G) 3.5 36.1 + - - - Gln44 70 213C (A) 4.0 21.4 + - - + Lys46 72 327D (G) 4.4 13.4 + - - + Ile47 73 213C (A) 3.6 42.8 - - + + Ile47 73 214C (T) 4.0 26.5 - - + + Trp48 74 212C (A) 3.5 24.0 + - - - Gln50 76 327D (G) 4.0 12.0 + - - + Gln50 76 328D (T) 3.6 36.4 + - + + Asn51 77 212C (A) 3.8 25.4 + - - + Asn51 77 213C (A) 3.1 45.5 + - - + Asn51 77 330D (A) 4.5 5.2 + - - - Arg53 79 327D (G) 3.6 23.2 + - - - Arg53 79 328D (T) 3.0 36.6 + - - + Lys55 81 211C (T) 3.2 25.8 + - - + Lys57 83 328D (T) 3.9 8.0 + - - - Lys57 83 329D (A) 2.9 42.6 + - - +

141

C) Paired-box subfamily (PDB ID 1fjlA) DNA Contact Aromatic- Hydrophobic- Amino Position Res. # and Distance Hydrophobic Surface Area H-bond Aromatic Hydrophilic Acid # Chain (Å) contact (Å2) contact contact (A/C/T/G) Arg2 8 4D (A) 4.0 4.0 - - - - Arg2 8 5D (A) 2.9 47.8 + - - + Arg2 8 6D (T) 3.4 51.5 + - - - Arg2 8 12E (T) 3.0 31.0 + - - - Arg2 8 13E (A) 3.9 30.0 + - - - Arg3 9 5D (A) 3.4 38.4 - - - + Arg3 9 6D (T) 2.9 19.5 + - - + Ser4 10 5D (A) 3.3 11.8 + - - - Arg5 11 2D (A) 4.0 4.9 + - - - Arg5 11 3D (T) 2.7 51.7 + - - - Arg5 11 4D (A) 3.3 45.9 + - - + Arg5 11 5D (A) 4.1 6.3 - - - - Arg5 11 13E (A) 3.4 19.0 - - - - Arg5 11 14E (T) 3.4 40.2 + - - + Arg5 11 1F (T) 3.2 34.9 - - - - Thr6 12 4D (A) 2.9 36.3 + - - + Thr6 12 5D (A) 2.5 23.7 + - - - Phe8 14 4D (A) 3.4 50.5 - - - - Tyr25 34 7E (C) 3.4 39.7 + - - + Tyr25 34 8E (T) 2.4 26.1 + - - - Arg31 40 6E (T) 3.0 31.5 + - - - Arg44 70 5D (A) 2.8 59.8 + - - + Gln46 72 6E (T) 3.6 31.6 + - - + Gln46 72 7E (C) 3.7 21.0 + - - + Val47 73 4D (A) 4.5 1.3 - - - - Val47 73 5D (A) 3.8 32.1 - - + + Val47 73 6D (T) 4.3 12.8 - - + + Trp48 74 4D (A) 3.8 18.0 + - - - Gln50 76 6D (T) 4.1 20.6 + - - + Gln50 76 8E (T) 4.2 25.4 + - + + Asn51 77 4D (A) 2.9 51.7 + - - + Asn51 77 5D (A) 3.1 33.6 + - - + Arg53 79 7E (C) 3.2 33.1 + - - - Arg53 79 8E (A) 2.8 27.8 + - - + Arg55 81 3D (T) 3.1 56.4 + - - + Arg55 81 4D (A) 4.2 16.0 + - - - Arg57 83 8E (A) 2.9 68.1 + - - + Arg57 83 9E (G) 3.3 32.9 + - - +

142

D) POU subfamily (PDB ID 1cqtB) DNA Contact Aromatic- Hydrophobic- Amino Position Res. # and Distance Hydrophobic Surface Area H-bond Aromatic Hydrophilic Acid # Chain (Å) contact (Å2) contact contact (A/C/T/G) Arg602 8 710O (A) 3.6 17.0 + - - - Arg602 8 711O (T) 3.5 42.9 + - - + Arg602 8 724P (T) 3.1 48.8 + - - - Arg602 8 725P (G) 3.6 14.8 + - - - Lys603 9 710O (A) 3.1 52.2 + - - + Lys603 9 711O (T) 2.9 28.7 + - - + Lys604 10 710O (A) 2.7 24.9 + - - - Lys604 10 726P (C) 3.3 33.9 + - - + Lys605 11 707O (C) 4.1 0.2 + - - - Lys605 11 708O (A) 2.6 45.5 + - - - Lys605 11 709O (A) 3.0 49.4 + - - + Lys605 11 710O (A) 3.3 13.8 - - - - Lys605 11 725P (G) 3.2 44.4 - - - + Lys605 11 726P (C) 3.7 53.0 + - - + Thr606 12 709O (A) 3.2 17.8 + - - + Thr606 12 710O (A) 2.9 21.6 + - - + Ser607 13 726P (C) 3.0 13.9 + - - - Ser607 13 727P (A) 3.0 33.0 + - - - Ile608 14 709O (A) 4.0 32.7 + - - + Arg613 19 708O (A) 2.7 34.4 + - - - Arg613 19 708O (A) 4.4 11.4 + - - - Glu643 69 710O (A) 4.4 2.9 + - - + Val644 70 709O (A) 4.1 8.1 - - - + Val644 70 710O (A) 3.1 18.6 - - - + Val647 73 709O (A) 4.5 10.8 - - + - Val647 73 710O (A) 3.7 35.7 - - + + Trp648 74 709O (A) 2.9 46.1 + - - + Cys650 76 719P (T) 4.2 30.5 - - + - Asn651 77 709O (A) 3.4 19.0 + - + + Asn651 77 710O (A) 3.0 45.9 + - - + Asn651 77 711O (T) 4.1 2.6 + - - + Asn651 77 721P (A) 3.9 6.2 + - - - Arg652 78 708O (A) 2.7 41.8 + - - + Arg653 79 719P (T) 4.4 12.0 + - - - Gln654 80 720P (T) 3.2 33.9 - - + + Gln654 80 721P (A) 3.0 28.0 + - + + Gln654 80 722P (T) 3.4 26.6 + - + + Lys655 81 708O (A) 2.7 28.2 + - + + Lys657 83 720P (T) 3.2 39.8 + - - -

143

References

1. Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A et al.: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res 2006, 34: D108-D110.

2. Luisi PL, Ferri F, Stano P: Approaches to semi-synthetic minimal cells: a review. Naturwissenschaften 2006, 93: 1-13.

3. Galas DJ: Sequence interpretation. Making sense of the sequence. Science 2001, 291: 1257-1260.

4. Madhani HD: Interplay of intrinsic and extrinsic signals in yeast differentiation. Proc Natl Acad Sci U S A 2000, 97: 13461-13463.

5. Li Y, Alvarez OA, Gutteling EW, Tijsterman M, Fu J, Riksen JA et al.: Mapping determinants of gene expression plasticity by genetical genomics in C. elegans. PLoS Genet 2006, 2: e222.

6. Maciag K, Altschuler SJ, Slack MD, Krogan NJ, Emili A, Greenblatt JF et al.: Systems-level analyses identify extensive coupling among gene expression machines. Mol Syst Biol 2006, 2: 2006.

7. NIRENBERG MW, MATTHAEI JH: The dependence of cell-free protein synthesis in E. coli upon naturally occurring or synthetic polyribonucleotides. Proc Natl Acad Sci U S A 1961, 47: 1588-1602.

8. FALASCHI A, ADLER J, KHORANA HG: CHEMICALLY SYNTHESIZED DEOXYPOLYNUCLEOTIDES AS TEMPLATES FOR RIBONUCLEIC ACID POLYMERASE. J Biol Chem 1963, 238: 3080-3085.

9. Lodish H, Berk A, Matsudaira P, Kaiser CA, Krieger M, Scott MP et al.: Molecular Cell Biology, 5 edn. W. H. Freeman; 2003.

10. Alberts B, Bray D, Lewis J, Raff M, Roberts K, Watson JD: Molecular Biology of the Cell, 3 edn. New York: Garland Publishing; 1994.

11. Uffenbeck SR, Krebs JE: The role of chromatin structure in regulating stress- induced transcription in Saccharomyces cerevisiae. Biochem Cell Biol 2006, 84: 477-489.

12. Zheng C, Hayes JJ: Structures and interactions of the core histone tail domains. Biopolymers 2003, 68: 539-546.

13. Turner BM: Cellular memory and the histone code. Cell 2002, 111: 285-291.

14. Luger K, Mader AW, Richmond RK, Sargent DF, Richmond TJ: Crystal structure of the nucleosome core particle at 2.8 A resolution. Nature 1997, 389: 251-260.

144

15. Richards EJ, Elgin SC: Epigenetic codes for heterochromatin formation and silencing: rounding up the usual suspects. Cell 2002, 108: 489-500.

16. Wright S: Regulation of eukaryotic gene expression by transcriptional attenuation. Mol Biol Cell 1993, 4: 661-668.

17. Hampsey M: Molecular genetics of the RNA polymerase II general transcriptional machinery. Microbiol Mol Biol Rev 1998, 62: 465-503.

18. Robinson MM, Yatherajam G, Ranallo RT, Bric A, Paule MR, Stargell LA: Mapping and functional characterization of the TAF11 interaction with TFIIA. Mol Cell Biol 2005, 25: 945-957.

19. Arnosti DN, Kulkarni MM: Transcriptional enhancers: Intelligent enhanceosomes or flexible billboards? J Cell Biochem 2005, 94: 890-898.

20. Babitzke P, Gollnick P: Posttranscription initiation control of tryptophan metabolism in Bacillus subtilis by the trp RNA-binding attenuation protein (TRAP), anti-TRAP, and RNA structure. J Bacteriol 2001, 183: 5795-5802.

21. Cheng H, Dufu K, Lee CS, Hsu JL, Dias A, Reed R: Human mRNA export machinery recruited to the 5' end of mRNA. Cell 2006, 127: 1389-1400.

22. Muhlrad D, Decker CJ, Parker R: Deadenylation of the unstable mRNA encoded by the yeast MFA2 gene leads to decapping followed by 5'-->3' digestion of the transcript. Genes Dev 1994, 8: 855-866.

23. Dower K, Rosbash M: T7 RNA polymerase-directed transcripts are processed in yeast and link 3' end formation to mRNA nuclear export. RNA 2002, 8: 686-697.

24. Anderson JS, Parker RP: The 3' to 5' degradation of yeast mRNAs is a general mechanism for mRNA turnover that requires the SKI2 DEVH box protein and 3' to 5' exonucleases of the exosome complex. EMBO J 1998, 17: 1497-1506.

25. Black DL: Mechanisms of alternative pre-messenger RNA splicing. Annu Rev Biochem 2003, 72: 291-336.

26. Crofts AJ, Washida H, Okita TW, Ogawa M, Kumamaru T, Satoh H: Targeting of proteins to endoplasmic reticulum-derived compartments in plants. The importance of RNA localization. Plant Physiol 2004, 136: 3414-3419.

27. Algire MA, Lorsch JR: Where to begin? The mechanism of translation initiation codon selection in eukaryotes. Curr Opin Chem Biol 2006, 10: 480-486.

28. Svitkin YV, Herdy B, Costa-Mattioli M, Gingras AC, Raught B, Sonenberg N: Eukaryotic translation initiation factor 4E availability controls the switch between cap-dependent and internal ribosomal entry site-mediated translation. Mol Cell Biol 2005, 25: 10556-10565.

145

29. Pfingsten JS, Costantino DA, Kieft JS: Structural basis for ribosome recruitment and manipulation by a viral IRES RNA. Science 2006, 314: 1450-1454.

30. Smolke CD, Keasling JD: Effect of copy number and mRNA processing and stabilization on transcript and protein levels from an engineered dual-gene operon. Biotechnol Bioeng 2002, 78: 412-424.

31. Walsh G, Jefferis R: Post-translational modifications in the context of therapeutic proteins. Nat Biotechnol 2006, 24: 1241-1252.

32. Utsumi T, Sato M, Nakano K, Takemura D, Iwata H, Ishisaka R: Amino acid residue penultimate to the amino-terminal gly residue strongly affects two cotranslational protein modifications, N-myristoylation and N-acetylation. J Biol Chem 2001, 276: 10505-10513.

33. Anfinsen CB: Principles that govern the folding of protein chains. Science 1973, 181: 223-230.

34. Fink AL: Chaperone-mediated protein folding. Physiol Rev 1999, 79: 425-449.

35. Guda C, Subramaniam S: pTARGET [corrected] a new method for predicting protein subcellular localization in eukaryotes. Bioinformatics 2005, 21: 3963- 3969.

36. Thornton BR, Toczyski DP: Precise destruction: an emerging picture of the APC. Genes Dev 2006, 20: 3069-3078.

37. Reinstein E, Ciechanover A: Narrative review: protein degradation and human diseases: the ubiquitin connection. Ann Intern Med 2006, 145: 676-684.

38. Beckett D: Regulated assembly of transcription factors and control of transcription initiation. J Mol Biol 2001, 314: 335-352.

39. Narlikar L, Hartemink AJ: Sequence features of DNA binding sites reveal structural class of associated transcription factor. Bioinformatics 2006, 22: 157- 163.

40. Bremer RE, Wurtz NR, Szewczyk JW, Dervan PB: Inhibition of major groove DNA binding bZIP proteins by positive patch polyamides. Bioorg Med Chem 2001, 9: 2093-2103.

41. Dragan AI, Li Z, Makeyeva EN, Milgotina EI, Liu Y, Crane-Robinson C et al.: Forces driving the binding of homeodomains to DNA. Biochemistry 2006, 45: 141- 151.

42. Nguyen-Hackley DH, Ramm E, Taylor CM, Joung JK, Dervan PB, Pabo CO: Allosteric inhibition of zinc-finger binding in the major groove of DNA by minor-groove binding ligands. Biochemistry 2004, 43: 3880-3890.

146

43. Green A, Parker M, Conte D, Sarkar B: Zinc finger proteins: A bridge between transition metals and gene regulation. The Journal of Trace Elements in Experimental Medicine 1998, 11: 103-118.

44. Aravind L, Anantharaman V, Balaji S, Babu MM, Iyer LM: The many faces of the helix-turn-helix domain: transcription regulation and beyond. FEMS Microbiol Rev 2005, 29: 231-262.

45. Banerjee-Basu S, Baxevanis AD: Molecular evolution of the homeodomain family of transcription factors. Nucleic Acids Res 2001, 29: 3258-3269.

46. Hunter CP, Kenyon C: Specification of anteroposterior cell fates in Caenorhabditis elegans by Drosophila Hox proteins. Nature 1995, 377: 229-232.

47. Mathias JR, Zhong H, Jin Y, Vershon AK: Altering the DNA-binding specificity of the yeast Matalpha 2 homeodomain protein. J Biol Chem 2001, 276: 32696-32703.

48. van Meyel DJ, O'Keefe DD, Thor S, Jurata LW, Gill GN, Thomas JB: Chip is an essential cofactor for apterous in the regulation of axon guidance in Drosophila. Development 2000, 127: 1823-1831.

49. Bachy I, Vernier P, Retaux S: The LIM-homeodomain gene family in the developing Xenopus brain: conservation and divergences with the mouse related to the evolution of the forebrain. J Neurosci 2001, 21: 7620-7629.

50. Pontoglio M, Prie D, Cheret C, Doyen A, Leroy C, Froguel P et al.: HNF1alpha controls renal glucose reabsorption in mouse and man. EMBO Rep 2000, 1: 359- 365.

51. Ceska TA, Lamers M, Monaci P, Nicosia A, Cortese R, Suck D: The X-ray structure of an atypical homeodomain present in the rat liver transcription factor LFB1/HNF1 and implications for DNA binding. EMBO J 1993, 12: 1805- 1810.

52. Nagasaki H, Matsuoka M, Sato Y: Members of TALE and WUS subfamilies of homeodomain proteins with potentially important functions in development form dimers within each subfamily in rice. Genes Genet Syst 2005, 80: 261-267.

53. Ariel FD, Manavella PA, Dezar CA, Chan RL: The true story of the HD-Zip family. Trends Plant Sci 2007, 12: 419-426.

54. Leibfried A, To JP, Busch W, Stehling S, Kehle A, Demar M et al.: WUSCHEL controls meristem function by direct regulation of cytokinin-inducible response regulators. Nature 2005, 438: 1172-1175.

55. Tan QK, Irish VF: The Arabidopsis zinc finger-homeodomain genes encode proteins with unique biochemical properties that are coordinately expressed during floral development. Plant Physiol 2006, 140: 1095-1108.

147

56. Kanrar S, Onguka O, Smith HM: Arabidopsis inflorescence architecture requires the activities of KNOX-BELL homeodomain heterodimers. Planta 2006, 224: 1163-1173.

57. Gehring WJ, Qian YQ, Billeter M, Furukubo-Tokunaga K, Schier AF, Resendez- Perez D et al.: Homeodomain-DNA recognition. Cell 1994, 78: 211-223.

58. Gehring WJ, Affolter M, Burglin T: Homeodomain proteins. Annu Rev Biochem 1994, 63: 487-526.

59. Wolberger C, Vershon AK, Liu B, Johnson AD, Pabo CO: Crystal structure of a MAT alpha 2 homeodomain-operator complex suggests a general model for homeodomain-DNA interactions. Cell 1991, 67: 517-528.

60. Hake S, Smith HM, Holtan H, Magnani E, Mele G, Ramirez J: The role of knox genes in plant development. Annu Rev Cell Dev Biol 2004, 20: 125-151.

61. LaRonde-LeBlanc NA, Wolberger C: Structure of HoxA9 and Pbx1 bound to DNA: Hox hexapeptide and DNA recognition anterior to posterior. Genes Dev 2003, 17: 2060-2072.

62. Passner JM, Ryoo HD, Shen L, Mann RS, Aggarwal AK: Structure of a DNA- bound Ultrabithorax-Extradenticle homeodomain complex. Nature 1999, 397: 714-719.

63. Johannesson H, Wang Y, Engstrom P: DNA-binding and dimerization preferences of Arabidopsis homeodomain-leucine zipper transcription factors in vitro. Plant Mol Biol 2001, 45: 63-73.

64. Bayarsaihan D, Enkhmandakh B, Makeyev A, Greally JM, Leckman JF, Ruddle FH: Homez, a homeobox leucine zipper gene specific to the vertebrate lineage. Proc Natl Acad Sci U S A 2003, 100: 10358-10363.

65. Billeter M: Homeodomain-type DNA recognition. Prog Biophys Mol Biol 1996, 66: 211-225.

66. Qian YQ, Billeter M, Otting G, Muller M, Gehring WJ, Wuthrich K: The structure of the Antennapedia homeodomain determined by NMR spectroscopy in solution: comparison with prokaryotic repressors. Cell 1989, 59: 573-580.

67. Ippel H, Larsson G, Behravan G, Zdunek J, Lundqvist M, Schleucher J et al.: The solution structure of the homeodomain of the rat insulin-gene enhancer protein isl-1. Comparison with other homeodomains. J Mol Biol 1999, 288: 689-703.

68. Tan S, Richmond TJ: Crystal structure of the yeast MATalpha2/MCM1/DNA ternary complex. Nature 1998, 391: 660-666.

69. Hunter CS, Rhodes SJ: LIM-homeodomain genes in mammalian development and human disease. Mol Biol Rep 2005, 32: 67-77.

148

70. Ostendorff HP, Bossenz M, Mincheva A, Copeland NG, Gilbert DJ, Jenkins NA et al.: Functional characterization of the gene encoding RLIM, the corepressor of LIM homeodomain factors. Genomics 2000, 69: 120-130.

71. Billeter M, Qian Y, Otting G, Muller M, Gehring WJ, Wuthrich K: Determination of the three-dimensional structure of the Antennapedia homeodomain from Drosophila in solution by 1H nuclear magnetic resonance spectroscopy. J Mol Biol 1990, 214: 183-197.

72. McGinnis W, Krumlauf R: Homeobox genes and axial patterning. Cell 1992, 68: 283-302.

73. Morita EH, Shirakawa M, Hayashi F, Imagawa M, Kyogoku Y: Structure of the Oct-3 POU-homeodomain in solution, as determined by triple resonance heteronuclear multidimensional NMR spectroscopy. Protein Sci 1995, 4: 729-739.

74. Sivaraja M, Botfield MC, Mueller M, Jancso A, Weiss MA: Solution structure of a POU-specific homeodomain: 3D-NMR studies of human B-cell transcription factor Oct-2. Biochemistry 1994, 33: 9845-9855.

75. Botfield MC, Jancso A, Weiss MA: An invariant asparagine in the POU-specific homeodomain regulates the specificity of the Oct-2 POU motif. Biochemistry 1994, 33: 8113-8121.

76. Pabo CO, Sauer RT: Transcription factors: structural families and principles of DNA recognition. Annu Rev Biochem 1992, 61: 1053-1095.

77. Ke A, Wolberger C: Insights into binding cooperativity of MATa1/MATalpha2 from the crystal structure of a MATa1 homeodomain-maltose binding protein chimera. Protein Sci 2003, 12: 306-312.

78. Goutte C, Johnson AD: Yeast a1 and alpha 2 homeodomain proteins form a DNA- binding activity with properties distinct from those of either protein. J Mol Biol 1993, 233: 359-371.

79. Li T, Jin Y, Vershon AK, Wolberger C: Crystal structure of the MATa1/MATalpha2 homeodomain heterodimer in complex with DNA containing an A-tract. Nucleic Acids Res 1998, 26: 5707-5718.

80. Fitch WM: Homology a personal view on some of the problems. Trends Genet 2000, 16: 227-231.

81. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403-410.

82. Teichmann SA, Murzin AG, Chothia C: Determination of protein function, evolution and interactions by structural genomics. Curr Opin Struct Biol 2001, 11: 354-363.

149

83. Stebbings LA, Mizuguchi K: HOMSTRAD: recent developments of the Homologous Protein Structure Alignment Database. Nucleic Acids Res 2004, 32: D203-D207.

84. Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247: 536-540.

85. Gan HH, Perlow RA, Roy S, Ko J, Wu M, Huang J et al.: Analysis of protein sequence/structure similarity relationships. Biophys J 2002, 83: 2781-2791.

86. Chothia C, Lesk AM: The relation between the divergence of sequence and structure in proteins. EMBO J 1986, 5: 823-826.

87. Gray GS, Fitch WM: Evolution of antibiotic resistance genes: the DNA sequence of a kanamycin resistance gene from Staphylococcus aureus. Mol Biol Evol 1983, 1: 57-66.

88. Jain R, Rivera MC, Lake JA: Horizontal gene transfer among genomes: the complexity hypothesis. Proc Natl Acad Sci U S A 1999, 96: 3801-3806.

89. Bapteste E, Susko E, Leigh J, MacLeod D, Charlebois RL, Doolittle WF: Do orthologous gene phylogenies really support tree-thinking? BMC Evol Biol 2005, 5: 33.

90. Hotopp JC, Clark ME, Oliveira DC, Foster JM, Fischer P, Torres MC et al.: Widespread lateral gene transfer from intracellular bacteria to multicellular eukaryotes. Science 2007, 317: 1753-1756.

91. Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 1992, 89: 10915-10919.

92. Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins. In Atlas of Protein Sequence and Structure. Washington: National Biomedical Research Foundation; 1978:345-352.

93. Altschul SF: Amino acid substitution matrices from an information theoretic perspective. J Mol Biol 1991, 219: 555-565.

94. Tatusov RL, Galperin MY, Natale DA, Koonin EV: The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res 2000, 28: 33-36.

95. Doolittle RF: Similar amino acid sequences: chance or common ancestry? Science 1981, 214: 149-159.

96. Cappello AR, Curcio R, Valeria MD, Stipani I, Robinson AJ, Kunji ER et al.: Functional and structural role of amino acid residues in the even-numbered transmembrane alpha-helices of the bovine mitochondrial oxoglutarate carrier. J Mol Biol 2006, 363: 51-62.

150

97. Chen D, Frey PA, Lepore BW, Ringe D, Ruzicka FJ: Identification of structural and catalytic classes of highly conserved amino acid residues in lysine 2,3- aminomutase. Biochemistry 2006, 45: 12647-12653.

98. Reese ML, Dakoji S, Bredt DS, Dotsch V: The guanylate kinase domain of the MAGUK PSD-95 binds dynamically to a conserved motif in MAP1a. Nat Struct Mol Biol 2007.

99. Zappulla DC, Maharaj AS, Connelly JJ, Jockusch RA, Sternglanz R: Rtt107/Esc4 binds silent chromatin and DNA repair proteins using different BRCT motifs. BMC Mol Biol 2006, 7: 40.

100. Lee LC, Lee YL, Leu RJ, Shaw JF: Functional role of catalytic triad and oxyanion hole-forming residues on enzyme activity of Escherichia coli thioesterase I/protease I/phospholipase L1. Biochem J 2006, 397: 69-76.

101. Moses AM, Chiang DY, Kellis M, Lander ES, Eisen MB: Position specific variation in the rate of evolution in transcription factor binding sites. BMC Evol Biol 2003, 3: 19.

102. de la CN, Messer PW, Arndt PF: DNA indels in coding regions reveal selective constraints on protein evolution in the human lineage. BMC Evol Biol 2007, 7: 191.

103. Wirtz P, Steipe B: Intrabody construction and expression III: engineering hyperstable V(H) domains. Protein Sci 1999, 8: 2245-2250.

104. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S et al.: The Pfam protein families database. Nucleic Acids Res 2004, 32: D138-D141.

105. Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D et al.: InterPro, progress and status in 2005. Nucleic Acids Res 2005, 33: D201-D205.

106. Abhiman S, Sonnhammer EL: FunShift: a database of function shift analysis on protein subfamilies. Nucleic Acids Res 2005, 33: D197-D200.

107. Bru C, Courcelle E, Carrere S, Beausse Y, Dalmar S, Kahn D: The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Res 2005, 33: D212-D215.

108. Tatusov RL, Koonin EV, Lipman DJ: A genomic perspective on protein families. Science 1997, 278: 631-637.

109. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM: CATH--a hierarchic classification of protein domain structures. Structure 1997, 5: 1093- 1108.

110. Holm L, Ouzounis C, Sander C, Tuparev G, Vriend G: A database of protein structure families with common folding motifs. Protein Sci 1992, 1: 1691-1698.

151

111. Henikoff S, Henikoff JG, Pietrokovski S: Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations. Bioinformatics 1999, 15: 471-479.

112. Falquet L, Pagni M, Bucher P, Hulo N, Sigrist CJ, Hofmann K et al.: The PROSITE database, its status in 2002. Nucleic Acids Res 2002, 30: 235-238.

113. Chen Y, Reilly KD, Sprague AP, Guan Z: SEQOPTICS: a protein sequence clustering system. BMC Bioinformatics 2006, 7 Suppl 4: S10.

114. Pipenbacher P, Schliep A, Schneckener S, Schonhuth A, Schomburg D, Schrader R: ProClust: improved clustering of protein sequences with an extended graph- based approach. Bioinformatics 2002, 18 Suppl 2: S182-S191.

115. Yona G, Linial N, Linial M: ProtoMap: automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space. Proteins 1999, 37: 360-378.

116. Sasson O, Linial N, Linial M: The metric space of proteins-comparative study of clustering algorithms. Bioinformatics 2002, 18 Suppl 1: S14-S21.

117. Sjolander K: Phylogenetic inference in protein superfamilies: analysis of SH2 domains. Proc Int Conf Intell Syst Mol Biol 1998, 6: 165-174.

118. Cover TM, Thomas JA: Elements of information theory. New York: John Wiley & Sons; 1991.

119. Lazareva-Ulitsky B, Diemer K, Thomas PD: On the quality of tree-based protein classification. Bioinformatics 2005, 21: 1876-1890.

120. Bergsten J: A review of long-branch attraction. Cladistics 2005, 21: 163-193.

121. Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG et al.: Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res 2003, 31: 3497-3500.

122. Staub E, Rosenthal A, Hinzmann B: Systematic identification of immunoreceptor tyrosine-based inhibitory motifs in the human proteome. Cell Signal 2004, 16: 435-456.

123. Panchenko AR, Kondrashov F, Bryant S: Prediction of functional sites by analysis of sequence and structure conservation. Protein Sci 2004, 13: 884-892.

124. Phillips A, Janies D, Wheeler W: Multiple sequence alignment in phylogenetic analysis. Mol Phylogenet Evol 2000, 16: 317-330.

125. Vingron M, Waterman MS: Sequence alignment and penalty choice. Review of concepts, case studies and implications. J Mol Biol 1994, 235: 1-12.

152

126. Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48: 443- 453.

127. Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147: 195-197.

128. Notredame C, Higgins DG, Heringa J: T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 2000, 302: 205-217.

129. Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004, 32: 1792-1797.

130. Felsenstein J: PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics 1985, 5: 164-166.

131. Otu HH, Sayood K: A new sequence distance measure for phylogenetic tree construction. Bioinformatics 2003, 19: 2122-2130.

132. Enright AJ, Van Dongen S, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 2002, 30: 1575-1584.

133. Sereno PC: The logical basis of phylogenetic taxonomy. Syst Biol 2005, 54: 595- 619.

134. Brammer CA, von Dohlen CD: Evolutionary history of Stratiomyidae (Insecta: Diptera): The molecular phylogeny of a diverse family of flies. Mol Phylogenet Evol 2006.

135. Sanderson MJ: r8s: inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock. Bioinformatics 2003, 19: 301-302.

136. Ho SY, Larson G: Molecular clocks: when times are a-changin'. Trends Genet 2006, 22: 79-83.

137. Li WH: So, what about the molecular clock hypothesis? Curr Opin Genet Dev 1993, 3: 896-901.

138. Lanfear R, Thomas JA, Welch JJ, Brey T, Bromham L: Metabolic rate does not calibrate the molecular clock. Proc Natl Acad Sci U S A 2007, 104: 15388-15393.

139. Martin AP, Palumbi SR: Body size, metabolic rate, generation time, and the molecular clock. Proc Natl Acad Sci U S A 1993, 90: 4087-4091.

140. Yap VB, Speed T: Rooting a phylogenetic tree with nonreversible substitution models. BMC Evol Biol 2005, 5: 2.

141. Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 1987, 4: 406-425.

153

142. Gascuel O, Steel M: Neighbor-joining revealed. Mol Biol Evol 2006, 23: 1997- 2000.

143. Stamatakis A, Ludwig T, Meier H: RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics 2005, 21: 456-463.

144. Jin G, Nakhleh L, Snir S, Tuller T: Efficient parsimony-based methods for phylogenetic network reconstruction. Bioinformatics 2007, 23: e123-e128.

145. Tamura K, Dudley J, Nei M, Kumar S: MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol Biol Evol 2007, 24: 1596-1599.

146. Schmidt HA, Strimmer K, Vingron M, von Haeseler A: TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 2002, 18: 502-504.

147. Guindon S, Lethiec F, Duroux P, Gascuel O: PHYML Online--a web server for fast maximum likelihood-based phylogenetic inference. Nucleic Acids Res 2005, 33: W557-W559.

148. Salter LA, Pearl DK: Stochastic search strategy for estimation of maximum likelihood phylogenetic trees. Syst Biol 2001, 50: 7-17.

149. Lemmon AR, Milinkovitch MC: The metapopulation genetic algorithm: An efficient solution for the problem of large phylogeny estimation. Proc Natl Acad Sci U S A 2002, 99: 10516-10521.

150. Susko E, Inagaki Y, Roger AJ: On inconsistency of the neighbor-joining, least squares, and minimum evolution estimation when substitution processes are incorrectly modeled. Mol Biol Evol 2004, 21: 1629-1642.

151. Huelsenbeck J, Rannala B: Frequentist properties of Bayesian posterior probabilities of phylogenetic trees under simple and complex substitution models. Syst Biol 2004, 53: 904-913.

152. Kearns M, Ron D: Algorithmic stability and sanity-check bounds for leave-one- out cross-validation. Neural Comput 1999, 11: 1427-1453.

153. Mi H, Lazareva-Ulitsky B, Loo R, Kejariwal A, Vandergriff J, Rabkin S et al.: The PANTHER database of protein families, subfamilies, functions and pathways. Nucleic Acids Res 2005, 33: D284-D288.

154. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H et al.: The Protein Data Bank. Nucleic Acids Res 2000, 28: 235-242.

155. UniProt Consortium: The Universal Protein Resource (UniProt). Nucleic Acids Res 2007, 35: D193-D197.

154

156. Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D et al.: New developments in the InterPro database. Nucleic Acids Res 2007, 35: D224-D228.

157. Lichtarge O, Bourne HR, Cohen FE: An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol 1996, 257: 342-358.

158. Reva B, Antipin Y, Sander C: Determinants of protein function revealed by combinatorial entropy optimization. Genome Biol 2007, 8: R232.

159. Sibanda BL, Blundell TL, Thornton JM: Conformation of beta-hairpins in protein structures. A systematic classification with applications to modelling by homology, electron density fitting and protein engineering. J Mol Biol 1989, 206: 759-777.

160. Casari G, Sander C, Valencia A: A method to predict functional residues in proteins. Nat Struct Biol 1995, 2: 171-178.

161. Yeung KY, Ruzzo WL: Principal component analysis for clustering gene expression data. Bioinformatics 2001, 17: 763-774.

162. Kalinina OV, Mironov AA, Gelfand MS, Rakhmaninova AB: Automated selection of positions determining functional specificity of proteins by comparative analysis of orthologous groups in protein families. Protein Sci 2004, 13: 443-456.

163. Amitai G, Shemesh A, Sitbon E, Shklar M, Netanely D, Venger I et al.: Network analysis of protein structures identifies functional residues. J Mol Biol 2004, 344: 1135-1146.

164. Ausiello G, Zanzoni A, Peluso D, Via A, Helmer-Citterich M: pdbFun: mass selection and fast comparison of annotated PDB residues. Nucleic Acids Res 2005, 33: W133-W137.

165. Ferre F, Ausiello G, Zanzoni A, Helmer-Citterich M: SURFACE: a database of protein surface regions for functional annotation. Nucleic Acids Res 2004, 32: D240-D244.

166. Young T, Abel R, Kim B, Berne BJ, Friesner RA: Motifs for molecular recognition exploiting hydrophobic enclosure in protein-ligand binding. Proc Natl Acad Sci U S A 2007, 104: 808-813.

167. Truong K, Ikura M: Identification and characterization of subfamily-specific signatures in a large by a hidden Markov model approach. BMC Bioinformatics 2002, 3: 1.

168. Ye K, Anton FK, Heringa J, Ijzerman AP, Marchiori E: Multi-RELIEF: a method to recognize specificity determining residues from multiple sequence alignments using a Machine-Learning approach for feature weighting. Bioinformatics 2008, 24: 18-25.

155

169. Russell RB, Saqi MA, Sayle RA, Bates PA, Sternberg MJ: Recognition of analogous and homologous protein folds: analysis of sequence and structure conservation. J Mol Biol 1997, 269: 423-439.

170. Kellis M, Birren BW, Lander ES: Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature 2004, 428: 617-624.

171. Taylor WR, Orengo CA: Protein structure alignment. J Mol Biol 1989, 208: 1-22.

172. Carugo O: How root-mean-square distance (r.m.s.d.) values depend on the resolution of protein structures that are compared. Journal of Applied Crystallography 2003.

173. Schultz J, Milpetz F, Bork P, Ponting CP: SMART, a simple modular architecture research tool: identification of signaling domains. Proc Natl Acad Sci U S A 1998, 95: 5857-5864.

174. Ochagavia ME, Wodak S: Progressive combinatorial algorithm for multiple structural alignments: application to distantly related proteins. Proteins 2004, 55: 436-454.

175. Peregrin-Alvarez JM, Yam A, Sivakumar G, Parkinson J: PartiGeneDB--collating partial genomes. Nucleic Acids Res 2005, 33: D303-D307.

176. Parkinson J, Anthony A, Wasmuth J, Schmid R, Hedley A, Blaxter M: PartiGene-- constructing partial genomes. Bioinformatics 2004, 20: 1398-1404.

177. Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG: The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res 1997, 25: 4876-4882.

178. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389-3402.

179. Holm L, Park J: DaliLite workbench for protein structure comparison. Bioinformatics 2000, 16: 566-567.

180. Gonnet GH, Cohen MA, Benner SA: Exhaustive matching of the entire protein sequence database. Science 1992, 256: 1443-1445.

181. Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 1988, 85: 2444-2448.

182. Gilbert D. Readseq. http://iubio.bio.indiana.edu/soft/molbio/readseq/java/ . 2005. Ref Type: Electronic Citation

183. Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: a sequence logo generator. Genome Res 2004, 14: 1188-1190.

156

184. Maddison DR, Swofford DL, Maddison WP: NEXUS: an extensible file format for systematic information. Syst Biol 1997, 46: 590-621.

185. Bininda-Emonds OR: Novel versus unsupported clades: assessing the qualitative support for clades in MRP supertrees. Syst Biol 2003, 52: 839-848.

186. Price SA, Bininda-Emonds OR, Gittleman JL: A complete phylogeny of the whales, dolphins and even-toed hoofed mammals (Cetartiodactyla). Biol Rev Camb Philos Soc 2005, 80: 445-473.

187. Page RD: TreeView: an application to display phylogenetic trees on personal computers. Comput Appl Biosci 1996, 12: 357-358.

188. Jaramillo A, Wernisch L, Hery S, Wodak SJ: Folding free energy function selects native-like protein sequences in the core but not on the surface. Proc Natl Acad Sci U S A 2002, 99: 13554-13559.

189. Robert Fraczkiewicz WB: Exact and efficient analytical calculation of the accessible surface areas and their gradients for macromolecules. Journal of Computational Chemistry 1998, 19: 319-333.

190. Koradi R, Billeter M, Wuthrich K: MOLMOL: a program for display and analysis of macromolecular structures. J Mol Graph 1996, 14: 51-32.

191. Dunbrack RL, Jr., Cohen FE: Bayesian statistical analysis of protein side-chain rotamer preferences. Protein Sci 1997, 6: 1661-1681.

192. Burglin TR, Cassata G: Loss and gain of domains during evolution of cut superclass homeobox genes. Int J Dev Biol 2002, 46: 115-123.

193. Retaux S, Rogard M, Bach I, Failli V, Besson MJ: Lhx9: a novel LIM- homeodomain gene expressed in the developing forebrain. J Neurosci 1999, 19: 783-793.

194. Bachy I, Failli V, Retaux S: A LIM-homeodomain code for development and evolution of forebrain connectivity. Neuroreport 2002, 13: A23-A27.

195. Lee B, Richards FM: The interpretation of protein structures: estimation of static accessibility. J Mol Biol 1971, 55: 379-400.

196. Sobolev V, Sorokine A, Prilusky J, Abola EE, Edelman M: Automated analysis of interatomic contacts in proteins. Bioinformatics 1999, 15: 327-332.

197. Vriend G: WHAT IF: a molecular modeling and drug design program. J Mol Graph 1990, 8: 52-6, 29.

198. Wilson DS, Guenther B, Desplan C, Kuriyan J: High resolution crystal structure of a paired (Pax) class cooperative homeodomain dimer on DNA. Cell 1995, 82: 709-719.

157

199. Moore RC, Purugganan MD: The evolutionary dynamics of plant duplicate genes. Curr Opin Plant Biol 2005, 8: 122-128.