The chromatin insulator CTCF and the emergence of metazoan diversity

Peter Heger, Birger Marin, Marek Bartkuhn, Einhard Schierenberg,

Thomas Wiehe

Supporting Information

1 Pleurobrachia pileus, Ctenophora, FQ006515 Pleurobrachia pileus, Ctenophora, CU423733 Nematostella vectensis, Cnidaria, XP_001634899 Nematostella vectensis, Cnidaria, XP_001640902 Hydra magnipapillata, Cnidaria, XP_002167862 Nematostella vectensis, Cnidaria, XP_001634190 Hydra magnipapillata, Cnidaria, XP_002162913 Nematostella vectensis, Cnidaria, XP_001638437 Pleurobrachia pileus, Ctenophora, FP991204 Nematostella vectensis, Cnidaria, jgi_23405|gw.130.55.1 Nematostella vectensis, Cnidaria, XP_001633145 100/1.00 Nematostella vectensis, Cnidaria, jgi_34044|gw.27.97.1 Nematostella vectensis, Cnidaria, XP_001627859 Nematostella vectensis, Cnidaria, XP_001627106 Nematostella vectensis, Cnidaria, XP_001630930 Nematostella vectensis, Cnidaria, XP_001637077 Nematostella vectensis, Cnidaria, XP_001638320 Nematostella vectensis, Cnidaria, XP_001639816 Pleurobrachia pileus, Ctenophora, CU419619 Pleurobrachia pileus, Ctenophora, CU421239 Hydra magnipapillata, Cnidaria, XP_002165457 Nematostella vectensis, Cnidaria, jgi_32594|gw.18.48.1 Nematostella vectensis, Cnidaria, XP_001628755 Nematostella vectensis, Cnidaria, XP_001623148 100/1.00 Hydra magnipapillata, Cnidaria, XP_002156009 Nematostella vectensis, Cnidaria, XP_001635170 Nematostella vectensis, Cnidaria, XP_001640317 98/1.00 Hydra magnipapillata, Cnidaria, XP_002164347 Nematostella vectensis, Cnidaria, XP_001627411 Trichoplax adhaerens, Placozoa, XP_002107781 62/- Nematostella vectensis, Cnidaria, XP_001638691 Trichoplax adhaerens, Placozoa, XP_002115480 Nematostella vectensis, Cnidaria, XP_001630609 Nematostella vectensis, Cnidaria, XP_001625440 84/0.97 Pleurobrachia pileus, Ctenophora, FP996443 Pleurobrachia pileus, Ctenophora, CU419106 Nematostella vectensis, Cnidaria, jgi_1034|gw.318.8.1 Nematostella vectensis, Cnidaria, jgi_31930|gw.14.46.1 Nematostella vectensis, Cnidaria, jgi_99869|e gw.54.13.1 Nematostella vectensis, Cnidaria, jgi_13117|gw.115.17.1 Nematostella vectensis, Cnidaria, jgi_21241|gw.147.19.1b Nematostella vectensis, Cnidaria, jgi_21241|gw.147.19.1a Nematostella vectensis, Cnidaria, XP_001637546 55/0.99 Nematostella vectensis, Cnidaria, XP_001627149 73/0.87 61/1.00 Nematostella vectensis, Cnidaria, jgi_23524|gw.48.32.1 Nematostella vectensis, Cnidaria, XP_001632184 Nematostella vectensis, Cnidaria, jgi_33793|gw.35.78.1 Nematostella vectensis, Cnidaria, jgi_32220|gw.10.54.1 Nematostella vectensis, Cnidaria, XP_001627957 Nematostella vectensis, Cnidaria, jgi_192|gw.361.3.1b Nematostella vectensis, Cnidaria, jgi_1904|gw.2641.1.1 Nematostella vectensis, Cnidaria, jgi_10533|gw.1453.3.1b 65/0.86 Nematostella vectensis, Cnidaria, jgi_135736|e gw.361.1.1b Nematostella vectensis, Cnidaria, jgi_966|gw.2770.1.1 Nematostella vectensis, Cnidaria, XP_001618979 Nematostella vectensis, Cnidaria, XP_001623938 Nematostella vectensis, Cnidaria, jgi_22037|gw.86.101.1 Nematostella vectensis, Cnidaria, XP_001622248 Nematostella vectensis, Cnidaria, jgi_107|gw.375.1.1 Nematostella vectensis, Cnidaria, jgi_192|gw.361.3.1a Nematostella vectensis, Cnidaria, jgi_22123|gw.105.28.1a Nematostella vectensis, Cnidaria, jgi_21470|gw.105.24.1b Nematostella vectensis, Cnidaria, jgi_10533|gw.1453.3.1a Nematostella vectensis, Cnidaria, jgi_22123|gw.105.28.1b Nematostella vectensis, Cnidaria, jgi_21470|gw.105.24.1a 54/0.96 Nematostella vectensis, Cnidaria, jgi_20972|gw.63.26.1b Nematostella vectensis, Cnidaria, jgi_20972|gw.63.26.1a Nematostella vectensis, Cnidaria, XP_001624941 Nematostella vectensis, Cnidaria, jgi_22113|gw.136.23.1 89/1.00 Nematostella vectensis, Cnidaria, jgi_20980|gw.63.27.1a Nematostella vectensis, Cnidaria, jgi_20980|gw.63.27.1b -/1.00 Nematostella vectensis, Cnidaria, jgi_135736|e gw.361.1.1a Nematostella vectensis, Cnidaria, jgi_3275|gw.4450.2.1 Nematostella vectensis, Cnidaria, jgi_135772|e gw.361.16.1b Nematostella vectensis, Cnidaria, XP_001633554 Nematostella vectensis, Cnidaria, XP_001634723 Nematostella vectensis, Cnidaria, XP_001629021 Nematostella vectensis, Cnidaria, XP_001629912 Nematostella vectensis, Cnidaria, jgi_101176|e gw.60.8.1 Nematostella vectensis, Cnidaria, XP_001630482 Nematostella vectensis, Cnidaria, XP_001634081 100/0.91 Amphimedon queenslandica, Porifera, Aqu1.227223 Amphimedon queenslandica, Porifera, Aqu1.227222 Nematostella vectensis, Cnidaria, XP_001626384 Pleurobrachia pileus, Ctenophora, FQ003013 Nematostella vectensis, Cnidaria, jgi_21453|gw.139.32.1 Nematostella vectensis, XP_001635337 Trichoplax adhaerens, Placozoa, XP_002112782 Trichoplax adhaerens, Placozoa, XP_002118240 Pleurobrachia pileus, Ctenophora, CU418523 Trichoplax adhaerens, Placozoa, XP_002113391 Nematostella vectensis, Cnidaria, XP_001633421 Trichoplax adhaerens, Placozoa, jgi_20484|e gw1.2.876.1 99/1.00 Trichoplax adhaerens, Placozoa, XP_002110060 Trichoplax adhaerens, Placozoa, XP_002109514 Nematostella vectensis, Cnidaria, XP_001633558 Hydra magnipapillata, Cnidaria, XP_002164561 Nematostella vectensis, Cnidaria, XP_001628695 Nematostella vectensis, Cnidaria, XP_001638633 Amphimedon queenslandica, Porifera, Aqu1.218884 Nematostella vectensis, Cnidaria, XP_001632099 Nematostella vectensis, Cnidaria, XP_001636147 100/1.00 Saccoglossus kowalevskii, Hemichordata, ACQM01099953 100/1.00 Paracentrotus lividus, Echinodermata, AM584769 Deuterostomia (Echinodermata, Hemichordata) Strongylocentrotus purpuratus, Echinodermata, XM_001182546 Branchiostoma floridae, Cephalochordata, BW735442 100/1.00 Molgula tectiformis, Tunicata, CJ411890 Halocynthia roretzi, Tunicata, DB583237 96/0.88 , Tunicata, NP_001104593 Gasterosteus aculeatus, Vertebrata, Teleostei, DN695373 84/- Hippoglossus hippoglossus, Vertebrata, Teleostei, CF931824 99/1.00 54/0.93 Salmo salar, Vertebrata, Teleostei, EG908850 54/- Oncorhynchus nerka, Vertebrata, Teleostei, EV376598 Pimephales promelas, Vertebrata, Teleostei, GH713274 Danio rerio, Vertebrata, Teleostei, NM_001001844 Deuterostomia (Chordata) Ictalurus punctatus, Vertebrata, Teleostei, FD319623 Bilaterian CTCF-clade Gallus gallus, Vertebrata, Aves, NM_205332 Macropus eugenii, Mammalia, Metatheria, EU527852 92/0.99 Pogona vitticeps, , Squamata, EU527854 Bos taurus, Mammalia, Ruminantia, NM_001075748 64/0.92 Mus musculus, Mammalia, Rodentia, NM_181322 75/0.81 Homo sapiens, Mammalia, Primates, NM_006565 Ornithorhynchus anatinus, Mammalia, Monotremata, NM_001123357 Petromyzon marinus, Vertebrata, Agnatha, DW022714 Brachionus plicatilis branch reduced by 50 % , Rotifera, FM904670 Protostomia (Lophotrochozoa) Milnesium tardigradum , Tardigrada, GR863778 (Ecdysozoa) 72/1.00 Trichinella spiralis, Nematoda, FM991921 Protostomia Crassostrea virginica, Mollusca, Bivalvia, EH644958 Lottia gigantea, Mollusca, Gastropoda, genomic scaffold 219 Protostomia (Lophotrochozoa) Aplysia californica, Mollusca, Gastropoda, AASC02025622, AASC02025623 Capitella sp., Annelida, Polychaeta, genomic scaffold 915 98/1.00 Hirudo medicinalis, Annelida, Hirudinea, EY489143 Helobdella sp., Annelida, Hirudinea, genomic scaffold 42 Sequence Data: Alvinella pompejana, Annelida, Polychaeta, GO140543 Tetranychus urticae, Arachnida, Acari, GW042583 275 aligned amino acid positions of 57 CTCF proteins Ixodes scapularis, Arachnida, Acari, XP_002413501 Lepeophtheirus salmonis, Crustacea, Copepoda, EY509071 61/1.00 from bilaterians and 105 non-CTCF ZF proteins from: -/0.98 Pediculus humanus corporis, Insecta, Phthiraptera, XP_002431599 Tribolium castaneum, Insecta, Coleoptera, EFA00140 - Trichoplax adhaerens (Placozoa; genome) Apis mellifera, Insecta, Hymenoptera, BI946444, AADG05000405 Nasonia vitripennis, Insecta, Hymenoptera, XP_001606075 Protostomia (Ecdysozoa) - Pleurobrachia pileus (Ctenophora; 51,042 ESTs) Anopheles gambiae, Insecta, Diptera, XP_315558 98/1.00 Aedes aegypti, Insecta, Diptera, AAY15453 - Amphimedon queenslandica (Porifera; genome) 98/1.00 Culex quinquefasciatus, Insecta, Diptera, XP_001850306 65/1.00 Nilaparvata lugens, Insecta, Hemiptera, DB844012 - Nematostella vectensis and Acyrthosiphon pisum, Insecta, Hemiptera, NW_001928160 100/1.00 Drosophila willistoni, Insecta, Diptera, XP_002062418 - Hydra magnipapillata (Cnidaria; genome) Drosophila pseudoobscura, Insecta, Diptera, EAL30192 Drosophila melanogaster, Insecta, Diptera, NP_648109 Tree topology: Drosophila virilis, Insecta, Diptera, XP_002046880 100/1.00 Drosophila grimshavi, Insecta, Diptera, XP_001983788 maximum likelihood (RAxML; model: WAG+Γ) 93/0.99 Bicyclus anynana, Insecta, Lepidoptera, GE734255 Danaus plexippus, Insecta, Lepidoptera, EY263249 Significances at branches: Laupala kohalensis, Insecta, Orthoptera, EH638023 Celuca pugilator, Crustacea, Malacostraca, DW176201 maximum likelihood (bootstrap values >50) / MrBayes Litopenaeus vannamei, Crustacea, Malacostraca, FE077232 Homarus americanus, Crustacea, Malacostraca, EX487114 (posterior probabilities >0.80) 85/1.00 Callinectes sapidus, Crustacea, Malacostraca, CV162031 Daphnia pulex, Crustacea, Branchiopoda, genomic scaffolds 158, 1460 78/0.93 Pleurobrachia pileus, Ctenophora, CU417453 Hydra magnipapillata, Cnidaria, XP_002161321 Hydra magnipapillata 57/- , Cnidaria, XP_002171280 Acyrthosiphon pisum, Insecta, Hemiptera, XP_001947579 79/1.00 100/1.00 Drosophila melanogaster, Insecta, Diptera, NP_477243 90/1.00 Drosophila pseudoobscura, Insecta, Diptera, EDY69939 100/1.00 Tribolium castaneum, Insecta, Coleoptera, XP_973800 Apis mellifera, Insecta, Hymenoptera, XP_392980 Outgroup (crooked legs) 88/1.00 Nasonia vitripennis, Insecta, Hymenoptera, XP_001606584 Culex quinquefasciatus, Insecta, Diptera, XP_001848403 94/0.92 Aedes aegypti, Insecta, Diptera, XP_001650771

Fig. S 1: Existence of a bilaterian CTCF-clade. The phylogenetic tree is identical with that shown in Fig. 1, but displays 105 candidates from early branching metazoans omitted in Fig. 1. None of these enters the CTCF cluster. Branch labels indicate origin and accession number of a sequence.

2 Number ESTs

103 105 107 Ctenophora Porifera Cnidaria Anthozoa Medusozoa Echinodermata Hemichordata Deuterostomia ? Xenoturbella Chordata Nematoda Nematomorpha Kinorhyncha Ecdysozoa Priapulida Tardigrada Onychophora

Protostomia Arthropoda

Platyhelminthes Rotifera Lophotrochozoa Mollusca Phoronida Brachiopoda Nemertea Hirudinea Oligochaeta Echiura

Polychaeta Annelida Sipuncula

Fig. S 2: Mapping of current EST data onto a metazoan phylogeny. A schematic representation of the kingdom is shown (as in ref. 1). The position of major animal groups is indicated as green dot. Arthropods are collapsed into one branch. A question mark indicates the uncertain phylogenetic placement of Xenoturbella. The number of ESTs deposited for a given taxon at NCBI (http://www.ncbi.nlm.nih.gov; reference day is May 7, 2010) is indicated as logarithmic bar graph. CTCF was not found in early branching metazoans despite the availablity of large numbers of ESTs. In Bilateria, detection of CTCF correlates with the amount of EST data available for a given phylum. Platyhelminthes are the only exception where a failure to detect CTCF cannot be explained with the paucity of sequence data as two complete genomes (Schmidtea mediterranea, Schistosoma mansoni) and more than 800,000 ESTs exist.

3 10. 20. 180. 190. 200. Strongylocentr...... Strongylocentr. HKCDHCDTTFGRYADMKTHVRKMHTA.GEP Saccoglossus ...... Saccoglossus HKCEFCDVTFGRIADMKAHIRKMHTP.GDP Amphioxus ...... Amphioxus YVCDICGTALTRKSDLKSHVRKLHTG.DKL Ciona YQCRECSFYSHRHSNLVRHMKIHTDERP Ciona FKCEICRTLLGRKSDLNVHMRKQHAFQEAP D Molgula YQCQECAFYSHRHSNLIRHMKIHTDERP D Molgula YECEICHTHLGRKSDLNVHMRKQHAYQDTP Gasterosteus ...... LDRHMKSHTDERP Gasterosteus FHCPHCDTVIARKSDLGVHLRKQHSFLEKG Homo FQCELCSYTCPRRSNLDRHMKSHTDERP Homo FHCPHCDTVIARKSDLGVHLRKQHSYIEQG Capitella LMCNYCNYTSPKRYLLTRHMKTHSEERP Capitella YQCELCPTTCGRRTDLKIHVQKLHTS.DKP Hirudo ...... Hirudo FKCELCPSTCTRKTDLRNHVLRIHTS.DKP Helobdella LKCNYCSYMTTKRFQLIRHLRSHSDDRP Helobdella FRCELCPSTCTRKTDLRNHVLRIHTS.DKP L Lottia MMCSYCNYTSPKRYLLTRHMKTHSEDRP L Lottia FQCEFCPTTCGRRTDLKIHVQKLHNS.EHP ZF1 Aplysia LMCDYCSYTSPKRYLLTRHMKSHSEERP ZF7 Aplysia FQCNLCPTTCGRKTDLKIHFNKLHTV.GSP Brachionus ...... Brachionus VDDQGSPIQNAIKSNKKNSIKK.....EAK Milnesium ...... Milnesium FQCDFCPVTCGRKNDLKIHVQKQHTS.LEP Trichinella YQCEFCPYTNHKRYLLLRHMKSHSEERP Trichinella FQCKFCPSSCGRKTDLRIHVQKLHTA.SAP Tetranychus HMCNYCNYASNKRYLLSQHMKSHSEERP Tetranychus FKCDQCPTTCGRKTDLRIHVQKLHTS.DKP Daphnia HMCNYCNYTSPKRYLLSRHMKCHSEERP Daphnia FQCELCPTTCGRKTDLRIHVQKLHTS.EKP Laupala ...... Laupala FQCELCPTTCGRKTDLRIHVQKLHTS.DKP E Acyrthosiphon HMCNYCNYTTGKRYLLSRHMKSHSKERP E Acyrthosiphon FKCDHCPATCGRKTDLRIHVQKLHTS.DKP Nasonia HMCNYCNYTSPKRYLLSRHMKSHSEERP Nasonia FQCELCPTTCGRKTDLRIHVQKLHTS.DKP Tribolium HMCNYCNYTSPKRYLLSRHMKSHSEERP Tribolium FHCEFCPTTCGRKTDLRIHVQKLHTS.DKP Anopheles YMCNYCNYTSNKLFLLSRHLKTHSEDRP Anopheles FQCKLCPTTCGRKTDLRIHVQNLHTA.DKP Drosophila YSCPHCPYTASKKFLITRHSRSHDVEPS Drosophila FQCNYCPTTCGRKADLRVHIKHMHTS.DVP − 1 2 3 6 − 1 2 3 6

30. 40. 50. 210. 220. Strongylocentr. HECHLCGRIFRTSTLLRNHENTHSGTKP Strongylocentr. MICKICENAFTDRFTYMQHVRGHRGEKI Saccoglossus HECDICGRTFRTSTLLRNHHNTHTGTKP Saccoglossus LICKVCENGFSDRFSYMQHIKTHRGEKI Amphioxus ...... NSHNGVKP Amphioxus LTCKYCDSAFPDKYNLTKHLKTHQGEKR Ciona YKCHLCERSFRTNTLLRNHINTHTGVKP Ciona TQCRYCDELFHDRWSLMQHQRTHRSCGQ D Molgula YQCHLCERAFRTNTLLRNHINTHTGVKP D Molgula MQCKYCDELFHDRW...... Gasterosteus HKCHLCGRAFRTVTLLRNHLNTHTGTRP Gasterosteus KKCRYCDAVFHERYALIQHQKTHKNEKR Homo HKCHLCGRAFRTVTLLRNHLNTHTGTRP Homo KKCRYCDAVFHERYALIQHQKSHKNEKR Capitella HKCFICNRGFKTMPSLQNHINTHTGVRP Capitella IQCKKCGKHFPDRYTYKIHVKTHEGEKC Hirudo ...... Hirudo LLCRKCGKNFPDRYTYRIHSRTHDGQKC Helobdella FKCEFCALAFKSRVTLNNHTNTHKGVRP Helobdella LLCKKCGKNFPDRYTY.IHSRTHDGQKC L Lottia HKCHICGRGFKTLASLTNHVNTHTGVRP L Lottia LLCRKCNQTFPDRYTYKIHLKSHEGEKC ZF2 Aplysia HKCSECNRGFKTPASLLNHVNTHTGTRP ZF8 Aplysia LECKKCGKVFSDRYSYKQHVRSHDGDRC Brachionus ...... Brachionus YYSEEKSKLSSFILTKSDAFINSNTDDF Milnesium ...... Milnesium VPCKVCDRSFPDRHSYRQHAKLDNCGTK Trichinella FKCTVCERCFKTNSSLQNHINTHTGTRP Trichinella IKCKKCDRTFTDRYTFKLHCKEHDGERC Tetranychus HKCGICERGFKTIASLQNHVNTHTGVRP Tetranychus LTCKRCDQSFSDRYSFKIHLKSHEGEKC Daphnia HKCSVCDRGFKTLASLQNHVNTHTGTKP Daphnia LKCKRCGKSFPDRYTFKLHSKTHEGEKC Laupala ...... Laupala LKCKRCGKSFPDRYTYKTHSKSHEGEKC E Acyrthosiphon HKCTVCERGFKTLTSLQNHVNTHTGTKP E Acyrthosiphon IKCKRCGDAFPDRYQYKVHCKSHEGEKC Nasonia HKCSVCERGFKTLASLQNHVNTHTGTKP Nasonia LKCKRCGKTFPDRYSYKLHSKTHEGEKC Tribolium HKCSVCERGFKTLASLQNHVNTHTGTKP Tribolium LKCKRCGKSFPDRYSYKVHNKTHEGEKC Anopheles HKCVVCERGFKTLASLQNHVNTHTGTKP Anopheles IKCKRCDSTFPDRYSYKMHAKTHEGEKC Drosophila FKCSICERSFRSNVGLQNHINTHMGNKP Drosophila MTCRRCGQQLPDRYQYKLHVKSHEGEKC − 1 2 3 6 − 1 2 3 6

60. 70. 80. 230. 240. 250. Strongylocentr. YKC..ELCPKAFGTSGELGRHMKYMHTHEKP Strongylocentr. YKCGECGYSAPQKRHLVIHMRVHTGERP Saccoglossus YKC..ELCERAFGTSGELARHTKYIHTHEKP Saccoglossus YKCGQCGYSCPQKRHLLTHMRVHTGERP Amphioxus HKC..DQCPMSFVTSGELMRHRRYKHTHEKP Amphioxus FRCEDCNYCCTQERHLINHKRCHTGEKP Ciona YKCTVDGCVMAFVTSGELTRHTRYIHTHEKP Ciona RFRSDDGVSSKRR...... D Molgula YKCMAEGCTMAFVTSGELTRHNRYKHTHEKP D Molgula ...... Gasterosteus HKC..TDCDMAFVTSGELVRHRRYKHTHEKP Gasterosteus FKCEQCDYCCRQERHMVMHKRTHTGEKP Homo HKC..PDCDMAFVTSGELVRHRRYKHTHEKP Homo FKCDQCDYACRQERHMIMHKRTHTGEKP Capitella HKC..KECNASFTTSGELVRHVRYKHTFEKP Capitella YKCELCPYSALSQRHLESHILTHTGEKP Hirudo ...... Hirudo FKCELCPYSASVQRHLEHHIMTHTGERP Helobdella NKC..KECEACFTTTGELIRHVRYRHTYEKP Helobdella FKCELCPYSASIQRHLENHIMTHTGERP L Lottia HKC..KLCESAFTTSGELVRHVRYKHTFEKP L Lottia FKCDICGHAAISQRHLETHQLIHSGEKP ZF3 Aplysia HKC..KTCDAAFTTSGELVRHIRYRHTFEKP ZF9 Aplysia FKCEDCDFIANTERALDQHAAIHSGGKR Brachionus ...... DSFFGTKSEMKRHIKYKHTMEKP Brachionus FTCDQCFMRFLTKDLLIEHKVKHTGERP Milnesium ...... Milnesium K...... Trichinella HQC..KGCELAFTTSGELIRHIRYKHTLEKP Trichinella YQCHLCPYSAMAQRHLEAHTLLHHSDKP Tetranychus HGC..KYCDAAFTTSGELVRHVRYRHTHEKP Tetranychus FKCDLCNYASVSARHLESHMLIHTDQKP Daphnia HRC..KHCHSSFTTSGELVRHVRYRHTHEKP Daphnia FKCDLCAYASISQRHLESHMLIHTDQKP Laupala ...... Laupala FKCDLCPYASISQRHLESHMLIHTDQKP E Acyrthosiphon YQC..RSCPSAFTTSGELVRHVRYKHTHEKP E Acyrthosiphon YKCELCPYASMSQRHLETHMLIHTDEKP Nasonia HNC..KFCDSAFTTSGELVRHVRYRHTHEKP Nasonia YKCDLCPYASISARHLESHMLIHTDQKP Tribolium HSC..KFCDSAFTTSGELVRHVRYKHTHEKP Tribolium YKCDLCPYASISARHLESHMLIHTDQKP Anopheles HRC..KHCDNCFTTSGELIRHIRYRHTHERP Anopheles YRCEYCPYASISMRHLESHLLLHTDQKP Drosophila HKC..KLCESAFTTSGELVRHTRYKHTKEKP Drosophila YSCKLCSYASVTQRHLASHMLIHLDEKP − 1 2 3 6 − 1 2 3 6

90. 100. 110. 260. 270. 280. Strongylocentr. HKCPLCDYLSVEASKIKRHMRSHTGEKP Strongylocentr. YECEECHE....TFKHKQTLINHQRSKHNLIQE Saccoglossus HKCPLCDYISVESSKIKRHMRSHTGEKP Saccoglossus YACKDCGE....SFKHKPTLIHHEKTKHNPEIE Amphioxus HKCTMCDYASVEISKLKRHMRSHTGERP Amphioxus FVCVQC...... Ciona FRCTLCDYASVEISKLRRHFRSHTGERP Ciona ...... D Molgula FKCTLCDYASVEISKLRRHFRSHTGERP D Molgula ...... Gasterosteus FKCSMCDYCSVEVSKLKRHIRSHTGERP Gasterosteus FACSQCDK....TFRQKQLLDMHFKRYHDPNFV Homo FKCSMCDYASVEVSKLKRHIRSHTGERP Homo YACSHCDK....TFRQKQLLDMHFKRYHDPNFV Capitella HRCPNCDYASVELSKLKRHLRSHTGERP Capitella FRCEECDQ....TFRQKQLLKRHLNLYHTPDYV Hirudo ...... Hirudo YKCNLCHQ....AFRQKTLLRKHYTTSHNANSE Helobdella HRCTVCNYASVELSKLRRHVRSHTGERP Helobdella FKCHICQQ....GFRQKTLLRRHFAISHEGKSE L Lottia HKCTLCEYASVELSKLKRHMRSHTGERP L Lottia YECESCDH....SFRQKQLLKRHQSLHHSAGE. ZF4 Aplysia HRCPKCDYASVELSKLKRHMRSHTGERP ZF10 Aplysia FECNVCHA....SFNLRQMLEKHKQW.CTLDLE Brachionus HKCTYCDYSTVELSKLKRHVRVHTDERP Brachionus FECQICGMKFSHKFALRAHLLSHD...... Milnesium ...... Milnesium ...... Trichinella HKCTECSYASVELSKLKRHIRSHTGERP Trichinella YKCVDCNL....SFKQVSLLKRHVESTHAAANQ Tetranychus HRCPECDYASVELSKLKRHMRCHTGERP Tetranychus FECDECEQ....SFRQKQLLKRHKNLYHNPEYI Daphnia HRCTECDYASVELSKLKRHMRCHTGERP Daphnia YQCDQCDQ....CFRQKQLLRRHQNLYHNPDYV Laupala ...... Laupala YQCDQWDQ....SFRQKQLLRRHQNLYHNPNYV E Acyrthosiphon HKCTICDYASVELSKMRNHMRCHTGERP E Acyrthosiphon FHCKLCDQ....SFRQKQLLRRHHNLYHNPAYI Nasonia HKCNDCDYASVELSKLKRHIRCHTGERP Nasonia YQCDHCYQ....SFRQKQLLKRHCNLYHNPNYV Tribolium HKCPECEYASVELSKLKRHIRCHTGERP Tribolium YHCDQCDQ....SFRQKQLLKRHQNLYHNPDYI Anopheles HKCTECDYASVELSKLKRHIRTHTGEKP Anopheles YKCDQCAQ....TFRQKQLLKRHMNYYHNPDYV Drosophila HKCTECTYASVELTKLRRHMTCHTGERP Drosophila FHCDQCPQ....AFRQRQLLRRHMNLVHNEEYQ − 1 2 3 6 − 1 2 3 6

120. 130. 140. 290. 300. 310. Strongylocentr. YKCTLCEYASTDNYKLKRHMRVHTGERP Strongylocentr...... Saccoglossus FKCQLCAYASTDNYKLKRHMRVHTGEKP Saccoglossus GEENEEGSFQ.CDKCDKTFSRTSSLKNHQKKHE Amphioxus FQCGMCSYASPDSYKLKRHMRTHTGEKP Amphioxus ...... Ciona YSCEECGKAFADSFHLKRHRMSHTGEKP Ciona ...... D Molgula YGCNECGKAFADSFHLKRHQMSHTGEKP D Molgula ...... Gasterosteus FQCSLCSYASRDTYKLKRHMRTHSGEKP Gasterosteus P.....TAFV.CSKCNKTFTPQNTCFRIRKLLG Homo FQCSLCSYASRDTYKLKRHMRTHSGEKP Homo P.....AAFV.CSKCGKTFTRRNTMARHADNCA Capitella YQCPHCTYASPDTYKLKRHLRIHTGEKP Capitella PPNPREKVHE.CPECHKIFAHYGNLVRHLVSHD Hirudo ...... Hirudo LALKQRAKIHVCPHCQRAFANEGSMTKHIKRLH Helobdella YKCPECPYASPDTFKLKRHLRVHTGERP Helobdella ...... L Lottia YQCPHCTYASPDTYKLKRHLRIHTGEKP L Lottia ...... ZF5 Aplysia YKCPHCPYASPDTYKLKRHLRIHTGEKP ZF11 Aplysia ...... Brachionus YLCHLCDYASRDTFRLKRHLRTHTGEKP Brachionus ...... Milnesium ...... EKP Milnesium ...... Trichinella YHCPHCSYASPDTYKLKRHLRVHTGEKP Trichinella L...... Tetranychus YQCPHCTYASPDTYKLKRHLRIHTGEKP Tetranychus PPQPKEKTHE.CPHCTQAFRHKVIL...... Daphnia YQCPHCTYASPDTFKLKRHLRIHTGEKP Daphnia PPPPGEKRHE.CPECGKPFRHKGNLIRHLALHD Laupala ...... Laupala PPPPQEKTHE.CPECGRAFRHKGNLIRHMAVHG E Acyrthosiphon YQCPHCTYASPDTFKLKRHLRIHTGEKP E Acyrthosiphon PPPPHAKTHQ.CNTCQRSFRHKGNLLRHMEVHM Nasonia YQCPHCTYASPDTFKLKRHLRIHTGEKP Nasonia PPPPQEKTHQ.CPECERPFRHKGNLIRHMAVHD Tribolium YQCPHCTYASPDTFKLKRHLRIHTGEKP Tribolium PPPPREKTHE.CPECSRAFRHKGNLIRHMAAHD Anopheles FQCPHCTYASPDKFKLTRHMRIHTGEKP Anopheles APTPKAKTHI.CPTCKRPFRHKGNLIRHMAMHD Drosophila YQCPHCTYASQDMFKLKRHMVIHTGEKK Drosophila PPEPREKLHK.CPSCPREFTHKGNLMRHMETHD − 1 2 3 6 − 1 2 3 6

150. 160. 170. Strongylocentr. FNCSQCDQSFSQKSSLKEHEWK.H.VGNRPS Saccoglossus FKCPDCDMAFSQKSSLKEHMWK.H.SGRRPT Amphioxus YECSVCLATFTQSGSLKMHMQR.H.LGTAPS Ciona YECPECNQRFTQRGSVKMHIMQQH.TKTAPK D Molgula YECPHCQQRFTQRGSVKMHIMQQH.LKTAPK Gasterosteus YECYICHARFTQSGTMKMHILQKH.TENVAK Homo YECYICHARFTQSGTMKMHILQKH.TENVAK Capitella YKCDICNTRFTQSNSLKAHKLI.H.TGNKPI Hirudo ...... TQSNSLKSHRLI.H.TGARPI Helobdella YVCDICEMRFTQSNSLKSHRLI.H.TGARPI L Lottia YECMVCHARFTQSNSLKAHKLI.H.TGTKPI ZF6 Aplysia YECDVCHSRFTQSNSLKAHKLI.H.TGNKPV Brachionus YECPICKNKFSQSNGLKAHIKTFH.SEVPII Milnesium YECDICHQKFTQSNSMKAHRLI.H.LDSKPS Trichinella YQCEVCNQRFTQSNSLKAHKLI.H.SGSRPV Tetranychus YECDVCHARFTQSNSLKAHKLI.H.SGNKPV Daphnia YECDICHARFTQSNSLKAHKLI.H.TGNKPI Laupala ...... I.HSAGDKPV E Acyrthosiphon YECDICFSRFTQSNSLKTHRLI.H.SGEKPV Nasonia YECDICNARFTQSNSLKAHKLV.HNVGDKPV Tribolium YECDICKTKFTQSNSLKTHKLT.HNIGDKPI Anopheles YSCDVCFARFTQSNSLKAHKMI.HQVGNKPV Drosophila YQCDICKSRFTQSNSLKAHKLI.HSVVDKPV − 1 2 3 6

Fig. S 3: Multiple sequence alignment of representative CTCF orthologs. The entire ZF region of 23 CTCFs from Deuterostomia (D), Lophotrochozoa (L), and Ecdysozoa (E) is shown. Cysteine and histidine residues, important for Zn2+ complexion, are highlighted (red). Amino acid positions −1, 2, 3, and 6, which are used for base recognition in ZF proteins (2, 3), are indicated at the bottom and colored yellow if conserved. Dots indicate missing sequence information or alignment gaps. Numbering is according to the D. melanogaster ortholog. Despite a strong overall conservation, the alignment reveals several lineage-specific modifications: (i) an insertion of two amino acids in ZF3 of ascidians; (ii) heterogeneous spacing of histidines in ZF6; (iii) absence of ZF7 and ZF8 in the rotifer Brachionus plicatilis;(iv) duplication of ZF4 in Pediculus humanus corporis (not shown); (v) absence of ZF9–11 in ascidians; (vi) absence of ZF11 in nematodes, echinoderms, and molluscs; (vii) a -specific deletion of five amino acids in the linker between ZF10 and ZF11; (viii) a vertebrate or -specific change in ZF11 of the second zinc-coordinating histidine to cysteine (blue). 4 ( ( elsmldbltra lds h hraaadteAtrpd,mpe pnteainetof alignment the upon mapped Arthropoda, the and Chordata the clades, bilaterian well-sampled hrce tt.Teeitneo ld-pcfi intrsi iaeinCC eune suggests sequences CTCF ancestral from bilaterian change in as signatures shown clade-specific are of Synapomorphies existence 9). The 8, state. 6, character 4, (ZF1, domains CTCF. ZF five of synapomorphies Lineage-specific 4: S Fig. rnfrevents. transfer ii i igeoii fteCC eefml ntecmo netro iaeinMtza and Metazoa, bilaterian of ancestor common the in family gene CTCF the of origin single a ) etclihrtneo TFgnsdrn iaeincaoeei,wtothrzna gene horizontal without cladogenesis, bilaterian during genes CTCF of inheritance vertical )

Unique synapomorphies of Chordata: ZF1: L N ZF6: indel of Q ZF8: E M V Y S G

Branchiostoma_Cephalochordata ------/-HKCTMCDYASVEISKLKRHMRSH-/-YECSVCLATFTQSGSLKMH-MQRH-/-LTCKYCDSAFPDKYNLTKHLKTH-/-FRCEDCNYCCTQERHLINH Molgula_Tunicata YQCQECAFYSHRHSNLIRHMKIH-/-FKCTLCDYASVEISKLRRHFRSH-/-YECPHCQQRFTQRGSVKMHIMQQH-/-MQCKYCDELFHDRW------Halocynthia_Tunicata ------FKCTLCDYASVEISKLRRHFRSH-/-YECPDCNQRFTQRGSVKMHIMQQH-/-MQCRYCDELFHDRWSLMQHQRTH-/-RQPRPGME------Ciona_Tunicata YQCRECSFYSHRHSNLVRHMKIH-/-FRCTLCDYASVEISKLRRHFRSH-/-YECPECNQRFTQRGSVKMHIMQQH-/-TQCRYCDELFHDRWSLMQHQRTH------Petromyzon_Vertebrata_Agnatha FQCELCSYTCPRRSNLDRHMKSH-/-FKCSMCDYASVEVSKLKRHIRSH------Gasterosteus_Vertebrata_Teleostei ------LDRHMKSH-/-FKCSMCDYCSVEVSKLKRHIRSH-/-YECYICHARFTQSGTMKMHILQKH-/-KKCRYCDAVFHERYALIQHQKTH-/-FKCEQCDYCCRQERHMVMH Oncorhynchus_Vertebrata_Teleostei ------FKCSMCDYASVEVSKLKRHIRSH-/-YECYICHARFTQSGTMKMHILQKH-/-RKCRYCDAVFHERYALIQHQKSH-/-FKCDQCDYCCRQERHMIMH Pimephales_Vertebrata_Teleostei FQCELCSYTCPRRSNLDRHMKSH-/-FKCSMCDYASVEVSKLKRHIRSH-/-YECYICHARFTQSGTMKMHILQKH-/-RKCRYCDAVFHERYALIQHQKSH-/-FKCDQCDYACRQERHMVMH Danio_Vertebrata_Teleostei FQCELCSYTCPRRSNLDRHMKSH-/-FKCSMCDYASVEVSKLKRHIRSH-/-YECYICHARFTQSGTMKMHILQKH-/-RKCRYCDAVFHERYALIQHQKSH-/-FKCDQCDYACRQERHMVMH Gallus_Vertebrata_Aves FQCELCSYTCPRRSNLDRHMKSH-/-FKCSMCDYASVEVSKLKRHIRSH-/-YECYICHARFTQSGTMKMHILQKH-/-KKCRYCDAVFHERYALIQHQKSH-/-FKCDQCDYACRQERHMVMH Macropus_Mammalia_Metatheria FQCELCSYTCPRRSNLDRHMKSH-/-FKCSMCDYASVEVSKLKRHIRSH-/-YECYICHARFTQSGTMKMHILQKH-/-KKCRYCDAVFHERYALIQHQKSH-/-FKCDQCDYACRQERHMIMH Pogona_Lepidosauria_Squamata FQCELCSYTCPRRSNLDRHMKSH-/-FKCSMCDYASVEVSKLKRHIRSH-/-YECYICHARFTQSGTMKMHILQKH-/-KKCRYCDAVFHERYALIQHQKSH-/-FKCDQCDYACRQERHMIMH Bos_Mammalia_Ruminantia FQCELCSYTCPRRSNLDRHMKSH-/-FKCSMCDYASVEVSKLKRHIRSH-/-YECYICHARFTQSGTMKMHILQKH-/-KKCRYCDAVFHERYALIQHQKSH-/-FKCDQCDYACRQERHMIMH Homo_Mammalia_Primates FQCELCSYTCPRRSNLDRHMKSH-/-FKCSMCDYASVEVSKLKRHIRSH-/-YECYICHARFTQSGTMKMHILQKH-/-KKCRYCDAVFHERYALIQHQKSH-/-FKCDQCDYACRQERHMIMH Mus_Mammalia_Rodentia FQCELCSYTCPRRSNLDRHMKSH-/-FKCSMCDYASVEVSKLKRHIRSH-/-YECYICHARFTQSGTMKMHILQKH-/-KKCRYCDAVFHERYALIQHQKSH-/-FKCDQCDYACRQERHMIMH Ornithorhynchus_Mammalia_Monotremata FQCELCSYTCPRRSNLDRHMKSH-/-FKCSMCDYASVEVSKLKRHIRSH-/-YECYICHARFTQSGTMKMHILQKH-/-KKCRYCDAVFHERYALIQHQKSH-/-FKCDQCDYACRQERHMIMH

5 Saccoglossus_Hemichordata ------HKCPLCDYISVESSKIKRHMRSH-/-FKCPDCDMAFSQKSSLKEHMW-KH-/-LICKVCENGFSDRFSYMQHIKTH-/-YKCGQCGYSCPQKRHLLTH Paracentrotus_Echinodermata ------HKCPLCDYLSVEASKIKRHMRSH-/-FSCSQCDQAFSQKSSLKEHEW-KH-/-MICKICENAFTDRFTYMQHVRGH-/-YKCGECGYSAPQKRHLVIH Strongylocentrotus_Echinodermata ------HKCPLCDYLSVEASKIKRHMRSH-/-FNCSQCDQSFSQKSSLKEHEW-KH-/-MICKICENAFTDRFTYMQHVRGH-/-YKCGECGYSAPQKRHLVIH Trichinella_Nematoda YQCEFCPYTNHKRYLLLRHMKSH-/-HKCTECSYASVELSKLKRHIRSH-/-YQCEVCNQRFTQSNSLKAHKL-IH-/-IKCKKCDRTFTDRYTFKLHCKEH-/-YQCHLCPYSAMAQRHLEAH Lottia_Mollusca_Gastropoda MMCSYCNYTSPKRYLLTRHMKTH-/-HKCTLCEYASVELSKLKRHMRSH-/-YECMVCHARFTQSNSLKAHKL-IH-/-LLCRKCNQTFPDRYTYKIHLKSH-/-FKCDICGHAAISQRHLETH Aplysia_Mollusca_Gastropoda LMCDYCSYTSPKRYLLTRHMKSH-/-HRCPKCDYASVELSKLKRHMRSH-/-YECDVCHSRFTQSNSLKAHKL-IH-/-LECKKCGKVFSDRYSYKQHVRSH-/-FKCEDCDFIANTERALDQH Capitella_Annelida_Polychaeta LMCNYCNYTSPKRYLLTRHMKTH-/-HRCPNCDYASVELSKLKRHLRSH-/-YKCDICNTRFTQSNSLKAHKL-IH-/-IQCKKCGKHFPDRYTYKIHVKTH-/-YKCELCPYSALSQRHLESH Helobdella_Annelida_Hirudinea LKCNYCSYMTTKRFQLIRHLRSH-/-HRCTVCNYASVELSKLRRHVRSH-/-YVCDICEMRFTQSNSLKSHRL-IH-/-LLCKKCGKNFPDRYTYI-HSRTH-/-FKCELCPYSASIQRHLENH Tetranychus_Arachnida_Acari HMCNYCNYASNKRYLLSQHMKSH-/-HRCPECDYASVELSKLKRHMRCH-/-YECDVCHARFTQSNSLKAHKL-IH-/-LTCKRCDQSFSDRYSFKIHLKSH-/-FKCDLCNYASVSARHLESH Ixodes_Arachnida_Acari HMCNYCNYTSPKRYLLSRHMKSH-/-HRCTECDYASVELSKLKRHMRCH-/-YECDVCHARFTQSNSLKAHKL-IH-/-LKCKRCGKSFPDRYTYKVHVKSH-/-FKCDLCPYASISQRHLESH Lepeophtheirus_Crustacea_Copepoda ------/-HKCSECGYKSVELSKLKRHMRCH-/-YECDICHARFTQSNSLKAHRL-IH-/-MHCKMCGKMFPDRYTLKVHKKTH-/-FKCDLCPYSSISQRHLESH hw r yaoopiso two of synapomorphies are Shown Callinectes_Crustacea_Malacostraca ------KLKRHMRCH-/-YECEICRARFTQSNSLKAHKL-IH-/-LKCKKCGKAFPDRYTFKQHNKSH-/-FKCDLCPYASTSARHLESH Daphnia_Crustacea_Branchiopoda HMCNYCNYTSPKRYLLSRHMKCH-/-HRCTECDYASVELSKLKRHMRCH-/-YECDICHARFTQSNSLKAHKL-IH-/-LKCKRCGKSFPDRYTFKLHSKTH-/-FKCDLCAYASISQRHLESH Acyrthosiphon_Insecta_Hemiptera HMCNYCNYTTGKRYLLSRHMKSH-/-HKCTICDYASVELSKMRNHMRCH-/-YECDICFSRFTQSNSLKTHRL-IH-/-IKCKRCGDAFPDRYQYKVHCKSH-/-YKCELCPYASMSQRHLETH Pediculus_Insecta_Phthiraptera HMCNYCNYTSPKRYLLSRHMKSH-/-HRCPECDYASVELSKLKRHIRCH-/-YECEICHSRFTQSNSLKAHRL-IH-/-LKCKRCGKMLPDRYTYKVHNKTH-/-YRCDLCPYASISARHLESH Tribolium_Insecta_Coleoptera HMCNYCNYTSPKRYLLSRHMKSH-/-HKCPECEYASVELSKLKRHIRCH-/-YECDICKTKFTQSNSLKTHKL-TH-/-LKCKRCGKSFPDRYSYKVHNKTH-/-YKCDLCPYASISARHLESH Nasonia_Insecta_Hymenoptera HMCNYCNYTSPKRYLLSRHMKSH-/-HKCNDCDYASVELSKLKRHIRCH-/-YECDICNARFTQSNSLKAHKL-VH-/-LKCKRCGKTFPDRYSYKLHSKTH-/-YKCDLCPYASISARHLESH Anopheles_Insecta_Diptera YMCNYCNYTSNKLFLLSRHLKTH-/-HKCTECDYASVELSKLKRHIRTH-/-YSCDVCFARFTQSNSLKAHKM-IH-/-IKCKRCDSTFPDRYSYKMHAKTH-/-YRCEYCPYASISMRHLESH Aedes_Insecta_Diptera HMCNYCNYTTNKRFLLARHMKSH-/-HKCPECDYASVELSKLKRHIRCH-/-YECDVCHARFTQSNSLKAHKL-IH-/-LKCRRCNKIYQDRYTYKMHTKTH-/-YKCDLCPYASISARHLESH Culex_Insecta_Diptera HMCNYCNYTTNKRFLLARHMKSH-/-HKCPECDYASVELSKLKRHIRCH-/-YECDVCHARFTQSNSLKAHKM-IH-/-LKCRRCSKLYHDRYTYKMHTKTH-/-YKCDLCPYASISARHLESH Drosophila_Insecta_Diptera YSCPHCPYTASKKFLITRHSRSH-/-HKCTECTYASVELTKLRRHMTCH-/-YQCDICKSRFTQSNSLKAHKL-IH-/-MTCRRCGQQLPDRYQYKLHVKSH-/-YSCKLCSYASVTQRHLASH

Unique synapomorphies ZF4: S C ZF9: A S of Arthropoda: (exception: T in Anopheles) → derived A: Original matrix

1 2 3 4 5 6 7 8 9101112131415 A 175 20 8 1 11 0 246 4 0 9 159 7 17 9 25 C 24 166 245 0 455 475 224 226 473 5 8 146 115 8 26 G 211 286 0 462 2 0 3 2 1 0 259 269 24 448 376 T 65 3 222 12 7 0 2 243 1 461 49 53 319 10 48

B: Corrupted matrix SH1

1 2 3 4 5 6 7 8 9101112131415 A 175 20 8 1 119 246 4 0 9 159 7 170 25 C 24 166 245 0 4558 224 226 473 5 8 146 115 475 26 G 211 286 0 462 2 448 3 2 1 0 259 269 240 376 T 65 3 222 12 7 10 2 243 1 461 49 53 3190 48

C: Corrupted matrix SH2

1 2 3 4 5 6 7 8 9101112131415 A 175 20 8 111 0 246 4 0 9 159 7 17 9 25 C 24 166 245 4550 475 224 226 473 5 8 146 115 8 26 G 211 286 02 462 0 3 2 1 0 259 269 24 448 376 T 65 3 2227 12 0 2 243 1 461 49 53 319 10 48

D: Corrupted matrix SH3

1 2 3 4 5 6 7 8 9101112131415 A 175 20 8 1 11 0 246 49 9 159 7 170 25 C 24 166 245 0 455 475 224 2268 5 8 146 115 473 26 G 211 286 0 462 2 0 3 2 448 0 259 269 241 376 T 65 3 222 12 7 0 2 243 10 461 49 53 3191 48

E: Corrupted matrix SH4

1 2 3 4 5 6 7 8 9101112131415 A 175 20 8 1 11 0 246 490 159 7 17 9 25 C 24 166 245 0 455 475 224 2265 473 8 146 115 8 26 G 211 286 0 462 2 0 3 201 259 269 24 448 376 T 65 3 222 12 7 0 2 243 4611 49 53 319 10 48

Fig. S 5: Matrices used for the prediction of CTCF-binding sites. The nucleotide frequencies for 475 CTCF-binding motifs from Drosophila and Homo at 15 positions in the intact (A) and four corrupted matrices (B–E) are shown. Corrupted matrices were generated by reciprocal exchange of two highly conserved nucleotide positions (red).

6 Vertebrata with CTCF (H. sapiens) Vertebrata (M. musculus) Echinodermata (S. purpuratus) Hemichordata (S. kowalevskii) Arthropoda (D. melanogaster) Arthropoda (D. pulex) Nematoda

Bilateria (T. spiralis) Nematoda w/o CTCF (C. elegans) Platyhelminthes (S. mediterranea) Mollusca with CTCF (L. gigantea) Annelida (Capitella sp.) Hydrozoa w/o CTCF (H. magnipapillata) Anthozoa (N. vectensis) Porifera (A. queenslandica) Placozoa (T. adhaerens) Protozoa CTCF matrix: (P. tetraurelia) intact Fungi shuffle1 (S. cerevisiae) shuffle2 Planta shuffle3 (A. thaliana) shuffle4 0 2 4 6 8 10 12 14 motif count/megabase

Fig. S 6: Enrichment of CTCF-binding sites in bilaterian genomes (STORM algo- rithm). The relative abundance of predicted CTCF-binding sites in 18 different genomes is shown. Genome-wide motif counts were adjusted to the relative number of instances (counts/megabase). Black bars represent the number of binding sites resulting from the intact CTCF matrix. Colored bars indicate the results of four corrupted matrices (shuffle1–4). These differ from the original matrix by a reciprocal nucleotide exchange at two conserved positions (Supplementary Fig. 5). The affiliation of an organism to the Bilateria (parentheses) and the distribution of CTCF (gray background) are indicated.

7 Drosophila melanogaster, Antennapedia complex

Chr3R:2,500,000 2,600,000 2,700,000 2,800,000

lab pb bcd dfd scr ftz antp

CTCF ChIP-chip

1 23 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Conservations scores

2,485,207 2,509,560 2,586,366 2,594,408 2,648,097 1.0 3 4 7 8 11

0.5

0.0 -50 0 +50 -50 0 +50 -50 0 +50 -50 0 +50 -50 0 +50

2,680,437 2,696,172 2,738,118 2,797,217 2,855,346 1.0 12 13 15 16 19

0.5

0.0 -50 0 +50 -50 0 +50 -50 0 +50 -50 0 +50 -50 0 +50

Fig. S 7: Conservation of CTCF-binding sites in the Drosophila Antennapedia Hox complex. (Top) Schematic view of the D. melanogaster Antp-C with the genes labial (lab), proboscipedia (pb), bicoid (bcd), deformed (dfd), sexcombs-reduced (scr), fushi tarazu (ftz), and antennapedia (antp), drawn to scale. (Middle) Position of 19 CTCF ChIP-chip signals relative to the D. melanogaster Antp-C (data from ref. 4). (Bottom) Conservation of 10 ChIP-positive CTCF sites in 12 Drosophila species. Each plot shows the PHASTCONS score of the site indicated on top, ± 50 bp. The 15-bp target sequence is highlighted in gray. While 13/19 ChIP peaks are associated with a predicted CTCF target site, eight of these sites display a local conservation maximum.

8 A B 1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2 conservation sums conservation sums

0.0 0.0 -50 -30 0 +15 +40 +70 -50 -30 0 +15 +40 +70 position position

Fig. S 8: Phylogenetic conservation of predicted CTCF sites. The D. melanogaster Bithorax Hox complex and the human HoxD complex were scanned with PATSER with the original and four corrupted matrices. Thresholds for control matrices were adjusted to return the same number of predictions obtained with the original matrix, 16 (BX-C) and 36 (HoxD), respectively. The PHASTCONS conservation score of a 115-bp window surrounding each predicted site was computed. Plots show the relative sum of conservation scores along the 115-bp window in the Bithorax (panel A) and HoxD complex (panel B). Results from the original matrix are shown as black line. Results from the corrupted matrices were averaged (red line), and dashed lines denote their standard deviation. The predicted 15-bp target sequence is highlighted in gray. In both Hox complexes, predictions with the original matrix show a conservation peak while predictions with control matrices do not.

9 0.8

0.7

0.6

0.5

0.4

conservation sums 0.3

0.2 -100 -50 0 +50 +100 position

Fig. S 9: Phylogenetic conservation of ChIP-positive CTCF sites throughout the Drosophila genome. CTCF-binding sites were predicted with PATSER (threshold 7) and their overlap with CTCF ChIP-chip peaks (data from ref. 4) was determined. The PHASTCONS conservation score of a 200-bp window surrounding each overlapping site was computed. Plots show the average relative conservation of binding sites within the Drosophila Bithorax complex (26 sites; black line) and in the entire genome (2,474 sites; red line). The predicted 15-bp target sequence is highlighted as gray bar. Both, sites in the Bithorax Hox complex and across the whole genome, show a conservation peak.

10 Saccoglossus kowalevskii, Hox gene scaffolds

Scaffold16417 20 40 60 80 kb

ANT ANT PG2 PG1

Scaffold1507 60 120 180 240 300 360 kb

POST POST CENT CENT CENT CENT PG9 PG9 PG7 PG7 PG5 PG4

Fig. S 10: Hox gene arrangement in the hemichordate Saccoglossus kowalevskii. D. melanogaster Hox genes were used as query to search for putative Hox gene orthologs in the genome of Saccoglossus kowalevskii (obtained from the Baylor College of Medicine Human Genome Sequencing Center website at http://hgsc.bcm.tmc.edu/) via BLAST. Scaffolds 1,507 and 16,417 returned multiple hits. We analyzed the corresponding open reading frames using Hox- Pred (http://cege.vub.ac.be/hoxpred/). Homeodomain-containing reading frames that could be classified as Hox gene with high confidence are shown above (highlighted in gray). Mapped onto scaffolds 1,507 and 16,417, they indicate the existence of an intact Hox gene cluster in Sac- coglossus kowalevskii. Red bar: approximate position of a Hox gene homeodomain. ANT, CENT, POST: anterior, central, and posterior Hox gene classes. PG: paralogy group.

11 Table S 1: Complete or nearly complete plant, fungi, and protist genomes without CTCF.

Kingdom Organism Sequences (×106) Plants (8) Glycine max 2.0 Medicago truncatula 1.1 Oryza sativa 35.7 Populus trichocarpa 10.2 Solanum lycopersicum 0.6 Sorghum bicolor 9.4 Vitis vinifera 23.0 Zea mays 7.8 Green algae (4) Chlamydomonas reinhardtii 5.4 Micromonas sp. 7.4 Ostreococcus tauri 2.6 Volvox carteri 6.8 Fungi (13) Ashbya gossypii 1.9 Aspergillus nidulans 5.5 Candida glabrata 2.2 Cryptococcus neoformans 4.7 Debaryomyces hansenii 2.6 Encephalitozoon cuniculi 1.1 Kluyveromyces lactis 2.0 Lachancea thermotolerans 2.0 Magnaporthe oryzae 4.6 Pichia pastoris 2.0 Schizosaccharomyces pombe 2.2 Yarrowia lipolytica 2.5 Zygosaccharomyces rouxii 2.0 Protists (15) Babesia bovis 1.6 Cryptosporidium hominis 5.5 Dictyostelium discoideum 6.0 Eimeria tenella 0.1 Entamoeba histolytica 2.7 Giardia intestinalis 7.5 Leishmania sp. 12.9 Monosiga brevicollis 4.6 Phytophthora ramorum 6.2 Plasmodium falciparum 6.1 Tetrahymena thermophyla 13.6 Theileria annulata 1.8 Toxoplasma gondii 12.6 Trichomonas sp. 11.9 Trypanosoma brucei 3.5

12 Table S 2: Statistical analysis of CTCF-binding site predictions on human chromosome 2 (hg19). For explanations see the legend below.

original PWM corrupted PWM FP TP FDR FP TP FDR

Patser 9

7079 1844 0.793 6109 123 0.980 single cell type 474 887 0.348 196 16 0.925

5779 3144 0.648 5843 389 0.938 multiple cell types 295 1066 0.217 148 64 0.698

Patser 10

3450 1518 0.694 2775 61 0.978 single cell type 295 788 0.272 99 8 0.925

2653 2315 0.534 2641 195 0.931 multiple cell types 139 884 0.136 77 30 0.746

Patser 13

284 637 0.308 87 7 0.926 single cell type 36 347 0.094 19 2 0.905

160 761 0.174 64 30 0.681 multiple cell types 13 370 0.034 13 8 0.619

The number of false positive (FP) and true positive (TP) CTCF-binding site predictions (generated by PATSER v.3b) as well as the false discovery rate (FDR) for the original and a corrupted (SH1) position weight matrix at three different PATSER thresholds (option ”-l”) are shown. Normal face: all available binding sites were analyzed. Bold face: Only binding sites in highly conserved regions were analyzed. Conservation was determined by PHASTCONS from a comparison of the human genome sequence to 45 other placental mammals (downloaded from the UCSC genome browser). It was considered high if the conservation score exceeded an average of 66 % over a predicted 15-bp motif. For instance, at a PATSER threshold of 13, as used for the genome scans in Fig. 2, FDR is reduced from 30.8 % to 17.4 % when ChIP data from multiple cell lines were analyzed as compared to a single cell line. Restricting the analysis to conserved binding sites resulted in a reduction of FDR from 9.4 % to 3.4 %. Thus, the amount of available experimental data and the presence of conservation have a strong impact on measured false discovery rates. To obtain these numbers, we downloaded 139 human CTCF ChIP-seq datasets for 31 different cell lines from the University of California, Santa Cruz (UCSC), genome browser repository (ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeUwTfbs/). In a first step we filtered out signals that were unique to only one experimental replicate. We merged the resulting signals and analyzed the overlap of the ChIP peaks with PHASTCONS conservation scores and PATSER predictions on human chr2 (∼ 8 % of the genome).

13 Table S 3: State of Hox gene clustering in different animal phyla (References for Fig. 4).

Phylum Species∗ Hox cluster Ref. Placozoa Trichoplax absent; 5 Hox genes (5) adhaerens

Porifera Amphimedon absent; no Hox genes (6) queenslandica

Cnidaria Nematostella absent; no central Hox (7, 8) vectensis, Hydra genes magnipapillata

Echinodermata Strongylocentrotus present; 11 Hox genes (9) purpuratus

Hemichordata Saccoglossus present; ≥ 8 Hox genes; ≥ 6 (this kowalevskii on one genomic scaffold study)

Chordata/Cephalo- Branchiostoma present; 14 Hox genes (10) chordata floridae

Chordata/Tunicata Oikopleura dioica, absent; 9 Hox genes, no (11, 12) Ciona intestinalis central genes

Chordata/Vertebrata Homo sapiens present; 39 Hox genes in 4 (13) clusters

Nematoda Caenorhabditis absent; 5 Hox genes (14) elegans

Arthropoda/Insecta Tribolium present; 10 Hox genes (15) castaneum

Arthropoda/Crustacea Sacculina carcini, present; 10 Hox genes (16, 17) Daphnia pulex

Arthropoda/Chelicerata Cupiennius salei present; 10 Hox genes (18)

Platyhelminthes Schistosoma absent; 7–9 Hox genes (19, 20) mansoni

Mollusca Lottia gigantea present; 11 Hox genes (17)

Annelida Capitella sp. I present; 11 Hox genes (17, 21)

∗ With the exception of , insects, and nematodes, the list contains—to our knowledge— all studied species of the respective phyla. “Present” (third column) includes organized, disorga- nized, and split Hox gene clusters (see ref. 13).

14 Table S 4: Download location of proteomes from early branching metazoans and other eukaryotes.

Organism Download Location Amphimedon queenslandica ftp://ftp.jgi-psf.org/pub/JGI_data/Amphimedon_ queenslandica/annotation/Aqi1.pep.fa.gz Arabidopsis thaliana http://www.plantgdb.org/download/Download/xGDB/ AtGDB/ATpepTAIR9 Ctenophora http://www.ncbi.nlm.nih.gov/nucest?term=Ctenophora 51,042 Entrez records for EST were downloaded. Hydra magnipapillata http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax. cgi?id=6085 17,563 Entrez records for Protein were downloaded. Nematostella vectensis ftp://ftp.jgi-psf.org/pub/JGI_data/ Nematostella_vectensis/v1.0/annotation/proteins. Nemve1FilteredModels1.fasta.gz Paramaecium tetraurelia http://paramecium.cgm.cnrs-gif.fr/download/fasta/ Ptetraurelia_peptides_cur.fasta Saccharomyces cerevisiae http://downloads.yeastgenome.org/sequence/genomic_ sequence/orf_protein/orf_trans_all.fasta.gz Trichoplax adhaerens ftp://ftp.jgi-psf.org/pub/JGI_data/Trichoplax_ adhaerens_Grell-BS-1999/annotation/v1.0/Triad1_ all_proteins.fasta.gz

15 Table S 5: Download location of genomes used for CTCF-binding site prediction.

Organism Download Location Amphimedon queenslandica ftp://ftp.jgi-psf.org/pub/JGI_data/Amphimedon_ queenslandica/assembly/AMPQU.finalScaffolds.fa.gz Arabidopsis thaliana ftp://ftp.plantgdb.org/download/Genomes/AtGDB/ ATgenomeTAIR9.171 Caenorhabditis elegans ftp://ftp.wormbase.org/pub/wormbase/genomes/c_ elegans/sequences/dna/c_elegans.WS221.dna.fa.gz Capitella teleta ftp://ftp.jgi-psf.org/pub/JGI_data/Capitella/v1.0/ Capitella_spI.allmasked.gz Daphnia pulex http://genome.jgi-psf.org/Dappu1/download/Daphnia_ pulex.allmasked.gz Drosophila melanogaster ftp://ftp.fruitfly.org/pub/download/compressed/na_ whole-genome_genomic_dmel_RELEASE3.FASTA.gz Homo sapiens http://hgdownload.cse.ucsc.edu/goldenPath/hg19/ bigZips/chromFaMasked.tar.gz Hydra magnipapillata http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=ABRM01 Lottia gigantea ftp://ftp.jgi-psf.org/pub/JGI_data/Lottia_gigantea/ v1.0/Lotgi1_assembly_scaffolds_repeatmasked.fasta.gz Mus musculus http://hgdownload.cse.ucsc.edu/goldenPath/mm9/ bigZips/chromFaMasked.tar.gz Nematostella vectensis ftp://ftp.jgi-psf.org/pub/JGI_data/Nematostella_ vectensis/v1.0/assembly/Nemve1.allmasked.gz Paramaecium tetraurelia http://paramecium.cgm.cnrs-gif.fr/download/fasta/ Ptetraurelia_assembly_v1.fasta Saccharomyces cerevisiae http://downloads.yeastgenome.org/sequence/genomic_ sequence/chromosomes/fasta Saccoglossus kowalevskii ftp://ftp.hgsc.bcm.tmc.edu/pub/data/Skowalevskii/ fasta/Skow_1.1/LinearScaffolds/Skow20090630-genome_ scaffolds.fa.gz Schmidtea mediterranea http://genome.wustl.edu/pub/organism/Invertebrates/ Schmidtea_mediterannea/assembly/Schmidtea_ mediterranea-3.1/output/supercontigs.fa.gz Strongylocentrotus purpura- http://www.spbase.org/SpBase/download/Spur2.5.AGP. tus linearScaffold.fa.gz Trichinella spiralis http://genome.wustl.edu/pub/organism/Invertebrates/ Trichinella_spiralis/assembly/Trichinella_ spiralis-1.0/output/ Trichoplax adhaerens ftp://ftp.jgi-psf.org/pub/JGI_data/Trichoplax_ adhaerens_Grell-BS-1999/assembly/v1.0/Triad1_masked_ genomic_scaffolds.fasta.gz

16 SI Materials and Methods Multiple sequence alignment and phylogenetic analysis The sequence dataset for phylogenetic analysis consisted of 170 sequences (20 control sequences and 150 CTCF candidate sequences from early branching metazoans and bilaterians). Indels and unalignable regions were excluded from the data before analysis. Likewise, the last ZF domain (ZF11) was excluded due to missing/ambiguous data for the majority of sequences. Phylogenetic trees were computed under the maximum-likelihood criterion, using RAXML 7.0.4 with 100 bootstrap resamplings (22). As an optimal model of sequence evolution we used the WAG+Γ model as selected by PROTTEST 2.4 (23). For Bayesian analyses, two Markov chain Monte Carlo chains with 700,000 generations were produced, and the first 100,000 generations were discarded as “burn-in”. To determine Bayesian posterior probabilities, an 80 % majority-rule consensus of the remaining 600,000 generations was calculated. To identify uniquely derived positions in the CTCF gene family, all non-CTCF zinc-finger sequences were ignored, and the remaining 57 CTCF sequences were used for synapomorphy analysis, targeting only some of the major lineages, e. g., the Arthropoda (24).

17 References

[1] Dunn CW, et al. (2008) Broad phylogenomic sampling improves resolution of the animal tree of life. Nature, 452(7188):745–749.

[2] Suzuki M, Gerstein M, Yagi N (1994) Stereochemical basis of DNA recognition by Zn fingers. Nucleic Acids Res, 22(16):3397–3405.

[3] Renda M, et al. (2007) Critical DNA binding interactions of the insulator protein CTCF: a small number of zinc fingers mediate strong binding, and a single finger-DNA interaction controls binding at imprinted loci. J Biol Chem, 282(46):33336–33345.

[4] Roy S, et al.; modENCODE Consortium (2010) Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science, 330(6012):1787–1797.

[5] Monteiro AS, Schierwater B, Dellaporta SL, Holland PW (2006) A low diversity of ANTP class homeobox genes in Placozoa. Evol Dev, 8(2):174–182.

[6] Larroux C, et al. (2007) The NK homeobox gene cluster predates the origin of Hox genes. Curr Biol, 17(8): 706–710.

[7] Chourrout D, et al. (2006) Minimal ProtoHox cluster inferred from bilaterian and cnidarian Hox complements. Nature, 442(7103):684–687.

[8] Kamm K, Schierwater B, Jakob W, Dellaporta SL, Miller DJ (2006) Axial patterning and diversification in the cnidaria predate the Hox system. Curr Biol, 16(9):920–926.

[9] Cameron RA, et al. (2006) Unusual gene order and organization of the sea urchin hox cluster. J Exp Zool B Mol Dev Evol, 306(1):45–58.

[10] Amemiya CT, et al. (2008) The amphioxus Hox cluster: characterization, comparative genomics, and evolution. J Exp Zool B Mol Dev Evol, 310(5):465–477.

[11] Seo HC, et al. (2004) Hox cluster disintegration with persistent anteroposterior order of expression in Oikopleura dioica. Nature, 431(7004):67–71.

[12] Ikuta T, Yoshida N, Satoh N, Saiga H (2004) Ciona intestinalis Hox gene cluster: Its dispersed structure and residual colinear expression in development. Proc Natl Acad Sci U S A, 101(42):15118–15123.

[13] Duboule D (2007) The rise and fall of Hox gene clusters. Development, 134(14):2549–2560.

[14] Aboobaker AA, Blaxter ML (2003) Hox Gene Loss during Dynamic Evolution of the Nematode Cluster. Curr Biol, 13(1):37–40.

[15] Shippy TD, et al. (2008) Analysis of the Tribolium homeotic complex: insights into mechanisms constraining insect Hox clusters. Dev Genes Evol, 218(3-4):127–139.

[16] Geant E, Mouchel-Vielh E, Coutanceau JP, Ozouf-Costaz C, Deutsch JS (2006) Are Cirripedia hopeful monsters? Cytogenetic approach and evidence for a Hox gene cluster in the cirripede crustacean Sacculina carcini. Dev Genes Evol, 216(7-8):443–449.

[17] Thomas-Chollier M, Ledent V, Leyns L, Vervoort M (2010) A non-tree-based comprehensive study of metazoan Hox and ParaHox genes prompts new insights into their origin and evolution. BMC Evol Biol, 10:73.

[18] Schwager EE, Schoppmeier M, Pechmann M, Damen WG (2007) Duplicated Hox genes in the spider Cupiennius salei. Front Zool, 4:10.

[19] Pierce RJ, et al. (2005) Evidence for a dispersed Hox gene cluster in the platyhelminth parasite Schistosoma mansoni. Mol Biol Evol, 22(12):2491–2503.

[20] Koziol U, Lalanne AI, Castillo E (2009) Hox genes in the parasitic platyhelminthes Mesocestoides corti, Echinococcus multilocularis, and Schistosoma mansoni: evidence for a reduced Hox complement. Biochem Genet, 47(1-2):100–116.

[21] Frobius AC, Matus DQ, Seaver EC (2008) Genomic organization and expression demonstrate spatial and temporal Hox gene colinearity in the lophotrochozoan Capitella sp. I. PLoS One, 3(12):e4004.

[22] Stamatakis A (2006) RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics, 22(21):2688–2690.

[23] Abascal F, Zardoya R, Posada D (2005) ProtTest: selection of best-fit models of protein evolution. Bioinfor- matics, 21(9):2104–2105.

[24] Marin B, Nowack EC, Melkonian M (2005) A plastid in the making: evidence for a second primary endosym- biosis. Protist, 156(4):425–432.

18