Supporting Information

Jékely 10.1073/pnas.1221833110 SI Methods CLANS2. Linkage clustering (minimum 10 links at e-value 1e-20; pNPs were retrieved using a combination of strategies. UniProt minimum 10 sequences) was performed to identify coherent sequences annotated with the Ontology (GO) term clusters of which neuropeptide receptor clusters were manually GO:0007218 (neuropeptide signaling pathway) were collected. selected. These sequences were filtered with HMMTOP and only Transmembrane and proteins lacking a SP (predicted sequences with seven transmembrane domains and an extracel- by SignalP4) were removed. National Center for Biotechnology lular N terminus were retained. Class-B sequences with the do- Information (NCBI) sequences were retrieved with the query mains IPR017981 or IPR001879 or IPR000832 or IPR017983 ‘neuropeptide NOT receptor’. Nonneuropeptide sequences (e.g., from Uniprot were enriched with C. teleta, H. robusta,and neuropeptide processing ) were removed. UniProt pro- L. gigantea sequences, filtered with HMMTOP, and clustered with teins with a SP and containing the repetitive motif [KjR][KjR]\w CLANS to select neuropeptide receptors. For the final clustering {3,10}G[KjR][KjR]\w{3,10}G[KjR][KjR]\w{3,10}G[KjR][KjR] 1,465 class-A and 547 class-B receptors were used. All sequences were also retrieved. Manually curated lists of pNPs from B. floridae, were annotated with the full classification, retrieved based on the S. kowalevskii, T. adhaerens, Capitella teleta, Helobdella robusta, NCBI Taxonomy identifier (taxid), using a bio-perl script. A cus- fi and L. gigantea were also created, either by species-speci c tom perl script was used to annotate pNPs with the three last amino searches or based on the literature. pNP lists from mass spec- acids of the amidated peptides preceding a G[KjR][KjR] motif. trometry studies were also added. Sequences were clustered using The length of the predicted amidated peptides flanked by dibasic CLANS2 to identify all major pNP families. Members from each cleavage sites was also included in the description for repetitive cluster were used as queries in PSI-BLAST searches in the pNPs. At least two peptides flanked by dibasic cleavage sites had to NR database at the NCBI using varying e-value cutoff (1 to 1e-5) have the same length. Sequences were clustered with CLANS2. and either the BLOSUM62 or the PAM30 matrix. Newly detected CLANS performs all-against-all BLAST and represents sequences sequences were examined and false-positive matches were re- moved. The NCBI Expressed Sequence Tag (EST) collection by nodes in a graph, placed randomly in a 3D space. Clustering is est.other (excluding human and mouse) was also searched; ESTs performed using attractive forces proportional to the negative were translated using ESTScan and screened for the presence logarithm of the BLAST P values, and a uniform repulsive force. of a SP. Sequences without a SP, spurious matches, toxins, and pNPsandGPCRswereclusteredwithaP value cutoff of 1e-5 fi antimicrobial peptides were removed. Adiponectins were also and 1e-40, respectively. Clustering was rst performed in 3D and removed because the collagen domain showed spurious matches then the maps were collapsed to 2D for easier representation. to repetitive pNPs. Redundancy was reduced to 95% identity Taxonomy, amidated motifs, and the length of the neuropeptide using CD-HIT. repeats were mapped on the cluster maps using the sequence Class-A GPCRs with the Interpro domains IPR019427 or groups tool. To read the CLANS files (Datasets S1, S2,andS3)in- IPR000276 or IPR017452 were downloaded from Uniprot and stall CLANS and run the command line command: java -Xmx4000m used to search the C. teleta, H. robusta, and L. gigantea predicted -jar /your_install_directory/CLANS.jar -load Clans_file. Multiple proteins (e-value 1e-20). The combined set was reduced to 75% alignments were generated by ClustalW, Muscle, or Cobalt. Motifs redundancy. The resulting 16,123 GPCRs were clustered with were identified with MEME.

Jékely www.pnas.org/cgi/content/short/1221833110 1of5 Fig. S1. Large cohesive sequence clusters, repeat length and distribution of R[FY]amides (Arg-[PheTyr]-NH2) and Wamides (Trp-NH2) in proneuropeptide (pNP) CLANS maps. (A) Individual clusters in the BLOSUM62 map were determined by linkage clustering (minimum three linkages) and are shown in different colors. Only clusters with more than 30 sequences are shown. The central cluster is shown in red. (B) The largest cluster in the BLOSUM62 map was defined by linkage clustering (minimum three linkages). This subset of sequences was further optimized and color-coded for taxonomy. (C) Different repeat lengths of pNPs are indicated in different colors on the PAM30 CLANS cluster map. Only those pNPs were colored that had at least two amidated peptides of the same length flanked by dibasic cleavage sites. The color code for the different repetitive peptide lengths is shown. (D) PAM30 clustering showing the central cluster with the mapping of RFamide (Arg-Phe-NH2), RYamide (Arg-Tyr-NH2), and Wamide terminal motifs.

Jékely www.pnas.org/cgi/content/short/1221833110 2of5 Fig. S2. Cluster analysis of pNPs and class-B neuropeptide G protein-coupled receptors (GPCRs). (A) A BLOSUM62 cluster map of pNPs was colored to highlight the indicated amidated termini in the mature neuropeptides. The individual amidated termini and the family they belong to are listed in the table. (B) BLOSUM62 cluster map of class-B neuropeptide GPCRs. Nodes correspond to class-B GPCR sequences and are colored based on taxonomy. Edges represent BLAST connections of P value > 1e-50. (C) BLOSUM62 cluster map of prokineticin/astakine/colipase, (D) prothoracicotropic hormone (PTTH)/trunk/noggin, and (E) neuroparsin/insulin-like growth factor-binding protein (IGFBP) domains. Representatives of the indicated families were clustered and colored based on taxonomy. Edges represent BLAST connections of P value > 1e-5.

Jékely www.pnas.org/cgi/content/short/1221833110 3of5 Fig. S3. Phyletic distribution of metazoan pNP and neuropeptide GPCR families. (A) Phyletic distribution of metazoan pNP families. The families that are part of the CC are shown in red. Pigment-dispersing factor (Pdf), leucokinin, thyrotropin-releasing hormone (TRH), and parathyroid hormone (PTH) may be ancestral bilaterian, motilin, melanin-concentrating hormone (MCH), and endothelin ancestral chordate, based on GPCR distribution. (B) Phyletic distribution of metazoan class A and class B neuropeptide GPCR families. Class B GPCRs are indicated as (B). Ancestral bilaterian (*), protostome (+), deuterostome (o) and chordate (-) families are indicated.

Fig. S4. Structure of placozoan pNPs and lophotrochozoan opioid pNPs. (A) Schematic structure of pNPs from the placozoan Trichoplax adhaerens.(B) Schematic structure of the Platynereis dumerilii (annelid) and Lottia gigantea and Haliotis asinina (mollusks) opioid pNPs. Signal peptides are shown in blue, peptides with a C-teminal Gly in green, dibasic cleavage sites in red, and Cys residues in yellow. The sequence logos show the conservation of residues in the predicted mature peptides.

Dataset S1. CLANS file of the 6,225 pNPs analyzed

Dataset S1

The file contains all sequences (between the lines and ), annotations, and the BLAST P value matrix (between the lines and ). The cluster map can be visualized with CLANS (http://134.34.129.6/programs/clans/index.php) using the command: java -Xmx4000m -jar /your_install_directory/ CLANS.jar -load Clans_file.

Jékely www.pnas.org/cgi/content/short/1221833110 4of5 Dataset S2. CLANS file of the 1,465 class-A neuropeptide GPCRs analyzed

Dataset S2

The file contains all sequences (between the lines and ), annotations, and the BLAST P value matrix (between the lines and ). The cluster map can be visualized with CLANS using the command: java -Xmx4000m -jar /your_install_directory/CLANS.jar -load Clans_file.

Dataset S3. CLANS file of the 547 class-B neuropeptide GPCRs analyzed

Dataset S3

The file contains all sequences (between the lines and ), annotations, and the BLAST P value matrix (between the lines and ). The cluster map can be visualized with CLANS using the command: java -Xmx4000m -jar /your_install_directory/CLANS.jar -load Clans_file.

Dataset S4. Multiple alignments of pNP families

Dataset S4

The multiple alignments for the selected pNP families were generated either with Muscle or Cobalt. GenBank/SwissProt or JGI identifiers and full species names are shown. The multiple alignments were visualized with Jalview. The sequences are colored according to the Clustalx color scheme, using varying conservation cutoff. Short motifs were identified using MEME.

Dataset S5. pNPs identified from Trichoplax adhaerens, Branchiostoma floridae, Saccoglossus kowalevskii, and Petromyzon marinus

Dataset S5

Jékely www.pnas.org/cgi/content/short/1221833110 5of5