Thesis

Bioinformatics tools to assist drug candidate discovery in venom gland transcriptomes

KOUA, Dominique Kadio

Abstract

Current pharmaceutical research is actively exploring the field of natural . Venomics addresses this issue with the study of toxins. The concomitant development of sequencing techniques is opening new perspectives of understanding biological mechanisms. Transcriptome sequencing of specific tissues is undertaken to better understand and characterize the context of gene expression. In this framework, transcriptomic data made available require automated processing workflows and user-friendly interfaces for data exploitation and comprehension. We present TATools, a bioinformatic platform that provides a unique management environment for understanding transcriptome data by merging results of diverse classical sequence analysis. Additional features and dedicated viewer pages makes TATools a valuable solution for highlighting novelty in a single transcriptome as well as cross-analysis of several transcriptomes in the same environment. TATools is validated in the context of venomics. This thesis reports the genesis of the design of TATools as exposed in two published articles and a manuscript (at this stage under [...]

Reference

KOUA, Dominique Kadio. Bioinformatics tools to assist drug candidate discovery in venom gland transcriptomes. Thèse de doctorat : Univ. Genève, 2012, no. Sc. 4471

URN : urn:nbn:ch:unige-239511 DOI : 10.13097/archive-ouverte/unige:23951

Available at: http://archive-ouverte.unige.ch/unige:23951

Disclaimer: layout of this document may differ from the published version.

1 / 1 UNIVERSITE DE GENEVE FACULTE DES SCIENCES Département d'informatique Professeur Ron D. Appel Institut Suisse de Bioinformatique Dr. Frédérique Lisacek LABORATOIRES ATHERIS Dr. Reto Stöcklin

Bioinformatics tools to assist drug candidate discovery in venom gland transcriptomes.

THESE

présentée à la faculté des sciences de l'Université de Genève pour obtenir le grade de Docteur ès sciences, mention Bioinformatique

par

Dominique Kadio Koua de Bouaké (Côte d'Ivoire)

Thèse N° 4471

Genève Centre d'impression UNIGE 1er octobre 2012

Thesis

Bioinformatics tools to assist potential drug candidate discovery in venom gland transcriptomes.

KOUA, Dominique Kadio

Abstract Current pharmaceutical research is actively exploring the field of natural peptides. Venomics addresses this issue with the study of toxins. The concomitant development of sequencing techniques is opening new perspectives of understanding biological mechanisms. Transcriptome sequencing of specific tissues is undertaken to better understand and characterize the context of gene expression. In this framework, transcriptomic data made available require automated processing workflows and user-friendly interfaces for data exploitation and comprehension. We present TATools, a bioinformatic platform that provides a unique management environment for understanding transcriptome data by merging results of diverse classical sequence analysis. Additional features and dedicated viewer pages makes TATools a valuable solution for highlighting novelty in a single transcriptome as well as cross-analysis of several transcriptomes in the same environment. TATools is validated in the context of venomics. This thesis reports the genesis of the design of TATools as exposed in two published articles and a manuscript (at this stage under revision) and it describes the final outcome of this work with the support of a submitted manuscript detailing the analysis workflow. The use of TATools is illustrated with the study of the consors venom gland transcriptome and subsequent conopeptide identification and classification. Other applications of parts of the TATools platform are shown in another two published articles.

Reference

KOUA, Dominique Kadio. Bioinformatics tools to assist drug candidate discovery in venom gland transcriptomes. Thèse de doctorat : Univ. Genève, 2012, no. Sc. Xxxx

iii Remerciements

Merci infiniment à,

Professeur Ron D. Appel de l'Université de Genève, Président du jury

Professeur Amos Bairoch de l'Université de Genève, expert interne,

Professeur Oliver Hartley de l'Université de Genève, expert interne,

Professeur Jordi Molgo du CNRS en France, expert externe,

Docteure Frédérique Lisacek de l'Institut Suisse de Bioinformatique, Co-directrice,

Docteur Reto Stöcklin des Laboratoires Atheris de Genève, Co-Directeur, les honorables membres du jury qui ont accepté de consacrer de leur temps à l’évaluation et à l'amélioration des travaux que j'ai présenté.

Je voudrais remercier toutes les personnes qui par leur confiance, leur soutien et leur assistance ont permis la réalisation de ces travaux de thèse.

Grand merci à Sylvie et Reto Stöcklin d'avoir bien voulu m'offrir l'opportunité de réaliser mes travaux de thèse au sein des Laboratoires Atheris. Merci à Frédérique Lisacek qui m'a accueilli dans le groupe PIG de l'Institut Suisse de Bioinformatique et qui a toujours eu confiance en moi. Merci à Reto et à tous les partenaires du projet CONCO pour les belles expériences de collaboration internationales auxquelles j'ai eu l'occasion de participer. Merci à mes co-directeurs de thèse d'avoir toujours été présents et disponibles pour l'encadrement de ces travaux. Grand merci pour l'amitié que vous ne cessez de me témoigner.

Je voudrais également remercier Philippe Favreau (Philou) pour ses conseils et ses orientations ainsi que pour sa disponibilité sans faille.

Je voudrais remercier tout mes collègues des Laboratoires Atheris. Cela a été un plaisir quotidien de travailler au sein d'une équipe tout aussi compétente que sympathique. Que Estelle B., Roman M., Coralie D., Aude V., Cecile C., Nicolas H., Frederic P., Francine A., Florence B., Xavier S., Daniel B., Vera O., Hadrien G., Alain C., Sebastien D., Florence A., et tous trouvent ici ma reconnaissance pour les excellents moments que nous avons partagés.

Je remercie également mes parents, mes frères et sœurs pour leur affection et leur soutien indéfectible malgré la distance. Très chers André K., Joséphine B., Suzanne K. Jean-Baptiste A.,

iv Jacques A., Florence K., Eugène G., Innocent K., Vincent K., Denis K., j'ai essayé de puiser dans votre courage et votre optimisme la force de mener de l'avant les taches qui m'incombaient. Merci d'être toujours là.

Ma reconnaissance va aussi à l'endroit des responsables de la Résidence Universitaire de Champel pour l'ambiance très conviviale et familiale et pour votre soutien toujours attentif. Merci en particulier à Joachim H., Manuel L., Alfred F., Lukas W., Philippe M., Hans F., Albert O., Albert M., Carlos S., Peter R., ainsi qu'à tous les résidents que j'ai eu énormément de plaisir à rencontrer.

Grand merci à Véronique M., Jocelyne B., Dolnide D., Laure V. Sylvie S. et Gabrielle de B. pour leur inestimable aide dans les questions administratives.

Merci à tous mes collègues du SIB pour leur amitié et l'exemple stimulant de leur qualité scientifique. Merci en particulier à Patricia P., Christian S. Béatrice C., Lorenzo C., Markus M., Fréderic N., Erik A. Je voudrais également adresser mes sincères remerciements à toutes les personnes de l'Institut Suisse de Bioinformatique et de l'Université de Genève pour leur disponibilité et leur assistance toujours cordiale.

Merci a tous mes amis d'ici et d'ailleurs pour leur soutien.

GRAND MERCI A VOUS TOUS.

Deo Omnis Gloria !

v Présentation générale

La recherche pharmaceutique moderne est essentiellement basée sur le criblage à haut débit de molécules candidates en vue de leur sélection comme comme principes actifs ciblant spécifiquement des récepteurs biologiques impliqués dans les pathologies à guérir. Il apparaît toutefois que depuis une quinzaine d'années, le nombre de molécules nouvelles proposées par l'industrie pharmaceutique est en constante regression. Il est dès lors fondamental d'envisager l'exploration de nouvelles sources de composés bioactifs. Dans ce contexte, les peptides naturels occupent une place de plus en plus importante dans les programmes de recherche. D'une manière particulière, les venins animaux, connus pour être des cocktails de composés hautement actifs et spécifiques ont été largement étudiés et ont déjà révélé une grande partie de leur richesse. Toutefois, avec l'émergence de nouvelles techniques de séquençage à haut débit, l'exploration de l'éventail complet des protéines en cours d'expression par la 'lecture' des ARN messagers (transcriptome) des glandes à venin est devenue possible et économiquement accessible. Cette nouvelle approche présente l'avantage de permettre une exploration plus détaillée des potentialités de l'appareil venimeux. Toutefois, l'amélioration des techniques de séquençage entraîne la production de transcriptomes de plus en plus volumineux composés de millions de 'reads'. L'analyse bioinformatique des transcriptomes afin d'identifier les peptides pouvant avoir un intérêt apparaît donc comme une étape cruciale de la recherche pharmaceutique basée sur l'exploration des transcriptomes. L'analyse des approches classiques d'analyse bioinformatique des transcriptomes a permis de mettre en lumière quelques problèmes pour lesquels la présente étude propose une solution. En effet, du fait du volume croissant de données transcriptomiques produites et de la variété d'outils d'analyse existant, l'exploitation pratique des transcriptomes s'avère encore très limitée. Quatre problèmes principaux ont été dégagés dans le présent travail: 1- La méthodologie d'analyse actuelle n'est pas optimale et surtout trop coûteuse en heures de calcul. 2- L'identification de peptides d'intérêt et l'annotation de leurs fonctions potentielles reste une activité longue et fastidieuse qui nécessite l'intervention d'un biologiste expérimenté capable d'explorer et de compiler manuellement les résultats hétérogènes issus entre autres de la recherche de similarité de séquence, ou celle de domaines conservés, de l'utilisation des liens avec des ontologies, etc… 3- La quantité de résultats à valider manuellement ainsi que la plupart des outils bioinformatiques couramment utilisés ne permettent pas la découverte de composés réellement nouveaux.

6 4- L'ensemble du processus d'exploration et de découverte est lourdement entravé par l'inexistence d'outils de visualisation adéquats. L'objet du présent travail est un environnement informatique permettant d'assister la découverte de peptides présentant un intérêt pharmaceutique à partir de l'analyse des transcriptomes des glandes à venin. Cette thèse propose et décrit TATools, une solution efficace et conviviale répondant aux quatre préoccupations soulevées. L'utilisation de TATools et du nouveau schéma d'analyse dans le contexte du projet européen CONCO1 a contribué de manière satisfaisante à plusieurs problématiques de recherche tant fondamentale qu'appliquée. Pour ce qui concerne la recherche pharmaceutique en particulier, la nouvelle approche proposée a permis d'identifier et/ou de confirmer l'existence à l'intérieur du venin de Conus consors d'analogues de la XEP-018, un composé préalablement isolé du venin de ce gastéropode prédateur. De plus, la modélisation spécifique des conopeptides (peptides de cône) a permis l'identification de nombreux composés intéressants à partir de l'analyse du transcriptome d'un spécimen unique de Conus adamsonii.

Origine des questions Les animaux venimeux se rencontrent sous presque toutes les latitudes et dans de multiples phyla: serpents et autres reptiles, mollusques marins, poissons, amphibiens, insectes, arachnides, myriapodes et même quelques mammifères. Ils possèdent des glandes exocrines très spécialisées couplées à un système parfois très sophistiqué (crochets, dards, harpons) pour l'administration du venin secrété. Les animaux venimeux et leurs venins sont depuis de longues années l'objet d'études scientifiques, spécialement parce que les envenimations constituent une cause relativement importante de décès et/ou d'incapacité dans le monde. De plus, les venins sont des mélanges très complexes de peptides et de protéines dont l'intérêt pour la recherche pharmaceutique n'a cessé de croître ces dernières années. L'attrait pour les venins réside dans l'extrême spécialisation et l’impressionnante efficacité des peptides et petites protéines qui les constituent. Les études réalisées ont permis de mettre en évidence que ces composés sont actifs sur un large spectre de cibles moléculaires. Ainsi, plusieurs médicaments issus de peptides de venins ou de leurs dérivés sont d'ores et déjà commercialisés (Capoten/Captopril ; Integrilin/Eptifibatide ; Aggrastat/Tirofiban ; Prialt/Ziconotide ; ...) tandis que de nombreux autres peptides se trouvent à différents stades de validation ou d'approbation.

1 Applied venomics of the species Conus consors for the accelerated, cheaper, safer and more ethical production of innovative biomedical drugs (http://www.conco.eu/) 7 Analyses biologiques L'élucidation de la composition des venins est liée à l'évolution des techniques d'analyse. Les progrès des méthodes de séparation par électrophorèse sur gels (SDS-PAGE) ou de chromatographie en phase liquide (HPLC), les avancées de la spectrométrie de masse et de la spectroscopie par résonance magnétique nucléaire (RMN) ainsi que la miniaturisation des tests biologiques ont très rapidement été appliqués à l'étude des venins (vénomique ou vénimologie). La protéomique des venins a permis de dresser un inventaire de plus en plus complet des peptides et protéines qui les composent. D'autre part, les récentes avancées dans le domaine du séquençage ont ouvert de nouveaux horizons à la compréhension des systèmes venimeux. Le recours à la transcriptomique des glandes à venin est de plus en plus courant pour compléter la protéomique dans l'identification de composés potentiellement intéressants en recherche pharmaceutique. Dans ce contexte, l'analyse bioinformatique constitue un passage obligé dans la mesure où elle permet, grâce aux nombreux outils mis à disposition, de mieux exploiter la richesse des transcriptomes. En l'absence de ces outils bioinformatiques, l'analyse des données transcriptomiques serait un travail lent et fastidieux. Au contraire, le développement de ces outils facilite et accélère le travail d'interprétation et le guide dans la formulation de nouvelles hypothèses.

Problématique Les questions préalablement posées suggèrent la capacité de reconnaître des peptides d'intérêt et d'optimiser la qualité de l'identification et la vitesse de détection desdits peptides. Quelle méthode employer pour parcourir les données des transcriptomes le plus rapidement et le plus efficacement possible? Comment détecter les séquences identiques ou similaires au d'intérêt? Comment caractériser un peptide ou une famille de peptide afin d’accélérer la détection de séquences homologues? Dans un second temps, il apparaît que l'exploitation des données transcriptomiques doit être étendue au delà d'une seule famille ou d'un unique composé. Identifier et classifier des séquences induit classiquement le recours aux bases de données publiques qui répertorient les séquences connues ainsi que des annotations automatiques et/ou manuelles. Cette démarche suppose la capacité accéder aux annotations disponibles en ligne pour inférer celles des données transcriptomiques produites. La grande interrogation demeure de savoir quelle conclusion tirer lorsqu'aucune annotation externe n'est disponible pour une partie du transcriptome. Dans ce cas, une analyse plus minutieuse doit être menée car elle peut aboutir à la découverte de composés potentiellement inconnus.

8 Il reste encore à résoudre la question très rarement abordée de la (re)présentation des résultats obtenus. Comment afficher de façon à la fois concise et précise les résultats des différentes analyses effectuées sur le transcriptome? Comment assurer l'interactivité avec les utilisateurs? La visualisation des données et des résultats d'analyse constitue en soi un défi, d'autant plus que, de la qualité de la visualisation peut dépendre la qualité des conclusions et interprétations tirées par les utilisateurs.

Méthodologie Vu l'éventail des problèmes à aborder, la solution que nous proposons intègre une base de donnée realtionnele et des outils d'analyse robustes et performants, le tout fonctionnant dans un environnement web interactif et convivial. TATools (Environnement Bio-informatique pour l'Analyse des Transcriptomes) permet aussi bien aux novices qu'aux spécialistes de tirer le meilleur parti de l'immense potentiel des transcriptomes rendu accessibles à des coûts toujours plus abordables par les nouvelles générations d'appareils et de techniques de séquençage. Le but des analyses est d'optimiser et de faciliter l'identification et la détection de séquences pouvant présenter un intérêt pour la recherche pharmacologique. Dans ce but, TATools inclut des outils d'analyse classiques tels que BLAST pour la recherche de similarité de séquence dans les bases de données, Gene Ontology (GO) pour le transfert d'annotation et la prédiction de fonction ou d'activité, SignalP pour la prédiction de séquence signal et MAFFT pour la réalisation d'alignements multiples (tous ces outils sont sommairement décrits en annexe 1). D'un point de vue méthodologique, le grand atout de l'environnement proposé réside dans la combinaison fructueuse d'analyses classiques et complémentaires reposant sur des modèles probabilistes dont l'efficacité a largement été démontrée. Deux des articles publiés à l'issue de nos travaux abordent justement le recours à ces méthodes et démontrent les bénéfices par rapport aux méthodes précédemment utilisées notamment le BLAST. Ainsi pour l'analyse des transcriptomes de glandes à venins, des profils généralisés (PSSM) et des modèles de Markov cachés (HMM) ont été construits pour les familles connues de toxines. Ces modèles sont utilisés pour la détection d'analogues et leur applicabilité à grande échelle constitue un atout majeur dans l'identification de protéines d'intérêt. TATools est donc le fruit de la fusion au sein du même environnement des résultats d'outils classiques et de méthodes de recherche de motifs en vue de l'analyse de grandes quantités de données.

9 Application aux conopeptides TATools a été développé dans le cadre du projet européen CONCO (www.conco.eu) coordonné par les Laboratoires Atheris. Il a permis la détection d'analogues intéressants pour la XEP-018, peptide phare du projet. Par ailleurs, l'analyse complète des transcriptomes de Conus consors et Conus adamsonii a permis la caractérisation de nombreux peptides issus de nombreuses familles de conopeptides. Le projet CONCO a d'autre part été l'occasion de mettre en lumière, une fois de plus, l'étroite relation existant entre protéomique et transcriptomique. Ainsi, les études protéomiques (spécialement le fractionnement par chromatographie et l'analyse par spectrométrie de masse) menées en parallèle des études transcriptomiques, ont permis de confirmer la présence, dans les venins prélevés sur les animaux vivants ou disséqués, de peptides matures dont les précurseurs ont été détectés par analyse bio-informatique du transcriptome. De même le transcriptome a permis de lever quelques ambiguïtés rencontrées lors de l'analyse protéomique du venin. Notons également que le progrès des méthodes d'analyse protéomique et de séquençage est allé de pair avec celui des procédés de production chimique des protéines. Les venins sont en général disponibles en petite quantité et cela entraîne une relative difficulté pour isoler et purifier les peptides d'intérêt. Cependant, le séquençage ainsi que la synthèse chimique permettent déjà de contourner les problèmes de résolution rencontrés lors de l'analyse protéomique de faibles quantités de venin. Par ailleurs, l'analyse bio-informatique des transcriptomes constitue une méthode relativement rapide et efficace pour détecter les composés recherchés qui pourront être chimiquement synthétisés et purifiés en vue de la validation de leur activité pharmacologique.

Plan de la thèse Le présent manuscrit comporte trois parties. Puisque le but de l'étude est d'analyser et d'exploiter des transcriptomes afin d'y découvrir des séquences possédant des caractéristiques intéressantes pour une cible protéique, dans la première partie, au chapitre I, nous évoquons des généralités sur les étapes du développement de nouveaux médicaments. Dans ce processus, la découverte du composé actif reste le point central et critique autour duquel les étapes de validation et d'optimisation sont réalisées. La transcriptomique propose une alternative prometteuse pour parvenir à l'identification de molécules candidates. Cette approche et son contexte technologique sont donc brièvement décrits dans ce chapitre. Dans le chapitre II de cette première partie, nous abordons la présentation des animaux venimeux et de leurs venins. L'accent est mis sur les caractéristiques des venins qui en font des sources potentielles de candidats por la découverte des médicaments de demain. Le chapitre III est consacré à la présentation des activités bioinformatiques réalisées dans le cadre du projet CONCO qui constitue sûrement une des initiatives les plus abouties

10 de vénomique car il a su fédérer des spécialistes de divers domaines de compétence autour de l'étude du Conus consors, un mollusque marin venimeux. Ces activités ayant en majorité porté sur les toxines du cône (conotoxines), ce chapitre présente les défis relatifs à leur classification ainsi que les approches développées au cours du projet en vue de leur identification au sein des transcriptomes. La seconde partie est consacrée à la présentation des problèmes liés a l'exploitation des données transcriptomique ainsi que de la méthodologie proposée pour résoudre ces problèmes (Chapitre IV). L'application de la méthodologie proposée a conduit au développement d'un environnement bioinformatique pour l'analyse des transcriptomes baptisé TATools. Les grandes lignes de l'implémentation de la plate-forme d'analyse des transcriptomes sont d'abord décrites, notamment les modules inclus dans le processus d'analyse des données ainsi que la structure des bases de données sous-jacentes (Chapitre V). Ensuite, un guide d'utilisation détaillé pour les futurs utilisateurs présente les principales interfaces du logiciel ainsi que les principales activités d'analyses qu'il est possible d'y réaliser (Chapitre VI). Finalement, la troisième partie présente les résultats de l'application de la méthodologie proposée. Le chapitre VII illustre l'utilisation de la plate-forme et donne quelques résultats issus d'exemples d'application. Le manuscrit se termine par un chapitre VIII consacrée à une discussion sur les performances de la plate-forme proposée. Des perspectives sont également formulées sur les développements qui pourraient être implémentés en vue d'améliorer le niveau d'exploration et d'exploitation des transcriptomes. Les annexes comportent entre autre des articles présentant des résultats intermédiaires obtenus lors de la résolution des problèmes recensés lors de l'évaluation des schémas classiques ainsi que le résultats obtenus grâce à l'utilisation de la plate-forme.

Mots clés : Séquençage de nouvelle génération, transcriptome, venin, toxine, peptide, protéine, activité pharmacologique, bioinformatique, famille de protéines, profils généralisés, modèles de Markov cachés, analyse de séquence, annotation automatique, environnement intégré, outil web.

11 Publications

Identification and classification of conopeptides using profile Hidden Markov Models. Laht S., Koua D., Kaplinski L., Lisacek F., Stöcklin R., Remm M. 2012. Biochim Biophys Acta. 1824(3):488-92. Epub 2011 Dec 30.

Position-Specific Scoring Matrix and Hidden Markov Model complements each other for the prediction of conopeptide superfamilies. Koua D., Laht S., Kaplinski L., Stöcklin R., Remm M., Favreau P., Lisacek F. Revised manuscript submitted to Biochim Biophys Acta.

ConoDictor: a tool for prediction of conopeptide superfamilies. Koua D., Brauer A., Laht S., Kaplinski L., Favreau P., Remm M., Lisacek F., Stöcklin R.2012. Nucleic Acids Research. 40(Web Server issue):W238-41. Epub 2012 May 31.; doi: 10.1093/nar/gks337.

TATools, a bioinformatic environment for transcriptomes analysis. Koua D., Mylonas R., Favreau P., Stöcklin R. and Lisacek F. Manuscript ready for submission to BMC Bioinformatics.

Pattern Searches in Protein Sequences. Koua D. and Lisacek F. 2012. In: eLS 2012, John Wiley & Sons, Ltd: Chichester. http://www.els.net/ [DOI: 10.1002/9780470015902.a0006222.pub2]

12 Table of contents

Remerciements...... 4 Présentation générale...... 6 Publications...... 12 Table of contents ...... 13 List of figures...... 15 List of appendices...... 16 Introduction...... 17 Thesis overview...... 20 Chapter I. Brief overview on drug discovery and Next-generation sequencing...... 26 1- Drug discovery: classical approach of a lead compound...... 26 1.1- The target-based approach...... 26 1.2- In-silico support in the drug discovery process...... 27 2- Next-generation sequencing techniques: opening promising opportunities...... 29 2.1- Sequencing techniques...... 29 2.2- Transcripts analysis and interpretation...... 31 Concluding remarks...... 32 Chapter II- Venomics: discovery platform for tomorrow's drug candidates...... 33 Introduction...... 33 1- Venomous and their venom...... 33 2- Venom component as drug candidate...... 34 2.1- Attractive characteristics of venoms...... 34 2.2- Venomics: a flourishing field...... 35 3- Venom Proteomics...... 36 3.1- Venomics co-evolved with analysis techniques...... 36 3.2- Venom compounds characterisation...... 37 4- Transcriptomics of venom glands...... 38 Concluding remarks...... 38 Chapter III. Bioinformatics for CONCO: conopeptide classification marathon...... 39 1- Project overview...... 39 2- Conus consors description...... 39 3- : nomenclature, classification and pharmacological interest...... 40 4- Conotoxins: bioinformatics classification tools...... 43 5- Concluding remarks...... 44 Chapter IV. Study methodology: needs for an improved analysis platform...... 47 1- Overview of classical analysis workflow...... 47 2- Evaluation of the classical analysis workflow...... 49 2.1- Problems opened by the classical workflow...... 49 2.2- T-ACE, classical transcriptome analysis and organization platform...... 50 3- Methodology : a drug-discovery oriented analysis workflow...... 50 3.1- Problem A: Time consuming analysis workflow...... 50 3.2- Problem B: Highlighting sequences of interest...... 52 3.3- Problem C: Cross-validation of bioinformatics results...... 53 4- Concluding remarks...... 56 Chapter V. TATools implementation...... 57 Introduction...... 57 1- Platform use cases ...... 57

13 2- TATools methods...... 59 3- View and exploit results...... 60 Concluding remarks...... 62 Chapter VI. Interfaces...... 63 Introduction...... 63 1- Login page:...... 63 2- TATools home page...... 63 2.1- Enter a new profile...... 64 2.2- Enter a new transcriptome...... 64 2.3- Run analysis...... 66 3- Transcriptome viewer...... 66 4- List viewers...... 68 4.1- Global results for BLAST, model match and signal detection...... 68 4.2- Simplified list view...... 70 5- Specialized viewers...... 71 5.1- Compiled results of a translated transcript...... 71 5.2- Tatools contig viewer...... 73 5.3- Cluster summary...... 73 5.4- Clipboard...... 73 6- Anticipate biologists needs...... 76 6.1- Enriched BLAST viewer...... 76 6.2- Pseudo-precursor detection...... 76 6.3- Multiple sequence alignment manager...... 76 6.4- Additional tools to assist drug discovery...... 77 Chapter VII. Transcriptome analysis: a step forward in venomics...... 79 1- First case study: alpha conotoxins from Conus adamsonii ...... 79 1.1- Importance of alpha ...... 79 1.2- Presentation of Conus adamsonii...... 79 1.3- Transcriptome map of Conus adamsonii ...... 80 1.4- Alpha conopeptides from Conus adamsonii...... 80 2- Second case study: identification of analogues for the XEP-018...... 81 2.1- Conopeptide distribution in Conus consors venom gland transcriptome...... 81 2.2- Presentation of XEP-018...... 82 2.3- XEP-018 analogues detection in venom gland transcriptome...... 82 3- Concluding remarks...... 85 Chapter VIII. Discussion and Prospects...... 86 1- Too many sequences in the bin...... 86 2- More analysis approaches, more matches, more confidence...... 86 3- Family distribution in transcriptomes...... 87 4- Bringing out novelty ...... 87 5- Comparative transcriptomics...... 87 6- Results annotation...... 88 Conclusion...... 89 Bibliography...... 90

14 List of figures

Figure 1: Thesis graphical situation...... 17 Figure 2: Target-based drug discovery pipeline...... 27 Figure 3: Computational interventions in target-based drug discovery...... 28 Figure 4: A Conus consors...... 40 Figure 5: Typical regions of a conopeptide precursor...... 41 Figure 6: Summary of classical transcriptome analysis operations...... 48 Figure 7: Simplified and efficient BLAST-based annotation workflow...... 51 Figure 8: Automated cross-validation of results are summarized into a "Transcriptome map"...... 54 Figure 9: Relational database schema for the newly proposed analysis workflow...... 55 Figure 10: Complete improved workflow for transcriptome analysis...... 56 Figure 11: TATools use cases diagram...... 58 Figure 12: Sequence extraction activity diagram...... 61 Figure 13: TATools homepage...... 63 Figure 14: New transcriptome submission interface...... 65 Figure 15: Interface for setting analysis parameters...... 65 Figure 16: General result page. Example from Conus adamsonii transcriptome analysis...... 67 Figure 17: Viewer for transcripts belonging to a class of the transcriptome map...... 69 Figure 18: Viewer for transcripts associated to a GO term...... 69 Figure 19: TATools list viewer with annotation interface...... 70 Figure 20: TATools transcript viewer...... 72 Figure 21: TATools contig viewer displays reads used to construct a given contig...... 74 Figure 22: TATools contig cluster viewer...... 74 Figure 23: TATools clipboard helps to manage user selection...... 75 Figure 24: Conus consors Transcriptome map...... 83 Figure 25: Distribution of matches obtained for conopeptides suerfamilies by searching the Conus consors transcriptome with conopeptides pHMMs and PSSMs...... 83 Figure 26: New isoforms of mu-conotoxin identified from the Conus consors venom gland transcriptome...... 84

15 List of appendices

Appendix 1. Presentation of main bioinformatics tools referred in the manuscript...... 103 Appendix 2. Conopeptides superfamilies characteristics...... 109 Appendix 3. Identification and classification of conopeptides using pHMM...... 110 Appendix 4. PSSM and HMM complements each other for conopeptide prediction...... 116 Appendix 5. ConoDictor: a tool for prediction of conopeptide superfamilies...... 129 Appendix 6. TATools, a bioinformatic environment for transcriptomes analysis...... 134 Appendix 7. Pattern Searches in Protein Sequences...... 140 Appendix 8. Molecular phylogeny of conopeptides...... 147

16 Introduction

Main motivation

The present work is a bioinformatics contribution that was designed to meet the requirements emerging from three different sources. First, this work is bound to pharmacology and the discovery of new drugs from natural compounds. As drug discovery is now increasingly influenced by the expansion of –omics technologies, this work is secondly related to large-scale sequencing initiatives and the current upsurge in generating vast amounts of sequences. Thirdly, this work was motivated by recent developments in venom gland studies now covered by the topic of venomics. Indeed, natural compounds for drug discovery are expected to be found in venoms.

In this context and despite the large collection of bioinformatics tools that are available, these are often unconnected while analytical pipelines are very much in demand especially in the field of venomics. Consequently, this thesis is an attempt to select and integrate a variety of concepts from different origins into a computer environment. As shown in Figure 1, the three fields of drug discovery, venomics and Next-Generation Sequencing (NGS) are related to one another. It is worth noting that most of the overlap between these fields is mediated by bioinformatics.

Figure 1: Thesis graphical situation

This work is at the confluence of Venomics, Next-Generation sequencing and Drug discovery.

17 Venomics and drug discovery

Decades of evolution have allowed nature to patiently select and propagate favourable gene products. It was recently recognised that one of the most interesting of these tools are venoms. Made from a complex and concentrated mixture of peptides, proteins and smaller organic molecules, these cocktails have been developed through years of natural selection and adapted to specific species and environments making any venom potentially unique (Mebs, 2002). The extent of the venomous realm means that these species must survive in a large diversity of environments, thus developing highly specific mechanisms of attack and/or defence often using highly efficient cocktails of molecules specifically designed to intimidate, paralyse, kill and/or eventually pre-digest their preys or their attackers.

Venoms have been actively subjected to proteomics screening and technical advances have led to a always more complete and precise elucidation of their peptide content. The progresses made in analytical chemistry and biochemistry have led to the marketing of a number of drugs issued from venoms peptides or their derivatives. Drugs issued from venoms can have many advantages for therapeutic applications. Through millions of years of evolution, nature has developed highly stable, bridged peptides with high activity and specificity. These properties mean that only small amount are necessary to obtain the desired effect and consequently, production costs are decreased. These peptides are often highly soluble in water which confers them a low toxicity because they can be easily cleared through the kidneys. Moreover, because of their small size, they exhibit poor immunogenicity.

Moreover, we are now realising that the genetic information contained in the various apparatus evolved to produce and deliver venoms can be an extraordinary resource of knowledge. Millions of year of evolutionary information are potentially available to us in transcriptomes.

NGS, venomics and drug discovery

The advent of Next-Gen sequencing (NGS) is currently giving us the tools to fully take advantage of the possibilities of proteins patiently selected by nature. Transcriptomic appeared to be an alternative approach to bio-assay guided screenings of venoms. Screenings for bioactives on fractionated venom can be a time consuming and expensive process. However, in the transcriptomic approach, the lead discovery and optimisation is somehow already achieved by nature. On the other hand, being able to properly identify a conserved functional domain in a protein is a key point to get clues about the protein's function and/or identify related homologous in others organisms. Various

18 identification and classification approaches have been proposed for protein characterisation. Most of them are based on common patterns or sequences similarity searches to characterise protein families. The results of automated searches for such patterns are used to qualify protein structure and function and to explore evolutionary relationships.

Back to bioinformatics

Considering the increasing number of DNA and protein sequences generated by high throughput technologies, bioinformatics tools must follow the pace of data generation and support the interpretation of transcriptomes. Up to now, only one tool has been proposed for the organisation and visualization of full transcriptome annotation projects (Philipp et al., 2012).

Recent works in Venomics showed various attempts of bioinformatic analysis of venom gland transcriptomes. These analyses mostly rely on sequence similarity searches, known domain scanning and manual annotations from experts.

In this study, we present TATools, a web-based user-friendly platform for transcriptome data analysis and visualization. TATools is based on open-source languages and tools. It offers HTML interfaces for data visualisation and exploitation; simplified forms are also provided for results annotation. The development of TATools was guided all throughout by a concern for usability and applicability to solve the questions raised by experimentalists.

19 Thesis overview

Modern drug discovery research is essentially based on high-throughput screening of candidate molecules to try to identify those specifically targeting a biological pathology-related mechanism. However, despite the advantages of target-based approach from conceptual and practical viewpoint, the number of new active principles that reached the clinical stage or the market is dramatically low and continuously decreasing. This situation has led to the exploration of new sources of biologicals. Among others, animal venoms have attracted interest because of the richness and efficiency of their compounds. The intensive exploration of venom by classical proteomics approaches have already proved the effectiveness of the venom-based drug discovery effort. In addition, thanks to the development of sequencing techniques, it is now possible to obtain transcriptomes of very good quality, at affordable price.s The transcriptome is the set of all proteins expressed at a particular time of life of the organ. The transcriptomics approach allows a deeper and more detailed exploration of peptides potentially produced by the venomous apparatus. However, the improvement of sequencing techniques leads to the production of increasingly numerous reads. The bioinformatic analysis of these transcriptomes to highlight peptides of interest for pharmacological application is therfore always more challenging. The evaluation of classical transcriptome analysis workflows revealed a number of limitation to be addressed: 1- The current analysis methodology is not optimized and is excessively time consuming. 2- Data interpretation and results validation as well as putative function annotations remain time consuming and require an expert intervention to point out sequences of interest. 3- The amount of data to be analyzed as well as the heterogeneity of bioinformatic analysis outputs make it fastidious to cross-link and cross-validate obtained results in order to reliably select interesting and novel sequences. 4- The lack of data visualization interfaces increases the difficulty of the whole data validation and interpretation process.

The purpose of this work is a bioinformatic environment to assist the discovery of peptides of pharmaceutical interest from animal venoms. This thesis proposes and describes TATools, an efficient and convenient bioinformatic solution addressing the four main presented concerns. The use of TATools and the application of the newly proposed analysis workflow in the framework of the European project CONCO, efficiently contributed to resolve theoretical and applied

20 problems. The new analysis workflow indeed led to identify and/or confirm the discovery of analogues of XEP-018 in the venom gland transcriptome of Conus consors. XEP-018 is an active compound previously isolated from the venom of this predatory cone snail. In addition we constructed specific models for conopeptides (cone snail peptides). The model-based search allowed to identified of novel conopeptides from the venom gland transcriptome of a single specimen of Conus admasonii.

Origin of questions Venomous animals are found in almost all latitudes and in multiple phyla: snakes and other reptiles, marine mollusks, fish, amphibians, insects, arachnids, centipedes and even some mammals. They have highly specialized exocrine glands coupled with a sometimes very sophisticated delivery system (hooks, darts, spears) to inject the secreted poison. Venomous animals and their venoms have been subjected to scientific studies, especially because envenomations constitute a relatively important cause of death and/or disability worldwide. In addition, venoms are highly complex mixtures of peptides and proteins whose interest in pharmaceutical research has grown steadily in recent years. The attraction for venoms is due to the extreme specialization and the impressive efficacy of peptides and small proteins they contain. The large number of achieved studies have highlighted the efficacy of these compounds for a broad spectrum of molecular targets. Thus, several drugs from venoms of peptides or their derivatives are already marketed (Capoten / Captopril; Integrilin / Eptifibatide; Aggrastat / Tirofiban; Prialt / Ziconotide; ...) while many other peptides are in various stages of validation or approval.

Biological analyses of venoms The elucidation of the composition of venoms is related to the development of analytical techniques. Advances in methods of separation by gel electrophoresis (SDS-PAGE) or liquid chromatography (HPLC), advances in mass spectrometry and nuclear magnetic resonance spectroscopy (NMR) and the miniaturization of biological assays have rapidly been applied to the study of venoms (or Venomics). Proteomics has allowed to elucidate an increasingly number of peptides and proteins that compose venom. Moreover, recent advances in sequencing have opened new horizons to the understanding of venomous systems. The use of transcriptomic venom glands is becoming common to complete proteomics in identifying compounds of potential interest in pharmaceutical research. In this context, bioinformatics analysis is a must since reliable software helps to better exploit the wealth of transcriptomes. In the absence of such bioinformatics tools the

21 analysis of transcriptome data would be a slow and tedious work. In fact, the development of these tools speeds up and makes easier the work of interpretation and guides researchers in the formulation of new hypotheses.

Problematic

Rapid and reliable identification of proteins encoded in transcriptomes plays a pivotal role in next- generation data interpretation. This suggest the ability to recognize peptides of interest and optimize the quality of their identification and the detection speed of such peptides. This raises several questions: What method should be used to browse the data of transcriptomes as quickly and efficiently as possible? How to detect sequences identical or similar to the peptide of interest? How to characterize a family of peptide or a single peptide to speed up the detection of homologous sequences? With a closer look, it appears that the exploitation of transcriptomic data must be extended beyond a single family or a single compound. Identify and classify sequences typically suppose the use of public databases that list known sequences with associated automatic and/or manual annotations. The tricky question is what conclusion to draw when no external annotation is available for part of the transcriptome. In this case, a more careful analysis must be conducted since it can reveal potentially new and unknown compounds. Finally the question of data display and result visualization must also be solved. How to display at once in a concise and a precise way all the results of analyses made on the transcriptome? How to provide interactivity with users? As already known, visualization is itself an entire challenge especially because from the quality of data visualization may depend the quality of interpretations and conclusions that will be drawn by users.

Methodology Given the range of questions to be addressed, our solution actually integrates a relational database with classical analysis tools and powerful model matching strategies running in an interactive and user-friendly web environment. TATools (Bioinformatic Environment for Transcriptome Analysis) allows both novices and experts to get the most out of the immense potential of transcriptomes. The purpose of this platform is to optimize and facilitate the identification and/or detection of sequences that may be relevant to pharmacological research. Conventional analysis tools such as BLAST for similarity search in sequence databases, Gene Ontology (GO) to transfer annotation about potential function or activity, SignalP for predicting signal sequence and MAFFT for

22 producing multiple sequences alignments. A brief description of these tools, and of other tools that are mentioned throughout the text, is given in appendix 1. The main feature of the methodology implemented in this new environment is the successful combination of classical analysis tools with probabilistic modelling methods whose effectiveness has been widely demonstrated. Two published articles describing our work demonstrate the benefits of model-based classification methods compared to commonly used sequence similarity search methods such as BLAST. Thus the transcriptome analysis of Conus spp. venom glands was enhanced by using generalized profiles (PSSM) and Hidden Markov Models (HMM) prepared for families of known toxins. These models led to the detection of analogues and their applicability to large scale studies appeared to be a major asset in identifying peptides of interest.

Application to conopeptides TATools was developed in the framework of an European project named CONCO (www.conco.eu) coordinated by Atheris LAboratories. It allowed the detection of analogues of interest for the XEP- 018, the lead peptide of the project. Moreover, the full analysis of transcriptomes of Conus consors and Conus adamsonii allowed the characterization of many peptides from various conopeptide families. The CONCO project was an opportunity to highlight the close collaboration between proteomics and transcriptomics. In parallel with transcriptomic analysis, proteomic studies (especially the fractionation by liquid chromatography and analysis by mass spectrometry) were carried out on the venom collected from living or dissected animals. Proteomics confirmed the presence in the venom of mature peptides whose precursors were detected by bioinformatic analysis of the transcriptome. Similarly, the transcriptome has allowed to solve ambiguities encountered in proteomic analysis of venoms.

Organisation of the thesis This manuscript consists in three parts. Since the motivation of our work is the analysis and exploration of transcriptomes for identifying sequences with interesting features given a pathology- related molecular target, in the first part, we discuss the main steps of drug discovery. In this process, the active compound discovery is the central and critical point which guides validation and optimization procedures. Transcriptomics proposes a promising alternative accelerating drug candidate discovery. This approach as well as some technological background are discussed in this chapter I. In Chapter II, we present the interest of venoms and the main results of previous studies conducted on this subject. Chapter III is devoted to the presentation of bioinformatics activities carried out during the CONCO project which united specialists from various biology, biochemistry

23 and bioinformatics fields around the study of Conus consors, a venomous marine snail. These activities have mainly focused on analyzing the cone toxins (conotoxins). This analysis is challenged by classification open issues. This chapter covers the approaches developed during the project for conotoxin identification in the transcriptome. The second part is devoted to the presentation of our bioinformatic environment for transcriptome analysis named TATools. The new methodological approach proposed for transcriptomes analysis is described in Chapter IV. Then, the main aspects of the implementation of the transcriptome analysis platform are described, in particular modules included in the data analysis process and the structure of the underlying databases (Chapter V). Finally, a more detailed user guide presents some user interfaces and explains how to perform analysis on the platform (Chapter VI). The third part presents results obtained by the usage of the new transcriptome analysis environment and opens a discussion on the future of TATools. Chapter VII illustrates the use of the platform and describes some results with two conus species. Chapter VII discusses performance of the proposed platform. Prospects are also made on further features that could be implemented to improve the level of exploration and exploitation of transcriptomes. Appendices include articles presenting the methodological aspects of conopeptide classification and related applications as well as a manuscript describing TATools.

Keywords: Next-generation sequencing techniques, transcriptome analysis, venom, toxin, peptide, protein , drug discovery, bioinformatics, protein family, generalised profiles, hidden Markov models, sequence annotation, web-based tool, integrated platform.

24 Part 1:

Background and Thesis Context

25 Chapter I. Brief overview on drug discovery and Next-generation sequencing

1- Drug discovery: classical approach of a lead compound

In early days, drug research was purely empiric. A remedy, known to be effective in human disease, but whose mechanism was not understood was the starting point. The goal then was to elucidate the mechanism and to use this knowledge to improve the therapeutic properties of the active principle. In the last fifteen years target-based drug discovery has reversed the overall trajectory of research in the pharmaceutical industry. Drug targets are now chosen on the basis of a hypothesis about the pathophysiology of the disease. Initial pharmacological tests of these hypotheses in man are not undertaken until after years of preparatory work (Hurko, 2012). This preparatory research activities include various scientific specialities and are time consuming and expensive. The whole process is referred as Drug discovery. Drug discovery can be conducted at three different levels: mechanism, function and physiology (Drews, 2003). 1- The physiology-based approach seeks to induce a therapeutic effect by reducing disease- specific symptoms or physiological changes. The screening is usually conducted in isolated organ systems or in whole animals. The physiology-based approach was the first drug discovery paradigm, and has resulted in many effective treatments. It is still used extensively but suffers from a very low screening capacity and difficulty in identifying the mode of action of compounds. 2- The function-based approach seeks to induce a therapeutic effect by normalizing a disease-specific functional abnormality. Functional parameters represent a higher level of organism complexity because function requires the integrated action of many mechanisms. the screening capacity of these two methods is low and so they cannot be used for library screening. 3- The mechanism-based approach, which corresponds to the target-based approach, seeks to produce a therapeutic effect by targeting a specific mechanism. It screens for compounds with a specific mode of action. It is the most commonly used strategy because of its ability to screen huge compound libraries (Sams-Dodd, 2005). The present work relates to this case.

1.1- The target-based approach

In the target-based approach (Figure 2), novel mechanisms are identified based on fundamental research in biology and clinical findings. These mechanisms are validated based on expression patterns and knock-out mice. After target selection, some high-throughput screening (HTS) in vitro assay is developed to measure the selectivity of compounds to the target. HTS normally results in several compounds, preferably belonging to different chemical classes, with medium to high

26 affinity to the target. In the lead identification phase, small-scale of analogues screening around these structure classes are performed to determine feasibility of reaching a selective compound with appropriate drug-like properties (Sams-Dodd, 2006). The lead structures can be tested in a disease model to determine if the targeted mechanism has therapeutic potential and if the outcome is positive, the lead optimisation programme begins. This programme is mainly a structure-activity relationship elucidation. During this step, a large number of analogues are produced around the lead structures and are screened for target selectivity, pharmacokinetic and metabolic properties. At the end of the lead optimisation phase, suitable compounds are tested in an in vivo disease model for proof-of-principle and, if the study is positive, the compound is selected for development. The lead optimisation and validation steps are long (2-4 years) and expensive. It is estimated that a marketable drug result from the systematic study of 105 to 106 molecules during almost 12 years for an assumed cost of at least a billion dollars. However, despite the fact that the target-based approach is highly advantageous from a scientific and practical viewpoint, it does not translate into a high success rate for novel targets. Indeed, there has been a steady decline in the number of new molecules and biologicals that enter clinical development and reach the market (Chanda and Caldwell, 2003; Van den Haak et al., 2004).

Figure 2: Target-based drug discovery pipeline. (figure adapted from http://sydney.edu.au)

1.2- In-silico support in the drug discovery process

All steps of the target-based discovery are assisted by computers. For instance, computational

27 genomics and protein crystal structure determination are used to improve target identification and validation. These computational techniques, by modeling both target and protein active conformations help to deduce useful interactions. This strategy could be considered as in-silico system biology. In the same manner, prediction of protein-ligand structure actually guides analogue preparation and screening. Finally at the early clinical trial stage, biosimulation can be used to evaluate adverse effect rates according to drug interactions and genetic background of the targeted population. In silico activities are summarized in Figure 3.

Figure 3: Computational interventions in target-based drug discovery.

Concluding remarks To face the decline of the number of drug and/or biologicals entering the market, natural products are actively explored. The improvement in throughput and quality as well as the decrease of costs of next-generation sequencing offer drug discovery a new source of compounds to be screened.

The opening of the sequencing era offers new perspectives to target-based drug discovery. Indeed, high-throughput transcriptome data provides hundreds of putative drug candidates to be tested. In the next section we briefly present sequencing approaches and introduce bioinformatic importance for transcriptome data analysis .

28 2- Next-generation sequencing techniques: opening promising opportunities

Genome and transcriptome analyses have become unavoidable in elucidating biological processes. Today, sequencing is not only providing long reads of good quality but is also relatively affordable. High quality and accessibility of current techniques, combined with extensive computational capabilities have given genome and transcriptome analyses a prevalent role in biological studies.

2.1- Sequencing techniques Sequencing techniques have evolved very rapidly in the past decades. Major improvement in terms of throughput, speed and cost led to major reappraisal at distinct time points. This explains why sequencing is qualified with “generations”.

The Sanger method, known as the first generation sequencing technique, has been the most widely adopted and used sequencing technology probably because of its very low error rate2. This method requires a single-stranded DNA template, a DNA primer, a DNA polymerase, normal deoxynucleotide triphosphates (dNTPs) incorporated in the newly synthesized strand in a cycle reaction, and modified nucleotides (dideoxyNTPs) that terminate DNA strand elongation. When this mixture is fractionated by electrophoresis on denaturing acrylamide gels the pattern of bands shows the distribution of dTs in the newly synthesized DNA. By using analogous terminators for the other nucleotides in separate incubations and running the samples in parallel on the gel, a pattern of bands is obtained from which the sequence can be read off (Sanger et al., 1975; Sanger et al., 1977). A typical Sanger sequencing reaction yields sequences with length up to 700–800 bp, after which the quality of the sequences decreases (Casals et al., 2012).

The second generation of sequencing techniques, also called high-throughput or Next Generation Sequencing (NGS) technologies, have exponentially increased the quantity of sequences generated, producing up to several million bases (gigabases, Gb) in a single run. The first and critical step in all NGS technologies is the library preparation. Library preparation is globally always the same and consists in DNA/RNA purification and random fragmentation by physical or enzymatic reactions to generate fragments of desired average sizes. The resulting fragments are then ligated to short DNA

2 Historically, the first DNA sequencing method was proposed by Maxam and Gilbert. This first method was a chemical procedure that breaks a terminally labeled DNA molecule partially at each repetition of a base. Four different reactions were proposed to preferentially cleave DNA at specific nucleotide. The DNA sequence required radioactive 32 labelling at one 5' end by a kinase reaction using gamma- P ATP. Chemical treatment generates breaks at a small proportion of one or two of the four nucleotide bases in each of four reactions (G, A+G, C, C+T). The initial DNA sequence was reconstituted from the migration pattern of radioactive bands obtained by electrophoresis on a polyacrilamide gel (Maxam and Gilbet, 1977). 29 fragments called adaptors. DNA/RNA is then amplified by PCR, and the sequencing reaction is performed. Sequencing techniques include the pyrosequencing technique proposed by Roche/454, the Illumina approach proposed by Solexa, the sequencing by ligation system used by Applied Biosystems' SOLiD technology, the Ion Torrent semiconductor sequencing methodology proposed by Life Technologies. Table 1 summarizes the main features of the most commonly used sequencing platforms. A detailed summary of each of these major sequencing techniques is added in Appendix 3. Other recently available second-generation technologies include Polonator G.007 (Shendure et al., 2005), the nanoball sequencing from Complete Genomics (Drmanac et al., 2010).

Table 1: Comparison of different sequencing platforms.

(This table was originally published by Thudi et al., 2012)

The main novelty of third generation (also called single molecule) sequencing technologies is that they are able to detect light from a single molecule change (Braslavsky et al., 2003; Harris et al., 2008). Helicos Biosciences, the first company that presented a third generation sequencing technology, has developed the Helicos' True Single Molecule Sequencing (tSMS). Sequencing is performed by synthesis of four fluorescently labelled nucleotides one at a time. A laser makes the

30 nucleotides emit light that will be detected by the sequencer. Pacific Biosciences has also produced a third generation system, following a real-time sequencing by synthesis method where the sequencing is not halted resulting in very short run times and longer reads. Single DNA polymerase molecules are attached to “zero-mode waveguide”, nanophotonic structures able to measure the fluorescence of labelled nucleotides in real time in reduced volumes enabling parallelization (Levene et al., 2003; Eid et al., 2009; Metzker, 2009). The main current limitation of single molecule sequencing technologies is the higher error rate (Schadt et al., 2010). Other third generation technologies not available include the detection of individual DNA bases as they pass through a nanopore, or microscopy techniques for direct imaging of single DNA molecules.

2.2- Transcripts analysis and interpretation Data produced by sequencing platforms are raw reads that need to be assembled for reconstructing the original genetic information expressed in the studied sample. Most assembling software produces contiguous nucleotide sequences named contigs. Other software is also available for contig annotation.

In the case of model organisms such as human, mouse or yeast, complete genomes are available. Transcriptome analysis of model organisms therefore mainly consists in mapping operations. This has been fully described and discussed in literature and still constitute an interesting challenge. However, assembly problems of non model organisms remains even more challenging.

In the case of non model organisms, limited a priori sequence information exists. Assembly must be performed without the aid of a reference genome. Two main approaches are proposed for de novo assembly: overlap graphs and de Bruijn graphs. In the first case, overlaps between each pair of reads is computed and compiled into a graph. Each node of this graph represents a single sequence read and an edge represent an overlap between two reads. The consensus is computed by following the overlap graph. This algorithm is computationally intensive and most effective in assembling fewer reads with a high degree of overlap. On the other hand, De Bruijn graphs breaks reads into smaller sequences called k-mers (usually 25-50 bp). Thees k-mers are aligned based on k-1 sequence conservation to create contigs. The use of k-mers – which are shorter than the read lengths – in de Bruijn graphs reduces the computational intensity of this method.

After de novo assembly, the analysis of contigs, relies on comparative analysis with annotated genes or gene products of other organisms. The elucidation of the role of a specific sequence will necessitate a comparative analysis with annotated sequences from different organisms described in a wide range of databases (NCBI nr, UniProtKB/SwissProt, Gene Ontology (GO), KEGG, InterPro

31 and others). Searching these databases will generate rich and voluminous outputs that also need to be evaluated by scientists. The management of results produced by bioinformatic analyses of assembled transcriptomes also constitute a bottleneck for transcriptome interpretation. This aspect will be discussed later in Chapter IV.

Concluding remarks

The combination of target-based drug discovery and high-throughput transcriptome sequencing has opened the era of transcriptome-based drug discovery. In this context, the present work focuses on the detection of novel sequences inside transcriptomes after the reads have been assembled.

The next chapter presents the CONCO project, a successful initiative of unravelling the transcriptome of a non model organism in order to discover new drug candidates.

32 Chapter II- Venomics: discovery platform for tomorrow's drug candidates

Introduction

The term “venomics” was introduced to embrace techniques and methods intended to understand and characterize venom and venom glands toxin contents (Ménez et al., 2006). The venomics approach currently encompasses transcriptomic, proteomic, peptidomic and/or glycomic studies of venom and venom glands (de Graaf et al., 2009). The recent development of venomics has made this field a must for identifying tomorrow's drug candidates.

1- Venomous animals and their venom

In any habitat there is competition for resources, and every ecosystem on Earth supporting life contains poisonous or venomous organisms. One of the most fascinating techniques of capturing prey or defending oneself is the use of poison or venoms. Venom represents an adaptive trait and an example of convergent evolution (Fry, 2008). The animal kingdom includes more than 100,000 venomous species spread through major phyla such as chordates (reptiles, fishes, amphibians, mammals), echinoderms (starfishes, sea urchins), mollusks (cone snails, octopi), annelids (leeches), nemertines, arthropods (arachnids, insects, myriapods) and cnidarians (sea anemones, jellyfish, corals) (Mebs, 2002). Venomous animals typically possess venom-producing exocrine glands coupled to a delivery system including barbs, beaks, fangs, harpoons, pincers, proboscises, spines, spurs and stingers (Fry, 2009). The ecological advantages conferred by a venom system are evident from the extraordinarily diverse range of animals that have evolved venoms for the purposes of predation, defense or competitor deterrence (King, 2011).

Venoms are deadly cocktails, each comprising unique mixtures of peptides and proteins naturally tailored by Natural Selection to act on vital systems of the prey or victim. Venom toxins disturb the activity of critical enzymes, receptors, or ion channels, thus disarranging the central and peripheral nervous systems, the cardiovascular and the neuromuscular systems, blood coagulation and homeostasis (Ménez et al., 2002). Venoms often include protease inhibitors and stabilizing agents that protect them from internal and external (high temperature) detrimental effects, and hence preserve them in the glands for weeks. It is estimated that they are composed of a mixture of 200 to 1000 peptides and proteins, most of which have not been characterised (Ménez et al., 2006; Biass at al., 2009). Multiplying the number of potential venom components by the number of venomous species makes it easy to understand what a natural resource of bioactives venoms represents (Escoubas and King, 2009).

33 The ‘venome’ is the sum of all natural venomous substances produced in the animal kingdom. The venome is of interest for at least two reasons. First, it is a source of basic tools to study complex physiological systems. Second, the venome is a source of drug leads, approved drugs and diagnostic tools (Ménez et al., 2005).

2- Venom component as drug candidate

2.1- Attractive characteristics of venoms The medicinal value of venoms has been known from ancient times. The snake is a symbol of medicine due to its association with Asclepius, the Greek god of medicine (Koh et al., 2006; Harvey et al., 1998). Since decades, venomous animals and their venoms (especially snakes, scorpions, cone snails and spiders) have attracted scientific interest mainly in the frame of developing antidotes against envenomations. Researchers in the area of antivenom have manipulated the immunological system of vertebrates, where immunoglobulins play a fundamental role (Gutiérrez et al., 2007). Similarly to venomics, “antivenomics” was coined for the identification of venom proteins bearing epitopes recognized by antivenom using proteomic techniques. Antivenomics is based on the immunodepletion of toxins upon incubation of whole venom with antivenom followed by the addition of a secondary antibody (Calvete et al., 2009).

Most venoms comprise a highly complex mixture of peptides, often with diverse and selective pharmacologies. Despite their diversity, venom peptides seem to have evolved from a relatively small number of structural frameworks that are particularly well suited to addressing the crucial issues of potency and stability. Indeed, five of the seven extant venom-derived drugs target the cardiovascular system, and four of these are snake proteins or mimetics thereof (King, 2011). One can consider the chemical space available from natural sources as the culmination of a billion-year drug discovery program with unlimited resources (Vetter et al., 2011). This evolved biodiversity makes venom peptides a unique source of leads and structural templates from which new therapeutic agents might be developed (Lewis and Garcia, 2003). Venom-derived compounds currently undergoing clinical trials or in preclinical development target a much greater range of pathophysiological conditions, including chronic pain, autoimmune disease, wound healing, HIV and cancer, and they are derived from a much wider range of venomous animals, including bats, cone snails, sea anemones, scorpions, snakes and spiders (King, 2011). It is well established that venom peptides have good affinity and selectivity towards a wide variety of membrane bound protein channels and receptors, especially membrane proteins such as ion channels, receptors, and transporters (Lewis and Garcia 2003; Fry et al., 2009). For example, five of the seven pharmacological sites on vertebrate voltage-gated sodium (NaV) channels are defined by

34 toxins from a wide range of venomous animals (King et al., 2008), and the alpha conotoxins from marine cone snails are particularly well adapted tools for discriminating between the various subtypes of nicotinic acetylcholine receptors (nAChRs) (Janes, 2005). In addition to selectivity, another interesting characteristic of venom compounds is the richness in that enhances molecular stability and protease resistance (Fry et al., 2008). The eight venom-derived drugs currently in clinical trials all contain between 1 and 14 disulfide bonds. Venom proteins are typically injected into the blood or soft tissue of prey (Mebs, 2002) and therefore must be able to withstand proteolysis and, in some cases, penetrate anatomical barriers such as the blood -brain barrier (Tedford et al., 2004). Extensive cysteine cross-linking is therefore a generalized feature of venom proteins and, some disulfide-rich scaffolds such as the ICK motif, shows extraordinary variation in primary structure (and therefore pharmacological activity) without perturbing the 3D fold of the protein (Sollod et al., 2005). From a therapeutic perspective, there has been a gradual realization that these disulfide-rich scaffolds might provide a greater opportunity for the development of orally active venom-derived drugs or at least drugs with a longer systemic half- life when delivered parenterally (King, 2011).

2.2- Venomics: a flourishing field The recent expansion of the venomics area is mainly due to the development of more sophisticated venom fractionation techniques (Vetter et al., 2011), advances in mass spectrometry (Favreau et al., 2006; Escoubas et al., 2008) and NMR spectroscopy (Schroeder et al., 2008), miniaturization of functional assays, and the ability to directly analyze toxin transcripts from venom-gland cDNA libraries (Wagstaff et al., 2009). These improvements have greatly facilitated the structural and functional characterization of venom components from animals that provide only minuscule amounts of venom (Pimenta et al., 2005; King et al., 2008; Liang, 2008; Escoubas and King, 2009; King, 2011). Moreover, the recent technological advances that facilitate high-throughput screening (HTS) and structural characterization of venoms and venom peptides promise to accelerate the venoms-based drug discovery pipeline. The typical discovery program includes the following key elements (Vetter et al., 2011): i. a robust high-throughput screen, to rapidly identify venoms with desired activity and to allow subsequent isolation of bioactive molecules;

ii. an efficient toxin production system, not only to produce sufficient toxin for functional and structural characterization, but also to facilitate structure–activity relationship (SAR) studies;

iii. a structural characterization by NMR .

This has been the traditionally, assay-directed development platform. Others approaches such as the

35 polymerase chain reaction (PCR) and chemical approaches (for example, mass spectrometry) are increasingly being used to overcome the bias associated with assay-directed methods, which identify the major activity (potency × quantity) at the expense of less abundant components. Finally, chemical synthesis of the peptide allows confirmation of activity and further characterization across multiple targets, both in vitro and in vivo (Lewis and Garcia, 2003).

The high potency and specificity of many venom-derived peptides, their ease of chemical synthesis and/or recombinant production, and the resistance of many disulfide-rich peptides to proteolytic degradation, are attributes that have made them attractive drug leads (Harvey, 1995; Olivera, 2006). As a consequence of their high selectivity, venom peptides have proved particularly useful for in vitro and in vivo proof-of-concept studies. However, for therapeutic applications, a number of issues associated with safety, pharmacokinetics and delivery need to be addressed (Lewis and Garcia, 2003).

3- Venom Proteomics

3.1- Venomics co-evolved with analysis techniques The elucidation of venom peptide composition has evolved in parallel with the techniques used to their study. Studies involving 1D SDS PAGE separations of venoms, showed a limited number of protein bands and of course the underlying protein complexity was not revealed. The transition to liquid chromatographic separation, even with low resolutive power such as gel filtration, revealed a far greater number of molecular species. The advent of HPLC, and especially reversed-phase (RP) methodologies, solved the problem of resolution by separating many more molecular species in a single run, thus providing better resolution of chromatographic peaks ( Escoubas, 2006). The advent of soft ionization processes, including electrospray ionization (ESI) and matrix-assisted laser desorption ionization (MALDI) raised the possibility of direct observation of molecules in crude venoms. MALDI-TOF MS is generally regarded as superior to ESI–MS for analyzing complex samples, whereas ESI, being liquid based, is more amenable to interfacing with online separation techniques such as rpHPLC. MALDI-TOF MS of crude venoms typically results in relatively low mass counts due to ion suppression effects that prevent ionization of all molecules, particularly minor components (Escoubas et al., 2008). This problem can be largely obviated by combining one or more separation techniques with MS analysis either online (LC–ESI–MS) or offline (rpHPLC prior to MALDI-TOF MS). (Pimenta et al., 2001; Fry et al., 2003; Escoubas et al., 2006; Davis et al., 2009). From a study of Conus consors venom that directly compared both analytical methods the authors concluded that the techniques are in fact complementary, with only 21% of the masses being common to both data sets

36 (Biass et al., 2009). The use of high-end MS techniques such as Fourier-transform ion cyclotron resonance (FT-ICR) LC/MS/MS is also starting to revolutionize the field (Quinton et al., 2006). The high resolution and mass accuracy afforded by such instruments offers yet another separation dimension, sometimes bypassing the initial crude venom fractionation step. This was very elegantly demonstrated by Quinton et al. (2006) in a study of Conus sp. venoms. More recently, the conus venom has been analysed by ultra-high pressure liquid chromatography (UHPLC) coupled with QTOF MS-MS (Eugster et al., 2012). This allowed an impressive separation capacity of the venom compounds as well as online peptide deconvolution. Peptide deconvolution is part of the discovery platform. It consists in additional separation steps of a venom fraction to more precisely isolate an active compound. When a potentially pure fraction is obtained, the peptide amino acid sequence is determined.

3.2- Venom compounds characterisation Most well studied venoms (i.e., those from cone snails, scorpions, spiders and snakes) are a heterogeneous mixture of inorganic salts, small organic molecules (< 1 kDa), polypeptides (2 - 9 kDa), and high-molecular-weight proteins including enzymes (> 10 kDa) (Calvete et al., 2007; Biass et al., 2009; Schroeder et al., 2008; Vassilevski et al., 2009).

After obtaining of venom compound mass list, further analysis of structure and pharmacological profile of these components require the amino acid sequence. To this end, de novo MS-sequencing strategies are used. Early de novo sequencing work conducted on wasp venom peptides used post- source decay (PSD) fragmentation (Hisada et al., 2000). Approaches involving collision-induced dissociation (CID) were then widely and allowed the characterization of conotoxins, from Conus monile and Conus virgo (Sudarslal, 2004), small snake peptides, such as BPPs (Soares et al., 2005), poly-His and poly-Gly peptides (Favreau et al., 2007) and sarafotoxins (Quinton et al., 2006). Electron-transfer dissociation (ETD), when coupled with a targeted chemical derivatization, was shown to greatly facilitate long peptide sequencing and it permitted determination of the full sequence and post-translational modifications of 31 Conus textile peptides (Ueberheide et al., 2009). In traditional, proteomics studies, tandem MS spectra are searched against a database of proteins in a ‘bottom-up’ approach. However, in organisms without any reference genome, true de novo peptide sequencing is necessary in a ‘top-down’ approach (Escoubas and King, 2009).

Injected venoms as well as milked venoms have been deeply analysed using the most advanced proteomics methods. The number of detected peptides and the de novo protein sequencing capacities has remarkably increased. Recently, next generation sequencing technology has been used to unravel the complete venom gland transcriptomes of various venomous animals so that interesting peptides could be identified based on transcriptomic data. 37 4- Transcriptomics of venom glands

Transcriptomic of venom glands is increasingly being used to complement proteomics studies.

Total RNAs are extracted from the venom gland, then a directional full-length cDNA library is constructed. Sequencing is conducted following of standard protocoles. Issued sequencing signals are processed to construct reads that are assembled into longer contigs. Contigs annotation relies on comparative analysis with annotated genes or protein domains of other organisms. The main steps of venom gland transcriptome sequencing and exploitation are described in chapter 1.

As attested by the amount of recently published work, a proposed approach to lead optimisation is the use of transcriptomic data (Lluisma et al., 2012; Ma et al., 2012; Durban et al., 2011; Prosdocimi et al., 2011; Vaiyapuri et al., 2011). A transcriptome therefore provides the scientist with a unique data set of sequences, some of which have been screened during the lead discovery process, and others that have not because they were not found in the extracted venom. Through post-translational modifications and various cleavage sites, a single contig found in a transcriptome can give up to six different sequences (Terrat et al., 2012).

Concluding remarks

The idea of combining scientific fields for the comprehension of living mechanisms and their elucidation has been present since the early beginnings of venomics. Today, combined proteomic and transcriptomic approaches have demonstrated great advantages in bioactive peptides discovery (Violette et al., 2012). A very interesting success story in the context of venomics, is the CONCO project in which proteomics, transcriptomics, bioinformatics and fundamental biology were associated to draw a complete map of the venomous system of a marine cone snail. The present study was conducted in the frame of the CONCO project and is intended to set up bioinformatics tools to assist the discovery of innovative peptides from the Conus consors venome and venom gland transcriptome. The bioinformatic approach applied to the discovery of interesting peptides from transciptomes was based on model search strategies. This is described in the next chapter.

38 Chapter III. Bioinformatics for CONCO: conopeptide classification marathon.

1- Project overview

The CONCO project (http://www.conco.eu/) focus on the discovery and the development of new therapeutically relevant molecules of the venomous marine cone snail species Conus consors (Figure 4). Through the deep and exhaustive investigation of the animal biodiversity, of its genome, its venom gland transcriptome and its venom proteome, CONCO aimed at exploiting in a sustainable way the great richness offered by these animals to discover the drugs of tomorrow. (Dutertre et al., 2010; Kauferstein et al., 2011; Terrat et al., 2012; Violette et al., 2012; Favreau et al., 2012) The international project CONCO started in February, 2007 and will end in 2012. It involves 20 partners from 13 countries: laboratories from universities, institutes, private companies and non-profit foundations. CONCO also established partnerships with the "Institut de Recherche pour le Développement" in New-Caledonia, the "Muséum National d'Histoire Naturelle" in France and the governments of New-Caledonia and French Polynesia. The project was fully integrated to the ambitious "Venomics" genome project initiated by the International Society on Toxinology (IST) dedicated to the understanding of the function and evolution of venomous systems in various phyla (Ménez et al., 2006). The CONCO project was proposed within the 6th Framework Programme and received funds of € 10.7 millions from the European Union over a five-year period.

2- Conus consors description

Cone snails are found in all major tropical and subtropical oceanic regions, including the East Atlantic and Mediterranean, East Pacific, South African, West Atlantic and Caribbean, and Indo Pacific regions (Ekman, 1953; Briggs, 1974, 1995). Conus is especially abundant on coral reefs throughout the vast Indo-Pacific region (Kohn et al., 1960). The families , Turridae and Terebridae constitute the superfamily , members of which are characterized by the possession of a venom apparatus. There are about 600 different cone snails, which are classified according to their feeding behavior into piscivorous, molluscivorous and vermivorous species (Kohn, 1983). Natural prey of cones consists of polychaete worms, other gastropods, pelecypods, octopuses and small fish. Cone snails feed by envenomating prey by means of disposable harpoon- like radular tooth (Olivera et al., 1991). Prior to feeding, one tooth found in the radula sac is engaged in the proboscis via the pharynx. The conidae is then armed and the proboscis is extended out of the mouth and violently projected to bury the tooth into the prey. Simultaneously, the venom bulb contracts and facilitates the venom ejection through the proboscis. Finally, the venom enters the extremity of the tooth fixed to the proboscis and oozes out through the hollow tooth in the prey's

39 tissues. The tooth is lost by the Conus and is replaceable by a new one from the radula sac (Le Gall et al., 1999). In the same manner as described for Conus striatus, another fish hunter, in Conus consors attack, the fish typically jerks suddenly after being struck but remains tethered through the proboscis. A good strike causes the fish to be immobilized within 1 or 2 seconds, unable to use its major fins. Total paralysis is effected a few seconds later, but often the fish has been engulfed by the snail into its distensible buccal cavity even before this has occurred (Olivera et al., 1991).

Figure 4: A Conus consors.

Conus consors is a marine fish-hunter gastropod which sticks its prey with a harpoon and injects a powerful venom.

3- Conotoxins: nomenclature, classification and pharmacological interest

The biologically active agents in Conus venoms are unusually small peptides, 10-30 amino acids in length. Most peptides are multiply disulfide-bonded; small loops of 1-6 amino acids are interspersed between the disulfide-bonded Cys residues (Olivera et al., 1991). These highly structured peptides are called conopeptides and they show an extensive diversity of pharmacological activities (Favreau and Stöcklin, 2009; Lewis et al., 2012).

For over 30 years (Endean et al., 1974; Gray et al., 1981; Olivera and Cruz, 2001), cone snail toxins, or conopeptides, have stimulated interest in their remarkable molecular diversity and capacity to target neuroreceptors, ion channels and transporters, with both potency and specificity (Terlau and Olivera, 2004; Janes, 2005; Olivera et al., 2008, Halai and Craig, 2009). Conopeptides serve as valuable probes for neurophysiological studies (Olivera, 1997; Olivera and Cruz, 2001; Dutton and Craik, 2001; Lewis, 2009), and they provide lead compounds for drug discovery

40 (Adams et al., 1999; Livett et al., 2004, 2006; Terlau and Olivera, 2004; Olivera, 2006; Craik and Adams, 2007; Vincler and McIntosh, 2007; Twede et al., 2009). For instance, the conopeptide MVIIA (Olivera et al., 1985) is used clinically, under the name ‘‘Prialt’’, for the treatment of neuropathic pain (Miljanich, 2004). Xen2174, an analog of the conopeptide MrIA from Conus marmoreus (McIntosh et al., 2000; Sharpe et al., 2001), entered Phase II clinical trials for the treatment of acute pain in September 2008 (Xenome Ltd, http://www.xenome.com).

Conopeptide precursor sequences comprise a signal sequence region, a proregion, and a mature peptide region (Figure 5). The latter is excised via proteolytic processing during the maturation process (Woodward et al., 1990; Olivera, 2002). The signal sequences comprise the initial 18–22 amino acids of the leader and tend to be highly conserved among toxins having identical Cys- frameworks. These features has been used to define 16 superfamilies of conotoxins namely A, D, I1, I2, I3, J, L, M, O1, O2, O3, P, S, T, V and Y (Terlau and Olivera, 2004; Loughnan et al., 2009). Conopeptides also typically contain multiple post-translational modifications (PTMs) that may be modulated by sequences within the propeptide region (Buczek et al., 2005). The three regions, signal sequence, pro-region and mature peptide region, have distinct levels of conservation, the signal sequence being relatively well conserved, the pro-region more divergent, and the mature peptide region highly diverse (Woodward et al., 1990; Conticello et al., 2001; Olivera, 2002; Yuan et al., 2007).

Figure 5: Typical regions of a conopeptide precursor.

41 Conopeptides have been categorized in the literature using several classification schemes. To date, the Conoserver database (Kaas et al., 2008; http://www.conoserver.org) is positioned as the central and most popular repository of conopeptides with numerous hits (Kaas et al., 2011). ConoServer relies on three types of classification commonly accepted in the conopeptide research community and summarized in appendix 2 , namely:

i. the gene superfamilies classification, based on similarities in conopeptide precursor sequences (mainly the signal peptide),

ii. the cysteine framework classification based on patterns of in the mature peptide domain,

iii. the pharmacological families classification that categorizes conopeptides according to their activity.

Recent works of Puillandre et al., (2012, submitted) validated this classification by molecular phylogeny but also demonstrated that signal sequences of cysteine-poor conopeptides do not cluster separately from the cysteine-rich ones. In fact, contryphans or conomarphins (Cys-poor) share highly similar signals with known superfamilies (contryphan with O2 and conomarphin with M, respectively). Consequently, exclusion of contryphans or conomarphins from the superfamily classification is not phylogenetically justified. Two additional superfamilies are proposed, B and C, for conantokins and contulakins respectively, one of which (C) has been suggested previously (Jimenez et al., 2007).

A standard toxin naming system, extensively relying on conopeptide classification schemes, was proposed early in the history of the field to set an unambiguous and descriptive denomination of disulfide-rich conopeptides (Gray et al., 1988). It consists of the following sequence of symbols:

i. one or two letters indicating the Conus species,

ii. a roman number indicating the cysteine framework category,

iii. an upper case letter denoting the order of discovery.

For example, GVIA is the first conotoxin (A) with a VI framework extracted from Conus geographus (G). Peptides that have not been characterized pharmacologically are given Arabic, instead of Roman, numbers.

The nomenclature of disulfide-poor conopeptides is distinct from that of disulfide-rich conopeptides. Most disulfide-poor conopeptides have been named by combining a class name, a hyphen, and one or two letters describing the species (e.g. conantokin-G from C. geographus). A number is added to the name when several conopeptides are found within the same species (e.g.

42 conolysin-Mt1 from Conus mustelinus) (Kaas et al., 2010).

To name cDNA clones, the favored nomenclature uses one or two letters to indicate the Conus species; an Arabic numeral to indicate the cysteine framework category; and a second number, separated by a decimal, to indicate the order of discovery (Walker et al., 1999) (e.g. Pu1.1, isolated from Conus pulicarius).

Given the purported diversity of Conus venoms (an estimated 100 unique conopeptides per species for a genus of ca. 500 venomous species), they present a unique opportunity for studying the evolution of large variable gene families. Speculations over conopeptide evolution have emerged (Olivera et al., 1999) but only few authors have addressed this topic in a quantitative manner. Previous results suggested that venom-derived gene families, including conopeptides, are undergoing accelerated evolution (Ohno et al., 1998; Duda and Palumbi 1999; Froy et al., 1999). Conticello et al., (2001) suggest that in the case of venom-derived conopeptides, both a hypervariability-generating molecular mechanism and diversifying selection have contributed to the evolution of these large and hypervariable gene families. The striking positional conservation of cysteine codons in the mature region is the molecular signature of a protecting mechanism developed to preserve structurally crucial cysteine residues.

4- Conotoxins: bioinformatics classification tools

Recent studies have estimated that the number of different conopeptides detected in the venom of a single species can exceed 1,000 (Adam et al., 2009; Biass et al., 2009). Despite the huge molecular diversity of potentially existing conopeptides, only approximately 1000 conopeptides have been described so far (Kaas, 2011).

Taking into account the huge potential that remains to be offered by conopeptides and given the fact that novel sequencing techniques provide a vast amount of sequence data, there is a need for an automated process for identification and annotation of new conopeptide sequences from large datasets (Laht et al., 2012).

43 Since conopeptides are classified based on the well-conserved signal peptides and cysteine frameworks, the natural inclination is to develop bioinformatics algorithms that focus on these criteria. This approach is proposed by ConoServer (Kaas, 2011) through the CONOPREC tool that outputs for each submitted precursor:

i. the identification of sequence regions,

ii. the classification according to the three classification schemes (detailed in section 3),

iii. the identification of the most similar sequences in ConoServer ,

iv. the predictions of potential post-translational modifications of the mature conopeptide.

On the other hand, different approaches for conopeptide superfamily prediction have been published over the years. Several pattern detection methods mentioned in Chapter II have been applied to conotoxins. Support vector machines (SVM ) have been defined (Mondal et al., 2006; Zaki et al., 2011) but are not adapted to high-throughput set-ups. Other pattern matching approaches as in (Lin and Li, 2007) are not of practical use. In contrast, the hidden Markov models (HMM) as described in (Laht et al., 2012) turned out the most appropriate for further automation.

The most recent study demonstrated the reliability of a classification based on the three regions of the precursor by the mean of profile Hidden Markov Models (pHMMs) (Laht et al., 2012). We proposed an improved strategy combining pHMM and generalised profiles for conotoxin classification (Koua et al., under revision) and built a web tool that implements this combined strategy. ConoDictor (http://conco.ebc.ee/index.php?pid=17) which was successfully tested on all known conotoxin data.

Futhermore, we built a web tool that implements this combined predictive strategy, called ConoDictor (http://conco.ebc.ee/index.php?pid=17). It is a web tools publicly available to classify conopeptides in their superfamilies (Koua et al., 2012). Even if initially developed for conotoxin prediction/classification, ConoDictor could be used for other proteins family classification providing the preparation and use of different profiles. It was designed to be generalised to any type of protein family profile search.

5- Concluding remarks

The CONCO project has achieved many of its original goals. Among others, it established that conopeptides classification could reliably be based on propeptide region and mature peptide and not only on the highly conserved signal sequence. To this end, profile-based approaches have been applied. On one hand, phylogenetic studies confirmed the signal-based classification and have enriched the classification perspectives by enlarging the concept of conotoxin superfamilies 44 (Puillandre et al., 2012). In particular, the distinction between “cysteine-poor” and “cysteine-rich” appeared phylogenetically irrelevant. On the other hand, the combination of HMM-based and PSSM-based approaches provided innovative and reliable methodologies to the conopeptide research community. The application of this strategy has led to the development of a publicly accessible web tool that effectively takes advantage of the performance of both techniques. The two methodological papers that established the efficiency of the combined model-based strategies for conopeptides classification as well as the ConoDictor paper are included in appendix 4 and appendix 5 respectively.

Finally, transcriptomic data annotation and comprehension are supported by a new integrated bioinformatics environment for transcriptome analysis named TATools. This platform links together a wide range of bioinformatics tools needed to analyse and interpret sequences generated by assembly programs. The detailed description of TATools is given in the second part of this manuscript.

45 Part 2:

Problems and Methodology

46 Chapter IV. Study methodology: needs for an improved analysis platform.

1- Overview of classical analysis workflow

As indicated previously the ultimate step of the transcriptomic approach is the data analysis. Data analysis is intended to provide comprehension to sequences issued from assembly programs. In general, two of the following five components are included in the transcriptome analysis workflow (Cantacessi et al., 2010): i. assembly. Assembling is performed de novo in the case of venomous animals since no model organisms have been fully sequenced for each clade including at least one venomous specie.

ii. similarity search. This is mostly based on BLAST (local alignment method detailed in next chapter and briefly presented in appendix 1). The goal is to select sequences of other organisms that resemble the newly sequenced ones. The function of new sequences is inferred from the description line of similar sequence entries.

iii. Functional prediction and annotation of gene products. In many cases, this is achieved using InterProScan (method described in appendix 1) or any other engine searching protein family profiles based on models of known domains. This hints at the activity and/or function of new sequences. It is assumed that the presence of the same domains will lead to the same activity.

iv. In silico subtraction. The subtraction approach compares expression levels and infers lineage relationship between the sequenced tissue and other tissues (Wilke et al., 2010). This is useful to detect and establish qualitative but not quantitative differences between or among samples.

v. probabilistic functional networking of protein-encoding genes and drug target prediction. This consists in comparing InterPro domains and GO term mapping to Enzyme Commission (EC) numbers. This hints at the tissue activity in terms of biological pathways, protein-protein interaction, …

These analyses are undertaken in the majority of published transcriptome papers. Figure 6 summarizes the classical transcriptome analysis workflow.

47 Figure 6: Summary of classical transcriptome analysis operations.

48 2- Evaluation of the classical analysis workflow

The constant improvement of sequencing techniques challenges data analysis. Indeed, sequencers are producing increasingly numerous reads of better quality at impressive high throughput rates. In addition, in most classical data computing strategies, a translated transcriptome is screened independently and as many times as there are analysis types included in the workflow. This leads to an ever growing amount of data that need to be manually cross-linked in order to highlight sequences of interest (as shown by the position of the expert in Figure 6). Moreover, low sequencing costs have created an upsurge of transcriptomes data in a wider range of organisms. Expert analysis of results towards validation and cross-linking will therefore (if not already) constitute a bottleneck.

2.1- Problems opened by the classical workflow In the context of transcriptome-oriented drug discovery, the classical analysis workflow opens a series of problems that require to be addressed:

A - Time consuming analysis workflow. The manual submission of transcriptome to each independent analysis, the determination of optimal values for each parameter of each analysis as well as the manual management of generated result files quickly become a fastidious and long series of operations.

B – Difficult selection of interesting sequence. Bioinformatics methods usually provide extended transcript annotations. These results require additional inspection to reliably point out sequences of interest, matching criteria such as: occurrence of a conserved domain, part of a given protein family, containing an innovative cysteine motif, … In this context, an expert intervention is critical and variable as it depend on the research goal.

C - Fastidious result cross-linking. Each analysis in the workflow generates a specifically formated output. The management of these heterogeneous files requires expert intervention and is time consuming. Result cross-linking and cross-validation is one of the most important steps of data interpretation since it increases confidence in the bioinformatic annotation by various independent tools.

D - Inefficient data visualization. Visual interfaces are needed since results are explored by biologists who are often not necessarily conversant with a wide range of special file formats and yet have to open and process results independently prior to a final merge. Exploration in a single and user-friendly environment appears of great interest.

49 2.2- T-ACE, classical transcriptome analysis and organization platform The Transcriptome Analysis and Comparison Explorer aka T-ACE (Philipp et al., 2012), is as far as we know, the first and only publicly available tool designed for the organization and analysis of large sequencing datasets in an unique environment. T-ACE is described as a complete tool with many useful features to analyze, organize and explore transcriptome data. However, this environment does not satisfactorily overcome the limitations presented in the previous section. Indeed, T-ACE tools are unrelated to and not optimized for drug discovery. Each task in the pipeline such as matching sequences against NCBI Non Redundant database, detecting family membership with InterProScan, mapping sequences with KEGG pathways, etc, is time consuming.

T-ACE, nevertheless proposes a number of features that could be of interest for drug discovery. Since our methodology was developed in parallel with T-ACE to address the same issues, some features are common to both environments such as pipelined sequence and domain similarity searches. Our methodology is described in detail in the next section but suffice to say here that we focus on a model-based discovery/classification approach combined to automated cross-validated results in order to specifically address key questions in drug discovery.

3- Methodology : a drug-discovery oriented analysis workflow

3.1- Problem A: Time consuming analysis workflow The slowness of the classical analysis approach is mainly due to the multiple submissions of the whole transcriptome to various separate bioinformatics analyses. Another time consuming operation is the cross-validation of obtained results.

Identified problem: The BLAST analysis appears as the most critical step. We evaluated that a standard BLAST search of a transcriptome containing 65000 contigs against the whole NCBI database would require nearly a week to be performed on a 8-cores personal computer.

Proposed solution: Instead of submitting the full transcriptome directly to BLAST against NCBI, we propose an intermediate clustering step and BLAST search against UniProtKB/SwissProt.

Implementation: Contigs obtained after the transcriptome assembly are clustered using CD-HIT (Li and Godzik, 2006. Also refer to appendix 1 for a short description of CD-HIT) with a similarity threshold of 95%. Then, only cluster representatives (the longest contig) are submitted to BLAST search against UniProtKB/Swissprot with an e-value of 10e-5 . Only the best BLAST hit is considered. BLAST hit files are automatically parsed and GO annotation are extracted. The obtained BLAST annotation is inferred to the other transcripts of the cluster when they overlap with the region of the representative corresponding to the generated BLAST hit. Transcripts are 50 annotated after in silico translation and the correct frame annotation is chosen according to the BLAST result. The new similarity search strategy is summarized in Figure 7.

Figure 7: Simplified and efficient BLAST-based annotation workflow.

Contigs clustering reduces the number of sequences to be sent to BLAST search. A smaller and properly annotated database is used for the BLAST. Special scripts are written to extract annotations from BLAST hits. This annotation is then inferred to the corresponding translated sequence of a cluster member for which the representative has a hit.

51 3.2- Problem B: Highlighting sequences of interest Identified problem: The classical approach described previously is based on conserved domain detection using the InterPro database. In the context of drug candidate discovery, the InterProScan appeared to be a time consuming process that produce non specific and even useless results since the whole model database is actually lacking specific venom-related models.

Proposed solution: To overcome these limitations, we developed specific hidden Markov models (HMMs) and position specific scoring matrices (PSSMs) to properly classify conopeptides. To this end, separate models were constructed for the signal sequence, the propeptide region and the mature peptide. We first tested the efficiency of each model approach separately and established that they perfectly complement each other for the classification of closely related conopeptides superfamilies.

In addition to the model-based approaches to detect innovative drug candidates in transcriptomes, we propose another strategy based on signal sequence detection. Venom peptides are secreted compounds and therefore include a signal peptide in their precursors.

Implementation: The methodology of models construction and validation as well as the web tools that uses them to predict new conopeptides in their corresponding superfamilies was published (Laht et al., 2011; Koua et al., 2012 and appendices).

In addition to model-based approaches to detect innovative drug candidates in transcriptomes, we propose another strategy based on a signal sequence detection. Detecting a signal peptide and extending the sequence to the right down, to the first stop codon immediately downstream the identified signal may lead to discover innovative sequences. The SignalP program was used for this identification. An in-house Perl script was written to achieve the sequence extension in order to detect full precursors.

The whole transcriptome is submitted to model-based conopeptide identification/prediction and also to the signal peptide detection.

52 3.3- Problem C: Cross-validation of bioinformatics results Identified problem: The fastidious cross-validation work initially reserved to experienced analysts. The variety of result file formats as well as the amount of data to be manually processed make the cross-validation a bottleneck in result interpretation.

Proposed solution: Our methodology proposes an automated cross-validation of results issued from BLAST searches, model-based classification/prediction and discovery as well as signal sequence search. For each transcript, annotations obtained from the different analyses are cross-linked and stored in a relational MySQL database. Cross-validation is now automatically achieved and accessible via a Venn diagram.

Implementation: Cross-validation supports the creation of a “Transcriptome map” (Figure 8) as it leads to the classification of transcripts into classes defined by the type of output generated for each considered transcript. External proteomics information can also be stored in the transcriptome database.

The transcriptome map makes it easy to focus on a transcript potentially representing a previously unidentified or new peptide. For instance, a precursor with a divergent mature peptide matched by PSSM/HMM in the absence of a BLAST hit is more likely to be interesting for drug discovery than a precursor with only one BLAST hit. Conversely, transcript matches output in two or three searches are more reliable.

The underlying relational storage organisation comprises two groups of tables (Figure 9): i. shared tables which store data that can be accessed during the analysis and exploitation of all transcriptomes submitted to the environment. Shared tables include user management tables, bioinformatic model information and a local dump of the Gene Ontology annotation database. ii. tables specific to each analysed transcriptome that store analysis results. These tables are organised in a separate database for each transcriptome. The central element of the schema of a transcriptome is the table of translated transcripts. These translated transcripts are the ones that are analysed according to user needs. The others tables are used to store the detailed results of each analysis and require a foreign key from the translated transcript table.

53 Figure 8: Automated cross-validation of results are summarized into a "Transcriptome map".

BLAST, SignalP and model matching represent the 3 analyses directly proposed by the methodology. Results of these analyses is automatically cross-related and dispatched among eight classes. The underlying database can also manage and cross-link proteomic results to results obtained after the three automated analyses. The transcriptome map therefore represents a global distribution of transcripts according to results obtained for the combination of four complementary methods.

54 Figure 9: Relational database schema for the newly proposed analysis workflow.

A new specific database will be created for each analysed transcriptome.

55 4- Concluding remarks

The new methodology proposed for transcriptome data analysis globally consists in an automated workflow which results are automatically cross-validated and stored in a dedicated relational database. This automated workflow is summarized in Figure 10. Compared to classical approaches (Figure 6), the proposed analysis method is improved both qualitatively and quantitatively. Analyses are selected upon their relevance to the problem and their number is limited so as to concentrate on a manageable amount of results. Users' work is greatly facilitated through the focus on the most reliable results.

The solution to the visualization question (Problem D) resulted in the development of a fully integrated analysis platform described in the following two chapters.

Figure 10: Complete improved workflow for transcriptome analysis.

56 Chapter V. TATools implementation

Introduction

One of the most important goals that guided the TATools implementation was to facilitate the identification and extraction of specific protein family members and homologous sequences from a transcriptome. To this end, TATools includes the possibility of searching a transcriptome with specifically tuned profiles and provides organised results centred on detected families. Up to now, transcriptome analysis has been restricted to specialists as it involves the manual validation of outputs of BLAST or FASTA similarity searches. We propose an automated and integrated approach that combines transcript clustering, BLAST search, GO-based annotations, pattern matching with prior models (Hidden Markov Models, HMM and Position Specific Scoring Matrices, PSSM) and signal sequence detection. Each analyzed transcriptome along with its related analysis results is stored in a separate database. The transcriptome is then displayed as the result of the merge of the 3 pattern matching strategies (HMM, PSSM and signal). A single intuitive and interactive web-based interface has been developed to facilitate the exploration and exploitation of these combined results.

1- Platform use cases

Three hierarchical users categories have been defined: guest (default), annotator and administrator. Every user can submit a transcriptome, run the analysis and visualise and exploit corresponding results. In addition, annotators can also add more models to the platform, write comments on available results and launch complementary analysis on the available transcriptomes. Super-users are in charge of the platform management activities: grant rights and update and manage databases. The Use-case diagram provides details on these activities (Figure 11).

57 Figure 11: TATools use cases diagram.

58 2- TATools methods

The transcriptome analysis workflow adopted in TATools consists in a combination of (i) a BLAST- based similarity search, (ii) a model-based matching using specific PSSMs and HMMs and (iii) a signal sequence detection using SignalP (Figure 10). The database schema (Figure 9) as well as the “Transcriptome map” were previously described. TATools has been developed as a web-based interface. TATools was mostly written in Perl. Various CPAN libraries was used. Among others, CGI and DBI were omnipresent since they helped to manage the web part of the project and the connection to the underlying MySQL databases respectively. Specific new Perl objects were developed to manage transcriptome data. User interactivity experience was managed with Java, Javascript and Ajax.

The platform is composed of 2 layers. Transcriptome cDNA files are submitted to the TATools analysis workflow layer in FASTA format via a php-based user interface. The submission interface also allows the selection of models (PSSM and/or HMM) to be searched against the new transcriptome. The user can choose among available models or upload new ones. It is also possible to fix BLAST search parameters as well as parameters related to signal detection in transcripts. The core script is the workflow manager: it makes system calls and requires in-house Object-oriented perl classes for result parsing and data storage. The modular organisation of the workflow components makes possible to launch all the analysis in one raw or to separately run each step of the analysis via an 'update interface'. The analysis layer of TATools was developed in perl. BLAST results are parsed using the BioPerl package. Results are stored in the specific mysql database using the perl-DBI interface. The data exploitation layer provides viewers and various tools for data visualization, understanding and exploitation. Once a transcriptome is analysed, a web-based viewer allows results visualization, the key element being the “Transcriptome map”. Transcripts belonging to each match class are displayed in a table view by simply clicking on the corresponding region of the “Transcriptome map”. It is also possible to display transcripts with BLAST hit sorted by associated GO terms and to view transcripts matching bioinformatics models sorted by model name or type. Furthermore, an entry viewer summarizes results and information related to each transcript (transcript sheet). The visualization layer was based on php-mysql and perl-cgi scripts including javascript and Ajax requests for interactivity. The contig assembly viewer proposed in the transcript sheet is a java applet. Useful analysis tools are also available for data manipulation. They include a text search, a local BLAST search against all the transcriptomes already analysed on the platform, an alignment tool with improved functionalities such as alignment colouring, trimming and export.

59 3- View and exploit results

The first overview of obtained results is the transcriptome map which is a Venn diagram automatically annotated after grouping translated transcripts as described in section 3 and Figure 8. In addition, various tabulated views were prepared and will be described in the next chapter. TATools has specially been optimized for protein family identification and extraction from transcriptomes. It is therefore possible to export sequences matched by models considered separately or in combination. The activity diagram (Figure 12) shows how to export sequences presenting an interest for the user. It especially describes internal operations underlying the user- friendly experience. The extraction feature has been intensively used for Conus transcriptome analysis. The flexibility of the extraction parameters makes it a very convenient tool. For example, if a protein family contains three different modelled domains, it is possible to extract, align and/or annotate transcripts where one, two or all three domains are present. A search tool is also available to allow direct query of the database by giving a portion of amino acid sequence or by searching a given word in BLAST or GO annotations.

60 Figure 12: Sequence extraction activity diagram.

61 Concluding remarks

TATools implementation follows common user requirements and the most current workflow described in the transcriptomics community. Bioinformatics analysis provided on the platform covers the most important steps described in chapter IV, section 3. A transcriptome analysis can be updated by running another BLAST search with a different BLAST database or any other parameter change or to search matches using newly added profiles. All results relative to a transcriptome are stored in a dedicated database and visualisation is based on prepared queries managed by perl-cgi and/or php-mysql. Multiple data exploitation tools have also been made available such as an alignment viewer, a BLAST search viewer, a global search tool, and a signal detection tool.

62 Chapter VI. Interfaces

Introduction

This chapter provides a user tour on the Transcriptome Analysis Platform. The main activities are presented as well as the related interfaces.

1- Login page:

This page allows registered users to enter the platform and get access to their previous analysis. An interface is available where new users can register. When logged on, the user name is indicated. It is possible to log out at any moment.

2- TATools home page

The home page (Figure 13) displays a list of all transcriptomes available on the platform. For each transcriptome, a brief description is available, indicating the sequenced animal and tissue, dates of data deposition (creation date) and last update. A simple click on a transcriptome name will load the related data and available analysis results. The home page is also the place to submit a new transcriptome and/or to add new profiles models.

Figure 13: TATools homepage.

A list of transcriptome previously deposited on the platform is displayed on the homepage. One can just click on a transcriptome DB name to load and explore results related to this transcriptome.

63 2.1- Enter a new profile This page allows uploading a model file (either PSMM or HMM) together with a short description indicating the modelled family or target (domain). The model type (PSMM or HMM) is automatically detected. A single file can contain multiple models but they should all be of the same type (either PSMMs or HMMs). The file will be parsed and model(s) saved individually under the indicated family or target. Submitted models are automatically available for transcriptome analysis.

2.2- Enter a new transcriptome Any user of the platform can deposit a transcriptome. TATools requires a file of assembled contigs in FASTA format as well as a text file indicating the coverage of each contig present in the transcriptome file. Currently coverage files from 454-Illumina and Ion-Torrent are supported. A programmatic access has also been implemented to allow direct submission from the assembly computer to TATools. The cDNA file must be submitted along with additional information concerning the organism name, the analysed tissue, the read alignment file as well as the assembly report. The submission interface is illustrated on Figure 14. Submitted files will be parsed and stored in dedicated databases. The new transcriptome is automatically available for analysis. One can even set analysis parameters at transcriptome submission time (Figure 15). When the new transcriptome is submitted, the home page list of available transcriptomes is updated with a new line (Figure 13). If analysis has been requested, the transcriptome status is set to “Running”. When all requested analyses are completed, the status is updated to “Done”. A “Refresh” button allows users to manually update the home page. A log file is also generated and indicates the progress of running analysis. The “Running” button displays the content of the progress report.

64 Figure 14: New transcriptome submission interface.

Figure 15: Interface for setting analysis parameters.

65 2.3- Run analysis Any user can run the initial analysis on the new transcriptome at submission time. One has to decide and select the analysis that should be run and in the case of BLAST, set search parameters and select from the available lists which models to be searched against the transcriptome if PSMM/HMM searches have to be performed (Figure 15). Models are displayed under the category they were assigned when submitted. When a model category is clicked, the category name is highlighted and the list of available models is displayed to allow their selection. When a category is hidden, items remain selected even if not visible. cDNA sequences are translated in silico and analyses are realised on the amino acid version of the transcriptome. Users with annotation rights (annotators) are able to add additional model to the platform and update the transcriptome analysis with different parameters (BLAST e-value, list of models, …). Because the BLAST search appeared to be CPU and time consuming, a clustering step was introduced. Contigs are clustered using the CD-HIT program (Li and Godzik, 2006) with a similarity threshold set by the user (threshold defaults to 95%). Only cluster representatives are sent to BLAST. Therefore, for others cluster members, BLAST and Gene Ontology annotations are inferred by similarity to the result obtained on the cluster representative.

3- Transcriptome viewer

When a transcriptome analysis is completed, a summary of the results can be obtained by loading the transcriptome viewer page. This is done by simply clicking the desired transcriptome name on the home page. The transcriptome viewer displays the global results obtained for the completed analysis. For BLAST search and model matches, an active link provides an overview of results sorted based on GO annotations and models. In addition, the 'Transcriptome map' indicates the number of translated contigs characterised in each of the performed analyses (Figure 16). The regions of the 'Transcriptome map' are click-able. From the global transcriptome viewer page, one can display different tabulated views of results limited to specific requirements.

66 Figure 16: General result page. Example from Conus adamsonii transcriptome analysis.

67 4- List viewers

Tabulated list views are produced for a set of translated transcripts: i. representing output in a given analysis, ii. belonging to a class of the 'Transcriptome map', iii. matching a search criterion.

4.1- Global results for BLAST, model match and signal detection These lists display all translated transcripts output by analysis. Each row of the resulting table shows description of one transcripts: entry identifier, position in the transcriptome map (indicating the range of matches as explained in Chapter V section 1), the transcript sequence and when applicable, a list of models that matched this transcript. For example Figure 17 represents the display obtained when the 'Class 1+2+3' region is clicked on the Transcriptome map. Results can also be sorted by GO terms as indicated in Figure 18. In this case, matches are grouped by GO category and for each GO term of the category, the number of matched sequences is indicated. Different actions are proposed from the list interface: i. a click on the sequence will display a detailed description of this single transcript, ii. export a FASTA formatted list of selected transcripts, iii. produce a multiple sequence alignment of selected transcripts iv. view a simple description of some selected transcripts

Users of the second level (annotator) can add annotations to selected sequences. In addition, when only the number of grouped sequences is displayed, it is possible to generate a tabulated list by clicking on the “View list” button (Figure 17).

68 Figure 17: Viewer for transcripts belonging to a class of the transcriptome map.

Figure 18: Viewer for transcripts associated to a GO term.

69 4.2- Simplified list view These lists are produced as a search result or when clicking on the “View list” button of Global result page after a set of translated transcripts is selected. The simplified table display the entry identifier, the sequence coverage and a yes/no status for bioinformatic analysis. When a transcript was output in an analysis, the number of matched models or the number of fragments looking like a signal is displayed between brackets (Figure 19). Like previously, sequences can be selected and exported in FASTA format or aligned. In addition, selected sequences can be annotated with results of proteomic studies (Add Proteomic results) or with a free text (Add comment).

Figure 19: TATools list viewer with annotation interface.

70 5- Specialized viewers

5.1- Compiled results of a translated transcript The transcript viewer gives a detailed description of all data related to a translated transcript (Figure 20). Displayed information consists in: i. the cluster the sequence belongs to, ii. the underlying read coverage and assembly quality, iii. a detailed view of the best BLAST hits, aligned and related to GO annotations, iv. a summary of model searches indicating for each matched model the corresponding sequence area. A global toxin precursor is estimated as the longest matched region that compiles: all matched fragments, an extension to the left up to a methionine immediately upstream the leftmost matched model, − an extension to the right down to the first stop codon immediately downstream the rightmost matched model. v. a list of signal peptide-like sequences. For each signal, a pseudo-precursor is generated by extending the sequence to the right until a stop or the end of the transcript is reached. vi. a list of user annotations when available. In the transcript viewer, the cDNA contig sequence as well as the translated sequence can directly be sent to BLAST.

71 Figure 20: TATools transcript viewer.

72 5.2- Tatools contig viewer The contig viewer is a java applet adapted from the Artemis open source project (Rutherford et al., 2000). It shows a read alignment for a single contig creation and allows checking the assembly of a contig of interest. The viewer (Figure 21) provides the six-frame translation of the viewed contig. One can also select a region in the nucleotide sequence: the corresponding the amino acid sequence is highlighted and conversely if a amino acid sequence is selected, the corresponding nucleotide region is highlighted. It is therefore easy to check assembly problems on a contig of interest.The contig viewer is really convenient to detect silent mutations as well as assembly errors.

5.3- Cluster summary The cluster viewer page (Figure 22) summarizes results of the clustering step for cDNA sequences. For the active transcript, the viewer indicates the number of cDNA sequences belonging to the same cluster and allows a FASTA export of these cDNA sequences. The “Generate cluster FASTA file” button creates the FASTA file and becomes “Download cluster FASTA file”, for saving the generated file. The cluster representative sequence is provided along with its percentage of similarity with the active contig sequence. Finally cDNA sequences can be sent to a BLAST search directly from the cluster viewer.

5.4- Clipboard Finally, a buffer manager was implemented to allow manipulation of union and intersection of sub- category matches when analysing results for GO-based annotations and model-based matches. The clipboard allows sequence export and analysis when they simultaneously match two or more models. This is really convenient for further manipulation or annotation of sequences sharing the same match conditions and also to take advantage of the combination of the predictive ability of models. Figure 23 shows how to export all full-length precursor sequences of A superfamily by combining the matching with the signal- propeptide- and mature-based models.

73 Figure 21: TATools contig viewer displays reads used to construct a given contig.

Figure 22: TATools contig cluster viewer.

74 Figure 23: TATools clipboard helps to manage user selection.

Sequences matched by the selected models are retrieve retrieved from the database. All the distinct sequences (in this case 1090) can be exported, annotated or aligned. However, it is also possible to manage only the subset of sequences matching all the selected models (here 202). This allows to combine model-based predictions.

75 6- Anticipate biologists needs

Additional features have been integrated in TATools to facilitate transcriptome exploration work.

6.1- Enriched BLAST viewer Since only the best BLAST hit from UniProtKB/Swiss-Prot is provided in the contig viewer, the user can submit the translated transcript as well as the initial contig nucleotide sequence to a local BLAST against NCBI non redundant database and/or other transcriptome available on the platform. The latter is especially useful to detect homologous sequences from non public transcriptomes of related or distant organisms. TATools provides an enriched BLAST result viewer. The most important BLAST hit information is displayed in a tabulated view. It is possible to select a set of hits and align or export them for further analysis. Different export formats are available: comma separated values (csv), FASTA and Excel.

6.2- Pseudo-precursor detection TATools also integrates a local version of SignalP that allows a graphical interpretation of the signal detection. We have developed a pseudo-precursor detection. When a signal is detected for a transcript, a pseudo-precursor is defined as the extension of the signal peptide until the next stop codon is reached. In theory, in the case of conopeptides, the signal sequence should be followed by a propeptide region and a mature peptide. Our pseudo-precursor detection script therefore tries to reconstruct a full-length precursor starting with the potential signal sequence detected by SignalP.

6.3- Multiple sequence alignment manager In addition to the possibility of producing a multiple sequence alignment from tabulated results obtained by querying databases of the platform, it is also possible to align any set of sequences submitted in FASTA format pasted into the alignment tool or uploaded from a file. This functionality is practically useful when sequences of a previously exported FASTA file have to be aligned and also when external sequences have to be aligned with sequences of the platform. The resulting alignment is computed by MAFFT version 6.847b. A brief presentation of MAFFT is given in Appendix 1. A special parser was added for analysing alignment. The alignment manager allows: i. the computation of a new alignment using a subset of the current alignment, ii. the extraction of a given range of columns from the current alignment and subsequent realignment, iii. the exclusion of stop containing sequences from the alignment.

76 6.4- Additional tools to assist drug discovery A number of bioinformatics tools were developed to satisfy specific needs expressed by users during result exploration and exploitation. They range from simple scripts to translate DNA sequences to more elaborated programs to assist protein deconvolution. Some of these tools apply to protein sequences and others are specific to mass spectrometry data analysis, such as peak detection, identification of protein modifications, comparison of mass lists, etc. For each one, a cgi/php web-based interface was developed according to users’ needs. These additional tools ensure that useful tools are properly running and available in-house when needed, limiting dependency to external servers. An even more important reason for internally maintaining these tools is to prevent potentially confidential data to be disseminated at risk on external web servers.

77 Part 3:

Main applications, results and discussion

78 Chapter VII. Transcriptome analysis: a step forward in venomics.

1- First case study: alpha conotoxins from Conus adamsonii

The first case study is the identification of alpha conotoxins from the transcriptome of a single specimen of Conus adamsonii. Conus adamsonii are rare deep sea cone snails. Up to now, little was known about its venom. We present here the first results regarding the venom gland transcriptome analysis. This study demonstrates the worth of the TATools platform in the transcriptomic study of a single venom gland in terms of result consistency.

1.1- Importance of alpha conotoxin Inhibitors of nicotinic Acetyl Choline Receptors (nAChR) are detected in all Conus spp. venoms. At least one conopeptide of this kind has been found in investigated venoms (McIntosh et al., 1999; Dutertre et al., 2007). Overall, seven different families of conotoxins are known to target nAChRs: alpha-conotoxins, alpha-C-conotoxins, alpha-D -conotoxins, psi-conotoxins, alpha- S-conotoxins, alpha-L -conotoxins, and alpha-J-conotoxins (Lewis et al., 2012). The alpha-conotoxins are selective antagonists of the muscletype (3/5) and neuronal-type (4/7, 4/4, and 4/3) nAChRs and probably represent the largest group of characterized Conus spp. peptides (McIntosh et al., 1999). Several alpha-conotoxins seem potential candidates for the development of treatment for neuropathic pain due to their demonstrated analgesic effets (Vincler and McIntosh, 2007; Callaghan et al., 2008; Klimis et al., 2011). Alpha-conotoxins are relatively small peptides. The recent development of a cyclized analog of Vc1.1 (Clark et al., 2010) revealed that the peptide could relieve signs of neuropathic pain when administered orally.

1.2- Presentation of Conus adamsonii Conus adamsonii is a predatory sea gastropod of the genus Conus belonging to the family of Conidae (Textilia clade). It is a rare species living in French Polynesia. Its rarity is probably explained by its inner reefs habitat. C. adamsonii is a fish-hunting cone snail.

A single specimen was found during the collection expedition organized for the CONCO project, the CONPOL-I campaign at Nuku-Hiva (Marquesas archipelago) in November 2007. This specimen was dissected and the venom gland was sequenced by 454 pyrosequencing. Proteomic analysis was conducted in parallel on the extracted venom using mass spectrometry.

We reveal here the first overview of conopeptide content of a Conus adamsonii venom gland transcriptome.

79 1.3- Transcriptome map of Conus adamsonii The venom gland pyrosequencing yielded 213560 reads of 216 base pairs average length. De novo MIRA-based read assembly led to 23095 contigs. MIRA is described in appendix 1. From the assembled contigs, 10157 clusters were obtained using CD-HIT, (Li and Godzik, 2006; also described in appendix 1). The translated transcripts were then submitted to BLAST (e value: 10-4), SignalP (HMM and NN) and conopeptide model search (HMM and PSSM built for each superfamily). The intersection between these three analyses led to organise results as follows and according to detailed explanations given in chapter IV section 3.2 and 3.3 and Chapter VI section 3 and 4 (Figure 16) :

i. transcripts that didn't match anything after searches. Class 0 = 127220 ii. transcripts with matches for only one search : - BLAST only. Class 1 = 830; - PSSM/HMM only. Class 2 = 1719; - SIGNAL only. Class 3 = 7873; iii. transcripts with matches for two searches : - BLAST and PSSM/HMM. Class 1+2 = 375; - BLAST and SIGNAL. Class 1+3 = 64; - PSSM/HMM and SIGNAL. Class 2+3 = 216; iv. transcripts that have match for three searches. Class 1+2+3 = 473; The total of matches by all analyses represents: - for BLAST: 1742 hits (Classes 1; 1+2;1+3;1+2+3), - for HMM/PSMM matching: 2783 hits (Classes 2;1+2;2+3;1+2+3), - for SignalP search: 8426 hits (Classes 3;1+3;2+3;1+2+3). From the 138570 translated sequences, 11350 individual sequences have been matched and automatically annotated using GO terms associated with BLAST hits and conopeptide superfamilies indicated in the description of matched models.

1.4- Alpha conopeptides from Conus adamsonii. Among the 2783 hits of the conopeptide model searches, 1306 (49.92%) were fished by the A superfamily associated models. As detailed in chapter III section 3, conopeptides contain 3 keys regions illustrated in Figure 5. In the 1306 sequences, 525 matched the signal peptide model, 631 were detected by the propeptide region model and 772 matches were reported by the mature region model.

Even if library preparation for sequencing tends to randomly cut cDNA sequences, we present

80 results involving full-length precursors of A-superfamily conopeptides to increase specialists’ confidence in the results. From the Conus adamsonii transcriptome, 298 peptides sequences could be considered as full-length precursors with signal sequence, propeptide region and mature peptide. According to the cysteine framework models (cf. appendix 2), 280 full-length precursors were identified as framework 4 and 18 sequences belonged to framework 1. Regarding the mature peptides, the 18 sequences consisted in 5 unique isoforms and the 280 sequences with framework 6 represented 19 unique isoforms.

Altogether, 24 unique isoforms of A superfamily conopeptides were identified from a single specimen of Conus adamsonii. Isoform identification required sequence alignments and manual check of these alignments. The detection of these unique isoforms took about one hour. Model search, BLAST matching and signal detection were performed overnight on the transcriptome in a fully automatic mode.

2- Second case study: identification of analogues for the XEP-018

The second case study refers to the identification of analogues for the XEP-018. The model-based approach led to identify new analogues in the venom gland transcriptome. During this analysis, TATools turned out really useful as it basically limits human intervention to result validation. The main results related to the Conus consors transcriptome analysis and the methodology are detailed in a submitted manuscript (appendix 6). Hereafter we present the main results.

2.1- Conopeptide distribution in Conus consors venom gland transcriptome The venom gland pyro-sequencing yields 213561 reads of 218 base pairs average length. De-novo MIRA-based read assembly led to 65,536 contigs from which 49086 clusters were obtained using CD-HIT (Li and Godzik , 2006) with a similarity threshold set to 0.95. The translated transcripts were then submitted to BLAST (e-value:10-4), SignalP (HMM and NN) and model search (HMMs and PSSMs build for conopeptides families and superfamilies). The contigs repartition after the analysis step is provided in Figure 24. The 65536 contigs from Conus consors venom gland transcriptomes were translated in silico into 393216 protein sequences and searched with the 96 models built for the 16 known superfamilies. This led to the identification of 5210 different hits. 1403 matches were obtained for A superfamily models, 1650 for M superfamily, 1356 for O1 Superfamily models, 593 for models from T superfamily, 74 and 19 for P superfamily and S superfamily respectively. The models from the other superfamilies returned 115 matches. Figure 25 summarizes the matches obtained with the model-based strategy.

81 2.2- Presentation of XEP-018 XEP-018 is also known as CnIIIC. It a µ-conotoxin identified and isolated from the Conus consors venom (Benoit et al., 1999). The µ-conopeptide family is defined by its ability to block voltage- gated sodium channels (VGSCs), a property that can be used for the development of myorelaxants and analgesics. μ-CnIIIC potently blocks VGSCs in skeletal muscle and nerve, and hence is applicable to myorelaxation. Its new atypical pharmacological profile suggests some common structural features between VGSCs and nAChR channels (Favreau et al., 2012).

2.3- XEP-018 analogues detection in venom gland transcriptome Transcriptome analysis was carried out on TATools to detect analogues and/or variants of CnIIIC. The translated transcriptome was indexed using the formatdb script from NCBI BLAST package and searched by BLASTP. In addition the transcriptome was searched using PSSMs built for M superfamily.

A BLASTP search (e-vaule: 10-4) of the CnIIIC sequence against the C. consors transcriptome indicated perfect matches with 11 contigs. Ten of the matching contigs completely covered the initial sequence and one contig was detected as containing a sequencing error. At this point, no variants or analogues were identified.

Another BLASTP search against UniProtKB/Swiss-Prot indicated that 12 mu-conopeptides were publicly available.

The PSSM representing the conopeptide M-superfamily mature peptide matched 630 distinct contigs. Regions of these contigs matching this specific model were isolated and aligned using the pfsearch command. Duplicate sequences, incomplete mature peptide as well as entries with sequencing errors were manually removed from the alignment. This led to a set of 29 contigs. The removed sequences were mostly either duplicate of the original XEP or duplicates of the kept variants. Out of the 29 remaining contigs, only 10 were identified to be full-length precursors with signal, propeptide and mature peptide. Out of these 10 sequences, 5 complete precursors were considered as analogues of the CnIIC (Fig. 6). The other 5 contigs had weak coverage and are likely to constitute new M-type conopeptides.

Of the five new analogues determined from the transcriptome analysis, two were actually identified at protein level by mass spectrometry analysis of the milked venom and a third one was also identified during the proteomic analysis of both milked and dissected venom.

The alignment of new analogues detected for the XEP-018 is given in Figure 26.

82 Figure 24: Conus consors Transcriptome map.

Figure 25: Distribution of matches obtained for conopeptides suerfamilies by searching the Conus consors transcriptome with conopeptides pHMMs and PSSMs.

83 Figure 26: New isoforms of mu-conotoxin identified from the Conus consors venom gland transcriptome.

84 3- Concluding remarks

Test users of TATools were satisfied with the platform for several reasons:

− It is web-based and made of familiar HTML pages with tables and cgi/php forms for information fill and submission

− Navigation was judged smooth, which is expectable given the php-mysql interaction between user interfaces and the underlying databases as well as system call for server-side scripts.

One of the most important features of TATools is the transcriptome map that provides a complete overview of matches obtained with a complete analysis. This map guides users and ensures structured data exploitation. Data representation in a Venn diagram facilitates subsequent annotations. In the same manner, the simplified interactive annotation system greatly improves data exploitation work.

Adding specific models to the platform provides a very useful personalization aspect of the transcriptome analysis and improves the discovery rate of interesting sequences. Finally the signal sequence search sheds light on a part of the transcriptome that is usually ignored.

85 Chapter VIII. Discussion and Prospects

1- Too many sequences in the bin

A point somehow rarely commented in published transcriptome analysis is the relatively important number of sequences that remain uncharacterised. In the two examples discussed in this study, the proportion of unmatched sequences reaches 90.89% and 91.81% for Conus consors and Conus adamsonii respectively. These uncharacterised sequences represent (i) potentially new sequences with innovative cysteine frameworks and/or activities and (ii) noisy data that corresponds to too short transcripts, sequencing errors or inexistent sequences resulting from the systematic transcriptome translation.

The presence of short transcripts and sequencing errors leading to unusable sequences can only be solved by sample preparation and sequencing operations. This constitutes a kind of residual noise the bioinformatic analysis must deal with. On the other hand, the sequence analysis strategy itself produces additional noise. Since sequence searches and annotations are based on protein sequences, the full transcriptome has to be translated into protein. If we consider that only one reading frame will represent an active and useful protein sequence, there are therefore five (5) sequences that will end up in the set of uncharacterised sequences. Sequence filters are obviously needed. A first filter could apply to contigs matched in the nucleotide format. This is actually possible with BLAST, but not with PSSM or HMM searches or signal detection. Since the translation into protein cannot be avoided another straightforward solution can involve removing the five alternative translations of a contig that was matched in one of the reading frames. Such a filter is usually too drastic due to possible RNA editing and frequent frame shifts. It appeared more informative to keep all translated transcripts since the first match is not always the best and the manual check of various matches of the same transcript still allows the identification of interesting sequences.

2- More analysis approaches, more matches, more confidence

As described previously, the combination of analyses increases confidence in observed matches. Very confident sequences are more likely to be detected by every method. Conversely, a sequence not matched by any method can be regarded as useless and ignored. The situation of really innovative never-seen-before sequences may be considered as part of the worse cases. We therefore consider that the tool combination proposed in TATools is optimal to detect and identify known facts as well as novel ones. For instance the addition of signal sequence detection led to reduction of uncharacterised sequences. However, the pitfall when combining methods is the

86 increasing error rate. In particular, SignalP shows higher detection error rates. Sequences of class 3 (only a signal detected) were therefore lesser considered than those of class 3+ {1,2,4}. Because of the great predicting abilities of mathematical models, model-based matched were considered more confident. Note that KEGG mapping is not implemented since it is not very documented for venom gland transcriptomic studies. KEGG-based annotations will be added to the next version.

3- Family distribution in transcriptomes

Some peptide families appeared objectively over represented in transcriptomes. The distribution of the number of transcripts by family showed important biases for A superfamily for example in the conus transcriptomes. The same was observed regarding ribosomal RNA sequences. Since the transcriptome is a snapshot of the state of the tissue at a given state and time, this situation cannot be avoided. At most, cDNA library can be normalised before sequencing to avoid over representation of a sequence type. From the bioinformatics analysis point of view, TATools provides an automatic grouping of sequences by family, by GO annotation and matched models. This allows users to avoid fastidious cleaning of highly repeated data. In addition, the alignment manager was implemented with additional features to remove redundancy, trim alignments, exclude sequences having a stop codon and produce a separate alignment for a subset of selected sequences.

4- Bringing out novelty

TATools undeniably allows the rapid and reliable identification of new analogues. Various novelty levels have been made available through the Transcriptome map that divides transcripts according to the combination of methods by which they were selected. Even if manual validation cannot be discarded, TATools does facilitate the discovery of novel sequences by providing a range of viewers and exploratory tools. However, a part of the novelty still remains hidden in the noise of uncharacterised sequences. In addition to signal peptide detection, a cysteine motif detection was implemented and will be added to the standard workflow.

5- Comparative transcriptomics

The current version of TATools is merely oriented in the deep exploration and exploitation of individual transcriptomes. Many transcriptomes can be analysed and managed on the same platform but there is at this stage limited comparative possibilities. Even if it is already possible to make a BLAST search or a sequence search in text format on two or more transcriptomes on TATools, the

87 transversal analysis of various transcriptomes remains to be implemented. From now, only deep horizontal or parallel explorations are possible. The main outstanding question pertains to data visualisation since databases of the different submitted transcriptomes are accessible on the platform. The addition of transcriptome comparison constitutes the main improvement for the next version of TATools.

6- Results annotation

Practically all results displayed on TATools can be annotated by the final user. Annotated entries are automatically assigned to a special category on the 'Transcriptome map'. We are convinced that despite the efficiency of automated annotation, human input will remain necessary. In the current version, TATools is limited to managing manual annotations. We plan to add a parser that will automatically read and create annotation from proteomic data obtained with tandem mass spectrometry (MS/MS) analysis tools such as Phenyx or other identification platforms. In particular, Phenyx is a bioinformatic tool that allows MS/MS spectra analysis for protein identification in a given transcriptome. Currently, users need to validate Phenyx matching results and manually annotate the corresponding entries on TATools. We will work to automatises the latter step.

7. CONCO project extensions Two venomics projects, namely VenX and SpidX have recently received funding. SpidX is a 12 months pilot study initiating the first project on spider venomics. Among others goals, SpidX aims at carry out the sequencing of five model spiders venom glands and analyse these transcriptomes in order to pave the way for a broad range of future fundamental and industrial investigation. TATools is already positioned as a key platform in this project. On the other hand, the VenX project plans to develop and combine innovative technology in transcriptomics, bioinformatics and peptide synthesis to overcome the technical challenges that currently limit further growth. In particular, TATools will be used process the transcriptome data to generate in silico collections (Virtual Venom Libraries) containing tens of thousands putative venom-derived miniproteins.

88 Conclusion

Transcriptome analysis has become a common practice to understand biological mechanisms and activities. In the field of venomics, the increasing number of recently published paper including transcriptomics studies reveals the importance of this approach to complement proteomics studies. A transcriptome analysis like TATools therefore comes in handy to assist exploitation and exploration of coming transcriptomes. TATools proposes an innovative modular approach to manage, analyse and exploit transcriptomic data. The innovative aspect of this platform is not related to the introduction new algorithms or new implementations of existing tools: TATools is mostly a successful integration of existing tools and probabilistic model-based approaches that have demonstrated individually their capabilities of novelty revelation. We tried to put together the respective performance of common tools and made a special effort to overcome or at least reduce the accumulation of errors from the various tools used. TATools offers researchers a direct access to results that are internally cross-validated. Results issued from the different identification approaches are compiled to annotate each submitted sequence. This highly contributes to reduce the fastidious work of eye-validation of large amount of data. Transcriptome analysis is therefore making a step towards democratization. In addition, since an effort has been made to match and even anticipate biologists’ needs, the platform proposes user- friendly interfaces for data visualisation. Users forms for data annotation are directly accessible from the results display page. The whole platform is therefore relatively easy to use and its adoption by the research community should not be a problem. The validation of the analysis methodology as well as its practical application to Conus sp. transcriptomes have allowed and/or facilitated the identification of interesting peptides from various transcriptomes. The next version of TATools will include support to transcriptome comparison and transversal family detection and annotation. In addition, new cysteine framework detection and automated annotation from proteomic platforms are to be implemented next.

89 Bibliography

Adams J., Jones A., Lewis R.J. 2009. Remarkable inter- and intra-species complexity of conotoxins revealed by LC/MS. Peptides 30: 1222–1227. Ahmadian A., Ehn M., Hober S. 2006. Pyrosequencing: history, biochemistry and future.Clin Chim Acta 363(1-2):83-94. Altschul S.F, Gish W., Miller W., Myers E.W., Lipman D.J. 1990. A basic local alignment search tool. J Mol Biol. 215:403-410. Altschul S.F. 1991. Amino acid substitution matrices from an information theoretic perspective. J Mol Biol. 219:555-565. Altschul S.F. 1993. A protein alignment scoring system sensitive at all evolutionary distances. J Mol Evol. 36:290-300. Altschul S.F., Madden T.L., Schäffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17):3389-402. Ashburner M., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., Harris M.A., Hill D.P., Issel-Tarver L., Kasarskis A., Lewis S., Matese J.C., Richardson J.E., Ringwald M., Rubin G.M., Sherlock G. 2000. Gene Ontology: tool for the unification of biology.The Gene Ontology Consortium. Nat. Genet. 25:25-29. Au K.F., Jiang H., Lin L., Xing Y. & Wong, W.H. 2010. Detection of splice junctions from paired- end RNA-seq data by SpliceMap. Nucleic Acids Res. 38, 4570–4578. Benoit E., Favreau P., Schlumberger S., Cordova M.A., Tytgat J., Stocklin R. et al., 2008. A new mu-conotoxin from Conus consors that atypically targets sodium channels in unmyelinated and myelinated nerve fibers. Abstract Book, 16th European Section Meeting of the International Society on Toxinology (2008). Biass D., Dutertre S., Gerbault A., Menou J.L., Offord R., Favreau P., Stöcklin R. 2009. Comparative proteomic study of the venom of the piscivorous cone snail Conus consors. J Proteomics 72:210–218. Braslavsky I., Hebert B., Kartalov E., Quake S.R. 2003. Sequence information can be obtained from single DNA molecules. Proc. Natl. Acad. Sci. U. S. A. 100:3960–3964. Briggs J.C. 1974. Marine Zoogeography. McGraw-Hill, New York. Briggs J.C. 1995. Global Biogeography. Elsevier, Amsterdam. Butler J. et al., 2008. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 18, 810–820. Callaghan B., Haythornthwaite A., Berecki G., Clark R.J., Craik D.J., Adams D.J. 2008. Analgesic -conotoxins Vc1.1 and Rg1A inhibit N-type calcium channels in rat sensory neurons via GABAB receptor activation. J Neurosci 28:10943–10951. Calvete J.J., Juarez P., Sanz L. 2007. Snake venomics. Strategy and applications. J. Mass Spectrom.42(11),1405–1414. Calvete J.J., Sanz L., Angulo Y., Lomonte B., Gutiérrez J.M. 2009. Venoms, venomics, antivenomics. FEBS Lett. 583(11):1736-43.

90 Cantacessi C., Jex A.R., Hall R.S., Young N.D., Campbell B.E., Joachim A., Nolan M.J., Abubucker S., Sternberg P.W., Ranganathan S., Mitreva M., Gasser R.B. 2010. A practical bioinformatic workflow system for large data sets generated by next-generation sequencing. Nucleic Acids Research, 38(17):e171. Casals F., Idaghdour Y., Hussin J., Awadalla P. 2012. Next-generation sequencing approaches for genetic mapping of complex diseases, J. Neuroimmunol., doi:10.1016/j.jneuroim.2011.12.017. Chanda S.K. and Caldwell J.S. 2003. Fulfilling the promise: drug discovery in the postgenomic era. Drug Discov. Today 8, 168–174 Chevreux B., Pfisterer T., Drescher B., Driesel A.J., Müller W.E., Wetter T., Suhai S. 2004. Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. Genome Res.14(6):1147-59. Clark R.J., Jensen J., Nevin S.T., Callaghan B.P., Adams D.J., Craik D.J. 2010. The engineering of an orally active conotoxin for the treatment of neuropathic pain. Angew Chem Int Ed Engl 49:6545–6548. Cocquet J., Chong, A., Zhang, G. and Veitia, R. A. 2006. Reverse transcriptase template switching and false alternative transcripts. Genomics 88, 127–131. Collins J.F., Coulson A.F.W., Lyall A. 1988. The significance of protein sequence similarities. Compur Appl Bio SCi 4:67-71. Compeau P.E.C., Pevzner P.A., Tesler G. 2011. How to apply de Bruijn graphs to genome assembly? Nature Biotechnology 29:987–991. Conesa A., et al., 2005. BLAST2GO: a universal tool for annotation, visualization and analysis in functional genomics research, Bioinformatics, 21, 3674-3676. Conticello S.G., Gilad Y., Avidan N., Ben-Asher E., Levy Z., Fainzilber M. 2001. Mechanisms for evolving hypervariability: the case of conopeptides. Mol Biol Evol. 18(2):120-31. Craik D.J., Adams D.J., 2007. Chemical modification of conotoxins to improve stability and activity. ACS Chem. Biol. 2: 457–468. Davis J., Jones A., Lewis R.J. 2009. Remarkable inter- and intraspecies complexity of conotoxins revealed by LC/MS. Peptides 30:1222–1227. De Graaf D.C., Aerts M., Danneels E., Devreese B. 2009. Bee, wasp and ant venomics pave the way for a component-resolved diagnosis of sting allergy. J Proteomics 72:145-154. Drmanac R., Sparks A.B., Callow M.J., Halpern A.L., BurnsN.L., Kermani B.G., Carnevali P., Nazarenko I., Nilsen G.B., Yeung G., Dahl F., Fernandez A., Staker B., Pant K.P., Baccash J., Borcherding A.P., Brownley A., Cedeno R., Chen L., Chernikoff D., Cheung A., Chirita R., Curson B., Ebert J.C., Hacker C.R., Hartlage R., Hauser B., Huang S., Jiang Y., Karpinchyk V., Koenig M., Kong C., Landers T., Le C., Liu J., McBride C.E.,MorenzoniM.,Morey R.E., Mutch K., Perazich H., Perry K., Peters B.A., Peterson J., Pethiyagoda C.L., Pothuraju K., Richter C., Rosenbaum A.M., Roy S., Shafto J., Sharanhovich U., Shannon K.W., Sheppy C.G., Sun M., Thakuria J.V., Tran A., Vu D., Zaranek A.W., Wu X., Drmanac S., Oliphant A.R., Banyai W.C., Martin B., Ballinger D.G., Church G.M., Reid C.A. 2010. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327: 78–81. Durban J., Juarez P., Angulo Y., Lomonte B., Flores-Diaz M., ape-Giron A., Sasa M., Sanz L., Gutierrez J. M., Dopazo J., Conesa A., Calvete J. J. 2011. Profiling the venom gland transcriptomes of Costa Rican snakes by 454 pyrosequencing. BMC. Genomics 12: 259. Dutertre S., Ulens C., Büttner R., Fish A., van Elk R., Kendel Y., Hopping G., Alewood P.F., Schroeder C., Nicke A., et al. 2007. AChBP-targeted -conotoxin correlates distinct binding orientations with nAChR subtype selectivity. EMBO J 26:3858–3867.

91 Dutertre S., Biass D., Stöcklin R., Favreau P. 2010. Dramatic intraspecimen variations within the injected venom of Conus consors: an unsuspected contribution to venom diversity. Toxicon 55:1453–1462. Dutton J.L., Craik, D.J. 2001. alpha-Conotoxins: nicotinic acetylcholine receptor antagonists as pharmacological tools and potential drug leads. Curr. Med. Chem. 8: 327–344. Drews J. 2003. Strategic trends in the drug industry. Drug Discov. Today 8, 411–420 Eid J., Fehr A., Gray J., Luong K., Lyle J., Otto G., Peluso P., Rank D., Baybayan P., Bettman B., Bibillo A., Bjornson K., Chaudhuri B., Christians F., Cicero R., Clark S., Dalal R., Dewinter A., Dixon J., Foquet M., Gaertner A., Hardenbol P., Heiner C., Hester K., Holden D., Kearns G., Kong X., Kuse R., Lacroix Y., Lin S., Lundquist P., Ma C., Marks P., Maxham M., Murphy D., Park I., Pham T., Phillips M., Roy J., Sebra R., Shen G., Sorenson J., Tomaney A., Travers K., Trulson M., Vieceli J., Wegener J., Wu D., Yang A., Zaccarin D., Zhao P., Zhong F., Korlach J., Turner S. 2009. Realtime DNA sequencing from single polymerase molecules. Science 323: 133–138. Ekman S.P. 1953. Zoogeography of the Sea. Sidgwick and Jackson, London. Emrich S.J., Barbazuk W.B., Li L., Schnable P.S. 2007. Gene discovery and annotation using LCM- 454 transcriptome sequencing. Genome Res. 17: 69–73. Endean R., Parrish G., Gyr P. 1974. Pharmacology of the venom of Conus geographus. Toxicon 12: 131. Escoubas P., Sollod B., King G.F. 2006. Venom landscapes: mining the complexity of spider venoms via a combined cDNA and mass spectrometric approach. Toxicon 47: 650. Escoubas P. 2006a. Mass spectrometry in toxinology: a 21st-century technology for the study of biopolymers from venoms. Toxicon. 47(6):609-13. Escoubas P., Quinton L., Nicholson G.M. 2008. Venomics: unravelling the complexity of animal venoms with mass spectrometry. J Mass Spectrom. 43(3):279-95. Escoubas P., King G.F. 2009. Venomics as a drug discovery platform. Expert Rev Proteomics 6:221–224. Eugster P.J., Biass D., Guillarme D., Favreau P., Stöcklin R., Wolfender J.L. 2012. Peak capacity optimisation for high resolution peptide profiling in complex mixtures by liquid chromatography coupled to time-of-flight mass spectrometry: Application to the Conus consors cone snail venom. J Chromatogr A. 2012 May 14.PMID: 22658136 . Ewing B., Green P. 1998. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8:186–194. Ewing B., Hillier L., Wendl M.C., Green P., 1998a. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8: 175–185. Favreau P., Menin L., Michalet S., Perret F., Cheneval O., Stöcklin M., Bulet P., Stöcklin R. 2006. Mass spectrometry strategies for venom mapping and peptide sequencing from crude venoms: case applications with single arthropod specimen. Toxicon 47:676-87. Favreau P., Cheneval O., Menin L., Michalet S., Gaertner H., Principaud F., Thai R., Ménez A., Bulet P., Stöcklin R. 2007. The venom of the snake genus Atheris contains a new class of peptides with clusters of histidine and glycine residues. Rapid Commun Mass Spectrom. 21(3):406-12. Favreau P.,Stöcklin R. 2009.Marine snail venoms: use and trends in receptor and channel neuropharmacology. Current Opinion in Pharmacology.

92 Favreau P., Benoit E., Hocking H.G., Carlier L., D'hoedt D., Leipold E., Markgraf R., Schlumberger S., Córdova M.A., Gaertner H., Paolini-Bertrand M., Hartley O., Tytgat J., Heinemann S.H., Bertrand D., Boelens R., Stöcklin R., Molgó J. 2012. Pharmacological characterization of a novel μ- conopeptide, CnIIIC, indicates potent and preferential inhibition of sodium channel subtypes (Na(V) 1.2/1.4) and reveals unusual activity on neuronal nicotinic acetylcholine receptors. Br J Pharmacol. 2012. doi: 10.1111/j.1476-5381.2012.01837.x. Finn R.D., Mistry J., Tate J., Coggill P., Heger A., Pollington J.E., Gavin O.L., Gunasekaran P., Ceric G., Forslund K., Holm L., Sonnhammer E.L., Eddy S.R., Bateman A. 2010. The Pfam protein families database.Nucleic Acids Res. 38(Database issue):D211-22. Fox J.W., Ma L., Nelson K., Sherman N.E., Serrano S.M. 2006. Comparison of indirect and direct approaches using ion-trap and Fourier transform ion cyclotron resonance mass spectrometry for exploring viperid venom proteomes. Toxicon 47(6):700-14. Froy O., Sagiv T., Poreh M., Urbach D., Zilberberg N., Gurevitz M. 1999. Dynamic diversification from a putative common ancestor of scorpion toxins affecting sodium,potassium, and chloride channels. J. Mol. Evol. 48:187–196. Fry B.G., Wuster W., Ryan Ramjan S.F., Jackson T., Martelli P., Kini R.M. 2003. Analysis of Colubroidea snake venoms by liquid chromatography with mass spectrometry: evolutionary and toxinological implications. Rapid Communications in Mass Spectrometry 17: 2047. Fry B.G., Scheib H., van der Weerd L., Young B., McNaughtan J., Ramjan S.F., Vidal N., Poelmann R.E., Norman J.A. 2008. Evolution of an arsenal: structural and functional diversification of the venom system in the advanced snakes (Caenophidia). Mol. Cell Proteomics, 7: 215–246. Fry B.G., Roelants K., Champagne D.E., Scheib H., Tyndall J.D., King G.F., Nevalainen T.J., Norman J.A., Lewis R.J., Norton R.S., Renjifo C., de la Vega R.C. 2009. The toxicogenomic multiverse: convergent recruitment of proteins into animal venoms. Annu Rev Genomics Hum Genet. 10:483-511. Gish W. and States D.J. 1993. Identification of protein coding regions by database similarity search. Nat. Genet. 3: 266–272. Gotoh O. 1990. Optimal sequence alignment allowing for long gaps. Bull. Math. Biol. 52: 359–373. Gotoh O. 2000. Homology-based gene structure prediction: Simplified matching algorithm using a translated codon (tron) and improved accuracy by allowing for long gaps. Bioinformatics 16: 190– 202. Grabherr M.G. et al. 2011. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nature Biotech. 29, 644–652. Gray W.R., Luque A., Olivera B.M., Barrett J., Cruz L.J. 1981. Peptide toxins from Conus geographus venom. J. Biol. Chem. 256: 4734–4740. Gray W.R., Olivera B.M., Cruz L.J. 1988. Peptide toxins from venomous Conus snails. Annu. Rev. Biochem. 57: 665–700. Guo ZR. 2009. [Strategy of molecular drug design: dual-target drug design]. Yao Xue Xue Bao. 2009 Mar;44(3):209-18. Gutiérrez J.M., Higashi H.G., Wen F.H., Burnouf T. 2007. Strengthening antivenom production in Central and South American public laboratories: report of a workshop. Toxicon, 49: 30–35. Guttman M. et al. 2010. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nature Biotech. 28, 503–510. Halai R. and Craik D.J. 2009. Conotoxins: natural product drug leads. Nat. Prod. Rep., 26: 526– 536.

93 Harris T.D., Buzby P.R., Babcock H., Beer E., Bowers J., Braslavsky I., CauseyM., Colonell J., Dimeo J., Efcavitch J.W., Giladi E., Gill J., Healy J., JaroszM., Lapen D., Moulton K., Quake S.R., Steinmann K., Thayer E., Tyurina A.,Ward R.,Weiss H., Xie Z. 2008. Single-molecule DNA sequencing of a viral genome. Science 320: 106–109. Harvey A.L. 1995. From venoms to toxins to drugs. Chem Ind 22:914–916. Harvey A.L., Bradley K.N., Cochran S.A., Rowan E.G., Pratt J.A., Quillfeldt J.A. and Jerusalinsky D.A. 1998. What can toxins tell us for drug discovery? Toxicon 36: 1635–1640. Hedges D.J., Guettouche T., Yang S., Bademci G., Diaz A., Andersen A., Hulme W.F., Linker S., Mehta A., Edwards Y.J., Beecham G.W., Martin E.R., Pericak-Vance M.A., Zuchner S., Vance J.M., Gilbert J.R. 2011. Comparison of three targeted enrichment strategies on the SOLiD sequencing platform.PLoS One;6(4):e18595. Hisada M., Konno K., Itagaki Y., Naoki H., Nakajima T. 2000. Advantages of using nested collision induced dissociation/post-source decay with matrix-assisted laser desorption/ionization time-of- flight mass spectrometry: sequencing of novel peptides from wasp venom. Rapid Commun. Mass Spectrom.14(19):1828–1834. Huang X. and Madan A. 1999. CAP3: A DNA sequence assembly program. Genome Research, 9, 868 – 877. Hurko O. 2012. Target-based drug discovery, genetic diseases, and biologics. Neurochem Int. 2012 Jan 27. [Epub ahead of print] Hyman E.D. 1988. A new method of sequencing DNA. Anal Biochem, 174(2):423-36. Janes R.W., 2005. alpha-Conotoxins as selective probes for nicotinic acetylcholine receptor subclasses. Curr. Opin. Pharmacol. 5 :280–292. Jimenez E.C., Olivera B.M., Teichert R.W. 2007. αC-conotoxin PrXA: a new family of nicotinic acetylcholine receptor antagonists. Biochemistry 46:8717-8724. Johnson M.S, Overington J.P. 1993. A structural basis for sequence comparisons. An evaluation of scoring methodologies. J Mol Biol. 233:716-738. Kaas Q., Westermann J.-C., Halai R., Wang C.K.L., Craik D.J. 2008. ConoServer, a database for conopeptide sequences and structures. Bioinformatics, 24: 445–446. Kaas Q., Westermann J.-C.,Craik D.J. 2010. Conopeptide characterization and classifications: An analysis using ConoServer. Toxicon 55: 1491–1509. Kaas Q., Yu R., Jin A.-H. Dutertre S., Craik D.J. 2011. ConoServer: updated content, knowledge, and discovery tools in the conopeptide database. Nucleic Acids Research, 2011: 1–6. Kanehisa M. and Goto S. 2000. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 28: 27-30. Karplus K., Barrett C. and Hughey R. 1998. Hidden Markov models for detecting remote protein homologies. Bioinformatics 14: 846–856. Kauferstein S., Porth C., Kendel Y., Wunder C., Nicke A., Kordis D., Favreau P., Koua D., Stöcklin R., Mebs D. 2011. Venomic study on cone snails from South Africa. Toxicon 57: 28–34. Kent W. J. 2002. BLAT - the BLAST-like alignment tool. Genome Res. 12, 656–664. King G.F., Escoubas P., Nicholson G.M. 2008a. Peptide toxins that selectively target insect NaV and CaV channels. Channels 2:100-16. King G.F., Gentz M.C., Escoubas P., Nicholson G.M. 2008b. A rational nomenclature for naming peptide toxins from spiders and other venomous animals. Toxicon 52:264-76.

94 King G.F. 2011. Venoms as a platform for human drugs: translating toxins into therapeutics. Expert Opin. Biol. Ther. 11(11):1469-1484. Kinsella M., Harismendy O., Nakano M., Frazer K.A. & Bafna V. 2011. Sensitive gene fusion detection using ambiguously mapping RNA-seq read pairs. Bioinformatics 27, 1068–1075. Klimis H., Adams D.J., Callaghan B., Nevin S., Alewood P.F., Vaughan C.W., Mozar C.A., Christie M.J. 2011. A novel mechanism of inhibition of high-voltage activated calcium channels by -conotoxins contributes to relief of nerve injury-induced neuropathic pain. Pain 152:259–266. Kloft C.; Poggesi I. 2010. Current and future directions of pharmacokinetic and pharmacokinetic- pharmacodynamic modelling and simulation: population approach group in Europe 19th annual meeting. Expert Opin Drug Metab Toxicol. 2010 Dec;6(12):1599-604. Epub 2010 Oct 24. Koh D.C.I., Armugam A. and Jeyaseelan K. 2006. Snake venom components and their applications in biomedicine. Cell Mol. Life Sci. 63: 3030–3041. Kohn A.J., Saunders P.R., Wiener S. 1960. Preliminary studies on the venom of the marine snail Conus. Ann N Y Acad Sci. 90:706-25. Kohn A.J. 1983. Feeding biology of gastropods. , 5 : 1–63. Koski L., et al. 2005. AutoFACT: An Automatic Functional Annotation and Classification Tool, BMC Bioinformatics, 6, 151. Koua D., Cerutti L., Falquet L., Sigrist C.J.A., Theiler G., Hulo N. and Dunand C. 2009. PeroxiBase: a database with new tools for peroxidase family classification. Nucleic Acids Res. 37(Database issue): D261–D266. Kumar S. and Blaxter M. 2010. Comparing de novo assemblers for 454 transcriptome data. BMC genomics, 11, 571. Laht S., Koua D., Kaplinski L., Lisacek F., Stöcklin R., Remm M. 2012. Identification and classification of conopeptides using profile Hidden Markov Models. Biochim Biophys Acta. 1824(3):488-92. Lawrence C.E., Altschul S.F., Boguski M.S., Liu J.S., Neuwald A.F., Wootton J.C. 1993. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262(5131):208- 14. Le Gall F., Favreau P., Richard G., Letourneux Y., Molgó J. 1999. The strategy used by some piscivorous cone snails to capture their prey: the effects of their venoms on vertebrates and on isolated neuromuscular preparations. Toxicon. 37(7):985-98. Levene M.J., Korlach J., Turner S.W., Foquet M., Craighead H.G., Webb W.W. 2003. Zero-mode waveguides for single-molecule analysis at high concentrations. Science 299: 682–686. Lewis R.J., Garcia M.L. 2003. Therapeutic potential of venom peptides. Nat Rev Drug Discov. 2(10):790-802. Lewis R.J. 2009. Conotoxins: molecular and therapeutic targets. Prog.Mol. Subcell. Biol. 46: 45– 65. Lewis R.J., Dutertre S., Vetter I., Christie M.J. 2012. Conus venom Peptide pharmacology. Pharmacol Rev. 64(2):259-98. Li W. and Godzik A. 2006. CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22:1658-1659. Liang S. 2008. Proteome and peptidome profiling of spider venoms. Expert Rev Proteomics 5:731- 46.

95 Lin H., Li Q.Z. 2007. Predicting conotoxin superfamily and family by using pseudo amino acid composition and modified Mahalanobis discriminant. Biochem. Biophys. Res. Commun., 354: 548– 551. Livett B.G., Gayler K.R., Khalil Z. 2004. Drugs from the sea: conopeptides as potential therapeutics. Curr. Med. Chem. 11: 1715–1723. Lluisma A. O., Milash B. A., Moore B., Olivera B. M., Bandyopadhyay P. K. 2012. Novel venom peptides from the cone snail Conus pulicarius discovered through next-generation sequencing of its venom duct transcriptome. Mar. Genomics 5: 43-51. Loughnan M.L., Nicke A., Lawrence A., Lewis R.J. 2009. Novel aD-conopeptides and their precursors identified by cDNA cloning define the D-conotoxin superfamily, Biochemistry, 48: 3717–3729. Lunter G., Goodson M. 2011. Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 21:936–939. Ma Y., He Y., Zhao R., Wu Y., Li W., Cao Z. 2012. Extreme diversity of scorpion venom peptides and proteins revealed by transcriptomic analysis: Implication for proteome evolution of scorpion venom arsenal. J. Proteomics. 75 (5): 1563-1576. Mardis E.R. 2008. Next-Generation DNA Sequencing Methods. Annu. Rev. Genom. Human Genet. 9:387-402. Martin J., Bruno V.M., Fang Z., Meng X., Blow M., Zhang T., Sherlock G., Snyder M., Wang Z. 2010. Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNA-Seq reads. BMC Genomics. 11:663. Martin J.A. and Wang Z. 2011. Next-generation transcriptome assembly. Nat Rev Genet., 671-82. doi: 10.1038/nrg3068. Maxam A.M. and Gilbert W. February 1977. A new method for sequencing DNA. Proc. Natl. Acad. Sci. U.S.A. 74 (2): 560–4. McIntosh J.M., Santos A.D. and Olivera B.M. 1999. Conus peptides targeted to specific nicotinic acetylcholine receptor subtypes. Annu Rev Biochem 68:59–88. McIntosh J.M., Corpuz G.O., Layer R.T., Garrett J.E.,Wagstaff J.D., Bulaj G., Vyazovkina A., Yoshikami D., Cruz L.J., Olivera B.M. 2000. Isolation and characterization of a novel Conus peptide with apparent antinociceptive activity. J. Biol. Chem. 275: 32391–32397. McIntosh J.M., Jones R.M. 2001. Cone venom--from accidental stings to deliberate injection. Toxicon. 39(10):1447-51. Mebs D., 2002. Venomous and Poisonous Animals. Medpharm,Stuttgart. Menez A., Servent D., Gasparini S. 2002. The sites by which animal toxins bind their targets involve two components: a clue for selectivity, evolution and design proteins. A. Ménez (Ed.), Perspectives in Molecular Toxinology, Wiley, Chichester. 175–202. Menez A., Gillet D., Grishin E. 2005. Toxins: threats and benefits. Gillet D., Johannes L. (Eds.). Recent Research Developments on Toxins from Bacteria and Other Organisms. Research Signpost, Trivandrum, India. Menez A., Stöcklin R., Mebs D. 2006. “Venomics” or: the venomous systems genome project. Toxicon, 47: 255–259 Metzker M.L. 2009. Sequencing in real time. Nat. Biotechnol. 27 150–151. Miljanich G.P. 2004. Ziconotide: neuronal calcium channel blocker for treating severe chronic pain. Curr. Med. Chem. 11: 3029–3040.

96 Miller J.R., Koren S., Sutton G. 2010. Assembly algorithms for next-generation sequencing data. Genomics 95:315–327. Mondal S., Bhavna R., Babu M.R., Ramakumar S. 2006. Pseudo amino acid composition and multi- class support vector machines approach for conotoxin superfamily classification. J. Theor. Biol., 243: 252–260. Needleman S.B., Wunsch C.D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 48(3):443-53. Nicole R. 2011. Torrents of sequence. Nature Methods (8): 44 doi:10.1038/nmeth.f.330. Ning Z., Cox A.J., and Mullikin J.C. 2001. SSAHA: A fast search method for large DNA databases. Genome Res. 11: 1725–1729. Ohno M., Ménez R., Ogawa T., Danse J.M., Shimohigashi Y., Fromen C., Ducancel F., Zinn-Justin S., Le Du M.H., Boulain J.C., Tamiya T., Ménez A. 1998. Molecular evolution of snake toxins: is the functional diversity of snake toxins associated with a mechanism of accelerated evolution? Prog. Nucleic Acid Res. Mol. Biol. 59: 307–364. Olivera B.M., Gray W.R., Zeikus R., McIntosh J.M., Varga J., Rivier J., de Santos V., Cruz L.J. 1985. Peptide neurotoxins from fish-hunting cone snails. Science 230: 1338–1343. Olivera B.M., Rivier J., Scott J.K., Hillyard D.R., Cruz L.J. 1991. Conotoxins. J Biol Chem. 266(33):22067-70. Olivera B.M., 1997. E.E. Just Lecture 1996. Conus venom peptides, receptor and ion channel targets, and drug design: 50 million years of neuropharmacology. Mol. Biol. Cell. 8: 2101–2109. Olivera B.M., Cruz L.J. 2001. Conotoxins, in retrospect. Toxicon 39: 7–14. Olivera B.M. 2002. Conus venom peptides: reflections from the biology of clades and species. Annu. Rev. Ecol. Syst. 33: 25–47. Olivera B.M. 2006. Conus peptides: biodiversity-based discovery and exogenomics. J Biol Chem 281:31173–31177. Olivera B.M., Quik M., Vincler M., McIntosh J.M. 2008. Subtype-selective conopeptides targeted to nicotinic receptors: concerted discovery and biomedical applications. Channels 2: 143–152. Parkinson J. et al. 2004. PartiGene—constructing partial genomes, Bioinformatics, 20, 1398-1404. Pearson W.R., Lipman D.J. 1988. Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 85(8):2444-8. Pearson W.R. and Miller W. 1992. Dynamic programming algorithms for biological sequence comparison. Methods Enzymol. 210:575-601. Pearson W.R. 1995. Comparison of methods for searching protein sequence databases. Protein Sci. 4(6):1145-60. Perkel J. 2011. Making Contact with Sequencing's Fourth Generation. BioTechniques, 2(50):93–95. Pevzner P. A., Tang H. and Waterman M.S. 2001. An Eulerian path approach to DNA fragment assembly. Proc. Natl Acad. Sci. USA 98, 9748–9753. Philipp E.E.R., Kraemer L., Mountfort D., M. Schilhabel; Schreiber S., Rosenstiel P. 2012.The Transcriptome Analysis and Comparison Explorer - T-ACE: a platform-independent, graphical tool to process large RNAseq data sets of non-model organisms. Bioinformatics 2012; doi: 10.1093/bioinformatics/bts056 .

97 Pimenta A.M., Stöcklin R., Favreau P., Bougis P.E., Martin-Eauclaire M.F. 2001. Moving pieces in a proteomic puzzle: mass fingerprinting of toxic fractions from the venom of Tityus serrulatus (Scorpiones, Buthidae). Rapid Communications in Mass Spectrometry 15: 1562. Pimenta A.M., De Lima M.E. 2005. Small peptides, big world: biotechnological potential in neglected bioactive peptides from arthropod venoms. J Pept Sci 11:670-6. Prosdocimi F., Bittencourt D., da Silva F.R., Kirst M., Motta P.C., Rech E.L. 2011. Spinning gland transcriptomics from two main clades of spiders (order: Araneae)--insights on their molecular, anatomical and behavioral evolution. PLoS. One. 6 (6): e21634. Puillandre N., Koua D., Favreau P. Olivera B.M., Stöcklin R. 2012.Molecular Phylogeny, Classification and Evolution of Conopeptides.J Mol Evol. 2012 Jul 4. PMID: 22760645 Quinton L., LeCaer J.P., Vinh J., Gilles N., Chamot-Rooke J. 2006. Fourier transform mass spectrometry: a powerful tool for toxin analysis. Toxicon 47: 715. Rigoutsos I., Floratos A., Parida L., Gao Y., Platt D. 2000. The emergence of pattern discovery techniques in computational biology.Metab Eng. 2(3):159-77. Robertson G. et al. 2010. De novo assembly and analysis of RNA-seq data. Nature Methods 7, 909– 912. Ronaghi M., Uhlén M. and Nyrén P. 1998. A Sequencing Method Based on Real-Time Pyrophosphate. Science 17 July 1998: 281 (5375), 363-365. [DOI:10.1126/science.281.5375.363] . Rothberg J.M., Hinz W., Rearick T.M., Schultz J., Mileski W., Davey M., Leamon J.H., Johnson K., Milgrew M.J., Edwards M., Hoon J., Simons J.F., Marran D., Myers J.W., Davidson J.F., Branting A., Nobile J.R., Puc B.P., Light D., Clark T.A., Huber M., Branciforte J.T., Stoner I.B., Cawley S.E., Lyons M., Fu Y., Homer N., Sedova M., Miao X., Reed B., Sabina J., Feierstein E., Schorn M., Alanjary M., Dimalanta E., Dressman D., Kasinskas R., Sokolsky T., Fidanza J.A., Namsaraev E., McKernan K.J., Williams A., Roth G.T., Bustillo J. 2011. An integrated semiconductor device enabling non-optical genome sequencing. Nature 475(7356):348-52. doi: 10.1038/nature10242. Rutherford K., Parkhill J., Crook J., Horsnell T., Rice P., Rajandream M.A. and Barrell B., 2000. Artemis: sequence visualization and annotation. Bioinformatics (Oxford, England) 16(10):944-5. Sams-Dodd F. 2005. Target-based drug discovery: is something wrong?Drug Discovery Today, Volume 10, Issue 2, 15 January 2005, Pages 139–147 Sanger F., Coulson A.R. May 1975. A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J. Mol. Biol. 94 (3): 441–8. doi:10.1016/0022-2836(75)90213-2. PMID 1100841. Sanger F., Nicklen S., Coulson A.R. December 1977. DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. U.S.A. 74 (12): 5463–7. Schadt E.E., Turner S., Kasarskis A. 2010. A window into third-generation sequencing. Hum. Mol. Genet. 19: R227–R240. Schmid R. and Blaxter M.L. 2008. annot8r: rapid assignment of GO, EC and KEGG annotations, BMC Bioinformatics. 2008 Apr 09, 9, 180. Schroeder F.C., Taggi A.E., Gronquist M., Malik R.U., Grant J.B., Eisner T., Meinwald J. 2008. NMR-spectroscopic screening of spider venom reveals sulfated nucleosides as major components for the brown recluse and related species. Proc Natl Acad Sci USA 105:14283-7. Sharpe I.A., Gehrmann J., Loughnan M.L., Thomas L., Adams D.A., Atkins A., Palant E., Craik D.J., Adams D.J., Alewood P.F., Lewis R.J. 2001. Two new classes of conopeptides inhibit the alpha1-adrenoceptor and noradrenaline transporter. Nat.Neurosci. 4: 902–907.

98 Shendure J., Porreca G.J., Reppas N.B., Lin X., McCutcheon J.P., Rosenbaum A.M., Wang M.D., Zhang K., Mitra R.D., Church G.M. 2005. Accurate multiplex polony sequencing of an evolved bacterial genome. Science 309, 1728–1732. Shu N., Zhou T., Hovmöller S. 2008. Prediction of zinc-binding sites in proteins from sequence. Bioinformatics 24(6):775-82. Sigrist C.J., Cerutti L., de Castro E., Langendijk-Genevaux P.S., Bulliard V., Bairoch A., Hulo N. 2010. PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Res. 38 (Database issue):D161-6. Simpson J.T. et al. 2009. ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117–1123. Smith T.F., Waterman M.S. 1981. Identification of common molecular subsequences. J Mol Biol. 147(1):195-7. Soares M.R., Oliveira-Carvalho A.L., Wermelinger L.S., Zingali R.B., Ho P.L., Junqueira-de- Azevedo I.L., Diniz M.R. 2005. Identification of novel bradykinin-potentiating peptides and C-type natriuretic peptide from Lachesis muta venom. Toxicon 46(1):31-8. Sollod B.L., Wilson D., Zhaxybayeva O., Gogarten J.P., Drinkwater R., King G.F. 2005. Were arachnids the first to use combinatorial peptide libraries? Peptides 2005;26:131-9. States D.J. and Gish W. 1994. Combined use of sequence similarity and codon bias for coding region identification. J. Comput. Biol. 1: 39–50. Sudarslal S., Singaravadivelan G., Ramasamy P., Ananda K., Sarma S.P., Sikdar S.K., Krishnan K.S., Balaram P. 2004. A novel 13 residue acyclic peptide from the marine snail, Conus monile, targets potassium channels. Biochem Biophys Res Commun. 317(3):682-8. Surget-Groba Y. and Montoya-Burgos J.I. 2010. Optimization of de novo transcriptome assembly from next-generation sequencing data. Genome Res. 20, 1432–1440. Tedford H.W., Sollod B.L., Maggio F., King G.F. 2004. Australian funnel-web spiders: master insecticide chemists. Toxicon 43:601-18. Terlau H., Shon K.J., Grilley M., Stocker M., Stühmer W., Olivera B.M. 1996. Strategy for rapid immobilization of prey by a fish-hunting marine snail. Nature 381: 148–151. Terlau H., Olivera B.M., 2004. Conus venoms: a rich source of novel ion channel-targeted peptides. Physiol. Rev. 84: 41–68. Terrat Y., Biass D., Dutertre S., Favreau P., Remm M., Stöcklin R., Piquemal D., Ducancel F. 2012. High-resolution picture of a venom gland transcriptome: case study with the marine snail Conus consors. Toxicon. 59(1):34-46. Trapnell C., Salzberg S.L. 2009a. How to map billions of short reads onto genomes. Nat. Biotechnol. 27:455–457. Trapnell C., Pachter L. and Salzberg S. L. 2009b. TopHat: discovering splice junctions with RNA- seq. Bioinformatics 25: 1105–1111. Trapnell C., Williams B.A., Pertea G., Mortazavi A., Kwan G., van Baren M.J., Salzberg S.L., Wold B.J., Pachter L. 2010. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 28(5):511-5. Twede V.D., Miljanich G., Olivera B.M., Bulaj G., 2009. Neuroprotective and cardioprotective conopeptides: an emerging class of drug leads. Curr. Opin. Drug Discov. Devel. 12: 231–239.

99 Ueberheide B.M., Fenyö D., Alewood P.F., Chait B.T. 2009. Rapid sensitive analysis of cysteine rich peptide venom components. Proc. Natl Acad. Sci. USA 106(17): 6910–6915.Van den Haak et al. (2004) Industry Success Rates 2004. CMR Report 04-234R Vaiyapuri S., Wagstaff S.C., Harrison R.A., Gibbins J.M., Hutchinson E.G. 2011. Evolutionary analysis of novel serine proteases in the venom gland transcriptome of Bitis gabonica rhinoceros. PLoS. One. 6 (6): e21532. Valouev A., Ichikawa J., Tonthat T., Stuart J., Ranade S., Peckham H., Zeng K., Malek J.A., Costa G., McKernan K., Sidow A., Fire A., Johnson S.M. July 2008. A high-resolution, nucleosome position map of C. elegans reveals a lack of universal sequence-dictated positioning. Genome Res. 18 (7): 1051–63. doi:10.1101/gr.076463.108. PMC 2493394. PMID 18477713. Vassilevski A.A., Kozlov S.A., Grishin E.V. 2009. Molecular diversity of spider venom. Biochemistry (Mosc) 74:1505-34. Vetter I., Davis J.L., Rash L.D., Anangi R., Mobli M., Alewood P.F., Lewis R.J., King G.F. 2011. Venomics: a new paradigm for natural products-based drug discovery. Amino Acids 40:15–28 Vincler M. and McIntosh J.M. 2007. Targeting the alpha-9 alpha-10 nicotinic acetylcholine receptor to treat severe pain. Expert Opin Ther Targets 11:891–897. Vingron M., Waterman M. 1994. Sequence alignment and penalty choice. J Mol Biol. 235:1-12. Violette A., Biass D., Dutertre S., Koua D., Piquemal D., Pierrat F., Stöcklin R., Favreau P. 2012. Large-scale discovery of conopeptides and conoproteins in the injectable venom of a fish-hunting cone snail using a combined proteomic and transcriptomic approach. J Proteomics. PMID: 22705119. Violette A., Leonardi A., Piquemal D., Terrat Y., Biass D., Dutertre S., Noguier F., Ducancel F., Stocklin R., Krizaj I., Favreau P. 2012a. Recruitment of glycosyl hydrolase proteins in a cone snail venomous arsenal: further insights into biomolecular features of Conus venoms. Marine Drugs 10: 258-280. Wagstaff S.C., Sanz L., Juárez P., Harrison R.A., Calvete J.J. 2009. Combined snake venomics and venom gland transcriptomic analysis of the ocellated carpet viper, Echis ocellatus. J. Proteomics, 71:609–623. Walker C.S., Steel D., Jacobsen R.B., Lirazan M.B., Cruz L.J., Hooper D., Shetty R., DelaCruz R.C., Nielsen J.S., Zhou L.M., Bandyopadhyay P., Craig A.G., Olivera B.M. 1999. The T- superfamily of conotoxins. J. Biol. Chem. 274: 30664–30671. Wang K., Singh D., Zeng Z., Coleman S.J., Huang Y., Savich G.L., He X., Mieczkowski P., Grimm S.A., Perou C.M., MacLeod J.N., Chiang D.Y., Prins J.F., Liu J. 2010. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res. 38, e178. Wasmuth J. and Blaxter M. 2004. prot4EST: Translating Expressed Sequence Tags from neglected genomes, BMC Bioinformatics, 5, 187. Wilke G., Steinhauser G., Grün J., Berek C. 2010. In silico subtraction approach reveals a close lineage relationship between follicular dendritic cells and BP3(hi) stromal cells isolated from SCID mice. Eur J Immunol., 40(8):2165-73. Wishart D.S. 2007. Improving early drug discovery through ADME modelling: an overview.Drugs R D.;8(6):349-62. Woodward S.R., Cruz L.J., Olivera B.M., Hillyard D.R. 1990. Constant and hypervariable regions in conotoxin propeptides. EMBO J. 9: 1015–1020. Wu T.D. and Nacu S. 2010. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26, 873–881.

100 Yuan D.D., Liu L., Shao X.X., Peng C., Chi C.W., Guo Z.Y. 2008. Isolation and cloning of a conotoxin with a novel cysteine pattern from Conus caracteristicus. Peptides 29: 1521–1525.

Zaki N., Wolfsheimer S., Nuel G., Khuri S. 2011. Conotoxin protein classification using free scores of words and support vector machines. BMC Bioinforma., 12: 217. Zerbino D. and Birney E. 2008. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, 18, 821 – 829. Zhang Z., Schwartz S., Wagner L., Miller W. 2000. A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7: 203–214.

101 Appendices

102 Appendix 1. Presentation of main bioinformatics tools referred in the manuscript.

- BLAST. The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families. Different implementations are provided in what is commonly called the BLAST Suite: i. To search protein database using a protein query, BLASTP, PSI-BLAST, PHI-BLAST and DELTA-BLAST algorithms could be used; ii. BLASTX searches a protein database using a translated nucleotide query; iii. TBLASTN searches a translated nucleotide database using a protein query; iv. TBLASTX is used to search a translated nucleotide database using a translated nucleotide query. BLAST analysis can be freely made at http://blast.ncbi.nlm.nih.gov/Blast.cgi .

- CD-HIT. CD-HIT is a very widely used program for clustering and comparing protein or nucleotide sequences. CD-HIT is very fast and can handle extremely large databases. CD-HIT helps to significantly reduce the computational and manual efforts in many sequence analysis tasks and aids in understanding the data structure and correct the bias within a dataset. Clustering a sequence database requires all-by-all comparisons; therefore it is very time-consuming. Many methods use BLAST to compute the all vs. all similarities. It is very difficult for these methods to cluster large databases. While CD-HIT can avoid many pairwise sequence alignments with a short word filter developed. CD-HIT uses a greedy incremental clustering algorithm method. Briefly, sequences are first sorted in order of decreasing length. The longest one becomes the representative of the first cluster. Then, each remaining sequence is compared to the representatives of existing clusters. If the similarity with any representative is above a given threshold, it is grouped into that cluster. Otherwise, a new cluster is defined with that sequence as the representative. A full user manual can be found at http://weizhong-lab.ucsd.edu/cd-hit/wiki/doku.php?id=cd-hit_user_guide .

- ConoServer. ConoServer is a database specialized in the sequence and structures of conopeptides, which are peptides expressed by carnivorous marine cone snails. A fascinating feature of these peptides is their high specificity and affinity towards human ion channels, receptors and transporters of the nervous system. This makes conopeptides an interesting resource for the physiological studies of neuroreceptors and promising drug leads. Conopeptides are describe and a selection of

103 recent reviews on the subject can be found on the database. Conopeptides are classified into disulfide rich (conotoxins) and several classes of disulfide poor peptides. The three classification schemes used in ConoServer, the gene superfamilies, the cysteine frameworks, and the pharmacological families are described and analyzed. Kaas,Q., Yu,R., Jin,A.H., Dutertre,S. and Craik,D.J. ConoServer: updated content, knowledge, and discovery tools in the conopeptide database. Nucleic Acids Res (2012).

- GO : Gene Ontology. The Gene Ontology project is a major bioinformatics initiative with the aim of standardizing the representation of gene and gene product attributes across species and databases. The project provides a controlled vocabulary of terms for describing gene product characteristics and gene product annotation data from GO Consortium members, as well as tools to access and process this data. The ontology covers three domains: cellular component, the parts of a cell or its extracellular environment; molecular function, the elemental activities of a gene product at the molecular level, such as binding or catalysis; and biological process, operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs, and organisms. The use of GO terms by collaborating databases facilitates uniform queries across them. The controlled vocabularies are structured so that they can be queried at different levels: for example, you can use GO to find all the gene products in the mouse genome that are involved in signal transduction, or you can zoom in on all the receptor tyrosine kinases. This structure also allows annotators to assign properties to genes or gene products at different levels, depending on the depth of knowledge about that entity. The Gene Ontology web page (http://www.geneontology.org) allows to search the GO database. (Ashburner M., et al. Gene Ontology: tool for the unification of biology. 2000;25:25-29.)

Blast2GO is a research tool designed with the main purpose of enabling GO based data mining on sequence data for which no GO annotation is yet available. Blast2GO joints in one application GO annotation based on similarity searches with statistical analysis and highlighted visualization on directed acyclic graphs. This tool offers a suitable platform for functional genomics research in non- model species. (Conesa A., et al. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. 2005;21:3674-3676.)

- HMM profiles for proteins. In a Markovian sequence, the character appearing at position t only depends on the k preceding characters, k being the order of the Markov chain. Hence, a Markov chain is fully defined by the set of probabilities of each character given the past of the sequence in a k-long window: the transition matrix. In the hidden Markov model, the transition matrix can change

104 along the sequence. The choice of the transition matrix is governed by another Markovian process, usually called the hidden process . Hidden Markov models are thus particularly useful to represent sequence heterogeneity. These models can be used in predictive approaches: some algorithms like the Viterbi algorithm and the forward-backward procedure allow to recover which transition matrix was used along the observed sequence. (from Martin et al., 2005. conference proceedings: http://conferences.telecom-bretagne.eu/asmda2005/IMG/pdf/proceedings/180.pdf). HMM-profile methods allow variable conservation and insertions/deletions to be dealt with in a fairly robust way. Modelling of complete domains should facilitate more biologically meaningful sequence annotation, and, in some cases, more sensitive detection. (Sonnhammer et al., Pfam: Multiple sequence alignments and HMM-profiles of protein domains. Nucl. Acids Res. (1998) 26 (1): 320- 322. doi: 10.1093/nar/26.1.320 )

- InterProScan. The InterPro database is an integrated resource of predictive protein signatures. These signatures use a range of computational methods to infer potential structure, function and/or evolutionary relationships for a query sequence. Equivalent signatures are grouped together in the same InterPro entry, and each entry contains information about the proteins matched by these signatures, including manual annotation, and links to related resources to provide enhanced biological context. Each InterPro entry is assigned a type depending on what the entry describes: family (a group of proteins with a common evolutionary origin), domain (a distinct functional, structural or sequence unit), site (which may be further subdivided into active site, binding site, conserved site or post-translational modification) and repeat (full definitions of InterPro entry types are via the user documentation available at: http://www.ebi.ac.uk/interpro/). InterProScan [E. M. Zdobnov and R. Apweiler (2001) Bioinformatics, 17, 847–848] is a tool that combines different protein signature recognition methods from the InterPro [N. J. Mulder, R. Apweiler, T. K. Attwood, A. Bairoch, A. Bateman, D. Binns, P. Bradley, P. Bork, P. Bucher, L. Cerutti et al. (2005) Nucleic Acids Res., 33, D201–D205] consortium member databases into one resource.

- KEGG: Kyoto Encyclopedia of Genes and Genomes is a knowledge base for systematic analysis of gene functions, linking genomic information with higher order functional information. The genomic information is stored in the GENES database, which is a collection of gene catalogs for all the completely sequenced genomes and some partial genomes with up-to-date annotation of gene functions. The higher order functional information is stored in the PATHWAY database, which contains graphical representations of cellular processes, such as metabolism, membrane transport, signal transduction and cell cycle. The PATHWAY database is supplemented by a set of ortholog

105 group tables for the information about conserved subpathways (pathway motifs), which are often encoded by positionally coupled genes on the chromosome and which are especially useful in predicting gene functions. A third database in KEGG is LIGAND for the information about chemical compounds, enzyme molecules and enzymatic reactions. KEGG provides Java graphics tools for browsing genome maps, comparing two genome maps and manipulating expression maps, as well as computational tools for sequence comparison, graph comparison and path computation. The KEGG databases are daily updated and made freely available (http://www.genome.ad.jp/kegg/ ). (Kanehisa M., Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. 2000;28:27-30.)

- MAFFT: Multiple alignment program for amino acid or nucleotide sequences. MAFFT is a multiple sequence alignment program for unix-like operating systems. It offers a range of multiple alignment methods. L-INS-i is one of the most accurate multiple sequence alignment methods currently available. L-INS-i is suitable to align 10-100 protein sequences, because of an objective function combining the WSP and consistency scores. FFT-NS-2 and other progressive methods can align many and/or long DNA/protein sequences, because of an FFT approximation and a linear- space DP algorithm. The scoring system was designed to allow large gaps. Thus MAFFT is suitable for LSU rRNA and SSU rRNA alignments that sometimes have variable loop regions. Staggered gaps are also allowed. MAFFT is freely accessible at http://mafft.cbrc.jp/alignment/server/ .

- MIRA: MIRA is an EST sequence assembler that specializes in reconstruction of pristine mRNA transcripts, while at the same time detecting and classifying single nucleotide polymorphisms (SNPs) occuring in different variations thereof. The assembler uses iterative multipass strategies centered on high-confidence regions within sequences and has a fallback strategy for using low- confidence regions when needed. It features special functions to assemble high numbers of highly similar sequences without prior masking, an automatic editor that edits and analyzes alignments by inspecting the underlying traces, and detection and classification of sequence properties like SNPs with a high specificity and a sensitivity down to one mutation per sequence. In addition, it includes possibilities to use incorrectly preprocessed sequences, routines to make use of additional sequencing information such as base-error probabilities, template insert sizes, strain information, etc., and functions to detect and resolve possible misassemblies. (Chevreux B., et al. Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. 2004;14:1147-1159.)

106 - NCBI. The National Center for Biotechnology Information advances science and health by providing access to biomedical and genomic information. The NCBI provides a variety of database and tools for bioinformatics. In particular, run a BLAST against NCBI, implicitly refers to non- redundent (nr) sequence databases for use with the stand-alone BLAST programs.

-Pfam. Pfam is a database of protein domain families (Bateman et al., 2004). Each family is represented by multiple sequence alignments and profile hidden Markov models (HMMs). In addition, each family has associated annotation, literature references, and links to other databases. The entries in Pfam are freely available via the web and in flatfile format at http://www.sanger.ac.uk/Software/Pfam/. Pfam is a founding member database of InterPro (see Article 83, InterPro, Volume 6) and, therefore, also available via the InterPro site at http://ebi.ac.uk/interpro. The use of Pfam by molecular biologists as a protein information resource and analysis tool is widespread. The multiple sequence alignments around which Pfam families are built are important for understanding both protein structure and function. The alignments are also the basis for techniques such as secondary structure prediction, fold recognition, and phylogenetic analysis and can guide mutation design. In addition to the identification of domains in novel protein sequences, Pfam can be used in conjunction with the Wise2 package to predict genes and annotate genomic DNA.

- Phenyx. Developed in collaboration with the Swiss Institute of Bioinformatics (SIB), Phenyx is GeneBio's renowned software platform for the identification, characterization and quantitation of proteins and peptides from mass spectrometry data. Specifically designed to meet the concurrent demands of high-throughput MS data analysis and dynamic results assessment, it offers a highly flexible user interface and an adaptable architecture that help instill confidence in results assessment. Phenyx enables proteomics scientists to 1) submit MS/MS data and identify peptides and proteins 2) visualize and evaluate results using various dynamic views, 3) manually validate results and compare Phenyx runs and those of other database search engines (like Mascot, SEQUEST and X!Tandem), 4) perform quantitative analysis with the Phenyx quantitation module, and 5) export results and generate reports into various formats. Phenyx can be used to match MS/MS data against a provided transcriptome.

Phenyx webpage :http://www.genebio.com/products/phenyx/)

107 - PSSM: Position-Specific Scoring Matrix. Also known as generalised profiles. It is a type of scoring matrix used in protein BLAST searches in which amino acid substitution scores are given separately for each position in a protein multiple sequence alignment. Thus, a Tyr-Trp substitution at position A of an alignment may receive a very different score than the same substitution at position B. This is in contrast to position-independent matrices such as the PAM and BLOSUM matrices (http://www.ncbi.nlm.nih.gov/books/NBK21106/def-item/app15/), in which the Tyr-Trp substitution receives the same score no matter at what position it occurs.

PSSM scores are generally shown as positive or negative integers. Positive scores indicate that the given amino acid substitution occurs more frequently in the alignment than expected by chance, while negative scores indicate that the substitution occurs less frequently than expected. Large positive scores often indicate critical functional residues, which may be active site residues or residues required for other intermolecular interactions.

- SignalP: predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes. The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks. A public server can be accessed at http://www.cbs.dtu.dk/services/SignalP/ .

- UniProtKB/Swiss-Prot. UniProtKB/Swiss-Prot is the manually annotated and reviewed section of the UniProt Knowledgebase (UniProtKB). It is a high quality annotated and non-redundant protein sequence database, which brings together experimental results, computed features and scientific conclusions. Since 2002, it is maintained by the UniProt consortium (http://www.uniprot.org/help/about) and is accessible via the UniProt website (http://www.uniprot.org/).

108 Appendix 2. Conopeptides superfamilies characteristics.

Summary table from McIntosh and Jones, 2001.

109 Appendix 3. Identification and classification of conopeptides using pHMM.

110 Biochimica et Biophysica Acta 1824 (2012) 488–492

Contents lists available at SciVerse ScienceDirect

Biochimica et Biophysica Acta

journal homepage: www.elsevier.com/locate/bbapap

Identification and classification of conopeptides using profile Hidden Markov Models

Silja Laht a,⁎, Dominique Koua b,c, Lauris Kaplinski a, Frédérique Lisacek c, Reto Stöcklin b, Maido Remm a a Estonian Biocentre, Riia 23, 51010, Tartu, Estonia b Atheris Laboratories, Case Postale 314, CH-1233 Bernex-Geneva, Switzerland c Swiss Institute of Bioinformatics, Proteome Informatics Group, Rue Michel-Servet 1, 1211, Geneva, Switzerland article info abstract

Article history: Conopeptides are small toxins produced by predatory marine snails of the genus Conus.Theyarestudiedwithin- Received 15 June 2011 creasing intensity due to their potential in neurosciences and pharmacology. The number of existing conopeptides Received in revised form 13 December 2011 is estimated to be 1 million, but only about 1000 have been described to date. Thanks to new high-throughput se- Accepted 19 December 2011 quencing technologies the number of known conopeptides is likely to increase exponentially in the near future. Available online 30 December 2011 There is therefore a need for a fast and accurate computational method for identification and classification of the novel conopeptides in large data sets. 62 profile Hidden Markov Models (pHMMs) were built for prediction and Keywords: fi Conotoxin classi cation of all described conopeptide superfamilies and families, based on the different parts of the corre- Conopeptide sponding protein sequences. These models showed very high specificity in detection of new peptides. 56 out of Hidden Markov Model 62 models do not give a single false positive in a test with the entire UniProtKB/Swiss-Prot protein sequence data- Conopeptide superfamilies base. Our study demonstrates the usefulness of mature peptide models for automatic classification with accuracy Protein prediction of 96% for the mature peptide models and 100% for the pro- and signal peptide models. Our conopeptide profile HMMs can be used for finding and annotation of new conopeptides from large datasets generated by transcrip- tome or genome sequencing. To our knowledge this is the first time this kind of computational method has been applied to predict all known conopeptide superfamilies and some conopeptide families. © 2012 Elsevier B.V. All rights reserved.

1. Introduction activity. Indeed, a single mutation in a mature sequence can drastically change its pharmacological properties and members of different super- Conopeptides are small, usually cysteine-rich, peptides that are families can reveal similar biological activities [5]. Additionally, some 30 found in the venom of the marine snails from genus Conus. Cone conopeptide families have been described based on a typical structural snails are predator mollusks, hunting for either worms, snails or pattern (such as the cysteine motif) coupled to a given biological activity fish, with a few species being harmful to humans. Conopeptides are on a specific subtype of ion channel or receptor. For example, one will used as valuable probes in neurophysiological studies due to their excep- refer to the alpha-A, delta, mu, mu-O, conantokin or conopressin family. tional specificity for different isoforms of ion channels, receptors and In this paper superfamilies and families together will be referred to as transporters [1] and provide lead compounds for drug discovery [2,3]. classes. Each conopeptide precursor (with a few exceptions) consists of The most recent studies have estimated that the number of different three parts: a signal peptide at the N-terminus (typically 20–25 amino conopeptides detected in the venom of a single species can exceed acids in length), a pro-peptide (for most conopeptides 30–60 amino 1,000 [6,7]. The number of Conus species is currently estimated to acids in length) and a mature peptide at the C-terminus (8 to>40 reach 800, thus suggesting a putative natural library in the range of 1 amino acids, usually 12–30 amino acids) (Fig. 1). During maturation million biomolecules, mostly bioactive peptides and mini-proteins. in the venom gland, the signal peptides and the pro-peptides are However, despite of such a huge molecular diversity, only approximate- cleaved, correct disulphide crosslinks are formed and often some ly 1000 conopeptides have been described so far [8,9].Takingintoac- amino acids are modified. The mature peptides act as toxins when count the huge potential that remains to be offered by conopeptides they are injected into a prey [4]. and given the fact that novel sequencing techniques provide a vast Conopeptides have been classified into 16 superfamilies defined amount of sequence data, there is a need for an automated process for by a common signal sequence (one will refer to the A, D, M, O or T su- identification and annotation of new conopeptide sequences from perfamily for example). A superfamily does not reflect the biological large datasets. For this purpose we have selected profile Hidden Markov Models (pHMMs) built from pre-existing data. Hidden Markov Models (HMMs) are a class of probabilistic models that are generally applicable to time series or linear sequences. Profile HMMs (pHMMs) are a type of ⁎ Corresponding author. Tel.: +372 737 5001, +372 527 6487 (mobile); fax: +372 fi 742 0286. HMMs that are designed to represent pro les of multiple sequence E-mail address: [email protected] (S. Laht). alignments [10]. pHMMs are widely used to predict and find members

1570-9639/$ – see front matter © 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.bbapap.2011.12.004 S. Laht et al. / Biochimica et Biophysica Acta 1824 (2012) 488–492 489

A Signal peptide Propeptide Mature peptide

B

C

Fig. 1. Conopeptide precursor structure. Panel A—a schematic represenation of conopeptide precursor sequence. Panels B and C present sequence logos [25] for the A superfamily and the M superfamily sequences, respectively, to illustrate the level of sequence conservation of different precursor parts within given superfamilies.

of protein families; for example they set the basis of the Pfam database classes all available sequences were used. Sequences of each class [11]. were further subdivided into their signal, pro- and mature peptide Several approaches for conopeptide superfamily prediction have parts according to the positions available in ConoServer. Each part been published over the last years [12–14]. Their main focus has was aligned with MAFFT version 6.707b using the L-INS-i method. been the prediction of the conopeptide superfamily based on a ma- MAFFT L-INS-i is one of the most accurate multiple sequence align- ture peptide sequence only, excluding superfamilies where only a ment methods currently available. L-INS-i is in particular suitable few sequences had been described. We aimed at building a set of for alignment of 10–100 protein sequences [16,17]. models that can be used to annotate all conopeptide superfamilies A pHMM was built for each subset using hmmbuild from the and families that have been described so far, even from partial se- HMMER 3.0 package [18]. Hmmpress from the same package was quences (mass-spec data, next generation sequencing data, etc.). used to construct binary compressed data files for hmmscan.

2.2. Determination of how the number of sequences used for pHMM 2. Materials and methods training affects sensitivity and specificity of classification

2.1. Building of pHMMs for all conopeptide superfamilies and families The 3 largest conopeptide superfamilies (A, O1, T; each containing at least 130 sequences) were randomly divided into one test set (50 Previously described conopeptide sequences were downloaded sequences) and several training sets consisting of 2, 3, 5, 10, 20, 30, from ConoServer [8] that has become the reference database for con- 40, 50, 60, 70 or 80 sequences. The training sequences were aligned opeptides. The sequences were grouped into 24 classes: 16 superfam- with MAFFT L-INS-i program, and pHMMs were built for each set ilies (defined by signal region) and 8 families (defined by other using hmmbuild. These pHMMs were formatted with hmmpress and patterns) by classification provided in the ConoServer (Table 1). then used to scan for full-length precursor sequences from the test Data redundancy was removed within each class using the CD-HIT sets with hmmscan (HMMER 3.0 package) with the default settings. program with 100% identity cutoff. With that step identical sequences The number of matches found with each model within the test set and sequences contained within other sequences were removed but of the same class was recorded as the true positives. The same models similar sequences, even with just one amino acid difference, were were also scanned against a negative test set that contained all other kept. CD-HIT is a program for clustering large sequence database at conopeptide classes, except for the one that was used for training, high sequence identity thresholds [15]. Only full-length precursor se- with the same default parameters. The number of matches found quences consisting of signal, pro- and mature peptides were used for from the negative test set was recorded as the false positives for 8 superfamilies that contained at least 10 sequences. For smaller each model. The random division, sequence alignment, model building and Table 1 testing were repeated 10 times, average number of matches and the Conopeptide superfamilies and families that were modeled and the number of sequences standard deviation were calculated based on those iterations. used for pHMM training. Only full-length precursor sequences consisting of signal, pro- and mature peptide were used, if not otherwise stated. 2.3. Determination of specificity of conopeptide pHMMs on UniProtKB/ No. Superfamily No of sequences No Superfamily or No of sequences in Swiss-Prot protein database in training set family training set 1 A 142 13 Sa 8 In order to determine the ability of conopeptide pHMMs to distin- 2D 18 14 T 129 guish between conopeptides and other proteins all conopeptide 3I1 9 15 V2 pHMMs were scanned against the UniProtKB/Swiss-Prot protein da- 4I2 35 16 Y1 5I3 7 17 Conantokin 7 tabase (downloaded on 17.08.2011, containing 531,473 protein se- 6J 6 18 Conkunitzinb 2 quences) [19,20] using hmmscan from the HMMER 3.0 package with 7L 7 19 Conolysinb 2 different E-value cutoffs. HMMER3 only does local alignment, so b 8M 75 20 Conophan 2 there was no need to divide the protein sequences tested into differ- 9 O1 396 21 Conopressinb 6 10 O2 44 22 Conorfamideb 2 ent domains, when looking for matches with the pHMMs. All matches 11 O3 21 23 Contryphan 9 from Conus sp. were considered true positive. The true positives were 12 P 6 24 Contulakin 3 manually revised for non-conopeptides, but all sequences retrieved a 7 full-length precursors and one mature peptide in the training set. with pHMMs that were from Conus sp. were indeed conopeptides. b Only a mature peptide sequence has been described for this conopeptide class. All sequences from other organisms were considered false positives. 490 S. Laht et al. / Biochimica et Biophysica Acta 1824 (2012) 488–492

To calculate the specificity of the mature peptide models, the hmmscan. The number of sequences found was recorded for signal, number of true and false positive matches found with only mature pro- and mature peptide models. The random division into test and peptide models was determined. training sets and the following search were repeated 10 times for Specificity was calculated as [true negative / (true negative+false each superfamily. positive)]. The results are given in Fig. 2, which clearly demonstrates that even with as little as 2 sequences used for training the conopeptide 2.4. Classification of newly discovered conopeptides using the pHMMs pHMMs, a sensitivity of 100% could be obtained for signal peptide models. The maximum sensitivity for propeptide models was reached A set of 53 conopeptides discovered from Conus consors venom with 10–20 training sequences. For mature peptide models more se- duct transcriptome has been recently described by Terrat et al. [21]. quences were required for maximum sensitivity. These 53 full-length or almost full-length conopeptide precursor se- The specificity of all models, even those with 2 training sequences, quences (kindly provided by Y. Terrat) were scanned with hmmscan was nearly always 100% with a few random false positive matches with default parameters using 19 propeptide pHMMs and 24 mature from other conopeptide classes. The only exception was O1 super- peptide pHMMs. If a single sequence matched signal, pro- or mature family mature peptide model, where specificity was reduced from peptide models from multiple classes, this sequence was classified 99.9% to 98.3% when more sequences were included in the model. according to the model with the lowest E-value. The lower specificity of the O1 superfamily mature peptide model Sensitivity was calculated as [true positive/(true positive+false will be discussed in Section 3.4. negative)]. Results of this experiment assure that, even when the training of Accuracy was calculated as [correctly classified/(correctly classified+ several conopeptide pHMMs was undertaken with b10 sequences, incorrectly classified)]. specificity can be trusted. The sensitivity of models with a small num- ber of training sequences can be suboptimal. When more sequences 2.5. Classification of putative conotoxins from Conus bullatus using will be added to databases and pHMMs, it will be possible to improve conopeptide pHMMs the sensitivity up to 70–80%, even when only mature peptide models are used and up to 100%, if propeptide models are used. For classes Conus bullatus venom duct transcriptome has been recently se- with only mature peptide sequences currently available, addition of quenced [22]. A set of 2410 putative conotoxin contigs discovered just 1–2 full precursor sequences is expected to improve the sensitiv- in this study was downloaded from http://derringer.genetics.utah. ity significantly, since signal and propeptide models show much edu/conus/data/conotoxin/. These contigs were translated in 6 higher sensitivity with small number of training sequences than ma- frames and scanned with all 62 conopeptide pHMMs using hmmscan ture peptide model. with E-value cutoff 1e-4. 3.3. When tested on UniProtKB/Swiss-Prot protein database, the specificity 3. Results and discussion of conopeptide pHMMs is nearly 100%

3.1. pHMMs construction for each described conopeptide superfamily To assess the specificity of the pHMMs in distinguishing between and family conopeptides versus other proteins, the conopeptide pHMMs were tested against the UniProtKB/Swiss-Prot protein database. All A total of 62 models for the 24 described conopeptide classes (16 matches from Conus sp. were considered as true positive. All se- superfamilies and 8 families) were built. Three separate models for quences from other organisms were considered false positives. signal peptide, propeptide and mature peptide were defined for With an E-value cutoff 1e-5, only 111 false-positive predictions each class, whenever possible. Five disulfide-poor conopeptide fami- were obtained from the 531473 proteins in the database. 57 out of lies have been described only at the mature peptide sequence level, 62 pHMMs had no false positive matches in the entire SwissProt da- and no signal or propeptide model could be built for these families. tabase. The 5 less specific pHMMs were: the mature peptide models Altogether, 24 models for mature conopeptide regions, 19 models for the I1, I3, O1, O2 superfamilies with several false positive matches for signal peptides and 19 models for propeptides were built. All the and the mature peptide model for the P superfamily with one false conopeptide classes and the number of sequences used for pHMM positive match. All of the false positives were cysteine-rich peptides training are listed in Table 1. The training set contained 964 se- (toxins, trypsin inhibitors) containing a knottin domain, that is struc- quences from 60 different Conus species. The training set sequences turally very similar to the O and I superfamily mature conopeptides. are available in fasta format as supplementary data. Also the mini- The specificity of conopeptide pHMMs obtained from the results mum and maximum E-values for each model obtained with training was 99.98–100% with E-value cutoffs 1e-5–10 (Table 2). set sequences and with novel conopeptides from Conus consors tran- Since almost all novel conopeptides are first described as mature scriptome (Section 3.4) are listed in Supplementary Table 1. peptides from venoms, it would be very useful to be able to classify those new peptide sequences. The dissected venom may also contain 3.2. Conopeptide pHMMs show good sensitivity and specificity, even partially processed conopeptide precursors therefore the specificity when trained on a small number of sequences of 19 propeptide pHMMs along with 23 mature peptide pHMMs was determined with UniProtKB/Swiss-Prot. The specificity of mature Since many of the conopeptide superfamilies contain only a small peptide models was 99.80–99.98% with E-value cutoffs 1e-5–10. The number of sequences (8 out of 16 conotoxin superfamilies and all specificity of propeptide models was 99.96–100% (Table 2). With modeled families contain less than 10 precursor sequences; the least stringent cutoff the propeptide models are even more Table 1), it is important to estimate the minimum number of se- specific than signal peptide models, retrieving 214 and 303 false quences necessary for pHMM training to obtain a model capable of positives, respectively. classifying conopeptides into a given class with good sensitivity and specificity. For this purpose, the 3 superfamilies containing >100 pre- 3.4. The propeptide and mature peptide pHMMs show high specificity cursor sequences (A, O1 and T) were divided randomly into 13 parts— and sensitivity for classification of newly discovered conopeptides one test set containing 50 sequences, and from the remainder 2, 3, 5, 10, 20, 30, 40, 50, 60, 70 or 80 sequences were used for training of the In order to understand whether pHMMs are suitable for automatic pHMMs. The test set was then searched with each model using classification of conopeptides, it was necessary to determine whether S. Laht et al. / Biochimica et Biophysica Acta 1824 (2012) 488–492 491

A superfamily O1 superfamily

100 100 90 90 80 80 70 70 60 60 50 50 40 40 Sensitivity % 30 Sensitivity % 30 20 20 10 10 0 0 2 3 5 1020304050607080 2 3 5 1020304050607080 Number of sequences used for pHMM training Number of sequences used for pHMM training

T superfamily

100 90 80 Signal peptide 70 60 Propeptide 50 40 Mature peptide

Sensitivity % 30 20 10 0 2 3 5 10 20 30 40 50 60 70 80 Number of sequences used for pHMM training

Fig. 2. Sensitivity of the conopeptide pHMMs trained on a different number of sequences. Each point represents an average number of hits of 10 random tests done with the pHMMs trained on the number of sequences indicated on the x-axis. The error bars show the standard deviation for each point. The red circles mark the number of hits with the signal peptide model, the green triangles with the propeptide model and the blue squares with the mature peptide model.

the best match (with lowest E-value) is obtained with the pHMMs of This data set was used to test how well can propeptide and mature the correct class or occasionally models of other classes can give the peptide pHMMs classify novel conopeptides. Signal peptide pHMMs lowest E-value. Classification is especially important for sequences were not included since previous analysis shows that, if signal pep- lacking the signal peptide sequence. Classification accuracy could tide is available, the classification is 100% correct (with 100% specific- not be tested on UniProtKB/SwissProt data since: a) not all sequences ity and sensitivity, Section 3.2). Only propeptide and mature peptide in UniProtKB/SwissProt have been classified into a superfamily or a pHMMs were used to retrieve matches (using hmmscan) from among family b) most of the sequences in UniProtKB/SwissProt had been these 53 conopeptide precursors. used to build the conopeptide pHMMs. Three sequences were not classified with propeptide pHMMs. Therefore the conopeptide pHMMs were tested on novel conopep- These were 2 conkunitzins and 1 conopressin. For these families no tides discovered in Conus consors venom duct transcriptome by Terrat propeptide model was built, since no precursor sequences had been et al. [21]. These authors have recently characterized a set of 53 cono- described so far. No false classifications were obtained with propep- peptide precursors (6 of which had been previously characterized in tide pHMMs from this set of sequences. C. consors)byhighthroughputsequencingofC. consors venom duct Mature peptide pHMMs were able to classify correctly 79% of the transcriptome. This set contains conopeptides from 10 different classes: tested conopeptides. Conopeptide mature peptide sequences are highly 6 superfamilies (A, O, M, T, S, P) and 4 families (conkunitzin, contulakin, divergent, and a similar sensitivity was also observed in the test with conopressin, conantokin). the larger superfamilies. In addition, two O superfamily sequences were incorrectly classified into I3 superfamily, when the mature pep- tide models were used. This can be explained by the fact that O and I su- perfamily mature peptides share common cysteine patterns (http:// Table 2 fi Specificity of conopeptide pHMMs tested on UniProtKB/Swiss-Prot. www.conoserver.org/?page=classi cation&type=genesuperfamilies) and as illustrated in Fig. 1B and C, the cysteines are the most conserved E-value Signal peptide Propeptide Mature peptide residues in the mature conopeptides of any class. The accuracy of classi- cutoff pHMMs pHMMs pHMMs fication using mature peptide pHMMs was 96%. 10 (default) 99.94% 99.96% 99.80% This example of classification of a set of conopeptides identified 1 99.96% 99.97% 99.82% 0.1 99.97% 99.97% 99.85% using massive parallel sequencing of cone snail venom duct transcrip- 0.01 99.98% 99.97% 99.88% tome shows that the pHMMs described in this work can be successfully 1e-3 99.99% 99.98% 99.91% used to annotate newly discovered conopeptides. It is important to also 1e-4 100% 100% 99.94% include the smaller classes with only a few sequences described. In 1e-5 100% 100% 99.98% Conus consors 12 isoforms from those small classes were composing 492 S. Laht et al. / Biochimica et Biophysica Acta 1824 (2012) 488–492 nearly 3% of conopeptide transcripts (Terrat et al., 2011), indicating that Conflicts of interest statement they are relevant components of the C. consors’ venom. The authors declare that there are no conflicts of interest.

3.5. Classification of putative conopeptide transcripts using pair-wise Acknowledgements homology to conopeptide signal sequences can be improved by using conopeptide pHMMs We thank Tõnu Margus (University of Tartu, Tartu, Estonia) for ex- pert advice on protein family modeling, and Philippe Favreau (Atheris In a recent publication Hu et al. [22] characterize Conus bullatus Laboratories, Geneva, Switzerland) for discussions and manuscript genome and venom-duct transcriptome using next-generation se- revision. This work has been supported by a grant of the European quencing. The authors have identified a set of 2410 putative conotox- Commission: CONCO, the cone snail genome project for health. Inte- ins using homology search with BLASTX. They have then used the grated Project ref. LSHB-CT-2007-037592 (http://www.conco.eu). signal sequence homology to assign each putative conotoxin contig S. Laht, L. Kaplinski and M. Remm were also supported by the EU into a superfamily. In total they were able to assign 543 putative con- through the European Regional Development Fund through the Esto- otoxins (23%) to a unique conopeptide superfamily. The reason for rel- nian Centre of Excellence in Genomics. atively low classification rate is most likely the fact, that the putative conopeptide contigs do not contain full-length precursor sequences and References many of them lack the signal peptide sequence that the classification [1] R.J. Lewis, Conotoxin venom peptide therapeutics, Adv. Exp. Med. Biol. 655 was based on. (2009) 44–48. These putative conotoxins were also classified using the conopep- [2] H. Terlau, Conus venoms: a rich source of novel ion channel-targeted peptides, tide pHMMs described in this work. We were able to classify 1188 Physiol. Rev. 84 (2004) 41–68. [3] G.P. Miljanich, Ziconotide: neuronal calcium channel blocker for treating severe (49%) putative conotoxins into a superfamily or a family. 766 of the chronic pain, Curr. Med. Chem. 11 (2004) 3029–3040. putative conotoxin contigs from Conus bullatus were classified with- [4] B.M. Olivera, J.S. Imperial, G. Bulaj, Biosynthesis of conopeptides, in: A. Menez out the signal peptide model. (Ed.), Perspectives in Molecular Toxinolgy, John Wiley & Sons, Ltd, Chichester, – The example of classification of C. bullatus putative conotoxins 2002, pp. 143 158. [5] P. Favreau, R. Stocklin, Marine snail venoms: use and trends in receptor and channel identified using next-generation sequencing shows that annotation neuropharmacology, Curr. Opin. Pharmacol. 9 (2009) 594–601. using pair-wise homology to known conopeptide signal sequences [6] J. Davis, A. Jones, R.J. Lewis, Remarkable inter- and intra-species complexity of – can be improved using pHMMs. In many cases the coverage of se- conotoxins revealed by LC/MS, Peptides 30 (2009) 1222 1227. fi [7] D. Biass, S. Dutertre, A. Gerbault, J.-L. Menou, R. Offord, P. Favreau, R. Stöcklin, quencing is not suf cient to allow assembly of full-length conopep- Comparative proteomic study of the venom of the piscivorous cone snail Conus tide precursors from transcriptome data and sequences of pro- and consors, J. Proteomics 72 (2009) 210–218. mature peptides are not classified. [8] http://www.conoserver.org/Last accessed September 8th 2011. [9] Q. Kaas, J.C. Westermann, R. Halai, C.K.L. Wang, D.J. Craik, ConoServer, a database for conopeptide sequences and structures, Bioinformatics 24 (2008) 445–446. [10] S.R. Eddy, Profile hidden Markov models, Bioinformatics 14 (1998) 755–763. 3.6. Conclusions [11] R.D. Finn, J. Mistry, J. Tate, P. Coggill, A. Heger, J.E. Pollington, O.L. Gavin, P. Gunasekaran, G. Ceric, K. Forslund, L. Holm, E.L.L. Sonnhammer, S.R. Eddy, A. Bateman, The Pfam pro- tein families database, Nucleic Acids Res. 38 (2009) D211–D222. New, affordable and high throughput sequencing technologies [12] S. Mondal, R. Bhavna, R. Mohan Babu, S. Ramakumar, Pseudo amino acid composition now enable scientists to conduct proteome-, transcriptome- or and multi-class support vector machines approach for conotoxin superfamily classifica- tion, J. Theor. Biol. 243 (2006) 252–260. genome-wide sequence analysis of tissues or organisms. This is [13] N. Zaki, S. Wolfsheimer, G. Nuel, S. Khuri, Conotoxin protein classification using producing vast amounts of data in short time and calls for conve- free scores of words and support vector machines, BMC Bioinforma. 12 (2011) nient tools for extraction of information relevant to the field of re- 217. search. In this context, the identification of new conopeptides from [14] H. Lin, Q.Z. Li, Predicting conotoxin superfamily and family by using pseudo amino acid composition and modified Mahalanobis discriminant, Biochem. Biophys. Res. cone snail genomes or transcriptomes appears to be an important Commun. 354 (2007) 548–551. step in finding novel venom peptides and mini-proteins. These [15] W. Li, A. Godzik, Cd-hit: a fast program for clustering and comparing large sets of – biomolecules provide useful compounds for neurosciences, and protein or nucleotide sequences, Bioinformatics 22 (2006) 1658 1659. [16] K. Katoh, H. Toh, Recent developments in the MAFFT multiple sequence alignment some of them are even in development or have reached market program, Brief. Bioinform. 9 (2008) 286–298. for clinical use [1,23]. The pHMMs presented in this work provide [17] http://mafft.cbrc.jp/alignment/software/Last accessed August 9th 2011. an adequate tool for such investigations and complement conven- [18] http://hmmer.janelia.org/Last accessed August 8th 2011. [19] http://www.uniprot.org/downloads/ Last accessed August 17th 2011. tional tools like BLAST. Importantly, the models tested here had [20] The Uniprot Consortium, The Universal Protein Resource (UniProt) in 2010, both high specificity and sensitivity, making the pHHMs an accu- Nucleic Acids Res. 38 (2009) D142–D148. rate and fast tool for the annotation of new conopeptides from [21] Y. Terrat, D. Biass, S. Dutertre, P. Favreau, M. Remm, R. Stocklin, D. Piquemal, F. Ducancel, High-resolution picture of a venom gland transcriptome: case study large datasets produced with modern sequencing technologies. with the marine snail Conus consors, Toxicon (2011) [Epub ahead of print]. Using separate models for the signal, pro- and mature peptide [22] H. Hu, P.K. Bandyopadhyay, B.M. Olivera, M. Yandell, Characterization of the Conus should make it easier to annotate genome sequences where cono- bullatus genome and its venom-duct transcriptome, BMC Genomics 12 (2011) 60. [23] A. Schmidtko, J. Lotsch, R. Freynhagen, G. Geisslinger, Ziconotide for treatment of peptide precursors are most likely coded by different exons [24]. severe chronic pain, Lancet 375 (2010) 1569–1577. Theyarealsoapplicabletotheclassification of new conopeptides [24] B.M. Olivera, C. Walker, G.E. Cartier, D. Hooper, A.D. Santos, R. Schoenfeld, R. Shetty, M. discovered and sequenced as peptides with no precursor sequence Watkins, P. Bandyopadhyay, D.R. Hillyard, Speciation of cone snails and interspecific fi available. hyperdivergence of their venom peptides. Potential evolutionary signi cance of introns, Ann. N. Y. Acad. Sci. 870 (1999) 223–237. Supplementary materials related to this article can be found online [25] G.E. Crooks, G. Hon, J.M. Chandonia, S.E. Brenner, WebLogo: a sequence logo at doi:10.1016/j.bbapap.2011.12.004. generator, Genome Res. 14 (2004) 1188–1190. Appendix 4. PSSM and HMM complements each other for conopeptide prediction. Position-Specific Scoring Matrix and Hidden Markov Model complements each other for the prediction of conopeptide superfamilies

Dominique Koua1,3,§, Silja Laht2, Lauris Kaplinski2, Reto Stöcklin1, Maido Remm2, Philippe Favreau1, Frédérique Lisacek3

1Atheris Laboratories. Case postale 314, CH-1233 Bernex-Geneva, Switzerland 2Estonian Biocentre, Riia str. 23, Tartu 51010, Estonia 3Swiss Institute of Bioinformatics, Proteome Informatics Group, Rue Michel Servet 1, CH-1206 Geneva, Switzerland

§ Corresponding author: Dominique Koua ([email protected]). Phone: +41228500585 – Fax: +41228500586

Email addresses: DK: [email protected] SL: [email protected] LK: [email protected] RS: [email protected] MR: [email protected] PF: [email protected] FL: [email protected]

- 1 - Abstract composed of an N-terminal signal sequence, a central propeptide region and a C-terminal hypervariable mature toxin [7,8]. However, exceptions such as the I2-superfamily Classified into 16 superfamilies, conopeptides are the main where the propeptide region is located downstream the components of cone snail venoms that attract growing mature toxin, may occur [9]. The precursor maturation, by interest in pharmacology and for drug discovery. Currently, the cleavage of the signal and propeptide from the mature conopeptide superfamily attribution is based on a consensus region, leads to a biologically active conopeptide. The signal sequence from the precursor. While this information peptide maturation process also implies post-translational is available at the genomic or transcriptomic levels, it is not modifications of some residues, including disulfide bond present in amino acid sequences of mature bioactives formation, amidation, or pyroglutamate. generated by proteomic studies. As the number of The number and position of disulfide bonds refine the conopeptide sequences is increasing exponentially with classification of conopeptides into structural families, and improvement in sequencing techniques, there is a growing the mode of action of the proteins defines pharmacological need for automating superfamily elucidation. To face this or functional families [6,7]. challenge we have created, for the 14 superfamilies represented by more than 5 members, one specific Position- Specific Scoring Matrix (PSSM, also named generalized 1.2. Conopeptides and their classification profiles) and one specific Hidden Markov Model (HMM) Recently, the popularity of conopeptides has increased due matching the corresponding signal sequence, propeptide to their potential to treat chronic pain, epilepsy, region and mature peptides. Further taking into account cardiovascular disease, psychiatric disorders, cancer and disulfide patterns, a total of 50 PSSMs and 47 HMM stroke [2,3,5,8,10-13]. Consequently, protein sequence profiles were created. We confirm that propeptide and naming and classification have become a critical issue, as mature regions can be used to efficiently classify further studies towards function determination are based on conopeptides lacking the signal sequence information. this initial information. The Conoserver database [6,7,14] is Furthermore, the combination of all three region models a reference conopeptide repository, thanks to an impressive demonstrated important improvement in the total effort in update and annotation. But, due to the classification rates. In addition, we establish that PSSM and improvement of sequencing techniques, the number of HMM approaches complement each other for superfamily published conopeptide sequences is continuously determination. The 97 proposed models have been validated increasing. Crude and milked venom diversity analysis [15- and were found to exhibit sufficient specificity and offer a 18] and venom gland transcriptome studies [19,20] lead to a straightforward approach for application to large sequence challenging amount of precursors and mature peptide data. datasets. This justifies the design of automatic tools for assigning a new sequence to a conopeptide superfamily. The Conoserver prosequence analyser [14] addresses this question by searching the signal peptide of a submitted Keywords precursor. But, since this prosequence analyser is based on Conopeptide; FASTA alignment of signal peptides, it fails to match Position Specific Scoring Matrix; truncated sequences missing the signal peptide and highly Hidden Markov Models; divergent precursors. Combined predictions; A closer look at mature conopeptide alignment in each superfamily reveals both the wide evolutionary diversity of Protein family prediction sequences and the occurrence of strongly conserved key residues, mostly cysteines defining a framework with 1. Background characteristic spacing [1,5]. Consequently, a straightforward 1.1. Cone snails, their venoms and conopeptides Cone BLAST search with an individual mature sequence cannot snails are carnivorous marine gastropods that have evolved pull out all sequences already known in the corresponding potent venoms to capture their prey, to defend themselves family. In many cases, conopeptide BLAST-based analysis from predators and to deter competitors. Conopeptides, the requires manual processing of free text annotation and of main components of Conus venoms, represent a unique weak matches with key residues in constrained positions. In arsenal of neuropharmacologically active molecules that fact, it has been long known that BLAST-processing of have been evolutionarily tailored to afford unprecedented large amount of sequences generated in high throughput and exquisite selectivity for a wide variety of ion-channel set-ups is not a sensitive approach. This issue was identified subtypes and neuronal receptors [1-5]. Conopeptides are and addressed many years ago through the design of expressed as protein precursors by epithelial cells from automatic classification tools of protein sequences matching cone snail venom duct [6]. The precursors are functional and/or structural domains. Since then, the conventionally classified into ‘‘gene superfamilies’’ based accuracy of automatic classification tools has gradually on their signal sequence similarity. Currently there are 16 improved, especially in the field of model-to-sequence major superfamilies, namely: A, D, I1, I2, I3, J, L, M, O1, alignments [21-26]. O2, O3, P, S, T, V and Y [7]. The precursors are generally Computational approaches applicable to the automatic annotation of large datasets rely on protein signatures or

- 2 - profiles. Thousands of protein sequence patterns and from mature and propeptide regions. In addition, the Hidden Markov Models (HMM) are publicly available in combination of HMMs and generalized profiles also databases such as PROSITE [21], PRINTS [22] or PFam confirms the quality of model-based superfamily [23] allowing protein sequence classification in known classification. This approach can easily be included in a families. These databases do contain a few available pipeline for processing large amount of data and saves long conopeptide superfamily patterns and profiles but these are manual analysis of BLAST matches that incidentally also incomplete or ambiguous. Up to now, profiling requires the input of experienced researchers. conopeptides, was limited to the Conoserver prosequence 2. Materials and Methods analyser but this option is not easily amenable to the 2.1. Preparation of model construction data set treatment of large datasets and may not work when the signal sequence is missing, which is often the case in newly Sequences used for generating the models were obtained identified peptides from venoms. Moreover, indisputable from the Conoserver database in xml format and parsed by evidence for the relevance of consensus sequences of using in-house perl scripts (conoserver_protein.xml, 2012, propeptides and cysteine frameworks supports the claim 06 28th) [6]. Only complete full-length precursor sequences that the classification of new sequences in conopeptide with gene superfamily annotation were considered. In superfamilies cannot be only based on the signal peptide addition, only the 14 superfamilies having at least 5 full [27]. precursor sequences were used in this study: the V superfamily (2 sequences) and Y superfamily (1 sequence) Superfamily-specific profiles have not been defined so far were therefore not considered. The training set consisted of neither for mature regions nor propeptides, even in the few 967 distinct sequences from 14 superfamilies. These 967 available PROSITE patterns and profiles. Furthermore, the sequences were randomly picked from the whole precursor construction of hidden Markov models [24-26] and dataset containing 1364 sequences; the training set therefore generalized profiles searching algorithms [28,29] was represent 2/3 of this dataset. Each sequence was divided shown particularly useful for the discrimination of closely into 3 parts stored separately: signal sequence, propeptide related protein families. On the one hand, HMM profiles region and mature peptide. For A, O1 and O2 superfamilies, assign a position-specific scoring system to substitutions, subsets were created according to the number and/or insertions, and deletions. Compared to most sequence disposition of cysteines in the mature peptide. This led to alignment algorithms, HMMER is significantly more the creation of A_4 and A_6 for sequences from A accurate and detects remote homologues thanks to the superfamiy having 4 and 6 cysteines respectively; the same strength of the underlying probability models, improved in for O1_6, O1_8, O2_6 and O2_8 for precursors from O1 the latest version [23-26]. On the other hand, generalized and O2 superfamilies with six and eight cysteines profiles rely on building position-specific scoring matrices respectively. Separate files were created for each (PSSMs). The most recent methodology is based on AMSA superfamily and region, resulting in 51 subsets. The (Annotated Multiple Sequence Alignment) [28] and sequences were then aligned using MAFFT version 6.707b increases the discriminative power for better separation of software [33]. The resulting alignments were manually very close protein families. The discriminative ability of refined using the JALVIEW 2.5 software [34] in order to PSSMs is also improved by a post-processing competition reduce sequence redundancy. See Table 1 for the repartition step in which the profile that produces the best matching of sequences in training sets after redundancy reduction. score for each query sequence is selected [29]. Other The resulting 51 alignments were used to build the models: bioinformatics approaches based on support vector machine in FASTA format for PSSM construction and in (SVM) [30,31], and amino acid composition [32] have been STOCKHOLM format for HMM. A FASTA file of the proposed for conopeptide classification and were shown to aligned sequences used for model training is available as improve classification results compared to standard additional file 1. evolutionary or sequence/alignment based approaches. However, the implementation of SVM methods is not suited to application/usage in high throughput set-ups and cannot 2.2. Hidden Markov models realistically be undertaken for protein classification and Profiles HMMs were built for each of the 51 alignments family prediction in routine analysis. Model-sequence with the hmmbuild script from HMMER 3.0 package using matches remain the most practical choice despite the defaults parameters. Matches between HMMs and challenge of combining different approaches of profile- sequences data set were searched using the hmmsearch based classification. script. When searching a sequence set, the e-value This study aims at combining HMMs and generalized significance level has been set to 0.1. In-house Perl scripts profiles (PSSM) for conopeptide superfamily classification were implemented to facilitate result visualization and based on different precursor regions and its potential exploitation. application to large sequence datasets. Models were thus built for the 16 known conopeptide superfamilies. This work introduces a methodology and shows subsequent 2.3. Generalized profiles results demonstrating the ability of HMMs and generalized PROSITE generalized profiles were constructed using the profiles to accurately and reliably predict conopeptide pftool package version 2.3. Each model was built directly superfamily, not only from complete precursors but also from the superfamily alignments as done for HMMs construction. The generalized profiles were generated using

- 3 - apsimake (from the pftools package) in a semi-global mode pHMM_Score(sequence i, superfamilyX)= after a weighting of the alignments. The profiles were then E-value(i, pHMM_X_sig)* calibrated using a randomized version of the E-value(i, pHMM_X_pro)* UniProtKB/Swiss-Prot database and cutoff values were E-value(i, pHMM_X_mat), tuned manually to avoid false positive matches. Finally, special “compete lines” were added in view of the post- when the given E-value was produced. processing competition. This competition step allows returning only the profile that produced the best score when The final HMM prediction for each sequence was the more than one profile matched the same protein. The superfamily producing the lowest HMM score when the pfsearch and ps_scan scripts were used to perform scores were significantly different. Otherwise the list of alignments between profiles and sequence sets. In-house superfamilies having non significantly different scores were Perl scripts were implemented to facilitate result returned. visualization and exploitation. 2.5.2. PSSM-based classification For generalized profile predictions, we considered the 2.4. Quality of model-based classification of known number of matches obtained with models from the same conopeptides superfamily (signal, pro- or mature region-based model). The test set was constructed with publicly known Each sequence was therefore predicted to belong to the conopeptide sequences that were extracted from the superfamily with the highest PSSM score when the scores Conoserver database (conoserver_protein.xml, 2012, 06 were significantly different. Otherwise the list of 28th). The final testing set contained 397 full precursors superfamilies having non significantly different scores were sequences not used for model construction. Training and returned. test sets are completely independent. The test set represent one third (1/3) of the total full annotated precursors PSSM_Score(sequence i, superfamilyX)= extracted from Conoserver; 2/3 having been used for model training. Conopeptides precursor sequences from the test set HasMatch(i, PSSM_X_sig)+ were also split into 3 parts resulting in a FASTA file HasMatch(i, PSSM_X_pro)+ containing 397 signal sequences, 397 propeptide regions HasMatch(i, PSSM_X_mat), and 397 mature peptides, leading to a test set consisting of where the boolean function HasMatch(sequence, model) 1191 sequences. Each sequence was annotated with the returns 1 if the sequence matched the considered model, or superfamily it belongs to as well as the corresponding 0 otherwise. precursor region it represents. The same test set was then suitable to assess selectivity and sensitivity. For example, since all mature sequences for all families are present in the The global combined prediction of a given sequence was same file, it is possible to check if a model build for a given the superfamily returned with the highest frequency in the family is able to discriminate between mature peptide union of the two prediction lists. When no match was sequences of relatively close families. Moreover, it is also reported for a given sequence for both methods, the easy to verify that signal-based profiles will not match sequence was tagged as “UNKNOWN”. When the union neither propeptide region nor mature peptides. contained more than one superfamily with the same higher The selectivity and sensitivity of each model were prediction frequency, the sequence was tagged as evaluated using the following formulas: “CONFLICT”. Sequences tagged “UNKNOWN” or Sensitivity=TP/(TP+FN) ; “CONFLICT” are considered to be wrong prediction in the Selectivity=TP/(TP+FP) , merging graphical representation (Figure 3). where TP: true positive, FN: false Negative, FP: false positive. 2.6. BLAST-based superfamily classification of mature A FASTA file of the split sequences use for model testing is peptides available as additional file 2. Models matches as well as The 397 mature peptides sequences from the testing set related sensitivity and selectivity are given in were sent to a BLAST search against the NCBI non- Supplementary file 1. redundant database. BLASTP version 2.2.24 [Aug-08-2010] was used with the options -m 7 -I T -P 3 -v 3 -b 3 -a 6 to return an XML containing the first 3 hits for each sequence 2.5. Merging predictions from HMM and generalized (the first hit having a high probability of being the profiles submitted sequence). For each submitted sequence, a 2.5.1. HMM-based classification superfamily was manually attributed based on the hit For HMM-based classification, we adopted the product of description of the first BLAST match not being itself: hits signal, pro- and mature peptide region models e-values as with 100% sequence similarity were excluded since the final score for each superfamily. goal is to attribute a classification to new sequences, based on the description of the closest blast hit. Results obtained after manual BLAST result checking were compared to the

- 4 - automatic attribution obtained after the combined these sequences with only 2 false positives while the HMM+PSSM approach. corresponding HMM correctly classified 34 sequences without any mistake. In the case of the M-superfamily, 73 out of the 99 signal-free sequences were successfully 3. Results classified by both M_6_MAT generalized profile and 3.1. HMM and PSSM to extensively cover conopeptide profile HMM. A last evidence of superfamily classification superfamilies based on the mature region is the O1 superfamily: among For each of the 14 known conopeptide superfamilies, three the 133 sequences without any signal sequence, 128 were separated models based on either signal sequences, correctly assigned by the O1_6_MAT generalized profile. propeptide regions and mature peptides were built. A total The HMM version correctly classified of these 129 mature of 97 models were generated: 50 generalized profiles and sequences of O1-conopeptides but with some false positive 47 profiles HMM (Table 2). The models were named hits. according to the superfamily and the region of the precursor Similar example regarding the ability of classification based they targeted. A number indicating the number of cysteine on the propeptide region are reported in Table 2. in the mature peptide was also added in the name when necessary to discriminate between model from the same superfamily. For example, the model built from and for the Globally, the selectivity and sensitivity of generalized A superfamily signal region was named A_SIG, while profiles respectively equal to 99,42% and 92,81% were A_4_MAT corresponds to the model for mature peptide of slightly higher than that of HMM profiles: 87,18% A superfamily and having four cysteines in the mature selectivity and 89,65% sensitivity. In superfamilies where peptide (typically CC-C-C) and the model A_6_MAT is sensitivity and/or selectivity were not equal, Figures 1a and designed to target mature region of A superfamily peptides 1b represent the classification performance of the that have six cysteines (Cctx-like peptides). The resulting generalized profiles compared to that of HMMs. Matches models were validated on one hand by searching a obtained from the test set suggested that either the signal randomized database and on the other hand, were confirmed peptide, the propeptide or the mature peptide could be used to match the internal variability of the training set. to reliably classify conopeptide sequences into superfamilies (Supplementary file 3).

3.2. Extension of superfamily classification criteria 3.3. Classification based on combination of precursor The 97 models were applied to classify 397 publicly known region models conopeptides extracted from the Conoserver database and divided into signal, propeptide and mature sequences. In the previous section, it was established that models built from the different parts of the precursor were individually The quality of the matches was assessed in terms of suitable to classify superfamilies. It also appeared that for sensitivity (percentage of true positives in the whole data both HMMs and generalized profiles methods the combined set of 1191 conopeptides parts) and selectivity (percentage use of the three models (signal, propeptide and mature of true positives among all sequences matched with that region) of a given superfamily increased the superfamily given model). For each model, all matched sequences are assignment efficiency. For example, there were many case considered and used to evaluate sensitivity and selectivity where the propeptide-based model allow to fish out and (Table 2). properly classify sequences that were not matched by the For HMMs as well as for generalized profiles, in nearly all mature peptide based model. This suggest that the superfamilies, propeptide regions and mature peptides combination of these two models can be used to based models showed excellent classification abilities classification of conopeptides found in proteomic studies of (Table 2 and Figure 1). venom. As expected, the signal-based models were a better predictor than models based on mature peptides and In the A-superfamily, when considering fragments not propeptide regions, even if the two latter demonstrated good having a signal sequence, the combination of the 2 HMM prediction ability. Since the initial classification rule is models (propeptide amd mature regions based models) based on the signal sequence, the interesting result is that could properly classify all the 43 proteins of the testset (34 propeptide-based and mature-based models appeared also found by the mature-based model and 41 found by the extremely useful for the classification of sequences missing propeptide-based model). The classification based on the the signal sequence. This was established previously for combination of A-superfamily generalized profiles also profiles HMM [27], this study confirmed the result for reliably classified all testing set sequences. Other generalized profiles as well. interesting results of this kind are provided as For instance, regarding the A-superfamily, among the 1191 supplementary material (file 3). sequences of the data set, only 43 represented a signal peptide. There were also 86 precursor fragments some of them for the propeptide region (43) and the others However, it must be noted that most of the complete consisting only in the mature peptide (43 sequences). The precursors were matched simultaneously by the 3 models. A_4_MAT generalized profile correctly classified 39 of This also highlights the possibility of recovering some

- 5 - precursors with a divergent signal, propeptide or mature models failed to classify D-Superfamily sequences which regions. For example, the UniProtKB/TrEMBL entry where also detected with relatively same scores by the M- [Uniprot: B3SVF1], which cannot be classified by the Superfamily models, resulting in a conflict. The conoserver prosequence analyser was realistically classified combination of HMM and PSSM allow disambiguation as a M-superfamily conopeptide. since the HMM properly classify these sequences as D- Superfamily conopeptides. Some sequences of the testing set remained unclassified or misclassified by either HMMs or generalized profiles. Most Interestingly, a few models identified some conopeptide of them were mature sequences sharing the same cysteine families that were not previously classified in any framework (for instance, A-conotoxins were classified as superfamily. Conomarphin sequences were fished out by T). As expected, very close cysteine frameworks remained a the M-superfamily signal model, the contryphans were decisive issue when performing a classification based on identified by the M-superfamily propeptide model and the mature peptides. bromosleeper sequence was matched by the O3- superfamily mature model. At least for conomarphins, results clearly indicate that this conopeptide family is 3.4. Merging predictions from HMM and generalized derived from the M-superfamily. profiles After providing confirmation of the ability of conotoxines classification based on signal, propeptide and mature 3.5. Assessing model specificity on the UniProtKB and regions (section 3.2), we also established the classification random database improvement gained when merging separate predictions In view of its application to large datasets that may not be obtained by the region-based models (section 3.3). The restricted to conopeptide-related sequences, the models latter led to a global HMM-based and a global PSSM-based were tested on the complete UniProtKB database in order to prediction/classification. assess specificity. Results indicated that the models are In addition, the study gives evidence on the fact that the suitable for conopeptide superfamily identification. For global classification/prediction of conopeptides into most of the profiles, the matching list contained only superfamilies was significantly improved after merging conopeptides from the targeted superfamily. In some HMM and PSSM based classifications. The matches generalized profiles, cutoff values required some obtained from the separate combination of the three HMMs adjustment to exclude false positive with weak scores since on one hand and, on the other hand, from the combination the best scores always corresponded to the targeted of the three generalized profiles for each superfamily were superfamily. Some of the results suggested interesting processed together as indicated in the “methods” section. sequences similarities between conopeptides and other The merge of HMMs and generalized predictions categories of proteins. These cases remain to be clarified. significantly increased the number of distinct proteins Finally, when searched against a completely randomized correctly predicted/classified for each superfamily. version of UniProtKB (Uniprot sequence shuffled using a Different situations occurred when combining the results window of 5 amino acids), the models demonstrated very from the two approaches. Figure 2 illustrates the combined good specificity since they did not significantly match any classification of mature peptides from the testing set. In random sequence. addition to the 3 sequences matched in commmon, the I1- superfamily HMM profile identified one sequence not 3.6. Comparison to BLAST-based superfamily prediction classified by generalized profiles whereas the opposite was As the BLAST tool is a standard approach that can be used observed for A-superfamily. In I2-, O1- and O2- for superfamily classification, a comparison was made with superfamilies, HMMs and generalized profiles detected our models. The subset of 253 mature sequences from the some sequences in a mutually exclusive manner. After testing test was submitted to BLAST against the non- global merging, the number of distinct sequences correctly redundant NCBI database. Manual superfamily attribution classified was considerably improved on the testing set. In based on the BLAST results was compared to the automatic the M-superfamily, 89% of correct classification was prediction obtained with our model-based approach. It achieved while individual use was 87.9% and 73.7% for appeared that BLAST-based superfamily attribution rapidly generalized profile and for HMM-based classification, became a fastidious task that could only be carried out by respectively. In the O2- and I2-superfamily, the experienced users. The hit description section of the combination of prediction resulted in a correct classification BLAST XML output is a free text section without any rate of 100% of sequences in the testing set. The PSSM controlled vocabulary. It is therefore frequent to find lots of approach showed itself more predictive for highly variable synonyms for a given superfamily. The hit annotation gives motif. For instance, for T-superfamily, the global HMM either gene superfamilies related information, or scaffold prediction only classified 45% of the testing set while the related information with roman or arabic annotation as well global PSSM predicting allowed to resolve the as pharmacological family information. For example, “mu- classification of 85% of the submitted sequences. O-conotoxin”, “scaffold VI/VII”, “6.1”, “delta-conotoxin”, Something similar occurred for the hypervariate M- “omega-conotoxin”, “O-superfamily” and “superfamily O” superfamily mature peptide region. However, the PSSM were found to all stand for the O-superfamily. Some more

- 6 - complicated cases were found like “gi|12619395|gb| AAG60359.1|AF214931_1 conotoxin scaffold III/IV Authors' contributions precursor” where scaffold III sequences belong to A- DK prepared the sequence alignment including manual superfamily and scaffold IV precursors belong to M- edition, built and calibrated PSSMs, wrote scripts for data superfamily. analysis, interpreted the results, drafted and corrected the It is not always easy to deduce the right except when the manuscript. SL contributed to data acquisition, designed BLAST hit description contained the precise assignment to HMM and analysed HMM matching result and was the correct superfamily. In most of the cases, the involved in the manuscript drafting. LK contributed to data superfamily assignment was imprecise (for instance, I- acquisition and HMM results analysis. PF annotated superfamily instead of I1-, I2- or I3- and O-superfamily sequence data, contributed to results interpretation and instead of precise O1-, O2- or O3-superfamilies). However, revised the manuscript. RS, MR and FL contributed to the when evident, superfamilies were deduced from annotation analysis, conception and design, critical manuscript not containing the precise superfamily. For instance “alpha- revisions and final approval. conotoxin” was interpreted as A superfamily, “mu- conotoxin” was accepted as M superfamily. In few cases, the BLAST hit annotation only allowed the detection of a Acknowledgements conopeptide but with no indication on the superfamily We are most grateful to Estelle Bianchi, Daniel Biass, membership. Finally, in very rare cases, the best BLAST Nicolas Hulo and Christian Sigrist for expert assistance. hit did not belong to a cone snail sequence. The BLAST-based classification is therefore not easily applicable for automated annotation/prediction of large Funding: This work was supported by the European datasets. Automated BLAST result parsing would result in Commission: CONCO, the cone snail genome project for solving too many exception cases. Additionally, BLAST- health (LSHB-CT-2007-037592; www.conco.eu) and the based superfamily prediction appeared less efficient than European Regional Development Fund (Estonian Centre of the model-based strategy. In most cases, the combined Excellence in Genomics). HMM/PSSM strategy allowed to precisely annotate O1-, O2- and I1-superfamilies while the corresponding BLAST References hits only allowed to deduce O- or I-superfamilies. Our new [1] B.M. Olivera, L.J. Cruz , Conotoxins, in approach was also able to classify some sequences for retrospect. Toxicon 39 (2001) 7-14. which no BLAST match were found with the default e- value (See supplementary data file 4). [2] H. Terlau, B. Olivera, Conus venoms: a rich source of novel ion channel-targeted peptides. Physiol. Rev. 4. Conclusions 84 (2004) 41-68. This study established that the conopeptide superfamily [3] P. Favreau, R. Stöcklin, Marine snail venoms: use classification and identification can reliably be achieved and trends in receptor and channel neuropharmacology. based on the mature and propeptide regions and access to Curr. Opin. Pharmacol. (2009) 594-601. the signal peptide is not a prerequisite. The combination of hidden Markov models and generalized profiles appeared as [4] J.W. Blunt, B.R. Copp, R.A. Keyzers, M.H. an efficient approach to perform an extensive classification Munro, M.R. Prinsep, Marine natural products. Nat. Prod. and/or prediction of conopeptide sequences into Rep. 29 (2012) 144-222. superfamilies or families. Specificity test on a large [5] R.J. Lewis , S. Dutertre, I. Vetter , M.J. Christie, sequence dataset and a final comparison with a BLAST- Conus venom Peptide pharmacology. Pharmacol. Rev. 64 based approach indicated the usefulness of the designed (2012) 259-298. models. Each model built in this study demonstrated very [6] Q. Kaas, J. Westermann, R. Halai, C. Wang, D. high discriminative abilities, with high sensitivity and Craik, ConoServer, a database for conopeptide sequences selectivity in superfamily classification. We obtained very and structures. Bioinformatics 24 (2008) 445-446. high specificity when searching the whole UniProtKB [7] Q. Kaas, J. Westermann, D.J. Craik, Conopeptide database as well as excellent selectivity and sensitivity for characterization and classifications: an analysis using closely related conopeptide superfamilies. For the first time, ConoServer. Toxicon 55 (2010) 1491-1509. a method combining PSSM and HMM profiles built on signal, propeptides or mature sequences has been [8] R. Jones, G. Bulaj, Conotoxins - new vistas for developed and validated for correct superfamily prediction peptide therapeutics. Curr. Pharm. Des 6 (2000) 1249-1285. of conopeptides, thus allowing extending the superfamily [9] M. Brown, G. Begley, E. Czerwiec, L. Stenberg, classification to signal-free sequences. In view of the M. Jacobs, D. Kalume, P. Roepstorff, J. Stenflo, B. Furie, increasing number of sequences deriving from the genomic, Precursors of novel Gla-containing conotoxins contain a transcriptomic and proteomic studies, this combined carboxy-terminal recognition site that directs gamma- prediction approach opens up new prospects for the carboxylation. Biochemistry 44 (2005) 9150-9159. annotation of conopeptides and others toxin families. This [10] R. Halai, D.J. Craik, Conotoxins: natural product combined approach has been implemented as a web tool drug leads. Nat. Prod. Rep. 26 (2009) 526-536. for conopeptides classification (35).

- 7 - [11] A.L. Harvey, R. Stöcklin, From venoms to drugs: [24] L.S. Johnson, S.R. Eddy, E. Portugaly, Hidden Introduction. Toxicon 59 (2012) 433. Markov model speed heuristic and iterative HMM search [12] G.F. King, Venoms as a platform for human drugs: procedure. BMC Bioinformatics 11 (2010) 431. translating toxins into therapeutics. Expert Opin. Biol. Ther. [25] S.R. Eddy, Accelerated Profile HMM Searches. 11 (2011) 1469-1484. PLoS Comput. Biol. 10 (2011) e1002195. [13] T.S. Han, R.W. Teichert, B.M. Olivera, G. Bulaj, [26] R. Durbin, S.R. Eddy, A. Krogh, G. Mitchison, Conus venoms - a rich source of peptide-based therapeutics. Biological sequence analysis: probabilistic models of Curr. Pharm. Des. 14 (2008) 2462-2479. proteins and nucleic acids. Cambridge University Press [14] Q. Kaas, R. Yu, A.H. Jin, S. Dutertre, D.J. Craik, (1998). ConoServer: updated content, knowledge, and discovery [27] S. Laht, D. Koua, L. Kaplinski, F. Lisacek, R. tools in the conopeptide database. Nucleic Acids Res. 40 Stöcklin, M. Remm, Identification and classification of (2012) D325-D330. conopeptides using profile Hidden Markov Models. [15] B.M. Ueberheide, D. Fenyö, P.F. Alewood, B.T. Biochim. Biophys. Acta. 1824 (2012) 488-492. Chait, Rapid sensitive analysis of cysteine rich peptide [28] D. Koua, L. Cerutti, L. Falquet, C.J.A. Sigrist, G. venom components. Proc. Natl. Acad. Sci. USA 106 (2009) Theiler, N. Hulo, C. Dunand, PeroxiBase: a database with 6910-6915. new tools for peroxidase family classification. Nucleic [16] D. Biass, S. Dutertre, A. Gerbault, J.-L. Menou, R. Acids Res. 37 (2009) D261-D266. Offord, P. Favreau, R. Stöcklin , Comparative proteomic [29] M. Oliva, G. Theiler, M. Zamocky, D. Koua, M. study of the venom of the piscivorous cone snail Conus Margis-Pinheiro, F. Passardi, C. Dunand, PeroxiBase: a consors. J. Proteomics 72 (2009) 210-218. powerful tool to collect and analyse peroxidase sequences [17] S. Dutertre, D. Biass, R. Stöcklin, P. Favreau, from Viridiplantae. J. Exp. Bot. 60 (2009) 453-459. Dramatic intraspecimen variations within the injected [30] S. Mondal, R. Bhavna, R.M. Babu, S. venom of Conus consors: an unsuspected contribution to Ramakumar, Pseudo amino acid composition and multi- venom diversity. Toxicon 55 (2010) 1453-1462. class support vector machines approach for conotoxin [18] H. Safavi-Hemami, W.A. Siero, D.G. Gorasia, superfamily classification . Journal of Theoretical Biology. N.D. Young, D. Macmillan, N.A. Williamson, A.W. Purcell, 243 (2006) 252-260. Specialisation of the venom gland proteome in predatory [31] N. Zaki, S. Wolfsheimer, G. Nuel, S. Khuri, cone snails reveals functional diversification of the Conotoxin protein classification using free scores of words conotoxin biosynthetic pathway. J. Proteome Res. 10 (2011) and support vector machines, BMC Bioinforma. 12 (2011) 3904-3919. 217. [19] H. Hu, P.K. Bandyopadhyay, B.M. Olivera, M. [32] H. Lin, Q.Z. Li, Predicting conotoxin superfamily Yandell, Characterization of the Conus bullatus genome and and family by using pseudo amino acid composition and its venom-duct transcriptome. BMC Genomics 12 (2011) modified Mahalanobis discriminant, Biochem. Biophys. 60. Res. Commun. 354 (2007) 548–551. [20] Y. Terrat, D. Biass, S. Dutertre, P. Favreau, M. [33] K. Katoh, K. Ichi Kuma, T. Miyata, H. Toh, Remm, R. Stöcklin, D. Piquemal, F. Ducancel, High- Improvement in the accuracy of multiple sequence resolution picture of a venom gland transcriptome: case alignment program MAFFT. Genome Inform. 16 (2005) 22- study with the marine snail Conus consors. Toxicon 59 33. (2012) 34-46. [34] A.M. Waterhouse, J.B. Procter, D.M.A. Martin, M. [21] C.J.A. Sigrist, L. Cerutti, E. de Castro, P.S. Clamp, G.J. Barton, Jalview Version 2--a multiple sequence Langendijk-Genevaux, V. Bulliard, A. Bairoch, N. Hulo, alignment editor and analysis workbench. Bioinformatics PROSITE, a protein domain database for functional 25 (2009) 1189-1191. characterization and annotation. Nucleic Acids Res. 38 [35] D. Koua, A. Brauer, S. Laht, L. Kaplinski, P. (2010) D161-D166. Favreau, M. Remm, F. Lisacek, R. Stöcklin, ConoDictor: a [22] T.K. Attwood, P. Bradley, D.R. Flower, A. tool for prediction of conopeptide superfamilies, Nucleic Gaulton, N. Maudling, A.L. Mitchell, G. Moulton, A. Acids Res. 2012 Jul;40(Web Server issue):W238-41. Epub Nordle, K. Paine, P. Taylor, A. Uddin, C. Zygouri, PRINTS 2012 May 31. and its automatic supplement, prePRINTS. Nucleic Acids Res. 31 (2003) 400-402. [23] M. Punta, P.C. Coggill, R.Y. Eberhardt, J. Mistry, J. Tate, C. Boursnell, N. Pang, K. Forslund, G. Ceric, J. Clements, A. Heger, L. Holm, E.L. Sonnhammer, S.R. Eddy, A. Bateman, R.D. Finn, The Pfam protein families database. Nucleic Acids Res. 40 (2012) D290-301.

- 8 - Figures: titles (first lines) and footnote descriptions

Figure 1A - Sensitivity comparison of built PSSMs and HMMs for conopeptide precursor regions. Each bar represents the percentage of the testing set sequences that was matched by the model of the considered superfamily. Sensitivity of non represented models is available in supplementary materials.

Figure 1B - Selectivity comparison of built PSSMs and HMMs for conopeptide precursor regions. Each bar represents the percentage of true positives among the sequences returned by the considered model. Selectivity of non represented models is available in supplementary materials.

Figure 2 - Merging of prediction from HMMs and generalized profiles for the classification of mature peptide of conopeptides from the testing set. For almost all superfamilies, combining HMM-based and generalized profile-based predictions led to a correct classification of nearly 95% of submitted sequences. The green+blue+yellow region indicates the domain of correct predictions for the combined approach: at least one approach is correctly predicting the superfamily. The red region corresponds to bad predictions or conflicting predictions.

- 9 - Tables

Table 1. Repartition of sequences in the training and test sets.

Signal sequence Propeptide region Mature peptide

Superfamilies Training Test Training Test Training Test

A_4 26 43 83 43 81 43

A_6 22 10 18 10 16 10

D 6 10 17 10 18 10

I1 6 4 6 4 6 4

I2 46 15 46 15 46 15

I3 3 2 3 2 3 2

J 8 4 8 4 8 4

L 5 2 5 2 5 2

M 194 99 194 99 194 99

O1_6 259 133 259 133 259 133

O1_8 5 2 5 2 5 2

O2_6 30 17 30 17 30 17

O2_8 6 4 6 4 6 4

O3 16 8 16 8 16 8

P 5 2 5 2 5 2

S 5 2 5 2 5 2

T 80 40 80 40 80 40

Total 722 397 765 397 742 397

- 10 - Table 2. Repartition of matches in the test set.

Signal sequence Propeptide region Mature peptide

Superfamilies HMM PSSM HMM PSSM HMM PSSM

A_4 A_SIG* A_SIG* A_PRO* 2 A_6_PRO 34 A_4_MAT 1 O1_6_MAT 43 A_4_SIG 39 A_4_SIG 10 A_6_PRO 42 A_4_PRO 39 A_4_MAT 10 A_6_SIG 6 A_6_SIG 41 A_4_PRO 1 A_6_MAT A_6 - - - 8 A_6_PRO 8 A_6_MAT 8 A_6_MAT D 1 M_6_SIG 10 D_10_SIG 7 D_10_PRO 10 D_10_PRO 10 D_10_MAT 10 D_10_MAT 1 I1_8_SIG 10 D_10_SIG I1 4 I1_8_SIG 4 I1_8_SIG 1 I1_8_PRO 1 I1_8_PRO 4 I1_8_MAT 3 I1_8_MAT 18 O1_6_MAT 5 O2_6_MAT 8 I2_8_MAT 2 I3_8_MAT I2 15 I2_8_SIG 15 I2_8_SIG 4 I2_8_PRO 12 I2_8_PRO 2 I1_8_MAT 14 I2_8_MAT 2 O1_6_PRO 43 O1_6_MAT 7 O2_6_MAT 13 I2_8_MAT 2 I3_8_MAT I3 1 I1_8_SIG 2 I3_8_SIG 2 I3_8_PRO 2 I3_8_PRO 2 I1_8_MAT 2 I3_8_MAT 2 I3_8_SIG 2 O1_6_MAT 3 I2_8_MAT 2 I3_8_MAT J 4 J_4_SIG 4 J_4_SIG 4 J_4_PRO 4 J_4_PRO 4 J_4_MAT 4 J_4_MAT

L 1 I1_8_SIG 2 L_4_SIG 1 L_4_PRO 2 L_4_PRO 1 L_4_MAT 1 L_4_MAT 2 O1_6_SIG 2 L_4_SIG M 99 M_6_SIG 99 M_6_SIG 97 M_6_PRO 97 M_6_PRO 72 M_6_MAT 73 M_6_MAT 1 M_6_MAT 1 M_6_MAT

O1_6 O1_SIG* 133 O1_6_SIG 128 O1_6_PRO 131 O1_6_PRO 129 O1_6_MAT 128 O1_6_MAT 2 O1_8_SIG 1 O2_8_MAT 133 O1_6_SIG 7 O2_6_MAT 3 I2_8_MAT O1_8 - 2 O1_8_SIG 2 O1_8_PRO 2 O1_8_PRO 2 O1_8_MAT 2 O1_8_MAT O2_6 O2_SIG* 17 O2_6_SIG 17 O2_6_PRO 17 O2_6_PRO 24 O1_6_MAT 17 O2_6_MAT 4 O2_8_SIG 1 O2_8_MAT 17 O2_6_SIG 17 O2_6_MAT 7 I2_8_MAT O2_8 - 4 O2_8_SIG 4 O2_8_PRO 4 O2_8_PRO 9 O1_6_MAT 4 O2_8_MAT 4 O2_8_MAT 2 I2_8_MAT O3 1 M_6_SIG 8 O3_6_SIG 7 O3_6_PRO 7 O3_6_PRO 7 O3_6_MAT 7O3_6_MAT 8 O3_6_SIG P 2 P_6_SIG 2 P_6_SIG 2 P_6_PRO 2 P_6_PRO 1 P_6_MAT 1 P_6_MAT S 34 M_6_SIG 2 S_10_SIG 2 S_10_PRO 2 S_10_PRO 2 S_10_MAT 2 S_10_MAT 2 S_10_SIG T 40 T_4_SIG 40 T_4_SIG 1 T_4_MAT 2 T_4_MAT 18 T_4_MAT 33 T_4_MAT 37 T_4_PRO 37 T_4_PRO *: For the concerned superfamily, a single model have been made for the corresponding region.

- 11 - Additional files Additional file 1 – FASTA file of the training set (bba_training_aln.fas).

Additional file 2 – FASTA file of the split sequences use for model testing (conoserv_parts_testset.fas)

Additional file 3 – Detailed matching result for conopeptides sequences from the test set against the newly built HMMs and PSSMs. Also contains data used to produce published figures: Classification of test set conopeptides with the built profiles, Number of sequences matched after HMM and generalized profiles matching, sensitivity and selectivity evaluation. (Stats_test_conotox_bba.xls)

Additional file 4 – BLAST-based superfamily prediction for conopeptides sequences described at the mature peptide only compared to prediction obtained with the new approach (testset_blast_nr_Res.xls)

- 12 - Appendix 5. ConoDictor: a tool for prediction of conopeptide superfamilies. Nucleic Acids Research Advance Access published May 31, 2012 Nucleic Acids Research, 2012, 1–4 doi:10.1093/nar/gks337

ConoDictor: a tool for prediction of conopeptide superfamilies Dominique Koua1,2,*, Age Brauer3, Silja Laht3, Lauris Kaplinski3, Philippe Favreau1, Maido Remm3, Fre´ de´ rique Lisacek2 and Reto Sto¨ cklin1

1Atheris Laboratories, Case postale 314, CH-1233 Bernex-Geneva, Switzerland, 2Proteome Informatics Group, Swiss Institute of Bioinformatics, CH-1211 Geneva, Switzerland and 3Bioinformatics Workgroup, Estonian Biocentre, EE-51010 Tartu, Estonia

Received January 27, 2012; Revised March 27, 2012; Accepted April 4, 2012

ABSTRACT The Conoserver database (http://www.conoserver.org/) is a repository of nucleic acid and protein sequences, and of ConoDictor is a tool that enables fast and accurate structural information on conopeptides (4). Downloaded from classification of conopeptides into superfamilies The naming and classification of new conopeptide based on their amino acid sequence. ConoDictor protein sequences has become an important issue combines predictions from two complementary because of the sharp increase in the number of new approaches—profile hidden Markov models and conopeptides being identified, and because studies to de- generalized profiles. Results appear in a browser as termine the peptide’s functional characteristics are based tables that can be downloaded in various formats. on this classification. The Conoserver prosequence http://nar.oxfordjournals.org/ This application is particularly valuable in view of the analyzer (ConoPrec) is the most specific web tool available exponentially increasing number of conopeptides for elucidation of conopeptide class. It provides hints that are being identified. ConoDictor was written in based on the signal peptide sequence of the submitted pre- cursor (6). However, this tool does not work when the Perl using the common gateway interface module signal sequence is missing, which is often the case with with a php submission page. Sequence matching is conopeptides identified by proteomic and mass spectrom- performed with hmmsearch from HMMER 3 and etry studies of toxins identified as mature bioactive ps_scan.pl from the pftools 2.3 package. ConoDictor peptides in venom or dissected venom gland. As data by guest on June 1, 2012 is freely accessible at http://conco.ebc.ee. generated by spreading venom high-throughput omics is notoriously incomplete, the classification of new sequences into conotoxin superfamilies should not be restricted to INTRODUCTION the signal peptide sequence. There is indisputable Conopeptides are the main bioactive component of cone evidence for the relevance of consensus sequences of snail venom. These marine animals produce complex propeptides and cysteine frameworks in conopeptide se- venoms that contain hundreds of peptides and proteins. quences. Thus, the inclusion of these criteria should also Recently, conopeptides have attracted a great deal of be considered for conopeptide classification. We recently interest as a result of their selectivity for, and potent demonstrated the reliability of conopeptide family predic- effects on, ion channels and receptors (1,2). Most are tion and classification based on profile hidden Markov cysteine-knotted peptides that have been classified into models (pHMM) of propeptides and mature peptides (7). superfamilies and families based on their structural or ConoDictor has been developed in the context of the functional features (3,4). To date, >1500 non-redundant CONCO project (www.conco.eu) and is a web-based conopeptide sequences are stored in public databases and tool that exploits pHMMs and position-specific scoring this number is increasing exponentially. Conopeptides are matrix (PSSM, also known as generalized profiles) to classified into ‘gene superfamilies’ based on their signal classify conopeptide into superfamilies based on their sequence. Currently, there are 16 major superfamilies, amino acid sequence. ConoDictor is a user-friendly tool namely: A, D, I1, I2, I3, J, L, M, O1, O2, O3, P, S, T, that meets users’ demands for an easy-to-use environment V and Y. The precursors generally contain an N-terminal for sequence classification and superfamily prediction. As signal sequence, a central propeptide region and a a fully automated tool, ConoDictor provides classification C-terminal hypervariable mature toxin (4,5). results that must be checked by users.

*To whom correspondence should be addressed. Tel: +41 228500585; Fax: +41 228500586; Email: [email protected]

ß The Author(s) 2012. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. 2 Nucleic Acids Research, 2012

MATERIALS AND METHODS (i) For pHMM-based classification, we adopted the Preparation of the model data set product of E-values as final score: Sequences used for generating the models were obtained pHMM Scoreðsequence i, superfamilyXÞ¼ from Conoserver. Only precursor sequences with gene Evalueði,pHMM X sigÞ superfamily annotation were considered. The training set Evalueði,pHMM X proÞ consisted of 933 sequences. Each sequence was manually annotated with the gene superfamily classification after Evalueði,pHMM X matÞ, checking the classification provided by Conoserver. Each sequence was divided into three parts, which were stored provided that the corresponding E-value exists. A separately: signal, propeptide and mature peptide. sequence was predicted to belong to the superfamily Separate files were also created for each of the 16 with the smallest pHMM score when this score was at superfamilies. The sequences were then aligned using the least one hundred times lower than that of any other MAFFT version 6.707b software. The alignments were superfamily. If the difference was smaller, the sequence manually refined when necessary using the JALVIEW was predicted as ‘CONFLICT’ for the pHMM. When 2.5 software, and the resulting 48 alignments were used no score was generated for any superfamily, the to build the models. sequence was tagged ‘UNKNOWN’. (ii) For generalized profile predictions, it is not possible to compare and merge scores obtained from separate Hidden Markov models profiles. The PSSM prediction score for a sequence is Downloaded from We previously described pHMM ability for conopeptide the number of models of one superfamily (1–3) that classification (7). We constructed pHMMs for each of the match the sequence: 48 alignments using the hmmbuild script from the HMMER 3.0 package (8,9). Matches between pHMMs and the PSSM Score ðsequence i, superfamilyXÞ¼ sequence data set were searched using the hmmsearch HasMatchði,PSSM X sigÞ http://nar.oxfordjournals.org/ script with an e-value significance level set to 0.1. +HasMatchði,PSSM X proÞ +HasMatchði,PSSM X matÞ, Generalized profiles (PSSM) where the boolean function HasMatch(sequence, model) Generalized profiles were constructed using the pftool returns 1 if the sequence matched the considered model, or package, version 2.3. The most recent methodology based 0 otherwise. The sequence is predicted to belong to the on annotated multiple sequence alignment (AMSA) was superfamily with the highest score. If two or more

used. The generalized profiles were generated using superfamilies have the same score, the sequence is tagged by guest on June 1, 2012 apsimake in a semi-global mode after weighing of align- as ‘CONFLICT’, and the list of conflicting families is ments. The resulting models were calibrated against returned. When no match is reported for a given sequence, randomized sequences and cut-off values tuned manually. the sequence is tagged ‘UNKNOWN’. These approaches have already been validated for classifi- Match lists of pHMMs and PSSMs are merged, and each cation of other proteins (10,11). prediction is weighted according to its frequency. The combined prediction is the superfamily with the highest fre- quency. When the highest frequency is linked to more than Testing of models on known conopeptides one superfamily, the sequence is tagged ‘CONFLICT’. When The test set was constructed from publicly available no match is reported for either method, the sequence is conopeptide sequences extracted from the NCBI Protein tagged ‘UNKNOWN’. Even if HMM and PSSM are very database and UniProtKB (release 2010_11). The test set robust classification approaches, the reduced size of learning contained 1225 manually curated sequences. Sequences set in some families and/or the underlying scoring system can were manually annotated and assigned to the relevant justify rare cases of misclassification. The ‘CONFLICT’ and superfamily according to UniProtKB annotations, ‘UNKNOWN’ tag can therefore represent not modelled cysteine frameworks and sequence similarity. Sequences families (may be new ones) or divergent sequences from an not belonging to any superfamily were added to the test existing family. In any case, all classifications have to be set as negative controls. validated by users before being used for further studies.

ConoDictor implementation RESULTS Conopeptide models Input sequences are first classified using pHMMs and PSSMs separately. pHMM models of signal (X_sig), For each of the 16 known conotoxin superfamilies, three propeptide (X_pro) and mature peptide (X_mat) are separate models based on signal, pro- and mature peptides used in parallel and corresponding predictions are were built, providing a total of 48 hidden Markov models combined. The same process is applied with PSSM and 48 generalized profiles. The models were named models. Resulting pHMM and PSSM classifications are according to the superfamily and the region of the precur- merged to produce a global combined classification. sor that they targeted. Each model demonstrated very Nucleic Acids Research, 2012 3 Downloaded from http://nar.oxfordjournals.org/

Figure 1. ConoDictor input (background) and output (foreground) interfaces. The input interface provides a text area for amino acid sequence in FASTA format and areas for users to upload their own models. A test set is also provided and can be loaded via a simple click. The output interface provides detailed, self-explanatory tables grouped by analysis type. The combined prediction/classification is summarized under the ‘General result’ tab. by guest on June 1, 2012 good discriminative abilities, with high sensitivity (95%) main tab provides combined prediction, as well as a and selectivity (99%) [(7) and Koua et al., unpublished summary of pHMM- and PSSM-based prediction. data] . When tested using known conopeptide sequences, Detailed result tabs for pHMM- and PSSM-based predic- these models enabled extensive and reliable classification tions provide the number of sequence matches for each even between superfamilies containing mature peptides model, the position for each sequence/model match, and with high sequence similarities. The models provided the related e-value and score of individual model match. good evidence of complementarity between signal, pro- Tab headers and table column names explain the results and mature peptide sequences for superfamily determin- displayed. The result page is automatically updated until ation, as well as complementarity between pHMMs and analysis results are available. An Excel file (.xls) and raw generalized profiles (Koua et al., submitted). text versions (.txt) of all results can be downloaded. A session identifier is also provided, and the results can be ConoDictor input interface accessed and visualized on the server for up to 3 weeks ConoDictor accepts amino acid sequences in FASTA format after submission or last viewing. A detailed help page as input. The sequences can be pasted in the prepared field or provides clear explanations and screen shots of the most uploaded as a file from the user’s computer (Figure 1). important tables of the analysis (http://conco.ebc.ee/ Sequences can be annotated with a predicted superfamily ConoDictor_help.html). in the header between sharps (#), otherwise they are con- sidered as ‘UNKNOWN’. By default, the models built in the framework of the CONCO project are used to analyse the CONCLUSION input sequences. However, users can also upload their own ConoDictor is a web-based application, based on prelim- PSSMs and/or pHMMs. An annotated testing set (attached inary studies that established PSSM and pHMM to a ‘LOAD TEST DATA’ button) is also available from the complement each other for conopeptide identification input interface. and classification. Thanks to a user-friendly interface, ConoDictor provides an easy-to-use environment for clas- Visualization interface sification of conopeptides into superfamilies based on The ConoDictor output interface offers user-friendly tab their amino acid sequence. In view of the rapidly views of matching outputs and predictions (Figure 1). The increasing number of new conopeptides being discovered 4 Nucleic Acids Research, 2012 by next-generation transcriptomic platforms, ConoDictor 2. Sto¨cklin,R. and Vorherr,T. (2010) Venoms—a natural source is a valuable bioinformatics tool for their classification for mini-protein drugs. Pharmanufacturing Int. Peptide Rev., (Sept. 2010), 44–46. and serves as a starting point for investigation of their 3. Olivera,B. and Cruz,L. (2001) Conotoxins, in retrospect. Toxicon, functional characteristics. 39, 7–14. 4. Kaas,Q., Westermann,J. and Craik,D. (2010) Conopeptide characterization and classifications: an analysis using ConoServer. ACKNOWLEDGEMENTS Toxicon, 55, 1491–1509. 5. Jones,R. and Bulaj,G. (2000) Conotoxins—new vistas for peptide We thank Estelle Bianchi, Daniel Biass, Nicolas Hulo and therapeutics. Curr. Pharm. Des., 6, 1249–1285. Christian Sigrist for expert assistance. 6. Kaas,Q., Yu,R., Jin,A.-H., Dutertre,S. and Craik,D.J. (2012) Conoserver: updated content, knowledge, and discovery tools in the conopeptide database. Nucleic Acids Res., 40, D325–30. FUNDING 7. Laht,S., Koua,D., Kaplinski,L., Remm,M. and Sto¨cklin,R. (2012) Identification and classification of conopeptides using hidden CONCO project [LSHB-CT-2007-037592, in part]: www. Markov models. Biochim. Biophys. Acta., 1824, 488–49. conco.eu funded by EU 6th Framework Programme 8. Durbin,R., Eddy,S., Krogh,A. and Mitchison,G. (1998) Biological (LIFESCIHEALTH); the European Regional Sequence Analysis: Probabilistic Models of Proteins and Nucleic Development Fund (Estonian Center of Excellence in Acids. Cambridge University Press, UK. Genomics); the Atheris Laboratories. Funding for open 9. Johnson,L.S., Eddy,S.R. and Portugaly,E. (2010) Hidden Markov model speed heuristic and iterative HMM search procedure. BMC access charge: LIFESCIHEALTH [LSHB-CT-2007- Bioinformatics, 11, 431.

037592]. 10. Koua,D., Cerutti,L., Falquet,L., Sigrist,C.J.A., Theiler,G., Downloaded from Hulo,N. and Dunand,C. (2009) PeroxiBase: a database with new Conflict of interest statement. None declared. tools for peroxidase family classification. Nucleic Acids Res., 37, D261–D266. 11. Oliva,M., Theiler,G., Zamocky,M., Koua,D., Margis-Pinheiro,M., REFERENCES Passardi,F. and Dunand,C. (2009) PeroxiBase: a powerful tool to collect and analyse peroxidase sequences from Viridiplantae. 1. Norton,R. and Olivera,B. (2006) Conotoxins down under. J. Exp. Bot., 60, 453–459. http://nar.oxfordjournals.org/ Toxicon, 48, 780–798. by guest on June 1, 2012 Appendix 6. TATools, a bioinformatic environment for transcriptomes analysis. Original paper TATools, a bioinformatic environment for transcriptomes analysis Dominique Koua1,2,*, Roman Mylonas2, Philippe Favreau1, Reto Stöcklin1 and Frédérique Lisacek2 1Atheris Laboratories, case postale 314, CH-1233 Bernex-Geneva, Switzerland. 2 Swiss Institute of Bioinformatics, 1, rue Michel Servet, CH-1206 Geneva, Switzerland. Received on XXXXX; revised on XXXXX; accepted on XXXXX

Associate Editor: XXXXXXX

ABSTRACT evolving and improving (multiplexing, increase in detection Motivation: New sequencing techniques yield increasing amounts sensitivity,...) and so is assembly software (assembly of short of transcriptomic data. Transcriptome sequencing of specific tissues reads,...) (Kumar and Blaxter, 2010; Martin and Wang, 2011). In is undertaken to better understand and characterize the context of the absence of genome sequence, venom transcriptomes represent gene expression. In this framework, transcriptomic data made valuable information source for the elucidation of the available require automated processing workflows and user-friendly pharmacological potential of venoms. This expolains the multiple interfaces for data exploitation and comprehension. These tools are venom gland transcriptomes recently investigated (Lluisma et al., essential when genome data are missing and cannot support 2012; Ma et al., 2012; Terrat et al., 2012; Durban et al., 2011; sequence annotation. Prosdocimi et al., 2011; Vaiyapuri et al., 2011). These Results. TATools provides a unique management environment for transcriptomic surveys were associated to proteomic studies that understanding transcriptome data by merging results of diverse allowed an in-depth exploration of the molecular diversity of classical sequence analysis. Additional features and dedicated venoms and validation of a significant number of potentially active viewer pages makes TATools a valuable solution for highlighting peptides. But so far, no automated workflow was described for novelty in a single transcriptome as well as cross-analysis of several high-throughput analysis of full transcriptome data. The analysis of transcriptomes in the same environment. As a use case, the in- the sequences obtained after the sequencing and the assembly depth analysis of a venom gland transcriptomes is presented. relies on comparative analysis with annotated genes or protein Availability and Implementation. We have developed TATools, an domains of other organisms. A multimodal comparison with integrated bioinformatic environment for transcriptome analysis and sequences from different organisms and databases, e.g. NCBI, visualization. TATools is an automated web-based platform that UniProtKB, GO (Ashburner, et al., 2000), KEGG (Kanehisa and provides an intuitive interface for easy interpretation of combined Goto, 2000), commonly supports a broad investigation of the analysis results. Transcriptome data are submitted to BLAST putative biological role of a specific sequence. This approach search, HMM/PSSM search and signal sequence detection. The however increases the quantity of information per sequence, adds output results are merged and stored in a dedicated mysql relational to the already large amount of data and calls for a new strategy database. A user-friendly interactive web-platform allows data (Philipp et al., 2012). In principle, two of the following five submission, visualization and annotation. components are included in the transcriptome analysis workflow according to Cantacessi et al. (2010): (i) assembly (mostly de novo in the case of venomous animals), (ii) similarity searching 1 INTRODUCTION (BLAST), (iii) prediction and annotation of peptides Since Sanger method's (Sanger et al., 1975 and 1977), sequencing (InterProScan), (iv) in-silico subtraction that highlights qualitative methods are constantly improving in terms of read length, read but not quantitative differences between or among samples, and (v) quality and amount of data produced (Mardis, 2008). The probabilistic functional networking of protein-encoding genes and application of recent technological advance to venomics has led to drug target prediction. The Transcriptome Analysis and the genome sequencing of only a handful of venomous animals. Comparison Explorer (Philipp and al., 2012), constitute one of the Next-generation sequencing technologies produce large EST rare initiative for a tool designed for organizing and analysing datasets representing the whole mRNA content expressed in large sequencing datasets. venom glands at a very reasonable cost. Venom glands We have developed TATools, an integrated bioinformatic transcriptomes consist in cDNA libraries constituted from short environment for transcriptome analysis and visualisation. TATools reads that are assembled into contigs. The number of reads, their was implemented as a web-based platform that provides intuitive length, the error rate of the sequencing method still challenge de visualisation for easy interpretation of processed data. One of the novo assembly algorithms and parameters setting. In addition to most important goal that directed TATools implementation was to effort invested in sample preparation (towards more purity for facilitate the identification and extraction of specific protein family extracted and amplified DNA/RNA), sequencers are constantly

© Oxford University Press 2012 1 members and homologous sequences from the transcriptome. A transcript potentially leading to a previously unidentified peptide is TATools was therefore applied to venom gland transcriptomes. therefore easy to spot. For instance, a precursor with a divergent mature Organisms adapt to their environment through complex biological peptide matched by PSSM/HMM in the absence of a BLAST hit is more regulation and control of gene transcription. In particular, likely to be interesting for drug discovery than a precursor with only one BLAST hit. Conversely, transcript matches confirmed in two or three venomous animals have developed very efficient venoms for self- searches are more reliable. defense and pray capture (Mebs, 2002; Fry, 2008). Venoms are The platform is composed of 2 layers. Transcriptome cDNA files are since quite long time attracting interest especially because of the submitted to the TATools analysis workflow layer in FASTA format via a relatively important number of envenomations worldwide (caused php-based user interface. The submission interface also allows the selection by snake bites and insects stings). Even if several studies still focus of models (PSSM and/or HMM) to be searched against the new on understanding the venom activity in the human body and transcriptome. The user can choose among available models or upload new develop antivenoms, there is an increasing interest and on-going ones. It is also possible to set BLAST search parameters as well as initiatives to take advantage of the selective power of venom parameters related to signal detection in transcripts. The core script is the compounds for human health (King, 2011). In that context, workflow manager: it makes system calls and requires in-house Object- numerous animal venoms are being submitted to high-throughput oriented perl classes for result parsing and data storage. The modular organisation of the workflow components makes it possible to launch the screening for different biological receptors (Fry, 2009). More whole analysis in one row or to separately run each step of the analysis via recently, the venomics initiative proposed a new paradigm: better an 'update interface'. The analysis layer of TATools was developed in perl. exploit venom power by understanding the underlying genomic BLAST results are parsed using the BioPerl package. Results are stored in basis of the venom machinery. the specific mysql database using the perl-DBI interface. The data exploitation layer provides viewers and various tools for data visualization, understanding and exploitation. Once a transcriptome is 2 METHODS analysed, a web-based viewer allows results visualization, the key element The transcriptome analysis workflow adopted in TATools consists in a being the “Transcriptome map”. Transcripts belonging to each match class combination of (i) a BLAST-based similarity search, (ii) a model-based are displayed in a table view by simply clicking on the corresponding region matching using both specific PSSMs and HMMs, (iii) a signal sequence of the “Transcriptome map”. It is also possible to display transcripts with detection using SignalP and (iv) an in-house cysteine-rich domain detection. BLAST hits sorted by associated GO terms and to view transcripts matching The submitted transcriptome cDNA sequences are first translated in-silico in bioinformatics models sorted by model name or type. Furthermore, an entry 6 frames. These transcripts are then sent in parallel via a multi-threaded viewer summarizes results and information related to each transcript analysis process to BLAST search against the UniprotKB/Swiss-Prot protein (transcript sheet). The visualization layer was based on php-mysql and perl- database, to model matching and to signal and cysteine-rich domain cgi scripts including javascript and Ajax requests for interactivity. The contig detection (Fig. 1). assembly viewer proposed in the transcript sheet is a java applet. Useful Translated transcripts displaying similarity to documented proteins are analysis tools are also available for data manipulation. They include a text identified with a blastp search against UniprotKB/Swissprot (e-value search, a local blast search against all the transcriptomes already analysed on threshold: 10e-4). Close relationship to known proteins corresponds in the the platform and an alignment tool with improved functionalities such as best cases to the identification of new variants of known proteins in the same alignment colouring, trimming and export. organism or in distant species. This group of transcripts generates limited novelty. However, these natural variants can be of great interest since they represent new member of an evolutionary conserved active peptide. Gene 3 RESULTS ontology annotations of transcripts are inferred from the UniprotKB annotation of the first BLAST hit. PSSM and HMM are specifically designed to match protein family, 3.1 An integrated user-friendly platform that reaches subfamily or domains of interest. PSSM and HMM matching has been biologist's needs proved a better approach than BLAST to identify distant relatives in protein families or subfamilies (personal communication, submitted). In addition, PSSM and HMM models complement each other in prediction and/or 3.1.1 Transcriptome relational database schema identification of distant members in protein families and can be combined to The underlying relational storage organization comprises two better discriminate between closely related protein families (personal groups of tables (Fig. 2): communication, submitted). Finally, signal sequence detection is a common strategy for identifying new (1) shared tables which store data that could be accessed precursors. In venomics applications, signal detection is coupled with search during the analysis and exploitation of all transcriptomes of cysteine-rich domains as this feature is shared by many toxin sequences. submitted to the environment. Shared tables include user The results of all the conducted searches are parsed and stored in a dedicated management tables, bioinformatic model information and a relational database. In TATools, search results are cross-linked to provide an local dump of the Gene Ontology annotation database. internal validation labeling for each transcript. This cross-validation supports the creation of a “Transcriptome map” as it allows the classification of (2) tables specific to each analysed transcriptome that store transcripts into 8 classes. These classes are defined by the type of output analysis results. These tables are organised in a separate generated for each considered transcript: transcripts that match sequences as database for each transcriptome. The central element of the the result of only one search (BLAST only, class '1'; PSSM/HMM only, class schema of a transcriptome is the table of translated '2'; SIGNAL only, class '3'), of two searches (BLAST and PSSM/HMM, transcripts. These translated transcripts serve for the class '1+2'; BLAST and SIGNAL, class '1+3'; PSSM/HMM and SIGNAL requested analysis. The other tables are used to store the class '2+3';), and of three searches (BLAST and PSSM/HMM and SIGNAL, detailed results of each analysis and require a foreign key class '1+2+3';) or no match in any of the searches (class '0'). from the translated transcript table. 3.1.2 TATools use case diagram sequence to a local BLAST against NCBI non redundant A number of operations can be achieved on the platform, including database and/or other transcriptomes available on the transcriptome deposition, analysis runs and updates, data platform. The latter is especially useful to detect visualization, annotation and exportation. Different user levels homologous sequences from non public transcriptomes of have been implemented in view of public deployment of the related or distant organisms. platform. Figure 3 summarizes the main use cases. (5) TATools also integrates a local version of signalP that generates a graphical interpretation of the signal detection. 3.1.3 Anticipate biologists needs. (6) Finally, a clipboard was implemented to handle union and The first representation of results obtained with TATools is the intersection of sub-category matches when analysing GO- “Transcriptome map”. The Transcriptome map is a four-class based annotations and model-based matches. For each sub- Venn diagram automatically annotated after grouping translated category, the number of involved sequences (matching transcripts according to the analysis for which a positive result was each single model or having the given GO annotation) is obtained. In addition to the class distribution map, tabulated views extracted from the database and displayed in front of the are provided to display sequences corresponding to a class or the sub-category. The clipboard allows export and analysis of result of a search concerning either an annotation or an amino acid sequences that simultaneously match two or more models sequence fragment with the following details: (e.g. signal model, propeptide model and mature peptide − translated transcripts identifier, model of a given conopeptide superfamily). − cDNA sequence coverage, − translation frame, − summary of obtained results for each ran analysis. 3.2 Case study : identification of XEP-018 analogs in Results can also be sorted by GO terms or according to models the transcriptome of Conus consors venom gland. they matched. Whenever a tabulated list is proposed, one can select The venom gland of Conus consors has been sequenced in the sequences of interest and export them in FASTA format or produce frame of the CONCO project (www.conco.eu). A deep analysis of a multiple sequence alignment. It is also possible to align any set the venom gland conopeptide content has already been described of sequence submitted in FASTA format pasted into the proposed (Terrat et al., 2012). We propose here a TATools-style view of this alignment tool or uploaded from a file. The resulting alignment is transcriptome. computed by mafft version 6.847b. A special parser has been added for alignment result exploitation. This functionality is useful for comparing sequences of a previously exported FASTA file or 3.2.1 Conus consors Transcriptome map. for aligning external sequences have to be aligned with those of the The venom gland pyro-sequencing yields 213561 reads of 218 platform. base pairs average length. De-novo MIRA-based read assembly led TATools was specially optimized for protein family identification to 65,536 contigs from which 49086 clusters were obtained using and extraction from transcriptomes. It is therefore possible to CD-HIT (Huang et al., 2010) with a similarity threshold set to export sequences matched by models considered separately or in 0.95. The translated transcripts were then submitted to BLAST (e- combination. For example, if a protein family contains three value:10e-4), SignalP (HMM and NN) and model search (HMMs different modeled domains, it is possible to extract, align and/or and PSSMs build for conopeptides families and superfamilies). annotate transcripts where one, two or all three domains are The contigs repartition after the analysis step is provided in Fig. 4. present. Additional features have been integrated in TATools to facilitate transcriptome annotation work: 3.2.2 Conopeptides identification is improved by the model-based analysis of the transcriptome. (1) The annotation platform allows users to add comments to The 65536 contigs from Conus consors venom gland any transcript. It is therefore possible to manually input transcriptomes were translated in-silico into 393216 protein external results such as experimental information or sequences and searched with the 96 models built for the 16 known proteomic analysis results (e.g. molecular mass, post- superfamilies. This led to the identification of 5210 different hits. transcriptional modification). It is possible to annotate a 1403 matches were obtained for A superfamily models, 1650 for single sequence as well as a set of selected sequences. M superfamily, 1356 for O1 Superfamily models, 593 for models (2) A search tool also allows querying the database directly by from T superfamily, 74 and 19 for P superfamily and S giving a portion of amino acid sequence or by searching a superfamily respectively. The models from the other superfamilies given word in BLAST or GO annotation. returned 115 matches. The figure 5 summarizes the matches obtained with the model-based strategy. (3) A contig viewer displays reads used by the assembly On the other hand, the BLAST of these contigs against program for contig creation. It is therefore easy to track UniprotKB/Swiss-Prot returned 1380 hits among which 191 were assembly problems on a contig of interest. related to toxin sequences from Conus species. (4) Since only the best BLAST hit from UniprotKB is Compared to BLAST-based strategy, our model-based approach provided in the contig viewer, the user can submit the significantly improves identification of conopeptide-related translated transcript as well as the initial contig nucleotide transcripts. More interestingly, identified transcripts are directly related to a superfamily (the family from which the model was additional model-based results, signal and cysteine framework built) and a further fastidious BLAST result manual analysis is detections are of great added value when focusing on extremely avoided. In particular, the model prediction highlights new folded and secreted peptides coming from animal venoms. The isoforms of mu-conotoxin. user-friendly, intuitive and interactive web-based interfaces allow transcriptome analysis by non-expert biologists since the combination of results obtained from the different analysis makes 3.2.3 Presentation of XEP-018 the validation easier. This beta version is an adequate solution for a XEP-018 is also known as CnIIIC. It a µ-conotoxin identified and reliable analysis of high-throughput transcriptome data. The isolated from Conus consors venom (Benoit et al., 1999). The µ- efficiency of model-based analysis and the easy-to-import features conopeptide family is defined by its ability to block voltage-gated constitute a major asset in the context of lead compounds analogs sodium channels (VGSCs), a property that can be used for the identification and selection of candidates for chemical synthesis. development of myorelaxants and analgesics. μ-CnIIIC potently Further development of TATools will focus on improving the blocks VGSCs in skeletal muscle and nerve, and hence is comparison between transcriptomes. applicable to myorelaxation. Its new atypical pharmacological profile suggests some common structural features between VGSCs and nAChR channels (Favreau et al., 2012). ACKNOWLEDGEMENTS We are most grateful to Estelle Bianchi, Daniel Biass, Aude 3.2.4 CnIIIC analogues from transcriptome. Violette, Nathalie Lembrez and Xavier Sprüngli for expert Transcriptome analysis was carried out on TATools to detect assistance, transcriptome sequence acquisition and platform tests. analogues and/or variants of CnIIIC. The translated transcriptome was indexed using the formatdb script from NCBI BLAST Funding: This work was supported by the European Commission: package and searched by BLASTP. In addition the transcriptome CONCO, the cone snail genome project for health (LSHB-CT- was searched using PSSMs built for M superfamily. 2007-037592; www.conco.eu). A BLASTP search (e-vaule:10e-4) of the CnIIIC sequence against the C. consors transcriptome indicated perfect matches with 11 contigs. Ten of the matching contigs completely covered the initial sequence and one contig was detected as containing a sequencing REFERENCES error. At this point, no variants or analogues were identified. Ashburner,M. et al. (2000) Gene Ontology: tool for the unification of biology.The Another BLASTP search against UniprotKB/Swiss-Prot indicated Gene Ontology Consortium. Nat. Genet. 25, 25-29. Benoit,E. et al. (2008) A new mu-conotoxin from Conus consors that atypically that 12 mu-conopeptides were publicly available. targets sodium channels in unmyelinated and myelinated nerve fibers. Abstract The PSSM representing the conopeptide M-superfamily mature Book, 16th European Section Meeting of the International Society on Toxinology peptide matched 630 distinct contigs. Regions of these contigs (2008). matching this specific model were isolated and aligned using the Callaghan,B. et al. (2008) Analgesic -conotoxins Vc1.1 and Rg1A inhibit N-type pfsearch command. Duplicate sequences, incomplete mature calcium channels in rat sensory neurons via GABAB receptor activation. J Neurosci 28, 10943–10951. peptide as well as entries with sequencing errors were manually Cantacessi,C. et al. (2010) A practical bioinformatic workflow system for large data removed from the alignment. This led to a set of 29 contigs, the sets generated by next-generation sequencing. Nucleic Acids Research, 38(17), removed sequences mostly being either duplicate of the original e171. XEP or duplicates of the kept variants. Out of the 29 remaining Clark,R.J. et al. (2010) The engineering of an orally active conotoxin for the treatment of neuropathic pain. Angew Chem Int Ed Engl 49, 6545–6548. contigs, only 10 were identified to be full-length precursors with Durban,J. et al. (2011) Profiling the venom gland transcriptomes of Costa Rican signal, propeptide and mature peptide. Out of these 10 sequences, 5 snakes by 454 pyrosequencing. BMC. Genomics 12, 259. complete precursors were considered as analogues of the CnIIC Dutertre,S. et al., (2007) AChBP-targeted-conotoxin correlates distinct binding (Fig. 6). The other 5 contigs had weak coverage and probably orientations with nAChR subtype selectivity. EMBO J 26, 3858–3867. constitute new M-type conopeptides. Favreau,P. et al. (2012) Pharmacological characterization of a novel μ-conopeptide, CnIIIC, indicates potent and preferential inhibition of sodium channel subtypes Of the five new analogues determined from the transcriptome (Na(V) 1.2/1.4) and reveals unusual activity on neuronal nicotinic acetylcholine analysis, two were actually identified at protein level by mass receptors. Br J Pharmacol. 2012. doi: 10.1111/j.1476-5381.2012.01837.x. spectrometry analysis of the milked venom and a third one was Fry,B.G. et al. (2008) Evolution of an arsenal: structural and functional diversification also identify during the proteomic analysis of both milked and of the venom system in the advanced snakes (Caenophidia). Mol. Cell. Proteomics, 7, 215–246. dissected venom. Fry,B.G. et al. (2009) The toxicogenomic multiverse: convergent recruitment of proteins into animal venoms. Annu. Rev. Genomics Hum. Genet. 10, 483-511. Huang,Y. et al. (2010) CD-HIT Suite: a web server for clustering andcomparing 3.3 Conclusion and perspectives biological sequences. Bioinformatics 26, 680–682. The most important innovation of TATools lies in the integration Kanehisa,M. and Goto,S. (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 28, 27-30. of home-made and specially tuned protein family models in a King,G.F. (2011) Venoms as a platform for human drugs: translating toxins into transcriptome analysis workflow. The platform was initially therapeutics. Expert Opin. Biol. Ther. 11(11), 1469-1484. implemented to facilitate identification of analogs of a given lead Klimis,H. et al. (2011) A novel mechanism of inhibition of high-voltage activated compound and/or others members of a protein family in a calcium channels by conotoxins contributes to relief of nerve injury-induced venomics context. The visualization and annotation interface neuropathic pain. Pain 152, 259–266. classically integrated BLAST results and GO annotations. The Kumar,S. and Blaxter,M. (2010) Comparing de novo assemblers for 454 transcriptome data. BMC genomics, 11, 571. Lewis,R.J. et al. (2012) Conus venom Peptide pharmacology. Pharmacol Rev. 64(2), 259-98. Lluisma,A.O. et al. (2012) Novel venom peptides from the cone snail Conus pulicarius discovered through next-generation sequencing of its venom duct transcriptome. Mar. Genomics 5, 43-51. Ma,Y. et al. (2012) Extreme diversity of scorpion venom peptides and proteins revealed by transcriptomic analysis: Implication for proteome evolution of scorpion venom arsenal. J. Proteomics. 75 (5), 1563-1576. Mardis,E.R. (2008) Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 9, 387–402. Martin,J.A. and Wang,Z. (2011) Next-generation transcriptome assembly. Nat Rev Genet., 671-82. McIntosh,J.M., Santos,A.D. and Olivera,B.M. (1999) Conus peptides targeted to specific nicotinic acetylcholine receptor subtypes. Annu Rev Biochem 68, 59–88. Mebs,D. (2002) Venomous and Poisonous Animals. Medpharm, Stuttgart Germany. Philipp,E.E.R. et al. (2012) The Transcriptome Analysis and Comparison Explorer - T-ACE: a platform-independent, graphical tool to process large RNAseq data sets of non-model organisms. Bioinformatics 2012; doi: 10.1093/bioinformatics/bts056 Prosdocimi,F. et al. (2011) Spinning gland transcriptomics from two main clades of spiders (order: Araneae)--insights on their molecular, anatomical and behavioral evolution. PLoS. One. 6 (6): e21634. Sanger,F. and Coulson,A.R. (1975) A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J. Mol. Biol. 94 (3), 441–448. Sanger,F., Nicklen,S., Coulson,A.R. (1977) DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. U.S.A. 74 (12), 5463–7. Terrat,Y. et al. (2012) High-resolution picture of a venom gland transcriptome: Case study with the marine snail Conus consors. Toxicon 59 (1), 34-46. Vaiyapuri,S. et al. (2011) Evolutionary analysis of novel serine proteases in the venom gland transcriptome of Bitis gabonica rhinoceros. PLoS. One. 6(6), e21532. Vincler,M. and McIntosh,J.M. (2007) Targeting the alpha-9 alpha-10 nicotinic acetylcholine receptor to treat severe pain. Expert Opin Ther Targets 11, 891–897.

List of figures.

Fig. 1. Transcriptome analysis workflow adopted in TATools.

Fig. 2. Relational database schema of TATools. A separated database is created for each new transcriptome deposited on the platform.

Fig. 3. TATools use cases diagram.

Fig. 4. Transcriptome map of Conus consors venom gland transcriptome.

Fig. 5. Distribution of matches obtained for the main superfamilies by searching the Conus consors transcriptome with conotoxin pHMMs and PSSMs .

Fig. 6. New isoforms of mu-conotoxin identified from the Conus consors venom gland transcriptome. Appendix 7. Pattern Searches in Protein Sequences. Pattern Searches in Advanced article Protein Sequences Article Contents . Introduction Dominique Koua, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland . Different Perspectives . Pattern Detection Fre´de´rique Lisacek, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland . Pattern Discovery

. Application

Online posting date: 15th June 2012

Common amino acid patterns characterise protein famil- To perform the detection, patterns are translated into ies. The results of automated searches for such patterns computer-readable descriptors, often simply called signa- are used to qualify protein structure and function and to tures or fingerprints, and which can be: explore evolutionary relationships. Considering the . deterministic, involving the implementation of pattern increasing number of deoxyribonucleic acid (DNA) and matching procedures; protein sequences generated by high-throughput tech- . probabilistic, involving probability calculations (Figure 1). nologies, pattern search is commonly undertaken in the identification of new protein function or the elucidation of biological processes. A wide array of pattern matching Deterministic patterns methods has been implemented. They aim at identifying the constraints governing the occurrence of amino acids The simplest deterministic pattern is a substring pattern in protein regions. These constraints are expressed as also called a consensus sequence, corresponding to a probabilities or as templates or both to set the basis of fragment of a polypeptide chain. Symbols representing the automated search. pattern are the 20 amino acid one-letter codes. Regular expressions are the most common deterministic patterns. They are defined over an extended vocabulary of symbols including the 20 amino acids, X as an unspecified residue and various delineation symbols such as numbers or brackets. Regular expressions express positional con- Introduction straints within the pattern. For example: C-[AG]-X(2,5)- DE-X(3)-[LIVM] describes a cysteine followed by either Most of the time, sequences are compared by means of alanine or glycine, from two to five unspecified residues, alignment. Overall sequence similarities have led to gather any residue but neither aspartic nor glutamic acids, three proteins that are alike into families. Furthermore, local unspecified residues and either a leucine, an isoleucine, a amino acid similarities have yielded the definition of so- valine or a methionine. This type of expression is used in the called domains, which span limited regions of proteins. PROSITE database (see Web Links) though less exten- These domains are characterised by patterns of regular sively now than in the early releases of this database. occurrences of amino acids. A pattern may contain several motifs. Constraints in patterns are positional and most detection Probabilistic patterns methods rely on rules governing the presence/absence of a Probabilistic patterns are characterised by weight matrices, particular amino acid at a particular position. Such rules which associate each amino acid residue in a pattern with a are either generated automatically or inbuilt in the position-specific score. The matrix columns correspond to description. Various description schemes have given rise to each position in the pattern, and the rows to each of the 20 a diversity of detection modes. amino acid residues. Weight matrices can be constructed by a variety of techniques. The classical method requires (1) a eLS subject area: Cell Biology multiple sequence alignment between a set of known sequences of a protein family and (2) a scoring scheme to How to cite: convert residue frequency distributions into weights. Koua, Dominique; and Lisacek, Fre´de´rique (June 2012) Pattern Profiles are weight matrices with position-specific Searches in Protein Sequences. In: eLS. John Wiley & Sons, Ltd: gap-penalties assigned to insertions and deletions. Profile- Chichester. hidden Markov models (Profile-HMMs; often simply DOI: 10.1002/9780470015902.a0006222.pub2 referred to as hidden Markov models or HMMs) are a

eLS & 2012, John Wiley & Sons, Ltd. www.els.net 1 Pattern Searches in Protein Sequences

Protein motif descriptors

Deterministic, e.g., Probabilistic, e.g.,

- Substring patterns - Position specific scoring matrices - Regular expressions - Neural networks - Rules (if... then...) - Hidden Markov models

Qualitative positional constraints Quantitative positional constraints

Figure 1 Motifs are representative of protein families and domains. They are translated into motif descriptors for automatic detection. The various methods described in the text are split into two main categories: deterministic and probabilistic. As a general trend deterministic methods tend to reflect the constraints on the occurrence of amino acids (qualitative), whereas probabilistic methods mainly rely on frequency calculations (quantitative). more complex type of profile, whose scoring schemes are Lisacek, 2002). Patterns are described not only in terms of based on probability calculations (see the later discussion). specific sites characterised by residue frequency vectors, PROSITE generalised profiles are position weighted but also as a combination of distinct features such as matrices built from enriched multiple sequences alignments charge, hydrophobicity, etc. These approaches are for- (MSA) (see the later discussion). See also: Profile Searching malised further while defining a set of functions or rules weighting the distinct features as the core of a matching and scoring procedure. Different Perspectives Alignment-free methods Given a set of proteins, two issues are distinguished: (1) the detection of a known pattern and (2) the discovery of a new As hinted in the introduction, the vast majority of patterns pattern. A known pattern can be searched in a single are defined from an initial sequence alignment. However, it sequence, whereas a new pattern can only be discovered as should be noted that several authors have proposed pattern a common characteristic in a set of sequences. Sup- search methods bypassing this initial alignment. Pattern plementary information (e.g. structure or chemical prop- definition is then based on the regularity of N-mer com- erties of amino acids) is most of the time required to refine position of sequences (Didier et al., 2006). Shared N-mer the interpretation of newly identified patterns. patterns can be identified and used to classify sequences on In reality, proteins may contain a variety of patterns. the basis of statistically significant N-mer content (e.g. Corel et al., 2010; Maetschke et al., 2010). Representation and detection Efficient pattern matching tools Regular expressions, weight matrices and profiles are the most frequently used descriptors. The major difference Regular expressions, such as PROSITE patterns, are rec- between the deterministic and probabilistic approaches ognised by finite automata, which have been defined and resides in the outcome of the detection procedure. The studied since the 1930s. The outcome of decades of research selection of a deterministic pattern is based on a yes/no in pattern matching has benefited molecular biology, as outcome. A probabilistic pattern, however, is selected illustrated in the collection of fast and efficient algorithms when the score is above a given threshold. described in Gusfield (1997). Whether exact or approxi- A particular care in specifying the relationships between mate, for single or multiple searches, various pattern the various definitions of motifs as well as the basis of a matching techniques are commonly implemented. The flexible pattern search is given in Bucher et al. (1996). Motif success of applying these algorithms to process proteins lies descriptors are shown to fall into four categories depending in the obvious analogy between searching words in texts on all possible combinations of two criteria: qualitative/ and searching patterns in sequences. Most alignment tools quantitative and variable/fixed length. are inspired from related algorithms developed for Alternative ways of representation were sugge- matching substrings with strings. Long-standing though sted for sorting signals (Bannai et al., 2002), C-terminal imperative discussions on the topic of scoring aligned glycosylated phosphatidylinositol (GPI) anchoring signals sequences are summarised and extended in Altschul et al. (Eisenhaber et al., 1999) or lipoprotein signals (Gonnet and (2001).

2 eLS & 2012, John Wiley & Sons, Ltd. www.els.net Pattern Searches in Protein Sequences

A natural interpretation of searching patterns as in networks or HMMs are the main references (see the later words in a text is to consider the language possibly repre- discussion). In fact, Learning provides means of reformu- sented in the text. Formal language theory seems to provide lating a problem. Once a correlation is inductively brought a range of models and tools applicable to the genomic text. to the foreground, it is used to define a filtering method, However, practical issues do not easily meet formal whether in the form of a metrics or a scoring function, expressions. As proven in many instances, amino acids do associated with the threshold values. Successful Learning is not occur randomly and grammatical rules were tentatively assessed by the quality of the outcome in any of these cases. defined. However, such attempts to use formal grammars Inductive methods require: to express regularities in sequences did not help elucidating . a set of representative examples (a training set); the interrelations between patterns. Indeed, grammars . a set to validate the knowledge acquired from analysing capture some of the features of protein sequences but do examples (a test set). not provide an unambiguous characterisation of protein families. So far, the predictive power of grammatical methods seems limited by the complexity of context-sen- Optionally, they may include: (1) a discrimination cri- sitive grammars (Sakakibara, 2005). Rules governing the terion between examples and counterexamples and (2) a set occurrence of symbols in sequences are likely to be context- of counterexamples. sensitive. The recognition of such a language is an NP- complete problem. (In computational complexity theory, Neural networks NP (or nondeterministic polynomial time) corresponds to Neural networks are used to express a complex correlation the top complexity class. No scalable algorithmic solution between an input and an output. They are composed of can be defined to solve this category of problems.) three or more interconnected layers of units called neurons. A weight is associated with a connection between neurons. The stronger the connection, the stronger its associated Pattern Detection weight. Examples of inputs bound to a known output are given to calculate weights of connections, in an attempt to Early approaches minimise the error between the expected and the obtained output. Amino acids within a protein sequence are most Most efforts have been put to refine score calculations probably not linearly correlated, justifying the use of given descriptors defined over the alphabet of amino acids. models where discontinuous functions can be approxi- The PROSITE database associated with the Scan mated. See also: Neural Networks PROSITE program is the oldest reference for general pro- Neural networks are useful tools considering the limi- tein pattern detection (Bairoch, 1991). Early PROSITE tations of human eyesight and the amount of sequences to descriptors were regular expressions over the alphabet be scanned for search purposes. However, the ‘black box’ of the amino acids and a wild card X. In this case, rules are setup of a neural net prevents from rationalising an auto- an integral part of the description and detection is based on matic decision made by the program. Whatever pattern is a pattern-matching procedure, yielding a purely quali- supposed to be recognised, the resulting score attributed by tative result, that is, exact match or no match. A con- a network has no known biological meaning. As such, temporary variation on the theme was introduced in the neural networks do not generate much substance for PRINTS database in which patterns are split into short explicit rules. Many examples of prediction methods based signatures. Partial matching is accepted when some signa- on neural networks can be found on the server of the Center tures are missing thus expanding the result of detection to for Biological Sequence Analysis (http://www.cbs.dtu.dk/ match, partial match, no match (Attwood and Beck, 1994). services/). The absence of quantifiable evaluation led to the defin- These algorithms discriminate quite well between the ition of motifs in terms of frequency vectors in later versions positive and the negative training sets. This is partly of PROSITE (Hofmann et al., 1999) where a probabilistic because of too much emphasis put on amino acid positional result is returned for each match. This shift from deter- constraints. ministic to probabilistic highlights the importance of the scoring functions associated with the detection method. HMMs In most probabilistic approaches, the automatic identi- fication of constraints in a pattern is based on inductive HMMs introduced a new hypothesis as well as a needed reasoning known as ‘Learning’. Examples of a pattern change of representation. Within this framework, states are are first gathered. The frequency and the nature of the not static symbols but symbol transformations such as amino acid are almost always the chosen initial descriptors. ‘delete’, ‘insert’ and ‘match’ as a hidden mechanism con- straining the occurrence of symbols, that is, amino acids. Learning patterns for prediction Such a dynamical description increases the chances of rationalising mutation phenomena. Sequences are con- A Learning phase has become almost unavoidable to sidered as the observable part of such a hidden mechanism, identify regular features in a collection of examples. Neural which supposedly corresponds to a succession of states. A

eLS & 2012, John Wiley & Sons, Ltd. www.els.net 3 Pattern Searches in Protein Sequences mapping between the observation and the hidden mech- some attention and is increasingly considered (e.g. Pier- anism levels is defined; the sequential change of states is leoni et al., 2008). governed by transition rules. These rules are first weighed using a training set. An inbuilt optimisation algorithm guarantees the fitting of data to the model. See also: Hidden Markov Models Pattern Discovery Despite a large set of parameters that need to be man- aged by an HMM and may appear cumbersome to naı¨ve A number of methods were designed for locating users, various applications are developed in molecular and describing patterns common to a set of proteins. biology, such as the generation of the PFAM database of Various approaches rely on searching either the space of all protein families (see Web Links). As mentioned earlier, the possible locations of a pattern in a sequence or the space of scoring system is essential and in constant improvement to all possible combinations of amino acids (see Rigoutsos allow a faster and more accurate selection of searched et al. (2000) for review and describing TEIRESIAS and patterns (Eddy, 2009; Finn et al., 2011). MUSCA).

AMSA-based PROSITE generalised profiles Probabilistic approach PROSITE generalised profiles are built from weighted The expectation maximisation (EM) algorithm is the basis MSA. The annotated multiple sequence alignment of the probabilistic approach to pattern discovery. The (AMSA) format was introduced to fine-tune profiles MEME method is a direct implementation of this algo- through associating specific annotation with multiple rithm. The Gibbs sampling method (Lawrence et al., 1993) alignments. Annotation is added at the sequence level or at is presented as a stochastic analogue of EM. A predictive the column (position) or the residue level. The format also (pattern selection) and a sampling (location determination) supports structural information and allows the application step are iteratively alternated until a stable set of patterns is of distinct substitution matrices for specific domain or found. The outcome is most probable location of a selected positions in the global profile. The new MSA format cou- pattern in each of the initial set of proteins. The Gibbs pled with a new profile building strategy has led to gener- sampler has been used in various applications. See also: alised profiles selecting not only distant homologues but Gibbs Sampling and Bayesian Inference also detecting and classifying subfamily sequences. The The PSI-BLAST is an iterative database search taking addition of this human-assisted annotation layer has sub- advantage of the PSSM prediction ability. A PSSM is built stantially improved the sensitivity of generalised profiles from the initial BLAST alignment (iteration 1) and used to compared to that of fully automated approaches like search the database. After each iteration, a new PSSM is HMMs and Neural Networks. built and used for the next iteration until convergence is In addition to PROSITE profiles, the PRORULE reached. At the end of the process, sequences related to the database contains rules used to trigger annotations for query sequence are output by PSI-BLAST. These matched sequences. Applying these rules involves checking sequences can directly be used to pattern construction specific conditions and generating relevant annotations for (Li et al., 2011). matched domain or active sites (Sigrist et al., 2010). Combinatorial approach Support Vector Machine Support Vector Machine (SVM) are mathematical tools With a goal of discovery, combinatorial pattern discovery for positioning linear as well as nonlinear separation sur- methods are used for identifying new patterns as described faces based on similarities and distances between objects to in the review of Brazma et al. (1998). Although they are discriminate. The application of SVM to biological data restricted to identifying regular expressions, most methods has demonstrated the applicability of the method to per- do not require sequences to be aligned for the identification form valuable protein classification also for sequences of motifs. Qualitative patterns of variable length are gen- belonging to quite close subfamilies (e.g. Shu et al., 2008). erated. Two approaches for the discovery of new patterns However, the complexity of the method makes its appli- are detailed: a so-called pattern-driven approach, which is based on the enumeration of possible patterns and choos- cation difficult for naı¨ve users. ing the fittest, and a sequence-driven approach based on a Recent trends search for common parts in sequences. The two strategies may actually be combined. A variation on the theme of The performance of predictive methods listed above is searching all possible amino acid combinations consists in often compared. The common conclusion is a wide overlap considering a given family and applying a so-called enu- in terms of sensitivity and specificity between HMMs, meration strategy as in EMOTIF (Nevill-Manning et al., Position Specific Scoring Matrices (PSSMs) or SVMs. 1998). Nevertheless, some missed patterns with one method are In 20 years, the topic remains actively investigated and not by another and vice versa. For that reason, combining pattern-driven approaches tend to prevail. The trend has the predictive power of two or more methods has attracted shifted towards identifying short and potentially

4 eLS & 2012, John Wiley & Sons, Ltd. www.els.net Pattern Searches in Protein Sequences

>sp|P50731|YPBE_BACSU Uncharacterized protein ypbE OS=Bacillus subtilis GN=ypbE PE=4 SV=1 MTNMSRVERRKAQNLYEDQNAALADDYVDDGESLPTRQSVKNQREQKKKQGKTKTPLFTV LAVIFVFVPVIVLVTLFYLKSHPDNHDDYEDVFIDSSQSKYEVVPKSEDKNDTADTKETA LQKESKKEPEDSKPKEQTAADKKQTAVAEDSPNKEEATAAAASSSQSTVQQQEQPAEP VQNVPNRVVKHTVQKKETLYRISMKYYKSRTGEEKIRAYNHLNGNDVYTGQVLDIPLMDE

Motif database search Protein database search

Significant Pfam-A Matches No similarity to any known sequence Show or hide all alignments Entry Family Description type LysM LysM domain Domain Summary

LysM-like domain Add annotation

The LysM domain (Pfam:PF01476) is thought to be a general peptidoglycan- binding module. Although originally described in bacterial proteins, it has been also found in some eukaryotic sequences. It takes up a beta-alpha-alpha-beta conformation, with the beta strands forming an antiparallel beta sheet and the two alpha helices packing on one side of this sheet [1].

This clan contains 3 families and the total number of domains in the clan in 19975. The clan was built by M Fenech. Literature references

1. Bateman A, Bycroft M; , J Mol Biol 2000;299:1113-1119.:The structure of a Example structure LysM domain from E. coli membrane-bound lytic murein transglycosylase D (MltD). 10843862 PDB entry 1Y7M : Crystal structure ot the B. subtilis YkuD protein at 2 A resolution Members View a different structure: 1Y7M

This clan contains the following 3 member families:

LysM OapA Phage tail X External database links

SCOP: 54106

Figure 2 Use of motif detection for proteome annotation. The ypbE gene product of Bacillus subtilis strain 168 is of unknown function. It does not look like any other protein sequence beside a few found in other B. subtilis strains; however, it is found to belong to the Lysin motif protein family (accession number PF01476 in the PFAM database) since positions 191–236 delineate a pattern known as the Lysin motif. Common features of proteins in this family can provide hints for further understanding of the ypbE protein. regulatory motifs in proteins. Two examples of such proteins or whether it contains an interpretable structural methods are NestedMICA, which uses a Monte Carlo or functional domain. Positive answers may lead to assign a inference strategy called Nested Sampling (Dogruel& et al., function to a protein sequence, as illustrated in Figure 2. 2008) and SLiMFinder that estimates the statistical sig- See also: Gibbs Sampling and Bayesian Inference; Hidden nificance of motifs through the implementation of a series Markov Models; Neural Networks; Profile Searching; of filters and statistical scoring schemes (Davey et al., Protein Families: Evolution 2010).

Application References Altschul SF, Bundschuh R, Olsen R et al. (2001) The estimation of Motifs are representative of protein families and domains, statistical parameters for local alignment score distributions. and a range of sequence analysis methods can be used Nucleic Acids Research 29(2): 351–361. to detect known motifs in newly sequenced deoxyribo- Attwood TK and Beck ME (1994) PRINTS – a protein motif nucleic acid (DNA) that has just been translated. The fingerprint database. Protein Engineering 7(7): 841–848. computational tools described above are used to determine Bairoch A (1991) PROSITE: a dictionary of sites and patterns in whether such a new sequence belongs to a known family of proteins. Nucleic Acids Research 19(suppl.): 2241–2245.

eLS & 2012, John Wiley & Sons, Ltd. www.els.net 5 Pattern Searches in Protein Sequences

Bannai H, Tamada Y, Maruyama O, Nakai K and Miyano S Maetschke SR, Kassahn KS, Dunn JA et al. (2010) A visual (2002) Extensive feature detection of N-terminal protein sorting framework for sequence analysis using n-grams and spectral signals. Bioinformatics 18(2): 298–305. rearrangement. Bioinformatics 26(6): 737–744. Brazma A, Jonassen I, Ukkonen E and Vilo J (1998) Approaches Nevill-Manning CG, Nu TD, Brutlag DL et al. (1998) Highly to the automatic discovery of patterns in biosequences. Journal specific protein sequence motifs for genome analysis. Proceed- of Comparative Biology 5(2): 279–305. ings of the National Academy of Sciences of the USA 95: 5865– Bucher P, Karplus K, Moeri N and Hofmann K (1996) A flexible 5871. motif search technique based on generalized profiles. Com- Pierleoni A, Martelli PL and Casadio R (2008) PredGPI: a GPI- puters in Chemistry 20(1): 3–23. anchor predictor. BMC Bioinformatics 9: 392. Corel E, Pitschi F, Laprevotte I et al. (2010) MS4 – multi-scale Rigoutsos I, Floratos A, Parida L, Gao Y and Platt D (2000) The selector of sequence signatures: an alignment-free method for emergence of pattern discovery techniques in computational classification of biological sequences. BMC Bioinformatics 11: biology. Metabolic Engineering 2(3): 159–177. 406–420. Sakakibara Y (2005) Grammatical inference in bioinformatics. Davey NE, Haslam NJ, Shields DC and Edwards RJ (2010) IEEE Transactions on Pattern Analysis and Machine Intelligence SLiMFinder: a web server to find novel, significantly over- 27(7): 1051–1062. represented, short protein motifs. Nucleic Acids Research 38: Shu N, Zhou T and Hovmo¨ller S (2008) Prediction of zinc-binding W534–W539. sites in proteins from sequence. Bioinformatics 24(6): 775–782. Didier G, Laprevotte I, Pupin M, He´naut A (2006) Local Sigrist CJ, Cerutti L, de Castro E et al. (2010) PROSITE, a protein decoding of sequences and alignment-free comparison. Journal domain database for functional characterization and anno- of Computational Biology 13(8): 1465–1476. tation. Nucleic Acids Research 38(Database issue): D161– Dogruel& M, Down TA and Hubbard TJ (2008) NestedMICA as D166. an ab initio protein motif discovery tool. BMC Bioinformatics 9: 19. Eddy SR (2009) A new generation of homology search tools based Further Reading on probabilistic inference. Genome Information 23: 205–211. Baldi P and Brunak S (1998) Bioinformatics: The Learning Eisenhaber B, Bork P and Eisenhaber F (1999) Prediction of Approach. Cambridge, MA: MIT Press. potential GPI-modification sites in protein sequences. Journal Durbin R, Eddy SR, Krogh A and Mitchson G (1998) Biological of Molecular Biology 292: 741–758. Sequence Analysis: Probabilistic Models of Proteins and Nucleic Finn RD, Clements J and Eddy SR (2011) HMMER web server: Acids. Cambridge, UK: Cambridge University Press. interactive sequence similarity searching. Nucleic Acids Research 39: W29–W37. Gonnet P and Lisacek F (2002) Probabilistic alignment of motifs Web Links with sequences. Bioinformatics 18: 1091–1101. Gusfield D (1997) Algorithms on Strings, Trees and Sequences. Gibbs sampler http://bayesweb.wadsworth.org/gibbs/gibbs.html Cambridge, UK: Cambridge University Press. HMMER http://hmmer.janelia.org/ Hofmann K, Bucher P, Falquet L and Bairoch A (1999) The MEME http://meme.sdsc.edu/meme/ PROSITE database: its status in 1999. Nucleic Acids Research Nested MICA http://www.sanger.ac.uk/Software/analysis/ 17(27): 215–219. nmica/ Lawrence CE, Altschul SF, Boguski MS et al. (1993) Detecting PFAM (database of protein families) http://www.sanger.ac.uk/ subtle sequence signals: a Gibbs sampling strategy for multiple Pfam/ alignment. Science 262: 208–214. PROSITE http://prosite.expasy.org/ Li Y, Chia N, Lauria M and Bundschuh R (2011) A performance SLiMFinder http://bioware.ucd.ie/slimfinder.html enhanced PSI-BLAST based on hybrid alignment. Bioinfor- TEIRESIAS and MUSCA http://cbcsrv.watson.ibm.com/ matics 27: 31–37. Ttwpd.html

6 eLS & 2012, John Wiley & Sons, Ltd. www.els.net Appendix 8. Molecular phylogeny of conopeptides. J Mol Evol DOI 10.1007/s00239-012-9507-2

Molecular Phylogeny, Classification and Evolution of Conopeptides

N. Puillandre • D. Koua • P. Favreau • B. M. Olivera • R. Sto¨cklin

Received: 7 February 2012 / Accepted: 12 June 2012 Ó Springer Science+Business Media, LLC 2012

Abstract Conopeptides are toxins expressed in the pharmacological activity can be found across different venom duct of cone snails (Conoidea, Conus). These are superfamilies. Furthermore, a few conopeptides from mostly well-structured peptides and mini-proteins with GenBank do not cluster in any of the known superfamilies, high potency and selectivity for a broad range of cellular and could represent yet-undefined superfamilies. A clear targets. In view of these properties, they are widely used as phylogenetically based classification should help to disen- pharmacological tools and many are candidates for inno- tangle the diversity of conopeptides, and could also serve vative drugs. The conopeptides are primarily classified into as a rationale to understand the evolution of the toxins in superfamilies according to their peptide signal sequence, a the numerous other species of conoideans and venomous classification that is thought to reflect the evolution of the animals at large. multigenic system. However, this hypothesis has never been thoroughly tested. Here we present a phylogenetic Keywords Cone snails Á Conus Á Conoidea Á Cys-pattern Á analysis of 1,364 conopeptide signal sequences extracted Venom Á Molecular evolution from GenBank. The results validate the current conopep- tide superfamily classification, but also reveal several important new features. The so-called ‘‘cysteine-poor’’ Introduction conopeptides are revealed to be closely related to ‘‘cys- teine-rich’’ conopeptides; with some of them sharing very Cone snails of the genus Conus are predatory venomous similar signal sequences, suggesting that a distinction marine mollusks feeding on fish, worm or snails. After based on cysteine content and configuration is not phylo- decades of biological prospecting, conopeptides expressed genetically relevant and does not reflect the evolutionary in their venom duct have emerged as one of the richest and history of conopeptides. A given cysteine pattern or most promising marine sources of natural products (Blunt et al. 2012). The analysis of cone snail venoms has revealed a complex exogenome that is characterized by an Electronic supplementary material The online version of this extremely high level of diversity. With more than 600 article (doi:10.1007/s00239-012-9507-2) contains supplementary described Conus species, each producing an estimated material, which is available to authorized users. 100–200 venom components, the ensemble of cone snails N. Puillandre (&) Á D. Koua Á P. Favreau Á R. Sto¨cklin were, until recently, estimated to produce between 50,000 Atheris Laboratories, Case postale 314, Bernex, 1233 Geneva, and 100,000 different toxins (Menez et al. 2006; Olivera Switzerland 2006). Recent studies, however, clearly demonstrate that e-mail: [email protected] this figure is an underestimation, probably by a factor of N. Puillandre Á B. M. Olivera ten or so, with several new species described every year, Department of Biology, University of Utah, more venom components detected in each sample using 257 South 1400 East, Salt Lake City, UT 84112, USA evolving technologies such as mass spectrometry (Biass D. Koua et al. 2009; Ueberheide et al. 2009; unpublished results) Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland and NextGen sequencing (Hu et al. 2011; Terrat et al. 123 J Mol Evol

2011) or combinations thereof (Violette et al. 2012), and sequence, this short sequence (*20 amino-acids) is highly marked intra-species and even intra-specimen variations in conserved, and has been used to define superfamilies; venom composition (Davis et al. 2009; Dutertre et al. 2010; second, mature toxins structural families are characterized Jakubowski et al. 2005). It is now estimated that the depending on their pattern of cysteines (the Cys-pattern), number of cone snail venom components exceeds one for example, the mature toxin can include a variable million. number of cysteines (most commonly 4 or 6), and their An important characteristic of conopeptides which respective position can vary (4 cysteines can be organized makes them attractive for drug development is their high as C–C–C–C or CC–C–C where ‘‘–’’ represents a variable selectivity for molecular targets that span a broad range of number of amino-acids); finally, several conopeptides have therapeutic applications (Gayler et al. 2005; Leary et al. also been characterized according to their molecular tar- 2009; Molinski et al. 2009). So far, the conopeptide MVIIA gets, referred to hereafter as ‘‘functional families,’’ and also (SNX-111, Prialt, or Ziconotide) from Conus magus (the previously termed ‘‘pharmacological families.’’ magician cone) that selectively blocks Cav2.2 N-type In a recent paper, Kaas et al. (2010) reviewed the voltage-gated calcium channels has been approved for the structure, function, and diversity of conopeptides on the treatment of severe chronic pain (McGivern 2007; Milja- ConoServer database (www.conoserver.org). In particular, nich 2004) and there are more promising drug candidates in they proposed that ‘‘the ‘gene superfamily’ classification the pipeline (e.g., see Favreau et al. 2012; Han et al. 2008a; scheme focuses on evolutionary relationships between Lewis 2012). The potential of this rich source of pharma- conopeptides’’, while the two other classification schemes cological products has stimulated a race for the discovery (cysteine framework and function) do not. Their underlying of new toxins. From the traditional bioactivity-guided hypothesis was that similarities in the Cys-pattern or identification, lead discovery efforts have evolved toward function might have arisen by convergence. While we fully modern structure-driven characterization (venom peptido- agree with this statement, we also argue that it could serve mics and proteomics, venom gland transcriptomics, tar- as a rationale to assess the congruence between the current geted genomics, structure–function studies) and gene superfamily classification and the evolution of the biocomputing-assisted analyses (proprietary databases and corresponding multigenic system, and to accurately dem- bioinformatic tools) (Daly and Craik 2009; Favreau and onstrate that convergence phenomena are common in Sto¨cklin 2009; Koua et al. 2012; Laht et al. 2011). In conopeptide structure and function. addition, phylogenetic approaches have recently emerged Here, we review the current superfamily classification of as an effective way to quickly identify divergent lineages conopeptides by analyzing all the signal sequences avail- that are likely to have evolved with different functional able in GenBank using a phylogenetic approach to check: characteristics. This approach to identify these previously (i) if all the defined superfamilies correspond to homoge- uncharacterized conopeptides is referred to as concerted neous groups; and (ii) if all the GenBank signal sequences discovery (Conticello et al. 2001; Duda and Remigio 2008; belong to a known superfamily. This study seeks to provide Olivera 2006; Puillandre and Holford 2010). a ‘‘rationale’’ for a phylogenetic classification of cono- However, despite the effectiveness of phylogenetic peptides and to clarify their current classification, thus approaches in concerted discovery, the technique is rarely complementing the work initiated by Kaas et al. (2010). used for the classification of conopeptides (but see Aguilar et al. 2009; Conticello et al. 2001; Wang et al. 2008; Zhangsun et al. 2006). Several statistical methods for Materials and Methods conopeptide classification, such as Mahalanobis (Lin and Li 2007) or BLAST and Euclidian distances among others Sequences from GenBank (Mondal et al. 2006) have been described; however, most of these approaches are primarily designed for classifica- Since the signal sequences used for phylogenetic analyses tion of new sequences rather than for testing the current (see below), are only found on complete nucleotide pre- classification (i.e., checking the validity of each known cursors and are not known for conopeptide discovered group by a blind-exploratory approach). Conopeptide pre- using proteomic approaches, all the nucleotide sequences cursors are characterized by a typical structural organiza- associated with the genus Conus were downloaded from tion consisting of a highly conserved signal region, GenBank (www.ncbi.nlm.nih.gov). The sequences corre- followed by a more variable pro-region and a hyper-vari- sponding to non-coding regions, ribosomal genes, mito- able mature toxin containing a few conserved amino acids chondrial genes, and genes with a function that did not such as the cysteine residues required for disulfide bonds. relate to toxin activity were removed from the dataset, thus Conopeptides are mainly named and classified according to keeping only coding genes with a potential toxin activity. three properties: first, they are characterized by their signal Only sequences obtained from Conus species belonging to 123 J Mol Evol the large major clade (Duda and Kohn 2005) were con- between the cone snails and other conoideans. Conse- served, as a large number of the conopeptides found in quently, no outgroup was included in the analysis. This species from other clades (e.g, C. californicus) are highly absence of an outgroup did not allow us to infer ancestor/ divergent and do not match with any of the currently descendant relationships. known superfamilies (Biggs et al. 2010; www.conoserver. org). Consequently, the classification in the present anal- ysis is relevant only for conopeptides of the large major Results clade species. Conopeptide superfamilies are defined by a conserved signal sequence, thus we used the Signalp 3.0 A total of 1,364 sequences potentially corresponding to server (Bendtsen et al. 2004) to identify the signal conopeptides and with a signal sequence were downloaded sequence; all sequences that did not include at least 50 % from GenBank (performed on 1st of July, 2011). Align- of the signal region were removed, together with sequences ments were 34 and 30 amino-acids long with Muscle and including a stop codon. Only the signal region was used for Clustal W, respectively. To limit the time of calculation for phylogenetic analyses, as only this part of the conopeptides phylogenetic analysis, only one sequence per amino-acid can be aligned within and, to some extent, between haplotype was kept; finally, 585 sequences were retained. superfamilies. Overall, the phylogenetic trees obtained from the Muscle and Clustal alignments were congruent; discrepancies were Phylogenetic Analysis not supported (posterior probabilities \0.90) and con- cerned phylogenetic relationships between the main clades Aligning signal sequences between highly divergent and the position of a few highly divergent sequences (see conopeptides (i.e., belonging to different superfamilies) is details below). For clarity, only the phylogenetic tree based arduous, and homology hypotheses are doubtful. Thus, on the Clustal alignment is presented (Fig. 1) but the sequences were translated to amino acids and automatically results obtained from the Muscle alignment, when differ- aligned using two different algorithms: Muscle (Edgar ent, are discussed. 2004 www.ebi.ac.uk/Tools/msa/muscle) and ClustalW Using information from GenBank and the literature, it (http://clustalw.ddbj.nig.ac.jp/top-e.html). Best model of was possible to link the clades defined with the bayesian evolution for these two datasets was selected using Mod- analysis to known superfamilies. Most of the defined elgenerator V.85 (Keane et al. 2006) following the cor- superfamilies (A, D, I1, I2, I3, J, L, O1, O3, P, S, T, V) rected Akaike Information Criterion (with four discrete corresponded to monophyletic groups, with some highly gamma categories) and used to reconstruct phylogenetic supported (Fig. 1). With the Muscle alignment, the O2 trees. The best model of evolution identified by Model- superfamily was included within the O1 superfamily; the generator was JTT ? G (Jones Taylor Thornton model, superfamily Y was represented by a single sequence, and implemented under the name ‘‘Jones model’’ in MrBayes corresponded to a unique lineage in the tree. However, —Jones et al. 1992) for both datasets. Bayesian analyses some superfamilies did not correspond to a monophyletic were performed by running two parallel analyses in group, as they included other conopeptides (e.g., O2 MrBayes (Huelsenbeck et al. 2001), each consisting of included sequences of contryphans, and M included con- eight Markov chains of 30,000,000 generations each with a omarphin—a result already discussed by Han et al. 2008b). sampling frequency of one tree every ten thousand gener- Several conopeptides from GenBank did not cluster in any ations. The number of swaps was set to 5, and the chain of the known superfamilies. These corresponded to known temperature at 0.02. A neighbor-joining tree obtained with cysteine-poor conopeptides, contulakin, and conantokin, MEGA5 (Tamura et al. 2011) was used as starting tree. shown in Fig. 1 as the B and C superfamilies, respectively Convergence of the parameters was evaluated using Tracer (the C superfamily has been previously defined by Jimenez 1.4.1 (Rambaut and Drummond 2007), and analyses were et al. (2007)); two conoCAP sequences (FN868446.1 and terminated when ESS values were all superior to 200. A FN868447.1—named X1 in the Fig. 1 and appendix 1) consensus tree was then calculated after omitting the first described by Mo¨ller et al. (2010); and sequences putatively 25 % trees as burn-in. annotated (FJ237364.1, named X2) or without annotation As is the case for most multigenic families, the identi- in GenBank (DQ359922.1, EF493183.1/EF493184.1 and fication of an outgroup was highly problematic. No gene DQ359921.1, named respectively X3, X4, and X5). In phylogenetically related to, and proven to be an outgroup the Clustal alignment, two other groups of sequences, for conopeptides has been described. Furthermore, the use FJ375238.1/FJ375239.1/FJ375240.1 and EF208033.1 of toxins from other conoidean species was not possible, as clustered in the superfamily A and O1, respectively with it would require that the toxins from cone snails all arose long branches, but corresponded to the independent lin- from duplication events that took place after the divergence eages in the Muscle alignment (X6 and X7, respectively). 123 J Mol Evol

Fig. 1 Bayesian phylogenetic tree (midpoint rooting) obtained from the Clustal alignment of the signal sequences of conopeptides from GenBank. Posterior probabilities (when [0.9) are provided for each node. Gray boxes are used to visualize the superfamilies. The B and C superfamilies respectively correspond to the contulakins and conantokins. The lineages X1–X7 potentially correspond to previously unrecognized superfamilies (see details in the text)

Function and cysteine pattern were not clade-specific; however, it was difficult to know if this result reflects a conopeptides with the same function or cysteine pattern higher conopeptide diversity in comparison to other spe- were found in different clades. In addition, sixteen new cies, or is due to a greater sampling effort in these species. (i.e., not numbered with Roman numbers) cysteine patterns All the superfamilies present in more than 10 Conus spe- were identified; however, most of them certainly corre- cies (A, B, I2, M, O1, O2, and T) were found in mollusk, spond to anecdotic mutations of the canonical framework worm, and fish-hunting species. in a given family (i.e., C–CC–C–C, C–C–CC–C–CC, and C–CC–C–C, found in the O1 superfamily, differ from the pattern VI/VII by only one mutation), while others may Discussion represent a new Cys-pattern number (e.g., the Cys-pattern C–C–C–CC–C, found in the three members of the X6 An Updated Classification of Conopeptides group). The results are summarized in Table 1 (full details are provided in Appendix 1). Overall, the molecular phylogeny, based on more than Table 2 lists the number of conopeptides found in each 1,300 conopeptides signal sequences extracted from Gen- superfamily and their distribution among the 71 Conus Bank, strongly supports the current superfamily classifi- species. The superfamilies A, M, and O1 were the largest, cation based on phenetic resemblances, as established in each containing at least 39 species, followed by the ConoServer. But, this relative congruency between phylo- superfamilies T and I2. Conus caracteristicus, C. imperi- genetic and phenetic classifications is not surprising given alis, and C. litteratus each express conopeptides belonging the relative conservation of the signal sequence within to more than 10 different superfamilies in their venom; superfamilies compared with between superfamilies, and

123 J Mol Evol

Table 1 Number of sequences found in each superfamily, with list of cysteine patterns identified and known function in each superfamily Superfamily Cysteine Known function ID # of sequences ID Pattern # of sequences

A 153 I CC–C–C 119 a, j, q II CCC–C–C–C 3 IV CC–C–C–C–C 25 VI/VII C–C–CC–C–C 1 XIV C–C–C–C 3 C1 CC–C–C–C 1

B 41 0 38 Conantokin C–C 3

C 4 0 1 Contulakin C–C 3

D 13 XX C–CC–C–CC–C–C–C–C 5 a C–CC–C–CC–C–C–C 1 C–C–C–CC–C–C–C–C–C 7

I1 6 XI C–C–CC–CC–C–C 6 i

I2 45 XI C–C–CC–CC–C–C 35 j XII C–C–C–C–CC–C–C 9 C–C–CC–CC–C 1

I3 7 XI C–C–CC–CC–C–C 7

J 12 XIV C–C–C–C 12 a ? j

L 4 XIV C–C–C–C 3 a C–C–C 1

M 193 0 3 a, j, l, conomarphin II CCC–C–C–C 1 III CC–C–C–CC 172 IV CC–C–C–C–C 4 IX C–C–C–C–C–C 1 XVI C–C–CC 1 XIX C–C–C–CCC–C–C–C–C 1 C1 C–C 2 CC–C–C–C 1 CC–C–C–CC–C 2 C–CC–C–C–C 4

O1 625 0 4 d, j, l, x VI/VII C–C–CC–C–C 613 C–C–C 1 C–C–CC–C 1 C–CC–C–C 4 C–C–C–C–C 1 C–C–CC–C–CC 1

123 J Mol Evol

Table 1 continued Superfamily Cysteine Known function ID # of sequences ID Pattern # of sequences

O2 67 VI/VII C–C–CC–C–C 51 c, contryphan C–C 7 XV C–C–CC–C–C–C–C 9

O3 25 VI/VII C–C–CC–C–C 25 bromosleeper

P 7 XIV C–C–C–C 2 IX C–C–C–C–C–C 5

S 7 VIII C–C–C–C–C–C–C–C–C–C 7 r, a

T 140 0 12 e, v, s X CC–CXPC 4 V CC–CC 121 C–C 2 CC–CCC 1

V 2 XV C–C–CC–C–C–C–C 2

X1 2 C–C–C–C–C–C–C 2 conoCAP

X2 III CC–C–C–CC 1

X3 1 0 1

X4 2 C–C–C–C–C–C–C–C–C–CC–C–C–C–C–C–C–C–C–C–C 2

X5 1 VIII C–C–C–C–C–C–C–C–C–C 1

X6 3 C–C–C–CC–C 3

X7 1 VIII C–C–C–C–C–C–C–C–C–C 1

Y 1 XVII C–C–CC–C–CC–C 1 the phylogenetic tree reflects these differences. However, in the current analysis. The signal sequences of cysteine- the phylogenetic approach also revealed several new fea- poor conopeptides do not cluster separately from the tures, the most striking of which is the presence of deeply conotoxins; some of them share highly similar signals with divergent lineages that, until now, were not included in the know superfamilies (contryphan with O2 and conomarphin conotoxin superfamily classification. There are two main with M), therefore, their exclusion from the superfamily explanations for this result. First, the conopeptide super- classification is not phylogenetically justified. We identi- family classification reviewed by Kaas et al. (2010) fied two additional superfamilies, B and C, for conantokins includes only what is traditionally referred to as ‘‘cysteine- and contulakins, respectively, one of which (C) has been rich’’ conotoxins [i.e., conopeptides with at least two proposed previously (Jimenez et al. 2007). Second, disulfide bridges in the mature sequence as defined by including non-annotated sequences from GenBank in the Norton and Olivera (2006)], thus excluding the conopep- dataset helped to identify several independent lineages in tides with two cysteines and linear conopeptides also the tree (X1–X7). The level of divergence of their broadly present in the venom (unpublished results). How- respective signal sequences with the signals of other ever, although the authors noted that ‘‘in future, all disul- superfamilies was equivalent to the level of divergence fide-poor conopeptides will probably have to be attributed between known superfamilies, and they thus deserve rec- to a superfamily,’’ they refrained from doing so because of ognition as new superfamilies. However, as these inde- the low number of cysteine-poor conopeptides with pre- pendent lineages are represented by only one, two or three cursor sequences in ConoServer (21). In GenBank, we sequences, and because some of them may not exhibit identified more than 50 such sequences and included them toxin activity (even if they were all found in venom ducts

123 o Evol Mol J Table 2 Number of conopeptides in each superfamily and species

Species Prey A B C D I1 I2 I3 J L M O1 O2 O3 P S T V X1 X2 X3 X4 X5 X6 X7 Y Occurrence

achatinus F4 12 aurisiacus F2 51 3 bullatus F 4 10 8 3 catus F4 29 2 circumcisus F4 14 3 consors F81 310 4 ermineus F4 14 3 geographus F 351 25 12 7 lynceus F1 1 magus F5 5131 4 monachus F4 3 2 obscurus F35 2 ochroleucus F2 1 parius F1 1 purpurascens F4 19 1 4 radiatus F63 2 43 1 1 7 stercusmuscarum F6 36 3 striatus F1013501 5 striolatus F1 8 2 sulcatus F82 2 tulipa F3 22 3 ammiralis M2 5 3 3 aulicus M2 3 3 3 aureus M2 1 bandanus M2 1 dalli M38 2 episcopatus M1 1 3 2 4 gloriamaris M 2 361 1 4 6 marmoreus M 4 2 14155 12 2 7 omaria M46 2 pennaceus M4 6 1332 16 6 textile M 5 2 152924 1 1 18 8 victoriae M3 1 abbreviatus W98 1 arenatus W1 2318 8 5

123 aristophanes W8 1 betulinus W8 11 1012 1 7 123 Table 2 continued

Species Prey A B C D I1 I2 I3 J L M O1 O2 O3 P S T V X1 X2 X3 X4 X5 X6 X7 Y Occurrence

capitaneus W 1 122 4 caracteristicus W25 3 6311 35 1111 coronatus W17 2 distans W1 2 5 3 ebraeus W29 1 eburneus W212 4 1162 1 9 emaciatus W311 3 ferrugineus W21 2 figulinus W5 1 flavidus W1 1 generalis W22 2 imperialis W43 126 271 1 2 1 1 3 13 judaeus W2 1 leopardus W8 5 9 4 4 litteratus W4227 61331185 3 16 1 1 15 lividus W3 18421 3 6 miles W1 1 1 6 2 1 6 miliaris W218 2 musicus W1 1 mustelinus W2 1 planorbis W41 2 pulicarius W4 3 3 6 6 5 quercinus W6 3 4 3 2 5 rattus W24 2 regius W 1 1 sponsalis W2 14 2 spurius W1018 3 tessulatus W 14624 11 5 ventricosus W 6 12 9 5 15 5 vexillum W 2 2 183 5 villepinii W 21 viola W5 1 virgo W22 4 532 41 8 vitulinus W31 222 1 6 Occurrence 39164541435250511795521211121111

Feeding types: F fish-hunting species, M mollusc-hunting species, w worm-hunting species Evol Mol J J Mol Evol of cone snails), we refrained from proposing new super- a relatively low genetic distance, or to the fact that two family names, and only provided temporary names (X1– previously defined superfamilies would actually corre- X7). It should also be borne in mind that many other spond to only one. This is the case of the L and I3 super- conopeptides have been described in the literature, some of families, separated by genetic distances comprised between which have been given formal names (conkunitzin, con- 0.38 and 0.69 that would, in most cases, correspond to olysin, conomap, conophysin, conopressin, conorfamide, within superfamily genetic distances. and conorphan). Because their signal sequences are not Consequently, it is not possible to rely only on a genetic represented as nucleotides in GenBank, they were not threshold to define superfamilies for conotoxins. A included in the analysis. However, a search in the protein threshold of 0.6, roughly corresponding to the gap between database of GenBank retrieved two complete precursors the two distributions of genetic distances (Fig. 2), would of Conkunitzin, with highly similar signal sequences lead to the division of the M-superfamily into numerous (P0C1X2.1 and P0CY85.1) and a local BLAST search superfamilies (indeed, Wang et al. 2008 proposed to divide (performed using BioEdit—Hall 1999) of the dataset used the M-superfamily in M1 and M2), and to the grouping of for the phylogenetic analyses revealed that the conkunitzin the superfamilies I1, I3, and L in a single one. However, signals were unique, and probably represent a new super- our approach is aimed at offering a complementary guid- family. Finally, if most of the superfamily-level clades are ance to help, in the future, deciding if a conotoxin or a highly supported, most of the inter-superfamily nodes are group of conotoxins deserve a superfamily name: (i) since not, preventing any reliable conclusion concerning the the minimum genetic distance between superfamilies is phylogenetic relationships at this level. 0.32, this distance should be the minimum distance The original results presented herein raise several issues between the potential new superfamily(ies) and all the concerning the classification and nomenclature of the others; (ii) the new superfamily(ies) should correspond to conopeptides and, more generally, of the genes that belong an independent lineage, i.e., it should not cluster in any of to multigenic systems. The updated classification system the superfamily clades previously defined; (iii) the molec- we propose is based on a phylogenetic reconstruction that ular target of the new conotoxin(s) should ideally be guarantees the identification of sequence clusters that share identified, to avoid naming conopeptides that would not be a common ancestor. However, such phylogenetic trees functional; (iv) the structure (cysteine pattern) and/or cannot help in deciding which clades deserve a superfam- function should be different from the most closely related ily-level ranking and which ones do not. One common superfamilies in terms of genetic distances and/or phylo- solution is to rely on a threshold of genetic distances, but genetic relationships. All these criteria apply to the B and C the analyses of the genetic distances (calculated as the superfamilies (genetic distances with other superfamilies number of differences) between all the conopeptide signal [0.3, these two lineages are independent and monophy- sequences revealed that the distribution of genetic dis- letic, their molecular targets are identified—Mena et al. tances within superfamilies of conopeptides largely over- 1990, Craig et al. 1999—, and their cysteine framework are laps with the distribution of genetic distances between different from their respective sister-groups), justifying the superfamilies (Fig. 2). This overlap can be linked to the attribution of new superfamily names. We followed the high level of homoplasy found in conopeptides, making traditional nomenclature of conopeptide superfamilies, i.e., two conopeptides from different clades having, by chance, a Roman capital letter. As the number of Roman letter is

Fig. 2 Pairwise distribution of genetic distances (p distances) calculated with MEGA5 using the Clustal alignment. Genetic distances between sequences from the same superfamily are shown in gray, genetic distances between sequences from different superfamily in black

123 J Mol Evol limited, some superfamilies have been named with a A phylogenetic approach could be very useful to iden- Roman letter followed by an Arabic number (e.g., I1, I2, tify divergent conopeptides with potentially different I3, O1, O2, and O3) when several superfamilies share a functions, even if they share a common structural frame- common cysteine framework or molecular target. Because work. For example, the cysteine framework IV, found in of the potentially high number of unknown superfamilies the A-superfamily, is already linked to two different of conopeptides, we have no doubt that the nomenclature functions (aA—Hopkins et al. 1995 and jA—Craig et al. based on both Roman letters and Arabic numbers will 1998). However, conotoxins, described by Conticello et al. become the reference rule. (2001), with the same framework, belong to the M-super- The first and fourth criteria also apply to the seven ‘‘X’’ family, suggesting that these IV-conotoxins that are lineages (Fig. 1), but the second applies to only 5 of them structurally convergent with the IV-conotoxins in a dif- (two clustered within the A and O1 superfamilies with the ferent superfamily, could exhibit a completely different muscle alignment) and the third to none of them. We function. A similar strategy could also apply within each propose to name such potential superfamilies of conopep- superfamily, where not only the signal sequence, but also tides that currently do not meet all the criteria but could in the propeptide and mature regions can be aligned, and the future with the X Roman letter, followed by an Arabic could reveal divergent lineages with as yet uncharacterized number, waiting for either to be fully recognized as a functions (e.g., see Aguilar et al. 2009; Puillandre et al. separate superfamily or as belonging to an existing one. 2010; Wang et al. 2008; Zhangsun et al. 2006). Furthermore, our identification of numerous new cys- teine frameworks among the GenBank sequences was also Evolution of the Conopeptides surprising. Even if some of them may be non-functional genes (pseudogenes), others could correspond to novel The phylogenetic analysis clearly confirms that most of the protein structures. A few publications demonstrated that defined superfamilies include conopeptides with different even toxins with odd numbers of cysteines can be func- cysteine frameworks and functions. Conversely, similar tional, for example with two 5-Cys toxins forming a cysteine frameworks and functions are found in different functional dimer or bioactive polymers of the 13-Cys superfamilies, suggesting that a given cysteine framework ‘‘Con-ikot-ikot’’ peptide from Conus striatus (Quinton or function can appear several times independently, prob- et al. 2009, Walker et al. 2009). Our findings challenge the ably as a result of convergent evolution. The multiple traditional view where conotoxins are characterized by a apparitions of the same framework and function during limited number of cysteine frameworks: by exploring new conotoxin evolution are probably linked to the extremely evolutionary pathways, the apparition of novel cysteine rapid diversification of the genes. Several molecular frameworks may also participate in the hyper-diversifica- mechanisms have been proposed as being responsible for tion of the conotoxins. In addition, this raises the question this high rate of diversification. Pi et al. (2006) suggested of the total number of cysteine patterns one could expect to that alternative splicing, unequal crossing-over or exon find among cone snail toxins. It is possible to predict the shuffling could explain this diversity. Olivera et al. (1999) theoretic number of cysteine patterns that could exist. If we proposed two other mechanisms: the lack of a mismatch limit the exercise to the 2, 4, and 6 cysteine patterns and repair system, at least in the hypervariable part of the exclude those with more than two consecutive cysteines, 20 sequence (the mature toxin); and recombination mecha- different frameworks can be proposed (C–C*, CC, CC–C– nisms. Several other hypotheses, such as a high rate of C*, CC–CC*, C–CC–C, C–C–CC*, C–C–C–C*, CC–CC– duplication, followed by a strong diversifying selection on CC, CC–CC–C–C, CC–C–CC–C, CC–C–C–CC*, CC–C– the newly created gene copies that could lead to the rapid C–C–C*, C–CC–CC–C, C–CC–C–CC, C–CC–C–C–C*, appearance of several structurally and functionally highly C–C–CC–CC, C–C–CC–C–C*, C–C–C–CC–C*, C–C–C– divergent genes, have been also proposed and tested by C–CC, C–C–C–C–C–C*). Ten of these frameworks different authors (Duda and Palumbi 1999, 2000; Conti- (marked with an *) can be found in GenBank. Given the cello et al. 2000, 2001; Espiritu et al. 2001; Duda and extreme capacity of the conopeptides to evolve and the Remigio 2008; Chang and Duda 2012). All these molecular apparent lack of evolutionary constraints (as illustrated by mechanisms, together with observed differences in the the multiple apparitions of identical frameworks during expression pattern between species, maybe linked to epi- their evolution), there is no reason that all these theoretical sodes of gene silencing and reactivation (‘‘Lazarotoxins’’, patterns will not be found in the future. It could be argued Conticello et al. 2001; Duda and Palumbi 2004; Duda that mechanical constraints would prevent the existence of 2008), could favor the rapid diversification of Conus spe- some cysteine patterns; for example, it could be unfavor- cies, by allowing them to envenomate and feed on new able to have a disulfide bridge between two adjacent cys- prey and thus colonize new niches (Duda and Lee 2009). teines. However, despite this we found a short mature toxin 123 J Mol Evol in the venom of one cone snail with a disulfide bridge 10,000 species (Bouchet et al. 2009). Even if the venom between adjacent cysteines (unpublished results). The apparatus has been lost in several lineages of Conoidea peptide has been reproduced by protein synthesis, con- (e.g., Fedosov 2007; Fedosov and Kantor 2008; Holford firming this finding. et al. 2009; Medinskaya and Sysoev 2003), these findings suggest that the conotoxin diversity characterized so far Conus and Conoidea Toxin Diversity represents only a small part. If the level of diversity across all conoidean species is similar to that found in those The diversity of conotoxins in the venom of several Conus already investigated, the number of toxins produced by this species (Table 2) confirms that most species are able to single superfamily could be as high as ten millions. express a variety of conotoxins, as widely reported in lit- erature (e.g., Olivera 2002). Furthermore, our results also Acknowledgments We are grateful to the European Commission suggest that Conus diet (fish, mollusk, and worm) is not for financial support. This study has been performed as a part of the CONCO cone snail genome project for health (www.conco.eu) within correlated with differences in venom composition at the the 6th Framework Program (LIFESCIHEALTH-6 Integrated Project superfamily level. If differences exist, as suggested in the LSHB-CT-2007, contract number 037592). We are also grateful to literature (e.g., Conticello et al. 2001; Kaas et al. 2010), Fre´de´rique Lisacek from the Swiss Institute of Bioinformatics for they most likely occur at the species and intra-superfamily ongoing help. We would like to thank Dr Ron Hogg of OmniScience SA for editorial support. levels. Furthermore, phylogenetic analyses suggest that, at least, the worm- and fish-hunting species are not mono- Conflict of interest The authors declare that they have no conflict phyletic, as these two diets appeared independently several of interest. times during the Conus evolution (Duda and Palumbi 2004; Espiritu et al. 2001; Kraus et al. 2011). Thus, differences in the venom composition should not be sought between the References three diet groups, but between the monophyletic clades defined within these three groups (Duda and Palumbi Aguilar MB, Lopez-Vera E, Ortiz E, Becerril B, Possani LD, Olivera 2004). BM, de la Heimer Cotera EP (2005) A novel conotoxin from Diversity of the marine snail toxins is not limited to Conus delessertii with posttranslationally modified lysine resi- dues. Biochemistry 44:11130–11136 species included in the large major clade of Conus. Recent Aguilar MB, Chan de la Rosa RA, Falcon A, Olivera BM, de la analyses in other conoidean taxa suggest that toxin hy- Heimer Cotera EP (2009) Peptide pal9a from the venom of the perdiversity is not the privilege of the Conus large major turrid snail Polystira albida from the Gulf of Mexico: purifica- clade. C. californicus, which is highly divergent from all tion, characterization, and comparison with P-conotoxin-like (framework IX) conoidean peptides. Peptides 30:467–476 the other Conus species (Duda and Kohn 2005), showed a Bendtsen JD, Nielsen H, von Heijne G, Brunak S (2004) Improved high diversity of toxins in its venom and several of them prediction of signal peptides: signalP 3.0. J Mol Biol. 340(4): were thought to correspond to new superfamilies (Biggs 783–795 et al. 2010; www.conoserver.org/?page=classification&type= Biass D, Dutertre S, Gerbault A, Menou J-L, Offord R, Favreau P, Sto¨cklin R (2009) Comparative proteomic study of the venom of genesuperfamilies). To a lesser extent, species in the small the piscivorous cone snail Conus consors. J Proteomics major clade of Conus, may also contain several novel 72:210–218 conotoxins, as suggested by an original Cys-pattern (XIII) Biggs JS, Watkins M, Puillandre N, Ownby JP, Lopez-Vera E, found in the species C. delessertii (Aguilar et al. 2005). In Christensen S, Moreno KJ, Bernaldez J, Licea-Navarro A, Showers Corneli P, Olivera BM (2010) Evolution of Conus addition to the family Conidae, original toxins have already peptide toxins: analysis of Conus californicus Reeve, 1844. Mol been reported in several other species of Conoidea, such as Phylogenet Evol 56:1–12 Polystira albida (Lopez-Vera et al. 2004; Rojas et al. Blunt JW, Copp BR, Keyzers RA, Munro MH, Prinsep MR (2012) 2008), Gemmula periscelida (Lopez-Vera et al. 2004), Marine natural products. Nat Prod Rep 29:144–222 Bouchet P, Lozouet P, Sysoev AV (2009) An inordinate fondness for G. speciosa, G. sogodensis, G. diomedea, G. kieneri turrids. Deep Sea Res II 56:1724–1731 (Heralde et al. 2008), Lophiotoma olangoensis (Watkins Cabang AP, Imperial JS, Gajewiak J, Watkins M, Showers Corneli P, et al. 2006), Terebra subulata (Imperial et al. 2003), Olivera BM, Concepcion GP (2011) Characterization of a venom Hastula hectica (Imperial et al. 2007) and Crassispira peptide from a crassispirid gastropod. Toxicon 58:672–680 Chang C, Duda TF (2012) Extensive and continuous duplication cerithina (Cabang et al. 2011). Furthermore, taxonomic facilitates rapid evolution and diversification of gene families. surveys (Bouchet et al. 2009) and phylogenetic analyses Mol Biol Evol. Advance access (Puillandre et al. 2011) suggest that the superfamily Conticello SG, Pilpel Y, Glusman G, Fainzilber M (2000) Position- Conoidea actually comprises a number of deeply divergent specific codon conservation in hypervariable gene families. Trends Genet 16:57–59 clades, whose species diversity is currently largely under- Conticello SG, Gilad Y, Avidan N, Ben-Asher E, Levy Z, Fainzilber estimated. Presently, around 4,500 species have been M (2001) Mechanisms for evolving hypervariability: the case of described, but the group is believed to include more than conopeptides. Mol Biol Evol 18:120–131 123 J Mol Evol

Craig AG, Zafaralla G, Cruz LJ, Santos AD, Hillyard DR, Dykert J, Gayler K, Sandall D, Greening D, Keays D, Polidano M, Livett B, Rivier J, Gray WR, Imperial J, DelaCruz RG, Sporning A, Down J, Satkunanathan N, Khalil Z (2005) Molecular prospect- Terlau H, West PJ, Yoshikami D, Olivera BM (1998) An ing for drugs from the sea. IEEE Eng Med Biol Mag 24:79–84 O-glycosylated neuroexcitatory Conus peptide. Biochemistry Hall TA (1999) BioEdit: a user-friendly biological sequence align- 37:16019–16025 ment editor and analysis program for Windows 95/98/NT. Craig AG, Norberg T, Griffin D, Hoeger C, Akhtar M, Schmidt K, Nucleic Acids Symp Ser 41:95–98 Low W, Dykert J, Richelsoni E, Navarro V, Mazella J, Watkins Han TS, Teichert RW, Olivera BM, Bulaj G (2008a) Conus M, Hillyard DR, Imperial J, Cruz LJ, Olivera BM (1999) venoms—a rich source of peptide-based therapeutics. Curr Contulakin-G, an O-glycosylated invertebrate neurotensin. J Biol Pharm Des 14:2462–2479 Chem 274:13752–13759 Han Y, Huang F, Jiang H, Liu L, Wang Q, Wang Y, Shao X, Chi C, Daly NL, Craik DJ (2009) Structural studies of conotoxins. IUBMB Du W, Wang C (2008b) Purification and structural character- Life 61:144–150 ization of a d-amino acid-containing conopeptide, conomarphin, Davis J, Jones A, Lewis RJ (2009) Remarkable inter- and intra- from Conus marmoreus. FEBS J 275:1976–1987 species complexity of conotoxins revealed by LC/MS. Peptides Heralde FM, Imperial J, Bandyopadhyay P, Olivera BM, Concepcion 30:1222–1227 GP, Santos AD (2008) A rapidly diverging superfamily of peptide Duda TF (2008) Differentiation of venoms of predatory marine toxins in venomous Gemmula species. Toxicon 51:890–897 gastropods: divergence of orthologous toxin genes of closely Holford M, Puillandre N, Terryn Y, Cruaud C, Olivera BM, Bouchet related Conus species with different dietary specializations. P (2009) Evolution of the Toxoglossa venom apparatus as J Mol Evol 67:315–321 inferred by molecular phylogeny of the Terebridae. Mol Biol Duda TF, Kohn AJ (2005) Species-level phylogeography and Evol 26:15–25 evolutionary history of the hyperdiverse marine gastropod genus Hopkins C, Grilley M, Miller C, Shon K-J, Cruz LJ, Gray WR, Dykert Conus. Mol Phylogenet Evol 34:257–272 J, Rivier J, Yoshikami D, Olivera BM (1995) A new family of Duda JTF, Lee T (2009) Ecological release and venom evolution of a Conus peptides targeted to the nicotinic acetylcholine receptor. predatory marine Snail at Easter Island. PLoS One 4:e5558 J Biol Chem 270:22361–22367 Duda TF, Palumbi SR (1999) Molecular genetics of ecological Hu H, Bandyopadhyay PK, Olivera BM, Yandell M (2011) Charac- diversification: duplication and rapid evolution of toxin genes of terization of the Conus bullatus genome and its venom-duct the venomous gastropod Conus. Proc Natl Acad Sci 96: transcriptome. BMC Genomics 12:60 6820–6823 Huelsenbeck JP, Ronquist F, Hall B (2001) MrBayes: bayesian Duda TF, Palumbi SR (2000) Evolutionary diversification of multi- inference of phylogeny. Bioinformatics 17:754–755 gene families: allelic selection of toxins in predatory cone snails. Imperial JS, Watkins M, Chen P, Hillyard DR, Cruz LJ, Olivera BM Mol Biol Evol 17:1286–1293 (2003) The augertoxins: biochemical characterization of venom Duda TF, Palumbi SR (2004) Gene expression and feeding ecology: components from the toxoglossate gastropod Terebra subulata. evolution of piscivory in the venomous gastropod genus Conus. Toxicon 42:391–398 Proc Royal Soc B 271:1165–1174 Imperial JS, Kantor Y, Watkins M, Heralde FM, Stevenson B, Chen Duda TF, Remigio A (2008) Variation and evolution of toxin gene P, Hansson K, Stenflo J, Ownby J-P, Bouchet P, Olivera BM expression patterns of six closely related venomous marine (2007) Venomous auger snail Hastula (Impages) hectica (Lin- snails. Mol Ecol 17:3018–3032 naeus 1758): molecular phylogeny, foregut anatomy and com- Dutertre S, Biass D, Sto¨cklin R, Favreau P (2010) Dramatic parative toxinology. J Exp Zool 308B:744–756 intraspecimen variations within the injected venom of Conus Jakubowski JA, Kelley WP, Sweedler JV, Gilly WF, Schulz JR consors: an unsuspected contribution to venom diversity. (2005) Intraspecific variation of venom injected by fish-hunting Toxicon 55:1453–1462 Conus snails. J Exp Biol 208:2873–2883 Edgar RC (2004) MUSCLE: multiple sequence alignment with high Jimenez EC, Olivera BM, Teichert RW (2007) aC-conotoxin PrXA: a accuracy and high throughput. Nucleic Acids Res 32:1792–1797 new family of nicotinic acetylcholine receptor antagonists. Espiritu DJD, Watkins M, Dia-Monje V, Cartier GE, Cruz LE, Biochemistry 46:8717–8724 Olivera BM (2001) Venomous cone snails: molecular phylogeny Jones DT, Taylor WR, Thornton JM (1992) The rapid generation of and the generation of toxin diversity. Toxicon 39:1899–1916 mutation data matrices from protein sequences. CABIOS 8: Favreau P, Sto¨cklin R (2009) Marine snail venoms: use and trends in 275–282 receptor and channel neuropharmacology. Curr Opin Pharmacol Kaas Q, Westermann JC, Craik DJ (2010) Conopeptide characteriza- 9:594–601 tion and classifications: an analysis using ConoServer. Toxicon Favreau P, Benoit E, Hocking E, Carlier L, D’hoedt D, Leipold E, 55:1491–1509 Markgraf D, Schlumberger S, Cordova M, Gaertner H, Paolini- Keane TM, Creevey CJ, Pentony MM, Naughton TJ, McInerney JO Bertrand M, Hartley O, Tytgat J, Heinemann S, Bertrand D, (2006) Assessment of methods for amino acid matrix selection Boelens R, Sto¨cklin R, Molgo J (2012) A novel mu-conopeptide, and their use on empirical data shows that ad hoc assumptions CnIIIC, exerts potent and preferential inhibition of NaV1.2/1.4 for choice of matrix are not justified. BMC Evol Biol 6:1–17 channels and blocks neuronal nicotinic acetylcholine receptors. Koua D, Brauer A, Laht S, Kaplinski L, Favreau P, Remm M, Lisacek Br J Pharmacol (in press) F, Sto¨cklin R (2012) ConoDictor: a tool for prediction of Fedosov AE (2007) Anatomy of accessory rhynchodeal organs of conopeptide superfamilies. Nucleic Acids Res (in press) Veprecula vepratica and Tritonoturris subrissoides: new types of Kraus NJ, Showers Corneli P, Watkins M, Bandyopadhyay PK, Seger foregut morphology in Raphitominae (Conoidea). Ruthenica J, Olivera BM (2011) Against expectation: a short sequence with 17:33–41 high signal elucidates cone snail phylogeny. Mol Phylogenet Fedosov A, Kantor Y (2008) Toxoglossan gastropods of the Evol 58:383–389 subfamily Crassispirinae (Turridae) lacking a radula, and a Laht S, Koua D, Kaplinski L, Lisacek F, Sto¨cklin R, Remm M (2011) discussion of the status of the subfamily Zemaciinae. J Mollusc Identification and classification of conopeptides using profile Stud 74:27–35 Hidden Markov models. Biochim Biophys Acta 1824:488–492

123 J Mol Evol

Leary D, Vierros M, Hamon G, Arico S, Monagle C (2009) Marine Puillandre N, Holford M (2010) The Terebridae and teretoxins: genetic resources: a review of scientific and commercial interest. combining phylogeny and anatomy for concerted discovery of Mar Policy 33:183–194 bioactive compounds. BMC Chem Biol 10:7 Lewis RJ (2012) Discovery and development of the v-conopeptide Puillandre N, Watkins M, Olivera BM (2010) Evolution of Conus class of analgesic peptides. Toxicon 59(4):524–528 peptide genes: duplication and positive selection in the A-super- Lin H, Li Q-Z (2007) Predicting conotoxin superfamily and family by family. J Mol Evol 70:190–202 using pseudo amino acid composition and modified Mahalanobis Puillandre N, Kantor Y, Sysoev A, Couloux A, Meyer C, Rawlings T, discriminant. Biochem Biophys Res Commun 354:548–551 Todd JA, Bouchet P (2011) The dragon tamed? A molecular Lopez-Vera E, de la Heimer Cotera EP, Maillo M, Riesgo-Escovar phylogeny of the Conoidea (Mollusca, ). J Mollusc JR, Olivera BM, Aguilar MB (2004) A novel structure class of Stud 77:259–272 toxins: the methionine-rich peptides from the venoms of turrid Quinton L, Gilles N, De Pauw E (2009) TxXIIIA, an atypical marine snails (Mollusca, Conoidea). Toxicon 43:365–374 homodimeric conotoxin found in the Conus textile venom. McGivern JG (2007) Ziconotide: a review of its pharmacology and J Proteomics 72:219–226 use in the treatment of pain. Neuropsychiatr Dis Treat 3:69–85 Rambaut A, Drummond AJ (2007) Tracer v1.4. Available from Medinskaya AI, Sysoev A (2003) The anatomy of Zemacies excelsa, http://beast.bio.ed.ac.uk/Tracer with a description of a new subfamily of Turridae (Gastropoda, Rojas A, Feregrino A, Ibarra-Alvarado C, Aguilar MB, Falcon A, de Conoidea). Ruthenica 13:81–87 la Heimer Cotera EP (2008) Pharmacological characterization of Mena EE, Gullak MF, Pagnozzi MJ, Richter KE, Rivier J, Cruz LJ, venoms obtained from Mexican toxoglossate gastropods on Olivera BM (1990) Conantokin-G: a novel peptide antagonist to isolated guinea pig ileum. J Venom Anim Toxins Incl Trop Dis the N-methyl-D-aspartic acid (NMDA) receptor. Neurosci Lett 14:497–513 118:241–244 Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S Menez A, Stocklin R, Mebs D (2006) Venomics’ or: the venomous (2011) MEGA5: molecular evolutionary genetics analysis using systems genome project. Toxicon 47:255–259 maximum likelihood, evolutionary distance, and maximum Miljanich GP (2004) Ziconotide: neuronal calcium channel blocker parsimony methods. Mol Biol Evol 28:2731–2739 for treating severe chronic pain. Curr Med Chem 11:3029–3040 Terrat Y, Biass D, Dutertre S, Favreau P, Remm M, Sto¨cklin R, Molinski TF, Dalisay DS, Lievens SL, Saludes JP (2009) Drug Piquemal D, Ducancel F (2011) High-resolution picture of a development from marine natural products. Nat Rev Drug venom gland transcriptome: case study with the marine snail Discov 8:69–85 Conus consors. Toxicon 59:34–46 Mo¨ller C, Melaun C, Castillo C, Dı´az ME, Renzelman CM, Estrada Ueberheide BM, Fenyo D, Alewood PF, Chait BT (2009) Rapid O, Kuch U, Lokey S, Marı´ F (2010) Functional hypervariability sensitive analysis of cysteine rich peptide venom components. and gene diversity of cardioactive neuropeptides. J Biol Chem Proc Natl Acad Sci 106:6910–6915 285:40673–40680 Violette A, Leonardi A, Piquemal D, Terrat Y, Biass D, Dutertre S, Mondal S, Bhavna R, Babu RM, Ramakumar S (2006) Pseudo amino Noguier F, Ducancel F, Sto¨cklin R, Krizˇaj I, Favreau P (2012) acid composition and multi-class support vector machines Recruitment of glycosyl hydrolase proteins in a cone snail approach for conotoxin superfamily classification. J Theor Biol venomous arsenal: further insights into biomolecular features of 243:252–260 Conus venoms. Mar Drugs 10:258–280 Norton RS, Olivera BM (2006) Conotoxins down under. Toxicon Walker CS, Jensen S, Ellison M, Matta JA, Lee WY, Imperial JS, 48:780–798 Duclos N, Brockie PJ, Madsen DM, Isaac JT, Olivera BM, Olivera BM (2002) Conus venom peptides: reflections from the Maricq AV (2009) A novel Conus snail polypeptide causes biology of clades and species. Annu Rev Ecol Syst 33:25–47 excitotoxicity by blocking desensitization of AMPA receptors. Olivera BM (2006) Conus peptides: biodiversity-based discovery and Curr Biol 19:900–908 exogenomics. J Biol Chem 281:31173–31177 Wang Q, Jiang H, Hana Y-H, Yuan DD, Chi C-W (2008) Two Olivera BM, Walker C, Cartier GE, Hooper D, Santos AD, different groups of signal sequence in M-superfamily conotox- Schoenfeld R, Shetty R, Watkins M, Bandyopadhyay PK, ins. Toxicon 51:813–822 Hillyard DR (1999) Speciation of cone snails and interspecific Watkins M, Hillyard DR, Olivera BM (2006) Genes expressed in a hyperdivergence of their venom peptides. Potential evolutionary Turrid venom duct: divergence and similarity to conotoxins. significance of introns. Ann NY Acad Sci 870:223–237 J Mol Evol 62:247–256 Pi C, Liu J, Peng C, Liu Y, Jiang X, Zhao Y, Tang S, Wang L, Dong Zhangsun D, Luo S, Wu Y, Xiaopeng Z, Hu Y, Xie L (2006) Novel M, Chen S, Xu A (2006) Diversity and evolution of conotoxins O-superfamily conotoxins identified by cDNA cloning from based on gene expression profiling of Conus litteratus. Genomics three vermivorous Conus species. Chem Biol Drug Des 68: 88:809–819 256–265

123