1 ISBRA 20 2 SHORT ABSTRACTS
8TH INTERNATIONAL SYMPOSIUM ON BIOINFORMATICS RESEARCH AND APPLICATIONS
May 21-23, 2012 University of Texas at Dallas, Dallas, TX
http://www.cs.gsu.edu/isbra12/
Symposium Organizers
Steering Committee Dan Gusfield, University of California, Davis Ion Mandoiu, University of Connecticutt Yi Pan, Georgia State University Marie-France Sagot, INRIA Alex Zelikovsky, Georgia State University
General Chairs Ovidiu Daesku, University of Texas at Dallas Raj Sunderraman, Georgia State University
Program Chairs Leonidas Bleris, University of Texas at Dallas Ion Mandoiu, University of Connecticut Russell Schwartz, Carnegie Mellon University Jianxin Wang, Central South University
Publicity Chair Sahar Al Seesi, University of Connecticut
Finance Chairs Anu Bourgeois, Georgia State University Raj Sunderraman, Georgia State University
Web Master, Web Design Piyaphol Phoungphol J. Steven Kirtzic
Sponsors
NATIONAL SCIENCE DEPARTMENT OF COMPUTER SCIENCE DEPARTMENT OF COMPUTER SCIENCE FOUNDATION GEORGIA STATE UNIVESITY UNIVERSITY OF TEXAS AT DALLAS
i
Program Committee Members
Srinivas Aluru, Iowa State University Allen Holder, Rose-Hulman Istitute of S. Cenk Sahinalp, Simon Fraser Danny Barash, Ben-Gurion Technology University University Jinling Huang, Eastern Carolina David Sankoff, University of Ottawa Robert Beiko, Dalhousie University University Russell Schwartz, Carnegie Mellon Anne Bergeron, Universite du Lars Kaderali, University of University Quebec a Montreal Heidelberg Joao Setubal, Virginia Bioinformatics Iyad Kanj, DePaul University Daniel Berrar, University of Ulster Institute Ming-Yang Kao, Northwestern Paola Bonizzoni, Universita' Degli Mona Singh, Princeton University Ileana University Streinu, Smith College Studi di Milano-Bicocca Yury Khudyakov, CDC Wing-Kin Sung, Nuational University of Daniel Brown, University of Danny Krizanc, Wesleyan University Waterloo Singapore Jing Li, Case Western Reserve Sing-Hoi Sze, Texas A&M University Doina Caragea, Kansas State University University Ilias Tagkopoulos, University of Fenglou Mao, University of Georgia California Tien-Hao Chang, National Cheng Osamu Maruyama, Kyushu University Kung University Marcel Turcotte, University of Ottawa Li Min, Georgia State University Chien-Yu Chen, National Taiwan Gabriel Valiente, Technical University University Ion Moraru, University of of Catalonia Connecticut Health Center Matteo Comin, University of Padova Stéphane Vialette, Université Paris-Est Axel Mosig, University of Leipzig Marne-la-Vallée Bhaskar DasGupta, University of Illinois at Chicago Giri Narasimhan, Florida Li-San Wang, University of International University Pennsylvania Jorge Duitama, University of Connecticut Yi Pan, Georgia State University Lusheng Wang, City University of Hong Kong Oliver Eulenstein, Iowa State Laxmi Parida, IBM University Bogdan Pasaniuc, Harvard University Xiaowo Wang, Tsinghua University Guillaume Fertin, University of Andrei Paun, Louisiana Tech Fangxiang Wu, University of Nantes University Saskatchewan Vladimir Filkov, University of Itsik Pe'er, Columbia University Yufeng Wu, University of Connecticut California Davis Weiqun Peng, George Washington Zhen Xie, Massachusetts Institute of Jean Gao, University of Texas at University Technology Arlington Nadia Pisanti, University of Pisa Jinbo Xu, Toyota Technological Institute Katia Guimaraes, Federal University Maria Poptsova, University of at Chicago of Pernambuco Connecticut Zhenyu Xuan, University of Texas at Jiong Guo, Saarland University Teresa Przytycka, NCBI Dallas Robert Harrison, Georgia State Sven Rahmann, Technical University Alex Zelikovsky, Georgia State University Dortmund University Jieyue He, Southeast University Shoba Ranganathan, Macquarie Fa Zhang, Chinese Academy of Science Steffen Heber, North Carolina State University Yanqing Zhang, Georgia State University University Leming Zhou, University of Pittsburgh
ii ISBRA 2012 Short Abstracts
AutoPipe: A Toolbox for Systems Biology Workflow Query Synthesis, Hasan Jamil 1
A new method to predict linear B-cell epitope using support vector machine, Bo Yao, Lin Zhang and Chi Zhang 5
Asymptotic properties of a median tree under the coalescent model, Liang Liu 6
Comparison of RNA-Seq with Microarray Analysis of the Transcriptional Response in HT-29 Colon Cancer Cells to 5-aza-deoxycytidine, Xiao Xu, Erica Antinoiou, W. Richard McCombie, Jennie Williams, Asia Brown, Wei Zhu, Song Wu and Ellen Li 10
CPAM: Effective Composite Regulatory Pattern Miner for Genome Sequences, Dan He 14
Pattern Characterization and Functional Mapping for Biomedical Signal Sets, Anish Nair and Kamran Kiasaleh 18
MapBase: A Virtual Biological ID Map Database, Hasan Jamil 22
Investigations on Elastic Network Models of Coarse-Grained Membrane Proteins, Kannan Sankar, Michael T. Zimmermann and Robert L. Jernigan 26
De novo Genome and Transcriptome Sequencing of Social Paper Wasps: Application to Understanding Parasite Manipulation of Host Behavior, Ruolin Liu, Daniel Standage and Amy Toth 30
Genome sequencing, assembly, annotation and comparative analysis of Pseudomonas fluorescens NCIMB 11764 bacterium, Claudia Vilo, Michael Benedik, Daniel Kunz and Qunfeng Dong 31
Statistical Evaluation of Dynamic Brain Cell Calcium Activity, Kinsey Cotton, Mark Decoster, Katie Evans, Richard Idowu, and Mihaela Paun, 35
Lineage Specific Expansion of Protein Families in Malaria Parasites, Hong Cai, Jianying Gu and Yufeng Wang 39
A Mean Shift Clustering Based Algorithm for Multiple Alignment of LC-MS Data, Minh Nguyen and Jean X. Gao 43
A new algorithm for the molecular distance geometry problem with inaccurate distance data, Michael Souza, Carlile Lavor, Albert Muritiba and Nelson Maculan 47
Identification of highly synchronized regulatory subnetwork with gene expression and interaction dynamics, Shouguo Gao and Xujing Wang 51
MGC: Gene calling in metagenomic sequences, Achraf El Allali and John Rose 55
Structural Motif Discovery Algorithms: Classification and Benchmarks, Isra Al-Turaiki, Ghada Badr and Hassan Mathkour 59
Enumerating Maximal Frequent Subtrees, Akshay Deepak and David Fernández-Baca 65
iii Bioinformatics: Desktop Applications to Peta-Scale Architectures with Web-Based Portals, Bhanu Rekepalli, Paul Giblock and Christopher Reardon 69
A Web-based multi-Genome Synteny Viewer for Customized Data, Kashi Revanna, Chi-Chen Chiu, Daniel Munro, Alvin Gao and Qunfeng Dong 70
Subgingival plaque microbiota in patients with type 2 diabetes, Mi Zhou, Ruichen Rong, Daniel Munro, Qi Zhang and Qunfeng Dong 74
Automatic Analysis of Dendritic Territory for Neuronal Images, Santosh Lamichhane and Jie Zhou 78
A Neural Network Approach to Pre-filtering MS/MS spectra, James Cleveland and John Rose 82
Statistical software and business productivity applications: workflows for communication and efficiency, Marie Vendettuoli, Heike Hofmann and David Siev 85
Development of a Detailed Model for the FcRn-mediated IgG Homeostasis, Venkat Pannala, Dilip Kumar Challa, Sally Ward and Leonidas Bleris 89
Querying Evolutionary Relationships in Phylogenetic Databases, Grant Brammer and Tiffani Williams 101
Gene Expression Resources Available from MaizeGDB, Wimalanathan Kokulapalan, Jack Gardiner, Bremen Braun, Ethalinda Cannon, Mary Schaeffer, Lisa Harper, Carson Andorf, Darwin Campbell, Scott Birkett, Taner Sen, Nicholas Provart and Carolyn Lawrence 105
Nocardia spp. Identification Using a Bioinformatics Approach, Dhundy Kiran Bastola, Scott McGrath, Ishwor Thapa and Peter Iwen 106
An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads, Serghei Mangul, Adrian Caciula, Nicholas Mancuso, Olga Glebova, Ion Mandoiu and Alex Zelikovsky 110
Distributions of Palindromic Proportional Content in Bacteria, Oliver Bonham-Carter, Lotfollah Najjar, Ishwor Thapa and Dhundy Kiran Bastola 114
GREDSTAT: Genome-wide Restriction Enzyme Digestion STatistical Analysis Tool, Maga Rowicka and Norbert Dojer 118
Scaffolding Large Genomes using Integer Linear Programming, James Lindsay, Hamed Salooti, Alex Zelikovsky and Ion Mandoiu 122
Inference of allele specic expression levels from RNA-Seq data, Sahar Al Seesi and Ion Mandoiu 128
Monitoring of Human body tissues at Molecular Level using FOTI Systems, G.S. Uthayakumar and A. Sivasubramanian 133
Multi-Commodity Flow Methods for Quasispecies Spectrum Reconstruction Given Amplicon Reads, Nicholas Mancuso, Bassam Tork, Pavel Skums, Ion Mandoiu and Alex Zelikovsky 148
Quasispecies frequency reconstruction using multicommmodity flows, Pavel Skums, Alexander Artyomenko, Alex Zelikovsky and Yury Khudyakov 153
iv ISBRA 2012 Short Abstracts
AutoPipe: A Toolbox for Systems Biology Workflow Query Synthesis?
Hasan M. Jamil
Department of Computer Science Wayne State University, USA [email protected]
Abstract. Prohibitive software implementation costs are a major bar- rier for biologists in testing potentially insightful hypotheses. An intrigu- ing issue then is still outstanding is if it is possible for a biologist to con- ceptually state an arbitrary computational biology process and map it to a workflow query over a network of distributed resources. The major hurdle is how to map the conceptual concepts into concrete and semanti- cally equivalent computational artifacts. In this paper, we propose a novel model for ad hoc systems biology workflow query synthesis from stored description of hierarchical computational components. We leverage on- tological concept structures and concept transformation relationships in our system AutoPipe, that allows users to explore tentative implementa- tion of workflow queries and select the most fitting ones. Queries synthe- sized are expressed using declarative languages such as BioFlow which can be executed in our LifeDB database management system.
1 Introduction
Although most concepts in biology are well understood and have defined mean- ings, stitching them conceptually in a coherent sequence does not always lead to a computable query that can be executed to generate an expected response. For example, consider a gene expression analysis that involves identifying a novel set of small regulatory relationships among gene products with another set of known genes for which prior knowledge is available. In other words, we are interested in finding out new regulatory relationships with high enough confidence for an already known pathway. Presumably, the data includes the expression profiles of the genes in the known pathway P along with other genes of interest. Conceptu- ally, the computational process may be expressed in the following way. The real question is how to implement this pipeline to generate the expected response. 1. Select top n differentially expressed genes including the genes in P . 2. Reverse engineer gene regulatory network to extract a candidate network. 3. Find ranked evidence of new regulatory relationships in known interaction networks such as pathways, protein-protein interaction networks, etc. 4. Display top k networks in order of relevance.
? Research supported in part by National Science Foundation grant IIS 0612203.
1 ISBRA 2012 Short Abstracts
The choice of artifacts to implement the above pipeline is researcher specific, based mostly to her familiarity or the popularity of the tools. For example, step 1 could be implemented using the GSEA algorithm [12] followed by a selection. In step 2, she can use RNN [7] to induce the candidate regulatory network, and then find the top-k matching networks using algorithm TraM [2] in step 3, and then displaying the graph using a suitable tool. However, at each step she could make alternate choices as well. For example, she can choose the enhanced GSEA method in [5], or a more simpler algorithm that comes standard with Biocon- ductor/R [10]. For regulatory network generation, she can select algorithms such as Genie3 [4] or BicAT-Plus [1]. Finally, for network matching, she could have selected TALE [13] or other similar graph matching algorithms. However, these choices would depend largely upon her expectation of the over all query semantics and the input output behavior of the components, the compatibility of the data and the application tools, and the complexity and her familiarity with these artifacts. In some cases, she may need to write small glue codes to patch disparities, or apply format conversions to make the components compatible. The more complex and diverse the choices are, the more expensive the application will tend to be, and the more unlikely it will be for her to develop the pipeline herself, thereby introducing communication hurdles across domain boundaries (computational experts and biologists) and substantially increasing the cost. This prompts the question: could she potentially map this description into a pipeline using a tool that could autonomously and judiciously stitch the available resources, only minimally involving the user? An even more intriguing question is: could a user just say, display top k regulatory networks from the input data set, and let the system fill in the blanks?
2 AutoPipe Model
The query construction approach in AutoPipe is substantially different from logic programming approaches [9] where the final goal is considered a proof construction problem and all needed components are pre-defined as rules. It is also significantly different from program synthesis research [8] in software engi- neering, where rigorous description regimes are required for apparently simple computational needs. Our approach is also distinct from query synthesis research in databases [11] where the query is synthesized from a pair of input set and a view through reverse engineering whereas we do so from only the input data set. Since we require substantially less information from the user, we compensate for the loss of essential information needed for successful reverse engineering by augmenting the database with a resource template hierarchy R, a concept hi- erarchy C, a coupling relationship ¹ among the resource templates in R, and a symmetric mapping µ of the form µ : C ↔ R. We allow two types of resources – tools and data. Tools are of three types – analyzers, converters and visualizers. Analyzers transform data resources to produce data resources, while converters change the representation or formats of data resources. Visualizers on the other hand accept data resources and display
2 ISBRA 2012 Short Abstracts
them in specific ways, and thus are treated as terminating transformers. A data resource on the other hand is an initiating transformer. All resources have defined input output behaviors described using concepts in C (figure 1) which are always enforced. A coupling relation between templates t1 and t2 may be implicit, or explicit. In both cases, t1 ¹ t2 exists only if there is an injective mapping from the input descriptions of t2 to output descriptions of t1, i.e., to be compatible t1 must supply all inputs needed by t2. While for data type resources, an in- stance is a table description where the at- inputs visualizer tributes are described using concepts in inputs analyzer outputs C, for every tool type resource template, inputs outputs an instance is a well defined description outputs of a computational procedure from which converter data a BioFlow [6], or any other declarative language, procedure can be constructed. Fig. 1. Examples of templates. BioFlow supports a declarative statement, called define function, for desktop or internet tool application which is capable of resolving schema mismatch and au- tomatic extraction of needed information using autonomous wrapper generation. It also supports declarative sequencing of predefined procedures for powerful pro- cess graph implementation. We use BioFlow as our target language in AutoPipe in the remainder of this paper.
2.1 Synthesizing Workflow Queries
Given the coupling relationship ¹, a set of input tables R = {r1, . . . , rn}, a target concept c, and a given k, synthesis of a workflow template is a set of m ≤ k shortest possible directed acyclic graphs constructed from ¹ such that the initial or root nodes are data resources in R, and the final node is a tool resource t such that µ(t) = c. The target workflow queries are then essentially instantiations of the workflow templates with concrete data types, tools and display functions as a set of declarative workflow queries. While any declarative language can be used by developing language converters of choice, in the current implementation of AutoPipe only BioFlow queries are supported for execution in LifeDB database management system [3].
A Declarative Language for Generating Workflow Queries The linguis- tic constructs of AutoPipe leverages its conceptual power of shortest path graph construction capability using the definition of coupling, and the concept of iso- morphic subgraph matching to retrieve possible implementation of a workflow query from resource instances and coupling relationships. AutoPipe supports the construct statement shown below for the extraction of workflow queries.
construct any | all | distinct | top k concept for relations through templates; display constructExpression with template; convert constructExpression to targetLanguage;
3 ISBRA 2012 Short Abstracts
In the construct statement, concept is a term in C, relations are instances of data templates, and templates are names in the resource template hierarchy. It means return all directed acyclic graphs induced by the relation ¹ such that they orig- inate in a relation and end in a concept via nodes in templates in the specified (not necessarily consecutive) order. Since the construct statement requires a con- cept, and display templates do not map to any specific concept, we support the last two statements to display the computation and convert the workflow graphs into executable queries, respectively. In the last two statements, constructEx- pression is any valid AutoPipe construct statement, template is an instance of display type tool template, and targetLanguage is a declarative workflow query language, such as BioFlow. Finally, create view or insert into type statements may be used directly to save views computed by the construct statements.
References
1. F. Alakwaa, N. Solouma, and Y. Kadah. Construction of gene regulatory net- works using biclustering and bayesian networks. Theoretical Biology and Medical Modelling, 8(1):39+, Oct. 2011. 2. S. Amin, J. Russell L. Finley, and H. M. Jamil. Top-k similar graph matching using tram in biological networks. ACM/IEEE TCBB, 2012. Accepted. 3. A. Bhattacharjee, A. Islam, M. S. Amin, S. Hossain, S. Hosain, H. M. Jamil, and L. Lipovich. On-the-fly integration and ad hoc querying of life sciences databases using LifeDB. In DEXA, 2009. 4. V. Huynh-Thu, A. Irrthum, L. Wehenkel, and P. Geurts. Inferring regulatory networks from expression data using tree-based methods. PloS one, 5(9):e12776+, 2010. 5. R. A. Irizarry, C. Wang, Y. Zhou, and T. P. Speed. Gene set enrichment analysis made simple. Statistical Methods in Medical Research, 18(6):565–575, Dec. 2009. 6. H. M. Jamil, A. Islam, and S. Hossain. A declarative language and toolkit for scientific workflow implementation and execution. IJBPIM, 5(1):3–17, 2010. 7. M. Kabir, N. Noman, and H. Iba. Reverse engineering gene regulatory network from microarray data using linear time-variant model. BMC Bioinformatics, 11(S- 1):56, 2010. 8. V. Kuncak, M. Mayer, R. Piskac, and P. Suter. Software synthesis procedures. Communications of the ACM, 55(2):103–111, 2012. 9. D. Nardi and R. Rosati. Deductive synthesis of programs for query answering. In International Workshop on Logic Program Synthesis and Transformation, pages 15–29. Springer-Verlag, 1992. 10. R Development Core Team. R: A Language and Environment for Statistical Com- puting. R Foundation for Statistical Computing, Vienna, Austria, 2009. 11. A. D. Sarma, A. G. Parameswaran, H. Garcia-Molina, and J. Widom. Synthesizing view definitions from data. In ICDT, pages 89–103, 2010. 12. A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette, A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander, and J. P. Mesirov. Gene set enrichment analysis: a knowledge-based approach for inter- preting genome-wide expression profiles. PNAS, 102(43):15545–15550, 2005. 13. Y. Tian and J. M. Patel. Tale: A tool for approximate large graph matching. In International Conference on Data Engineering, pages 963–972, 2008.
4 ISBRA 2012 Short Abstracts
A new method to predict linear B-cell epitope using support vector machine
Bo Yao1, Lin Zhang2, Shide Liang3* and Chi Zhang1*
1School of Biological Sciences, Center for Plant Science and Innovation, University of Nebraska, Lincoln, NE, 68588, USA 2Department of Statistics, University of Nebraska, Lincoln, NE, 68588, USA 3Systems Immunology Lab, Immunology Frontier Research Center, Osaka University, Suita, Osaka, 565-0871, Japan
*Corresponding author
Email addresses: Bo Yao: [email protected] Lin Zhang: [email protected] Shide Liang: [email protected] Chi Zhang: [email protected]
Abstract Identifying protein surface regions preferentially recognizable by antibodies (antigenic epitopes) is at the heart of new immuno-diagnostic reagent discovery and vaccine design, and computer prediction provides a crucial means. Many linear B-cell epitope prediction methods were developed, such as BepiPred, ABCPred, AAP, and BCPred towards this goal. However, effective immunological research demands higher accuracy and more robust performance of the prediction method than what the current algorithms could provide. In this work, we developed a new method to predict antigenic epitope with sequence input. Support Vector Machine (SVM) has been utilized by combining the Tri-peptide similarity and Propensity scores. In a leave-one-out test, an accuracy of 77.75% and a specificity of 81.99% were achieved by our method.
5 ISBRA 2012 Short Abstracts
Asymptotic properties of a median tree under the coalescent model
Liang Liu
University of Georgia
Abstract. Accurately estimating the evolutionary history of species (species tree) is one of the most important problems in biology. In this paper, I investigate the statistical properties of the sample median tree under the coalescent model and show that the sample median tree is a statistically consistent estimate of the species tree. This result provides a consistent method for accurately estimating species trees.
Keywords: species tree; median tree; coalescent model.
1 Introduction
As molecular sequence data become increasingly available, phylogenetic studies have found significant evidence that the history of a single gene (gene tree) may differ from the history of species, due to a variety of biological phenomena including deep coalescence, horizontal gene transfer, and gene duplication/loss [3]. Many probabilistic models have been proposed to explain the relationship between gene trees and the species tree. Most commonly, gene trees are viewed as a random sample generated from a coalescence process occurring along the lineages of the species tree [4]. A broad class of distance methods attempt to estimate the species tree by a median tree - the tree with minimum distance to gene trees, but studies on the statistical properties of the median tree under the coalescent model [5] are limited. This paper investigates the asymptotic properties of the median tree under the coalescent model. It can be shown that under the coalescent model, the median tree is a statistically consistent estimate of the species tree.
2 Assumptions and notations
Gene trees and the species tree are binary rooted trees on the same set of taxa. An N N-taxon rooted tree is characterized by a set of 3 rooted triples (Fig. 1). Thus, a rooted binary tree can be represented by a vector of rooted triples. Consider a rooted binary of taxa A, B, C, and D (Fig. 1). This tree has four rooted triples; TABC , TABD, TACD, and TBCD. The topology of the rooted triple is indicated by an indicator vector [I1, I2, I3]. For example, [1, 0, 0] implies that the topology of a rooted triple TABC is AB|C (A and B are grouped together), while [0,1,0] and [0,0,1] suggest that the topology of TABC is AC|B and BC|A respectively. A
6 ISBRA 2012 Short Abstracts
2 Lecture Notes in Computer Science: Authors’ Instructions rooted binary tree is uniquely represented by a vector of indicators in which the value is 1 if the corresponding topology of the rooted triple is present and 0 if the topology is not present in the binary tree. For example, the vector representation of the four-taxon tree in Figure 1 is [1,0,0,1,0,0,0,0,1,0,0,1]. Note that each triplet in the vector has exactly one 1 and two 0s. Thus the sum of the vector is equal N to the number of the rooted triples in the tree, i.e., 3 . The length of the vector N is 3 × 3 . The triplet distance between two rooted trees T1 and T2 is w X d(T ,T ) = |vi − vi | (1) 1 2 T1 T2 i=1 N Note that w = 3 × 3 is the length of the indicator vector (the number of i th elements in each vector) and vT is the i element in the indicator vector of tree T. This distance function counts the number of rooted triples that appear in either tree T1 or tree T2, but not in both. The value of d(T1,T2) is always an even number because it counts either 0 or 2 for each rooted triple.
A B C D A B C A B D A C D B C D
Fig. 1. The triples in a rooted tree. The rooted tree contains four triples.
3 Asymptotic distribution of the sample median tree
Let {vgi , ..., vgk } be the vector representations of gene trees {gi, ..., gk} andv ¯ be Pk th the mean vector, i.e.,v ¯ = i=1 vgi /k . By the law of large numbers, the i element ofv ¯ converges to the probability that the corresponding topology of the rooted triple is present in the gene tree generated from a probability distribution P (g|s), i.e., as k → ∞,
vi → pi ∀i (2) The asymptotic probability distribution of the mean vector is a multivariate normal distribution MVN(p, Σ) in which p is the vector (p1, ..., pw) and Σ is the covariance matrix. A sample median tree T˜ minimizes the sum of the triplet distances to all gene trees,
k X T˜ = argmin d(gj,T ) (3) j=1
7 ISBRA 2012 Short Abstracts
Median trees 3 and
k k w w k w X X X X X X d(g ,T ) = |vi − vi | = |vi − vi | = k |v¯i − vi | (4) j gj T gj T T˜ j=1 j=1 i=1 i=1 j=1 i=1
It implies that the sample median tree has minimum distance to the mean vector v¯. As the mean vectorv ¯ has an asymptotic multivariate normal distribution, the asymptotic distribution of the sample median tree can be obtained by calculating the trees with minimum distance to the vectors generated from the multivariate normal distribution with p and Σ being replaced by their unbiased estimators; sample mean vectorv ¯ and sample covariance matrix Σˆ. The population median tree T˜p is defined as the tree that minimizes the sum of distances to gene trees with respect to the probability distribution P (g|s) of the gene tree given the species tree, i.e., X T˜p = argmin d(g, T ) × P (g|s) (5) g The solution to (5) is unique under the coalescent model (the proof will be given shortly). By the law of large numbers, the sample median tree T˜ converges to the population median tree in probability as the number of gene trees increases, i.e., as k → ∞,
p T˜ −→ T˜p (6)
4 Consistency of the median tree under the coalescent model
Under the coalescent model, the probability distribution of the gene tree (topol- ogy and branch lengths) given the species tree topology, branch lengths, and population sizes was derived by Rannala and Yang [1]. Degnan and Kubatko [2] later derived the probability distribution P (g|s) of the gene tree topology given the species tree (topology and branch lengths in coalescence units) by integrating out the branch lengths of the gene tree.
Lemma 1. If the most probable triple in the gene trees is consistent with the triple in the true species tree s, the sample median tree is a statistically consistent estimator of the true species tree s.
Proof. By (2), (4), and (6), the population median tree T˜p minimizes the distance to vector p,
w X |pi − vi | (7) T˜p j=1
8 ISBRA 2012 Short Abstracts
4 Lecture Notes in Computer Science: Authors’ Instructions
Consider an arbitrary vector p. Because the elements in vector p are probabilities, the sum is equal to 1. The elements of vector v are either 1 or 0. The distance T˜p between two vectors is minimized when vi = 1 and pi has the largest probability T˜p among the three possible topologies for all i. It indicates that (7) is minimized when the triples in the species tree are consistent with the most probable triples in gene trees. By assumption, the triples in the true species tree s are consistent with the most probable triples in gene tree, it indicates that the true species tree is identical with the population median tree T˜p. It follows from (6) that the sample median tree is a statistically consistent estimator of the true species tree s. Theorem 1. Under the coalescent model, the sample median tree based on the triplet metric is statistically consistent. Proof. By Lemma 2, it suffices to show that the most probable triple in the gene trees generated from the probability distribution P (g|s) is consistent with the corresponding triple in the true species tree s. Consider an arbitrary triple TABC in s. Without loss of generality, the topology of TABC is AB|C. According to coalescent theory, the probabilities of the three topologies of the gene tree triple are P (AB|C) = 1−2/3e−b, P (AC|B) = 1/3e−b, and P (BC|A) = 1/3e−b, where b is the length of the internal branch of the species tree triple TABC . Apparently, the most probable gene tree triple matches the species tree triple.
5 Discussion
The sample median tree can be used to consistently estimate species trees. The variance of the sample median tree can be estimated through a bootstrap tech- nique. Specifically, the original dataset is resampled to generate bootstrap sam- ples. A sample median tree is built for each bootstrap sample. Then the median trees are summarized by a consensus tree which indicates the variation among the median trees.
References
1. Rannala, B., Yang, Z.H.: Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics 164, 1645–1656 (2003) 2. Degnan J.H., Salter L.A.: Gene tree distributions under the coalescent process. Evolution 59, 24–37 (2005). 3. Maddison, W.P., Knowles, L.L.: Inferring phylogeny despite incomplete lineage sort- ing. Syst Biol 55, 21–30 (2006) 4. Liu, L., Pearl, D.K.: Species trees from gene trees: Reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Syst Biol 56, 504–514 (2005) 5. Wakeley, J.: Coalescent Theory: An Introduction. Roberts Company Publishers; (2008).
9 ISBRA 2012 Short Abstracts
Comparison of RNA-Seq with Microarray Analysis of the Transcriptional Response in HT-29 Colon Cancer Cells to 5-aza-deoxycytidine
Xiao Xu1, Jennie Williams1, Erica Antinoiou2, W. Richard McCombie2, Wei Zhu3, Song Wu3, Asia Brown1, Paula Denoya1, Ellen Li1
1School of Medicine, Stony, Brook University, Stony Brook, NY, USA; 2Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA; 3Department of Applied Mathematics, Stony Brook University, Stony Brook, NY, USA
Abstract – In this study we compared parallel datasets commercial microarray platform. We chose to investigate generated by 1.) pair-end Illumina RNA Sequencing (RNA- the global transcriptomic responses of HT-29 colorectal Seq) measures, and 2.) Affymetrix Human U133 Plus 2.0 cancer cells to two concentrations of 5-aza-deoxycytidine arrays on HT-29 colon cancer cells treated with two different levels (5 µM and 10 µM) of 5-aza-deoxycytidine (a (demethylation agent) treatment. Finally, comparisons on demethylation agent), and further compared to control cells significant biological pathways derived by differentially treated with vehicle (dimethylsulfoxide) alone. This study expressed genes are performed in order to explore the aims to enhance our understanding towards the pros and cons of RNA-Seq technology in the specific context of difference of the two platforms at function level. induced epigenetic intervention at different dosage levels. RNA-Seq experiment detected a total of 18109 genes and the II Experimental Design Affymetrix experiment detected a total of 12467 genes in at least one of the experimental groups. The average Three groups (each containing 3 biological replicates) of Spearman correlation coefficient R was 0.72 for the colon cancer HT-29 samples were treated with: (A) 5 µM expression levels measured for the 11930 genes that were detected by both experiments. We then selected the genes 5-aza-2’ deoxycytidine each day for five days, (B) 10 µM that were differentially expressed (fold change ≥ 2, FDR 5-aza-2’deoxycytidine each day for five days and (C) <0.05) after treatment with 5-aza-deoxycytidine by using the vehicle (DMSO) alone respectively. The same set of Cufflinks/Cuffdiff program for the RNAseq data and the samples were used on both RNA-seq and microarray Significance Analysis of Microarray (SAM) program for the Affymetrix data. There was considerable overlap between platforms (Table 1). A pair-wise comparison between (A) the pathways selected respectively by Ingenuity Pathway and (B) as well as (A) vs (C) were also carried out. Analysis (IPA) of the up-regulated genes selected for the Differentially expressed genes were selected and RNA-Seq and Affymetrix datasets. Both experiments confirmed that STAT1, STAT2 and FES gene expressions compared in two settings. Subsequently, a by-sample were up-regulated after treatment with 5-aza-deoxycytidine. correlation analysis was performed between RNA-seq Moreover, the RNA-Seq results on differentially expressed mapped data and Affymetrix probe levels using Spearman genes seem to suggest that 10 µM 5-aza-deoxycytidine has a Correlation method. certain level of toxic effect towards HT-29 cells, shutting down the expressions of a lot of genes than that of the 5 µM dosage. Overall, our study featured a comprehensive Table 1 Experimental design (3 dosage levels and two platforms). characterization of the transcriptomic response of HT-29 colorectal cancer cells towards 5-aza-deoxycytidine 5-aza-2’ deoxycytidine Experimental Platforms treatment on two parallel experimental platforms. Treatment RNA-Seq Microarray Keywords: RNA-Seq, microarray, colon cancer, 5-aza- 5 µM HT-29 × 3 (Group A) HT-29 × 3 (Group A’) deoxycytidine, transcriptomic analysis. 10 µM HT-29 × 3 (Group B) HT-29 × 3 (Group B’) DMSO controls HT-29 × 3 (Group C) HT-29 × 3 (Group C’)
I Introduction II Method High-throughput sequencing is emerging as an attractive alternative to microarrays for measuring global mRNA The experiment and analytic procedures were based on expression [1]. The goal of our study is to validate the use popular RNA-Seq and microarray algorithms featuring a of RNA-Seq by comparison with an established complete preprocessing filtering normalization
10 ISBRA 2012 Short Abstracts
statistical analysis pipeline (Figure 1). Several key steps sites are below 30 in at least 5 out of 9 samples. The were explained as the following: resulting fastq files were provided for subsequent analysis.
2) Tophat mapping
A popular RNA-Seq reads mapping tool, Tophat (v1.4.1) [2] was used to map the millions of short reads to the reference Ensembl human genome 19 following default settings ftp://ftp.ensembl.org/pub/current_gtf/. The mapped reads were summarized and exported into SAM/BAM files.
3) Cufflink/Cuffdiff Analysis
The cufflink program (version 1.3.0) [3] is a popular tool for transcripts assembling and abundance estimation with RNA-seq samples. It utilizes Fragments Per Kilobase of exon model per Million mapped fragments (FPKM) value to quantify gene transcripts abundance. Its associated program cuffdiff was specially designed to test for differentially expressed genes, exons, coding sequences, Figure 1: Preprocessing and analytic pipeline of colorectal cancer HT-29 samples. I. Correlation analysis between all 9 splicing events, and promoter use. We used cufflink samples based on common detectable genes. II Comparison of program to assemble and estimate transcripts abundance differentially expressed genes found on both platforms. for each gene before correlating to the microarray data. *Filtering procedure was applied to remove gene/probe entries The cuffdiff program was used in paired comparison which do not have 3 present values in at least one of the three groups. settings (A vs C and B vs C) in our search of differentially expressed genes.
4) SAM analysis 1) Illumina® high-throughput sequencing experiment The Significance Analysis of Microarray (SAM) is a The TruSeq RNA Sample Preparation Kit (Illumina Inc., popular statistical tool for identifying differentially CA) was used to prepare the sequencing libraries by expressed genes based on permutation t-like test [4]. The implementing the following steps. mRNA was false discovery rate (FDR) cutoff of 0.05 was used in our subsequently purified and double stranded cDNA was analysis to control type I error. This method was applied made using random primers for the first and second strand only to Affymetrix microarray data and additional fold synthesis. The next step converted the overhangs of the change (FC) cutoff of 2 (FC <= 0.5 or FC >= 2 on group DNA into phosphorylated blunt ends. Adaptors were means) was also used in this analysis to select ligated to the DNA fragments. A size selection was differentially expressed genes. performed using AMPure XP beads (Beckman Coulter) to remove excess of adaptors and isolate DNA templates of 5) Pathway Analysis 320bp long in average. Finally, PCR was performed to enrich the adapter-modified DNA fragments since only Pathway analysis was performed using Ingenuity software the DNA fragments with adaptors at both ends will (Redwood City, CA). The enrichment rates of amplify. A sequencing flow cell was prepared at 10nM differentially expressed genes were evaluated against loading concentration and sequenced on an Illumina canonical signaling pathways in human cells. Significant HiSeq 2000 instrument. The sequences were filtered by pathways were picked based on P value <= 0.05. the Illumina software to remove bad quality sequences (the first 3 nucleotides) since the Phred score of these III Results
A. Correlation with microarray data
11 ISBRA 2012 Short Abstracts
After the filtering and gene symbol conversion step, demethylation) is generally higher than that of the down- Affymetrix® hgu133plus2 microarray detected 12267 regulated genes and this tendency is more explicit in the independent gene entries while the Illumina RNA-seq lower dosage group compared with controls (A vs C). We experiment found 18109 present gene transcripts. An also noticed that the overlap rates of differentially overlap of 11775 genes was used in the subsequent expressed genes from two platforms are higher in A vs C correlation analysis (Figure-II) between samples. We comparison as opposed to B vs C experiment (Table 2). also found that the 492 microarray exclusive genes have A further analysis comparing number of significant genes an average lower expression levels (down by 2~4 folds) between two demethylation levels to control indicated a than those also detected on RNA-seq experiment (11775), high overlap rate between the two dosages (vs control) suggesting a dubious signal quality of these probes. The comparisons on both platforms except for down-regulated estimated transcripts levels from RNA-Seq experiment genes in microarray study (Table 3). showed a high correlation at per sample level based between RNA-Seq and results using Spearman Table 2: Number of differentially expressed genes in each comparison category. Correlation (average r = 0.72, P value << 1×10-10). An additional correlation analysis based on group fold Comparisons A > C A < C B > C B< C changes (A vs C & B vs C) for each transcript was also RNA-Seq 1522 252 574 193 performed, resulting in a high correlation (r = 0.83, P value << 1×10-10) on A vs C comparison and a relatively Microarray 584 346 429 338 lower yet significant correlation (r = 0.70, P value << Intersection 458 112 216 67 1×10-10) for B vs C comparison. Intersection Total 570 283
Table 3: Number of overlapped genes and rates between two demethylation-vs-control comparisons on both RNA-Seq and Microarray platforms.
Intersection of A vs C A > C & B > C A < C & B < C and B vs C RNA-Seq 494/(1522 & 574) 93/(252 & 193) Overlap Rate 86.1% 48.2%
Microarray 290/(584 & 429) 115/(346 & 338) Figure II: Venn diagram of detectable genes from two Overlap Rate* 67.6% 34.0% platforms showing overlap and exclusive genetic sets from both RNA-Seq and microarray platforms. *Among the 6122 genes which are exclusively identified on *Overlap rate is calculated by the number of common genes RNA-Seq, 4218 are not revealed in Microarray experiment divided by the number of genes in the smaller parental set because their abundances are too low thus filtered from final (underscored in the table), which reflects the maximum possible Microarray data. overlap rate independent from the size of parental sets.
B. Differentially expressed genes in demethylation treatment group vs controls C. The Ingenuity Pathway Analysis
We performed significant gene detection analysis In the final pathway enrichment analysis, we primarily respectively for both dosages compared to DMSO focused on the up-regulated genes since they are assumed controls. Using cuffdiff test, we found 1774 and 767 to be directly affected by the demethylation treatment. differentially expressed genes in (A) vs (C) and (B) vs (C) The ingenuity pathway analysis (IPA) identified 35 comparisons from RNAseq experiment, which are (RNA-Seq) and 27 (Microarray) significant signaling relatively more than we observed from microarray results pathways using differentially up-regulated genes (930 in A vs C and 767 in B vs C) based on P value cutoff (treatment level > control) from A vs C comparisons of 0.05 and fold change cutoff of 2. Specifically, the (Table 4). Among these identified pathways, 11 were number of up-regulated genes (from control to found in overlap between two experimental platforms. In
12 ISBRA 2012 Short Abstracts
B vs C comparisons (up-regulated genes only), the IPA in higher dosage 5-aza-deoxycytidine treatment (B) vs program identified 40 (RNA-Seq) and 27 (Microarray) control (C) also differentially expressed in lower dosage significant signaling pathways among which 13 were treatment (A) vs control (C). To some extent, we may found in common. consider introduction of 10 µM 5-aza-deoxycytidine (B) to HT-29 cells a toxic dosage which actually turned off IV Discussion the expression of many genes that are activated upon the 5 µM de-methylation treatment (Group A). Moreover, the In on our study, 5-aza-deoxycytidine treatment of HT-29 observation that the 492 microarray exclusive genes cells resulted in significant transcriptional response having lower expression profiles (2~4 folds) than the reflected on both RNAseq and microarray platforms. Both common 11775 ones seems to indicate a certain level of experimental platforms confirmed a number genes unreliability of these probe readings, which partly reported to be up-regulated, such as STAT1, STAT2 [5] explained why they were not picked in RNA-Seq and FES [6]. However, the SPARC gene reported by S experiment. Lastly, The IPA analysis indicated that while Cheetham et al [7] was not detected in our RNA-seq or the RNA-Seq experiment revealed more differentially microarray analysis due to its low abundance in our expressed genes than microarray experiment between experiment. The study showed that the high throughput demethylation treated group and controls, many of the Illumina RNA sequencing technology is more sensitive to additional genes were assigned to pathways that are low-abundance transcripts than Affymetrix Microarray, already identified by the microarray platform. considering that 4218 genes below detection criterion in microarray experiment are revealed present on RNA-Seq V Acknowledgement platform. When comparing to previous studies in similar settings such as Su et al [8], our between-platform The author would like to thank Molly Hammel from Cold Spearman correlation (r = 0.72) is slightly less yet in the Spring Harbor Laboratory for her valuable suggestions in same range as their result (r ~0.80). In another experiment the RNA-Seq data analysis process. The author also wants conducted by Marioni et al [1], their Spearman correlation to express his sincere gratitude to all the lab technicians (r = 0.73~0.75) is much closer to our findings suggesting from both Stony Brook University Health Science Center a high consistency between our study and their reports. A and Cold Spring Harbor Laboratory who are involved in direct comparison on differentially expressed gene sets relevant experiments of this study. between (A) vs (C) and (B) vs (C) seem to validate the fact that the absolute majority of genes that are different
5. Karpf, A.R., et al., Inhibition of DNA methyltransferase VI References stimulates the expression of signal transducer and activator of transcription 1, 2, and 3 genes in colon tumor cells. Proc 1. Marioni, J.C., et al., RNA-seq: an assessment of technical Natl Acad Sci U S A, 1999. 96(24): p. 14007-12. reproducibility and comparison with gene expression 6. Shaffer, J.M. and T.E. Smithgall, Promoter methylation arrays. Genome Res, 2008. 18(9): p. 1509-17. blocks FES protein-tyrosine kinase gene expression in 2. Trapnell, C., L. Pachter, and S.L. Salzberg, TopHat: colorectal cancer. Genes Chromosomes Cancer, 2009. discovering splice junctions with RNA-Seq. Bioinformatics, 48(3): p. 272-84. 2009. 25(9): p. 1105-11. 7. Cheetham, S., et al., SPARC promoter hypermethylation in 3. Trapnell, C., et al., Transcript assembly and quantification colorectal cancers can be reversed by 5-Aza- by RNA-Seq reveals unannotated transcripts and isoform 2'deoxycytidine to increase SPARC expression and improve switching during cell differentiation. Nat Biotechnol, 2010. therapy response. Br J Cancer, 2008. 98(11): p. 1810-9. 28(5): p. 511-5. 8. Su, Z., et al., Comparing next-generation sequencing and 4. Efron, B. and R. Tibshirani, Empirical bayes methods and microarray technologies in a toxicological study of the false discovery rates for microarrays. Genet Epidemiol, effects of aristolochic acid on rat kidneys. Chem Res 2002. 23(1): p. 70-86. Toxicol, 2011. 24(9): p. 1486-93.
13 ISBRA 2012 Short Abstracts
CPAM: Effective Composite Regulatory Pattern Miner for Genome Sequences
Dan He
Computer Science Dept., Univ. of California, Los Angeles, CA, 90095-1596, USA [email protected]
Abstract. Finding repetitive patterns in DNA sequences is a fundamental problem in computational biology. There are many different types of repetitive patterns. The composite regulatory pattern mining problem is to find a l-mer, or length-l consecutive sequence, in a set of sample sequences, such that the l- mer has at least k occurrences in the sample sequence where each occurrence is of at most d mismatches to the l-mer. The problem is also known as (l,d)-pattern mining problem, or (l,d)-challenging problem. It has been studied extensively. However, the current methods to solve the problem are not efficient to handle relatively long patterns and are generally not scalable to long sample sequences. In this work, we proposed an algorithm CPAM which seeks short seeds for the patterns first, then extend the seeds into full length pattern. We also proposed an iterative version of the algorithm ICPAM which reduces the problem into easier problems recursively. Our experiments show that our algorithms are scalable both to long patterns and long sample sequences. And our algorithms are also very efficient compared with the state-of-the-art methods.
1 Introduction
Finding repetitive patterns in DNA sequences is a fundamental problem in computational biology since a remarkable fraction of the genomes of complex organisms are repetitive patterns. These repetitive patterns play an important role in the identification of novel function units. Various techniques such as combinatorial methods, statistical modelling, and suffix trees have been applied to the problem and various forms of repetitive patterns have been studied, such as exact maximal repeats, approximate maximal repeats, repeats with minimum frequency, elementary repeats, repeat families [6, 2, 9, 10, 8, 19]. In this work, we study the problem of mining composite regulatory patterns in sample genome sequence. DNA sequences are subject to mutations and therefore the repetitive patterns often occur with some mismatches from the consensus motif. The consensus motif can be represented as an l-mer, which is a continuous string of length l. The (l,d)-neighborhood of an l-mer P represents all possible l-mers with up to d mismatches as compared to P. Pd l i For DNA sequences, whose alphabet size is 4, the size of the (l,d)-neighborhood for any l-mer is i=0 i 3 . We call each occurrence of P in the sample sequence as an instance.A l-mer is a valid occurrence of pattern P if the l-mer is at most d mismatches to P . The problem of mining composite regulatory patterns (also called (l, d)-k patterns, or motifs (we will for now use “pattern” and “motif” interchangeably)) is defined as the following: Given a set of sequences S, find all l-mers that occur up to d mismatches at least k times in S. The problem is very challenging because the search space can be very big for some (l, d) configurations. For example, a typical setting of the set sample sequence is a set of 20 length-600 sequences, and we are looking for (15,5)-20 patterns, where 20 is a typical setting for k. As the (15, 5)-neighborhood is of size 853,570, for any 15-mer, the expected 853,570×600×20 number of (15,5) occurrences in the sample sequence is 415 = 9.54. Notice when we compute the expected occurrences of the patterns, for simplicity, we simply consider the occurrences in 600 × 20 = 12000 possible positions. Therefore it’s hard to distinguish the true pattern out of all 415 possible 15-mers without enumerating and validating them all. The (l, d)-k pattern mining problem is well-studied and numerous algorithms are proposed, including both optimal and approximate algorithms [18] [17] [16] [19] [5] [1]. These algorithms solve the problem efficiently when l, d are relatively small and the sample sequence is relatively short. But for relatively large l with respect to d, or long sample sequence, all previous algorithms are usually not efficient. This is because for a fixed l, the number of possible occurrences for (l, d)-k patterns increases exponentially with d. In the meanwhile, a long sample sequence leads to high cost to check the valid occurrences of a pattern.
14 ISBRA 2012 Short Abstracts
2
In this work, we proposed an algorithm CPAM (Composite Regulatory Pattern Miner), which deploys the following two ideas: 1. Instead of searching the pattern directly, we start with searching some seeds. The seeds are usually short and contain fewer mismatches. Therefore it’s much easier to find all occurrences of these seeds. 2. The occurrences of the seeds are extended into full length strings and random projection [18] is applied on the extended full length strings to recover the candidate patterns. These candidate patterns are then validated against the sample sequence for the true pattern. We show these two ideas are able to improve the efficiency to mine the composite regulatory patterns significantly, especially for long sample sequences, compared with the current state-of-the-art algorithms. What’s more, for long motif, we conduct the above process in an iterative manner, namely we keep on reducing the length of the seeds as for long motifs, we can not use too short seeds. Thus we consider the discovering of the seeds as a pattern mining process itself and solve it by using even shorter seeds iteratively till the seeds are short enough for fast processing. We show in our experiments that our algorithm is scalable to both long patterns and long sample sequence.
2 CPAM algorithm We propose an algorithm CPAM based on the observation that we sometimes do not actually need all of the information on the motif occurrences, namely we may don’t need to find all k occurrences of the motif in the sequence but still be able to recover the motif.
2.1 Workflow We show the workflow of our CPAM algorithm in Figure 1. As we can see, to identify (l, d)-k patterns, we first identify the occurrences of (l0, d0) patterns, which are considered as seeds. We typically set l0 and d0 as half of l and d, respectively. Then we try to find all occurrences for the (l0, d0) seeds using MITRA-count, which is very efficient for short seeds. The number of occurrences is usually big due to small values of l0 and d0. Next we extend all occurrences of the seeds to length-l strings, and apply random projection on the extended length-(l − l0) substrings. For example, assume l0 = 3 and l = 6, we have sample sequence “AGCTCTAGCTATCAATAGCTAT” and the seed is “AGC”. For illustration purpose, assume d = 0. Then there are three occurrences of “AGC” in the sample sequence. We extend the three occurrences to length-6 strings, and obtain “AGCTCT”, “AGCTAT” and “AGCTAT”. We next apply random projection on the extended length-3 substrings, namely “TCT”, “TAT” and “TAT”. Assuming we randomly project on two bits in the length-3 substrings, we obtain a pattern “AGCT−T” with 3 occurrences, where “−” means positions not projected. Assuming after random projection, we obtain k0 occurrences for the extended pattern (l, d), we can compute the probability of observing k0 occurrences out of k occurrences of the pattern (l, d) such that all these occurrences contain an (l0, d0) seed. If the probability is too small, for example, less than 0.001, we ignore the pattern. Otherwise we recover the positions that are not projected using consensus bits. In the above example, we obtain a consensus pattern “AGCTAT” because there are two occurrences of “TAT” but only one occurrence of “TCT”. Finally for the consensus patterns we check the occurrences of them in the sample sequence to select the ones which have at least k occurrences.
2.2 Iteration for Long Patterns As we will show later in the experiments that our algorithm is fast for (l, d) motifs such as (15,4), (17,5), which considers (7,2) as seeds. When the motifs are long, it is not feasible any more to consider sub-motifs (7,2) as seeds since the random projection needs to be conducted on the remaining long substrings, which is both time consuming and inaccurate. Therefore we need to use relatively long sub-motifs as seeds. However, as the l0 and d0 increase for the (l0, d0) sub-motifs, the running time to identify the seeds occurrences increases dramatically. Thus we propose an iterative algorithm ICPAM, where we identify the seeds recursively until the problem is easy enough for our CPAM algorithm. For example, for motifs such as (35,9)-20 or (35,10)-20, we first use sub-motif (25,7) as seeds. To solve the problem for sub-motif (25,7), we further use sub-motif (15,4), and then (7,2), where CPAM is very fast. We show later that ICPAM is able to reduce the motif mining problem to smaller problem effectively and thus it is able to handle relatively long patterns.
15 ISBRA 2012 Short Abstracts
3
Identify Seeds Extend Seeds Identify Candidate Validate Candidate (l, d) – k (l’, d’) (l, d) (l, d) – k’ (l, d) – k
Identify all Extend all Apply random occurrences occurrences projection
Fig. 1. Workflow of CPAM.
3 Experimental Results
We first tried long motifs. As MITRA-count and MITRA-graph can not handle such long motifs, we compare our algorithm only with the graph-based algorithm [21], which is able to solve problem where the parameters are more challenging. The results are shown in Table 1 (left). For (19,6)-20 and (19,7)-20 problems, we ran CPAM where l0 = 7, d0 = 2. It obvious that our algorithm finished a lot faster. For (21,7)-20 and (21,8)-20 problems, since they are relatively long, we ran ICPAM with seeds (15,4), then seeds (7,2) recursively. Again our algorithm out performs the graph-based algorithm. We also tried (35,9)-20 problem. We ran ICPAM with seeds (25,5), then seeds (15,4), then seeds (7,2) recursively. The problem is solved efficiently. The graph-based algorithm is superior to other algorithms in that it is scalable to long sample sequences. We show our algorithm is also scalable to long sample sequences. This is because our algorithm uses relatively short motifs as seeds, whose occurrences are easy to find, even for long sample sequences. In Table 1 (right), we show the running time of CPAM for sample sequences of length 600, 800, 1000, 1200, for (15,4)-20 problem. As we can see, the running time of our algorithm CPAM increases linearly and thus it is able to handle very long sample sequences. As a comparison, we show the running time for the graph-based algorithms as well, whose running time also increases linearly with respect to the length of the sample sequence. However, our algorithm is more efficient for all sample sequences with different lengths. The last thing to notice is that both CPAM and ICPAM are approximate algorithms since random projection is applied. It is possible that one run of CPAM or ICPAM doesn’t find the real motif. However, since the probability of missing the real motif is very small, running the algorithms twice usually won’t miss the real motif. And in our experiments, we never saw our algorithms missed the real motif twice.
pattern CP AM Graph − based n CP AM Graph − based (19,6)-20 153 1599 600 69 698 (19,7)-20 632 2141 800 119 1081 (21,7)-20 70 698 1000 207 1599 (21,8)-20 1038 1081 1200 354 2141 (35,9)-20 73 - Table 1. (Left) Execution time (sec.) for algorithms CPAM and the graph-based algorithm [21] on different (l, d)-k problems. “−” indicates inability to solve the problem. (Right) Execution time (sec.) for (15, 4)-20 problem for algorithms CPAM and the graph-based algorithm [21] for different sample sequence length n.
4 Discussion
In this work, we proposed an algorithm CPAM for the composite regulatory pattern mining problem. Our algorithm seeks short seeds for the pattern using MITRA efficiently. Then the seed occurrences are extended and the candidate patterns are recovered by Random Projection. The candidate patterns are then evaluated against the sample sequence to find the true pattern. We also proposed an iterative algorithm ICPAM which aims to handle long patterns. ICPAM seeks short seeds recursively till the seeds are short enough and therefore it is able to handle long patterns. We also show our method is scalable to long sample sequence since identifying the occurrences of short seeds is relatively easy.
16 ISBRA 2012 Short Abstracts
4
References
1. A. L. Price, N.C. Jones and P.A. Pevzner. De novo identification of repeat families in large genomes. Bioinformatics, 21:i351-i358, 2004. 2. D. Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cam- bridge University Press, 1997. 3. D. He. Using suffix tree to discover complex repetitive patterns in DNA sequences. In Proc. of 28th Annual International Conference IEEE Engineering in Medicine and Biology Society (EMBC’06), pp. 3474-3477, New York, NY, 2006. 4. D. He and X. Wu. An Efficient Algorithm for Finding Approximate Complex Repetitive Patterns. In Proceedings of the International Conference on Computational and Systems Biology (CASB 2006), Dallas, Texas, 2006. 5. E. Eskin and P.A. Pevzner. Finding Composite Regulatory Patterns in DNA Sequences. Bioinformatics, 1(1):1-9, 2002. 6. E. F. Adebiyi, T. Jiang, M. Kaufmann. An efficient algorithm for finding short approximate non-tandem repeats. Bioinformatics, Vol. 17, suppl. 1, pp. S5-S12, 2001. 7. M. Katti, R. Sami-Subbu, P. Ranjekar and V. Gupta. Amino acid repeat patterns in protein sequences: their diversity and structural-functional implications. Protein Science, 9(6):1203-1209, 2000. 8. S. Kurtz and C. Schleiermacher. REPuter: Fast computation of maximal repeats in complete genomes. Bioinfor- matics, 15(5), pp. 426-427, 1999. 9. S. Kurtz, E. Ohlebusch, C. Schleiermacher, J. Stoye and R. Giegerich. Computation and visualization of degen- earate repeats in complete genomes. In Proc. of the 8th International Conf. on Intelligent Systems for Molecular (ISMB) 2000. 10. S. Kurtz, J. V. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye, and R. Giegerich. REPuter: The manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res. 29(22), pp. 4633-4642, 2001. 11. M.S. Waterman. Introduction to computational biology. Chapman & Hall, 1995. 12. X. Zhu and X. Wu. Mining Complex Patterns across Sequences with Gap Requirements. In Proc. of the 20th International Joint Conference on Artificial Intelligence (IJCAI-07), pp. 2934-2940, Hyderabad, India, 2007. 13. PAM250 Amino Acid Scoring Matrix: http://prowl.rockefeller.edu/aainfo/pam250.htm. 14. NCBI Basic Local Alignment Search Tool: http://blast.ncbi.nlm.nih.gov/Blast.cgi. 15. Research Collaboratory for Structural Bioinformatics (RCSB): Protein Data Bank. http://www.rcsb.org/pdb/home/home.do. 16. Waterman, M., Arratia, R. and Galas, D. Pattern recognition in several sequences: consensus and alignm ent. Bulletin of Mathematical Biology, 46, 515-527. 17. Pevzner, P. A. and Sze, S. Combinatorial approaches to finding subtle signals in DNA sequences. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology. pp. 269-278. 18. J. Buhler and M. Tompa. Finding Motifs Using Random Projections. Journal of Computational Biology, vol. 9, no. 2, 2002, 225-242. 19. Sagot, M. Spelling approximate or repeated motifs using a suffix tree. Lecture Notes in Computer Science, 1380, 111-127. 20. Agrawal, R. and Srikant, R. Fast algorithms for mining association rules Proc 20th Int Conf Very Large Data Bases VLDB, pp 487–499, 1994. 21. Geraci, F., Pellegrini, M., Renda, M.E. An Efficient Combinatorial Approach for Solving the DNA Motif Finding Problem. In ISDA, pp 335-340, 2009.
17 ISBRA 2012 Short Abstracts
Pattern Detection and Functional Mapping for Biomedical Signal Sets
Anish Nair, Kamran Kiasaleh
Erik Johnson School of Engineering and Computer Science, University of Texas at Dallas, 800 West Campbell Road, Richardson, TX 75080,USA
Abstract. In recent years numerous studies have been carried out on effective means of pattern detection and characterization for biomedical signal sets. This paper presents an attempt to characterize and map ECG signal sets from MIT BIH data base for atrial fibrillation scenarios. The objective here is two-fold. First, we present models to characterize the inherent chaotic patterns in ECG dataset. Second, we present an estimate of the probability density functions of the ECG time series. Keywords: chaotic, ECG, mapping
1 Introduction
Atrial fibrillation is caused by an imbalanced impulse gradient in ventricles [1], increasing the possibility of cardiac arrest and myocardial infarction, with episode duration ranging from hours to days. In this paper, the time series under consideration is the RR interval. Widespread investigations [2] have been carried out for possible chaotic implications from RR interval time series with substantiating results. This paper aims to go beyond from not just detecting a chaotic footprint in the series but to have a function mapped model based on the database of available chaotic functions. The primary reason behind seeking functional mapping for the time series pattern is that if we know the function behind the observed series, it will be easier to predict the future iterative state values. In this paper it is intended to introduce a primary level modeling tool which forms the basis for higher level analysis, including future trajectory and probability estimation. Effective estimation of the probability density function can be considered as the requisite tool for further analysis. Constraints of invariance can be imposed on the distribution in order to provide a priori information to the Markovian model for predicting the future probability estimates. However, the analysis in this paper is limited to deciphering the functional mapping from the ECG data sets. The relative confidence on the accuracy of the probability density estimates would solely rely on the effectiveness of the functional mapping of the ECG signal sets to known chaotic functions.
18 ISBRA 2012 Short Abstracts
2 Time series analysis
Readings taken from MIT BIH physionet data base [3] run to a duration of 10 hours sampled at the rate of 250 samples/second. The RR interval forms the root time series, which is tested for chaotic nature by using Lyapunov exponent test. Exponents derived in the test characterize the rate of divergence and how much sensitivity does the time series possess with respect to initial conditions. Positive polarity of the exponents implies chaotic nature, greater the value of exponent higher would be the extent of chaos, while a negative polarity implies dissipative series. From the atrial fibrillation database [4] sub set of ECG readings considered are (04048, 04043 and 05091) and from the normal rhythmic behavior database [5] reading 16272 is considered. All analyses and tests are carried out for a sample count of 5000. Once the chaotic nature is established, it is extremely important to narrow the search in terms of dimensionality. Correlation dimension algorithm is used to calculate the dimensional estimate of the signal set, which has tested positive for the chaotic nature. The algorithm takes into account the ratio of the number of data points falling within a distance ε of each other to the total number of data points of the set. This ratio is termed as the correlation function. Slope of the log plot of correlation function and ε ranging from (0 to 0.085) would provide the dimensional estimate for the particular signal set under consideration. Estimates are shown in Table 1.
Table 1. Correlational dimensional estimates and polarity of largest Lyapunov exponents.
ECG reading Correlation dimension Polarity of Lyapunov exponent 04048 1.232 Positive 04043 1.261 Positive 05091 0.4962 Positive 16272 0.3352 Negative
Readings (04048, 04043 and 05091) are analyzed using phase space embedded plots with embedding dimension set as 3 by delaying it by a sample. We ruled out 16272 as the reading tested negative for possible chaotic nature. For Fig. 1 and 2, x(t) represents the time series under consideration. The circled regions highlight the structural similarity between the recurring lag profiles of ECG data sets with standard chaotic mapping functions. From the phase space plots and correlation dimensional estimates the subset readings (04048, 04043) have Henon Map as their mapping function and for reading (05091) Logistic Map is the parent mapping function as the standard range for correlation dimension for Henon map is (1.23 to 1.27) and for Logistic map it ranges from (0.495 to 0.505). Logistic map is a single dimensional map whereas Henon map is a two dimensional map. Hence, the second step of characterization should focus on determining the dimension associated with the dataset. Ensuing section caters to the characterization of the subset time series in terms of probability density estimates.
19 ISBRA 2012 Short Abstracts
1
1 0.8
0.6 0.5
x(t-2) 0.4 x(t-2)
0 0.2 0 0
0.2 0
0.2 0.4 0.4 0.6 1 0.8 0.6 0.6 0.8 1 0.4 0.8 0.8 0.6 0.2 0.4 x(t) 1 1 0.2 0 x(t-1) x(t) 0 x(t-1) Fig. 1. Phase plot for reading 05091 (left) and Logistic map (right)
1.5
2 1
1 0.5
0 0
x(t-2) x(t-2) -0.5 -1
-1 -2 -1.5 -1.5 -1 1.5 1 -0.5 1.5 -1.5 0.5 0 1 -1 0.5 0 -0.5 0.5 0 -0.5 0 0.5 1 -0.5 -1 -1 1 1.5 -1.5 1.5 -1.5 x(t-1) x(t-1) x(t) x(t)
Fig. 2. Phase plot for reading 04048 (left) and Henon map (right)
3 Characterization
Kernel Density estimation algorithm [6] is used for calculating requisite pdfs, defined in (1)
( ) { } ∏ ∑ (1) ( )
In this equation, D denotes the dimensionality associated with mapping function, N is the total number of data points, and hd is the bandwidth bin size under consideration. It can be understood from the equation, the kernel is set as Gaussian, where variable factor is the bandwidth bin size which is the optimal value reducing MSE. In this paper the optimal value is taken as in (2)
20 ISBRA 2012 Short Abstracts
(2) ( )
where σ is the standard deviation of the data set and N is set to 5000 consecutive data points, obtained from MIT BIH. For single dimensional scenarios the probability density function can be considered as the summation across multiple Gaussian kernels on a single axis, and for higher dimensional cases the probability density function is the product across individual axes. The estimates from kernel density algorithm for one dimensional and two dimensional scenarios are shown in Fig. 3.
1.4
1.2
1 60
0.8 40
0.6 20
0 Density function
Probability density function 0.4 0 0.2 1 0.4 0.8 0.2 0.6 0.6 0.4 0.8 0.2 0 Seconfd axis value range -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 1 Range of values for Logistic Map First axis value range Fig. 3. Probability density estimate for reading 05091 (left) and density estimate for reading 04048 (right).
References
1. Timothy, A. Denton; George A. Diamond, Richard H. Helfant, Steven Khan, Hrayr Karagueuzian. , “Fascinating rhythm: A primer on chaos theory and its application to cardiology”, American Heart Journal, vol.120, no.6, pp. 1419-1440, Dec 1990. 2. Kaifu Wang; Yi Zhao; Xiaoran Sun; Tongfeng Weng; , "A simple way of distinguishing chaotic characteristics in ECG signals," Biomedical Engineering and Informatics (BMEI), 2010 3rd International Conference on , vol.2, no., pp.713-716, 16-18 Oct. 2010 3. Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation 101(23):e215-e220; [Circulation Electronic Pages; http://circ.ahajournals.org/cqi/content/full/101/23/e215]; 2000 (June- 13). PMID: 10851218; doi: 10.1161/01.CIR.101.23.e215 4. Database for atrial fibrillation; http://www.physionet.org/physiobank/database/afdb/ 5. Database for normal http://www.physionet.org/physiobank/database/nsrdb/ 6. Epanechnikov, V.A, "Non-parametric estimation of a multivariate probability density". Theory of Probability and its Applications, vol.14, no., pp.153- 158, 1969.
21 ISBRA 2012 Short Abstracts
MapBase: A Virtual Biological ID Map Database
Hasan M. Jamil
Department of Computer Science Wayne State University, USA [email protected]
Abstract. Traditionally, IDs are used to describe, cross reference and link objects in biological databases and applications. Specialized ID con- verters are then used to map objects of different types to establish corre- spondence. Studies show that the quality of ID conversion varies widely, producing inaccurate ID correspondence. Authoritative databases such as GeneCards, PIR and BioMart try to compensate for this shortcoming by maintaining ID relationships for specific sets of ID types. Unfortu- nately, users of these resources are often forced to settle for incomplete or incorrect set of mappings when they do not have intimate knowledge of these resources. In this paper, we introduce a new lazy and on demand ID mapping database, called MapBase, which allows arbitrary mapping of biological IDs. MapBase materializes ID correspondence in real time from other databases and converters when queried, and maintains the materialized view using a negative provenance protocol. Thereby, Map- Base guarantees maximum possible accuracy and currency by using the best possible resources and by prioritizing resources based on quality.
1 Introduction
Object identifiers, also called IDs, are widely used to represent biological en- tities of arbitrary conformations. However, the global nature of life sciences research and distributed authority that generate them contribute to an often chaotic but seemingly unavoidable environment where an object is assigned nu- merous IDs by various groups and databases, and is linked to other objects using these IDs. Consider, for example, the gene symbol SMCR (Smith-Magenis syn- drome chromosome region). This gene has been assigned multiple IDs by various databases and authorities. For example, HUGO Gene Nomenclature Committee has assigned it an HGNC ID: 11113 and Entrez assigned it ID: 11113. However, GeneCards database lists SMCR’s Entrez ID as 6600 and RAI1 and SMS as its two aliases while HGNC lists its Entrez ID as 11113. Furthermore, HGNC notes that the symbol SMCR has been withdrawn while GeneCards notes that Gene ID: 6600 was discontinued on 3-Aug-2010 and replaced with Gene ID: 10743. Since cross referencing and linking objects spread across various databases are mostly through IDs, accuracy is a paramount factor in ensuring quality information processing. Biologists have been trying to ensure accuracy of ID conversion and mapping for quite sometime with limited success. The two main
22 ISBRA 2012 Short Abstracts
approaches are to design ID conversion tools such as GeneID Converter [2] and IdBean [7], and maintain ID correspondence in authoritative databases such as GeneCards, BioMart and SWISS-PROT. Recent studies [5, 1] show that despite significant efforts, the progress has been truly limited toward ensuring conversion accuracies largely because most converters and mapping databases are designed specifically for a particular type of ID, and they usually rely on polling the re- quired information from other databases, which in turn rely on another resource and thereby compound the inaccuracy and complexity. In this paper, our goal is to develop a universal online interface, called Map- Base, as an ID mapping service for users to map IDs of arbitrary types to the highest possible level of accuracy. As we describe in section 3, we do not rely on a specific database or conversion tool to poll our information. We dynami- cally decide on the resource to use for a specific mapping based on a priority order of converters and databases. Since IDs may be related to one another on a 1-1, 1-M, M-1 and M-M cardinalities, we recognize three querying options – any, unique and all. With option “any”, an arbitrary set of mappings will be produced for each ID in the query set without any guarantee of completeness. The option “unique” will produce a single 1-1 mapping if it exists, and reject mappings that violate this constraint. Finally, option “all” produces all possi- ble 1-M, M-1, and M-M mappings from all sources. The MapBase interface we present also safeguards against obsolete IDs and mappings produced by other resources whenever such information is available online, as discussed earlier in the context of the gene SMCR in HGNC and GeneCards databases.
2 Query Language for MapBase
To query online resources for ID maps, we have developed a simple declarative query language in [6] with two basic functions – converting an arbitrary type of IDs to a set of arbitrary type of IDs, and to cross reference objects with different IDs using an operation similar to join in relational databases. For brevity, we will briefly discuss only the convert statement syntax and semantics below without any technical details. We use the implementation of this statement as the core engine for MapBase interface described in section 3.
convert r into t1 [any|unique|all], . . . , tk [any|unique|all] [using c1, . . . , cn];
In this statement, r is a unary relation of domain Dt, t1, . . . , tk are type names, 0 and cis are online converters. The result is a relation r ⊆ Dt × Dt1 × ... × Dtk . The options [any|unique|all] allow mapping elements in r to either any available ID, exactly one, or all possible IDs respectively as noted earlier. If no options are specified, any is assumed. Furthermore, if using clause is used, mapping is attempted only and sepcifically from the list of converters in this clause.
3 MapBase ID Conversion Database
MapBase is a lazy, on demand and incremental materialized view [4] database in which the view is maintained using a negative provenance model [8]. As shown
23 ISBRA 2012 Short Abstracts
in figure 1, it has four main components: (i) an ontology, (ii) a materialized view of ID mappings, (iii) a provenance manager, and (ii) a query processor.
Ontology: The ontology consists of three components: an index of converters, a priority relation over them, and a specialization hierarchy of ID types. The index is a user updatable hash structure that lists all online ID conversion resources such as tools and databases that can be queried to map IDs in which the only user update allowed is insertion of new ID converters. For each such resource, it stores the name of the converter, the type of IDs it can convert to what type of IDs, called conversion pairs, URL of the resource, its status as an authoritative converter for an ID type, and its update broadcast policy. The authority status of a converter of a type of ID allows its mappings to arbitrate over conflicting or incorrect ID mappings and forces its mapping to be final. On the other hand, its update broadcast policy helps correct mapping errors and view maintenance against provenance queries. The ontology also in- Ontology cludes a priority relation ¹ Hash Index Updates of ID converters based on of ID Converters ID Hierarchy conversion pairs in the form of a partial order. The in- Query Processor dex can be queried to col- Differential Map Online Query lect a ranked list of con- Generator Processor verters that can convert a specific ID type to another Validator Materialized View
ID type according to the Map Queries Response of ID Maps priority relation. The spe- cialization hierarchy groups Integrator Provenance Manager
IDs according to their types Graphical User Interface Online with an universal identifier, ID Converters e.g., HGNC is a gene sym- Online Queries bol ID, and UniProt is a Fig. 1. MapBase architecture and components. protein ID. These type sym- bols are used in all queries, descriptions and database schema, as appropriate. Accordingly, MapBase can only convert IDs of types included in this hierarchy.
Materialized ID Map View: The materialized map view database is a partitioned set of quadruples hi1, i2, c, oi, where the i1 is mapped to i2 using converter c with the convert statement option o ∈{any, one, all}). This view is partitioned into sets for easy lookup based on the type pairs ht1, t2i where t1 is the type for ID i1 and t2 is for i2 (i.e., HGNC to NetAffy) as described in the ID hierarchy. Whenever a map query is processed and responses are generated, these responses are materialized in the appropriate partition for future use.
Provenance Manager: The provenance engine has two main components – a validator and an integrator. Once a query is submitted, the query relation is joined with the map view relation to compute the set of IDs that already exists
24 ISBRA 2012 Short Abstracts
in the view and is removed from the query relation. The validator then checks to see if the mappings are still valid by running a provenance query against the online source converters. If the mappings are still valid, the query relation is forwarded to the query processor for execution. Otherwise, the failed IDs are again added to the query relation and the map view entries are removed by the integrator.
Query Processor and Query Interface: The heart of MapBase is its query proces- sor which drives all computations. It has two major components - the differential map generator and online query processor. User queries pipe through the dif- ferential map generator that attempts to generate the set of IDs that actually require online computation by isolating the subset of IDs that are already in the view database. From the set of mapping type information in the convert statement, it identifies the online converters that can potentially generate the mappings by consulting the hash index and the ID hierarchy. It then analyzes the submitted query with the help of provenance manager to determine the sub- set of IDs needing computation. The query is then transformed into a set of web queries and submitted to the online converters. Appropriate schema matching and wrapping functions are used to match the remote converter form schema and extract returned responses. The pre-computed response from the map view and the computed response are then returned as a single response to the user. The MapBase query interface allows users to query the web to find and study new converters as well. The users are also allowed to add new ID converters to the MapBase database by simply supplying the URL to the system. Since Map- Base uses LifeDB data integration system [3] and its query language BioFlow as its implementation and execution platform, the querying and inclusion of new converters are transparent to users and do not require any additional process.
References
1. Diego Forero Blog. http://www.scribd.com/doc/18966500/Id-Converters-Test. 2. A. Alib´es,P. Yankilevich, A. Ca˜nada,and R. D´ıaz-Uriarte.Idconverter and idclight: Conversion and annotation of gene and protein ids. BMC Bioinformatics, 8, 2007. 3. A. Bhattacharjee, A. Islam, M. S. Amin, S. Hossain, S. Hosain, H. M. Jamil, and L. Lipovich. On-the-fly integration and ad hoc querying of life sciences databases using LifeDB. In DEXA, 2009. 4. S. Ceri and J. Widom. Deriving production rules for incremental view maintenance. In VLDB, pages 577–589, 1991. 5. S. Draghici, S. Sellamuthu, and P. Khatri. Babel’s tower revisited: a univer- sal resource for cross-referencing across annotation databases. Bioinformatics, 22(23):2934–2939, 2006. 6. H. M. Jamil. Improving integration effectiveness through id mapping based record linkage in biological databases. Technical report. Under review, IEEE BIBM 2012. 7. S. Lee, B. Kim, H. Kim, H. Lee, and U. Yu. IdBean: a java GUI application for conversion of biological identifiers. BMB reports, 44(2):107–112, Feb. 2011. 8. A. Meliou, W. Gatterbauer, K. F. Moore, and D. Suciu. Why so? or why no? functional causality for explaining query answers. In MUD, pages 3–17, 2010.
25 ISBRA 2012 Short Abstracts
Investigations on Elastic Network Models of Coarse- Grained Membrane Proteins
Kannan Sankar1, Michael T. Zimmermann2, 3 and Robert L. Jernigan1, 2, 3
1Bioinformatics and Computational Biology Graduate Program 2Department of Biochemistry, Biophysics and Molecular Biology 3L. H. Baker Center for Bioinformatics and Biological Statistics Iowa State University, Ames IA 50011, USA [email protected], [email protected], [email protected]
Abstract. Despite their overwhelming importance, still relatively few struc- tures of membrane proteins have been experimentally solved. Given that the majority of drugs target membrane proteins, insights from structural and functional analysis of existing structures with computational tools can be extremely useful. The dynamics and function of membrane proteins relate closely to the membrane in which they are em- bedded. Here we use anisotropic elastic network models (ANMs) to investigate the motions of a G-protein coupled receptor (GPCR), β-2 adrenergic receptor, in the ab- sence and presence of membranes where the surrounding patch of membrane has vari- ous shapes and sizes and also using different parameters of the ANM. Our results indi- cate that the normal modes of the protein are significantly modified by the presence of the membrane. The extent of membrane-induced modifications and the membrane’s impact upon proposed functional motions is investigated.
Keywords: Membrane proteins, beta-2 adrenergic receptor, elastic net- work model, coarse-grained model, normal mode analysis
1 Introduction
Membrane proteins play a crucial role in cells by playing diverse functions ranging from signal transduction and cell adhesion to small molecule transport and catalysis. They are also the largest class of protein drug targets [1]. However the structures of only a few membrane proteins have been solved experimentally due to difficulties in expressing and crystallizing them [2]. This makes computational approaches and simulations particularly important for understanding their structure-function relationships.
Computational analysis of membrane proteins is complicated by their large size and also by the fact that the membrane itself could play a significant role in modulating their effective dynamics. Elastic Network Models (ENMs) including Gaussian (GNMs) and Anisotropic Network models (ANMs) offer a adfa, p. 1, 2011. © Springer-Verlag Berlin Heidelberg 2011
26 ISBRA 2012 Short Abstracts
fast and convenient way for analyzing such large systems [3,4,5]. By modeling complex systems as a set of particles that are interconnected by springs (if two particles are within a particular distance cutoff Rc) and applying normal mode analysis, ENMs can capture the collective most important motions of the parts of a system. Experimental studies early in protein science have demonstrated that substantial structural fluctuations occur in proteins, and that these fluctuations are essential to protein function [6,7]. Previous studies have shown that the mean square fluctuations of atoms obtained from ENMs correlate well with experimentally determined temperature factors and NMR ensembles of structures, as well as with the results of principal component analysis of ensembles of independently determined structures [8,9]. Customarily, the first few slowest normal modes capture the functional motions for a wide range of proteins [4,5].
Our study focuses on the human β-2 adrenergic receptor (ADRB2) which is a G-protein coupled receptor (GPCR) involved in response to adrenaline- mediated smooth muscle relaxation. We investigate differences in the normal modes obtained from ANMs of the protein in the presence and absence of membrane, using membrane models of different shapes (cubic and cylindrical) and different sizes and also by using different ANM parameters.
2 Methods
The X-ray crystal structure of ADRB2 was obtained from the Protein Data Bank (PDB ID: 2RH1) [10]. The cubic POPC (1-palmitoyl-2-oleoyl phosphatidyl choline) membrane with sides of length 100 Å are built using Membrane Builder in the VMD1.9 (Visual Molecular Dynamics) [11] package. Cylindrical membranes of radii from 27Å to 41Å (in steps of 2Å) are built after embedding the protein, by retaining only the POPC’s within the particular radius. The protein is coarse-grained to only Cα atoms and the POPC atoms were ‘vertically’ coarse-grained (along the length of POPC) to retain the atoms N, P1 and O21 in the polar head group and C32, C24, C28, C212, C216, C36, C310 and C314 in the hydrophobic tails. This provides a membrane with somewhat more detail than the protein. So we further utilized a spherical coarse-grained membrane by iteratively removing atoms within a 5Å cutoff to ensure uniform density. ENMs are generated using a subset of the atomic coordinates (coarse-grained structures) connected by harmonic springs with unit stiffness γ = 1 kcal/(mol.Å-2) where points are within a cutoff radius of Rc = 13 Å (unless otherwise stated). Similarities between modes of motion from different models are measured in terms of overlap (O), cumulative overlap (CO) and root mean-square inner product (RMSIP) between the normal mode vectors as described in detail elsewhere [12, 13].
27 ISBRA 2012 Short Abstracts
3 Results and Discussions
The first 10 normal modes of the free ADRB2 protein show little overlap with the first 10 modes of the protein from models where the membrane (cubic or cylindrical) is included. The models which include the membrane yield higher mean-square fluctuations in various regions of the protein, especially the loop regions (Fig. 2). On visualizing the modes, we find that, in the presence of the membrane, the motions exhibited by the free protein are highly damped. Also, there was only a moderate overlap between the modes generated using cubic in comparison with cylindrical membranes (Fig. 1a), perhaps indicating that the local membrane environment around the protein can have a major impact on the protein's functional motions that may affect, for example, the formation of membrane rafts. Also, cylindrical membranes of increasing radii yield modes of decreasing overlaps with the modes of the free protein. Although spherical coarse-graining (c-g) of the membrane yields similar motions, some specific individual modes in the vertical cg are absent in the spherical c-g α motions (Fig. 1b). We varied the values of γ and Rc for the protein C , as well as for the head and tail atoms of the lipid molecules in the bilayer. Similar behaviors are observed between the modes obtained with such models and the ones reported here, demonstrating insensitivity to these details (results not shown). Our results, however, suggest that the membrane may does play an important role in affecting the functional motions of a membrane protein.
Fig. 1. Mean square fluctuations (MSFs) from ANMs built (a) with and (b) without membrane are mapped onto the Cα backbone of ADRB2 structure in a spectral coloring scheme. Most of the significant differences are in the extra- and intra-cellular loops and the N- and C-termini. Red represents regions with high MSF while blue represents regions with low MSF.
28 ISBRA 2012 Short Abstracts
Fig. 2. (a) Overlaps between the first 10 normal modes of the model with cubic membranes and cylindrical membrane of radius 26.5Å is only moderate (gray squares) indicating a significant influence of these details on the motion. (b) Overlap between the first 10 normal modes of the model with cylindrically c-g membrane and vertically c-g membrane is significantly high (dark squares) showing that the motions are similar. The gray-scale reflects the extent of overlap in the directions of the motions between modes as shown in the legend bar on right.
4. References 1. Terstappen, G.C., Reggiani, A: In silico research in drug discovery. Trends Pharm. Sci. 22, 23–26 (2001) 2. Ostermeier, C, Michel, H.: Crystallization of membrane proteins. Curr Opin. Str. Biol.7, 697-701 (1997) 3. Tirion M.M.: Large amplitude elastic motions in proteins from a single-parameter, atomic analysis. Phys. Rev. Lett.. 77,1905–1908 (1996) 4. Bahar I, Atilgan AR, Erman B.: Direct evaluation of thermal fluctuations in proteins using a single-parameter harmonic potential. Fold. Des. 2,173–181 (1997) 5. Atilgan AR, Durell SR, Jernigan RL, Demirel MC, Keskin O, Bahar I.: Anisotropy of fluc- tuation dynamics of proteins with an elastic network model. Biophys. J. 80, 505-515(2001) 6. Careri, G., Fasella, P. & Gratton, E.: Enzyme Dynamics: The Statistical Physics Approach. Ann/ Rev. Biophys. Bioeng.. 8, 69-97 (1979) 7. Weber G.: Energetics of ligand binding to proteins. Adv Protein Chem. 29, 1-83 (1975) 8. Yang, L., Song, G., Carriquiry, A., Jernigan, R. L.: Close Correspondence between the motions from principal component analysis of multiple HIV-1 protease structures and elas- tic network modes. Structure 16, 321-330 (2008) 9. Bakan, A., Bahar, I.: Computational generation inhibitor-bound conformers of p38 map kinase and comparison with experiments. Pac. Symp. Biocomput.181-192 (2011) 10. Cherezov et.al.: High-resolution crystal structure of an engineered human beta2-adrenergic G protein-coupled receptor. Science 318: 1258-1265 (2007) 11. Humphrey, W., Dalke, A. and Schulten, K.: VMD - Visual Molecular Dynamics. J. Molec. Graphics.14.1, 33-38 (1996) 12. Tama F., Sanejouand YH.: Conformational change of proteins arising from normal mode calculations. Protein Eng. 14, 1–6 (2001) 13. Leo-Macias A, Lopez-Romero P, Lupyan D, Zerbino D, Ortiz AR.: An analysis of core de- formations in protein superfamilies. Biophys J. 88,1291–1299 (2005)
29 ISBRA 2012 Short Abstracts
De novo Genome and Transcriptome Sequencing of Social Paper Wasps: Application to Understanding Parasite Manipulation of Host Behavior
Ruolin Liu (1), Daniel Standage (1,2) and Amy L. Toth (1, 3, 4)
(1) Program in Bioinformatics and Computational Biology, Iowa State University (2) Genetics, Development & Cell Biology Department, Iowa State University (3) Department of Ecology, Evolution, and Organismal Biology, Iowa State University (4) Department of Entomology, Iowa State University
In the recent history of biology, next-generation sequencing has substantially widened the scope of studying the genetics and evolution of almost any trait of interest in any organism. We are developing genomic resources for studying the evolution of social behavior of paper wasps in the genus Polistes, a group of social insects. These wasps form small “primitively eusocial” societies containing queens and altruistic workers. Although they cooperate to form a “eusocial” colony, they are considered to be “primitively eusocial” because workers have the ability become queens, and there is a substantial amount of conflict and aggression among females for opportunities to reproduce. These characteristics make Polistes an ideal system for testing hypotheses about the genetic basis of the evolution of altruistic behavior. The genomic tools are playing a critical role in helping us understand this emerging model system. We believe that a complete Polistes genome sequence can greatly enhance our ability to study the genetics of social behavior via comparative genomic and transcriptomic analyses, and greatly facilitating the identification of regulatory regions and epigenetic modifications affecting sociality. Using multiple Illumina/Solexa libraries derived from the genome of a single haploid male, we rapidly and efficiently sequenced and de novo assembled a draft genome sequence of Polistes dominulus. The genome is approximately 300 Mb, and the assembly represents over 100X coverage of the genome. We also generated ABI SOLiD RNA-sequence data from brains of the same species, which is being used to feed the MAKER annotation pipeline. We are using the draft genome and RNA-seq data to study a fascinating aspect of P. dominulus behavior—an aberrant nest-desertion behavior displayed by workers after infection by the strepsipteran endoparasite Xenos vesparum. Parasitized workers lose altruistic behavior and do not help at the nest, but instead sit in aggregations (typical of overwintering behavior by queens) in nearby vegetation. Using the ABI SOLiD RNA sequence data, we are quantifying brain transcriptomic differences among 3 different samples (normal aggregating queens, normal workers and parasitized workers). In doing so, we are investigating whether the parasite “manipulates” the brain gene expression of the host, and predict that the parasite shifts gene expression patterns of workers to mimic those of aggregating queens.
Keyword: Paper wasp, Genome sequencing, Transcriptomics, Social behavior
30 ISBRA 2012 Short Abstracts
Genome sequencing, assembly, annotation and comparative analysis of Pseudomonas fluorescens NCIMB11764 bacterium
Claudia Vilo1, Michael Benedik2 , Daniel Kunz1* and Qunfeng Dong1,3* 1 University of North Texas, Department of Biological sciences. 2 Texas A&M University, Department of Biology. 3 University of North Texas, Department of Computer Science and Engineering. *Corresponding authors.
Abstract
Pseudomonas fluorescens NCIMB 11764 (Pf11764) bacterium has been discovered to be capable of cyanide utilization as its solely nitrogenous source. Cyanide is a potent poison that can be found naturally in different environments. Therefore, its biodegradation should have evolved within species exposed to it. The cyanide metabolism in this bacterium is dependent on the induction of an enzyme described as Cyanide oxidase (CNO) [1,2], which is made of four protein components: NADH oxidase (Nox), NADH peroxidase (Npx), Cyanide nitrilase (CNN), and Carbonic anhydrase (CA) [3,4]. The complete molecular properties of this enzyme and the genetic basis of cyanide utilization by Pf11764 are not well understood. Therefore, to learn more about the unique genetic potential of Pf11764 for adaptation to cyanide we characterized the bacterium’s genome using next-generation sequencing technology. Specifically, the genome was sequenced by using Illumina/solexa technology, with paired end libraries. The number of reads was 16,174,118, with a total length of 841,054,136 bp, which gave coverage of 120x. We obtained reads 52 bp long on average from the sequencing in fastaq format. Then, we reconstructed the genome by assembling those reads. The assembly of the genome was done using SOAPdenovo software (http://soap.genomics.org.cn/soapdenovo.html), which is based on the de brujin graph algorithm. SOAPdenovo software uses a .config file with specification for the assembly. We tried several parameters for the assembly procedure, but for the following steps we chose a maximum read length of 50 bp, an average insert size of 200 bp and 2000 bp, and a k-mer size of 31 bp. The assembly results were: 2,751 contigs with an average length of 2,551 bp, and 150 scaffolds with an average length of 46,263 bp. Once the reads were assembled into longer consensus sequences (scaffolds), we used Genemark.hmm (http://exon.gatech.edu/) for the gene prediction. By using a ribosomal binding site (RBS) model we found 6,432 ORFs. Our first aim was to find the proteins compounding the CNO enzyme. We used the Blast2go tool (http://www.blast2go.com) to characterize the predicted genes. Blast2go is java-based software that allows the comparison of unknown protein sequences with reference sequences from GenBank. Then, we searched within the gene annotations for putative CNO enzyme components.
Table 1 . Potential genes coding CNO enzyme components CNO component Predicted genes Nox Six predicted genes were identified as flavin oxidoreductase NADH oxidase Npx One predicted gene candidate. CNN Eleven genes annotated as nitrilases were identified. CA Three genes were identified as carbonic anhydrases.
31 ISBRA 2012 Short Abstracts
We also used the Blast2go tool annotations with the Non Redundant database of NCBI to characterize all the predicted genes of the Pf11764 genome. In general, several metabolic pathways in Pf11764 were similar to those found in other Pseudomonas species. Consistent with Pseudomonas carbohydrate metabolism, we found no presence of 6-phosphofructokinase, which indicates that Pf11764 do not perform the Embden-Meyerhof pathway. An important proportion of the predicted genes showed hydrolase and transferase activity, which is in accordance with the metabolism of the soil and plants elements that are in the usual Pseudomonas environment. Additionally, predicted genes with transcription factor and nucleotide/nucleic acid biding activity account for the high regulation at the genome level, which is expected for large genomes. More than 1,400 predicted genes were similar to hypothetical proteins according to previous Pseudomonas genome projects. Also, more than 1,200 predicted genes were similar to Transporter proteins, more than 500 were similar to membrane proteins and more than 180 were similar to ion transporter proteins, which corresponds to the capacity for soil and plant surface colonization. Interestingly, 39 predicted genes were similar to sigma factors, including anti sigma factors and flagella factors. Catabolic capabilities were also present in this genome, with predicted genes similar to proteases, lipases and aminotransferases. Housekeeping genes that are used for phylogeny purposes were also found in the genome: gyrB, gyrA, rpoA, rpoB, rpoD, RecA, gltA and gapA. Cyanide often binds in the environment with metal ions such as iron, cobalt, copper, nickel and zinc; therefore it was interesting to search for related proteins. Several siderophore receptors were identified, which indicates the iron acquisition capacity. Also, several predicted genes were similar to metal transporter proteins, including nickel, copper, zinc, iron and cobalt transporter proteins.
In order to understand the mechanism that allows this bacterium to adapt to cyanide environments, and what make this bacterium different from closely related species, we compared Pf11764 with three Pseudomonas fluorescens strains. We downloaded from GenBank the entire genomes of P. fluorescens SBW25, P fluorescens Pf-01 and P. fluorescens Pf-5. We used the Genemark.hmm program to annotate their genes as we did with the genome of Pf11764, and a Perl script for number of bases and GC content. The general features showed that Pf11764 had almost the same genome size from the assembly stage. Also, the number of genes predicted was very similar to the other reference Pseudomonas genomes.
Table 2. Genome comparison of Pf11764 with P. fluorescens Pf0-1, Pf-5 and SBW25. Pf11764 Pf0-1 Pf-5 SBW25 # of Bases 6,939,480 6,438,405 7,074,893 6,722,539 %GC 56.8 60.5 63.3 60.5 16S rRNA 2 6 5 5 ORFs (Genemark.hmm) 6,432 5,815 6,370 6,117 tRNAs (tRNAScanSE) 40 73 71 66
Our second goal was to compare the predicted genes of the Pf11764 sequenced genome with the predicted genes of known reference P. fluorescens. We used standalone BLAST (ftp://ftp.ncbi.nih.gov/blast/) to compare the genes of Pf11764 with those of Pf0-1, Pf-5 and SBW25.
32 ISBRA 2012 Short Abstracts
Table 3. Comparison of the predicted genes from Pf11764 genome with P. fluorescens Pf0-1, Pf-5 and SBW25, using BLAST with e-value 1e-20 Pseudomonas Predicted genes from Pf11764: 6,432 species ORFs with hit ORFs without hit Number Length GC % Number Length GC % average (nt) average (nt) Pf0-1 5,064 1,046 60% 1,368 685 59% Pf-5 4,752 1,047 60% 1,680 747 56% SBW25 4,697 1,060 60% 1,735 723 56%
To know about the function of the orphan genes of Pf11764, we took them and performed a new gene annotation using the Blast2go tool. Interestingly, the orphan genes showed a high proportion of metal binding proteins.
In addition, we did a differential analysis of the orphan genes using Blast2go tool, with Pseudomonas fluorescens Pf-5 and Pseudomonas fluorescens SBW25. The analysis showed that the metal biding genes were overrepresented in PF11764. Also, differential analysis with the three Pseudomonas fluorescens showed an overrepresentation of transport and localization genes.
Figure 1. Characterization of orphan genes detected after comparison of Pf11764 with Pf-5 using BLAST. Molecular function of the annotated genes using Blast2go.
Our results indicate that the presence of the metal ion binding proteins could be directly related with Cyanide metabolism in Pseudomonas fluorescens PF11764. The utilization of the CNO enzyme for cyanide degradation might be part of a specific pathway that also involves metal ion binding proteins and transporter proteins for the initial uptake of cyanide from the environment.
33 ISBRA 2012 Short Abstracts
Additionally, a large number of genome rearrangements were observed in PF11764 when compared with reference Pseudomonas fluorescens genomes.
Conclusions
Potential genes encoding putative enzymatic components shown earlier (3) to be necessary for oxidative cyanide metabolism by Pf11764 were identified. Further research is necessary before assigning specific genes to previously identified enzymes. Differential analysis of orphan genes following a comparison of the Pf11764 genome with related Pf0- 1, Pf-5 and SBW25 strains revealed an over-representation of metal-binding, transport and localization genes in Pf11764. It is well known that cyanide binds metals as a ligand. Such metal-complexed species are generally much less toxic than cyanide itself. The high incidence of metal ion binding and transporter proteins in Pf1176 could indicate that such genes play important roles in cyanide detoxification and transport. A large number of genome rearrangements were observed in comparing the structure of the Pf11764 genome with that of Pf0-1, Pf-5 and SBW25. These differences, we conclude, reflect possible genetic events such as horizontal gene transfer that could lead Pf11764 to acquire the unique capacity for cyanide degradation and nutritional assimilation as a nitrogen source.
References
1. Harris R and Knowles C (1983). The conversion of cyanide to ammonia by extracts of a strain of Pseudomonas fluorescens that utilizes cyanide as a source of nitrogen for growth. FEMS Microbiol. Lett. 20:337-341. 2. Kunz D, Nagappan O, Silva-Avalos J and Delong G (1992). Utilization of cyanide as a nitrogenous substrate by Pseudomoans fluorescens NCBIMB 11764: evidence for multiple pathways of metabolic conversion. Appl. and Env. Microbiol, 58(6):2022-2029. 3. Kunz D, Wang C and Chen J (1994). Alternative routes of enzymic cyanide metabolism in Pseudomonas fluorescens NCIMB 11764. Microbiology 140, 1705-1712. 4. Fernandez R and Kunz D (2005). Bacterial cyanide oxygenase is a suite of enzymes catalyzing the scavenging and adventitious utilization of cyanide as a nitrogenous growth substrate. J. Bacteriol. 187(18):6396-6402. 5. Paulsen et al. (2005). Complete genome sequence of the plant commensal Pseudomonas fluorescens Pf-5. Nature Biotechnology, 23(7):873-878.
34 ISBRA 2012 Short Abstracts
Statistical Evaluation of Dynamic Brain Cell Calcium Activity
Kinsey R. Cotton, Mark DeCoster, Katie A. Evans, Richard A. Idowu, and Mihaela Paun
Louisiana Tech University, College of Engineering and Science, P.O.Box 10348, Ruston, LA 71272 [email protected], [email protected], [email protected], [email protected], [email protected]
Abstract. Calcium in its ionic form is very dynamic, especially in excitable cells such as muscle and brain cells, moving from the high concentration exteri- or of the cell to the much lower concentrations inside the cell where calcium is used as a second messenger. In brain cells and neurons especially, calcium is a key signaling ion involved in memory and learning with excitatory neurotrans- mitters such as glutamate turning neurons “on”. Glutamate (Glu) excites the neurons in part by causing large and dynamic changes in intracellular calcium 2+ 2+ concentration ([Ca ]i) increases. While these [Ca ]I dynamics are essential for normal signaling in the brain, excessive and sustained elevations in neuronal 2+ [Ca ]i are related to neuronal injury [1] including long-term neurodegenerative processes [2]. Helping to regulate these dynamics in the brain are the glial cells known as astrocytes. Astrocytes express glutamate transporters[3], and in this way diminish the time that neurons are exposed to glutamate, and thus also 2+ shaping the [Ca ]i dynamics in neurons. Here we describe an in vitro cell cul- ture system composed of rat brain cortical neurons with different densities of 2+ astrocytes which we have used to statistically analyze the [Ca ]i dynamics in individual neurons. This work follows our long-standing interest in brain cell 2+ [Ca ]i dynamics[4], but with the proposed applied statistical and mathematical tools we now provide a system for predicting: 1) whether the order of repeated 2+ glutamate stimulation alters neuronal [Ca ]i dynamics and 2) how the presence 2+ of different densities of astrocytes modulates neuronal [Ca ]i dynamics. We anticipate that this combined experimental/analytical approach will also have utility in understanding additional brain diseases such as brain tumors [5].
Materials and Methods
2.1 Primary Cortical Culture Preparation Cortical cells were obtained by performing cervical disarticulation of Outbred Sprague-Dawley newborn rats (age ≤ 48hrs) using methods as described[6]. After three days in vitro, the cell culture plates were split
35 ISBRA 2012 Short Abstracts
in half, with one half of the culture treated with a 100x dilution of Cy- tosine Arabinoside ([Ara C] 1mM, Sigma-Aldrich) to deplete glial cells from cultures. Three culture sets were created in total (n=21 rats and approximately 48 wells per culture type, co-culture and neurons.) 2.2 Calcium Fluorescence Imaging The cortical cultures were imaged 8 to 9 days in vitro, by incubating cells in a loading solution, Pluronic acid (20% wt in Dimethylsiloxane, Sigma-Aldrich) at a 1000x dilution and Fluo 3/Am (Invitrogen) at 500x dilution in Locke’s solution[4], for 45 minutes. Cells were then washed and recovered in Locke’s solution and re-incubated for 30 minutes. Cells were imaged with an Olympus CKX41 inverted microscope with a 488 excitation wavelength filter over real time at a 4 s frame rate with Intracellular Imaging software. A baseline (Treatment 0, i.e. cell re- cording before Treatment) was obtained for 60s, GLU concentrations were added to the experiment at predetermined intervals (60, 240, and 500 s) without washing out the media between additions. 2.3 Measurement and Analysis of Fluorescence Intensity Intracellular Imaging software (InCytIm1™, Version 5.26, Intracel- lular Imaging Inc., Cincinnati, OH) was utilized to create regions of interest (ROIs) around every cell in the data set post experiment. ROIs were used to measure fluorescence intensity over time, and the data imported into excel was analyzed by taking the ROIs starting value and normalizing to one, this allows us to correlate one ROI to another. 2.4 Statistical and Applied Mathematical Analysis A one-way Analysis of Variance (ANOVA) was considered to exam- ine the effect of the independent variable Treatment with four levels, on the dependent variable “Number of spikes” and “Area under the curve.” To determine which pairs of the Treatment groups differ, a Tukey hon- estly significance difference test (Tukey HSD) was explored.
Results and Discussion
(a) Testing calcium dynamics Three sets of submaximal glutamate stimuli were successively added to primary rat cortical neurons and
36 ISBRA 2012 Short Abstracts
2+ [Ca ]i dynamics as described in methods. Once glutamate was added to the neurons, the glutamate remained on the cells, therefore, as can be seen in Figure 1, each stimulation was sub-maximal in the sense that cells recovered completely or to a large extent to baseline levels before the next stimulus.
Fig. 1. Successive treatment of rat brain cortical neurons with 250, 500, and 750 nM Glu 2+ as indicated by arrows elicits transient increases in [Ca ]i as indicated by fluorescence intensity (Y-axis). Each tracing represents an individual neuron tracked over time (4s/frame X-axis). Six representative neurons from over 40 cells are shown. See text for mathematical analysis of all cells treated.
(b) Spiking activity and area under the curve analysis: Using the one-way ANOVAs, the amount of variability in the response variable (sum of square error-treatment) for “Number of spikes” and “Area un- der curve” was reported as 534.7 and 6647020, respectively. Our pre- liminary analysis shows a significant Treatment effect for both varia- bles considered. The result of the Tukey HSD for “Number of spikes” reveals Treatment 2, (with the highest number of spikes/mean spike) was highly significant when compared to other Treatments. Similarly, for the “Area under the curve”, Treatments 2 and 3 were significantly different according to the corresponding Tukey HSD. While each of the three successive stimuli continued to increase in glutamate concentration, unexpectedly, the most spiking activity was observed in Treatment 2, which was an intermediate concentration
37 ISBRA 2012 Short Abstracts
(Figure 2a). We hypothesized that the highest concentration Treatment 3 leads to fewer spikes due to synchrony of neuronal activity. This is supported by the “Area under the curve” result, where indeed the high- est glutamate concentration resulted in the largest calcium load. This result is consistent with the highest glutamate stimulation resulting in 2+ the strongest [Ca ]i load (Figure 2b).
Number of Spikes by Treatment Area under the curve by Treatment
12
1400
1200
10
1000
8
800
6
Spikes
600
4
Area under the curve the under Area
400
2
200
0 0
Treatments (0,1,2,3) Treatments (0, 1, 2, 3) Fig. 2. (a, left) Box plot of "Number of spikes" by Treatment group. (b, right) Box plot of "Area under the curve" by Treatment group. For each, the box portion of the box and whisk- er plot includes 50% of the data; whiskers depict the minimum and maximum data values. The edges of the box show the lower (Q1) and upper (Q3) quartile, and the dark, thick line represents the median of the data.
Reference List 1. Lazarewicz JW. Calcium transients in brain ischemia: role in neuronal injury. Acta Neurobiol Exp (Wars) 1996; 56(1): 299-311. 2. Marambaud P, Dreses-Werringloer U, Vingtdeux V. Calcium signaling in neurodegeneration. Mol Neurodegener 2009; 4: 20. 3. Anderson CM, Swanson RA. Astrocyte glutamate transport: review of properties, regulation, and physiological functions. Glia 2000; 32(1): 1-14. 4. DeCoster MA, Koenig ML, Hunter JC, Tortella FC. Calcium dynamics in neurons treated with toxic and non-toxic concentrations of glutamate. Neuroreport 1992; 3(9): 773-77 5. Lyons SA, Chung WJ, Weaver AK, Ogunrinu T, Sontheimer H. Autocrine glutamate signaling promotes glioma cell invasion. Cancer Res 2007; 67(19): 9463-9471. 6. Daniel B, DeCoster MA. Quantification of sPLA2-induced early and late apoptosis changes in neuronal cell cultures using combined TUNEL and DAPI staining. Brain Res Protoc 2004; 13(3): 144-150.
38 ISBRA 2012 Short Abstracts
Lineage Specific Expansion of Protein Families in Malaria Parasites
Hong Cai1, Jianying Gu2, *, Yufeng Wang1, *,
1 Department of Biology, South Texas Center for Emerging Infectious Diseases University of Texas at San Antonio, San Antonio, TX 78249, USA [email protected], [email protected] (*corresponding author) 2 Department of Biology, College of Staten Island City University of New York, Staten Island, NY 10314, USA [email protected] (*corresponding author)
Abstract. Malaria is a devastating global infectious disease caused by fast- evolving parasites in the genus Plasmodium. The development of new drugs and therapies relies on a better understanding of the parasite biology. In this study, we explored the protein families that have been specifically expanded in one or several unique lineage of six evolutionarily related Plasmodium strains. These proteins with lineage specific expansions (LSEs) involve genes that are associated with pathogenesis and virulence as well as fundamental cellular processes in malaria parasites.
Keywords: malaria, protein family, comparative genomics, network
1 Introduction
Malaria is a vector-borne infectious disease. About 1-2 million deaths every year worldwide are due to malaria infection. The causative agents of malaria belong to a group of parasites in the genus Plasmodium, and the most life-threatening form of malaria is caused by P. falciparum. This disease was controlled by effective medicines but it was reemerging due to the increasing resistance of the parasites to available drugs. The development of new drugs and therapies relies on a better understanding of the parasite biology. The availability of human malaria parasite genome sequences and other closely related species has enabled the study of genome evolution [1-6]. Previously, we investigated the distribution of core genome components in six completed Plasmodium genomes [7], which represent the minimum and common requirement to sustain a life cycle encompassing a vertebrate host and a mosquito vector. These six sibling species, however, have their unique host specificities and epidemiological profiles: P. falciparum and P. vivax mainly infect humans, while the former is mostly prevalent in sub-Saharan Africa, and the latter is the mostly widely distributed, commonly found in Latin America, United States, and in some areas of Africa; P. knowlesi serves as a model organism for primate malaria as its natural hosts are long-tailed macaques, but it can infect humans as well. It is prevalent in southeast
39 ISBRA 2012 Short Abstracts
Asia; P. yoelli yoelli, P. berghei, and P. chabaudi infect rodents and serve as rodent models to study parasite infection in laboratory condition. In this study, we further explored the protein families that have been expanded in specific lineage(s). These strain/species-specific proteins may be associated with pathogenesis, virulence, and other adaptive traits related to their ecological niches.
2 Data and Methods
2.1 Cluster of gene families and functional classification analysis
The complete genomes of six Plasmodium species were downloaded from PlasmoDB, the all-in-one portal of Plasmodium Genome resources (http://www.plasmodb.org) [8]. The nucleotide, protein, annotation, and expression data were also downloaded. OrthoMCL, a Markov cluster algorithm, was used to cluster genes into clusters [9], which include the orthologous and paralogous genes from different genomes. Multiple alignments of each cluster were derived by ClustalX and T-coffee, followed by manual editing. Phylogenetic trees were inferred by the neighbor-joining method, the maximum likelihood method, and the maximum parsimony method, using MEGA5 (http://www.megasoftware.net/).
2.2 Protein-protein association analysis
The protein-protein associations for P. falciparum were downloaded from the STRING database [10]. Confidence score (S) ranging from 0.15 to 0.999, was assigned based on the evidence from sequence similarity, pathway assignment according to KEGG and PlasmoCyc metabolic pathway database [11], chromosome synteny and genome neighborhood analysis, phylogenetic inference, and literature analysis.
3 Results and Discussion
The OrthoMCL analysis identified abundant duplicate genes in Plasmodium. Approximately 5-9% of the whole genomes are comprised by genes that are expanded in one or several lineage(s). These protein families showed two distinct lineage- specific expansion (LSE) patterns: the lineage-unique LSE, which includes protein families that are uniquely present in one genome, without orthologs in any other five genomes. (2) Typical LSE which includes protein families that are expanded in more than one genomes. As shown in Fig. 1, two rodent parasites, P. berghei, and P. chabaudi possess most abundant LSE proteins families in both categories; this is likely due to the fact that these two genomes contain more open reading frames (ORFs) and proteins.
40 ISBRA 2012 Short Abstracts
Fig. 1. Distribution of LSE protein families in six Plasmodium species.
Fig. 2. Protein-protein associations with ring-infected erythrocyte surface antigen (RESA) PFA0110w in P. falciparum.
Very little is known about these LSE protein families, as over 60% of the ORFs in P. falciparum, the best studies malaria genome, were predicted as hypothetical proteins with unknown functions. Nevertheless, several LSE proteins may be associated with pathogenesis and virulence. Strain-specific surface antigen families are present in each species: rifin and erythrocyte membrane protein (EMP) are two largest protein families found in P. falciparum, which are implicated in antigenic variation, cell adhesion, and invasion; P. vivax possesses the Vir protein family of
41 ISBRA 2012 Short Abstracts
variant antigens, while SICAvar-like antigen, the simian specific surface antigen is present in P. knowlesi. Other potentially important protein families that are expanded in specific lineages include kinases, heat shock proteins, and various metabolic enzymes. Protein-protein interaction analysis showed that these protein families with LSEs are involved in versatile cellular activities. As shown in Fig. 2, PFA0110w is a putative protein in the ring-infected erythrocyte surface antigen (RESA) protein family. It was predicted to be associated with merozoite surface protein 2 (PfMSP2 or MSA2) and merozoite surface protein 9 (PfMSP9 or ABRA), both of which may be involved in merozoite invasion to the host red blood cell, an actin (PFL2215w) and a skeleton-binding protein (PfSBP1), two proteases (a proteasome subunit β1 (PFE0915c) important for protein turnover and falcilysin critical for globin digestion), and several hypothetical proteins. A better understanding about the origin, divergence, function, and network of these protein families with lineage specific expansion will offer new insights into the mechanisms of parasite adaptation and evolution.
Acknowledgments. This work is supported by NIH grants AI067543, GM081068 and AI080579 to YW, and the PSC-CUNY Research Award PSCREG-39-497 to JG.
References
1. Carlton, J.: The Plasmodium vivax genome sequencing project. Trends Parasitol 19, 227-231 (2003) 2. Carlton, J., Silva, J., Hall, N.: The genome of model malaria parasites, and comparative genomics. Curr Issues Mol Biol 7, 23-37 (2005) 3. Carlton, J.M., Adams, J.H., Silva, J.C., et al.: Comparative genomics of the neglected human malaria parasite Plasmodium vivax. Nature 455, 757-763 (2008) 4. Carlton, J.M., Angiuoli, S.V., Suh, B.B., et al.: Genome sequence and comparative analysis of the model rodent malaria parasite Plasmodium yoelii yoelii. Nature 419, 512-519 (2002) 5. Pain, A., Bohme, U., Berry, A.E., et al.: The genome of the simian and human malaria parasite Plasmodium knowlesi. Nature 455, 799-803 (2008) 6. Gardner, M.J., Hall, N., Fung, E., et al.: Genome sequence of the human malaria parasite Plasmodium falciparum. Nature 419, 498-511 (2002) 7. Cai, H., Gu, J., Wang, Y.: Core genome components and lineage specific expansions in malaria parasites plasmodium. BMC Genomics 11 Suppl 3, S13 (2010) 8. Aurrecoechea, C., Brestelli, J., Brunk, B.P., et al.: PlasmoDB: a functional genomic database for malaria parasites. Nucleic Acids Res 37, D539-543 (2009) 9. Li, L., Stoeckert, C.J., Jr., Roos, D.S.: OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13, 2178-2189 (2003) 10. Szklarczyk, D., Franceschini, A., Kuhn, M., et al.: The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res 39, D561-568 (2010) 11. Yeh, I., Hanekamp, T., Tsoka, S., et al.: Computational analysis of Plasmodium falciparum metabolism: organizing genomic information to facilitate drug discovery. Genome Res 14, 917-924 (2004)
42 ISBRA 2012 Short Abstracts
A Mean Shift Clustering Based Algorithm for Multiple Alignment of LC-MS Data
Minh Nguyen and Jean X. Gao
Department of Computer Science and Engineering, The University of Texas at Arlington, TX, USA
Abstract. Alignment of multiple liquid chromatography - mass spectrometry (LC-MS) maps is a crucial step in preprocessing of LC-MS data due to its unavoidable variations in retention time (rt) dimension of technical repeats. In this paper, we propose a novel algorithm for aligning multiple LC-MS feature maps simultaneously without choosing any specific map as a reference. Features are first matched across all maps using Gaussian blurring mean shift clustering, and then nonlinear rt shifts in each map are corrected independently by applying locally weighted scatterplot smoothing (LOESS). The bandwidth of Gaussian kernel in the clustering algorithm and the span used in the LOESS are selected automatically. Experimental results on real datasets show that the performance of our proposed method is comparable to or better than that of six alternative approaches commonly used in the research community.
Keywords: liquid chromatography - mass spectrometry, multiple alignment, clustering, mean shift
1 Introduction
Liquid chromatography - mass spectrometry (LC-MS) is a technology for analysis of complex pro- tein mixtures. Due to variations in mass-to-charge (m/z) ratios and rt dimensions, even in technical replications of the same experiment, one needs to align LC-MS maps before carrying out a quan- titative analysis [9]. While m/z variations are relatively small and are related to the accuracy of mass spectrometers, the shifts in rt dimension between different LC-MS experiments can be fairly large [4, 7]. The LC-MS alignment algorithms can be roughly divided into two categories: profile-based and feature-based approaches [9]. Profile-based approaches usually take raw data as an input, while feature-based approaches use features, which are peaks on LC-MS maps extracted by a feature detection step and are represented by m/z, rt, and intensity [4], for alignment. For a complete coverage on recent alignment methods, please refer to the survey in [9]. In feature-based multiple alignment approaches [7, 8], the authors first find well-behave feature groups across all LC-MS maps using kernel density estimation based clustering [8] or hierarchical clustering [7] and then correct rt shifts for all features based on these groups. The advantage of these algorithms is that a reference map is not required. However, choosing an appropriate kernel bandwidth for kernel density estimation or cutoff value for hierarchical clustering is non-intuitive, especially for a new dataset. Another limitation of existing methods, e.g., [6,8], is that numerous user-defined parameters are required. These drawbacks motivate us to propose a feature-based algorithm that is capable of: (i) aligning multiple LC-MS maps simultaneously; (ii) matching corresponding features across all maps using Gaussian blurring mean shift clustering, which has proven its superiority in clustering applications [2]; (iii) using data-driven kernel bandwidth selection, which accordingly adapts to data density, for Gaussian kernels in the clustering algorithm; and (iv) requiring few parameters with clear physical meaning which can be chosen from observations of LC-MS maps.
2 Methods
The proposed algorithm for multiple alignment of LC-MS feature maps is primarily based on two phases: (1) grouping features whose m/z and rt values are close to each other into clusters, using
43 ISBRA 2012 Short Abstracts
Gaussian blurring mean shift clustering [2]. These feature groups, so-called consensus features [5], are highly likely associated with the same peptides across all maps and can be used as references for rt alignment; (2) correcting rt shifts for each map based on the reference features by locally weighted regression (LOESS) [3]. These two phases can be optionally repeated several times to detect more likely candidate groups for increasingly accurate alignment. The proposed algorithm is summarized as follows: Map combination. Features from all LC-MS maps are combined and sorted with respect to m/z values. Bandwidth estimation. A suitable bandwidth for the Gaussian kernel in the Gaussian blurring mean shift clustering algorithm [2] is estimated based on the distribution of features along rt dimension using the solution for k-stage direct plug-in bandwidth selector proposed in [1], which uses fixed-point algorithm and discrete cosine transform. δ = ξγ[k](δ), (1) where γ[k](δ) = γ (...γ − (γ (δ))...), k ≥ 1 and ξ ≈ 0.90. |1 {zk 1 }k
k times Binning. For the purpose of computational efficiency, LC-MS maps are divided into m/z bins whose width is selected on the basis of mass accuracy. This step aims to group features with close m/z values into the same bin. Feature matching. For matching features in each m/z bin, we use the fast algorithm of Gaussian blurring mean shift proposed in [2]. After each iteration a data point xm in the dataset X = {x1, ..., xN} moves to a new data point ym and thus the new dataset is a blurred version of X. Data points quickly move towards their local modes and collapse into clusters after the first few iterations. The proposed stopping criterion in [2] terminates the algorithm at this phase and obtain the clustering results. In the Algorithm 1, the mean shift iteration is expressed in the posterior probability form.
Algorithm 1 Gaussian blurring mean-shift (GBMS) algorithm repeat for m ∈ {1, ..., N} do
exp(− 1 ∥ (x − x )/σ ∥2) ∀ | ← ∑ 2 m n n : p(n xm) N (2) ′ − 1 ∥ − ′ /σ ∥2 n =1 exp( 2 (xm xn ) ) ∑N ← | ym p(n xm)xn (3) n=1 end for ∀ ← m : xm ym until stop { } , connected-components( xn n=1N min diff)
To prevent potential groups from being separated due to bin boundary, we use successive bins overlapping by half as used in [8]. A postprocessing step is thus required to filter out features in overlapping areas which appear in different clusters. The postprocessing step deals with: (1) number of samples not contributing features to each cluster and (2) features from the same sample presenting in each cluster. Since we focus on multiple alignment and the resulting features are used as the references for rt correction, well-behaved groups have to contain features from at least some fraction of total number of samples. In addition, well-behaved groups should consist of at most one feature from each sample. For features from the same sample in clusters, only the feature with highest intensity is picked. Retention time correction. For each cluster, the median rt and the corresponding deviation from the feature of each map in the cluster to the median are computed. In general, well-behaved feature
44 ISBRA 2012 Short Abstracts
groups are evenly distributed over the substantial parts of retention time dimension [8]. The features of each map presenting in different well-behaved groups can be used as references for correcting rt shifts of all features of that map. We apply the LOESS to pairs of rt and rt deviation in each map separately and the fitted curve is then employed to correct rt variations of all features in the map.
3 Results and Discussion
To evaluate the performance of the proposed method, we conducted experiments on two metabolic datasets M1 and M2 from [5], which consist of 44 and 24 LC-MS feature maps, respectively. The alignment ground truth is composed of consensus features, which are feature groups with high confidence and are reproducible over at least four samples as well as exhibit small deviation in retention time across samples. For more detailed information on the datasets and ground truth, please refer to [5]. Our algorithm needs only two user-defined parameters: (1) m/z bin width which is related to the accuracy of mass spectrometer and (2) the number of mesh points along rt dimension used to estimate the kernel bandwidth which can be selected based on observations of feature densities on LC-MS maps. In the experiments, an m/z bin of 0.1 Da was used for the M1 dataset and 0.04 Da for the M2 dataset. The number of mesh points of 28 was applied to both datasets. To capture variations along m/z dimension, we split LC-MS maps into 4 m/z segments and estimated the bandwidth for each segment separately. The experiments show that there is no much loss in the alignment performance with regard to using more segments. In order to choose a proper span for the LOESS, we performed 5-fold cross validation. Figure 1 illustrates rt alignment curves fitted by the LOESS of sample 2 and 36 from dataset M1. The resulting consensus features were used for performance evaluation. We employed the alignment measures proposed in [5], precision and recall, to evaluate the performance of our method. In addition, we computed F-score in our evaluation to thoroughly assess the efficiency of alignment methods in the tradeoff of precision and recall. ∑N | ∩ | 1 gti con f eati Precision = , (4) N |con f eati| i=1 ∑N 1 |gti ∩ con f eati| Recall = , (5) N |Mi| · |gti| i=1 2 · Precision · Recall F-score = , (6) Precision + Recall th where |gti ∩ con f eati| is the number of features in the i consensus feature of ground truth detected by the algorithm; |con f eati| is the total number of features in all consensus features detected by th the algorithm corresponding to the query on the i consensus feature of ground truth; |gti| is the th number of features in the i consensus feature of ground truth; and |Mi| is the number of consensus features split by the algorithm from the ith consensus feature of ground truth. The experimental results on M1 and M2 datasets are given in Table 1. We compared our method with five alignment methods as in [5]: msInspect, MZmine, OpenMS, XAlign, and XCMS (with rt correction) as well as the recently developed algorithm, RANSAC aligner, in the software package MZmine 2 [6]. For the M1 dataset, the recall of our method is comparable to that of XCMS, which is the best one. However, the precision of our method and MZmine 2 (RANSAC aligner) are at the best performance. For the M2 dataset, the proposed method obtains the best performance on both recall and precision values. With respect to F-score, which combines recall and precision, our algorithm also acquires the best result. The performance of MZmine 2 (RANSAC aligner) is comparable to that of our method. However, the RANSAC aligner requires four user-defined parameters for two 2-D (m/z and rt) windows, the RANSAC window and alignment window, while our method needs only two parameters related to m/z and rt. The proposed method outperforms the alignment algorithm in the XCMS package, which uses kernel density estimation based clustering and a fixed kernel bandwidth, on both datasets with regard to F-score.
45 ISBRA 2012 Short Abstracts
LC−MS map 2 LC−MS map 36 40 30
30 20
20 10
10
0
0
Rentention time deviation Rentention time deviation −10 −10
−20 −20
−30 −30 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 Rentention time Rentention time
(a) Sample 2 (b) Sample 36
Fig. 1. Retention time alignment curves of two samples from M1 dataset.
Table 1. Comparison of alignment performance of the proposed method with six alternative approaches
Data Measure msInspect MZmine OpenMS XAlign XCMS MZmine2 Proposed Recall 0.27 0.89 0.87 0.88 0.94 0.91 0.93 M1 Precision 0.46 0.74 0.69 0.70 0.70 0.74 0.74 F-score 0.34 0.81 0.77 0.78 0.80 0.82 0.82 Recall 0.23 0.98 0.93 0.93 0.98 0.98 0.99 M2 Precision 0.47 0.84 0.79 0.79 0.78 0.83 0.84 F-score 0.31 0.90 0.85 0.85 0.87 0.90 0.91
References
1. Botev, Z., Grotowski, J., Kroese, D.: Kernel density estimation via diffusion. The Annals of Statistics 38(5), 2916–2957 (2010) 2. Carreira-Perpin, M.: Fast nonparametric clustering with gaussian blurring mean-shift. pp. 153–160. ACM (2006) 3. Cleveland, W.: Robust locally weighted regression and smoothing scatterplots. Journal of the American statistical association pp. 829–836 (1979) 4. Lange, E., Gropl, C., Schulz-Trieglaff, O., Leinenbach, A., Huber, C., Reinert, K.: A geometric approach for the alignment of liquid chromatography - mass spectrometry data. Bioinformatics 23(13), I273–I281 (2007) 5. Lange, E., Tautenhahn, R., Neumann, S., Grpl, C.: Critical assessment of alignment procedures for lc-ms proteomics and metabolomics measurements. Bmc Bioinformatics 9(1), 375 (2008) 6. Pluskal, T., Castillo, S., Villar-Briones, A., Oresic, M.: Mzmine 2: Modular framework for processing, visu- alizing, and analyzing mass spectrometry-based molecular profile data. Bmc Bioinformatics 11, – (2010) 7. Podwojski, K., Fritsch, A., Chamrad, D.C., Paul, W., Sitek, B., Stuhler, K., Mutzel, P., Stephan, C., Meyer, H.E., Urfer, W., Ickstadt, K., Rahnenfuehrer, J.: Retention time alignment algorithms for lc/ms data must consider non-linear shifts. Bioinformatics 25(6), 758–764 (2009) 8. Smith, C.A., Want, E.J., O’Maille, G., Abagyan, R., Siuzdak, G.: Xcms: Processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Analytical Chemistry 78(3), 779–787 (2006) 9. Vandenbogaert, M., Li-Thiao-Te, S., Kaltenbach, H.M., Zhang, R.X., Aittokallio, T., Schwikowski, B.: Align- ment of lc-ms images, with applications to biomarker discovery and protein identification. Proteomics 8(4), 650–672 (2008)
46 ISBRA 2012 Short Abstracts
A new algorithm for the molecular distance geometry problem with inaccurate distance data
Michael Souza1, Carlile Lavor2, Albert Muritiba1, and Nelson Maculan3
1 Federal University of Cear´a, Cear´a, Brazil, {michael,einstein}@ufc.br, 2 State University of Campinas (IMECC-UNICAMP), Campinas, Brazil, [email protected], 3 Federal University of Rio de Janeiro (COPPE-UFRJ), Rio de Janeiro, Brazil, [email protected]
Abstract. We present a new algorithm for the molecular distance ge- ometry problem with innacurate and sparse data, based on the solution of linear systems, a heuristic to find big cliques, and a minimization of nonlinear least-squares function. Computational results are presented in order to validate our approach.
Keywords: molecular geometry, heuristic, nonlinear least-squares, cliques, linear systems
1 Introduction
The molecular distance geometry problem (MDGP) can be defined as the prob- lem of finding Cartesian coordinates x , . . . , x R3 of atoms of a molecule 1 n ∈ such that lij xi xj uij , (i, j) E, where the bounds lij and uij for the Euclidean≤ distances k − k of ≤ pairs of∀ atoms∈ (i, j) E are given a priori [1]. An overview on methods applied to the MDGP is given∈ in [2]. An n n symmetric matrix D = (dij ) with nonnegative elements and a zero diagonal× is said to be an Euclidean distance matrix (EDM) if there exist points k x1, . . . , xn IR such that dij = xi xj , i, j = 1, 2, . . . , n. The smallest value of k is called∈ the embedding dimensionk − ofkD. Assuming that D = (dij ) is an EDM with embedding dimension k = 3 with singular value decomposition UΣU t = D, then x = UΣ1/2 is a solution of the exact MDGP defined by lij = uij = dij [1]. If just some exact distances are known, we can use an iterative algorithm called geometric buildup [3]. First, this algorithm initializes a set (base) with four points (index), whose distances between all of them are known.B Then, the coordinates of the points in are set using the singular value decomposition of the EDM D restricted to theB base , and the remaining unset coordinates are calculated by solving the linear systemB
d2 d2 + d2 x , x = i,1 − i,j j,1 , i = i , i , i , i , (1) h i j i 2 ∈ B { 1 2 3 4}
47 ISBRA 2012 Short Abstracts
2 where dij = xj xi . The indices i1, i2, i3, i4 can be chosen in an arbitrarily way, allowingk us to− choosek another base subset when calculating the coordinate xj. However, when a set of inaccurate data (lij < uij ) are available, neither the singular value decomposition nor the buildup algorithm can be applied directly because they are both designed to deal with exact distances. Our contribution is to extend the buildup algorithm in order to consider inaccurate distance data, based on simple ideas: generate an approximated distance matrix D, take as base a clique in the graph that has D as a connectivity matrix, solve the system (1) and refine the solution using a nonlinear least-squares method.
2 The new method
The set E of pairs (i, j) and the set of indices V = 1, 2, . . . , n can be considered as a set of edges and a set of vertices of a graph G ={ (V,E), respectively.} One may decide to use as base the biggest complete subgraph of G. However, the problem of calculating the biggest complete subgraph belongs to the NP-complete class. Hence, we decided to use a simple heuristic that just looks for big complete subgraphs. Once we have obtained the base associated with a complete subgraph, we need to set its coordinates. In order toB generate an approximated EDM restricted to the points in the base, we define a matrix D(t) = [d(tij )] , where
d = d(t ) = (1 t )l + t u (2) ij ij − ij ij ij ij for some tij [0, 1]. With this choice, we have lij dij uij, but D may not be an EDM with∈ appropriated embedding dimension≤ (k =≤ 3). This may happen because the entries dij can violate the triangular inequality dij dik + djk for some indices i, j, k, or because the rank of D is greater than 3.≤ With this in mind, instead of considering the solution given by singular value decompostion directly, we take the columns (eigenvectors) of U associated with the 3 largest eigenvalues, getting the best 3-approximation rank of the solution to xxt = D(t) [4]. We should not expect great precision in x, because the matrix D(t) is just an approximation. Then, we refine it by minimizing the nonlinear function
ij min φλ,τ (x) = φ (x, l, u), (3) x τ,λ (i,j)∈XE:i,j∈B where
φij (x, l, u) = λ(l u ) + θij (x, l) + θij (x, u), (4) τ,λ ij − ij λ,τ λ,τ
ij 2 2 2 2 2 θ (x, c) = λ (c xi xj + τ ) + τ , (5) τ,λ r − qk − k with λ > 0, τ > 0.
48 ISBRA 2012 Short Abstracts
3
The function φτ,λ is infinitely differentiable with respect to x, and therefore allows the application of classical optimization methods. The function φλ,τ is a variation of the hyperbolic penalty technique used in [5,6]. Once we have refined the coordinates of the points in the base , we start to set the remaining (free) points. We begin with the points that haveB at least four constraints with the points in the base. In order to set the coordinate xj, we use all constraints involving the index j and the indices in the base. For example, to set the coordinate xj, we use the approximated distance matrix D(t) for some t [0, 1]|E|, solve the linear system ∈ d2 d2 + d2 x , x = i,1 − i,j j,1 , i , (6) h i j i 2 ∈ B and then we refine the solution by minimizing the function φλ,τ (x) restricted to the index j and to the indices in the base (see eq.(3)). Each newly calculated coordinate is included in the base. In the end, some points may not be fixed because they have less than four constraints involving the points in the base. In this case, we just position these points solving an undetermined system defined by constraints with points in the base.
3 Numerical experiments
We have implemented our algorithm in Matlab and tested it with a set of model problems on an Intel Core 2 Quad CPU Q9550 2.83 GHz, 4GB of RAM and Linux OS-32 bits. The distance data were derived from the real structural data from the Protein Data Bank (PDB). For each of the proteins, only one subset of distances was considered. For- mally, we kept only the distances lower than R = 6A.˚ The bounds were given by the equations
l = d∗ max(0, 1 ǫ¯ ), u = d∗ (1 + ǫ ), (7) ij ij − | ij | ij ij | ij | ∗ 2 where dij is the true distance between atom i and atom j andǫ ¯ij , ǫij (0, σij ) (normal distribution). These instances were proposed by Biswas in [4].∼ N We used the function 1/2 1 2 LDME = (max lij xi xj , xi xj uij, 0 ) (8) E { − k − k k − k − } | | (i,jX)∈E in order to measure the precision of the solution just with respect to the con- straints, without providing any information about the original structure x∗, and also measured the deviation between the solutions generated by our algorithm and those of the original ones in the PDB files, using the function 1 RMSD = min x∗ Q(x h) : h IRn×3 and Q IR3×3, orthogonal . √n {k − − kF ∈ ∈ } (9)
49 ISBRA 2012 Short Abstracts
4
Table 1. Results for 70% of distances below 6A˚ and σij = 0.05.
PDB ID n |E| LDME RMSD |B0| CPUtime 1PTQ 402 5025 2.02E-04 3.47E-03 8 5.28 1HOE 558 7103 1.94E-04 3.39E-03 7 7.20 1LFB 641 8104 1.97E-04 4.23E-03 8 8.17 1PHT 811 12351 1.83E-04 9.35E-03 9 17.64 1POA 914 11805 7.60E-02 2.73E-01 8 13.50 1AX8 1003 13002 2.00E-04 3.46E-03 8 15.09 1F39 1534 19804 1.99E-04 7.99E-02 8 30.70 1RGS 2015 26774 1.91E-04 3.13E-02 8 47.32 1KDH 2846 38725 1.88E-04 1.00E-02 8 80.47 1BPM 3671 52548 2.10E-02 1.03E-01 8 108.97 1RHJ 3740 53850 1.88E-04 7.49E-03 8 126.98 1HQQ 3944 54571 1.48E-01 1.42E+00 8 137.98 1TOA 4292 60216 1.92E-04 6.12E-02 8 155.07 1MQQ 5681 84153 1.87E-04 3.03E-03 8 277.09
In all experiments the parameters of the function φλ,τ were set at λ = 1.0 and at τ = 0.01. Table 3 shows that our approach is efficient even when the bounds lij and uij are not so close (σij = 0.05), and just 70% of the constraints are considered. In all instances the LDME was low and the RMSD was lower than 3.5A,˚ which means that the protein structures are very similar [7]. In this table, n is the number of atoms in the instance, E is the number of constraints lij xi xj uij , indicates the size of| the| initial base, and CPU time is given≤ k in− seconds.k ≤ |B0| References
1. Crippen, G., Havel, T.: Distance geometry and molecular conformation. Volume 15. Research Studies Press Taunton, Somerset (1988) 2. Liberti, L., Lavor, C., Maculan, N.: Molecular distance geometry methods: from con- tinuous to discrete. International Transactions in Operational Research 18 (2010) 33–51 3. Wu, D., Wu, Z.: An updated geometric build-up algorithm for solving the molecular distance geometry problems with sparse distance data. Journal of Global Optimiza- tion 37(4) (2007) 661–673 4. Biswas, P., Toh, K.C., Ye, Y.: A distributed sdp approach for large-scale noisy anchor-free graph realization with applications to molecular conformation. SIAM Journal on Scientific Computing 30(3) (2008) 1251–1277 5. Souza, M., Xavier, A., Lavor, C., Maculan, N.: Hyperbolic smoothing and penalty techniques applied to molecular structure determination. Operations Research Let- ters 39 (2011) 461–465 6. Xavier, A.E.: Hyperbolic penalty: A new method for nonlinear programming with inequalities. International Transactions in Operational Research 8 (2001) 659–671 7. Schlick, T.: Molecular modeling and simulation: an interdisciplinary guide. Vol- ume 21. Springer Verlag (2010)
50 ISBRA 2012 Short Abstracts
Identification of highly synchronized regulatory subnetwork with gene expression and interaction dynamics
Shouguo Gao, Xujing Wang
Department of Physics & The Comprehensive Diabetes Center, University of Alabama at Birmingham, Birmingham, AL, 35294
Abstract. There has been a growing interest in combining PPI (protein-protein interaction) data with gene ex- pression data. However the interaction dynamics in biological process has not been sufficiently considered previ- ously. Here we propose a topological phase locking (TopoPL) based scoring method with a simulated annealing search, for identifying differentially expressed PPI subnetwork from time series data. First phase locking index is used to represent the interaction strength under certain biological process. Next, we perform a simulated anneal- ing search to identify the subnetwork with the maximum score in the whole PPI network. Applications to Simu- lated data and the yeast cell cycle data show that the TopoPL method can more sensitively identify biologically meaningful subnetworks than static topological and additive scoring methods.
1 Introduction
Although a number of computational methods have been developed to integrate gene expression and protein in- teraction [1], most ignore the dynamics of interaction and not fully utilize network topology. We regarded the active subnetworks as those with deregulated genes with high expression synchronization between them. Specifically we approximate protein activity by gene expression significance and all their synchronized interactions. Subnetworks of genes with high significance and interactions with high phase locking indexes are regarded as synchronized regulat- ed subnetwork [2]. To represent interaction dynamics, Guo et al. proposed a method to identify condition responsive subnetworks from PPI network, where only protein-protein interactions with high coexpressions between corresponding genes are considered [3]. This assumption is reasonable as many studies found that not all protein interactions occur at a spe- cific tissue and at a specific time [3, 4]. Zhiping Liu also utilized the expression correlation to represent interaction dynamics [1], they assessed the statistical significance of differential expression of two nodes and their correlation [1]. However it has been proved that correlation metrics have limitations when applied to time course data [5, 6]. Not utilizing the inter-time point dependence not only loses sensitivity toward detecting interaction but could also lead to erroneous predictions. Phase locking captures the dynamic interaction structure. When compared with sim- ple correlation we found that the phase locking metric can identify gene pairs that interact with each other more efficiently [6]. To grasp the dynamic network topological characteristics in representing the activity of a subnetwork, we integrate phase locking analysis with Pathway Connectivity Index (PCI) that we previously devel- oped PCI utilizes information of all genes and network topological properties[7]. With both the simulated and real data, we will demonstrate the performance of TopoPL based method.
2 Datasets and Method
2.1 Simulation study Simulation was based on the example expression data in Cytoscape gal80R. We randomly selected n_predefined (40, 60, 80) connected genes as the responsive subnetworks, in which m% (80%, 90%, 100%) of genes are consid- ered active. The significance values of active genes were assigned with top _ % significance val- ues in gal80R. The Phase locking index λ were of the responsive subnetworks networks were sampled from 0.8, 0.5 , while the index for the remaining edges were sampled from 0.4, 0.3 . The F score is a measure of a test's accuracy. It considers both the precision and the sensitivity of the test to compute the score.
51 ISBRA 2012 Short Abstracts