1 ISBRA 20 2 SHORT ABSTRACTS

8TH INTERNATIONAL SYMPOSIUM ON RESEARCH AND APPLICATIONS

May 21-23, 2012 University of Texas at Dallas, Dallas, TX

http://www.cs.gsu.edu/isbra12/

Symposium Organizers

Steering Committee , University of California, Davis Ion Mandoiu, University of Connecticutt Yi Pan, Georgia State University Marie-France Sagot, INRIA Alex Zelikovsky, Georgia State University

General Chairs Ovidiu Daesku, University of Texas at Dallas Raj Sunderraman, Georgia State University

Program Chairs Leonidas Bleris, University of Texas at Dallas Ion Mandoiu, University of Connecticut Russell Schwartz, Carnegie Mellon University Jianxin Wang, Central South University

Publicity Chair Sahar Al Seesi, University of Connecticut

Finance Chairs Anu Bourgeois, Georgia State University Raj Sunderraman, Georgia State University

Web Master, Web Design Piyaphol Phoungphol J. Steven Kirtzic

Sponsors

NATIONAL SCIENCE DEPARTMENT OF COMPUTER SCIENCE DEPARTMENT OF COMPUTER SCIENCE FOUNDATION GEORGIA STATE UNIVESITY UNIVERSITY OF TEXAS AT DALLAS

i

Program Committee Members

Srinivas Aluru, Iowa State University Allen Holder, Rose-Hulman Istitute of S. Cenk Sahinalp, Simon Fraser Danny Barash, Ben-Gurion Technology University University Jinling Huang, Eastern Carolina , Robert Beiko, Dalhousie University University Russell Schwartz, Carnegie Mellon Anne Bergeron, Universite du Lars Kaderali, University of University a Heidelberg Joao Setubal, Virginia Bioinformatics Iyad Kanj, DePaul University Daniel Berrar, University of Ulster Institute Ming-Yang Kao, Northwestern Paola Bonizzoni, Universita' Degli , Princeton University Ileana University Streinu, Smith College Studi di Milano-Bicocca Yury Khudyakov, CDC Wing-Kin Sung, Nuational University of Daniel Brown, University of Danny Krizanc, Wesleyan University Waterloo Singapore Jing Li, Case Western Reserve Sing-Hoi Sze, Texas A&M University Doina Caragea, Kansas State University University Ilias Tagkopoulos, University of Fenglou Mao, University of Georgia California Tien-Hao Chang, National Cheng Osamu Maruyama, Kyushu University Kung University Marcel Turcotte, University of Ottawa Li Min, Georgia State University Chien-Yu Chen, National Taiwan Gabriel Valiente, Technical University University Ion Moraru, University of of Catalonia Connecticut Health Center Matteo Comin, University of Padova Stéphane Vialette, Université Paris-Est Axel Mosig, University of Leipzig Marne-la-Vallée Bhaskar DasGupta, University of Illinois at Chicago Giri Narasimhan, Florida Li-San Wang, University of International University Pennsylvania Jorge Duitama, University of Connecticut Yi Pan, Georgia State University Lusheng Wang, City University of Hong Kong Oliver Eulenstein, Iowa State , IBM University Bogdan Pasaniuc, Harvard University Xiaowo Wang, Tsinghua University Guillaume Fertin, University of Andrei Paun, Louisiana Tech Fangxiang Wu, University of Nantes University Saskatchewan Vladimir Filkov, University of Itsik Pe'er, Columbia University Yufeng Wu, University of Connecticut California Davis Weiqun Peng, George Washington Zhen Xie, Massachusetts Institute of Jean Gao, University of Texas at University Technology Arlington Nadia Pisanti, University of Pisa Jinbo Xu, Toyota Technological Institute Katia Guimaraes, Federal University Maria Poptsova, University of at Chicago of Pernambuco Connecticut Zhenyu Xuan, University of Texas at Jiong Guo, Saarland University Teresa Przytycka, NCBI Dallas Robert Harrison, Georgia State Sven Rahmann, Technical University Alex Zelikovsky, Georgia State University Dortmund University Jieyue He, Southeast University Shoba Ranganathan, Macquarie Fa Zhang, Chinese Academy of Science Steffen Heber, North Carolina State University Yanqing Zhang, Georgia State University University Leming Zhou, University of Pittsburgh

ii ISBRA 2012 Short Abstracts

AutoPipe: A Toolbox for Systems Biology Workflow Query Synthesis, Hasan Jamil 1

A new method to predict linear B-cell epitope using support vector machine, Bo Yao, Lin Zhang and Chi Zhang 5

Asymptotic properties of a median tree under the coalescent model, Liang Liu 6

Comparison of RNA-Seq with Microarray Analysis of the Transcriptional Response in HT-29 Colon Cancer Cells to 5-aza-deoxycytidine, Xiao Xu, Erica Antinoiou, W. Richard McCombie, Jennie Williams, Asia Brown, Wei Zhu, Song Wu and Ellen Li 10

CPAM: Effective Composite Regulatory Pattern Miner for Genome Sequences, Dan He 14

Pattern Characterization and Functional Mapping for Biomedical Signal Sets, Anish Nair and Kamran Kiasaleh 18

MapBase: A Virtual Biological ID Map Database, Hasan Jamil 22

Investigations on Elastic Network Models of Coarse-Grained Membrane , Kannan Sankar, Michael T. Zimmermann and Robert L. Jernigan 26

De novo Genome and Transcriptome Sequencing of Social Paper Wasps: Application to Understanding Parasite Manipulation of Host Behavior, Ruolin Liu, Daniel Standage and Amy Toth 30

Genome sequencing, assembly, annotation and comparative analysis of Pseudomonas fluorescens NCIMB 11764 bacterium, Claudia Vilo, Michael Benedik, Daniel Kunz and Qunfeng Dong 31

Statistical Evaluation of Dynamic Brain Cell Calcium Activity, Kinsey Cotton, Mark Decoster, Katie Evans, Richard Idowu, and Mihaela Paun, 35

Lineage Specific Expansion of Families in Malaria Parasites, Hong Cai, Jianying Gu and Yufeng Wang 39

A Mean Shift Clustering Based for Multiple Alignment of LC-MS Data, Minh Nguyen and Jean X. Gao 43

A new algorithm for the molecular distance geometry problem with inaccurate distance data, Michael Souza, Carlile Lavor, Albert Muritiba and Nelson Maculan 47

Identification of highly synchronized regulatory subnetwork with expression and interaction dynamics, Shouguo Gao and Xujing Wang 51

MGC: Gene calling in metagenomic sequences, Achraf El Allali and John Rose 55

Structural Motif Discovery : Classification and Benchmarks, Isra Al-Turaiki, Ghada Badr and Hassan Mathkour 59

Enumerating Maximal Frequent Subtrees, Akshay Deepak and David Fernández-Baca 65

iii Bioinformatics: Desktop Applications to Peta-Scale Architectures with Web-Based Portals, Bhanu Rekepalli, Paul Giblock and Christopher Reardon 69

A Web-based multi-Genome Synteny Viewer for Customized Data, Kashi Revanna, Chi-Chen Chiu, Daniel Munro, Alvin Gao and Qunfeng Dong 70

Subgingival plaque microbiota in patients with type 2 diabetes, Mi Zhou, Ruichen Rong, Daniel Munro, Qi Zhang and Qunfeng Dong 74

Automatic Analysis of Dendritic Territory for Neuronal Images, Santosh Lamichhane and Jie Zhou 78

A Neural Network Approach to Pre-filtering MS/MS spectra, James Cleveland and John Rose 82

Statistical software and business productivity applications: workflows for communication and efficiency, Marie Vendettuoli, Heike Hofmann and David Siev 85

Development of a Detailed Model for the FcRn-mediated IgG Homeostasis, Venkat Pannala, Dilip Kumar Challa, Sally Ward and Leonidas Bleris 89

Querying Evolutionary Relationships in Phylogenetic Databases, Grant Brammer and Tiffani Williams 101

Gene Expression Resources Available from MaizeGDB, Wimalanathan Kokulapalan, Jack Gardiner, Bremen Braun, Ethalinda Cannon, Mary Schaeffer, Lisa Harper, Carson Andorf, Darwin Campbell, Scott Birkett, Taner Sen, Nicholas Provart and Carolyn Lawrence 105

Nocardia spp. Identification Using a Bioinformatics Approach, Dhundy Kiran Bastola, Scott McGrath, Ishwor Thapa and Peter Iwen 106

An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads, Serghei Mangul, Adrian Caciula, Nicholas Mancuso, Olga Glebova, Ion Mandoiu and Alex Zelikovsky 110

Distributions of Palindromic Proportional Content in Bacteria, Oliver Bonham-Carter, Lotfollah Najjar, Ishwor Thapa and Dhundy Kiran Bastola 114

GREDSTAT: Genome-wide Restriction Enzyme Digestion STatistical Analysis Tool, Maga Rowicka and Norbert Dojer 118

Scaffolding Large Genomes using Integer Linear Programming, James Lindsay, Hamed Salooti, Alex Zelikovsky and Ion Mandoiu 122

Inference of allele specic expression levels from RNA-Seq data, Sahar Al Seesi and Ion Mandoiu 128

Monitoring of Human body tissues at Molecular Level using FOTI Systems, G.S. Uthayakumar and A. Sivasubramanian 133

Multi-Commodity Flow Methods for Quasispecies Spectrum Reconstruction Given Amplicon Reads, Nicholas Mancuso, Bassam Tork, Pavel Skums, Ion Mandoiu and Alex Zelikovsky 148

Quasispecies frequency reconstruction using multicommmodity flows, Pavel Skums, Alexander Artyomenko, Alex Zelikovsky and Yury Khudyakov 153

iv ISBRA 2012 Short Abstracts

AutoPipe: A Toolbox for Systems Biology Workflow Query Synthesis?

Hasan M. Jamil

Department of Computer Science Wayne State University, USA [email protected]

Abstract. Prohibitive software implementation costs are a major bar- rier for biologists in testing potentially insightful hypotheses. intrigu- ing issue then is still outstanding is if it is possible for a biologist to con- ceptually state an arbitrary process and map it to a workflow query over a network of distributed resources. The major hurdle is how to map the conceptual concepts into concrete and semanti- cally equivalent computational artifacts. In this paper, we propose a novel model for ad hoc systems biology workflow query synthesis from stored description of hierarchical computational components. We leverage on- tological concept structures and concept transformation relationships in our system AutoPipe, that allows users to explore tentative implementa- tion of workflow queries and select the most fitting ones. Queries synthe- sized are expressed using declarative languages such as BioFlow which can be executed in our LifeDB database management system.

1 Introduction

Although most concepts in biology are well understood and have defined mean- ings, stitching them conceptually in a coherent sequence does not always lead to a computable query that can be executed to generate an expected response. For example, consider a gene expression analysis that involves identifying a novel set of small regulatory relationships among gene products with another set of known for which prior knowledge is available. In other words, we are interested in finding out new regulatory relationships with high enough confidence for an already known pathway. Presumably, the data includes the expression profiles of the genes in the known pathway P along with other genes of interest. Conceptu- ally, the computational process may be expressed in the following way. The real question is how to implement this pipeline to generate the expected response. 1. Select top n differentially expressed genes including the genes in P . 2. Reverse engineer gene regulatory network to extract a candidate network. 3. Find ranked evidence of new regulatory relationships in known interaction networks such as pathways, protein-protein interaction networks, etc. 4. Display top k networks in order of relevance.

? Research supported in part by National Science Foundation grant IIS 0612203.

1 ISBRA 2012 Short Abstracts

The choice of artifacts to implement the above pipeline is researcher specific, based mostly to her familiarity or the popularity of the tools. For example, step 1 could be implemented using the GSEA algorithm [12] followed by a selection. In step 2, she can use RNN [7] to induce the candidate regulatory network, and then find the top-k matching networks using algorithm TraM [2] in step 3, and then displaying the graph using a suitable tool. However, at each step she could make alternate choices as well. For example, she can choose the enhanced GSEA method in [5], or a more simpler algorithm that comes standard with Biocon- ductor/R [10]. For regulatory network generation, she can select algorithms such as Genie3 [4] or BicAT-Plus [1]. Finally, for network matching, she could have selected TALE [13] or other similar graph matching algorithms. However, these choices would depend largely upon her expectation of the over all query semantics and the input output behavior of the components, the compatibility of the data and the application tools, and the complexity and her familiarity with these artifacts. In some cases, she may need to write small glue codes to patch disparities, or apply format conversions to make the components compatible. The more complex and diverse the choices are, the more expensive the application will tend to be, and the more unlikely it will be for her to develop the pipeline herself, thereby introducing communication hurdles across domain boundaries (computational experts and biologists) and substantially increasing the cost. This prompts the question: could she potentially map this description into a pipeline using a tool that could autonomously and judiciously stitch the available resources, only minimally involving the user? An even more intriguing question is: could a user just say, display top k regulatory networks from the input data set, and let the system fill in the blanks?

2 AutoPipe Model

The query construction approach in AutoPipe is substantially different from logic programming approaches [9] where the final goal is considered a proof construction problem and all needed components are pre-defined as rules. It is also significantly different from program synthesis research [8] in software engi- neering, where rigorous description regimes are required for apparently simple computational needs. Our approach is also distinct from query synthesis research in databases [11] where the query is synthesized from a pair of input set and a view through reverse engineering whereas we do so from only the input data set. Since we require substantially less information from the user, we compensate for the loss of essential information needed for successful reverse engineering by augmenting the database with a resource template hierarchy R, a concept hi- erarchy C, a coupling relationship ¹ among the resource templates in R, and a symmetric mapping µ of the form µ : C ↔ R. We allow two types of resources – tools and data. Tools are of three types – analyzers, converters and visualizers. Analyzers transform data resources to produce data resources, while converters change the representation or formats of data resources. Visualizers on the other hand accept data resources and display

2 ISBRA 2012 Short Abstracts

them in specific ways, and thus are treated as terminating transformers. A data resource on the other hand is an initiating transformer. All resources have defined input output behaviors described using concepts in C (figure 1) which are always enforced. A coupling relation between templates t1 and t2 may be implicit, or explicit. In both cases, t1 ¹ t2 exists only if there is an injective mapping from the input descriptions of t2 to output descriptions of t1, i.e., to be compatible t1 must supply all inputs needed by t2. While for data type resources, an in- stance is a table description where the at- inputs visualizer tributes are described using concepts in inputs analyzer outputs C, for every tool type resource template, inputs outputs an instance is a well defined description outputs of a computational procedure from which converter data a BioFlow [6], or any other declarative language, procedure can be constructed. Fig. 1. Examples of templates. BioFlow supports a declarative statement, called define function, for desktop or internet tool application which is capable of resolving schema mismatch and au- tomatic extraction of needed information using autonomous wrapper generation. It also supports declarative sequencing of predefined procedures for powerful pro- cess graph implementation. We use BioFlow as our target language in AutoPipe in the remainder of this paper.

2.1 Synthesizing Workflow Queries

Given the coupling relationship ¹, a set of input tables R = {r1, . . . , rn}, a target concept c, and a given k, synthesis of a workflow template is a set of m ≤ k shortest possible directed acyclic graphs constructed from ¹ such that the initial or root nodes are data resources in R, and the final node is a tool resource t such that µ(t) = c. The target workflow queries are then essentially instantiations of the workflow templates with concrete data types, tools and display functions as a set of declarative workflow queries. While any declarative language can be used by developing language converters of choice, in the current implementation of AutoPipe only BioFlow queries are supported for execution in LifeDB database management system [3].

A Declarative Language for Generating Workflow Queries The linguis- tic constructs of AutoPipe leverages its conceptual power of shortest path graph construction capability using the definition of coupling, and the concept of iso- morphic subgraph matching to retrieve possible implementation of a workflow query from resource instances and coupling relationships. AutoPipe supports the construct statement shown below for the extraction of workflow queries.

construct any | all | distinct | top k concept for relations through templates; display constructExpression with template; convert constructExpression to targetLanguage;

3 ISBRA 2012 Short Abstracts

In the construct statement, concept is a term in C, relations are instances of data templates, and templates are names in the resource template hierarchy. It means return all directed acyclic graphs induced by the relation ¹ such that they orig- inate in a relation and end in a concept via nodes in templates in the specified (not necessarily consecutive) order. Since the construct statement requires a con- cept, and display templates do not map to any specific concept, we support the last two statements to display the computation and convert the workflow graphs into executable queries, respectively. In the last two statements, constructEx- pression is any valid AutoPipe construct statement, template is an instance of display type tool template, and targetLanguage is a declarative workflow query language, such as BioFlow. Finally, create view or insert into type statements may be used directly to save views computed by the construct statements.

References

1. F. Alakwaa, N. Solouma, and Y. Kadah. Construction of gene regulatory net- works using biclustering and bayesian networks. Theoretical Biology and Medical Modelling, 8(1):39+, Oct. 2011. 2. S. Amin, J. Russell L. Finley, and H. M. Jamil. Top-k similar graph matching using tram in biological networks. ACM/IEEE TCBB, 2012. Accepted. 3. A. Bhattacharjee, A. Islam, M. S. Amin, S. Hossain, S. Hosain, H. M. Jamil, and L. Lipovich. On-the-fly integration and ad hoc querying of life sciences databases using LifeDB. In DEXA, 2009. 4. V. Huynh-Thu, A. Irrthum, L. Wehenkel, and P. Geurts. Inferring regulatory networks from expression data using tree-based methods. PloS one, 5(9):e12776+, 2010. 5. R. A. Irizarry, C. Wang, Y. Zhou, and T. P. Speed. Gene set enrichment analysis made simple. Statistical Methods in Medical Research, 18(6):565–575, Dec. 2009. 6. H. M. Jamil, A. Islam, and S. Hossain. A declarative language and toolkit for scientific workflow implementation and execution. IJBPIM, 5(1):3–17, 2010. 7. M. Kabir, N. Noman, and H. Iba. Reverse engineering gene regulatory network from microarray data using linear time-variant model. BMC Bioinformatics, 11(S- 1):56, 2010. 8. V. Kuncak, M. Mayer, R. Piskac, and P. Suter. Software synthesis procedures. Communications of the ACM, 55(2):103–111, 2012. 9. D. Nardi and R. Rosati. Deductive synthesis of programs for query answering. In International Workshop on Logic Program Synthesis and Transformation, pages 15–29. Springer-Verlag, 1992. 10. R Development Core Team. R: A Language and Environment for Statistical Com- puting. R Foundation for Statistical Computing, Vienna, Austria, 2009. 11. A. D. Sarma, A. G. Parameswaran, H. Garcia-Molina, and J. Widom. Synthesizing view definitions from data. In ICDT, pages 89–103, 2010. 12. A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette, A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander, and J. P. Mesirov. Gene set enrichment analysis: a knowledge-based approach for inter- preting genome-wide expression profiles. PNAS, 102(43):15545–15550, 2005. 13. Y. Tian and J. M. Patel. Tale: A tool for approximate large graph matching. In International Conference on Data Engineering, pages 963–972, 2008.

4 ISBRA 2012 Short Abstracts

A new method to predict linear B-cell epitope using support vector machine

Bo Yao1, Lin Zhang2, Shide Liang3* and Chi Zhang1*

1School of Biological Sciences, Center for Plant Science and Innovation, University of Nebraska, Lincoln, NE, 68588, USA 2Department of Statistics, University of Nebraska, Lincoln, NE, 68588, USA 3Systems Immunology Lab, Immunology Frontier Research Center, Osaka University, Suita, Osaka, 565-0871,

*Corresponding author

Email addresses: Bo Yao: [email protected] Lin Zhang: [email protected] Shide Liang: [email protected] Chi Zhang: [email protected]

Abstract Identifying protein surface regions preferentially recognizable by antibodies (antigenic epitopes) is at the heart of new immuno-diagnostic reagent discovery and vaccine design, and computer prediction provides a crucial means. Many linear B-cell epitope prediction methods were developed, such as BepiPred, ABCPred, AAP, and BCPred towards this goal. However, effective immunological research demands higher accuracy and more robust performance of the prediction method than what the current algorithms could provide. In this work, we developed a new method to predict antigenic epitope with sequence input. Support Vector Machine (SVM) has been utilized by combining the Tri-peptide similarity and Propensity scores. In a leave-one-out test, an accuracy of 77.75% and a specificity of 81.99% were achieved by our method.

5 ISBRA 2012 Short Abstracts

Asymptotic properties of a median tree under the coalescent model

Liang Liu

University of Georgia

Abstract. Accurately estimating the evolutionary history of species (species tree) is one of the most important problems in biology. In this paper, I investigate the statistical properties of the sample median tree under the coalescent model and show that the sample median tree is a statistically consistent estimate of the species tree. This result provides a consistent method for accurately estimating species trees.

Keywords: species tree; median tree; coalescent model.

1 Introduction

As molecular sequence data become increasingly available, phylogenetic studies have found significant evidence that the history of a single gene (gene tree) may differ from the history of species, due to a variety of biological phenomena including deep coalescence, horizontal gene transfer, and gene duplication/loss [3]. Many probabilistic models have been proposed to explain the relationship between gene trees and the species tree. Most commonly, gene trees are viewed as a random sample generated from a coalescence process occurring along the lineages of the species tree [4]. A broad class of distance methods attempt to estimate the species tree by a median tree - the tree with minimum distance to gene trees, but studies on the statistical properties of the median tree under the coalescent model [5] are limited. This paper investigates the asymptotic properties of the median tree under the coalescent model. It can be shown that under the coalescent model, the median tree is a statistically consistent estimate of the species tree.

2 Assumptions and notations

Gene trees and the species tree are binary rooted trees on the same set of taxa. An N N-taxon rooted tree is characterized by a set of 3 rooted triples (Fig. 1). Thus, a rooted binary tree can be represented by a vector of rooted triples. Consider a rooted binary of taxa A, B, C, and D (Fig. 1). This tree has four rooted triples; TABC , TABD, TACD, and TBCD. The topology of the rooted triple is indicated by an indicator vector [I1, I2, I3]. For example, [1, 0, 0] implies that the topology of a rooted triple TABC is AB|C (A and B are grouped together), while [0,1,0] and [0,0,1] suggest that the topology of TABC is AC|B and BC|A respectively. A

6 ISBRA 2012 Short Abstracts

2 Lecture Notes in Computer Science: Authors’ Instructions rooted binary tree is uniquely represented by a vector of indicators in which the value is 1 if the corresponding topology of the rooted triple is present and 0 if the topology is not present in the binary tree. For example, the vector representation of the four-taxon tree in Figure 1 is [1,0,0,1,0,0,0,0,1,0,0,1]. Note that each triplet in the vector has exactly one 1 and two 0s. Thus the sum of the vector is equal N to the number of the rooted triples in the tree, i.e., 3 . The length of the vector N is 3 × 3 . The triplet distance between two rooted trees T1 and T2 is w X d(T ,T ) = |vi − vi | (1) 1 2 T1 T2 i=1 N Note that w = 3 × 3 is the length of the indicator vector (the number of i th elements in each vector) and vT is the i element in the indicator vector of tree T. This distance function counts the number of rooted triples that appear in either tree T1 or tree T2, but not in both. The value of d(T1,T2) is always an even number because it counts either 0 or 2 for each rooted triple.

A B C D A B C A B D A C D B C D

Fig. 1. The triples in a rooted tree. The rooted tree contains four triples.

3 Asymptotic distribution of the sample median tree

Let {vgi , ..., vgk } be the vector representations of gene trees {gi, ..., gk} andv ¯ be Pk th the mean vector, i.e.,v ¯ = i=1 vgi /k . By the law of large numbers, the i element ofv ¯ converges to the probability that the corresponding topology of the rooted triple is present in the gene tree generated from a probability distribution P (g|s), i.e., as k → ∞,

vi → pi ∀i (2) The asymptotic probability distribution of the mean vector is a multivariate normal distribution MVN(p, Σ) in which p is the vector (p1, ..., pw) and Σ is the covariance matrix. A sample median tree T˜ minimizes the sum of the triplet distances to all gene trees,

k X T˜ = argmin d(gj,T ) (3) j=1

7 ISBRA 2012 Short Abstracts

Median trees 3 and

k k w w k w X X X X X X d(g ,T ) = |vi − vi | = |vi − vi | = k |v¯i − vi | (4) j gj T gj T T˜ j=1 j=1 i=1 i=1 j=1 i=1

It implies that the sample median tree has minimum distance to the mean vector v¯. As the mean vectorv ¯ has an asymptotic multivariate normal distribution, the asymptotic distribution of the sample median tree can be obtained by calculating the trees with minimum distance to the vectors generated from the multivariate normal distribution with p and Σ being replaced by their unbiased estimators; sample mean vectorv ¯ and sample covariance matrix Σˆ. The population median tree T˜p is defined as the tree that minimizes the sum of distances to gene trees with respect to the probability distribution P (g|s) of the gene tree given the species tree, i.e., X T˜p = argmin d(g, T ) × P (g|s) (5) g The solution to (5) is unique under the coalescent model (the proof will be given shortly). By the law of large numbers, the sample median tree T˜ converges to the population median tree in probability as the number of gene trees increases, i.e., as k → ∞,

p T˜ −→ T˜p (6)

4 Consistency of the median tree under the coalescent model

Under the coalescent model, the probability distribution of the gene tree (topol- ogy and branch lengths) given the species tree topology, branch lengths, and population sizes was derived by Rannala and Yang [1]. Degnan and Kubatko [2] later derived the probability distribution P (g|s) of the gene tree topology given the species tree (topology and branch lengths in coalescence units) by integrating out the branch lengths of the gene tree.

Lemma 1. If the most probable triple in the gene trees is consistent with the triple in the true species tree s, the sample median tree is a statistically consistent estimator of the true species tree s.

Proof. By (2), (4), and (6), the population median tree T˜p minimizes the distance to vector p,

w X |pi − vi | (7) T˜p j=1

8 ISBRA 2012 Short Abstracts

4 Lecture Notes in Computer Science: Authors’ Instructions

Consider an arbitrary vector p. Because the elements in vector p are probabilities, the sum is equal to 1. The elements of vector v are either 1 or 0. The distance T˜p between two vectors is minimized when vi = 1 and pi has the largest probability T˜p among the three possible topologies for all i. It indicates that (7) is minimized when the triples in the species tree are consistent with the most probable triples in gene trees. By assumption, the triples in the true species tree s are consistent with the most probable triples in gene tree, it indicates that the true species tree is identical with the population median tree T˜p. It follows from (6) that the sample median tree is a statistically consistent estimator of the true species tree s. Theorem 1. Under the coalescent model, the sample median tree based on the triplet metric is statistically consistent. Proof. By Lemma 2, it suffices to show that the most probable triple in the gene trees generated from the probability distribution P (g|s) is consistent with the corresponding triple in the true species tree s. Consider an arbitrary triple TABC in s. Without loss of generality, the topology of TABC is AB|C. According to coalescent theory, the probabilities of the three topologies of the gene tree triple are P (AB|C) = 1−2/3e−b, P (AC|B) = 1/3e−b, and P (BC|A) = 1/3e−b, where b is the length of the internal branch of the species tree triple TABC . Apparently, the most probable gene tree triple matches the species tree triple.

5 Discussion

The sample median tree can be used to consistently estimate species trees. The variance of the sample median tree can be estimated through a bootstrap tech- nique. Specifically, the original dataset is resampled to generate bootstrap sam- ples. A sample median tree is built for each bootstrap sample. Then the median trees are summarized by a consensus tree which indicates the variation among the median trees.

References

1. Rannala, B., Yang, Z.H.: Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics 164, 1645–1656 (2003) 2. Degnan J.H., Salter L.A.: Gene tree distributions under the coalescent process. Evolution 59, 24–37 (2005). 3. Maddison, W.P., Knowles, L.L.: Inferring phylogeny despite incomplete lineage sort- ing. Syst Biol 55, 21–30 (2006) 4. Liu, L., Pearl, D.K.: Species trees from gene trees: Reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Syst Biol 56, 504–514 (2005) 5. Wakeley, J.: Coalescent Theory: An Introduction. Roberts Company Publishers; (2008).

9 ISBRA 2012 Short Abstracts

Comparison of RNA-Seq with Microarray Analysis of the Transcriptional Response in HT-29 Colon Cancer Cells to 5-aza-deoxycytidine

Xiao Xu1, Jennie Williams1, Erica Antinoiou2, W. Richard McCombie2, Wei Zhu3, Song Wu3, Asia Brown1, Paula Denoya1, Ellen Li1

1School of Medicine, Stony, Brook University, Stony Brook, NY, USA; 2Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA; 3Department of Applied Mathematics, Stony Brook University, Stony Brook, NY, USA

Abstract – In this study we compared parallel datasets commercial microarray platform. We chose to investigate generated by 1.) pair-end Illumina RNA Sequencing (RNA- the global transcriptomic responses of HT-29 colorectal Seq) measures, and 2.) Affymetrix Human U133 Plus 2.0 cancer cells to two concentrations of 5-aza-deoxycytidine arrays on HT-29 colon cancer cells treated with two different levels (5 µM and 10 µM) of 5-aza-deoxycytidine (a (demethylation agent) treatment. Finally, comparisons on demethylation agent), and further compared to control cells significant biological pathways derived by differentially treated with vehicle (dimethylsulfoxide) alone. This study expressed genes are performed in order to explore the aims to enhance our understanding towards the pros and cons of RNA-Seq technology in the specific context of difference of the two platforms at function level. induced epigenetic intervention at different dosage levels. RNA-Seq experiment detected a total of 18109 genes and the II Experimental Design Affymetrix experiment detected a total of 12467 genes in at least one of the experimental groups. The average Three groups (each containing 3 biological replicates) of Spearman correlation coefficient R was 0.72 for the colon cancer HT-29 samples were treated with: (A) 5 µM expression levels measured for the 11930 genes that were detected by both experiments. We then selected the genes 5-aza-2’ deoxycytidine each day for five days, (B) 10 µM that were differentially expressed (fold change ≥ 2, FDR 5-aza-2’deoxycytidine each day for five days and (C) <0.05) after treatment with 5-aza-deoxycytidine by using the vehicle (DMSO) alone respectively. The same set of Cufflinks/Cuffdiff program for the RNAseq data and the samples were used on both RNA-seq and microarray Significance Analysis of Microarray (SAM) program for the Affymetrix data. There was considerable overlap between platforms (Table 1). A pair-wise comparison between (A) the pathways selected respectively by Ingenuity Pathway and (B) as well as (A) vs (C) were also carried out. Analysis (IPA) of the up-regulated genes selected for the Differentially expressed genes were selected and RNA-Seq and Affymetrix datasets. Both experiments confirmed that STAT1, STAT2 and FES gene expressions compared in two settings. Subsequently, a by-sample were up-regulated after treatment with 5-aza-deoxycytidine. correlation analysis was performed between RNA-seq Moreover, the RNA-Seq results on differentially expressed mapped data and Affymetrix probe levels using Spearman genes seem to suggest that 10 µM 5-aza-deoxycytidine has a Correlation method. certain level of toxic effect towards HT-29 cells, shutting down the expressions of a lot of genes than that of the 5 µM dosage. Overall, our study featured a comprehensive Table 1 Experimental design (3 dosage levels and two platforms). characterization of the transcriptomic response of HT-29 colorectal cancer cells towards 5-aza-deoxycytidine 5-aza-2’ deoxycytidine Experimental Platforms treatment on two parallel experimental platforms. Treatment RNA-Seq Microarray Keywords: RNA-Seq, microarray, colon cancer, 5-aza- 5 µM HT-29 × 3 (Group A) HT-29 × 3 (Group A’) deoxycytidine, transcriptomic analysis. 10 µM HT-29 × 3 (Group B) HT-29 × 3 (Group B’) DMSO controls HT-29 × 3 (Group C) HT-29 × 3 (Group C’)

I Introduction II Method High-throughput sequencing is emerging as an attractive alternative to microarrays for measuring global mRNA The experiment and analytic procedures were based on expression [1]. The goal of our study is to validate the use popular RNA-Seq and microarray algorithms featuring a of RNA-Seq by comparison with an established complete preprocessing  filtering  normalization 

10 ISBRA 2012 Short Abstracts

statistical analysis pipeline (Figure 1). Several key steps sites are below 30 in at least 5 out of 9 samples. The were explained as the following: resulting fastq files were provided for subsequent analysis.

2) Tophat mapping

A popular RNA-Seq reads mapping tool, Tophat (v1.4.1) [2] was used to map the millions of short reads to the reference Ensembl 19 following default settings ftp://ftp.ensembl.org/pub/current_gtf/. The mapped reads were summarized and exported into SAM/BAM files.

3) Cufflink/Cuffdiff Analysis

The cufflink program (version 1.3.0) [3] is a popular tool for transcripts assembling and abundance estimation with RNA-seq samples. It utilizes Fragments Per Kilobase of exon model per Million mapped fragments (FPKM) value to quantify gene transcripts abundance. Its associated program cuffdiff was specially designed to test for differentially expressed genes, exons, coding sequences, Figure 1: Preprocessing and analytic pipeline of colorectal cancer HT-29 samples. I. Correlation analysis between all 9 splicing events, and promoter use. We used cufflink samples based on common detectable genes. II Comparison of program to assemble and estimate transcripts abundance differentially expressed genes found on both platforms. for each gene before correlating to the microarray data. *Filtering procedure was applied to remove gene/probe entries The cuffdiff program was used in paired comparison which do not have 3 present values in at least one of the three groups. settings (A vs C and B vs C) in our search of differentially expressed genes.

4) SAM analysis 1) Illumina® high-throughput sequencing experiment The Significance Analysis of Microarray (SAM) is a The TruSeq RNA Sample Preparation Kit (Illumina Inc., popular statistical tool for identifying differentially CA) was used to prepare the sequencing libraries by expressed genes based on permutation t-like test [4]. The implementing the following steps. mRNA was false discovery rate (FDR) cutoff of 0.05 was used in our subsequently purified and double stranded cDNA was analysis to control type I error. This method was applied made using random primers for the first and second strand only to Affymetrix microarray data and additional fold synthesis. The next step converted the overhangs of the change (FC) cutoff of 2 (FC <= 0.5 or FC >= 2 on group DNA into phosphorylated blunt ends. Adaptors were means) was also used in this analysis to select ligated to the DNA fragments. A size selection was differentially expressed genes. performed using AMPure XP beads (Beckman Coulter) to remove excess of adaptors and isolate DNA templates of 5) Pathway Analysis 320bp long in average. Finally, PCR was performed to enrich the adapter-modified DNA fragments since only Pathway analysis was performed using Ingenuity software the DNA fragments with adaptors at both ends will (Redwood City, CA). The enrichment rates of amplify. A sequencing flow cell was prepared at 10nM differentially expressed genes were evaluated against loading concentration and sequenced on an Illumina canonical signaling pathways in human cells. Significant HiSeq 2000 instrument. The sequences were filtered by pathways were picked based on P value <= 0.05. the Illumina software to remove bad quality sequences (the first 3 nucleotides) since the Phred score of these III Results

A. Correlation with microarray data

11 ISBRA 2012 Short Abstracts

After the filtering and gene symbol conversion step, demethylation) is generally higher than that of the down- Affymetrix® hgu133plus2 microarray detected 12267 regulated genes and this tendency is more explicit in the independent gene entries while the Illumina RNA-seq lower dosage group compared with controls (A vs C). We experiment found 18109 present gene transcripts. An also noticed that the overlap rates of differentially overlap of 11775 genes was used in the subsequent expressed genes from two platforms are higher in A vs C correlation analysis (Figure-II) between samples. We comparison as opposed to B vs C experiment (Table 2). also found that the 492 microarray exclusive genes have A further analysis comparing number of significant genes an average lower expression levels (down by 2~4 folds) between two demethylation levels to control indicated a than those also detected on RNA-seq experiment (11775), high overlap rate between the two dosages (vs control) suggesting a dubious signal quality of these probes. The comparisons on both platforms except for down-regulated estimated transcripts levels from RNA-Seq experiment genes in microarray study (Table 3). showed a high correlation at per sample level based between RNA-Seq and results using Spearman Table 2: Number of differentially expressed genes in each comparison category. Correlation (average r = 0.72, P value << 1×10-10). An additional correlation analysis based on group fold Comparisons A > C A < C B > C B< C changes (A vs C & B vs C) for each transcript was also RNA-Seq 1522 252 574 193 performed, resulting in a high correlation (r = 0.83, P value << 1×10-10) on A vs C comparison and a relatively Microarray 584 346 429 338 lower yet significant correlation (r = 0.70, P value << Intersection 458 112 216 67 1×10-10) for B vs C comparison. Intersection Total 570 283

Table 3: Number of overlapped genes and rates between two demethylation-vs-control comparisons on both RNA-Seq and Microarray platforms.

Intersection of A vs C A > C & B > C A < C & B < C and B vs C RNA-Seq 494/(1522 & 574) 93/(252 & 193) Overlap Rate 86.1% 48.2%

Microarray 290/(584 & 429) 115/(346 & 338) Figure II: Venn diagram of detectable genes from two Overlap Rate* 67.6% 34.0% platforms showing overlap and exclusive genetic sets from both RNA-Seq and microarray platforms. *Among the 6122 genes which are exclusively identified on *Overlap rate is calculated by the number of common genes RNA-Seq, 4218 are not revealed in Microarray experiment divided by the number of genes in the smaller parental set because their abundances are too low thus filtered from final (underscored in the table), which reflects the maximum possible Microarray data. overlap rate independent from the size of parental sets.

B. Differentially expressed genes in demethylation treatment group vs controls C. The Ingenuity Pathway Analysis

We performed significant gene detection analysis In the final pathway enrichment analysis, we primarily respectively for both dosages compared to DMSO focused on the up-regulated genes since they are assumed controls. Using cuffdiff test, we found 1774 and 767 to be directly affected by the demethylation treatment. differentially expressed genes in (A) vs (C) and (B) vs (C) The ingenuity pathway analysis (IPA) identified 35 comparisons from RNAseq experiment, which are (RNA-Seq) and 27 (Microarray) significant signaling relatively more than we observed from microarray results pathways using differentially up-regulated genes (930 in A vs C and 767 in B vs C) based on P value cutoff (treatment level > control) from A vs C comparisons of 0.05 and fold change cutoff of 2. Specifically, the (Table 4). Among these identified pathways, 11 were number of up-regulated genes (from control to found in overlap between two experimental platforms. In

12 ISBRA 2012 Short Abstracts

B vs C comparisons (up-regulated genes only), the IPA in higher dosage 5-aza-deoxycytidine treatment (B) vs program identified 40 (RNA-Seq) and 27 (Microarray) control (C) also differentially expressed in lower dosage significant signaling pathways among which 13 were treatment (A) vs control (C). To some extent, we may found in common. consider introduction of 10 µM 5-aza-deoxycytidine (B) to HT-29 cells a toxic dosage which actually turned off IV Discussion the expression of many genes that are activated upon the 5 µM de-methylation treatment (Group A). Moreover, the In on our study, 5-aza-deoxycytidine treatment of HT-29 observation that the 492 microarray exclusive genes cells resulted in significant transcriptional response having lower expression profiles (2~4 folds) than the reflected on both RNAseq and microarray platforms. Both common 11775 ones seems to indicate a certain level of experimental platforms confirmed a number genes unreliability of these probe readings, which partly reported to be up-regulated, such as STAT1, STAT2 [5] explained why they were not picked in RNA-Seq and FES [6]. However, the SPARC gene reported by S experiment. Lastly, The IPA analysis indicated that while Cheetham et al [7] was not detected in our RNA-seq or the RNA-Seq experiment revealed more differentially microarray analysis due to its low abundance in our expressed genes than microarray experiment between experiment. The study showed that the high throughput demethylation treated group and controls, many of the Illumina RNA sequencing technology is more sensitive to additional genes were assigned to pathways that are low-abundance transcripts than Affymetrix Microarray, already identified by the microarray platform. considering that 4218 genes below detection criterion in microarray experiment are revealed present on RNA-Seq V Acknowledgement platform. When comparing to previous studies in similar settings such as Su et al [8], our between-platform The author would like to thank Molly Hammel from Cold Spearman correlation (r = 0.72) is slightly less yet in the Spring Harbor Laboratory for her valuable suggestions in same range as their result (r ~0.80). In another experiment the RNA-Seq data analysis process. The author also wants conducted by Marioni et al [1], their Spearman correlation to express his sincere gratitude to all the lab technicians (r = 0.73~0.75) is much closer to our findings suggesting from both Stony Brook University Health Science Center a high consistency between our study and their reports. A and Cold Spring Harbor Laboratory who are involved in direct comparison on differentially expressed gene sets relevant experiments of this study. between (A) vs (C) and (B) vs (C) seem to validate the fact that the absolute majority of genes that are different

5. Karpf, A.R., et al., Inhibition of DNA methyltransferase VI References stimulates the expression of signal transducer and activator of transcription 1, 2, and 3 genes in colon tumor cells. Proc 1. Marioni, J.C., et al., RNA-seq: an assessment of technical Natl Acad Sci U S A, 1999. 96(24): p. 14007-12. reproducibility and comparison with gene expression 6. Shaffer, J.M. and T.E. Smithgall, Promoter methylation arrays. Genome Res, 2008. 18(9): p. 1509-17. blocks FES protein-tyrosine kinase gene expression in 2. Trapnell, C., L. Pachter, and S.L. Salzberg, TopHat: colorectal cancer. Genes Cancer, 2009. discovering splice junctions with RNA-Seq. Bioinformatics, 48(3): p. 272-84. 2009. 25(9): p. 1105-11. 7. Cheetham, S., et al., SPARC promoter hypermethylation in 3. Trapnell, C., et al., Transcript assembly and quantification colorectal cancers can be reversed by 5-Aza- by RNA-Seq reveals unannotated transcripts and isoform 2'deoxycytidine to increase SPARC expression and improve switching during cell differentiation. Nat Biotechnol, 2010. therapy response. Br J Cancer, 2008. 98(11): p. 1810-9. 28(5): p. 511-5. 8. Su, Z., et al., Comparing next-generation sequencing and 4. Efron, B. and R. Tibshirani, Empirical bayes methods and microarray technologies in a toxicological study of the false discovery rates for microarrays. Genet Epidemiol, effects of aristolochic acid on rat kidneys. Chem Res 2002. 23(1): p. 70-86. Toxicol, 2011. 24(9): p. 1486-93.

13 ISBRA 2012 Short Abstracts

CPAM: Effective Composite Regulatory Pattern Miner for Genome Sequences

Dan He

Computer Science Dept., Univ. of California, Los Angeles, CA, 90095-1596, USA [email protected]

Abstract. Finding repetitive patterns in DNA sequences is a fundamental problem in computational biology. There are many different types of repetitive patterns. The composite regulatory pattern mining problem is to find a l-mer, or length-l consecutive sequence, in a set of sample sequences, such that the l- mer has at least k occurrences in the sample sequence where each occurrence is of at most d mismatches to the l-mer. The problem is also known as (l,d)-pattern mining problem, or (l,d)-challenging problem. It has been studied extensively. However, the current methods to solve the problem are not efficient to handle relatively long patterns and are generally not scalable to long sample sequences. In this work, we proposed an algorithm CPAM which seeks short seeds for the patterns first, then extend the seeds into full length pattern. We also proposed an iterative version of the algorithm ICPAM which reduces the problem into easier problems recursively. Our experiments show that our algorithms are scalable both to long patterns and long sample sequences. And our algorithms are also very efficient compared with the state-of-the-art methods.

1 Introduction

Finding repetitive patterns in DNA sequences is a fundamental problem in computational biology since a remarkable fraction of the genomes of complex organisms are repetitive patterns. These repetitive patterns play an important role in the identification of novel function units. Various techniques such as combinatorial methods, statistical modelling, and suffix trees have been applied to the problem and various forms of repetitive patterns have been studied, such as exact maximal repeats, approximate maximal repeats, repeats with minimum frequency, elementary repeats, repeat families [6, 2, 9, 10, 8, 19]. In this work, we study the problem of mining composite regulatory patterns in sample genome sequence. DNA sequences are subject to mutations and therefore the repetitive patterns often occur with some mismatches from the consensus motif. The consensus motif can be represented as an l-mer, which is a continuous string of length l. The (l,d)-neighborhood of an l-mer P represents all possible l-mers with up to d mismatches as compared to P. Pd l i For DNA sequences, whose alphabet size is 4, the size of the (l,d)-neighborhood for any l-mer is i=0 i 3 . We call each occurrence of P in the sample sequence as an instance.A l-mer is a valid occurrence of pattern P if the l-mer is at most d mismatches to P . The problem of mining composite regulatory patterns (also called (l, d)-k patterns, or motifs (we will for now use “pattern” and “motif” interchangeably)) is defined as the following: Given a set of sequences S, find all l-mers that occur up to d mismatches at least k times in S. The problem is very challenging because the search space can be very big for some (l, d) configurations. For example, a typical setting of the set sample sequence is a set of 20 length-600 sequences, and we are looking for (15,5)-20 patterns, where 20 is a typical setting for k. As the (15, 5)-neighborhood is of size 853,570, for any 15-mer, the expected 853,570×600×20 number of (15,5) occurrences in the sample sequence is 415 = 9.54. Notice when we compute the expected occurrences of the patterns, for simplicity, we simply consider the occurrences in 600 × 20 = 12000 possible positions. Therefore it’s hard to distinguish the true pattern out of all 415 possible 15-mers without enumerating and validating them all. The (l, d)-k pattern mining problem is well-studied and numerous algorithms are proposed, including both optimal and approximate algorithms [18] [17] [16] [19] [5] [1]. These algorithms solve the problem efficiently when l, d are relatively small and the sample sequence is relatively short. But for relatively large l with respect to d, or long sample sequence, all previous algorithms are usually not efficient. This is because for a fixed l, the number of possible occurrences for (l, d)-k patterns increases exponentially with d. In the meanwhile, a long sample sequence leads to high cost to check the valid occurrences of a pattern.

14 ISBRA 2012 Short Abstracts

2

In this work, we proposed an algorithm CPAM (Composite Regulatory Pattern Miner), which deploys the following two ideas: 1. Instead of searching the pattern directly, we start with searching some seeds. The seeds are usually short and contain fewer mismatches. Therefore it’s much easier to find all occurrences of these seeds. 2. The occurrences of the seeds are extended into full length strings and random projection [18] is applied on the extended full length strings to recover the candidate patterns. These candidate patterns are then validated against the sample sequence for the true pattern. We show these two ideas are able to improve the efficiency to mine the composite regulatory patterns significantly, especially for long sample sequences, compared with the current state-of-the-art algorithms. What’s more, for long motif, we conduct the above process in an iterative manner, namely we keep on reducing the length of the seeds as for long motifs, we can not use too short seeds. Thus we consider the discovering of the seeds as a pattern mining process itself and solve it by using even shorter seeds iteratively till the seeds are short enough for fast processing. We show in our experiments that our algorithm is scalable to both long patterns and long sample sequence.

2 CPAM algorithm We propose an algorithm CPAM based on the observation that we sometimes do not actually need all of the information on the motif occurrences, namely we may don’t need to find all k occurrences of the motif in the sequence but still be able to recover the motif.

2.1 Workflow We show the workflow of our CPAM algorithm in Figure 1. As we can see, to identify (l, d)-k patterns, we first identify the occurrences of (l0, d0) patterns, which are considered as seeds. We typically set l0 and d0 as half of l and d, respectively. Then we try to find all occurrences for the (l0, d0) seeds using MITRA-count, which is very efficient for short seeds. The number of occurrences is usually big due to small values of l0 and d0. Next we extend all occurrences of the seeds to length-l strings, and apply random projection on the extended length-(l − l0) substrings. For example, assume l0 = 3 and l = 6, we have sample sequence “AGCTCTAGCTATCAATAGCTAT” and the seed is “AGC”. For illustration purpose, assume d = 0. Then there are three occurrences of “AGC” in the sample sequence. We extend the three occurrences to length-6 strings, and obtain “AGCTCT”, “AGCTAT” and “AGCTAT”. We next apply random projection on the extended length-3 substrings, namely “TCT”, “TAT” and “TAT”. Assuming we randomly project on two bits in the length-3 substrings, we obtain a pattern “AGCT−T” with 3 occurrences, where “−” means positions not projected. Assuming after random projection, we obtain k0 occurrences for the extended pattern (l, d), we can compute the probability of observing k0 occurrences out of k occurrences of the pattern (l, d) such that all these occurrences contain an (l0, d0) seed. If the probability is too small, for example, less than 0.001, we ignore the pattern. Otherwise we recover the positions that are not projected using consensus bits. In the above example, we obtain a consensus pattern “AGCTAT” because there are two occurrences of “TAT” but only one occurrence of “TCT”. Finally for the consensus patterns we check the occurrences of them in the sample sequence to select the ones which have at least k occurrences.

2.2 Iteration for Long Patterns As we will show later in the experiments that our algorithm is fast for (l, d) motifs such as (15,4), (17,5), which considers (7,2) as seeds. When the motifs are long, it is not feasible any more to consider sub-motifs (7,2) as seeds since the random projection needs to be conducted on the remaining long substrings, which is both time consuming and inaccurate. Therefore we need to use relatively long sub-motifs as seeds. However, as the l0 and d0 increase for the (l0, d0) sub-motifs, the running time to identify the seeds occurrences increases dramatically. Thus we propose an iterative algorithm ICPAM, where we identify the seeds recursively until the problem is easy enough for our CPAM algorithm. For example, for motifs such as (35,9)-20 or (35,10)-20, we first use sub-motif (25,7) as seeds. To solve the problem for sub-motif (25,7), we further use sub-motif (15,4), and then (7,2), where CPAM is very fast. We show later that ICPAM is able to reduce the motif mining problem to smaller problem effectively and thus it is able to handle relatively long patterns.

15 ISBRA 2012 Short Abstracts

3

Identify Seeds Extend Seeds Identify Candidate Validate Candidate (l, d) – k (l’, d’) (l, d) (l, d) – k’ (l, d) – k

Identify all Extend all Apply random occurrences occurrences projection

Fig. 1. Workflow of CPAM.

3 Experimental Results

We first tried long motifs. As MITRA-count and MITRA-graph can not handle such long motifs, we compare our algorithm only with the graph-based algorithm [21], which is able to solve problem where the parameters are more challenging. The results are shown in Table 1 (left). For (19,6)-20 and (19,7)-20 problems, we ran CPAM where l0 = 7, d0 = 2. It obvious that our algorithm finished a lot faster. For (21,7)-20 and (21,8)-20 problems, since they are relatively long, we ran ICPAM with seeds (15,4), then seeds (7,2) recursively. Again our algorithm out performs the graph-based algorithm. We also tried (35,9)-20 problem. We ran ICPAM with seeds (25,5), then seeds (15,4), then seeds (7,2) recursively. The problem is solved efficiently. The graph-based algorithm is superior to other algorithms in that it is scalable to long sample sequences. We show our algorithm is also scalable to long sample sequences. This is because our algorithm uses relatively short motifs as seeds, whose occurrences are easy to find, even for long sample sequences. In Table 1 (right), we show the running time of CPAM for sample sequences of length 600, 800, 1000, 1200, for (15,4)-20 problem. As we can see, the running time of our algorithm CPAM increases linearly and thus it is able to handle very long sample sequences. As a comparison, we show the running time for the graph-based algorithms as well, whose running time also increases linearly with respect to the length of the sample sequence. However, our algorithm is more efficient for all sample sequences with different lengths. The last thing to notice is that both CPAM and ICPAM are approximate algorithms since random projection is applied. It is possible that one run of CPAM or ICPAM doesn’t find the real motif. However, since the probability of missing the real motif is very small, running the algorithms twice usually won’t miss the real motif. And in our experiments, we never saw our algorithms missed the real motif twice.

pattern CP AM Graph − based n CP AM Graph − based (19,6)-20 153 1599 600 69 698 (19,7)-20 632 2141 800 119 1081 (21,7)-20 70 698 1000 207 1599 (21,8)-20 1038 1081 1200 354 2141 (35,9)-20 73 - Table 1. (Left) Execution time (sec.) for algorithms CPAM and the graph-based algorithm [21] on different (l, d)-k problems. “−” indicates inability to solve the problem. (Right) Execution time (sec.) for (15, 4)-20 problem for algorithms CPAM and the graph-based algorithm [21] for different sample sequence length n.

4 Discussion

In this work, we proposed an algorithm CPAM for the composite regulatory pattern mining problem. Our algorithm seeks short seeds for the pattern using MITRA efficiently. Then the seed occurrences are extended and the candidate patterns are recovered by Random Projection. The candidate patterns are then evaluated against the sample sequence to find the true pattern. We also proposed an iterative algorithm ICPAM which aims to handle long patterns. ICPAM seeks short seeds recursively till the seeds are short enough and therefore it is able to handle long patterns. We also show our method is scalable to long sample sequence since identifying the occurrences of short seeds is relatively easy.

16 ISBRA 2012 Short Abstracts

4

References

1. A. L. Price, N.C. Jones and P.A. Pevzner. De novo identification of repeat families in large genomes. Bioinformatics, 21:i351-i358, 2004. 2. D. Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cam- bridge University Press, 1997. 3. D. He. Using suffix tree to discover complex repetitive patterns in DNA sequences. In Proc. of 28th Annual International Conference IEEE Engineering in Medicine and Biology Society (EMBC’06), pp. 3474-3477, New York, NY, 2006. 4. D. He and X. Wu. An Efficient Algorithm for Finding Approximate Complex Repetitive Patterns. In Proceedings of the International Conference on Computational and Systems Biology (CASB 2006), Dallas, Texas, 2006. 5. E. Eskin and P.A. Pevzner. Finding Composite Regulatory Patterns in DNA Sequences. Bioinformatics, 1(1):1-9, 2002. 6. E. F. Adebiyi, T. Jiang, M. Kaufmann. An efficient algorithm for finding short approximate non-tandem repeats. Bioinformatics, Vol. 17, suppl. 1, pp. S5-S12, 2001. 7. M. Katti, R. Sami-Subbu, P. Ranjekar and V. Gupta. Amino acid repeat patterns in protein sequences: their diversity and structural-functional implications. Protein Science, 9(6):1203-1209, 2000. 8. S. Kurtz and C. Schleiermacher. REPuter: Fast computation of maximal repeats in complete genomes. Bioinfor- matics, 15(5), pp. 426-427, 1999. 9. S. Kurtz, E. Ohlebusch, C. Schleiermacher, J. Stoye and R. Giegerich. Computation and visualization of degen- earate repeats in complete genomes. In Proc. of the 8th International Conf. on Intelligent Systems for Molecular (ISMB) 2000. 10. S. Kurtz, J. V. Choudhuri, E. Ohlebusch, C. Schleiermacher, J. Stoye, and R. Giegerich. REPuter: The manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res. 29(22), pp. 4633-4642, 2001. 11. M.S. Waterman. Introduction to computational biology. Chapman & Hall, 1995. 12. X. Zhu and X. Wu. Mining Complex Patterns across Sequences with Gap Requirements. In Proc. of the 20th International Joint Conference on Artificial Intelligence (IJCAI-07), pp. 2934-2940, Hyderabad, , 2007. 13. PAM250 Amino Acid Scoring Matrix: http://prowl.rockefeller.edu/aainfo/pam250.htm. 14. NCBI Basic Local Alignment Search Tool: http://blast.ncbi.nlm.nih.gov/Blast.cgi. 15. Research Collaboratory for Structural Bioinformatics (RCSB): Protein Data Bank. http://www.rcsb.org/pdb/home/home.do. 16. Waterman, M., Arratia, R. and Galas, D. Pattern recognition in several sequences: consensus and alignm ent. Bulletin of Mathematical Biology, 46, 515-527. 17. Pevzner, P. A. and Sze, S. Combinatorial approaches to finding subtle signals in DNA sequences. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology. pp. 269-278. 18. J. Buhler and M. Tompa. Finding Motifs Using Random Projections. Journal of Computational Biology, vol. 9, no. 2, 2002, 225-242. 19. Sagot, M. Spelling approximate or repeated motifs using a suffix tree. Lecture Notes in Computer Science, 1380, 111-127. 20. Agrawal, R. and Srikant, R. Fast algorithms for mining association rules Proc 20th Int Conf Very Large Data Bases VLDB, pp 487–499, 1994. 21. Geraci, F., Pellegrini, M., Renda, M.E. An Efficient Combinatorial Approach for Solving the DNA Motif Finding Problem. In ISDA, pp 335-340, 2009.

17 ISBRA 2012 Short Abstracts

Pattern Detection and Functional Mapping for Biomedical Signal Sets

Anish Nair, Kamran Kiasaleh

Erik Johnson School of Engineering and Computer Science, University of Texas at Dallas, 800 West Campbell Road, Richardson, TX 75080,USA

Abstract. In recent years numerous studies have been carried out on effective means of pattern detection and characterization for biomedical signal sets. This paper presents an attempt to characterize and map ECG signal sets from MIT BIH data base for atrial fibrillation scenarios. The objective here is two-fold. First, we present models to characterize the inherent chaotic patterns in ECG dataset. Second, we present an estimate of the probability density functions of the ECG time series. Keywords: chaotic, ECG, mapping

1 Introduction

Atrial fibrillation is caused by an imbalanced impulse gradient in ventricles [1], increasing the possibility of cardiac arrest and myocardial infarction, with episode duration ranging from hours to days. In this paper, the time series under consideration is the RR interval. Widespread investigations [2] have been carried out for possible chaotic implications from RR interval time series with substantiating results. This paper aims to go beyond from not just detecting a chaotic footprint in the series but to have a function mapped model based on the database of available chaotic functions. The primary reason behind seeking functional mapping for the time series pattern is that if we know the function behind the observed series, it will be easier to predict the future iterative state values. In this paper it is intended to introduce a primary level modeling tool which forms the basis for higher level analysis, including future trajectory and probability estimation. Effective estimation of the probability density function can be considered as the requisite tool for further analysis. Constraints of invariance can be imposed on the distribution in order to provide a priori information to the Markovian model for predicting the future probability estimates. However, the analysis in this paper is limited to deciphering the functional mapping from the ECG data sets. The relative confidence on the accuracy of the probability density estimates would solely rely on the effectiveness of the functional mapping of the ECG signal sets to known chaotic functions.

18 ISBRA 2012 Short Abstracts

2 Time series analysis

Readings taken from MIT BIH physionet data base [3] run to a duration of 10 hours sampled at the rate of 250 samples/second. The RR interval forms the root time series, which is tested for chaotic nature by using Lyapunov exponent test. Exponents derived in the test characterize the rate of divergence and how much sensitivity does the time series possess with respect to initial conditions. Positive polarity of the exponents implies chaotic nature, greater the value of exponent higher would be the extent of chaos, while a negative polarity implies dissipative series. From the atrial fibrillation database [4] sub set of ECG readings considered are (04048, 04043 and 05091) and from the normal rhythmic behavior database [5] reading 16272 is considered. All analyses and tests are carried out for a sample count of 5000. Once the chaotic nature is established, it is extremely important to narrow the search in terms of dimensionality. Correlation dimension algorithm is used to calculate the dimensional estimate of the signal set, which has tested positive for the chaotic nature. The algorithm takes into account the ratio of the number of data points falling within a distance ε of each other to the total number of data points of the set. This ratio is termed as the correlation function. Slope of the log plot of correlation function and ε ranging from (0 to 0.085) would provide the dimensional estimate for the particular signal set under consideration. Estimates are shown in Table 1.

Table 1. Correlational dimensional estimates and polarity of largest Lyapunov exponents.

ECG reading Correlation dimension Polarity of Lyapunov exponent 04048 1.232 Positive 04043 1.261 Positive 05091 0.4962 Positive 16272 0.3352 Negative

Readings (04048, 04043 and 05091) are analyzed using phase space embedded plots with embedding dimension set as 3 by delaying it by a sample. We ruled out 16272 as the reading tested negative for possible chaotic nature. For Fig. 1 and 2, x(t) represents the time series under consideration. The circled regions highlight the structural similarity between the recurring lag profiles of ECG data sets with standard chaotic mapping functions. From the phase space plots and correlation dimensional estimates the subset readings (04048, 04043) have Henon Map as their mapping function and for reading (05091) Logistic Map is the parent mapping function as the standard range for correlation dimension for Henon map is (1.23 to 1.27) and for Logistic map it ranges from (0.495 to 0.505). Logistic map is a single dimensional map whereas Henon map is a two dimensional map. Hence, the second step of characterization should focus on determining the dimension associated with the dataset. Ensuing section caters to the characterization of the subset time series in terms of probability density estimates.

19 ISBRA 2012 Short Abstracts

1

1 0.8

0.6 0.5

x(t-2) 0.4 x(t-2)

0 0.2 0 0

0.2 0

0.2 0.4 0.4 0.6 1 0.8 0.6 0.6 0.8 1 0.4 0.8 0.8 0.6 0.2 0.4 x(t) 1 1 0.2 0 x(t-1) x(t) 0 x(t-1) Fig. 1. Phase plot for reading 05091 (left) and Logistic map (right)

1.5

2 1

1 0.5

0 0

x(t-2) x(t-2) -0.5 -1

-1 -2 -1.5 -1.5 -1 1.5 1 -0.5 1.5 -1.5 0.5 0 1 -1 0.5 0 -0.5 0.5 0 -0.5 0 0.5 1 -0.5 -1 -1 1 1.5 -1.5 1.5 -1.5 x(t-1) x(t-1) x(t) x(t)

Fig. 2. Phase plot for reading 04048 (left) and Henon map (right)

3 Characterization

Kernel Density estimation algorithm [6] is used for calculating requisite pdfs, defined in (1)

( ) { } ∏ ∑ (1) ( )

In this equation, D denotes the dimensionality associated with mapping function, N is the total number of data points, and hd is the bandwidth bin size under consideration. It can be understood from the equation, the kernel is set as Gaussian, where variable factor is the bandwidth bin size which is the optimal value reducing MSE. In this paper the optimal value is taken as in (2)

20 ISBRA 2012 Short Abstracts

(2) ( )

where σ is the standard deviation of the data set and N is set to 5000 consecutive data points, obtained from MIT BIH. For single dimensional scenarios the probability density function can be considered as the summation across multiple Gaussian kernels on a single axis, and for higher dimensional cases the probability density function is the product across individual axes. The estimates from kernel density algorithm for one dimensional and two dimensional scenarios are shown in Fig. 3.

1.4

1.2

1 60

0.8 40

0.6 20

0 Density function

Probability density function 0.4 0 0.2 1 0.4 0.8 0.2 0.6 0.6 0.4 0.8 0.2 0 Seconfd axis value range -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 1 Range of values for Logistic Map First axis value range Fig. 3. Probability density estimate for reading 05091 (left) and density estimate for reading 04048 (right).

References

1. Timothy, A. Denton; George A. Diamond, Richard H. Helfant, Steven Khan, Hrayr Karagueuzian. , “Fascinating rhythm: A primer on chaos theory and its application to cardiology”, American Heart Journal, vol.120, no.6, pp. 1419-1440, Dec 1990. 2. Kaifu Wang; Yi Zhao; Xiaoran Sun; Tongfeng Weng; , "A simple way of distinguishing chaotic characteristics in ECG signals," Biomedical Engineering and Informatics (BMEI), 2010 3rd International Conference on , vol.2, no., pp.713-716, 16-18 Oct. 2010 3. Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng CK, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation 101(23):e215-e220; [Circulation Electronic Pages; http://circ.ahajournals.org/cqi/content/full/101/23/e215]; 2000 (June- 13). PMID: 10851218; doi: 10.1161/01.CIR.101.23.e215 4. Database for atrial fibrillation; http://www.physionet.org/physiobank/database/afdb/ 5. Database for normal http://www.physionet.org/physiobank/database/nsrdb/ 6. Epanechnikov, V.A, "Non-parametric estimation of a multivariate probability density". Theory of Probability and its Applications, vol.14, no., pp.153- 158, 1969.

21 ISBRA 2012 Short Abstracts

MapBase: A Virtual Biological ID Map Database

Hasan M. Jamil

Department of Computer Science Wayne State University, USA [email protected]

Abstract. Traditionally, IDs are used to describe, cross reference and link objects in biological databases and applications. Specialized ID con- verters are then used to map objects of different types to establish corre- spondence. Studies show that the quality of ID conversion varies widely, producing inaccurate ID correspondence. Authoritative databases such as GeneCards, PIR and BioMart try to compensate for this shortcoming by maintaining ID relationships for specific sets of ID types. Unfortu- nately, users of these resources are often forced to settle for incomplete or incorrect set of mappings when they do not have intimate knowledge of these resources. In this paper, we introduce a new lazy and on demand ID mapping database, called MapBase, which allows arbitrary mapping of biological IDs. MapBase materializes ID correspondence in real time from other databases and converters when queried, and maintains the materialized view using a negative provenance protocol. Thereby, Map- Base guarantees maximum possible accuracy and currency by using the best possible resources and by prioritizing resources based on quality.

1 Introduction

Object identifiers, also called IDs, are widely used to represent biological en- tities of arbitrary conformations. However, the global nature of life sciences research and distributed authority that generate them contribute to an often chaotic but seemingly unavoidable environment where an object is assigned nu- merous IDs by various groups and databases, and is linked to other objects using these IDs. Consider, for example, the gene symbol SMCR (Smith-Magenis syn- drome region). This gene has been assigned multiple IDs by various databases and authorities. For example, HUGO Gene Nomenclature Committee has assigned it an HGNC ID: 11113 and assigned it ID: 11113. However, GeneCards database lists SMCR’s Entrez ID as 6600 and RAI1 and SMS as its two aliases while HGNC lists its Entrez ID as 11113. Furthermore, HGNC notes that the symbol SMCR has been withdrawn while GeneCards notes that Gene ID: 6600 was discontinued on 3-Aug-2010 and replaced with Gene ID: 10743. Since cross referencing and linking objects spread across various databases are mostly through IDs, accuracy is a paramount factor in ensuring quality information processing. Biologists have been trying to ensure accuracy of ID conversion and mapping for quite sometime with limited success. The two main

22 ISBRA 2012 Short Abstracts

approaches are to design ID conversion tools such as GeneID Converter [2] and IdBean [7], and maintain ID correspondence in authoritative databases such as GeneCards, BioMart and SWISS-PROT. Recent studies [5, 1] show that despite significant efforts, the progress has been truly limited toward ensuring conversion accuracies largely because most converters and mapping databases are designed specifically for a particular type of ID, and they usually rely on polling the re- quired information from other databases, which in turn rely on another resource and thereby compound the inaccuracy and complexity. In this paper, our goal is to develop a universal online interface, called Map- Base, as an ID mapping service for users to map IDs of arbitrary types to the highest possible level of accuracy. As we describe in section 3, we do not rely on a specific database or conversion tool to poll our information. We dynami- cally decide on the resource to use for a specific mapping based on a priority order of converters and databases. Since IDs may be related to one another on a 1-1, 1-M, M-1 and M-M cardinalities, we recognize three querying options – any, unique and all. With option “any”, an arbitrary set of mappings will be produced for each ID in the query set without any guarantee of completeness. The option “unique” will produce a single 1-1 mapping if it exists, and reject mappings that violate this constraint. Finally, option “all” produces all possi- ble 1-M, M-1, and M-M mappings from all sources. The MapBase interface we present also safeguards against obsolete IDs and mappings produced by other resources whenever such information is available online, as discussed earlier in the context of the gene SMCR in HGNC and GeneCards databases.

2 Query Language for MapBase

To query online resources for ID maps, we have developed a simple declarative query language in [6] with two basic functions – converting an arbitrary type of IDs to a set of arbitrary type of IDs, and to cross reference objects with different IDs using an operation similar to join in relational databases. For brevity, we will briefly discuss only the convert statement syntax and semantics below without any technical details. We use the implementation of this statement as the core engine for MapBase interface described in section 3.

convert r into t1 [any|unique|all], . . . , tk [any|unique|all] [using c1, . . . , cn];

In this statement, r is a unary relation of domain Dt, t1, . . . , tk are type names, 0 and cis are online converters. The result is a relation r ⊆ Dt × Dt1 × ... × Dtk . The options [any|unique|all] allow mapping elements in r to either any available ID, exactly one, or all possible IDs respectively as noted earlier. If no options are specified, any is assumed. Furthermore, if using clause is used, mapping is attempted only and sepcifically from the list of converters in this clause.

3 MapBase ID Conversion Database

MapBase is a lazy, on demand and incremental materialized view [4] database in which the view is maintained using a negative provenance model [8]. As shown

23 ISBRA 2012 Short Abstracts

in figure 1, it has four main components: (i) an ontology, (ii) a materialized view of ID mappings, (iii) a provenance manager, and (ii) a query processor.

Ontology: The ontology consists of three components: an index of converters, a priority relation over them, and a specialization hierarchy of ID types. The index is a user updatable hash structure that lists all online ID conversion resources such as tools and databases that can be queried to map IDs in which the only user update allowed is insertion of new ID converters. For each such resource, it stores the name of the converter, the type of IDs it can convert to what type of IDs, called conversion pairs, URL of the resource, its status as an authoritative converter for an ID type, and its update broadcast policy. The authority status of a converter of a type of ID allows its mappings to arbitrate over conflicting or incorrect ID mappings and forces its mapping to be final. On the other hand, its update broadcast policy helps correct mapping errors and view maintenance against provenance queries. The ontology also in- Ontology cludes a priority relation ¹ Hash Index Updates of ID converters based on of ID Converters ID Hierarchy conversion pairs in the form of a partial order. The in- Query Processor dex can be queried to col- Differential Map Online Query lect a ranked list of con- Generator Processor verters that can convert a specific ID type to another Validator Materialized View

ID type according to the Map Queries Response of ID Maps priority relation. The spe- cialization hierarchy groups Integrator Provenance Manager

IDs according to their types Graphical User Interface Online with an universal identifier, ID Converters e.g., HGNC is a gene sym- Online Queries bol ID, and UniProt is a Fig. 1. MapBase architecture and components. protein ID. These type sym- bols are used in all queries, descriptions and database schema, as appropriate. Accordingly, MapBase can only convert IDs of types included in this hierarchy.

Materialized ID Map View: The materialized map view database is a partitioned set of quadruples hi1, i2, c, oi, where the i1 is mapped to i2 using converter c with the convert statement option o ∈{any, one, all}). This view is partitioned into sets for easy lookup based on the type pairs ht1, t2i where t1 is the type for ID i1 and t2 is for i2 (i.e., HGNC to NetAffy) as described in the ID hierarchy. Whenever a map query is processed and responses are generated, these responses are materialized in the appropriate partition for future use.

Provenance Manager: The provenance engine has two main components – a validator and an integrator. Once a query is submitted, the query relation is joined with the map view relation to compute the set of IDs that already exists

24 ISBRA 2012 Short Abstracts

in the view and is removed from the query relation. The validator then checks to see if the mappings are still valid by running a provenance query against the online source converters. If the mappings are still valid, the query relation is forwarded to the query processor for execution. Otherwise, the failed IDs are again added to the query relation and the map view entries are removed by the integrator.

Query Processor and Query Interface: The heart of MapBase is its query proces- sor which drives all computations. It has two major components - the differential map generator and online query processor. User queries pipe through the dif- ferential map generator that attempts to generate the set of IDs that actually require online computation by isolating the subset of IDs that are already in the view database. From the set of mapping type information in the convert statement, it identifies the online converters that can potentially generate the mappings by consulting the hash index and the ID hierarchy. It then analyzes the submitted query with the help of provenance manager to determine the sub- set of IDs needing computation. The query is then transformed into a set of web queries and submitted to the online converters. Appropriate schema matching and wrapping functions are used to match the remote converter form schema and extract returned responses. The pre-computed response from the map view and the computed response are then returned as a single response to the user. The MapBase query interface allows users to query the web to find and study new converters as well. The users are also allowed to add new ID converters to the MapBase database by simply supplying the URL to the system. Since Map- Base uses LifeDB data integration system [3] and its query language BioFlow as its implementation and execution platform, the querying and inclusion of new converters are transparent to users and do not require any additional process.

References

1. Diego Forero Blog. http://www.scribd.com/doc/18966500/Id-Converters-Test. 2. A. Alib´es,P. Yankilevich, A. Ca˜nada,and R. D´ıaz-Uriarte.Idconverter and idclight: Conversion and annotation of gene and protein ids. BMC Bioinformatics, 8, 2007. 3. A. Bhattacharjee, A. Islam, M. S. Amin, S. Hossain, S. Hosain, H. M. Jamil, and L. Lipovich. On-the-fly integration and ad hoc querying of life sciences databases using LifeDB. In DEXA, 2009. 4. S. Ceri and J. Widom. Deriving production rules for incremental view maintenance. In VLDB, pages 577–589, 1991. 5. S. Draghici, S. Sellamuthu, and P. Khatri. Babel’s tower revisited: a univer- sal resource for cross-referencing across annotation databases. Bioinformatics, 22(23):2934–2939, 2006. 6. H. M. Jamil. Improving integration effectiveness through id mapping based record linkage in biological databases. Technical report. Under review, IEEE BIBM 2012. 7. S. Lee, B. Kim, H. Kim, H. Lee, and U. Yu. IdBean: a java GUI application for conversion of biological identifiers. BMB reports, 44(2):107–112, Feb. 2011. 8. A. Meliou, W. Gatterbauer, K. F. Moore, and D. Suciu. Why so? or why no? functional causality for explaining query answers. In MUD, pages 3–17, 2010.

25 ISBRA 2012 Short Abstracts

Investigations on Elastic Network Models of Coarse- Grained Membrane Proteins

Kannan Sankar1, Michael T. Zimmermann2, 3 and Robert L. Jernigan1, 2, 3

1Bioinformatics and Computational Biology Graduate Program 2Department of Biochemistry, Biophysics and Molecular Biology 3L. H. Baker Center for Bioinformatics and Biological Statistics Iowa State University, Ames IA 50011, USA [email protected], [email protected], [email protected]

Abstract. Despite their overwhelming importance, still relatively few struc- tures of membrane proteins have been experimentally solved. Given that the majority of drugs target membrane proteins, insights from structural and functional analysis of existing structures with computational tools can be extremely useful. The dynamics and function of membrane proteins relate closely to the membrane in which they are em- bedded. Here we use anisotropic elastic network models (ANMs) to investigate the motions of a G-protein coupled receptor (GPCR), β-2 adrenergic receptor, in the ab- sence and presence of membranes where the surrounding patch of membrane has vari- ous shapes and sizes and also using different parameters of the ANM. Our results indi- cate that the normal modes of the protein are significantly modified by the presence of the membrane. The extent of membrane-induced modifications and the membrane’s impact upon proposed functional motions is investigated.

Keywords: Membrane proteins, beta-2 adrenergic receptor, elastic net- work model, coarse-grained model, normal mode analysis

1 Introduction

Membrane proteins play a crucial role in cells by playing diverse functions ranging from signal transduction and cell adhesion to small molecule transport and catalysis. They are also the largest class of protein drug targets [1]. However the structures of only a few membrane proteins have been solved experimentally due to difficulties in expressing and crystallizing them [2]. This makes computational approaches and simulations particularly important for understanding their structure-function relationships.

Computational analysis of membrane proteins is complicated by their large size and also by the fact that the membrane itself could play a significant role in modulating their effective dynamics. Elastic Network Models (ENMs) including Gaussian (GNMs) and Anisotropic Network models (ANMs) offer a adfa, p. 1, 2011. © Springer-Verlag Berlin Heidelberg 2011

26 ISBRA 2012 Short Abstracts

fast and convenient way for analyzing such large systems [3,4,5]. By modeling complex systems as a set of particles that are interconnected by springs (if two particles are within a particular distance cutoff Rc) and applying normal mode analysis, ENMs can capture the collective most important motions of the parts of a system. Experimental studies early in protein science have demonstrated that substantial structural fluctuations occur in proteins, and that these fluctuations are essential to protein function [6,7]. Previous studies have shown that the mean square fluctuations of atoms obtained from ENMs correlate well with experimentally determined temperature factors and NMR ensembles of structures, as well as with the results of principal component analysis of ensembles of independently determined structures [8,9]. Customarily, the first few slowest normal modes capture the functional motions for a wide range of proteins [4,5].

Our study focuses on the human β-2 adrenergic receptor (ADRB2) which is a G-protein coupled receptor (GPCR) involved in response to adrenaline- mediated smooth muscle relaxation. We investigate differences in the normal modes obtained from ANMs of the protein in the presence and absence of membrane, using membrane models of different shapes (cubic and cylindrical) and different sizes and also by using different ANM parameters.

2 Methods

The X-ray crystal structure of ADRB2 was obtained from the Protein Data Bank (PDB ID: 2RH1) [10]. The cubic POPC (1-palmitoyl-2-oleoyl phosphatidyl choline) membrane with sides of length 100 Å are built using Membrane Builder in the VMD1.9 (Visual Molecular Dynamics) [11] package. Cylindrical membranes of radii from 27Å to 41Å (in steps of 2Å) are built after embedding the protein, by retaining only the POPC’s within the particular radius. The protein is coarse-grained to only Cα atoms and the POPC atoms were ‘vertically’ coarse-grained (along the length of POPC) to retain the atoms N, P1 and O21 in the polar head group and C32, C24, C28, C212, C216, C36, C310 and C314 in the hydrophobic tails. This provides a membrane with somewhat more detail than the protein. So we further utilized a spherical coarse-grained membrane by iteratively removing atoms within a 5Å cutoff to ensure uniform density. ENMs are generated using a subset of the atomic coordinates (coarse-grained structures) connected by harmonic springs with unit stiffness γ = 1 kcal/(mol.Å-2) where points are within a cutoff radius of Rc = 13 Å (unless otherwise stated). Similarities between modes of motion from different models are measured in terms of overlap (O), cumulative overlap (CO) and root mean-square inner product (RMSIP) between the normal mode vectors as described in detail elsewhere [12, 13].

27 ISBRA 2012 Short Abstracts

3 Results and Discussions

The first 10 normal modes of the free ADRB2 protein show little overlap with the first 10 modes of the protein from models where the membrane (cubic or cylindrical) is included. The models which include the membrane yield higher mean-square fluctuations in various regions of the protein, especially the loop regions (Fig. 2). On visualizing the modes, we find that, in the presence of the membrane, the motions exhibited by the free protein are highly damped. Also, there was only a moderate overlap between the modes generated using cubic in comparison with cylindrical membranes (Fig. 1a), perhaps indicating that the local membrane environment around the protein can have a major impact on the protein's functional motions that may affect, for example, the formation of membrane rafts. Also, cylindrical membranes of increasing radii yield modes of decreasing overlaps with the modes of the free protein. Although spherical coarse-graining (c-g) of the membrane yields similar motions, some specific individual modes in the vertical cg are absent in the spherical c-g α motions (Fig. 1b). We varied the values of γ and Rc for the protein C , as well as for the head and tail atoms of the lipid molecules in the bilayer. Similar behaviors are observed between the modes obtained with such models and the ones reported here, demonstrating insensitivity to these details (results not shown). Our results, however, suggest that the membrane may does play an important role in affecting the functional motions of a membrane protein.

Fig. 1. Mean square fluctuations (MSFs) from ANMs built (a) with and (b) without membrane are mapped onto the Cα backbone of ADRB2 structure in a spectral coloring scheme. Most of the significant differences are in the extra- and intra-cellular loops and the N- and C-termini. Red represents regions with high MSF while blue represents regions with low MSF.

28 ISBRA 2012 Short Abstracts

Fig. 2. (a) Overlaps between the first 10 normal modes of the model with cubic membranes and cylindrical membrane of radius 26.5Å is only moderate (gray squares) indicating a significant influence of these details on the motion. (b) Overlap between the first 10 normal modes of the model with cylindrically c-g membrane and vertically c-g membrane is significantly high (dark squares) showing that the motions are similar. The gray-scale reflects the extent of overlap in the directions of the motions between modes as shown in the legend bar on right.

4. References 1. Terstappen, G.C., Reggiani, A: In silico research in drug discovery. Trends Pharm. Sci. 22, 23–26 (2001) 2. Ostermeier, C, Michel, H.: Crystallization of membrane proteins. Curr Opin. Str. Biol.7, 697-701 (1997) 3. Tirion M.M.: Large amplitude elastic motions in proteins from a single-parameter, atomic analysis. Phys. Rev. Lett.. 77,1905–1908 (1996) 4. Bahar I, Atilgan AR, Erman B.: Direct evaluation of thermal fluctuations in proteins using a single-parameter harmonic potential. Fold. Des. 2,173–181 (1997) 5. Atilgan AR, Durell SR, Jernigan RL, Demirel MC, Keskin O, Bahar I.: Anisotropy of fluc- tuation dynamics of proteins with an elastic network model. Biophys. J. 80, 505-515(2001) 6. Careri, G., Fasella, P. & Gratton, E.: Enzyme Dynamics: The Statistical Physics Approach. Ann/ Rev. Biophys. Bioeng.. 8, 69-97 (1979) 7. Weber G.: Energetics of ligand binding to proteins. Adv Protein Chem. 29, 1-83 (1975) 8. Yang, L., Song, G., Carriquiry, A., Jernigan, R. L.: Close Correspondence between the motions from principal component analysis of multiple HIV-1 protease structures and elas- tic network modes. Structure 16, 321-330 (2008) 9. Bakan, A., Bahar, I.: Computational generation inhibitor-bound conformers of p38 map kinase and comparison with experiments. Pac. Symp. Biocomput.181-192 (2011) 10. Cherezov et.al.: High-resolution crystal structure of an engineered human beta2-adrenergic G protein-coupled receptor. Science 318: 1258-1265 (2007) 11. Humphrey, W., Dalke, A. and Schulten, K.: VMD - Visual Molecular Dynamics. J. Molec. Graphics.14.1, 33-38 (1996) 12. Tama F., Sanejouand YH.: Conformational change of proteins arising from normal mode calculations. Protein Eng. 14, 1–6 (2001) 13. Leo-Macias A, Lopez-Romero P, Lupyan D, Zerbino D, Ortiz AR.: An analysis of core de- formations in protein superfamilies. Biophys J. 88,1291–1299 (2005)

29 ISBRA 2012 Short Abstracts

De novo Genome and Transcriptome Sequencing of Social Paper Wasps: Application to Understanding Parasite Manipulation of Host Behavior

Ruolin Liu (1), Daniel Standage (1,2) and Amy L. Toth (1, 3, 4)

(1) Program in Bioinformatics and Computational Biology, Iowa State University (2) Genetics, Development & Cell Biology Department, Iowa State University (3) Department of Ecology, Evolution, and Organismal Biology, Iowa State University (4) Department of Entomology, Iowa State University

In the recent history of biology, next-generation sequencing has substantially widened the scope of studying the genetics and evolution of almost any trait of interest in any organism. We are developing genomic resources for studying the evolution of social behavior of paper wasps in the genus Polistes, a group of social insects. These wasps form small “primitively eusocial” societies containing queens and altruistic workers. Although they cooperate to form a “eusocial” colony, they are considered to be “primitively eusocial” because workers have the ability become queens, and there is a substantial amount of conflict and aggression among females for opportunities to reproduce. These characteristics make Polistes an ideal system for testing hypotheses about the genetic basis of the evolution of altruistic behavior. The genomic tools are playing a critical role in helping us understand this emerging model system. We believe that a complete Polistes genome sequence can greatly enhance our ability to study the genetics of social behavior via comparative genomic and transcriptomic analyses, and greatly facilitating the identification of regulatory regions and epigenetic modifications affecting sociality. Using multiple Illumina/Solexa libraries derived from the genome of a single haploid male, we rapidly and efficiently sequenced and de novo assembled a draft genome sequence of Polistes dominulus. The genome is approximately 300 Mb, and the assembly represents over 100X coverage of the genome. We also generated ABI SOLiD RNA-sequence data from brains of the same species, which is being used to feed the MAKER annotation pipeline. We are using the draft genome and RNA-seq data to study a fascinating aspect of P. dominulus behavior—an aberrant nest-desertion behavior displayed by workers after infection by the strepsipteran endoparasite Xenos vesparum. Parasitized workers lose altruistic behavior and do not help at the nest, but instead sit in aggregations (typical of overwintering behavior by queens) in nearby vegetation. Using the ABI SOLiD RNA sequence data, we are quantifying brain transcriptomic differences among 3 different samples (normal aggregating queens, normal workers and parasitized workers). In doing so, we are investigating whether the parasite “manipulates” the brain gene expression of the host, and predict that the parasite shifts gene expression patterns of workers to mimic those of aggregating queens.

Keyword: Paper wasp, Genome sequencing, Transcriptomics, Social behavior

30 ISBRA 2012 Short Abstracts

Genome sequencing, assembly, annotation and comparative analysis of Pseudomonas fluorescens NCIMB11764 bacterium

Claudia Vilo1, Michael Benedik2 , Daniel Kunz1* and Qunfeng Dong1,3* 1 University of North Texas, Department of Biological sciences. 2 Texas A&M University, Department of Biology. 3 University of North Texas, Department of Computer Science and Engineering. *Corresponding authors.

Abstract

Pseudomonas fluorescens NCIMB 11764 (Pf11764) bacterium has been discovered to be capable of cyanide utilization as its solely nitrogenous source. Cyanide is a potent poison that can be found naturally in different environments. Therefore, its biodegradation should have evolved within species exposed to it. The cyanide metabolism in this bacterium is dependent on the induction of an enzyme described as Cyanide oxidase (CNO) [1,2], which is made of four protein components: NADH oxidase (Nox), NADH peroxidase (Npx), Cyanide nitrilase (CNN), and Carbonic anhydrase (CA) [3,4]. The complete molecular properties of this enzyme and the genetic basis of cyanide utilization by Pf11764 are not well understood. Therefore, to learn more about the unique genetic potential of Pf11764 for adaptation to cyanide we characterized the bacterium’s genome using next-generation sequencing technology. Specifically, the genome was sequenced by using Illumina/solexa technology, with paired end libraries. The number of reads was 16,174,118, with a total length of 841,054,136 bp, which gave coverage of 120x. We obtained reads 52 bp long on average from the sequencing in fastaq format. Then, we reconstructed the genome by assembling those reads. The assembly of the genome was done using SOAPdenovo software (http://soap.genomics.org.cn/soapdenovo.html), which is based on the de brujin graph algorithm. SOAPdenovo software uses a .config file with specification for the assembly. We tried several parameters for the assembly procedure, but for the following steps we chose a maximum read length of 50 bp, an average insert size of 200 bp and 2000 bp, and a k-mer size of 31 bp. The assembly results were: 2,751 contigs with an average length of 2,551 bp, and 150 scaffolds with an average length of 46,263 bp. Once the reads were assembled into longer consensus sequences (scaffolds), we used Genemark.hmm (http://exon.gatech.edu/) for the gene prediction. By using a ribosomal binding site (RBS) model we found 6,432 ORFs. Our first aim was to find the proteins compounding the CNO enzyme. We used the Blast2go tool (http://www.blast2go.com) to characterize the predicted genes. Blast2go is java-based software that allows the comparison of unknown protein sequences with reference sequences from GenBank. Then, we searched within the gene annotations for putative CNO enzyme components.

Table 1 . Potential genes coding CNO enzyme components CNO component Predicted genes Nox Six predicted genes were identified as flavin oxidoreductase NADH oxidase Npx One predicted gene candidate. CNN Eleven genes annotated as nitrilases were identified. CA Three genes were identified as carbonic anhydrases.

31 ISBRA 2012 Short Abstracts

We also used the Blast2go tool annotations with the Non Redundant database of NCBI to characterize all the predicted genes of the Pf11764 genome. In general, several metabolic pathways in Pf11764 were similar to those found in other Pseudomonas species. Consistent with Pseudomonas carbohydrate metabolism, we found no presence of 6-phosphofructokinase, which indicates that Pf11764 do not perform the Embden-Meyerhof pathway. An important proportion of the predicted genes showed hydrolase and transferase activity, which is in accordance with the metabolism of the soil and plants elements that are in the usual Pseudomonas environment. Additionally, predicted genes with transcription factor and nucleotide/nucleic acid biding activity account for the high regulation at the genome level, which is expected for large genomes. More than 1,400 predicted genes were similar to hypothetical proteins according to previous Pseudomonas genome projects. Also, more than 1,200 predicted genes were similar to Transporter proteins, more than 500 were similar to membrane proteins and more than 180 were similar to ion transporter proteins, which corresponds to the capacity for soil and plant surface colonization. Interestingly, 39 predicted genes were similar to sigma factors, including anti sigma factors and flagella factors. Catabolic capabilities were also present in this genome, with predicted genes similar to proteases, lipases and aminotransferases. Housekeeping genes that are used for phylogeny purposes were also found in the genome: gyrB, gyrA, rpoA, rpoB, rpoD, RecA, gltA and gapA. Cyanide often binds in the environment with metal ions such as iron, cobalt, copper, nickel and zinc; therefore it was interesting to search for related proteins. Several siderophore receptors were identified, which indicates the iron acquisition capacity. Also, several predicted genes were similar to metal transporter proteins, including nickel, copper, zinc, iron and cobalt transporter proteins.

In order to understand the mechanism that allows this bacterium to adapt to cyanide environments, and what make this bacterium different from closely related species, we compared Pf11764 with three Pseudomonas fluorescens strains. We downloaded from GenBank the entire genomes of P. fluorescens SBW25, P fluorescens Pf-01 and P. fluorescens Pf-5. We used the Genemark.hmm program to annotate their genes as we did with the genome of Pf11764, and a Perl script for number of bases and GC content. The general features showed that Pf11764 had almost the same genome size from the assembly stage. Also, the number of genes predicted was very similar to the other reference Pseudomonas genomes.

Table 2. Genome comparison of Pf11764 with P. fluorescens Pf0-1, Pf-5 and SBW25. Pf11764 Pf0-1 Pf-5 SBW25 # of Bases 6,939,480 6,438,405 7,074,893 6,722,539 %GC 56.8 60.5 63.3 60.5 16S rRNA 2 6 5 5 ORFs (Genemark.hmm) 6,432 5,815 6,370 6,117 tRNAs (tRNAScanSE) 40 73 71 66

Our second goal was to compare the predicted genes of the Pf11764 sequenced genome with the predicted genes of known reference P. fluorescens. We used standalone BLAST (ftp://ftp.ncbi.nih.gov/blast/) to compare the genes of Pf11764 with those of Pf0-1, Pf-5 and SBW25.

32 ISBRA 2012 Short Abstracts

Table 3. Comparison of the predicted genes from Pf11764 genome with P. fluorescens Pf0-1, Pf-5 and SBW25, using BLAST with e-value 1e-20 Pseudomonas Predicted genes from Pf11764: 6,432 species ORFs with hit ORFs without hit Number Length GC % Number Length GC % average (nt) average (nt) Pf0-1 5,064 1,046 60% 1,368 685 59% Pf-5 4,752 1,047 60% 1,680 747 56% SBW25 4,697 1,060 60% 1,735 723 56%

To know about the function of the orphan genes of Pf11764, we took them and performed a new gene annotation using the Blast2go tool. Interestingly, the orphan genes showed a high proportion of metal binding proteins.

In addition, we did a differential analysis of the orphan genes using Blast2go tool, with Pseudomonas fluorescens Pf-5 and Pseudomonas fluorescens SBW25. The analysis showed that the metal biding genes were overrepresented in PF11764. Also, differential analysis with the three Pseudomonas fluorescens showed an overrepresentation of transport and localization genes.

Figure 1. Characterization of orphan genes detected after comparison of Pf11764 with Pf-5 using BLAST. Molecular function of the annotated genes using Blast2go.

Our results indicate that the presence of the metal ion binding proteins could be directly related with Cyanide metabolism in Pseudomonas fluorescens PF11764. The utilization of the CNO enzyme for cyanide degradation might be part of a specific pathway that also involves metal ion binding proteins and transporter proteins for the initial uptake of cyanide from the environment.

33 ISBRA 2012 Short Abstracts

Additionally, a large number of genome rearrangements were observed in PF11764 when compared with reference Pseudomonas fluorescens genomes.

Conclusions

Potential genes encoding putative enzymatic components shown earlier (3) to be necessary for oxidative cyanide metabolism by Pf11764 were identified. Further research is necessary before assigning specific genes to previously identified enzymes. Differential analysis of orphan genes following a comparison of the Pf11764 genome with related Pf0- 1, Pf-5 and SBW25 strains revealed an over-representation of metal-binding, transport and localization genes in Pf11764. It is well known that cyanide binds metals as a ligand. Such metal-complexed species are generally much less toxic than cyanide itself. The high incidence of metal ion binding and transporter proteins in Pf1176 could indicate that such genes play important roles in cyanide detoxification and transport. A large number of genome rearrangements were observed in comparing the structure of the Pf11764 genome with that of Pf0-1, Pf-5 and SBW25. These differences, we conclude, reflect possible genetic events such as horizontal gene transfer that could lead Pf11764 to acquire the unique capacity for cyanide degradation and nutritional assimilation as a nitrogen source.

References

1. Harris R and Knowles C (1983). The conversion of cyanide to ammonia by extracts of a strain of Pseudomonas fluorescens that utilizes cyanide as a source of nitrogen for growth. FEMS Microbiol. Lett. 20:337-341. 2. Kunz D, Nagappan O, Silva-Avalos J and Delong G (1992). Utilization of cyanide as a nitrogenous substrate by Pseudomoans fluorescens NCBIMB 11764: evidence for multiple pathways of metabolic conversion. Appl. and Env. Microbiol, 58(6):2022-2029. 3. Kunz D, Wang C and Chen J (1994). Alternative routes of enzymic cyanide metabolism in Pseudomonas fluorescens NCIMB 11764. Microbiology 140, 1705-1712. 4. Fernandez R and Kunz D (2005). Bacterial cyanide oxygenase is a suite of enzymes catalyzing the scavenging and adventitious utilization of cyanide as a nitrogenous growth substrate. J. Bacteriol. 187(18):6396-6402. 5. Paulsen et al. (2005). Complete genome sequence of the plant commensal Pseudomonas fluorescens Pf-5. Nature Biotechnology, 23(7):873-878.

34 ISBRA 2012 Short Abstracts

Statistical Evaluation of Dynamic Brain Cell Calcium Activity

Kinsey R. Cotton, Mark DeCoster, Katie A. Evans, Richard A. Idowu, and Mihaela Paun

Louisiana Tech University, College of Engineering and Science, P.O.Box 10348, Ruston, LA 71272 [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract. Calcium in its ionic form is very dynamic, especially in excitable cells such as muscle and brain cells, moving from the high concentration exteri- or of the cell to the much lower concentrations inside the cell where calcium is used as a second messenger. In brain cells and neurons especially, calcium is a key signaling ion involved in memory and learning with excitatory neurotrans- mitters such as glutamate turning neurons “on”. Glutamate (Glu) excites the neurons in part by causing large and dynamic changes in intracellular calcium 2+ 2+ concentration ([Ca ]i) increases. While these [Ca ]I dynamics are essential for normal signaling in the brain, excessive and sustained elevations in neuronal 2+ [Ca ]i are related to neuronal injury [1] including long-term neurodegenerative processes [2]. Helping to regulate these dynamics in the brain are the glial cells known as astrocytes. Astrocytes express glutamate transporters[3], and in this way diminish the time that neurons are exposed to glutamate, and thus also 2+ shaping the [Ca ]i dynamics in neurons. Here we describe an in vitro cell cul- ture system composed of rat brain cortical neurons with different densities of 2+ astrocytes which we have used to statistically analyze the [Ca ]i dynamics in individual neurons. This work follows our long-standing interest in brain cell 2+ [Ca ]i dynamics[4], but with the proposed applied statistical and mathematical tools we now provide a system for predicting: 1) whether the order of repeated 2+ glutamate stimulation alters neuronal [Ca ]i dynamics and 2) how the presence 2+ of different densities of astrocytes modulates neuronal [Ca ]i dynamics. We anticipate that this combined experimental/analytical approach will also have utility in understanding additional brain diseases such as brain tumors [5].

Materials and Methods

2.1 Primary Cortical Culture Preparation Cortical cells were obtained by performing cervical disarticulation of Outbred Sprague-Dawley newborn rats (age ≤ 48hrs) using methods as described[6]. After three days in vitro, the cell culture plates were split

35 ISBRA 2012 Short Abstracts

in half, with one half of the culture treated with a 100x dilution of Cy- tosine Arabinoside ([Ara C] 1mM, Sigma-Aldrich) to deplete glial cells from cultures. Three culture sets were created in total (n=21 rats and approximately 48 wells per culture type, co-culture and neurons.) 2.2 Calcium Fluorescence Imaging The cortical cultures were imaged 8 to 9 days in vitro, by incubating cells in a loading solution, Pluronic acid (20% wt in Dimethylsiloxane, Sigma-Aldrich) at a 1000x dilution and Fluo 3/Am (Invitrogen) at 500x dilution in Locke’s solution[4], for 45 minutes. Cells were then washed and recovered in Locke’s solution and re-incubated for 30 minutes. Cells were imaged with an Olympus CKX41 inverted microscope with a 488 excitation wavelength filter over real time at a 4 s frame rate with Intracellular Imaging software. A baseline (Treatment 0, i.e. cell re- cording before Treatment) was obtained for 60s, GLU concentrations were added to the experiment at predetermined intervals (60, 240, and 500 s) without washing out the media between additions. 2.3 Measurement and Analysis of Fluorescence Intensity Intracellular Imaging software (InCytIm1™, Version 5.26, Intracel- lular Imaging Inc., Cincinnati, OH) was utilized to create regions of interest (ROIs) around every cell in the data set post experiment. ROIs were used to measure fluorescence intensity over time, and the data imported into excel was analyzed by taking the ROIs starting value and normalizing to one, this allows us to correlate one ROI to another. 2.4 Statistical and Applied Mathematical Analysis A one-way Analysis of Variance (ANOVA) was considered to exam- ine the effect of the independent variable Treatment with four levels, on the dependent variable “Number of spikes” and “Area under the curve.” To determine which pairs of the Treatment groups differ, a Tukey hon- estly significance difference test (Tukey HSD) was explored.

Results and Discussion

(a) Testing calcium dynamics Three sets of submaximal glutamate stimuli were successively added to primary rat cortical neurons and

36 ISBRA 2012 Short Abstracts

2+ [Ca ]i dynamics as described in methods. Once glutamate was added to the neurons, the glutamate remained on the cells, therefore, as can be seen in Figure 1, each stimulation was sub-maximal in the sense that cells recovered completely or to a large extent to baseline levels before the next stimulus.

Fig. 1. Successive treatment of rat brain cortical neurons with 250, 500, and 750 nM Glu 2+ as indicated by arrows elicits transient increases in [Ca ]i as indicated by fluorescence intensity (Y-axis). Each tracing represents an individual neuron tracked over time (4s/frame X-axis). Six representative neurons from over 40 cells are shown. See text for mathematical analysis of all cells treated.

(b) Spiking activity and area under the curve analysis: Using the one-way ANOVAs, the amount of variability in the response variable (sum of square error-treatment) for “Number of spikes” and “Area un- der curve” was reported as 534.7 and 6647020, respectively. Our pre- liminary analysis shows a significant Treatment effect for both varia- bles considered. The result of the Tukey HSD for “Number of spikes” reveals Treatment 2, (with the highest number of spikes/mean spike) was highly significant when compared to other Treatments. Similarly, for the “Area under the curve”, Treatments 2 and 3 were significantly different according to the corresponding Tukey HSD. While each of the three successive stimuli continued to increase in glutamate concentration, unexpectedly, the most spiking activity was observed in Treatment 2, which was an intermediate concentration

37 ISBRA 2012 Short Abstracts

(Figure 2a). We hypothesized that the highest concentration Treatment 3 leads to fewer spikes due to synchrony of neuronal activity. This is supported by the “Area under the curve” result, where indeed the high- est glutamate concentration resulted in the largest calcium load. This result is consistent with the highest glutamate stimulation resulting in 2+ the strongest [Ca ]i load (Figure 2b).

Number of Spikes by Treatment Area under the curve by Treatment

12

1400

1200

10

1000

8

800

6

Spikes

600

4

Area under the curve the under Area

400

2

200

0 0

Treatments (0,1,2,3) Treatments (0, 1, 2, 3) Fig. 2. (a, left) Box plot of "Number of spikes" by Treatment group. (b, right) Box plot of "Area under the curve" by Treatment group. For each, the box portion of the box and whisk- er plot includes 50% of the data; whiskers depict the minimum and maximum data values. The edges of the box show the lower (Q1) and upper (Q3) quartile, and the dark, thick line represents the median of the data.

Reference List 1. Lazarewicz JW. Calcium transients in brain ischemia: role in neuronal injury. Acta Neurobiol Exp (Wars) 1996; 56(1): 299-311. 2. Marambaud P, Dreses-Werringloer U, Vingtdeux V. Calcium signaling in neurodegeneration. Mol Neurodegener 2009; 4: 20. 3. Anderson CM, Swanson RA. Astrocyte glutamate transport: review of properties, regulation, and physiological functions. Glia 2000; 32(1): 1-14. 4. DeCoster MA, Koenig ML, Hunter JC, Tortella FC. Calcium dynamics in neurons treated with toxic and non-toxic concentrations of glutamate. Neuroreport 1992; 3(9): 773-77 5. Lyons SA, Chung WJ, Weaver AK, Ogunrinu T, Sontheimer H. Autocrine glutamate signaling promotes glioma cell invasion. Cancer Res 2007; 67(19): 9463-9471. 6. Daniel B, DeCoster MA. Quantification of sPLA2-induced early and late apoptosis changes in neuronal cell cultures using combined TUNEL and DAPI staining. Brain Res Protoc 2004; 13(3): 144-150.

38 ISBRA 2012 Short Abstracts

Lineage Specific Expansion of Protein Families in Malaria Parasites

Hong Cai1, Jianying Gu2, *, Yufeng Wang1, *,

1 Department of Biology, South Texas Center for Emerging Infectious Diseases University of Texas at San Antonio, San Antonio, TX 78249, USA [email protected], [email protected] (*corresponding author) 2 Department of Biology, College of Staten Island City University of New York, Staten Island, NY 10314, USA [email protected] (*corresponding author)

Abstract. Malaria is a devastating global infectious disease caused by fast- evolving parasites in the genus Plasmodium. The development of new drugs and therapies relies on a better understanding of the parasite biology. In this study, we explored the protein families that have been specifically expanded in one or several unique lineage of six evolutionarily related Plasmodium strains. These proteins with lineage specific expansions (LSEs) involve genes that are associated with pathogenesis and virulence as well as fundamental cellular processes in malaria parasites.

Keywords: malaria, protein family, comparative , network

1 Introduction

Malaria is a vector-borne infectious disease. About 1-2 million deaths every year worldwide are due to malaria infection. The causative agents of malaria belong to a group of parasites in the genus Plasmodium, and the most life-threatening form of malaria is caused by P. falciparum. This disease was controlled by effective medicines but it was reemerging due to the increasing resistance of the parasites to available drugs. The development of new drugs and therapies relies on a better understanding of the parasite biology. The availability of human malaria parasite genome sequences and other closely related species has enabled the study of genome evolution [1-6]. Previously, we investigated the distribution of core genome components in six completed Plasmodium genomes [7], which represent the minimum and common requirement to sustain a life cycle encompassing a vertebrate host and a mosquito vector. These six sibling species, however, have their unique host specificities and epidemiological profiles: P. falciparum and P. vivax mainly infect humans, while the former is mostly prevalent in sub-Saharan Africa, and the latter is the mostly widely distributed, commonly found in Latin America, United States, and in some areas of Africa; P. knowlesi serves as a model organism for primate malaria as its natural hosts are long-tailed macaques, but it can infect humans as well. It is prevalent in southeast

39 ISBRA 2012 Short Abstracts

Asia; P. yoelli yoelli, P. berghei, and P. chabaudi infect rodents and serve as rodent models to study parasite infection in laboratory condition. In this study, we further explored the protein families that have been expanded in specific lineage(s). These strain/species-specific proteins may be associated with pathogenesis, virulence, and other adaptive traits related to their ecological niches.

2 Data and Methods

2.1 Cluster of gene families and functional classification analysis

The complete genomes of six Plasmodium species were downloaded from PlasmoDB, the all-in-one portal of Plasmodium Genome resources (http://www.plasmodb.org) [8]. The nucleotide, protein, annotation, and expression data were also downloaded. OrthoMCL, a Markov cluster algorithm, was used to cluster genes into clusters [9], which include the orthologous and paralogous genes from different genomes. Multiple alignments of each cluster were derived by ClustalX and T-coffee, followed by manual editing. Phylogenetic trees were inferred by the neighbor-joining method, the maximum likelihood method, and the maximum parsimony method, using MEGA5 (http://www.megasoftware.net/).

2.2 Protein-protein association analysis

The protein-protein associations for P. falciparum were downloaded from the STRING database [10]. Confidence score (S) ranging from 0.15 to 0.999, was assigned based on the evidence from sequence similarity, pathway assignment according to KEGG and PlasmoCyc metabolic pathway database [11], chromosome synteny and genome neighborhood analysis, phylogenetic inference, and literature analysis.

3 Results and Discussion

The OrthoMCL analysis identified abundant duplicate genes in Plasmodium. Approximately 5-9% of the whole genomes are comprised by genes that are expanded in one or several lineage(s). These protein families showed two distinct lineage- specific expansion (LSE) patterns: the lineage-unique LSE, which includes protein families that are uniquely present in one genome, without orthologs in any other five genomes. (2) Typical LSE which includes protein families that are expanded in more than one genomes. As shown in Fig. 1, two rodent parasites, P. berghei, and P. chabaudi possess most abundant LSE proteins families in both categories; this is likely due to the fact that these two genomes contain more open reading frames (ORFs) and proteins.

40 ISBRA 2012 Short Abstracts

Fig. 1. Distribution of LSE protein families in six Plasmodium species.

Fig. 2. Protein-protein associations with ring-infected erythrocyte surface antigen (RESA) PFA0110w in P. falciparum.

Very little is known about these LSE protein families, as over 60% of the ORFs in P. falciparum, the best studies malaria genome, were predicted as hypothetical proteins with unknown functions. Nevertheless, several LSE proteins may be associated with pathogenesis and virulence. Strain-specific surface antigen families are present in each species: rifin and erythrocyte membrane protein (EMP) are two largest protein families found in P. falciparum, which are implicated in antigenic variation, cell adhesion, and invasion; P. vivax possesses the Vir protein family of

41 ISBRA 2012 Short Abstracts

variant antigens, while SICAvar-like antigen, the simian specific surface antigen is present in P. knowlesi. Other potentially important protein families that are expanded in specific lineages include kinases, heat shock proteins, and various metabolic enzymes. Protein-protein interaction analysis showed that these protein families with LSEs are involved in versatile cellular activities. As shown in Fig. 2, PFA0110w is a putative protein in the ring-infected erythrocyte surface antigen (RESA) protein family. It was predicted to be associated with merozoite surface protein 2 (PfMSP2 or MSA2) and merozoite surface protein 9 (PfMSP9 or ABRA), both of which may be involved in merozoite invasion to the host red blood cell, an actin (PFL2215w) and a skeleton-binding protein (PfSBP1), two proteases (a proteasome subunit β1 (PFE0915c) important for protein turnover and falcilysin critical for globin digestion), and several hypothetical proteins. A better understanding about the origin, divergence, function, and network of these protein families with lineage specific expansion will offer new insights into the mechanisms of parasite adaptation and evolution.

Acknowledgments. This work is supported by NIH grants AI067543, GM081068 and AI080579 to YW, and the PSC-CUNY Research Award PSCREG-39-497 to JG.

References

1. Carlton, J.: The Plasmodium vivax genome sequencing project. Trends Parasitol 19, 227-231 (2003) 2. Carlton, J., Silva, J., Hall, N.: The genome of model malaria parasites, and . Curr Issues Mol Biol 7, 23-37 (2005) 3. Carlton, J.M., Adams, J.H., Silva, J.C., et al.: Comparative genomics of the neglected human malaria parasite Plasmodium vivax. Nature 455, 757-763 (2008) 4. Carlton, J.M., Angiuoli, S.V., Suh, B.B., et al.: Genome sequence and comparative analysis of the model rodent malaria parasite Plasmodium yoelii yoelii. Nature 419, 512-519 (2002) 5. Pain, A., Bohme, U., Berry, A.E., et al.: The genome of the simian and human malaria parasite Plasmodium knowlesi. Nature 455, 799-803 (2008) 6. Gardner, M.J., Hall, N., Fung, E., et al.: Genome sequence of the human malaria parasite Plasmodium falciparum. Nature 419, 498-511 (2002) 7. Cai, H., Gu, J., Wang, Y.: Core genome components and lineage specific expansions in malaria parasites plasmodium. BMC Genomics 11 Suppl 3, S13 (2010) 8. Aurrecoechea, C., Brestelli, J., Brunk, B.P., et al.: PlasmoDB: a functional genomic database for malaria parasites. Nucleic Acids Res 37, D539-543 (2009) 9. Li, L., Stoeckert, C.J., Jr., Roos, D.S.: OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13, 2178-2189 (2003) 10. Szklarczyk, D., Franceschini, A., Kuhn, M., et al.: The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res 39, D561-568 (2010) 11. Yeh, I., Hanekamp, T., Tsoka, S., et al.: Computational analysis of Plasmodium falciparum metabolism: organizing genomic information to facilitate drug discovery. Genome Res 14, 917-924 (2004)

42 ISBRA 2012 Short Abstracts

A Mean Shift Clustering Based Algorithm for Multiple Alignment of LC-MS Data

Minh Nguyen and Jean X. Gao

Department of Computer Science and Engineering, The University of Texas at Arlington, TX, USA

Abstract. Alignment of multiple liquid chromatography - mass spectrometry (LC-MS) maps is a crucial step in preprocessing of LC-MS data due to its unavoidable variations in retention time (rt) dimension of technical repeats. In this paper, we propose a novel algorithm for aligning multiple LC-MS feature maps simultaneously without choosing any specific map as a reference. Features are first matched across all maps using Gaussian blurring mean shift clustering, and then nonlinear rt shifts in each map are corrected independently by applying locally weighted scatterplot smoothing (LOESS). The bandwidth of Gaussian kernel in the clustering algorithm and the span used in the LOESS are selected automatically. Experimental results on real datasets show that the performance of our proposed method is comparable to or better than that of six alternative approaches commonly used in the research community.

Keywords: liquid chromatography - mass spectrometry, multiple alignment, clustering, mean shift

1 Introduction

Liquid chromatography - mass spectrometry (LC-MS) is a technology for analysis of complex pro- tein mixtures. Due to variations in mass-to-charge (m/z) ratios and rt dimensions, even in technical replications of the same experiment, one needs to align LC-MS maps before carrying out a quan- titative analysis [9]. While m/z variations are relatively small and are related to the accuracy of mass spectrometers, the shifts in rt dimension between different LC-MS experiments can be fairly large [4, 7]. The LC-MS alignment algorithms can be roughly divided into two categories: profile-based and feature-based approaches [9]. Profile-based approaches usually take raw data as an input, while feature-based approaches use features, which are peaks on LC-MS maps extracted by a feature detection step and are represented by m/z, rt, and intensity [4], for alignment. For a complete coverage on recent alignment methods, please refer to the survey in [9]. In feature-based multiple alignment approaches [7, 8], the authors first find well-behave feature groups across all LC-MS maps using kernel density estimation based clustering [8] or hierarchical clustering [7] and then correct rt shifts for all features based on these groups. The advantage of these algorithms is that a reference map is not required. However, choosing an appropriate kernel bandwidth for kernel density estimation or cutoff value for hierarchical clustering is non-intuitive, especially for a new dataset. Another limitation of existing methods, e.g., [6,8], is that numerous user-defined parameters are required. These drawbacks motivate us to propose a feature-based algorithm that is capable of: (i) aligning multiple LC-MS maps simultaneously; (ii) matching corresponding features across all maps using Gaussian blurring mean shift clustering, which has proven its superiority in clustering applications [2]; (iii) using data-driven kernel bandwidth selection, which accordingly adapts to data density, for Gaussian kernels in the clustering algorithm; and (iv) requiring few parameters with clear physical meaning which can be chosen from observations of LC-MS maps.

2 Methods

The proposed algorithm for multiple alignment of LC-MS feature maps is primarily based on two phases: (1) grouping features whose m/z and rt values are close to each other into clusters, using

43 ISBRA 2012 Short Abstracts

Gaussian blurring mean shift clustering [2]. These feature groups, so-called consensus features [5], are highly likely associated with the same peptides across all maps and can be used as references for rt alignment; (2) correcting rt shifts for each map based on the reference features by locally weighted regression (LOESS) [3]. These two phases can be optionally repeated several times to detect more likely candidate groups for increasingly accurate alignment. The proposed algorithm is summarized as follows: Map combination. Features from all LC-MS maps are combined and sorted with respect to m/z values. Bandwidth estimation. A suitable bandwidth for the Gaussian kernel in the Gaussian blurring mean shift clustering algorithm [2] is estimated based on the distribution of features along rt dimension using the solution for k-stage direct plug-in bandwidth selector proposed in [1], which uses fixed-point algorithm and discrete cosine transform. δ = ξγ[k](δ), (1) where γ[k](δ) = γ (...γ − (γ (δ))...), k ≥ 1 and ξ ≈ 0.90. |1 {zk 1 }k

k times Binning. For the purpose of computational efficiency, LC-MS maps are divided into m/z bins whose width is selected on the basis of mass accuracy. This step aims to group features with close m/z values into the same bin. Feature matching. For matching features in each m/z bin, we use the fast algorithm of Gaussian blurring mean shift proposed in [2]. After each iteration a data point xm in the dataset X = {x1, ..., xN} moves to a new data point ym and thus the new dataset is a blurred version of X. Data points quickly move towards their local modes and collapse into clusters after the first few iterations. The proposed stopping criterion in [2] terminates the algorithm at this phase and obtain the clustering results. In the Algorithm 1, the mean shift iteration is expressed in the posterior probability form.

Algorithm 1 Gaussian blurring mean-shift (GBMS) algorithm repeat for m ∈ {1, ..., N} do

exp(− 1 ∥ (x − x )/σ ∥2) ∀ | ← ∑ 2 m n n : p(n xm) N (2) ′ − 1 ∥ − ′ /σ ∥2 n =1 exp( 2 (xm xn ) ) ∑N ← | ym p(n xm)xn (3) n=1 end for ∀ ← m : xm ym until stop { } , connected-components( xn n=1N min diff)

To prevent potential groups from being separated due to bin boundary, we use successive bins overlapping by half as used in [8]. A postprocessing step is thus required to filter out features in overlapping areas which appear in different clusters. The postprocessing step deals with: (1) number of samples not contributing features to each cluster and (2) features from the same sample presenting in each cluster. Since we focus on multiple alignment and the resulting features are used as the references for rt correction, well-behaved groups have to contain features from at least some fraction of total number of samples. In addition, well-behaved groups should consist of at most one feature from each sample. For features from the same sample in clusters, only the feature with highest intensity is picked. Retention time correction. For each cluster, the median rt and the corresponding deviation from the feature of each map in the cluster to the median are computed. In general, well-behaved feature

44 ISBRA 2012 Short Abstracts

groups are evenly distributed over the substantial parts of retention time dimension [8]. The features of each map presenting in different well-behaved groups can be used as references for correcting rt shifts of all features of that map. We apply the LOESS to pairs of rt and rt deviation in each map separately and the fitted curve is then employed to correct rt variations of all features in the map.

3 Results and Discussion

To evaluate the performance of the proposed method, we conducted experiments on two metabolic datasets M1 and M2 from [5], which consist of 44 and 24 LC-MS feature maps, respectively. The alignment ground truth is composed of consensus features, which are feature groups with high confidence and are reproducible over at least four samples as well as exhibit small deviation in retention time across samples. For more detailed information on the datasets and ground truth, please refer to [5]. Our algorithm needs only two user-defined parameters: (1) m/z bin width which is related to the accuracy of mass spectrometer and (2) the number of mesh points along rt dimension used to estimate the kernel bandwidth which can be selected based on observations of feature densities on LC-MS maps. In the experiments, an m/z bin of 0.1 Da was used for the M1 dataset and 0.04 Da for the M2 dataset. The number of mesh points of 28 was applied to both datasets. To capture variations along m/z dimension, we split LC-MS maps into 4 m/z segments and estimated the bandwidth for each segment separately. The experiments show that there is no much loss in the alignment performance with regard to using more segments. In order to choose a proper span for the LOESS, we performed 5-fold cross validation. Figure 1 illustrates rt alignment curves fitted by the LOESS of sample 2 and 36 from dataset M1. The resulting consensus features were used for performance evaluation. We employed the alignment measures proposed in [5], precision and recall, to evaluate the performance of our method. In addition, we computed F-score in our evaluation to thoroughly assess the efficiency of alignment methods in the tradeoff of precision and recall. ∑N | ∩ | 1 gti con f eati Precision = , (4) N |con f eati| i=1 ∑N 1 |gti ∩ con f eati| Recall = , (5) N |Mi| · |gti| i=1 2 · Precision · Recall F-score = , (6) Precision + Recall th where |gti ∩ con f eati| is the number of features in the i consensus feature of ground truth detected by the algorithm; |con f eati| is the total number of features in all consensus features detected by th the algorithm corresponding to the query on the i consensus feature of ground truth; |gti| is the th number of features in the i consensus feature of ground truth; and |Mi| is the number of consensus features split by the algorithm from the ith consensus feature of ground truth. The experimental results on M1 and M2 datasets are given in Table 1. We compared our method with five alignment methods as in [5]: msInspect, MZmine, OpenMS, XAlign, and XCMS (with rt correction) as well as the recently developed algorithm, RANSAC aligner, in the software package MZmine 2 [6]. For the M1 dataset, the recall of our method is comparable to that of XCMS, which is the best one. However, the precision of our method and MZmine 2 (RANSAC aligner) are at the best performance. For the M2 dataset, the proposed method obtains the best performance on both recall and precision values. With respect to F-score, which combines recall and precision, our algorithm also acquires the best result. The performance of MZmine 2 (RANSAC aligner) is comparable to that of our method. However, the RANSAC aligner requires four user-defined parameters for two 2-D (m/z and rt) windows, the RANSAC window and alignment window, while our method needs only two parameters related to m/z and rt. The proposed method outperforms the alignment algorithm in the XCMS package, which uses kernel density estimation based clustering and a fixed kernel bandwidth, on both datasets with regard to F-score.

45 ISBRA 2012 Short Abstracts

LC−MS map 2 LC−MS map 36 40 30

30 20

20 10

10

0

0

Rentention time deviation Rentention time deviation −10 −10

−20 −20

−30 −30 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 Rentention time Rentention time

(a) Sample 2 (b) Sample 36

Fig. 1. Retention time alignment curves of two samples from M1 dataset.

Table 1. Comparison of alignment performance of the proposed method with six alternative approaches

Data Measure msInspect MZmine OpenMS XAlign XCMS MZmine2 Proposed Recall 0.27 0.89 0.87 0.88 0.94 0.91 0.93 M1 Precision 0.46 0.74 0.69 0.70 0.70 0.74 0.74 F-score 0.34 0.81 0.77 0.78 0.80 0.82 0.82 Recall 0.23 0.98 0.93 0.93 0.98 0.98 0.99 M2 Precision 0.47 0.84 0.79 0.79 0.78 0.83 0.84 F-score 0.31 0.90 0.85 0.85 0.87 0.90 0.91

References

1. Botev, Z., Grotowski, J., Kroese, D.: Kernel density estimation via diffusion. The Annals of Statistics 38(5), 2916–2957 (2010) 2. Carreira-Perpin, M.: Fast nonparametric clustering with gaussian blurring mean-shift. pp. 153–160. ACM (2006) 3. Cleveland, W.: Robust locally weighted regression and smoothing scatterplots. Journal of the American statistical association pp. 829–836 (1979) 4. Lange, E., Gropl, C., Schulz-Trieglaff, O., Leinenbach, A., Huber, C., Reinert, K.: A geometric approach for the alignment of liquid chromatography - mass spectrometry data. Bioinformatics 23(13), I273–I281 (2007) 5. Lange, E., Tautenhahn, R., Neumann, S., Grpl, C.: Critical assessment of alignment procedures for lc-ms proteomics and metabolomics measurements. Bmc Bioinformatics 9(1), 375 (2008) 6. Pluskal, T., Castillo, S., Villar-Briones, A., Oresic, M.: Mzmine 2: Modular framework for processing, visu- alizing, and analyzing mass spectrometry-based molecular profile data. Bmc Bioinformatics 11, – (2010) 7. Podwojski, K., Fritsch, A., Chamrad, D.C., Paul, W., Sitek, B., Stuhler, K., Mutzel, P., Stephan, C., Meyer, H.E., Urfer, W., Ickstadt, K., Rahnenfuehrer, J.: Retention time alignment algorithms for lc/ms data must consider non-linear shifts. Bioinformatics 25(6), 758–764 (2009) 8. Smith, C.A., Want, E.J., O’Maille, G., Abagyan, R., Siuzdak, G.: Xcms: Processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Analytical Chemistry 78(3), 779–787 (2006) 9. Vandenbogaert, M., Li-Thiao-Te, S., Kaltenbach, H.M., Zhang, R.X., Aittokallio, T., Schwikowski, B.: Align- ment of lc-ms images, with applications to biomarker discovery and protein identification. Proteomics 8(4), 650–672 (2008)

46 ISBRA 2012 Short Abstracts

A new algorithm for the molecular distance geometry problem with inaccurate distance data

Michael Souza1, Carlile Lavor2, Albert Muritiba1, and Nelson Maculan3

1 Federal University of Cear´a, Cear´a, Brazil, {michael,einstein}@ufc.br, 2 State University of Campinas (IMECC-UNICAMP), Campinas, Brazil, [email protected], 3 Federal University of Rio de Janeiro (COPPE-UFRJ), Rio de Janeiro, Brazil, [email protected]

Abstract. We present a new algorithm for the molecular distance ge- ometry problem with innacurate and sparse data, based on the solution of linear systems, a heuristic to find big cliques, and a minimization of nonlinear least-squares function. Computational results are presented in order to validate our approach.

Keywords: molecular geometry, heuristic, nonlinear least-squares, cliques, linear systems

1 Introduction

The molecular distance geometry problem (MDGP) can be defined as the prob- lem of finding Cartesian coordinates x , . . . , x R3 of atoms of a molecule 1 n ∈ such that lij xi xj uij , (i, j) E, where the bounds lij and uij for the Euclidean≤ distances k − k of ≤ pairs of∀ atoms∈ (i, j) E are given a priori [1]. An overview on methods applied to the MDGP is given∈ in [2]. An n n symmetric matrix D = (dij ) with nonnegative elements and a zero diagonal× is said to be an Euclidean distance matrix (EDM) if there exist points k x1, . . . , xn IR such that dij = xi xj , i, j = 1, 2, . . . , n. The smallest value of k is called∈ the embedding dimensionk − ofkD. Assuming that D = (dij ) is an EDM with embedding dimension k = 3 with singular value decomposition UΣU t = D, then x = UΣ1/2 is a solution of the exact MDGP defined by lij = uij = dij [1]. If just some exact distances are known, we can use an iterative algorithm called geometric buildup [3]. First, this algorithm initializes a set (base) with four points (index), whose distances between all of them are known.B Then, the coordinates of the points in are set using the singular value decomposition of the EDM D restricted to theB base , and the remaining unset coordinates are calculated by solving the linear systemB

d2 d2 + d2 x , x = i,1 − i,j j,1 , i = i , i , i , i , (1) h i j i 2 ∈ B { 1 2 3 4}

47 ISBRA 2012 Short Abstracts

2 where dij = xj xi . The indices i1, i2, i3, i4 can be chosen in an arbitrarily way, allowingk us to− choosek another base subset when calculating the coordinate xj. However, when a set of inaccurate data (lij < uij ) are available, neither the singular value decomposition nor the buildup algorithm can be applied directly because they are both designed to deal with exact distances. Our contribution is to extend the buildup algorithm in order to consider inaccurate distance data, based on simple ideas: generate an approximated distance matrix D, take as base a clique in the graph that has D as a connectivity matrix, solve the system (1) and refine the solution using a nonlinear least-squares method.

2 The new method

The set E of pairs (i, j) and the set of indices V = 1, 2, . . . , n can be considered as a set of edges and a set of vertices of a graph G ={ (V,E), respectively.} One may decide to use as base the biggest complete subgraph of G. However, the problem of calculating the biggest complete subgraph belongs to the NP-complete class. Hence, we decided to use a simple heuristic that just looks for big complete subgraphs. Once we have obtained the base associated with a complete subgraph, we need to set its coordinates. In order toB generate an approximated EDM restricted to the points in the base, we define a matrix D(t) = [d(tij )] , where

d = d(t ) = (1 t )l + t u (2) ij ij − ij ij ij ij for some tij [0, 1]. With this choice, we have lij dij uij, but D may not be an EDM with∈ appropriated embedding dimension≤ (k =≤ 3). This may happen because the entries dij can violate the triangular inequality dij dik + djk for some indices i, j, k, or because the rank of D is greater than 3.≤ With this in mind, instead of considering the solution given by singular value decompostion directly, we take the columns (eigenvectors) of U associated with the 3 largest eigenvalues, getting the best 3-approximation rank of the solution to xxt = D(t) [4]. We should not expect great precision in x, because the matrix D(t) is just an approximation. Then, we refine it by minimizing the nonlinear function

ij min φλ,τ (x) = φ (x, l, u), (3) x τ,λ (i,j)∈XE:i,j∈B where

φij (x, l, u) = λ(l u ) + θij (x, l) + θij (x, u), (4) τ,λ ij − ij λ,τ λ,τ

ij 2 2 2 2 2 θ (x, c) = λ (c xi xj + τ ) + τ , (5) τ,λ r − qk − k with λ > 0, τ > 0.

48 ISBRA 2012 Short Abstracts

3

The function φτ,λ is infinitely differentiable with respect to x, and therefore allows the application of classical optimization methods. The function φλ,τ is a variation of the hyperbolic penalty technique used in [5,6]. Once we have refined the coordinates of the points in the base , we start to set the remaining (free) points. We begin with the points that haveB at least four constraints with the points in the base. In order to set the coordinate xj, we use all constraints involving the index j and the indices in the base. For example, to set the coordinate xj, we use the approximated distance matrix D(t) for some t [0, 1]|E|, solve the linear system ∈ d2 d2 + d2 x , x = i,1 − i,j j,1 , i , (6) h i j i 2 ∈ B and then we refine the solution by minimizing the function φλ,τ (x) restricted to the index j and to the indices in the base (see eq.(3)). Each newly calculated coordinate is included in the base. In the end, some points may not be fixed because they have less than four constraints involving the points in the base. In this case, we just position these points solving an undetermined system defined by constraints with points in the base.

3 Numerical experiments

We have implemented our algorithm in Matlab and tested it with a set of model problems on an Intel Core 2 Quad CPU Q9550 2.83 GHz, 4GB of RAM and Linux OS-32 bits. The distance data were derived from the real structural data from the Protein Data Bank (PDB). For each of the proteins, only one subset of distances was considered. For- mally, we kept only the distances lower than R = 6A.˚ The bounds were given by the equations

l = d∗ max(0, 1 ǫ¯ ), u = d∗ (1 + ǫ ), (7) ij ij − | ij | ij ij | ij | ∗ 2 where dij is the true distance between atom i and atom j andǫ ¯ij , ǫij (0, σij ) (normal distribution). These instances were proposed by Biswas in [4].∼ N We used the function 1/2 1 2 LDME =  (max lij xi xj , xi xj uij, 0 )  (8) E { − k − k k − k − } | | (i,jX)∈E   in order to measure the precision of the solution just with respect to the con- straints, without providing any information about the original structure x∗, and also measured the deviation between the solutions generated by our algorithm and those of the original ones in the PDB files, using the function 1 RMSD = min x∗ Q(x h) : h IRn×3 and Q IR3×3, orthogonal . √n {k − − kF ∈ ∈ } (9)

49 ISBRA 2012 Short Abstracts

4

Table 1. Results for 70% of distances below 6A˚ and σij = 0.05.

PDB ID n |E| LDME RMSD |B0| CPUtime 1PTQ 402 5025 2.02E-04 3.47E-03 8 5.28 1HOE 558 7103 1.94E-04 3.39E-03 7 7.20 1LFB 641 8104 1.97E-04 4.23E-03 8 8.17 1PHT 811 12351 1.83E-04 9.35E-03 9 17.64 1POA 914 11805 7.60E-02 2.73E-01 8 13.50 1AX8 1003 13002 2.00E-04 3.46E-03 8 15.09 1F39 1534 19804 1.99E-04 7.99E-02 8 30.70 1RGS 2015 26774 1.91E-04 3.13E-02 8 47.32 1KDH 2846 38725 1.88E-04 1.00E-02 8 80.47 1BPM 3671 52548 2.10E-02 1.03E-01 8 108.97 1RHJ 3740 53850 1.88E-04 7.49E-03 8 126.98 1HQQ 3944 54571 1.48E-01 1.42E+00 8 137.98 1TOA 4292 60216 1.92E-04 6.12E-02 8 155.07 1MQQ 5681 84153 1.87E-04 3.03E-03 8 277.09

In all experiments the parameters of the function φλ,τ were set at λ = 1.0 and at τ = 0.01. Table 3 shows that our approach is efficient even when the bounds lij and uij are not so close (σij = 0.05), and just 70% of the constraints are considered. In all instances the LDME was low and the RMSD was lower than 3.5A,˚ which means that the protein structures are very similar [7]. In this table, n is the number of atoms in the instance, E is the number of constraints lij xi xj uij , indicates the size of| the| initial base, and CPU time is given≤ k in− seconds.k ≤ |B0| References

1. Crippen, G., Havel, T.: Distance geometry and molecular conformation. Volume 15. Research Studies Press Taunton, Somerset (1988) 2. Liberti, L., Lavor, C., Maculan, N.: Molecular distance geometry methods: from con- tinuous to discrete. International Transactions in Operational Research 18 (2010) 33–51 3. Wu, D., Wu, Z.: An updated geometric build-up algorithm for solving the molecular distance geometry problems with sparse distance data. Journal of Global Optimiza- tion 37(4) (2007) 661–673 4. Biswas, P., Toh, K.C., Ye, Y.: A distributed sdp approach for large-scale noisy anchor-free graph realization with applications to molecular conformation. SIAM Journal on Scientific Computing 30(3) (2008) 1251–1277 5. Souza, M., Xavier, A., Lavor, C., Maculan, N.: Hyperbolic smoothing and penalty techniques applied to molecular structure determination. Operations Research Let- ters 39 (2011) 461–465 6. Xavier, A.E.: Hyperbolic penalty: A new method for nonlinear programming with inequalities. International Transactions in Operational Research 8 (2001) 659–671 7. Schlick, T.: Molecular modeling and simulation: an interdisciplinary guide. Vol- ume 21. Springer Verlag (2010)

50 ISBRA 2012 Short Abstracts

Identification of highly synchronized regulatory subnetwork with gene expression and interaction dynamics

Shouguo Gao, Xujing Wang

Department of Physics & The Comprehensive Diabetes Center, University of Alabama at Birmingham, Birmingham, AL, 35294

Abstract. There has been a growing interest in combining PPI (protein-protein interaction) data with gene ex- pression data. However the interaction dynamics in biological process has not been sufficiently considered previ- ously. Here we propose a topological phase locking (TopoPL) based scoring method with a simulated annealing search, for identifying differentially expressed PPI subnetwork from time series data. First phase locking index is used to represent the interaction strength under certain biological process. Next, we perform a simulated anneal- ing search to identify the subnetwork with the maximum score in the whole PPI network. Applications to Simu- lated data and the yeast cell cycle data show that the TopoPL method can more sensitively identify biologically meaningful subnetworks than static topological and additive scoring methods.

1 Introduction

Although a number of computational methods have been developed to integrate gene expression and protein in- teraction [1], most ignore the dynamics of interaction and not fully utilize network topology. We regarded the active subnetworks as those with deregulated genes with high expression synchronization between them. Specifically we approximate protein activity by gene expression significance and all their synchronized interactions. Subnetworks of genes with high significance and interactions with high phase locking indexes are regarded as synchronized regulat- ed subnetwork [2]. To represent interaction dynamics, Guo et al. proposed a method to identify condition responsive subnetworks from PPI network, where only protein-protein interactions with high coexpressions between corresponding genes are considered [3]. This assumption is reasonable as many studies found that not all protein interactions occur at a spe- cific tissue and at a specific time [3, 4]. Zhiping Liu also utilized the expression correlation to represent interaction dynamics [1], they assessed the statistical significance of differential expression of two nodes and their correlation [1]. However it has been proved that correlation metrics have limitations when applied to time course data [5, 6]. Not utilizing the inter-time point dependence not only loses sensitivity toward detecting interaction but could also lead to erroneous predictions. Phase locking captures the dynamic interaction structure. When compared with sim- ple correlation we found that the phase locking metric can identify gene pairs that interact with each other more efficiently [6]. To grasp the dynamic network topological characteristics in representing the activity of a subnetwork, we integrate phase locking analysis with Pathway Connectivity Index (PCI) that we previously devel- oped PCI utilizes information of all genes and network topological properties[7]. With both the simulated and real data, we will demonstrate the performance of TopoPL based method.

2 Datasets and Method

2.1 Simulation study Simulation was based on the example expression data in Cytoscape gal80R. We randomly selected n_predefined (40, 60, 80) connected genes as the responsive subnetworks, in which m% (80%, 90%, 100%) of genes are consid- ered active. The significance values of active genes were assigned with top _ % significance val- ues in gal80R. The Phase locking index λ were of the responsive subnetworks networks were sampled from 0.8, 0.5, while the index for the remaining edges were sampled from 0.4, 0.3. The F score is a measure of a test's accuracy. It considers both the precision and the sensitivity of the test to compute the score.

51 ISBRA 2012 Short Abstracts

/

2.2 Expression data and Protein-protein interaction data Yeast Cell Cycle data was downloaded from EMBL Huber group (http://www.ebi.ac.uk/huber-srv/scercycle/) [8]. Yeast Interaction data was downloaded from YEASTRACT.

2.3 Phase locking index based scoring methods and search algorithms The detail of definition and calculation of phase locking was described in [6]. EDGE software [9] was used to calcu- late the evaluating the significances of gene expression changes in time course microarray datasets. We convert to a z-score 1 . Where is the inverse normal CDF. To utilize the topology of subnetwork, let is the adjacent matrix of the subnetwork from PPI and . We define the overall activity of a subnetwork with. . . ∑∑ || || captures well the topological property of the pathway, as hub genes contribute more to this metric. . . || || , can be regarded as the “activity measurement” of the interaction. Only the interactions in which two genes both have higher contribute more to the activity of the subnetwork. Obviously, increase with the number of nodes and number of edges. To adjust the effect of different number of edges, we use the following equation. # ## Randomly created subnetworks with same nodes tend to have similar number of edges, in most of cases when the sampling set is large enough (>200). We implemented the searching procedure based on simulated annealing approach. The pseudocode of the simulated annealing algorithm is described as below. Input: entire PPI network ,; a set of parameters for running simulated annealing: start temperature end temperature , number of iterations . Output: the connected sub-network with the highest score. Initialize by setting each nodes to the significance from EDGE; and selecting the largest connected compo- nent (sub-network); Calculate score for the biggest connected components of and get its score ,; For i = 1 to N, Do / Calculate the current temperature ; ′ Randomly pick a node IF (), remove n from ; ELSE add n to ; Calculate score , for the largest connected component of ; Calculate ∆= , ,; IF ∆> 0, then ; ∆/ ELSE, accept with the probability ; END (end for) Output the connected sub-network with the highest score in .

3 Results

3.1 Simulation Study We compared our TopoPL-based scoring method with two other methods using the F-score: The commonly used scoring method based on summing gene significance levels (hereafter referred to as the additive scoring method).

52 ISBRA 2012 Short Abstracts

Gao et al. proposed the topology-based scoring method (hereafter referred to as the TAPPA-based scoring method) [7]. The first one does not use any network information and the second one uses only static network information. We evaluated the performance of these three methods and compared the percentage of genes with a higher phase locking index for all interactions in the subnetworks. F measure showed that TopoPL performs better than TAPPA and addi- tive method when there is higher synchronization in the predefined responsive subnetworks.

Fig. 1. Performance of TopoPL-based, topological and additive scoring approach.

3.2 Yeast cell cycle Dataset A simulated annealing search with our TopoPL-based scoring function identified a subnetwork with 454 genes with EMBL alpha cell cycle data [8]. We performed the GO term enrichment analysis with topGO to investigate how well the identified subnetwork represented the functional modules [10]. We calculated the significance level for the number of proteins in an identified subnetwork only in the category of "biological process" and the most signifi- cant GO terms for the identified subnetwork are shown in Table 1. Table 1. Top 10 significant GO biological processes enriched in identified subnetwork for yeast cell cycle GOID GOterm pvalue GO:0042254 ribosome biogenesis 1.49E-17 GO:0007049 cell cycle 1.33E-16 GO:0022613 ribonucleoprotein complex biogenesis 2.09E-16 GO:0000278 mitotic cell cycle 2.96E-15 GO:0000280 nuclear division 1.43E-12 GO:0022402 cell cycle process 4.01E-12 GO:0044085 cellular component biogenesis 4.29E-12 GO:0051301 cell division 5.21E-12 GO:0048285 organelle fission 7.33E-12 GO:0006364 rRNA processing 2.38E-11

We investigated the distribution of the phase locking index within the identified subnetwork. Clearly the identi- fied subnetwork contains higher phase locking indexes, showing that TopoPL tends to find highly synchronized subnetworks (Fig. 2). Fig 3 depicts the network of these top 30 high-degree and high-betweenness nodes from the identified subnetwork. We hypothesize that together they constitute biological process of yeast cell cycle, which provide a holistic picture of primary molecular basis of cell cycle. Many of genes are annotated with yeast cell cycle GO:0007049 (cell cycle), denoting with round rectangles. There is no ‘gold standard’ to evaluate the biological rele- vance of the network modeling algorithms. We investigated the functional enrichment of the proteins in the identi- fied subnetworks [3]. The p values of the top 2 terms are 4.44E-17, 8.64E-16 and 4.36E-12, 4.47E-12 with TAPPA and additive method. TAPPA is slightly higher than TopoPL-based method, but additive method gave much higher

53 ISBRA 2012 Short Abstracts

p values. This indicates that including interaction, especially its dynamic information helps to identify more biologi- cal meaningful modules.

3.3 Agreement between the datasets A good algorithm should generate consistent results with different datasets. We identified 484 genes with dataset cdc28, 524 genes with dataset alpha. There are 156 overlapping genes in them (Fisher Test, p<0.00001). On contrast, there are only 87 overlapping genes with additive method (alpha: 501 genes; cdc28: 509 genes) and 145 Fig. 2 Boxplot of phase locking index of all interactions overlapping genes with TAPPA (alpha: 499 in all genes and subnetwork genes genes; cdc28: 503 genes), indicating integrating dynamics and network can generate robust result.

4 Conclusion

TopoPL-based scoring method with a simulated annealing search was proposed for integrating PPI data with expression data in order to identify active subnetworks. Our method considers the network structure and dynamic interactions. When applied to the simulated data and the yeast cell cycle data, TopoPL-based method tends to Fig. 3 core of the identified subnetworks yield larger numbers of highly coordinated pro- teins than two other scoring methods. Further- more, the subnetworks identified by the TopoPL based method tended to identify more consistent subnetworks be- tween different datasets or different objects. Reference 1. Zhi-Ping, L., et al. Dynamically dysfunctional protein interactions in the development of Alzheimer's disease. in Sys- tems, Man and Cybernetics, 2009. SMC 2009. IEEE International Conference on. 2009. 2. Ozlem Keskin, B.M., Kristina Rogale, K Gunasekaran and , Protein–protein interactions: organization, cooperativity and mapping in a bottom-up Systems Biology approach Phys. Biol., 2005. 2(2) p. S24-35. 3. Guo, Z., et al., Edge-based scoring and searching method for identifying condition-responsive protein protein interac- tion sub-network. Bioinformatics, 2007. 23(16): p. 2121-2128. 4. Han, J.-D.J., et al., Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature, 2004. 430(6995): p. 88-93. 5. Burton, P., L. Gurrin, and P. Sly, Clustered Data: Extending the Simple Linear Regression Model to Account for Cor- related Responses: An Introduction to Generalized Estimating Equations and Multi-Level Mixed Modelling. Tutorials in Biostatistics. 2005: John Wiley & Sons, Ltd. 1-33. 6. Gao, S., et al., Global analysis of phase locking in gene expression during cell cycle: the potential in network modeling. BMC Systems Biology, 2010. 4(1): p. 167. 7. Gao, S. and X. Wang, TAPPA: topological analysis of pathway phenotype association. Bioinformatics, 2007. 23(22): p. 3100-3102. 8. Granovskaia, M., et al., High-resolution transcription atlas of the mitotic cell cycle in budding yeast. Genome Biology 2010. 11(3): p. R24. 9. Leek, J.T., et al., EDGE: extraction and analysis of differential gene expression. EDGE: extraction and analysis of differential gene expressio, 2006. 22(4): p. 507-508. 10. Alexa, A., J.r. Rahnenführer, and T. Lengauer, Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics, 2006. 22(13): p. 1600-1607.

54 ISBRA 2012 Short Abstracts MGC: Gene calling in metagenomic sequences Achraf El Allali, John R. Rose

Abstract—Computational gene finding algorithms have proven a dynamic programing algorithm that combines the previous their robustness in identifying genes in complete genomes. However, score with the ORF length, the distance between the ORF and metagenomic sequencing has presented new challenges due to the its neighbor, and the distance between the translation initiation incomplete and fragmented nature of the data. During the last few years, attempts have been made to extract complete and incomplete start (TIS) and the left most start codon. The goal of the open reading frames (ORFs) directly from short reads and identify the algorithm is to select the final set of coding ORFs, by-passing other challenging tasks such as the assem- ORFs by resolving the overlap between ORFs. bly of the metagenome. In this abstract we introduce a metagenomics Orphelia obtains better performance than metaGene by gene caller (MGC) which improves over the state of the art prediction using a two-stage machine learning approach. The first stage algorithm Orphelia [1]. Orphelia uses a two-stage machine learning approach and compute a model that classifies extracted ORFs from builds linear discriminants for monocodon and dicodon usage fragmented sequences. We hypothesise that sequences need separate as well as the TIS features extracted from the ORFs. This step models based on their local GC-Content in order to avoid the noise linearly extracts features from the high dimensional features introduced to a single model computed with sequences from the entire obtained from the codon usage and the TIS information GC spectrum. We have also added two amino acid features based on and reduces each usage to a single feature. The next stage the benefit of amino acid usage shown in our previous research [2]. Direct comparison between our method and the original algorithm combines the features obtained from the linear discriminants supports our hypotheses and sets the ground for further investigation. as well as length and GC content features using a non-linear neural network which produces the probability that a given Keywords—Gene Finding, Metagenomic fragments, Coding Non- ORF encodes a protein. Finally, Orphelia deploys a post coding Classification. processing algorithm which uses the obtained probabilities from its scoring scheme in order to resolves the overlap.

I.INTRODUCTION II.METHODS In cultured microbes, the shotgun sequences that result from sequencing the full genome come from a single clone A. The MGC Algorithm which makes the assembly and annotation of the genome MGC is a metagenomic gene caller based on a two-stage manageable. In metagenomics, the uncultured microbes are machine learning approach similar to that of Orphelia. MGC sampled directly from their environment. The next generation learns separate models for several pre-defined GC content sequencing (NGS) used in metagenomics results in a much regions as opposed to the single model approach used by larger amount of data than traditional sequencing due to the Orphelia and applies the appropriate model to each fragment rapid and low cost of sequencing. However, the resulting based on its GC content. Chan and Stolfo [7] investigated sequences are noisy, partial and most importantly, may come model combination for machine learning classification and from thousands of different species. Therefore, the assembly showed that models learned from disjoint partitions of a and annotation of the large metagenomics data present more dataset outperform a single model learned from the entire challenges. Several methods have shown promising results and dataset. Separating the training data by GC content provides efficiency in assembling metagenomic data [3], [4]. However MGC with mutually exclusive partitions of the data in order they are designed for single genomes which is never the case to train multiple models. is environmental samples and therefore face the most difficult First, all possible complete and incomplete ORFs are ex- challenge in metagenomics which is separating the data by tracted from the input fragment. For comparison purposes we species. One way to deal with these difficulties is to bypass consider the same possible ORFs described by Hoff et al. [8]. assembly and go directly to finding genes. Complete ORFs begin with start codon (ATG, CGT, GTG, or New methods are being developed to predict genes specifi- TTG) and end with an in frame stop codon (TGA, TG, pr cally in metagenomics. The best known methods in this field TAA). Incomplete ORFs are missing either the upstream end, are MetaGene [5] and Orphelia [1]. MetaGene uses a similar the downstream end or both ends of the ORF. Both complete approach to GeneMark.hmm [6] which takes into account and incomplete ORFs must be at least 60 bp in length. Once the GC-content sensitive monocodon and dicodon models we have all the ORFs from the input fragment, we extract computed from fully annotated genomes. Once Metagene input features using the same linear discriminants used on the extracts all the possible open reading frames (ORFs) present in training examples. Based on the GC content of the fragment the fragments, it uses statistical models computed from fully containing the input ORF, the corresponding neural network annotated genomes to score the fragments. The next step uses model is used to score the ORF. The output of the neural network is the approximation of the posterior probability that A. El Allali and J. R. Rose are with the Department of Computer Science and Engineering, University of South Carolina, Columbia South Carolina the ORF is coding. Once all input ORFs are scored by the 29208. email: [email protected],[email protected] neural network, a greedy algorithm is deployed to resolve the

55 ISBRA 2012 Short Abstracts overlap between all candidate ORFs that have a probability while GC-poor genes tend to be the shortest [11]. The longer greater than 0.5. the gene is, the more candidate TIS codons the ribosome The neural network models is trained using the same 131 encounters. Unlike the ribosome, models find it hard to pick fully sequences prokaryotic genomes used by Hoff et al. the correct TIS from a large number of candidates especially [8]. Fragments of 700 bp are randomly excised from these when they are close to each other. In addition to the number of genomes and used to train both stages of the MGC algorithm. candidates TIS codons, these candidates share most of the TIS First, we compute linear discriminants for mono/dicodon, window used to compute the features. Having separate models monoamino/diamino-acid and translation initiation start us- for genes that have a large number of start codons will ensure ages. For each usage, several linear discriminants are built that the subtle difference between the candidates is learned by for each GC range using all the training examples from the the non-linear neural networks. same GC range. The linear discriminants are used to extract For each GC range we obtain a model using features features from the training examples by linearly reducing the computed from all the sequences in the training dataset that high dimensional feature space computed from each usage have GC content within the GC range. The same GC ranges into a single feature. We also use the same length and GC used to compute the linear discriminants are used to build the content features used in Orphelia in the training of our models. neural network models. Different splits by GC content are used The resulting nine features for all the training examples in to study the effect of the GC range size on the performance of each GC range are combined in a non-linear fashion using a MGC. In this paper we show results of MGC models trained by neural network. The output of each network is the posterior breaking down the training data into 10%, 5% and 2.5% ranges probability of an ORF encoding a protein. and show how combining the models gives better results. For the remaining of this paper, we simply refer to these ranges B. GC Content Sensitive Models as the 10% , 5% and 2.5% GC ranges. Our use of GC content to partition the training dataset is We also combine the probability output from all models that inspired by the causal relationship between nucleotide bias and include a given test fragment. Instead of using a single neural amino acid composition. Singer and Hickey [9] demonstrated network, empirically the average over an ensemble of neural that nucleotide bias can have a dramatic effect on the amino networks improves the overall classification performance. This acid composition of the encoded proteins, they showed that empirical observation is supported by Hansen and Krogh GC-poor genomes have proteins that are rich in the FYMINK [12] who wrote that the error of an ensemble is the average amino acids and GC-rich genomes have proteins that are rich error of each ensemble members minus a measure of the in the GARP amino acids. This effect is not only present in disagreement between the members proving that the ensemble complete genomes but it is also valid on individual genes. is always better than the individual average performance. The Singer and Hickey [9] identified genes common between most common ensemble method is to combine the ensemble a GC-rich genome (B. burgdorferi)and a GC-poor genome members by weighted summation of the output, this method (M. tuberculosis) and measured the synonymous nucleotide is called the linear average predictor (LAP). In this paper, frequencies and amino acid contents of each gene. While there we show results from the simplest LAP approach where the was no overlap in the synonymous GC contents of these two average of probabilities from all three GC ranges is used as genomes, some overlap in the amino acid proportions of the the combined probability. encoded proteins exists. However, no overlap in the amino acid proportions of the encoded proteins in the common genes III.EXPERIMENTAL RESULTS was found, the GARP/FYMINK ratio in the M. tuberculosis homolog was higher than that of the corresponding gene in B. The performance of MGC is measured using the sensitivity burgdorferi. Separating the models by GC content can ensure and specificity measures which evaluate the capability of that both compositions are accounted for instead of combining detecting annotated genes and the reliability of the gene them into one model. predictions respectively, these measures can be found in GC content influences codon usage which in turns influ- Hoff et al. [8]. The performance measures are computed for ences the amino acid usage. Lightfield et al. [10] have shown predicted genes in fragments with length 700 bp from ten that across bacterial Phyla, distantly-related genomes with bacterial and three archaeal genomes based on their GenBank similar genomic GC content have similar patterns of amino [13] annotations. Table I shows the sensitivity, specificity acid usage. They examined codon usage patterns and were and harmonic mean scores of MGC predictions based on able to predict protein amino acid content as a function of models built from 10%, 5% and 2.5% GC ranges respectively, genomic GC content. Lightfield et al. [10] demonstrated that in addition to the predictions from the LAP approach. The use of amino acids encoded by GC-rich codons increased harmonic mean score is a composite measure of sensitivity by approximately 1% for each 10% increase in genomic and specificity [8]. Models built from the 10% GC ranges GC content, the opposite was also true for GC-poor codons. have an average harmonic mean of 87.58% with 7.2% standard Separating GC contents into several GC ranges will ensure deviation. The 5% and 2.5% GC ranges acquire a small that the different linear discriminants can separate the codon improvement over the 10% ranges. Their average harmonic and amino acid usage more precisely. means are 88.42% and 87.92% respectively, and the standard Another effect of GC content is its link to the length of the deviations are 6.69%. and 6.89% respectively. Table I also genes, GC-rich genes in prokaryotes tend to be the longest shows the improvements in both sensitivity and specificity

56 ISBRA 2012 Short Abstracts

TABLE I GENEPREDICTIONPERFORMANCEOF MGC USING DIFFERENT MODELS.SENSITIVITY (SN),SPECIFICITY (SP) AND HARMONIC MEAN (H.M) SCORES AREMEASUREDON 700 BPRANDOMLYEXCITEDFRAGMENTSFROMEACHTESTGENOMETO 5-FOLDCOVERAGE.

Model Ranges 10% Ranges 5% Ranges 2.5% Ranges LAP Genomes Sp Sn H.M Sp Sn H.M Sp Sn H.M Sp Sn H.M M. jannaschii 99.42 92.84 96.01 99.37 92.98 96.06 99.38 92.95 96.06 99.41 92.97 96.08 A. fulgidus 95.40 79.14 86.51 96.51 78.18 86.39 96.13 78.41 86.37 97.18 81.54 88.68 B. subtilis 89.00 56.13 68.84 90.07 57.63 70.29 89.64 57.21 69.84 90.11 58.34 70.72 B. aphidicola 98.55 91.32 94.80 98.51 90.77 94.48 98.68 90.54 94.43 98.82 91.26 94.89 W. endosymbiont 95.46 87.14 91.11 95.41 89.50 92.36 95.56 89.03 92.18 96.12 90.07 92.99 N. pharaonis 95.84 79.17 86.71 95.16 79.40 86.57 95.41 77.39 85.46 96.32 81.29 88.17 E. coli 95.45 78.56 86.19 97.01 80.70 88.11 96.59 79.74 87.36 97.56 83.65 90.07 H. pylori 95.12 83.51 88.94 97.44 88.48 92.74 97.19 88.36 92.56 97.47 89.06 93.07 P. aeruginosa 96.80 87.15 91.72 95.94 86.94 91.22 95.99 85.57 90.48 96.82 87.82 92.10 C. tepidum 94.07 68.03 78.96 95.59 71.80 81.99 95.24 69.95 80.66 96.12 73.60 83.37 B. pseudomallei 95.59 85.91 90.45 94.50 85.44 89.74 94.81 84.55 89.39 95.46 86.42 90.72 C. jeikeium 95.12 77.80 85.62 95.27 79.78 86.84 95.54 77.90 85.83 96.38 82.00 88.61 P. marinus 98.06 87.70 92.59 97.90 87.88 92.62 98.07 87.23 92.33 98.40 88.43 93.15 Average 95.69 81.11 87.58 96.05 82.27 88.42 96.02 81.45 87.92 96.63 83.57 89.44 S.D. 2.54 10.03 7.20 2.31 9.54 6.69 2.38 9.76 6.89 2.28 9.21 6.51

TABLE II when combining the models using the LAP approach. This GENEPREDICTIONPERFORMANCECOMPARISONBETWEEN MGC AND means that the sensitivity is not sacrificed for the specificity ORPHELIA [1]. and vice versa, although the sensitivity gain is higher than the gain in specificity. Combining the overlapping models Methods MGC Orphelia Genomes Sp Sn H.M Sp Sn H.M improved their overall predictive ability rather than favoring M. jannaschii 99.41 92.97 96.08 99.04 90.68 94.67 one class over the other. A. fulgidus 97.18 81.54 88.68 98.47 80.60 88.64 Table II shows a comparison between MGC and Orphelia B. subtilis 90.11 58.34 70.72 91.33 62.44 74.17 B. aphidicola 98.82 91.26 94.89 98.99 89.50 94.01 using the same bacterial and archaeal genomes as before. The W. endosymbiont 96.12 90.07 92.99 98.14 84.44 90.78 MGC algorithm is run using the LAP approach. The average N. pharaonis 96.32 81.29 88.17 97.12 69.49 81.01 harmonic mean for MGC is 89.44% with a standard deviation E. coli 97.56 83.65 90.07 98.51 81.19 89.01 of 6.51%, while the average harmonic mean for Orphelia H. pylori 97.47 89.06 93.07 98.92 89.21 93.81 P. aeruginosa 96.82 87.82 92.10 95.85 68.48 79.89 is 85.95% with a standard deviation of 7.19%. We observe C. tepidum 96.12 73.60 83.37 97.39 67.01 79.39 that the gain is mostly in the sensitivity measures (6.23%), B. pseudomallei 95.46 86.42 90.72 95.64 62.85 75.85 while a small loss occurs in specificity (0.78%). Orphelia was C. jeikeium 96.38 82.00 88.61 97.80 74.50 84.57 originally compared to MetaGene in Hoff et. al [1], where it P. marinus 98.40 88.43 93.15 99.11 85.03 91.53 Average 96.63 83.57 89.44 97.41 77.34 85.95 was shown that Orphelia has an average of 4.6% specificity S.D. 2.28 9.21 6.51 2.16 10.36 7.19 gain and a 3.8% sensitivity loss compared to MetaGene based on the same test species used in our comparison. However, the overall performance measured by the harmonic mean was very similar between the two approaches. In our case, the harmonic score as described by Hoff et al. We are currently currying mean shows an improvement of 3.49% on average over the out complete comparison between MGC and Orphelia. results of Orphelia. In the future we plan to build more models based on different GC ranges in order to investigate and find the ideal IV. CONCLUSION division of the GC spectrum. We also plan to use logarithmic The experimental results show the improvement of MGC opinion pool (LOP) based ensemble techniques to combine over Orphelia’s performance. We hypothesized that learning the probabilities from different models. The LOP should separate models for several pre-defined GC content regions further improve the overall performance following the claim of as opposed to the single model approach used by Orphelia Hansen and Krogh [12] that LOP based methods outperform should improve the performance of the neural network and the more common LAP approach. the current results support this claim. The ensemble technique used to combine models from different GC ranges also proved that the ensemble has better average performance over the REFERENCES individual classifiers. Predicting the correct TIS is very important and challenging [1] K. J. Hoff, T. Lingner, P. Meinicke, and M. Tech, “Orphelia: predict- ing genes in metagenomic sequencing reads.” Nucleic acids research, in conventional as well as metagenomic gene finding. The vol. 37, no. Web Server issue, pp. W101–5, Jul. 2009. correct TIS is crucial to the subsequent experimental steps in [2] A. E. Allali and J. R. Rose, “MIM : A Species Independent Approach for the metagenomic pipeline. MGC employs linear discriminant Classifying Coding and Non-Coding DNA Sequences in Bacterial and Archaeal Genomes,” Engineering and Technology, pp. 411–418, 2010. TIS-models in order to identify the correct TIS. The accuracy [3] M. J. Chaisson and P. A. Pevzner, “Short read fragment assembly of of TIS prediction can be measured using the TIS correctness bacterial genomes,” Genome Research, vol. 18, no. 2, pp. 324–330, 2008.

57 ISBRA 2012 Short Abstracts

[4] J. Butler, I. MacCallum, M. Kleber, I. A. Shlyakhter, M. K. Belmonte, E. S. Lander, C. Nusbaum, and D. B. Jaffe, “ALLPATHS: De novo assembly of whole-genome shotgun microreads,” Genome Research, vol. 18, no. 5, pp. 810–820, 2008. [5] H. Noguchi, J. Park, and T. Takagi, “MetaGene: prokaryotic gene finding from environmental genome shotgun sequences,” Nucleic Acids Research, vol. 34, no. 19, pp. 5623–5630, 2006. [6] M. Borodovsky, R. Mills, J. Besemer, and A. Lomsadze, “Prokaryotic gene prediction using GeneMark and GeneMark.hmm.” Current proto- cols in bioinformatics editoral board Andreas D Baxevanis et al, vol. Chapter 4, p. Unit4.5, 2003. [7] P. K. Chan and S. J. Stolfo, “A comparative evaluation of voting and meta-learning on partitioned data,” in Proc 12th International Conference on Machine Learning. Morgan Kaufmann, 1995, pp. 90–98. [8] K. J. Hoff, M. Tech, T. Lingner, R. Daniel, B. Morgenstern, and P. Meinicke, “Gene prediction in metagenomic fragments: a large scale machine learning approach.” BMC bioinformatics, vol. 9, p. 217, Jan. 2008. [9] G. A. Singer and D. A. Hickey, “Nucleotide bias causes a genomewide bias in the amino acid composition of proteins.” Molecular Biology and Evolution, vol. 17, no. 11, pp. 1581–1588, 2000. [10] J. Lightfield, N. R. Fram, and B. Ely, “Across Bacterial Phyla, Distantly- Related Genomes with Similar Genomic GC Content Have Similar Patterns of Amino Acid Usage,” PLoS ONE, vol. 6, no. 3, p. 12, 2011. [11] J. L. Oliver and A. Mar´ın, “A relationship between GC content and coding-sequence length.” Journal of Molecular Evolution, vol. 43, no. 3, pp. 216–223, 1996. [12] A. Hansen, J. V. Krogh, “A general method for combining predictors tested on protein secondary structure prediction,” in of Artificial Neural Networks in Medicine and Biology, 2000, pp. 259–264. [13] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and D. L. Wheeler, “GenBank: update.” Nucleic Acids Research, vol. 32, no. Database issue, pp. D23–D26, 2004.

58 ISBRA 2012 Short Abstracts

Structural Motif Discovery Algorithms: Classification and Benchmarks

Isra Al-Turaiki, Ghada Badr, and Hassan Mathkour

King Saud University, College of Computer and Information Sciences, Riyadh, Kingdom of Saudi Arabia {ialturaiki,mathkour}@ksu.edu.sa,[email protected]

Abstract. Motif discovery is the problem of finding recurring patterns in biological sequences. Patterns can be sequential, when discovering DNA motifs, or structural, when discovering RNA motifs. Finding com- structural patterns helps better understanding the mechanism of post-transcriptional regulation. Unlike DNA motifs, which are sequen- tially conserved, RNA motifs exhibit conservation in structure, which may be common even if the sequences are different. Over the past few years, hundreds of algorithms have been developed to solve the sequential motif discovery problem, while less work has been done for the structural case. Some sequential algorithms have been adapted to the structural case based on the assumption that a common structure is a direct conse- quence of a common sequence (). Other work has been done where the discovery problem is able to extract common structures in different sequences. In this paper, we survey and classify the different algorithms that solve the structural motif discovery problem, where the underlying sequences may be different. We highlight their strengths and weaknesses. In addition, we propose benchmark datasets for evaluating the available structural motif discovery approaches.

Keywords: bioinformatics, motif, RNA, structural

1 Introduction

Finding recurring patterns, motifs, in biological data gives an indication of im- portant functional or structural roles. Motifs can be either sequential or struc- tural. Motifs are represented as sequences when they represent repeated patterns in DNA sequences. Motifs are structural when they represent patterns for RNA secondary structures [2, 3]. Knowing structural motifs in RNA leads to a bet- ter understanding of its function. Unlike DNA motifs, which are sequentially conserved, RNA motifs may share a common structure even in the case of low sequence similarity. Many algorithms have been devised to solve the structural motif discovery problem; Some of which require pre-aligned input sequences and are based on the assumption that similar sequences share similar structures. These kinds of algorithms are sensitive to the quality of the sequence alignment. A survey of different structural RNA motif discovery algorithms can be found

59 ISBRA 2012 Short Abstracts

2 in [6]. In this paper, we survey up to date structural motif discovery algorithms, where structures are discovered in sequences that are different. The structural motif discovery problem can be formulated as follows: Given a set of co-regulated RNA sequences S, it is required to the find common secondary structures that are responsible for their function or regulation. This should not be confused with the two close problems: RNA structure prediction, and RNA consensus structure prediction. In the former, it is required to predict the sec- ondary structure of a single RNA sequence. While in the later, it is required to predict the consensus secondary structure in a set of related RNA sequences.

2 Classification of RNA Motif Discovery Algorithms

There are many approaches for solving the structural motif discovery problem. Based on how the search space is explored, we can classify the approaches into two main classes: enumerative approaches (EN) and heuristic approaches (). In enumerative approaches, the search space is exhaustively explored in order to discover overrepresented motifs. Algorithms in this class can be further di- vided into: dynamic programming-based approaches, data structures-based ap- proaches, and graph-based approaches. In heuristic approaches, only promising regions of the search are explored. Examples of algorithms in this class include: Expectation maximization and evolutionary algorithms . In addition, there are other heuristics specifically designed to tackle the motif discovery problem. Table 1 summarizes different structural motif discovery algorithms.

FOLDALIGN [7] EN http://foldalign.ku.dk/software/index.html It is a dynamic programming approach that is based on Sankoff algorithm [18]. It maximizes alignment similarity and number of base pairs formed in 2 aligned sequences.

SLASH [8] EN tool not available It uses a dynamic programming approach, FOLDALIGN, to find local align- ments in RNA sequences. Then COVE [5], to build a SCFG model from the local alignments.

Dynalign [13] Multilign [20] EN http://rna.urmc.rochester.edu/ RNAstructure.html Dynalign restrictscthe maximum distance allowed between aligned nucleotides in the sequences using dynamic programming. Multilign uses Dynalign pro- gressively to construct a common structure.

Mauri and Pavesi [14] EN tool not available Uses Affix trees data structures for the discovery of hairpins, bulges and inter- nal loops in RNA. Substrings of certain length appearing in at least q sequences were found and expanded.

60 ISBRA 2012 Short Abstracts

3

Seed [1] EN http://bio.site.uottawa.ca/software/seed/ Uses suffix arrays data structures to induce motifs from one sequence, the seed. Data structures were used to store the seed sequence, its reverse, and the input sequences. comRNA [12] EN http://stormo.wustl.edu/comRNA/ Uses an n−partite undirected weighted connectivity graph to represent stems and their similarity. The problem of finding motifs is mapped to finding a set of maximum of cliques. A graph technique similar to topological sort is applied to find the best assemblies of stems.

RNAmine [10] EN http://software.ncrna.org Uses a graph mining algorithm to find conserved stems.

RNAGA[4] HU tool not available Genetic algorithm is applied at different levels. First it is applied on each sequence to get a set of stable structures. Then it is applied again to the set of stable structures.

GPRM [11] HU tool not available GPRM uses generic programming. It requires two sets of inputs: a positive set and a negative set. Individuals are evaluated based on F-score and using the two input sets.

GeRNAMo [15] HU too not available GeRNAMo applies genetic programming to the output of RNAsubopt.

CMfinder [21] HU http://bio.cs.washington.edu/yzizhen/ CMfinder/ It is based on expectation maximization to simultaneously align and fold se- quences using the covariance model of RNA motifs.

RNAProfile [16] HU http://159.149.109.9/modtools/downloads/ rnaprofile.html It uses a heuristic to extract a set of candidate regions from each sequence. Then it groups regions to find similar motifs.

RNASampler[19] HU http://stormo.wustl.edu/~xingxu/ RNASampler/ It applies a probabilistic sampling approach and combines intra-sequence base pairing probabilities and inter-sequence base alignment probabilities

RNAPromo [17] HU http://genie.weizmann.ac.il/pubs/ rnamotifs08/index.html It initially looks for structural elements which are common to the input RNAs, and then uses expectation maximization to refine the resulting probabilistic model.

61 ISBRA 2012 Short Abstracts

4

Table 1: Structural motif discovery algorithms. EN: enumerative ap- proaches and HU: heuristic approaches

3 Advantages and Limitations

There are two main challenges facing motif discovery algorithms: The ability to discover complex structures (including pseudoknots) and the ability to deal with large datasets (scalability). Many algorithms (e.g. FOLDALIGN [7] , SLASH [8], RNAProfile [16], and RNASampler [19]) do not allow branching structures. They are limited to very conserved stem-loops and are only suitable for small datasets. Some algorithms, such as Dynalign [13] and Seed [1], can discover multi-loops. Multilign [20] can deal with large datasets. However, due to preflitering steps used in Multilign, genuine base pairs may be excluded. Thus, it should be used with caution for families with diverse struc- tures. Graph theoretical algorithms (e.g. comRNA [12] and RNAmine [10]) can find pseudoknots. However, they have a non-polynomial time complexity because all algo- rithms are mapped to the clique finding algorithm, which is a well-known NP-complete problem. Evolutionary algorithms can deal with complex secondary structure. Unfor- tunately, they are computationally demanding for large datasets.

4 Proposed Benchmarks

Motivated by the lack of a ”gold standard” benchmark, we propose benchmarks that can be used to assess the performance of structural motif discovery tools. The bench- mark is specifically designed to highlight the different challenges of the motif discovery problem. Based on the complexity of RNA secondary structures, the benchmarks can be divided into the following datasets, ranging from low complexity to high complexity structures:

– Ref.1 contains three RNA families: IRE, Histone3, and SECIS I. These families have simple secondary structures that are composed of a few number of loops and loop types (1 to 2). – Ref.2 contains three RNA families: FMN, glmS, and Lysine. These families have more complex secondary structures that are composed of a high number of loops (5 to 7) and loop types (2 to 3). – Ref.3 contains three RNA families: EnteroOriR, Metazoa SRP, and RNasePbact a. These families have complex secondary structures that are composed of large number of loops (8-17) and loop types (4-5).

The datasets can be retrieved from the Rfam database [9] 1. For each family, 50 seed sequences with 200 nucleotides of flanking regions, randomly distributed between the 5‘ and 3‘ ends, is considered. Scalability can be measured by changing the length of flanking regions.

1 http://rfam.sanger.ac.uk/

62 ISBRA 2012 Short Abstracts

5

5 Conclusion

The discovery of structural motifs is of great interest to bioinformaticians. Recently, many algorithms have been proposed to tackle this problem. Based on the search method, we devised a classification of the available structural motif discovery algo- rithms that are used when the underlying sequences may be different. Current algo- rithms have two challenges: the discovery of complex secondary structures and scala- bility. Motivated by the lack of a ”gold standard” benchmark to evaluate the different algorithms, we proposed benchmarks to tackle these issues. Proposed benchmarks and the corresponding analysis tools will be available soon for other researchers in our website2. Currently we are running experiments using the proposed benchmarks for comparing all tools that are surveyed in this paper. Results will be submitted soon for publication.

References

1. Anwar, M., Nguyen, T., Turcotte, M.: Identification of consensus RNA secondary structures using suffix arrays. BMC Bioinformatics 7(1), 244 (May 2006) 2. Badr, G., Turcotte, M.: Component-based matching for multiple interacting RNA sequences. In: Proceedings of the 7th international conference on Bioinformatics re- search and applications. pp. 73–86. ISBRA’11, Springer-Verlag, Berlin, Heidelberg (2011) 3. Carvalho, A.M., Freitas, A.T., Oliveira, A.L., Sagot, M.: An efficient algorithm for the identification of structured motifs in DNA promoter sequences. IEEE/ACM Trans. Comput. Biol. Bioinformatics 3(2), 126–140 (Apr 2006) 4. Chen, J., Le, S., Maizel, J.V.: Prediction of common secondary structures of RNAs: a genetic algorithm approach. Nucleic Acids Research 28(4), 991–999 (Feb 2000) 5. Eddy lab: Software, http://selab.janelia.org/software.html 6. George, A., Tenenbaum, S.: Informatic resources for identifying and annotating structural RNA motifs. Molecular Biotechnology 41(2), 180–193 (Feb 2009) 7. Gorodkin, J., Heyer, L.J., Stormo, G.D.: Finding the most significant common sequence and structure motifs in a set of RNA sequences. Nucleic Acids Research 25(18), 3724–3732 (Sep 1997) 8. Gorodkin, J., Stricklin, S.L., Stormo, G.D.: Discovering common stem-loop mo- tifs in unaligned RNA sequences. Nucleic Acids Research 29(10), 2135–2144 (May 2001) 9. Griffiths-Jones, S., Moxon, S., Marshall, M., Khanna, A., Eddy, S.R., Bateman, A.: Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Research 33(Database Issue), D121–D124 (Jan 2005) 10. Hamada, M., Tsuda, K., Kudo, T., Kin, T., Asai, K.: Mining frequent stem patterns from unaligned RNA sequences. Bioinformatics 22(20), 2480–2487 (Oct 2006) 11. Hu, Y.: Prediction of consensus structural motifs in a family of coregulated RNA sequences. Nucleic Acids Research 30(17), 3886–3893 (Sep 2002) 12. Ji, Y., Xu, X., Stormo, G.D.: A graph theoretical approach for predicting common RNA secondary structure motifs including pseudoknots in unaligned sequences. Bioinformatics (Oxford, England) 20(10), 1591–1602 (Jul 2004)

2 The bioinformatics group website at KSU http://bioksu.wordpress.com/

63 ISBRA 2012 Short Abstracts

6

13. Mathews, D.H., Turner, D.H.: Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. Journal of Molecular Biology 317(2), 191–203 (Mar 2002) 14. Mauri, G., Pavesi, G.: Algorithms for pattern matching and discovery in RNA secondary structure. Theoretical Computer Science 335(1), 29–51 (2005) 15. Michal, S., Ivry, T., Cohen, O., Sipper, M., Barash, D.: Finding a common motif of RNA sequences using genetic programming: the GeRNAMo system. IEEE/ACM Transactions on Computational Biology and Bioinformatics / IEEE, ACM 4, 596– 610 (Dec 2007) 16. Pavesi, G., Mauri, G., Stefani, M., Pesole, G.: RNAProfile: an algorithm for finding conserved secondary structure motifs in unaligned RNA sequences. Nucleic Acids Research 32(10), 3258–3269 (Jan 2004) 17. Rabani, M., Kertesz, M., Segal, E.: Computational prediction of RNA structural motifs involved in posttranscriptional regulatory processes. Proceedings of the Na- tional Academy of Sciences (Sep 2008) 18. Sankoff, D.: Simultaneous solution of the RNA folding, alignment and protose- quence problems. SIAM Journal on Applied Mathematics 45(5), 810 (1985) 19. Xu, X., Ji, Y., Stormo, G.D.: RNA sampler: a new sampling based algorithm for common RNA secondary structure prediction and structural alignment. Bioinfor- matics 23(15), 18831891 (Aug 2007) 20. Xu, Z., Mathews, D.H.: Multilign: An algorithm to predict secondary structures conserved in multiple RNA sequences. Bioinformatics (Dec 2010) 21. Yao, Z., Weinberg, Z., Ruzzo, W.L.: CMfindera covariance model based RNA motif finding algorithm. Bioinformatics 22(4), 445–452 (Feb 2006)

64 ISBRA 2012 Short Abstracts

Enumerating Maximal Frequent Subtrees

Akshay Deepak and David Fernández-Baca

Department of Computer Science, Iowa State University, Ames, Iowa, USA

1 Introduction

A common problem in phylogenetic analysis is to identify common patterns in a col- lection of phylogenetic trees (i.e., rooted trees whose leaves are in one-to-one corre- spondence with a set of species). Roughly speaking, the goal is to find a subset of the species (taxa) on which all or some significant subset of the trees agree. Here we give algorithms and experimental results for two approaches, one based on agreement sub- trees, the other on frequent subtrees. An agreement subtree for a collection of phylogenies on a common leaf set is a sub- tree homeomorphically included in all of the input trees. A maximal agreement subtree (MXST) is an agreement subtree that is not a subtree of any other agreement subtree. An MXST is a maximum agreement subtree (MAST) if it has the largest number of leaves [6]. MASTs are used, among other things, as a metric for comparing phyloge- netic trees [7], computing their congruence index [4], and to identify horizontal gene transfer events [3]. The MAST problem is polynomially-solvable for two trees, but is NP-hard for three or more input trees, if their degree is unbounded [1]. An MXST can reveal shared phylogenetic information not displayed by any of the MASTs (see Fig. 1). Even more common substructure can be uncovered if we relax

Fig. 1. (a) A collection of two trees and their (b) MAST. (c) An MXST that has fewer leaves than the MAST but is not displayed by it. (a) (b) (c) the requirement that the subtree returned has to be supported by all the input trees. 1  Let f be a number in the interval 2 , 1 . An f-frequent subtree, or simply a frequent subtree (FST), for a collection of m leaf-labeled trees on a common leaf set, is a subtree homeomorphically included in at least f · m of the input trees. A maximal FST (MFST) is an FST that is not a subtree of any other FST. Thus, an MXST is an MFST with f = 1. A well-supported MFST can have more leaves and be more resolved than an MAST; see, for instance, the experimental results in Section 3. In the more general setting where

65 ISBRA 2012 Short Abstracts

the leaf sets of the input trees have little overlap, the gap between the size of an MFST and that of an MAST would be even wider. Indeed, in this case any agreement tree would tend to be quite small. Motivated by this utility, our current work also enumerates MFSTs on a collection of trees on a partially overlapping leafset. Further, an MFST can be more resolved than the majority rule tree, which can be greatly affected by “rogue” taxa; that is, taxa whose positions vary widely within the input collection [8]. See Fig. 2.

(a) (b) (c) (d)

2 Fig. 2. (a) Three input trees. (b) Their MAST, which is star-like. (c) Two MFSTs with f = 3 , each fully resolved and larger than the MAST. (d) The majority rule tree, which is also star-like.

The set of all MFSTs is a compact non-redundant summary of the set of all FSTs: Every FST is a subtree to some MFST but every MFST is not a subtree to any other FST. Thus, every MFST reveals some unique phylogenetic information that is not displayed by any other FST. To our knowledge the enumeration of MFSTs has not been studied before. Here we sketch a new algorithm, MFSTMINER, for this task. We compare MFSTMINER with Phylominer [9], an algorithm for enumerating all FSTs and show that enumerating MF- STs can be orders of magnitude faster than enumerating all FSTs. Our current imple- mentation can be downloaded from http://goo.gl/jzzUk; it works for up to 250 leaves and 10,000 trees. Further details on MFSTMINER and additional experimental results can be found in [5].

2 Outline of the Enumeration Algorithm

We enumerate all MXSTs from the solution space of all FSTs by exploiting the fact that a k-leaf tree is an FST only if all k of its (k − 1)-leaf subtrees are FSTs. In fact, we can show that every k-leaf FST can be enumerated by combining two unique (k − 1)-leaf FSTs. We call the k-leaf tree a join on the two smaller (k − 1)-leaf trees. Further, we enumerate only those FSTs that can potentially lead to MFSTs. To do so efficiently, three main issues must be addressed: non-redundant enumeration of the MFSTs, effi- cient frequency counting of a particular join to classify it as an FST, and limiting the combinatorial explosion due to the number of FSTs. To avoid generating multiple isomorphic copies of the same tree, we enumerate subtrees in “canonical form” [9]. To enumerate every canonical representation once,

66 ISBRA 2012 Short Abstracts

we define a parent-child relationship over the space of all FSTs. This induces an enu- meration tree over the solution space, where each node represents a collection of FSTs grouped together via an equivalence relation. Leaf nodes represent potential MFSTs and each MFST belongs to a unique leaf node. This scheme is motivated by the reverse search technique for enumeration [2]. An FST is enumerated by combining two smaller FSTs. However, the two smaller FSTs can exhibit more than one type of join across the input trees. Thus, an FST can arise out of their combination only if the fraction of the input trees where the two FSTs exhibit a common type of join is greater than or equal to f. Determining this involves identifying the types of joins the smaller FSTs exhibit across the input trees. For this we use a fast one-time least common ancestor based preprocessing step that can identify the join type in each input tree in constant time. One way to enumerate all MFSTs is to visit all the leaf nodes representing potential MFSTs. However, this requires traversing the complete enumeration tree and can lead to a combinatorial explosion due to the number of FSTs. To limit this we use a strategy to prune a branch at a node in the enumeration tree, without traversing it, if none of its leaf descendants contain a MFST.

3 Experiments and Results

We derived 17 datasets from a set of bootstrapped trees analyzed in a previous study [8] on majority rule trees. These trees were constructed from single-gene and multi-gene datasets. We order the sets of trees based on the increasing number of sequences and refer to the datasets as A − Q. To extract datasets with different numbers of leaves and trees, we randomly selected the required number of trees and restricted them on a random set of leaves of the required size. We conducted three experiments. The first, done on a set of 100 trees on 50 leaves from each of the datasets, compared the size of the MAST with the size of the largest MFST using f = 0.51 (Fig. 3a). In some cases, the largest MFST had more than twice as many taxa as the corresponding MAST. The second experiment compared MFSTMINER with Phylominer [9], an algorithm that enumerates all FSTs (Fig. 3b). Enumerating MFSTs was orders of magnitude faster than enumerating all FSTs. This experiment was done on dataset A with 100 trees for each of the leaf set. The graph has missing entries in the cases where Phylominer exceeded its 4GB memory. This limita- tion arises because that program requires all the FSTs to be in memory. MFSTMINER does not have this memory limitation, since it traverses the enumeration tree using depth-first search, only keeping FSTs along a branch. The third experiment evaluated the scalability of MFSTMINER with respect to the size of the leaf set (Fig. 3c). This experiment was done on dataset Q. For each leaf set, 100 trees were extracted for the corresponding dataset.

References

1. Amir, A., Keselman, D.: Maximum agreement subtree in a set of evolutionary trees. SIAM Journal on Computing 26, 758–769 (1994)

67 ISBRA 2012 Short Abstracts

50 45 Maximum Agreement Subtree Maximal Frequent Subtree 40 35 30 25 20

Number of leaves Number 15 10 5 A B C D E F G H I J K L M N O P Q (a) MFSTs can have more leaves than MASTs (f = 0.51).

102 103

102 1 10 Phylominer MfstMiner 101

100

100 Runtime [in sec] [in Runtime 10-1 sec] [in Runtime 10-1

-2 10 10-2 15 20 25 30 35 10 20 30 40 50 60 70 80 90 Size of leafset Number of leaves (b) MFSTMINER vs. Phylominer. f = .95 (c) Scalability of MFSTMINER. f = .95

Fig. 3. Experiments

2. Avis, D., Fukuda, K.: Reverse search for enumeration. Discrete Applied Mathematics 65(1), 21–46 (1996) 3. Daubin, V., Gouy, M., Perrière, G.: A phylogenomic approach to bacterial phylogeny: ev- idence of a core of genes sharing a common history. Genome Research 12(7), 1080–1090 (2002) 4. De Vienne, D., Giraud, T., Martin, O.: A congruence index for testing topological similarity between trees. Bioinformatics 23(23), 3119–3124 (2007) 5. Deepak, A., Fernández-Baca., D.: Enumerating all maximal frequent subtrees. Tech. Rep. 12- 01, Dept. of Computer Science, Iowa State University (2012), http://goo.gl/Tve0t 6. Finden, C., Gordon, A.: Obtaining common pruned trees. Journal of Classification 2(1), 255– 276 (1985) 7. Goddard, W., Kubicka, E., Kubicki, G., McMorris, F.: The agreement metric for labeled binary trees. Mathematical Biosciences 123(2), 215–226 (1994) 8. Pattengale, N., Aberer, A., Swenson, K., Stamatakis, A., Moret, B.: Uncovering Hidden Phy- logenetic Consensus in Large Datasets. IEEE/ACM Trans. Comput. Biol. Bioinformatics 8- 4(99), 1–1 (2011) 9. Zhang, S., Wang, J.: Discovering frequent agreement subtrees from phylogenetic data. IEEE Trans. Knowl. Data Eng. 20(1), 68–82 (2008)

68 ISBRA 2012 Short Abstracts

Bioinformatics: Desktop Applications to Petascale Architectures with Web-Based Portals

Bhanu Rekepalli, Paul Giblock, and Christopher Reardon

Joint Institute for Computational Sciences, The University of Tennessee-Oak Ridge National Laboratory, 1 Bethel Valley Road, Oak Ridge, TN, 37831 {brekapal, pgiblock, creardon}@utk.edu

Abstract. System biology research spans multiple biological scales, from atoms to ecosystems, and its complexity warrants a significant com- putational component for the underlying research. Genome sequencing technology and accompanying high-throughput approaches, such as pro- teomics and , drive data-intensive bioinformat- ics applications aiming at deriving information on the scale of molecules and pathways. The level of computing that is now required to handle current and future Systems Biology applications can be found only in specialized national centers and cannot be achieved by a single lab or a university alone. Thus, the Joint Institute for Computational Sciences, part of the University of Tennessee and the Oak Ridge National Labora- tory, is developing massively parallel applications to analyze the plethora of biological data to derive novel knowledge at highly rapid rate. The dis- coveries from these analyses will have direct effects on human health and the environment. The efforts from National Science Foundation (NSF) projects laid ground work to generate modules for widely-used, parallel bioinformatics applications on the Kraken supercomputer (the world’s fastest academic supercomputer). These tools are used by limited num- bers of labs due to the complexity of job submission on High Performance Computing (HPC) resources along with the limited knowledge of the bi- ologist to operate on the command line on these UNIX-based systems. Most researchers must depend on their graduate students and postdocs to submit jobs and retrieve results. Thus, we are developing science gate- ways for the systems biology applications that will allow easy access to HPC resources. Access to the science gateways will be through a web- interface that allows scientists to more easily submit jobs and retrieve results. This presentation will discuss the various developmental stages involved in taking a bioinformatics application from the desktop to su- percomputers into portals for large scale data analysis in life sciences.

Keywords: Bioinformatics, computational genomics, proteomics, sys- tems biology, Kraken supercomputer, High Performance Computing, par- allel applications, web-based portals

69 ISBRA 2012 Short Abstracts

A Web-based multi-Genome Synteny Viewer for Customized Data

Kashi V. Revanna1, Chi-Chen Chiu2, Daniel Munro1, Alvin Gao3, and Qunfeng Dong1,2*

1Department of Biological Sciences, 2Department of Computer Science and Engineering, 3The Texas Academy of Mathematics and Science, University of North Texas, Denton, Texas 76203, USA. *To whom correspondence should be addressed.

Email addresses: KR: [email protected] CC: [email protected] DM: [email protected] AG: [email protected] QD: [email protected]

Keywords: synteny, genome browser, visualization, bioinformatics

Abstract Web-based synteny visualization tools are important for sharing data and revealing patterns of complicated genome conservation and rearrangements. Such tools should allow biologists to upload genomic data for their own analysis. Recently, we published a web-based synteny viewer, GSV, which was designed to satisfy the above requirement [1]. However, GSV can only compare two genomes at a given time. Extending the functionality of GSV to visualize multiple genomes is important to meet the increasing demand of the research community. We have developed a multi-Genome Synteny Viewer (mGSV). Similar to GSV, mGSV is a web-based tool that allows users to upload their own genomic data files for visualization. Multiple genomes can be presented in a single integrated view with an enhanced user interface. Users can navigate through all selected genomes to examine conserved genomic regions as well as the accompanying genome annotations. A web server hosting mGSV is provided at http://cas-bioinfo.cas.unt.edu/mgsv.

Background Since patterns of genome conservation and rearrangements can be complicated, visualization tools are critical to reveal those patterns. A variety of web-based synteny visualization tools exist for this purpose (e.g., SynBrowse [2] and CoGe [3]). Compared to standalone bioinformatics software, those web-based analysis tools are more convenient for users since no local software installation or maintainance is necessary. However, some of these tools only allow users to analyze a small number of pre-selected genome sequences available at those web resources. This limitation is becoming a serious issue since biologists often need to examine synteny for their own sequences of interest that are typically not available at those web resources.

70 ISBRA 2012 Short Abstracts

Figure 1 - Screenshot of overview page with descriptions

71 ISBRA 2012 Short Abstracts

Figure 2 - Screenshot of interface with descriptions

72 ISBRA 2012 Short Abstracts

Design and Implementation The synteny data file allows users to specify the genomic location of each conserved region in each pair of genomic sequences. An optional genome annotation file can also be submitted to list the accompanying genomic features (e.g., genes) to be displayed as annotation tracks along with the reference genomes. Upon submission, an overview page, shown in Figure 1, gives a general idea of the genome synteny, and allows the user to select the initial ordering scheme of the genomes. The main interface, shown in Figure 2, is then displayed. Buttons along the top control which genomes are displayed as well as their order. Control panels for each synteny window and annotation track allow for customization of the view.

Discussion Although embedding sequence comparison software may facilitate users, we have chosen not to do so in mGSV mainly for three reasons: (1) Sequence comparison among large genomes is not often practical at a web server due to heavy computational demands. (2) It is unrealistic for a centralized web server to decide which software or methods users should use for their data set. (3) Sequence comparison is not the only means for synteny identification. Other types of data (e.g., genetic mapping) may also provide synteny information.

Conclusions mGSV is a web-based synteny visualization tool that enhances the original functionalities of GSV by allowing biologists to upload their own data sets and visualize the synteny among multiple genomes simultaneously in a single integrated view. The novel design and the implementation of mGSV provide the research community with an important alternative to currently available tools.

References 1. Revanna KV, Chiu CC, Bierschank E, Dong Q: GSV: a web-based genome synteny viewer for customized data. BMC Bioinformatics 2011, 12:316. 2. Pan X, Stein L, Brendel V: SynBrowse: a synteny browser for comparative sequence analysis. Bioinformatics 2005, 21(17):3461-3468. 3. Lisch D et al: Finding and comparing syntenic regions among Arabidopsis and the outgroups papaya, poplar, and grape: CoGe with rosids. Plant Physiol 2008, 148(4):1772-1781.

73 ISBRA 2012 Short Abstracts

Subgingival plaque microbiota in patients with type 2 diabetes Mi Zhou1*, Ruichen Rong2*, Daniel Munro2, Qi Zhang1§, and Qunfeng Dong2, 3§ 1School & Hospital of Stomatology, Wuhan University, Wuhan, Hubei China, 2Department of Biological Sciences, 3Department of Computer Science and Engineering, University of North Texas, Denton, Texas USA *These authors contributed equally to this work §Corresponding authors

Abstract Diabetes mellitus has become a major public health problem that is estimated to affect over 92.4 million adults in China. Periodontitis is recognized as the sixth most common complication of diabetes mellitus, suggesting that the microbiota in subgingival plaque of type 2 diabetic patients may differ from that of the systemic healthy subject. To investigate the bacterial diversity in subgingival plaque of diabetic patients, and test the hypothesis that the subgingival plaque microbiota in patients with diabetes may different from others, we applied 454 pyrosequencing on subgingival samples isolated from 31 volunteers in China. Sequencing was performed on a 454 Life Sciences Genome Sequencer FLX Titanium instrument to explore the microbial diversity by targeting the 16S rRNA hypervariable V1-V3 region. In total, 121 genera and 1149 species-level operational taxnomic units (OTUs) were identified from 116,218 16S rDNA sequences. After bioinformatic analyzing we found that periodontally healthy subjects with type 2 diabetes (P- D+) have large differences from chronic periodontitis subjects with or without diabetes(P+D- and P+D+), while P+D- and P+D+ are also dissimilar.

Introduction Diabetes mellitus has become a major public health problem that is estimated to affect over 92.4 million adults in China. The prevalence of total diabetes was estimated at 9.7% in China adults, with an additional 15.5% having prediabetes. Type 2 diabetes is the most common diabetes type which affects more than 90% of diabetic individuals(1). Periodontitis is recognized as the sixth most common complication of diabetes mellitus (2) and it results in loss of connective tissue and bone support and is the major cause of tooth loss in adults. Advanced chronic periodontitis often coexists with poorly controlled diabetes (3), and even impaired fasting glucose (IFG) may worsen the periodontal disease (4), though diabetes is considered as a risk factor for chronic periodontitis. Conversely, the effect of periodontitis on diabetes control has been proposed. Many studies indicated that well-controlled periodontitis benefits the diabetic patients on their improvement of glycemic control (5) and also have fully elucidated the biological mechanisms of the association between diabetes and periodontitis (6, 7, and 8). The widely accepted causal factor of periodontitis is pathogenic microflora. The mechanisms of how diabetes induces or aggravates periodontitis are still unclear, but may include the increased glucose concentrations in gingival crevicular fluid and saliva. The condition of hyperglycemia in patients with diabetes changes the oral environment(9,10). These changes in the microenvironment may interrupt the subtle dynamic balance between the potential pathogens and beneficial bacterial species, altering the subgingival microbe composition. Hence, the periodontitis occurs or is aggravated. Although both bacterial plaque and diabetes play important roles in the pathogenesis of periodontitis, associations between diabetes and subgingival bacterial species or consortia have not been well elucidated. Advances in high-throughput sequencing and bioinformatics analyses now allow for comprehensive culture-

74 ISBRA 2012 Short Abstracts independent assay of host associated microenvironment microbes. Many recent investigations have used sequencing technology, such as 454 pyrosequencing, which is based on the detection of DNA polymerase with another chemiluminescent enzyme, to enhance our understanding of how bacterial diversity affects health and disease (11, 12, and13). These newer culture-independent technologies quickly produce a great amount of sequence data and are utilized to characterize the environmental microbiome. The purposes of this study were as follows: (1). to utilize 454 pyrosequencing to determine the subgingival plaque bacterial diversity in type 2 diabetics with periodontitis; (2). to figure out whether type 2 diabetes affects the composition of the subgingival plaque bacterial community.

Materials and Methods Ethics statement The study protocol was approved by the Ethics Committee of the Faculty of Medicine for Human Studies, school and hospital of stomatology, Wuhan University (protocol number 2011029) and all the patients signed an informed consent.

Participant Selection The study inclusion criteria for all subjects: aged 30-65 years, no pregnancy, no HIV, and no antibiotics or NSAIDs or smoking within the past 6 months, had at least 20 teeth, had no clinical signs of oral mucosal disease or root caries, no periodontal therapy and never had periodontal surgery. Inclusion criteria for type 2 diabetics with or without periodontitis: HbA1c ≥ 6.5%, or fasting plasma glucose test ≥ 7.0 mmol/L, or OGTT 2 hour glucose test≥ 11.1 mmol/L, and had been diagnosed with type 2 diabetes for at least 1 year. Periodontal examinations were also conducted for participants. Inclusion criteria for patients with periodontitis and with or without diabetes: at least 30% of sites having the probing depth and attachment loss, and more than 4 sites fulfilling the following criteria: probing depth≥ 5mm, clinical attachment loss ≥ 4mm.

Sample collection, preparation, pooling, and sequencing Samples were collected in the morning between 9 am and 10 am. After removal of supragingival plaque, a subgingival plaque sample was removed from the 4 deepest sites of the molars using sterile Gracey curettes and transferred into 200μl of phosphate-buffered saline (PBS) buffer to the tubes. The samples were immediately kept frozen at -70℃ until processing DNA extraction. DNA was isolated with a Qiagen DNA MiniAmp kit (Qiagen, Valencia, CA, USA) using the tissue protocol according to the manufacturer’s instructions. Amplicon DNA concentrations were measured using the Quant-iT PicoGreen dsDNA reagent and kit (Invitrogen). DNA samples were diluted in 30 uL 1X TE; an equal volume 2X PicoGreen working solution was added for a total reaction volume of 60 μL in a minicell cuvette. Fluorescence was measured on a Turner Biosystems TBS-380 Fluorometer using the 465-485/515-575-nm excitation/emission filter pair. Following quantitation, cleaned amplicons were combined in equimolar ratios into a single tube. Pyrosequencing was carried out on a 454 Life Sciences Genome Sequencer FLX Titanium instrument (Roche).

Statistical Analysis and Feature Selection Genus level Weighted Unifrac (Fig. 1), Bray-Curtis NMDS and LDA show that periodontally healthy subjects with type 2 diabetes (P-D+) have large differences from

75 ISBRA 2012 Short Abstracts chronic periodontitis subjects with or without diabetes (P+D- and P+D+), while P+D- and P+D+ are also dissimilar. OTU level analysis also shows this trend (Fig. 2). Other multivariate analyses including ANOSIM and per-MANOVA produce a p-value smaller than 0.05 for comparison.

Figure 1: Genus level weighted Unifrac Figure 2: OTU level weighted Unifrac

(Red: D-P- Green: D+P+ Blue: D+P- Yellow: D-P+)

In order to find the indicating bacteria for separating the 4 groups, statistical tests (Student T-test/Mann–Whitney U-test, Fisher Exact test, TuckyHSD/NDWD test and Indicator Species Analysis) and Machine Learning methods (including Boruta (random Forest), SVM-RFE, elasticnet and Metadistance (KNN svm)) were used. Table-1 shows the important genera for separating the 4 different groups.

Table 1: Identify candidate bacterial genera separating 4 groups P-D- P+D+ Actinomyces, Cardiobacterium, Actinomyces, Aggregatibacter, Corynebacterium, Neisseria, Cardiobacterium, Corynebacterium, P+D- Treponema, Aggregatibacter, Haemophilus, Kingella, Tessaracoccus, Streptococcus, Pyramidobacter, Tessaracoccus, Haemophilus, Gemella Propionivibrio, Pseudomonas Actinomyces, Dialister, Schwartzia, Treponema, Haemophilus,Schwartzia, Tannerella, Porphyromonas, Neisseria, Porphyromonas, Veillonella, P-D+ Prevotella, Pseudomonas Actinomyces, Eubacterium, Tannerella, Mycoplasma, Rothia, Streptococcus, Propionivibrio, Dialister, Prevotella,

After feature selection, SVM, LDA can provide a good accuracy for classification and can be used to build a model with high prediction power (Fig. 3). Also, based on the heatmap for these indicator species (Fig. 4), we found that Eubacterium and Treponema are correlated with P+ Group; Gemella and Streptococcus are correlated with P- Group; Actinomyces and Pseudomonas are correlated with D+ Group; Aggregatibacter, Cardiobacterium, Corynebacterium, Tessaracoccus, and Kingella are anti-correlated with P+D-; Dialister, Porphyromonas, Prevotella, and Schwartzia are anti-correlated with P-D+; Neisseria and Haemophilus are correlated with P-D+ while anti-correlated with P+D-; Mycoplasma, Tannerella, and Propionivibrio are correlated with P+D+; Veillonella and Rothia are correlated with P-D+; Dialister is highly correlated with P+D- while anti-correlated with P-D+.

76 ISBRA 2012 Short Abstracts

Figure 3: LDA classification Comparison Figure4: Indicator Genus heatmap

References: 1. Yang W, Lu J, Weng J, Jia W, et al. Prevalence of diabetes among men and women in China. N Engl J Med. 2010 Mar 25; 362(12):1090-101. 2. Löe H. Periodontal disease: the sixth complication of diabetes mellitus. Diabetes Care 1993;16(1):329-334. 3. Taylor GW. Bidirectional interrelationships between diabetes and periodontal diseases: an epidemiologic perspective. Ann Periodontol. 2001;6(1):99-112. 4. Choi YH, McKeown RE, Mayer-Davis EJ, Liese AD, et al. Association between periodontitis and impaired fasting glucose and diabetes. Diabetes Care. 2011 Feb;34(2):381-6. Epub 2011 Jan 7. 5. Taylor GW. The effects of periodontal treatment on diabetes. J Am Dent Assoc. 2003 Oct;134 Spec No:41S-48S. 6. Iacopino AM. Periodontitis and diabetes interrelationships: role of inflammation. Ann Periodontol. 2001 Dec;6(1):125-37. 7. Nishimura F, Iwamoto Y, Soga Y. The periodontal host response with diabetes. Periodontol 2000. 2007;43:245-53. 8. Graves DT, Liu R, Oates TW. Diabetes-enhanced inflammation and apoptosis: impact on periodontal pathosis. Periodontol 2000. 2007;45:128-37. 9. Strauss SM, Wheeler AJ, Russell SL, Brodsky A,et al. The potential use of gingival crevicular blood for measuring glucose to screen for diabetes: an examination based on characteristics of the blood collection site. J Periodontol. 2009 Jun;80(6):907-14. 10. Kardeşler L, Buduneli N, Biyikoğlu B, Cetinkalp S, Kütükçüler N. Gingival crevicular fluid PGE2, IL-1beta, t-PA, PAI-2 levels in type 2 diabetes and relationship with periodontal disease. Clin Biochem. 2008 Jul;41(10-11):863-8. 11. Grice EA, et al. Topographical and temporal diversity of the human skin microbiome. Science. 2009 May 29; 324(5931):1190-2. 12. Eckburg PB, et al. Diversity of the human intestinal microbial flora. Science. 2005; 308(5728):1635-8. 13. Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010 Mar 4;464(7285):59-65.

77 ISBRA 2012 Short Abstracts

Automatic Analysis of Dendritic Territory for Neuronal Images

Santosh Lamichhane, Jie Zhou

Department of Computer Science, Northern Illinois University, USA [email protected], [email protected]

Abstract. The territory occupied by the dendritic tree of a neuron is an im- portant measure of the neuron morphology. Robust and scalable image analysis methods for automatic territory quantification are critical in large scale screen- ing possible, as well as for accurate differentiation of wild type versus mutant morphology patterns. In this project, we developed an automatic pipeline that quantifies the territory occupied by the dendritic arbor of the lobula plate tan- gential cells (LPTCs) in adult Drosophila brain, of both wild type and mutant type, using confocal microscopic images. Preliminary results show that our al- gorithm can overcome noise, artifacts from staining, as well as the property var- iations between wild type and mutant images.

Keywords: automatic neuronal image analysis; confocal microscopy; lobula plate tangential cells; dendritic territory; image classification and annotation

1 Introduction

Intricate morphology is a striking feature of neurons and plays an important role in functional analysis and quantification of neuronal systems [1]. Advanced multi- dimensional microscopic imaging in an in vivo system, together with genetic manipu- lation of single neurons, have emerged as powerful tools for studying neuronal mor- phology and related mechanisms. However, current methods for analyzing high- dimensional confocal images of neurons remain qualitative and largely manual, espe- cially for complex neurons such as the lobula plate tangential cells (LPTCs) in the brain of the fruitfly Drosophila melanogaster. Robust and scalable image analysis methods are critical for filling the performance gap to make the large scale screening possible, as well as for accurate differentiation of wild type versus mutant morpholo- gy patterns. Among the neuronal morphometrics, the territory occupied by the den- dritic tree is an important measure of the neuron morphology that not only identifies a defective shape that potentially links to neuronal function but also serves as an important parameterization parameter for further quantification analysis (e.g. synapse distribution) in mutant screening [2]. In this project, we developed an automatic high- throughput pipeline that quantifies the territory occupied by the dendritic arbor of the LPTC neurons in adult Drosophila brain of both wild type and mutant types.

adfa, p. 1, 2011. © Springer-Verlag Berlin Heidelberg 2011

78 ISBRA 2012 Short Abstracts

The neuronal images, obtained via in vivo confocal microscopic imaging with fluo- rescent protein labeling, present several research challenges including a) high varia- tion of image properties (e.g. contrast) between healthy and mutant types; b) staining artifacts from labeling; and c) other noise such as those from multi-image stitching that are needed to in order get the complete image of a single LPTC neuron. To overcome these problems, the automatic image analysis algorithms we employ im- prove the traditional methods by incorporating novel machine learning based methods for segmenting soma and axon from the neuron, then extract and quantify the territory occupied by the dendritic arbor. An ImageJ plugin has also been developed to provide a portable tool for the neuroscientists for automatic territory analysis.

2 Method

The GFP-stained samples of the lobula plate tangential cells (LPTCs) in the brain of the adult Drosophila melanogaster were imaged on a Leica SP5 laser-scanning confocal system. Separate images were stitched together to get the complete image for a single neuron. See examples in Figure 2 for the Vertical System (VS) LPTC images of both wild type and mutant. Pre-processing removes noise by despeckling using a 3*3 median filter, followed by a locally adaptive histogram equalization to enhance image contrast. To extract the territory, we need to first automatically remove soma and axon from the image. For the images we work with, simple threshold based removal or methods based on Hough transform alone are not sufficient. Instead, we start with two guide images: an edge mask (created using Canny edge detection) and a binary mask. We detected soma seed points in the Hough space. A morphological dilation is performed on the selected seed until the soma-axon junction point [3]. The two guide masks were used to ensure the dilation stays on the foreground and within edges. To remove the axon, a machine learning based segmentation is used to increase re- liability and overcome image variations. We built a trained model to detect axon seed points from the image. We used a training set consisting of 4 categories of image regions extracted from two neuron images – one wild type and one mutant type. The training set included 10 axons, 50 branches, 7 soma and 9 background image regions. We made use of our in-house bioimage annotation tool (BioCAT) which performed automatic comparisons of various combinations of extractors, selectors and classifiers and selected HAAR wavelet as features, Fisher criterion as selector and Support Vec- tor Machine as the classifier. Once the model was built, it was applied to other LPTC images to automatically annotate image regions and export the regions identified as axon candidates. A best candidate is chosen using length and orientation assuming the axon is a long tabular structure, and dilation is performed until the axon-dendrite junction, again guided by the two mask images. Once the soma and axon were removed from the neuron, we performed a two- round rolling ball algorithm to subtract background and extract the foreground of the dendrite tree. Finally, the extracted foreground was binarized to quantify the size of the territory of the dendritic tree. Figure 1 summarizes the above algorithm flow.

79 ISBRA 2012 Short Abstracts

Fig. 1. Workflow of the territory extraction process.

3 Experimental Results and Tools

Figure 2 demonstrates the process and results of the territory extraction. Figure 4 summarizes the results on nine wildtype LPTC VS1 neuronal images and three imag- es of L1-mutant. An ImageJ [4] plug-in “Neuron Territory Analyzer” has been devel- oped for the project that provides a portable and freely available tool for neuron terri- tory quantification from confocal images.

Fig. 2. Various stages of processing for wild type (upper) and L1 mutant type neurons (lower). (A) Original image (B) Preprocessed image (C) Extracted soma (D) Extracted axon (E) Ex- tracted territory overlaid on original image.

80 ISBRA 2012 Short Abstracts

Fig. 3. Box plot of territory size for wild type and mutant neurons in micron.

4 Conclusion

Preliminary results show that our algorithm can overcome noise, artifacts from staining, as well as the property variations between wild type and mutant images such as inconsistent image brightness, contrast and noise levels. The results demonstrate a reduction in the territory occupied by the mutant images as compared to that of the healthy types. Experiments will be conducted to test the algorithms on more mutant neuronal images.

5 Acknowledgements

Dr. Bing Ye at Department of Cell and Developmental Biology, University of Michigan, Ann Arbor has provided the images. NIU Graduate School Great Journey Assistantship has provided funding support for the project.

References

1. Grueber WB, Yang CH, Ye B JY. The development of neuronal morphology in insects. Current Biology. 2005;(15):730–738. 2. Meseke M, Evers JF, Duch C. Developmental changes in dendritic shape and synapse lo- cation tune single-neuron computations to changing behavioral functions. Journal of neuro- physiology. 2009;102(1):41-58. 3. Xiong G, Xing L, Taylor C. Automatic Junction Detection for Tubular Structures. Insight. 2009. Available at: http://hdl.handle.net/1926/1534. 4. Abramoff MD, Magalhaes PJ, Ram SJ. Image Processing with ImageJ. Biophotonics In- ternational. 2004;11(7):36-42. Available at: http://imagej.nih.gov/ij/.

81 ISBRA 2012 Short Abstracts 1 A Neural Network Approach to Pre-filtering MS/MS spectra James P. Cleveland, John R. Rose

Abstract—The effectiveness of any de novo peptide sequencing The general approach used by other prominent de novo algorithm depends on the quality of MS/MS spectra. Since most peptide sequencing algorithms depends primarily on relative of the peaks in a spectrum are uninterpretable ‘noise’ peaks it is peak intensity. PepNovo uses a sliding a window of width 56 necessary to carefully pre-filter the spectra to select the ‘signal’ peaks that likely correspond to b-/y-ions. Selecting the optimal set across the spectrum and keeps any peaks that are in the top 3 of peaks for candidate peptide generation is essential for obtaining when ranked by intensity [1]. MSNovo selects peaks by using accurate results, however this step is often glossed over in most a sliding window of width 100 and selects the top 6 peaks published de novo peptide sequencing algorithms. A careful balance from each window [2]. PILOT keeps only the top 125 peaks must be maintained between the precision and recall of peaks that of highest intensity in the spectrum [3]. pNovo selects the top are selected for further processing and candidate peptide generation. If too many peaks are selected the search space will be too large 100 peaks by intensity [4]. and the problem becomes intractable. If too few peaks are selected In our experiments we found that selecting peaks based on cleavage sites will be missed, the candidate peptides will have relative intensity alone could miss a nontrivial portion of b- large gaps, and sequencing results will be poor. For this reason /y-ions. If the complex dynamics of peptide fragmentation– pre-filtering of MS/MS spectra and accurate selection of peaks for including relative peak intensity–can be modeled and incor- candidate generation is essential to any de novo peptide sequencing algorithm. Here we present a novel neural network approach for the porated into a predictive ion-type classifier, then the accuracy selection of b-/y-ions utilizing known fragmentation characteristics, of peak selection will be superior than the accuracy of a peak and leveraging neural network probability estimates of flanking and classifier that uses peak intensity alone. We demonstrate that complementary ions. this superior approach can be implemented via a neural net- Keywords—Tandem mass spectrometry, proteomics, de novo, pep- work. A neural network approach was used because it allows tide sequencing, MS/MS preprocessing, b-/y-ion selection us to construct a predictive model that doesn’t require the complete understanding of the complex dynamics of peptide I.INTRODUCTION fragmentation. The Leveraged Neural Network (LNN) ion Tandem mass spectrometry (MS/MS) is the most important classifier described below selects peaks with higher precision tool in high-throughput proteomics, of which the primary and recall than other de novo peptides sequencing algorithms. application is peptide identification. For MS/MS spectra origi- Increasing recall leads to better candidates in the candidate nating from proteins that are not present in a sequence database peptide search space. If recall is held fixed and the precision researchers must use de novo peptide sequencing algorithms to is increased the result will be a significantly smaller candidate sequence the peptide. The goal of de novo peptide sequencing peptide search space, without sacrificing the best candidate is to compute the peptide whose fragmentation produced the contained in the search space. Given the computational limits experimental spectrum. De novo peptide sequencing follows that all de novo agorithms face, a low precision can render any a general formulation: peak selection (often described as de novo algorithm prohibitively impractical. Concurrently, a preprocessing), followed by peptide candidate generation, and low recall can be the cause of an exploding combinatorial then candidate scoring. In the peak selection step the objective search space and lowered candidate accuracy due to large is to identify a subset of peaks in the spectrum that likely corre- gaps from missing peaks. It is clear that a careful balance spond to b-/y-ion ladders. In the candidate generation step the of improved precision and recall is important for peptide selected peaks are used to generate a set of candidate peptides candidate generation. that could have produced the spectrum. It is during candidate generation that the effects of inadequate peak selection become II.METHODS problematic. In the candidate scoring step a scoring function The dataset used in this study is composed of doubly is used to rank the candidates and choose the most probable charged tryptic peptides produced by low energy LC/MS/MS. peptide that produced the spectrum. Often candidate scoring Of the original dataset containing 8610 mass spectra we kept is incorporated into candidate generation. The quality of the 3373 spectra of unique peptides which had an Xcorr score candidates that are generated and scored depends on the ability greater than 2.5, and a mass between 600 and 3000 Da. Our of the program to initially select peaks that will allow for the data came from the PNNL Salmonella Typhimurium dataset correct (or best) candidate peptide to exist in the search space, [5] which is publicly available on the web for download1 so it is important that peak selection be done in an optimal The dataset (D) was divided into a training dataset (DT , way. 2873 spectra) and a small testing dataset (DE, 500 spectra). J. P. Cleveland and J. R. Rose are with the Department of Computer Science For each spectrum in the training dataset we first removed and Engineering, University of South Carolina, Columbia South Carolina 29208. email: [email protected], [email protected] 1http://omics.pnl.gov/view/dataset 80292.html

82 ISBRA 2012 Short Abstracts

Algorithm 1 Leveraged Neural Network Training and Classi- net1 pattern net2 pattern feature value feature value fication intensity N, D intensity N, D net1 ← train(DTB ,DTV ) strong peak B strong peak B DT ← classify(DT , net1) {peaks in DT now have b-/y- local intensity rank N local intensity rank N global intensity rank N global intensity rank N /u-ion probability estimates} relative cleavage position N, D relative cleavage position N, D net2 ← train(DTB ,DTV ) random peak hypothesis P random peak hypothesis P DE ← classify(DE, net1) principal isotope B principal isotope B isotopologue B isotopologue B DE ← classify(DE, net2) {results presented in Table III} complement N, D complement Pnet1 H2O neutral loss N, D H2O neutral loss N, D NH3 neutral loss N, D NH3 neutral loss N, D H2O-H2O neutral loss N, D H2O-H2O neutral loss N, D peaks with intensity below an experimentally derived thresh- H2O-NH3 neutral loss N, D H2O-NH3 neutral loss N, D old, in this case 15 Da, which dramatically sped up the CO neutral loss (a-ion) N, D CO neutral loss (a-ion) N, D y2-ion N, D y2-ion N, D training of the neural network without sacrificing performance. N-term flanking ion Pnet1 It should be noted that the intensity threshold was only applied C-term flanking ion Pnet1 to the training dataset, and not to the testing dataset; the TABLE I precision and recall statistics presented below were computed PATTERN FEATURES:N DENOTESANORMALIZEDVALUE,D DENOTESA on the raw spectra from the testing set. Before the neural DISCRETIZED VALUE,B DENOTES A BINARY VALUE, AND P DENOTESA PROBABILITY ESTIMATE. network can be trained DT must be transformed. Each peak in DT is assigned its correct class label (target vector), either b- ion, y-ion, or u-ion (unknown ion), each of which is a binary vector of length three. For each peak in D a feature vector complement ion feature by replacing the normalized relative (pattern) is generated that will later be presented to the input intensity of the complementary peak with the maximum of layer of the neural network for training and classification. The the b-/y-ion probability estimates in the output from net1 for features used are described in table I. DT is divided again such the complementary peak. The pattern for net2 also adds two that 2538 spectra are set aside for backpropagation (DTB ) and additional features corresponding to flanking residues on the

317 spectra for validation (stopping criteria) (DTV ). The peaks N and C terminal sides of the current peak (peak for which in the DTB are then filtered so that there are an equal number the feature vector is being computed). The N terminal flanking of b, y, and u ions. residue feature is computed by taking the maximum b-/y-ion The training process of the neural network requires the probability (as estimated by net1) of any peak with a mass use of an objective error function. In our implementation the offset from the current peak equivalent to the mass of an amino output (o) of the neural network represents an estimate of acid. The C terminal flanking residue feature is computed the posterior probability that the input pattern belongs to the similarly. The reasoning for these ‘leveraged’ features is that respective class in the target vector (t). When interpreting if the current peak has a complement or flanking peak with a the outputs as probabilities it is appropriate to use the cross high probability of being a b−/y−ion, then the current peak entropy error function [6]. has increased probability of being a b−/y−ion itself. Our 2 experiments show that leveraging the output from net1 to train X a second neural network in this way yields a higher recall than network error = − [tilog(oi) + (1 − ti)log(1 − oi)] i=0 does classification with net1 alone. The neural network will train on the all of the patterns in Description of features DTB numerous times (epochs) until the network performance no longer improves. This is determined by classifying the The intensity feature is simply the normalized and dis- patterns in DTV after each epoch until the error on DTV begins cretized relative peak intensity of the current peak. Normalized to increase, at which point the training terminates. intensities are computed by dividing each peak intensity by In our classifier two neural networks are used in succession the maximum peak intensity in the spectrum. Normalized and for peak classification, which we refer to as a leveraged neural discretized intensities are rounded up to either 0.05, 0.10, 0.20, network. Each is trained in the manner described above except 0.40, 0.80, or 1.00. The strong peak feature is a binary value for differences in the feature vector used. The structure of that indicates whether or not the current peak was selected each neural network consists of an input layer with as many as a ‘strong peak’ using a sliding window method; in this nodes as features in the pattern, a single hidden layer with case the top three peaks were selected in a sliding window twice as many nodes as the input layer, and an output layer of width 56 Da. The local and global intensity ranks give with three nodes corresponding to the three possible classes. A the normalized rank by intensity of the current peak within general formulation for training the neural networks is given in a ‘local’ window, or globally. The relative cleavage position algorithm 1. In the first neural network (net1) the peak features gives the normalized and discretized position of the current are computed from data in the spectrum alone as described peak relative to parent ion mass. The random peak hypothesis below and in table I. In the second neural network (net2) estimates the probability that the current peak is a random peak the outputs from net1 are leveraged as additional features in rather than an ion of interest. This feature is modeled after net2, described in table I. The pattern for net2 modifies the the random peak hypothesis described in [1]. The principal 83 ISBRA 2012 Short Abstracts

Input Hidden Output This is the method Frank described in the original PepNovo intensity b-ion publication [1], however the actual performance of PepNovo shown above is substantially better with respect to recall. The strong peak actual performance of PepNovo was determined by modifying local rank the source code to output the peaks from the raw spectrum that global rank are used to construct the spectrum graph. rel. cleavage Note that the LNN precision is 50% greater than that of rand. hyp. PepNovo. The effect of this difference can be seen when we principal iso. generate a search space of candidate peptide sequences using the peaks selected by the two algorithms. We implemented a isotopologue y-ion basic spectrum graph approach as described in [7] to generate complement candidate peptides using the peaks selected by PepNovo and H2O loss the LNN. The best candidate in the search space generated

NH3 loss by the LNN was better on average than the best candidate produced by PepNovo by 7% when when doing a sequence 2H2O loss comparison to the correct peptide. More more importantly, the H O+NH loss 2 3 size of the search space generated by the de novo algorithm CO loss using PepNovo’s peak selection was on average 144% larger y2-ion u-ion than the size of the search space generated by the de novo algorithm using LNN peak selection. Fig. 1. Visualization of net1 REFERENCES isotope feature is a binary value that indicates whether or not [1] A. Frank and P. Pevzner, “PepNovo: de novo peptide sequencing via probabilistic network modeling.” Analytical chemistry, vol. 77, no. 4, the current peak appears to be a principal isotope, and the pp. 964–73, Feb. 2005. [Online]. Available: http://www.ncbi.nlm.nih. isotopologue feature indicates the converse. The complement gov/pubmed/15858974 feature in the net pattern is the normalized and discretized [2] L. Mo, D. Dutta, Y. Wan, and T. Chen, “MSNovo: a dynamic 1 programming algorithm for de novo peptide sequencing via tandem intensity of any peak found at the expected complement mass mass spectrometry.” Analytical Chemistry, vol. 79, no. 13, pp. 4870– position. In the case of the net2 pattern the complement feature 4878, 2007. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/ gives the maximum b-/y-ion probability estimate using net 17550227 1 [3] P. a. DiMaggio and C. a. Floudas, “De novo peptide identification of any peak found at the expected complement mass position. via tandem mass spectrometry and integer linear optimization.” The H2O, NH3,H2O-H2O, H2O-NH3, and CO neutral loss Analytical chemistry, vol. 79, no. 4, pp. 1433–46, Feb. 2007. features all give the normalized and discretized intensity of [Online]. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi? artid=2730153\&tool=pmcentrez\&rendertype=abstract peaks found in their respective offsets from the current peak. [4] H. Chi, R. Sun, B. Yang, C. Song, and LH, “pNovo: De The y2-ion feature vector gives the normalized and discretized novo Peptide Sequencing and Identification Using HCD Spectra,” intensity of any peak existing in the offset from the current Journal of Proteome, pp. 2713–2724, 2010. [Online]. Available: http://pubs.acs.org/doi/abs/10.1021/pr100182k peak where a doubly charged y-ion is expected. The N-term [5] C. Ansong, N. Tolic,´ S. O. Purvine, S. Porwollik, M. Jones, and C-term flanking ion features are the maximum b-/y-ion H. Yoon, S. H. Payne, J. L. Martin, M. C. Burnet, M. E. Monroe, probability estimates using net for any peaks that are found P. Venepally, R. D. Smith, S. N. Peterson, F. Heffron, M. McClelland, 1 and J. N. Adkins, “Experimental annotation of post-translational at a mass offset from the current peak corresponding to the features and translated coding regions in the pathogen Salmonella mass of a single amino acid. Typhimurium.” BMC genomics, vol. 12, no. 1, p. 433, Jan. 2011. [Online]. Available: http://www.pubmedcentral.nih.gov/articlerender.fcgi? artid=3174948\&tool=pmcentrez\&rendertype=abstract III.RESULTS [6] L. Silva and J. M. de Sa,´ “Data classification with multilayer perceptrons using a generalized error function,” Neural Networks, Precision Recall vol. 21, no. 9, pp. 1302–10, Nov. 2008. [Online]. Available: Window Method 0.153 0.551 http://www.sciencedirect.com/science/article/pii/S0893608008000749 MSNovo 0.135 0.573 [7] B. Lu and T. Chen, “A suboptimal algorithm for de novo pNovo 0.093 0.683 peptide sequencing via tandem mass spectrometry.” Journal of PILOT 0.078 0.721 computational biology : a journal of computational molecular cell PepNovo 0.122 0.746 biology, vol. 10, no. 1, pp. 1–12, Jan. 2003. [Online]. Available: LNN 0.183 0.751 http://www.ncbi.nlm.nih.gov/pubmed/12676047 TABLE II RESULTS COMPARING DIFFERENT PEAK SELECTION ALGORITHMS.

Preliminary results comparing the precision and recall of the leveraged neural network (LNN) peak selection with other prominent de novo peptide sequencing algorithms are given in table III. The window method selects peaks by choosing the 3 most intense peaks in a window of width 56 Da.

84 ISBRA 2012 Short Abstracts

Statistical software and business productivity applications: workflows for communication and efficiency

Marie Vendettuoli124, David Siev4 and Heike Hofmann123

1 Bioinformatics and Computational Biology Program 2 Human Computer Interaction Program 3 Department of Statistics, Iowa State University, Ames IA 50010 4 Statistics Section, Center for Veterinary Biologics, AHPHIS/USDA, 1920 N Dayton Ave, Ames IA 50010

Abstract. We present a case study of the data challenge facing statisticians of USDA APHIS when evaluating submissions for product licensing. Specifically, we examine the impact to productivity that arises after introducing a workflow that relies on function- ality of off-the-shelf business productivity software, R, Sweave, and LATEX to facilitate information transfer between stakeholders. To address the challenge imposed by hav- ing no common data format enforced for submissions, we created a workflow for rapid development and deployment of robust tools rooted in fields of data visualization, reproducible research and human computer interaction. This workflow is a platform facilitating development of technical solutions, the implementation of which both re- duces overall turnaround times and increases submission quality. Application extends beyond the immediate needs of the current user group and may be leveraged to create multidisciplinary just-in-time tools that meet the fluid demands existing at interface between statistics and business productivity audiences.

1 Introduction

One mission of the Center for Veterinary Biologics (CVB) within the United States Depart- ment of Agriculture (USDA) Animal and Plant Health Inspection Service (APHIS) is the evaluation of veterinary biologics for licensure in a manner compliant with the provisions of the virus-serum-toxin act (VST). Title 9 of the Code of Federal Regulations (CFR) specifi- cally stipulates that products falling under licensing requirements receive approval only after a successful evaluation of records and methods used to demonstrate the validity of claims. No mention exists of submission format at a data management level. Practically, this policy means statisticians in CVB must accept submissions in a wide variety of formats, including output from software the section does not have access to and occasionally, hard-copy formats. It is outside of the scope of this paper to discuss mandating a rigid electronic submission scheme; any data formatting guidelines under current policy must be readily accessible to firms varying greatly in size and available resources. In this paper we will examine changes in the CVB Statistics Section workflow over the past twelve months that integrate new tools developed for business productivity and statistical analysis environments.

85 ISBRA 2012 Short Abstracts

2

2 Background: Borrowing From Other Disciplines

Reproducible Research The role of a statistician when evaluating firm submission is similar to that of a researcher developing an in-depth understanding of analysis performed by authors in the academic publication setting, where the submitting firm is in the role of author. As a first phase, this process involves knowing exactly what actions the author took to eluci- date final conclusions from raw data. Secondly, the statistician is interested in extending the initial work performed by that author through different comparisons or the application of previously unexplored models. These steps match exactly the process that the reproducible research movement attempts to address [3], [10]. In the working environment of R, creating a reproducible compendium for distribution is achieved by using Sweave [4] and LATEX. Regulated Clinical Trial Environments 21 CFR Part 11 are the guidelines for electronic records and signatures in organizations regulated by the Food and Drug Administration (FDA). Although the FDA has a distinctly unique mission from CVB, aspects of this guidance are relevant to both applications. Both R and SAS are tools for statistical analysis that fit into this framework. Specific examples include: generating records of eye-readable and electronic formats that are suitable for inspection and review, storing of records using external infras- tructure that meets compliance expectations, respecting the user access granted by an oper- ating system, functions to time-stamp records to create audit trails, error-checking methods to enforce predetermined workflows, the ability to implement checks on data quality and/or completeness, and version development and documentation tools [8]. Additionally, tools that support reproducible research are consistent with the validation framework of regulation. User-centered design While it is tempting to strictly focus on technical solutions to max- imizing productivity, ultimately the submission process is one of communication between groups of people with a diverse background of expertise and a successful tool must facilitate this process. Simple choices, such as selecting an interface that most users are familiar with - instead of requiring individuals to learn a new interface - allows all stakeholders to focus on the submission at hand over learning new interactions. Some principles that originate in the field of human-computer interaction include Gestalt theory and reduction of cognitive load [6]. By paying careful attention to the structure of the interface environment, it is possible to develop tools which increase the quality of data submissions by reducing the effort required of firms to understand the formatting needs of CVB Statistics. Additionally, providing immediate (automated) feedback to users reduces the amount of time spent amending data submissions due to inconsistencies and incompleteness.

3 A Solution

To support the diversity of working environments and expertise, we propose a workflow that uses tools optimized for each user group. Statisticians perform analysis and develop reports from within R, submitting firms are presented with format templates that use Excel as an interface, report end-users get a pdf output. Once data is in a consistent format, the R Language Environment is a ideal platform to implement the goals described above. The ease with which CVB Statistics is able to develop custom approaches to analysis which leverages existing published methods integrates well with the expectations of reproducibility and documentation that a the regulatory role of CVB requires.

86 ISBRA 2012 Short Abstracts

3

The visible work-product of statisticians is a report which may be viewed both in hard- copy and/or electronic formats. This report consists of the statisticians’ final interpretation of the submission supported with graphics and summary data displays that support thos conclusions. While any word processing program is sufficient for the text, the other elements are generated from within the statistical analysis environment, R. One challenge of compiling these elements into a single document is the manual manipulation required. For the statis- tician to invest any significant amount of time in typesetting is not a priority use of that individual’s expertise. Additionally, due to the interactive nature of analysis, it is not possible to absolutely confirm that all figures and summaries were created under the same computing conditions, which reduces the reliability of any recommendations. A solution from the field of reproducible research is the use of Sweave files. These text files intersperse the R code required to generate statistical elements with the accompanying explanatory text. With one click from the statistician, all code is evaluated in a single (new) session and the report is generated in pdf format via LATEX . The use of the LATEX typesetting system allows statisticians to focus on content instead of display characteristics by using that are consistent for the entire group. Since all statisticians are expected to analyze a variety of submission types, one way to increase productivity is by sharing commonly used R functions. Sharing reduces the amount of reimplementation that individual statisticians must code and acts as an expert user group to validate the robustness of any particular function. In order to ensure that all members of the group are using the same version of a particular function, we use the package structure that R supports, creating an internal repository with documented functions. As part of the package creation process, test cases for each function are developed which ensure that updates do not change the results under specific situations. While statisticians within CVB Statistics are familiar with the R Language Environment, it may be a significant obstacle for individual firms to develop expertise. A much more acces- sible platform is the business productivity application Microsoft Excel. We developed several templates for common data submission types. Macros in this tool not only guide the data entry, but provide immediate feedback for when data is inconsistent with guidelines. When data is free of errors a simple button-click is all that is necessary for exporting in the correct structure. The anticipated workflow is that a firm will use the Excel templates as a data-entry tool. Once data files are submitted to CVB Statistics, functions in the dataFormats package are used for import and generation of most common statistical functions associated with the submission type. Statisticians can perform additional analysis or generate a report with the initial results. Because incoming data is highly structured this initial overview is wrapped into a few short lines of code: require(dataFormats) dft <- importFromELISATables(file.choose()) # use dialog box to choose file createReport(type = ’potency’, data = dft, mapcolor = mapcolor, base = 1.75, platesinFig = platesinFig[[1]], startdil = 50)

4 Concluding Remarks

The implementation of communication tools such as examples, templates and instructions represents a shift in the operational paradigm for CVB Statistics. While the group has histor-

87 ISBRA 2012 Short Abstracts

4

Fig. 1: (a) Sample of data entry. Embedded buttons and dialog box provides user with in- structions as-needed and quick access to macros. (b) A page from the sample report output, using code below. This page shows the dilution, od values and well contents for a 96-well format plate. Wells of the same color act in the same role under the experimental design (e.g. positive control, blank, serial) ically provided extensive guidance and informal support to requesting firms, the data format tools are an approach that addresses two overlooked areas. First, the dependence of statistical analysis and daily business operation on electronic tools. ticians to focus efforts on value-added tasks that leverage individual expertise. Second is the need to communicate in a consistent manner. With the data formats, all firms are subject to data formatting guidelines which are based in a single framework and easily accessible by firms of all sizes, without the need to acquire specialized software or technical expertise.

References

1. Anatharaman, D (in press) Comparing self-regulation and statutory regulation: Evidence from the accounting profession Accounting, Organizations and Society. 2. Anderson AR, Russell EO (2011) Self-regulation: A strategic alternative for small firms? J Business Strategy, 32(4), pp 42-47. 3. Gentleman R, Lang DT (2007) Statistical Analyses and Reproducible Research J of Computational and Graph- ical Statistics, 16(1), pp 1-23. 4. Leisch, F (2002) Sweave: Dynamic generation of statistical reports using literate data analysis. In Wolfgang H¨ardleand Bernd R¨onz,editors, Compstat 2002 - Proceedings in Computational Statistics, pp 575-580. Physica Verlag, Heidelberg 5. Lundholt BK, Scudder KM, Pagliaro L (2003) A Simple Technique for Reducing Edge Effect in Cell-based Assays J Biomolecular Screening, 8(5), pp 566-570 6. Oviatt S (2006) Human-centered design meets cognitive load theory: designing interfaces that help people think. Proceedings of the 14th annual ACM international conference on Multimedia. ACM New York, NY pp 871-880 7. R Development Core Team (2011) R: A Language and Environment for Statistical Computing. isbn: 3-900051- 07-0, http://www.R-project.org 8. R Foundation (2008) R: regulatory compliance and validation issues. A guidance document for the use of R in regulated clinical trial environments http://www.r-project.org/doc/R-FDA.pdf 9. Romero-Fernandez MM, Royo-Bordonada MA, Rodriguez-Artalejo F (2009) Compliance with self-regulation of television food and beverage advertising aimed at children in Spain. Public Health Nutrition, 13(7), pp 1013-1021 10. Schwab M, Karrenbach M, Claerbout J (2000) Making scientific computations reproducible Computing in Science & Engineering, 2(6), pp 61-67. 11. Siev, D (2012) Statistics Section Update. Animal Health Institute - Veterinary Services Meeting.

88 ISBRA 2012 Short Abstracts

Development of a Detailed Model for the FcRn-mediated IgG Homeostasis

Venkat Reddy Pannala1, Dilip Kumar Challa2, E.Sally Ward2,* and Leonidas Bleris1,*

1University of Texas at Dallas, Richardson, USA {vrp110030, bleris}@utdallas.edu 2Universtiy of Texas Southwestern Medical Center, Dallas, USA {dilip.challa, sally.ward}@utsouthwestern.edu

Abstract. The long serum half-lives of immunoglobulin G antibodies have been at- tributed to the protection mechanism by a neonatal Fc receptor FcRn. Several experi- mental studies have hypothesized the FcRn trafficking pathway in maintaining IgG homeostasis process, a detailed model representing the FcRn trafficking pathway is still missing. In this study, we present a detailed mechanistic model of IgG recycling by FcRn that explicitly takes into account receptor synthesis, receptor mediated IgG trafficking inside the FcRn expressing cells for IgG recycling and eventual degrada- tion of unbound IgG. The detailed model is used to derive reduced models which retain a mechanistic interpretation. We find that the IgG homeostasis can be described by FcRn-mediated saturating Michaelis-Menten type of process. Using the quasi- steady-state assumption for the IgG distribution in the receptor system, we find that a reduced model with fast kinetics for receptor synthesis and degradation equally well describes the detailed model predictions. Our approach can be readily extended to any general system and finds applications in the development of physiologically-based pharmacokinetic models for the determination of drug pharmacokinetic parameters.

Keywords: Detailed model, IgG homeostasis, FcRn, Michaelis-Menten and Pharmacokinetics

89 ISBRA 2012 Short Abstracts

1 Introduction

In recent years, the therapeutic macromolecules particularly monoclonal antibodies (MAB) are a major focus of research and development in academia and the pharma- ceutical industry [1]. Approximately 30 monoclonal antibodies in several therapeutic areas have been approved by the US Food and Drug Administration and many are in clinical trials [2, 3]. The vast majority of approved monoclonal antibodies (MABs) are immunoglobulin G (IgG1) isotype. IgG antibody possess two unique properties: (i) a selective prenatal transepithelial transport across the placenta in humans or a postnatal transport across the intestinal epithelium in rodents (ii) a long serum half- life relative to other serum proteins. The reason for such unique properties of IgG was first observed by Brambell and his coworkers and postulated the presence of a satura- ble receptor (FcRn) responsible for both biological functions [4, 5]. Subsequent stud- ies on mice lacking a neonatal Fc-receptor (FcRn), a MHC-class I like protein re- vealed that FcRn indeed was responsible for the observed extended IgG half-lives [6- 9].

The FcRn receptor binds IgG and IgG-based monoclonal antibodies with high affinity at acidic pH (~ 6.0) in the endosomes and no observable binding was detected at neu- tral physiological pH (~ 7.4) [10-12]. The current understanding for IgG protection by FcRn stands at: IgG enters the early endosomes by fluid-phase pinocytosis, FcRn interacts with IgG and binds strongly as the early endosomes converts into acidic endosomes [13-15]. FcRn-IgG complexes are then routed away from the lysosomal pathway eventually returning either to the systemic circulation or transported to the opposite side of the cell [16-20]. Vesicles containing receptor bound IgG molecules then fuses with the plasma membrane (at pH >7.0) releasing the IgG into circulation [18, 21]. In contrast, without the salvage function of the FcRn, antibodies or antibody- based drugs like other proteins taken into cells by the pinocytosis process and catabo- lized rapidly upon fusion of the endosome with lysosome [15, 18]. Thus, the protec- tive function of FcRn significantly extends the circulating half-lives of endogenous IgG and IgG-based therapeutics compared to other systemic proteins [22]. In cases, the protective mechanism of IgG can be blocked by therapeutic proteins having great- er affinity to FcRn in order to shorten the half-lives of the disease prone endogenous IgG in autoimmune diseases [23-26].

Several empirical/compartmental models have been proposed to describe the mecha- nism of IgG protection by FcRn and subsequently used them to explain the IgG/antibody drug pharmacokinetics [27-29]. However, none of them included many of the experimentally observed details of the FcRn-trafficking pathway. Instead, these models have been developed by assuming a Michaelis-Menten type FcRn saturation kinetics for IgG recycling. Although, the developed models were able to describe some of the antibody pharmacokinetic properties, being empirical in nature, these models do not provide a mechanistic understanding of how the different processes of receptor trafficking contribute to the overall pharmacokinetic profile to aid in design- ing more efficient dosing regimens.

90 ISBRA 2012 Short Abstracts

The objective of this article is to build a detailed mechanistic model of IgG recycling pathway that takes into account the most relevant processes of IgG uptake and recep- tor mediated trafficking inside the FcRn expressing cells. The specific aims are: (i) to develop a detailed FcRn-mediated trafficking model (ii) to derive reduce models of IgG recycling pathway which retain mechanic interpretation with few parameters only (iii) to analyze the impact of different processes (like IgG uptake and synthesis) on the extent of IgG recycling and degradation. While our approach has the potential applications for several receptor systems with binding of the ligand to the receptor on the surface leading to receptor mediated endocytosis and subsequent recy- cling/elimination. For simplicity of analysis we neglect the weak FcRn-mediated IgG endocytosis and experimentally observed 1:2 ratio of IgG and FcRn interaction in our model. We assume the simple case of IgG uptake by fluid-phase pinocytosis to enter into the FcRn expressing cells and FcRn-mediated recycling pathway thereafter.

2 Model Development

Development of a Detailed Model (Model A) We consider the following detailed model of IgG recycling pathway as schematically shown in Fig. 1(a). Here IgG being synthesized and delivered to plasma, the extracel- lular phase at the neutral pH. Subsequently, IgG internalized into FcRn expressing cells with an unspecific pinocytosis rate of kp. The internalized IgG binds with the FcRn in the acidic early endosomes (pH ~ 6) leading to [IgG-FcRn]SE complex. The bound IgG complex in the sorting endosomes is recycled to the membrane ([IgG- FcRn]m) with the rate constant krec and its subsequent fusion with the plasma mem- brane releases IgG into the circulation with the rate constant krp. Unbound IgG and some of the bound [IgG-FcRn]SE complex will be degraded in the lysosomes with the rate constants kd and kd_c, respectively. Inside the cell, the receptor FcRn is synthe- sized with a rate Ks_f and degraded into lysosomes with the rate constant kd_f. Free intracellular FcRnSE is recycled to the membrane with the rate constant krec_f and free membrane FcRnm is internalized with the rate constant kfc.

Based on the law of mass action, the rates of change for the various molecular species are written by following ordinary differential equations (ODEs):

d() IgG E K  k  IgG  k [ IgG  FcRn ] (1) dt s_ i p E rp m

91 ISBRA 2012 Short Abstracts

Figure 1: Models of FcRn mediated IgG trafficking pathway: (a) Detailed Model A (b) Re- duced models of saturable distribution into the receptor system(Model B) and Model C (Ks_f = kd_f = kd_c1 = 0)

d() IgG SE k  IgG  k  IgG  FcRn  k  K [ IgG  FcRn ] - k  IgG (2) dt p Ef SE SEfD SEd SE d( FcRn ) m (3) krec_ f  FcRn SE  k fc  FcRn mkrp [] IgG  FcRn m dt

d( FcRnSE ) Ks__ f k rec f  FcRn SE  k fc  FcRn m k f  K D [] IgG  FcRn SE dt (4)

kkfIgGSE FcRn SE  d_ f  FcRn SE

d([ IgG FcRn ] ) m (5) krec [][]IgG  FcRnSE  k rp  IgG  FcRn m dt

d([ IgG FcRn ]SE ) kkff IgGSE  FcRn SE  K D [] IgG  FcRn SE dt (6)

kkrec[]]IgG FcRnSE  d_ c [IgG  FcRn SE

In the above equations, all variables are expressed in molar concentrations. All pa- -1 rameters are first-order rate constants in units [day ] except synthesis rates (Ks_i&,Ks_f ) are zero order rate constants in units [M/day] and kf which is a second order rate constant in units[1/(M×day)].

92 ISBRA 2012 Short Abstracts

Reduced Models of IgG Recycling Pathway Considering the time scales of interest in pharmacokinetics, the above equations in the detailed model are not suitable for pharmacokinetic parameter estimation in clinical trials and thus can be reduced by applying a quasi-steady-state assumption with re- spect to receptor system (see Fig. 1 (b)). With respect to FcRn, the above system con- tain following processes: (i) FcRn synthesis and degradation (ii) distribution of FcRn within and between cytosolic space and membrane (iii) IgG-FcRn interaction. Based on the time scales of FcRn receptor synthesis and degradation the following two re- duced models are developed: (i) the time scales of FcRn synthesis and degradation are very fast compared to IgG disposition (Model B) (ii) the time scales of FcRn synthe- sis and degradation are slow compared to IgG disposition, thus leading to Ks_i= kd_f= kd_c = 0 (Model C). As a consequence the total concentration of FcRn remains con- stant in the system and which is the primary assumption in several previously devel- oped models.

The idea in deriving a reduced model is based on the quasi-steady-state assumption for the FcRn system. This transforms the above differential equations (3)-(6) into algebraic equations for free and bound receptors (FcRnm, FcRnSE, [IgG-FcRn]m and [IgG-FcRn]SE). These algebraic equations can be solved explicitly for the given endo- somal IgG concentration (IgGSE) and allows us to compute the total concentration of IgG in the receptor system of the cell for both fast and slow FcRn kinetics (IgGrs = [IgG-FcRn]m + [IgG=FcRn]SE). Based on the IgGrs, the quasi steady-state concentra- tion of [IgG-FcRn]m complex on the membrane can be calculated which determines the extent of IgG recycling into the circulation. The total IgG concentration in the sorting endosomes is defined as IgGST = IgGSE + IgGrs.

Model B describes the recycling of IgG concentration in the FcRn expressing cells by following two ordinary differential equations:

d() IgG E K  k  IgG  k  IgG (7) dt s_1 i p E rp rs

d() IgG ST k  IgG  k  IgG  k  IgG k  IgG (8) dt p E rp1 rs d _ c 1 rs d SE

Fmax  IgGSE IgGrs  (9) Km IgG SE

1 IgG IgGFKFKK  ( IgG  )42   IgG  (10) SE2  STmaxm ST max m ST m 

The above equations consists of four parameters: the maximal IgG binding capacity Fmax of the receptor system (in units molar), the concentration of endosomal IgG

93 ISBRA 2012 Short Abstracts

(IgGSE) corresponding to a half-maximal capacity Km (in units molar), rate of recy- cling of IgG in the receptor system krp1 (in units 1/day) and degradation rate kd_c1 (in units 1/day). A similar analysis can be performed for the case of slow FcRn kinetics by imposing the conditions Ks_i= kd_f= kd_c = 0 on the detailed model to obtain Model C. The resulting reduced Model C consists of two ordinary differential equations similar to equations (7) and (8) with the exception of third term in equation (8) being zero and comprises the above described first three parameters.

For the two scenarios of fast and slow FcRn kinetics, the functional relation between the parameters Fmax, Km, krp1 and kd_c1 and the parameters of the detailed model of IgG recycling pathway can be determined. In the case of fast FcRn synthesis and degrada- tion (Model B):

K k k k k k k K  s__ f rp rec d f dc_ rec f D (11) Fmax  ; K   k k m k  k  d__ c rp  f  d c 

krp k rec k rp k d_ c kkrp1; d _ c 1 (12) ()()krp k rec k rp k rec

In the case of slow FcRn synthesis and degradation, the relation between the parame- ters is: k k kk K  k F  1 rec rec rec_ f rec K  D rec_ f t  D krp kk k F ; K ff fc (13) max m kk kk rec rec 1rec rec 1  kkrp fc kkrp fc 

kkrp rec krp1  (14) ()kkrp rec

where Ft is the total concentration of FcRn in the system and determined as the sum of the free and bound receptor concentrations at steady-state.

We solved the corresponding equations of Model A, B and C numerically using built in MATLAB ODE solvers (The Mathworks, Inc.,). Parameter values for the detailed model were obtained from the literature [28]. The unknown parameter values were estimated by simulating model equations to attain the steady-state concentration of IgG in the plasma. In order to calculate the newly introduced synthesis rate of the receptor FcRn, we used the quasi-steady-state assumption for the receptor system and back calculation to determine the steady-state concentrations of the individual species. The FcRn synthesis in the cells can be calculated using the following equation: Ks_f = [kd_c×[IgG-FcRn]SE + kd_f×FcRnSE] at steady-state. Parameter values for the reduced

94 ISBRA 2012 Short Abstracts

models were determined from those of detailed model using respective equations from (11)-(14). All parameter values are tabulated in Table 1. The initial concentra- tions for individual species were assumed equal to the steady-state concentrations obtained by using the steady-state concentration of extracellular IgG in the plasma which is set to 15.34 µM for the mouse system [28].

Table 1 Parameter values for IgG recycling pathway

Parameter Value Source -1 kp 1.03 day [28] -1 kd_f 0.4 day fitted

KD 4.8 nM [28] -1 kd 0.43 day [28] -1 krec 3.82 day fitted 10 -1 -1 kf 2.5×10 M day [30] -1 krp 1000 day fitted -1 kd_c 0.40 day fitted -1 krec_f 2.06 day fitted -1 kfc 9.27 day fitted -6 -1 Ks_f 1.36×10 M day Calculated -6 -1 Ks_i 2.7×10 M day [28]

3 Results and Discussion

We demonstrate the approximation features of the two reduced models for predicting concentration-time profiles of the extracellular IgG in the plasma (IgGE) in compari- son to the detailed model. The initial steady-state concentration of IgG is 15.34 µM. In Fig. 2(a), the prediction of the extracellular IgG concentration (IgGE) is shown for the three models A, B and C. All models result in very similar concentration-time profiles and attain the steady-state IgG concentrations due to the dynamic balance between synthesis of IgG, recycling by receptor system and degradation of unbound IgG. We used Model B to calculate the amount of IgG in the receptor system at steady-state using equation (9) and the model predicts that more than 50% of the total endocytosed IgG is in the receptor system which is being recycled to the plasma by saturable receptor concentration.

In order to check the consistency of the reduced models predictions in comparison to detailed model, we simulated the models further by increasing the rate of pinocytosis parameter (kp) by 2-fold. Here the rate of pinocytosis parameter determines the extent to which IgG enters the cells and its availability for both recycling and degradation process. In this case we would expect an increase in the overall IgG elimination ca- pacity with increasing rate of pinocytosis. Fig. 2(b) shows the simulation results for

95 ISBRA 2012 Short Abstracts

Model A, B and C. Due to the increased pinocytosis of extracellular IgG into the cells by a constant rate, the initial IgG concentration drops quickly reaching a new steady- state value at 7.66 µM. The observed reduction in steady-state concentration is due to the higher availability of IgG in the FcRn expressing cells while the corresponding receptor concentration remains unchanged leading to an increased availability of free IgG concentration for degradation. The initial extracellular IgG concentration (15.34 µM) reduced to 50% due to the 2-fold increase in rate of pinocytosis. The model sim- ulation results show that the detailed model simulations are well approximated by the Model B and Model C using only the apparent saturable recycling process.

We used the models further to study the impact of both saturating receptor approxi- mation and slow and fast receptor kinetics of reduced models on the quality of extra- cellular IgG predictions compared to the detailed model for perturbations in crucial parameters. We artificially increase the rate of degradation (kd) and rate of recycling (krec) of IgG by 10-fold independently for each simulation. All other parameters of the detailed Model A including the initial extracellular concentration of IgG are identical. Parameters for Model B and C have been recalculated according to equations (11)- (14) resulting in a varied maximal receptor binding capacity Fmax. The predictions of the concentration-time profile of the extracellular IgG concentration (IgGE) based on the three Models A, B and C are shown in Fig. 3(a) &(c). While Models A and B give almost identical results, the prediction based on Model C differs significantly for the two cases. In the case of increased rate of IgG degradation (kd), Model C under- predicts the IgG elimination, although the magnitude of variation is small compared to other models. While in the case of increased rate of recycling (krec) constant, Model C over-predicts the IgG recycling with steady-state reaching in much faster time. Here the Michaelis-Menten type saturation process given by equation (9) saturates in faster time due to the slow receptor kinetics assumption for Model C. Similarly, for a 10- fold decreased synthesis rate of IgG (Ks_i) Model A and B show a similar result, whereas Model C considerably under-predicts the extracellular IgG elimination (see Fig. 3(b)).

Figure 2: Concentration-time profile of the extracellular IgGE concentration for the Models A, B and C. (a) Parameter values used according to Table 1 (b) Same as Table 1, but kp increased by 2-fold.

96 ISBRA 2012 Short Abstracts

Figure 3: Comparison of Models A, B and C for perturbations in various parameters. Concen- tration-time profiles of the extracellular IgGE concentration for the Models A, B and C: (a) Rate of degradation of IgG, kd increased 10-fold (b) Rate of recycling of [IgG-FcRn]SE, krec increased 10-fold (c) Rate of IgG synthesis, Ks_i deceased 10-fold. All other parameters were kept constant for each individual case.

Calculation of the corresponding IgG concentration in the receptor system (IgGrs) by equation (9) shows that the magnitude of the ratio Fmax/(Km + IgGSE) is significantly higher in the case of Model C compared to Model B. Thus, the prediction shows that the assumption of slow receptor kinetics resulting in a constant total FcRn concentra- tion in the system leading to the reduced elimination of IgG.

In summary, the detailed Model A represents the endogenous IgG recycling pathway in terms of a system of biochemical reactions (1)-(6), including fluid-phase pinocyto- sis of IgG, receptor binding, receptor mediated recycling, synthesis and eventual deg- radation. Two reduced models have been developed under the assumption that the receptor system is at quasi-steady-state. With respect to the endogenous IgG and/or external therapeutic proteins, two aspects of IgG recycling pathway are of particular importance: (i) binding of the IgG molecules to the receptor in the endosomes and subsequent recycling to the circulation and (ii) the process of IgG internalization and eventual degradation. Both Models B and C explicitly takes into account the concen- tration of IgG in the receptor system as a Michaelis-Menten type saturating process specified in terms of Fmax and Km. In order to finally derive reduced models, we have to make an additional assumption on the time scales of the receptor synthesis and degradation resulting in fast receptor kinetics (Model B) and slow receptor kinetics (Model C). The analysis for three Models A, B and C suggests that the extracellular IgG predictions of reduced Model B are almost identical to the detailed Model A for the observed different scenarios. While the assumption of slow receptor kinetics (con-

97 ISBRA 2012 Short Abstracts

stant total FcRn concentration) for Model C either under or over-predicting the con- centration-time profiles may not be a good approximation to determine the IgG phar- macokinetics particularly when the elimination of IgG is the important process.

Modeling the mechanism of IgG protection by FcRn is crucial both in terms of under- standing the potential pathway involved in the process and also helps in the design of mutant antibodies with altered affinities to FcRn. Experimental studies have shown that mutated human antibodies (Abdegs) showing higher affinity for FcRn both at neutral and acidic pH can be used to block the receptor molecules leading to increased endogenous IgG degradation [14]. Such mutant antibodies having high affinity at the surface of the membrane for the receptor leads to the receptor mediated endocytosis resulting in highly non-linear pharmacokinetic properties. The developed detailed modeling strategy can be readily extended to include more details of the IgG-FcRn interactions. Parameters from these detailed models can be used to derive reduced models to incorporate into a much larger case of compartmental models to determine both the drug pharmacokinetic properties and tissue elimination profiles.

References

1. Baumann A: Early development of therapeutic biologics - Pharmacokinetics. Current Drug Metabolism 2006, 7(1):15-21. 2. Chames P, Baty D: Bispecific antibodies for cancer therapy: The light at the end of the tunnel? mAbs 2009, 1(6):539-547. 3. Zhou H, Mascelli MA: Mechanisms of monoclonal antibody-drug interactions. Annual review of pharmacology and toxicology 2011, 51:359- 372. 4. Brambell FW: The transmission of immunity from mother to young and the catabolism of immunoglobulins. Lancet 1966, 2(7473):1087-1093. 5. Brambell FWR, Hemmings WA, Morris IG: A theoretical model of γ- globulin catabolism. Nature 1964, 203(4952):1352-1355. 6. Ghetie V, Hubbard JG, Kim JK, Tsen MF, Lee Y, Ward ES: Abnormally short serum half-lives of IgG in β2-microglobulin-deficient mice. European Journal of Immunology 1996, 26(3):690-696. 7. Israel EJ, Wilsker DF, Hayes KC, Schoenfeld D, Simister NE: Increased clearance of IgG in mice that lack β2-microglobulin: Possible protective role of FcRn. Immunology 1996, 89(4):573-578. 8. Junghans RP: Finally! The Brambell receptor (FcRB). Immunologic Research 1997, 16(1):29-57. 9. Junghans RP, Anderson CL: The protection receptor for IgG catabolism is the β2-microglobulin-containing neonatal intestinal transport receptor. Proceedings of the National Academy of Sciences of the United States of America 1996, 93(11):5512-5516.

98 ISBRA 2012 Short Abstracts

10. Jones EA, Waldmann TA: The mechanism of intestinal uptake and transcellular transport of IgG in the neonatal rat. Journal of Clinical Investigation 1972, 51(11):2916-2927. 11. Rodewald R: pH dependent binding of immunoglobulins to intestinal cells of the neonatal rat. Journal of Cell Biology 1976, 71(2):666-670. 12. Simister NE, Rees AR: Isolation and characterization of an Fc receptor from neonatal rat small intestine. European Journal of Immunology 1985, 15(7):733-738. 13. Raghavan M, Bonagura VR, Morrison SL, Bjorkman PJ: Analysis of the pH dependence of the neonatal Fc receptor/immunoglobulin G interaction using antibody and receptor variants. Biochemistry 1995, 34(45):14649-14657. 14. Vaccaro C, Zhou J, Ober RJ, Ward ES: Engineering the Fc region of immunoglobulin G to modulate in vivo antibody levels. Nature Biotechnology 2005, 23(10):1283-1288. 15. Ward ES, Zhou J, Ghetie V, Ober RJ: Evidence to support the cellular mechanism involved in serum IgG homeostasis in humans. International Immunology 2003, 15(2):187-195. 16. Newton EE, Wu Z, Simister NE: Characterization of basolateral-targeting signals in the neonatal Fc receptor. Journal of Cell Science 2005, 118(11):2461-2469. 17. Ober RJ, Martinez C, Lai X, Zhou J, Ward ES: Exocytosis of IgG as mediated by the receptor, FcRn: An analysis at the single-molecule level. Proceedings of the National Academy of Sciences of the United States of America 2004, 101(30):11076-11081. 18. Ober RJ, Martinez C, Vaccaro C, Zhou J, Ward ES: Visualizing the Site and Dynamics of IgG Salvage by the MHC Class I-Related Receptor, FcRn. Journal of Immunology 2004, 172(4):2021-2029. 19. Spiekermann GM, Finn PW, Sally Ward E, Dumont J, Dickinson BL, Blumberg RS, Lencer WI: Receptor-mediated immunoglobulin G transport across mucosal barriers in adult life: Functional expression of FcRn in the mammalian lung. Journal of Experimental Medicine 2002, 196(3):303-310. 20. Ward ES, Martinez C, Vaccaro C, Zhou J, Tang Q, Ober RJ: From sorting endosomes to exocytosis: Association of Rab4 and Rab11 GTPases with the Fc receptor, FcRn, during recycling. Molecular Biology of the Cell 2005, 16(4):2028-2038. 21. Lencer WI, Blumberg RS: A passionate kiss, then run: Exocytosis and recycling of IgG by FcRn. Trends in Cell Biology 2005, 15(1):5-9. 22. Datta-Mannan A, Witcher DR, Tang Y, Watkins J, Jiang W, Wroblewski VJ: Humanized IgG1 variants with differential binding properties to the neonatal Fc receptor: Relationship to pharmacokinetics in mice and primates. Drug Metabolism and Disposition 2007, 35(1):86-94. 23. Crow AR, Song S, Semple JW, Freedman J, Lazarus AH, Hansen RJ, Balthasar JP: IVIG induces dose-dependent amelioration of ITP in rodent models [5] (multiple letters). Blood 2003, 101(4):1658-1659.

99 ISBRA 2012 Short Abstracts

24. Hansen RJ, Balthasar JP: Effects of intravenous immunoglobulin on platelet count and antiplatelet antibody disposition in a rat model of immune thrombocytopenia. Blood 2002, 100(6):2087-2093. 25. Patel DA, Puig-Canto A, Challa DK, Montoyo HP, Ober RJ, Ward ES: Neonatal Fc receptor blockade by Fc engineering ameliorates arthritis in a murine model. Journal of Immunology 2011, 187(2):1015-1022. 26. Sesarman A, Vidarsson G, Sitaru C: The neonatal Fc receptor as therapeutic target in IgG-mediated autoimmune diseases. Cellular and Molecular Life Sciences 2010, 67(15):2533-2550. 27. Bleeker WK, Teeling JL, Erik Hack C: Accelerated autoantibody clearance by intravenous immunoglobulin therapy: Studies in experimental models to determine the magnitude and time course of the effect. Blood 2001, 98(10):3136-3142. 28. Hansen RJ, Balthasar JP: Pharmacokinetic/pharmacodynamic modeling of the effects of intravenous immunoglobulin on the disposition of antiplatelet antibodies in a rat model of immune thrombocytopenia. Journal of Pharmaceutical Sciences 2003, 92(6):1206-1215. 29. Kim J, Hayton WL, Robinson JM, Anderson CL: Kinetics of FcRn-mediated recycling of IgG and albumin in human: Pathophysiology and therapeutic implications using a simplified mechanism-based model. Clinical Immunology 2007, 122(2):146-155. 30. Popov S, Hubbard JG, Kim JK, Ober B, Ghetie V, Ward ES: The stoichiometry and affinity of the interaction of murine Fc fragments with the MHC class I-related receptor, FcRn. Molecular Immunology 1996, 33(6):521-530.

100 ISBRA 2012 Short Abstracts

Querying Evolutionary Relationships in Phylogenetic Databases

Grant Brammer and Tiffani L. Williams

Department of Computer Science and Engineering Texas A&M University {grb,tlw}@cse.tamu.edu

1 Introduction

The ability to search through massive amounts of data has had a transformative impact on both science and computing. While there has been work in creating querying algorithms for phylogenetic trees, the main focus has been on working with the trees in large databases such as TreeBASE [1]. In TreeBase, which con- tains phylogenetic trees with diverse taxa sets, the interesting question for search is often, “Which of these trees contain the taxa I am interested in?” However, if we wanted to search the tens to hundreds of thousands of trees that poten- tially result from a Bayesian analysis, we would ask different questions of the data and need different tools. In this situation, the researcher may want to know how many of the trees supported a certain hypothesis about the relationship between taxa. To answer this type of question we need a structural search for phylogenetic trees. We have developed a novel search algorithm which we have implemented in a package called TreeHouse. This package allows users to query a set of trees based on their structure. This short abstract will introduce one facet of the program, which is bipartition based searching. While the program also supports other methods of searching (by taxa, by subtree, and by quartet), bipartition based searching is a novel and powerful approach for querying phylogenetic databases.

2 Phylogenetic Trees and Bipartitions

A consists of n taxa (or organisms) of interest. In Figure 1, we show a rooted, phylogenetic tree consisting of 6 taxa. A phylogenetic tree can be uniquely defined by its set of bipartitions (or edges). When removed, a bipartition splits the taxa into two sets. In Figure 1, consider bipartition B1. Breaking the tree at this point would partition the taxa into two groups: Lion, Leopard, and Jaguar on one side and Tiger, Snow Leopard, and Clouded Leopard on the other side. We represent this bipartition as the set [Lion, Leopard, Jaguar | T iger, S. Leopard, C. Leopard]. In this paper, we use the term bipartition to refer to only the non-trivial bipartitions in a tree. There is a trivial bipartition which separates each single taxa from the rest of a tree. These bipartitions contain no information about the

101 ISBRA 2012 Short Abstracts

2

Fig. 1. Phylogeny of genius Panthera with bipartitions marked in red.

structure of the tree, hence we only consider non-trivial bipartitions (internal edges). A binary tree has exactly n − 3 non-trivial bipartitions.

3 Searching for Bipartitions in a Collection of Trees

To demonstrate bipartition based querying we will use the set of trees shown in Figure 2. For instance using the query in Query 1 which would return the set of trees [T1 T2,T3,T7,T9,T10,T13, and T14] as each of those trees contain an edge with Lion, Leopard, and Jaguar on one side and Tiger, Snow Leopard, and Clouded Leopard on the other.

Fig. 2. Phylogenetic hypotheses of the genus Panthera [2].

102 ISBRA 2012 Short Abstracts

3

By default our system assumes that any taxa not included in the bipartition could appear on either side of the relationship or not at all. In practice this means that we can leave taxa out to relax our search parameters. For instance, by removing Clouded Leopard from Query 1 we get Query 2. This search returns the set of trees [T1,T2,T3,T7,T8,T9,T10,T13, and T14], which includes tree T8 which contains Cougar and Cheetah, but not Clouded Leopard. On the other hand, Query 3 which is the same as Query 1 but with Tiger removed, returns the set [T1,T2,T3,T5,T6,T7,T9,T10,T13, and T14]. The results of Query 3 is a superset of the Query 1 results which also include trees T5 and T6.

SEARCH([Lion, Leopard, Jaguar | Tiger, S. Leopard, C. Leopard]) (1)

SEARCH([Lion, Leopard, Jaguar | Tiger, S. Leopard]) (2)

SEARCH([Lion, Leopard, Jaguar | S. Leopard, C. Leopard) (3)

By combining the queries, we can restrict our search even further. For in- stance we could search for the subtree relationship shown highlighted in blue in Figure 3. This subtree as a Newick string is ((Lion), (Leopard, Jaguar)) and is defined by two bipartitions. (Note: Newick is the most common format for rep- resenting phylogenetic trees.) To translate this subtree into a bipartition based search, we would search for the bipartition that separates the subtree from the rest of the tree. Here, that is bipartition B1. Next, we search for every bipartition contained in the subtree, which is bipartition B2. We then take the intersection of these search results for our answer. Using this method, we could represent a search for the ((Lion), (Leopard, Jaguar)) subtree in our system as the search terms shown in Query 4. This search would only return [T14] since it is the only tree with both bipartitions that make up the ((Lion), (Leopard, Jaguar)) clade. Any relationship that can be represented as a subtree can be translated into a series of bipartition queries which provide the same results. Since it can be easier to write a Newick string instead of a series of bipartitions, we’ve implemented a parser that does the work of translating Newick strings into bipartitions so that the user can seamlessly use either mode to facilitate their search.

SEARCH([Lion, Leopard, Jaguar | Tiger, S. Leopard, C. Leopard]) AND SEARCH([Leopard, Jaguar | Lion, Tiger, S. Leopard, C. Leopard] (4)

With the ability to quickly search a large set of trees, biologists are now able to ask questions that used to be difficult to answer. Our TreeHouse software makes it easy to determine how many of the trees in a dataset support a given

103 ISBRA 2012 Short Abstracts

4

Fig. 3. Phylogenetic tree with subtree ((Lion), (Leopard, Jaguar)) highlighted in blue. hypothesis. With that, it now becomes possible to explore the minority opinion in a set of trees. Previously, viewing the relationships that appear in the majority of trees has been relatively simple, as those are the relationships that appear in the consensus tree. However, it has not been so easy to explore the relationships that appear in less than half the trees. Since all the trees in a Bayesian search, after burn in is removed, have some probability of being in the true tree, it can be useful to explore those results. For instance, if the researcher has a hypothesis of the true tree which does not appear in the consensus, they may be curious to find out if that hypothesis appeared in any of the trees explored by the search. Such a hypothesis might appear in a minority of the trees or it might not appear at all. This may be due to the hypothesis being incorrect and rejected by the phylogenetic reconstruction algorithm, or it may be that the algorithm never tested the hypothesis. Without robust search tools, it is difficult to explore these important and interesting possibilities.

4 Conclusions

This method for searching a set of trees is both fast and robust. Since we store each tree as a set of edges, we can compute structural searches very quickly. Furthermore, this method of storage is also memory efficient allowing us to handle large numbers of trees.

References

[1] Vos, R.A., Lapp, H., Piel, W.H., Tannen, V.: TreeBASE2: Rise of the Machines. Nature Precedings (713) (2010) [2] Davis, B.W., Li, G., Murphy, W.J.: Supermatrix and species tree methods re- solve phylogenetic relationships within the big cats, panthera (carnivora: Felidae). Molecular and Evolution 56(1) (2010) 64 – 76

104 ISBRA 2012 Short Abstracts

Gene Expression Resources Available from MaizeGDB

Wimalanathan Kokulapalan1, Jack Gardiner4 5, Bremen Braun2, Ethalinda KS Cannon4, Mary Schaeffer3 8, Lisa Harper6 7, Carson Andorf2, Darwin Campbell2, Scott Birkett4, Taner Sen1 2 4, Nicholas Provart9, and Carolyn Lawrence1 2 4

1Bioinformatics and Computational biology Program, Iowa State University, Ames, IA 50011; 2USDA‐ARS Corn Insects and Crop Genetics Research Unit, Iowa State University, Ames, IA 50011; 3Division of Plant Sciences, Department of Agronomy, University of Missouri, Columbia, MO 65211; 4Department of Genetics Development and Cell Biology, Iowa State University, Ames, IA 50011; 5School of Plant Sciences, University of Arizona, Tucson, AZ 85721‐0036; 6USDA‐ARS Plant Gene Expression Center, Albany, CA 94710; 7Department of Molecular and Cell Biology, University of California, Berkeley, CA 94720; 8USDA‐ARS Plant Genetics Research Unit, University of Missouri, Columbia, MO 65211; 9Department of Cell & Systems Biology, University of Toronto, Ontario M5S 3G5

The completion of the maize genome sequence in 2009 has created both significant challenges and opportunities for maize researchers. The opportunities for understanding cellular processes underlying maize's phenomenal productivity have never been greater but this opportunity can only be seized if functional genomics software tools (FGSTs) are available to reduce the complexity of multimillion point data sets into manageable images and/or concepts. Currently, MaizeGDB is hosting numerous large gene expression data sets, and furthermore, indications from currently funded NSF Plant Genome Research Projects are that much more data will be deposited at MaizeGDB in the near future. Fortunately for maize researchers, free public domain FGSTs have been developed for other biological systems and their implementation at MaizeGDB can be accomplished with a moderate amount of effort. In this poster, we describe current efforts at MaizeGDB that focus on leveraging two of these FGSTs, the eFP browser1 and MapMan2 by creating strategic linkages from MaizeGDB to the sites where these FGSTs are deployed. The eFP browser projects gene expression data onto a series of pictures (pictographs) representing the original plant tissues from which the expression data was derived. Each pictograph is colored according to the level of expression for the gene on of interest. The MapMan software suite allows the visualization of a variety of functional genomics datasets in the context of well characterized biochemical processes and metabolic pathways. Our initial efforts focus on the 60 tissues within the B73 Maize Gene Atlas3 developed by the Kaeppler laboratory at the University of Wisconsin with the expectation that additional expression data sets characterizing meristem and kernel development will be added in the near future.

105 ISBRA 2012 Short Abstracts

Nocardia spp. Identification Using a Bioinformatics Approach

Dhundy R. Bastola1, Scott McGrath, Ishwor Thapa, and Peter C. Iwen2

1School of Interdisciplinary Informatics, University of Nebraska at Omaha, Omaha, Nebraska 68182.

2Department of Pathology and Microbiology, University of Nebraska Medical Center, Omaha, Nebraska 68198.

cantly impedes patient recovery [9]. However, a large hurdle pre- venting early treatment has been the difficulty and time associated with identifying Nocardia in a diagnostic laboratory. Cultures take Abstract. 48-72 hours to grow on most nonselective laboratory media, with up Nocardiosis is an opportunistic infection caused by pathogenic bacte- to 14 days required for identification [10]. The difficulty in identifica- rium in the genera Nocardia. Approximately 50% of catalogued species tion of different Nocardia species in the laboratory has also been of Nocardia can infect humans with a mortality ranging from 14 to 40%. associated to poor quality of documentation of the identifications However, if the infections spread to the central nervous system, the mor- results [11]. Depending on the choice of method used for verification, tality rate increases to 75-90%. Currently, the culture based-methods of completion of an identification process can take several weeks. Given laboratory testing for the presence of Nocardia take 2-3 weeks, which can the significant risk posed by nocardiosis in immune compromised and significantly delay the treatment and increase the risk of systemic infec- transplant patients, there is a growing need to develop a more effi- tion. DNA-sequence based identification methods can reduce this time to cient method to reliably identify Nocardia and to confirm a diagnosis less than three days. Using an algorithmic approach to distill target se- of nocardiosis. To address this problem and speed up the process of quences into categorized clusters and discern the differences between diagnosis, several approaches have been investigated including bio- gene targets, 34 unique molecular target sequences from 84 different chemical, chemotaxonomic and serological methods [1]. While all Nocardia species have been collected. The results showed that the sensi- these approaches have yielded encouraging results, molecular identi- tivity of molecular identification is highly dependent on the choice of fication appears to hold the most promise by reducing both the diffi- sequence target used in analysis. Overall, this algorithmic approach of culty and time involved in identification of species within the genus data collection is expected to not only assist in the development of DNA- Nocardia. sequence based molecular identification method for Nocardia and other The biggest breakthroughs in molecular identification have occurred species, but also to enhance rapid identification to optimize patient man- through the advancement of DNA sequencing technology. For exam- agement. ple, using the Microseq 500 system, the average time required for identification dropped from the conventional 2 to 3 weeks to 1 to 3 days [12]. The 16S rRNA gene is currently the principal target se- 1 Introduction quence used in DNA sequence based identifications. This is due to high level of sequence conservation in Nocardia species. For this Often mistaken for fungi, the organisms within the genus Nocardia reason, this region of the genome is abundantly found in public se- are bacteria, which are known to cause serious and even fatal infec- quence repositories. The National Center for Biotechnology Infor- tions in humans. These aerobic, partially acid-fast, saprophytic acti- mation (NCBI) contains over 90,000 16S rRNA sequences in their nomycetes are found worldwide in organic rich soil, aquatic envi- databases for a wide range of bacteria [13]. This provides laboratories ronments, and animal tissue. Nocardia are difficult to identify using a very large database with which to compare results. Upon evaluating traditional culture based-methods in the laboratory [1]. There are 80 the 16S rRNA gene for Nocardia, a variable region appears within known species within the genus Nocardia, and 50% of them have the first 500 bp [14]. Having such a consistent variable region in the been documented as pathogens for both animals and humans[2, 3]. gene has shown this 500 bp region to be an effective target for partial Due to nocardiae’s low virulence, they are considered opportunistic sequencing with reliable results [1]. At this time, no other gene has pathogens. The infections caused by these species are clinically re- been found to be as helpful in Nocardia species identification with ferred to as nocardiosis with approximately 85% of the cases in im- 16S rRNA, but secA1 and hsp65 are being investigated as possible munocompromised patients [4]. Transplant patients, especially lung, alternatives [1]. An additional sequence, which has the promise of kidney, heart, and liver transplant recipients, are at greater risk for being a viable substitute for the 16S rRNA gene, is the internal tran- nocardiosis than the general population [5, 6]. Infections in these scribed spacer (ITS) region sequence located between the 16S rRNA cases are serious with documented mortality rates between 14% and and the 23S rRNA genes [15]. ITS is an attractive candidate due to a 40% [4]. Approximately 20% to 40% of all cases of nocardiosis high level of expression and to the greater sequence variability among eventually spread to the central nervous system (CNS) [7]. However, species and strains [16]. The DNA sequence based molecular identi- infection that spreads to the brain leads to a dramatic increase in fication of Nocardia provides a significant improvement in diagnostic reported mortality rates, which could be as high as 75% to 90% [8]. time (hsp65 PCR and 16S rRNA restriction enzyme analysis recog- Early identification of Nocardia infections is critical to prevent nizes >90% of Nocardia species [1]). Unfortunately, 16S rRNA systemic disease and CNS involvement. Delay in treatment signifi- analysis has proven to be a poor method for resolving sub- populations within species.

106 ISBRA 2012 Short Abstracts D.Bastola et al.

With increased use of these molecular techniques for species iden- sequences available for these species. These listed species are diffi- tification there is a need for additional support infrastructure. Specifi- cult to identify using available laboratory test and thus they became cally, there is a need for reliable databases and accompanying algo- the focus of this study. All available sequence targets (16S, ITS, rithms for sequence alignment analysis [17]. GenBank contains over HSP, and gyrB) for these species were determined along with the 80 billion nucleotide bases from more than 76 million individual sensitivity of using these for identification purposes. sequences, and has doubled in size about every 18 months [18]. The staggering size of the database highlights the need for developing Currently, within the bacteria, a species is identified that shares tools to aid in more efficient database searches with a higher degree 70% or more of its genomic DNA with a comparable species [21]. of precision. In the current study our focus was to create an algo- Species that fall below the 70% level, but still match >20% are la- rithm, which would compile molecular sequence targets for the No- beled as a different species with the same genus[22]. A 16S rRNA cardia genus and determine the sensitivity of molecular targets to score with a similarity value < 98% will have a DNA re-association identify different strains within the known Nocardia species complex. value of no more than 60. Thus, any 16S rRNA sequence showing <98% sequence similarity can be considered a separate species [22]. While the results showed 16S rRNA as the most prevalent molecular 2 Methods sequence publicly available, target area is insensitive for resolving sub-populations within the species complexes. Therefore, one of the The web based 'tgclassfier' application was developed in PHP, JavaS- early goals of this project was to identify all known DNA target cript at the frontend of the software application. Perl (Bioperl [19]) sequences for each species of Nocardia within the GenBank database. and Java, along with a MySQL relational database were used for the backend. The application stores all the previous works from the past The result showed multiple numbers of several DNA target se- in a users' workplace. This application provides two major functions, quences. Among these, sequences for 16S rRNA, 16S rRNA to 23S for running a new project and for viewing existing projects. Each new rRNA intergenic spacer (ITS) region, heat shock protein (HSP), and project requires a new name and the name of the organism for which gyrB were most prevalent. Detection of the HSP illustrated the diffi- an up-to-date list of molecular target sequence is wanted. Following culty of detecting target sequences using keyword-based search of “submit” on the graphical user interface, a multi-process pipeline is sequence data annotation. The lack of standard naming conventions triggered, which begins with extraction of the organism specific in the annotation table led to this problem. This was a particular sequences from public repository using BioSQL. These sequences for problem found in the annotations for HSP and 16S rRNA, as several the specific organism are subsequently used in BLAST-based search forms were used. Our use of a sequence-based search method helped of other similar sequences. Clustering of the data occurs using previ- to overcome this limitation of non-standard annotations in the Gen- ously developed methods providing homogeneity and separation of Bank sequence records. the clusters [20]. The system state is saved during this pipeline execu- tion. The software application allows the user the ability to logout and To evaluate the strength of the DNA-based method to differentiate login at a later time to view the results. The results section has several species within each Nocardia complex the sensitivity of the each of output files, viz. sequence file, homogeneity/separation graph, Clus- the target sequences was determined. Sequences that were able to ters, and Cluster Description. draw clearer distinctions between species in these groups are deemed better performers than those with overlapping results. To accomplish this goal, the software program MEGA5 was used to align all of the 3 Results sequences for each complex (using ClustalW) and then build a neigh- bor-joining tree and a UPGMA tree for each targeted sequence. Se- lective trees are included to highlight the findings of this method. There exists a need for more effective use of DNA-based methods of pathogen detections in a clinical laboratory setting. Technologies When building the trees, the default MEGA5 parameters were exist that offer rapid results with high sensitivity and specificity, at used. For the Neighbor Joining trees, the bootstrap method was em- relatively low costs. The computational approach presented here to ployed for phylogeny; no phylogeny tests were used for the UPGMA parse the sequence data from public sequence repositories such as the trees. Both trees evaluated nucleotide substitution using the Maxi- GenBank is the first step in developing a reliable DNA-based method mum Composite Likelihood method. Both tests also included transi- of pathogen detection. Employing our algorithmic approach showed tion and translation substitutions. 16S rRNA, ITS, hsp, and gyrB were there were 1263 different sequences of Nocardia from our BioSQL used to evaluate species in each complex with results comparison. database. This database was previously populated with the July 1- Although literature [13] states 16S rRNA as a good candidate in release of non-redundant sequences from GenBank. These sequences effectively splitting up the complex in to general groups, our result contained 84 different species. By categorizing these sequences into with Nocardia complexes showed few overlapping elements which separate groups, 34 different molecular targets were identified. Over make it difficult to definitively split some of the species apart (Fig 1). 30 of the original 84 species have been identified as medically im- For example, sequencing of the 16S rRNA gene differentiated multi- portant to humans [1]. Of these, four subpopulations were identified: ple species from the Nocardia nova complex whereas Nocardia vet- Nocardia asteroides complex, Nocardia transvalensis complex, erana, Nocardia kruczakiae, and Nocardia nova were not clearly Nocardia brevicatena complex, and Nocardia nova complex. Each of defined. Incidents where the trees failed to distinguish between spe- these contains 3-4 different Nocardia species, which are difficult to cies were highlighted by inverted triangles for Fig 1. Additionally, differentiate from each other using culture-based methods. Table 1 ITS, hsp and gyrB target sequences were used to determine their lists species contained in each complex and the number of target ability to differentiate individual species within each complex. The

107 ISBRA 2012 Short Abstracts

discriminatory power of the ITS region to differentiate among species was better than the results with the 16S gene, although not as robust as needed. A clearer distinction among Nocardia veterana, Nocardia africana, and Nocardia nova was seen. Unfortunately the downsides of using ITS was the lack of available sequences where sequences of the 16S gene greatly outnumbers those of the ITS regions. This was also highlighted by the lack of a Nocardia kruczakiae ITS sequence to compare to the others. Hence, the ITS region sequences appear to provide a better resolution among species, but the number of ITS sequences available limit development of the region as a diagnostic tool is for the identification of Nocardia species.

Evaluation of the hsp genomic target as a differential tool yielded an incident where two species overlapped (Nocardia nova, and No- Fig. 2 Neighbor Joining method for comparison of Nocardia nova cardia africana). As observed with ITS, without additional sequenc- complex species using the gyrB gene target. es, the ability to use hsp as a diagnostic tool is limited. Finally, the The evolutionary history was inferred using the Neighbor-Joining use of gyrB (Fig 2) in differentiating among species within a complex method. The optimal tree with the sum of branch length = 0.09524501 was more promising. There was a clear distinction for each clade and is shown. The tree is drawn to scale, with branch lengths in the same the species they represent. However, the small number of available units as those of the evolutionary distances used to infer the phyloge- sequences erodes the ability to support using gyrB as a more effective netic tree. The evolutionary distances were computed using the Maxi- mum Composite Likelihood method and are in the units of the number means to identify Nocardia species over 16S. of base substitutions per site. The analysis involved 9 nucleotide se- quences. Codon positions included were 1st+2nd+3rd+Noncoding. All positions containing gaps and missing data were eliminated. There were a total of 1186 positions in the final dataset. Evolutionary anal- yses were conducted in MEGA5.

Other trials with Nocardia asteroides complex, Nocardia trans- valensis complex, and Nocardia brevicatena complex also fell into a similar pattern seen with the Nocardia nova complex test case. Evi- dence showed that results may be more conclusive using alternatives to the 16S rRNA gene target, but further testing is necessary with additional sequences before different regions can be used to identify Nocardia species.

4 Discussion

Utilization of DNA-based methods for the identification of micro- bial pathogens derives from the premise that each species carries

unique DNA or RNA sequences for differentiation from other organ-

isms. The first step in the development of any DNA-based identifica- Fig. 1. Neighbor Joining method for comparison of Nocardia nova tion system is to obtain a comprehensive and current list of sequence complex species using the 16S rRNA gene target sequence. The evolutionary history was inferred using the Neighbor-Joining data. The 16S rRNA gene has been shown to be a popular region for method [23]. The optimal tree with the sum of branch length = identifying bacteria [13]. When attempting to differentiate between 0.09469911 is shown. The tree is drawn to scale, with branch lengths related species within a complex however, the 16S gene does not in the same units as those of the evolutionary distances used to infer always prove to be useful. This issue of subtyping closely related the phylogenetic tree. The evolutionary distances were computed us- ing the Maximum Composite Likelihood method [24] and are in the species has been successfully resolved with the use of a highly con- units of the number of base substitutions per site. The analysis in- served 16S-23S rRNA ITS region [26, 27]. Due to the difficulty of volved 65 nucleotide sequences. Codon positions included were identifying Nocardia species, implementing a genomic target that 1st+2nd+3rd+Noncoding. All positions containing gaps and missing would yield a higher degree of differentiation would be desirable. data were eliminated. There were a total of 318 positions in the final Clinically, there are cases that would benefit greatly from the ability dataset. Evolutionary analyses were conducted in MEGA5 [25]. to identify species reliably. Nocardiosis is already a difficult disease

to diagnosis; and the characteristics of species within the Nocardia

complexes add an additional layer of difficulty. Each of the species in these Nocardia complexes has a different antibiotic susceptibility pattern, illustrating the importance to produce a quick and precise way to differentiate between them in order to provide targeted drug treatments. Advances in genomics are beginning to drive the discov- ery of novel diagnostics, drug targets, and vaccines [28]. The ability

108 ISBRA 2012 Short Abstracts D.Bastola et al.

to rapidly identify Nocardia species will enhance patient management 8. Herkes, G.K., et al., Cerebral nocardiosis--clinical and pathological so the most effective therapy is used The results of this study did findings in three patients. Aust N Z J Med, 1989. 19(5): p. 475-8. follow what has been found in prior research, that 16S rRNA was 9. Tachezy, M., et al., Abscess of adrenal gland caused by able to provide a degree of separation between the species in these disseminated subacute Nocardia farcinica pneumonia. A case complexes, but the overlapping elements could confuse the diagnosis report and mini-review of the literature. BMC Infect Dis, between two related species. The ITS region target does seem promis- 2009. 9: p. 194. ing to help in the identification of species with the complexes How- 10. Goodfellow, M., The family Nocardiaceae. 2nd ed. The prokaryotes, ed. A. Balows, et al.1992, New York, N.Y.: Springer-Verlag. ever, the lack of a sufficient number of available sequences in data- 11. Wauters, G., et al., Distribution of nocardia species in clinical bases was a limiting factor in its adoption. Additionally, hsp and gyrB samples and their routine rapid identification in the genes provided promising results. However, these targets suffer from laboratory. J Clin Microbiol, 2005. 43(6): p. 2624-8. the same fate as the ITS region with limited availability of sequences 12. Cloud, J.L., et al., Evaluation of partial 16S ribosomal DNA for comparison purposes. The limited data the latter groups however sequencing for identification of nocardia species by using the MicroSeq 500 system with an expanded database. J Clin (ITS, hsp, gyrB) provide better alternatives to 16S and are more Microbiol, 2004. 42(2): p. 578-84. sensitive in differentiating among the species within a complex 13. Clarridge, J.E., 3rd, Impact of 16S rRNA gene sequence analysis for group. identification of bacteria on clinical microbiology and infectious diseases. Clin Microbiol Rev, 2004. 17(4): p. 840- In conclusion, this project described a tool that can be used to 62, table of contents. simply and efficiently collect and collate all known sequences for a 14. Conville, P.S., et al., Identification of nocardia species by restriction endonuclease analysis of an amplified portion of the 16S rRNA requested species. Species within the Nocardia genus were selected gene. J Clin Microbiol, 2000. 38(1): p. 158-64. in this case due to the difficulty associated with properly identifying 15. Kono, T., et al., Sequencing of 16S−23S rRNA internal transcribed the pathogen. However, the ‘tgclassfier’ application described can be spacer and its application in the identification of Nocardia used to scan other species in order to yield similar results. Given the seriolae by polymerase chain reaction. Aquaculture Research, rapid proliferation of DNA sequencing technology important discov- 2002. 33(14): p. 1195-1197. 16. Mohamed, A.M., et al., Computational approach involving use of eries are being lost in the flood of information. Tools like the ‘tgclass- the internal transcribed spacer 1 region for identification of fier’ can be used to go back and efficiently update records on what is Mycobacterium species. J Clin Microbiol, 2005. 43(8): p. the most current data set for the organism currently working on. 3811-7. Additional sequences are however needed to properly explore the 17. Bork, P. and A. Bairoch, Go hunting in sequence databases but benefit of using sequences within the ITS region or the gyrB gene to watch out for the traps. Trends Genet, 1996. 12(10): p. 425-7. 18. Benson, R.F., P.W. Tang, and B.S. Fields, Evaluation of the Binax determine if these areas are more reliable than the 16S rRNA gene and Biotest urinary antigen kits for detection of Legionnaires' target. Knowing what sequences to focus on for medical diagnostic disease due to multiple serogroups and species of Legionella. J purposes will be a crucial element for future research. The 16S rRNA Clin Microbiol, 2000. 38(7): p. 2763-5. gene sequence was able to distinguish among Nocardia species, but 19. Stajich, J.E., et al., The Bioperl toolkit: Perl modules for the life the edge cases involved with the four complexes add delays. Further sciences. Genome Res, 2002. 12(10): p. 1611-8. 20. Gat-Viks, I., R. Sharan, and R. Shamir, Scoring clustering solutions review of ITS, gyrB, and hsp is warranted given the initial results of by their biological relevance. Bioinformatics, 2003. 19(18): p. this study but those results remain pending, yet promising. 2381-9. 21. Wayne, L.G., International Committee on Systematic Bacteriology: announcement of the report of the ad hoc Committee on References Reconciliation of Approaches to Bacterial Systematics. Zentralbl Bakteriol Mikrobiol Hyg A, 1988. 268(4): p. 433-4. 22. Johnson, J.L., Nucleic acids in bacterial classification., in Bergey's 1. Brown-Elliott, B.A., et al., Clinical and laboratory features of the Nocardia spp. based on current molecular taxonomy. Clin manual of systematic bacteriology. , H. N.R. Kreig, J.G., Editor 1984, Williams & Wilkins: Baltimore. p. 8-11. Microbiol Rev, 2006. 19(2): p. 259-82. 23. Saitou, N. and M. Nei, The neighbor-joining method: a new method 2. Brown, J.M., M.M. McNeil, and B.A. Lasker, Nocardia, Rhodococcus, Gordonia, Actinomadura, Streptomyces, and for reconstructing phylogenetic trees. Mol Biol Evol, 1987. other actinomycetes. 8th ed. Manual of clinical microbiology, 4(4): p. 406-25. ed. P.R. Murray, et al.2003, Washington, D.C.: ASM Press. p. 24. Tamura, K., M. Nei, and S. Kumar, Prospects for inferring very 370-398. large phylogenies by using the neighbor-joining method. Proc Natl Acad Sci U S A, 2004. (30): p. 11030-5. 3. Euzèby, J.P. List of bacterial names with standing in nomenclature - 101 Genus Nocardia. 2009; Available from: 25. Tamura, K., et al., MEGA5: molecular evolutionary genetics http://www.bacterio.cict.fr/n/nocardia.html. analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol Biol Evol, 2011. (10): p. 4. Martinez, R., S. Reyes, and R. Menendez, Pulmonary nocardiosis: 28 risk factors, clinical features, diagnosis and prognosis. Curr 2731-9. 26. Maeda, T., et al., Structural variation in the 16S-23S rRNA Opin Pulm Med, 2008. 14(3): p. 219-27. intergenic spacers of Vibrio parahaemolyticus. FEMS 5. Peleg, A.Y., et al., Risk factors, clinical characteristics, and outcome of Nocardia infection in organ transplant recipients: a Microbiol Lett, 2000. 192(1): p. 73-7. 27. Lee, S.K., et al., Analysis of the 16S-23S rDNA intergenic spacers matched case-control study. Clin Infect Dis, 2007. 44(10): p. 1307-14. (IGSs) of marine vibrios for species-specific signature DNA sequences. Mar Pollut Bull, 2002. (5): p. 412-20. 6. Nocardia infections. Am J Transplant, 2004. 4 Suppl 10: p. 47-50. 44 28. Medini, D., et al., Microbiology in the post-genomic era. Nat Rev 7. Presant, C.A., P.H. Wiernik, and A.A. Serpick, Factors affecting Microbiol, 2008. (6): p. 419-30. survival in nocardiosis. Am Rev Respir Dis, 1973. 108(6): p. 6 1444-8.

109 ISBRA 2012 Short Abstracts

Short Abstract: An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads

Serghei Mangul1, Adrian Caciula1, Nicholas Mancuso1, Olga Glebova1, Ion Mandoiu2, and Alex Zelikovsky1

1 Department of Computer Science, Georgia State University, Atlanta, GA 30303 Email: {serghei, acaciula, nmancuso, oglebova, alexz}@cs.gsu.edu 2 Department of Computer Science & Engineering, University of Connecticut, Storrs, CT 06269 Email : [email protected]

Recent advances in DNA sequencing have made it possible to sequence the whole transcriptome by massively parallel sequencing, commonly referred as high-throughput RNA sequencing (RNA-seq)[1]. RNA-Seq is becoming a tech- nology of choice for transcriptome analyses [2] which allows to reduce the se- quencing cost and significantly increase data throughput, but it is computation- ally challenging to use such data for reconstructing full-length transcripts and accurately estimating their abundances across all cell types. The common applications of RNA-seq are gene expression level estimation (GE), transcript expression level estimation (IE) [3] and novel transcript recon- struction (TR). A variety of new methods and tools have been recently developed to tackle these problems. In this work, we propose a novel statistical “genome-guided” method called “Transciptome Reconstruction using Integer Programing” (TRIP) that incor- porates fragment length distribution into novel transcript reconstruction from paired-end RNA-Seq reads. To reconstruct novel transcripts, we create a splice graph based on exact annotation of exon boundaries and RNA-Seq reads. A splice graph is a directed acyclic graph (DAG), whose vertices represent ex- ons and edges represent splicing events. We enumerate all maximal paths in the splice graph using a depth-first-search (DFS) algorithm. These paths correspond to putative transcripts and are the input for the TRIP algorithm. To solve the transcriptome reconstruction problem we must select a set of putative transcripts with the highest support from the RNA-Seq reads. We for- mulate this problem as an integer program model. The objective is to select the smallest set of putative transcripts that yields a good statistical fit between the fragment length distribution empirically determined during library preparation and fragment lengths implied by mapping read pairs to selected transcripts. The following parameters are used in the proposed mathematical model:

110 ISBRA 2012 Short Abstracts

2

Symbol Description

N: Total number of reads ; p : Paired-end read ; j : Index of paired-end read p, 1 ≤ j ≤ N ; i : Index of standard deviation, 0 ≤ i ≤ 4 ; t : Candidate transcript ; K(k) : Number (index) of transcripts t, 1 ≤ k ≤ K; Ti(pj): Set of candidate transcripts on which paired-read pj can be mapped with a fragment length between i − 1 and i standard deviation, 0 ≤ i ≤ 4; T4(pj): set of candidates transcripts within more than 3 standard deviation ; y(tk) : 1 if candidate transcript tk is selected, and 0 otherwise; xi(pj): 1 iff the read pj is mapped between i − 1 and i standard deviation, and 0 otherwise;

Objective function of this model is to minimize the total number of possible candidate transcripts, as shown in equation (1).

(1) minimize P y(t) t∈T P (2) y(t) ≥ xi(p), ∀p, i = 1, 4 t∈Ti(p) P (3) Nreads ∗ (n(si) − ) ≤ p xi(p) ≤ N ∗ (n(si) + ) P (4) i xi(p) = 1

Equation (2) implies that for each paired-end read pj with non-empty set Ti(pj), at least one transcript is selected (this first constraint allows to select multiple transcripts for the same read). Note that all y(t) = 1 because we only consider the transcripts in non-empty set Ti(pj) of that particular paired-read pj, for which t had already been selected (otherwise Ti(p) would be empty). All xi(pj) = 1 because if pj is not mapped within standard deviation i then Ti(pj) is empty set, i.e., pj will not be chosen for this loop since we only consider ”p with non-empty Ti(p)”. In the worst case read pj is mapped for sure with standard deviation 4 (i.e., x4(p) = 1 which ensures that at least one transcript is selected for read pj, even if it is with a high standard deviation.

Equation (3) implies that the sum of all paired-end reads mapped within standard deviation i equals total number of paired-end reads expected within standard deviation i (±). If a fragment length is approximately normal then about 68% of the fragments are within one standard deviation of the mean (mathematically, µ ± s, where µ is the arithmetic mean), about 95% are within

111 ISBRA 2012 Short Abstracts

3 two standard deviations (µ ± 2s), and about 99.7% lie within three standard deviations (µ ± 3s). This is known as the 68-95-99.7 rule, or the empirical rule. Let s1, s2 and s3 be expected portion within one, two, and three standard deviations from the mean. (From statistics we know that s1 = 0.68, s2 = 0.95 and s3 = 0.99). The number of paired-end reads that have been mapped within a standard deviation i should be equal, more or less , with the expected value ( varies between 0.01 ad 0.05, because we can have errors and map the same read to different transcripts with same standard deviation 1, so we want to limit to only one. Paired-end reads p are short and may be mapped to different transcripts, therefore we have equation (4) which ensures that each paired-end read p is mapped in only one category of standard deviation (i.e. standard deviation sets are mutually exclusive).

For our simulation we have used Human genome UCSC annotations, GN- FAtlas2 gene expression levels with uniform/geometric expression of gene tran- scripts. The fragment lengths follow a normal distribution with a mean length of 500 and a standard definition of 50.

25,000 25,000 100000 s

s 20,000 20,000 10000

15,000 15,000 1000 `:J$H`1] $ `:J$H`1] ofGenes r 10,000 100 NumberofGenes Numberof Numberof 5,000 10

0 1 10 100 1,000 10,000 100,000 0 5 10 15 20 25 30 35 40 45 50 55 `:J$H`1] Length NumberofIsoforms (a) (b)

Fig. 1. Distribution of transcript lengths (a) and gene cluster sizes (b) in the UCSC dataset

Preliminary experimental results on synthetic datasets generated with var- ious sequencing parameters and distribution assumptions show that TRIP has increased transcriptome reconstruction accuracy for genes with less than 4 tran- scripts compared to previous methods that ignore fragment length distribution information. Following [4], we use sensitivity and Positive Predictive Value (PPV) to eval- uate the performance of different methods. Sensitivity is defined as portion of the annotated transcript sequences being captured by candidate transcript se- quences as follows: TP Sens = TP + FN

112 ISBRA 2012 Short Abstracts

4

PPV is defined portion of annotated transcript sequences among candidate se- quences as follows: TP PPV = TP + FP

Cufflinks 1 1 TRIP 0.8 splice graph 0.8

0.6 0.6 sitivity PPV n n Se 0.4 0.4 TRIP

0.2 0.2 Cufflinks splice graph 0 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

# of transcripts per gene # of transcripts per gene (a) (b)

Fig. 2. Flowchart for TRIP: (a) Positive Predictive Value (PPV) and (b) Sensitivity

Acknowledgments. This work has been partially supported by NSF award IIS-0916401,NSF award IIS-0916948, Agriculture and Food Research Initiative Competitive Grant no. 201167016-30331 from the USDA National Institute of Food and Agriculture and Second Century Initiative Bioinformatics University Doctoral Fellowship.

References

1. A. Mortazavi, B. Williams, K. McCue, L. Schaeffer, and B. Wold, “Mapping and quantifying mammalian transcriptomes by RNA-Seq.” Nature methods, 2008. [Online]. Available: http://dx.doi.org/10.1038/nmeth.1226 2. Z. Wang, M. Gerstein, and M. Snyder, “RNA-Seq: a revolutionary tool for transcriptomics.” Nat. Rev. Genet., vol. 10, no. 1, pp. 57–63, 2009. [Online]. Available: http://dx.doi.org/10.1038/nrg2484 3. M. Nicolae, S. Mangul, I. Mandoiu, and A. Zelikovsky, “Estimation of alternative splicing isoform frequencies from rna-seq data,” Algorithms for Molecular Biology, vol. 6:9, 2011. [Online]. Available: http://www.almob.org/content/6/1/9 4. I. Astrovskaya, B. Tork, S. Mangul, K. Westbrooks, I. Mandoiu, P. Balfe, and A. Zelikovsky, “Inferring viral quasispecies spectra from 454 pyrosequencing reads,” BMC Bioinformatics, vol. 12, no. Suppl 6, p. S1, 2011. [Online]. Available: http://www.biomedcentral.com/1471-2105/12/S6/S1

113 ISBRA 2012 Short Abstracts

Distributions of Palindromic Proportional Content in Bacteria

Oliver Bonham-Carter1, Lotfollah Najjar2, Ishwor Thapa1 and Dhundy Bastola1

University of Nebraska at Omaha, Omaha, NE 68182, USA, {obonhamcarter, lnajjar, ithapa, dkbastola}@unomaha.edu

Abstract. DNA palindromes, the reversed and complemented genetic words, are read the same in the 3’ to 5’ as the 5’ to 3’ direction, and can form a unique restriction sites (RSs) where enzymes are able to cut DNA. Several studies have confirmed that short palindromes, behaving as active RSs, are few when compared to statistically expected values in bacterial genomes. These studies suggest that palindromes bring potential instability to intolerant cod- ing regions of the genomes which appears to alter their concentrations. While this palindrome- avoidance phenomenon has been observed in bacteria, the exact location in the genome where palindromes are most rare has not been investigated. In this paper, we provide evidence to suggest where the palindromic content is the least by comparing the content in coding and non-coding regions of bacterial DNA. We study the exhaustive lists of palindromes (lengths 4, 6, 8, and 10) to conclude that at least half of the motifs of each set (and sometimes, nearly all of the motifs of a set) show similar trends of reduced presence in the coding regions, when compared to the non-coding regions of bacteria.

1 Introduction A DNA palindrome (here called a palindrome) is a word which is equivalent to itself when in its reversed and base-complemented form. Palindromes have been shown to be key actors in bacte- rial auto-immune defense systems as they often form the restriction sites for type II restriction endonucleases; highly specific restriction enzymes which cleave the DNA at these sites [2, 6, 8]. In palindromic avoidance studies across several bacterial groups, Koonin et. al. [3] found type II restriction-modification enzymes tended to be under-represented when compared to expected levels. Since it is conceivable that natural restriction sites can fail to be methylated (and are unprotected from enzymes) on occasion, the authors explain that avoidance is likely an evolved damage-control system. Because short palindromes are thought to introduce instability into the genome by their nature, [1, 4], a palindrome can be considered to be a dangerous commodity for a region of DNA which codes for the life-giving proteins. In this paper, we provide evidence that they reside in the non- coding regions of the genome which are likely tolerant of this instability. Since details concerning the location of palindromic rarity is out of scope of the current literature, we provide statistical evidence to suggest that the exhaustive lists of palindromes of lengths-{4,6,8} in bacterial DNA conform to skewed distribution patterns, and may reflect properties of their functions as RSs.

2 Materials and Methods The data for this study was drawn from common bacterial chromosomal DNA which was downloaded from Genbank [7]. We developed a software tool written in Python, employing Biopython version

114 ISBRA 2012 Short Abstracts

2

1.58 that, for each sequence, calculates the GC-content, isolates the coding and non-coding regions of sequence material from the inputted genomes, and determines an exhaustive list of palindromes which are then parsed in each preprocessed region of the input sequences. Finally, the results are organized for Mann-Whitney, non-parametric statistical tests (discussed later) to determine the final motif distributions of the genomes. To obtain the coding and non-coding datasets, the protein-coding segments of each genome were found based on the CDS features given in the organism’s Genbank record. All the segments which were associated with CDS regions were joined together to create a unified and continuous string for each organism. We secured all the non-coding material for each organism in a similar way; starting with a complete genome, we removed all the CDS regions. The remaining code was the non-coding material for the organism. Our genera data was divided into two groups based on GC-content of the genomes. The GC rich group was made up of sequences with more than 60% GC content. The genera in this group are; Bifidobacterium, Burkholderia, Caulobacter, Desulfovibrio, Geobacter, Xanthomonas. The other group, GC-poor, contained the following; Agrobacterium, Bifidobacterium, Brucella, Chloroflexus, Corynebacterium, Erwinia, Geobacter, Pantoea. The palindromes for our study were prepared by first creating an exhaustive list of all possible DNA words of lengths-{4, 6, 8, 10}. The complement of a base is one which is found on the opposite strand in the helix (i.e. A ⇔ T, C ⇔ G). Each word w in the list was tested for palindromy by determining whether w == reversed[complemented(w)] was true. By the nature of this function, only even palindromes, where length(w) mod 2 ≡ 0, are considered in this study. There are 4 ∗ 4 ∗ 1 ∗ 1 = 16 possible palindromic words of length-4. Expressed mathematically, the Lp number of possible palindromes of a length Lp is; n 2 , where n is the size of the alphabet. Across each organism’s coding and non-coding material, we determined the proportion of se- quence code made up by each palindrome. We use proportions, not frequencies, in our study of palindromic content because proportions are naturally normalized and facilitate comparison of content between regions. For these readings, there is no overlap between palindromes in the se- quences and we do not consider nested palindromes. The proportion is given the following equation; count(mi)∗|mi| SL = , where mi is a motif, SL is the sequence space, count(mi) is number occurrences |SL| of mi in SL, |mi| and |SL| are the lengths of the motif and the sequence respectively. This equation determines how much of the coding or non-coding sequence is actually composed from the current motif by finding the number of occurrences. This value is divided the length of each region. The higher the value of the proportion, the greater the content of the motif in the region. Alternative Hypothesis 1 A palindrome has higher proportions in the non-coding regions than in the coding regions of all evaluated sequence material. The Mann-Whitney, non-parametric test, was selected to determine which of the two regions had more content for each palindromic. This test was appropriate for our data since there is no requirement of a normal distribution of the data. The significant value from the outcome of test indicates that Alternative Hypothesis 1 was satisfied for the particular palindrome under evaluation (i.e. a higher proportion of the palindrome in the non-coding region than the coding region by evi- dence in all evaluated genomes). The approach that we chose was designed to be straight-forward, duplicable and speedy. The other methods capable of performing a similar study, such as those involving dynamic programming (i.e. methods from global alignment and similar) are more com- putationally expensive. For this reason, we opted to use an efficient statistical approach, inspired from Information Theory.

115 ISBRA 2012 Short Abstracts

3 3 Results and Discussion We used the Mann-Whitney tests to determine which palindromes of the lengths-{4, 6, 8, 10} had significantly greater concentrations in the non-coding regions than in the coding regions. Our working alternative hypothesis is that there is more short palindromic content in the non-coding regions than in the coding regions and was further established by the observation that long palin- dromes and sequences behaving like long palindromes, are often found in the non-coding regions of mitochondrial DNA [5]. Motif Length GC 4 % 6 % 8 % 10 % p < 0.01 14 87.5 54 84.4 183 71.5 431 42.1 Rich p < 0.05 only -- 4 6.3 18 7.03 123 12 p < 0.01 13 81.3 43 67.2 118 46.1 501 48.9 Poor p < 0.05 only -- 10 15.6 43 16.8 166 16.2 Size Of Exhaustive List 16 64 256 1024 Table 1: The percentage of the exhaustive lists of all possible palindromes (lengths 4, 6, 8 and 10) which are found in higher proportions in the non-coding regions than the coding regions, according to their significant p-values (Mann-Whitney tests). The row “p < 0.05 only” excludes the set from p < 0.01 and indicates that these palindromes were not as significant as the α = 0.01 group.

Table 1 shows the results of the Mann-Whitney tests for the GC-rich and poor data sets with lengths of 4, 6, 8, and 10 bases. There are two α values given, for both, GC-rich and poor sequence data. In each column (length) and row (significance), the number of palindromes out of the total (i.e the exhaustive set) satisfying our alternative hypothesis, is given. The Size of Exhaustive List represents the number of palindromes, taken from the exhaustive list, which passed the Mann- Whitney test, used to determine that the palindrome had larger proportions in the non-coding data than the coding data. The, “p < 0.05 only” row, indicates the number of significant counts but not significant at the α = 0.01 level. A percentage is also given to describe how much of the total number of palindromes for this length were able to satisfy our alternative hypothesis. By the nature of the Mann-Whitney test, we only learn whether the proportion of a particular palindrome is greater in the non-coding data than the coding data and so it could be that the proportions were low in both areas, but less so in the non-coding regions. 3.1 Lengths-{4,6,8,10} Palindromes From the GC-rich sequences we note that 14 of 16, length-4 palindromes (87.5%) are abundant in the non-coding regions at the α = 0.01 significance level (also significant at the 0.05 level). In the GC-poor set, we have 13 out of the total 16 (81.3%) were found in greater proportions in the non-coding regions. All were significant at the α = 0.01 level. A large percentage of the exhaustive list of palindromes of length-6 was found to have higher proportions in the non-CDS regions. From the GC-rich sequences, 54 of 64 (84.4%) at α = 0.01 and 43 of the total 64 (67.2%) for the same alpha in the GC-poor data. For the GC-Rich dataset, four palindromes were significant only at α = 0.05 level and 10 in the GC-poor. The majority of the possible palindromes of length-8 are still found in abundance in the non-coding regions of the GC-rich dataset; 183 of 256 (71.5%) at α = 0.01. For the GC-rich dataset, 118 palindromes of the total 256 (46.1%) were significant also for the same alpha level in the GC-poor set. Palindromes of length-10 are almost too long to be called

116 ISBRA 2012 Short Abstracts

4

“short” palindromes and since the normal RSs is on average length-6, we expect now to see some changes in the general trends of palindromic abundance in the non-coding regions. For example, for the GC-rich set, 431 or the total 1024 there is 42.1% of the total set of all palindromes at the α = 0.01. For the GC-poor set, we have 501 of the total 1024 (48.9%) at the α = 0.01. 4 Conclusions Short palindromic sequences play important roles as restriction sites for cleaving enzymes. Various studies have provided evidence that these palindromes occur in reduced numbers along the bacterial genome but they do not provide evidence about where palindromic avoidance is actually happening. In this study, we hypothesized that avoidance of short palindromes (for lengths {4, 6, 8, 10 }) is concentrated in the coding regions which is thought to be less tolerant of the palindromic instability [3]. Our argument was further motivated by observations in the literature that longer palindromes have been found performing their structural duties in the non-coding regions [5]. The results described in this paper can be used to determine strategies for finding and studying biological mechanisms which depend on palindromic involvement such as, auto-immune function, restriction enzyme activity and methylation systems. More importantly, a sequence property such as the one observed here that was obtained from the analysis of complete genomes, would be very important in whole genome sequence assembly and annotation problem. Similar to pieces of sky in jigsaw puzzles, reads belonging to certain regions in a genome are difficult to position correctly. In particular, when current assembly algorithms use common overlapping substrings of letters as a basis to assemble sequence reads, presuming the two reads likely originated from the same chromosomal regions in the genome. Consequently, most of the assembly algorithms are greedy or graph based. Incorporation of sequence specific features observed from biological samples is expected to overcome the limitations that arise during sequence assembly. In the future, we will study the GC content of the palindromes to find their distribution prop- erties. In greater detail, we plan to analyze the role of sequence specific feature in the development of sequence assemblers. References 1. Brewer B.J., Payen C., Raghuraman M.K., Dunham M.J. (2011) Origin-Dependent Inverted-Repeat Amplification: A Replication-Based Model for Generating Palindromic Amplicons. PLoS Genet. 2011 March; 7(3). 2. Park C.K., Joshi H.K., Agrawal A., Ghare M.I., Little E.J., Dunten P.W., Bitinaite J., Horton N.C. (2010) Domain swapping in allosteric modulation of DNA specificity. PLoS Biol. 2010 Dec 7;8(12). 3. Gelfand M.S., Koonin E.V. (1997) Avoidance of palindromic words in bacterial and archaeal genomes: a close connection with restriction enzymes. Nucleic Acids Res. 1997 Jun 15;25(12):2430-2439. 4. Darmon E., Eykelenboom J.K., Lincker F., Jones L.H., White M., Okely E., Blackwood J.K., Leach D.R. (2010) E. coli SbcCD and RecA control chromosomal rearrangement induced by an interrupted palindrome. Mol Cell. 2010 Jul 9;39(1):59-70. 5. Luki´c-BilelaL., Brandt D., Pojski´cN., Wiens M., Gamulin V., M¨ullerW.E. (2008) Mitochondrial genome of Suberites domuncula: palindromes and inverted repeats are abundant in non-coding regions. Gene. 2008 Apr 15;412(1-2):1-11. 6. Perona J.J. (2002) Type II restriction endonucleases.Methods. 2002 Nov;28(3):353-364. 7. Benson D.A., Karsch-Mizrachi I., Lipman D.J., Ostell J., Sayers E.W. (2011) GenBank. Nucleic Acids Res. 2011 Jan;39 (Database issue). 8. Castronovo M., Radovic S., Grunwald C., Casalis L., Morgante M., Scoles G. (2008) Control of steric hindrance on restriction enzyme reactions with surface-bound DNA nanostructures. Nano Lett. 2008 Dec;8(12):4140-4145.

117 ISBRA 2012 Short Abstracts

GREDSTAT: Genome-wide Restriction Enzyme Digestion STatistical Analysis Tool

Norbert Dojer1,2 and Maga Rowicka1,3,4

1 Institute for Transitional Science, University of Texas Medical Branch at Galveston, 301 University Boulevard, Galveston, TX 77555, USA 2 Institute of Informatics, University of Warsaw 3 Departament of Biochemistry and Molecular Biology, University of Texas Medical Branch at Galveston, 301 University Boulevard, Galveston, TX 77555, USA 4 Sealy Center for Molecular Medicine, University of Texas Medical Branch at Galveston, 301 University Boulevard, Galveston, TX 77555, USA

Abstract. GREDSTAT is an online tool for in silico whole genome digestion with restriction enzymes. It is designed specifically for applica- tion in planning next-generation sequencing experiments, such as reduced representation sequencing. GREDSTAT is digesting genomes several or- der of magnitudes bigger than maximal input size taken by earlier tools within minutes. GREDSTAT is also a powerful tool for restriction en- zyme selection and analysis, it allows user to choose an restriction enzyme from a database based on its cutting pattern in a genome of interest. Such or similar functionality is not provided by any publicly available tool and is crucial for optimal design of some of the next-generation sequencing experiments such as reduced representation sequencing. Availability: GREDSTAT is available at http://gredstat.utmb.edu.

Introduction

Restriction enzymes, also known as restriction endonucleases, recognize short DNA sequences (recognition sites) and cut DNA at specific sites within or adja- cent to these sequences. The cut can either result in blunt edges (cut in the same place at both DNA strands) or in overhangs (cut in different place at each DNA strand). We introduce notation to contain this information in single line. With this notation recognition site of AflII reads C0TTAA G meaning the enzyme would cut as follows (↓ denotes place of the cut): 50 − C↓TTAAG − 30 30 − GAATT↓C − 50 Restriction enzymes can be of four types (Types I, II III, and IV) based on their composition and enzyme cofactor requirements, the nature of their target sequence, and the position of their DNA cleavage site relative to the target sequence: – Type I enzymes (EC 3.1.21.3) cleave at sites remote from recognition site (> 1000bp).

118 ISBRA 2012 Short Abstracts

2 Norbert Dojer, Maga Rowicka

– Type II enzymes (EC 3.1.21.4) cleave within or at short specific distances from recognition site; restriction site is typically 4-8bp and often palindromic. – Type III enzymes (EC 3.1.21.5) recognize two separate non-palindromic se- quences that are inversely oriented and cut DNA about 20-30 base pairs after the recognition site. – Type IV enzymes target normal DNA.

Our tool is based on REBASE database, that includes currently includes over 800 commercially available restriction enzymes of type I, II or III recognizing over 250 DNA sequences [1].

Discussion and Results

Until recently, restriction enzymes have been used mainly for aiding in insertion of genes into plasmid vectors for gene cloning and protein expression experi- ments. Restriction enzymes can also be used to distinguish gene alleles, provid- ing these alleles differ in a number of the recognition sites for the restriction enzyme (restriction mapping). There are many tools designed to help with such studies: Webcutter 2.0 [2], WatCut [3], TACG2[4], Restriction Enzyme Picker [5], NEBcutter [6], In silico restriction digest of complete genomes [7], Sequence Extractor [8]. Unlike GREDSTAT, all of these tools have the input FASTA file size restricted to between 200kb and 5Mb. So the whole-genome digestion with these tools is possible only for the simplest prokaryotic genomes, but not even for yeast, S. cerevisiae. With the advent of Human Genome project, seminal idea of reduced repre- sentation sequencing was introduced by Eric Lander and colleagues [9]. Briefly, reduced representation method is used whenever the desired coverage of sequenc- ing is impractical or too costly to achieve. For example, SNP and methylation studies require very high depth of sequencing, making them relatively expensive for large genomes, such as genomes of mammals. On the other hand, many ques- tions can be answered by querying only fractions of these genomes – but this fraction has to be chosen in a repeatable manner to allow for comparisons be- tween different samples and experiments. This goal can be achieved by reduced representation method: digestion of DNA with an restriction enzyme followed by size selection of the resulting DNA fragments (using gels or an automated method). The gain in coverage of sequencing is the inverse of the fraction of the genome covered. Reduced representation sequencing has an additional advan- tage of reducing number of hypothesis tested (by interrogating smaller fraction of the genome), thus providing additional gain in statistical power. GREDSTAT is designed specifically to aid in such experiments. In planning reduced representation sequencing experiment it is crucial to know what increase in depth of sequencing will be achieved by choosing a given restriction enzyme. Restriction enzymes vary considerably in how often they cut a genome. Generally, enzymes with longer restriction site are less frequent cutters. This is, however, a very loose guideline. For instance, although the length

119 ISBRA 2012 Short Abstracts

GREDSTAT 3 of the restriction site of the former is 6 while in the latter is 8, enzymes MluI and SrfI have roughly the same number of restriction sites in yeast. The frequency of cutting in the whole human genome also cannot be approx- imated well by the frequency of cutting in the shorter region of the length that the existing tools can handle (up to 5Mb). For instance, an attempt to estimate the total number of NotI cutting sites in the human genome based on 1/516 of the human genome sequence results in more than 6 times overestimation of the frequency of cutting, while for SmaI it is almost 4 times overestimation. More- over, using smaller genome for such estimation is also not yielding satisfactory results. For example NotI has 40 restriction sites in S. cerevisiae and SmaI has 287 restriction sites, that is 7.2 times more. NotI cuts human genome 9,674 times and SmaI cuts it 374,946 times (38.8 times more). GREDSTAT helps a researcher to choose optimal restriction enzymes by in silico whole genome digestion, followed by statistical analysis of obtained frag- ments. In silico digestion may be performed with one or more enzymes from the REBASE [1]. Moreover, there is also an option of user-provided restriction site to allow using GREDSTAT with enzymes not currently included in REBASE. Available genomes include H. sapiens, S. cerevisiae, A. thaliana, C. elegans and D. melanogaster. Moreover, the user has an option of submitting other input DNA sequence. However, since sending large files via internet is likely to be impractical, the list of genomes available in GREDSTAT will be extended soon. GREDSTAT also automatically computes percentage of the genome covered by user-selected fragment lengths, a key parameter in reduced representation experiments. This number is crucial, because only fragments with lengths in selected interval are retained for sequencing. It is important to use exact number that GREDSTAT provides, as e.g. total number of occurrences of a recognition site in a genome is not a good proxy for estimating numbers of fragments in a chosen interval (Figure 1).

Fig. 1. Histograms of fragment length in length range 0-1000nt for restriction enzymes AciI and Tsp45I. Even though these restriction enzymes have almost identical number of their recognition sites in the human genome (3976945 and 3939921, respectively) number of cut fragments of given length varies considerably.

120 ISBRA 2012 Short Abstracts

4 Norbert Dojer, Maga Rowicka

Conclusions

GREDSTAT dramatically simplifies planning stage of the reduced representa- tion sequencing experiments. Using GREDSTAT, one can immediately select all commercially available type I, II and III restriction enzymes providing required increase of sequencing depth. To allow visual inspection of regions of partic- ular interest, GREDSTAT generates UCSC Genome Browser [10] tracks with recognition sites for the selected enzymes. GREDSTAT can also work with a mixture of restrictions enzymes retaining all of the described functionality. With all these powerful options we expect that GREDSTAT will not only substantially facilitate planning reduced representa- tion sequencing experiments, but will also contribute to improving the quality of their results.

Acknowledgments

We are grateful to Andrzej Kudlicki and Abhishek Mitra for helpful discussions. This study was supported in part by grant 1UL1RR029876-01 from the National Center for Research Resources, National Institutes of Health.

References

1. Roberts, R.J., Vincze, T., Posfai, J., Macelis, D.: Rebase–a database for dna restriction and modification: enzymes, genes and genomes. Nucleic Acids Res 38(Database issue) (Jan 2010) D234–D236 2. Heiman, M.: Webcutter. http://rna.lundberg.gu.se/cutter2/ 3. Palmer, M.: Watcut: An on-line tool for restriction analysis, silent mutation scanning, and snp-rflp analysis. http://watcut.uwaterloo.ca/watcut/watcut/ template.php 4. Mangalam, H.J.: tacg–a grep for dna. BMC Bioinformatics 3 (2002) 8 5. Collins, R.E., Rocap, G.: Repk: an analytical web server to select restriction en- donucleases for terminal restriction fragment length polymorphism analysis. Nu- cleic Acids Res 35(Web Server issue) (Jul 2007) W58–W62 6. Vincze, T., Posfai, J., Roberts, R.J.: Nebcutter: A program to cleave dna with restriction enzymes. Nucleic Acids Res 31(13) (Jul 2003) 3688–3691 7. Bikandi, J., San Milln, R., Rementeria, A., Garaizar, J.: In silico analysis of com- plete bacterial genomes: Pcr, aflp-pcr and endonuclease restriction. Bioinformatics 20(5) (Mar 2004) 798–799 8. Stothard, P.: Sequence extractor. http://www.bioinformatics.org/seqext/ index.html 9. Altshuler, D., Pollara, V.J., Cowles, C.R., Van Etten, W.J., Baldwin, J., Linton, L., Lander, E.S.: An snp map of the human genome generated by reduced repre- sentation shotgun sequencing. Nature 407(6803) (Sep 2000) 513–516 10. Kent, W.J., Sugnet, C.W., Furey, T.S., Roskin, K.M., Pringle, T.H., Zahler, A.M., Haussler, D.: The human genome browser at ucsc. Genome Res 12(6) (Jun 2002) 996–1006

121 ISBRA 2012 Short Abstracts

Scaffolding Large Genomes using Integer Linear Programming

James Lindsay1, Hamed Salooti2, Alex Zelikovsky2, Ion M˘andoiu1

1Computer Science & Engineering Department, University of Connecticut 371 Fairfield Way, Storrs, CT 06269 {james.lindsay,ion}@engr.uconn.edu 2Computer Science, Georgia State University 34 Peachtree Street, Atlanta, GA 30303 [email protected], [email protected]

Abstract. The precipitous drop in sequencing costs has generated much enthusiasm for very large scale genome sequencing initiatives, such as the Genome 10K (G10K) project. Despite renewed interest the assembly of large genomes from short reads is still an extremely resource intensive process. This work presents a scalable algorithms to create scaffolds, or ordered and oriented sets of assembled contigs, which is one part of a practical assembly. This is accomplished using integer linear program- ming (ILP). In order to process large mammalian genomes we employ non-serial dynamic programming (NSDP) and a hierarchical strategy. Finally novel quantitative metrics are introduced in order to compare scaffolding tools and gain deeper insight into the challenges of scaffold- ing.

1 Introduction

Short contig lengths, typically between 2-4Kb, are characteristic of draft genome assemblies generated from low-coverage (2x) Sanger, (10x HTS) reads over the past decade. To increase the utility of such fragmented assemblies, additional long-range linkage information is used to orient contigs relative to one another and order them in larger structures referred to as scaffolds. Unfortunately linkage information provided by HTS pairs is noisy due to both chimeric pairs resulting from library preparation artifacts and erroneous mapping of reads originating from repeats. These difficulties, along with the sheer number of HTS pairs and contigs that must be handled, render scaffolding methods developed for Sanger pairs such as[5, 9] ineffective on HTS data. While recent algorithmic advances [1, 2, 10], and [7] have led to improve scaffolding accuracy from such noisy HTS paired reads, scaling these methods to datasets consisting of hundreds of thou- sands to millions of contigs and hundreds of millions of read pairs, as expected for a vertebrate genome, remains a significant challenge. Our approach utilizes a pure ILP to find the scaffolds which are most con- sistent with the supplied linkage information. The presented model is equally or more accurate than the leading methods, and is able to solve large instances

122 ISBRA 2012 Short Abstracts

where all others fail. This is achieved through the use of non-serial dynamic programming (NSDP) paradigm which divides the problem into smaller sub- problems that are solved and combined into an optimal solution. When genomes are very large, and the contig quality is low a hierarchical approach is employed to achieve scalability. This approach utilizes the highest quality data first, and scaffolds found in earlier rounds serve as a filter for later rounds.

2 Scaffolding

Our approach breaks the scaffolding problem into two steps; first a compatible orientation and pairwise order is found using a robust Integer Linear Program (ILP), then a complete ”golden path” or scaffold is computed using weighted bipartite matching.

2.1 ILP Model

The scaffolding ILP model is best described using the scaffolding graph G = (V,E), where vertex represent contigs and edges are derived from linkage in- formation provided by paired HTS reads. Paired sequencing libraries are con- structed such that the order and orientation of the reads relative to their origi- nating fragment is known. For this work we will use the mate-pair style of reads where the pairs come from the same strand and are in the same orientation. In a pair each read can map on the 5’ or 3’ strand of each contig, there are 4 possible orientations of two contigs i, j connected by a paired read. We de- fine several boolean variable to assist in creating the model. First Si = 0, 1 { } i N is used to indicate the orientation of each contig, Si = 0 indicates the ∀ ∈ contigs orientation does not change. Then Sij = 0, 1 (i, j) E tells if the { } ∀ ∈ contigs have the same orientation (Sij = 0) or (Sij = 1) if one is flipped. Four state variables, Aij ,Bij,Cij ,Dij = 0, 1 (i, j) E represent the 4 possible order and orientations of adjacent contigs,{ } ∀ they∈ are mutually exclusive. Each w w w w state has an associated weight Aij ,Bij,Cij ,Dij which are the sum of the weight of corresponding pairs. The objective is to maximize the number of concordant edges. w w w w Max A Aij + B Bij + C Cij + D Dij ij · ij · ij · ij · (i,jX)∈E The following constraints enforce the behavior of the orientation variables.

Sij Si + Sj Aij + Dij 1 Sij ≤ ≤ − Sij Sj Si Bij + Cij Sij ≥ − ≤ Sij 2 Si Sj ≤ − − Sij Si Sj ≥ − An additional 8 constraints forbid two and three cycles in all instances where 3 contigs are connected. The solution to this ILP gives the orientation of every

123 ISBRA 2012 Short Abstracts

contig, which induces a partial ordering of the scaffolding graph. We can con- struct a directed scaffold graph DG = (V,E) where the vertex set is the same as the previous graph and directed edges are induced by adjacent vertices with compatible orientations. This graph still does not indicate the linear path that is the scaffold.

2.2 Path Finding

A complete ordering of the directed scaffold graph will yield the desired scaffold. To do this we use maximum weighted bipartite matching. We define a bipartite graph B = (V 1 V 2,E) where each vertex in V 1 corresponds to the 5’ end of a contig, and each∪ vertex in V 2 to the 3’ end of a contig. Edges are defined by directed scaffolding graph. The Hungarian algorithm can be used to solve this in O(V 3). The edges are weighted by the same weight as the chosen state indicator.

3 Scaling the Algorithm

It is impractical to solve a scaffold ILP for a large mammalian genome, the num- ber of variables and constraints is simply too large. To overcome this hurdle and solve the problem optimally and we adopted the non-serial dynamic program- ming (NSDP) paradigm. This optimization technique exploits the sparsity of the scaffolding graph, which should be a bounded-width graph [3], to compute the solution in stages, each stage using the results of a previous one to efficiently solve the problem.

3.1 NSDP

A key concept in NSDP is the notion of an interaction graph which models the relationship between variables and constraints. An interaction graph I = (V,E) contains a vertex for each variable and an edge is added between vertices if they appear in the same constraint or component of the objective function. NSDP is a process which eliminates variables in such a way that adjacent variables can be merged together [11]. We noted that our scaffolding graph is equivalent to the interaction graph. The first step in applying NSDP is identifying the independent and weakly independent components of interaction graph. The completely independent com- ponents have no influence on other components, however a weakly independent component may share one or two nodes with another component. We use efficient algorithms to find the bi and tri connected components of our interaction graph. Then an elimination order must be found so that each component can be solved independently in such a way that the solutions can be merged to find the global solution. This order takes the shape of a tree, for bi-connected components the tree is found using DFS from an arbitrary node. The decomposition order from tri-connected components is given by the SPQR-tree [4] data structure.

124 ISBRA 2012 Short Abstracts

The solution to each component of the interaction graph is found using a bottom up traversal. In general during the traversal the ILP for each component is solved 2 or 4 times. Once for each possible orientation of the common nodes. The objective value of each case is encoded in the objective of the components parent. After solving all components, top down DFS starting from the same root is performed to apply the chosen solution for each component.

3.2 Hierarchical It was observed that the number of paired edges that span contigs is a parameter, called p that has the biggest effect on accuracy and scalability of the scaffolding program. In our work we attempted to solve the scaffolding problem without setting this parameter explicitly but in complex genomes it was observed that occasionally the ILP would be too large even after the complete decomposition procedure. In order to address this, we developed a hierarchical strategy to solving the scaffolding problem. The problem is first solved with high confidence pairs by requiring p > 1, then p is gradually decreased. After each solution utilized pairs are removed from consideration, and any edge in the scaffolding graph that is not compatible with a previous stage scaffold is removed.

4 Results

The input to a scaffolding program is simply a set of contigs (veritcies) and some number of paired reads (directed edges). A scaffolded genome can be though as a linear directed graph and the act of scaffolding is the attempt to predict directed edges between adjacent contigs. We treat scaffolding as a binary classification test where methods attempt to predict true adjacencies in the test dataset. However the traditional measure of a scaffold is the N50, or weighted median statistic such that 50% of the entire assembly is contained in scaffolds equal to or larger than this value. Unfortunately this measure does not reflect the accuracy of the scaffolds and rewards aggressive merging. Therefore we utilize the true positive N50 (TP-N50) which breaks scaffolds when an adjacency was falsely predicted.

4.1 Experimental Setup The test datasets were derived from a simulated denovo draft assembly of an individual human [6]. The set of Sanger style reads used to create this finished version of the draft genome were sub sampled to about 4x base coverage. A total of 11200000 reads were placed into the Celera 6.1 [8] assembler and assembled using the recommended parameters for large mammalian genomes. The assem- bler generated 422837 contigs with an N50 of 7704bp. These contigs were then mapped against the finished version of the genome using a gapped alignment tool, only non-overlapping contigs with 95% identity were utilized. From this alignment a reference scaffold was created to serve as the test case. Smaller test cases were generated by randomly choosing a percentage of scaffolds from each chromosome.

125 ISBRA 2012 Short Abstracts

Accuracy The primary metric we use to demonstrate our algorithm is accurate is the TP-N50. We see that in all test cases our algorithm makes longer correct scaffolds than existing tools. In addition to this number we provide the two standard measures of accuracy in a binary classification test, sensitivity and positive predictive value. Using these measures we see that we are basically equivalent to OPERA.

Scalability The obvious measure of the scalability of an algorithm is runtime. We compared the runtime of all three tools on different size test cases. The objective of this test was to discover on which test case each method failed to produce a result given 120 hours. We chose 120 hours, or 5 days as our threshold because we believe it is sufficient amount of time for each tool. Longer times would represent an unreasonable amount of time to wait for an scaffold given that the assembly took approximately 10 days.

Acknowledgment

This work has been supported in part by awards IIS-0916948 and IIS-0916401 from NSF and Agriculture and Food Research Initiative Competitive Grant no. 201167016-30331 from the USDA National Institute of Food and Agriculture.

126 ISBRA 2012 Short Abstracts

References

1. Adel Dayarian, Todd P. Michael, and Anirvan M. Sengupta. SOPRA: Scaffold- ing algorithm for paired reads via statistical optimization. BMC Bioinformatics, 11:345, 2010. 2. Song Gao, Niranjan Nagarajan, and Wing-Kin Sung. Opera: reconstructing opti- mal genomic scaffolds with high-throughput paired-end sequences. In Proc. 15th Annual international conference on Research in computational molecular biology, pages 437–451, Berlin, Heidelberg, 2011. Springer-Verlag. 3. Song Gao, Niranjan Nagarajan, and Wing-Kin Sung. Opera: reconstructing opti- mal genomic scaffolds with high-throughput paired-end sequences. In Proc. 15th Annual international conference on Research in computational molecular biology, pages 437–451, 2011. 4. J. E. Hopcroft and R. E. Tarjan. Dividing a graph into triconnected components. SIAM Journal on Computing, 2(3):135–158, 1973. 5. Daniel H. Huson, Knut Reinert, and Eugene W. Myers. The greedy path-merging algorithm for contig scaffolding. J. ACM, 49(5):603–615, 2002. 6. S. Levy et al. The diploid genome sequence of an individual human. PLoS Biology, 5(10):e254+, 2007. 7. J. Lindsay, J. Zhang, T. Farnham, Y. Wu, I. Mandoiu, R. O’Neill, H. Salooti, E. Bullwinkel, and A. Zelikovsky. Poster: Scaffolding draft genomes using paired sequencing data. In Computational Advances in Bio and Medical Sciences (IC- CABS), 2011 IEEE 1st International Conference on, page 252, feb. 2011. 8. E. W. Myers, G. G. Sutton, A. L. Delcher, I. M. Dew, D. P. Fasulo, M. J. Flanigan, S. A. Kravitz, C. M. Mobarry, K. H. Reinert, K. A. Remington, E. L. Anson, R. A. Bolanos, H. H. Chou, C. M. Jordan, A. L. Halpern, S. Lonardi, E. M. Beasley, R. C. Brandon, L. Chen, P. J. Dunn, Z. Lai, Y. Liang, D. R. Nusskern, M. Zhan, Q. Zhang, X. Zheng, G. M. Rubin, M. D. Adams, and J. C. Venter. A whole- genome assembly of Drosophila. Science (New York, N.Y.), 287(5461):2196–2204, March 2000. 9. Mihai Pop, Daniel S. Kosack, and Steven L. Salzberg. Hierarchical scaffolding with Bambus. Genome research, 14(1):149–159, 2004. 10. Leena Salmela, Veli M¨akinen, Niko V¨alim¨aki, Johannes Ylinen, and Esko Ukkonen. Fast scaffolding with small independent mixed integer programs. Bioinformatics, 27(23):3259–3265, December 2011. 11. Oleg Shcherbina. Nonserial dynamic programming and tree decomposition in dis- crete optimization. In OR, pages 155–160, 2006.

127 ISBRA 2012 Short Abstracts

Inference of allele specific expression levels from RNA-Seq data

Sahar Al Seesi and Ion M˘andoiu

Computer Science and Engineering University of Connecticut {sahar,ion}@engr.uconn.edu

Abstract. Accurate allele specific expression estimation requires the availability of a diploid transcriptome, which makes it a challenging prob- lem. Most existing methods rely on simple counting of alleles coverage at heterozygous Single Nucleotide Polymorphic sites. In this work, we present RNA-PhASE, a pipeline for Allele Specific gene and isoform Expression estimation from RNA-Seq Reads. The pipeline integrates methods for SNV detection and phasing with a new diploid version of an Expectation Maximization algorithm for gene/isoform estimation. Within this pipeline, we couple an existing phasing algorithm with a novel method for coverage based phasing.

1 Introduction

Most current methods for estimating gene/isoform expression levels from high- throughput whole transcriptome sequencing (RNA-Seq) data rely on mapping the reads to a reference genome and/or transcriptome and do not consider the difference between the two parental alleles (diploid transcriptome). The diploid transcriptome can be easily inferred when a diploid genome is available, as in recent studies of cis- and trans-regulation [8] and parent-of-origin effects [5] that use hybrids of inbred species or strains. However, reconstructing the diploid genome of human subjects remains a difficult task [3]. Hence, existing studies of allele-specific gene expression rely on simple alleles coverage analysis for het- erozygous Single Nucleotide Polymorphic (SNP) sites within transcripts. Such approaches typically do not allow inference of allele-specific expression of indi- vidual gene isoforms, result in less robust estimates since they use only RNA-Seq reads that overlap heterozygous SNP sites, and are affected by systematic read mapping biases toward reference alleles [1][6]. In this work, we integrate a recent method for SNV detection and genotyping from RNA-Seq data [4] with the scalable haplotype reconstruction method [2] and a diploid version of the Expectation Maximization (EM) algorithm for iso- form expression estimation of [9] into a pipeline for estimation of allele-specific isoform expression levels. Our pipeline, RNA-PhASE, does not require genome sequencing data, but can incorporate such data when available. Inferring the two haplotypes and re-mapping the reads against the diploid transcriptome resolves

128 ISBRA 2012 Short Abstracts

2 Sahar Al Seesi and Ion M˘andoiu the above mentioned bias towards reference alleles, while the EM model improves inference accuracy by using all reads, including those that map to more than one isoform, incorporating additional sources of disambiguation information such as the distribution of RNA-Seq fragment lengths, and correcting biases introduced by library preparation and sequencing protocols. Preliminary results show the ability of the proposed pipeline to accurately infer allele specific isoform expression levels for synthetic hybrids with varying levels of heterozygosity, generated by pooling whole brain RNA-Seq reads of different mouse strains studied as part of the Sanger Institute Mouse Genomes Project [7].

2 Methods

The RNA-PhASE pipeline, depicted in Figure 1, starts by mapping the RNA-Seq reads against a haploid reference transcriptome and reference genome. Align- ments from both mappings are merged together, according a set of rules de- scribed in [5], and the resulting set of alignments are used to call SNVs. The merging method, referred to as HardMerge, keeps a read if it aligns uniquely to the genome only, uniquely to the transcriptome only, or to both provided that the two alignments agree. Results have shown that this hybrid method results in calling SNVs with very high confidence. We introduce a local alignment ver- sion of HardMerge that works on the base level, discarding read bases mapped to multiple locations. It then generates alignments from contiguous stretches of non-ambiguously mapped bases. This modification enables HardMerge to handle local alignments of long RNA-Seq reads generated by technologies like 454 and ION Torrent.SNVs are then called using SNVQ [4], which uses Bayes rule to call the genotype with the highest probability while taking base quality scores into account. For haplotyping, we couple an efficient Single Individual Haplotyping algo- rithm, RefHap [2], with a novel method for coverage based phasing. Our new method merges phased blocks in the RefHap output, and it phases called SNVs that were not phased by RefHap because they are not in close proximity with other SNVs and consequently there is no read evidence that can be used to phase them. In coverage based phasing, for two successive heterozygous SNVs i and j, the i’s allele with highest coverage is paired with j’s allele with highest coverage in the same haplotype, and similarly lowest coverage alleles are paired in the other haplotype. When one or both SNVs have equal coverage for the two alleles, phasing is done arbitrarily. i and j can be two SNVs for which the phase was not resolved by RefHap. Alternatively, j can be first SNV in a phased block and i is the last in the most adjacent SNV preceding j. Alelle Specific Expression (ASE) levels are estimated through realigning the reads against the diploid transcriptome and feeding the mapping results into a diploid version of IsoEm [9], an EM algorithm that makes use of information such as insert size, quality scores, and read pairing, if available, to handle read

129 ISBRA 2012 Short Abstracts

Inference of ASE levels from RNA-Seq data 3

Fig. 1. RNA-PhASE: Pipeline for Allele Specific Expression inference from RNA-Seq data through calling and phasing expressed Single Neucloetide Variations mapping ambiguities. Finally, allelic expression imbalance is inferred through applying Fisher’s Exact test.

3 Experimental Results

We test RNA-PhASE against synthetic hybrids data created by merging whole brain RNA-Seq reads from the Sanger Institute Mouse Genomes project. Four synthetic hybrids data sets were created by merging equal number of reads from C57BL/6NJ with each of the following strains: BALB/cJ, A/J, CAST/EiJ, and SPRET/EiJ. The four strains were selected to provide the test of RNA-PhASE performance with varying levels of heterozygosity. As a measure of strain vari- ation compared to C57BL/6NJ, and thus heterozygosity level of the synthetic hybrids, we use the number of genomic SNVs reported in [7]. The strains are listed here in an increasing variation order, compared to C57BL/6NJ. Testing is done on two levels. First, we test the ability of the diploid IsoEM to accurately estimate ASE given the diploid transcriptome. This is done by creating diploid transcriptomes for the hybrids using the SNVs reported in [7]. The inferred expression level for each allele of an isoform or gene is compared with the expression level of that isoform/gene estimated from the corresponding strain reads when processed separately. We measure Pearson coefficient of correlation, error fractions (EF) and median percent errors (MPE). EF at a certain threshold t is the percentage of isoforms (or genes) with relative error larger than given threshold t, where the relative error is calculated as the difference in estimated

130 ISBRA 2012 Short Abstracts

4 Sahar Al Seesi and Ion M˘andoiu to actual expression levels divided by the actual expression level. MPE is the threshold t for which EF is 50%. Figure 2 and tables 1 and 2 show these results.

Fig. 2. Isoform and Gene Error Fractions. Error Fractions at different threshold val- ues for expression levels estimated for strains in synthetic hybrids vs. corresponding separate strain.

The second level of testing is for the whole pipeline, starting from the syn- thetic hybrid reads and haploid reference. In this case, a direct comparison of the ASE from the hybrids against the corresponding separate strain expression levels will not be feasible. Results accuracy will be determined by comparing which isoforms and/or genes are detected to have allelic imbalance in the hybrid vs. the corresponding separate strains. The allelic imbalance will be determined using Fisher’s Exact test. These results are currently being generated.

Table 1. Pearson correlation coefficient for gene and isoform expression levels esti- mated for strains in synthetic hybrids vs. corresponding separate strains. IE: Isoform Expression; GE: Gene Expression

Hybrid C57BLxStrain C57BL IE Strain IE C57BL GE Strain GE C57BLxBALBc 0.705 0.675 0.706 0.675 C57BLxAJ 0.855 0.902 0.856 0.903 C57BLxCAST 0.872 0.824 0.924 0.882 C57BLxSPRET 0.952 0.726 0.951 0.725

Acknowledgments. This project was supported by in part by awards IIS- 0546457 and IIS-0916948 from NSF, and Agriculture and Food Research Initia- tive Competitive Grant no. 2011-67016-30331 from the USDA National Institute of Food and Agriculture.

131 ISBRA 2012 Short Abstracts

Inference of ASE levels from RNA-Seq data 5

Table 2. MPE for isoform expression levels estimated for strains in synthetic hybrids vs. corresponding separate strains.

Hybrid C57BLxStrain C57BL Strain C57BLxBALBc 0.3874 0.9075 C57BLxAJ 0.6281 0.4339 C57BLxCAST 0.2276 0.1840 C57BLxSPRET 0.1871 0.1753

References

1. J.F. Degner, J.C. Marioni, A.A. Pai, J.K. Pickrell, E. Nkadori, Y. Gilad and J.K. Pritchard, Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics, 25(24):3207-3212, 2009. 2. J. Duitama, T. Huebsch, G. McEwen, E. Suk, and M.R. Hoehe, ReFHap: A Reliable and fast algorithm for Single Individual Haplotyping, BCB ’10: Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology, 160-169, 2010. 3. J. Duitama, G.K. McEwen, T. Huebsch, S. Palczewski, S. Schulz, K. Verstrepen, E-K Suk and M.R. Hoehe, Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques, Nucleic Acids Research, to appear, 2012. 4. J. Duitama and P.K. Srivastava and I.I. Mandoiu, Towards accurate detection and genotyping of expressed variants from Whole Transcriptome Sequencing data, BMC Genomics, to appear, 2012. 5. C. Gregg, J. Zhang, J.E. Butler, D. Haig, and C. Dulac, Sex-specific parent-of-origin allelic expression in the mouse brain. Science 239:682-685, 2010. 6. G.A. Heap, J.H.M. Yang, K. Downes, B.C. Healy, et al. Genome-wide analysis of al- lelic expression imbalance in human primary cells by high-throughput transcriptome resequencing. Human Molecular Genetics, 19(1):122134, 2010. 7. T.M. Keane, L. Goodstadt, P. Danecek, et al. Mouse genomic variation and its effect on phenotypes and gene regulation, Nature, 477(7364):289-294, 2011. 8. C.J. McManus, J.D. Coolon, M.O. Duff, J. Eipper-Mains, B.R. Graveley, and P.J. Wittkopp, Regulatory divergence in Drosophila revealed by mRNA-seq. Genome Research, 20:816-825, 2010. 9. M. Nicolae, S. Mangul, I.I. Mandoiu, A. Zelikovsky, Estimation of alternative splic- ing isoform frequencies from RNA-Seq data, Algorithms for Molecular Biology 6:9, 2011.

132 ISBRA 2012 Short Abstracts

Monitoring of human body tissues at molecular level using FOTI Systems

G.S.Uthayakumar1,Dr.A.Sivasubramanian2 Department of Electronics and Communication Engineering, St.Joseph’s College of Engineering,Chennai-119,Tamil Nadu,India [email protected],[email protected]

Abstract—There is an ongoing need for improvements in non-invasive techniques for the diagnostic and prognosis of gastric problem. Such technologies would allow for accuracy in re- sults over huge number of patients. In India, there are two types of Medical Treatment systems (MTS) using drugs namely, English Medical Treatment systems (EMTS) and Ayurveda, Unani and Siddha Treatment system(AUSTS).Due to modern food habits, 70% of the population suf- fering from gastric problem. To find the characteristics and properties of drugs given for the treatment of gastric problem by both type of physicians, the samples are taken for experiment using Fourier transform Infra-red spectrometer. In this method the Infra-red spectrum originates from the vibrational motion of the molecules. This property is used for characterization of bio- logical compounds. Spectral analysis revealed differences in both the treatment systems and it is monitored by FOTI system for several reactions with metabolic components such as lipid, proteins, glucose, and carboxylate presence in the human body. The spectral analysis indicated that the specific functional groups of the drug materials have almost the same chemical charac- teristics but different reactions in the human body.

Keywords— FOTI-Fiber Optic Transillumination Imaging, MTS-Medical Treatment Sys- tem, AUSTS-Ayurveda,Unani and Siddha treatment system, KBr-Pottassium Bromide, FT- Fourier Transform.

1 Introduction

Plants have been used in traditional medicine for several thousand years. Medic- inal plants as a group comprise approximately species and account for about 50% of all the higher flowering plant species in India. The knowledge of medicinal plants has been accumulated in the course of many centuries based on different medicinal sys- tems such as Ayurveda, Unani and Siddha. In a large number of countries, human population depends on medicinal plants for treating various illnesses as well as a source of livelihood. The World Health Organization (WHO) estimated that 80% of population of developing countries relies on traditional medicines, mostly plant drugs, for their primary health care needs. The objective of this research is to identify vari- ous chemical groups present in the five important medicinal drugs given for gastric problems by both types of physicians. The number of cases reported for gastric prob- lem is steadily increasing in both industrialized and developing countries. In spite of the fact that significant progress has been achieved in identification of gastric prob- lem at molecular level using FTIR spectroscopy and other advanced technologies of

133 ISBRA 2012 Short Abstracts

bio-optics. Spectroscopic investigations on pharmaceutical samples are of importance in the present. Vibrational spectral studies of many pharmaceutical drugs are exten- sively studied by many scientists. So far nobody as made an attempt to study the chemical characteristics of Brown sample and green sample.

In FTIR, the design of the optical pathway produces a pattern called an interferrogram. The interferrogram is a complex signal, but its wave like pattern contains all the frequencies that make up the infrared spectrum. A mathematical operation known as Fourier Transform (FT) can separate the individual absorption frequencies from the interferrogram producing a spectrum virtually identical to that obtained with a dispersive spectrometer. The advantage of an FTIR instrument is that it acquires the interferrogram in less than a second. It is thus possible to collect dozens of interferrograms of the same sample and accumulate them in the memory of a computer. When a Fourier transform is performed on the sum of the accumulated interferrogram, a spectrum with a better SNR can be plotted. An FT-IR instrument is capable of greater speed and sensitivity than the dispersion instrument. There are three basic spectrometer components in an FTIR system: radiation source, interferometer, and detector. The same types of radiation sources are used for both dispersive and Fourier transform spectrometers. The source is more often water- cooled in FTIR instruments to provide better power and stability. In contrast, a completely different approach is taken in an FTIR spectrometer to differentiate and measure the absorption at component frequencies. The monochromator is replaced by an interferometer, which divides radiant beams, generates an optical path difference between the beams, then recombines them in order to produce repetitive interference signals measured as a function of optical path difference by a detector.

1.1 Experimental Set up

As its name implies, the interferometer produces interference signals, which contain infrared spectral information generated after passing through a sample. The most commonly used interferometer is a Michelson interferometer. It consists of three active components: a moving mirror, a fixed mirror, and a beam splitter. The two mirrors are perpendicular to each other. The beam splitter is a semi reflecting device and is often made by depositing a thin film of germanium onto a flat KBr. substrate. Radiation from the broadband IR source is collimated and directed into the interferometer, and impinges on the beam splitter. At the beam splitter, half the IR beam is transmitted to the fixed mirror and the remaining half is reflected to the moving mirror. After the divided beams are reflected from the two mirrors, they are recombined at the beam splitter. Due to changes in the relative position of the moving mirror to the fixed mirror, an interference pattern is generated [5].The resulting beam then passes through the sample and is eventually focused on the detector. For an easier explanation, the detector response for a single-frequency component from the

134 ISBRA 2012 Short Abstracts

IR source is first considered [8]. This simulates an idealized situation where the source is monochromatic, such as a laser source. As previously described, differences in the optical paths between the two split beams are created by varying the relative position of moving mirror to the fixed mirror. If the two arms of the interferometer are of equal length, the two split beams travel through the exact same path length[9]. The two beams are totally in phase with each other; thus, they interfere constructively and lead to a maximum in the detector response. This position of the moving mirror is called the point of zero path difference (ZPD). When the moving mirror travels in either direction by the distance λ /4, the optical path (beam splitter–mirror–beam splitter) is changed by 2 (λ /4), or λ /2. The two beams are 180° out of phase with each other, and thus interfere destructively [10]. As the moving mirror travels another λ /4, the optical path difference is now 2 (λ /2), or λ. The two beams are again in phase with each other and result in another constructive interference [11].

When the mirror is moved at a constant velocity, the intensity of radiation reaching the detector varies in a sinusoidal manner to produce the interferrogram output. The interferrogram is the record of the interference signal. It is actually a time domain spectrum and records the detector response changes versus time within the mirror scan[12]. If the sample happens to absorb at this frequency, the amplitude of the sinusoidal wave is reduced by an amount proportional to the amount of sample in the beam. Extension of the same process to three component frequencies results in a more complex interferrogram, which is the summation of three individual modulated waves. In contrast to this simple, symmetric interferrogram, the interferrogram produced with a broadband IR source displays extensive interference patterns. It is a complex summation of superimposed sinusoidal waves, each wave corresponding to a single frequency. When this IR beam is directed through the sample, the amplitudes of a set of waves are reduced by absorption if the frequency of this set of waves is the same as one of the characteristic frequencies of the sample. The interferrogram contains information over the entire IR region to which the detector is responsive. A mathematical operation known as Fourier transformation converts the interferrogram (a time domain spectrum displaying intensity versus time within the mirror scan) to the final IR spectrum, which is the familiar frequency domain spectrum showing intensity versus frequency. The detector signal is sampled at small, precise intervals during the mirror scan. The sampling rate is controlled by an internal, independent reference, a modulated monochromatic beam from a helium neon (HeNe) laser focused on a separate detector.

2 Materials and methods

For the five samples, the FTIR spectra have been recorded using Perkin Elmer Spectrum over the region 4000cm-1to 400cm-1 at Sophisticated analytical Instrumentation Facility (SAIF), IIT, Chennai, India. In FTIR only those molecules can absorb I.R. Radiation, in which such an absorption produces some changes in the electric dipole of the molecule. A dipole moment is a quantitative measure of this charge distribution, the less symmetrical distribution, the greater the numerical

135 ISBRA 2012 Short Abstracts

magnitude of the moment. In the vibration of a molecule, the continual fluctuation of the charge distribution sets up an alternating electric filed which interacts with the electrical vector of the impinging radiation, provided the molecule has a natural vibration which involves a periodic variation of the dipole moment, and also provided the frequency of the impinging light is the same as this natural molecular frequency. The light will strongly excite those specific vibrations, lose energy in the process, and an absorption band will be observed. The molecule absorbs the incoming energy by changing its amplitude and its electrical dipole moment.

The intensity of absorption is proportional to the square of the rate of change of the dipole. It follows that only those vibrations which involve an oscillation of the dipole moment, will record themselves in the infrared spectrum. For e.g. no change in the dipole moment is produced by the C=C double bond stretching of the symmetrical molecule. Since there is no change in the dipole moment, the bond does not absorb infrared radiation. On the other hand, substitution of a bromine for a hydrogen atom to form bromo ethylene destroys the symmetry around the double bond. When such a molecule is placed in an electric field, some force is exerted on the molecule and this increases or decreases separation between the opposite charges present in the molecules. Due to this change in spacing between the charged atoms, these occurs a change in the dipole of the molecule. Due to this periodic change in dipole, the charged atoms of the molecule, when they vibrate absorb some I.R. radiation from the radiation source. If the rate of change in vibration is fast then the absorption of radiation is intense which results in an intense I.R. Band. Since the resultant dipole moment for symmetrical diatomic molecules such as O2, N2 etc., is zero, these molecules will not give I.R. Absorption spectra. Potassium bromide pellets are used to obtain the Fourier Transform Infrared spectra for solids such as Fibril-SF, Festner, Blue and Brown samples. KBr is an inert, infrared transparent material, and acts as a support and a different for the sample. There are two steps to follow for preparing successful KBr pellets. First, the sample and the KBr must be ground to reduce the particle size to less than 2 microns in diameter. Grinding is traditionally performed with an agate mortar and pestle, but a WIG-L–BUG may also be used. A gram or so of KBr should be placed in the mortar. It should be ground until crystallites can no longer be seen and it becomes somewhat “pasty” and sticks to the sides of the mortar. The KBr and the sample should be ground separately to avoid possible chemical interactions, the heat and pressure generated in the mortar may cause the KBr to react with the sample. The spectrum obtained may be that of the product of this reaction rather than the original sample[1- 5].After grinding the sample and KBr, the sample is diluted to about 1% in the ground KBr. About 1 to 10mg of sample can be used. The amounts of material to use can be “eyeballed” with some success, but actually weighing but the amounts will give more reproducible results. It is important that the sample be well dispersed in the KBr. Mixing the sample and KBr in the mortar for a minute using a spatula is usually sufficient to ensure good dispersion. The sample/KBr mixture is then placed in a dye or press, and is squeezed to produce a transparent pellet. Several tons of pressure may be necessary to obtain a transparent pellet. If the pellet is inside a dye, place the

136 ISBRA 2012 Short Abstracts

pellet/dye combination directly in the FTIR beam. KBr is a hygroscopic material, which means it will absorb water directly from the atmosphere. It is critical that the KBr used in making pellets be kept warm and dry, preferably in an even at >100°C. Cloudy regions in a pellet indicate the pellet has absorbed water, which will give bands around 3900 and 1630 cm-1[6].

The Table.1 shows the frequency and wavelength of all the five samples taken for experiment. The similarities of some chemical compounds are highlighted in blue color. From Table.2 The bands exhibited in the region around 3400-3390cm-1 can be assigned to N-H stretching for all the five samples and also vibrational frequencies exhibited from 1600-1655cm-1 are considered to be due to C-H stretching of the compound. The relative intensity for both the regions are strong. From the region 1000-1200cm-1, the vibrational frequencies are assigned to C-O-C stretching due to polysaccharide functional groups. N-H wagging is appeared for all the samples at 750 to 775cm-1. The N-H stretching absorption is less sensitive to hydrogen bonding than O-H absorptions. In the gas phase and in dilute CCl3 solution free N-H absorption is observed in the region from 3400 to 3500cm-1. C-N stretching absorptions are found at 1200 to 1350cm-1 for aromatic amines and at 1000-1200cm-1 for aliphatic amines.

3 Calculation

It is possible to calculate the value of a stretching vibrational frequency of a bond by use of Hook’s law which may be represented as v 1 k 1 1 k f = = 2= v c 2π c m m 2π c x 1 2 √ [ m1+m2 ] where m m x= 1 2 m1+m2

It is the reduced mass, m1 and m2 are the masses of atom in a particular band, k is the force constant. It is regarded as a measure of stiffness of the bond. For a single bond it is approximately 5 x 105 dynes/cm. Its value becomes double and triple for a double and a triple bond respectively. If we consider a diatomic molecule having resultant dipole moment, the vibratory motion of the nuclei of such a molecule may be similar to that of a linear harmonic oscillator. If the bond between to nuclei of diatomic molecule is distorted from equilibrium length L0 to a new length L. Then the restoring forces on each atoms of diatomic molecule will be given by d 2 L m 1 = -k ( L-L ) ------(1) 1 dt 2 0

137 ISBRA 2012 Short Abstracts

d 2 L m 2 = -k ( L-L ) ------(2) 2 dt 2 0

Where L1 and L2 are positions of atoms 1 and 2 relative to the center of gravity of the molecule. Hence,

m1 L1= L ------(3) m1+m2

m2 and L2= L ------(4) m1+m 2

From equation 1 and 3 we get

2 m1 m2 d L ⋅ 2 = -k ( L-L0 ) ------(5) m1+m2 dt

Since L0 is constant, hence dL = 0 dt

2 m1 m2 d ( L−L0 ) ⋅ 2 = -k ( L-L0 ) ------(6) m1+m2 dt

From equation 5 and 6, we get Substitute

m1 m2 L−L = x 0 and = μ in equ.6 m1+m2

d 2 x d 2 x k ∴ μ = -kx (or) + x= 0 dt 2 dt 2 μ

k ω= √ μ

d 2 x +ω2 x= 0 dt 2

138 ISBRA 2012 Short Abstracts

Table 1 Vibrational band assignment for Orange Sample WN*(cm-1) Band Assignment Intensity* 3821 - - 3434 O-H (H-bonded), usually broad S 2037 C=C asymmetric Stretch S 1728 C=O (saturated aldehyde) S 1680,1674,1669, C=C (symmetry reduces intensity) V 1642 1396,1391,1336 O-H bending (in-plane) M 1297,1286,1217 O-C (3-peaks) M - S 1178,1149,1119 C-N M 973 C-O S 911,901 =C-H & =CH2 S 856 C-H bending & ring puckering M-S 781 out-of-plane bending M 773,755,746,730 C-H bending & ring puckering M-S ,705 676,658,631,610 C-H deformation S 545,491 S-S disulfide W

Table 2 Vibrational band assignment for Brown sample WN(cm-1) Band Assignment Intensity 3416,3404 N-H (1°-amines), 2 bands W 2985,2918,2850 CH3, CH2 & CH 3 bands S 2139 C≡C (symmetry reduces intensity) V 1698,1681 C=O (amide I band) S 1616,1609 N-H (1°-amide) II band M 1451,1356 CH2 & CH3 deformation M 1207,1177,1159, C-N M 1105,1079,1027 895 =C-H & =CH2 S 872,854,814,796, C-H bending & ring puckering S -M 764,743,730,712 NH & N-H wagging (shifts on H- 685,676 2 V bonding) 616,600 C-H deformation S 545,492 S-S disulfide W

139 ISBRA 2012 Short Abstracts

Table 3 Vibrational band assignment for Blue sample WN(cm-1) Band Assignment Intensity 3418,3403 N-H Stretch free(2°amine) S CH , CH & CH 2 or 3 2924,2854 3 2 S bands -N=C=O, -N=C=S,- 2138 M N=C=N-, -N3 1724 C=O (saturated aldehyde) S 1669 C=O (amide I band) S 1617 NH2 scissoring (1°-amines) M- S 1541,1534,1521, N-H (2°-amide) II band M 1523,1503 1451 CH2 & CH3 deformation M 1339 O-H bending (in-plane) M 1212,1166,1106, C-N Stretching M 1058,1041,1034 926 =C-H & =CH2 S 825,814,796 out-of-plane bending M NH & N-H wagging 762,742,706,694 2 V (shifts on H-bonding) 632 C-H deformation M 493 S-S disulfide W

Table 4 Vibrational band assignment for Yellow sample WN(cm-1) Band Assignment Intensity 3472 N-H (1°-amines), 2 bands W 3010,2942,2854 CH3, CH2 & CH 2 or 3 bands S 2681 O-H (very broad) S 2347 P-H phosphine M & Sharp 1746 C=O S 1655 C=O (amide I band) S 1462,1418,1377 CH2 & CH3 deformation M 1235,1163,1119, C-N M 1098 968,915 =C-H & =CH2 S 722 CH2 rocking W

140 ISBRA 2012 Short Abstracts

Table 5 Vibrational band assignment for Green sample WN(cm-1) Band Assignment Intensity 3391 N-H (2°-amines) W 2932 CH3, CH2 & CH 2 or 3 bands S 2140,1654 C≡C (symmetry reduces intensity) V 1458 CH2 & CH3 deformation M 1089 C-N M 1047 C-N M 861,849,833,818,802, C-H bending & ring puckering S-M 773,752,740,697 655,603,593 C-H deformation S

*W-Weak, S-Strong, M-Medium, S-M: Strong to Medium, WN-Wave Number, V-Varibale

4 Results and Discussions

The IR spectrum of a compound is the superposition of the absorption bands of specific functional groups. As such, the IR spectrum can be used as finger print for identification of unkown in comparison with previously recorded reference spectra. By observing the position, shape and relative intensities of the vibrational bands in FTIR spectra of the drugs a satisfactory vibrational band assignment has been made. The FTIR spectra of the sample orange drug presented in Fig.1. The vibrational band assignment of the drugs are summarized in the table.1 and are discussed as follows.The bands exhibited in the region around 3000cm-1 can be immediately assigned to be due to aromatic C-H stretching[15]. In this view, the vibrational frequencies exhibited at 3434cm-1 in the FTIR spectrum are considered to be due to C- H stretching vibrations of the compound. The C-C ring stretching vibrations occur in the region 1642-1579cm-1 in FTIR spectra.

141 ISBRA 2012 Short Abstracts

Fig.1 Spectrum for brown sample

Fig.2 Spectrum for Green sample

142 ISBRA 2012 Short Abstracts

Fig.3 Spectrum for Yellow sample

Fig.4 Spectrum for Blue sample

143 ISBRA 2012 Short Abstracts

Fig.5 Spectrum for Orange sample

Fig.6 Spectrum for all samples

144 ISBRA 2012 Short Abstracts

As solids or liquids, primary aliphatic amines absorb in the region 3434cm-1 – 3450cm-1 and exhibit a broad band of medium intensity. In dilute solution in non- polar solvnts, two bands are observed for primary amines due to N-H asymmetric and symmetric vibration in the range 3550cm-1 – 3250cm-1. The relative intensity of the band due to the hydroxyl stretching decreases with increase in concentration with additional broader bands appearing at lower frequencies 3580cm-1 – 3200cm-1. In aminobenzoesaesure the hydroxyl stretching occurs at 3434cm-1 in FTIR spectra[16]. From the above band assignments, the sample under experiment shows a very strong band at 3434cm-1 in FTIR spectrum due to N-H and O-H stretching.The carbonyl groups exhibits a strong absorption band due to C = O stretching vibration at 1728cm- 1. In fluorouracil, the bands observed at 1642cm-1 in FTIR is assigned to C = O asymmetry stretching vibration[17]. A strong band at 1728cm-1 is assigned for C = O carbonyl stretching of nalidixic acid[18]. The band at 1642cm-1, which is close to the literature range due to characteristics are assigned for C = O stretching[19] in benzocaine. Keeping this in mind, the sharp band present in the expected region at 1728cm-1 in the FTIR spectrum is allotted to be due to C = O stretching vibration. C – N stretching absorption of primary aliphatic amines is weak and occurs in the region 1119cm-1-973cm-1. Secondary aliphatic amines have bands of medium intensity at 1178cm-1 – 1140cm-1. The band in the region 1217cm-1 in FTIR have assigned to C – N symmetry stretching of the compound fluorouracil[17]. In this analogy the bands at 1217cm-1 in FTIR spectra of the drug sample is assigned due to C – N vibrations. The bands due to C – O stretching vibrations are strong and occur in the region 1217cm-1 -1119cm-1.

In aminobenzoesaeure the strong band occur near 1119cm-1 in the FTIR spectra is assigned to C – O stretching vibration[16]. In benzocaine, the very strong and sharp peak at 1217cm-1 has been assigned to the C – O stretching. Taking the above band assignments, 1119cm-1 and 973cm-1 in FTIR spectrum of the sample under experiment are assigned due to C – O vibration. A number of C- H in plane deformation bands occur in the region 973cm-1 – 901cm-1, the bands being sharp but weak to medium intensity. However these bands are not normally of important for interpretation purpose although they can be used. The aromatic C – H out of plane deformation bands occur below 700cm-1. The bending vibration are generally found at lower wave numbers. The frequencies observed at 775cm-1, 705cm-1, 676cm-1, 592- 491cm-1 are assigned to be due to O = C – C , O = C – N , C = C – N and C = C – C bending of the pyrimidine ring in the FTIR spectra of Xanthine and C – N – C bending vibrations are assigned at 498cm-1 and 428cm-1 [19,20]. Using the above analogy, the bands at 901cm-1 -775cm-1 is due to C – H in plane deformation. The bands at 676cm-1 is due to C – H out of plane deformation / C – C = O deformation.

145 ISBRA 2012 Short Abstracts

5 Conclusions

Orange color sample is an anti-bio-tic for gastric trouble given by physician for English medical treatment system and the Yellow color sample is given for liver function problem. The remaining three drugs such as green sample, blue sample and Brown sample are given for gastric problem by Ayurveda, Siddha and Unani treatment system. An attempt has been made in this work to study the vibrations of the functional derivatives of all the five samples. By observing the position, shape and relative intensities of the vibration bands in FTIR spectra of all the drugs a satisfactory vibration band assignment has been made. FTIR analysis was conducted to verify the occurrence of chemical bonds between the English medical drug and Ayurveda, Siddha and Unani medical treatment drugs. The spectral analysis indicated that the specific functional groups of the both types of drugs have almost the same chemical characteristics. The studies suggest that did not occur molecular interaction that could alter the chemical structure of the drug. As per the results of both types of Medical Treatment systems existing in India and it is concluded that both types of drugs have the same chemical bonds and characteristics but different reactions at molecular level in the human body.

Acknowledgment

I would like to acknowledge Mr.Shankar, Technical Assistant, Sophisticated Analytical Instrument Facilities (SAIFs) at IIT, Chennai, for providing the FTIR equipment for the complete analysis of both types of samples.

References

1.A.Bright, T.S.Renuga Devi and S.Gunasekaran, Spectroscopical vibrational band assignment and qualitative analysis of Biomedical compounds with cardiovascular activity,International Journal of Chem Tech Research, Vol.2 No.1,pp 379-388, Jan-Mar 2010. 2.Munusamy Chamundeeswari, S. S. Liji Sobhana and et.al., Preparation, characterization and evaluation of a biopolymeric gold nanocomposite with antimicrobial activity, Biotechnol. Appl. Biochem. (2010) 55, 29–35 3.M. Chamundeeswari, V. Senthil, M. Kanagavel, S.M. Chandramohan, and et.al Preparation and characterization of nanobiocomposites containing iron nanoparticles prepared from blood and coated with chitosan and gelatin, Materials Research Bulletin 46 (2011) 901–904 4.R. Davis and L.J. Mauer,Fourier tansform infrared spectroscopy: A rapid tool for detection and analysis of foodborne pathogenic bacteria,FORMATEX 2010 5.Cecile Caubet,Michel Simon, and et.al.,A new amyloidosis caused by fibrillar aggregates of mutated corneodesmosin,The FASEB Journal,Vol. 24,Sep 2010 6.Rui Chen,Chen Huang, Xiumei Mo and et.al., Preparation and characterization of coaxial electrospun thermoplastic polyurethane/collagen compound nanofibers for tissue engineering applications,Colloids and Surfaces B: Biointerfaces 79 (2010) 315–325

146 ISBRA 2012 Short Abstracts

7. Thomas D.Wang,George Triadafilopoulos,et. Al., Detection of endogenous biomolecules in Barrett’s esophagus by FTIR, Vol.104, No.40,pp 15864-15869. 8. J.N. Miller and J.C. Miller, Statistics and Chemometrics for Analytical Chemistry, 4th Edition, Prentice Hall, 2000. 9.Monograph Committee, Malaysian Herbal Monograph, Volume 2, Malaysian Monograph Committee, National Pharmaceutical Control Bureau, Ministry of Health Malaysia, 2001. 10. O.S. Chew, M.R. Hamdan, Z. Ismail and M.N. Ahmad, 19th Annual Seminar & Workshop of the Malaysia Natural Products Society, Faculty of Science, University of Malaya, Malaysia, 13-16 October 2003. 11.E. David-Vaudey, A. Burghardt, K. Keshari, A. Brouchet, M. Ries and S. Majumdar, FTIRI of Human Osteoarthritic Cartilage ,E.European Cells and Materials Vol.10. 2005 (pages 51-60) 12.Natalia Irishina, Miguel Moscoso, and Oliver Dorn, Microwave Imaging for Early Breast cancer detection using a Shape-based Strategy,IEEE Trans.Bio.Engg,Vol. 56, No. 4, Apl 2009 13.Natalia Irishina, Miguel Moscoso, and Oliver Dorn, Microwave Imaging for Early Breast cancer detection using a Shape-based Strategy,IEEE Trans.Bio Engg,Vol. 56, No. 4, Apl 2009 14.Gunasekaran.S, Abitha.P, Indian J pur and Applied physics,43,329(2005) 15.Gunasekaran.S,Ponnambalam.U,Muthu.S,Kumaresan.S, Indian J.Physics 78(10),1141(2004) 16.Gunasekaran.S, Radithika.R, Indian J. Physics 41,503(2005). 17.Renuga Devi TS, Spectroscopic analysis of lipid disorders and study in the quality and efficacy of status and fibrates, Ph.D. , University of Madras,Feb. 2007. 18.Gunasekaran.S and Sankari.G, Spectrochem Acta Part. A, 6,117(2005). 19.Dhanikula Anand, P.Ramesh, Fluorescence Anisotropy, FTIR spectra and 31-P NMR studies on the interaction of Paclitaxiel with lipid bilayer, lipids,Vol.43,June 2008,569-579. 20.Silerstein RM, Clayton Oesslor G and Morril T, Sepctroscopic Identification of Organic Compunds 4E, New York, Jhon Wiley(1981).

G.S.Uthayakumar received the B.E. in ECE from Madras University, Chennai in 1990 and M.E.degree in Medical Electronics from College of Engineering, Anna University,Chennai and M.B.A.degree from IGNOU. He is currently working with St.Joseph's College of Engineering in the Department of Electronics and Communication Engineering, affiliated to Anna University, Chennai, India. He has over 21 years of experience in industry and various engineering colleges. He attended many workshops in the area of Bio-medical Electronics. He has been teaching the subjects: Medical Electronics, Optical Communication and etc for number of semesters. His research interests include Bio- medical Optical Engineering, Bio-optical spectroscopy.

A. Sivasubramanian has received B.E. in ECE from University of Madras in 1990 and M.E. in Applied Electronics from Bharathiar University in 1995 and Ph.D. degree in Optical Comm. from Anna University Chennai in 2008. Currently he is working as a Prof & Head, in the department of Electronics and communication engineering at St.Joseph’s College of Engineering, Chennai, India. He has 20 years of experience in teaching and guiding projects for undergraduate and postgraduate students. He has added ten international and national publications to his credit. He is a recognized supervisor for the doctoral degree programme at Anna University Chennai and Sathyabama University, Chennai. His areas of interests include optical communication, optical networks, Bio-optical Engineering, Wireless sensor and computer networks. He is a member of ISTE, IETE, IEEE, and OSA.

147 ISBRA 2012 Short Abstracts

Multi-Commodity Flow Methods for Quasispecies Spectrum Reconstruction Given Amplicon Reads

Nicholas Mancuso1 ∗ †, Bassam Tork1 ∗ †, Pavel Skums2, Ion Măndoiu3 ∗, and Alex Zelikovsky1 ∗

1 Department of Computer Science Georgia State University Atlanta, Georgia 30302-3994 email: {nmancuso, btork, alexz}@cs.gsu.edu 2 Centers for Disease Control and Prevention 1600 Clifton Road NE Atlanta, Georgia 30322 email: [email protected] 3 Department of Computer Science & Engineering University of Connecticut Storrs, CT 06269 email: [email protected]

Keywords: Next-generation sequencing. Viral quasispecies. Network flows. RNA viruses depend on error-prone reverse-transcriptase for replication within an infected host. These errors lead to a high mutation rate which creates a di- verse population of closely related variants [1]. This viral population is known as a quasispecies. As breakthroughs in next-generation sequencing have allowed for researchers to apply sequencing to new areas, studying genomes of viral quasis- pecies is now realizable. By understanding the quasispecies, more effective drugs and vaccines can be manufactured as well as cost-saving metrics for infected patients implemented [2]. Given a collection of (shotgun or amplicon) next-generation sequencing reads generated from a viral sample, the quasispecies reconstruction problem is defined as: reconstruct the quasispecies spectrum, i.e., the set of sequences and respective frequencies of the sample population. Reconstructing the quasispecies spectrum is difficult for several reasons. The actual amount of variants may be obfuscated by conserved regions in the genome that extend beyond the maximum read length. Additionally, the amount of pos- sible assignments of reads to variants in overlapping segments grows quickly. Furthermore, we are required to rank the variants by frequency. Previous ap- proaches have utilized min-cost flows, probabilistic methods, shortest paths, and population diversity for the quasispecies spectrum assembly problem [3–6].

∗ This work has been partially supported by NSF award IIS-0916401, NSF award IIS-0916948, Agriculture and Food Research Initiative Competitive Grant no. 201167016-30331 from the USDA National Institute of Food and Agriculture † This work has been partially supported by Georgia State University Molecular Basis of Disease research fellowship.

148 ISBRA 2012 Short Abstracts

This work extends the maximum bandwidth method of [7] by including an exact multi-commodity flow method using Integer Linear Programming. Despite ILP being NP-hard, read graphs built from viral amplicon data tend to be small enough to solve quickly. An amplicon A is a multiset of reads such that, each read r ∈ A has the same predefined starting and ending position in the genome (i.e. startA, endA).

Two amplicons A1,A2 are said to overlap if and only if startA2 < endA1 . A set of amplicons A = {A1,...,Am} is said to be overlapping if and only if Ai and Ai+1 overlap for i = 1 . . . m−1. Given an overlapping set A = {A1,...,Am}, we 0 0 define a partial order for overlapping amplicons where r ≺ r , r ∈ Ai, r ∈ Ai+1 if and only if the suffix of r starting at startAi+1 is the same sequence as the 0 prefix of r ending at endAi .

Given an overlapping set A = {A1,...,Am} an m-staged directed read-graph is defined as G = (V = V1 ∪ · · · ∪ Vm ∪ {s, t}, E, c), where v ∈ Vi, 1 ≤ i ≤ m corresponds to a distinct read in amplicon Ai. An edge (u, v) ∈ E if and only if readu ≺ readv for u, v∈ / {s, t}, u = s and v ∈ V1, or v = t and u ∈ Vm. Additionally, c : V → N is the count of the read represented by v ∈ Vi in amplicon Ai.

Lemma 1 (Each consistent overlap in amplicons Ai,Ai+1 corresponds 0 to a unique bipartite clique in G). Suppose the contrary. Let v, v ∈ Ai and 0 0 0 0 0 u, u ∈ Ai+1, where v ≺ u, v ≺ u , v ≺ u. Since v and u are comparable but v and u0 are not, the prefixes of u and u0 must not be consistent. This implies a contradiction with v ≺ u and v ≺ u0. ut

Using this simple fact, we output a new “forked” read-graph. An m-staged directed read-graph can be represented by an (2m − 1)-staged “forked” read- graph. Given an i × j bipartite clique Ki,j in G create an i + j star graph Si+j with a new “fork” vertex as the internal node. Repeating this for all bipartite cliques over Vk,Vk+1 will produce a new “fork” stage Fk. Repeating again for all neighboring stages we see m − 1 new fork stages. Lastly we denote c : E → N to be the count function for edges. This will reduce the number of edges at the cost of additional vertices if the graph has sufficiently dense bipartite cliques. Given an edge (f, u) or (u, f) where f is a fork and u is a read vertex let c(f, u) or c(u, f) be c(u). This will be useful for flow formulations. Figure 1 illustrates this transformation. Given a forked read-graph, the quasispecies reconstruction problem may be restated as a network flow problem. A k-multi-commodity flow problem is defined Pk i as given k (si, ti) pairs either minimize or maximize the total flow f = i=1 f subject to capacity and demand constraints. For the quasispecies reconstruction problem we wish to minimize the total k flows such that each read is fully covered. Additionally, we force each flow to to be unsplittable, i.e., each flow is

149 ISBRA 2012 Short Abstracts

s t s t

Fig. 1: Creating a “forked” read-graph from the original directed read-graph.

a simple s-t path in the read-graph, where si = s, ti = t, 1 ≤ i ≤ k.

X i Min: fs,u 0≤i≤k (s,u)∈E k X i Subject to: gu,v ≥ cu,v ∀(u, v) ∈ E i X i X i gu,v = gv,u ∀v ∈ V, i = 1 . . . k u∈ pred(v) u∈ succ(v) X i fv,u = 1 ∀v ∈ V, i = 1 . . . k u∈ succ(v) i i fu,v ≥ gv,u ∀(u, v) ∈ E, i = 1 . . . k i fu,v ∈ {0, 1} ∀(u, v) ∈ E, i = 1 . . . k i gu,v ∈ [0, 1] ∀(u, v) ∈ E, i = 1 . . . k The method was run on data simulated from the E1E2 region of 44 HCV strains. Variants for each data set were produced from a uniform distribution, geometric distribution, or skewed distribution. Cross-validation was done by us- ing Jensen-Shannon Divergence (JSD) to measure the quality of frequency as- signment. JSD is defined as, 1 1 JSD(P ||Q) = D (P ||M) + D (Q||M) 2 KL 2 KL where n X P (i) D (P ||Q) = P (i) log KL Q(i) i=1 1 and M = 2 (P + Q). We also evaluate the quality of assembled quasispecies by using sensitivity and positive predicted value. Sensitivity measures the correctly assembled quasispecies out of the population, while PPV measures the correctly assembled quasispecies out of the assembled population. They are defined as,

TP TP Sensitivity = , PPV = . TP + FN TP + FP

150 ISBRA 2012 Short Abstracts

Fig. 2: Sensitivity of Multi-Commodity Flow and Maximum Bandwidth

Fig. 3: PPV of Multi-Commodity Flow and Maximum Bandwidth

Under all three measures, the multi-commodity flow formulation performed competitively with Maximum Bandwidth. The flow formulation produced less variants than maximum bandwidth in all three quasispecies distributions. This lead to higher PPV (Fig. 3), but slightly less sensitivity (Fig. 2). The divergence is slightly higher than Maximum Bandwidth’s due to skewing of frequencies from lower sensitivity (Fig. 4). While the current model performs quite well overall, we expect that further improvements to the flow model will lead to more accurate assemblies.

References

1. Duarte EA, Novella IS, Weaver SC, Domingo E, Wain-Hobson S, Clarke DK, Moya A, Elena SF, de la Torre JC, Holland JJ.: RNA Virus Quasispecies: Significance for Viral Disease and Epidemiology. Infectious Agents and Disease 3(4) (1994) 201–214 2. Skums Pavel, Dimitrova Zoya, Campo David S., Vaughan Gilberto, Rossi Livia, Forbi Joseph C, Yokosawa Jonny, Zelikovsky Alex, Khudyakov Yury: Efficient er- ror correction for next-generation sequencing of viral amplicons. In: International Symposium on Bioinformatics Research and Applications. (2011)

151 ISBRA 2012 Short Abstracts

Fig. 4: Jensen-Shannon Divergence of Multi-Commodity Flow and Maximum Bandwidth

3. Westbrooks K, Astrovskaya I, Campo D, Khudyakov Y, Berman P, Zelikovsky A: HCV Quasispecies Assembly using Network Flows. In: Proc. International Sympo- sium Bioinformatics Research and Applications. (2008) 159–170 4. Zagordi O, Klein R, Daumer M, Beerenwinkel N: Error Correction of Next- Generation Sequencing Data and Reliable Estimation of HIV Quasispecies. Nucleic Acids Research 38(21) (2010) 7400–7409 5. Astrovskaya I., Tork, B., Mangul, S., Westbrooks, K., Măndoiu, I., Balfe, P., Ze- likovsky A: Inferring Viral Quasispecies Spectra from 454 Pyrosequencing Reads. BMC Bioinformatics 12 (2011) 6. Prosperi MC, Prosperi L, Bruselles A, Abbate I, Rozera G, Vincenti D, Solmone MC, Capobianchi MR, Ulivi G: Combinatorial Analysis and Algorithms for Quasispecies Reconstruction using Next-Generation Sequencing. BMC Bioinformatics 12 (2011) 7. N. Mancuso and B. Tork and P. Skums and I. Mandoiu and A. Zelikovsky: Viral Quasispecies Reconstruction from Amplicon 454 Pyrosequencing Reads. In: Proc. 1st Workshop on Computational Advances in Molecular Epidemiology. (November 12, 2011 2011) 94–101

152 ISBRA 2012 Short Abstracts

Quasispecies frequency reconstruction using multicommodity flows

Pavel Skums1, Alexander Artyomenko2, Alex Zelikovsky3 and Yury Khudyakov1

1 Laboratory of Molecular Epidemiology and Bioinformatics, Division of Viral Hepatitis, Centers for Disease Control and Prevention, 1600 Clifton Road NE, 30329 Atlanta, GA, USA 2 Mechanics and Mathematics Department, Belarus State University, Nezavisimosti av., 4, 220030, Minsk, Belarus 3 Department of Computer Science, Georgia State University, 34 Peachtree str., 30303, Atlanta, GA, USA

RNA viruses, such as HIV and HCV, exist in the infected hosts as a pop- ulation of genetically close variants known as quasispecies [1]. Intra-host viral genetic heterogeneity is associated with immune escape and drug resistance. Un- derstanding this association is important for developing a vaccine against viral infections as well as therapeutic treatment of patients. Next-generation sequencing allows for analyzing a large number of viral se- quence variants from infected patients, presenting a novel opportunity for un- derstanding virus evolution, drug resistance and immune escape. The problem of reconstruction of the consensus full-length genome from sequencing reads is well- studied, and many algorithms for its solution have been developed. However, the problem of reconstruction of individual intra-host variants of the full-length viral genome remains underappreciated. It was recently shown that intra-host HCV variants have differential sensitivity to therapeutic treatment with interferon[3]. Analysis of HCV quasispecies frequency dynamics provides an opportunity for the detection of HCV drug resistance before the initiation of therapy. The qual- ity of the detection algorithm highly depends on the accuracy of quasispecies frequencies estimation [3]. The problem of reconstruction of quasispecies and their frequencies is called quasispecies spectrum reconstruction problem. This work extends the algorithm presented in [4]. The method proposed in [4] consists of two stages: generation of candidate quasispecies sequences from reads and estimation of their frequencies using the Expectation Maximization (EM) algorithm. We present a method for quasispecies frequency estimation based on multicommodity flows (MCF), which significantly outperforms the EM-based algorithm. The new method, when com- bined with the candidate sequence generation algorithm from [4], presents a novel framework for the reliable reconstruction of quasispecies spectrum. The input is a set of reads R with frequencies (fv : v ∈ R) and a set of candidate sequences Q = {q1, ..., qn}. Construction of a read directed graph G = (V,E) is accomplished as follows: 1) vertices of G correspond to reads from R aligned with the reference se- quence; the consensus of candidate sequences can be used as a reference;

153 ISBRA 2012 Short Abstracts

2) the directed edge (u, v) belongs to E if and only if suffix of u overlaps with prefix of v and they agree inside the overlap; 3) for each candidate sequence qi ∈ Q add a source si and a sink ti. Add edges (si, v) ((v, ti)) for each vertex v ∈ R such, that v coincides with the prefix (suffix) of qi. i Let pv be the probability that read v was obtained from the candidate qi. As i l j l−j in [4] we estimate the probability as pv = j ǫ (1 − ǫ) , where l is the length of the read v, j is the number of mismatches  between v and qi and ǫ is the genotyping error rate. The quasispecies frequency estimation problem could be formulated in the form of the multicommodity flow problem (MCF):

n i i pvxv → max (1) vX∈V Xi=1

n i xv ≤ fv, v ∈ V (2) Xi=1

i i x(u,v) = x(v,w), v ∈ V \{si, ti} (3) (u,vX)∈E (v,wX)∈E

i i xv = x(u,v), v ∈ V (4) (u,vX)∈E

i x(u,v) ≥ 0, (u, v) ∈ E (5)

The frequency fi of the candidate qi could be estimated by the normalized value of the flow xi, i.e.

xi (si,v) (si,vP)∈E fi = n . (6) xi (si,v) iP=1 (si,vP)∈E The algorithm was tested and compared to the EM algorithm [4] using data generated by FlowSim [7]. Following [6] and [4], we use Kullback-Leibler diver- gence [5] as a measure of the quality of prediction of quasispecies frequencies. The results are presented in Figure 1. The MCF algorithm significantly outperforms the EM algorithm. Moreover, unlike the EM algorithm, the quality of the solution of the MCF algorithm depends very little on the number of reads.

154 ISBRA 2012 Short Abstracts

Fig. 1. Comparison of EM and MCF algorithms

References

1. E. Domingo, ”Biological significance of viral quasispecies”, Viral Hepatitis Rev. 2, 1996, pp. 247-261. 2. N. Mancuso and B. Tork and P. Skums and I. Mandoiu and A. Zelikovsky: Viral Quasispecies Reconstruction from Amplicon 454 Pyrosequencing Reads. In: Proc. 1st Workshop on Computational Advances in Molecular Epidemiology. (November 12, 2011 2011) 94-101 3. Pavel Skums, David S. Campo, Zoya Dimitrova, Gilberto Vaughan, Daryl T. Lau, Yuri Khudyakov. Modelling differential interferon resistance of HCV quasispecies PIn: Proc. 1st Workshop on Computational Advances in Molecular Epidemiology. (November 12, 2011 2011) 94-101 4. I. Astrovskaya, B. Tork, S. Mangul, K. Westbrooks, I. Mandoiu, P. Balfe and A. Zelikovsky, ”Inferring Viral Quasispecies Spectra from 454 Pyrosequencing Reads,” BMC Bioinformtaics 12(Suppl 6):S1 (2011). 5. Kullback S., Leibler R.A. On information and sufficiency // The Annals of Math- ematical Statistics. 1951. V.22. N. 1. P. 79-86. 6. N. Eriksson, L. Pachter, Y. Mitsuya, S.Y. Rhee, and C. Wang et al. Viral popu- lation estimation using pyrosequencing. PLoS Comput Biol, 4:e1000074, 2008. 7. S. Balser, K. Malde, A. Lanzen, A. Sharma, and I. Jonassen. Characteristics of 454 pyrosequencing data-enabling realistic simulation with FlowSim. Bioinformatics, 26:i420-5, 2010.

155