How to carry out bioinformatic studies from computational point of view

Qin Ma, Ph.D.

Bioinformatics and Mathematical Bioscience Lab

04/29/2016

Bioinformatics and Mathematical Biosciences Lab @ SDSU 1 Bioinformatics

Systems

• This interdisciplinary science … is about providing computational Omics data support to studies on linking the behavior of cells, organisms and populations to the information Genomics encoded in the genomes. Transcriptomics Metabolomics – Temple Smith, Current Topics in Metagenomics Computational (2002) Epigenomics Proteomics Interactomics …

Bioinformatics

4 Some Concepts in Bioinformatics

• Models

• Algorithms

• Programs/Tools Some Concepts in Bioinformatics

• Models

• Algorithms

• Programs/Tools Problem: to find a path through the nodes that would cross each edge once and only once. B

A C

D The answer is NO! Seven Bridges of Konigsberg

The problem was to find a walk through the city that would cross each bridge once and only once.

B

B A C A C D D

Its negative solution by Leonhard Euler in 1735 laid the foundations of graph theory. Model: Euler tour ó no nodes of odd degree Mathematical models in Bioinformatics Some Concepts in Bioinformatics

• Models

• Algorithms

• Programs/Tools Some Concepts in Bioinformatics

• Models

• Algorithms

• Programs/Tools Algorithm Design

Problem: to move the smallest number of matches to make the formula correct

j a b d e f i g I c+ XI h= X Famous algorithms in Bioinformatics

Smith–Waterman algorithm

BLAST

• Like physics, where general rules and laws are taught at the start, biology will surely be presented to future generations of students as a set of basic systems. --Temple Smith Michael Waterman Some Concepts in Bioinformatics

• Models

• Algorithms

• Programs/Tools

1. C/C++, PERL, R, Python, PHP, JAVA, etc Development of bioinformatic tools in BMBL

6 1. RNA-seq analysis pipeline

RNA-sequencing reads

FastQC Read quality check FastX-FastQ HISAT2 Qualified reads mapping Funded by Scholarly Excellence Funds of SDSU Replicate sample quality check

Cufflinks DMINDA Gene assembly cis-regulatory Pathway Differential motif enrichment expression analysis identification analysis EdgeR

24 16 2. Bi-clustering for identification of co- expressed genes under some conditions

Conditions Conditions Genes

Genes Bi-clustering

25 QUBIC: an R/Bioconductor package of qualitative biclustering for gene co-expression Juan Xieanalysis1, Qin Ma1,2,3 [email protected], [email protected] 1 Department of and Statistic, South Dakota State University, Brookings, SD, 2 Department of Plant Science, South Dakota State University, Brookings, SD, 3 BioSNTR, Brookings, SD. 1. Introduction 3. Functions

• Biclustering can discover the underlying structure of Six functions are included in QUBIC-R: gene expression data (Fig.1), it is successful approach to • qudiscretize creates a discrete matrix for a conduct gene co-expression analysis. given matrix. • QUBIC has been reviewed as one of the best biclustering • BCQU and BCQUD perform biclustering for algorithms (Eren et al.,2013). continuous and discretized gene expression • A web server was developed to facilitate common users data, respectively. (Zhou et al.,2012) • quheatmap draws heatmap for any single • This R package (QUBIC-R) provides efficient and predicted bicluster or for two biclusters optimized implementation of QUBIC, with significantly (Fig.2 B, C). improved efficiency and comprehensive functions. • qunetwork creates co-expression networks • QUBIC-R is freely available online at based on the identified biclusters (Fig.2 D, http://bioconductor.org/packages/QUBIC/ E). Conditions Conditions • qunet2xml can convert the constructed networks into XGMML format for further analysis in Cytoscape(Fig.2 F,G), Biomax 4.and Conclusion Jnets. Biclustering • QUBIC-R implements the well-cited Genes

Genes biclustering algorithm, QUBIC. • It efficiently optimized the source code, improving the original efficiency by 44%. • It also provides integrated functions to visualize the identified biclusters and corresponding co-expression networks. Fig.1Heatamp visualization of ge ne expression matrix before and after biclustering • It offers output for further advanced analysis. • QUBIC-R can be a powerful tool for gene 2. Implementation F G expression data mining and co-expression . . Bic network modeling • The original QUBI C program was written in GNU C, lusBic terlus which is limited in portability. And memory leak is Bic 3ter lus 7 another concern. ter 5. References 4 • In QUBIC-R, C code was refactored and transformed • Eren, K., M. Deveci, O. Küçüktunç an d Ü. V. Ça taly ü r ek (2013). "A co mp arativ e an aly sis of biclustering alg o rith ms for geneex p ression data." into C++, data structures was changed and C pointers Br ief in g s in bioinformatics14(3):279-29 2 • Li, G., Q. Ma, H. Tang, A. H. Paterson an d Y. Xu (20 09 ). "QUBIC: a was replaced by STL containers. qualitative biclustering alg o rith m for an aly ses of gene ex pressio n data." Nucleic Acids Resear ch 37(15):e101. • Core function structures was optimized to facilitate • Zhou, F., Q. Ma, G. Li an d Y. Xu (2012). "QServer:a biclustering server package updates and further development. for pre-diction an d assessmen t of co -ex pressed geneclu sters Fig.2. : (A) Comparison of CPU running time be twee n QUBI C-R and QUBI C; (B) • Smoot, M. E., K. Ono, J. Ru sch ein sk i, P.-L. Wang an d T. Ideker (2011). • Consequently, the efficiency of the program has been "Cytoscape 2.8: new features for data in teg r atio n an d network Heatmap visualization for a single bicluster; (C) He a tma p visualization for two visualization." Bio in f o r matics 27(3):431-43 2. significantly increased (Fig.2A) . biclusters; (D) Co-expression ne tw or k for a single bicluster; (E) Co-expression network Acknowledgement • The output format of QUBIC-R can be used by other for two biclusters; (F) network for a single bicluster re ge ne ra ted by Cytoscape; and (G) It is supported by supported by th e State of South Dakota Rese ar ch Innovation Cen ter an d th e Ag r icu ltu r e Experiment Station of network analysis software, such as Cytoscape (Smoot et networks for two biclusters regenerated byCytoscape. South DakotaStateUniversity al. 2011) 3. DNA motif identification and analyses

19 26 Development of Computational Tools in DNA motif identification and analyses Jinyu Yang1, Qin Ma1,2,3 1Department of Mathematics and , South Dakota State University, Brookings, SD, USA, 2Department of Plant Science, South Dakota State University, Brookings, SD, USA, 3BioSNTR, Brookings, SD, USA

Background

Computational Identification of cis-regulatory motif Regulation of gene transcription Key: TF RNA Polymerase Transcription

111 initiation is regulated through interactions between transcription factors (TFs) and Key holes: cis-regulatory their binding elements sites (motifs). The essence of our algorithms: assessing the possibility for each nucleotide in a given promoter to be in a motif.

MP3: phylogenetic footprinting

A phylogenetic footprinting framework (MP3) for prokaryotes based on a new orthologous data preparation procedure and a novel promoter scoring and pruning method.

BO BRO

Bi opr os pe c t or

CO NSENSU S

CUBI C

MEME

MDscan Promoters Orthologous operons

R 1 R t

r Collection of orth ol og pro moters Mo tif votin0 g Curve: scores o n each r nucleotide 1 r 2

Predicted motifs Curve fitting Graph model to cluster binding sites

DOOR 2.0: operon database Regulon Prediction References

A D Phylogenetic footprinting motif finding [ ( ), , ] ωmax ω1,2 ω1,1 ωm,n 1. An integrative and applicable phylogenetic footp rinting framewo rk fo r Orthologous Operon A B promoters cis-regulatory motif s identification in prokaryotic genomes. BM C Complete and A new … … computationa Genomics, under review. reliable opero Meta-Cluster Orthologous operons l framework 2. Ba cte rial regulon modeling and prediction based on sy ste m atic cis Motifs of operon A

n database co Motifs of operon B B ω1,1 Vertex blow-up regulato ry motif an aly se s, B Liu, C Zhou, H Zhang, G Li, Q Li u , Q Ma, ωmax = ω1,2 and a novel ω vering 2,072 b ω1,1 max Scientific Reports, 2016. ω 2,2 A B ωm,n . . graph model acteria genom . . 3. DMIN D A: an integrated web se rv er for DNA motif identification and . . ωm,n integrating analy se s, Q Ma, H Zhang, X Mao, C Zhou, B Liu, X Chen, Y Xu, Nucleic es and with ov Motif Similarity Evaluation the motif acids research, 42, W12-19, 2014. erall accuracy [ ( ), , ] C ωmax ω1,2 ω1,1 ωm,n A B comparison 4. DOOR 2.0: presenting operons and their fun ctions through dynamic and ω − ω Cluster 1 of ~90% evalu CRS(A, B) = max σ integrated view s, X Mao, Q Ma, C Zhou, X Ch en, H Zhang, J Yang, F Mao, ated by Brouw and clustering for W Lai, Y Xu, Nucleic acids research 42 (D1), D654-D659, 2014. er (2008) on Construction of Co-regulation Graph Vertex blow-up and Clustering Cluster 2 5. An integrated toolkit for accurate prediction and analy sis of ci s- Brief regulon prediction. regulato ry motif s at a genome sc ale, Q Ma*, B Liu*, C Zhou, Y Yin, G Li, Y Bioinformatic An outline of the regulon prediction Xu, Bioinformatics 29 (18), 2261-2268, 2012. s. framework