Biclustering Algorithms Are Used

Total Page:16

File Type:pdf, Size:1020Kb

Biclustering Algorithms Are Used Roadmap • Clustering (just a very small overview!!) – Hierarchical – Partitional • Biclustering • Biclustering Gene Expression Time Series 1 Sara C. Madeira, 24/03/2011 Clustering 2 Sara C. Madeira, 24/03/2011 Clustering 3 Sara C. Madeira, 24/03/2011 Clustering 4 Sara C. Madeira, 24/03/2011 Clustering 5 Sara C. Madeira, 24/03/2011 Clustering • Distances used when clustering expression data are related to • Absolute differences (Euclidean distance, …) • Trends (Pearson Correlation, …) • Homogeneity and Separation Principles should be preserved!! • Homogeneity: Genes/conditions within a cluster are close/ correlated to each other • Separation: Genes/conditions in different clusters are further apart from each other/uncorrelated to each other clustering is not an easy task! 6 Sara C. Madeira, 24/03/2011 Clustering Techniques • Agglomerative Start with every gene/condition in its own cluster, and iteratively join clusters together. • Divisive Start with one cluster and iteratively divide it into smaller clusters. • Hierarchical Organize elements into a tree, leaves represent genes and the length of the pathes between leaves represents the distances between genes/conditions. Similar genes/conditions lie within the same subtrees. • Partitional Partitions the genes/conditions into a specified number of groups. 7 Sara C. Madeira, 24/03/2011 Hierarchical Clustering 8 Sara C. Madeira, 24/03/2011 Hierarchical Clustering 9 Sara C. Madeira, 24/03/2011 Hierarchical Clustering 10 Sara C. Madeira, 24/03/2011 Partitional Clustering 11 Sara C. Madeira, 24/03/2011 Clustering in Babelomics • Algorithms: UGMA, k-Means, SOTA (Dopazo and Carazo, 1997; Herrero et al., 2001) • Webpage: http://babelomics.bioinfo.cipf.es/ • Tutorial: http://bioinfo.cipf.es/babelomicstutorial/clustering/ 12 Sara C. Madeira, 24/03/2011 Roadmap • Clustering • Biclustering – Why Biclustering and not just Clustering? – Bicluster Types and Structure – Algorithms • Biclustering Gene Expression Time Series 13 Sara C. Madeira, 24/03/2011 What is Biclustering? • Simultaneous Clustering of both rows and columns of a data matrix. – Biclustering - Identifies groups of genes with similar/coherent expression patterns under a specific subset of the conditions. – Clustering - Identifies groups of genes/conditions that show similar activity patterns under all the set of conditions/all the set of genes under analysis. • |R| by |C| data matrix A = (R,C) Cond 1 … Cond j … Cond.m – R={r1,..., r|R|} = Set of |R| rows. Gene 1 … … … … … – C={y1,..., y|C|} = Set of |C| columns. … … … … … … – aij=relation between row i and column j. Gene i … … a … … • Gene expression matrices ij … … … … … … – R = Set of Genes – C = Set of Conditions. Gene n … … … … … – aij = expression level of gene i under condition j (quantity of mRNA). 14 Sara C. Madeira, 24/03/2011 Biclustering vs Clustering C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 G1 a11 a12 a13 a14 a15 a16 a17 a18 a19 a110 G2 a21 a22 a23 a24 a25 a26 a27 a28 a29 a210 G3 a31 a32 a33 a34 a35 a36 a37 a38 a39 a310 G4 a41 a42 a43 a44 a45 a46 a47 a48 a49 a410 G5 a51 a52 a53 a54 a55 a56 a57 a58 a59 a510 G6 a61 a62 a63 a64 a65 a66 a67 a68 a69 a610 R = {G1, G2, G3, G4, G5, G6} Cluster of Conditions (R,J) C= {C1, C2, C3, C4, C5, C6, C7, C8, C9, C10} (R,{C4, C5, C6}) I = {G2, G3, G4} Cluster of Genes (I,C) Bicluster (I,J) J = {C4, C5, C6} ({G2, G3, G4},C) ({G2, G3, G4}, {C4, C5, C6}) 15 Sara C. Madeira, 24/03/2011 Example – Yeast Cell Cycle 16 Sara C. Madeira, 24/03/2011 Example – Yeast Cell Cycle 17 Sara C. Madeira, 24/03/2011 Example – Yeast Cell Cycle GENE CLUSTER 18 Sara C. Madeira, 24/03/2011 Example – Yeast Cell Cycle BICLUSTER 19 Sara C. Madeira, 24/03/2011 Example – Yeast Cell Cycle BICLUSTER 20 Sara C. Madeira, 24/03/2011 Why Biclustering and not just Clustering? • When Clustering algorithms are used – Each gene in a given gene cluster is defined using all the conditions. – Each condition in a condition cluster is characterized by the activity of all the genes. Global Model • When Biclustering algorithms are used – Each gene in a bicluster is selected using only a subset of the conditions – Each condition in a bicluster is selected using only a subset of the genes. Local Model 21 Sara C. Madeira, 24/03/2011 Why Biclustering and not just Clustering? • Unlike Clustering – Biclustering identifies groups of genes that show similar activity patterns under a specific subset of the experimental conditions. • Biclustering is the key technique to use when 1. Only a small set of the genes participates in a cellular process of interest. 2. An interesting cellular process is active only in a subset of the conditions. 3. A single gene may participate in multiple pathways that may or not be co- active under all conditions. 22 Sara C. Madeira, 24/03/2011 Roadmap • Clustering • Biclustering – Why Biclustering and not just Clustering? – Bicluster Types and Structure – Algorithms • Biclustering Gene Expression Time Series 23 Sara C. Madeira, 24/03/2011 Bicluster Types 1. Biclusters with constant values. 2. Biclusters with constant values on rows or columns. 3. Biclusters with coherent values. 4. Biclusters with coherent evolutions. 24 Sara C. Madeira, 24/03/2011 Constant Values • Perfect constant bicluster sub-matrix (I,J) where all values 1.0 1.0 1.0 1.0 within the bicluster are equal for all i∈ I and j∈ J: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 aij= µ 1.0 1.0 1.0 1.0 25 Sara C. Madeira, 24/03/2011 Constant Values on Rows or Columns • Perfect bicluster with constant • Perfect bicluster with constant rows columns – submatrix (I,J) where all the – submatrix (I,J) where all the values within the bicluster can values within the bicluster can be obtained using: be obtained using: aij= µ + αi aij= µ + βj aij= µ x αi aij= µ x βj where µ is the typical value within where µ is the typical value within the the bicluster and α is the i bicluster and βj is the adjustment for adjustment for row i∈ I. column j∈ J. This adjustment can be obtained either in an additive or multiplicative way. 26 Sara C. Madeira, 24/03/2011 Constant Values on Rows or Columns 1.0 1.0 1.0 1.0 1.0 2.0 3.0 4.0 2.0 2.0 2.0 2.0 1.0 2.0 3.0 4.0 3.0 3.0 3.0 3.0 1.0 2.0 3.0 4.0 4.0 4.0 4.0 4.0 1.0 2.0 3.0 4.0 Constant Rows Constant Columns 27 Sara C. Madeira, 24/03/2011 Coherent Values • Perfect bicluster with additive/multiplicative model – a subset of rows and a subset of columns, whose values aij are predicted using: aij= µ + αi + βj aij= µ x αi x βj where µ is the typical value within the bicluster, αi is the adjustment for row i∈ I and βj is the adjustment for row j∈ J. These adjustments can be obtained either in an additive or multiplicative way. 28 Sara C. Madeira, 24/03/2011 Coherent Values 1.0 2.0 5.0 0.0 1.0 2.0 0.5 1.5 2.0 3.0 6.0 1.0 2.0 4.0 1.0 3.0 4.0 5.0 8.0 3.0 4.0 8.0 2.0 6.0 5.0 6.0 9.0 4.0 3.0 6.0 1.5 4.5 Additive Model Multiplicative Model 29 Sara C. Madeira, 24/03/2011 Coherent Values • The “the plaid models” (Lazzeroni and Owen) consider a generalization of the additive model: general additive model. • For every element aij – The general additive model represents a sum of models. – Each model represents the contribution of the bicluster Bk to the value of aij in case i∈ I and j∈ J. 30 Sara C. Madeira, 24/03/2011 Coherent Values General Additive Model • θijk specifies the contribution of each bicluster k and can be one of the following expressions representing different types of biclusters: • K is the number of biclusters. – µk Constant Biclusters • ρik and κjk are binary values that – µk + αik Biclusters with represent memberships: constant rows – ρik is the membership of row i in the – µk + βjk Biclusters with bicluster k. constant columns – κjk is the membership of column j in – µk + αik + βjk Biclusters with the bicluster k. additive model General Multiplicative Model can also be assumed! 31 Sara C. Madeira, 24/03/2011 Coherent Values General Additive Model 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 Constant Values 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 3.0 3.0 2.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0 3.0 3.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 Constant Biclusters 2.0 2.0 2.0 2.0 32 Sara C. Madeira, 24/03/2011 Coherent Values General Additive Model 1.0 1.0 1.0 1.0 1.0 2.0 3.0 4.0 2.0 2.0 2.0 2.0 1.0 2.0 3.0 4.0 3.0 3.0 8.0 8.0 5.0 5.0 1.0 2.0 8.0 10 7.0 8.0 4.0 4.0 10 10 6.0 6.0 1.0 2.0 8.0 10 7.0 8.0 7.0 7.0 7.0 7.0 5.0 6.0 7.0 8.0 8.0 8.0 8.0 8.0 5.0 6.0 7.0 8.0 Constant Rows Constant Columns 33 Sara C.
Recommended publications
  • Development of Biclustering Techniques for Gene Expression Data Modeling and Mining Juan Xie South Dakota State University
    South Dakota State University Open PRAIRIE: Open Public Research Access Institutional Repository and Information Exchange Electronic Theses and Dissertations 2018 Development of Biclustering Techniques for Gene Expression Data Modeling and Mining Juan Xie South Dakota State University Follow this and additional works at: https://openprairie.sdstate.edu/etd Part of the Genetics and Genomics Commons, and the Statistics and Probability Commons Recommended Citation Xie, Juan, "Development of Biclustering Techniques for Gene Expression Data Modeling and Mining" (2018). Electronic Theses and Dissertations. 2960. https://openprairie.sdstate.edu/etd/2960 This Thesis - Open Access is brought to you for free and open access by Open PRAIRIE: Open Public Research Access Institutional Repository and Information Exchange. It has been accepted for inclusion in Electronic Theses and Dissertations by an authorized administrator of Open PRAIRIE: Open Public Research Access Institutional Repository and Information Exchange. For more information, please contact [email protected]. DEVELOPMENT OF BICLUSTERING TECHNIQUES FOR GENE EXPRESSION DATA MODELING AND MINING BY JUAN XIE A thesis submitted in partial fulfillment of the requirements for the Master of Science Major in Statistics South Dakota State University 2018 iii ACKNOWLEDGEMENTS First and foremost, endless thanks to my advisor, Dr. Qin Ma, who encouragingly guided me through the course of this dissertation. Thanks for providing such an excellent opportunity to work in your lab. I still remembered that at the beginning I knew nothing about programming and often felt frustrated, he gave me enough time to learn from the scratch and always acknowledged my every little progress, thanks for the patience and encouragement. In the past two years, he taught me how to learn a new skill, how to give presentations, how to read and write papers, and all the necessary and essential quantities that a graduate student should have.
    [Show full text]
  • Package 'S4vd'
    Package ‘s4vd’ November 19, 2015 Version 1.1-1 Title Biclustering via Sparse Singular Value Decomposition Incorporating Stability Selection Date 2015-11-11 Author Martin Sill, Sebastian Kaiser Maintainer Martin Sill <[email protected]> Depends biclust,methods,irlba,foreach Description The main function s4vd() performs a biclustering via sparse singular value decomposition with a nested stability selection. The results is an biclust object and thus all methods of the biclust package can be applied. License GPL-2 LazyLoad yes Repository CRAN Repository/R-Forge/Project s4vd Repository/R-Forge/Revision 109 Repository/R-Forge/DateTimeStamp 2015-11-18 10:48:14 Date/Publication 2015-11-19 17:15:53 NeedsCompilation no R topics documented: BCheatmap . .2 BCs4vd . .3 BCssvd . .5 jaccardmat . .7 lung200 . .8 stabpath . .9 Index 11 1 2 BCheatmap BCheatmap Overlap Heatmap for the visualization of overlapping biclusters Description Heatmap function to plot biclustering results. Overlapping biclusters are indicated by colored rect- angles. Usage BCheatmap(X, res, cexR = 1.5, cexC = 1.25, axisR = FALSE, axisC= TRUE, heatcols = maPalette(low="blue",mid="white",high="red", k=50), clustercols = c(1:5), allrows = FALSE, allcolumns = TRUE) Arguments X the data matrix res the biclustering result cexR relativ font size of the row labels cexC relativ font size of the column labels axisR if TRUE the row labels will be plotted axisC if TRUE the column labels will be plotted heatcols a character vector specifing the heatcolors clustercols a character vector specifing the colors of the rectangles that indicate the rows and columns that belong to a bicluster allrows if FALSE only the rows assigned to any bicluster will be plotted allcolumns if FALSE only the columns assigned to any bicluster will be plotted Author(s) Martin Sill \ <[email protected]> Examples #lung cancer data set Bhattacharjee et al.
    [Show full text]
  • Biclustering in Data Mining Stanislav Busygina, Oleg Prokopyevb,∗, Panos M
    Computers & Operations Research 35 (2008) 2964–2987 www.elsevier.com/locate/cor Biclustering in data mining Stanislav Busygina, Oleg Prokopyevb,∗, Panos M. Pardalosa aDepartment of Industrial and Systems Engineering, University of Florida, Gainesville, FL 32611, USA bDepartment of Industrial Engineering, University of Pittsburgh, Pittsburgh, PA 15261, USA Available online 6 February 2007 Abstract Biclustering consists in simultaneous partitioning of the set of samples and the set of their attributes (features) into subsets (classes). Samples and features classified together are supposed to have a high relevance to each other. In this paper we review the most widely used and successful biclustering techniques and their related applications. This survey is written from a theoretical viewpoint emphasizing mathematical concepts that can be met in existing biclustering techniques. ᭧ 2007 Published by Elsevier Ltd. Keywords: Data mining; Biclustering; Classification; Clustering; Survey 1. The main concept Due to recent technological advances in such areas as IT and biomedicine, the researchers face ever-increasing challenges in extracting relevant information from the enormous volumes of available data [1]. The so-called data avalanche is created by the fact that there is no concise set of parameters that can fully describe a state of real-world complex systems studied nowadays by biologists, ecologists, sociologists, economists, etc. On the other hand, modern computers and other equipment are able to produce and store virtually unlimited data sets characterizing a complex system, and with the help of available computational power there is a great potential for significant advances in both theoretical and applied research. That is why in recent years there has been a dramatic increase in the interest in sophisticated data mining and machine learning techniques, utilizing not only statistical methods, but also a wide spectrum of computational methods associated with large-scale optimization, including algebraic methods and neural networks.
    [Show full text]
  • Arxiv:1811.04661V5 [Cs.CV] 11 May 2021
    RelDenClu: A Relative Density based Biclustering Method for identifying non-linear feature relations Namita Jain · Susmita Ghosh · C A Murthy May 12, 2021 Abstract The existing biclustering algorithms for finding feature relation based biclusters often depend on assumptions like monotonicity or linearity. Though a few algorithms overcome this problem by using density-based methods, they tend to miss out many biclusters because they use global criteria for identifying dense regions. The proposed method, RelDenClu uses the local variations in marginal and joint densities for each pair of features to find the subset of observations, which forms the bases of the relation between them. It then finds the set of features connected by a common set of observations, resulting in a bicluster. To show the effectiveness of the proposed methodology, experimentation has been carried out on fifteen types of simulated datasets. Further, it has been applied to six real-life datasets. For three of these real-life datasets, the proposed method is used for unsupervised learning, while for other three real-life datasets it is used as an aid to supervised learning. For all the datasets the performance of the proposed method is compared with that of seven different state-of-the-art algorithms and the proposed algorithm is seen to produce better results. The efficacy of proposed algorithm is also seen by its use on COVID-19 dataset for identifying some features (genetic, demographics and others) that are likely to affect the spread of COVID- 19. Keywords Biclustering Relative density Marginal density Non-linear · · · relationship . Namita Jain Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India E-mail: [email protected] arXiv:1811.04661v5 [cs.CV] 11 May 2021 Susmita Ghosh Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India E-mail: [email protected] C A Murthy Machine Intelligence Unit, Indian Statistical Institute, 203 Barrackpore Trunk Road,Kolkata 700108, India 2 Namita Jain et al.
    [Show full text]
  • Arxiv:2003.04726V1 [Cs.LG] 7 Mar 2020 Ces Indicating Some Regularities in a Data Matrix
    New advances in enumerative biclustering algorithms with online partitioning Rosana Veronezea,∗, Fernando J. Von Zubena aUniversity of Campinas (DCA/FEEC), 400 Albert Einstein Street, Campinas, SP, Brazil Abstract This paper further extends RIn-Close CVC, a biclustering algorithm capable of performing an effi- cient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets. By avoiding a priori partitioning and itemization of the dataset, RIn-Close CVC implements an online partitioning, which is demonstrated here to guide to more infor- mative biclustering results. The improved algorithm is called RIn-Close CVC3, keeps those attractive properties of RIn-Close CVC, as formally proved here, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime; additional ability to handle datasets with missing values; and additional ability to operate with attributes characterized by distinct distributions or even mixed data types. The experimental results include synthetic and real-world datasets used to per- form scalability and sensitivity analyses. As a practical case study, a parsimonious set of relevant and interpretable mixed-attribute-type rules is obtained in the context of supervised descriptive pattern mining. Keywords: Enumerative biclustering, Online partitioning of numerical datasets, Efficient enumeration, Quantitative class association rules, Supervised descriptive pattern mining 1. Introduction Biclustering is a powerful data analysis technique that, essentially, consists in discovering submatri- arXiv:2003.04726v1 [cs.LG] 7 Mar 2020 ces indicating some regularities in a data matrix. It has been successfully applied in various domains, such as analysis of gene expression data, collaborative filtering, text mining, treatment of missing data, and dimensionality reduction.
    [Show full text]
  • High Performance Parallel/Distributed Biclustering Using Barycenter Heuristic
    High Performance Parallel/Distributed Biclustering Using Barycenter Heuristic Arifa Nisar∗ Waseem Ahmady Wei-keng Liaoz Alok Choudhary x Abstract nificantly useful and considerably harder problem than Biclustering refers to simultaneous clustering of objects and traditional clustering. Whereas a cluster is a set of ob- their features. Use of biclustering is gaining momentum jects with similar values over the entire set of attributes, in areas such as text mining, gene expression analysis and a bicluster can be composed of objects with similarity collaborative filtering. Due to requirements for high per- over only a subset of attributes. Many clustering tech- formance in large scale data processing applications such niques such as those based on Nearest Neighbor Search as Collaborative filtering in E-commerce systems and large are known to suffer from "Curse of Dimensionality" [4]. scale genome-wide gene expression analysis in microarray ex- With large number of dimensions, similarity functions periments, a high performance prallel/distributed solution such as Euclidean distance tend to perform poorly as for biclustering problem is highly desirable. Recently, Ah- similarity diffuses over these dimensions. For example, mad et al [1] showed that Bipartite Spectral Partitioning, in text mining, the size of keywords set describing dif- which is a popular technique for biclustering, can be refor- ferent documents is very large and yields sparse data mulated as a graph drawing problem where objective is to matrix [5]. In this case, clusters based on the entire minimize Hall's energy of the bipartite graph representation keywords set may have no meaning for the end users. of the input data.
    [Show full text]
  • Statistical and Computational Tradeoffs in Biclustering
    Statistical and computational tradeoffs in biclustering Sivaraman Balakrishnany Mladen Kolary Alessandro Rinaldoyy [email protected] [email protected] [email protected] Aarti Singhy Larry Wassermanyy [email protected] [email protected] y School of Computer Science and yy Department of Statistics, Carnegie Mellon University Abstract We consider the problem of identifying a small sub-matrix of activation in a large noisy matrix. We establish the minimax rate for the problem by showing tight (up to con- stants) upper and lower bounds on the signal strength needed to identify the sub-matrix. We consider several natural computationally tractable procedures and show that under most parameter scalings they are unable to identify the sub-matrix at the minimax sig- nal strength. While we are unable to directly establish the computational hardness of the problem at the minimax signal strength we discuss connections to some known NP-hard problems and their approximation algorithms. 1 Introduction We are given an (n × n) matrix M, which is a sparse (k × k) matrix of activation, of strength at least µmin, drowned in a large noisy matrix. More formally let, Θ(µmin; n; k) := f(µ, K1;K2): µ ≥ µmin; jK1j = k; K1 ⊂ [n]; jK2j = k; K2 ⊂ [n]g (1) be a set of parameters. For a parameter θ 2 Θ, let Pθ denote the joint distribution of the entries of M = fmijgi2[n];j2[n], whose density with respect to the Lebesgue measure is Y 2 N (mij; µ 1Ifi 2 K1; j 2 K2g; σ ); (2) ij where the notation N (z; µ, σ2) denotes the distribution p(z) ∼ N (µ, σ2) of a Gaussian random variable with mean µ and variance σ2, and 1I denotes the indicator function.
    [Show full text]
  • Optimal Estimation and Completion of Matrices with Biclustering Structures
    Optimal Estimation and Completion of Matrices with Biclustering Structures Chao Gao1, Yu Lu1, Zongming Ma2, and Harrison H. Zhou1 1 Yale University 2 University of Pennsylvania Abstract Biclustering structures in data matrices were first formalized in a seminal paper by John Hartigan [15] where one seeks to cluster cases and variables simultaneously. Such structures are also prevalent in block modeling of networks. In this paper, we develop a theory for the estimation and completion of matrices with biclustering structures, where the data is a partially observed and noise contaminated matrix with a certain underlying biclustering structure. In particular, we show that a constrained least squares estimator achieves minimax rate-optimal performance in several of the most important scenarios. To this end, we derive unified high probability upper bounds for all sub-Gaussian data and also provide matching minimax lower bounds in both Gaussian and binary cases. Due to the close connection of graphon to stochastic block models, an immediate consequence of our general results is a minimax rate-optimal estimator for sparse graphons. Keywords. Biclustering; graphon; matrix completion; missing data; stochastic block models; sparse network. 1 Introduction In a range of important data analytic scenarios, we encounter matrices with biclustering arXiv:1512.00150v2 [math.ST] 22 Oct 2018 structures. For instance, in gene expression studies, one can organize the rows of a data matrix to correspond to individual cancer patients and the columns to transcripts. Then the patients are expected to form groups according to different cancer subtypes and the genes are also expected to exhibit clustering effect according to the different pathways they belong to.
    [Show full text]
  • Scalable Graph Algorithms with Applications in Genetics
    University of Tennessee, Knoxville TRACE: Tennessee Research and Creative Exchange Doctoral Dissertations Graduate School 12-2008 Scalable Graph Algorithms with Applications in Genetics Yun Zhang University of Tennessee - Knoxville Follow this and additional works at: https://trace.tennessee.edu/utk_graddiss Part of the Computer Sciences Commons Recommended Citation Zhang, Yun, "Scalable Graph Algorithms with Applications in Genetics. " PhD diss., University of Tennessee, 2008. https://trace.tennessee.edu/utk_graddiss/543 This Dissertation is brought to you for free and open access by the Graduate School at TRACE: Tennessee Research and Creative Exchange. It has been accepted for inclusion in Doctoral Dissertations by an authorized administrator of TRACE: Tennessee Research and Creative Exchange. For more information, please contact [email protected]. To the Graduate Council: I am submitting herewith a dissertation written by Yun Zhang entitled "Scalable Graph Algorithms with Applications in Genetics." I have examined the final electronic copy of this dissertation for form and content and recommend that it be accepted in partial fulfillment of the requirements for the degree of Doctor of Philosophy, with a major in Computer Science. Michael A. Langston, Major Professor We have read this dissertation and recommend its acceptance: Michael W. Berry, Elissa J. Chesler, Lynne E. Parker Accepted for the Council: Carolyn R. Hodges Vice Provost and Dean of the Graduate School (Original signatures are on file with official studentecor r ds.) To the Graduate Council: I am submitting herewith a dissertation written by Yun Zhang entitled \Scalable graph algorithms with applications in genetics." I have examined the final electronic copy of this dissertation for form and content and recommend that it be accepted in partial fulfillment of the requirements for the degree of Doctor of Philosophy, with a major in Computer Science.
    [Show full text]
  • An Extensive Survey on Biclustering Approaches and Algorithms for Gene Expression Data
    INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 09, SEPTEMBER 2019 ISSN 2277-8616 An Extensive Survey On Biclustering Approaches And Algorithms For Gene Expression Data N. Kavitha Sri, Dr. R. Porkodi Abstract- Data mining (DM) is the exploration of useful and needed patterns from a huge database. Nowadays, several researchers showing their keen interest in analysis and experimentation on biological information which results in the generation of huge data. Complex biological data lead the data mining into the clinical context and diverse DM approaches made ease of interpretation as well as computational analysis. Some general examples of DM based biological data analysis such as gene classification, disease prediction, protein structure prediction, analysis of gene expression and mutation. Biclustering is one of the emerging techniques to analyze gene expression and their key feature is identifying the subset of the gene along with the functional interrelation of gene sets. Biclustering establishes the correlation among the subset of gene-based on the characteristic conditions of genes. Biological and Statistical measures are employed to evaluate the discovering significant bicluster and bicluster quality. The paper describes a brief survey of biclustering algorithms. The biclustering algorithms are divided into five approaches, namely Greedy Iterative Search, Divide and Conquer, Iterative Row and Column Clustering Combination, Exhaustive Bicluster Enumeration and Distribution Parameter Identification. The paper also describes the last decade's novel approaches. This paper also describes the number of papers that are indexed in bibliographic databases. Index Terms- Gene expression data, Biclustering, Greedy Iterative Search, Divide and Conquer, Iterative Row and Column Clustering Combination, Exhaustive Bicluster Enumeration and Distribution Parameter Identification.
    [Show full text]
  • Biclustering Usinig Message Passing
    Biclustering Using Message Passing Luke O’Connor Soheil Feizi Bioinformatics and Integrative Genomics Electrical Engineering and Computer Science Harvard University Massachusetts Institute of Technology Cambridge, MA 02138 Cambridge, MA 02139 [email protected] [email protected] Abstract Biclustering is the analog of clustering on a bipartite graph. Existent methods infer biclusters through local search strategies that find one cluster at a time; a common technique is to update the row memberships based on the current column member- ships, and vice versa. We propose a biclustering algorithm that maximizes a global objective function using message passing. Our objective function closely approx- imates a general likelihood function, separating a cluster size penalty term into row- and column-count penalties. Because we use a global optimization frame- work, our approach excels at resolving the overlaps between biclusters, which are important features of biclusters in practice. Moreover, Expectation-Maximization can be used to learn the model parameters if they are unknown. In simulations, we find that our method outperforms two of the best existing biclustering algorithms, ISA and LAS, when the planted clusters overlap. Applied to three gene expres- sion datasets, our method finds coregulated gene clusters that have high quality in terms of cluster size and density. 1 Introduction The term biclustering has been used to describe several distinct problems variants. In this paper, In this paper, we consider the problem of biclustering as a bipartite analogue of clustering: Given an N × M matrix, a bicluster is a subset of rows that are heavily connected to a subset of columns. In this framework, biclustering methods are data mining techniques allowing simultaneous clustering of the rows and columns of a matrix.
    [Show full text]
  • Convex Biclustering
    CONVEX BICLUSTERING By Eric C. Chi, Genevera I. Allen and Richard G. Baraniuk North Carolina State University and Rice University In the biclustering problem, we seek to simultaneously group ob- servations and features. While biclustering has applications in a wide array of domains, ranging from text mining to collaborative filter- ing, the problem of identifying structure in high dimensional genomic data motivates this work. In this context, biclustering enables us to identify subsets of genes that are co-expressed only within a subset of experimental conditions. We present a convex formulation of the biclustering problem that possesses a unique global minimizer and an iterative algorithm, COBRA, that is guaranteed to identify it. Our approach generates an entire solution path of possible biclusters as a single tuning parameter is varied. We also show how to reduce the problem of selecting this tuning parameter to solving a trivial modification of the convex biclustering problem. The key contribu- tions of our work are its simplicity, interpretability, and algorithmic guarantees - features that arguably are lacking in the current alter- native algorithms. We demonstrate the advantages of our approach, which includes stably and reproducibly identifying biclusterings, on simulated and real microarray data. 1. Introduction. In the biclustering problem, we seek to simultaneously group ob- servations (columns) and features (rows) in a data matrix. Such data is sometimes de- scribed as two-way, or transposable, to put the rows and columns on equal footing and to emphasize the desire to uncover structure in both the row and column variables. Biclus- tering is used for visualization and exploratory analysis in a wide array of domains.
    [Show full text]