To Study Correlation Between Gene Expression Profiles

Total Page:16

File Type:pdf, Size:1020Kb

To Study Correlation Between Gene Expression Profiles

To Study Correlation Between Gene Expression Profiles

Term Paper for CSCi 689 Data Mining

Submitted by

Ranjana Sharma

To

Dr. William Perrizo

Department of Computer Science North Dakota State Uicersity To Study Correlation Between Gene Expression Profiles

1.0 INTRODUCTION

Information from one gene can be used by the gene to synthesize functional gene product which are generally proteins. However in some cases where genes are non protein coding genes such as rRNA or tRNA, genes, the product is functional RNA. The steps generally present in the gene expression process are transcription and translation. Smallest change in the DNA can result in major changes in the cell functions. The information can be collected from many genes and the activity can be measured to determine the cellular function. This is called gene expression profiling. Using the profiling the cells can be distinguished. Some cells may be dividing or some cells may be responding to a particular treatment. Profiling also find how the cells can respond to a particular treatment. Using gene expression profiling every gene present in the cell or the genome can be measured. The amount of gene expression can have a profound effect on the functions of the gene in the organism. The expression of gene is given as turned on or turned off.

The human genome contains approximately 20,000 genes. At any given moment, each of our cells has some combination of these genes turned on, and others are turned off. Gene expression profiling is done using a technique called microarray analysis. A DNA micorarray consists of rows and columns of thousands of spots of DNA oligonucleotides such that an experiment on many genes can be performed at the same time. Each spot on a microarray contains multiple identical strands of DNA representing one gene. Microarray analysis breaks open a cell, isolating its genetic contents, identifying all the genes that are turned on in that particular cell and generating a list of those genes.

Gene expression experiments have widely used in the functional genomics. These experimental results can be related to functional and other types of information such as relating that to proteins to test whether groups of gene expression experiments, including time course experiments, show a relationship with respect to properties that define sets of genes [LEI2006], [DAT2006].

The gene expression data can be related to functional information only when the functional annotations of the gene are already known. However, many times a particular function of a gene may be known only from a single set of experiment. This result would not be reliable. Hence, a relationship if any that may exist between the experiments may be found out.

2.0 RELATED WORK Correlation between gene expression profiles of yeast, mouse, human and e.coli and protein– protein interactions within and across genomes indicate that in E.coli the expression of interacting pairs is highly correlated in comparison to random pairs, while in the other three species, the correlation of expression of interacting pairs is only slightly stronger than that of random pairs [BHA2005].

The correlation of genes in the zebra fish genome with respect to the distance and their transcriptional expression pattern using available microarray data and gene annotation shows that the neighboring genes are significantly coexpressed and coexpression level is influenced by the intergenic distance and transcription orientation[KAO2009].

Gene products that are biologically and functionally related would maintain this similarity both in their expression profiles as well as in their Gene Ontology (GO) annotation. Some aspects of the relationship between gene expression, gene function, and gene annotation show that there is a correlation between semantic similarity in the GO annotation and gene expression for the different GO ontologies[SEV2005].

3.0 PROPOSED WORK (“KILLER IDEA”)

Groups of genes may act together or may be similar during a biological process. The genes in the same clusters may be functionally correlated. This can be revealed by the most commonly performed procedure of cluster analysis. This is generally done by calculating a distance or dissimilarity between the expression vectors of each pair of genes and placing the genes with similar functions in the similar clusters. This technique generally employed in the case of microarray analysis most of which used the standard hierarchical clustering, UPGMA (Unweighted Pair Group Method with Arithmetic mean) [HUA2008], with one minus the Pearson's correlation coefficient as a measure of dissimilarity. However, many times the results may not be accurate. The genes that may not seem related may be similar in one or more functions. Hence various studies have to be carried out to confirm this. A statistical clustering algorithm is generally employed which places a pair of genes in the same cluster if their expression profiles are similar as judged by the distance measure employed. The exact details of achieving this goal varies from one algorithm to the next. More tuning parameter choices can be obtained from more complex and relatively modern algorithms resulting in varied grouping [DAT2006]. However, selecting the best algorithm or the parameter to determine this remains a difficult problem. A good clustering algorithm ideally should produce groups with distinct non-overlapping boundaries, although a perfect separation can not typically be achieved in practice [DAT2006]. Various algorithms have been reported in order to find out the correlation between the expression profiles of different genes. However each has a drawback of its own in completely determining the expression profile.

3.1 Clustering by K-means:

K-means algorithm is a type of partition clustering algorithm which is used to cluster n objects based on attributes into k partitions such that k < n. It is similar to the expectation- maximization algorithm for mixtures of Gaussians as in both cases centers of natural clusters in the data are determined. It assumes that the object attributes form a vector space. The objective it tries to achieve is to minimize total intra-cluster variance, or, the squared error function

where there are k clusters Si, i = 1, 2, ..., k, and µi is the centroid or mean point of all the points xj ∈ Si.

[wiki]

The number of clusters has to be fixed in advance in this type of algorithm. This number cannot be arbitrarily fixed. Usually another clustering algorithm, such as the hierarchical clustering, has to be run to determine the number of clusters to be used in the K-means algorithm. The algorithm then assigns the observations into various clusters in order to find the minimum value for each class. This is also called as the local minimum. The major drawback of this algorithm is that the number of clusters have to be determined in advance which may not always be possible. Also it does not work well with overlapping clusters and not suitable for discovering clusters with non-convex shapes [wiki]. A number of studies on various modifications of K means clustering have been carried out. Also studies on six as well as ten clustering algorithms have been carried out to see if the genes in the same clusters can be functionally correlated as the genes that are seemingly unrelated or may have more than one common function [DAT2006]. It was found that no single algorithm was best suited for clustering genes into functional groups via expression profiles and the best algorithm in each case depends on the validation measure employed.

3.2 Clustering by CLICK:

CLICK (CLuster Identification via Connectivity Kernels) algorithm uses graph theory and statistical techniques to find genes that are highly similar to each other. It groups them together in the same cluster using heuristic approaches. The input data is considered as a weighted graph in which vertices represent the elements and the weighted edges represent pair-wise similarity between the elements. Basically, CLICK algorithm recursively follows these steps: In every step, it tries to handle some connected component of the subgraph induced by the unclustered elements. If the stopping criterion is satisfied by the component, then it is declared as a kernel. Otherwise, a minimum weight cut is calculated and that component is divided into two most loosely connected pieces. After this, singletons are added to the kernels whose vectors are highly similar to the mean vector of the kernel. Finally, a merging procedure is used to similar clusters.[SHA2003]. In any gene expression, the genes will be considered as the elements and the vector of each gene will contain its expression levels under different conditions. Similarity can be measured between the genes can be measured by measuring the correlation coefficient between vectors.

3.3 Gene Set Enrichment Analysis

Gene Set Enrichment Analysis is a powerful analytical method for interpreting gene expression data. The method focuses on gene sets which are the groups of genes that share common biological function, chromosomal location, or regulation. The gene sets have already been annotated by their common functions in the databases. As a result, for a list of significant gene sets their biological interpretation would be clear. Gene Set Enrichment Analysis has identified significant differentially expressed sets of genes, even where the average difference in expression between two phenotypes is only 20% for genes in the gene set. Gene Set Enrichment Analysis and related methods are complementary to conventional single-gene methods. It is likely to be more powerful than conventional single-gene methods for studying the large number of common diseases in which many genes each make subtle contributions. It is a tool that deserves to be in the toolbox of bioinformatics practitioners [SUB2005].

Gene sets are obtained from the microarray data. The Gene Set Enrichment Analysis evaluates these gene sets. The gene sets thus obtained are defined based on prior biological knowledge, e.g., published information about their biochemical pathways or co-expression in previous experiments. With Gene Set Enrichment Analysis common features between the data sets can be more effectively revealed by gene-set analysis rather than single-gene analysis. Extensions of GSEA have also been reported [JIAXXXX]. Gene Set Enrichment Analysis addresses the problem of overlapping of the item- sets or the gene sets. This problem cannot be addressed by K means clustering.

4.0 PROOF:

The gene expression data can be related to functional information. However, sometimes there may not be enough information about the function of a particular gene as the information could be from one set of experiments only. In such cases, the two experiments should be tested for an existing relationship between them. Experimental results are normally given as continuous values. These values or attributes could be grouped together and can be represented as multi-dimensional vectors.

The gene expression data sets from different experiments on yeast or any other suitable microorganism can be used. These data sets can be freely downloaded from the internet. Even if the experiments are of different types they must show some similarity. Thus, there may be experiments that show the gene expression that is cell cycle related and time-dependent. These experiments can be considered for all the genes of yeast that may participate in cell cycle related experiments so that a comparison can be made between them. The genes that can show convincing results can be grouped together.

The experimental results of correlation can be shown as scatter plots. If a correlation is observed then the scatter plot as given in the Figures 1 will be observed whereas if no correlation is observed as in Fig 2 the genes appear to be scattered.

Fig 1 Fig 2

Thus if the genes show correlation, it can be deducted that the expression profiles of the central gene is shared with those of the neighboring genes which may be different from the expression profile of the distant genes. Hence the central gene is the most important gene and the experiments can be assumed to be related with respect to the gene at the center. There may be many genes that are unrelated to the biological process under consideration may lead to noise or outliers. Even the presence of outliers should not prevent it from exhibiting the correlation. If an unrelated gene is used as the central gene, no correlation is observed. It is assumed that all of above genes that can be used for the experimentation show cell-cycle related time-dependent gene expression. Only those genes will be considered, for which all experiments are reported which may reduce the number of genes that can be considered out of the total genes. Since all the genes exhibit the cell cycle activity, it can be expected that cell cycle genes should exhibit a relationship between experiments. Also, since there may be small number of genes that are involved in the cell cycle experiment, ROC analysis can be used to evaluate the result. A receiver operating characteristics (ROC) graph is a technique for visualizing, organizing and selecting classifiers based on their performance. It is a graphical plot of the sensitivity vs. (1 - specificity)

Fig 3

The ROC analysis curves can be used for the above study to find the correlation between the genes. If it shows a high initial slope then it can be deducted that there is a correlation between experiments. Generally if the AUC is near to 1 then it represents a perfectly accurate test. A lesser value may indicate lesser correlation. It can be assumed if the AUC is less than 0.5 then no correlation exits between the genes.

Clustering can also be used to find the correlation between the gene sets and the resulting clusters for can b tested for overlap using the gene set enrichment approaches. The different clustering approaches that can be used are K means clustering and CLICK clustering and the results can be obtained for comparison. Since K means clustering requires an initial value of K any arbitrary value could be entered. However, this value has to be feasible. For example, if 20 genes have to be cluster the value of K has to be such that each cluster has some genes so that the distance between can be found out. K should never be large enough so that each cluster consists of only one or less genes. Hence, for 20 genes the maximum value K should ideally take should not exceed 5.

Another clustering method is CLICK which is a graph based method. This type of clustering overcomes the drawbacks of K-Means clustering such as the number of clusters need not be known initially and K-means does not handle outliers where as CLICK separates the outliers as singletons and does not place them in any of the clusters. They are not use in the correlation studies. The number of clusters formed may differ for each set of experiment. However, unlike K-means where every run may produce different size of clusters placing some of the genes in different clusters, this algorithm always places the genes in the same cluster. Hence for one set of experiment, the number of clusters formed always remains the same and the genes in those clusters are also always the same.

Gene Set Enrichment Analysis studies were carried out on all the sets of experiments of yeast dataset namely alpha, cdc15, cdc28 and elu clustered by K-means as against these clustered by CLICK. All the experiments were compared to each other and the corresponding Gene ontologies were also found. It was found that some clusters that were found to be enriched by K-means were also found to be enriched by CLICK. The different functional classes that were found to be enriched in both sets of experiments are: GO:0042254(Ribosome biogenesis and assembly), GO:0016070(RNA metabolic process), GO:0006396 (RNA processing), GO:0006139(Nucleobase and nucleic acid metabolic process), GO:0005515 (Protein binding), GO:0016491(Oxidoreductase activity), GO:0007049(Cell cycle), GO:0007109 (Cytokinesis, completion of separation), GO:0016072(rRNA metabolic process)

Fig 4 Enrichment in K-Means Fig 5 Enrichment in CLICK

Thus it was observed that clusters from both K-Means and CLICK clustering algorithms show enrichment in all the experiments. Common functional groups were identified between same sets of experiments when different clustering algorithms were used. Common functional groups were identified between different sets of experiments when different clustering algorithms were used. Few common functional groups have been identified by both types of clustering methods on the same set of experiments. Thus same set of genes are expressed over the same functional groups in different experiments indicating the correlation between the experiments. Thus, two vectors can be compared directly and the distance between them could be found out. The clusters resulting from different clustering algorithms over vectors showing enrichment of the same functional classes are correlated.

5.0 CONCLUSIONS:

K means clustering has been used over a long period of time due to its speed and the simplicity in implementation. However, due to its drawbacks there has been studies on its modifications and have been compared to the traditional K means clustering. Most of the studies show an enhancement in the performance over the traditional K means algorithm. This algorithm has also been applied to the gene expression data and along with Genetic Algorithms. K means clustering algorithm is not a good choice for finding the cell cycle genes as they may appear at the overlap or the intersection of clusters. K means clustering algorithm does not support the overlap of the clusters.

CLICK clustering is a novel graph based approach to clustering. It has the advantage over the traditional K-Means clustering as the number of clusters do not need to be specified apriori. It also handles of outliers that are not handled by traditional K-means. If any gene which is an outlier becomes the central gene no correlation is found.

Gene Set Enrichment Analysis which is a relatively new technique, which would find the genes at the intersection or the overlap of the clusters. It can find the enrichment in the clusters found by the different clustering algorithms. Common functional groups can be identified between same sets of experiments when different clustering algorithms were used as well as common functional groups can be identified between different sets of experiments when different clustering algorithms were used. Thus in combination of different clustering algorithms Gene Set Enrichment Analysis can be used to find correlation between the different genes.

6.0 REFERENCES

[wiki] http://en.wikipedia.org/wiki/K-means_algorithm

[STE 1956] H. Steinhaus. Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci., C1. III vol IV:801– 804, 1956.

[KAO2009]Positive correlation between gene coexpression and positional clustering in the zebra fish genome, Yen Kaow Ng, Wei Wu and Louxin Zhang BMC Genomics 2009, 10:42).

[SEV2005] (Correlation between Gene Expression and GO Semantic Similarity, Jose L. Sevilla, Victor Segura, Adam Podhorski, Elizabeth Guruceaga, Jose M. Mato, Luis A. Martinez- Cruz, Fernando J. Corrales, Angel Rubio, IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), Volume 2 , Issue 4 (October 2005), Pages: 330 – 338, Year of Publication: 2005)

[LEI2006]An Improved K-means Algorithm for Clustering Categorical Data, Ming Lei1, Pilian He2, Zhichao Li3 Journal of Communication and Computer, Aug. 2006, Volume 3, No.8 (Serial No.21)

[BHA2005](Correlation between gene expression profiles and protein–protein interactions within and across genomes, Nitin Bhardwaj and Hui Lu, Bioinformatics 2005 21(11):2730-2738)

[HUA2008]A practical comparison of two K-Means clustering algorithms Wilkin, Gregory A Huang, Xiuzhen, BMC Bioinformatics 2008 9(Suppl 6):S19 [DAT2006]Evaluation of clustering algorithms for gene expression data, Susmita Datta and Somnath Datta ,BMC Bioinformatics 2006, 7, (Suppl 4), S 17:

[DAT2006]Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes Susmita Datta and Somnath Datta BMC Bioinformatics 2006, 7:397doi:10.1186/1471-2105-7-397

[SUB2005]Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles : Aravind Subramaniana,b, Pablo Tamayoa,b, Vamsi K. Moothaa,c, Sayan Mukherjeed, Benjamin L. Eberta,e, Michael A. Gillettea,f, Amanda Paulovichg, Scott L. Pomeroyh, Todd R. Goluba,e, Eric S. Landera,c,i,j,k, and Jill P. Mesirova,k PNAS _ October 25, 2005 _ vol. 102 _ no. 43 _ 15545–15550

[JIAXXXX]Extensions to gene set enrichment Zhen Jiang * and Robert Gentleman Volume 23, Number 3 Pp. 306-313 (Reference lacked information)

[SHA2003]. CLICK: A Clustering Algorithm for Gene Expression Analysis, Ron Shamir Roded Sharan, 2003. (Reference lacked information)

Recommended publications