BIOINFORMATICS Pages 1–7

Vol. 00 no. 00 2010 BIOINFORMATICS Pages 1–7 Integrative classification and analysis of multiple arrayCGH datasets with probe alignment Ze Tian and Rui Kuang∗ Department of Computer Science and Engineering, University of Minnesota Twin Cities, Minneapolis, USA Received on XXXXX; revised on XXXXX; accepted on XXXXX Associate Editor: XXXXXXX ABSTRACT 2009). Chromosome copy number variations can be measured by Motivation: Array comparative genomic hybridization (ArrayCGH) comparative genomic hybridization (CGH), which compares the is widely used to measure DNA copy numbers in cancer research. copy number of a differentially labeled case sample with a reference ArrayCGH data report log-ratio intensities of thousands of probes DNA from a normal individual. ArrayCGH technology based on sampled along the chromosomes. Typically, the choices of the DNA microarray can currently allow genome-wide identification locations and the lengths of the probes vary in different experiments. of regions with copy number variations at different resolutions This discrepancy in choosing probes poses a challenge in integrated (Carter, 2007). The arrayCGH data was used to discriminate healthy classification or analysis across multiple arrayCGH datasets. We patients from cancer patients and classify patients of different cancer propose an alignment based framework to integrate arrayCGH subtypes. Thus, arrayCGH data is considered as a new source of samples generated from different probe sets. The alignment biomarkers that provide important information of candidate cancer framework seeks an optimal alignment between the probe series loci for the classification of patients and discovery of molecular of one arrayCGH sample and the probe series of another sample, mechanisms of cancers (Sykes et al., 2009). intended to find the maximum possible overlap of DNA copy number ArrayCGH allows rapid mapping of DNA copy-number variations between the two measured chromosomes. An alignment variations of a tumor sample by a locus-by-locus measure. kernel is introduced for integrative patient sample classification and a In an arrayCGH experiment, probes (short DNA segments on multiple alignment algorithm is also introduced for identifying common chromosomes) of different lengths and locations are chosen for regions with copy number aberrations. comparative genomic hybridization. The CNV information at a Results: The probe alignment kernel and the multiple probe probe location is reported by the log-ratio of the probe intensity alignment algorithm were experimented to integrate three bladder between a case and a reference. The sizes and the number of the cancer datasets as well as artificial datasets. In the experiments, probe determines the resolution of an arrayCGH experiment. The by integrating arrayCGH samples from multiple datasets, the probe length of the DNA probes used by current platforms ranges from alignment kernel used with support vector machines significantly 100 bp to 5 kb, and in a typical study, the number of probes ranges improved patient sample classification accuracy over other baseline from several hundreds to tens of thousands. For example, a Human kernels. The experiments also demonstrated that the multiple probe array 2.0 chromium surface array can consist of only 2,464 probes alignment algorithm can find common DNA aberrations that cannot at 1.5 Mb resolution, while a high-resolution tiling array provides be identified with the standard interpolation method. Furthermore, measurements of 36,288 probes. the multiple probe alignment algorithm also identified many known The large difference in the sizes and numbers of probes bladder cancer DNA aberrations containing four known bladder makes integration of multiple arrayCGH datasets used for similar cancer genes, three of which cannot be detected by interpolation. studies a challenging problem. From a data analysis perspective, Availability: http://www.cs.umn.edu/compbio/ProbeAlign the arrayCGH data are series of real values labeled by their Contact: [email protected] chromosomal positions. The traditional approach is an interpolation between the two series of probe locations from two platforms. As described in Fig 1, in an interpolation probe locations on one 1 INTRODUCTION platform are added (interpolated) to the target platform and the corresponding CNV intensities are guessed by intensities of the It has been confirmed in many recent studies that aberrations most nearby probes in the target platform. However, when the in chromosome copy number, rearrangement and structures two platforms have large variations in the number of probes, the et al. have association with disease (Feuk , 2006). Among interpolation is not an appropriate way to integrate the two series these chromosomal aberrations, DNA copy-number variations of probes. Specifically, when the two series are both sparse, the (CNVs), the events of amplification or deletion of a large DNA interpolated points might give misleading or wrong information. For segment on chromosomes, are believed to play an important example, in Fig 1, some of the probes are interpolated to locations et al. role in tumorigenesis (Redon , 2006; Shlien and Malkin, close to probes with opposite CNV. When the probes are sparsely located, this might introduce false positives. Moreover, some of the ∗to whom correspondence should be addressed c Oxford University Press 2010. 1 Tian and Kuang ? ? amplification deletion Interpolation ? interpolated amplification gap gap interpolated Probe deletion Alignment interpolated neutral Fig. 1. A hypothetical example of interpolation and probe alignment. Interpolation and probe alignment are applied to compare the same two chromosomes. The probe locations are marked with vertically striped blocks (amplifications) or horizontally striped blocks (deletions). The interpolation interpolates each probe to its corresponding position on the chromosome and add a new probe in the other series. The CNV value of the new probe is guessed by checking the neighboring probes in the same series. The probe alignment aligns nearby probe pairs in the two series with similar CNVs. The matched probe pairs are connected by arrows in the alignment. The interpolated probes marked with “?” are possibly wrongly labeled in the interpolation. interpolated probes are far from other probes. In these cases (i.e. the types of data. It is worth mentioning that Aach and Church (2001) interpolated probes marked by “?” in the figure), it is ambiguous to proposed to use an alignment algorithm to align time series of gene decide the CNV for the interpolated new probes. expressions. Although motivated by another application, the central In this article, we propose to integrate arrayCGH datasets philosophy of finding a synchronization of two series to replace with probe alignment. As shown in Fig 1, two series of probes simple interpolation is closely related. There are several previous from two arrayCGH samples are aligned based on both their approaches for segmentation and CNV detection from array CGH chromosomal locations and their CNV log-ratios. The probes in data. For example, Guha et al. (2008) proposed a Bayesian hidden one series are matched to the probes in the other series. The Markov model to model relation between neighboring probes. alignment representing an approximate positional matching of the Because these approaches assumes discrete copy number states two compared chromosomes with the maximum possible overlap of without quantifying the distance between probes, they are not CNV events. The probe alignment seeking the maximum possible directly applicable in the multi-platform scenario. CNV overlap is motivated by the fact that two tumor samples are likely to share some common aberration regions related to tumor 2 METHODS development and progression. If a probe in one series shares the In this section, we first describe how to align two probe series. We next same CNV with another nearby probe in the other series, it is likely introduce the probe alignment kernels for classification of CNV samples that the two probes capture the same common tumor aberration and a multiple alignment algorithm for identifying common disease CNV in the two samples. Thus, the two probes should be aligned to regions. Finally, the time complexity of probe alignment is analyzed. enhance the signal. Based on this assumption, a probe alignment is more capable of detecting weak common CNV signals between 2.1 Pairwise alignment of probe series two chromosomes if only sparse information is available. The We denote the series of arrayCGH probes on a chromosome as a problem of finding the best alignment of two probe series can be finite sequence of tuples (x1; l1); (x2; l2);::: (xi; li);:::, where each solved by a variation of the standard global sequence alignment xi denotes the log-ratio intensity of the ith probe and li denotes the algorithm (Durbin et al., 1998). Compared with interpolation, the location of the probe on the chromosome by kilo base pairs (kb). Given two such sequences U = (u ; a ); (u ; a );::: (u ; a ) and V = main advantage of probe alignment is that the copy number of 1 1 2 2 n n (v1; b1); (v2; b2);::: (vm; bm) of length n and m respectively, we define the chromosomal regions between probes are inferred based on the three functions, S(i; j), L(i; j) and R(i; j), for computing the alignment information of all the probes, instead of only the nearby ones, and between the subsequences (u1; a1);::: (ui; ai) in U and the subsequences the optimal global alignment provides the best guess of the CNVs (v1; b1);::: (vj ; bj ) in V . S(i; j) is the optimal alignment score when in these regions. (ui; ai) is aligned with (vj ; bj ); L(i; j) is the optimal alignment score Based on probe alignment, a probe alignment kernel derived from when (ui; ai) is aligned with a gap, and R(i; j) is the optimal score when the probe alignment scores is introduced for classification of patient (vj ; bj ) is aligned with a gap. Finally, a function M(i; j) defined as the samples from multiple datasets. In the classification task, a binary max of S(i; j), L(i; j) and R(i; j) gives the optimal alignment score up to classifier is trained to predict the tumor grade or cancer stages of a position i in U and position j in V .

BIOINFORMATICS Pages 1–7

Analyses of Allele-Specific Gene Expression in Highly Divergent

Genome-Wide Profiling of RNA Editing Sites in Sheep Yuanyuan Zhang1,2, Deping Han1, Xianggui Dong1, Jiankui Wang1, Jianfei Chen1, Yanzhu Yao1, Hesham Y

Chromosome 20 Shows Linkage with DSM-IV Nicotine Dependence in Finnish Adult Smokers

Detailed Characterization of Human Induced Pluripotent Stem Cells Manufactured for Therapeutic Applications

Mclean, Chelsea.Pdf

High-Throughput Biochemical Analysis of In-Vivo Location Data Reveals Novel Distinct Classes of POU5F1(Oct4)/DNA Complexes. ABST

The DNA Sequence and Comparative Analysis of Human Chromosome 20

Uncovering Cancer Gene Regulation by Accurate Regulatory Network Inference from Uninformative Data

RNA Over-Editing of BLCAP Contributes To

And Tissue-Specific Imprinting of a Tumour Suppressor Gene

Evolutionarily Conserved Human Targets of Adenosine to Inosine RNA Editing Erez Y

BLCAP (NM 001167820) Human Tagged ORF Clone Product Data