!

!

!

!

HEREDITARY COLORECTAL CANCER:

INFORMATION-BASED APPROACH

By

Elena Manilich

Submitted in partial fulfillment of the requirements

For the degree of Doctor of Philosophy

Dissertation Adviser: Dr. Z. Meral Özsoyo!lu

Department of Electrical Engineering and Computer Science

CASE WESTERN RESERVE UNIVERSITY

January, 2010 CASE WESTERN RESERVE UNIVERSITY

SCHOOL OF GRADUATE STUDIES

We hereby approve the thesis/dissertation of

Elena Manilich candidate for the Doctor of Philosophy degree *.

(signed)Z. Meral Özsoyo!lu ______

(chair of the committee)

H. Andy Podgurski ______

Jing Li ______

Tomas Radivoyevitch ______

Gultekin Özsoyo!lu ______

(date) April 7, 2009

*We also certify that written approval has been obtained for any proprietary material contained therein.

Hereditary Colorectal Cancer: Information-Based Approach

Abstract By

ELENA MANILICH

This project is about computational techniques applied to the practice of medicine, an information-based approach to a medical problem, and the ability to manage medical information in new ways. Using a set of computational tools, we demonstrate how computing allows medical professionals to manage and analyze data for families with inherited colorectal cancer syndromes and identify distinct clinical and genetic patterns suggestive of an inherited disease. Our aim is to predict the likelihood of a family member developing cancer, on the basis of family history, clinical criteria and genetic data. Toward this end, we have developed a medical knowledge acquisition tool for clinicians that integrates clinical, genetic and pedigree data that has demonstrated its effectiveness as a diagnostic and therapeutic platform. This system serves as a platform for analytical and data mining methods that can reason over available phenotypic and genome-scale measurements to provide a better understanding of hereditary colorectal cancer. The importance of the framework is illustrated in the context of solving a complex classification problem through a collaborative effort of colorectal surgeons and biologists.

1 Acknowledgements

It seems only a very short while ago when I stepped of my flight at JFK International

Airport together with my three-year old daughter and along with the rest of my family to

start a new life in the US. Although this is now almost 14 years ago, I realize that I could

not have made the type of progress in my personal life and let alone in my professional

and academic career without all of my friends, colleagues, and collaborators. At the start

of this project, the task seemed daunting and monumental and despite our technical research goals, the major objective of this project always aimed to find means to improve patient care and be a practical tool to be used in the clinical setting. Since the start, this project turned out to be a success and has been used by a number of institutions world- wide.

It would not have been possible to attribute such a favorable outcome of this endeavor without all of the help, support, and guidance that I have received from the academic staff of Case Western Reserve University and the many renowned surgeons and clinicians at the Cleveland Clinic.

I would like to express my utmost gratitude to my thesis advisor, Prof. Meral

Özsoyo!lu, who has been nothing but supportive during the entire course of the project.

Her vast knowledge and expertise in the computer science field seemed even somewhat intimidating at times and much gratefulness goes out to her for the insights, advice, and guidance I have received.

Another group of individuals without whom this project could not have attained success are the talented and esteemed surgeons and physicians at the Cleveland Clinic.

My colleagues at the Clinic were nothing short of supportive and understanding during

2 the entire ordeal. It would not have been possible for me to succeed in making a multi- disciplinary project without the knowledge and clinical expertise of the talented and brilliant surgeons and physicians at the Cleveland Clinic. I would like to especially thank

Dr. Victor Fazio, Dr. Feza Remzi and Dr. James Church for their continued support, advice, and guidance during the course of the entire ordeal. It was due to their help and contributions to this work that made it possible to be a much useful tool in the clinical setting and my heartfelt gratitude goes out to them and others that allowed me to take part in patient care.

I want to express my gratitude to all of my academic colleagues and to my friends and family, who are too numerous to mention here for their support and friendship all of whom contributed in some way to this work

3 Contents

1 INTRODUCTION ...... 7 1.1 CONTRIBUTIONS...... 7 1.2 DISSERTATION ORGANIZATION ...... 9 2 HEREDITARY COLORECTAL CANCER: MEDICAL AND COMPUTATIONAL FOUNDATION ...... 11 2.1 STATEMENT OF A MEDICAL PROBLEM ...... 11 2.2 EXISTING INFORMATION-BASED FRAMEWORKS FOR HEREDITARY CANCER...... 15 2.3 INTERDISCIPLINARY APPROACH ...... 19 3 DATA MODEL ...... 22 3.1 REQUIREMENTS AND DESIGN DECISIONS ...... 22 3.2 THE DAG PEDIGREE DATA MODEL ...... 23 3.3 IMPLEMENTATION OF A RELATIONAL SCHEMA ...... 24 4 COLOGENE ...... 27 4.1 HISTORY OF COLOGENE ...... 27 4.2 ARCHITECTURE OF COLOGENE ...... 29 4.3 DESIGNING FOR EFFECTIVENESS ...... 30 4.4 COLOGENE – PEDIGREE EDITOR ...... 31 4.5 COLOGENE – USAGE ANALYSIS ...... 39 5 PEDIGREE EXPLORER ...... 41 5.1 PEDIGREE EXPLORER INTERFACE ...... 42 5.2 EXPRESSIVE POWER OF PEDIGREE EXPLORER ...... 51 5.3 QUERY SETS ...... 56 5.4 PEDIGREE EXPLORER ARCHITECTURE ...... 59 5.5 PEDIGREE QUERY INTERFACE AND NODECODES ...... 61 6 KNOWLEDGE DISCOVERY TECHNIQUES IN HEREDITARY CANCER DATA ...... 69 6.1 MOTIVATIONAL EXAMPLE: CAN DESMOIDS BE PREDICTED? ...... 69 6.2 FAP FAMILY DATA ...... 71 6.3 CLUSTERING DESMOID PATIENTS ...... 73 6.4 CLASSIFICATION BASED ON ASSOCIATION ...... 74 6.5 CLASSIFICATION MODELS FOR DESMOIDS ...... 76 6.6 MEDICAL IMPLICATION OF KNOWLEDGE DISCOVERY ...... 78 7 MINING MICROARRAY DATA - RANDOM FOREST ...... 80 7.1 MOLECULAR SIGNATURE FOR COLON CANCER PATIENTS ...... 81 7.2 RANDOM FOREST PREDICTION RESULTS ...... 82 7.3 RANDOM FOREST CLASSIFIER FOR MICROARRAY DATA ...... 84 7.4 RANDOM FOREST CLASSIFIER - DECISION TREE ALGORITHM ...... 86 7.5 PREVIOUS WORK - OPTIMIZATION OF DECISION TREE ALGORITHMS ...... 91 7.6 SCALABLE IMPLEMENTATION OF RANDOM FOREST ...... 94 7.7 FROM RANDOM FOREST TO CLINICALLY FEASIBLE MOLECULAR SIGNATURE ...... 105 8 CONCLUSIONS ...... 108 8.1 SUMMARY ...... 108

4 Table of Figures

Figure 2.1 Pedigree of a family with hereditary colorectal cancer syndrome...... 14! Figure 3.1 The DAG pedigree data model...... 24! Figure 3.2 Partial view of the Cologene database diagram...... 25! Figure 4.1 Pedigree Symbols and Relationship Lines...... 33! Figure 4.2 Snapshot of the Pedigree Editor GUI displaying the list of family members (left) and their pedigree in graph form (right)...... 35! Figure 4.3 Data panel for adding a new family member...... 36! Figure 4.4 The Polyp Details interface for calculating severity of duodenal polyps...... 38! Figure 5.1 Pedigree Explorer...... 45! Figure 5.2 Top query panel displaying the top level of a query tree...... 46! Figure 5.3 Top query panel displaying Example 1 parameters...... 47! Figure 5.4 SQL statement for Example 1...... 48! Figure 5.5 Right upper panel displaying query results...... 49! Figure 5.6 Pedigree Editor displaying the selected family; affected members are highlighted...... 50! Figure 5.7 Pedigree Explorer displaying the query results...... 51! Figure 5.8 Query Tree...... 53! Figure 5.9 Example hierarchy of Pedigree Explorer Data...... 57! Figure 5.10 XML schema...... 58! Figure 5.11 Architecture of a tree-structure query interface for pedigree data...... 60! Figure 5.12 Pedigree Explorer...... 61! Figure 5.13 Nodecodes labeling on a sample pedigree graph ...... 64! Figure 6.1 Partial view of the pedigree for a family with an FAP syndrome and a subset of clinical findings and pedigree relations used in the data mining analysis...... 72! Figure 6.2 Rule-Based Classifier: Association Rule 1...... 76! Figure 6.3 Rule-based Classifier: Association Rule 2 ...... 77! Figure 6.4 Rule-based Classifier: Association Rule 3 ...... 77! Figure 6.5 Rule-based Classifier: Association Rule 4 ...... 78! Figure 7.1 Kaplan-Meier survival plots for patients with differentially expressed selected by random forest classification algorithm...... 83! Figure 7.2 Decision tree (right) for the concept of cancer recurrence...... 90! Figure 7.3 Representation of decision tree in the form of a graph ...... 90! Figure 7.4 Random forest algorithm for inducing an ensemble of decision trees from training samples...... 98! Figure 7.5 Sample microarray expression data...... 101! Figure 7.6 List of sorted indices and hash table data structures used in the algorithm for the samples data in Figure 16...... 101! Figure 7.7 A performance evaluation comparing optimized implementation of random forest with the original implementation in Weka with constant number of attributes. ... 103! Figure 7.8 A performance evaluation comparing optimized implementation of random forest with the original implementation in Weka with constant number of trees...... 103! Figure 7.9 Kaplan-Meier survival analysis on 96 frozen tumor samples using the 10 top prognostic genes selected by random forest...... 106!

5 Table 1 Cologene's Data Statistics ...... 40! Table 2 Sample Queries...... 44! Table 3 Sample nodecodes relational table...... 65! Table 4 Sample pedigree queries used to evaluate performance of the nodecode technique...... 67! Table 5 Desmoid clusters: dissimilarity measures for probands...... 74! Table 6 Desmoid clusters: dissimlarity measures for all family members ...... 74! Table 7 The top ten genes selected by the random forest classifier...... 105!

6

1 Introduction

As medicine becomes increasingly information intensive, the need to understand, create, and apply new methods to model, manage and acquire information has never been greater. Most of this thesis uses a biomedical informatics point of view to find new methods that will enable scientists and clinicians to access and apply discipline-specific knowledge, to learn from clinical and experimental data, and to advance the practice of medicine as a result. In this work, we present new ways to acquire, represent and process data and knowledge related to hereditary colorectal cancer. The hope is that this will lead to a better understanding of the disease and new ways to treat it. We present a series of novel computational techniques, a cross-disciplinary approach that combines expertise in the areas of data modeling and management, data mining, medical genetics, surgical techniques and bioinformatics.

1.1 Contributions

The main contributions of this thesis are the following:

• A data model that describes concepts and relationships that are important in the hereditary colorectal cancer domain is presented. The knowledge-based data model captures complex pedigree relationships and clinical findings and allows specialty specific pedigree annotations. It also provides support for diagnostic, analytical and data mining tasks.

• A computerized system called Cologene that provides genetic counselors and clinicians with domain-friendly support needed to create pedigree models and enter data.

7 The Cologene software system is an application that takes advantage of structured information to simplify the knowledge acquisition process. The viability of this software is demonstrated by the growth in the international community of users who add to

Cologene’s knowledge base and thus direct its further evolution.

• We underline the effectiveness and importance of the above mentioned knowledge-based data model and advanced computational tools by applying knowledge discovery methods to a large pedigree database. Using a novel data mining method, we estimate the risk of desmoid disease, a deadly condition, for patients with a hereditary syndrome

• Microarray technology that can measure expression levels of thousands of genes is the most promising tool available to researchers for conducting phenotype-genotype association studies linking clinical outcomes of patients with hereditary diseases to their genomic profile. However, this technology poses a computational challenge to classical data mining and machine learning techniques as the amount of data generated by this type of experiments is tremendous. In this thesis, we propose a new framework in which an optimized implementation of a random forest decision tree algorithm produces excellent classification results for high-dimensional microarray data. The scalable implementation of the random forest approach is used to look for differences between the genomes of patients with a recurrent colon cancer and those without.

The implications of this project are quite significant. The onset and associated risk of occurrence of a particular condition for an individual can be predicted even for related individuals who never sought treatment. Relevant information will provide clinicians easily interpretable results based on our computational techniques. Our computational

8 system and techniques allow early detection, facilitate diagnoses, and identify high risk individuals based on their genetic profile. This should relieve some strain on limited hospital resources and lower costs by allowing treatments at early stages, or perhaps even before onset (e.g. through chemoprevention). We hope that our computational system will serve as a useful model not only for colorectal conditions but also for numerous other diseases in need of improved diagnoses, identification of those at risk, and thus prevention of their genetically-linked anticipated conditions.

1.2 Dissertation Organization

This thesis can be roughly divided into three parts. The first part, which presents a problem from a medical perspective and describes existing computational tools available to clinicians and researchers, begins with Chapter 2. In this section, we discuss the lack of a comprehensive computational framework and we stress the importance of our interdisciplinary approach to this problem. A new approach to data management that integrates novel computing technologies and medical science may have significant impact on cancer prevention and morbidity reduction. We illustrate how a new framework and series of novel computational techniques we have developed that improved our understanding of disease [18].

The second part introduces Cologene, a software application for clinicians that includes a domain-specific and knowledge-based data model (Chapter 3), Pedigree

Editor, a set of graphical tools to manipulate and edit pedigrees (Chapter 4), and Pedigree

Explorer that deals with an advanced query interface (Chapter 5). One of our aims, from an oversimplified and generalized point of view, is to provide clinicians with tools needed to gather and maintain a detailed database of a patient's medical history, family

9 history, and relevant genetic information. Variations of parts of this work have been published elsewhere [32,50].

The third part deals with data mining techniques that help advance the understanding of a hereditary disease (Chapter 6) and an optimized implementation of an advanced decision tree algorithm that is suitable for mining high dimensional genome- scale measurements (Chapter 7). Our analytical tools, techniques, and methods, described later in detail, will help us identify individuals at a certain level of risk for a particular type of condition. With these tools in hand, a clinician will be able to make more informed decisions about how to proceed with treatment or establish a diagnosis. As a result, individuals at high risk for developing a particular condition can be identified and prophylactic treatment can begin at a much earlier stage. It is also possible to identify individuals related to a patient who are at a high risk for a genetically-linked condition who have not yet sought any type of treatment. These individuals can then be treated at an early stage of a disease or even before its onset, i.e. preventative measures in addition to regular appropriate screenings are acceptable if risks are extremely high. Parts of this work have been discussed in other publications [30,44,51].

10 2 Hereditary Colorectal Cancer: Medical and Computational Foundation

2.1 Statement of a Medical Problem

The annual incidence of colorectal cancer in the United States is approximately 148,300, of which 56,000 result in death [49]. While the lifetime risk of colorectal cancer in the general population is approximately 5 to 6 percent, patients with a familial risk, defined as those who have two or more first or second degree relatives with colorectal cancer, make up approximately 20 percent of all patients with colorectal cancer. The two main dominantly-inherited syndromes of colorectal cancer are familial adenomatous polyposis

(FAP) and hereditary nonpolyposis colorectal cancer (HNPCC). Although these syndromes account for only 2 to 4 percent of all colorectal cancers, they are significant to the extent that they reveal mechanisms of colorectal tumorigenesis and cancer risks in afflicted patients [58].

Familial adenomatous polyposis (FAP) is inherited as an autosomal dominant trait. FAP is characterized by the early development of hundreds to thousands of colorectal adenomas (polyps) and a variety of extra-colonic manifestations. If left untreated, nearly all affected family members die from colorectal cancer by about 40-50 years of age. More than 800 mutations of the APC associated with FAP, the majority of which are located in exon 15, result in 100 or more colorectal adenomas and are a confirmed risk of colorectal cancer in individuals who remain surgically untreated by an average age of 40 years [6,8,41]. Patients who have an APC gene mutation or one or more first-degree relatives with FAP are at high risk.

11 The other major dominantly-inherited syndrome is hereditary nonpolyposis

colorectal cancer (HNPCC). This syndrome is the result of a mutation in one of mismatch-repair genes, typically the MLH1 and MSH2 genes which account for most of

the identified mutations. HNPCC is the most common form of hereditary colorectal

cancer; it is accompanied by synchronous colorectal and extracolonic cancers and is characterized by multiple occurrences of colorectal cancers in the same family along with a high association of extra-colonic cancer. A particular pattern of primary cancers within the pedigree, such as colonic and endometrial, will help to identify affected family members.

A positive family history of colorectal cancer is a sign of risk for colorectal neoplasia. The significance of a family history of colorectal neoplasia is also widely recognized. Recent studies estimated the degree of risk for colorectal cancer associated with various strengths of family history [48]. While a family history of colorectal cancer is a clinically significant risk factor, a positive history does not necessarily imply that everyone within this group is at high risk. Familial clustering of colorectal cancer may be due to chance and this accounts for most instances of a positive family history when only a single relative older than 60 years is affected. In addition, contributions of environmental factors such as cancer-promoting diet or workplace may also be related to familial clustering. Therefore, a carefully compiled family history with detailed information about family members is necessary to suggest a hereditary colorectal cancer syndrome in the family.

12 The most important step leading to the diagnosis and appropriate treatment of a hereditary cancer syndrome is the compilation of a thorough family history of cancer in the form of a pedigree [49]. Several aspects of a family history that may help determine individual’s risk include the number of affected relatives and generations, closeness of the relationship to affected relatives, age at which the affected relatives were diagnosed, and the presence of an adenoma. A large collection of previous work demonstrates the advantage of seeking extensive family history that includes clinical findings suggestive of hereditary colorectal cancer [17,45,49]. A Japanese study on the natural history of hereditary colorectal nonpolyposis colorectal cancer based on chronological family history is one such example [43]. In this study the analysis was based on changes in family pedigree over time. This study stressed the importance of checking the family pedigree periodically and as a result of it, successive family histories of hereditary colorectal cancers have become a primary means of studying the development of changes in colorectal cancer incidence.

Practical solutions to the management of patients with hereditary colorectal cancer syndromes start with family pedigree development to secure a definitive hereditary cancer syndrome diagnosis. Pedigree information should focus on the identification of cancer types and sites and, in particular, patterns of primary cancer segregation within the pedigree, age at the onset of cancer, associated phenotypic features that may be related to cancer, and pathological findings. This information frequently identifies a hereditary colorectal syndrome in the family. Once a hereditary colorectal syndrome is indicated by the family history, determination will be made as to who in the family requires DNA testing and which of these require intensive follow-ups if their

13 DNA tests are positive for mutations. Further molecular genetic testing then provides verification of the diagnosis.

Figure 2.1 Pedigree of a family with hereditary colorectal cancer syndrome.

Although much has been discovered about FAP, the rarity of the syndrome has limited the role of data analysis in the further investigation of the syndrome. An appropriate analysis of large amounts of data would enable the evaluation of the risk assessment for certain conditions related to FAP. For instance, the risk of developing desmoid disease is an important factor in determining the timing of abdominal surgery in patients with FAP.

If desmoid-prone patients can be identified, an opportunity for desmoid chemoprevention may exist before the disease is clinically evident. The duodenal cancer is the second most common cause of death in patients with FAP. Therefore, the ability to define the risk of

14 serious upper gastrointestinal neoplasia would allow surveillance efforts to be concentrated on patients at high risk and minimized where risk is low. Another important problem is to examine relationships between phenotypic expression of FAP and the site of mutation in the APC. A set of phenotypic features can be identified to predict a specific region of APC mutations. Furthermore, the problem can be translated to predicting clinical outcomes based on a mutation site. These important questions can be addressed by appropriate analysis of clinical and genetic data.

This interdisciplinary approach aims at exploiting novel data management and data mining techniques for an integrated analysis and management of pedigree data, patient data, and the data on the disease itself. The information it provides, in turn, addresses some of the aforementioned questions in hereditary colorectal cancer for the purposes of early diagnosis and better treatment.

2.2 Existing Information-Based Frameworks for Hereditary Cancer

Church and colleagues [20] reported on a survey concerning the presence of registries or centers for hereditary colorectal cancer families across the United States. The survey included details about database management systems and pedigree management software used by hereditary colorectal cancer centers in the United States. The 26 centers in the

United States that responded to the survey reported having 1,396 FAP families, 2,058

HNPCC families, and 258 families with other hereditary syndromes. Seven of 26 responding centers used Microsoft Access as a relational database, one used Filemaker

Pro, and one used Oracle. Cyrillic [15] was the pedigree drawing software in six registries, Ped Draw [52] in one, Progeny [60] with an underlying customizable relational

15 database in 17. Software used by these inherited colorectal cancer centers generally falls into one of three categories: software that provides pedigree drawing tools and a limited set of data fields to record medical findings (Cyrillic, Ped Draw), software that offers a pedigree drawing interface with an underlying customizable database (Progeny, Cyrillic v2), and software that uses data management systems, typically relational database systems, that does not directly support the graphical display of a family tree. Commercial packages such as Cyrillic, Progeny, and Ped Draw implement excellent pedigree-drawing functions; however, these applications are designed for geneticists and do not meet requirements for clinical assessment of families with hereditary colorectal syndromes.

RAGs (Risk Assessment in Genetics), a modified version of Cyrillic, is a decision support application for evaluating the genetic risk of breast and colorectal cancer [25].

RAGs’ design centers on the ease with which general practitioners can evaluate patients’ genetic risk of cancer on the basis of family history, and thus make appropriate referral decisions. Development of the program is based on the assumption that pedigrees being entered are likely to be fairly simple. The system has been designed to minimize initial learning costs to clinicians during real-time consultation sessions. However, more detailed information eventually has to be provided by the system to cater not only to the needs of general practitioners but also to genetic counselors and clinicians interested in hereditary cancer research.

PedHunter, another software package for the construction of large genealogies

[1], provides tools to organize genealogical information in a relational database and a query facility to access the database. This package introduces the concept of a relation for the representation of all types of family relationships by using simple tables.

16 Furthermore, the package supports functions for finding relatives as well as functions for testing closeness of relationship between two individuals. The queries featured in this software include options to extract the most recent common ancestors of a given set of people, to search for a pedigree within a large genealogy that contains all shortest paths between one common ancestor and a set of individuals, and to find all descendants of a given individual. Functions are implemented using Structured Query

Language to answer simple queries and program graph theory algorithms for path queries. PedHunter does not provide pedigree visualization tools, but it does produce pedigree files that can be read by other pedigree maintenance programs such as Cyrillic

[15].

MEGADATS, being one of the earliest pedigree plotting systems [35], was used to conduct a study of Huntington’s disease, a hereditary disorder of the central nervous system. This software provides a range of functionality that includes structured representation of information, retrieval of genealogical knowledge, integration of expert system components, and a system for pedigree drawing and symbolic representation of information. It uses a pedigree plotting system designed to display a human pedigree on a graphical device. Major advantages of this plotting algorithm are that it can plot families with multiple mates and multi-rooted trees, rather than only a single tree. MEGADATS uses a pointer scheme to drive the pedigree input process; content and layout of a pedigree is defined by the user at the time the plot is drawn.

The PViN (Pedigree Visualization and Navigation) system enables the visualization and printing of extensive pedigrees [71]. The system incorporates techniques from the field of information visualization for rendering and printing

17 pedigrees [54]. The effectiveness of its layout techniques was demonstrated on a 40,000 record database and was compared to MEGADATS [5], the legacy system for pedigree drawing. PviN provides a user interface to display multiple pedigree trees rendered from different databases and database servers and seamlessly shifts focus between them.

Navigation methods allow users to change the zoom factor of the focus pedigree window, reposition the focus to the selected portion of the tree, and map individual nodes to corresponding printout pages to locate individuals in the tree. An essential property of this system is its use of effective techniques to visualize very large pedigrees.

As of now, pedigree management frameworks do not support efficient representation and evaluation of complex pedigree queries. The most interesting queries described by PediTree [68] and PViN [71] retrieve all ancestors or descendants of an individual by recursively constructing an SQL query. PViN focuses on visualization, printing, and navigation methods for very large pedigrees stored in relational databases. It supports calculations such as inbreeding estimation. Other academic work, PedHunter

[1], focuses on the tasks of pedigree verification, relative identification, and determination of how sets of individuals are related. This platform does not provide efficient evaluation of long path queries.

Most of the existing pedigree management frameworks do not incorporate domain- specific knowledge such as knowledge acquired by physicians, colorectal surgeons and genetic counselors who are involved with treatment and follow-up of patients with colorectal cancer. A promising direction for management of information for families with hereditary cancer syndromes and effective use of this information is a data management system that integrates in-depth specific medical knowledge into the system.

18

2.3 Interdisciplinary Approach

New approaches to data management that integrate novel computing technologies and medical science may have a significant impact on cancer prevention and morbidity reduction. In this work, we discuss properties of pedigree data that are highly complex and dynamic and the challenges this presents to pedigree data management in applications that attempt to reach a fundamental understanding of disease. We propose a series of novel computational techniques that can be applied to a large patient base with a rare inherited disease. This is a cross-disciplinary approach that combines expertise in the areas of data modeling and management, data mining, medical genetics, surgical techniques and bioinformatics.

We introduce Cologene, an integrated software tool for storing, visualizing and analyzing pedigree data. The core of this software is a knowledge-based data model for hereditary cancer that effectively manages familial relationships in the form of a graphical pedigree structure that also represents clinical and genetic findings. The data model supports analytical, diagnostic and data mining tasks. The novel features of the system include: graph-structured pedigree data integrated with other clinical and genetic data; database design, implementation and graphical tools to visualize and query pedigree data; and an implementation that allows the expression of complex queries on patterns of familial relationships involving a large number of pedigrees. Through computerized data acquisition and management, physicians and geneticists can thoroughly analyze pedigree data and correlate the findings with genomic data. The system is designed to make

19 familial risk of colorectal cancer more readily quantifiable, make risk assessment easier, and correlate many combinations of family history with risk status.

Cologene has two basic subsystems. Pedigree Editor, a tool that allows visualization and editing of pedigree data, and Pedigree Explorer, a system which implements advanced query capabilities. Pedigree Explorer allows users to specify pedigree queries in an intuitive and dynamic manner and displays results in both tabular and graphical formats. Consider the query “identify families and family members with at least one first degree relative affected with either duodenal, thyroid or pancreatic adenoma” or “identify families and family members whose ascendants had been diagnosed with multiple primary cancers”. At present, there are no main-stream pedigree management systems that allow users to specify such structural queries in a facile way.

The Pedigree Explorer tool was inspired by the Pathway Explorer [56] and uses an encoding scheme, NodeCodes, for efficient evaluation of queries on pedigree graphs proposed by Elliot, B. [31].

The usefulness of the proposed data model and pedigree management framework also lies in the ability to apply novel data mining techniques to large amounts of data, to refine information into clinical knowledge, and to link discovered clinical patterns to biological data. An example of this is given in our recent work [51], presented to the

American Society of Colon and Rectal Surgeons. Using a very large clinical database, we demonstrated how computing allows us to analyze data for families with hereditary colorectal cancer syndromes and identify distinct clinical patterns suggestive of desmoid disease, a life-threatening complication associated with hereditary cancer.

20 To conduct genotype-phenotype association studies based on large microarray expression data, we propose a scalable implementation of Random Forest, a classification algorithm well suited for microarray data. Its purpose is to develop an accurate genetic predictor for disease recurrence for early stage colon cancers using entire genome expression profiles. We demonstrate excellent performance of Random Forest in classification tasks and propose a fast, scalable implementation of the algorithm.

21 3 Data Model

3.1 Requirements and Design Decisions

A knowledge-based data model for hereditary cancer should effectively manage familial

relationships that are best presented as a graphical pedigree structure, but also capture

clinical and genetic findings that are complex and temporal in their nature. It should

support analytical and diagnostic tasks, data mining and queries. Pedigree data is used by

physicians and geneticists to record and display patterns of the disease within a family. It

allows calculation of risk and is a key step to establishing a diagnosis. The purpose of this data collection should be the identification of particular patterns of primary cancer, age at

onset, patterns of extra-colonic manifestations, distribution of colonic adenomas, genetic

mutations, and other clinically important features. Because of the complexity of

hereditary colorectal syndromes, including the impact of various medical and genetic

conditions on the risk of cancer and various treatment options available to patients, a data

model for hereditary cancer is challenged by its requirement to be able to reflect this

information. Since physicians, surgeons, genetic counselors, and biologists can be

interested in different aspects of the disease, a data model should cover a wide spectrum

of data, including complete family history, detailed endoscopic evaluations, extra-colonic

manifestations, surgical procedures, and results of genetic testing.

Based on these requirements, we are developing methods to store and manage a

large collection of pedigree data in a Relational Database Management System because

relational database query engines remain the most efficient and effective among back end

database systems that provide: (i) sophisticated search facilities and query languages with

22 high expressive power; (ii) extensive support for statistical analysis of large data sets; (iii)

multi-user capabilities; and (iv) a robust, scalable and effective means of data recovery

and backup. A rich body of medical knowledge, including pedigree relationships, is stored in a relational database where main objects encapsulate information about families and family members. A specialized graph-based data model, known as a Directed

Acyclic Graph (DAG), is used to represent familial relationships. The relational schema

uses a straightforward implementation of this model. The next section discusses the

underlying DAG model and how it is implemented as a relational schema.

3.2 The DAG Pedigree Data Model

For purposes of this proposal, pedigree is a visual depiction of a family, usually constructed with circles and squares, where patterns of a specific disease can be identified visually. Formally, a pedigree can be defined as “a simplified diagram of a family's genealogy that shows family members' relationships to each other and how a particular trait or disease has been inherited [55].” We present a pedigree as a directed graph composed of nodes and edges where each individual is represented by a node and nodes are connected by directed edges from parents to children. The graph-based pedigree data model is a directed acyclic graph (DAG) with special properties: nodes with no incoming edges or with an in-degree of 0 are called progenitors, and all other nodes have an in- degree of 2 as, from a genealogical perspective, each individual has exactly two biological parents. In this model, each node is associated with a set of NodeCodes that result from traversals of the family tree depth-first and assignments of codes to each node

23 [43]. The NodeCodes labeling method, an annotation for each node, is used to encode the

hierarchical structure of the pedigree and to evaluate structural pedigree queries efficiently. The proposed implementation of the NodeCodes technique will be described in the next section. The principle elements of the DAG pedigree model are depicted in

Figure 3.1.

Figure 3.1 The DAG pedigree data model.

3.3 Implementation of a Relational Schema

This section describes a conceptual view of the database and the implementation of the

DAG pedigree data model. The Cologene data model uses a structured knowledge

representation that conceptualizes information about medical conditions relevant to hereditary cancer. It effectively manages familial relationships in the form of a graph. A partial view of the database schema is depicted in Figure 3.2. At the core of the knowledge-based system is a suite of entities that formally represent medical conditions.

24

Figure 3.2 Partial view of the Cologene database diagram.

The data model uses a distinct representation where relations that encapsulate medical data are designed to be problem-specific, that is, related to a particular aspect of the disease. It organizes patient data according to time to emphasize changes over time and capture progression of the disease.

! The polyps relation models characteristics of gastrointestinal polyps. It includes

the date and method of diagnosis as well as values for the size, shape and

pathology of the polyps.

! The cancer relation lists cancer sites, histological findings, cancer stage, size of

each tumor, and treatment details.

! The extra-colonic manifestations relation provides a structure to specify a wide

spectrum of extra-colonic benign and malignant anatomic lesions that correlate

with sites of genetic mutations/lesions in the DNA [19].

25 The Cologene data model is built on a knowledge-based system that allows

quantification of familial risk based on a specific scoring system, such as one introduced

by Church [17], or calculation of the severity of duodenal polyps as meaningful and clinically important examples. The data model can support a sophisticated summarization strategy, such as detection of significant changes or aggregation of abnormal observations into a summary diagnostic statement. It supports efficient implementation of pedigree queries that involve an ancestor-descendent relationship.

! The family relation contains data about family syndrome and genetic mutations

found in the family.

! The family members relation models pedigree structure that is encoded by an

individual’s mother and father; it includes demographic data, hereditary status of

an individual, and genetic tests. Special annotation is used for dizygotic and

monozygotic twins or multiples, i.e., more than one offspring from the same

pregnancy.

! The node-code relation implements a novel node-code labeling technique for

pedigree data by Elliott, B., et al.[32] where each family member is identified by a

distinct label. The node-codes system is used to answer interesting and complex

pedigree queries without multiple costly joins and traversals of the entire graph.

Implementation and use of the node-code system are described in the next section.

! The marriage relation allows representation of multiple marriages and reflects

different types of relationships between parents, including informal, divorced and

consanguineous (the union of individuals having a common ancestor).

26 4 Cologene

4.1 History of Cologene

In this work, we describe novel features of Cologene [53], an integrated software tool for storing, visualizing and analyzing pedigree data. Recently introduced, it now supports hereditary registries at major university hospitals in the United States, South America and

Israel. Cologene is a modernized version of the Cleveland Clinic’s computerized Familial

Polyposis System (FAMPOLYP) that was originally designed by a colorectal surgeon,

Dr. David Jagelman, in 1979. The system was designed for the VMS operating system and used a file system to store registry data. The purpose of the computerized registry was to aid in the prevention of colorectal cancer in patients with a genetically determined risk and to provide educational data to better understand the transmission of the disease.

It would also gather information of a large number of patients on an ongoing basis in order to track the natural history, treatment and overall management of the disease.

Developers of FAMPOLYPS considered the following important properties of the system:

! Ease of data entry and access to individual and family records.

! Ease of tracking cases, management and patient history.

! Automatic generation of customized follow-up letters.

! Ability to generate and save reports.

! Ability to plot family trees.

The system assigned unique identifiers to each individual and family to allow easy access to information in the registry. Data from medical records could be entered interactively

27 via menu-driven software in the system. The system included the family tree program that could extract demographics and medical information and plot detailed pedigrees for as many as five generations. The program identified twins, children adopted or born out of wedlock, stillborn or miscarried babies, and multiple marriages. The program also automatically generated follow-up letters and clinical reminders for clinical exams to patients in the registry. In addition to scheduling letters, the system could define other events such as scheduling an interview or identifying future medical exams. A word processor, Mass-11, was linked to the system to customize any form letter to a specific patient. Extensive medical information was maintained on each patient, including types of surgery performed, medication administered, pathologies, and results of tests and examinations. The system also included a set of standard reports such as family members determined to be at risk, weekly pending events, and a medical history report for an individual patient.

An important stage in Cologene development was the re-engineering or transformation of the legacy software that represented years of accumulated experience.

Although the system was considered obsolete, it was essential to recognize intellectual property imbedded in the legacy software that served the needs of a large hereditary colorectal registry for years and not to discard existing knowledge. Computationally, there was an urgent need to improve the usability and management of the software.

Modern software development required the use of component-based design and object- oriented programming to move the software development forward. Transformation of data from the file system into an advanced data management system was an essential part

28 of the Cologene development process. The data transformation step helped understand

work-flow of registry coordinators and key aspects of collected information.

4.2 Architecture of Cologene

Cologene is implemented in C# and runs on Windows as a client in communication with

a remote server; the software system also interfaces with a relational database. The

database was created in MS SQL Server 2000 database management software. The

functions are implemented using Transact-SQL. When Cologene is launched, the user

can create a new pedigree, open an existing pedigree, or import a pedigree from a

different format. Cologene is able to create a user profile that records display-related

information and maintains user’s preferences and customization. Once a pedigree is

opened, the user works with it in the Pedigree Editor. Cologene’s core functionality as a

pedigree editing software is extended to more advanced capabilities such as

import/export, validation, and visualization of large pedigrees. Import and export of pedigree data is a critical feature of a pedigree editing software, because there are many formats for storing and analyzing pedigrees. Cologene offers several modules for

importing and exporting pedigrees in different formats, including plain text format and

GEDCOM format, a specification for exchanging genealogical data between different genealogy software [41].

29

4.3 Effectiveness

A system’s success depends not only on whether it meets informational needs but also on how it interacts with users. Computer-based information systems for a hereditary registry should be integrated into the daily routines of genetic counselors, physicians and surgeons. Cologene’s users were involved in all phases of system design and development to guarantee success of the applications. In developing Cologene and its particular realization for managing hereditary registry data, we have adhered to a set of guiding principles and practices. Our first priority was to implement a good interface design that would allow for thorough understanding of work practices. Next, it was important for the GUI to support the interaction between a human and the computer. It was also imperative for the information presented to the user to be structured in a logical and consistent manner. Automatic checks would be performed to ensure consistency and validity of the data.

The following most important parameters of computer systems were discussed with users and considered during system design [65]:

! Quality and style of interface: Interfaces must have clear presentations, avoid

unnecessary detail, and provide consistent interaction.

o Cologene’s GUI uses consistent and attractive menus, graphics, and colors

throughout the application.

! Convenience: Users must have convenient access to the system.

o Cologene is accessible in the physician’s as well as the genetic counselor’s

office; multiple users can work with the same pedigree simultaneously.

30 ! Speed and response: The software must allow users timely access to the data in

the form they need.

o Cologene implements a number of efficient optimization techniques when

displaying large pedigrees to the user and retrieving data from the

database.

! Reliability and Security: The confidentiality of sensitive medical and genetic data

is an important issue in the design of an information system.

o The system is only accessible by authorized personnel and certain

operations are restricted to particular users. It also allows patient

information to be de-identified when data is released for statistical analysis

or a pedigree is presented at a research forum.

! Integration: The integration of the software with a laboratory system and a

pathology system provides a single view of the patient’s data and also reduces the

need for redundant data entry, a major source of errors

4.4 Cologene – Pedigree Editor

Cologene uses domain-specific concepts and ideas in order to present clinicians and genetic counselors with carefully designed components for structured data entry. The graphical tool, Pedigree Editor, is used to manipulate pedigrees in a user-friendly way.

More specifically, it provides an environment for drawing pedigrees and editing medical information, for dynamic exploration and for browsing of family trees. The Pedigree

Editor uses a symbolic language, a well defined set of markers and annotations that represent medical conditions of family members in a pedigree [7]. The Editor provides a

31 graphical view of relationships among individuals. Pedigree data is inherently structured in the form of a tree; individuals are shown as nodes, circles or squares, and edges represent relationships between them. Some common pedigree symbols, definitions and annotations specific to colorectal pedigrees are shown in Figure 4.1. The ability to review an accurately constructed family pedigree and standardized annotations aids clinicians in establishing a diagnosis and the pattern of inheritance, and assists in identifying individuals at risk. Our software is specifically designed to provide this kind of support.

The importance of this is underscored by the importance of correct interpretation of family pedigrees for human genetic research and the challenges of studying large families and collaborating with researchers and clinicians,.

32

Figure 4.1 Pedigree Symbols and Relationship Lines.

33 The Pedigree Editor provides a set of operations needed to maintain a pedigree:

adding, changing and deleting individuals and relationships between them. In addition, a

graphical structure captures multiple relationships, twins and multiples. The Pedigree

Editor graphical user interface (GUI), represented in Figure 4.2, is organized into separate panels that provide different views into the contents of the pedigree. This view enables the user to navigate the pedigree as a zoomable graph. The right side of the visualization panel displays a pedigree as a graph; individuals are shown as nodes and

relationships between them are shown as arcs. Each node is identified by a name of the

individual and the date of birth beneath it. Users can select the type of information to be

displayed on the graph. When designing the Pedigree Editor GUI, we kept in mind the

importance of displaying this information to better facilitate interpretation of symbols and

analysis. For example, it is important to know if a family member had a genetic test performed. In this case, the Editor stores into the database detailed information about the genetic test, such as the date and type of test and the specific genetic mutation. At the

same time, it limits the display to a specific set of information and a letter G next to an

individual’s symbol indicating a negative or positive test result. The leftmost panel of the

Pedigree Editor graphical GUI depicted in Figure 4.2 displays a list of family members

and key information about them. This list can be sorted by last name or a result of a

genetic test to facilitate searches. When a family member is selected from the list, the

node that represents this family member in the graph is highlighted and moved to the

center of the screen. Easy extractions of specific data from a pool of information makes

this feature very useful for very large pedigrees.

34

Figure 4.2 Snapshot of the Pedigree Editor GUI displaying the list of family members (left) and their pedigree in graph form (right).

The user initializes the drawing of a new pedigree with a single node representing a proband, the first affected family member who seeks medical attention for a genetic disorder. The next logical candidates for drawing are the nodes one hop away or adjacent to the selected node, namely, parents, a spouse, children, or siblings (Figure 4.3).

Simplicity and a user friendly drawing environment are the most important considerations when designing a pedigree drawing tool. A genetic counselor should be able to create or update a pedigree containing key medical information while interviewing a patient in the office or requesting data from a medical office while talking to a referring physician on the phone.

35

Figure 4.3 Data panel for adding a new family member.

The drawing algorithm incorporates knowledge of both the graph structure and associated medical data. It integrates and visually displays patients’ data including the type of relationship for spouses, multiples, genetic markers, and disease status. Some useful pedigree drawing rules and validations are enforced in the drawing algorithm. For instance, the algorithm assumes that a partner has to be added to an individual before one or more biological children can be added to that individual in a pedigree and a partner has to be of an opposite gender. On the other hand, the drawing algorithm is flexible in how it allows the user to disconnect individuals from any part of a pedigree or to create a new relationship between an existing pair of individuals. However, it would not allow a direct ascendant of an individual to become his child and it would not allow two half siblings to become twins. Since some pedigrees can span multiple generations and include hundreds

36 of individuals, it is imperative to have a set of validation rules incorporated into the pedigree drawing algorithm. To understand the capacity of domain specific knowledge in the pedigree drawing environment, one can observe its behavior when a new individual is added to the pedigree. In this situation, the presence or absence of specific clinical conditions related to colorectal hereditary disease has to be recorded. This may include the presence or absence of extra-colonic manifestation, or previous history of colorectal or other cancers and colonic polyps. A special set of markers are created to record this information that may assist in identifying individuals at high risk or those who need frequent screening. The user can navigate the pedigree by double-clicking on a symbol representing a family member in the graphs which produces a panel for recording clinical outcomes and follow-up data for the family member.

Graphical user interface design of the Pedigree Editor supports the following aesthetic criteria, defined in the literature [72], for optimal visualization of trees:

! Nodes should not overlap

! Straight lines from children to their respective parents should not cross

! Nodes in the same generation should be placed on a straight line and the lines

should be parallel

! A sub-tree should be drawn the same way regardless of its position

The Pedigree Editor is linked to additional user interface elements. User interaction with a pedigree graph can trigger functionality in other software components.

For example, double-clicking on the node would result in a preview of associated medical conditions and demographic information. The software application organizes patient data

37 according to the time when they were collected, thus emphasizing changes over time.

Views of associated medical records for a given patient are efficiently designed to be problem-specific, related to a particular aspect of the disease, and highlight important findings in a clinical summary. For example, the program allows calculation of the severity of duodenal polyps based on the Spigelman staging system [8]. Only a small proportion of duodenal adenomas will ever become malignant, but this event is particularly lethal. Clinicians can use the polyp module Figure 4.4 to judge the severity of duodenal polyps according to the number, size and histology. The severity score will identify a patient as a candidate for a more aggressive treatment.

Figure 4.4 The Polyp Details interface for calculating severity of duodenal polyps.

The software not only provides a way to record a family history but the whole spectrum of symptoms and presentations related to hereditary cancer syndromes including results of genetic testing, endoscopic surveillance, extra-colonic manifestations,

38 cancer sites and surgical treatments. The information is organized in a structured and logical way appropriate for analytical processing and assisting in medical decision making.

4.5 Cologene – Usage Analysis

Since its initial deployment at the Cleveland Clinic in early 2005, Cologene has been adopted by ten other registries at major hospitals in North and South America. Since it is the largest registry in the United States, we chose to perform the usage analysis of

Cologene at the Cleveland Clinic’s hereditary colorectal registry. During this period, a total of approximately 900 families with a rare hereditary syndrome, familial adenomatous polyposis or FAP, were registered in the Cleveland Clinic’s database, which includes about 14,000 family members. The distribution of related medical information is presented in Table 1. These statistics signify the importance of Cologene as a tool to support patient care activities. Researchers and scientists have used the database as an important source of information. A large number of clinical and translation research projects that produced important publications in the field of colorectal hereditary disease utilized the registry’s data, which would not have been possible without an organized database.

39

Table 1 Cologene's Data Statistics

40 5 Pedigree Explorer

Hereditary colorectal cancer registries must provide effective access to their contents for a wide variety of users. In this section we focus on user interface techniques to improve access to digital registries and describe our work in developing a graphical user interface to the Cologene database. To specify powerful pedigree queries, we have designed the advanced pedigree query interface, Pedigree Explorer, which allows users to specify queries dynamically. Path or structural pedigree queries are integrated within the advanced query interface, giving users the ability to specify queries on familial relationships dynamically and in an intuitive manner. Consider the query “find family members at high risk of colorectal cancer which include those with at least one second degree relative affected with colorectal cancer and with at least one first degree relative affected with colorectal cancer before 60 years of age” or the query “find family members with first degree relatives affected by adenomas before 35 years of age”. At present, there are no pedigree management systems that provide such powerful search functionalities and that allow medical professionals to specify structural queries, compute risk scores for each family member based on the pedigree information, or visualize results of a query on the pedigree graph. The Pedigree Explorer was inspired by the tree- structured interface of Pathway Explorer [56] and uses an encoding scheme, NodeCodes, for efficient evaluation of queries on pedigree graphs [31].

41

5.1 Pedigree Explorer Interface

Before describing the tool itself, it is important to understand how pedigree and clinical data is being used. Pedigree data is utilized in many fields such as clinical science, genetics, and bioinformatics. The requirements of these user groups differ with respect to the presentation of data, search facilities, and export options and cannot be satisfied using a single search interface. Therefore, different levels of searching and presentation should be provided. A hereditary colorectal cancer database may contain many attributes associated with each family and family member; therefore, clinicians and researchers with limited knowledge of formal query language may find it difficult to query this information. A tool tailored to the needs and skills of all potential users is necessary.

Design of a graphical query user interface presents challenges. First, due to the diversity of information that needs to be accessed, the interface must be general enough to cover heterogeneous data and it must be specific enough to enable users to find information they are looking for. Second, the interface should be acceptable to different groups of users who are diverse with respect to their knowledge, expertise, and research interests. The query interface should assist users who are not proficient in composing queries in a formal language that is based on first-order logic.

We distinguish three functional areas of a query tool: search, visualization, and export of query results. The following open-ended questions concerning these areas were posed to clinicians and researchers:

1. What kind of information will be requested by the user?

2. What are some examples of frequent queries?

42 3. What will be the format in which search results are displayed?

Researchers and clinicians put a special emphasis on the ability to: query the underlying database, browse data in both tabular and graphical presentation, export data for collaborative projects and statistical analysis, and use pedigree drawings as images for papers and presentations. Typical queries may focus on pedigree structure and family relationships or on associated clinical data. Sample queries in Table 2 include questions about clinical, genetic or pathologic characteristics of families, family members, or a disease process, or involve relationships among family members. Queries 1-3 are independent of pedigree relationships: Query 1 calls for an aggregate count of families with specific clinical characteristics, Query 2 requires graphical presentation, and Query

3 retrieves specific data points from multiple entities in the database. Query 4-6 are structural queries that involve pedigree relationships: Query 4 requires output in a tabular format, Query 5 requires calculations based on family relationships, and Query 6 requires visual presentation on a pedigree graph.

43

Query 1 Find families with familial polyposis syndrome where at least one family member was diagnosed with colon cancer. What is the total number of such families? 2 Within a given family, find family members who were diagnosed with a desmoid tumor before 20 years of age, have had gastric polyps, and had a positive genetic test. Highlight these individuals on the pedigree graph. 3 Retrieve pathologic details of polyps, genetic test results, and cancer sites of these individuals who underwent a major colonic surgery before 35 years of age. 4 Retrieve cancer sites and pathologic stage for family members whose first degree relatives (mother, father, siblings) have been diagnosed with osteoma. 5 Calculate risk of colorectal cancer based on the clinically defined scoring system that takes into account the number of first and second degree relatives diagnosed with colorectal cancer at young age (before 50 years of age) and the number of family members diagnosed with colorectal cancer on each site of the family. 6 Highlight all descendants of a given individual on the pedigree graph. Table 2 Sample Queries.

The Pedigree Explorer provides a user-friendly environment to explore data in the

Cologene database. This tool presents a query tree where the user can specify entities and attributes included in the output and add parameters to the query. A sample query is displayed in the left panel of the Pedigree Explorer in Figure 5.1.

44

Figure 5.1 Pedigree Explorer.

Notice that the query tool is a set of nodes organized in a tree, where each node, in itself,

represents a relation. In our example, the sample query tree consists of eight nodes

organized into three levels. The Family node is the root of the tree representing first level

of the tree, while level two represents the relation Family Members and level three

includes a set of nodes representing various concepts associated with a family member

such as Cancer, Polyps, First Degree Relatives, Second Degree Relatives, Ascendants,

and Descendants. This tree-like view of the database schema expresses existing semantic

relationships between nodes. In other words, a Family consists of multiple individuals

and each Family Member may have multiple Cancers or multiple First Degree Relatives.

The tree-structured query interface provides metadata information, that is, data about

data. This allows the user to compile queries in a more meaningful way, and makes the query output results easier to understand.

45 Initially, the user is presented with the top level of a query tree Figure 5.2.

Figure 5.2 Top query panel displaying the top level of a query tree.

The top level of the sample query tree includes the parent node, Family relation, and corresponding child attributes of the parent node. Check boxes next to the relation name,

Family, and attribute names, such as Family Syndrome or Mutation Type, are used to select attributes to be included in output query results. The checkbox for the parent node,

Family, allows one to select or unselect all child attributes. Semantically, dropdown and text boxes next to each attribute display a list of allowable constraints and predicates.

Notice that in the graphical query interface, the conditions provided for data selection depend on the data type and semantics of the data. For categorical or discrete data, the user can select a value for the dropdown list, while for non-categorical string values, the user can select from the list of predicates that provide string pattern matching ( = , like: prefix, contains, suffix). For numerical and date values, the user can select a comparison operator (<, <=, =, <>, >=, >) from the drop down list. The “or” buttons next to the attributes are used to specify additional constraints for that attribute where specified values are combined with the logical OR operator.

Example 1: Suppose that data relevant to a specific research project consists of a list of families with FAP syndrome with genetic mutation on the APC gene. A researcher would

46 like to know which such families were registered in the database between the years of

2001 and 2009.

The user identifies Family ID, Family Syndrome, Mutation Type, and Registered

Date as attributes included in the query output by selecting appropriate check boxes. He or she then specifies query parameters in dropdown boxes, as shown in Figure 5.3. Every

time the dropdown box is opened or a check box is clicked, a corresponding query is dynamically generated. Since the Family Syndrome attribute takes on categorical values, the user can select search parameters from the dropdown box. The Mutation Type

attribute is not well categorized and can take on any string value. Thus, the user can

specify the LIKE condition for this attribute.

Figure 5.3 Top query panel displaying Example 1 parameters.

The top panel of the interface displays parameters of the query to the user in a simplified

SQL format that only includes the WHERE clause of the dynamically generated SQL

statement in Figure 5.4.

47

Figure 5.4 SQL statement for Example 1.

Query results can be visualized in a tabulated form. In the left upper panel of Pedigree

Explorer, each row represents a relation included in the query. The first column displays the name of the relation, the second column provides a link which, when activated, displays the relation in a tabular format in the panel below, and the third column contains another link that allows data export in a simple text format (Figure 5.5). The last column in the table shows summary information, or more specifically, the number of records in the given relation that match the query criteria. If multiple relations are included in a query, the panel will include links and summary information for the corresponding relations. This representation is similar to spreadsheets.

48

Figure 5.5 Right upper panel displaying query results.

Query results can be presented graphically, using different colors to highlight family members that meet search criteria on a pedigree graph. Visualization with pedigree graphs is popular in genetic counseling as it allows for quick analysis of relations among affected family members as well as formation of interesting hypotheses. The lower panel in Figure 5.5 displays the list of families and/or family members that meet all constraints specified. A family can be selected from the list by mouse clicking. This action will activate the Pedigree Editor where the selected pedigree is displayed and family members that meet search criteria are highlighted in different colors as in Figure 5.6.

49

Figure 5.6 Pedigree Editor displaying the selected family; affected members are highlighted.

Example 2: A genetic counselor who is studying the association of a specific APC mutation site with phenotypic characteristics may be interested in all family members who had duodenal or rectal polyps and who have also been diagnosed with osteoma.

The appropriate query should contain Family Member, Polyps, and Extra-Colonic relations and should specify the criteria for related attributes. As previously mentioned, the checkboxes that determine attributes included in the output and dropdown boxes display predicate variables that are instantiated to the relevant attributes. The right upper panel in Figure 5.7 shows all relations included in the query and the lower panel shows a detailed view of the Polyps relation that is displayed by mouse clicking the Polyp link.

50

Figure 5.7 Pedigree Explorer displaying the query results.

5.2 Expressive Power of Pedigree Explorer

The Pedigree Explorer system is an advanced query interface, the most important component of which is a query tree that allows exploration of the schema of the underlying database and the formulation of queries. It is important to note that the goal of the Pedigree Explorer is the construction of a large number of possible queries by the user who does not have knowledge of the schema of the underlying database and of query language primitives. The query construction step is carried out using visual actions only so the user does not need to be aware of the query language; the system composes queries and outputs results of the queries based on end-user actions. Visual actions introduce variables and parameters of a query.

The query tree in Figure 5.8 will be the running example that illustrates the basic points of generating an SQL statement based on the user actions. The state of the query tree has been produced by the following user actions:

! selected the “Family Syndrome” node to be included in the query output;

51 ! placed the condition on “Family Syndrome” as being “FAP” or “AFAP”;

! placed the condition on “Age”, child attribute of the “Family Member” node, as

being more than 21;

! selected the “Cancer” node to include all attributes of the Cancer relation to be

presented in the query output;

The end-user intention is to find all individuals within families with FAP or AFAP syndromes who are older than 21 years of age and display family syndrome and cancer details for these family members. Visual actions introduce variables and parameters in the condition controls. The final step of the query formulation process passes the SQL query as input to the database management system.

Note that attribute nodes or leave nodes of a query tree, such as Family Mutation or

Age, can be in one of the following states:

(1) selected (projected) as a query output with no parameters specified ;

(2) parameters for an attribute has been instantiated but the attribute is not being

selected in a query output;

(3) selected as a query output and parameter for the attributes are being instantiated;

(4) attribute is not considered in a query;

The user can project a single attribute by mouse clicking a check box next to the attribute name or project all attributes in a particular relation by mouse clicking a check box next to the relation name. Only those attributes that are specified by the user will be projected in the query output. Instantiation of parameters for an attribute is considered selection of that attribute. The specified parameter can be a substring of any matching values of that attribute. Available operators for substring matching include prefix, suffix,

52 contain expressed as the LIKE operator in an SQL statement. Other supported conditions for numeric and date values are =, >, <, <=, >=. Specification of these conditions is accomplished by preceding each dropdown or text box with a dropdown box containing a list of allowable operators. The user can include more than one instance of an attribute by pressing the “or” button.

A relation node, such as Family or Family Members, can be in one of the following states:

! INSTANTIATED: if at least one child attribute node is in the state (2) or (3)

! PROJECTED: if at least one child attribute is in the state (1)

! IGNORED: if all child attributes are in the state (4)

Figure 5.8 Query Tree.

The query tree in Figure 5.8 reflects an explicit child-parent relationship between nodes which represent relations, such as Family, Family Member, Cancer, and nodes

53 which represent attributes, such as Family Mutation, Date of Birth, or Primary Stage. The

Cancer node is the child of the Family Member node, and the Family Member node is the

child of the Family node. The Tumor Stage attribute is a child of both the Family Member

and Family nodes. Equijoins are implicitly performed by the Pedigree Explorer whenever

a node is instantiated as a child of another object. The equijoins occur between primary

and foreign keys that are specified in the XML query set document. Parent and child

nodes representing relations are joined using an inner join if a child node is in the state

INSTANTIATED; they are joined using a left join if a child node is in the state

PROJECTED. Otherwise, a child relation node is not included in the query. In our

running example, three relations are joined: Cancer, Family Member, and Family.

Equijoin is implicitly performed between the primary key family_id of the Family

relation and the foreign key family_id of the Family Member relation. Note that the user

is not aware of join parameters and never explicitly specifies relationships in the

underlying database. Constraints on attributes or child nodes are combined with

conjunction (AND operator). Parameters for a single attribute are combined with an

operator defined in the query tree graphical interface. For instance, values of a single

categorical attribute are combined with disjunction (OR operator), numeric and string variable take on a set of meaningful predicates. The following SQL statement is produced

by the query interface and submitted to the database management system:

SELECT DISTINCT tblSTSyndromes.fldDescription AS fldFamilySyndrome,fldAge = CASE tblFamilyMembers.fldPersonVitalStatus WHEN 0 THEN DATEDIFF(year, tblFamilyMembers.fldPersonDOB, getdate()) WHEN 1 THEN DATEDIFF(year, tblFamilyMembers.fldPersonDOB, tblFamilyMembers.fldPersonDOD) END,tblCancer.fldDateDiagnosed,DATEDIFF(year, tblFamilyMembers.fldPersonDOB, tblCancer.fldDateDiagnosed) AS fldAgeDiagnosed,tblSTCancerSite.fldDescription AS fldCancerPrimarySite,tblCancer.fldMedicalTreatment,tblCancer.fldSurgicalTreatment,tblC ancer.fldTumorSize,tblCancer.fldStageT,tblCancer.fldStageN,tblCancer.fldStageM,tblCan cer.fldStageTNM,tblSTCancerDifferentiation.fldDifferentiationDescription AS

54 fldDifferentiation,tblSTCancerCellTypes.fldCellTypeDescription AS fldHistopathologicType

FROM tblFamily LEFT OUTER JOIN tblSTSyndromes ON tblFamily.fldFamilySyndrome = tblSTSyndromes.fldID LEFT OUTER JOIN tblFamilyMembers ON tblFamily.fldFamilyID = tblFamilyMembers.fldFamilyID LEFT OUTER JOIN tblCancer ON tblFamilyMembers.fldPersonID = tblCancer.fldPersonID AND tblFamilyMembers.fldFamilyID = tblCancer.fldFamilyID LEFT OUTER JOIN tblSTCancerSite ON tblCancer.fldCancerPrimarySite = tblSTCancerSite.fldID LEFT OUTER JOIN tblSTCancerDifferentiation ON tblCancer.fldDifferentiation = tblSTCancerDifferentiation.fldDifferentiationID LEFT OUTER JOIN tblSTCancerCellTypes ON tblCancer.fldHistopathologicType = tblSTCancerCellTypes.fldCellTypeID

WHERE (tblFamily.fldFamilySyndrome = 9 OR tblFamily.fldFamilySyndrome = 1) AND (CASE tblFamilyMembers.fldPersonVitalStatus WHEN 0 THEN DATEDIFF(year, tblFamilyMembers.fldPersonDOB, getdate()) WHEN 1 THEN DATEDIFF(year, tblFamilyMembers.fldPersonDOB, tblFamilyMembers.fldPersonDOD) END < 21)

The following user friendly presentation of the WHERE statement is displayed on the top panel of the Pedigree Explorer interface to the user.

Expressive power of the Pedigree Explorer is limited to selections, projections, and equijoins. We realize the limitations; however, a key benefit of the Pedigree Explore is that it enables intuitive representation of the underlying data model and easy generation of complex SQL statements. A query tree interface encodes a large number of possible queries that can be compiled with little difficulty and minimal training by users with no prior knowledge of the system.

55

5.3 Query Sets

The Pedigree Explorer interface provides a user-friendly environment for interactive data analysis. The Pedigree Explorer allows the user to interactively communicate with the database and direct exploration process. It is never realistic to expect that such query interface can answer all possible questions for a large database.

Although it may sound appealing to have a comprehensive query system, in practice, such systems would be overwhelmingly complex. The intricacy of such an interface may easily surpass that of the syntax of a formal query language. Furthermore, many types of queries would be irrelevant to the data exploration task of the user. Thus, it is never realistic to translate a full-power formal query language to a user-friendly query interface.

A more realistic scenario is to expect that users can communicate with the database using query sets designed to facilitate efficient data exploration [59]. Such query sets include specifications of the portion of the database or the set of data in which the user is interested and how query results should be presented and visualized. Query sets provide a foundation on which the Pedigree Explorer graphical interface is built.

There may be more than one query set based on different users’ viewpoints and security constraints. For example, a clinical researcher who is interested in the pathologic characteristics of polyps at different colon sites may prefer a query set that includes detailed specifications of polyps. A clinical researcher, however, may prefer to have easy access to follow-up details organized with respect to different types of tests performed for each family member. Specification of relevant attributes or relations is targeted toward a

56 user group. A user would have to determine what interesting attributes for exploration will appear in the Pedigree Explorer. The Pedigree Explorer interface is able to provide different hierarchical views with different types of predicates. Figure 5.9 illustrates one of the possible hierarchical views.

Figure 5.9 Example hierarchy of Pedigree Explorer Data.

For different user groups, the Pedigree Explorer dynamically constructs a tree- structured query interface based on a query set specification document. The query set specification document, an XML document, contains a set of allowable constraints and output parameters. It concisely describes a tree-structured query interface presented to the user. An example XML query set file is shown in Figure 5.10. The XML query set provides the description of building blocks for the tree-structured query interface; it also specifies mapping to database tables, attributes, and the relational links between the tables. The query set provides control on what queries can be submitted and which attributes can appear on the query results.

57

Figure 5.10 XML schema.

XML query set file in Figure 5.10 contains the name of the query set as the root of

the document and one or more query tags representing specifications. The set of tags

includes the following: Table, ColumnName, and TableLinks. A TableLink tag specifies

relationships among the tables included in the query set. A TableLink tag specifies

relational links between the two tables; it contains the names of the tables, primary and

foreign keys. These specifications are used to perform equijoin.

A Table tag corresponds to a node in a query tree, such as Family, Family Member,

and Cancer, that also represents a table, or rather a view of the table in the database.

Hidden property of the top-level node, Family, is set up to false so the initial state of the node is expanded, all child attributes of the node can be viewed by the user. For compact representation of the query tree on a query interface, the “Hidden” property of all other

nodes is set up to true; this table will be initially presented to the user in a collapsed

mode. These hidden nodes are subtrees in a query tree that can be viewed by mouse

clicking a “+” sign next to the name of a node. Each Table tag contains an associated

58 Name, the actual name of the table in the database, and an optional DisplayName, the

name of the table displayed in the user interface.

Table tags contain ColumnName tags that may correspond to the actual column in

the table, such as Family Syndrome, or a calculated value, such as Age at Diagnosis.

ColumnType within a ColumnName tag specifies one of the following properties of the attribute: it can be of type “criteria’ or ‘calculated’, an attribute displayed in a query tree

and used for projection or selection in an SQL statement, or it can be of type ‘select’ that informs the Pedigree Explorer that this attribute is required to link the Table to its parent.

“Calculated” attribute is an expressions that does not correspond to the actual column in

the table. ColumnName tag also contains an optional DataLength value that determines

the length of a text box where a substring of any matching values of that attribute can be

entered. The optional PrimaryKey value in the ColumnName tag specifies if an attribute

is a primary key of the relation. DisplayName value is the name displayed on the user

interface. DisplayType determines the format in which user will enter parameters of the

query: it can be a dropdown box with possible values for categorical attributes that will

be retrieved form a look up table, a dropdown box .preceeded by a list of possible

predicates for numeric and date attributes, or a simple text box. LookupName specifies

the name of a table in the database where all possible values of a categorical attribute are

stored. These values will be displayed in a dropdown box next the attribute name.

5.4 Pedigree Explorer Architecture

Figure 5.11 models the architecture of the Pedigree Explorer. An XML schema specifies

a query set, a set of allowable attributes and constraints. It concisely describes the tree-

59 structured query interface presented to the user. Queries are designed by choosing constraints and selecting which relations and attributes to display. Since a traditional relational database is used in the background, queries specified by the user are translated to SQL queries. Output is then provided in both tabular and graphical forms.

Figure 5.11 Architecture of a tree-structure query interface for pedigree data.

Figure 5.12 depicts the Pedigree Explorer as a query interface for querying tree- structured views of a pedigree relational database. The query interface follow the schema of an XML document that contains specific nodes along with attributes and lists of possible values as children. The structure of the interface shown in Figure 5.12 contains a family node with an associated family member node. Children of a family member node include a cancer node, a polyps node along with a first degree relatives node, a second degree relatives node, descendants and ascendants nodes. The tree-structured interface of Pedigree Explorer in this example provides a basic search by family mutation, syndrome, primary cancer site, etc. It also allows users to specify structural pedigree queries that involve relationships between family members, e.g., “Find family members whose first degree relatives have been diagnosed with colon cancer”, etc.

60 Because these queries require graph traversal and are recursive by nature, we use the

NodeCodes technique for scalability.

Figure 5.12 Pedigree Explorer.

5.5 Pedigree Query Interface and NodeCodes

There is considerable interest in quantifying familial colorectal cancer risks and obtaining

precise estimates of familial risks according to the nature of family history [17,45]. This

may require calculation of the average age at diagnosis among first or second degree

relatives or the number of relatives on one site of the family diagnosed with a colonic

adenoma before certain age. It is important to trace a family history by generations and

determine if there are consecutive generations in the family affected by cancer. Because

these types of queries require graph traversal and are recursive by nature, they are very

costly. To reduce computational cost of graph queries, we use the NodeCodes technique.

61 NodeCodes is an encoding system that was originally proposed for encoding a

directed acyclic graph with a single source node [62] and later adapted to encode

pedigree graphs [31,32]. The NodeCode system in [62], based on Huffman Codes [40],

is used to represent the path information for a multimedia presentation graph in

compressed form. In a directed acyclic presentation graph, the source node has the

NodeCode 1 and all other nodes are descendants of the source node in the form of 1(01*)

+. The algorithm that assigns nodecodes to a presentation graph implements a depth-first-

search of the graph starting from the source node. The NodeCode schema not only

uniquely identifies a node in a presentation graph but also specifies a distinct path

between two presentation nodes. Given NodeCodes of the two nodes, all the existing

connecting paths between the two nodes can be derived without graph traversal.

The NodeCodes labeling method, an annotation for each node, was adapted to encode the hierarchical structure of a pedigree graph and to evaluate structural pedigree

queries efficiently. Since a pedigree graph may have several progenitors, nodes with in-

degree 0, a virtual source node is defined and all the progenitors in a pedigree become

children of a virtual node. Progenitors, or nodes with an in-degree of 0, can be considered

children of a virtual node and are labeled first. Each node in the pedigree graph is

assigned a set of labels that result from a depth-first-search traversal of the graph. The

NodeCodes, or a set of labels, are permanently assigned to each individual based on their position in the pedigree.

NodeCodes of a node is a set of strings that contain sequences of integers and delimiters representing individuals and generations respectively. The integers denote the sibling order and the delimiter denote the generations as well as indicating the gender of

62 the node. The delimiter “ * ” could be “.” , “ , “ or “ ; ” denoting female, male or unknown gender respectively. The virtual source node is not actually used and is not being assigned a nodecode, the progenitors are assigned nodecodes i* where i = 0,1, ,,, and * indicates gender. Then for each node n in the pedigree graph nodecodes(u) is assigned using a depth first search traversal starting from the source node as follows:

• if u is the source node, the nodecode(u) = {the empty string}

• let u be a node with a nodecode x and v0, , v1, …,vk be children of u in sibling order, then nodecodes (vi) = xi* where 0 ! i ! k.

Figure 5.13 shows a pedigree graph labeled with NodeCodes. Using NodeCodes method, unique codes are assigned to each individual in a pedigree. Given NodeCodes of two nodes, one can determine degree of relatedness between the two individuals by string comparison. Ancestors, descendants, parents or children of an individual can be identified by string manipulation using NodeCodes and can be translated directly into a simple SQL query.

63

Figure 5.13 Nodecodes labeling on a sample pedigree graph

Experimental results demonstrate that NodeCodes provide a good alternative for query evaluation on pedigree data. The hierarchical structure of pedigree data is encoded using a labeling method for efficient evaluation of queries that require structural pattern.

Otherwise, evaluation of structural queries involves traversal of a pedigree graph which may be costly.

For example, a query involving all male ancestors of a given individual may require several joins to evaluate the parent information stored with each individual. The labeling method allows for evaluation of basic relationships on pedigree graphs by a single label comparison step. Given nodecodes of the two individuals, relatedness of the two individuals can be determined.

64 Queries on pedigree graphs are expressed into SQL statements and evaluated over

the relational table representation of NodeCodes assigned to each individual, as shown in

Table 3. Finding children, parents, ancestors, or descendants of an individual requires

simple string comparison operations on the nodecode structure.

Table 3 Sample nodecodes relational table.

Query 1: Descendants of an Individual

Node d is a descendant of node n if and only if nodecodes(d) includes nc + s where s is a

nonempty string and nc is a nodecode in nodecodes(n). The following is an SQL

representation of the query:

(1) SELECT d.IndividualID (2) FROM nodecodes n, nodecodes d (3) WHERE n.IndividualID = k (4) AND n.PedigreeID = d.PedigreeID (5) AND n.nodecode + ‘ _%’ LIKE d.nodecode

Query 2: Parents of an Individual

Similarly, Node d is a parent of node n if and only if nodecodes(d) includes nc + s where

s is a string that consists of two characters and nc is a nodecode in nodecodes(n). Line 5

of the SQL statement of the Query 1 becomes:

AND n.nodecode + ‘ _ _’ LIKE d.nodecode

Query 3: Ascendants of an Individual

65 Node a is an ascendant of node n if and only if nodecodes(n) includes dc + s where s is a nonempty string and dc is a nodecode in nodecodes(d). The following is an SQL representation of the query: The corresponding SQL statement is:

(1) SELECT a.IndividualID (2) FROM nodecodes n, nodecodes a (3) WHERE n.IndividualID = k (4) AND n.PedigreeID = a.PedigreeID (5) AND a.nodecode + ‘ _%’ LIKE n.nodecode

Query 4: Children of an Individual

Node a is an child of node n if and only if nodecodes(n) includes dc + s where s is a string that contains two characters and dc is a nodecode in nodecodes(d). Line 5 of the

SQL statement of the Query 3 becomes:

AND a.nodecode + ‘ _ _’ LIKE n.nodecode

Query 5: Siblings of an Individual

Node b is a sibling of a node n if and only if b has at least one nodecode that differs from a nodecode of n by the last two characters. The corresponding SQL statement should exclude node n from the result:

(1) SELECT b.IndividualID (2) FROM nodecodes n, nodecodes b (3) WHERE n.IndividualID = k (4) AND b.nodecode LIKE SUBSTRING(n.nodecode, 0, LEN(n.nodecode)-2)+’_ _’ (5) AND b.nodecode <> n.nodecode

We showed effectiveness of the nodecode techniques for structural query evaluation by comparing the common use of SQL in existing pedigree management system [1,68,71]. Experimental results using data stored in the Cologene database show

66 significant (>850%) performance improvement for complex queries over naïve

evaluation [32]. The experiment was performed with 655 pedigrees containing 8,381

individuals. The largest pedigree contained 118 individuals with an average size of 12

individuals. The maximum number of generations in a single pedigree was 8, the

maximum number of progenitors was 67, and the maximum number of children for an

individual was 15. We implemented the nodecodes labeling algorithm and used strings to

store nodecodes with sibling numbers encoded in a base-64 representation. Majority of sequences of integers and delimiters could be represented by two characters (one for encoding sibling number, one for delimiter) with the exception for pedigrees with large

number of progenitors. Experiment was performed for four queries that involve structural relationships. Sample pedigree queries used to evaluate performance of the nodecode

technique and performance of iterative query implementation vs. string comparison using

the NodeCodes technique are shown in Table 4.

Iterative Query NodeCodes Query Implementation (ms) Implementation (ms) Query Find all ancestors of an individual. 34 34 1 Query Find all first and second degree 119 84 2 relatives of a proband who were diagnosed with cancer before 50 years of age. (first degree relatives: mother, father, children, siblings; second degree relatives: uncles, aunts, grandparents, grandchildren)

Query Find all individuals with three or more 20656 11473 3 ancestors on the paternal site of the family but not on the maternal site who were diagnosed with cancer.

Query Find all probands with at leas two 14002 1663 4 descendants diagnosed with at least three polyps before 35 years of age.

Table 4 Sample pedigree queries used to evaluate performance of the nodecode technique.

67 Experimental results show that the use of NodeCodes provides a good alternative for queries involving structural relationships having significant improvements over the iterative evaluation methods. A new encoding scheme, Family NodeCodes, proposed in

[33] is further optimized for pedigree graphs. Family NodeCodes improves execution of structural queries to 77 times faster while using 91% less space than regular NodeCodes.

Family NodeCodes decrease the size and the storage overhead of using NodeCodes by introducing a family-level pedigree graph, where the nodes represent families instead of individuals. A family is defined as a unique set of two parents, a mother and a father, and a set of children. The directed edges between nodes represent relationships between family if there is a shared individual between the two families such as the same individual represents a child in one node and a parent in the other node. Experimental results in [33] show that Family NodeCodes scale significantly better than the original

NodeCodes by having more than 74% reduction in the number of NodeCodes. Thus, the space requirements for family-level graph scale well for very large pedigrees. The use of

Family NodeCodes, a new encoding system, was experimentally implemented and evaluated for calculating the inbreeding coefficient of a given individual. The future work may include investigating the use of Family NodeCodes for other types of complex pedigree queries and visual construction of such queries.

68 6 Knowledge Discovery Techniques in Hereditary Cancer Data

Computerized data acquisition, management and analysis provide physicians with a more comprehensive view of the patient data and tools to understand this data. Using a very large hereditary cancer database, we demonstrated how pedigree management software, described in the previous sections of our work, allows large scale analysis of data for families with inherited colorectal cancer syndromes. This project applied concepts and techniques of data mining, also popularly referred to as knowledge discovery process in databases, to automatically extract patterns representing knowledge from a large clinical database [43]. This chapter is concerned with the intelligent interpretation of patient data, the discovery of medical knowledge from data, and the presentation of such knowledge in a symbolic form. Computational algorithms were applied to a large database of FAP patients and family members evaluating possible, multiple relationships between desmoids, pedigree structure and clinical presentation of the disease. The goal is to use computing power to transform information into a meaningful base of knowledge, a set of predictive models; associate interesting and, sometimes, unexpected patterns, rules and clusters with the risk of developing desmoids.

6.1 Motivational Example: Can Desmoids Be Predicted?

Desmoid tumors are part of the spectrum of extracolonic manifestations that represent an important cause of morbidity and mortality for patients with familial adenomatous polyposis (FAP), an inherited disease due to mutations in the APC gene [22]. Desmoid tumors occur in 12 to 15 percent of FAP patients and are the second most common cause of death after colorectal cancer. Although the etiology of desmoid disease is not clearly

69 identified, some authors have found a relationship between specific APC mutations and an increased risk for desmoids [12,69]. Other risk factors for desmoids in FAP include a positive family history of the disease, female gender, and presence of extracolonic manifestations that characterize Gardner’s syndrome (osteoma and epidermoid cysts)

[4,9,16,22,23].

Due to lack of effective treatment for desmoid tumors, morbidity remains high

[23]. Therefore, estimating the risk of developing desmoid disease is an important factor in determining the timing of abdominal surgery in patients with FAP; surgery in desmoid-prone patients should be delayed. [66]. For instance, pregnancy improves the course of desmoid disease in FAP [21], and surgery can ideally be delayed in desmoid- prone women until after at least one pregnancy. Furthermore, if desmoid-prone patients can be identified, an opportunity for desmoid chemoprevention may exist if patients are treated before the disease is clinically evident [67].

Toward this end, we have created a study that would generate a set of models for the prediction of desmoids in FAP patients. Here, analyzing the relation between desmoids, family history, and different clinical features was our primary concern. We identified distinct clinical patterns suggestive of desmoid disease, a life threatening complication of FAP syndrome. On the basis of family history and clinical criteria, how likely is a family member to develop desmoids? Using knowledge discovery approach, we constructed clinically driven projecting models that aid in the prediction of desmoid disease, contribute to understanding of the disease and assist in developing new surgical strategies. We identified distinct clusters of desmoid prone patients and built a rule-based

70 classifier that searched for patterns in the database and represents these in a form of association rules.

6.2 FAP Family Data

We performed this study to see if knowledge discovery could estimate the risk of developing desmoids. The process required data integration, data merging from multiple sources into a coherent data store, and transformation of the data into a form appropriate for mining. The long history of data collection and development of information technology at the hereditary colorectal registry at the Cleveland Clinic resulted in the accumulation of heterogeneous sources of information containing data on patients with

FAP. Among these sources were flat text files, relational databases, and pedigree graphs.

All historical data for FAP families collected since 1979 was integrated into single pedigree management software, Cologene [53].

The Cologene database comprises nineteen relational tables describing various aspects of the clinical presentation of FAP such as desmoids, extra-colonic manifestations, test results, polyps, surgeries, genetic testing and pedigree relationships.

Knowledge discovery techniques were applied to an inherited colorectal cancer database that included 557 FAP families with 8,889 family members. There were 196 family members diagnosed with desmoid tumors: 111 probands and 85 other family members.

Analysis was performed at two levels of granularity: for all FAP family members and for family probands only (family members through which the pedigree is ascertained). The reasoning is that detailed and reliable clinical findings were recorded for probands but potentially valuable clinical information may not be available for all family members.

71 We analyzed pedigree relationships, data related to clinical presentation of FAP, test results, and follow-up. The complete data set was represented in a format that can be used by the knowledge discovery process: 143 clinical and pedigree variables for each family member were investigated in relation to the risk of desmoid tumors.

A partial view of the pedigree for a family with an FAP syndrome, a subset of clinical findings and pedigree relationships used in the data mining analysis, is shown in

Figure 6.1. It depicts a partial and simplified view of an FAP family. Displayed in the figure is a small subset of clinical and pedigree variables for a proband that may be suggestive of desmoids. For example, the proband in this family, the first affected family member who seeks medical attention for a genetic disorder, was diagnosed with a desmoid tumor at the age of 44 years, had gastric polyps, osteoma, epidermoid cysts and a prophylactic colectomy, but no colon cancer. Pedigree data indicates that the proband had at least one family member with desmoids and osteoma, first and second degree relatives with colorectal cancer, and a parent with duodenal polyps. Considering an exponential number of combinations and permutation of all clinical characteristics, the goal of the discovery process is to find a particular pattern that will help identify family members at high risk.

Figure 6.1 Partial view of the pedigree for a family with an FAP syndrome and a subset of clinical findings and pedigree relations used in the data mining analysis.

72

6.3 Clustering Desmoid Patients

Clustering algorithms can be used to search patient records and identify patient clusters, to categorize patients with similar outcomes, and to gain insights into a particular disease.

By clustering, we can discover an interesting correlation among data attributes, that is, group data into classes so that patient records within the same cluster have high similarity in comparison to one another but dissimilar to records in other clusters. Analysis of this sort can be used to observe distribution of data within one cluster and to focus on a specific set of variables for further analysis. We applied K-means algorithm, one of the most popular clustering methods, to the family data [42]. This clustering analysis of clinical and pedigree variables identified patterns in the data set and partitioned family records into classes such that family members assigned to one cluster are as similar as possible. The algorithm identified clusters of family members with a high probability of desmoids and highlighted the distinct features of these clusters.

In the first phase of the analysis, the clustering algorithm examined 557 proband records and identified distinct groups with common characteristics. The first group included 98 female patients with a family history of desmoids, osteoma, gastric polyps, and epidermoid cysts and was considered a high risk cluster. Probands with other characteristics, such as those with no family history of desmoids or no osteoma, fell into a low probability cluster. Of the 98 probands included in the first cluster, 42 (43%) have desmoids, whereas desmoids occurred in only 69 (15%) of 459 probands in the second cluster (Table 5).

73

Cluster 1: 43% desmoids (N= 42/98) Gender Female Gastric polyps Yes Osteoma Yes Epidermoid Yes

Cluster 2: 15% desmoids (N = 69/459)

Table 5 Desmoid clusters: dissimilarity measures for probands.

In the next stage of the analysis, we applied the same clustering algorithm to

8,889 family members searching for combinations of factors and grouping patient records according to the factors their families have in common. The occurrence of specific extra- colonic manifestations and desmoids in the family was related to the risk of desmoids; 79

(4.5%) patients diagnosed with desmoid tumors belonged to families with a history of desmoids, osteoma, congenital hypertrophy of the retinal pigmentation (CHRPE) and epidermoid cysts. Family members with the above criteria formed one cluster; all other family members formed “no desmoid” cluster where the incidence of desmoid tumors was 1.6% (Table 6).

Cluster 1: 4.5% desmoids (N= 79/1,749) Family history of desmoids Yes Family history of osteoma Yes Family history of CHRPE Yes Family history of epidermoid cysts Yes

Cluster 2: 1.6% desmoids (N = 110/7,088)

Table 6 Desmoid clusters: dissimlarity measures for all family members

6.4 Classification Based on Association

Two important data mining techniques, classification rule mining and association rule mining, were used to discover interesting or useful patterns with respect to desmoid

74 disease in the FAP database. Association rule mining finds all the rules existing in the database, while classification rule mining reduces a large number of generated rules to a small set with special properties. An example rule can be expressed in the following if- then form:

IF Number of Adenomas > 100 THEN FAP

For instance, this rule assigns FAP diagnosis to a patient with more than 100 synchronous adenomas. Interestingness measures are used to evaluate the discovered patterns, to separate uninteresting patterns from knowledge. Using data mining language, a discovered rule may be considered interesting if it has a high level of confidence, such as percentage of patients diagnosed with desmoid tumors, or if it contributes new information to the existing medical knowledge or is different from what is expected.

Starting from a complete set of discovered rules, the association rule mining finds those that satisfy confidence constraints.

Classification Based on Association (CBA) is a data mining technique, an integration of classification and association rule mining [47]. Classification rule mining discovers a set of rules in the database to form an accurate classifier for a predetermined target while association rule mining finds all interesting rules in the database. The integration of the two mining techniques produces a classifier, more accurate than the state-of-the-art classification system C4.5, which is comparable to a high performance

Support Vector Machine (SVM) classifier. CBA achieves high classification performance through the utilization of association rule mining. CBA produces a set of significant rules using the a priori algorithm where the right hand side contains a target attribute, in this case, an attribute indicating whether or not a patient has a desmoid tumor. The set of

75 selected rules is sorted based on significance which is defined by comparing the confidence and support of each rule. The best classifier is built by adding rules from the sorted list to the classifier until the accuracy of the classifier is no longer improved.

6.5 Classification Models for Desmoids

In this project, the target of the discovery process is the desmoid disease and rules that relate to desmoids. A number of rules, with a target value of a desmoid, discovered by the rule-based classifier refer to extra-colonic manifestations or a surgery for one or more extra-colonic manifestations. The first association rule with a high level of confidence indicates that 100% of patients with three or more extra-colonic manifestations, a surgery for one of these, and a prophylactic colectomy will develop desmoids.

if 3 or more extra-colonic manifestations and surgery for extra-colonic manifestations and prophylactic colectomy

desmoids: 100%

Rule-Based Classifier: Association Rule 1

Figure 6.2 Rule-Based Classifier: Association Rule 1.

The next rule implies that patients with at least three extra-colonic manifestations of which dental abnormality is one and a surgery for extra-colonic manifestation are also at high risk of developing desmoids (Figure 6.3).

76

if 3 > extra-colonic manifestations and surgery for extra-colonic manifestations and dental abnormalities

desmoids: 100%

Figure 6.3 Rule-based Classifier: Association Rule 2

Another rule with a high level of confidence indicates that 100% of patients with three or more extra-colonic manifestations, a surgery for one of these, and fundic gland polyps will have desmoids (Figure 6.4).

if 3 > extra-colonic manifestations and surgery for extra-colonic manifestations and fundic glands polyps

desmoids: 100%

Figure 6.4 Rule-based Classifier: Association Rule 3

Figure 6.5 shows that there is a 90% of certainty (or confidence) that female FAP patients who have at least three extra-colonic manifestations or a surgery for an extra- colonic manifestation, and duodenal adenomas will be diagnosed with desmoids.

77

if gender: female and 3 > ECM or surgery for extra-colonic manifestations and duodenal adenoma

desmoids: 90%

Figure 6.5 Rule-based Classifier: Association Rule 4

Most patterns link to extra-colonic manifestations, such as the presence of three or

more manifestations, surgery, family history of extra-colonic manifestations, and family

history of desmoids.

6.6 Medical Implication of Knowledge Discovery

Using a very large clinical database, we demonstrated how computing allows us to

analyze data for families with inherited colorectal cancer syndromes and identify distinct

clinical patterns suggestive of desmoid disease. On the basis of family history and clinical

criteria, the likelihood of a family member to develop desmoids was estimated.

Knowledge discovery approach allowed us to construct clinically driven predictive

models that aid in the prediction of desmoid disease, contribute to understanding of the

disease and the development of new surgical strategies.

The approach used to estimate risk for desmoid disease is different from that used in

other recent studies. Most authors look for differences in phenotype and genotype

between patients who develop desmoids and those who do not use uni- and multivariable

analysis to look for significant determinants of risk. Speake et al. found that genotype

was a strong predictor of desmoid risk, with 16 of 22 patients with APC mutations 3’ of

codon 1399 developing postoperative desmoids [66]. Durno et al. also found desmoid

78 disease to be more common in patients with an APC mutation 3’ of codon 1399 [47].

They noted that women were more at risk, especially if they had colectomy at an early age. Family history was not assessable due to lack of data. Sturt et al. found that genotype, family history and female gender were all associated with risk for desmoid disease, [67] as did Bertario et al. [9]. Bertario also found osteomas and epidermoid cysts to be associated with an increased risk of desmoids. On the other hand, Nieuwenhuis et al. did not find genotype to be of help in predicting desmoid risk, nor were female gender or pregnancy. They did find that family history of desmoids was associated with an over fourfold increase in risk however [29]. We have used a different approach, involving data mining techniques to look for informative patterns of phenotype. We have not used mutational data in this study because it was lacking in many families in our database. The cluster analysis gives a broad overview of desmoid risk, supporting a role for female gender and a Gardner’s syndrome phenotype (and therefore genotype) that is consistent with the finding of other studies. The rule-based classifiers provide more detailed snapshots of patients at risk for desmoid disease, with rules one and two pointing particularly at Gardner’s syndrome. Rules three and four introduce gastric and duodenal polyps as part of a desmoid-predicting phenotype. This is surprising as duodenal adenomas and fundic gland polyps can be found in the majority of patients with FAP

[57]. The next step is to interface genomic and genetic data with various types of clinical data as the largest source of phenotypes, and to apply data learning techniques to genetic information and clinical data to study desmoid disease. We will look for a predictive model that can be used in an individual patient contemplating a new diagnosis of FAP and the probability of prophylactic colectomy.

79 7 Mining Microarray Data - Random Forest

The most important diagnostic method for a hereditary cancer syndrome is the compilation of a family history of cancer and phenotypic features that may be related to cancer. Previous chapters demonstrated new approaches to management of clinical and pedigree data for colon cancer patients and the use of novel data mining technologies to analyze these data. In addition to these clinical diagnostic methods, another diagnostic tool exists - microarray technology. It is one of the most promising tools available to researchers. This technology generates an immense quantity of biological data. Typically, the final result of a microarray experiment is a set of numbers associated with the expression levels of various genes. In order to understand biological phenomena, expression levels of hundreds or thousands of genes need to be compared among individuals with different phenotypic characteristics or at different time points. The keys to understanding colorectal cancer may be found in these numbers. Clearly, powerful analysis techniques and algorithms are essential tools in the mining, interpretation, and integration of these biological data with clinical outcomes. To conduct genotype- phenotype association studies based on large microarray expression data, we propose a scalable implementation of random forest, a classification algorithm that is well-suited for microarray data. Comparable to other classification methods, random forest shows high predictive accuracy for highly dimensional entire genome data [27,63]. The goal is to develop an accurate genetic predictor for disease recurrence for early stage colon cancers using the entire genome expression profile. In the next sections, we explain diagnostic challenges of early stage colon cancer and importance of a high performance classification model that can identify patients with poor outcome, demonstrate the

80 excellent performance of random forest in classification tasks, describe the random forest

technique, and propose a fast scalable implementation of the algorithm.

7.1 Molecular Signature for Colon Cancer Patients

The clinical outcome for patients with colorectal cancer depends on numerous clinical

and genetic factors. Despite broad groupings of patients based on cancer stage to

determine prognosis, individual patients may respond to their disease in different ways.

For example, approximately 10% to 20% of patients with early stage colorectal cancer

will develop recurrent cancer, despite best available treatment appropriate for this stage.

Currently, no accurate means exist to predict this subgroup of patients that would present

an opportunity for intervention. However, recent development of large, high-throughput

assay technology has allowed the development of predictive models based on gene

expression profiling for various cancers.

More specifically, the 5-year survival rate for patients with Stage II colon cancer

is approximately 75%. There is no clinical test available to identify the 25% of patients at

high risk of recurrence. There is clearly a pressing need to identify new prognostic factors

to determine Stage II colon cancer patients who are likely to relapse, to help guide their

treatments. This information would allow better planning of treatments by identifying

patients who will possibly benefit from adjuvant therapy. We hypothesize that colon cancer outcomes may be defined by a yet undetermined set of genes that drives tumor biology.

Using microarray expression analysis, our goal is to find genes that exhibit a behavior consistently different between the two conditions, good and poor outcome. The

81 study is about identifying a molecular signature that is highly informative in identifying patients with recurrence and predicts individuals’ risk for recurrence. If the prognostic signature could be constructed into a clinically feasible test, an accurate predictive model could then determine which patients with early stage colorectal cancer would benefit from radiation or chemo therapy.

7.2 Random Forest Prediction Results

The aim of this study is to reliably identify relevant predictors from a large set of candidate gene. To explore whether the tissue microarray data can be used to predict recurrence of early stage colorectal cancer, random forest classifications were carried out.

Frozen tumor specimens from 96 Stage II colon cancer patients were obtained from the Cleveland Clinic Foundation according to the Institutional Review Board approved protocol. These samples were taken between1980 and 2001 from patients who had a median follow-up time of 123 months (range 35 to 270 months). Patients’ eligibility criteria included colon primary Stage II adenocarcinoma, primary treatment of surgery only without adjuvant or neoadjuvant therapy , and at least three years of follow-up except for patients who relapsed before that time. All frozen tumor tissues were processed using published methods (Affymetrix, Santa Clara, CA) and hybridized to

Affymetrix U133a GeneChip. The Affymetrix U133a GeneChip contains 22,000 transcripts. RNA samples from 96 patients with stage II colon cancer were analyzed.

Seventeen patients relapsed, whereas 79 remained disease-free for more than 5 years after surgery. The goal of the study was to identify a small subset of genes that could serve as a molecular signature that exhibits a behavior consistently different between the two conditions.

82 The random forest classification algorithm was used to identify gene markers that

best discriminate between patients who recurred and patients who remained disease-free.

Gene expression profiling identified a small group of genes among top-ranked genes

selected by random forest that indicate a significant difference in survival. Expression

levels of genes selected for the random forest classifier were strongly correlated to

disease-free time. For example, Figure 7.1 illustrates that higher expression of Hp

(haptoglobin), one of the “important” genes selected by a random forest classifier,

adversely affects outcome. Kaplan-Meier plots compares disease-free survival of patients with high and low levels of Hp expression. The patients were first placed into one of the

two subgroups based on the expression of haptoglobin (below and above the average

expression level). Log-rank test was used to compare survival in the two groups [3].

Figure 7.1 Kaplan-Meier survival plots for patients with differentially expressed genes selected by random forest classification algorithm.

The ultimate goal is to develop the prognostic signature into a clinically feasible test.

Microarrays are excellent tools for studying the effects of many genes but not necessarily

83 able to substitute others tools available to biologists. Conclusions obtained with

microarrays should be validated using different techniques. For instance, genes that found

differentially regulated using microarrays can be confirmed with alternative assays such

as quantitative real time polymerase chain reaction (Q-RT-PCR) and further biological

experiments [53]. Another requirement for acceptance of a molecular signature is the

validation of the assay performance on a truly independent patient population. Our

current study with the Pathology Department at the Cleveland Clinic aims to test the

signature in a totally independent group of patients. Such a validation study would

support further development of this prognostic gene signature into a clinically useful

diagnostic test.

7.3 Random Forest Classifier for Microarray Data

Random forest, developed by Leo Breiman, is a classification algorithm that uses an

ensemble of classification trees [10]. Each of the classification trees is built using a

random sample of data (bootstrap aggregation) and random variable selection; it uses the

gini index to decide the best splitting criterion. We highlight good performance and suitability of random forest for the analysis of microarray data in this section and provide a detailed description of the random forest algorithm in Section 7.4.

The clinical application of genomics in the diagnosis and management of cancer have been proposed. As more studies are published, there has been an increasing appreciation of the challenges facing the analysis of expression data. The fact that microarray can measure expression levels of thousands of genes in parallel is one of the features that made this technology popular. However, this characteristic is also a challenge. The main challenge to such genomic based discovery comes from both an extremely large number

84 of predictors and complex interactions between them. An additional challenge of the analysis of a microarray based assay is that typical data sets often contain only a comparatively small number of samples. As a result, standard statistical methods, such as logistic regression, cannot handle large number of variables without requiring a prohibitively large sample size. Many traditional statistical methods and data mining techniques cannot readily be applied to microarray data. If microarrays are used to determine a specific condition [2], an important question is whether the expression profile differs in a significant way among the groups considered. The classical statistical methods that were designed to answer such questions (e.g. chi-square tests) can not be applied directly because the number of variables in microarray experiments greatly outnumbers the number of experiments conducted. Therefore, novel techniques are necessary for further data interpretation [28]. Random forest is one such novel method.

Random forest has received increased attention from scientists in bioinformatics and has become a major data analysis tool. It has been applied tolarge-scale tissue microarray data[27,63], genome-wide association studies for complex diseases[3,63], and microarray data [36,39]. Random forest performs well in comparison with many classical statistical methods [27,38], especially in problems in genomics.

Diaz-Uriarte and Alvarez de Andres [27] demonstrated that random forest is well suited for microarray data based on its high predictive accuracy comparable to other classification methods. The predictive performance of random forest was compared to

Diagonal Linear Discriminant Analysis (DLDA), K Nearest Neighbor (KNN), Support

Vector Machines (SVM), and Shrunken Centroids (SC) using real microarray data.

85 Random forest shows excellent performance in classification tasks and exhibits several characteristics that make it ideal for microarray data:

! Handles problems with small number of samples and large number of attributes,

does not require pre-selecting relevant variables.

! Produces a single measure of importance for each predictor variable.

! Provides unbiased variable selection, identifies relevant predictors from a large

set of variables.

! Can be applied to problems that involve complex interaction effects.

! Shows excellent performance without the need to fine-tune parameters (number of

attributes examined at each node and number of trees).

! Can be used for supervised learning such as classification and prediction, or

unsupervised analysis such as clustering.

7.4 Random Forest Classifier - Decision Tree Algorithm

This section outlines the context of the ensemble classification ideas and the architecture of the random forest algorithm that we used for classification of microarray data.

Classification refers to a form of data analysis which extracts models describing important data classes and predicts unclassified samples. The more specific type of analysis, data classification, involves a two step process. In the first step, a model is constructed by analyzing training data, a randomly selected subset from the sample population. Each training sample is described by a number of attributes and labeled with the class attribute. In the second step, predictive accuracy of the model is estimated and

86 the model is used to classify future samples for which the class label is not known.

Typically, the learned model is represented in the form of classification rules or decision trees.

A decision tree is a special type of classifier. It is a directed acyclic graph in the form of a tree. The root of the tree does not have any incoming edges. All other nodes have exactly one incoming edge and may have two or more outgoing edges. Nodes that do not have outgoing edges are called leaf nodes; the rest are called internal nodes. Every internal node is labeled with one predictor attribute, called splitting attribute, and each leaf node is labeled with one class label. Each edge connecting an internal node to one of its children has a predicate associated with it. The combined information of the splitting attribute and splitting predicate of an internal node comprise the splitting criteria of that node. In this work, the binary tree structured classifier is considered. While multi-way splits into more than two groups can sometimes be useful, it is not a good general strategy. Multi-way splits fragment the data too quickly and the resulting tree becomes harder to interpret [37].

The basic algorithm for decision tree induction constructs decision trees in a top- down recursive manner. The tree starts as a single root node representing the training samples. At the root node, the database is examined and the best splitting criteria is computed. The algorithm then uses a split selection method, a heuristic, to select the attribute that will best separate the samples into individual classes. The samples are partitioned according to the splitting criteria. The algorithm uses the same process recursively to build a decision tree at each partition. The recursive partitioning stops if the samples are all of the same class or there are no remaining attributes on which the

87 samples may be further partitioned. Each leaf node of a final decision tree is labeled with

the class label in majority among samples [11].

To further illustrate, a decision tree for the concept cancer recurrence, indicating

whether or not a patient who underwent a surgical treatment for cancer is likely to

develop a recurrent cancer based on the expression level of n genes, is shown in Figure

7.2. The algorithm chooses the variable and split-point to achieve the best fit. The root of

the tree represents the full dataset. Samples that satisfy the condition at each node are

assigned to the left branch; all others are assigned to the right branch. The tree stratifies

the training samples into strata of high and low risk of recurrence, on the basis of gene

expression data. The key advantage of the recursive binary tree is its interpretability. A

tree classifier provides a natural way of understanding the structure of the problem and is

very popular among scientists and doctors.

Three elements need to be considered during the construction of a tree: the

selection of the splits, the decision when to stop the growing phase of a tree, and the

assignment of each internal node to a class. The first problem in tree construction is how

to determine the splits. The fundamental idea of a split selection method is to select each

split of a subset so that data in each of the descendant subsets are “purer” than the data in

the parent subset. Several splitting methods or measures of the goodness of splits have

been used proposed [70]. The random forest algorithm uses the gini index of diversity as

an attribute selection measure [11]. The gini index is an impurity-based split selection

method that calculates the splitting criterion. The gini index for a data set T with j distinct

class labels is defined as

88 gini (T) = 1 - "(pj )2

where pj is the relative frequency of class j in T. Therefore, the gini index does not use the plurality rule but uses the estimated probability of misclassification instead.

At each node, all predictor attributes are examined and impurity of the best split is calculated. In order to evaluate the impurity index for an ordered or numeric attribute, training samples are sorted based on the values of the attribute and the split point is evaluated at the midpoint between consecutive distinct data values. The number of searches for the best split point is proportional to the number of samples in the training data set. Splits for a categorical attribute are evaluated for all possible subsets for an attribute. To calculate the gini index, the algorithm has to perform exhaustive subset searches. When splitting an attribute having n possible categorical values, there are 2n – 1 possible partitions of the n values into two groups, and the computations become prohibitive for large n. At each node of the tree the algorithm searches through the variables one by one finding the best split for each variable. The final split is constructed in such a way that the value of the impurity index is minimized.

When training data have only numeric variables, another way of looking at the tree construction procedure is a recursive partitioning of the data space into rectangles.

An equivalent way of looking at the tree diagram is shown in Figure 7.2. In this representation, the tree procedure recursively partitions the unit square into rectangles such that the samples within each rectangle become more and more homogenous.

The problem of knowing when to stop splitting and which class to assign is simple with respect to the random forest classifier. Random forest does not employ any

89 pruning method. Therefore, each tree is grown to the largest extent possible. The splitting procedure continues until each terminal node contains samples of one class. The assignment of each internal node to a class is based on the plurality class of training samples in the node.

Figure 7.2 Decision tree (right) for the concept of cancer recurrence.

Figure 7.3 Representation of decision tree in the form of a graph

90

7.5 Previous Work - Optimization of Decision Tree Algorithms

Current implementations of random forest in the machine learning and statistics

community require that the entire dataset remain permanently in memory. These include

the original Breiman’s implementation in Fortran [10], in R, a language and an integrated

suite of software for statistical computing and graphics[14], and Weka [73], a collection

of machine learning algorithms for data mining tasks implemented in Java. Such

implementations limit its usability for mining over large databases. The optimization of

decision tree algorithms for a single tree classifier where performance is measured

against the number of records has been widely investigated. However, the optimization of classification tree construction algorithms involving large numbers of attributes has not received a lot of attention in the database literature. Throughout this section, we

investigate several frameworks that address scalability requirements of classification

construction algorithms that build a single tree classifier in the context of an expression

dataset, a dataset with large number of numeric attributes.

SLIQ, one of the first decision tree algorithms for disk-resident datasets, is a

scalable data access method for CART, a decision tree classifier that uses the gini index

as an impurity function, the same splitting criterion as RandomForest. For numeric

attributes, sorting time is the dominant factor when finding the best split at a decision tree node as the impurity function has to be evaluated at all possible split points, at each node of the tree [13]. To select the best splitting attribute, the algorithm requires sequential access to all numeric attributes in sorted order. SLIQ uses a presorting technique and maintenance of sort orders in the tree growth phase to avoid re-sorting at each node of the

91 tree and to reduce the cost of evaluating numeric attributes. It does so by sorting ordered attributes and separating the input dataset into attribute lists at the beginning of the algorithm. In addition, it requires a data structure called class list whose size is proportional to the number of records in the dataset. An entry in the class list contains a class label and a reference to a leaf node of the decision tree. Thus, the class list can identify the tree node to which an example belongs. An entry in an attribute list contains an index into the class list and an attribute value. Therefore, the class list is accessed frequently and randomly and must be kept in main memory. The algorithm uses a breadth-first strategy to simultaneously evaluate splits for all the leaves of the current tree in one pass over the data.

SLIQ does require that some data per record stay in main memory, limiting the number of input records. SPRINT, introduced by Shafer et al. [61], removes this limitation. SPRINT also avoids sorting at each node of the classification tree by creating sorted attribute lists at the beginning of the algorithm but uses different data structure.

The initial sorted lists are associated with the root of the classification tree. As the tree is grown, the attribute lists are partitioned and distributed among children. Partitioned lists never require resorting as the order of the records in a list is preserved. SPRINT does not require a class list but uses a hash table instead which is proportional to the number of records in a node in the decision tree. Connection between the vertically separated parts of a record is made through a record identifier by performing a hash-join. SPRINT assumes that a minimum amount of main memory corresponding to the size of a hash table is available; otherwise several scans of a training dataset are required. Partitioning of attribute lists at each node of the decision tree may have a significant overhead of

92 rewriting a disk resident dataset with large number attributes; it also triples the size of the

training dataset. SPRINT was not designed to outperform SLIQ on datasets where a class

list can fit in memory. Both SLIQ and SPRINT have been developed by IBM research

group and show similar performance for data set with about 1.5 million records.

The main contribution of RainForest is a generic scalable algorithm that can be

specialized to most classification and regression tree construction algorithms [34]. The

design goal of the RainForest framework was not to outperform but to provide certain

optimization upon SPRINT, based on the assumption about the relationship between size

of main memory and size of an aggregate data of a predictor attribute. The main

observation of this work is that most split selection methods used in the induction of

decision trees follow a generic top-down classification schema; that is, a possible splitting attribute is examined independently of the other attributes. The sufficient statistics for evaluating all possible split points for an attribute is the class label distribution of each distinct value of this attribute. RainForest offers significant performance improvements over the SPRINT classification algorithm by introducing

AVC-sets (Attribute-Value, Classlabel) for each predictor attribute where counts of the individual class labels are aggregated. AVC-sets contain aggregate information for predictor attributes and can be considered as a compressed version of the attribute lists used in SPRINT. The size of the AVC-set for a categorical attribute is proportional to the number of distinct values of the attribute and is highly likely to fit in main memory. The main difference between SPRINT and the top-down RainForest schema is that the latter introduces an important component, the AVC-set. If the training database is stored in a database system, the AVC-set of a predictor attribute can be retrieved through a simple

93 SQL query. RainForest optimization technique is based on the expectation that AVC-sets of the root node containing aggregate information of all attributes in the training database or at least each single AVC-set of the root node will fit in-memory. However, the size of the AVC-set of an attribute with continuous numerical values will be equal to that of an attribute list in SPRINT. To take advantage of the optimization technique proposed in

RainForest, numeric attributes that may impact classification results have to be discretized. However, scalability results on the number of attributes are only available in

SLIQ for maximum of 700 attributes demonstrating that the classification time increases linearly due to the domination of I/O costs. Both SPRINT and RainForest are the fastest scalable classification tree construction algorithms proposed previously. Neither one modifies results of a classification method but rather focuses on scalability issues only.

7.6 Scalable Implementation of Random Forest

Efficiency and scalability become issues of concern when the random forest classifier is applied to the mining of very large real-world databases. In this section, we address the scalability aspects of the random forest algorithm, restricting our attention to a classification problem for large microarray expression datasets with thousands of numeric predictors, expression levels of genes, and a categorical dependent attribute. We present a framework for scaling up random forest to larger datasets, and describe the design and detailed performance analysis of the framework. We also provide a description of data structures and methods to optimize computation. The goal is to improve learning time of the algorithm without loss of accuracy and to allow classification to be performed on a

94 large disk resident training data without imposing restrictions on the amount of training data or the number of attributes in an example.

The main insight, based on the analysis of data mining frameworks [10,11,14,73], is that none of the current implementations of random forest are suitable for large disk- resident datasets because they require that all training data be memory resident; they also require multiple sorting passes over the data and its subsets. The presently available statistical packages were designed for small datasets. In response to this limitation, a preprocessing step, some sort of dimensionality reduction, is used for complex datasets before data is submitted for the analysis to a decision tree classifier. Some of the known drawbacks of the present multivariate reduction tools include changes of the resulting classification model, reduced accuracy, and loss of the potentially important or interesting predictors. The only other previously known scalable algorithms for a decision tree classifier, SPRINT and RainForest, describe optimization techniques for a single tree model, but do not address specific properties of datasets with a very large number of numeric predictors.

The Intel group provides an analysis of computational resources required by the algorithm to construct an ensemble of classification trees. The analysis investigates computational complexity of the algorithm, provides distribution of execution time by specific operations, and identifies hot or computationally expensive operations. The main computational complexity is due to two operations: sorting of training data on each attribute selected for the split, and random selection of both examples and subset of variables. At every step, a decision tree uses exhaustive search of all possible combinations of variables and split points to achieve the global minimum in impurity.

95 The procedure of sorting which requires a large amount of comparison and the procedure

of sampling which is characterized by intensive memory access take up almost 80 percent

of the execution time.

A new framework is presented in the context of a classification problem of expression

data. One can visualize expression data as a large matrix with samples represented in

rows and genes represented in columns. What makes expression data interesting is not

only its size but also its complexity. When addressing scalability of the random forest algorithm for microarray datasets, several specific characteristics of training data have to be considered:

1. highly dimensional data that contain expression levels of thousands of genes; that

is a very large number of predictor attributes (more than 30,000)

2. relatively small number of training examples (hundreds or thousands)

3. predictor attributes have continuous numeric values

4. categorical outcome variable or a class label with distinct categorical values

We propose a new framework, the scalable implementation of random forest, which addresses specific properties of microarray data, takes computational complexity of a decision tree classifier into consideration, and does not change the quality of the resulting model. The goal of the optimization algorithm is to speed up tree growth phase that is the most computationally expensive part of the algorithm. Next, we provide a brief overview of the random forest classification followed by a description of a new scalable framework.

96 The random forest algorithm, summarized in Figure 7.4, builds an ensemble of

classification trees in the following way: given a specific training set T, it forms a

bootstrap set Ti, about one-third of the training dataset T, and grows a single tree on Ti.

The algorithm repeats this procedure k times to build an ensemble of k trees. The number of trees in an ensemble is a user specified criterion. Random forest builds trees independent of each other on a randomly selected subset of training data, a bootstrap sample, and predicts by a majority vote. Outputs from all k trees in the ensemble vote for the most popular class. A single decision tree uses a top-down approach; it starts as a single node representing the bootstrap set Ti. At the root node, the bootstrap set is

examined and the best splitting criterion is computed. The algorithm selects the splitting

attribute and the splitting predicate of that attribute. Once decided on the splitting

criterion, the algorithm is recursively applied to each of the children. At every step, a

decision tree uses exhaustive search by trying all combinations of m randomly selected

attributes and all possible split points to achieve the maximum reduction in impurity.

Random forest does not implement a pruning method but uses fully grown trees. The

recursive partitioning stops when all samples for a given node belong to the same class.

Algorithm: Random Forest. Generate a forest of decision trees from the given

training set

Input: training database D, number of trees k, number of attributes m Output: ensemble of decision trees for D Method: Build_Forest ( training dataset D, number of trees k, number of attributes m) (1) for ( i = 1; i # k; i ++)

(2) draw bootstrap set Ti from D

97 (3) Build_Tree (node n, bootstrap set Ti , number of attributes m) (4) end for

Build_Tree (node n, bootstrap set T, number of attributes m) (1) randomly select m attributes (2) apply gini impurity function to find best splitting attribute among m and splitting predicate

(3) use best split to partition T into T1 and T2 , create two children n1 and n2 of n

(4) Build_Tree (n1 , T1 , m)

(5) Build_Tree (n2 , T2 , m) Figure 7.4 Random forest algorithm for inducing an ensemble of decision trees from training samples.

Our scalable method for the random forest classifier employs a decision tree algorithm that adopts a middle ground between SPRINT and RainForest. The algorithm implements a greedy top-down approach for constructing a decision tree by repeatedly splitting training data into descendant subsets. The fundamental idea is to select each split of a subset so that the purity of each of the descendant nodes is better than that of the parent node. To find the best split, the algorithm needs to evaluate the impurity function at all possible split points for each randomly selected attribute at a node. In order to decide on the splitting criterion among continuous numeric attributes, the algorithm requires access to each of m randomly selected attributes in sorted order. The computational burden of each node includes a sort of each attribute and an evaluation of splits for each value. Similar to SPRINT and SLIQ, the algorithm avoids sorting at each node by using a pre-processing step, a one-time sorting, but uses different data structures.

The sorting technique is integrated with a breadth-first tree growing strategy which requires scanning the data.

98 The algorithm employs disk-resident lists of sorted indices and a small memory resident data structure. The algorithm makes one scan over the database and constructs lists of sorted indices; a list is constructed for each attribute in the training database.

Entries into a list of sorted indices contain record identifiers of the training database sorted by value of the corresponding attribute. The lists or sorted indices can be stored in one sequential file and can be used in the construction of all trees in a random forest.

Lists of sorted indices allow sequential access to an ordered attribute in sorted order without re-sorting all attributes at each node of the tree. The idea of pre-sorting attribute values carries over from SLIQ, but we use different structures to maintain sorted order.

We also introduce an additional memory resident data structure, a hash table, similar to that proposed in SLIQ. A hash table, proportional to the number of records, is used to reference a class label of each record. We assume that this table always fits into memory due to the nature of microarray data with relatively small number of records. Hash-join is necessary to make a connection between the record identifiers in a list of sorted indices and the class labels in a hash table. During construction of a decision tree, the hash table is augmented with n bits corresponding to the number of nodes in the tree, a pointer to a node of the classification tree. The hash table contains a reference to a leaf node that is initially set up to a root node. We assume that there is enough memory to keep the hash table in memory. Lists of sorted indices are written to disk, if necessary. With current average memory size of 1GB, the assumption is that the hash table and at least one list of sorted indices will fit entirely in main memory. Depending on the amount of main memory, the number of lists of sorted indices that will fit in main memory can be easily estimated in the beginning of the algorithm as a pre-processing step. The idea is to

99 estimate a block of lists of sorted indices that can entirely fit into main memory and bring

one block at the time to calculate best splitting criterion for each attribute in the block.

Note that at a node n, k randomly selected attributes as possible splitting attributes are

examined; each attribute is evaluated independent of the other predictor attributes. We

denote k randomly selected attributes at a node n as a set of attributes and k lists of sorted indices corresponding to these attributes as a set of sorted indices. All sets of sorted indices at one level of a tree are denoted as a group of sorted indices. Lists of sorted indices of each predictor attribute are needed in main memory, one at a time, to be given as argument to the gini impurity function. Thus, the total main memory required to evaluate the gini index at a node is the maximum size of any single set of sorted indices.

For most microarray datasets, we expect that the whole set of sorted indices of the root

node will fit in main memory. If not, it is highly likely that at least a list of sorted indices

of each individual predictor attribute fits in main memory. The assumption that the set of

sorted indices of the root node fits in memory does not imply that the training database

fits in memory since the random forest classifier selects k random attributes at each node of a tree.

The lists of sorted indices and hash table generated by the algorithm for the sample data of Figure 7.5 are shown in Figure 7.6. Each list of sorted indices is associated with an attribute; entries in each list consist of record identifiers sorted by value of the corresponding attribute. To compute the gini index, lists of sorted indices are processed one at a time. For each record identifier in the list of sorted indices for current attribute A, the corresponding class label and the leaf node are retrieved from the hash table. Thus, in one scan of a list of sorted indices, the best split using this attribute is computed for all

100 the leaf nodes. The best overall split for all of the leaf nodes is computed with one

traversal of all of the lists in a group of sorted indices.

RID Gene1 Gene 2 Gene 3 Gene 4 … Class

1 5.513948602 4.02552882 7.636228629 11.76987481 … 1

… 2 8.576949738 11.01545934 11.91476442 12.42854319 2

… 3 10.52380288 10.16997957 10.80256615 10.97484746 1

4 9.717920514 7.65131205 12.04368221 9.889963467 … 1

5 12.5336826 12.05824593 9.762327431 4.774056523 … 2

… … … … … …

Figure 7.5 Sample microarray expression data.

Figure 7.6 List of sorted indices and hash table data structures used in the algorithm for the samples data in Figure 16.

To evaluate performance of our algorithm, we compared classification time of the

random forest algorithm implemented in Weka [73], a collection of machine learning

algorithms in Java, with our optimized implementation of random forest in Java within

the same framework. Since Weka classifiers handle only datasets that fit in memory, the

comparison used the in memory implementation of our framework. The performance

evaluation used classification time as the primary and only metrics since original

101 implementation of the random forest classifier in Weka and our optimized version produce exactly the same classification models. To conduct experiments, we used a small real-life microarray expression set with 22,000 predictor attributes, 96 records, and a binary categorical outcome variable. The running time of the original implementation of random forest in Weka is affected by sorting time at each node of a classification tree, whereas our implementation is dominated by expensive hash-join operations between lists of sorted indices and class labels. Our experiment was performed on an Intel Core 2

Duo with 2.2 GHz processor running Mac OS version 10.4.11 with 2 GB of main memory. Figure 7.7 shows performance of the optimized implementation of random forest as the number of trees generated by random forest increases from 10 to 100. The number of attributes in this experiment was fixed to 150. The next experiment evaluates the performance of our algorithm as the number of attributes increases. The smallest dataset holds only 150 attributes and the largest contains 20,000 attributes. The number of trees in this experiment was fixed to 10. The two experiments show that our algorithm achieves good performance on memory-resident data along two dimensions: the number of trees and the number of attributes (Figure 7.8).

102

Figure 7.7 A performance evaluation comparing Figure 7.8 A performance evaluation comparing optimized optimized implementation of random forest with the implementation of random forest with the original original implementation in Weka with constant implementation in Weka with constant number of trees. number of attributes.

Past experience has shown that the use of multiple trees improves classification

accuracy and helps physicians overcome the generalization biases. Random forest highly

increases the prediction accuracy as compared to individual classification trees. Using

this algorithm, the user builds a forest of decision trees using a randomized process. This

ensemble of trees is then used to combine outputs from multiple predictors. The class

having the majority vote over all the trees in the forest is then chosen as a winning class.

Multiple trees are created independently using different subsets of data. Each tree is

grown on a bootstrap sample of the training set, random selection of about two-third of

the original data set. Samples that are not in the training set are referred to as out-of-bag

samples. For every tree grown, about one-third of the original data set are out-of-bag, out

103 of the bootstrap samples, are used to get an unbiased estimate of the test set error.

Therefore, there is no need for cross-validation or a separate test set. At each node, a subset of attributes is selected at random out of all attributes and the best split is the best split on these randomly selected attributes. Highest accuracy is achieved when the trees are grown to maximum depth using random subset of samples. To classify a new sample with unknown class label, the sample is classified by each tree and the forest chooses the classification having the majority vote. The only adjustable parameters in random forest are the number of randomly selected attributes to be searched through at each node and the number of trees in the forest.

Besides an accurate prediction, random forest offers the user more information about the data. This information includes variable importance measures, effect of variables on predictions, proximity between samples and clustering. To calculate variable importance of an attribute m, out-of-bag samples are classified by each tree in the forest and the number of votes for the correct class is counted. Next, values of the attribute m in the out- of-bag samples are randomly permuted and classified once again. The importance of variable m is the number of correctly classified samples in the original out-of-bag data minus the number of votes for the correct class in the permuted out-of-bag data. Thus, the variable importance measure is the decrease in prediction accuracy after permuting the variable. Random forest can also calculate the proximity among samples. Since an individual tree is unpruned, the terminal or leaf nodes contain a small number of samples.

If two samples end up in the same leaf node, the similarity between them is increased by

one. At the end of the forest construction, the similarities are divided by the number of

trees. The similarity between a sample and itself is set equal to one. The random forest

104 similarity measurements can be used as a set of points in a Euclidian space such that the

Euclidian distances between these points are approximately equal to the similarities [26].

The random forest similarities can be used as an input for a clustering procedure.

7.7 From Random Forest to Clinically Feasible Molecular Signature

There is clearly a pressing need to identify new prognostic factors to determine Stage II

colon cancer patients who are likely to have recurrence [24,44,46,64,74]. By identifying

patients who are more likely to benefit from adjuvant therapy, this information would

allow for a personalized plan of treatment. However, there is no clinical test available to

provide such prognostic information. Using microarray analysis, we described discovery and initial validation of a 10-gene prognostic signature for Stage II colon cancer patients.

The present study selected a small set of genes that were highly informative in identifying patients with distant metastases. The 10 genes are listed in Table 7, with GenBank ID and

Affymetrix U133a.

Gene Bank ID Affymetrix Gene Name U133a ID HP 206697_s_at Haptoglobin ECH1 200789_at enoyl Coenzyme A hydratase 1, peroxisomal ACTG2 202274_at , gamma 2, smooth muscle, enteric GPSM2 205240_at G- signalling modulator 2 (AGS3-like, C. elegans) BACE1 217904_s_at "-Secretase MRPL48 218281_at mitochondrial ribosomal protein L48 RUNX1 209359_x_at runt-related transcription factor 1 FXR2 35265_at fragile X mental retardation, autosomal homolog 2 CALR 214316_x_at Calreticulin APBA3 205146_x_at Amyloid beta (A4) precursor protein-binding, family A, member 3 (X11-like 2) Table 7 The top ten genes selected by the random forest classifier.

105

Figure 7.9 Kaplan-Meier survival analysis on 96 frozen tumor samples using the 10 top prognostic genes selected by random forest.

The main conclusion of the present study is the usefulness of the random forest classifier

for the possibility to build a prognostic signature based on gene-expression profiles.

Random forest shows excellent performance for microarray data sets that typically contain a large set of variables and a much smaller set of observations. We took advantage of the measure of variable importance returned by the random forest classifier as part of the algorithm. The main objective is to identify relevant genes for subsequent analysis with the emphasis on biological interpretability of the output obtained. Our results clearly show the prognostic value of the predefined gene signature for Stage II colon cancer patients. At this stage of the study, we are developing this prognostic signature into a clinically feasible test with a real-time quantitative PCR assay, the 10- gene prognostic signature. Performance of predictors remains to be validated on large series of patients before proposing a clinical application. The ability to identify colon

106 cancer patient with an unfavorable outcome may help patients at high risk for recurrence to seek more aggressive treatment.

107 8 Conclusions

8.1 Summary

Mendelian genetics, although crude and overly simplistic by today's standards, present a very enticing model in which genotypic information is used to predict phenotypes and vice versa almost flawlessly and with relative ease. As biologists, geneticists, biochemists, and other researchers make advances and discoveries in this field, the amount of information that can be collected to make a prediction is staggering. Further, it becomes increasingly difficult to discern relative information, work with incomplete information, and take into account environmental factor whose effects are not always known and completely understood. Our model presented in this thesis, which is already being validated by institutions home and abroad, through a multidisciplinary approach provides a data model and uses a structured knowledge-based representation that addresses the issues previously mentioned and ultimately and most importantly provides a quantitative risk assessment of an individual's likelihood to develop hereditary colorectal cancers. Our model serves as a tool to advise and guide clinicians towards a more appropriate diagnosis, a course of treatment, and can predict colorectal cancer risk for family members that are not even treated at the time.

The framework of this project consists of software that makes it possible to construct, query, and analyze pedigree data for families with hereditary colorectal cancer.

Our computerized system, Cologene, is a platform that provides clinicians and genetic counselors an ideal and domain-friendly environment to capture complex pedigree relationships and clinical findings. This knowledge-based data model further provides

108 support for diagnostic, analytical, and data mining tasks while at the same time takes

advantage of structured information to simplify the overall knowledge acquisition

process. In-depth specific medical knowledge is integrated into the system and through computerized data acquisition and management; physicians and geneticists can

thoroughly analyze pedigree data and correlate these findings with genomic data. To underline the effectiveness and importance of this knowledge-based data model, we applied knowledge discovery methods to a large pedigree database and using a novel data mining method, we estimated the risk of a deadly condition, desmoid disease, for patients with a hereditary syndrome.

The two components of Cologene are Pedigree Editor and Pedigree Explorer indented for clinicians and geneticists provide a user-friendly, intuitive, and easy to use

environment. Pedigree Editor's hallmark lies in its ability to edit, analyze, and graphically

visualize pedigree data. The graphical user interface allows one to draw a pedigree, edit

relevant medical information, and to dynamically explore and browse family trees since

this subsystem interfaces with a relational database. Pedigree Explorer is an advanced

query interface and is capable of specifying powerful pedigree queries dynamically.

Users have the ability to specify queries on familial relationships dynamically and in an

intuitive manner since path or structural pedigree queries are integrated within the advanced query interface.

To answer queries in Pedigree Explorer that require graph traversal and are

recursive by nature, we have implemented an encoding system, the NodeCodes technique. This technique plays an essential role in implementing the advanced query interface. Experimental results demonstrate that NodeCodes provides a good alternative

109 for query evaluation on pedigree data and the labeling method. The labeling method is an annotation for each node and is used to encode the hierarchical structure of the pedigree and to evaluate structural pedigree queries efficiently.

Further, a specialized graph-based data model within Pedigree Explorer, Directed

Acyclic Graph, is used to represent familial relationships. It is a model that uses a structured knowledge-based representation to view conceptually medical conditions related to hereditary colorectal cancer. The data model serves as an effective platform for pedigree data storage, retrieval, and query processing and the usefulness of this model lies in its ability to gain information and knowledge such as quantifying familial risk factors and correlating many combinations of family history with risk status.

We have effectively been able to correlate clinical findings collected in Cologene with genetic data based on microarray technology. This assay can measure the expression levels of thousands of genes and is the most promising tool available to researchers trying to find a genetic basis for a disease. However, the amount of data generated by such an experiment is tremendous. We have conducted a phenotype-genotype association study linking clinical outcomes of patients with hereditary diseases to their genomic profile acquired by microarray assays. We have devised a new framework and optimized implementation of a decision tree algorithm, random forest, which produces excellent classification results for high-dimensional microarray data. The scalable implementation of random forest is used to look for differences between the genomes of patients with a recurrent colon cancer and those without. This method allowed us to find genetic markers that were previously not correlated with colorectal cancers.

110 During the course of this project we have collaborated with clinicians, geneticists,

and basic scientists and have taken a multidisciplinary approach to take on practical tasks in hopes to improve patient care. The tools computer science provided were

indispensable at all times and we have seen that we can take an integrative and unifying

approach with numerous disciplines. Our project accomplished that which it set out to do-

we have devised a system that is now an invaluable tool in research and management of patient data.

111 References

1. Agarwala, R., L. G. Biesecker, K. A. Hopkins, C. A. Francomano, and A. A. Schaffer. Software for constructing and verifying pedigrees within large genealogies and an application to the old order amish of lancaster county. Genome Research 8, 3 (Mar 1998), 211-21.

2. Alizadeh, A. A., M. B. Eisen, R. E. Davis, C. Ma, I. S. Lossos, A. Rosenwald, J. C. Boldrick, et al. Distinct types of diffuse large B-cell lymphoma identified by profiling. Nature 403, 6769 (Feb 2000), 503-11.

3. Altman, D. G. Practical statistics for medical research. 1st ed. Chapman and Hall, London ; New York, 1991.

4. Arvanitis, M. L., D. G. Jagelman, V. W. Fazio, I. C. Lavery, and E. McGannon. Mortality in patients with familial adenomatous polyposis. Diseases of the Colon and Rectum 33, 8 (Aug 1990), 639-42.

5. Beidleman, K., and J. Gersting. Plotting human pedigrees. Journal of Medical Systems 9, 3 (Jun 1985), 97-108.

6. Belchetz, L. A., T. Berk, B. V. Bapat, Z. Cohen, and S. Gallinger. Changing causes of mortality in patients with familial adenomatous polyposis. Diseases of the Colon & Rectum. 39, 4 (Apr 1996), 384-7.

7. Bennett, R. L., K. A. Steinhaus, S. B. Uhrich, C. K. O'Sullivan, R. G. Resta, D. Lochner-Doyle, D. S. Markel, V. Vincent, and J. Hamanishi. Recommendations for standardized human pedigree nomenclature. pedigree standardization task force of the national society of genetic counselors. American Journal of Human Genetics 56, 3 (Mar 1995), 745-52.

8. Bertario, L. Causes of death and postsurgical survival in familial adenomatous polyposis: Results from the italian registry. Italian Registry of Familial Polyposis Writing Committee 10, 3 (1994), 225-34.

9. Bertario, L., A. Russo, P. Sala, M. Eboli, M. Giarola, F. D'amico, V. Gismondi, et al. Genotype and phenotype factors as determinants of desmoid tumors in patients with familial adenomatous polyposis. International Journal of Cancer.Journal International Du Cancer 95, 2 (Mar 2001), 102-7.

10. Breiman, L. Random forest. Machine Learning 45, (2001), 5-32.

11. Breiman, L., J. Friedman, C. J. Stone, R. A. Olshen. Classification and regression trees. Chapman Hall, New York, 1984.

112 12. Caspari, R., S. Olschwang, W. Friedl, M. Mandl, C. Boisson, T. Boker, A. Augustin, M. Kadmon, G. Moslein, and G. Thomas. Familial adenomatous polyposis: Desmoid tumours and lack of ophthalmic lesions (CHRPE) associated with APC mutations beyond codon 1444. Human Molecular Genetics 4, 3 (Mar 1995), 337-40.

13. Catlett, J. Megainduction: Machine learning on very large databases. Ph.D., University of Sydney 1991.

14. Chambers, J. R development core team. 2008. http://www.r-project.org/.

15. Chapman, C. Cyrillic software. 2000. http://www.cyrillicsoftware.com/.

16. Church, J. M. Desmoid tumours in patients with familial adenomatous polyposis. Seminars in Colon and Rectal Cancer 6 (1995), 29-32.

17. Church, J. M. A scoring system for the strength of a family history of colorectal cancer. Diseases of the Colon & Rectum. 48, 5 (2005), 889-96.

18. Church, J. M., M. Bondardi, K. Bova, L. LaGuardia, and E. Manilich. Is there something about a new mutation that causes severe disease in FAP? A comparative study of probands stratified by family history. Hereditary Cancer in Clinical Practice 5, 1 (2007), 31-50.

19. Church, J. M., G. Casey. Molecular genetics and colorectal neoplasia: A primer for the clinician. [Developments in Oncology.]. 2nd ed. Kluwer Academic Publishers, Massachusetts, 2004.

20. Church, J. M., R. Kiringoda, and L. LaGuardia. Inherited colorectal cancer registries in the united states. Diseases of the Colon & Rectum. 47, 5 (May 2004), 674-8.

21. Church, J. M., and E. McGannon. Prior pregnancy ameliorates the course of intra- abdominal desmoid tumors in patients with familial adenomatous polyposis. Diseases of the Colon and Rectum 43, 4 (2000), 445-50.

22. Church, J. M., E. McGannon, S. Hull-Boiner, M. V. Sivak, R. Van Stolk, D. G. Jagelman, V. W. Fazio, J. R. Oakley, I. C. Lavery, and J. W. Milsom. Gastroduodenal polyps in patients with familial adenomatous polyposis. Diseases of the Colon & Rectum. 35, 12 (1992), 1170-3.

23. Clark, S. K., and R. K. Phillips. Desmoids in familial adenomatous polyposis. The British Journal of Surgery 83, 11 (Nov 1996), 1494-504.

24. Compton, C. C., L. P. Fielding, L. J. Burgart, B. Conley, H. S. Cooper, S. R. Hamilton, M. E. Hammond, et al. Prognostic factors in colorectal cancer. college of american pathologists consensus statement 1999. Archives of Pathology & Laboratory Medicine 124, 7 (Jul 2000), 979-94.

113 25. Coulson, A. S., D. W. Glasspool, J. Fox, and J. Emery. RAGs: A novel approach to computerized genetic risk assessment and decision support from pedigrees. Methods of Information in Medicine 40, 4 (2001), 315-22.

26. Cox, T. F., A. A. Cox. Multidimensional scaling. Monographs on statistics and applied probability. 2nd ed. Chapman & Hall/CRC, Boca Raton, 2001.

27. Diaz-Uriarte, R., and S. Alvarez de Andres. Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7, 3 (Jan 2006).

28. Dr$ghici, S. Data analysis tools for DNA microarrays. Mathematical biology and medicine series. Chapman & Hall/CRC, Boca Raton, 2003.

29. Durno, C., N. Monga, B. Bapat, T. Berk, Z. Cohen, and S. Gallinger. Does early colectomy increase desmoid risk in familial adenomatous polyposis? Clinical Gastroenterology and Hepatology : The Official Clinical Practice Journal of the American Gastroenterological Association 5, 10 (Oct 2007), 1190-4.

30. Elayi, E., E. Manilich, and J. Church. Polishing the crystal ball: Knowing genotype improves ability to predict desmoid disease. American Society of Colon and Rectal Surgeons, Boston. 2006.

31. Elliot, B., S. F. Akgul, S. Hayes, and Z. M. Ozsoyoglu. Efficient evaluation of inbreeding queries on pedigree data. 19th International Conference on Scientific and Statistical Database Management, 2007.

32. Elliot, B., S. F. Akgul, Z. M. Ozsoyoglu, and E. Manilich. A framework for querying pedigree data. 18th International Conference on Scientific and Statistical Database Management, Vienna, Austria. 2006. 71-80.

33. Elliot, B., E. Cheng, S. Mayes, and Z. M. Ozsoyoglu. Efficiently calculating inbreeding on large pedigrees databases. Information Systems Journal(2007).

34. Gehrke, J., R. Ramakrishnan, and V. Ganti. RainForest: A framework for fast decision tree construction of large datasets. Very Large Data Bases(1998), 416-27.

35. Gersting, J. M., P. M. Conneally, and K. Beidelman. Huntington's disease research roster support with a microcomputer database management system. Computer Applications in Medical Care. Proceedings.(1983), 746-9.

36. Gunther, E. C., D. J. Stone, R. W. Gerwien, P. Bento, and M. P. Heyes. Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro. Proceedings of the National Academy of Sciences of the United States of America 100, 16 (Aug 2003), 9608-13.

114 37. Hastie, T., R. Tibshirani, J. Friedman. The elements of statistical learning. Springer, New York, 2001.

38. Heidema, A. G., J. M. Boer, N. Nagelkerke, E. C. Mariman, A. D. L. van der, and E. J. Feskens. The challenge for genetic epidemiologists: How to analyze large numbers of SNPs in relation to complex diseases. BMC Genetics 7, (Apr 2006), 23.

39. Huang, X., W. Pan, S. Grindle, X. Han, Y. Chen, S. J. Park, L. W. Miller, and J. Hall. A comparative study of discriminating human heart failure etiology using gene expression profiles. BMC Bioinformatics 6, (Aug 2005), 205.

40. Huffman, D. A. A method for the construction of minimum-redundancy codes. Proceedings of the Institute of Radio Engineers 40, 9 (Sep 1952), 1098-101.

41. Iwama, T., Y. Mishima, and J. Utsunomiya. The impact of familial adenomatous polyposis on the tumorigenesis and mortality at the several organs. its rational treatment. Annals of Surgery 217, 2 (1993), 101-8.

42. Jain, A. K., M. N. Murty, and P. J. Flynn. Data clustering: A survey. ACM Computing Surveys, New York. 1999. 264-323.

43. Jass, J. R., D. S. Cottier, P. Jeeveratnam, V. Pokos, P. J. Browett. Pathology of hereditary nonpolyposis colorectal cancer with clinical and molecular genetic correlations. New strategies for treatment of hereditary colorectal cancer., ed. S. Baba Churchill Livingston, Tokyo, Japan, 1996.

44. Jiang, Y., G. Casey, I. C. Lavery, Y. Zhang, D. Talantov, E. Manilich, M. Martin- McGreevy, et al. Development of a clinically feasible molecular assay to predict recurrence of stage II colon cancer. The Journal of Molecular Diagnostics : JMD 10, 4 (Jul 2008), 346-54.

45. Johns, L. E., and R. S. Houlston. A systematic review and meta-analysis of familial colorectal cancer risk. American Journal of Gastroenterology 96, 10 (2001), 2992-3003.

46. Johnston, P. G. Stage II colorectal cancer: To treat or not to treat. The Oncologist 10, 5 (May 2005), 332-4.

47. Liu, B., W. Hsu, and Y. Ma. Integrating classification and association rule mining. Knowledge Discovery and Data Mining(Aug 1998), 80-6.

48. Lovett, E. Family studies in cancer of the colon and rectum. The British Journal of Surgery 63, 1 (Jan 1976), 13-8.

49. Lynch, H. T., and A. de la Chapelle. Hereditary colorectal cancer. New England Journal of Medicine 348, 10 (2003), 919-32.

115 50. Manilich, E. A pedigree management framework for families with hereditary colorectal cancer syndromes. Collaborative Group of the Americas on Inherited Colorectal Cancer, Santiago, Chile. 2008.

51. Manilich, E., J. M. Church, L. LaGuardia, K. Boya, S. F. Akgul, J. Young, and Z. M. Ozsoyoglu. Can desmoids be predicted? the use of knowledge discovery techniques in familial adenomatous polyposis. American Society of Colorectal Surgeons Annual Meeting, Seattle. 2006.

52. Medgen. PED 5 pedigree drawing software. 2008, (Feb)2005. http://www.medgen.de/ped5.

53. Medical Software Innovations. Cologene software.2001. . Cleveland: , http:/colorectal.ccf.org.

54. Munzner, T., F. Guimbretiere, S. Tasiran, L. Zhang, and Y. Zhou. TreeJuxtaposer: Scalable tree comparison using Focus+Context with guaranteed visibility. ACM Transactions on Graphics 22, 3 (2003), 453-62.

55. National Research Institute. Glossary of genetic terms. April, 2008. http://www.genome.gov/glossary.cfm.

56. Newman, S. O. A tree-structured query interface for querying semi-structured data. 16th International Conference on Scientific and Statistical Database Management, Santorini Island, Greece. 2004. 127.

57. Nieuwenhuis, M. H., De Vos Tot Nederveen Cappel,W., A. Botma, F. M. Nagengast, J. H. Kleibeuker, E. M. Mathus-Vliegen, E. Dekker, J. Dees, J. Wijnen, and H. F. Vasen. Desmoid tumors in a dutch cohort of patients with familial adenomatous polyposis. Clinical Gastroenterology and Hepatology : The Official Clinical Practice Journal of the American Gastroenterological Association 6, 2 (Feb 2008), 215-9.

58. Parkin, D. M., P. Pisani, and J. Ferlay. Estimates of the worldwide incidence of eighteen major cancers in 1985. International Journal of Cancer 54, 4 (1993), 594-606.

59. Petropoulos, M., A. Deutsch, and Y. Papakonstantinou. Query set specification language (QSSL). In Proc. of WebDB, University of California. 2003. 99-104.

60. Progeny Software. Progeny clinical data management software. 2008. http://www.progenygenetics.com/clinical/.

61. Shafer, J., R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for data mining. Very Large Data Bases(Sep 1996), 544-55.

62. Sheng, L., Z. M. Ozsoyoglu, and G. Ozsoyoglu. A graph query language and its query processing. 1999.

116 63. Shi, T., D. Seligson, A. S. Belldegrun, A. Palotie, and S. Horvath. Tumor classification by tissue microarray profiling: Random forest clustering applied to renal cell carcinoma. Modern Pathology : An Official Journal of the United States and Canadian Academy of Pathology, Inc 18, 4 (Apr 2005), 547-57.

64. Shibata, D., M. A. Reale, P. Lavin, M. Silverman, E. R. Fearon, G. Steele Jr, J. M. Jessup, M. Loda, and I. C. Summerhayes. The DCC protein and prognosis in colorectal cancer. The New England Journal of Medicine 335, 23 (Dec 1996), 1727-32.

65. Shortliffe, E. H., J. J. Cimino. Computer applications in health care and biomedicine. Biomedical informatics. 3rd ed. Springer, 2006.

66. Speake, D., D. G. Evans, F. Lalloo, N. A. Scott, and J. Hill. Desmoid tumours in patients with familial adenomatous polyposis and desmoid region adenomatous polyposis coli mutations. The British Journal of Surgery 94, 8 (Aug 2007), 1009-13.

67. Sturt, N. J., M. C. Gallagher, P. Bassett, C. R. Philp, K. F. Neale, I. P. Tomlinson, A. R. Silver, and R. K. Phillips. Evidence for genetic predisposition to desmoid tumours in familial adenomatous polyposis independent of the germline APC mutation. Gut 53, 12 (Dec 2004), 1832-6.

68. van Berloo, R., and R. C. Hutten. Peditree: Pedigree database analysis and visualization for breeding and science. The Journal of Heredity 96, 4 (Jul-Aug 2005), 465-8.

69. Wallis, Y. L., D. G. Morton, C. M. McKeown, and F. Macdonald. Molecular analysis of the APC gene in 205 families: Extended genotype-phenotype correlations in FAP and evidence for the role of APC amino acid changes in colorectal cancer predisposition. Journal of Medical Genetics 36, 1 (Jan 1999), 14-20.

70. Weiss, S. M., C. A. Kulikowski. Computer systems that learn: Classification and prediction methods from statistics, neural nets, machine learning, and expert systems. Machine learning series. Morgan Kaufmann, San Francisco, CA, 1991.

71. Wernert, E. A., and J. Lakshmipathy. PViN: A scalable and flexible system for visualizing pedigree databases. Proceedings of the 2005 ACM symposium on Applied computing, Santa Fe, New Mexico. 2005. 115-122.

72. Wetherell, C., and A. Shannon. Tidy drawings of trees. IEEE Transactions on Software Engineering 5, 5 (Sep 1979), 514-20.

73. Witten, I. H., E. Frank. Data mining: Practical machine learning tools and techniques. 2nd ed. Morgan Kaufmann, San Francisco, CA, 2005.

117 74. Zhou, W., S. N. Goodman, G. Galizia, E. Lieto, F. Ferraraccio, C. Pignatelli, C. A. Purdie, et al. Counting alleles to predict recurrence of early-stage colorectal cancers. Lancet 359, 9302 (Jan 2002), 219-25.

118