Affymetrix® Tool

User’s Guide

Version 3.0

For Research Use Only. Not for use in diagnostic procedures. Affymetrix Confidential

700233 Rev. 3 Trademarks Affymetrix®, GeneChip®, EASI™, ™,, ™, HuSNP™, GenFlex™, Jaguar™, MicroDB™, 417™, 418™, 427™, 428™, Pin-and-Ring™, Flying Objective™, NetAffx™ and CustomExpress™ are trademarks owned or used by Affymetrix, Inc. Microsoft® is a registered trademark of Microsoft Corporation. Oracle® is a registered trademark of Oracle Corporation. Limited License PROBE ARRAYS, INSTRUMENTS, SOFTWARE AND REAGENTS ARE LICENSED FOR RESEARCH USE ONLY AND NOT FOR USE IN DIAGNOSTIC PROCEDURES. NO RIGHT TO MAKE, HAVE MADE, OFFER TO SELL, SELL, OR IMPORT OLIGONUCLEOTIDE PROBE ARRAYS OR ANY OTHER PRODUCT IN WHICH AFFYMETRIX HAS PATENT RIGHTS IS CONVEYED BY THE SALE OF PROBE ARRAYS, INSTRUMENTS, SOFTWARE, OR REAGENTS HEREUNDER. THIS LIMITED LICENSE PERMITS ONLY THE USE OF THE PARTICULAR PRODUCT(S) THAT THE USER HAS PURCHASED FROM AFFYMETRIX. Patents Software products may be covered by one or more of the following patents: U.S. Patent Nos. 5,733,729; 5,795,716; 5,974,164; 6,066,454; 6,090,555; 6,185,561 and 6,188,783; and other U.S. or foreign patents. Copyright ©1999, 2001 Affymetrix, Inc. All rights reserved. Contents

CHAPTER 1 Welcome 3

Data Mining Tool User’s Guide 3 What’s New in DMT 3.0 3 Conventions Used 4

On-line Documentation 5

Technical Support 6

Your Feedback is Welcome 6

CHAPTER 2 Installing Data Mining Tool 3.0 9

Before You Begin 9 Microsoft® SQL Server LIMS Users 9 Oracle® LIMS Users 9 MicroDB™ Users 9

Installing Data Mining Tool 10

Creating an Oracle® Alias 17 Oracle 8.1.7 Alias Configuration 17

CHAPTER 3 Affymetrix® Data Mining Tool Overview 25

Access Data 25 Affymetrix Publish 25 Affymetrix® Analysis Data Model 26

i ii Contents

DMT Windowpanes 27

Query Data 30 Building and Running a Query 30

Viewing Query Results 33 Tables 33 Graphs 38

Analyze Query Results 43 Statistical Analyses 43 Cluster Analysis 43 Matrix Analysis 46

CHAPTER 4 Getting Started 49

Starting DMT 49

Managing Database Connections 50 Registering a Database 51 Unregistering a Database 52 Selecting a Database 53

Specifying the Default Directory 54

CHAPTER 5 Building and Running a Query 59

Building a Query 59 Starting a New Query 59 Specifying the Filters 61 Query Builder 68 Selecting Analyses for the Query 70 Specifying Analysis Filters 70

Running a Query 79 Affymetrix® Data Mining Tool User’s Guide iii

Normalizing GeneChip® Signal Data 79 Choosing Normalization Before a Query or Pivot 80 Choosing Normalization After a Query or Pivot 81 Normalization Options 81

CHAPTER 6 Managing Queries 87

Saving a Query 87 Using the Save As Command 88

Opening a Previously Saved Query 89

Deleting a Query 90

CHAPTER 7 Query Results Tables 93

Experiment Information Table 93 GeneChip® Data Mode 94 Spot Data Mode 95

Query Table 96

Pivot Data Table 97 Selecting Results for the Pivot Table 99 Running the Pivot Operation 101 Including Probe Descriptions in the Pivot Table 102 Including Annotations in the Pivot Table 102 Sorting Pivot Table Columns 103 Pivot Options 104

Working with Tables 106 Finding Probes 106 Viewing Descriptions & Obtaining Further Gene Information 107 Annotating Probes 108 Adding Probes to the Filter Grid 109 iv Contents

Copying Tables 110 Exporting Data 111 Expanding the Results Pane 111 Clearing the Results Pane 112

CHAPTER 8 Annotations 115

Annotating Probes 115 Loading Annotations 116 Querying Annotations 118 Adding Probes to the Filter Grid 121 Deleting Annotations 122

CHAPTER 9 Probe Lists 127

Creating Probe Lists 127 Creating a Probe List from the Query or Pivot Table 128 Creating a Probe List from Cluster Analysis 130 Creating a Probe List from Search Array Descriptions 131 Creating a Probe List from Filter 132 Creating a Probe List by Combining Existing Lists 132

Loading a Probe List 134 Specifying Probe List Members 134 Specifying an Input File 135

Using Probe Lists 137 Adding a Probe List to the Filter Grid 137 Displaying Selected Probe List Members 138

Managing Probe Lists 140 Viewing and Editing Probe List Members 140 Combining Probe Lists 142 Affymetrix® Data Mining Tool User’s Guide v

Exporting a Probe List 143 Deleting a Probe List 144

CHAPTER 10 Array Sets 149

Creating an Array Set 149

Working with Array Sets 151 Viewing Array Sets 151

Managing Array Sets 152 Editing an Array Set 152 Deleting an Array Set 153

CHAPTER 11 Graphing Results 157

Scatter Graph 158 Plotting the Scatter Graph 158 Working with the Scatter Graph 161 Scatter Graph Options 168

Fold Change Graph 171 Plotting the Fold Change Graph 173 Working with the Fold Change Graph 176 Fold Change Graph Options 183

Series Graph 185 Plotting the Series Graph 186 Working with the Series Graph 188 Series Graph Options 191

Histogram 193 Plotting the Histogram 193 Working with the Histogram 195 Histogram Options 199 vi Contents

Other Graphing Features 202 Enlarging the Graph Pane 202 Changing Graph Colors 202 Copying and Clearing Graphs 204 Printing Graphs 204

CHAPTER 12 Statistical Analyses 209

Selecting an Operator 209

Average, Median, Standard Deviation or Inter-Quartile Range 210

Fold Change 212

T-Test 214

Mann-Whitney Test 216

Count & Percentage 218

CHAPTER 13 Matrix Analysis 223

Overview 223 Population Size 224

Running a Matrix Analysis 225

CHAPTER 14 Cluster Analysis 231

Self Organizing Map (SOM) Algorithm 231 Running a SOM Cluster Analysis 232 Saving a Probe List 237 SOM Filters 238 SOM Parameters 239 Affymetrix® Data Mining Tool User’s Guide vii

Correlation Coefficient Clustering Algorithm 240 Running the Correlation Coefficient Cluster 241 Correlation Coefficient Clustering Options 244 Effect of Changing Algorithm Parameters 246 Saving and Importing Seed Patterns 248

Saving a Probe List 251

CHAPTER 15 DMT Tutorial 255

Introduction 255 Step 1: Restoring the MicroDB™ Database 256 Step 2: Starting DMT 256 Step 3: Registering the Database 256 Step 4: Selecting the Tutorial Database 258 Step 5: Opening the DMT Session 258

Lesson 1: Identifying Highly Expressed Genes 259 Step 1: Specifying a Filter 259 Step 2: Selecting Analyses for the Query 260 Step 3: Pivoting on Signal & Detection Call 260 Step 4: Querying and Pivoting the Data 262 Step 5: Sorting the Pivot Table by Signal 263 Step 6: Saving a Probe List 263 Step 7: Plotting the Series Line Graph 264 Lesson 1 Summary 268 Suggested Exercise 269

Lesson 2: Calculating Averages of Replicates 270 Step 1: Specifying a Probe List for the Filter 270 Step 2: Selecting Analyses for the Query 272 Step 3: Pivoting on Signal 273 Step 4: Query and Pivot the Data 274 Step 5: Selecting Average & Standard Deviation Operators 276 Step 6: Sorting the Pivot Table 279 viii Contents

Step 7: Displaying Probe Set Descriptions 280 Lesson 2 Summary 281 Suggested Exercise 281

Lesson 3: Summarizing Qualitative Data 282 Step 1: Pivoting on Detection Call 282 Step 2: Performing Count & Percentage Analysis 284 Step 3: Sorting Pivot Table Results 286 Step 4: Saving a Probe List 287 Step 5: Annotating Probe List Members 287 Lesson 3 Summary 288 Suggested Exercise 288

Lesson 4: Evaluating Difference Between Two Tissues 289 Step 1: Pivoting on Signal 290 Step 2: Mann-Whitney Test 292 Step 3: Annotating Probe Sets 295 Step 4: Saving a Probe List 295 Lesson 4 Summary 296 Suggested Exercise 296

Lesson 5: Evaluating Change Call Consistency 297 Step 1: Clearing the Filter Grid & Selecting Comparison Analyses 299 Step 2: Pivoting on Difference Call 300 Step 3: Comparison Ranking 301 Step 4: Annotating Probe Sets 303 Step 5: Saving a Probe List 304 Lesson 5 Summary 304 Suggested Exercise 304

Lesson 6: Self Organizing Map (SOM) Cluster Analysis 305 Step 1: Clearing the Filter Grid & Selecting Analyses 306 Step 2: Pivoting on Signal 307 Step 3: Computing Average Signal 308 Step 4: SOM Cluster Analysis 310 Affymetrix® Data Mining Tool User’s Guide ix

Step 5: Saving & Annotating a Probe List 318 Lesson 6 Summary 318

APPENDIX A Filter Grid 323

GeneChip Data Mode 323 Statistical Expression Algorithm 323 Empirical Expression Algorithm 324

Spot Data Mode 330

APPENDIX B Working with Windows & Tables 333

Query Windowpanes 333 Expanding a Windowpane 333 Resizing a Windowpane 333 Clearing the Results or Graph Pane 334

Tables 334 Selecting the Entire Table 334 Selecting Rows 334 Resizing Columns 335 Hiding Columns 335 Reordering Columns 336

APPENDIX C Query Table Data 339

GeneChip® Data Mode 339 Statistical Expression Algorithm Metrics 339 Empirical Expression Algorithm Metrics 340

Spot Data Mode 346 x Contents

APPENDIX D DMT Algorithms 349

The SOM Algorithm 349 Neighborhood 351 Learning Rate 352

The Correlation Coefficient Clustering Algorithm 353

The Matrix Algorithm 354

APPENDIX E Toolbars & Shortcuts 359

DMT Main Toolbar 359

Session Toolbar 360

Shortcut Descriptions 361 1 Chapter 1 Welcome 1

Welcome to the Affymetrix® Data Mining Tool (DMT) User’s Guide. The DMT filters, queries and analyzes publish of GeneChip® or spotted array expression data.

Data Mining Tool User’s Guide

This manual explains how to use DMT to:

Build a query. Display the query results in table or graph format. Evaluate and compare replicate data using statistical analyses.

Calculate the overlap significance between two lists of GeneChip® probe sets or spot probes. Apply cluster analysis to experimental results to help identify gene expression patterns. This manual also includes a tutorial that demonstrates; 1) a data mining strategy to identify genes that significantly change expression level, 2) statistical analyses of replicate data, and 3) cluster analysis.

What’s New in DMT 3.0

Compatible with Microarray Suite Statistical or Empirical Expression Algorithm DMT can query and analyze experimental results generated by the Statistical Expression algorithm (in Microarray Suite 5.0) as well as the Empirical Expression algorithm (in versions of Microarray Suite prior to 5.0). The filter includes both Statistical and Empirical metrics so that a query may specify (in “OR” fashion) both types of metrics.

3 4 CHAPTER 1 Welcome

Publish Database Security Each publish database requires a login password to prevent unauthorized database access.

Conventions Used This manual provides a detailed outline for all tasks associated with Affymetrix® Data Mining Tool. Various conventions are used throughout the manual to help illustrate the procedures described. Explanations of these conventions are provided below.

Steps Instructions for procedures are written in a step format. Immediately following the step number is the action to be performed. On the line below the step there may be the following symbol: ⇒. This symbol defines the system response or consequence as a result of user action; what you see and what has happened that you may not see. Following the response additional information pertaining to the step may be found and is presented in paragraph format. For example:

9. Click Yes to continue. ⇒ The Delete task proceeds. In the lower right pane the status is displayed. To view more information pertaining to the delete task, right-click Delete and select View Task Log from the shortcut menu.

Font Styles Bold fonts indicate names of commands, buttons, options or titles within a dialog box. When asked to enter specific information, such input appears in italics within the procedure being outlined. For example:

1. To select another server, enter the server name in the Oracle Alias box.

2. Enter DMT_3_Tutorial in the Publish Database box, then click Register. ⇒ The tutorial database is available to DMT. Affymetrix® Data Mining Tool User’s Guide 5

Screen Captures The steps outlining procedures are frequently supplemented with screen captures to further illustrate the instructions given.

The screen captures depicted in this manual may not exactly match the windows displayed on your screen.

Additional Comments Throughout the manual, text and procedures are occasionally accompanied by special notes. These additional comments are and their meanings are described below.

Information presented in tips provide helpful advice or shortcuts for completing a task.

The Note format presents important information pertaining to the text or procedure being outlined.

Caution notes advise you that the consequence(s) of an action may be irreversible and/or result in lost data.

Warnings alert you to situations where physical harm to person or damage to hardware is possible.

On-line Documentation

The CD with DMT includes an electronic version of this user’s guide. The on-line documentation is in Adobe Acrobat format (a *.pdf file) and is readable with the Adobe Acrobat® Reader software, available at no charge from Adobe at http://www.adobe.com. The electronic user’s guide is printable, searchable and fully indexed. You can have it open and minimized on screen while using the DMT software. 6 CHAPTER 1 Welcome

Technical Support

Affymetrix provides technical support to all licensed users via phone or e-mail. To contact Affymetrix Technical Support:

Affymetrix Inc. 3380 Central Expressway Santa Clara, CA 95051 USA Tel: 1-888-362-2447 (1-888-DNA-CHIP) Fax: 1-408-731-5441 E-mail: [email protected]

Affymetrix UK Ltd., Voyager, Mercury Park, Wycombe Lane, Wooburn Green, High Wycombe HP10 0HH United Kingdom Tel: +44 (0) 1628 552550 Fax: +44 (0) 1628 552585 E-mail: [email protected]

www.affymetrix.com

Your Feedback is Welcome

Affymetrix Technical Publications is dedicated to continually improving the quality of our documentation and helping you get the information that you need. We welcome any comments or suggestions you may have regarding this manual. Please contact us at: [email protected] 2 Chapter 2 Installing Data Mining Tool 3.0 2

Installing Data Mining Tool 3.0 will uninstall any previous version of DMYT. You will no longer be able to use your previous version of DMT after installing Data Mining Tool 3.0.

Before You Begin

This section guides you through the installation of Data Mining Tool 3.0. Listed below is an overview of the steps needed to complete the installation.

Microsoft® SQL Server LIMS Users

1. Obtain the name of the LIMS Server from your IT personnel if not known (this is needed during installation).

2. Install Data Mining Tool 3.0.

Oracle® LIMS Users

1. Install Oracle Client Utilities on the workstation (Oracle Client Utilities must be the same version installed on the LIMS Server).

2. Install SQL* Loader (for better performance). ® 3. Create an Oracle Alias. (Refer to the section Creating an Oracle Alias, on page 17.)

4. Install Data Mining Tool 3.0.

MicroDB™ Users Install Data Mining Tool 3.0.

9 10 CHAPTER 2 Installing Data Mining Tool 3.0

Installing Data Mining Tool

The following are detailed instructions for installing DMT. Please note that the screen captures depicted in this section may not exactly match the windows displayed on your screen.

You must be logged in as administrator to install the DMT 3.0 software. The screen captures depicted in this manual may not exactly match the windows displayed on your screen.

1. Log in as an administrator.

2. Insert the Affymetrix® DMT 3.0 CD-ROM.

3. If the autorun feature does not start the program:

a. Click Start → Run.

b. Type :\setup.exe.

c. Click OK. ⇒ The Affymetrix Software Setup window appears. Affymetrix® Data Mining Tool User’s Guide 11

4. Click DMT 3.0 Setup. ⇒ The Welcome window appears (Figure 2.1).

Figure 2.1 Welcome window

5. Click Next. 12 CHAPTER 2 Installing Data Mining Tool 3.0

6. Several consecutive Software License Agreement windows appear. Review the contents in each and click Yes to accept the terms of the agreement. ⇒ The Customer Information window appears (Figure 2.2).

Figure 2.2 Customer Information window

7. Enter your Name, Company and Serial Number. The serial number is located on the Affymetrix® Software Product Registration card.

If you do not have a serial number, contact Affymetrix Technical Support. If you are upgrading from a previous version, the Serial Number field populates automatically.

8. Click Next. ⇒ The Choose Destination Location window appears (Figure 2.3). Affymetrix® Data Mining Tool User’s Guide 13

Figure 2.3 Choose Destination Location window

9. Select the destination where Data Mining Tool will be installed. 14 CHAPTER 2 Installing Data Mining Tool 3.0

10. Click Next. ⇒ The Select Database Compatibility window appears (Figure 2.4).

Figure 2.4 Select Database Compatibility window

11. Select the type of database that DMT will connect with. Affymetrix® LIMS - if connecting to a LIMS Server. Affymetrix® MicroDB - if connecting to a local publish database using MicroDB™.

12. Click Next. ⇒ If connecting to a LIMS server, the Select Database Type window appears (Figure 2.5). If using MicroDB™ go to step 16. Affymetrix® Data Mining Tool User’s Guide 15

Figure 2.5 Select Database Type window

13. Select the type of database used on the LIMS server, either SQL Server or Oracle. If you do not know the type of database you are using with the LIMS Server, please contact your IT personnel or DBA. 16 CHAPTER 2 Installing Data Mining Tool 3.0

14. Click Next. ⇒ The Enter Information window appears (Figure 2.6).

Figure 2.6 Enter Information windows for the SQL Server database (left) or the Oracle® database (right)

15. In the Enter Information window complete one of the following; If SQL Server is selected, enter the SQL Server Name (usually the name of the LIMS Server). If Oracle® is selected, enter the Oracle Alias Name.

16. Click Next. ⇒ Database connectivity is verified and the Start Copying Files window appears.

17. In the Start Copying Files window, verify the information and click Next. ⇒ Program files are copied and the system configures the registry. The Setup Complete window appears. ⇒ For Oracle systems: If a warning message regarding SQL Loader appears, continue the DMT install until complete. Then, install SQL Loader (part of Oracle) for better DMT performance. After SQL Loader is installed, re-install DMT.

18. Select Yes, I want to restart my computer now and click Finish. Affymetrix® Data Mining Tool User’s Guide 17

Creating an Oracle® Alias

To create an Oracle alias, use the Net8 Assistant. The following steps guide you through creating an alias.

Oracle 8.1.7 Alias Configuration

1. Start → Programs → → Network Administration → Net8 Assistant. ⇒ Oracle Net8 Assistant window appears (Figure 2.7).

Figure 2.7 Oracle® Net8 Assistant window

2. Expand Local. 18 CHAPTER 2 Installing Data Mining Tool 3.0

3. Highlight Service Naming, then from the menu bar click Edit → Create. ⇒ The Net Service Name Wizard Welcome window appears (Figure 2.8).

Figure 2.8 Net Service Name Welcome window

4. Enter the Net Service Name (which is the alias name). The name must be the same name as the local LIMS server.

If creating a remote publish server alias, Host Name must be the same as the computer name of the remote publish server.

5. Click Next. ⇒ The Networking Protocol window appears (Figure 2.9). Affymetrix® Data Mining Tool User’s Guide 19

Figure 2.9 Networking Protocol window

6. Select TCP/IP (Internet Protocol).

7. Click Next. ⇒ The Host Name window appears (Figure 2.10).

Figure 2.10 Host Name window

The Host Name is the name of the local LIMS Server. 20 CHAPTER 2 Installing Data Mining Tool 3.0

The Port Number is left as the default value 1521, unless it has been changed.

If creating a remote publish server alias, the Host Name must be the name of the remote publish server.

8. Click Next. ⇒ The Database SID window appears (Figure 2.11).

Figure 2.11 Database SID window

9. Select (Oracle8i) Service Name option. Enter the name of the instance on the local LIMS server.

If creating a remote publish database server alias, the Database SID name should be the instance created on the remote publish server.

10. Click Next. ⇒ The Test Service window appears (Figure 2.12). Affymetrix® Data Mining Tool User’s Guide 21

Figure 2.12 Test Service window

11. Click Test... to test the alias created. ⇒ A Connection Test Information window appears (Figure 2.13).

Figure 2.13 Connection Test Information window 22 CHAPTER 2 Installing Data Mining Tool 3.0

12. If the connection was successful go to step 13. If the connection was unsuccessful, follow the instructions below.

a. If the test fails, click Change Login....

Figure 2.14 Change Login window

b. Enter Username and Password, then click OK.

c. Repeat step 11.

13. Click Close.

14. Click Finish.

15. Repeat the above steps to create and test the second alias if using remote publish database server.

16. Save the configuration settings.

If your test was unsuccessful, verify that your listener is listening for your alias. 3 Chapter 3 ® Affymetrix Data Mining Tool Overview 3

Affymetrix® Data Mining Tool (DMT) provides a flexible and intuitive query interface to a large of published expression databases and helps you sift through hundreds or thousands of experimental results. This chapter provides an overview of DMT and how it interacts with publish databases. It explains the steps involved in running a query and the options available to you for viewing and analyzing results.

Access Data

DMT operates in GeneChip® data or spot data mode. It enables you to access, query and analyze data found in a publish database populated with Affymetrix GeneChip® probe array expression analysis results (*.chp) or spotted probe array intensity results (*.spt). The data mode and location of the publish database determine the DMT features available.

Affymetrix Publish Database An Affymetrix publish database is created by an Affymetrix publishing application (see Ta b l e 3 . 1 , on page 26). These applications import or publish analysis data (*.chp or *.spt) to a publish database located on the LIMS server or a local workstation (MicroDB™). Published data are available to DMT or other third party analysis tools, as well as database management tools such as ® 2000.

25 26 CHAPTER 3 Affymetrix® Data Mining Tool Overview

Ta b l e 3 . 1 Affymetrix® publishing applications

Publishing Publish Database Data Published Application Location

Affymetrix® LIMS GeneChip® probe array expression LIMS server analysis data (*.exp, *.cel, *.chp)

Affymetrix® MicroDB™ GeneChip® probe array expression Local workstation analysis data (*.chp)

Affymetrix® MicroDB™ Affymetrix® Jaguar™ spotted array Local workstation intensity data (*.spt)

You can also use DMT to query other appropriately formatted databases populated with Affymetrix® GeneChip® expression analysis results (*.chp) or spotted probe array intensity results (*.spt).

Affymetrix® Analysis Data Model DMT is compatible with any Affymetrix® Analysis Data Model (AADM) compliant database populated with Affymetrix GeneChip® probe array expression analysis results (*.chp) or AADM-derived database populated with spotted probe array intensity results (*.spt). AADM is available at www.affymetrix.com. Affymetrix® Data Mining Tool User’s Guide 27

DMT Windowpanes The DMT session appears when you start a new query or previously saved query. DMT has four different panes for filtering and displaying expression data (Figure 3.1, Figure 3.2). The panes are:

Filter grid Enables you to specify the filters and the limits the data must meet to be returned by the query. Data tree Displays analyses, array sets and probe lists. You can select analyses or array sets from the data tree for the query. Graph pane Displays graphs (scatter, fold change, series, or histogram graph) and cluster analysis results. Results pane Displays the experiment information, query and pivot tables.

Use the filter grid and data tree to specify query conditions. The graph and results panes display query and analysis results. 28 CHAPTER 3 Affymetrix® Data Mining Tool Overview

Figure 3.1 DMT display in GeneChip® data mode Affymetrix® Data Mining Tool User’s Guide 29

Figure 3.2 DMT display in spot data model 30 CHAPTER 3 Affymetrix® Data Mining Tool Overview

Query Data

A query searches a publish database to find vital experimental and expression data. User-defined filters specify the search criteria. The query returns only those records that meet the criteria or limits specified by the query filters.

Building and Running a Query To build a query:

Specify the filter conditions that the expression data must satisfy (Ta b l e 3 . 2 lists the type of filters for the various DMT data modes). Select the analyses for the query from the data tree (Figure 3.1, Figure 3.2).

Analysis filters are available in GeneChip LIMS data mode. You can filter the current database so that the data tree displays only the analyses that meet user-specified criteria. Affymetrix® Data Mining Tool User’s Guide 31

Ta b l e 3 . 2 DMT Filters

DMT Data Publishing a b Mode Application Analysis Filters Expression Filters

GeneChip® Affymetrix® Sample template Absolute or comparison probe array LIMS Experiment template expression metrics Attribute (See Appendix A) Attribute value Sample project Probe Array Sample type Operator Sample name

MicroDB™ Not available Absolute or comparison expression metrics (See Appendix A)

Spotted MicroDB™ Not available Intensity result probe array Standard deviation intensity Pixel intensity Background Standard deviation Background Ratio (See Appendix A) a. In GeneChip LIMS mode, analysis filters interrogate the publish database and determine the analyses displayed in the data tree. b. Filters interrogate the analyses selected in the data tree. 32 CHAPTER 3 Affymetrix® Data Mining Tool Overview

You can specify analysis filters in the Filter Analysis dialog box (LIMS data mode only) (Figure 3.3) that interrogate the current database and determine the analyses displayed in the data tree. To filter the analyses, select View → Analysis Filters from the menu bar.

Figure 3.3 Filter Analysis dialog box (available when connected to a publish database on the LIMS server in GeneChip® data mode) Affymetrix® Data Mining Tool User’s Guide 33

Viewing Query Results

You can view the data retrieved from the database in both tables and graphs. This section describes the various types of tables and graphs available in DMT.

Ta b l e s

The results pane (Figure 3.1) contains three tables:

Experiment Information table Query table Pivot table The query and pivot tables provide two different views of expression data.

Experiment Information Table The experiment information table contains information about analyses or array sets selected in the data tree.

In GeneChip® data mode, the experiment information table (Figure 3.4) displays:

User-specified information such as project and experiment name. Information automatically captured by Affymetrix® LIMS during hybridization, scanning and analysis of GeneChip probe arrays including experiment template parameters. Values for user-modifiable expression algorithm parameters (used to calculate the expression metrics).

In spot data mode, the experiment information table (Figure 3.5) displays:

Probe array and operator name. Parameters associated with the analysis. 34 CHAPTER 3 Affymetrix® Data Mining Tool Overview

To populate the experiment information table:

Select analyses or array sets in the data tree.

Click the Info toolbar button , or select Query → Experiment Information.

Figure 3.4 Experiment information table, GeneChip® data mode Affymetrix® Data Mining Tool User’s Guide 35

Figure 3.5 Experiment information table, spot data mode

Query Table

The query table displays the expression data for probes (probe sets or spot probes) that met the query criteria (Figure 3.6 and Figure 3.7). Each row displays the probe name, analysis and expression data (for example, signal and detection) for every analysis. If the query results include the same probe from different analyses, the query table displays a separate row for each probe/analysis pair. For example, if a query returned the same probe set from four different analyses, the query table would display four rows of results for the same probe set (one row per analysis). 36 CHAPTER 3 Affymetrix® Data Mining Tool Overview

To populate the query table:

Select Analyses Or Array Sets In The Data Tree. Specify filters (optional).

Click the Query button or select Query → Run Query from the menu bar.

Figure 3.6 Query table, GeneChip® data mode

Figure 3.7 Query table, spot data mode Affymetrix® Data Mining Tool User’s Guide 37

Pivot Table When a query returns the same probe from several different analyses, it is often more convenient to view the probe data (probe set or spot probe) from each analysis side by side in the same row. DMT can retrieve analyses from the database and organize the data in the pivot table so that all analysis results for the same probe are displayed in one row (Figure 3.8 and Figure 3.9). In the pivot table, the column headers display the analysis names; the columns display the expression data. The pivot table columns are available for graphing, statistical analysis, or cluster analysis. To populate the pivot table:

Select analyses or array sets in the data tree. Specify filters (optional).

Click the Run Pivot button or select Query → Retrieve Data from the menu bar.

Figure 3.8 Pivot table, GeneChip® data mode 38 CHAPTER 3 Affymetrix® Data Mining Tool Overview

Figure 3.9 Pivot table, spot data mode

Graphs DMT can display the pivot table data in graphical formats. The types of graphs available include:

Scatter graph Fold Change graph Series graph Histogram graph

Each type of graph is displayed in a separate tab of the graph pane (Figure 3.1 and Figure 3.2).

The graphing functions are only available for the analyses displayed in the pivot table. Affymetrix® Data Mining Tool User’s Guide 39

Scatter Graph The scatter graph plots multiple pairs of user-specified numeric columns from the pivot table using a traditional scatter plot (Figure 3.10). Each point represents a probe (probe set or spot probe) common to both columns in the comparison. A point is defined by the intersection of the value on the x and y axes for the common probe.

Figure 3.10 Scatter graph, GeneChip® data mode 40 CHAPTER 3 Affymetrix® Data Mining Tool Overview

Fold Change Graph

The fold change graph (Figure 3.11) compares multiple pairs of user-specified numeric pivot table columns (base and comparison columns). It displays a scatter plot of the fold change of the comparison column compared to the base column. (See Appendix A for the fold change calculation.) Each point represents a probe (probe set or spot probe) that is common to the base and comparison columns. The y-axis coordinate is the average fold change for all of the base-comparison pairs that contain the probe. The x- axis coordinate is the average of the comparison column value for all of the comparison analyses that contain the probe.

Figure 3.11 Fold change graph, GeneChip® data mode

Series Graph The series graph plots any numeric pivot table column in a line or bar graph format (Figure 3.12, Figure 3.13). The series graph is a useful way to monitor gene expression across different experiments or over a time course. Affymetrix® Data Mining Tool User’s Guide 41

Figure 3.12 Series line graph, GeneChip® data mode

Figure 3.13 Series bar graph, GeneChip® data mode 42 CHAPTER 3 Affymetrix® Data Mining Tool Overview

Histogram The histogram plots a frequency distribution of any numeric pivot table column (Figure 3.14). The histogram sorts the metric values into groups or bins (x-axis coordinate) and plots the number of probes (probe sets or spot probes) in each bin (y-axis coordinate). For example, a histogram of probe set expression signal values can help evaluate the proportion of genes expressed at different levels.

Figure 3.14 Histogram, expression signal data Affymetrix® Data Mining Tool User’s Guide 43

Analyze Query Results

You can apply statistical and cluster analyses to the results displayed in the pivot table. This section describes the various types of statistical and cluster analysis available in DMT.

Statistical Analyses DMT can apply the following statistical analyses to numeric pivot table columns:

Average Median Standard deviation Inter-Quartile range Fold change T- Te s t Mann-Whitney test Count & Percentage The pivot table displays the resulting data in new columns that are available for graphing (scatter, fold change, or series graph), clustering, or further statistical analysis.

Cluster Analysis Cluster analysis finds expression profiles that have similar shapes. DMT provides two different algorithms, the Self Organizing Maps (SOM) and Correlation Coefficient algorithms, for finding those clusters.

Self Organizing Map Algorithm The self organizing map (SOM) algorithm is designed to identify patterns in expression signals. However any numeric pivot table column may be selected for cluster analysis. The algorithm represents the selected data of probe sets in n experiments as points in k-dimensional space. Initially, the algorithm randomly maps a grid of nodes in space, then iteratively adjusts the node positions toward collections of points until the nodes reflect clusters of probe sets with similar 44 CHAPTER 3 Affymetrix® Data Mining Tool Overview

expression patterns. (See Appendix D for more information about the SOM algorithm.)

Figure 3.15 shows the patterns and probe set members of clusters found by the SOM algorithm.

Figure 3.15 SOM cluster results Affymetrix® Data Mining Tool User’s Guide 45

Correlation Coefficient Algorithm The correlation coefficient algorithm uses a nearest neighbor approach to find groups of probe sets with similar pattern. The average pattern of a group defines a cluster seed. Probe sets whose patterns are closely matched to the seed pattern are assigned to the seed’s cluster.

Figure 3.16 Correlation coefficient cluster results 46 CHAPTER 3 Affymetrix® Data Mining Tool Overview

Matrix Analysis Matrix analysis enables you to compare probe lists and determine the overlap between two lists (Figure 3.17). The matrix algorithm computes the probability (P-value) that the observed overlap is expected due to random chance. The algorithm converts the P-value to an overlap significance value that is displayed in the matrix. The overlap significance value = -logP, and may range from near zero to a large number. Appendix D provides further information on the Matrix algorithm. The matrix highlights values that exceed the overlap significance threshold (pink) and values that exceed the non-overlap significance threshold (yellow).

Figure 3.17 Matrix displays the overlap significance values for two probe lists 4 Chapter 4 Getting Started 4

This chapter provides step by step instructions for completing the basic tasks that are necessary to start and run Affymetrix® Data Mining Tool (DMT).

Starting DMT

1. Click the Windows Start button , then select Start → Programs → Affymetrix → Data Mining Tool. ⇒ The Publish Database Login dialog box appears (Figure 4.1).

This dialog does not appear in MicroDB mode.

Figure 4.1 Publish Database login for LIMS mode

49 50 CHAPTER 4 Getting Started

2. Enter the password for the publish database and click Login. ⇒ The main window appears (Figure 4.2).

Figure 4.2 DMT main window, Database02 selected

In the DMT main window, you can:

Register or unregister a publish database. Select a database for the query. Start a new DMT session. Open or delete a previously saved query.

Managing Database Connections

DMT connects with databases created using the Affymetrix® LIMS or Affymetrix® MicroDB applications (or other appropriately formatted databases). The tasks involved with managing these database connections include registering a database, selecting a database for use with DMT, or unregistering a database. Affymetrix® Data Mining Tool User’s Guide 51

Registering a Database You must register a publish database to make it available to DMT. To register a database, use the appropriate procedure outlined below that is suited to your particular system.

Publish Database on Windows Workstation (MicroDB™ System)

1. Select Edit → Register Database from the menu bar. ⇒ The Register Database dialog box appears (Figure 4.3).

Figure 4.3 Register Database dialog box, publish database on Windows NT workstation

2. Select a database from the Publish Database drop-down list, then click Register. ⇒ The publish database is now available to DMT.

Publish Database on LIMS Server (Affymetrix® LIMS)

1. Select Edit → Register Database from the menu bar. ⇒ The Register Database dialog box appears (Figure 4.4). The Publish Database box contains a list of publish databases on the server.

Figure 4.4 Register Database dialog box, publish database on LIMS server

2. Enter the SQL server or Oracle Alias name in the Server Name box. 52 CHAPTER 4 Getting Started

3. Click List Databases to display the publish databases for the server in the Publish Database drop-down list.

4. Select a database from the Publish Database drop-down list.

5. Click Register. ⇒ The Publish Database login dialog box appears (Figure 4.5).

Figure 4.5 Publish Database Login

6. Enter the database password and click Login. ⇒ The database is available to DMT.

Unregistering a Database Unregistering a database removes it from the lists of available, or registered, databases which may be queried.

1. Select Edit → Unregister Database from the menu bar. ⇒ The Unregister Database dialog box appears (Figure 4.6).

Figure 4.6 Unregister Database dialog box Affymetrix® Data Mining Tool User’s Guide 53

2. Select a database, then click Unregister. ⇒ The database is no longer available to DMT.

Selecting a Database DMT connects to a single publish database at a time. By default, DMT connects to the most recently registered database.

Select the database of interest before opening a DMT session.

1. Close all the DMT sessions and return to the main DMT window (Figure 4.2).

2. Select Edit → Select Database from the menu bar, then select a database. ⇒ The Publish Database Login dialog box appears (Figure 4.7).

Figure 4.7 Publish Database Login

3. Enter the password, then click Login. ⇒ The status bar at the bottom of the main window displays the current database name (Figure 4.2).

If the status bar is not displayed, select View → Status Bar from the menu bar. 54 CHAPTER 4 Getting Started

Specifying the Default Directory

You must specify a default directory that identifies the location of files for import (for example, when loading probe lists or annotations) or export when the data export option is selected.

1. Open a DMT session.

2. Click the Options button . Alternatively, select View → Options from the menu bar. ⇒ The Data Mining Options dialog box appears (Figure 4.8).

Figure 4.8 Data Mining Options dialog box, Default Directory tab

3. Click the Default Directory tab.

4. Click the Browse button . ⇒ The Browse for Folder dialog box appears (Figure 4.9). Affymetrix® Data Mining Tool User’s Guide 55

Figure 4.9 Browse for Folder dialog box

5. Locate the default directory, then click OK. 56 CHAPTER 4 Getting Started 5 Chapter 5 Building and Running a Query 5

A query is the key to obtaining interesting data for subsequent analysis using Affymetrix® Data Mining Tool. This chapter explains how to define a query (specify the conditions that the data must meet to be retrieved from the database) and select analyses for the query from the current database.

Building a Query

The three main steps for building a query are:

Open a DMT session. Specify the filters. Select analyses or array sets for the query. Affymetrix® Data Mining Tool operates in GeneChip® data mode or spot data mode. The data mode and location of the publish database (LIMS server or Windows NT workstation) determine the DMT features available.

Starting a New Query To start a new query in GeneChip® data mode, select Data → New → GeneChip Mining from the menu bar. ⇒ A new DMT session starts (Figure 5.1).

To start a new query in spot data mode, select Data → New → Spotted Array Mining from the menu bar. ⇒ A new DMT session starts (Figure 5.2).

You can open more than one DMT session at a time. Select Window → Cascade, or Window → Tile from the menu bar to organize the open windows.

59 60 CHAPTER 5 Building and Running a Query

Figure 5.1 DMT session, GeneChip® data mode (graph pane not displayed until a graph or cluster result is generated)

Figure 5.2 DMT session, spot data mode (graph pane not displayed until a graph or cluster result is generated) Affymetrix® Data Mining Tool User’s Guide 61

DMT Session Components

Session toolbar Provides access to additional functions specific to the DMT session. See Appendix E for detailed toolbar information. Filter grid Provides a flexible interface for selecting expression metrics for filtering and entering the limits the data must meet to be returned by the query. Data tree Displays the analyses in the current publish database. When the database is on the LIMS server, the Filter Analysis dialog box can be used to filter the analyses displayed in the data tree. The data tree also displays array sets and probe lists. Select the analyses for the query from the data tree. Results pane Displays the experiment information, query, and pivot tables that contain information about the analyses or array sets selected in the data tree, and query results. Graph pane Displays graphs or cluster analysis results. This pane is not displayed until a graph or cluster analysis is generated.

Specifying the Filters

The filter grid (Figure 5.3, Figure 5.4) enables you to select expression metrics for filtering and specify the limits that the data must meet to be returned by the query.

Figure 5.3 Filter grid, GeneChip® data mode 62 CHAPTER 5 Building and Running a Query

Figure 5.4 Filter grid, spot data mode

Filter Grid Components

Column headers Displays the probe set or spot probe name and expression metrics available for the filter. GeneChip® data mode: Any absolute or comparison expression analysis metric generated by the Statistical Expression algorithm or the Empirical Expression algorithm (in versions of Microarray Suite lower than 5.0). (See the Affymetrix Microarray Suite User’s Guide for more information about the expression algorithms and metrics.) Spot data mode: Intensity, intensity standard deviation, pixel count, background, background standard deviation, ratio. Sort Specifies a sort order (ascending, descending, or none) in the query table for a results column. Note: This sort specification does not affect the pivot table. To sort a pivot table column, right-click the column header and select a sort option from the shortcut menu. Line 1 through n Accommodates the entries that specify metric limits. Limits entered in two or more cells of the same row are combined in AND fashion (intersection). Limits entered in subsequent rows are combined in OR fashion (union). Affymetrix® Data Mining Tool User’s Guide 63

Entering Limits

1. Double-click the cell of interest in Line 1 of the grid (not the Sort row). ⇒ The blinking cursor in the cell indicates DMT is ready to accept typed input (Figure 5.5).

Ta b l e 5 . 1 and Ta b l e 5 . 2 describe query operators and statements and provide example limits.

Figure 5.5 Filter grid, GeneChip® data mode

If you double-click a cell in the last row of the filter grid, DMT automatically adds another row to the grid.

2. Enter the limit, then do one of the following: Double-click the next cell where you want to enter a limit. Press the ENTER key to complete the entry and move the cursor to the grid cell below in line 2. Press the TAB key to complete the entry and move the cursor to the right to the next cell in the row. Limits may be entered in all columns and many rows of the filter grid. Limits in two or more cells in the same row are logically connected with an AND (intersection) statement. Limits entered in subsequent rows are logically connected with an OR (union) statement.

Enter limits for Statistical algorithm metrics and Empirical algorithm metrics on separate lines in the filter grid.

Figure 5.6 shows an example filter that specifies probe sets with a signal greater than 400 AND detection p-value < 0.1. 64 CHAPTER 5 Building and Running a Query

Figure 5.6 Filter grid, GeneChip® data mode

Use Probe Lists to quickly add a group of associated probes to the filter. Right-click the cell in the Probe Set Name column, and select Probe List from the shortcut menu. See page 137 for more information.

Entering Multiple Limits in a Single Cell Limits containing AND (intersection) or OR (union) operators may be entered in a single cell.

For example, the limit in Figure 5.7 defines the range between 500 and 1000 (the intersection of the range > 500 and the range < 1000). The query returns probe sets where: 500 < signal < 1000. Probe sets with signal < 500 or > 1000 are not returned.

Figure 5.7 Filter grid, GeneChip® data mode

DMT automatically adds a blank row to the bottom of the grid to accommodate another OR entry. The last row of the grid may remain blank with no effect on the query. Affymetrix® Data Mining Tool User’s Guide 65

Editing Limits Double-click the cell to highlight the entire limit, then do one of the following: Enter a new limit (overwrites the old limit). Right-click the mouse and make a selection from the shortcut menu of edit commands. Use the mouse to select part of the limit, then enter new text.

An Oracle® database is case sensitive.

Specifying a Sort Order for the Query Table You may specify a sort order (ascending, descending, or not sorted) for the query table.

1. In the filter grid, click the cell in the Sort row (first row) for the metric you want to sort (for example, Signal in Figure 5.8). ⇒ An arrow button appears.

Figure 5.8 Filter grid, GeneChip® data mode

2. Click the arrow, and select a sort order from the drop-down list that appears (Figure 5.9).

Figure 5.9 Filter grid, sort options (GeneChip® data mode)

3. Repeat steps 1 - 2 for additional metrics you want to sort. 66 CHAPTER 5 Building and Running a Query

If a sort order is specified for two or more metrics, the sort is prioritized from left to right. For example, the limits in Figure 5.10 sort the query results first by descending signal, then by ascending detection p-value.

Figure 5.10 Filter grid, multiple column sort (GeneChip® data mode)

Ta b l e 5 . 1 Query operators and example query statements

Comparison Returns the Record for Definition Example Limit Operators Metric Data... = Equal (number or character field) =3 Equal to 3

=’P’ Called present > Greater than >5 Greater than 5 < Less than <20 Less than 20 >= Greater than or equal to >=6Greater than or equal to 6 <= Less than or equal to <=19 Less than or equal to 19 != Not equal to (number or character !=25 Not equal to 25 field) Returns the Record for Ranges Definition Example Limit Metric Data... BETWEEN Returns records with the metric BETWEEN Between 2 and 5 value between the user-specified 2 AND 5 limits NOT BETWEEN Returns records where the metric NOT BETWEEN 1 and Not between 1 and 1.5 value is not between the user- 1.5 specified limits Affymetrix® Data Mining Tool User’s Guide 67

Ta b l e 5 . 1 Query operators and example query statements

Returns the Record for Lists Definition Example Limit Metric Data... IN Returns records that match any one IN (‘cre’, ‘bioB’) cre or bioB of the values in the list NOT IN Returns records that do not match NOT IN (‘cre’, ‘biobB’) Not cre or biobB any one of the values in the list LIKE Searches character fields such as LIKE ‘cre’ cre probe name and returns records that match the pattern in the LIKE LIKE ‘cr_’ cr followed by any single character statement (the underscore symbol (_) is the wild card for a single character)

LIKE ‘cr%’ cr followed by any string of zero or more characters (the % symbol is the wild card for any string of zero or more characters) NOT LIKE Searches character fields such as NOT LIKE ‘cr%’ Not cr followed by any string of zero probe name and returns records or more characters (the % symbol is that do not match the pattern in the the wild card for any string of zero or NOT LIKE statement more characters) Local Operators & Example Returns the Record for Definition Complex Statement Metric Data... Statements AND Connects two conditions and only >5 AND <6 Greater than 5 and less than 6 returns results when both conditions are true OR Connects two conditions and <5 OR >9 Less than 5 or greater than 9 returns results when either condition is true NOT Negates a condition when NOT < 5000 Not less than 5000 combined with various operators. For example, NOT LIKE, NOT IN ( ) Used to force the order of (>5 AND <10) OR Greater than 5 and less than 10 or evaluation of two or more (>200 AND < 500) greater than 200 and less than 500 combined conditions 68 CHAPTER 5 Building and Running a Query

Ta b l e 5 . 2 Expression call search strings (GeneChip® data mode)

Absolute Call Limit

Present =’P’

Marginal =’M’

Absent =’A’

No call =’No Call’ Difference Call Limit

Increased =’I’

Marginally increased =’MI’

No change =’NC’

Marginally decreased =’MD’

Decreased =’D’

An Oracle database is case-sensitive. Use upper case letters to specify the call, except for ‘No Call’.

Query Builder The Query Builder helps you input complex limits in the filter grid without prior knowledge of correct syntax for operators such as BETWEEN and LIKE. You need only specify text or where appropriate. The Query Builder inserts the logical operators and syntactically correct limit into the user-specified cell of the filter grid.

Entering Limits

1. Right-click the cell of interest in the filter grid (do not click the Sort row).

2. Select Show Query Builder from the shortcut menu that appears. ⇒ The Build Filter dialog box appears for the chosen type of result (Figure 5.11). Affymetrix® Data Mining Tool User’s Guide 69

Figure 5.11 Build Avg Diff Filter dialog box

3. Click an operator or statement button. See Ta b l e 5 . 1 , on page 66 for information on operators and statements.

4. Enter appropriate text to complete the limit.

Lower case text in the query builder is a place holder that must be replaced with your input. A text search string must contain single quotation marks (for example, LIKE ‘YDR154C/’).

5. Click OK or press the ENTER key to place the limit in the filter grid.

Editing Limits

1. Click Undo in the Build Filter dialog box. ⇒ The last entry is deleted.

2. Alternatively, select the text you want to edit, then make a new entry or right-click to open a shortcut menu of edit commands.

The BACKSPACE, DELETE, and arrow keys are supported during editing in the Build Filter dialog box. 70 CHAPTER 5 Building and Running a Query

Selecting Analyses for the Query An analysis includes the GeneChip® expression analysis results (*.chp) or spotted array intensity results (*.spt) derived from an experiment. An analysis is computed using particular values for user-modifiable algorithm parameters.

Selecting Analyses from the Data Tree The data tree displays analyses in the current database as well as array sets. See Chapter 10 for more information about array sets.

To select analyses for the query, click the analyses or array sets in the data tree. To select adjacent analyses, press and hold the SHIFT key while you click the first and last analysis in the selection. To select non-adjacent analyses, press and hold the CTRL key while you click the analyses.

Specifying Analysis Filters If the publish database is on the LIMS server, you may specify analysis filters that determine the analyses displayed in the data tree.

Click the Display Analysis Filters button . Alternatively, select View Analysis → Filters from the menu bar. ⇒ The Filter Analysis dialog box appears (Figure 5.12).

The Filter Analysis dialog box contains an Attribute section (top) and a Sample section (bottom). Analysis filters can be specified in the Attribute section, Sample section, or both sections. Analysis filters specified in the Attribute and Sample sections are combined in OR fashion (union). Affymetrix® Data Mining Tool User’s Guide 71

Figure 5.12 Filter Analysis dialog box, GeneChip® data mode (publish database on the LIMS server) 72 CHAPTER 5 Building and Running a Query

Attribute Section The Attribute section includes the template tree, attribute list, and value list (Figure 5.13). Together these components comprise a hierarchy that enables you to specify particular attribute values as analysis filters (see Ta b l e 5 . 3 ). The data tree displays the analyses that contain the selected attribute values when the Apply button is clicked.

Figure 5.13 Filter Analysis dialog box, Attribute section Affymetrix® Data Mining Tool User’s Guide 73

Ta b l e 5 . 3 Filter Analysis dialog box, Attribute section components

Component Displays... Select...

Template tree sample and experiment one or more templates from the templates in the current template tree to display the associated database attributes in the attribute list

Attribute list attributes associated with attributes from this list to display all the templates selected in values for the selected attributes in the the template tree value list

Value list all the values in the current particular attribute values from this list database for the attributes for use as analysis filters selected in the attribute list

Selecting Analysis Filters in the Attribute Section To select adjacent items in the template tree, attribute list, or value list, press and hold the SHIFT key, then click the first and last row in the selection. To select non-adjacent items, press and hold the CTRL key, then click the desired rows.

1. Click the template names of interest in the template tree. ⇒ The Attribute list displays all attributes associated with the selected templates (Figure 5.14). 74 CHAPTER 5 Building and Running a Query

Figure 5.14 Template tree (top) and attribute list (bottom)

2. Click the attribute(s) of interest in the Attribute list (Figure 5.14). ⇒ The value list displays all values for the selected attribute(s) (Figure 5.15). Affymetrix® Data Mining Tool User’s Guide 75

Figure 5.15 Value list displays all values for the selected attribute(s)

3. In the value list, click the attribute Value(s) you want to use as an analysis filter(s).

4. Click Clear to clear all selections from the Attribute section.

5. When finished specifying analysis filters, click Apply. ⇒ The data tree in the Query window displays the analyses selected by the filters. If analysis filters are specified in both the Attribute and Sample sections, DMT combines the filters in OR fashion (union).

If no attribute values are highlighted in the value list, then all values are selected. 76 CHAPTER 5 Building and Running a Query

Finding Templates or Attributes The Find function in the Filter Analysis dialog box searches for templates, attribute names, or attribute values. The Find button is located in the lower right corner of the Attribute section in the Filter Analysis dialog box.

1. To begin a search, click Find in the Filter Analysis dialog box (Figure 5.12). ⇒ The Find dialog box appears (Figure 5.16).

Figure 5.16 Find dialog box

2. Enter the text string for the search (up to 256 alphanumeric characters and spaces) in the Find what box.

3. Select Templates, Attribute Names, or Attribute Values from the Look in drop-down list.

4. Click Find Now. ⇒ Template search highlights templates in the template tree that contain the search text string. ⇒ Attribute name search highlights the: 1) attributes in the attribute list that contain the search text string, and 2) corresponding attribute values in the value list. ⇒ Attribute value search highlights the attribute values in the value list that contain the search text string.

5. Click Close to close the Find dialog box. Affymetrix® Data Mining Tool User’s Guide 77

Sample Section

The Sample section of the Filter Analysis dialog box (Figure 5.17) displays the attributes that LIMS requires during sample registration and experiment setup (see Ta b l e 5 . 1 ).

Figure 5.17 Sample section of the Filter Analysis dialog box

Ta b l e 5 . 4 Filter Analysis dialog box, Sample section

Component Contents

Sample Project Projects in the current database. You can assign a sample to a project before publishing data. Several samples can be assigned to the same project for faster selection in DMT. If samples have been assigned to multiple projects, select all pertinent projects from the sample project list.

Probe Array GeneChip® probe array types in the current database.

Sample Type Samples types in the current database. You can organize experiments according to sample type before publishing analysis results data. The sample type may be used to create groups of results for a project. Many experiments may be associated with one sample type for faster selection in DMT. For example, experiment results may be assigned to Treated Liver or Untreated Liver sample types in the Liver project.

Operator Logon user names of operators who created experiments.

Sample Name Identifies the RNA source of the target hybridized to the GeneChip® probe array. You can assign the same sample name to different GeneChip probe arrays or experiments, then select the name to conveniently obtain all results for the sample from different experiments. 78 CHAPTER 5 Building and Running a Query

Selecting Filters in the Sample Section

The Sample section of the Filter Analysis dialog box (Figure 5.17) organizes attributes with increasing specificity from left to right.

1. Starting at the left in the Sample section, click the items of interest in each component list. Selected items in the same component list are combined in OR fashion (union). Selections from different component lists are combined in AND fashion (intersection).

Ta b l e 5 . 5 Filter Analysis dialog box, Sample section

Select This From To Display... Component List

Sample Project in the data tree, the analyses associated with the projects

Probe Array in the data tree, the analyses associated with the probe arrays

Sample Type operators and sample names associated with the sample types

Operator sample names associated with the selected sample types AND operators

Sample Name in the data tree, the analyses associated with the selected sample types AND operators AND sample names

If no items in a list have been highlighted, then all of the items in the list are selected by default.

2. When finished specifying analysis filters, click Apply. ⇒ The data tree in the Query window displays the analyses selected by the filters. If analysis filters are specified in both the Attribute and Sample section, DMT combines the filters in OR fashion (union). Affymetrix® Data Mining Tool User’s Guide 79

Running a Query

After specifying the filter and selecting the analyses from the data tree, the query is ready to run. To run the query, do one of the following:

Click the Query button or select Query → Run Query from the menu bar. ⇒ The query table displays the query results.

Click the Pivot button or select Query → Pivot Data from the menu bar. ⇒ The pivot table displays the query results. For more information about results tables, see Chapter 7, Query Results Tables.

Normalizing GeneChip® Signal Data

Normalization is a mathematical technique that minimizes discrepancies in results data from different experiments due to non-biological variables such as sample preparation, hybridization conditions, staining, amount of spotted probe, or GeneChip® probe array lot. Results data may be normalized prior to publishing in Affymetrix® Microarray Suite (GeneChip data) or Affymetrix® Jaguar™ (spotted array data). If GeneChip signal data were not normalized or were not normalized consistently, normalization can be performed in DMT. In DMT, you may normalize the data before or after a query or pivot operation.

The normalization option is only available in GeneChip® data mode. 80 CHAPTER 5 Building and Running a Query

Choosing Normalization Before a Query or Pivot

1. Click the Options button . ⇒ The Data Mining Options dialog box appears.

2. Click the Normalization tab. ⇒ The normalization options are displayed (Figure 5.18).

Figure 5.18 Data Mining Options dialog box, Normalization tab

3. Select the Compute Normalization option and confirm the All Probe Set Normalization algorithm is selected.

4. Click OK. ⇒ After a query, the query and pivot table display normalized signal values for each probe set.

If the pivot table does not display the normalized data column, verify that the pivot data includes Norm Signal or Norm Avg Diff (select Query → Pivot Data from the menu bar). Affymetrix® Data Mining Tool User’s Guide 81

Choosing Normalization After a Query or Pivot

1. After a query or pivot operation is run, select Query → Normalize from the menu bar.

2. To display the normalized signal data in the query table, click the Query tab in the results pane (displays the query table), then select Query → Normalize from the menu bar.

If the Query Normalize menu item is not available, verify that the All Probe Set Normalization algorithm is selected in the Data Mining Options dialog box (click the Options button , then click the Normalization tab).

3. To display the normalized signal data in the pivot table, click the Pivot tab in the results pane (displays the pivot table), then select Query → Normalize from the menu bar.

If the pivot table does not display the normalized data values, check to make sure the pivot data includes Norm Signal or Norm Avg Diff (select Query → Pivot Data from the menu bar).

Normalization Options

1. Click the Options button . ⇒ The Data Mining Options dialog box appears.

2. Click the Normalization tab. ⇒ The normalization options are displayed (Figure 5.19). 82 CHAPTER 5 Building and Running a Query

Figure 5.19 Data Mining Options dialog box, Normalization tab

3. Click Settings. ⇒ The All Probe Set Normalization Settings dialog box appears (Figure 5.20).

Figure 5.20 All Probe Set Normalization Settings Affymetrix® Data Mining Tool User’s Guide 83

Target Intensity Select this option to normalize the signal data to a user-specified target intensity (default = 5000). When selected, DMT computes the Normalization Factor (NF) for an analysis n so that:

Target Intensity = NFn x average signaln If the user-specified Target Intensity option is not selected, DMT sets the Target Intensity equal to the average signal of all analyses queried, not just the analyses returned by the query.

Intensity Threshold Select the Intensity Threshold option to specify a threshold for the signal values used to compute the average signal. When the signal of a probe set is less than the intensity threshold, DMT omits the probe set from the average signal calculation.

Low and High Percentage DMT does not include a signal value in the average signal calculation when it falls in the Low Percentage or High Percentage range. The default values are the bottom 2% and the top 2%.) If an Intensity Threshold is specified, the Low and High Percentage range is applied to the signal values above threshold. 84 CHAPTER 5 Building and Running a Query 6 Chapter 6 Managing Queries 6

Saving and opening filters saves time when one or more complex filters are used on a regular basis. This chapter outlines the tasks of saving a query, opening previously saved queries and deleting queries.

Saving a Query

You may save the filter parameters you specify. You can apply the saved filter parameters to subsequent experimental results or use them to regenerate the current query results in a future session.

1. When a DMT session is open, select Data → Save from the menu bar. ⇒ The Save dialog box appears (Figure 6.1).

Figure 6.1 Save dialog box

2. Enter a name for the query in the Name box, then click Save. ⇒ This saves the filter parameters.

87 88 CHAPTER 6 Managing Queries

Using the Save As Command Queries created by other users may be opened as read-only. Changes to read-only queries cannot be saved unless the query is renamed. This prevents users from modifying queries created by other users. You can also use the Save As command if you want modify, but not overwrite, one of your own queries.

1. Select Data → Save As from the menu bar. ⇒ The Save dialog box appears (Figure 6.2).

Figure 6.2 Save dialog box

2. Enter a new name for the modified query, then click Save. ⇒ This saves the filter parameters. Affymetrix® Data Mining Tool User’s Guide 89

Opening a Previously Saved Query

1. When a DMT session is open, select Data → Open from the menu bar. ⇒ The Open dialog box appears (Figure 6.3). The Open dialog box displays all saved queries in the default directory, unless the Only show my queries option box is selected. You may open any saved query.

Figure 6.3 Open dialog box

2. Select a query, then click Open. ⇒ The DMT session starts. 90 CHAPTER 6 Managing Queries

Deleting a Query

1. Select Data → Delete Query from the menu bar. ⇒ The Delete dialog box appears (Figure 6.4).

Figure 6.4 Delete dialog box

2. Select a query, then click Delete. ⇒ The selected query is permanently removed from the system.

Users (identified by the logon name) cannot delete queries created by other users. 7 Chapter 7 Query Results Tables 7

The results tables display experimental information and expression data that satisfy the query filter conditions. This chapter explains how to use these results tables.

The results tables are generated independently. Therefore, you can change the analyses displayed in one table without affecting the contents of the other tables.

Experiment Information Table

The experiment information table displays information about the analyses or array sets selected in the data tree.

1. To view experiment information for several analyses or array sets:

a. In the data tree, select the analyses or array sets you want to view To select adjacent analyses, press and hold the SHIFT key while you click the first and last analysis in the selection. To select non-adjacent analyses, press and hold the CTRL key while you click the analyses.

b. Click the Info button or select Query → Experiment Information from the menu bar ⇒ The selected analyses are displayed in the experiment information table (Figure 7.1, Figure 7.2).

2. To view experiment information for one analysis, right-click the analysis in the data tree and select Experiment Info from the shortcut menu.

3. If necessary, click the Experiment Info tab to view the table.

93 94 CHAPTER 7 Query Results Tables

The experiment information table displays each analysis in a separate column. You can resize, reorder, or hide columns as desired (see Appendix B).

GeneChip® Data Mode

The experiment information table for GeneChip® data (Figure 7.1) displays information about analyses or array sets selected in the data tree, including:

Information entered during GeneChip® probe array experiment setup. Data and experiment attributes automatically captured during hybridization, scanning and analysis.

Figure 7.1 Experiment information table, GeneChip® data mode Affymetrix® Data Mining Tool User’s Guide 95

Spot Data Mode

The experiment information table for spot data (Figure 7.2) displays information about selected analyses, including:

Probe array and operator name. Parameters associated with the analysis.

Figure 7.2 Experiment information table, spot data mode 96 CHAPTER 7 Query Results Tables

Query Table

The query table presents query data in rows that identify an analysis and probe name (probe set or spot probe), followed by the expression metrics that met the limits specified by the filter. Appendix C describes the metrics and other types of data included in the query table. To populate the query table:

1. In the data tree, click analyses or array sets of interest.

2. Specify filters (optional).

3. Click the Query button . ⇒ The query table displays the query results (Figure 7.3, Figure 7.4). If you specified a sort order for a particular metric(s) in the filter grid, the corresponding query table rows are arranged accordingly (ascending, descending, or no sort). The columns may be resized, reorder, or hidden as desired (see Appendix B).

Figure 7.3 Query table, GeneChip® data mode Affymetrix® Data Mining Tool User’s Guide 97

Figure 7.4 Query table, spot data mode

In the Spot column, the spot coordinates (in parentheses) follow the probe name.

Pivot Data Table

The results of a query frequently include the same probe (probe set or spot probe) from different analyses. The pivot operation organizes the query results so that all analysis results for a particular probe are displayed side by side in the same row of the pivot data table (Figure 7.5). The pivot table is blank until the pivot operation is run. The pivot table makes it easier to review and compare the query results from different analyses that are associated with a particular probe. 98 CHAPTER 7 Query Results Tables

Figure 7.5 Pivot table, GeneChip® data mode

Figure 7.6 Pivot table, spot data mode Affymetrix® Data Mining Tool User’s Guide 99

Selecting Results for the Pivot Table Before running the pivot operation, specify the type of expression metrics you want to view in the pivot table.

1. Click the Options button . ⇒ The Data Mining Options dialog box appears.

2. Click the Pivot tab. ⇒ This tab displays the results available for the pivot operation (Figure 7.7, Figure 7.8).

Figure 7.7 Data Mining Options dialog box, Pivot tab (GeneChip® data mode) 100 CHAPTER 7 Query Results Tables

Figure 7.8 Data Mining Options dialog box, Pivot tab (spot data mode)

3. Place (or remove) a check mark next to the result you want to include (or exclude) from the pivot operation. DMT applies the data selections to the next pivot operation. The pivot table displays only the types of results selected for the pivot operation.

4. Click OK to close the Data Mining Options dialog box.

Viewing Results Selected for the Pivot Table The menu bar also shows the metrics selected for the pivot table.

1. To view Statistical algorithm metrics, select Query → Select Pivot Data → Statistical Algorithm Results from the menu bar. To View Empirical algorithm results, select Query → Select Pivot Data → Empirical Algorithm Results from the menu bar. ⇒ This displays a drop-down list of metrics (Figure 7.9). Check marks indicate items selected for the pivot table.

2. To include (or exclude) a result in the pivot table, click the result to add (or remove) a check mark. Affymetrix® Data Mining Tool User’s Guide 101

Figure 7.9 Pivot data drop-down list (GeneChip® data mode), Empirical algorithm (left) and Statistical algorithm (right)

Running the Pivot Operation You can query analyses or array sets selected in the data tree and display the results in the pivot table.

1. In the data tree, click the analyses you want to query and pivot. To select adjacent analyses, press and hold the SHIFT key while you click the first and last analysis in the selection. To select non-adjacent analyses, press and hold the CTRL key while you click the analyses.

2. To view the query results in the pivot table, do one of the following:

Click the Pivot button . Right-click a highlighted analysis or array set in the data tree and select Pivot Data from the shortcut menu. Select Query → Pivot from the menu bar. ⇒ This displays the pivot table (Figure 7.5, Figure 7.6).

The pivot table must be populated before the scatter, fold change, or bar graph can be plotted. 102 CHAPTER 7 Query Results Tables

Including Probe Descriptions in the Pivot Table The pivot table can include probe (probe set or spot probe) descriptions. The descriptions are derived from public databases (except for custom probe arrays).

To display probe descriptions, select Query → Pivot Descriptions from the menu bar. ⇒ This adds the Description column to the pivot table.

Including Annotations in the Pivot Table The pivot table can include several columns of annotations. For more information about annotating probes, see Chapter 8, Annotations.

1. Select Query → Pivot Annotations from the menu bar. ⇒ If there is more than one annotation type, the Select Pivot Annotation Type dialog box appears (Figure 7.10).

Figure 7.10 Select Pivot Annotation Type dialog box

2. Select an annotation type, then click OK. ⇒ This adds a column of annotations (one annotation type per column) to the pivot table (far right). Affymetrix® Data Mining Tool User’s Guide 103

Sorting Pivot Table Columns You can specify a sort order for up to four columns in the pivot table.

1. Click the Pivot tab in the results pane.

2. Select Edit → Sort from the menu bar. Alternatively, right-click the pivot table and select Sort from the shortcut menu. ⇒ The Sort dialog box appears (Figure 7.11).

Figure 7.11 Sort dialog box

3. Click the Sort By drop-down arrow.

4. Select a pivot column from the drop-down list (Figure 7.12).

Figure 7.12 Sort dialog box 104 CHAPTER 7 Query Results Tables

5. Select the Ascending or Descending sort order option.

6. To specify another sort order, click the next drop-down arrow in the Then By box, and repeat steps 4 and 5.

7. Click OK when finished.

Pivot Options

The Data Mining Options dialog box displays the pivot options (Figure 7.13).

1. To open the Data Mining Options dialog box, do one of the following:

Click the Options button . Right-click the pivot table and select Options from the shortcut menu. Select View → Options from the menu bar, then click the Pivot tab.

Figure 7.13 Data Mining Options dialog box, GeneChip® data mode (left) and spot data mode (right) Affymetrix® Data Mining Tool User’s Guide 105

Show Order Analyses Dialog

Select this option to display the Order Pivot Analysis dialog box (Figure 7.14) before the pivot operation begins. This dialog box enables you to specify an order for the analyses (columns) in the pivot table. The analysis order in the pivot table determines the order of the analyses in the series bar and histogram graphs.

Figure 7.14 Order Pivot Analyses dialog box

1. Use the drag-and-drop method to change the order of the analyses in the Order Analyses dialog box.

2. Click OK to pivot the data.

You can also reorder columns in the pivot table using the drag-and- drop method (see Appendix B). 106 CHAPTER 7 Query Results Tables

Working with Tables

Working with results tables is the same in GeneChip® data mode (shown in this section) or spot data mode.

Finding Probes DMT can perform a text search in the query or pivot table.

1. To specify the text string for the search, do one of the following:

Click the Find button . Right-click the query or pivot table and select Find In Results from the shortcut menu. Select Edit → Find In Results from the menu bar. ⇒ The Find Probe dialog box appears (Figure 7.15).

Figure 7.15 Find Probe dialog box, GeneChip® data mode

2. Enter the text string for the search in the Find What box, or select a previously entered text string from the Find what drop-down list.

3. Select the Match and Direction search options, then click Find Next.

If the pivot table includes descriptions, the find function searches the probe set or spot probe name and description columns.

4. Click Find Next again to continue the search.

The Find Next command finds all strings that match the search text string. For example, using the Find Next command to search for the text string biob would find AFFX-BioB-5 as well as other occurrences of BioB (unless either the Match whole word only or Match case option is selected). Affymetrix® Data Mining Tool User’s Guide 107

Viewing Descriptions & Obtaining Further Gene Information

The Description dialog box (Figure 7.16) is available in the query or pivot table. It enables you to:

View descriptions. View or enter annotations for the selected probe (probe set or spot probe). Access an Internet website for further gene information.

1. Double-click the query or pivot table row that contains the probe of interest. ⇒ The Description dialog box appears (Figure 7.16). The Description dialog box displays: The probe name and a brief description. The target sequence the probe set is designed to interrogate. Annotations.

The Description dialog box is automatically updated when you click another probe in the query or pivot table.

Figure 7.16 Description dialog box, GeneChip® data mode

2. To obtain further gene information, select an Internet website from the drop-down list, then click Information. ⇒ The default Internet browser is started and automatically opens the selected website.

3. To annotate the selected probe, click Annotate. 108 CHAPTER 7 Query Results Tables

See the following section for information about annotating probes.

Annotating Probes You can annotate probes (probe sets or spot probes) displayed in the query or pivot table.

1. Select one or more probe names in the query or pivot table. To select adjacent names, press and hold the SHIFT key while you click the first and last name in the selection. To select non-adjacent names, press and hold the CTRL key while you click the names.

2. Right-click the query or pivot table and select Annotate Probes from the shortcut menu. Alternatively, select Annotations → Annotate Probes from the menu bar. ⇒ The Annotate dialog box appears and displays the selected probe names in the Probe Set(s) box (Figure 7.17 ).

3. Enter an Annotation Type or make a selection from the drop-down list.

4. Enter comments in the Annotation box.

Figure 7.17 Annotate dialog box

5. Click OK to add the annotation and close the Annotate dialog box. Affymetrix® Data Mining Tool User’s Guide 109

Adding Probes to the Filter Grid You can add all or selected probes (probe sets or spot probes) in the query or pivot table to the current filter. DMT saves the selected probes as a probe list, then adds the probe list to the filter. (See Chapter 9 for more information about probe lists.)

Adding Selected Probes

1. Select one or more probe names in the query or pivot table. To select adjacent names, press and hold the SHIFT key while you click the first and last name in the selection. To select non-adjacent names, press and hold the CTRL key while you click the names.

2. Right-click the table and select Add Selected Rows to Filter from the shortcut menu. Alternatively, select Edit → Add Selected Rows to Filter from the menu bar. ⇒ The selected probe set names are added to the Probe Set Name column (or the selected spot probe names to the Spot column) in the filter grid.

Adding All Probes

1. Right-click the query or pivot table and select Add All Rows to Filter from the shortcut menu. Alternatively, select Edit → Add All Rows to Filter from the menu bar. ⇒ All probe set names are added to the Probe Set Name column (or all spot probe names to the Spot column) in the filter grid.

If the option Always Prompt to Create List is chosen (select Edit → Lists → Always Prompt to Create List from the menu bar), DMT prompts you to create a list of the probe sets (or spot probes) you want to add to the filter. DMT adds the list name to the filter instead of the probe names. See Chapter 9 for more information about lists. 110 CHAPTER 7 Query Results Tables

Copying Tables All or a selected portion of a results table can be copied to the system clipboard, then pasted into other applications.

1. To select the entire results table, click the upper left corner of the table (Figure 7.18).

Figure 7.18 Query table (GeneChip® data mode), all rows selected

2. To select part of a results table, do one of the following: Click and drag the mouse to select the desired rows. Click a row header to select the entire row. To select adjacent rows, press and hold the SHIFT key while you click the first and last row in the selection. To select non-adjacent rows, press and hold the CTRL key while you click the rows.

3. To copy the selection to the system clipboard, do one of the following:

Click the Copy Cells button . Right-click the table and select Copy Cells from the shortcut menu. Select Edit → Copy Cells from the menu bar. ⇒ The selected table cells are copied to the system clipboard.

4. To copy the selection to Excel, select Edit → Copy to Excel from the menu bar. ⇒ Microsoft® Excel opens and the selection is pasted into a new . Affymetrix® Data Mining Tool User’s Guide 111

Exporting Data The experiment information, query, or pivot table data may be exported (saved) to a tab-delimited text file (*.txt), then imported into other applications.

Hidden table columns are not exported.

1. Select Data → Export As from the menu bar. ⇒ The Export As dialog box appears (Figure 7.19).

Figure 7.19 Export As dialog box

2. Select a directory from the Save in drop-down box.

3. Enter a File name, then click Save.

Expanding the Results Pane When the Query window displays both the graph and results pane, you can enlarge the results pane.

1. Right-click a table in the results pane and select Expand Results from the shortcut menu. Alternatively, select View → Expand Results from the menu bar. ⇒ The graph pane is hidden and the results pane is enlarged.

2. To return the results pane to its original size, repeat step 1. 112 CHAPTER 7 Query Results Tables

Clearing the Results Pane To clear the results pane, select Edit → Clear Results from the menu bar. ⇒ All tables from the results pane are cleared. 8 Chapter 8 Annotations 8

You can annotate probes (probe sets or spot probes) and view the annotations in the pivot table. The annotations may be queried and the query results may be added to the filter.

Creating and working with annotations is the same in GeneChip® data mode (shown in this chapter) or spot data mode.

Annotating Probes

1. Select one or more probes in the query or pivot table.

2. Right-click the query or pivot table and select Annotate Probes from the shortcut menu. Alternatively, select Annotations → Annotate Probes from the menu bar. ⇒ The Annotate dialog box appears and displays the selected probes in the Probe Set(s) box (Figure 8.1).

Figure 8.1 Annotate dialog box, GeneChip® data mode

3. Enter the Annotation Type, or select from the drop-down list.

115 116 CHAPTER 8 Annotations

4. Enter comments in the Annotation box, then click OK.

Loading Annotations You can add or load annotations previously saved in a text file (*.txt) to the system.

Creating a Text File Use the following procedure to create an annotation text file.

1. Create a text file (*.txt) following the tab delimited format shown in Figure 8.2.

2. In the first line, enter the columns names (as defined in Ta b l e 8 . 1 ) delimited by tabs (Figure 8.2).

Ta b l e 8 . 1 Annotation text file, column names

Column Number Column Name

1 Probe name

2 Type

3 Annotation

3. In the next line, enter the probe name, annotation type and the annotation delimited by tabs (Figure 8.2).

Enter only one annotation per line. Each annotation can include up to 2000 characters. Affymetrix® Data Mining Tool User’s Guide 117

Figure 8.2 Annotations text file (*.txt), GeneChip® data mode

Loading the Annotations Use the following procedure to add an annotations text file to the system.

1. Select Annotations → Load Annotations from the menu bar. ⇒ The Open dialog box appears (Figure 8.3) and displays the contents of the default directory specified in the Data Mining Options dialog box (Default Directory tab).

ó

Figure 8.3 Open dialog box

2. Select the text file that contains the annotations, then click Open. ⇒ The annotations are added to the GeneInfo database and are available to DMT. 118 CHAPTER 8 Annotations

Querying Annotations

The Query Annotations window (Figure 8.4) enables you to build an annotation query (top pane) and view the returned results (bottom pane).

1. Select Annotations → Query Annotations from the menu bar. ⇒ The Query Annotations window appears (Figure 8.4).

Figure 8.4 Query Annotations window

2. Click the drop-down arrow in the Field column to display a drop-down list of field types. Select if you want to clear a previously selected feld type.

3. Use the scroll bar to view the list and select a field type (or none) from the drop-down list.

4. Enter the search text string in the Search For box. ⇒ DMT combines the field type with the text string in AND fashion (intersection) (see Ta b l e 8 . 2 ).

5. To edit the text string, highlight the entry and right-click the cell. ⇒ A shortcut menu of edit commands is displayed. Affymetrix® Data Mining Tool User’s Guide 119

Ta b l e 8 . 2 Field types

Field Type Returns Annotations....

Probe for probe set or spot probe names that contain the text string

User created by a user whose name contains the text string

Description for probe sets with descriptions that contain the text string

Annotation Type of the specified type that contain the text string

6. To enter another row of criteria, click the Operation column.

7. Click the drop-down arrow, then select the AND (intersection) or OR (union) operator. ⇒ DMT automatically adds another row to the Query Annotations filter grid.

8. To specify additional query criteria, repeat step 2 through step 6.

9. To run the query, do one of the following:

Click the Query Annotations button . Right-click the top pane and select Run Query from the shortcut menu. Select Annotations → Run Query from the menu bar. ⇒ The bottom pane of the Query Annotations window displays the returned results (Figure 8.5). 120 CHAPTER 8 Annotations

Figure 8.5 Query Annotations window, query criteria (top) and query results (bottom)

Annotation Query Results Type Annotation type selected when the annotation was created. Annotation Text entered by the user who created the annotation. User Windows NT name of the user who logged onto the workstation when the annotation was created. Date Date when the annotation was created or last updated. Description Probe description (derived from a public database).

Copying Annotation Query Results Annotation query results may be copied to the system clipboard and pasted into other applications. The row numbers are also copied with the selected cells for reference.

1. Click the row number in the query results to select the entire row.

2. Select Annotations → Copy Cells from the menu bar. ⇒ The selection is copied to the system clipboard. Affymetrix® Data Mining Tool User’s Guide 121

Clearing the Annotation Query or Query Results

1. To clear the annotation filter grid (top pane of the Query Annotations window), select Annotations → Clear Query from the menu bar.

2. To clear the annotation query results (bottom pane of the Query Annotations window), select Annotations → Clear Results from the menu bar.

Adding Probes to the Filter Grid Probes (probe sets or spot probes) returned by an annotation query may be added to the current filter.

1. Select one or more probes in the bottom pane of the Query Annotations window.

2. Right-click the selection, then select Add Selected Results To Filter from the shortcut menu. Alternatively, select Annotations → Add Selected Results To Filter from the menu bar. ⇒ The selected probe set names are added to the Probe Set Name column (or the selected spot probe names to the Spot column) in the filter grid.

If the option Always Prompt to Create List is selected (Edit → Lists → Always Prompt to Create List from the menu bar), DMT prompts you to create a list of the probe sets (or spot probes) you want to add to the filter. DMT adds the list name to the filter instead of the probe names. See Chapter 9 for more information about lists. 122 CHAPTER 8 Annotations

Deleting Annotations An annotation may only be removed from the database by the user who created it. The delete command permanently removes an annotation from the system.

1. Select Annotations → Query Annotations from the menu bar. ⇒ The Query Annotations window appears (Figure 8.6).

Figure 8.6 Query Annotations window, specifying search for annotations created by the user

2. Select User from the Field drop-down list.

3. Enter your user name in the Search For box.

4. Click the Query Annotations button . ⇒ All of the annotations that meet the criteria are displayed.

5. To select a row, click the row number. To select the all rows, click the upper left corner of the query results pane (Figure 8.7).

Figure 8.7 Query Annotations window Affymetrix® Data Mining Tool User’s Guide 123

6. Select Annotations → Delete Selected Annotations from the menu bar. Alternatively, right-click a selected annotation, then select Delete Selected Annotations from the shortcut menu. ⇒ The selected annotations are permanently removed. 124 CHAPTER 8 Annotations 9 Chapter 9 Probe Lists 9

A user-specified group of probes (probe sets or spot probes) can be saved as a probe list. Probe lists are displayed in the data tree and may be added to the filter grid (probe set name or spot column), or used to view specific query results. A text file (comma delimited *.txt) that specifies a probe list may also be added to the system. This section covers the methods for creating or loading probe lists and how to use and manage probe lists in Affymetrix® Data Mining Tool.

Creating and working with probe lists is the same in GeneChip® data mode (shown in this chapter) or spot data mode.

Creating Probe Lists

A probe list may be generated from probes selected from:

The query or pivot table. Cluster analysis results. Search array descriptions. The filter grid. Additionally, existing probe lists may be combined to create new lists. This section outlines the various procedures for creating probe lists.

127 128 CHAPTER 9 Probe Lists

Creating a Probe List from the Query or Pivot Table

1. Select one or more probes in the query or pivot table.

2. Right-click the table and select Create Probe List from the shortcut menu ⇒ The Save Probe List dialog box appears (Figure 9.1).

Figure 9.1 Save Probe List dialog box

3. Enter a name for the list in the Name box, then click Save. ⇒ The Probe List Members dialog box appears and displays the members in the saved list (Figure 9.2). Affymetrix® Data Mining Tool User’s Guide 129

Figure 9.2 Probe List Members dialog box

4. Click Close when finished viewing the probe list members. ⇒ The data tree displays the probe list in the Probe Lists directory (Figure 9.3).

5. In the data tree, click the plus sign (+) next to the probe list name to display the probe list members. For example, in Figure 9.3, the probe list L1 contains five members.

Figure 9.3 Data tree, Probe Lists Directory 130 CHAPTER 9 Probe Lists

Creating a Probe List from Cluster Analysis The cluster members identified by cluster analysis may be saved as a probe list. (See Chapter 14 for more information about cluster analysis.)

1. After the cluster analysis results are returned, click the cluster of interest in the Clusters tab of the graph pane. ⇒ The cluster members (probe sets or spot probes) are displayed in the Probes box (Figure 9.4).

Figure 9.4 Graph pane, Clusters tab (GeneChip® data mode)

2. Enter a Probe List Name, then click Save Selected. ⇒ A probe list is created that includes the members of the selected cluster and is displayed in the data tree Probe List directory. Affymetrix® Data Mining Tool User’s Guide 131

Creating a Probe List from Search Array Descriptions

1. Select Edit → Search Array Descriptions from the menu bar. ⇒ The Search Array Descriptions dialog box appears (Figure 9.5)

Figure 9.5 Search Array Descriptions dialog box

2. In the Search for box, enter the description, or partial description, then click Find. ⇒ Results for the search are displayed in the list box (Figure 9.6).

Figure 9.6 Search Array Descriptions dialog box with search results 132 CHAPTER 9 Probe Lists

3. Press and hold the CTRL key while you click to select the desired probe set names.

4. Click Add to Filter. ⇒ The Save Probe List dialog box appears.

5. Enter a Name for the probe list, then click Save. ⇒ The Probe List Members dialog box appears.

6. Click Close when finished. ⇒ The data tree displays the probe list.

Creating a Probe List from Filter

1. Enter the probe set names in the Probe Set Name column of the filter grid.

2. Select → Edit → Probe Lists → Create Probe List from Filter. ⇒ The Save Probe List dialog box appears.

3. Enter a Name for the probe list, then click Save. ⇒ The Probe List Members dialog box appears.

4. Click Close when finished. ⇒ The data tree displays the probe list.

Creating a Probe List by Combining Existing Lists

1. Select Edit → Probe Lists → Combine Probe Lists. ⇒ The Combine Probe Lists dialog box appears (Figure 9.7).

Figure 9.7 Combine Probe Lists dialog box Affymetrix® Data Mining Tool User’s Guide 133

2. Select or clear the Only show my probe lists option, as desired.

3. In the Combine Probe List drop-down box, select a probe list.

4. Select a second probe list from the lower drop-down box.

5. Select either the And or Or option. And specifies that probe names must belong to both lists to be included in the new list. Or specifies probe names belonging to either one or both lists will be included in the new list.

6. Enter a new probe list name.

Figure 9.8 Combining all probes belonging to lists Hu6800 or Like_Affx IN_cre

7. Click OK. ⇒ The Probe List Members dialog box appears displaying all probes in new probe list.

8. Click Close when finished. ⇒ The data tree displays the new probe list. 134 CHAPTER 9 Probe Lists

Loading a Probe List

In addition to creating a probe list (described in the preceding section), a probe list may be loaded or added to the system. There are two methods available for loading a probe list:

Specify members. Select this option to manually enter the probe list members. Specify input file. Select this option to load a previously saved text file (*.txt) that specifies the probe list members.

Specifying Probe List Members

1. Select Edit Probe Lists → Load Probe List from the menu bar. ⇒ The Load List dialog box appears (Figure 9.9).

2. Enter a Probe List name.

3. Select the Specify members (comma delimited) option and enter the probe set or spot probe names using a comma delimited format (terminate the entry with a comma) (Figure 9.9).

.

Figure 9.9 Load List dialog box, Specify members option

4. Click OK. ⇒ The list is created and displayed in the data tree Probe Lists directory. Affymetrix® Data Mining Tool User’s Guide 135

Specifying an Input File To load a probe list using the Specify input file option, you must first create the list (*.txt) so that you can select it from the Load List dialog box.

Creating the Input File

1. Create a text file (*.txt) following the comma delimited format shown in Figure 9.10.

2. Enter the probe names (probe set or spot probe) in comma delimited format (terminate the entry with a comma) (Figure 9.10).

Figure 9.10 Comma delimited probe list entries

3. Save the text file. 136 CHAPTER 9 Probe Lists

Selecting the Input File

1. Select Edit → Probe Lists → Load Probe List from the menu bar. ⇒ The Load List dialog box appears (Figure 9.11).

Figure 9.11 Load List dialog box

2. Enter a Probe List name.

3. Select the Specify input file option and enter the name of the text file (*.txt) that contains the list members. Alternatively,

a. Click the Browse button . ⇒ The Select List dialog box appears (Figure 9.12).

Figure 9.12 Select List dialog box Affymetrix® Data Mining Tool User’s Guide 137

b. Select a text file, then click Open. ⇒ The Load List dialog box displays the selected input file (Figure 9.13).

Figure 9.13 Load List dialog box

4. Click OK. ⇒ The probe list is created and displayed in the data tree Probe Lists directory.

Using Probe Lists

Probe lists provide a convenient way to quickly add a group of associated probes (probe sets or spot probes) to the filter, or to highlight and view results for only selected probes.

Adding a Probe List to the Filter Grid You can add an existing probe list to the filter.

1. In the filter grid, right-click a cell in the Probe Set Name or Spot column and select Probe List... from the shortcut menu (Figure 9.14). ⇒ The Open Probe List dialog box appears (Figure 9.14). 138 CHAPTER 9 Probe Lists

Figure 9.14 Shortcut menu and Open Probe LIst dialog box

The Open Probe List dialog box displays all probe lists contained on the server, unless the Only show my probe lists option is selected.

2. From the Open dialog, select the probe list that you want to add to the filter.

3. Click Open. ⇒ The probe list is added to the filter.

Displaying Selected Probe List Members Use probe lists to highlight probe list members in the scatter or fold change graph, or exclusively display members in the pivot table or a series line graph.

Pivot the analyses of interest and plot the scatter, fold change and series line graph before highlighting a probe list(s).

1. Select one or more probe lists in the data tree. To select adjacent probe lists, press and hold the SHIFT key while you click the first and last list in the selection. To select non-adjacent probe lists, press and hold the CTRL key while you click the lists.

2. Right-click a selected probe list and select Display Selected Probes from the shortcut menu. ⇒ If the scatter or fold change graph is the active (selected) graph, the corresponding points are highlighted. If the series graph is active, Affymetrix® Data Mining Tool User’s Guide 139

only the data for the selected probe list members is displayed (Figure 9.15). ⇒ The pivot table displays only the rows for the probe list members (Figure 9.16).

3. To restore all rows to the pivot table, right-click the pivot table and select Show All Pivot Rows from the shortcut menu.

Figure 9.15 Series line graph of probe list L5

Figure 9.16 Pivot table displaying probe list L5 140 CHAPTER 9 Probe Lists

Managing Probe Lists

List management is the same in GeneChip® data mode (shown in this section) or spot data mode.

Viewing and Editing Probe List Members

1. Select Edit → Lists → View Members from the menu bar. ⇒ The Probe List Members dialog box appears (Figure 9.17).

Figure 9.17 Probe List Members dialog box

2. Select a Probe List from the drop-down list. ⇒ The Probe List Members box displays the list members. If the Only show my probe lists option is selected (Figure 9.17), the Probe Lists drop-down list only displays lists created by you (identified by the logon name).

3. To add a probe set to the list, enter the probe set name in the bottom box (Figure 9.19), then click Add Member. Affymetrix® Data Mining Tool User’s Guide 141

Figure 9.18 Probe list Members dialog box

4. To remove a probe from the list, highlight it in the Probe List Members box (Figure 9.19), then click Remove Member.

Figure 9.19 Probe List Members dialog box

5. Click Close when finished viewing or editing the list. 142 CHAPTER 9 Probe Lists

Combining Probe Lists

1. Select Edit → Lists → Combine Lists from the menu bar. ⇒ The Combine Probe Lists dialog box appears (Figure 9.20).

Figure 9.20 Combine Probe Set Lists dialog box

2. Make a selection from the upper and lower Combine Probe List drop- down list box (Figure 9.21).

Figure 9.21 Combine Probe Lists dialog box

3. Select the And (intersection) or Or (union) combination option for the lists.

4. Enter a New probe list name for the new list, then click OK. ⇒ If the Show members after saving option is selected (Figure 9.21), the Probe List Members dialog box appears and displays the new list members (Figure 9.22). Affymetrix® Data Mining Tool User’s Guide 143

Figure 9.22 Probe List Members dialog box

5. Click Close when finished viewing or editing list members.

Exporting a Probe List A probe list may be exported as a text file (*.txt).

1. Right-click the probe list for export and select Export Probe List from the shortcut menu. Alternatively, select Probe Lists → Export Probe Lists from the menu bar. ⇒ The Save As dialog box appears (Figure 9.23).

Figure 9.23 Save As dialog box 144 CHAPTER 9 Probe Lists

2. Choose a directory from the Save in drop-down list.

3. Enter a name for the text file and click Save.

Deleting a Probe List

Using the Shortcut Menu

1. Right-click the probe list you want to delete and select Delete Probe List from the shortcut menu. ⇒ DMT prompts you to confirm the probe list to be deleted (Figure 9.24).

Figure 9.24 Delete probe list prompt

2. Click OK to delete the probe list.

Using the Menu Bar

1. Select Edit → Probe Lists → Delete Saved Probe List from the menu bar. ⇒ The Delete Probe List dialog box appears (Figure 9.25).

Figure 9.25 Delete Probe List dialog box Affymetrix® Data Mining Tool User’s Guide 145

2. Select the list you want to delete and click Delete. 146 CHAPTER 9 Probe Lists 10 Chapter 10 Array Sets 10

An array set is a user-specified group of GeneChip® probe array analyses. An array set provides a convenient way to select a group of analyses for a query, the pivot operation, graphing, statistical analyses, or clustering.

Array sets are only available for GeneChip® probe array analyses.

Creating an Array Set

1. In the data tree, click the analyses you want to include in an array set (Figure 10.1). To select adjacent analyses, press and hold the SHIFT key while you click the first and last analysis in the selection. To select non-adjacent analyses, press and hold the CTRL key while you click the analyses.

Figure 10.1 Right-click selected analyses in data tree for the shortcut menu

149 150 CHAPTER 10 Array Sets

2. Right-click a selected analysis, then select Create Set from the shortcut menu (Figure 10.1). Alternatively, select Edit → Sets → Create Set from the menu bar. ⇒ The Save Array Set dialog box appears (Figure 10.2).

Figure 10.2 Save Array Set dialog box

The Virtual Set option is available if the analyses selected for the array set are derived from different GeneChip® probe array types. When a virtual set is pivoted, DMT merges the analyses and displays them in a single column of the pivot table. A virtual set is a convenient way to manage the analyses from a multiple GeneChip probe array set. If the same probe set occurs in more than one analysis, the pivot table displays each probe set-analysis combination in a separate row to ensure no data are lost. For example, a control probe that is found across a set of four probe arrays will generate four pivot table rows. Each row is distinguished by the probe set-analysis name in the row header.

3. Enter a Name for the array set.

4. Select the Virtual Set option if you want to merge the analyses into a single column in the pivot table.

5. Click Save. ⇒ The array set is saved and displayed in the data tree under the My Array Sets directory (Figure 10.3). Affymetrix® Data Mining Tool User’s Guide 151

Figure 10.3 Data tree, My Array Sets directory

Saved Array Sets are stored in the registry on the computer and are only available when using that specific computer.

Working with Array Sets

An array set is available for graphing (see Chapter 11), statistical analysis (see Chapter 12) and cluster analysis (see Chapter 14).

Viewing Array Sets

The results tables are displayed independently. Therefore, changing the analyses displayed in the experiment information table does not affect the query or pivot table contents.

Experiment Information Table

1. Click an array set(s) in the data tree. To select adjacent array sets, press and hold the SHIFT key while you click the first and last array set in the selection. To select non-adjacent array sets, press and hold the CTRL key while you click the array sets.

2. Right-click a highlighted array set and select Experiment Info from the shortcut menu. ⇒ The experiment information table displays information for the analyses in the array set(s). 152 CHAPTER 10 Array Sets

Pivot Table

1. Select an array set(s) in the data tree.

2. Right-click a highlighted array set in the data tree and select Pivot Data from the shortcut menu. ⇒ The pivot table displays the analysis results from the selected array set(s). The pivot table displays a single column of results for a virtual array set.

Managing Array Sets

Array sets that you have created can be edited or deleted from DMT. Only array sets created by you, as identified by the logon name, are displayed in the data tree.

Editing an Array Set

1. Select an array set in the data tree.

2. Select Edit → Sets → Edit Set from the menu bar. ⇒ The Array Set Members dialog box appears and displays the selected array set and its members (Figure 10.4).

Figure 10.4 Array Set Members dialog box Affymetrix® Data Mining Tool User’s Guide 153

3. Do one or both of the following: Add a member to the array set: Enter the analysis name (from the current database) in the bottom box, then click Add member. Remove a member from the array set: Select the analysis in the Array Set Members box, then click Remove member.

4. Click Close when finished editing the array set.

Deleting an Array Set

1. In the data tree, select the array set(s) you want to delete.

2. Right-click a selected array set and select Delete Sets from the shortcut menu. Alternatively, select Edit → Sets → Delete Sets from the menu bar. ⇒ DMT prompts you to confirm the array set(s) to be deleted.

3. Click OK to delete the array set(s). 154 CHAPTER 10 Array Sets 11 Chapter 11 Graphing Results 11

DMT can plot user-specified columns of numeric pivot table data in a scatter, fold change, series, or histogram graph. This includes:

Analysis results Statistical data generated using the analysis function (see Chapter 12). The graph pane of the DMT session displays each type of graph in a separate tab (Figure 11.1).

The pivot operation must be run before the graphs can be plotted.

Figure 11.1 Graph pane, Scatter graph tab

157 158 CHAPTER 11 Graphing Results

Scatter Graph

The scatter graph (Figure 11.1) is an x-y graph that compares numeric pivot table data (from user-specified columns) using a traditional scatter plot. Multiple pivot table columns may be assigned to each axis. This enables quick comparison of the results from different experiments. Each point in the scatter graph represents a probe common to the two pivot table columns in the comparison. A point is defined by the intersection of the result value on the x and y axes for the common probe. The scatter graph displays up to eight fold change lines (four pairs) to help identify results that have changed significantly. The fold change lines are defined in pairs: y = mx and y = 1/mx where m = 2,3,5 and 10 by default. In GeneChip® data mode, average difference and fold change metrics are generally the most informative because probe sets with significant changes in expression levels can be easily identified.

Plotting the Scatter Graph

Plotting the scatter graph is the same in GeneChip® data mode (shown in the following section) or spot data mode.

1. Click the Scatter Graph button . Alternatively, select Graph → Scatter from the menu bar. ⇒ The Scatter Graph dialog box appears and displays the pivot table columns available for the scatter graph (Figure 11.2). Affymetrix® Data Mining Tool User’s Guide 159

Figure 11.2 Scatter Graph dialog box (GeneChip® data mode), pivot table columns available for the scatter graph

2. Use the drag-and-drop method to select each x-axis column in the Available Columns box and place it in the Select X-Axis Column(s) box (Figure 11.3). Alternatively, select one or more columns in the Available Column box, then click the down arrow above the Select X-Axis Column(s) box. To select adjacent columns, press and hold the SHIFT key while you click the first and last column in the selection. To select non-adjacent columns, press and hold the CTRL key while you click the columns.

3. Use the drag-and-drop method to select each y-axis column in the Available Columns box and place it in the Select Y-Axis Column(s) box (Figure 11.3). Alternatively, select one or more columns in the Available Column box, then click the down arrow above the Select Y-Axis Column(s) box.

The analysis results for a GeneChip® probe array set must be ordered identically because the scatter graph compares the first analysis on the x-axis to the first analysis on the y-axis (and so forth). If the analyses are not identically ordered, many probe sets will not be compared and plotted (only the common probe sets such as the controls). 160 CHAPTER 11 Graphing Results

Figure 11.3 Scatter Graph dialog box, GeneChip® probe data mode

4. To change the order of an column in the Select X-Axis (or Select Y- Axis) Column(s) box, use the drag-and-drop method to move the column to a new position in the list. Alternatively, select the column, then click the up or down arrow located at the inside of the Select X or Select Y Axis Column(s) box.

5. To change the scatter graph axes from log scale (default) to linear scale, click the Log Scale option to remove the check mark.

6. Click OK. ⇒ The graph pane displays the scatter graph (Figure 11.4). The points are color-coded using the display option colors in the Scatter Graph tab of the Data Mining Options dialog box (click the Options button). Affymetrix® Data Mining Tool User’s Guide 161

Figure 11.4 Scatter graph, GeneChip® data mode, signal metric

Working with the Scatter Graph

Working with the scatter graph is the same in GeneChip® data mode (shown in the following section) or spot data mode.

Magnifying the Graph

1. Press and hold the SHIFT key while using the click-and-drag method to draw a rectangle over the graph area of interest (Figure 11.5).

2. Release the mouse key. ⇒ The area selected by the rectangle is magnified (Figure 11.6). 162 CHAPTER 11 Graphing Results

Figure 11.5 Scatter graph, rectangle selects an area to magnify (GeneChip® data mode)

Figure 11.6 Magnified area in the scatter graph Affymetrix® Data Mining Tool User’s Guide 163

3. To zoom out and restore the graph, right-click the graph and select Full Out Zoom from the shortcut menu.

Locating a Probe Select a probe in the pivot table to quickly locate it in the scatter graph.

1. Click and hold the probe name in the pivot table. ⇒ The corresponding point in the scatter graph is highlighted (Figure 11.7). The highlighting is removed when the mouse button is released.

Figure 11.7 Scatter graph highlights the probe selected in the pivot table (GeneChip® data mode) 164 CHAPTER 11 Graphing Results

Viewing Probe Information & Annotating Probes

1. To display probe and corresponding gene information, click a point in the scatter graph. ⇒ The probe name, analyses names, metrics from the pivot table and a brief description of the gene are displayed to the right of the graph (Figure 11.8).

Figure 11.8 Scatter graph displaying probe information (GeneChip® data mode)

2. To obtain further gene information, select an Internet website from the drop-down list, then click Information. ⇒ The default Internet browser is started and automatically opens the selected website.

3. Double-click a point in the graph or a pivot table row to display the Description dialog box (Figure 11.9). Affymetrix® Data Mining Tool User’s Guide 165

Figure 11.9 Description dialog box

⇒ The Description dialog box displays a brief description of the probe (probe set or spot probe), the sequence that is designed to interrogate and any annotations associated with the probe.

4. To enter an annotation, click Annotate. ⇒ The Annotate dialog box appears (Figure 11.10).

Figure 11.10 Annotate dialog box

5. Enter comments in the Annotation box, then click OK. ⇒ The annotation is added to the Description dialog box. 166 CHAPTER 11 Graphing Results

Selecting Points in the Graph The lasso feature enables you to quickly select and focus on points of interest in the scatter graph by drawing a line around them (roping). The pivot table displays only the rows that correspond to the roped points (all other rows are hidden). Probes selected by roping may be conveniently annotated as a group or saved in a probe list that can be applied to the filter grid of a subsequent query.

1. Click the Lasso Points button . Alternatively, select Graph → Lasso Points. ⇒ The mouse pointer changes to a pair of cross hairs (+) when it is positioned over the scatter graph.

2. To rope points of interest, position the cross hairs near the group of points, then do one of the following: Click and hold the mouse button while you draw a complete circle around the points (Figure 11.11); or Click the mouse, move it to draw a line segment, then click the mouse again to start drawing a new line segment. Repeat until you return the cross hairs to the starting point and the lines segments enclose the points of interest (Figure 11.12). Affymetrix® Data Mining Tool User’s Guide 167

Figure 11.11 Roped points in the scatter graph

Figure 11.12 Roped points in the scatter graph 168 CHAPTER 11 Graphing Results

3. To terminate the roping operation, double-click the mouse or press the ESC key. ⇒ The scatter graph displays the selected points in orange color (default selected point color that is user-specified in the Options dialog box, see Changing Graph Colors on page 202). ⇒ The pivot table displays only the rows that correspond to the roped points (all other rows are hidden).

4. To restore the hidden rows to the pivot table, right-click the pivot table and select Show All Pivot Rows from the shortcut menu.

5. To clear the selection from the graph, right-click the graph and select Clear Selection from the shortcut menu. ⇒ The roped points are deselected and all rows (probes) are restored to the pivot table.

Scatter Graph Options Preferences for the scatter graph display may be set in the Data Mining Options dialog box (Figure 11.13). Newly selected options are immediately applied to an existing graph and subsequent sessions for you.

1. Click the Options button , then click the Scatter Graph tab. Alternatively: Right-click the scatter graph and select Options from the shortcut menu; or Select View → Options from the menu bar, then click the Scatter Graph tab. ⇒ The Data Mining Options dialog box appears and displays the scatter graph options (Figure 11.13). Affymetrix® Data Mining Tool User’s Guide 169

Figure 11.13 Data Mining Options dialog box, Scatter graph tab, GeneChip® data mode (left) and spot data mode (right)

Point Options

Point size The point size number determines the dot size for a graph point. Enter a larger point size for easier viewing. Use a smaller point size for higher resolution graphs. Color by In GeneChip® data mode, select this option to color- Absolute Call code the points according to the colors assigned to the absolute or detection call combination of the x and y- axis analyses (as displayed in the Scatter Graph tab of the Data Mining Options dialog box, see Ta b l e 11. 1 ). Note: You must pivot the absolute or detection call data. 170 CHAPTER 11 Graphing Results

Color by In GeneChip data mode, select this option to color-code Difference Call the points according to the colors assigned to the difference or change call for the x-axis analysis (as displayed in the Colors section of the Data Mining Options dialog box). There are five possible difference calls: decrease (D), marginal decrease (MD), no change (NC), marginal increase (MI) and increase (I). If the X-axis analysis does not have a difference or detection call, then the difference or detection call for the y-axis analysis is used. If neither the x or y-axis analysis has a difference or detection call, Point Color is used. Use Point Color Select this option to display all graph points using the Point Color (default is black) in the Colors section of the Data Mining Options dialog box.

Ta b l e 11. 1 Absolute or detection call combinations in the scatter graph (GeneChip® data mode)

Absent in Y Marginal in Y Present in Y

Absent in X A-A A-M A-P

Marginal in X M-A M-M M-P

Present in X P-A P-M P-P

Colors The colors of the absolute and difference call categories as well as other scatter graph items (graph points, graph background, selected or roped points, fold change lines) may be changed. (For further information, see Changing Graph Colors on page 202.) Affymetrix® Data Mining Tool User’s Guide 171

Fold Change Lines The default fold change lines are defined in four pairs: y = 2x and y = 1/2x, y = 3x and y = 1/3x, y =10x and y = 1/10x, y = 30x and y = 1/30x.

1. To redraw the fold change lines, enter new values in the edit boxes. Only integer values may be entered.

2. Remove the check mark to turn off the display of that pair of fold change lines.

Fold Change Graph

The fold change graph is a scatter plot that displays the fold change for a user-specified set of base and comparison columns. (Appendix A describes the fold change calculation.) Numeric pivot table columns are available for the fold change graph. Each point in the graph represents a probe (probe set or spot probe) that is common to the base and comparison column. The y-axis coordinate of a point is the average fold change for all of the base-comparison column pairs that contain the probe. The x-axis coordinate is the average result value for all of the comparison columns that contain the probe. The fold change graph supports calculations with replicates. All pairs of replicate comparison and base columns contribute to the fold change graph. The fold change is averaged when the probe is repeated (for example, when the query returns analysis results from several different GeneChip® probe or spot arrays of the same type, or analysis results from the same probe found on different types of GeneChip probe or spot arrays).

For the example replicate data in Ta b l e 11. 2 , DMT calculates the average fold change values from rows 1 and 2, 3 and 4, 5 and 6, and 7 and 8 (excluding the control probes). 172 CHAPTER 11 Graphing Results

Ta b l e 11. 2 Sample replicate data for the fold change calculation

Base Column Comparison Column

1 rep1base000A rep1samp030A

2 rep2base000A rep2samp030A

3 rep1base000B rep1samp030B

4 rep2base000B rep2samp030B

5 rep1base000C rep1samp030C

6 rep2base000C rep2samp030C

7 rep1base000D rep1samp030D

8 rep2base000D rep2samp030D

Multiple pivot table columns may be assigned to each axis. For example, Figure 11.14 displays the fold change for two sets of base and comparison columns. N002AS-Avg Diff and N004AS-Avg Diff are the base columns. N006AS-Avg Diff and N008AS-Avg Diff are the comparison columns. Affymetrix® Data Mining Tool User’s Guide 173

Figure 11.14 Fold change graph

Plotting the Fold Change Graph

Plotting the fold change graph is the same in GeneChip® data mode (shown in the following section) or spot data mode.

1. Click the Fold Change Graph button . Alternatively, select Graph → Fold Change from the menu bar. ⇒ The Fold Change Graph dialog box (Figure 11.15) appears and displays the pivot table columns available for the fold change graph. 174 CHAPTER 11 Graphing Results

Figure 11.15 Fold Change Graph dialog box (GeneChip® data mode), pivot table columns available for the fold change graph

2. Use the drag-and-drop method to select each base column in the Available Columns box and place it in the Select Base Column(s) box (Figure 11.16). Alternatively, select one or more base columns in the Available Columns box, then click the down arrow above the Select Base Column(s) box. To select adjacent columns, press and hold the SHIFT key while you click the first and last column in the selection. To select non-adjacent columns, press and hold the CTRL key while you click the columns.

3. Use the drag-and-drop method to select each comparison column in the Available Columns box and place it in the Select Comparison Column(s) box (Figure 11.16). Alternatively, select one or more comparison columns in the Available Columns box, then click the down arrow above the Select Comparison Column(s) box. Affymetrix® Data Mining Tool User’s Guide 175

Figure 11.16 Fold Change Graph dialog box, GeneChip® probe data mode

4. To change the order of a column in the Select Base (or Select Comparison) Column(s) box, use the drag-and-drop method to move the column to a new position in the list. Alternatively, select the column, then click the up or down arrow located at the inside of the Select Base (or Comparison) Column(s) box.

5. To change the fold change graph axes from log scale (default) to linear scale, click the Log Scale option to remove the check mark.

6. Click OK. ⇒ The graph pane displays the fold change graph (Figure 11.17). 176 CHAPTER 11 Graphing Results

Figure 11.17 Fold change graph (GeneChip® data mode)

Working with the Fold Change Graph

Working with the fold change graph is the same in GeneChip® data mode (shown in the following section) or spot data mode.

Magnifying the Graph

1. Press and hold the SHIFT key while using the click-and-drag method to draw a rectangle over the area of interest in the graph (Figure 11.18).

2. Release the mouse key. ⇒ The area selected by the rectangle is magnified (Figure 11.19). Affymetrix® Data Mining Tool User’s Guide 177

Figure 11.18 Fold change graph, rectangle selects area to magnify

Figure 11.19 Magnified area in the fold change graph 178 CHAPTER 11 Graphing Results

3. To zoom out and restore the graph, right-click the graph and select Full Out Zoom from the shortcut menu.

Locating Probes in the Graph Select a probe in the pivot table to quickly locate it in the scatter graph.

1. Click and hold the probe name in the pivot table. ⇒ The corresponding point is highlighted in the fold change graph (Figure 11.20). The highlighting is removed when the mouse button is released.

Figure 11.20 Click a probe name to in the pivot table to highlight the corresponding point in the fold change graph Affymetrix® Data Mining Tool User’s Guide 179

Viewing Probe Information & Annotating Probes

1. To display probe and corresponding gene information, click a point in the fold change graph. ⇒ The probe name, analyses names, results from the pivot table and a brief description of the gene are displayed to the right of the graph (Figure 11.21).

Figure 11.21 Fold change graph displaying probe information (GeneChip® data mode)

2. To obtain further gene information, select an Internet website from the drop-down list, then click Information. ⇒ The default Internet browser is started and automatically opens the selected website.

3. Double-click a point in the graph or a pivot table row to display the Description dialog box (Figure 11.22). 180 CHAPTER 11 Graphing Results

Figure 11.22 Description dialog

⇒ The Description dialog box appears and displays a brief description of the probe (probe set or spot probe), the sequence that it is designed to interrogate and any annotations associated with the probe.

4. To enter an annotation, click Annotate. ⇒ The Annotate dialog box appears (Figure 11.23).

Figure 11.23 Annotate dialog box

5. Enter comments in the Annotation box, then click OK. ⇒ The annotation is added to the Description dialog box. Affymetrix® Data Mining Tool User’s Guide 181

Selecting Points in the Graph The lasso feature enables you to quickly select and focus on points of interest in the fold change graph by drawing a line around them (roping). The pivot table displays only rows that correspond to the roped probes (all other rows are hidden). Probes selected by roping may be conveniently annotated as a group or included in a probe list that can be applied to the filter grid of a subsequent query.

1. Click the Lasso Points button . Alternatively, select Graph → Lasso Points from the menu bar. ⇒ The mouse pointer changes to a pair of cross hairs (+) when positioned over the fold change graph.

2. To rope points of interest, position the cross hairs near the group of points, then do one of the following: Click and hold the mouse button while you draw a complete circle around the points (Figure 11.24); or Click the mouse, move it to draw a line segment, then click the mouse again to start drawing a new line. Repeat until the cross hairs return to the starting point and the lines segments enclose the points of interest (Figure 11.25).

Figure 11.24 Roped points in the fold change graph 182 CHAPTER 11 Graphing Results

Figure 11.25 Roped points in the fold change graph

3. To terminate the roping operation, double-click the mouse or press the ESC key. ⇒ The fold change graph displays the selected points in orange color (default selected point color is user-specified in the Options dialog box, see Changing Graph Colors on page 202). The pivot table displays only the rows that correspond to the roped points (all other rows are hidden).

4. To restore the hidden rows to the pivot table, right-click the pivot table and select Show All Pivot Rows from the shortcut menu.

5. To clear the selection from the graph, right-click the graph and select Clear Selection from the shortcut menu. ⇒ The roped graph points are deselected and all probes (rows) are restored to the pivot table. Affymetrix® Data Mining Tool User’s Guide 183

Fold Change Graph Options Preferences for the fold change graph display may be set in the Data Mining Options dialog box (Figure 11.26). Newly selected options are immediately applied to an existing graph and subsequent sessions for you.

1. Click the Options button , then click the Fold Change tab. Alternatively, do either of the following: Right-click the fold change graph and select Options from the shortcut menu; or Select View → Options from the menu bar, then click the Fold Change tab. ⇒ The Data Mining Options dialog box appears and displays the fold change options (Figure 11.26).

Figure 11.26 Data Mining Options dialog box, Fold Change tab 184 CHAPTER 11 Graphing Results

Point Options Point size The point size number determines the dot size for a graph point. Enter a larger point size for easier viewing, but use a smaller point size for higher resolution graphs.

Fold Change Calculation Default Threshold The intensity threshold value used to calculate the fold change is a function of the noise, scaling or normalization factor and the noise multiplier of the two analyses. The intensity threshold value is calculated by the expression algorithm and is stored in the Mining database. If the intensity threshold value is not found in the database, then DMT uses the default threshold value you entered for the intensity threshold. (Appendix A describes the fold change calculation.) Note: In spot data mode, set the default threshold to zero (Data Mining Options dialog box, Fold Change tab).

Y-Axis Gridlines The fold change graph displays major and minor y-axis gridlines as horizontal lines with y-intercepts specified in the edit boxes (only integer values may be entered). A solid line labeled with the y-intercept value represents the major Y-axis gridline and a dotted line represents the minor gridline.

Colors The colors assigned to the fold change graph points, background, or selected (roped) points may be changed. (See Changing Graph Colors on page 202.) Affymetrix® Data Mining Tool User’s Guide 185

Series Graph

The series graph displays numeric pivot table columns in a line (default) or bar graph format (Figure 11.27 and Figure 11.28). Both graph formats plot numeric pivot columns on the x-axis and the data associated with each probe (probe set or spot probe) in the column on the y-axis. The series graph is an extremely useful way to:

Monitor gene expression across different experiments or over a time course. View probes roped in the scatter or fold change graph. View individual data for cluster members (saved in a probe list).

Figure 11.27 Series line graph (GeneChip® data mode) 186 CHAPTER 11 Graphing Results

Figure 11.28 Series bar graph (GeneChip® data mode)

Plotting the Series Graph

Plotting the series graph is the same in GeneChip® data mode (shown in the following section) or spot data mode.

1. Click the Series Graph button . Alternatively, select Graph → Series from the menu bar. ⇒ The Series Graph dialog box appears and displays the pivot table columns available for the series graph (Figure 11.29). Affymetrix® Data Mining Tool User’s Guide 187

Figure 11.29 Series Graph dialog box, pivot table columns available for the series graph

2. Select the pivot table columns for the series graph. To select adjacent columns, press and hold the SHIFT key while you click the first and last column in the selection. To select non-adjacent columns, press and hold the CTRL key while you click the columns.

3. Click OK. ⇒ The graph pane displays the series graph (Figure 11.30). The line graph format is the default. (See Series Graph Options on page 191 to specify the bar graph format.

) The series graph does not include probe sets that are hidden in the pivot table. 188 CHAPTER 11 Graphing Results

Working with the Series Graph

Working with the series graph is the same in GeneChip® data mode (shown in the following section) or spot data mode.

Locating Probes in the Graph Select a probe in the pivot table to quickly locate it in the series graph.

1. Click and hold the probe name in the pivot table. ⇒ In the series line graph, the corresponding line in the graph is highlighted. ⇒ In the series bar graph, the portion of the graph that contains the probe is displayed. The highlighting is removed when the mouse button is released.

Figure 11.30 Series line graph, highlighted line (top) corresponds to the selected pivot table row Affymetrix® Data Mining Tool User’s Guide 189

Viewing Probe Information & Annotating Probes

1. To display information about a probe, move the pointer over a point (or bar) in the series graph. ⇒ A pop-up tool tip displays the probe name and associated data (Figure 11.31).

Figure 11.31 Series graph

2. To view sequence information, double-click a point or bar in the series graph, or a pivot table row. ⇒ The Description dialog box appears (Figure 11.32) and displays a brief description of the gene, its sequence or the portion of the gene sequence the probe is designed to interrogate. 190 CHAPTER 11 Graphing Results

Figure 11.32 Description dialog box

3. To view further gene information, select an Internet website from the drop-down list, then click Information. ⇒ The default Internet browser is started and automatically opens the selected website.

4. To enter an annotation, click Annotate. ⇒ The Annotate dialog box appears (Figure 11.33).

Figure 11.33 Annotate dialog box

5. Enter comments in the Annotation box, then click OK. ⇒ The annotation is added to the Description dialog box. Affymetrix® Data Mining Tool User’s Guide 191

Series Graph Options Preferences for the series graph display may be set in the Data Mining Options dialog box (Figure 11.26). Newly selected options are immediately applied to an existing graph and subsequent sessions for you.

1. Click the Options button , then click the Series Graph tab. Alternatively: Right-click the series graph, select Options from the shortcut menu; or Select View → Options from the menu bar, then click the Series Graph tab. ⇒ The Data Mining Options dialog box appears and displays the Series Graph options (Figure 11.34).

Figure 11.34 Data Mining Options dialog box, Series Graph tab 192 CHAPTER 11 Graphing Results

Graph Type Bar Graph or Select a format option to display information in the Line Graph Option series graph as described in Ta b l e 11. 3 .

Ta b l e 11. 3 Series graph formats

Bar Line

X-axis Probe set or spot probe names Pivot table column or probe

Y-axis User-specified data for each probe in User-specified data for each probe in the analysis the column

Series Bar Graph Options

Probe Set Width (%) Determines the width of the graph bar. X-Axis Parameters The number of probes displayed on the x-axis in the Visible viewable portion of the graph pane.

Series Line Graph Options

X-Axis Select probe or pivot table columns for display on the x-axis. Point Size Determines the dot size for a graph point. Enter a larger point size for easier viewing, but use a smaller point size for higher resolution graphs. X-Axis Parameters Specifies the number of probes or columns Visible displayed on the x-axis in the viewable portion of the graph pane. Affymetrix® Data Mining Tool User’s Guide 193

Colors Up to 25 different colors are applied to the bars or lines. If there are more than 25 bars or lines, the colors are re-used. The color of the series graph points, background, lines, or bars may be changed. (See Changing Graph Colors on page 202.)

Histogram

The histogram plots a frequency distribution of data from numeric pivot table columns. DMT sorts the data into groups or bins (x-axis coordinate) and plots the number of probe (probe sets or spot probes) per bin (y-axis coordinate) for each analysis. The resulting data distribution helps evaluate the proportion of genes expressed at a particular level.

Plotting the Histogram

Plotting the histogram is the same in GeneChip® data mode (shown in the following section) or spot data mode.

1. Click the Histogram button . Alternatively, select Graph → Histogram from the menu bar. ⇒ The Histogram dialog box (Figure 11.35) appears and displays the numeric pivot table columns available for the histogram.

Figure 11.35 Histogram dialog box 194 CHAPTER 11 Graphing Results

2. Select the desired columns for the histogram. To select adjacent columns, press and hold the SHIFT key while you click the first and last column in the selection. To select non-adjacent columns, press and hold the CTRL key while you click the columns.

3. Click OK. ⇒ The graph pane displays the histogram (Figure 11.36).

Figure 11.36 Histogram of average difference data (GeneChip® data mode) Affymetrix® Data Mining Tool User’s Guide 195

Working with the Histogram

Working with the histogram is the same in GeneChip® data mode (shown in the following section) or spot data mode.

Viewing Histogram Information & Annotating Probes

1. To view information for a particular histogram bar, place the mouse pointer over that area of the histogram. ⇒ A pop-up tool tip displays the minimum and maximum value for the bin and the number of probe sets from the corresponding column in the bin (Figure 11.36).

2. To view sequence information, double-click a row in the pivot table. ⇒ The Description dialog box appears (Figure 11.37) and displays a brief description of the gene, its sequence or the portion of the gene sequence the probe is designed to interrogate.

Figure 11.37 Description dialog box

3. To view further gene information, select an Internet website from the drop-down list, then click Information. ⇒ The default Internet browser is started and automatically opens the selected website.

4. To enter an annotation, click Annotate. ⇒ The Annotate dialog box appears (Figure 11.33). 196 CHAPTER 11 Graphing Results

Figure 11.38 Annotate dialog box

5. Enter comments in the Annotation box, then click OK. ⇒ The annotation is added to the Description dialog box.

Adding Landmarks

One or more landmarks (Figure 11.40) may be added to the histogram to identify where a user-specified probe falls in the distribution.

1. Right-click the histogram and select Add Landmark from the shortcut menu. ⇒ The Landmarks dialog box appears (Figure 11.39).

Figure 11.39 Landmarks dialog box Affymetrix® Data Mining Tool User’s Guide 197

2. Select one or more columns, then enter a probe name.

3. Click OK. ⇒ The histogram displays the landmark labeled with the column and probe name (Figure 11.40).

Figure 11.40 Histogram with landmark for average difference value of probe set M95787_at in analysis N004AS

4. To hide the landmark(s), right-click the histogram and select Hide Landmarks from the shortcut menu.

5. To display the hidden landmark(s), right-click the histogram and select Show Landmarks from the shortcut menu.

6. To clear all landmarks, right-click the histogram and select Remove Landmarks from the shortcut menu. 198 CHAPTER 11 Graphing Results

Magnifying the Histogram

1. Press and hold the SHIFT key while using the click-and-drag method to draw a rectangle over the graph area of interest (Figure 11.41).

2. Release the mouse key. ⇒ The area selected by the rectangle is magnified (Figure 11.42).

Figure 11.41 Histogram, rectangle selects area to magnify

Figure 11.42 Magnified area of the histogram

3. To zoom out and restore the graph, right-click the histogram and select Full Out Zoom from the shortcut menu. Affymetrix® Data Mining Tool User’s Guide 199

Histogram Options Preferences for the histogram display may be set in the Data Mining Options dialog box (Figure 11.43). Newly selected options are immediately applied to an existing graph and subsequent sessions for you.

1. Click the Options button , then click the Histogram tab. Alternatively, do either of the following: Right-click the histogram and select Options from the shortcut menu; or Select View → Options from the menu bar, then click the Histogram tab. ⇒ The Data Mining Options dialog box appears (Figure 11.43).

Figure 11.43 Data Mining Options dialog box, Histogram tab

Graph Options

Combined All of the pivot table columns in a single bin are combined into Histogram one bar (Figure 11.44). If a single column was selected for the histogram, the Combined Histogram and Separate Histograms options are identical. 200 CHAPTER 11 Graphing Results

Separate Each bar in the histogram represents one pivot table column Histograms and is color-coded according to the legend at the right of the histogram (Figure 11.45). Select the Separate Histograms option to plot a separate frequency distribution for each column.

Figure 11.44 Histogram, combined histogram option

Figure 11.45 Histogram, separate histograms option Affymetrix® Data Mining Tool User’s Guide 201

Bin Options Range Select this option to specify the range of values for the histogram. Fixed Bin Size Select this option to define the range of data values for each bin. Each bin is set to the user-specified Bin Size. If a range is specified, it determines where the first bin begins, otherwise the lowest data value is used. The first bin begins at the lowest data value or the low value set in the Range option. The histogram creates sufficient bins to plot all of the data using the user-specified Bin Size. If a Range is specified, the number of bins = Range/ Bin Size. Variable Bin Size Select this option to define the number of histogram bins. Number of Bins is the number of bins plotted. First Bin Upper Limit defines the boundary value between the first and second bin. The first bin includes all values less than or equal to the first bin upper limit. The user-specified Range and Number of Bins determine the size of the remaining bins (increases exponentially).

Use the Variable Bin Size and Range options to compare the distribution of values from one or more analyses. For example, set the Number of Bins = 10, First Bin Upper Limit = 40 and Range = 0 to 10,000. The histogram plots 10 bins that contain an increasingly larger range of values. 202 CHAPTER 11 Graphing Results

X-Axis Options Ticks per label Defines the number of graph markers or tick marks on the x-axis between the numeric labels. The numeric label shows the range for a bin. The histogram displays a tick mark for each bin. Note: Set the Ticks per label option to 1/2 or 1/4 the Number of Bins in the Variable Bin Size option. This displays enough labels to view the histogram ranges without overloading the graph. Color Options The color of the histogram background, landmarks, or bars may be changed. (See Changing Graph Colors on page 202.)

Other Graphing Features

Enlarging the Graph Pane

1. Right-click the graph pane and select Expand Graph from the shortcut menu. Alternatively, select View → Expand Graph from the menu bar.

2. Repeat step 1 to restore the graph pane to its original size.

Changing Graph Colors

1. Click the Options button , then click the graph tab of interest. Alternatively, do one of the following, then click the graph tab of interest: Right-click the graph and select Options from the shortcut menu; or Select View → Options from the menu bar. ⇒ The Data Mining Options dialog box appears and displays the selected graph options (Figure 11.46). Affymetrix® Data Mining Tool User’s Guide 203

Figure 11.46 Data Mining Options dialog box, Scatter Graph tab

2. To change the color of an item (for example, Selected Point Color in Figure 11.46), click the associated color square in the Data Mining Options dialog box. ⇒ The Color palette appears (Figure 11.47).

Figure 11.47 Color palette (expanded palette, right) 204 CHAPTER 11 Graphing Results

3. Click a new basic color in the palette or click Define Custom Colors to define a custom color. ⇒ The color palette expands to display the custom color field (Figure 11.47).

4. To define a custom color, use the click-and-drag method to position the cross hairs in the custom color field. In the luminosity scale to the right, adjust the color brightness by moving the arrow up or down the scale. ⇒ The Color|Solid swatch displays the custom color.

5. When finished, click Add to Custom Colors to apply the color. Color selections are saved on a per user basis.

Copying and Clearing Graphs

1. To copy a graph to the system clipboard, right-click the graph and select Copy Graph from the shortcut menu. Alternatively, select Edit → Copy Graph from the menu bar.

2. To clear a graph from the graph pane, right-click the graph and select Clear Graph from the shortcut menu.

3. To clear all graphs from the graph pane, select Edit → Clear Graphs from the menu bar.

Printing Graphs

1. In the graph pane, click the graph tab you want to print.

2. Click the Print button in the toolbar. ⇒ The Print dialog box appears (Figure 11.48). Affymetrix® Data Mining Tool User’s Guide 205

Figure 11.48 Print dialog box

3. Confirm that the Graph option is selected.

4. Click OK. 206 CHAPTER 11 Graphing Results 12 Chapter 12 Statistical Analyses 12

DMT offers several types of statistical analyses to help evaluate and compare replicate data. Statistical operators can be applied to numeric pivot table columns. The resulting data are displayed in the pivot table and are available for graphing and further statistical analysis.

Selecting an Operator

Open the Analysis Function dialog box to select a statistical operator(s).

Select Analyze → Analysis Function from the menu bar. ⇒ The Analysis Function dialog box appears (Figure 12.1).

Figure 12.1 Analysis Function dialog box

209 210 CHAPTER 12 Statistical Analyses

Average, Median, Standard Deviation or Inter-Quartile Range

One or more of the following operators can be applied to user-specified numeric columns in the pivot table: Average Computes the average for the selected pivot table column(s). Median Computes the median (50th percentile) for the selected pivot table column(s). Standard Deviation Calculates the standard deviation for the selected pivot table column(s). Inter-Quartile Range Computes the 75th and the 25th percentile value for the selected pivot table column(s). The inter-quartile range is the 75th percentile minus the 25th percentile.

1. In the Analysis Function dialog box (Figure 12.2), select one or more of the operators: Average, Median, Standard Deviation, or Inter- Quartile Range.

Figure 12.2 Analysis Function dialog box Affymetrix® Data Mining Tool User’s Guide 211

2. Enter a name for the new column(s) of data that will be generated, then click Next. ⇒ The column selection dialog box appears (Figure 12.3).

Figure 12.3 Column selection dialog box, average and standard deviation analysis

3. Select one or more pivot table columns for the operators selected in Step 2, then click Finish. ⇒ The pivot table (right side) displays the new column(s) of statistical results (Figure 12.4). The column header displays the user-specified name, followed by the type of operator. For example, in Figure 12.4, the new column names are Tumor-Average and Tumor-Stdev.

Figure 12.4 Pivot table displaying average and standard deviation results 212 CHAPTER 12 Statistical Analyses

Fold Change

The fold change (FC) operator compares user-specified pivot table columns (base and comparison analysis) and computes the fold change for each probe in the comparison. (See Appendix A for more information about the fold change calculation).

1. In the Analysis Function dialog box, select the Fold Change operator (Figure 12.5).

Figure 12.5 Analysis Function dialog box

2. Enter a column name for the fold change results, then click Next. ⇒ The Column selection dialog box appears (Figure 12.6). Affymetrix® Data Mining Tool User’s Guide 213

Figure 12.6 Column selection dialog box, fold change analysis

3. Select a base and comparison column, then click Finish. ⇒ The pivot table (right side) displays the fold change results (Figure 12.7).

Figure 12.7 Pivot table, fold change results 214 CHAPTER 12 Statistical Analyses

T-Test

The T-Test analyzes two groups of pivot table columns (control and experiment) and determines the significance of change of the means of the two groups as well as the direction of the change. It computes a p-value for each comparison. The p-value is the probability value that the observed difference occurred by chance. A small p-value (for example, 0.01) means it is unlikely (only a one in 100 chance) that such a mean difference would occur by chance under the assumption that the mean difference was zero. The T-Test assumes two samples of unequal variances and a normal distribution of the data. DMT uses an unpaired, one-sided T-Test and converts the p-value to a two-sided p-value. It shows the direction of change in a separate pivot table column.

1. In the Analysis Function dialog box, select the T-Test operator (Figure 12.8).

Figure 12.8 Analysis Function dialog box

2. Enter a column name for the T-Test results.

3. Confirm the default P Cutoff, or enter a new value. Affymetrix® Data Mining Tool User’s Guide 215

If the computed p-value for a call is greater than the p Cutoff, the Change Direction call is None (no change).

4. Click Next. ⇒ The column selection dialog box appears (Figure 12.9).

Figure 12.9 Column selection dialog box, T-Test

5. Select two or more pivot columns for the Control and two or more pivot columns for the Experiment, then click Finish. ⇒ The pivot table displays two columns of T-Test results: the computed P Value and the Change Direction call (Figure 12.10).

Figure 12.10 Pivot table displaying T-Test results 216 CHAPTER 12 Statistical Analyses

Mann-Whitney Test

The Mann-Whitney test compares two groups of pivot table columns (control and experiment) to determine the significance of change as well as the direction of change. It computes a p-value for each comparison. The Mann-Whitney test is the nonparametric method for comparing two unpaired groups. It does not assume a particular distribution of the data.

1. In the Analysis Function dialog box, select the Mann-Whitney operator (Figure 12.11).

Figure 12.11 Analysis Function dialog box

2. Enter a column name for the Mann-Whitney test results.

3. Confirm the default P Cutoff, or enter a new value. If the computed p-value for a call is greater than the p Cutoff, the Change Direction call is None.

4. Click Next. ⇒ The column selection dialog box appears (Figure 12.12). Affymetrix® Data Mining Tool User’s Guide 217

Figure 12.12 Column selection dialog box, Mann-Whitney test

5. Select two or more pivot columns for the Control and two or more pivot columns for the Experiment, then click Finish. ⇒ The pivot table displays the Mann-Whitney test results (Figure 12.13).

Figure 12.13 Pivot table, Mann-Whitney results 218 CHAPTER 12 Statistical Analyses

Count & Percentage

The Count & Percentage operator is only available in GeneChip® data mode. For each probe set in user-specified pivot table columns, it counts the number and computes the percentage of:

Absolute or detection calls (P, M, or A) Difference or change calls (I, MI, NC, MD, D), or Calls within a user-specified numeric range For the Count & Percentage operator, you can specify any combination of:

Absolute call, difference call and numeric range Detection call, change call and numeric range A probe set must meet all conditions to be counted.

1. In the Analysis Function dialog box, select the Count & Percentage operator (Figure 12.14).

Figure 12.14 Analysis Function dialog box

2. Specify the conditions (absolute call, difference call, or numeric thresholds) a probe set must meet to be counted. Affymetrix® Data Mining Tool User’s Guide 219

If numeric thresholds are specified, DMT counts only the values within the threshold limits. If both > and < threshold options are selected, they are combined in AND (intersection) fashion. The example in Figure 12.14 specifies a probe set must have an absolute call = P, difference call = I and expression metric >200 to be counted.

3. Enter a column name for the Count & Percentage results.

4. Click Next. ⇒ The column selection dialog box appears (Figure 12.15).

Figure 12.15 Column selection dialog box, Count & Percentage operator

5. Select the pivot columns and parameter(s) for the count and percentage analysis, then click Finish. (The Parameters box is only displayed if a numeric threshold was specified in the Analysis Function dialog box.) ⇒ The pivot table displays the Count & Percentage results (Figure 12.16). For example, in Figure 12.16, the count for probe set AB002533_at is 16 because the probe set met all conditions (absolute call = P, difference call = I, average difference > 200) in 16 of 16 columns, resulting in percentage = 100%. 220 CHAPTER 12 Statistical Analyses

Figure 12.16 Pivot table, Count & Percentage results 13 Chapter 13 Matrix Analysis 13

Matrix analysis compares two probe lists, determines the probes in common (probe sets or spot probes) and computes an overlap or non- overlap significance score for the two probe lists. The matrix provides a spreadsheet framework for comparing probe lists and displays the overlap or non-overlap significance score for probe lists in the matrix. See Appendix D for further information about the matrix algorithm.

Overview

Matrix analysis uses the binomial distribution to calculate the probability that an overlap between two lists occurs by chance. (See Appendix D for more information about the binomial distribution.) The analysis compares two separate lists and calculates the significance of the overlap between them. To illustrate how the significance is determined, consider two independent sets, probe list A and Q. Probe list A has na members and probe list Q has nq members. These sets were generated from a total population size of t. (Note: the total population usually includes additional members besides those in sets A and Q.) The expected overlap between the lists based on random chance is important in determining the overlap significance. The chance (or frequency, w) of picking a member of set Q at random from the total population is: w = nq/t. For example, if there are 10 member of Q in a total population of 100 members, then there is a ten percent chance of picking Q. If we make na random picks (the number of members in set A) from this distribution, we would expect to pick a member of set Q ten percent of the time. The expected overlap between Q and A is na*w. What we actually observe is there are x members that belong to both classification A and Q. How close the observed overlap, x, is to the expected overlap, na*w, determines the overlap significance. If these two values are

223 224 CHAPTER 13 Matrix Analysis

close, then there is a high probability that the overlap is due to random chance. The algorithm uses the binomial distribution to determine this significance. The observed overlap could be larger or smaller than the expected overlap. If the observed overlap is larger than the expected, then set A is over represented. If the observed overlap is smaller than the expected, then set A is under represented.

Population Size The total population is an important parameter in calculating overlap significance. This is the total population from which the lists were generated. It is defined as the number of members in common between the two independent classification schemes. The total population for clustering sets is the number of probe sets used when generating the clusters. For the SOM clustering, this value is the total number of probe sets contained in all of the clusters. For correlation coefficient clustering, the population size is either the maximum number of seeds used or the total number of probe sets in the pivot table, depending on whether all or only the seed set were used to generate the final clusters. (See Chapter 14 for a description of the clustering algorithms and parameters.) Matrix analysis initially sets the population size as the number of unique probe sets in the row and column probe lists of the matrix. If you are using only a subset of the classification lists, or the lists do not include all of the members that were used to generate the classification, then the calculated population size is too small. In this case, change the total population to the total number of members used to generate the classification. Affymetrix® Data Mining Tool User’s Guide 225

Running a Matrix Analysis

1. Select Analyze → Matrix from the menu bar. ⇒ The Matrix opens (Figure 13.1).

Figure 13.1 Matrix

2. Click Select Rows. ⇒ The Select dialog box appears (Figure 13.2).

Figure 13.2 Select Probe Sets dialog box 226 CHAPTER 13 Matrix Analysis

3. Select the probe lists you want to include in the matrix rows, then click OK. ⇒ The matrix displays the probe list names in the row headers (Figure 13.3).

Figure 13.3 Matrix, rows specified

4. Click Select Columns. ⇒ The Select dialog box appears (Figure 13.2).

5. Select the probe lists you want to include in the matrix columns, then click OK. ⇒ The matrix displays the probe list names in the column headers (Figure 13.4). Affymetrix® Data Mining Tool User’s Guide 227

Figure 13.4 Matrix, probe lists selected for the rows and columns

6. Confirm the Population Size, or enter a new value. The default population value is equal to the number of unique probes in the rows and column probe lists. (See Population Size on page 224 for information on how to set this value.)

7. Click Calculate. ⇒ The algorithm computes the overlap (over represented probe sets) or non-overlap (under represented probe sets) significance score for each pair of probe lists in the matrix (Figure 13.5). The overlap significance score increases as the overlap or lack of overlap increases between two lists (see Appendix D). To distinguish between overlap or non-overlap, the matrix highlights scores that exceed the overlap significance threshold (pink) or are non- overlap scores and exceed the significance threshold (yellow).

The threshold values in the Overlap and Non-overlap boxes can be changed. 228 CHAPTER 13 Matrix Analysis

Figure 13.5 Matrix displaying overlap significance scores

8. Click Print to print the matrix.

9. Click Close when finished to close the matrix. 14 Chapter 14 Cluster Analysis 14

Cluster analysis helps identify gene expression patterns (profiles) in the data and groups together probe sets or spot probes with similar gene expression patterns. DMT offers two clustering algorithms: self organizing map (SOM) and correlation coefficient clustering.

Self Organizing Map (SOM) Algorithm

The self organizing map (SOM) algorithm is designed to cluster GeneChip® average difference data (shown in this chapter). However, any numeric column in the pivot table may be selected for cluster analysis. (Appendix D describes the SOM algorithm and its user-modifiable parameters.) The algorithm considers the expression levels of n probe sets in k experiments as n points in k-dimensional space. Initially, the algorithm randomly places a grid of nodes or centroids onto the k-dimensional space. The algorithm iteratively adjusts the positions of the nodes to identify clusters in the data.

231 232 CHAPTER 14 Cluster Analysis

Running a SOM Cluster Analysis Prior to cluster analysis, normalize GeneChip® signal data in Affymetrix® Microarray Suite or DMT. Normalize spot probe intensity data in Affymetrix® Jaguar™. (For more information, see Chapter 5.)

1. Select Analyze → SOM Clustering from the menu bar. ⇒ The Select Columns for Clustering dialog box appears (Figure 14.1).

Figure 14.1 Select Columns for Clustering dialog box

2. Select more than one pivot table column for SOM clustering, then click OK. ⇒ The SOM Clustering dialog box appears (Figure 14.2). Affymetrix® Data Mining Tool User’s Guide 233

Figure 14.2 SOM Clustering dialog box

The section SOM Filters on page 238, provides a description of the thresholds, row variation and row normalization filters. Filtering the data is optional, but recommended. See SOM Parameters on page 239 for a description of the user-modifiable algorithm parameters.

3. To apply threshold filtering, confirm the Thresholds values, MinVal and MaxVal, or enter new values, then click Add>. ⇒ The threshold filter is displayed in the box to the right (Figure 14.3).

4. To apply Row Variation filtering, confirm the row variation Max/Min and Max-Min defaults or enter new values, then click Add>. ⇒ The row variation filter is displayed in the box to the right (Figure 14.3).

5. Click Compute to display the number of probe sets (or spot probes) remaining after the row variation filter is applied to the data. The Compute button is a tool for quickly confirming the row variation Max/Min and Min-Min parameters. The number of rows (probe sets) 234 CHAPTER 14 Cluster Analysis

that remain in the dataset after filtering is displayed as New Rows next to the Compute button (Figure 14.3).

When you click Compute, any values entered in the Row Variation edit boxes are also applied to the filters in the box on the upper right, even when the Row Variation values do not appear in the filter box.

6. To apply Row Normalization, confirm the Mean and Variance defaults, or enter new values, then click Add>. ⇒ The row normalization filter is displayed in the box to the right (Figure 14.3).

Figure 14.3 SOM Clustering dialog box, data filtering

7. To change the order of a filter, highlight the filter, then click Down or Up to move the filter to the desired position.

8. To delete a filter, highlight the filter, then click Del. To delete all filters, click Del All. Affymetrix® Data Mining Tool User’s Guide 235

9. Confirm the defaults for Parameters, or enter new values. See SOM Parameters on page 239 for a description of the user- modifiable algorithm parameters.

10. Click Run to filter the data and perform SOM cluster analysis. ⇒ The graph pane displays the results of the cluster analysis (Figure 14.4).

Figure 14.4 SOM clusters

The rows and columns parameters generate the nodes that identify clusters. For example, in Figure 14.4 the default rows and columns (6 x 3) generate 18 clusters (click the down arrow to scroll the cluster view). The SOM algorithm maps clusters that have similar gene expression patterns near one another. As a result, in Figure 14.4, the average gene expression 236 CHAPTER 14 Cluster Analysis

patterns in Cluster 1 and Cluster 2 show the greatest similarity and those in Cluster 1 and Cluster 18 are the most dissimilar. Each cluster plot displays the cluster number followed by the number of cluster members (in parentheses). The middle (red) graph line represents the average gene expression pattern for the cluster. The two outer (blue) graph lines represent the standard deviation of expression (Figure 14.5). Cluster plot axes are not scaled identically. SOM cluster results may show run-to-run variability due to the inherent nature of the algorithm (for example, the random initialization process).

Figure 14.5 SOM cluster plot (4 pivot columns selected for clustering)

11. Click a cluster plot to view the members in the Probes box (Figure 14.4). Affymetrix® Data Mining Tool User’s Guide 237

Saving a Probe List

Saving a Selected Cluster as a Probe List

1. Click the cluster you want to save. ⇒ The cluster members are displayed in the Probes box of the Cluster tab (Figure 14.4).

2. Enter a Probe List Name.

3. Click Save Selected. ⇒ The data tree displays the probe list name.

Saving All Clusters as a Probe List

1. Click Save All. ⇒ The Save All Clusters dialog box appears (Figure 14.6).

Figure 14.6 Save All Clusters dialog box

2. Enter a cluster root name and click Save All. ⇒ The data file tree displays the probe lists (Figure 14.7). Each probe list is named using the cluster root name followed by the cluster number. 238 CHAPTER 14 Cluster Analysis

Figure 14.7 Data file tree, Probe Lists directory

To quickly view data for the cluster members in a probe list, right-click the probe list in the data tree, then select Highlight Pivot and Graph from the shortcut menu. The pivot table displays only the rows for the probe list (cluster members). If the scatter, fold change and series line graphs were previously plotted for the clustered columns, the scatter and fold change graphs highlight the points from the probe list. The series line graph displays only the probe list.

SOM Filters The SOM filter values are user-modifiable. The default values are intended for probe set average difference data.

Thresholds The minimum and maximum thresholds are designed to exclude outlier data. Data that exceed the maximum threshold value are changed to the maximum threshold value. Data less than the minimum threshold value are changed to the minimum threshold value. Affymetrix® Data Mining Tool User’s Guide 239

Row Variation The row variation filters are designed to exclude probe sets or spot probes that do not significantly change expression level across the experiments. DMT evaluates each probe set or spot probe across all selected columns and includes it in the analysis if both of the following conditions are met: 1) maximum value/minimum value > 3 (default), and 2) maximum value - minimum value > 100 (default) The maximum and minimum row variation values are user-modifiable.

Row Normalization This normalizes the data to a mean of zero and a variance of one. Row normalization helps the algorithm identify clusters based on the shape of expression patterns rather than absolute expression levels.

SOM Parameters See Appendix D for further description of the SOM algorithm. Rows & Columns Specifies the rows and columns of nodes that identify clusters in the data. The number of nodes (rows x columns) determines the number of clusters generated. Epochs Determines the number of iterations the algorithm runs. Iterations = Epochs x Number of probe sets Seeds The number of times the algorithm runs through a set of iterations. The algorithm selects the result that minimizes the sum of the distances from the data points to the nodes. Initialization Initial placement of the nodes in k-dimensional space. Random Vectors method randomly places the nodes in k- dimensional space. Random Datapoints method places the nodes on randomly-selected points. 240 CHAPTER 14 Cluster Analysis

Neighborhood Defines a distance from the target node (the node closest to the point being considered). At each iteration, nodes in the neighborhood are moved toward the point being considered (updated). Bubble neighborhood = a radial distance from the target node. All nodes in the bubble neighborhood are updated the same amount. Nodes outside the bubble neighborhood are not updated. In the Gaussian neighborhood, all nodes are updated. The distance a node moves is a function of the distance of the node from the target node. The greater the distance between the node and the target node, the smaller the distance the node is updated. Initial Initial width of the bubble neighborhood (default = 5). neighborhood size Final Final width of the bubble neighborhood at the last neighborhood size iteration. Initial learning Initial distance (learning rate) a node is updated. rate Final learning rate Final learning rate at the last iteration.

Correlation Coefficient Clustering Algorithm

The correlation coefficient clustering algorithm finds probe set patterns that have similar shape. The process for finding clusters of similar probe set patterns is accomplished in three steps:

Filtering - Removes patterns due mostly to noise. Seeding - Defines the expression patterns of the clusters. Clustering - Groups patterns which are close to the cluster shape. First, the data set is filtered to remove probe sets with low or relatively constant expression levels across the samples (low standard deviation). The entire data set need not be included to obtain a diverse set of clusters. To the contrary, including noisy data tends to make the discovery of unique Affymetrix® Data Mining Tool User’s Guide 241

expression patterns more difficult. Filtering reduces the number of expression patterns using the following seeding step. It has been empirically determined that 3,000 or fewer genes should be included in the seeding step. Next, a nearest neighbor approach is used to calculate seeds with unique patterns in the data set. All probe sets whose expression patterns exceed the user-defined correlation coefficient (CC) threshold are grouped to define a seed. The expression level for each of the genes in the seed is normalized relative to its standard deviation and the mean of the normalized expression levels is calculated and defined as the seed pattern. In the final step, the pattern of each gene is compared to the seed patterns. Those patterns that closely match the seed pattern are assigned to the seed cluster. Depending on the way the clustering parameters are defined, either all genes or just those that survived the filtering step are assigned to seed clusters. Genes may match more than one seed. Assignment to more than one cluster is allowed, or assignment to only the cluster with the highest CC may be forced. Unlike the SOM clustering, the correlation coefficient algorithm does not pre-define the number of clusters. The seeding operation determines the final number of clusters. The correlation coefficient clustering algorithm is designed to cluster GeneChip® expression data such as signal or average difference. In general it is best to use normalized expression values. This removes some types of sample preparation artifacts which can create spurious patterns that tend to mask the true patterns in the data. However, any column in the pivot table may be selected for cluster analysis. See Appendix D for more information about the algorithm.

Running the Correlation Coefficient Cluster To run the correlation coefficient clustering, you must specify the data to cluster and various parameters for filtering, seeding and final clustering.

1. Select Analyze → Correlation Coefficient Clustering from the menu bar. ⇒ The Select Columns for Clustering dialog box appears (Figure 14.8). 242 CHAPTER 14 Cluster Analysis

Figure 14.8 Select Columns for Clustering dialog box

2. Select the samples for clustering.

3. Click OK when finished. ⇒ The Correlation Coefficient Clustering dialog box appears (Figure 14.9).

Figure 14.9 Correlation Coefficient Clustering dialog box Affymetrix® Data Mining Tool User’s Guide 243

See Correlation Coefficient Clustering Options on page 244 for a description of the Filter, Seed Patterns and Cluster options and settings.

4. In GeneChip® data mode, confirm the default or enter a new value for the Maximum number of probe sets to include in seeding.

The Filter options are only available in GeneChip data mode if the absolute call or detection values have been retrieved from the database.

5. To generate seeds, choose the Generate Seeds option. The Import Seed Patterns option is described later.

6. Confirm the defaults or enter new values for the Correlation coefficient threshold and Minimum number of probe sets per seed.

7. Confirm the defaults for the Cluster options (Unique assignments to one cluster and Cluster filtered probe sets only) or choose new Cluster options.

8. Click Run to start the cluster analysis. ⇒ The Cluster tab in the graph pane displays the clusters (Figure 14.10). The pane displays the cluster number followed by the number of cluster members (in parentheses).

The cluster plot axes are not scaled identically. 244 CHAPTER 14 Cluster Analysis

Figure 14.10 Correlation coefficient cluster plot

Correlation Coefficient Clustering Options The parameters for the filtering, seed generation and clustering steps of the clustering algorithm are specified in the Correlation Coefficient Clustering dialog box (Figure 14.9). The following discusses the parameters for each step.

Filter Many genes in a data set may not be expressed and have low expression values. However, the noise in the expression values will lead to spurious patterns which are removed by the filtering step. The detection call (Statistical Expression algorithm) and absolute call (Empirical Expression algorithm) are used to determine whether a probe set is expressed or not. An absent call (A) indicates the gene is not expressed in the sample. These calls may be excluded from the seeding process. To filter based on the expression call, choose the Exclude probe sets with less than _% Present calls across all analyses when generating seeds option. The filter slider sets the percentage of P (present) or M (marginal) calls that are required for a given probe set to be included in the seeding step. Affymetrix® Data Mining Tool User’s Guide 245

A higher filter percentage excludes more probe sets with low expression values; a lower percentage includes more genes with low signal. The default is 75%. Depending on the experiment, the filtering parameter may be set to either a high or low number. For example, suppose an experiment looks at several different tissues and only those probe sets expressed in a single tissue are of interest. In this case, lowering the filtering percentage and tolerating the noise in the rest of the sample is required to detect the one rare gene that may be expressed. Also in the filter step, specify the Maximum number of probes to include in seeding. This parameter ranks the genes according to the relative standard deviation of their expression intensities across the samples, that is, those with the greatest fluctuations in expression patterns. Top-ranked probe sets fluctuate the most, low-ranked genes the least. The value determines the number of top-ranked genes that will be included in seeding. The default is 1,000, but sometimes this value may be as small as several hundred in order to obtain meaningful clusters.

Seed

Seeding is a pre-clustering process by which cluster patterns are first determined. A seed is usually a small group of genes whose expression patterns are very similar to each other. The seed’s expression pattern is calculated from the average expression pattern of this small group. Two separate parameters are used in the seeding step:

Correlation coefficient threshold Minimum number of probes per seed The correlation coefficient threshold is a numerical way of representing the relatedness of expression patterns. It is the covariance between expression patterns for two probe sets across a series of biological samples. The value of the correlation coefficient ranges from -1 to +1, where +1 represents complete correspondence. The higher the threshold, the more similar the probe sets must be to belong to the same seed. The default value is 0.98, but can be as large as 0.999 or as small as 0.8 in order to obtain meaningful clusters. 246 CHAPTER 14 Cluster Analysis

Set the Minimum number of probe sets to specify the number of genes that must be present in a seed for it to be used in clustering. A higher number is more restrictive and reduces the number of allowed patterns. A lower number allows rarer expression patterns to define seeds and then later clusters. The default is 3. If the file name is entered into the Save Seed Patterns box, the patterns will be saved as a text file.

Cluster There are three parts to the final clustering step. The correlation coefficient threshold is the same parameter as in the seeding step. It specifies how closely a probe set pattern must match the seed’s pattern in order to join the cluster. A lower number allows a less stringent expression relationship between the probe sets which are permitted to join the cluster. A higher number forces a more stringent relationship. Generally, it is best to use a less stringent threshold than in the seeding step in order to incorporate more unseeded probe sets into the cluster. The default is 0.90. If the Cluster Filtered Probe Sets Only option is chosen, only those probe sets that passed the filtering steps are allowed to join the cluster. The choice will depend on factors such as the quality of the data and whether a rare expression pattern is being sought. It is possible that a probe set correlation coefficient exceeds the threshold for two or more seeds. If the Unique assignments to one cluster option is chosen, a probe set is assigned to the cluster with the highest correlation coefficient. If this option is not chosen, the probe set is assigned to every cluster whose correlation threshold exceeds the threshold

Effect of Changing Algorithm Parameters

Table 14.1 describes how changing a parameter value affects seeding and clustering. Affymetrix® Data Mining Tool User’s Guide 247

Table 14.1 User-modifiable Correlation Coefficient algorithm parameters

Correlation Parameter Coefficient Description Effect of Parameter Change Algorithm Change Parameter Filter Specifies the percentage of present Increase Decreases the number of probe sets (with and marginal detection or absolute the highest relative standard deviation) used calls that a probe set must have in the seeding process. An excessive number across all analyses in order to be of probe sets in the seeding process considered for the seeding step, and generates large, less distinct clusters. optionally, the clustering step (default = 75%) Decrease Increases the number of probe sets (with the highest relative standard deviation) used in the seeding process. Increases the number of seeds (representative expression profiles). Maximum The algorithm ranks the probe sets Increase Increases the number of probe sets (with the number of probe not excluded by the filter in order of highest relative standard deviation) used in sets to include in highest standard deviation. Probe the seeding process. An excessive number of seeding sets with the highest standard probe sets in the seeding process generates deviation are included in the seeding large, less distinct clusters. procedure until the Maximum number of probe sets to include in Decrease Decreases the number of probe sets (with seeding is reached. the highest relative standard deviation) used in the seeding process. Decreases the number of seeds (representative expression profile for a cluster). Seed correlation Expression patterns of probe sets Increase Increases the similarity required between the coefficient that pass the filter are compared to expression profiles of two probe sets in order threshold one another. If the correlation to be included in the same seed. If the seed coefficient between two probe set correlation coefficient threshold is profiles exceeds the threshold, they excessively high, this prevents identification are included in the same seed. of any seeds. Decrease Lowers the similarity required between the expression profiles of two probe sets in order to be included in the same seed. If the seed correlation coefficient threshold is too low, expression profiles merge and can result in a new profile that is unlike either merged profile. Minimum Minimum number of probe sets (that Increase Decreases the number of clusters generated. number of probe exceed the seed correlation sets per seed coefficient threshold) required to Decrease Increases the number of clusters generated. define a cluster and generate a seed. Cluster The expression profile of each probe Increase Decreases the number of probe sets in a correlation set is compared to each seed. If the cluster. coefficient correlation coefficient exceeds the threshold threshold, the probe set is assigned Decrease Increases the number of probe sets in a to the cluster. cluster. 248 CHAPTER 14 Cluster Analysis

Saving and Importing Seed Patterns The seeding process described above is useful for finding interesting, but unknown patterns in the data set. In cases where the pattern is known, the seeding process can be omitted and the known patterns imported instead. An example pattern would be a gene that is expressed in one tissue type, but in none of the others. To import the seed patterns:

1. Choose the Import Seed Patterns option.

2. Enter the name of the seeds data file (*.txt) that contains the patterns (Figure 14.11). Alternatively, click the Browse button and select a *.txt from the Read Seeds Data dialog box that appears. The *.txt can be a file saved from a previous clustering run (see Saving Seeds Data on page 249) or manually created (see Seed Pattern (*.txt) Format on page 250).

Figure 14.11 Correlation Coefficient Clustering dialog box Affymetrix® Data Mining Tool User’s Guide 249

Saving Seeds Data The seeds generated by a cluster analysis may be saved in a seeds data file (*.txt).

1. In the Correlation Coefficient Clustering dialog box (Figure 14.12), select the Generate Seeds option.

Figure 14.12 Correlation Coefficient Clustering dialog box

2. Click the upper Browse button . ⇒ The Save Seeds Data dialog box appears (Figure 14.13). 250 CHAPTER 14 Cluster Analysis

Figure 14.13 Save Seeds Data dialog box

3. Select a directory for the saved file.

4. Enter a File Name for the seeds data file (*.txt), then click Save. The seed patterns are saved when the clustering algorithm is executed.

Seed Pattern (*.txt) Format

In the seed pattern text file (Figure 14.14), the first row contains the column headings. The following rows contain the patterns. Each row contains the label of the pattern and the expression values.

Figure 14.4 shows a representative pattern file. In this example, the first three rows are patterns of individual probe sets. The last three rows are user- specified patterns.

Figure 14.14 Import seed text file Affymetrix® Data Mining Tool User’s Guide 251

Saving a Probe List

Cluster members may be saved as a probe list.

1. Click the cluster you want to save. ⇒ The cluster members are displayed in the Probes box of the Clusters tab (Figure 14.15).

2. Enter a name for the list in the Probe List Name box.

Figure 14.15 Correlation coefficient cluster plot

3. Click Save Selected. ⇒ The data tree displays the probe list name. 252 CHAPTER 14 Cluster Analysis

To quickly view data for the cluster members in a probe list, right-click the probe list in the data tree, then select Highlight Pivot and Graphs from the shortcut menu. The pivot table displays only the rows for the probe list (cluster members). If the scatter, fold change and series line graphs were previously plotted for the clustered columns, the scatter and fold change graphs highlight the points from the probe list. The series line graph displays only the probe list. 15 Chapter 15 DMT Tutorial 15

Introduction

This tutorial includes six lessons that demonstrate (in GeneChip® data mode) how to use DMT.

Lesson 1: Identify highly expressed genes Lesson 2: Calculate summary statistics of replicates Lesson 3: Summarize qualitative data Lesson 4: Evaluate difference between two tissues Lesson 5: Use comparison ranking to evaluate difference call consistency Lesson 6: Perform cluster analysis using the self organizing map (SOM) algorithm The tutorial lessons use the demonstration database DMT_3_Tutorial that is provided on the Affymetrix Data Mining Tutorial and Demo Data CD (P/N 610050 Rev. 2). The database includes absolute and comparison analyses of tissue T1, T2 and T3.

LIMS users: The tutorial database name may be different from that used in this manual. Please contact your Database Administrator for the correct name.

There are six replicate absolute analyses of each tissue type (a total of 18 absolute analyses). For example, the replicates for tissue T1 are T1_r1, T1_r2, ... T1_r6 (Figure 15.1). There are 36 comparison analyses that compare tissue T1 and T2 replicates for use in Lesson 5. The number of replicates needed in your own experiments will depend on how much variability you expect to see in your system. The signal intensity data were scaled to a target intensity (TGT) of 500 using the All Probe Sets option in the Affymetrix® Microarray Suite software.

255 256 CHAPTER 15 DMT Tutorial

Figure 15.1 DMT_3_Tutorial database, 18 absolute analyses

Before we can start to analyze the data, we must first register and connect the database to DMT.

Step 1: Restoring the MicroDB™ Database Refer to the Affymetrix® MicroDB™ User’s Guide, the SQL Server manual, or the Oracle® manual for instructions on how to restore the tutorial database to a workstation or server.

Step 2: Starting DMT Refer to Chapter 2 on page 10 for more information on installing and registering DMT. Press the Windows Start menu button , then select Programs → Affymetrix → Data Mining Tool. ⇒ The DMT main window appears.

Step 3: Registering the Database

Tutorial Database on Windows NT® Workstation (MicroDB™ System)

1. Select Edit → Register Database from the menu bar. ⇒ The Register Database dialog box appears (Figure 15.2). Affymetrix® Data Mining Tool User’s Guide 257

Figure 15.2 Register Database dialog box, publish database on Windows NT workstation

2. Select the DMT_3_Tutorial database from the Publish Database drop-down list, then click Register. ⇒ The tutorial database is now available to DMT.

Tutorial Database on LIMS Server (Affymetrix® LIMS) Select Edit → Register Database from the menu bar. ⇒ The Register Database dialog box appears (Figure 15.3).

Figure 15.3 Register Database dialog box

Oracle® Database

1. To select another server, enter the server name or Oracle alias, then click List Databases to display the publish databases for the server in the Publish Database drop-down list.

2. Select the DMT_3_Tutorial database from the Publish Database drop- down list, then click Register. ⇒ The tutorial database is available to DMT. 258 CHAPTER 15 DMT Tutorial

Step 4: Selecting the Tutorial Database

1. Select Edit → Select Database from the menu bar.

2. Select the DMT_3_Tutorial database. ⇒ The status bar at the bottom of the main window displays the name of the current database.

If the status bar is not displayed, select View → Status Bar from the menu bar.

Step 5: Opening the DMT Session A DMT session must be opened to begin .

Select Data → New → GeneChip Mining from the menu bar. ⇒ The DMT session opens (Figure 15.4).

Figure 15.4 DMT session, DMT_3_Tutorial database selected Affymetrix® Data Mining Tool User’s Guide 259

Lesson 1: Identifying Highly Expressed Genes

Identifying genes that significantly change expression level can give insight into the major functional and structural cell changes that occur between two experimental conditions (for example, normal cells and cells treated with a drug). This lesson shows how to identify genes that are highly expressed in tissue T1. It then examines the expression of these same genes in tissue T2 and T3. We will use only one replicate of each tissue in this lesson. Lesson 1 includes:

Step 1: Specifying a Filter Step 2: Querying the Database Step 3: Sorting the Pivot Table by Signal Step 4: Creating a Probe List Step 5: Plotting the Series Bar Graph

Step 1: Specifying a Filter The filter is a useful tool for selecting transcripts that exceed a certain limit or transcripts within a given expression range. For example, to find highly expressed genes, we can specify (in the filter grid) genes that are called Present with a Signal > 1000.

1. Clear any entries in the filter grid. To do this, right-click the filter grid and select Clear Query from the shortcut menu that appears.

2. In Line 1 of the filter grid, double-click the Signal cell, then enter >1000.

3. Enter =’P’ in the Detection cell (Figure 15.9).

Figure 15.5 Filter grid 260 CHAPTER 15 DMT Tutorial

The query interrogates the absolute analyses selected from the data tree and returns probe sets that have a Signal greater than 1000 and a Present (P) Detection call.

To specify more complex queries, right-click a cell in the filter grid, then select Show Query Builder from the shortcut menu that appears. This opens the Build Filter dialog box for the selected cell. The Build Filter dialog box enables you to enter complex limits in the filter grid without prior knowledge of correct syntax for operators such as BETWEEN and LIKE. You need only specify text or number where appropriate.

Step 2: Selecting Analyses for the Query In the data tree, select the absolute analyses: T1_r1, T2_r1 and T3_r1.

Step 3: Pivoting on Signal & Detection Call

1. To select results for the pivot operation, click the Options toolbar button . ⇒ The Data Mining Options dialog box appears (Figure 15.6).

2. Click the Pivot tab. ⇒ The absolute and relative expression data available for the pivot table are displayed (Figure 15.6). Affymetrix® Data Mining Tool User’s Guide 261

Figure 15.6 Data Mining Options dialog box, Pivot tab

3. From the list of Absolute Expression Data for the Statistical Algorithm, select Signal and Detection (Figure 15.6).

4. Clear the check mark from the Show order analyses dialog option. When this option is chosen, the software prompts you to confirm the order of the columns (analyses) in the pivot table prior to the pivot operation.

5. Click OK to close the Data Mining Options dialog box. 262 CHAPTER 15 DMT Tutorial

Step 4: Querying and Pivoting the Data

1. Click the Pivot toolbar button . ⇒ The data are queried using the filter specified in step 1. The pivot table displays the signal and detection call for each probe set returned by the query (Figure 15.7).

You can reorder the pivot table columns using the click-and-drag method.

Figure 15.7 Pivot table

Some fields in the pivot table are blank because the probe sets in these analyses did not satisfy the filter criteria. Affymetrix® Data Mining Tool User’s Guide 263

Step 5: Sorting the Pivot Table by Signal In the pivot table, right-click the Signal column heading for T1_r1 and select Sort Descending from the shortcut menu that appears. ⇒ The pivot table columns are sorted in descending order of the signal values for T1_r1 (Figure 15.8).

Step 6: Saving a Probe List

1. Select the ten pivot table rows with the highest signal values for T1_r1 (Figure 15.8).

Figure 15.8 Pivot table

2. Right-click a highlighted cell and select Create Probe List from the shortcut menu that appears. ⇒ The Save Probe List dialog box appears (Figure 15.9). 264 CHAPTER 15 DMT Tutorial

Figure 15.9 Save Probe List dialog box

3. In the Name box, enter the probe list name Highly Expressed.

4. Clear the check mark from the Show members after saving option.

5. Click Save. ⇒ The probe list is saved and displayed in the data tree.

To view the probe list members, click the plus sign (+) next to the probe list in the data tree.

Step 7: Plotting the Series Line Graph Now that we have identified genes that are highly expressed in T1_r1 and saved them in a probe list, we can plot the series line graph to examine the expression levels of these genes in T2_r1 and T3_r1 as well.

1. Right-click the filter grid and select Clear Query from the shortcut menu that appears. ⇒ The criteria in the filter grid are cleared.

2. Click the Pivot toolbar button .

Verify that analysis T1_r1, T2_r1 and T3_r1 remain selected in the data tree before running the pivot operation. Affymetrix® Data Mining Tool User’s Guide 265

3. Right-click the Highly Expressed probe list in the data tree and select Display Selected Probes from the shortcut menu that appears. ⇒ The pivot table displays only the members of the Highly Expressed probe list (Figure 15.10).

Figure 15.10 Pivot table

4. Click the Options toolbar button . ⇒ The Data Mining Options dialog box appears (Figure 15.11). 266 CHAPTER 15 DMT Tutorial

Figure 15.11 Data Mining Options dialog box, Series Graph tab

5. Click the Series Graph tab and verify the Line Graph option is selected.

6. Click OK.

7. Click the Series Graph toolbar button . ⇒ The Series Graph dialog box appears (Figure 15.12). Affymetrix® Data Mining Tool User’s Guide 267

Figure 15.12 Series Graph dialog box

8. Select all three columns (T1_r1-Signal, T2_r1-Signal and and T3_r1-Signal), then click OK. ⇒ The signal series line graph is plotted for the probe sets in the Highly Expressed probe list (Figure 15.13). If necessary, use the scroll bar at the bottom of the graph pane to view the entire graph. 268 CHAPTER 15 DMT Tutorial

Figure 15.13 Series bar graph displaying the Highly Expressed probe list

Lesson 1 Summary We used filters (Detection = P and Signal > 1000) to query the database and select transcripts in a given expression range. We then sorted the pivot table by signal in descending order to quickly identify those genes returned by the query that were expressed the highest. We saved probe sets (genes) of interest as a probe list. The probe list is a useful way to organize probe sets of interest. In the data tree, the Display Selected Probes function provided a convenient way to view pivot table results and plot graphs for the probe list members only. You can use the probe list to look at gene expression for list members across other experiments. For example, in this lesson we saved ten probe sets (with the highest signal value in T1_r1) as a probe list, then used the Display Selected Probes function to update the pivot table and plot the series bar graph for analyses T1_r1, T2_r1 and T3_r1. Affymetrix® Data Mining Tool User’s Guide 269

Suggested Exercise Repeat lesson 1, filtering for genes that are called present and have a signal between 1000 and 2000. Generate a short probe list (five to ten members) and plot the series line graph (select Columns for the X-Axis option) for the probe list across three replicate analyses. 270 CHAPTER 15 DMT Tutorial

Lesson 2: Calculating Averages of Replicates

The analysis of replicates allows us to measure the variability in a data set and determine confidence values for these measurements. This enables us to measure small, consistent changes even when the variability in a data set is relatively high. Small changes in gene expression can be biologically very important. Using a larger number of replicates increases the probability that small changes are statistically significant. Lesson 2 shows how to compute the mean and standard deviation for the members of the Highly Expressed probe list (generated in lesson 1) across replicate analyses. This lesson includes:

Step 1: Specifying a Probe List for the Filter Step 2: Selecting Analyses for the Query Step 3: Pivoting on Signal Step 4: Querying and Pivoting the Data Step 5: Selecting the Average and Standard Deviation Operators Step 6: Sorting the Pivot Table Step 7: Displaying Probe Set Descriptions

Step 1: Specifying a Probe List for the Filter

1. Clear any entries in the filter grid. To do this, right-click the filter grid and select Clear Query from the shortcut menu that appears.

2. In Line 1 of the filter grid, right-click the Probe Set Name column, then select Probe List from the shortcut menu that appears. ⇒ The Open Probe List dialog box appears (Figure 15.14). Affymetrix® Data Mining Tool User’s Guide 271

Figure 15.14 Open Probe List dialog box

3. Select the Highly Expressed probe list (generated in lesson 1), then click Open. ⇒ The selected probe list is placed in the Probe Set Name column of the filter grid (Figure 15.15). The probe list contains the probe sets we want to analyze. By loading the list we limit our analysis to these probe sets only.

Figure 15.15 Filter grid 272 CHAPTER 15 DMT Tutorial

Step 2: Selecting Analyses for the Query

1. In the data tree, select all replicate absolute analyses for tissue T1, T2, and T3 (T1_r1 through T1_r6, T2_r1 through T2_r6, and T3_r1 through T3_r6) (Figure 15.16).

Figure 15.16 Data tree, all absolute analyses (18) selected Affymetrix® Data Mining Tool User’s Guide 273

Step 3: Pivoting on Signal

1. Click the Options toolbar button . ⇒ The Data Mining Options dialog box appears (Figure 15.17).

2. Click the Pivot tab. ⇒ The absolute and relative expression data available for the pivot table are displayed (Figure 15.17).

Figure 15.17 Data Mining Options dialog box, Pivot tab

3. From the list of Absolute Expression Data for the Statistical Algorithm, select Signal. Verify that all other options are cleared.

4. Click OK to close the Data Mining Options dialog box. 274 CHAPTER 15 DMT Tutorial

Step 4: Query and Pivot the Data

1. Click the Pivot toolbar button . ⇒ The pivot table displays probe sets that are members of the Highly Expressed probe list (Figure 15.18).

Figure 15.18 Pivot table

Average and Standard Deviation The average and standard deviation statistics or the median and inter-quartile range statistics can be used to summarize the expression level for each probe set across a number of replicate analyses. Select the average and standard deviation statistics if you assume a normal distribution for the data (Figure 15.19). The standard deviation provides an estimate of how much the expression level changes from one replicate to the next. Select the median and inter-quartile range statistics if you assume the data do not have a normal distribution (Figure 15.20). The inter-quartile range is the 75th percentile minus the 25th percentile.

If you are not sure whether your data have a normal distribution, calculate both the mean and median values. If the values vary significantly, the data probably do not have a normal distribution and it may be better to use the median value. Affymetrix® Data Mining Tool User’s Guide 275

Figure 15.19 Normal data distribution

Figure 15.20 Skewed data distribution 276 CHAPTER 15 DMT Tutorial

Step 5: Selecting Average & Standard Deviation Operators

1. Select Analyze → Analysis Function from the menu bar. ⇒ The Analysis Function dialog box appears (Figure 15.21).

Figure 15.21 Analysis Function dialog box

2. Enter T1 in the Column Name box.

3. Select the Average and Standard Deviation operators (Figure 15.21), then click Next. ⇒ The column selection dialog box displays the available pivot table columns (Figure 15.22). Affymetrix® Data Mining Tool User’s Guide 277

Figure 15.22 Analysis Function dialog box

4. Select all replicate T1 signal columns (T1_r1-Signal, T1_r2-Signal,... T1_r6-Signal), then click Finish. ⇒ The pivot table (far right) displays the new columns T1-Average and T1-Stdev (Figure 15.23). Use the horizontal scroll bar at the bottom of the results pane to view the right side of the pivot table.

Figure 15.23 Pivot table, average and standard deviation for replicate T1 average difference data 278 CHAPTER 15 DMT Tutorial

5. Repeat items 1 through 4 of Step 5 for the replicate T2 average difference columns (enter T2 in the Column Name box of the Analysis Function dialog box).

6. Repeat items 1 through 4 of Step 5 for the replicate T3 average difference columns (enter T3 in the Column Name box of the Analysis Function dialog box). ⇒ The pivot table (right side) displays six new columns: T1-Average and T1-Stdev, T2-Average and T2-Stdev, and T3-Average and T3-Stdev (Figure 15.24). Use the horizontal scroll bar at the bottom of the results pane to view the right side of the pivot table.

Figure 15.24 Pivot table, average and standard deviation for replicate T1, T2 and T3 average difference data Affymetrix® Data Mining Tool User’s Guide 279

Step 6: Sorting the Pivot Table We are interested in probe sets with large signal values. We can sort the pivot table to help identify these probe sets.

1. Select Edit → Sort from the menu bar. ⇒ The Sort dialog box appears (Figure 15.25).

Figure 15.25 Sort dialog box

2. Select T1-Average from the top Sort By drop-down list and select the Descending sort option.

3. Click OK. ⇒ The pivot table is sorted by descending average T1-Signal value (Figure 15.26). 280 CHAPTER 15 DMT Tutorial

Figure 15.26 Pivot table sorted by descending T1-Average

Step 7: Displaying Probe Set Descriptions

1. Select Query → Pivot Descriptions from the menu bar. ⇒ The pivot table displays a column of probe set descriptions (Figure 15.27).

Figure 15.27 Pivot table, probe set descriptions displayed Affymetrix® Data Mining Tool User’s Guide 281

Lesson 2 Summary We used a probe list as a filter to focus on genes of interest across different analyses. Here we included the Highly Expressed probe list (generated in lesson 1) in the filter and queried all replicate analyses of tissue T1, T2 and T3. We computed the mean and standard deviation to help summarize the replicate average difference data for tissue T1, T2 and T3, and provide a confidence measure for the data. We sorted the T1 signal values to help us identify probe sets with large signal values. Descriptions were displayed in the pivot table for more information about the probe sets.

Suggested Exercise Repeat lesson 2, computing the median and inter-quartile range for the replicate analyses of tissue T1, T2 and T3. 282 CHAPTER 15 DMT Tutorial

Lesson 3: Summarizing Qualitative Data

Some transcripts may be expressed at the limit of assay detection. The more often a weakly expressed transcript is called present across multiple analyses, the more confident we are that it is actually present. (Think of this as a jury where each experiment is a juror that votes whether or not a transcript is present.) This lesson shows how to:

Use the Count & Percentage analysis to evaluate the consistency of detection calls across all replicate data. Identify the transcripts that are present in all replicates of tissue T1, T2 and T3. Annotate the genes that are present and generate a corresponding probe list representing potential genes of interest. Lesson 3 includes:

Step 1: Pivoting on Detection Call Step 2: Performing Count & Percentage Analysis Step 3: Sorting the Pivot Table Results Step 4: Saving a Probe List Step 5: Annotating Probe List Members

Step 1: Pivoting on Detection Call

1. Clear any entries in the filter grid. To do this, right-click the filter grid and select Clear Query from the shortcut menu that appears.

2. In the data tree, select all 18 absolute analyses for tissue T1, T2 and T3.

3. Click the Options toolbar button . ⇒ The Data Mining Options dialog box appears (Figure 15.28).

4. Click the Pivot tab. ⇒ The absolute and relative expression data available for the pivot table are displayed (Figure 15.28). Affymetrix® Data Mining Tool User’s Guide 283

Figure 15.28 Data Mining Options dialog box, Pivot tab

5. From the list of Absolute Expression Data for the Statistical Algorithm, select Detection. Verify that all other options are cleared.

6. Click OK to close the Data Mining Options dialog box.

7. Click the Pivot toolbar button . ⇒ The pivot table displays the detection call for each probe set in the selected analyses (Figure 15.29). 284 CHAPTER 15 DMT Tutorial

Figure 15.29 Pivot table displaying detection calls

Step 2: Performing Count & Percentage Analysis

1. Select Analyze → Analysis Function from the menu bar. ⇒ The Analysis Function dialog box appears (Figure 15.30).

Figure 15.30 Analysis Function dialog box

2. Enter T1 Present in the Column Name box.

3. Select the Count & Percentage operator, then select the P (present) option. Affymetrix® Data Mining Tool User’s Guide 285

4. Click Next. ⇒ The column selection dialog box displays the available pivot table columns (Figure 15.31).

Figure 15.31 Analysis Function dialog box

5. Select the six replicates for tissue T1 (T1_r1, T1_r2,... T1_r6), then click Finish. ⇒ This generates the columns T1 Present-Count and T1 Present- Percent in the pivot table (Figure 15.32).

6. Repeat items 1 through 5 of Step 2 for the replicate T2 Detection columns (enter T2 Present in the Column Name box of the Analysis Function dialog box). ⇒ The pivot table (right side) displays two new columns: T2 Present- Count and T2 Present-Percent (Figure 15.32).

7. Repeat items 1 through 5 of Step 2 for the replicate T3 Detection columns (enter T3 Present in the Column Name box of the Analysis Function dialog box). ⇒ The pivot table (right side) displays two new columns: T3 Present- Count and T3 Present-Percent (Figure 15.32). 286 CHAPTER 15 DMT Tutorial

Use the horizontal scroll bar at the bottom of the results pane to view the right side of the pivot table.

Step 3: Sorting Pivot Table Results

1. In the pivot table, right-click the T1 Present-Count column heading and select Sort Descending from the shortcut menu that appears.

Figure 15.32 Pivot table, count and percentage columns

For each probe set, the: Count column displays the number of columns (analyses) in which the detection call = Present. Percent column shows the corresponding percentage of columns (analyses) in which the probe set was called present. For example, in Figure 15.32, probe set Z70759_at was called present in all replicates of tissue T1, T2 and T3 or 100% of the analyses. Sorting the T1 Present-Count column in descending order ranks the probe sets so that those with the most consistent detection calls in tissue T1 are displayed at the top of the pivot table. Affymetrix® Data Mining Tool User’s Guide 287

Step 4: Saving a Probe List Save all probe sets with T1 Present-Percent =100% as a probe list called T1 Present 100%. (See lesson 1, step 6.)

Step 5: Annotating Probe List Members

1. In the data tree, right-click the probe list T1 Present 100% and select Display Selected Probes from the shortcut menu that appears. ⇒ The pivot table displays all the members of the T1 Present 100% probe list.

2. Select all pivot table rows, right-click a pivot table row, then select Annotate Probes from the shortcut menu. ⇒ The Annotate dialog box appears (Figure 15.33).

Figure 15.33 Annotate dialog box

3. Enter Tutorial in the Annotation Type box.

4. In the Annotation box enter: T1: Present count = 6, Percent = 100%.

5. Click OK. ⇒ The probe sets that were called present across all six T1 replicates are annotated. 288 CHAPTER 15 DMT Tutorial

Lesson 3 Summary We used the count and percentage analysis to summarize detection calls for the tissues T1, T2 and T3. By sorting the pivot table T1 count column, we were able to identify the most consistent results. This makes it easy to annotate all probe sets that are present in all analyses (or a user-specified percentage of analyses). We saved the probe sets that were present in all six T1 replicates as a probe list and annotated the members of the probe list. In future sessions we can query the annotations (see Chapter 8, Annotations).

Suggested Exercise Repeat lesson 3 using the count and percentage analysis to identify all genes called absent in all replicates of tissue T1, T2 and T3. Affymetrix® Data Mining Tool User’s Guide 289

Lesson 4: Evaluating Difference Between Two Tissues

The T-Test and Mann-Whitney test are ranking tests that enable you to determine the direction and significance of change in a transcript’s expression level between two experimental conditions with one or more replicates. These analyses are very good strategies to use if you are looking for small, consistent changes in expression levels. The use of replicates helps distinguish real change from biological and experimental noise. The T-Test assumes the expression levels for a given transcript are normally distributed across experiments. The Mann-Whitney test makes no assumptions about the data distribution. DMT computes a p-value for each comparison. The p-value is the probability value that the observed difference in expression level occurred by chance. A small p-value (for example, 0.01) means it is unlikely (only a one in 100 chance) that such a mean difference would occur by chance. If the computed p-value > p-value cutoff, the change call is no change. In this lesson, we use the Mann-Whitney test to compare the signal replicates for tissues T1 and T2 and determine whether the signal data for these two tissues show a statistically significant difference. The lesson shows how to generate change calls for signal data so we can determine the direction of change and associated p-values to estimate confidence. Lesson 4 includes:

Step 1: Pivoting on Signal Step 2: Performing a Mann-Whitney Test Step 3: Annotating Probe Sets Step 4: Saving a Probe List 290 CHAPTER 15 DMT Tutorial

Step 1: Pivoting on Signal

1. Clear any entries in the filter grid. To do this, right-click the filter grid and select Clear Query from the shortcut menu that appears.

2. In the data tree, select all absolute analysis replicates for T1 and T2.

3. Click the Options toolbar button . ⇒ The Data Mining Options dialog box appears (Figure 15.34).

4. Click the Pivot tab. ⇒ The absolute and relative expression data available for the pivot table are displayed (Figure 15.34).

Figure 15.34 Data Mining Options dialog box, Pivot tab

5. From the list of Absolute Expression Data for the Statistical Algorithm, select Signal. Verify that all other options are cleared.

6. Click OK to close the Data Mining Options dialog box. Affymetrix® Data Mining Tool User’s Guide 291

7. Click the Pivot toolbar button . ⇒ The pivot table displays the signal for each probe set returned by the query (Figure 15.35).

Figure 15.35 Pivot table displaying signal 292 CHAPTER 15 DMT Tutorial

Step 2: Mann-Whitney Test

1. Select Analyze → Analysis Function from the menu bar. ⇒ The Analysis Function dialog box appears (Figure 15.36).

Figure 15.36 Analysis Function dialog box

2. Enter T1vsT2 in the Column Name box.

3. Select the Mann-Whitney test option.

4. Click Next. ⇒ The column selection dialog box appears (Figure 15.37). Affymetrix® Data Mining Tool User’s Guide 293

Figure 15.37 Analysis Function dialog box, select analyses for the Mann-Whitney test

5. Select the six replicate T1 signal columns in the Control Columns box (Figure 15.37).

6. Select the six replicate T2 signal columns in the Experiment Columns box (Figure 15.37). The pivot table columns selected in the Control and Experiment Columns lists define the two populations being compared.

7. Click Finish. ⇒ The pivot table displays two columns of T1vsT2-Mann-Whitney test results (Figure 15.38). The pivot table also displays the computed p-value and the direction of change (up, down, or none) for each probe set in the comparison. An Up or Down change direction call is associated with a probe set if the p- value < 0.05. If the p-value is > 0.05, the change direction call is None. An Up call for a transcript indicates the signal is higher in the Experiment group than the Control group. A Down call indicates the signal is lower in the Experiment group compared to the Control group. 294 CHAPTER 15 DMT Tutorial

8. Right-click the P Value column header and select Sort Ascending from the shortcut menu that appears. ⇒ The pivot table displays the p-values in ascending order (Figure 15.38).

Figure 15.38 Pivot table Affymetrix® Data Mining Tool User’s Guide 295

Step 3: Annotating Probe Sets

1. In the pivot table, select probe sets with an Up call and p-value < 0.001.

2. Right-click a selected row and select Annotate Probes from the shortcut menu that appears. ⇒ The Annotate dialog box appears (Figure 15.39).

Figure 15.39 Annotate dialog box

3. Enter or select Tutorial in the Annotation Type box.

4. In the Annotation box, enter Signal higher in T2 than T1 with p<= 0.001, then click OK. ⇒ The probe sets that showed a higher expression level in T2 compared to T1 with significance of p-value < 0.001 are annotated.

Step 4: Saving a Probe List Save all probe sets with an Up direction call as a probe list named T2_T1_MW_T2UP. (See lesson 1, step 6.) You can now inspect or further filter the probe list as in lesson 1 and 2. 296 CHAPTER 15 DMT Tutorial

Lesson 4 Summary When replicate analyses are available, the Mann-Whitney test helps determine whether differences in expression levels between two different groups of samples are statistically significant.

The Mann-Whitney test generates change calls (Up, Down, None) based on comparisons of one numeric metric (typically, signal). Lesson 5 shows a more stringent comparison between 2 sets of replicates using comparison replicates.

Suggested Exercise Repeat lesson 4 and apply the T-Test to tissue T1 and T2 replicate signal data. Affymetrix® Data Mining Tool User’s Guide 297

Lesson 5: Evaluating Change Call Consistency

Comparison ranking is a useful method for assessing the consistency of change calls when comparing two data sets that include replicate analyses. It is a ranking strategy that uses the change call from Microarray Suite analysis to perform the ranking. The results are typically more conservative than a standard Mann-Whitney or T-Test. To comparison rank two data sets:

Generate all possible combinations of comparison analyses for the two sets of replicate data in Affymetrix® Microarray Suite. Pivot the change call result for all of the comparison analyses. Run a count and percentage analysis of the change call data. In the pivot table, sort the change call, count and percentage columns in descending order. This arranges or ranks the probe sets with the highest count and percentage of a call at the top of the pivot table. Those with the lowest count and percentage of the call are displayed at the bottom of the table. In this format, you can conveniently evaluate the consistency of the data and the significance of a change call. Lesson 5 shows how to comparison rank T1 and T2 change call data. The tutorial database includes comparison analyses for all possible combinations of the T1 and T2 replicates (36 total, generated in Affymetrix® Microarray Suite, see Figure 15.40 and Table 15.1). This lesson includes:

Step 1: Clearing the Filter Grid & Selecting Comparison Analyses Step 2: Pivoting on Change Call Step 3: Comparison Ranking Step 4: Annotating Probe Sets Step 5: Saving a Probe List 298 CHAPTER 15 DMT Tutorial

Figure 15.40 DMT_2_Tutorial database, 36 comparison analyses of tissue T1 and T2 replicates

Table 15.1 Comparison analyses of T1 and T2 replicate data (generated in Affymetrix® Microarray Suite)

T2 Replicate Analyses T1 Replicate T2_r1 T2_r2 T2_r3 T2_r4 T2_r5 T2_r6 Analyses Comparison Analyses

T1_r1 T1_r1 T1_r1 T1_r1 T1_r1 T1_r1 T1_r1 v v v v v v T2_r1 T2_r2 T2_r3 T2_r4 T2_r5 T2_r6

T1_r2 T1_r2 T1_r2 T1_r2 T1_r2 T1_r2 T1_r2 v v v v v v T2_r1 T2_r2 T2_r3 T2_r4 T2_r5 T2_r6

T1_r3 T1_r3 T1_r3 T1_r3 T1_r3 T1_r3 T1_r3 v v v v v v T2_r1 T2_r2 T2_r3 T2_r4 T2_r5 T2_r6

T1_r4 T1_r4 T1_r4 T1_r4 T1_r4 T1_r4 T1_r4 v v v v v v T2_r1 T2_r2 T2_r3 T2_r4 T2_r5 T2_r6

T1_r5 T1_r5 T1_r5 T1_r5 T1_r5 T1_r5 T1_r5 v v v v v v T2_r1 T2_r2 T2_r3 T2_r4 T2_r5 T2_r6

T1_r6 T1_r6 T1_r6 T1_r6 T1_r6 T1_r6 T1_r6 v v v v v v T2_r1 T2_r2 T2_r3 T2_r4 T2_r5 T2_r6 Affymetrix® Data Mining Tool User’s Guide 299

Step 1: Clearing the Filter Grid & Selecting Comparison Analyses

1. To clear the filter grid, right-click the grid and select Clear Query from the shortcut menu that appears.

2. In the data tree, select all of the comparison analyses for the T1 and T2 replicate data (36 total) (Figure 15.41). (Press and hold the CTRL key while you click the analyses.)

Figure 15.41 Data tree, comparison analyses selected 300 CHAPTER 15 DMT Tutorial

Step 2: Pivoting on Difference Call

1. Click the Options toolbar button . ⇒ The Data Mining Options dialog box appears (Figure 15.42).

2. Click the Pivot tab. ⇒ The absolute and relative expression data available for the pivot table are displayed (Figure 15.42).

Figure 15.42 Data Mining Options dialog box, Pivot tab

3. From the list of Relative Expression Data for the Statistical Algorithm, select Change.

4. Click OK to close the Data Mining Options dialog box.

5. Click the Pivot toolbar button . ⇒ The pivot table displays the change call for each probe set in the selected analyses (Figure 15.43). Affymetrix® Data Mining Tool User’s Guide 301

Figure 15.43 Pivot table displaying change calls

Step 3: Comparison Ranking

1. Select Analyze → Analysis Function from the menu bar. ⇒ The Analysis Function dialog box appears (Figure 15.44).

Figure 15.44 Analysis Function dialog box

2. Enter Rank T1vsT2 in the Column Name box. 302 CHAPTER 15 DMT Tutorial

3. Select the Count & Percentage analysis option, choose the I (increase) difference call option and click Next. ⇒ The column selection dialog box displays the pivot table columns available for the Count & Percentage analysis (Figure 15.45).

Figure 15.45 Analysis Function dialog box

4. Select all of the columns (comparison analyses) and click Finish. ⇒ The new pivot table columns: Rank T1vsT2-Count and Rank T1vsT2-Percent are generated (Figure 15.46). Affymetrix® Data Mining Tool User’s Guide 303

Figure 15.46 Pivot table

5. Right-click the Rank T1vT2-Count column header and select Sort Descending from the shortcut menu that appears. ⇒ The probe sets with the highest count and percentage are arranged, or ranked, at the top of the pivot table (Figure 15.46). Those with the lowest count and percentage (least consistent data) are located at the bottom of the table.

Step 4: Annotating Probe Sets

1. Select the pivot table rows with RankT1vsT2-Percent = 100%.

100% concordance is very high stringency or confidence. You can select a lower percentage, depending on your requirements.

2. Right-click a highlighted row and select Annotate Probes from the shortcut menu that appears. ⇒ The Annotate dialog box appears (Figure 15.47). 304 CHAPTER 15 DMT Tutorial

Figure 15.47 Annotate dialog box

3. Enter T1vT2: Increase with 100% concordance in the Annotation box.

4. Enter or select Tutorial in the Annotation Type box.

5. Click OK.

Step 5: Saving a Probe List In the pivot table, select the probe sets with RankT1vsT2-Percent = 100% and save them as a probe list. (See lesson 1, step 6.)

Lesson 5 Summary The comparison ranking method uses the count and percentage operator to rank the increase or decrease change calls of comparison analyses between two groups of replicate samples. The method enables you to assess the consistency or concordance of change calls between the two groups. In this lesson we identified the genes that show concordance of the increase change call in T1 and T2. We annotated these genes and saved them as a probe list.

Suggested Exercise Perform a comparison ranking using count and percentage analysis on tissue T1 and T2 Decrease and Marginal Decrease change calls. Affymetrix® Data Mining Tool User’s Guide 305

Lesson 6: Self Organizing Map (SOM) Cluster Analysis

Cluster analysis groups probe sets with similar gene expression patterns. For example, cluster analysis can help identify transcripts that are increased after a treatment or over a period of time. Clustering can be applied to any numeric output; however, the SOM algorithm is optimized for expression signals and the algorithm defaults are set accordingly.1 This lesson demonstrates how to:

Compute the average signal values of tissue T1, T2 and T3 Apply SOM cluster analysis to the average signal values of T1, T2 and T3 (See Appendix D for more information about the SOM algorithm.) Save a cluster result as a probe list Lesson 6 includes:

Step 1: Clearing the Filter Grid & Selecting Analyses Step 2: Pivoting on Signal Step 3: Computing Average Signal Step 4: SOM Cluster Analysis Step 5: Saving & Annotating a Probe List

1. Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, Eric S., and Golub, T.R. Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. USA. 96:2907-2912. 306 CHAPTER 15 DMT Tutorial

Step 1: Clearing the Filter Grid & Selecting Analyses

1. To clear the filter grid, right-click the grid and select Clear Query from the shortcut menu that appears.

2. In the data tree, select the absolute analysis replicates for tissue T1, T2 and T3 (18 analyses).

3. Click the Options toolbar button . ⇒ The Data Mining Options dialog box appears (Figure 15.48).

4. Click the Pivot tab. ⇒ The absolute and relative expression data available for the pivot table are displayed (Figure 15.48).

Figure 15.48 Data Mining Options dialog box, Pivot tab

5. From the list of Absolute Expression Data for the Statistical Algorithm, select Signal. Verify that all other options are cleared. Affymetrix® Data Mining Tool User’s Guide 307

6. Click OK to close the Data Mining Options dialog box.

Step 2: Pivoting on Signal

1. To pivot the data, click the Pivot toolbar button . ⇒ The pivot table displays the signal for each probe in the selected analyses (Figure 15.49).

Figure 15.49 Pivot table 308 CHAPTER 15 DMT Tutorial

Step 3: Computing Average Signal

1. Select Analyze → Analysis Function from the menu bar. ⇒ The Analysis Function dialog box appears (Figure 15.50).

Figure 15.50 Analysis Function dialog box

2. Enter T1 in the Column Name box.

3. Select the Average operator, then click Next. ⇒ The column selection dialog box appears (Figure 15.51). Affymetrix® Data Mining Tool User’s Guide 309

Figure 15.51 Analysis Function dialog box

4. Select the six replicate T1 Signal columns (absolute analyses) in the Analysis Function dialog box, then click Finish. ⇒ The new column T1-Average in pivot table is generated (Figure 15.52).

Figure 15.52 Pivot table 310 CHAPTER 15 DMT Tutorial

5. Repeat items 1 through 4 in Step 3 for the replicate T2 signal columns (enter T2 in the Column Name box of the Analysis Function dialog box) to generate the T2-Average column in the pivot table.

6. Repeat items 1 through 4 in Step 3 for the replicate T3 signal columns (enter T3 in the Column Name box of the Analysis Function dialog box) to generate the T3-Average column in the pivot table.

Step 4: SOM Cluster Analysis

1. Select Analyze → SOM Clustering from the menu bar. ⇒ The Select Columns for Clustering dialog box appears (Figure 15.53).

Figure 15.53 Select Columns for Clustering dialog box

2. Select the T1-Average, T2-Average and T3-Average columns, then click OK. ⇒ The SOM Clustering dialog box appears (Figure 15.54). Affymetrix® Data Mining Tool User’s Guide 311

Figure 15.54 SOM Clustering dialog box

The SOM Clustering dialog box contains two sections: SOM Filtering (top) and Parameters (bottom). The filter and parameters settings significantly affect the analysis. Click Defaults to reset the algorithm parameters to the default settings.

SOM Filtering There are three types of SOM filters: thresholds, row variation and row normalization. The default values are appropriate for most data sets, when clustering on signal, but the optimum values may differ depending on the data set and the type of information you want to extract from your data.

Thresholds This sets the maximum and minimum values for the data set (signal in this example). The minimum and maximum threshold settings control the outliers in the data. 312 CHAPTER 15 DMT Tutorial

Figure 15.55 shows three expression profiles. In the raw data, the two outliers prevent effective normalization. Normalization is much more effective after filtering. However, filtering removes information from the data set, so filter as little as possible to obtain the optimum results.

Figure 15.55 Example expression profiles showing effects of normalization and filtering

Row Variation When comparing expression patterns between different biological conditions, most genes do not significantly change expression level and are uninformative. Keeping uninformative genes in the data set in effect forms a single, large cluster that may affect our ability to cluster the expression patterns that do change. The row variation filters define the genes that are considered changed. The Max/Min setting defines the minimum expression ratio value (maximum expression level/minimum expression level) a probe set must have across all experiments to be included in the cluster analysis. The Max/Min setting is very useful at moderate to high expression levels, but is subject to noise at low expression levels. For example, in Figure 15.56, Affymetrix® Data Mining Tool User’s Guide 313 the max/min ratio (b) for the bottom profile is much higher than max/min ratio (a) for the top profile. We need a second parameter to filter for changes at low expression levels.

Figure 15.56 Max/Min

The Max-Min setting can distinguish between changes in low expression levels and high expression levels, and can be used to filter out noise. As Figure 15.57 shows, the max-min can be set to eliminate the bottom profile if we should want to. It is very important to take care when filtering out noisy, low-level expression, because the noise will be amplified after normalization (Figure 15.57).

To see how many probe sets remain after filtering, click Compute in the SOM Clustering dialog box. The Max-Min value sets a threshold for the absolute numerical difference of the clustering values. For example, if the Max/Min is set at three, changes of 30/10 and 300/100 will be included in the cluster analysis. By setting the Max-Min to 100, we eliminate the inherently noisy, low numerical change values. 314 CHAPTER 15 DMT Tutorial

Figure 15.57 Max-min filter setting

Row Normalization Normalization is a technique that helps to answer the question: What probe sets have similar expression patterns? For example, we may be interested in finding all genes that increase expression under certain experimental conditions regardless of the actual level of increase. Asking the question in this way allows us to find small as well as large changes in expression levels. It also makes the technique less sensitive to experimental variation in the absolute expression levels, such as difficulty normalizing between controls and experiments. Using filtering and normalization usually reduces the number of clusters in a data set because we ignore the actual expression levels. For example, Figure 15.58 shows three different transcripts that are expressed at very different levels. Without normalization, the three probe sets may group into three different clusters according to their absolute expression levels. However, after normalization, it is clear that their relative expression levels are the same and they should cluster together. Affymetrix® Data Mining Tool User’s Guide 315

Figure 15.58 Raw and normalized expression profiles

Order of Filtering The Up and Down buttons in the SOM Clustering dialog box can be used to change the order in which the filters are applied to the data. Be very careful if you intend to change the default order. In particular, the row normalization filter changes the values of the data to which the filters are applied and will change the filter functions significantly.

Parameters The rows and columns parameters should be considered carefully. These settings specify the grid of centroids or nodes that is applied to the data. In general, try to keep the grid square or almost square to ensure good coverage of the whole data set. For the same reason, it also helps, but is not imperative, to make one of the settings (row or column) an uneven number. The exact number of clusters (rows x columns) you select depends on the size and complexity of the data set, and the type of analysis you want to perform. The default of 18 clusters (6 rows x 3 columns) is a good place to start. If you find that the analysis generates empty clusters (clusters with zero members), reduce the number of clusters. Reducing the cluster number increases the variability of the shapes of curves grouped together in a cluster. This is indicated by an increase in the distance between the two (blue) error bars. Generating a small number of clusters summarizes the data, but may obscure rare, interesting patterns. Increasing the cluster number reduces the variability of the patterns that are grouped together. This is indicated by a decrease in the distance between the 316 CHAPTER 15 DMT Tutorial

two (blue) error bars. If the cluster number is increased too much, the algorithm generates clusters with no members and many clusters will look the same. The optimum number of clusters for a particular dataset displays the narrowest possible error bars with lowest number of empty clusters. The remaining parameters affect the functioning of the cluster program and are intended for expert users. Do not change these parameters unless you understand their function and the effect of changing them on your data.

3. In the SOM Clustering dialog box, click the Add> button for the Thresholds, Row Variation and Row Normalization variables. ⇒ The default values for these algorithm variables are displayed in the box in the upper right corner (Figure 15.59).

Figure 15.59 SOM Clustering dialog box

4. Enter 3 rows and 2 columns in the Parameters section.

Other values may be more appropriate for your data. These are suggested values for this data set. Affymetrix® Data Mining Tool User’s Guide 317

5. Click Run. ⇒ The SOM algorithm generates 6 clusters (3 rows x 2 columns specified in the SOM parameters) (Figure 15.60).

Your results may not be identical to the clusters in Figure 15.60. Run-to- run cluster results may vary slightly because the nodes are randomly initialized (see Appendix D).

Figure 15.60 SOM cluster results 318 CHAPTER 15 DMT Tutorial

To expand the cluster graph view, right-click the graph and select Expand Graph from the shortcut menu that appears.

Figure 15.60 shows the results of clustering the mean signal values for three tissues (six replicates each). T1 is the first point, T2 is the second point and T3 is the third point.

Step 5: Saving & Annotating a Probe List After genes of interest are identified, we can save them as a probe list.

1. Click Cluster 3. ⇒ The probe sets in cluster 3 are displayed at the right in the Probes box (Figure 15.60).

2. Enter the name Cluster 3 in the Probe List Name box.

3. Click Save Selected. ⇒ A probe list is generated that includes the probe sets in cluster 3 and the probe list is displayed in the data tree.

4. Annotate the probe list members (see lesson 3, step 5). In future sessions the annotations may be queried and sorted (see Chapter 8, Annotations).

Lesson 6 Summary SOM cluster analysis identifies gene expression patterns in the data. The threshold and row variation filters help focus the analysis on probe sets that have the same expression pattern. The cluster results display patterns of gene expression rather than absolute expression levels because the Row Normalization filters normalize the signal data to a mean of zero and variance of one (see Appendix D). Adjusting the number of nodes or centroids (rows x columns) affects the cluster number and the variability of expression patterns grouped together in a cluster. The optimum number of clusters for a particular data set displays the narrowest possible error bars with the lowest number of empty or similar clusters. We computed the average signal for the T1, T2 and T3 replicates. We applied SOM cluster analysis to the average signal values. The SOM cluster Affymetrix® Data Mining Tool User’s Guide 319 results organize the expression data into groups of genes with similar expression patterns. 320 CHAPTER 15 DMT Tutorial A Appendix A Filter Grid A

This Appendix explains the column headings in the filter grid for both GeneChip® data mode and spot data mode.

GeneChip Data Mode

The filter grid includes expression metrics generated by the Statistical Expression algorithm (in Microarray Suite 5.0) or the Empirical Expression algorithm (in Microarray Suite 4.0 or lower).

Statistical Expression Algorithm

Probe Set Name Identifier for the probe set on a GeneChip® probe array Signal A measure of the abundance of a transcript. Detection The call that indicates whether the transcript was present (P), absent (A), marginal (M), or no call (NC) Detection p-value p-value that indicates the significance of the detection call. Stat Pairs The number of probe pairs for a particular probe set on the array. Stat Pairs Used = Pairs - Masked probe pairs - Saturated MM probe pairs This is the number of pairs used by the Statistical Expression algorithm to make the detection call in an absolute analysis. Signal Log Ratio The change in expression level for a transcript between a baseline and an experiment array. This change is expressed as the log2 ratio. Signal Log Ratio The lower limit of the log ratio within a 95% confidence Low interval.

323 324 APPENDIX A Filter Grid

Signal Log Ratio The upper limit of the log ratio within a 95% confidence High interval. Change The call that indicates the change in the transcript level between a baseline and an experiment array. Change p-value p-value that indicates the significance of the change call. Stat Common The intersection of the probe pairs from the baseline and Pairs experiment that are used by the Statistical Expression algorithm to make the change call in a comparison analysis.

Empirical Expression Algorithm

Probe Set Name Identifier for the probe set on a GeneChip® probe array Positive The number of probe pairs scored positive. A probe pair is positive if: PM - MM > Statistical Difference Threshold (SDT) and PM/MM > Statistical Ratio Threshold (SRT) where PM = perfect match intensity and MM = mismatch intensity. The SDT is a function of the noise (Q) and is calculated as:

SDT = Q * SDTmultiplier

The SDTmultiplier and the SRT are user-modifiable parameters (see Affymetrix® Microarray Suite User’s Guide). The SDTmultiplier is set at 2.0 for the standard staining protocol or 4.0 for the antibody amplification protocol. (Refer to the Affymetrix Expression Analysis Technical Manual.)

The default SRT value is 1.5. Note: Increasing the SDTmult and SRT increases analysis stringency, reducing these thresholds decreases analysis stringency. Affymetrix® Data Mining Tool User’s Guide 325

Negative The number of probe pairs scored negative. A probe pair is negative if: MM - PM > SDT and MM/PM > SRT Pairs Number of probe pairs for a particular probe set on a GeneChip® probe array. Pairs Used Number of probe pairs per probe set used in the analysis (Empirical Expression algorithm). This may be the total number of probes per probe set on the probe array or the number of probe pairs in a pre-designated subset (for example, probe pairs specified by a probe mask file and/or a masked image). Pairs Used = total probe pairs per probe set – (probe pairs masked in a mask file) – (probe pairs masked in the image). Pairs in Average A trimmed probe set that excludes probes with extremely intense or weak signal from the analysis. If 8 or fewer probe pairs are used, Pairs in Avg = Pairs Used (or the number of probe pairs per probe set minus any that are masked). Super scoring is performed if more than 8 probe pairs are used. Superscoring is a process that excludes probe pairs from calculation of the Avg Diff and Log Avg Ratio if they are outside a given intensity range. Microarray Suite software calculates the mean and standard deviation of the intensity differences (PM – MM) for an entire probe set (excluding the highest and lowest values). Those values outside of a set number of standard deviations (STP) are not included in the calculation of the Avg Diff or Log Avg Ratio. The STP is a user-modifiable parameter with a default value = 3. 326 APPENDIX A Filter Grid

Log Avg Describes the hybridization performance of a probe set and is (Log Avg Ratio) determined by calculating the ratio of the PM/MM intensities for each probe pair in a probe set, taking the logs of the resulting values and averaging them for the probe set: Log Avg = 10 x [Σ log (PM/MM)] / Pairs in Avg Note: Log Avg = 0 indicates random cross hybridization. The higher the Log Avg, the more confidence the gene transcript is present. Average Serves as a relative indicator of the level of expression of a Difference transcript. It is used to determine the change in the hybridization intensity of a given probe set between two different experiments. The Avg Diff is calculated by taking the difference between the PM and MM of every probe pair (excluding the probe pairs where PM – MM is outside the STP standard deviation of the mean of PM-MM) in a probe set and averaging the differences for the entire probe set. Avg Diff = Σ (PM – MM) / Pairs in Avg Note: Avg Diff cannot be used to compare the hybridization intensity levels of two different probe sets on the same array. Absolute Call Each transcript in an absolute analysis has three possible Absolute Call outcomes: Present (P), Absent (A), or Marginal (M). The absolute call is derived from the Pos/Neg, Positive Fraction and Log Avg absolute call metrics. Each absolute call metric is weighted and entered into a decision matrix to determine the status of the transcript. Affymetrix® Data Mining Tool User’s Guide 327

Increase (Inc) Number of probe pairs that increased. A probe pair is considered to increase if the intensity difference between the PM and MM probe cells in the experimental sample is significantly higher than in the baseline sample. Two criteria must be met for a probe pair to show a significant increase:

(PM – MM)exp – (PM – MM)baseline > Change Threshold (CT) and

[(PM – MM)exp – (PM – MM) baseline] / max [Q/2, min(|PM – MM|exp, |PM – MM|baseline)] > Percent Change Threshold/100 Affymetrix Microarray Suite computes the Change Threshold (CT) using the Statistical Difference Threshold of the experiment and baseline data. Alternatively, you can specify a value for the CT multiplier, which is multiplied by the noise of the baseline or experiment data (whichever is greater) to define CT. Percent Change Threshold is a user-specified value (default = 80). Decrease (Dec) Number of probe pairs that decreased. A probe pair is considered to decrease if the intensity difference between the PM and MM probe cells in the experimental sample is significantly lower than in the baseline sample. Two criteria must be met for a probe pair to show a significant decrease:

(PM – MM) baseline – (PM – MM) exp > Change Threshold (CT) and

[(PM – MM)baseline – (PM – MM) exp] / max [Q/2, min(|PM – MM|exp, |PM – MM|baseline)] > Percent Change Threshold/100 Increase Ratio For each transcript: # Increased probed pairs / # probe Pairs Used Decrease Ratio For each transcript: # Decreased probed pairs / # probe Pairs Used 328 APPENDIX A Filter Grid

Positive Change # Positive probe pairsexp - # Positive probe pairsbaseline

Negative Change # Negative probe pairsexp - # Negative probe pairsbaseline Difference (Positive Change – Negative Change) / # probe Pairs Used Positive - Difference The DPos – DNeg Ratio and Log Avg Ratio Change are Negative Ratio usually positive when a transcript changes from a very low to (DPos - DNeg a relatively high expression level and are typically negative Ratio) when the expression level changes from a high to a very low or undetectable level. Both metrics may have values close to zero if the transcript is present in both the baseline and experimental samples despite an increase or decrease in the level of the transcript. Log Avg Ratio The difference between the Log Avg Ratio of the baseline Change and experimental probe array data (in a comparison analysis) for each transcript. The Log Avg Ratios are recomputed for each for each probe set based on probe pairs used in both the baseline and experimental probe arrays (the recomputed values are not displayed by DMT).

Log Avg Ratio Change = Log Avgexp – Log Avgbase Difference Call Each transcript in a comparison analysis has five possible Difference Call outcomes: (1) Increase (I), (2) Marginally Increase (MI), or (3) Decrease (D), (4) Marginally Decrease (MD), and (5) No Change (NC). The difference call is derived from the comparison metrics: Max [Increase/Total, Decrease/Total], Increase/Decrease Ratio, Log Average Ratio Change, and Dpos – Dneg Ratio. Each comparison metric is weighted and entered into a decision matrix to determine the status of the transcript (see Affymetrix Microarray Suite User’s Guide).

Average Avg Diffexperiment - Avg Diffbaseline Difference Change Affymetrix® Data Mining Tool User’s Guide 329

Fold Change Indicates the relative change in the expression levels between the experiment and baseline targets. The Fold Change (FC) for a transcript is a positive number when the expression level in the experiment increases compared to the baseline and is a negative number when the expression level in the experiment declines.

AvgDiffChange + 1 if AvgDiffexp > AvgDiffbase FC = + (max [min (AvgDiffexp, AvgDiffbase), QM x QC]) {- 1 if AvgDiffexp < AvgDiffbase}

Qc = max(Qexp, Qbase) µ µ QM = 2.1 for a 50 m feature, QM = 2.8 for a 24 m or smaller feature Microarray Suite recomputes the normalized or scaled Avg Diff values in both the experimental and baseline data sets to include only probe pairs used in both the baseline and experiment arrays. This recomputation is not done in the DMT calculation. If the noise (Q) of the experiment or baseline is greater than the Avg Diff of the transcript (baseline or experiment data), the Fold Change is calculated over the noise and is an approximation (a tilde character (~) precedes the approximated Fold Change value in the *.chp file). Sort Score A ranking based on the fold Change and the Avg Diff Change. The higher the fold Change and the Avg Diff Change, the higher the Sort Score. 330 APPENDIX A Filter Grid

Spot Data Mode

Spot Identifier for the spotted probe. Intensity Background-subtracted spot intensity. Standard deviation Standard deviation of the intensity signal. (SD) Pixels Number of pixels in the image data file (*.tif) used to calculate the intensity for a spot. Background Background calculated for a spot. Background SD Standard deviation of the spot background. Ratio If channel 1 > channel 2, ratio = - channel1/channel 2, otherwise ratio = channel 2/channel 1. B Appendix B Working with Windows & Tables B

The windows and tables found in Affymetrix® Data Mining Tool can be modified to suit the individual needs of the user or data. This appendix explains the options available.

Query Windowpanes

Expanding a Windowpane

1. To expand the results pane (or the graph pane), right-click the pane and select Expand Results (or Expand Graph) from the shortcut menu. Alternatively, select View → Expand Results (or View → Expand Graph) from the main menu. ⇒ The results (or graph) pane is enlarged and the graph (or results) pane is hidden.

2. Repeat step 1 to return the pane to its original size.

Resizing a Windowpane You may resize a windowpane using the click-and-drag method to move a border.

1. Place the mouse pointer over a border so that it changes from a single arrow to a double arrow .

2. Use the click-and-drag method to move a border in the horizontal or vertical direction and resize the windowpane.

333 334 APPENDIX B Working with Windows & Tables

Clearing the Results or Graph Pane Right-click the results (or graph) pane and select Clear Results (or Clear Graph) from the shortcut menu. Alternatively, select Edit → Clear Results (or Edit → Clear Graphs) from the main menu. ⇒ All graphs are cleared and the graph pane is hidden.

Tables

Selecting the Entire Table Click the upper left corner of a table. ⇒ All rows in the table are selected (Figure B.1).

Figure B.1 Query table

Selecting Rows To select adjacent rows, press and hold the SHIFT key while you click the first and last row in the selection. To select non-adjacent rows, press and hold the CTRL key while you click the rows. Affymetrix® Data Mining Tool User’s Guide 335

Resizing Columns

1. Place the mouse pointer over the border of a column header. ⇒ The mouse pointer changes from a single arrow to a double arrow .

Figure B.2 Query table, adjusting width of Analysis Name column

2. Use the click-and-drag method to adjust the column width.

Hiding Columns

1. Right-click an analysis column header in the experiment information table or a metric column header in the query or pivot table (Figure B.3).

Figure B.3 Pivot table, shortcut menu of column commands 336 APPENDIX B Working with Windows & Tables

2. Select Hide Column from the shortcut menu. ⇒ The selected column is hidden.

3. To show hidden columns, right-click an analysis column header in the experiment information table or a metric column header in the query or pivot table, then select Show All Columns from the shortcut menu (Figure B.3).

Reordering Columns

1. Click a column header and use the click-and-drag method to move the column.

2. In the pivot table, click a column header (analysis) and use the click- and-drag method to reorder the column and its subordinate results columns.

3. In the pivot table, click a subordinate column header (results), then use the click-and-drag method to reorder the results column within the analysis.

DMT retains the column order of the results table in saved queries and as a user preference. If you open a previously saved query, DMT: 1) displays the results tables using the saved column order, and 2) unhides any hidden columns of results data. If you create a new query, DMT applies the column settings used in the previous session. C Appendix C Query Table Data C

After running a query, results are presented in the Query Table. This appendix defines the column headings and explains the information found there, for both GeneChip Data Mode and Spot Data Mode.

GeneChip® Data Mode

Statistical Expression Algorithm Metrics

Probe Set Name Identifier for the probe set on a GeneChip® probe array Signal A measure of the abundance of a transcript. Detection The call that indicates if the transcript was detected (P) or undetected (A) Detection p-value p-value that indicates the significance of the detection call. Stat Pairs The number of probe pairs for a particular probe set on the array. Stat Pairs Used = Pairs - Masked probe pairs - Saturated MM probe pairs This is the number of pairs used by the Statistical Expression algorithm to make the detection call in an absolute analysis. Signal Log Ratio The change in expression level for a transcript between a baseline and an experiment array. This change is expressed as the log2 ratio. Signal Log Ratio The lower limit of the log ratio within a 95% confidence Low interval. Signal Log Ratio The upper limit of the log ratio within a 95% confidence High interval.

339 340 APPENDIX C Query Table Data

Change The call that indicates the change in the transcript level between a baseline and an experiment array. Change p-value p-value that indicates the significance of the change call. Stat Common The intersection of the probe pairs from the baseline and Pairs experiment that are used by the statistical Expression algorithm to make the change call in a comparison analysis.

Empirical Expression Algorithm Metrics

Analysis Name Name of the experiment entered during experiment set up. Probe Set Name Identifier for the probe set on the array. Positive Number of probe pairs scored positive. A probe pair is called positive if the intensity of the PM probe cell is significantly greater than that of the corresponding MM probe cell. To evaluate intensity, the Empirical Expression algorithm calculates the ratio and difference associated with each probe pair and compares these values to the Statistical Difference Threshold (SDT) and the Statistical Ratio Threshold (SRT). A probe pair is positive if: PM - MM > SDT and PM/MM > SRT. Negative Number of probe pairs scored negative. A probe pair is called negative if the intensity of the MM probe cell is significantly greater than that of the corresponding PM probe cell. To evaluate intensity, the expression algorithm calculates the ratio and difference associated with each probe pair and compares these values to the Statistical Difference Threshold (SDT) and the Statistical Ratio Threshold (SRT). A probe pair is negative if: MM - PM > SDT and MM/PM > SRT. (See Affymetrix® Microarray Suite User’s Guide for further information.) Pairs Number of probe pairs for a particular probe set on the probe array. Affymetrix® Data Mining Tool User’s Guide 341

Pairs Used Number of probe pairs per probe set used in the analysis. This may be the total number of probes per probe set on the probe array or the number of probe pairs in a pre-designated subset (for example, probe pairs specified by a probe mask file or a masked image). Pairs Used = total probe pairs per probe set - (probe pairs masked in a mask file) - (probe pairs masked in the image) Pairs in Avg A trimmed probe set that excludes probes with extremely intense or weak signal from the analysis. If 8 or fewer probe pairs are used, Pairs in Avg = Pairs Used (or the number of probe pairs per probe set minus any that are masked). Super scoring is performed if more than 8 probe pairs are used. Superscoring is a process that excludes probe pairs from calculation of the Avg Diff and Log Avg Ratio if they are outside a given intensity range. Microarray Suite calculates the mean and standard deviation of the intensity differences (PM – MM) for an entire probe set (excluding the highest and lowest values). Those values outside of a set number of standard deviations (STP) are not included in the calculation of the Avg Diff or Log Avg Ratio. The STP is a user- modifiable parameter with a default value = 3. Pos Fraction Number of positive probe pairs divided by the number of probe pairs used. Log Avg Describes the hybridization performance of a probe set and is determined by calculating the ratio of the PM/MM intensities for each probe pair in a probe set, taking the logs of the resulting values, and averaging them for the probe set: Log Avg = 10 x [Σ log (PM/MM)] / Pairs in Avg Pos/Neg Ratio of positive probe pairs to negative probe pairs in a probe set (# Positive probe pairs/# Negative probe pairs). 342 APPENDIX C Query Table Data

Avg Diff This parameter serves as a relative indicator of the level of expression of a transcript. It is used to determine the change in the hybridization intensity of a given probe set between two different experiments. Note: Avg Diff cannot be used to compare the hybridization intensity levels of two different probe sets on the same array. Avg Diff is calculated by taking the difference between the PM and MM of every probe pair (excluding the probe pairs where PM – MM is outside the STP standard deviation of the mean of PM-MM) in a probe set and averaging the differences for the entire probe set: Avg Diff = Σ (PM – MM) / Pairs in Avg Norm Avg Diff Avg Diff x Normalization Factor DMT computes the normalization factor (NF) using all probe sets on the array in an analysis, then applies any specified filters. All intensities in an analysis are multiplied by the NF. Absolute Call Each transcript in an absolute analysis has three possible Absolute Call outcomes: Present (P), Absent (A), or Marginal (M). The absolute call is derived from the Pos/Neg, Positive Fraction, and Log Avg absolute call metrics. Each absolute call metric is weighted and entered into a decision matrix to determine the status of the transcript. (See Affymetrix® Microarray Suite User’s Guide for further information.) Affymetrix® Data Mining Tool User’s Guide 343

Increase (Inc) Number of probe pairs that increased. A probe pair is considered to increase if the intensity difference between the PM and MM probe cells in the experimental sample is significantly higher than in the baseline sample. Two criteria must be met for a probe pair to show a significant increase:

(PM – MM)exp – (PM – MM)baseline > Change Threshold (CT) and

[(PM – MM)exp – (PM – MM) baseline] / max [Q/2, min(|PM – MM|exp, |PM – MM|baseline)] > Percent Change Threshold/100 Affymetrix Microarray Suite computes the Change Threshold (CT) using the Statistical Difference Threshold of the experiment and baseline data. Alternatively, you can specify a value for the CT multiplier, which is multiplied by the noise of the baseline or experiment data (whichever is greater) to define CT. Percent Change Threshold is a user-specified value (default = 80). Decrease (Dec) A probe pair is considered to decrease if the intensity difference between the PM and MM probe cells in the experimental sample is significantly lower than in the baseline sample. Two criteria must be met for a probe pair to show a significant decrease:

(PM – MM) baseline – (PM – MM) exp > Change Threshold (CT), and

[(PM – MM)baseline – (PM – MM) exp] / max [Q/2, min(|PM – MM|exp, |PM – MM|baseline)] > Percent Change Threshold/100 (See Affymetrix® Microarray Suite User’s Guide for further information.) Inc Ratio For each transcript: # Increased probe pairs / # probe Pairs Used 344 APPENDIX C Query Table Data

Dec Ratio For each transcript: # Decreased probe pairs / # probe Pairs Used

Pos Change # Positive probe pairsexperiment - # Positive probe pairsbaseline

Neg Change # Negative probe pairsexperiment - # Negative probe pairsbaseline Inc/Dec For each transcript: # Increased probe pairs / # Decreased probe pairs Dpos-Dneg (Positive Change – Negative Change) / # probe Pairs Used Ratio The Dpos – Dneg Ratio and Log Avg Ratio Change are usually positive when a transcript changes from a very low to a relatively high expression level and are typically negative when the expression level changes from a high to a very low or undetectable level. Both metrics may have values close to zero if the transcript is present in both the baseline and experimental samples despite an increase or decrease in the level of the transcript. Log Avg Ratio The difference between the Log Avg Ratio of the baseline Change and experimental probe array data (in a comparison analysis) for each transcript. The Log Avg Ratios are recomputed for each for each probe set based on probe pairs used in both the baseline and experimental probe arrays (the recomputed values are not displayed by DMT).

Log Avg Ratio Change = Log Avgexp – Log Avgbase Diff Call Each transcript in a comparison analysis has five possible Difference Call outcomes: (1) Increase (I), (2) Marginally Increase (MI), or (3) Decrease (D), (4) Marginally Decrease (MD), and (5) No Change (NC). The difference call is derived from the comparison metrics: Max [Increase/Total, Decrease/Total], Increase/Decrease Ratio, Log Average Ratio Change, and Dpos – Dneg Ratio. Each comparison metric is weighted and entered into a decision matrix to determine the status of the transcript. (See Affymetrix® Microarray Suite User’s Guide for further information.) Affymetrix® Data Mining Tool User’s Guide 345

Avg Diff Serves as a relative indicator of the level of expression of a Change transcript. It is used to determine the change in the hybridization intensity of a given probe set between two different experiments. The Avg Diff is calculated as:

Avg Diff Change = Avg Diffexp – Avg Diffbaseline B=A An asterisk (*) in this column indicates the transcript is (Baseline = called absent (A) in the baseline. Absent)

Fold Change The Fold Change indicates the relative change in the expression levels between the experiment and baseline targets. The Fold Change for a transcript is a positive number when the expression level in the experiment increases compared to the baseline and is a negative number when the expression level in the experiment declines. The Fold Change (FC) is calculated as:

AvgDiffChange + 1 if AvgDiffexp > AvgDiffbase FC = + (max [min (AvgDiffexp, AvgDiffbase), QM x QC]) {- 1 if AvgDiffexp < AvgDiffbase}

(See Affymetrix® Microarray Suite User’s Guide for further information.) Approx If the noise (Q) of the experiment or baseline array is greater than the Avg Diff of the transcript (the baseline or experimental data), the Fold Change is calculated over the noise and is an approximation (a tilde character (~) precedes the approximated Fold Change value in the *.chp file. Sort Score The Sort Score is a ranking based on the Fold Change and the Avg Diff Change. The higher the Fold Change and the Avg Diff Change, the higher the Sort Score. 346 APPENDIX C Query Table Data

Spot Data Mode

Analysis name Name of the experiment associated with an intensity results file (*.spt) Spot Identifier for the spotted probe Intensity Background-subtracted intensity for the selected spot Standard Deviation Standard deviation of the spot intensity Pixels Number of pixels in the image data file (*.tif) used to calculate the channel signal (intensity) Background Background calculated for the spot Background SD Standard deviation of the background for a spot Ratio Ratio of channel 1/channel2 intensity data D Appendix D DMT Algorithms D

This appendix provides further information on the three algorithms used in Affymetrix® Data Mining Tool: the SOM clustering algorithm, Correlation Coefficient clustering algorithm and the Matrix algorithm.

The SOM Algorithm

The self organizing map (SOM) algorithm applies cluster analysis to GeneChip® metric data to help identify gene expression patterns. The algorithm considers the expression levels of n probe sets (or the intensities of n probes) in k experiments as n points in k-dimensional space. Initially, the algorithm randomly places a grid of nodes or centroids onto the k-dimensional space. The rows and columns of nodes determine the number of clusters identified by the algorithm (rows x columns = number of clusters). Figure D.1 shows a 3 x 2 arrangement of nodes that can identify six clusters of gene expression patterns.

Figure D.1 3 x 2 arrangement of nodes

349 350 APPENDIX D DMT Algorithms

The user specifies the rows and columns of nodes as well as the initial placement of the nodes (initialization). The random vectors method randomly places the nodes in k-dimensional space. The random datapoints method places the nodes on randomly-selected points. Next, the algorithm iteratively adjusts the node positions toward clusters of points. At each iteration, it selects a data point (P) and moves (updates) the node closest to P (the target node, Np) toward P. (The data points are randomly ordered for selection and recycled as needed through the iterations.) Other nodes may also move toward P, depending on their distance from Np, the type of neighborhood selected (discussed in the following section) and time (iteration). The algorithm updates a node using the formula:

() () α()(), ()() fi + 1 N = fi N + dNNP ,i Pf– i N

where: N = the node being updated P = the data point being considered

fi(N) = the position of N at iteration i

Np = the target node (the node closest to P) α = distance N moves toward P in iteration i (learning rate), which is a function of:

❥ d(N, Np), the distance between N (the node being considered) and Np (the target node) in two-dimensional space ❥ i, iteration

P - fi(N) = distance between P and N in k-dimensional space T = maximum number of iterations Affymetrix® Data Mining Tool User’s Guide 351

Neighborhood

The neighborhood describes an area around the target node, Np. At each iteration, Np and all nodes in the neighborhood move toward the P, the point being considered. There are two types of neighborhoods: bubble or Gaussian.

Bubble Neighborhood

The bubble neighborhood specifies a radial distance from Np (default = 5). At an iteration, all nodes in the bubble neighborhood are updated by the same amount. Nodes outside the bubble neighborhood are not updated. Neighborhood size is a user-modifiable parameter that specifies the width of the bubble neighborhood. Neighborhood size decays with time (iterations) as described by the following equation: / Neighborhood sizei = neighborhood size_i * (neighborhood size_f neighborhood size_i)i/T where:

neighborhood sizei = width of bubble neighborhood at iteration i neighborhood size_i = initial width of bubble neighborhood at the first iteration neighborhood size_f = final width of bubble neighborhood at the last iteration T = the maximum number of iterations (iterations = epochs x number of probe sets (or probes))

Gaussian Neighborhood In the Gaussian neighborhood, all nodes are updated at each iteration. The distance a node moves is a function of its distance from the target node (Np). The greater the distance between N and Np, the less N moves toward P. 352 APPENDIX D DMT Algorithms

Learning Rate The learning rate is a user-modifiable parameter that specifies the distance a node moves toward P at each iteration. The learning rate decays with time (iteration) as described by the following equation: / i/T learning ratei = alpha_i * (learning rate_f learning rate_i) where:

learning ratei = learning rate at iteration i learning rate_i = initial learning rate at the first iteration learning rate_f = final value of the learning rate at the last iteration T = the maximum number of iterations (iterations = epochs x number of probe sets (or probes)) Affymetrix® Data Mining Tool User’s Guide 353

The Correlation Coefficient Clustering Algorithm

The correlation coefficient (ρ) between two probe set expression patterns (X and Y) across all analyses is determined by the equation: N 1 ---- ⋅⋅()X – X ()Y – Y N ∑ i m i m Cov() X, Y ρ()XY, ==------i = 1 σXσY σXσY

where: Cov (X,Y) is the covariance between X and Y σX = standard deviation of X, σY = standard deviation of Y

Xm = mean Avg Diff (or normalized Avg Diff) for probe set X across all analyses

Ym = mean Avg Diff (or normalized Avg Diff) for probe set Y across all analyses

Xi = Avg Diff (or normalized Avg Diff) for probe set X from analysis i

Yi = Avg Diff (or normalized Avg Diff) for probe set Y from analysis i N = number of analyses

Covariance increases when (Xi - Xm) and (Yi - Ym) are both positive or negative. The covariance decreases when (Xi - Xm) is positive and (Yi - Ym) is negative, or vice versa. Each analysis is weighed equally. The order in which the analyses are used to compute ρ(X,Y) is not important because all data are compared to the mean. The value of ρ(X,Y) can range from -1 to +1: ρ = 1 indicates perfect positive correlation, ρ = 0 indicates no correlation and ρ = -1 indicates perfect inverse correlation. The correlation coefficient clustering algorithm is designed to identify positive correlations, not negative inverse correlations. 354 APPENDIX D DMT Algorithms

The Matrix Algorithm

The matrix algorithm determines the overlap significance between two lists (the probe sets or spotted probes common to both lists). The matrix displays the overlap significance value and highlights values that exceed the overlap significance threshold (pink) or the non-overlap significance threshold (yellow). The algorithm uses the binomial distribution equation to calculate the probability (p-value) that the observed overlap between two lists is expected due to random chance. The classification algorithm computes a p-value for each overlap significance value:

n! x nx– P = ------⋅⋅w ()1 – w ()nx– !x!

where, P = probability that the observed overlap is due to random chance n = number of probe sets (or spotted probes) in the first list (rows) x = observed number of probe sets (or spotted probes) that overlap in the two lists w = frequency of probe sets (or spotted probes) in the second list and w = b/t where b = number of probe sets (or spotted probes) in the second list and t = total population

The p-value may range from zero to one. A score of one indicates there is no relationship (overlap) between the lists and that the observed distribution of probe sets or spotted probes in the two lists is expected to occur due to random chance. A score close to zero indicates the observed overlap between the two lists is not expected to occur due to random chance. The algorithm computes the overlap significance score from the p-value: Overlap significance = -log P Affymetrix® Data Mining Tool User’s Guide 355

As a result, higher values in the matrix indicate greater overlap or non- overlap significance between two lists. The algorithm uses the following rules to distinguish between these two possibilities:

x > wn, there is greater overlap than expected by random chance x< wn, there is less overlap than expected by random chance The matrix displays the overlap significance value. It highlights values that exceed the overlap significance threshold (pink) or values that exceed the non-overlap significance threshold value (yellow). 356 APPENDIX D DMT Algorithms E Appendix E Toolbars & Shortcuts E

You can display toolbars with text labels. To display the toolbar button labels, select View → Toolbar → Text labels from the menu bar.

DMT Main Toolbar

Figure E.1 DMT main toolbar

Ta b l e E . 1 DMT main toolbar button descriptions

Menu Command Button Function

Data → Open Displays the Open dialog box. Select and open a previously saved query from the Open dialog box.

Data → Save Displays the Save dialog box so that a query may be named and saved.

Data → Print Displays the Print dialog box.

Help → Contents Displays DMT help contents.

359 360 APPENDIX E Toolbars & Shortcuts

Session Toolbar

Figure E.1 Session toolbar

Ta b l e E . 2 Session Toolbar Button Descriptions

Menu Command Button Function

Edit → Copy Cells Copies the cells selected in a results table to the system clipboard.

Edit → Find in Results Displays the Find Probe Set dialog box that enables a text search of probe sets or spotted probe names in the query or pivot table. The search includes probe or probe set descriptions when these are displayed in the pivot table.

Query → Experiment Information Displays experiment information for the analyses selected in the data tree.

Query → Run Query Executes the query for the analyses selected in the data tree and populates the query table.

Query → Pivot Executes the query for the analyses selected in the data tree and populates the pivot table.

Annotations → Annotate Probe Sets Displays the Annotate dialog box.

Annotations → Query Annotations Runs the annotation query.

Graph → Scatter Displays the Scatter Graph dialog box.

Graph → Fold Change Displays the Fold Change Graph dialog box.

Graph → Series Displays the Series Graph dialog box.

Graph → Histogram Displays the Histogram dialog box. Affymetrix® Data Mining Tool User’s Guide 361

Ta b l e E . 2 Session Toolbar Button Descriptions

Menu Command Button Function

Graph → Lasso Points Changes the cursor to a drawing tool that can circle (lasso) points in the scatter graph.

View → Options Displays the Data Mining Options dialog box.

View → Analysis Filters Displays the Filter Analysis dialog box.

View → Data Tree Displays or hides the data tree in the Query window.

View → Results Filters Displays or hides the filter grid in the DMT session.

Shortcut Descriptions

Menu Bar Command Shortcut Key

Data → Save CTRL + S

Data → Print CTRL + P

Edit → Copy Cells CTRL + C

Edit → Copy Graph CTRL + G

Edit → Find in Results CTRL + F

Query → Run Query CTRL + Q

Annotations → Annotate Probes CTRL + A 362 APPENDIX E Toolbars & Shortcuts Index

A comparison DMT display query operators 66 GeneChip® data 28 Affymetrix spot data 29 LIMS 26 copying technical support 6 annotation query results 120 DMT session tables 110 components 61 algorithm data tree 27 correlation coefficient correlation coefficient 353 filter grid 27 clustering 240–250, 353 correlation coefficient graph pane 27 matrix 354 algorithm 45, 240–250, 353 results pane 27 SOM clustering 231–240, filtering 240–241, 244–245 349–352 modifiable parameters 247 documentation conventions used 4 alias saving seeds 249 configuration for Oracle 17 seed 245, 248–250 creating an Oracle 17 seeding 240–241 E analyses count & percentage analysis epochs 239 selecting 70–78 218–220 experiment information table analysis filters 30, 31 33–35, 93–95 filter analysis dialog box D GeneChip® data 94 components 70 data tree 27 exporting data specifying 70–75 database 25 query results table 111 annotations publish 25 expression call search strings deleting 122 registering 51 68 loading 116–117 selecting 53 query results 120 unregistering 52 F querying 118–120 database connections 50 filter array sets LIMS 50 adding probes 109 creating 149–151 MicroDB 50 defined 149 filter (correlation coefficient default directory 54–55 deleting 153 algorithm) 247 deleting editing 152 filter analysis dialog box annotations 122 viewing 151 attribute section 72 array set 153 virtual set option 150 components 73 probe list 144 attributes find function 76 query 90 finding 76 sample section 77 descriptions 107 average 210–211 components 77 DMT filter grid 27 C installing 9–16 adding probe lists 64 main toolbar 359 components 62 cluster analysis 239 main toolbar buttons 359 editing limits 65 correlation coefficient main window 50 entering limits 63 algorithm 45, 240–250 overview 25 expression call search strings SOM algorithm 43, 231–240 session toolbar 360 68 cluster correlation coefficient shortcuts 361 GeneChip® data 63, 323 threshold 245, 246, 247 starting 49 query builder 68

363 364 Index

query operators 66 magnifying 176–178 L sort order 65 plotting 173–176 lasso points specifying 61–65 selecting points 181 fold change graph 181 spot data 330 viewing probe informa- learning rate 240, 352 filtering (correlation coefficient tion 179 algorithm) 240–241 histogram 42, 193–202 limits adding landmarks 196– editing 65 filters 197 entering 63 analysis 30, 31 display options 199–202 results 30, 31 LIMS 26 magnifying 198 selecting from sample database connections 50 plotting 193–194 section 78 lists viewing bar information query operators 67 find function 106 195 find probe 106 printing 204 templates and attributes 76 scatter graph 39, 158–171 M fold change 212–213 display options 168–171 Mann-Whitney test 216–217 locating probes 163 fold change graph 40, 171–184 matrix 46, 225 display options 183–184 magnifying 161–163 matrix analysis 223–228 locating probes 178 plotting 158–161 overlap significance 223 magnifying 176–178 selecting points 166–168 population size 224 plotting 173–176 viewing probe informa- running 225 selecting points 181 tion 164–165 viewing probe information series graph 40, 185–193 median 210–211 179 display options 191–193 MicroDB 26 locating probes 188 database connections 50 G plotting 186–187 viewing probe informa- N GeneChip® data tion 189 analysis filters 70–75 neighborhood 240, 351 DMT display 28 H new query experiment information table GeneChip® data 59 histogram 42, 193–202 94 spot data 59 expression call search strings adding landmarks 196–197 nodes 239, 349 68 display options 199–202 filter grid explained 323 graph options 199 normalization 79–81 new query 59 magnifying 198 after query or pivot 81 query table data explained plotting 193–194 before query or pivot 80 339 viewing bar information 195 intensity threshold 83 low and high percentage 83 graph pane 27 options 81–83 enlarging 202 I target intensity 83 graphs initialization (SOM algorithm) clearing 204 239 O color options 202–204 installing DMT 9–16 copying 204 open a saved query 89 inter-quartile range 210–211 fold change 40 Oracle fold change graph 171–184 alias configuration 17 display options 183–184 creating an alias 17 locating probes 178 Affymetrix® Data Mining Tool User’s Guide 365

overlap significance 223 probe sets statistical analyses 43 p-value 354 maximum in seeding tables (correlation coefficient experiment information P algorithm) 247 33–35 minimum in seeding pivot 37 pivot (correlation coefficient query 35 normalizing data algorithm) 247 viewing 33 after pivot 81 publish database 25 results pane 27 before pivot 80 publishing applications 26 results tables pivot operation 101 p-value overlap significance annotating probes 108 pivot table 37, 97–105 354 copying 110 annotating probes 108 experiment information 93– including probe descriptions 95 102 Q exporting 111 options 104 query find 106 selecting and viewing data annotations 118–120 gene information 107 99–100 builder 68–69 pivot 97–105 sorting columns 103 building 30, 59 query 96 population size 224 deleting 90 text search 106 printing normalizing data viewing descriptions 107 graphs 204 after query 81 row normalization (SOM probe descriptions before query 80 algorithm) 239 open previously saved 89 pivot table 102 row variation filters (SOM operators 66 algorithm) 239 probe lists results 30 adding to a filter 64 running 30, 79 adding to results filter 137 save as 88 S combining 132, 142 saving 87 saving creating 127 selecting analyses 70–78 cluster member probe list 251 creating from statements 66, 67 probe list 237 cluster analysis results query table 35, 96 query 87 237 GeneChip® data 339 clustering results 251 scatter graph 39, 158–171 sort order 65 query or pivot table 128 absolute call combinations spot data 346 results filter 132 170 search array descriptions display options 168–171 131 R locating probes 163 magnifying 161–163 deleting 144 ranges plotting 158–161 editing members 140 query operators 66 highlighting members 138 point options 169 registering a database 51 input file 135 selecting points 166–168 loading 134–137 results viewing probe information managing 140 analyzing 43 164–165 specifying input file 135 cluster analysis 43 search strings 68 specifying members 134 matrix analysis 46 absolute call 68 using 137 pane difference call 68 viewing and editing 140 clearing 112 expanding 111 366 Index

seed (correlation coefficient status bar view status bar 53 algorithm) 245 viewing 53 importing 248 maximum probe sets 247 T minimum probe sets 247 saving 249 tables threshold 247 annotating probes 108 copying 110 seed (SOM algorithm) 239 experiment information 93– seeding (correlation coefficient 95 algorithm) 240–241 exporting data 111 selecting find 106 analyses 70–78 gene information 107 database 53 modification options 333 series graph 40, 185–193 modifying layout 334 display options 191–193 pivot 97–105 formats 192 sorting columns 103 locating probes 188 query 96 plotting 186–187 text search 106 viewing probe information viewing descriptions 107 189 working with 334 SOM algorithm 43, 231–240, technical support 5 349–352 templates row variation filters 239 finding 76 threshold filters 238 text searches 106 user-modifiable parameters threshold filters (SOM 239, 351–352 algorithm) 238 spot data toolbar DMT display 29 buttons 359 new query 59 DMT session 360 spot data mode DMT session buttons 360 filter grid explained 330 main 359 query table data explained T-Test 214–215 346 tutorial lessons 255 standard deviation 210–211 starting U DMT 49 statistical analyses unregistering a database 52 average 210–211 count & percentage 218–220 V fold change 212–213 viewing descriptions 107 inter-quartile range 210–211 Mann-Whitney test 216–217 median 210–211 W standard deviation 210–211 windows T-Test 214–215 main window tasks 50 modification options 333