Implementation Of Microarray Data Analysis Algorithms As A Windows Application
Total Page:16
File Type:pdf, Size:1020Kb
TABLE OF CONTENTS
LIST OF FIGURES……………………………………………………………………….ii ACKNOWLEDGEMENTS...... iii 1. INTRODUCTION...... 1 1.1 Introduction to microarray:...... 1 1.2 Image Analysis:...... 5 1.3 Data Analysis:...... 6
2. TECHNOLOGY:...... 7 2.1 Excel Database:...... 7 2.2 Microsoft .NET:...... 8 2.2.1) Why .NET?...... 8 2.2.2) VS.NET for Office Solutions:...... 8 2.2.3) Primary Interop Assemblies (PIA)...... 9 2.2.4) Coding with .NET Libraries:...... 9
3. IMPLEMENTATION OVERVIEW:...... 10 3.1 Application Design Overview:...... 10 3.2 Algorithm Outline...... 11 3.3 Data description...... 12 3.4 User Interface:...... 13 1) Parent Form:...... 13 2) Detection Calls Option:...... 13 3) T-test of Signal Log Ratios Option:...... 14 4) Set the criteria for Fold Change Option:...... 14 3.5 Why DLLs:...... 14 3.6 Class Diagram of the methods:...... 16 3.7 Description of methods used in the dlls:...... 16 3.7.1 VedmetDLL.dll – SetWorksheet ():...... 17 3.8 Description of methods used in the Options form:...... 21 3.8.1 Class: Form 1...... 21 3.9 Features:...... 22 3.10 Screen Shots...... 23 4. TESTING:...... 28 5. ENHANCEMENTS:...... 32 6. REFLECTIONS:...... 32 7. CONCLUSION:...... 33 8. REFERENCES:...... 34 9. APPENDIX...... 35 a) Transferring the GCOS CAB files into Excel:...... 35
i LIST OF FIGURES
Figure 1: Gene Chip Array………………………………………………………………2 Figure 2: Hybridization of tagged and untagged probes………………………………..3 Figure 3: Gene Chip Technology………………………………………………………..4 Figure 4: Scanning of tagged and untagged probes…………………………………….6 Figure 5: Excel Object Model…………………………………………………………..9 Figure 6: Application Design Overview……………………………………………….11 Figure 7: Algorithm Overview…………………………………………………………12 Figure 8: Class Diagram………………………………………………………………..16
ii ACKNOWLEDGEMENTS
I sincerely thank Dr. Dan Andresen, my major professor, for his invaluable guidance and advice throughout the whole project. I also thank him for being flexible and adjusting during the course of the project. I am grateful to Dr. Daniel Marcus who has helped me to understand the background of this project. I thank him for his valuable suggestions and being supportive. I thank Dr. Mitch Neilsen for serving in my committee and agreeing to review my report. I am indebted to my parents for their love and encouragement. Last but not least I thank all my friends including Palani and my brother Dinesh for their support.
iii 1 1. INTRODUCTION The analysis of Micro array data has been a time consuming task which involves implementing different algorithms on the genome databases. This was done earlier using a collection of different software packages that suited the varied purposes. As a step toward automating the procedure, a Windows application was developed which has a biologist friendly user-interface interacting with the genome databases (gene lists) obtained from the Gene Chip Operating Software (GCOS) 1.2. GCOS is a software package which is used specifically for acquiring and analyzing gene array data from the Affymetrix Gene Chip platform.
1.1 Introduction to microarray: The fundamental basis of DNA microarrays is the process of hybridization. Two DNA strands hybridize (stick together) if they are complementary to each other. Complementarity reflects the Watson-Crick rule that adenine (A) binds to thymine (T) and Cytosine (C) binds to guanine (G). Hybridization has for decades been used in molecular biology as the basis for such techniques as Southern blotting and Northern blotting. Where before it was possible to run a couple of Northern blots or a couple of Southern blots in a day to identify a few expressed genes, it is now possible with DNA arrays to run hybridizations to test for expression of tens of thousands of genes. This has in some sense revolutionized molecular biology and medicine. Instead of studying one gene and one messenger at a time, experimentalists are now studying many genes and many messages at the same time. In fact, DNA arrays are often used to study all known messages for genes of an organism. This has opened the possibility of an entirely new, systematic view of how cells react in response to certain stimuli. It is also an entirely new way to study human disease by viewing how it affects the expression of all genes inside the cell. The Technology behind DNA Microarrays: A microarray is a solid support on which DNA of known sequence is deposited in a regular grid-like array. The DNA may take the form of cDNA or oligonucleotides, although other materials may be deposited as well. Typically, several nanograms (per chip) of DNA are immobilized on the surface of an array.
1 Figure 1: GeneChip Array
Image courtesy of Affymetrix - http://www.affymetrix.com/corporate/media/image_library/image_library_1.affx
RNA is extracted from biological sources of interest, such as cell lines with or without drug treatment, tissues from wild-type or mutant organisms, or samples studied across a time course. The RNA (or mRNA) is often converted to cDNA, labeled with fluorescence or radioactivity, and hybridized to the array.
2 Figure 2: Hybridization of tagged probes
Image courtesy of Affymetrix - http://www.affymetrix.com/corporate/media/image_library/image_library_1.affx
During this hybridization, cDNAs derived from RNA molecules in the biological starting material can hybridize selectively to their corresponding nucleic acids on the microarray surface. Following washing of the microarray, image analysis and data analysis are performed to quantitate the signals that are detected. Through this process, microarray technology allows the simultaneous measurement of the expression levels of thousands of genes represented on the array.
3 Figure 3: GeneChip Technology:
Image courtesy of Affymetrix - http://www.affymetrix.com/corporate/media/image_library/image_library_1.affx
Advantages of Microarrays: 1. Its fast; one can obtain data on the expression levels of over 50000 genes within one week. 2. The entire genome can be represented on a chip and thus it is comprehensive. 3. It is flexible because cDNAs or oligonucleotides corresponding to any gene can be represented on a chip. Disadvantages of Microarrays: 1. Many researchers find it prohibitively expensive to perform sufficient replicates and other controls. 2. There are many artifacts associated with image analysis and data analysis. Researchers are still figuring out how to get the “best” answers from microarray experiments. 3. It is just not enough to do microarrays; usually the microarray results have to be validated using some technique like RTPCR. 4. There is NO standard way to analyze microarray data.
4 5. It is best to combine knowledge of biology, statistics and computers to get answers and hence the learning curve is high.
Applications of Microarray: 1. Studying the effects of drug treatment 2. Gene knock out effects 3. Gene cloning 4. Cancer research 5. Developmental biology (like stem cell populations)
1.2 Image Analysis:
When the microarray chip is illuminated by a laser beam, the RNA that has been hybridized fluoresces, producing brightness proportional to the amount of hybridized RNA. This image is captured by a camera and it is then processed by a computer to get the expression levels of all the genes.
5 Figure 4: Scanning of tagged and untagged probes
Image courtesy of Affymetrix - http://www.affymetrix.com/corporate/media/image_library/image_library_1.affx
Background subtraction: The first step of data analysis is to correct for background across the entire array. The array is divided into equally spaced zones and is assigned an average background to the center of each zone. The calculated background computed from each cell establishes an intensity floor that is subtracted from all intensity values.
1.3 Data Analysis: The data analysis starts with the normalization and/or scaling of the data which is a built-in step when the micro array data image is acquired by the Gene Chip Operating software GCOS 1.2. (The normalization and scaling can be performed by either data acquisition software or data analysis software). The data correlation between the samples
6 can be viewed by performing a lot of methods like Cluster analysis, Condition trees, Principal component analysis etc. Once the data is acquired, generally it is exported to an Excel file where analysis is done. The post-analysis steps include Gene Ontology (i.e., classifying the genes into different functional groups) and inputting the genes into pathways from different databases (which can be performed in different software). There are a number of commercially-available software packages that might include the analysis steps that are implemented in this Windows application. As such, there is no perfect method for data analysis and it is up to the investigators to decide which of the steps or algorithms to follow for their data analysis. This application is highly customized for use in the Cellular Biophysics laboratory of the Department of Anatomy and Physiology, College of Veterinary Medicine, Kansas State University.
2. TECHNOLOGY: The front end has been implemented in Visual C#.net as a Windows application. For the back-end, the spreadsheet that is obtained from GCOS is Microsoft Excel and the output data sheet that biologists expect out of a tool is also an Excel spreadsheet.
2.1 Excel Database: Excel spreadsheets is a straightforward solution for storing tabular data, and recent versions of the ubiquitous Microsoft Excel include some surprisingly sophisticated data access and manipulation functions. Issues: There are a number of reasons why Excel is not to be preferred for data management and/or statistical analysis; some of the simple reasons being a) There is no way to record what you have done b) Poor statistical routines-it is impossible to view the source code that implements the statistical routines; several Excel procedures are misleading. c) Routines for handling missing data were incorrect in prior versions of Excel 2000. [In reference to pre-2000,"Excel does not calculate the paired t-test correctly when some observations have one of the measurements but not the other." E. Goldwater, ref. (1)]. Nonetheless, it is a manual process to copy the formulas to all the cells and sometimes it is dangerous to sort the columns and the datasheets too.
7 However, the conventional method of data analysis for huge genomes used by biologists is to use Microsoft Excel. Though the calculation of the cells in Excel for the complete genome, sorting & filtering the data, and copying different results to different worksheets for subsequent manipulation is a menial task, it is still considered as an easy solution. So the solution I have proposed is to develop a Windows application tool that will aid the analysis task by interfacing and programming in Excel worksheet.
2.2 Microsoft .NET: 2.2.1) Why .NET?
The reasons for choosing .NET include the following: 1. .NET provides the ability to create rich clients that execute within the Common Language Runtime. These applications utilize a new Windows forms processing engine, called Windows Forms. Any .NET language can use Windows Forms to build Windows applications. These applications have access to the complete .NET Framework of namespaces and objects, and have all of the advantages which the Framework can offer. 2. It is object-oriented and has many programming tools that allow for faster development and more functionality. 3. All applications in .NET are "garbage-collected", which means that objects are destroyed automatically when they are no longer in use.
2.2.2) VS.NET for Office Solutions:
The key benefits of choosing VS.NET as development environment for Office Solutions are,
Power of writing managed .NET code that executes behind Word and Excel documents Developers get the full, robust advantages of the Visual Studio .NET environment Allows developers to create applications with a more robust security model, restricting code that can execute only on a fully trusted corporate server.
8 Code-behind .NET projects can be started in .NET with new Office documents, applied to existing Excel spreadsheets or Word documents and templates, and even co-exist with current VBA-based logic. Using VS.NET facilitates language freedom, easier debugging, better memory management, and a more robust security model.
VBA programming model still exists and the .NET Office tools are just another choice.
2.2.3) Primary Interop Assemblies (PIA)
Microsoft provides official wrapper assemblies for writing managed code against the programmable unmanaged Microsoft Office libraries. These are called the primary interop assemblies. The Office PIAs are installed due to the following reasons: Develop managed Office applications within the robust Visual Studio development environment o Create Excel and Word solutions directly from within Visual Studio Develop applications more quickly with less code Utilize Visual Studio’s vast array of tools, easy access to web services, and access to the .NET Framework Leverage existing Visual Studio/VB/C# experience
2.2.4) Coding with .NET Libraries: Here is a quick walkthrough about Excel Object Model; the complete Excel Object model is little complicated, there are only few objects required for our Windows application,
Figure 5 – Excel Object Model
9 Application object is the controller object of all other subsystems in the Excel Application. Each application can have multiple Workbooks; there will be one default workbook for each application, the default workbook is returned to the ‘ThisWorkbook’ object variable. Each Workbook will have 3 Worksheets (actual data presentation area). Developer can present data in any one of these sheets or they can create their own additional worksheets.
3. IMPLEMENTATION OVERVIEW: 3.1 Application Design Overview: The application has references to the Microsoft Excel 11.0 Object Library and to the dlls that contains the analysis modules and the user interface for those analysis modules. All the references and their associated items can be managed by Solution Explorer which is provided as a part of the Integrated Development Environment IDE. The solution is a container for the projects and solution items that can be built into the application. The Windows application interacts with the Excel database through the solution level that has all the build configurations.
Figure 5 – Application Design Overview
10 3.2 Algorithm Outline The working of this application is given in the architecture diagram given below. The user can import the Excel file (See Appendix (a) for transferring a CAB file into Excel) containing the gene list to be processed and can build the analysis strategy by selecting the options given in the first form. All the analysis modules and their user interface are placed in different dynamic link libraries and the dlls are added as reference to the Windows application. Depending on the steps selected by the user, appropriate forms are opened by calling the dlls wherein the user can set the criteria for that step. The actions in those forms call the corresponding functions in the dlls. These modules in the dlls are interfaced with the Excel spreadsheet and hence the analysis module corresponding to the first option selected by the user filters the data set and returns an array list containing the row numbers of the genes that match the criteria. This array list
11 is passed to the next analysis module corresponding to the second option selected by the user and thus subsequent filtering is performed. Finally the dataset which has been filtered by all the analysis steps is exported as an Excel worksheet. It can also be previewed within the “Preview Result” textbox of the form.
Figure 6 – Algorithm Overview
3.3 Data description The Excel spreadsheet (See Appendix (a) for transferring a GCOS CAB file into Excel) consists of the following data: a) Affymetrix id b) Detection (for every single and comparison analysis): The values of detection can be 1. “P”, (denoting the Present Call) 2. “M”, (denoting the Marginal Call) 3. “A” (denoting the Absent Call) c) Signal Log Ratio (for every comparison analysis) d) Change (for every comparison analysis)
12 e) Detection p-value (for every single and comparison analysis) f) Description g) Different types of annotations like common name, GenBank Accession number, product, function, GO biological process, GO cellular component, GO molecular function etc. All of these fields might vary depending on the options selected by the user.
3.4 User Interface: 1) Parent Form: The user can import the file for analysis by clicking the button “Import the samples file for analysis”. The options available in the tool are listed in the left panel and the user can select the options by clicking on the items in the ListBox one at a time and by clicking on the button “Perform this Filter”. The corresponding panels are opened as per the options selected by the user and after all the analyses steps are performed, the results can be exported into a separate Excel sheet in this form. This is done by clicking on the button “Export the result”, thereby a savefiledialogbox opens and user can save the file at the desired location. Also, there is a button “Result Preview” on clicking which the user can preview the genes that are filtered by the analysis modules. 2) Detection Calls Option: This is one of the options listed in the listbox of the parent form. The user interface for handling this option is located in the dll DetectionCalls.dll. On clicking this option and on pressing the button “Perform this Filter”, a panel is displayed on the right. This panel is for setting the criteria for detection calls. The user must type in the number of samples in each group in the textbox next to the label “Input the number of Samples in each group”. (It has been assumed that there are two groups and that the number of samples in each group is the same number). The next two textboxes are for specifying the criteria for the number of present calls and the number of marginal calls which are next to the labels “Input the number of Present calls” and “Input the number of Marginal Calls”. The user input in these two textboxes has been validated so that their
13 total does not exceed the number of samples inputted by the user in the textbox1. On clicking the “Perform test” button, the application returns the number of genes that satisfy the criteria set by the users and this number is displayed in the label at the end of this form. 3) T-test of Signal Log Ratios Option: This is other option listed in the left panel. The user interface for handling this option is present in the dll Ttest.dll. The panel has controls to set values for performing t- test on the signal log ratio values. The textbox textBox1 gets the input from the user for setting the criteria for the p-value for obtaining significant genes. Since the p-value should always be 0.05 (as per the requirement of the users), it has been made as Read only. On clicking the button “Perform t-test”, the application displays the number of significant genes that pass the criteria set by the user. 4) Set the criteria for Fold Change Option: This is also one of the options in the listbox. The user interface for handling this option is created by the dll FoldChange.dll. The purpose of this option is to calculate the fold change of the significant genes (if the significance has been evaluated prior to this) or just the fold change of all the genes and to set the criteria for the up-regulated and the down-regulated genes. Here, Fold Change is calculated from the median of the signal log ratios. The textboxes textBox1 and textBox2 are for setting the cut-off fold change for determining the up and down regulated genes respectively. The two textboxes have been made as Read only and their values are set as 2.0 and -2.0 respectively. The number of up-regulated and the number of down-regulated genes are displayed on clicking the “Submit” button.
3.5 Why DLLs: 1. The Dynamic linked library shares the memory. So, the system performance is improving compared to using applications. 2. We can build and test separately each DLL. 3. We can load and unload at run time. This helps to improve application performance.
14 4. The big software products were divided into several DLLs. The developers easily develop their application. 5. Eases the creation of international versions.
A potential disadvantage to using DLLs is that the application is not self-contained; it depends on the existence of a separate DLL module.
Even though DLLs and applications are both executable program modules, they differ in several ways. To the end-user, the most obvious difference is that DLLs are not programs that can be directly executed. From the system's point of view, there are two fundamental differences between applications and DLLs:
An application can have multiple instances of itself running in the system simultaneously, whereas a DLL can have only one instance.
An application can own things such as a stack, global memory, file handles, and a message queue, but a DLL cannot.
Loading DLLs in run-time:
The .NET Framework allows an assembly to inspect and manipulate itself at runtime through System.Runtime.Reflection namespace and associated classes. A sample of the code that handles the dlls at run time is given below:
Assembly assembly = Assembly.LoadFrom(@"MyAssembly.dll"); foreach( Type type in assembly.GetExportedTypes()) { ConstructorInfo constructor = type.GetConstructor( new Type[] {} ); object newObject = constructor.Invoke( new object[] {} ); }
This way the newObject is created from that dll.
In this application, a button with the tick mark located at the top-right corner handles this event. The filenames of the dlls that are added at run time are appended to
15 the list box in the left panel. Also, the dlls with the same name cannot be added; a message box pops up indicating that the dll already exists.
3.6 Class Diagram of the methods:
Figure 7 – Class Diagram
Form1
ExportArrayListToExcel() VedmetDLL StartAppropriateForm() listBox1_SelectionIndexChanged() al1 button_PerformFilter_Click() button3_Click() SetWorkSheet() button2_Click() button1_Click() Form1_load() DoFiltertest()
FoldChange DetectionCalls Ttest
Median() test_1() test_2() test_3() textBox1_TextChanged() BuilArrayList() BuilArrayList() button_PerformTest_Click() button3_Click_2() button1_Click() CreatePanel2() CreatePanel2() CreatePanel4()
3.7 Description of methods used in the dlls: All the modules that are used in the analysis and their corresponding user interface are placed in the dlls - VedmetDLL.dll, DetectionCalls.dll, Ttest.dll and FoldChange.dll; they are invoked whenever it is necessary in the application. These dynamic link libraries are added as references in the Windows application. The dlls and their functions are described here:
16 3.7.1 VedmetDLL.dll – SetWorksheet (): This method creates an Excel application and opens the Excel file specified by the user. This also sets the Excel workbooks, Excel sheets and the Excel worksheet of the Excel object. 3.7.2 DetectionCalls.dll – Test_1(): This method finds the number of present calls and marginal calls for each sample group for each gene and compares it to the input provided by the user; the input is passed as parameters to this function. This is implemented by creating an Excel range object for the first sample group for the columns containing the detection calls. The find() method of the range object is used for finding the cells containing the detection values “P”/”M” and counters evaluate the count of the present and marginal calls. If the values of the counters are greater than the values entered by the user then the flag of that row is set to true and the row number is added to an arraylist. The genes for which the flag is not set to true, another range object is created for the second group and the above procedure is repeated again. If the selected row number is not already in the arraylist, it is added to it. At the end, the arraylist will have all the rows that satisfy the criteria in either of the sample groups and thus the count of this arraylist will have all the genes that are filtered through this test and it will be displayed in the form.
3.7.3 DetectionCalls.dll – CreatePanel2(): This method creates the user interface for this option. This method returns a panel object to the options form for display. It creates a panel with 3 textboxes, 5 labels and a button. The first textbox is for getting the number of samples from the user. The next two text boxes are for specifying the call criteria for selection. On clicking the button “Perform Test”, the results i.e. the number of genes that meet the criteria will be displayed in a label at the bottom.
17 3.7.4 DetectionCalls.dll – button_PerformTest_Click(): This method checks for the validity of the user input; if the total of the number of present calls and the number of marginal calls is greater than the number of samples inputted in the first textbox, it is considered as an invalid entry. Then a message box is displayed indicating that it is an invalid entry. This method invokes the test_1 () method of the DetectionCalls.dll class. An arraylist is returned from the method test_1 () and the count of that arraylist gives the number of genes that have satisfied this criteria. The count is appended to the label box “Number of genes that meet the criteria:” 3.7.5 DetectionCalls.dll – textBox1_TextChanged(): This method handles the event when the text in the textBox1 is changed or entered by the user. The value entered by the user in the textBox1 is appended next to the textboxes for entering the Present and the Marginal Calls criteria. 3.7.6 Ttest.dll – test_2(): This function performs t-test on the whole set of genes or the genes that are already processed by the Detection call filter, depending on the options set by the user. The first argument of this function is an arraylist which has all the row numbers of the genes for which t-test is to be done. If the t-test is the first option selected by the user then the arraylist will have the row numbers of all the genes. The second argument to this function is the user’s input for p-value and the third argument is a flag value denoting whether any other analysis module has been performed previously. The t-test is implemented as follows. An arraylist keeps track of all the column headers containing the string “Ratio” which is a substring of “Signal Log Ratio”. The Application object includes a property called the WorksheetFunction which returns an instance of WorksheetFunction class. The class provides a number of useful mathematical functions like mean,
18 variance etc which allow performing calculations on ranges and hence are used in this method. So the p-value of the t-test is derived from the mean and the standard deviation of the signal log ratio values. If the calculated p-value is less than the p-value set by the user in the form, then that row number is added to an arraylist and is returned back to the form. 3.7.7 Ttest.dll – CreatePanel3(): This method creates the user interface for this particular analysis option. This method returns a panel object to the options form for display. This method creates a panel with two labels, a textbox and a button. The textbox is for setting the criteria for p-value. The result obtained after the button has been handled will be appended to the label at the bottom of the panel. 3.7.8 Ttest.dll - Button3_Click_2(): This method invokes the test_2 () method of the dll. The third argument (i.e., the flag) to the function test_2 () will vary depending upon whether the analysis method in form 2 i.e., the detection call filter has been evaluated or not. The test_2 () function returns an arraylist and the count of the arraylist is appended to the label box “Number of Significant Genes”. 3.7.9 FoldChange.dll – test_3(): This function calculates fold change on the significant genes that are evaluated by t-test and it compares the calculated fold change with the input from the user to determine the up-regulated and the down-regulated genes. The first argument of this function is an arraylist that has the list of all the row numbers of the genes that are significant, that is those genes that are filtered by t-test. The second and third arguments are the user’ input in the textboxes of the form that set the cut-off fold change for finding the up and down regulated genes. Here again like test_2 (), an arraylist is created for finding out the columns (from their headers) that contain the signal log ratio values. For all the rows the median of the
19 signal log ratio is calculated by a separate function Median () that sorts all the signal log ratios and finds the median of the sorted list. The fold change is evaluated from the median in this function test_3 (). If the fold change is greater than or equal to the cutoff value for up-regulated genes specified by the user, then the row numbers are added to an arraylist. Similarly if the fold change is lesser than or equal to the cutoff value for down-regulated genes specified by the user, the row numbers corresponding to those genes are added to another arraylist. Both of these arraylists containing the up and down regulated genes are returned back to the form. 3.7.10 FoldChange.dll – CreatePanel4(): This method creates the user interface required for this particular analysis option. It creates a panel with 7 labels, 2 textboxes and a button. The two textboxes are for setting the cutoff fold-change for the up and down regulated genes. This method returns the panel object containing all the other created controls to the options form. 3.7.11 FoldChange.dll – Median(): The fold change is calculated from the median of the signal log ratios and so this function sorts all the signal log ratios and finds the median of the sorted list. 3.7.12 FoldChange.dll – button1_Click(): This method is used to submit the information entered by the user in the textboxes of the form i.e., the cutoff values for the up and down regulated genes. This method calls the text_3 () method of the dll with the values of the arguments based on whether the analysis options are selected or not. An arraylist containing the up and down regulated genes is returned and the count of those genes is appended to the label boxes “Up-regulated genes” and “Down-regulated genes”.
20 3.8 Description of methods used in the Options form: 3.8.1 Class: Form 1 Methods: i. buttonImport_Click (): This method is used to import the Excel file containing the genelist to be processed. An openfiledialogbox is created and the method SetWorkSheet in the VedmetDLL dynamic link library is called to set the Excel application and the workbook. ii. button_PerformFilter_Click (): This button is clicked after selecting the analysis option in the panel. This method calls another method DoFiltertest () in the same class. iii. DoFiltertest (): Depending on the option selected by the user, this method calls the other method StartAppropriateForm (). iv. StartAppropriateForm (): This method has the switch-case statements for creating the user interface corresponding to those options by calling the appropriate dlls for each analysis step. Depending on the users choice at run time the other panels are made invisible. It also adds the created panel to the set of controls of the form. v. button_Export_Click (): This method gets the results of the analysis modules from all the forms and exports it to Excel sheets. If the analysis strategy excludes the last step i.e., fold change calculator, then there is only one Excel sheet that needs to be exported whereas if the fold change calculator is also included in the analysis strategy, the result set has two arraylists namely the up and down regulated genes and hence they need to be exported to multiple Excel worksheets of the Excel application. These two types of exporting is performed by ExportArrayListToExcel () and ExportArrayListToExcelSheets (). vi. ExportArrayListToExcel (): This method is used for exporting the genes that have satisfied the criteria set by the user to an Excel file. A savefiledialogbox is opened and the user can save the file in the desired location.
21 vii. ExportArrayListToTextBox (): This method is used for giving the user a preview of the results that have been obtained. The resultant genes are displayed in a textbox at the bottom of the form.
3.9 Features: Some of the features of this tool are: a) Dynamic selection of the analysis method by the user: The analyses modules are placed in dlls and so depending on the options selected by the user for analysis, the corresponding modules are dynamically invoked. This feature gives the user the flexibility to determine the analysis strategy. b) Dynamic loading of dlls: At run time, the user can load dlls dynamically. This feature allows the user to load as many dlls needed that can handle the events and perform analysis modules. This is an added feature to this application that makes it scalable. c) Extensible: If any other analysis module is to be added to this tool in future, it can be added as dll since the use of dlls in this application for invoking the user interface and the analysis modules has been made modular. This also makes this Windows application extensible for future enhancements. d) Minimal user input: Most of the textboxes used in this application have been customized to be read only. This avoids the user to enter the same values in the text boxes every time an analysis is performed. e) Implemented navigation – order of tabs: The order of tabs has been configured so that the users can rapidly interact with the user interface of the application.
22 3.10Screen Shots Form1: This form displays the options that can be selected for the analysis.
23 On clicking the “Import the Samples file for analysis”, the open file dialog box pops up and the user can select the file to be processed.
24 The user can select from the options and the corresponding user interface pops up.
On clicking the button “Perform the selected filter”, the Detection Call Summarizer panel appears and the user can set the criteria for selection.
25 The t-test panel appears on selecting the option in the left panel and on clicking the button “Perform the selected filter”.
The number of genes that meet the criteria is displayed.
26 The fold change calculator panel appears when the corresponding option is selected in the left panel.
27 4. TESTING:
i. Validated user input: In the Detection Call Summarizer panel, the user’s input in the text boxes (wherein the call criteria is specified) is checked to see if the total of the present call value and the marginal call value given by the user is less than or equal to the user’s input in the first textbox for the number of samples in each group.
ii. Difference in performance between a dlls’ function and a stand alone module which are functionally same: The performance of an analysis module in a dll has been compared with the performance of a stand alone function. This test is to evaluate whether the procedure calls to the dll has slowed down the process of analysis or not. The performance testing is done in ANTS Profiler (Advanced .Net Testing System) which records the performance and the time spent by different modules of the source code. Here, for this testing, the test_1 () module of the DetectionCalls.dll has been compared with the same function but handled by the form. The results from the ANTS Profiler when analyzing samples of different sizes say 500 genes, 5000 genes and 30000 genes are given below:
28 Here is a snapshot of the recording from the ANTS profiler when analyzing a sample with 500 genes.
It can be observed that the event handler button5_Click () in Form 1 has taken ~14 seconds whereas the DetectionCalls.test_1 () has taken ~12 seconds for evaluating the same result. It can be observed that there is an increase in performance of about 2 seconds when it was evaluated by using dlls.
29 This is a snapshot of a recording from ANTS Profiler when the application was analyzing a sample file containing 5000 genes.
It is seen that the button5_Click () has taken ~122 seconds to perform the analysis whereas the DetectionCalls.test_1 () has taken ~116 seconds to evaluate the same result. The results are consistent with the previous results and it can be seen that the dlls have taken lesser time to evaluate than the stand alone module.
30 To determine if the testing results are consistent with bigger sample files, a sample with 30000 genes has been taken. The results from the ANTS Profiler are given below:
There is a difference of approximately 8 seconds between the two functions for analyzing the sample file with 30000 genes, difference of 6 seconds for analyzing the sample file with 5000 genes and a difference of 2 seconds for analyzing sample file with 500 genes. The performance of the dll has been better compared to that of the stand alone module in the application.
31 5. ENHANCEMENTS: This application can be extended for analyzing more than two groups and also for analyzing the groups that have unequal number of samples. Also, many more analysis modules like ANOVA test, clustering etc can be implemented in the dynamic link libraries if necessary. This project can be extended for handling the loading of malformed dlls, duplicate dlls, deleted dlls, dlls with the same name but different version number etc. It can also be extended to handle missing data, invalid data, and duplicate records in the Excel worksheet. Further enhancements would include handling of different formats of Excel like comma delimited, tab delimited, text files etc. A dynamic way of handling data and presentation would be to implement the data formats in XML and XSLT. This project has been done in a very broad sense. Its design allows for easy enhancements in the future and adding other options would be easy. This system is accessible to the biologists and scientists of minimal background.
6. REFLECTIONS: Implementing the user interface and the analysis modules as dynamic link libraries and coordinating all the dlls in the application was a difficult task at the beginning, but I have learnt a lot in due course of the project. Though it is not a common and an easy solution to develop and implement dlls for a Windows application, I have realized that it is always a good design practice to make an application more modular and extensible. Using Excel as a back-end would be a painful experience for programmers who are used to programming in databases like SQL and Oracle. Unlike other databases, there is no primary key to refer to records and there is no commit or rollback operation. The cells, rows and columns are all referred as Excel Range objects. It was a whole new different experience for me to develop such an office solution.
32 7. CONCLUSION: This application reduces the burden of analyzing the Excel files manually. Also, it would reduce the load of maintaining multiple Excel tabs with the intermediate results in the whole procedure. This is a stand-alone application and the executable can be downloaded in as many workstations as needed. It would cater to the needs of the individual users who work on gene array data and would simplify the process. The users can import their gene lists into this tool, set their own parameters to process them and export the analyzed gene lists again. The use of dlls improves the performance of the application and it makes the application extensible.
33 8. REFERENCES:
1) Eva Goldwater, (1999), Univ. of Massachusetts Office of Information Technology, “Using Excel for statistical data analysis” http://www.umass.edu/acco/statistics/handout/excel.html
2) Mohammed Ashraf, “Dll Profiler in C#” http://www.codeproject.com/csharp/dll_profiler.asp
3) Mark Belles, “Hosting Control Panel Applets using C#/C++ - Finding and loading unmanaged DLLs dynamically”
http://www.developerfusion.co.uk/show/4451/2/
4) Microsoft Corp, “Determining which linking method to use” http://msdn.microsoft.com/library/default.asp?url=/library/en- us/vccore/html/_core_determine_which_linking_method_to_use.asp
5) Microsoft Corp, “AppDomains and Dynamic Loading” http://msdn.microsoft.com/library/default.asp?url=/library/en- us/dncscol/html/csharp05162002.asp
6) “How do I dynamically load a control from a DLL?” http://www.syncfusion.com/FAQ/WindowsForms/FAQ_c41c.aspx#q709q
7) “Building a DLL From Several Classes” http://www.lexundesigns.com/LexunsVisualCSharp.NET/LexunsCSharpTutorials/tutoria l/UsingVisualCSharptoCreateApps/UsingVisualCSharptoCreateApps5.htm
8) “Debugging Dll projects” http://winfx.msdn.microsoft.com/library/default.asp?url=/library/en- us/dv_vsdebug/html/433cab30-d191-460b-96f7-90d2530ca243.asp
34 9. APPENDIX
Raw Affymetrix Gene Chip data are stored in files with the “CAB” extension. A suite of programs from Affymetrix are used to extract and pre-process this data. The result is a list of the genes tested in the experiment as well as the fluorescence signal strength for each probe and associated information. These pre-processed data are placed in an Excel file that is used as the input to the Program developed in the Report. a) Transferring the GCOS CAB files into Excel: Affymetrix Data Transfer tool is one of the ways for transferring the CAB files into GCOS. The screenshot of the welcome page is given below, in which one can select the transfer option and click “Next”.
35 Browse to the location of the data file, specify the type of the data and click on “Next”.
36 Select the CAB files to be imported and click on “Start”.
After the importing is complete, click the “Finish” button. Depending on the number of samples imported, experiments are created in GeneChip Operating System.
37 By clicking the analysis results of any comparison array analysis, one can view the results.
38 The “Analysis Options” window gives the user various options that can be viewed in the window.
39 One can save the results by clicking on the “Save As” button and one can select the export type as Excel file.
This way, the user can export out all the necessary comparisons into Excel and the Excel file can be inputted into the Windows application.
40