Max Yan Student ID

Max Yan Student ID

Master of Biostatistics Workplace Portfolio Project University of Sydney Name: Max Yan Student ID: 311298761 November 2019 1 PREFACE Project title Gene expression and survival analysis of breast cancer subtypes defined by immunohistochemistry Location and dates The project was conducted at the Systems Biology Initiative, School of Biotechnology and Biomolecular Science, University of New South Wales, between February and November 2019. Relevant BCA Units Survival Analysis (SVA) Bioinformatics (BIF) Context I am a senior staff specialist at the Department of Anatomical Pathology, NSW Health Pathology – Randwick, Prince of Wales Hospital. I am also a conjoint lecturer at the Faculty of Medicine, University of New South Wales. I am routinely involved in the diagnosis, subtyping and staging of breast cancer as part of my work. Subtyping of breast cancer by immunohistochemistry is vital for prognostication and treatment in the clinical setting. This project seeks to understand the gene expression signatures underlying the breast cancer subtypes defined by immunohistochemistry. It also investigates the effect of subtyping on survival. Dr Kenneth Beath (Department of Mathematics and Statistics, Macquarie University) is the statistical supervisor. The bioinformatic analyses were supervised by Dr Susan Corley and Prof Marc Wilkins (Systems Biology Initiative, School of Biotechnology and Biomolecular Science, University of New South Wales) 2 Student’s role I am the principal investigator, and am responsible for the conception, design and conduct of the project. This includes: - Obtaining RNA-seq and clinico-pathological data for 105 clinical breast cancer specimens from the National Cancer Institute online database, plus data entry into a format suitable for subsequent analyses. - Analysis of RNA-seq data via edgeR, limma and R studio, including drafting and execution of the required R scripts. - Analysis of survival data via STATA. - Drafting of the manuscript, including the preparation of tables, heatmaps and figures. Reflections on learning Communication Communication of statistical and bioinformatics results in a clear and concise manner was of great importance. Care was taken to ensure the language used in the results section was readily understandable by the intended target audience of clinicians and pathologists. Tables were formatted and annotated in a manner to ensure accurate and concise visual representation of data. Complex statistical data, not regarded as essential for understanding of the key findings, were referred to the appendix to facilitate interpretation of results. Work patterns/planning Time management and organizational skills were vital. Self-imposed deadlines for specific tasks were set throughout the 10 month period, to ensure that the project was completed in a timely manner. For the bioinformatics portion of the project, a large number of data files (> 150) were generated. These had to be organized into folders and labeled with the appropriate headings to ensure that they were readily accessible for bioinformatics analysis. Throughout the project I maintained regular contact with my supervisors, who provided guidance in regards to the direction of the project. They were also vital in helping me answer many of the statistical and bioinformatics questions that arose during the project. They also gave much needed assistance with the drafting of the manuscript. 3 Statistical issues For RNA-seq analysis, issues included importing, organizing, annotating, transforming, filtering and normalizing data via the edgeR package, removing heteroscedascity from count data via the voom function, and fitting linear models for comparisons of interest. This project has been invaluable in teaching me the R programming language, the use of the various bioconductor packages and the organisation of information in R via data frames and matrices. I have also gained exposure to gene set testing via online databases such as DAVID. Issues arising from the multivariable survival analysis included testing for interactions and the proportional hazards assumption, plus the identification of influential observations. This project has consolidated the STATA programming skills I have learnt during the course. Ethical considerations The project was carried out in accordance with NHMRC ethics guidelines. The data for the survival analyses were collected by Prof Sandra O’Toole and Dr Ewan Millar at the Garvan Institute, with ethics approval provided by St Vincent’s Hospital HREC. All data received were de-identified to maintain confidentiality. De-identified data for the PAM50 and RNA-seq cohorts were obtained from publicly available databases. 4 Student Declaration I declare this project is evidence of my own work, with direction and assistance provided by my project supervisors. This work has not been previously submitted for academic credit. Max Yan November 2019 Supervisor Statement I can confirm that Max conducted this work independently. This project incorporated a number of analyses both bioinformaticsand survival analysis based on the bioinformatics results. He has shown an excellent ability to construct the analyses and present them. He has also been rigorous in sticking to his time-line, on what has been a large project. Dr Kenneth Beath November 2019 5 PROJECT REPORT Title Gene expression and survival analysis of breast cancer subtypes defined by immunohistochemistry Abstract Introduction: Breast cancer in the clinical setting may be divided into four subtypes (luminal A, luminal B, HER2 and basal) by immunohistochemical (IHC) analysis. This project aims to 1) compare subtype classification by IHC vs. the PAM50 gene classifier, 2) investigate the gene expression profiles of the IHC subtypes and 3) investigate the prognostic implications of the IHC subtypes. Method: The concordance between the IHC and PAM50 classifiers was assessed in a group of 405 breast cancers used in a previous study by Breffer et al. [1]. For RNA-seq analysis, data for a group of 105 cancers subtyped by IHC were obtained from The Cancer Genome Atlas Program (TCGA), National Cancer Institute (NCI) [2]. The data were analysed with limma and edgeR to obtain the gene expression profiles for the IHC subtypes. An exploratory unsupervised hierarchical cluster analysis was performed to help understand the usefulness of these gene sets in providing information on IHC cancer subtypes. Lastly, breast cancer specific overall survival among the IHC subtypes was assessed in a cohort of 292 cancers from the Garvan Insititute. Results: The concordance between subtyping by PAM50 and IHC among the 405 tumours was 59%. For cancers classified as HER2 by IHC (HER2 IHC cancers), 94% (16 out of 17) were also classified as HER2 by PAM50. For basal IHC cancers the concordance with PAM50 was 86% (54 out of 63). For luminal A and luminal B IHC cancers, the 6 concordance with PAM50 was 59% (148 out of 251) and 28% (21 out of 74) respectively. Linear modeling of RNA-seq data showed basal IHC cancers were enriched for genes associated with tumour hypoxia, activation of the beta-catenin pathway, progression through the cell cycle and stem cell proliferation. Whereas as luminal A cancers showed down-regulation of genes associated with progression through the cell cycle and cell migration. An exploratory unsupervised hierarchical cluster analysis by the PAM50 gene set supported the usefulness of these genes in understanding the differences between the subtypes. Survival analysis using the Cox regression model showed, compared to luminal A IHC cancers, basal (HR = 4.08, 95% CI: 2.02 – 5.77), and HER2 (HR = 4.08, 95% CI: 2.14 – 7.78) cancers were associated with shorter breast cancer specific overall survival (p = 0.018). A tendency for poorer survival was seen for luminal B IHC cancers (HR = 2.04, 95% CI: 0.98 – 4.25) was observed. Conclusion: Breast cancer IHC subtypes have distinctive gene expression profiles that relate to their biological behaviour. Subtyping by IHC yields important prognostic information that may aid the clinical management of breast cancers. 7 Glossary Bioconductor package Bioconductor package provides tools for the analysis of high- throughput genomic data. It uses the R statistical programming language, is open source and allows for open development. camera Correlation Adjusted MEan RAnk gene set test. This test for enrichment of a particular gene annotation category among a gene set (composed of genes that are differentially expressed). Unlike previous methods, camera does not assume the genes within the set are independent (i.e. not correlated). cDNA microarray A cDNA microarray allows the measurement of a large number of genes simultaneously. mRNA is reverse transcribed into cDNA. The cDNA, attached with a dye, binds to a matching sequence on a spot within the array, and is visualised. It has largely been replaced by RNA-seq. Counts per million In RNA-seq analysis, cpm is the count of the sequenced fragment of (cpm) interest divided by the total number of reads times one million. DAVID Database for Annotation, Visualization and Integrated Discovery. DAVID provides a set of functional annotation tools to understand biological meaning behind large list of genes. Distant metastasis Breast cancers that have spread to distant sites, beyond the breast and local axillary lymph nodes. edgeR Empirical Analysis of Digital Gene Expression Data in R. This is a Bioconductor package used for differential expression analysis of gene expression data. Endocrine (hormone) Treatment

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    64 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us