2012 ANNUAL REPORT CIHR STAGE Program Advisory Committee Meeting January 24 & 25, 2013 Toronto, Ontario PROGRESS REPORT RECRUITMENT AND BUDGET

Page Page 01 01 Co-Directors Message 1 STAGE Competitions: Applicant-Publication 20 02 Records-Comparison Co-principal Investigators, Co-investigators, 3 02 and Collaborators Projected and Actual Admissions 22 03 03 Summary of Changes 5 Current Trainees 23 04 04 Progress 8 Alumni 25 05 05 List of Acronyms 17 Trainee Productivity 26 06 06 Governance Structure 18 Trainee Awards, Distinctions, and Honours 27 07 Budget 29

CURRICULUM APPENDICES

Page A Summary of Grant Proposal 01 Components, Training Objectives, and 30 B List of Trainee Publications Assessable Outcomes C Representative Trainee Publications 02 Integrative and Cross-disciplinary Courses 32 D List of Mentors 03 International Speaker Seminar Series 34 E Mentors-Mentee Agreement F Syllabi-Integrative Courses G Steering Committee Meeting Minutes, Oct. 2012 H STAGE International Internship and Travel Award Programs I Syllabi and Agendas for Professional Development Courses and Workshops J Trainee Annual Progress Report and Exit Survey K GAW18 Workshop - Resulting Papers and Participating STAGE Mentors and Trainees L Mentoring Commitment Message

STAGE - PAC 2012 Annual Report Page 5 01 CO-DIRECTORS MESSAGE Drs. SHELLEY B. BULL & FRANCE GAGNON

Dear members of the Program Advisory Committee: Thank you very much for your wise counsel and valuable guidance regarding the progress of STAGE and its future directions. Your critical evaluation of how STAGE fulfils its mission and objectives, and how it can improve is essential. The program’s midterm review by the Canadian Institutes for Health Research (CIHR) (Strategic Training Initiative in Health Research (STIHR) Program) on November 2013 will determine funding for the last three years of operation, until 2016. This document updates and incorporates the text of the Annual Progress Report submitted to CIHR on November 1, 2012, and adds program statistics useful to the Program Advisory Committee (PAC). The text component (pages 3-15) approximates what is expected to be submitted to CIHR for its midterm review in late 2013. CIHR is expected to evaluate how well STAGE has met its own mission and objectives and those of STIHR, listed in page 8. Most importantly, we would like to know if STAGE is on the right course to achieve its Vision to become a destination for training in genetic epidemiology and statistical genetics for Canadian and international trainees. We ask that you carefully consider the information in this report in your assessment of the performance of STAGE, and that you prepare comments and recommendations for the January 24 & 25, 2013 PAC Meeting, including: • how accurately the information provided reflects STAGE’s accomplishments, • how STAGE reporting materials might be improved or modified to better reflect program successes, progress, and challenges, • how well STAGE is fulfilling each STIHR and STAGE objective, • your suggested specific program improvements in the context of each STIHR and STAGE objectives, • your assessment of how STAGE compares to other training programs in genetic epidemiology and statistical genetics, internationally, and • any other relevant or useful guidance regarding STAGE that you would like to provide.

Shelley B. Bull, PhD. France Gagnon, MSc., PhD. Co-Director, CIHR STAGE Co-Director, CIHR STAGE Senior Investigator, Canada Research Chair in Genetic Epidemiology Samuel Lunenfeld Research Institute of Associate Professor, Division of Epidemiology Mount Sinai Hospital Dalla Lana School of Public Health Professor, Division of Biostatistics, Dalla University of Toronto Lana School of Public Health University of Toronto

STAGE - PAC 2012 Annual Report Page 1 STAGE - PAC 2012 Annual Report Page 2 02 CO-PRINCIPAL INVESTIGATORS, CO-INVESTIGATORS, AND COLLABORATORS

STIHR REPORT INSTRUCTION STAGE RESPONSE List all co-principal investigators, The current complement of 43 STAGE mentors encompasses co-investigators, primary mentors and three discipline-based mentor pools (11 mentors in Genetic collaborators (including international and Molecular Epidemiology, 14 in Statistical Genetics and collaborators) along with their university Genomics, and 18 in Bio-Medical Genetics). To date, 20 affiliations and a description of their roles (47%) mentors have supervised STAGE trainees, 9 (21%) and activities. Describe briefly significant as primary mentors and 18 (42%) as co-mentors, see Table changes, if any, which have occurred during 1, below), 17 (40%) contribute to teaching courses on the the reporting period. Include changes to STAGE curriculum, and most mentors participate actively the program’s team (co-investigators no in the program through attendance at the ISSS, research longer participating, collaborations that collaborations with STAGE trainee teams, and serve on the are no longer in place, etc.), and list new Admissions and/or Steering Committee. Appendix D lists all co-investigators and collaborations. STAGE mentors and their involvement in the program, and Table 2, in page 4, lists new major mentor distinctions.

TABLE 1. TRAINEE-MENTOR TEAMS Mentors by # of Primary Co- Graduate (Master’s & PhDs) Postdoctoral Fellows Visiting Scholar Research Area* Trainees mentor mentor

*Genetic & Molecular Epidemiology Kung, Dennis, Ladouceur, Gagnon, France 3 3 1 Tabitha Jessica Martin • Brenner, Fehringer, Hung, Rayjean 3 2 1 Darren • Gord Woodbury- Paterson, Andrew 6 1 5 ••• •• Smith, Marc Knight, Julia 2 2 ••

*Statistical Genetics & Genomics Wu, Yan Briollais, Laurent 2 1 1 • Yan Faye, Chen, Yilmaz, Bull, Shelley 4 3 1 • Laura Zhijian Yildiz Craiu, Radu 1 1 • Lemire, Mathieu 1 1 • Weili, Miller, Strug, Lisa 4 2 2 • Li Melissa • Derkach, Sun, Lei 4 1 3 Andriy • • •

*Biomedical Genetics Andrulis, Irene 1 1 • Jiang, Danska, Jayne 1 1 Yue Oliveira, Kennedy, James 1 1 Vanessa Liu, Geoffrey 2 2 • • Lye, Stephen 1 1 1 • Pausova, Zdenka 1 1 • Petronis, Arturas 1 1 • Rommens, Johanna 3 3 •• • Scherer, Steve 1 1 • Siminovitch, Katherine 2 1 1 •

Primary mentors and mentees in same row, • = Co-mentor 1 Biomedical mentor invited to accommodate trainee-specific research interests unavailable from initial mentor-pool. 2 Ad hoc biomedical mentor invited to accommodate trainee-specific research interests unavailable from initial mentor-pool.

STAGE - PAC 2012 Annual Report Page 3 02 CO-PRINCIPAL INVESTIGATORS, CO-INVESTIGATORS, AND COLLABORATORS (continued)

Mentors: New Distinctions and Leadership Roles

TABLE 2. NEW DISTINCTIONS AND LEADERSHIP ROLES

Name Organization Award Name/Distinction In recognition of Steven Narod Royal Society of Canada Elected Fellow. Steven Narod has proven that hereditary breast/ovarian cancers are preventable, and he has also found that Through their exceptional work, these many Ontario women with BRCA1/2 mutations are new Fellows pursue the distinguished work ineligible for provincially funded genetic testing. For of a long line of researchers and creators women unwilling to undergo radical surgeries, he is who have contributed to expand Canada’s pinpointing dietary options that reduce risk. His database intellectual, artistic and scientific resources of 12,000+ women from 30 countries supports numerous to support Canada’s population and its international collaborations. Author of over 550 peer- international scope. reviewed publications, Dr. Narod has an H-index of 84. James Kennedy Royal Society of Canada Elected Fellow. James Kennedy’s innovative research has resulted in pioneering discoveries relating variants to psychiatric Through their exceptional work, these disorders, brain imaging and treatment response. He new Fellows pursue the distinguished work has found genetic predictors of risk for attention deficit of a long line of researchers and creators disorder, schizophrenia, obsessive compulsive disorder, who have contributed to expand Canada’s mood disorders, and medication side effects including intellectual, artistic and scientific resources tardive dyskinesia, drug-induced mania and weight gain. He to support Canada’s population and its has translated these findings into pharmacogenetic tests in international scope. clinical care, and influenced pharmacogenetic research and its application at an international level. Shelley B. Bull International Genetic Epidemiology Leadership Award Shelley B. Bull’s contributions and for being instrumental in Society shaping the International Genetic Epidemiology Society

France Gagnon Human Genetics Guest editor for the Human Genetics special N.A. issue on Genetic Epidemiology: Study Designs and Methods post-GWAS. Lei Sun International Genetic Epidemiology Selected as a member of the Editorial Board N.A. Society for Genetic Epidemiology, the official journal of the International Genetic Epidemiology Society.

STAGE - PAC 2012 Annual Report Page 4 03 SUMMARY OF CHANGES

STIHR REPORT INSTRUCTION Steering Committee in Dec. 2011 (see Appendix L). This process aligns trainee-research interests, Describe any changes in design, direction, as they become known, with mentoring pool objectives, team makeup, collaborations or expertise. Mentor responses are yet to be tallied. milestones, focusing on the reporting period of September 1, 2011 to August 31, 2012. Attach, as STAGE is refining procedures and criteria for inviting Appendix A, a copy of the Summary of Research new mentors, taking into account and addressing Proposal from the original application. gaps in expertise of the initial mentor pool. At its 2012 meeting the PAC recognized the STAGE RESPONSE value of STAGE mentoring trios and also 1. MENTORING POOL recommended flexibility in mentoring arrangements, recognizing, for some students, the usefulness APR. 2012 of an advising duo, while still allowing for Dr. Jo Knight, newly recruited to the Toronto interdisciplinarity in the mentoring team. Centre for Addiction and Mental Health (CAMH) from King’s College, London, was 2. ADMISSIONS COMMITTEE invited and accepted to serve as a STAGE FEB. 2012 mentor in Statistical Genetics and Genomics. Population genetics mentor, and DEC. 2012 Admissions Committee member, Esteban Five new mentors were invited to join the STAGE Parra, is on leave until Feb. 2013. team. Drs. Akbari, Bassett, Hu and Wilson have APR. 2012 formally accepted. Confirmation from S. Lye is still Statistical genetics mentor, Mathieu Lemire pending. Please see research and expertise information was invited, and agreed to attend for Esteban in Table 3, below. Parra on the Admissions Committee. TABLE 3. MENTORS FORMALLY INVITED INTO STAGE (DEC 2012) DEC. 2012 Name Research Area Specific Expertise Biomedical genetics mentor, and Admissions Mohammad Biomedical Genetics Genetic susceptibility to breast, Committee member, Jayne Danska was unavailable Akbari ovarian, esophageal, pancreatic and prostate cancers for the Dec. 2012 Admissions Committee meeting. Anne Bassett Biomedical Genetics Psychiatry Biomedical genetics mentor, Zdenka Pausova was invited and agreed to attend for Jayne Danska at Howard Hu Genetic & Molecular Environmental and epigenetic the Dec. 2012 Admissions Committee meeting. Epidemiology epidemiology

Stephen Lye Biomedical Genetics Birth cohort, child 3. INITIATION OF CURRICULUM developmental trajectories COMPONENTS Michael Biomedical Genetics Functional and comparative In the original STAGE timetable, professional Wilson genomics development and integrative sessions had been planned to begin in 2011. More effective With the goal of offering the most diverse research modalities for these sessions were later found and and mentoring opportunities to trainees, while implemented, on a longer timeline, consistent increasing the percentage of active and committed with PAC 2012 recommendation, leading mentors, STAGE mentors were invited to confirm to highly successful offerings in 2012: their commitment to STAGE, consistent with Program mentor-mentee expectations created on recommendation of, and approved by, the STAGE - PAC 2012 Annual Report Page 5 03 SUMMARY OF CHANGES (continued)

3.1 PROFESSIONAL DEVELOPMENT research team and grant writing for STAGE trainees in spring and fall 2013. Objectives STAGE improved its professional development include training statisticians/biostatisticians to component, by organizing, collaborating with, communicate effectively with epidemiologists, or taking advantage of existing university clinicians, and biomedical scientists and vice and hospital network resources. versa, and teaching grant writing skills by reviewing sample sections of successful and • STAGE teams with other training programs at unsuccessful grant applications and explaining the UofT, leveraging resources to organize annual grant review process. joint training sessions on topics of common interest, such as social networking tools for • The Chair of UofT’s Department of Biochemistry, researchers on Feb. 2011, ethics on May 2012, Dr. Reinhart Reithmeier, now allows STAGE and effective communication of research on Apr. trainees to participate in a newly established 2013. Agendas for training sessions are detailed in graduate level course focused on developing Appendix I. the academic and professional skills required to Planned 2013 workshop topics include: succeed during and beyond graduate education translating knowledge to academics, stakeholders, in basic biomedical sciences. The course and the public; funding research; research in the comprises: PI relationships and mentoring, media; and privacy. Training programs at UofT enhancing research ability skills, problem- invite each other’s trainees to relevant session, solving techniques, leadership, finding successful such as the workshop on academic hiring, which collaborations, research ethics, developing was held in Feb. 2012. strong written and oral communication skills, further training as a postdoctoral fellow, effective • Biomedical geneticist mentor Geoffrey Liu and networking, integrating family commitments, statistical geneticist mentor Wei Xu have, in career transitions, CVs and resumés, career collaboration with STAGE directorship and options in and out of academia, best methods program coordinator, committed to organize of searching for and landing the job, staff and deliver multi-session workshops on management, global scientific issues, clinical communications for successful interdisciplinary applications, social implications, and maintaining

Trainees from 15 UofT-based STIHR training programs at 2012 Cross-STIHR Research Training Day STAGE - PAC 2012 Annual Report Page 6 03 SUMMARY OF CHANGES (continued)

career development. Syllabus detailed in 3.3 LOCAL RESEARCH COMMUNITY MEETINGS Appendix I. WITH ISSS SPEAKERS - NEW! • Annual grant-writing workshop offered by UofT’s In March 2012, consistent with 2012 PAC Faculty of Medicine. This workshop has a special recommendation to enable further science- focus on grant writing for CIHR and NSERC focussed interactions between faculty or senior opportunities and panel discussions in basic and trainees and junior trainees, STAGE began highly clinical science areas, population health, health successful one-on-one/small group meetings services, and knowledge translation. where trainees, mentors, and community members 3.2 INTEGRATIVE SESSIONS meet with guest speakers from the program’s To maximize trainee research time in a rigorous International Speaker Seminar Series (ISSS) to program, the integrative sessions component of exchange scientific ideas, further enhance existing STAGE has been integrated into other program collaborations, or to establish new ones. activities while preserving their purpose and 3.4 TRAINEES’ MEET THE INVESTIGATOR SERIES - outcomes. In the summer of 2012, four STAGE NEW! mentors were involved in the coordination of a The Meet the Investigator Series was initiated to local genetics analysis workshop (GAW) with leverage the availability of visiting guest speakers the participation of 13 STAGE mentors and four from the program’s ISSS, in response to trainees’ trainees. In Oct. 2012, four of these mentors and wishes for organized group sessions with speakers, two trainees travelled to Stevenson, Washington, and for increased trainee-trainee interactions, USA to participate in the International GAW18, consistent with PAC 2012 recommendations. which emphasized international collaborations to evaluate and compare statistical genetics methods and In Oct. 2012, STAGE held its first one-hour study design options in view of current challenges “Meet the Investigator” to bring together STAGE in complex diseases in genetic epidemiology and trainees and ISSS speakers in a casual setting to statistical genetics research. Travel expenses for support trainee-carreer development by learning STAGE trainees were covered by STAGE and through from the speakers’ professional experiences. matching funds provided by the McLaughlin Research Institute for Biomedical Sciences in Toronto.

STAGE - PAC 2012 Annual Report Page 7 04 PROGRESS

STIHR REPORT INSTRUCTION Describe the objectives set for your Training Program and the extent to which the objectives of your grant, and the Strategic Training Initiative in Health Research (STIHR) as a whole, have been achieved (see below for a list of the STIHR objectives). You must address all of the STIHR objectives individually. In addition, attach, as Appendix B, a complete list of publications for your program (submitted, in press and published, in the usual CIHR format) since the grant began. This section should only contain publications where the authors include program trainees (including their percent contribution). Also attach, as Appendix C, a copy of up to five representative publications from the reporting period.

STAGE RESPONSE The STAGE Objectives, Mission, and Vision closely mirror the STIHR objectives:

STIHR OBJECTIVES STAGE OBJECTIVES The overall mandate of the Strategic Training 1. To implement a dynamic multi-disciplinary training Initiative in Health Research is to increase the network of committed mentors and engaged capacity of the Canadian health research community partners that will foster trainees’ ability to bridge to produce high-quality graduates capable of laboratory and population-based investigation of addressing major health issues and/or health research common diseases; challenges. The objectives of Training Grants are to: 2. To promote “hands-on” cross-disciplinary research 1. Increase the capacity of the Canadian health in epidemiological, statistical, and biological research community, including areas where it can settings to enable trainees to appreciate and be demonstrated that there is a need to develop address the challenges in study design, analysis, capacity. and molecular technologies used in genetic epidemiology studies; and 2. Enable recruitment and retention of highly qualified individuals from Canada and abroad to undertake 3. To prepare the ground for expanded training health research training in Canada. opportunities across Canada and abroad, by nurturing existing partnerships and forging new 3. Support the development of innovative, effective, ones, through trainee exchange and internship transdisciplinary, and internationally competitive programs at the local, national, and international Training Programs. levels. 4. Engage new mentors and educators in the STAGE MISSION development and evolution of training strategies. • To increase research capacity in genetic 5. Encourage programs that: epidemiology and statistical genetics, with the overall goal of improving the prevention and • Embrace diverse research disciplines and management of common diseases through genetic methodological approaches to resolve major epidemiologic research; and, health issues and scientific challenges; • To deliver a comprehensive program with breadth • Integrate training and discussion on the ethical and depth to foster a generation of scientists and conduct of research and related ethical issues; highly qualified personnel capable of conducting • Develop and measure the individual’s and leading innovative, inter-disciplinary research communication, teamwork, and leadership in genetic epidemiology and statistical genetics skills including grant writing and peer review; – rigor in study designs and methods, building • Incorporate effective research strategies that on, respectively, cross-disciplinary training and translate knowledge into practice. strength in core disciplines

STAGE - PAC 2012 Annual Report Page 8 04 PROGRESS (continued)

STAGE VISION • The Hospital for Sick Children (SickKids, To establish STAGE as a globally recognized (Genetics & Genome Biology Program) destination for Canadian and international trainees STAGE trainees meet and learn from top international in genetic epidemiology and statistical genetics. experts. Each seminar attracts 60-70 people locally and some 20 remotely. To date, videoconference In its first PAC reporting period (Jan. 2010- participation has increased from 10 to 16 institutions Dec. 2011), STAGE recruited a program across Canada. ISSS participation and geographical coordinator, formalized its governance structure presence are shown in Figs. 3 and 4, in page 36. (PAC and Steering Committee) and curriculum (including approval to establish a new graduate- STAGE has held five admission competitions. level course, “Statistical Methods for Genetics Many individuals show interest in the program, and Genomics”, led by STAGE Co-Director but only those few who fulfill strict program pre- S.B. Bull and mentor A. Paterson); established requisites are encouraged to apply. Table 4 setsout admissions procedures, and launched its ISSS. competition dates and admissions, including number of applicants by competition. Fig. 1, in In its second PAC reporting period (Jan. 2012- page 10, shows admissions by discipline. Dec. 2012), STAGE improved its website for community building and visibility by adding TABLE 4 - STAGE APPLICATIONS AND ADMISSIONS Date Apps. Admitted Received Admitted sections on employment opportunities, events of Received Applications Trainees interest, community announcements, and funding by Type by Type opportunities for trainees. STAGE also enhanced Nov 2010 9 6 3 PhDs 3 PhDs 6 PDFs 3 PDFs its curriculum with career development electives Mar 2011 5 2 1 Master’s 1 Master’s (including ethics, and grant-writing) and integrative 4 PDFs 1 PDF STAGE opportunities for trainees to participate in Oct 2011 6 4 2 PhDs 2 PhDs 3 PDFs 1 PDF GAW18 (detailed in Appendix K); developed and 1 Visiting Scholar 1 Visiting Scholar implemented travel award and international internship Apr 2012 5 3 5 PDFs 3 PDFs programs (detailed in Appendix H); held its second Nov 2012 9 TBD 1 PhD TBD, early 2013 annual Steering Committee meeting (minutes in 8 PDFs Totals 34 15+ Appendix G) and first annual PAC meeting, submitted its second annual report to CIHR and to the PAC; Nine (60%) admitted trainees have secured funding and instituted annual trainee reporting systems ($1,179,000) from various competitions (fellowships, to inform and support the ongoing improvement operating grants, travel awards, etc.) before or while and growth of STAGE (detailed in Appendix J). in STAGE, such as; M. Woodbury-Smith; CIHR: The ISSS has booked 24 seminars, through Institute of Genetics Clinical Investigatorship Award 2013, made possible by the generous (Ranked #1, $280,000, 2012-2014) and Y. Yilmaz; sponsorship of some $100,000 of cash and in- Mprime/Mitacs-Accelerate NCE Postdoc Industrial kind funding commitments from six partner Research Project Award ($140,000, 2011-2012). For organizations from 2010 through to 2016: full details, see Table 5, page 27. Six trainees were attracted to or retained in Canada from other countries • CIHR Institute of Genetics (Meetings, Planning for STAGE training; in Y1, Z. Chen from China, V. and Dissemination Grant awarded Nov. 2011) Oliveira from Brazil, and Y. Yilmaz from Turkey, • MITACS and in Y2, M. Miller from the United States, J. Yue • Ontario Cancer Institute of the University Health from China, and M. Woodbury-Smith from the UK. Network • Ontario Institute for Cancer Research (OICR) • Samuel Lunenfeld Research Institute STAGE - PAC 2012 Annual Report Page 9 04 PROGRESS (continued)

STIHR OBjECTIvE 1 - InCREASE THE STIHR OBjECTIvE 2 - EnABLE RECRuITMEnT AnD CAPACITy Of THE CAnADIAn HEALTH RETEnTIOn Of HIGHLy quALIfIED InDIvIDuALS RESEARCH COMMunITy, InCLuDInG AREAS fROM CAnADA AnD ABROAD TO unDERTAkE WHERE IT CAn BE DEMOnSTRATED THAT HEALTH RESEARCH TRAInInG In CAnADA THERE IS A nEED TO DEvELOP CAPACITy A key aim, which is well underway, is to establish Training and retention of individuals capable of STAGE as a globally recognized destination for working at the interface of genetics, genomics, Canadian and international trainees in genetic biology, statistics and mathematics, and population epidemiology and statistical genetics. Building health sciences, responds to a deficit in national and on broad Y1 recruitment activities, STAGE international research capacity. A 2007 survey by program Co-Directors, coordinator, and mentors J. Graham (Simon Fraser University) and STAGE have continued and expanded an ongoing cycle Co-Director S.B. Bull showed that many Canadian of presentations and dissemination of print and genetic epidemiology and statistical genetics trainees electronic materials to partnering faculties at the seek higher-level training or work in other countries: University of Toronto (UofT), other universities 35% of PhDs and 40% of PDFs leave Canada to internationally, national, and international conferences train or work elsewhere. At the recommendation (e.g. the American Society of Human Genetics of STIHR reviewers and the PAC, in Y2, STAGE Meeting, the joint Canadian Human Genetics focused on PhD and PDF recruitment to address this Conference and Canadian Genetic Epidemiology training and retention deficit. Seven trainees admitted and Statistical Genetics Meeting) and professional in Y2 comprise two PhDs, four PDFs, and one a societies (International Genetic Epidemiology visiting scientist appointed at McMaster University. Society, American Society of Human Genetics). These numbers fulfill numeric admission goals in the original STAGE submission, and reallocate As part of its International Trainee Exchange & two anticipated Master’s positions to more senior Internship Program, described in its grant application, trainees, doubling to four the number of PDFs. STAGE has, for some time, been preparing to consolidate and expand its existing partnership with STAGE intends to continue to broaden trainee pool the Brazilian Agency for Graduate and Post-Graduate diversity by building on its demonstrated success Education (Coordenação de Aperfeiçoamento attracting trainees from various disciplines and de Pessoal de Nível Superior), ‘CAPES’). research areas, which have to date included: cancer, computer science, rheumatology, molecular biology, In this vein, on Sept. 11, 2012, STAGE neurobiology, epidemiology, virology, psychiatry, representatives, including Brazilian STAGE trainee medicine, microbiology, bioinformatics, evolutionary V. Oliveira, participated in a very positive high- genetics, biostatistics, statistics, and mathematics. level meeting led by DLSPH, on the subject of potential research collaborations. Event attendees Fig. 1. ADMITTED TRAINEES BY DISCIPLINE 2010-2012 included high profile delegates from three major 1 Brazilian universities. Additional discussions to expand the CAPES partnership are planned in the 4 Epidemiology, 27% (4 trainees) 3 coming year, with the intention of attracting excellent Sta

Biosta

Biomedical, 20% (3 trainees) recruitment venue. Seminar attendees and guest 1 speakers are invited to, and frequently show interest Popula

training needed for genetic epidemiology and statistical genetics research (pages 29-32). An “innovative trio” of mentors, including an epidemiologist, statistician, and biomedical scientist, supervises STAGE trainees. Monthly STAGE ISSS seminars expose trainees to leading edge international genetic epidemiology and statistical genetics research, while promoting the increasingly integrated STAGE mentor and trainee community to key international scientists and potential collaborators. Trainee productivity is excellent and continues Audience at Jan. 2012 International Speaker Seminar Series to improve, with 37 refereed papers published All three STAGE alumni to date have secured research (18 in Y1, 19 in Y2), and five book chapters (2 positions on finishing the program, including one in Y1 and 3 in Y2), in publications and journals former PhD student who is now a postdoctoral fellow in each of the three STAGE main disciplines at the International Agency for Research on Cancer, (epidemiology, statistics, and genetics) as well as where he had before been a PhD-level STAGE intern. journals dedicated to public health and chronic diseases (e.g. cancer, rheumatic diseases). Trainees STAGE tracks and will continue regular follow-up also delivered 65 oral and poster presentations (34 of current and graduating trainees, with annual and in Y1, 31 in Y2) of peer-reviewed findings and exit questionnaires, to assess STAGE experience other work at national and international events. and measure long-term retention of high-quality personnel in genetic epidemiology and statistical genetics. The questionnaires are set out in Appendix J.

STIHR OBjECTIvE 3 - SuPPORT THE DEvELOPMEnT Of InnOvATIvE, EffECTIvE, TRAnSDISCIPLInARy, AnD InTERnATIOnALLy COMPETITIvE TRAInInG PROGRAMS STAGE continually refines its rigorous, creative curriculum of core, integrative, and cross-disciplinary courses, practica, and leadership training. Epidemiologists and statisticians are given depth and breadth in their core disciplines, and also develop fundamental knowledge in genetics and Mentors and trainees at 6th Annual Canadian Genetic Epidemiology related biological principles and concepts. Biologists & Statistical Genetics Meeting (May 2011, King City, ON) acquire knowledge and skills in quantitative An objective of Strategic Training Program Grants population sciences. To complement requirements set out in the CIHR website is “to support the of their core program or research (i.e. epidemiology, development of training programs that improve statistics, biostatistics, or life sciences), all STAGE the mentoring and training environment for trainees participate in “integrative” courses on the health researchers.” In support of this objective, “Fundamentals of Genetic Epidemiology,” “Statistical STAGE has adopted the following goals: Genetics,” “Molecular Anthropology: Theory & Practice,” an elective, and cross-disciplinary • To develop a national perspective of colleagues’ courses to address gaps and complement training in research interests and expertise their core disciplines, to support cross-disciplinary • To foster a broader sense of community STAGE - PAC 2012 Annual Report Page 11 04 PROGRESS (continued)

These are accomplished through STAGE trainee the W. Harding LeRiche Prize (awarded annually to participation at Annual Canadian Genetic the DLSPH Master’s in Public Health student in the Epidemiology and Statistical Genetics Meetings epidemiology specialization with the highest standing where they showcase their projects through oral and in course work in epidemiology), J. Dennis was poster presentations, and network with Canadian awarded a highly competitive Vanier Canada Graduate colleagues involved in genetics research. Mentors Scholarship award and a Michael Smith travel award. are also encouraged to attend. In 2012, two STAGE mentors were invited speakers, and five STAGE STIHR OBjECTIvE 4 - EnGAGE nEW MEnTORS AnD EDuCATORS In THE trainees were funded to attend the meeting. DEvELOPMEnT Of TRAInInG STRATEGIES STAGE has capitalized on a critical mass of genetic epidemiologists and statistical geneticists, along with a high concentration of biomedical researchers in complex disease genetics at UofT and affiliated institutions, to attract 43 mentors from different and complementary disciplines. STAGE mentors were selected for excellent science, academic productivity, (Right to left) STAGE Trainees J. Dennis, T. Kung, M. Miller, and Z. collaborative research approach, research training Chen and meeting participant at joint Canadian Human Genetics Conference and Canadian Genetic Epidemiology and Statistical success, and a commitment to the success of STAGE. Genetics Meeting (Apr. 2012, Niagara-On-the-Lake, ON) Over Y2, 18 mentors co-supervised STAGE trainees The STAGE grant application set out the objective and 15 taught curriculum courses. Others participated of developing and applying trainee skills in writing through ISSS attendance, research collaborations with successful grant applications and scientific papers. STAGE trainee teams, and service on Admissions STAGE trainees have exceeded these expectations. and/or Steering Committees (see Appendix D). 9 (60%) trainees have secured funding in the form Several STAGE mentors not previously engaged in of travel or external competitive fellowships either teaching now lecture and develop STAGE course before applying for or during their STAGE training materials, often crossing disciplines. For example, in (page 27). More importantly, their ability to 2010-2011, genetic epidemiologist R. Hung, taught secure quality external funding through competitive a designated STAGE cross-disciplinary graduate processes confirms the quality of STAGE mentorship course on the epidemiology of non-communicable and the excellence of trainee work. They are, in diseases. Statistical geneticist L. Briollais teaches in the view of the program, successful individuals, a STAGE cross-disciplinary course on categorical demonstrably highly qualified, and expected to data analysis led by statistician L. Sun. Population succeed in their research aims. Such trainees should geneticist E. Parra and statistical geneticist L. continue to increase the capacity of the health Strug have each guest lectured for the mandatory research community, in areas of focus for STAGE. STAGE integrative course “Fundamentals of Genetic STAGE requires trainees to submit, on application, Epidemiology,” L. Strug also developed a new then annually, a list of funding applications curriculum on “Comparative Statistical Paradigms,” intended for the year, until funding has been now offered as an advanced STAGE course. secured. Of eight trainees admitted in Y1, six Newly recruited biomedical mentor M. Akbari obtained funding through competitive fellowship/ coordinated the “Fundamentals of Genetic studentship awards. Of six trainees admitted Epidemiology” course of STAGE Co-Director, in Y2, three were likewise successful. F. Gagnon during her sabbatical in fall 2012. The STAGE annual trainee survey tracks academic Strategically selected new mentors were and achievement awards. In Y2, Dr. T. Kung received added to the pool of mentors to provide new

STAGE - PAC 2012 Annual Report Page 12 04 PROGRESS (continued)

research topic opportunities and expertise to mentors F. Gagnon and K. Siminovitch, established, STAGE trainees. See Table 3, in page 5. from cumulative evidence, that “RFC-1 A80G is a genetic determinant of efficacy but not toxicity in the STIHR OBjECTIvE 5A - EnCOuRAGE PROGRAMS treatment of Rheumatoid Arthritis with methotrexate.” THAT EMBRACE DIvERSE RESEARCH DISCIPLInES AnD METHODOLOGICAL This has potential to influence not only future APPROACHES TO RESOLvE MAjOR HEALTH clinical research but also clinical decision making. ISSuES AnD SCIEnTIfIC CHALLEnGES STAGE mentors are leading scientists in diverse STIHR OBjECTIvE 5B - EnCOuRAGE PROGRAMS THAT InTEGRATE TRAInInG AnD research disciplines, as illustrated by the scope DISCuSSIOn On THE ETHICAL COnDuCT Of of departmental affiliations, which include RESEARCH AnD RELATED ETHICAL ISSuES; Epidemiology and Biostatistics, Statistics, Public PREPARE THEIR OWn ETHICS SuBMISSIOn Health, Pharmacology, Computer Science, Medical It is policy at UofT that all trainees prepare and submit Sciences, Clinical Epidemiology, Molecular and their own ethics proposals; further, a dedicated ethics Medical Genetics, Medical Biophysics, Laboratory course, “Genomics, Bioethics and Public Policy,” Medicine and Pathology, Psychiatry, Cell and is included as a cross-disciplinary elective on the Molecular Biology, Health Policy Management STAGE curriculum. In Y2, STAGE implemented and Evaluation, Immunology, and Rheumatology. a mandatory requirement for all STAGE trainees STAGE offers diverse common disease research to complete either the online course “Protecting opportunities including cancers, cardiovascular Human Research Participants,” offered by the NIH diseases, psychiatric, neurological, metabolic Office of Extramural Research, or the “TCPS 2 and autoimmune disorders, asthma and allergies, Tutorial Course on Research Ethics,” offered by the and growth and development disorders. Government of Canada Panel on Research Ethics, and provide STAGE with a certificate of completion. STAGE trainees are developing expertise STAGE also participated in organizing and hosting in diverse research specialties, including a UofT based cross-STIHR annual Research Day: population, evolutionary and statistical genetics, Ethics in Research: A scientific Lifecycle Approach pharmacogenomics, functional and comparative (Apr. 2012), facilitated by Dr. Jaime Flamenbaum, genomics, and genetic epidemiology. STAGE Senior Policy Advisor, CIHR. The event attracted trainee research spans a broad spectrum of complex some 60 trainees from 15 UofT STIHRs. disorders, including epidemiological and genomic research to investigate genetics of lung, head, neck STIHR OBjECTIvE 5C - EnCOuRAGE and breast cancer, rheumatoid arthritis, pain, diabetes PROGRAMS THAT DEvELOP AnD MEASuRE THE InDIvIDuAL’S COMMunICATIOn, TEAMWORk, and metabolic syndrome, addictions, mental health, AnD LEADERSHIP SkILLS InCLuDInG venous thromboembolism, cystic fibrosis, and autism. GRAnT WRITInG AnD PEER REvIEW In Y1, STAGE trainees integrated methodologies from Trainees develop critical presentation, communication, several disciplines to effectively address scientific and collaborative skills through weekly participation challenges. For example, in Forse*, Yilmaz et al. (*co- in the Statistical Methods for Genetics and first authors; submitted), PDF trainee Yilmaz, under Genomics research seminar and journal club mentors S.S. Bull and I. Andrulis, integrated statistical (SMGG), in which they are required to present methods for cure-rate modelling to yield novel to peers and mentors, and receive their feedback. insights into the nature of the association of elevated Mentors have involved trainees in joint journal expression of podocalyxin with tumour subtypes and peer review with them in at least 13 instances. clinical outcome in axillary lymph node-negative Through collaboration with Health Care, Technology breast cancer, leading to new directions in molecular and Place CIHR Training Program (HCTP), based genetic and translational clinical research. In Kung, at UofT, STAGE trainees have participated in the et al. (submitted), Master’s trainee T. Kung, under HCTP-organized professional development workshop STAGE - PAC 2012 Annual Report Page 13 04 PROGRESS (continued)

on academic hiring (Feb. 2012, see Agenda in training. To date, three such placements have taken Appendix I), which included guidance on effective place, and a fourth has been approved as follows: negotiations, interview, and presentation preparation. STAGE also obtained special permission for its 1. Internship at the International Agency for trainees to participate in a UofT Faculty of Medicine Research on Cancer/World Health Organization grant-writing workshop (June, 2012), and in a in Lyon, France, on the subject of Genome-wide Professional Development for Graduate Students investigations examining histology-specific lung course at the Department of Biochemistry (see cancer as the outcome within several studies of the syllabus in Appendix I). Further, STAGE trainees International Lung Cancer Consortium (PhD trainee who can benefit from improved academic writing D. Brenner, internship from May 1 to Sept. 1, 2011). skills, are encouraged to enroll in a UofT School 2. Internship at INSERM UMRS 937 (Cardiovascular of Graduate Studies course “Becoming a Better Genomics)/Université Pierre et Marie Curie in Editor of Your Own Work” for physical and life Paris, France, on the subject of the application of sciences students (see syllabus in Appendix I). statistical methods for analyses of methylome array data on thrombosis and hemostasis related traits Planning for a UofT Cross-STIHR Annual (PhD trainee J. Dennis, internship from May 4 to Research day with a focus on “Effective Sept. 14, 2012, funded through STAGE and Communication of Research” is currently underway B. Michael Smith Foreign Study Award). for Apr. 2013 (see Agenda in Appendix I). STAGE trainees are provided opportunities and are encouraged to seek and embrace leadership roles. For example, J. Dennis has been actively involved as a co-coordinator of the PhD Epidemiology Student Journal Club, as a member of the Epidemiology Curriculum Committee, and in Jan. 2013, as a reviewer on the PhD admission committee (Epidemiology Div.), at UofT’s DLSPH.

STIHR OBjECTIvE 5D - EnCOuRAGE PROGRAMS THAT TRAnSLATE knOWLEDGE InTO PRACTICE PhD trainee J. Dennis (lower left) with members of Dr. David Translation and application of research strategies to Tregouet’s team while interning at INSERM in Paris, France. ‘real world problems’ is a key STAGE outcome. In all initiatives developing statistical methods, for example, 3. Internship at the Harvard School of Public Health STAGE trainees must take their investigations beyond in Boston, USA, on the subject of applying novel simulated data (in itself sufficient to achieve high- methods for distinguishing disease-causing genetic quality, high-impact publishable findings), to real variants from their highly correlated proxies in life studies using population data. This pedagogical a unique next generation sequencing prostate strength is reinforced through the involvement cancer meta-analysis dataset (PhD trainee L. Faye, of several biomedical mentors and concomitant internship from Jun. 1 to Jul. 31, 2012). access to rich and diverse real life data sets. 4. The fourth placement has been approved at Opportunities for 2-8 month internships are integrated INSERM UMRS 937 (Cardiovascular Genomics)/ into the curriculum, and partnerships with industry, Université Pierre et Marie Curie in Paris, France, government, and institutional research labs are to develop and compare novel statistical models in place to support trainees in these knowledge for the analysis of methylome array data in translation initiatives at an appropriate point in their multi-generational families, and association with

STAGE - PAC 2012 Annual Report Page 14 quantitative traits (PDF trainee M. Ladouceur, Consolidation of a formalized, cross- and internship from Feb. 4 to Aug. 28, 2013). inter-disciplinary training program in genetic In Dec. 2012, STAGE Co-Director F. Gagnon and epidemiology and statistical genetics, with dedicated program coordinator E. Berzunza met M. Bouvier resources and a comprehensive curriculum, is new d’Yvoire, Scientific Attaché to the Consulat Général in Canada. Further, the ‘innovative trio’ model, de France à Toronto, to explore scholarship programs in which STAGE trainees are supervised by an available to graduate students or PDFs in Canada epidemiologist, a statistician, and a biomedical who wish to pursue research internships in France. scientist, is the only example of its kind known to us (see Fig. 2). Some other programs offer co- mentorship with a biologist and statistician alone. STIHR REPORT INSTRUCTION This approach offers a biological perspective to Describe the most innovative aspects of your Training statistician trainees (and a quantitative perspective Program; include any new materials, methods, tools, to biomedical trainees), but fails to fully round etc., that have been developed and put into use. out training in the fundamental principles and Demonstrate that the program brings added value concepts of study design options and associated in terms of its approach to training and compare challenges, which is critical to plan and conduct it to what would have happened in its absence. effective, rigorous genetic epidemiology studies. In a recent survey, participants spontaneously cited the STAGE RESPONSE program’s training structure as a particular strength: Before STAGE, limiting factors for capacity “I am really enjoying the co-mentoring by building in genetic epidemiology and statistical my three expert supervisors…I am astonished genetics included a lack of cross-disciplinary how the ideas, critique and suggestions training opportunities and infrastructure to foster from all of them have enriched my project research training collaborations to address complex development. I’m also enjoying the list of human disease questions, and the low visibility of courses that are offered by the STAGE. the Canadian genetic epidemiology and statistical I think it is important to fill some gaps/ genetics community. STAGE overcomes these deficiencies from my previous training.” PDF challenges by building cohesion and visibility in the Canadian genetic epidemiology community, “It has provided me with a unique by providing a comprehensive inter- and opportunity to undertake training in Genetic cross-disciplinary curriculum and research training Epidemiology, with a combination of taught - made particularly appealing through opportunities courses and hands on research. I have for scholarships, collaborations, mentorship, and been able to tailor my training to my own valuable international research experience. The personal needs and I have been offered STAGE ISSS and its international internship program excellent supervision.” Visiting Scholar will increase the international visibility of STAGE. The monthly ISSS and the SMGG provide trainees ongoing illustrations of international excellence in the field. STAGE has also supported the participation of four trainees and 14 mentors in the bi-annual GAW, which provide rich opportunities for STAGE trainees to work together, who otherwise might not, and to gain knowledge and experience with statistical genetic analysis on local teams of faculty and other trainees, and to collaborate with international colleagues. The 2012 GAW18 offered trainees the opportunity to apply and evaluate methodology for whole genome sequencing data in the context of large, complex pedigrees characterized by quantitative blood pressure measurements and hypertension status as well as GWAS data. Joint papers submitted address a range Fig. 2 - CIHR STAGE Cross-disciplinary Training of topics including analysis of rare and common STAGE - PAC 2012 Annual Report Page 15 04 PROGRESS (continued)

sequence variants, modelling of pleiotropy in bivariate chaired by F. Gagnon, is composed of ten STAGE trait analysis, adjustment for treatment effects, multi- mentors, the grant applicants, with representation phase analytic strategies, comparison of linkage and from genetic and molecular epidemiology, statistical association analysis in pedigrees, and use of gene genetics and genomics, and biomedical genetics. The annotation for variant prioritization (Appendix K). Steering Committee includes two ad hoc members, L. Palmer (OICR) and L. Strug (SickKids). STIHR REPORT INSTRUCTION The Steering Committee meets annually and Describe any challenges you have faced is consulted on an as-needed basis to support and how you have addressed them. Program Co-Directors in setting the overall direction of STAGE. Steering Committee sub- STAGE RESPONSE committees are struck as needed to manage operational responsibilities for admissions, seminars, STAGE has attracted exceptional trainees, and workshop, practica, and knowledge translation. met recruitment goals. It now intends to increase recruitment targets, as trainee success in securing The PAC, chaired by M. Boehnke (University of external funding permits STAGE to support more Michigan), includes seven international leaders trainees. As STAGE may not find enough excellent in genetic epidemiology, statistical genetics and applicants within the Toronto area, in its Mar. and Nov. biomedical genetics and meets annually to provide 2012 competitions, STAGE broadened recruitment independent external evaluation of STAGE progress efforts to international and pan-Canadian applicants. and direction, including long-term sustainability to Program Co-Directors and the Steering Committee. Further, although STAGE aims to train future Highly Qualified Personnel (HQP), and not only future STAGE plans for 2013 include; to pursue Type 2 principal investigators, its admission criteria may Graduate Diploma at the DLSPH and decrease the favour the latter. The 2012 PAC recommendation number of competitions in the year from two to one to was “that training both principal investigators optimize efficient use of limited resources and faculty and HQP is valuable and that there is no real which are needed for training components of STAGE. need to distinguish between training for these two groups.” However, given the competitiveness of STAGE admissions and limited funding, the admission and training of HQP remains an issue.

STIHR REPORT INSTRUCTION Describe the management structure, including composition and mandate of your PAC and any changes that have occurred to the composition and mandate of your PAC during the reporting period.

STAGE RESPONSE Program Co-Directors F. Gagnon (nominated PI) and S.S. Bull (co-PI) administer STAGE. With feedback from the Steering Committee, they oversee STAGE operations, including curriculum, seminars, recruitment, admissions, and internship and practicum placements. The Steering Committee,

STAGE - PAC 2012 Annual Report Page 16 05 LIST OF ACRONYMS

ACRONYM DEFINITION

CAMH Center for Addiction and Mental Health CAPES Coordenação de Aperfeiçoamento de Pessoal de Nível Superior CIHR Canadian Institutes of Health Research COM Co-mentor DLSPH Dalla Lana School of Public Health, University of Toronto HQP Highly Qualified Personnel ISSS International Speaker Seminar Series MITACS Mathematics of Information Technology and Complex Systems NSERC Natural Sciences and Engineering Research Council of Canada OICR Ontario Institute for Cancer Research PAC Program Advisory Committee PI Principal Investigator PM Primary Mentor SickKids The Hospital for Sick Children SLRI Samuel Lunenfeld Research Institute STAGE Strategic Training for Advanced Genetic Epidemiology STIHR Strategic Training Initiative in Health Research UofT University of Toronto

STAGE - PAC 2012 Annual Report Page 17 06 STAGE GOVERNANCE STEERING COMMITTEE The Steering Committee supports the Co-Directors in setting the overall direction of CIHR STAGE, including curriculum, seminars, trainee recruitment, admissions and practicum placements.

FrancE GaGnon shELLEY bULL raYJEan J. hUnG GEoFFrEY LIU John r. McLaUGhLIn Co-Director, Chair Co-Director

Associate Professor, UofT, Senior Investigator, SLRI Principal Investigator, SLRI Assistant Professor, UofT, Senior Investigator , SLRI DLSPH, Epidemiology Div. DLSPH, Epidemiology Div. Professor, UofT, DLSPH, Assistant Professor, UofT, Associate Professor, UofT, Canada Research Chair in Biostatistics Div. DLSPH Epidemiology Div. Assistant Professor, Department of Public Health Genetic Epidemiology Department of Medical Science Biophysics, UofT Scientist, Ontario Cancer Institute, University Health Network, and Alan B. Brown Chair in Molecular Genomics, Princess Margaret Hospital Visiting Scientist, Harvard School of Public Health

stEvEn narod EstEban J. Parra andrEw PatErson stEvE schErEr LEI sUn

Director, Familial Breast Associate Professor, Senior Scientist, Genetics & Director, The McLaughlin Associate Professor, UofT, Cancer Research Unit, Department of Anthropology, Genome Biology Program, Centre for Molecular DLSPH Biostatistics Div. Women’s College Research UofT Mississauga SickKids Medicine, UofT Institute Associate Professor, UofT, Associate Professor, UofT, Professor, UofT, Department Department of Statistics Professor, UofT, DLSPH DLSPH Biostatistics Div. of Molecular and Medical Epidemiology Div. Genetics Canada Research Chair in Canada Research Chair in Genetics of Complex Diseases Program Director, Breast Cancer Computational Genomics, McLaughlin Centre for Molecular Medicine, UofT Director, The Centre for Applied Genomics; Associate Chief, Research Institute; Senior Scientist, Genetics and Genomic Biology Program, SickKids

STAGE - PAC 2012 Annual Report Page 18 06 STAGE GOVERNANCE PROGRAM ADVISORY COMMITTEE The Program Advisory Committee is responsible for assisting the Program Co-Directors and Steering Committee to evaluate the progress of the program and to make recommendations for improvement.

MIchaEL boEhnkE MarY corEY Joan E. baILEY-wILson Chair

Richard G. Cornell Senior Scientist Emeritus, Co-Chief & Senior Distinguished University Research Institute, SickKids Investigator, Inherited Disease Professor of Biostatistics, Research Branch, National School of Public Health, Research University of Michigan Institute, National Institutes of Health Director, Center for Statistical Genetics Director, Genome Science Training Program

FLorEncE dEMEnaIs thoMas J. hUdson ErwIn schUrr cLarIcE wEInbErG

Director of Research, INSERM President and Scientific Investigator, Centre for the Chief, Biostatistics Branch, (French National Institute of Director, Ontario Institute for Study of Host Resistance, National Institute of Health and Medical Research) Cancer Research McGill University Environmental Health Sciences, National Institutes Head, INSERM Unit U946 Professor, Department of of Health (Genetic Variation and Human Human Genetics, Diseases) McGill University Professor, Department of Medicine, Div. of Experimental Medicine, McGill University

STAGE - PAC 2012 Annual Report Page 19 01 STAGE COMPETITIONS PUBLICATION RECORDS COMPARISON OF RECEIVED PDF APPLICATIONS

Published, Accepted, or in Press Refereed Papers by PDF Applicants and CompeYYon1 16 16

14 14

12 12

10 10

8 8

6 6

4 4

2 2

4 0 14 9 6 1 4 7 5 2 12 2 15 5 3 8 14 10 15 2 7 6 11 0 0 I J L B C D E F G H K M N O P Q R S T U V W X Y Z

Applicant A

5.66 Avg. -­‐ Nov. 2010 4.5 Avg. -­‐ Mar. 2011 6.0 Avg. -­‐ Oct. 2011 6.6 Avg. -­‐ Apr. 2012 9.25 Avg. -­‐ Nov. 2012

1 Colour horizontal bar reflects the average number of refereed papers by competition.

PresentaPons, Book Chapters and Other Research ContribuPons by PDF Applicants and CompePPon2 55 55

50 50

45 45

40 40

35 35

30 30

25 25

20 20

15 15

10 10

5 5

11 7 10 11 12 10 3 1 12 5 6 8 9 8 51 10 9 1 9 27 29 5 17 0 31 1 0 0 I J L B C D E F G H K M N O P Q R S T U V W X Y Z

Applicant A

10.16 Avg. -­‐ Nov. 2010 5.25 Avg. -­‐ Mar. 2011 7.66 Avg. -­‐ Oct. 2011 15.8 Avg. -­‐ Apr. 2012 14.87 Avg. -­‐ Nov. 2012 2 Colour horizontal bar reflects the average number of presentations, book chapters and other research contributions by competition.

STAGE - PAC 2012 Annual Report Page 20 01 STAGE COMPETITIONS PUBLICATION RECORDS COMPARISON OF RECEIVED MASTER’S AND PHD APPLICATIONS

Published, Accepted, or in Press Refereed Papers by Master's and PhD Applicants and CompeHHon1 8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

7 0 1 4 5 7 4 0 0 PhD A PhD B PhD C Master's A PhD D PhD E PhD F

3.5 Avg. -­‐ Nov. 2010 4 Avg. -­‐ Mar. 2011 6 Avg. -­‐ Oct. 2011 4 Avg. -­‐ Nov. 2012 1 Colour horizontal bar reflects the average number of refereed papers by competition.

Presenta9ons, Book Chapters and Other Research Contribu9ons by Master's and PhD Applicants and Compe99on2 16 16

14 14

12 12

10 10

8 8

6 6

4 4

2 2

6 0 14 4 12 13 10 0 0 PhD A PhD B PhD C Master's A PhD D PhD E PhD F

6.6 Avg. -­‐ Nov. 2010 4 Avg. -­‐ Mar. 2011 12.5 Avg. -­‐ Oct. 2011 10 Avg. -­‐ Nov. 2012

2 Colour horizontal bar reflects the average number of presentations, book chapters and other research contributions by competition. STAGE - PAC 2012 Annual Report Page 21 02 ADMISSIONS PROJECTED AND ACTUAL RECRUITMENT

At the recommendation of the STIHR review panel, STAGE competitions prioritized recruitment to remedy PhD and PDF retention deficits in Canada. In years 4-6, funds to recruit Visiting Scholars have been reallocated for the recruitment of Postdoctoral Fellows. Recruitment projections for years 3-6 are for funded prositions.

Year 1 (Apr 2010 - Mar 2011) Year 2 (Apr 2011 - Mar 2012)

Undergraduate Undergraduate

Master's Master's

PhDs PhDs

Postdoctoral Fellow(s) Postdoctoral Fellow(s)

Visiting Scholar Visiting Scholar

0 1 2 3 4 5 0 1 2 3 4 5

Actual Projected Actual Projected

Year 3 (Apr 2012 - Mar 2013) Year 4 (Apr 2013 - Mar 2014)

Undergraduate Undergraduate

Master's Master's

PhDs

PhDs Postdoctoral Fellow(s)

Postdoctoral Fellow(s) Visiting Scholar or Visiting Scholar

0 1 2 3 4 5 0 1 2 3 4 5

Actual Projected Actual Projected 0

Year 5 (Apr 2014 - Mar 2015) Year 6 (Apr 2015 - Mar 2016)

Undergraduate Undergraduate

Master's Master's

PhDs PhDs

Postdoctoral Fellow(s) Postdoctoral Fellow(s) or Visiting Scholar or Visiting Scholar

0 1 2 3 4 5 0 1 2 3 4 5

Actual Projected 0 Actual Projected 0

STAGE - PAC 2012 Annual Report Page 22 03 CURRENT PHD TRAINEES TERM, DISCIPLINE, MENTORS, AND PROJECTS

JEssIca dEnnIs andrIY dErkach LaUra FaYE wEILI (LIZ) LI

PhD Student PhD Student PhD Student PhD Student UofT, DLSPH, Epidemiology Div. UofT, Dep. of Statistics UofT, DLSPH, Biostatistics Div. UofT, DLSPH, Biostatistics Div.

Term: Jan 2012-Aug 2014 Term: Jan 2011-Aug 2013 Term: Jan 2011-Jan 2013 Term: Jan 2012-Aug 2014

Discipline: Epidemiology Discipline: Statistics Discipline: Biostatistics Discipline: Biostatistics

Mentors: France Gagnon Mentors: Andrew Paterson, Mentors: Shelley B. Bull (PM) Mentors: Lisa Strug (PM), (PM), Zdenka Pausova, Lisa Johanna Rommens, Lei Lei Sun Johanna Rommens, Rayjean Strug Sun (PM) Hung

Project: Genetic and Project: Pooled association Project: Accurate and Project: Developing novel epigenetic factors in the tests for rare genetic efficient estimation of statistical methods under regulation of tissue factor variants genetic effect in genome- the evidential framework pathway inhibitor (TFPI) wide association and with a focus on analyzing and Factor VII plasma genetic sequencing studies sequence data levels and role in venous thromboembolism

STAGE - PAC 2012 Annual Report Page 23 03 CURRENT PDF AND VISITING SCHOLAR TRAINEES TERM, DISCIPLINE, MENTORS, AND PROJECTS

ZhIJIan (charLIE) chEn MartIn LadoUcEUr MELIssa MILLEr vanEssa oLIvEIra

Postdoctoral Fellow Postdoctoral Fellow Postdoctoral Fellow Postdoctoral Fellow Samuel Lunenfeld Research UofT, DLSPH, Epidemiology Div. The Hospital for Sick Children Centre for Addiction and Mental Institute Health

Term: Jan 2011-Aug 2013 Term: Nov 2012-Oct 2014 Term: Mar 2012-Feb 2014 Term: Aug 2011 - Jul 2013

Discipline: Biostatistics Discipline: Biostatistics Discipline: Epidemiology Discipline: Population Genetics

Mentors: Shelley B. Bull Mentors: France Gagnon Mentors: France Gagnon, Mentors: James L. Kennedy (PM), Radu Craiu, Andrew (PM), Mathieu Lemire, Lisa Strug (PM), Johanna (PM), Andrew Paterson, Paterson Arturas Petronis Rommens Lei Sun

Project: Design and analysis Project: Statistical model to Project: Identifying Project: Variability of the of genome-wide studies of characterize genome-wide Genetic Modifiers DRD4 gene in behavioural complex diseases and traits DNA methilation from of Immunoreactive phenotypes multigenerational families Trypsinogen Levels in a NBS Population in Cystic Fibrosis

JIanG YUE Yan Yan wU YILdIZ YILMaZ Marc woodbUrY-sMIth

Postdoctoral Fellow Postdoctoral Fellow Postdoctoral Fellow Visiting Scholar The Hospital for Sick Children Samuel Lunenfeld Research Samuel Lunenfeld Research McMaster University Institute Institute

Term: Aug 2012-Jul 2014 Term: Aug 2012-Jul 2014 Term: Jan 2011-Dec 2012 Term: Jan 2012-Dec 2012

Discipline: Biomedical Discipline: Biostatistics Discipline: Biostatistics Discipline: Biomedical

Mentors: Jayne Danska (PM), Mentors: Laurent Briollais Mentors: Irene L. Andrulis, Mentors: Andrew Paterson Andrew Paterson, Lisa (PM), Julia Knight, Stephen Shelley B. Bull (PM), Julia (PM), Stephen Scherer, Lei Strug J. Lye Knight Sun

Project: A systems based Project: Development of Project: Statistical methods Project: Identifying functional analytical framework for statistical methods for BMI for molecular genetic rare variants in autism metatranscriptomic studies growth curves estimation analysis of breast cancer spectrum disorders (ASD) of type 1 diabetes and analysis on longitudinal families GWAS data in relation to metabolic traits in children

STAGE - PAC 2012 Annual Report Page 24 04 STAGE ALUMNI PROJECTS, MENTORS, STAGE TERM, AND CURRENT POSITION(S)

Gord FEhrInGEr darrEn brEnnEr tabItha kUnG

Current Position: Current Position: Current Position: Scientific Associate Postdoctoral Fellow Rheumatology Fellow Samuel Lunenfeld Research International Agency for Research Mount Sinai Hospital Institute on Cancer Research Associate, Samuel Lunenfeld Research Institute

STAGE Term: STAGE Term: STAGE Term: Jan 2011-Jan 2012 Jan 2011-Aug 2012 May 2011 - Jun 2012

Trainee Level: PDF Trainee Level: PhD Student Trainee Level: Master’s Student

Mentors: Laurent Briollais Mentors: Shelley B. Bull, Mentors: France Gagnon Rayjean Hung (PM) Rayjean Hung (PM), (PM), Kathy Siminovitch* Geoffrey Liu Geoffrey Liu

Discipline: Biomedical Discipline: Epidemiology Discipline: Epidemiology

Project: Genome-wide and Project: A histology specific Project: Genetic Predictors of post genome-wide studies genome-wide investigation Response to Treatment in of lung cancer and head of lung cancer. Early Rheumatoid Arthritis and neck cancer. *ad hoc STAGE Mentor.

STAGE - PAC 2012 Annual Report Page 25 05 TRAINEE PRODUCTIVITY PUBLICATIONS AND PRESENTATIONS

PUBLICATIONS ARISING FROM COMMUNICATIONS ARISING FROM TRAINEES 2010-2012 TRAINEES 2010-2012

5 6

27

21

22 33

3

11 4 14

33 Refereed Papers - Published 27 Oral presentations in Canada 4 Refereed Papers - Accepted or in Press 11 Oral presentations outside of Canada 22 Refereed Papers - Submitted 21 Poster presentations in Canada 5 Book Chapters and Other 14 Abstracts - Published 3 Abstracts - Accepted or in Press 6 Poster presentations outside of Canada

PUBLICATIONS IN JOURNALS 2010-2012

Journal impact Number factor range in range 0-1 10 2-3 13 3-4 7 5-10 6 11-20 0 21+ 1

STAGE - PAC 2012 Annual Report Page 26 06 TRAINEE GRANTS AND AWARDS

TABLE 5. TRAINEE GRANTS AND AWARDS

Effective Trainee Name Award Type Amount Organization Award Name End Date Project Title/Specialty Date Brenner, Darren Doctoral Student $105,000 CIHR Frederick Banting and Charles Sept. 1, 2009 Sept. 1, 2012 A genome wide analysis of Award Best Canada Graduate inflammation and lung cancer: Scholarship A population based study of the role of previous lung diseases and inflammatory genetic pathways in the development of lung cancer Postdoctoral $36,500 Department of Postdoctoral Fellowship Oct. 2010 Sept. 2011 Psychiatric Genetics Fellowship Foreign Affairs and Research Award International Trade, de Oliveira, Canada Vanessa Travel Award $1,000 UofT, Department Fellowship Travel Award Sept. 13, 2012 N.A. GWAS of alcohol and dependence in of Psychiatry the CATIE schizophrenia sample

Doctoral Student $150,000 CIHR Vanier Canada Graduate Sept. 1, 2011 Aug. 31, 2014 Genetic architecture and determinants Award Scholarships of plasma fibrinogen levels

Canada Graduate $6,000 CIHR Michael Smith Foreign Study Sept. 1, 2011 Aug. 31, 2014 Epigenetic determinants of plasma Dennis, Jessica Scholarships Supplements fibrinogen levels

Doctoral Student Declined CIHR (October Frederick Banting and Charles N.A. N.A. Genetic architecture and determinants Award ($105,000) 2010 Competition) Best Canada Graduate of plasma fibrinogen levels Scholarships

Graduate $15,000 Ministry of The Queen Elizabeth II Sept. 1, 2012 Aug. 31, 2013 N.A. Scholarship Training, Colleges Graduate Scholarships in and Universities Science and Technology Canada Derkach, Andriy Graduate $15,000 Ministry of Reginald A. Blyth Fellowship Sept. 1, 2011 Aug. 31, 2012 It is awarded to a graduate student Scholarship Training, Colleges in the Department of Mathematics/ and Universities Statistics Canada CIHR Doctoral $66,000 CIHR, Institute of CIHR Doctoral Research Sept. 1, 2008 Aug. 31, 2011 N.A. Research Award Genetics Award

Bonus Research $2,000 UofT University of Toronto Open Sept. 1, 2018 Aug. 31, 2011 All graduate students who receive a Award Bonus Research Award competitive award equal to or greater than $15,000 receive a “bonus” from UofT Open (UTO) Fellowship funds Laura Faye to a maximum of $25,000 for all sources of support Fellowship $1,750 UofT University of Toronto Open Sept. 1, 2008 Aug. 31, 2011 Students who receive a competitive Research Award Fellowship Research Award award and no other support, will receive a “top-up”, from the DLSPH University of Toronto Open funds, to achieve a total of $23,400

Kung, Tabitha Postgraduate $240,000 The Arthritis UCB Canada Inc./Canadian Jul. 1, 2010 Jun. 30, 2012 Genetic determinants for Fellowship Award Society of Canada Rheumatology Association/ susceptibility, outcome, treatment The Arthritis Society response and personalized medicine in Rheumatology Postgraduate rheumatoid arthritis Fellowships Award Postgraduate $63,000 Natural Sciences Postgraduate Scholarship Sept. 1, 2011 Aug 31, 2014 A novel statistical method for Scholarship and Engineering prioritizing genome-wide association Research Council study results Li, Weili (Liz) of Canada Studentship $22,750 University of Graduate Funding for DLSPH Sept. 1, 2010 Aug. 31, 2015 N.A. Toronto UofT Doctoral-Stream Programs Research Award $280,000 Canadian Institutes Institute of Genetics Clinical Jan. 2, 2013 Jan 1, 2015 Identifying Causes of Autism Spectrum of Health Research Investigatorship Award Disorder through Next-Generation Sequencing in Combination with Woodbury-Smith, Genetic Linkage Marc Research Award $35,000 Scottish Rite Research Grant Oct. 15, 2010 Oct. 15, 2012 Identifying Causes of Autism Spectrum Charitable Disorder through Next-Generation Organization Sequencing in Combination with Genetic Linkage Yildiz, Yilmaz Research Award $140,000 Mitacs, Canada Mprime/Mitacs-Accelerate Jan. 1, 2011 Dec. 31, 2012 N.A. Network of Centres of Excellence Postdoctoral Industrial Research Project Award

STAGE - PAC 2012 Annual Report Page 27 06 TRAINEE DISTINCTIONS

2010-2012 TRAINEE DISTINCTIONS

Trainee Name Award Type Organization Award Name In recognition of Talk Title

Derkach, Andriy Nominated International Williams Award (2012) Nomination for best Combining Linear and Quadratic Tests for Rare Variants for Best Oral Genetic abstract/presentation by Provides a Robust Test Across Genetic Models Presentation Epidemiology a PhD student Society

Derkach, Andriy Best Pgaw18 Biostatistics Best Student Poster Best poster by student Evaluation of composite statistics for association analysis of oster Research Day at Presentation Award (2011) rare variants. DLSPH UofT

Faye, Laura Best Oral International Williams Award (2012) Best abstract/ Re-Ranking Next Generation Sequencing Variants for Accurate Presentation Genetic presentation by a PhD Causal Variant Epidemiology student Society

Kung, Tabitha School Prize Dalla Lana School W. Harding LeRiche Prize Highest standing N.A. of Public Health, (2012) in course work in University of Epidemiology by Toronto Masters in Public Health student in Epidemiology specialization

STAGE - PAC 2012 Annual Report Page 28 07 STAGE BUDGET

FY1 FY2 FY3 FY4 FY5 FY6 Trainees Apr 2010 - Apr 2011 - Apr 2012 - Apr 2013 - Apr 2014 - Apr 2015 - Mar 2011 Mar 2012 Mar 2013 Mar 2014 Mar 2015 Mar 2016

Total funding received from CIHR annually $217,333.00 $315,133.00 $321,515.00 $325,000.00 $325,000.00 $270,833.00

REVENUE Balance brought forward from previous fiscal year (unspent monies) N.A. $136,713.66 $175,454.37 $69,324.62 $20,316.85 $34,793.84

A. Funding available/fiscal year $217,333.00 $451,846.66 $496,969.37 $394,324.62 $345,316.85 $305,626.84

Yildiz Yilmaz Own funding Own funding

Gord Fehringer Own funding Own funding

Charlie Chen Own funding $45,000 $45,000

Vanessa de Oliveira Own funding $45,000 $45,000

Melissa Miller $45,000 $45,000

Yue Jiang $45,000 $45,000

Yan Yan Wu $45,000 $45,000 Postdoctoral Fellows Martin Ladouceur $21,170 $45,000 $23,830

Trainee (Projected) $45,000 $45,000 Trainee stipends and travel and Trainee (Projected) $45,000 $45,000 internship programs Trainee (Projected) $45,000 $45,000

Trainee (Projected) $45,000 $45,000

Trainee (Projected) $45,000

Visiting Scholar $40,000

1. Stipends to support postdoctoral fellows $0 $175,000 $291,170 $225,000 $158,830 $135,000

2. Stipends to support PhD students $0 $5,505 $10,074 $20,000 $20,000 $20,000

3. Stipends to support master's students $0 $0 $0 $0 $0 $0

4. CIHR STAGE Travel and International Internships programs $0 $6,269 $25,000 $25,000 $25,000 $25,000

B. Total funds spent on trainee stipends and travel and $0 $186,774 $326,244 $270,000 $203,830 $180,000 internship programs (1 + 2 + 3 + 4)

5. Other travel costs (PAC meetings and ISSS expenses) $0 $846 $10,000 $10,000 $10,000 $10,000

6. Salary and benefits program coordinator $71,362 $83,573 $86,901 $89,508 $92,193 $94,959

7. Advertising $2,929 $1,189 $1,500 $1,500 $1,500 $1,500 Other costs 8. Website development, maintenance, supplies, and other $6,328 $4,011 $3,000 $3,000 $3,000 $3,000 miscellaneous expenses

C. Total funds spent on other costs (5 + 6 + 7 + 8 ) $80,619 $89,619 $101,401 $104,008 $106,693 $109,459

BALANCE Balance brought forward (A - (B + C)) $136,713.66 $175,454.37 $69,324.62 $20,316.85 $34,793.84 $16,168.05

Percentage budget spent on trainee stipends and trainee travel: 0.00% 41.34% 65.65% 68.47% 59.03% 58.90%

Percentage budget spent on other costs: 37.09% 19.83% 20.40% 26.38% 30.90% 35.81%

Percentage budget brought forward: 62.91% 38.83% 13.95% 5.15% 10.08% 5.29%

Legends Comitted Projected

STAGE - PAC 2012 Annual Report Page 29 01 CURRICULUM COMPONENTS, TRAINING OBJECTIVES, AND ASSESSABLE OUTCOMES

Component Objectives Outcome Assessments Comments

Co-mentorship Foster development and Trainees will have access to The co-mentor team to be established at Implemented, with in three application of integrative mentors with complementary time of admission modifications to disciplines thinking strengths and learn about research accommodate Trainees to demonstrate integrative without disciplinary boundaries study programs Learn and experience the thinking in the development of research for Master’s value of partnership protocol and papers (epidemiology, Acquire knowledge Trainees to demonstrate partnership biostatistics, and and skills in three skills in curriculum activities, such as statistics) and PhD core disciplines of the “paper project” based on STAGE (statistics) students genetic epidemiologic integrative courses investigations Courses in core Develop strength in a core Trainees will develop expertise in Trainees will complete (or have Implemented discipline discipline one of the three core elements of completed) required courses of their core genetic epidemiologic research; academic program in a timely manner for example, epidemiology, (bio) statistics, or biomedical/genetics Integrative Introduce trainees to the Trainees will think in terms of the Trainees will complete at least two Implemented courses fundamental concepts ‘big picture’ while working on integrative courses i.e., Fundamentals and principles of genetic individual aspects of a problem. of Genetic Epidemiology, Statistical epidemiology, statistical For example, statistics trainees Genetics, Molecular Anthropology genetics and population will take into consideration (Population Genetics), and required genetics as integrative biological information while seminar course disciplines working on specific statistical problems Courses in cross- Develop cross-disciplinary Trainees will have a broad Trainees will successfully complete at Implemented disciplines skills and knowledge understanding of key principles least one course in two core disciplines Enrollment and concepts, relevant to genetic outside their own. Available but not restrictions epidemiologic research, in the required of PDF trainees. negotiated on an two other core elements. For individual basis example, biomedical/genetic trainees will take (bio)statistics and epidemiology courses Weekly Journal With a focus on methods Trainees will employ critical Trainees and mentors will regularly Implemented Club/Research and study designs: and integrative thinking through attend the journal clubs and seminars, Progress discussions with peers and faculty, and will develop presentation skills prior Discuss seminal papers and Seminar and will demonstrate good oral- to presenting at national or international recent breakthroughs communication skills meetings Provide a casual, yet critical, environment for trainees to present their research projects Foster a sense of community Monthly Develop an International Trainees will be exposed to high- Trainees will meet with international Implemented International perspective of genetic calibre international research, speakers in an informal setting Funded largely by Speaker Seminar epidemiologic research- and will be able to discuss their research institutes. Series research with speakers Provide networking Applied for and opportunities for STAGE Ongoing promotion of STAGE to secured new grant trainees, mentors, research community funding from and national genetic Institute Community Support of research network epidemiology community Support Program of with one another and the CIHR Institute of international speakers Genetics.

STAGE - PAC 2012 Annual Report Page 30 01 CURRICULUM COMPONENTS, TRAINING OBJECTIVES, AND ASSESSABLE OUTCOMES (continued)

Component Objectives Outcome Assessments Comments STAGE Develop and apply Trainees and mentors will interact Trainees will be engaged in the Implemented with Integrative integrative thinking on a monthly basis to exchange discussion during the sessions modifications. Sessions scientific thoughts and ideas, will Develop and apply In the course of their training, trainees learn how different disciplines professional skills will produce a “paper project”, in think and speak, and will gain a cooperation with their three mentors, Develop a sense of team broad view of the various skills suitable for publication, operating and community required to succeed in scientific grant application, or provincial grant research (e.g. Ontario Ministry of Research and Innovation); their work will be presented at the DLSPH Annual Research Day Writing grants Develop and apply skills Having obtained the appropriate All trainees will apply for external Implemented with and papers & in writing grants and authorizations, trainees will studentships/fellowships modifications Peer-review scientific papers be observers on peer-review With appropriate authorization from activities panels, and mentors and trainees Become familiar with the journals, trainees and mentors will will jointly act as reviewers process of peer-review of jointly prepare critical reviews of manuscripts from peer-reviewed grants and papers” manuscripts Trainees will write well- scientific journals structured scientific papers with impact Research Provide hands-on Trainees will have the skills Trainees will exercise good judgment in Exploring internship practicums experience to work in a broad range of addressing relevant scientific questions options for trainees. genetic epidemiologic research through analysis of existing research Government Encourage cross- environments data using statistical genetics software and industry disciplinary experience opportunities yet to To be better prepared be established. as HQP for genetic Specific to masters level statisticians/ epidemiologic work epidemiologists/biologists: environment Internship Introduce trainees to Trainees will apply their graduate Trainees will exercise skills in a new We expect to program option real-life work experience research to real-world challenges setting, and will identify potential future implement this through MITACS and learn about non-academic employers or collaborators from non- activity by Summer Develop professional ACCELERATE R&D processes academic settings 2013 relationships with Canadian industry and other non-academic organizations International Introduce trainees to a Trainees will learn about Trainees will share new techniques and Three trainees trainee exchange collaborative international approaches unavailable in their results with other STAGE trainees and have completed & internship experience Canadian labs, or will validate mentors, and will obtain publishable internships at program their results in another sample results from their internship experiences various international Encourage trainees to seek locations. their own collaboration on Trainees will learn about different a small project research cultures, and will Actively seeking network at the international level additional internship Provide a medium to sites and funding promote STAGE and opportunities recruit international trainees National Develop a national Trainees will be made aware of Trainees and mentors will regularly National JC Videoconference perspective of colleagues’ the work of Canadian colleagues attend videoconference and the annual discontinued by Journal Club interests and expertise and develop a sense of belonging meeting, and will develop a network of funding agency. JC (“JC”) & Annual to the Canadian genetic Canadian collaborators replaced by national Foster a broader sense of Canadian epidemiology community broadcasting of community Genetic ISSS. 2013 Meeting Epidemiology planning underway and Statistical (third-party event) Genetics Mtng.

STAGE - PAC 2012 Annual Report Page 31 02 COURSES INTEGRATIVE AND CROSS-DISCIPLINARY

INTEGRATIVE

Courses that introduce trainees to the fundamental concepts and principles of genetic epidemiology, statistical genetics, and population genetics. Trainees are encouraged to complete two required integrative courses, plus one required integrative seminar in statistical methods for genetics and genomics. Trainees may take additional, elective integrative courses at their own discretion. Bold denotes STAGE Mentor

Instructor/Department Course Code Course Name Course Description Esteban Parra ANT3440 Molecular Anthropology: Covers fundamental principles and concepts of population genetics. Theory & Practice This course is offered every two years. To be offered in Fall 2013. Lei Sun (course coordinator) CHL5224 Statistical Genetics Covers the main categories of statistical methods in genetic Wei Xu epidemiology. France Gagnon (course coordinator) CHL5430 Fundamentals of Genetic Overview of fundamental principles and concepts underlying the Mohammad Akbari* Epidemiology design and conduct of genetic epidemiologic studies. Steven Narod Invited lecturers and guest speakers: - Lisa Strug (UofT DLSPH Div. of Biostatistics) - Lucia Mirea (Maternal-Infant Care Research Centre (MiCare) - Sanaa Choufani (SickKids)

*Assumed teaching responsibilities while F. Gagnon on sabbatical leave, fall 2012. Shelley Bull CHL7001 Statistical Methods for One hour Journal Club/Research Seminar session, held two-three Andrew Paterson Genetics and Genomics times per month, Sept. through May, with STAGE statistical (SMG) genetics and genetic epidemiology faculty participating at the seminar. The seminar is followed by a one-hour in-depth discussion for registered trainees. Radu Craiu STA4315 Computational Methods for Advanced statistical genetics. Course offered every two-years based Lei Sun (course coordinator) Statistical Genetics on demand.

CROSS-DISCIPLINARY Courses designed to develop cross-disciplinary knowledge and skills. Trainees are encouraged to complete at least two cross-disciplinary courses, aiming for breadth outside their own core disciplines. Some courses require instruction’s preauthorization. Suggested courses include, but are not limited to those listed below.

Instructor/Department Course Code Course Name Course Description Boris Steipe BCB420 Computational Systems Current approaches to using the computer for analyzing and Biology modeling biology as integrated molecular systems. The course complements an introductory Bioinformatics course.

STAGE - PAC 2012 Annual Report Page 32 02 COURSES INTEGRATIVE AND CROSS-DISCIPLINARY (continued)

Instructor/Department Course Code Course Name Course Description Halla Thorsteinsdottir CHL5121 Genomics, Bioethics and The course addresses the main bioethical and public policy issues Abdallah S. Daar Public Policy associated with genomics and health biotechnology development Paul Corey CHL5201 Biostatistics for Intro to biostatistics for students in the Master’s in Public Health Epidemiologists I (Epidemiology) program. Elizabeth Badley CHL5403 Epidemiology of Non- The course covers the epidemiology of selected chronic diseases/ Rayjean J. Hung (course Communicable Diseases health conditions and their risk factors. coordinator) J. Robert Mann (course coordinator) Anthony Miller Eric J. Holowaty CHL5409 Cancer Epidemiology This is a seminar and guided-reading course for masters and Anna M. Chiarelli doctoral students with a strongly focused interest or thesis topic in the area of cancer epidemiology. Department of Computer Science CSC260F Department of Computer For trainees with limited programming skills. Offered on the basis Science of instructor availability. Department of Computer Science CSC456- High-Performance Scientific For trainees requiring more advanced programming skills. 2306F Computing Peter Ray (course coordinator) MCBD1041 Fundamentals of Human Trainees from statistics, without prior exposure to genetic research, Johanna Rommens (course Genetics should consider taking this course. coordinator) Dionne Gesink (course coordinator) CHL5404H Research Methods in The goal of this course is to foster a deeper understanding of how to John McLaughlin Epidemiology I design an epidemiologic study including how to develop a research Nancy Kreiger question, how to design as study to answer that question, and what additional data you will need to collect to patch study design holes during analysis. Lucy R. Osborne (course coordinator) MSC2010Y Molecular Medicine in This course should encourage students to develop an approach to Steve Scherer (course coordinator) Human Genetic Disease the genetic analysis, investigation and treatment of human disease / Advanced Concepts in Human Genetic Disease Lei Sun (course coordinator) CHL5210 Categorical Data Analysis This course covers the fundamental statistical methods for Laurent Briollais analyzing categorical data, including theory and analysis of multi- dimensional contingency tables and log-linear models; comparison and contrast of different methods; model specification - choosing and assessing models; GLM; GEE.

ADVANCED COURSE IN CORE DISCIPLINE Instructor/Department Course Code Course Name Course Description Lisa Strug (course coordinator) CHL7001H Introduction to the This course presents an overview of the standard paradigms, along Likelihood Paradigm. with their strengths and limitations, and covers the likelihood paradigm in detail with emphasis on the theory and applications. This course will be offered every two years.

STAGE - PAC 2012 Annual Report Page 33 03 INTERNATIONAL SPEAKER SEMINAR SERIES 2012-2013 INVITED GUEST SPEAKERS

cLarIcE wEInbErG John wIttE vEronIca vIELand Chief, Biostatistics Branch, Professor of Epidemiology Professor, Department of National Institute for & Biostatistics and Urology, Pediatrics and Department of Environmental Health University of California, San Statistics Sciences, National Institutes Francisco The Ohio State University of Health College of Medicine

Jan. 27, 2012 Mar. 2, 2012 Apr. 13, 2012

Using nuclear families to find Design and Analysis of Calibration of Statistical related to conditions Next-generation Genetic Evidence using Principles of with onset early in life Epidemiological Studies Thermodynamics

aLIcE whIttEMorE MarJo-rItta JarvELIn sharon brownInG

Professor of Epidemiology Professor and Chair, Associate Professor, and Biostatistics, Department of Epidemiology Department of Biostatistics Department of Health and Biostatistics University of Washington Research and Policy, Imperial College London Stanford University School of Medicine

May 4, 2012 Sept. 14, 2012 Oct. 5, 2012

Talk Title: Evaluating Genetics of Early Growth Identity by descent in Personal Risk Models “unrelated” individuals

STAGE - PAC 2012 Annual Report Page 34 03 INTERNATIONAL SPEAKER SEMINAR SERIES 2012-2013 INVITED GUEST SPEAKERS (continued)

dUncan c. thoMas howard hU davId trEGoUEt

Professor and Director Director and Professor Research Director, UMR-S 937 Biostatistics Division UofT, DLSPH Genomics of Venous University of Southern Thrombosis California INSERM

Nov. 2, 2012 Dec. 7, 2012 Jan. 14, 2013

Two-phase family-based Looking behind the curtain: Haplotypes & Imputation, designs for next generation Lead Toxicity as a Case Study Two Complementary Tools: A sequencing of Methodologic Challenges Case Study on GenomeWide in Gene-Environment Expression Studies Interactions Research

FLorEncE dEMEnaIs kIMbErLEY sIEGMUnd GEorGE davEY-sMIth

Director, UMR 946 Associate Professor Professor of Clinical Genetic Variation and Department of Preventive Epidemiology Human Diseases Medicine School of Social and INSERM University of Southern Community Medicine California University of Bristol Jan. 25, 2013 Mar. 1, 2013 May 3, 2013

TBC TBC TBC

STAGE - PAC 2012 Annual Report Page 35 03 INTERNATIONAL SPEAKER SEMINAR SERIES (ISSS)

ISSS OVERALL LOCAL AND REMOTE PARTICIPATION

146 140 21 120 106 98 97 100 91 10 84 21 75 80 21 72 68 10 70 38 66 62 61 65 57 57 15 8 60 125 52 15 51 25 7 21 15 20 9 10 96 18 40 33 15 70 74 77 7 57 62 59 59 53 46 45 48 50 20 42 41 39 36 26 0 Oct 5, Nov 19, Dec 3, Jan 14, Feb 4, Mar 4, Apr 1, May 6, Oct 7, Dec 2, Jan 6, Jan 27, Mar 2, Apr 13, May 4, Sep 14, Oct 5, Nov 2, Dec 7,

Individuals Participating Individuals 2010 2010 2010 2011 2011 2011 2011 2011 2011 2011 2012 2012 2012 2012 2012 2012 2012 2012 2012

Local Remote

Fig. 3, ISSS Overall Local and Remote Participation

ISSS GEOGRAPHICALGeographical Presence PRESENCE 16 PARTICIPATING INSTITUTIONS ACROSS CANADA

ST. JOHN’S • Memorial University of Newfoundland CHICOUTIMI • Université du Québec à Chicoutimi

HALIFAX • Dalhousie University

OSHAWA • Lakeridge Health

QUEBEC CITY BURNABY • Université Laval • Simon Fraser University MONTREAL WINNIPEG • Jewish General Hospital GUELPH • McGill University • Winnipeg University • Guelph University • Sainte-Justine University Hospital • Université de Montréal HAMILTON VANCOUVER • McMaster University TORONTO • University of British Columbia • University of Toronto OTTAWA • University of Ottawa Fig. 4, ISSS Geographical Presence STAGE - PAC 2012 Annual Report Page 36 APPENDIX A SUMMARY OF GRANT PROPOSAL

Gagnon & Bull Yr 1 $165,000 – Summary of Research Proposal STAGE: An integrated program in statistical & epidemiological training for genetics with a population health impact ______The global objective of STAGE (Strategic Training for Advanced Genetic Epidemiology) is to build research capacity in tomorrow’s leading chronic disease scientists capable of bridging laboratory-based and population-based research through genetic epidemiology. This new cadre of (statistical, epidemiological, and biomedical) geneticists will have strong quantitative skills and, through integrative thinking and strong teamwork skills, the ability to resolve complex issues in the design and analysis of population health studies addressing high-impact hypotheses. The mission of STAGE is to provide leadership in integrated epidemiological and statistical research training for genetics with a population health impact, in the context of a world-class training environment. New investigators will learn to apply team approaches to generate ground-breaking studies aimed at identifying and characterizing, at the population level, genetic factors implicated in chronic diseases. The vision is to develop a model of training in genetic epidemiology that will significantly increase capacity with such expertise and set the stage for future programs in Canada, as well as providing trainees with enriched international collaborative experience early on. The program’s values are creativity, scientific rigor, teamwork and partnership. Key specific objectives: 1) Launch the first formal training program in Genetic Epidemiology in Canada; 2) Implement a dynamic interdisciplinary training network with a focus on the development and application of novel methods and study designs to address challenging scientific issues in the investigation of chronic diseases; 3) Implement innovative enhanced training strategies promoting integrative thinking and partnership; 4) Promote “hands-on” research experience in epidemiological, statistical, and biological settings in a cross-disciplinary fashion; and 5) Lay the ground for expanded training opportunities across Canada and abroad. STAGE is hosted at the U of T Dalla Lana School of Public Health (DLSPH), with close partnerships and extensive support from affiliated institutions, and led by Gagnon (DLSPH) and Bull (Samuel Lunenfeld Research Institute). STAGE has National partnerships through Canadian Labs and the MITACS NCE, as well as International partnerships with France and Brazil. With a critical mass of 15 epidemiological and statistical geneticists, and 21 additional internationally renowned mentors from epidemiology, statistics, bioinformatics and biomedical genetics, STAGE has assembled a stellar team of mentors with strength in quantitative genetics and complementary expertise for complex chronic disease investigations with a population health impact. Graduate trainees are recruited mainly from the large pool of high quality applicants from academic Programs in Epidemiology, Biostatistics and Statistics; and post-docs directly through mentors and advertisement. In addition, STAGE provides unique opportunities to scientists from parent disciplines (e.g. epidemiology, statistics, genetics) who intend to make a career change (e.g. “epidemiology” to “genetic epidemiology”) and University Depts. to promote cross-disciplinary research of their faculty in favour of the field of genetic epidemiology. The STAGE curriculum is highly creative and comprehensive, including an Innovative trio supervisory model (epidemiologist + statistician + biomedical scientist); monthly Integrative sessions providing a forum for trainees emanating from this broad range of disciplines, to learn and discuss a variety of topics including leadership and communication skills, research grant and paper writing, ethical issues in research and genetics, and career development; cross-disciplinary practicum opportunities in industry, and an International internship and exchange program. These activities aim to enrich trainees’ solid foundation built on core academic program, complemented by cross-disciplinary and integrative courses in genetic epidemiology, statistical and population genetics. Although new and innovative, STAGE is a result of over 12 years of sustained efforts in developing the field of genetic epidemiology in Canada. Co-PIs on this application have invested their hearts and time in major community building activities here in Canada, as well as internationally by leadership roles at the International Genetic Epidemiology Society. In addition to excellence in their research, STAGE mentors, and in particular co-applicants, have made major contributions to teaching and new course development in epidemiological, statistical and population genetics. Our former trainees are successful scientists in various employment sectors, here and abroad. With STAGE, we propose an aggressive multi-level recruitment strategy and a very exciting and extensive curriculum that will allow us to significantly increase capacity in genetic epidemiology. CIHR funding of STAGE is critical in making Canadian research training in genetic epidemiology as a destination for Canadian and International trainees. Page 9a

APPENDIX B LIST OF TRAINEE PUBLICATIONS

APPENDIX B TRAINEE PUBLICATION RECORDS AND CONTRIBUTIONS

STIHR REPORT INSTRUCTION Attach, as Appendix B, a complete list of publications for your program (submitted, in press and published, in the usual CIHR format) since the grant began. This section should only contain publications where the authors include program trainees (including their percent contribution). Publications about the Training Program are also permitted, even if no trainees are authors. It is preferable for the publications to be grouped by trainee. STAGE RESPONSE A complete list of trainee publications and publication and presentation summaries since the grant began (Jan 2010) follows below. Trainee publications that predate admission to the training program are counted for STAGE purposes only if they were produced while trainees were under the mentorship or supervision of STAGE mentors.

Trainee Publications & Presentations (summary) STAGE Training Period (mm/yy) Abstracts and Refereed Trainee Level Name Presentations Total number of Book Chapters Publications From To months at STAGE Master’s Kung, Tabitha 05/11 05/12 12 3 0 1 Brenner, Darren 01/11 06/12 18 5 5 14 Dennis, Jessica 01/12 present 10 5 0 2 PhDs Derkach, Andriy 01/11 present 18 12 0 7 Faye, Laura 01/11 present 18 10 0 5 Li, Weili (Liz) 01/12 present 10 1 2 2 Chen, Zhijian (Charlie) 01/11 present 18 9 2 7 Fehringer, Gord 01/11 01/12 12 2 1 2 Ladouceur, Martin 11/12 present 1 0 0 0

Postdoctoral Miller, Melissa 03/12 present 8 1 3 2 Fellows Oliveira, Vanessa 08/11 present 14 6 0 1 Wu, Yan Yan 08/12 present 3 0 1 3 Yilmaz, Yildiz 01/11 present 18 9 5 10 Yue, Jiang 08/12 present 3 0 0 0 Visiting Scholar/Faculty Woodbury-Smith, Marc 01/12 present 10 2 3 3 TOTALS 65 22 59

APPENDIX B - STAGE PAC 2012 ANNUAL REPORT Page 1 Trainee-Refereed Publications, Abstracts, and Book Chapters (details)

Refereed Publications Total Abstracts Book Total Trainee level Name Refereed Chapters Abstracts + Accepted Accepted Published Submitted Publications Published & other Books or in press or in press Master’s Kung, Tabitha 1 1 0 Brenner, Darren 12 2 14 4 1 5 Dennis, Jessica 1 1 2 0 PhDs Derkach, Andriy 1 6 7 0 Faye, Laura 4 1 5 0 Li, Weili (Liz) 2 2 2 2 Chen, Zhijian (Charlie) 3 4 7 2 2 Ladouceur, Martin 0 0 Fehringer, Gord 2 2 1 1

Postdoctoral Miller, Melissa 2 2 3 3 Fellows Oliveira, Vanessa 1 1 0 Wu, Yan Yan 1 2 3 1 1 Yilmaz, Yildiz 6 4 10 5 5 Yue, Jiang 0 0 Visiting Scholar/Faculty Woodbury-Smith, Marc 3 3 3 3 TOTALS 33 4 22 59 14 14 5 33

Trainee-Presentations (details)

Total By type By geographic scope Invited Trainee level Name presentations Oral Poster Local National International presentations Master’s Kung, Tabitha 3 2 1 1 1 1 1 Brenner, Darren 5 2 3 2 3 Dennis, Jessica 5 3 2 2 2 1 2 PhDs Derkach, Andriy 12 9 3 6 5 1 Faye, Laura 10 5 5 2 5 3 Li, Weili (Liz) 1 1 1 Chen, Zhijian (Charlie) 9 5 4 3 5 1 Fehringer, Gord 2 2 1 1 Ladoucier, Martin

Postdoctoral Miller, Melissa 1 1 1 Fellows Oliveira, Vanessa 6 2 4 1 3 2 Wu, Yan Yan Yilmaz, Yildiz 9 7 2 5 4 3 Yue, Jiang Visiting 2 1 1 2 1 Scholar/Faculty Woodbury-Smith, Marc TOTALS 65 38 27 19 29 17 7

Legends: bold + underline = STAGE Trainee IF = Impact Factor bold = STAGE mentor TC = Trainee Contribution APPENDIX B - STAGE PAC 2012 ANNUAL REPORT Page 2 1. Kung, Tabitha - Master’s Student Work for these presentations was additionally supported by a UCB Canada Inc./Canadian Rheumatology Association/ The Arthritis Society Rheumatology Postgraduate Fellowship Award, and recognized by the award of the Dalla Lana School of Public Health W. Harding LeRiche Prize (awarded annually to the MPH student in the Epidemiology specialization who achieves the highest standing in course work in Epidemiology). Submitted refereed papers

1. Kung, TN, Dennis, J, Ma, Y, Xie, G, Bykerk, V, Keystone, EC, Siminovitch, KA, Gagnon, F. RFC- 1 A80G is a genetic determinant of efficacy but not of toxicity in the treatment of Rheumatoid Arthritis with methotrexate: evidence from a HuGE review and meta-analysis. Submitted to the Annals of the Rheumatic Diseases. IF= 8.72. TC=TBC. Presentations 1. Kung, TN, Dennis, J, Ma, RY, Xie, G, Cot, S, Bykerk, V, Keystone, EC, Siminovitch, KA, Gagnon, F. RFC-1 A80G is a genetic determinant of efficacy but not of toxicity in the treatment of Rheumatoid Arthritis with methotrexate: evidence from a HuGE review and meta-analysis. Annual Ogryzlo Research Day, University of Toronto, June 20, 2012 (Podium Presenter - Selected).

2. Kung TN, R. Isserlin, G. Bader, E. Keystone, and K. Siminovitch. Personalized Medicine: Integrating Clinical and Genetic Data using a Bioinformatics Framework. Personalized Medicine Symposium, Canadian Arthritis Network (CAN) Annual Scientific Meeting, Quebec City, QC, October 29, 2011 (Symposium Presenter - Invited).

3. Kung, TN, J. Hochman, Y. Sun, L. Bessette, B. Haraoui, J. Pope, and V. Bykerk. Efficacy and Safety of Cannabinoids for Pain in Musculoskeletal Diseases: a Systematic Review and Meta- analysis. Annual European Congress of Rheumatology, EULAR 2011, London, United Kingdom, May 25–28, 2011. Annals of the Rheumatic Diseases. 2011;70 (Suppl3):566 (Poster presentation).

2. Brenner, Darren - PhD student Work for these publications was additionally supported by a Frederick Banting and Charles Best Canada Graduate Scholarship from the Canadian Institutes of Health Research.

Published refereed papers 1. Brenner, DR, Boffetta, P, Duell, EJ, Bickeböller, H, Rosenberger A, McCormack, V, Muscat, JE, Yang, P, Wichmann, E, Brueske-Hohlfeld, I, Schwartz, AG, Cote, M, Tjønneland, A, Friis, S, LeMarchand, L, Zhang, ZF, Morgenstern, H, Szeszenia-Dabrowska, N, Lissowska, J, Zaridze, D, Rudnai, P, Fabianova, E, Foretova, L, Janout, V, Bencko, V, Schejbalova, M, Brennan, P, Mates, I, Lazarus, P, Field, Olaide, R, J, McLaughlin, J, Liu, J, Wiencke, J, Neri, M, Ugolini, D, Andrew, AS, Lan, Q, Hu, W, Orlow, I, Park, BJ, Hung, RJ. (2012). International Lung Cancer Consortium: Pooled Analysis of Previous Lung Diseases and Lung Cancer Risk. American Journal of Epidemiology. Available on-line Sept. 21, 2012. IF=5.21.

2. Timofeeva, M, Hung, RJ, 12 authors, Brenner, DR, 36 authors, Brennan, P, Amos, C, Houlston, R, Landi, MT. (2012). Influence of Common Genetic Variation on Lung Cancer Risk: Meta-Analysis of 14,900 Cases and 29,485 Controls. Human Molecular Genetics. Accepted. Manuscript Number HMG-20120ASA-00545. IF=7.63

APPENDIX B - STAGE PAC 2012 ANNUAL REPORT Page 3 2. Brenner, Darren - PhD student 3. Campos, EA, MacLean, E, Davidson, C, Brenner, DR, Mayers, I, Vliagoftis, H, El-Sohemy, A, Cameron, L. (2012). The Single Nucleotide Polymorphism, CRTh2-6373G>A, is Associated with Allergic Asthma and Increased Expression of CRTh2. Allergy. Accepted. Manuscript Number ALL-2011-00779. IF= 6.27.

4. García-Bailo, B, Brenner, DR, Nielsen, D, Lee, HJ, Borchers, C, Badawi, A, Karmali, M and El-Sohemy, A. (2011). High-abundance plasma proteomic profiles and dietary patterns in an ethnoculturally diverse population of young Canadian adults. American Journal of Clinical Nutrition. ajcn.022657; First published online December 28, 2011. doi:10.3945/ajcn.111.022657. IF= 6.66.

5. Brenner, DR, Arora, P, Garcia-Bailo, B, Morrison, H, Karmali, M, Badawi, A. (2011). The Relationship Between Metabolic Syndrome and Markers of Cardiometabolic Disease among Canadian Adults. Journal of Diabetes and Metabolism. (S:2) S2-003. IF= 2.41.

6. Rosenberger, A, Bickeböller, H, McCormack, V, Brenner DR, Duell, EJ, Tjønneland, A, Friis, S, Muscat, JE, Yang, P, Wichmann, E, Heinrich, J, Szeszenia-Dabrowska, N, Lissowska, J, Zaridze, D, Rudnai, P, Fabianova, E, Foretova, L, Janout, V, Bencko, V, Schejbalova, M, Brennan, P, Mates, D, Schwartz, AG, Cote, M, Zhang, ZF, Lazarus, P, Field, J, Olaide, R, McLaughlin, J, Wiencke, J, LeMarchand, L, Neri, M, Bonassi, S, Andrew, AS, Lan, Q, Hu, W, Orlow, I, Park, BJ, Boffetta, P, Hung, RJ. (2011). International Lung Cancer Consortium: Pooled Analysis of Asthma and Lung Cancer Risk. Carcinogenesis. doi: 10.1093/carcin/bgr307. First published online: December 22, 2011. IF=5.40. TC=85%.

7. Brenner DR, Arora P, Garcia-Bailo B, Wolever TM, Morrison H, El-Sohemy A, Karmali M, Badawi A. Plasma vitamin D levels and risk of metabolic syndrome in Canadians. Clin Invest Med. 2011 Dec 1;34(6):E377. IF=1.16.

8. McCormack, V, Duell, EJ, Brenner, DR, Rosenberger A, Bickeböller, H, Muscat, JE,Yang, P, Wichmann, E, Schwartz, AG, Tjønneland, A, Friis, S, LeMarchand, L, Zhang, ZF, Lazarus, P, Field, J, McLaughlin, J, Wiencke, J, Neri, M, Lan, Q, Orlow, I, Park, BJ, Boffetta, P, Hung RJ. (2011). NSAIDs and lung cancer risk. Cancer Causes & Control. Accepted. IF=2.79. TC=30%.

9. Arora, P, Garcia Bailo, B, Dastani, Z, Brenner, DR, Villegas, A, Malik, S, Richards, B, El- Sohemy, A, Karmali, M, Badawi, A (2011). Genetic Polymorphisms of Innate Immunity-Related Inflammatory Pathways in Type 2 Diabetes Mellitus: Biomarkers of Early Risk Prediction and Prevention. BMC Medical Genetics. IF=2.44. TC=30%.

10. Brenner, DR, McLaughlin JR, Hung RJ (2011) Previous Lung Diseases and Lung Cancer Risk: A Systematic Review and Meta-Analysis. PLoS ONE 6(3): e17479. doi:10.1371/journal. pone.0017479. IF=4.41. TC=90%.

11. Brenner, DR, Boucher BA, Kreiger, N, Jenkins D, El-Sohemy A (2011). Dietary Patterns in an Ethnoculturally Diverse Population of Young Canadian Adults. Canadian Journal of Dietetic Research and Practice. 01/2011; 72(3):e161-8. IF=0.81. TC=90%.

12. Soskolne CL, Jhangri GS, Scott HM, Brenner, DR, Siemiatycki J, Lakhani R, et al. A population- based case-control study of occupational exposure to acids and the risk of lung cancer: evidence for specificity of association. Int J Occup Environ Health. 2011 Jan-Mar;17(1):1-8. IF=1.00. TC=40%. Submitted refereed papers

1. Arora, P, Vasa, P, Brenner, DR, Iglar, K, McFarlane, P, Badawi, A. (2012). Prevalence estimates of chronic kidney disease in Canada: Results of a nationally representative survey. CMAJ. Manuscript Number 12-0833. IF= 8.21.

APPENDIX B - STAGE PAC 2012 ANNUAL REPORT Page 4 2. Brenner, Darren - PhD student 2. Brenner, DR, Brennan, P, Boffetta, P, Amos, C, Spitz, M, Chen, C, Goodman, G, Heinrich, J, E, Bickeboller, H, Rosenberger, A, Risch, A, Huley, T, McLaughlin, J, Benhamou, Bull, S, Chen, G, Witte, J, Lewinger, JP, Hung, RJ. (2012). Pathway Analysis of Inflammation Related Variants and Lung Cancer Risk: An Application of Hierarchal Modeling Using GWAS Data. Human Genetics. Manuscript Number HumGen-12-0348. IF= 5.06 Published abstracts 1. Brenner, DR, Arora, P, Garcia-Baillo, El-Sohemy, A, Karmali, M, Badawi, A (2011). Vitamin D in the prediction of metabolic syndrome: A target for public health intervention. Journal of Epidemiology and Community Health, Volume 65(S1), A224, P2-32. IF= 3.19.

2. Brenner, DR, Arora, P, Garcia-Baillo, B, Wolever, T, El-Sohemy, A, Karmali, M, Badawi, A (2011). The Impact of the metabolic syndrome on cardiometabolic and inflammatory profiles among Canadian adults. Journal of Epidemiology and Community Health, Volume 65(S1), A224, P2-33. IF= 3.19.

3. Brenner, DR, Arora, P, Garcia-Baillo, B, Wolever, T, El-Sohemy, A, Karmali, M, Badawi, A (2011). Association Between Plasma Vitamin D and Metabolic Syndrom in the Canadian Population. Journal of Epidemiology and Community Health, Volume 65(S1), A224, P2-31. IF= 3.19.

4. Arora, P, Garcia Bailo, B, Dastani, Z, Brenner, DR, Villegas, A, Malik, S, Richards, B, El- Sohemy, A, Karmali, M, Badawi, A (2011). Genetic Polymorphisms of Innate Immunity-Related Inflammatory Pathways in Type 2 Diabetes Mellitus: Biomarkers of Early Risk Prediction and Prevention. Journal of Epidemiology and Community Health, Volume 65(S1), A224, P2-15. IF= 3.19. Published materials

1. Garcia-Bailo, B, Brenner, DR, Nielsen, D, Lee, H, Borchers, C, Badawi, A, Karmali, M, El- Sohemy, A. Quantitation of 55 Common Human Plasma Proteins in Healthy Young Adults and Correlation with Body Mass Index and Dietary Patterns. Available from Nature Precedings (2010). Presentations 1. Brenner DR. Dalla Lana School of Public Health Doctoral Seminar Series in Epidemiology, Toronto, Ontario. October 6, 2011

2. Brenner, DR, Application of Hierachical modeling to GWAS level data: Examining inflammatory genes and cancer risk. Oral presentation at the Statistical Methods for Genetics and Genomics Research Seminar and Journal Club. Prosserman Centre of the Samuel Lunenfeld Research Institute, March 18, 2011, Toronto, Ontario, Canada. STAGE-led University-wide event.

3. Brenner, DR, Arora, P, Garcia-Baillo, B, El-Sohemy, A, Karmali, M, Badawi, A (2011). P2-32 Vitamin D in the prediction of metabolic syndrome: a target for public health intervention. Journal of Epidemiology and Community Health, Volume 65(S1), A228, P2-32. Poster presentation at the World Congress of Epidemiology, August 7-11, 2011, Edinburgh, Scotland. IF=2.98.

4. Brenner, DR, Arora, P, Garcia-Baillo, B, Woleve, T, El-Sohemy, A, Karmali, M, Badawi, A (2011). The Impact of the metabolic syndrome on cardiometabolic and inflammatory profiles among Canadian adults. Journal of Epidemiology and Community Health, Volume 65(S1), A224, P2- 33. Poster presentation at the World Congress of Epidemiology, August 7-11, 2011, Edinburgh, Scotland. IF=2.98

APPENDIX B - STAGE PAC 2012 ANNUAL REPORT Page 5 2. Brenner, Darren - PhD student 5. Brenner, DR, Arora, P, Garcia-Baillo, B, Woleve, T, El-Sohemy, A, Karmali, M, Badawi, A (2011). Association Between Plasma Vitamin D and Metabolic Syndrome in the Canadian Population. Journal of Epidemiology and Community Health, Volume 65(S1), A224, P2-31. Poster presentation at the World Congress of Epidemiology, August 7-11, 2011, Edinburgh, Scotland. IF=2.98.

3. Dennis, Jessica - PhD Student Work for these publications was additionally supported by a Vanier Canada Graduate Scholarship Research Award.

Published refereed papers 1. Dennis, J, Johnson, CY, Adediran, AS, de Andrade, M, Heit, JA, Morange, PE, Tregouet, DA, Gagnon, F. 2011. The Endothelial Protein C Receptor (PROCR) Ser219Gly Variant and Risk of Common Thrombotic Disorders: A HuGE Review and Meta-Analysis of Evidence from Observational Studies. Blood. 2012 Mar 8;119(10):2392-400. doi: 10.1182/ blood-2011-10-383448. Epub 2012 Jan 17. IF=10.55. TC=65%. Submitted refereed papers

1. Kung, TN, Dennis, J, Ma, Y, Xie, G, Bykerk, V, Keystone, EC, Siminovitch, KA, Gagnon, F. RFC- 1 A80G is a genetic determinant of efficacy but not of toxicity in the treatment of Rheumatoid Arthritis with methotrexate: evidence from a HuGE review and meta-analysis. Submitted to the Annals of the Rheumatic Diseases. IF= 8.72. TC=15%. Presentations 1. Dennis, J, Preliminary Analysis of Genome-Wide DNA Methylation Patterns in Thrombosis. INSERM UMRS 937 Génétique Epidemiologique et Moleculaire des Pathologies Cardiovasculaires Journal Club. Faculté de médecine Pierre et Marie Curie, Paris, France, August 30, 2012. Invited oral presentation.

2. Dennis, J, Epigenetics: Promises, challenges, and why an epidemiologist should care. Oral presentation at the Doctoral Seminar Series in Epidemiology, Division of Epidemiology, University of Toronto, December 1, 2011, Toronto, Ontario, Canada. Invited oral presentation.

3. Dennis, J, Johnson, CY, Adediran, AS, Morange, PE, Tregouet, DA, Gagnon, F, The Endothelial Protein C Receptor (PROCR) 4600A/G Variant and Risk of Common Thrombotic Disorders: A HuGE Review and Meta-Analysis of Evidence from Observational Studies. Poster (#0066-S) presentation at the 3rd North American Congress of Epidemiology, June 21-24, 2011, Montreal, Quebec, Canada.

4. Dennis, J, Johnson, CY, Adediran, AS, Morange, PE, Tregouet, DA, Gagnon, F, The Endothelial Protein C Receptor (PROCR) 4600A/G Variant and Risk of Common Thrombotic Disorders: A HuGE Review and Meta-Analysis of Evidence from Observational Studies. Poster presentation at the 6th Annual Canadian Genetic Epidemiology and Statistical Genetics Meeting, May 11-13, 2011, King City, Ontario, Canada.

5. Dennis, J, Epigenetics: Promises, challenges, and why an epidemiologist should care. Oral presentation at the Doctoral Seminar Series in Epidemiology, Division of Epidemiology, University of Toronto, December 1, 2011, Toronto, Ontario, Canada.

APPENDIX B - STAGE PAC 2012 ANNUAL REPORT Page 6 4. Derkach, Andriy - PhD Student Work for these publications was additionally supported by a Reginald A. Blyth OGSST Graduate Scholarship from the University of Toronto; The Queen Elizabeth II Graduate Scholarships in Science and Technology from the Canadian Ministry of Training, Colleges and Universities; and recognized by a nomination for the Williams Award of the IGES 2012.

Published refereed papers 1. Derkach, A, Lawless, JF, Sun, L, (2012), Robust and powerful tests for rare variants using Fisher’s method to combine evidence of association from two or more complementary tests. Genetic Epidemiology. IF=3.44. TC=60%. Submitted refereed papers 1. Derkach, A, Chiang, T, Addis, L, Dobbins, SE, Tomilinson, I, Houlston, R, Pal, D, Strug, L, (2012), Can public NGS data be used as controls in association studies? Report of a novel statistical approach. PLoS Genetics. IF= 8.69. TC=30%.

2. Derkach A, Lawless, JF, Merico, D, Paterson, A, Sun, L, 2012. Evaluation of association tests and gene annotations for analyzing rare variants using GAW18 data. Genetic Analysis Workshop 18. Stevenson WA, USA. TC=60%.

3. Roslin, N, Derkach, A, Strug, L, 2012. Association analysis with sequence data using publicly available controls. Genetic Analysis Workshop 18. Stevenson WA, USA. TC=20%.

4. Xu L, Craiu R, Derkach A, Paterson, AD, Sun, L, 2012, Using a Bayesian latent variable approach to detect pleiotropy in the GAW 18 data. Genetic Analysis Workshop 18. Stevenson, WA, USA. TC=10%.

5. Naplathankalam T, Ziman R, Derkach, A, Scherer, SW, Paterson, AD, Merico D, 2012, GAW18 single nucleotide variant prioritization based on protein impact, sequence conservation and gene annotation. Genetic Analysis Workshop 18. Stevenson, WA, USA. TC=10%.

6. Derkach, A, Lawless, JF, Sun, L, 2011. A Unified Framework for Pooled Association Tests for Rare Genetic Variants (under review of Journal of the American Statistical Association (JASA), submitted on 30/09/2011). IF=0.98. Presentations

1. Derkach, A, Lawless, JF, Sun, L, Combining Linear and Quadratic Tests for Rare Variants Provides a Robust Test Across Genetic Models, selected for oral presentation, International Genetic Epidemiology Society Meeting, October 18-20, 2012, Stevenson, WA, USA.

2. Derkach, A, Lawless, JF, Sun, L, Combining Linear and Quadratic Tests for Rare Variants Provides a Robust Test Across Genetic Models, selected for oral presentation, the 40th Annual Meeting of Statistical Society of Canada, June 3-6, 2012, University of Guelph, Canada.

3. Derkach, A, Lawless, JF, Sun, L, Combining p-Values from Linear and Quadratic Tests for Rare Variants Provides Robust Statistics across Genetic Models, selected for oral presentation, Canadian Human and Statistical Genetics Meeting, April 29-May 2, 2012, Niagara-on-the-Lake, Canada.

4. Derkach, A, Lawless, JF, Sun, L, Robust Association Tests for Rare Genetic Variants, oral presentation, Statistics Graduate Student Research Day, April 19, 2012, Fields Institute, Toronto, Ontario, Canada.

APPENDIX B - STAGE PAC 2012 ANNUAL REPORT Page 7 4. Derkach, Andriy - PhD Student 5. Derkach, A, Lawless, JF, Sun, L, Update on methods for statistical analysis of rare variants, oral presentation, Statistical Methods for Genetics and Genomics Research Seminar and Journal Club. Prosserman Centre of the Samuel Lunenfeld Research Institute, November 25, 2011, Canada. STAGE-led University-wide event.

6. Derkach, A, Lawless, JF, Sun, L, A general statistical framework for analyzing rare variants, oral presentation, Graduate Student Seminar, Department of Statistics, University of Toronto, November 24, 2011, Canada.

7. Derkach, A, Lawless, JF, Sun, L, A generalized pooled association statistic for analyzing rare variants, (Program #723F). Poster presentation at the 12th International Congress of Human Genetics and 61st American Society of Human Genetics / International Congress of Human Genetics Meeting, October 11-15, 2011, Montreal, Quebec, Canada.

8. Derkach, A, Lawless, JF, Sun, L, A Unified Statistical Framework for Association Methods for Rare Variants, (Session #1). Oral presentation at the 6th Annual Canadian Genetic Epidemiology and Statistical Genetics Meeting (CGESGM), May 11-13, 2011, King City, Ontario, Canada.

9. Derkach, A, Lawless, JF, Sun, L, A Unified Statistical Framework for Association Methods for Rare Variants, (Poster #1). Poster at the 6th Annual Canadian Genetic Epidemiology and Statistical Genetics Meeting (CGESGM), May 11-13, 2011, King City, Ontario, Canada.

10. Derkach, A, Lawless, JF, Sun, L, Evaluation of composite statistics for association analysis of rare variants. Poster presentation at the Biostatistics Research Day, University of Toronto Dalla Lana School of Public Health, April 29, 2011, Toronto, Ontario, Canada. 1st Prize of the Student Poster Presentation Award.

11. Derkach, A, Hardy-Weinberg Equilibrium test with complex surveys. Oral presentation at the Statistical Methods for Genetics and Genomics Research Seminar and Journal Club. Prosserman Centre of the Samuel Lunenfeld Research Institute, April 8, 2011, Toronto, Ontario, Canada. STAGE-led University-wide event.

12. Derkach, A, Current Methods for Association Studies with Rare Variants. Oral presentation at the graduate student seminar, Department of Statistics, University of Toronto, March 17, 2011, Toronto, Ontario, Canada.

5. Faye, Laura - PhD Student Work for these publications was additionally supported by a Doctoral Research Award from Canadian Institutes of Health Research, an Open Fellowship Research Award, an Open Bonus Research Award from the University of Toronto; and a Williams Award from the International Genetic Epidemiology Society in 2012.

Published refereed papers 1. Faye, LL, Sun, L, Dimitromanolakis, A, Bull, SB. 2011. A flexible genome-wide bootstrap method that accounts for ranking and threshold-selection bias in GWAS interpretation and replication study design. Statistics in medicine, 30: 1898-1912.IF=2.328. TC=75%.

2. Faye, LL, Bull, SB. 2011. Two-stage study designs combining GWAS tag SNPs and exome sequencing: accuracy of genetic effect estimates. BMC Proceedings 5 (suppl 9), S64, in press. IF=2.44. TC=85%.

3. Sun, L, Dimitromanolakis A, Faye, LL, Paterson, AD, Waggott D, the DCCT/EDIC Research Group, Bull, SB. 2011. BRsquared: a practical solution to the winner’s curse in genome-wide scans. Human Genetics 129(5): 545-552. IF=5.05. TC=15%.

APPENDIX B - STAGE PAC 2012 ANNUAL REPORT Page 8 5. Faye, Laura - PhD Student 4. Thomas A, Abel HJ, Di Y, , LL, Jin J, Liu J, Wu Z, Paterson, AD, 2011. The impact of linkage disequilibrium on the identification of functional variants. Genetic Epidemiology, in press. IF=3.99. TC=12%. Submitted refereed papers 1. Faye, LL, Machiela MH, Kraft, P, Bull SB, Sun L. 2012. Re-ranking sequencing variants in the post-GWAS era for accurate causal variant identification. Submitted to PLoS Genetics. IF= 8.69. TC=90%. Presentations 1. Faye LL, Bull SB, Kraft P, Sun L. Re-ranking Next Generation Sequencing Variants for Accurate. Selected for oral presentation, International Genetic Epidemiology Society Meeting, October 18- 20, 2012, Stevenson, WA, USA.

2. Faye LL. Re-ranking Low Coverage Sequencing Variants for Accurate Causal Variant Identification. Seminar presentation at Center for Statistical Genetics, University of Michigan. June 25 2012, Ann Arbor, Michigan, USA

3. Faye LL. Re-ranking Low Coverage Sequencing Variants for Accurate Causal Variant Identification. Seminar presentation at Program for Molecular and Genetic Epidemiology, Harvard School of Public Health. June 15 2012, Boston, Massachusetts, USA

4. Faye LL, Bull SB, Sun L. Re-ranking sequencing variants in the post-GWAS era. Poster presentation at the 7th Canadian Genetic Epidemiology and Statistical Genetics Workshop, April 29 - May 2, 2012, St Catherines, Ontario Canada.

5. Faye, LL, Bull, SB, Sun, L. Re-ranking sequencing variants in the post-GWAS era. Poster presentation at the 12th International Congress of Human Genetics and 61st American Society of Human Genetics / International Congress of Human Genetics Meeting, October 11-15, 2011, Montreal, Quebec, Canada.

6. Faye, LL, Bull, SB, Sun, L. Re-ranking of sequencing variants improves accuracy in targeted sequencing studies. Oral presentation at the 6th Annual Canadian Genetic Epidemiology and Statistical Genetics Meeting (CGESGM), May 11-13, 2011, King City, Ontario, Canada.

7. Faye, LL, Bull, SB, Sun, L. Re-ranking of sequencing variants improves accuracy in targeted sequencing studies. Poster presentation at the 6th Annual Canadian Genetic Epidemiology and Statistical Genetics Meeting (CGESGM), May 11-13, 2011, King City, Ontario, Canada.

8. Faye, LL, Bull, SB, Sun, L. Accuracy of Targeted Sequencing Studies. Poster presentation at the Biostatistics Research Day, University of Toronto Dalla Lana School of Public Health, April 29, 2011, Toronto, Ontario, Canada.

9. Faye, LL, Bull, SB, Sun, L. Re-ranking of sequencing variants improves accuracy in targeted sequencing studies. Poster presentation at The Centre de Recherches Mathématiques Workshop - Computational Statistical Methods for Genomics and Systems Biology, April 18-22, 2011, Montreal, Quebec, Canada.

10. Faye, LL. Oral presentation at the Statistical Methods for Genetics and Genomics Research Seminar and Journal Club. Prosserman Centre of the Samuel Lunenfeld Research Institute, March 25, 2011, Toronto, Ontario, Canada. STAGE-led University-wide event.

APPENDIX B - STAGE PAC 2012 ANNUAL REPORT Page 9 6. Li, Weili (Liz) - PhD Student Work for these publications was additionally supported by a Canada Graduate Scholarship Research Award from the Natural Sciences and Engineering Reseach Council of Canada (NSERC), and an Ontario Graduate Scholarship Research Award from the Ministry of Training, Colleges and Universities in Canada.

Published refereed papers 1. Strug LJ, Addis L, Chiang T, Baskurt Z, Li W, Clarke T, Hardison H, Kugler SL, Mandelbaum DE, Novotny EJ, Wolf SM, Pal DK. 2012. The genetics of reading disability in an often excluded sample: novel Loci suggested for reading disability in rolandic epilepsy. PLoS One. 2012; 7(7):e40696. IF= 4.09. TC=20%.

2. Sun L, Rommens JM, Corvol H, Li W, Li X, Chiang TA, Lin F, Dorfman R, Busson PF, Parekh RV, Zelenika D, Blackman SM, Corey M, Doshi VK, Henderson L, Naughton KM, O’Neal WK, Pace RG, Stonebraker JR, Wood SD, Wright FA, Zielenski J, Clement A, Drumm ML, Boëlle PY, Cutting GR, Knowles MR, Durie PR, Strug LJ. 2012. Multiple apical plasma membrane constituents are associated with susceptibility to meconium ileus in individuals with cystic fibrosis. Nat Genet. 2012 May; 44(5):562-9. IF= 35.53. TC=50% Accepted or in press abstracts 1. Li W, Su D, Chiang T, Li X, Miller MR, Keenan K, Corvol H, Wright FA, Blackman S, Drumm ML, Cutting GR, Knowles MR, Durie PR, Rommens JM, Sun L, Strug LJ. 2012. Do severity of early lung disease and meconium ileus in cystic fibrosis have common genetic contributors? Annual Meeting of The American Society of Human Genetics.

2. Soave D, Chiang T, Miller M, Keenan K, Li W, Ip W, Wright F, Blackman S, Corvol H, Knowles M, Cutting G, Drumm M, Sun L, Rommens J, Durie P, Strug LJ. 2012. Exocrine and Endocrine Pancreatic Damage in Cystic Fibrosis are Associated with SLC26A9. Annual Meeting of the American Society of Human Genetics, San Francisco, CA, USA Presentations

1. Li W. Problems associated with ranking rare variants using two-sided exact p-values. Oral presentation at the Biostatistics Seminar, Dalla Lana School of Public Health, March 6, 2011, Toronto, Ontario, Canada. TC:100%.

7. Chen, Zhijian (Charlie) - Postdoctoral Fellow Published refereed papers 1. Chen, Z, Craiu, R, Bull, SB. Two-phase stratified sampling designs for regional sequencing. Genetic Epidemiology, 36:320–332. IF=3.44.TC=80%.

2. Chen, Z, Yi GY, Wu C. 2011. Marginal methods for correlated binary data with misclassified responses. Biometrika, 98: 647-662. IF=1.833. TC=70%.

3. Chen B, Chen, Z, Wu L, Wang L, Yi GY. 2011. Marginal analysis of population-based genetic association studies of quantitative traits with incomplete longitudinal data. Journal of the Iranian Statistical Society, in press. IF=1.10. TC=30%. Submitted refereed papers 1. Chen, Z, Tan, KR, Bull, SB, 2012. Multi-phase analysis by linkage, quantitative transmission disequilibrium, and measured genotype: Systolic blood pressure in complex Mexican-American pedigrees. Genetic Analysis Workshop 18. Stevenson, WA, USA. TC=80%.

APPENDIX B - STAGE PAC 2012 ANNUAL REPORT Page 10 7. Chen, Zhijian (Charlie) - Postdoctoral Fellow 2. Bull, SB, Chen, Z, Tan, KR, Taleban J, 2012. An exploration of heterogeneity in genetic analysis of complex pedigrees: Linkage and association in WGS data. Genetic Analysis Workshop 18. Stevenson, WA, USA. TC=50%.

3. Chen, Z, Craiu, R, Bull, SB. Bayesian sequential analysis of two-phase designs for genetic association model inference with regional resequencing data. Statistics in Medicine. Under review. IF=1.87.

4. Chen, Z, Yi GY, Wu C. Marginal analysis of longitudinal ordinal data with misclassification in both response and covariates. Statistics in Medicine. IF=1.87. Published abstracts

1. Chen, Z, Craiu, R, Bull, SB. Two-phase stratified sampling designs for regional sequencing. The International Genetic Epidemiology Society, Annual Meeting, September 18-20, 2011, Heidelberg, Germany.

2. Chen, Z, Paterson, AD, Canty, AJ, Sun, L, Bull, SB. Joint modelling of repeated measures and time-to-event data in genetic association analysis of type 1 diabetes. The 12th International Congress of Human Genetics and 61st American Society of Human Genetics / International Congress of Human Genetics Meeting, October 11-15, 2011, Montreal, Quebec, Canada. Presentations

1. Chen, Z, Craiu, R, Bull, SB. Sequential two-phase stratified designs for regional re-sequencing: A Bayesian approach. Oral presentation at the Annual Meeting of the Statistical Society of Canada, June 3-6, 2012, Guelph, Ontario, Canada. (contributed)

2. Chen, Z, Paterson, AD, Canty, AJ, Sun, L, Bull, SB. Specification and interpretation of joint phenotype models in genetic association of complex traits. Poster presentation at the Canadian Human and Statistical Genetics Meeting. April 29–May 2, 2012, Niagara-on-the-Lake, Ontario, Canada.

3. Chen, Z. Inferring genetic causal effects on survival data with associated endo-phenotypes. Oral presentation at Statistical Methods for Genetics and Genomics Research Seminar and Journal Club. Prosserman Centre of the Samuel Lunenfeld Research Institute presentation, January 13, 2012, Toronto, Ontario, Canada. STAGE-led University-wide event.

4. Chen, Z, Paterson, AD, Canty, AJ, Sun, L, Bull, SB. Joint modelling of repeated measures and time-to-event data in genetic association analysis of type 1 diabetes, Poster presentation at the 12th International Congress of Human Genetics and 61st American Society of Human Genetics / International Congress of Human Genetics Meeting, October 11-15, 2011, Montreal, Quebec, Canada.

5. Chen, Z, Craiu, R, Bull, SB. Two-phase stratified sampling designs for regional sequencing. Oral presentation at the International Genetic Epidemiology Society, Annual Meeting, September 18- 20, 2011, Heidelberg, Germany.

6. Chen, Z, Craiu, R, Bull, SB. Two-phase Stratified Sampling Designs for Regional Sequencing. Poster presentation at The 6th Annual Canadian Genetic Epidemiology & Statistical Genetics Meeting, May 11-13, 2011, King City, Ontario, Canada.

7. Chen, Z, Craiu, R, Bull, SB. Two-phase Stratified Sampling Designs for Regional Sequencing. Poster presentation at the Centre de Recherche en Mathematiques (CRM) International Workshop on Computational Statistical Methods for Genomics and Systems Biology, April 18-22, 2011, Montreal, Quebec, Canada.

APPENDIX B - STAGE PAC 2012 ANNUAL REPORT Page 11 7. Chen, Zhijian (Charlie) - Postdoctoral Fellow 8. Chen, Z. Genetic Analysis of Longitudinal Data. Statistical Methods for Genetics and Genomics Research Seminar and Journal Club. Prosserman Centre of the Samuel Lunenfeld Research Institute presentation, March 11, 2011, Toronto, Ontario, Canada. STAGE-led University-wide event.

9. Chen, Z. Marginal Methods for Correlated Binary and Ordinal Data with Misclassification. CHL 5250H Biostatistics Seminar Course. University of Toronto Dalla Lana School of Public Health, March 1, 2011, Toronto, Ontario, Canada.

8. Fehringer, Gord - Postdoctoral Fellow Published refereed papers 1. Fehringer, G, Liu, G, Briollais, L, Brennan, P, Amos, CI, Spitz, RM, Bickeböller, H, Wichmann, HE, Risch, A, Hung, RJ. Comparison of pathway analysis approaches using lung cancer GWAS data sets. PLoS ONE (submitted). IF=4.41.

2. Fehringer G, Liu G, Pintilie M, Sykes J, Cheng D, Liu N, Chen Z, Seymour L, Der SD, Shepherd FA, Tsao MS, Hung RJ. 2012 Association of lung cancer associated 15q25 variants with IREB2 gene expression in lung tumour tissue. Cancer Epidemiol Biomarkers Prev 21:1097-104. IF=3.919. Book chapters 1. Epidemiology of Cancer. Hung, RJ, Fehringer, G, Liu, G. In: Basic Science of Oncology (Tannock, Hill, Bristow, Harrington eds); 5th Edition, In press. IF Tannock, RP Hill, RG Bristow, L Harrington eds. McGraw Hill Ryerson, Toronto, Ontario, Canada. 2012. Presentations 1. Fehringer, G, Liu, G, Briollais, L, Brennan, P, Amos CI, Spitz, RM, Bickeböller, H, Wichmann, HE, Risch, A, Hung, RJ. Comparison of pathway analysis approaches using lung cancer GWAS data sets. Poster presentation at the 12th International Congress of Human Genetics and 61st American Society of Human Genetics / International Congress of Human Genetics Meeting, October 11-15, 2011, Montreal, Quebec, Canada. (selected, presenter)

2. Fehringer, G, Liu G, Pintilie M, Sykes J, Cheng D, Liu N, Chen Z, Seymour L, Der SD, Shepherd FA, Tsao MS, Hung RJ. 2012 Association of lung cancer associated 15q25 variants with IREB2 gene expression in lung tumour tissue. Poster presentation at AACR Meeting: Chicago, Illinois, May-April 2012. (selected, presenter)

9. Miller, Melissa - Postdoctoral Fellow

Published refereed papers 1. Miller MR, Sokol RJ, Narkewicz MR, Sontag MK. 2012. Pulmonary function in individuals with cystic fibrosis from the U.S. cystic fibrosis foundation registry who had undergone liver transplant. Liver Transplantation 18(585-593). IF=3.38. TC=70%.

2. Miller MR, Pereira RI, Langefeld CD, Lorenzo C, Rotter JI, Chen Y-D, Bergman RN, Wagenknecht LE, Norris JM, Fingerlin TE. 2012. Levels of Free Fatty Acids Are Associated with Insulin Resistance but do not Explain the Relationship between Adiposity and Insulin Resistance in Hispanic Americans: The IRAS Family Study. Journal of Clinical Endocrinology and Metabolism (Epub ahead of print. doi: 10.1210/jc.2912-1318). IF=6.20. TC=70%. APPENDIX B - STAGE PAC 2012 ANNUAL REPORT Page 12 9. Miller, Melissa - Postdoctoral Fellow

Published abstracts 1. Durie PR, Soave D, Gonska T, Ip W, Keenan K, Miller MR, Lei S, Rommens J, Strug LJ. 2012. Early exocrine pancreatic damage determined by serum immunoreactive tryspinogen is a significant predictor of CF-related diabetes at a later age. North American Cystic Fibrosis Meeting. TC=5%.

2. Li W, Su D, Chiang T, Li X, Miller MR, Keenan K, Corval H, Wright FA, Blackman S, Drumm ML, Cutting GR, Knowles MR, Durie PR, Rommens JM, Sun L, Strug LJ. 2012 Does severity of early lung disease and meconium ileus in cystic fibrosis have common genetic contributors? Poster presentation at American Society for Human Genetics, San Francisco, CA, USA TC=5%.

3. Soave D, Chiang T, Miller MR, Su D, Keenan K, Li W, Ip W, Wright FA, Blackman S, Corvol H, Knowles, MR, Cutting GR, Drumm ML, Sun L, Rommens JM, Durie PR, Strug LJ. 2012. Exocine and endocrine pancreatic damage in cystic fibrosis are associated with SLC26A9. Poster presentation at American Society for Human Genetics, San Francisco, CA, USA. TC=5%. Presentations 1. Miller MR. (presenter) Pulmonary function in individuals with cystic fibrosis from the U.S. cystic fibrosis foundation registry who had undergone liver transplant. Cystic Fibrosis Research Seminar at the Hospital for Sick Children. Toronto, Ontario, December 2011.

10. Oliveira, Vanessa - Postdoctoral Fellow Work for these publications was additionally supported by a Postdoctoral Fellowship Research Award from the Department of Foreign Affairs and International Trade Canada.

Submitted refereed papers 1. Gonçalves VF, Tiwari AK, de Luca V, Kong SL, Zai C, Tampakeras M, Mackenzie B, Sun L, Kennedy JL. (2012) DRD4 VNTR polymorphism and age at onset of severe mental illnesses. Neurosci Lett. IF= 2.1. TC=75% Presentations 1. Gonçalves VF, DRD4 VNTR polymorphism and age at onset of severe mental illnesses. Oral presentation at the In: Fellowship Academic Half-Day, April 2nd, 2012. Toronto, Canada. (selected)

2. Gonçalves VF, Zai CC, Paterson A, Sun L, Kennedy JL, Knight J. Genome-wide association study of alcohol abuse and dependence in two schizophrenia samples. Poster presentation at the Canadian Human and Statistical Genetics Meeting., April 29-May 2, 2012, Niagara on the Lake, Canada. (selected).

3. Gonçalves VF, Tiwari AK, de Luca V, Zai CC, Tampakeras M, Likodi O, Mackenzie B, Shaikh S, Sun L, Kennedy JL. Dopamine D4 gene 7-repeat variant and age at onset in severe mental disorders. Poster presentation at the Society of Biological Psychiatry Meeting. May 3-5, 2012, Philadelphia, USA. (selected).

4. Zai, CC, Tampankeras M, de Luca, V, Tiwari, AK, de Oliveira VFG, Freeman, N, MacKenzie, B, Kennedy, JL. Detection of a deletion in 20p in a schizophrenia patients. Poster presentation at the 12th International Congress of Human Genetics and 61st American Society of Human Genetics / International Congress of Human Genetics Meeting, October 11-15, 2011, Montreal, Quebec, Canada.

APPENDIX B - STAGE PAC 2012 ANNUAL REPORT Page 13 10. Oliveira, Vanessa - Postdoctoral Fellow 5. de Oliveira VFG, Tiwari, A, Zai, CC, Tampakeras, M, Likodi, O, Mackenzie, B, Kennedy, JL. Survey of effects of the DRD4 7R allele on substance abuse disorder across schizophrenia, bipolar and other psychiatric phenotypes. Poster presentation at the 12th International Congress of Human Genetics and 61st American Society of Human Genetics / International Congress of Human Genetics Meeting, October 11-15, 2011, Montreal, Quebec, Canada.

6. de Oliveira VFG. Recovery of mitochondrial lineages of the extinct Botocudo Indians and investigation of their possible relationship with the Paleoamerindian and extant populations of the Southeast Brazil. Seminar presented at Centre for GeoGenetics/ Natural History Museum of Denmark, August 31, 2011, Copenhagen, Denmark.

11. Wu, Yan Yan - Postdoctoral Fellow

Accepted refereed papers 1. Abarin T, Wu YY, Warrington N, Lye S, Pennel C, Briollais L. 2012. The impact of breastfeeding on FTO-related BMI growth trajectories: An application to the RAINE Pregnancy cohort study. International Journal of Epidemiology. IF=6.414. TC=50%. Submitted refereed papers 1. Kaur J, Kak I, Srivastava G, Assi J, Matta A, Wu YY, Leong I, Witterick I, Colgan TJ, MacMillan C, Briollais L, Sui KW, Ralhan R. S100A7 overexpression: Predictive marker for high risk of malignant transformation in oral dysplasia. Clinical Cancer Research. IF=7.742. TC=10%.

2. Warrington NM, Wu YY, Pennel C, Marsh JA, Beilin L, Palmer LJ, Lye SJ, Briollais L. Modelling BMI trajectories in children for genetic association studies. PloS ONE. IF=4.411. TC=25% Accepted Abstracts 1. Wu YY, Briollais L. Mixed-effects models for joint modelling of sequence data in longitudinal studies. Genetic Analysis Workshop 18. TC=70%.

12. Yilmaz, Yildiz - Postdoctoral Fellow Work for these publications was additionally supported by a Mprime Network – NCE Postdoctoral Industrial Research Project from the Networks of Centres of Excellence and an Operating Grant (held by Shelley B. Bull) from the Natural Sciences and Engineering Research Council of Canada.

Published refereed papers 1. Forse, CL*, Yilmaz, YE*, Dushanthi, P, O’Malley, FP, Mulligan, AM, Bull, SB, Andrulis, IL. 2013. Elevated expression of podocalyxin is associated with lymphatic invasion, basal-like phenotype and clinical outcome in axillary lymph node-negative breast cancer. (*first author). Breast Cancer Research and Treatment. 2013 Jan 4. [Epub ahead of print]. http://www.ncbi.nlm.nih.gov/ pubmed/23288345 IF=4.43. TC=40%

2. Yilmaz, YE, Lawless, JF. 2011. Likelihood ratio procedures and tests of fit in parametric and semiparametric copula models with censored data. Lifetime Data Analysis 17: 386-408. IF=0.873. TC=80%.

3. Lawless, JF, Yilmaz, YE. 2011. Comparison of semiparametric maximum likelihood estimation and two-stage semiparametric estimation in copula models. Computational Statistics & Data Analysis 55: 2446-2455. IF=0.500. TC=50%.

APPENDIX B - STAGE PAC 2012 ANNUAL REPORT Page 14 12. Yilmaz, Yildiz - Postdoctoral Fellow

4. Lawless, JF, Yilmaz, YE. 2011. Semiparametric estimation in copula models for bivariate sequential survival times. Biometrical Journal 5: 779-796. IF=1.438. TC=50%.

5. Yilmaz, YE, Bull, SB. 2011. Are quantitative trait-dependent sampling designs cost effective for analysis of rare and common variants? BMC Proceedings 5 (suppl 9), S111, in press. IF=2.16. TC=70%.

6. Bailey-Wilson, JE, Brennan, JS, Bull, SB, Culverhouse, R, Kim, Y, Jiang, Y, Jung, J, Li, Q, Lamina, C, Liu, Y, Magi, R, Niu, YS, Simpson, CL, Wang, L, Yilmaz, YE, Zhang, H, Zhang, Z. 2011. Regression and data mining methods for analyses of multiple rare variants in the Genetic Analysis Workshop 17 “mini-exome” data. (To appear in Genetic Epidemiology), in press. IF=3.99. TC=5%. Submitted refereed papers

1. Yilmaz, YE, Lawless, JF, Andrulis, IL, Bull, SB. 2012. Insights from mixture cure modeling of molecular markers for prognosis in breast cancer. Submitted to Journal of Clinical Oncology. IF=18.37. TC=75%

2. Konigorski, S, Yilmaz, YE, Bull, SB. 2012. Bivariate genetic association analysis of systolic and diastolic blood pressure by copula models. Submitted to Genetic Analysis Workshop 18. TC=45%

3. Yilmaz, YE, Lawless JF, Bull, SB. 2012. Optimal quantitative trait-dependent sampling designs for genetic association analysis of a rare variant score. For submission to Genetic Epidemiology. IF=3.44. TC=80% Presentations as a guest speaker

1. Yilmaz, YE. 2012. Methods for analyzing sequential survival times with applications in cancer studies. Department of Community Health Sciences, University of Calgary, Calgary, Canada, Invited Talk, May 7, 2012. TC=100%

2. Yilmaz, YE. 2012. Methods for analyzing sequential survival times with applications in cancer studies. Department of Biostatistics, Indiana University, Indianapolis, USA, Invited Talk, March 27, 2012. TC=100%

3. Yilmaz, YE. 2012. Methods for analyzing sequential survival times with applications in cancer studies. Department of Statistics, University of Manitoba, Winnipeg, Canada, Invited Talk, January 19, 2012. TC=100%. Published abstracts

1. Yilmaz, YE, Lawless, JF, Bull, SB. 2012. Optimal response-dependent sampling designs for genetic association analysis in next-generation sequencing data. 8th World Congress in Probability and Statistics, Istanbul, Turkey, Refereed Conference Abstract #780. TC=80%

2. Yilmaz, YE. 2012. Response-selective sampling designs for rare variant analysis in genetic association studies. ENAR International Biometric Society Spring Meeting, Washington, DC, USA, Contributed Paper. TC=100%

3. Yilmaz, YE, Lawless, JF, Bull, SB. 2011. Rare variant analysis in genetic association studies under quantitative trait-dependent sampling designs. 12th International Congress of Human Genetics and 61st Annual American Society of Human Genetics Meeting, October 11-15, 2011, Montreal, Canada, Refereed Conference Abstract #259. (Winner of the University of Toronto McLaughlin Centre Award) TC=80%

APPENDIX B - STAGE PAC 2012 ANNUAL REPORT Page 15 12. Yilmaz, Yildiz - Postdoctoral Fellow

4. Yilmaz, YE, Lawless, JF, Bull, SB. 2011. Semiparametric maximum likelihood method for rare variant analysis under quantitative trait-dependent sampling designs. 20th International Genetic Epidemiology Society Conference, September 18-20, 2011, Heidelberg, Germany, Refereed Conference Abstract #68. TC=80%

5. Yilmaz, YE, Lawless, JF, Bull, SB. 2011. Evaluation of quantitative trait dependent sampling designs for association studies involving rare and common variants. 6th Annual Canadian Genetic Epidemiology and Statistical Genetics Meeting, May 11-13, 2011, King City, Canada, Refereed Conference Abstract #9. TC=80% Presentations

1. Yilmaz, YE, Lawless, JF, Bull, SB. 2012. Optimal response-dependent sampling designs for genetic association analysis in next-generation sequencing data. Oral presentation at the 8th World Congress in Probability and Statistics, July 10, 2012, Istanbul, Turkey (presenter). Selected, Refereed Conference Abstract #780.

2. Yilmaz, YE. 2012. Methods for analyzing sequential survival times with applications in cancer studies. Oral presentation at the Department of Community Health Sciences, University of Calgary, May 7, 2012, Calgary, Alberta, Canada (presenter). Invited Talk.

3. Yilmaz, YE. 2012. Response-selective sampling designs for rare variant analysis in genetic association studies. Oral presentation at the ENAR International Biometric Society Spring Meeting, April 2, 2012, Washington, DC, USA (presenter). Contributed Paper.

4. Yilmaz, YE. 2012. Methods for analyzing sequential survival times with applications in cancer studies. Oral presentation at the Department of Biostatistics, Indiana University, March 27, 2012, Indianapolis, USA (presenter). Invited Talk.

5. Yilmaz, YE. 2012. Methods for analyzing sequential survival times with applications in cancer studies. Oral presentation at the Department of Statistics, University of Manitoba, January 19, 2012, Winnipeg, Canada (presenter). Invited Talk.

6. Yilmaz, YE, Lawless, JF, Bull, SB. 2011. Rare variant analysis in genetic association studies under quantitative trait-dependent sampling designs. Oral presentation at the12th International Congress of Human Genetics and 61st American Society of Human Genetics / International Congress of Human Genetics Meeting, October 11-15, 2011, Montreal, Quebec, Canada (presenter). Selected, Refereed Conference Abstract # 259 (Winner of the University of Toronto McLaughlin Centre Award).

7. Yilmaz, YE, Lawless, JF, Bull, SB. 2011. Semiparametric maximum likelihood method for rare variant analysis under quantitative trait-dependent sampling designs. Poster presentation at the 20th Annual International Genetic Epidemiology Society Conference, September 18-20, 2011, Heidelberg, Germany (presenter). Selected, Refereed Conference Abstract # 68.

8. Yilmaz, YE, Lawless, JF, Bull, SB. 2011. Evaluation of Quantitative Trait Dependent Sampling Designs for Association Studies Involving Rare and Common Variants. Oral presentation at the 6th Annual Canadian Genetic Epidemiology and Statistical Genetics Meeting, May 11-13, 2011, King City, Ontario, Canada (presenter). Selected, Refereed Conference Abstract # 9.

9. Yilmaz, YE, Lawless, JF, Bull, SB. 2011. Evaluation of Quantitative Trait Dependent Sampling Designs for Association Studies Involving Rare and Common Variants. Poster presentation at the 6th Annual Canadian Genetic Epidemiology and Statistical Genetics Meeting, May 11-13, 2011, King City, Ontario, Canada (presenter). Selected, Refereed Conference Abstract # 9.

APPENDIX B - STAGE PAC 2012 ANNUAL REPORT Page 16 13. Woodbury-Smith, Marc - Visiting Scholar Work for these publications was additionally supported by a Clinical Investigatorship Award from the Institute of Genetics of the Canadian Institutes of Health Research and a grant from the Scottish Rite Charitable Foundation. Identifying Causes of Autism Spectrum Disorder through Next-Generation Sequencing in Combination with Genetic Linkage.

Accepted refereed papers 1. Kakiashiasvilli, T., Koczkodaj, WW., Woodbury-Smith, MR (2012). Improving the medical scale predictability by the pairwise comparisons method: evidence from a clinical data study Computer Methods and Programs in Biomedicine, 105(3): 210-216. IF=1.51. TC=50%.

2. Szatmari, P., Liu, X-Q, Goldberg, J, Zwaigenbaum, L, Paterson, AD, Woodbury-Smith, MR, Georgiades, S, and The AGP Consortium (2012). Sex differences in repetitive stereotyped behaviours in autism: implications for genetic liability American Journal of Medical Genetics: Part B Neuropsychiatric Genetics, 159B (1): 5-12. IF=3.70, TC=20%.

3. Gorter, J-W, Stewart, D, Woodbury Smith, MR, King, G, Wright, G, Nguyen, T, Freeman, M, and Swinton, M. Pathways Toward Positive Psychosocial Outcomes and Mental Health for Youth with Disabilities: A Knowledge Synthesis of Information on Developmental Trajectories. Canadian Journal of Community Mental Health. In Press. IF= unknown. TC=30%. Book chapters 1. Woodbury-Smith, MR Asperger Syndrome, in Volkmar, FR (ed.) Encyclopedia of Autism Spectrum Disorders. New York: Springer. In press.

2. Woodbury-Smith, MR Criminal Behaviour, in Volkmar, FR (ed.) Encyclopedia of Autism Spectrum Disorders. New York: Springer. In press.

3. Woodbury-Smith, MR Epidemiology of Asperger Syndrome, in Volkmar, FR (ed.) Encyclopedia of Autism Spectrum Disorders. New York: Springer. In press. Presentations 1. Woodbury-Smith, MR & Szatmari, P. (2011). The use of mixed effects modelling to identify heritable autism endophenotypes for linkage and association studies. Poster presented at the 12th International Congress of Human Genetics and 61st American Society of Human Genetics / International Congress of Human Genetics Meeting, October 11-15, 2011, Montreal, Quebec, Canada. TC=70%.

2. Woodbury-Smith, MR (2012) Psychiatric Genetics. Invited Grand Round Presentation at McMaster University, Department of Psychiatry & Behavioural Neurosciences. TC=100%.

APPENDIX B - STAGE PAC 2012 ANNUAL REPORT Page 17

APPENDIX C REPRESENTATIVE TRAINEE PUBLICATIONS

American Journal of Epidemiology Vol. 176, No. 7 © The Author 2012. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of DOI: 10.1093/aje/kws151 Public Health. All rights reserved. For permissions, please e-mail: [email protected]. Advance Access publication: September 17, 2012

Systematic Reviews and Meta- and Pooled Analyses

Previous Lung Diseases and Lung Cancer Risk: A Pooled Analysis From the International Lung Cancer Consortium

Darren R. Brenner, Paolo Boffetta, Eric J. Duell, Heike Bickeböller, Albert Rosenberger, Valerie McCormack, Joshua E. Muscat, Ping Yang, H.-Erich Wichmann, Irene Brueske-Hohlfeld, Ann G. Schwartz, Michele L. Cote, Anne Tjønneland, Søren Friis, Loic Le Marchand, Downloaded from Zuo-Feng Zhang, Hal Morgenstern, Neonila Szeszenia-Dabrowska, Jolanta Lissowska, David Zaridze, Peter Rudnai, Eleonora Fabianova, Lenka Foretova, Vladimir Janout, Vladimir Bencko, Miriam Schejbalova, Paul Brennan, Ioan N. Mates, Philip Lazarus, John K. Field, Olaide Raji, John R. McLaughlin, Geoffrey Liu, John Wiencke, Monica Neri, Donatella Ugolini, Angeline S. Andrew, Qing Lan, Wei Hu, Irene Orlow, Bernard J. Park, and Rayjean J. Hung* http://aje.oxfordjournals.org/

* Correspondence to Dr. Rayjean J. Hung, Samuel Lunenfeld Research Institute, Mount Sinai Hospital, 60 Murray Street, Toronto, Ontario M5T 3L9, Canada (e-mail: [email protected]).

Initially submitted August 31, 2011; accepted for publication February 23, 2012.

To clarify the role of previous lung diseases (chronic bronchitis, emphysema, pneumonia, and tuberculosis) in at University of Toronto Library on October 4, 2012 the development of lung cancer, the authors conducted a pooled analysis of studies in the International Lung Cancer Consortium. Seventeen studies including 24,607 cases and 81,829 controls (noncases), mainly conducted in Europe and North America, were included (1984–2011). Using self-reported data on previous diagnoses of lung diseases, the authors derived study-specific effect estimates by means of logistic regression models or Cox pro- portional hazards models adjusted for age, sex, and cumulative tobacco smoking. Estimates were pooled using random-effects models. Analyses stratified by smoking status and histology were also conducted. A history of em- physema conferred a 2.44-fold increased risk of lung cancer (95% confidence interval (CI): 1.64, 3.62 (16 studies)). A history of chronic bronchitis conferred a relative risk of 1.47 (95% CI: 1.29, 1.68 (13 studies)). Tubercu- losis (relative risk = 1.48, 95% CI: 1.17, 1.87 (16 studies)) and pneumonia (relative risk = 1.57, 95% CI: 1.22, 2.01 (12 studies)) were also associated with lung cancer risk. Among never smokers, elevated risks were observed for emphysema, pneumonia, and tuberculosis. These results suggest that previous lung diseases influence lung cancer risk independently of tobacco use and that these diseases are important for assessing individual risk.

bronchitis, chronic; emphysema; lung diseases; lung neoplasms; meta-analysis; pneumonia; pulmonary disease, chronic obstructive; tuberculosis

Abbreviations: CI, confidence interval; COPD, chronic obstructive pulmonary disease; RR, relative risk.

Lung cancer continues to be the leading cause of cancer among never smokers (4). One particular set of risk factors incidence and mortality worldwide, with an estimated that may play an important role in lung cancer development 1,608,800 new cases and 1,378,400 deaths in 2008 (1). is previous lung diseases. Recent evidence suggests that in- Disease survival remains dismal, with 5-year survival rates flammatory processes may play a central role in carcinogen- of approximately 15% among developed populations (2, 3). esis (5–8). Although tobacco smoking continues to be the primary de- Previous lung diseases such as chronic obstructive pul- terminant of risk, further investigation is required concern- monary disease (COPD) (including emphysema and ing the additional risk factors for lung cancer, particularly chronic bronchitis), pneumonia, and tuberculosis are major

573 Am J Epidemiol. 2012;176(7):573–585 574 Brenner et al. sources of inflammation in lung tissue (9, 10). The resulting analysis using primary data from 17 studies included in the inflammation has been suggested to increase the risk of lung International Lung Cancer Consortium to examine the risk cancer (11–13), and these diseases may act as catalysts in of lung cancer associated with previous lung diseases. the development of lung neoplasms (14, 15). The associa- tions between COPD (emphysema and/or chronic bronchi- tis), pneumonia, and tuberculosis and lung cancer have been MATERIALS AND METHODS investigated previously; however, a recent meta-analysis of Data collection the literature showed that most of the studies were small- scale initiatives—65% were identified as having fewer than Requirements for inclusion of studies in the International 500 cases (16). In addition, the meta-analysis was not able Lung Cancer Consortium and other details have been previ- to address the issues of standardized covariate adjustment, ously published (17) and are available on the Consortium’s and data on never smokers and histologic type were limited. website (http://ilcco.iarc.fr). Investigators from 17 partici- To address these limitations, we conducted a pooled pating studies (out of 52 studies included in the Downloaded from Table 1. Characteristics of Participating Studies in a Pooled Analysis of Previous Lung Diseases and Lung Cancer Risk, International Lung Cancer Consortium, 1984–2011

Principal Control No. of No. of Total Continent and Study/Center Study Period Location Investigator Source Cases Controls No.

North America http://aje.oxfordjournals.org/ Family Health Study A. G. Schwartz Population 1984–2004 Detroit, Michigan, US 1,006 1,184 2,190 (WSU/KCI-1) (22) Study of women’s lung A. G. Schwartz Population 2001–2007 Detroit, Michigan, US 576 575 1,151 cancer epidemiology (WSU/KCI-2) (30) University of California, Z. F. Zhang Population 1999–2004 Los Angeles, 611 1,040 1,651 Los Angeles (21) California, US New England Lung E. Duell Population 2005–2008 New Hampshire, US 277 251 528

Cancer Study (25) at University of Toronto Library on October 4, 2012 Samuel Lunenfeld J. McLaughlin Mixed 1997–2002 Toronto, Ontario, 445 947 1,392 Research Institute (18) Canada Mayo Clinic (27) P. Yang Mixed 1997–2006 Rochester, 5,700 2,269 7,969 Minnesota, US New York Multicenter J. Muscat Hospital 1969–1999 New York State, US 5,130 4,942 10,072 Study (26) Moffitt Cancer Study (24) P. Lazarus Hospital 1999–2003 Florida, US 497 898 1,395 University of California, J. Wiencke Population 1999–2002 San Francisco, 428 900 1,328 San Francisco (29) California, US Memorial Sloan-Kettering I. Orlow Hospital 2003–2005 New York City, US 102 101 203 Cancer Center (33) Hawaii (28) L. Le Marchand Population 1992–1997 Hawaii, US 635 588 1,223 Europe Liverpool Lung J. K. Field Population 1998–2006 Liverpool, 475 954 1,429 Project (35) United Kingdom CREST Biorepository M. Neri Mixed 1996–ongoing Genova, Italy 413 555 968 (19) Helmholtz Center Munich E. Wichmann Population 2000–2004 Germany 4,735 8,178 12,913 (39, 40, 69, 70) Central Europe (23) P. Boffetta Hospital 1998–2002 Central/Eastern 2,633 2,702 5,335 Europe Danish Diet, Cancer, and A. Tjønneland Population- 1993–2009 Copenhagen, 822 55,623 56,445 Health Studya (20) based Denmark cohort Asia NCI-China (34) Q. Lan Population 1985–1990 Xuan Wei, People’s 122 122 244 Republic of China Total 24,607 81,829 106,436

Abbreviations: CREST, Cancer of the Respiratory Tract; KCI, Karmanos Cancer Institute; NCI, National Cancer Institute; US, United States; WSU, Wayne State University. a Population-based cohort included in counts as cases and controls.

Am J Epidemiol. 2012;176(7):573–585 Previous Lung Diseases and Lung Cancer Risk 575

Consortium) contributed data on previous lung diseases associated 95% confidence intervals for the relation of each and agreed to participate in this pooled analysis (Table 1). previous lung disease with lung cancer, using unconditional There was 1 population-based cohort study and 16 case- logistic regression, adjusting for age, sex, cumulative control studies, of which 9 were population-based, 4 were tobacco smoking (in pack-years), and country (when the hospital-based, and 3 had mixed controls (where both pop- study participants were from multiple countries). For the ulation and hospital-based controls were sampled). Eleven cohort study (20), we used Cox regression (with time since studies were conducted in North America, 5 in Europe, and study entry as the time scale) to estimate hazard ratios, ad- 1 in China; the dates of the studies ranged from 1984 to justed for age, sex, and pack-years, and their associated 2011. The control groups in all of the case-control studies 95% confidence intervals for each previous lung disease. were, at a minimum, frequency-matched with cases on age Follow-up time at risk was calculated as the time between and sex. Written informed consent was obtained from all study entry and lung cancer diagnosis (for cases) or the last study subjects, and ethics review boards at each study known date of query (for noncases) from the cancer regis- center approved the study protocols. The data submitted try. Although we estimated hazard ratios or odds ratios from all 17 studies were checked for missing values, inad- across study sites, we refer to all effect estimates henceforth

missible values, aberrant distributions, and inconsistencies. as relative risks for consistency. Downloaded from Queries were sent to the investigators to resolve all discrep- When information on cumulative tobacco smoking was ancies and possible errors. Subjects with unknown age or missing (<1%), it was imputed using the median of the sex were excluded from the analysis (n = 6). A total of study-specific control population for the smoking group 24,607 cases and 81,829 controls were available for the (never/former/current) of the individual. We estimated present investigation. pooled effects across studies, employing random-effects Previous lung diseases were based on self-reported status models to account for variability between study popula- http://aje.oxfordjournals.org/ of being previously diagnosed with chronic bronchitis, em- tions. Studies in which the odds ratio could not be estimat- physema, pneumonia, or tuberculosis by a physician. Two ed because of small numbers in one or more of the 4 of the studies asked open-ended questions about previous categories in the 2 × 2 table of case-control status and lung diseases, where responses were recorded using free history of previous lung diseases were omitted. We con- text (18) or were coded using International Classification ducted an analysis stratified by smoking status to investigate of Diseases, Ninth Revision, codes (19). Dichotomous vari- the potential modifying effects of smoking or differential ables were created for each of the previous lung diseases. etiology across smoking groups. We also compared effect

Several studies also recorded the date of diagnosis of the estimates across histologic subtypes to search for differen- at University of Toronto Library on October 4, 2012 disease (18, 20–28). Detailed descriptions of the 17 study tial effects. We adjusted estimates for other lung diseases; populations within this analysis have been published else- however, since not all studies collected data for all diseases, where (18–34). Four of the studies had previously reported this limited the sample in which we could conduct such an effect estimates for prior lung diseases (18, 25, 30, 35) and analysis. Subgroup analyses for large cell carcinoma are were included in the previous meta-analysis (16), whereas omitted from the results because there were very small the other 13 studies (88% of the pooled study population) numbers of cases in most studies and risk measures could represented new data and were not included in the previous not be estimated across studies unless data were pooled as a meta-analysis (Table 1). single study. We estimated population attributable fractions for each of the previous lung diseases based on the pooled Statistical methods adjusted effect estimates and the proportion of exposed persons among the cases (36). The frequency distributions of demographic variables Heterogeneity was evaluated for each of the summary es- and putative risk factors for lung cancer, including age, timates based on a test of the Cochrane Q statistic as well sex, ethnicity, and smoking, were examined among cases as the I2 statistic (37). Where there was evidence of hetero- and controls combined. The ethnicity of the subjects was geneity across studies, we evaluated the source of heteroge- categorized according to the National Institutes of Health neity by means of meta-regression using control type, definition as non-Hispanic white, black or African- prevalence of ever smoking among controls, median year American, Hispanic or Latino, Asian, Native Hawaiian or of the study period, and continents as predictors. If the het- Pacific Islander, American Indian, or other. Former erogeneity could not be accounted for by the different smokers were defined as smokers who had quit smoking at study characteristics, we conducted an influence analysis to least 2 years before the interview or diagnosis. Never evaluate the source of heterogeneity from single studies smokers were defined as persons who had smoked fewer using Galbraith plots (38) and Q statistics through an itera- than 100 cigarettes over their lifetime. Cumulative tobacco tive process. Statistical analyses were conducted using smoking was calculated as the product of smoking duration SAS, version 9.1 (SAS Institute Inc., Cary, North Carolina), and intensity throughout the life course, standardized and STATA, version 10 (StataCorp LP, College Station, across studies and expressed as pack-years. Texas). For those studies that recorded the date of lung disease diagnosis, indicator variables for whether the diagnosis RESULTS had been made 5 years or 10 years before the date of cancer diagnosis or control interview were created. For The demographic distribution of the pooled data set for case-control studies, we estimated odds ratios and their previous lung diseases is displayed in Table 2. The

Am J Epidemiol. 2012;176(7):573–585 576 Brenner et al.

Table 2. Demographic Characteristics of Participants in a Pooled Analysis of Previous Lung Diseases and Lung Cancer Risk, International Lung Cancer Consortium, 1984–2011

Cases (n = 24,607) Controls (Noncases) (n = 81,829) No. % Mean (SD) No. % Mean (SD) Age at diagnosisa, years 61.1 (10.9) 56.4 (8.1) Age group, years <50 4,434 18.0 12,135 14.8 51–60 6,713 27.3 46,522 56.9 61–70 8,392 34.1 19,156 23.4 >70 5,068 20.6 4,016 4.9 Sex Male 15,394 62.6 41,964 51.3

Female 9,213 37.4 39,865 48.7 Downloaded from Ethnicity White/Caucasian 21,030 85.5 74,890 91.5 Black/African-American 1,379 5.6 1,698 2.1 Asian 561 2.3 574 0.7 Hispanic/Latino 313 1.3 629 0.8 http://aje.oxfordjournals.org/ Other/unknown 1,324 5.4 4,038 4.9 Smoking status Never smoker 2,719 11.0 29,884 36.5 Ever smoker 21,888 88.9 51,945 63.5 Former smoker 13,113 53.3 27,022 52.0 Current smoker 8,775 35.7 24,923 48.0

Pack-years of smokingb 44.1 (28.0) 28 (16.8) at University of Toronto Library on October 4, 2012 <15 5,191 21.1 41,719 51.0 15–<30 4,383 17.8 13,477 16.5 30–45 6,179 25.1 21,168 25.9 >45 8,854 36.0 5,465 6.7 Histologic typec Adenocarcinoma 6,684 27.1 Squamous cell carcinoma 4,685 19.0 Small cell lung cancer 1,810 7.4 Large cell lung cancer 824 3.3

Abbreviation: SD, standard deviation. a Age at baseline in the cohort study. b Among ever smokers only. c The remaining cases had either mixed or other histologic types.

majority of cases were Caucasian, male, and over the age Specifically, a previous diagnosis of emphysema was asso- of 60 years. As expected, there was a much higher propor- ciated with increased risk overall, based on 16 studies (rela- tion of never smokers among the controls. Adenocarcinoma tive risk (RR) = 2.44, 95% confidence interval (CI): 1.64, and squamous cell carcinoma were the most commonly 3.62; I2 = 89.37%), and when stratified according to never characterized histologic subtypes among cases in the (RR = 2.21, 95% CI: 1.00, 4.90; I2 = 88.52%) or ever pooled population. The prevalences of the 4 lung diseases (RR = 2.25, 95% CI: 1.50, 3.37; I2 = 44.28) smoking. The examined among cases and controls across studies/centers, study-specific estimates, as well as estimates for subgroups smoking groups, and histology groups are shown in of smoking status and histology, are shown in Figure 1. Table 3. There was evidence of heterogeneity across studies that was Overall, all of the 4 previous lung diseases examined not clearly explained by a single source (i.e., control type, were associated with increased incidence of lung cancer proportion of ever smokers, time period, continent—all when adjusted estimates were examined individually. contributed (P < 0.001)). When we removed the outlying

Am J Epidemiol. 2012;176(7):573–585 mJEpidemiol. J Am Table 3. Prevalence of Previous Lung Disease Among Lung Cancer Cases and Controls, by Study/Center, Smoking Status, and Histologic Type, International Lung Cancer Consortium, 1984–2011

Emphysema Chronic Bronchitis Tuberculosis Pneumonia No. of No. of No. of No. of No. of No. of No. of No. of Cases Controls Cases Controls Cases Controls Cases Controls

2012;176(7):573 Exp. Unexp. Exp. Unexp. Exp. Unexp. Exp. Unexp. Exp. Unexp. Exp. Unexp. Exp. Unexp. Exp. Unexp. Study/center UCLA (21) 71 540 7 1,033 71 540 59 981 28 583 25 1,015 213 398 197 843 Helmholtz Center Munich (39, 40, 69, 70) 125 4,564 88 4,528 929 3,759 665 6,263 192 4,505 204 4,399 1,098 3,557 781 3,808 Central Europe (23) 75 2,552 47 2,652 207 2,420 150 2,550 900 1,727 689 2,009

– NCI-China (34) 10 111 2 119 38 84 34 87 12 110 1 119 585 Family Health Study (WSU/KCI-1) (22) 30 967 18 1,164 29 388 28 441 24 975 12 1,171 174 819 171 1,011 Study of women’s lung cancer 87 488 12 560 123 450 65 507 19 551 16 555 207 365 187 384 epidemiology (WSU/KCI-2) (30) Hawaii (28) 105 525 20 568 36 593 15 573 26 605 28 560 Samuel Lunenfeld Research Institute (18) 31 270 8 436 21 424 49 898 6 439 5 942 11 434 35 912 Liverpool Lung Project (35) 6 317 31 875 9 314 40 865 41 282 138 768 Mayo Clinic (27) 1,167 4,533 36 2,233 63 5,637 18 2,251 897 4,803 123 2,146 New England Lung Cancer Study (25) 46 228 10 241 41 235 13 238 3 272 2 249 123 152 73 178 Moffitt Cancer Study (24) 67 428 29 861 55 440 36 853 New York Multicenter Study (26) 299 4,831 82 4,860 527 4,603 201 4,741 50 5,080 29 4,913 CREST Biorepository (19) 10 403 4 551 77 336 14 541 7 406 3 552 21 392 14 541 UCSF (29) 77 349 45 853 20 407 19 881 168 258 167 733 rvosLn iessadLn acrRisk Cancer Lung and Diseases Lung Previous MSKCC (33) 6 90 4 96 4 90 1 97 Danish Diet, Cancer, and Health 15 807 271 55,352 45 777 1022 54,601 6 816 108 55,515 242 580 4,154 51,469 Study (20) Smoking status Never smoker 44 2,514 94 28,007 90 1,516 459 26,312 70 2,588 208 27,883 343 1,848 1,852 24,018 Ever smoker 2,177 19,399 616 48,879 1,908 11,203 1,746 44,508 606 20,622 453 48,751 3,752 11,919 4,877 40,784 Former smoker 1,644 11,312 335 25,576 1,203 5,988 821 22,655 334 12,426 298 25,403 2,240 8,262 2,292 21,141 Current smoker 533 8,087 281 23,303 705 5,215 925 21,853 272 8,196 155 23,348 1,512 3,657 2,585 19,643 Histologic type Adenocarcinoma 705 5,551 277 2,792 159 6,287 950 3,933 Squamous cell carcinoma 588 3,847 226 1,689 127 4,270 839 2,340 Small cell lung cancer 175 1,607 62 618 50 1,700 301 1,098 Total 2,221 21,913 710 76,886 1,998 12,719 2,205 70,820 676 23,210 661 76,634 4,095 13,767 6,729 64,802 With removal(s)a 769 9,028 498 65,641 453 18,044 440 71,561 2,602 7,432 2,033 9,061

Abbreviations: CREST, Cancer of the Respiratory Tract; Exp., exposed; KCI, Karmanos Cancer Institute; MSKCC, Memorial Sloan-Kettering Cancer Center; UCLA, University of

California, Los Angeles; UCSF, University of California, San Francisco; Unexp., unexposed; WSU, Wayne State University. 577

a Removal of one or more particular studies for each previous disease as specified in the figure legends.

Downloaded from from Downloaded http://aje.oxfordjournals.org/ at University of Toronto Library on October 4, 2012 4, October on Library Toronto of University at 578 Brenner et al. Downloaded from http://aje.oxfordjournals.org/

Figure 1.

Results from a pooled analysis of emphysema as a risk factor for the development of lung cancer, International Lung Cancer at University of Toronto Library on October 4, 2012 Consortium, 1984–2011. The graph shows a forest plot of the association between emphysema and lung cancer risk by study center, smoking status, and histologic type. Models adjusted for age, sex, and pack-years of smoking. P values are from a test for heterogeneity across studies or across subgroups. “With removals” represents removal of the Mayo, Central Europe, HMGU, WSU/KCI-2, and UCLA studies. See Table 1 for published references. (CI, confidence interval; CREST, CREST (Cancer of the Respiratory Tract) Biorepository; Danish, Danish Diet, Cancer, and Health Study; HMGU, Helmholtz Center Munich; KCI, Karmanos Cancer Institute; Liverpool, Liverpool Lung Project; NCI, National Cancer Institute; NELCS, New England Lung Cancer Study; New York, New York Multicenter Study; OR, odds ratio; RR, relative risk; SCC, squamous cell carcinoma; SCLC, small cell lung cancer; Toronto, Samuel Lunenfeld Research Institute; UCLA, University of California, Los Angeles; UCSF, University of California, San Francisco; WSU, Wayne State University; WSU/KCI-1, Family Health Study; WSU/KCI-2, study of women’s lung cancer epidemiology).

studies (21–23, 27, 39) as indicated by the Galbraith plot A previous diagnosis of pneumonia was associated with (see Web Figure 1 (http://aje.oxfordjournals.org/)), we ob- increased risk overall, based on 12 studies (RR = 1.57, 95% served marginal attenuation in the pooled effect estimate CI: 1.22, 2.01; I2 = 93.00%), and when stratified according (RR = 2.33, 95% CI: 1.86, 2.94; I2 = 40.52%). After adjust- to never (RR = 1.35, 95% CI: 1.12, 1.63; I2 = 23.01%) or ment for other previous lung diseases, the relative risk asso- ever (RR = 1.55, 95% CI: 1.16, 2.06; I2 = 93.18%) ciated with emphysema was 2.05 (95% CI: 1.33, 3.15; smoking (Figure 3). There was evidence of heterogeneity I2 = 89.95%) (data not shown). across studies that was not clearly explained by a single A previous diagnosis of chronic bronchitis was associat- source (P < 0.001). When we removed the outlying studies ed with increased risk overall, based on 13 studies (18, 20, 27, 29, 30) as indicated by the Galbraith plot (Web (RR = 1.47, 95% CI: 1.29, 1.68; I2 = 33.91%), and among Figure 2), we observed a slight attenuation in the pooled ever smokers (RR = 1.63, 95% CI: 1.40, 1.89; I2 = 40.79%) effect estimate (RR = 1.39, 95% CI: 1.27, 1.52; I2 =14.28%). (Figure 2). There was no evidence of heterogeneity across After adjustment for other previous lung diseases, the rela- the 13 studies (P = 0.111). After adjustment for other previ- tive risk for pneumonia was 1.42 (95% CI: 1.09, 1.86; ous lung diseases, the risk ratio for chronic bronchitis was I2 = 93.13%) (data not shown). 1.25 (95% CI: 1.05, 1.56; I2 = 60.70%) (data not shown). A previous diagnosis of tuberculosis was associated with When the effects of chronic bronchitis and emphysema increased risk overall, based on 16 studies (RR = 1.48, 95% were examined as a measure of COPD, the combined CI: 1.17, 1.87; I2 = 54.27%), and among ever smokers overall effect of COPD was a relative risk of 1.93 (95% CI: (RR = 1.36, 95% CI: 1.05, 1.75; I2 = 47.96%) (Figure 4). 1.48, 4.89; I2 = 89.54%) (data not shown). We also observed an elevated risk among never smokers

Am J Epidemiol. 2012;176(7):573–585 Previous Lung Diseases and Lung Cancer Risk 579 Downloaded from http://aje.oxfordjournals.org/

Figure 2. Results from a pooled analysis of chronic bronchitis as a risk factor for the development of lung cancer, International Lung Cancer Consortium, 1984–2011. The graph shows a forest plot of the association between chronic bronchitis and lung cancer risk by study center, smoking status, and histologic type. Models adjusted for age, sex, and pack-years of smoking. P values are from a test for heterogeneity across studies or across subgroups. (CI, confidence interval; CREST, CREST (Cancer of the Respiratory Tract) Biorepository; Danish, Danish Diet,

Cancer, and Health Study; HMGU, Helmholtz Center Munich; KCI, Karmanos Cancer Institute; MSKCC, Memorial Sloan-Kettering Cancer at University of Toronto Library on October 4, 2012 Center; NELCS, New England Lung Cancer Study; New York, New York Multicenter Study; NCI, National Cancer Institute; OR, odds ratio; RR, relative risk; SCC, squamous cell carcinoma; SCLC, small cell lung cancer; Toronto, Samuel Lunenfeld Research Institute; UCLA, University of California, Los Angeles; WSU, Wayne State University; WSU/KCI-1, Family Health Study; WSU/KCI-2, study of women’s lung cancer epidemiology).

(RR = 1.50, 95% CI: 1.03, 2.19; I2 = 23.64%). There was Population attributable fraction estimates for the diseases evidence of heterogeneity across studies (P = 0.005); investigated ranged within the combined population from however, when we examined the Galbraith plot, it appeared 0.9% for tuberculosis to 8.3% for pneumonia, with study- that the heterogeneity was due to only 1 outlying study specific estimates varying according to population disease (40) (Web Figure 3). When this study was removed, a prevalence (tuberculosis, 0.29%–9.76%; chronic bronchitis, slight elevation in the pooled effect estimate was observed 3.63%–30.28%; emphysema, 1.20%–17.90%; pneumonia, (RR = 1.56, 95% CI: 1.24, 1.96; I2 = 34.95%). After adjust- 0.51%–44.11%). Among never smokers as a combined ment for other previous lung diseases, the relative risk for group, having had any of the previous lung diseases of in- tuberculosis was 1.31 (95% CI: 1.03, 1.56; I2 = 50.99%) terest conferred an attributable fraction of 5.91% (Web (data not shown). Table 2). In those studies where multiple diseases were investigat- ed, we examined the risk associated with having multiple DISCUSSION lung diseases. There was a dose-response relation with in- creasing number of previous lung diseases (P-trend < 0.001). In this investigation into the effects of previous lung dis- The relative risk associated with having 1 disease was 1.71 eases on lung cancer risk, we found associations with in- (95% CI: 1.61, 1.82); with having 2 diseases, it was 2.00 creased cancer risk for each of the diseases of interest. (95% CI: 1.80, 2.21); with having 3 diseases, 2.23 (95% CI: Comparisons among all histologic subgroups were consis- 1.76, 2.82); and with having all 4 diseases, 2.44 (95% CI: tent with increases in risk observed overall, with the excep- 0.92, 6.48) (only 8 controls and 15 cases had had all 4 dis- tion of squamous cell carcinoma among persons with eases). We examined the effects of all 4 lung diseases sepa- tuberculosis. Risk estimates were consistent across smoking rately among males and females and observed no differential subgroups; estimates were elevated in all subgroups, with effects by sex. (Full subgroup analyses are shown in Web the exception of chronic bronchitis. Our results among Table 1.) never smokers suggest an effect of previous lung diseases

Am J Epidemiol. 2012;176(7):573–585 580 Brenner et al. Downloaded from http://aje.oxfordjournals.org/

Figure 3. Results from a pooled analysis of pneumonia as a risk factor for the development of lung cancer, International Lung Cancer Consortium, 1984–2011. The graph shows a forest plot of the association between pneumonia and lung cancer risk by study center, smoking status, and histologic type. Models adjusted for age, sex, and pack-years of smoking. P values are from a test for heterogeneity across studies “ ” or across subgroups. With removals represents removal of the Toronto, WSU/KCI-2, UCSF, Mayo, and Danish studies. (CI, confidence at University of Toronto Library on October 4, 2012 interval; CREST, CREST (Cancer of the Respiratory Tract) Biorepository; Danish, Danish Diet, Cancer, and Health Study; HMGU, Helmholtz Center Munich; KCI, Karmanos Cancer Institute; Liverpool, Liverpool Lung Project; NELCS, New England Lung Cancer Study; OR, odds ratio; RR, relative risk; SCC, squamous cell carcinoma; SCLC, small cell lung cancer; Toronto, Samuel Lunenfeld Research Institute; UCLA, University of California, Los Angeles; UCSF, University of California, San Francisco; WSU, Wayne State University; WSU/KCI-1, Family Health Study; WSU/KCI-2, study of women’s lung cancer epidemiology).

on lung cancer risk independent of tobacco smoking, prob- were early manifestations or symptoms of lung cancer that ably acting through the inflammatory response and patho- were misdiagnosed, particularly for emphysema and genesis associated with the diseases. chronic bronchitis. For pneumonia and tuberculosis, infec- The results of this pooled analysis corroborate the results tions may have been the result of a weakened immune of the previous meta-analysis suggesting that there was a system due to lung cancer. In addition, tumors may have large difference in the prevalence of COPD/emphysema been interpreted as lesions from infections prior to cancer among cases and controls (16). This difference in preva- diagnosis. To address these issues, we conducted a latency lence among cases and controls may explain/confound the analysis which found that diagnoses of the previous lung differential effects observed in genetic epidemiologic diseases more than 5 years and more than 10 years prior to studies of lung cancer in which inconsistent effects have cancer diagnosis were positively associated with lung been observed among populations of similar genetic ances- cancer incidence. This suggests that reverse causality is not try (41) or may act as mediators in the associations between likely to fully explain these associations. For example, the variants and lung cancer risk (42). Although chronic when the analysis was restricted to the conditions diag- bronchitis and emphysema are commonly grouped together nosed 10 years prior to lung cancer, chronic bronchitis re- as COPD, we calculated detailed results for each condition mained associated with an increased risk of lung cancer separately in order to allow for differential effects of these (RR = 1.45, 95% CI: 1.08, 1.95). Complete results of two conditions, which have different pathologies and etiol- latency analyses are available in Web Table 3. Note that in ogies. Because we observed independent effects of both of the cohort study included in this analysis (20), both lung these diseases when adjusting for the other in a fixed- disease and smoking status were ascertained at baseline and effects analysis, we felt this to be a beneficial approach. the average follow-up time to diagnosis/censoring was Reverse causality and the issue of temporality are para- approximately 7 years. mount to the consideration of causality for these associa- The use of self-reports for measuring previous lung dis- tions. It is certainly possible that some of the conditions eases may have introduced misclassification bias into the

Am J Epidemiol. 2012;176(7):573–585 Previous Lung Diseases and Lung Cancer Risk 581 Downloaded from http://aje.oxfordjournals.org/

Figure 4. Results from a pooled analysis of tuberculosis as a risk factor for the development of lung cancer, International Lung Cancer at University of Toronto Library on October 4, 2012 Consortium, 1984–2011. The graph shows a forest plot of the association between tuberculosis and lung cancer risk by study center, smoking status, and histologic type. Models adjusted for age, sex, and pack-years of smoking. P values are from a test for heterogeneity across studies or across subgroups. “With removal” represents removal of the HMGU study. (CI, confidence interval; CREST, CREST (Cancer of the Respiratory Tract) Biorepository; Danish, Danish Diet, Cancer, and Health Study; HMGU, Helmholtz Center Munich; KCI, Karmanos Cancer Institute; Liverpool, Liverpool Lung Project; MSKCC, Memorial Sloan-Kettering Cancer Center; NCI, National Cancer Institute; NELCS, New England Lung Cancer Study; New York, New York Multicenter Study; OR, odds ratio; RR, relative risk; SCC, squamous cell carcinoma; SCLC, small cell lung cancer; Toronto, Samuel Lunenfeld Research Institute; UCLA, University of California, Los Angeles; UCSF, University of California, San Francisco; WSU, Wayne State University; WSU/KCI-1, Family Health Study; WSU/KCI-2, study of women’s lung cancer epidemiology).

studies included in the pooled analysis. Quantitative tech- COPD has been shown to have a high level of agreement niques for each of the previous lung diseases are presently with spirometry results (43, 44). Despite the reports of available for improved diagnostic accuracy and disease these previous studies, misclassification of exposure may classification; however, these were not employed in any of have produced underestimation of the burden due to the ex- the component studies of the analysis. When effect esti- posures, since several investigations have shown that mates obtained using quantitative diagnostic tools for COPD/emphysema is present in many lung cancer patients COPD (forced expiratory volume in 1 second, quantitative who do not report a history of COPD (45–47). computed tomography, or radiographic evidence), pneumo- For pneumonia, the question of persistence of inflamma- nia (microimmunofluorescence), and tuberculosis (radiogra- tion arising from a condition with clinical transience should phy) were pooled in the previous meta-analysis (16), the be addressed. Because this investigation did not contain in- risk estimates derived using quantitative techniques were formation on the number of infections or the length and/or consistent with those derived using self-reported diagnoses. intensity of infection, it is difficult to conceptually include The similarity between effect estimates from the cohort pneumonia with the other diseases in terms of persistence study included in the analysis and the pooled case-control of inflammation. However, murine models have suggested estimates (results not shown) suggests that potential bias that infection from Mycoplasma pneumoniae can lead to due to misclassification of exposure, recall bias, and reverse long-term changes in peribronchial histopathology (48), causality may not explain the associations completely. Al- pulmonary airflow resistance, and elevated inflammatory though none of the studies contained in this analysis vali- biomarkers long after active infection clears (49). This sug- dated self-reports with medical records, self-reported gests that inflammation resulting from pneumonia may be

Am J Epidemiol. 2012;176(7):573–585 582 Brenner et al. more long-term in nature than clinical symptoms may Strengths of this investigation include the large sample suggest. size and the large number of exposed persons. The use of It is also possible that our results, particularly among random-effect models, although it provides wider confi- never smokers, may have been confounded from exposure dence intervals, reduces the likelihood of larger studies’ to other agents such as secondhand smoke or other occupa- overly affecting pooled estimates when combining data tional exposures. Secondhand smoke has been associated across studies by estimating both within- and across-study with increased risk of lung cancer (50) and may be related variance. The inclusion of prospective data is also a to previous lung diseases (51). However, it is unlikely to strength of this pooled analysis, although the number of fully explain the large effects associated with several of the cases collected prospectively was comparatively smaller, previous lung diseases. When we adjusted for secondhand whereby the biases associated with case-control studies smoke in our analysis among never smokers, the results re- could be comparatively evaluated. mained, with risk estimates changing only slightly. For In conclusion, we observed elevated lung cancer risks as- example, the relative risk associated with pneumonia sociated with previous diagnoses of emphysema, chronic among never smokers changed marginally from 1.35 to bronchitis, pneumonia, and tuberculosis in this pooled analy-

1.45. In addition, occupational exposures may have acted sis of primary data. The observation of relatively consistent Downloaded from as confounders in the associations tested, as they have been associations between several of the previous lung diseases associated with lung cancer (52, 53). We examined the in- and lung cancer risk across smoking groups, histologic sub- clusion of restricted cubic splines to check for nonlinearity types, and study designs supports a direct association with in both age and smoking (pack-years) as covariates in the lung cancer, reducing the likelihood of confounding by association models. As was previously observed (54), non- tobacco exposure. The most likely explanation for the in- linear components for age and smoking were significant in creased risk associated with these diseases is the effect of the http://aje.oxfordjournals.org/ the models, suggesting a deviation from linear fit; however, inflammatory response within lung tissue. Recent evidence the effect estimates for the lung diseases pooled across has suggested that inflammation plays a pivotal role in the studies changed minimally. Therefore, we retained linear development of lung cancer (12, 57, 58). Inflammation may terms in the models to avoid overdispersion in small increase the risk of cancer development as an initiator or pro- studies when examining the within-study effects. moter through 3 processes: increased genetic mutation, anti- For those instances where heterogeneity was observed in apoptotic signaling (59), and angiogenesis (14). the overall estimates (emphysema, pneumonia, tuberculo- Whether acting as promoters in the causal pathway or as

sis), removal of the outlying studies led to only slight dif- causative agents, these diseases appear to be markers of at University of Toronto Library on October 4, 2012 ferences in the pooled estimates. For emphysema and risk for the development of lung cancer that are clinically pneumonia, where more than one study was contributing to relevant. Most importantly, when considered as a group, heterogeneity, meta-regression suggested that several the lung diseases examined in this pooled analysis affect sources, including continent, control type, and proportion large numbers of persons. In the United States, the preva- of ever smokers, all accounted for some portion of the het- lence of emphysema is 18.5 per 1,000 persons, and the erogeneity (results not shown). In subgroup analyses where prevalence of chronic bronchitis is 43.0 per 1,000 (60). Al- more than 3 studies were included in a pooled estimate, the though the incidence of pneumonia in the United States is only major difference was seen for emphysema between unknown, there were approximately 1.4 million hospital Europe and North America (Web Table 1). For emphyse- discharges associated with pneumonia in 2005 (61). While ma, differences by continent of study may be a product of the incidence of tuberculosis in North America is low (4.8 different diagnostic practices across populations, since diag- per 100,000 population per year) (62), in developing nostic criteria for COPD differ across continents. More spe- nations the disease affects millions. In Europe and Asia, cifically, the diagnostic guidelines of the British Thoracic these conditions collectively affect millions of persons, and Society and the American Thoracic Society lead to marked thus the exposed population is large (63). Therefore, the differences in the prevalence of COPD when applied to the positive associations between the lung diseases examined same population (55). Diagnostic differences across study and lung cancer risk are of substantial public health impor- locations that are not discernable from questionnaires may tance, and the consistent associations suggest that a nontriv- also explain some portion of the heterogeneity. Although ial proportion of all lung cancer cases are attributable to these differences in diagnostic practice should produce non- these other lung diseases or their underlying pathologies. differential variation in disease status classification across The previous lung diseases examined in this investiga- cases and controls, the potential of this to influence the tion are significant for both public health and clinical prac- results should not be precluded. Several studies included in tice. The development of lung cancer risk prediction this analysis displayed COPD (emphysema/chronic bron- models (54, 64, 65) should continue to incorporate the lung chitis) prevalence higher than that in the community at diseases examined in this analysis for improved discrimina- large (Web Table 2), where it is often largely undiagnosed tory ability among all patients, regardless of smoking (56). For emphysema, control source contributed signifi- history. The United Kingdom Lung Cancer Screening cantly to heterogeneity, suggesting that the differences in Trial, which uses computed tomography to screen for lung diagnosis in population-based settings compared with cancer, utilizes the lung cancer risk prediction model of the hospital-based settings may affect the prevalence of disease Liverpool Lung Project, which includes pneumonia as one reported and therefore the magnitude of estimates and asso- of the factors (62) for selection of high-risk individuals for ciated population attribution measures. the trial (66). These diseases may be useful in determining

Am J Epidemiol. 2012;176(7):573–585 Previous Lung Diseases and Lung Cancer Risk 583 who to monitor by providing a further resolution of risk Department of Cancer Epidemiology and Genetics, stratification, particularly as new-era screening evaluations Masaryk Memorial Cancer Institute, Brno, Czech Republic and initiatives advance (67, 68). (Lenka Foretova); Department of Preventive Medicine, Faculty of Medicine, Palacky University, Olomouc, Czech Republic (Vladimir Janout); Institute of Hygiene and Epi- demiology, First Faculty of Medicine, Charles University, ACKNOWLEDGMENTS Prague, Czech Republic (Vladimir Bencko, Miriam Schejbalova); University of Medicine and Pharmacy Carol Author affiliations: Samuel Lunenfeld Research Institute, Davila, Bucharest, Romania (Ioan N. Mates); Roy Castle Mount Sinai Hospital, Toronto, Ontario, Canada (Darren Lung Cancer Research Programme, Department of Molecu- R. Brenner, John R. McLaughlin, Rayjean J. Hung); Divi- lar and Clinical Cancer Medicine, University of Liverpool sion of Epidemiology, Dalla Lana School of Public Health, Cancer Research Centre, Liverpool, United Kingdom (John University of Toronto, Toronto, Ontario, Canada (Darren K. Field, Olaide Raji); Ontario Cancer Institute, Princess R. Brenner, John R. McLaughlin, Rayjean J. Hung); Insti- Margaret Hospital, Toronto, Ontario, Canada (Geoffrey

tute for Translational Epidemiology, Mount Sinai School of Liu); Department of Epidemiology and Biostatistics, Downloaded from Medicine, New York, New York (Paolo Boffetta); Interna- School of Medicine, University of California, San Francisco, tional Prevention Research Institute, Lyon, France (Paolo San Francisco, California (John Wiencke); Unit of Clinical Boffetta); Unit of Nutrition, Environment and Cancer, and Molecular Epidemiology, IRCCS (Istituto di Ricovero Cancer Epidemiology Research Program, Catalan Institute e Cura a Carattere Scientifico) San Raffaele Pisana, Rome, of Oncology, Barcelona, Spain (Eric J. Duell); Department Italy (Monica Neri); Units of Epidemiology and Biostatistics, of Genetic Epidemiology, Medical School, Georg-August University of Genoa, Genoa, Italy (Donatella Ugolini); http://aje.oxfordjournals.org/ University of Göttingen, Göttingen, Germany (Heike Units of Epidemiology, Biostatistics, and Clinical Trials, Bickeböller, Albert Rosenberger); International Agency for National Cancer Research Institute, Genoa, Italy (Donatella Research on Cancer, Lyon, France (Valerie McCormack, Ugolini); Unit of Biostatistics and Epidemiology, Depart- Paul Brennan); Division of Epidemiology, Penn State ment of Community and Family Medicine, Norris Cotton Cancer Institute, Pennsylvania State University, State Cancer Center, Dartmouth Medical School, Lebanon, New College, Pennsylvania (Joshua E. Muscat); Departments of Hampshire (Angeline S. Andrew); Division of Cancer Epi- Pharmacology and Health Evaluation Sciences, Penn State demiology and Genetics, National Cancer Institute, Bethes-

Cancer Institute, Pennsylvania State University, State da, Maryland (Qing Lan, Wei Hu); Department of at University of Toronto Library on October 4, 2012 College, Pennsylvania (Philip Lazarus); Epidemiology and Epidemiology and Biostatistics, Memorial Sloan-Kettering Genetics of Lung Cancer Research Program, Mayo Clinic, Cancer Center, New York, New York (Irene Orlow); and Rochester, Minnesota (Ping Yang); Institute of Epidemiolo- Department of Surgery, Memorial Sloan-Kettering Cancer gy, Helmholtz Center Munich, German Research Center for Center, New York, New York (Bernard J. Park). Environmental Health, Neuherberg, Germany (H.-Erich The Toronto study (18) was funded by the Canadian Wichmann, Irene Brueske-Hohlfeld); Department of Oncol- Cancer Society Research Institute (grant 020214). The New ogy, School of Medicine, Wayne State University, Detroit, England Lung Cancer Study (25) was funded by the Na- Michigan (Ann G. Schwartz, Michele L. Cote); Karmanos tional Center for Research Resources, US National Insti- Cancer Institute, Detroit, Michigan (Ann G. Schwartz, tutes of Health (grant P20RR018787). The Liverpool Lung Michele L. Cote); Institute of Cancer Epidemiology, Project (35) was funded by the Roy Castle Lung Cancer Danish Cancer Society, Copenhagen, Denmark (Anne Foundation. The study from Memorial Sloan-Kettering Tjønneland, Søren Friis); University of Hawaii Cancer Cancer Center (33) was funded by Steps for Breath, the Center, Honolulu, Hawaii (Loic Le Marchand); Department Labrecque Foundation, and the Society of Memorial Sloan- of Epidemiology, School of Public Health, University Kettering Cancer Center. The Central Europe study (23) of California, Los Angeles, Los Angeles, California was funded by the World Cancer Research Fund and the (Zuo-Feng Zhang); Departments of Epidemiology and European Commission’s INCO-COPERNICUS Program Environmental Health Sciences, School of Public Health, (contract IC15-CT96-0313). The Warsaw portion of the University of Michigan, Ann Arbor, Michigan (Hal Central Europe study was funded by the Polish State Morgenstern); Comprehensive Cancer Center, University of Committee for Scientific Research (grant SPUB-M- Michigan, Ann Arbor, Michigan (Hal Morgenstern); COPERNICUS/P-05/DZ-30/99/2000). The Family Health Department of Epidemiology, Institute of Occupational Study (22) and the study of women’s lung cancer epidemi- Medicine, Lodz, Poland (Neonila Szeszenia-Dabrowska); ology (30), conducted by Wayne State University and Department of Cancer Epidemiology and Prevention, the Karmanos Cancer Institute, were funded by the US M. Sklodowska-Curie Memorial Cancer Center and Insti- National Institutes of Health (grants R01CA060691, tute of Oncology, Warsaw, Poland (Jolanta Lissowska); R01CA87895, N01-PC35145, and P30CA22453). The Institute of Carcinogenesis, Cancer Research Centre, study at the University of California, San Francisco (29) Moscow, Russia (David Zaridze); Fodor József National was funded by the US National Institute of Environmental Center for Public Health, National Institute of Environmen- Health Sciences (grant ES06717) and the US National tal Health, Budapest, Hungary (Peter Rudnai); Department Cancer Institute (grant CA-113710 to S. S. O.). The Danish of Occupational Health, Specialized State Health Institute, Diet, Cancer, and Health Study (20) is funded by the Banska Bystrica, Slovakia (Eleonora Fabianova); Danish Cancer Society. The Helmholtz Center Munich

Am J Epidemiol. 2012;176(7):573–585 584 Brenner et al. lung cancer study (39, 40, 69, 70) was funded by the analysis. PLoS ONE. 2011;6(3):e17479. (doi:10.1371/journal. German Federal Ministry of Education, Science, Research pone.0017479). and Technology, the State of Bavaria, the National Genome 17. Hung RJ, Christiani DC, Risch A, et al. International Lung Research Network, the German Research Foundation Cancer Consortium: pooled analysis of sequence variants in – DNA repair and cell cycle pathways. Cancer Epidemiol (grants BI 576/2-1 and BI 576/2 2), the Helmholtz Associ- – fi Biomarkers Prev. 2008;17(11):3081 3089. ation, and the German Federal Of ce for Radiation Protec- 18. Brenner DR, Hung RJ, Tsao MS, et al. Lung cancer risk in tion (grant STSch4454). never-smokers: a population-based case-control study of R. J. H. holds a Cancer Care Ontario Chair in Population epidemiologic risk factors. BMC Cancer. 2010;10:285. Studies. D. B. holds a Canadian Institutes of Health (doi:10.1186/1471-2407-10-285). Research Canada Graduate Scholarship. 19. Ugolini D, Neri M, Canessa PA, et al. The CREST Conflict of interest: none declared. biorepository: a tool for molecular epidemiology and translational studies on malignant mesothelioma, lung cancer, and other respiratory tract diseases. Cancer Epidemiol Biomarkers Prev. 2008;17(11):3013–3019. 20. Tjønneland A, Olsen A, Boll K, et al. Study design, exposure variables, and socioeconomic determinants of participation in Downloaded from REFERENCES Diet, Cancer and Health: a population-based prospective cohort study of 57,053 men and women in Denmark. Scand J 1. Jemal A, Bray F, Center MM, et al. Global cancer statistics. Public Health. 2007;35(4):432–441. CA Cancer J Clin. 2011;61(2):69–90. 21. Cui Y, Morgenstern H, Greenland S, et al. Dietary flavonoid 2. Canadian Cancer Society’s Steering Committee. Canadian intake and lung cancer—a population-based case-control Cancer Statistics 2010. Toronto, Ontario, Canada: Canadian study. Cancer. 2008;112(10):2241–2248. http://aje.oxfordjournals.org/ Cancer Society; 2010. 22. Schwartz AG, Cote ML, Wenzlaff AS, et al. Racial 3. Jemal A, Siegel R, Ward E, et al. Cancer statistics, 2009. CA differences in the association between SNPs on 15q25.1, Cancer J Clin. 2009;59(4):225–249. smoking behavior, and risk of non-small cell lung cancer. 4. Samet JM, Avila-Tang E, Boffetta P, et al. Lung cancer in J Thorac Oncol. 2009;4(10):1195–1201. never smokers: clinical epidemiology and environmental risk 23. Brennan P, Crispo A, Zaridze D, et al. High cumulative risk factors. Clin Cancer Res. 2009;15(18):5626–5645. of lung cancer death among smokers and nonsmokers in 5. Peek RM Jr, Mohla S, DuBois RN. Inflammation in the Central and Eastern Europe. Am J Epidemiol. 2006;164(12): genesis and perpetuation of cancer: summary and 1233–1241.

recommendations from a National Cancer Institute-sponsored 24. Gallagher CJ, Muscat JE, Hicks AN, et al. The UDP- at University of Toronto Library on October 4, 2012 meeting. Cancer Res. 2005;65(19):8583–8586. glucuronosyltransferase 2B17 gene deletion polymorphism: 6. Ballaz S, Mulshine JL. The potential contributions of chronic sex-specific association with urinary 4-(methylnitrosamino)- inflammation to lung carcinogenesis. Clin Lung Cancer. 1-(3-pyridyl)-1-butanol glucuronidation phenotype and risk 2003;5(1):46–62. for lung cancer. Cancer Epidemiol Biomarkers Prev. 2007; 7. Rakoff-Nahoum S, Medzhitov R. Toll-like receptors and 16(4):823–828. cancer. Nat Rev Cancer. 2009;9(1):57–63. 25. Heck JE, Andrew AS, Onega T, et al. Lung cancer in a U.S. 8. Weitzman SA, Gordon LI. Inflammation and cancer: role of population with low to moderate arsenic exposure. Environ phagocyte-generated oxidants in carcinogenesis. Blood. Health Perspect. 2009;117(11):1718–1723. 1990;76(4):655–663. 26. Muscat JE, Stellman SD, Wynder EL. Insulation, asbestos, 9. Rutgers SR, Postma DS, ten Hacken NH, et al. Ongoing smoking habits, and lung cancer cell types. Am J Ind Med. airway inflammation in patients with COPD who do not 1995;27(2):257–269. currently smoke [abstract]. Chest. 2000;117(5 suppl 1):262S. 27. Yang P, Bamlet WR, Ebbert JO, et al. Glutathione pathway 10. Moldoveanu B, Otmishi P, Jani P, et al. Inflammatory genes and lung cancer risk in young and old populations. mechanisms in the lung. JInflamm Res. 2009;(2):1–11. Carcinogenesis. 2004;25(10):1935–1944. 11. Brody JS, Spira A. State of the art. Chronic obstructive 28. Le Marchand L, Murphy SP, Hankin JH, et al. Intake of pulmonary disease, inflammation, and lung cancer. Proc Am flavonoids and lung cancer. J Natl Cancer Inst. 2000;92(2): Thorac Soc. 2006;3(6):535–537. 154–160. 12. Engels EA. Inflammation in the development of lung cancer: 29. Wrensch MR, Miike R, Sison JD, et al. CYP1A1 variants and epidemiological evidence. Expert Rev Anticancer Ther. smoking-related lung cancer in San Francisco Bay Area Latinos 2008;8(4):605–615. and African Americans. Int J Cancer. 2005;113(1):141–147. 13. Heikkilä K, Harris R, Lowe G, et al. Associations of 30. Schwartz AG, Cote ML, Wenzlaff AS, et al. Chronic circulating C-reactive protein and interleukin-6 with cancer obstructive lung diseases and risk of non-small cell lung risk: findings from two prospective cohorts and a meta- cancer in women. J Thorac Oncol. 2009;4(3):291–299. analysis. Cancer Causes Control. 2009;20(1):15–26. 31. Gallagher CJ, Ahn K, Knipe AL, et al. Association between 14. Azad N, Rojanasakul Y, Vallyathan V. Inflammation and lung haplotypes of manganese superoxide dismutase (SOD2), cancer: roles of reactive oxygen/nitrogen species. J Toxicol smoking, and lung cancer risk. Free Radic Biol Med. Environ Health B Crit Rev. 2008;11(1):1–15. 2009;46(1):20–24. 15. Punturieri A, Szabo E, Croxton TL, et al. Lung cancer and 32. Field RW, Smith BJ, Platz CE, et al. Lung cancer histologic chronic obstructive pulmonary disease: needs and type in the Surveillance, Epidemiology, and End Results opportunities for integrated research. J Natl Cancer Inst. registry versus independent review. J Natl Cancer Inst. 2009;101(8):554–559. 2004;96(14):1105–1107. 16. Brenner DR, McLaughlin JR, Hung RJ. Previous lung 33. Orlow I, Park BJ, Mujumdar U, et al. DNA damage and diseases and lung cancer risk: a systematic review and meta- repair capacity in patients with lung cancer: prediction of

Am J Epidemiol. 2012;176(7):573–585 Previous Lung Diseases and Lung Cancer Risk 585

multiple primary tumors. J Clin Oncol. 2008;26(21): 52. Moulin JJ. A meta-analysis of epidemiologic studies of lung 3560–3566. cancer in welders. Scand J Work Environ Health. 1997;23(2): 34. Lan Q, He X, Costa DJ, et al. Indoor coal combustion 104–113. emissions, GSTM1 and GSTT1 genotypes, and lung cancer 53. Cassidy A, ‘t Mannetje A, van Tongeren M, et al. risk: a case-control study in Xuan Wei, China. Cancer Occupational exposure to crystalline silica and risk of lung Epidemiol Biomarkers Prev. 2000;9(6):605–608. cancer: a multicenter case-control study in Europe. 35. Field JK, Smith DL, Duffy S, et al. The Liverpool Lung Epidemiology. 2007;18(1):36–43. Project research protocol. Int J Oncol. 2005;27(6): 54. Tammemagi CM, Pinsky PF, Caporaso NE, et al. Lung 1633–1645. cancer risk prediction: Prostate, Lung, Colorectal and Ovarian 36. Greenland S, Robins JM. Conceptual problems in the Cancer Screening Trial models and validation. J Natl Cancer definition and interpretation of attributable fractions. Am J Inst. 2011;103(13):1058–1068. Epidemiol. 1988;128(6):1185–1197. 55. Lindberg A, Jonsson AC, Rönmark E, et al. Prevalence of 37. Higgins JP, Thompson SG, Deeks JJ, et al. Measuring chronic obstructive pulmonary disease according to BTS, inconsistency in meta-analyses. BMJ. 2003;327(7414): ERS, GOLD and ATS criteria in relation to doctor’s 557–560. diagnosis, symptoms, age, gender, and smoking habits. 38. Galbraith R. Graphical display of estimates having different Respiration. 2005;72(5):471–479. standard errors. Technometrics. 1988;30(3):271–281. 56. Young RP, Hopkins RJ, Hay BA, et al. Lung cancer Downloaded from 39. Sauter W, Rosenberger A, Beckmann L, et al. Matrix susceptibility model based on age, family history and genetic metalloproteinase 1 (MMP1) is associated with early-onset variants. PLoS ONE. 2009;4(4):e5302. (doi:10.1371/journal. lung cancer. LUCY-Consortium. Cancer Epidemiol pone.0005302). Biomarkers Prev. 2008;17(5):1127–1135. 57. Coussens LM, Werb Z. Inflammation and cancer. Nature. 40. Kreuzer M, Heinrich J, Wölke G, et al. Residential radon and 2002;420(6917):860–867. risk of lung cancer in eastern Germany. Epidemiology. 58. Fitzpatrick FA. Inflammation, carcinogenesis and cancer. Int http://aje.oxfordjournals.org/ 2003;14(5):559–568. Immunopharmacol. 2001;1(9-10):1651–1667. 41. Yang P, Li Y, Jiang R, et al. A rigorous and comprehensive 59. Lin WW, Karin M. A cytokine-mediated link between innate validation: common genetic variations and lung cancer. immunity, inflammation, and cancer. J Clin Invest. Cancer Epidemiol Biomarkers Prev. 2010;19(1):240–244. 2007;117(5):1175–1183. 42. Wang J, Spitz MR, Amos CI, et al. Mediating effects of 60. Epidemiology and Statistics Unit, American Lung smoking and chronic obstructive pulmonary disease on the Association. Trends in COPD (Chronic Bronchitis and relation between the CHRNA5-A3 genetic locus and lung Emphysema): Morbidity and Mortality. Washington, DC: cancer risk. Cancer. 2010;116(14):3458–3462. American Lung Association; 2007.

43. Radeos MS, Cydulka RK, Rowe BH, et al. Validation of self- 61. DeFrances CJ, Hall MJ. 2005 National Hospital Discharge at University of Toronto Library on October 4, 2012 reported chronic obstructive pulmonary disease among Survey. Adv Data. 2007;(385):1–19. patients in the ED. Am J Emerg Med. 2009;27(2):191–196. 62. Epidemiology and Statistics Unit, American Lung 44. Barr RG, Herbstman J, Speizer FE, et al. Validation of self- Association. Trends in Tuberculosis: Morbidity and reported chronic obstructive pulmonary disease in a cohort Mortality. Washington, DC: American Lung Association; study of nurses. Am J Epidemiol. 2002;155(10):965–971. 2007. 45. Young RP, Hopkins RJ, Christmas T, et al. COPD prevalence 63. World Health Organization. Global Burden of Disease. is increased in lung cancer, independent of age, sex and Geneva, Switzerland: World Health Organization; 2008. smoking history. Eur Respir J. 2009;34(2):380–386. 64. Spitz MR, Etzel CJ, Dong Q, et al. An expanded risk 46. Wilson DO, Weissfeld JL, Balkan A, et al. Association of prediction model for lung cancer. Cancer Prev Res (Phila). radiographic emphysema and airflow obstruction with lung 2008;1(4):250–254. cancer. Am J Respir Crit Care Med. 2008;178(7):738–744. 65. Cassidy A, Myles JP, van Tongeren M, et al. The LLP risk 47. de Torres JP, Bastarrika G, Wisnivesky JP, et al. Assessing model: an individual risk prediction model for lung cancer. the relationship between lung cancer risk and emphysema Br J Cancer. 2008;98(2):270–276. detected on low-dose CT of the chest. Chest. 2007;132(6): 66. Baldwin DR, Duffy SW, Wald NJ, et al. UK Lung Screen 1932–1938. (UKLS) nodule management protocol: modelling of a single 48. Hardy RD, Jafri HS, Olsen K, et al. Mycoplasma pneumoniae screen randomised controlled trial of low-dose CT screening induces chronic respiratory infection, airway hyperreactivity, for lung cancer. Thorax. 2011;66(4):308–313. and pulmonary inflammation: a murine model of infection- 67. Aberle DR, Berg CD, Black WC, et al. The National Lung associated chronic reactive airway disease. Infect Immun. Screening Trial: overview and study design. Radiology. 2002;70(2):649–654. 2011;258(1):243–253. 49. Hardy RD, Jafri HS, Olsen K, et al. Elevated cytokine and 68. Prorok PC, Andriole GL, Bresalier RS, et al. Design of the chemokine levels and prolonged pulmonary airflow resistance in Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer a murine Mycoplasma pneumoniae pneumonia model: a Screening Trial. Control Clin Trials. 2000;21(6 suppl): microbiologic, histologic, immunologic, and respiratory 273S–309S. plethysmographic profile. Infect Immun. 2001;69(6):3869–3876. 69. Kreienbrock L, Kreuzer M, Gerken M, et al. Case-control 50. Stayner L, Bena J, Sasco AJ, et al. Lung cancer risk and study on lung cancer and residential radon in western workplace exposure to environmental tobacco smoke. Am J Germany. Am J Epidemiol. 2001;153(1):42–52. Public Health. 2007;97(3):545–551. 70. Holle R, Happich M, Löwel H, et al. KORA—a research 51. Reardon JZ. Environmental tobacco smoke: respiratory and platform for population based health research. other health effects. Clin Chest Med. 2007;28(3):559–573, vi. Gesundheitswesen. 2005;67(suppl 1):S19–S25.

Am J Epidemiol. 2012;176(7):573–585 Genetic Epidemiology 36 : 320–332 (2012)

Two-Phase Stratified Sampling Designs for Regional Sequencing

Zhijian Chen,1 Radu V. Craiu,2 and Shelley B. Bull1,3∗ 1Samuel Lunenfeld Research Institute of Mount Sinai Hospital, Toronto ON, Canada 2Department of Statistics, University of Toronto, Toronto ON, Canada 3Dalla Lana School of Public Health, University of Toronto, Toronto ON, Canada

By systematic examination of common tag single-nucleotide polymorphisms (SNPs) across the genome, the genome-wide association study (GWAS) has proven to be a successful approach to identify genetic variants that are associated with complex diseases and traits. Although the per cost of sequencing has dropped dramatically with the advent of the next- generation technologies, it may still only be feasible to obtain DNA sequence data for a portion of available study subjects due to financial constraints. Two-phase sampling designs have been used frequently in large-scale surveys and epidemiological studies where certain variables are too costly to be measured on all subjects. We consider two-phase stratified sampling designs for genetic association, in which tag SNPs for candidate genes or regions are genotyped on all subjects in phase 1, and a proportion of subjects are selected into phase 2 based on genotypes at one or more tag SNPs. Deep sequencing in the region is then applied to genotype phase 2 subjects at sequence SNPs. We investigate alternative sampling designs for selection of phase 2 subjects within strata defined by tag SNP genotypes and develop methods of inference for sequence SNP variant associations using data from both phases. In comparison to methods that use data from phase 2 alone, the combined analysis improves efficiency. Genet. Epidemiol. 36:320–332, 2012. C 2012 Wiley Periodicals, Inc. Key words: fine-mapping; genetic association studies; two-phase design; optimal allocation; quantitative trait

Supporting Information is available in the online issue at wileyonlinelibrary.com. Contract grant sponsor: Canadian Institutes of Health Research (CIHR). ∗ Correspondence to: Shelley B. Bull, Samuel Lunenfeld Research Institute of Mount Sinai Hospital, 60 Murray Street, Box No. 18, Toronto, ON M5T 3L9, Canada. E-mail: [email protected] Received 29 April 2011; Revised 29 July 2011; Accepted 17 January 2012 Published online 28 March 2012 in Wiley Online Library (wileyonlinelibrary.com/journal/gepi). DOI: 10.1002/gepi.21624

INTRODUCTION ies can impute over 3 million SNPs, the directly typed or imputed SNPs detected are not necessarily the functional variants [Fridley et al., 2010; Ioannidis et al., 2009]. Impu- The population-based genetic association study is now tation coverage or accuracy may be low in the region of a well-established approach to identify genetic variants interest [e.g., Pei et al., 2010], particularly when the trait is that are detrimental or protective for human disease. The influenced by multiple low-frequency/rare variants in the genome-wide association study (GWAS) attempts to com- region rather than solely by common variants. Investigators prehensively survey common variants in the entire human may also use sequence data to test for association between genome based on up to a million typed genetic markers in the trait and a common variant or a gene-based summary each individual in the sample, with imputation of another 3 score that incorporates information on multiple rare vari- million single-nucleotide polymorphisms (SNPs) based on ants in a region. a reference panel such as HapMap3, done without regard Thus, one purpose of regional sequencing may be to dis- to any phenotypic information . This form of imputation cover novel, potentially functional variants in a particular has the advantage that different phenotypes can be tested region that has been detected in genome-wide association for association without the need to redo the imputation [Li analysis or chosen as a candidate region. Due to financial et al., 2009]. For many GWAS, single-marker association constraints, however, investigators may be able to afford analysis of typed and imputed SNPs is the first step to iden- sequencing only a portion of the available subjects. When a tify promising regions, and associations are confirmed by covariate, such as a sequence SNP (seq SNP), is difficult or more powerful and focused analysis based on replication costly to measure, a two-phase stratified sampling design and fine-mapping studies [Zheng et al., 2007]. can dramatically reduce the cost of data collection [Bres- In focused studies following up reasonable GWAS hits, low and Wellner, 2007]. At phase 1, measurements of the investigators may choose to comprehensively sequence a phenotype and an easily measured auxiliary variable, such whole region of interest using next-generation sequencing as a GWAS tag SNP, are obtained for all available subjects. (NGS) technology, or selectively sequence the region using At phase 2, measurements on the expensive target covari- customized technology to genotype additional SNPs, for ate (i.e., the seq SNP) are made for a subsample drawn example, SNPs that are not imputable or are known from randomly, without replacement, from strata defined by the dbSNP [Liu and Leal, 2010]. Although most GWAS stud- auxiliary variable. Loss of efficiency due to incomplete

C 2012 Wiley Periodicals, Inc. Two-Phase Designs for Sequencing 321

observation will be modest when the target covariate is METHODS highly correlated with the auxiliary variable. By sequenc- ing an informative portion of the available phase 1 subjects TWO-PHASE STRATIFIED SAMPLING at phase 2, sequence variants associated with one or more phenotypes can be detected efficiently. Thereafter, selected The basic idea of the two-phase design is to use auxiliary variants together with the promising GWAS tag SNPs can information available on all subjects to draw a subsample be examined jointly in additional larger studies. for additional, more expensive, measurements of a target We investigate the use of a two-phase sampling design variable. In genetic association analysis, the auxiliary in- to obtain sequence data for genetic association analysis of a formation typically available consists of genotype data for quantitative trait. At phase 1, we assume that all available a tag SNP, or an imputed SNP, within a candidate region. subjects are fully phenotyped and genotyped at an asso- The SNP genotypes are available in all individuals in phase ciated tag SNP within a promising region of the genome. 1 of the study. The target covariate refers to a potentially Strata are then formed according to the genotypes of the functional seq SNP that is collected in the phase 2 subjects tag SNP, and a fraction of subjects from each stratum is only. randomly selected for sequencing in the region of interest Suppose we have N subjects that constitute a population = ,..., at phase 2. The total phase 2 sample size can be predeter- sample, indexed by i 1 N. Let Yi denote the quan- mined based on study budget, for example, 10% or 50% of titative trait for the ith subject, and denote the major and the phase 1 sample, but the fraction of subjects selected in minor alleles of the seq SNP by D and d, respectively. Let Pd each stratum can differ across strata. We propose a method be the minor allele frequency (MAF) at the seq SNP in the for the joint analysis of data from both phases and investi- population. If Hardy-Weinberg Equilibrium (HWE) holds, gate strategy for the allocation of the phase 2 sample size the population frequencies of genotypes DD, Dd, and dd − 2 − 2 to each stratum. The method is particularly useful in situa- are given by (1 Pd ) , 2(1 Pd )Pd , and Pd , respectively. A tions in which the tag SNP is in high linkage disequilibrium linear regression model under an additive genetic effect is (LD) with a common seq SNP or with a rare variant score given by constructed by aggregation of multiple low frequency/rare = ␤ + ␤ + ⑀ , variants. When imputation accuracy is high within a region Yi 0 1 Xi i (1) identified by tag SNP association with a quantitative trait, the two-phase stratified design strategy we propose for a where Xi , the number of copies of allele d, would be po- tag SNP can be similarly applied using imputed SNP data tentially available on each subject if all subjects were com- ␤ = ␤ , ␤ T to define the sampling strata. pletely observed. Let ( 0 1) be a vector of the regres- ⑀ In most GWAS and fine-mapping studies, it is difficult sion parameters. The error term i is commonly assumed to to distinguish two SNPs that are in strong LD on statisti- be normally distributed with mean 0 and variance ␴2. cal grounds without incorporating biological or other addi- For simplicity, we consider stratification that is based on tional information. Depending on the minor allele frequen- a single tag or imputed SNP available in phase 1. Denote cies and the strength of the LD correlation, the sample size the major and minor alleles of the tag SNP by A and a, required to conduct such fine-scale mapping and success- respectively. Correspondingly, let Pa be the MAF at this tag fully distinguish two SNPs is typically one to four times SNP in the population, and Zi be the number of copies of larger than that required to detect the initial association allele a at the tag SNP observed on subject i. Again, if HWE = [Udler et al., 2010]. We see the goal of the two-phase strategy holds for the tag SNP, the genotype frequencies pr(Zi j) − 2 − 2 = , , as (1) to select a set of highly correlated polymorphisms for are given by (1 Pa ) , 2(1 Pa )Pa , and Pa for j 0 1 2, further evaluation, and/or (2) to identify other associated respectively. variants in the region that can be analyzed subsequently for The phase 1 sample is divided into three strata according functional consequences [Ioannidis et al., 2009]. The joint to the observed value of Zi . Let Nj denote the number of analysis method we propose here aims to efficiently detect subjects observed in stratum AA, Aa , and aa for j = 0, 1, 2, potential association signals at SNPs that are not typed in respectively. Under the assumption that the phase 1 sam- / phase 1, rather than to distinguish a causal SNP, genotyped ple represents the source population, E(Nj N) equals the ␰ by sequencing, from the tag SNP. corresponding population genotype frequency. Let i be a ␰ = In the following sections, we develop an approach for binary indicator such that i 1 if subject i is sampled at ␰ = ␲ = ␰ = | joint analysis of phases 1 and 2 and compare it to meth- phase 2 and i 0 otherwise. Then i pr( i 1 Zi ) is the ods of inference limited to sequenced SNP data available probability of such sampling. only in phase 2. For ease of exposition, we assume an ad- Breslow and Wellner [2007] describe two probability ␰ ditive model for the genetic association analysis, one that models for the indicators i in two-phase stratified sam- ␰ is used widely to capture the average change of the quan- pling. In the first one, known as Bernoulli sampling, i for titative trait with each additional copy of the minor allele each phase 1 subject is independently generated with prob- ␲ = ␲ ␲ of an associated SNP, or with a unit increase in a rare vari- ability i 0(Zi ), where 0 is a known sampling function. ␲ ant summary score. In simulation studies, we quantify the Under a missing at random mechanism, 0 does not de- relative design efficiencies across a range of possible sam- pend on the unobserved values of missing data from phase ple allocations, considering both joint analysis and phase 1. This sampling scheme results in random phase 2 stratum- ≤ = , , =  2 only methods for analysis of a common variant or a specific samples of size n j Nj ( j 0 1 2). For i i , let ␲  = ␰ = ␰  = rare variant score, and assess robustness to misspecifica- i,i pr( i i 1) be the joint inclusion probability. Un- ␲  = ␲ ␲  tion of the model used for analysis of the phase 2 data. der Bernoulli sampling, i,i i i . In contrast, under a We close with discussion of implications for studies in- second sampling model, known as finite population strat- volving multiple seq SNPs, multiple tag SNPs, or multiple ified sampling, the phase 2 sample size in each stratum traits. is fixed. To be specific, at the second phase of sampling, n j subjects are sampled at random without replacement

Genet. Epidemiol. 322 Chen et al.

from stratum j, with sampling for different strata conducted independently. This sampling method is of particularly in- terest to survey statisticians who aim to derive variances of estimates of population quantities, such as population means, totals, or quantiles, and leads to finite popula- tion joint inclusion probabilities then involved in the vari- ance formula. In our development here we consider only Bernoulli sampling for phase 2, which is relatively simpler for the problem we are investigating. For each subject selected into the phase 2 sample ac- cording to a promising tag SNP, a region containing the tag SNP is sequenced to identify additional, potentially functional seq SNPs. In the remainder, we assume that a functional SNP is indeed in the region containing the tag SNP and is sequenced for all phase 2 subjects. We quan- tify the association between the tag SNP and the seq SNP by the conditional probabilities ␣ = pr(X = k|Z = j), jk i i = . j, k = 0, 1, 2. Since ␣ = 1 − ␣ − ␣ , j = 0, 1, 2, we let Fig. 1. An illustration of a two-phase stratified sampling (Pa 0 3 j0 j1 j2 ␳ = ␣ ={␣ , ␣ , = , , }T under HWE, sampling fraction 10%). ( j1 j2) j 0 1 2 be a vector of LD-related param- eters that are to be estimated. The joint distribution of the = , = = = two SNP genotypes given by pr(Xi k Zi j) pr(Xi TABLE I. An example of realized genotype counts at the k|Z = j)pr(Z = j) = ␣ pr(Z = j) with the correlation be- seq SNP under the sampling scheme shown in Figure 1 i i jk i = . = . = . tween Xi and Zi defined as Pearson’s correlation coefficient. (Pa 0 3, Pd 0 2, r 0 75, and sampling fraction , = , ,..., ␳ = / / The phase 1 data consist of (Yi Zi ), i 1 2 N, and the 10%. The A a alleles are for the tag SNP and D d are , , phase 2 data consist of (Yi Zi Xi ) for i included in the phase for the seq SNP) 2 sample. DD Dd dd Total

ALLOCATION OF PHASE 2 SAMPLE SIZE AND AA 29 1 0 30 NAIVE ANALYSIS Aa 2 80 10 aa 9232860 In this section, we discuss allocation of the phase 2 sample Total 40 32 28 100 size to each of the phase 1 strata, in the context of fitting a standard additive linear regression model (A1) to the phase 2 data. ␲ = ␳ Let mk be the number of phase 2 subjects carrying k The individual sampling probability is therefore i j for copies of allele d at the seq SNP, k = 0, 1, 2, with a total all subjects in stratum j. The expected number of phase ␤ = ␳ phase 2 sample size of n. Let ˆ 1,nai be the naive estima- 2 observations is E(n j ) j Nj , and the overall sampling ␤ ␳ = 2 ␳ / tor for 1 obtained from fitting model (A1) to the phase fraction j=0 j Nj N may be predetermined according 2 −1 2 data. Given variance var(␤ˆ , ) = ␴ { ∈ (X − X¯ )X } , 1 nai i s2 i i to financial constraints. Therefore, the expected number of ¯ = / = + / and X i∈s Xi n (m1 2m2) n, it follows that subjects carrying k copies of allele d in the phase 2 sample 2 | , , = 2 ␣ = , , is E(mk n0 n1 n2) j=0 n j jk, k 0 1 2. ␤ = ␴2{ − − var(ˆ 1,nai) n m1(n m1 2m2) An illustration of phase 2 sampling is given in Figure 1, + − − }−1 where the MAF at the tag SNP is 0.3, and the overall sam- 2m2(2n m1 2m2) − pling fraction is 10%. Table I is an example of one realization = ␴2 − − − − 2 1 , n n(n m1) (n m1 2m2) of phase 2 sampling, where the MAF at the seq SNP is 0.2, and the correlation between the tag and seq SNPs is 0.75. The = and minimum variance would be achieved when m1 0 joint distribution of the SNPs can be estimated from these = = / and m0 m2 n 2. In the context of genetic association counts. One can see that if the two SNPs are highly corre- ␤ studies, ˆ 1,nai is most efficient when half of the phase 2 sub- lated, the rare homozygote dd at the seq SNP will appear jects have zero copies of d and the other half have two copies most frequently in the stratum with the rare homozygote of d, provided that the true genetic model is additive. Since aa at the tag SNP. at phase 1 we observe information only on tag SNPs, a natu- ral choice is to select subjects from the three strata defined by PHASE 2 INVERSE PROBABILITY WEIGHTED the tag SNP genotypes such that E(m ) is minimized while 1 (IPW) ANALYSIS E(m0) and E(m2) are approximately equal. Because the tag SNP genotype is serving as a surrogate for the unobserved The unweighted naive analysis just described does not target SNP and discordance between them will reduce the take account of the sampling probabilities. In this section, chance of selecting informative subjects, the underlying cor- we review a weighted estimation method that incorporates relation between the tag and the seq SNPs is important for the stratum-specific sampling fractions. A typical method of the success of the phase 2 sample size allocation strategy. estimation, as used by survey statisticians, is to maximize Under Bernoulli sampling, all subjects within a stratum the IPW sum of log-likelihood contributions from the phase are sampled independently with the same probability. Let 2 observations, or equivalently, to solve an IPW version ␳ = , , j ( j 0 1 2) be the inclusion probability for the subjects of the score equations [Manski and Lerman, 1977]. IPW is in stratum j, that is, the stratum-specific sampling fraction. a standard approach to inference about finite population

Genet. Epidemiol. Two-Phase Designs for Sequencing 323

parameters (i.e., those of the entire phase 1 data), when the (2). Design consistency is a desirable property in random- probability of being sampled (i.e., being included in phase ization approaches to finite population sampling [Godambe 2) varies across individuals. It is usually applied to ensure and Thompson, 1986]. that inferences are representative of the complete data, and Both naive and IPW analyses of phase 2 data ignore it can limit the effects of model misspecification [Godambe the phenotype and tag SNP genotype data available in the and Thompson, 1986], for example, if an additive model is phase 1 participants not included in phase 2, which leads to assumed incorrectly. We begin with the case of complete efficiency loss. Although the IPW estimator has attractive , , = , ,..., data: (Yi Zi Xi ) for all i 1 2 N, and then consider properties such as consistency and asymptotic normality, estimation with phase 2 data alone. when the additive model (1) is correctly specified the naive The likelihood contribution of subject i to estima- estimates are not biased due to sampling, and incorporating ␤, ␴2 ␤, ␴2 = | | = tion√ of ( ) is Li ( ) f (Yi Xi ), where f (Yi Xi ) IPW sampling weights can induce greater variation when / ␲␴2 −{ − ␤ + ␤ }2/ ␴2 1 2 exp[ Yi ( 0 1 Xi ) (2 )] is the probability some of the weights are large. As is evident in the simula- density function of Yi . If seq SNP genotypes were observed tion studies we report, the result is that the IPW estimator on all subjects, the likelihood of the data would be can be less precise than the naive estimator that ignores the sampling design. N ␤, ␴2 = ␤, ␴2 . L( ) Li ( ) JOINT ANALYSIS OF PHASE 1 AND PHASE 2 i=1 Although analysis of phase 2 data alone can give an esti- ␤, ␴2 = ␤ The log-likelihood contribution of subject i is i ( ) mate of 1 at a seq SNP that is similar to the estimate that ␤, ␴2 ␤ log Li ( ), and maximum likelihood estimation of is would be obtained if seq SNP data were available for all equivalent to solving phase 1 subjects, the naive and IPW approaches are gener- ally not powerful. In this section, we describe an alterna- N tive estimating equations method that can achieve greater ␤ , 0 = Ui ( ) (2) power by jointly analyzing data from both phases. This ap- i=1 proach constructs mean score functions for subjects that are not selected into phase 2. We show that the mean score ␤ = ∂ ␤, ␴2 /∂␤ function is a weighted sum of three score functions, each where Ui ( ) i ( ) . An estimate of the variance parameter ␴2 can be obtained from the residuals via the of which corresponds to one of the three possible seq SNP method of moments. In addition, estimates of the con- genotypes. The weight measures the likelihood of the miss- ditional probabilities in the vector ␣ can be obtained by ing seq SNP genotype given the observed trait and the tag solving SNP genotype. ∈ For subject i s¯2, that is, not selected into phase 2 sample, n let = ␣ , 0 Qi ( ) (3) i=1 | = ␤ = | ␣ ␾ ␣, ␤ = f (Yi Xi k; ) pr (Xi k Zi ; ) , ␣ = = { = − ␣ }, = , , = ik( ) where Qi ( ) ([I(Zi j) I(Xi k) jk k 1 2] j 2 | =  ␤ = | ␣ k=0 f (Yi Xi k ; ) pr (Xi k Zi ; ) 0, 1, 2)T and I(·) is an indicator function. In phase 2 data, however, Equations (2) and (3) cannot be used directly, due to missing seq SNP data for subjects k = 0, 1, 2. Under the assumption that the distribution of with ␰ = 0. Let s ands ¯ denote the phase 2 sample and its i 2 2 the phenotype in the population is a mixture of three nor- complement, respectively. One commonly employed IPW mal distributions with constant variance, ␾ (␣, ␤) can be estimation method for ␤, using data from phase 2 alone, ik viewed as a weight for the possible seq SNP genotype applies weighted estimating equations [Skinner et al., 1989, with k copies of the minor allele. We construct estimating section 3.4] which are given by ∗ ␣, ␤ = 2 ␾ ␣, ␤ ␤ , = functions Ui ( ) k=0 ik( )Ui ( ; Yi Xi k), and let ˜ ␣, ␤ = ␰ ␤ + − ␰ ∗ ␣, ␤ N Ui ( ) i Ui ( ) (1 i )Ui ( ). Therefore, in parallel = /␲ ␤ = ␰ /␲ ␤ , 0 (1 i )Ui ( ) ( i i )Ui ( ) (4) to Equations (3) and (4) above, the proposed estimating ∈ = ␤ i s2 i 1 equations for are given by /␲ with 1 i called the sampling design weight for subject i. Under the two-phase stratified design, we have N = ␣, ␤ , ⎧ ⎫ 0 U˜i ( ) (5) ⎨ ⎬ N i=1 /␲ ␤ = ␤ , E p ⎩ (1 i )Ui ( )⎭ Ui ( ) ∈ = i s2 i 1 and we construct weighted estimating equations for ␣ given where E p denotes expectation under the sampling scheme. by Because the constructed estimating functions are unbiased for the phase 1 complete data estimating functions under ex- pectation with respect to the sampling design, the estimator N /␲ ␣ = ␰ /␲ ␣ , obtained by solving (4) is called a design-consistent estima- 0 = (1 i )Qi ( ) ( i i )Qi ( ) (6) ∈ = tor for the phase 1 sample parameter, that is, the solution to i s2 i 1 Genet. Epidemiol. 324 Chen et al.

with solution␣ ˆ . Under the two-phase stratified design, we TABLE II. Table of parameters for simulation design have (N = 1, 000) ⎧ ⎫ = = . ⎨ ⎬ N Minor allele freq. (MAF) Scenario (i): Pa Pd 0 4 /␲ ␣ = ␣ . Scenario (ii): Pa = 0.3, Pd = 0.2 E p ⎩ (1 i )Qi ( )⎭ Qi ( ) ∈ = i s2 i 1 tag SNP-seq SNP Scenario (i): correlation r = 0.95, 0.50, 0.05, −0.50 One can obtain a consistent estimator for ␤ by simulta- Scenario (ii): 1 = . , . , . , − . neously solving (5) and (6) using a two-stage estimation r 0 75 0 25 0 05 0 30 procedure that is equivalent to an iterative Fisher scoring Overall sampling fraction ␳ = 0.10, 0.25, 0.50 ␣ = , , algorithm. Specifically,in the first step, we solve (6) for . Let (n 100 250 500 in phase 2) ␣, ␤ = N ␣, ␤ ␣, ␤ = ∂ ␣, ␤ /∂␤T M˜ ( ) = M˜ i ( ), where M˜ i ( ) U˜i ( ) . seq SNP genetic model Additive, dominant, recessive i 1 ␣ ␣ In the second step, we replace with ˆ and solve (5) for Genotype-specific ␴2 = ␴2 = ␴2 = 0.50 (for all three ␤ 0 1 2 via the Fisher scoring algorithm variance genetic models); ␴2 = . ␴2 = . ␴2 = . 0 0 50, 1 0 75, 2 1 00 N (for additive model only) ␤(t+1) = ␤(t) − −1 ␣, ␤(t) ␤(t) , M˜ (ˆ ) U˜i ( ) i=1 within a gene or a specific genomic region. The target vari- t = 0, 1,..., until convergence. Let ␤ˆ = (␤ˆ , ␤ˆ )T denote the 0 1 able Xi in the linear model (1) now becomes the genetic score resulting limit. of rare variants [e.g., Morris and Zeggini, 2010]. For exam- ␪ = ␣T, ␤T T ␪ˆ = ␣T, ␤ˆ T T Let ( ) , with estimate (ˆ ) . Large ple, for Ril defined as the number of copies of minor allele at sample theory yields asymptotic properties for ␪ˆ . In the lth rare variant, l = 1,...,K , where K is the total num- 1/2 ␪ − ␪ = K the Appendix, we outline the proof that N (ˆ ) is ber of rare variants in that region, we have Xi l=1 Ril. asymptotically normal with mean 0 and asymptotic co- Although Xi is a count variable from 0 to 2K , it takes far −1 −1 T  ␪ = variance matrix given by ( ) , where i ( ) fewer values in practice due to the low MAFs of the rare { ␰ /␲ T ␣ , ˜ T ␣, ␤ }T = ,...,  = { ␪ /∂␪T} variants. ( i i )Qi ( ) Ui ( ) , i 1 N, E i ( ) ,  = { ␪ T ␪ } →∞   and E i ( ) i ( ) . As N , and can be con- sistently estimated by their empirical counterparts. In the Appendix, we give the components of the approximate co- SIMULATION STUDIES variance matrix. In concluding this section, we note the influence of sample SIMULATION DESIGN size allocation on the efficiency of the proposed method. We We conducted simulation studies to investigate the rela- wish to optimize the power to detect the association of the tive efficiency of the proposed joint analysis method. Here, seq SNP with the quantitative trait by choosing a design that relative efficiency of the joint analysis estimator is defined ␤ˆ minimizes the variance of 1 obtained using both phase 1 as the ratio between the empirical variance of the estimator ␤ and phase 2 data. Because the variance of ˆ 1 depends on the obtained when complete phase 1 data were available and estimated LD-related conditional probabilities, ␣, reducing the empirical variance of the joint analysis estimator. By the uncertainly in ␣ translates into reducing variability in calculating this quantity, we can investigate whether there ␤ ˆ 1. Therefore, allocations that improve precision of the con- is benefit from jointly analyzing both phases 1 and 2 data. ditional probabilities also improve precision of the genetic In our simulation comparisons, we also included a naive association estimate, although a generally valid approach method, which fits a standard linear model to phase 2 data is yet to be found. There are several reasons why we do ignoring the sampling design, and the IPW approach, which not consider a design that samples only from the two tag fits a linear model to phase 2 data weighted by the inverse SNP homozygote categories. First, robustness to departures of the inclusion probability. from the underlying genetic model can depend on having Throughout our simulation studies, the sample size of observations from the heterozygote category. Second, the phase 1 was fixed at N = 1, 000. Table II summarizes the joint analysis method utilizes the correlation between the simulation design. For each combination of parameters and tag and the seq SNPs. If the tag SNP heterozygote stratum a specific sampling scheme, we generated 1,000 data sets. is not sampled at all, the phase 1 subjects in this category We used analysis of complete data for phase 1 as the ideal. (usually a large proportion) will not be used in the joint anal- We considered various scenarios for the genotypes of the tag ysis, which will greatly decrease the design efficiency.Third, SNP and the seq SNP, varying the MAFs. Two sets of MAF = = . = . , if the MAF is low at the tag SNP but is relatively high at the values were specified as (i) Pa Pd 0 4, and (ii) Pa 0 3 = . = . = . seq SNP (e.g., Pa 0 1, Pd 0 3, positive correlation), then Pd 0 2. We quantified LD between the two SNPs by the lack of sampling from the tag SNP heterozygote Aa stratum correlation coefficient r, and considered a range of correla- may decrease the number of phase 2 subjects carrying rare tions from highly positive to moderately negative. For sce- homozygote dd at the seq SNP, even when all subjects in nario (i) where MAF values are equal, r = 0.95, 0.5, 0.05, and the tag SNP rare homozygote aa stratum are sampled. −0.50. For scenario (ii) where MAF values differ, complete Although formulated for common seq SNPs, the pro- correlation between the two SNPs was not possible, and r posed method can be generalized to analysis of rare vari- was bounded by some value that is smaller than 1. There- ants, that is, variants with MAF < 1%. In this case, the fore, for scenario (ii) we set r = 0.75, 0.25, 0.05, and −0.30, objective of sequencing is to investigate the potential asso- where the highest correlation 0.75 was very close to the up- ciation between the phenotype and multiple rare variants per bound of possible correlation. The joint distribution of

Genet. Epidemiol. Two-Phase Designs for Sequencing 325

the tag SNP and the seq SNP can be inferred from their efficiency, since it incorporates both the bias and variability marginals and their correlation. For a given combination of an estimator. of MAF values and correlation, we first simulated the two In addition to studies of common seq SNPs, we con- haplotypes for each subject to achieve the desired frequen- ducted simulations for scenarios involving rare variants. cies of the genotypes and correlation. We then simulated a We assumed that a rare variant score had been obtained by quantitative trait under an additive model given by (A1), counting the number of rare variants across 20 loci, each of ␤ = . ␤ = . where parameters were specified as 0 0 5, 1 0 25, and which had minor allele frequency (MAF) generated from ␴2 = 0.5. a uniform distribution between 0.005 and 0.01, yielding a We considered three different values of the phase 2 sam- score of 0, 1, 2, or rarely 3. We then randomly selected three pling proportion ␳ (Table II), corresponding to samples of rare variants to be associated with the phenotype. All three size 100, 250, and 500. For each ␳, we investigated the in- causal rare variant effects are additive on the quantitative fluence of sample size allocation on the efficiencies of the trait, with each copy of the minor allele increasing the mean ␤ ␳ ␳ various estimators for 1. The sampling fractions 1 and 2 trait value by 0.5, 0.75, and 1, respectively. Because most of for the heterozygote and the rare homozygote strata were the loci were noncausal, the overall association of the score specified to reflect the extent to which the minor alleles are with the trait was weak. The MAF for the tag SNP is speci- ␳ over sampled. The sampling proportion 0 in stratum AA fied as 0.3, and the correlation r with the rare variant score ␳ ␳ ␳ can be calculated from Pa , 1, 2, and . We considered a was around 0.5. range of phase 2 sample sizes allocated to the heterozy- The primary focus of the evaluations was design effi- gote stratum. For each given heterozygote count, we then ciency based on the precision of parameter estimation. De- varied the size allocated to the rare homozygote stratum. sign efficiency translates directly into power for hypothesis To avoid variance inflation in the estimated ␣ due to sparse testing, provided the variance estimate used to construct the counts, we required a minimum of 10 for the expected num- corresponding test statistic is accurate, and the test statis- ber of subjects to be selected from each stratum. Therefore, tic is valid under the null hypothesis of no association. We there is not much freedom to allocate sample size to the two therefore also examined the distribution of the test statis- homozygote strata when most of the sample size is allo- tic for the proposed method under the null of no seq SNP cated to the heterozygote stratum. We also included two association with the phenotype. Genotype data were sim- special allocations. The first one is equivalent to simple ran- ulated with the same configurations described above, and ␤ = dom sampling within each stratum with the same sampling a quantitative trait was simulated with 1 0. We used fraction, and the second one allocates an equal sample size Z = ␤ˆ /se(␤ˆ ) as the test statistic and calculated the P-value ␳ = ␳ / = , , 1 1 to each stratum, that is, j Nj N 3, j 0 1 2. under an asymptotic standard normal distribution. For il- To investigate the robustness of the methods to model lustration purposes we chose 5% as the threshold for type I misspecification, we also simulated data under various error assessment. forms of departures from the additive model with constant variance specified by model (1), but then analyzed the data assuming a dosage effect and constant variance. We first RESULTS considered dominant and recessive genetic effect models, Overview. Under HWE and correct model specifica- both of which followed a similar form of model (A1) for tion, all methods yield consistent estimators for the intercept the seq SNP. In the third case, we considered a true model and the additive seq SNP effect. Asymptotic normality for with heteroscedastic variances across the seq SNP genotype the estimator from the proposed method is confirmed by ␤ categories. Heteroscedasticity may be encountered in sit- examination of the distribution of 1 estimates (see Supple- uations where the phenotype is more variable in subjects mentary Fig. 1). We focus on the relative efficiencies of the with two copies of the rare allele at the seq SNP than in methods under different sampling designs compared to the ␴2 = ␴2 = patients with one or zero copies. We set 0 0.5, 1 0.75, ideal situation where we have sequence data for all subjects ␴2 = = and 2 1 for X 0, 1, and 2, respectively. In the fourth in the cohort. In general, the empirical standard deviations ␤ case, we simulated genotypes at an additional seq SNP with of the naive estimate ˆ 1,nai are smaller than those of the low MAF but strong association with the phenotype, and IPW estimate. For a quantitative trait, naively fitting a lin- moderate LD with the seq SNP being tested. We set the ear model still leads to a consistent estimate of the additive MAF at this additional SNP to be 0.01 and the regression effect even though the marginal distribution of Xi in the parameter to be 0.5. This SNP was ignored in the analy- phase 2 sample is not the same as that in the population. sis, however, yielding model misspecification. In the last When the effect size under an additive model in the phase case, we generated observations from an additive model in 1 sample is of interest, the incorporation of weighting in which the residuals follow a skewed distribution, a feature IPW helps to guard against bias from model misspecifica- not uncommon in practice, but analyzed the data as if they tion. The proposed joint analysis method however yields ␤ were normally distributed. Because estimators for 1 ob- more precise estimates than the other two approaches, and tained under model misspecification are generally biased, illustrates improved efficiency in detecting a functional SNP we used mean squared error (MSE), as opposed to empirical within the same region as the tag SNP when we can only variance, in the calculation of relative efficiency for all cases afford to sequence a portion of the subjects in phase 1. with model misspecification. That is, we calculated the ra- tio between two MSEs, with one from fitting model (A1) to = = . complete phase 1 data (as if seq SNP data were available Scenario (i) (MAF Pa Pd 0 4) with additive effect. for all) and the other from applying the estimation method Table III displays a subset of the phase 2 sample allocations = . under the two phase design with seq SNP data only for we evaluated for scenario (i) with correlation r 0 50 and ␳ = the phase 2 sample. When the model used for analysis is sampling fraction 10%, as well as the resulting aver- misspecified, MSE is a better measure in the comparison of ages of estimates and empirical standard deviations. Here, , , E(n0 n1 n2) are the expected counts in the common ho-

Genet. Epidemiol. 326 Chen et al.

TABLE III. An example of estimation efficiencies under various phase 2 sample allocations for scenario (i) with MAFs Pa = Pd = 0.4, correlation r = 0.50, effect size ␤1 = 0.25, and overall sampling fraction ␳ = 10% (AVE = average of the ␤1 estimates over 1,000 replicates, SD = standard deviation of the ␤1 estimates multiplied by 100. For the complete data, N = 1, 000, and for the phase 2 sample, n = 100. Minimum and maximum standard deviations for each method are displayed in bold)

Complete data Naive IPW Proposed Allocation ␳ rE(n0, n1, n2) AVE SD AVE SD AVE SD AVE SD

10% 0.50 (80, 10, 10) 0.250 3.27 0.248 9.62 0.246 14.19 0.247 6.15 (45, 10, 45) 0.251 9.43 0.249 13.52 0.248 6.25 (10, 10, 80) 0.254 10.01 0.252 16.02 0.251 6.86 (60, 30, 10) 0.251 9.73 0.251 10.66 0.248 6.13 (50, 30, 20) 0.252 9.85 0.250 10.43 0.252 6.04 (35, 30, 35) 0.254 9.78 0.254 10.71 0.249 6.29 (20, 30, 50) 0.250 10.83 0.247 12.47 0.249 6.97 (10, 30, 60) 0.252 12.06 0.252 16.41 0.248 7.44 (33, 34, 33)a 0.250 9.62 0.251 10.54 0.247 6.39 (36, 48, 16)b 0.255 10.18 0.255 10.18 0.249 6.46 (40, 50, 10) 0.248 10.06 0.248 10.29 0.247 6.32 (25, 50, 25) 0.254 10.65 0.255 11.08 0.253 6.32 (10, 50, 40) 0.252 10.54 0.254 14.03 0.252 6.80 (20, 70, 10) 0.249 11.20 0.249 12.37 0.248 7.02 (10, 70, 20) 0.257 11.18 0.259 14.46 0.248 7.67 (10, 80, 10) 0.253 11.21 0.250 14.40 0.247 7.43 aApproximately equal phase 2 sample size in each stratum. bEqual sampling fraction in each stratum. mozygote, the heterozygote, and the rare homozygote strata the relative efficiency for the naive method is higher when in the phase 2 sample. We also include results for two spe- stratum Aa is sparsely sampled and the two homozygote cial cases of interest: one with approximately equal phase 2 strata AA and aa are equally heavily sampled. Compared sample size in each stratum, and the other with equal sam- to the other approaches, the proposed method is relatively pling fraction in each stratum. The relative bias is within 2% more consistent across different allocations for cases with for all methods. The SD of the proposed estimate is roughly positive correlation as long as the sampling scheme is not 2–2.5 times larger than the SD for the complete data, corre- extreme. For negative correlation, however, the proposed sponding to a relative efficiency of 40–50%. method performs better when the rare homozygote stra- Figure 2 shows the relative efficiencies of the three meth- tum is sparsely sampled. This result is as expected, since ods under various sample size allocations for scenario (i), the minor allele d at the seq SNP appears less frequently in which the overall sampling fraction is 10% or 50% (for with the minor allele a at the tag SNP. cases with ␳ = 25% see Supporting Information). The rel- ative efficiency depends on the strength of the correlation Scenario (ii) (MAF P = 0.3, P = 0.2) with additive ef- between the tag and the seq SNPs. When there is high LD a d = . fect. Similar results are obtained for scenario (ii) with ad- (e.g., r 0 95), the efficiency of the proposed method ap- ditive effect. Figure 3 shows selected results for cases with proaches that of fitting model (A1) to the complete data if sampling fraction ␳ = 10% and 50%. For cases where the they were available. When there is no or low LD between = . correlation between the tag and the seq SNPs is positive the two SNPs (e.g., r 0 05), the phase 1 sample tag SNP (e.g., r = 0.75), both the naive method and the proposed does not provide much useful information for inferring the method can achieve higher relative efficiency when the rare genotypes at the seq SNP for phase 1 subjects not included homozygote stratum is heavily sampled. In contrast, for in phase 2. Thus, the proposed method performs no better cases with negative correlation (e.g., r =−0.30), the meth- than the naive approach. For cases where the tag and the ods achieve higher relative efficiency when the rare ho- seq SNPS are negatively correlated, the proposed method mozygote stratum is sparsely sampled. Again, the proposed still can improve efficiency. As the overall sampling fraction ␳ method performs no better than the naive method when increases, all methods produce more precise estimates. there is weak or no correlation between the tag and the seq The phase 2 sample size allocation plays an important SNPs (e.g., r = 0.05) but does not do worse. See Supporting role in the relative efficiency of the naive method. As men- Information for more comprehensive results. tioned above, under an additive genetic association model, ␤ ˆ 1,nai achieves maximum efficiency when the seq SNP geno- type is DD for half of the phase 2 sample and is dd for the Robustness assessment. When the regression model other half. This strategy works well when the seq SNP and for the genetic association of the quantitative trait with the tag SNP have similar minor allele frequencies and are the seq SNP is misspecified, all methods produce biased estimates. Here, we focus on reporting results under sce- highly correlated or in perfect LD. As shown in the panels = = . with r = 0.95 in Figure 2, compared to other allocations, nario (i) (MAF Pa Pd 0 4) for cases where correlation between the tag SNP and the seq SNP is moderate and the

Genet. Epidemiol. Two-Phase Designs for Sequencing 327

Fig. 2. Relative efficiencies of the naive (dashed line), IPW (dotted line), and proposed (solid line) methods under scenario (i) with MAF values of Pa = Pd = 0.4. The rows correspond to decreasing values of the tag-seq SNP correlation r. The first column corresponds to overall sampling fraction ␳ = 10% with sample size = 30 allocated to stratum Aa. The second and third columns correspond to overall sampling fraction ␳ = 50%, with sample size of 100 and 300 allocated to stratum Aa, respectively. Within each panel, the horizontal axis indicates different tag SNP strata allocations (AA, Aa, aa) for fixed heterozygote (Aa) count. At 100% sampling, the expected counts of AA, Aa, aa are 360, 480, 160. sampling fraction is ␳ = 25% (Fig. 4). With the dominant row 4 in Figure 4 suggests that the proposed method is rea- ␤ model being the true model, the naive estimator ˆ 1,nai can sonably robust to violation of the assumption of constant have relative efficiency greater than 1 for some allocations. variance across genotype classes. Results for the cases with The explanation is that the complete data analysis yields an additional causal seq SNP ignored in analysis or with a biased estimates if one incorrectly fits an additive model, skewed phenotype distribution are very similar to the case while the naive method can be less biased if the number with heteroscedasticity. Similar results are also observed for = . , = . with rare homozygote dd at the seq SNP in phase 2 is small. scenario (ii) (MAF Pa 0 3 Pd 0 2) that specifies different This may be achieved by oversampling from strata AA and minor allele frequencies (see Supplementary Figs. 24– 38). Aa , provided that the correlation is high (see Supplemen- ␳ Rare variant analysis. Because only a few rare variants tary Figs. 9– 23 for the other cases of r and ). For cases with are specified to be associated with the trait, the estimate of negative r, the naive method performs best when stratum the overall effect size ␤ for the aggregation score is only AA is sparsely sampled. This is due to the fact that under a about 0.11. When the tag SNP and the rare variant score dominant model, fewer cases with genotype dd in the sam- are moderately correlated (e.g., r = 0.5), the relative effi- ␤ˆ ple leads to smaller bias in 1,nai. In general, however, the ciency is consistently higher when stratum aa is oversam- proposed method performs better than the naive method pled than when stratum aa is undersampled. The proposed for various allocations. Under a recessive model, the op- joint analysis method performed consistently better than posite phenomenon is observed. The relative efficiency of the two methods that use phase 2 data alone (Fig. 5). Un- ␤ ˆ 1,nai increases as the expected number E(m1) of genotype like common seq SNP analysis, however, undersampling of DD in phase 2 sample increases. Comparison of row 1 with the tag SNP heterozygote stratum did not appear to

Genet. Epidemiol. 328 Chen et al.

Fig. 3. Relative efficiencies of the naive (dashed line), IPW (dotted line), and proposed (solid line) methods under scenario (ii) with MAF values Pa = 0.3 and Pd = 0.2. The rows correspond to decreasing values of the tag-seq SNP correlation r; for the given MAF values, r is constrained to be less than 0.76. The first column corresponds to overall sampling fraction ␳ = 10% with sample size = 10 allocated to stratum Aa. The second and third columns correspond to overall sampling fraction ␳ = 50%, with sample size of 10 and 400 allocated to stratum Aa , respectively. Within each panel, the horizontal axis indicates different tag SNP strata allocations (AA, Aa, aa) for fixed heterozygote (Aa) count. At 100% sampling, the expected counts of AA, Aa, aa are 490, 420, 90. contribute to efficiency gain, most likely due to the fact that precision of the estimates. This inflation may be due to the the rare variant score distribution concentrates its mass on use of small phase 2 sample. For sampling fraction ␳ = 25% zero. Under the assumption that K independent rare vari- or 50%, the type 1 error rate is close to the nominal 5% ants are included in the rare variant score, the sum of the (see Supplementary Figs. 40 and 41). Similar patterns are correlations between the tag SNP and each of the K rare observed for scenario (ii) where minor allele frequencies variants is approximately equal to K 1/2r. For r =0.5 and differed between the tag and seq SNPs (see Supplementary K =20, the average correlation between each of the K rare Figs. 42–44). variants and the tag SNP would be approximately 0.112. Summary concerning allocation. It is evident that the Simulations under the null hypothesis of no associa- phase 2 sample size allocation has a major impact on meth- tion. When data were generated under the situation of ods that use only phase 2 data, which is relevant when no association between the seq SNP and the phenotype, analysis is limited to use of a linear model in phase 2 data. all methods consistently estimated the genetic effect at 0 Although a universally optimal allocation design does not for all sampling fractions and all sampling schemes. For appear to exist even when the assumption of additive effect = = . scenario (i) with MAF Pa Pd 0 4 and sampling fraction is correctly made, our simulation studies provide investiga- ␳ = 10%, there is a slight inflation of type 1 error rate for tors with some guidelines for planning regional sequencing the proposed method under some sample size allocations. on a subset of phase 1 subjects. As illustrated by our simu- The type 1 error rate is between 5% and 10% when r = 0.05 lations, efficiency of the naive method using only the phase (see Supplementary Fig. 39). However, type I error does not 2 sample depends on the MAFs at the tag SNP and the seq necessarily reflect accurately the statistical design efficiency. SNP as well as their correlation r. When there are reason- This is because the test statistic depends on the accuracy of able grounds to expect positive correlation, we recommend standard error estimation (typically based on asymptotic that the heterozygote stratum Aa be undersampled while distributions) whereas efficiency depends on the empirical the two homozygote strata AAand aa be oversampled up to

Genet. Epidemiol. Two-Phase Designs for Sequencing 329

Fig. 4. Relative MSE of the naive (dashed line), IPW (dotted line), and proposed (solid line) methods under model misspecification with MAF values Pa = Pd = 0.4. The first row corresponds to a correctly specified additive (ADD) model. The second, third, and fourth rows correspond to cases where the true model is dominant (DOM), recessive (REC), and heteroscedastic (HET), respectively. The overall sampling fraction is ␳ = 25%, and the tag-seq SNP correlation is r = 0.5. The columns correspond to cases with sample sizes of 10, 50, and 100 allocated to stratum Aa, respectively. Within each panel, the horizontal axis indicates different tag SNP strata allocations (AA, Aa, aa) for fixed heterozygote (Aa) count. At 100% sampling, the expected counts of AA, Aa, aa are 360, 480, 160. Results for cases with an additional causal seq SNP ignored as well as for cases with skewed phenotype distribution are similar to HET, and hence are not shown here (see Supplementary Figs. 9–23). maximum available counts in aa (Figs. 2 and 3). Although sample can lead to better decisions on how to select a subset similar trends of relative efficiency are observed for the pro- of the sample to be sequenced for discovery and assessment posed joint analysis of phases 1 and 2 data, it is less depen- of additional variant SNPs in phase 2. As the depth and dent on the sample allocation and is generally more efficient breadth of available sequence data accumulates, for exam- and more robust than the naive method. ple, through initiatives such as the 1,000 Genomes Project, and lower frequency SNPs are added to GWAS SNP arrays, the number of unknown or unimputable variants within a sequenced region in a particular study may decrease. It DISCUSSION remains to be seen, however, whether imputation accuracy will improve sufficiently for follow-up and fine-mapping The two-phase study design, widely used in epidemio- studies, especially for very low frequency variants. More- logical studies, focuses on collecting more detailed but ex- over, sequencing a portion of study participants may serve pensive covariate data in a subset of the study sample based to create a very well-matched reference panel useful for im- on information in auxiliary variables available in the entire putation in the entire study [Fridley et al., 2010; Zeggini, sample. Although the cost of high-density genotypic data 2011]. has dropped dramatically, costs of NGS are still considered Two-phase stratified design and optimal sample alloca- high for large-scale studies that involve tens of thousands of tion for GWAS follow-up studies have received little atten- participants. Efficient study designs are likely to remain nec- tion [Thomas et al., 2004, 2009]. In contrast, the well-known essary in the near future for cost-efficiency in large studies multistage design improves cost-efficiency in the GWAS set- of the genetic basis of complex diseases and traits. The two- ting by genotyping a full set of known SNP markers in a phase design is important in the sense that information from subset of available subjects in stage 1, and then genotyping low-cost tag SNPs and imputed SNPs available in a phase 1 a selected subset of the SNPs in the remaining subjects in

Genet. Epidemiol. 330 Chen et al.

Fig. 5. Relative efficiency of the naive (dashed line), IPW (dotted line), and proposed (solid line) methods for rare variant analysis involving 20 rare variants with MAF values generated from Unif (0.005, 0.01). The overall sampling fraction is ␳ = 25%, and the correlation between the tag SNP and the rare variants sum score is r = 0.5. The panels correspond to cases with sample size of 10, 50, and 150 allocated to stratum Aa, respectively. Within each panel, the horizontal axis indicates different tag SNP strata allocations (AA, Aa, aa) for fixed heterozygote (Aa) count. At 100% sampling, the expected counts of AA, Aa, aa are 490, 420, 90. stage 2. By excluding markers that show little evidence of of interest has been identified by GWAS using genome- association in stage 1, genotyping requirements, and hence wide significance criteria, effect estimates for tag SNPs so cost, can be substantially reduced while preserving much identified will be subject to selection bias known as the of the power of the corresponding single-stage design in ”winner’s curse” [Faye et al., 2011, Sun et al., 2011]. In fine- which all subjects are genotyped on all markers [Skol et al., mapping conducted in the same sample, effect estimates 2006]. An association detected in both stages, however, may for seq SNPs in LD with the tag SNP will also be affected be arising indirectly through a common variant that is in LD indirectly by this phenomenon in a complicated manner with a functional variant, thus motivating the need for sub- [Faye and Bull, 2011], and similarly subject to bias. How- sequent regional fine-mapping and sequencing studies to ever, regardless of such complications in interpreting the identify additional variants. results, a tag SNP in high LD with a sequenced SNP is In this report, we consider estimation of an additive ge- nevertheless expected to serve well in selecting individu- netic effect in a two-phase stratified design. As a starting als enriched for informative seq SNP genotypes within the point, we examine the case of one tag SNP and one seq region. SNP. We propose an estimating equations approach using For complex diseases and traits, the underlying genetic all available data from phases 1 and 2 and study the effi- models are unknown. Thus, there is no uniformly most ciency gain compared to using only phase 2 data. We also powerful test across all possible alternative genetic mod- investigate the sensitivity of estimation efficiency to alloca- els and no single optimal phase 2 sample size allocation tion of the phase 2 sample size under the additive model. for all situations. We have limited consideration to a linear The main idea of the sampling design is to select fewer het- model for a quantitative trait with an additive effect of a ge- erozygotes with the sequence variant while selecting more netic variant, in which the number of the copies of the minor of each homozygote type. If one expects positive correla- allele is treated as dosage. Our method presumes that a po- tion between the tag SNP and a functional seq variant, tential functional variant is within the same region as the then it is more efficient to over-sample from the tag SNP tag SNP used for stratification. In practice, with sequencing rare homozygote stratum. This strategy provides no useful of multiple SNPs in the same region as the tag SNP, analysis information, however, if the tag SNP is not in LD with the of the sequence genotypes would proceed by association seq SNP, which may be the case when the seq SNP is too testing of each of the seq SNPs with the quantitative trait of distant from the tag SNP. Through simulation studies, we interest. For any of the seq SNPs correlated with the tag SNP, show that the correlation between a tag SNP used for strat- a stratified sampling design will be more powerful than a ification and the seq SNP plays an important role in esti- simple random sample of the same size. On the other hand, mation efficiency for the proposed joint analysis method. for a seq SNP in the region that is uncorrelated with the As the magnitude of the correlation coefficient decreases to tag SNP, stratified sampling will not perform worse than zero, the efficiency decreases relative to complete sequenc- simple random sampling. ing. When the seq SNP and the tag SNP are independent, The development of sample allocation and joint analysis the phase 2 sample can be regarded as a random subset of methods is based on the assumption that only a single tag the entire study sample, and stratification by the tag SNP SNP is available within a region. In practice, multiple SNP genotype does not contribute to improved estimation of the markers in phase 1 may be used as tag SNPs for a region or genetic effect. there may be tag SNPs in multiple regions of interest. When Our findings concerning sample size allocation are most the number of genotype combinations is large, stratification directly applicable to the design of an independent repli- using multiple tag SNPs can be problematic. One possible cation study in which a specific region of interest has been modification is to combine genotype categories based on the prespecified, and typed or imputed SNP data for the region total count of minor alleles at the tag SNPs. Modeling the are readily available. When, however, a promising region associations between the seq SNPs and the collapsed strata

Genet. Epidemiol. Two-Phase Designs for Sequencing 331

as well as designating a robust allocation, however, may not Faye, L, Bull, SB. 2011. Two-stage study designs combining GWAS tag be straightforward. As a result, scope for the application of SNPs and exome sequencing: accuracy of genetic effect estimates. sample allocation design principles across multiple regions BMC Proceedings 5(Suppl. 9): S64. may be limited for purposes of cost-efficiency. For other Faye L, Sun L, Dimitromanolakis A, Bull SB. 2011. A flexible genome- types of analysis such as haplotype- or gene-based infer- wide bootstrap method that accounts for ranking- and threshold- ence, optimal allocation depends on the specific statistical selection bias in GWAS interpretation and replication study design. methods to be used and is very likely to employ different Stat Med 30: 1898–1912. optimization criteria. In principle, the sample size allocation Fridley BL, Jenkins G, Deyo-Svendsen ME, Hebbring S, Freimuth R. methods can be extended to situations in which haplotypes 2010. Utilizing genotype imputation for the augmentation of se- are used for stratification and/or rare variant counts are quence data. PLoS ONE 5: e11018. Godambe VP, Thompson ME. 1986. Parameters of superpopulation and used to summarize sequence data [Price et al., 2010]. In ad- survey population: their relationships and estimation. Int Stat Rev dition, because environmental factors also play important 54: 127–138. roles in the etiology of complex traits/diseases, they can be Guey LT, Kravic J, Melander O, Burtt NP, Laramie JM, Lyssenko V, included in the joint analysis to better explain other sources Jonsson A, Lindholm E, Tuomi T, Isomaa B, Nilsson P, Almgren P, of the variation of the trait [e.g., Paterson et al., 2010]. Kathiresan S, Groop L, Seymour AB, Altshuler D, Voight BF. 2011. Subject-selection strategies that depend on the quantita- Power in the phenotypic extremes: a simulation study of power tive trait, and associated methods of analysis, have been in discovery and replication of rare variants. Genet Epidemiol 35: evaluated by a number of authors [e.g., Bacanu et al., 2011; 236–246. Guey et al., 2011; Huang and Lin, 2007; Lin and Tang, 2011; Huang BE, Lin DY. 2007. Efficient association mapping of quantitative Tang, 2010; Van Gestel et al., 2000; Yilmaz and Bull, 2011], trait loci with selective genotyping. Am J Hum Genet 80: 567–572. and in principle could also be applied within genotype Ioannidis JP, Thomas G, Daly MJ. 2009. Validating, augmenting and classes, as suggested by a reviewer . In cross-sectional and refining genome-wide association signals. Nat Rev Genet. 210: 318– longitudinal study designs, however, multiple traits are of- 329. ten of interest, and sampling based on one trait may not Li Y, Willer C, Sanna S, Abecasis G. 2009. Genotype imputation. Annu improve efficiency for another. If functional variants for Rev Genomics Hum Genet 10: 387–406. multiple traits are harbored in the region and are corre- Lin DY, Tang ZZ. 2011. A general framework for detecting disease lated with the tag SNP, then sampling based on that tag associations with rare variants in sequencing studies. Am J Hum SNP can be beneficial for all traits. To the extent that the Genet 89: 354–367. tag SNP genotype is correlated with a quantitative trait, Liu DJ, Leal SM. 2010. Replication strategies for rare variant complex trait association studies via next-generation sequencing. Am J Hum over-sampling on the high- and low-risk genotype strata Genet 87: 790–801. will indirectly enrich for associated extreme trait values in Manski CF, Lerman SR. 1977. The estimation of choice probabilities the selected individuals. Further work to formally evalu- from choice based samples. Econometrica 45: 1977–1988. ate additional improvements in efficiency associated with Morris AP, Zeggini E. 2010. An evaluation of statistical approaches to trait-dependent sampling, for example, by defining mul- rare variant analysis in genetic association studies. Genet Epidemiol tiple strata according to genotype and phenotype, is war- 34: 188–193. ranted. Newey W, McFadden D. 1994. Large sample estimation and hypothesis testing. In: Engler R, McFadden D, eds, Handbook of Econometrics, vol. 4. Elsevier Science, Amsterdam. Paterson AD, Waggott D, Boright AP, Hosseini SM, Shen E, Sylvestre M- ACKNOWLEDGMENTS P, Wong I, Bharaj B, Cleary PA, Lachin JM, Below JE, Nicolae D, Cox NJ, Canty AJ, Sun L, Bull SB, Diabetes Control and Complications This research was supported by funding from the Cana- Trial/Epidemiology of Diabetes Interventions and Complications dian Institutes of Health Research: CIHR Operating Grant Research Group. 2010. A genome-wide association study identifies MOP-84287 (R.C., S.B.B.), CIHR Training Grant GET-101831 a novel major locus for glycemic control in type 1 diabetes, as mea- (Z.C.). Z.C. is a CIHR Fellow in Genetic Epidemiology sured by both A1C and glucose. Diabetes 59: 539–549. and Statistical Genetics with CIHR STAGE (Strategic Train- Pei YF, Zhang L, Li J, Deng HW. 2010. Analyses and comparison of ing for Advanced Genetic Epidemiology)—CIHR Training imputation-based association methods. PLoS ONE 5: e10827. Grant in Genetic Epidemiology and Statistical Genetics. Price AL, Kryukov GV, de Bakker PI, Purcell SM, Staples J, Wei LJ, Computations were performed on the GPC supercomputer Sunyaev SR. 2010. Pooled association tests for rare variants in exon- at the SciNet HPC Consortium. SciNet is funded by: the resequencing studies. Am J Hum Genet 86: 832–838. Canada Foundation for Innovation under the auspices of Skinner, CJ, Holt, D, Smith, TMF (editors). 1989. Analysis of complex Compute Canada; the Government of Ontario; Ontario Re- surveys. Chichester, UK: Wiley. search Fund—Research Excellence; and the University of Skol AD, Scott LJ, Abecasis GR, Boehnke M. 2006. Joint analysis is more Toronto. The authors thank the referees for constructive efficient than replication-based analysis for two-stage genome-wide comments that significantly improved the paper. association studies. Nat Genet 38: 209–213. Sun L, Dimitromanolakis A, Faye L, Paterson A, Waggott D, Bull S. 2011. BR-squared: a practical solution to the winner’s curse in genome- wide scans. Hum Genet 129: 545–552. REFERENCES Tang Y. 2010. Equivalence of three score tests for association mapping of quantitative trait loci under selective genotyping. Genetic Epidemiol Bacanu SA, Nelson MR, Whittaker JC. 2011. Comparison of methods 34: 522–527. and sampling designs to test for association between rare variants Thomas D, Xie RR, Gebregziabher M. 2004. Two-stage sampling designs and quantitative traits. Genet Epidemiol 35: 226–235. for gene association studies. Genet Epidemiol 27: 401–414. Breslow NE, Wellner JA. 2007. Weighted likelihood for semiparametric Thomas DC, Casey G, Conti DV, Haile RW, Lewinger JP, Stram DO. models and two-phase stratified samples, with application to Cox 2009. Methodological issues in multistage genome-wide association regression. Scand J Stat 34: 86–102. studies. Stat Sci 24: 414–429.

Genet. Epidemiol. 332 Chen et al.

Udler SM, Tyrer J, Easton DF. 2010. Evaluating the power to discriminate Thus, between highly correlated SNPs in genetic association studies. Genet   2  Epidemiol 34: 463–468. j11 0 Van Gestel S, Houwing-Duistermaat JJ, Adolfsson R, van Duijn CM,  = pr(Z = j) , i   Van Broeckhoven C. 2000. Power of selective genotyping in genetic j=0 j21 j22 association analyses of quantitative traits. Behav Genet 30: 141–146. Yilmaz, YE, Bull, SB. 2011. Are quantitative trait-dependent sampling where designs cost effective for analysis of rare and common variants? BMC Proceedings 5(Suppl. 9): S111.  = {∂ ␣ /∂␣T | = }, j11 E Qi ( ) Zi j Zeggini E. 2011. Next-generation association studies for complex traits. Nat Genet 43: 287–288.  = − ␳ {∂ ∗ ␣, ␤ /∂␣T | = }, j21 (1 j )E Ui ( ) Zi j Zheng G, Song K, Elston RC. 2007. Adaptive two-stage analysis of ge-  = ␳ {∂ ␤ /∂␤T | = } netic association in case-control designs. Hum Hered 63: 175–186. j22 j E Ui ( ) Zi j + − ␳ {∂ ∗ ␣, ␤ /∂␤T | = }. (1 j )E Ui ( ) Zi j APPENDIX: CONSISTENCY AND As N →∞, E{∂U∗(␣, ␤)/∂␤T | Z = j}, and  can be con- ␪ i i j11 ASYMPTOTIC DISTRIBUTION OF ˆ sistently estimated by, respectively,   ␪ ∗ The asymptotic behavior of ˆ can be derived based on ∂U (␣, ␤) 1 Eˆ i | Z = j = standard estimating equations theory. By Theorem 3.4 of ∂␤T i − ⎡ Nj n j Newey and McFadden [1994], under regularity conditions, 2 we have that with probability approaching 1, there is a × ⎣ ␾ ␣, ␤ˆ ␣, ␤ˆ , = T ␣, ␤ˆ , = N  ␪ ␪ˆ ik(ˆ )Ui (ˆ ; Yi Xi k)Ui (ˆ ; Yi Xi k) unique solution to i=1 i ( ), denoted by , that satisfies ∈{ ∪ } = i s¯2 j k 0  2 N − ␾ ␣, ␤ ␤ , = ik(ˆ ˆ )Ui (ˆ ; Yi Xi k) = −1/2  ␪ 0 N i ( ) k=0 =   i 1 2 T N  × ␾ ␣, ␤ˆ ␤ˆ , = − / ik(ˆ )Ui ( ; Yi Xi k) + N 1 ∂ (␪)/∂␪T N1 2 ␪ˆ − ␪ + o (1). i p k=0 ⎤ i=1 2 + ␾ ␣, ␤ ∂ ␤ , = /∂␤T⎦ ik(ˆ ˆ ) Ui (ˆ ; Yi Xi k) , This is equivalent to k=0    / −1 N1 2 ␪ˆ − ␪ =− E ∂ (␪)/∂␪T and i   ∗ N ∂ ␣, ␤ (A1) Ui ( ) 1 −1/2 Eˆ | Z = j = × N  (␪) + o (1) ∂␣T i − i p Nj n j i=1 ⎡ 2 × ⎣ ␾ ␣, ␤ˆ ␣ , = T ␣ , ∂ ␪ /∂␪T ik(ˆ )Ui (ˆ; Yi Xi k)Qi (ˆ; Gk Zi ) as under regularity conditions, E i ( ) exists and ∈{ ∪ } = { ␪ } i s¯2 j k 0 is invertible and var i ( ) is finite and positive definite.   2 −1 N  ␪ → The Law of Large Numbers leads to N i=1 i ( ) p − ␾ (ˆ␣, ␤ˆ )U (ˆ␣, ␤ˆ ; Y , X = k) { ␪ }= →∞ ␪ˆ ik i i i E i ( ) 0, as N , and the consistency of is imme- k=0 diate by the Slutzky theorem. By applying the Central Limit   ⎤ 2 T Theorem to (A1), the asymptotic distribution of N1/2(␪ˆ − ␪) × ␾ (ˆ␣, ␤ˆ )Q (ˆ␣; G , Z ) ⎦. can be established. ik i k i = Let k 0 Similarly, E ∂U (␤)/∂␤T | Z = j can be consistently W (␣) = (␰ /␲ )∂ Q (␣)/∂␣T, i i i i i i estimated by ∗ G (␣, ␤) = (1 − ␰ )∂U (␣, ␤)/∂␣T. i i i   ∂U (␤) 1 ∂U (␤ˆ ) Eˆ i | Z = j = i . ∂␤T i ∂␤T Note that n j ∈{ ∪ } i s2 j

˜ ␣, ␤ = ␰ ∂ ␤ /∂␤T + − ␰ ∂ ∗ ␣, ␤ /∂␤T. The middle term  in the sandwich variance matrix can be Mi ( ) i Ui ( ) (1 i ) Ui ( ) consistently estimated by its empirical counterpart ⎛ ⎞ Therefore, N ␰ /␲ 2 ␣ T ␣ ␰ /␲ ␣ ˜ T ␣, ␤ˆ ( i i ) Qi (ˆ)Qi (ˆ) i i Qi (ˆ)Ui (ˆ )   ˆ = ⎝ ⎠. ␣ ∂ ␪ Wi ( )0 = ␰ /␲ ˜ ␣, ␤ˆ T ␣ ˜ ␣, ␤ˆ ˜ T ␣, ␤ˆ i ( ) = . i 1 i i Ui (ˆ )Qi (ˆ) Ui (ˆ )Ui (ˆ ) ∂␪T ␣, ␤ ␣, ␤ Gi ( ) M˜ i ( )

Genet. Epidemiol. From bloodjournal.hematologylibrary.org at GERSTEIN SCI INFO CENTRE on October 4, 2012. For personal use only.

2012 119: 2392-2400 Prepublished online January 17, 2012; doi:10.1182/blood-2011-10-383448 The endothelial protein C receptor (PROCR) Ser219Gly variant and risk of common thrombotic disorders: a HuGE review and meta-analysis of evidence from observational studies

Jessica Dennis, Candice Y. Johnson, Adeniyi Samuel Adediran, Mariza de Andrade, John A. Heit, Pierre-Emmanuel Morange, David-Alexandre Trégouët and France Gagnon

Updated information and services can be found at: http://bloodjournal.hematologylibrary.org/content/119/10/2392.full.html

Information about reproducing this article in parts or in its entirety may be found online at: http://bloodjournal.hematologylibrary.org/site/misc/rights.xhtml#repub_requests

Information about ordering reprints may be found online at: http://bloodjournal.hematologylibrary.org/site/misc/rights.xhtml#reprints

Information about subscriptions and ASH membership may be found online at: http://bloodjournal.hematologylibrary.org/site/subscriptions/index.xhtml

Blood (print ISSN 0006-4971, online ISSN 1528-0020), is published weekly by the American Society of Hematology, 2021 L St, NW, Suite 900, Washington DC 20036. Copyright 2011 by The American Society of Hematology; all rights reserved. From bloodjournal.hematologylibrary.org at GERSTEIN SCI INFO CENTRE on October 4, 2012. For personal use only. THROMBOSIS AND HEMOSTASIS

The endothelial protein C receptor (PROCR) Ser219Gly variant and risk of common thrombotic disorders: a HuGE review and meta-analysis of evidence from observational studies

Jessica Dennis,1 Candice Y. Johnson,2 Adeniyi Samuel Adediran,1 Mariza de Andrade,3 John A. Heit,4 Pierre-Emmanuel Morange,5,6 David-Alexandre Tre´goue¨t,7-9 and France Gagnon1

1Division of Epidemiology, Dalla Lana School of Public Health, University of Toronto, Toronto, ON; 2Department of Epidemiology, Rollins School of Public Health, Emory University, Atlanta, GA; 3Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, and 4Division of Cardiovascular Diseases, Department of Internal Medicine, Mayo Clinic, Rochester, MN; 5Inserm, Unite´Mixte de Recherche (UMR) en Sante´, Marseille, France; 6Faculty of Medicine, University of the Mediterranean, Marseille, France; 7Inserm, UMR en Sante´937, Paris, France; 8L’Institut hospitalo-universitaire de cardiologie-metabolisme et nutrition, Paris, France; and 9Universite´Pierre et Marie Curie, Paris, France

The endothelial protein C receptor (EPCR) present study is a highly comprehensive epidemiologic studies, and data were limits thrombus formation by enhancing systematic review and meta-analysis, in- summarized using random-effects meta- activation of the protein C anticoagulant cluding unpublished genome-wide asso- analysis. Twelve candidate genes and pathway, and therefore may play a role in ciation study results, conducted to evalu- 13 genome-wide association studies were the etiology of thrombotic disorders. The ate the evidence for an association analyzed (11 VTE and 14 MI, including rs867186 single-nucleotide polymorphism between rs867186 and 2 common throm- 37 415 cases and 84 406 noncases). Un- in the PROCR gene (g.6936A > G, c.4600A botic outcomes, venous thromboembo- der the additive genetic model, the odds > G), resulting in a serine-to-glycine sub- lism (VTE) and myocardial infarction (MI), of VTE increased by a factor of 1.22 (95% stitution at codon 219, has been associ- which are hypothesized to share some confidence interval, 1.11-1.33, P < .001) ated with reduced activation of the pro- etiologic pathways. MEDLINE, EMBASE, for every additional copy of the G allele. tein C pathway, although its association and HuGE Navigator were searched No evidence for association with MI was with thrombosis risk remains unclear. The through July 2011 to identify relevant observed. (Blood. 2012;119(10):2392-2400) Introduction

Protein C (PC) is a major component of the coagulation/fibrinolysis incapable of inactivating FVa8 and may also impede PAR-1 cascade. Circulating in plasma as an inactive zymogen, PC is cleavage.2 By limiting APC generation and function, elevated activated at the endothelial surface by the membrane-bound levels of sEPCR may exert procoagulant and proinflammatory thrombin-thrombomodulin complex.1 When activated PC (APC) is effects; in 2 case-control studies,9,10 elevated levels of sEPCR were bound to its cofactor, protein S, it inactivates the procoagulant associated with increased risk of VTE. Likewise, a small family factors FVa and FVIIIa, limiting the coagulation cascade and fibrin study found a higher occurrence of VTE in those with above- 11 formation.1,2 PC activation is enhanced approximately 20-fold normal values of sEPCR compared with those with normal levels. when PC binds to the endothelial PC receptor (EPCR),3 a type I The PROCR gene is located on 20q11.2, spans 6 kilobases, and possesses 4 exons.12 The mature protein comprises transmembrane protein. EPCR is primarily localized on the endo- 221 amino acids, including an extracellular domain, a 25-amino thelial cells of large blood vessels (ie, the arteries and veins) and is acid transmembrane domain, and a 3–amino acid intracytoplasmic very sparse or absent in the microvascular endothelium of most sequence. Animal experiments have demonstrated the importance tissues.4 EPCR-bound APC triggers protease-activated receptor-1 of PROCR in normal embryonic development; in PROCR knock- (PAR-1) cleavage, resulting in anti-inflammatory and cytoprotec- out mice, fibrin deposition in trophoblast giant cells results in 2,5 tive (eg, antiapoptotic) effects. In addition to its APC-mediated thrombosis at the maternal-embryonic interface.13 Death occurs by effects, EPCR also works to limit thrombus formation by binding embryonic day 10.5. procoagulant FVII/FVIIa, facilitating the clearance of FVIIa and limiting downstream activation of the tissue factor (extrinsic) Gene variants and frequency coagulation pathway.6,7 These findings strongly favor an important role for EPCR in thrombosis and inflammation.1 Mutations in the PROCR gene that influence protein expression, A soluble form of EPCR (sEPCR) also circulates in the plasma. function, and/or the concentration of sEPCR may be functionally sEPCR binds PC/APC with the same affinity as membrane-bound relevant. Rare point mutations in the gene14 and its promoter EPCR, but does not enhance PC activation by the thrombin- region15 have been described, but effects on thrombosis and gene thrombomodulin complex.8 Furthermore, sEPCR-bound APC is expression remain unknown.16

Submitted October 4, 2011; accepted December 22, 2011. Prepublished online The publication costs of this article were defrayed in part by page charge as Blood First Edition paper, January 17, 2012; DOI 10.1182/blood-2011-10- payment. Therefore, and solely to indicate this fact, this article is hereby 383448. marked ‘‘advertisement’’ in accordance with 18 USC section 1734. The online version of this article contains a data supplement. © 2012 by The American Society of Hematology

2392 BLOOD, 8 MARCH 2012 ⅐ VOLUME 119, NUMBER 10 From bloodjournal.hematologylibrary.org at GERSTEIN SCI INFO CENTRE on October 4, 2012. For personal BLOOD, 8 MARCH 2012 ⅐ VOLUME 119, NUMBER 10 use only. PROCR Ser219Gly AND THROMBOSIS: A HuGE REVIEW 2393

The rs867186 diallelic single nucleotide polymorphism in the “thromboembolism,” “thrombosis,” “pulmonary embolism,” “myocardial PROCR gene (g.6936AϾG, c.4600AϾG), resulting in a serine-to- infarction,” “EPCR,” and “PROCR” (supplemental Table 1, available on glycine substitution at codon 219 in the membrane-spanning the Blood Web site; see the Supplemental Materials link at the top of the domain of EPCR, explains between 56% and 87% of the variations online article). Titles and abstracts of identified articles were screened by 2 reviewers, and those that discussed any PROCR variant in the context of in sEPCR levels.10,17-19 The G allele tags the A3 haplotype VTE or MI proceeded to a full-text screening. The full text of articles was PROCR (4 haplotypes have been identified in whites) and is screened by 2 reviewers and case-control, cohort, or cross-sectional studies associated with increased shedding of EPCR from the endothelial examining an association between the PROCR rs867186 polymorphism and membrane, both by rendering the receptor more sensitive to VTE or MI were selected for inclusion, with disagreements resolved by cleavage20 and by leading to a truncated mRNA through alternative consensus. Reference lists of included studies and review articles were splicing.21 The overall frequency of the G allele is 0.074 among searched for additional articles. Genetic linkage studies, review articles, individuals included to date in the 1000 Genomes Project22; animal studies, and conference proceedings were excluded. however, there are large variations across the population (eg, GWAS of VTE or MI in which the PROCR rs867186 variant, or variants 0.53 among Papuan New Guineans and 0.0 among South- in close linkage disequilibrium with rs867186, might potentially have been American Amerindians from the Human Genome Diversity Cell included on the genotyping platform were located using keyword searches of the HuGE Navigator GWAS Integrator database. Unpublished GWAS Line Panel).23 In a genome-wide association study (GWAS) of were identified by keyword searches of the National Center for Biotechnol- more than 23 000 cohort participants of European ancestry, the ogy Information database of genotypes and phenotypes (dbGaP). Authors 24 G-allele frequency was 0.101. of relevant GWAS were contacted to request counts of events and nonevents by PROCR rs867186 genotype (ultimately, all identified GWAS Disease had genotyped this single-nucleotide polymorphism). Venous thromboembolism (VTE) results from an obstruction of Information on study design, location, demographics, ascertainment of subjects, case definition, and genotype and allele frequencies was extracted blood in the venous system25 by a RBC-rich thrombus composed of from included studies by 2 independent reviewers using a standardized data platelets and fibrin at sites with low blood flow and shear rate and abstraction form. When this information was not available from the article, 26 where the vein wall is normal. In contrast, arterial thrombosis it was sought from other publications reporting on the same study (ischemic stroke and coronary artery disease) results from platelet- population, if available. Crude odds ratios (ORs) and their 95% confidence rich thrombi induced by the rupture of an atherosclerotic plaque at intervals (95% CIs) were calculated from genotype frequencies presented in arterial sites where shear rates are high. Although arterial and the published article; when separate genotype counts were not presented, venous thromboses have traditionally been viewed as distinct, they were requested from the study authors. recent studies suggest that the 2 conditions share common etiologic pathways. Evidence in support of this is 3-fold (for review, see Meta-analysis 27 26 Prandoni or Lowe ): (1) several observational studies have Inverse variance-weighted random effects meta-analysis was used to demonstrated increased risk of subsequent arterial vascular disease estimate the summary effect and 95% CI for VTE, MI, and VTE and MI among patients with VTE; (2) drugs affecting coagulation and combined, using Stata Version 11 software. The G allele was considered the hemostasis have some effect in preventing and treating various at-risk allele, and the per-allele effect estimate was calculated as the OR per atherosclerotic disorders; and (3) VTE and myocardial infarction unit score (0, 1, or 2 copies of the G allele) using logistic regression. The (MI) share common risk factors (albeit of different magnitudes), per-allele OR is the risk of disease per one-allele increase, and the P value including genetic determinants such as the Factor V Leiden (F5L) of the OR tests the hypothesis of zero slope for a line that best fits the 3 genotypic risk estimates.34 The per-allele model is powerful for detecting and prothrombin G20210A gene mutations28,29 and variants in the additive genetic effects34 and the additive model is in accordance with the ABO gene,30,31 as well as age, obesity, hypertension, diabetes observed variation in sEPCR values according to the number of copies of 32 mellitus, smoking, and hypercholesterolemia. All of these risk the G allele.18,19,35 In secondary analyses, 4 additional genotype contrasts factors increase susceptibility to thrombus formation by modifying were tested for each outcome: AG versus AA, GG versus AA, GG ϩ AG the coagulation and fibrinolytic systems.33 versus AA (dominant model), and GG versus AG ϩ AA (recessive model). The PROCR rs867186 variant has emerged as a candidate risk The intensity and significance of between-study heterogeneity were as- factor for both arterial and venous thrombotic disease because of sessed with, respectively, the I2 statistic with its 95% uncertainty interval, the involvement of EPCR in APC- and FVII/FVIIa-mediated and the Cochran Q statistic. I2 values of 25%, 50%, and 75% indicate low, clotting and inflammation. However, evidence for an association medium, and high between-study heterogeneity, respectively, and P Ͼ .05 for 36 between the PROCR rs867186 variant and arterial and venous the Cochran Q statistic suggests no statistically significant heterogeneity. thrombosis has been conflicting. We therefore undertook a system- Publication bias was assessed graphically using funnel plots and the Egger test quantified the asymmetry of the plot. The latter tests the null hypothesis atic review and meta-analysis of observational studies to evaluate that small studies give the same results as large studies37; P Ͻ .05 was the evidence for an association between the PROCR rs867186 deemed statistically significant. variant and 2 common thrombotic outcomes, VTE and MI, which Four subgroup analyses were subsequently carried out to explore are hypothesized to share etiologic pathways. Although ischemic possible explanations for heterogeneity. In the first, GWAS were excluded stroke is also a common thrombotic outcome, we chose not to from the meta-analysis; in the second, candidate gene studies were include it in our review because the outcome was highly variable excluded from the meta-analysis; in the third, analyses were restricted to (ie, ischemic and hemorrhagic stroke cases were often indistinguish- studies of white populations; and in the fourth, studies with deviation from able) and because the number of studies of stroke was low. Hardy-Weinberg equilibrium (HWE), a potential indicator of poor genotyp- ing quality or ascertainment issues, were removed. HWE was investigated in controls of case-control studies and in the entire samples of cohort studies using the standard ␹2 goodness of fit test as well as the relative excess Methods heterozygosity (REH).38 The ␹2 goodness-of-fit statistic tests the null Search strategy hypothesis that the data are consistent with HWE. In contrast, the REH approach quantifies Hardy-Weinberg disequilibrium by measuring the ratio MEDLINE, EMBASE, and the HuGE Navigator Genopedia database were of the actual proportion of heterozygotes to the proportion of heterozygotes searched through July 2011 using a combination of keywords, including expected in a population, which conforms to HWE.38 A significant From bloodjournal.hematologylibrary.org at GERSTEIN SCI INFO CENTRE on October 4, 2012. For personal 2394 DENNIS et al use only. BLOOD, 8 MARCH 2012 ⅐ VOLUME 119, NUMBER 10

Figure 1. Articles and studies identified from a systematic search of the literature on the PROCR rs867186 polymorphism and venous and arterial thrombosis. deviation of the ratio from unity, as measured by the 2-sided 95% CI, is not yet published when our search was executed.42 This study, as evidence of Hardy-Weinberg disequilibrium. Among studies of MI, a fifth well as the most recent GWAS of MI,31 were meta-analyses that subgroup analysis was carried out in which analysis was restricted to included 21 studies. These 2 studies encompassed almost all of the studies that had used a history of MI as the case definition. previously published GWAS results; therefore, no other GWAS of The Venice criteria for assessing the strength of cumulative evidence in genetic association studies were used.39 Briefly, this semiquantitative index MI beyond these 2 meta-analyses were included in this systematic classifies the credibility of cumulative epidemiologic evidence into review. Study-specific genotype counts were obtained from the 3 categories, “weak,” “moderate,” and “strong,” taking into consideration GWAS meta-analysis31 for use in the present meta-analysis. In the criteria such as amount of evidence (eg, sample size, power, and false- gene-centric meta-analysis,42 because the per-allele OR was pre- discovery rate), replication (eg, attention to phenotype definition, models, sented but genotype counts were not, the per-allele OR was used in and I2), and protection from bias (eg, population stratification and measure- the present meta-analysis. One additional unpublished GWAS of ment errors). This review was prepared following the guidance of the HuGE VTE was identified from dbGaP (J.A.H. and M.d.A., unpublished Review Handbook Version 1.0.40 data, March 21, 2011; referred to hereafter as “Heit 2011”) for a total of 2 included GWAS of VTE. The published GWAS of VTE30 also included a candidate gene study that had not been identified in Results previous searches. The search of MEDLINE, EMBASE, and HuGE Navigator Overall, the systematic review included 25 studies (11 VTE and Genopedia returned 150 unique candidate gene articles, of which 14 MI) for a total of 37 415 cases (4821 with VTE and 32 594 with 12 (9 VTE and 3 MI) were eligible for inclusion (Figure 1). The MI) and 84 406 noncases (6070 VTE noncases and 78 336 MI study populations overlapped in 3 studies of VTE,9,17,41 so de- noncases). duplicated counts were obtained from one of the authors and Characteristics of studies included in the meta-analysis are included. The population in one study of MI19 overlapped with that presented in supplemental Table 2. All 11 VTE studies used a 9,41 in the gene-centric meta-analysis42 included in the systematic case-control design, of which 2 were family-based (controls review (see next paragraph), so this article was excluded from were first-degree relatives of cases), one45 was nested within a further analysis. A second candidate gene article,43 as well as a cohort study, and 2 (Tregouet et al30 and Heit 2011), were GWAS gene-centric study of VTE identified from the reference list of a designs. Six studies9,17,30,35,41 included white subjects only, 3 studies review paper,44 was excluded because genotype counts were not (Yamagishi et al,45 Pecheniuk et al,46 and Heit 2011) included white available separately for each of the 3 genotype groups. and nonwhite subjects, one study47 included Chinese subjects, and The search of HuGE Navigator GWAS Integrator returned one study10 did not report the race/ethnicity of subjects. Cases and 46 GWAS, of which 18 (1 VTE and 17 MI) were eligible for controls were matched by age and gender in 6 studies (Uitte de inclusion (Figure 1). Contact with GWAS authors resulted in the Willige et al,10 Saposnik et al,35 Yamagishi et al,45 Pecheniuk et al,46 identification of an additional gene-centric (Ͼ 2000 genes) study Chen et al,47 and Heit 2011). In all but 2 non-matched studies, age From bloodjournal.hematologylibrary.org at GERSTEIN SCI INFO CENTRE on October 4, 2012. For personal BLOOD, 8 MARCH 2012 ⅐ VOLUME 119, NUMBER 10 use only. PROCR Ser219Gly AND THROMBOSIS: A HuGE REVIEW 2395

Table 1. Distribution of PROCR rs867186 genotypes among cases and non-cases in studies of VTE and MI AA genotype AG genotype GG genotype Deviation from Hardy- Cases Non-cases Cases Non-cases Cases Non-cases Weinberg equilibrium Included study* n% n %n%n%n%n%MAF† ␹2 P value REH (95% CI)‡

VTE Medina 200417 291 82 327 82 62 17 74 18 2 1 0 0 0.09 .04 Saposnik 200435 249 74 278 82 85 25 58 17 4 1 2 1 0.09 .58 1.23 (0.59-2.58) Uitte de Willige 200410 345 73 361 77 116 25 100 21 10 2 10 2 0.13 .33 0.83 (0.58-1.21) Medina 200541 77 81 145 80 17 18 35 19 1 1 1 1 0.10 .47 1.45 (0.52-4.10) Navarro 20089 58 69 128 86 24 29 21 14 2 2 0 0 0.07 .35 Pecheniuk 200846) 82 72 87 76 27 24 24 21 5 4 3 3 0.13 .40 0.74 (0.37-1.50) Trégouët 2009–GWAS30 309 75 1003 82 92 22 216 18 9 2 8 1 0.09 .32 1.21 (0.83-1.75) Trégouët 2009–MARTHA30 885 79 654 82 222 20 141 18 16 1 5 1 0.09 .38 1.23 (0.77-1.97) Yamagishi 200945 417 84 844 83 72 15 158 16 7 1 14 1 0.09 .04 0.73 (0.53-0.99) Chen 201147 49 75 63 89 15 23 7 10 1 2 1 1 0.06 .15 0.44 (0.13-1.52) Heit 2011 978 77 1029 79 264 21 257 20 28 2 16 1 0.11 .99 1.00 (0.76-1.32) MI Ireland 2005–EDSC-EW18 33 65 231 85 17 33 37 14 1 2 4 1 0.08 .09 0.61 (0.34-1.10) Ireland 2005–EDSC-IA18 63 55 222 64 48 42 108 31 3 3 19 5 0.21 .23 0.83 (0.62-1.12) Ireland 2005–HIFMECH18 440 84 455 81 83 16 101 18 3 1 7 1 0.10 .60 0.89 (0.59-1.36) Ireland 2005–NPHSII18 166 85 1883 82 26 13 387 17 4 2 14 1 0.09 .57 1.08 (0.84-1.38) Medina 200848 606 88 554 79 80 12 142 20 3 0 1 0 0.10 .01 3.02 (1.12-8.16) Schunkert 221 79 262 84 55 20 49 16 2 1 1 0 0.08 .41 1.51 (0.55-4.20) 2011–ADVANCE31 Schunkert 2011 1642 79 2324 79 409 20 589 20 25 1 37 1 0.11 .96 1.00 (0.84-1.20) CADomics31 Schunkert 2011 deCODE 5509 83 22 711 82 1078 16 4640 17 51 1 244 1 0.09 .68 0.99 (0.92-1.06) CAD31 Schunkert 2011 GerMIFS 670 77 1225 77 184 21 344 22 11 1 23 1 0.12 .84 1.02 (0.81-1.29) I31 Schunkert 2011 GerMIFS 981 80 1007 78 227 19 265 21 14 1 15 1 0.11 .60 1.08 (0.81-1.43) II31 Schunkert 2011 GerMIFS 922 80 1359 78 222 19 360 21 13 1 29 2 0.12 .36 0.91 (0.73-1.12) III31 Schunkert 2011 MedStar31 369 83 721 82 76 17 148 17 2 0 6 1 0.09 .59 1.13 (0.73-1.73) Schunkert 2011 MIGen31 1035 81 1151 82 224 18 240 17 12 1 14 1 0.10 .71 0.95 (0.71-1.27) Schunkert 2011 OHGS131 1207 82 1141 81 257 18 263 19 4 0 11 1 0.10 .33 1.17 (0.85-1.62) IBC 50K CAD Consortium 0.10 (European)42§ IBC 50K CAD Consortium 0.19 (South Asian)42§

*Studies are identified by the lead author’s last name, year of publication, study name (if a publication included more than 1 study), and reference number. †The minor allele (G) frequency (MAF) was calculated in controls or in the underlying cohort. ‡The REH cannot be calculated in studies with 0 cases or controls with the GG genotype. §Genotype frequencies were not reported in the published article; however, the study authors state that the single-nucleotide polymorphism did not violate HWE in the controls (P Ͼ .0001).

and gender distributions were similar in cases and controls.30 Two results for white subjects and South-Asian subjects separately.18,42 studies were of first-event VTE patients,10,30 all cases of VTE were This stratification by race/ethnicity was maintained in the present objectively diagnosed, and the case definition included deep vein meta-analysis for a total of 16 MI studies analyzed. Cases and thrombosis and pulmonary embolism, except in 2 studies10,47 controls were matched by age and gender in 3 studies,18,31,48 that were limited to deep vein thrombosis only. Cases were whereas the age and/or gender distribution of cases and controls recruited from thrombosis clinics in 8 studies, from a venous was either not reported or differed considerably in all but 142 of the thrombosis registry in one study,46 and from a community-based other studies. One study was of first-event MI patients18 and the cohort in another study.45 Controls were selected from hospital case definition was a history of MI in 6 studies, whereas a personnel,17 general medical examination clinics (Tregouet composite case definition including coronary intervention proce- et al,30 Saposnik et al,35 and Heit 2011), partners and acquain- dures, angina, Ͼ 50% stenosis of coronary vessels, and MI was tances of cases,10 first-degree relatives of cases,9,41 the community,30,46 used in the other 11 studies. Cases and noncases were population, the underlying cohort,45 or from a clinical trial of antioxidant hospital, or clinic based. supplementation.30 In one study,47 the source of cases and controls Genotype and minor allele frequency distributions were similar was not stated. between studies (Table 1), with the exception of the studies of The 14 MI studies included 1 cohort study, 3 case-control South Asians. Whereas the frequency of the G allele among studies, 9 GWAS case-control studies, and 1 case-control gene- controls or in the underlying cohort was 13% or less in white centric meta-analysis (supplemental Table 2). Twelve studies populations, the frequency was 19%-21% among South Asian included white subjects only; the remaining 2 studies presented controls. There was evidence of deviation from HWE in 3 studies From bloodjournal.hematologylibrary.org at GERSTEIN SCI INFO CENTRE on October 4, 2012. For personal 2396 DENNIS et al use only. BLOOD, 8 MARCH 2012 ⅐ VOLUME 119, NUMBER 10

Figure 2. Results from a random-effects meta-analysis of the association between the PROCR rs867186 polymorphism and VTE (per-allele model). Individual studies are identified by the lead author’s last name, year of publication, study name (if a publication included more than 1 study), and reference number. The sample sizes are represented by the size of the squares. Bars indicate 95% CI.

(2 of VTE17,45 and one of MI48), with both methods of HWE contrast, this was likely the result of poor precision due to the assessment yielding similar conclusions. small number of GG subjects and should not be interpreted as a lack of true between-study heterogeneity.36 Under the per-allele VTE model, the funnel plot (supplemental Figure 1) and the Egger test (P ϭ .23) revealed no evidence of publication bias. Sub- The meta-analysis supported an association between the PROCR rs867186 variant and VTE (Figure 2 and Table 2), with the odds of group analyses yielded results similar to those obtained from VTE increasing by a factor of 1.22 (95% CI, 1.11-1.33) for every analysis of all subjects (Table 3). additional copy of the G allele (P Ͻ .001). A statistically Of the individual studies included in the meta-analysis, the significant association was also observed for the dominant largest association between the PROCR rs867186 variant and VTE 9 model, the recessive model, and for the GG-versus-AA contrast, was observed in a study of prothrombin G20210A mutation whereas the AG-versus-AA contrast approached statistical sig- carriers (per-allele OR ϭ 2.63; 95% CI, 1.41-4.87). Although no nificance. Between-study heterogeneity was low under the other included study was conducted exclusively among these per-allele model (20%, 95% uncertainty interval: 0%-52%), but mutation carriers, 7 other studies reported the prevalence of was moderate for the AG-versus-AA contrast and for the G20210A mutation carriers; the prevalence was less than 25% and dominant model. Although between-study heterogeneity was no obvious gradient in the estimated ORs according to G20210A low under the recessive model and for the GG-versus-AA mutation prevalence was observed (supplemental Figure 2 and

Table 2. Associations between PROCR rs867186 genotypes and VTE or MI in a random-effects meta-analysis Included Total Total 95% Uncertainty Cochran Q Genotype contrast studies, n cases, n non-cases, n OR 95% CI I2 interval P value

VTE Per-allele model 11 4821 6070 1.22 1.11-1.33 20 0-52 .197 AG versus AA 11 4736 6010 1.21 1.05-1.40 43 0-72 .063 GG versus AA 11 3825 4979 1.81 1.29-2.56 0 0-46 .694 AG ϩ GG versus AA (dominant) 11 4821 6070 1.25 1.08-1.44 47 0-74 .041 GG versus AG ϩ AA (recessive) 11 4821 6070 1.76 1.24-2.48 0 0-42 .739 MI Per-allele model 16 32 594 78 336 0.94 0.88-1.00 65 40-79 Ͻ .001 AG versus AA 14 16 850 42 919 0.96 0.86-1.06 67 43-81 Ͻ .001 GG versus AA 14 14 012 35 671 0.87 0.72-1.06 0 0-53 .502 AG ϩ GG versus AA (dominant) 14 16 998 43 344 0.95 0.86-1.05 65 37-80 Ͻ .001 GG versus AG ϩ AA (recessive) 14 16 998 43 344 0.88 0.72-1.06 0 0-55 .462 From bloodjournal.hematologylibrary.org at GERSTEIN SCI INFO CENTRE on October 4, 2012. For personal BLOOD, 8 MARCH 2012 ⅐ VOLUME 119, NUMBER 10 use only. PROCR Ser219Gly AND THROMBOSIS: A HuGE REVIEW 2397

Table 3. Associations within subgroups between PROCR rs867186 genotypes and VTE or MI under the per-allele model in a random-effects meta-analysis Included Total Total 95% Uncertainty Cochran Q Subgroup studies, n cases, n non-cases, n OR 95% CI I2 interval P value

VTE Candidate gene studies 9 2428 3968 1.19 1.07-1.34 15 0-51 .271 GWAS 2 1680 2529 1.28 1.02-1.61 29 0-74 .240 Studies of white subjects only 6 2405 3096 1.35 1.14-1.59 12 0-53 .321 Studies in HWE 9 3970 4653 1.28 1.16-1.41 10 0-46 .333 MI Candidate gene studies 5 1576 4165 1.03 0.68-1.56 85 66-93 Ͻ .001 GWAS 11 31 018 74 171 0.93 0.89-0.98 36 0-68 .113 Studies of white subjects only 14 28 086 73 728 0.93 0.86-1.00 66 40-81 Ͻ .001 Studies in HWE 15 31 905 77 639 0.95 0.90-1.01 54 16-74 .007 Studies of subjects with a 6 5730 7292 0.87 0.75-1.00 64 14-85 .016 history of MI only

supplemental Table 3). One included study was conducted exclu- MI sively among F5L mutation carriers.41 Results from this study were The meta-analysis did not support an association between the null (per-allele OR ϭ 0.98; 95% CI, 0.54-1.77). A large positive PROCR rs867186 variant and MI (Figure 3 and Table 2). None of association (per-allele OR ϭ 1.49; 95% CI, 1.17-1.90) was the investigated genotype contrasts reached statistical significance. observed in the study that included only idiopathic VTE subjects Heterogeneity between MI studies under all genetic models was (ie, VTE in the absence of acquired risk factors: surgery, hospital- moderate to high (again, the apparent low heterogeneity under the ization, pregnancy, puerperium, oral contraception, cancer, and recessive model and for the GG-versus-AA contrast is likely the autoimmune disease; and in the absence of strong known genetic result of wide CIs in analyses of GG subjects). No publication bias risk factors: antithrombin, protein S, or PC deficiencies, and under the per-allele model was observed (supplemental Figure 1), homozygosity for F5L or prothrombin G20210A).30 In the 7 other with an Egger test P ϭ .25. When the analysis was restricted to studies in which the etiology of VTE was reported, the proportion GWAS, under the per-allele model, the I2 was reduced to 36% (95% of idiopathic VTE subjects ranged from 25%-65% and there was no uncertainty interval, 0%-68%) and the OR for MI was 0.93 (95% obvious trend in the estimated ORs according to the frequency of CI, 0.89-0.98) for every additional copy of the G allele (P ϭ .005; idiopathic subjects. Table 3). In all other subgroup analyses, heterogeneity remained

Figure 3. Results from a random-effects meta-analysis of the association between the PROCR rs867186 polymorphism and MI (per-allele model). Individual studies are identified by the lead author’s last name, year of publication, study name (if a publication included more than 1 study), and reference number. The sample sizes are represented by the size of the squares. Bars indicate 95% CI. Allele frequencies were not reported in the IBC 50K CAD Consortium article. From bloodjournal.hematologylibrary.org at GERSTEIN SCI INFO CENTRE on October 4, 2012. For personal 2398 DENNIS et al use only. BLOOD, 8 MARCH 2012 ⅐ VOLUME 119, NUMBER 10 moderate to high and the effect estimate was similar to that 2) levels relative to genotype, a significant increase was seen for obtained from analysis of all subjects. those with the G allele.17,18 The G allele was also associated with Of the 16 included studies of MI, a statistically significant elevated levels of FVII and FVIIa in a study of approximately association between increasing copies of the PROCR rs867186 2000 healthy middle-aged men.6 G allele and MI was found in 3; in 1 of these, the association was VTE is multicausal, in that multiple genetic and environmental positive, and in the other 2, there was an inverse association. The factors contribute to its etiology.52 Two genetic risk factors for VTE statistically significant positive association was observed in the with relatively high population prevalence are well established: the study18 of white diabetics (per-allele OR ϭ 2.57; 95% CI, 1.41- F5L and the prothrombin G20210A mutations. The largest associa- 4.69). Among South-Asian diabetics, the association was also tion between the PROCR rs867186 variant and VTE identified in positive, although not statistically significant (per-allele OR ϭ 1.25; our review was observed in a study of prothrombin G20210A 95% CI, 0.85-1.83). The largest inverse association was found in a mutation carriers.9 However, further investigation of the modifying Spanish study48 of subjects with a history of MI (per-allele role of this mutation was hampered by the low prevalence—or OR ϭ 0.55; 95% CI, 0.41-0.74). However, HWE was violated in incomplete reporting of the prevalence—of prothrombin G20210A this study and the results may not be comparable to other included mutation carriers in the other included studies. No association studies. The third statistically significant result was observed in the between the PROCR rs867186 variant and VTE was observed in gene-centric meta-analysis of European subjects42 comprising 2 studies of F5L mutation carriers (only 1 of which was included in 10 studies (per-allele OR ϭ 0.85; 95% CI, 0.80-0.90). the review).41,43 In a study (not included in the review) of 2 families Although initially intended, results of VTE and MI studies were with inherited thrombophilia, a stronger association with VTE was not pooled together in a combined analysis because of differences observed in those with both the PROCR G allele and a dysfunc- in the direction of the association with the PROCR rs867186 tional PC gene variant relative to those with a dysfunctional PC G allele in VTE and MI studies. gene variant alone.11 Although limited by the small number of studies, these findings suggest that the PROCR rs867186 polymor- phism may act in concert with some known genetic risk factors, but not others, to increase VTE risk. Discussion All but 1 study included in the meta-analysis30 used a pooled definition of idiopathic and non-idiopathic VTE cases; only 1 study This meta-analysis of 4821 VTE cases and 6070 controls found a restricted the case definition to idiopathic VTE patients. In pooling significant association between the PROCR rs867186 variant and VTE. idiopathic and non-idiopathic VTE cases, heterogeneity in the Under an additive genetic model, the odds of VTE increased by 22% for definition of the outcome may have weakened or even masked the every additional copy of the G allele. In contrast, the meta-analysis effect of the PROCR rs867186 variant on VTE. Providing some found no association between the PROCR rs867186 variant and evidence for this is the fact that the study restricted to idiopathic MI. With nearly 17 000 MI cases and more than 43 000 noncases, VTE patients reported a large positive association with the G allele and based on our current knowledge of trait and variant characteris- of PROCR rs867186. tics (ie, disease prevalence of approximately 4%,49 population 24 G allele frequency of approximately 10%, and an additive disease MI model), the meta-analysis had Ͼ 99% power to detect a per-allele effect as small as 1.1 at P ϭ .05 (supplemental Table 4),50 if such an Between-study heterogeneity in MI studies was moderate to high, association existed. However, substantial between-study heteroge- as indicated by both the I2 and Cochran Q statistics, and is likely neity and other population-specific characteristics may have re- attributable to differences in study design, participant selection, duced this theoretical power. Given the present results, the and participant characteristics. The meta-analysis pooled results of hypothesis of some common pathways underlying venous and cohort, case-control, and GWAS designs; in 10 studies, the case arterial thrombosis may not apply to the PROCR gene and the definition included MI grouped with other coronary heart diseases, variant may be a genetic risk factor for VTE only. and the severity of MI (fatal versus nonfatal) varied or was unclear across studies. Two studies were restricted to men, some studies Ͻ VTE included younger subjects ( 60 years of age), whereas the age and gender of subjects was not reported in the study of diabetics. The The association with VTE is in accordance with the proposed I2 was less among GWAS, but even among the 9 GWAS in which functional implications of the PROCR rs867186 variant, namely, information on subject characteristics was available, the frequency the G allele, causes EPCR shedding from the endothelial mem- of cases with a history of MI ranged from 48%-100%, the age and brane, reduced PC activation, and higher FVII/FVIIa levels, gender distribution of subjects varied considerably, and 2 studies eventually leading to thrombosis. Of the 8 studies included in the were restricted to cases with a family history of coronary artery meta-analysis that examined levels of sEPCR relative to the disease. PROCR rs867186 variant, all observed a significant increase Although no overall association with MI was found, when associated with the G allele.9,10,17,18,35,45,47,48 Four studies reported analysis was restricted to GWAS under the per-allele model, a levels of PC or APC by genotype. Of these, two19,45 observed a statistically significant inverse association with the PROCR statistically significant increase in PC associated with the G allele, rs867186 G allele emerged. This finding may be because of one9 observed a statistically significant decrease in APC among reduced heterogeneity in GWAS as a result of harmonization of the G allele carriers, and one17 found no association between genotype case definition, quality control measures, and analysis methods that and APC levels. In a GWAS of plasma levels of PC including more occurs in consortiums, of which all of the GWAS were part. Our than 8000 participants, the strongest association was observed at GWAS findings are consistent with a recently published (ie, after the PROCR locus, and the rs867186 polymorphism explained an our search was updated for the last time) gene-centric case-control estimated 10.4% of the variance in PC.51 In 2 studies included in study of Italian early-onset MI cases53; however, they are nonethe- the systematic review that measured prothrombin (fragment 1 and less difficult to explain. The biologic mechanism underlying an From bloodjournal.hematologylibrary.org at GERSTEIN SCI INFO CENTRE on October 4, 2012. For personal BLOOD, 8 MARCH 2012 ⅐ VOLUME 119, NUMBER 10 use only. PROCR Ser219Gly AND THROMBOSIS: A HuGE REVIEW 2399 inverse association between rs867186 and MI is unclear. It may be biases could not be completely assessed. This corresponds to a that the PROCR rs867186 G allele is in linkage disequilibrium with score of “B” for amount of evidence, replication, and protection the true causal variant. from bias. Using the same criteria, the strength of evidence for an association with MI is “weak.” Despite a considerable sample size Strengths and limitations indicating large-scale evidence (an “A” rating), there existed substantial between-study inconsistency in MI phenotype defini- Strengths of the systematic review include the thorough search for tion, study populations, and study findings, and a null overall effect articles, the investigation of both venous and arterial thrombotic was found (a “C” rating for replication). disease, the large number of included subjects, and the use of a In light of the ever-increasing number of gene-disease associa- genetic model that is based on biologic evidence. The limitations of tion studies, especially in the postGWAS era, HuGE reviews are a the meta-analysis include the possible influences of selection bias useful tool to summarize this data. Overall, there is moderate and genotype and phenotype misclassification on the results of evidence for an association between the PROCR rs867186 polymor- individual studies and on the results of the meta-analysis, which are phism and VTE, whereas there is weak evidence for an association difficult to predict. Bias could also have arisen when calculating the with MI. This suggests that a common mechanism underlying crude ORs from genotype frequencies, because this could not be venous and arterial thrombosis is not mediated by the PROCR taken into consideration in the analysis when cases and controls rs867186 polymorphism. were matched. Nonetheless, unmatched effect estimates were almost identical to those presented in the published article, either because matched and unmatched analyses produced similar results or because the investigators had not performed a matched analysis. Acknowledgments When estimates adjusted for potential confounders were presented in the published article, these too approximated the crude estimates The authors thank Dr Inke Ko¨nig and Dr Jeanette Erdmann for calculated from genotype frequencies. A major source of confound- providing genotype counts for the PROCR rs867186 variant from ing in genetic association studies, population stratification, was the studies included in the CARDIoGRAM consortium; Dr Martin assessed in the meta-analysis by restricting analysis to white Farrall for correspondence regarding the CARDIoGRAMplusC4D subjects only. Although confounding was not detected, the “white” data; and Dr Francisco Espan˜a for clarifying the number of designation may comprise individuals of various races/ethnicities, non-overlapping cases to be used from each of his 3 studies and hidden population stratification cannot be completely ruled out. included in the meta-analysis. Whereas selective reporting must be considered in all meta- This project was partially funded by the Canadian Institutes of analyses, because of ready sample availability and low genotyping Health Research (grant MOP 86466) and by the Heart and Stroke cost, it is especially a concern in studies of gene-disease associa- Foundation of Canada (grant T6484). J.D. holds a Vanier Canada tions, in which authors may test many polymorphisms but only Graduate Scholarship and F.G. holds a Canada Research Chair. The report the most interesting or statistically significant findings. Even Mayo VTE GWAS was funded by the National Institutes of Health the means by which publication bias is measured, funnel plots and (grant HG 04735). statistical tests of funnel plot asymmetry, have limited power.40 Although no statistically significant publication bias was detected, it is nonetheless possible that studies with null findings were not Authorship published and are missing from the systematic review. In an attempt to overcome this limitation, we included in our search both Contribution: F.G., P.E.M., and D.-A.T. conceived the research; published and unpublished GWAS data, and authors of all eligible J.D., F.G., and C.Y.J designed the study; J.D. led the implementa- studies were contacted to request genotype counts. tion, analysis, and interpretation of the study; A.S.A. screened articles for eligibility and extracted the data; M.d.A. and J.A.H. contributed to data acquisition; J.D. drafted the manuscript; and all Conclusions authors critically reviewed the manuscript for intellectual content. Conflict-of-interest disclosure: The authors declare no compet- Applying the Venice criteria,39 the strength of cumulative evidence ing financial interests. for an association between the PROCR rs867186 variant and VTE Correspondence: France Gagnon, Division of Epidemiology, is “moderate.” The meta-analysis had Ͼ 80% power to detect a Dalla Lana School of Public Health, University of Toronto, 155 College per-allele relative risk of 1.15 (supplemental Table 4), there existed Street, Suite 662, Toronto, ON, Canada M5T 3M7; e-mail: france. some between-study heterogeneity, and the effect of potential [email protected]. References

1. Geiger M. Soluble protein C receptor: why? for the control of the protein C pathway. Circula- endothelium. J Biol Chem. 2007;282(16):11849- Blood. 2008;111(7):3301-3302. tion. 1997;96(10):3633-3640. 11857. 2. Gandrille S. Endothelial cell protein C receptor 5. Mosnier LO, Zlokovic BV, Griffin JH. The cytopro- 8. Liaw PC, Neuenschwander PF, Smirnov MD, and the risk of venous thrombosis. Haemato- tective protein C pathway. Blood. 2007;109(8): Esmon CT. Mechanisms by which soluble endo- logica. 2008;93(6):812-816. 3161-3172. thelial cell protein C receptor modulates protein 3. Stearns-Kurosawa DJ, Kurosawa S, Mollica JS, 6. Ireland HA, Cooper JA, Drenos F, et al. FVII, C and activated protein C function. J Biol Chem. Ferrell GL, Esmon CT. The endothelial cell pro- FVIIa, and downstream markers of extrinsic path- 2000;275(8):5447-5452. tein C receptor augments protein C activation by way activation differ by EPCR Ser219Gly variant 9. Navarro S, Medina P, Mira Y, et al. Haplotypes of the thrombin-thrombomodulin complex. Proc Natl in healthy men. Arterioscler Thromb Vasc Biol. the EPCR gene, prothrombin levels, and the risk Acad Sci U S A. 1996;93(19):10212-10216. 2009;29(11):1968-1974. of venous thrombosis in carriers of the prothrom- 4. Laszik Z, Mitro A, Taylor FB, Jr Ferrell G, Esmon CT. 7. Ghosh S, Pendurthi UR, Steinoe A, Esmon CT, bin G20210A mutation. Haematologica. 2008; Human protein C receptor is present primarily on Mohan Rao LV. Endothelial cell protein C recep- 93(6):885-891. endothelium of large blood vessels: implications tor acts as a cellular receptor for factor VIIa on 10. Uitte de Willige S, Van Marion V, Rosendaal FR, From bloodjournal.hematologylibrary.org at GERSTEIN SCI INFO CENTRE on October 4, 2012. For personal 2400 DENNIS et al use only. BLOOD, 8 MARCH 2012 ⅐ VOLUME 119, NUMBER 10

Vos HL, De Visser MCH, Bertina RM. Haplotypes ALFRED: an allele frequency database for micro- interim guidelines. Int J Epidemiol. 2008;37(1): of the EPCR gene, plasma sEPCR levels and the evolutionary studies. Evol Bioinform Online. 120-132. risk of deep venous thrombosis. J Thromb Hae- 2005;1:1-10. 40. Little J, Higgins JPT. The HuGENet HuGE Re- most. 2004;2(8):1305-1310. 24. Smith NL, Chen MH, Dehghan A, et al. Novel as- view Handbook. Version 1.0. Available from: 11. Simioni P, Morboeuf O, Tognin G, et al. Soluble sociations of multiple genetic loci with plasma http://www.hugenet.org.uk/resources/handbook. endothelial protein C receptor (sEPCR) levels levels of factor VII, factor VIII, and von Willebrand php. Accessed February 28, 2006. and venous thromboembolism in carriers of two factor: The CHARGE (Cohorts for Heart and Ag- 41. Medina P, Navarro S, Estelles A, Vaya A, dysfunctional protein C variants. Thromb Res. ing Research in Genome Epidemiology) Consor- Bertina RM, Espana F. Influence of the 4600A/G 2006;117(5):523-528. tium. Circulation. 2010;121(12):1382-1392. and 4678G/C polymorphisms in the endothelial 12. Simmonds RE, Lane DA. Structural and func- 25. Reitsma PH, Rosendaal FR. Past and future of protein C receptor (EPCR) gene on the risk of tional implications of the intron/exon organization genetic research in thrombosis. J Thromb Hae- venous thromboembolism in carriers of factors of the human endothelial cell protein C/activated most. 2007;5(suppl 1):264-269. V Leiden. Thromb Haemost. 2005;94(2):389-394. protein C receptor (EPCR) gene: Comparison 42. Butterworth AS, Braund PS, Farrall M, et al; IBC 26. Lowe GD. Common risk factors for both arterial with the structure of CD1/major histocompatibility 50K CAD Consortium. Large-scale gene-centric and venous thrombosis. Br J Haematol. 2008; complex alpha1 and alpha2 domains. Blood. analysis identifies novel variants for coronary ar- 140(5):488-495. 1999;94(2):632-641. tery disease. PLoS Genet. 2011;7(9):e1002260. 27. Prandoni P. Venous thromboembolism and ath- 13. Gu JM, Crawley JT, Ferrell G, et al. Disruption of 43. Galanaud JP, Cochery-Nouvellon E, Alonso S, erosclerosis: is there a link? J Thromb Haemost. the endothelial cell protein C receptor gene in et al. Paternal endothelial protein C receptor 2007;5(suppl 1):270-275. mice causes placental thrombosis and early em- 219Gly variant as a mild and limited risk factor bryonic lethality. J Biol Chem. 2002;277(45): 28. Ye Z, Liu EH, Higgins JP, et al. Seven haemo- for deep vein thrombosis during pregnancy. 43335-43343. static gene polymorphisms in coronary disease: J Thromb Haemost. 2010;8(4):707-713. 14. Biguzzi E, Merati G, Liaw PCY, et al. A 23bp meta-analysis of 66,155 cases and 91,307 con- 44. Bezemer ID, Bare LA, Doggen CJ, et al. Gene insertion in the endothelial protein C receptor trols. Lancet. 2006;367(9511):651-658. variants associated with deep vein thrombosis. (EPCR) gene impairs EPCR function. Thromb 29. Kim RJ, Becker RC. Association between factor JAMA. 2008;299(11):1306-14. Haemost. 2001;86(4):945-948. V Leiden, prothrombin G20210A, and methyl- 45. Yamagishi K, Cushman M, Heckbert SR, Tsai MY, 15. Biguzzi E, Gu J-, Merati G, Esmon NL, Esmon CT. enetetrahydrofolate reductase C677T mutations Folsom AR. Lack of association of soluble endo- Point mutations in the endothelial protein C re- and events of the arterial circulatory system: a thelial protein C receptor and PROCR 6936A/G ceptor (EPCR) promoter. Thromb Haemost. meta-analysis of published studies. Am Heart polymorphism with the risk of venous thrombo- 2002;87(6):1085-1086. J. 2003;146(6):948-957. embolism in a prospective study. Br J Haematol. 2009;145(2):221-226. 16. Medina P, Navarro S, Estelles A, Espana F. Poly- 30. Tre´goue¨t DA, Heath S, Saut N, et al. Common morphisms in the endothelial protein C receptor susceptibility alleles are unlikely to contribute as 46. Pecheniuk NM, Elias DJ, Xu X, Griffin JH. Failure gene and thrombophilia. Thromb Haemost. 2007; strongly as the FV and ABO loci to VTE risk: re- to validate association of gene polymorphisms in 98(3):564-569. sults from a GWAS approach. Blood. 2009; EPCR, PAR-1, FSAP and protein S Tokushima 113(21):5298-5303. with venous thromboembolism among Califor- 17. Medina P, Navarro S, Estelles A, et al. Contribu- nians of European ancestry. Thromb Haemost. 31. Schunkert H, Konig IR, Kathiresan S, et al. Large- tion of polymorphisms in the endothelial protein 2008;99(2):453-455. C receptor gene to soluble endothelial protein scale association analysis identifies 13 new sus- 47. Chen XD, Tian L, Li M, Jin W, Zhang HK, Zheng CF. C receptor and circulating activated protein C lev- ceptibility loci for coronary artery disease. Nat Relationship between endothelial cell protein els, and thrombotic risk. Thromb Haemost. 2004; Genet. 2011;43(4):333-338. c receptor gene 6936a/g polymorphisms and 91(5):905-911. 32. Ageno W, Becattini C, Brighton T, Selby R, Kam- deep venous thrombosis. Chin Med J (Engl). 18. Ireland H, Konstantoulas CJ, Cooper JA, et al. phuisen PW. Cardiovascular risk factors and ve- 2011;124(1):72-75. EPCR Ser219Gly: elevated sEPCR, prothrombin nous thromboembolism: a meta-analysis. Circula- 48. Medina P, Navarro S, Corral J, et al. Endothelial F1ϩ2, risk for coronary heart disease, and in- tion. 2008;117(1):93-102. protein C receptor polymorphisms and risk of creased sEPCR shedding in vitro. Atherosclero- 33. Undas A, Szułdrzyn´ski K, Brummel-Ziedins KE, myocardial infarction. Haematologica. 2008; sis. 2005;183(2):283-292. Tracz W, Zmudka K, Mann KG. Systemic blood 93(9):1358-1363. 19. Reiner AP, Carty CL, Jenny NS, et al. PROC, coagulation activation in acute coronary syn- 49. National Center for Chronic Disease Prevention PROCR and PROS1 polymorphisms, plasma an- dromes. Blood. 2009;113(9):2070-2078. and Health Promotion, Centers for Disease Con- ticoagulant phenotypes, and risk of cardiovascu- 34. Balding DJ. A tutorial on statistical methods for trol and Prevention. Featured Data & Statistics: lar disease and mortality in older adults: The Car- population association studies. Nat Rev Genet. Prevalence of Heart Disease–United States diovascular Health Study. J Thromb Haemost. 2006;7(10):781-791. (2005). Available from: http://www.cdc.gov/ 2008;6(10):1625-1632. 35. Saposnik B, Reny J-, Gaussem P, Emmerich J, DataStatistics/archive/heartdisease.html. 20. Qu D, Wang Y, Song Y, Esmon NL, Esmon CT. Aiach M, Gandrille S. A haplotype of the EPCR Accessed April 18, 2011. The Ser219-ϾGly dimorphism of the endothelial gene is associated with increased plasma levels 50. Purcell S, Cherny SS, Sham PC. Genetic Power protein C receptor contributes to the higher of sEPCR and is a candidate risk factor for throm- Calculator: design of linkage and association ge- soluble protein levels observed in individuals with bosis. Blood. 2004;103(4):1311-1318. netic mapping studies of complex traits. Bioinfor- the A3 haplotype. J Thromb Haemost. 2006;4(1): matics. 2003;19(1):149-150. 229-235. 36. Borenstein M, Hedges LV, Higgins JPT, Rothstein HR, eds. Introduction to Meta-Analysis. 51. Tang W, Basu S, Kong X, et al. Genome-wide 21. Saposnik B, Lesteven E, Lokajczyk A, Esmon CT, Chichester, United Kingdom: John Wiley & Sons association study identifies novel loci for plasma Aiach M, Gandrille S. Alternative mRNA is fa- Ltd; 2009. levels of protein C: the ARIC study. Blood. 2010; vored by the A3 haplotype of the EPCR gene 116(23):5032-5036. 37. Egger M, Davey Smith G, Schneider M, Minder C. PROCR and generates a novel soluble form of 52. Rosendaal FR. Venous thrombosis: the role of Bias in meta-analysis detected by a simple, EPCR in plasma. Blood. 2008;111(7):3442-3451. genes, environment, and behavior. Hematology graphical test. BMJ. 1997;315(7109):629-634. 22. National Center for Biotechnology Information. Am Soc Hematol Educ Program. 2005;2005:1- dbSNP Short Genetic Variations Cluster Report: 38. Ziegler A, Van Steen K, Wellek S. Investigating 12. rs867186. Available from: http://www.ncbi.nlm. Hardy-Weinberg equilibrium in case-control or 53. Guella I, Duga S, Ardissino D, et al. Common ϭ nih.gov/projects/SNP/snp_ref.cgi?searchType cohort studies or meta-analysis. Breast Cancer variants in the haemostatic gene pathway contrib- ϭ ϭ adhoc_search&type rs&rs rs867186. Accessed Res Treat. 2011;128(1):197-201. ute to risk of early-onset myocardial infarction in May 28, 2011. 39. Ioannidis JP, Boffetta P, Little J, et al. Assessment the Italian population. Thromb Haemost. 2011; 23. Rajeevan H, Cheung KH, Gadagkar R, et al. of cumulative evidence on genetic associations: 106(4):655-664. Genetic Epidemiology 00 : 1–12 (2012)

Robust and Powerful Tests for Rare Variants Using Fisher’s Method to Combine Evidence of Association From Two or More Complementary Tests

Andriy Derkach,1 Jerry F. Lawless,2,3 and Lei Sun1,3∗ 1Department of Statistics, University of Toronto, Toronto, Ontario, Canada 2Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada 3Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada

Many association tests have been proposed for rare variants, but the choice of a powerful test is uncertain when there is limited information on the underlying genetic model. Proposed methods use either linear statistics, which are powerful when most variants are causal and have the same direction of effect, or quadratic statistics, which are more powerful in other scenarios. To achieve robustness, it is natural to combine the evidence of association from two or more complementary tests. To this end, we consider the minimum-p and Fisher’s methods of combining P-values from linear and quadratic statistics. Extensive simulation studies show that both methods are robust across models with varying proportions of causal, deleterious, and protective rare variants, allele frequencies, and effect sizes. When the majority (>75%) of the causal effects are in the same direction (deleterious or protective), Fisher’s method consistently outperforms the minimum-p and the individual linear and quadratic tests, as well as the optimal sequence kernel association test, SKAT-O. When the individual test has moderate power, Fisher’s test has improved power for 90% of the ∼5000 models considered, with >20% relative efficiency gain for 40% of the models. The maximum absolute power loss is 8% for the remaining 10% of the models. An application to the GAW17 quantitative trait Q2 data based on sequence data of the 1000 Genomes Project shows that, compared with linear and quadratic tests, Fisher’s test has comparable power for all 13 functional genes and provides the best power for more than half of them. Genet. Epidemiol. 00:1–12, 2012. C 2012 Wiley Periodicals, Inc. Key words: robust methods; Fisher’s method; rare variants; complex traits; next-generation sequencing; 1000 genome project

Supporting Information is available in the online issue at wileyonlinelibrary.com. ∗ Correspondence to: Lei Sun, Dalla Lana School of Public Health, 155 College Street, University of Toronto, Toronto, ON M5T 3M7, Canada. E-mail: [email protected] Received 5 June 2012; Revised 23 August 2012; Accepted 7 September 2012 Published online in Wiley Online Library (wileyonlinelibrary.com/journal/gepi). DOI: 10.1002/gepi.21689

INTRODUCTION 2010; Price et al., 2010] and tests based on quadratic statis- tics, which are designed to have reasonable power across a Rare variants play an important role in studies of complex wide range of alternatives [e.g., Lee et al., 2012a; Neale et al., human diseases and traits, and next-generation sequenc- 2011; Wu et al., 2011]. Several papers have also considered ing technology provides rich data for analysis [Cirulli and using adaptive weighting of the rare variants under study, Goldstein, 2010]. This recent focus on rare variants has pro- based on the observed phenotype and genotype data. For duced numerous genotype-phenotype association testing linear statistics [e.g., Han and Pan, 2010; Hoffmann et al., strategies based on aggregating information across multi- 2010; Lin and Tang, 2011; Yi and Zhi, 2011], it can be shown ple SNPs. They include, among others, proposals by Mor- analytically that these adaptive methods are operationally genthaler and Thilly [2007], Li and Leal [2008], Madsen and similar to using quadratic statistics, unless (correct) prior Browning [2009], Bansal et al. [2010], Han and Pan [2010], information on SNP effect is available [Derkach et al., 2012]. Hoffmann et al. [2010], Morris and Zeggini [2010], Price et al. We note as well that score statistics obtained from random [2010], Yi and Zhi [2011], Neale et al. [2011], Wu et al. [2011], effect regression models lead to quadratic statistics [Basu Lin and Tang [2011], Lee et al. [2012a]. Basu and Pan [2011] and Pan, 2011; Goeman et al., 2006]. and Derkach et al. [2012] review the many test statistics that Tests within the linear class or the quadratic class per- have been proposed. Their work shows that the tests can be form rather similarly, but there are substantial differences considered within a unified framework, with methods di- in power between linear and quadratic tests [e.g., Basu and vided into two classes: tests based on linear composite statis- Pan, 2011; Derkach et al., 2012; Han and Pan, 2010; Lin and tics, which are powerful against very specific alternative Tang, 2011]. More specifically, linear tests can outperform hypotheses [e.g., Li and Leal, 2008; Madsen and Browning, quadratic tests if all or almost all SNPs under considera- 2009; Morgenthaler and Thilly, 2007; Morris and Zeggini, tion are causal variants and their effects are in the same

C 2012 Wiley Periodicals, Inc. 2 Derkach et al.

n direction. However, tests based on linear statistics can per- = where m j Xij is the total number of copies of the rare form poorly when there are both protective and deleterious i=1 SNPs, and more generally, when a substantial portion of allele of SNP j, approximately equal to the number of sub- the SNPs is neutral. It is becoming evident that “the power jects carrying the rare allele of SNP j. The statistic S = ,...,  of recently proposed statistical methods depend strongly (S1 SJ ) is also the scaled score statistic from linear or lo- = ,...,  on the underlying hypotheses concerning the relationship gistic regression models relating Yi and Xi (Xi1 XiJ) of phenotypes with each of these three factors [proportions [e.g., Basu and Pan, 2011; Lin and Tang, 2011; Wu et al., of causal variants, and direction of the associations (delete- 2011], and it has an expectation of 0 under the null hypoth- rious, protective, or both)]. No method demonstrates con- esis of no association, H0: Yi is independent of Xi . sistently acceptable power despite this large sample size, Linear statistics that have been proposed take the general and the performance of each method depends upon the un- form derlying assumption of the relationship between rare vari- ants and complex traits,” as concluded by Ladouceur et al. J = w =  , [2012]. Robustness, therefore, is critical and consequential WL j Sj w S (2) when our knowledge about the genetic architecture of rare j=1 variants is still incomplete [Cirulli and Goldstein, 2010]. = Basu and Pan [2011] recommended that both linear and where w j is a prespecified weight for SNP j and w w ,...,w  quadratic statistics be used in settings where prior informa- ( 1 J ) . Quadratic statistics take the form tion is limited. In this report, we propose hybrid association =  , test statistics that borrow strength from each class of tests by WQ S AS (3) combining them via Fisher’s method or the minimum-p ap- proach. We show that both hybrid statistics are robust across where A is a positive definite (or semidefinite) symmetric genetic models with respect to power, and in some situa- matrix [Basu and Pan, 2011; Derkach et al., 2012; Lin and tions Fisher’s statistic can outperform linear and quadratic Tang, 2011; Wu et al., 2011]. statistics with a relative efficiency gain of more than 100%. Let pL be the two-sided P-value obtained from WL to al- The advantages of the proposed methods are demonstrated low for either positive or negative association statistic under through extensive simulation studies of over 10,000 differ- the alternative hypothesis. Let pQ be the P-value obtained ≥ ent models with varying proportions of causal, deleterious, from WQ as Prob(WQ observed value) under H0, because and protective rare variants; variant frequencies; effect sizes; for quadratic statistics only large values of WQ provide ev- and the relationships between variant frequencies and ef- idence against H0. To combine information from the linear fect sizes, for studies of both binary and quantitative traits, and quadratic statistics, we propose to use Fisher’s method as well as an application to the Genetic Analysis Workshop to combine P-values from WL and WQ, using 17 (GAW17) simulated quantitative trait Q2 data based on =− − the “mini-exome” sequence data that were provided by the WF 2log(pL ) 2log(pQ)(4) 1000 Genomes Project [Almasy et al., 2011; 1000 Genomes Project Consortium, 2010]. We also compare the two tests as the association test statistic. Large values of WF corre- with SKAT-O [Lee et al., 2012a], the optimal sequence kernel spond to small values of pL and/or pQ and indicate evi- association test that uses the minimal P-value of a family of dence against the null hypothesis of no association. If pL tests that are based on weighted averages of a linear statistic and pQ are independent under H0, then WF is distributed ␹ 2 and the original SKAT quadratic statistic [Wu et al., 2011], as 4 . However, WL and WQ (thus pL and pQ) are not inde- and we show that the proposed hybrid statistics have better pendent except asymptotically when J →∞. Simulations power than SKAT-O. described below and Supporting Information Tables S1 and ␹ 2 S2 show that the 4 approximation is inadequate for realis- tic settings. Thus we assess statistical significance in finite MATERIALS AND METHODS samples using a novel permutation distribution approach described in the Appendix. METHODS Another way to combine evidence from two or more tests To formulate the testing problem, we assume that a group is through the minimum-p approach. Here we consider of J SNPs labeled j = 1,...,J and a (quantitative or binary) = , . phenotype Y for n subjects are under consideration. Let Yi WM min(pL pQ) (5) be the phenotype value, and Xij be the genotype value of the ith subject representing the number of copies of the rare The minimum-p principle has been proposed by many au- = allele for the jth SNP. In practice, Xij 0 or 1 because of thors, [e.g., Lee et al., 2012a; Lin and Tang, 2011], each the low frequency of the rare allele. Most statistics [e.g., Lin considering a different set of tests. For example, the recent and Tang, 2011; Neale et al., 2011; Wu et al., 2011] for testing SKAT-O statistic [Lee et al., 2012a] is the minimal P-value of association between Yi and Xij in the absence of other factors a family of tests that are based on weighted averages of a lin- can be written in terms of ear statistic and the quadratic SKAT statistic, with weights ranging from 0 (using the quadratic statistic only) to 1 (us- ing the linear statistic only). EREC [Lin and Tang, 2011] can   be considered as a special case of SKAT-O. We focus on the n − n − = i=1 (Yi Y)Xij  i=1 (Yi Y)Xij , = ,..., , simple minimum-p statistic of (5) and the Fisher’s statis- Sj  = j 1 J n n m (1 − m /n) tic, but we compare them with SKAT-O in the Discussion − / j j ( Xij)(1 Xij n) section below and in the Supporting Information. i=1 i=1 For symmetric case-control studies or studies of nor- (1) mally distributed traits, the statistic WM is asymptotically

Genet. Epidemiol. Robust Tests for Rare Variants 3

→∞ { , } distributed as J under H0 as min U1 U2 , where U1 that WL and WQ have good power among the classes of , and U2 are two independent Unif (0 1) variables. As with linear and quadratic statistics, respectively. WF , asymptotic approximations may not provide satisfac- Under the normal model (6), the work of Derkach et al. tory P-values in many practical situations, and once again [2012] shows that, when m j in (1) equals its expected value we rely on permutation-based P-values. np j , The general problem of combining information from test ∼ ␮ , ␴2 , statistics has been studied by several authors [e.g., Loughin, Sj N( j ) 2004; Owen, 2009; Stouffer, 1949]. They show that no single √  approach is best (most powerful) under all circumstances. ␮ = ␤ − , [Owen, 2009] gave a careful comparison of (4) and (5), and j n j p j (1 p j ) found that if one of the original statistics has low power and the other high power, W is a better hybrid statistic. On and the variance explained by each SNP j is approximately M    the other hand, if both tests have reasonable power, Fisher’s ␤ − 2 j p j (1 p j ) statistic WF is a better choice. EV = . j ␴2

2 ␹ , SIMULATION MODELS Moreover, WQ has a noncentral J ncQ distribution under an ␤ = We conducted extensive simulation studies to exam- alternative H1 for which 0 and the power of WQ is a ine the finite sample performance of linear, quadratic, function of the noncentrality parameter, which is minimum-p, and Fisher’s test statistics. We considered asso-  ␮2  ciation studies for both quantitative and binary traits. Here, nc = j = n EV . Q ␴2 j we focus on quantitative traits for which simulations can be j j conducted efficiently to study a large number (over 10,000) 2 2 W has a noncentral ␹ , distribution under H and of genetic models. Results from case-control studies are pro- L 1 ncL 1 vided as Supporting Information and are discussed below    2 in the final Discussion section. ␮ 2 ␤ ( j ) j sign( j ) EVj To provide numerical comparisons of power, we consider nc = j = n . L ␴2 J for simplicity the case where Xj indicates the presence (1) j or absence (0) of the minor allele, and let Therefore, the power of both statistics depends only on sam- = ␤ + ␤ +···+␤ + , = ,..., , ple size n, the number of SNPs J , variance explained by each Yi 0 1 Xi1 J XiJ ei for i 1 n (6) ␤ SNP EVj , its direction of effect sign( j ) (for linear statistics only), and type 1 error ␣. It is important to note that effect ∼ , ␴2 with ei N(0 ), so the hypothesis of no association H0 size ␤ , MAF p /2, and the total phenotypic variance ␴2 do ␤ = ␤ ,...,␤  = j j becomes ( 1 J ) 0. We first assume that the not directly affect power once EVj is known or specified. Xijs are mutually independent Bernoulli variables with = ␤ − 2/␴2 = = However, also note that EV j ( j p j (1 p j )) , and P(Xij 1) p j , approximately twice the minor allele fre- therefore the models considered here implicitly assume that, quency (MAF) of SNP j, for j = 1,...,J . We later consider for a specified EVj level, variants with smaller MAFs tend Xij obtained from the sequence data of the 1000 Genomes to have bigger effect sizes (i.e., the “MAF-effect-dependent” Project [1000 Genomes Project Consortium, 2010], which are model assumption). Later we discuss additional simulation not mutually independent. studies assuming that MAFs and effect sizes are mutually The specific linear and quadratic statistics considered = independent (i.e., the “MAF-effect-independent” model as- here are WL in (2) with w 1, sumption). To evaluate the accuracy of the asymptotic null distribu- J tions of the proposed hybrid statistics, W and W , in finite =  = , F M WL 1 S Sj (7) samples in terms of J , we considered J = 10, 20, 30, 40, j=1 ≡ = ,..., 50, or 100 and EV j 0, for all j 1 J . (Sample size n is not a factor under the normal assumption and under = = ,...,  and WQ in (3) with A I , H0.) For each combination, we generated S (S1 SJ ) from S ∼ N(0, ␴2 I ) independently 106 times. Without loss J of generality, we assumed ␴2 = 1 because ␴2 does not af- =  = 2, WQ S S Sj (8) fect the size or power of the tests once EV j is specified. For j=1 each simulated replicate, we calculated pL for WL based on , ␹ 2 N(0 1) and pQ for WQ based on J , then combined the two where Sj is defined in (1). Because Sj is an MAF-scaled P-values to obtain WF and WM. Finally, we calculated pF for ␹ 2 , score statistic, the linear statistic (7) is the same as the MAF- WF based on 4 and pM for WM based on min(U1 U2), with ␹ 2 weighted linear statistic proposed by [Madsen and Brown- U1 and U2 assumed independent. Note that the approxi- , 4 ing, 2009] and the quadratic statistic (8) is equivalent to mation for WF and min(U1 U2) for WM are only used here for Hotelling’s statistic [Derkach et al., 2012]. Although other assessing the accuracy of the asymptotic null distributions. linear and quadratic statistics could be considered, we fo- For all power comparisons below, we used empirical critical cus on WL and WQ because within-class difference in power values that are based on the results from this study for quan- is substantially smaller than the between-class difference titative traits or on the permutation method (Appendix) for [Basu and Pan, 2011; Derkach et al., 2012; Lin and Tang, case-control and application studies. 2011], and the latter is the subject of interest here. Moreover, To evaluate power of the statistics under a broad range extensive simulation results in Basu and Pan [2011] show of scenarios, we independently generated 10,000 different

Genet. Epidemiol. 4 Derkach et al.

TABLE I. Parameters and parameter values of simulated models for studies of quantitative or binary traits. The MAF-effect-dependent model assumes that variants with smaller MAFs tend to have bigger effect sizes; the MAF-effect-independent model assumes that MAF and effect sizes are mutually independent

Parameters Parameter values = = / n sample size (ncase ncontrol n 2 for binary traits) 500, 1000, or 2000 J total number of SNPs Unif {10, 20, 30, 40, 50} pC proportion of the causal SNPs Unif (0.1, 1) JC number of the causal SNPs, an integer closest to J · pC pD proportion of the deleterious SNPs among the causal ones Unif (0.75, 1) J D number of the deleterious SNPs, an integer closest to JC · pD pP proportion of the protective SNPs among the casual ones, 1 − pD J P number of the protective SNPs, JC − J D pN proportion of the neutral SNPs, 1 − pC J N number of the neutral SNPs, J − J D − J P Quantitative traits under the MAF-effect-dependent assumption; 10,000 independently simulated models = ␤2 − EVj the variance explained by SNP j (EVj j p j (1 p j )) for neutral SNPs 0 for causal SNPs Unif (0.001, 0.0025) Quantitative traits under the MAF-effect independent assumption; 10,000 independently simulated models p j approximately twice the MAF of SNP j Unif (0.005, 0.02) ␤j regression coefficient in (6) of SNP j for neutral SNPs 0 for causal SNPs Unif (0.45, 0.5) or Unif (-0.5, -0.45) (The resulting EVj s in the range 0.001 to 0.0049) Binary traits under the MAF-effect-dependent assumption; 500 independently simulated models p j approximately twice the MAF of SNP j Unif (0.005, 0.02) ␤ e j OR of SNP j for neutral SNPs 1   for causal SNPs C/ p j (1 − p j ), C = 4 0.005(1 − 0.005) (The resulting ORs in the range 2 (or 1/2) to 4 (or 1/4) Binary traits under the MAF-effect independent assumption; 500 independently simulated models p j approximately twice the MAF of SNP j Unif (0.005, 0.02) ␤ e j OR of SNP j for neutral SNPs 1 for causal SNPs Unif (2, 4) or Unif (1/2, 1/4) models as described in Table I for studies of quantita- this. One rationale is that rare variants act independently. tive traits under the MAF-effect-dependent assumption. For The independence assumption also allows for evaluation of each combination of parameter values (i.e., one of the 10,000 a large number of different models, as well as systematic , ␴2 models), we generated√  Sj from N(0 ) for J N neutral vari- presentation and understanding of the results. However, , ␴2 the general conclusions made so far are not affected by the ants, from√N( n EVj ) for J D deleterious variants, and − , ␴2 independence assumption. As a proof of principle, we also from N( n EVj ) for J P protective variants, indepen- dently 10,000 times (i.e., 10,000 data replicates for each of the analyzed the GAW17 data for which multiple phenotypes , were simulated based on the “mini-exome” sequence data 10,000 models). We calculated pL for WL based on N(0 1), p for W based on ␹ 2, then combined the two P-values to provided by the 1000 Genomes Project [Almasy et al., 2011; Q Q J 1000 Genomes Project Consortium, 2010]. obtain WF and WM. We then determined if pF for WF and p for W were less than a given ␣ value by comparing W We analyzed quantitative trait Q2 that is influenced by 72 M M F SNPs in 13 genes but not by other covariates [Almasy et al., and WM with the empirical critical values from Support- = ing Information Tables S2 and S4, respectively. This ensures 2011]. We used data from the n 321 unrelated Asian sub- ␣ jects (Han Chinese, Denver Chinese, and Japanese). Because that the tests have the correct type 1 error . Finally, we esti- > mated power by the proportion of the 10,000 data replicates we excluded SNPs that had MAF 5% or were monomor- ␣ phic within the Asian sample, VNN1 had no causal rare that had pL , pQ, pM, and pF less than , respectively, for WL , W , W , and W . variant but it was kept in the analysis to serve as a nega- Q M F tive control. The choice to focus on SNPs with MAF ≤ 5% was made because this thresholding (almost) does not affect the number of causal SNPs (70 of the 72 causal SNPs have APPLICATION DATA MAF ≤ 5% in the range of (0.16–1.4%)), but it reduces the Similar to many earlier studies of rare variants, the sim- number of neutral SNPs in a gene, so that the proportion ulating models considered so far assumed that genotypes of the causal variants is high enough to have meaningful = ,..., of a group of rare variants X j , j 1 J , are mutually power comparisons for at least some of the 13 genes independent, although the tests themselves do not require (Table II).

Genet. Epidemiol. Robust Tests for Rare Variants 5

TABLE II. Power of the four test statistics applied to the GAW17 sequence data provided by the 1000 Genomes Project. The four statistics are linear WL in (7), quadratic WQ in (8), minimum-p WM in (5), and Fisher’s WF in (4). The 13 genes presented here are all the causal genes for simulated quantitative trait Q2. VNN1 does not have causal variants because one of the two causal variants has MAF 26% and the other is not polymorphic within the Asian sample (n = 321). VNN1 is kept in the analysis to serve as a negative control. All causal variants were designed by GAW17 to have the same direction of effects (minor alleles were associated with higher Q2 values). The average genetic effect is the average of regression coefficient ␤ values of the causal variants used to simulate Q2 (effects are independent of populations by the GAW17 design). Genes are ordered according to the maximum power of Fisher’s test. Powers shown vary considerably due to inherent factors and estimation based only on 200 replicates, and the 13 genes are separated into different groups

SNP distribution Average MAF of Average effect of Power of the four tests

Gene JC , J N JC , J N JC Linear Quadratic Minimum-p Fisher’s

Eight genes for which the maximum power is 10% or more SI RT1 4, 7 0.27%, 0.22% 0.71 0.44 0.39 0.43 0.50 BCHE 5, 10 0.22%, 0.19% 0.72 0.29 0.39 0.39 0.45 P DGF D 3, 6 0.78%, 0.65% 0.74 0.29 0.35 0.38 0.43 SRE !BF 1 4, 5 0.39%, 0.40% 0.52 0.29 0.15 0.24 0.26 GCK R 1, 0 1.21%, NA 0.38 0.25 0.25 0.25 0.25 VLDLR 4, 6 0.19%, 1.64% 0.75 0.12 0.09 0.12 0.13 P L AT 4, 7 0.39%, 0.49% 0.68 0.13 0.13 0.11 0.13 RARB 1, 5 0.78%, 0.90% 0.64 0.06 0.14 0.12 0.11 Four genes for which the maximum power is 10% or less I NSIG 1 3, 1 0.16%, 3.42% 0.20 0.06 0.03 0.03 0.05 VNN3 2, 2 0.16%, 2.57% 0.37 0.03 0.04 0.04 0.04 LPL 1, 4 0.16%, 0.23% 0.73 0.02 0.05 0.03 0.03 VWF 1, 3 0.16%, 1.90% 0.34 0.02 0.01 0.01 0.02 One gene for which there is no polymorphic rare causal variants in the Asian sample VNN1 0, 3 NA, 0.31% NA 0.02 0.05 0.04 0.05

The GAW17 data include 200 replicates (same genotype efficient permutation scheme that provides correct P-values , data but different phenotype data independently simulated for WL , WQ, WF and WM simultaneously, which is used for based on the true genotype-phenotype association model), our simulation and application studies. and for each, we calculated permutation-based P-values Figure 1 shows the empirical power of the four test statis- for the four tests using the method described in the Ap- tics compared to the maximum power. Sample size is 1000 pendix, based on 104 permutations. We estimated power for and type 1 error is ␣ = 10−4. (Results for n = 500 and 2000 ␣ = 0.05 by the proportion of the 200 replicates for which at ␣ = 10−4 are in Supporting Information Figures S2 and the empirical P-values are ≤ 0.05 for each test. The choice S3, respectively and are characteristically similar; results of the liberal type 1 error level 0.05 was based on the overall for other ␣ levels, 0.05, 10−2, and 10−3 are also similar low power of detecting these genes due to small sample size, and not shown.) Several observations can be made from small genetic effect, extremely small MAF, or the low pro- Figure 1. portion of the causal variants within a gene. Because power estimated from 200 replicates is highly variable, compar- r isons should be focused on the first group of eight genes for The maximum power is often achieved by the Fisher’s which the maximum power is 10% or more. r test: this occurs in 75% of the 10,000 simulated models. Both linear and quadratic tests have large variability in power reflecting the wide variation in the simulated SIMULATION RESULTS models. Power of the linear test is more than 5% below The empirical type 1 error rates for WL and WQ, as ex- the maximum power for 52% of the 10,000 models; when pected, were very close to the nominal level because of the the maximum power is around 60%, power of the linear assumption of normality (results not shown). Therefore, pL test can be as low as 15%. Power of the quadratic test and pQ used to obtain the WF and WM statistics were “hon- is more than 5% below the maximum power for 53% of est” P-values. However, WF has a slight inflation of type the 10,000 models; when the maximum power is around 1 error around 0.06 for ␣ = 0.05, and a large inflation of r 60%, power of the quadratic test can be as low as 20%. 5 · 10−4 for ␣ = 10−4, worse for smaller J , and better for big- Both Fisher’s and minimum-p test statistics are robust ger J as expected (Supporting Information Table S1). This in terms of power. However, Fisher’s test consistently reflects the nonindependency of WL and WQ under H0 with outperforms the minimum-p test, and it can have sub- small J . The empirical type 1 error rates for WM were con- stantially better power than the individual linear and sistent with the nominal levels considered (Supporting In- r quadratic tests. formation Table S3). However, this does not hold in general, When either the linear or quadratic test has moderate e.g., for non-normally distributed traits with small sample power, Fisher’s test has improved power. For example, sizes or case-control studies. Therefore, in practice P-values among the 10,000 models simulated, the power of linear should be obtained empirically. The Appendix provides an or quadratic test is at least 20% for 4,903 models, and

Genet. Epidemiol. 6 Derkach et al.

Fig. 1. Empirical power of the four test statistics compared to the maximum power for 10,000 independently generated models for studies of quantitative traits under the MAF-effect-dependent assumption as described in Table I. The four statistics are linear WL in (7), quadratic WQ in (8), minimum-p WM in (5), and Fisher’s WF in (4). For each genetic model, the maximum power among the four statistics and the statistic that provides the maximum power are recorded (black triangle for WL, black diamond for WQ, blue square for WM, and −4 red circle for WF ), and it is compared with power of individual statistics. Sample size n = 1000 and type 1 error ␣ = 10 . Results for n = 500 and 2000 are in Supporting Information Figures S2 and S3, respectively.

Genet. Epidemiol. Robust Tests for Rare Variants 7

Fisher’s test has improved power for 90% of the 4,903 it is the most powerful test, with appreciable power; e.g., the models. The relative efficiency gain is at least 20% for power of Fisher’s test is 50% for SIRT1 and 45% for BCHE. 40% of the 4,903 models (and at least 50% for 10% of Some authors have reported problems related to the the 4,903 models). Among the 380 of the 4,903 models for GAW17 data and some of the published analyses. For ex- which Fisher’s test has less power, the maximum absolute ample, Tintle et al. [2011] noted “Two main causes emerged: power loss is 8%. population stratification and long-range correlation (ga- metic phase disequilibrium) between rare variants.” These issues however do not affect our analyses, because we To better understand the impact of the various param- used the samples from the Asian population only and eters, Figure 2 presents the same results from a different we assessed the statistical significance of the tests using perspective showing the individual power as a function of a permutation-based method as described in the Appendix. the number of causal variants J C (large scale of the X-axis) and the number of deleterious variants J D (small scale of the X-axis), when the total number of rare variants is J = 30 (see DISCUSSION Supporting Information Figures S1a– S1d for J = 10, 20, 40, and 50). It is clear that power of all tests highly depend on As discussed above, the genetic models considered in the percentage of causal SNPs in the group of SNPs investi- Table I and Figure 1 directly specify EV j , the variance ex- gated. For example, among the 10,000 models simulated, to plained by a rare variant, which implicitly assumes that achieve power of 50% or greater, the average proportion of rarer variants have bigger genetic effects. We also inves- causal SNPs is 81% (SE = 13%, min = 42%) for the linear test, tigated models where MAFs and genetic effects are inde- 81% (SE = 12%, min = 50%) for the quadratic test, 80% (SE pendent of each other (Table I; Quantitative traits under = = = 14%, min 42%) for the minimum-p test, and 77% (SE the MAF-effect-independent assumption). Results are pre- = 15%, min 36%) for Fisher’s test. For the 2005 models with sented in Supporting Information Figure S4 and are very = J 30 shown in Figure 2, a proportion of 80% being causal similar to those in Figure 1. = · = means J C J pC 24 on the large scale of the X-axis. We also conducted extensive simulation studies of case- To further demonstrate the differential consequences of control studies. Briefly, the distribution of Yi given Xi is effect directions on different tests, Figure 3 shows the in- Bernoulli with = dividual power for the 75 models that have J C 24 casual  ␤ + ␤ variants of the J = 30 total variants. The X-axis in Figure 3 e 0 j j Xij = | =  , Prob(Yi 1 Xi ) ␤ + ␤ shows the number of deleterious variants out of the 24 + 0 j j Xij = · = · = 1 e casual ones, ranging from J D J C pD 24 75% 18 to = · = J D 24 100% 24. Although the linear test can outper- where Xij are mutually independent Bernoulli variables form the quadratic test by a large margin for some models, as in the quantitative setting. Without loss of generality, ␤ =− . = | = = . it is highly sensitive to the direction of effects. For example, 0 2 1922 so that Prob(Yi 1 Xi 0) 0 1. Other pa- for models in Figure 3 where all 24 causal SNPs are dele- rameters are described in Table I, separately under the = , = terious (J D 24 J P 0), power of the linear test is over MAF-effect-dependent or -independent assumption. For 90% compared to ∼ 60% for the quadratic test (power of the each combination of parameter values, a sample was ob- minimum-p and Fisher’s tests are also over 90%). However, tained by first generating genotype X for each subject. The = , = i if 4 of the 24 causal SNPs are protective (J D 20 J P 4), case (Y = 1) and control (Y = 0) status were assigned based ∼ = | power for the linear test drops to 40% while power of the on the probabilities from the logistic model, Prob(Yi 1 Xi ), quadratic test remains at ∼ 60% (power of the minimum- allowing for the case-control design. This was done inde- p is ∼ 60% and power of Fisher’s test is ∼ 70%). Both pendently 1000 times to estimate power of the four tests the minimum-p and Fisher’s tests are robust because they for each of the 500 models independently generated. Due combine information from the complementary linear and to non-normality, pL and pQ, P-values of the linear and quadratic tests, and the relationship between their power quadratic statistics, were also obtained empirically via 106 and the various parameters is similar. One noticeable differ- permutations (see the Appendix). Results are in Supporting ence is that power of the minimum-p test is constrained be- Information Figure S5 (MAF-effect dependent) and Sup- tween power of the linear and quadratic tests, while Fisher’s porting Information Figure S6 (MAF-effect independent), test does not have such a restriction. and they are similar to each other and similar to those in Figure 1 for quantitative traits. The models considered so far do not restrict all causal APPLICATION RESULTS variants to have the same direction of effect, but do assume Results in TableII are consistent with our previous conclu- the majority of the causal variants have the same direction = / ∼ sions. (1). Performance of individual linear and quadratic (without loss of generality, deleterious) with pD J D J C tests can be highly variable depending on the model. For Unif (0.75, 1). Although this is more plausible than the sce- example, power of the linear and quadratic tests for SIRT1 nario when deleterious and protective variants are equally are 44% and 39%, respectively, while power of the two tests likely among the causal ones, we also investigated models ∼ for BCHE are 29% and 39%, respectively. The example of for which pD Unif (0.5, 0.75); all other parameters were BCHE also shows that even when all causal variants have generated as described in Table I. For such models, the lin- the same direction of effect, quadratic tests can outperform ear test has little power in most cases while the quadratic linear tests if the proportion of the causal variants is low test has much better power, as expected. Results of quan- (5 out 15 for BCHE). (2). The minimum-p and Fisher’s hy- titative traits under the MAF-effect-dependent assumption brid statistics are both robust, and Fisher’s test consistently are in Figure 4 (results of other types of studies as described outperforms the minimum-p test. (3). Fisher’s test not only in Table I are characteristically similar and not shown). The provides comparable power for all genes analyzed but often maximum power was achieved by the quadratic test in 94%

Genet. Epidemiol. 8 Derkach et al.

Fig. 2. Empirical individual power of the four test statistics for the 2005 of the 10,000 models in Figure 1 with J = 30 total number of rare variants. The large scale of the X-axis shows the number of causal variants in the range of JC = J · pC = 30 · 10% = 3 to JC = 30 · 100% = 30. The small scale of the X-axis shows the number of deleterious variants JD out of the total of JC causal variants in the range of JD = JC · pC = JC · 75% to JD = JC · 100%, depending on the actual number of causal variants in a model. The 2005 models are a subset of the 10,000 models generated as described in Table I and Figure 1. Results of other J values are in Supporting Information Figures S1a–S1d. of 10,000 simulated models; for the remaining 6% of the brid test statistics. For large sample size, an alternative for models the maximum power was achieved by Fisher’s test. obtaining valid P-values is numerical calculation. In that Consequently, although both minimum-p and Fisher’s tests case, vector S is (approximately) distributed as multivari- are reasonably robust, the minimum-p statistic is close to ate normal and hence can be generated. Further research is best statistic for each model and is a better hybrid statistic, needed on robust ways to obtain P-values in the presence of consistent with the findings of Owen [2009]. covariates. In some settings, a test of no association may be based The goal of this study is to show that combing evidence on a regression model with several environmental or pop- of association from complementary linear and quadratic ulation stratification covariates [e.g., Lin and Tang, 2011]. tests can lead to robust and more powerful tests. As a Because adjusting for covariates is performed at the indi- proof of principle, we used the MAF-weighted linear vidual linear and quadratic test statistic level, the calcula- statistic WL in (7) and Hotelling’s statistic WQ in (8), and tion of the proposed hybrid statistics remains the same as we assumed that there are no other influencing factors. in (5) for the minimum-p statistic and in (4) for Fisher’s However, the concept can be extended to any two or more statistic. However, covariates adjustment could affect the complementary tests. The power of such hybrid statistics computation of P-values for the hybrid tests. Simple per- depends on the power of the original individual tests and mutation procedures are not valid unless SNP genotypes the dependency between the tests under the null hypothesis are independent of both the response Y and covariates. of no association H0. An interesting question is whether Several authors [e.g., Lee et al., 2012a; Lin and Tang, 2011] one could further improve robustness by combining the have proposed parametric bootstrap to obtain P-values in P-values from the minimum-p and Fisher’s tests, however, the presence of covariates. The parametric bootstrap ap- this is beyond the scope of this work. proach combined with the joint resampling methodology Recently,Lee et al. [2012a] developed an optimal sequence as discussed in the Appendix can be used for our hy- kernel association test, SKAT-O, extending the work of

Genet. Epidemiol. Robust Tests for Rare Variants 9

Fig. 3. Empirical individual power of the four test statistics for the 75 of the 2005 models in Figure 2 with JC = 24 causal variants out of total J = 30 rare variants. The X-axis shows the number of deleterious variants JD out of the total of JC = 24 causal variants in the range of JD = JC · pD = 24 · 75% = 18 to JD = 24 · 100% = 24. Other details are Figures 1 and 2 and Table I.

Wu et al. [2011] that proposed the quadratic SKAT test. In summary, the proposed minimum-p and Fisher’s hy- The SKAT-O uses the minimum-p principle by considering brid test statistics provide much needed robustness in terms a family of tests based on weighted averages of a linear of power for association tests of rare variants, by combining statistic and SKAT. We evaluated the empirical power of information from the complementary linear and quadratic SKAT-O using the existing R-package [Lee et al., 2012b] test statistics. Statistical significance of the hybrid statistics under the various scenarios as outlined in Table I. In the can be obtained efficiently using the same permutation- case of quantitative traits, we observed that Fisher’s statis- based method often required for the existing linear and tic performs better than SKAT-O, when the proportion of quadratic statistics, without the need for additional permu- deleterious SNPs is from Unif (0.75, 1) (Supporting Informa- tations. The minimum-p statistic is attractive if one believes tion Figure S7, n = 1000, ␣ = 10−4). Similar results were also that causal rare variants are equally likely to be deleterious seen in the case-control study where we used SKAT-O with and protective. However, for the plausible scenario when suggested adjustment for small sample size (Supporting In- the majority of the causal variants have the same direc- = = ␣ = −4 formation Figure S9, ncases ncontrols 500, 10 ). How- tion of effect (either deleterious or protective), Fisher’s test ever, when the proportion of deleterious SNPs is from Unif consistently outperforms methods that use the minimum-p (0.5, 0.75), the proposed minimum-p statistic outperforms principle [e.g., the simple minimum-p test considered here SKAT-O and Fisher’s statistic because the latter two statis- and SKAT-O Lee et al., 2012a], and it often provides consid- tics lose power when one of the tests has little to no power erably better power than the individual linear and quadratic (Supporting Information Figure S8 for quantitative traits tests. The general concept of using Fisher’s method to com- and Supporting Information Figure S10 for binary traits). bine information from two or more existing but comple- We also observed that SKAT-O has a slight inflated empiri- mentary methods applied to the same data, beyond the tra- cal type 1 error, therefore results for SKAT-O presented here ditional setting of meta-analysis of multiple data resources, are a little too optimistic. Nevertheless, the general conclu- can be readily extended and is useful in many other scientific sions hold. studies.

Genet. Epidemiol. 10 Derkach et al.

Fig. 4. Empirical power of the four test statistics compared to the maximum power for 10,000 independently generated models for which the proportion of deleterious SNPs among the causal ones is generated from Unif (0.50, 0.75). All other parameters are described in Table I for studies of quantitative traits under the MAF-effect-dependent assumption, and they are the same as the ones used for Figure 1.

Genet. Epidemiol. Robust Tests for Rare Variants 11

ACKNOWLEDGMENTS Neale BM, Rivas MA, Voight BF, Altshuler D., Devlin B, Orho-Melander M, Kathiresan S, Purcell SM, Roeder K, Daly MJ. 2011. Test- The authors would like to thank the Genetic Analysis ing for an unusual distribution of rare variants. PLoS Genet 7: Workshop 17 (GAW17) committee and the 1000 Genomes e1001322. Project for providing the GAW17 application data, and Dr. Owen AB. 2009. Karl Pearson’s meta-analysis revisited. Ann Statistics Andrew Paterson for insightful discussions. This work was 37:3867–3892. Price AL, Kryukov GV, de Bakker PI, Purcell SM, Staples J, Wei L-J, supported by the Natural Sciences and Engineering Re- Sunyaev SR. 2010. Pooled association tests for rare variants in exon- search Council of Canada (NSERC; 250053-2008) and the resequencing studies. Am J Hum Genet 86:832–838. Canadian Institutes of Health Research (CIHR; MOP 84287) Stouffer S. 1949. The American Soldier: Adjustment during Army grants to L.S., NSERC to J.F.L., the Ontario Graduate Schol- life. The American Soldier. Princeton, NJ: Princeton University arship (OGS) and the CIHR Strategic Training for Advanced Press. Genetic Epidemiology (STAGE) fellowship to A.D., Univer- Tintle N, Aschard H, Hu I, Nock N, Wang H, Pugh E. 2011. Inflated type I sity of Toronto. The authors have no conflict of interest to error rates when using aggregation methods to analyze rare variants declare. in the 1000 Genomes Project exon sequencing data in unrelated individuals: summary results from Group 7 at Genetic Analysis REFERENCES Workshop 17. Genet Epidemiol 35:S56–S60. Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. 2011. Rare-variant asso- 1000 Genomes Project Consortium 2010. A map of human genome vari- ciation testing for sequencing data with the Sequence Kernel Asso- ation from population-scale sequencing. Nature 467:1061–1073. ciation Test. Am J Hum Genet 89:82–93. Almasy L, Dyer T, Peralta J, Kent J, Charlesworth J, Curran J, Blangero Yi N, Zhi D. 2011. Bayesian analysis of rare variants in genetic associa- J. 2011. Genetic analysis workshop 17 mini-exome simulation. BMC tion studies. Genet Epidemiol 35:57–69. Proceedings 5:S2. Bansal V, Libiger O, Torkamani A, Schork NJ. 2010. Statistical analysis strategies for association studies involving rare variants. Nat Rev Genet 11:773–185. Basu S, Pan W. 2011. Comparison of statistical tests for disease associa- APPENDIX tion with rare variants. Genet Epidemiol 35:606–619. Cirulli ET, Goldstein DB. 2010. Uncovering the roles of rare variants An efficient permutation-based method that provides empirical in common disease through whole-genome sequencing. Nat Rev P-values, simultaneously, for the linear, quadratic, minimum-p, Genet 11:415–425. and Fisher’s tests Derkach A, Lawless JF, Sun L. 2012. Assessment of pooled associ- Here, we describe an efficient permutation-based method ation tests for rare genetic variants within a unified framework. that provides empirical P-values, pL pQ, pM, and pF , simul- arXiv:1205.4079 [stat.ME]; submitted for publication. taneously, for tests based on WL , WQ, WM, and WF , respec- Genest C, Rivest LP. 1993. Statistical inference procedures for bivariate tively. This approach is novel, but similar in spirit to meth- Archimedean Copulas. J Am Stat Assoc 88:1034–1043. ods used for nonparametric estimation of copula functions Goeman JJ, VanDe Geer SA, VanHouwelingen HC. 2006. Testingagainst [Genest and Rivest, 1993] a high dimensional alternative. J R Stat Soc: B (Statistical Methodol- = ,...,  For a given dataset, let Y (Y1 Yn) be the (binary ogy) 68:477–493. = or quantitative) phenotype values for n subjects, X j Han F, Pan W. 2010. A data-adaptive sum test for disease association ,...,  = ,..., (X1 j Xnj) , j 1 J be the corresponding genotype with multiple common or rare variants. Hum Hered 70:42–54. values for a group of J SNPs under study. Let WL,obs be Hoffmann TJ, Marini NJ, Witte JS. 2010. Comprehensive approach to the observed linear statistic calculated, e.g., as in Equa- analyzing rare genetic variants. PLoS ONE 5:e13584. tion (7), and W , be the observed quadratic statistic cal- Ladouceur M, Dastani Z, Aulchenko YS, Greenwood CMT, Richards Q obs JB. 2012. The empirical power of rare variant association methods: culated, e.g., as in Equation (8). Due to sparsity or non- results from sanger sequencing in 1,998 individuals. PLoS Genet normality, asymptotic approximations often do not provide 8:e1002496. satisfactory P-values in practical settings, so resampling- Lee S, Lin X, Wu MC. 2012a. Optimal tests for rare variant effects in based methods are recommended by many authors [e.g., sequencing association studies. Biostatistics. Basu and Pan, 2011; Lin and Tang, 2011; Neale et al., 2011]. Lee S, Miropolsky L, Wu M. 2012b. SKAT: SNP-set (Sequence) Kernel To preserve the possible dependence present in the ob- Association Test. R package version 0.76. served genotypes between SNPs, a permuted dataset un- Li B, Leal SM. 2008. Methods for detecting associations with rare vari- der the null of no association is obtained by permuting ants for common diseases: application to analysis of sequence data. the phenotype. Let Yk , k = 1,...,K be the K indepen- Am J Hum Genet 83:311–321. k k dently permuted phenotype vectors and WL and WQ be Lin D-Y, Tang Z-Z. 2011. A general framework for detecting disease the corresponding linear and quadratic statistics for the associations with rare variants in sequencing studies. Am J Hum kth permuted dataset. “Honest” P-values for the linear Genet 89:354–367. and quadratic tests using the observed data are obtained, Loughin TM. 2004. A systematic comparison of methods for combining respectively, as p-values from independent tests. Comput Stat Data Analysis 47:467–  485. = 2 ≥ 2 / , Madsen BE, Browning SR. 2009. A groupwise association test for rare pL I(WL,k WL,obs) K mutations using a weighted sum statistic. PLoS Genetics 5: e1000384. k Morgenthaler S, Thilly WG. 2007. A strategy to discover genes that  = ≥ / , carry multi-allelic or mono-allelic risk for common diseases: a cohort pQ I(WQ,k WQ,obs) K allelic sums test (CAST). Mut Res 615:28–56. k Morris AP, Zeggini E. 2010. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet Epidemiol where I(·) indicates if the statistic from the kth permutated 34:188–193. sample is greater than or equal to the observed statistic.

Genet. Epidemiol. 12 Derkach et al.

The observed test statistics of WM and WF are, Finally, “honest” P-values of the minimum-p and Fisher respectively, tests for the observed data are obtained, respectively, = , , as WM,obs min(pL pQ)  = k ≤ / , pM I(WM WM,obs) K =− − . k WF,obs 2log(pL ) 2log(pQ)  To empirically assess the statistical significance of WM,obs = k ≥ / . pF I(WF WF,obs) K and WF,obs without additional permutations, let k pk = Rank(|Wk |)/K, k = 1,...,K, L L The size of K depends on the particular application. For the GAW17 data, K = 104 because P-values were large be the empirical P-value of the linear test using the kth ␣ = . permuted sample, where Rank(|Wk |) is the rank of |Wk | and power were assessed at 0 05. For the simulation L L studies, K = 106 because power was assessed at ␣ as low among all K linear statistics calculated based on the K −4 k = as 10 . For a more stringent type 1 error control (e.g., permuted samples. (Other choices are possible, e.g., pL −6 = . / , | k | − . / 10 0 05 50 000 genes or bins/groups of rare variants) (Rank( WL ) 0 5) K but results are not practically differ- ent; P-values for linear statistics are two-sided to allow for suitable for whole-genome analysis of rare variants, com- either positive or negative association statistic under the putational burden can be an issue, common to all genome- alternative hypothesis.) Similarly, we calculate wide analyses that require permutations to assess statisti- cal significance. For extremely sparse data, permutation- based methods are also known to be conservative. For pk = Wk /K, k = ,...,K. Q Rank( Q) 1 example, in the extreme case of a case-control study of one single SNP, if there was only one copy of of the The WM and WF statistics using the K permuted datasets are, respectively, rare allele present in the sample, there would be only two distinct test statistics among all possible permutated Wk = min(pk , pk ), k = 1,...,K, datasets, resulting in permutation-based P-values being 0, M L Q 0.5, or 1. Randomized P-values are often recommended to circumvent the problem, but additional research is k =− k − k , = ,..., . needed. WF 2log(pL ) 2log(pQ) k 1 K

Genet. Epidemiol. Comparison of Pathway Analysis Approaches Using Lung Cancer GWAS Data Sets

Gordon Fehringer1, Geoffrey Liu2, Laurent Briollais1, Paul Brennan3, Christopher I. Amos4, Margaret R. Spitz4, Heike Bickebo¨ ller5, H. Erich Wichmann6, Angela Risch7, Rayjean J. Hung1* 1 Prosserman Centre for Health Research, Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto, Ontario, Canada, 2 Department of Medicine and Medical Biophysics, Ontario Cancer Institute/Princess Margaret Hospital, Toronto, Ontario, Canada, 3 International Agency for Research on Cancer (IARC), Lyon, France, 4 Department of Epidemiology, The University of Texas M. D. Anderson Cancer Center, Houston, Texas, United States of America, 5 Department of Genetic Epidemiology, University Medical Center, University of Goettingen, Goettingen, Germany, 6 Institute of Epidemiology I, Helmholtz Center Munich, Neuherberg, Germany, 7 Division of Epigenomics and Cancer Risk Factors, German Cancer Research Center, Heidelberg, Germany

Abstract Pathway analysis has been proposed as a complement to single SNP analyses in GWAS. This study compared pathway analysis methods using two lung cancer GWAS data sets based on four studies: one a combined data set from Central Europe and Toronto (CETO); the other a combined data set from Germany and MD Anderson (GRMD). We searched the literature for pathway analysis methods that were widely used, representative of other methods, and had available software for performing analysis. We selected the programs EASE, which uses a modified Fishers Exact calculation to test for pathway associations, GenGen (a version of Gene Set Enrichment Analysis (GSEA)), which uses a Kolmogorov-Smirnov-like running sum statistic as the test statistic, and SLAT, which uses a p-value combination approach. We also included a modified version of the SUMSTAT method (mSUMSTAT), which tests for association by averaging x2 statistics from genotype association tests. There were nearly 18000 genes available for analysis, following mapping of more than 300,000 SNPs from each data set. These were mapped to 421 GO level 4 gene sets for pathway analysis. Among the methods designed to be robust to biases related to gene size and pathway SNP correlation (GenGen, mSUMSTAT and SLAT), the mSUMSTAT approach identified the most significant pathways (8 in CETO and 1 in GRMD). This included a highly plausible association for the acetylcholine receptor activity pathway in both CETO (FDR#0.001) and GRMD (FDR = 0.009), although two strong association signals at a single gene cluster (CHRNA3-CHRNA5-CHRNB4) drive this result, complicating its interpretation. Few other replicated associations were found using any of these methods. Difficulty in replicating associations hindered our comparison, but results suggest mSUMSTAT has advantages over the other approaches, and may be a useful pathway analysis tool to use alongside other methods such as the commonly used GSEA (GenGen) approach.

Citation: Fehringer G, Liu G, Briollais L, Brennan P, Amos CI, et al. (2012) Comparison of Pathway Analysis Approaches Using Lung Cancer GWAS Data Sets. PLoS ONE 7(2): e31816. doi:10.1371/journal.pone.0031816 Editor: Zhongming Zhao, Vanderbilt University Medical Center, United States of America Received July 27, 2011; Accepted January 13, 2012; Published February 21, 2012 Copyright: ß 2012 Fehringer et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This study is supported by Canadian Cancer Society (grant no. 020214), the CCO Chair in Population Studies, CCO Chair in Experimental Therapeutics, the Alan Brown Chair in Molecular Genomics, and National Institute of Health (U19 CA148127-01). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: [email protected]

Introduction associated with variation at many genes may be too small to detect in GWAS using single SNP analysis, associations may be detected Genome wide association studies (GWAS) examine the from the joint effect of many weaker signals at genes grouped into association of hundreds of thousands of genetic variants with a pathway based on shared biological function. Other benefits of disease or other phenotypes. These studies have successfully this approach are the substantial reduction of the multiple testing identified associations between genetic variants and outcome, such burden once genes are grouped into pathways for association as associations between SNPs at the 15q25 and 5p region and lung testing [7] and the incorporation of biological knowledge into the cancer risk [1,2,3,4,5,6]. GWAS of lung cancer and other diseases analysis, which is not accounted for in GWAS. generally identify only a few SNPs that are associated with disease The number of methods developed for pathway analysis and these usually have small effect sizes. For instance, the per allele continues to increase. Many on-line programs offer a simple gene odds ratio for variants which implicate acetylcholine receptor set enrichment approach that uses some form of Fisher’s Exact test genes at 15q25 with lung cancer risk is about 1.3 [1,2,5]. SNPs to determine over-representation of genes within a pathway. with weaker effects could be missed given the stringent require- Generally, a gene is assigned a P-value (usually obtained from the ments needed for adjustment for multiple comparisons. SNP most strongly associated with outcome at a gene) and an Pathway analysis has been proposed as a complementary arbitrary cut-off (e.g., P#0.05) is used to separate genes strongly approach to single SNP analyses in GWAS. Pathway analysis associated with outcome from other genes. A Fishers Exact groups genes that are related biologically and tests whether these calculation is then used to test for within pathway enrichment of gene groups are associated with outcome. Although outcome genes strongly associated with outcome. This approach does not

PLoS ONE | www.plosone.org 1 February 2012 | Volume 7 | Issue 2 | e31816 Pathway Analysis Comparison with Lung Cancer GWAS account for linkage disequilibrium patterns among SNPs at different genes in the pathway. As well, it may over-estimate the of GO

significance of pathways with large genes (i.e., many SNPs), since # Pathways selecting the most significant SNP when there are many SNPs at a single gene is more likely to find a strong association between gene and outcome by chance [8,9]. The popular GSEA approach generally uses the SNP most of genes strongly associated with outcome at each gene to represent gene- # after SNP mapping outcome associations. Some implementations take into account linkage disequilibrium among SNPs and gene size bias by performing phenotype (case-control status) permutations and using normalization routines. Genes are first ranked by size of Combined Data Sets SNPs mapping to genes 161,435 17811 421 160,726 17805 420 their test statistic for association with outcome. A Kolmogorov-

Smirnov-like running sum statistic is then used to test for { { enrichment of highly ranked genes within pathways, by comparing the pathway test statistic to its null distribution as determined by the phenotype permutations [9,10]. Other approaches, for example the SUMSTAT approach which uses the sum of x2 statistics assigned to genes as a pathway test statistic [11], can be adapted to use phenotype permutations and normalization methods. Alternatives to these gene set enrichment approaches, such as methods of combining P-values (similar to meta-analyses), have also been proposed for pathway analysis. Some of these, Illumina HumanHap300 317,498 incorporate methodology that accounts for potential bias related to gene size or correlation among SNPs [12,13]. We compare four pathway analysis methods. These include a simple gene enrichment approach in EASE, which calculates a modified Fishers Exact probability [14], GSEA (using the GenGen program) [9,10], a modified SUMSTAT approach, and SLAT, a P- value combination approach [12]. The first method is representative

of early simpler approaches which use the Fishers Exact test, while 100% the others, as outlined above, are more sophisticated and designed to address biases related to gene size and linkage disequilibrium among SNPs. We compare and contrast the results from analyses using these methods in two lung cancer GWAS data sets. Median Age (range) Ever smokers CHIP/SNP number Materials and Methods Samples Sex (Males) 70% 60 (20–89) Cases: 89% Controls: 63% Illumina HumanHap30057% 305,326 57 (27–92) Cases: 98% Controls: 86% Illumina HumanHap300 302,334 Data were used from case-control GWAS of lung cancer risk.

These included lung cancer cases and controls from Central { { Europe [2], Toronto [2] and Germany (HGF study) [15,16] and non-small cell lung cancer cases and controls from Texas (MD 1926/2522 75% 61 (25–89) Cases: 93% Controls 65% Illumina HumanHap300 (317,139) Case/ control Anderson Cancer Centre) [1]. Genotyping was performed using 333/506 42% 58 (20–85) Cases: 71% Controls: 56% Illumina HumanHap300 (317,139) either the Illumina HumanHap300 or HumanHap550 chips. Data from the four studies were combined into two data sets: 1) Central Europe and Toronto (CETO); and 2) Germany and Texas (GRMD), in order to reach adequate sample size and statistical power to detect associations in the pathway analyses. The choice of which data sets to combine was predominantly made to ensure similar sample sizes in the two independent analyses. Table 1 provides further details related to these studies.

Selection of pathway analysis methods and population Cases: hospital Controls: hospital Cases: hospital Controls: clinicpopulation and Pathway analysis methods were identified through literature Cases: hospital Controls: various 2258/3027 review. Methods implemented in the programs EASE [14], GenGen (developed from GSEA) [9,10], and SLAT [12] were chosen because they were widely used and/or representative of other pathway analysis approaches. We chose the SUMSTAT method based on a report indicating it had superior power to Comparison of study designs, selected epidemiologic variables, genotyping platforms and results. detect pathway associations than GSEA or Fishers Exact methods [11]. For this method an in-house SAS program was developed. Stage Stage st nd The methods are described here briefly, with details provided in After implementing data quality measures described in methods. Central Europe Toronto Study Type doi:10.1371/journal.pone.0031816.t001 Table 1. 1 2 { Texas (MD Anderson) Cases: hospital Controls: clinic 1154/1137 57% 62 (31–92) Cases: 100% Controls: Central Europe/Toronto combined Germany (DKFZ/LUCY/KORA) Cases:Texas hospital Germany Controls: combined population 504/484 Cases: hospital Controls: various 57% 1639/1618 46 (27–51) Cases: 93% Controls: 54% Illumina HumanHap550 561,466 the original publications.

PLoS ONE | www.plosone.org 2 February 2012 | Volume 7 | Issue 2 | e31816 Pathway Analysis Comparison with Lung Cancer GWAS

Description of gene set analysis methods Unconditional logistic regression, using PLINK 1.05 [19] 2 With the exception of SLAT, pathway analysis methods generated allelic x values for SNPs for each data set, CETO and described here require assignment of a test statistic (or P-value) GRMD, for use in the programs EASE, GenGen and mSUM- to each gene representing its association with outcome. We used STAT. Permuted SNP association results were generated for the common practice of assigning each gene the most significant GenGen and mSUMSTAT using 1000 logistic regression runs with test statistic from all SNP associations tests for the gene [8,9]. case-control status randomly shuffled for each run. Logistic Input for EASE requires that genes significantly associated with regression analyses were adjusted for sex, age and country of origin. outcome are distinguished from all other genes, using a pre- The SLAT program performed its own SNP association tests for its specified cut-off (e.g., P#0.05). Enrichment for significant genes in pathway analysis, which does not include adjustment for covariates. each pathway is then tested using the EASE score, a modified SNPs were assigned to a gene if they were within 20 kb of the Fishers Exact probability representing the upper bound of gene. A SNP to gene linking file and GO level 4 pathway database jackknife Fisher exact probabilities. Global FDRs are calculated file, both obtained from the GenGen web site, were used to link to account for multiple comparisons [14]. SNPs, genes and pathways. Only pathways with 15 to 200 genes GenGen is adapted from Gene Set Enrichment Analysis were included to avoid testing overly large or small GO pathways (GSEA), used originally for microarray analysis [17]. Genes are [6]. The x2 of the most significant SNP at gene was assigned to ranked in descending order according to size of the initial that gene. This x2 statistic was used to assign the cut-off value of association statistic. A weighted Kolmogorov-Smirnov-like run- P#0.05 to identify strongly associated genes for analysis with ning sum statistic is then calculated that reflects over representa- EASE. The same x2 statistic was used in the calculation of the tion of higher ranked genes in a pathway in the gene list. The pathway test statistics for GenGen and mSUMSTAT. All SNPs at weight takes on the values of SNP test statistics representing genes each gene were used as input for the calculation of pathway P- in the list. A normalized enrichment statistic (NER) is calculated values for SLAT. for observed data, followed by phenotype permutations which give The influence of gene size on pathway ranking of the four permuted NER values, creating the null distribution from which pathway analysis methods was investigated using linear regression pathway association P-values are determined. FDRs are used to analysis (SAS 9.2: SAS Institute Inc., Cary, North Carolina). account for multiple comparisons [9]. Median gene size (median number of SNPs per gene) was The modified SUMSTAT (mSUMSTAT) approach, that we calculated for each top pathway and included as the outcome developed, is adapted from Tintle et al. [11]. The approach is variable in a model with pathway analysis method (treated as a similar to that used in GenGen but the pathway test statistic is categorical variable and coded into four dummy variables) as the 2 calculated by averaging x test statistics within each pathway. The main effect and number of genes per pathway included as a equation below shows the calculation of the normalized mean potential confounder. value of the observed x2 statistic, where S refers to a specific gene set and p denotes the permutation. The normalized permuted Results statistic is calculated the same way. Table 2 shows the number of significant pathways identified by    the four pathway analysis methods in CETO and GRMD using a x2 x2 x2 S{mean S,p =SD S,p FDR of #0.05 as the criterion to determine statistical significance. n n n EASE identified 10 pathways as associated with lung cancer risk in the two data sets, 7 in CETO, 5 in GRMD, with two significant The p-value is determined by comparing the normalized mean pathways common to both data sets. The mSUMSTAT method value of the x2 statistic to the normalized permuted mean x2 identified 8 pathways as significant, 8 in CETO, 1 in GRMD with statistics [18] and a FDR is calculated according to Wang et al. one being common to both data sets. SLAT identified five [9]. This method contrasts to that of Tintle et al., [11] through the pathways as significant, three in GRMD and two in CETO. calculation of a normalized test statistic, and use of phenotype Since EASE identified 10 significant pathways, more than the permutations instead of randomly selected gene sets to determine other methods, Table 3 shows the top 10 pathways identified in the null distribution. CETO and GRMD by all pathway analysis methods (taken from The SLAT program calculates P-values for association of SNPs lists comprising results from both data sets). An FDR of #0.05 in with outcome for a defined pathway (as in this study), gene, or both data sets was used as the criteria for a replicated result. region. P-values reaching a specific threshold are combined into a Transmission of nerve impulse and the Ras guanyl nucleotide test statistic. The statistic is calculated for observed and phenotype exchange factors pathways were identified by EASE as associated permuted data which permits determination of a pathway P-value with lung cancer in CETO and GRMD (Table 3). The [12]. No particular method for adjusting for multiple comparisons is provided by the authors. (We used the Benjamini-Hochberg Table 2. Number of significant pathway associations (using correction to calculate FDRs for this method). FDR, = 0.05) for Central Europe-Toronto (CETO) and Germany-MD Anderson (GRMD) by pathway analysis method. Analysis details SNPs were excluded when the P-value for HWE in controls was #0.001 (consistent with previous pathway analysis studies [9,11]), Data set EASE GenGen mSUMSTAT SLAT the minor allele frequency was ,1%, and genotype was missing in .5% of individuals. In addition, SNPs from the HumanHap550 CETO 7 0 8 2 chip that were used in the German GWAS were excluded if there GRMD 5 0 1 3 was no corresponding SNP from MD Anderson (the study with Both CETO and GRMD 2 0 1 0 which German GWAS data was combined).Subjects with sex Total 10 0 8 5 discrepancies (based on heterozygosity rate at chromosome X) and those with .10% missing SNPs were excluded. doi:10.1371/journal.pone.0031816.t002

PLoS ONE | www.plosone.org 3 February 2012 | Volume 7 | Issue 2 | e31816 Pathway Analysis Comparison with Lung Cancer GWAS acetylcholine receptor activity pathway was identified as associated FDR#0.001) but FDRs fell well below the 0.05 significance level with lung cancer in CETO and GRMD by mSUMSTAT. This in GRMD (CHRNA5: P = 0.002, FDR = 0.37). Removing CHRNA5 pathway contains the CHRNA3-CHRNA5-CHRNB4 gene cluster at from the GenGen analysis resulted in reduced strength of association 15q25, where GWAS have identified several SNPs associated with in CETO (P = 0.003, FDR, = 0.48) but virtually no change in lung cancer risk [1,2,5]. This pathway was the highest ranked GRMD (P = 0.01, FDR, = 0.41). However, removal of the entire pathway in CETO using the GenGen method (FDR = 0.19) gene cluster resulted in marked reduction of the FDR and loss of (Table 3). In GRMD, this pathway was ranked 16th among all significance in the two data sets for both pathway analysis methods pathways (not shown) by GenGen. The FDR was 0.43, but it was (mSUMSTAT without CHRNA3-CHRNA5-CHRNB4: CETO: accompanied by a nominally significant P-value (P = 0.004). Other P = 0.19, FDR = 0.56 GRMD: P = 0.71, FDR = 0.82; GenGen significant pathway associations in CETO had corresponding without CHRNA3-CHRNA5-CHRNB4 CETO: P = 0.11, FDR = nominally significant P-values in GRMD, specifically: heme 1.00 GRMD: P = 0.32, FDR = 0.76). metabolic process, porphyrin metabolic process, pigment biosyn- We further explored the association of this pathway with risk by thetic process, and 4 iron, 4 sulfur cluster binding using graphing odds ratios and 95% confidence limits for acetylcholine mSUMSTAT; and low-density lipoprotein binding using EASE. receptor pathway SNPs and genes produced by unconditional SLAT identified regulation of cell migration as significantly logistic regression analyses. Figure 1A shows odds ratios for associated with lung cancer in GRMD, with a corresponding specific SNPs assigned to genes (i.e., the most significant SNP for nominally significant P-value in CETO (Table 3). each gene) for the CETO analysis and for comparison, odds ratios Other than the acetylcholine receptor activity pathway, which for these same SNPs for GRMD. In addition to SNPs in the was identified by both mSUMSTAT and GenGen as a top CHRNA3-CHRNA5-CHRNB4 gene cluster, a SNP at CHRNA2 pathway, there were few top pathways identified by more than one showed a nominally significant association with risk in both data method. Chloride ion binding was associated with risk in CETO sets (CETO: P = 0.012; GRMD: P = 0.022). Figure 1B shows odds according to EASE and GenGen. Complement activation-classical ratios for the most significant SNP assigned to each gene in either pathway was associated with lung cancer risk in CETO according data set (i.e., the actual SNPs used in pathway analyses in the two to GenGen, mSUMSTAT and SLAT. Heme metabolic process data sets). Additional nominally significant associations were found was identified as associated with risk in CETO by GenGen and for CHRM3 (CETO: P = 0.003; GRMD: P = 0.028), CHRNA7 mSUMSTAT. Chromatin assembly was associated with lung (CETO: P = 0.016; GRMD: P = 0.009), and CHRNA4 (CETO: cancer risk in CETO according to mSUMSTAT and SLAT. P = 0.012; GRMD: P = 0.038) in both data sets. In total, 6 of 8 Interleukin-2 biosynthetic process was identified as associated with genes associated with risk in CETO were associated with risk in risk by EASE and GenGen in GRMD. Regulation of cell GRMD, a result greater than expected by chance given the migration was associated with risk for GRMD according to EASE number of SNPs at each gene. and SLAT (Table 3). Anion transport was identified as a top pathway by mSUMSTAT but 35 of 102 genes in this pathway Discussion were included in the chloride ion binding pathway (64 genes), identified as a top pathway by EASE and GenGEN (gene number Four pathway analysis methods were compared by using each to in pathways calculated following SNP mapping). Likewise, 16 of test association of GO level 4 pathways with lung cancer risk in 18 genes in the interleukin 2 pathway (EASE) are included among two lung cancer GWAS data sets. Methods compared included the 65 genes in the cytokine metabolic pathway (GenGen). Other four gene set enrichment approaches, EASE, GenGen, mSUM- top pathways identified by different methods shared genes but the STAT and a p-value combination approach, SLAT. After overlap was 12% or less based on shared genes for the larger of the adjustment for multiple comparisons using an FDR of less than two pathways (e.g., 20 of 50 positive regulation of phosphorous or equal to 0.05 as the criterion for a significant association, EASE pathway genes (GenGen) are included in the growth factor and mSUMSTAT identified more pathways associated with lung metabolism pathway (SLAT), which has 165 genes). cancer risk across the two datasets (10 and 8 respectively) than did The EASE method selected pathways with greater gene size GenGen (no pathways), or SLAT (5 pathways). EASE and (defined using the median number of SNPs per gene) than the mSUMSTAT also identified pathways that were significantly other methods. The average gene size for the top EASE pathways associated with risk in both data sets: transmission of nerve impulse shown in Table 3 was 12.2 SNPs per gene, whereas average top and Ras guanyl nucleotide exchange factor by EASE; and the pathway gene size was 8.4 for GenGen, 7.4 for mSUMSTAT, and acetylcholine receptor activity pathway by mSUMSTAT. There 8.7 for SLAT. Regression analysis, where pathway analysis was limited agreement among the different methods in the method was coded into four dummy variables, produced a identification of top ranked pathways. Comparing genes among statistically significant association between the EASE method and top pathways chosen by each method showed only a modest gene size (P = 0.02). degree of overlap. As two methods identified acetylcholine receptor activity as a top In comparing pathway analysis methods, we examined whether pathway we examined this association in more detail. SNPs near the the number of SNPs per gene in pathways influenced the selection CHRNA3-CHRNA5-CHRNB4 gene cluster showing strong associa- of top pathways. The results indicated EASE, identified top tions with lung cancer risk, are in strong LD, and there is overlap pathways with a significantly greater median number of SNPs per among SNP test statistics assigned to these genes (i.e., the test statistic gene than the other methods. This result is not unexpected. For all for the same SNP was assigned to both CHRNA5 and CHRNA3). gene set enrichment methods we used the common approach of These pathway characteristics may bias pathway association signals assigning the most significant SNP to represent each gene. Genes [20,21] To evaluate whether the pathway analysis was driven by a with more SNPs, generally large genes, are more likely to be single associated gene or the gene cluster, we examined the effect of assigned a SNP with a high association statistic, which can lead to removing the CHRNA5 gene (where the putative causal variant is over estimation of significance of pathways with large genes (gene located) and the entire gene cluster from analyses using mSUM- size bias) [8,9]. We acknowledge that large genes might be more STAT and GenGen. Removing CHRNA5 had no influence on likely to harbour multiple variants which are truly associated with mSUMSTAT results in CETO (CHRNA5:P, = 0.001, outcome, but our comments focus on statistical properties of the

PLoS ONE | www.plosone.org 4 February 2012 | Volume 7 | Issue 2 | e31816 LSOE|www.plosone.org | ONE PLoS

Table 3. Comparison of FDRs (top line) and P-values (in brackets) for Central Europe-Toronto (CETO) and Germany-MD Anderson (GRMD) for top lung cancer risk associated pathways identified by different analysis methods using GO level 4 pathways.

EASE GenGen mSUMSTAT SLAT

Go FDR (P-value) Go FDR (P-value) Go FDR (P-value) Go P-value

Pathway CETO GRMD pathway CETO GRMD pathway CETO GRMD pathway CETO GRMD

nerve impulse ,0.001 ,0.001 acetylcholine 0.194 (0.001) 0.430 (0.004) acetylcholine ,0.001 0.009 regulation of 0.0048 0.0002 { (,0.001) (0.005) receptor{ receptor{ (,0.001) (,0.001) cell migration Ras-GEF{ ,0.001 (0.007) ,0.001 immune response{ 1.00 (0.735) 0.271 (0.002) heme metabolic 0.003 ( ,0.001) 0.245 (0.005) growth factor 0.2727 ,0.0001{ (,0.001) activity LDL binding { ,0.001 (0.005) 0.400 chloride ion binding 0.316 (0.002) 0.707 (0.213) porphyrin 0.010 ( ,0.001) 0.329 (0.029) gland development 0.6215 0.0003{ (0.021) metabolic chloride ion binding ,0.001 (0.003) 1.00 (0.065) interleukin-2 1.00 (0.446) 0.294 (0.002) pigment 0.033 ( ,0.001) 0.290 (0.043) glycoprotein ,0.0001{ 0.1223 biosynthetic biosynthetic metabolic { 5 interleukin-2 biosynthetic 1.00 (0.419) ,0.001 cytokine metabolic 1.00 (0.278) 0.314 (0.003) 4 iron, 4 sulfur 0.042 (0.001) 0.334 (0.011) chromatin 0.0002 0.7325 (,0.001) cluster binding assembly carboxylic acid transport ,0.001 (0.005) 0.955 (0.498) heme metabolic 0.381 (0.001) 0.506 (0.056) chromatin assembly 0.036 (0.001) 0.813 (0.669) complement 0.0015 0.2004 activation{ regulation of cell migration 1.00 (0.456) ,0.001 complement 0.361 (0.004) 0.819 (0.567) complement 0.034 (0.003) 0.737 (0.536) response to steroid 0.0014 0.0418 (0.006) activation{ activation{ hormone

sensory organ development 1.00 (0.487) ,0.001 somatic 1.00 (0.308) 0.358 (0.002) antigen processing{ 0.032 (0.003) 0.570 (0.280) regulation of 0.0383 0.0014 GWAS Cancer Lung with Comparison Analysis Pathway 0.002) recombination{ axonogenesis phospholipid transporter ,0.001 (,0.001) 1.00 (0.280) peptide receptor{ 1.00 (0.453) 0.370 (0.009) mRNA binding 0.052 (0.001) 0.316 (0.051) retrograde transport, 0.0007 0.4841 eray21 oue7|Ise2|e31816 | 2 Issue | 7 Volume | 2012 February GER{ muscle contraction ,0.001 (,0.001) 0.394 (0.085) positive reg 1.00 (0.639) 0.380 (0.018) anion transport 0.055 (,0.001) 0.735 (0.557) fatty acid 0.0009 0.4935 Phosphorous{ regulation

Bold: significant after adjustment for multiple comparisons (FDR#0.05) in both CETO and GRMD. Bold with italics: significant after adjustment for multiple comparisons in one data set (FDR#0.05), nominal significance in other (P#0.05). Underline: Top pathways identified by more than one pathway analysis method within a data set. {Abbreviated GO category name. Full category names as follows: Ras-GEF: Ras guanyl-nucleotide exchange factor; LDL binding: low-density lipoprotein binding; acetylcholine receptor: acetylcholine receptor activity; immune response: adaptive immune response based on somatic recombination of immune receptors built from immunoglobulin superfamily domains; complement activation: complement activation, classical pathway; somatic recombination: somatic recombination of immunoglobulin gene segments; peptide receptor: G-protein coupled peptide receptor activity; positive reg phosphorous: positive regulation of phosphorus metabolic process; antigen processing: antigen processing and presentation of peptide antigen via MHC Class I; retrograde transport, GER: Retrograde vesicle mediated transport, golgi to endoplasmic reticulum. {Significant based on Benjamini-Hochberg FDR calculation. doi:10.1371/journal.pone.0031816.t003 Pathway Analysis Comparison with Lung Cancer GWAS

Figure 1. Comparison of odds ratios for acetylcholine receptor pathway showing. A) the most significant SNP for each gene used in Central Europe-Toronto analysis and odds ratios for same SNPs for Germany MD Anderson); B) the most significant SNP assigned to each gene in either data set (i.e., the actual SNPs used in pathway analyses in the two data sets). Chromosome number (Chr) and genes for both graphs are shown on left. (Central Europe – Toronto SNPs: solid fill, Germany MD Anderson matching SNPs: no fill; Germany MD Anderson top SNP (differing from Central Europe-Toronto): grey fill). A) Reference allele same in both Central Europe-Toronto and Germany-MD Anderson but chosen to show positive association for Central Europe-Toronto. B) Reference allele always chosen to show positive association. CHRNA5 is excluded as SNPs are identical to those representing CHRNA3. Odds ratios adjusted for age, sex and country of study. doi:10.1371/journal.pone.0031816.g001 methods, specifically the potential for false positives resulting from small number of replicated associations, as the German sample gene size bias. EASE, which uses a relatively simple approach was restricted to subjects under age 50, and the MD Anderson based on the Fishers Exact test, is susceptible to this bias. GWAS included only ever smokers. Therefore, GRMD subjects Normalization routines and phenotype permutations incorporated were younger and had a higher proportion of ever smokers into GenGen and mSUMSTAT protect against this bias [6,22]. compared to CETO subjects. SLAT is also protected against this bias as it uses all SNPs in a Among the three methods (GenGen, mSUMSTAT and SLAT) pathway for analysis and incorporates a phenotype shuffling that are robust against gene size bias only mSUMSTAT identified routine [12]. The more robust design of GenGen, mSUMSTAT a replicated association. This was for the acetylcholine receptor and SLAT provides an additional benefit, as these methods activity pathway. The association of this pathway with risk is not account for correlation among SNPs within pathways. unexpected as several SNPs at or near the CHRNA3-CHRNA5- A critical aspect of this comparison was the use of replication of CHRNB4 gene cluster are associated with both lung cancer risk top pathways across CETO and GRMD to help evaluate the [1,2,5] and nicotine addiction [5,23,24]. It is of interest that the relative performance of these methods. However, based on an GenGen method also identified acetylcholine receptor activity as FDR of #0.05, few replicated associations were found. Lack of the top ranked pathway in CETO and one of the most highly study power may in part account for the small number of ranked pathways in GRMD, although the result was not replicated associations. In particular GRMD (cases = 1639, significant in either data set after correcting for multiple controls = 1618) may have had insufficient sample size to detect comparisons using the FDR. We note that the associations found associations found in CETO (cases = 2258, controls = 3027). for this pathway was driven by the CHRNA3-CHRNA5-CHRNB4 Heterogeneity between data sets might also have contributed to gene cluster, as demonstrated by the dramatic reduction of

PLoS ONE | www.plosone.org 6 February 2012 | Volume 7 | Issue 2 | e31816 Pathway Analysis Comparison with Lung Cancer GWAS strength of association (according to the FDR) found for both the [18].Their analysis used a t-test instead of a x2 statistic, allowing mSUMSTAT and GenGen methods when data were reanalyzed for gene expression comparisons of two groups. Permutation and with these three genes removed from the pathway. This may normalization approaches were the same as used here, except complicate the interpretation of the observed association as ideally, normalization for GSEA also incorporated means and standard significant pathways should not be identified from a signal that deviations calculated from permutations with random gene sets. might ultimately represent a single gene or variant [20,21] We Our results are consistent with these studies in that mSUMSTAT point out, however, that there are two independent risk associated identified several significant associations in CETO and GRMD loci in this region [25] and it is currently not clear which genes in (with one of these replicated in both data sets), while GenGen did the region are causally related to disease risk. It is preferable then not, suggesting that mSUMSTAT may have greater power to that pathways such as these are identified to be associated with detect associations. outcome by the analysis method, and the researcher can then Since the strongest association found by GenGen and mSUM- follow-up with additional exploratory analyses. Further investiga- STAT was for the acetylcholine receptor pathway we graphed tion of this pathway did suggest that allowing the same SNP to odds ratios and confidence limits to further explore the pathway represent both CHRNA5 and CHRNA3 in the analysis overesti- association. Despite weak association signals found for these mated significance in the GRMD data set for mSUMSTAT and regions when the CHRNA3-CHRNA5-CHRNB4 cluster was re- the CETO data set for Gengen. Results from analyses that moved from analyses, the graphical presentation of results suggests excluded CHRNA5 are likely the most appropriate for this that SNPs outside of this gene cluster may contribute to the pathway. association, as suggested by replicated associations across the two For the purpose of further comparing pathway associations data sets. This association appeared more convincing when across data sets we used a less restrictive criterion for a replicated comparing the most significant SNPs representing each gene pathway association (a significant FDR in one data set and a across the two data sets (gene based comparison) as opposed to nominally significant association (P, = 0.05) in the second). This comparing the most significant SNPs at each gene in CETO to the permitted additional associations to be identified, although with same SNPs in GRMD (variant based comparison). Better evidence less confidence than those identified using the original criterion. for replication could result from a gene based approach versus a The mSUMSTAT method found four potential risk associated SNP based approach if multiple SNPs capture the causal variant(s) pathways with a significant FDR in CETO and nominally more completely than single SNPs for some pathway genes. This significant P-values in GRMD: heme metabolic process, porphyrin can be advantageous to pathway analysis approaches which can metabolic process, pigment biosynthesis and 4 iron, 4 sulfur cluster rely on gene based association signals to better replicate pathway binding. The heme metabolic and porphyrin metabolic pathways show a high degree of overlap. All four of these pathways include associations. IREB2 which is in the same region of strong LD that includes the In summary, this study compared several different pathway CHRNA3-CHRNA5-CHRNB4 cluster. SLAT identified one path- analysis approaches in two lung cancer GWAS data sets way, regulation of cell migration, using this same criterion. comprising four studies. Difficulties in replicating associations Overall, our results (along with insights from other comparisons across studies hindered our comparison and we cannot clearly discussed below) suggest mSUMSTAT should be considered when establish one pathway analysis method as superior to the others. choosing a method for pathway analysis. Lack of strong replication However, the mSUMSTAT approach did demonstrate several of pathway associations makes it difficult to evaluate GenGen and strengths such as a highly plausible association with the SLAT against one another. However, the GenGen approach acetylcholine receptor pathway and several additional suggestive appears to have some advantages. GenGen results provided some associations, while accounting for correlation among SNPs and support for an association of the acetylcholine receptor pathway gene size bias. Since different pathway analysis methods can with risk, and like mSUMSTAT this method allows for the produce different results using the same data set (as was seen here), incorporation of covariates, whereas the SLAT program does not it is best to use more than one method when examining pathway have this capability. Finally, GenGen is commonly used and has associations with disease risk [26]. We suggest that the mSUM- provided other plausible associations in pathway analyses of STAT method could be used in combination with other methods, GWAS data sets [10]. On the other hand, the utility of SLAT is such as the better known GenGen approach, in pathway analysis difficult to assess given our results and further evaluation of this investigations. method is needed. The rest of the discussion focuses on mSUMSTAT and GenGen. Acknowledgments Our mSUMSTAT method contrasts to that of Tintle et al. [11] through calculation of a normalized test statistic, and use of We would like to acknowledge study PIs from the central Europe study: D. Zaridze, Cancer Research Centre, Moscow, Russia; N. Szeszenia- phenotype permutations instead of randomly selected gene sets to Dabrowska, Institute of Occupational Medicine, Lodz, Poland; J. determine the null distribution. These changes were introduced to Lissowska, M. Sklodowska-Curie Memorial Cancer Center and Institute address gene size bias and maintain the correlation structure ofOncology,Warsaw,Poland;P.Rudnai,NationalInstituteof among SNPs in a pathway. Environmental Health, Budapest, Hungary; E. Fabianova, Specialized Some simulation results suggest that approaches that use the Institute of Hygiene and Epidemiology, B. Bystrica, Slovakia; D. Mates, sum or average of the x2 as a pathway test statistic will be more Institute of Public Health, Bucharest, Romania; V. Bencko, Charles powerful than those that use the weighted Kolmogorov-Smirnov- University in Prague, Czech Republic; L. Foretova, Masaryk Memorial like running sum statistic incorporated into GenGen and related Cancer Institute, Brno, Czech Republic; V. Janout, Palacky University, GSEA approaches. Tintle et al. found that the original Olomouc, Czech Republic. SUMSTAT test statistic was more powerful than a GSEA approach in a comparison where random gene sets were used to Author Contributions construct the null distribution for both methods [11]. Efron and Conceived and designed the experiments: GF RJH GL LB. Analyzed the Tibshirani found generally lower p-values using mean test statistics data: GF. Contributed reagents/materials/analysis tools: PB CIA MRS when compared to GSEA in simulated gene expression analyses HB HEW AR RJH. Wrote the paper: GF.

PLoS ONE | www.plosone.org 7 February 2012 | Volume 7 | Issue 2 | e31816 Pathway Analysis Comparison with Lung Cancer GWAS

References 1. Amos CI, Wu X, Broderick P, Gorlov IP, Gu J, et al. (2008) Genome-wide 14. Hosack DA, Dennis G, Jr., Sherman BT, Lane HC, Lempicki RA (2003) association scan of tag SNPs identifies a susceptibility locus for lung cancer at Identifying biological themes within lists of genes with EASE. Genome Biol 4: 15q25.1. Nat Genet 40: 616–622. R70. 2. Hung RJ, McKay JD, Gaborieau V, Boffetta P, Hashibe M, et al. (2008) A 15. Landi MT, Chatterjee N, Yu K, Goldin LR, Goldstein AM, et al. (2009) A susceptibility locus for lung cancer maps to nicotinic acetylcholine receptor genome-wide association study of lung cancer identifies a region of chromosome subunit genes on 15q25. Nature 452: 633–637. 5p15 associated with risk for adenocarcinoma. Am J Hum Genet 85: 679–691. 3. McKay JD, Hung RJ, Gaborieau V, Boffetta P, Chabrier A, et al. (2008) Lung 16. Sauter W, Rosenberger A, Beckmann L, Kropp S, Mittelstrass K, et al. (2008) cancer susceptibility locus at 5p15.33. Nat Genet 40: 1404–1406. Matrix metalloproteinase 1 (MMP1) is associated with early-onset lung cancer. 4. Rafnar T, Sulem P, Stacey SN, Geller F, Gudmundsson J, et al. (2009) Sequence Cancer Epidemiol Biomarkers Prev 17: 1127–1135. variants at the TERT-CLPTM1L locus associate with many cancer types. Nat 17. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, et al. (2005) Genet 41: 221–227. Gene set enrichment analysis: a knowledge-based approach for interpreting 5. Thorgeirsson TE, Geller F, Sulem P, Rafnar T, Wiste A, et al. (2008) A variant genome-wide expression profiles. Proc Natl Acad Sci U S A 102: 15545–15550. associated with nicotine dependence, lung cancer and peripheral arterial disease. 18. Efron B, Tibshirani R (2007) On testing the significance of sets of genes. Ann Appl Stat 1: 107–129. Nature 452: 638–642. 19. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, et al. (2007) 6. Wang Y, Broderick P, Webb E, Wu X, Vijayakrishnan J, et al. (2008) Common PLINK: a tool set for whole-genome association and population-based linkage 5p15.33 and 6p21.33 variants influence lung cancer risk. Nat Genet 40: analyses. Am J Hum Genet 81: 559–575. 1407–1409. 20. Holmans P (2010) Statistical methods for pathway analysis of genome-wide data 7. Zhao J, Gupta S, Seielstad M, Liu J, Thalamuthu A (2011) Pathway-based for association with complex genetic traits. Adv Genet 72: 141–179. analysis using reduced gene subsets in genome-wide association studies. BMC 21. Wang L, Jia P, Wolfinger RD, Chen X, Zhao Z (2011) Gene set analysis of Bioinformatics 12: 17. genome-wide association studies: methodological issues and perspectives. 8. Cantor RM, Lange K, Sinsheimer JS (2010) Prioritizing GWAS results: A review Genomics 98: 1–8. of statistical methods and recommendations for their application. Am J Hum 22. Holmans P, Green EK, Pahwa JS, Ferreira MA, Purcell SM, et al. (2009) Gene Genet 86: 6–22. ontology analysis of GWA study data sets provides insights into the biology of 9. Wang K, Li M, Bucan M (2007) Pathway-based approaches for analysis of bipolar disorder. Am J Hum Genet 85: 13–24. genomewide association studies. Am J Hum Genet 81: 1278–1283. 23. Berrettini W, Yuan X, Tozzi F, Song K, Francks C, et al. (2008) Alpha-5/alpha- 10. Wang K, Zhang H, Kugathasan S, Annese V, Bradfield JP, et al. (2009) Diverse 3 nicotinic receptor subunit alleles increase risk for heavy smoking. Mol genome-wide association studies associate the IL12/IL23 pathway with Crohn Psychiatry 13: 368–373. Disease. Am J Hum Genet 84: 399–405. 24. Saccone SF, Hinrichs AL, Saccone NL, Chase GA, Konvicka K, et al. (2007) 11. Tintle NL, Borchers B, Brown M, Bekmetjev A (2009) Comparing gene set Cholinergic nicotinic receptor genes implicated in a nicotine dependence analysis methods on single-nucleotide polymorphism data from Genetic Analysis association study targeting 348 candidate genes with 3713 SNPs. Hum Mol Workshop 16. BMC Proc 3 Suppl 7: S96. Genet 16: 36–49. 12. De la Cruz O, Wen X, Ke B, Song M, Nicolae DL (2010) Gene, region and 25. Broderick P, Wang Y, Vijayakrishnan J, Matakidou A, Spitz MR, et al. (2009) pathway level analyses in whole-genome studies. Genet Epidemiol 34: 222–231. Deciphering the impact of common genetic variation on lung cancer risk: a 13. Luo L, Peng G, Zhu Y, Dong H, Amos CI, et al. (2010) Genome-wide gene and genome-wide association study. Cancer Res 69: 6633–6641. pathway analysis. Eur J Hum Genet 18: 1045–1053. 26. Wang K, Li M, Hakonarson H (2010) Analysing biological pathways in genome- wide association studies. Nat Rev Genet 11: 843–854.

PLoS ONE | www.plosone.org 8 February 2012 | Volume 7 | Issue 2 | e31816 APPENDIX D LIST OF MENTORS

APPENDIX D - LIST OF MENTORS

Roles as per Grant Specific contributions to STAGE Committee Membership Team members by Affiliation(s) Changes discipline Attends ISSS Attends Journal Steering (Co-) Attends meetings / Club and GAW18 Admissions Co-PI Co-App Collaborator Mentors ISSS dinner Seminar Participation Teaching Full Member Ad hoc

Total Participation 5 5 26 19 28 22 13 13 17 10 2 9 10 1. Genetic and Molecular Epidemiology 1.1 Genetic & Molecular Epidemiology Gagnon, France • Dalla Lana School of Public Health (DLSPH) (Div. of (NOMINATED PI) Epidemiology), UofT X X X X X X X X

Hu, Howard • DLSPH, UofT Dec 2012 joins X STAGE mentor team Hung, Rayjean J. • Samuel Lunenfeld Research Institute of Mount Sinai Hospital; DLSPH (Div. of Epidemiology), UofT X X X X X X X X

Narod, Steven • DLSPH (Div. of Epidemiology), UofT; Women’s College Research Institute X X X X X

Palmer, Lyle • Ontario Health Study, Ontario Institute for Cancer Research; The Canadian Partnership for Tomorrow Project; DLSPH (Divs. of Epidemiology and Biostatistics), UofT; X X Samuel Lunenfeld Research Institute, Mount Sinai Hospital; Cancer Care Ontario

Parra, Esteban J. • Dept. of Anthropology, UofT On leave X X X X X until Feb 2013 Paré, Guillaume • Dept. of Pathology & Molecular Medicine, McMaster University; Genetic & Molecular Epidemiology Laboratory, X X X McMaster University

Paterson, Andrew D. • Genetics and Genome Biology Program, SickKids; DLSPH (Div. of Biostatistics), UofT; Institute of Medical Science, Faculty of Medicine, UofT; Neuroscience X X X X X X X X Program, Psychiatry, UofT;

1.2 Epidemiology Cotterchio, Michelle • Population Studies & Surveillance, Cancer Care Ontario; DLSPH (Div. of Epidemiology), UofT X

Knight, Julia • Prosserman Centre for Health Research, Samuel Lunenfeld Research Institute, Mount Sinai Hospital; X X X X X DLSPH (Div. of Epidemiology), UofT

McLaughlin, John • Prosserman Centre for Health Research, Samuel Lunenfeld Research Institute; Public Health Ontario; X X X X X DLSPH (Div. of Epidemiology), UofT APPENDIX D - LIST OF MENTORS

Roles as per Grant Specific contributions to STAGE Committee Membership Team members by Affiliation(s) Changes discipline Attends ISSS Attends Journal Steering (Co-) Attends meetings / Club and GAW18 Admissions Co-PI Co-App Collaborator Mentors ISSS dinner Seminar Participation Teaching Full Member Ad hoc 2. Statistical Genetics and Genomics 2.1 Statistical Genetics and Genomics Beyene, Joseph • Dept. of Clinical Epidemiology and Biostatistics, McMaster University; Population Genomics Program, X X X X McMaster University Briollais, Laurent • Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto; DLSPH (Div. of Biostatistics), UofT X X X X X X X Bull, Shelley B. • Samuel Lunenfeld Research Institute of Mount Sinai Hospital; DLSPH (Div. of Biostatistics), UofT; Dept. of X X X X X X X X X Statistics, UofT

Greenwood, Celia • McGill University; Jewish General Hospital, Lady Davis Institute for Medical Research X X

Knight, Jo • Centre for Addiction and Mental Health, Dept. of Apr 2012 Psychiatry, UofT Joins X X X X STAGE mentor team Lemire, Mathieu • Ontario Institute for Cancer Research; DLSPH (Div of Apr 2012 Biostatistics), UofT accepts to X X X X X X X X replace E. Parra until Feb 2013 Strug, Lisa J. • Child Health Evaluative Sciences, Research Institute, SickKids; DLSPH (Div. of Biostatistics), UofT X X X X X X X X X

Sun, Lei • DLSPH (Div. of Biostatistics), UofT; Dept. of Statistics, UofT; Genetics and Genomic Biology, SickKids X X X X X X X X X

Xu, Wei • DLSPH (Div. of Biostatistics), UofT; Princess Margaret Hospital, University Health Network X X X X X

2.2 Bioinformatics Jurisica, Igor • Ontario Cancer Institute; Division of Signaling Biology, University Health Network; School of Computing, Queen's University; College of Stomatology, Shanghai Jiao Tong X University; IBM Centre for Advance Studies; Depts. of Computer Science and Medical Biophysics, UofT

2.3 Statistics Brown, Patrick • Population Studies & Surveillance, Cancer Care Ontario, Toronto; DLSPH (Div. of Biostatistics), UofT X

Craiu, Radu • Dept. of Statistics, UofT X X X X X X Lawless, Jerry F. • Dept. of Statistics and Actuarial Science, University of Waterloo X X X X X APPENDIX D - LIST OF MENTORS

Roles as per Grant Specific contributions to STAGE Committee Membership Team members by Affiliation(s) Changes discipline Attends ISSS Attends Journal Steering (Co-) Attends meetings / Club and GAW18 Admissions Co-PI Co-App Collaborator Mentors ISSS dinner Seminar Participation Teaching Full Member Ad hoc 3. Bio-Medical Genetics 3.1 Bio-Medical Genetics Akbari, Mohammad • Scientist, Women’s College Research Institute Fall 2012 Assistant Professor, Dalla Lana School of Public Health, teaches course for University of Toronto F. Gagnon while Gagnon on X sabbatical

Dec 2012 joins STAGE mentor team Andrulis, Irene L. • Dept. of Molecular Genetics, UofT; Dept. of Laboratory Medicine & Pathobiology, UofT; Fred A. Litwin Centre for Cancer Genetics, Samuel Lunenfeld Research Institute, Mount Sinai Hospital; Ontario Cancer Genetics Network, X X X Cancer Care Ontario

Barr, Cathy • Neurosciences & Mental Health, SickKids; Dept. of Psychiatry, UofT; Institute of Medical Science, Faculty of Medicine, UofT; Cell and Molecular Biology Division, X X X Toronto Western Research Institute

Bassett, Anne S. • Clinical Genetics Research Program, Clinical Research Dec 2012 Dept., Centre for Addiction and Mental Health; Dept. of joins Psychiatry, UofT X STAGE mentor team Danska, Jayne • Genetics & Genome Biology, SickKids; Depts. of Dec 2012 Immunology & Medical Biophysics, Faculty of Medicine, temp. UofT; Institute of Medical Sciences, UofT X X X X X unavail. for Dec 2012 mtng McPherson, John • Ontario Institute for Cancer Research X X Liu, Geoffrey • DLSPH (Div. of Epidemiology), UofT; Dept. of Medical Biophysics, Div. of Applied Molecular Oncology & Dept of Medicine, Div. of Hematology-Oncology, UofT; Dept. of Env X X X X Health, Harvard School of Public Health

Lye, Stephen • Samuel Lunenfeld Research Institute, Mount Sinai Hospital Aug 2012 ad hoc mentor

X Dec 2012 formally invited to join STAGE mentor team APPENDIX D - LIST OF MENTORS

Roles as per Grant Specific contributions to STAGE Committee Membership Team members by Affiliation(s) Changes discipline Attends ISSS Attends Journal Steering (Co-) Attends meetings / Club and GAW18 Admissions Co-PI Co-App Collaborator Mentors ISSS dinner Seminar Participation Teaching Full Member Ad hoc Pausova, Zdenka • Physiology & Experimental Medicine, Sick Kids Research Dec 2012 Institute accepts to X X X X X attend for J Danska at Dec 2012 mtng

Petronis, Arturas • The Krembil Family Epigenetics Laboratory, Centre for Addiction and Mental Health; Depts. of Psychiatry & Pharmacology, Institute of Medical Sciences, UofT X X X

Poussier, Philippe • Dept. of Medicine and Immunology, UofT; Molecular & Cell Biology, Sunnybrooke Health Sciences Centre X X Research Institute Rommens, Joanna • Genetics & Genome Biology, SickKids; Dept. of Molecular Genetics, UofT X X X X Scherer, Stephen W. • Genetics & Genomic Biology, SickKids; Dept. of Molecular & Medical Genetics, UofT X X X X X

Wilson, Michael • Genetics & Genome Biology, SickKids Dec 2012 joins X X STAGE mentor team 3.2 Medical Gladman, Dafna • Toronto Western Research Institute; Division of Rheumatology, UofT; University Health Network / Mount X Sinai Hospital

Inman, Robert D. • Dept. Medicine/Immunology, UofT; University Health Network X Kennedy, James L. • Dept. of Neuroscience Research & Psychiatric Neurogenetics Section, Centre for Addiction & Mental X X X Health; Dept. of Psychiatry & Inst of Medical Science, UofT

Logan, Alexander G. • Depts. of Medicine, Health Policy, Management & Evaluation; Mount Sinai Hospital and University Health Network; Heart and Stroke/Richard Lewar Centre of X Excellence, UofT

Wither, Joan E. • Genetic & Development Division, Toronto Western Research Institute, University Health Network, Toronto; Depts. of Medicine and Immunolgy, UofT; Rheumatology X Division, Toronto Western Hospital APPENDIX E MENTORS-MENTEE AGREEMENT

Appendix E

Strategic Training for Advanced Genetic Epidemiology MENTOR/MENTEE AGREEMENT

Check box to WE, identify the Primary Mentor

1. Genetic and Molecular Epidemiology Mentor

Last name: Given Name(s): ☐

2. Statistical Genetics and Genomics Mentor

Last name: Given Name(s): ☐

3. Bio-Medical Genetics Mentor

Last name: Given Name(s): ☐

AND

4. CIHR STAGE Mentee (Trainee) Last name: Given Name(s): commit ourselves to successfully fulfill our roles as per the following mentoring relationship guidelines as set forth by the CIHR STAGE Training Program in Advanced Genetic Epidemiology:

1. To maintain this mentoring relationship for the agreed upon period of time:

Start date (mentoring will Expected/planned finish date

start on): (mentoring will end on): Day Month Year Day Month Year

2. To endeavor to maintain a consistent meeting schedule, as noted below, with the understanding that the frequency and length of these meetings may be adjusted throughout the mentorship to accommodate to the needs and goals of the project and to adapt to the individual circumstances of the mentors and mentee (e.g. greater mentee independence as a result of professional growth, scheduling conflicts resulting from work, study, or personal commitments, etc.).

The trainee and the CIHR STAGE Primary Mentor will meet face-to-face ☐ Weekly ☐ Bi-weekly ☐ Monthly

for approximately hours

The trainee and all three CIHR STAGE ☐ Weekly ☐ Bi-weekly ☐ Monthly Mentors will meet face-to-face ☐ Bi-monthly

for approximately hours

3. To work toward achieving the specific aims set forth in the trainee’s CIHR STAGE Proposed Research Project Module form. These specific aims include, but are not limited to, the following:

1.

2.

Appendix D

Strategic Training for Advanced Genetic Epidemiology MENTOR/MENTEE AGREEMENT

3.

4.

5.

4. The Mentee agrees to abide by the following mentee-specific guidelines: a. Take responsibility for own learning experience and keep mentors updated on progress b. Prepare a prioritized agenda of topics for discussion for each mentorship meeting (e.g. questions, results, issues, concerns, challenges, possible solutions, etc.). c. Utilize mentors for feedback when appropriate.To be on time for scheduled meetings or to contact the other parties beforehand if unable to attend a meeting

5. To create an atmosphere of trust conducive to a positive mentor/mentee relationship

6. To keep all information provided during mentoring sessions confidential except as may cause others harm

7. To inform the CIHR STAGE co-directors or program coordinator of any difficulties or areas of concern that may arise in the relationship

We fully understand the guidelines for mentoring relationships included in this agreement. By signing below we agree to respect each other’s role in upholding these guidelines.

1. Genetic and Molecular Epidemiology Mentor

Signature Date

2. Statistical Genetics and Genomics Mentor

Signature Date

3. Bio-Medical Genetics Mentor

Signature Date

4. CIHR STAGE Mentee (Trainee)

Signature Date

APPENDIX F Syllabi-Integrative Courses

ANT3440H: Molecular anthropology, theory and practice

Course outline.

This course will introduce graduate students to theoretical and experimental methods in Molecular Anthropology. This field of Anthropology uses genetic information for addressing questions about the origin and evolution of our species. We will review important aspects of genome organization and describe the different types of genetic markers used in anthropological studies. A variety of experimental techniques to analyze genome variation will be reviewed in detail, and the students will have the opportunity to apply some of these methods in the laboratory. We will also discuss the application of statistical methods in human evolutionary studies. Diverse topics regarding the extent, pattern and meaning of genetic variation within and between human populations will be discussed. The course will familiarize students with how genetic data can be a powerful tool to explore longstanding questions in the field of anthropology, such as the origin of anatomically modern humans, how humans have adapted to a wide range of climatic and ecological conditions, and the relationship of human variation and disease, among others.

Syllabus.

Week 1. The human genome: structure and organization

Week 2. DNA polymorphisms -Single Nucleotide Polymorphisms (SNPs) -Insertion/Deletion Polymorphisms (Indels) -Alu insertions -Microsatellites (STRs) -Copy number variation

Week 3. Application of molecular biology methods in human evolutionary studies

Week 4. Review of population genetics -Mutation -Genetic drift -Natural selection -Gene flow

Week 5. Exploring population history using genetic markers Week 6. Laboratory practice. Review of DNA databases and Primer design. Week 7. Laboratory practice. PCR Week 8. Laboratory practice. Indel and SNP genotyping. Week 9. Molecular Evolution Week 10. Molecular Phylogenetics Weeks 11-13. Student presentations on selected topics.

Evaluation.

Student evaluation will be based on a term project: The graded components of the project are: written proposal (20%), oral presentation (30%) and written report (50%). Statistical Genetics (CHL 5224, Winter 2012) http://fisher.utstat.toronto.edu/sun/Teaching/chl5224_index.html

First Lecture: Monday January 9, 2012

General Information Time: Mondays, 10am - noon (Please arrive on time so we can end on time to give enough between-calss time to students who are taking the Math Stat. II). Location: HSB 790. HSB: Health Sciences Building, 155 College Street (South on College and West of University). Instructors

Lei Sun ([email protected]) Wei Xu ([email protected])

Office Hours Mondays 11:50-12:30pm (other personal on-demand time slots can also be arranged, particularly for students who are taking the Math Stat. II). There is no TA for the course.

Prerequisites and Enrollment This is a graduate course with the following prerequisites. • Statistics at the graduate level or consent of instructor. • There will be one or more simulation studies (no tutorial) and use of GENEHUNTER and PLINK software packages (with tutorial). • All participating students must register! This is a graduate school policy. January 23: The final date to enroll the course. February 27: The final date to withdraw from the course.

Course Information

Teaching objectives This course is for students of biostatistics, epidemiology and statistics with little genetics background but with some knowledge of probability and statistics. This course covers the fundamental statistical problems in genetics, with an emphasis on human genetics. The aim of the course is to provide students necessary background and prepare them for advanced study and research in the area of statistical genetics.

Format of instruction Lectures and computing labs

Evaluation Student evaluation will be based on homework problem sets and lab projects and overall participation in the classes.

Recommended books (no formal text book but lecture notes in .pdf are provided) For basic genetics background • Gonick L, Wheelis M (1991). Cartoon guide to genetics. Revised edition. HarperCollins.

!"#"$%"$&#'()*+*"$&%(,-./(01123(4$+"*5(16718(9(7

• Virtually any genetics textbook. • For statistical genetics • Sham P (1998). Statistics in Human Genetics. Arnold, London. • Zieglier A, Koenig I (2006). A Statistical Approach to Genetic Epidemiology: Concepts and Applications. Wiley-VCH, . • Thomas DC (2004). Statistical Methods in Genetic Epidemiology. Oxford University Press. • Lange K (2002). Mathematical and Statistical Methods for Genetic Analysis. 2nd edition. Springer-Verlag, New York. • Ott J (1999). Analysis of Human Genetic Linkage. 3rd edition. Johns Hopkins University Press, Baltimore.

Course Outline (tentative, to be modified) Notes and Homework Assignments in PDF file

January 9: session 1 (Wei Xu). Introduction and overview, and administrative work.

January 16: session 2 (Lei Sun). Basic genetic terms and principles of population genetics.

January 24: session 3 (Wei Xu). Familial aggregation, single locus inheritance, and segregation analysis.

January 30: session 4 (Lei Sun). Multiple locus inheritance, map, and linkage.

February 6: session 5 (Lei Sun). Linkage mapping I (full-parametric method and allele-sharing method).

February 13: session 6 (Lei Sun). Computing lab I - application using GENEHUNTER.

February 20: University reading week, no class.

February 27: session 7 (Wei Xu). Association analysis I (LD, case-control study and TDT).

March 5: session 8 (Wei Xu). Association analysis II (Genome wide association study, haplotype analysis).

March 12: session 9 (Wei Xu). Computing lab II - application using PLINK.

March 19: session 10 (Laura Faye and Lei Sun). Principal component analysis

March 26: session 11 (Lei Sun). Mutiple hypothesis testing.

April 02: session 12 (Lei Sun; Last One!). GWAS and related statistical issues.

!"#"$%"$&#'()*+*"$&%(,-./(01123(4$+"*5(16718(9(1

CHL5430: Fundamentals of Genetic Epidemiology

CHL5430-FUNDAMENTALS OF GENETIC EPIDEMIOLOGY SYLLABUS

When: Fall 2012; Thursdays 16:00-18:30.Lectures start Sept. 13.

Where: 155 College St., Room HSB614

Course instructors:

Steven Narod [email protected] http://www.womensresearch.ca/researchers/core-faculty/steven-narod-md-frcpc

Mohammad Akbari [email protected] http://www.womensresearch.ca/researchers/core-faculty/mohammad-akbari,-md,-phd

Invited lecturers and guest speakers:

Lisa Strug (The Hospital for Sick Children and DLSPH Div. of Biostatistics) Lucia Mirea (Maternal-Infant Care Research Centre (MiCare)) Sanaa Choufani (The Hospital for Sick Children)

Course description

This introductory course provides an overview of central concepts and topical issues in genetic epidemiology, providing an overall framework for investigating the role of genetic factors in the etiology of common complex disorders. This course integrates human genetics, biostatistics and epidemiology. The main course objective is to provide the common terminology and fundamental concepts underlying the design and conduct of genetic epidemiologic studies. Advanced and novel genetic epidemiology study designs and methods will not generally be discussed in depth as this goes beyond the scope of the course. The students are expected to be active participants in their learning experience. Critical appraisal and presentation of selected scientific articles in the field will be a major component of the course, as well as in-class exercises.

Target audience & Prerequisites

The course targets students with a strong interest in learning the basic elements of genetic epidemiologic studies but with minimal or no formal training in the field. This is a required course for trainees of the CIHR STAGE (Strategic Training for Advanced Genetic Epidemiology) program. The course has been designed for a group of 8 students. Introduction to Epidemiology (CHL 5401) and Biostatistics (CHL 5201) or equivalent courses are prerequisites for this course. Students who have not taken these courses must discuss their eligibility with the course coordinators. 1 CHL5430: Fundamentals of Genetic Epidemiology

Specific goals

1. The students will have basic understanding of the fundamental principles and concepts underlying the main study designs and methods used in genetic epidemiologic research, and their specific objectives.

2. The students will have some level of critical appraisal skills for the interpretation of scientific articles in the field of genetic epidemiology.

Required readings: Articles listed for each blocks. Note that some weeks will require more reading than others - Be prepared. Additional and/or more advanced reading materials will be recommended based on the specific individual interests of each student.

A few useful – but not required - textbooks • Khoury MJ, Beaty TH, Cohen BH (1993) Fundamentals of Genetic Epidemiology. Oxford University Press • Ziegler A and König IR (2006) A Statistical Approach to Genetic Epidemiology: Wiley- VCH • Speicher MR, Antonarakis SE, Motulsky AG (2010) Vogel and Motulsky’s Human Genetics: Problems and approaches, 4th edition: Springer Other relevant resources: • CIHR STAGE (Strategic Training for Advanced Genetic Epidemiology) training program: http://www.stage.utoronto.ca/ • STAGE International Speaker Seminar Series (ISSS): http://www.stage.utoronto.ca/home/isss • Statistical Methods for Genetics & Genomics (SMGG) Research Seminar and Journal Club: http://research.lunenfeld.ca/MITACS/DEFAULT.ASP?page=Winter%20and%20Spring %202011 • Website with a comprehensive list of pedagogical references in genetic epidemiology - M.Tevfik DORAK website: http://www.dorak.info/epi/genetepi.html • Watch out for the 2012 Human Genetics Special Issue on “Study Designs and Methods Post-G WAS” with Guests Editor Andreas Ziegler and Yan V. Sun! For your curiosity: 1) Careers and recruitment opportunities in genetic epidemiology: http://www.stage.utoronto.ca/about/jobs 2) International Genetic Epidemiology Society (IGES): http://geneticepi.org/content/about/what-iges

2 STATISTICAL METHODS FOR GENETICS & GENOMICS - RESEARCH SEMINAR AND JOURNAL CLUB September 2012 – June 2013

TIME: 10am – 12 noon Friday Seminar: 10-11am Small Group Discussion for CHL7001H: 11am – 12 noon

LOCATION: Prosserman Centre of the Samuel Lunenfeld Research Institute, 60 Murray Street, 5th floor, Room 5-102

DETAILED LECTURE & CHL7001 EVALUATION SCHEDULE (subject to revision)

September 14 10am – First Class for CHL Course Participants *** alternate location, room 5-1019 ***

September 14 12 noon ** please note change in time and location **

STAGE International Speaker Seminar Dr. Marjo-Ritta Jarvelin Professor and Chair, Department of Epidemiology and Biostatistics, School of Public Health, Imperial College, London, UK

Title: Genetics of Early Growth

Location: The Hospital for Sick Children CDIU Multimedia Theatre Room 4132, 4th Floor, Burton Wing, 555 University Avenue

*** September 20 *** deadline to submit request for CHL7001 Reading Course

September 21 10am – Organizational Meeting for Journal Club

September 28 10am – Journal Club – Shelley Bull, SLRI

Topic: “Haplotype phasing: existing methods & new developments”, Browning & Browning (2011). Nature Reviews Genetics 12:703-714.

October 5 12 noon ** please note change in time and location **

STAGE International Speaker Seminar Dr. Sharon Browning Associate Professor, Department of Biostatistics,

1/4/2013 University of Washington, Seattle

Title: Identity by Descent in “Unrelated” Individuals

Location: The Hospital for Sick Children CDIU Multimedia Theatre Room 4132, 4th Floor, Burton Wing, 555 University Avenue

October 12 10am – Preview: International Genetic Epidemiology Society Meeting

Talk 1: Laura Faye, PhD Student, Biostatistics, DLSPH Re-ranking Next Generation Sequencing Variants for Accurate Causal Variant Identification

Reading: Li Y, Sidore C, Kang HM, Boehnke M, Abecasis G (2011). Low-coverage sequencing: Implications for the design of complex trait association studies. Genome Res. 21:940-951. More technical: Udler MS, Tyrer J, Easton, DF (2010). Evaluating the power to discriminate between highly correlated SNPs in genetic association studies. Genet. Epidemiol.34(5): 463-8.

Talk 2: Andriy Derkach, PhD Student, Statistics Combining p-values from Linear and Quadratic Tests for Rare Variants provides Robust Test Statistics across Genetic Models

Reading: Early paper: Basu S, PanW. 2011. Comparison of statistical tests for disease association with rare variants. Genet Epidemiol 35:606–619 Advanced: Lee S, Wu MC, Lin X. 2012 Optimal tests for rare variant effects in sequencing association studies. Biostatistics. 2012 Sep;13(4):762-75

October 19 No seminar – IGES Meeting

October 26 10am – What did we learn at GAW18 ?

** Please note change in Location for this week: MARS East Tower, College St, 12th floor, Large conference room on the 12th floor, Room 12-710

November 2 12 noon ** please note change in time and location **

STAGE International Speaker Seminar Dr. Duncan Thomas Professor and Director, Biostatistics Division Department of Preventive Medicine University of Southern California

1/4/2013

Title: Two-phase Family-based Designs for Next Generation Sequencing

Location: The Hospital for Sick Children CDIU Multimedia Theatre Room 4132, 4th Floor, Burton Wing, 555 University Avenue

November 9 10am – No Seminar - ASHG Meeting

November 16 10am – Seminar – ASHG Highlights by Jo Knight, Vanessa Oliveira, Melissa Miller & Dave Soave, Lei Sun

November 23 10am – Journal Club – Charlie Chen, SLRI

Topic: Study Designs for Identification of Rare Disease Variants in Complex Diseases: The Utility of Family-Based Designs

Reading: Ionita-laza & Ottman (2011) Genetics 189, 1061–1068. Risch (1990) AJHG 46: 222–228. Linkage strategies for genetically complex traits. I._Multilocus models.

November 30 10am – Seminar – Jessica Dennis, DLSPH

Title: "Becoming a Genetic Epidemiologist: From Toronto to Paris and back again"

Background Reading: Epigenome-wide association studies for common human diseases, Rakyan et al (2011), Nature Reviews Genetics, 12: 529-541 (August).

December 7 12 noon ** please note change in time and location **

STAGE International Speaker Seminar Dr. Howard Hu Professor and Director Dalla Lana School of Public Health University of Toronto

Title: Looking behind the curtain: Lead Toxicity as a Case Study of Methodologic Challenges in Gene-Environment Interactions Research

Location: The Hospital for Sick Children CDIU Multimedia Theatre Room 4132, 4th Floor, Burton Wing, 555 University Avenue

1/4/2013

*** Last week of classes *** Final date for scheduling student presentations

*******************

January 11 12 noon ** please note change in time and location **

STAGE International Speaker Seminar Dr. David Tregouet Head, Institute for Cardio-Metabolism and Nutrition, Genomics Department, Pierre and Marie Curie University Campus Research Director, Mixed Research Unit - Genomics of Venous Thrombosis, INSERM (French National Institute of Health and Medical Research)

Title: Haplotypes and Imputation, Two Complementary Tools: A Case Study on GenomeWide Expression Studies

Location: The Hospital for Sick Children CDIU Multimedia Theatre Room 4132, 4th Floor, Burton Wing, 555 University Avenue

January 18 Student Presentations- Journal Club

10am – Sarah Gagliano, IMS, CAMH Topic: The ENCODE Project and its Applications

Reading: Dunham et al. (2012) An integrated encyclopedia of DNA elements in the human genome. Nature, 489, 7414, 57-74. News & Views article: Ecker, J.R. (2012) FORUM: Genomics ENCODE explained. Nature, 489, 7414, 52-53.

11am – Kuan Rui Tan, Biostats, DLSPH

January 25 12 noon ** please note change in time and location **

STAGE International Speaker Seminar Dr. Florence Demenais Director, Mixed Research, Genetic Variation and Human Diseases INSERM (French National Institute of Health and Medical Research)

Title: TBA

1/4/2013 Location: The Hospital for Sick Children CDIU Multimedia Theatre Room 4132, 4th Floor, Burton Wing, 555 University Avenue

February 1 No Seminar

February 8 Student Presentations- Journal Club 10 am – Stefan Konigorski, Biostats, DLSPH 11 am – Naim Panjwani, Biostats, DLSPH

February 15 10 am – Research Seminar – YanYan Wu, SLRI

February 22 No seminar – Reading Week

March 1 12 noon ** please note change in time and location ** STAGE International Speaker Seminar -TBA

March 8 10am – Research Seminar – Vanessa Goncalves, CAMH

March 15 Student Presentations - Journal Club 10am – Dave Soave, Biostats, DLSPH 11 am – Paul Popadiuk, Biostats, DLSPH

March 22 10am – Seminar/Journal Club – Marc Woodbury-Smith, McMaster

March 29 Good Friday Holiday

April 5 12 noon ** please note change in time and location ** STAGE International Speaker Seminar

April 12 10am – Research Seminar – Cindy Yue Jiang, Sick Kids

*** April 8 -12 Last week of classes *** April 19 (TBC) Due date for final student papers

April 19 10am – Seminar/Journal Club - TBA April 26 No Seminar – Canadian Human & Statistical Genetics Meeting

May 3 12 noon ** please note change in time and location **

STAGE International Speaker Seminar Dr. George Davey-Smith Professor of Clinical Epidemiology, University of Bristol Scientific Director of The Avon Longitudinal Study of Parents and Children (ALSPAC)

1/4/2013 Syllabus for STA 4315: Computational Methods for Statistical Genetics

Radu V. Craiu and Lei Sun

Prerequisites and Enrollment: This is an advanced graduate course with the following prerequisites.

• CHL 5224 (Statistical Genetics) and statistics at the graduate level, or consent of instructor. • Knowledge of the following statistical and statistical genetics concepts is essential: genotype, haplotype, linkage analysis, association studies, likeli- hood, estimation, hypothesis testing, multiple comparisons. • Working knowledge of UNIX platform is necessary. Programing using R, Splus, C or C++ is required, and analysis using some of the standard computational tools for statistical genetics is also expected.

Course Information and Reading Materials: • Evaluation: Student evaluation will be based on 3-5 homework problem sets and lab projects and overall participation in classes.

Some recommended reading (no formal text book):

• Sham P (1998). Statistics in Human Genetics. Arnold, London.

• Lange K (2002). Mathematical and Statistical Methods for Genetic Analysis. 2nd edition. Springer-Verlag, New York.

• Thomas DC (2004). Statistical Mthods in Genetic Epidemiology. Ox- ford University Press.

• Liu JS (2001). Monte Carlo Strategies in Scientific Computing. Springer- Velag New York.

1 Course Outline

Session 1 Topic: Introductions. Reading: R. C. Elston and J. Stewart (1971). A General Model for the Genetic Analysis of Pedigree Data. Human Heredity, 21, pages 523-542.

Session 2 Topic: Markov, Inheritance and Identity-By-Descent (IBD) pro- cesses. Reading: Any textbook on discrete state and continuous time Markov chains.

Session 3 Topic: HMM and applications. Reading: L. R. Rabiner (1989). A tutorial on Hidden Markov Mod- els and selected applications in speech recognition. Proceedings of the IEEE, 77, pages 257-286.

Session 4 Topic: HMM and applications. Reading: E. S. Lander and P. Green (1987). Construction of multilo- cus genetic linkage maps in humans.Proceedings National Academy of Sciences USA, 84, pages 2363-2367. L. Kruglyak, M. J. Daly, M. P. Reeve-Daly and E. S. Lander (1996). Parametric and non-parametric linkage analysis: a unified multipoint approach. American Journal of Human Genetics, 58, pages 1347-1363.

Session 5 Topic: EM and applications. Reading: A. P. Dempster, N. M. Laird and D.B Rubin (1977). Max- imum likelihoood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society B, 39, pages 1-38. T. A. Louis (1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society B, 44, pages 226-233.

Session 6 Topic: EM and applications. Reading: X. L. Meng and D. B. Rubin (1991). Using EM to obtain asymptotic variance-covariance matrices: the SEM algorithm. Journal of the American Statistical Association, 86, pages 899-909.

2 S. Ghosh and P. P. Majumder (2000). Mapping a quantitative trait locus via the EM algorithm and Bayesian classification. Genetic Epi- demiology, 19, pages 97-126.

Session 7 Topic: Bayesian inference and MCMC Reading: J. Liu (2001). Monte Carlo Strategies in Scientific Comput- ing. Chapters 2, 5, 6.

Session 8 Topic: MCMC and applications

Session 9 Topic: Resampling techniques and applications. Reading: Goring H, Terwilliger JD, Blangero J. (2001). Large up- ward bias in estimation of locus-specific effects from genomewide scans. American Journal of Human Genetics, 69, pages 1357-1369. Sun L, Bull SB (2005). Reduction of selection bias in genomewide genetic studies by resampling. Genetic Epidemiology, in press.

Session 10 Topic: Resampling techniques and applications.

Session 11 Topic: Genetics of diabetes complications and related traits. Reading: The International HapMap Project. Nature, 426, 2003.

Session 12 Topic: Haplotype Inference. Reading: Qin ZS, Niu T, Liu JS (2002). Partition-ligation-expectation- maximization algorithm for haplotype inference with single-nucleotide polymorphisms. American Journal of Human Genetics, 71, pages 1434-1445. Niu T, Qin ZS, Liu JS (2002). Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. American Journal of Human Genetics, 70, pages 1242-1247. Stephens M, Smith NJ, Donnelly P (2001). A new statistical method for haplotype reconstruction from population data. American Journal of Human Genetics, 68, pages 978-989. Stephens M and Donnelly P (2003). A comparison of Bayesian methods for haplotype reconstruction from population genotype data. American Journal of Human Genetics, 71, pages 1162-1169.

3 Session 13 Topic: Multiple comparisons. Reading: Benjamini, Y and Hochberg, Y (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing, JRSS-B,57, pages = 289–300. Yekutieli, D and Benjamini, Y (1999). Resampling-based false discov- ery rate controlling multiple test procedures for correlated test statis- tics, J. Statist. Plann. Inference,82, pages = 171–196. Storey, J. D. and Tibshirani, R. (2003) Statistical significance for genomewide studies, Proc. Natl. Acad. Sci. U.S.A., 100, pages=9440-9445.

4

May 10 10am – Seminar/Journal Club

May 17 No seminar – Victoria Day Weekend

May 24 10am – Seminar/Journal Club May 31 10am – Seminar/Journal Club

June 7 12 noon ** please note change in time and location ** STAGE International Speaker Seminar

June 14 10am – Seminar/Journal Club June 21 10am – Seminar/Journal Club

Co-LEADERS: Shelley Bull Andrew Paterson Professor, DLSPH Associate Professor, DLSPH Samuel Lunenfeld Research Institute Toronto Medical Discovery Tower 60 Murray Street 101 College St. Room 5-226 Room 15-707 Email: [email protected] Email: [email protected] Phone: 416-586-8245 Phone: 416-813-6994

1/4/2013 Syllabus for STA 4315: Computational Methods for Statistical Genetics

Radu V. Craiu and Lei Sun

Prerequisites and Enrollment: This is an advanced graduate course with the following prerequisites.

• CHL 5224 (Statistical Genetics) and statistics at the graduate level, or consent of instructor. • Knowledge of the following statistical and statistical genetics concepts is essential: genotype, haplotype, linkage analysis, association studies, likeli- hood, estimation, hypothesis testing, multiple comparisons. • Working knowledge of UNIX platform is necessary. Programing using R, Splus, C or C++ is required, and analysis using some of the standard computational tools for statistical genetics is also expected.

Course Information and Reading Materials: • Evaluation: Student evaluation will be based on 3-5 homework problem sets and lab projects and overall participation in classes.

Some recommended reading (no formal text book):

• Sham P (1998). Statistics in Human Genetics. Arnold, London.

• Lange K (2002). Mathematical and Statistical Methods for Genetic Analysis. 2nd edition. Springer-Verlag, New York.

• Thomas DC (2004). Statistical Mthods in Genetic Epidemiology. Ox- ford University Press.

• Liu JS (2001). Monte Carlo Strategies in Scientific Computing. Springer- Velag New York.

1 Course Outline

Session 1 Topic: Introductions. Reading: R. C. Elston and J. Stewart (1971). A General Model for the Genetic Analysis of Pedigree Data. Human Heredity, 21, pages 523-542.

Session 2 Topic: Markov, Inheritance and Identity-By-Descent (IBD) pro- cesses. Reading: Any textbook on discrete state and continuous time Markov chains.

Session 3 Topic: HMM and applications. Reading: L. R. Rabiner (1989). A tutorial on Hidden Markov Mod- els and selected applications in speech recognition. Proceedings of the IEEE, 77, pages 257-286.

Session 4 Topic: HMM and applications. Reading: E. S. Lander and P. Green (1987). Construction of multilo- cus genetic linkage maps in humans.Proceedings National Academy of Sciences USA, 84, pages 2363-2367. L. Kruglyak, M. J. Daly, M. P. Reeve-Daly and E. S. Lander (1996). Parametric and non-parametric linkage analysis: a unified multipoint approach. American Journal of Human Genetics, 58, pages 1347-1363.

Session 5 Topic: EM and applications. Reading: A. P. Dempster, N. M. Laird and D.B Rubin (1977). Max- imum likelihoood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society B, 39, pages 1-38. T. A. Louis (1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society B, 44, pages 226-233.

Session 6 Topic: EM and applications. Reading: X. L. Meng and D. B. Rubin (1991). Using EM to obtain asymptotic variance-covariance matrices: the SEM algorithm. Journal of the American Statistical Association, 86, pages 899-909.

2 S. Ghosh and P. P. Majumder (2000). Mapping a quantitative trait locus via the EM algorithm and Bayesian classification. Genetic Epi- demiology, 19, pages 97-126.

Session 7 Topic: Bayesian inference and MCMC Reading: J. Liu (2001). Monte Carlo Strategies in Scientific Comput- ing. Chapters 2, 5, 6.

Session 8 Topic: MCMC and applications

Session 9 Topic: Resampling techniques and applications. Reading: Goring H, Terwilliger JD, Blangero J. (2001). Large up- ward bias in estimation of locus-specific effects from genomewide scans. American Journal of Human Genetics, 69, pages 1357-1369. Sun L, Bull SB (2005). Reduction of selection bias in genomewide genetic studies by resampling. Genetic Epidemiology, in press.

Session 10 Topic: Resampling techniques and applications.

Session 11 Topic: Genetics of diabetes complications and related traits. Reading: The International HapMap Project. Nature, 426, 2003.

Session 12 Topic: Haplotype Inference. Reading: Qin ZS, Niu T, Liu JS (2002). Partition-ligation-expectation- maximization algorithm for haplotype inference with single-nucleotide polymorphisms. American Journal of Human Genetics, 71, pages 1434-1445. Niu T, Qin ZS, Liu JS (2002). Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. American Journal of Human Genetics, 70, pages 1242-1247. Stephens M, Smith NJ, Donnelly P (2001). A new statistical method for haplotype reconstruction from population data. American Journal of Human Genetics, 68, pages 978-989. Stephens M and Donnelly P (2003). A comparison of Bayesian methods for haplotype reconstruction from population genotype data. American Journal of Human Genetics, 71, pages 1162-1169.

3 Session 13 Topic: Multiple comparisons. Reading: Benjamini, Y and Hochberg, Y (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing, JRSS-B,57, pages = 289–300. Yekutieli, D and Benjamini, Y (1999). Resampling-based false discov- ery rate controlling multiple test procedures for correlated test statis- tics, J. Statist. Plann. Inference,82, pages = 171–196. Storey, J. D. and Tibshirani, R. (2003) Statistical significance for genomewide studies, Proc. Natl. Acad. Sci. U.S.A., 100, pages=9440-9445.

4

APPENDIX G Steering Committee MeetingMinutes OctOBER 2012

CIHR STAGE (Strategic Training for Advanced Genetic Epidemiology) 3rd Steering Committee Meeting - Minutes October 10, 2012 – 3:00 p.m. to 5:00 EDT

Meeting Location: University of Toronto Dalla Lana School of Public Health 155 College Street, 6th Floor, Rm. 681, Toronto, ON

Agenda 1. Review of STAGE progress to date, including timelines, trainee recruitment, and internships 2. Review Program Advisory Committee (PAC) report. Next PAC meeting on January 24-25, 2013 3. Review of STAGE materials for CIHR annual report (due on Nov 1, 2012) 4. Evaluate new mentor nominations (H. Hu, M. Wilson, M. Akbari, S. Lye and A. Bassett) 5. STAGE sustainability

Meeting Materials • Draft of STAGE annual report, minus representative trainee papers and finances • Summary of STAGE progress to date • PAC report based on January 27, 2012, meeting and site visit • CVs for Drs. A. Bassett. M. Akbari, H. Hu , S. Lye, and M. Wilson • CIHR STAGE International Internship Program

In attendance Steering Committee (SC) STAGE Staff France Gagnon, Chair Geoffrey Liu Andrew Paterson Esther Berzunza Shelley Bull John McLaughlin Lisa Strug Rayjean Hung Steven Narod Lei Sun Regrets Esteban Parra Stephen Scherer

F. Gagnon - called the meeting to order - welcomed all present - announced CIHR had confirmed the STAGE’s 2011-2012 report would be an annual rather than a midterm report. - invited additional topics – none brought forward. - briefly reviewed agenda items and meeting materials.

Agenda Item 1. Review of STAGE progress to date Topic Discussion Action Action by General details • $1.8 M over 6 years from CIHR. Info -- • STAGE two and a half years into operation. • On target with spending projections. • Small surplus of ~$30K projected for fiscal year end (March 31, 2013). • 37 of 38 original invited mentors still on faculty. Admissions • Graduate students registered in Epidemiology, Biostatistics Info -- criteria and and Epidemiology programs at UofT. records • PDFs from related disciplines with/ committed to acquiring strong quantitative backgrounds. • Visiting faculty transitioning into genetic epidemiology or statistical genetics.

Page 1 of 6

Agenda Item 1. Review of STAGE progress to date Topic Discussion Action Action by Admissions to • Four trainee competitions. Info -- date • Qualified trainees admitted, even if over admission targets. • External funding secured by trainees has reduced STAGE funding commitments to them. • 16 excellent trainees accepted, one MSc, five PhD, nine PDFs and a visiting scholar. Two of these trainees are now alumni. • G. Liu asked if STAGE minimizes master’s students? • F. Gagnon explained; STAGE exceptionally accepts Master’s students, e.g. Dr. Kung, an MD, involved in research (clinical fellow), wanting formal training in genetic epidemiology, study design, and statistical analysis. Mentorship STAGE features supervision by mentors from three disciplines Info -- (epidemiology, biostatistics, biomedical sciences). Master’s students’ supervisory team configured to fit a course-based program with four to eight-month research practica. Curriculum • Graduate students first focus on core course requirements. Info EB • Several courses developed for and added to STAGE. • Some student practica include hands-on experience managing and analyzing real data, as with biostatistics and statistics trainees at Genetic Analysis Workshop 18. • Monthly integrative sessions replaced by campus professional development courses, other sessions developed by STAGE and other CIHR-funded training programs at UofT. • Trainee participation has been moderate. • NIH online course on ethics a STAGE requirement. E. Berzunza tracks registration and course completion. • Comprehensive professional development course by R. Reithmeier, Chair, Dept. of Biochem., available to trainees. • G. Liu offered to include STAGE trainees in his PMH mock EB to coordinate EB/GL grant review sessions; four sessions/year; eight to 10 with GL. trainees/session. • R. Hung asked STAGE annual report to recognize her course EB to change EB in Epidemiology of Non-Communicable Diseases instead of record for RH Cancer Epidemiology as originally listed. International Three STAGE-funded international internships to date: Info -- internships • D. Brenner with P. Brennan (R. Hung’s collaborator) at International Agency for Research in Cancer (IARC) in Lyon, France. D. Brenner beginning PDF training at IARC. • J. Dennis with D. Tregouet (F. Gagnon’s collaborator) at Institut National de la Santé et de la Recherche Médicale (INSERM) in Paris, France. • L. Faye with P. Kraft at Harvard School of Public Health, Boston, U.S.A. Internship developed from first interactions at 2010 Genetic Epidemiology and Statistical Genetics Meeting.

Page 2 of 6

Agenda Item 2. Review PAC report Topic Discussion Action Action by Introduction F. Gagnon asked SC to prepare to assess position on PAC report Info -- recommendations. F. Gagnon summarized key PAC report points. Integrate PAC recommended that in order to integrate biomedical Info -- biomedical mentors (less involved because graduate students in these mentors areas unlikely to apply to STAGE) STAGE open its doors TO biomedical trainees with evidenced quantitative skills. STAGE open to students and PDFs with strengths in their core disciplines, but also willing to develop quantitative skills with a focus on Genetic Epidemiology. STAGE has admitted PDFs from biomedical sciences. Graduate PAC recommended that STAGE start accepting PhD students Info -- students from with demonstrated quantitative skills from other departments. other A. Paterson suggested that STAGE respond by explaining that departments UofT’s structure does not lend itself to the admission of graduate students from other departments into STAGE. Unlike other universities, there is no department of human genetics at UofT and many department leads are not receptive to human genetics. Many people who work in human genetics do not secure appointments at UofT’s Molecular Genetics dept. This is a UofT bias issue. Increase At the request of trainees, the PAC recommended that STAGE EB to follow up EB/ trainee-trainee facilitate trainee-trainee interactions by developing online with Trainees Trainees interactions platform (like a Wiki page). Trainees’ responses poor at first and WIKI page attempt to implement. E. Berzunza to call on trainees for for trainees leadership in this project, in particular, contact trainees who met with PAC.

E. Berzunza invites mentors and trainees to meet ISSS speakers individually or in groups. STAGE seeks speakers with stellar research programs and committed to training. Stipend and F. Gagnon noted; PAC reported that trainees seek transparent Info -- awards stipend and award calculations. Info now posted on STAGE website, including procedures for travel award and international internships. Infectious F. Gagnon noted; PAC suggested STAGE add infectious diseases. EB to request CV EB diseases SC agreed on the importance of research area. A. Patterson from Dat Tran (research area suggested; Dat Tran, SickKids clinician scientist in endemic for circulation addition) influenza, be added to mentoring team. May prompt and approval by applications from trainees interested in area. SC

Page 3 of 6

Agenda Item 3. Review of materials for CIHR annual report Topic Discussion Action Action by Training on A. Paterson suggested; STAGE shift from using NIH ethics EB to make EB ethics course to TCPS 2 Research Ethics Tutorial Course (CORE). TCPS 2 required http://www.pre.ethics.gc.ca/eng/education/tutorial- course, change didacticiel/ website IGES meeting A. Paterson noted; trainees nominated for presentations at EB to add EB presentation IGES meeting to be recorded in Appendix B of STAGE annual nominations to nominations report. annual report External A. Paterson recommended; narrative portion of STAGE annual Info -- awards by report to include examples of prominent external trainee trainees awards. F. Gagnon explained; CIHR didn’t call for award details, but these will be included in the PAC report. Trainee funding accomplishments are included in Appendix B. Mentor time E. Berzunza asked SC to comment on CIHR request re. amount EB to follow up EB spent with of primary mentor time spent mentoring trainees. SC suggested with CIHR. trainees - Table 10% of mentor time. 3 - Q. 10 F. Gagnon asked E. Berzunza to clarify reporting instructions regarding “mentor time spent with trainees” with CIHR.

Agenda Item 4. Evaluate new mentor nominations Topic Discussion Action Action by Strategy for F. Gagnon explained; merely expanding the STAGE mentor pool Info -- maintaining an could create a larger pool with a lower percentage of active active mentor participants. The alternative of replacing inactive mentors with pool mentors engaged in STAGE activities would yield a higher ratio of active mentors in a smaller pool. New mentors R. Hung asked about process for nominating new mentors. EB to draft FG, SB, SC to develop nomination protocols for new STAGE mentors to nomination and EB be considered or invited when: protocols • a researcher in statistical genetics or genetic epidemiology is FG and SB to newly recruited to Toronto, or review • trainees express a specific and justified interest in supervision by a non-STAGE researcher, or • potential mentors (whose research is aligned with STAGE research) wish to participate in STAGE and have excellent research and training records, or trainee projects can be undertaken with mentors’ existing collaborators. Mechanism to New mentor requests or nominations to be reviewed by SC on Ad hoc All admit new ad hoc basis by circulating applicant’s letter of interest and CV. mentors Aspiring mentors to have STAGE nominators or supporters.

A. Paterson suggested; use Statistics, Epidemiology, or Biostatistics faculty appointments as primary mentor pool. It was noted that some current primary mentors of PDF trainees affiliated with other departments ie; J. Danska and J. Kennedy.

S. Bull noted; possible trainee-nominated primary mentors could be less effective than current mentors, all of whom were approved by SC.

Page 4 of 6

Agenda Item 4. Evaluate new mentor nominations Topic Discussion Action Action by New mentors to be informed of STAGE expectations. EB to develop EB Preference given to those interested in training and in the field. Biomedical mentors posed a challenge; more mentors in an already largely less active group might be counterproductive. New mentor SC reviewed and approved new mentor nominations for Info -- nominations Akbari, Bassett, Hu, Lye, and Wilson. Mentor cycle F. Gagnon; letter to gauge mentor interest in continuing STAGE EB to circulate EB mentors and reiterating STAGE expectations. draft letter to SC G. Liu suggested; replace “expectations” with “opportunities”. SC agreed to review letter, intended to be sent to group of mentors, to canvass their interest in reaffirming STAGE participation or withdrawal.

G. Liu suggested; withdrawing mentors to retain website visibility and be listed under separate “Affiliated” category.

If future trainees express an interest in working with a previous mentor, he/she is to be invited back.

Agenda Item 5. Sustainability Topic Discussion Action Action by Introduction F. Gagnon suggested E. Berzunza document relevant faculty E.B. to compile EB positions across US and Canada to include in future funding list. proposals in support of STAGE. L. Strug added that list would be useful for STAGE trainees planning future educational/professional placements.

It is hoped that ISSS sponsoring institutions will renew their support to STAGE past original commitment.

Plan to use existing infrastructure, such as the Bioinformatics Workshop that could be useful for the transfer of skills and also for potential revenue.

New items Topic Discussion Action Action by Increased F. Gagnon asked for ideas to establish/increase collaboration SB & FG to SB/FG collaborations with other national research centres. coordinate with national S. Bull, suggested asking if they would accommodate STAGE approach research trainees in practica. centres SC meetings Annual frequency of SC meetings was determined to meet All -- STAGE needs. It was suggested that future SC meeting notices include lists of EB to prepare EB required and optional reading materials December 2012 F. Gagnon suggested that STAGE representatives attend dinner EB to circulate EB dinner with Dr. Howard Hu, new Director of the DLSPH and STAGE invitation. December guest speaker.

Page 5 of 6

New items Topic Discussion Action Action by Statistical L. Strug noted: statistical trainees need genetics training. Mentors to All trainees and Fundamentals of Human Genetics is not an ideal introduction. prepare list of training in F. Gagnon added: training should be provided by trainees’ complementary, basic genetics co-mentors by suggesting independent readings, such as the cross-discipline Special Issue on Genetic Epidemiology: Study Designs and trainee readings Methods Post-GWAS. COMBIEL G. Liu offered trainees access to COMBIEL EB to follow up EB/GL (Outcomes, http://www.uhnresearch.ca/programs/combiel/index.html with GL to Medicine, Bi-monthly sessions for trainees to present research projects coordinate Biostatistics, on study design, statistical issues, etc. STAGE Informatics, Encourages statisticians to learn to communicate across involvement. Epidemiology, disciplines to better understand applied work. Laboratory STAGE trainees would benefit from attending COMBIEL. Medicine) Downside: time lost for their own research. STAGE Annual S. Narod suggested an annual STAGE research day/retreat be EB to implement EB Research Day organized to strengthen the sense of community between STAGE mentors and trainees. It would also serve to introduce trainees’ research projects amongst mentors and trainees. F. Gagnon: initiative to be implemented with surplus funds. Interactions F. Gagnon noted; long-term STAGE goals involve networking EB to assemble EB with other with other universities/sites in Canada and external initiatives list of available programs that lead to training in areas of interest to STAGE. external training S. Bull noted; helpful to assemble and forward list of external opportunities training opportunities for trainees and mentors to jointly evaluate potential training needs and participation.

F. Gagnon thanked everyone for attending, and adjourned the meeting.

Page 6 of 6 APPENDIX H STAGE International Internship and Travel Award Programs

International Internship Program

International INTERNSHIP Program

CIHR Strategic Training for AdvancedSTAGE Genetic Epidemiology | 1 International Internship Program

Overview The CIHR STAGE International Internship Program provides unique research opportunities for trainees to work on publishable research projects that complement their existing research in genetic epidemiology and statistical genetics. This program provides trainees with access to superlative international research environments and collaborative opportunities. International placements are intended to complement and enhance trainees’ abilities and development in a manner consistent with CIHR STAGE mission, values, and objectives. Eligibility and Requirements All STAGE trainees are welcome to apply. With the endorsement of his/her STAGE mentors, each trainee may undertake one international placement (typically lasting two to eight months) at organizations affiliated with STAGE or its mentors (through existing partnerships or collaborations) or at other trainee-identified, mentor-approved sites. Although there is no deadline for submissions, applicants are encouraged to apply as early as possible as funds are limited. Applications will be reviewed on a first-come, first-served basis.

Possible Training Locations Travel Subsidies Trainee Expectations To help identify suitable Subsidies may be available to • Be in good academic and/ placement locations, STAGE offset travel costs necessary for or administrative standing has established relations with trainees to interact on site with with the STAGE program multiple, highly reputable host organizations. STAGE travel and, if applicable, with the international organizations, support is limited to University of Toronto. including: • round-trip economy fare, • Commit to an internship • The World Health Organization • accommodation costs, and position with a minimum of International Agency for • visa fees. 37.5-hour-per-week (or local Research on Cancer (IARC), Lowest rates must be used full-time equivalent) at the Genetic Epidemiology and whenever possible for all costs host organizations for the full Nutrition and Metabolism Units STAGE will not be responsible duration of the placement. • Inserm (French National for expenses resulting from any • Ensure confidentiality of host Institute of Health and costs paid for by the trainee organization information/data. Medical Research) that are outside the program’s • Provide periodic updates on UMRS 937 - Génomique eligible expenses, expenditures the status of the placement Cardiovasculaire Unit made without the approval of to STAGE as required. • The Brazilian Agency for the Co-Directors of STAGE, or • Prepare a summary report Graduate and Post-Graduate expenditures above the approved at the conclusion of the Education (CAPES) budget total. placement which details • The Consulate the research performed, General of France its results, and any • The University of Western recommendations. Australia Centre for Genetic • Complete an exit survey Epidemiology and Biostatistics within one month of the completion of the placement.

| 1 International Internship Program

Application Materials Applications must be submitted at least 45 days in advance of the proposed internship start date. Submit application materials electronically via email to [email protected] as a single PDF (preferred) or Word/Excel documents whenever possible. Applications must include: 1. Statement of Purpose Two page-maximum, 1” margins, 12-point type size. The statement of purpose should explain: 1. The need for funding and the international placement. 2. The overall goal, specific aims, and rationale of the trainee’s project or research question. 3. The educational purpose and expected learning outcome(s) of the placement experience, including how the internship will • help meet the trainee’s future academic/career goals • fit the trainee’s past and intended future professional experiences, and • advance, complement, and extend the trainee’s cross-disciplinary training. 4. The trainee’s understanding of the host organization/lab infrastructure. 5. The specifics of the proposed placement abroad, including the name(s) of the trainee’s supervisor(s), the name of the host organization, the location, and the start and end dates of the internship.

2. Budget Itemized breakdown in CAD, including: 1. External sources of funding. 2. Accommodations and transportation costs (airfare, train, bus, mileage). 3. Travel and health insurance costs and required visa fees. It is the trainee’s responsibility to abide by University of Toronto (“UofT”) Policies and Guidelines for Travel and Other Reimbursable Expenses.

3. Letter of Support Letter from the host organization documenting its approval/support of the applicant as an intern, including: 1. Confirmation that the host organization possesses and will provide the infrastructure, resources, guidance, and expertise necessary to support the trainee’s placement. 2. The start and end dates of the internship plus the number of work/research hours per week. 3. A brief description of the trainee’s responsibilities and the tasks he/she will be performing while acknowledging an understanding of his/her research or learning goals. 4. Information on whether the placement is unpaid or paid and, if paid, the stipend that the trainee will receive.

4. Signed Statement I agree that if selected I will sign the documents “Consent Form and Release from Liability”, “Terms for Participation” and “Acknowledgment of Responsibilities.” I also agree that I will comply with any and all behavioral and/or travel requirements set forth by STAGE and the University of Toronto’s Safety Abroad Office,

Applicant’s signature: Date:

| 2 International Internship Program

University of Toronto - interdisciplinary, or All internship placements must Safety Abroad cross-disciplinary be approved prior to the start of All trainees participating in research experience. the internship. out-of-country CIHR STAGE- • Trainee’s preparedness to sponsored activities are required undertake the proposed Intern Report to abide by UofT’s internship in a new field or Within two weeks of the • Safety Abroad Program one in which he/she has completion of a CIHR STAGE (http://www.utoronto. worked, based on past International Internship, every ca/safety.abroad/) experiences, academic trainee is required to submit • Safety Abroad Guidelines history, research, or an Intern Report. The Report is (http://www.cie.utoronto. faculty mentoring. intended to provide a concise ca/safety-abroad/PDFs/ • Trainee’s understanding evaluation of the internship from SafetyAbroadGuide.html) of matters involved in the the intern’s point of view. For • Safety Abroad Manual proposed placement. reporting requirements, click (http://www.utoronto. • Trainee’s enthusiasm to here. serve as a representative ca/safety.abroad/ Contact program_sponsor.html) of CIHR STAGE. Resources Esther Berzunza, Program Evaluation Coordinator Applications will be reviewed The CIHR STAGE Program CIHR STAGE (Strategic and participants selected by Coordinator will provide Training for Advanced Genetic CIHR STAGE Co-directors, assistance in obtaining Epidemiology) Drs. Shelley B. Bull and France necessary forms and information, University of Toronto Dalla Lana Gagnon. Applicants and their compile the application materials School of Public Health corresponding primary mentor for submission, and check Health Sciences Building will be notified by email of materials for accuracy and 155 College Street, Suite 734 the decision. The names of route them through the approval Toronto ON M5T 3M7 successful participants will be process. Tel. 416-946-7244 subsequently announced to the Proposed Internships will be [email protected] public on the STAGE website. reviewed to ensure that they provide a proper learning Selection Criteria situation for the trainee, plus a Participants will be selected reasonable return for the efforts using the following criteria: and resources invested by the • Application’s relevance to host organization and by STAGE. the mission, values, and Prior to submitting any materials, objectives of CIHR STAGE it is strongly recommended that (http://www.stage.utoronto. the applicant contact the STAGE ca/about/mission). Program Coordinator to discuss • Internship’s potential for a placement plans. stimulating collaborative,

| 3 1/15/13 Travel Awards | CIHR STAGE

STRATEGIC TRAINING FOR ADVANCED GENETIC EPIDEMIOLOGY

Travel Awards

The CIHR STAGE training program has a dedicated pool of funds for trainee travel support. Travel funds can be used to Jan 2013 View all » cover more than one trip, but only to a maximum of $1,500 CDN per fiscal year, per trainee (fiscal year end March 31). Funds may vary from year to year depending upon the applicant pool and availability of funds. We encourage trainees to actively SMTWTFS seek financial support from other sources to supplement CIHR STAGE travel support. 1 2 3 4 5

6 7 8 9 10 11 12 Purpose of the Award

13 14 15 16 17 18 19 The purpose of this travel fund is to promote research by supporting the travel of STAGE trainees to scientific conferences, meetings, workshops, and other events that facilitate professional development in research and communicate research 20 21 22 23 24 25 26 discoveries. 27 28 29 30 31 Eligible Expenses Dates & Upcoming Events Travel support is limited to the following:

April 23, 2013 1. the cost of a round-trip economy fare, 4th Annual UofT Cross‐STIHR 2. conference or workshop registration fees, and Research Training Day 3. accommodation expenses. View All Events Eligibility Events may be held within or outside of Canada, but they must be aligned with the program’s research themes. Applicants must meet the following criteria for consideration:

1. He/she has been selected to present a talk or poster at the event. 2. He/she is an active trainee of the CIHR STAGE Training Program at the time of application as well as at the time of the event.

How to Apply

1. Travel support can be requested at any time of the year. Requests should be submitted preferably 30 days or more before the date of travel. 2. Email your request to Esther Berzunza at [email protected], and cc all of your CIHR STAGE mentors. Requests must include: a. the complete name of the meeting and its location and travel dates, b. a clear budget justification for the amount of funding being requested, and c. proof of presentation acceptance (e.g. confirmation of participation from event organizers).

NOTE: Original airline/train tickets and boarding passes are required to claim these type of expenses.

Retroactive Requests Requests for retroactive travel funds are discouraged but will be considered on a case-by-case basis.

Travel Support Evaluation and Ratings Travel support requests are reviewed by the co-directors of the CIHR STAGE training program. Funding is competitive and limited and co-directors may recommend full, partial, or no funding.

Requests will be rated from high to low priority based on the following criteria: • oral presentations, • trainees who have not received prior travel support, and • the quality and significance of the presentation or conference for which travel support is sought.

Notifications www.stage.utoronto.ca/about/trainee-travel-awards 1/2 1/15/13 Travel Awards | CIHR STAGE Upon receipt of your request, a notification of receipt will be emailed to the applicant.

Acknowledgement Trainees are required to acknowledge the training program in their presentations, publications arising from the supported research, and in any subsequent conference, meeting, or congress materials. Acknowledgements should be made to “CIHR STAGE (Strategic Training for Advanced Genetic Epidemiology.”

Procedures for Submitting Reimbursement Requests Reimbursement requests for conference related expenses should be submitted within one month of returning from the event. Forward (1) original receipts for travel related expenses, (2) a summary of your oral presentation or your poster abstract to:

Esther Berzunza, Program Coordinator CIHR STAGE (Strategic Training for Advanced Genetic Epidemiology) University of Toronto Dalla Lana School of Public Health Health Sciences Building 155 College Street, 5th Floor Toronto ON M5T 3M7

Contact Esther Berzunza, Program Coordinator CIHR STAGE (Strategic Training for Advanced Genetic Epidemiology) CIHR Training Grant in Genetic Epidemiology Tel. 416-946-7244 Email: [email protected]

You will be notified by email of the outcome of your request.

This award will be in the form of reimbursement to the Supervisor’s research fund or to the individual that originally paid for the travel.

Submit applications to:

Esther Berzunza CIHR STAGE

Edit

© 2013, CIHR STAGE 155 College Street, Suite 734, Toronto ON M5T 3M7

www.stage.utoronto.ca/about/trainee-travel-awards 2/2 APPENDIX I Syllabi and Agendas for Professional Development Courses and Workshops

BCH 2024 Graduate Professional Development 2012-13 (Guest panelists are tentative)

Class One – Sep 14, 2012

How to Cultivate Essential Skills outside Benchwork How to Get the Most out of Grad School (Reinhart Reithmeier – 20 min, Nana Lee – 20 min) Problem-solving, project management, communications, leadership, multi-tasking, collaborations, perseverance, analytics, initiative.

Present an overview, 1 minute student introductions, provide resource materials and links (ie. Science careers, Nature jobs)

Questions: 1. Why are you in Graduate School? 2. What do you hope to achieve during your training? 3. Which skills would you like to develop? 4. How do you think you can do that during graduate school? 5. What are your career aspirations? 6. Set up a LinkedIn profile. 7. What services might you use offered by U of T? LSCDS 8. What services would you like to see more of or initiate?

Guest Panel U of Toronto Career Centre Workshop Orientation – Elena Pizzamiglio U of Toronto Graduate Professional Skills Program Orientation – Karen McCrank U of Toronto Leadership Skills Development and Student Life Orientation - Tara Bunting Dr. Zayna Khayat, Principal Consultant at Secor Group

Class Two – Sept 28, 2012

How to Obtain and Succeed in an Academic Position (Reinhart Reithmeier) Topics to discuss are navigating through competitive academia, lab management, collaborations, grants, teaching, administration, academic mentorship, leadership.

Questions: 1. Given your own research, which collaborations do you envision forming in the future and why? 2. How would you optimize your chances of achieving a successful academic position? 3. Research and list some of the grants available to you after graduation.

Guest Panel: Five Successful Academics

Class Three – October 12, 2012

Importance of Mentorship Academic – Reinhart Reithmeier Nonacademic – Nana Lee and Pamela Plant Topics to discuss are the importance of the PI/student or postdoc relationship, mentorship, how to find a mentor, and training the future mentor/PI.

Questions: 1. If you were a PI/mentor, develop a feedback form for your student and career plan is they wanted to pursue a) academia or b) science writing. 2. How would you find a mentor outside your department?

Guest Panel Mentorship program Coordinator, Founder of own biotech company, Writer/Editor, Clinical Trials Coordinator

Special Symposium: October 19, 2012 at 3-5 pm Open to All Student and Faculty Speaker: Assistant Director University of Toronto Graduate Enterprise Internship Program (STEM) Chioma Ekpo Room: MSB 2172

Class Four – October 26, 2012 Postdoc Choices and Succeeding in a Nonacademic Career Academic Postdocs – Reinhart Reithmeier Nonacademic Postdocs and Careers – Nana Lee Topics to discuss are the nonacademic pathways available, how to find the hidden job market, how to land the job and how to succeed as a nonacademic scientist.

Questions: 1. Which labs would you pursue your postdoctoral studies towards an academic career? Why? If your goals change during the process, how would you change the direction of your postdoc? 2. What are your career objectives? This question should be asked every year during your career. 3. What is the hidden job market? 4. What are the nonacademic options are you interested in at this time? How would you find out more about these jobs?

Guest Panel Political Advisor, Creative Company Researcher, Big Pharma, Writer, Sales and Marketing

November 2, 2012 Oral Presentations for All Students: Mock Press Conferences Q & A Panel: Nana Lee, Rienhart Reithmeier, Globe and Mail Health Editor, Lawyer, Business, Lay People

Class Five – November 9, 2012

Career Transitions and Development Throughout Life (Nana Lee)

Topics to discuss are the skills to develop for career transitions throughout life’s changes such as the effects of relationships, finances, marriage, children, aging parents, re-entry after childrearing, company restructuring, grant losses, and retirement.

Questions: 1. If money was not an issue, what would you research on and why? 2. What career development resources are available to you throughout school and life? 3. Research and find re-entry grants after childrearing or other family responsibilities. 4. If you decided to be a full-time caregiver, how would you stay connected to science? 5. What are some of the career issues facing postdocs and scientists today?

Guest Panel: MaRs Innovation, Fischer Scientific Sales, Hosp Diagnostics Director, Dean of Students, Big Pharma

Class Six – November 23, 2012 The Big Picture Global Concerns and Science (Nana Lee)

Topics to discuss are thinking outside the box, TED talks, relating to innovate, market trends in the biotech industry. Finding your passion.

Questions: 1. If you had a scientific breakthrough with your research, what are the issues and concerns getting it to the public/market? 2. List some of the global scientific concerns facing us today. 3. List some biotech companies that interest you and state why. 4. Which causes funded by the Bill and Melinda Gates Foundation might interest you?

Research Ethics Guest Speaker: Dr. David Bazett-Jones

Topics to be discussed include data handling and storage, preparation of tables and figures, collaborations, publishing your work, author order, animal/human research, copyright, intellectual property, patents

Questions: 1. What ethical issues do you see arising from any part of your research? 2. List other researchers involved with the R&D of any topic related to your research. 3. What topics around research ethics did you learn today?

Guest Panel: UHN Admin, Sanofi-Pasteur, MaRs Innovation, Patent Lawyer

Evaluation Written Assignments (50%) Oral Presentations (30%) Class Participation (20%)

HCTP Professional Development Workshop Series

WORKSHOP Getting the Scoop on the Academic Hiring Process February 8, 2012 Health Sciences Building room 208 10am‐12pm

Abstract: This HCTP professional development workshop is designed to enhance the transparency of the academic hiring process. A panel consisting of recent hires and faculty who have sat on search committees will discuss issues central to this process. Panellists will also share their observations from seeing others undertake this process and from their own experience as interviewees/interviewers.

Topics include: the relative importance of various elements of the hiring process from application to interview follow‐up; how to engage in successful negotiations; how to prepare for the interview component; and important considerations when preparing for the job talk.

Panelists: Gavin Andrews, Professor & Previous Chair (Health, Aging & Society) McMaster University; Ahmed Bayoumi, Director (Clinical Epidemiology & Health Care Research at HPME) University of Toronto; Tamara Daly, Assistant Professor (Health Policy and Management) York University; Josephine Wong, Associate Professor (Nursing) Ryerson University.

Register by February 1, 2012

The event is open to all students but space is limited. Register online: www.hctp.utoronto.ca

Health Care, Technology and Place Program, CIHR Strategic Training and Research Program Institute of Health Policy, Management & Evaluation, University of Toronto Health Sciences Building Suite 425, 155 College Street, Toronto ON M5T 3M6 www.hctp.utoronto.ca | [email protected] | 416.978.2067 2nd Annual Cross – STIHR Research Day CIHR ‐ Strategic Training Initiative in Health Research (STIHR) University of Toronto Strategic Training Programs

AGENDA

Google, Facebook, Zotero and Skype, Oh my! Developing essential social networking skills for research Presented by Dr. Alex Jadad

Monday, February 14, 2011 University of Toronto Health Sciences Building 155 College Street, Room 208 Toronto ON M5T 3M6

Social media are ushering a second wave in the evolution of the Web. Very rapidly, free resources such as Google, Wikipedia, Facebook, YouTube and Twitter are becoming increasingly available through mobile devices, re-shaping how humans communicate, learn and live. In this workshop, participants will have an opportunity to learn about generic social media applications and gain basic skills with which to improve their capacity to formulate research questions, design studies, write proposals, analyze data and disseminate their findings.

Requirements: 1. Please read this document prior to attending the session: www.cdc.gov/healthcommunication/ToolsTemplates/SocialMediaToolkit_BM.pdf 2. Remember to bring a laptop to use during this session

1:30 pm to 1:45 pm Welcome and overview 1:45 pm to 2:00 pm Social media 101 2:00 pm to 2:15 pm Building an eToolkit for researchers 2:15 pm to 2:30 pm Break 2:30 pm to 4:30pm Hands on activities (30 minutes each):  Working with Google Calendar and Documents  Creating and studying Facebook Groups  Managing references through Zotero  Communicating with Skype 4:30 pm to 4:40 pm Closing remarks

This 3-hour workshop will be coordinated and supported by Dr. Alex Jadad and other members of the People, Health equity and Innovation (PHI) Group at the University of Toronto. Dr. Alex Jadad is the PHI Group Convener; Chief Innovator and Founder of the Centre for Global eHealth Innovation; Canada Research Chair in eHealth Innovation; Rose Family Chair in Supportive Care and Professor in the Departments of Health Policy, Management and Evaluation; Anesthesia; and the Dalla Lana School of Public Health at the University Health Network and the University of Toronto.

The University of Toronto is home to 15 CIHR Strategic Training programs spanning biomedical, clinical, population, and health policy research, which are aimed at developing outstanding new research investigators. U of T has the largest number of such programs among Canadian universities. The purpose of the Annual Research Day is to promote interaction and the dissemination of knowledge across disciplines and to demonstrate the added value of interdisciplinary and interfaculty graduate and postdoctoral research training that these strategic research training initiatives bring to our university. th 4 Annual Cross-STIHR Research Day CIHR – Strategic Training Initiative in Health Research (STIHR)

University of Toronto Strategic Training Programs

AGENDA Effective Communication of Research Hosted by the University of Toronto Cross-STIHR Community

Tuesday, April 23, 2013, 09:00 am-05:00 pm EST University of Toronto, Health Sciences Building, 155 College Street, Room 208, Toronto ON M5T 3M6

The University of Toronto Cross-STIHR community of 15 CIHR Strategic Training Programs is hosting its fourth Annual Research Day. The purpose of the annual event is to promote interaction amongst the STIHR trainees as well as provide an additional opportunity for professional development in launching scientific research careers in Canada.

This year’s training event aspires to offer a breadth of material focusing on Effective Communication of Research.

Topics include: Academic report writing, manuscript preparation for peer reviewed journals, poster presentations, grant writing, responding to the media, privacy in research, and funding your research.

AGENDA 8:30 - 9:00 am Registration & Coffee 9:00 – 9:20 am Opening Remarks 20 min Dr. Kwame McKenzie, Opening Journalist, Advocate, Broadcaster Professor of Psychiatry, University of Toronto Medical Director, Centre for Addiction and Mental Health

9:20 – 10:40 am Session 1: Translating Your KNOWLEDGE to the Academic, Stakeholder and Public Speakers – 30 min each Communities Facilitated Discussion – 10 min each Total 80 min • Dr. Onil Bhattacharyya Clinical Scientist, Li Ka Shing Knowledge Institute, St. Michael’s Hospital Professor, Department of Family and Community Medicine, University of Toronto o Session Facilitator (STIHR Trainee), to be confirmed • Mr. Rafael Eskenazi Director, Freedom of Information and Protection of Privacy Office, University of Toronto o Session Facilitator (STIHR Trainee), to be confirmed

The University of Toronto is home to 15 CIHR Strategic Training programs spanning biomedical, clinical, population, and health policy research, which are aimed at developing outstanding new research investigators. The University of Toronto has the largest number of such programs among Canadian universities and starting in 2009 we have come together as the “UofT Cross- STIHR Community”. The purpose of the Annual Research Day is to promote interaction and the dissemination of knowledge across disciplines and to demonstrate the added value of interdisciplinary and interfaculty graduate and postdoctoral research training that these strategic research training initiatives bring to our university. th 4 Annual Cross-STIHR Research Day CIHR – Strategic Training Initiative in Health Research (STIHR)

University of Toronto Strategic Training Programs

10:40 – 11:00 am Break 20 min

11:00 – 12:20 pm Session 2: Translating Your RESEARCH to the Academic, Stakeholder, Public Communities Speakers 30 min each Facilitated Discussion 20 Dr. Rachael Cayley min • Total 80 min Office of English Language and Writing, University of Toronto • Dr. Jocalyn Clark Senior Magazine Editor, PLOS Medicine o Session Facilitator (STIHR Trainee), to be confirmed

12:20 – 1:20 pm Lunch 60 min 1:20 – 2:40 pm Session 3: FUNDING Your Research Speakers 30 min each • Dr. Jane Freeman Facilitated Discussion 20 Director, Office of English Language and Writing, University of Toronto min Total 80 min • Dr. Peter Szatmari – Talk Title: “Dealing with rejection” Chedoke Health Corporation Chair in Child Psychiatry Professor, Department of Psychiatry and Behavioural Neurosciences, McMaster University Director, Oxford Centre for Child Studies; Head, Division of Child Psychiatry o Session Facilitator (STIHR Trainee), to be confirmed

2:40 – 3:00 pm Break 20 min 3:00 – 4:20 pm Session 4: Your Research is in the MEDIA, What Do You Do? Speakers 30 min each Facilitated Discussion 20 Public Relations & the Media min Total 80 min • Laurie Stephens Director, News and Media Relations, University of Toronto • Donna Lee Sound Journalist, Aboriginal People’s Television Network (APTN) o Session Facilitator (STIHR Trainee), to be confirmed

4:20 – 5:00 pm Closing Remarks Total 40 min for wrap-up and to account for any Dr. Kwame McKenzie over time sessions Journalist, Advocate, Broadcaster Professor of Psychiatry, University of Toronto Medical Director, Centre for Addiction and Mental Health

The University of Toronto is home to 15 CIHR Strategic Training programs spanning biomedical, clinical, population, and health policy research, which are aimed at developing outstanding new research investigators. The University of Toronto has the largest number of such programs among Canadian universities and starting in 2009 we have come together as the “UofT Cross- STIHR Community”. The purpose of the Annual Research Day is to promote interaction and the dissemination of knowledge across disciplines and to demonstrate the added value of interdisciplinary and interfaculty graduate and postdoctoral research training that these strategic research training initiatives bring to our university.

APPENDIX J Trainee Annual Progress Report and Exit Survey

Trainee Progress Report

Page One

What follows is a short questionnaire that we are asking you to complete in order to help us fulfill our reporting obligations to our funding agency, the Canadian Institutes of Health Research.

Please complete the questionnaire before Wednesday, September 26, 2012.

If you have any questions, please feel free to contact me, Esther Berzuza, at 416-946-7244, [email protected].

New Page

1. Personal Information First Name * Last Name *

Title *

Organization *

Street Address *

Apt/Suite/Office *

City * State/Province * Postal/Zip Code *

Country *

Email Address * Phone Number *

Fax Number *

Mobile Phone *

Website *

2. Identify your trainee level: *

Master's PhD Postdoctoral Fellow Visiting Scientist/Scholar

3. Are you among the top 10% of your class? Or, if you have recently graduated from a program, were you in the top 10% of your graduating class? *

Yes No Don't know

4. Did you complete your program (e.g. degree or postdoctoral studies between the date you were accepted to STAGE and August 31, 2012? *

Yes No

5. When did you complete your program e.g. degree or postdoctoral studies? *

6. When do you expect to complete your program, e.g. degree or postdoctoral studies? *

7. What discipline or disciplines did you come from before joining CIHR STAGE? * Check as many as applicable.

Addiction research Gastrointestinal and liver Physical therapy Aging Growth and development, Physiology including human Anaesthesiology genetics Preventive medicine

Anatomy Health care and Psychiatry economics Biochemistry parasitology, Psycho-social medicine virology Health policy Psychology Biomedical History of medicine Public health Biophysics, Imaging (including nuclear Rehabilitation bioengineering, medical medicine) instrumentation and Respirology devices Immunology Rheumatology Biostatistics Medical education Sociology Biotechnology Medical ethics Speech/language Blood, haematology Microbiology, bacteriology Sports medicine Bone and mineral Molecular biology communications metabolism Neonatology Surgery Cancer Nephrology Theoretical biology Cardiovascular Neurobiology Toxicology Cell biology Nutrition and metabolism Tropical medicine Dental science, oral Obstetrics & gynaecology biology Urology Occupational therapy Dermatology Vision Orthopaedics Emergency medicine Women's health issues Otolaryngology Endocrinology Other Environmental and Pathology occupational medicine Pharmacology Epidemiology

Family medicine

New Page

8. External Funding In addition to the funding you may have receive from STAGE, did your receive funding from any other source? External funding may come from any one or a combination of sources, which may include:

Awards from external agencies such as the Canadian Institutes of Health Research (CIHR), the Natural Sciences and Engineering Research Council of Canada (NSERC), Social Sciences and Humanities Research Council (SSHRC), the Ontario Graduate Scholarship Program (OGS), etc.

University of Toronto awards such as departmental cohort funds (U of T Fellowships), Ontario Graduate Scholarships in Science and Technology (OGSST), Connaught Fund, Ontario Student Opportunity Trust Fund (OSOTF).

Government, international agencies, and other awards for the express purpose of education.

Teaching or research assistantships or stipends from supervisors or from other training grants (T4 or T4A income).

Yes

No

9. Please list name of organization, the nature of the funding, term, and amount.

Avoid acronyms if possible; if used, spell out before first use.

Example:

External Funding: Award Name or Title or Type *

Institution, Organization and Country * Effective Date (mm/yy) *

End Date (mm/yy) * Amount *

Add Another

New Page

10. Between the date you were accepted to STAGE and August 31, 2012, have you had any publications in which you were listed as an author? *

By "publications" we refer to any of the items listed below:

Published refereed papers (original articles published in journals with editorial review) Accepted or in press refereed papers (attach acceptance letters) Submitted refereed papers Published books and monographs (as author or editor) Accepted or in press books and monographs Submitted books and monographs Published contributions to a collective work and book chapters (including chapters written on invitation or collective works derived from conferences or symposiums) Accepted or in press contributions to a collective work and book chapters (including chapters written on invitation or collective works derived from conferences or symposiums) Presentations as guest speaker (including conferences, presentations, demonstrations, workshops intended for a non-academic audience, according to the type of audience) Published abstracts/number of notes (including name of journal, title of article, and date submitted) Accepted or in press abstracts/number of notes (including name of journal, title of article, and date submitted) Submitted abstracts (including name of journal, title of article, and date submitted) Works including individual or collective literary or artistic works (e.g. novels, short stories, poetry, film, video, visual arts work, booklet, record, sound creation, book of artists, collection, exhibition catalogue, etc.) Research reports or reports produced for the government Articles in professional or cultural journals without review committee (including popularized texts)

Yes No

11. Upload a Word document that includes all the publications in which you were listed as an author between the date you were admitted to STAGE and August 31, 2012. *

Use the template provided with our survey email.

Choose File No file selected Upl oad

12. What is the most significant publication during that period? * Briefly explain why. 13. Between the date you were accepted to STAGE and August 31, 2012, have you made any presentations (oral or poster)? *

Yes No

14. Upload a Word document that includes your presentations (oral or poster) that were made between the date you were admitted to STAGE and August 31, 2012. *

This listing is to be in reverse chronological order. Indicate whether oral presentation or poster and indicate whether invited, selected (a committee or person selects from those submitted for consideration), or contributed (all submittals are allowed, e.g., some poster sessions do not limit the number accepted).

Only include presentations made by you. Each entry should indicate participation as a presenter, co-presenter, panelist, organizer, president or moderator in parentheses after your name. Use the template provided with our survey email.

Choose File No file selected Upl oad

15. Besides the presentations listed above, between the date you were accepted to STAGE and August 31, 2012, how many other presentations, knowledge translations or outreach activities (if any) were you involved in? Please list and describe as appropriate. *

16. Besides those listed in questions 10 to 15 above, have you had any other research/career related successful outcomes? * New Page

17. Please list any any peer review activities (including internal grant review, journal review, etc.) you may have been exposed to between the date you were accepted to STAGE and August 31, 2012. * Please include as many details as possible, i.e., date, location, organizer, and full name of the event or workshop.

18. Please list any any seminars or workshops that you may have attended between the date you were accepted to STAGE and August 31, 2012 on any of the following topics:

Writing grants and papers Ethics sessions Career advice

*

Please include the full name of the workshop or seminar and its date and location.

19. What have you most enjoyed about the STAGE program so far? *

20. What have you least enjoyed about the STAGE program so far? * 21. Do you have any suggestions for STAGE program improvement? *

Thank You!

Thanks [question("value"), id="4"]

We really appreciate your feedback! Trainee Exit Questionnaire

Page One

This questionnaire is for trainees who have completed, or are about to complete, their training at CIHR STAGE. We want to learn from your experience at CIHR STAGE!

We value frank evaluations and comments and encourage you to be honest. Anonymous responses may be quoted in the program's promotional materials, in reports submitted to CIHR and/or to the program’s committees, or research papers addressing the CIHR STAGE program.

New Page

1. Personal information First Name Last Name

Title

Company Name

Street Address

Apt/Suite/Office

City State Zip

Country

Email Address Work Phone Number

Fax Number

Mobile Phone

URL

2. What was your most valuable training/learning experience at STAGE and why?

3. Please rate your overall satisfaction with your training at STAGE: *

Very Dissatisfied

Dissatisfied

Somewhat Dissatisfied

Somewhat Satisfied

Satisfied

Very Satisfied

4. What was your least valuable training/learning experience at STAGE and why?

5. Please rate your level of agreement with the following benefits derived from your participation in STAGE and provide examples. * Please rate * Please give examples for Strongly Strongly Disagree Neutral Agree each instance * Disagree Agree STAGE helped to improve my understanding of how my research can improve the prevention and management of chronic diseases through genetic epidemiology research STAGE helped to improve my understanding of interdisciplinary research, which, in future, will help me to bridge laboratory and population- based investigations of chronic diseases STAGE promoted and encouraged STAGE helped me to appreciate and address the challenges in applying study design, and analytic and molecular tools used in genetic epidemiology STAGE has enabled me to integrate aspects of diverse research disciplines and methodological approaches to address or resolve health issues and scientific challenges STAGE activities (e.g. ISSS, SMGG Journal Club) improved my ability to contribute to genetic epidemiology and statistical genetics research

New Page

6. Please rate your level of agreement with the following statements: *

Strongly Strongly Disagree Disagree Neutral Agree Agree STAGE has increased my understanding of disciplines and underlining methodological approaches related to and supporting genetic epidemiology and statistical genetics research * STAGE increased my opportunities to interact with peers and researchers from disciplines different from mine * STAGE has enabled me to integrate aspects of diverse research disciplines and methodological approaches to address or resolve health issues and scientific challenges * STAGE has engaged me in training and discussion on the ethical conduct of research and related ethical issues * STAGE has helped me to incorporate effective research strategies to translate knowledge into practice.

For example: work on clinically- relevant STAGE research question, work on development of statistical methods with direct health research applications. *

7. Please rate your level of agreement with the following statements: STAGE has helped me to develop my skills in... *

Strongly Strongly Disagree Disagree Neutral Agree Agree communications teamwork leadership grant writing and peer review

8. Please rate your level of agreement with the following statement: STAGE mentors were able to give the support and feedback I required.

Strongly disagree Disagree Neutral Agree Strongly agree

9. Did you participate in the International Internship Program of CIHR STAGE?

Yes

No

10. Rate your overall internship experience: Very Dissatisfied Dissatisfied Neutral Satisfied Very Satisfied Overall experience

11. Please describe the overall goal(s) of your internship project or research question.

12. Please describe your primary learning goals for your internship.

13. Indicate your level of agreement with each of the statements regarding your internship placement.

Strongly Strongly disagree Disagree Neutral Agree agree I achieved my learning goals I achieved the overal goal(s) of my research project I successfully completed my assigned responsibilities and duties The host organization possessed and provided the infrastructure, resources, guidance, and expertise necessary to support my placement Other lab members were collegial and friendly My supervisor was supportive and provided timely feedback Other lab members with whom I interacted were supportive and helpful

14. Did the internship affect your career goals? Yes

No

15. Would you recommend the CIHR STAGE internship program to other trainees?

Yes

No

16. New Essay Question

17. I will pursue work or further studies as a *

Principal Investigator

Highly Qualified Personnel (e.g. research associate)

Other *

18. Please rate your level of agreement with the following statement:

Strongly Strongly disagree Disagree Neutral Agree agree Participation in STAGE has improved the likelihood that I will pursue work or further studies in projects that have genetic epidemiology and/or statistical genetics as a focus.

New Page

19. Do you plan to continue to be part of the Canadian health research community?

Yes No 20. What area of the Canadian health research community do you plan to be part of?

Genetic and Molecular Epidemiology Statistical Genetics and Genomics

Bio-medical Genetics Other. Please specify: *

21. Why won't you continue to be part of the Canadian research community?

22. Did STAGE meet your professional development expectations? *

Yes No

23. List the ways in which STAGE contributed to your professional development.

24. List the ways in which STAGE did NOT contribute to your professional development.

25. Please list your most current professional position.

New Page 26. Did STAGE expand your network/academic contacts?

Yes No

27. How did STAGE expand your network/academic contacts and how do you plan to maintain these networks in the future?

28. List the ways in which STAGE did NOT contribute to your professional development.

29. How likely are you to recommend STAGE to other individuals interested in training in genetic epidemiology and statistical genetics?

Very unlikely Unlikely Likely Very likely

30. Do you have any suggestions for improving STAGE you would like to share? *

Thank You!

Thank you for taking our survey. Your response is very important to us.

APPENDIX K GAW18 Workshop Resulting Papers and Participating STAGE Mentors and Trainees

APPENDIX K GAW18 WORKSHOP - RESULTING PAPERS AND PARTICIPATING STAGE MENTORS AND TRAINEES

Genetics Analysis Workshop (GAW18), October 13-17, 2012, Stevenson WA USA Legend: bold = STAGE Trainee

MENTORS & MENTORS & TRAINEES BY PAPER TITLE + AUTHORS TRAINEES BY PAPER TITLE + AUTHORS PAPER PAPER

Andriy Derkach Association analysis with sequence data using Joseph Beyene Analysis of Rare and Common Genetic Variants to Lisa Strug publicly available controls. Genetic Analysis Determine Association with Blood Pressure. Liu Workshop 18. Stevenson WA, USA. Roslin, N, XF, Beyene J. Derkach, A, Strug, L. Joseph Beyene Testing the association of hypertension with SNP Shelley B. Bull Bivariate genetic association analysis of systolic profiles obtained using sparse principal component Yildiz Yilmaz and diastolic blood pressure by copula models. analysis. Bonner A, Neupane B, Beyene J. Konigorski S, Yilmaz Y, Bull SB. Joseph Beyene Entropy-based method for assessing the Laurent Briollais Mixed-effects models for joint modelling of influence of genetic and environmental factors on Yan Yan Wu sequence data in longitudinal studies. Wu YY, hypertension. Liu J, Beyene J. Briollais L. Joseph Beyene Pathway-based analysis of rare and common Laurent Briollais Accounting for treatment effect in genetic variants to test for association with blood pressure. Yan Yan Wu association of longitudinal systolic blood pressure: Alsulami H, Liu XF, Beyene J. a illness-death model approach. Briollais L, Choi Radu V. Craiu Using a Bayesian Latent Variable Approach to YH, Wu YY. Andriy Derkach Detect Pleiotropy in the GAW18 Data. Xu L, Craiu Shelley B. Bull Multi-phase analysis by linkage, quantitative Andrew Paterson R, Derkach A, Paterson A, Sun L. Zhijian Chen transmission disequilibrium, and measured Lei Sun genotype: Systolic blood pressure in complex Andriy Derkach Evaluation of association tests and gene Mexican-American pedigrees. Chen Z, Tan KR, Jerry F. Lawless annotations for analyzing rare variants using Bull SB. Andrew Paterson GAW18 data. Derkach A, Lawless J, Merico D, Shelley B. Bull exploration of heterogeneity in genetic analysis of Lei Sun Paterson A, Sun L. Zhijian Chen complex Mexican-American pedigrees and the role Andriy Derkach GAW18 Single Nucleotide Variant Prioritization of linkage versus association in WGS data. Bull Steve Scherer Based on Protein Impact, Sequence Conservation SB, Chen Z, Tan KR, Taleban J. Andrew Paterson and Gene Annotation. Nalpathamkalam T, Ziman Jo Knight Functional Annotation of Rare Variants in GAW18 R, Derkach A, Scherer SW, Paterson AD, Merico Data. Gagliano S, Benke K, Knight J. D. Geoffrey Liu Gene-based Haplotype Approach for Association Andrew Paterson Dynamic pathway analysis of genes associated Wei Xu Analysis on Hypertension. Shen X, Espin-Garcia with blood pressure using whole genome sequence O, Brhane Y, Qiu X, Liu G, and Xu W. data.Hu P, Paterson AD. Geoffrey Liu Genetic Association Analysis using Weighted False Lei Sun A. PREST-plus identifies pedigree errors and Wei Xu Discovery Rate Approach on GAW18 Data.Qiu X, cryptic relatedness in the GAW18 sample Shen X, Espin-Garcia O, Liu G and Xu W. using genome-wide SNP data. Sun L, Dimitromanolakis A. Geoffrey Liu Genetic Association Analysis for Common Wei Xu Variants in the GAW18 Data: a Dirichlet Regression Approach. Espin-Garcia O, Shen X, Qiu X, Brhane Y, Liu G, Xu W. Joseph Beyene Assessing association of genetic markers with longitudinally measured BP data. Hossain A, Beyene J. Joseph Beyene Linear mixed model analysis to test joint association of systolic and diastolic blood pressure with genetic factors. Neupane B, Beyene J.

APPENDIX L Mentoring Commitment Message

Tuesday, 15 January, 2013 3:30:41 PM Eastern Standard Time

Subject: Mentoring at STAGE (Strategic Training for Advanced Genetic Epidemiology) Program Date: Thursday, 20 December, 2012 1:43:11 PM Eastern Standard Time From: Esther Berzunza Lopez To: Esther Berzunza Lopez Priority: High Dear members of the STAGE mentoring team,

Thank you for a resoundingly successful first two years at STAGE.

Over our first two years the local STAGE community has evolved. There have been mentor relocations, the arrival of new faculty members, changes in the research interests of our trainees, and direction from the Steering and Program Advisory Committees of STAGE.

Consequently, we invite each of you to consider your mentoring commitment to STAGE.

The STAGE training program depends on and has benefitted from the dedication and commitment to great science and exemplar leadership and mentoring of our mentors--a unique, enriched, and integrated community of genetic epidemiologists, statistical geneticists, and biomedical scientists with the common goal of providing innovative training in genetic epidemiology.

It is our hope that you can continue to support STAGE as we build further on our achievements to date.

We are committed to excellence in the training we provide and to fostering a new generation of scientists with solid quantitative and interdisciplinary skills. The success of STAGE depends upon co-mentoring provided by researchers to graduate and postgraduate trainees in each of the following (broadly defined) three fields:

Genetic & Molecular Epidemiology, Statistical Genetics and Genomics, and Bio-medical Genetics.

We ask that STAGE mentoring be consistent with the mentorship agreement (attached) created on recommendation of, and approved by, the Steering Committee of STAGE, which follows the principles listed below.

To dedicate significant time to mentor trainees on research and career development skills. To encourage trainees to actively participate in STAGE activities and recommended scientific and professional/career development events and seminars. To participate regularly in STAGE-organized events (such as the International Speaker Seminar Series and the Statistical Methods for Genetics & Genomics journal club) and, when relevant, to contribute to teaching.

In addition to supporting trainees, mentorship includes many benefits to mentors as well. For example,

While working with a STAGE mentee, the mentor also has the opportunity to exchange ideas and expertise with new talented colleagues from other disciplines – with whom the mentor may collaborate for years to come.

Page 1 of 2 Mentoring can enhance a mentor’s own personal and professional knowledge while interacting with mentees and other STAGE co-mentors.

Should you be interested to continue to be part of the CIHR STAGE mentoring team, please let us know by return email.

Also know that we are open to hear your ideas on curriculum development, workshops, lectures, outreach programs, etc. All suggestions are welcome!

If, however, you inform us otherwise, you should know that we are extremely grateful for the support and contributions of our inaugural team of mentors.

We look forward to your replies.

Sincerely,

Drs. Shelley B. Bull and France Gagnon, Co-Directors of CIHR STAGE

CIHR STAGE (Strategic Training for Advanced Genetic Epidemiology) CIHR Training Grant in Genetic Epidemiology University of Toronto Dalla Lana School of Public Health 155 College Street Toronto ON M5T 3M7 -- Esther Berzunza, Program Coordinator CIHR STAGE (Strategic Training for Advanced Genetic Epidemiology) CIHR Training Grant in Genetic Epidemiology University of Toronto Dalla Lana School of Public Health 155 College Street, Suite 734 Toronto ON M5T 3M7

Tel. 416-946-7244 Email: [email protected] www.stage.utoronto.ca

Page 2 of 2