Introductory High-Throughput Genomic Data Analysis I: Data Mining and Applications

BIOST 2055 Introductory high-throughput genomic data analysis I: data mining and applications. Spring/2013

Class location: Room A622 Crabtree Hall Computer lab location: Room 3073 (3rd floor), Department of Computational Biology, BST3, 3501 Fifth Avenue Class schedule: Wednesday, Friday 9:30-10:45AM Course homepage: use Blackboard

Lecturer: George C. Tseng, Yan Lin and Wei Chen Office hour: by appointment Office: 303 Parran Hall (George Tseng) Email address: [email protected] Telephone number: 412-624-5318 Lecturer’s homepage: http://www.pitt.edu/~ctseng (George Tseng)

TA: Serena Liao Office hour: TBD Office: TBD Email address:

Course Description: This course is designed for graduate students (both doctoral and master students) and researchers from both quantitative fields (statistics, information and computer science) and qualitative biological fields who are interested in high-throughput genomic data analysis. The course aims to introduce modern statistical and computational methods in high-throughput genomic data analysis. It mainly focuses on the method and applied aspects while the in-depth methodological and theoretical details are left to the second course in the fall (BIOST 2078 Introductory high-throughput genomic data analysis II: theories and algorithms). The first half of the course focuses on fundamental statistical and computational methods applicable in virtually all kinds of high-throughput genomic data. The second half covers new selected topics that are subject to change every year. Students are required to have basic statistical training (i.e. two elementary statistics courses, basic calculus and linear algebra) and basic programming proficiency (R programming is required for homework and final project and can be learned from the class). The visions of the course include: (1) to motivate students from quantitative fields into genomic research (2) to familiarize students from biological fields with deeper understanding of statistical methods (3) to promote inter-disciplinary collaboration atmosphere in class.

Tentative Schedule of Sessions and Assignments: The first 18 sessions (75 minutes each session) are designed to introduce fundamental statistical methods used in genomic data analysis. Additional 8 sessions are devoted to selected special topics in the field. The last two sessions are for student presentations of their final projects.

Part I: Fundamental statistical methods 1/9 Introduction of the entire course and basic molecular biology and genetics. (Lin) 1/11 Introduction microarray and next-generation sequencing (NGS) technology. (Lin) 1/16 Data preprocessing Data summarization, data transformation, data filtering and missing value imputation. (Lin) 1/18 Detecting differentially expressed (DE) genes Empirical Bayes. Comparative analysis of two or more conditions; permutation methods; SAM; control false discovery rate (FDR). (Lin) 1/23 (Lab1) Introduction Bioconductor and NCBI database. Up-stream analysis analysis on real Affymetrix and cDNA array data sets. Homework 1 distributed. (Lin) 1/25 Supervised learning (classification) basic concepts in machine learning; feature selection, overfitting and cross-validation, sensitivity and specificity. (Tseng) 1/30 Supervised learning (classification) Bayes classifier; popular machine learning methods: Logistic regression, LDA/QDA/Fisher’s criterion, KNN, CART, bagging, boosting, random forest, SVM, ANN, nearest shrunken centroid. (Tseng) 2/1 Supervised learning (classification) cont’d (Tseng) 2/6 (Lab2) DE analysis and classification Data analysis on detecting DE genes and classification problem. Homework 2 distributed. (Tseng) 2/8 Dimension reduction data visualization; principal component analysis (PCA); multidimensional scaling (MDS). (Lin) 2/13 Unsupervised learning (clustering) hierarchical clustering, K-means, self- organizing maps (SOM), model-based clustering; estimate number of clusters. (Lin) 2/15 Unsupervised learning (clustering) tight clustering; penalized and weighted K- means; cluster stability and tightness; bi-clustering. (Lin) 2/20 (Lab3) Dimension reduction and Clustering analysis Homework 3 distributed. (Lin) 2/22 Pathway analysis microarray and gene annotation databases (GO, KEGG and more); enrichment analysis; motif finding. (Tseng) 2/27 Genetic regulatory network Genomic regulatory network inference: Bayesian network, hidden Markov model and general network analysis. (Tseng) 3/1 Horizontal genomic meta-analysis microarray meta-analysis (random effects model, Fisher’s method, maxP, rank-based methods etc). (Tseng) 3/6 Vertical integrative analysis (Tseng) 3/8 (Lab4) Genomic meta-analysis, gene annotation and pathway analysis Homework 4 distributed. (Tseng) 3/13 & Spring break 3/15 Part II: Selected topics 3/20 Copy number variation (CNV) and loss of heterozygosity (LOH) array CGH, SNP array (Chen) 3/22 Genome-wide association (GWAS) (Chen) 3/27 Next generation sequencing I introduction of technology(Chen) 3/29 Next generation sequencing II DNA-seq analysis(Chen) 4/3 Next generation sequencing III RNA-seq analysis (Chen) 4/5 Next generation sequencing IV ChIP-seq analysis; bisulfite sequencing and methylation array (Chen) 4/10 Gene regulation and miRNA regulation (guest lecturer: Dr. Takis Benos) 4/12 Gene regulation and miRNA regulation (guest lecturer: Dr. Takis Benos) 4/17 Student final project presentation 4/19 Student final project presentation

Handout: Course information and handouts will be posted to the Blackboard. Students are encouraged to print out the slides before each lecture.

Computer Lab: There will be four lab sessions for hands-on experiences on programming and software usage during the first half of the course. R is the major language used and ability of programming in R is a prerequisite (In some situations, students may not be familiar with R programming before the semester begins but are expected to learn to catch up in the first few weeks). Four homework sets are distributed after each computer lab.

Final project: Final projects are conducted by groups of 3 students. We will encourage/enforce mixture of quantitative (statisticians) and qualitative (biologists) students in the final projects. The lecturer will provide a list of topics/references at the beginning of the semester and the major goal is to apply statistical techniques learned in class to analyze real data sets and solve real-world problems. A presentation and a final report are expected from each group in the end of the semester.

Grade: Homework 1~4: 52% Final project: 48% (mid-term progress report due 3/17 for 8%; final presentation for 20%; final paper due on 4/21 for 20%)