Big Challenges for Statisticians

Hongtu Zhu, Ph.D Department of Biostatistics† and Biomedical Research Imaging Center‡ The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Thank NSF and SAMSI! Thank organizers! Thank you!

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Science

Statistics

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Part 1. Technical Challenges

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Imaging Science

From Wikipedia, the free encyclopedia

Imaging Science is a multidisciplinary field concerned with the generation, collection, duplication, analysis, modification, and visualization of images.

As an evolving field, it includes research and researchers from

Physics, Mathematics, Statistics, Electrical Engineering, Computer Vision, Computer Science and Perceptual Psychology.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Three key components

•Image acquisition: studies the physical mechanisms and mathematical models and algorithms by which imaging devices generate image observations.

•Image interpretation/application: is to see, monitor, and interpret the targeted world/patterns being imaged.

•Image processing: is any linear or nonlinear operator that operates on the images and produces targeted patterns.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Level 1: Imaging Data

StrOuvcerturavil ew Functional MRI MRI (task) - Variety of acquisitions - Measurement basics • Structural MRI - Limitations & artefacts Diffusion MRI - Analysis principles • Functional Diffusion - Acquisition tips Functional MRI MRI • MRI • Complementary techniques (resting)

PET EEG/MEG CT Calcium

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Image Processing

Image Signal Models Acquisition & Noise Sources

Image Representation Preprocessing Mathematics Segmentation Registration & Statistics

Data Analysis Statistical & Computer Modeling Science/Engineer Interpretation & Inference

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Individual Imaging Analysis Imaging Construction

Multimodal Analysis

DTI FLAIR

Marc The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Group Imaging Analysis Registration Prediction

NC/Diseased

Group Differences Longitudinal/Family Brain Imaging Genetics

Hibar, Dinggang, Martin

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL FDA: Functional Data Analysis f Fˆ = T[ f ] T

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL FDA: Functional Data Analysis

Registration Images

Estimation Prediction

Voxel-wise Multiple Smoothing Statistical Comparisons Models

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL ill-posed inverse problems

f T Fˆ = T[ f ]

F d(F,Fˆ) ® 0?

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Level 2: A Multiscale Physical System

stimulus – activity – measurement chain The van Essen diagram Robinson The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL A Multi-modal Approach

• Different models at different scales. • Ladder of overlapping models. • Must be testable against multiple phenomena.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL REVIEWS

Meta-dimensional analysis In this Review, we describe the principles of meta- reflecting the complexity within biological systems. An approach whereby all dimensional analysis and multi-staged analysis, and The primary motivation behind integrated data analy- scales of data are combined provide an overview of some of the approaches that sis is to identify key genomic factors, and importantly simultaneously to produce are used to predict a given quantitative or categorical their interactions, that explain or predict disease risk or complex models defined as multiple variables from outcome, the tools available to implement these analy- other biological outcomes. The success in understand- multiple scales of data. ses, and the various strengths and weaknesses of these ing the genetic and genomic architecture of complex st rategies. In addition, we describe the analytical chal- phenotypes has been modest , and this could be due to Multi-staged analysis lenges that emer ge with data sets of this magnitude, and our limited exploration of the interactions among the A stepwise or hierarchical analysis method that reduces provide our per spective on how such systems genomic , , metabolome and so on. Data the search space through analyses might develop in the future. integration may provide improved power to identify different stages of analysis. the important genomic factors and their interactions Why integrate data? (BOX 1). In addition, modelling the complexity of, and Systems Data integration can have numerous meanings; however, the interactions between, variation in DNA, gene An analysis approach that models the complex inter- and in this Review, we use it to mean the process by which expression, methylation, metabolites and proteins intra-individual variations differ ent types of omic data are combined as predictor may improve our understanding of the mechanism of traits and diseases using variables to allow more thorough and comprehensive or causal relationships of complex-trait architecture. data from next-generation modelling of complex traits or phenotypes — which are There are two main approaches to data integration: omic data. likely to be the result of an elaborate interplay among multi-staged analysis, which involves integrating Data integration biological variation at various levels of regulation — information using a stepwise or hierarchical analysis The incorporation of through the identification of more informative models. approach; and meta-dimensional analysis, which refers multi-omic information in Data integration methods are now emerging that aim to the concept of integrating multiple different data a meaningful way to provide a more comprehensive analysis to bridge the gap between our ability to generate vast types to build a multi variate model associated with a of a biological point of interest. amounts of data aLevelnd our understan di3:ng of b ioDatalogy, thus gi vIntegrationen outcome16–18.

• SNP • DNA methylationa • G ene expression • Protein • Metaboilite • CNV • Histone modifiction • Alternative splicing expresssion profilng in • LOH • Chromatin • Long non-coding • Post-translatioanal serum, plasma, Ritchie et al. (2015). • Genomic accessibility RNA modifiction urine, CSF, etc. rearrangement • TF binding • Small RNA • Cytokine array Nature Review Genetics • Rare variant • miRNA

Genome Epigenome Transcriptome Proteome Metabolome Phenome

DNA Gene mRNA TF Metabolites • Cancer TFbs

Me • Metabolic TFbs Alternative splicing syndrome Histone Protein

TFbs • Psychiatric miRNA disease

Transcription Expression Translation Function Figure 1 | Biological systems multi- from the genome, epigenome, TheepigeUNIVERSITYnome level; gene exp ofres sNORTHion and alte rCAROLINAnative splicing at tathe CHAPEL HILL transcriptome, proteome and metabolome to the phenome. transcriptome level; protein expression and postN-taratunrsela Rtieovniaelw mso | dGifeicnaettiiocns Heterogeneous genomic data exist within and between levels, for example, at the proteome level; and metabolite profiling at the metabolome level. single-nucleotide polymorphism (SNP), copy number variation (CNV), loss Arrows indicate the flow of genetic information from the genome level to of heterozygosity (LOH) and genomic rearrangement, such as translocation, the metabolome level and, ultimately, to the phenome level. The red crosses at the genome level; DNA methylation, histone modification, chromatin indicate inactivation of transcription or translation. CSF, cerebrospinal accessibility, transcription factor (TF) binding and micro RNA (miRNA) at the fluid; Me, methylation; TFBS, transcription factor-binding site.

2 | ADVANCE ONLI NE PUBLI CATI ON www.nature.com/reviews/genetics

© 2015 Macmillan Publishers Limited. All rights reserved Endophenotypes+

feedback) feedback) feedback) feedback)

Genes+ Expression) Molecules+ Cells) Brain+ Symptoms+ RNA)genes,) ) Structure,) RNA,)proteins,) protein4coding) development,) circuits,) Behavioral) metabolites) genes) organelle) physiology) tests) Genomics) Transcriptomics) Cell)biology) Neuroscience) Diagnosis) ) ) Neuroscience) Imaging) Self4report) ) Brain)interactome) Interactomics)

Environmental,+social+and+psychological+factors+

Figure+1.+A)simplified)flow)chart)for)psychiatric)disorders:)from)genes)to)symptoms)

Zhao and Castellanos (2016) Discovery science strategies in studies of the pathophysiology of child and adolescent psychiatric disorders: promises and limitations

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Big Data Integration in Health Informatics

E I

E: environmental factors

I: imaging/device

G D Selection G: genetic/genomics D: disease http://en.wikipedia.org/wiki/DNA_sequence The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Part 2. Career Challenges

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Career Development Start with simple projects

Learn from others

Try hard to get involved in some large studies

Think about how to do it better, in what sense?

More papers.

Develop new tools and packages.

Write more grants

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Training

SAMSI videos and slides for summer schools and lectures.

Short Courses in major conferences.

New Graduate Courses

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Collaborations

Good Mentors: Theory and Applications.

Good Collaborators: Radiology, Neuroscience, Psychiatry, Psychology, Computer Science, …

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Data Sets Big Public Data Sets:

• Alzheimer’s Disease Initiative (ADNI)

• NIH MRI Study of Normal Brain Development

• National Database for Research

• Human Project

• The Cancer Genome Atlas (TCGA)

• UK Biobank https://en.wikipedia.org/wiki/List_of_neuroscience_databases

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL UK Biobank Project

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL The Human Connectome Project The HCP is to elucidate the neural pathways that underlie brain function and behavior. The Heavily Connected Brain Peter Stern, “Connection, connection, connection…”, Science, Nov. 1 2013: Vol. 342 no. 6158 P.577

Resting-state fMRI (rfMRI) and dMRI provide information about brain connectivity. Task-evoked fMRI reveals much about brain function. Structural MRI captures the shape of the highly convoluted cerebral cortex. Behavioral data relate brain circuits to individual differences in cognition, perception, and personality. (MEG) combined with (EEG) yield information about brain function on a milisecond time scale.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL http://www.nitrc.org/

NITRC = The Source for Neuroimaging Tools and Resources

Statistical Parametric Mapping (SPM) FMRIB Library (FSL) Analysis of Functional NeuroImages (Afni) FreeSurfer ……

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Conferences

Human (HBM) ISMRM conference SNF conference.

Information Processing in (IPMI) SIAM Conference on Imaging Science (IS)

Medical Image Computing and Computer Assisted Intervention (MICCAI) International Symposium on Biomedical Imaging (ISBI)

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Neural Information Processing Systems Foundation (NIPS)

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Publications NeuroImage Medical Imaging Analysis IEEE Transactions on Medical Image

Human Brain Mapping

IEEE Transactions on Signal Processing IEEE Transactions on Image Processing IEEE Transactions on Signal Processing Magazine

SIAM Journal on Imaging Sciences IEEE Pattern Analysis and Machine Intelligence

Annals of Applied Statistics, Biometrics Biostatistics Journal of American Statistical Association ACS The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Part 3. Software Challenges

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Software Development http://www.nitrc.org/

NITRC = The Source for Neuroimaging Tools and Resources

Lack a good and popular statistical software for Neuroimaging Data Analysis from our community

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Software Development Start a Neuroconduct project

• Share responsibilities and information

• Common input and output files compatible with major packages

• Build small Rcpp and Matlab packages

• Release them through your own websites, our neuroconduct website and http://www.nitrc.org/

• Focus on a few key tools and expand from them

• Encourage other groups to download and use them.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Software Development 1. Simulators for different imaging modalities • Evaluate image processing tools • Evaluate statistical methods (group analysis, reliability) 2. Standardize all image processing and analysis pipelines • fMRI and resting fMRI • EEG/MEG • DTI • CT • Calcuim • PET 3. Develop new tools to do multi-modal analysis 4. Develop new tools to integrate imaging, genetic, and clinical data

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL