Contents

Instruction ...... 1 Program ...... 2

Abstracts On Gradient-Based Optimization: Accelerated, Stochastic and Nonconvex Michael I. Jordan .... 4 Value of Information Methods Le Bao ...... 5 TBA Jian Cao ...... 6 Optimal covariance matrix estimation for high-dimensional noise in high-frequency data Jinyuan Chang ...... 7 Multivariate network meta-analysis made simple Yong Chen ...... 8 Functional Canonical Correlation and Functional Prediction Di-Rong Chen ...... 9 A Power One Test for Unit Roots Based on Sample Autocovariances Guanghui Cheng ...... 10 METRIC LEARNING VIA CROSS-VALIDATION Linlin Dai ...... 11 Dynamic Change Detection with False Discovery Rate Control Lilun Du ...... 12 Valuing commodity options and futures options with changing economic conditions Kun Fan ...... 13 An extended Mallows model for ranked data aggregation Xiaodan Fan ...... 14 Some challenges in analyzing big data in health and medical research Bo Fu ...... 15 Estimating Truncated Functional Linear Models with a Nested Group Bridge Approach Tianyu Guan ...... 16 Modeling Traffic Crash Risk Feng Guo ...... 17 Moderate-Dimensional Inferences on Quadratic Functionals in Ordinary Least Squares Xiao Guo ...... 18 Local Inference in Additive Models with Decorrelated Local Linear Estimator Zijian Guo ..... 19 Nonlocal online RPCA for video denoising Zhi Han ...... 20 Oracle P-value and Variable Screening Ning Hao ...... 21 Inference in a mixture additive hazards cure model Haijin He ...... 22 The Pearson Correlation Between Tree-Shaped Data Sets: Estimating, Graphical Representation and Hypothesis Testing Jie Hu ...... 23 AI-Based Solution for Financial Risk Assessment and Fraud Detection Ling Huang ...... 24 Causal mediation of semicompeting risks Yen-Tsung Huang ...... 25 Multiple Imputation on enhanced model identification for nonignorable nonresponse Jongho Im ...... 26 Generalized Four Moment Theorem and an Application to CLT for Spiked Eigenvalues of Large-dimensional Covariance Matrices Dandan Jiang ...... 27 Functional-coefficient regression models with GARCH errors Jiancheng Jiang ...... 28

Prediction of hospital readmission frailties with misspecified shared frailty models Xuejun Jiang ...... 29 The Operating Principle of Regularized Spectral Clustering Donggyu Kim ...... 30 Discrepancy between global and local principal component analysis on large-panel high-frequency data Xin-Bing Kong ...... 31 Optimal Estimation of Wasserstein Distance on Trees with An Application to Microbiome Studies Hongzhe Li ...... 32 A supplement to Jiang's asymptotic distribution of the largest entry of a sample correlation matrix Deli Li ...... 33 High-Dimensional Vector Autoregressive Time Series Modeling via Tensor Decomposition Guodong Li ...... 34 Tensor Analysis and Neuroimaging Applications Lexin Li ...... 35 Statistical Learning for Personalized Wealth Management Yingying Li ...... 36 Mediation analysis for zero-inflated mediators Zhigang Li ...... 37 Identiability and Non-Convex Algorithm for Multi-Channel Blind Deconvolution Song Li .... 38 A non-randomized multiple testing procedure for large-scale heterogeneous discrete hypotheses based on randomized tests Nan Lin ...... 39 Deep Neural Networks for Rotation-Invariance Approximation and Learning Shao-Bo Lin ...... 40 A Quantile Association-based Variable Selection Yuanyuan Lin ...... 41 Some Statistical Methods for Single-cell Genomics Zhixiang Lin ...... 42 Weighted multiple-quantile classifiers for functional data with application in multiple sclerosis screening Catherine Liu ...... 43 Optimal Covariance Matrix Estimation for High-dimensional Noise in High-frequency Data Cheng Liu ...... 44 Data-adaptive Kernel Support Vector Machine Xin Liu ...... 45 Testing of covariate effects under ridge regression for high-dimensional data Xu Liu ...... 46 Towards Software-Defined Infrastructure for Decentralized Data Governance Xuanzhe Liu .. 47 Distributed learning from multiple EHR databases: Contextual embedding models for predicting medical events Qi Long ...... 48 Wavelet Empirical Likelihood Estimator for Stationary and Locally Stationary Long Memory Processes Zhiping Lu ...... 49 GMV Prediction Using Driver Preference Shikai Luo ...... 50 A Nonparametric Bayesian Approach to Simultaneous Subject and Cell Heterogeneity Discovery for Single Cell RNA-Seq Data Xiangyu Luo ...... 51 A Versatile Estimation Procedure without Estimating the Nonignorable Missingness Mechanism Yanyuan Ma ...... 52 Matrix Completion under Low-Rank Missing Mechanism Xiaojun Mao ...... 53 A mean field theory of two-layers neural networks Song Mei ...... 54 A Dynamic Additive and Multiplicative Effects Model with Application to the United Nations Voting Behaviours Xiaoyue Niu ...... 55

A Super Scalable Algorithm for Short Segment Detection Yue Niu ...... 56 Improved doubly robust estimation in learning individualized treatment rules Yinghao Pan .. 57 Predicting terrorist events: opportunities and challenges Andre Python ...... 58 On the ‘Off-Label’ Use of Data Normalization for Sample Classification and Prognostication Li-Xuan Qin ...... 59 Adaptive Minimax Density Estimation for Huber’s Contamination Model under $L_p$ Losses Zhao Ren ...... 60 Dynamic Spatial Panel Data Models with Endogeneity and Common Factors Wei Shi ...... 61 Bridging the gap between noisy healthcare data and knowledge: automated translation of medical terminology Xu Shi ...... 62 Estimating the sample mean and standard deviation from the five-number summary and their applications in evidence-based medicine Tiejun Tong...... 63 An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss Cheng Wang ...... 64 MOSUM-based test and estimation method for multiple changes in panel data Man Wang .... 65 Structured tensor decomposition and its application Miaoyan Wang ...... 66 Identification of the number of factors for factor modeling in high dimensional time series Qinwen Wang ...... 67 Large Multiple Graphical Model Inference via Bootstrap Shaoli Wang...... 68 An adaptive independence test for microbiome community data Tao Wang ...... 69 Integrated Quantile Rank Test (iQRAT) for heterogeneous joint effect of rare and common variants in sequencing studies Tianying Wang ...... 70 Model Free Approach to Quantifying the Proportion of Treatment Effect Explained by a Surrogate Marker Xuan Wang ...... 71 Scattering Transform and Stylometry Analysis in Arts Yang Wang ...... 72 A Fast and Practical Randomized Method for Low-Rank Tensor Approximations Yao Wang . 73 The traps that must be encountered in machine learning practices Hu Wei ...... 74 Flexible Experimental Designs for Valid Single-cell RNA-sequencing Experiments Allowing Batch Effects Correction Yingying Wei ...... 75 Recentdevelopmentsingraphmatching:statisticalanalysis Yihong Wu ...... 76 Differential Markov Random Field Analysis Yin Xia ...... 77 Building a Translational Research Program in Neuroinflammation: A Data Driven Approach to Advance Precision Medicine for Multiple Sclerosis Zongqi Xia ...... 78 Realized volatility forecasting with HAR-GARCH type model: a Bayesian approach Han Xiang ...... 79 TBA Han Xiao ...... 80 Multiple Testing Embedded in an Aggregation Tree to Identify where Two Distributions Differ Jichun Xie ...... 81 Pearson's statistics: approximation theory and beyond Mengyu Xu ...... 82 Distribution and correlation free two-sample test of high-dimensional means Kaijie Xue ...... 83 RobustEstimatesandGenerativeAdversarialNetworks Yuan Yao...... 84

A statistical and machine learning framework for new energy vehicle ride sharing system Kaixian Yu ...... 85 Word Segmentation and Term Discovery in Chinese Electronic Medical Records Using Graph Theory and Deep Learning Sheng Yu ...... 86 Statistical Inference Based on Sufficient Dimension Reduction Zhou Yu ...... 87 Global Optimality of Stochastic Semi-definite Optimization with Application to Ordinal Embedding Jinshan Zeng ...... 88 High-dimensional Tensor Regression Analysis Anru Zhang ...... 89 Enhanced Pulmonary Nodule Detection Using Fully Automated Deep Learning: A Multifactor Investigation Chi Zhang ...... 90 Heteroscedasticity test based on high-frequency data with jumps and microstructure noise Chuanhai Zhang ...... 91 Factor Models for High-Dimensional Tensor Time Series Cun-Hui Zhang ...... 92 Stochastic differential reinsurance games with capital injections Nan Zhang ...... 93 Structured sparse logistic regression with application to lung cancer prediction using breath volatile biomarkers Xiaochen Zhang ...... 94 Simulated Distribution Based Learning for Non-regular and Regular Statistical Inferences Zhengjun Zhang ...... 95 Estimation and inference for the indirect effect in high-dimensional linear mediation models Sihai Dave Zhao...... 96 Factor Modeling for Volatility Xinghua Zheng ...... 97 Sequential scaled sparse factor regression Zemin Zheng ...... 98 Estimating Endogenous Treatment Effect Using High-Dimensional Instruments with an Application to the Olympic Effect Wei Zhong ...... 99 Approximation Theory of Deep Convolutional Neural Networks Ding-Xuan Zhou ...... 100 Global Convergence of EM Harrison H. Zhou ...... 101 R package for new normality test Maoyuan Zhou ...... 102 GD-RDA: A new regularized discriminant analysis for high dimensional data Yan Zhou ...... 103 Matrix Completion for Network Analysis Ji Zhu ...... 104 Quantile double autoregression Qianqian Zhu ...... 105 A Boosting Algorithm for Estimating Generalized Propensity Scores with Continuous Treatments Yeying Zhu ...... 106 Safe machine learning for safe genome editing James Zou ...... 107 List of Participants ...... 108 List of Participants (ZJU) ...... 112 Maps ...... 115

2019 杭州数据科学前沿国际研讨会 2019 Hangzhou International Conference on Frontiers of Data Science

May 26-27, 2019 Jinxi Hotel, Hangzhou 杭州金溪山庄

Programming Committee Tony CAI University of Pennsylvania (USA) Tianxi CAI Harvard University (USA) Qi-Man SHAO Southern University of Science and Technology () Yazhen WANG University of Wisconsin-Madison (USA,Chair)

Ming YUAN Columbia Univeristy (USA) Heping ZHANG Yale University (USA)

Local Organizing Committee Xi CHEN Zhejiang University (Chair) Zhonggen SU Zhejiang University Jianwei YIN Zhejiang University Lixin ZHANG Zhejiang University Rongmao ZHANG Zhejiang University

Contact: Weina Su, [email protected], 0571-88208268

Organizer: Center for Data Science, Zhejiang University Co-sponsor: Zhejiang Association of Internet Finance, International Business School of Zhejiang University, International Research Center for Data Analytics and Management, Zhejiang University

Official Website:http://cds.zju.edu.cn

1 Program

May 26 08:30~09:10 Opening Ceremony Jinxi Hall Chair: Xi Chen Title:On Gradient-Based Keynote Speech Optimization: 09:10~10:00 Jinxi Hall Chair: Tony Cai Michael I. Jordan Accelerated, Stochastic and Nonconvex 10:00~10:20 Tea Break The advancement in Recent advance in Learning Theory and statistical methods for Data Science in complex data analysis Compressed Sensing biological data Healthcare analysis No.1 Hall No.2 Hall No.3 Hall Sightseeing Hall Chair: Zijian Guo Chair: Song Li Chair: Jie Hu Chair: Wei Luo 10:20~12:00 Organizer: Zijian Guo Organizer: Junhong Lin Organizer: Hangjin Organizer: Tianxi Jiang Cai

Speakers: Speakers: Speakers: Speakers: Cun-Hui Zhang Yang Wang Xiaodan Fan Xu Shi Harrison Zhou Ding-xuan Zhou Yingying Wei Zongqi Xia Han Xiao Di-rong Chen Xianghyu Luo Sheng Yu Yihong Wu Song Li Jie Hu Xiaoyue Niu Lunch Translational Recent Developments Recent development Recent development on Biomedical Data in Applied Statistical in complex data data analysis Science methods analysis No.1 Hall No.2 Hall No.3 Hall Sightseeing Hall Chair: Xuan Wang Chair: Donggyu Kim Chair: Zhixiang Lin Chair: Jinyuan Chang Organizer: Tianxi Cai Organizer: Donggyu Kim Organizer: Yuanyuan Organizer: Jinyuan 13:30~15:10 Lin Chang

Speakers: Speakers: Speakers: Speakers: Xuan Wang Donggyu Kim Xu Liu Zemin Zheng Le Bao Hyo Young Choi Zhixiang Lin Guanghui Cheng Yanyuan Ma Jongho Im Linlin Dai Jinyuan Chang Yen-Tsung Huang Maoyuan Zhou Yuanyuan Lin 15:10~15:30 Tea Break Advanced Methods Real-world Application Advances in for Analysis of of AI & Service Time series analysis Statistical Learning Modern Biomedical Computing Data No.1 Hall No.2 Hall No.3 Hall Sightseeing Hall Chair: Yin Xia Chair: Renjun Xu Chair: Rongmao Zhang Chair: Qi Long Organizer: Tianxi Cai Organizer: Renjun Xu& Organizer: Xinyu Song Organizer: Qi Long 15:30~17:10 Xiaoye Miao & Yazhen Wang

Speakers: Speakers: Speakers: Speakers: Dave Zhao Xuanzhe Liu Qianqian Zhu Yinghao Pan Yin Xia Ling Huang Chuanhai Zhang Yong Chen Zijian Guo Hu Wei Xiao Guo Xiaochen Zhang Ji Zhu Andre Python Zhiping Lu Qi Long

2 May 27 Advances in High Machine Learning in Dimensional Models: Machine Learning Complex model Integration of Beyond Theory and inference Large-Scale Genomics Scalar-on-vector Application Regressions Data No.1 Hall No.2 Hall No.3 Hall Sightseeing Hall Chair: Junhong Lin Chair:Tianxiao Pang Chair: Hongzhe Li Chair: Harrison Zhou 08:30~10:10 Organizer: Emma Organizer: Yazhen Wang Organizer: Hongzhe Li Organizer: Harrison Jingfei Zhang Zhou

Speakers: Speakers: Speakers: Speakers: Lexin Li Zhengjun Zhang Jichun Xie Song Mei Jiguo Cao Nan Lin James Zou Zhao Ren Tao Wang Bo Fu Hongzhe Li Yuan Yao Anru Zhang 10:10~10:30 Tea Break New methods and Financial Harness the Power of perspective for Data science in Econometrics Complicated data analysis of multiple genetics data sources No.1 Hall No.2 Hall No.3 Hall Sightseeing Hall Chair: Hangjin Jiang Chair: Ning Hao Chair: Catherine Liu Chair: Wei Luo 10:30~12:10 Organizer: Yazhen Organizer: Jiancheng Organizer: Catherine Organizer: Ying Wei Wang Jiang Liu

Speakers: Speakers: Speakers: Speakers: Yingying Li Yue Niu Lilun Du Lixuan Qin Xinghua Zheng Ning Hao Guodong Li Tianying Wang Xinbing Kong Jiancheng Jiang Catherine Liu Zhigang Li Qinwen Wang Dandan Jiang Tiejun Tong Xin Liu Lunch Learning Financial Modern statistics: high-dimensional data Machine learning Stochastics and theory and methods with complex structure Statistics No.1 Hall No.2 Hall No.3 Hall Sightseeing Hall Chair: Weidong Liu Chair: Miaoyan Wang Chair: Yao Wang Chair: Zhonggen Su Organizer: Weidong Organizer: Miaoyan Organizer: Yao Wang Organizer: Yazhen 13:30~15:10 Liu Wang Wang& Changliang Zou

Speakers: Speakers: Speakers: Speakers: Xiaojun Mao Chi Zhang Shaobo Lin Kun Fan Zhou Yu Wei Shi Jinshan Zeng Nan Zhang Wei Zhong Mengyu Xu Zhi Han Cheng Liu Miaoyan Wang Yao Wang 15:10~15:30 Tea Break Statistical and Machine Recent Advances in Recent developments Learning Methods with Service Computing Complex Data in statistical and Application in AI & Big Data Analysis Analysis probability learnings transportation No.1 Hall No.2 Hall No.3 Hall Sightseeing Hall 15:30~17:10 Chair: Man Wang Chair: Renjun Xu Chair: Wei Luo Chair: Xiaoye Miao Organizer: Xinyuan Song Organizer: Rui Song Organizer: Wei Luo Organizer: Xiaoye Miao

Speakers: Speakers: Speakers: Speakers: Haijin He Shikai Luo Shaoli Wang Jian Cao Xuejun Jiang Kaixian Yu Yeying Zhu Kaijie Xue Han Xiang Feng Guo Deli Li

3 On Gradient-Based Optimization: Accelerated, Stochastic and Nonconvex

Michael I. Jordan1

1 University of California at Berkeley, USA E-mail: [email protected]

Abstract: Optimization methods play a key enabling role in statistical inference, both frequentist and Bayesian. Moreover, as statistics begins to more fully embrace computation, what is often meant by "computation" is in fact "optimization". I will discuss some recent progress in high-dimensional, large-scale optimization, where new theory and algorithms have provided non-asymptotic rates, sharp dimension dependence, elegant ties to geometry and practical relevance. In particular, I discuss several recent results: (1) a new framework for understanding Nesterov acceleration, obtained by taking a continuous-time, agrangian/Hamiltonian/symplectic perspective, (2) a discussion of how to escape saddle points efficiently in nonconvex optimization, and (3) the acceleration of Langevin diffusion.

4 Value of Information Methods

Jacob Parsons, Le Bao*

*Department of Statistics, The Pennsylvania State University, USA E-mail: [email protected]

Abstract: We develop new value of information methods to apply to the problems of outlier detection and influence analysis. The proposed method has a distinct advantage in flexibility and interpretability when compared to existing methods. We study the theoretical properties of three value of information quantities, establish the relationship between the proposed measures and classic measures, and illustrate our proposed approach in the case of a generalized linear mixed model for studying HIV epidemics.

Key Words: Influence; Outlier; Bayesian

5 TBA

Jian Cao1

1 Shanghai Jiaotong University, China E-mail: [email protected]

Abstract: TBA

6 Optimal covariance matrix estimation for high-dimensional noise in high-frequency data

Jinyuan Chang1

1 Southwestern University of Finance and Economics, China E-mail: [email protected]

Abstract: TBA

7 Multivariate network meta-analysis made simple

Yong Chen1

1 University of Pennsylvania, 423 Guardian Dr, Blockley 602, USA E-mail: [email protected]

Abstract: Due to patient's heterogeneous response to treatment, there is a growing interest in developing novel and efficient statistical methods in estimating individualized treatment rules (ITRs). The central idea is to recommend treatment according to patient characteristics, and the optimal ITR is the one that maximizes the expected clinical outcome if followed by the patient population. We propose an improved estimator of the optimal ITR that enjoys two key properties. First, it is doubly robust, meaning that the proposed estimator is consistent if either the propensity score or the outcome model is correct. Second, it achieves the smallest variance among its class of doubly robust estimators when the propensity score model is correctly specified, regardless of the specification of the outcome model. Simulation studies show that the estimated optimal ITR obtained from our method yields better clinical outcome than its main competitors. Data from Sequenced Treatment Alternatives to Relieve Depression (STAR*D) study is analyzed as an illustrative example.

Key Words: Double robustness; Individualized treatment rule; Personalized medicine

8 Functional Canonical Correlation and Functional Prediction

Di-Rong Chen1

1 School of Mathematics and System Sciences, Beihang University, China E-mail: [email protected]

Abstract: Functional data analysis (FDA) is concerned with infinite-dimensional objects collected in the form of random curves. A reproducing kernel Hilbert space (RKHS) framework, based on the Loeve–Parzen congruence, has been developed in FDA to deal with the problem of non-invertibility of some compact operators, e.g, covariance operator. In this talk, we consider functional canonical correlation analysis and functional prediction in the RKHS framework. Some estimators are proposed and convergence rates are established.

9 A Power One Test for Unit Roots Based on Sample Autocovariances

Guanghui Cheng1 , Jinyuan Chang , Qiwei Yao

1 Guangzhou University, China E-mail: [email protected]

Abstract: Testing for the existence of unit roots is a long-standing and challenging problem in time series analysis in econometrics. Most available methods suffer from a lack of size control and poor power. Perhaps more significantly the settings for unit root tests typically postulate the existence of unit roots as the null hypothesis, and, therefore, lead to innate indecisive inference as statistical tests are incapable to accept a null hypothesis. We propose a new test for the null hypothesis that the observed time series is weakly stationary (or with a linear deterministic trend) against the alternative that unit roots are present. Therefore the null hypothesis is rejected, we conclude that unit roots are indicated by significant data evidence.

10 METRIC LEARNING VIA CROSS-VALIDATION

1 Linlin Dai , Kani Chen, Gang Li, Yuanyuan Lin

1Southwestern University of Finance and Economics E-mail: [email protected]

Abstract: Many algorithms rely on a good measure of the distance between two inputs. Metric learning learns a proper metric from the data, and the learned metric can then be used to derive algorithms for problem solving. It is widely applied in machine learning, data mining and pattern recognition. This paper studies multiple index model, which has attracted lots of attentions in statistics, in the framework of metric learning. The metric is incorporated in a nonparametric kernel smoothing function to approximate the link function. An optimization procedure over all positive semi-definite matrices is built based on the cross validation and is called cross validation metric learning. This procedure simultaneously estimates the link function and find the projection matrix of the multiple index model. The estimated link function and directions share the same bandwidth and thus avoid the problem caused by two separate estimation procedures existing in some other approaches.

Various simulation studies and real data analyses demonstrate its advantages over other existing approaches.

Key Words: cross-validation, multiple index model, kernel estimation

11 Dynamic Change Detection with False Discovery Rate Control

Lilun Du1, Changliang Zou2

1HKUST, HKSAR 2Nankai University, China E-mail: [email protected]

Abstract: In multiple data stream surveillance, the rapid and sequential identification of individuals whose behaviour deviates from the norm has become particularly important. In such applications, the state of a stream can alternate, possibly multiple times, between a null state and an alternative state. To balance the ability to detect two types of changes, that is, a change from the null to the alternative and back to the null, we develop a large-scale dynamic testing system in the framework of false discovery rate (FDR) control. By fully exploiting the sequential feature of data streams, we propose a new procedure based on a penalised version of the generalised likelihood ratio test statistics for change detection. The FDR at each time point is shown to be controlled under some mild conditions on the dependence structure of data streams. A data-driven approach is developed for selection of the penalisation parameter, which gives the new method an edge over existing methods in terms of FDR control and detection delay. Its advantage is demonstrated via simulation and a data example.

Key Words: Epidemic changes; High-dimensional data streams; Multiple change-points; Multiple testing; Penalised methods; Sequential detection

12 Valuing commodity options and futures options with changing economic conditions

Kun Fan1, Yang Shen2,*, Tak Kuen Siu3, Rongming Wang1

1 School of Statistics, East China Normal University, Shanghai 200241, China 2 Department of Mathematics and Statistics, York University, Toronto, Ontario, Canada, M3J 1P3 3 Department of Applied Finance Actuarial Studies, Faculty of Business and Economics, Macquarie University, Sydney, NSW 2109, Australia E-mail: [email protected]

Abstract: A model for valuing a European-style commodity option and a futures option is discussed with a view to incorporating the impact of changing hidden economic conditions on commodity price dynamics. The proposed model may be thought of as an extension to the Gibson–Schwartz two-factor model, where the model parameters vary when the hidden state of an economy switches. A semi-analytical approach to valuing commodity options and futures options is adopted, where the closed-form expressions for the characteristic functions of the logarithmic commodity price and futures price are derived. A fast Fourier transform (FFT) approac0h is then applied to provide a practical and efficient way to evaluate the option prices. Real data studies and numerical examples are used to illustrate the practical implementation of the model.

Key Words: Commodity options; Futures options; Regime-switching; Gibson– Schwartz two-factor model; Fast Fourier transform

13 An extended Mallows model for ranked data aggregation

Han Li1,2, Minxuan Xu2,3, Jun S. Liu4, Xiaodan Fan2,*

1College of Economics, Shenzhen University, Shenzhen, China 2Department of Statistics, The Chinese University of Hong Kong, Shatin, HK SAR, China 3Department of Statistics, University of California, Los Angeles, CA, USA 4Department of Statistics, Harvard University, Cambridge, MA, USA E-mail: [email protected]

Abstract: For a same disease, different studies may report differently ranked gene lists after measuring the gene association with the disease. We aim to find a consensus ranking by aggregating multiple ranking lists. To address the problem probabilistically, we formulate an elaborate ranking model for full and partial rankings by generalizing the Mallows model. Our model assumes that the ranked data are generated through a multistage ranking process that is explicitly governed by parameters that measure the overall quality and stability of the process. The new model is quite flexible and has a closed form expression. Under mild conditions, we can derive a few useful theoretical properties of the model. Furthermore, we propose an efficient statistic called rank coefficient to detect over-correlated rankings and a hierarchical ranking model to fit the data. Through extensive simulation studies and real applications, we evaluate the merits of our models and demonstrate that they outperform the state-of-the-art methods in diverse scenarios.

Key Words: rank aggregation; Mallows model; hierarchical ranking

14 Some challenges in analyzing big data in health and medical research

Bo Fu1

1 , China E-mail: [email protected]

Abstract: TBA

15 Estimating Truncated Functional Linear Models with a Nested Group Bridge Approach

Tianyu Guan1, Zhenhua Lin2, Jiguo Cao1,*

1Department of Statistics and Actuarial Science, Simon Fraser University 2 Department of Statistics, University of California, Davis 1,* Department of Statistics and Actuarial Science, Simon Fraser University E-mail: [email protected]

Abstract: We study a scalar-on-function truncated linear regression model which assumes that the functional predictor does not influence the response when the time passes a certain cutoff point. We approach this problem from the perspective of locally sparse modeling, where a function is locally sparse if it is zero on a substantial portion of its defining domain. In the truncated linear model, the slope function is exactly a locally sparse function that is zero beyond the cutoff time. A locally sparse estimate then gives rise to an estimate of the cutoff time. We propose a nested group bridge penalty that is able to specifically shrink the tail of a function. Combined with the B-spline basis expansion and penalized least squares, the nested group bridge approach can identify the cutoff time and produce a smooth estimate of the slope function simultaneously. The proposed nested group bridge estimator is shown to be consistent, while its numerical performance is illustrated by simulation studies. The proposed nested group bridge method is demonstrated with an application of determining the effect of the past engine acceleration on the current particulate matter emission.

Key Words: B-spline basis functions; Functional data analysis; Functional linear regression; Group bridge approach; Locally sparse; Penalized B-splines

16 Modeling Traffic Crash Risk

Feng Guo1∗

1∗ Virginia Tech, Blacksburg, VA 24060, USA [email protected]

Abstract: Accurate assessment of driver risk is challenging due to the rarity of crashes and high variability among individual drivers. This talk focuses on evaluating factors affecting driving risk as well individual driving risk prediction. The emerging connected and automated vehicle technology provides rich in-situ telematics driving data that could provide personalized driving behavior infor- mation for risk assessment. This talk will introduce the latest development in using the naturalistic driving study in driver risk prediction, specifically, the The Second Strategic Highway Research Plan naturalistic driving study (SHRP2 NDS). A decision-adjusted framework is introduced to develop an optimal model for individual driver risk prediction.

Key Words: Traffic Safety; Crash Risk; Naturalistic Driving Study; Telematics; Driver Risk Evalu- ation

17 Moderate-Dimensional Inferences on Quadratic Functionals in Ordinary Least Squares

Xiao Guo1,*, Guang Cheng2

1International Institute of Finance, School of Management, University of Science and Technology of China, Hefei, Anhui 230026, People's Republic of China. 2Department of Statistics, Purdue University, West Lafayette, IN 47906. E-mail: [email protected]

Abstract: Statistical inferences on quadratic functionals of linear regression parameter have found wide applications including signal detection, one/two-sample global testing, inference of fraction of variance explained and genetic co-heritability. Conventional theory based on ordinary least squares estimator works perfectly in the fixed-dimensional regime, but fails when the parameter dimension p grows proportionally to the sample size n. In some cases, its performance is not satisfactory even when n > 5p.

The main contribution of this paper is to illustrate that the signal-to-noise ratio (SNR) plays a crucial role in the moderate-dimensional inferences where p/n -> c with 0 < c < 1. In the case of weak SNR, as often occurred in the moderate-dimensional regime, both bias and variance need to be corrected in the traditional inference procedures. The amount of correction mainly depends on SNR and c, and could be fairly large as c-> 1. However, the classical fixed-dimensional results continue to hold if and only if SNR is large enough, say when p diverges but slower than n. Our general theory holds, in particular, without Gaussian design/error or structural parameter assumption, and apply to a broad class of quadratical functionals, covering all aforementioned applications. The mathematical arguments are based on random matrix theory and leave-one-out method. Extensive numerical results demonstrate the satisfactory performances of the proposed methodology even when p > 0.9n in some extreme case.

Key Words: Co-heritability; fraction of variance explained; linear regression model; moderate dimension, quadratic functional; signal-to-noise ratio

18 Local Inference in Additive Models with Decorrelated Local Linear Estimator

1,∗ 1 Zijian Guo , Cun-Hui Zhang

1 Rutgers University, USA E-mail: [email protected]

Abstract: Additive models, as a natural generalization of linear regression, have played an important role in studying nonlinear relationships. Despite of much recent progress on additive models, the statistical inference problem in additive models has been much less understood. Motivated by infer- ence for the exposure effect, we tackle the t statistical inference problem for f1 (x0) in additive models, where f1 denotes the t univariate function of interest and f1(x0) denotes its first order derivative evalu- ated at a specific point x0. The main challenge for this local inference problem is due to the additional uncertainty of estimating other nuisance functions. To address this, we propose a decorrelated local linear estimator, which is particularly useful in reducing the effect of the estimation error related to the nuisance functions on the estimation t accuracy of f1 (x0). We establish the asymptotic limiting distribution for the proposed estimator and then construct confidence interval and conduct hypothesis testing for the t estimand f1 (x0). The variance level of the asymptotic limiting distribution is of the same order as that for the nonparametric regression while the bias of the proposed estimator is jointly determined by how well we can estimate the nuisance functions and the relationship between the variable of interest and the nuisance variables. The method is developed for general additive models and is demonstrated in high-dimensional sparse additive model.

Key Words: Additive model; Inference; Local linear; Decorrelated; Function derivative.

19 Nonlocal online RPCA for video denoising

Zhi Han1

1 Shenyang Institute of Automation, Chinese Academy of Sciences, China E-mail: [email protected]

Abstract: Online Robust Principal Component Analysis (RPCA) has been proposed to enhance the efficiency of (batch) RPCA for processing big or streaming data. However, for both online and batch RPCA, two drawbacks prevent their applications on video denoising: 1) it is hard for them to extract accurate low-rank components when the scene or background of the video is non-static; 2) correlated information of regional patches or volumes, which may be used to further enhance the reconstruction performance, is not taken into consideration. For solving such problems, nonlocal technique has been successfully applied to batch RPCA. But due to the characteristics of online RPCA, it is hard to apply nonlocal technique directly. In this work, we propose a novel nonlocal online RPCA model for video denoising. Firstly, rather than grouping low-rank patches by applying the methods like k-Nearest Neighbor (kNN) to every patch separately, we use Gaussian Mixture Model (GMM) for clustering all the nonlocal patches, in order to build couples of updatable subspaces. Then we propose a weighted RPCA model and solve it via an Alternating Direction Method of Multipliers (ADMM) strategy. Finally, we propose a sequential multi-subspace update scheme, in which, the subspaces and the corresponding Gaussian components are updated simultaneously by utilizing the subspace-projected new coming data.

20 Oracle P-value and Variable Screening

Ning Hao1, Hao Helen Zhang1,*

1University of Arizona, Mathematics Department, Tucson, AZ, USA E-mail: [email protected]

Abstract: P-value, first proposed by Fisher to measure inconsistency of data with a specified null hypothesis, plays a central role in statistical inference. For classical linear regression analysis, it is a standard procedure to calculate P-values for regression coefficients based on least squares estimator (LSE) to determine their significance. However, for high dimensional data when the number of predictors exceeds the sample size, ordinary least squares are no longer proper and there is not a valid definition for P-values based on LSE. It is also challenging to define sensible P-values for other high dimensional regression methods such as penalization and resampling methods. In this paper, we introduce a new concept called oracle P-value to generalize traditional P-values based on LSE to high dimensional sparse regression models. Then we propose several estimation procedures to approximate oracle P-values for real data analysis. We show that the oracle P-value framework is useful for developing new tools in high dimensional data analysis, including variable ranking, variable selection, and screening procedures with false discovery rate (FDR) control. Numerical examples are presented to demonstrate performance of the proposed methods.

Key Words: False discovery rate; Inference; Linear model

21 Inference in a mixture additive hazards cure model

Haijin He1

1 Shenzhen University, China E-mail: [email protected]

Abstract: TBA

22 The Pearson Correlation Between Tree-Shaped Data Sets: Estimating, Graphical Representation and Hypothesis Testing

1 1 2,∗ Shanjun Mao , Xiaodan Fan , Jie Hu 1 The Chinese University of Hong Kong, Hong Kong SAR, China 2,∗ Xiamen University, Fujian, China E-mail: [email protected]

Abstract: Tree-shaped data set is increasingly common in real world, such as gene expression data measured on a cell lineage tree. Due to its complex structure, such as changing correlation along the tree branch, classical formula of Pearson correlation coefficient cannot be applied directly to calculate correlation between tree-shaped data sets, i.e. tree correlation. In our study, a statistical model with correlation-gradually-weaken mechanism for tree-shaped data is proposed first, and then a Bayesian approach is applied to estimate the tree correlation; secondly, a simple and intuitive graph representation is used to demonstrate the geometric significance of tree correlation; finally, a χ2 test is constructed to test the hypotheses on tree correlations. Extensive simulations are completed to demonstrate the validity and compatibility of our model and algorithm. Furthermore, the application to a public dataset of gene expression measured on cell lineage tree shows that our method can capture the correlation between tree-shaped data sets well.

Key Words: Pearson correlation; Tree-shaped data; Bayesian estimation; Graphical representation; Hypothesis testing

23 AI-Based Solution for Financial Risk Assessment and Fraud Detection

Ling Huang1

1AHI Fintech, Beijing, China E-mail: [email protected]

Abstract: In this talk, I will present our work in applying a variety of Artificial Intelligence (AI) technology to daily decision-making in business operations in financial industry. Specifically, we propose to integrate the latest graph analytics with machine learning and develop cutting edge semi-supervised learning platform to provide risk assessment and fraud detection solutions for financial institutions. Our solutions seamlessly integrate expert knowledge with AI systems, automatically discover unknown risk patterns from massive data, build models with few labels to detect ever-changing fraudulent activities unseen before, and protect customers from the latest threats and attacks.

Key Words: Artificial Intelligence; Machine Learning; Risk Model; Fraud Detection

24 Causal mediation of semicompeting risks

Ying Chen1, Thorsten Koch2, Yen-Tsung Huang1

1 E-mail: [email protected]

Abstract:The semi-competing risk problem arises when one is interested in the effect of an exposure or treatment on both intermediate (e.g., having cancer) and primary events (e.g., death) where the intermediate event may be censored by the primary event, but not vice versa. Here we propose a nonparametric approach casting the semi-competing risks problem in the framework of causal mediation modeling. We set up a mediation model with the intermediate and primary events, respectively as the mediator and the outcome, and define indirect effect (IE) as the effect of the exposure on the primary event mediated by the intermediate event and direct effect (DE) as that not mediated by the intermediate event. A Nelson-Aalen type of estimator with time-varying weights is proposed for direct and indirect effects where the counting process at time $t$ of the primary event $N_{2n_1}(t)$ and its compensator $A_{n_1}(t)$ are both defined conditional on the status of the intermediated event right before $t$, $N_1(t^-)=n_1$. We show that $N_{2n_1}(t)-A_{n_1}(t)$ is a zero-mean martingale. Based on this, we further establish the asymptotic unbiasedness, consistency and asymptotic normality for the proposed estimators. Numerical studies including simulation and data application are presented to illustrate the finite sample performance and utility of the proposed method.

Key words: causal inference; causal mediation; martingale; Nelson-Aalen estimator; semi-competing risks

25 Multiple Imputation on enhanced model identification for nonignorable nonresponse

1,∗ 2 3 Jongho Im , Kosuke Morikawa , Tomoyuki Nakagawa

1∗ Presenter, Department of Applied Statistics, Yonsei University, Seoul, South Korea 2 Graduate School of Engineering Science, Osaka University, Osaka, Japan 3 Department of Information Sciences, Faculty of Science and Technology, Tokyo, Tokyo University of Science E-mail: [email protected]

Abstract: Multiple imputation is a popular technique for analyzing incomplete data. Although mul- tiple imputation usually assumed ignorable missing mechanism in the early days, many new methods have proposed for nonignorable missing mechanism in recent years. However, those methods are still having limitations in robustness to model specification and identification. In this study, we propose a new multiple imputation method strengthened model identification. The outcome model is approx- imated by Gaussian mixtures and the response mechanism is assumed to be a logit model. As the response model is parametric but cannot be correctly specified in practice, the model identification is important to obtain consistent estimates. We discuss when the response model can be safely identi- fied under the presence of nonignorable nonresponse and then use the results to estimate our assumed model parameters. Results from a limited simulation study are presented to check the performance of the proposed multiple imputation method.

Key Words: data augmentation; Gaussian mixture; not missing at random; model identification

26 Generalized Four Moment Theorem and an Application to CLT for Spiked Eigenvalues of Large-dimensional Covariance Matrices

Dandan Jiang 1

1 School of Mathematics & Statistics, Xi'an Jiaotong University,No.28, Xianning West Road, Xi'an, Shanxi, 710049, China E-mail: [email protected]

Abstract: We consider a more generalized spiked covariance matrix $\Sigma$, which is a general non-definite matrix with the spiked eigenvalues scattered into a few bulks and the largest ones allowed to tend to infinity. By relaxing the matching of the 4th moment to a tail probability decay, a Generalized Four Moment Theorem (G4MT) is proposed to show the universality of the asymptotic law for the local spectral statistics of generalized spiked covariance matrices, which implies the limiting distribution of the spiked eigenvalues of the generalized spiked covariance matrix is independent of the actual distributions of the samples satisfying our relaxed assumptions. Moreover, by applying it to the Central Limit Theorem (CLT) for the spiked eigenvalues of the generalized spiked covariance matrix, we also extend the result of Bai and Yao (2012) to {\color{red}a general case that the 4th moment is not necessarily required to exist and the population covariance matrix is in a general form without diagonal block independent assumption}, thus meeting the actual cases better.

27 Functional-coefficient regression models with GARCH errors

Jiancheng Jiang1,2*

1 School of Mathematical Sciences, , Tianjin 300071, China 2 Department of Mathematics and Statistics, University of North Carolina at Charlotte, NC 28277, USA E-mail: [email protected]

Abstract: The GARCH models are widely used to model various heavily tailed financial data with nonlinearity structures and heteroscedasticity structures. In this talk, we propose a functional-coefficient regression model with GARCH errors to model these kinds of data. To deal with the influence of heteroscedasticity, we introduce a two-step approach to estimating the unknown coefficient functions and the volatility. With the estimates of unknown parameters and functions, one may consider making simultaneous inference about parameters and making prediction for the conditional mean and volatility. Asymptotic properties of the proposed estimators are established. Our results demonstrate that the functional coefficients can be estimated as if the volatility were known. Simulations and real data examples show that the proposed estimators significantly improve the estimation efficiency of the unweighted estimators when there are GARCH effects.

Key Words: functional coefficients; GARCH errors; local linear smoothing; QMLE

28 Prediction of hospital readmission frailties with misspecified shared frailty models

1 Xuejun Jiang

1 Department of Mathematics, Southern University of Science and Technology, Shenzhen 518055, China E-mail: [email protected]

Abstract: In this paper we study prediction of hospital readmission frailties using var- ious misspecified shared frailty models. We review commonly used frailty distributions including the gamma, positive stable, power variance function, lognormal, Weibull, and compound Poisson distributions. We employ the EM algorithm and the penalized like- lihood for estimating the parameters. Based on these estimates, we construct the best prediction of the risk. We conduct simulations to evaluate the performance of different misspecified shared frailty models and to examine if the best prediction is sensitive to the specification of frailty. Finally we predict the risk of hospital readmission using misspecified frailty models.

Key Words: EM algorithm; Frailty; Misspecification; Prediction; Penalized likelihood

29 The Operating Principle of Regularized Spectral Clustering

Juhee Cho1, Donggyu Kim2, Karl Rohe3, Song Wang4,*

1Microsoft, USA 2KAIST, South Korea 3University of Wisconsin-Madison, Wisconsin Madison, USA 4Amazon, USA E-mail: [email protected]

Abstract:Spectral Clustering is one of the most popular modern clustering algorithms. Its performance can be improved using regularization and we call it Regularized Spectral Clustering. Despite Regularized Spectral Clustering is widely used, its statistical operating principle is not well studied. In this paper, we investigate why Regularized Spectral Clustering works well under the degree-corrected stochastic block model consisting of dense cohesive cores and sparse loosely-connected peripheries. We show that the regularized row-and-column-normalized adjacency matrix has the eigenvalues with the order of increasing the expected degree. Thus, the regularization makes eigenvectors corresponding to dense cohesive cores come first which helps Regularized Spectral Clustering provide a more balanced cut than Spectral Clustering. A simulation study is conducted to check the properties that we find, and we also apply Regularized Spectral Clustering to the several real network data.

Key Words: network; community detection; degree-corrected stochastic block model; core/periphery structure

30 Discrepancy between global and local principal component analysis on large-panel high-frequency data

1 1 2 1 Xin-Bing Kong , Jin-Guan Lin , Cheng Liu , Guang-Ying Liu

1Nanjing Audit University, Nanjing, China 2Wuhan University, Wuhan, Chia E-mail: [email protected]

Abstract: The global principal component analysis (GPCA), PCA directly applied to the whole sample, is not reliable to reconstruct the common components of a large panel of high-frequency data when the factor loadings are time-varying, but it works when the factor loadings are constant. However, the local principal component analysis (LPCA) presented in Kong (2017)(2018) results in consistent estimates of the common components even if the factor loading processes follow Ito semimartingales. The LPCA is also suited for online computation in ``big data" framework with restricted storage and memory.This motivates us to study the discrepancy between the GPCA and LPCA in recovering the common components of the large-panel high-frequency data. In this paper, we measure the discrepancy by the total sum of squared differences between common components reconstructed from GPCA and LPCA. We provide the asymptotic distribution of the discrepancy measure when the factor loadings are constant. Alternatively when some factor loadings are time-varying, the discrepancy measure explodes in a rate higher than root pk_n under some mild conditions on the time-variation magnitude of the factor loadings where k_n is the size of each subsample. We then apply the theory to testing the hypothesis that the factor loading matrix is a constant matrix. We show that the test performs well in controlling the type I error and detecting time-varying loadings. Our real data analysis provides evidence that the factor loading matrices are always time-varying.

31 Optimal Estimation of Wasserstein Distance on Trees with An Application to Microbiome Studies

Hongzhe Li1

1University of Pennsylvania, Philadelphia, PA 19104, USA E-mail: [email protected]

Abstract: Weighted UniFrac distance, a plug-in estimator of the Wasserstein distance of read counts on a tree, has been widely used to investigate the microbial community difference in microbiome studies. Our investigation however shows that such a plug-in estimator, although intuitive and commonly used in practice, suffers from potential bias. Motivated by this, we study the problem of estimating the Wasserstein distance between two distributions on a tree from the sampled data in high-dimensional but non-asymptotic setting. To overcome the bias problem, we introduce a new estimator, referred to as moment-screening estimator on a tree (MET), by conducting implicit polynomial approximation that incorporates the tree structure. The new estimator is computationally efficient and is shown to be minimax rate-optimal. Applications to both simulated and real biological datasets demonstrate the practical merits of MET, including reduced biases and statistically more significant differences in microbiome between inactive Crohn's disease patients and the normal controls.

Key Words: Phylogenetic tree; UniFrac distance; 16S rRNA sequencing; polynomial approximation

32 A supplement to Jiang’s asymptotic distribution of the largest entry of a sample correlation matrix

Deli LI1

1Lakehead University, Canada E-mail: [email protected]

Abstract: Let  X, Xki, ; i 1, k 1 be an array of i.i.d. random variables and let

 pnn ;1  be a sequence of positive integers such that np/ n is bounded away from

 ()n

0 and  . The sample correlation matrix n p ij, is generated from  ppnn

 ()n   XXXX,, p  1,1,,n ,1  1,,, p  n , p  such that ij, is the Pearson correlation coefficient nn

  between  XX1,i , , n , i  and  XX1,j , , n , j  . In this talk, we provide a supplement to

 ()n Jiang‟s asymptotic distribution of the largest entry Lp m a x . We show that, n1 i  j  p n ij,

for given nondecreasing function h   : 0,   0,   with lim x  hx  , there exists an array  X, Xki, ; i 1, k 1 of symmetric i.i.d. random vatiables such that

hX    and, for some subsequence  nmm ;1  of 1; 2; 3;  ,

1 / 2  n m limL n  2 almost surely, m  lo g n m m 1 / 2 n lim L n does not exist almost surely, n  lo g n m 1 limn L2  a  t  e x p  e t / 2 ,    t    m n n   , mm m  8 and 2 n Lnn a does not convergence in distribution where an4log p n  loglog p n , n  2 .

33 High-Dimensional Vector Autoregressive Time Series Modeling via Tensor Decomposition

Di Wang1, Heng Lian2, Wai Keung Li 1, Guodong Li 1∗

1* Department of Statistics and Actuarial Science, University of Hong Kong, HKSAR 2 Department of Mathematics, City University of Hong Kong, HKSAR E-mail: [email protected]

Abstract: The classical vector autoregressive model is a fundamental tool for multivariate time series analysis. However, it involves too many parameters for high-dimensional time series, and hence suf- fers from the curse of dimensionality. In this paper, we rearrange the parameter matrices of a vector autoregressive model into a tensor form, and use the tensor decomposition to restrict the parameter space in three directions. Compared with the reduced-rank regression method, which can limit the parameter space in one direction only, the proposed method dramatically improves the capability of vector autoregressive models in handling high-dimensional time series. For this method, its asymp- totic properties are studied and an alternating least squares algorithm is suggested. Moreover, for the case with much higher dimension, we further assume the sparsity of three loading matrices, and the regularization method is thus considered for estimation and variable selection. An ADMM-based algorithm is proposed for the regularized method and oracle inequalities for the global minimizer are established.

Key Words: High dimensional time series; Reduced-rank regression; Regularization; Tucker decom- position; Variable selection

34 Tensor Analysis and Neuroimaging Applications

Lexin Li1

1 Department of Biostatistics and Epidemiology, University of California, Berkeley E-mail: [email protected]

Abstract: Data in the form of multidimensional array, or tensor, are fast emerging in a wide variety of scientific and business applications. Simply turning an array into a vector would both induce extremely high dimensionality and destroy the inherent structure of the array. In this talk, we discuss two tensor analysis problems, one about regression with a tensor-valued response, and the other about dynamic tensor clustering. We exploit the special structure of tensor, and introduce two low-dimensional structures: sparsity and low-rankness, which helps bring the ultrahigh dimensionality to a manageable level. We develop fast estimation algorithms, and derive the associated asymptotic properties. We illustrate with two applications in brain imaging analysis.

35 Statistical Learning for Personalized Wealth Management

Yi Ding1, Yingying Li2∗, Rui Song3

1 Hong Kong University of Science and Technology, HKSAR 2∗Hong Kong University of Science and Technology, HKSAR 3 North Carolina State University, USA E-mail: [email protected]

Abstract: We establish a statistical learning framework for personalized wealth management. A high- dimensional Q-learning methodology is proposed for continuous decision making. The proposed method is shown to enjoy desirable oracle properties and facilitate valid statitical inference for opti- mal values. Empirically, the proposed statistical learning methodology is exercised with Health and Retirement Study data. The results show that the proposed personalized optimal strategy can improve individual‟s financial well-being and surpasses benchmark strategies under a consumption based util- ity framework.

Key Words: High-dimensionality; Statistical Learning; Wealth Management

36 Mediation analysis for zero-inflated mediators

Zhigang Li1

1 University of Florida, USA E-mail: [email protected]

Abstract: TBA

37 Identiability and Non-Convex Algorithm for Multi-Channel Blind Deconvolution

Song Li1

1 School of Mathematical Sciences, Zhejiang University, China E-mail: [email protected]

Abstract: In this talk, we consider the multichannel blind deconvolution problem. We propose de- terministic subspace assumption, which is widely used in practice, and give some theoretical results. First of all, we derive tight sufficient condition for identiability of signal and convolution kernels, which is only violated on a set of Lebesgue measure zero. Then, we present a non-convex regulariza- tion algorithm by a lifting method and approximate the rank-one constraint and show that the global minimizer of the proposed non-convex algorithm is rank-one matrix under mild conditions on param- eters and noise level. The stability result is also shown under the assumption that the inputs lie in a compact set. Finally, we provide numerical experiments to show that our non-convex model outper- forms convex relaxation models, such as nuclear norm minimization and some non-convex methods (alternating minimization method and spectral method).

38 A non-randomized multiple testing procedure for large-scale heterogeneous discrete hypotheses based on randomized tests

Xiaoyu Dai1, Nan Lin1,*, Daofeng Li2, Ting Wang2

1Department of Mathematics and Statistics, Washington University in St. Louis, USA 2Department of Genetics, Washington University in St. Louis, USA E-mail: [email protected]

Abstract: High-throughput genomic studies often require performing genome-wide multiple discrete testing. However, most existing multiple testing procedures for controlling the false discovery rate (FDR) assume that test statistics are continuous and become conservative for discrete tests. To overcome the conservativeness, we propose a novel multiple testing procedure for better FDR control on heterogeneous discrete tests. Our procedure makes decisions based on the marginal critical function (MCF) of randomized tests, which enables achieving a powerful and non‐randomized multiple testing procedure. We provide upper bounds of the positive FDR (pFDR) for our procedure and show that the set of detections made by our method contains every detection made by a naive application of the widely used q‐value method. We further demonstrate our method by simulations and a real example of differentially methylated region (DMR) detection using whole‐genome bisulfite sequencing (WGBS) data.

Key Words: multiple testing; next-generation sequencing; randomized test

39 Deep Neural Networks for Rotation-Invariance Approximation and Learning

Shao-Bo Lin1

1Wenzhou University, 325035, China E-mail: [email protected]

Abstract: The objective of this talk is to design deep neural networks with two or more hidden layers (called deep nets) for realization of radial functions so as to enable rotational invariance for near-optimal function approximation in an arbitrarily high dimensional Euclidian space. It is shown that deep nets have much better performance than shallow nets (with only one hidden layer) in terms of approximation accuracy and learning capabilities. In particular, for learning radial functions, it is shown that near-optimal rate can be achieved by deep nets but not by shallow nets. Our results illustrate the necessity of depth in neural network design for realization of rotation-invariance target functions.

Key Words: Deep nets; rotation-invariance; learning theory; radial-basis functions

40 A Quantile Association-based Variable Selection

Yuanyuan Lin1, Niansheng Tang2, Jinhan Xie2, Wenlu Tang1

1The Chinese University of Hong Kong, Hong Kong, China 2Key Lab of Statistical Modeling and Data Analysis of Yunnan Province, Yunnan University, Kunming 650091, China. E-mail: [email protected]

Abstract: In modern scientific discoveries, important variables identification in analyzing high dimensional data is intrinsically challenging, especially when there are complex relationships among predictors. In this paper, without any specification of a regression model, we introduce a quantile association-based statistic to identify influential predictors, which is flexible enough to capture a wide range of dependence. The asymptotic null distribution of the proposed statistic is established under mild conditions. Based on the proposed statistic, a multiple testing procedure is advocated to simultaneously test the independence between each predictor and the response variable in high dimensionality. The proposed procedure is able to detect relevant variables with pairwise or higher-order interactions. It is computationally efficient as no optimization or resampling is involved. We prove its theoretical properties and justify the proposal asymptotically controls the false discovery rate at a given significance level. Numerical studies including simulation studies and real data analysis contain supporting evidence that the proposal performs reasonably well in practical settings.

Key Words: Quantile association; high dimensionality

41 Some Statistical Methods for Single-cell Genomics

Zhixiang Lin1,*

1Department of Statistics, the Chinese University of Hong Kong, HKSAR E-mail: [email protected]

Abstract: Some recent work regarding the analysis of single-cell chromatin accessibility (scATAC-Seq) and single-cell gene expression (scRNA-Seq) data will be discussed. First, we will present scABC, a toolkit for analyzing scATAC-Seq data, and extensions that improve rare cell population detection. Second, a model-based framework for the joint analysis of scATAC-Seq and scRNA-Seq data will be presented. We show that the underlying cell types can be better characterized through the joint analysis.

Key Words: Single-cell genomics; Bayesian modeling; Clustering

42 Weighted multiple-quantile classifiers for functional data with application in multiple sclerosis screening

Haiqiang Ma2, Sheng Xu3, Catherine Liu1,* Kam Chuen Yuen4

1,3The Hong Kong Polytechnic Univeristy, Hung Hom, Kowloon, HKSAR 2Jiangxi University of Finance and Economics, Nanchang, China 4 University of Hong Kong, Po Fu Lam Road, HKSAR E-mail: [email protected]

Abstract: Multiple sclerosis (MS) is the most prevalent chronic neurological disease. It can be diagnosed by functional data generated from diffusion tensor imaging. Early recognition and treatment of MS are crucial in the treatment and management of MS patients. Existing functional classifiers seem to suffer from high false negative rates or high false positive rates or both. To develop a classifier with low false negative and false positive rates, we define a generalized distance measure for the functional data. Using this generalized distance, we show that the existing classifiers can be derived by choosing appropriate loss functions. Furthermore, when we consider the quantile loss function, we are able to develop a weighted multiple-quantile (weMulQ) classifier that is robust, accurate, and computationally fast. We showed that it is asymptotically consistent and enjoys the near perfection optimality. Numerically, we demonstrate that it outperforms the other methods when the data are from a generalized Gaussian noise process with mixed populations. Finally, we apply weMulQ to classify MS patients using a DTI data set collected from the and the Kennedy-Krieger Institute. Our classifier indeed has much lower false negative and false positive rates than the existing methods.

Key Words: almost perfection; classification; quantile

43 Optimal Covariance Matrix Estimation for High-dimensional Noise in High-frequency Data

Cheng Yong Tang1, Cheng Liu2, Jinyuan Chang3

1Temple University, 2Wuhuan University, 3Southwestern University of Finance and Economics E-mail: [email protected]

Abstract: In this paper, we consider efficiently learning the structural information from the high-dimensional noise in high-frequency data via estimating its covariance matrix with optimality. The problem is uniquely challenging due to the latency of the targeted high-dimensional vector containing the noises, and the practical reality that the observed data can be highly asynchronous-not all components of the high-dimensional vector are observed at the same time points. To meet the challenges, we propose a new covariance matrix estimator with appropriate localization and thresholding. In the setting with latency and asynchronous observations, our theoretical analysis establishes the minimax optimal convergence rates associated with two common loss functions for the covariance matrix estimations. As a major theoretical development, we show that despite the latency of the signal in the high-frequency data, the optimal rates remain the same as if the targeted high-dimensional noises are directly observable. Our results indicate that the optimal rates reflect the impact due to the asynchronous observations, which are slower than that with synchronous observations. Furthermore, we demonstrate that the proposed localized estimator with thresholding achieves the minimax optimal convergence rates. We also illustrate the empirical performance of the proposed estimator with extensive simulation studies and a real data analysis.

Key Words: High-dimensional covariance matrix; high-frequency data analysis; measurement error, minimax optimality; thresholding

44 Data-adaptive Kernel Support Vector Machine

Liu, X. 1∗, He, W. 2

1 Shanghai University of Finance and Economics, 200433, China 2 he University of Western Ontario, N6A 3K7, Canada E-mail: [email protected]

Abstract: The support vector machine (SVM) is popularly used as a classifier in applications such as pattern recognition, texture mining and image retrieval owing to its flexibility and interpretability. However, its performance deteriorates when the response classes are imbalanced. To enhance the performance of the support vector machine classifier in the imbalanced cases we investigate a new two stage method by adaptively scaling the kernel function. Based on the information obtained from the standard SVM in the first stage, we conformally rescale the kernel function in a data adaptive fashion in the second stage so that the separation between two classes can be effectively enlarged especially when observations are imbalanced. The proposed method takes into account the location of the support vectors in the feature space, therefore is especially appealing when the response classes are imbalanced. Simultaneously, we consider how to select the important features with data-adaptive kernels in SVMs, and spatial association that may exist. The approach is further applied in multi- category classification problems.

Key Words: Classification; imbalanced image data; spatial association; support vector machine; feature selection.

45 Testing of covariate effects under ridge regression for high-dimensional data

Xu Liu1

1Shanghai University of Finance and Economics,No. 777, Guoding Lu, China E-mail: [email protected]

Abstract: In this paper, we revisit the ridge regression under high dimensional model settings. We propose a novel estimator of error's variance and establish its asymptotic normality property based on the random matrix theories as the dimension of covariates diverges with the sample size, which is promising compared with its competitors including the refitted cross validation method. An upper bound of mean squared error (MSE) is given for the ridge regression estimator of coefficients, and two new test statistics are provided for the inference on the covariate effects. %One is based on the measurement between estimator and true of coefficients, motivated by MSE. Another is motivated by famous Wald-type test in low-dimensional situation.

Asymptotic properties are obtained under some regularity conditions. Numerical examples are used to assess the finite sample performance of the proposed methods.

Key Words: High-dimensional test; Random matrix; Ridge regression

46 Towards Software-Defined Infrastructure for Decentralized Data Governance

Xuanzhe Liu1

1Peking University, No.5, Yiheyuan Road, Haidian District, Beijing, China E-mail: [email protected]

Abstract: Exploring and mining the explosive burst of “big data” has already generated a lot of innovative applications, especially the recent advances of AI applications, and thus produced big values to the human society and civilization. However, due to the centralized patterns of data governance activities, including creation, sharing, exchange, management, analytics, tracing, and accounting, the potential values of big data distributed on the Internet are far away from being adequately explored. The recent announcement of data protection policies/laws such as GDPR makes the problem even more challenging. We are now at a moment of truth where the data governance infrastructure should be reconsidered and redesigned. In this paper, we propose a software-defined infrastructure design in a decentralized fashion: data owners are able to implement and deploy their own rules to the application systems where the data are produced for further governance activities. Such a fashion is quite similar to the popular software-defined networking where users are allowed to deploy rules of switches and customize the uses. Our principled infrastructure design can radically reform the current data governance activities into a decentralized topology. On the one hand, data can be separated from the application that generates the data, and data owners can have the full rights to decide where their data should be stored and how the data can be shared. On the other hand, data users can search, discover, integrate, and analyze the data from various data sources according to their application requirements and scenarios. As a result, we argue that our infrastructure can establish a new generation of responsive decentralized data governance that can promote the innovation of linking data to the better adaptation of the open environment and the diverse user requirements. With this perspective, we briefly discuss some key insights and enumerate several related new technologies and open challenges.

Key Words: data governance; software-defined

47 Distributed learning from multiple EHR databases: Contextual embedding models for predicting medical events

Qi Long1

1University of Pennsylvania, USA E-mail: [email protected]

Abstract: P-value, first proposed by Fisher to measure inconsistency of data with a specified null hypothesis, plays a central role in statistical inference. For classical linear regression analysis, it is a standard procedure to calculate P-values for regression coefficients based on least squares estimator (LSE) to determine their significance. However, for high dimensional data when the number of predictors exceeds the sample size, ordinary least squares are no longer proper and there is not a valid definition for P-values based on LSE. It is also challenging to define sensible P-values for other high dimensional regression methods such as penalization and resampling methods. In this paper, we introduce a new concept called oracle P-value to generalize traditional P-values based on LSE to high dimensional sparse regression models. Then we propose several estimation procedures to approximate oracle P-values for real data analysis. We show that the oracle P-value framework is useful for developing new tools in high dimensional data analysis, including variable ranking, variable selection, and screening procedures with false discovery rate (FDR) control. Numerical examples are presented to demonstrate performance of the proposed methods.

Key Words: False discovery rate; Inference; Linear model

48 Wavelet Empirical Likelihood Estimator for Stationary and Locally Stationary Long Memory Processes

Zhiping Lu 1, Yulin Zhu

1East China Normal University, China E-mail: [email protected]

Abstract: This paper introduces a version of wavelet empirical likelihood based on the periodogram and spectral estimating equations. This formulation handles dependent data through a data transformation. The method results in likelihood ratios which can be used to build nonparametric, asymptotically correct confidence regions of stationary long memory processes and locally stationary long memory processes. The Monte Carlo simulation is carried out to prove the effectiveness of the method.

Key Words: Whittle estimator; Empirical likelihood; Wavelet transform; Locally stationary; Long memory processes

49 GMV Prediction Using Driver Preference

Guojun Wu1, Yanhua Li2, Shikai Luo1,*

1Didi Chuxing, Beijing, China 2Worcester Polytechnics Institute, MA, US E-mail: [email protected]

Abstract: Drivers are making a sequence of decisions while working, one of the most important decisions is whether to stop working (log off) after finishing an order or being idle for a while. We model drivers‟ such sequential decision-making process as a Markov Decision Process (MDP). We extract two types of features, i.e. income related and user-experience related, to model drivers‟ decision space. The reward function represents the preference each driver has over different decision-making features. We utilize inverse reinforcement learning to extract drivers‟ individual preference from intra-day working cycles. We are interested in predicting each driver‟s future income. We show that models with individual preference can improve the prediction accuracy a lot over models with only inter-day characteristics.

Key Words: data mining; inverse reinforcement learning

50 A Nonparametric Bayesian Approach to Simultaneous Subject and Cell Heterogeneity Discovery for Single Cell RNA-Seq Data

Qiuyu Wu1, Xiangyu Luo1,*

1Institute of Statistics and Big Data, Renmin University of China, Beijing, China E-mail: [email protected]

Abstract: The advent of the single cell sequencing era opens new avenues for the personalized treatment. The first but important step is discovering the subject heterogeneity at the single cell resolution. In this talk, we address the two-level-clustering problem of simultaneous subject subgroup discovery (subject level) and cell type detection (cell level) based on the single cell RNA sequencing (scRNA-seq) data from multiple subjects. However, current approaches either cluster cells without considering the subject heterogeneity or group subjects not using the single cell information. We develop a solid nonparametric Bayesian model SCSC (Subject and Cell clustering for Single-Cell data) to achieve subject and cell grouping at the same time without pre-specifying the subject subgroup number or the cell type number. An efficient blocked Gibbs sampler is then proposed for the posterior inference. The simulation study and the real application demonstrate the good performance of our model.

Key Words: nonparametric Bayes; mixture model; nonignorable missing; MCMC; single cell genomics

51 A Versatile Estimation Procedure without Estimating the Nonignorable Missingness Mechanism

Yanyuan Ma1

1Penn State University, USA E-mail: [email protected]

Abstract: We consider the estimation problem in a regression setting where the outcome variable is subject to nonignorable missingness and identiability is ensured by the shadow variable approach. We propose a versatile estimation procedure where model-ing of missingness mechanism is completely bypassed. We show that our estimator is easy to implement and we derive the asymptotic theory of the proposed estimator. We also investigate some alternative estimators under dierent scenarios. Comprehensive simulation studies are conducted to demonstrate the nite sample performance of the method. We apply the estimator to a children's mental health study to illustrate its usefulness.

52 Matrix Completion under Low-Rank Missing Mechanism

Xiaojun Mao1, Raymond Wong2, Song Xi Chen3,*

1School of Data Science, Fudan University, Shanghai 200433, China 2Department of Statistics, Texas A&M University, College Station, Texas 77843, U.S.A. 3Department of Business Statistics and Econometrics, Guanghua School of Management, and Center for Statistical Science, , Beijing 100651, China. E-mail: [email protected]

Abstract: Matrix completion is a modern missing data problem where both the missing structure and the underlying parameter are high dimensional. Although the missing structure is a key component to any missing data problems, existing matrix completion methods often assume a simple uniform missing mechanism. In this work, we study matrix completion from corrupted data under a novel low-rank missing mechanism. The probability matrix of observation is estimated via a high dimensional low-rank matrix estimation procedure and further used to complete the target matrix via inverse probabilities weighting. Due to both high dimensional and extreme (i.e., very small) nature of the true probability matrix, the effect of inverse probability weighting requires careful study. We derive optimal asymptotic convergence rates of the proposed estimators for both the observation probabilities and the target matrix.

Key Words: Low-rank; Missing; Nuclear-norm; Regularization.

53 A mean field theory of two-layers neural networks

Song Mei1

1Stanford University, USA E-mail: [email protected].

Abstract: Multi-layer neural networks are among the most powerful models in machine learning, yet the fundamental reasons for this success defy mathematical understanding. Learning a neural network requires to optimize a non-convex high-dimensional objective (risk function), a problem which is usually attacked using stochastic gradient descent (SGD). Does SGD converge to a global optimum of the risk or only to a local optimum? In the first case, does this happen because local minima are absent, or because SGD somehow avoids them? In the second, why do local minima reached by SGD have good generalization properties? In this talk we consider a simple case, namely two-layers neural networks, and prove that -in a suitable scaling limit- SGD dynamics is captured by a certain non-linear partial differential equation (PDE) that we call distributional dynamics (DD). We then consider several specific examples, and show how DD can be used to prove convergence of SGD to networks with nearly ideal generalization error. This description allows to 'average-out' some of the complexities of the landscape of neural networks, and can be used to prove a general convergence result for noisy SGD.

Key Words: mean field theory; stochastic gradient descent; distributional dynamics

54 A Dynamic Additive and Multiplicative Effects Model with Application to the United Nations Voting Behaviours

Bomin Kim1, Xiaoyue Niu2,*, David Hunter2, Xun Cao2

1Freddie Mac, United States of America 2The Pennsylvania State University, United States of America E-mail: [email protected]

Abstract: In this paper, we introduce a statistical regression model for discrete-time networks that are correlated over time. Our model is a dynamic version of a Gaussian additive and multiplicative effects (DAME) model which extends the latent factor network model of Hoff (2009) and the additive and multiplicative effects model of Hoff et al. (2014), by incorporating the temporal correlation structure into the prior specifications of the parameters. The temporal evolution of the network is modeled through a Gaussian process (GP) as in Durante and Dunson (2014), where we estimate the unknown covariance structure from the dataset. We analyze the United Nations General Assembly voting data from 1983 to 2014 (Voeten (2013)) and show the effectiveness of our model at inferring the dyadic dependence structure among the international voting behaviors as well as allowing for a varying number of nodes over time. Overall, the DAME model shows significantly better fit to the dataset compared to alternative approaches. Moreover, after controlling for other dyadic covariates such as geographic distances and bilateral trade between countries, the model-estimated additive effects, multiplicative effects, and their movements reveal interesting and meaningful foreign policy positions and alliances of various countries.

Key Words: dynamic network; latent variable; additive and multiplicative effects

55 A Super Scalable Algorithm for Short Segment Detection

Feifei Xiao, Heping Zhang, Ning Hao, Yue Niu1

1University of Arizona, 617 N Santa Rita Ave, USA E-mail: [email protected]

Abstract: In many applications such as copy number variant (CNV) detection, the goal is to identify short segments on which the observations have different means or medians from the background. Those segments are usually short and hidden in a long sequence, and hence are very challenging to find. We study a super scalable short segment (4S) detection algorithm in this paper. This nonparametric method clusters the locations where the observations exceed a threshold for segment detection. It is computationally efficient and does not rely on Gaussian noise assumption. Moreover, we develop a framework to assign significance levels for detected segments. We demonstrate the advantages of our proposed method by theoretical, simulation, and real data studies.

Key Words: copy number variant; scalable; segementation

56 Improved doubly robust estimation in learning individualized treatment rules

Yinghao Pan1, Yingqi Zhao2

1University of North Carolina at Charlotte, U.S.A 2Fred Hutchinson Cancer Research Center, U.S.A E-mail: [email protected]

Abstract: Due to patient's heterogeneous response to treatment, there is a growing interest in developing novel and efficient statistical methods in estimating individualized treatment rules (ITRs). The central idea is to recommend treatment according to patient characteristics, and the optimal ITR is the one that maximizes the expected clinical outcome if followed by the patient population. We propose an improved estimator of the optimal ITR that enjoys two key properties. First, it is doubly robust, meaning that the proposed estimator is consistent if either the propensity score or the outcome model is correct. Second, it achieves the smallest variance among its class of doubly robust estimators when the propensity score model is correctly specified, regardless of the specification of the outcome model. Simulation studies show that the estimated optimal ITR obtained from our method yields better clinical outcome than its main competitors. Data from Sequenced Treatment Alternatives to Relieve Depression (STAR*D) study is analyzed as an illustrative example.

Key Words: Double robustness; Individualized treatment rule; Personalized medicine

57 Predicting terrorist events: opportunities and challenges

Andre Python1,*, Tim Lucas1, Penelope Hancock1, Andreas Bender1, Rohan Arambepola1, and Anita Nandi1

1Li Ka Shing Centre for Health Information and Discovery, Big Data Institute, Nuffield Department of Medicine, University of Oxford, UK E-mail: [email protected]

Abstract: In 2017, 18,700 people lost their life due to terrorist attacks. Iraq, Syria, Pakistan, and Afghanistan counted more than half of the attacks and the total number of deaths attributed to terrorism worldwide. Determining the location and time of terrorist events are key elements to prevent terrorism. So far, predictive approaches have mainly focused on armed conflict and insurgency without considering terrorism. In this talk, I will first briefly review recent approaches that have been used to predict terrorism. Second, I will introduce the eXtreme Gradient Boosting (XGboost), a machine-learning algorithm used to predict a week ahead the locations of terrorist events at fine spatial scale in Iraq, Iran, Afghanistan, and Pakistan. I will conclude the talk by comparing the predictive performance of XGboost approach with alternative machine learning approaches and spatio-temporal models.

Key Words: machine learning; terrorism; prediction; space-time

58 On the ‘Off-Label’ Use of Data Normalization for Sample Classification and Prognostication

Li-Xuan Qin1,*

1Memorial Sloan Kettering Cancer Center, New York, New York, USA E-mail: [email protected]

Abstract: Data normalization is an important preprocessing step for molecular data containing unwanted data variation due to experimental handling. There has been a critical yet over-looked disconnection between the use of data normalization and the goals of subsequent analysis: on one hand, methods for data normalization have been mostly developed for the analysis goal of group comparison; on the other hand, these methods have encountered frequent „off-label‟ use for other goals such as sample classification, neglecting the impact of potential „side-effects‟ of normalization such as over-compressed data variability. A bridge between these two is made possible by a unique pair of microRNA array datasets on the same set of tumor tissue samples that were collected at Memorial Sloan Kettering Cancer Center. In this talk, I will share our findings, through empirical analysis and re-sampling-based simulations using this dataset pair, on how data normalization impacts the development of tumor sample classifiers and survival outcome predictors.

Key Words: Genomics; Microarray; Normalization; Classification; Prediction; Personalized medicine

59 Adaptive Minimax Density Estimation for Huber’s Contamination Model under $L_p$ Losses

Zhao Ren1

1University of Pittsburgh, Pittsburgh, PA, 15260, USA E-mail: [email protected]

Abstract: Today's data pose unprecedented challenges as it may be incomplete, corrupted or exposed to some unknown source of contamination. In this talk, we address the problem of density function $f$ estimation under $L_p$ losses ($1\leq p <\infty$) for Huber's contamination model in which one observes i.i.d. observations from $(1-\epsilon)f+\epsilon g$ and $g$ represents the unknown contamination distribution. We investigate the effects of contamination proportion $\epsilon$ among other key quantities on the corresponding minimax rates of convergence for both structured and unstructured contamination classes: for structured contamination, $\epsilon$ always appears linearly in the optimal rates while for unstructured contamination, the leading term of the optimal rate involving $\epsilon$ also relies on the smoothness of target density class and the specific loss function.

We further carefully study the corresponding adaptation theory in contamination models. Two different Goldenshluger-Lepski-type methods are proposed to select bandwidth and achieve $L_p$ risk oracle inequalities for structured and unstructured contaminations respectively. It is shown that the proposed procedures lead to minimax rate-adaptivity over a scale of the anisotropic Nikol‟skii classes for most scenarios except that adaptation to both contamination proportion $\epsilon$ and smoothness of density class for unstructured contamination is shown to be impossible. Our technical analysis in adaptive procedures relies on some uniform bounds under the $L_p$ norm of empirical processes developed by Goldenshluger and Lepski.

Key Words: adaptivity; minimax rate; contamination; robust statistics; nonparametric density estimation

60 Dynamic Spatial Panel Data Models with Endogeneity and Common Factors

Wei Shi1

1 Institute for Economic and Social Research, Jinan University∗ E-mail: [email protected]

Abstract: Spatial interactions and common factors are two popular approaches to model cross sectional dependencies, which reflects local and global dependence respectively. Recent literature have investigated models with both spatial interactions and common factors in a dynamic panel data setup with both large n and T. In many applications of these models, some explanatory variables may be endogenous and the spatial weight matrices may be stochastic and based on variables that may also be endogenous. Using a control function approach to address the endogeneity, this paper proposes a QML estimator and provides conditions for its consistency and asymptotic normality. Due to the presence of predetermined terms and common factors, the estimator may have an asymptotic bias and an analytical bias correction procedure is provided. We examine the finite sample behavior of the estimator in a set of Monte Carlo simulations and apply the model to study the effect of cigarette taxation on demand when cross state shopping is present.

Key Words: Spatial panel data; endogenous spatial weighting matrix; multiplicative individual and time effects; QMLE

61 Bridging the gap between noisy healthcare data and knowledge: automated translation of medical terminology

Xu Shi1,*, Xiaoou Li2, Tianxi Cai1

1Department of Biostatistics, Harvard University, USA 2Department of Statistics, University of Minnesota, USA E-mail: [email protected]

Abstract: Routinely collected healthcare data present numerous opportunities for biomedical research but also come with unique challenges. In this talk, we detail the challenge of inconsistent “languages” used by different healthcare systems and coding systems. In particular, different healthcare providers may use alternative medical codes to record the same diagnosis or procedure, limiting the transportability of phenotyping algorithms and statistical models across healthcare systems. We present an automated data quality control pipeline that aims to address this challenge and make the transition from data to knowledge. We formulate the idea of medical code translation into a statistical problem of inferring a mapping between two sets of multivariate, unit-length vectors learned from two healthcare systems, respectively. The statistical problem is particularly interesting because the training data is corrupted by a fraction of mismatch in the response-predictor pairs, whereas classical regression analysis tacitly assumes that the response and predictor are correctly linked. We propose a novel method for mapping recovery and establish theoretical guarantees for estimation and model selection consistency.

Key Words: electronic health records; mismatched data; ontology translation; spherical regression

62 Estimating the sample mean and standard deviation from the five-number summary and their applications in evidence-based medicine

Tiejun Tong1

1 Department of Mathematics, Hong Kong Baptist University, HKSAR E-mail: [email protected]

Abstract: Evidence-based medicine is attracting increasing attention to improve decision making in medical practice via integrating evidence from well designed and conducted clinical research. Meta-analysis is a statistical technique widely used in evidence-based medicine for analytically combining the findings from independent clinical trials to provide an overall estimation of a treatment effectiveness. The sample mean and standard deviation are two commonly used statistics in meta-analysis but some trials use the median, the minimum and maximum values, or sometimes the first and third quartiles to report the results. Thus, to pool results in a consistent format, researchers need to transform those information back to the sample mean and standard deviation. In this talk, I will introduce our recent advances in the optimal estimation of the sample mean and standard deviation for meta-analysis from both theoretical and empirical perspectives. Specifically, we solve the problems by incorporating the sample size in a smoothly changing weight in the estimators to reach the optimal estimation. Our proposed estimators not only improve the existing ones significantly but also share the same virtue of the simplicity. The real data application indicates that our proposed estimators are capable to serve as „„rules of thumb‟‟ and will be widely applied in evidence-based medicine.

Key Words: Median; meta-analysis; mid-range; mid-quartile range; optimal weight; sample mean; sample size

63 An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

Cheng Wang 1,Binyan Jiang2,*

1 School of Mathematical Sciences, MOE-LSC, Shanghai Jiao Tong University, Shanghai 200240, China. 2 Department of Applied Mathematic\The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong E-mail: [email protected]

Abstract: The estimation of high dimensional precision matrices has been a central topic in statistical learning. However, as the number of parameters scales quadratically with the dimension $p$, many state-of-the-art methods do not scale well to solve problems with a very large $p$. In this paper, we propose a very efficient algorithm for precision matrix estimation via penalized quadratic loss functions. Under the high dimension low sample size setting, the computation complexity of our algorithm is linear in both the sample size and the number of parameters. Such a computation complexity is in some sense optimal, as it is the same as the complexity needed for computing the sample covariance matrix. Numerical studies show that our algorithm is much more efficient than other state-of-the-art methods when the dimension $p$ is very large.

Key Words: High dimension; Penalized quadratic loss; Precision matrix

64 MOSUM-based test and estimation method for multiple changes in panel data

Man Wang1, Rongmao Zhang2, Ngaihang Chan3,*

1Department of Finance, Donghua University, Shanghai, China 2School of Mathematics, Zhejiang University, Hangzhou, China 3 Department of Statistics, The Chinese University of Hong Kong, Shatin, HKSAR E-mail: [email protected]

Abstract: Detection of common changes in panel data is a vibrant topic in both econometrics and statistics. However, most existing testing methods suffer from power loss under certain configurations of multiple change points. To solve this problem, in this paper, a moving sum (MOSUM) based test is proposed for detecting the multiple changes in panel data. Under mild conditions, it is shown that the proposed test statistic converges to an extreme distribution of a Gaussian process under the null hypothesis of no change, and diverges to infinity under the alternative hypothesis of multiple changes. Additionally, based on the MOSUM test, an estimation method which estimates the locations of all the change points simultaneously is given and its consistency is established. To illustrate the performance of the proposed testing and estimation method, a number of simulation studies have been conducted. The simulation result shows that the proposed MOSUM based test outperforms most existing cumulative sum (CUSUM) based procedures under multiple changes setting, and the estimation method has satisfying performance. The test and estimation method is applied to the US state-level personal income data. The testing result shows that there exist common structural breaks in the growth rate of personal income of the 50 states, and the estimated five change points accord with the business cycle behavior of the US.

Key Words: MOSUM; change point; panel data

65 Structured tensor decomposition and its application

Miaoyan Wang1

1Second University of Wisconsin – Madison, USA E-mail: [email protected]

Abstract: Tensors of order 3 or greater, known as higher-order tensors, have recently attracted increased attention in many fields. Methods built on tensors provide powerful tools to capture complex structures in data that lower-order methods may fail to exploit. However, extending familiar matrix concepts to higher-order tensors is not straightforward, and indeed it has been shown that most computational problems for tensors are NP-hard. In this talk, I will present some statistical results on tensors decomposition. We focus on high-dimensional tensors with special structures, such as low-rankness, sparsity, multi-way blocks, and binary-valued tensors. Such problems arise in several applications such as collaborative filtering, compressed sensing, sensor network localization, and topic modeling. We propose proper loss functions and give the performance bound under generalized multilinear models. We demonstrate the power of our approach on the tasks of tensor completion and clustering, with improved performance over previous methods.

Key Words: High dimensionality; higher-order tensors; CP tensor decomposition

66 Identification of the number of factors for factor modeling in high dimensional time series

Qinwen Wang1

1Fudan University, Shanghai, China E-mail: [email protected]

Abstract: Identifying the number of factors in a high-dimensional factor model has attracted much attention in recent years and a general solution to the problem is still lacking. A promising ratio estimator based on the singular values of the lagged autocovariance matrix has been recently proposed in the literature and is shown to have a good performance under some specific assumption on the strength of the factors. Inspired by this ratio estimator and as a first main contribution, we will propose a complete theory of such sample singular values for both the factor part and the noise part under the large-dimensional scheme where the dimension and the sample size proportionally grow to infinity. In particular, we provide the exact description of the phase transition phenomenon that determines whether a factor is strong enough to be detected with the observed sample singular values. Based on these findings, we propose a new estimator of the number of factors which is strongly consistent for the detection of all significant factors (which are the only theoretically detectable ones). In particular, factors are assumed to have the minimum strength above the phase transition boundary which is of the order of a constant; they are thus not required to grow to infinity together with the dimension (as assumed in most of the existing papers on high-dimensional factor models).

Key Words: high-dimensional factor model; autocovariance matrix; singular values

67 Large Multiple Graphical Model Inference via Bootstrap

Shaoli Wang1, Yongli Zhang, Xiaotong Shen

1 Shanghai University of Finance and Economics, China E-mail: [email protected]

Abstract: Large economic and financial networks may experience stage-wise change as a result of external shocks. To detect and infer a structural change, we consider an inference problem in the framework of multiple Gaussian graphical models when the number of graphs and the dimension of graphs increase with the sample size. In this setting, two major challenges emerge as a result of the bias and uncertainty inherent in regularization required to treat such overparameterized models. To deal with these challenges, bootstrap is utilized to approximate the sampling distribution of a likelihood ratio test statistic. We show theoretically that the proposed method leads to a correct asymptotic inference in a high-dimensional setting regardless of the distribution of the test statistic. Simulations show that the proposed method compares favorably to its competitors such as the Likelihood Ratio Test. An application is given to analyze a network of 200 stocks.

68 An adaptive independence test for microbiome community data

Tao Wang1, Yaru Song1, Hongyu Zhao2,*

1Shanghai Jiao Tong University, Shanghai, China 2Yale University, New Haven, USA E-mail: [email protected]

Abstract: Advances in sequencing technologies and bioinformatics tools have vastly improved our ability to collect and analyze data from complex microbial communities. A major goal of microbiome studies is to correlate the overall microbiome composition with clinical or environmental variables. La Rosa et al. (2012) recently proposed a parametric test for comparing microbiome populations between two or more groups of subjects. However, this method is not applicable for testing the association between the community composition and a continuous outcome. Although multivariate non-parametric methods based on permutations are widely used in ecology studies, they lack interpretability and can be inefficient for analyzing microbiome data. We consider the problem of testing for independence between the microbial community composition and a continuous or many-valued variable. By partitioning the range of the variable into a few slices, we formulate the problem as a problem of comparing multiple groups of microbiome samples, with each group indexed by a slice. To model multivariate and over-dispersed count data, we use the Dirichlet-multinomial distribution. We propose an adaptive likelihood-ratio test by learning a good partition or slicing scheme from the data. A dynamic programming algorithm is developed for numerical optimization. We demonstrate the superiority of the proposed test by comparing it to that of La Rosa et al. (2012) and popular approaches on the same topic including PERMANOVA, the distance covariance test, and the microbiome regression-based kernel association test. We further apply it to test the association of gut microbiome with age in three geographically distinct populations, and show how the learned partition facilitates differential abundance analysis.

Key Words: Adaptive slicing; Community-level analysis; Dierential abundance testing; Distance-based methods; Penalization

69 Integrated Quantile Rank Test (iQRAT) for heterogeneous joint effect of rare and common variants in sequencing studies

Tianying Wang, Iuliana Ionita-Laza, Ying Wei

Department of Biostatistics, Columbia University, New York, NY E-mail: [email protected]

Abstract: Genetic association studies often evaluate the combined group-wise effects of rare and common genetic variants on phenotype at gene levels. Many approaches have been proposed for group-wise association tests, such as the widely used burden tests and sequence kernel association tests. Most of these approaches focus on identifying mean effects. As the genetic associations are complex, we propose an efficient integrated rank test to investigate the genetic effect across the entire distribution/quantile function of a phenotype. The resulting test complements the mean-based analysis and improve efficiency and robustness. The proposed test integrates the rank score test statistics over quantile levels while incorporating Cauchy combination test scheme and Fisher's method to maximize the power. It generalized the classical quantile-specific rank-score test. Using simulations studies and real Metabochip data on lipid traits, we investigated the performance of the new test in comparison with the burden tests and sequence kernel association tests in multiple scenarios.

Key Words: Quantile process; Association test; Sequencing analysis; Joint effects; Rare variants

70 Model Free Approach to Quantifying the Proportion of Treatment Effect Explained by a Surrogate Marker

Xuan Wang1,∗, Layla Parast2, Lu Tian3, Tianxi Cai4

1,∗School of Mathematical Sciences, Zhejiang University, Hangzhou, Zhejiang, China 2 Statistics Group, RAND Corporation, Santa Monica, California 90401 U.S.A. 3 Department of Biomedical Data Science, Stanford University, Stanford, California 94305 U.S.A. 4 Department of Biostatistics, Harvard University, Boston, Massachusetts 02115 U.S.A. E-mail: [email protected]

Abstract: In randomized clinical trials, the primary outcome, Y , often requires long term follow-up and/or is costly to measure. For such settings, it is desirable to use a surrogate marker, S, to infer the treatment effect on Y , ∆. Identifying such an S and quantifying the proportion of treatment effect on Y explained (PTE) by the effect on S are thus of great importance. Most existing methods for quantifying the PTE are model-based and may yield biased estimates under model mis-specification. Recently proposed non-parametric methods require strong assumptions to ensure that the PTE is be-tween [0, 1]. Additionally, optimal use of S to approximate ∆ is especially important when S relates to Y non-linearly. In this paper, we identify an optimal transformation of S, gopt   such that the PTE can be inferred based on gopt(S). In addition, we provide two novel model free definitions of PTE and simple conditions for ensuring the PTE is between [0, 1]. We provide non-parametric estima-tion procedures and establish asymptotic properties of the proposed estimators. Simulation studies demonstrate that the proposed methods perform well in finite samples. We illustrate the proposed procedures using a randomized study of HIV patients.

Key Words: Non-parametric estimation; Proportion of treatment effect explained; Randomized clin- ical trial; Surrogate marker.

71 Scattering Transform and Stylometry Analysis in Arts

Yang Wang1

1 Department of Mathematics, Hong Kong University of Science and Technology E-mail: [email protected]

Abstract: With the rapid advancement in data analysis and machine learning, stylometry analysis in arts has gained considerable interest in recent years. A fundamental topic of research in stylometry is the detection of art forgery. But unlike many other machine learning applications, we typically face the challenge of not having enough data. In this talk I‟ll discuss how scattering transform can be applied to stylometry analysis, and demonstrate its effectiveness on Van Gogh paintings as well as another data set.

72 A Fast and Practical Randomized Method for Low-Rank Tensor Approximations

Yao Wang1,∗

1,∗ Xi’an Jiaotong University, 710049, P.R. China E-mail: [email protected]

Abstract: Low-rank tensor approximations have gained much attention in dealing with real-world applications such as dynamic medical image processing and multi-channel video analysis, because of their efficiency in exploiting the intrinsic structures of the data with limited parameters. Unfor- tunately, the popular tensor decompositions that can get efficient low rank approximations, namely Tucker decomposition and tensor Singular Value Decomposition (t-SVD), for computing many SVDs are prohibitively computationally expensive in general, which obviously limits their use in “big data” environments. To remedy such issue, in this work, we present a randomized URV decomposition for producing fast and efficient low rank tensor approximations with theoretical guarantees by us- ing Tucker decomposition and t-SVD. To be more precise, our method incorporates a strong rank- revealing QR decomposition that can make the computations of Tucker decomposition and t-SVD to be more stable. We then justify the effectiveness of the obtained low rank tensor approximations through a series of synthetic data experiments and several real-world applications. The extensive ex- perimental results demonstrate the superior performance of our procedures over the existing methods in terms of both robustness and computational speed.

Key Words: Low-rank tensor approximations; Tucker decomposition; t-SVD; URV decomposition; randomized algorithms

73 The traps that must be encountered in machine learning practices

Hu Wei 1

1 Yingying Group, Inc., China E-mail: [email protected]

Abstract: In machine learning practices, we often encounter the following difficulties: inconsistency between data samples and application scenarios, discrepancy between model training and application environment, systematic missing and variation of data, censoring of observed samples, and so on. These problems vary in different businesses, but they remain essentially the same in all respects, such as data, features and model methods. Based on several case studies in the field of Internet on the applications of cutting-edge machine learning technology, the presentation helps understand how to analyze data, select assumptions suitable for data, and then design the modeling process of data preprocessing, algorithm selection and optimization, model evaluation and deployment.

74 Flexible Experimental Designs for Valid Single-cell RNA-sequencing Experiments Allowing Batch Effects Correction

1 2 3 Yingying Wei , Ga Ming Chan , Fangda Song

1 Department of Statistics, The Chinese University of Hong Kong, HKSAR E-mail: [email protected]

Abstract: Despite their widespread applications, single-cell RNA-sequencing (scRNA-seq) experiments are still plagued by batch effects and dropout events. Although the completely randomized experimental design has frequently been advocated to control for batch effects, it is rarely implemented in real applications due to time and budget constraints. Here, we mathematically prove that under two more flexible and realistic experimental designs---the "reference panel" and the "chain-type" designs---true biological variability can also be separated from batch effects. We develop Batch effects correction with Unknown Subtypes for scRNA-seq data (BUSseq), which is an interpretable Bayesian hierarchical model that closely follows the data-generating mechanism of scRNA-seq experiments. BUSseq can simultaneously correct batch effects, cluster cell types, impute missing data caused by dropout events, and detect differentially expressed genes without requiring a preliminary normalization step. We demonstrate that BUSseq outperforms existing methods with simulated and real data.

Key Words: Batch effects; Experimental design; Single-cell RNA-seq experiments; Model-based clustering; Integrative analysis

75 Recent developments in graph matching: statistical analysis Jian Ding1∗, Zongming Ma1, Yihong Wu2, Jiaming Xu3∗,

1∗ Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, USA 2 Department of Statistics and Data Science, Yale University, New Haven, USA 3 The Fuqua School of Business, , Durham, USA E-mail: [email protected]

Abstract: Random graph matching refers to recovering the underlying vertex correspondence be- tween two random graphs with correlated edges; a prominent example is when the two random graphs d are given by Erdos-R˝ enyi´ graphs G(n, n ). This can be viewed as an average-case and noisy version of the graph isomorphism problem. Under this model, the maximum likelihood estimator is equivalent to solving the intractable quadratic assignment problem. This work develops an O˜(nd2 + n2)-time al- gorithm which perfectly recovers the true vertex correspondence with high probability, provided that the average degree is at least d = Ω(log2 n) and the two graphs differ by at most δ = O(log−2(n)) fraction of edges. For dense graphs and sparse graphs, this can be improved to δ = O(log−2/3(n)) and δ = O(log−2(d)) respectively, both in polynomial time. The methodology is based on appropriately chosen distance statistics of the degree profiles (empirical distribution of the degrees of neighbors). Before this work, the best known result achieves δ = O(1) and no(1) ≤ d ≤ nc for some constant c with an nO(log n)-time algorithm and δ = O˜((d/n)4) and d = Ω(˜ n4/5) with a polynomial-time algo- rithm.

Key Words: Graph matching, Degree profiles, Quadratic assignment problem, Random graph iso- morphism, Erdos-R¨ enyi´ graphs

76 Differential Markov Random Field Analysis

Yin Xia1, Tony Cai2, Hongzhe Li3, Jing Ma3

1Department of Statistics, Fudan University, China 2Department of Statistics, University of Pennsylvania, USA 3Department of Biotatistics, University of Pennsylvania, USA E-mail: [email protected]

Abstract: In this talk, we propose a flexible Markov random field model for learning the microbial community structure and introduce a testing framework for detecting the difference between networks, also known as differential network analysis. Our global test for differential networks is particularly powerful against sparse alternatives. In addition, we develop a multiple testing procedure with false discovery rate control to infer the structure of the differential network. The proposed method is applied to a gut microbiome study on UK twins to detect the microbial interactions associated with the age of the host.

Key Words: Differential network; High dimensional logistic regression; Microbiome; Multiple testing

77 Building a Translational Research Program in Neuroinflammation: A Data Driven Approach to Advance Precision Medicine for Multiple Sclerosis

1,2 3 4 5 5 Zongqi Xia , Liang Liang , Tianrun Cai1 , Kumar Dahal , Chen Lin , Sean 5 5 6 6 Finan , Guergana Savovoa , Tanuja Chitnis , Howard Weiner Philip De Jager2,7 ,Tianxi Cai3

1Department of Neurology, University of Pittsburgh, Pittsburgh, PA, USA 2Cell Circuits Program, Broad Institute, Cambridge, MA, USA 3Department of Statistics, Harvard School of Public Health, Boston, MA, USA 4Department of Rheumatology, Brigham and Women’s Hospital, Boston, MA, USA 5Clinical Natural Language Processing Program, Boston Children’s Hospital, Boston, MA, USA 6Department of Neurology, Brigham and Women’s Hospital, Boston, MA, USA 7Center for Translational & Computational Neuroimmunology and the Columbia MS Center, Department of Neurology, Columbia University Medical Center, New York City, NY, USA E-mail: [email protected]

Abstract: Multiple sclerosis (MS) is a chronic neurological disease with a disproportionally high socioeconomic burden. Given the wide spectrum of disease trajectories among people with MS and their diverse responses to treatments, there is unmet need to bring precision medicine to MS. For this presentation, I will primarily discuss our ongoing efforts in developing analytical approaches to ascertain disease activity and predict treatment response using electronic health records data. Tools that leverage real-life clinical data for outcome prediction in chronic neurological disorders have the potential for widespread dissemination at the point of care. I will additionally highlight complementary research strategies of leveraging prospective cohort studies to investigate MS onset and disease evolution as part of a broader program to advance precision medicine for MS.

Key Words: multiple sclerosis; precision medicine; electronic health records; disease activity; treatment response

78 Realized volatility forecasting with HAR-GARCH type model: a Bayesian approach

Yunxian Li 1,∗, Han Xiang 2

1, ∗ Yunnan University of Finance and Economics, 650221, China 2Yunnan University of Finance and Economics, 650221, China E-mail: [email protected]

Abstract: In this paper, HAR-GARCH-type models are proposed to analyze crude oil price volatili- ties. Bayesian approaches, include Bayesian model estimation and model comparison are developed for HAR-GARCH-type models. MCMC methods are applied to get the Bayesian estimation of the unknown parameters of the proposed model. A Bayesian criterion-based statistic, called Lv mea- sure is proposed as model comparison statistic for HAR-GARCH-type model. In addition, Bayesian forecasting is also discussed in this paper. According to the proposed models and methodologies, 5-minute price data of WTI crude Oil Futures is analyzed. Different models, including HAR-type model and HAR-GARCH-type model are considered and compared. According to Lv measure, HAR-GARCH-type model performs better then HAR-type model. It is reasonable to consider HAR- GARCH-type model as heterogeneous is existed in the realized volatilities. Bayesian estimation and model comparison are discussed for the HAR-GARCH-type model. The future realized volatilities are predicted based on the selected HAR-GARCH model.

Key Words: Oil price; Realized volatility; Model selection; Bayesian approach

79 TBA

Han XIAO1

1 Rutgers University, USA E-mail: [email protected]

Abstract: TBA

80 Multiple Testing Embedded in an Aggregation Tree to Identify where Two Distributions Differ

Jichun Xie 1

1 Duke University,Duke Box 103854, Durham, NC E-mail: [email protected]

Abstract: A key goal of flow cytometry data analysis is to identify the subpopulation of cells whose attributes are responsive to the treatment. These cells are supposed to be sparse among the entire cell population. To identify them, we propose a novel multiple TEsting on the Aggregation tree Method (TEAM) to locate where the treated and the reference distributions differ. TEAM has a bottom-up hierarchical framework. On the bottom layer, we search for the short-range spiky distributional differences in the small bins; while on the higher layers, we search for the long-range weak distributional differences. The active testing sets and the rejection rule on the higher layer will depend on the testing results of the lower layers. Under the mild conditions, we proved that team will yield FDP converging to the desired level. Extensive simulations show that the proposed method is valid and has much better power compared to the single-layer multiple testing methods and the multi-resolution scanning method. We then apply team to a flow cytometry study where we successfully identified the cell subpopulation that is responsive to the cytomegalovirus antigen.

Key Words: Hierarchical multiple testing; aggregation tree; equal-power binning; false discovery rate (FDR); distribution difference; flow cytometry (FCM)

81 Pearson's statistics: approximation theory and beyond

Mengyu Xu1, Danna Zhang2*, Wei Biao Wu3,

1Department of Statistics, University of Central Florida, 32816, U.S.A. 2Department of Mathematics, University of California, San Diego, 92093, U.S.A. 3Department of Statistics, University of Chicago, 60637, U.S.A. E-mail: [email protected]

Abstract: We establish an approximation theory for Pearson's chi-squared statistics in situations where the number of cells is large, by using a high-dimensional central limit theorem for quadratic forms of random vectors. Our high-dimensional central limit theorem is proved under Lyapunov-type conditions that involve a delicate interplay between the dimension, the sample size and the moment conditions. We propose a modified chi-squared statistic and the concept of adjusted degrees of freedom. Our simulation study shows the modified statistic outperforms Pearson's chi-squared statistic in both size accuracy and power. Our procedure is applied to the construction of a goodness-of-fit test for Rutherford's alpha particle data.

Key Words: Adjusted degrees of freedom; goodness-of-fit test; invariance principle; large p small n; Pearson‟s chi-squared statistic

82 Distribution and correlation free two-sample test of high-dimensional means

Kaijie XUE1

1 University of Toronto, Canada E-mail: [email protected]

Abstract: We propose a two-sample test for high-dimensional means that requires neither distributional nor correlational assumptions, besides some weak conditions on the moments and tail properties of the elements in the random vectors. This two-sample test based on a nontrivial extension of the one-sample central limit theorem (Chernozhukov et al., 2017) provides a practically useful procedure with rigorous theoretical guarantees on its size and power assessment. In particular, the proposed test is easy to compute and does not require the independently and identically distributed assumption, which is allowed to have different distributions and arbitrary correlation structures. Further desired features include weaker moments and tail conditions than existing methods, allowance for highly unequal sample sizes, consistent power behavior under fairly general alternative, data dimension allowed to be exponentially high under the umbrella of such general conditions. Simulated and real data examples are used to demonstrate the favorable numerical performance over existing methods.

83 84 A statistical and machine learning framework for new energy vehicle ride sharing system

Kaixian Yu3,*, Jinliang Deng1,*, Chengchun Shi2,*, Rui Song2, Qiang Yang1, Jieping Ye3, Hongtu Zhu3

1Department of Computer Science and Engineering, Hong Kong University of Science and Techology, Hong Kong, China 2Department of Statistics, North Carolina State University, Raleigh, NC, US 3AI Labs, Didi Chuxing, Beijing, China *These authors contribute equally E-mail: [email protected]

Abstract: Recently, the number of electric vehicles (EVs) served on the online ride-hailing companies, like Uber, Didi Chuxing, increased rapidly. Not like conventional fuel vehicles, EVs have some unique characteristics: they do not travel as far as fuel vehicles, and it takes much longer for EVs to be charged. Adapting these characteristics into the dispatching system of online ride-hailing companies becomes increasingly important. In this talk, we will present our recent progress on two major components of an EV friendly dispatching system. Firstly, we will introduce a stochastic partial differential equation approach to model the power consumption by an EV. The power consumption model takes real time vehicle and environment factors into account to estimate the state of charge. Secondly, we will introduce a deep multi-objective reinforcement learning approach to solve the order dispatching problem based on the estimated state of charge of EVs. Some results on real data and simulated system will be shown as well.

Key Words: Applied statistics; new energy vehicle; online ride-hailing platform; stochastic differential equation; deep multi-objective reinforcement learning

85 Word Segmentation and Term Discovery in Chinese Electronic Medical Records Using Graph Theory and Deep Learning

Zheng Yuan1#, Yuanhao Liu2#, Qiuyang Yin3, Boyao Li4, Sheng Yu1*

1Center for Statistical Science, , Beijing, China; 2Department of Statistics, University of Michigan, Ann Arbor, MI, USA; 3Department of Automation, Tsinghua University, Beijing, China; 4Department of Physics, Tsinghua University, Beijing, China; # contributed equally E-mail: [email protected]

Abstract: Natural language processing (NLP) for electronic medical records (EMR) generally requires a comprehensive medical dictionary that covers both standard terminology, term variations, and abbreviations. In this work, we present an automated method to discover medical terms from Chinese EMR notes, innovatively using both graph theory and deep learning.

The automatic Chinese medical term discovery pipeline has two steps. The first step is word segmentation. We propose a graph theory-based unsupervised word segmentation algorithm that considers the sentence as an undirected graph, whose nodes are the characters. One can use various techniques to compute the edge weights that measure the connection strength between characters. Spectral graph partition algorithms are used to group the characters and achieve word segmentation. Segmenting the EMR corpus will provide a list of candidate medical terms.

The word segmentation result may contain errors, i.e., the word boundary may be put at wrong places. Thus in the second step, we train a bi-directional LSTM neural network discriminator to remove wrong segmentation results. The model input includes the candidate term and ±4 characters around the term in the text. The neural network uses an embedding layer for both the characters and the forward/backward positions. We created training data by simulation. Positive samples are terms identified using a dictionary-based segmenter loaded with a general domain dictionary and a Chinese-English medical dictionary. Negative samples are obtained by randomly adding/removing 1 or 2 characters to/from either end of a positive sample to imitate segmentation errors. The trained discriminator is applied to the segmentation result. Terms repeatedly appear in the EMR corpus are classified repeatedly. Terms not in the general domain dictionary and accepted over 10 times by the model are kept as predicted medical terms.

Key Words: electronic medical records; word segmentation; term discovery; deep learning; graph partition

86 Statistical Inference Based on Sufficient Dimension Reduction

Zhou Yu1

1 East China Normal University, China E-mail: [email protected]

Abstract: TBA

87 Global Optimality of Stochastic Semi-definite Optimization with Application to Ordinal Embedding

Jinshan Zeng1,2∗, Ke Ma3, Yuan Yao 2

1 School of Computer Science, Jiangxi Normal University, Nanchang, China 2 Department of Mathematics, Hong Kong University of Science and Technology, HKSAR 3 Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China E-mail: [email protected]

Abstract: Nonconvex reformulations via low-rank factorization for stochastic convex semi-definite optimization problem has attracted arising attention due to its empirical efficiency and scalability. However, it opens a new challenge that under what conditions the nonconvex stochastic algorithms may find the population minimizer within the optimal statistical precision despite their empirical suc- cess in applications. In this talk, we provide an answer that the stochastic gradient descent (SGD) method can be adapted to solve the nonconvex reformulation of the original convex problem, with a global linear convergence when using a fixed step size, i.e., converging exponentially fast to the population minimizer within an optimal statistical precision, at a proper initial choice in the restricted strongly convex case. If a diminishing step size is adopted, the bad effect caused by the variance of gradients on the optimization error can be eliminated but the rate is dropped to be sublinear. Then we propose an accelerated stochastic algorithm, i.e., SVRG and establish its global linear convergence. Finally, we apply our developed stochastic algorithms to the ordinal embedding problem and demon- strate their effectiveness.

Key Words: Stochastic gradient descent; SVRG; semidefinite optimization; low-rank factorization; ordinal embedding

88 High-dimensional Tensor Regression Analysis

Anru Zhang1

1University of Wisconsin-Madison, Madison WI, USA E-mail: [email protected]

Abstract: The past decade has seen a large body of work on high-dimensional tenors or multiway arrays that arise in numerous applications and have been applied to many statistical problems. In many of these settings, the tensor of interest is high-dimensional in that the ambient dimension is substantially larger than the sample size. Oftentimes, however, the tensor comes with natural low-rank or sparsity structure. How to exploit such structure for tensors poses new statistical and computational challenges.

In this talk, we introduce a novel procedure for low-rank tensor regression, namely Importance Sketching Low-rank Estimation for Tensors (ISLET), which addresses these challenges. The central idea behind ISLET is what we call importance sketching, carefully designed structural sketches based on higher order orthogonal iteration (HOOI) and combining sketched estimated components using the recently developed Cross procedure. We show that our estimating method is sharply minimax optimal in terms of the mean-squared error under low-rank Tucker assumptions. In addition, if a tensor is low-rank with group sparsity, our procedure also achieves minimax optimality. Further, we show through numerical study that ISLET achieves comparable mean-squared error performance to existing state-of-the-art methods whilst having substantial storage and run-time advantages. In particular, our procedure performs reliable tensor estimation with tensors of dimension p = O (10^8) and is 1 or 2 orders of magnitude faster than baseline methods.

Key Words: Dimension reduction; high-order orthogonal iteration; minimax optimality; sketching; tensor regression

89 Enhanced Pulmonary Nodule Detection Using Fully Automated Deep Learning: A Multifactor Investigation

Chi Zhang1∗, Jiechao Ma2, Shiyuan Liu3

1∗ Presenter, Beijing Infervision Inc., 100085, China 2 Beijing Infervision Inc., 100085, China 3 Changzheng Hospital, Second Military Medical University, 200003, China E-mail: [email protected]

Abstract: Neural network based deep learning (DL) algorithms have been successfully used in the detection of lung nodules in CT scans, improving efficiency while reducing burden on radiologists. We compared detection sensitivity of DL model with radiologists, and verified whether DL models could assist radiologists to enhance baseline screening detection. In this retrospective study, we compared the false discovery rate (FDR) and localization receiver operative character (LROC) curves between performance of radiologist only or with DL model assistance. The results showed that for all cohorts the DL model showed higher overall sensitivity than manual detection and is insensitive to radiation dose, patient age or CT equipment. It is also shown able to enhance manual screening, which may lead to better pulmonary nodule management.

Key Words: lung cancer screening; pulmonary nodule detection; deep learning; computer-aided detection

90 Heteroscedasticity test based on high-frequency data with jumps and microstructure noise

Qiang Liu1, Zhi Liu2, Chuanhai Zhang3

1 Department of Mathematics, National University of Singapore, Singapore 2 Department of Mathematics, University of Macau, Macau SAR, China 3 School of Finance, Zhongnan University of Economics and Law, Wuhan 430073, China E-mail:[email protected]

Abstract: In this paper, we are interested in testing if the volatility process is constant or not during a given time span by using high-frequency data, when considering the possible presence of jumps and microstructure noise. Bases on the estimators of integrated volatility and spot volatility, we propose a new nonparametric way to depict the discrepancy between local variation and global variation. We show that our test estimator converges to a standard normal distribution if the volatility is constant, otherwise it diverges to infinity. Simulation studies verify our theoretical results and show the good finite sample performance of our test procedure. We also apply our estimator to do the heteroscedasticity test for some real financial high-frequency data.

Key Words: High-frequency data; Market microstructure noise; jumps; Heteroscedasticity; Nonparametric test; Integrated volatility; Spot volatility

91 Factor Models for High-Dimensional Tensor Time Series

Cun-Hui Zhang1

1Rutgers University, USA E-mail: [email protected]

Abstract: Large tensor data are now routinely collected in a wide range of applications due to rapid development of information technologies and their broad implementation in our era. Often such observations are taken over time, forming tensor time series. In this paper we present a factor model approach for analyzing high-dimensional dynamic tensor time series and multi-category dynamic transport networks. Two estimation procedures are presented along with their theoretical properties and simulation results. Real applications are used to illustrate the model and its interpretations. This talk is based on joint work with Rong Chen and Dan Yang.

Key Words: tensor; time series; factor model; loading matrix; decomposition

92 Stochastic differential reinsurance games with capital injections

Nan Zhang1, Zhuo Jin 2, Linyi Qian 3, Kun Fan 4

1,3,4 East China Normal University, Shanghai 200062, China 2 The University of Melbourne, VIC 3010, Australia E-mail: [email protected]

Abstract: This paper investigates a class of reinsurance game problems between two insurance companies under the framework of non-zero sum stochastic differential games. Both insurers can purchase proportional reinsurance contracts from reinsurance markets and have the option of determining the time and amount of capital injections, which is described by impulse controls. We assume the reinsurance premium is calculated under the generalized variance premium principle. The objective of each insurer is to maximize the expected value that synthesizes the discounted utility of its surplus relative to a reference point, the penalties caused by its capital injection interventions, and the gains brought by capital injections of his competitors. We prove the verification theorem and derive explicit expressions of the Nash equilibrium strategy by solving the corresponding quasi-variational inequalities. Numerical examples are also conducted to illustrate our results.

Key Words: Stochastic differential game, Impluse control, Nash equilibrium, Quasi-variational inequality

93 Structured sparse logistic regression with application to lung cancer prediction using breath volatile biomarkers

Xiaochen Zhang 1, Qingzhao Zhang 12, Xiaofeng Wang3, Shuangge Ma4 , Kuangnan Fang 1,*

1Department of Statistics, School of Economics, Xiamen University, China 2The Wang Yanan Institute for Studies in Economics, Xiamen University, China 3Department of Quantitative Health Sciences/Biostatistics Section Cleveland Clinic Lerner Research Institute, Cleveland, OH, USA 4Department of Biostatistics, Yale School of Public Health, USA E-mail: [email protected]

Abstract: This article is motivated by a study of lung cancer prediction using breath volatile organic compound (VOC) biomarkers, where the challenge is that the predictors include not only high-dimensional time-dependent or functional VOC features but also the time-independent clinical variables. We consider a high-dimensional logistic regression and propose two different penalties: group spline-penalty or group smooth-penalty to handle the group structures of the time-dependent variables in the model. The new methods have the advantage for the situation where the model coefficients are sparse but change smoothly within the group, compared with other existing methods such as the group lasso and the group bridge approaches. Our methods are easy to implement since they can be turned into a group minimax concave penalty problem after certain transformations. We show that our fitting algorithm possesses the descent property and leads to attractive convergence properties. The simulation studies and the lung cancer application are performed to demonstrate the accuracy and stability of the proposed approaches.

Key Words: group spline-penalty; group smooth-penalty; variable selection; time-dependent variables; high- dimensional data

94 Simulated Distribution Based Learning for Non-regular and Regular Statistical Inferences

Bingyan Wang1, Zhengjun Zhang2,

1Peking University, China 2University of Wisconsin, USA E-mail: [email protected]

Abstract: Statistical research involves drawing inference about unknown quantities (e.g., parameters) in the presence of randomness in which distribution assumptions of random variables (e.g., error terms in regression analysis) play a central role. However, a fundamental issue of preserving the distribution assumptions has been more or less ignored by many inference methods and applications. As a result, the further inference of studied problems and related decisions based on the estimated parameter values may be inferior. This paper proposes a continuous distribution preserving estimation approach for various kinds of non-regular and regular statistical studies. The paper establishes a fundamental theorem which guarantees the transformed order statistics (to a given marginal) from the assumed distribution of a random variable (or an error term) to be arbitrarily close to the order statistics of a simulated sequence of the same marginal distribution. Different from the Kolmogorov-Smirnov test which is based on absolute errors between the empirical distribution and the assumed distribution, the statistics proposed in the paper are based on relative errors of the transformed order statistics to the simulated ones. Upon using the constructed statistic (or the pivotal quantity in estimation) as a measure of the relative distance between two ordered samples, we estimate parameters such that the distance is minimized. Unlike many existing methods, e.g., maximum likelihood estimation, which rely on some regularity conditions and/or the explicit form of probability density function, the new method only assumes a mild condition that the cumulative distribution function can be approximated to a satisfied precision. The paper illustrates simulation examples to show its superior performance. Under the linear regression settings, the proposed estimation performs exceptionally well regarding preserving the error terms (i.e., the residuals) to be normally distributed which is a fundamental assumption in the linear regression theory and applications.

Key Words: estimation; extreme value theory; inverse approximation of distributions; relative errors; simulation.

95 Estimation and inference for the indirect effect in high-dimensional linear mediation models

1 2 1,* Ruixuan Zhou , Liewei Wang , Sihai Dave Zhao

1Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, IL 61820, USA 2Division of Clinical Pharmacology, Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, Rochester, MN 55905, USA E-mail: [email protected]

Abstract: Mediation analysis is difficult when the number of potential mediators is larger than the sample size. We propose new inference procedures for the indirect effect in the presence of high-dimensional mediators for linear mediation models. We develop methods for both incomplete mediation, where a direct effect may exist, as well as complete mediation, where the direct effect is known to be absent. We prove consistency and asymptotic normality of our indirect effect estimators. Under complete mediation, where the indirect effect is equivalent to the total effect, we further prove that our approach gives a more powerful test compared to directly testing for the total effect. We apply our method to an integrative analysis of gene expression and genotype data from a pharmacogenomic study of drug response. We present a novel analysis of gene sets to understand the molecular mechanisms of drug response, and also identify a genome-wide significant noncoding genetic variant that cannot be detected using standard analysis methods.

Key Words: High-dimensional Inference; Integrative Genomics; Mediation Analysis

96 Factor Modeling for Volatility

Xinghua Zheng 1, Yingying Li2, Rob Engle3, Yi Ding4

1Hong Kong University of Science and Technology, HKSAR 3 New York University, USA E-mail: [email protected]

Abstract: This talk consists of two parts. In the first part, under a high-frequency and high-dimensional setup, we establish a framework to estimate the factor structure in idiosyncratic volatility. We show that the factor structure can be consistently estimated by conducting principal component analysis on the idiosyncratic realized volatilities. Empirically, we confirm and identify the factor structure in idiosyncratic volatilities of SP 500 Index constituents. In the second part, motivated by strong empirical evidence, a single-factor volatility model is proposed. Empirical examination of the model reveals that the simple model well explains the co-movement feature of volatilities, and leads to substantial gain in volatility forecasting.

Key Words: Volatility; Factor model; High-frequency data; principal component analysis

97 Sequential scaled sparse factor regression

Zemin Zheng1

1 University of Science and Technology of China, China E-mail: [email protected]

Abstract: TBA

98 Estimating Endogenous Treatment Effect Using High-Dimensional Instruments with an Application to the Olympic Effect

1,∗ 2 2 2 Wei Zhong , Wei Zhou , Qingliang Fan , Yang Gao

1,∗ Xiamen University 2 Xiamen University E-mail:[email protected]

Abstract: Endogenous treatments are commonly encountered in program evaluations using observa- tional data where the selection-on-observables assumption does not hold. In this paper, we develop a two-stage approach to estimate endogenous treatment effects using high-dimensional instrumental variables. In the first stage, instead of using a linear reduced form regression in the conventional two-stage least squares (TSLS) approach, we propose a new high-dimensional logistic reduced form model with the SCAD penalty to approximate the optimal instrument. In the second stage, we replace the original treatment variable by its estimated propensity score and run a least squares regression to obtain the penalized Logistic-regression Instrumental Variables Estimator (LIVE). We show that the proposed LIVE is root-n consistent to the true average treatment effect, asymptotically normal and achieves the semiparametric efficiency bound. Monte Carlo simulations demonstrate that the LIVE outperforms the traditional TSLS estimator and the post-Lasso estimator for the endogenous treat- ment effects. Moreover, in the empirical study, we investigate whether the Olympic Games could facilitate the host nation‟s economic growth using data from 163 countries. The proposed LIVE esti- mator shows a strong Olympic effect on the host nation‟s economic growth.

Key Words: Endogenous treatment effect; high dimensionality; instrumental variable; logistic regres- sion; variable selection

99 Approximation Theory of Deep Convolutional Neural Networks

Ding-XuanZhou1

1 School of Data Science, City University of Hong Kong, HKSAR E-mail: [email protected]

Abstract: Deep learning has been widely applied and brought breakthroughs in speech recognition, computer vision, and many other domains. The involved deep neural network architectures and computational issues have been well studied in machine learning. But there lacks a theoreti-cal foundation for understanding the approximation or generalization ability of deep learning models with network architectures such as deep convolutional neural networks (CNNs) with convolutional structures. The convolutional architecture gives essential differences between the deep CNNs and fully-connected deep neural networks, and the classical approximation theory of fully-connected networks developed around 30 years ago does not apply. This talk describes an approximation theory of deep CNNs. In particular, we show the universality of a deep CNN, meaning that it can be used to approximate any continuous function to an arbitrary accuracy when the depth of the neural network is large enough. Our quantitative estimate, given tightly in terms of the number of free parameters to be computed, verifies the efficiency of deep CNN sin dealing with large dimensional data. Some related distributed learning algorithms will also be discussed.

100 Global Convergence of EM

Harrison H. Zhou1,*, Yihong Wu2

1Yale University, New Haven, CT, USA 2Yale University, New Haven, CT, USA E-mail: [email protected]

Abstract: For Gaussian mixtures with two symmetric components, we show the global convergence of EM for a random initialization without any separation condition.

Key Words: EM; Global Convergence; Gaussian Mixtures

101 R package for new normality test

Maoyuan Zhou1,*

1 Civil Aviation University of China, 300300, China E-mail: [email protected]

Abstract: This paper studies a new normality test method proposed by Jin Zhang (2005), realizes the application of this method in R software, and writes it as a practical complete R package: NTest package, which mainly includes three functions that can be used for normality test. Made contrast test with the common normality test function in R, the test results are shown in the figures and rank tables. It can be intuitively seen that the new test method has a higher power than the common normality test method. Further, this research program several R functions based on the above test statistics for two-sampling homogeneity test and multiple samples homogeneity test ( two samples: hoza.test, hozc.test and hozk.test; Various versions: khomoza.test, khomozc.test, khomozk.test). In addition, a comparative experiment was conducted to compare those power with the homogeneity test function commonly used in R. In the case of two samples, the power of the new method is higher than that of K-S test. In the case of multiple samples, the new test method has a higher power of global homogeneity test without considering the assumption of same mean or same variance.

Key Words: tests for normality; R Language; nonparametric test; homogeneity test

102 GD-RDA: A new regularized discriminant analysis for high dimensional data

Yan Zhou1

1 Shenzhen University, China E-mail: [email protected]

Abstract: TBA

103 Matrix Completion for Network Analysis

Ji Zhu1,*

1,*Department of Statistics, University of Michigan, Ann Arbor, MI, 48105, USA E-mail: [email protected]

Abstract: Matrix completion is an active area of research in itself, and a natural tool to apply to network data, since many real networks are observed incompletely and/or with noise. However, developing matrix completion algorithms for networks requires taking into account the network structure. This talk will discuss two examples of matrix completion used for network tasks. First, we discuss the use of matrix completion for cross-validation or non-parametric bootstrap on network data, a longstanding problem in network analysis. The second example focuses on reconstructing incompletely observed networks, with structured missingness resulting from the egocentric sampling mechanism, where a set of nodes is selected first and then their connections to the entire network are observed. We show that matrix completion can generally be very helpful in solving network problems, as long as the network structure is taken into account. This talk is based on joint work with Elizaveta Levina, Tianxi Li and Yun-Jhong Wu.

Key Words: cross-validation; egocentric network; link prediction; matrix completion; network analysis

104 Quantile double autoregression

Qianqian Zhu 1, Guodong Li 2,*

1 Shanghai University of Finance & Economics, 777 Guoding Rd., Shanghai, 200433, P.R.China 2 University of Hong Kong, Pokfulam Road, Hong Kong, P.R.China E-mail: [email protected]

Abstract: Many financial time series have varying structures at different quantile levels, and also exhibit the phenomenon of conditional heteroscedasticity at the same time. In the meanwhile, it is still lack of a time series model to accommodate both of the above features simultaneously. This paper fills the gap by proposing a novel conditional heteroscedastic model, which is called the quantile double autoregression. The strict stationarity of the new model is derived, and a self-weighted conditional quantile estimation is suggested. Two promising properties of the original double autoregressive model are shown to be preserved. Based on the quantile autocorrelation function and self-weighting concept, two portmanteau tests are constructed, and they can be used in conjunction to check the adequacy of fitted conditional quantiles. The finite-sample performance of the proposed inference tools is examined by simulation studies, and the necessity of the new model is further demonstrated by analyzing the S&P500 Index.

Key Words: Autoregressive time series model; Conditional heteroscedasticity; Portmanteau test; Quantile model; Strict stationarity.

105 A Boosting Algorithm for Estimating Generalized Propensity Scores with Continuous Treatments

Yeying Zhu1,*, Donna Coffman2, Debashis Ghosh3

1University of Waterloo, Waterloo, Canada 2Temple University, Philadelphia, USA 3Colorado School of Public Health, Aurora, USA E-mail: [email protected]

Abstract:In this article, we study the causal inference problem with a continuous treatment variable using propensity score-based methods. For a continuous treatment, the generalized propensity score is defined as the conditional density of the treatment-level given covariates (confounders). The dose–response function is then estimated by inverse probability weighting, where the weights are calculated from the estimated propensity scores. When the dimension of the covariates is large, the traditional nonparametric density estimation suffers from the curse of dimensionality. Some researchers have suggested a two-step estimation procedure by first modeling the mean function. In this study, we suggest a boosting algorithm to estimate the mean function of the treatment given covariates. In boosting, an important tuning parameter is the number of trees to be generated, which essentially determines the trade-off between bias and variance of the causal estimator. We propose a criterion called average absolute correlation coefficient (AACC) to determine the optimal number of trees. Simulation results show that the proposed approach performs better than a simple linear approximation or L2 boosting. The proposed methodology is also illustrated through the Early Dieting in Girls study, which examines the influence of mothers‟ overall weight concern on daughters‟ dieting behavior.

Key Words: boosting; distance correlation; dose-response function; generalized propensity score

106 Safe machine learning for safe genome editing

James Zou1

1 Stanford University, USA E-mail: [email protected]

Abstract: We analyze CRISPR Cas9 repair outcomes in primary human cells to systematically evaluate DNA repair patterns and investigate on-target DNA damage. Leveraging a large new dataset, we develop a novel machine learning model, CRISPR Repair OUTcome (SPROUT), that accurately predicts the length, probability, and sequence of nucleotide insertions and deletions. SPROUT facilitates optimizing genome editing in therapeutically-important primary human cells.

107 List of Participants

Name Affiliation E-mail Le BAO Peking University [email protected] Tony CAI University of Pennsylvania [email protected] Tianxi CAI Harvard University [email protected] Jian CAO Shanghai Jiaotong University [email protected] Southwestern University of Finance and Jinyuan CHANG [email protected] Economics Yong CHEN University of Pennsylvania [email protected] Southern University of Science and Xin CHEN [email protected] Technology Di-Rong CHEN Beihang University [email protected] Guanghui CHENG Guangzhou University [email protected] Southwestern University of Finance and Linlin DAI [email protected] Economics Hong Kong University of Science and Lilun DU [email protected] Technology Kun FAN East China Normal University [email protected] Xiaodan FAN The Chinese University of Hong Kong [email protected] Bo FU Fudan University [email protected] Tianyu GUAN Simon Fraser University [email protected] University of Science and Technology of Xiao GUO [email protected] China Virginia Polytechnic Institute and State Feng GUO [email protected] University Zijian GUO Rutgers University [email protected] Shenyang Institute of Automation,Chinese Zhi HAN [email protected] Academy of Sciences Ning HAO University of Arizona [email protected] Haijin HE Shenzhen University [email protected] Jie HU Xiamen University [email protected] Yen-Tsung HUANG Academia Sinica [email protected] Ling HUANG AHI Fintech [email protected] Fei HUANG dtwave [email protected] Huilin HUANG Wenzhou University [email protected] Jongho IM Yonsei University [email protected] Nankai University,University of North Jiancheng JIANG [email protected] Carolina at Charlotte Southern University of Science and Xuejun JIANG [email protected] Technology Dandan JIANG Xi‟an Jiaotong University [email protected] Michael I. JORDAN University of California at Berkeley [email protected] Donggyu KIM KAIST [email protected]

108 Name Affiliation E-mail Xinbing KONG Nanjing Audit University [email protected] Hongzhe LEE University of Pennsylvania [email protected] Hong Kong University of Science and Yingying LI [email protected] Technology Yunnan University of Finance and Yunxian LI [email protected] Economics Lexin LI University of California at Berkeley [email protected] Guodong LI The University of Hong Kong [email protected] Zhigang LI University of Florida [email protected] Deli LI Lakehead University [email protected] Song LI Zhejiang University [email protected] Tengyuan LIANG University of Chicago [email protected] Nan LIN Washington University in St. Louis [email protected] Yuanyuan LIN The Chinese University of Hong Kong [email protected] Zhixiang LIN The Chinese University of Hong Kong [email protected] Shaobo LIN Wenzhou University [email protected] Chengxiu LING Xi'an Jiaotong-Liverpool University [email protected] Cheng LIU Wuhan University [email protected] Shanghai University of Finance and Xin LIU [email protected] Economics Shanghai University of Economics and Xu LIU [email protected] Finance Catherine LIU The Hong Kong Polytechnic University [email protected] Xuanzhe LIU Peking University [email protected] Southwestern University of Finance and Baisen LIU [email protected] Economics Xi LIU Synfuels China [email protected] Weidong LIU Shanghai Jiao Tong University [email protected] Qi LONG University of Pennsylvania [email protected] Zhiping LU East China Normal University [email protected] Shikai LUO Didi Chuxing [email protected] Xiangyu LUO Renmin University of China [email protected] Yanyuan MA Pennsylvania State University [email protected] Xiaojun MAO Fudan University [email protected] Song MEI Stanford University [email protected] Yue NIU University of Arizona [email protected] Xiaoyue NIU Pennsylvania State University [email protected] Yinghao PAN University of North Carolina at Charlotte [email protected] Andre PYTHON University of Oxford [email protected] Lixuan QIN Memorial Sloan Kettering Cancer Center [email protected] Zhao REN University of Pittsburgh [email protected]

109 Name Affiliation E-mail Southern University of Science and Qi-man SHAO [email protected] Technology Wei SHI Jinan University [email protected] Xu SHI Harvard University [email protected] Zhijie SONG Hangzhou Chenhao Company [email protected] Xinyuan Song Chinese University of Hong Kong [email protected] Will Wei SUN University of Miami [email protected] Tiejun TONG The Hong Kong Baptist University [email protected] Miaoyan WANG University of Wisconsin-Madison [email protected] Man WANG Donghua University [email protected] Cheng WANG Shanghai Jiao Tong University [email protected]

Gui WANG Zhejiang University City College [email protected] Tao WANG Shanghai Jiao Tong University [email protected] Tianying WANG Columbia University [email protected] Xuan WANG Zhejiang University [email protected] Yao WANG Xi'an Jiaotong University [email protected] Shanghai University of Finance and Shaoli WANG [email protected] Economics Hong Kong University of Science and Yang WANG [email protected] Technology Qinwen WANG Fudan University [email protected] Yazhen WANG University of Wisconsin - Madison [email protected] Yingying WEI The Chinese University of Hong Kong [email protected] Hu WEI Yingying Group, Inc. [email protected]

Honglei WEN Wenzhou University [email protected] Yihong WU Yale University [email protected] Zongqi XIA University of Pittsburg [email protected] Yin XIA Fudan University [email protected] Han XIAO Rutgers University [email protected] Jichun XIE Duke University [email protected] Mengyu XU University of Central Florida [email protected] Huaping XU Zhejiang Chinese Medical University [email protected] Kaijie XUE Nankai University [email protected] Hong Kong University of Science and Yuan YAO [email protected] Technology Kaixian YU Didi Chuxing [email protected] Zhou YU East China Normal University [email protected] Sheng YU Tsinghua University [email protected] Ming YUAN Columbia Univeristy [email protected] Jinshan ZENG Jiangxi Normal University [email protected] Nan ZHANG East China Normal University [email protected]

110 Name Affiliation E-mail Anru ZHANG University of Wisconsin-Madison [email protected] Zhengjun ZHANG University of Wisconsin-Madison [email protected]

Caiya ZHANG Zhejiang University City College [email protected] Chi ZHANG Infervision Advanced Research [email protected] Chuanhai ZHANG Zhongnan University of Economics and Law [email protected] Cun-Hui ZHANG Rutgers University [email protected] Xiaochen ZHANG Xiamen University [email protected] Southwestern University of Finance and Jia Zhang [email protected] Economics Jun ZHANG Alibaba Group [email protected] Feida Zhang University of Queensland [email protected] Heping ZHANG Yale University [email protected] SihaiDave ZHAO University of Illinois at Urbana-Champaign [email protected] Hong Kong University of Science and Xinghua ZHENG [email protected] Technology University of Science and Technology of Zemin ZHENG [email protected] China Wei ZHONG Xiamen University [email protected] Harrison ZHOU Yale University [email protected] Maoyuan ZHOU Xiamen University [email protected] Yan ZHOU Shenzhen University [email protected] School of Data Science & Department of Ding-Xuan ZHOU [email protected] Mathematics,City University of Hong Kong Shanghai University of Finance and Qianqian ZHU [email protected] Economics Ji ZHU University of Michigan [email protected] Yeying ZHU University of Waterloo, Canada [email protected]

James ZOU Stanford University [email protected]

111 List of Participants (ZJU)

Name E-mail Younes BOKTAYA [email protected] Xi CHEN [email protected] Lei CHEN [email protected] Silu CHEN [email protected] Yang CHEN [email protected] Lang CHENG [email protected] Shunjie DONG [email protected] Zhetong DONG [email protected] Lingjie DU [email protected] Xuansu FANG [email protected] Chao FENG [email protected] Yongchang FU [email protected] Mingyang GONG [email protected] Tao GONG [email protected] Ming GUO [email protected] Xu HE [email protected] Yongxing HE [email protected] Chuanfeng HU [email protected] Huan HUANG [email protected] Meixiang HUANG [email protected] Shenwei HUANG [email protected] Yi'an HUANG [email protected] Hangjin JIANG [email protected] Yuliang JIANG [email protected] Xiaoying JIANG [email protected] Meifang LAN [email protected] Yixin LI [email protected] Jiaqi LI [email protected] Shuangbo LI [email protected] Wei LI [email protected] Yangang LI [email protected]

112 Name E-mail Yuning LI [email protected] Zejian LI [email protected] Huiping LI [email protected] Junhong LIN [email protected] Zhengyan LIN [email protected] Xin LIN [email protected] Peilin LIU [email protected] Rong LIU [email protected] Weiming LIU [email protected] Zhunzhun LIU [email protected] Wei LUO [email protected] Tianyu MA [email protected] Huiling MAO [email protected] Xiaoye MIAO [email protected] Jingjing PAN [email protected] Tianxiao PANG [email protected] Xinyue QIAN [email protected] Qinghua RAN [email protected] Jingwen REN [email protected] Wuyue SHANGGUAN [email protected] Kaili SONG [email protected] Zhonggen SU [email protected] Zuoqi TANG [email protected] Chen TIAN [email protected] Haochuan WANG [email protected] Tiantian WANG [email protected] Xiaoyu WANG [email protected] Jia WEI [email protected] Jiwei WEN [email protected] Jun WEN [email protected] Xiaoyu WU [email protected] Daiqing XI [email protected] Junpeng XIA [email protected]

113 Name E-mail Yu XIA [email protected] Renjun XU [email protected] Hang XU [email protected] Chenkai XU [email protected] Jiapan XU [email protected] Guan'ao YAN [email protected] Qing YANG [email protected] Mengting YAO [email protected] Jiangsheng YI [email protected] Jianwei YIN [email protected] Mufang YING [email protected] Yan YU [email protected] Lixin ZHANG [email protected] Rongmao ZHANG [email protected] Guangyi ZHANG [email protected] Hao ZHANG [email protected] Hongxin ZHANG [email protected] Yu ZHANG [email protected] Jiaming ZHU [email protected] Luxi ZOU [email protected]

114 Maps

115

116

117

118