<<

Tilburg University

Multilevel modeling for data streams with dependent observations Ippel, L.

Publication date: 2017

Document Version Publisher's PDF, also known as Version of record

Link to publication in Tilburg University Research Portal

Citation for published version (APA): Ippel, L. (2017). Multilevel modeling for data streams with dependent observations. [s.n.].

General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal Take down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

Download date: 25. sep. 2021

Multilevel Modeling for Data Streams with Dependent Observations

Lianne Ippel Tilburg University Multilevel Modeling for Data Streams with Dependent Observations

PROEFSCHRIFT

ter verkrijging van de graad van doctor aan Tilburg University op gezag van de rector magnificus, prof.dr. E.H.L. Aarts, in het openbaar te verdedigen ten overstaan van een door het college voor promoties aangewezen commissie in de aula van de Universiteit

op vrijdag 13 oktober 2017 om 10.00 uur

door

c 2017 L. Ippel All Rights Reserved. Gijsberdina Janna Elisabeth Ippel ⃝ geboren te Werkendam Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, microfilming, and recording, or by any information storage and retrieval system, without written per- mission of the author. Printing was financially supported by Tilburg University.

ISBN: 978-94-6295-757-2 Printed by: Proefschriftmaken || Vianen Cover design: Faboosh design & art Multilevel Modeling for Data Streams with Dependent Observations

PROEFSCHRIFT

ter verkrijging van de graad van doctor aan Tilburg University op gezag van de rector magnificus, prof.dr. E.H.L. Aarts, in het openbaar te verdedigen ten overstaan van een door het college voor promoties aangewezen commissie in de aula van de Universiteit

op vrijdag 13 oktober 2017 om 10.00 uur

door

c 2017 L. Ippel All Rights Reserved. Gijsberdina Janna Elisabeth Ippel ⃝ geboren te Werkendam Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, microfilming, and recording, or by any information storage and retrieval system, without written per- mission of the author. Printing was financially supported by Tilburg University.

ISBN: 978-94-6295-757-2 Printed by: Proefschriftmaken || Vianen Cover design: Faboosh design & art Promotor: prof.dr. J. K. Vermunt Copromotor: prof.dr. M.C. Kaptein

Overige leden van de Promotiecommissie: prof.dr. G.J.P. van Breukelen Preface prof.dr. M.E. Timmerman dr. M. Postma dr. M.A. Croon One of my early-childhood memories comes from second grade at primary school. I am standing at the desk of my teacher, a five-year old and a bit too witty, asking my teacher when I would finally learn how to write and how to do math. Done with playing with blocks and dolls, I wanted to learn more! However, I had to wait one more year before I could start writing and calculating. The eagerness to broaden my skills and deepen my knowledge has never left me. Years later, while finishing my Bachelor’s degree in Sociology, I decided to develop myself even more and I applied for the research master at the faculty of Social and Behavioral Sciences. I think it was not more than a month in the program, when Guy Moors ap- proached me. He asked me which topic I wanted to study during my PhD project. Honored, and admittedly a little stressed out because I didn’t feel like I had proven myself to be worthy of this position yet, we discussed several topics. Later in the program, I got the opportunity to work with Maurits Kaptein on my Master’s The- sis. After the research master, he became my PhD supervisor in the following four years. The book you are holding right now is the result of four years work. When I started this project, I never thought I was able to write the code, do the math, or have the writing skills to do this. Obviously, I have not accomplished the work on my own, but you will read more about that at the end of this book (Dankwoord).

Lianne Ippel May, 2017 Promotor: prof.dr. J. K. Vermunt Copromotor: prof.dr. M.C. Kaptein

Overige leden van de Promotiecommissie: prof.dr. G.J.P. van Breukelen Preface prof.dr. M.E. Timmerman dr. M. Postma dr. M.A. Croon One of my early-childhood memories comes from second grade at primary school. I am standing at the desk of my teacher, a five-year old and a bit too witty, asking my teacher when I would finally learn how to write and how to do math. Done with playing with blocks and dolls, I wanted to learn more! However, I had to wait one more year before I could start writing and calculating. The eagerness to broaden my skills and deepen my knowledge has never left me. Years later, while finishing my Bachelor’s degree in Sociology, I decided to develop myself even more and I applied for the research master at the faculty of Social and Behavioral Sciences. I think it was not more than a month in the program, when Guy Moors ap- proached me. He asked me which topic I wanted to study during my PhD project. Honored, and admittedly a little stressed out because I didn’t feel like I had proven myself to be worthy of this position yet, we discussed several topics. Later in the program, I got the opportunity to work with Maurits Kaptein on my Master’s The- sis. After the research master, he became my PhD supervisor in the following four years. The book you are holding right now is the result of four years work. When I started this project, I never thought I was able to write the code, do the math, or have the writing skills to do this. Obviously, I have not accomplished the work on my own, but you will read more about that at the end of this book (Dankwoord).

Lianne Ippel May, 2017 vii

Contents

Preface v

1 Introduction 1 1.1 The era of data streams ...... 1 1.2 Outline ...... 2 1.3 Contributions to the literature ...... 7

2 Dealing with Data Streams: an Online, Row-by-Row, Estimation Tutorial. 9 2.1 Introduction ...... 10 2.2 Dealing with Big Data: the options ...... 12 2.3 From Conventional Analysis to Online Analysis ...... 14 2.3.1 Sample mean ...... 14 2.3.2 Sample variance ...... 15 2.3.3 Sample covariance ...... 16 2.3.4 ...... 17 Computation time of linear regression ...... 18 2.3.5 η2 (ANOVA) ...... 18 2.4 Online Estimation using Stochastic Gradient Descent ...... 21 2.4.1 Offline Gradient Descent ...... 21 2.4.2 Online or Stochastic Gradient Descent ...... 23 2.4.3 : an Example of the Usage of SGD ...... 24 2.5 Online learning in practice: logistic regression in a data stream ..... 25 2.5.1 Switching to a safe well ...... 25 2.5.2 Results ...... 26 2.5.3 Learn rates ...... 26 2.5.4 Starting values ...... 27 2.6 Considerations analyzing Big Data and Data Streams ...... 28 2.7 Discussion ...... 30 Appendix 2.A Online Correlation ...... 31 Appendix 2.B Online linear regression ...... 32 Appendix 2.C Stochastic Gradient Decent – Logistic regression ...... 33 Appendix 2.D Wells data example ...... 34 vii

Contents

Preface v

1 Introduction 1 1.1 The era of data streams ...... 1 1.2 Outline ...... 2 1.3 Contributions to the literature ...... 7

2 Dealing with Data Streams: an Online, Row-by-Row, Estimation Tutorial. 9 2.1 Introduction ...... 10 2.2 Dealing with Big Data: the options ...... 12 2.3 From Conventional Analysis to Online Analysis ...... 14 2.3.1 Sample mean ...... 14 2.3.2 Sample variance ...... 15 2.3.3 Sample covariance ...... 16 2.3.4 Linear regression ...... 17 Computation time of linear regression ...... 18 2.3.5 Effect size η2 (ANOVA) ...... 18 2.4 Online Estimation using Stochastic Gradient Descent ...... 21 2.4.1 Offline Gradient Descent ...... 21 2.4.2 Online or Stochastic Gradient Descent ...... 23 2.4.3 Logistic regression: an Example of the Usage of SGD ...... 24 2.5 Online learning in practice: logistic regression in a data stream ..... 25 2.5.1 Switching to a safe well ...... 25 2.5.2 Results ...... 26 2.5.3 Learn rates ...... 26 2.5.4 Starting values ...... 27 2.6 Considerations analyzing Big Data and Data Streams ...... 28 2.7 Discussion ...... 30 Appendix 2.A Online Correlation ...... 31 Appendix 2.B Online linear regression ...... 32 Appendix 2.C Stochastic Gradient Decent – Logistic regression ...... 33 Appendix 2.D Wells data example ...... 34 viii ix

3 Online Estimation of Individual-Level Effects using Streaming Shrinkage 5.3.1 The online E-step ...... 91 Factors 35 5.3.2 The online M-step ...... 94 3.1 Introduction ...... 36 5.4 Simulation study ...... 95 3.2 Estimation of shrinkage factors ...... 39 5.4.1 Design ...... 95 3.2.1 The James Stein estimator ...... 39 5.4.2 Results ...... 97 3.2.2 Approximate Maximum likelihood estimator ...... 42 5.5 SEMA in action: predicting weight fluctuations...... 99 3.2.3 The Beta Binomial estimator ...... 43 5.6 Discussion ...... 106 3.2.4 The Heuristic estimator ...... 45 3.3 Predicting individual-level effects: when is the right time? ...... 45 6 Discussion 109 3.4 Simulation Study ...... 47 6.1 Overview ...... 109 6.2 Related approaches to analyze data streams ...... 109 3.4.1 Design ...... 47 3.4.2 Results ...... 48 6.2.1 Sliding window approach ...... 110 3.5 LISS Panel Study: Predicting Attrition ...... 52 6.2.2 Parallelization ...... 111 3.5.1 Results ...... 54 6.2.3 Bayesian framework ...... 111 3.6 Conclusion and discussion ...... 56 6.3 Data stream challenges ...... 112 6.3.1 Convergence ...... 112 4 Estimating Random-Intercept Models on Data Streams 59 6.3.2 Models used for analyses ...... 113 4.1 Introduction ...... 60 6.3.3 Missingness ...... 114 4.2 From offline to online data analysis ...... 62 6.3.4 Attrition ...... 114 4.3 Online estimation of random-intercept models ...... 63 6.4 Null Hypothesis Significance Testing ...... 115 4.3.1 The random-intercept model and its standard offline estimation 63 6.5 Future research directions for SEMA ...... 115 4.3.2 Online estimation of the random-intercept model ...... 65 4.4 Performance of SEMA evaluated by simulation ...... 68 References 117 4.4.1 Simulation study I: Evaluation of the precision of estimated Summary 127 parameters ...... 68 Design ...... 68 Samenvatting 129 Results ...... 69 4.4.2 Simulation study II: Improving SEMA in low reliability cases . 72 Dankwoord 131 Design ...... 72 Results ...... 75 4.5 An application of SEMA to longitudinal happiness ratings ...... 78 4.6 SEMA characteristics ...... 79 4.6.1 Theoretical considerations ...... 79 4.6.2 Convergence ...... 79 4.7 Extending SEMA ...... 80 4.8 Discussion ...... 82

5 Estimating Multilevel Models on Data Streams 85 5.1 Introduction ...... 86 5.2 Offline estimation of multilevel models ...... 88 5.2.1 The offline E-step ...... 89 5.2.2 The offline M-step ...... 90 5.3 Online estimation of multilevel models ...... 91 viii ix

3 Online Estimation of Individual-Level Effects using Streaming Shrinkage 5.3.1 The online E-step ...... 91 Factors 35 5.3.2 The online M-step ...... 94 3.1 Introduction ...... 36 5.4 Simulation study ...... 95 3.2 Estimation of shrinkage factors ...... 39 5.4.1 Design ...... 95 3.2.1 The James Stein estimator ...... 39 5.4.2 Results ...... 97 3.2.2 Approximate Maximum likelihood estimator ...... 42 5.5 SEMA in action: predicting weight fluctuations...... 99 3.2.3 The Beta Binomial estimator ...... 43 5.6 Discussion ...... 106 3.2.4 The Heuristic estimator ...... 45 3.3 Predicting individual-level effects: when is the right time? ...... 45 6 Discussion 109 3.4 Simulation Study ...... 47 6.1 Overview ...... 109 6.2 Related approaches to analyze data streams ...... 109 3.4.1 Design ...... 47 3.4.2 Results ...... 48 6.2.1 Sliding window approach ...... 110 3.5 LISS Panel Study: Predicting Attrition ...... 52 6.2.2 Parallelization ...... 111 3.5.1 Results ...... 54 6.2.3 Bayesian framework ...... 111 3.6 Conclusion and discussion ...... 56 6.3 Data stream challenges ...... 112 6.3.1 Convergence ...... 112 4 Estimating Random-Intercept Models on Data Streams 59 6.3.2 Models used for analyses ...... 113 4.1 Introduction ...... 60 6.3.3 Missingness ...... 114 4.2 From offline to online data analysis ...... 62 6.3.4 Attrition ...... 114 4.3 Online estimation of random-intercept models ...... 63 6.4 Null Hypothesis Significance Testing ...... 115 4.3.1 The random-intercept model and its standard offline estimation 63 6.5 Future research directions for SEMA ...... 115 4.3.2 Online estimation of the random-intercept model ...... 65 4.4 Performance of SEMA evaluated by simulation ...... 68 References 117 4.4.1 Simulation study I: Evaluation of the precision of estimated Summary 127 parameters ...... 68 Design ...... 68 Samenvatting 129 Results ...... 69 4.4.2 Simulation study II: Improving SEMA in low reliability cases . 72 Dankwoord 131 Design ...... 72 Results ...... 75 4.5 An application of SEMA to longitudinal happiness ratings ...... 78 4.6 SEMA characteristics ...... 79 4.6.1 Theoretical considerations ...... 79 4.6.2 Convergence ...... 79 4.7 Extending SEMA ...... 80 4.8 Discussion ...... 82

5 Estimating Multilevel Models on Data Streams 85 5.1 Introduction ...... 86 5.2 Offline estimation of multilevel models ...... 88 5.2.1 The offline E-step ...... 89 5.2.2 The offline M-step ...... 90 5.3 Online estimation of multilevel models ...... 91 Chapter 1

Introduction

1.1 The era of data streams

In the last decade, technological developments have been rapidly changing our so- ciety. Instead of going out shopping in the city center we now often buy clothes in webshops, and instead of reading a newspaper once a day, we now continuously receive the headlines on our smartphones. While previously, it was often unknown who bought which products because it was difficult to trace individual customers, nowadays webpages can be designed to store all relevant digital transactions. As a result, these technological developments have led to an increase in digital informa- tion, which are collected on a large scale (Al-Jarrah, Yoo, Muhaidat, Karagiannidis, & Taha, 2015). Analyzing the collected digital information might be challenging, because stor- ing all the data requires a large computer memory. Additional to the memory bur- den, the fact that these observations keep streaming in complicates commonly used analyses even further, because the analyses often have to be redone when new ob- servations enter to remain up to date. Situations where new data points are continu- ously entering and thereby augmenting the current data set are commonly referred to as data streams (Gaber, 2012). When the data are arriving over time, it might be necessary to act upon the data while they enter: tailor the webpage to the currently browsing individual, warn patients to take their medication, or give people an extra nudge to respond to the questionnaire. Failing to act in real time might result in the potential customer leav- ing the webpage, because it did not appeal to him, the lack of medication could be deteriorating the patient’s health, or a respondent failing to answer the question- naire in time. These three examples clearly illustrate that in many situations failing to analyze the data in real time makes the analysis rather ineffective. Digital data collection has also influenced the social sciences. Until recently, data were often collected by inviting respondents to fill out paper-and-pencil question- naires. After a period of data collection, the resulting data set would be considered ’finished’ and analyzed. Using modern technological innovations, data are nowa- days commonly collected using web surveys or smartphone applications. Using Chapter 1

Introduction

1.1 The era of data streams

In the last decade, technological developments have been rapidly changing our so- ciety. Instead of going out shopping in the city center we now often buy clothes in webshops, and instead of reading a newspaper once a day, we now continuously receive the headlines on our smartphones. While previously, it was often unknown who bought which products because it was difficult to trace individual customers, nowadays webpages can be designed to store all relevant digital transactions. As a result, these technological developments have led to an increase in digital informa- tion, which are collected on a large scale (Al-Jarrah, Yoo, Muhaidat, Karagiannidis, & Taha, 2015). Analyzing the collected digital information might be challenging, because stor- ing all the data requires a large computer memory. Additional to the memory bur- den, the fact that these observations keep streaming in complicates commonly used analyses even further, because the analyses often have to be redone when new ob- servations enter to remain up to date. Situations where new data points are continu- ously entering and thereby augmenting the current data set are commonly referred to as data streams (Gaber, 2012). When the data are arriving over time, it might be necessary to act upon the data while they enter: tailor the webpage to the currently browsing individual, warn patients to take their medication, or give people an extra nudge to respond to the questionnaire. Failing to act in real time might result in the potential customer leav- ing the webpage, because it did not appeal to him, the lack of medication could be deteriorating the patient’s health, or a respondent failing to answer the question- naire in time. These three examples clearly illustrate that in many situations failing to analyze the data in real time makes the analysis rather ineffective. Digital data collection has also influenced the social sciences. Until recently, data were often collected by inviting respondents to fill out paper-and-pencil question- naires. After a period of data collection, the resulting data set would be considered ’finished’ and analyzed. Using modern technological innovations, data are nowa- days commonly collected using web surveys or smartphone applications. Using 3 ). 1.1 2

learning is an chapter chapter 3 chapter 4 chapter 5 offline model random intercept multilevel ), a more detailed introduction to data 1.1 based model factors shrinkage data data nested non-nested Figure 1.1: Graphical outline of this thesis (the first leaf of Fig. 2 Analyzing data streams Chapter All the methods presented in this chapter share an important assumption, namely In streams and tools tochapter analyze is these mainly on data online learning. streams Itsample are is mean discussed. shown but how also simple more The parameters complex such parameters focusmodel as such can the of as be the this estimated coefficients in of a a data logistic stream using online learning. that the data points are independent.lated in However, this the assumption context is of likelyobserved to repeatedly. data Two be streams, observations vio- of due the to samesimilar the than individual two are fact observations likely of that to two different the be individuals;nested more same hence, within the individuals individuals data and points are are, are as aViolating result of this that assumption nesting, no results longer in independent. chosen more which prediction do error take than the when dependencyfor methods into the account. are dependency However, models between that theThus account data far, most points online are learning muchtions methods more are do complex nested. not to take estimate. In intonesting this account in thesis, that the online data the are learning observa- developed methods and evaluated which (branch do 2: account nested for data, Fig. the data arrive is currently common practice in many social science applications. Chapter 1: Introduction one when a baseball player scoresestimation a procedure point. which uses On all the theobservations other observations when hand, new in data memory enter and to revisits updatecase these of the the result baseball of match an example, analysis. we Inmatch would have an again to extreme and go back count in points timethis over to rewatch example again, the seems every inefficient time and a perhaps new rather odd, point redoing is analyses scored. when new While 2 on- ) of 1.1 and Chapter 3 L. F. Barrett & Chapter 1: Introduction (see e.g., ). Using online learning, an 2013 , in memory, data points do not have to be ) and commonly uses a smartphone appli- ) are introduced. ) and, even though commonly not analyzed Experience sampling 1.1 2009 , 2017 , Witten, Frank, & Hall ; summary 2011a , are published as separate journal articles and Chapter Hamaker & Wichers Trull & Ebner-Priemer Cappé 4 ; ( presents an overview of the structure of this thesis. Note that, Chapter 2001 1.1 , A commonly used approach to analyze data streams is very intuitive. Let’s imag- In this thesis, approaches to analyze data streams in real time are studied and Analyzing data streams in real time is possible when fast prediction methods Besides collecting more data using less resources, these developments have also are submitted for publication. This might have led to some repetition and incon- methods only store some stored in memory. The sum scorethe is sum an of example the of points a scored, summary we statistic: can if update we this know sum score by incrementing it with ine we are at a baseball field,ball and we player want scores to a keep scores point, ofwith the we one. teams. simply When increment a This base- the typeline score learning of of updating the team of whoanalysis the is scored result done without of returning an to analysis previous data is points. referred Because to online learning as Figure and Chapter 5 sistencies in notation across the chapters.to Below, a analyze short data illustration streams of is the given, approach aftereach which of the the topics chapters (the (the ‘branches’ ‘leafs’ of of Fig. Fig. dent observations. These new methods facilitateencountered the in use the of social data sciences. stream applications 1.2 Outline memory capacity have grown substantiallydate over predictions the in last a data decades, stream obtaining istraditional up-to- still methods a challenge. which Due revisit to all thenew influx observations data of to have data entered points, update are the bound to predictions become when too slow to be usefulnew in methods a are data stream. developed for the analysis of data streams consisting of depen- social science ( as such, the method does give rise to a data stream. are available. Especially when datacomputational points power stream to in analyze rapidly, the thestore demand data for all in more the real time data and increases the continuously. memory capacity Even to though computational power and typical behavior or feelings, which respondentsrespondents would are have asked to at random recall intervals from to memory, rent fill out feelings. some questions about This their technique cur- Barrett is called cation that gives a signal atquestionnaire. random Experience sampling intervals has to become a alert common the method respondent to collect to data answer in the 2 these digital approaches, itfrom has many individuals become at easier, the same cheaper, time and and to faster monitor these to individuals collect over time. created data new opportunities to study individuals’ behavior. Instead of asking for their

Chapter 1 h u ftepit crd ecnudt hssmsoeb nrmnigi with it incrementing by score sum know this we update if can statistic: we summary scored, a points of the example of an sum is the score sum The memory. in as stored learning online to Because referred some points. is store data only previous analysis methods to an returning of without done result scored is the analysis who of team the updating of of learning score line type the base- This a increment When simply teams. one. we the with of point, scores keep a to scores want player we and ball field, baseball a at are we ine Fig. Fig. of of ‘leafs’ ‘branches’ (the (the chapters topics the the of which each after approach given, the is of streams illustration data short analyze a Below, to chapters. the across notation in sistencies 5 Chapter and Figure Outline 1.2 applications depen- stream sciences. of data social of consisting the use in streams the encountered facilitate data methods of new analysis These the observations. dent for developed stream. data are a methods in new useful be to slow too when become predictions to bound the are update points, entered data have to of data observations influx new the all to revisit Due which challenge. a methods still up-to- traditional is obtaining stream decades, data and a last power in the predictions computational over date though substantially to grown Even capacity have memory capacity continuously. the memory increases and data time real the more in all for data demand store the the rapidly, analyze in to stream power points computational data when Especially available. are stream. data a to rise give does method the such, as the in answer data to collect ( to science respondent method social the common alert a become to has intervals sampling Experience random questionnaire. at signal a gives that cation called is Barrett cur- technique their This about questions some feelings. out fill rent memory, to from intervals recall random at to asked have are would respondents respondents which their feelings, for or asking of behavior Instead typical behavior. individuals’ study to opportunities new data created time. over collect individuals to these monitor faster to and and time cheaper, same the easier, at become individuals many has from it approaches, digital these 2 r umte o ulcto.Ti ih aeldt oerptto n incon- and repetition some to led have might This publication. for submitted are nlzn aasrasi eltm spsil hnfs rdcinmethods prediction fast when possible is time real in streams data Analyzing also have developments these resources, less using data more collecting Besides omnyue praht nlz aasrasi eyitiie e’ imag- Let’s intuitive. very is streams data analyze to approach used commonly A and studied are time real in streams data analyze to approaches thesis, this In , 1.1 2001 rsnsa vriwo h tutr fti hss oeta,Chapter that, Note thesis. this of structure the of overview an presents ( ; 4 Cappé rl Ebner-Priemer & Trull aae Wichers & Hamaker r ulse ssprt ora rilsadChapter and articles journal separate as published are , 2011a umr statistics summary ; itn rn,&Hall & Frank, Witten, , 2017 , 2009 1.1 xeinesampling Experience n,ee huhcmol o analyzed not commonly though even and, ) r introduced. are ) n omnyue mrpoeappli- smartphone a uses commonly and ) nmmr,dt onsd o aet be to have not do points data memory, in , 2013 .Uigoln erig an learning, online Using ). see.g., (see hpe :Introduction 1: Chapter .F art & Barrett F. L. 3 n Chapter and 1.1 of ) on- 2 aaarv scretycmo rciei aysca cec applications. science social many in While new practice when common scored. analyses currently is redoing is point odd, arrive rather new data perhaps a and time inefficient every seems the again, example rewatch to over this time points in count back go and extreme to again an have would match In we analysis. example, an match of baseball result the the of these case update revisits to and enter memory data in new hand, when observations other observations the the all On uses which point. procedure a estimation scores player baseball a when one Introduction 1: Chapter in r etd nti hss nielann ehd hc oacutfrthe Fig. data, for nested account 2: do (branch which evaluated and methods developed observa- learning are the data online the that thesis, in account this nesting into In estimate. take to not nested. complex do are more methods tions much learning are online points most far, data account Thus the that between models However, dependency are account. the into methods for dependency when the than take error do prediction which more chosen independent. in longer results no nesting, assumption that this of result Violating a as are are, are points and data individuals individuals the within hence, same more nested individuals; be the different two to that of likely observations fact are two individual than the similar same to the due of vio- observations streams, be Two data repeatedly. to observed likely of is context assumption the this However, in lated independent. are points data the that learning. online using stream logistic data a a of in coefficients estimated this the be as of the can such as model focus parameters such complex parameters The more simple also how but shown discussed. mean is are sample It streams learning. online data on mainly these is analyze chapter to tools and streams In l h ehd rsne nti hpe hr nipratasmto,namely assumption, important an share chapter this in presented methods the All Chapter data streams Analyzing 2 tefis efo Fig. of leaf first (the iue11 rpia uln fti thesis this of outline Graphical 1.1: Figure non-nested nested data data shrinkage factors model based 1.1 ,amr ealditouto odata to introduction detailed more a ), multilevel intercept random model offline chapter 4 chapter 3 chapter chapter 5 erigi an is learning

2 1.1 ). 3

Chapter 1 5 1.0 ● ● ) came up ● 1961 ( 0.8 ● four shrinkage factors are ● ● ● 3 ● ● ● W. James and Stein 0.6 Chapter . ● ). In ● ● 2012 , ● ● ● shrunken estimates shrunken 0.4 shrinkage factor ● observed individual averages averages in estimating true abilities Morris & Lysy ● ● 0.2 Figure 1.2: Graphical display of the effect of including other observed ● What is often underrated however, and provides an intuition to the origin of How much the other players’ averages influence the estimate of a single bat- The fact that for dice-throwing it is seems intuitively feasible to look at the data Stein’s paradox, is that any measurement willextent. contain both When signal there and is noise to clearlyformance some lot’s of of an noise, individual we is intuitivelythe not grasp scores a that of good previous predictor, everyone per- andcess. else that involved we Oddly, to rather when want get weout to a move such use better to noise, and baseball grasp suddenly scores, of feelindividual-level many inclined the to scores; people underlying derive seem Stein’s predictions pro- to solely shrinkage based totallybetween estimators on the rule the individual-level provide “skill” a and the smoothnoise group weighting introduced scores, by to the correct “best” for batters some merely of being the lucky. ting ability is determined bywith a one of the first shrinkage factorsbeen and since developed, then which multiple differ shrinkage in factors how have erages much they towards shrink the the overall observed individual averageare av- and shrunken equally whether ( all observedstudied. individual averages For each of these four shrinkage factors an online approach is developed Chapter 1: Introduction aggregated analysis, on the other hand, leads3.5 us (which to predict was an the average average score in ofthis about our case. 1000 people sample) and seems more sensible in of others to predict individual performanceand can “noise”; be the understood signal, in ones “dice-throwing-skill” termsnoise, is of clearly the “signal” non-existent, sheer while “luck” the oflevel of throwing the two 28 sixes good throwers. inbe Most a corrected people for row, intuitively in understand is the this clearly case noise of driving should dice the throwing. skill , the Stein 1.2 shrunken analysis, i.e., . The top of this 1.2 Chapter 1: Introduction disaggregated analysis. However, the aggre- ) is also known as Stein’s paradox 1956 ( aggregated Stein ) that their next throw will be a six again. The th 6 / 1 ). 1977 , ) showed that if there are more than two units, e.g., baseball players, just using Now, Stein’s paradox manifests itself when we use the historical data (hence, the To illustrate Stein’s paradox, let us assume that we are studying people’s ability The concept of shrinkage estimation is illustrated in Figure Let us return to the example of the baseball match and assume that we are now Efron & Morris 1956 gated analysis would lead usately to object predict to: a the score 28 ’good of dice six,more throwers’ accurate, which were the just most probability lucky, and people is it immedi- is unlikely (or to be throwers, while those that repeatedly have lowsubsequently, scores invite are 1,000 “poor” people dice to throwers. throw We, aIn dice our twice, and sample, we we observe find their 28six scores. twice “good” in dice-throwers; a these row. people managed to throw a two previous throws), to predict the future data. In our jargon above, the disaggre- of the observed average.the Thus, ball, if then we we want should toThis also rather predict counterintuitive take player finding into A’s of probability account( to how hit well other players are doing. of throwing dice. We coin those who repeatedly have high score (sixes) “good” dice- abilities. The solid line isestimated the abilities overall are average. shrunken closer Asaverages. to can It each can be be other seen shown than that from thesetrue the Figure shrunken ability observed than estimates the predict individual individual more average; accurately i.e., the predicted on average ability is and the the difference true between the ability smaller if you use a shrunken estimate instead terms of our baseball example:predicting if individual we batting abilities, include we the arethe batting on observed behavior average individual of hitting more all proportions. accurate players than using in figure presents the observedshrunken individual estimates. proportions The dotted and lines connect the the observed bottom averages to presents the estimated the for each player separately, is straightforwardever, to this implement disaggregated in a analysis data is( stream. a How- naive approach toa solve baseball player’s this hitting problem. proportion does notof result this in the players most true accurate prediction battingestimates ability. yield more Instead, accurate predictions he than the proofed observed that individual the averages. so-called In gated analysis only gives uswhich one does estimate not of answer the ourappropriately hitting question to proportion who look for is at all the the players, bestanswer individual player. our batting question, behavior So, we of it could the would updateseparately players. the be when proportion more In they of order hits hit to online orwould for miss be each the player the ball best and player. the This approach, one referred with to the as highest a proportion interested in who is the bestproportion baseball over player. all We players could easily compute online thetotal by average counting number hitting the of total attempts; number of we hits call by this the an 4

Chapter 1 4 tl bett:te2 go ietrwr’wr utlcy n ti niey(rt be to (or unlikely is immedi- it is people and lucky, probability most just the were which accurate, throwers’ more six, dice of ’good 28 score the a disaggre- to: predict the object above, to ately jargon us our lead In would data. analysis future gated the predict to throws), previous two a throw to managed people row. these a dice-throwers; in “good” twice scores. six 28 their find observe we we sample, and twice, our dice In a We, throw throwers. to dice people “poor” 1,000 are invite scores subsequently, low dice- have “good” repeatedly (sixes) that score high those have while repeatedly throwers, who those coin We dice. throwing of doing. are players other well hit how to ( account probability of A’s into finding player take counterintuitive predict rather also This to should want we we then if instead ball, estimate Thus, the shrunken a average. use observed you if the smaller of ability the between true difference the the and is ability average on predicted the i.e., accurately average; more individual individual predict the estimates than observed ability shrunken Figure the true these from that than shown seen other be be can each It can to averages. As closer shrunken average. are overall abilities the estimated is line the estimated solid the The presents to averages bottom observed abilities. the the connect lines and dotted The proportions estimates. individual shrunken observed the presents figure in using than players accurate proportions. all more hitting of individual average behavior observed on batting the are the we include abilities, we batting individual if predicting example: baseball In so-called our averages. the individual of that observed terms proofed the than he predictions accurate Instead, more yield ability. estimates batting prediction accurate true most players the in this result of not does proportion problem. hitting this player’s baseball solve a to approach naive How- a stream. ( is data analysis a in disaggregated implement this to proportion ever, straightforward a highest is as the separately, to with player referred each one approach, for This the player. and best ball the player the each be miss for would or online to hit hits order of they In more proportion when be the players. separately update would the could it of we So, behavior question, batting our player. individual answer best players, the the all at is for look who proportion to question hitting appropriately our the answer of not estimate does one which us gives an only the this analysis by call hits gated we of number attempts; total of the hitting number counting average by total the online compute easily could players We all player. over baseball proportion best the is who in interested 1956 fo Morris & Efron e srtr oteeapeo h aealmthadasm htw r now are we that assume and match baseball the of example the to return us Let o,Sensprdxmnfssisl hnw s h itrcldt hne the (hence, data historical the use we when itself manifests paradox Stein’s Now, ability people’s studying are we that assume us let paradox, Stein’s illustrate To Figure in illustrated is estimation shrinkage of concept The hwdta fteeaemr hntouis .. aealpaes utusing just players, baseball e.g., units, two than more are there if that showed ) , 1977 ). 1 / 6 th htternx ho ilb i gi.The again. six a be will throw next their that ) Stein aggregated ( 1956 sas nw sSensparadox Stein’s as known also is ) nlss oee,teaggre- the However, analysis. disaggregated hpe :Introduction 1: Chapter 1.2 h o fthis of top The . nlss i.e., analysis, shrunken 1.2 Stein the , tde.Frec fteefu hikg atr noln prahi developed is approach online an factors shrinkage four these of each For averages individual studied. observed all ( whether equally shrunken and av- are average individual observed overall the the shrink towards they much erages have how factors in shrinkage differ multiple which then developed, since and been factors shrinkage first the of one a with by determined is ability ting lucky. the being of merely some batters for “best” correct the to by scores, introduced weighting group noise smooth the and a “skill” provide individual-level the rule the on estimators between totally based shrinkage solely to pro- predictions Stein’s seem derive underlying people scores; to the inclined many individual-level feel of scores, suddenly grasp baseball and noise, to better use such move a to out we get want when rather to Oddly, we involved that else cess. and per- everyone predictor, previous good of that a scores grasp not the intuitively is we individual noise, an of of lot’s some formance clearly to noise is and there signal When both contain extent. will measurement any that is paradox, Stein’s skill throwing. the dice should driving of noise case clearly this the is understand in intuitively row, for people corrected a Most be in throwers. good sixes 28 two the throwing of level of the “luck” while sheer non-existent, “signal” the clearly of is noise, terms “dice-throwing-skill” ones in signal, understood the be “noise”; can and performance individual predict to others of in sensible more seems and sample) people 1000 case. our about this of in score average average the an was predict to (which us 3.5 leads hand, other the on analysis, aggregated Introduction 1: Chapter o uhteohrpaes vrgsiflec h siaeo igebat- single a of estimate the influence averages players’ other the much How of origin the to intuition an provides and however, underrated often is What data the at look to feasible intuitively seems is it dice-throwing for that fact The ● iue12 rpia ipa fteefc ficuigohrobserved other including of effect the of display Graphical 1.2: Figure 0.2 ● ● ors&Lysy & Morris vrgsi siaigtu abilities true estimating in averages observed individualaveragesobserved ● hikg factor shrinkage 0.4 shrunken estimates ● ● ● , 2012 ● ● .In ). ● . Chapter 0.6 .JmsadStein and James W. ● ● ● 3 ● ● ● orsrnaefcosare factors shrinkage four ● 0.8 ( 1961 ● aeup came ) ● ● 1.0 5

Chapter 1 , 7 ), efficient ap- 2017 Cappé & Moulines , ; 2010 , Bottou Hamaker & Wichers ). While data streams are commonly used in computer 1998 , Neal & Hinton ; Secondly, the state-of-the-art methods currently used to analyze the data streams by social scientists. In this thesis, computationally-efficientmodeling approaches are to developed multilevel to account for the nestedstreams. structure commonly The found in content data ofsciences this and thesis of covers, the in computer part,than science. both on Admittedly, the the the literature focus latter: of isof the more this study on social in thesis the computer former introduces science datamethods - to streams to account - social for scientists which nested and, observations arethe in in social an sciences. data addition, active streams, develops a area novel problem common to Chapter 1: Introduction model is extended with both time-constantlikely predictors to (such as change gender, over which time) is andself un- esteem, time-varying which predictors is (such likely as to a1) vary player’s predictors over current time). can Additionally, have the time-varying differentslopes). (level effects In depending a on simulation study, the theoffline procedure, individual SEMA which (i.e., shows algorithm that is random SEMA compared can analyze withwell. these the simulated In data standard an streams empirical study about theequately fluctuations predicts in the individuals’ weights, weight SEMA of ad- the individuals in the data stream. 1.3 Contributions to the literature The contributions of this thesis totion the to literature are data twofold: streams providing for andata introduc- social streams. scientists, First, and efficient developingsocial approaches new scientists to methods implement are to commonly-used illustrated. analyze modelscoming While by more intensive popular longitudinal in dataproaches social collection to sciences is analyze ( be- the datatimal have use to of supplement the these data developmentsestimate to stream. well-known make models, By op- data introducing streams computationally-efficient becometists. methods more accessible to for social scien- often do not account for nested2009 observations (e.g., sciences and marketing research, we focus on methods and models commonly used ), ), 1.1 algo- 2002 , Rauden- to refer to , acronym for level 2 SEMA Chapter 1: Introduction Raudenbush & Bryk ; Expectation Maximization : Analyzing data streams with 2004 1.1 , ). The aim of Maximum Likelihood esti- ) or some Newton-type of algorithm (see, e.g., Demidenko 2003 ( , 1977 , to refer to the observations and belongs to the same ’model-based’ branch as the previ- Myung ( 1.1 , an extension of SEMA is presented. The random intercept level 1 5 multilevel models ). The SEMA algorithm fits a random intercept model while the ). However, because these algorithms pass over the data repeatedly , an alternative algorithm is developed, called 4 Chapter 2002 , 2004 , Dempster, Laird, & Rubin Maximum Likelihood Chapter Multilevel models are usually fitted to the data using an estimation framework In Next, we turn to the last ‘branch’ of Figure The last ‘leaf’ of Figure called mation is to find the parameterdata. values that However, unlike maximize parameters the such as likelihood the of mean,model the the cannot parameters observed of easily the be multilevel computed. Inone order has to to rely find on those some values iterative for procedure, the such parameters, the the grouping variable.at Using level our 1 and baseball the baseball example, playersadvantages are the at over batting level traditional 2. observations methods Multilevel are models ofmultilevel have analysis: a models number take e.g., of the unlike nested aggregated structuremodels of analyses, consist the of data less into parameters account, than andmultilevel the multilevel models disaggregated easier analyses, to which interpret. make the keep the parameters up to date.a data As stream a could result, become analyzing infeasible the when data data using keep this streaming in model rapidly. in rithm ( Demidenko to find the Maximum Likelihood solution, therevisited data in points each are iteration. stored In in memory addition, and data when used point in enters, a data the stream, iterative each time fitting a procedure new has to be repeated again in order sign. While some shrinkage factorspredictions perform of better the online than and others, the the offline accuracy estimated shrinkage of factor the are very similar. nested data using a model-basedten approach. analyzed using In social sciences,where nested we data use are the of- term 6 such that the shrinkage factors are suitable toa estimate data the stream. individual The abilities during standard offline approachin and a the simulation online approach study are and compared arespondents applied would fail to to an respond empirical to example a to questionnaire in predict a which repeated-measurements re- de- stream, while SEMA is much faster. ous chapter. In data are entering, and more importantly, itous does data so points, without which going back can tois then the compared be previ- with discarded the from memory. standardand offline The fitting in SEMA procedure an algorithm both empirical in studyto a on obtain simulation respondents parameter study wellbeing. estimates, The which SEMAthe are algorithm offline very is procedure, similar able both to in the the estimates obtained simulated by data stream and in the empirical data Streaming Expectation Maximization Approximation. In thisthe chapter focus (see, is Fig. on the simplest multilevelbush model: & the Bryk random intercept model (

Chapter 1 u hpe.In chapter. ous data empirical the in and faster. stream much data is by SEMA simulated obtained while estimates the stream, the in to both able similar procedure, is very offline algorithm are the SEMA which The estimates, wellbeing. study parameter respondents simulation obtain on a to study in empirical both algorithm an procedure SEMA in fitting The offline and standard memory. from the discarded with previ- be compared the then is to can back going which without points, so data does ous it importantly, more and entering, are data ( model intercept random Bryk the & model: bush multilevel simplest the on Fig. is (see, focus chapter the this In Approximation. Maximization Expectation Streaming epteprmtr pt ae sarsl,aayigtedt sn hsmdlin rapidly. model in streaming this keep using data data when order the infeasible analyzing in become result, could again a stream repeated As data be a date. to to has up new procedure parameters a the fitting time keep each iterative stream, the data a enters, in point used when data and addition, memory in In stored iteration. are each points in data revisited the solution, Likelihood Maximum the find to Demidenko ( rithm term of- the are use data we nested where sciences, social In using analyzed approach. ten model-based a using data nested similar. very are the factor of shrinkage estimated accuracy offline the the others, and than online the better of perform predictions factors de- shrinkage re- repeated-measurements some which a While predict in questionnaire to sign. a example to empirical respond an to to fail would applied spondents are compared and are study approach online simulation the a and in approach offline standard during abilities The individual stream. the data estimate a to suitable are factors shrinkage the that such 6 n a orl nsm trtv rcdr,sc the parameters, such the procedure, for iterative values some those on find rely to to has order one In computed. multilevel be the easily of observed parameters cannot the the model mean, of the likelihood as such the parameters maximize unlike However, that values data. parameter the find to is mation called the make interpret. which to analyses, easier disaggregated models multilevel the multilevel and than account, parameters into less data of the consist analyses, of models structure aggregated nested unlike the of e.g., take number models a analysis: have multilevel of models are Multilevel methods observations 2. traditional level batting over at the are advantages players example, baseball the baseball and 1 our level Using at variable. grouping the h at‘ef fFigure of ‘leaf’ last The et etr otels bac’o Figure of ‘branch’ last the to turn we Next, In utlvlmdl r sal te otedt sn netmto framework estimation an using data the to fitted usually are models Multilevel Chapter aiu Likelihood Maximum eptr ar,&Rubin & Laird, Dempster, , 2004 , 2002 Chapter 4 natraieagrtmi eeoe,called developed, is algorithm alternative an , .Hwvr eas hs loihsps vrtedt repeatedly data the over pass algorithms these because However, ). .TeSM loih t admitretmdlwiethe while model intercept random a fits algorithm SEMA The ). utlvlmodels multilevel 5 ee 1 level netnino EAi rsne.Terno intercept random The presented. is SEMA of extension an , 1.1 ( Myung eog otesm mdlbsd rnha h previ- the as branch ’model-based’ same the to belongs orfrt h bevtosand observations the to refer to , 1977 , ( 2003 Demidenko rsm etntp fagrtm(e,e.g., (see, algorithm of Newton-type some or ) .Teamo aiu ieiodesti- Likelihood Maximum of aim The ). , 1.1 2004 nlzn aasraswith streams data Analyzing : xetto Maximization Expectation ; adnuh&Bryk & Raudenbush hpe :Introduction 1: Chapter SEMA ee 2 level coy for acronym , orfrto refer to Rauden- , 2002 algo- 1.1 ), ), ehd oacutfrnse bevtosi aasras rbe omnto common problem novel area a develops streams, active addition, data sciences. an social in in the are observations and, nested which scientists for social - account to streams to - methods data science introduces former computer the thesis in social on study this more the of is of latter: focus literature the the the Admittedly, on both science. than part, computer in the covers, of thesis and this sciences of data content in found The commonly structure streams. nested the for account to multilevel developed to are approaches modeling computationally-efficient thesis, used this commonly In models scientists. and social methods by on focus we research, marketing and sciences (e.g., observations 2009 nested for account not do often scien- social for to accessible more methods tists. become computationally-efficient streams introducing data op- By models, make well-known stream. to estimate developments data these the supplement of to use have timal data the be- ( analyze is sciences to collection social proaches data in longitudinal popular intensive more by While coming models analyze illustrated. commonly-used to are implement methods to scientists new approaches social developing efficient and First, scientists, streams. social introduc- data an for providing streams twofold: data are literature to the tion to thesis this of contributions The literature the to Contributions 1.3 stream. data the in individuals the ad- of SEMA weight weights, individuals’ the in predicts fluctuations equately the about study empirical streams an standard data In simulated the these well. with analyze can compared SEMA random is that algorithm shows (i.e., which SEMA individual procedure, offline the the study, simulation on a depending In effects (level slopes). time-varying different the have Additionally, can time). current over predictors player’s vary 1) a to as likely (such is predictors which time-varying esteem, un- self and is time) which over gender, change as (such to predictors likely time-constant both with extended is model Introduction 1: Chapter eody h tt-fteatmtoscretyue oaayetedt streams data the analyze to used currently methods state-of-the-art the Secondly, ; el&Hinton & Neal , 1998 .Wiedt tem r omnyue ncomputer in used commonly are streams data While ). aae Wichers & Hamaker Bottou , 2010 ; , ap Moulines & Cappé 2017 ,efiin ap- efficient ), 7 ,

Chapter 1 Chapter 2

Dealing with Data Streams: an Online, Row-by-Row, Estimation Tutorial.

Abstract

Novel technological advances allow distributed and automatic measurement of hu- man behavior. While these technologies provide exciting new research opportuni- ties, they also provide challenges: datasets collected using new technologies grow increasingly large, and in many applications the collected data are continuously aug- mented. These data streams make the standard computation of well-known estima- tors inefficient as the computation has to be repeated each time a new data point enters. In this chapter, we detail online learning, an analysis method that facilitates the efficient analysis of Big Data and continuous data streams. We illustrate how common analysis methods can be adapted for use with Big Data using an online, or “row-by-row”, processing approach. We present several simple (and exact) exam- ples of the online estimation and we discuss Stochastic Gradient Descent as a general (approximate) approach to estimate more complex models. We end this chapter with a discussion of the methodological challenges that remain.

This chapter is published as Ippel, L., Kaptein, M.C, & Vermunt, J.K. (2016) Dealing with Data Streams: an Online, Row-by-Row, Estimation Tutorial. Methodology, 12(4), 124-138 Chapter 2

Dealing with Data Streams: an Online, Row-by-Row, Estimation Tutorial.

Abstract

Novel technological advances allow distributed and automatic measurement of hu- man behavior. While these technologies provide exciting new research opportuni- ties, they also provide challenges: datasets collected using new technologies grow increasingly large, and in many applications the collected data are continuously aug- mented. These data streams make the standard computation of well-known estima- tors inefficient as the computation has to be repeated each time a new data point enters. In this chapter, we detail online learning, an analysis method that facilitates the efficient analysis of Big Data and continuous data streams. We illustrate how common analysis methods can be adapted for use with Big Data using an online, or “row-by-row”, processing approach. We present several simple (and exact) exam- ples of the online estimation and we discuss Stochastic Gradient Descent as a general (approximate) approach to estimate more complex models. We end this chapter with a discussion of the methodological challenges that remain.

This chapter is published as Ippel, L., Kaptein, M.C, & Vermunt, J.K. (2016) Dealing with Data Streams: an Online, Row-by-Row, Estimation Tutorial. Methodology, 12(4), 124-138 n 11 is a (2.1) θ . n x (or row-by-row estimation), , conceptual approaches for is a set of sufficient statistics does not include subscript θ 2.2 , their estimates when new data θ 2.1 , ) http://github.com/L-Ippel/ , n ) n update ,x 1 − θ, x n online learning ( ’, which indicates that the updated θ f ( := f := = θ n θ , the method further illustrated in the remainder of , we illustrate how often-used estimators such as sample and the most recent data point, θ older data points. Formally online learning can be denoted 2.3 online learning , provide an introduction of Stochastic Gradient Descent (SGD) as 2.4 . . The second equation for updating n x never revisit A large number of well-known conventional estimation methods used for the This chapter is organized as follows: In Section The aim of this chapter is to introduce Online estimation methods continuously arrive, and or equivalently and a shorthand analysis of regular (read "small") datasetsdata can streams, be without adapted losing such their straightforwardness that or they interpretation.a can We number provide handle of examples in thistic chapter. Gradient Furthermore, we Descent, will a also general introducetimation Stochas- method of that complex can models bethis used in chapter, for we data have the streams. made (approximate) [R] code es- ForMethodology available at all the examples introduced in the estimation of parameters in Big Datafocus and/or primarily data on streams are discussed, andthis we chapter. In Section means, variances, and covariances,the can benefits be of estimated online using learning methodscomparing online to the learning. deal computational with times Here data of streamsthen, are online in illustrated Section and by offline estimation methods.a general We (approximate) method to estimate more complex models in data streams. Chapter 2: Online Estimation Tutorial worse, amplifying the problems. Regardless of theare exact continuously scaling augmented however, if both the the data requiredeventually computation will become time infeasible and memory use as a way to deal with Bigdata Data or without data storing streams. all Online individual learning data methodsmean, points, analyze for the or instance a by sum computing a ofing sample squares methods have without a revisiting feasible olderanalysis) time data. and complexity they (i.e., Therefore, require the a online time feasible learn- data required amount streams to of or conduct Big computer the Data. memory when Init analyzing the were latter a case, data a stream very by largeavailable static iterating in dataset through memory. is the treated rows, as without if having all data points as follows: because we use the update operator ‘ which we will use throughout the(not chapter. In necessarily Eq. the actual parametersdata of point, interest), which is updated using a new function of the previous ; ): 2011 2013 , , ) and its 2001 , ). If datasets are 2012 , . Thus, the time increase Sagiroglu & Sinanc t/n ; is c 2013 ). Because more data are made , Carmona et al. L. F. Barrett & Barrett 2002 Chapter 2: Online Estimation Tutorial , ) or when data are mined from website Swendsen, Ben-Zeev, & Granholm time, where 2010 c , +2 t time (where the time unit required for the computation is de- t data points requires Killingsworth & Gilbert and is every increasing as the data stream grows. In less fortunate and +2 n n Demchenko, Grosso, De Laat, & Membrey Failing to use appropriate methods when analyzing Big Data or data streams Handling extremely large datasets represents a technical challenge in its own data points requires sum over could result in computer memory overflowIn or favorable computations cases, that the take time to alinearly compute lot with a of the statistic time. using amount standard of methods datan increases entering. For example, if computingpendent the on sum the over type of machine used, the algorithm used, etc.), then computing the existing methods need to be adaptedstreaming data. and/or To new be methods able are to required capitalizehave on to become the analyze available, vast amounts we of must (streaming)are develop data widely efficient that available methods. we Only will if bebehavior. these able methods to truly improve our understanding of human in happiness, logs (e.g., research into improving e-commerce, continuously augmented and estimates aretional needed analyses at often each have to pointprocess be in is repeated time, every highly conven- time inefficient a andcollection new frequently and data forces analyze point scholars a enters. (smaller) to static arbitrarily This dataset. stop In data- order to resolve this inefficiency, machinery or analysis methods. right, moreover, the challengeaugmented is (i.e., amplified new when rows large areA datasets added combination to are of these the continuously challenges dataset iscollected as encountered continuously new when data using — enter for smart-phone example over applications — time). (e.g., data are tracking fluctuations new challenges: If (longitudinal)then data we are collected may from obtain largethat extremely groups they large of cannot subjects, datasets. beware analyzed These packages. using datasets standard This might analysis isData” be methods exactly ( one and so of existing large the soft- definitionsdatasets used that for are the so buzz-term large “Big that they cannot be handled using standard computing These technological developments allowlarge researchers scales to and over study long human periods ofWhalen, behavior time Jamner, at ( Henker, Delfino, &available Lozano for research, these technologicalvance developments our have understanding the of potential humandynamics. behavior to ( ad- However, these novel data collection technologies also present us with 10 2.1 Introduction The ever-increasing availability of Internethas access, led smart to phones, many and novel social opportunities media for collecting behavioral and attitudinal data. is linear in more common cases, the increase in time complexity is not linear but quadratic, or

Chapter 2 oecmo ae,teices ntm opeiyi o ierbtqartc or quadratic, but linear not is complexity time in increase the cases, common more in linear is edn ntetp fmcieue,teagrtmue,ec) hncmuigthe computing then etc.), used, algorithm the over used, sum machine of type over the sum on the pendent computing if example, For entering. increases n data methods of standard amount using time. statistic the of a with lot compute linearly a to time take the that cases, computations favorable or In overflow memory computer in result could human of understanding our improve truly to methods able these behavior. be if will Only we methods. available that efficient widely data develop are (streaming) must of we amounts vast available, analyze the become to on have capitalize required to are able methods be new To and/or data. inefficiency, streaming adapted this resolve be to to order need data- In methods stop dataset. existing This arbitrarily static to (smaller) enters. a scholars point analyze forces data and frequently new collection and a inefficient time conven- highly every time, repeated is in be process point to have each often at analyses needed tional are estimates and augmented continuously e-commerce, improving into research (e.g., logs fluctuations tracking are data (e.g., happiness, time). — applications in over example smart-phone for enter — using data when new continuously encountered as collected is dataset challenges continuously the these of are to combination added datasets A are large rows when new amplified (i.e., is augmented challenge the moreover, right, computing standard using handled be cannot methods. they analysis that or “Big machinery large buzz-term so the are for that used datasets definitions soft- the large existing of so and one ( exactly methods be Data” is analysis might This standard datasets using packages. These analyzed ware be datasets. subjects, cannot of large they groups extremely that large obtain from may collected are we with data us then present (longitudinal) also If technologies challenges: collection new data novel these However, ad- ( to behavior dynamics. human potential of the understanding have our developments vance technological these research, for Lozano available & Delfino, Henker, ( at Jamner, time behavior Whalen, of periods human long study over and to data. scales attitudinal researchers large and allow behavioral developments collecting technological for media opportunities These social novel and many phones, to smart led access, has Internet of availability ever-increasing The Introduction 2.1 10 aapit requires points data aln oueaporaemtoswe nlzn i aao aastreams data or Data Big analyzing when methods appropriate use to Failing own its in challenge technical a represents datasets large extremely Handling ecek,Gos,D at Membrey & Laat, De Grosso, Demchenko, n n +2 n seeyicesn stedt temgos nls otnt and fortunate less In grows. stream data the as increasing every is and ilnsot Gilbert & Killingsworth aapit requires points data t ie(hr h ieui eurdfrtecmuaini de- is computation the for required unit time the (where time t +2 , c 2010 ie where time, wnsn e-ev Granholm & Ben-Zeev, Swendsen, rwe aaaemndfo website from mined are data when or ) , hpe :Oln siainTutorial Estimation Online 2: Chapter 2002 .F art Barrett & Barrett F. L. amn tal. et Carmona , .Bcuemr aaaemade are data more Because ). 2013 c is ; t/n aiol Sinanc & Sagiroglu hs h ieincrease time the Thus, . , 2012 .I aaesare datasets If ). , 2001 n its and ) , , 2013 2011 ): ; ucino h previous the of function hc ewl s hogottecatr nEq. In chapter. the throughout use will we which ntncsaiyteata aaeeso neet,wihi pae sn new a using updated is which interest), point, of data parameters actual the necessarily (not eas eueteudt prtr‘ operator update the use we because sfollows: as points data all having if without as rows, treated the is memory. through dataset in iterating static available large by very stream a data case, a latter were the analyzing it In when memory Data. the computer Big conduct or of to streams amount required data learn- feasible time online a the require Therefore, (i.e., they complexity and data. time analysis) older feasible revisiting a without have methods squares sample ing of a computing sum by a instance or the for analyze points, mean, methods data learning individual Online all streams. storing data without or Data data Big with deal to way a as use memory and infeasible time become will computation eventually required data the the both if however, augmented scaling continuously exact are the of Regardless problems. the amplifying worse, Tutorial Estimation Online 2: Chapter eea apoiae ehdt siaemr ope oesi aastreams. data in models complex more estimate to method (approximate) We general a methods. estimation offline by and Section illustrated in online are then, streams of data Here times with computational deal learning. the to online comparing methods learning using online estimated of be benefits can the covariances, and variances, means, Section In chapter. we this and discussed, are streams on data primarily and/or focus Data Big in parameters of estimation the in introduced examples the all at available Methodology For es- code [R] (approximate) made streams. the have data we for chapter, in used this be models can complex that of method Stochas- timation introduce general also a will Descent, we Furthermore, Gradient chapter. tic this in examples of handle provide number We can a interpretation. they or that straightforwardness such their losing adapted without be streams, can data datasets "small") (read regular of analysis reuvlnl n shorthand a and equivalently or rie and arrive, nieetmto ehd continuously methods estimation Online introduce to is chapter this of aim The hscatri raie sflos nSection In follows: as organized is chapter This the for used methods estimation conventional well-known of number large A ee revisit never x n h eodeuto o updating for equation second The . . 2.4 rvd nitouto fSohsi rdetDset(G)as (SGD) Descent Gradient Stochastic of introduction an provide , nielearning online 2.3 le aapit.Fral nielann a edenoted be can learning online Formally points. data older θ n h otrcn aapoint, data recent most the and eilsrt o fe-sdetmtr uha sample as such estimators often-used how illustrate we , h ehdfrhrilsrtdi h eane of remainder the in illustrated further method the , θ n θ = := f := ( f θ ,wihidctsta h updated the that indicates which ’, ( nielearning online n x θ, − 1 ,x update n ) n , http://github.com/L-Ippel/ ) , 2.1 θ hi siae hnnwdata new when estimates their , 2.2 θ osnticuesubscript include not does sasto ufiin statistics sufficient of set a is ocpulapoce for approaches conceptual , o o-yrwestimation), row-by-row (or x n . θ (2.1) sa is 11 n

Chapter 2 , , θ θ 13 Bottou . Subsequently, . See for a discus- θ Thiesson, Meek, and ), or Atallah, Cole, & Goodrich ), but here we discuss it as 2003 batch learning ( 2005 , and the new data point; historical θ is the total number of observations, n ). This effectively solves the problem of ). The summaries required to estimate the Gaber et al. ; 2010 1998 , , 2006 , Wilson and Martinez Opper ). As introduced in the previous section, online learning data points, where Turaga et al. 1 Chu et al. 2011 ; , − n ) about choosing block (or batch) sizes for the EM algorithm. 2006 , 2001 ( data points are summarized into a limited set of sufficient statistics of the 1 Chu et al. Shalev-Shwartz − ; ; n . In the next section, we further detail online learning and how to choose starting Note that in this chapter, we focus on the situation where parameters are up- The two characteristics of online learning – including all the data in the estimate The third option, using parallel computing, is an often-used method to analyze In this chapter, we will focus on a fourth method: online learning (e.g., θ data points are not revisited. dated using a single (mostrather recent) uses data a point. ‘batch’ of data Theresion points. are of This also batch is situations known learning where as in one SGD, Heckerman and not revisiting the historicalapproach data to – analyze jointly data streams. make However online twodows, learning downfalls online remain, a learning like very sliding also suitable win- requires domainshould knowledge be to judge gathered which beforehand; information and the their researcher update needs functions to up chooseonline front. learning, the the Second, researcher elements often although of needs this toof choose issue starting is values for not the unique elements for values by providing the online adaptation of a number of conventional statistics. is updated using some function of the previous Chapter 2: Online Estimation Tutorial static Big Data. Using parallelsuch computing, that the multiple independent researcher machines splits each thethe analyze data results a of in chunk the chunks, of different chunks data, are after combined1989 which (see, e.g., memory burden by allocating thecomputation data to time multiple of memory static units,been datasets, and done reduces since ‘sequentially’ the analyses are which conductedvery ‘parallel’. otherwise effective when would However, the have parallelization datasetquired is for is the not continuously analyses, augmented: computation powerfor since has as to all long eventually data as grow the are without datasetbining bound re- is the augmented results obtained with on new different data. chunks of Also, data the might operation itself of be com- a1999 challenge. uses all available information, but withoutpoints. storing Online learning or methods revisiting can the be individual usedtion, data in (for combination instance, with see parallel computa- a unique method that has largecan potential for be use thought in of the as socialpart using sciences. consisting a This of method very extreme splitand of only 1 the data data; point the onthe data the is other split hand. into Additionally, in a estimates online of learning the methods, parameters of interest, whichous take data all points relevant into information account of ( previ- parameters of interest (often the sufficient statistics) are stored in ). ). 2006 2005 , , Efraimidis & Spirakis Chapter 2: Online Estimation Tutorial Gaber, Zaslavsky, & Krishnaswamy ; 2002 , most recent data points, while the second part contains older data m ). For instance, when studying a rare event, the window should be describes an example of an application of SGD in the social sciences. In m we detail some of the limitations of the online learning approach. Finally, 2.5 2.6 sample from the data to reduce the size of theuse dataset, a sliding window approach, parallelize the computation, or resort to online learning. Option two, using a sliding window, also solves the issue of needing increasingly The first option, to sample from the data, solves the problem of having to deal 4. 1. 2. 3. Datar, Gionis, Indyk, & Motwani much larger than in the case of ato frequent decide event. how It large is this up window to ought thewindow to researcher’s be. approach discretion might Also, not when analyzing be trends, appropriate a since sliding historical data are ignored. where the subsample only consistwindow of shifts new to data include points. newthe old data When data. (i.e. new Although data a aand enter, sliding (partially) the amount window new approach of is subsample) memory feasible and needed,that in ignore it computation the requires time sliding domain window knowledge to approachdetermine determine has a the proper downside size of the window (e.g., sliding window approach the analysis is restricted to( the most recent part of the data Thus, the data are againwhich split is into not a used for part the which analysis.consists is The of used analysis the part for (i.e., the also analyses coinedwhich “the and is window”) a discarded. part One could see a sliding window as a special case of option 1, in the case of datapoints streams, or a let them researcher “pass can by” to decideHowever, reduce to when memory burden a randomly ( lot include of new data data arewe available, could it potentially might use. be a waste not to use all the data more computation power by reducing the amount of data that is analyzed. In a with a large volume of data simply byis reducing too its size. large Effectively, to when the process dataset ata part once, which one is could used “randomly” for split the the analyses data and into a two part parts: of the data that is discarded. Even be identified: 2.2 Dealing with Big Data:In the the recent options years, data streams and thetion resulting of large many datasets scholars. have Diverse received methods atten- haveamounts been of developed data. to deal Conceptually, four with overarching these approaches vast to handle Big Data can 12 Section Section in the last section, we discuss theBig directions Data. for further research on data streams and

Chapter 2 idwapoc ih o eaporaesnehsoia aaaeignored. are data historical sliding since a appropriate trends, be analyzing when not Also, might discretion approach be. researcher’s to window the ought to window up this is large It how event. decide frequent to a of case (e.g., the in window than the larger much of size downside proper the a has determine determine approach to knowledge window domain sliding time requires the computation it ignore in that needed, and feasible memory subsample) is of approach new window amount the (partially) sliding enter, and a a data Although new (i.e. data. 1, When data old option the new of points. include case data special to new a shifts of as window consist window only sliding subsample a the see where could One part discarded. a window”) is and “the which coined analyses also the (i.e., for part the analysis used of The is consists analysis. which the part for used a not into is split which again are a data In the Thus, data the analyzed. of part is recent most that the data ( to restricted of is analysis amount the approach the window reducing sliding by power computation more data the all use to not waste a be use. might potentially it could available, we are data data new of include lot ( Even randomly a burden memory when discarded. to reduce However, is decide to that by” can “pass data researcher them the let a or of streams, parts: points part data two a of into and data case analyses the the the split in for “randomly” used could is one which once, part a at dataset process the when to Effectively, large size. its too reducing is by simply data of volume large a with can Data Big handle identified: to be vast approaches these overarching with four Conceptually, deal to data. developed of been amounts have atten- methods received Diverse have scholars. datasets many large of resulting tion the and streams data years, options recent the the In Data: Big with Dealing 2.2 and streams data on research further for Data. directions Big the discuss we section, last the in Section Section 12 aa,Goi,Idk Motwani & Indyk, Gionis, Datar, 1. 4. 3. 2. pinto sn ldn idw losle h su fneigincreasingly needing of issue the solves also window, sliding a using two, Option deal to having of problem the solves data, the from sample to option, first The apefo h aat euetesz ftedataset, the of size the reduce to data the from sample rrsr ooln learning. online to resort or computation, the parallelize approach, window sliding a use 2.6 2.5 edti oeo h iiain fteoln erigapoc.Finally, approach. learning online the of limitations the of some detail we m ecie neapeo napiaino G ntesca cecs In sciences. social the in SGD of application an of example an describes .Frisac,we tdigarr vn,tewno hudbe should window the event, rare a studying when instance, For ). m otrcn aapit,wietescn atcnan le data older contains part second the while points, data recent most , 2002 ; ae,Zsasy Krishnaswamy & Zaslavsky, Gaber, hpe :Oln siainTutorial Estimation Online 2: Chapter famds&Spirakis & Efraimidis , , 2005 2006 ). ). sudtduigsm ucino h previous the of function some in stored using are updated statistics) is sufficient the (often interest of parameters previ- ( of account information into relevant points all data take ous which interest, of parameters methods, the learning of online estimates a in Additionally, into hand. split other is the data the on the point data; data the 1 only of and split extreme very method of This a consisting sciences. using part social as the of in thought use for be potential can large has that method unique a computa- parallel see with instance, combination (for in data tion, used individual be the can revisiting methods or learning Online storing points. without but information, available all uses challenge. 1999 a com- be of itself operation might the data Also, of chunks data. different new on with obtained results augmented the is re- bound bining dataset without are the grow as data eventually long all to as has since for power computation augmented: analyses, continuously not the is for is quired dataset parallelization have the However, would when effective otherwise ‘parallel’. very conducted which are analyses the ‘sequentially’ since reduces done and datasets, been units, static memory of multiple time to data computation the allocating by burden memory e.g., (see, which 1989 combined after are data, chunks different of chunks, the chunk in of a results data analyze the the each splits machines researcher independent multiple the that computing, such parallel Using Data. Big static hpe :Oln siainTutorial Estimation Online 2: Chapter ausb rvdn h nieaatto fanme fcnetoa statistics. conventional of number a of adaptation online the providing by values for elements unique the not for values is starting issue choose of to this needs of although often elements researcher Second, the the learning, front. online choose up to functions needs update researcher their the and information beforehand; which gathered judge to be knowledge should domain requires win- suitable also sliding very like learning a remain, online downfalls learning dows, two online However make streams. data jointly analyze – to data approach historical the revisiting not and Heckerman SGD, one in as where learning known situations is batch also This of are points. sion There data of ‘batch’ point. a data uses recent) rather (most single a using dated revisited. not are points data θ nti hpe,w ilfcso orhmto:oln erig(e.g., learning online method: fourth a on focus will we chapter, this In analyze to method often-used an is computing, parallel using option, third The h w hrceitc foln erig–icuigaltedt nteestimate the in data the all including – learning online of characteristics two The up- are parameters where situation the on focus we chapter, this in that Note ntenx eto,w ute ealoln erigadhwt hoestarting choose to how and learning online detail further we section, next the In . n ; ; − Shalev-Shwartz h tal. et Chu 1 aapit r umrzdit iie e fsfcetsaitc fthe of statistics sufficient of set limited a into summarized are points data ( 2001 , 2006 bu hoigbok(rbth ie o h Malgorithm. EM the for sizes batch) (or block choosing about ) n − , ; 2011 h tal. et Chu 1 uaae al. et Turaga aapit,where points, data .A nrdcdi h rvosscin nielearning online section, previous the in introduced As ). Opper isnadMartinez and Wilson , 2006 , , 1998 2010 ; ae tal. et Gaber .Tesmaisrqie oetmt the estimate to required summaries The ). .Ti fetvl ovstepolmof problem the solves effectively This ). n stettlnme fobservations, of number total the is θ n h e aapit historical point; data new the and , 2005 ( ac learning batch 2003 ,bthr edsusi as it discuss we here but ), tla,Cl,&Goodrich & Cole, Atallah, ,or ), hesn ek and Meek, Thiesson, θ e o discus- a for See . Subsequently, . Bottou 13 θ θ , ,

Chapter 2 15 (2.4) , where n x + Sx := grow without bound, Sx Sx and +1 , 2 n ) ¯ x ) more is to be gained when mov- := http://github.com/L-Ippel/ 2 − s , and b) lets n i θ x ( ; in this case, the sample mean could be as this starting point does not impact the . This implementation is a ready-to-use n θ =1 i ! , 1 1 =0 1 − − ¯ x SS n n . This latter method however a) does not actually n = = Sx and 2 ˆ s = ¯ =0 x n , which can be found at , as the elements of x is the sum of squares. Here, the first pass is used to estimate the sample mean_online() , while the second pass is used to compute the sum of squares. SS ¯ x is the sum over We implemented an example of the online formulation of the sample mean in where mean ing from offline to online computationsample as variance requires the two conventional passes method through of the dataset: computing a 2.3.2 Sample variance In case of the sample variance (often denoted Chapter 2: Online Estimation Tutorial This also holds for all the otherstraightforwardly examples choose provided. In the case of the mean, one can Sx store the sought for statistic as an element of computed at runtime using > for(i in 1:n) +{ ++} res <- mean_online(input = x[i], theta> = res) which might lead to numerical instabilities. [R] code, Methodology/Streaming_functions update of the sample mean. Below we presentof [R] the code, use which of gives a the demonstration online implementation‘#’ of denotes the a sample comment. mean. In the [R] language, the > # create some data: > # number of data points> = # 1000, mean of the data is> 5 N and <- standard 1000 deviation is 2: > x <- rnorm(n = N, mean> = # 5, create sd an = object 2) for> the res results: <- NULL > # the res object is needed> such # that the you updates can into the feed function> back # are created within the function, at the first call final result – this, regretfully, will not generally hold.mean could Also note also that be an computed online by sample maintaining θ (2.3) (2.2) (the sample mean). need to be chosen. ¯ x θ ) estimators. We discuss the , 1 . ) is computationally not very in- − . ) ¯ n x ¯ x offline (a count) and ¯ x n . i n n − Chapter 2: Online Estimation Tutorial − x ’ and start by stating the elements of n n x n =1 x , ( := i ! , } 1 , n +1 + 1 n } 1 1 +1 + ¯ x, n − − = { n x n n ¯ x, n ¯ x x n { = := :=¯ =¯ = = θ ¯ x n θ n n ¯ x n (in an ANOVA framework). 2 η the sample variance, the sample covariance, linear regression models, and the effect size the sample mean, 4. 5. 1. 2. 3. or equivalently, ples, small working exampleshttp://github.com/L-Ippel/Methodology as well as ready-to-use functions are available on The online formulations we discuss inoffline this counterparts: section are exact the reformulationsuses results of their an of offline the or analysis online are estimation the method. exact same Note whether that one for each of these exam- tensive since it only requires a single pass through the dataset, 2.3.1 Sample mean The conventional estimation of a sample mean ( online estimation of the following parameters: 14 2.3 From Conventional Analysis toIn Online this Analysis section, we discuss onlineline analysis computation by providing of several standard examples (often of the computed on- where we again use the updatethat need operator to ‘ be updated: inNote this that, case these appropriate are starting value(s) for all the elements However, even in this case, online computationof can a be sample beneficial. mean The is online computed update as follows:

Chapter 2 oeta,aporaesatn au()fralteelements the all for value(s) starting are appropriate these case that, this Note in updated: be ‘ to operator need that update the use again we where follows: as update computed online is The mean beneficial. sample be a can of computation online case, this in even However, esv ic tol eursasnl astruhtedataset, the through pass single a requires ( only it mean since sample tensive a of estimation conventional The mean Sample 2.3.1 parameters: following the on- of computed the estimation of online (often examples standard several of providing by computation analysis line online discuss we section, Analysis this Online In to Analysis Conventional From 2.3 14 ls ml okn xmlsa ela ed-ouefntosaeaalbeon exam- available these are of functions each ready-to-use for as one that well whether Note as same http://github.com/L-Ippel/Methodology examples exact method. working the estimation small are online ples, analysis or the offline of an their of results uses reformulations the exact are section counterparts: this offline in discuss we formulations online The requivalently, or 5. 4. 3. 2. 1. h fetsize effect the and models, regression linear covariance, sample the variance, sample the mean, sample the η 2 i nAOAframework). ANOVA an (in n x ¯ n n θ n x ¯ θ = =¯ = :=¯ := = { x n x ¯ ,n x, ¯ n n x n { = − − ,n x, ¯ + +1 1 1 } n 1 + +1 n , 1 } , ! i := ( , x =1 n x n n n tr ysaigteeeet of elements the stating by start and ’ x − hpe :Oln siainTutorial Estimation Online 2: Chapter − n n i . n x ¯ acut and count) (a offline x ¯ x n ¯ ) . − scmuainlyntvr in- very not computationally is ) . 1 , siaos edsusthe discuss We estimators. ) θ x ¯ edt echosen. be to need tesml mean). sample (the (2.2) (2.3) θ encudas ecmue ymaintaining sample by online computed an be that also note Also could mean hold. generally not will regretfully, this, – result final e -ma_nieipt=xi,tea=res) = > theta x[i], = mean_online(input <- res +} + +{ call first 1:n) the in at for(i function, > the within created are back # function> feed the into can updates you the that # such > needed is object res the # > NULL <- results:res the > 2) for = object sd an 5, create = # mean > N, = rnorm(n <- x 2: > is deviation 1000 standard <- and N 5 > is data the of mean 1000,# = > points data of number # > data: some create # > the language, [R] the In mean. comment. sample a the denotes of ‘#’ implementation online demonstration the a gives of which use code, the [R] of present we Below mean. sample the of update Methodology/Streaming_functions code, [R] instabilities. numerical to lead might which optda utm using runtime at computed mean where tr h ogtfrsaitca neeetof element an as statistic for sought the store Sx ncs ftesml aine(fe denoted (often variance sample the of case In variance Sample 2.3.2 can one mean, the of case the In provided. choose examples straightforwardly other the all for holds also This Tutorial Estimation Online 2: Chapter n rmofln ooln optto stecnetoa ehdo optn a computing dataset: the of through method passes conventional two the requires variance as sample computation online to offline from ing eipeetda xml fteoln omlto ftesml enin mean sample the of formulation online the of example an implemented We stesmover sum the is x ¯ SS hl h eodps sue ocmuetesmo squares. of sum the compute to used is pass second the while , mean_online() stesmo qae.Hr,tefis asi sdt siaetesample the estimate to used is pass first the Here, squares. of sum the is x steeeet of elements the as , hc a efudat found be can which , n x =0 ¯ = s ˆ 2 and Sx = = n hslte ehdhwvra osntactually not does a) however method latter This . n n SS x ¯ − − 1 =0 1 1 , ! i =1 θ n hsipeetto saready-to-use a is implementation This . sti trigpitde o matthe impact not does point starting this as nti ae h apema ol be could mean sample the case, this in ; ( x θ i n n )lets b) and , s − 2 http://github.com/L-Ippel/ := oei ob andwe mov- when gained be to is more ) x ¯ ) n 2 , +1 and Sx Sx rwwtotbound, without grow := Sx + x n where , (2.4) 15

Chapter 2 ). 17 and (2.9) , the . The ) (2.11) (2.10) 2008 ¯ y 2 X , ′ ,σ X is a vector and (0 = ¯ x β Pebay A ∼N . This method is ϵ B 1 − A . In such cases the compu- = y ˆ β , , y . ′ ′ n n x ϵ, y , X n n 1 } + x x − B ) , β + + X ′ A X , one computes B { A X = β . Note that, unlike the previous statistics, these = . := := independent variables (including an intercept), y θ =( q X B A ˆ β on an dependent variable x cor_online ) with observed data, including a column of 1’s for the . q 1) are conventionally estimated as follows: × β − , and n we present [R] code to compute covariances and correlations n denotes the error or noise. When assuming ( ϵ 2.A SC/ = xy denotes the transpose of ˆ and compute the update as follows: s is the vector containing the data of the dependent variable, ′ y is the matrix ( y ′ X X X Computing this row-by-row works as follows: We can define In Appendix = regression coefficients where Chapter 2: Online Estimation Tutorial Note that, contrary toneed the auxiliary online variables in computation this of casechoice since the of we sample which can of alternate variance, the updating we two of Similar sample do means to not is the updated first, case is ofcompute arbitrary the ( sample variance, to compute the sample covariance, we 2.3.4 Linear regression In applied research, often the aimcertain independent is variable to estimate group differencestation or of the a effect sample of mean a ofthe a research question. sample One variance often-used will approach notthe to necessarily answer relationship suffice research questions to between about answer onevariable, or is more fitting a independent linear regression variables model: and one dependent To obtain the regression coefficients, B well known in the parallel computing literature and is, for example, described in where online. Since computing a correlationances, entails and the a estimation covariance of all sample of means,ers these vari- wanting are also to included compute in the thistion code sum during snippet. their of analysis For cross we read- have products, implemented[R], the the and online covariance, these estimation can procedures or be in the found in correla- SSxy_online, the cov_online Streaming_functions file on github asfunctions respectively require 2 inputs, one for each variable. of the regression coefficients of the and intercept. Finally (2.5) (2.8) (2.7) (2.6) (which [1] ’s method . The func- x 1) = − uses the online ¯ x n ( Welford SS/ and a = . ) 2 var_online n ˆ s =1 ¯ x , , ) n ) ¯ − x and ¯ x n − . x − ) , i ) ¯ x )( n x ¯ x 1 x , which is often computed using: )( − , − . y ¯ ) , y )( − } are provided as default, in case the n ) n Chapter 2: Online Estimation Tutorial , ¯ ¯ } x y ¯ ¯ y x n x } − SS_online [1] ( x i − − − − ( x d and ¯ , y x, n n 1 n ( n n x y + x y = x − ( ( n =1 +1 , + i n ¯ ¯ x, SS, n ! x 1 1 n n , +( ¯ , y, SC, n { x n x SS +( 1 1 which is used since the online update of the sum +1 + 1 + = = ¯ x, := :=¯ := 1 − − − d SC n { x y θ d =1 SC n ¯ x n n n = := := :=¯ :=¯ n SS = = θ SS ¯ ¯ y x n = xy SC ˆ s function can be used. Note that in order to compute the n SS ’s method (1962), which, to keep notation consistent, we denote as: sd_online ). The values { 1 − is the sum of cross products. Again, making use of n Welford SC A numerically feasible online method to compute a sample variance in a data stream is vided by ( tion to compute the sum of squares is coined is the first data point), are required due to the fact that the sum of squares are di- Note the use of auxiliary variable of squares uses both the deviation fromprevious the sample current mean: sample mean as well as from the 16 2.3.3 Sample covariance Next we turn to the estimation ofinstance quantities which the depend sample on covariance multiple between variables, for where (1962), we can estimate the sample covariance online: In order to obtain the actual sample variance, we compute sum of squares function to compute theviation variance. In directly, order to obtain the standardvariance de- and the standard deviation, starting values of user does not provide starting values.

Chapter 2 srde o rvd trigvalues. starting provide not does user 16) ecnetmt h apecvrac online: covariance sample the estimate can we (1962), where fsursue ohtedvainfo h urn apema swl sfo the from as well as mean sample mean: current sample the previous from deviation the both uses squares of variable auxiliary of use the Note aineadtesadr eito,satn ausof values starting deviation, standard the and de- variance standard the obtain to order directly, In variance. viation the compute to function squares of sum nodrt banteata apevrac,w compute we variance, sample actual the obtain to order In etw unt h siaino uniiswihdpn nmlil aibe,for variables, between multiple covariance on sample depend the which quantities instance of estimation the to turn we Next covariance Sample 2.3.3 16 stefis aapit,aerqie u otefc httesmo qae r di- are squares of sum the that fact the to due ( required by are vided point), data first the is coined is squares of sum the compute to tion temis stream ueial esbeoln ehdt opt apevrac nadata a in variance sample a compute to method online feasible numerically A SC Welford n stesmo rs rdcs gi,mkn s of use making Again, products. cross of sum the is − 1 .Tevle { values The ). sd_online smto 16) hc,t epntto osset ednt as: denote we consistent, notation keep to which, (1962), method ’s SS n ucincnb sd oeta nodrt opt the compute to order in that Note used. be can function s ˆ SC xy = n x y ¯ ¯ SS θ = = SS n :=¯ :=¯ := := = n n n x ¯ n SC =1 d θ x { y n SC d − − − 1 := :=¯ := x, ¯ = = + 1 + +1 hc sue ic h nieudt ftesum the of update online the since used is which 1 1 +( SS x n x { ,S,n SC, y, , ¯ +( , n n 1 1 x ! ,S,n SS, x, ¯ ¯ n i + , +1 =1 n ( ( − x = y x y + x n n ( n 1 n n x, y , ¯ and d x ( − − − − i x ( [1] SS_online − } x n x y ¯ ¯ y x } ¯ ¯ , hpe :Oln siainTutorial Estimation Online 2: Chapter n ) n r rvdda eal,i aethe case in default, as provided are } − )( y , ) ¯ y . − , − )( hc sotncmue using: computed often is which , x 1 x ¯ x n )( x ¯ ) i , ) − x . − n x ¯ and x − ¯ ) n ) , , x ¯ =1 s ˆ n var_online 2 ) . = n a and SS/ Welford ( n x ¯ ssteonline the uses − = 1) x h func- The . smethod ’s [1] (which (2.7) (2.5) (2.6) (2.8) necp.Finally intercept. and the of coefficients regression the of ucin eur nus n o ahvariable. each for one inputs, 2 require respectively functions as github on file Streaming_functions cov_online the SSxy_online, correla- in found the in be or procedures can estimation these covariance, online and the the [R], implemented products, have read- we cross For analysis of their snippet. during sum code tion this the in compute included to also are wanting vari- these ers means, of sample all of covariance estimation a the and entails ances, correlation a computing Since online. where elkoni h aallcmuigltrtr n s o xml,dsrbdin described example, for is, and literature computing parallel the in known well B ooti h ersincoefficients, regression the obtain To where dependent one and model: variables regression linear independent a fitting more is or variable, one answer about between to questions research suffice relationship answer necessarily to the not approach will often-used variance One sample question. research a the of a mean of sample effect a the of or tation differences group estimate to variable is independent certain aim the often research, applied In regression Linear 2.3.4 we covariance, sample the compute to variance, sample ( the arbitrary compute of is first, case updated the is not to means do sample Similar of two we updating the variance, alternate of can which sample we of the since choice case of this computation in variables online auxiliary the need to contrary that, Note Tutorial Estimation Online 2: Chapter ersincoefficients regression = nAppendix In optn hsrwb-o ok sflos ecndefine can We follows: as works row-by-row this Computing X X X ′ y stemti ( matrix the is y ′ stevco otiigtedt ftedpnetvariable, dependent the of data the containing vector the is s n opt h paea follows: as update the compute and ˆ eoe h rnps of transpose the denotes xy = SC/ 2.A ϵ ( eoe h ro rnie hnassuming When noise. or error the denotes n epeet[]cd ocmuecvracsadcorrelations and covariances compute to code [R] present we n and , − β × r ovninlyetmtda follows: as estimated conventionally are 1) q . ihosre aa nldn ouno ’ o the for 1’s of column a including data, observed with ) cor_online x na eedn variable dependent an on β ˆ A B X q =( θ y needn aibe icuiga intercept), an (including variables independent := := . = oeta,ulk h rvossaitc,these statistics, previous the unlike that, Note . β = X A { B n computes one , X A ′ X + + β , ) B − x x + } 1 n n X , y ϵ, x n n ′ ′ . y , , β ˆ y = nsc ae h compu- the cases such In . A − 1 B ϵ hsmto is method This . ∼N A Pebay β x ¯ = (0 and savector a is X ,σ ′ , X 2 y ¯ 2008 (2.10) (2.11) ) The . the , (2.9) and 17 ).

Chapter 2 19 (2.13) (2.14) in a stream t SS is the sum of the and w w SS 100 SS online A_inv 80 groups, is the total sum of squares. The . t k k , ¯ x SS w k the − , 60 n groups. This requires computing the t w SS b k,n k + SS x , SS SS online t b b + − between k SS SS SS 40 x = = =1 sample size, x100 sample size, groups, and 2 k := ¯ η which is given by: k shows that computing both ¯ 2 x 20 η offline : inverted matrix (dashed line). k 2.13 0 each of the

, dotted line), or online estimation by online updating the elapsed time elapsed already presented how sums of squares can be computed in data within inv A 2.6 equals the sum of squares Figure 2.1: Computation timeestimation of (solid regression line), coefficients online using estimationline offline by inverting the matrix (on- b SS Equation sample mean within group last expression of Equation sums of squares suffices to compute the desired effect size. streams. The only complexity introduced in the ANOVAof example is the the computation sums of squares within each of the Chapter 2: Online Estimation Tutorial are carried out usingcial modern media interactive platforms, sample technologies, sizes such can asoften grow very on analyze quickly. so- the Traditionally, researchers data fromBetween-subjects such ANOVAs group can comparisons be using computed ancomputation fully of ANOVA the approach. online. effect size Here, we focus on the where ; k 1993 (2.12) , can be presents A 2.1 . The function . In Appendix B inv A = Escobar & Moser lm_online ˆ β , n inv x A inv ′ n x A n ′ n is large, this itself can be a compu- x x Chapter 2: Online Estimation Tutorial q , inv , } n 1+ A ): y B n , − x inv 1950 + . In practice, one would use a small part of the inv , A A { A B = := := , will grow quite quickly (quadratic) for the offline θ n is computed, a matrix inversion is required. Especially B inv to be non-singular. Obtaining the regression coefficients ˆ β A A in a data stream between the three estimation methods dis- , invert this matrix, after which the original matrix ˆ β (ANOVA) for A 2 -axis denotes the number of data points seen so far. While the only requires a matrix multiplication: η x ) and the type of computing system used, the qualitative results +1 q 2.12 ). Sherman & Morrison -axis (time) will heavily depend on the size of the model (the num- ; n>q y is the inverted matrix , at each value of 2006 ( β 1950 inv , A is faster than inverting the matrix at each time-point. This latter difference , directly online, using the Sherman–Morrison formula ( Although fairly simple, computing the regression coefficients this way has a dis- , we implement online linear regression in [R] using Sherman-Morrison formula. nv inv i Plackett however only affects the slope of the linear computation time. 18 Chu et al. advantage: Every time that In many studies in sociology anddistinct , groups it differ is of from interest one topants another, received examine for a whether treatment instance while because the other one group group of participants of did partici- not. When such 2.3.5 Effect size scale of the ber of parameters presented here will hold in general:timate the computational of time needed to obtain an es- Computation time of linear regression To illustrate the difference between onlinea and comparison offline methods, of Figure the computationalregression coefficients time required to computecussed the above. estimates The of the 2.B At the github page mentioned before, therequires function two is separate named inputs, one forindependent the variables. dependent The variable, latter and can one obviously input be for a vector the of multiple variables. data to create matrix where tationally intensive operation. We can addressA this by updating the inverse matrix, when the number of independent variables clearly illustrates the computational benefitsods. of online It methods can over alsoA offline be meth- seen that the direct online computation of the inverted matrix method, while it grows only slowly for the online methods (linear). This result at least have discarded from computer memory. The “small” part of the data that is used should using Equation

Chapter 2 icre rmcmue eoy h sal ato h aata sue should Equation used using is that data the have of least part at “small” The memory. computer from discarded d.I a lob enta h ietoln optto fteivre matrix inverted the of computation online direct the that result seen This meth- be offline A also over (linear). can methods methods It online online of ods. the benefits computational for the slowly illustrates only clearly grows it while method, ainlyitnieoeain ecnadesti yudtn h nes matrix, inverse the updating by this A address can We operation. variables intensive tationally independent of number the when where aat raematrix create to data rsne eewl odi eea:tecmuainltm eddt bana es- an obtain to needed time of computational the timate general: in hold will here presented parameters of ber the of scale the of The estimates above. the cussed compute to required time coefficients regression computational the Figure of methods, offline comparison and a online between difference the illustrate To regression linear of time Computation variables. multiple of the vector a for be input obviously one can and latter variable, The dependent variables. the independent for one inputs, named separate is two function requires the before, mentioned page github the At 2.B at eevdatetetwieteohrgopo atcpnsddnt hnsuch When not. partici- did of participants of group group one other the because while instance treatment whether a for examine received another, pants to one interest from of is differ it groups psychology, distinct and sociology in studies many In size Effect 2.3.5 that time Every advantage: al. et Chu 18 oee nyafcstesoeo h iercmuaintime. computation linear the of slope the affects only however Plackett i inv nv eipeetoln ierrgeso n[]uigSemnMrio formula. Sherman-Morrison using [R] in regression linear online implement we , lhuhfil ipe optn h ersincefiinsti a a dis- a has way this coefficients regression the computing simple, fairly Although ietyoln,uigteSemnMrio oml ( formula Sherman–Morrison the using online, directly , sfse hnivrigtemti tec iepit hslte difference latter This time-point. each at matrix the inverting than faster is A , inv 1950 β ( 2006 tec au of value each at , steivre matrix inverted the is y n>q ; ai tm)wl evl eedo h ieo h oe tenum- (the model the of size the on depend heavily will (time) -axis hra Morrison & Sherman ). 2.12 q +1 n h yeo optn ytmue,teqaiaieresults qualitative the used, system computing of type the and ) x η nyrqie arxmultiplication: matrix a requires only ai eoe h ubro aapit ens a.Wiethe While far. so seen points data of number the denotes -axis 2 A for (ANOVA) β ˆ netti arx fe hc h rgnlmatrix original the which after matrix, this invert , nadt tembtentetreetmto ehd dis- methods estimation three the between stream data a in A A β ˆ ob o-iglr bann h ersincoefficients regression the Obtaining non-singular. be to inv B scmue,amti neso srqie.Especially required. is inversion matrix a computed, is n θ ilgo ut ucl qartc o h offline the for (quadratic) quickly quite grow will , := := = B A { A A , inv npatc,oewudueasalpr fthe of part small a use would one practice, In . + 1950 inv x − , n B y ): A 1+ n } , inv , q hpe :Oln siainTutorial Estimation Online 2: Chapter x x slre hsisl a eacompu- a be can itself this large, is n ′ n A x n ′ inv A x inv n , β ˆ lm_online soa Moser & Escobar = A inv B nAppendix In . h function The . 2.1 A presents a be can , (2.12) 1993 k ; where ewe-ujcsAOA a ecmue ul nie ee efcso the on focus we Here, size effect online. approach. the ANOVA of fully computation an computed using be comparisons can group ANOVAs such Between-subjects from data researchers Traditionally, the so- quickly. analyze on very grow often as can such sizes technologies, sample platforms, interactive media modern cial using out carried are experiments Tutorial Estimation Online 2: Chapter ftesm fsurswti aho the of each within squares of sums computation the the is example of ANOVA the in introduced complexity only The streams. size. effect desired the compute to suffices uso squares of sums atepeso fEquation of expression last apema ihngroup within mean sample Equation SS b siain(oi ie,oln siainb netn h arx(on- matrix the inverting by offline line estimation using online coefficients line), regression (solid of estimation time Computation 2.1: Figure qastesmo squares of sum the equals 2.6 A inv within led rsne o uso qae a ecmue ndata in computed be can squares of sums how presented already elapsed time otdln) roln siainb nieudtn the updating online by estimation online or line), dotted , aho the of each 0 2.13 k netdmti dse line). (dashed matrix inverted : offline η 20 x 2 ¯ hw htcmuigboth computing that shows k hc sgvnby: given is which η =¯ := k 2 rus and groups, sample size, x100 =1 = = x 40 SS SS SS k between − + b b t online SS SS , x SS + k k,n b SS w t rus hsrqie optn the computing requires This groups. n 60 , − the k w SS x ¯ , k k t . stettlsmo qae.The squares. of sum total the is groups, 80 online A_inv SS 100 SS w w and stesmo the of sum the is SS t nastream a in (2.14) (2.13) 19

Chapter 2 21 (2.17) Arvas & ), using Max- denote the obser- 2014 , n given the observed ζ Keith ,...,x 1 x , ) ζ | i x ( f n =1 i ! )= n ). In the maximum likelihood framework, 2003 is a probability density function (PDF) (or proba- , states that the likelihood of ,...,x 1 () x | f ζ 2.17 ( L Myung . ). In the social sciences we often use the maximum likelihood framework 2.C is a set of parameters, ζ 2012 , To explain SGD, we will first discuss Gradient Descent (GD), an optimization vations. In words, Equation data is a product of thetice, it individual is probabilities often of (much) each simpler to of obtain the the maximum data likelihood points. estimates by In taking prac- There are multiple ways to obtaininstance estimates for using parameters a of least statistical squares models, approachimum for (see, Likelihood for estimation example, (ML), or usingSevgi the method of moments (e.g., (see, for an introduction, we want to obtain the parameter valuesAssuming which independent maximize observations, the the probability likelihood of function for thethe many data. models following takes form: Chapter 2: Online Estimation Tutorial 2.4 Online Estimation using StochasticIn Gradient the Descent previous section, we haveof discussed statistics how and to models estimate, that fullychological are online, data. often a We used have number also in illustrated the thetimation analysis computational for advantages of very of large sociological online datasets and es- and psy- datadiscussed streams. above However, for we each could of derive the exact methods it was summation possible methods; to using transform simple standard algebra, estimationtunately, this methods is to not online always variants. the Unfor- case.require Many multiple estimation iterations methods, especially through those a that staticonline, dataset, in part cannot because exactly even be when using implemented is conventional offline approximate. analysis the Examples estimation arethis logistic does regression not or mean that multilevel wemultitude models. can of only methods However, estimate available very for simple theIn models this online online: section, estimation there we of is focus more a on complextimation Stochastic method models. Gradient that can Descent be (SGD), used a for general estimation online of es- more complex statistical models. method that is oftenstarting used point in for conventional SGD. offlinequently We analysis provide provide and a the provides technical general a details. intuitionof logical to fitting Lastly, a GD we logistic will / regression provide SGD,Appendix model an and using applied subse- SGD, example for which [R] code is provided in 2.4.1 Offline Gradient Descent where bility mass function in the discrete case), and as before . w SS John, (2.15) (2.16) when ; 2 η ) in a data 1969 2 , η ). Effectively . Subsequently, -value is found, k ), due to inflation to compute p 2014 , k ¯ x 2011 level (and hence need , with ¯ x , group . Armstrong , 401 1) . " − k ) w k k =0 ,n ( k / Chapter 2: Online Estimation Tutorial − 10 ,SS ) ¯ . This function will compute x originates from group t n w ( 05) , substituting / . k,n SS Armitage, McPherson, & Rowe k ) and the bottom row indicates the global 0 w x k − − SS t ¯ x, n, SS (1 ! SS ( etasq_online − := elements. We have implemented the online compu- = -statistic online, one can use the information that is θ =1 F Simmons, Nelson, & Simonsohn F ; ’ as a function of the number of tests to prevent an in- +4 α α k 2 indicates the parameters at the 2012 , θ contains θ is used within each group in a function named 2 2.6 η We will continue our discussion of online learning methods with Stochastic Gra- In order to compute the dient Descent (SGD). SGD ismore complex an models optimization when analytical method solution which are not is available. useful to estimate instead of the 5% she startedflated with. Type I Perhaps error the is most known commonthis as correction Bonferroni correction of correction decreases ( this ‘ in- crease of Type I error. 20 which is only updated once a data point parameters, which are only singleThus, parameters in which total need to betation stored of in memory. where the top row of to be kept in memory for each group two or more groups are available.the data Note point that and this function to also whichbe included requires group during two this the inputs: data stream, point without belongs. the data New analyst groups interfering in can the easily analysis. The computation of the effect sizestream or thus proportion requires of the variance following explained parameters: ( Equation already available: sidered a questionable research practice ( Loewenstein, & Prelec of Type 1 error. For instance,on when whether a the researcher ANOVA decides is to significantnew collect data 10 more are times entering, data with the based actual significance Type level 1 error of equals 5% while It is important to notewhich that might repeated be testing attractive until if a results certain are small available for each new data point, is con-

Chapter 2 nwehrteAOAi infiat1 ie ihsgicnelvlo %while 5% equals of error 1 level Type significance actual based the with data entering, times are more 10 data collect new significant to is decides ANOVA researcher the a whether when on instance, For error. con- 1 Type is of point, data new Prelec each & Loewenstein, for ( available practice small are research questionable certain results a a sidered if until attractive testing be repeated might that which note to important is It led available: already Equation h optto fteefc ieo rprino aineepand( parameters: explained following variance the of requires proportion thus or stream size effect the of computation The eicue uigtesra,wtottedt nls nefrn nteanalysis. easily the can in interfering groups analyst New data the belongs. without point stream, data inputs: the this two during group requires included be which also to function this and that point Note data the available. are groups more or two aaees hc r nysnl aaeeswihne ob trdi memory. in of stored tation be to need total which in parameters Thus, single group only each are for which parameters, memory in kept be to of row top the where hc sol pae neadt point data a once updated only is which 20 in ecn SD.SDi notmzto ehdwihi sflt estimate to useful available. is not are which solution method analytical when optimization models an complex more is SGD (SGD). Descent dient error. I Type of crease in- ‘ this ( decreases correction of correction Bonferroni correction as this common known most is the error Perhaps I Type with. flated started she 5% the of instead nodrt opt the compute to order In ewl otneordsuso foln erigmtoswt tcatcGra- Stochastic with methods learning online of discussion our continue will We η 2.6 2 nafnto named function a in sue ihnec group each within used is θ contains θ , 2012 niae h aaeesa the at parameters the indicates 2 k α α +4 safnto ftenme ftsst rvn nin- an prevent to tests of number the of function a as ’ ; F imn,Nlo,&Simonsohn & Nelson, Simmons, F =1 θ saitcoln,oecnueteifrainta is that information the use can one online, -statistic = lmns ehv mlmne h niecompu- online the implemented have We elements. := − etasq_online ( SS ! (1 ,n SS n, x, ¯ t SS − − k x w 0 n h otmrwidctsteglobal the indicates row bottom the and ) k riae chro,&Rowe & McPherson, Armitage, SS k,n . / substituting , 05) ( w n t rgntsfo group from originates x hsfnto ilcompute will function This . ¯ ) ,SS 10 − hpe :Oln siainTutorial Estimation Online 2: Chapter / k ( ,n =0 k k w ) k − " . 1) 401 , Armstrong . group , x ¯ with , ee adhneneed hence (and level 2011 x ¯ k , 2014 p ocompute to ,det inflation to due ), k vlei found, is -value Subsequently, . .Effectively ). η , 2 1969 nadata a in ) η 2 ; when (2.16) (2.15) John, SS w . iiyms ucini h iceecs) n sbefore as and case), discrete the in function mass bility where h olwn form: takes following models data. many the the for function of likelihood probability the the observations, maximize independent which Assuming values parameter the obtain to want we introduction, an for (see, (e.g., moments of method the Sevgi using or (ML), example, estimation for Likelihood (see, for imum approach models, squares statistical least of a parameters using for estimates instance obtain to ways multiple are There Descent Gradient Offline 2.4.1 in provided is code [R] which for example SGD, subse- applied using and an model Appendix SGD, provide regression / will logistic we GD a Lastly, fitting to logical of intuition details. a general technical provides the a and provide provide analysis We quently offline SGD. conventional for in point used starting often is that method models. statistical complex more es- of online estimation general for a used (SGD), be Descent can that Gradient models. method Stochastic timation complex on a more focus is of we there estimation section, online: online this models In the simple for very available estimate However, methods only of can models. multitude we multilevel that mean or not regression does logistic this are estimation Examples the analysis approximate. offline conventional is implemented using when be even exactly because cannot part in dataset, online, static that a those through especially methods, iterations estimation multiple Many require case. Unfor- the variants. always online not to is methods this tunately, estimation algebra, standard simple transform using to methods; possible summation was it methods exact the derive of could each we for However, above streams. discussed data psy- and es- and datasets online sociological large of very of advantages for computational analysis timation the the illustrated in also number have used We a often data. online, are chological fully that estimate, models to and how statistics discussed of have we section, previous Descent the Gradient In Stochastic using Estimation Online 2.4 Tutorial Estimation Online 2: Chapter aai rdc fteidvda rbblte fec ftedt ons nprac- taking In by estimates points. likelihood data maximum the the obtain of to simpler each (much) of often probabilities is individual it tice, the of product a is data Equation words, In vations. oepanSD ewl rtdsusGain ecn G) noptimization an (GD), Descent Gradient discuss first will we SGD, explain To , 2012 ζ sasto parameters, of set a is 2.C .I h oilsine eotnuetemxmmlklho framework likelihood maximum the use often we sciences social the In ). . Myung L ( 2.17 ζ f | x () 1 ,...,x ttsta h ieiodof likelihood the that states , sapoaiiydniyfnto PF o proba- (or (PDF) function density probability a is 2003 .I h aiu ieiodframework, likelihood maximum the In ). n )= ! i =1 n f ( x i | ζ ) , x 1 ,...,x Keith ζ ie h observed the given n , 2014 eoeteobser- the denote ,uigMax- using ), ra & Arvas (2.17) 21

Chapter 2 λ 23 (2.20) each iteration, the al- ζ , ) n x ). | ζ version of GD is as follows: instead of ( n ℓ 2010 ( ∇ λ online + ζ Bottou := ζ ) and contains two dotted lines. These dotted lines illustrate 2003 2.2 ( ). This means that the process that generates the data, does 2010 , learn rate or a learn rate which is adaptive, for instance one could to denote that we are evaluating the log-likelihood function. Hence, , although it is not without difficulties. For example, the parameter n ζ ℓ fixed Bottou Gradient Descent can be a very effective method of finding the maximum likeli- Note that in the case that the dataset is no longer augmented, SGD can still be a Wilson and Martinez choose to let the learn rate decreasediscussion with on the choosing number the of appropriate iterations. learn Ain rate more for extensive complex models can be found 2.4.2 Online or Stochastic Gradient Descent Gradient Descent provides an iterative, approximate methodlihood to estimates. find maximum Deriving like- an effective iterating over the full dataset multiple times and updating Chapter 2: Online Estimation Tutorial three evaluations of the first orderthe derivative, three including points the (solid tangent black lines lines).likelihood at each increases When of by the increasing derivative has the ahas parameter a positive value. negative number, value, the the Opposite, slope if is negative, thethe and the derivative parameter likelihood increases value. by decreasing Obviously, thederivative aim is is equal to to find zero,of the in the parameter order derivative value to in where find Fig. the how the the maximum. next The evaluation is secondthe chosen. evaluation result GD of increases the the evaluationThe value is of magnitude positive the of and parameter the when visa derivativethe versa together parameter when value with is the the changed. result learn is rate negative. influence how much hood value of controls the size of the steps andlarge has can to be be problematic chosen since carefully: the alikelihood algorithm learn solution. could rate A which make learn is jumps rate too over which is thesmall too maximum steps, small and causes the thus algorithm many to iterations takehood will very estimate. be needed It to depends obtain onof the the the maximum model likelihood likeli- (e.g., function, complexity etc.) of what thefor learn model, either rate complexity a will be appropriate. One can choose instead of updating thedata, estimates we of update the based parameters onased each based estimate arriving on of data the iterations point. parameters using as SGDis all long will random as converge ( the to order an in unbi- whichnot the change data over points the arrive period in which the data are arriving. useful tool: Analyzing static Big Dataneeds using SGD to circumvents be that available the in entire memory. dataset By simulating that the data enter a point at a time where we use gorithm takes a small step toenters: a more likely parameter value every time a data point , is () l ζ ∇ then (2.18) (2.19) 2 presents -axis is the 2.2 dimensions: y q Gelman & Hill dimensional space q analytical maximum are , ) , ζ | ) i n x ( and we can make a step towards f ζ ln ,...,x Chapter 2: Online Estimation Tutorial 1 n =1 i x ! | ζ the error. ( ℓ )= is a vector, GD takes a step in n ∇ λ ζ + is negative, we need to step in the opposite direc- ζ ζ minimize ,...,x 1 x := | ζ ζ ( ℓ methods, which are also frequently applied in offline anal- -axis are possible parameter values and on the x (vector of first order derivatives) of the log-likelihood function. . If the slope at ζ provides an illustration of a gradient in a single dimension (e.g., the approximate gradient . Gradient Descent is the name most often used in the machine learning because we use it in the context of a likelihood function which we want 2.2 is a learn rate (also known as step size) chosen by the researcher and λ ). Ascent The GD algorithm can be stated as follows: However, exact analytical solutions are not always available. In such cases, one Figure Here, we are assuming the log-likelihood function to be well-behaved. 2 maximize ysis. One such approximate algorithm is calledent Gradient Descent, or actually Gradi- likelihood estimates given the appropriate models (see2007 for example, can resort to Intuitively, this algorithm states thateter one and chooses evaluates starting the values gradient for using each these param- values. In the simple case where to literature and classically used to has already been demonstrated: the estimation ofgression the models sample as mean, discussed and in of previous linear sections, re- actually which effectively replaces the producttion term for with the a maximum since sum, theels, and logarithm obtaining gives is a the a maximum monotonic same likelihood function. solu- is estimate For defined analytically, some after is mod- the straightforward: log-likelihood, wezero, take and solve the for derivative the of parameters the to log-likelihood, obtain set the required it estimates. to Effectively, this 22 the logarithm of the likelihood: scalar, and the gradient simplifiestion to regarding the the derivative, slope this of evaluation the log-likelihood gives function: informa- if the slope is positive where denotes the likelihood function. In the case that the maximum can be foundhigher at values higher of values of tion: we need to choose athrough lower value. the Using dataset this multiple intuition, times GD – iteratively – takes passing steps towards the maximum of the log- likelihood. The dashed curve ispoint the on likelihood of the a curve given we parameter can value. evaluate At the each first order derivative. Figure for each parameter (i.e., dimension) GD determinestive whether is the positive slope or of the negative, deriva- accordinglywhich GD causes takes the steepest a ascent step towards in the the maximum of the likelihood function. derivative). On the

Chapter 2 on ntecrew a vlaetefis re eiaie Figure derivative. order first each the At evaluate value. can parameter we given curve a the of likelihood on the point is curve dashed The likelihood. the On derivative). function. likelihood the of maximum the the in towards step ascent a steepest the takes causes GD which accordingly deriva- negative, the of or slope positive the is whether tive determines GD dimension) (i.e., parameter each for hog h aae utpetms–tkssestwrstemxmmo h log- the of that maximum case the the towards In steps passing takes – function. iteratively – likelihood GD times intuition, multiple this dataset Using the value. lower through a choose to need we tion: of values of higher values at higher found be can maximum the inrgrigtesoeo h o-ieiodfnto:i h lp spositive is slope the if informa- function: gives log-likelihood the evaluation of this slope derivative, the the regarding to tion simplifies gradient the and scalar, the denotes where h oaih ftelikelihood: the of logarithm the 22 rsinmdl sdsusdi rvosscin,actually re- sections, linear previous of in and discussed mean, as sample this models the Effectively, gression of to estimation estimates. it the required the set demonstrated: been obtain log-likelihood, already to the has parameters of the derivative for the solve and take zero, we log-likelihood, straightforward: the mod- is after some analytically, defined For estimate is solu- function. likelihood same monotonic maximum a the a is gives obtaining logarithm and els, the sum, since maximum a the with for term tion product the replaces effectively which ieaueadcasclyue to used classically and literature to tradeautstegain sn hs aus ntesml aewhere case simple the In values. param- these each using for gradient values the starting evaluates chooses and one eter that states algorithm this Intuitively, ss n uhapoiaeagrtmi aldGain ecn,o culyGradi- actually or Descent, Gradient ent called is algorithm approximate such One ysis. to resort can example, for 2007 (see models appropriate the given estimates likelihood maximize 2 ee eaeasmn h o-ieiodfnto ob well-behaved. be to function log-likelihood the assuming are we Here, Figure h Dagrtmcnb ttda follows: as stated be can algorithm GD The one cases, such In available. always not are solutions analytical exact However, Ascent ). λ salanrt as nw sse ie hsnb h eerhrand researcher the by chosen size) step as known (also rate learn a is 2.2 eas euei ntecneto ieiodfnto hc ewant we which function likelihood a of context the in it use we because rdetDseti h aems fe sdi h ahn learning machine the in used often most name the is Descent Gradient . gradient approximate rvdsa lutaino rdeti igedmnin(.. the (e.g., dimension single a in gradient a of illustration an provides ζ ftesoeat slope the If . vco ffis re eiaie)o h o-ieiodfunction. log-likelihood the of derivatives) order first of (vector x ai r osbeprmtrvle n nthe on and values parameter possible are -axis ehd,wihaeas rqetyapidi fieanal- offline in applied frequently also are which methods, ℓ ( ζ ζ | := x 1 ,...,x minimize ζ ζ sngtv,w edt tpi h poiedirec- opposite the in step to need we negative, is + ζ λ ∇ n savco,G ae tpin step a takes GD vector, a is )= ℓ ( h error. the ζ | ! x i =1 n 1 hpe :Oln siainTutorial Estimation Online 2: Chapter ,...,x ln ζ f n ecnmk tptowards step a make can we and ( x n i ) | ζ , ) , are nltclmaximum analytical q iesoa space dimensional emn&Hill & Gelman q y dimensions: 2.2 ai sthe is -axis presents 2 (2.19) (2.18) then ∇ ζ l () is , oih ae ml tpt oelkl aaee au vr ieadt point data a time every value parameter likely more a enters: to step small a takes gorithm hr euse we where ed ob vial nmmr.B iuaigta h aaetrapita time a at point a enter data the that simulating By dataset memory. entire in the available that be circumvents to SGD using needs Data Big static Analyzing tool: useful arriving. are data the which in period arrive the points over data change the not which unbi- in an order to the ( converge as random will long all is SGD as using parameters point. iterations the data of on arriving estimate based each ased on parameters based the update of we estimates data, the updating of instead trtn vrtefl aae utpetmsadupdating and times multiple dataset full the over iterating effective an like- Deriving maximum find estimates. to lihood method approximate iterative, an provides Descent Gradient Descent Gradient Stochastic or Online 2.4.2 choose can One appropriate. be will a complexity rate either model, learn for the what of etc.) complexity function, (e.g., likeli- likelihood model maximum the the the of on obtain depends to It needed be estimate. very will hood take iterations to many algorithm thus the causes and small steps, maximum too small the is which over too rate jumps is learn make which A rate could solution. learn algorithm likelihood a the carefully: since chosen problematic be be to can has large and steps the of size the controls of value hood much how influence negative. rate is learn result changed. the the is with value when parameter together versa the derivative visa when the parameter and of the positive magnitude of is value The evaluation the the increases of GD result evaluation chosen. the second is evaluation The next maximum. the the how the Fig. find where in to value derivative order parameter the in the of zero, find to to equal is is aim derivative the Obviously, decreasing by value. increases likelihood parameter derivative the and the the negative, is if slope Opposite, the the value, number, negative value. positive a parameter has a the has derivative increasing the by of When increases each at likelihood lines). lines black tangent (solid the points including three derivative, the order first the of evaluations three Tutorial Estimation Online 2: Chapter icsino hoigteaporaelanrt o ope oescnb found be can models complex extensive for more rate in A learn iterations. appropriate of the number choosing the on with discussion decrease rate learn the let to choose isnadMartinez and Wilson oeta ntecs httedtsti olne umne,SDcnsilb a be still can SGD augmented, longer no is dataset the that case the in that Note rdetDsetcnb eyefciemto ffidn h aiu likeli- maximum the finding of method effective very a be can Descent Gradient Bottou fixed ℓ ζ n lhuhi sntwtotdfclis o xml,teparameter the example, For difficulties. without not is it although , odnt htw r vlaigtelglklho ucin Hence, function. log-likelihood the evaluating are we that denote to er aeo er aewihi dpie o ntneoecould one instance for adaptive, is which rate learn a or rate learn , 2010 .Ti en httepoesta eeae h aa does data, the generates that process the that means This ). ( 2.2 2003 otistodte ie.Teedte ie illustrate lines dotted These lines. dotted two contains and ) ζ := Bottou ζ + online λ ∇ ( 2010 ℓ n ( eso fG sa olw:isedof instead follows: as is GD of version ζ | ). x n ) , ζ ahieain h al- the iteration, each (2.20) 23 λ

Chapter 2 25 ): 2.C and (2.22) (2.24) (2.23) 2002 , sgd_log Agresti . )) ). See Appendix i x ( p 2010 , − will not be not be necessary and is modeled as a function λ (1 . ) , n i X x | x log Bottou )) ) )) i i n =1 y x x ( ( y , , and a large enough data stream, SGD − p p ( ) λ n − − i , Pr n y y y λ, x ( , ( ( )+(1 } i λ f n =1 . The inclusion of x i ! ( ’s, the following online algorithm is obtained: θ + + ˆ p β,λ β = ˆ { λ β using a maximum likelihood approach, and hence ). The dataset contains information regarding house- = β := := log ∂ℓ ∂β θ i ˆ λ β y 2007 ( n =1 i ! )= , but it highlights that the learn rate could be a function of the β λ ( , the learn rate, in ℓ λ Gelman and Hill is the probability to score a 1 on p stream 3020 households. Among other variables, the dataset includes the distance in . Unlike linear regression, logistic regression does not have a closed-form solu- = X of tion to estimate the parameters even for offline analysis approximate methodsonline are can used. be Estimating done the using parameters SGD as follows. First, we specify the log likelihood: Second, we compute the gradient (see, for more details for instance 2.5 Online learning in practice: logistic regression in a2.5.1 data Switching to a safe well To illustrate a logistic regressionscribed in in a data stream, weholds use in an Bangladesh example and dataset, whether orThe de- not wells they were switch labeled to safe athe safe if labeling well of the to the arsenic wells, collect level researchers water. had collected was switched data from low to their enough. study own unsafe how wellwell Five many to households was years another dependent after safe well. on Switching whethersafe to owners well another of and whether a the safe householdsmake well that some were extra did effort willing have to an to go unsafe to shareN well the their were safe well. willing to The relatively small dataset consists of Chapter 2: Online Estimation Tutorial where Here we include which in the case ofWhen offline we estimation use would SGD be to estimate evaluated the for all the data at once. implemented SGD for logistic regressioncan in be used [R], to the estimate logistic function regression is in a called stream. for a fixed value of data stream. Given an appropriate choice of will correctly estimate the parametersfor of interest an ( implementation of SGD for the estimation of logistic regression in [R]. We (2.21) ). In the case of a 1998 , , ) β ) ), or whether or not people β X ( X ( 2000 exp , exp 1+ Chapter 2: Online Estimation Tutorial ● )= X ( Anderson p parameter estimates parameter )= X ● | maximum of the function. =1 y ( Pr ● Emmons, Wechsler, Dowdall, & Abraham

Figure 2.2: Graphicalwhere the display direct of maximization the ofan algorithm the likelihood such likelihood as function. function GD isthe can difficult, slope be In used of to cases the find tangent the maximum. and GD a uses learn rate to make steps towards the Likelihood dent variables are often binary,to examples vote include (democratic whether versus and republican, smoke how cigarettes people ( intent binary dependent variable, often athe logistic relationship between regression a model binary is dependentvariables: chosen variable and to continuous describe independent still estimating the parameters without seeing all data at once. 2.4.3 Logistic regression: an ExampleWe of present the an Usage example of to SGD illustrateindependent SGD variables in on which a we binary are dependent interested variable. in In the applied effect research, of depen- and letting the data stream in repeatedly, SGD can obtain unbiased estimates while 24

Chapter 2 h eainhpbtenabnr eedn aibeadcniuu independent describe continuous to and variable chosen variables: dependent is binary model a regression between relationship logistic the a often variable, dependent binary intent ( people cigarettes how smoke republican, and versus whether (democratic include vote examples to binary, depen- of research, often effect applied are the In in variables variable. interested dent dependent are binary we a which on in variables SGD independent illustrate SGD to of example Usage an the present of We Example an regression: Logistic 2.4.3 once. at data while all estimates seeing unbiased without obtain parameters can the SGD estimating repeatedly, still in stream data the letting and 24

Likelihood h lp ftetnetadalanrt omk tp oad the towards steps make to rate learn uses a GD and maximum. the tangent find the cases to of used In be slope difficult, can the is GD function function. as likelihood such likelihood the algorithm an of the maximization of direct display the where Graphical 2.2: Figure mos ehlr odl,&Abraham & Dowdall, Wechsler, Emmons, ● Pr ( y =1 aiu ftefunction. the of maximum | ● X )= parameter estimates p Anderson ( X )= ● hpe :Oln siainTutorial Estimation Online 2: Chapter 1+ exp , exp 2000 ( X ( X β ,o hte rntpeople not or whether or ), ) β ) , , 1998 .I h aeo a of case the In ). (2.21) a eue oetmt oitcrgeso nastream. called a We in is regression [R]. function logistic in estimate the to regression [R], used logistic be in can of regression estimation logistic the for SGD for implemented SGD of implementation ( an interest of for parameters the estimate correctly will of choice appropriate an Given stream. data of value fixed a for hc ntecs fofln siainwudb vlae o l h aaa once. at data the all for the evaluated estimate to be SGD would use estimation we offline When of case the in which eew include we Here aesm xr fott ot h aewl.Terltvl ml aae osssof consists dataset small relatively The to willing well. safe were their the well N share to unsafe go to an to have willing effort did extra were some that well make households safe the a whether and of another well owners to safe whether Switching on well. safe after dependent another years was households to many Five well well how unsafe own study enough. their to low from data switched was collected had water. researchers level collect wells, arsenic the to the of well labeling if safe the a safe to labeled switch were they wells not de- The or whether dataset, and example Bangladesh an in use holds we stream, data a in in scribed regression logistic a illustrate To well safe a to Switching data 2.5.1 a in regression logistic practice: in learning Online 2.5 where Tutorial Estimation Online 2: Chapter eod ecmuetegain se o oedtisfrinstance for details more for (see, gradient the compute we Second, vnfrofln nlssapoiaemtosaeue.Etmtn h parameters the Estimating used. are methods approximate analysis offline for even parameters the estimate to tion of niecnb oeuigSDa olw.Frt eseiytelglikelihood: log the specify we First, follows. as SGD using done be can online X = nielna ersin oitcrgeso osnthv lsdfr solu- closed-form a have not does regression logistic regression, linear Unlike . 00hueod.Aogohrvrals h aae nldstedsac in distance the includes dataset the variables, other Among households. 3020 stream p stepoaiiyt cr on 1 a score to probability the is emnadHill and Gelman λ ℓ h er ae in rate, learn the , ( λ β u thglgt httelanrt ol eafnto fthe of function a be could rate learn the that highlights it but , )= ! i =1 n ( 2007 y β λ ˆ i θ ∂β ∂ℓ log := := β = .Tedtstcnan nomto eadn house- regarding information contains dataset The ). sn aiu ieiodapoc,adhence and approach, likelihood maximum a using β λ { ˆ = β λ β, p ˆ + + θ s h olwn nieagrtmi obtained: is algorithm online following the ’s, ( ! i x h nlso of inclusion The . =1 n f λ i } )+(1 ( ( , ( x λ, y y y n Pr , i − − n λ ) ( p p − n ag nuhdt tem SGD stream, data enough large a and , , y ( ( x x y =1 n i i )) ) )) Bottou log x | x X i n , ) . (1 λ n smdlda function a as modeled is and ilntb o enecessary be not be not will − , 2010 p ( x i .SeAppendix See ). )) . Agresti sgd_log , 2002 (2.23) (2.24) (2.22) and 2.C ): 25

Chapter 2 27 3000 2500 -axis is the estimated param- y online 2000 . Here we present the influence of the n √ data points / offline 1 1500 1000 cients of logistic regression as more data enter. -axis is the data stream presented,

x

1.0 0.0 1.0 1.0 0.0 1.0 − 1.0 0.0 1.0 − 1.0 0.0 1.0 −

− Figure 2.3: Online (dotted) and offline (solid) estimated beta coeffi- . On the

b0 b1 b2 b3 2.5 For larger datasets and for continuous streams, which is what we primarily fo- clear example of learn rates thathardly change. are too low. In such cases the estimates do not,2.5.4 or Starting values Lastly, we present the resultsFigure of start the analysis witheter value, different and starting the lines values are in the movinghad average starting of values 100 {-2, estimates. -1, While 1, theto intercept 2}, the zero remaining and coefficients had the startingstarting learn values values equal rate on remained the finalbetween parameter the estimate. four Although lines, there allThis is four illustrates some of that difference SGD them does result not in reallythe depend very starting (given similar values an and appropriate parameter learn that estimates. rate) the on data dominate the results quickly. cused on in this chapter, the performance of SGD is often accurate. Chapter 2: Online Estimation Tutorial ) ) are 3 2010 (2.25) b , , ) ) ars X ars X dist X dist -axes present the data 3 x X b ) for a dataset this size. 3 b + ) require some more data. 018 1 + . b . The ars . The starting values for all Yang, Xu, King, & Lyu n 3 1 -axis the estimated parameter X ars b √ 2 =0 y X b ) and the term ( . 2 = 2 b + b ). For the current example, we sim- and 1 λ . We present the results of the four 2.3 3020 + , √ 2 Chapter 2: Online Estimation Tutorial dist /n 1 ,b 2014 X dist 1 , 1 X b ,b 1 0 b + b 0 + b ( 0 . b ( exp 2.D exp Tarrès & Yao 1+ ) from the household, and arsenic level that is present in )= dist X ars ,X presents the moving average of 100 estimates during the stream. -axes present the estimated parameter values. During the stream we dist ). y 2.4 X | , we present the results of fitting a logistic regression in a data stream -axis presents the data stream and the ars x X 2.3 =1 y ( ’s are zero, see Appendix β Pr The estimates of the effect of the arsenic level ( In practice, choosing which variables to include from often a large set of vari- ) and the effect of the distance to the next safe well ( 0 b These moving averages are presented in Figure stream and the monitored the estimated parameter values using a moving average of 100 estimates. four The dashed line is fluctuating,the even fact towards that the the end learn of rate is the still dataset: quite this large is ( due to the water ( tion in a data stream areor for based instance on the Ridge online regression Lasso ( ( ulate that the data enterboth a offline point and at online implementations a to time predictto by whether a the analyzing safe household well the switched (coded data 1) row-by-row ortwo using did independent not variables switch and (coded an 0). interaction The term: model we estimate contains ables, could be a challenging task on its own. Methods to deal with variable selec- A smaller learn rate (oralgorithm one more, that but increases decreases the more risk rapidly) of introducing would more stabilize bias. the SGD 26 meters to a safe well ( the online estimation of theintercept. intercept All remains fluctuation close has to gone the for offline the estimation two of smallest learn the rates. These two are a results of 4 different rates:learn .1, rates .01, for .001, the and intercept,Again, though the the learn ratesvalue. were Figure equal for allClearly, coefficients. the curve with theof largest learn this rate fluctuation shows is much already more gone fluctuation. when Much we lower the learn rate to .01, although 2.5.3 Learn rates To gain some insight in the sensitivity of SGD to its learn rates we also present the 2.5.2 Results In Figure with four coefficients and an adaptive learn rate, very accurate from the beginning of( the data stream. The estimates of the intercept We thus estimate the four coefficients

Chapter 2 etu siaetefu coefficients four the estimate thus We eyacrt rmtebgnigo h aasra.Teetmtso h intercept the of estimates The stream. data the ( of beginning the from accurate very ihfu ofcet n naatv er rate, learn adaptive an and coefficients four with Figure In Results 2.5.2 necp.Alflcuto a oefrtetosals er ae.Teetoaea are two These rates. the learn smallest although of two estimation .01, the offline to for the gone rate to has learn close fluctuation remains the All intercept lower intercept. the we of Much when estimation fluctuation. gone online more already the much is shows fluctuation rate this learn largest of the with curve the coefficients. Clearly, all for equal Figure were value. rates learn the the though the Again, intercept, present and also the we .001, for rates .01, rates learn .1, learn its rates: to different SGD 4 of of sensitivity results the in insight some gain To rates Learn 2.5.3 ( well safe a to meters 26 mle er ae(roeta erae oerpdy ol tblz h SGD the bias. stabilize more would introducing of rapidly) risk more the decreases increases but that more, one algorithm (or rate learn smaller A oasf el(oe )o i o wth(oe ) h oe eetmt contains estimate we model term: The interaction 0). an (coded and switch variables not independent did using two or row-by-row 1) data (coded switched the well household safe analyzing the a whether by to predict time to a implementations online at and point offline a both enter data the that ulate selec- ( ( Lasso variable regression online Ridge with the on deal instance based to for or Methods are stream own. data its a on in tion task challenging a be could ables, h ahdln sflcutn,ee oad h n ftedtst hsi u to due ( is large this quite dataset: still the is rate of learn end the the that towards fact even the fluctuating, is line dashed The h ae ( water the four hs oigaeae r rsne nFigure in estimates. presented 100 are of averages average moving moving These a using values parameter estimated the monitored the and stream b 0 n h feto h itnet h etsf el( well safe next the to distance the of effect the and ) npatc,cosn hc aibe oicuefo fe ag e fvari- of set large a often from include to variables which choosing practice, In h siae fteefc fteasnclvl( level arsenic the of effect the of estimates The Pr β saezr,seAppendix see zero, are ’s ( y =1 2.3 X x ars ai rsnstedt temadthe and stream data the presents -axis epeetterslso tigalgsi ersini aastream data a in regression logistic a fitting of results the present we , | X 2.4 y ). dist ae rsn h siae aaee aus uigtesra we stream the During values. parameter estimated the present -axes rsnstemvn vrg f10etmtsdrn h stream. the during estimates 100 of average moving the presents ,X ars X dist )= rmtehueod n rei ee hti rsn in present is that level arsenic and household, the from ) 1+ ars&Yao & Tarrès exp 2.D exp ( b . 0 ( b + 0 b + b 0 1 ,b b X 1 , 1 dist X 2014 ,b 1 /n dist hpe :Oln siainTutorial Estimation Online 2: Chapter 2 √ , + 3020 2.3 epeetterslso h four the of results the present We . λ 1 and .Frtecreteape esim- we example, current the For ). b + b 2 = 2 . n h neato em( term interaction the and ) b X y =0 2 √ b ars X ai h siae parameter estimated the -axis 1 3 n ag u ig Lyu & King, Xu, Yang, h trigvle o all for values starting The . ars The . b . + 1 018 eur oemr data. more some require ) + b 3 o aae hssize. this dataset a for ) b X x 3 ae rsn h data the present -axes dist X dist X ars X ars ) ) , , b (2.25) 2010 3 are ) ) ozr n h er aeremained rate equal values learn starting the had coefficients and remaining zero the 2}, intercept to the 1, While -1, estimates. {-2, 100 values of starting average had moving the in are values lines the starting and different value, eter with analysis the start of Figure results the present we Lastly, values Starting or 2.5.4 not, do estimates the cases such In low. too are change. hardly that rates learn of example clear Tutorial Estimation Online 2: Chapter ue ni hscatr h efrac fSDi fe accurate. often is SGD of performance the chapter, this in on cused quickly. results the dominate data on the rate) estimates. that learn parameter appropriate and an values similar (given starting very depend the really in not result does them SGD difference that of some illustrates four is This all there lines, Although four estimate. the parameter between final the on values starting o agrdtst n o otnossras hc swa epiaiyfo- primarily we what is which streams, continuous for and datasets larger For

2.5 b3 b2 b1 b0 nthe On .

iue23 nie(otd n fie(oi)etmtdbt coeffi- beta estimated (solid) offline and (dotted) Online 2.3: Figure −1.0 0.0 1.0 −1.0 0.0 1.0 −1.0 0.0 1.0 −1.0 0.0 1.0 x ai stedt tempresented, stream data the is -axis inso oitcrgeso smr aaenter. data more as regression logistic of cients 1000 1500 1 offline / data points √ n eew rsn h nuneo the of influence the present we Here . 2000 online y ai steetmtdparam- estimated the is -axis 2500 3000 27

Chapter 2 - λ 29 ● 3000 offline Wilkinson & Task ● 2500 on SV 2 ● ). It is considered a questionable ). When adopting an online learning 2000 2006 , ● 2012 on SV 1 , data points Strube more data enter. ): the data stream is operated on online (for those 1500 ), as opposed to null hypothesis testing. ). ● John et al. ( 2013 1999 on SV −1 2009 , , , 05 ● p<. 1000 on SV −2 ● Marz & Warren ●

Figure 2.5: Online (dotted) andstarting offline (solid) values: estimated intercepts -2, for -1, 1, and 2 coefficients of logistic regression as

500 Cappé & Moulines 1.0 0.0 1.0 − Finally, it is not always feasible to translate all analyses from the offline frame- From a conceptual point of view, we do explicitly mention that we are not pro- nal level (i.e., too many false positives, research practice to repeatedly testonce for the a effect yields significant a effect andapproach, we stop encourage data researchers gathering tosize focus of on the obtaining effects precise of interest, estimatesForce in of on adherence Statistical the to Inference the APA guidelines ( work to the online framework. Fordata instance, that the are analysis nested of within binary units,logistic dependent which multilevel data, are (or in random the effects-) offline case models,synonym. often have Therefore, analyzed not future using yet research found should aels, be proper aimed such online as at logistic translating multilevel complex models, mod- active to research work the is online carried learning out framework. indescribing this Note online field, that with approximations for of example the recent well-knowngorithm publications ( Expectation-Maximization al- Chapter 2: Online Estimation Tutorial practical solution to this in thearchitecture computer ( science literature is to adoptcomputations a that so-called where specified in advance), butlyzed also offline at stored a and later can point thus in bethe time ana- (often size using of parallelization the methods dataset). to deal with moting repeated null hypothesis significance testingavoided. in data When streams; this a should researcher be decidessignificant result to of stop the the hypothesis data test, collection the once Type she I error obtains rate a increases above nomi- ), ), 2014 2014 , , ● 3000 offline Kaptein ( ● RStorm Toshniwal et al. 2500 ( M. Hofmann & Klinken- ), ( Bifet, Holmes, Kirkby, & ,( on 1/n 2015 ● , MOA 2000 RapidMiner Chapter 2: Online Estimation Tutorial Apache Storm ), ● ), and the analysis. Any information that is not on LR .001 coefficients of logistic regression as 2010 data points , 2009 /n , 1 before more data enter. 1500 ● throughout this text – to store the relevant information on LR .01 θ ● Berthold et al. 1000 ( ) often require extensive programming knowledge and focus mainly Karau, Konwinski, Wendell, & Zaharia on LR .1 ● ( KNIME 2010

, ), Figure 2.4: Online (dotted) andlearn offline (solid) rates: estimated intercepts .1, for .01, .001, and ●

500 1.0 0.0 1.0 − 2013 , Neumeyer, Robbins, Nair, & Kesari A solution to this latter issue could be to run simultaneously different analyses Second, we have to stress that for the application of online methods the analyst Practically, it has to be noted that at this moment not many off-the-shelf statisti- ( and/or models, such that at a laterysis point or in model time to a use. decision can This,available be of made to course, which does store anal- require the that sufficient enough statistics computer of memory is multiple models. A frequently adopted research question. Online learning methods make usereferred of to a the limited elements set of of quantitiesand – to subsequently estimate model parameters.know what This information means is that it required isstored is important forgotten to and is impossible to retrieve if the data themselves are not stored. on the infrastructure of analyzing largethe datasets. methods and There software is developed still byused a computer by large scientists, social gap and scientist between those to analyze thatcustomed their can to. data be streams using models that they are ac- has to know beforehand what type of analysis and model is required to answer the cal packages are available to actuallysoftware, analyze for data streams. instance The (and currentlyApache not available Spark exhaustive) S4 berg Pfahringer 2.6 Considerations analyzing Big DataIn and this Data chapter, Streams we haveHowever, discussed some online important learning issues as a remain.conceptual way issues Here, to related we deal to analyzing discuss with Big two Big Data. Data. practical and two 28

Chapter 2 S4 exhaustive) Spark available not Apache currently (and The instance streams. data for analyze software, actually to available are packages cal two and practical Data. Data. Big two Big with discuss analyzing to deal we related to Here, issues way conceptual remain. a as issues learning important online some discussed However, have we Streams chapter, Data this and In Data Big analyzing Considerations 2.6 28 vial osoetesfcetsaitc fmlil oes rqetyadopted frequently A models. multiple is memory of computer statistics enough sufficient that the require anal- store does which course, to made of be available This, can decision use. a to time model in or point ysis later a at that such models, stored. and/or not are themselves data the if retrieve to impossible is and to forgotten important is stored is required it that is means information This what know parameters. model estimate subsequently to – and quantities of of set the elements limited answer the a to to of referred required use is make methods model learning and Online analysis question. of research type what beforehand know to has ac- are they that models using streams be data to. can their customed that analyze to those between scientist and gap social scientists, large by computer a used by still developed is software There and methods datasets. the large analyzing of infrastructure the on Pfahringer berg ( rcial,i a ob oe hta hsmmn o ayoftesefstatisti- off-the-shelf many not moment this at that noted be to has it Practically, ouint hslte su ol et u iutnosydfeetanalyses different simultaneously run to be could issue latter this to solution A analyst the methods online of application the for that stress to have we Second, emyr obn,Ni,&Kesari & Nair, Robbins, Neumeyer, , 2013 −1.0 0.0 1.0 500 ● iue24 nie(otd n fie(oi)etmtditret for intercepts estimated (solid) offline and (dotted) Online 2.4: Figure er ae:.,.1 01 and .001, .01, .1, rates: learn ), , 2010 KNIME ( ● on LR.1 aa,Knisi edl,&Zaharia & Wendell, Konwinski, Karau, fe eur xesv rgamn nweg n ou mainly focus and knowledge programming extensive require often ) ( 1000 etode al. et Berthold ● θ on LR.01 hogotti et–t tr h eeatinformation relevant the store to – text this throughout ● 1500 oedt enter. data more before 1 , /n 2009 , data points 2010 ofcet flgsi ersinas regression logistic of coefficients on LR.001 h nlss n nomto hti not is that information Any analysis. the ,and ), ● ), pceStorm Apache hpe :Oln siainTutorial Estimation Online 2: Chapter RapidMiner 2000 MOA , ● 2015 on 1/n ,( ie,Hle,Krb,& Kirkby, Holmes, Bifet, ( ), .Hfan&Klinken- & Hofmann M. ( 2500 ohia tal. et Toshniwal RStorm ● ( Kaptein offline 3000 ● , , 2014 2014 ), ), eciigoln prxmtoso h elkonEpcainMxmzto al- Expectation-Maximization ( publications gorithm well-known recent the example of for approximations with that field, online Note this describing in framework. out learning carried online is the work research to active mod- models, complex multilevel translating logistic at as online such aimed proper be els, a should found research yet using future not analyzed Therefore, have often synonym. models, case offline effects-) the random in (or are data, multilevel which dependent logistic units, binary within of nested analysis are the that instance, data For framework. online the to work ( guidelines APA the Inference to the Statistical adherence on of in Force estimates interest, of precise effects obtaining the on of focus size to gathering researchers data encourage stop we approach, and effect a significant yields effect a the for once test repeatedly nomi- to above practice increases research positives, a rate false obtains error many I she too Type (i.e., once the level collection nal test, data hypothesis the the stop of to result significant decides be researcher should a this streams; When data in avoided. testing significance hypothesis null repeated moting with deal to dataset). methods the parallelization of using size (often ana- time the be in thus point can later and a stored at offline also lyzed but advance), in specified where so-called that a computations adopt to is literature science ( computer architecture the in this to solution practical Tutorial Estimation Online 2: Chapter ial,i sntawy esbet rnlt l nlssfo h fieframe- offline the from analyses all translate to feasible always not is it Finally, pro- not are we that mention explicitly do we view, of point conceptual a From

−1.0 0.0 1.0 ap Moulines & Cappé 500 trigvle:-,-,1 n ofcet flgsi ersinas regression logistic of coefficients 2 and 1, -1, for -2, intercepts estimated values: (solid) offline starting and (dotted) Online 2.5: Figure ● az&Warren & Marz ● on SV−2 1000 p<. ● 05 , , , 2009 on SV−1 1999 2013 ( one al. et John ● ). ,a poe onl yohsstesting. hypothesis null to opposed as ), 1500 :tedt temi prtdo nie(o those (for online on operated is stream data the ): oedt enter. data more Strube data points , on SV1 2012 ● , 2006 2000 .We dpiga nielearning online an adopting When ). .I scniee questionable a considered is It ). ● on SV2 2500 ● ikno Task & Wilkinson offline 3000 ● 29 λ -

Chapter 2 31 #number of observations #generate data #deviance x #deviance y #update number of observations #compute covariance #estimate std.dev. x #estimate std.dev. y #estimate correlation (y[i]-ybar)#update sum of cross products * sy) (x[i]-xbar) #update sum of(y[i]-ybar) squares for x * #update sum of squares for y * * x+rnorm(N) * Chapter 2: Online Estimation Tutorial 2.A Online Correlation > N <-> 1000 x <-> rnorm(N, y 5,2) <-> 1.5 # because> a # correlation at> requires least n 2 = points> 1; we for xbar (i start+{ = in with x[1]; n=1 2:N) ybar+ = y[1];+ SC = dx 0;+ SSx dy = <-+ 0; (x[i]-xbar) SSy n <- =+ (y[i]-ybar) 0; xbar <-+ <- xbar+(x[i]-xbar)/n n+1 SSx+ <- SC SSx+dx + Sxy <-+ SC+(x[i]-xbar) <- #update ybar SC/(n-1) mean <-+ x ybar+(y[i]-ybar)/n SSy+ <- sx SSy+dy + sy <-+} sqrt(SSx/(n-1)) #update rxy <- mean> sqrt(SSy/(n-1)) y <- Sxy/(sx , , ). A possible ). 2016b Cappé & Moulines , 2015 , ), and this work should be Robert 2008 , Gelman, Carlin, Stern, & Rubin Chapter 2: Online Estimation Tutorial Ippel, Kaptein, & Vermunt ), to collect social and psychological data have 2001 , Wolfe, Haghighi, & Klein ; ). Suggestions for parallel computations and more efficient 1998 , 1977 , L. F. Barrett & Barrett Neal & Hinton ; ). In the case of conjugate priors, the posterior can be updated relatively easily. We hope that this chapter motivates applied researchers to explore new research Despite the versatility of online methods as displayed in this tutorial, many chal- Dempster et al. many new insights in human behavior andhuman can emotions provide and new attitudes. research areas to study procedures for the EM algorithm have2009 already been proposed ( extended to make the EM algorithm applicable for streaming data. areas that are opened upin by a the data technological stream. opportunity to We monitor believe individuals that data streams can provide social scientists with lenges remain: common methods suchor multilevel as models (latent) are not factor easily analysis, estimatedapproaches online mixture for (See models, for multilevel a models, discussion andway online to deal with these( types of analyses is to alter for instance the EM algorithm which are used to analyze datato analyze streams; data there streams, are for instance many the morebe Bayesian techniques used framework available to can in update some the cases estimated2004 also parameters (e.g., However, in the situation whereparticle the filtering prior are is required to not update conjugate, the other posterior ( methods such as made data streams more apparent andter, more we prevalent in discussed recent how years. social In scientistsregular this estimation can chap- methods can deal be with applied in these the largeWe context hope of datasets, to continuous and data have streams. how contributed to openingresearch up questions the as possibilities to well answer as bothor new existing continuous types data of streams. research Note questions that using we large have datasets only touched upon a few methods 30 2.7 Discussion Using new data collection methodspling and technologies, ( for instance experience sam-

Chapter 2 nadt tem ebleeta aasrascnpoiesca cetsswith scientists study to social areas provide research attitudes. new can and provide emotions streams can human and data behavior human that individuals in believe insights monitor We new to many opportunity stream. technological data the a by in up opened are that areas data. streaming for applicable algorithm EM the make to extended ( proposed algorithm been EM already 2009 have the algorithm instance EM for the alter for to procedures is analyses of types ( these with deal to online way and discussion models, a multilevel for models, (See for mixture online approaches estimated analysis, easily factor not are (latent) models as multilevel or such methods common remain: lenges as such methods ( posterior other the conjugate, update not to required is are prior filtering the particle where situation the in However, (e.g., parameters also 2004 estimated cases the some update in can to available framework used techniques Bayesian be more the many instance methods for are few streams, there a data streams; analyze upon to data touched analyze only to datasets have used large we are using that which questions Note research streams. of data types continuous existing new or both as answer well to possibilities as the questions up research opening to contributed how streams. have data and continuous to datasets, of hope context We large the these in applied with be deal can methods chap- can estimation this regular scientists In social years. how recent discussed in prevalent we more ter, and apparent more streams data sam- made experience instance for ( technologies, and pling methods collection data new Using Discussion 2.7 30 eptre al. et Dempster ehp htti hpe oiae ple eerhr oepoenwresearch new explore to researchers applied motivates chapter this that hope We chal- many tutorial, this in displayed as methods online of versatility the Despite ; easily. relatively updated be can posterior the priors, conjugate of case the In ). el&Hinton & Neal .F art Barrett & Barrett F. L. , 1977 , 1998 .Sgetosfrprle opttosadmr efficient more and computations parallel for Suggestions ). ; of,Hgih,&Klein & Haghighi, Wolfe, , 2001 ,t olc oiladpyhlgcldt have data psychological and social collect to ), pe,Kpen Vermunt & Kaptein, Ippel, hpe :Oln siainTutorial Estimation Online 2: Chapter emn aln tr,&Rubin & Stern, Carlin, Gelman, , 2008 Robert ,adti oksol be should work this and ), , 2015 , ap Moulines & Cappé 2016b ). .Apossible A ). , , x -Sxy/(sx <- y sqrt(SSy/(n-1)) > mean <- rxy #update sqrt(SSx/(n-1)) +} <- sy + SSy+dy sx <- + SSy ybar+(y[i]-ybar)/n x <- + mean SC/(n-1) ybar #update <- SC+(x[i]-xbar) + <- Sxy + SSx+dx SC <- + n+1SSx xbar+(x[i]-xbar)/n <- <- + xbar 0; (y[i]-ybar) + = <- n SSy (x[i]-xbar) 0; + <- = dy SSx + 0; dx = SC + y[1]; = + ybar 2:N) n=1 x[1]; with in = +{ start (i xbar we for 1; points > = 2 n least requires > at correlation # a > because # 1.5 > <- 5,2) y rnorm(N, > <- x 1000 > <- N > Correlation Online 2.A Tutorial Estimation Online 2: Chapter * x+rnorm(N) * * yi-br udt u fsursfry for squares of sum #update * x for squares of (y[i]-ybar) sum #update (x[i]-xbar) sy) * yi-br#paesmo rs products cross of sum (y[i]-ybar)#update udt ubro observations of number #update y #deviance x #deviance data #generate observations of #number etmt correlation #estimate y std.dev. #estimate x std.dev. #estimate covariance #compute 31

Chapter 2 x[i])) * 33 x+e)))) * %c(1,x[i]) * x+e)/(1+exp(-2+1.5 x[i])/(1+exp(beta[1]+beta[2] * * (y[i]- p) % * Chapter 2: Online Estimation Tutorial 2.C Stochastic Gradient Decent –> Logistic regression N> x> <-3000 e> <-rnorm(N,1,1) y <-rnorm(N) > beta <-> <-rbinom(N,1, c(0,0) (exp(-2+1.5 for(i in+{ 1:N) ++ p+} beta <-beta> <- + exp(beta[1]+beta[2] lambda %A_inv)/C) * %x[i,])) * %x[i,]% * %y[i] * %A_inv% * Chapter 2: Online Estimation Tutorial %x[i,]% * %t(x[i,]) * %B #compute coefficients * #C is a scalar #fit linear regression: #update A as long as it#update is B not invertible #invert A when n>p x[,2]+rnorm(N) * +++ C+} A_inv <-+} A_inv beta <- - as.numeric((1+x[i,]% ((A_inv%> <- A_inv% ++{ if(i==3) ++} + A_inv <-+{ solve(A) if(i>=3) #update inverted matrix A_inv +{ ++{ if(i<3) ++} + A B <- A+x[i,]% <- B + as.matrix(x[i,])% > x1 <-> rnorm(N, x 5,2) > <- y matrix(c(x0,x1),nrow=N) > <- A 3+1.5 > <- #the matrix(0,nrow=2,ncol=2); as.matrix B> and <- for as.numeric c(0,0) (i are in required 1:N) to get [r] running 32 2.B Online linear regression > N> <- x0 1000 <- rep(1, N) # generate data

Chapter 2 ea< A_inv% <- ((A_inv%> as.numeric((1+x[i,]% - <- beta A_inv +} <- A_inv +} C A_inv + matrix inverted + #update + if(i>=3) solve(A) +{ <- as.matrix(x[i,])% A_inv + + B <- +} + A+x[i,]% if(i==3) <- +{ B + A + +} + if(i<3) running+{ [r] get + to 1:N) required in are +{ (i c(0,0) as.numeric for <- and > B as.matrix matrix(0,nrow=2,ncol=2); #the <- > 3+1.5 A <- > matrix(c(x0,x1),nrow=N) y <- data > generate 5,2) x # rnorm(N, > <- N) x1 rep(1, > <- 1000 x0 <- > N > regression linear Online 2.B 32 * x[,2]+rnorm(N) udt sln si sntinvertible not is it as long as A #update regression: linear #fit Ci scalar a is #C n>p when A #invert B #update * B#opt coefficients #compute %B * %t(x[i,]) * %x[i,]% hpe :Oln siainTutorial Estimation Online 2: Chapter * %A_inv% * %y[i] * %x[i,]% * %x[i,])) * %A_inv)/C) ea <-beta beta +} p + + 1:N) +{ in (exp(-2+1.5 for(i c(0,0) <-rbinom(N,1, > <- beta > <-rnorm(N) y <-rnorm(N,1,1)> e <-3000> x > N regression Logistic > – Decent Gradient Stochastic 2.C Tutorial Estimation Online 2: Chapter * yi-p % p) (y[i]- * * x[i])/(1+exp(beta[1]+beta[2] x+e)/(1+exp(-2+1.5 * %c(1,x[i]) * x+e)))) 33 * x[i]))

Chapter 2 34 Chapter 2: Online Estimation Tutorial

2.D Wells data example

> beta <- c(0,0,0,0) > for(i in 1:nrow(wells.dat)) Chapter 3 +{ + n <- n+1 + x <- c(1,wells.dat[i,c(dist, ars, dist*ars)]) + y <- wells.dat$switch[i] Online Estimation of + p <- exp(sum(beta*x))/(1+exp(sum(beta*x))) + beta <-beta + 1/sqrt{n}*(y - p) %*%x Individual-Level Effects using +} > Streaming Shrinkage Factors

Abstract

In the last few years, it has become increasingly easy to collect data from individ- uals over long periods of time. Examples include smart-phone applications used to track movements with GPS, web-log data tracking individuals’ browsing behav- ior, and longitudinal (cohort) studies where many individuals are monitored over an extensive period of time. All these datasets cover a large number of individuals and collect data on the same individuals repeatedly, causing a nested structure in the data. Moreover, the data collection is never ‘finished’ as new data keep stream- ing in. It is well known that predictions that use the data of the individual whose individual-level effect is predicted in combination with the data of all the other in- dividuals, are better in terms of squared error than those that just use the individual mean. However, when data are both nested and streaming, and the outcome vari- able is binary, computing these individual-level predictions can be computationally challenging. In this chapter, we develop and evaluate four computationally-efficient estimation methods which do not revise “old” data but do account for the nested data structure. The methods that we develop are based on four existing shrinkage factors. A shrinkage factor is used to predict an individual-level effect (i.e., the prob- ability to score a 1), by weighing the individual mean and the mean over all data points. In a simulation study, we compare the performance of existing and newly developed shrinkage factors. We find that the existing methods differ in their pre- diction accuracy, but the differences in accuracy between our novel shrinkage factors and the existing methods are small. Our novel methods are however computation- ally feasible in the context of streaming data.

This chapter is submitted as Ippel, L., Kaptein, M.C, & Vermunt, J.K. Online Estimation of Individual-Level Effects using Streaming Shrinkage Factors 34 Chapter 2: Online Estimation Tutorial

2.D Wells data example

> beta <- c(0,0,0,0) > for(i in 1:nrow(wells.dat)) Chapter 3 +{ + n <- n+1 + x <- c(1,wells.dat[i,c(dist, ars, dist*ars)]) + y <- wells.dat$switch[i] Online Estimation of + p <- exp(sum(beta*x))/(1+exp(sum(beta*x))) + beta <-beta + 1/sqrt{n}*(y - p) %*%x Individual-Level Effects using +} > Streaming Shrinkage Factors

Abstract

In the last few years, it has become increasingly easy to collect data from individ- uals over long periods of time. Examples include smart-phone applications used to track movements with GPS, web-log data tracking individuals’ browsing behav- ior, and longitudinal (cohort) studies where many individuals are monitored over an extensive period of time. All these datasets cover a large number of individuals and collect data on the same individuals repeatedly, causing a nested structure in the data. Moreover, the data collection is never ‘finished’ as new data keep stream- ing in. It is well known that predictions that use the data of the individual whose individual-level effect is predicted in combination with the data of all the other in- dividuals, are better in terms of squared error than those that just use the individual mean. However, when data are both nested and streaming, and the outcome vari- able is binary, computing these individual-level predictions can be computationally challenging. In this chapter, we develop and evaluate four computationally-efficient estimation methods which do not revise “old” data but do account for the nested data structure. The methods that we develop are based on four existing shrinkage factors. A shrinkage factor is used to predict an individual-level effect (i.e., the prob- ability to score a 1), by weighing the individual mean and the mean over all data points. In a simulation study, we compare the performance of existing and newly developed shrinkage factors. We find that the existing methods differ in their pre- diction accuracy, but the differences in accuracy between our novel shrinkage factors and the existing methods are small. Our novel methods are however computation- ally feasible in the context of streaming data.

This chapter is submitted as Ippel, L., Kaptein, M.C, & Vermunt, J.K. Online Estimation of Individual-Level Effects using Streaming Shrinkage Factors 37 data online Ippel, ) or 2012 ( ). An illustrative example the individual-level effects, 2016a Pébay, Terriberry, Kolla, & Ben- Morris and Lysy , . Estimating a sample mean in a t x ). While this method solves the prob- =1 re-estimate n t ). Note that our current focus is solely ! 2006 , 1 n 2011b , rather than Cappé )), we restrict our attention solely to random-intercept , “computing estimates of model parameters on-the-fly, update Ippel, Kaptein, & Vermunt ; 2016b ( , for an introduction on many more data-stream techniques, in- 1999 Efraimidis & Spirakis , 2007 , online learning Bottou ). . In a data stream, the data collection is never “finished”, for instance in click- 2016 Aggarwal , The aim of this chapter is to develop and evaluate different shrinkage factors A possible approach to efficiently obtain estimates in a situation where the data In general, various methods are available to deal with data streams. For instance Chapter 3: Online Shrinkage factors which can be used towhere new efficiently data estimate present the themselves individual-levelstream over effect time. in a We refer situation behavior to data this on situation a as website. a predictions of In the the individual-level case effectsstream, of are methods required real-time that at prediction, can each where moment up-to-date during the without storing the data andservations by become continuously available” updating ( theon estimates estimating as the more individual-level ob- effectsaccounting in for the the grouping context present ofexplanatory in variables nested the (for data data. instance to and While take hence intoseen the account in inclusion when the of data an stream, individual additional previous was purchases, last when etc.) estimating in shrinkage the factors (see, prediction for model instance isKaptein, possible and Vermunt models with binary outcomes. come streaming in, is to estimate theestimated individual-level effects shrinkage in factors. real time using Onlineparameter estimation (e.g., a (or mean, online or regression learning) coefficient)batch is implies of) updated that data using point a a single and (or some small data sufficient points, statistics (e.g., a summation ofis the previous the computation of the sample mean greatly improve the speed of the estimationnett process ( one could subsample from the datapoints stream in (i.e., the at analysis random while include excluding others), some andobtain of analyze predictions the the ( data subsample in order to lem of a growing dataset, it inherently limitsto the information include and data risks not of being specific able that individuals deals who are well of with future asliding interest. window data is Another also stream a method is subsample of apoints. the The sliding-window data, advantages existing approach. of of this only method the Effectively are mostin that the recent memory which data burden the is fixed data-generating and, in processobservations cases most is heavily not influence stationary the overthe resulting size time, of predictions. the the window However, most often choosing requires recent any domain events knowledge: meaningful, too too large small a might window not(see, might catch computationally be too expensive cluding sliding windows). In this chapter, wedata focus streams: on another method to deal with (3.1) ) com- , a way to 2007 ( ) monitored . 2010 , results in better ). Following this ( ¯ p 1977 , shrinkage factor group mean ) studied ‘click’ behavior in Linares et al. ) studied the effect of schools on 2010 as the ¯ ( p, ˆ ¯ p β Chapter 3: Online Shrinkage factors + 2010 ( i Efron & Morris p , and ) ˆ β − . Because we focus on binary outcomes, the ) showed that predicting the individual-level data where the outcome variable of interest is Quintelier =(1 1956 ( i ) using only the data of this particular individual, ˆ i µ µ grouped individual mean Stein in 1961 introduced the idea of a Cheng and Cantú-Paz shrinkage factor Murnaghan, Sihvonen, Leatherdale, and Kekki as the i p . The resulting weighted combination can be denoted as follows: i µ to index the grouping factor which, in this particular case of multiple , which is the estimated probability that an individual will click. Note i i µ , with the estimated sample mean over all data points, is the so-called i p β W. James and Stein . For example, In a now classical paper, To settle for an unambiguous terminology throughout, we adopt the terms of the in our case denotes the proportion of (for instance) clicks. In the remainder of this , accurately and computationally efficiently. i i result, weigh the estimated mean of ana individual prediction and of the mean over data points to obtain out-of-sample predictions (see, for instance, thus without taking the othersquared individuals prediction into error than account, when results theseHe in other demonstrated a individuals larger that are average taken combining into thedenote account. estimated mean of an individual, which we effects of one individual (i.e., click-through rate is being estimated. However, theter methods do discussed in not this restrict chap- themselvesindividuals to but the could also nesting be of used observations foror that groupings schools such are within as districts. nested individuals within within Our schools interestµ lies in estimating the individual-level effect that we use observations nested within the individual, denotes the individual customer whose chapter we refer to arrive continuously and datasets are rapidly augmented. latter e-commerce example. Here, a researcher couldlevel be effect interested in the individual- e-commerce, i.e., whether an individual clicksthis on latter an case, advertisement the on repeatedly a observed website.individual. click-through In In behavior each is and nested every within instance the good above estimates researchers are of interested the in obtaining probabilityual, of while an respecting event the occurring nested atefficient structure the methods in of level the obtaining of such data. the estimates in individ- In a this situation chapter, where the we collected examine data pared the smoking behavior (smokinggrouped within versus different none schools. smoking) ofstudents students voting behavior that (vote are versus nochildren over vote), a and long period of timeto (repeated measurements investigate nested within the children) effectsymptoms. of air Furthermore, pollution on the presence (or absence) respiratory 36 3.1 Introduction Researchers often encounter binary where p

Chapter 3 p where atrecmec xml.Hr,arsace ol eitrse nteindividual- the in interested effect be level could researcher a Here, example. e-commerce latter augmented. data examine rapidly collected are we the datasets where chapter, and situation this continuously a In individ- arrive in estimates the data. such of obtaining the level of in the methods structure efficient at nested occurring the event respecting an while of ual, probability obtaining in the interested of are researchers estimates above good the instance within every nested and is each behavior In In click-through individual. website. observed a repeatedly on the advertisement case, respiratory an latter on absence) this clicks (or individual an presence whether the i.e., e-commerce, on pollution Furthermore, air of symptoms. effect children) the within nested investigate measurements (repeated to time of period long and a vote), over children no versus are (vote that behavior voting students students of smoking) schools. none different versus within grouped (smoking behavior smoking the pared binary encounter often Researchers Introduction 3.1 36 hpe erfrto refer we chapter rshoswti itit.Oritrs isi siaigteidvda-ee effect individual-level the estimating in lies µ interest schools Our within within individuals nested districts. as within are such schools groupings that or for observations used of be nesting also could the but to individuals themselves chap- restrict this not in discussed do methods ter the whose However, customer estimated. being individual is the rate denotes click-through individual, the within nested observations use we that fet foeidvda (i.e., individual one of effects edmntae htcmiigteetmtdma fa niiul hc we which individual, an of mean estimated account. denote the into combining taken average are that larger individuals a demonstrated other in He these results when account, than error into prediction individuals squared other the taking without thus eg h siae eno nidvda n h enoe aapit oobtain to points data over mean the of and prediction individual a an of mean estimated the weigh result, instance, for (see, predictions out-of-sample i i cuaeyadcmuainlyefficiently. computationally and accurately , norcs eoe h rprino frisac)cik.I h eane fthis of remainder the In clicks. instance) (for of proportion the denotes case our in ostl o nuabgostriooytruhu,w dp h em fthe of terms the adopt we throughout, terminology unambiguous an for settle To nanwcasclpaper, classical now a In o example, For . .JmsadStein and James W. β p i steso-called the is ihteetmtdsml enoe l aapoints, data all over mean sample estimated the with , µ i i hc steetmtdpoaiiyta nidvda ilcik Note click. will individual an that probability estimated the is which , oidxtegopn atrwih nti atclrcs fmultiple of case particular this in which, factor grouping the index to µ i h eutn egtdcmiaincnb eoe sfollows: as denoted be can combination weighted resulting The . p i sthe as unga,Shoe,Lahrae n Kekki and Leatherdale, Sihvonen, Murnaghan, hikg factor shrinkage hn n Cantú-Paz and Cheng n16 nrdcdteie fa of idea the introduced 1961 in Stein niiulmean individual grouped µ µ i ˆ sn nytedt fti atclrindividual, particular this of data the only using ) i ( 1956 =(1 Quintelier aaweeteotoevral fitrs is interest of variable outcome the where data hwdta rdcigteindividual-level the predicting that showed ) eas efcso iayotoe,the outcomes, binary on focus we Because . − β ˆ ) and , p fo Morris & Efron i ( 2010 + hpe :Oln hikg factors Shrinkage Online 3: Chapter β p ¯ ˆ p, ( ¯ sthe as 2010 tde h feto col on schools of effect the studied ) iae tal. et Linares tde cik eairin behavior ‘click’ studied ) ru mean group hikg factor shrinkage , 1977 p ¯ ( .Floigthis Following ). eut nbetter in results , 2010 . monitored ) ( 2007 a to way a , com- ) (3.1) stecmuaino h apemean sample the of computation the previous the is of summation a (e.g., statistics points, sufficient data small some (or and single a a point using data that updated of) implies is batch coefficient) learning) regression or online mean, (or a (e.g., estimation parameter Online using time real factors. in shrinkage effects individual-level estimated the estimate to is in, streaming come outcomes. binary with models Vermunt and possible Kaptein, is instance model for prediction (see, factors the shrinkage in estimating etc.) when last purchases, was previous additional individual stream, an data of the when inclusion in account the seen into hence take While and to instance data. data (for the nested variables in explanatory of present context grouping the the for in accounting effects ob- individual-level more the as estimating estimates on the ( updating available” continuously become by servations and data the storing with without deal to method another on streams: focus data we chapter, this In windows). sliding cluding expensive too be computationally catch might (see, not window might a small large too too meaningful, knowledge: events domain any recent requires choosing often most However, window the the predictions. of time, size resulting the over the stationary influence not heavily is most cases observations process in and, data-generating fixed is the burden data which memory recent the that in most are Effectively the method only this of of approach. existing advantages data, sliding-window The the points. a of subsample is method a stream also Another is data window interest. sliding a future with of well are who deals individuals that able specific being of not risks data and include information the to limits inherently it dataset, growing a of lem to order in subsample data ( the the predictions analyze of obtain and some others), excluding include while random analysis at the (i.e., in stream points data the from subsample could one ( process nett estimation the of speed the improve greatly rdcin fteidvda-ee fet r eurda ahmmn uigthe during up-to-date moment where each can prediction, at that real-time required methods are of stream, effects case individual-level the the In of predictions a website. as a situation on this data to behavior situation refer We a in time. effect over stream individual-level themselves the present estimate data efficiently new where to used be can which hpe :Oln hikg factors Shrinkage Online 3: Chapter ngnrl aiu ehd r vial oda ihdt tem.Frinstance For streams. data with deal to available are methods various general, In osbeapoc oefiinl banetmtsi iuto hr h data the where situation a in estimates obtain efficiently to approach possible A h i fti hpe st eeo n vlaedfeetsrnaefactors shrinkage different evaluate and develop to is chapter this of aim The , Aggarwal 2016 nadt tem h aacleto snvr“nse” o ntnei click- in instance for “finished”, never is collection data the stream, data a In . ). Bottou nielearning online , 2007 , famds&Spirakis & Efraimidis 1999 o nitouto nmn oedt-temtcnqe,in- techniques, data-stream more many on introduction an for , ( 2016b ; pe,Kpen Vermunt & Kaptein, Ippel, update cmuigetmtso oe aaeeson-the-fly, parameters model of estimates “computing , ) ersrc u teto oeyt random-intercept to solely attention our restrict we )), Cappé ahrthan rather , 2011b n 1 , 2006 ! .Nt htorcretfcsi solely is focus current our that Note ). t n re-estimate =1 .Wieti ehdsle h prob- the solves method this While ). x t siaigasml eni a in mean sample a Estimating . , orsadLysy and Morris éa,Trier,Kla Ben- & Kolla, Terriberry, Pébay, 2016a h niiullvleffects, individual-level the .A lutaieexample illustrative An ). ( 2012 or ) Ippel, online data 37

Chapter 3 i n 39 (3.3) × Efron i n ), the nor- is a I 2012 ( by the number of obser- observations, i n ). Morris and Lysy ). In this section, we discuss four influenced ) with 2012 , I ( ) i 2 i ) is as follows: there is information 1956 2 , ,σ is the population average, which below i 3.1 µ µ, τ µ ( ( Stein N N ; ∼ ∼ i i 1961 µ y , Morris and Lysy ). ). This shrinkage factor assumes normally distributed 2008 , 2012 ( the residual variance, 2 i σ W. James & Stein ; 1977 is the response vector of individual , i y We estimate the BB using theYoung-Xu method & of Chan moments estimator (see, for instance Heuristic estimator, (HN): Unlike the previous three shrinkagedoes factors, not the HN rely on anyad-hoc distributional estimator assumptions. which solely This depends shrinkageindividual. on factor the is number an of observations of an James Stein estimator, (JS):Morris Here, and we Lysy use theindividual-level effects. formulation This assumption as is clearly introduced violateding by for the binary data data; transformation, us- also suggested by mal distribution is approximated. Furthermore,across this all shrinkage individuals. factor is equal Approximate Maximum Likelihood estimator, (ML):individual The specific. ML The is level of unlike shrinkage the is vations JS of an individual. This shrinkage factor alsolevel assumes effects that the are individual- normally distributed. Hence,formation suggested here by we also use the data trans- Beta Binomial estimator, (BB): This shrinkage factordistribution, does not instead assume the a normal individual-leveldistribution. effects Similar are to ML, assumed the to level of havelevel shrinkage of is a shrinkage individual is specific Beta influenced and by the the number of observations of an individual. • • • • 3.2.1 The James Stein estimator The JS is historically important sincebe it is considered among in one the of literature. the Thisindividual-level first effects. shrinkage shrinkage Thus, factor factors the to assumes assumed normally data-generating distributed model is: identity matrix, Chapter 3: Online Shrinkage factors 3.2 Estimation of shrinkage factors The intuition of aavailable shrinkage both model on (Eq. theindividual-level group effect level towards as the onfrom group the the mean, individual neighbors”, level, thereby the reducing so estimator the& by “borrows average Morris squared shrinking strength prediction the error ( shrinkage factors and develop their online implementations: where and (3.2) n as op- 3.2 , we discuss 3.6 Ippel, Kaptein, & Ver- , ) t describes four existing shrink- ( ¯ p 3.2 − , +1) t Chapter 3: Online Shrinkage factors ¯ p ( n +1) n − t ( estimation procedure only stores x x ). However, not every offline estimation +1 + +1 + ) ) n p t t 2014 ( ( online , n p := := ¯ ¯ p = =¯ ). Hence, we often have to resort to approximate n +1) +1) t t 1998 ( ( Kaptein , ¯ p n estimation procedure stores all the observations and for each we apply the developed online shrinkage factors to analyze 3.5 , we discuss when the individual-level effect should be estimated, offline 3.3 Neal & Hinton . This results in a time-constant update. Attractively, using online esti- ; presents a simulation study where we compare the online and offline 3.2 is the total number of observations and ‘:=’ is an assignment operator, mean- 3.4 n 2016a , Note that the This chapter is organized as follows. Section in memory, and, when a new data point enters, these are updated according to ¯ 38 data stream using online learning can be done as follows: or equivalently, where ing that the left-hand sideThroughout is this updated chapter using we the will expressionposed use on to the using the notation explicit right-hand superscripts. presented side. in Equation sponse request to achieve higherthe response rates. limitations of Finally, in the Section shrinkagesetting. factors and their possible extensions to a broader individual-level effects. Here we explicitlyanisms. explore different In data-generating Section mech- a real dataset. The datasetdropouts contains in data panel coming data from is a anon-response large serious per panel threat, repeatedly study. we observed Because individual. focus on These predictionsthe predicting could the choice facilitate probability of of which respondents to invite for the next wave, or personalize the re- age factors and develops thetors. In online Section implementation of eachan of issue the which shrinkage arises fac- dueSection to the fact thatimplementations new of data the present shrinkage themselves factors over in time. terms of the accuracy of the estimated be discarded from memory ( procedure can be used exactly for onlinemunt estimation (see, e.g., solutions. In thisa chapter, number we of evaluate shrinkage the factors.tremely accuracy large static of Note datasets that online can although be approximations analyzed we of using focus the same on methods. data streams, ex- a data stream thus takes increasinglyto more be time processed. because On more the andp contrary, more the data need Equation mation methods, there is no need to revisit previous data points, which can therefore new estimate revisits the older data points. Updating the sample mean offline in

Chapter 3 h iiain ftesrnaefcosadterpsil xesost broader a to extensions possible their and factors setting. re- shrinkage Section the the in personalize Finally, of or limitations rates. wave, response the next higher the achieve for to invite request to sponse respondents which of of probability facilitate choice the could predicting the predictions These on focus individual. Because observed we study. repeatedly threat, panel per serious large non-response a a is from data coming panel data in contains dropouts dataset The dataset. real a mech- Section estimated data-generating In the different of explore anisms. accuracy explicitly the we Here of terms time. effects. in individual-level over factors themselves shrinkage present the data of new implementations that fact the to Section due fac- arises shrinkage which the issue of an each of implementation Section online In tors. the develops and factors ex- age streams, data methods. on same the focus using of we analyzed approximations be although can online that datasets Note of static large accuracy tremely factors. the shrinkage evaluate of we number chapter, a this In solutions. e.g., (see, estimation munt online for exactly used be can procedure ( therefore can memory which from points, data discarded previous be revisit to need no is there methods, mation Equation need data the more contrary, p in and the offline more On mean because processed. time sample be more the to increasingly Updating takes thus points. stream data data a older the revisits estimate new hogotti hpe ewl s h oainpeetdi Equation in side. presented right-hand notation the the on use expression will the we using chapter updated this is Throughout side left-hand the that ing where equivalently, or oe ouigepii superscripts. explicit using to posed aasra sn nielann a edn sfollows: as done be can learning online using stream data 38 ¯ nmmr,ad hnanwdt on nes hs r pae codn to according updated are these enters, point data new a when and, memory, in hscatri raie sflos Section follows. as organized is chapter This oeta the that Note , 2016a n 3.4 stettlnme fosrain n :’i nasgmn prtr mean- operator, assignment an is ‘:=’ and observations of number total the is 3.2 rsnsasmlto td hr ecmaeteoln n offline and online the compare we where study simulation a presents ; hsrslsi iecntn pae trciey sn nieesti- online using Attractively, update. time-constant a in results This . el&Hinton & Neal 3.3 offline edsuswe h niiullvlefc hudb estimated, be should effect individual-level the when discuss we , 3.5 eapytedvlpdoln hikg atr oanalyze to factors shrinkage online developed the apply we siainpoeuesoe l h bevtosadfreach for and observations the all stores procedure estimation n p ¯ , Kaptein ( ( 1998 t t +1) +1) n .Hne eotnhv orsr oapproximate to resort to have often we Hence, ). =¯ = p ¯ =¯ := := p n , online ( ( 2014 t t p n ) ) + +1 + +1 .Hwvr o vr fieestimation offline every not However, ). x x siainpoeueol stores only procedure estimation ( t − n +1) n ( p ¯ hpe :Oln hikg factors Shrinkage Online 3: Chapter t +1) , − 3.2 p ¯ ( ecie oreitn shrink- existing four describes t ) , pe,Kpen Ver- & Kaptein, Ippel, 3.6 ediscuss we , 3.2 sop- as n (3.2) and where dniymatrix, identity is: model distributed data-generating normally assumed assumes to the factors factor Thus, shrinkage shrinkage effects. first individual-level This the literature. of the one in among is considered it be since important historically is JS The estimator Stein James The 3.2.1 implementations: online their develop and factors shrinkage ( error the prediction strength shrinking squared Morris average “borrows by & the estimator so reducing the thereby level, neighbors”, individual mean, the the group from on the as towards level effect group individual-level the (Eq. on model both shrinkage available a of intuition The factors shrinkage of Estimation 3.2 factors Shrinkage Online 3: Chapter • • • • dhcetmtrwihsll eed ntenme fosrain fan of observations of an number is the factor on individual. shrinkage depends This solely which assumptions. estimator distributional ad-hoc any on rely HN the not factors, does shrinkage three previous the Unlike (HN): estimator, Heuristic instance for (see, estimator moments Chan of & method Young-Xu the using individual. BB an the of observations estimate of We number the the by and influenced Beta specific is individual shrinkage a is of shrinkage level have of level to the assumed ML, to are Similar effects distribution. individual-level normal a the assume instead not does distribution, factor shrinkage This (BB): estimator, Binomial Beta trans- data the use also we by here suggested formation Hence, distributed. normally individual- are the that effects assumes level also factor shrinkage This individual. an of JS vations is the shrinkage unlike of level is The ML specific. The individual (ML): estimator, Likelihood Maximum Approximate equal is factor individuals. shrinkage all this across Furthermore, approximated. is distribution mal by suggested also us- transformation, data; data binary the for by ing violated introduced clearly is as assumption This formulation effects. individual-level the use Lysy we and Here, Morris (JS): estimator, Stein James y i , stersos etro individual of vector response the is 1977 ; .Jms&Stein & James W. σ i 2 h eiulvariance, residual the ( 2012 , 2008 .Ti hikg atrasmsnral distributed normally assumes factor shrinkage This ). ). orsadLysy and Morris , y µ 1961 i i ∼ ∼ ; N N Stein ( ( µ ,τ µ, µ 3.1 i stepplto vrg,wihbelow which average, population the is ,σ , 2 1956 sa olw:teei information is there follows: as is ) i 2 i ) ( I , 2012 with ) influenced .I hsscin edsusfour discuss we section, this In ). orsadLysy and Morris ). n i observations, ytenme fobser- of number the by ( 2012 I sa is ,tenor- the ), n i Efron × (3.3) 39 n i

Chapter 3 is 41 i can ˆ (3.9) w , is a . We, (3.10) js ¯ , w ¯ i β w ˆ µ , ) are con- : ), then, the N ′ i 3.10 ¯ w ). The w . ). Similar to the 3.2 n , and 3.3 ¯ same result using p , , , is less trivial than . Thus, the vector of is the between indi- n, ¯ N N w N js / ∈ ∈ exact ˆ β , , . Using Eq. t t i i js N N if if ). Lastly, to obtain / ∈ ∈ SS t t 2 i i 3.5 : ) ¯ ¯ if if i w w the transformed individual means − (Eq. ¯ i n/n /N all i w ) ( ˆ 2 i w ) ), not over data points ( w ¯ i + w N ′ /N + i ) − . ′ i i i js ¯ i n/n ˆ ) is added to µ w w i w individuals arrive in the data stream, but does ( SS w + − to ). In order to obtain the t i ¯ ¯ + − i w w ˆ w new ), we check whether the individual belonging to the js js N N ( ( ), which is also a summation over individuals. For the 3.8 SS SS individual arrives (again) in the data stream. To check ! js is new or not, the set of unique identifiers of individuals the current estimate of the transformed individual mean. ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ t := i SS i , Eq. w ¯ := w N js observed to transform we make use of a similar update regime as used for SS js 3.7 by subtracting the old contribution (i.e, the previous SS ¯ denotes the previous contribution to the transformed individual means change when new data enter, not only w ′ i all is the previous transformed individual mean from the last time this indi- js ) entered and t ′ i i w SS The online update of the transformed individual means, The remaining parameter needed for the estimation of Note that, due to the influx of new data, group parameters ( needs to be available. N count observations or the onlinesample update mean of averaged over the individuals sample ( meancount (Eq. of individuals ( new data point is observedon before. whether Different or update not functions an areto used individual a depending is known observed individual, before. therefirst, is When correct already the a data contribution point of belongs this individual to the individual that just entered ( new contribution (i.e., the updated the offline and online estimationshould procedure, be updated every timeindividual a means data is point inefficient enters. and becomes Updating infeasibleuals all when grows rapidly. the these Hence, number transformed we of approximate individ- thecurrent offline individual. version We by discuss updating this only issue the in more detail in Section Chapter 3: Online Shrinkage factors unique identifiers of all knownis individuals labeled observed with up an to identifiernew now. individual such is that Each observed we a individual new can elementunique track is identifiers added the grows to individual when the set overnot time. grow when If an a whether the individual viduals sum of squares ( estimation of vidual ( be estimated, with which we can estimate imputed in Eq. stantly changing. The dataplies transformation that uses these group parameters. This im- where where ) N the (3.7) (3.6) (3.5) (3.4) (3.8) ¯ p is set of N ). While the , the JS shrink- 3.6 js ˆ Morris and Lysy β using the JS shrink- , and i . , µ 2 3.5 )) − p , / 2¯ 1) 3.4 the individual mean, , the total number of obser- − i indicates that we only focus − setting, we now turn to deriv- p is the sum of squares between , , , t )) js n/N js N N p ˆ β 2¯ , / ∈ ∈ = w SS 2 t t Chapter 3: Online Shrinkage factors offline ) − arcsin(1 (see, Eq. ¯ i n i i ¯ w 2 − , )+ ¯ − js if if is incremented when a new data point i ¯ n/n ) − 2 js i w js ( ˆ in an SS N β p are straightforward, for instance the com- − to merely count the observations. We do N 2 js ), where − , . Using this data transformation to estimate js SS N =1 +1 ! i i N ˆ β − β and (1 = = N N i +1 ¯ ) + arcsin(1 w 2012 n/n w " js ( ¯ n , n ˆ β / ¯ n =¯ = i := , i 2 i w ¯ := p the variance of the individual-level effects. Since we ˆ ˆ w σ (arcsin(1 , N 2 n ¯ n τ N √ , n = , =(sin((ˆ i i i , and n using, w ¯ ˆ p µ , i i p n Morris and Lysy , i or w n formulation. is the transformed individual mean, the average across the transformed individual means and is the index of an individual and subscript i ¯ i w online w ) suggested the following data transformation: Parts of the online computation of 2012 ing an vations divided by the total number of individuals, age factor are: age factor, is given by above formulas detail how to estimate the individual-level effects results in the following shrinkage model: focus on grouped binary data, the individualand means therefore, (i.e., proportions) not are nearly bounded, normally( distributed. To address this, putations of 40 we estimated using Thus, the quantities or parameters that are needed to estimate requires some additional thought: before arrives, we need to checkobserved whether individual or this from new a data newthe point individual. counter: originates Only in from the an latter already case we increment not detail these further. However, counting the number of unique individuals ( individuals. To obtain the estimated individual-levelone effect computes in terms of probabilities to be approximately equal to sample mean over all data points. The transformation stabilizes the within-variance as formulated by where where where on the individual belonging to the most recent data point. Furthermore,

Chapter 3 nteidvda eogn otems eetdt on.Furthermore, point. data recent most the to belonging individual the on where where where sfruae by formulated as ob prxmtl qa to within-variance equal the approximately stabilizes be transformation The to points. data all over mean sample niiul.T banteetmtdidvda-ee feti em fprobabilities of terms in computes effect one individual-level estimated the obtain To individuals. bevdidvda rfo e niiul nyi h atrcs eincrement we case already latter an the from in Only originates counter: individual. point the new data a new from this or individual whether observed check to need ( we individuals arrives, unique before of thought: number additional the some counting requires However, further. these detail not hs h uniiso aaeesta r eddt estimate to needed are that parameters or quantities the Thus, eetmtdusing estimated we 40 uain of putations n hrfr,ntnal omlydsrbtd oadesthis, address To distributed. ( normally bounded, nearly are not proportions) (i.e., therefore, means and individual the data, binary grouped on focus bv omlsdti o oestimate to how detail formulas above model: shrinkage following the in results effects individual-level the g atr sgvnby given is factor, age g atrare: factor age ain iie ytettlnme findividuals, of number total the by divided vations n an ing 2012 at fteoln optto of computation online the of Parts ugse h olwn aatransformation: data following the suggested ) w online w i ¯ i steidxo nidvda n subscript and individual an of index the is h vrg costetasomdidvda en and means individual transformed the across average the stetasomdidvda mean, individual transformed the is formulation. n w or i , orsadLysy and Morris n p i i , µ p ˆ ¯ w using, n and , i i i =(sin((ˆ , = n , √ N τ n ¯ n 2 N , (arcsin(1 σ w ˆ ˆ h aineo h niiullvlefcs ic we Since effects. individual-level the of variance the p := ¯ w i 2 i , := i = =¯ n ¯ / β ˆ n , n ¯ ( js " w n/n 2012 w arcsin(1 + ) ¯ +1 i N N = = (1 and β − β ˆ N i i ! +1 =1 N SS js sn hsdt rnfraint estimate to transformation data this Using . , − ,where ), js 2 N omrl on h bevtos edo We observations. the count merely to − r tagtowr,frisac h com- the instance for straightforward, are p β N SS nan in ˆ ( js w i js 2 − ) n/n ¯ i siceetdwe e aapoint data new a when incremented is if if js − +¯ )+ , − 2 w ¯ i i n i ¯ se Eq. (see, arcsin(1 − ) offline hpe :Oln hikg factors Shrinkage Online 3: Chapter t t 2 SS w = ∈ ∈ / , 2¯ β ˆ p N N js n/N js )) t , , , stesmo qae between squares of sum the is p etn,w o unt deriv- to turn now we setting, − niae htw nyfocus only we that indicates i − h oa ubro obser- of number total the , h niiulmean, individual the 3.4 1) 2¯ / , p − )) 3.5 2 µ , . i and , sn h Sshrink- JS the using β orsadLysy and Morris ˆ js 3.6 h Sshrink- JS the , .Wiethe While ). N sstof set is p ¯ (3.8) (3.5) (3.4) (3.6) (3.7) the N ) where where tnl hnig h aatasomto ssteegopprmtr.Ti im- This parameters. group these uses that transformation plies data The changing. stantly mue nEq. in imputed eetmtd ihwihw a estimate can we which with estimated, be iul( vidual siainof estimation iul u fsurs( squares of sum viduals individual the whether a an If when grow time. not over set the when individual to grows the added identifiers is track unique element can new individual a we observed Each that is such individual now. new identifier to an up with observed labeled individuals is known all of identifiers unique factors Shrinkage Online 3: Chapter urn niiul edsusti su nmr eali Section in detail more in the issue only this updating discuss by We version individual. offline current the individ- approximate of we transformed number Hence, these the rapidly. grows when all uals infeasible Updating becomes and enters. inefficient point is data means a individual time every updated be procedure, should estimation online and offline the e otiuin(.. h updated the (i.e., contribution new h niiulta utetrd( entered just that individual the oakonidvda,teei led otiuino hsidvda to individual this belongs of point contribution data a the already correct When is first, there before. individual, observed known is depending a individual used to are an functions not update or Different whether before. on observed is point data new ( individuals of (Eq. count mean ( sample individuals the over averaged of mean update sample online the or observations count N ed ob available. be to needs oeta,det h nu fnwdt,gopprmtr ( parameters group data, new of influx the to due that, Note h eann aaee eddfrteetmto of estimation the for needed parameter remaining The h nieudt ftetasomdidvda means, individual transformed the of update online The SS w i i ′ t nee and entered ) js stepeiu rnfre niiulma rmtels ieti indi- this time last the from mean individual transformed previous the is all i ′ w rnfre niiulmascag hnnwdt ne,ntonly not enter, data new when change means individual transformed eoe h rvoscnrbto othe to contribution previous the denotes ¯ SS ysbrcigteodcnrbto ie h previous the (i.e, contribution old the subtracting by 3.7 js SS emk s fasmlrudt eiea sdfor used as regime update similar a of use make we otransform to observed js N w := ¯ w Eq. , i SS i := t ⎩ ⎪ ⎪ ⎨ ⎪ ⎪ ⎧ h urn siaeo h rnfre niiulmean. individual transformed the of estimate current the snwo o,testo nqeietfir findividuals of identifiers unique of set the not, or new is js ! niiularvs(gi)i h aasra.T check To stream. data the in (again) arrives individual SS SS 3.8 ,wihi loasmainoe niiul.Frthe For individuals. over summation a also is which ), ( ( N N js js ,w hc hte h niiulblnigt the to belonging individual the whether check we ), new w ˆ w w i − + ¯ ¯ i t .I re ooti the obtain to order In ). to − + w SS ( niiul riei h aasra,btdoes but stream, data the in arrive individuals w i w w µ saddto added is ) ˆ n/n i ¯ js i i i ′ . − ) i + /N ′ N w + i ¯ w ,ntoe aapit ( points data over not ), ) w i 2 ˆ ( ) w i all /N n/n i ¯ (Eq. − h rnfre niiulmeans individual transformed the w w i if if ¯ ¯ ) : 3.5 i i 2 t t SS ∈ ∈ / .Lsl,t obtain to Lastly, ). if if N N js i i t t sn Eq. Using . , , β ˆ exact ∈ ∈ / js N w N N ¯ n, stebtenindi- between the is hs h etrof vector the Thus, . sls rva than trivial less is , , , p aersl using result same ¯ 3.3 and , n 3.2 .Smlrt the to Similar ). . w .The ). w ¯ 3.10 i ′ N ,te,the then, ), : r con- are ) , µ ˆ w β i ¯ w , ¯ js (3.10) We, . sa is , w (3.9) ˆ can i 41 is

Chapter 3 43 and 2 (3.14) (3.13) (3.15) τ . In this 3.1 , ) ) β β + )Γ( α α , , Γ( Γ( , the individual-level range. Thus, in order i ) _ N N , for a more extensive . When the learn rate, 1] β i , / ml ∈ ∈ to the gradient of . β + t t [0 i ) 2 i i 2011 k ˆ β τ if if , − + − 2 . 2 ) Xu i ) n α 2 2 ˆ σ ; τ τ 3.7 + ( ) )Γ( ) − i i +ˆ n ℓ α 2 ) 2013 2 i ,µ ∇ Γ( i + ¯ α,β σ , w γ ( n k ( 2(ˆ − Γ( i )+ ) ), SGD steps towards the maximum of the 2 using Eq. w 2 Bin Beta γ ( τ i τ ( ˆ ( ′ µ ∼ ∼ + 1) i i N =1 i i ℓ ℓ i ! k k µ ∇ ∇ − γ γ )= n + 1) − + 2 τ 2 2 n ( ˆ ˆ τ τ ℓ Γ( " ∇ + 1)Γ( k := 2 , the individual means do not have to be transformed to estimate Γ( Schaul, Zhang, & LeCun ˆ is transformed to τ : ij i ; y is a summation over individuals, we apply a similar update regime ˆ w )= is the previous contribution of individual i 3.10 =1 we can make use of the shrinkage model as defined in Eq. ) n j 2010 2 i 3.13 τ , # µ ( ′ i n, α, β = ℓ | i k ∇ k ( is: f Bottou the current contribution to that gradient of individual ) The compound distribution of the Beta Binomial distribution is: 2 ) after which i ℓ τ , is large, SGD can ‘move’ fast towards the maximum-likelihood value, however ( γ with the same pace itWhen can the also learn step rate overmany is the data small maximum of points it the have will likelihoode.g., take to function. many enter) evaluations before of thediscussion the maximum on learn derivative likelihood rates (i.e., for is SGD). reached After the (see, estimation of effect is estimated using the shrinkage model3.5 for transformed individual means (Eq. where 3.2.3 The Beta Binomial estimator When we assume that the data-generating model is a Beta Binomial distribution, Chapter 3: Online Shrinkage factors for a single data pointwhether and the based current on estimate thelikelihood of value value. the of parameter Using the is a derivativelikelihood above learn function. SGD rate or determines When ( below a theagain new and maximum- data updates the point parameter enters, estimateℓ SGD accordingly. The evaluates first-order the derivative of derivative where BB, because the Beta distribution naturally falls within the Because Eq. as in Equation ∇ case, we choose the method-of-moments estimationthis method method to has estimate a BB closed-form because solutionform to estimate expression the of shrinkage factor. the Theformulation closed- estimation of procedure BB to of an BB online formulation. makes it easier to rewrite the to estimate , . , i n one , i 3.2.1 ) is the (3.11) (3.12) p 2 2 τ ˆ τ ( ℓ ’s are nor- Goldstein i by evaluat- µ 2 uses the alter- τ , i 2 µ / , $% 2 2 / τ i $% + 2 i , τ ¯ n/n 2 2 i / + σ ¯ n/n % 2 i ) # i σ _ , # ), therefore we focus only on the ml i , β Chapter 3: Online Shrinkage factors + log 3.9 2 i i n/n σ 2 ) result in less shrinkage, and + log i 2 i τ n 2 +¯ +ˆ i ˆ ¯ n/n σ + log( τ ) as discussed previously in Section + 2 2 , and i 2 i is maximized by iterating over the dataset, i ˆ ˆ τ τ _ + ¯ n/n σ 3.4 2 i 3.8 ml = = ). SGD updates the estimate of ¯ n/n σ , β i 3.12 _ 2 2 2 ) ) ) 3.2 ml ¯ ¯ ¯ i w w w 2010 ˆ β 2 2 i i ) which includes the transformed individual means. , using the JS, the ML estimation of − − − σ σ i ¯ i i i n/n 3.5 µ using ML, the following parameters are needed: w w w . Most of these parameters have already been discussed ( ( ( i 2 " " " µ ˆ τ Bottou N N N =1 =1 =1 i i i ! ! ! , equation at the bottom of page 128): = = , and )= ¯ is not straightforward during the data stream since using an it- w is found by maximizing the following log-likelihood function (see, 2 , 2 τ 2 ¯ 2012 p ˆ ( τ τ , ℓ , ¯ n , N , n , 2 i ). Because the means of binary observations are not normally distributed, we ˆ σ Estimating Intuitively, SGD works as follows: The first-order derivative of the log-likelihood For the estimation of , . i 2 ˆ maximum-likelihood value of the variance oflikely the value individual-level of effects. The most an individual as well as by information of the other individuals: native shrinkage model (Eq. However, unlike the previous shrinkage factor,vidual: ML estimator the is level tailored of to shrinkage an is indi- influenced both by the number of observations of ing the gradient (in this case, a one-dimensional gradient or derivative) of erative maximization procedure is notGradient feasible. Descent (SGD, For this reason, we use Stochastic Morris & Lysy use the same data transformation (Eq. Similar to the estimation of 42 3.2.2 Approximate Maximum likelihood estimator The ML is an often usedmally shrinkage distributed factor and for the multilevel outcome variable models, is where 1986 continuous (among others, data point at a time. function is a summation over individuals. SGD evaluates this first-order derivative τ in the previous section (see,remaining Eq. parameter: the estimation of the variance of the individual-level effects, w where more observations of an individual ( In the case of offline estimation,using Eq. a numerical optimization method, for instance Newton Raphson.

Chapter 3 sn ueia piiainmto,frisac etnRaphson. Newton instance for method, optimization numerical a Eq. using estimation, offline of case the In hr oeosrain fa niiul( individual an of observations more where w eann aaee:teetmto ftevrac fteidvda-ee effects, individual-level the of variance the τ of estimation the parameter: Eq. remaining (see, section previous the in ucini umto vridvdas G vlae hsfis-re derivative first-order this evaluates SGD individuals. over summation a is function time. a at point data ors&Lysy & Morris of estimation the to Similar (Eq. transformation data same the use others, (among continuous 1986 where is models, variable outcome multilevel the for and factor distributed shrinkage mally used often an is ML The estimator likelihood Maximum Approximate 3.2.2 42 rtv aiiainpoeuei o esbe o hsrao,w s Stochastic use we reason, this For (SGD, Descent feasible. Gradient not is procedure maximization erative n h rdet(nti ae n-iesoa rdeto eiaie of derivative) or gradient one-dimensional a case, this (in gradient the ing iul h ee fsrnaei nune ohb h ubro bevtosof individuals: observations other of the number of information the by by as both well influenced as indi- individual is an an shrinkage to of tailored level is the estimator ML vidual: factor, shrinkage previous the unlike However, (Eq. model shrinkage native aiu-ieiodvleo h aineo h niiullvlefcs h most The effects. of individual-level value the likely of variance the of value maximum-likelihood ˆ 2 i . , o h siainof estimation the For nutvl,SDwrsa olw:Tefis-re eiaieo h log-likelihood the of derivative first-order The follows: as works SGD Intuitively, Estimating σ ˆ .Bcuetemaso iayosrain r o omlydsrbtd we distributed, normally not are observations binary of means the Because ). i 2 , n , N , n ¯ , ℓ , τ τ ( ˆ p 2012 ¯ 2 τ 2 , 2 sfudb aiiigtefloiglglklho ucin(see, function log-likelihood following the maximizing by found is w sntsrihfraddrn h aasra ic sn nit- an using since stream data the during straightforward not is ¯ )= and , = = qaina h otmo ae128): page of bottom the at equation , ! ! ! i i i =1 =1 =1 N N N Bottou τ ˆ µ " " " 2 i ( ( ( oto hs aaeeshv led endiscussed been already have parameters these of Most . w w w sn L h olwn aaeesaeneeded: are parameters following the ML, using µ 3.5 n/n i i i ¯ i σ σ − − − sn h S h Letmto of estimation ML the JS, the using , hc nldstetasomdidvda means. individual transformed the includes which ) i i 2 2 β ˆ 2010 w w w i ¯ ¯ ¯ ml 3.2 ) ) ) 2 2 2 _ 3.12 i β , σ n/n ¯ .SDudtsteetmt of estimate the updates SGD ). = = ml 3.8 i 2 3.4 σ n/n ¯ + _ τ τ ˆ ˆ i smxmzdb trtn vrtedataset, the over iterating by maximized is i 2 i and , 2 2 + sdsusdpeiul nSection in previously discussed as ) τ log( + σ n/n ¯ ˆ i +¯ +ˆ 2 n τ i 2 i log + euti essrnae and shrinkage, less in result ) 2 n/n σ i i 2 3.9 log + hpe :Oln hikg factors Shrinkage Online 3: Chapter β , i ml ,teeoew ou nyo the on only focus we therefore ), # , _ σ i # ) i 2 % n/n ¯ σ + / i 2 2 n/n ¯ τ , i 2 + $% i τ / 2 2 $% , / µ 2 i , τ sstealter- the uses 2 µ yevaluat- by i Goldstein saenor- are ’s ℓ ( τ ˆ τ 2 2 p (3.12) (3.11) sthe is ) 3.2.1 i , one n i , . , oestimate to omepeso fteetmto rcdr fB ae tese orwiethe rewrite to easier it makes formulation. online BB an of to BB procedure of estimation closed- formulation The the factor. shrinkage of the expression estimate to form solution because closed-form BB a estimate has to method method this estimation method-of-moments the choose we case, ∇ si Equation in as Eq. Because B eas h eadsrbto aual al ihnthe within falls naturally distribution Beta the because BB, where hnw sueta h aagnrtn oe saBt ioildistribution, Binomial Beta a is model data-generating the that assume we When estimator Binomial Beta The 3.2.3 derivative of derivative the first-order evaluates The accordingly. SGD ℓ estimate enters, parameter point the updates data maximum- and new again the a below ( When determines or rate SGD function. learn above likelihood derivative a is the Using parameter of the value. value of likelihood the estimate on current based the and whether point data single a for factors Shrinkage Online 3: Chapter where feti siae sn h hikg oe o rnfre niiulmas(Eq. means individual transformed for 3.5 model shrinkage the using estimated is effect icsino er ae o G) fe h siainof estimation (see, the After reached SGD). is for (i.e., rates likelihood derivative learn on maximum the discussion the of before evaluations enter) many function. take to e.g., likelihood will have the it points of maximum small data the is many over rate step learn also the can When it pace same the with γ ( slre G a mv’fs oad h aiu-ieiodvle however value, maximum-likelihood the towards fast ‘move’ can SGD large, is , τ ℓ i fe which after ) 2 h opuddsrbto fteBt ioildsrbto is: distribution Binomial Beta the of distribution compound The ) h urn otiuint htgain findividual of gradient that to contribution current the Bottou f is: ( k ∇ k i | ℓ = β , α n, i ′ ( µ # , τ 3.13 i 2 2010 j n ) ecnmk s ftesrnaemdla endi Eq. in defined as model shrinkage the of use make can we =1 3.10 i stepeiu otiuino individual of contribution previous the is )= w ˆ sasmainoe niiul,w pl iia paeregime update similar a apply we individuals, over summation a is y ; i ij : τ stasomdto transformed is ˆ cal hn,&LeCun & Zhang, Schaul, Γ( h niiulmasd o aet etasomdt estimate to transformed be to have not do means individual the , 2 := k Γ( 1) + ∇ " Γ( ℓ τ τ ˆ ˆ ( n 2 2 τ 2 + − 1) + n )= γ γ − ∇ ∇ µ k k ! i ℓ ℓ i i =1 N i i 1) + ∼ ∼ µ ′ ( ˆ ( τ i τ ( γ Beta Bin 2 w sn Eq. using 2 ,SDsestwrstemxmmo the of maximum the towards steps SGD ), ) )+ i Γ( − 2(ˆ ( k n ( γ w , σ β α, ¯ + i Γ( ∇ ,µ i 2 2013 ) 2 α ℓ n +ˆ i i − ) Γ( ) ) ( + 3.7 τ τ ; σ ˆ 2 2 α n ) i Xu ) 2 . 2 − + − , if if τ β ˆ k 2011 i i 2 ) i [0 t t + β . otegain of gradient the to ∈ ∈ ml / , i β 1] hntelanrate, learn the When . o oeextensive more a for , N N _ ) i ag.Tu,i order in Thus, range. h individual-level the , Γ( Γ( , , α α Γ( ) + β β ) ) , 3.1 nthis In . τ (3.15) (3.13) (3.14) 2 and 43

Chapter 3 i – – n n , , 45 i N p , ¯ p i ˆ N bb ) both have a M nono no no BB HN , , Beta c n SS , , , and the amount 3.17 , 2 i p ˆ N s n , , , =¯ ¯ p i bb i p ˆ µ SS ), have been discussed in n 2 2 i N ˆ τ ˆ σ no , yes , , ML and BB, Eq. 2 n ¯ , and i w ¯ ˆ , τ p , , , ¯ n i i N , 3.11 , n w i ¯ p , , i n i 1 p n √ , i p = i _ i js hn JS N js ˆ β , yes yes their complexity n SS SS using HN ( , , , i ¯ i w N Normal Normal µ , w , ¯ ¯ , n p i n , i p i µ : When an individual only has 1 observation, gives an overview of the online shrinkage factors that are used in the . 3.1 Table 3.1: Overview of the characteristics of the shrinkage factors and i 3.1 ˆ µ 3.2.1 Table group or individualequal variance transformation update group individual individual individual distribution 3.3 Predicting individual-level effects: when isWhen the right analyzing time? static data,individual-level effects, the does exact not come moment to question.diction at is It only which naturally follows made one that once: the should afternot pre- the the predict shrinkage case the factor when is data estimated.researcher are is This entering faced is, over with however, time. a choice In when this to estimate section, the we individual-level explain effect. why the Chapter 3: Online Shrinkage factors 3.2.4 The Heuristic estimator The previous two shrinkage factorssimilar type (ML, of Eq. intuition: individual-level effects aremean moved when more towards little the group is knowntions) about compared the to individual when there (i.e.,shrinkage is a factor more small has information number the about of sametional an observa- assumptions intuition, individual. or however, sophisticated we The formulas. do The last last so shrinkage without factor any distribu- Section simulation study. The characteristics of eachThe of last the three shrinkage lines factors of are presented. theshould table be give updated an to indication estimate howwhen the many an shrinkage parameters additional factor estimates data andindividual the point parameters, individual-level enters second effect the line dataset. areare group the level First parameters count that of require parameters, the more and computations three to last lines update. line are the ters used for the estimation of of shrinkage decreases as more observations of an individual enter. All the parame- shrinks individual-level effects only based onvations the of (square an root of) individual. number Likeusing of obser- Eq. BB, the HN also shrinks the individual-level effects is i µ , the . The (3.16) (3.17) is de- js bb ˆ β bb ˆ β )) SS µ − (1 . While the first bb , , M ) N N + SS / M k ∈ ∈ . Similar to the . t t i i + − 2 bb ˆ s n n if if , , , and SS N N Γ( , 2 )Γ( , which is computed as follows: . Because all parameters except ) / 2 ∈ ∈ N,n , and c ¯ β p ) i . t t ). The individual-level effect , ¯ p ˆ i i 2 n i + Mµ 1 M c , − . ˆ s n ) i , − ¯ if if α 1 i − p Chapter 3: Online Shrinkage factors ¯ i p and + ¯ p =1 n p i n − i N − , ( = p =1 k N i i ) N i c 2 ˆ ( + (1 bb ¯ M ) ! i p n , ¯ p and Γ( ,N, ¯ p /n n M ! ˆ i 2 (i.e. N M − 1) + # ˆ s p − − , c ′ i 1) k , n =1 i 2 i +1 − N i i n ¯ NSS (1 = p, ˆ s _ " p bb ′ ¯ , p − /n i i ( i , and bb ! _ N i c ¯ )) p ˆ ( bb β N SS n bb = µ ( N +1 − ˆ β ˆ − + SS c c − = = M , i bb bb 2 $ ) (1 we need p ˆ s , , using i SS SS ˆ is defined as M := M requires the following: M n 2 c . Alternatively, the Beta Binomial distribution can also be $ 2 3.1 Γ( ˆ s ˆ s )Γ( ij ): y := and i Mµ bb =1 3.10 n j i Γ( SS ! /n 1 (Eq. , which estimated by =1 )= N i β which fluctuates throughout the data stream. denotes the previous contribution to the =1 js using BB are: α , the latter is again a sum over individuals, which requires an update + N i is the between-individual sum of squares using the individual means. ¯ p ′ i ! i α µ SS ! bb bb is the previous contribution to = = ′ : 3.2.1 i = c n, M, µ SS SS k c µ | c k ( estimated online is slightly different compared to the offline estimated The computation of For the computation of f bb ˆ difference between the two estimation procedures is due to the fact that Here, where 44 where parameterized as, pendent on The shrinkage factor of BB is: two parameters are easy to updateSection during the data stream and alreadysimilar discussed to in where where then estimated using Eq. the last parameter are already discussedtion of previously, we only present the computa- mation of Similar to the ML, thetions BB per is individual also influences individual the specific level where of the number shrinkage. of The observa- parameters for the esti- where β

Chapter 3 where β where in e niiuliflecstelvlo hikg.Teprmtr o h esti- the for parameters observa- The of shrinkage. number the of of where level mation specific the individual influences also individual is per BB tions the ML, the to Similar h atprmtraearaydsusdpeiul,w nypeettecomputa- the present only we previously, of tion discussed already are parameter last the hnetmtduigEq. using estimated then where w aaeesaees oudt uigtedt temadaraydsusdin to discussed similar already and stream data the during Section update to easy are parameters two is: BB of factor shrinkage The edn on pendent aaeeie as, parameterized where 44 where Here, ifrnebtentetoetmto rcdrsi u otefc that fact the to due is procedures estimation two the between difference ˆ bb f o h optto of computation the For h optto of computation The siae niei lgtydfeetcmae oteofln estimated offline the to compared different slightly is online estimated ( k c | µ c k SS SS ,M µ M, n, c = i 3.2.1 : ′ = = stepeiu otiuinto contribution previous the is bb bb ! SS µ α i ! i ′ p ¯ stebtenidvda u fsursuigteidvda means. individual the using squares of sum between-individual the is i N + h atri gi u vridvdas hc eursa update an requires which individuals, over sum a again is latter the , α sn Bare: BB using js =1 eoe h rvoscnrbto othe to contribution previous the denotes hc utae hogottedt stream. data the throughout fluctuates which β i N )= =1 hc siae by estimated which , (Eq. 1 /n ! SS Γ( i j n 3.10 =1 bb Mµ i and := y ): ij Γ( ) s ˆ s ˆ Γ( 3.1 2 $ lentvl,teBt ioildsrbto a lobe also can distribution Binomial Beta the Alternatively, . c 2 n M eurstefollowing: the requires M := M sdfie as defined is ˆ SS SS i using , , s ˆ p eneed we (1 ) $ 2 bb bb i , M = = − c c SS + − ˆ β ˆ − +1 N ( µ = bb n SS N β bb ( ˆ p )) ¯ c i N _ ! bb and , i ( i i /n − p , ¯ ′ bb p " _ s ˆ p, = (1 NSS ¯ n i i N − +1 i 2 i =1 n , k 1) i ′ c , − − p s ˆ # + 1) − M N (i.e. 2 i ˆ ! M n /n p ¯ ,N, Γ( and p ¯ , n p i ! ) M ¯ bb (1 + ( ˆ 2 c i N ) i i N k =1 p = ( , − i N − n i p n =1 p ¯ + and p i ¯ hpe :Oln hikg factors Shrinkage Online 3: Chapter p − i 1 α if if ¯ − , i ) n s ˆ . − , c M 1 Mµ + i n 2 i i ˆ p ¯ , .Teidvda-ee effect individual-level The ). t t . i ) p β ¯ c and , N,n ∈ ∈ 2 / ) eas l aaeesexcept parameters all Because . hc scmue sfollows: as computed is which , Γ( ) 2 , Γ( N N SS and , , , if if n n s ˆ bb 2 − + i i t t . iia othe to Similar . ∈ ∈ k M / SS + N N ) M , , bb hl h first the While . (1 − µ SS )) β ˆ bb β ˆ bb js sde- is (3.16) (3.17) The . the , µ i is ain fa niiul ieB,teH losrnsteidvda-ee effects individual-level the shrinks also HN the BB, Eq. obser- of using Like number individual. of) root an (square of the vations on based only effects individual-level shrinks fsrnaedcessa oeosrain fa niiuletr l h parame- of the estimation All the enter. for individual used an ters of observations more as decreases shrinkage of hna diinldt on nestedtst is ftetrelnsaethe are line update. lines last to three computations and more the parameters, require of that count parameters First level the group are are dataset. line the effect second enters individual-level parameters, point the individual and data estimates factor additional parameters shrinkage an many the when how estimate indication to an updated give be table should the presented. are of factors lines shrinkage three the last of The each of characteristics The study. simulation Section o h aewe aaaeetrn vrtm.I hsscin eepanwythe why effect. explain individual-level we the section, estimate to this when In choice a time. however, with over is, faced entering This is are researcher estimated. data is when factor the case shrinkage predict the the pre- not after should the once: that one made follows naturally which only It is at diction question. to moment come exact not does the effects, individual-level data, static time? analyzing right the When is when effects: individual-level Predicting 3.3 distribu- any factor without shrinkage so last last The do formulas. The we sophisticated however, or individual. intuition, assumptions observa- an tional same of about the number information has small more factor a is shrinkage (i.e., there when individual to the compared about tions) known is group the little towards more when moved mean are effects individual-level intuition: Eq. of (ML, type similar factors shrinkage two previous The estimator Heuristic The 3.2.4 factors Shrinkage Online 3: Chapter distribution ru ridvda ru niiulidvda individual individual individual group update transformation variance equal individual or group Table 3.2.1 µ ˆ 3.1 i al .:Oeve ftecaatrsiso h hikg atr and factors shrinkage the of characteristics the of Overview 3.1: Table 3.1 . ie noeve fteoln hikg atr htaeue nthe in used are that factors shrinkage online the of overview an gives hna niiulol a observation, 1 has only individual an When : µ i p i , n i p n , ¯ ¯ , w , µ omlNormal Normal N w i ¯ i , , , sn N( HN using SS SS n hi complexity their yes yes , β ˆ js N JS hn js i _ i = p i , √ n p 1 i n i , , p ¯ i w n , 3.11 , N i i n ¯ , , , p τ , ˆ ¯ w i and , ¯ n 2 n B Eq. BB, and ML , , yes , no σ ˆ τ ˆ N i 2 2 n ,hv endsusdin discussed been have ), SS µ ˆ p i bb i p ¯ =¯ , , , n s N ˆ p i 2 , 3.17 n h amount the and , , , SS n c Beta , , BHN BB ono no no no M ohhv a have both ) bb N ˆ i p ¯ , p N i 45 , , n n – – i

Chapter 3 , , 47 , right skewed, . We set the av- 7) 1) , , Breslow & Clayton ) and all conditions ; (7 (6 p B B 1981 , = 500 and N 6) , it is computationally complex , i µ (1 i i ). Since two of the four shrinkage ^ p µ B t = 2 Rabe-Hesketh, Skrondal, & Pickles ● ● ; /N , does change over time. remains the same between the 2 ¯ p ) i i Bock & Aitkin p right after the data point enters, µ 2003 ; i , µ − i 2000 ), especially in a data stream. The generated µ , (ˆ (which results in . While time in data stream ! 2004 =2 000 , t , t = 1 ) estimates at ● ● i i i ^ p µ µ =1 = 10 individual-level effects is centered t n true two time points, the group mean Figure 3.1: An illustrationfect. of when Option to 1 shrink ( the individual-level ef- option 2 estimates , or a mixture of two Beta distributions: number of observations equal to 20. As a benchmark we use a multilevel Skrondal & Rabe-Hesketh Agresti, Booth, Hobert, & Caffo Moerbeek, Van Breukelen, & Berger 2 12) ; ; , Because we sample the individuals at random after which we generate an observation, the number 2 . In the following simulation study, we have chosen this second implementation (2 i ˆ erage is violated in the simulationthe individual-level study. effects, which To increasingly do depart so,tion from normality: we underlying the generated the distribu- three distributionsB of factors assume a normal distribution, we specifically examine the case when this model with logit link‘lme4’ function, package as (in implemented [R]) with inmodel is 20 the known adaptive to GLMR provide Gaussian very function Quadrature good from estimates points. of the While this µ of observations is not equal across individuals. Chapter 3: Online Shrinkage factors 3.4 Simulation Study 3.4.1 Design To evaluate whether the onlineequally implementations well as of their original the offline implementations, shrinkage weIn conduct factors a this simulation perform simulation study. study we compareaverage the squared two prediction estimation error procedures ( in terms of the to fit ( 1993 2002 data streams consist of of predicting the individual-level effect. have 1,000 replications. ) and based i ) needs i ˆ µ ˆ =2 µ one would t if in this case). As =2 t ) whose data point en- t i , a prediction ( i p ) do not result in the same =2 ) intends to pay our website—as t t i Chapter 3: Online Shrinkage factors vs. . The black dot denotes an individual 3.1 =1 t the individual-level effect is estimated influences the when , or one could wait until this individual returns ( =1 t . Because the group mean and the estimated shrinkage factor change i µ one could obtain a shrinkage estimateserved the and moment store an the resulting individual’s prediction, data is ob- or, one could obtain a prediction atre-enter the the moment dataset; that hence, the the individual moment is a about prediction to might be needed. The first option leads to the following procedure: when a data point enters, the The individual-level effect is a combination of the individual mean, the group For the second option, imagine an individual ( The two options are illustrated in Figure 1. 2. group-level parameters, the parameters of the individual ( re-estimate all the individual-level effectsestimates every change time with every a additional new datainfeasible data point. for point the Such full enters, re-estimation set the is, of individualsone however, at would each obtain time point, an and estimate inshrinkage many only of applications for the the individual-level mean individualtion to concerned. for the a group-level In specific mean any individual to can case, obtain be the done a at predic- two distinct moments: mean, and a shrinkageone factor. person changes. Every time However, a dueindividual-level to effects new at this data that new moment point in data enters, time point, change the the slightly. That estimates record is, of of all the anymore. The other downsideentering. is These that new while data we points store affectthese the the changes shrinkage prediction, are factor new and not global data taken statistics.sidered are into All fixed. account because Therefore, the the prediction storedmost is prediction recent stored information. does and not con- optimally make use of the to be stored, which is potentially never used if we do not observe this individual tered, and shrinkage factor are updated ordiction computed. of With these the parameters, individual-level a pre- has effect two is downsides. made and The stored first in downside memory. is that This besides option discussed in the introduction—a visit again.access Her the browser website. will At send that out point, acan we request retrieve know to this who individual’s is record. about Now, to we enter can our predict website, this so individual’s we 46 over time, these two options (shrink at can be seen from the plot, estimate of does not waste memory on storing predictionsand that it we incorporates might the end most up recent not changes using at to all the group-level parameters. mean. One could choosepoint to enters predict at the individual-level effectshrink right towards the after group this mean at data that point in time (which is on all the information we know sothe far. website The visit allows data us generated to by update this bothestimates. individual the during group This and second individual-level option parameters thus deals with both downsides of the first option: it

Chapter 3 siaeof estimate is plot, (which the time from in seen be point can that data at mean this group after the towards right shrink effect individual-level the at predict enters to point choose could One mean. parameters. group-level the all to at using changes not recent up most end it the might incorporates option: we it first that and predictions the storing of on downsides memory both waste with not does deals thus parameters option individual-level second and This group during the individual estimates. both this update by to generated us data allows visit The website far. the so know we information the all on vrtm,teetootos(hikat (shrink options two these time, over 46 cestewbie tta on,w nwwoi bu oetrorwbie owe individual’s so this website, predict our can enter we to Now, about record. is individual’s who this to know retrieve request we can a point, out that send At will website. browser the Her access again. visit introduction—a the in discussed ito fteidvda-ee feti aeadsoe nmmr.Ti option besides This that is memory. downside in first stored The and made downsides. is two effect has pre- a individual-level parameters, the these With of computed. diction or updated are factor shrinkage and tered, iee xd hrfr,tesoe rdcinde o pial aeueo the of use make optimally con- not and does information. stored recent prediction is most stored prediction the the Therefore, because account fixed. All into are sidered statistics. taken data global not and new factor are prediction, shrinkage individual changes the the this these store affect observe points we data not while new do that These we is entering. if downside used other never The potentially anymore. is which stored, be to infraseicidvda a edn ttodsic moments: distinct two predic- at a done the be obtain case, can to individual any mean specific In group-level a the for concerned. to tion individual mean individual-level the the for applications of only many shrinkage in estimate and an point, time obtain each would at however,one individuals of is, the set re-estimation enters, full Such the point for point. data infeasible data new additional a every with time change every the estimates effects all individual-level of of is, the record estimates That slightly. all the the change re-estimate point, time enters, data in point moment new that data this at new effects to individual-level due a However, time Every changes. person factor. one shrinkage a and mean, ru-ee aaees h aaeeso h niiul( individual the of parameters the parameters, group-level 2. 1. h w pin r lutae nFigure in illustrated are options two The o h eodoto,iaiea niiul( individual an imagine option, second the For h niiullvlefc sacmiaino h niiulma,tegroup the mean, individual the of combination a is effect individual-level The h rtoto ed otefloigpoeue hnadt on nes the enters, point data a when procedure: following the to leads option first The n ol banasrnaeetmt h oeta niiulsdt sob- is data prediction, individual’s resulting the an store moment and the served estimate shrinkage a obtain could one eetrtedtst ec,temmn rdcinmgtb needed. be might to prediction about a is moment individual the the hence, that dataset; moment the the re-enter at prediction a obtain could one or, µ i eas h ru enadteetmtdsrnaefco change factor shrinkage estimated the and mean group the Because . t =1 roecudwi ni hsidvda eun ( returns individual this until wait could one or , when h niiullvlefc setmtdiflecsthe influences estimated is effect individual-level the t =1 3.1 h lc o eoe nindividual an denotes dot black The . vs. hpe :Oln hikg factors Shrinkage Online 3: Chapter i t t ned opyorwebsite—as our pay to intends ) =2 ontrsl ntesame the in result not do ) p i rdcin( prediction a , i t hs aapiten- point data whose ) t =2 nti ae.As case). this in if t n would one µ =2 ˆ µ ˆ i needs ) i based and ) ae100replications. 1,000 have fpeitn h niiullvleffect. individual-level the predicting of aasrascnitof consist streams data 2002 1993 ( fit to nti iuainsuyw opr h w siainpoeue ntrso the of terms in ( procedures error estimation prediction two squared the average compare we study study. simulation perform simulation this a factors conduct In we shrinkage implementations, offline the original their of as well implementations equally online the whether evaluate To Design 3.4.1 Study Simulation 3.4 factors Shrinkage Online 3: Chapter fosrain snteulars individuals. across equal not is observations of µ le’pcae(n[] ih2 dpieGusa udauepit.Wiethis While the of points. estimates from good Quadrature function very Gaussian provide GLMR to adaptive known the 20 is model in with [R]) implemented (in as package function, ‘lme4’ link logit with model erage of B distributions three distribu- the generated the underlying we normality: from tion so, depart this do increasingly To which when effects, case study. individual-level the the simulation examine the specifically in we violated is distribution, normal a assume factors ˆ i (2 ntefloigsmlto td,w aecoe hsscn implementation second this chosen have we study, simulation following the In . 2 eas esml h niiul trno fe hc egnrt nosrain h number the observation, an generate we which after random at individuals the sample we Because , ; ; 12) 2 orek a ruee,&Berger & Breukelen, Van Moerbeek, get,Boh oet Caffo & Hobert, Booth, Agresti, kodl&Rabe-Hesketh & Skrondal ubro bevtoseult 0 sabnhakw s multilevel a use we benchmark a As 20. to equal observations of number ramxueo w eadistributions: Beta two of mixture a or , pin2estimates 2 option iue31 nilsrto fwe osrn h niiullvlef- individual-level the ( shrink 1 to Option when of fect. illustration An 3.1: Figure w iepit,tegopmean group the points, time two true n t niiullvlefcsi centered is effects individual-level 10 = =1 µ µ p ^ i i i ● ● at estimates ) t =1 , t , 000 =2 2004 ! time indatastream While . wihrslsin results (which (ˆ , µ ,epcal nadt tem h generated The stream. data a in especially ), 2000 i − µ , i ; 2003 µ ih fe h aapitenters, point data the after right p ok&Aitkin & Bock i i ) p ¯ 2 ean h aebtenthe between same the remains oscag vrtime. over change does , /N ; ● ● aeHseh kodl Pickles & Skrondal, Rabe-Hesketh, t =2 B µ p ^ .Snetoo h orshrinkage four the of two Since ). i i (1 µ i , ti opttoal complex computationally is it , 6) N and 500 = , 1981 B B p (6 (7 ; n l conditions all and ) rso Clayton & Breslow , , 1) 7) estteav- the set We . ih skewed, right , 47 , ,

Chapter 3 is 49 i and µ 7) , ) shows (7 1) , B (6 ,B differ across the -axes present the 6) y , 3.3 (1 -axes present the data B x and the /N 2 ) i µ b show that in the beginning of the data − 3.3 i µ (ˆ ! are organized as follows: The 1,000, 5,000, and 10,000). In the columns are the c) presents a different pattern of shrinkage factors. = 3.3 implementations of the shrinkage factors. 3.2 n a and Figure 3.3 -axes of the three subfigures of Figure online y and is organized as follows. In the rows are the three conditions (cen- offline 3.2 ) are comparable, however, the mixture distribution ( 12) , Table The last subfigure (Fig. The subfigures of Figure (2 transformation. The GLMR function performs ‘best’ indifference both between scenarios. the However, shrinkage the factors and thethe GLMR data function is stream. minimal later More in between importantly the for our purpose, there is almost no difference tered, right skewed anddata mixture), stream within are each presented ( conditiondifferent shrinkage three factors points with within theaverage the offline squared prediction and error online of implementations. eachdeviations are of Both presented. the the In shrinkage the factors centered andthe scenario, the shrinkage the standard GLMR factors function (offline outperforms andthe online). difference between However, the as shrinkage factors thedard and data GLMR deviations stream becomes across smaller. continues, the The stan- shrinkagesmall. factors The and second during scenario, the the streamerage right-skewed are squared distribution, stable prediction has error. and an This even is smaller due av- to the fact that the distribution of stream. Note that the three scenarios. In additionsents to the the GLMR function. already The introduced results lines, of the the two unimodal dotted distributions line ( repre- different results. Figure stream, the two shrinkagemore factors error that (JS, ML) make than use the of two shrinkage the factors data (BB, HN) transformation that have do not rely on the Chapter 3: Online Shrinkage factors out than in the previous condition,all while four the shrinkage online factors and have offline similar implementations levels of of shrinkage. Even at the end of thetors. data The stream, cluster there of are shrinkage twothe factors distinct with online clusters the and of highest offline level shrinkage implementations of fac- online shrinkage of implementation consists the of of ML. heuristic The shrinkageBB remaining factors, and shrinkage and JS, factors and the (online the anddata-generating offline offline distribution ML) of hardly the shrink individual-level atheuristic effects all. is estimator bimodal. This (online is Because and due the tions, offline) to the it does fact cannot not that take have the decrease into any more account distributional in this that assump- condition there than in areor the longer two other data modes. two stream conditions. would A The allowa larger the similar online learn online level rate ML of ML shrinkage does to as decrease theaccount even offline that more ML. and The the reach offline distribution ML, BBvery of and little individual JS accordingly. do level take effects into is not normal, andaverage shrink squared prediction error: narrowly distributed around the group mean making the mean over all data a good B . 7) , 3.3 (7 B -axis the y and Figure 3.2 b. For the two shrinkage factors 3.2 Chapter 3: Online Shrinkage factors ) distribution, and one for the mixture further details the average squared pre- 12) , 3.2 (2 B ). Table 1) , presents the length of the data stream and the (6 B 3.2 presents the average squared prediction error over the sim- and 3.3 6) , 0.01) might have been slightly too small. Towards the end of the a the centered distribution is presented. The BB (online and offline) (1 = 3.2 B γ presents the average of the estimated shrinkage factors over the simula- -axis of Figure x 3.2 The average estimated shrinkage factors in the right-skewed distribution of the The In Figure ulation runs. Both figures consist of three subfigures: one for the centered ( 48 3.4.2 Results The main results of the simulationFigure study are presented in Figure tion runs. Figure distribution, one for the right skewed ( distribution ( some more than the offlineTowards JS the end but of behaves the qualitatively data the stream, the same different as shrinkage the factors are offline more JS. spread individual-level effects are presented in Figure that do not use theML data and transformation JS show the differences with results the are previousof condition. quite the The similar. online JS implementation However, shrinks the more overshrinks more a in longer the time, beginning of also the the data offline stream. implementation The offline of ML the shrinks JS on average generated data streams three of the foursame, shrinkage only factors the shrink approximately heuristic the shrinkageless factor than the (online other and factors. offline) shrinks substantially diction error at three points inover the the data different simulation stream runs. and includes the standard deviation age, while the online implementationstream. of The the ML JS online quicklylearn implementation decreases rate only during ( changes the very data gradually. The chosen shrinks the individual-level effectseven most, more than and the the offline implementation. online(over The implementation 2,000) BB data (online does points and before so it offline) can needs bemoments many estimated. estimator, This which is an returns artifact negative of hyperparameters thetion method when for of the the data Beta does not distribu- (yet)Both comply the to offline the versions beta of distribution the (under JS dispersion). and the ML have a relatively stable level of shrink- symbols. The black lines areand also ML marked is with marked symbols: withsmall JS triangles difference is (facing between denoted down). the with In circles offlinefactors. all and In three online general, subfigures, the implementations there online of implementations isthe the tend offline a to implementations. shrinkage shrink somewhat more than average shrinkage factor. Thethe solid shrinkage lines factors. represent The dashed the lines onlineshrinkage represent implementations factors. the offline of implementations The of four the (solid) gray shrinkage lines factors indicate that do the not offlinetriangle require (dashed) the symbols data and (facing transformation. the up) The online to BB differentiate carries the BB from HN which carries square

Chapter 3 tem h Loln mlmnainol hne eygauly h chosen The gradually. data very the changes ( during only rate decreases implementation learn quickly online JS ML the shrink- The of of level stream. implementation stable relatively online a have the ML while the and age, dispersion). JS (under the distribution of beta versions the offline to the comply Both (yet) distribu- not does Beta data the the of for when method tion the hyperparameters of negative artifact returns an is which This estimator, estimated. many moments be needs can offline) it so before and points does (online data BB 2,000) implementation The (over online implementation. offline the the and than more most, even effects individual-level the shrinks than more somewhat shrink shrinkage implementations. to a offline tend the the is implementations of online there implementations the subfigures, general, online three In and all factors. offline circles In with the down). denoted between (facing is difference triangles JS small with symbols: marked square with is carries marked ML also which and are HN lines from black BB The the carries differentiate symbols. BB to online The up) the transformation. (facing and data symbols the (dashed) require triangle offline not the do that indicate factors lines shrinkage gray (solid) the four of The implementations of offline the factors. implementations represent shrinkage online lines the dashed The represent factors. lines shrinkage solid the The factor. shrinkage average ito ro ttrepit ntedt temadicue h tnaddeviation standard the includes and runs. stream simulation different data the the over in points three at error diction oad h n ftedt tem h ifrn hikg atr r oespread JS. more offline are factors the shrinkage as different same the stream, the data qualitatively the behaves of but end average the JS Towards on JS offline shrinks the ML the of offline than The implementation more stream. offline data some the the also of beginning time, the longer in a more shrinks over more the shrinks However, implementation JS online similar. The the quite condition. of previous are the results with differences the show JS transformation and data ML the use not do that Figure in presented are effects individual-level substantially shrinks offline) factors. and other (online the than factor less shrinkage the heuristic approximately shrink the factors only shrinkage same, four the of three streams data generated itiuin( distribution ( skewed right the for one distribution, lto us ohfiue oss ftresbgrs n o h etrd( centered the for one subfigures: three of consist figures Both runs. ulation Figure runs. tion Figure in presented are study Figure simulation the of results main The Results 3.4.2 48 nFigure In h vrg siae hikg atr ntergtsee itiuino the of distribution right-skewed the in factors shrinkage estimated average The The 3.2 x ai fFigure of -axis rsnsteaeaeo h siae hikg atr vrtesimula- the over factors shrinkage estimated the of average the presents γ B 3.2 = (1 h etrddsrbto speetd h B(nieadoffline) and (online BB The presented. is distribution centered the a .1 ih aebe lgtytosal oad h n fthe of end the Towards small. too slightly been have might 0.01) , 6) 3.3 and rsnsteaeaesurdpeito ro vrtesim- the over error prediction squared average the presents 3.2 B (6 rsnstelnt ftedt temadthe and stream data the of length the presents , 1) .Table ). B (2 3.2 , 12) ute eal h vrg qae pre- squared average the details further itiuin n n o h mixture the for one and distribution, ) hpe :Oln hikg factors Shrinkage Online 3: Chapter 3.2 .Frtetosrnaefactors shrinkage two the For b. 3.2 n Figure and y ai the -axis B (7 3.3 , 7) . rg qae rdcinerr hsi u otefc httedsrbto of distribution the that fact the to av- due smaller is even This an and error. has prediction stable distribution, squared are right-skewed erage stream the the scenario, during second and The factors small. shrinkage stan- The the continues, smaller. across becomes stream deviations GLMR data and dard the factors shrinkage as the However, between difference online). the and outperforms (offline function factors GLMR standard the shrinkage the scenario, the and centered factors the shrinkage In the the presented. Both of are deviations each implementations. of online error and prediction squared offline the average the within with points factors three shrinkage different condition ( presented each are within stream mixture), data and skewed right tered, difference no almost is there purpose, our for the importantly between in More later minimal stream. is function data GLMR the the the and on factors the rely shrinkage However, not the scenarios. between do both have difference that in transformation ‘best’ HN) performs (BB, data function factors GLMR the shrinkage The two of transformation. the use than make ML) (JS, that error factors more shrinkage two the stream, Figure results. different B arwydsrbtdaon h ru enmkn h enoe l aaagood a data all over mean the making mean group the around distributed narrowly vrg qae rdcinerror: prediction squared shrink average and normal, not is into effects take level do accordingly. JS individual little and of very BB ML, distribution offline reach the The and ML. more that offline even account the decrease as to does shrinkage ML of ML rate level online learn online similar the larger a allow The A would conditions. stream two modes. data other two longer the or are in than there condition assump- that this in distributional account more any into decrease the have take that not cannot fact does it the to offline) tions, the due and Because is (online This bimodal. estimator is all. effects heuristic at individual-level shrink the hardly of ML) distribution offline offline data-generating and the (online the and factors JS, and shrinkage and factors, remaining BB shrinkage The heuristic ML. of of the consists implementation of shrinkage online fac- of implementations shrinkage level offline highest of and the clusters online with distinct factors the two shrinkage are of there cluster stream, The data tors. the of end the at Even shrinkage. of of levels implementations similar offline have and factors online shrinkage the four while all condition, previous the in than out factors Shrinkage Online 3: Chapter he cnro.I diint h led nrdcdlns h otdln repre- ( line distributions dotted unimodal two the the of lines, results introduced The already function. GLMR the the to sents addition In scenarios. three the that Note stream. (2 Table h ufiue fFigure of subfigures The (Fig. subfigure last The , 12) r oprbe oee,temxuedsrbto ( distribution mixture the however, comparable, are ) 3.2 offline sognzda olw.I h osaetetrecniin (cen- conditions three the are rows the In follows. as organized is and y online ae ftetresbgrso Figure of subfigures three the of -axes 3.3 n Figure and a n 3.2 mlmnain ftesrnaefactors. shrinkage the of implementations 3.3 = )peet ifrn atr fsrnaefactors. shrinkage of pattern different a presents c) ,0,500 n 000.I h oun r the are columns the In 10,000). and 5,000, 1,000, r raie sflos The follows: as organized are ! (ˆ µ i 3.3 − hwta ntebgnigo h data the of beginning the in that show b µ i ) 2 /N n the and x B ae rsn h data the present -axes (1 3.3 , y 6) ae rsn the present -axes ifrars the across differ ,B (6 B , 1) (7 shows ) , 7) µ and i 49 is

Chapter 3 51 ● ● ● ● ● ● 100 100 100 ● ● ● ● ● ● GLM GLM GLM ● ● ● ● ● ● 80 80 80 ● ● ● ● ● ● 1) , HN off HN on HN off HN on HN off HN on (6 ● ● ● ● ● ● 60 60 60 7) B 12) , , ● ● ● ● ● ● & (7 (2 BB off BB on BB off BB on BB off BB on B 6) B , ● ● ● ● ● ● 40 40 40 (1 (a) (b) B ● ● ● ● ● ● ML off ML on ML off ML on ML off ML on data points entering in a stream x 100 data points entering in a stream x 100 data points entering in a stream x 100 (c) ● ● ● ● ● ● 20 20 20 ● ● ● ● ● ● JS off JS on JS off JS on JS off JS on ● ● ● ● ● ● 0 0 0

averaged over the 1,000 replications.

0.20 0.15 0.10 0.05 0.00 0.15 0.10 0.05 0.00 0.015 0.010 0.005 0.000

average squared prediction error prediction squared average average squared prediction error prediction squared average error prediction squared average Figure 3.3: Average squared prediction error for the three scenarios, Chapter 3: Online Shrinkage factors ● ● ● ● ● ● 100 100 100 ● ● ● ● ● ● ● ● ● ● ● ● 80 80 80 ● ● ● ● ● ● 1) Chapter 3: Online Shrinkage factors , HN off HN on HN off HN on HN off HN on (6 ● ● ● ● ● ● 60 60 60 7) B 12) , , ● ● ● ● ● ● & (7 (2 BB off BB on BB off BB on BB off BB on B 6) B , ● ● ● ● ● ● 40 40 40 (1 (a) (b) B ● ● ● ● ● ● ML off ML on ML off ML on ML off ML on data points entering in a stream x 100 data points entering in a stream x 100 data points entering in a stream x 100 (c) ● ● ● ● ● ● 20 20 20 the 1,000 replications ● ● ● ● ● ● JS off JS on JS off JS on JS off JS on ● ● ● ● ● ●

0 0 0

0.8 0.6 0.4 0.2 0.0 0.6 0.4 0.2 0.0 1.0 0.2 0.0 1.0 0.8 1.0 0.8 0.6 0.4

average estimated shrinkage factor shrinkage estimated average average estimated shrinkage factor shrinkage estimated average average estimated shrinkage factor shrinkage estimated average Figure 3.2: The average estimated shrinkage factors, averaged over 50

Chapter 3 50 iue32 h vrg siae hikg atr,aeae over averaged factors, shrinkage estimated average The 3.2: Figure

average estimated shrinkage factor average estimated shrinkage factor average estimated shrinkage factor

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0 0 0 ● ● ● ● ● ● JS on JS off JS on JS off JS on JS off ● ● ● ● ● ● h ,0 replications 1,000 the 20 20 20 ● ● ● ● ● ● (c) data points entering inastreamx100 data pointsentering inastreamx100 data pointsentering data points entering inastreamx100 data pointsentering ML on ML off ML on ML off ML on ML off ● ● ● ● ● ● B (b) (a) (1 40 40 40 ● ● ● ● ● ● , B 6) B BB on BB off BB on BB off BB on BB off (2 (7 & ● ● ● ● ● ● , , 12) B 7) 60 60 60 ● ● ● ● ● ● (6 HN on HN off HN on HN off HN on HN off , hpe :Oln hikg factors Shrinkage Online 3: Chapter 1) ● ● ● ● ● ● 80 80 80 ● ● ● ● ● ● ● ● ● ● ● ● 100 100 100 ● ● ● ● ● ● hpe :Oln hikg factors Shrinkage Online 3: Chapter iue33 vrg qae rdcinerrfrtetrescenarios, three the for error prediction squared Average 3.3: Figure

average squared prediction error average squared prediction error average squared prediction error

0.00 0.05 0.10 0.15 0.20 0.000 0.005 0.010 0.015 0.00 0.05 0.10 0.15 vrgdoe h ,0 replications. 1,000 the over averaged 0 0 0 ● ● ● ● ● ● JS on JS off JS on JS off JS on JS off ● ● ● ● ● ● 20 20 20 ● ● ● ● ● ● (c) data points entering inastreamx100 data pointsentering inastreamx100 data pointsentering inastreamx100 data pointsentering ML on ML off ML on ML off ML on ML off ● ● ● ● ● ● B (b) (a) (1 40 40 40 ● ● ● ● ● ● , B 6) B BB on BB off BB on BB off BB on BB off (2 (7 & ● ● ● ● ● ● , , 12) B 7) 60 60 60 ● ● ● ● ● ● (6 HN on HN off HN on HN off HN on HN off , 1) ● ● ● ● ● ● 80 80 80 ● ● ● ● ● ● GLM GLM GLM ● ● ● ● ● ● 100 100 100 ● ● ● ● ● ● 51

Chapter 3

53

000.7 .0).7 .0).7 .0).8 .0).0 .0).0 .0).1 .0).1 .0).0 (.000) .005 (.001) .011 (.001) .011 (.001) .006 (.000) .005 (.002) .086 (.002) .076 (.002) .074 (.002) .075 10,000

,0 06(01 03(02 06(01 08(02 01(01 03(01 01(01 04(01 01(.001) .011 (.001) .024 (.001) .021 (.001) .013 (.001) .011 (.002) .068 (.001) .056 (.002) .053 (.001) .056 5,000 1) , (6 ,B 6) , (1 B

,0 04(05 06(06 02(04 08(05 05(05 14(03 06(03 15(03 02(.004) .042 (.003) .115 (.003) .086 (.003) .144 (.005) .045 (.005) .088 (.004) .042 (.006) .096 (.005) .044 1,000

000.0 .0).0 .0).0 .0).0 .0).0 .0).0 .0).0 .0).0 .0).0 (.000) .004 (.000) .004 (.000) .004 (.000) .004 (.000) .003 (.000) .006 (.000) .006 (.000) .006 (.000) .006 10,000

,0 06(00 06(00 06(00 06(00 05(00 06(01 06(00 07(01 05(.000) .005 (.001) .007 (.000) .006 (.001) .006 (.000) .005 (.000) .006 (.000) .006 (.000) .006 (.000) .006 5,000 12) , (2 B

,0 09(01 05(01 08(01 04(01 08(01 09(01 09(01 09(01 08(.001) .008 (.001) .009 (.001) .009 (.001) .009 (.001) .008 (.001) .014 (.001) .008 (.001) .015 (.001) .009 1,000

000.1 .0).1 .0).1 .0).1 .0).0 .0).0 .0).0 .0).0 .0).0 (.000) .007 (.001) .008 (.001) .008 (.000) .007 (.000) .007 (.001) .013 (.001) .012 (.001) .012 (.001) .012 10,000

,0 02(01 02(01 02(01 02(01 00(01 01(01 03(01 04(01 00(.001) .010 (.001) .014 (.001) .013 (.001) .011 (.001) .010 (.001) .012 (.001) .012 (.001) .012 (.001) .012 5,000 7) , (7 B

,0 04(04 01(04 03(04 02(04 07(01 08(01 08(01 08(01 05(.001) .015 (.001) .018 (.001) .018 (.001) .018 (.001) .017 (.004) .042 (.004) .043 (.004) .051 (.004) .054 1,000

(sd) (sd) (sd) (sd) (sd) (sd) (sd) (sd) (sd) distribution

e e e e e e e e e n

¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ 2 2 2 2 2 2 2 2 2

fieoln fieoln fieoln fieoln offline online offline online offline online offline online offline

JS GML HN BB ML al .:Teaeaesurdpeito ro.I h aetee r h dsoe ,0 replications. 1,000 over sd’s the are parentheses the In error. prediction squared average The 3.2: Table Chapter 3: Online Shrinkage factors c), 3.3 ), and multi- Goodman & 2012 , c), the data transforma- 3.2 Manzo & Burke ; for both JS and ML. For the other Chapter 3: Online Shrinkage factors 2007 , increases ’s causes the individual-level effects to be moved ). Knowing which respondents are likely to drop i ˆ µ 2005 , ’s (2012) data transformation, this is not the case. This Leeuw Curtin, Singer, & Presser Morris and Lysy ). Much effort is spent on the prevention of non-response, for instance, re- 1996 , The mixture scenario provides quite different results. While the average squared From the simulation study, we can thus conclude that a) for a long enough data mode data collection ( out of the panel, could facilitatethe the probability prevention for of a the given dropout. respondent to For answer instance, to when the questionnaire drops below vant is a panel study,over where a new longer questionnaires period of are time. sentof Panel to the studies the are major used issues same to of respondents analyze acause ongoing panel trends. it study One can is affect attrition the (i.e., generalizabilityBlum respondents of who the drop results out) to be- theminders, population rewards ( ( 3.5 LISS Panel Study: PredictingAn Attrition application where data are entering over time and real-time prediction is rele- true individual-level effects are either closegroup mean to a zero poor predictor or of the close individual-level tomeans effects. one. Because are these far individual This from makes the the absolute group values. mean, the In transformed an individualgroup means absolute mean have sense, than large larger values values thatdicted are are individual-level closer moved effects to to more the towards group the mean. Transforming the pre- after about 2,000 data pointstwo shrinkage the factors error and the benchmark GLMR,do i.e., not these estimation use methods that pattern appears for both the onlinethe and mixture offline of estimated the shrinkage two factors. distributionsto the Due zero, individual to means or are close either to clustered close one. While the mean of these two distributions is 0.5, all the 52 predictor of the individual-level effects. Thisdiction results in error a and small even average smaller squared standard pre- deviations. prediction error decreases rapidly in the beginning of the data stream (see Fig. example we show that it islong-running possible panel study to will accurately respond predict to whether a respondents monthly of questionnaire. a stream all online shrinkage factors perform asb) well the as their BB offline counterparts, seems and Hence, to for have the the analysis mostrecommend of robust using large, the performance our nested, over online the binary versionexamined of three outcome shrinkage the conditions. data factors BB. streams In are the we further following would evaluated section, all in the a real-data example. In this even more towards the group mean.use the Thus, data even transformation though are the intion shrinkage fact pushes small factors the (see that individual-level Figure effectstional even push closer towards the to group the mean group causeserror the than mean. JS HN and and This ML BB. addi- to have larger prediction

Chapter 3 xml eso hti spsil oacrtl rdc hte epnet fa questionnaire. of monthly respondents this a whether to In predict respond accurately will to study example. panel possible real-data long-running is a it the that in all show section, evaluated would we following further example we the are In streams BB. factors data conditions. the shrinkage outcome three of examined version binary the online over nested, our performance the large, using robust of recommend most analysis the the have for to Hence, and seems counterparts, offline BB their as the well b) as perform factors shrinkage online all stream prediction larger have to addi- BB. ML This and and HN JS mean. than the error causes group mean the group to the towards closer push even tional effects Figure individual-level that (see the factors small pushes fact shrinkage tion in the are though transformation even data Thus, the use mean. group the towards more even h rbblt o ie epnett nwrt h usinar rp below drops questionnaire the when to instance, answer For to respondent dropout. given the a of for prevention probability the the facilitate could panel, the of out ( collection data mode ( ( rewards population minders, the be- to out) results drop the who of respondents Blum generalizability (i.e., the attrition affect is can One study it trends. panel ongoing cause a analyze respondents of to same issues used major are the studies the to Panel of sent time. are of rele- period questionnaires is longer new prediction a where over real-time study, and panel time a over is entering vant are data where application Attrition An Predicting Study: Panel LISS 3.5 pre- the Transforming mean. the group towards the more to to effects moved closer individual-level are are dicted that values values larger large than sense, have mean absolute means group individual an transformed In the mean, values. the group absolute all the the 0.5, makes from This individual is far these are Because distributions one. effects. means to two individual-level close the these of or predictor of poor zero a mean to mean group close the either While are effects one. close individual-level clustered to true either close are or means to individual zero, Due the to distributions factors. two shrinkage the estimated of offline mixture and the online the both for appears pattern that methods use estimation these not i.e., do GLMR, benchmark the and error factors the shrinkage two points Fig. data (see stream 2,000 data about the of after beginning the in rapidly decreases error prediction deviations. pre- standard squared smaller average even small a and error in results diction This effects. individual-level the of predictor 52 rmtesmlto td,w a hscnld hta o ogeog data enough long a for a) that conclude thus can we study, simulation the From h itr cnropoie ut ifrn eut.Wieteaeaesquared average the While results. different quite provides scenario mixture The , 1996 .Mc foti pn ntepeeto fnnrsos,frisac,re- instance, for non-response, of prevention the on spent is effort Much ). orsadLysy and Morris utn igr Presser & Singer, Curtin, Leeuw s(02 aatasomto,ti snttecs.This case. the not is this transformation, data (2012) ’s , 2005 µ ˆ i .Koigwihrsodnsaelkl odrop to likely are respondents which Knowing ). scue h niiullvlefcst emoved be to effects individual-level the causes ’s increases , 2007 hpe :Oln hikg factors Shrinkage Online 3: Chapter o ohJ n L o h other the For ML. and JS both for ; az Burke & Manzo 3.2 ) h aatransforma- data the c), , 2012 oda & Goodman ,admulti- and ), 3.3 c), hpe :Oln hikg factors Shrinkage Online 3: Chapter

Table 3.2: The average squared prediction error. In the parentheses are the sd’s over 1,000 replications.

JS ML BB HN GML offline online offline online offline online offline online offline distribution n e¯2 (sd) e¯2 (sd) e¯2 (sd) e¯2 (sd) e¯2 (sd) e¯2 (sd) e¯2 (sd) e¯2 (sd) e¯2 (sd) 1,000 .054 (.004) .051 (.004) .043 (.004) .042 (.004) .017 (.001) .018 (.001) .018 (.001) .018 (.001) .015 (.001) B(7, 7) 5,000 .012 (.001) .012 (.001) .012 (.001) .012 (.001) .010 (.001) .011 (.001) .013 (.001) .014 (.001) .010 (.001) 10,000 .012 (.001) .012 (.001) .012 (.001) .013 (.001) .007 (.000) .007 (.000) .008 (.001) .008 (.001) .007 (.000) 1,000 .009 (.001) .015 (.001) .008 (.001) .014 (.001) .008 (.001) .009 (.001) .009 (.001) .009 (.001) .008 (.001) B(2, 12) 5,000 .006 (.000) .006 (.000) .006 (.000) .006 (.000) .005 (.000) .006 (.001) .006 (.000) .007 (.001) .005 (.000) 10,000 .006 (.000) .006 (.000) .006 (.000) .006 (.000) .003 (.000) .004 (.000) .004 (.000) .004 (.000) .004 (.000) 1,000 .044 (.005) .096 (.006) .042 (.004) .088 (.005) .045 (.005) .144 (.003) .086 (.003) .115 (.003) .042 (.004) B(1, 6),B(6, 1) 5,000 .056 (.001) .053 (.002) .056 (.001) .068 (.002) .011 (.001) .013 (.001) .021 (.001) .024 (.001) .011 (.001) 10,000 .075 (.002) .074 (.002) .076 (.002) .086 (.002) .005 (.000) .006 (.001) .011 (.001) .011 (.001) .005 (.000) 53

Chapter 3 55 ● ● 40 GLM 1.0 ● 0.8 ● HN off HN on 30 0.6 using the GLMR function. BB off BB on ● i ● 0.4 µ 20 with not many respondents in the middle, just 0.2 3.5 questionnaires as a data stream ML off ML on level effect GLMR estimated individual−level ● ●

0.0

10

4000 3000 2000 1000 0 Frequency JS off JS on ● ● ● ● 0

Figure 3.5: Estimated

Figure 3.4: Percentage correctly classified responses

100 80 60 40 20 0 percentage correctly classified correctly percentage like the mixture of distributionsare of less the computationally simulation complex study. than GLMR, Evenare the though equally predictions BB accurate. made and by BB HN and HN Chapter 3: Online Shrinkage factors offline) are less able to make accuratethe predictions HN with and regard BB to perform non-response while equally wella as surprise, the benchmark as (GLMR). the This is distribution(the not of much MAP the of estimates) individual-level shows effects estimated thateither with the end GLMR of majority the of interval, see the Fig. individual-level effects are on and 3 and the total number of 924 : the JS and ML (online and , 3.4 = 12 Chapter 3: Online Shrinkage factors N -axis is the replay of the questionnaires as these are send out x 397.647. For the analysis of the LISS panel data, we had to drop -axis presents the percentage of correctly classified respondents. A = y n presents the results of the replay of the data stream of the LISS panel ques- 3.4 We analyze the data by replaying the data as if it is a data stream. To do so, As expected from the simulation study, the differences between the offline and In this application, we predict whether a respondent of the panel study is going An extra variable in the data set gives (when applicable) the date the responded stopped being a 3 results by the GLMR function, likeof in the the methods simulation can study, classify in whethertake terms a how into respondent well is account each or when is a not going respondent to stopped respond. being We a member of the panel questionnaire. we kept the ordering ofwithin the a questionnaires questionnaire. but We randomly randomly selected orderedbecause the the we responses respondents within do a not questionnaire haveWe data compare about the the results order of in the which the four data shrinkage entered factors originally. (online and offline) with the observations, one questionnaire (July, 2011) because none of the respondents had answered this with the mixture of distribution appears in Figure the questionnaire. The over time. online estimation procedures are negligible.offline The BB classification and GLMR performances are of exactly the theFurthermore, same, the and same therefore, clustering impossible of to shrinkage disentangle. factors as in the simulation condition 3.5.1 Results Figure tionnaires. The respondent is correctly classified if the shrinkagea model response predicted greater the probability than of .5or and when the the respondent predicted indeed probability was answered below the .5 questionnaire, and the responded failed to answer scored a 1 if theythe participated analysis, in we that selected only wave these ornaire, respondents a that who 0 received had at if an least they one identificationnumber failed question- number of to and individuals participate. started used for before For the December analysis 2011. is Total informative or helpful. Therefore, predicting non-response in aexample panel of study a is a case good where real-time prediction is useful. to participate in the next wavenal as Internet well. Study for Data Social are sciences)between coming panel November from study, the 2007 consisting LISS and of (Longitudi- 50 December monthly 2011. waves For each wave, respondents either 54 a threshold, an additional incentive (aetc.) letter could of be the sent importance when ofcrease inviting the the the panel, probability respondent money that to the answer respondent thea will questionnaire respondent reply was to to unlikely in- to the respond questionnaire. to the Knowing questionnaire, after the facts, is not very member of the panel. After which the respondent is no longer invited for the next waves. correct the group statistics accordingly.

Chapter 3 orc h ru ttsisaccordingly. statistics group the correct ebro h ae.Atrwihtersodn sn ogrivtdfrtenx waves. next the for invited longer no is respondent the which After panel. the of member ihtemxueo itiuinapasi Figure in appears distribution condition of simulation mixture the in the as with factors disentangle. shrinkage to of impossible clustering therefore, same and the same, Furthermore, the the exactly of are performances GLMR and classification BB The offline negligible. are procedures estimation online answer time. to over failed responded the and The questionnaire, .5 questionnaire. the below the answered was probability indeed predicted respondent the the when and or .5 of than probability the greater predicted response model a shrinkage the if classified correctly is respondent The tionnaires. Figure Results 3.5.1 Total is 2011. analysis December the For before for used started participate. individuals and to of number question- failed number identification one they least an if at had received 0 who that a respondents naire, or these wave only either selected that respondents we in analysis, wave, participated the each they if For 1 waves a 2011. monthly scored December 50 (Longitudi- of and LISS consisting 2007 the study, from November panel coming between sciences) are Social Data for Study well. Internet as nal wave next the in participate to useful. is prediction real-time where good case a is a study of panel very example a not in is non-response facts, predicting Therefore, the after helpful. questionnaire, or Knowing informative the to questionnaire. respond the to in- unlikely to to was reply respondent questionnaire will a the respondent answer the to money that respondent probability panel, the the the inviting crease of when importance sent the be of could letter etc.) (a incentive additional an threshold, a 54 aeit con hnarsodn tpe en ebro h panel the of member a We being respond. stopped to respondent going not a is when or each account is well respondent into how a terms take whether the in classify study, with can simulation offline) methods the and the in (online of like originally. factors function, entered GLMR shrinkage data the four the by which the results in of order results the the about compare data We have questionnaire not a do within respondents responses we the the because ordered selected randomly randomly We but questionnaire. questionnaires a the within of ordering the kept we this answered had respondents the of none questionnaire. because 2011) (July, questionnaire one observations, 3 netavral ntedt e ie we plcbe h aetersoddsopdbiga being stopped responded the date the applicable) (when gives set data the in variable extra An nti plcto,w rdc hte epneto h ae td sgoing is study panel the of respondent a whether predict we application, this In sepce rmtesmlto td,tedfeecsbtenteofln and offline the between differences the study, simulation the from expected As eaayetedt yrpaigtedt si ti aasra.T oso, do To stream. data a is it if as data the replaying by data the analyze We 3.4 rsnsterslso h elyo h aasra fteLS ae ques- panel LISS the of stream data the of replay the of results the presents n y = ai rsnstepretg fcretycasfidrsodns A respondents. classified correctly of percentage the presents -axis 9.4.Frteaayi fteLS ae aa ehdt drop to had we data, panel LISS the of analysis the For 397.647. x ai sterpa fteqetonie steeaesn out send are these as questionnaires the of replay the is -axis N hpe :Oln hikg factors Shrinkage Online 3: Chapter 12 = 3.4 , h SadM oln and (online ML and JS the : 924 n h oa ubrof number total the and 3 and r escmuainlycmlxta LR h rdcin aeb BadHN and HN BB by and made accurate. BB predictions equally though the are Even GLMR, than study. complex simulation computationally the less of are distributions on of are mixture effects the individual-level like Fig. the see interval, of the majority of GLMR end the with either that estimated effects shows individual-level estimates) of the MAP much of not (the distribution is This the (GLMR). as benchmark the surprise, as a well equally while non-response perform to BB regard and with HN predictions the accurate make to able less are offline) factors Shrinkage Online 3: Chapter

percentage correctly classified

0 20 40 60 80 100 iue34 ecnaecretycasfidresponses classified correctly Percentage 3.4: Figure iue35 Estimated 3.5: Figure 0 ● ● ● ● JS on JS off Frequency 0 1000 2000 3000 4000 10 0.0 ● ● GLMR estimatedindividual−level effect ML on ML off questionnaires asadatastream 3.5 0.2 ihntmn epnet ntemdl,just middle, the in respondents many not with 20 µ 0.4 ● i ● BB on BB off sn h LRfunction. GLMR the using 0.6 30 HN on HN off ● 0.8 ● 1.0 GLM 40 ● ● 55

Chapter 3 57 Ippel, Kaptein, ) at the level of the group. Including fixed effects at the level of the 2016b , Making real-time predictions without revisiting older data has great potential. Chapter 3: Online Shrinkage factors can easily include fixed effectsIncluding more to random improve effects the in predictioncan, this of however, case facilitate might the more be random individual less effects effect. & as straightforward. Vermunt well The as ML fixed effects ( observations seems challenging for both the JSdata as transformation well aggregates as the the ML information since to the the suggested level of individuals. These real-time predictions areor not to only encourage beneficial respondents intionnaire. that the have context Other a of cases low include e-commerce, sus probability classifying fraudulence to credit-card transactions), respond transactions monitoring to (legitimate patients’regimen the ver- compliance (medication ques- was with taken their or medical not),cational or career tracking (passing students’ exams progress in orestimating their not), individual-level edu- to effects name in a data few. streamsaccount allow the The the dependence presented researcher among methods to the for take observationsefficiency into of without the losing methods the that computational do not take this dependency into account. Acknowledgement We would like to thankpanel prof.dr. dataset. Additionally Peter we would Lugtig likeuous for to help thank preparing and dr. and feedback Marcel on sharing Croon this for the chapter. his LISS contin- ¯ p ), in the ). The JS 2012 ( 2012 , instead we would 1 1000 (which is easy to see since 1 n Morris and Lysy Morris & Lysy Chapter 3: Online Shrinkage factors we effectively use a learn rate of 3.2 )). By choosing a fixed learn rate of, e.g., ¯ p − x ( 1 n = A possible advantage of the JS and ML could be that these methods are easier In the current study, we assumed the data-generating process to be stationary; There are differences between the shrinkage factors in how well they are able to ¯ p n − x (smoothly) decrease the value of older data points in the resulting estimate. to extend to deal with covariates (see, for instance, puting an estimate. Alleasily the adapted to online create such shrinkage a factors movingof window presented approach the in by procedure changing this to the learn chapter a rate are using fixed value: Equation for example, when updating the sample mean the possible effect of thea time result, within the the individual-level data effects streampoints are is equally, estimated not irrespective of using explicitly their the modeled. history. information In As ever, practice, of this not all assumption data hold. might, how- Ifweigh the the stationarity recent assumption data does points not more hold, heavily one than might the older prefer to data points when com- factors. More importantly, the shrinkage factors implementedall offline individual (making data use points) of or online (incrementallyperform and not similarly. revisiting previous However, data) whendeviates the from distribution the of normal theproximate distribution, individual-level Maximum the Likelihood effects estimator James perform less Stein welland (JS) than heuristic the estimator estimator. Beta and Binomial the ap- studied situations does not worktions well is in limited and situations the where distribution thenormal of distribution. number the individual-level of effects observa- deviates from the predict the individual-level effects. Whento the normally true distributed individual-level effects the are prediction close accuracy is very similar across all shrinkage Maximum Likelihood estimator, the Betaestimator. Binomial In estimator a and simulation study, lastly we showed aage that heuristic make all these good shrinkage predictions factors of on the aver- mixture individual-level of effects. distributions However of when the there individual-level isrely effects, a shrinkage on factors the that normal do not distributionIt of appears the that individual-level the data effects transformation do suggested noticeably by better. have a nested structure, when the datain enter real over time. time, and While predictions the are multilevelanalyzing required model nested with data logit with link a functionity is binary of the outcome, that standard model, due for analyzing to dataTo the overcome streams computational this of problem, complex- nested we binary studied data onlinesions becomes of – four infeasible. and different shrinkage computationally factors: efficient the – James ver- Stein estimator, the (approximate) 56 3.6 Conclusion and discussion The most important conclusion wetions can of draw the is that individual-level we effects can when make the accurate outcome predic- variable is binary, the data

Chapter 3 ftepoeuet xdvle o xml,we paigtesml mean sample the updating when example, for Equation value: fixed using are rate a chapter learn the to this changing procedure by in the approach com- presented window of when moving factors a points shrinkage such create data to online to prefer older adapted the easily the might All than one estimate. heavily hold, an more not puting points does data assumption recent stationarity the the weigh If how- might, hold. data assumption all not this of practice, ever, As In information history. modeled. the their explicitly using of irrespective not estimated equally, is are points stream effects data individual-level the the within result, time a the of effect possible the ap- the Binomial and Beta estimator. estimator the heuristic than (JS) and well Stein less perform James estimator effects Likelihood the Maximum individual-level distribution, proximate the normal of the distribution from the deviates when data) However, previous revisiting similarly. not and perform (incrementally online or of points) use data shrinkage (making individual offline all all implemented across factors similar shrinkage the very importantly, More is factors. accuracy close prediction are the effects individual-level distributed true normally the to When effects. individual-level the predict the from deviates observa- effects of individual-level the number distribution. of normal the distribution where the situations and limited in is well tions work not does better. by situations noticeably suggested studied do transformation effects data the individual-level that the appears of It distribution not do normal that the factors on shrinkage a effects, rely is individual-level there the when of However distributions effects. of individual-level mixture aver- the on of factors predictions shrinkage good these all make heuristic that age a showed we lastly study, and simulation a estimator In Binomial estimator. Beta the (approximate) estimator, the estimator, Likelihood Stein ver- Maximum James – the efficient factors: computationally shrinkage different and infeasible. four – of becomes sions online data studied binary we nested complex- problem, of this computational streams overcome the To data to analyzing for due model, standard that outcome, the of binary is ity function a link with logit data with nested model requiredanalyzing multilevel are the data predictions While and the time, time. binary, over real is enter in data variable the predic- outcome when accurate structure, the make nested when a can have effects we individual-level that is the draw of can tions we conclusion important most The discussion and Conclusion 3.6 56 oetn oda ihcvrae se o instance, for (see, covariates with deal to extend to estimate. resulting the in points data older of value the decrease (smoothly) x − n p ¯ ntecretsuy easmdtedt-eeaigpoest estationary; be to process data-generating the assumed we study, current the In to able are they well how in factors shrinkage the between differences are There osbeavnaeo h SadM ol eta hs ehd r easier are methods these that be could ML and JS the of advantage possible A = n 1 ( x − p ¯ ) ycosn xdlanrt f e.g., of, rate learn fixed a choosing By )). 3.2 eefcieyuealanrt of rate learn a use effectively we hpe :Oln hikg factors Shrinkage Online 3: Chapter ors&Lysy & Morris orsadLysy and Morris n 1 wihi ayt e since see to easy is (which 1000 1 nta ewould we instead , 2012 ( 2012 .TeJS The ). ,i the in ), p ¯ ae aae.Adtoal ewudlk otakd.Mre ro o i contin- LISS his chapter. the for this Croon sharing on Marcel feedback and dr. and preparing thank help to for uous like Lugtig would we Peter Additionally dataset. prof.dr. panel thank to like would We Acknowledgement account. into dependency this take not do computational that the methods losing the without of into efficiency observations take for the to methods among researcher presented dependence the The the allow account streams few. data a in name effects to edu- individual-level not), their estimating or in progress exams students’ (passing tracking career or cational not), medical or their taken with was ques- (medication compliance ver- the regimen patients’ (legitimate to monitoring transactions respond transactions), credit-card to fraudulence classifying probability sus e-commerce, include low cases of a Other context have the that tionnaire. in respondents beneficial encourage only to not or are predictions real-time These individuals. of level suggested the the to since information ML the the as aggregates well transformation as data JS the both for challenging seems observations ( effects fixed ML as The well Vermunt straightforward. as & effect. effects less individual random be more the might facilitate case however, of this can, prediction in the effects improve random to more Including effects fixed include easily can factors Shrinkage Online 3: Chapter aigra-iepeitoswtotrvstn le aahsgetpotential. great has data older revisiting without predictions real-time Making , 2016b ttelvlo h ru.Icuigfie fet ttelvlo the of level the at effects fixed Including group. the of level the at ) pe,Kaptein, Ippel, 57

Chapter 3 Chapter 4

Estimating Random-Intercept Models on Data Streams

Abstract

Multilevel models are often used for the analysis of grouped data. Grouped data occur for instance when estimating the performance of pupils nested within schools or analyzing multiple observations nested within individuals. Currently, multilevel models are mostly fit to static datasets. However, recent technological advances in the measurement of social phenomena have led to data arriving in a continuous fashion (i.e., data streams). In these situations the data collection is never “finished”. Traditional methods of fitting multilevel models are ill-suited for the analysis of data streams because of their computational complexity. A novel algorithm for esti- mating random-intercept models is introduced. The Streaming EM Approximation (SEMA) algorithm is a fully-online (row-by-row) method enabling computationally- efficient estimation of random-intercept models. SEMA is tested in two simulation studies, and applied to longitudinal data regarding individuals’ happiness collected continuously using smart phones. SEMA shows competitive statistical performance to existing static approaches, but with large computational benefits. The introduc- tion of this method allows researchers to broaden the scope of their research, by using data streams.

This chapter is published as Ippel, L., Kaptein, M.C, & Vermunt, J.K. (2016) Estimating Random- Intercept Models on Data Streams. Computational Statistics and Data Analysis, 104, 169–182 Chapter 4

Estimating Random-Intercept Models on Data Streams

Abstract

Multilevel models are often used for the analysis of grouped data. Grouped data occur for instance when estimating the performance of pupils nested within schools or analyzing multiple observations nested within individuals. Currently, multilevel models are mostly fit to static datasets. However, recent technological advances in the measurement of social phenomena have led to data arriving in a continuous fashion (i.e., data streams). In these situations the data collection is never “finished”. Traditional methods of fitting multilevel models are ill-suited for the analysis of data streams because of their computational complexity. A novel algorithm for esti- mating random-intercept models is introduced. The Streaming EM Approximation (SEMA) algorithm is a fully-online (row-by-row) method enabling computationally- efficient estimation of random-intercept models. SEMA is tested in two simulation studies, and applied to longitudinal data regarding individuals’ happiness collected continuously using smart phones. SEMA shows competitive statistical performance to existing static approaches, but with large computational benefits. The introduc- tion of this method allows researchers to broaden the scope of their research, by using data streams.

This chapter is published as Ippel, L., Kaptein, M.C, & Vermunt, J.K. (2016) Estimating Random- Intercept Models on Data Streams. Computational Statistics and Data Analysis, 104, 169–182 , ). ). 61 Op- 2009 2005 Wolfe, , , Thiesson, ; ) discussed ) proposed a 2003 2012 , ( 2007 ( Gaber et al. McLachlan and Peel Cappé & Moulines , 2006; Ng & McLachlan Berlinet and Roland observations. The method we propose Steiner and Hudec , ch. 12) described various possible adaptations dependent 2000 ( ) presented an (offline) parallel version of the EM algo- ). We propose an adaption of the EM algorithm for the es- 2008 2001 ( , , 2008) and for latent variable models ( McLachlan and Peel Liu, Almhana, Choulakian, and McGorman ). SEMA is an approximate EM method, because unlike the EM algorithm Wolfe et al. 1998 The remainder of this chapter is organized as follows. In the next section, we The resulting Streaming EM Approximation algorithm (henceforth referred to as Related methods for speeding up the EM algorithm have been proposed for deal- , stores information on the leveland of updates individuals, the instead estimates in of a thedata single level streams pass and of over extremely the observations, large data, datasets. making it suitable for both illustrate the computational advantages ofexample streaming of estimation the using estimation the ofrandom-intercept simple a models sample using mean. the EM Next, algorithm, we and discuss show the how estimation this of algorithm rithm and of EM methods for large datasets.for different Various applications online have adaptations also been of proposed,(see, the for EM e.g., example, algorithm for mixture models 2000; Instead of speeding upmethod the to EM scale algorithm, down theexisting literature data by prior proposing to an EM usingbased approximation on the for data streams EM the consisting estimation algorithm. of of models We add to this Chapter 4: Introduction of SEMA individual-level effects, the estimatesshould of be updated the as parameters data pointsparameters of come should the in, and be model the used of updateddata, for estimates interest these of prediction traditional the purposes. methods model havepoints, When to each repeatedly applied cycle time to through a streaming all newter available data estimates. data point Additionally, arrives, even inand if order (extremely) the to large, dataset obtain it is is up-to-date noin often parame- smaller longer computationally batches, augmented, preferable or to but even analyze a static Meek, data the & point dataset Heckerman at a time ( timation of random-intercept models, to resolvedata the in problem a of data stream analyzing or grouped when the dataset is extremely large. SEMA) falls within the frameworkA of key online learning feature methods of ( onlinemary learning statistics is which contain that all theper relevant data information are of summarized previous data into pointswhich a ( uses few all sum- the datause to a update single the data estimates point,previous of estimates some the of model summary the parameters, statistics model we onparameters. parameters, only the Because to SEMA individual update does level, the not and estimates requiremore all appropriate the of to the the deal data with model to data be streams in than memory, the SEMA conventional is EM algorithm. ing with large (static) datasets,methods for to speed example, up the convergence rateHaghighi, of the and conventional Klein EM algorithm. ; 2002 , Demp- ), or vot- Lee, Pod- ), multiple 2010 Cortes, Fisher, 2013 , , ), telephone communi- ). Such data are typically 2011 Raudenbush & Bryk , ). 2007 , Chapter 4: Introduction of SEMA 2012 , ). However, each of these methods 2010 Killingsworth & Gilbert Gelman , Patidar & Sharma Morris & Lysy ). In order to obtain up-to-date predictions of the ), and consumer behavior tracking in e-commerce, ). 2001 P. Barrett, Zhang, Moffat, & Kobbacy 2000 2002 , , , ) to maximize the likelihood. Alternatively, but not consid- Browne & Goldstein 1977 , contain fewer parameters than group-specific models, allow for generalization of results to a wider populationallow of groups, information and to be sharedSteenbergen & between Jones groups ( Recent technological developments have, however, led to the increased availabil- Current (maximum-likelihood) methods for fitting multilevel models use itera- The latter property in particular makes multilevel analysis interesting when the Multilevel models have various advantages over more traditional methods of 2. 3. 1. Pregibon, Rogers, & Smith where purchased items or visited web pageslaseck, are nested Schonberg, within & customers Hoch ( ity of these so-called data streams:with i.e., new datasets data which are points. continuously augmented ture. Such Examples data include streams fraud often detection usingactions have are credit nested a card within grouped transactions, credit cards where (or ( trans- nested)cation analysis, struc- where calls are nested within telephone registrations ( more details see, e.g., require multiple passes through the full datasetthough to fitting obtain a parameter multilevel estimates. model Even once, onrequire a excessive moderately computation sized dataset time, does such often ways not come of infeasible when fitting a multilevel dataset models is can extremelycollection be- large, is or never in “finished” the because situation more where data the present data themselves over time. out-of-sample prediction error than predictions derived froma either group-specific an analysis aggregate (see or e.g., tive algorithms such as Newton-Raphson or Expectationster, Maximization Laird, (EM, & Rubin ered in this chapter, one could use a Bayesian framework with MCMC sampling (for focus is on obtaining group-level predictions, since multilevel modeling yields smaller analysis, such as aggregatednored, analysis, or in group-specific which analysis, the inignored. That within-group which is, information structure they about is the ig- other groups is observations grouped within individualsers ( grouped within geographical regionsanalyzed ( using multilevel (or hierarchical) modelsstatistics in are which treated batches as of randomly group-level chapter, drawn we from will an use underlying thealthough distribution. formulation the method of we In “observations present this does nested not within restrict itself individuals”, to this type of nesting. 60 4.1 Introduction In the social sciences,within we school often classes encounter (e.g., grouped data, such as pupils grouped

Chapter 4 oe,o ru-pcfi nlss nwihifrainaotteohrgop is groups other ig- the is about they structure information is, which within-group That ignored. in the analysis, which group-specific in or analysis, nored, aggregated as such nesting. analysis, of type this to individuals”, itself restrict within not nested does this present “observations In we of method the formulation distribution. although the underlying use an will from we drawn chapter, group-level randomly of as batches treated which are in statistics models hierarchical) (or multilevel using ( analyzed regions geographical within grouped grouped ( ers pupils individuals as within such grouped data, observations grouped (e.g., encounter classes often school we within sciences, social the In Introduction 4.1 60 hr ucae tm rvstdwbpgsaense ihncsoes( Hoch customers & within Schonberg, nested are laseck, pages web visited or items purchased where Smith & ( Rogers, registrations telephone Pregibon, within nested are calls where struc- analysis, cation nested) trans- ( (or where cards credit transactions, grouped within card a nested credit are have actions using detection often fraud streams include data Examples Such ture. augmented continuously points. are which data datasets new i.e., with streams: data so-called these of ity time. over themselves data present the data where more situation because the “finished” in never or is large, be- collection extremely can is models dataset multilevel a fitting when infeasible of come not ways often such does time, dataset sized computation moderately excessive a require on once, Even model estimates. multilevel parameter a obtain fitting to though dataset full the through passes multiple require (for sampling MCMC e.g., with see, framework details Bayesian more a use could one chapter, this in ered Rubin & (EM, Laird, Maximization ster, Expectation or Newton-Raphson as such algorithms tive e.g., or (see aggregate analysis an group-specific either a from derived predictions than error smaller prediction yields out-of-sample modeling multilevel since predictions, group-level obtaining on is focus 3. 2. 1. utlvlmdl aevrosavnae vrmr rdtoa ehd of methods traditional more over advantages various have models Multilevel eettcnlgcldvlpet ae oee,ldt h nrae availabil- increased the to led however, have, developments technological Recent itera- use models multilevel fitting for methods (maximum-likelihood) Current the when interesting analysis multilevel makes particular in property latter The lo nomto ob hrdbtengop ( groups Jones between & Steenbergen shared be to and information groups, of allow population wider a to results of generalization for allow models, group-specific than parameters fewer contain , 1977 rwe&Goldstein & Browne omxmz h ieiod lentvl,btntconsid- not but Alternatively, likelihood. the maximize to ) , , , 2002 2000 .Bret hn,Mfa,&Kobbacy & Moffat, Zhang, Barrett, P. 2001 ). ,adcnue eairtakn ne-commerce, in tracking behavior consumer and ), .I re ooti pt-aepeitoso the of predictions up-to-date obtain to order In ). ors&Lysy & Morris aia Sharma & Patidar , Gelman ilnsot Gilbert & Killingsworth 2010 .Hwvr aho hs methods these of each However, ). , 2012 hpe :Itouto fSEMA of Introduction 4: Chapter , 2007 ). , adnuh&Bryk & Raudenbush 2011 .Sc aaaetypically are data Such ). ,tlpoecommuni- telephone ), , , 2013 ots Fisher, Cortes, 2010 ,multiple ), e,Pod- Lee, ,o vot- or ), Demp- , 2002 ; ehdt cl ontedt ro ouigteE loih.W d othis to add We models of of algorithm. estimation consisting the EM streams data for the on approximation based using EM an to proposing prior by data literature existing the down algorithm, scale EM to the method up speeding of Instead 2000; models mixture for algorithm example, e.g., EM for the (see, proposed, of been also adaptations have online applications Various different for datasets. large for methods EM of and rithm algorithm. EM Klein conventional and the of Haghighi, rate convergence the up example, speed to for methods datasets, (static) large with ing algorithm. EM is conventional SEMA the memory, than in streams be data to model with data deal the the to of the appropriate all more require estimates and not the level, does update individual SEMA to Because the only parameters, parameters. on we model statistics parameters, the summary model of the some estimates of previous point, estimates data the single update a to use data the sum- all few uses ( a which points into data previous summarized of are information data relevant per the all that contain which is statistics learning mary online ( of methods feature learning online key of A framework the within falls SEMA) large. extremely is dataset the when grouped or analyzing stream data of a problem in the data resolve to models, random-intercept of timation ( time a at Heckerman dataset point & the data Meek, static a analyze even but to or preferable augmented, batches, computationally longer smaller parame- often in no up-to-date is is it obtain dataset large, to the (extremely) order if and in even arrives, Additionally, point data estimates. data available ter new all streaming a through to time cycle applied repeatedly each to When points, have model methods purposes. the traditional prediction of these interest estimates for data, updated of used the model be and in, the should come of parameters points data parameters as the updated be of should estimates the effects, individual-level SEMA of Introduction 4: Chapter admitretmdl sn h Magrtm n hwhwti algorithm of this estimation how the show discuss and we algorithm, Next, EM the mean. using sample models a simple random-intercept of the estimation using the estimation of streaming example of advantages computational the illustrate both for suitable it making datasets. data, large observations, the extremely over of and pass streams level single data the a of in estimates instead the individuals, updates of and level the on information stores , eae ehd o peigu h Magrtmhv enpooe o deal- for proposed been have algorithm EM the up speeding for methods Related as to referred (henceforth algorithm Approximation EM Streaming resulting The h eane fti hpe sognzda olw.I h etscin we section, next the In follows. as organized is chapter this of remainder The 1998 of tal. et Wolfe .SM sa prxmt Mmto,bcueulk h Malgorithm EM the unlike because method, EM approximate an is SEMA ). i,Amaa huain n McGorman and Choulakian, Almhana, Liu, caha n Peel and McLachlan 08 n o aetvral oes( models variable latent for and 2008) , , ( 2001 2008 .W rps naato fteE loih o h es- the for algorithm EM the of adaption an propose We ). rsne n(fie aallvrino h Malgo- EM the of version parallel (offline) an presented ) ( 2000 dependent h 2 ecie aiu osbeadaptations possible various described 12) ch. , tie n Hudec and Steiner bevtos h ehdw propose we method The observations. elntadRoland and Berlinet g&McLachlan & Ng 2006; , ap Moulines & Cappé caha n Peel and McLachlan ae tal. et Gaber ( 2007 ( , 2012 2003 rpsda proposed ) discussed ) ; Thiesson, , , Wolfe, 2005 2009 Op- 61 ). ). ,

Chapter 4 63 (4.2) . n ) are kept in memory. n ¯ x , . , and ) . + 1) . n ). In the next section, we detail ) ( n 2 n n +1 ( ¯ +1 x n n n, O ( n x n 1950 − 1 2 = O +1 , + = = +1 n n +1= , n x i n +1 x + ··· + n +1 Plackett +1 n n =1 ! i n ··· ; n x x =¯ = =¯ . More specifically, 1993 n +1 , n ¯ x shows two key features of online learning: first, when 1+1+1+ 1+2+3+ ,...,x 4.2 1 x grows, progresses as follows: n , while online the number grows linearly as a function of n ) or for estimating the coefficients of a linear regression model using Escobar & Moser data points 1962 n , Second, only certain summary statistics (here Welford This makes online learning bothSimilar computationally algorithms fast can as be well as used,( memory amongst efficient. others, for updatingleast of squares ( higher moments the transition from offline estimationmodel. to online estimation of the random-intercept In comparison, the online update of the sampleof mean, computations requires the following number 4.3 Online estimation of random-intercept4.3.1 models The random-intercept model and itsIn this standard section, offline we estimation describe thewhich random-intercept we model focus with on continuous throughout outcomes, thisof chapter. the Next, EM we algorithm, give as a conceptual wellmodel. description as the These technical technical details of detailsfrom fitting offline are the estimation subsequently random-intercept of needed the tothis random-intercept model. explain model the to transition the online estimation of Chapter 4: Introduction of SEMA mean for The last line of Equation a new observation enters, we updatehistorical the data. current estimate This without reduces revisiting thesample all computational mean. the complexity Note required to thatthe update the sample the number mean of as (offline) computations needed to compute This simple analysis shows that thequadratically computations in to update the mean offline grow , 4.5 and (4.1) n is not too large n that data enter in and of the estimated sample the total number of obser- n analysis of data streams. This update Chapter 4: Introduction of SEMA online , i x th unit and n i n =1 ! i . This is feasible as long as i = x n ¯ x grows larger and larger. n , we detail some theoretical characteristics of SEMA, and 4.6 , we extend the random-intercept model to include additional becomes infeasible when it needs to be performed in the face of 4.7 data points can be expressed as an analysis of static datasets to the 4.1 +1 n denotes the measurement for the offline i x Suppose now that we want to compute the sample mean The online computation of a sample mean can be done by noting that the sample . The standard offline computation proceeds as follows: n ¯ vations. x 4.2 From offline to onlineBefore data introducing analysis SEMA, we first explainfrom the the key changes involved whenconceptual moving shift is easily illustrated by examining the computation of a sample mean in a stream. In Section we discuss a convergence diagnostic to evaluateSEMA. the In estimated Section model parameters of fixed covariates. The last section discusses theand main presents results directions of for the future simulation work. studies to improve the estimates botheffects. of the The model parameters first and alternativevalues, of uses the the second a individual-level implementation small cycles part through alland of individuals the the at last given data intervals, implementation to is obtainwe a illustrate better the combination use starting of of SEMA the inpiness, an previous in application two. using which real nested data In on data, Section respondents’ collected hap- using a smart-phone application, “arrived” 62 can be modified into a streaminguate version, SEMA leading in to two SEMA. simulation Subsequently, studies. weaccuracy eval- In of the the first simulation estimates study, of wefects. evaluate the the In model the parameters, second and study, we of evaluate the three individual-level alternative implementations ef- of SEMA where a stream. The naive application ofeach the time a above new offline data formula point would enters one then has imply to that count the number of observations as in Equation rapidly entering data points, as mean for or when the update is only required rarely. However, even a simple computation compute the sum of all measurements

Chapter 4 opt h u falmeasurements all of sum the compute enfor mean computation as points, simple data a entering even rapidly However, rarely. Equation required in only as is update the when or ahtm e aapitetr n a ocuttenme fobservations of number the count that to imply has then one enters would point formula data offline new above a time the each of application naive The stream. a where ocpulsiti aiyilsrtdb xmnn h optto fasml mean sample a of computation x the examining by illustrated easily is shift moving conceptual when involved changes key the the from explain first we SEMA, analysis introducing data Before online to offline From 4.2 studies work. simulation future the for of directions results presents main and the discusses section last The covariates. fixed of parameters model Section estimated In the SEMA. evaluate to diagnostic convergence a discuss “arrived” we application, Section smart-phone In a stream. using a hap- collected in respondents’ Section data, on In data nested real which using two. application in previous an piness, in the SEMA of of starting use combination the better illustrate a we obtain is to implementation intervals, data given last at the the individuals of and all through part cycles small implementation individual-level a second the the uses of values, alternative and first parameters model The the SEMA of effects. of both ef- implementations estimates alternative individual-level the three the improve evaluate of to we study, and second parameters, the model In the the evaluate fects. we of study, estimates simulation first the the of In eval- accuracy we studies. Subsequently, simulation SEMA. two to in leading SEMA version, uate streaming a into modified be can 62 vations. ¯ n h tnadofln optto rcesa follows: as proceeds computation offline standard The . h niecmuaino apema a edn yntn httesample the that noting by done be can mean sample a of computation online The ups o htw att opt h apemean sample the compute to want we that now Suppose x i offline eoe h esrmn o the for measurement the denotes n +1 4.1 nlsso ttcdtst othe to datasets static of analysis aapit a eepesda an as expressed be can points data 4.7 eoe nesbewe tnest epromdi h aeof face the in performed be to needs it when infeasible becomes eetn h admitretmdlt nld additional include to model random-intercept the extend we , 4.6 edti oetertclcaatrsiso EA and SEMA, of characteristics theoretical some detail we , n rw agradlarger. and larger grows x ¯ n x = i hsi esbea ogas long as feasible is This . i ! =1 n i n hui and unit th x i , online hpe :Itouto fSEMA of Introduction 4: Chapter update nlsso aasras This streams. data of analysis n h oa ubro obser- of number total the fteetmtdsample estimated the of and htdt ne in enter data that n snttolarge too not is n (4.1) and 4.5 , hssml nlsssosta h opttost paetema fiegrow offline mean the update to in computations quadratically the that shows analysis simple This apema.Nt httenme f(fie opttosnee ocompute to needed computations (offline) as of mean number the sample the update the that to required Note complexity the mean. computational all sample the revisiting reduces without This estimate current data. the historical update we enters, observation new a Equation of line last The rmofln siaino h admitretmdlt h nieetmto of estimation online the transition to the model explain model. random-intercept this to the needed of random-intercept subsequently estimation the are offline fitting from details of details technical technical These the as description model. well conceptual a as give algorithm, we EM Next, the chapter. of this outcomes, throughout continuous on with focus model we random-intercept which the describe estimation we offline section, standard this In its and model random-intercept The models 4.3.1 random-intercept of estimation Online 4.3 for mean SEMA of Introduction 4: Chapter ncmaio,teoln paeo h apema,rqie h olwn number following the requires computations mean, of sample the of update online the comparison, In h rniinfo fieetmto ooln siaino h random-intercept the of estimation online to model. estimation offline from transition the moments higher ( squares of least updating for others, efficient. amongst memory ( as used, well be as can fast algorithms computationally Similar both learning online makes This Welford eod nycransmaysaitc (here statistics summary certain only Second, , n 1962 aapoints data soa Moser & Escobar rfretmtn h ofcet falna ersinmdlusing model regression linear a of coefficients the estimating for or ) n hl nietenme rw ierya ucinof function a as linearly grows number the online while , n rw,porse sfollows: as progresses grows, x 1 4.2 ,...,x 1+2+3+ 1+1+1+ hw w e etrso nielann:fis,when first, learning: online of features key two shows x ¯ n , +1 n 1993 oespecifically, More . =¯ = =¯ x x n ; ··· n i ! =1 n n +1 Plackett +1 n + ··· + x +1 n i x n , +1= n n +1 = = + , +1 O = 1 2 − 1950 n x n ( O n, n n x +1 ¯ ( +1 n n 2 n ( ) .I h etscin edetail we section, next the In ). n . 1) + . ) and , . , x ¯ n r eti memory. in kept are ) n . (4.2) 63

Chapter 4 , , . , ) j 65 k n 2 ( , is j ˆ σ (4.8) (4.6) (4.9) (4.7) ˆ µ (4.13) (4.11) (4.12) (4.10) Stein ; , and ) 2012 k 2 ( , ˆ τ , shrinkage factor ) k ( ˆ µ , the reliability goes 2 . σ # ) k ( j Morris & Lysy ν . +ˆ j , ) 2 , ) ) ) /n k ) k ( 2 ( k equals 1) j , ( ˆ µ ˆ ) j ρ j − ) ˆ ˆ k ρ µ 1) k 2 ( − ( − . , − σ j ) ) ) − k 2 ν k k k ( (1 ˆ +ˆ τ ij n J J 1( 2( 3( , can be interpreted as a 1) +ˆ y ) T T T 1) , ( k ) − ) ( " k k − j 2 k ( ( k ( ˆ 2 j 2 j ρ τ ( j , (see for instance, =1 = = = ˆ n τ µ i ˆ µ ! ˆ µ (ˆ =ˆ = ) ) ) ) J J J =1 =1 =1 k k k k ) 2 j j j 2 ! ! ! ( ( ( ( k ˆ j τ ˆ ˆ σ µ ( for the individual for which a new data point enters. ˆ ν j ˆ ρ = = = only ) ) ) k k k 1( 2( 3( T T T is large compared to the residual variance, 2 τ is the individual average, and where j ¯ y ). When The CDSS are then computed as follows: In the M step, these CDSS are used to obtain new estimates After updating the estimatesfollowed of by an the M model step. parameters This process a is repeated new until E convergence. step is executed, moved towards the overall mean, 1956 4.3.2 Online estimation of the random-interceptFor streaming model estimation of the random-intercept model, andoes algorithm is not needed require that storing all the data inat memory, or each cycling iteration through all cycle. the For data points thisthe purpose, EM we algorithm propose a described modificationcontribution previously. of to the This the E CDSS modification step of involves updating the Chapter 4: Introduction of SEMA where up. The reliability also goes up when the number of observations per individual, That is, increases. Lastly, we compute Note that one minus the reliability, which determines the extent to which the estimated individual-level effect, which can be interpreted as a measure of uncertainty of the individual-level effect. ∼ j (4.3) (4.5) (4.4) µ , and 2 T , 1 T , 2 by 2 ) ) j 2 µ µ σ 2 2 ) is assumed to be − τ σ − j j µ ij . The three unknown µ j is the individual-level and is also assumed to y ( ( , and . These individual-level µ j 2 ij 1 ), given the CDSS of the ϵ j µ J τ , =1 =1 j n ! ...J, , − i ! 1) 4.4 . µ 1 2 k − 2 is not observed. In order to k J =1 σ ( =1 j − ! j µ 2 µ )ˆ 1 2 Chapter 4: Introduction of SEMA ) τ ,j k j ( − , and j ln 2 2 ˆ ρ 2 τ J σ is the total number of observations of , − ...n j ln µ − n ) 2 , n =1 π j +(1 and independent of − j ) ) ¯ y 2 ln(2 ) π k ,i ( 2 ,σ J , which represent the individual-level effect, the j ij ) ρ ϵ (0 k − ln(2 ( j + 2 =ˆ n ˆ ν j ) − k ∼N µ ). The EM algorithm uses the complete-data log-likelihood ( of individual j . ij ˆ = µ i ϵ k )= using , and y ij ) 1977 ) | k y , k 2 ( ( j j ˆ . ρ ,σ ˆ µ j 2 , is the total number of individuals, and n ) k ( J is not observed, we have to impute values for this variable in order to j µ, τ =1 ( J j ˆ j µ ℓ µ " is observation = . The random error per observation is denoted by ij ) y Dempster et al. n 2 We can obtain, Maximum-likelihood estimates of the parameters of the random-intercept model The model of interest can be formulated as follows: Because The CDSS computed in the E step are a function of three individual-level statis- µ, τ , respectively. In order to compute the CDSS, the algorithm imputes values for ( and the estimates of the model parameter at iteration 3 statistics are cannot be computed directly due to the fact that an individual, reliability of this individual-level effect, andrespectively, at variance iteration of this individual-level effect obtain maximum-likelihood estimates, we userithm the ( Expectation Maximization algo- function, a likelihood function for which the latent variable ( be normally distributed random intercept. These intercepts are assumed to be normally distributed as tics. These individual-level statistics are aj function of the observations of individual the estimates of the model parametersthe of CDSS the are previous computed. iteration Subsequently, (orstep these starting maximizes CDSS values) are the used complete-data in log-likelihoodprevious the E (Eq. M step. step. The M the latent variable in the E step. Using these imputed values in combination with model parameters to be estimated are thus 64 T N known. The complete-data log-likelihood function is as follows: compute the Complete Data Sufficient Statisticsfor (CDSS). each There model are parameter. three CDSS, We one denote these CDSS for where where

Chapter 4 where where o ahmdlprmtr ednt hs DSfor CDSS these denote one We CDSS, three parameter. are model There each (CDSS). for Statistics Sufficient Data Complete the compute nw.Tecmlt-aalglklho ucini sfollows: as is function log-likelihood complete-data The known. N T 64 oe aaeest eetmtdaethus are estimated be to parameters model is hs niiullvlsaitc r ucino h bevtoso individual of observations the of function j a are statistics individual-level These tics. M The step. step. M (Eq. E the previous log-likelihood in complete-data used the are values) CDSS maximizes starting these step with (or Subsequently, iteration combination computed. previous in are the values CDSS of imputed the parameters these model Using the of step. estimates E the the in variable latent the admitret hs necpsaeasmdt enral itiue as distributed normally be to assumed are intercepts These intercept. random enral distributed normally be banmxmmlklho siae,w s h xetto aiiainalgo- Maximization Expectation the use we estimates, maximum-likelihood obtain ucin ieiodfnto o hc h aetvral ( variable latent the which for function likelihood a function, ( rithm eiblt fti niiullvlefc,advrac fti niiullvleffect individual-level this of iteration variance at respectively, and effect, individual-level this of reliability nindividual, an antb optddrcl u otefc that fact the to due directly computed be cannot ttsisare statistics 3 n h siae ftemdlprmtra iteration at parameter model the of estimates the and ( epciey nodrt opt h DS h loih mue ausfor values imputes algorithm the CDSS, the compute to order In respectively. , ,τ µ, h DScmue nteEse r ucino he niiullvlstatis- individual-level three of function a are step E the in computed CDSS The Because aiu-ieiodetmtso h aaeeso h admitretmodel random-intercept the of parameters the of estimates Maximum-likelihood h oe fitrs a efruae sfollows: as formulated be can interest of model The ecnobtain, can We 2 n eptre al. et Dempster y ) ij h admerrprosraini eoe by denoted is observation per error random The . = sobservation is " µ ℓ µ j ˆ j J ( =1 ,τ µ, j sntosre,w aet muevle o hsvral nodrto order in variable this for values impute to have we observed, not is J ( k ) n stettlnme fidvdas and individuals, of number total the is , 2 j µ ˆ ,σ ρ . ˆ j j ( ( 2 k , y k | ) 1977 ) ij y and , using )= k ϵ i µ = ˆ ij . j findividual of ( .TeE loih sstecmlt-aalog-likelihood complete-data the uses algorithm EM The ). µ ∼N k − ) j ν ˆ n =ˆ 2 + j ( ln(2 − k (0 ϵ ρ ) ij j hc ersn h niiullvlefc,the effect, individual-level the represent which , J ,σ 2 ( ,i k π ) ln(2 2 y ¯ ) ) j − n needn of independent and +(1 j π =1 n , 2 ) n − µ ln j ...n − , stettlnme fosrain of observations of number total the is σ J τ 2 ρ ˆ 2 2 ln j and , − ( j k ,j τ ) hpe :Itouto fSEMA of Introduction 4: Chapter 2 1 )ˆ µ 2 µ j ! − j =1 ( σ =1 J k sntosre.I re to order In observed. not is 2 − k 2 1 µ . 4.4 1) ! i − , ...J, ! n j =1 =1 , τ J µ j ϵ ,gvnteCS fthe of CDSS the given ), 1 ij 2 j µ hs individual-level These . and , ( ( y n sas sue to assumed also is and steindividual-level the is j µ h he unknown three The . ij µ j j − σ τ − sasmdt be to assumed is ) 2 2 σ µ µ 2 j ) ) 2 by 2 , T 1 , T 2 and , µ (4.4) (4.5) (4.3) j ∼ hc a eitrrtda esr fucranyo h niiullvleffect. individual-level the of uncertainty of measure a as interpreted be can which hc eemnsteetn owihteetmtdidvda-ee effect, individual-level estimated the which to extent the determines which oeta n iu h reliability, the minus one that Note nrae.Lsl,w compute we Lastly, increases. htis, That p h eiblt loge pwe h ubro bevtosprindividual, per observations of number the when up goes also reliability The up. h Magrtmdsrbdpeiul.Ti oicto novsudtn the updating involves of step modification CDSS E the This the to of previously. contribution modification described a propose algorithm we EM purpose, the this points data For the cycle. all through iteration cycling each or memory, at in data the all storing that require needed not is algorithm does an model, random-intercept the of estimation model streaming For random-intercept the of estimation Online 4.3.2 where SEMA of Introduction 4: Chapter 1956 mean, overall the towards moved fe paigteetmtso h oe aaeesanwEse sexecuted, is step convergence. E until new repeated is a process This parameters step. model M the an by of followed estimates the updating After nteMse,teeCS r sdt bannwestimates new obtain to used are CDSS these step, M the In h DSaete optda follows: as computed then are CDSS The .When ). y ¯ j steidvda vrg,adwhere and average, individual the is τ 2 slrecmae otersda variance, residual the to compared large is T T T 3( 2( 1( k k k ) ) ) only = = = ρ ˆ j ν ˆ o h niiulfrwihanwdt on enters. point data new a which for individual the for ( µ σ ˆ ˆ τ j ˆ k ( ( ( ( ! ! ! 2 j j j 2 ) k k k k =1 =1 =1 J J J ) ) ) ) = =ˆ (ˆ µ ˆ ! µ ˆ i µ τ n ˆ = = = =1 sefrinstance, for (see , j ( τ ρ j 2 j 2 ˆ ( k ( ( k 2 j − k k " ( ) − ) k ( , 1) T T T ) y +ˆ 1) a eitrrtda a as interpreted be can , 3( 2( 1( J J n ij τ +ˆ ˆ (1 ( k k k ν 2 k − ) ) ) j σ − , . − ( − ( 2 k 1) µ ρ k ˆ ˆ ) − j ρ j ) ˆ µ ˆ ( , j 1) equals k ( 2 ( k ) k /n ) ) ) , 2 ) , j +ˆ . ν ors&Lysy & Morris j ( k ) # σ . 2 h eiblt goes reliability the , µ ˆ ( k ) hikg factor shrinkage , τ ˆ , ( 2 k 2012 ) and , ; Stein (4.10) (4.12) (4.11) (4.13) µ ˆ (4.7) (4.9) (4.6) (4.8) σ ˆ j is , ( 2 n k 65 j ) , , . ,

Chapter 4 . j 67 n (4.16) (4.17) and 2 j ¯ y , j ¯ y ) differs from the presents a summa- 4.10 , j 4.10 n ) . ) . This analysis shows that t ) , j ( t , Eq. ( j # based on the CDSS from the 3 t ) ν j t ) T , ( 2 t ) 2 ( j t +ˆ T ( ν ˆ σ t ) j t + ( 3 +ˆ 2 j T 2 1) µ ) , − ) ) + , and t t t +ˆ ) ( ( ( t t 2 j ) j ( 1) , j t ˆ ) ˆ τ ν 2 µ ( − t for individual t j the SEMA algorithm proceeds as follows: , ( T ( ) , j ˆ +ˆ µ − t t t ) ij ˆ j ν ( j t ) − 3 y ( t ˆ ij y µ j ( T y 2¯ 2 j 2 , 1) ( ) ˆ µ T − t " − − t ( , and j j 2 j 2( ) 3 J J 1) =1 =1 ¯ t =1 n y j j T ! ! ( i ! T online, − ( j t j ˆ ρ = = = 3( J J J =1 =1 =1 , n j j j ) T ! ! ! ) t t ( 2( j = = = = ˆ T µ ) t , and 2 j 3( , ¯ t y by 1 when it concerns an observation of a new individual and T individuals. However, this is relatively straightforward: j , j J , J ¯ t y = j = n observations, while here we present the computation more efficiently as previous E step. subtract the current contribution from the CDSS, update compute new add the new contribution to the CDSS, increase set compute the new estimates in a similar way: n is the average of the squared 2 2 j 1. 2. 3. 4. 1. 2. T ¯ y E step: for M step In the summary presented above, it can be seen why SEMA (Streaming EM Ap- The update of the CDSS of the residual variance ( To summarize, at entry of data point • • proximation) is called an approximate EME algorithm: step and SEMA a performs, M like step. EM, However,E unlike an step, the because EM it algorithm, only it updates only the does contribution a single to partial the CDSS for a single individual. previous two equations. This is due totion the over fact that Equation in order to perform the E step we do not need all data points, but only Chapter 4: Introduction of SEMA rewrite where a summation over , , , j 1 ) j 2 t ˆ µ ˆ ρ ( . T − j w , , t j 1 T ). We (4.15) (4.14) ˆ before µ T E step, t j 1998 , ; that is, the t ) are already partial j 2 τ ̸= j data points, t for 1) observations, we merely -th data point. Note that , − ) t t . t n ( Neal & Hinton ) is that in the former all ( t , CDSS for ; is computed using the most t ( 2 wj t j j T wj T 1 ˆ ( µ T individuals. T , 4.15 2000 = t J j + , Chapter 4: Introduction of SEMA ) + 4.9 t ( 1) , indexing the data point which is , that is, the individual for which a 1) t t − wj − j t to those based on t ( T ( t t } = j 3 wj 1 , j T T , 2 needs to be updated. This can be expressed ) , , t : t ) − − ( t 1 j j ( and Equation j 1 1) 1) = ˆ µ T − − 4.14 ∈{ t t j ( 4.8 . Note that 1( ) and Equation J J =1 =1 w t w j j T ! ! µ McLachlan & Peel T , 1) = = = = − ) ) t t t ( denote the contribution to CDSS for individual ( 1( w ) w , is now replaced by t T individuals instead of data points. Therefore they are easily T ( T k t J wj T , CDSS for 1 T ( and 1) 4.8 − t ( t denote the individual which corresponds to the t wj can be computed at the level of individuals instead of observations within j are computed only for individual T 3 j ˆ T ν the contribution of individual Let Equation The key feature used by our proposed SEMA algorithm is that the CDSS as follows: contribution does not change if the new data point does not concern individual The difference between Equation written as a sums over rewritten in the format of Equation are estimated with the estimatesIn of the the latter model formulation, however, parameters only for from person the latest iteration. 66 The M step remains thethe same data since, points. given the CDSS, the M step is independent of individuals. Therefore, it is no longer requiredstore to a store small all number of summaries for each of the recent estimates of the model parameters.only Therefore, for SEMA 1 applies individual a (see also, being processed. The key element of the proposed SEMA algorithm is that this data point can befrom a either new from individual. an For individual thepreviously discussion who denoted of is SEMA, by already the in iteration index, the which sample, was or new data point arrives. This impliesdata that points, when denoted going by from the CDSS based on and after the entry of data point and only and where

Chapter 4 where only and and n fe h nr fdt point data of entry the after and e aapitarvs hsipista hngigfo h DSbsdon based CDSS the from going when that implies This arrives. point data new aapit,dntdby denoted points, data en rcse.Tekyeeeto h rpsdSM loih sthat is algorithm SEMA proposed the of element key or The was sample, which processed. the index, being iteration in the already by SEMA, is of denoted who discussion previously the individual For an individual. from new either a from be can point data this nyfr1idvda sealso, (see a individual applies 1 SEMA for Therefore, only parameters. model the of estimates recent r siae ihteetmtso h oe aaeesfo h aetiteration. latest the person from for only parameters however, formulation, model latter the the of In estimates the with estimated are the of each for summaries of number all small store a to store required longer no is it Therefore, individuals. of independent is step M the CDSS, the given points. since, data same the the remains step M The 66 erte ntefra fEquation of format the in rewritten over sums a as written h ifrnebtenEquation between difference The otiuinde o hnei h e aapitde o ocr individual concern not does point data new the if change not does contribution sfollows: as h e etr sdb u rpsdSM loih sta h CDSS the that is algorithm SEMA proposed our by used feature key The Equation Let h otiuino individual of contribution the ν T ˆ j 3 T r optdol o individual for only computed are j a ecmue ttelvlo niiul nta fosrain within observations of instead individuals of level the at computed be can wj t eoeteidvda hc orsod othe to corresponds which individual the denote t ( t − 4.8 1) and ( T 1 DSfor CDSS , T wj J t k T ( T niiul nta fdt ons hrfr hyaeeasily are they Therefore points. data of instead individuals T t snwrpae by replaced now is , w ) w 1( ( eoetecnrbto oCS o individual for CDSS to contribution the denote ( t t t ) ) − = = = = 1) , T caha Peel & McLachlan µ ! ! T j j w t w =1 =1 J J n Equation and ) 1( oethat Note . 4.8 ( j t t ∈{ 4.14 − − T µ ˆ = 1) 1) 1 j n Equation and ( j j 1 t ( − − ) t : t , , ) ed ob pae.Ti a eexpressed be can This updated. be to needs 2 , T T j , 1 wj 3 j = } t t ( T ( t otoebsdon based those to t j − wj − t t 1) hti,teidvda o hc a which for individual the is, that , neigtedt on hc is which point data the indexing , 1) ( t 4.9 + ) hpe :Itouto fSEMA of Introduction 4: Chapter , + j J t = 2000 4.15 , T individuals. T µ ( ˆ 1 T wj T j j t wj 2 ( t scmue sn h most the using computed is ; DSfor CDSS , t ( sta ntefre all former the in that is ) el&Hinton & Neal ( n t . t t ) − , t aapit oethat Note point. data -th bevtos emerely we observations, 1) for t aapoints, data j ̸= τ 2 j partial r already are ) t hti,the is, that ; , 1998 j t step, E T µ before ˆ (4.14) (4.15) .We ). T 1 j t , , w j − T . ( ρ ˆ µ ˆ t 2 j ) 1 j , , , umto over summation a where rewrite SEMA of Introduction 4: Chapter nodrt efr h tpw ontne l aapit,btonly but points, data all need not do we step E the perform to order in rvostoeutos hsi u otefc htEquation that fact over the tion to due is This equations. two previous tp eas tol pae h otiuint h DSfrasnl individual. single a for CDSS the partial to single a contribution does the only updates it only algorithm, it EM because the step, an unlike E However, EM, step. like M performs, a SEMA and step algorithm: E EM approximate an called is proximation) • • osmaie tetyo aapoint data of entry at summarize, To h paeo h DSo h eiulvrac ( variance residual the of CDSS the of update The ntesmaypeetdaoe tcnb enwySM SraigE Ap- EM (Streaming SEMA why seen be can it above, presented summary the In step M tp for step: E y ¯ T 2. 1. 4. 3. 2. 1. j 2 2 steaeaeo h squared the of average the is n nasmlrway: similar a in opt h e estimates new the compute set increase CDSS, the to contribution new the add opt new compute update CDSS, the from contribution current the subtract rvosEstep. E previous bevtos hl eew rsn h optto oeefiinl as efficiently more computation the present we here while observations, n = j = y t ¯ J , J j , j niiul.Hwvr hsi eaieystraightforward: relatively is this However, individuals. T y1we tcnen nosraino e niiuland individual new a of observation an concerns it when 1 by y t ¯ , 3( j 2 and , t ) µ T ˆ = = = = j 2( ( t t ) ! ! ! T ) j j j n , =1 =1 =1 J J J 3( = = = ρ ˆ j t j ( − online, T ! i ( ! ! T j j y n =1 t ¯ =1 =1 1) J J 3 ) 2( j 2 j j and , ( t − − " t − T µ ˆ ) ( 1) , 2 j 2 2¯ y T ( j µ y ij ˆ t ( y 3 − ) t j ( ν j ˆ ij ) t t t − µ +ˆ ˆ j , ) ( T ( , h EAagrtmpoed sfollows: as proceeds algorithm SEMA the j t o individual for t − ( µ 2 ν τ ˆ ) ˆ t j , 1) ( j ) j 2 t t ( ( ( ) +ˆ t t t and , + ) ) − , ) µ 1) 2 T j 2 +ˆ 3 ( + t j ) t σ ˆ ν ( T +ˆ t j ( 2 ) t 2 ( , T ) t j ν ) t 3 ae nteCS rmthe from CDSS the on based # j ( Eq. , t ( j , ) t hsaayi hw that shows analysis This . ) . ) n 4.10 j , 4.10 rsnsasumma- a presents ifr rmthe from differs ) y ¯ j , y ¯ j 2 and (4.17) (4.16) n 67 j .

Chapter 4 . j 2 ), 1 = ˆ τ 69 n 100 . In n 4.1 > = 10 = 2 across 4.3 µ τ j , and 2 n ) 2 j ˆ σ µ , J − ˆ µ j µ (ˆ , see Table } = 100 ). However, the 100 and =1 2 J j 25 5,000, SEMA starts = σ , 4.2 still seems overesti- ! 2 2 = 10 τ are close to the popu- τ { = 10, which is the lowest n 100). SEMA somewhat µ 2 = ¯ = e = j j n n n , are presented in Table 2 σ with (presented in the columns) and 1 and , at the end of the data stream ( 2 ¯ ρ τ . In the five lowest reliability condi- = 2 2 = 10 2 τ τ 2 τ τ . However, when 2 in the simulation study for τ ¯ ρ and 1 10 25 100 } 100 , j 25 , 1025 .091 .500 .200 .714 .714 .909 .862 .962 n is low, both EM and SEMA underestimate the residual 100 .500 .909 .962 .991 2 10 { τ ). present the mean and standard deviation (SD) of = j presents the results for 4.5 n 2013 halfway through the data stream for the conditions where , 4.4 2 100), causing the data-generating model to fluctuate more across the σ . The two factors that varied are = 1 with is estimated equally well by SEMA and EM. Table 4.1: Average reliability J through 2 = σ compared to the population values that generated the data. = 100 2 4.2 ), τ 2 10 σ R Core Team Across all conditions, the SEMA and EM estimates of Next, Table The results for the estimated residual variance, 000 , lation value even with asSEMA estimates little are as clearly 100 much moreof observations variable the (see than data Table those stream, of that EMthis at is, difference the when has beginning SEMA disappeared did by notlarger 10,000 SD’s have observations. at the the end Both chance of EM to the data and converge. stream SEMA in But the have condition with In all conditions, the variabilitybeginning of of both the the SEMA data andstream. stream, EM but Irrespective estimate of decreases is the steadily large average in10 towards reliability, the the end of the data the condition in which variance in the beginningoverestimates of the data stream (when Chapter 4: Introduction of SEMA of the data stream, and[R]( the estimation using SEMA and EM, were implemented in Results Tables respectively, and the averaged squared prediction error: (presented in the rows). Inthan bold are the parameter estimates that differed by more tions ( mated at 10,000 observations occurs with 1,000 replications at 100, 1,000,standard 5,000, (offline) and EM. For 10,000 each observations simulationand for run, both the SEMA population and values were SEMA seems to slightly overestimate than in the other conditions. Thiscondition is due ( to the smaller number of individuals in this approaching the EM estimates. The only situation in which different runs. . ¯ ρ = (0) ˆ µ ); that is, , constant. 2 4.6 σ 1,000, 250, or = in the simulation J ¯ ρ . Both the simulation individual-level effects . The residual variance =1 } J (see also Eq. 2 (0) 100 ¯ , were drawn from a normal ρ ˆ σ j , yielded average reliabilities µ 25 2 , this is the main factor varied in , ¯ τ ρ Chapter 4: Introduction of SEMA , and 10 , 1 =1 observations. The average number of and { j 2 (0) n = http://github.com/L-Ippel/SEMA ˆ τ 000 , 2 τ = 10 presents the different levels of n 4.1 ) was 10, 25, or 100, which results in j n , and by varying the amount of variance of the random j n and variance in all conditions. First, we generated = 10 . Next, the observations were generated by randomly drawing an ) µ 2 = 100 µ, τ , we provide some additional justification for SEMA. In the next sec- ( . In this simulation study, we keep the residual variance, 2 2 4.6 σ τ ∼N is large, the EM algorithm will converge after a few iterations, but when the eters j ¯ ρ µ (first observation of the data stream), We generated data streams of Each of these 12 conditions was run 1,000 times. The starting values were ’s get closer to zero convergence will become slower. Since it can be expected that =1 j t from We evaluate SEMA and EMindividual-level by effects monitoring during the the data parameter stream. estimates and predicted observations per individual ( intercept, was set to individual, and generating a datalevel point based effect. on this The individual’s 12 true different individual- settings for distribution with rithm for multilevel modelswhen is the average reliability ρ Design In this simulation study, we comparerithm the performance with of the the standard proposed SEMA EMestimates. algo- algorithm, An in important terms factor of affecting the the accuracy speed of of convergence the of parameter the EM algo- tion, we will test the accuracy of SEMA in two simulation studies. 4.4 Performance of SEMA evaluated4.4.1 by simulation Simulation study I: Evaluation of the precision of estimated param- of the updating the information ofcomputationally all less individuals. intensive, The which benefit makes oflarge it SEMA static suitable is data for that and dealing it data with is the both streams, number very because of individuals the instead required of memorySEMA the algorithm only number in [R] grows of code with data is points. available at In An Section example of the 68 Doing only an E step foring one the E individual step is for computationally all individuals. less(i.e., It expensive also take than means more that do- partial SEMA will E converge - morethe M slowly EM steps) algorithm, to because the it (local) only updates maximum-likelihood the estimate information than of one individual instead y convergence of SEMA will be strongly affectedthis by simulation study. We do so inservations two per different ways: individual, by varying the number of ob- ranging from .091 to .990. Table 100 individuals in total. The individual-level effects, study.

Chapter 4 study. 0 niiul nttl h niiullvleffects, individual-level The total. in individuals 100 agn rm.9 o.9.Table .990. to .091 from ranging hssmlto td.W os ntodfeetwy:b ayn h ubro ob- of number the varying by individual, ways: different per two servations in so do We study. simulation by this affected strongly be will SEMA of convergence y ρ reliability algo- average EM the the parameter is when of the convergence models of of multilevel speed accuracy for the the rithm affecting of factor terms important in An algorithm, algo- estimates. EM SEMA proposed standard the the of with performance the rithm compare we study, simulation this In Design param- estimated of precision the of Evaluation I: study Simulation simulation by 4.4.1 evaluated SEMA of Performance 4.4 studies. simulation two in SEMA of accuracy the test will we tion, the of example Section An In at available points. is data with code of grows [R] in number only algorithm the SEMA memory of required instead the individuals of because very number streams, both the is with data it dealing and that for data is suitable static SEMA it large of makes benefit which The intensive, individuals. less all computationally of instead information individual the one updating of than information the estimate the of maximum-likelihood updates only (local) it the because to algorithm, steps) EM slowly M the more - converge E will SEMA partial do- that more means than take also expensive It (i.e., less individuals. all computationally for is step individual E the one ing for step E an only Doing 68 itiuinwith distribution ee fet h 2dfeetstig for settings individual- different true 12 individual’s The this on effect. based point level data a generating and individual, a e to set was intercept, bevtospridvda ( individual per observations predicted and estimates stream. parameter data the the during monitoring effects by individual-level EM and SEMA evaluate We from t j =1 sgtcoe ozr ovrec ilbcm lwr ic tcnb xetdthat expected be can it Since slower. become will convergence zero to closer get ’s aho hs 2cniin a u ,0 ie.Tesatn auswere values starting The times. 1,000 run was conditions 12 these of Each egnrtddt tem of streams data generated We fis bevto ftedt stream), data the of observation (first µ ρ ¯ j eters slre h Magrtmwl ovreatrafwieain,btwe the when but iterations, few a after converge will algorithm EM the large, is ∼N τ σ 4.6 2 2 nti iuainsuy eke h eiulvariance, residual the keep we study, simulation this In . ( epoiesm diinljsicto o EA ntenx sec- next the In SEMA. for justification additional some provide we , ,τ µ, 100 = 2 µ ) et h bevtoswr eeae yrnol rwn an drawing randomly by generated were observations the Next, . 10 = nalcniin.Frt egenerated we First, conditions. all in n variance and n j n yvrigteaon fvrac fterandom the of variance of amount the varying by and , n j a 0 5 r10 hc eut in results which 100, or 25, 10, was ) 4.1 n rsnstedfeetlvl of levels different the presents 10 = τ 2 , 000 τ ˆ http://github.com/L-Ippel/SEMA = n (0) 2 j { and bevtos h vrg ubrof number average The observations. =1 1 , 10 and , hpe :Itouto fSEMA of Introduction 4: Chapter ρ τ ¯ , hsi h anfco aidin varied factor main the is this , 2 25 µ ile vrg reliabilities average yielded , j σ ˆ ρ eedanfo normal a from drawn were , ¯ 100 (0) 2 seas Eq. also (see J } =1 h eiulvariance residual The . niiullvleffects individual-level ohtesimulation the Both . ρ ¯ J ntesimulation the in = ,0,20 or 250, 1,000, σ 4.6 2 constant. , ;ta is, that ); µ ˆ (0) = ρ ¯ . ifrn runs. different prahn h Metmts h nystaini which in situation only The estimates. EM the approaching hni h te odtos hsi u otesalrnme fidvdasi this in individuals of number smaller the to ( due is condition This conditions. other the in than EAsest lgtyoverestimate slightly to seems SEMA tnad(fie M o ahsmlto u,tepplto auswere values and population SEMA the both run, for and simulation observations each 10,000 For EM. and (offline) 5,000, standard 1,000, 100, at replications 1,000 ae t1,0 bevtosocr with occurs observations 10,000 at mated in ( tions peetdi h os.I odaeteprmtretmtsta ifrdb more by differed that estimates parameter the are bold than In rows). the in (presented error: prediction squared averaged the and respectively, Tables Results in implemented were EM, and SEMA using estimation the [R]( and stream, data the of SEMA of Introduction 4: Chapter ainei h einn ftedt tem(when stream data the of overestimates beginning the in variance h odto nwhich in condition the einn ftedt tem u erae taiytwrsteedo h data the of end the the reliability, towards 10 in average large steadily the is decreases of estimate Irrespective but EM stream, stream. and data SEMA the the both of of beginning variability the conditions, all In agrS’ tteedo h aasra ntecniinwith condition have the But in SEMA stream converge. and data the to EM of chance Both end the the at observations. have SD’s 10,000 larger not by did disappeared SEMA beginning has when the difference is, at this EM that of stream, those Table data than (see the variable observations of more much 100 clearly as are little estimates as SEMA with even value lation , 000 h eut o h siae eiulvariance, residual estimated the for results The et Table Next, cosalcniin,teSM n Metmtsof estimates EM and SEMA the conditions, all Across oeTeam Core R σ 10 2 τ ), 4.2 2 100 = oprdt h ouainvle htgnrtdtedata. the generated that values population the to compared σ = 2 through J al .:Aeaereliability Average 4.1: Table setmtdeulywl ySM n EM. and SEMA by well equally estimated is with 1 = h w atr htvre are varied that factors two The . σ 0) asn h aagnrtn oe oflcut oears the across more fluctuate to model data-generating the causing 100), 2 4.4 , afa hog h aasra o h odtoswhere conditions the for stream data the through halfway 2013 n 4.5 rsnsterslsfor results the presents j = rsn h enadsadr eito S)of (SD) deviation standard and mean the present ). τ { 10 2 0 50.0 92.991 .962 .909 .500 100 slw ohE n EAudrsiaeteresidual the underestimate SEMA and EM both low, is n 5.0 74.6 .962 .862 .909 .714 .714 .200 .500 .091 25 10 , 25 j , 100 } 02 100 25 10 1 and ρ ¯ τ ntesmlto td for study simulation the in 2 oee,when However, . τ τ 2 τ τ 2 10 = 2 2 = ntefielws eiblt condi- reliability lowest five the In . τ ρ ¯ 2 tteedo h aasra ( stream data the of end the at , and 1 peetdi h oun)and columns) the in (presented with σ 2 r rsne nTable in presented are , n n n j j = e = ¯ = 2 µ 0) EAsomewhat SEMA 100). n 0 hc stelowest the is which 10, = { τ r ls otepopu- the to close are τ 10 = 2 2 ! tl em overesti- seems still 4.2 , σ = ,0,SM starts SEMA 5,000, 25 j J 2 =1 0 and 100 .Hwvr the However, ). 100 = } e Table see , (ˆ µ j µ ˆ − J , µ σ ˆ j 2 ) n 2 and , j τ µ 4.3 across 2 = 10 = > 4.1 n In . 100 n 69 τ ˆ = 1 ), 2 j .

Chapter 4

71

0009.6(.5 99 14)9.6(.5 99 14)9.6(.5 99 14)9.6(.5 99 (1.45) 99.96 (1.45) 99.96 (1.45) 99.96 (1.45) 99.96 (1.45) 99.96 (1.45) 99.96 (1.45) 99.96 (1.45) 99.96 10,000

,0 99 20)9.9(.6 0.0(.6 0.0(.6 0.0(.6 0.0(.6 0.0(.6 0.0(2.06) 100.00 (2.06) 100.00 (2.06) 100.00 (2.06) 100.00 (2.06) 100.00 (2.06) 100.00 (2.06) 99.99 (2.05) 99.95 5,000

100

,0 87 1.6 96 46)107 2.8 0.0(.0 0.0(55)100 48)133 4.1 0.0(4.81) 100.00 (42.81) 103.33 (4.80) 100.00 (25.59) 101.70 (4.80) 100.00 (20.88) 100.70 (4.62) 99.65 (16.66) 98.71 1,000

173.49 116.66 0 84 2.4 27 1.2 0.1(49)9.8(17.65) 95.48 (34.93) 105.21 (15.72) 92.77 (29.94) 98.40 100 7.3 90 (22.93) 99.00 (72.73) (20.03) 97.68 (42.86)

0009.1(.1 99 14)9.4(.4 99 14)9.4(.4 99 14)9.4(.4 99 (1.44) 99.94 (1.44) 99.94 (1.44) 99.94 (1.44) 99.94 (1.44) 99.94 (1.44) 99.94 (1.44) 99.93 (1.41) 99.51 10,000

,0 84 20)9.2(.8 98 21)9.6(.0 0.0(.1 99 21)9.7(.0 99 (2.10) 99.96 (2.10) 99.97 (2.10) 99.96 (2.11) 100.00 (2.10) 99.96 (2.10) 99.84 (2.08) 99.92 (2.03) 98.47 5,000

25

139.70 ,0 63 1.0 87 48)110 1.5 98 55)179 1.0 99 (5.67) 99.90 (17.70) 107.98 (5.57) 99.89 (14.15) 101.03 (4.87) 98.75 (11.60) 96.33 1,000 4.7 98 (5.77) 99.87 (40.37)

85.61 193.94 121.39 88.58 2.9 0.0(36.36) 107.50 (21.79) (33.41) 98.85 100 7.8 95 (40.14) 99.56 (72.38) (28.93) 92.33 (40.72) (24.74)

0009.8(.8 0.0(.8 97 15)100 15)102 15)100 15)101 15)100 (1.50) 100.02 (1.51) 100.13 (1.50) 100.03 (1.57) 100.23 (1.50) 100.03 (1.56) 99.79 (1.48) 100.00 (1.48) 98.08 10,000

,0 66 34)9.5(.6 96 45)100 22)130 65)100 22)194 1.4 0.8(2.28) 100.08 (15.04) 109.48 (2.27) 100.08 (6.51) 103.01 (2.27) 100.08 (4.54) 99.60 (2.16) 99.85 (3.42) 96.61 5,000

10

170.38 114.29 ,0 64 2.6 79 53)131 2.5 0.0(6.60) 100.00 (22.55) 103.18 (5.31) 97.91 (20.26) 96.47 1,000 4.2 0.7(7.46) 100.27 (49.62) (7.13) 100.27 (26.49)

74.87 198.37 82.52 122.24 78.11 3.2 0.0(38.39) 107.40 (30.82) (35.04) 98.23 100 7.7 51 (55.96) 95.12 (73.47) (38.26) (44.24) (33.73)

enS enS enS enS enS enS enS enSD mean SD mean SD mean SD mean SD mean SD mean SD mean SD mean j n ¯ n

EAEM SEMA EM SEMA EAEM SEMA EM SEMA

=1 τ =100 τ =25 τ =10 τ

2 2 2 2

ouainvle of values population τ

2

n nbl hs auswihaemr hn1 rmtetu value. true the from 10 than more are which values those bold in and

vrgdoe ,0 elctos ( replications, 1,000 over averaged of Estimates 4.3: Table 0) nteprnhssaeteS’ vr100replications 1,000 over SD’s the are parentheses the In 100).

σ = σ , 10 = µ

2 2 Chapter 4: Introduction of SEMA

Table 4.2: estimates of µ (population value µ = 10) averaged over 1,000 replications (σ2 = 100). In the parentheses are the SD’s over 1,000 replications.

population values of τ 2 τ 2 =1 τ 2 =10 τ 2 =25 τ 2 = 100

Chapter 4: Introduction of SEMA SEMA EM SEMA EM SEMA EM SEMA EM n¯j n mean SD mean SD mean SD mean SD mean SD mean SD mean SD mean SD 100 10.10 (3.90) 9.99 (1.05) 10.08 (4.13) 9.99 (1.09) 10.10 (4.46) 9.99 (1.16) 10.10 (5.92) 9.99 (1.46) 1,000 10.07 (2.67) 10.00 (0.32) 10.06 (2.81) 10.00 (0.34) 10.08 (3.04) 10.00 (0.38) 10.12 (4.09) 10.00 (0.52) 10 5,000 10.01 (0.89) 10.00 (0.15) 10.01 (0.92) 10.00 (0.17) 10.01 (0.97) 10.00 (0.21) 10.03 (1.07) 10.00 (0.34) 10,000 10.00 (0.22) 10.00 (0.10) 10.00 (0.19) 10.00 (0.14) 10.00 (0.19) 10.00 (0.18) 10.00 (0.32) 10.00 (0.32) 100 9.82 (3.95) 10.06 (1.05) 9.84 (4.14) 10.07 (1.06) 9.85 (4.42) 10.07 (1.13) 9.91 (5.81) 10.10 (1.43) 1,000 9.91 (2.01) 10.00 (0.32) 9.93 (2.11) 10.01 (0.37) 9.93 (2.22) 10.01 (0.43) 9.96 (2.98) 10.02 (0.63) 25 5,000 10.00 (0.17) 10.00 (0.15) 10.00 (0.22) 10.01 (0.22) 10.01 (0.29) 10.01 (0.29) 10.01 (0.52) 10.01 (0.52) 10,000 10.00 (0.12) 10.00 (0.12) 10.01 (0.19) 10.01 (0.19) 10.01 (0.27) 10.01 (0.27) 10.01 (0.51) 10.01 (0.51) 100 9.93 (3.56) 9.98 (1.03) 9.95 (3.73) 10.01 (1.12) 9.99 (4.03) 10.03 (1.25) 10.09 (5.23) 10.09 (1.70) 1,000 10.01 (0.93) 10.01 (0.34) 10.04 (1.05) 10.03 (0.47) 10.08 (1.24) 10.05 (0.62) 10.15 (1.76) 10.10 (1.09) 100 5,000 10.01 (0.18) 10.01 (0.18) 10.03 (0.35) 10.03 (0.35) 10.04 (0.53) 10.04 (0.53) 10.09 (1.04) 10.09 (1.04) 10,000 10.01 (0.15) 10.01 (0.15) 10.03 (0.34) 10.03 (0.34) 10.05 (0.53) 10.05 (0.53) 10.10 (1.01) 10.10 (1.04) 70

Chapter 4

70

0001.1(.5 00 01)1.3(.4 00 03)1.5(.3 00 05)1.0(.1 01 (1.04) 10.10 (1.01) 10.10 (0.53) 10.05 (0.53) 10.05 (0.34) 10.03 (0.34) 10.03 (0.15) 10.01 (0.15) 10.01 10,000

,0 00 01)1.1(.8 00 03)1.3(.5 00 05)1.4(.3 00 10)1.9(1.04) 10.09 (1.04) 10.09 (0.53) 10.04 (0.53) 10.04 (0.35) 10.03 (0.35) 10.03 (0.18) 10.01 (0.18) 10.01 5,000

100

,0 00 09)1.1(.4 00 10)1.3(.7 00 12)1.5(.2 01 17)1.0(1.09) 10.10 (1.76) 10.15 (0.62) 10.05 (1.24) 10.08 (0.47) 10.03 (1.05) 10.04 (0.34) 10.01 (0.93) 10.01 1,000

0 .3(.6 .8(.3 .5(.3 00 11)99 40)1.3(.5 00 52)1.9(1.70) 10.09 (5.23) 10.09 (1.25) 10.03 (4.03) 9.99 (1.12) 10.01 (3.73) 9.95 (1.03) 9.98 (3.56) 9.93 100

0001.0(.2 00 01)1.1(.9 00 01)1.1(.7 00 02)1.1(.1 00 (0.51) 10.01 (0.51) 10.01 (0.27) 10.01 (0.27) 10.01 (0.19) 10.01 (0.19) 10.01 (0.12) 10.00 (0.12) 10.00 10,000

,0 00 01)1.0(.5 00 02)1.1(.2 00 02)1.1(.9 00 05)1.1(0.52) 10.01 (0.52) 10.01 (0.29) 10.01 (0.29) 10.01 (0.22) 10.01 (0.22) 10.00 (0.15) 10.00 (0.17) 10.00 5,000

25

,0 .1(.1 00 03)99 21)1.1(.7 .3(.2 00 04)99 29)1.2(0.63) 10.02 (2.98) 9.96 (0.43) 10.01 (2.22) 9.93 (0.37) 10.01 (2.11) 9.93 (0.32) 10.00 (2.01) 9.91 1,000

0 .2(.5 00 10)98 41)1.7(.6 .5(.2 00 11)99 58)1.0(1.43) 10.10 (5.81) 9.91 (1.13) 10.07 (4.42) 9.85 (1.06) 10.07 (4.14) 9.84 (1.05) 10.06 (3.95) 9.82 100

0001.0(.2 00 01)1.0(.9 00 01)1.0(.9 00 01)1.0(.2 00 (0.32) 10.00 (0.32) 10.00 (0.18) 10.00 (0.19) 10.00 (0.14) 10.00 (0.19) 10.00 (0.10) 10.00 (0.22) 10.00 10,000

,0 00 08)1.0(.5 00 09)1.0(.7 00 09)1.0(.1 00 10)1.0(0.34) 10.00 (1.07) 10.03 (0.21) 10.00 (0.97) 10.01 (0.17) 10.00 (0.92) 10.01 (0.15) 10.00 (0.89) 10.01 5,000

10

,0 00 26)1.0(.2 00 28)1.0(.4 00 30)1.0(.8 01 40)1.0(0.52) 10.00 (4.09) 10.12 (0.38) 10.00 (3.04) 10.08 (0.34) 10.00 (2.81) 10.06 (0.32) 10.00 (2.67) 10.07 1,000

0 01 39)99 10)1.8(.3 .9(.9 01 44)99 11)1.0(.2 .9(1.46) 9.99 (5.92) 10.10 (1.16) 9.99 (4.46) 10.10 (1.09) 9.99 (4.13) 10.08 (1.05) 9.99 (3.90) 10.10 100

enS enS enS enS enS enS enS enSD mean SD mean SD mean SD mean SD mean SD mean SD mean SD mean j n ¯ n

EAE EAE EAE EAEM SEMA EM SEMA EM SEMA EM SEMA SEMA of Introduction 4: Chapter

100 = τ =25 τ =10 τ =1 τ

2 2 2 2

ouainvle of values population τ

2

,0 replications. 1,000

.I h aetee r h Dsover SD’s the are parentheses the In ). ( replications 1,000 over averaged ) value (population of estimates 4.2: Table 100 = σ 10 = µ µ 2 hpe :Itouto fSEMA of Introduction 4: Chapter

Table 4.3: Estimates of σ2 averaged over 1,000 replications, (µ = 10,σ2 = 100). In the parentheses are the SD’s over 1,000 replications and in bold those values which are more than 10 from the true value.

population values of τ 2 τ 2 =1 τ 2 =10 τ 2 =25 τ 2 =100 SEMA EM SEMA EM SEMA EM SEMA EM n¯j n mean SD mean SD mean SD mean SD mean SD mean SD mean SD mean SD 100 98.23 (35.04) 74.87 (30.82) 107.40 (38.39) 78.11 (33.73) 122.24 (44.24) 82.52 (38.26) 198.37 (73.47) 95.12 (55.96) 1,000 96.47 (20.26) 97.91 (5.31) 103.18 (22.55) 100.00 (6.60) 114.29 (26.49) 100.27 (7.13) 170.38 (49.62) 100.27 (7.46) 10 5,000 96.61 (3.42) 99.85 (2.16) 99.60 (4.54) 100.08 (2.27) 103.01 (6.51) 100.08 (2.27) 109.48 (15.04) 100.08 (2.28) 10,000 98.08 (1.48) 100.00 (1.48) 99.79 (1.56) 100.03 (1.50) 100.23 (1.57) 100.03 (1.50) 100.13 (1.51) 100.02 (1.50) 100 98.85 (33.41) 85.61 (21.79) 107.50 (36.36) 88.58 (24.74) 121.39 (40.72) 92.33 (28.93) 193.94 (72.38) 99.56 (40.14) 1,000 96.33 (11.60) 98.75 (4.87) 101.03 (14.15) 99.89 (5.57) 107.98 (17.70) 99.90 (5.67) 139.70 (40.37) 99.87 (5.77) 25 5,000 98.47 (2.03) 99.92 (2.08) 99.84 (2.10) 99.96 (2.10) 100.00 (2.11) 99.96 (2.10) 99.97 (2.10) 99.96 (2.10) 10,000 99.51 (1.41) 99.93 (1.44) 99.94 (1.44) 99.94 (1.44) 99.94 (1.44) 99.94 (1.44) 99.94 (1.44) 99.94 (1.44) 100 98.40 (29.94) 92.77 (15.72) 105.21 (34.93) 95.48 (17.65) 116.66 (42.86) 97.68 (20.03) 173.49 (72.73) 99.00 (22.93) 1,000 98.71 (16.66) 99.65 (4.62) 100.70 (20.88) 100.00 (4.80) 101.70 (25.59) 100.00 (4.80) 103.33 (42.81) 100.00 (4.81) 100 5,000 99.95 (2.05) 99.99 (2.06) 100.00 (2.06) 100.00 (2.06) 100.00 (2.06) 100.00 (2.06) 100.00 (2.06) 100.00 (2.06) 10,000 99.96 (1.45) 99.96 (1.45) 99.96 (1.45) 99.96 (1.45) 99.96 (1.45) 99.96 (1.45) 99.96 (1.45) 99.96 (1.45) 71

Chapter 4

73

00009 02)09 02)98 15)98 15)2.1(.0 47 37)9.6(43)9.6(14.34) 98.86 (14.34) 98.86 (3.70) 24.71 (3.70) 24.71 (1.57) 9.88 (1.57) 9.88 (0.29) 0.98 (0.29) 0.98 10,000

,0 .5(.7 .8(.2 .8(.2 .7(.2 47 38)2.1(.6 88 1.2 88 (14.52) 98.85 (14.52) 98.86 (3.86) 24.71 (3.86) 24.71 (1.72) 9.87 (1.72) 9.88 (0.42) 0.98 (0.37) 1.05 5,000

100

,0 .3(.5 .0(.3 08 29)97 29)2.1(.5 45 50)9.3(88)9.7(15.63) 98.67 (18.83) 96.73 (5.05) 24.54 (5.85) 23.61 (2.93) 9.72 (2.95) 10.80 (1.23) 1.20 (1.35) 5.43 1,000

17.61 40.22 1.4 .7(.7 95 1.8 32 1.2 29 1.9 57 (18.85) 25.70 (15.19) 22.96 (13.42) 13.23 (12.68) 19.50 (9.47) 7.17 (11.34) 100 2.4 80 (34.68) 98.00 (28.04)

00018 02)10 03)99 10)99 10)2.8(.4 48 21)9.1(.4 95 (7.64) 99.51 (7.64) 99.51 (2.14) 24.88 (2.14) 24.88 (1.03) 9.95 (1.03) 9.95 (0.35) 1.00 (0.22) 1.80 10,000

,0 .0(.7 .0(.8 04 13)99 12)2.8(.7 48 23)9.2(.0 95 (7.88) 99.51 (7.90) 99.42 (2.39) 24.85 (2.47) 24.68 (1.29) 9.92 (1.37) 10.40 (0.58) 1.00 (0.57) 4.20 5,000

25

15.13 51.44 68)20 18)1.9(.2 .2(.9 26 1.1 47 (5.13) 24.75 (10.81) 22.68 (3.79) 9.82 (8.32) 17.79 (1.83) 2.00 (6.87) 1,000 2.1 94 (11.28) 99.46 (27.11)

21.45 14.39 19.62 38.22 20.49 (14.46) (18.10) (12.77) 100 2.4 86 (47.36) 98.68 (27.44) (27.77) 31.67 (16.73) 24.42 (21.92)

00054 08)10 04)1.0(.3 00 09)2.2(.4 50 15)9.2(.9 0.5(5.01) 100.05 (5.09) 99.42 (1.58) 25.02 (2.24) 24.32 (0.90) 10.01 (1.53) 11.00 (0.47) 1.04 (0.87) 5.47 10,000

78.85 ,0 04 32)12 06)1.9(.1 .8(.7 20 69)2.8(2.08) 24.98 (6.90) 22.08 (1.37) 9.98 (4.41) 14.19 (0.68) 1.21 (3.22) 10.47 5,000 2.0 0.1(5.53) 100.01 (20.20)

10

17.07 39.02 88)34 26)1.9(00)1.0(.0 23 1.7 48 (6.56) 24.86 (12.17) 22.35 (5.10) 10.20 (10.02) 18.99 (2.66) 3.40 (8.88) 1,000 2.8 96 (10.78) 99.62 (23.78)

20.33 26.11 18.76 36.24 42.12 31.75 (12.51) (28.80) (11.55) 100 2.2 0.0(59.83) 103.40 (24.22) (37.63) (14.46) 23.13 (32.27)

enS enS enS enS enS enS enS enSD mean SD mean SD mean SD mean SD mean SD mean SD mean SD mean j n ¯ n

EAEM SEMA EM SEMA EAEM SEMA EM SEMA

=1 τ =100 τ =25 τ =10 τ

2 2 2 2

ouainvle of values population τ

2

n nbl hs auswihaemr hn1 rmtetu value. true the from 10 than more are which values those bold in and

vrgdoe ,0 elctos ( replications, 1,000 over averaged of Estimates 4.4: Table .I h aetee r h Dsoe ,0 replications 1,000 over SD’s the are parentheses the In ).

τ 100 = σ , 10 = µ

2 2 Chapter 4: Introduction of SEMA ) and for 04 . =1 2 ˆ τ 100.00) and moreover, ) is larger compared to . That is, , EM: = 92 . 47 shows that irrespective of . 2 the performance of EM and ˆ σ update =1 4.5 =5 2 2 ¯ 000 e ˆ τ , Chapter 4: Introduction of SEMA . In general, the higher the reliability 1,000 observations of the data stream . Table = 10 ) in which SEMA performs somewhat 2 ¯ 98.08, EM: e n 091 . first = : SEMA: 091 . 2 , and the U to , of SEMA ( =0 ˆ σ 2 000 ¯ ¯ e ρ =0 , ¯ ρ = 10 training n : SEMA: converges to its true value. 000 2 , τ = 10 n and its variability across replications are very large in beginning 1, SEMA performs less well than EM. That is, the average estimates 2 are too high (at ¯ = e : While SEMA is used to obtain estimates of the individual-level effect 2 ). 2 τ τ 94 . =0 2 SEMA-T and the model parameters, when the have entered, the EM algorithmobtain (which better iterates estimates for until the convergence) model parameters, is whichfor used are SEMA. to subsequently used ¯ 10 and e These two explanations suggest two possible adaptations of SEMA: a) an adap- One possible explanation for the fact that SEMA has some difficulties in the low The last result we present from this simulation study is the average squared pre- are too low (at = 1. decrease rapidly during the data stream; that is, when the model parameter es- 2 j 2 ¯ tation yielding better starting valuespass and over b) all an adaptation individuals in issible which performed. variants more For of than this the one purpose, SEMA we algorithm,SEMA-TU, which investigate where the we three T refer pos- refers to to as SEMA-T, SEMA-U, and is that this low reliability conditionis, is in a a rather situation difficult condition wherehundreds even the for of average EM. reliability iterations That equals (and .091, passessurprising the through that EM the the algorithm full SEMA needs dataset) algorithm,has to which not converge. yet passes reached through the It the peak is of dataset not the only likelihood once, function. reliability condition is that it isaverage sensitive reliability to is the very starting low. values, Ourfar especially rather off when crude to starting the guarantee values convergence may have within been 10,000 too data points. A second explanation diction error of the individual-level effects, EM ( Design Our first simulation study showed thatn in the lowest reliability condition, i.e., when 4.4.2 Simulation study II: Improving SEMA in low reliability cases 72 reliability condition corresponding to worse. When the reliability increases, as expected,both the SEMA prediction and error EM. decreases For for a stream of length timates improve and the amount ofThe information prediction available quality of per SEMA individual is increases. for similar the to lowest that reliability of condition EM ( at 5,000 data points, except the condition, the average squared prediction error, σ of the data stream fore both SEMA and EM, but both the size and the variability of of SEMA for the faster the estimate of SEMA is identical.

Chapter 4 EAi identical. is SEMA h atrteetmt of estimate the faster the fSM for SEMA of ftedt temfrbt EAadE,btbt h ieadtevraiiyof variability the and size the both but EM, and SEMA both e for stream data the of h vrg qae rdcinerror, prediction squared average the σ h condition, the ohSM n M o temo length of stream a for For decreases EM. error and prediction SEMA the both expected, as except increases, points, reliability the data When 5,000 worse. at ( EM condition of reliability that lowest to the similar for increases. is individual SEMA per of quality available prediction information The of amount the and improve timates u rtsmlto td hwdta ntelws eiblt odto,ie,when i.e., condition, reliability lowest the in n that showed study simulation first Our cases reliability low in Design SEMA Improving II: study Simulation 4.4.2 to corresponding condition reliability 72 M( EM ito ro fteidvda-ee effects, individual-level the of error diction il ainso h EAagrtm hc erfrt sSM-,SM-,and SEMA-U, SEMA-T, as to to refers pos- refer T three we the where investigate which SEMA-TU, algorithm, we SEMA purpose, one the this than of For more variants performed. which sible is in individuals adaptation an all b) over and pass values starting better yielding tation function. once, likelihood only the not dataset of is peak the It the through reached passes yet converge. not which to has algorithm, dataset) needs SEMA full algorithm the the EM that through the surprising passes .091, (and equals That iterations reliability EM. average of for the even hundreds where condition difficult situation explanation rather a second a in A is is, condition points. reliability data low too this 10,000 been that within have is may convergence values guarantee the starting to crude when off rather especially far Our values, low. starting very the is to reliability sensitive average is it that is condition reliability ¯ 2 j 2 eraerpdydrn h aasra;ta s hntemdlprmtres- parameter model the when is, that stream; data the during rapidly decrease 1. = r o o (at low too are h atrsl epeetfo hssmlto td steaeaesurdpre- squared average the is study simulation this from present we result last The hs w xlntossgettopsil dpain fSM:a nadap- an a) SEMA: of adaptations possible two suggest explanations two These low the in difficulties some has SEMA that fact the for explanation possible One e 0and 10 ¯ banbte siae o h oe aaees hc r usqetyused subsequently to SEMA. are used for which is parameters, model convergence) the until for estimates iterates better (which obtain algorithm EM the entered, have the when parameters, model the and SEMA-T 2 =0 . 94 τ τ 2 ). 2 hl EAi sdt banetmtso h niiullvleffect individual-level the of estimates obtain to used is SEMA While : e = ¯ r o ih(at high too are 2 ,SM efrsls elta M hti,teaeaeestimates average the is, That EM. than well less performs SEMA 1, n t aiblt cosrpiain r eylrei beginning in large very are replications across variability its and n 10 = τ , 2 000 ovre oistu value. true its to converges SEMA: : n training 10 = ρ ¯ , =0 ρ e ¯ ¯ 000 2 σ ˆ =0 fSM ( SEMA of , n h to U the and , 2 . 091 SEMA: : = first . 091 n e 80,EM: 98.08, ¯ 2 nwihSM efrssomewhat performs SEMA which in ) 10 = Table . ,0 bevtoso h aastream data the of observations 1,000 ngnrl h ihrtereliability the higher the general, In . hpe :Itouto fSEMA of Introduction 4: Chapter , τ ˆ e 000 ¯ 2 2 =5 4.5 =1 update σ ˆ h efrac fE and EM of performance the 2 . hw htirsetv of irrespective that shows 47 . 92 = EM: , htis, That . slre oprdto compared larger is ) 0.0 n moreover, and 100.00) τ ˆ 2 =1 . 04 n for and ) hpe :Itouto fSEMA of Introduction 4: Chapter

Table 4.4: Estimates of τ 2 averaged over 1,000 replications, (µ = 10,σ2 = 100). In the parentheses are the SD’s over 1,000 replications and in bold those values which are more than 10 from the true value.

population values of τ 2 τ 2 =1 τ 2 =10 τ 2 =25 τ 2 =100 SEMA EM SEMA EM SEMA EM SEMA EM n¯j n mean SD mean SD mean SD mean SD mean SD mean SD mean SD mean SD 100 18.76 (11.55) 26.11 (28.80) 20.33 (12.51) 31.75 (32.27) 23.13 (14.46) 42.12 (37.63) 36.24 (24.22) 103.40 (59.83) 1,000 17.07 (8.88) 3.40 (2.66) 18.99 (10.02) 10.20 (5.10) 22.35 (12.17) 24.86 (6.56) 39.02 (23.78) 99.62 (10.78) 10 5,000 10.47 (3.22) 1.21 (0.68) 14.19 (4.41) 9.98 (1.37) 22.08 (6.90) 24.98 (2.08) 78.85 (20.20) 100.01 (5.53) 10,000 5.47 (0.87) 1.04 (0.47) 11.00 (1.53) 10.01 (0.90) 24.32 (2.24) 25.02 (1.58) 99.42 (5.09) 100.05 (5.01) 100 19.62 (12.77) 14.39 (18.10) 21.45 (14.46) 20.49 (21.92) 24.42 (16.73) 31.67 (27.77) 38.22 (27.44) 98.68 (47.36) 1,000 15.13 (6.87) 2.00 (1.83) 17.79 (8.32) 9.82 (3.79) 22.68 (10.81) 24.75 (5.13) 51.44 (27.11) 99.46 (11.28) 25 5,000 4.20 (0.57) 1.00 (0.58) 10.40 (1.37) 9.92 (1.29) 24.68 (2.47) 24.85 (2.39) 99.42 (7.90) 99.51 (7.88) 10,000 1.80 (0.22) 1.00 (0.35) 9.95 (1.03) 9.95 (1.03) 24.88 (2.14) 24.88 (2.14) 99.51 (7.64) 99.51 (7.64) 100 17.61 (11.34) 7.17 (9.47) 19.50 (12.68) 13.23 (13.42) 22.96 (15.19) 25.70 (18.85) 40.22 (28.04) 98.00 (34.68) 1,000 5.43 (1.35) 1.20 (1.23) 10.80 (2.95) 9.72 (2.93) 23.61 (5.85) 24.54 (5.05) 96.73 (18.83) 98.67 (15.63) 100 5,000 1.05 (0.37) 0.98 (0.42) 9.88 (1.72) 9.87 (1.72) 24.71 (3.86) 24.71 (3.86) 98.86 (14.52) 98.85 (14.52) 10,000 0.98 (0.29) 0.98 (0.29) 9.88 (1.57) 9.88 (1.57) 24.71 (3.70) 24.71 (3.70) 98.86 (14.34) 98.86 (14.34) 73

Chapter 4 = 75 n 1,000 , while µ each , the aver- µ . A similar pattern can 2 1 condition but now we τ = 900 observations, but clear 2 = τ n 10 and = though the effect is smaller because the j , n 2 σ , the SEMA variants have a large impact on both the point esti- 2 τ : combines both features. : A single EM iteration over all available individuals is used to update presents the results obtained with the different variants of SEMA at 4.6 SEMA-U all the estimated individual-level effects and modeldata parameters points. after SEMA-TU The average squared prediction error is more affected by using a training set or However, for 3. 2. be observed for the residual variance Chapter 4: Introduction of SEMA The training set could provideconvergence SEMA to with a better local starting maximum. values,is The especially speeding second useful up variant when the observations of of SEMA, anthe using individual contributions EM enter in updates to a the block. CDSS Inwhich will that are case, be not based yet on converged, estimatesto and of the more CDSS the importantly are model not these parameters corrected, erroneous becausean contributions this additional full individual E is step no will longer help returning.second in Doing simulation correcting the study, contributions we to the repeat CDSS. the In this SEMA estimate was already closeupdates to yields estimates the closer true to value. thoseupdates of seem the Using to EM only a algorithm, have training though a the set minimal additional influence and on EM the estimate and itsadditional SD. EM updates. This effectticeable of halfway the through training the data set stream. and Theoutperforms updates variant the is with variant especially only with the no- only training the dataset the updates. difference Towards the between end standard SEMA of and the its data variants stream, becomes much smaller. having lower SDs than those withoutference a between training EM set. and SEMA-T AtSEMA-T and 5,000 and SEMA-TU observations, SEMA-TU the are clearly dif- minimal. improvesthe the The additional precision training update of set only the in marginally estimates improves the of precision. also apply these three variants oftational SEMA. time Additionally required we by keep each track of of thevariants the different of algorithms: compu- SEMA. EM, SEMA, and the three Results Table 900, 1,000, 5,000, and 10,000are observations. still identical, but At at 900 1,000 observations observations, largeants differences all appear using SEMA between those the versions observations vari- as a trainingage set estimates and were those already that close do to not.improvements the For are true value visible at in the SDs, with the variants of SEMA with a training set mate and their SD. Allowing for a singleU) EM updates already every 1,000 yields data a points (SEMA- solutionlarger that improvement is is much shown closer bydates to yields SEMA-T. the another Using slight full both improvement EM of a the solution. training estimate of set An and even EM up-

J ¯2 2 Table 4.5: The average squared error (e = (ˆµj µj) /J) averaged over 1,000 replications. In the parentheses are the SD’s over 1,000 j=1 − ! replications.

population values of τ 2 τ 2 =1 τ 2 =10 τ 2 =25 τ 2 =100 Chapter 4: Introduction of SEMA SEMA EM SEMA EM SEMA EM SEMA EM n¯j n mean SD mean SD mean SD mean SD mean SD mean SD mean SD mean SD 100 20.43 (27.19) 16.43 (23.88) 28.06 (29.93) 23.44 (22.19) 40.80 (34.65) 33.62 (19.48) 106.75 (63.51) 65.08 (16.99) 1,000 11.73 (16.96) 1.26 (0.40) 18.26 (18.89) 9.05 (0.67) 29.18 (22.12) 18.66 (1.23) 82.87 (43.91) 42.16 (2.62) 10 5,000 4.16 (2.62) 0.99 (0.06) 8.35 (3.45) 6.84 (0.32) 13.70 (5.02) 11.80 (0.57) 26.10 (13.32) 19.33 (0.98) 10,000 1.92 (0.37) 0.94 (0.05) 5.24 (0.29) 5.15 (0.24) 7.63 (0.40) 7.56 (0.35) 10.08 (0.53) 10.01 (0.48) 100 21.12 (26.19) 7.16 (11.35) 28.49 (28.55) 15.14 (10.81) 40.55 (32.72) 26.30 (10.09) 103.92 (63.21) 56.48 (13.67) 1,000 8.58 (9.76) 1.17 (0.28) 13.96 (11.79) 8.20 (0.68) 22.06 (14.71) 15.69 (1.20) 57.81 (36.65) 31.05 (2.36) 25 5,000 1.53 (0.21) 0.93 (0.08) 4.59 (0.34) 4.57 (0.33) 6.36 (0.47) 6.35 (0.47) 7.98 (0.60) 7.98 (0.60) 10,000 0.89 (0.08) 0.82 (0.06) 2.92 (0.21) 2.92 (0.21) 3.56 (0.26) 3.56 (0.26) 4.00 (0.29) 4.00 (0.29) 100 17.89 (26.17) 3.57 (3.79) 24.63 (30.97) 11.46 (3.99) 36.01 (38.36) 21.70 (5.02) 90.48 (66.36) 45.24 (9.68) 1,000 2.74 (15.29) 1.10 (0.27) 6.29 (19.35) 5.30 (0.80) 9.13 (23.92) 7.63 (1.13) 13.17 (40.94) 9.98 (1.50) 100 5,000 0.70 (0.11) 0.71 (0.11) 1.70 (0.25) 1.70 (0.25) 1.89 (0.27) 1.89 (0.27) 2.01 (0.29) 2.01 (0.29) 10,000 0.52 (0.08) 0.52 (0.08) 0.92 (0.13) 0.92 (0.13) 0.98 (0.14) 0.98 (0.14) 1.01 (0.15) 1.01 (0.15) 74

Chapter 4

74

00005 00)05 00)09 01)09 01)09 01)09 01)10 01)10 (0.15) 1.01 (0.15) 1.01 (0.14) 0.98 (0.14) 0.98 (0.13) 0.92 (0.13) 0.92 (0.08) 0.52 (0.08) 0.52 10,000

,0 .0(.1 .1(.1 .0(.5 .0(.5 .9(.7 .9(.7 .1(.9 .1(0.29) 2.01 (0.29) 2.01 (0.27) 1.89 (0.27) 1.89 (0.25) 1.70 (0.25) 1.70 (0.11) 0.71 (0.11) 0.70 5,000

100

,0 .4(52)11 02)62 1.5 .0(.0 .3(39)76 11)1.7(09)99 (1.50) 9.98 (40.94) 13.17 (1.13) 7.63 (23.92) 9.13 (0.80) 5.30 (19.35) 6.29 (0.27) 1.10 (15.29) 2.74 1,000

0 78 2.7 .7(.9 46 3.7 14 39)3.1(83)2.0(.2 04 6.6 52 (9.68) 45.24 (66.36) 90.48 (5.02) 21.70 (38.36) 36.01 (3.99) 11.46 (30.97) 24.63 (3.79) 3.57 (26.17) 17.89 100

00008 00)08 00)29 02)29 02)35 02)35 02)40 02)40 (0.29) 4.00 (0.29) 4.00 (0.26) 3.56 (0.26) 3.56 (0.21) 2.92 (0.21) 2.92 (0.06) 0.82 (0.08) 0.89 10,000

,0 .3(.1 .3(.8 .9(.4 .7(.3 .6(.7 .5(.7 .8(.0 .8(0.60) 7.98 (0.60) 7.98 (0.47) 6.35 (0.47) 6.36 (0.33) 4.57 (0.34) 4.59 (0.08) 0.93 (0.21) 1.53 5,000

25

,0 .8(.6 .7(.8 39 1.9 .0(.8 20 1.1 56 12)5.1(66)3.5(2.36) 31.05 (36.65) 57.81 (1.20) 15.69 (14.71) 22.06 (0.68) 8.20 (11.79) 13.96 (0.28) 1.17 (9.76) 8.58 1,000

0 11 2.9 .6(13)2.9(85)1.4(08)4.5(27)2.0(00)139 6.1 64 (13.67) 56.48 (63.21) 103.92 (10.09) 26.30 (32.72) 40.55 (10.81) 15.14 (28.55) 28.49 (11.35) 7.16 (26.19) 21.12 100

00019 03)09 00)52 02)51 02)76 04)75 03)1.8(.3 00 (0.48) 10.01 (0.53) 10.08 (0.35) 7.56 (0.40) 7.63 (0.24) 5.15 (0.29) 5.24 (0.05) 0.94 (0.37) 1.92 10,000

,0 .6(.2 .9(.6 .5(.5 .4(.2 37 50)1.0(.7 61 1.2 93 (0.98) 19.33 (13.32) 26.10 (0.57) 11.80 (5.02) 13.70 (0.32) 6.84 (3.45) 8.35 (0.06) 0.99 (2.62) 4.16 5,000

10

,0 17 1.6 .6(.0 82 1.9 .5(.7 91 2.2 86 12)8.7(39)4.6(2.62) 42.16 (43.91) 82.87 (1.23) 18.66 (22.12) 29.18 (0.67) 9.05 (18.89) 18.26 (0.40) 1.26 (16.96) 11.73 1,000

0 04 2.9 64 2.8 80 2.3 34 2.9 08 3.5 36 1.8 0.5(35)6.8(16.99) 65.08 (63.51) 106.75 (19.48) 33.62 (34.65) 40.80 (22.19) 23.44 (29.93) 28.06 (23.88) 16.43 (27.19) 20.43 100

enS enS enS enS enS enS enS enSD mean SD mean SD mean SD mean SD mean SD mean SD mean SD mean j n ¯ n

EAEM SEMA EM SEMA EM SEMA EAEM SEMA

hpe :Itouto fSEMA of Introduction 4: Chapter

=100 τ =25 τ =10 τ =1 τ

2 2 2 2

ouainvle of values population τ

2

replications. !

=1 j

j j vrgdoe ,0 elctos nteprnhssaeteS’ vr1,000 over SD’s the are parentheses the In replications. 1,000 over averaged ) ( error squared average The 4.5: Table

/J ) µ µ (ˆ = e

¯ 2 2 J agripoeeti hw ySM-.Uigbt riigstadE up- EM even and An set of estimate training solution. the a of EM improvement both full slight Using another the SEMA-T. yields to dates by closer shown much is is improvement that larger solution (SEMA- points a data yields 1,000 every already updates EM U) single a for Allowing SD. their and mate h diinludt nymrial mrvsteprecision. of the improves estimates marginally in the only set of update training precision additional The the the improves minimal. dif- clearly are the SEMA-TU observations, SEMA-TU and 5,000 and SEMA-T At SEMA-T and set set. training EM training a between a with ference without SEMA those of than variants SDs the lower with having SDs, the in at visible value true are For the improvements not. to do close that already those were and estimates set age training a as vari- observations versions the those between SEMA using appear all differences ants large observations, observations 1,000 900 at At but identical, still observations. are 10,000 and 5,000, 1,000, 900, Table Results three the and SEMA, EM, SEMA. compu- algorithms: of different the variants the of of track each keep by we required Additionally time SEMA. tational of variants three these apply also h ifrnebtensadr EAadisvrat eoe uhsmaller. much becomes stream, variants data its the and of SEMA standard end between the Towards difference updates. the dataset the training only no- the with only especially variant with is the variant updates outperforms The and stream. set data the training through the halfway of ticeable effect This updates. EM SD. additional its and estimate the EM on and influence additional minimal set the a though training have algorithm, a only EM to Using the seem of updates those value. to true closer the estimates yields to updates close already was estimate SEMA hpe :Itouto fSEMA of Introduction 4: Chapter nadtoa ulEse ilhl ncretn h otiuin oteCS.I this In the CDSS. repeat the to we contributions study, the correcting simulation Doing in second returning. help longer will no step is E individual full additional this contributions an because erroneous corrected, parameters these not model are importantly the CDSS more the of and to estimates converged, on yet based not be case, are that will which In CDSS block. the a to updates in enter EM contributions individual using the an SEMA, of of observations the when variant up useful second speeding especially The is values, maximum. starting local better a with to SEMA convergence provide could set training The eosre o h eiulvariance residual the for observed be 3. 2. oee,for However, h vrg qae rdcinerri oeafce yuigatann e or set training a using by affected more is error prediction squared average The SEMA-TU after points. parameters data model and effects individual-level estimated the all SEMA-U 4.6 rsnsterslsotie ihtedfeetvrat fSM at SEMA of variants different the with obtained results the presents igeE trto vralaalbeidvdasi sdt update to used is individuals available all over iteration EM single A : obnsbt features. both combines : τ 2 h EAvrat aealreipc nbt h on esti- point the both on impact large a have variants SEMA the , σ 2 n , j huhteefc ssalrbcuethe because smaller is effect the though = 0and 10 n τ = 2 0 bevtos u clear but observations, 900 = τ odto u o we now but condition 1 2 iia atr can pattern similar A . µ h aver- the , each µ while , 1,000 n 75 =

Chapter 4 77 10000 EM 1 EM 10 EM 100 EM 1000 9000 8000 ● 7000 6000 5000 data points ● 4000 3000 2000 SEMA SEMA T SEMA U SEMA TU EM indicated number of data points (10, 100, or 1,000). 1000

● ●

0 60 40 20

Figure 4.1: Relative cumulative computationcreasing time number as of a data function points, oframeters when in- are the updated estimates after of each the data model point pa- (SEMA, EM 1, or after the relative cumulative computational time computational cumulative relative Chapter 4: Introduction of SEMA n 1, = 2 τ 10, (9.09) 3.64 (2.89) = µ 17.24 Chapter 4: Introduction of SEMA (7.92) 3.42 (2.71) 3.40 (2.66) (9.09) . =1 2 16.18 17.24 τ 500. There is no visible difference between = shows four variants of the EM algorithm. The n value: (9.09) 4.1 17.24 presents the difference in cumulative computation time when (8.88) 3.42 (2.71) (9.09) 100. In the parentheses are the SD’s over 1,000 replications, 4.1 = SEMA SEMA T SEMA U SEMA T+U EM 2 σ 17.07 17.24 mean SD mean SD mean SD mean SD mean SD n Table 4.6: Results of SEMA variants in the condition and and in bold those values which are more than 10 from the population 900 10.07 (2.74) 10.07 (2.74) 10.07 (2.74) 10.07 (2.74) 10.00 (0.33) 900 900 96.47 (21.03) 96.47 (21.03)900 96.47 12.12 (21.03) (17.52) 96.47 12.12 (21.03) (17.52) 97.61 12.12 (5.72) (17.52) 12.12 (17.52) 1.30 (0.46) 5,000 10.47 (3.22) 3.10 (2.10) 7.74 (1.72) 2.92 (1.80) 1.21 (0.68) 1,000 1,000 10.075,000 (2.67) 10.01 (0.89) 10.00 (0.32) 10.00 (0.24) 10.06 (2.32) 10.00 (0.41) 10.00 (0.32) 10.00 (0.21) 10.00 (0.32) 10.00 (0.15) 1,000 96.475,000 (20.26) 96.61 97.77 (3.42) (5.33) 98.641,000 95.17 11.73 (2.36)5,000 (17.60) (16.96) 96.45 4.16 97.77 1.27 (2.32) (2.62) (5.33) (0.41) 98.63 1.28 97.91 (2.31) (5.31) 9.42 (0.41) (14.05) 99.85 (2.16) 2.45 1.27 (0.69) (0.41) 1.22 1.26 (0.31) (0.40) 0.99 (0.06) 10,000 5.47 (0.87) 2.50 (1.25) 3 .67 (0.41) 2.16 (0.87) 1.04 (0.47) 10,000 10.00 (0.22) 10.00 (0.16) 10.00 (0.13) 10.00 (0.12) 10.00 (0.10) 10,000 98.08 (1.48) 99.16 (1.54)10,000 98.42 1.92 (1.45) (0.37) 99.19 1.14 (1.50) (0.24) 100.00 (1.48) 1.33 (0.17) 1.05 (0.14) 0.94 (0.05) 2 2 2 ˆ µ ¯ e ˆ τ ˆ σ To conclude, both the model-parameter estimates and the prediction errors can Finally, Figure concerned, other combinations of SEMA and EM could be used. be improved by using better startingperforming values a obtained single from EM a iteration training after dataset.estimates every and Also, 1,000 lowered prediction data errors. points Experimentation improved withmethod parameter variants showed of that the even latter larger improvementsing can multiple EM be iterations, obtained or by performing the either singleother perform- EM words, iteration depending more on frequently. In whether this is feasible in the streaming data application over time. More importantly thelinear curves and curve of more the upwards EM asof algorithm larger tend the datasets to are EM analyzed. deviate algorithm from These illustratewhen curved that data lines points analyzing enter nested over data time becomesparameters using infeasible, will the as require the increasingly EM estimation more algorithm time. of the model the different variants of SEMA,grows which with all a grow factor linearly of by 20).estimates factor Figure of of the about model 10 parameters (as aredata updated points. using EM All every: four 1, variantsthan 10, of SEMA 100, when the or it EM 1,000 has algorithm to grow produce up-to-date with parameter a estimates much when larger data enter factor the algorithms have todata produce point up-to-date arrives parameter (or after estimates therequired each indicated to number time update of the a data estimates new points).required of to We the estimate scale model the the model parameters time when proportional to the time 76

Chapter 4 76 ocre,ohrcmiain fSM n Mcudb used. be could EM and application SEMA data of streaming combinations the in other feasible concerned, is this whether In frequently. on more depending iteration words, EM perform- other single either the performing by or obtained iterations, be EM multiple can ing improvements larger latter even the that of showed variants parameter method with improved Experimentation points errors. data prediction lowered 1,000 Also, and every estimates dataset. after training iteration a EM from single obtained a values performing starting better using by improved be model the of time. algorithm more estimation EM increasingly the require as the will infeasible, using parameters becomes time data over nested enter analyzing points lines data that curved when illustrate These from algorithm deviate analyzed. EM are to datasets the tend larger algorithm of as EM upwards the more of curve and curves linear the factor enter data importantly larger when More much estimates a parameter time. with up-to-date over produce grow to algorithm has 1,000 EM it or the when 100, SEMA of 10, than variants 1, four every: All EM using points. updated data are (as parameters 10 model about the of of Figure factor estimates 20). by of linearly factor grow a all with which grows SEMA, of time variants the different to the proportional when time parameters model the the model scale estimate the We to of required points). new estimates data a the of update time number to indicated each required the estimates after (or parameter arrives up-to-date point produce data to have algorithms the ocnld,bt h oe-aaee siae n h rdcinerr can errors prediction the and estimates model-parameter the both conclude, To Figure Finally, σ τ ˆ ˆ e ¯ µ ˆ 2 2 2 0009.8(.8 91 15)9.2(.5 91 15)100 (1.48) 100.00 (1.50) 99.19 (1.45) 98.42 (1.54) 99.16 (1.48) 98.08 10,000 00019 03)11 02)13 01)10 01)09 (0.05) 0.94 (0.14) 1.05 (0.17) 1.33 (0.24) 1.14 (0.37) 1.92 10,000 0001.0(.2 00 01)1.0(.3 00 01)1.0(0.10) 10.00 (0.12) 10.00 (0.13) 10.00 (0.16) 10.00 (0.22) 10.00 10,000 00054 08)25 12)3.7(.1 .6(.7 .4(0.47) 1.04 (0.87) 2.16 (0.41) .67 3 (1.25) 2.50 (0.87) 5.47 10,000 ,0 .6(.2 .8(.1 .5(.9 .2(.1 .9(0.06) 0.99 (0.40) (0.31) 1.26 1.22 (0.41) (0.69) 1.27 2.45 (2.16) 99.85 (14.05) (0.41) 9.42 (5.31) (2.31) 97.91 1.28 98.63 (0.41) (5.33) (2.62) (2.32) 1.27 97.77 4.16 96.45 (16.96) (17.60) 5,000 (2.36) 11.73 95.17 1,000 98.64 (5.33) (3.42) 97.77 96.61 (20.26) 5,000 96.47 1,000 ,0 00 08)1.0(.4 00 04)1.0(.1 00 (0.15) 10.00 (0.32) 10.00 (0.21) 10.00 (0.32) 10.00 (0.41) 10.00 (2.32) 10.06 (0.24) 10.00 (0.32) 10.00 (0.89) 10.01 (2.67) 5,000 10.07 1,000 1,000 ,0 04 32)31 21)77 17)29 18)12 (0.68) 1.21 (1.80) 2.92 (1.72) 7.74 (2.10) 3.10 (3.22) 10.47 5,000 0 64 2.3 64 2.3 64 2.3 64 2.3 76 (5.72) 97.61 (21.03) 96.47 (21.03) 96.47 (21.03) 96.47 (21.03) 96.47 900 900 0 21 1.2 21 1.2 21 1.2 21 1.2 .0(0.46) 1.30 (17.52) 12.12 (17.52) 12.12 (17.52) 12.12 (17.52) 12.12 900 0 00 27)1.7(.4 00 27)1.7(.4 00 (0.33) 10.00 (2.74) 10.07 (2.74) 10.07 (2.74) 10.07 (2.74) 10.07 900 n nbl hs auswihaemr hn1 rmtepopulation the from 10 than more are which values those bold in and and al .:Rslso EAvrat ntecondition the in variants SEMA of Results 4.6: Table n enS enS enS enS enSD mean SD mean SD mean SD mean SD mean 17.24 17.07 σ 2 EASM EAUSM + EM T+U SEMA U SEMA T SEMA SEMA = 4.1 0.I h aetee r h Dsoe ,0 replications, 1,000 over SD’s the are parentheses the In 100. (9.09) 88)34 (2.71) 3.42 (8.88) rsnstedfeec ncmltv optto iewhen time computation cumulative in difference the presents 17.24 4.1 (9.09) value: n hw orvrat fteE loih.The algorithm. EM the of variants four shows = 0.Teei ovsbedfeec between difference visible no is There 500. τ 17.24 16.18 2 =1 . (9.09) 79)34 27)34 (2.66) 3.40 (2.71) 3.42 (7.92) hpe :Itouto fSEMA of Introduction 4: Chapter 17.24 µ = 90)36 (2.89) 3.64 (9.09) 10, τ 2 = 1, n hpe :Itouto fSEMA of Introduction 4: Chapter

relative cumulative computational time aeesaeudtdatrec aapit(EA M1 ratrthe after or 1, EM (SEMA, pa- point model data the each of after estimates updated the are in- when rameters of points, function data a of as number time creasing computation cumulative Relative 4.1: Figure 0 20 40 60 ● ● 1000 niae ubro aapit 1,10 r1,000). or 100, (10, points data of number indicated EM SEMA TU SEMA U SEMA T SEMA 2000 3000 4000 ● data points 5000 6000 7000 ● 8000 9000 EM 1000 EM 100 EM 10 EM 1 10000 77

Chapter 4 , ) t j 79 n 1998 (4.18) ( ) call in- 2009 ( individuals. Sec- is one of the model J data points in mem- θ n Neal and Hinton ). – which can be maintained θ ¯ δ 2009 Liang and Klein , /C, | increases, the average will cover a for each of the +1 t i j ˆ 3 θ T − i ˆ θ | ) , and C j 1) 2 − t − T t ! ( , =( in the case of incremental EM, while using more i j Cappé & Moulines 1 n T = , θ 2 j and then iterates over the dataset, subtracting previous ¯ δ ¯ y , j for a single individual only, and subsequently updates the ¯ y t , j j 3 n T , instead of J data points , and t j 2 T , t j is the size of the window of the moving average and 1 T C , t 2 j ¯ y SEMA is conceptually positioned between what , t j ¯ where parameters. As new data points enternew and interval thus of parameter differences. This measure ond, SEMA decreases the numberEM algorithm of when computations analyzing compared a data to stream. the The conventional SEMA algorithm updates the SEMA scales with information than stepwise EM. CDSS and the model parameters in a single pass. cremental EM and stepwise EM ( provide a proof for the largeEM. sample Incremental convergence EM estimates of the both parameters stepwise by storing andtions the of incremental CDSS each and of the the contribu- contributions of the data pointtribution to to the the CDSS, CDSS thereby of correctingpoints previous the in time. erroneous memory. con- As In such, contrast,not incremental stepwise EM correct EM requires for does all previous not data store contributionscontribution all to of the the each data, data CDSS: but point stepwise it tochoose EM the does a CDSS. adds weight To a given use to weighted stepwise new EM, datatwo the points. earlier analyst methods, SEMA, has does which to not conceptually store combines thethe the observed level data; of it stores the only individuals, contributions instead at of the data points themselves. This means that 4.6.2 Convergence Fitting multilevel models on data streamsdard adds offline methods: an additional it complication is not tovations), immediately stan- the clear parameter when estimates (e.g., can after be howsubstantively interpreted. said many obser- However, to options have are “converged” available and tocould, address thus for this can issue. example be One choose, duringof the the absolute data difference stream, between to twotime compute estimates points: of a the moving same average parameter at adjacent Chapter 4: Introduction of SEMA 4.6 SEMA characteristics 4.6.1 Theoretical considerations The proposed SEMA algorithmtional yields EM two algorithm. improvements First, it compared isory, to no leading longer the to required to a tradi- store decrease all inthe memory current required. values What of needs to be stored are merely y during the stream – can be used to quantify convergence (where given some cut-off ). We will fit a , as well as the 2 367.16. We used ˆ σ 2 ¯ e 2010 = ( 2 and ˆ σ 2,248 individuals. The , 2 ˆ = τ J ˆ µ, 2 ˆ . σ Chapter 4: Introduction of SEMA 93.58, and =1 = is somewhat overestimated by SEMA 2 0 2 σ 2 ˆ τ , is somewhat underestimated. The av- ˆ τ 2 ˆ σ , and 2 64.85, ˆ τ Killingsworth and Gilbert =1 = 2 0 τ ˆ µ 17,742 observations for = n mates and average squared error ˆ µ SEMA EM SEMA EM SEMA EM SEMA EM n is estimated properly, while 100 68.15 67.01 125.60 161.52 235.87 229.08 31.40 27.15 1,0005,000 65.60 65.49 63.96 63.80 129.24 121.72 103.33 349.30 93.91 353.06 336.33 337.67 24.64 22.82 19.58 19.06 ˆ µ reports the values of the parameter estimates and the average squared 10,00017,742 64.34 64.39 64.72 64.85 103.61 97.16 100.22 93.58 357.36 359.90 365.28 367.16 14.14 13.68 0.30 – Table 4.7: Longitudinal happiness ratings: model parameter esti- (the first observation), 4.7 =1 ings t y Table During the data stream, we obtained parameter estimates from the SEMA and = 0 prediction error for both SEMA and EM.lation Similar study, to the results obtained in the simu- compared to EM. The residualerage variance, squared prediction error of SEMAerror of is the EM close algorithm, to even though the atfavored average the due end squared to of prediction the its data own stream use EM in is obviously the operationalization of the “gold standard”. estimated individual-level effects. The starting valuesµ for SEMA were, respectively, daily measurements of the participants’ happiness ondataset a contains continuous rating a scale. total The of To illustrate the use of SEMAdinal in study a of real-life happiness application, ratings werandom-intercept by use model data to from the a data longitu- dents’ to obtain happiness. individual-level estimates Data of were respon- collected using a smart-phone application, yielding 4.5 An application of SEMA to longitudinal happiness rat- 78 during the data stream. From these “true”abilities estimates, range we from find 0.20 that to the 0.91 estimatedduring with reli- the an data average of stream 0.67. we As monitored in the the estimates simulation of study, the individual-level effects estimated using allthe data “true” (i.e., individual-level end effect, of to the compute data the stream) average as squared prediction error from the smart-phone application, by replaying the data collection over time. EM algorithms from only themates data seen using so the far, entire and dataset: compared these to the EM esti- average number of observation per persontion is (254 7.89, individuals) with and a maximum minimum of ofauthors 39 one analyzed observations observa- the (one dataset individual). after the While data the tered collection as stopped, a in data reality, the stream. data We fit en- a random-intercept model on the data stream resulting

Chapter 4 Magrtm rmol h aase ofr n oprdteet h Mesti- EM the to these compared dataset: and entire far, the so using seen data mates the only from algorithms EM time. over collection data resulting the stream replaying data by the application, on model smart-phone random-intercept the a from en- fit We data stream. the reality, data in a stopped, as collection tered the data While the after individual). dataset (one the observa- observations analyzed one 39 authors of of minimum maximum a and with individuals) 7.89, (254 is tion person per observation of number average blte ag rm02 o09 iha vrg f06.A ntesmlto study, of simulation estimates the the in monitored As we 0.67. stream of average data an the reli- with during estimated 0.91 the to that 0.20 find from we range estimates, error abilities “true” prediction these From squared as stream. average stream) data the data the compute during the to of effect, end individual-level (i.e., “true” data the all using estimated effects individual-level the aae otisattlof The total scale. yielding a rating application, continuous contains a dataset smart-phone on happiness a participants’ using the of collected measurements respon- daily were of Data estimates individual-level happiness. obtain to dents’ longitu- data a the from to data model use by random-intercept we ratings application, happiness real-life of a study in dinal SEMA of use the illustrate rat- To happiness longitudinal to SEMA of application An 4.5 78 siae niiullvlefcs h trigvle o EAwr,respectively, were, SEMA for µ values starting The effects. individual-level estimated aoe u oisonuei h prtoaiaino h gl standard”. “gold the of operationalization the obviously is in EM use stream own data its the prediction of to squared end due the average favored at the though even to algorithm, close EM the is of error SEMA of error prediction squared erage oprdt M h eiulvariance, residual The EM. to compared rdcinerrfrbt EAadE.Smlrt h eut bandi h simu- the in obtained results the to study, Similar lation EM. and SEMA both for error prediction 0 = uigtedt tem eotie aaee siae rmteSM and SEMA the from estimates parameter obtained we stream, data the During Table y t ings =1 4.7 tefis observation), first (the al .:Lniuia apns aig:mdlprmtresti- parameter model ratings: happiness Longitudinal 4.7: Table 7726.26.5102 35 6.8371 .0– 0.30 13.68 14.14 367.16 365.28 359.90 357.36 93.58 100.22 97.16 103.61 64.85 64.72 64.39 64.34 17,742 10,000 eot h auso h aaee siae n h vrg squared average the and estimates parameter the of values the reports µ ˆ ,0 39 38 0.39.1363 3.71.819.06 19.58 22.82 24.64 337.67 336.33 353.06 93.91 349.30 103.33 121.72 129.24 63.80 63.96 65.49 65.60 5,000 1,000 0 81 70 2.0115 3.7290 14 27.15 31.40 229.08 235.87 161.52 125.60 67.01 68.15 100 setmtdpoel,while properly, estimated is n EAE EAE EAE EAEM SEMA EM SEMA EM SEMA EM SEMA µ ˆ ae n vrg qae error squared average and mates n = 772osrain for observations 17,742 µ ˆ τ 0 2 = =1 ilnsot n Gilbert and Killingsworth τ ˆ 64.85, 2 and , σ ˆ 2 τ ˆ ssmwa neetmtd h av- The underestimated. somewhat is , τ ˆ 2 σ 2 0 2 ssmwa vrsiae ySEMA by overestimated somewhat is = =1 35,and 93.58, hpe :Itouto fSEMA of Introduction 4: Chapter σ . ˆ 2 µ, ˆ J τ = ˆ 2 , ,4 niiul.The individuals. 2,248 σ ˆ and 2 ( = 2010 e ¯ 2 σ ˆ 6.6 eused We 367.16. 2 swl sthe as well as , .W ilfi a fit will We ). uigtesra a eue oqatf ovrec weegvnsm cut-off some given (where convergence quantify to used be can – stream the during y fteaslt ifrnebtentoetmtso h aeprmtra adjacent at parameter average same moving the a of points: estimates compute time two to between stream, difference data absolute the the of during choose, One be example issue. can this for thus address could, to and available “converged” are have options to However, obser- many said interpreted. substantively how be after can (e.g., estimates when parameter clear the stan- immediately vations), to not is complication it additional an methods: offline adds dard streams data on models multilevel Fitting Convergence 4.6.2 merely are stored be to needs of What values required. current memory the in all decrease store tradi- a to required to the longer leading no to ory, is compared it First, improvements algorithm. two EM yields tional algorithm SEMA proposed The considerations Theoretical 4.6.1 characteristics SEMA 4.6 SEMA of Introduction 4: Chapter h ee fteidvdas nta ftedt onstesle.Ti en that EM. means stepwise This than information themselves. points data with the scales of at SEMA instead contributions individuals, only the stores it of data; level observed the the the combines store conceptually not to which does has SEMA, methods, analyst earlier points. the two data EM, new stepwise weighted to use given a To weight adds CDSS. a does the EM choose to it stepwise point but CDSS: data data, each the the of to all contribution contributions store data not previous all does for requires EM correct EM stepwise incremental not contrast, such, In As con- memory. erroneous time. in the previous points correcting of thereby CDSS CDSS, the the to to tribution point data the of contributions contribu- the the of and each CDSS incremental of the tions and storing by stepwise parameters both the of estimates EM convergence Incremental sample EM. large the for proof a provide ( EM stepwise and EM cremental pass. single a in parameters model the and CDSS Magrtmwe nlzn aasra.TeSM loih pae the updates algorithm SEMA conventional The the stream. to data a compared analyzing computations when of algorithm EM number the decreases SEMA ond, e nevlo aaee ifrne.Ti measure This differences. parameter of thus interval and new enter points data new As parameters. where ¯ j t , EAi ocpulypstoe ewe what between positioned conceptually is SEMA y ¯ j 2 t , C T 1 stesz ftewno ftemvn vrg and average moving the of window the of size the is j t , T 2 j t and , aapoints data J nta of instead , T n 3 j j , t y ¯ o igeidvda ny n usqetyudtsthe updates subsequently and only, individual single a for j , y ¯ δ ¯ n hnieae vrtedtst utatn previous subtracting dataset, the over iterates then and j 2 θ , = T n 1 ap Moulines & Cappé j i ntecs ficeetlE,wieuigmore using while EM, incremental of case the in =( , ( ! t T − t − 2 1) j C and , ) | θ ˆ i − T θ 3 ˆ j i t +1 o aho the of each for nrae,teaeaewl oe a cover will average the increases, | /C, , in n Klein and Liang 2009 δ ¯ θ hc a emaintained be can which – ). eladHinton and Neal n θ aapit nmem- in points data J soeo h model the of one is niiul.Sec- individuals. ( 2009 alin- call ) ( (4.18) 1998 n 79 j t ) ,

Chapter 4 1 in 81 A β (4.25) (4.24) (4.27) (4.29) (4.26) (4.23) (4.28) is now a depends j 1 T µ , if , the fixed effect j 4.3 n , where , 3 ) , T j , j ˆ 1) µ , t t ˆ µ − j , ij t ′ j − j n x x 1( ] . n , and ij j t ) ) 1) 2 j t A y n j ( t ν ( − t ˆ t µ ,T ′ ij ′ ij − 1 +ˆ wj 1( x x − j t T T 2 A ¯ y j ij ) t j ′ j t y =1 n + x j ij ′ i . (¯ x " . t ˆ ′ j j µ x 1) ) 1) , is computed as follows: ij t x n − 2 J =1 − t are retrieved from memory. Because y − is dependent on the number of observa- t j J ˆ 2( " τ t ( t 1( ˆ J J 1+ j t =1 =1 T 1 β j ′ ij j j " " t µ , A − x x j , wj t 1 1 t = # j x j T − − ) + − ˆ ) µ ν t t # # 2 ij − ( − ′ j ) ) 1) 1) ˆ τ +ˆ x t , j j x j ’s (1993) updating method for the regression co- − − . In our simulation studies the individual-level ) 1) t t ′ ij t t x x y 2 j j 1 ′ j ′ j − x 2( 1( t n µ T /n ( ( x x 1 ( ( A A w j − j j T =1 n T i " n n 2 = =ˆ =[(¯ = = = A = are again referred to as . For the model introduced in Equation J J J ( =1 =1 =1 t t t ) j j j j " " " 1 2 j j j t ˆ µ ) ) ( 1 2 3 n t t σ ! ! A ! in this formulation. This means that every time a new data w T T T 2( 1( = = = = T t A A ij and ˆ β x j , and observations, we do not have to correct for previous contributions. can both be computed online: ˆ 2 µ 2 n j Escobar and Moser n A β,τ = and t j 1 1 only consist of observed data (there are no model parameters involved) and T A 2 , j A Next, the variance of the random effect, Moreover, using the notation including the summation over In the M step, the CDSS can be used to obtain new estimates for the model pa- n effects were not dependent onusing the either of number the of two formulations observations. will effectively Therefore, be the the same. results where, as previously, vector, instead of a scalar. Thecan contributions then to be the computed CDSS as for follows: a single individual are and Chapter 4: Introduction of SEMA The CDSS of point arrives the values of the covariates is weighted according to the numberaccount which of individuals observations have of more observations an results individual. in better Taking estimates into of tions of that individual, on where it are sums over rameters using efficients, as follows: Note that, we use the case where the individual-level effect j 2 µ τ ¯ δ µ ¯ δ (4.22) (4.20) (4.21) (4.19) for the sim- . Because j θ ¯ δ x = ij x that in all streams , we also find that 2 µ σ are normally distributed j µ , 2 and 1 ˆ σ 2 , " τ j , j Chapter 4: Introduction of SEMA n n ij ) ϵ / . ˆ β 2 ) j 2 ˆ σ + 2 ˆ x τ j , the decrease is quite a bit slower (i.e., . + ,τ 2 µ j − -dimensional vector with fixed-effect re- effects to SEMA. This model can be writ- 2 σ p (0 n j ˆ τ + y will be problem dependent and might dif- (¯ + β , , in many of the simulations, indicating that − 1 j 1 θ 2 2 is a fixed µ ¯ − ∼N δ 1 ˆ x τ ˆ j σ seems a good candidate to use for convergence; β / ! ˆ j / V 2 2 j = θ µ ˆ σ for τ ν ¯ δ = ij , we find for the parameter ζ j y =ˆ =ˆ = ˆ µ 1 4.4 − j , the closer the parameter estimates are to their offline ˆ V θ ¯ δ decreases. Hence, -dimensional row-vector of covariates at the individual level with first µ p ¯ δ equals obtained using our online method, and those obtained offline (where we 1 would imply convergence). If we examine the behavior of is a µ − decrease during the stream, and that these decreases correspond to more j j ˆ V 2 x σ <ζ ¯ δ θ needs more observations) than for ¯ δ 2 , ten as: ulations presented in Section equivalent. The actual cut-offs element equal to 1 for thegression intercept, coefficients, and the individual-level intercepts is now centered around zero, theIn computation the E of step the the parameters following is individual-level altered statistics slightly. are computed: In practice, one might want tomore extend parameters. the random-intercept One model could, tomates for a of example, model the with include model covariates parameters and towe the improve discuss predictions the the resulting inclusion from esti- of the additional model. Here, 4.7 Extending SEMA and 80 ζ monotonically approaches zero, and that themates difference between of the parameter esti- We assume the covariates are constant within each individual: σ the smaller the value of as: where some parameters might be said to have converged sooner than others. fer for the different parameters. For the parameter and more precise estimates. However, for where determined convergence by no change in parameterdecreases estimates as to the fourth decimal)

Chapter 4 eemndcnegneb ocag nprmtretmtst h orhdecimal) fourth the to as estimates decreases parameter in change no by convergence determined where n oepeieetmts oee,for However, estimates. precise more and e o h ifrn aaees o h parameter the For parameters. different the for fer oeprmtr ih esi ohv ovre onrta others. than sooner converged have to said be might parameters some where as: σ of value the smaller the easm h oaitsaecntn ihnec individual: each within constant are covariates the assume We nteEse h olwn niiullvlsaitc r computed: are slightly. statistics altered individual-level is following parameters the the step of E the computation In the zero, around centered now is Here, model. additional the of esti- from inclusion resulting the the predictions discuss improve the we to and covariates parameters model include with the model example, of a for mates to could, model One random-intercept the parameters. extend more to want might one practice, In SEMA Extending 4.7 and esti- parameter the of between difference mates the that and zero, approaches monotonically ζ 80 rsincefiins n h niiullvlintercepts individual-level the and coefficients, intercept, gression the for 1 to equal element qiaet h culcut-offs actual The equivalent. ltospeetdi Section in presented ulations e as: ten , 2 δ ¯ ed oeosrain)ta for than observations) more needs θ δ ¯ <ζ σ x 2 V ˆ j j eraedrn h tem n htteedcesscrepn omore to correspond decreases these that and stream, the during decrease − µ sa is ol ml ovrec) fw xmn h eairof behavior the examine we If convergence). imply would 1 banduigoroln ehd n hs bandofln weewe (where offline obtained those and method, online our using obtained equals δ ¯ p µ dmninlrwvco fcvrae tteidvda ee ihfirst with level individual the at covariates of row-vector -dimensional erae.Hence, decreases. δ ¯ θ V ˆ h lsrteprmtretmtsaet hi offline their to are estimates parameter the closer the , j − 4.4 1 µ ˆ = =ˆ =ˆ y j ζ efidfrteparameter the for find we , ij = δ ¯ ν τ for σ ˆ µ θ = j 2 2 V / j ˆ ! / β em odcniaet s o convergence; for use to candidate good a seems σ j ˆ τ x ˆ 1 δ ∼N − ¯ µ fixed sa is 2 2 θ 1 j 1 − nmn ftesmltos niaigthat indicating simulations, the of many in , , β + (¯ ilb rbe eedn n ih dif- might and dependent problem be will y + τ ˆ j n (0 p σ 2 fet oSM.Ti oe a ewrit- be can model This SEMA. to effects dmninlvco ihfie-fetre- fixed-effect with vector -dimensional − j µ 2 ,τ + . h eraei ut i lwr(i.e., slower bit a quite is decrease the , j τ x ˆ 2 + σ ˆ 2 j ) 2 β ˆ . / ϵ ) ij n n hpe :Itouto fSEMA of Introduction 4: Chapter j , j τ " , 2 σ ˆ 1 and 2 , µ j r omlydistributed normally are σ µ 2 eas n that find also we , hti l streams all in that x ij = x δ ¯ θ j Because . o h sim- the for (4.19) (4.21) (4.20) (4.22) δ ¯ µ δ ¯ τ µ 2 j h aeweeteidvda-ee effect individual-level the where case the oeta,w use we that, Note fcet,a follows: as efficients, using rameters taesm over sums are it where on con hc niiul aemr bevtosrslsi etretmtsof into estimates Taking better in individual. results an observations more of have observations individuals of which account number the to according weighted is in fta individual, that of tions on rie h auso h covariates the of values the arrives point and of CDSS The SEMA of Introduction 4: Chapter etr nta fasaa.Tecnrbtost h DSfrasnl niiulare individual single a follows: for as CDSS computed the be to then contributions can The scalar. a of instead vector, hr,a previously, as where, fet eentdpneto h ubro bevtos hrfr,teresults same. the the be Therefore, effectively will observations. formulations two of the number of either the using on dependent not were effects n nteMse,teCS a eue ooti e siae o h oe pa- model the for estimates new obtain to used be can CDSS the step, M the In oevr sn h oainicuigtesmainover summation the including notation the using Moreover, et h aineo h admeffect, random the of variance the Next, A j , 2 A T nycnito bevddt teeaen oe aaeesivle)and involved) parameters model no are (there data observed of consist only 1 1 j t and = τ β, A n soa n Moser and Escobar j n 2 µ 2 ˆ a ohb optdonline: computed be both can bevtos ed o aet orc o rvoscontributions. previous for correct to have not do we observations, and , j x β ˆ and ij A A t T = = = = 2( 1( T T T w nti omlto.Ti en hteeytm e data new a time every that means This formulation. this in ! A ! ! σ t t n 3 2 1 ( ) ) µ ˆ t j j j 2 1 " " " j j j j ) t t t =1 =1 =1 ( J J J o h oe nrdcdi Equation in introduced model the For . r gi eerdt as to referred again are = A = = = =[(¯ =ˆ = 2 n n " i T n =1 T j j − j w A A ( ( 1 x x ( ( /n T µ n t 2( 1( x − j ′ j ′ 1 j j 2 y x x t t ij ′ t t 1) ) norsmlto tde h individual-level the studies simulation our In . − − s(93 paigmto o h ersinco- regression the for method updating (1993) ’s j x j j , t x +ˆ τ ˆ 1) 1) ) ) j ′ − ( − ij 2 # # t t ν µ ) ˆ + − ) − − T j x j # = t 1 1 t wj , j x x − A , µ t " " j j ij ′ j β 1 T =1 =1 t j 1+ J J ˆ 1( t ( t τ " 2( ˆ J j t sdpneto h ubro observa- of number the on dependent is − y r eree rmmmr.Because memory. from retrieved are t − =1 J 2 − n x t ij scmue sfollows: as computed is , 1) ) 1) x µ j j ′ ˆ t . " x (¯ . i ij ′ j x + n =1 y t j ′ j t ) ij j y ¯ A 2 T T t j − x x 1( wj +ˆ 1 − ij ′ ij ′ ,T µ t ˆ t − ( ν t ( j n y A t j 2 1) ) ) t j ij and , n . ] 1( x x n j − j ′ t ij , j − µ ˆ t t , µ 1) ˆ j , j T , ) 3 , where , n 4.3 j h xdeffect fixed the , if , µ T 1 j depends snwa now is (4.28) (4.23) (4.29) (4.26) (4.27) (4.24) (4.25) β A 81 in 1

Chapter 4 , 83 4.3 ). For 1982 ( Rubin and Thayer . The (offline) EM algorithm to fit this 1) , (0 ∼N j z , where ij ϵ . Additionally, in the case that all the data for an individual enter + j /J 1 τz + µ With the introduction of SEMA, we provide a novel method to fit multilevel It is to be noted that the random-intercept model, as presented in Equation Another issue to be noted is that ordering of the data points in the data stream as a block (i.e., allbased on at model once), parameters which the are not individual-level yetfunction. close effect to This for the maximum could this of result the individual in likelihood couldincorrect, contributions and be to because the the data CDSS entered of in a anCDSS block, individual the are incorrect which not contributions are corrected. to these Even thoughdecrease the eventually, as effect new of data these points incorrect (and contributionsreason individuals) will to enter, this do is a an full additional EMstream. iteration, using all individuals, occasionally during the data models row-by-row. This allows for the analysis ofdata data sets, without streams revisiting and the extremely previous large data. Becauseis SEMA not is necessary an online to method, store it all theless data computational points in power memory. Additionally, than SEMA requires to the data EM streams. algorithm These when two fitting advantages multilevel make SEMA models attractive both in terms of the Chapter 4: Introduction of SEMA individual-level prediction is thethe main estimation focus of of the individual-level the effects,the analysis, moment for that one example, they by could are recomputing needed fine theseThe using tune at the proposed most alterations recent model to parameter SEMA,direction. estimates. SEMA-T and SEMA-U, provide a step in this which provided the basis forently: our one could SEMA also algorithm, interpret the can currentour also model “observations as be within a individuals” formulated factor correspond analysis differ- to model multipleuals, in items to which within which individ- one fits aified single-factor model. as: The current model could then be spec- this case convergence will also be slow.of As extreme the values data will stream lessen, progresses, since therate their influence of contribution at to least the CDSS will decrease at a model, and its generalizations, is specified inour current detail model, in the covariances between theone items to are also constrained, which in allows this formulationto derive the same a update online steps version as presented of here.general the However, case: this EM seems algorithm when not leading to the be covariances trueficient are in statistics the unconstrained, in the a computation data ofof stream the observations seems suf- within cumbersome, individuals due during to theon the stream. the differing current Still, numbers problem the might, factor-analytic in view more future complex work, models. inspire online EM approximations of is important for the rate of convergencedata stream, of if SEMA. the Especially data in points the are beginning verythe extreme, of maximum-likelihood SEMA the will estimates require for more data the tosimilar model find to parameters. using offline This EM is with conceptually poorly-chosen starting values of the parameters: in 2 ˆ σ (4.30) , 2014; , since we ; the differ- µ , 2013) there 4.12 ) in which only the 2000 , ) data points of the stream as 000 Chapter 4: Introduction of SEMA , . =1 ) t n n 3( T McLachlan & Peel = ) t 2 ( ˆ σ Pedro, Z., Baker, Bowers, and Heffernan is now distributed around 0 instead of j W. Hofmann, Adriaanse, Vohs, and Baumeister µ 2010; (where we choose n Other interesting extensions concern the inclusion of fixed and random effects Our first simulation study showed that SEMA recovers both the fixed effect and the derivation of the updating formulasthe for parameters. the In expected future sufficient research statistics we and will for look into these extensions. for level-1 predictors and thethose generalization models, SEMA to versions can more also than be formulated, 2 which levels as shown of here nesting. involves For separated the fixed effects from the random effects. Lastly, the residual variance 82 This is slightly different from theence previous is formulation in due Equation to the fact that is the same as it was previously, one could occasionally preform a single EM iterationfar. using Using all this individuals extra entered information so for thein estimation parameters of which the approached model the parameters EM resulted estimatesone of could the use parameters faster. the Second, first the individual-level (random) effects well, asAlso, encountered the in variance grouped components data are streams. well estimated,low although reliability in (i.e., conditions when with the very residualthe variance random is intercept), large compared a to largewhich the are number variance close of of to data population values. pointstwo In ways are to the needed improve second the to simulation estimates obtained study, get we by SEMA examined estimates early in the data stream. First, approximation (SEMA). SEMA is obtainedrithm; by that adapting is, the by E using step acontributions of to partial the the E sufficient EM step statistics of ( algo- thetion individual are providing updated. the new observa- social science research (e.g., Killingsworth and Gilbert is a need for computationally feasible methods topresents a analyze data novel streams. method for This estimating chapter multilevelof models in dependent data observations. streams consisting Becausetionally the infeasible regular as EM the size algorithm of becomes the computa- data stream grows, we propose a streaming EM 4.8 Discussion Since data streams are becoming more common in both real-life applications and estimate the model parameters. The combinationeven of larger the improvement. two Finally, approaches in showed our an is implementation, updated an when individual-level a effect new data point enters for the person concerned. However, when a training set. These first dataapplying points EM can until be convergence, used after to which obtain the better stream starting is values, continued by using SEMA to

Chapter 4 sudtdwe e aapitetr o h esncnend oee,when However, concerned. person the for enters point data new effect a individual-level when an updated implementation, is an our showed in approaches Finally, two improvement. the larger of even to combination SEMA The using by parameters. continued values, model is starting the stream better estimate the obtain which to after used convergence, be until can EM points applying data first These set. training a n ol s h first Second, the faster. parameters use the could of one estimates resulted EM parameters the model approached the which of parameters estimation in the for so information entered extra individuals this all Using using far. First, iteration EM stream. single data a the preform in occasionally early estimates could examined SEMA one by we get study, obtained estimates simulation to the second improve needed the to are ways In two points values. population data to of of close variance number are the which large to a compared large intercept), is random variance the residual very the with when conditions (i.e., in reliability although low estimated, well streams. are data components grouped variance in the encountered Also, as well, effects (random) individual-level the observa- new the updated. providing are individual tion the algo- ( of statistics step EM sufficient E the the partial to of contributions a step using E by the EM is, streaming adapting that a by rithm; propose obtained we is grows, stream SEMA data (SEMA). computa- the approximation becomes of algorithm size the EM as regular infeasible the tionally Because consisting streams observations. data dependent in models of multilevel chapter estimating This for method streams. novel data analyze a presents to methods feasible computationally for need a is and applications Gilbert real-life and Killingsworth both in (e.g., research common science more social becoming are streams data Since Discussion 4.8 previously, was it as same the is that fact the to Equation due in formulation is previous ence the from different slightly is This 82 eaae h xdefcsfo h admefcs aty h eiulvariance residual the Lastly, effects. random the from effects fixed the separated h aaees nftr eerhw illo noteeextensions. these into look for will and we statistics research sufficient future expected In the parameters. For for the formulas involves nesting. updating here the of shown of as levels which derivation 2 formulated, the be than also more can versions to SEMA models, generalization those the and predictors level-1 for u rtsmlto td hwdta EArcvr ohtefie fetand effect fixed the both recovers SEMA that showed study simulation first Our te neetn xesoscnenteicuino xdadrno effects random and fixed of inclusion the concern extensions interesting Other n weew choose we (where 2010; µ .Hfan dias,Vh,adBaumeister and Vohs, Adriaanse, Hofmann, W. j snwdsrbtdaon nta of instead 0 around distributed now is er,Z,Bkr oes n Heffernan and Bowers, Baker, Z., Pedro, σ ˆ ( 2 t ) = caha Peel & McLachlan T 3( n n t ) =1 . , hpe :Itouto fSEMA of Introduction 4: Chapter 000 aapit ftesra as stream the of points data ) , 2000 nwihol the only which in ) 4.12 03 there 2013) , µ h differ- the ; ic we since , 2014; , (4.30) σ ˆ 2 feteevle illse,snetercnrbto oteCS ildces ta at decrease will CDSS the least to at contribution of influence their rate the since progresses, lessen, stream will data values the extreme As in of slow. parameters: be the also of will values convergence starting case conceptually poorly-chosen this with is EM This offline using parameters. to find model similar to the data more for require estimates will the SEMA maximum-likelihood of extreme, the very beginning are the points in data Especially the SEMA. if of stream, data convergence of rate the for important is of approximations EM online inspire models. work, complex future more view in factor-analytic might, the problem numbers Still, current differing the stream. the on the to during due individuals cumbersome, within suf- seems observations the stream of of data computation a the in unconstrained, the statistics in are ficient true covariances be the to leading not when algorithm seems EM this case: However, the general here. of presented version as steps online update a same the derive to formulation this allows in which constrained, also are to items one the between covariances the in model, detail current our in specified is generalizations, its and model, as owihoefisasnl-atrmdl h urn oe ol hnb spec- be then could model current The as: model. single-factor ified a fits one individ- which within which to items in uals, multiple model to differ- analysis correspond factor formulated individuals” a within be as “observations model also our current can the interpret algorithm, also SEMA could one our ently: for basis the provided which this in step a provide SEMA-U, and SEMA-T estimates. direction. SEMA, parameter to model recent alterations most proposed the at tune using The these fine needed recomputing are could by they example, one that for moment analysis, the effects, the individual-level the of of focus estimation main the the is prediction individual-level SEMA of Introduction 4: Chapter odt tem.Teetoavnae aeSM trciebt ntrso the of terms in both attractive models SEMA make multilevel advantages fitting two when These algorithm streams. EM data the to requires SEMA than Additionally, memory. power in points computational data less the all it store method, to online an necessary is not SEMA is Because data. large previous extremely the and revisiting streams without sets, data data of analysis the for allows This row-by-row. models data the during occasionally individuals, all using iteration, stream. EM additional full an a is do this enter, to will individuals) reason contributions (and incorrect points these data of new effect as eventually, the decrease though Even these to corrected. are contributions not which incorrect are the individual block, CDSS an a in of entered CDSS data the the because to be and contributions incorrect, could likelihood in individual the result of this could maximum the for This to effect close function. yet individual-level not are the which parameters once), model at on based all (i.e., block a as nte su ob oe sta reigo h aapit ntedt stream data the in points data the of ordering that is noted be to issue Another ti ob oe htterno-necp oe,a rsne nEquation in presented as model, random-intercept the that noted be to is It ihteitouto fSM,w rvd oe ehdt tmultilevel fit to method novel a provide we SEMA, of introduction the With µ + τz 1 /J j + diinly ntecs htaltedt o nidvda enter individual an for data the all that case the in Additionally, . ϵ ij where , z j ∼N (0 , 1) h ofln)E loih ofi this fit to algorithm EM (offline) The . ui n Thayer and Rubin ( 1982 .For ). 4.3 83 ,

Chapter 4 84 Chapter 4: Introduction of SEMA number of computations and in terms of the memory requirements.

Acknowledgement Chapter 5 We would like to thank dr. M.A. Killingsworth and prof.dr. D.T. Gilbert for shar- ing their data. Furthermore we would like to thank the editor and the anonymous reviewers for the great contribution to the chapter. Finally, we would like to thank Estimating Multilevel Models on James Mason, Sophia Rabe-Hesketh, and Anders Skrondal for their feedback during the writing process. Data Streams

Abstract

Social scientists are often faced with data that have a nested structure: for exam- ple, pupils are nested within schools, employees are nested within companies, or repeated measurements are nested within individuals. Data sets that have such nested structures are typically analyzed using multilevel models. However, when data sets are extremely large or when new data continuously augment the data set, estimating multilevel models can be challenging: the algorithms used to fit multi- level models repeatedly revisit all data points and end up consuming a lot of time and computer memory. This is especially troublesome when predictions are needed in real time and observations keep streaming in. We address this problem by intro- ducing the Streaming Expectation-Maximization Approximation (SEMA) algorithm for fitting multilevel models online (or “row-by-row”). In a simulation study, we demonstrate the performance of SEMA compared to traditional methods of fitting multilevel models. Next, the algorithm is used to analyze an empirical data set that was originally recorded as a data stream. We show that the prediction accuracy is SEMA is both competitive and orders of magnitude faster than traditional methods.

This chapter is submitted as Ippel, L., Kaptein, M.C, & Vermunt, J.K. Estimating Multilevel Models on Data Streams 84 Chapter 4: Introduction of SEMA number of computations and in terms of the memory requirements.

Acknowledgement Chapter 5 We would like to thank dr. M.A. Killingsworth and prof.dr. D.T. Gilbert for shar- ing their data. Furthermore we would like to thank the editor and the anonymous reviewers for the great contribution to the chapter. Finally, we would like to thank Estimating Multilevel Models on James Mason, Sophia Rabe-Hesketh, and Anders Skrondal for their feedback during the writing process. Data Streams

Abstract

Social scientists are often faced with data that have a nested structure: for exam- ple, pupils are nested within schools, employees are nested within companies, or repeated measurements are nested within individuals. Data sets that have such nested structures are typically analyzed using multilevel models. However, when data sets are extremely large or when new data continuously augment the data set, estimating multilevel models can be challenging: the algorithms used to fit multi- level models repeatedly revisit all data points and end up consuming a lot of time and computer memory. This is especially troublesome when predictions are needed in real time and observations keep streaming in. We address this problem by intro- ducing the Streaming Expectation-Maximization Approximation (SEMA) algorithm for fitting multilevel models online (or “row-by-row”). In a simulation study, we demonstrate the performance of SEMA compared to traditional methods of fitting multilevel models. Next, the algorithm is used to analyze an empirical data set that was originally recorded as a data stream. We show that the prediction accuracy is SEMA is both competitive and orders of magnitude faster than traditional methods.

This chapter is submitted as Ippel, L., Kaptein, M.C, & Vermunt, J.K. Estimating Multilevel Models on Data Streams 87 Ip- ; (5.1) 2009 , ’ is the as- Ippel et al. := by one, and we when a new data n updated -learning algorithm. On- Cappé & Moulines ; online . ). 2011a ¯ x , − , n computation of a sample mean solves t t 2016a is the sample mean, and ‘ ( x x ¯ x n =1 Cappé +1 + t ! online n x 1 n := := ¯ Ippel et al. ¯ x n ). A simple illustration of online learning can be , and these are efficiently ). The current chapter adds to this existing literature ¯ x 2011a , 2006 and , n Cappé Liu et al. ; : 2 2016b , is total number of observations, ). Their so-called SEMA algorithm (Streaming Expectation Maximization Ap- n , computation of a sample mean using, ) discuss how to parallelize the EM algorithm to deal with extremely large data The SEMA algorithm can be categorized as an In this chapter, we present a fully online method for estimating multilevel mod- See, for an online-estimation tutorial, 2 2008 2016b is inefficient since when aredo new our data computation point by enters, revisitinghave all we our to increment stored be data available points. intime As a computer a new memory, result, observation and all is thethese data added. computation issues. time An When computing grows the each samplethe mean sufficient online, statistics, it is only necessary to store Chapter 5: SEMA extended ( sets; a method that isa less number well-suited of for specific statistical dealing models, withalgorithm computationally streaming have efficient data. versions recently been of Finally, proposed the for ( EM pel et al. by presenting a computationally efficient algorithmmodels for – the or estimation linear of mixed multilevel models – in data streams. line learning refers to “computing estimates of modelstoring parameters the on-the-fly, without data and bybecome continuously available” updating the ( estimates asprovided more by observations carefully inspecting the computationoffline of a sample mean. The standard, Here, point enters signment operator, indicating that theright-hand left side. hand Note side that we is will replaced use by this what operator throughout is the on chapter. els the by extending the( online EM algorithmproximation) introduced dealt previously only by with random-interceptto models. extend this The method aim to of allowlevel-2 for this fixed fitting chapter multilevel effects, is models as that well contain asprevious level-1 random and work intercepts and to slopes. a Hence, muchchapter, we broader we extend the class will of use linear thewithin mixed terminology individuals. models. of However, repeated multilevel Throughout observations modelsrestricted this nested to and this or the type grouped SEMA of grouping. algorithm are not Fi- ’s Gaber ; Wolfe et al. L. F. Barrett 2012 , ). ) which, due to 2001 Gaber ( ( 2002 , McLachlan and Peel Chapter 5: SEMA extended data stream ). When we monitor, for exam- Thiesson et al. 2012 , Steenbergen & Jones ; ), the computation time increases exponentially. 1988 , 1977 ) detail a number of possible adaptations to the gen- , Buskirk & Andrus ; ). Nested structures are often dealt with using multilevel 1998 ( 2015 1986 , , book (2000, ch. 12) and by Beck ; Dempster et al. 2001 ). Due to the continuous influx of new observations, data streams quickly , Goldstein & McDonald Neal and Hinton Kenny & Judd 2005 , In the literature, several adaptations of the EM algorithm that are computation- When continuously monitoring the attitudes and behaviors of individuals, data eral EM algorithm to deal with large and/orThese growing adaptations data sets are using further batches of explained data. andnite extended Mixture in Models stream efficient computational methods designed toquired. deal with data streams, are re- ally more efficient than thestance, traditional EM algorithm have been proposed. For in- models ( their complexity, only exaggerate the computation time problemsdealing encountered when with streaming data. Since the likelihoodto function be of a maximized multilevel iteratively model (using, has gorithm, for EM, example, the Expectation-Maximization al- Thus, when real-time predictions of individuals’ scores are needed during a data time the data set is augmented oftenacceptable. increases non-linearly and In quickly addition, becomes un- the aforementionedthe examples collected describe data situations in have which aamong the nested observations, and structure. these dependencies in Thismany turn violate nesting statistical a models introduces key assumption that dependencies of assumedent that ( observations are (conditionally) indepen- in which new data enter continuously, iset known al. as a result in (extremely) large datamemory. sets Even when – the possibly storage largerobtaining of up-to-date than all predictions of would using these fit observations all intionally is available infeasible: computer technically information the feasible, computational is time to often re-estimate computa- the model parameters each ried out in real time whenchapter, up-to-date we predictions present are a continuously computationally-efficient available. algorithm for In generatingof this predictions individuals’ traits in situations in which data are continuously collected. collection is effectively never ‘finished’: newtinue costumers to visit see websites, their patients doctors, con- and students enter and leave universities. This situation, & Barrett ple, the behavior of customers onical a regimen, webpage, or patients’ students’ compliance performances, with wetraits their are of med- likely individuals. interested Based in on thetions individual-level behavior or estimates treatments; or of e.g., traits, we could we recommendpreferences can certain as tailor books displayed ac- tailored by to their individuals’ browsing behavior. Such tailoring can only be car- 86 5.1 Introduction Novel technological advances – such astions the – widespread facilitate use monitoring of individuals smartphone over applica- extensive periods of time (

Chapter 5 ieMxueModels in Mixture extended nite and data. explained of batches further using are sets data adaptations growing These and/or large in- with deal For to algorithm proposed. EM eral been have algorithm EM traditional stance, the than efficient more ally re- are streams, data data a with during deal quired. to needed designed are methods scores computational individuals’ efficient stream of predictions real-time when Thus, al- Expectation-Maximization the example, EM, for gorithm, has (using, model iteratively multilevel maximized a of be function to likelihood the Since data. streaming with when encountered dealing problems time computation the exaggerate only complexity, their ( models indepen- (conditionally) are observations ( that dent assume of dependencies that assumption key introduces models a statistical nesting violate turn many This in dependencies these structure. and observations, nested the among a which have in situations data describe collected examples the aforementioned the un- becomes addition, quickly In and non-linearly increases acceptable. often augmented is each set parameters data model the the computa- time re-estimate often to time is computational feasible, the information technically computer infeasible: available is tionally in all observations fit these using would of predictions all than up-to-date of obtaining larger storage possibly the – when Even sets memory. data large (extremely) in result a situation, as This al. known et universities. is continuously, leave enter and data enter new which students in and con- doctors, patients their websites, see visit to costumers tinue new ‘finished’: never effectively is collection collected. continuously are data which in situations in traits individuals’ predictions this of generating In for algorithm available. computationally-efficient continuously a are present predictions we up-to-date chapter, car- when be time only real can in tailoring out Such ried behavior. browsing individuals’ their to by tailored ac- displayed books tailor as certain can preferences recommend we could traits, we e.g., of or treatments; estimates or behavior individual-level tions the on in Based interested individuals. likely med- of are their traits we with performances, compliance students’ patients’ or webpage, regimen, a ical on customers of behavior the ple, ( time of periods Barrett extensive & applica- over smartphone individuals of monitoring use facilitate widespread – the tions as such – advances technological Novel Introduction 5.1 86 nteltrtr,svrlaattoso h Magrtmta r computation- are that algorithm EM the of adaptations several literature, the In data individuals, of behaviors and attitudes the monitoring continuously When , 2005 en Judd & Kenny eladHinton and Neal odti McDonald & Goldstein , .Det h otnosiflxo e bevtos aasrasquickly streams data observations, new of influx continuous the to Due ). 2001 eptre al. et Dempster ; Beck ok(00 h 2 n by and 12) ch. (2000, book , , 1986 2015 ( 1998 .Nse tutrsaeotndatwt sn multilevel using with dealt often are structures Nested ). ; ukr Andrus & Buskirk , ealanme fpsil dpain otegen- the to adaptations possible of number a detail ) 1977 , 1988 ,tecmuaintm nrae exponentially. increases time computation the ), ; tebre Jones & Steenbergen , 2012 hesne al. et Thiesson .We emntr o exam- for monitor, we When ). aastream data hpe :SM extended SEMA 5: Chapter caha n Peel and McLachlan , 2002 ( ( Gaber 2001 hc,deto due which, ) ). , 2012 .F Barrett F. L. of tal. et Wolfe ; Gaber ’s Fi- ihnidvdas oee,mliee oesadteSM loih r not are algorithm grouping. of SEMA grouped type the or this and to nested this restricted models observations Throughout multilevel repeated However, of models. individuals. terminology mixed within the linear use of will class the extend we broader we chapter, much Hence, a slopes. to and intercepts work and random level-1 previous as contain well that as models is multilevel effects, chapter fitting fixed this for level-2 allow of to aim method The this extend models. to random-intercept with by only previously dealt introduced proximation) algorithm EM online ( the extending by the els chapter. on the is throughout operator what this by use replaced will is we that side Note hand side. left right-hand the that indicating operator, signment on enters point Here, rvddb aeul npcigtecmuaino apema.Testandard, The mean. sample a of offline computation the inspecting carefully observations by more provided as estimates ( the updating available” continuously become by and data without on-the-fly, the parameters storing model of estimates “computing to refers learning line streams. data in – models multilevel mixed of linear estimation or the – for models algorithm efficient computationally a presenting by al. et pel EM ( for the proposed Finally, of been recently versions data. efficient have streaming computationally algorithm with models, dealing statistical specific for of well-suited number less a is that method a sets; ( extended SEMA 5: Chapter hs sus hncmuigtesml enoln,i sol eesr ostore to necessary only is it statistics, online, sufficient mean the sample each the grows computing When An time issues. computation added. data these the is all and observation result, memory, new a computer a As time in points. available data be stored increment to our we all have revisiting enters, by point computation data our new redo a when since inefficient is 2016b 2008 2 e,fra nieetmto tutorial, online-estimation an for See, nti hpe,w rsn ul niemto o siaigmliee mod- multilevel estimating for method online fully a present we chapter, this In h EAagrtmcnb aeoie san as categorized be can algorithm SEMA The ics o oprleieteE loih oda iheteeylredata large extremely with deal to algorithm EM the parallelize to how discuss ) optto fasml enusing, mean sample a of computation , n .Ters-aldSM loih SraigEpcainMxmzto Ap- Maximization Expectation (Streaming algorithm SEMA so-called Their ). sttlnme fobservations, of number total is , 2016b 2 : ; i tal. et Liu Cappé n , and 2006 , 2011a x ¯ .Tecretcatrad oti xsigliterature existing this to adds chapter current The ). n hs r efficiently are these and , .Asml lutaino nielann a be can learning online of illustration simple A ). n x ¯ pe tal. et Ippel =¯ := := n 1 x n online ! t + +1 Cappé =1 n x ¯ x x ( stesml en n ‘ and mean, sample the is 2016a t t optto fasml ensolves mean sample a of computation n , − , x ¯ 2011a ). . online ; ap Moulines & Cappé lann loih.On- algorithm. -learning updated n hnanwdata new a when yoe n we and one, by := pe tal. et Ippel steas- the is ’ , 2009 (5.1) ; Ip- 87

Chapter 5 ) r k 89 1( × t (5.4) (5.5) (5.3) r indicates 1 j 2 ’s given the ). b j # − 2 1 b ) σ k j − b itself is an Φ ′ ij ) ′ j k ) would have been z b and ( j j , b − there are closed-form J =1 C ’s, and Φ j ! j j β , σ 3 b 1 2 ′ ij β . x ) |− 1) , − ) (CDSS), which are computed − Φ k ij | 1) ), Ch. 14 ( y , − ˆ ) ( β ln k k indexes the current iteration, j ( " ( 2 2002 ˆ J j k β ( X j ˆ b j ′ j − =1 j n i ! X Z Z ′ j J =1 − − j ! (for respectively X j j matrix, 1 2 3 y y t ′ ( j J r =1 ′ j j ! − Z × Z ( 2 ) ) j = and σ 1 1 k k n ) ( ( , − − j j k 2 ln is in the E-step and subsequently maximizing the log- 1( C C T j 2 t n , 2 = = 1 Z Raudenbush and Bryk σ − ’s, are replaced by their expected values given the cur- t ) j k is given by: ( b ) j k )= ˆ b ( j j , and complete-data sufficient statistics b ˆ b matrix, Φ y, , | p is defined as follows: 2 β × β j ,σ n Φ , β ( quantifies the uncertainty of the imputations of vector, and ℓ is an ) 1 j k ( j of the previous iteration is used in the computation. × X C p ˆ β Often, the maximum likelihood framework is used to estimate the model pa- The CDSS for For more details and proof, see 3 model parameters observed, optimizing the log-likelihoodlog-likelihood function function has would the be following distribution: straightforward. The likelihood function given these expectationsatively in finds the the M-step, parameter thedetails EM values regarding algorithm that the iter- computation maximize in the theputations likelihood. E-step of are the presented, M-step Below, after are first which presented. the the com- where is an Chapter 5: SEMA extended rameters of the above multilevel model. If the random effects ( 5.2.1 The offline E-step When the missing values, based on a number of as part of the E-step.the Each three necessary of CDSS the as model parameters has its own CDSS. We refer to rent parameter estimates and the available data ofexpressions individual to compute the model parameters. These closed-form expressions are Here However, since these randomtent) effects we are are not confronted directly withthis observed a missing-data problem missing-data (i.e., is problem. using these the EM are Onefunction. algorithm la- approach to By maximize to imputing the deal log-likelihood these with missing values with the expectations of that (5.2) , are β be total j n =1 J j ! = presents a simulation n 5.4 Chapter 5: SEMA extended , ij ϵ ) + ). For example, including a random ) Φ j j 2 , , might differ across individuals. Fur- b b j (0 ,σ ′ ij n z (0 individuals. The model fitted to the data + observations and let , J j ) algorithm. Multilevel models can contain β j ∼N ′ ij ∼MVN ij x effects data, ϵ j 1977 = b effect data, , ,...,n ij y =1 fixed random i of individual i . j b have j vector of ⊥ vector of vector of random effects coefficients, matrix with (co)variances of the random effects, vector of fixed-effect coefficients, 1 Dempster et al. ϵ 1 1 r 1 × × × × × p r r r p is a is a is the response is the error term for each observation, is the variance of the error term is a is a is a ij 2 ij j ij ij y β b z σ Φ ϵ x In the next section, the offline estimation of multilevel models using the EM algo- Let individual • • • • • • • • is: assumed to have the samevary effect between across individuals individuals. are random effects Effects ( which are assumed to diction error. This sectionconsisting is of followed repeated by measurements. andiscussed and Finally, empirical directions the example for results future of research of are a highlighted. both data evaluations stream are rithm is explained in detail. Subsequently,of we multilevel illustrate the models online using fitting the procedure study SEMA examining algorithm. the performance Section of SEMA in terms of estimation accuracy and pre- though the covariates still affect alladds individuals a equally. distribution A of random effects slope of effectively uals a differently. covariate, such that a covariate can affect individ- intercept formalizes the assumption that individuals have different starting points, thermore, the variance of the randombe effects independent: and the error variance are assumed to number of observations collected from where The number of observations per individual, Here, we discuss the estimationimization of (EM, multilevel models usingboth the fixed Expectation effects Max- and random effects. Fixed effects, which we denote using 5.2 Offline estimation of multilevel models 88

Chapter 5 ohfie fet n admefcs ie fet,wihw eoeusing denote we which effects, Fixed effects. random Max-and effects Expectation fixed the both using models multilevel (EM, of imization estimation the discuss we models Here, multilevel of estimation Offline 5.2 88 h ubro bevtosprindividual, per observations of number The where ubro bevtoscletdfrom collected observations of number hroe h aineo h admefcsadteerrvrac r sue to assumed are variance error the and independent: effects be random the of variance the thermore, dsadsrbto fefcso oait,sc htacvraecnafc individ- affect can covariate a that such covariate, differently. a uals effectively of slope effects random of A distribution equally. a individuals adds all points, affect still starting different covariates the have though individuals that assumption the formalizes intercept sue ohv h aeefc cosidvdas fet hc r sue to assumed are which ( Effects effects random are individuals. individuals across between effect vary same the have to assumed are stream evaluations data both highlighted. a are of research of future results for example the directions empirical Finally, and discussed an measurements. by repeated followed of is consisting section pre- This and accuracy estimation error. of terms diction in SEMA of Section performance the algorithm. examining SEMA study procedure the fitting using online models the illustrate multilevel we of Subsequently, detail. in explained is rithm is: • • • • • • • • e individual Let ntenx eto,teofln siaino utlvlmdl sn h Malgo- EM the using models multilevel of estimation offline the section, next the In x ϵ Φ σ z b β y ij ij j ij 2 ij sa is sa is sa is stevrac fteerrterm error the of variance the is steerrtr o ahobservation, each for term error the is steresponse the is sa is sa is p r r r p × × × × × 1 r 1 1 ϵ eptre al. et Dempster 1 etro xdefc coefficients, fixed-effect of vector arxwt c)aine fterno effects, random the of (co)variances with matrix etro admefcscoefficients, effects random of vector etrof vector ⊥ etrof vector j have b j . i findividual of i random fixed =1 y ij ,...,n , fetdata, effect b = 1977 j ϵ fet data, effects x ij ∼MVN ij ′ ∼N j β loih.Mliee oescncontain can models Multilevel algorithm. ) j J , bevtosadlet and observations + niiul.Temdlfitdt h data the to fitted model The individuals. (0 z n ij ′ ,σ (0 j b b ih ifrars niiul.Fur- individuals. across differ might , , 2 j j Φ ) .Freape nldn random a including example, For ). + ) ϵ ij , hpe :SM extended SEMA 5: Chapter 5.4 n rsnsasimulation a presents = ! j J =1 n j etotal be β are , (5.2) that aeeso h bv utlvlmdl fterno fet ( effects random the If model. multilevel above the of rameters ucin yiptn hs isn auswt h xettosof expectations the with values missing with these log-likelihood deal the imputing to maximize By to approach la- algorithm function. One are EM the these using problem. is (i.e., missing-data problem missing-data a observed this with directly confronted not are are we effects tent) random these since However, Here spr fteEse.Ec ftemdlprmtr a t w DS erfrto refer We CDSS. own its has parameters model as the CDSS of necessary three Each the E-step. the of part are as expressions of closed-form number These a on parameters. based model the compute to individual expressions of data available the and estimates parameter rent san is values, missing the When E-step offline The 5.2.1 extended SEMA 5: Chapter where eal eadn h optto nteEse r rsne,atrwihtecom- the the presented. which first are after Below, M-step presented, the are of E-step likelihood. putations the the in maximize computation iter- the that algorithm regarding values EM details the parameter M-step, the the finds in atively expectations these given function likelihood bevd piiigtelglklho ucinwudb tagtowr.The straightforward. distribution: following be the would has function function log-likelihood log-likelihood the optimizing observed, oe parameters model 3 o oedtisadpof see proof, and details more For fe,temxmmlklho rmwr sue oetmt h oe pa- model the estimate to used is framework likelihood maximum the Often, h DSfor CDSS The β ˆ p C X × ftepeiu trto sue ntecomputation. the in used is iteration previous the of j ( k j 1 ) san is ℓ etr and vector, unie h netit fteipttosof imputations the of uncertainty the quantifies ( β , Φ n ,σ j β × β 2 sdfie sfollows: as defined is p | , y, Φ matrix, b ˆ b opeedt ufiin statistics sufficient complete-data and , j j ( b ˆ )= k j ) b ( sgvnby: given is k j ) t s r elcdb hi xetdvle ie h cur- the given values expected their by replaced are ’s, − σ adnuhadBryk and Raudenbush Z 1 = = 2 , n t 2 j T C C 1( nteEse n usqetymxmzn h log- the maximizing subsequently and E-step the in is ln 2 k j j − − , ( ( ) n k k 1 1 σ and = j ) ) 2 ( Z × Z − ! j j ′ =1 r J j ( ′ t y y 3 2 1 matrix, j j X frrespectively (for ! j − − =1 J j ′ Z Z X ! i n j =1 − j ′ j b ˆ j X ( β k j J ˆ 2002 2 ( " ( j nee h urn iteration, current the indexes k k ln β ( ) ˆ − , y ( ,C.14 Ch. ), 1) | ij k Φ − CS) hc r computed are which (CDSS), ) − , 1) |− ) x . β ij ′ 2 1 b 3 σ , β j j ! j Φ s and ’s, C =1 J hr r closed-form are there − b , j j ( and b z ol aebeen have would ) k j ′ ) ij ′ Φ tefi an is itself b − j k σ ) b 1 2 − # j b ). sgvnthe given ’s 2 j 1 indicates r (5.3) (5.5) (5.4) t × 1( 89 k r )

Chapter 5 91 ) is 1 (5.10) (5.11) by the ) k 2( T , all observations 2 σ . , j ) 1 k Z ( ′ − j j Z C ) 1 k ( =1 − j J j C ! is presented. , changes accordingly and needs ) =1 1) ˆ J j k β ( − j k ! 2 ( ˆ b σ ′ ij ) is computed by dividing tr J n ) z +ˆ k 1) ( ) − − k k Φ ( 2 ) ( ′ j k σ ( ˆ b ) ˆ β k +ˆ ( ′ ij j u ′ x ˆ b ) is computed as follows: . u to indicate the iteration cycles of the EM algorithm. ) ) − k =1 k 2 ( k J j ) =1 J ij σ 2( J j k y n ! T 3( t ! = = = = ) k ) ( k 2 ( ˆ Φ ˆ σ The variance of the random effects ( In this section, we drop this subscript toSEMA emphasize algorithm that only unlike updates the the EM CDSS algorithm, using the aprevious single data data points. point, without Note revisiting that, data pointindividual, refers the to covariates a with vector with fixed an effects identifierof and for the random an dependent effects variable. and the When observation aan data E-step point only enters, for the the SEMA individual algorithm thatAfter performs belongs the to the E-step data for point that this recentlyM-step. individual, entered. all Because three of model this parameters updatingwhen are a scheme, updated new SEMA data in updates point the the enters,First, instead parameter the of estimates online fitting implementation the of multilevel the model CDSS all for over again. thus have to be stored inparameters memory. Furthermore, of the the residual depends previous on iteration.each the iteration, model Because the residual, the model parameters change with by the M-step. Atdescribed. the This latter end overview of illustrates the thiswhich sequence section, elements of are computations the stored and in full details memory. algorithm (see Algorithm 5.3.1 The online E-step Previously, we used subscript Chapter 5: SEMA extended 5.3 Online estimation of multilevelIn models this section, we introduce thetion (SEMA) Streaming algorithm. Expectation The Maximization approximation Approxima- of the E-step is presented first, followed number of individuals: to be re-computed. Lastly, the residual variance ( The latter equation again illustratesis that computationally fitting intensive. multilevel models The in residualresidual a variance for data is each stream computed observation. by Using estimating this the offline formulation of is ) k (5.6) (5.8) (5.7) (5.9) 3( is not t ) , k ) ( k j 2 ( b σ , ) , k ) . ( , ) j j ) k ˆ b 1 k ( Z ( j j ′ . − j j ˆ b ) Z Chapter 5: SEMA extended j . k Z ′ C j ) 1) 1( Z 1 k ( X t J − =1 − 1 j j k − ). ! − ( − − C j β 1) ˆ j j Φ y − is the sum of the squared random- ( y y J =1 1) ′ ′ ′ k j j j j ! 2 ( ) − matrix inversion are computationally k σ k X X X 2 tr ( 2( p σ +ˆ 1) T J J J =1 =1 =1 ) × − j j j ! ! ! +ˆ k k ( 1 1 1 2 ( p j ′ j − − − σ ˆ b Z ) # # # ′ j k j j j +ˆ ( Z , is the standard residual. j u ) X X X ′ ˆ b ′ ′ ′ k j j j = ( u j ) X X X J =1 ˆ b k j ! J =1 j ( j ! j J J J =1 =1 =1 Z = j j j ! ! ! C = ) " " " − k ) k = = = matrix. In words, 2( 1) 3( ) T − r t k k ( ( ˆ ), and the resulting , is given by: × ˆ β β j ) j k r X , the coefficients of the fixed effects are computed using the normal 2( ′ X j k T − X has multiple elements which are difficult to compute in a data stream. j is an =1 y ) J j k 5.9 = 2( $ T u Lastly, the complete-data sufficient statistic of the residual variance, The estimation of the complete-data sufficient statistic for the variance of the In iteration random effect, observed. effects coefficients plus the additional uncertainty due to the fact that 90 matrix: when operating on a data stream.parameters We in discuss turn, the starting computation with of the each fixed of effects the ( model equations: 5.2.2 The offline M-step In the M-step, the log-likelihoodWhile function presenting is the maximized, computations, we given also the indicate CDSS which parts of present E-step. difficulties given by: expensive when there are many covariates. where where Equation For instance, the matrix multiplication and summationvidual of ( the matrices for each indi-

Chapter 5 o ntne h arxmlilcto n umto ftemtie o ahindi- each for matrices the ( of vidual summation and multiplication matrix the instance, For Equation where where xesv hnteeaemn covariates. many are there when expensive ie by: given equations: model ( the effects of fixed each the of with computation starting the turn, discuss in We parameters stream. data difficulties a E-step. present on of parts operating which CDSS when indicate the also given we computations, maximized, the is presenting function While log-likelihood the M-step, the In M-step offline The 5.2.2 matrix: 90 fet ofcet lsteadtoa netit u otefc that fact the to due uncertainty additional the plus coefficients effects observed. admeffect, random niteration In h siaino h opeedt ufiin ttsi o h aineo the of variance the for statistic sufficient complete-data the of estimation The aty h opeedt ufiin ttsi ftersda variance, residual the of statistic sufficient complete-data the Lastly, u T $ 2( = 5.9 k j J ) y =1 san is j a utpeeeet hc r ifiutt opt nadt stream. data a in compute to difficult are which elements multiple has X − T k j X ′ 2( h ofcet ftefie fet r optduigtenormal the using computed are effects fixed the of coefficients the , X r k j ) j β β ˆ × sgvnby: given is , ,adteresulting the and ), ˆ ( ( k k t r − T ) 3( 1) 2( arx nwords, In matrix. = = = k ) k − " " " ) = C ! ! ! j j j = Z =1 =1 =1 J J J j ! j ( j =1 J ! j k b ˆ =1 J X X X ) j u ( = j j j k ′ ′ ′ b ˆ ′ X X X ) u j stesadr residual. standard the is , Z ( +ˆ j j j k j ′ # # # ) Z b ˆ σ − − − j ′ j p ( 2 1 1 1 ( k k +ˆ ! ! ! j j j − × ) =1 =1 =1 J J J T 1) +ˆ σ p 2( ( tr 2 X X X k σ k arxivrinaecomputationally are inversion matrix − ) ( 2 ! j j j j k ′ ′ ′ 1) =1 J y ( y stesmo h qae random- squared the of sum the is − y Φ j j ˆ 1) β j C − − ( − ! ). − k j j 1 − =1 − J t X ( k 1 Z 1( 1) ) j C ′ Z k . j hpe :SM extended SEMA 5: Chapter Z ) b ˆ j j − . ′ j j ( Z ( k 1 b ˆ k ) j j ) , ( . ) k , ) , σ b ( 2 j k ( ) k , ) t snot is 3( (5.9) (5.7) (5.8) (5.6) k ) is eiulfrec bevto.Uigti fiefruainof formulation offline the this estimating Using by observation. computed stream each is data for variance a residual residual in The models multilevel intensive. fitting computationally that is illustrates again equation latter The aty h eiulvrac ( variance residual the Lastly, ob re-computed. be to ubro individuals: of number rvosy eue subscript used we Previously, E-step online The 5.3.1 Algorithm (see algorithm memory. details full in and stored the computations are of elements section, sequence which this the illustrates of overview end latter This the described. At followed first, M-step. presented is the E-step by the of Approxima- approximation Maximization The Expectation algorithm. Streaming (SEMA) tion the introduce we section, this models In multilevel of estimation Online 5.3 extended SEMA 5: Chapter aaeeso h rvosieain eas h oe aaeescag with change parameters model the residual, the Because model iteration, the each iteration. on previous depends residual the the of Furthermore, memory. parameters in stored be to have thus hnanwdt on nes nta ffitn h utlvlmdlaloe again. over for all CDSS model the multilevel of the implementation fitting online estimates of the parameter instead First, enters, the the point updates in data SEMA new updated scheme, a are when updating parameters this model of three Because all entered. individual, M-step. recently this that point for data E-step the to the belongs performs After that algorithm individual SEMA the the for enters, only point E-step data an a observation When the and variable. effects dependent an random the for and of identifier effects an fixed with vector with a covariates to the refers individual, point data that, revisiting Note without point, points. data data single previous a the using algorithm, CDSS EM the the updates unlike only that algorithm emphasize SEMA to subscript this drop we section, this In h aineo h admefcs( effects random the of variance The σ ˆ Φ ˆ ( 2 k ( ) k ) = = = = ! t 3( T ! n y k j J 2( σ ij J =1 ) j J k ( 2 k =1 k − ) ) oidct h trto ylso h Malgorithm. EM the of cycles iteration the indicate to u . scmue sfollows: as computed is ) b ˆ x ′ u j ij ′ ( +ˆ k β ˆ ) b ˆ ( σ k j ′ ( ) 2 ( Φ k k − − ) ( 1) k +ˆ z ) n J tr scmue ydividing by computed is ) ij ′ σ b ˆ ( 2 ! k j − ( β k j J ˆ 1) =1 ) hne codnl n needs and accordingly changes , spresented. is ! C j J j − =1 ( k 1 ) C Z j j − ′ ( Z k 1 ) j , . σ 2 l observations all , T 2( k ) ythe by (5.11) (5.10) 1 is ) 91

Chapter 5 ) ). 93 ′ ij 5.5 z data (5.17) (5.21) (5.20) (5.18) (5.15) (5.16) (5.19) ij n z 1 is given ) is com- − j j ˆ b C 5.6 (see, Eq. j =1 j n i (Eq. X ′ ! j j j Z C =1 y J j j z ′ j ! ˆ b , tr 2 ) 2 t ( − σ t . The online computation of j j ) 2 + y ˆ . β is also a summation over indi- T j ′ 1 2 ) , 5.14 2 x ) − j j ′ ˜ . + , j , ij T ˆ 1 1 C β ˆ b Z y ′ , ij ′ 1) − 2 1 − j ′ z ij ij ). Therefore, a similar update regime ˜ − t ˆ J =1 1 t z Φ z X ij C − ( j " 2 ( − t in Eq. z j j 2 2 j tr + σ − j 2 − 2 C σ + j Z ˆ T +ˆ β Z σ j y ′ j 2 j ′ j j y +ˆ =1 ˆ b 2 ′ ij j − j J j +ˆ j z Z ′ j x X z Z can be updated as follows: ˆ b ′ j ˆ uses a matrix product of the data used for the b : 1) ( matrix needs to be stored per individual. The ! j ˆ b 3 2 j 1 j := − := − ˜ t = + ˆ b 2 2 ˆ j t , the next step needed for computing β − j Z j 2 C j 2 j j σ j j ij 2( Z y = C ˜ y j Z C Z C T +ˆ ( X j ′ j ′ z ′ j 2 = ˆ j ˆ b β =1 T X := j online, the previous contribution of this individual is j n i ˆ β ˆ ) b ˆ b . The 2 t ! ˜ is given by: + +2 T 2( =1 ˜ j 2 j J j T =1 5.14 y is unlike the previous two CDSS, a summation over J j C ! vector. Note that the matrix multiplication 3 = t ! 1 = j 3 × 2 = t r T 3 : t : 5.7 is an . When new data enter 5.8 2 j j y Z j z as j Lastly, we illustrate the online computation of the CDSS for the residual vari- Next, we present the online computation of the CDSS for the variance of the Z is thus given by: ′ j j ˆ b by: is equal to the transpose of the matrix online computation of estimation of the random effects.Z We define the result of the matrix multiplication again subtracted before the new contribution is computed and added. ance (Eq. where Thus, in order to update Chapter 5: SEMA extended complex due to the inversionupdated. of two However, when matrices the each number timesion of the is random computationally model effects not parameters is too are expensive. small,puted the We first online. matrix explain The inver- how computation of latent variables. Similar to the computation of which is similar to Eq. viduals (Eq. Using the online formulation of The computation of is used for this CDSS: where points. Therefore, we firstcontribution rewrite of the an individual contribution to of the each single data point, as a ) ) ˆ β and j (5.13) (5.12) (5.14) ij X ′ j x Z to account − are the new of this indi- 1 j t j is the current ′ ij y , which is the ˆ b ′ j t after which the z are challenging ) t j t ( Z j t j ( j ˆ b 1 a single individual 1 j is merely added to is computationally t and − j 1 Z j ′ ˜ ij t ′ j ij ˆ b C z . Next, x 1 X ij and t of this individual can be = x to indicate that the CDSS 1 j 1 , =1 t t ˜ t ) J j ˆ b Chapter 5: SEMA extended t ( : t consists of a summation over ! j , 1 ˆ t 5.5 ′ β ij = is given by z 1 + j ij t 1 , as t 1) : x j − ˆ ˆ b t β ( j + 5.4 t j j Z 1 ′ j t Z ′ . The contribution to j X 1 − ˜ t X = , none of the data points themselves ( 1) j − := t 1 t j 1( 5.14 ˜ t Z ′ j need to be recomputed when new data enter. The := 1 X t ) t 1( ˜ t , which depends on the model parameters. Because these j ˆ b . So, defining the CDSS for uses j . Only for the CDSS, we use subscript 1 1 matrix can be update online: 1 t t ˜ t j represents the previous contribution of individual is exact. Using Eq. is the result of the matrix multiplication which is only updated for Z ′ j 1) j − t X ( Z t 5.14 ′ j j 1 X t , Eq. As highlighted above, two aspects of Eq. The solution we chose is as follows: when a new data point enters, the contribu- Next, the coefficients of the random effects (Eq. ) need to be stored since only the results of the matrix multiplication is stored. individuals. If the (weighted) contribution of a new data point would simply be ij for the fact that this individual has already contributed to model parameters are updated whensame a result new using data either point the onlineply enters, or that obtaining offline the all computation exact contributions of to this CDSS, would im- comes increasingly precise as the number of observations per individual grows. tion of the individual belonging to this data point is subtracted from latter, however, is not feasible, especiallyTherefore, when we the resort number to of an individuals approximate is solution. large. Note that this approximation be- individual belonging to the most recent data point, to values of fixed effects and5.12 random effects covariates of this individual. Unlike Eq. in the context of a data stream. First, the CDSS for the current result of the matrix multiplication. is obtained by subtracting the previous contribution of individual offline CDSS, we refer toof the the online coefficients of implemented the complete-data fixed sufficient effects as statistic added. Because the online implementation of the CDSS is not exactly the same as the When new data present themselves, the outer product of 92 where the new contribution is added. The computation of is referred to as where contribution to where can similarly be approximated online. The computation of vidual is recomputed, such that the new contribution to z the individual belonging to the most recent data point, and the equation of J added, then this would result in counting the same individual repeatedly. Second,

Chapter 5 de,te hswudrsl ncutn h aeidvda eetdy Second, repeatedly. individual same the of counting equation the in result would this then added, J h niiulblnigt h otrcn aapit and point, data recent most the to belonging individual the z iuli eoptd uhta h e otiuinto contribution new the that such recomputed, is vidual a iial eapoiae nie h optto of computation The online. approximated be similarly can where otiuinto contribution where hr the where of computation The added. is contribution new as to referred is hnnwdt rsn hmevs h ue rdc of product outer the themselves, present data new When 92 fieCS,w ee oteoln mlmne opeedt ufiin statistic as effects sufficient fixed complete-data the implemented of coefficients online the the the of as to same refer the exactly we not CDSS, is CDSS offline the of implementation online the Because added. h urn euto h arxmultiplication. matrix the of result current the individual of contribution previous the subtracting by obtained is ntecneto aasra.Frt h DSfor CDSS the First, stream. data a of context the in auso xdefcsadrno fet oaitso hsidvda.Ulk Eq. Unlike individual. this of covariates effects random 5.12 and effects fixed of values niiulblnigt h otrcn aapit to point, data recent most the to belonging individual ino h niiulblnigt hsdt on ssbrce from subtracted is point data this to belonging individual the of tion grows. individual be- per approximation observations of this number that the Note as large. precise solution. is increasingly approximate comes individuals an of to number resort the we when Therefore, especially feasible, not is however, latter, aersl sn ihrteoln rofln optto fti DS ol im- would CDSS, this to of contributions exact computation all the offline obtaining that or enters, ply online the point either data using new result a same when updated are parameters model o h atta hsidvda a led otiue to contributed already has individual this that fact the for ij niiul.I h wihe)cnrbto fanwdt on ol ipybe simply would point data new a of contribution (weighted) the If individuals. edt esoe ic nyterslso h arxmlilcto sstored. is multiplication matrix the of results the only since stored be to need ) et h ofcet fterno fet (Eq. effects random the of coefficients the Next, h ouinw hs sa olw:we e aapitetr,tecontribu- the enters, point data new a when follows: as is chose we solution The shglgtdaoe w set fEq. of aspects two above, highlighted As Eq. , t X 1 j j ′ 5.14 t Z ( X t − j 1) j ′ Z stersl ftemti utpiainwihi nyudtdfor updated only is which multiplication matrix the of result the is seat sn Eq. Using exact. is ersnstepeiu otiuino individual of contribution previous the represents j t ˜ t t 1 arxcnb paeonline: update be can matrix 1 1 nyfrteCS,w s subscript use we CDSS, the for Only . j uses o enn h DSfor CDSS the defining So, . b ˆ j hc eed ntemdlprmtr.Bcuethese Because parameters. model the on depends which , t ˜ 1( t ) t X 1 := edt ercmue hnnwdt ne.The enter. data new when recomputed be to need j ′ Z t ˜ 5.14 1( j t 1 t := − j 1) oeo h aapit hmevs( themselves points data the of none , = X t ˜ − 1 X j h otiuinto contribution The . ′ Z t j ′ 1 Z j j t 5.4 + j ( β t b ˆ ˆ − j x : 1) t as , 1 ij t j + 1 z sgvnby given is = ij β ′ 5.5 t ˆ 1 , j ! ossso umto over summation a of consists t : ( t hpe :SM extended SEMA 5: Chapter b ˆ j J ) t ˜ t t =1 , 1 j 1 oidct htteCDSS the that indicate to x = fti niiulcnbe can individual this of t and ij X 1 x Next, . z C b ˆ ij j ′ t ij ˜ ′ j Z 1 j − and t scomputationally is smrl de to added merely is j 1 igeindividual single a 1 b ˆ j ( j t j Z ( t j t ) r challenging are z fe hc the which after t j ′ b ˆ hc sthe is which , y ij ′ stecurrent the is j t j 1 fti indi- this of r h new the are − oaccount to Z x j ′ X ij (5.14) (5.12) (5.13) j and β ˆ ) ) ons hrfr,w rtrwietecnrbto fec igedt on,a a as point, data single each the of to contribution individual an the of rewrite contribution first we Therefore, points. where sue o hsCDSS: this for used is h optto of computation The sn h niefruainof formulation online the Using hc ssmlrt Eq. to similar is which iul (Eq. viduals aetvrals iia otecmuainof computation the to Similar variables. latent ue nie h optto of computation The online. puted ini opttoal o o xesv.W rtepanhow inver- explain matrix first We the small, expensive. are too is parameters not effects model computationally random is the of sion time number each the matrices when However, two of updated. inversion the to due complex extended SEMA 5: Chapter hs nodrt update to order in Thus, where ne(Eq. ance added. and computed is contribution new the before subtracted again siaino h admefcs edfietersl ftemti multiplication matrix the of result the define We Z effects. random the of estimation niecmuainof computation online seult h rnps ftematrix the of transpose the to equal is by: b ˆ j j ′ stu ie by: given thus is Z et epeetteoln optto fteCS o h aineo the of variance the for CDSS the of computation online the present we Next, aty eilsrt h niecmuaino h DSfrtersda vari- residual the for CDSS the of computation online the illustrate we Lastly, j as z j Z y j j 2 5.8 hnnwdt enter data new When . san is 5.7 : t : 3 T r t = 2 × 3 j = 1 ! t = 3 etr oeta h arxmultiplication matrix the that Note vector. ! C j J sulk h rvostoCS,asmainover summation a CDSS, two previous the unlike is y 5.14 =1 T j J j 2 j ˜ =1 2( T +2 + sgvnby: given is ˜ ! t 2 The . b ˆ b ) ˆ β ˆ i n j nie h rvoscnrbto fti niiulis individual this of contribution previous the online, j := X T =1 β b ˆ j ˆ = 2 j ′ z ′ j ′ j X ( +ˆ T C Z C Z j y ˜ C = y Z 2( ij j j σ j j 2 j C 2 j Z j − β h etse eddfrcomputing for needed step next the , t j ˆ 2 2 b ˆ + = t ˜ − := − := j 1 j 2 3 b ˆ j ! arxnest esoe e niiul The individual. per stored be to needs matrix ( 1) : b ssamti rdc ftedt sdfrthe for used data the of product matrix a uses ˆ j ′ b ˆ a eudtda follows: as updated be can Z z X x j ′ Z z j +ˆ j J j − j ij ′ 2 b ˆ =1 +ˆ y j j ′ j 2 j ′ y j σ Z β +ˆ T ˆ Z j + σ C 2 − 2 j − σ + tr j 2 2 j j z nEq. in t − ( 2 " j ( − C ij X z Φ z t 1 =1 J ˆ t − ˜ .Teeoe iia paeregime update similar a Therefore, ). ij ij z ′ j − 1 2 − 1) ′ ij , ′ y Z b ˆ β C 1 1 ˆ T ij , j , + . ˜ ′ j j − ) x 2 5.14 , ) 2 1 ′ j T sas umto vrindi- over summation a also is β . ˆ y + 2 ) j j h niecmuainof computation online The . t σ − ( t 2 ) 2 tr , b ˆ ! j ′ z j j J y =1 C Z j j j ! ′ X (Eq. i n j =1 j se Eq. (see, 5.6 C b ˆ j j − scom- is ) sgiven is 1 z n ij (5.19) (5.16) (5.15) (5.18) (5.20) (5.21) (5.17) data z 5.5 ij ′ 93 ). )

Chapter 5 , is J ) 95 2 X 5.17 (Eq. j R Core Team y ; j z ) ), ) can be found at 2015 , 5.11 5.14 2016 , is the length of vector is vector with identifiers , and (Eq. j J 3 ◃J j ◃ ,t Z j 5.10 ′ j . 2 , T X ,) , j ), 1 5.26 t 5.21 ) , R Core Team j 5.15 y j (Eq. 5.22 z 2 , , and ˆ σ 3 (Eq. j ) ˜ t , y 2 , j j ˆ 2 Φ 5.20 , and Z x , ˆ 5.18 σ , , ˆ , β ), 2 j 2 ) ’s do not have to be available while updating the ˜ 5.19 t T . The first line indicates which elements the algo- j X 5.13 j , and , 1 θ , Bates, Mächler, Bolker, & Walker 5.25 j ˆ Φ 5.24 , C ). This algorithm finds the best fitting parameter values 1 5.12 , (Eq. 5.16 ˜ t j (Eq. , j and j 3 Z ˆ j β ′ 2009 j then y do (Eq. ,t , j (Eq. j y, X 3 2 x j θ , ,θ } , ˜ t 5.25 x 2 are all elements which should be available at the global level, t b T 2 , j , ij ˆ ), σ , , 2 2 θ ij y , ,j j j Z ˜ y T , +1 2 1 inv , ˆ J Φ , C t Powell j (Eq. ij 1 +1 , 5.23 J + b X ˜ z t ˆ j , , β 2 j +1 , 2 SEMA: Notation and equations can be found in the second and third 2 j inv n y ←{ ← J ij n contains all the elements which should be stored for each individual. is unknown (Eq. ,y X x J J create new record for t j j ← ← , 2 : j j ← θ for the individual that belongs to the most recent data point is used in n in data stream y j 2 j n, J, t compute compute update compute model parameters return if end if n x n y X j = θ = E-step M-step update global parameters update individual parameters j ), which is coined “bobyqa” (acronym for Bound Optimization BY Quadratic ◃ ◃ ◃ end for input θ θ for ◃ We present a schematic overview of the SEMA algorithm, assuming that 4: 5: 6: 7: 8: 9: 1: 2: 3: 16: 17: 18: 19: 20: 15: 10: 11: 12: 13: 14: Only 5.4 Simulation study 5.4.1 Design In this section, SEMA ismultilevel compared models with using an often simulations. usedoptimizer of offline As the procedure Lmer a function for ( comparison fitting point,2016 we use the default Approximation, Chapter 5: SEMA extended global parameters or the elementstered of data the point. individual The belonginghttp://github.com/L-Ippel/SEMA_extended SEMA to function the in recently en- [R]( Algorithm 1 section of this chapter. the update step, the remaining already inverted, in Algorithm rithm uses, where whereas ; 1993 (5.24) (5.26) (5.22) (5.23) (5.25) , and is de- 2 j Z : 5.14 Escobar & Moser . , is the multiplication of the ) Chapter 5: SEMA extended can be updated similarly to t ij ( 2 inv is invertible. Directly updat- 3 t x j 5.9 ˜ t 2 3 , X , which we detail in this section. t : 2 inv 1 ′ ij , ˆ X ˜ β t ) , is similar to that of + x ′ ij 1 X ij matrix. This matrix can be updated 5.21 ˜ − 2 t j ij x . However, in order to subsequently 1) y ′ ij j j 2 x ij − ij x − y X t ′ X j ( x X x ′ y t j 2 , is the same as the offline implementa- inv j X x + 3 3 n + 1+ ˜ t X ( t X ): 2 y =1 − 2 J j =1 inv = − x X J j 1) 2 X 1950 ! 2 ˆ ! − := σ inv , := 1 t = 2 y − 3( X " ˜ t ˆ x β j X is defined as := is the X := is needed. Computing the inverse of a matrix can be a ′ j is more efficient than the offline estimation procedure, X ) only has to be inverted once, after which the inverted ′ 2 t 2 inv X 5.9 2 3( 2 X X inv ˜ t X , X =1 X J j ! 5.15 vector. Inserting the online computed components of Equation matrix every time new data present themselves in order to obtain 1 =( ˆ × β X ′ Sherman & Morrison p . Note that the computation of ; X 2 ij y , the inverse of is a ˆ is computed as the sum of the squared observations of the dependent vari- β 1950 j =1 y n i , 2 j , and the residual variance, x y 2 J ˜ ! T The first element of Eq. into the equation results in the computation of = ˆ scribed in detail below in the M-step. Using Eq. compute ing the inverted matrix online using the same update regime as already discussed in Eq. costly procedure if the number of covariatesthe is inverted large. A matrix solution using is the toPlackett Sherman–Morrison directly update formula ( able: matrix is directly updated withto the wait new until data. enough data In have practice, entered, this such means that that one has 5.3.2 The online M-step The online implementation of the M-step ofΦ both the variance of the random effects, 94 where Using this formulation, to invert the because the offline estimation procedure stores all observations in memory and has tion of the M-step we discussedcomputation above. of This, however, does not hold for the online the other CDSS: The online implementation of thedata E-step points and makes only it store possible summariesrithm to of 1 the ignore below). data the Next, points the historical (see online for implementation exact of details the M-step Algo- is presented. up-to-date model parameters. The second partcovariates of with Eq. the dependent variable. This can be updated online as follows: where, similar to Eq. 5.9 where

Chapter 5 5.9 where hr,smlrt Eq. to similar where, oaitswt h eedn aibe hscnb pae niea follows: as online updated be can This variable. dependent the Eq. with of covariates part second The parameters. model up-to-date ih eo) et h nieipeetto fteMse spresented. is Algo- M-step the details of exact implementation for online (see historical the points Next, the data below). ignore the 1 of to rithm summaries possible store it only makes and points E-step data the of implementation online The h te CDSS: other the ino h -tpw icse bv.Ti,hwvr osnthl o h online the for hold not does however, This, of above. computation discussed we M-step the of tion eas h fieetmto rcdr trsalosrain nmmr n has and memory in the observations invert all to stores procedure estimation offline the because sn hsformulation, this Using h nieipeetto fteMse fbt h aineo h admeffects, random the of variance the both Φ of M-step the of implementation online The M-step online The 5.3.2 where 94 arxi ietyudtdwt h e aa npatc,ti en htoehas one that that means such this entered, practice, have In data enough data. until new wait the to with updated directly is matrix able: h netdmti sn h hra–orsnfrua( formula update directly Sherman–Morrison Plackett to the is using solution matrix A large. inverted is the covariates of number the if procedure costly nieuigtesm paergm saraydsusdi Eq. in discussed already as regime update same the using online n h netdmatrix inverted the ing compute cie ndti eo nteMse.UigEq. Using M-step. the in below detail in scribed ˆ = noteeuto eut ntecmuainof computation the in results equation the into h rteeeto Eq. of element first The T ! ˜ J 2 y x n h eiulvariance, residual the and , j 2 , i n y =1 j 1950 β scmue stesmo h qae bevtoso h eedn vari- dependent the of observations squared the of sum the as computed is ˆ sa is h nes of inverse the , y ij 2 X ; oeta h optto of computation the that Note . p hra Morrison & Sherman ′ X β × ˆ =( 1 arxeeytm e aapeettesle nodrt obtain to order in themselves present data new time every matrix etr netn h niecmue opnnso Equation of components computed online the Inserting vector. 5.15 ! j J X =1 X , X t ˜ inv X X 2 3( 2 5.9 X inv 2 t 2 ′ nyhst eivre ne fe hc h inverted the which after once, inverted be to has only ) X smr fcetta h fieetmto procedure, estimation offline the than efficient more is j ′ snee.Cmuigteivreo arxcnb a be can matrix a of inverse the Computing needed. is := X sthe is := sdfie as defined is X j β x ˆ t ˜ " X 3( − y 2 = t 1 := , inv σ := − ! ˆ 2 ! 1950 X 2 1) j J X x − = inv =1 j J 2 − =1 y 2 ): X t ( X t ˜ 1+ + n 3 3 + x X j inv stesm steofln implementa- offline the as same the is , 2 j t y ′ x X x ( j X ′ t X y − x ij − ij x 2 j j ij ′ y 1) oee,i re osubsequently to order in However, . x ij j t 2 − ˜ 5.21 arx hsmti a eupdated be can matrix This matrix. ij X 1 ij ′ x + ssmlrt htof that to similar is , ) t β ˜ X ˆ , ij ′ 1 inv 2 : t hc edti nti section. this in detail we which , X , 3 2 t ˜ 5.9 j x t 3 sivril.Drcl updat- Directly invertible. is inv 2 ( ij t a eudtdsmlryto similarly updated be can hpe :SM extended SEMA 5: Chapter ) stemlilcto fthe of multiplication the is , . soa Moser & Escobar 5.14 : Z j 2 n sde- is and , (5.25) (5.23) (5.22) (5.26) (5.24) 1993 ; whereas h paese,teremaining the step, update the ih ss where uses, rithm Algorithm in inverted, already ee aapit h EAfnto n[R]( en- recently in the function to SEMA http://github.com/L-Ippel/SEMA_extended belonging The individual point. the data of tered elements the or parameters global Approximation, default the use we 2016 point, fitting comparison ( for function a Lmer procedure the As offline of optimizer used simulations. often an using with models compared multilevel is SEMA section, this In Design 5.4.1 study Simulation 5.4 chapter. this of section 1 Algorithm extended SEMA 5: Chapter Only 11: 10: 15: 14: 13: 12: 20: 19: 18: 17: 16: 3: 2: 1: 9: 8: 7: 6: 5: 4: epeetashmtcoeve fteSM loih,asmn that assuming algorithm, SEMA the of overview schematic a present We ◃ n for end for θ θ input ◃ ◃ ◃ ,wihi ond“oya arnmfrBudOtmzto YQuadratic BY Optimization Bound for (acronym “bobyqa” coined is which ), j paeidvda parameters individual update paegoa parameters global update M-step E-step = θ = j return parameters model compute update compute compute X y n x n if end if t ,J, n, j 2 j y ndt stream data in n o h niiulta eog otems eetdt on sue in used is point data recent most the to belongs that individual the for θ ← j j : 2 , ← ← j j t raenwrcr for record new create J J x X ,y (Eq. sunknown is otisalteeeet hc hudb trdfrec individual. each for stored be should which elements the all contains n ij J ← ←{ y n inv j 2 EA oainadeutoscnb on ntescn n third and second the in found be can equations and Notation SEMA: 2 , +1 j 2 β , , j ˆ t z ˜ X b + J 5.23 , +1 1 ij (Eq. j Powell t C , Φ J ˆ , inv 1 2 +1 , T y ˜ Z j j ,j , y ij θ 2 2 , , σ ), ˆ ij , j , 2 T b t r l lmnswihsol eaalbea h lbllevel, global the at available be should which elements all are 2 x 5.25 t ˜ , } ,θ ,θ j x 2 3 X y, j (Eq. j , ,t (Eq. do y then j 2009 ′ β j ˆ Z 3 j and j , (Eq. j t ˜ 5.16 (Eq. , 5.12 1 .Ti loih nstebs tigprmtrvalues parameter fitting best the finds algorithm This ). C , 5.24 Φ ˆ j 5.25 ae,Mclr okr Walker & Bolker, Mächler, Bates, , θ 1 , and , j 5.13 X j h rtln niae hc lmnstealgo- the elements which indicates line first The . T t 5.19 ˜ sd o aet eaalbewieudtn the updating while available be to have not do ’s ) 2 j 2 ), β , ˆ , , σ 5.18 ˆ , x Z and , 5.20 Φ 2 ˆ j j , 2 y , t ˜ ) j (Eq. 3 σ ˆ and , , 2 z 5.22 (Eq. j y 5.15 j oeTeam Core R , ) 5.21 t 5.26 1 ), j , ,) X T , 2 . j ′ 5.10 j Z ,t ◃ j ◃J 3 J j (Eq. and , svco ihidentifiers with vector is stelnt fvector of length the is , 2016 5.14 5.11 , 2015 a efudat found be can ) ), ) z j ; y oeTeam Core R j (Eq. 5.17 X 2 95 ) J is ,

Chapter 5 , , 97 and =2 =7 , and , r 5 . ,r 5 , = 10 5 . j 4 = 10 -axes the pa- n − y j , n 5 . 3 . − , 5 5 000 . . , 2 5 − − , =1 5 . J 5 and − , or , 5 . 5 . 4 4 000 − , , , 5 . 5 . =5 . The gray lines represent the parameter =3 =3 J β =8 β 1 8: 3: . The three rows are from top to bottom: = = conditions shows that SEMA rapidly approaches , lvl data points instead of after each data point. Each times. = 50 =7 j 000 = 10 , r n j = 100 n =1 m n ; and =8 and the 1 data points, resulting in , lvl and 2.5; and 000 , = 50 , 5 j . =2 presents the average estimated variance terms of both the residual vari- n r -axes the length of the data stream is presented, and on the = 25 ; =1 = 50 x 5 5.1 2 . Number of level 1 predictors 6 σ n Number of level 1 predictors β =3 The bottom two figures deviate from the figures above, because these conditions • • • • • 1 Additionally, the residual variance, the numberof of the level data stream 2 were variables, fixed and across the the conditions length to lvl Chapter 5: SEMA extended 5.4.2 Results Figure ance and the variance ofOn the the random intercept and sloperameter over estimates the are 100 presented. replications. The three figures on the left all have The data stream was generated byresulting randomly in sample unequal individuals with number replacement, oftional observations complexity per of individual. the offline Duedata fitting to stream procedure, the only the computa- every Lmer functioncondition is was fitted replicated to the the figures on the right have Lmer’s parameter estimates, especially when the number ofdividual observations is for large. each in- Furthermore, SEMA provides estimatesto even converge. when There Lmer is is hardly unable any difference8 between level including 3 1 level predictors: 1 predictors the or given top the two number figures of and observations the per individual, two very figures similar. in the middle roware, even are, for the Lmer algorithm, veryis only difficult able to to fit. fit the In model the whenLmer bottom at left least is 34,000 figure, able data Lmer points to are available. fitobtain Even the convergence. when model, While Lmer Lmer is cycled able throughright to panel the fit than data the in model thousands the using lower of less leftof times data panel, times to still in to Lmer the revisited fit lower the the same modelthe data and thousands results hence of took SEMA (very) in longin the times the panel to other in compute. panels, the Comparing it lower is left clear panel that with this the lower results left of panel SEMA (condition: estimates obtained using Lmer, theThe black Lmer lines the function parameters was estimates notstream, of in always SEMA. these able cases to the gray convergetween lines the in are the omitted beginning from the of figure. the A data comparison be- ) ), 2 r ) ij ), we y r − ij y (ˆ ! 1 n = 2 ¯ e Chapter 5: SEMA extended ), the number of random effects ( j n : 4 ). 1 is an important contributor to the reliability of j n 8, 8, and 8. 3, 3, 8, ======1 1 1 1 1 1 , and lvl , and lvl , and lvl , and lvl , and lvl , and lvl : more observations per individual results in less uncertainty. j ), we have similar expectations for the number of random effects: 4 and 9, or 4, 9, 16, 2.25, 6.25, 12.25, and 20.25. b 1 =2 =7 =7 =2 =2 =2 = = 2 2 ,r ,r ,r ,r ,r ,r φ φ : : = 50 = 10 = 50 = 10 = 50 = 10 =7 =2 j j j j j j r n n n r n n n The number of observations • • • • • • • • The number of random effects includes the random intercept. 4 The data stream is generated with variance terms of the random effects equal to: the estimates of are compared in terms of average squared prediction error ( 96 by iteratively approximating the likelihood function using quadratici.e., approximation, this algorithm does notcomparison use specifically since first it or is secondestimation an of order often-used linear derivatives. mixed and models. robust We SEMA implementation choose and of this this state-of-the-art the offline procedure and will result in more error.let The the three factors number are of all covariates crossed, withhence however, a we we random have will effect the not following exceed six the conditions number of covariates; more fixed effects will lead to a slower retrieval of the data-generating parameters, and parameter estimation precision. Wethe explicitly number study of the observations effect per of individual three factors: ( The fixed effects are generated with the following parameter values: and the number of level 1 covariates (lvl expect that more random effects will resultvalues, in i.e., a slower SEMA rate has of to findingof the take random parameter more effects steps is than small. inexpect the that Also, condition SEMA where when will the produce the more number the number error. first of level Lastly, for random (lvl the effects number is of high, covariates we on Therefore, we expect SEMA will learna the lower true parameter number values of in observations conditionsthan more with in slowly (i.e., conditions more where data individualsexpect points that are have when observed individuals to more are enter) observed often.tion more error often, For the will average the be squared error, lower predic- number we of than observations. in For the the second case factor, the where number the of random individuals effects only ( have a small

Chapter 5 xetta EAwl rdc oeerr aty o h ubro oaitson we covariates high, of is number effects the (lvl random for Lastly, level of first error. number the number more the produce the will when where SEMA condition Also, that the expect in small. than is steps effects more parameter random take the of finding to of has rate small SEMA slower a a i.e., in have values, result ( only will effects effects individuals random random of the more number that where the expect factor, case second the the For in observations. than of we number predic- lower error, squared be the average will the For often, error more tion often. observed enter) are more to individuals observed when have are that points expect individuals data where more conditions (i.e., slowly in with more than conditions observations in of values number parameter true lower the a learn will SEMA expect we Therefore, (lvl covariates 1 level of number the and h xdefcsaegnrtdwt h olwn aaee values: parameter following the with generated are effects fixed The h ubro bevtospridvda ( factors: three individual of per effect observations the of study number explicitly the We precision. estimation parameter and e h ubro oaitswt admefc xedtenme fcovariates; of number conditions the six exceed following not the effect will have random we we a however, hence with crossed, covariates all of are number factors parameters, three the data-generating The let the error. more of in retrieval result slower will a and to lead will effects fixed more r oprdi em faeaesurdpeito ro ( error prediction squared average of procedure terms offline in the state-of-the-art this compared this of and choose are implementation SEMA We robust models. and mixed derivatives. linear often-used order of an estimation second is or it first since specifically use comparison not does algorithm this approximation, i.e., quadratic using function likelihood the approximating iteratively by 96 h siae of estimates the h aasra sgnrtdwt ainetrso h admefcseulto: equal effects random the of terms variance with generated is stream data The 4 h ubro admefcsicue h admintercept. random the includes effects random of number The • • • • • • • • h ubro observations of number The n r n n n n n r j j j j j j =2 =7 50 = 10 = 50 = 10 = 50 = 10 = : : φ φ ,r ,r ,r ,r ,r ,r 2 2 = = =7 =7 =2 =2 =2 =2 1 b ,9 6 .5 .5 22,ad20.25. and 12.25, 6.25, 2.25, 16, 9, 4, or 9, and 4 ,w aesmlrepcain o h ubro admeffects: random of number the for expectations similar have we ), j oeosrain e niiulrslsi esuncertainty. less in results individual per observations more : n lvl and , lvl and , n lvl and , lvl and , lvl and , lvl and , 1 1 1 1 1 1 ======8. and 8, 8, 8, 3, 3, n j sa motn otiuo oterlaiiyof reliability the to contributor important an is 1 ). 4 : n j ,tenme frno fet ( effects random of number the ), hpe :SM extended SEMA 5: Chapter e ¯ 2 = n 1 ! (ˆ y ij − r y ,we ), ij ) r 2 ), ) inlcmlxt fteofln tigpoeue h mrfnto sfitdt the to replicated fitted was is condition function Lmer every computa- the only the procedure, stream to fitting data Due offline the individual. of per complexity observations tional of replacement, number with individuals unequal sample in randomly resulting by generated was stream data The tem nteecsstega ie r mte rmtefiue oprsnbe- comparison data A the figure. of the from beginning omitted the are in the lines tween converge gray the to cases able these SEMA. always in of stream, not estimates was parameters function the lines Lmer black The the Lmer, using obtained estimates nteohrpnl,i scerta hslwrlf ae (condition: SEMA panel of left results lower the this with that panel clear left is lower it Comparing the panels, compute. in other to panel the times the in long in (very) SEMA took of hence results thousands and data the model same the the lower fit revisited the Lmer to in still to times panel, data times of left less of lower using the thousands model in the data than fit the panel to right through able cycled is Lmer Lmer While model, when convergence. the Even obtain available. fit are to points Lmer data able figure, 34,000 is least left at bottom Lmer when the model In the fit fit. to to able difficult only is very algorithm, Lmer the for are, even are, row middle the in similar. figures very two individual, per the observations and of figures number two the top given or the predictors 1 predictors: level 1 3 including level between 8 difference any unable hardly is is Lmer There when converge. even to estimates provides SEMA Furthermore, in- each large. for is observations dividual of number the when especially estimates, parameter Lmer’s h grso h ih have right the on figures the aee siae r rsne.Tetrefiue ntelf l have all left the on figures three The replications. presented. 100 are the estimates over rameter slope and intercept random the the On of variance the and ance Figure Results 5.4.2 extended SEMA 5: Chapter lvl ftedt temwr xdars h odtosto length conditions the the across and fixed variables, were 2 stream data level the of of number the variance, residual the Additionally, 1 • • • • • h otmtofiue eit rmtefiue bv,bcueteeconditions these because above, figures the from deviate figures two bottom The =3 n β ubro ee predictors 1 level of Number 6 predictors 1 level of Number σ . 2 5.1 5 x 50 = =1 ; 25 = ae h egho h aasra speetd n nthe on and presented, is stream data the of length the -axes r n rsnsteaeaeetmtdvrac em fbt h eiulvari- residual the both of terms variance estimated average the presents =2 . j 5 , 50 = , 000 n .;and 2.5; and lvl , aapit,rsligin resulting points, data 1 n the and =8 and ; n m =1 n 100 = j n r , 10 = 000 j =7 50 = times. aapit nta fatrec aapit Each point. data each after of instead points data lvl , odtossosta EArpdyapproaches rapidly SEMA that shows conditions = = h he osaefo o obottom: to top from are rows three The . 3: 8: 1 β =8 β J =3 =3 h rylnsrpeetteparameter the represent lines gray The . =5 . 5 . 5 , , , − 000 4 4 . 5 . 5 , or , − and 5 J . 5 =1 , − − 5 2 , . . 000 5 5 , − . 3 . 5 n , j y − n ae h pa- the -axes 10 = 4 j . 5 10 = , 5 ,r . 5 r , and , =7 =2 and 97 , ,

Chapter 5 99 ) was . Note ¯ ). The y , when 5.2 =8 = 25 ). The study 1 2 , although the σ 2 σ 2014 , lvl ( . In all conditions ¯ y ). Across conditions, =2 5 . 5 ,r − = 10 and j 5 . n , have a large SD for the estimates ˆ φ =1 β was predicted using ij y remains difficult for SEMA even when 50,000 . An additional inspection of the replications Kooreman and Scherpenzeel 9 ). Clearly, these variance terms are more difficult 9 that the fixed effects are estimated well by both =8 1 5.1 , and , and . Second, while the parameter estimates of SEMA are , lvl ij y =4 =4 ) is not always positioned approximately in the middle of 2 =7 2 ˆ 2 φ φ φ ,r presents the estimates of the residual variance ( 5.3 = 10 presents the average squared prediction error for each of our two j n presents the estimates of the variance of the random intercept and one , both for 5.2 5.2 000 , . Overall, SEMA and Lmer produce very similar estimates of 2 Table Figure Lastly, Table ˆ σ =1 we can conclude fromLmer Table and SEMA, with the mere(however, difference SEMA that is the magnitudes SD’s faster). of SEMA are slightly larger of the random slopes ( methods. For clarity reasons, only three conditionsthat are for presented in both Figure fitting procedures,vidual the is error entering is for the implemented first such time, that this when an indi- data points have entered. condition of lets us conclude thathave there entered, were the a effect of handful theand of outliers the fades extreme empirical quickly: outliers. interval the becomes SD smaller. Whenresults becomes Unfortunately, of much more we smaller SEMA data cannot with compare Lmer the Lmer in was the unable runs to in converge. thesedifficult. We contend extreme that cases these since conditions in are these generally very cases same conditions which showed aof large SD for 5.5 SEMA in action: predictingIn weight this fluctuations. section, thenating SEMA from algorithm an is done applied by to an empirical data stream origi- Chapter 5: SEMA extended empirical interval is quite widevalue in with some conditions, which are the already data approximating were the generated ( to estimate for SEMA thanto the generate fixed the effects. data While asobtain Lmer soon good retrieves as estimates of the it the values is variance terms. used ableof Notice to that the fit the variance average the terms estimated model, value ( SEMA uses more data to the empirical interval, see for instance condition (also the three not presented),two Lmer reasons. had First, a Lmer larger wasstream. error often When unable than Lmer to SEMA. did converge not This in converge, is the theused average due beginning of to of to the predict dependent the variable the data ( next updated every data point, Lmer was onlythe called parameter once estimates every of 1,000 Lmer data were points.data used Then, points. for 1,000 It data was points computationally toeach infeasible predict time to the re-estimate a next the new model data usingtype point Lmer prediction entered. would be Likely, chosen in if any online practical methods application, are unavailable. this batch n , 8 3 8 ● ● ● 50 50 50 = = = 000 1 1 1 , ● ● ● 1 , lvl , lvl , lvl 7 2 2 = = = r r r ≤ ● ● ● , , , 40 40 40 n 50 50 50 = = = j j j ● ● ● n n n ● ● ● 30 30 30 ● ● ● ● ● ● 20 20 20 data stream, x1000 ● ● ● Chapter 5: SEMA extended Lmer: slope var Sema: slope var ● ● ● 10 10 10 . ), SEMA performs much better ● ● ● = 10 ● ● ● 0 0 0 j = 50

n 0 40 30 20 10 0 40 30 20 10 40 30 20 10 0 j n Lmer: var intercept Sema: intercept var 8 3 8 ● ● ● 50 50 50 = = = 1 1 1 ● ● ● , lvl , lvl , lvl 2 7 2 = = = r r r ● ● ● , , , 40 40 40 10 10 10 = = = j j j ● ● ● n n n ● ● ● 30 30 30 Lmer: var residual Sema: residual var ance terms are included in the graph. ● ● ● ● ● ● ● 20 20 20 data stream, x1000 ● ● ● ● ● ● 10 10 10 Figure 5.1: Estimatedrandom residual slope. variance Note and that for random two intercept graphs and on the bottom, not all vari- ● ● ● ) is, as expected, a difficult condition. Especially in the extremely challenging ● ● ● 0 0 0

=8 40 30 20 10 0 30 20 10 0 10 0 40 40 30

20 Next, we present three tables with the parameter estimates averaged over the presents the estimates of two of the fixed effects estimated by SEMA and Lmer at

1 variance variance variance fixed effects, we choose to present only these two. First note that when replications, the standard deviation over the replicationson and the the 95% empirical interval distribution based of5.1 results of the simulation studythree (percentiles). points in Table the data stream. Since the (qualitative) behavior is similar across all lvl 98 (lower right panel) than in the condition with condition where only 10 observations per individual arenumber available of to fixed estimate and a random large effects, ithave is clear not that yet the converged, parameter even estimates at ofmore the SEMA end information of available the per data individual stream. ( However, when there is Lmer was never able to converge, while the estimates of SEMA, even though the

Chapter 5 mrwsnvral ocneg,wieteetmtso EA vntog the though even SEMA, of estimates the while converge, to able never was Lmer aentytcnegd vna h n ftedt tem oee,we hr is there when However, ( stream. individual data per the available of information end SEMA the more of at estimates even parameter converged, the yet that not clear is have it effects, large random a and estimate fixed to of available are number individual per observations 10 only where condition lwrrgtpnl hni h odto with condition the in than panel) right (lower lvl 98 xdefcs ecos opeetol hs w.Frtnt htwhen that note First two. all these across only similar present is to behavior (qualitative) choose the we Since effects, fixed stream. data the Table in points (percentiles). three study simulation the of results 5.1 of based distribution interval empirical 95% the the and on replications the over deviation standard the replications, variance variance variance 1 rsnsteetmtso w ftefie fet siae ySM n mrat Lmer and SEMA by estimated effects fixed the of two of estimates the presents et epeettretbe ihteprmtretmtsaeae vrthe over averaged estimates parameter the with tables three present we Next, 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 =8 0 0 0 ● ● ● s sepce,adfcl odto.Epcal nteeteeychallenging extremely the in Especially condition. difficult a expected, as is, ) ● ● ● admsoe oeta o w rpso h otm o l vari- all not bottom, the on and graphs intercept two random for that and Note variance slope. residual random Estimated 5.1: Figure 10 10 10 ● ● ● ● ● ● data stream,x1000 20 20 20 ● ● ● ● ● ● ● netrsaeicue ntegraph. the in included are terms ance Sema: residualvar residualvar Lmer: 30 30 30 ● ● ● n n n ● ● ● j j j = = = 10 10 10 40 40 40 , , , ● ● ● r r r = = = 7 2 2 , lvl , lvl , lvl ● ● ● 1 1 1 = = = 50 50 50 ● ● ● 8 8 3 Sema: interceptvar interceptvar Lmer: n

j 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 n 50 = j 0 0 0 ● ● ● 10 = ● ● ● ,SM efrsmc better much performs SEMA ), . 10 10 10 ● ● ● Sema: slopevar slopevar Lmer: hpe :SM extended SEMA 5: Chapter ● ● ● data stream,x1000 20 20 20 ● ● ● ● ● ● 30 30 30 ● ● ● n n n ● ● ● j j j = = = 50 50 50 n 40 40 40 , , , ● ● ● ≤ r r r = = = 7 2 2 , lvl , lvl , lvl 1 ● ● ● , 1 1 1 000 = = = 50 50 50 ● ● ● 8 8 3 , n ahtm e aapitetrd iey naypatclapiain hsbatch this unavailable. are application, methods practical online any if in chosen Likely, be would entered. prediction Lmer point type using data model new the next a re-estimate the to time predict infeasible each to computationally points was data It 1,000 for points. Then, used data points. were data Lmer 1,000 of every estimates once parameter called the only was Lmer point, data every updated sdt rdc h next ( data the variable the dependent predict the to of to of beginning due average used the the is converge, in This not converge did SEMA. to Lmer than unable When often error stream. was larger Lmer a First, had reasons. Lmer two presented), not three the (also h miia nevl e o ntnecondition instance for see interval, empirical the ognrt h aaa ona ti bet ttemdl EAue oedt to data more uses SEMA ( value model, estimated terms the average variance the fit that the to Notice of able used terms. variance is values the it the of estimates as retrieves good soon Lmer obtain as While data effects. the fixed generate the to than SEMA for estimate to nti eto,teSM loih sapidt neprcldt temorigi- stream data empirical an to by applied done experiment is an algorithm from SEMA nating the section, fluctuations. this weight In predicting action: in SEMA 5.5 ( generated the were approximating data already the are which conditions, some with in value wide quite is interval empirical extended SEMA 5: Chapter aecniin hc hwdalreS for SD large of a showed which conditions same eut fSM ihLe ntern nteeeteecsssnei hs cases very generally these are in conditions since these cases that extreme contend We difficult. these converge. in to runs unable the was in Lmer the Lmer compare with cannot data SEMA smaller we more much of Unfortunately, becomes results When smaller. SD becomes the interval outliers. quickly: empirical extreme fades the outliers of and the handful of effect a the were entered, there have that conclude us lets odto of condition htfrbt tigpoeue,teerri mlmne uhta hna indi- an when this that time, such first implemented the for is entering error is the vidual procedures, fitting Figure both in presented for are that conditions three only reasons, clarity For methods. entered. have points data mradSM,wt h eedfeec htteS’ fSM r lgtylarger slightly are SEMA of SD’s the that difference mere the with SEMA, and TableLmer from conclude can we fterno lps( slopes random the of faster). magnitudes is SEMA (however, =1 σ ˆ aty Table Lastly, Figure Table 2 vrl,SM n mrpouevr iia siae of estimates similar very produce Lmer and SEMA Overall, . , 000 5.2 5.2 ohfor both , rsnsteetmtso h aineo h admitretadone and intercept random the of variance the of estimates the presents n j rsnsteaeaesurdpeito ro o aho u two our of each for error prediction squared average the presents 10 = 5.3 rsnsteetmtso h eiulvrac ( variance residual the of estimates the presents ,r φ φ φ 2 ˆ 2 =7 2 sntawy oiindapoiaeyi h ideof middle the in approximately positioned always not is ) =4 =4 y ij lvl , eod hl h aaee siae fSM are SEMA of estimates parameter the while Second, . and , and , 5.1 1 =8 httefie fet r siae elb both by well estimated are effects fixed the that 9 .Cery hs ainetrsaemr difficult more are terms variance these Clearly, ). 9 ormnadScherpenzeel and Kooreman nadtoa npcino h replications the of inspection additional An . ean ifiutfrSM vnwe 50,000 when even SEMA for difficult remains y ij a rdce using predicted was β =1 φ ˆ aealreS o h estimates the for SD large a have , n . 5 j and 10 = − ,r 5 . 5 =2 .Ars conditions, Across ). y ¯ nalconditions all In . ( lvl , 2014 σ 2 σ lhuhthe although , 2 1 .Testudy The ). 25 = =8 5.2 when , y .The ). ¯ Note . was ) 99

Chapter 5 101 3.124 24.306 15.921 48.044 Sema SD 2.5% 97.5% 842 441 . . 23 20 2 ˆ τ , across replications of φ Lmer SD 2.5% 97.5% Lmer and SEMA 2 ˆ φ 1,000 –1,000 – –1,000 – – –1,000 – – – 13.512 –1,000 6.422 – – – – 17.499 4.646 25.029 5.998 – – – 8.853 9.320 29.737 – – 6.968 15.896 2.201 24.281 5.963 – 11.417 7.317 31.372 12.801 2.597 35.083 1,000 –1,000 – –1,000 – – –1,000 – – – – 15.7181,000 4.337 – – – – 7.964 9.5541,000 22.034 – – 6.051 – – 13.8311,000 1.891 23.114 2.989 – – – – 7.856 8.696 19.895 – – 5.356 – 25.898 2.111 18.319 – – 14.465 – 19.502 4.000 13.710 27.796 25,00050,000 8.921 8.928 0.47425,000 0.432 7.91350,000 8.101 9.762 4.000 – 9.64625,000 0.156 8.92250,000 8.928 3.714 – 8.971 0.475 – 4.28825,000 0.432 0.279 7.91750,000 3.971 8.465 4.449 – 8.100 – 3.959 9.773 0.236 9.45325,000 9.647 0.451 0.197 3.54650,000 8.999 6.855 3.566 – – 3.667 4.436 8.999 0.441 4.325 7.870 5.326 1.205 0.417 8.165 3.965 8.310 3.960 – 4.656 9.857 1.398 0.238 9.745 6.358 9.175 0.197 8.996 5.723 3.536 8.999 10.464 3.567 2.529 0.442 4.439 0.417 4.323 2.434 8.164 12.214 8.309 9.873 9.746 25,00050,000 4.000 4.001 0.21125,000 0.137 3.57550,000 9.007 3.738 4.335 9.018 0.374 4.25825,000 0.257 8.399 6.46850,000 4.024 8.560 4.396 9.807 4.032 0.706 0.220 9.58025,000 0.215 0.205 8.057 3.636 4.64950,000 8.987 8.860 3.684 3.989 4.466 8.990 7.379 1.979 0.481 4.43925,000 4.664 0.403 0.444 8.220 4.022 3.96350,000 4.018 8.294 4.032 10.148 7.875 11.041 3.999 0.219 0.201 9.930 8.98725,000 9.422 0.205 0.140 3.588 3.63150,000 8.981 8.990 3.774 3.687 0.480 4.451 8.994 4.464 0.315 4.28325,000 4.441 0.443 0.224 8.230 8.391 7.19650,000 3.982 8.586 4.507 10.146 8.296 9.568 3.975 0.828 0.251 9.388 9.930 0.167 0.223 9.257 3.448 6.521 9.013 3.562 4.276 4.454 8.209 1.775 4.356 4.794 0.306 3.982 5.356 3.976 8.486 11.121 0.250 9.528 0.223 3.441 3.560 4.452 4.355 9 4 9 4 9 4 9 4 9 4 9 4 φn 1 Table 5.2: The variability of the estimates of lvl r j 50 2 8 10 7 8 50 7 8 10 2 3 50 2 3 10 2 8 n Chapter 5: SEMA extended . 000 , =50 SEMA n SD 2.5% 97.5% ˆ β Chapter 5: SEMA extended , across replications of β Lmer SD 2.5% 97.5% Lmer and SEMA ˆ β 1,000 –1,000 – – – – – – 1.355 0.695 – -0.244 -3.679 2.502 1.077 -5.418 -1.414 1,000 –1,000 – –1,000 – – –1,000 – – – – -5.0981,000 0.328 – – -5.723 – – -4.394 1.464 0.341 – – – 0.839 -5.246 0.280 2.077 – – -5.812 -4.699 1.269 0.848 – -0.187 -3.532 2.772 1.025 -5.454 -1.431 1,000 –1,000 – –1,000 – – –1,000 – – – –1,000 1.528 0.370 – – – 0.827 – -5.194 0.321 2.098 – – -5.807 – -4.595 1.392 0.279 – – 0.859 -5.292 0.235 1.947 – -5.783 -4.869 1.495 0.604 0.682 1.877 25,00050,000 1.489 1.492 0.07125,000 0.073 1.37150,000 -5.490 1.374 -5.487 0.083 1.632 0.080 1.633 -5.653 1.489 -5.644 -5.337 1.491 0.070 -5.331 -5.475 0.073 1.374 -5.486 0.088 1.367 1.626 0.080 -5.630 1.626 -5.648 -5.309 -5.325 25,00050,000 1.500 1.502 0.04125,000 0.035 1.41550,000 -5.497 1.432 -5.496 0.038 1.57025,000 0.026 1.568 -5.566 1.51150,000 -5.544 -5.424 1.501 1.506 0.169 -5.444 1.500 -5.496 0.073 0.061 1.40525,000 -5.496 0.071 0.038 1.361 1.42650,000 -5.495 1.583 0.026 -5.564 1.374 -5.497 0.035 1.641 1.568 -5.544 -5.420 25,000 0.023 1.632 -5.554 1.500 -5.445 50,000 -5.542 -5.430 1.501 0.073 -5.459 1.502 -5.495 0.071 – 1.35825,000 -5.497 0.039 0.035 1.37750,000 1.639 0.023 -5.554 1.419 -5.498 – 1.637 -5.542 -5.430 – 0.044 1.576 -5.459 -5.574 1.494 – – -5.411 0.080 -5.401 1.344 0.127 – – 1.657 -5.558 1.469 -5.093 0.238 – 0.962 -5.175 0.337 1.972 -5.563 -4.305 25,00050,000 1.508 1.505 0.04725,000 0.037 1.41550,000 -5.500 1.438 -5.499 0.034 1.60225,000 0.024 1.567 -5.555 1.51150,000 -5.542 -5.447 1.490 1.507 0.066 -5.452 1.491 -5.499 0.069 0.039 1.41125,000 -5.499 0.065 0.035 1.331 1.44050,000 -5.497 1.634 0.024 -5.559 1.339 -5.497 0.032 1.604 1.591 -5.542 -5.445 0.021 1.591 -5.554 1.490 -5.452 -5.535 -5.437 1.491 0.067 -5.461 -5.497 0.065 1.341 -5.497 0.032 1.343 1.601 0.021 -5.554 1.587 -5.535 -5.437 -5.461 βn 1.5 1.5 1.5 1.5 1.5 1.5 -5.5 -5.5 -5.5 -5.5 -5.5 -5.5 1 Table 5.1: The variability of the estimates of lvl indicates the condition under which the data streams were generated. This means that j n r j 10 7 8 50 7 8 50 2 8 50 2 3 10 2 8 10 2 3 the average number of observations per individual is not equal to 10 (50), until Note: n 100

Chapter 5 100 n h vrg ubro bevtospridvda snteult 0(0,until (50), 10 to equal not is individual per observations of number average the Note: 023 2 10 078 7 50 8 7 10 8 2 50 8 2 10 3 2 50 j r n j niae h odto ne hc h aasraswr eeae.Ti en that means This generated. were streams data the which under condition the indicates lvl al .:Tevraiiyo h siae of estimates the of variability The 5.1: Table 1 -5.5 -5.5 -5.5 -5.5 -5.5 -5.5 1.5 1.5 1.5 1.5 1.5 1.5 βn 0001550071481571570091401.591 1.634 1.440 1.411 0.039 0.066 1.507 1.511 1.567 1.602 1.438 1.415 0.037 0.047 1.505 1.508 50,000 25,000 000-.8 .8 564-.3 546000-.4 -5.325 -4.305 -5.309 -5.648 1.626 -5.563 -5.630 0.080 1.626 1.972 0.337 1.367 0.088 -5.486 1.374 -5.175 0.962 0.073 -5.475 -5.331 – 0.070 1.491 -5.337 0.238 -5.644 -5.093 1.489 -5.653 1.469 1.633 0.080 -5.558 1.657 1.632 0.083 -5.487 – – 1.374 0.127 -5.490 50,000 1.344 1.371 0.073 -5.401 25,000 0.080 0.071 1.492 -5.411 – – 1.494 1.489 -5.574 50,000 -5.459 1.576 0.044 25,000 – -5.430 -5.542 1.637 – -5.498 1.419 -5.554 0.023 1.639 50,000 1.377 0.035 0.039 -5.497 25,000 1.358 – 0.071 -5.495 1.502 -5.459 0.073 1.501 -5.430 -5.542 50,000 -5.445 1.500 -5.554 1.632 0.023 25,000 -5.420 -5.544 1.568 1.641 0.035 -5.497 1.374 -5.564 0.026 1.583 -5.495 50,000 1.426 1.361 0.038 0.071 -5.496 25,000 1.405 0.061 0.073 -5.496 1.500 -5.444 0.169 1.506 1.501 -5.424 -5.544 50,000 -5.461 1.511 -5.566 1.568 0.026 25,000 -5.437 -5.535 1.587 1.570 0.038 -5.496 1.432 -5.554 0.021 1.601 -5.497 50,000 1.343 1.415 0.032 0.035 -5.497 25,000 1.341 0.065 0.041 -5.497 1.502 -5.461 0.067 1.491 1.500 -5.437 -5.535 50,000 -5.452 1.490 -5.554 1.591 0.021 25,000 -5.445 -5.542 1.604 0.032 -5.497 1.339 -5.559 0.024 -5.497 50,000 1.331 0.035 0.065 -5.499 25,000 0.069 -5.499 1.491 -5.452 1.490 -5.447 -5.542 50,000 -5.555 0.024 25,000 0.034 -5.499 -5.500 50,000 25,000 ,0 514031-.0 -4.595 -5.807 2.098 0.321 -5.194 0.827 – 0.370 1.528 – – – – – – 1,000 – 1,000 ,0 369107-.1 -1.414 -5.418 1.077 2.502 -3.679 -0.244 – 0.695 1.355 -1.431 -5.454 – – 1.025 2.772 -3.532 -0.187 – – – 0.848 1.269 -4.699 – – -5.812 – – 2.077 0.280 1,000 -5.246 – 0.839 – – – 0.341 1.464 -4.394 1,000 – – -5.723 – – 1.877 0.328 1,000 -5.098 – 0.682 – – – 0.604 1.495 -4.869 1,000 – – -5.783 – – 1.947 0.235 1,000 -5.292 – 0.859 – – – 0.279 1.392 1,000 – – – – 1,000 – – – 1,000 – – 1,000 – 1,000 β ˆ mradSEMA and Lmer D25 97.5% 2.5% SD Lmer β cosrpiain of replications across , hpe :SM extended SEMA 5: Chapter β ˆ D25 97.5% 2.5% SD n SEMA =50 , 000 . hpe :SM extended SEMA 5: Chapter n 078 7 50 8 7 10 8 2 50 8 2 10 3 2 50 3 2 10 j r lvl al .:Tevraiiyo h siae of estimates the of variability The 5.2: Table 1 φn 9 4 9 4 9 4 9 4 9 4 9 4 500––––63825924412.214 2.434 2.529 10.464 5.723 6.358 1.398 – 5.326 7.870 3.667 – – 0.451 9.647 9.773 – 8.100 – 4.449 7.917 0.432 4.355 4.288 – 0.475 4.452 – 3.560 3.714 8.928 3.441 8.922 0.156 0.223 9.528 25,000 9.646 – 0.250 4.000 9.762 11.121 8.486 3.976 8.101 50,000 5.356 3.982 7.913 0.432 0.306 4.794 25,000 4.356 0.474 1.775 8.209 8.928 4.454 4.276 3.562 9.013 8.921 50,000 6.521 3.448 9.257 0.223 0.167 9.930 25,000 9.388 0.251 0.828 3.975 9.568 8.296 10.146 4.507 8.586 3.982 50,000 7.196 8.391 8.230 0.224 0.443 4.441 25,000 4.283 0.315 4.464 8.994 4.451 0.480 3.687 3.774 8.990 8.981 50,000 3.631 3.588 0.140 0.205 9.422 25,000 8.987 9.930 0.201 0.219 3.999 11.041 7.875 10.148 4.032 8.294 4.018 50,000 3.963 4.022 8.220 0.444 0.403 4.664 25,000 4.439 0.481 1.979 7.379 8.990 4.466 3.989 3.684 8.860 8.987 50,000 4.649 3.636 8.057 0.205 0.215 25,000 9.580 0.220 0.706 4.032 9.807 4.396 8.560 4.024 50,000 6.468 8.399 0.257 25,000 4.258 0.374 9.018 4.335 3.738 9.007 50,000 3.575 0.137 25,000 0.211 4.001 4.000 50,000 25,000 0008990478309758990478399.746 9.873 8.309 8.164 0.417 4.323 0.442 4.439 3.567 8.999 3.536 8.996 0.197 9.175 9.745 0.238 9.857 4.656 3.960 8.310 3.965 8.165 0.417 1.205 4.325 0.441 8.999 4.436 3.566 6.855 8.999 50,000 3.546 0.197 25,000 9.453 0.236 3.959 8.465 3.971 50,000 0.279 25,000 8.971 50,000 ,0 .2 .6 .0 24.281 2.201 6.968 29.737 9.320 8.853 – 5.998 25.029 4.646 17.499 – – 6.422 27.796 13.512 13.710 – – – 4.000 19.502 – – – – 1,000 – 14.465 – – – 18.319 2.111 1,000 25.898 – – 5.356 – – 19.895 1,000 8.696 7.856 – – – – 2.989 23.114 1.891 1,000 13.831 – – 6.051 – – 22.034 1,000 9.554 7.964 – – – – 4.337 1,000 15.718 – – – – 1,000 – – – 1,000 – – 1,000 – 1,000 ,0 1471.0 .9 35.083 2.597 12.801 31.372 7.317 11.417 – 5.963 15.896 – – – – – – 1,000 – 1,000 φ ˆ 2 mradSEMA and Lmer D25 97.5% 2.5% SD Lmer φ cosrpiain of replications across , τ ˆ 2 20 23 . . 441 842 D25 97.5% 2.5% SD Sema 59148.044 15.921 .2 24.306 3.124 101

Chapter 5 103 78,021, = n 20.013 47.730 respondents. Ta- SD 2.5% 97.5% 269 752 . , SEMA 21 =1 2 J ˆ σ across replications of 2 σ Lmer Lmer and SEMA SD 2.5% 97.5% observations from a total of 2 ˆ σ 521 , n were handed out. These smart weighting scales were equipped = 288 1,000 –1,000 – –1,000 – – –1,000 – – – –1,000 21.632 3.318 – – – –1,000 16.553 22.856 28.633 2.624 – – – – 18.312 28.568 28.203 – – – 25.095 3.211 – – 20.387 33.778 33.003 10.041 – 19.710 38.022 56.259 10.165 20.874 57.999 n 25,00050,000 24.936 24.985 0.25425,000 0.178 24.54850,000 25.068 24.627 25.551 25.012 0.206 25.308 24.65925,000 0.167 24.727 24.92150,000 25.019 24.679 0.911 25.534 24.985 0.288 0.195 25.348 23.657 25.06925,000 0.179 24.470 24.568 25.012 26.977 50,000 25.020 24.619 0.206 25.531 25.314 25.006 0.239 0.167 25.350 24.726 24.20225,000 0.155 24.566 24.679 24.865 25.528 50,000 24.726 0.524 25.449 25.349 25.001 0.175 25.293 23.429 25.021 –25,000 0.217 24.494 25.006 25.512 50,000 25.027 24.635 0.240 25.269 25.016 0.238 0.155 25.397 24.565 – 0.159 24.622 24.725 29.656 25.452 24.737 25.498 25.293 2.173 – 25.345 25.030 26.032 25.016 0.236 34.261 – 0.159 24.630 39.729 24.739 25.496 25.345 5.599 30.297 50.673 1 Table 5.3: The variability of the estimates of lvl ) were combined with the data of the remaining years. The first experimen- presents an overview of the model fitted to the data stream by indicating the r smart scales 5.4 We analyzed the data stream again offline using the Lmer function and online j = 883 n 10 2 3 50 2 3 10 2 8 50 2 8 10 7 8 50 7 8 scales were handed out inuntil the February beginning 2014. of 2011 While and theauthors the data used set data the data contains collection of the continued 2011 data only. Becausestamps, from the roughly we data of 3 were the years, able smart to the scales includes replayin time the this data evaluation stream from of 2011 SEMA,N till the February data 2014. of Thus, Kooremantal and factor was Scherpenzeel the ( (instructed) frequency of thenot scale specified. usage: The every second day, factor every was week, the or feedbackand respondents the received: their norm weight what they shouldtheir weigh, weight. their Both weight experimental and factors theirFinally, we were goal removed a crossed, weight, number resulting or of in only outliers (0.1% nineated of conditions. with the data), more for than which 5 weight kg fluctu- set within consisted a of day for a single respondent. The remaining data Chapter 5: SEMA extended concerned the fluctuations in individuals’in weight—over repeated a measurements— longitudinal study usingfor respondents Social from Sciences the (LISS) Longitudinal panel.1,000 Internet Among Studies the respondents ofwith an the Internet LISS connection. panel, Respondents about weresuch instructed that to use it the could scale measure,tissue, barefoot, and among percentage other of variables, fatserver, weight, tissue. where percentage The the of smart data muscle scale were sent combined the with data respondents’ to survey a central data. LISS The smart ble variables included as fixed or random, asof well each as the of number the of variables. levels (or categories) ● 50 ● Sema Sema Sema ● 40 ● Chapter 5: SEMA extended was used to predict ● ¯ y ● 30 ● Lmer Lmer Lmer ● 20 data stream, x1000 : : : 3 8 8 ● = = = 1 1 1 , lvl , lvl , lvl ● 2 2 7 10 = = = r r r , , , 50 10 10 ● was fitted again to update the model parameters. = = = j j j n n n 0 . When Lmer returned model parameters, these model parameters

ij

100 50 0 300 250 200 150 Figure 5.2: Average squared predictiontions. error of When three Lmer selected failedy condi- to fit the model, the were used to predict the next 1,000 data points, after which the model average squared error squared average 102

Chapter 5 102

average squared error eeue opeittenx ,0 aapit,atrwihtemodel the which after points, data 1,000 next the predict to used were in.We mrfie ofi h oe,the model, the fit to condi- y failed selected Lmer three When of error tions. prediction squared Average 5.2: Figure 0 50 100 150 200 250 300 ij hnLe eundmdlprmtr,teemdlparameters model these parameters, model returned Lmer When . 0 n n n j j j = = = a te gi oudt h oe parameters. model the update to again fitted was ● 10 10 50 , , , r r r = = = 10 7 2 2 ● , lvl , lvl , lvl 1 1 1 = = = ● 8 8 3 : : : data stream,x1000 20 ● Lmer Lmer Lmer ● 30 ● y ¯ ● a sdt predict to used was hpe :SM extended SEMA 5: Chapter ● 40 ● Sema Sema Sema ● 50 ● aibe nldda xdo adm swl stenme flvl o categories) (or levels variables. of the number of the as each well of as random, or fixed as included variables ble tdwt oeta gwti a o igersodn.Termiigdata remaining The respondent. single a for day of a consisted within set fluctu- kg weight 5 which than for more data), the with conditions. of ated nine (0.1% outliers only in of or resulting number weight, crossed, a removed goal were we Finally, their factors and experimental weight Both their weight. weigh, their should they what weight norm their received: the respondents and feedback or the week, was every factor day, second every The usage: specified. scale not the of frequency (instructed) ( the Scherpenzeel was factor and tal Kooreman Thus, of 2014. data February the till N SEMA, 2011 of from stream evaluation data this the time in replay includes scales the to smart able years, the were 3 of data we roughly the from stamps, Because only. data 2011 continued the of collection contains data the data set used data the authors the and smart While 2011 The of 2014. LISS beginning February data. central the until a survey in to out respondents’ data handed with the were combined scales sent were scale muscle data smart of the The percentage where tissue. weight, server, fat variables, of other percentage among and barefoot,tissue, measure, scale could the it use to that instructed such were about Respondents panel, connection. LISS Internet the an with of respondents the Studies Among Internet 1,000 panel. Longitudinal (LISS) the Sciences from Social respondents for using study longitudinal measurements— a repeated weight—over in individuals’ in fluctuations the concerned extended SEMA 5: Chapter 078 7 50 8 7 10 8 2 50 8 2 10 3 2 50 3 2 10 n 883 = j eaaye h aasra gi fieuigteLe ucinadonline and function Lmer the using offline again stream data the analyzed We 5.4 mr scales smart r rsnsa vriwo h oe te otedt temb niaigthe indicating by stream data the to fitted model the of overview an presents eecmie ihtedt ftermiigyas h rtexperimen- first The years. remaining the of data the with combined were ) lvl al .:Tevraiiyo h siae of estimates the of variability The 5.3: Table 1 500––––3.2 .9 02750.673 30.297 5.599 25.345 25.496 24.739 39.729 24.630 0.159 – 34.261 0.236 25.016 26.032 25.030 25.345 – 2.173 25.293 25.498 24.737 25.452 29.656 24.725 24.622 0.159 – 24.565 25.397 0.155 0.238 25.016 25.269 0.240 24.635 25.027 50,000 25.512 25.006 24.494 0.217 25,000 – 25.021 23.429 25.293 0.175 25.001 25.349 25.449 0.524 24.726 50,000 25.528 24.865 24.679 24.566 0.155 25,000 24.202 24.726 25.350 0.167 0.239 25.006 25.314 25.531 0.206 24.619 25.020 50,000 26.977 25.012 24.568 24.470 0.179 25,000 25.069 23.657 25.348 0.195 0.288 24.985 25.534 0.911 24.679 25.019 50,000 24.921 24.727 0.167 25,000 24.659 25.308 0.206 25.012 25.551 24.627 25.068 50,000 24.548 0.178 25,000 0.254 24.985 24.936 50,000 25,000 n ,0 8021.6 08457.999 20.874 10.165 56.259 38.022 19.710 – 10.041 33.003 33.778 20.387 – – 3.211 25.095 – – – 28.203 28.568 18.312 – – – – 2.624 28.633 22.856 16.553 1,000 – – – – 3.318 21.632 1,000 – – – – 1,000 – – – 1,000 – – 1,000 – 1,000 288 = eehne u.Teesatwihigsae eeequipped were scales weighting smart These out. handed were n , 521 σ ˆ 2 bevtosfo oa of total a from observations D25 97.5% 2.5% SD mradSEMA and Lmer Lmer σ 2 cosrpiain of replications across σ ˆ J 2 =1 21 SEMA , . 752 269 D25 97.5% 2.5% SD epnet.Ta- respondents. 00347.730 20.013 n = 78,021, 103

Chapter 5 5.3 105 ● ● ● ● 250 250 ● ● ● ● ● ● Sema training and update Sema training 200 200 ● ● -axes present the length of the data x ● ● 150 150 ● ● ● ● data stream x1000 ● ● 100 100 Sema training set Sema training Sema updates ● ● ● ● ● 50 50 ● ● Lmer Sema ● ● , (Lmer), 1.269 (SEMA), 0.164 (SEMA Training), 0.090 (SEMA 0 0

095 85 80 75 3.0 2.0 1.0 0.0 70 .

-axes the estimated parameter values. In gray are the parameter es-

estimated intercept estimated estimated effect of monday of effect estimated y =0 illustrates the estimated variances of the random intercept and the ˆ φ 5.4 Figure 5.3: Estimated intercept and the effect of Monday, fixed effects Figure Since the authors of the original analysis focused on the “effect of Monday”, shows that the estimates resulting from SEMA are accurate while being magni- stream and the timates obtained using the offline procedure. These arethe clearly SEMA more fluctuating procedure(s), than which isconvergence. what SEMA, can without be extra EM expected runs giventhe or that intercept using and Lmer a overestimates training the is effect set, run of underestimated Mondayues. till due When to a the selected training starting set val- is5.3 used – or when additionaltudes E-steps faster are than included the – traditional Figure procedure. variance of the effect ofvariance Monday. due to In the selected both starting graphs,very values, similar: while SEMA the again other overestimated fitting procedures the Update), were and 0.071 (SEMA Training and Update).pretty Interestingly, much all methods, all during of the data stream, estimate the variance of the effect of Monday on Fridays, we also focusedpresents on the the estimated estimation fixed of intercept thisage of effect “Monday” the of model effect. Monday in in Figure the the bottom top figure. figure, The and the aver- Chapter 5: SEMA extended using the additional updates, thethe SEMA CDSS algorithm in can a correct computationally the efficientprocedures contributions manner. are to presented. Below the results of all five fitting which implies that on average individuals where 0.2 kg heavier on Mondays than to update the morning not specified only weight 174cm (centered) 1970 (centered) male Friday 000 , data points as a train- =1 n ), previous contributions 000 , Chapter 5: SEMA extended 000 , =5 n =1 n 4 3 3 – – 2 7 ) for the random-intercept model: ! ! ! ! ! ! Fixed Random!! number of categories Reference 2016b ( , weight = 100 , , Table 5.4: Fitted model to the smart-scale data stream = 84 = 14 Ippel et al. good starting values, SEMA Update: this implementation includesall extra individuals’ contributions full to E-steps the to CDSS at recompute given intervals, SEMA Training and Update: this implementationabove. is a combination of the two SEMA Training: this implementation includes a training-data set to obtain • • • Time of Measurement Frequency Feedback Length Year of birth Gender Variables Day of the week The dependent variable is using the SEMA algorithm. Lmer was again called every 104 fixed effect Gender variance random intercept remaining fixed-effects parameters started at 0, the variance of the random effects at 1 Starting values: fixed effect intercept this data stream we doBy not evaluating know all individuals if at the given individual intervals steps ( on the smart scale again. individual. The last implementation, SEMAof Training and the Update, two is previous a implementations combination tional by full using E-steps. In both a a practical trainingusing setting, the when set beginning starting as of values the well are data difficult stream as to ashas addi- choose, a to training take set to limits obtain number of good stepsrithm estimates SEMA corrects of the the previous parameters. contributions In of addition, an the individual SEMA to algo- the CDSS. However, in offline EM algorithm). The SEMA algorithmthese as starting presented values in to this chapter continueplementation, then SEMA the used Update, analysis is of similar the tochapter, though the data in SEMA addition stream. algorithm to the as E-steps The presented perthe in individual, second CDSS SEMA this im- Update at recomputed given intervals by performinglowed a by single an full M-step, as E-step opposed for to all computing individuals the fol- E-step only for the newly arriving The SEMA Training implementation used theing first set to obtain good guesses for the parameter starting values (using the traditional evaluated by parameter estimates. In addition tothis the chapter, implementation we of included SEMA three as extra introduced in implementations of SEMA which have been to the CDSS can be updated, even though an individual has not returned (yet). So,

Chapter 5 oteCS a eudtd vntog nidvda a o eund(e) So, (yet). returned not has individual an again. though scale even smart updated, be the can on CDSS ( steps the intervals individual to given the at if individuals all know evaluating not By in do However, we CDSS. the algo- stream to SEMA data individual the this an addition, of In contributions parameters. previous the the of corrects SEMA estimates rithm steps good of number obtain limits to set take training to a choose, addi- has as to as stream difficult data are well the values of as starting beginning set when the setting, using training practical a a both In E-steps. using full by tional combination implementations a previous is two Update, the and arriving Training of newly SEMA the implementation, for last only E-step The fol- the individuals computing individual. all to for opposed E-step as M-step, full an single by a lowed performing by intervals given recomputed at Update im- this SEMA CDSS second individual, in the per presented The E-steps as the to algorithm stream. addition SEMA in data the though chapter, to the similar of is analysis Update, used the SEMA then plementation, continue chapter this to in values presented starting as these algorithm SEMA traditional The the (using algorithm). values starting EM parameter offline the for guesses good obtain to set first ing the used implementation Training SEMA The been have which SEMA of implementations in by introduced extra evaluated as three SEMA included of we implementation chapter, the this to addition In estimates. parameter eann xdefcsprmtr tre t0 h aineo h admefcsa 1 at effects random the of variance the 0, at started parameters fixed-effects remaining intercept random variance Gender effect fixed intercept effect fixed values: Starting sn h EAagrtm mrwsaancle every called again was Lmer algorithm. SEMA the using 104 h eedn aibeis variable dependent The a fteweek the of Day Variables Gender ero birth of Year Length Feedback Frequency ieo Measurement of Time • • • EATann n pae hsipeetto sacmiaino h two the of combination a is above. implementation this Update: and Training SEMA intervals, given recompute at CDSS to the E-steps to full contributions extra individuals’ all includes implementation this Update: obtain SEMA to set training-data a values, includes starting good implementation this Training: SEMA pe tal. et Ippel 14 = 84 = al .:Fte oe otesatsaedt stream data smart-scale the to model Fitted 5.4: Table , , 100 = weight , ( 2016b ie admnme fctgre Reference categories of number !! Random Fixed ! ! ! ! ! ! o h admitretmodel: random-intercept the for ) 7 2 – – 3 3 4 n =1 n =5 , 000 hpe :SM extended SEMA 5: Chapter , 000 ,peiu contributions previous ), n =1 aapit satrain- a as points data , 000 Friday male 90(centered) 1970 7c (centered) 174cm nyweight only o specified not morning oudt the update to rtymc l ftedt tem siaetevrac fteefc fMonday of effect the of variance the estimate stream, data the of during all methods, all much Interestingly, pretty Update). and Training (SEMA 0.071 and were Update), the procedures fitting overestimated other again the SEMA while similar: values, very graphs, starting both selected the In to due Monday. variance of effect the of variance procedure. Figure traditional – the included than are faster E-steps tudes additional when or – used 5.3 is val- set starting training selected the a to When due till ues. Monday underestimated of run set, effect is the training overestimates a Lmer and using intercept that or the given runs expected EM extra be without can SEMA, what convergence. is which than procedure(s), fluctuating more SEMA clearly the are These procedure. offline the using obtained timates aver- the and The the figure, figure. and top bottom stream the the Figure in in Monday effect. model of the “Monday” effect of age this intercept of fixed estimation than estimated the Mondays the on on presents heavier focused kg also 0.2 we where Fridays, individuals on average on that implies which fitting five all of results the Below presented. to are manner. contributions procedures efficient the computationally correct a can in algorithm CDSS SEMA the the updates, additional the using extended SEMA 5: Chapter hw htteetmtsrsligfo EAaeacrt hl en magni- being while accurate are SEMA from resulting estimates the that shows Figure Monday”, of “effect the on focused analysis original the of authors the Since iue53 siae necp n h feto ody xdeffects fixed Monday, of effect the and intercept Estimated 5.3: Figure 5.4 φ ˆ lutae h siae aine fterno necp n the and intercept random the of variances estimated the illustrates =0 y estimated effect of monday estimated intercept ae h siae aaee aus nga r h aaee es- parameter the are gray In values. parameter estimated the -axes

. 0.0 1.0 2.0 3.0 70 75 80 85 095 0 0 Le) .6 SM) .6 SM riig,000(SEMA 0.090 Training), (SEMA 0.164 (SEMA), 1.269 (Lmer), , ● ● Sema Lmer ● ● 50 50 ● ● ● ● ● Sema updates Sema training set 100 100 ● ● data streamx1000 ● ● ● ● 150 150 ● ● x ae rsn h egho h data the of length the present -axes ● ● 200 200 Sema training andupdate ● ● ● ● ● ● 250 250 ● ● ● ● 105 5.3

Chapter 5 107 ● ● 250 ● Sema training and update Sema training ● 200 ● 150 000 ● , =6 data stream x1000 ● n 100 Sema training set Sema training Sema updates ● ● 50 ● Lmer Sema ●

0

30 25 20 15 10 5 0 average squared prediction error prediction squared average Figure 5.5: Average squared prediction error of weight, starting from Common procedures to fit multilevel models (i.e., EM algorithm or Newton Raphson) repeatedly pass over thenew data data enter, to these procedures estimate revisit the all dataters. model points Especially to parameters. when update the the number When model of parame- randomthe data effects increases are required and to many obtain passes stable over tional estimates fitting of procedures the quickly model parameters, become these infeasible tradi- data for streams. large data SEMA, sets on or the continuous is other hand, discarded. only SEMA uses a thus datarameters learns point more the once, efficiently maximum after in likelihood which a it valuesmodel data parameter of stream estimates the are than model updated the with pa- common eachfore, newly procedures, SEMA entered since facilitates data the the point. analysis There- structure of that data streams is while often accounting observedproblems for in of the data storing nested streams. extremely largeOur SEMA data algorithm sets effectively enables and deals very researchers with lengthy to the time. fitting use In a procedures. multilevel simulation models study, we for showedper that prediction individual even is in when small real the and number the ofwere estimated number observations quite of accurately. parameters Furthermore, we is showed large, thatmance parameter the of predictive estimates SEMA perfor- was competitive, if not superior, to current state-of-the-art methods. Chapter 5: SEMA extended measurements), level 2 (i.e., individuals), and randomwhen intercepts data and sets random are slopes large and/or continuously augmented.more SEMA efficient is because computationally this algorithm never revisits older data points. ● ● ● ● ● ● 250 250 ). This extension enables ● ● Chapter 5: SEMA extended ● ● 2016b Sema training and update Sema training 200 200 , ● ● ● ● ● ● 150 150 ● ● Ippel et al. fects data stream x1000 ● ● 100 100 Sema training set Sema training Sema updates ● ● onwards, because the offline procedure could es- . We disregarded the beginning of the stream for ● ● ● 000 000 ● , 50 50 , ● 5 Lmer Sema ● ● =6 ≈ n ● n

0 0

presents the average squared prediction error of all five fitting

8 6 4 2 0 220 200 180 160

5.5

variance monday variance variance random intercept random variance Figure 5.4: Estimated intercept and the effect of Monday, random ef- Finally, Figure In this chapter, we developed an extensiontion of Approximation the (SEMA) Streaming algorithm Expectation ( Maximiza- researchers to fit multilevel models that include fixed effects at level 1 (i.e. repeated 5.6 Discussion individual which entered for theerror first is time. computed from The average of the squared prediction effect” might be less systematic thanfor emphasized quite a in large the number original of study: individuals,to the positive. apparently, effect of Monday is negative as opposed procedures. The average squared predictionsimulation error study, was where implemented the similar average to weight the was used to predict the weight of an to be high compared to its average effect. To us, this indicates that the “Monday 106 error than SEMA. Hence, it seems thateffects for in the data purpose of streams, predicting SEMA individual-level is very well suited. timate the model from all fitting procedures and comparedproduce the parameter methods estimates. only While from computing averged the favors point error the at offline only procedure, which after still they Lmer Lmer produces has on con- average more prediction

Chapter 5 fet ndt tem,SM svr elsuited. well very is individual-level SEMA predicting streams, of purpose data the in for effects that seems prediction it more Hence, average SEMA. con- on than has produces error Lmer Lmer they still after which procedure, only offline at the error point favors the verged a computing from While only estimates. methods parameter the produce compared and procedures fitting all from model the timate eerhr ofi utlvlmdl hticuefie fet tlvl1(..repeated (i.e. 1 level at effects fixed include that models multilevel fit to researchers Maximiza- ( Expectation algorithm Streaming (SEMA) the Approximation of tion extension an developed we chapter, this In Discussion 5.6 prediction squared the of an average of The from weight computed time. the is first predict error the to for used entered was which the weight individual to average similar the implemented where was study, error simulation prediction squared average The procedures. opposed as negative is Monday of “Monday effect apparently, the positive. the to individuals, that study: of original indicates number the large this in a us, quite emphasized To for than systematic effect. less average be might its effect” to compared high be to 106 ial,Figure Finally, iue54 siae necp n h feto ody admef- random Monday, of effect the and intercept Estimated 5.4: Figure

variance monday variance random intercept

5.5 0 2 4 6 8 160 180 200 220 rsnsteaeaesurdpeito ro falfiefitting five all of error prediction squared average the presents 0 0 n ● n ≈ =6 ● ● Sema Lmer 5 ● , 50 50 , ● 000 000 ● ● ● edseaddtebgnigo h temfor stream the of beginning the disregarded We . nad,bcueteofln rcdr ol es- could procedure offline the because onwards, ● ● Sema updates Sema training set 100 100 ● ● data streamx1000 fects pe tal. et Ippel ● ● 150 150 ● ● ● ● ● ● , 200 200 Sema training andupdate 2016b ● ● hpe :SM extended SEMA 5: Chapter ● ● .Ti xeso enables extension This ). 250 250 ● ● ● ● ● ● ac fSM a opttv,i o ueir ocretsaeo-h-r methods. state-of-the-art current to superior, not if competitive, was perfor- SEMA estimates predictive of the parameter mance that large, showed is we Furthermore, parameters accurately. of quite observations number estimated were of the number and the real small when in is even individual prediction that per showed for we study, models simulation multilevel procedures. a In use fitting time. the to lengthy with researchers very deals and enables effectively sets algorithm data SEMA Our large extremely streams. nested storing data the of in for problems observed accounting often while is streams data that of structure There- analysis point. the the data facilitates since entered SEMA procedures, newly fore, each common pa- with the updated model than are the estimates stream of parameter data model values it a which likelihood in after maximum efficiently once, the more point learns rameters data thus a uses SEMA only discarded. hand, other is continuous the or on sets SEMA, data large streams. for data tradi- infeasible these become parameters, model quickly the procedures of fitting estimates tional over stable passes obtain many to and required are increases effects data the random parame- of model When number the the update when parameters. to Especially points model ters. data all the revisit estimate procedures these to enter, data data new the over pass repeatedly Raphson) points. data older revisits never algorithm this computationally because is efficient SEMA more augmented. continuously and/or large slopes are random sets and data intercepts when random and individuals), (i.e., 2 level measurements), extended SEMA 5: Chapter omnpoeue ofi utlvlmdl ie,E loih rNewton or algorithm EM (i.e., models multilevel fit to procedures Common iue55 vrg qae rdcinerro egt trigfrom starting weight, of error prediction squared Average 5.5: Figure

average squared prediction error

0 5 10 15 20 25 30 0 ● Sema Lmer ● 50 ● ● Sema updates Sema training set 100 n ● data streamx1000 =6 , ● 000 150 ● 200 ● Sema training andupdate ● 250 ● ● 107

Chapter 5 108 Chapter 5: SEMA extended

While the current extension of SEMA algorithm allows for fitting multilevel mod- els with fixed and random effects in data streams, extensions are possible and need further development. First, in Kooreman and Scherpenzeel (2014)—our empirical example—the authors actually used a multilevel model with more predictors than Chapter 6 the model we used in this chapter. The original model also contained fixed effects for the calendar months. Fitting this model requires observations in (almost) each month, such that the X′X matrix becomes invertible (i.e., at least semi positive def- Discussion inite). Consequently, fitting a model including the effects of months cannot be fitted to the data before the data stream has run for almost a year. Further research should focus on extending the model during the data stream, such that these effects can be included dynamically once enough data has been collected. 6.1 Overview Second, the current version of SEMA basically assumes that the true data gener- ating process is stationary and that, over the course of the data stream, we converge In this thesis, computationally-efficient procedures for analyzing data streams are to the “correct” parameter estimates. However, when monitoring individuals over developed and illustrated in order to make data streams more accessible to ap- time, it is likely that the data-generating process itself changes over time. Moving plied researchers. In the case of independent observations, Chapter 2 gives a short window approaches, in which only the latest data points are included in the analysis, overview of approaches to analyze streaming data. A large part of this chapter is are often used in such cases. However, when using a moving window approach one dedicated to the introduction of online learning methods. These methods are illus- would still refit the model to all the data points in the window every time the win- trated with examples written in R code to promote the use of these online learning dow changes. Alternatively, however, we could introduce a fixed learn rate when methods. dealing with data streams. In Eq. 5.1 it is easily seen that the “learn-rate” for com- Data streams, however, often consist of repeated measurements of the same indi- 1 viduals, which creates a nested structure in the data. Chapter 3 presents four online puting an online sample mean is n . Thus, as the stream becomes longer (and n grows larger) the learn rate decreases and the computed mean stabilizes. If instead learning methods to deal with dependent observations in data streams where the xt x¯ outcome variable is binary. These online learning methods are based on shrinkage we would alter the update rule of x¯ to read x¯ :=x ¯ + min(−n,α) for some fixed value of α of say 10, 000, we effectively create a moving window in the sense that older data factors which are well-known in the literature. Additionally, this chapter introduces points are smoothly discarded—though without revisiting older data points. This an online learning method to deal with the normality assumption based on Morris can, with some effort, similarly be implemented in SEMA. and Lysy’s (2012) data transformation for binary outcomes. To conclude, our extended SEMA algorithm is a computationally-efficient algo- In Chapter 4, a novel online algorithm is developed, called the Streaming Expec- rithm to analyze data that contain a nested structure and arrive in a continuous fash- tation Maximization Approximation algorithm (SEMA). This algorithm fits random ion. Hence, multilevel models with numerous fixed and random effects can now be intercepts models on streaming data. Based on the EM algorithm (Dempster et al., fit to continuous data streams (or extremely large static data sets), in a computation- 1977), SEMA fits the simplest multilevel model by updating the parameter estimates ally efficient fashion. online. This online learning approach to multilevel modeling facilitates social scien- tists to study social phenomena in a longitudinal design using data streams. Lastly, in Chapter 5 an extension of SEMA is provided. This extension can be Acknowledgements used to estimate more complex multilevel models, which include fixed effects and random intercepts and slopes. The development of these methods allows researchers We would like to express our thanks to prof. Peter Kooreman for sharing their data. to broaden the scope of their research using data streams. Furthermore, we thank Daniel Ivan Pineda for his feedback on the writing process.

6.2 Related approaches to analyze data streams

In this thesis, I have predominantly explored the online learning approach as a method to efficiently analyze data streams. In addition to online learning, two re- lated approaches have been discussed in Chapter 2: a sliding window approach and 108 Chapter 5: SEMA extended

While the current extension of SEMA algorithm allows for fitting multilevel mod- els with fixed and random effects in data streams, extensions are possible and need further development. First, in Kooreman and Scherpenzeel (2014)—our empirical example—the authors actually used a multilevel model with more predictors than Chapter 6 the model we used in this chapter. The original model also contained fixed effects for the calendar months. Fitting this model requires observations in (almost) each month, such that the X′X matrix becomes invertible (i.e., at least semi positive def- Discussion inite). Consequently, fitting a model including the effects of months cannot be fitted to the data before the data stream has run for almost a year. Further research should focus on extending the model during the data stream, such that these effects can be included dynamically once enough data has been collected. 6.1 Overview Second, the current version of SEMA basically assumes that the true data gener- ating process is stationary and that, over the course of the data stream, we converge In this thesis, computationally-efficient procedures for analyzing data streams are to the “correct” parameter estimates. However, when monitoring individuals over developed and illustrated in order to make data streams more accessible to ap- time, it is likely that the data-generating process itself changes over time. Moving plied researchers. In the case of independent observations, Chapter 2 gives a short window approaches, in which only the latest data points are included in the analysis, overview of approaches to analyze streaming data. A large part of this chapter is are often used in such cases. However, when using a moving window approach one dedicated to the introduction of online learning methods. These methods are illus- would still refit the model to all the data points in the window every time the win- trated with examples written in R code to promote the use of these online learning dow changes. Alternatively, however, we could introduce a fixed learn rate when methods. dealing with data streams. In Eq. 5.1 it is easily seen that the “learn-rate” for com- Data streams, however, often consist of repeated measurements of the same indi- 1 viduals, which creates a nested structure in the data. Chapter 3 presents four online puting an online sample mean is n . Thus, as the stream becomes longer (and n grows larger) the learn rate decreases and the computed mean stabilizes. If instead learning methods to deal with dependent observations in data streams where the xt x¯ outcome variable is binary. These online learning methods are based on shrinkage we would alter the update rule of x¯ to read x¯ :=x ¯ + min(−n,α) for some fixed value of α of say 10, 000, we effectively create a moving window in the sense that older data factors which are well-known in the literature. Additionally, this chapter introduces points are smoothly discarded—though without revisiting older data points. This an online learning method to deal with the normality assumption based on Morris can, with some effort, similarly be implemented in SEMA. and Lysy’s (2012) data transformation for binary outcomes. To conclude, our extended SEMA algorithm is a computationally-efficient algo- In Chapter 4, a novel online algorithm is developed, called the Streaming Expec- rithm to analyze data that contain a nested structure and arrive in a continuous fash- tation Maximization Approximation algorithm (SEMA). This algorithm fits random ion. Hence, multilevel models with numerous fixed and random effects can now be intercepts models on streaming data. Based on the EM algorithm (Dempster et al., fit to continuous data streams (or extremely large static data sets), in a computation- 1977), SEMA fits the simplest multilevel model by updating the parameter estimates ally efficient fashion. online. This online learning approach to multilevel modeling facilitates social scien- tists to study social phenomena in a longitudinal design using data streams. Lastly, in Chapter 5 an extension of SEMA is provided. This extension can be Acknowledgements used to estimate more complex multilevel models, which include fixed effects and random intercepts and slopes. The development of these methods allows researchers We would like to express our thanks to prof. Peter Kooreman for sharing their data. to broaden the scope of their research using data streams. Furthermore, we thank Daniel Ivan Pineda for his feedback on the writing process.

6.2 Related approaches to analyze data streams

In this thesis, I have predominantly explored the online learning approach as a method to efficiently analyze data streams. In addition to online learning, two re- lated approaches have been discussed in Chapter 2: a sliding window approach and ). ∝ ) 111 D . | 2014 ) θ , t ( p D | is propor- θ ( +1 p t ) ). Böse, Andrze- y θ | +1 ) and Bayes Theo- t 2011 ) y , θ ( | p D ( ∝ p ) van de Schoot et al. +1 ; t D | θ 2007 ( p , Broderick, Boyd, Wibisono, Wilson, ): Lynch 1998 ; , 2004 , Opper Hsu, Karampatziakis, Langford, & Smola Gelman et al. ; ). Next, using the likelihood of the data ( 2010 ) , θ ( p (see, e.g., ) θ ( p ) θ | The computations of the online shrinkage factors and the SEMA algorithm can While the Bayesian approach fits conceptually well with data streams, the real- Interesting developments in Bayesian updating are variational methods and se- D ( tionally equal to the likelihood of thebased new on observation the multiplied previous with data the points posterior ( rem, one’s prior beliefs can be updated, resultingp in the so-called posterior: Theoretically, this scheme of updating one’s belief lendsdata itself streams very because well the to analyzing posterior after observing a new data point Chapter 6: Discussion 6.2.2 Parallelization Parallel computing is useful in casessplit where up the in computations separate of blocks. anmachine To analysis come could can back compute be to the the sample example meanall of these of (sub)means the the are sample data combined mean: stored to compute on each tion the that of mean machine, over parallelization next, all and data. online The learningjak, combina- is & known Högqvist in the literature ( easily be distributed over multipledata. machines For by instance, making let userepeatedly us observed of over assume the time. there nesting Four are in machinesstatistics five could the of machines be all used and the to individuals. 100 store The individualsas the fifth the summary are machine mean stores over the all global data parameters,Now, and such when a list new of data which enter, individual thethat is fifth stored just machine on entered retrieves which from the machine. one record ofthe of the individual the four summary individual machines, statistics, updates which the can globaltributing be these parameters returned records reduce and to the the memory burden right of machine. the introduced Dis- online methods. 6.2.3 Bayesian framework In this thesis, I have focusedtimate primarily on parameters. the Maximum An Likelihood approach alternative,obtained to Bayesian, es- by approach quantifying to one’s parameter priorusing estimation beliefs a is regarding probability the distribution values (or of densitydistribution the in ( the parameters continuous case), called the prior ity is, however, less straightforward.many Even Bayesian when models the are computationally data complex do becausealways not the have enter posterior a in does known not a distribution. stream, Insuch practice, as researchers Markov often Chain rely on Monterior. techniques Carlo In sampling such (MCMC) cases, tocomputationally updating approximate challenging. the the posterior poste- in the context of data streams couldquential be MCMC (sMCMC) sampling.process Variational by methods replacing speed the up posterior the whichdistribution updating has with an a unknown known distributional distributional form form ( by a , Aggarwal Chapter 6: Discussion ). Some strong aspects , when many data points 5 2006 , G. James, Witten, Hastie, & Tibshi- Chu et al. becomes large. If, however, the learn rate is fixed to a , independent of the length of the data stream. n m ). Parallelization, on the other hand, decreases the computa- is the total number of observations and when the data grows n , a smooth ‘sliding window’ is created. In this smooth sliding 1 n 2005 , , where 1 n ). In this trade off, bias is related the robustness of the parameter estimate Gaber et al. 2013 ; , On the other hand, computing the sample mean online as more data enter, the Fixing learn rates is also an interesting future direction for the online learning Usually, it is undesirable to have highly fluctuating parameter estimates because ) becomes smaller when more data enter. As a result, the value of the mean hardly 1 n value larger than weight of the new datation points equals decreases gradually. The weight of a new observa- changes anymore when should be found to obtain accurate predictions. these are difficult tofectly interpret. stable, adding new However, data when becomesbias-variance the rather trade useless. parameter off This estimates balance (see is are for relatedrani an per- to introduction the against random fluctuations in the data.pick up The real variance changes is in the related data. to In the practice, sensitivity some balance to between bias and variance methods discussed in this thesis. For instance,stream in, in Chapter the SEMA algorithm didtions. not learn Including much a anymore fixed from learn the newvations. rate observa- would As a make result, SEMA the more parameter sensitive estimates to become new more obser- variable. window, the older data points aretions not becomes ignored, smaller though compared the to weight the of new these observations. observa- point is deleted from memory to includerecomputed the over new this data ‘new’ point set and of the observations.number sample of This mean observations, set is always consists of the same As mentioned above, the slidingof window observations. approach An only often usesthat unintended the when practical changes most advantage occur recent of over set because time, a this the sliding approach older window can data is easily pointsthe follow do these sample changes, not mean influence as the an analyses example. anymore. When Let a us new use data point enters the oldest data tional burden by distributing theis analyses over commonly multiple used machines. for large Thisof static technique both data techniques could sets also ( be implementedoped in in the this online thesis. learning methods devel- 6.2.1 Sliding window approach 110 a parallelization approach. The slidingbecause window it approach only reduces stores memory the burden most recentous set data of points data are points. discarded from Using memory this2007 when technique, new previ- data points enter ( larger, this weight becomes smaller.( In this case, the learn rate of the online mean

Chapter 6 agr hswih eoe mle.I hscs,telanrt fteoln mean online the of rate learn the case, this In ( smaller. becomes weight this larger, eoptdoe hs‘e’sto bevtos hsstawy ossso h same the data of consists oldest always the is set observations, mean This of enters sample number observations. the point of and set data point ‘new’ data use new this new over us the a recomputed include Let When to memory from anymore. example. deleted analyses is an point the as influence mean not changes, sample these do follow the points easily is data can window older approach sliding the this a time, because set over of recent occur advantage most changes practical when the unintended that uses often only An approach observations. window of sliding the above, mentioned As approach window Sliding 6.2.1 devel- methods learning thesis. online this the in in oped implemented be ( also sets could techniques data both technique static of This large for machines. used multiple commonly over analyses is the distributing by burden tional ( enter points data previ- new technique, when 2007 this memory Using from discarded points. are data points of data set ous recent most burden the memory stores reduces only approach it window because sliding The approach. parallelization a 110 hudb on ooti cuaepredictions. variance accurate and obtain bias between to to balance found some sensitivity be practice, the should In to data. related the in is changes variance real The up pick data. the in fluctuations random against the introduction to per- an rani related for are is (see balance estimates This off parameter useless. trade rather the bias-variance becomes when data However, new adding stable, interpret. fectly to difficult are these variable. obser- more new become to estimates sensitive parameter more the SEMA result, make a As would observa- rate vations. new the learn from fixed anymore a much Including learn not tions. did algorithm SEMA the Chapter in in, stream instance, For thesis. this in discussed methods observa- observations. these new of the weight to the compared though smaller ignored, becomes not tions are points data older the window, hne nmr when anymore changes egto h e aapit erae rdal.Tewih fanwobserva- new a of weight The gradually. decreases equals points tion data new the of weight au agrthan larger value n 1 eoe mle hnmr aaetr sarsl,tevleo h enhardly mean the of value the result, a As enter. data more when smaller becomes ) sal,i sudsrbet aehgl utaigprmtretmtsbecause estimates parameter fluctuating highly have to undesirable is it Usually, learning online the for direction future interesting an also is rates learn Fixing nteohrhn,cmuigtesml enoln smr aaetr the enter, data more as online mean sample the computing hand, other the On , ; 2013 ae tal. et Gaber .I hstaeof isi eae h outeso h aaee estimate parameter the of robustness the related is bias off, trade this In ). n 1 where , , 2005 n 1 moh‘ldn idw scetd nti mohsliding smooth this In created. is window’ ‘sliding smooth a , n stettlnme fosrain n hntedt grows data the when and observations of number total the is .Prleiain nteohrhn,dcesstecomputa- the decreases hand, other the on Parallelization, ). m n needn ftelnt ftedt stream. data the of length the of independent , eoe ag.I,hwvr h er aei xdt a to fixed is rate learn the however, If, large. becomes h tal. et Chu .Jms itn ate Tibshi- & Hastie, Witten, James, G. , 2006 5 hnmn aapoints data many when , .Sm togaspects strong Some ). hpe :Discussion 6: Chapter Aggarwal , rcs yrpaigtepseirwihhsa nnw itiuinlfr ya by ( form form distributional distributional known unknown a an with has updating distribution which the posterior up the speed replacing methods by Variational process sampling. (sMCMC) MCMC be quential could streams data of context the in poste- posterior the the challenging. approximate updating computationally to cases, (MCMC) such sampling In Carlo techniques rior. Monte on rely Chain often Markov researchers as practice, such In stream, distribution. a not known does in a posterior enter have the not always because do complex data computationally are the models when Bayesian Even many straightforward. less however, is, ity aasrasbcuetepseiratrosriganwdt point data new a observing after posterior analyzing to the well because very streams itself data lends belief one’s updating of scheme this Theoretically, posterior: so-called the in p resulting updated, be can beliefs prior prior one’s the rem, called case), continuous parameters the ( in the distribution density of (or values distribution the probability regarding is a beliefs estimation using prior parameter one’s to quantifying approach by es- Bayesian, to obtained alternative, approach Likelihood An Maximum the parameters. on primarily timate focused have I thesis, this In framework Bayesian 6.2.3 methods. online Dis- introduced the machine. of right burden memory the the to and reduce records returned parameters these be tributing global can the which updates statistics, machines, individual summary four the individual the of the of record one machine. the from which retrieves entered on machine just stored fifth is that the individual enter, which data of list new a when such and Now, parameters, data global all the over stores mean machine are summary the fifth the as individuals The store 100 individuals. to the and used all be machines of the could five statistics machines in are Four nesting there time. the assume over of observed us repeatedly use let making instance, by For machines data. multiple over distributed be easily ( literature the in Högqvist known & is combina- jak, learning The online data. and all next, parallelization over machine, mean of that the tion each on compute to stored mean: combined data sample are the the (sub)means of these of all mean example sample the the to be compute back can could come analysis To machine an blocks. of separate computations in the up where split cases in useful is computing Parallel Parallelization 6.2.2 Discussion 6: Chapter ae ntepeiu aapit ( posterior points the data with previous multiplied the observation on new based the of likelihood the to equal tionally ( D neetn eeomnsi aeinudtn r aitoa ehd n se- and methods variational are updating Bayesian in developments Interesting real- the streams, data with well conceptually fits approach Bayesian the While h opttoso h niesrnaefcosadteSM loih can algorithm SEMA the and factors shrinkage online the of computations The | θ ) p ( θ ) se e.g., (see, p ( θ , ) 2010 .Nx,uigtelklho ftedt ( data the of likelihood the using Next, ). ; emne al. et Gelman s,Krmazai,Lnfr,&Smola & Langford, Karampatziakis, Hsu, Opper , 2004 , ; 1998 Lynch ): rdrc,By,Wbsn,Wilson, Wibisono, Boyd, Broderick, , p ( 2007 θ | D t ; +1 a eSho tal. et Schoot de van ) p ∝ ( D p | ( θ , y ) 2011 t n ae Theo- Bayes and ) +1 | θ y öe Andrze- Böse, ). ) t p +1 ( θ spropor- is | D p ( t , θ ) 2014 | . D 111 ) ∝ ).

Chapter 6 113 provides a 5 An interesting direction for future research to compare competing models in data In the presented simulation studies, the models which generated the data were model is best, is lessAIC straightforward and BIC, in commonly a used to data compare stream differentlihood, models, than which are based in is on computed static the using log data like- the sets.the current parameter parameter The estimates. estimates In are a continuously dataof updated stream, the and data especially stream probably in far the fromthe beginning the differences Maximum Likelihood in solution. AIC As a orartifact result, of BIC poorly-chosen values starting between values, which thestream makes using competing AIC comparing models and models BIC in could complex. a be data an streams could be inspired bySEMA the updates SEMA the algorithm: contribution of when the a individualcomplete belonging new data to data sufficient this statistics. point data enters, point These toover sufficient the individuals. statistics Since are the in log fact likelihoodtributions, summations is these a contributions sum of could individual also log-likelihoodin be con- the updated data when stream. the Suchonline individuals an approximation return updating of common scheme goodness-of-fit could measures make in it data streams. possible to employ an Chapter 6: Discussion whether the parameter estimates becomegence more is stable an over interesting direction time. for future Studying research. conver- 6.3.2 Models used for analyses Choosing a model for the analysis is alsoinstead more complex of when analyzing static data data streams sets.that For the instance, fitted one model might has discover tocluding during be other the altered; variables. data e.g., Firstly, stream excluding adapting current modelsif covariates during the and/or the required in- stream information is is only stored: possible retrievable. information Secondly, which even is if the not information stored,during is might the available, stream not might including be complicated. more The variables applicationgood study example in Chapter of such a situation.itored In this over application, three individuals’ years. weight was Theof mon- original which authors month the of data thiswould were study collected. require controlled observations for Fitting in the that each effect sameobserved month. model is in However, usually not waiting a an for data option. all stream, while For months the such to data cases, are an be entering approach to should include be covariates developed. also fitted to the data stream. Inof this methods, thesis, when I a did not different model explicitly was explore thebe fit robustness adapted to during the the data data or stream. how When thebe analyzing fitted static fitted model data, could to different the models data can and comparedsures, which e.g., model AIC fits or best using BIC. goodness-of-fitin An mea- parallel option and for decide data later streams which could be model to is fit preferred. multiple However, models evaluating which , ). The 2013 , Neal & Hinton ; ). The other develop- 2009 Chapter 6: Discussion , 2016 , Yang & Dunson a heuristic procedure is used to monitor 4 Cappé & Moulines Kabisa (Tchumtchoua), Dunson, & Morris , could decrease this influence in the parameter estimates. ; 6.2.1 2013 , ), though currently, I have not extensively explored a formal procedure to check Even though parameter values are likely to change in a data stream, it is of In the context of static data, one hopes to find stable parameter estimates, as an A multi-start procedure is also possible in a data stream, by running the same sions of EM algorithm have been1998 studied ( whether SEMA has converged. In Chapter previous data generating model. Introducingin a Section forgetting factor, such as described interest to see whetherLikelihood the values current of parameter the data values seen are so close far. to Theoretical the convergence Maximum of online ver- stable parameter estimates are not necessarilygenerating desirable; model, if due data to enter a thatmates, shift parameter of do estimates the not should data change fit accordingly. The wellods developed do online with not meth- the ’forget’ older current data and parameter itble. esti- is As assumed a that result, the when data the generating data modelor generating is direction model sta- changes of over an time, effect i.e., the changes, strength the parameter estimates will be influenced by the optimum. Additionally, due todata the given the influx current of parameter estimates new andchange. data, the These parameter both two estimates reasons the themselves of likelihood why thevergence of likelihood in the changes, a complicate data studying stream. con- indication that the algorithm converged. However, when analyzing data streams, a given threshold, the algorithmalgorithm might converged. converge to a When local optimum the instead.help function A to multi-start is procedure find might not the global convex, maximum. an analysis started at different points in parallel. However,it in the is context of no data streams, longer feasible to repeatedly go over the data to find an (global or local) 6.3.1 Convergence First, the exact meaning ofcompared convergence to is static less data. clear in Inmight repeatedly practice, the go context when over analyzing of the data static data tovalues data streams find are sets, the compared optimum over an of consecutive algorithm iterations a and function. if The these parameter values change less than SEMA approach presented in thisduring a thesis, data which stream, might involves beation updating relevant of to the the this likelihood likelihood field is of necessary research if as one well wishes since to update evalu- the posterior. 6.3 Data stream challenges 112 & Jordan ment, sMCMC, provides an appealing extension tois popular computationally MCMC attractive methods since which the generatedposed MCMC to draws sampled are anew updated when as op- additional data enters (

Chapter 6 hte EAhscnegd nChapter In converged. has ver- SEMA whether online of Maximum convergence ( the Theoretical studied 1998 to been far. have close algorithm so EM are seen of values data sions the parameter of current values the Likelihood whether see to interest described as such the factor, by forgetting Section influenced a in be Introducing will model. estimates generating parameter data the strength previous changes, the i.e., effect time, an over of changes sta- model direction is generating or model data generating the data when the result, that a assumed As is esti- ble. it parameter and data current older ’forget’ the meth- not with online do developed ods well The accordingly. fit change data not should the estimates do of parameter shift mates, that a enter to data due if model, streams, desirable; generating data necessarily not analyzing are when estimates However, parameter stable converged. algorithm the that indication con- stream. studying data complicate a changes, the in likelihood of local) vergence the why likelihood or of themselves the (global reasons estimates two both parameter an These the data, find change. and new estimates to parameter of data current influx the the given the over data to go due repeatedly Additionally, to feasible optimum. longer streams, data no of context is the in it However, parallel. in points different at started analysis an maximum. convex, global the not might find procedure is multi-start to A function help instead. the optimum local When a to converge converged. might algorithm algorithm than the less change threshold, values parameter given these The if a function. and a iterations algorithm consecutive of an over optimum compared the sets, are find streams data values to data static data the of analyzing over when context go the practice, repeatedly might In in clear data. less static is to convergence compared of meaning exact the First, Convergence 6.3.1 challenges stream Data 6.3 posterior. the evalu- update to since wishes well one as if research necessary of is field likelihood likelihood this the the to of relevant updating ation be involves might stream, which data thesis, a during this ( in enters presented data approach additional op- SEMA as when updated anew are sampled draws to MCMC posed generated the which since methods attractive MCMC computationally popular is to extension appealing an provides sMCMC, ment, Jordan & 112 vntog aaee ausaelkl ocag nadt tem ti of is it stream, data a in change to likely are values parameter though Even an as estimates, parameter stable find to hopes one data, static of context the In same the running by stream, data a in possible also is procedure multi-start A ,tog urnl,Ihv o xesvl xlrdafra rcdr ocheck to procedure formal a explored extensively not have I currently, though ), , 2013 6.2.1 ; ol eraeti nunei h aaee estimates. parameter the in influence this decrease could , aia(cuthu) usn Morris & Dunson, (Tchumtchoua), Kabisa ap Moulines & Cappé 4 ersi rcdr sue omonitor to used is procedure heuristic a ag&Dunson & Yang , 2016 , hpe :Discussion 6: Chapter 2009 .Teohrdevelop- other The ). ; el&Hinton & Neal , 2013 .The ). , ntedt tem uha paigshm ol aei osbet mlyan employ to possible streams. data it in make measures could goodness-of-fit scheme common of updating return approximation an individuals online Such the stream. when data updated the con- be in log-likelihood also individual could of sum contributions a these is summations tributions, likelihood fact log in the are Since statistics individuals. the sufficient over to These point enters, data point statistics. this sufficient data to data new belonging complete individual a the when of contribution algorithm: the SEMA updates the SEMA by inspired be could streams an data be a complex. could in BIC models and models comparing AIC competing using makes stream the which values, between starting values poorly-chosen BIC of result, artifact or a As AIC solution. in Likelihood Maximum differences the beginning the from the far in probably stream especially data and the stream, updated of data continuously a are In estimates estimates. The parameter parameter current the sets. the like- data log using the static computed on is in based are which than models, lihood, different stream compare data to used a commonly which in BIC, evaluating and models straightforward However, AIC multiple less preferred. fit is is to best, model be is could which model streams later data decide for and option parallel mea- An in goodness-of-fit BIC. using best or fits AIC model e.g., which sures, compared and can data models the different to could data, model fitted static fitted analyzing be the When how stream. or data data the the during to adapted robustness fit be the explore was explicitly model different not did a I when thesis, methods, this of In stream. data the to fitted also developed. covariates be include to should approach entering be an are cases, data to such the months For while stream, all option. data for an a waiting not usually However, in is model month. observed same effect each that the in Fitting for observations controlled require collected. study were would this data of the month authors which original mon- of The was weight years. individuals’ three application, over this In itored situation. a such of Chapter in example study good application variables The more complicated. be including might not stream available, the might is during stored, information not the if is even which Secondly, information retrievable. possible stored: only is is information stream in- required the and/or the during covariates if models current adapting excluding stream Firstly, e.g., data variables. altered; the other be during cluding to discover has might model one fitted instance, the For that sets. streams data data static analyzing when of complex more instead also is analysis the for model a Choosing analyses for used Models 6.3.2 conver- research. Studying future for time. direction interesting over an stable is more gence become estimates parameter the whether Discussion 6: Chapter nitrsigdrcinfrftr eerht opr optn oesi data in models competing compare to research future for direction interesting An were data the generated which models the studies, simulation presented the In 5 rvdsa provides 113

Chapter 6 , . It 115 2 ). To es- ). ). When the 2007 , Neal & Hinton 1999 , 2011 , -value is related to the p intended to sequentially not Simmons et al. ). An online bootstrap procedure McLachlan & Krishnan 2012 , -value becomes smaller, given that the sample p -value also becomes small, even though the effect challenges of analyzing data streams and potential p 6.3 Owen & Eckles ). -value is small enough ( p 2001 Wilkinson & Task Force on Statistical Inference , -value is questionable. The size of the p , it was illustrated how other commonly-used techniques for the anal- -values ( 6.2 p Thiesson et al. ; statistical hypotheses. This issue has also been touched upon in Chapter Besides the inflated Type I error issue, given the research context is one of either Even when the focus is on effect sizes, it is of interest to get some insight in the -value smaller than the chosen Type I error rate without the effect being present in 6.5 Future research directions forIn SEMA this chapter, various directions for future researchIn in data Section streams were presented. ysis of data streams could beoped implemented in and this improve thesis. the In online Section directions methods for devel- solutions were discussed. Inspecifically this for last SEMA section, are several presented. future directions Chapter 6: Discussion who dropped out or do notreturn in return the that data often, stream when (which individuals1998 is similar related to to them partial do EM algorithm, 6.4 Null Hypothesis Significance Testing Whereas the methods discussed in this thesistially can while be used the to data make are decisions sequen- entering, these techniques are test is considered a Questionable Research Practicebased when on the whether data collection the isdata stopped collection is stopped based onconclusions the could outcome be of drawn the based on nullp the hypothesis test test, results: invalid the probabilitythe of population obtaining could a be severely inflated. extremely large data sets orusefulness data of streams the which often resulteffect in size large (e.g., data the sets, observedWhen the difference the between effect two size groups) increases, andsize the the remains sample the size. same. Oneffect the size other remains hand, the same, when the thesize sample is size practically increases not and meaningful. the instead It of is, therefore, preferable to focus on effect sizes variance of the estimate ofthe the effect uncertainty size. of an A commonly-used estimatesamples, approach is standard to a errors estimate bootstrapping can procedure. betimate estimated standard Using errors ( bootstrapped in datadures streams, are computationally-efficient available bootstrap (e.g., proce- supplementing the developed methods provides more insightobtained of effect the sizes. certainty of the 4 , the contri- 3 data points and ...n Chapter 6: Discussion =1 i van der Palm, van der Ark, & ; 2006 , covariates are observed at once. Dealing with covari- covariates. In the case of data streams, in addition to . In these chapters, the SEMA algorithm did additional k 5 k and ) and the parameters of the multilevel models in Chapter 4 3 covariates for each data point, some covariates might not enter k ), dealing with missingness in a data stream complicates the issue of , the data of all i 2016 , are updated in two steps: first the previous contributions of this individual is 5 While solutions to attrition in data streams have not been extensively studied in Donders, van der Heijden, Stijnen, & Moons runs over all the individualscontributions at of all given individuals were intervals. updated, including those Duringin who these the had not additional data returned stream. runs, the Alternatively, one could also update the contributions of those this thesis, two approaches have beenbutions implemented. of those First, respondents in which Chapter hadstudy, indicated were to stop removed participating from in the thewhen panel a summary respondent stops statistics. participating is not However,is always available. this applied in The information Chapter second approach of is added to the summary statistics.new data The point. parameter However, estimates these are individual contributions updated toare with these only each summary updated statistics when an individual returns.entered, So, all except the for remaining the contributions individual who areestimates. just still based As on a the result, ’outdated’outdated when parameter contributions an are not individual updated does with the not new return parameter in estimates. the data stream, its become biased. The methods developed in this thesistion are due even to more affected the by update attri- schemeage of factors the (Chapter parameter estimates. Theand estimates of shrink- subtracted of the summary statistics of the parameters, second the new contribution 6.3.4 Attrition Lastly, a challenge common for longitudinalspondents research quit in participating general in is the attrition,the study. i.e., generalizability Attrition re- of is the a sample. threatent If because respondents, a it drop subgroup can out of of affect respondents, the e.g., study, the the less parameter afflu- estimates of the model could ( Vermunt having incomplete observations even further.observation In this thesis, it is assumed that for not observing all all at the same time.trieved Some from information of memory, for the covariates instance couldinformation a be might level potentially be 2 be missing covariate re- or like onlyrespondent gender. later drop in in However, the later, other data e.g., stream. learning While the missingness gender is a of research the area on its own 114 6.3.3 Missingness Especially when a model containsates many are covariates, observed it for is eacheach likely data data that point point. not consists all of Let covari- us assume ates that do not enter at thearea, same though time or not do studied not in enter this at thesis. all, is an interesting research

Chapter 6 tsta ontetra h aetm rd o ne tal sa neetn research interesting an is all, thesis. at this enter in not studied do not or time though same area, the at enter not do that ates ntedt tem lentvl,oecudas paetecnrbtoso those of contributions the update also could one Alternatively, the runs, stream. returned data additional not had the these who in During those including updated, intervals. were individuals given all of at contributions individuals the all over runs of approach second Chapter information The in applied this available. always is However, not is participating statistics. stops respondent summary a panel when the the in from participating removed stop to were indicated study, had Chapter which in respondents First, those of implemented. butions been have approaches two its thesis, stream, this data the estimates. in parameter return new not the with does updated individual not are an contributions parameter when outdated ’outdated’ result, the a on As based still just estimates. are who individual contributions the remaining for the except all So, entered, returns. individual an when statistics updated summary each only these with are to updated contributions individual are these estimates However, parameter point. The data new statistics. contribution summary new the the to second added parameters, is the of statistics summary the of subtracted shrink- of estimates and The estimates. parameter (Chapter the factors of age scheme attri- update by the affected more to could even due model are tion thesis the this of in developed estimates afflu- methods parameter The less the biased. the study, become e.g., the respondents, affect of of out can subgroup drop it a respondents, because If ent threat sample. a the is of re- Attrition generalizability i.e., study. the attrition, the is in general participating in quit research spondents longitudinal for common challenge a Lastly, Attrition 6.3.4 for that assumed is it thesis, this In observation further. even observations incomplete having own its on Vermunt area the research of a is gender missingness the While ( learning stream. e.g., data other later, the However, in in drop later gender. respondent only like or re- covariate missing be 2 be potentially level might be a information could instance covariates the for memory, of information from Some trieved time. same the at all all assume observing us covari- not Let of all consists not point. point that data data likely each each is for it observed covariates, are many ates contains model a when Especially Missingness 6.3.3 114 odr,vndrHidn tje,&Moons & Stijnen, Heijden, der van Donders, hl ouin oatiini aasrashv o enetnieysuidin studied extensively been not have streams data in attrition to solutions While 5 r pae ntoses rttepeiu otiuin fti niiulis individual this of contributions previous the first steps: two in updated are , 2016 i h aao all of data the , ,daigwt isnns nadt temcmlctsteiseof issue the complicates stream data a in missingness with dealing ), k oaitsfrec aapit oecvrae ih o enter not might covariates some point, data each for covariates 3 4 n h aaeeso h utlvlmdl nChapter in models multilevel the of parameters the and ) and k 5 k nteecatr,teSM loih i additional did algorithm SEMA the chapters, these In . oaits ntecs fdt tem,i diinto addition in streams, data of case the In covariates. oaitsaeosre toc.Daigwt covari- with Dealing once. at observed are covariates , 2006 ; a e am a e r,& Ark, der van Palm, der van i =1 hpe :Discussion 6: Chapter ...n aapit and points data 3 h contri- the , 4 upeetn h eeoe ehd rvdsmr nih ftecranyo the of certainty sizes. the effect of obtained insight more provides methods developed the supplementing proce- (e.g., bootstrap available computationally-efficient are streams, dures data in bootstrapped ( errors Using standard estimated timate be procedure. can bootstrapping estimate errors a to standard is approach samples, estimate commonly-used A an of size. uncertainty effect the the of estimate the of variance iei rcial o ennfl ti,teeoe rfrbet ou nefc sizes effect on focus to preferable therefore, is, of It instead the meaningful. and not increases practically size is sample size the the when same, the hand, remains other size the effect On same. size. the sample remains the the size and increases, groups) size two effect between the difference the When observed sets, the data (e.g., large size in effect result often which the streams of data usefulness or sets data large extremely inflated. severely be a could obtaining population of the probability the invalid results: test, test hypothesis the p null on based the drawn of be outcome could the conclusions on based stopped is collection stopped data is the collection data whether the on when based Practice Research Questionable a considered is test ietosfrsltoswr icse.I hsls eto,svrlftr directions future presented. several are section, SEMA last for this specifically In discussed. were solutions devel- for methods directions Section online In the thesis. improve this and in implemented oped be could streams data of ysis presented. were streams Section data in In research future for directions various chapter, this SEMA In for directions research Future 6.5 are techniques these entering, sequen- decisions are make data to the used be while can tially thesis this in discussed methods the Whereas Testing Significance Hypothesis Null 6.4 algorithm, EM do partial them to to related similar is 1998 individuals (which when stream often, data that the return in return not do or out dropped who Discussion 6: Chapter vlesalrta h hsnTp ro aewtotteefc en rsn in present being effect the without rate error I Type chosen the than smaller -value vnwe h ou so fetszs ti fitrs ogtsm nih nthe in insight some get to interest of is it sizes, effect on is focus the when Even eie h nae yeIerrise ie h eerhcneti n feither of one is context research the given issue, error I Type inflated the Besides ttsia yohss hsisehsas entuhduo nChapter in upon touched been also has issue This hypotheses. statistical ; hesne al. et Thiesson p 6.2 vle ( -values twsilsrtdhwohrcmol-sdtcnqe o h anal- the for techniques commonly-used other how illustrated was it , p vlei usinbe h ieo the of size The questionable. is -value , ikno akFreo ttsia Inference Statistical on Force Task & Wilkinson 2001 p vlei ml nuh( enough small is -value ). wn&Eckles & Owen 6.3 p hlegso nlzn aasrasadpotential and streams data analyzing of challenges vleas eoe ml,ee huhteeffect the though even small, becomes also -value p vlebcmssalr ie httesample the that given smaller, becomes -value , 2012 caha Krishnan & McLachlan .A niebosrpprocedure bootstrap online An ). imn tal. et Simmons not neddt sequentially to intended p vlei eae othe to related is -value , 2011 , 1999 el&Hinton & Neal , 2007 .We the When ). ). .T es- To ). 2 115 It . ,

Chapter 6 116 Chapter 6: Discussion

While SEMA is extended in Chapter 5 to fit linear multilevel models with fixed and random effects, other model extensions yet have to be developed. For example, SEMA as described in this thesis, cannot fit a model with more than two levels. As an illustration let us return to the baseball example from Chapter 1: a two-level model could be the batting observations nested within the baseball players; the third level References in this example are the teams in which the baseball players are nested. Another ex- tension of the SEMA algorithm could be a crossed . This model assumes that the observations are nested within more than one grouping, whereby Aggarwal, C. C. (Ed.). (2007). Data streams: Models and algorithms. Springer US. doi: the groupings themselves are not nested: observations nested within customers and 10.1007/978-0-387-47534-9 the same observations are also nested within different webpages. Agresti, A. (2002). Catagorical Data Analysis (2nd ed.). Wiley series in probability and Finally, the SEMA algorithm currently fits linear multilevel models. The linear statistics. doi: 10.1002/0471249688 multilevel models are part of a larger framework of multilevel models, namely the Agresti, A., Booth, J. G., Hobert, J. P., & Caffo, B. (2000). Random-effects modeling of Generalized Linear Mixed Models (GLMM, Skrondal & Rabe-Hesketh, 2004). The categorical response data. Sociological Methodology, 30(1), 27–80. doi: 10.1111/ GLMM framework also contains multilevel models for variables which are not con- 0081-1750.t01-1-00075 tinuous. However, if the outcome variable is categorical, model fitting could be Al-Jarrah, O. Y., Yoo, P. D., Muhaidat, S., Karagiannidis, G. K., & Taha, K. (2015). severely complicated because the likelihood function can often not be straightfor- Efficient machine learning for big data: A review. Big Data Research, 2(3), 87 wardly maximized (Bock & Aitkin, 1981; Breslow & Clayton, 1993), which is true – 93. (Big Data, Analytics, and High-Performance Computing) doi: 10.1016/ even in static data sets. Chapter 3 presents some heuristic online methods to deal j.bdr.2015.04.001 with binary outcomes in the context of data streams with nested observations and it Anderson, C. J. (2000). Economic voting and political context: a comparative per- might be interesting to explore whether these can be converted into full multilevel spective. Electoral Studies, 19(2-3), 151–170. doi: 10.1016/s0261-3794(99)00045 models. -1 Armitage, P., McPherson, C., & Rowe, B. (1969). Repeated Significance Tests on Accumulating Data. Journal of the Royal Statistical Society. Series A (General), 132(2), 235–244. doi: 10.2307/2343787 Armstrong, R. A. (2014). When to use the Bonferroni correction. Ophthalmic & physiological optics, 34(5), 502–508. doi: 10.1111/opo.12131 Arvas, E., & Sevgi, L. (2012). A Tutorial on the Method of Moments. Antennas and Propagation Magazine, IEEE, 54(3), 260–275. doi: 10.1109/MAP.2012.6294003 Atallah, M. J., Cole, R., & Goodrich, M. T. (1989). Cascading Divide-and-Conquer: A Technique for Designing Parallel Algorithms. SIAM Journal on Computing, 18(3), 499–532. doi: 10.1137/0218035 Barrett, L. F., & Barrett, D. J. (2001). An Introduction to Computerized Experience Sampling in Psychology. Social Science Computer Review, 19(2), 175–185. doi: 10.1177/089443930101900204 Barrett, P., Zhang, Y., Moffat, J., & Kobbacy, K. (2013). A holistic, multi-level analysis identifying the impact of classroom design on pupils’ learning. Building and Environment, 59, 678–689. doi: 10.1016/j.buildenv.2012.09.016 Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. doi: 10.18637/ jss.v067.i01 Beck, E. N. (2015). The Invisible Digital Identity: Assemblages in Digital Networks. Computers and Composition, 35, 125–140. doi: 10.1016/j.compcom.2015.01.005 116 Chapter 6: Discussion

While SEMA is extended in Chapter 5 to fit linear multilevel models with fixed and random effects, other model extensions yet have to be developed. For example, SEMA as described in this thesis, cannot fit a model with more than two levels. As an illustration let us return to the baseball example from Chapter 1: a two-level model could be the batting observations nested within the baseball players; the third level References in this example are the teams in which the baseball players are nested. Another ex- tension of the SEMA algorithm could be a crossed random effects model. This model assumes that the observations are nested within more than one grouping, whereby Aggarwal, C. C. (Ed.). (2007). Data streams: Models and algorithms. Springer US. doi: the groupings themselves are not nested: observations nested within customers and 10.1007/978-0-387-47534-9 the same observations are also nested within different webpages. Agresti, A. (2002). Catagorical Data Analysis (2nd ed.). Wiley series in probability and Finally, the SEMA algorithm currently fits linear multilevel models. The linear statistics. doi: 10.1002/0471249688 multilevel models are part of a larger framework of multilevel models, namely the Agresti, A., Booth, J. G., Hobert, J. P., & Caffo, B. (2000). Random-effects modeling of Generalized Linear Mixed Models (GLMM, Skrondal & Rabe-Hesketh, 2004). The categorical response data. Sociological Methodology, 30(1), 27–80. doi: 10.1111/ GLMM framework also contains multilevel models for variables which are not con- 0081-1750.t01-1-00075 tinuous. However, if the outcome variable is categorical, model fitting could be Al-Jarrah, O. Y., Yoo, P. D., Muhaidat, S., Karagiannidis, G. K., & Taha, K. (2015). severely complicated because the likelihood function can often not be straightfor- Efficient machine learning for big data: A review. Big Data Research, 2(3), 87 wardly maximized (Bock & Aitkin, 1981; Breslow & Clayton, 1993), which is true – 93. (Big Data, Analytics, and High-Performance Computing) doi: 10.1016/ even in static data sets. Chapter 3 presents some heuristic online methods to deal j.bdr.2015.04.001 with binary outcomes in the context of data streams with nested observations and it Anderson, C. J. (2000). Economic voting and political context: a comparative per- might be interesting to explore whether these can be converted into full multilevel spective. Electoral Studies, 19(2-3), 151–170. doi: 10.1016/s0261-3794(99)00045 models. -1 Armitage, P., McPherson, C., & Rowe, B. (1969). Repeated Significance Tests on Accumulating Data. Journal of the Royal Statistical Society. Series A (General), 132(2), 235–244. doi: 10.2307/2343787 Armstrong, R. A. (2014). When to use the Bonferroni correction. Ophthalmic & physiological optics, 34(5), 502–508. doi: 10.1111/opo.12131 Arvas, E., & Sevgi, L. (2012). A Tutorial on the Method of Moments. Antennas and Propagation Magazine, IEEE, 54(3), 260–275. doi: 10.1109/MAP.2012.6294003 Atallah, M. J., Cole, R., & Goodrich, M. T. (1989). Cascading Divide-and-Conquer: A Technique for Designing Parallel Algorithms. SIAM Journal on Computing, 18(3), 499–532. doi: 10.1137/0218035 Barrett, L. F., & Barrett, D. J. (2001). An Introduction to Computerized Experience Sampling in Psychology. Social Science Computer Review, 19(2), 175–185. doi: 10.1177/089443930101900204 Barrett, P., Zhang, Y., Moffat, J., & Kobbacy, K. (2013). A holistic, multi-level analysis identifying the impact of classroom design on pupils’ learning. Building and Environment, 59, 678–689. doi: 10.1016/j.buildenv.2012.09.016 Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. doi: 10.18637/ jss.v067.i01 Beck, E. N. (2015). The Invisible Digital Identity: Assemblages in Digital Networks. Computers and Composition, 35, 125–140. doi: 10.1016/j.compcom.2015.01.005 118 References References 119

Berlinet, A. F., & Roland, C. (2012). Acceleration of the EM algorithm: P-EM versus Carmona, C. J., Ramírez-Gallego, S., Torres, F., Bernal, E., Del Jesus, M. J., & García, epsilon algorithm. Computational Statistics & Data Analysis, 56(12), 4122–4137. S. (2012). Web usage mining to improve the design of an e-commerce website: doi: 10.1016/j.csda.2012.03.005 OrOliveSur.com. Expert Systems with Applications, 39(12), 11243–11249. doi: Berthold, M. R., Cebron, N., Dill, F., Gabriel, T. R., Kötter, T., Meinl, T., . . . Wiswedel, 10.1016/j.eswa.2012.03.046 B. (2009). KNIME – The Konstanz Information Miner. ACM SIGKDD Explo- Cheng, H., & Cantú-Paz, E. (2010). Personalized click prediction in sponsored rations Newsletter, 11(1), 26–31. doi: 10.1145/1656274.1656280 search. In Proceedings of the third acm international conference on web search Bifet, A., Holmes, G., Kirkby, R., & Pfahringer, B. (2010). MOA: Massive Online and data mining - wsdm ’10 (pp. 351–360). New York, USA: ACM. doi: Analysis. The Journal of Machine Learning Research, 11, 1601–1604. doi: 10.1.1 10.1145/1718487.1718531 .180.8004 Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Ng, A. Y., & Olukotun, K. Bock, R. D., & Aitkin, M. (1981). EM Solution of the Marginal Likelihood Equations. (2006). Map-reduce for machine learning on multicore. In Proceedings of the Psychometrika, 46(4), 443–459. doi: 10.1007/BF02293801 19th international conference on neural information processing systems (pp. 281– Böse, J.-H., Andrzejak, A., & Högqvist, M. (2010). Beyond online aggregation: Par- 288). Cambridge, MA, USA: MIT Press. allel and incremental data mining with online map-reduce. In Proceedings of Cortes, C., Fisher, K., Pregibon, D., Rogers, A., & Smith, F. (2000). Hancock: A the 2010 workshop on massive data analytics on the cloud – mdac ’10 (pp. 3:1–3:6). Language for Extracting Signatures from Data Streams. In Proc. of the sixth acm New York, USA: ACM. doi: 10.1145/1779599.1779602 international conference on knowledge discovery and data mining (sigkdd) (pp. 9–17). Bottou, L. (1999). On-line learning and stochastic approximations. In D. Saad (Ed.), Boston. On-line learning in neural networks (pp. 9–42). Cambridge University Press. doi: Curtin, R., Singer, E., & Presser, S. (2007). Incentives in Random Digit Dial Telephone 10.1017/cbo9780511569920.003 Surveys: A and Extension. Journal of Official Statistics, 23(1), 91– Bottou, L. (2010). Large-Scale Machine Learning with Stochastic Gradient Descent. 105. In Proceedings of the 19th international conference on computational statistics (comp- Datar, M., Gionis, A., Indyk, P., & Motwani, R. (2002). Maintaining Stream Statistics stat’2010) (pp. 177–187). doi: 10.1007/978-3-7908-2604-3_16 over Sliding Windows. SIAM Journal on Computing, 31(6), 1794–1813. doi: Breslow, N. E., & Clayton, D. G. (1993). Approximate Inference in Generalized Linear 10.1137/S0097539701398363 Mixed Models. Journal of the American Statistical Association, 88(421), 9–25. doi: Demchenko, Y., Grosso, P., De Laat, C., & Membrey, P. (2013). Addressing big 10.2307/2290687 data issues in Scientific Data Infrastructure. Proceedings of the 2013 Interna- Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C., & Jordan, M. I. (2013). Streaming tional Conference on Collaboration Technologies and Systems, CTS 2013, 48–55. doi: variational bayes. In Advances in neural information processing systems 26: 27th 10.1109/CTS.2013.6567203 annual conference on neural information processing systems 2013. proceedings of a Demidenko, E. (2004). Mixed models: Theory and applications. Wiley Series in Proba- meeting held december 5-8, 2013, lake tahoe, nevada, united states. (pp. 1727–1735). bility and Statistics. doi: 10.1002/0471728438 Browne, W., & Goldstein, H. (2010). MCMC sampling for a multilevel model with Dempster, A. P., Laird, N. M., & Rubin, D. (1977). Maximum Likelihood from Incom- nonindependent residuals within and between cluster units. Journal of Educa- plete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B tional and Behavioral Statistics, 35(4), 453–473. doi: 10.3102/1076998609359788 (Methodological), 39(1), 1–38. Buskirk, T. D., & Andrus, C. (2012). Smart Surveys for Smart Phones: Exploring Donders, A. R. T., van der Heijden, G. J., Stijnen, T., & Moons, K. G. (2006). Re- Various Approaches for Conducting Online Mobile Surveys via Smartphones. view: A gentle introduction to imputation of missing values. Journal of Clinical Survey Practice, 5(1), 1–11. Epidemiology, 59(10), 1087–1091. doi: 10.1016/j.jclinepi.2006.01.014 Cappé, O. (2011a). Online EM Algorithm for Hidden Markov Models. Journal of Efraimidis, P. S., & Spirakis, P. G. (2006). Weighted random sampling with a reser- Computational and Graphical Statistics, 20(3), 1–20. doi: 10.1198/jcgs.2011.09109 voir. Information Processing Letters, 97(5), 181–185. doi: 10.1016/j.ipl.2005.11 Cappé, O. (2011b). Online Expectation-Maximisation. In K. Mengersen, M. Titter- .003 ington, C. Robert, & P. Robert (Eds.), Mixtures: Estimation and applications (pp. Efron, B., & Morris, C. (1977). Stein’s Paradox in Statistics. Scientific American, 236, 31–53). John Wiley & Sons, Ltd. doi: 10.1002/9781119995678.ch2 119–127. doi: 10.1038/scientificamerican0577-119 Cappé, O., & Moulines, E. (2009). On-line expectation-maximization algorithm for Emmons, K. M., Wechsler, H., Dowdall, G., & Abraham, M. (1998). Predictors of latent data models. Journal of the Royal Statistical Society. Series B: Statistical smoking among us college students. American journal of public health, 88(1), Methodology, 71(3), 593–613. doi: 10.1111/j.1467-9868.2009.00698.x 104–7. doi: 10.2105/AJPH.88.1.104 118 References References 119

Berlinet, A. F., & Roland, C. (2012). Acceleration of the EM algorithm: P-EM versus Carmona, C. J., Ramírez-Gallego, S., Torres, F., Bernal, E., Del Jesus, M. J., & García, epsilon algorithm. Computational Statistics & Data Analysis, 56(12), 4122–4137. S. (2012). Web usage mining to improve the design of an e-commerce website: doi: 10.1016/j.csda.2012.03.005 OrOliveSur.com. Expert Systems with Applications, 39(12), 11243–11249. doi: Berthold, M. R., Cebron, N., Dill, F., Gabriel, T. R., Kötter, T., Meinl, T., . . . Wiswedel, 10.1016/j.eswa.2012.03.046 B. (2009). KNIME – The Konstanz Information Miner. ACM SIGKDD Explo- Cheng, H., & Cantú-Paz, E. (2010). Personalized click prediction in sponsored rations Newsletter, 11(1), 26–31. doi: 10.1145/1656274.1656280 search. In Proceedings of the third acm international conference on web search Bifet, A., Holmes, G., Kirkby, R., & Pfahringer, B. (2010). MOA: Massive Online and data mining - wsdm ’10 (pp. 351–360). New York, USA: ACM. doi: Analysis. The Journal of Machine Learning Research, 11, 1601–1604. doi: 10.1.1 10.1145/1718487.1718531 .180.8004 Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Ng, A. Y., & Olukotun, K. Bock, R. D., & Aitkin, M. (1981). EM Solution of the Marginal Likelihood Equations. (2006). Map-reduce for machine learning on multicore. In Proceedings of the Psychometrika, 46(4), 443–459. doi: 10.1007/BF02293801 19th international conference on neural information processing systems (pp. 281– Böse, J.-H., Andrzejak, A., & Högqvist, M. (2010). Beyond online aggregation: Par- 288). Cambridge, MA, USA: MIT Press. allel and incremental data mining with online map-reduce. In Proceedings of Cortes, C., Fisher, K., Pregibon, D., Rogers, A., & Smith, F. (2000). Hancock: A the 2010 workshop on massive data analytics on the cloud – mdac ’10 (pp. 3:1–3:6). Language for Extracting Signatures from Data Streams. In Proc. of the sixth acm New York, USA: ACM. doi: 10.1145/1779599.1779602 international conference on knowledge discovery and data mining (sigkdd) (pp. 9–17). Bottou, L. (1999). On-line learning and stochastic approximations. In D. Saad (Ed.), Boston. On-line learning in neural networks (pp. 9–42). Cambridge University Press. doi: Curtin, R., Singer, E., & Presser, S. (2007). Incentives in Random Digit Dial Telephone 10.1017/cbo9780511569920.003 Surveys: A Replication and Extension. Journal of Official Statistics, 23(1), 91– Bottou, L. (2010). Large-Scale Machine Learning with Stochastic Gradient Descent. 105. In Proceedings of the 19th international conference on computational statistics (comp- Datar, M., Gionis, A., Indyk, P., & Motwani, R. (2002). Maintaining Stream Statistics stat’2010) (pp. 177–187). doi: 10.1007/978-3-7908-2604-3_16 over Sliding Windows. SIAM Journal on Computing, 31(6), 1794–1813. doi: Breslow, N. E., & Clayton, D. G. (1993). Approximate Inference in Generalized Linear 10.1137/S0097539701398363 Mixed Models. Journal of the American Statistical Association, 88(421), 9–25. doi: Demchenko, Y., Grosso, P., De Laat, C., & Membrey, P. (2013). Addressing big 10.2307/2290687 data issues in Scientific Data Infrastructure. Proceedings of the 2013 Interna- Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C., & Jordan, M. I. (2013). Streaming tional Conference on Collaboration Technologies and Systems, CTS 2013, 48–55. doi: variational bayes. In Advances in neural information processing systems 26: 27th 10.1109/CTS.2013.6567203 annual conference on neural information processing systems 2013. proceedings of a Demidenko, E. (2004). Mixed models: Theory and applications. Wiley Series in Proba- meeting held december 5-8, 2013, lake tahoe, nevada, united states. (pp. 1727–1735). bility and Statistics. doi: 10.1002/0471728438 Browne, W., & Goldstein, H. (2010). MCMC sampling for a multilevel model with Dempster, A. P., Laird, N. M., & Rubin, D. (1977). Maximum Likelihood from Incom- nonindependent residuals within and between cluster units. Journal of Educa- plete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B tional and Behavioral Statistics, 35(4), 453–473. doi: 10.3102/1076998609359788 (Methodological), 39(1), 1–38. Buskirk, T. D., & Andrus, C. (2012). Smart Surveys for Smart Phones: Exploring Donders, A. R. T., van der Heijden, G. J., Stijnen, T., & Moons, K. G. (2006). Re- Various Approaches for Conducting Online Mobile Surveys via Smartphones. view: A gentle introduction to imputation of missing values. Journal of Clinical Survey Practice, 5(1), 1–11. Epidemiology, 59(10), 1087–1091. doi: 10.1016/j.jclinepi.2006.01.014 Cappé, O. (2011a). Online EM Algorithm for Hidden Markov Models. Journal of Efraimidis, P. S., & Spirakis, P. G. (2006). Weighted random sampling with a reser- Computational and Graphical Statistics, 20(3), 1–20. doi: 10.1198/jcgs.2011.09109 voir. Information Processing Letters, 97(5), 181–185. doi: 10.1016/j.ipl.2005.11 Cappé, O. (2011b). Online Expectation-Maximisation. In K. Mengersen, M. Titter- .003 ington, C. Robert, & P. Robert (Eds.), Mixtures: Estimation and applications (pp. Efron, B., & Morris, C. (1977). Stein’s Paradox in Statistics. Scientific American, 236, 31–53). John Wiley & Sons, Ltd. doi: 10.1002/9781119995678.ch2 119–127. doi: 10.1038/scientificamerican0577-119 Cappé, O., & Moulines, E. (2009). On-line expectation-maximization algorithm for Emmons, K. M., Wechsler, H., Dowdall, G., & Abraham, M. (1998). Predictors of latent data models. Journal of the Royal Statistical Society. Series B: Statistical smoking among us college students. American journal of public health, 88(1), Methodology, 71(3), 593–613. doi: 10.1111/j.1467-9868.2009.00698.x 104–7. doi: 10.2105/AJPH.88.1.104 120 References References 121

Escobar, L., & Moser, E. (1993). A Note on the Updating of Regression Estimates. Proceedings of the fourth berkeley symposium on mathematical statistics and probabil- The American Statistician, 47(3), 192–194. doi: 10.2307/2684974 ity, volume 1: Contributions to the theory of statistics (Vol. 1, pp. 361–379). Berke- Gaber, M. M. (2012). Advances in data stream mining. Wiley Interdisciplinary Reviews: ley, Calif: University of California Press. doi: 10.1007/978-1-4612-0919-5\_30 Data Mining and Knowledge Discovery, 2(1), 79–85. doi: 10.1002/widm.52 John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the Prevalence of Ques- Gaber, M. M., Zaslavsky, A., & Krishnaswamy, S. (2005). Mining data streams: A tionable Research Practices With Incentives for Truth Telling. Psychological Sci- review. SIGMOD, 34(2), 18–26. doi: 10.1145/1083784.1083789 ence, 23(5), 524–532. doi: 10.1177/0956797611430953 Gelman, A. (2007). Rich State, Poor State, Red State, Blue State: What’s the Matter Kabisa (Tchumtchoua), S., Dunson, D. B., & Morris, J. S. (2016). Online variational with Connecticut? Quarterly Journal of Political Science, 2(4), 345–367. doi: bayes inference for high-dimensional correlated data. Journal of Computational 10.1561/100.00006026 and Graphical Statistics, 25(2), 426-444. doi: 10.1080/10618600.2014.998336 Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian Data Analysis Kaptein, M. (2014). {RStorm}: Developing and Testing Streaming Algorithms in {R}. (2nd ed.). Chapman and Hall/CRC. The R Journal, 6(1), 123–132. Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning Spark: Ligtning- models. Cambridge University Press. doi: 10.2277/0521867061 Fast Big Data Analysis. O’Reilly Media. Goldstein, H. (1986). Multilevel mixed analysis using iterative gener- Keith, T. (2014). Multiple regression and beyond (2nd ed.). Routledge. doi: 10.4324/ alized . Biometrika, 73(1), 43–56. doi: 10.1093/biomet/73.1.43 9781315749099 Goldstein, H., & McDonald, R. P. (1988). A general model for the analysis of multi- Kenny, D. A., & Judd, C. M. (1986). Consequences of violating the independence level data. Psychometrika, 53(4), 455–467. doi: 10.1007/BF02294400 assumption in . Psychological Bulletin, 99(3), 422–431. doi: Goodman, J., & Blum, T. (1996). Assessing the non-random sampling effects of 10.1037/0033-2909.99.3.422 subject attrition in longitudinal research. Journal of Management, 22(4), 627–652. Killingsworth, M. A., & Gilbert, D. T. (2010). A Wandering Mind Is an Unhappy doi: 10.1016/S0149-2063(96)90027-6 Mind. Science, 330(6006), 932. doi: 10.1126/science.1192439 Hamaker, E. L., & Wichers, M. (2017). No time like the present. Current Directions in Kooreman, P., & Scherpenzeel, A. (2014). High frequency body mass measurement, Psychological Science, 26(1), 10-15. doi: 10.1177/0963721416666518 feedback, and health behaviors. Economics & Human Biology, 14, 141–153. doi: Hofmann, M., & Klinkenberg, R. (2013). RapidMiner: Data Mining Use Cases and Busi- 10.1016/j.ehb.2013.12.003 ness Analytics Applications. Boca Raton, Florida, USA: Chapman & Hall/CRC. Lee, J., Podlaseck, M., Schonberg, E., & Hoch, R. (2001). Visualization and Analysis Hofmann, W., Adriaanse, M., Vohs, K. D., & Baumeister, R. F. (2014). Dieting and of Clickstream Data of Online Stores for Understanding Web Merchandising. the self-control of eating in everyday environments: An experience sampling Data Mining and Knowledge Discovery, 5, 59–84. doi: 10.1023/a:1009843912662 study. British Journal of Health Psychology, 19(3), 523–539. doi: 10.1111/bjhp Leeuw, E. D. D. (2005). To Mix or Not to Mix Data Collection Modes in Surveys. .12053 Journal of Official Statistics, 21(2), 233–255. Hsu, D. J., Karampatziakis, N., Langford, J., & Smola, A. J. (2011). Parallel online Liang, P., & Klein, D. (2009). Online EM for unsupervised models. Proceedings of learning. CoRR, abs/1103.4204. Retrieved from http://arxiv.org/abs/ Human Language Technologies The 2009 Annual Conference of the North American 1103.4204 Chapter of the Association for Computational Linguistics on NAACL 09, 611. doi: Ippel, L., Kaptein, M. C., & Vermunt, J. K. (2016a). Dealing with Data Streams: 10.3115/1620754.1620843 an Online, Row-by-Row, Estimation Tutorial. Methodology, 12, 124—138. doi: Linares, B., Guizar, J. M., Amador, N., Garcia, A., Miranda, V., Perez, J. R., & 10.1027/1614-2241/a000116 Chapela, R. (2010). Impact of air pollution on pulmonary function and res- Ippel, L., Kaptein, M. C., & Vermunt, J. K. (2016b). Estimating random-intercept piratory symptoms in children. Longitudinal repeated-measures study. BMC models on data streams. Computational Statistics & Data Analysis, 104, 169–182. Pulmonary Medicine, 10(1), 62. doi: 10.1186/1471-2466-10-62 doi: 10.1016/j.csda.2016.06.008 Liu, Z., Almhana, J., Choulakian, V., & McGorman, R. (2006). Online EM algorithm James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical for mixture with application to internet traffic modeling. Computational Statis- learning. Springer New York. doi: 10.1007/978-1-4614-7138-7 tics & Data Analysis, 50(4), 1052–1071. doi: 10.1016/j.csda.2004.11.002 James, W., & Stein, C. (1961). Estimation with Quadratic Loss. In J. Neyman (Ed.), Lynch, S. M. (2007). Introduction to applied bayesian statistics and estimation for social scientists. Springer New York. doi: 10.1007/978-0-387-71265-9 Manzo, A. N., & Burke, J. M. (2012). Increasing response rate in web-based/internet 120 References References 121

Escobar, L., & Moser, E. (1993). A Note on the Updating of Regression Estimates. Proceedings of the fourth berkeley symposium on mathematical statistics and probabil- The American Statistician, 47(3), 192–194. doi: 10.2307/2684974 ity, volume 1: Contributions to the theory of statistics (Vol. 1, pp. 361–379). Berke- Gaber, M. M. (2012). Advances in data stream mining. Wiley Interdisciplinary Reviews: ley, Calif: University of California Press. doi: 10.1007/978-1-4612-0919-5\_30 Data Mining and Knowledge Discovery, 2(1), 79–85. doi: 10.1002/widm.52 John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the Prevalence of Ques- Gaber, M. M., Zaslavsky, A., & Krishnaswamy, S. (2005). Mining data streams: A tionable Research Practices With Incentives for Truth Telling. Psychological Sci- review. SIGMOD, 34(2), 18–26. doi: 10.1145/1083784.1083789 ence, 23(5), 524–532. doi: 10.1177/0956797611430953 Gelman, A. (2007). Rich State, Poor State, Red State, Blue State: What’s the Matter Kabisa (Tchumtchoua), S., Dunson, D. B., & Morris, J. S. (2016). Online variational with Connecticut? Quarterly Journal of Political Science, 2(4), 345–367. doi: bayes inference for high-dimensional correlated data. Journal of Computational 10.1561/100.00006026 and Graphical Statistics, 25(2), 426-444. doi: 10.1080/10618600.2014.998336 Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian Data Analysis Kaptein, M. (2014). {RStorm}: Developing and Testing Streaming Algorithms in {R}. (2nd ed.). Chapman and Hall/CRC. The R Journal, 6(1), 123–132. Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning Spark: Ligtning- models. Cambridge University Press. doi: 10.2277/0521867061 Fast Big Data Analysis. O’Reilly Media. Goldstein, H. (1986). Multilevel mixed linear model analysis using iterative gener- Keith, T. (2014). Multiple regression and beyond (2nd ed.). Routledge. doi: 10.4324/ alized least squares. Biometrika, 73(1), 43–56. doi: 10.1093/biomet/73.1.43 9781315749099 Goldstein, H., & McDonald, R. P. (1988). A general model for the analysis of multi- Kenny, D. A., & Judd, C. M. (1986). Consequences of violating the independence level data. Psychometrika, 53(4), 455–467. doi: 10.1007/BF02294400 assumption in analysis of variance. Psychological Bulletin, 99(3), 422–431. doi: Goodman, J., & Blum, T. (1996). Assessing the non-random sampling effects of 10.1037/0033-2909.99.3.422 subject attrition in longitudinal research. Journal of Management, 22(4), 627–652. Killingsworth, M. A., & Gilbert, D. T. (2010). A Wandering Mind Is an Unhappy doi: 10.1016/S0149-2063(96)90027-6 Mind. Science, 330(6006), 932. doi: 10.1126/science.1192439 Hamaker, E. L., & Wichers, M. (2017). No time like the present. Current Directions in Kooreman, P., & Scherpenzeel, A. (2014). High frequency body mass measurement, Psychological Science, 26(1), 10-15. doi: 10.1177/0963721416666518 feedback, and health behaviors. Economics & Human Biology, 14, 141–153. doi: Hofmann, M., & Klinkenberg, R. (2013). RapidMiner: Data Mining Use Cases and Busi- 10.1016/j.ehb.2013.12.003 ness Analytics Applications. Boca Raton, Florida, USA: Chapman & Hall/CRC. Lee, J., Podlaseck, M., Schonberg, E., & Hoch, R. (2001). Visualization and Analysis Hofmann, W., Adriaanse, M., Vohs, K. D., & Baumeister, R. F. (2014). Dieting and of Clickstream Data of Online Stores for Understanding Web Merchandising. the self-control of eating in everyday environments: An experience sampling Data Mining and Knowledge Discovery, 5, 59–84. doi: 10.1023/a:1009843912662 study. British Journal of Health Psychology, 19(3), 523–539. doi: 10.1111/bjhp Leeuw, E. D. D. (2005). To Mix or Not to Mix Data Collection Modes in Surveys. .12053 Journal of Official Statistics, 21(2), 233–255. Hsu, D. J., Karampatziakis, N., Langford, J., & Smola, A. J. (2011). Parallel online Liang, P., & Klein, D. (2009). Online EM for unsupervised models. Proceedings of learning. CoRR, abs/1103.4204. Retrieved from http://arxiv.org/abs/ Human Language Technologies The 2009 Annual Conference of the North American 1103.4204 Chapter of the Association for Computational Linguistics on NAACL 09, 611. doi: Ippel, L., Kaptein, M. C., & Vermunt, J. K. (2016a). Dealing with Data Streams: 10.3115/1620754.1620843 an Online, Row-by-Row, Estimation Tutorial. Methodology, 12, 124—138. doi: Linares, B., Guizar, J. M., Amador, N., Garcia, A., Miranda, V., Perez, J. R., & 10.1027/1614-2241/a000116 Chapela, R. (2010). Impact of air pollution on pulmonary function and res- Ippel, L., Kaptein, M. C., & Vermunt, J. K. (2016b). Estimating random-intercept piratory symptoms in children. Longitudinal repeated-measures study. BMC models on data streams. Computational Statistics & Data Analysis, 104, 169–182. Pulmonary Medicine, 10(1), 62. doi: 10.1186/1471-2466-10-62 doi: 10.1016/j.csda.2016.06.008 Liu, Z., Almhana, J., Choulakian, V., & McGorman, R. (2006). Online EM algorithm James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical for mixture with application to internet traffic modeling. Computational Statis- learning. Springer New York. doi: 10.1007/978-1-4614-7138-7 tics & Data Analysis, 50(4), 1052–1071. doi: 10.1016/j.csda.2004.11.002 James, W., & Stein, C. (1961). Estimation with Quadratic Loss. In J. Neyman (Ed.), Lynch, S. M. (2007). Introduction to applied bayesian statistics and estimation for social scientists. Springer New York. doi: 10.1007/978-0-387-71265-9 Manzo, A. N., & Burke, J. M. (2012). Increasing response rate in web-based/internet 122 References References 123

surveys. In L. Gideon (Ed.), Handbook of survey methodology for the social sciences scalable formulas for parallel and online computation of higher-order multi- (pp. 327–343). New York, NY: Springer New York. doi: 10.1007/978-1-4614 variate central moments with arbitrary weights. Computational Statistics. doi: -3876-2\_19 10.1007/s00180-015-0637-z Marz, N., & Warren, J. (2013). Big Data: Principles and best practices of scalable realtime Pedro, S., Z., M. O., Baker, R., Bowers, A. J., & Heffernan, N. T. (2013). Predicting data systems. Greenwich, CT, USA: Manning Publications. college enrollment from student interaction with an intelligent tutoring system McLachlan, G., & Krishnan, T. (2007). Extensions of the EM algorithm. In Wiley in middle school. Proceedings of the 6th International Conference on Educational series in probability and statistics (2nd ed., pp. 159–218). John Wiley & Sons, Inc. Data Mining, 177–184. doi: 10.1002/9780470191613.ch5 Plackett, R. (1950). Some Theorems in Least Squares. Biometrika, 37(1/2), 149–157. McLachlan, G., & Peel, D. (2000). Finite Mixture Models. New York, USA: Wiley doi: 10.2307/2332158 series in probability and statistics. doi: 10.1002/0471721182 Powell, M. J. (2009). The bobyqa algorithm for bound constrained optimization Moerbeek, M., Van Breukelen, G., & Berger, M. (2003). A comparison of Estimation without derivatives. Cambridge NA Report NA2009/06, University of Cambridge, Methods for Multilevel Logistic Models. Computational Statistics, 18, 19–37. Cambridge. doi: 10.1007/s001800300130 Quintelier, E. (2010). The effect of schools on political participation: a multilevel Morris, C., & Lysy, M. (2012). Shrinkage Estimation in Multilevel Normal Models. logistic analysis. Research Papers in Education, 25(2), 137–154. doi: 10.1080/ Statistical Science, 27(1), 115–134. doi: 10.1214/11-STS363 02671520802524810 Murnaghan, D. A., Sihvonen, M., Leatherdale, S. T., & Kekki, P. (2007). The rela- R Core Team. (2013). R: A language and environment for statistical computing tionship between school-based smoking policies and prevention programs on [Computer software manual]. Vienna, Austria. smoking behavior among grade 12 students in Prince Edward Island: A mul- R Core Team. (2016). R: A language and environment for statistical computing tilevel analysis. Preventive Medicine, 44(4), 317–322. doi: 10.1016/j.ypmed.2007 [Computer software manual]. Vienna, Austria. .01.003 Rabe-Hesketh, S., Skrondal, A., & Pickles, A. (2002). Reliable estimation of gen- Myung, I. (2003). Tutorial on maximum likelihood estimation. Journal of Mathematical eralized linear mixed models using adaptive quadrature. Stata Journal, 2(1), Psychology, 47, 90–100. doi: 10.1016/S0022-2496(02)00028-7 1–21. Neal, R., & Hinton, G. E. (1998). A View Of The Em Algorithm That Justifies In- Raudenbush, S., & Bryk, A. (2002). Hierarchical Linear Models: applications and Data cremental, Sparse, And Other Variants. In Learning in graphical models (pp. Analysis Methods (2nd ed.; J. de Leeuw, Ed.). Thousand Oaks, California, USA: 355–368). doi: 10.1007/978-94-011-5014-9_12 Sage Pulication. Neumeyer, L., Robbins, B., Nair, A., & Kesari, A. (2010). S4: Distributed stream Robert, C. P. (2015). The Metropolis-Hastings algorithm. ArXiv e-prints: 1504.01896. computing platform. Proceedings - IEEE International Conference on Data Mining, Rubin, D., & Thayer, D. T. (1982). EM algorithms for ML factor analysis. Psychome- ICDM, 170–177. doi: 10.1109/ICDMW.2010.172 trika, 47(1), 69–76. doi: 10.1007/BF02293851 Ng, S. K., & McLachlan, G. (2003). On the choice of the number of blocks with Sagiroglu, S., & Sinanc, D. (2013). Big data: A review. International Conference the incremental EM algorithm for the fitting of normal mixtures. Statistics and on Collaboration Technologies and Systems (CTS), 42–47. doi: 10.1109/CTS.2013 Computing, 13(1), 45–55. doi: 10.1023/A:1021987710829 .6567202 Opper, M. (1998). A Bayesian Approach to Online Learning. In D. Saad (Ed.), On-line Schaul, T., Zhang, S., & LeCun, Y. (2013). No More Pesky Learning Rates. Journal of learning in neural networks (pp. 363–378). Cambridge: Cambridge University Machine Learning Research, 28(3), 343–351. Press. doi: 10.1017/cbo9780511569920.017 Shalev-Shwartz, S. (2011). Online Learning and Online Convex Optimization. Owen, A. B., & Eckles, D. (2012). Bootstrapping data arrays of arbitrary order. The Foundations and Trends R in Machine Learning, 4(2), 107–194. doi: 10.1561/ ⃝ Annals of Applied Statistics, 6(3), 895–927. doi: 10.1214/12-aoas547 2200000018 Patidar, R., & Sharma, L. (2011). Credit Card Fraud Detection Using Neural Net- Sherman, J., & Morrison, W. J. (1950). Adjustment of an Inverse Matrix Correspond- work. International Journal of Soft Computing and Engineering, 1(June), 32–38. ing to a Change in One Element of a Given Matrix. The Annals of Mathematical Pebay, P. P. (2008). Formulas for robust, one-pass parallel computation of covariances and Statistics, 21(1), 124–127. doi: 10.1214/aoms/1177729893 arbitrary-order statistical moments. (Tech. Rep.). doi: 10.2172/1028931 Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-Positive Psychol- Pébay, P. P., Terriberry, T. B., Kolla, H., & Bennett, J. (2016). Numerically stable, ogy: Undisclosed Flexibility in Data Collection and Analysis Allows Present- ing Anything as Significant. Psychological Science, 22(11), 1359–1366. doi: 122 References References 123 surveys. In L. Gideon (Ed.), Handbook of survey methodology for the social sciences scalable formulas for parallel and online computation of higher-order multi- (pp. 327–343). New York, NY: Springer New York. doi: 10.1007/978-1-4614 variate central moments with arbitrary weights. Computational Statistics. doi: -3876-2\_19 10.1007/s00180-015-0637-z Marz, N., & Warren, J. (2013). Big Data: Principles and best practices of scalable realtime Pedro, S., Z., M. O., Baker, R., Bowers, A. J., & Heffernan, N. T. (2013). Predicting data systems. Greenwich, CT, USA: Manning Publications. college enrollment from student interaction with an intelligent tutoring system McLachlan, G., & Krishnan, T. (2007). Extensions of the EM algorithm. In Wiley in middle school. Proceedings of the 6th International Conference on Educational series in probability and statistics (2nd ed., pp. 159–218). John Wiley & Sons, Inc. Data Mining, 177–184. doi: 10.1002/9780470191613.ch5 Plackett, R. (1950). Some Theorems in Least Squares. Biometrika, 37(1/2), 149–157. McLachlan, G., & Peel, D. (2000). Finite Mixture Models. New York, USA: Wiley doi: 10.2307/2332158 series in probability and statistics. doi: 10.1002/0471721182 Powell, M. J. (2009). The bobyqa algorithm for bound constrained optimization Moerbeek, M., Van Breukelen, G., & Berger, M. (2003). A comparison of Estimation without derivatives. Cambridge NA Report NA2009/06, University of Cambridge, Methods for Multilevel Logistic Models. Computational Statistics, 18, 19–37. Cambridge. doi: 10.1007/s001800300130 Quintelier, E. (2010). The effect of schools on political participation: a multilevel Morris, C., & Lysy, M. (2012). Shrinkage Estimation in Multilevel Normal Models. logistic analysis. Research Papers in Education, 25(2), 137–154. doi: 10.1080/ Statistical Science, 27(1), 115–134. doi: 10.1214/11-STS363 02671520802524810 Murnaghan, D. A., Sihvonen, M., Leatherdale, S. T., & Kekki, P. (2007). The rela- R Core Team. (2013). R: A language and environment for statistical computing tionship between school-based smoking policies and prevention programs on [Computer software manual]. Vienna, Austria. smoking behavior among grade 12 students in Prince Edward Island: A mul- R Core Team. (2016). R: A language and environment for statistical computing tilevel analysis. Preventive Medicine, 44(4), 317–322. doi: 10.1016/j.ypmed.2007 [Computer software manual]. Vienna, Austria. .01.003 Rabe-Hesketh, S., Skrondal, A., & Pickles, A. (2002). Reliable estimation of gen- Myung, I. (2003). Tutorial on maximum likelihood estimation. Journal of Mathematical eralized linear mixed models using adaptive quadrature. Stata Journal, 2(1), Psychology, 47, 90–100. doi: 10.1016/S0022-2496(02)00028-7 1–21. Neal, R., & Hinton, G. E. (1998). A View Of The Em Algorithm That Justifies In- Raudenbush, S., & Bryk, A. (2002). Hierarchical Linear Models: applications and Data cremental, Sparse, And Other Variants. In Learning in graphical models (pp. Analysis Methods (2nd ed.; J. de Leeuw, Ed.). Thousand Oaks, California, USA: 355–368). doi: 10.1007/978-94-011-5014-9_12 Sage Pulication. Neumeyer, L., Robbins, B., Nair, A., & Kesari, A. (2010). S4: Distributed stream Robert, C. P. (2015). The Metropolis-Hastings algorithm. ArXiv e-prints: 1504.01896. computing platform. Proceedings - IEEE International Conference on Data Mining, Rubin, D., & Thayer, D. T. (1982). EM algorithms for ML factor analysis. Psychome- ICDM, 170–177. doi: 10.1109/ICDMW.2010.172 trika, 47(1), 69–76. doi: 10.1007/BF02293851 Ng, S. K., & McLachlan, G. (2003). On the choice of the number of blocks with Sagiroglu, S., & Sinanc, D. (2013). Big data: A review. International Conference the incremental EM algorithm for the fitting of normal mixtures. Statistics and on Collaboration Technologies and Systems (CTS), 42–47. doi: 10.1109/CTS.2013 Computing, 13(1), 45–55. doi: 10.1023/A:1021987710829 .6567202 Opper, M. (1998). A Bayesian Approach to Online Learning. In D. Saad (Ed.), On-line Schaul, T., Zhang, S., & LeCun, Y. (2013). No More Pesky Learning Rates. Journal of learning in neural networks (pp. 363–378). Cambridge: Cambridge University Machine Learning Research, 28(3), 343–351. Press. doi: 10.1017/cbo9780511569920.017 Shalev-Shwartz, S. (2011). Online Learning and Online Convex Optimization. Owen, A. B., & Eckles, D. (2012). Bootstrapping data arrays of arbitrary order. The Foundations and Trends R in Machine Learning, 4(2), 107–194. doi: 10.1561/ ⃝ Annals of Applied Statistics, 6(3), 895–927. doi: 10.1214/12-aoas547 2200000018 Patidar, R., & Sharma, L. (2011). Credit Card Fraud Detection Using Neural Net- Sherman, J., & Morrison, W. J. (1950). Adjustment of an Inverse Matrix Correspond- work. International Journal of Soft Computing and Engineering, 1(June), 32–38. ing to a Change in One Element of a Given Matrix. The Annals of Mathematical Pebay, P. P. (2008). Formulas for robust, one-pass parallel computation of covariances and Statistics, 21(1), 124–127. doi: 10.1214/aoms/1177729893 arbitrary-order statistical moments. (Tech. Rep.). doi: 10.2172/1028931 Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-Positive Psychol- Pébay, P. P., Terriberry, T. B., Kolla, H., & Bennett, J. (2016). Numerically stable, ogy: Undisclosed Flexibility in Data Collection and Analysis Allows Present- ing Anything as Significant. Psychological Science, 22(11), 1359–1366. doi: 124 References References 125

10.1177/0956797611417632 and Products. Technometrics, 4(3), 419–420. doi: 10.2307/1266577 Skrondal, A., & Rabe-Hesketh, S. (2004). Generalized latent variable models: multilevel, Whalen, C. K., Jamner, L. D., Henker, B., Delfino, R. J., & Lozano, J. M. (2002). longitudinal, and structural equation models (Vol. 17). Chapman and Hall/CRC. The ADHD spectrum and everyday life: experience sampling of adolescent doi: 10.1201/9780203489437 moods, activities, smoking, and drinking. Child development, 73(1), 209–27. doi: Steenbergen, M., & Jones, B. (2002). Modeling Multilevel Data Structures. American 10.1111/1467-8624.00401 Journal of Political Science, 46(1), 218–237. doi: 10.2307/3088424 Wilkinson, L., & Task Force on Statistical Inference. (1999). Statistical methods Stein, C. (1956). Inadmissibility of the Usual Estimator for the Mean of a Multi- in psychology journals: Guidelines and explanations. American Psychologist, Variate Normal Distribution. In Proc. third berkeley symp. on math. statist. and 54(8), 594–604. doi: 10.1037/0003-066x.54.8.594 prob (Vol. 1, pp. 197–206). Wilson, D., & Martinez, T. R. (2003). The general inefficiency of batch training for Steiner, P. M., & Hudec, M. (2007). Classification of large data sets with mixture gradient descent learning. Neural Networks, 16(10), 1429–1451. doi: 10.1016/ models via sufficient EM. Computational Statistics and Data Analysis, 51(11), S0893-6080(03)00138-2 5416–5428. doi: 10.1016/j.csda.2006.09.014 Witten, I. H., Frank, E., & Hall, M. A. (2013). Data Mining: Practical Machine Learning Strube, M. J. (2006). SNOOP: a program for demonstrating the consequences of pre- Tools and Techniques (3rd ed.). Morgan Kaufmann Publishers. doi: 10.1016/ mature and repeated null hypothesis testing. Behavior research methods, 38(1), c2009-0-19715-5 24–27. doi: 10.3758/BF03192746 Wolfe, J., Haghighi, A., & Klein, D. (2008). Fully distributed EM for very large Swendsen, J., Ben-Zeev, D., & Granholm, E. (2011). Real-time electronic ambulatory datasets. Proceedings of the 25th international conference on Machine learning - monitoring of substance use and symptom expression in schizophrenia. Amer- ICML ’08, 1184–1191. doi: 10.1145/1390156.1390305 ican Journal of Psychiatry, 168(2), 202–209. doi: 10.1176/appi.ajp.2010.10030463 Xu, W. (2011). Towards optimal one pass large scale learning with averaged stochas- Tarrès, P., & Yao, Y. (2014). Online Learning as Stochastic Approximation of Regu- tic gradient descent. CoRR, abs/1107.2490. larization Paths: Optimality and Almost-Sure Convergence. IEEE Transactions Yang, H., Xu, Z., King, I., & Lyu, M. R. (2010). Online learning for group lasso. on Information Theory, 60(9), 5716 – 5735. doi: 10.1109/TIT.2014.2332531 In Proceedings of the 27th international conference on machine learning (pp. 1191– Thiesson, B., Meek, C., & Heckerman, D. (2001). Accelerating EM for large 1198). Haifa, Israel. databases. Machine Learning, 45(3), 279–299. doi: 10.1023/A:1017986506241 Yang, Y., & Dunson, D. B. (2013). Sequential Markov Chain Monte Carlo. ArXiv Toshniwal, A., Donham, J., Bhagat, N., Mittal, S., Ryaboy, D., Taneja, S., . . . Fu, e-prints. M. (2014). Storm@twitter. Proceedings of the 2014 ACM SIGMOD interna- Young-Xu, Y., & Chan, K. A. (2008). Pooling overdispersed binomial data to estimate tional conference on Management of data - SIGMOD ’14, 147–156. doi: 10.1145/ event rate. BMC medical research methodology, 8, 58. doi: 10.1186/1471-2288-8 2588555.2595641 -58 Trull, T. J., & Ebner-Priemer, U. W. (2009). Using Experience Sampling Meth- ods/Ecological Momentary Assessment (ESM/EMA) in Clinical Assessment and Clinical Research: Introduction to the Special Section. Psychological Assess- ment, 21(4), 457–462. doi: 10.1037/a0017653 Turaga, D., Andrade, H., Gedik, B., Venkatramani, C., Verscheure, O., Harris, J. D., . . . Jones, P. (2010). Design principles for developing stream processing applica- tions. Software: Practice and Experience, 40(12), 1073–1104. doi: 10.1002/spe.993 van der Palm, D. W., van der Ark, L. A., & Vermunt, J. K. (2016). A comparison of incomplete-data methods for categorical data. Statistical Methods in Medical Research, 25(2), 754–774. doi: 10.1177/0962280212465502 van de Schoot, R., Kaplan, D., Denissen, J., Asendorpf, J. B., Neyer, F. J., & van Aken, M. A. (2014). A gentle introduction to bayesian analysis: Applications to developmental research. Child Development, 85(3), 842–860. doi: 10.1111/ cdev.12169 Welford, B. (1962). Note on a Method for Calculating Corrected Sums of Squares 124 References References 125

10.1177/0956797611417632 and Products. Technometrics, 4(3), 419–420. doi: 10.2307/1266577 Skrondal, A., & Rabe-Hesketh, S. (2004). Generalized latent variable models: multilevel, Whalen, C. K., Jamner, L. D., Henker, B., Delfino, R. J., & Lozano, J. M. (2002). longitudinal, and structural equation models (Vol. 17). Chapman and Hall/CRC. The ADHD spectrum and everyday life: experience sampling of adolescent doi: 10.1201/9780203489437 moods, activities, smoking, and drinking. Child development, 73(1), 209–27. doi: Steenbergen, M., & Jones, B. (2002). Modeling Multilevel Data Structures. American 10.1111/1467-8624.00401 Journal of Political Science, 46(1), 218–237. doi: 10.2307/3088424 Wilkinson, L., & Task Force on Statistical Inference. (1999). Statistical methods Stein, C. (1956). Inadmissibility of the Usual Estimator for the Mean of a Multi- in psychology journals: Guidelines and explanations. American Psychologist, Variate Normal Distribution. In Proc. third berkeley symp. on math. statist. and 54(8), 594–604. doi: 10.1037/0003-066x.54.8.594 prob (Vol. 1, pp. 197–206). Wilson, D., & Martinez, T. R. (2003). The general inefficiency of batch training for Steiner, P. M., & Hudec, M. (2007). Classification of large data sets with mixture gradient descent learning. Neural Networks, 16(10), 1429–1451. doi: 10.1016/ models via sufficient EM. Computational Statistics and Data Analysis, 51(11), S0893-6080(03)00138-2 5416–5428. doi: 10.1016/j.csda.2006.09.014 Witten, I. H., Frank, E., & Hall, M. A. (2013). Data Mining: Practical Machine Learning Strube, M. J. (2006). SNOOP: a program for demonstrating the consequences of pre- Tools and Techniques (3rd ed.). Morgan Kaufmann Publishers. doi: 10.1016/ mature and repeated null hypothesis testing. Behavior research methods, 38(1), c2009-0-19715-5 24–27. doi: 10.3758/BF03192746 Wolfe, J., Haghighi, A., & Klein, D. (2008). Fully distributed EM for very large Swendsen, J., Ben-Zeev, D., & Granholm, E. (2011). Real-time electronic ambulatory datasets. Proceedings of the 25th international conference on Machine learning - monitoring of substance use and symptom expression in schizophrenia. Amer- ICML ’08, 1184–1191. doi: 10.1145/1390156.1390305 ican Journal of Psychiatry, 168(2), 202–209. doi: 10.1176/appi.ajp.2010.10030463 Xu, W. (2011). Towards optimal one pass large scale learning with averaged stochas- Tarrès, P., & Yao, Y. (2014). Online Learning as Stochastic Approximation of Regu- tic gradient descent. CoRR, abs/1107.2490. larization Paths: Optimality and Almost-Sure Convergence. IEEE Transactions Yang, H., Xu, Z., King, I., & Lyu, M. R. (2010). Online learning for group lasso. on Information Theory, 60(9), 5716 – 5735. doi: 10.1109/TIT.2014.2332531 In Proceedings of the 27th international conference on machine learning (pp. 1191– Thiesson, B., Meek, C., & Heckerman, D. (2001). Accelerating EM for large 1198). Haifa, Israel. databases. Machine Learning, 45(3), 279–299. doi: 10.1023/A:1017986506241 Yang, Y., & Dunson, D. B. (2013). Sequential Markov Chain Monte Carlo. ArXiv Toshniwal, A., Donham, J., Bhagat, N., Mittal, S., Ryaboy, D., Taneja, S., . . . Fu, e-prints. M. (2014). Storm@twitter. Proceedings of the 2014 ACM SIGMOD interna- Young-Xu, Y., & Chan, K. A. (2008). Pooling overdispersed binomial data to estimate tional conference on Management of data - SIGMOD ’14, 147–156. doi: 10.1145/ event rate. BMC medical research methodology, 8, 58. doi: 10.1186/1471-2288-8 2588555.2595641 -58 Trull, T. J., & Ebner-Priemer, U. W. (2009). Using Experience Sampling Meth- ods/Ecological Momentary Assessment (ESM/EMA) in Clinical Assessment and Clinical Research: Introduction to the Special Section. Psychological Assess- ment, 21(4), 457–462. doi: 10.1037/a0017653 Turaga, D., Andrade, H., Gedik, B., Venkatramani, C., Verscheure, O., Harris, J. D., . . . Jones, P. (2010). Design principles for developing stream processing applica- tions. Software: Practice and Experience, 40(12), 1073–1104. doi: 10.1002/spe.993 van der Palm, D. W., van der Ark, L. A., & Vermunt, J. K. (2016). A comparison of incomplete-data methods for categorical data. Statistical Methods in Medical Research, 25(2), 754–774. doi: 10.1177/0962280212465502 van de Schoot, R., Kaplan, D., Denissen, J., Asendorpf, J. B., Neyer, F. J., & van Aken, M. A. (2014). A gentle introduction to bayesian analysis: Applications to developmental research. Child Development, 85(3), 842–860. doi: 10.1111/ cdev.12169 Welford, B. (1962). Note on a Method for Calculating Corrected Sums of Squares Summary

The technological developments of the last decades, e.g., the introduction of the smartphone, have created opportunities to efficiently collect data of many individ- uals over an extensive period of time. While these technologies allow for intensive longitudinal measurements, they also come with new challenges: data sets collected using these technologies could become extremely large, and in many applications the data collection is never truly ‘finished’. As a result, the data keep streaming in and analyzing data streams using the standard computation of well-known mod- els becomes inefficient as the computation has to be repeated each time a new data point enters to remain up to date. In this thesis, methods to analyze data streams are developed. The introduction of these methods allows researchers to broaden the scope of their research, by using data streams. In Chapter 2, multiple approaches for analyzing data streams are discussed, though the main focus of this chapter is on online learning. Online learning means that the parameter estimates are estimated while the data enter, without going back to older data points to update the parameter estimates. In this chapter, the standard computations of several common models for independent observations are adapted such that these models could be computed online in data streams. These online computations are illustrated with R code, e.g., to compute correlation and linear re- gression online. For more complex models that do not have simple (closed-form) computations, Stochastic Gradient Decent is introduced. This method approximates the solution (e.g., the Maximum Likelihood solution), a data point at a time. This optimization method is illustrated by fitting a logistic model to a data stream. Chapter 2 focuses on data streams consisting of independent observations. How- ever, data streams often consist of observing the same individuals repeatedly. Ob- serving the same individual multiple times creates a nesting in the data. Many statis- tical models, however, assume that the observations are not nested, or independent. In Chapter 3, four online methods for the analysis of nested observations in data streams are developed. These four methods combine the observations of an indi- vidual with the data of all the other individuals, to obtain more accurate predictions than when using only the individual’s observations. However, fitting a model that accounts for both nested observations and binary outcomes in a data stream can be computationally challenging. The four methods that are presented in this chapter are based on existing shrinkage factors. The prediction accuracy of the offline and online shrinkage factors is compared in a simulation study. While the existing meth- ods differ in their prediction accuracy, the differences in accuracy between the online Summary

The technological developments of the last decades, e.g., the introduction of the smartphone, have created opportunities to efficiently collect data of many individ- uals over an extensive period of time. While these technologies allow for intensive longitudinal measurements, they also come with new challenges: data sets collected using these technologies could become extremely large, and in many applications the data collection is never truly ‘finished’. As a result, the data keep streaming in and analyzing data streams using the standard computation of well-known mod- els becomes inefficient as the computation has to be repeated each time a new data point enters to remain up to date. In this thesis, methods to analyze data streams are developed. The introduction of these methods allows researchers to broaden the scope of their research, by using data streams. In Chapter 2, multiple approaches for analyzing data streams are discussed, though the main focus of this chapter is on online learning. Online learning means that the parameter estimates are estimated while the data enter, without going back to older data points to update the parameter estimates. In this chapter, the standard computations of several common models for independent observations are adapted such that these models could be computed online in data streams. These online computations are illustrated with R code, e.g., to compute correlation and linear re- gression online. For more complex models that do not have simple (closed-form) computations, Stochastic Gradient Decent is introduced. This method approximates the solution (e.g., the Maximum Likelihood solution), a data point at a time. This optimization method is illustrated by fitting a logistic model to a data stream. Chapter 2 focuses on data streams consisting of independent observations. How- ever, data streams often consist of observing the same individuals repeatedly. Ob- serving the same individual multiple times creates a nesting in the data. Many statis- tical models, however, assume that the observations are not nested, or independent. In Chapter 3, four online methods for the analysis of nested observations in data streams are developed. These four methods combine the observations of an indi- vidual with the data of all the other individuals, to obtain more accurate predictions than when using only the individual’s observations. However, fitting a model that accounts for both nested observations and binary outcomes in a data stream can be computationally challenging. The four methods that are presented in this chapter are based on existing shrinkage factors. The prediction accuracy of the offline and online shrinkage factors is compared in a simulation study. While the existing meth- ods differ in their prediction accuracy, the differences in accuracy between the online 128 Summary and the offline shrinkage factors are small. A model-based approach to analyze data streams with dependent observations is discussed in Chapter 4. Data sets with nested structures are typically analyzed using multilevel models. However, in the context of data streams, estimating multilevel models can be challenging: the algorithms used to fit multilevel models repeatedly Samenvatting revisit all data points and, in the case that new data enter, have to redo this proce- dure to remain up to date. Chapter 4 presents a solution for this problem with the Streaming Expectation Maximization Approximation (SEMA) algorithm for fitting In de afgelopen decennia hebben er veel technologische ontwikkelingen, bijvoor- random intercept models online. The performance of SEMA is compared to tradi- beeld de opkomst van de smartphone, plaatsgevonden. Deze technologische ont- tional methods of fitting random intercept models in a simulation study and in an wikkelingen hebben nieuwe mogelijkheden gecreëerd om over lange periodes data empirical example. The prediction accuracy of SEMA is both competitive and orders van veel mensen tegelijkertijd te verzamelen. Doordat de dataverzameling nu op of magnitude faster than traditional methods. een grotere schaal kan plaatsvinden, ontstaan er nieuwe uitdagingen: de databe- Chapter 5 provides an extension of the SEMA algorithm to allow online multi- standen, die verzameld worden met deze methodes, kunnen zeer groot worden. level modeling with fixed and random effects. In a simulation study, models with Om deze databestanden te kunnen verwerken en analyseren moet er zowel genoeg random intercepts and slopes and fixed effects on both the level of the individual opslag- als rekencapaciteit zijn. Een bijkomend probleem is dat het niet altijd duide- and the level of observations are included. The SEMA algorithm is able to accu- lijk is wanneer de dataverzameling afgelopen is, aangezien nieuwe data continu het rately estimate the parameter values. The performance of SEMA is also illustrated computergeheugen binnen blijven stromen. De standaardmethodes om data te ana- using an empirical example, where individuals’ weight is predicted in a data stream. lyseren zijn veelal niet geschikt om deze datastromen te analyseren: deze methodes In this example, the prediction accuracy of SEMA and the traditional methods are veronderstellen namelijk dat alle data tegelijkertijd beschikbaar zijn in het compu- very similar. tergeheugen. Hierdoor moet een analyse steeds opnieuw uitgevoerd worden om ge- Finally, Chapter 6 discusses the contributions, such as estimating multilevel mod- bruik te kunnen maken van de nieuw binnengekomen data. In deze thesis worden els efficiently in data streams, and limitations, such as the small scale study of con- methodes besproken en ontwikkeld, die 1) het resultaat van een analyse eenvoudig vergence, of the work presented in this thesis. Directions for further research are aanpassen aan de hand van nieuwe data, zonder de analyse opnieuw uit te voeren; provided, such as how SEMA could be extended to fit other models as well and how en 2) het overbodig maken om observaties op te slaan. related fields could make use of the work presented in this thesis. Eerst worden de bestaande methoden besproken en geïllustreerd in hoofdstuk 2. De focus van dit hoofdstuk ligt op de online learning methode. Deze methode past eenvoudig de uitkomst van een analyse aan wanneer nieuwe data binnen ko- men, zonder gebruik te maken van de oude data. De data punten hoeven daarom niet opgeslagen te worden. Voor een aantal veelgebruikte modellen wordt (met R code) de aangepaste (online) manier van schatten gepresenteerd (bijvoorbeeld een gemiddelde of een lineaire regressie). De berekening van sommige modellen kan echter niet eenvoudig aangepast worden zodat ze online berekend kunnen worden. Een voorbeeld van zo’n model is het logistische model, dat gebruikt wordt om een binaire uitkomst te voorspellen. Om dit soort modellen toch te kunnen gebruiken in een datastroom zijn meer complexe technieken nodig. Met R code laten we zien hoe Stochastic Gradient Descent, een online benaderingsmethode, een logistisch model schat. In de sociale wetenschappen bestaan datastromen vaak uit herhaalde metingen van dezelfde personen, bijvoorbeeld of iemand wel of niet op een website op adver- tenties klikt. Het herhaaldelijk observeren van dezelfde persoon creëert een samen- hang (of afhankelijkheid) tussen de observaties die van dezelfde persoon afkomstig zijn: twee observaties van dezelfde persoon lijken waarschijnlijk meer op elkaar dan 128 Summary and the offline shrinkage factors are small. A model-based approach to analyze data streams with dependent observations is discussed in Chapter 4. Data sets with nested structures are typically analyzed using multilevel models. However, in the context of data streams, estimating multilevel models can be challenging: the algorithms used to fit multilevel models repeatedly Samenvatting revisit all data points and, in the case that new data enter, have to redo this proce- dure to remain up to date. Chapter 4 presents a solution for this problem with the Streaming Expectation Maximization Approximation (SEMA) algorithm for fitting In de afgelopen decennia hebben er veel technologische ontwikkelingen, bijvoor- random intercept models online. The performance of SEMA is compared to tradi- beeld de opkomst van de smartphone, plaatsgevonden. Deze technologische ont- tional methods of fitting random intercept models in a simulation study and in an wikkelingen hebben nieuwe mogelijkheden gecreëerd om over lange periodes data empirical example. The prediction accuracy of SEMA is both competitive and orders van veel mensen tegelijkertijd te verzamelen. Doordat de dataverzameling nu op of magnitude faster than traditional methods. een grotere schaal kan plaatsvinden, ontstaan er nieuwe uitdagingen: de databe- Chapter 5 provides an extension of the SEMA algorithm to allow online multi- standen, die verzameld worden met deze methodes, kunnen zeer groot worden. level modeling with fixed and random effects. In a simulation study, models with Om deze databestanden te kunnen verwerken en analyseren moet er zowel genoeg random intercepts and slopes and fixed effects on both the level of the individual opslag- als rekencapaciteit zijn. Een bijkomend probleem is dat het niet altijd duide- and the level of observations are included. The SEMA algorithm is able to accu- lijk is wanneer de dataverzameling afgelopen is, aangezien nieuwe data continu het rately estimate the parameter values. The performance of SEMA is also illustrated computergeheugen binnen blijven stromen. De standaardmethodes om data te ana- using an empirical example, where individuals’ weight is predicted in a data stream. lyseren zijn veelal niet geschikt om deze datastromen te analyseren: deze methodes In this example, the prediction accuracy of SEMA and the traditional methods are veronderstellen namelijk dat alle data tegelijkertijd beschikbaar zijn in het compu- very similar. tergeheugen. Hierdoor moet een analyse steeds opnieuw uitgevoerd worden om ge- Finally, Chapter 6 discusses the contributions, such as estimating multilevel mod- bruik te kunnen maken van de nieuw binnengekomen data. In deze thesis worden els efficiently in data streams, and limitations, such as the small scale study of con- methodes besproken en ontwikkeld, die 1) het resultaat van een analyse eenvoudig vergence, of the work presented in this thesis. Directions for further research are aanpassen aan de hand van nieuwe data, zonder de analyse opnieuw uit te voeren; provided, such as how SEMA could be extended to fit other models as well and how en 2) het overbodig maken om observaties op te slaan. related fields could make use of the work presented in this thesis. Eerst worden de bestaande methoden besproken en geïllustreerd in hoofdstuk 2. De focus van dit hoofdstuk ligt op de online learning methode. Deze methode past eenvoudig de uitkomst van een analyse aan wanneer nieuwe data binnen ko- men, zonder gebruik te maken van de oude data. De data punten hoeven daarom niet opgeslagen te worden. Voor een aantal veelgebruikte modellen wordt (met R code) de aangepaste (online) manier van schatten gepresenteerd (bijvoorbeeld een gemiddelde of een lineaire regressie). De berekening van sommige modellen kan echter niet eenvoudig aangepast worden zodat ze online berekend kunnen worden. Een voorbeeld van zo’n model is het logistische model, dat gebruikt wordt om een binaire uitkomst te voorspellen. Om dit soort modellen toch te kunnen gebruiken in een datastroom zijn meer complexe technieken nodig. Met R code laten we zien hoe Stochastic Gradient Descent, een online benaderingsmethode, een logistisch model schat. In de sociale wetenschappen bestaan datastromen vaak uit herhaalde metingen van dezelfde personen, bijvoorbeeld of iemand wel of niet op een website op adver- tenties klikt. Het herhaaldelijk observeren van dezelfde persoon creëert een samen- hang (of afhankelijkheid) tussen de observaties die van dezelfde persoon afkomstig zijn: twee observaties van dezelfde persoon lijken waarschijnlijk meer op elkaar dan 130 Samenvatting twee willekeurige observaties die niet van dezelfde persoon afkomstig zijn. Echter, voor observaties die afhankelijk zijn, is het schatten van de daarvoor geschikte model- len complex, met name als de uitkomst binair is. In hoofdstuk 3 worden vier online methodes ontwikkeld, die rekening houden met de samenhang tussen de observa- ties van dezelfde persoon en met een binaire uitkomst. De nieuwe online methodes Dankwoord zijn gebaseerd op reeds bestaande shrinkage methodes. Shrinkage methodes com- bineren de observaties van een persoon met de data van alle andere personen. Op deze manier worden er accuratere voorspellingen gemaakt dan voorspellingen die In deze laatste paar pagina’s van mijn boekje wil ik graag iedereen bedanken die alleen gebaseerd zijn op de observaties van iedere persoon afzonderlijk. Uit een si- bijgedragen heeft aan dit proefschrift. mulatiestudie blijkt dat er nauwelijks verschillen zijn tussen de online manier versus Vier jaar geleden heb ik van Jeroen Vermunt de kans gekregen om aan dit project de standaard manier van schatten van de shrinkage methodes. te beginnen. Hij stelde voor om eens met Maurits Kaptein te praten en Jeroen bood In het volgende hoofdstuk wordt een methode besproken die gebaseerd is op me een promotieplek binnen het Methoden en Technieken departement. Jeroen, ik een veelgebruikt model wanneer de observaties gegroepeerd zijn. Het model heet ben je dankbaar dat je me deze kans gegeven hebt. Ik heb je feedback altijd erg het multilevel model: het lagere level (level 1) duidt de observaties aan en het ho- gewaardeerd en daar waar ik er zelf niet uitkwam of de zaken toch net niet helemaal gere level (level 2) geeft de personen weer. Het schatten van zo’n multilevel model scherp had, kon ik altijd rekenen op je uitleg. is lastig in een datastroom, omdat de gebruikelijke methodes alle data in het com- Maurits, eigenlijk ben je de afgelopen vijf jaar mijn begeleider geweest, want ik puter geheugen nodig hebben en herhaaldelijk deze data gebruiken om het model mocht ook mijn master thesis bij jou schrijven. In deze tijd heb ik ontzettend veel te kunnen schatten. In hoofdstuk 4 wordt een nieuw algoritme ontwikkeld dat het van je mogen leren en je stond altijd voor me klaar. Daar waar mogelijk, probeerde multilevel model kan schatten zonder gebruik te maken van oude observaties. Het je bij al mijn presentaties te zijn. Zelfs even twee uur naar Kerkrade rijden, was algoritme heet SEMA: Streaming Expectation Maximization Approximation en het blijkbaar geen probleem. Hoewel ik zo vlak voor een presentatie het wellicht niet is gebaseerd op een bestaand algoritme om het multilevel model te schatten. In een altijd heb laten blijken, heb ik het enorm gewaardeerd dat je erbij zat. En in de tijden simulatie studie en met bestaande data wordt het random intercept model, een een- dat het me allemaal even niet mee zat, wist ik ook op persoonlijk vlak dat je er voor voudig multilevel model, geschat. De standaardmethode en het SEMA algoritme me was. Je hebt het veilig voor me gemaakt om fouten te maken en iedereen die mij presteren vergelijkbaar, terwijl SEMA vele malen sneller is dan de standaardme- ook maar een beetje kent, weet dat dat een van de grootste complimenten is die ik je thode. kan geven. Hoofdstuk 5 is een uitbreiding van het SEMA algoritme zodat niet alleen het I also want to thank my two hosts at Berkeley, Sophia Rabe-Hesketh and Anders random intercept model geschat kan worden, maar ook complexere multilevel mo- Skrondal. I am grateful to have gotten the opportunity to learn from you and from dellen. Met deze uitbreiding kan SEMA modellen met meerdere fixed effecten en the QME students you supervise. A special thanks to James Mason and Joonho Lee random effecten schatten in een data stroom. Fixed effecten zijn effecten die voor with whom I shared many dinners and ideas! alle personen even groot zijn. Random effecten daarentegen kunnen verschillen per Buiten mijn twee vaste begeleiders, kon ik altijd rekenen op de hulp en uitleg persoon. We laten aan de hand van een simulatiestudie en de analyse van bestaande van de andere collega’s binnen het departement. Marcel van Assen, bedankt voor data zien dat SEMA multilevel modellen met fixed effecten op beide levels, random de ‘spar’-momentjes wanneer we het eigenlijk alle twee niet echt wisten maar je me intercepts en random slopes kan schatten. In beide gevallen is SEMA goed in staat toch weer op weg wist te helpen. Marcel Croon, bedankt dat u plaats wilt nemen in om accurate voorspellingen te maken terwijl de data binnen komt. mijn commissie. Uw deur stond altijd voor me open en uw wiskundige kennis heeft Tot slot, in het laatste hoofdstuk worden de bijdrage, zoals multilevel modellen me vele malen vooruit geholpen. Katrijn, ik kon altijd bij je aankloppen of ik nu een efficiënt kunnen schatten in een data stroom, en limitaties, zoals een beperkte ana- presentatie wilde oefenen of ergens vastliep in mijn papers, bedankt! lyse van convergentie, van deze thesis besproken. Verder worden ook richtingen Verder wil ik graag mijn collega’s bedanken voor alle gezelligheid op kantoor, de voor vervolgonderzoek besproken zoals bijvoorbeeld hoe het SEMA algoritme nog borrels, tijdens de IOPS cursussen en conferenties. Mijn kantoorgenootjes Robbie, verder uitgebreid kan worden en hoe gerelateerde velden gebruik kunnen maken voor de vele doorgestuurde linkjes naar alle financiele documenten die ik nooit kon van het werk besproken in deze thesis. vinden, en de R support waar ik altijd kon rekenen, Chris en Niek voor alle discus- sies en vele koppen koffie, mede namens mijn koffieverslaving, bedankt! Mede-VICI 130 Samenvatting twee willekeurige observaties die niet van dezelfde persoon afkomstig zijn. Echter, voor observaties die afhankelijk zijn, is het schatten van de daarvoor geschikte model- len complex, met name als de uitkomst binair is. In hoofdstuk 3 worden vier online methodes ontwikkeld, die rekening houden met de samenhang tussen de observa- ties van dezelfde persoon en met een binaire uitkomst. De nieuwe online methodes Dankwoord zijn gebaseerd op reeds bestaande shrinkage methodes. Shrinkage methodes com- bineren de observaties van een persoon met de data van alle andere personen. Op deze manier worden er accuratere voorspellingen gemaakt dan voorspellingen die In deze laatste paar pagina’s van mijn boekje wil ik graag iedereen bedanken die alleen gebaseerd zijn op de observaties van iedere persoon afzonderlijk. Uit een si- bijgedragen heeft aan dit proefschrift. mulatiestudie blijkt dat er nauwelijks verschillen zijn tussen de online manier versus Vier jaar geleden heb ik van Jeroen Vermunt de kans gekregen om aan dit project de standaard manier van schatten van de shrinkage methodes. te beginnen. Hij stelde voor om eens met Maurits Kaptein te praten en Jeroen bood In het volgende hoofdstuk wordt een methode besproken die gebaseerd is op me een promotieplek binnen het Methoden en Technieken departement. Jeroen, ik een veelgebruikt model wanneer de observaties gegroepeerd zijn. Het model heet ben je dankbaar dat je me deze kans gegeven hebt. Ik heb je feedback altijd erg het multilevel model: het lagere level (level 1) duidt de observaties aan en het ho- gewaardeerd en daar waar ik er zelf niet uitkwam of de zaken toch net niet helemaal gere level (level 2) geeft de personen weer. Het schatten van zo’n multilevel model scherp had, kon ik altijd rekenen op je uitleg. is lastig in een datastroom, omdat de gebruikelijke methodes alle data in het com- Maurits, eigenlijk ben je de afgelopen vijf jaar mijn begeleider geweest, want ik puter geheugen nodig hebben en herhaaldelijk deze data gebruiken om het model mocht ook mijn master thesis bij jou schrijven. In deze tijd heb ik ontzettend veel te kunnen schatten. In hoofdstuk 4 wordt een nieuw algoritme ontwikkeld dat het van je mogen leren en je stond altijd voor me klaar. Daar waar mogelijk, probeerde multilevel model kan schatten zonder gebruik te maken van oude observaties. Het je bij al mijn presentaties te zijn. Zelfs even twee uur naar Kerkrade rijden, was algoritme heet SEMA: Streaming Expectation Maximization Approximation en het blijkbaar geen probleem. Hoewel ik zo vlak voor een presentatie het wellicht niet is gebaseerd op een bestaand algoritme om het multilevel model te schatten. In een altijd heb laten blijken, heb ik het enorm gewaardeerd dat je erbij zat. En in de tijden simulatie studie en met bestaande data wordt het random intercept model, een een- dat het me allemaal even niet mee zat, wist ik ook op persoonlijk vlak dat je er voor voudig multilevel model, geschat. De standaardmethode en het SEMA algoritme me was. Je hebt het veilig voor me gemaakt om fouten te maken en iedereen die mij presteren vergelijkbaar, terwijl SEMA vele malen sneller is dan de standaardme- ook maar een beetje kent, weet dat dat een van de grootste complimenten is die ik je thode. kan geven. Hoofdstuk 5 is een uitbreiding van het SEMA algoritme zodat niet alleen het I also want to thank my two hosts at Berkeley, Sophia Rabe-Hesketh and Anders random intercept model geschat kan worden, maar ook complexere multilevel mo- Skrondal. I am grateful to have gotten the opportunity to learn from you and from dellen. Met deze uitbreiding kan SEMA modellen met meerdere fixed effecten en the QME students you supervise. A special thanks to James Mason and Joonho Lee random effecten schatten in een data stroom. Fixed effecten zijn effecten die voor with whom I shared many dinners and ideas! alle personen even groot zijn. Random effecten daarentegen kunnen verschillen per Buiten mijn twee vaste begeleiders, kon ik altijd rekenen op de hulp en uitleg persoon. We laten aan de hand van een simulatiestudie en de analyse van bestaande van de andere collega’s binnen het departement. Marcel van Assen, bedankt voor data zien dat SEMA multilevel modellen met fixed effecten op beide levels, random de ‘spar’-momentjes wanneer we het eigenlijk alle twee niet echt wisten maar je me intercepts en random slopes kan schatten. In beide gevallen is SEMA goed in staat toch weer op weg wist te helpen. Marcel Croon, bedankt dat u plaats wilt nemen in om accurate voorspellingen te maken terwijl de data binnen komt. mijn commissie. Uw deur stond altijd voor me open en uw wiskundige kennis heeft Tot slot, in het laatste hoofdstuk worden de bijdrage, zoals multilevel modellen me vele malen vooruit geholpen. Katrijn, ik kon altijd bij je aankloppen of ik nu een efficiënt kunnen schatten in een data stroom, en limitaties, zoals een beperkte ana- presentatie wilde oefenen of ergens vastliep in mijn papers, bedankt! lyse van convergentie, van deze thesis besproken. Verder worden ook richtingen Verder wil ik graag mijn collega’s bedanken voor alle gezelligheid op kantoor, de voor vervolgonderzoek besproken zoals bijvoorbeeld hoe het SEMA algoritme nog borrels, tijdens de IOPS cursussen en conferenties. Mijn kantoorgenootjes Robbie, verder uitgebreid kan worden en hoe gerelateerde velden gebruik kunnen maken voor de vele doorgestuurde linkjes naar alle financiele documenten die ik nooit kon van het werk besproken in deze thesis. vinden, en de R support waar ik altijd kon rekenen, Chris en Niek voor alle discus- sies en vele koppen koffie, mede namens mijn koffieverslaving, bedankt! Mede-VICI 132 Dankwoord leden, zowel de huidige als de voormalige: Margot, Zsuzsa, Daniel O., Jeroen, Ka- trijn, Kim, Laura, Davide, Mattis, Erwin, Niek, Geert, Leonie, Pia, Reza, ik heb veel van jullie papers en jullie feedback mogen leren. Uiteraard vergeet ik de Bayes club met Jeroen, Jesper, Joris, Florian, Davide, Geert, Dino, en Sara ook niet! Onderwijs geven is een van die taken waar ik elk jaar weer naar uitkeek, tot on- geveer halverwege het blok dan “was ik er wel weer klaar mee” en wilde ik “weer gewoon m’n werk doen”. Leoni, Josine, Jules, Reza, Eva, Hannah, Chris, Inga, Ka- trijn, John en Luc, ik heb erg genoten om met jullie het onderwijs te verzorgen. Wilco, bedankt voor de organisatie van het onderwijs en dat ik bij je terecht kon met alle vragen, opmerkingen en frustraties. Guy, als groentje ben ik bij jouw vak MTO-A- MAW begonnen 7 jaar geleden, bedankt voor het vertrouwen en de support ook in de vele jaren daarna. Verder wil ik nog de secretaresses van MTO bedanken, Marieke, Liesbeth, en Anne-Marie. Bedankt voor alle support en al het geregel! Zonder jullie was ik nu waarschijnlijk nog steeds aan het zoeken naar de juiste formulieren. Naast mijn collega’s wil ik ook mijn vrienden bedanken. Allereerst mijn para- nymfen Erwin en Hilde, bedankt dat jullie straks achter mij willen gaan staan. Be- dankt dat jullie steeds opnieuw voor me klaar staan. Ook al is het straks wellicht wat meer uit het oog, zeker niet uit het hart. Tom, we zijn het nooit met elkaar eens, toch heb ik veel van je geleerd, en je schrijven is iets waar ik altijd jaloers op zal zijn. Daniel Pineda, Drew, Matt, and Colin a big thank you for welcoming me into your house. Thanks for giving me the full American experience by inviting me to the Super Bowl party and the ‘lovely’ election season. Daniel, thank you for showing me around. Continuing in English, Adam thank you for sharing your stories and always offering a listening ear. Greta, I am already looking forward to our next city trip. Aan mijn tijd bij Rataplan heb ik veel vrienden overgehouden. Janneke, Niek, Liselot, Letje, Peppie, Sanne, Fons, Hilde en Marius. Wat heb ik met jullie geweldige tijden beleefd, op het terras, in de langeboom, op de vloer, kanoën, NSK, etc. etc. etc.! Door jullie weet ik dat er meer bestaat dan alleen mijn proefschrift. Sanne, met je mooie gezinnetje, dankjewel voor je vriendschap, je bent me dierbaar. Hilde en Marius, bedankt voor de klustherapie om alle phd en gerelateerde stress van me af te zetten, jullie zijn fantastisch! Ik wil mijn familie bedanken, pa, ma, Marco, Agnita en Lucas. Bedankt voor jullie steun in de afgelopen jaren. Jeroen, bedankt voor je geduld en samen met jou in Mestreech, kump alles good!