A Compound Dirichlet-Multinomial Model for Provincial Level Covid-19 Predictions in South Africa
Total Page:16
File Type:pdf, Size:1020Kb
medRxiv preprint doi: https://doi.org/10.1101/2020.06.15.20131433; this version posted June 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. A compound Dirichlet-Multinomial model for provincial level Covid-19 predictions in South Africa Alta de Waal*1,2Daan de Waal1,3 1 Department of Statistics, University of Pretoria, Pretoria, South Africa 2 Center for Artificial Intelligence (CAIR), Pretoria, South Africa 3 Mathematical Statistics and Actuarial Science, University of the Free State, Bloemfontein, South Africa * [email protected] Abstract 1 Accurate prediction of COVID-19 related indicators such as confirmed cases, deaths and 2 recoveries play an important in understanding the spread and impact of the virus, as 3 well as resource planning and allocation. In this study, we approach the prediction 4 problem from a statistical perspective and predict confirmed cases and deaths on a 5 provincial level. We propose the compound Dirichlet Multinomial distribution to 6 estimate the proportion parameter of each province as mutually exclusive outcomes. 7 Furthermore, we make an assumption of exponential growth of the total cummulative 8 counts in order to predict future total counts. The outcomes of this approach is not 9 only prediction. The variation of the proportion parameter is characterised by the 10 Dirichlet distribution, which provides insight in the movement of the pandemic across 11 provinces over time. 12 Introduction 13 The global COVID-19 (C19) pandemic has urged governments globally to rapidly 14 implement measures to reduce the number of infected cases and reduce pressure on the 15 June 15, 2020 1/17 NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice. medRxiv preprint doi: https://doi.org/10.1101/2020.06.15.20131433; this version posted June 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. health care system. Strategic decisions on measures such as quarantine, economic 16 lockdown and social distancing depend on projections of number of infections, people 17 requiring urgent medical attention and mortality. Already early on in the pandemic, it 18 became clear that data on confirmed cases are plagued with uncertainty, because of the 19 high number of asymptomatic cases, varying test protocols per country [1], as well as 20 different symptoms in different regions [2]. For example, one mathematical approach to 21 understanding the spread of infections is compartmental models such as Susceptible 22 Exposed Infectious Recovered (SEIR) and Susceptible Infectious Recovered (SIR) 23 models [3{5]. 24 The approach of this paper is to predict the cummulative count of C19 confirmed 25 cases and deaths across provinces in South Africa. We provide the following arguments 26 for this approach: 27 • The spread of the pandemic in a country is not uniform, but characterized by 28 regional hotspots such as Northern Italy [6] in Italy and New York in the USA. 29 • Although the test protocols and definition of infection may vary across 30 countries [1], it is more likely to be consistent for multiple regions within a single 31 country. Therefore, it makes sense to model the dependencies between provinces. 32 • For the purpose of resource planning and mitigation strategies, it is important to 33 understand how the pandemic spreads and moves across provincial or district 34 borders and how the infection proportions change over time. Our modelling 35 approach shows the change in counts as proportions across provinces. 36 A simple approach to such a model would be the Multinomial distribution, a 37 multivariate discrete distribution which models multiple mutually exclusive outcomes. 38 The simplest application of the Multinomial distribution is determing whether a 39 six-sided die is biased. This is determined by calculating the probability of throwing 40 each one of the numbers from 1{6 by throwing the die multiple times. Uneven 41 probabilities for one of the numbers might indicate an unfair die. There is, however, 42 uncertainty associated with these calculated probabilities, which can be accounted for 43 by following a Bayesian approach to the problem and fitting a prior to the Multinomial 44 probability parameter. The conjugate prior for the Multinomial distribution is the 45 June 15, 2020 2/17 medRxiv preprint doi: https://doi.org/10.1101/2020.06.15.20131433; this version posted June 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. Dirichlet distribution which allows us to calculate a posterior distribution over the 46 Multinomial proportions. This approach forms the basis of our methodology which we 47 describe in more detail in the next section. The methodology is followed by a 48 description of the South African data and then the application. We conclude the paper 49 with a discussion of the results and future work. 50 Methodology 51 The structure of the methodology section is as follows: We first describe the nature of 52 categorical count data before introducing the compound Dirichlet Multinomial 53 distribution and describing the methodology in detail according to these steps: 54 1. Update Dirichlet parameter. 55 2. Estimate categorical counts. 56 3. Validate the estimations with Q-Q plots. 57 4. Predict future total counts. 58 5. Validate predictions. 59 6. Predict future categorical counts 60 Categorical data 61 Consider a sequence of N observations x1; :::; xN where each observation xi is a vector 62 of length K, the number of categories, denoting numbers from 1; :::; M such that 63 PK k=1 xik = M. 64 Illustration 1: Let's say K = 4 and M = 123. Then xi = [13; 12; 42; 56] can be an 65 observation. This is an outcome of the Multinomial distribution with parameters M 66 and Y . N such outcomes are observed and denoted by the matrix X with N rows and 67 K columns. The objective is to estimate a future outcome xN+1 given M. The 68 PK unknown parameter Y = y1; :::; yK (where k=1 yk = 1), however, varies all the time 69 and this variability is characterized by the Dirichlet distribution [7]. 70 June 15, 2020 3/17 medRxiv preprint doi: https://doi.org/10.1101/2020.06.15.20131433; this version posted June 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. The Dirichlet distribution of Y has parameters α = (α1; ::; αK ) which are positive 71 real numbers and the density function given by [8]: 72 P Γ( k αk) Y αk−1 p(Y jα) = Q Yk : (1) Γ(αk) k k Conceptually, we are making N independent draws from a categorical distribution with 73 K categories. Let us represent the N draws as random draws on the K variables and 74 denote the number of times a particular category k has been seen among K categories 75 PN as nk with i=1 xik = nk. 76 Illustration 2: Moving closer to the application of this study and following up on 77 Illustration 1 with K = 4 and M = 264, let N = 17 and suppose N is the number of 78 days in the recorded dataset used to get to nk. The variable of interest is the number of 79 C19 deaths per day. After 17 days of daily observations of the number of C19 deaths, 80 suppose the totals for 4 categories (provinces) are n1 = 34; n2 = 27; n3 = 56; n4 = 153 81 with a total of M = 270 deaths. Daily observations are not important, only the totals 82 after the N = 17 days. This is due to the fact that the likelihood function (discussed 83 next) in the predictive distribution only depends on the numbers nk from the last 84 observation. In the next section we introduce the compound Dirichlet Multinomial 85 distribution. 86 The compound Dirichlet Multinomial distribution 87 Assume Y is distributed Dirichlet(α1; ::; αK ) and draw Y = y1; :::; yK from this distribution. XjY ∼ MN(M; Y ) and the marginal distribution of X is referred to as a compound Dirichlet-Multinomial (CDM) distribution with parameters M and PK (α1; ::; αK ) [9]. Let k=1 αk = α0. The density function is given by: f(XjM; α1; ::; αK ) = p(xjM; )p(Y jα1; ::; αK ) Y nk Γ(α0) Y αk−1 = yk × Q yk Γ(αk) k k k Γ(α0) Y nk+αk−1 = Q yk Γ(αk) k k Q Γ(α0) k Γ(αk + nk) = Q : (2) k Γ(αk) Γ(α0 + nk) June 15, 2020 4/17 medRxiv preprint doi: https://doi.org/10.1101/2020.06.15.20131433; this version posted June 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. PK The conditions k=1 nk = M; xi ≥ 0;nk > 0; αk > 0 must hold. 88 In a Bayesian framework, the predictive density function of a future X, say Z [10] 89 where ZjM ∼ MN(M; Y ) and Y jα1; ::; αK ∼ Dirichlet(α1; ::; αK ) is similar to the CDM 90 density function: Given xi with values n(1); :::; n(N) of observed x values, the predictive 91 distribution of Zjn(1); :::; n(N) ∼ CDM(M; (α1 + n(1); ::; αK + n(K)).