A New Hybrid Model Using an Autoregressive Integrated Moving
Total Page:16
File Type:pdf, Size:1020Kb
Am. J. Trop. Med. Hyg., 97(3), 2017, pp. 799–805 doi:10.4269/ajtmh.16-0648 Copyright © 2017 by The American Society of Tropical Medicine and Hygiene A New Hybrid Model Using an Autoregressive Integrated Moving Average and a Generalized Regression Neural Network for the Incidence of Tuberculosis in Heng County, China Wudi Wei,1† Junjun Jiang,1† Lian Gao,2 Bingyu Liang,1 Jiegang Huang,1 Ning Zang,1,3 Chuanyi Ning,1,3 Yanyan Liao,1,3 Jingzhen Lai,1 Jun Yu,1 Fengxiang Qin,1 Hui Chen,4 Jinming Su,1 Li Ye,1* and Hao Liang1,3* 1Guangxi Key Laboratory of AIDS Prevention and Treatment and Guangxi Universities Key Laboratory of Prevention, Guangxi Medical University, 22 Shuangyong Road, Nanning, Guangxi, China; 2Department of Infectious Diseases, Heng County Centers for Disease Control and Prevention, 16 Gongyuan Road, Heng County, China; 3Life Sciences Institute, Guangxi Medical University, 22 Shuangyong Road, Nanning, China; 4Geriatrics Digestion Department of Internal Medicine, The First Affiliated Hospital of Guangxi Medical University, 6 Shuangyong Road, Nanning, China Abstract. It is a daunting task to eradicate tuberculosis completely in Heng County due to a large transient population, human immunodeficiency virus/tuberculosis coinfection, and latent infection. Thus, a high-precision forecasting model can be used for the prevention and control of tuberculosis. In this study, four models including a basic autoregressive integrated moving average (ARIMA) model, a traditional ARIMA–generalized regression neural network (GRNN) model, a basic GRNN model, and a new ARIMA–GRNN hybrid model were used to fit and predict the incidence of tuberculosis. Parameters including mean absolute error (MAE), mean absolute percentage error (MAPE), and mean square error (MSE) were used to evaluate and compare the performance of these models for fitting historical and prospective data. The new ARIMA–GRNN model had superior fit relative to both the traditional ARIMA–GRNN model and basic ARIMA model when applied to historical data and when used as a predictive model for forecasting incidence during the subsequent 6 months. Our results suggest that the new ARIMA–GRNN model may be more suitable for forecasting the tuberculosis incidence in Heng County than traditional models. INTRODUCTION which is generally used to analyze the nonlinear data sets, could also be used.16–19 Combination models using the basic Guangxi Zhuang Autonomous Region has the second 1 assumptions of both ARIMA and GRNN have been used for largest burden of tuberculosis cases in China. Tuberculosis predicting disease incidence rates. The previous studies have fi has signi cant economic impact in Guangxi and may be an also shown that a combined model may provide the advan- ’ 2 obstacle to Guangxi s development. The tuberculosis epi- tages from both earlier models and increase forecasting demic of Heng County is particularly problematic in Guangxi. accuracy.20–23 However, Wu and others24 found that the fi The morbidity of tuberculosis was signi cant, and ranked predictions of the ARIMA–GRNN model were not as good as second in Nationally Notifiable Infectious Diseases of the 3,4 those of the ARIMA model for the incidence of hemorrhagic county. Heng County is a key area in the Guangxi Beibu Gulf fever with renal syndrome. Therefore, traditional ARIMA– Economic Zone, with a large transient population that pro- 5,6 GRNN models may not be useful for predicting the incidence motes the spread of tuberculosis. Currently, the prevalence of all infectious diseases, and the traditional hybrid models fi of human immunode ciency virus (HIV)-tuberculosis coin- may be improved under some circumstances. fections represent a serious new challenge to the public health To our knowledge there is no research with respect to system. In 2011, about 1.1 million (13%) of the 8.7 million forecasting the incidence of tuberculosis using these statis- people infected with tuberculosis worldwide were HIV- 7–9 tical models. In this study, we developed ARIMA models, positive. In addition, it is estimated that about one-third of – – 10 traditional ARIMA GRNN models, and new ARIMA GRNN people who are infected with tuberculosis are latent carriers. models to develop an optimal model for tuberculosis in- Furthermore, multidrug-resistant tuberculosis is prevalent, 11,12 cidence forecasting in Heng County. We expect that the especially in rural areas. To cope with the challenges to model will play an important role in prevention and control of the public health system posed by the spread of tuberculosis, tuberculosis. a series of preventative measures should be used in locations such as Heng County. Among these measures, a highly ac- curate epidemiologic model of the disease that can predict the MATERIALS AND METHODS trends in incidence that can help the public health authorities Materials source. We obtained all the necessary data in- control the epidemic would be useful. cluding the incidence of tuberculosis from January 2005 to At this time, the existing mathematical models, specifically December 2013 and population information from the Heng the autoregressive integrated moving average (ARIMA) model County Center for Disease Control and Prevention (CDC) and that is widely used to predict incidence of infectious – the Heng County Statistics Bureau. All tuberculosis cases diseases,13 15 are based on linear assumptions. Moreover, must be reported to the Heng County CDC through the net- the generalized regression neural network (GRNN) model, work system within 12 hours of diagnosis. Thus, the tuber- culosis incidence data for Heng County are known to be reliable and complete because there is compulsory reporting. * Address correspondence to Hao Liang, Life Sciences Institute, The original data can be gained from the corresponding Guangxi Medical University, Guangxi, China, E-mail: lianghao@ author. gxmu.edu.cn or Li Ye, Guangxi Key Laboratory of AIDS Prevention and Treatment, School of Public Health, Guangxi Medical University, We divided the data into two parts so that we could compare Guangxi, China, E-mail: [email protected]. the modeling and predictive performances of the two com- † These authors contributed equally to this paper. bined models and the ARIMA model. The incidence rates from 799 800 WEI, JIANG AND OTHERS FIGURE 1. Monthly incidence of tuberculosis from January 2005 to June 2013. January 2005 to June 2013 were used to fit the models, regulation parameter, the smoothing factor. The equivalent whereas the rest were used to validate the four models. nonlinear regression formula shows how the GRNN model Any ethical considerations in this study were limited. No works: informed consent was required or was feasible because all the ! data used for the modeling came from publicly accessible n D2 + Y exp À i secondary data sources, and no personal identifiers were i s2 ^ i¼1 2 entailed. YðXÞ¼ ! ARIMA model construction. n D2 The ARIMA (p,d,q) (P,D,Q)s + exp À i s2 model is a frequently used time series model that contains i¼1 2 seven different parameters: 1) the order of autoregression (p), h i 2 ¼ ð À ÞT ð À Þ 2) the degree of difference (d), 3) the order of the moving av- Di X Xi X Xi erage (q), 4) the seasonal autoregression lags (P), 5) the degree ... of seasonal difference (D), 6) the seasonal moving average where X means the input vector (X1, X2, , Xi), which consists ^ lags (Q), and 7) the length of any cyclical pattern(s). We used of n predictor variables; Y denotes the output values predicted ^ ^ the augmented Dickey-Fuller (ADF) unit root test to first de- by the GRNN; YðXÞ is the expected value of the output Y given 26 termine the stationary value of the data. Second, log trans- an input vector X; and s is the smoothing factor. Thus, the formation and differencing were used to maintain the stationary smoothing factor is very significant contributor in the GRNN time series if the sequence is not stationary.23 Finally, we used model. the Box–Jenkins strategy to construct the ARIMA models in- In this study, the forecasted incidence of the ARIMA model cluding three steps: identification, estimation, and diagnostic and the actual incidence were used as the input and output checking.25 The R2, the Schwarz–Bayesian information crite- variables, respectively, to construct the traditional ARIMA– rion, and the Akaike information criterion were used to select GRNN model. The smoothing factor is the only parameter of the most appropriate model.24 the model that directly affects the precision of the predictions. Construction of the traditional ARIMA–GRNN model. Specht18 proposed a method to find the optimal smoothing The GRNN model is one of the various artificial neural network factor. We randomly selected two samples (each sample is models that were developed by Specht18; it contains four likely to be selected, and the probability is the same) as the layers, input layer, pattern layer, summation layer, and output testing data to develop the GRNN model. Subsequently, the layer. Compared with other networks, the GRNN model has model was examined by a series of variables to find the opti- strong ability for nonlinear mapping and good learning ability mal smoothing factor for which the minimum mean square skills. The forecasting performance of the network is good and error (MSE) of the model was the lowest. Finally, the stable even if the sample size is small. Moreover, the con- struction of the GRNN model is straightforward and only has a TABLE 2 Estimate parameters of the ARIMA (0,1,2)(2,1,2)12 model Variable Coefficient t statistic P value TABLE 1 The AIC and SBC values of the five ARIMA models MA(1) −0.3060 −2.3990 0.0200 Model AIC SBC R2 MA(2) 0.2760 2.1760 0.0340 SAR(12) −0.4250 −3.2490 0.0020 ARIMA(1,1,0)(2,1,1)12 0.0811 0.2161 0.4734 SAR(24) −0.6060 −5.2310 0.0000 ARIMA(0,1,0)(2,1,1)12 0.1353 0.2357 0.4274 SMA(12) 0.4430 3.2100 0.0020 ARIMA(0,1,2)(2,1,2) −0.0080 0.1927 0.5476 12 MA(1) = moving average, lag1; MA(2) = moving average, lag2; SAR(12) = seasonal moving AIC = Akaike information criterion; ARIMA = autoregressive integrated moving average; average, lag12; SAR(24) = seasonal moving average, lag24; SMA(12) = season moving SBC = Schwarz Bayesian information criterion.