Prediction Skill of Extended Range 2-M Maximum Air Temperature Probabilistic Forecasts Using Machine Learning Post-Processing Methods

atmosphere

Article Prediction Skill of Extended Range 2-m Maximum Air Temperature Probabilistic Forecasts Using Machine Learning Post-Processing Methods

Ting Peng 1,2, Xiefei Zhi 1,2,*, Yan Ji 1,2, Luying Ji 1,2 and Ye Tian 1

1 Collaborative Innovation Center on Forecast and Evaluation of Meteorological Disasters (CIC-FEMD)/Key Laboratory of Meteorological Disasters, Ministry of Education (KLME), Nanjing University of Information Science & Technology, Nanjing 210044, China; [email protected] (T.P.); [email protected] (Y.J.); [email protected] (L.J.); [email protected] (Y.T.) 2 WeatherOnline Institute of Meteorological Applications, Wuxi 214000, China * Correspondence: [email protected]; Tel.: +86-186-5181-4986

Received: 9 June 2020; Accepted: 28 July 2020; Published: 4 August 2020

Abstract: The extended range temperature prediction is of great importance for public health, energy and agriculture. The two machine learning methods, namely, the neural networks and natural gradient boosting (NGBoost), are applied to improve the prediction skills of the 2-m maximum air temperature with lead times of 1–35 days over East Asia based on the Environmental Modeling Center, Global Ensemble Forecast System (EMC-GEFS), under the Subseasonal Experiment (SubX) of the National Centers for Environmental Prediction (NCEP). The ensemble model output statistics (EMOS) method is conducted as the benchmark for comparison. The results show that all the post-processing methods can eﬃciently reduce the prediction biases and uncertainties, especially in the lead week 1–2. The two machine learning methods outperform EMOS by approximately 0.2 in terms of the continuous ranked probability score (CRPS) overall. The neural networks and NGBoost behave as the best models in more than 90% of the study area over the validation period. In our study, CRPS, which is not a common loss function in machine learning, is introduced to make probabilistic forecasting possible for traditional neural networks. Moreover, we extend the NGBoost model to atmospheric sciences of probabilistic temperature forecasting which obtains satisfying performances.

Keywords: machine learning; probabilistic temperature forecast; extended range; neural networks; natural gradient boosting

1. Introduction Subseasonal or extended range weather forecasts are used to predict heat waves, extreme cold events, thunderstorms, droughts and floods as far as four weeks ahead. Subseasonal forecasts can deliver relevant weather information, such as the timing of the onset of a rainy season and the risk of extreme rainfall events or heat waves. However, there is a well-known gap in current numerical prediction systems for the subseasonal timescale of 10 days to one month. This gap falls between medium-range weather forecasts (up to 10 days) and seasonal climate predictions (longer than one month). Medium-range weather forecasts are influenced by the initial conditions of the atmosphere, whereas predictions of the seasonal climate are more affected by slowly evolving surface boundary conditions, such as the sea surface temperature and soil moisture content [1–6]. Predictions on the subseasonal timescale have made progress in some regions and seasons [1,7,8], although the full potential of their predictability requires further exploration.

Atmosphere 2020, 11, 823; doi:10.3390/atmos11080823 www.mdpi.com/journal/atmosphere Atmosphere 2020, 11, 823 2 of 17

The Subseasonal Experiment (SubX) [9] is a multi-model ensemble experiment for subseasonal prediction. It is a research-to-operations project including global climate prediction models from both operation and research centers. It was developed to provide guidance for real-time subseasonal prediction and to improve forecast skills. SubX covers seven global models from US and Canadian modeling groups and has produced 17 years of historical retrospective (re)forecasts. It is an open research database (http://iridl.ldeo.columbia.edu/SOURCES/.Models/.SubX/) designed to operational standards [9]. The reforecasts from all seven global models are required to include, but are not limited to, the period from 1999 to 2015, with at least three ensemble members, a minimum of weekly initialization and forecasting at least 32 days in advance. In addition, some SubX models have also provided more than 18 months of real-time forecasts since 2016. The systematic bias of the raw ensemble forecasts can be corrected using statistical post-processing methods. These methods reduce bias by learning a function that relates predictors to the variable of interest, which can be viewed as a supervised machine learning task. Bayesian model averaging [10] and ensemble model output statistics (EMOS) [11] are two state-of-the-art methods used in probabilistic forecasting. However, both of these methods specifically rely on parametric forecast distributions, which means that a predictive distribution has to be specified in advance and its parameters estimated. We have used the EMOS frame as a benchmark method because of its superior performance [11] and time saving capability. Two alternative approaches based on machine learning, neural networks and natural gradient boosting (NGBoost), which learn nonlinear mapping using abundant predictors in a data-driven way, are also considered here. Compared to the traditional multi-model superensemble forecasts which may perform better than the best individual model forecasts and multi-model ensemble mean forecasts [12–16], a neural network is a flexible machine learning algorithm that can deal with complex problems using arbitrary nonlinear functions [17]. A neural network consists of interconnected nodes organized in layers and regulated by an activation function. Neural networks are used in a number of different fields, including computer vision and natural language processing [18], as well as in biology, physics and chemistry [19,20]. In the atmospheric sciences, neural networks have been applied to precipitation nowcasting [21,22] as well as short- and medium-range weather forecasts [23]. Rasp and Lerch [24] trained neural networks without a hidden layer and with a single hidden layer for post-processing ensemble temperature forecasts to establish nonlinear mapping between the outputs of numerical models and the corresponding observations to reduce the bias from raw ensemble prediction systems. NGBoost is an algorithm that allows gradient boosting to make probabilistic forecasts in a generic way [25]. Gradient boosting is a supervised learning method that combines several weak learners to give an additive ensemble [26]. The gradient boosting method has been widely applied in prediction tasks, although it has rarely been applied to probabilistic forecasting. NGBoost combines the natural gradient with a multi-parameter boosting algorithm to estimate intuitively how parameters vary with the observed features. Experiments on several regression datasets have shown that NGBoost is more flexible, modular and faster than existing methods for probabilistic forecasting [25,27]. We used two post-processing methods (neural networks and NGBoost) in a machine learning framework and the EMOS benchmark post-processing method to extend the range of the 2-m maximum air temperature probabilistic forecast. This paper is organized as follows. Section2 describes the datasets used, followed by the introduction of three post-processing methods and scoring rules in Section3. Section4 presents our main results and a summary is provided in Section5. The discussions about the two machine learning methods are given in Section6.

2. Data We obtain raw ensemble forecasts from the Environmental Modeling Center, Global Ensemble Forecast System (EMC-GEFS) model under the SubX project (http://iridl.ldeo.columbia.edu/ SOURCES/.Models/.SubX/)[9]. The base model of the EMC-GEFS is a numerical weather prediction Atmosphere 2020, 11, 823 3 of 17 atmosphere–land model forced with prescribed sea surface temperatures. This base model contributes 11 ensemble members to the SubX reforecasts. The group provides reforecasts for the 1999–2016 period with weekly initialization and a lead time of 1–35 days. The anomaly correlation coefficients of precipitation and the 2-m temperature for week three, the anomaly correlation coefficients of the Real-time Multivariate Madden–Julian Oscillation indices and the North Atlantic Oscillation index all prove the superior prediction skills of the EMC-GEFS model [9]. We focus on the 2-m maximum air temperature forecasts over East Asia (15–60◦ N, 70–140◦ E) with a resolution of 1 1 for a lead time of 1–35 days. To ensure that our verification procedure mimics ◦ × ◦ operational conditions, we set aside the data for the year 2016 as a validation set. The observations used for verification are obtained from the National Oceanic and Atmospheric Administration Climate Prediction Center (ftp://ftp.cdc.noaa.gov/Datasets/cpc_global_temp/)[9]. For the 2-m maximum air temperature over the land, the Climate Prediction Center provides a maximum daily temperature (Tmax) dataset with a horizontal resolution of 0.5 0.5 [28]. The verification data are re-gridded to a ◦ × ◦ coarser EMC-GEFS model resolution of 1 1 . ◦ × ◦ 3. Post-Processing and Verification Methods

3.1. Ensemble Model Output Statistics EMOS is a variant of the model output statistics method and regression techniques designed for probabilistic forecasting [29,30]. In the simple frame of model output statistics, which only uses the ensemble member forecasts x or x1, ... , xk as predictors and multiple linear regression as the transfer function, the predictand y can be written as:

y = a + b x + + bKxK (1) 1 1 ··· where a and b1, ... , bk (or denoted by b) are the regression coefficients; K is the number of ensemble members. There has been little research into the application of regression techniques to probabilistic forecasting [31]. Following Gneiting et al. [11], the conditional distribution of the predictand y based on the ensemble member forecast xcan be modeled by a single parametric distribution Pθwith parameters θ Rd: ∈ y x P (x) (2) ∼ θ When the distribution of the weather variable of interest y (e.g., temperature) is Gaussian, Equation (2) can be written as: y x µ, σ2 (3) ∼ N where µ is the mean and σ is the standard deviation. By applying regression theory to probabilistic forecasting, the EMOS method attempts to reduce the bias between the predictive mean/variance and the regression estimate by using a bias-corrected weighted average. Hence we use a linear function with the ensemble mean and spread to fit the predictive mean and variance: ( µ = a + b x + + bKxK 1 1 ··· (4) σ2 = c + dS2 where S2 is the ensemble spread and c and d are nonnegative coefficients. Combining Equations (3) and (4), the Gaussian predictive distribution can be written as: 2 y x a + b1x1 + + bKxK, c + dS (5) ∼ N ··· These EMOS coefficients θ= (a, b, c, d) are estimated by minimizing the correct scoring rule (e.g., the continuous ranked probability score (CRPS) or the maximum likelihood estimation Atmosphere 2020, 11, 823 4 of 17

(MLE)) during the training period. Our EMOS experiments are implemented in R using the scoringRules package [32].

3.2. Neural Networks As in the flowchart of neural networks shown in Figure1, the neural networks consist of nodes organized in layers. The first (input) layer contains the input features, whereas the last (output) layer represents the output targets. The layers between the input and output layers are referred to as hidden layers. Apart from the input features, each node in the network can be computed as: Atmosphere 2020, 11, x FOR PEER REVIEW 4 of 17 l l zj = f aj (6) represents the output targets. The layers between the input and output layers are referred to as hidden layers. Apart from the inputl features, each node in the network can be computedl as: where f is an activation function, zj is the value of the jth node in the lth layer and aj is the weighted 푧푙 = 푓(푎푙) l 푗 푗 Pm (6) zl 1 al al = wl zl 1 + b average of the outputs i− in the previous푙 layer. Additionally, j can be written as: j 푙 ji i− ; where 푓 is an activation function, 푧푗 is the value of the 푗th node in the 푙th layer and i=푎푗1 is the l w isweighted the weight average between of the the outputsith node 푧푙−1 in in the the ( previousl 1)thlayer layer. andAdditionally, the jth node 푎푙 can in the be writtenlth layer; as: b푎is푙 = a bias ji 푖 − 푗 푗 푚푙 푙 푙−1 푙 f term∑ and푖=1 푤 is푗푖 usually푧푖 + 푏; constant. 푤푗푖 is the Theweight activation between function the 푖th nodeis usuallyin the (푙 − a nonlinear1)th layer and function the 푗th which node allowsin the the network푙th layer; to be more푏 is a complexbias term andand robustis usually and constant. the rectified The activation linear unit function (ReLU) 푓 isis usedusually here a nonline apart fromar in the outputfunction layer. which The allows weights the network and biases to be more are optimized complex and to robust reduce and the the loss recti functionfied linear using unit (ReLU) the Adam optimizationis used here method apart from [33]. in The the topologicaloutput layer. structureThe weights of and the biases neural are networks optimized is to of reduce great the importance loss for itsfunction performance using the in Adam which optimization the number method of nodes [33]. T inhe eachtopological hidden structure layer of is the usually neural optimized networks via cross-validation.is of great importance In our study, for its we performance take the 11 in raw which ensemble the number forecasts of nodes as the in inputs each hidden and the layer predictive is usually optimized via cross-validation. In our study, we take the 11 raw ensemble forecasts as the parameter µ and σ as the targets of the neural networks. Then we use an analogous-linear search inputs and the predictive parameter μ and σ as the targets of the neural networks. Then we use an method to test the optimal configuration of the neural networks where we build several models which analogous-linear search method to test the optimal configuration of the neural networks where we consistbuild of diseveralfferent models number which of nodesconsist (i.e.,of different 8, 16, 32, number... , 256, of nodes 512, 1024)(i.e., 8, in16, the 32, hidden…, 256, layers.512, 1024) After in the assessmentsthe hidden (not layers. shown), After we the finally assessments build the (not neural shown), networks we finally model build consisting the neural ofnetworks two hidden model layers whichconsisting contains of 64 two and hidden 256 nodes, layers respectively.which contains 64 and 256 nodes, respectively.

FigureFigure 1. The 1. The flowchart flowchart of neuralof neural networks. networks.F 푭or orf 1푓,1f,2푓,2,...… , ,푓11f 11, the, the raw raw ensemble ensemble forecasts forecasts and as and the as the inputs;inputs;W or 푾W or1, W푊1,,...푊, …, W, 푊m푚,, thethe weightsweights of of nodes; nodes; 풂a oror 푎a11,,푎a22, …, ..., 푎푚,a, mthe, the weighted weighted average average of the of the outputs in the previous layer; 풛 or 푧 , 푧 , … , 푧 , the outputs with the rectified linear function, ReLU, outputs in the previous layer; z or z1, z12, ...2 , zm푚, the outputs with the rectified linear function, ReLU, appliedapplied in a ;inµ and풂; 휇 σand, the 휎 two, the targettwo target parameters: parameters: mean mean and and standard standard deviation; deviation; 푂0, ,the the verification verification or or the observation; CRPS, the continuous ranked probability score. Here, 푚 is the number of the the observation; CRPS, the continuous ranked probability score. Here, m is the number of the nodes. nodes. Neural networks can be applied to a range of problems but have rarely been applied to probabilistic Neural networks can be applied to a range of problems but have rarely been applied to forecasting [24]. The difficulty lies in building the correct loss function for probabilistic forecasting. probabilistic forecasting [24]. The difficulty lies in building the correct loss function for probabilistic Here,forecasting. we use a closed Here, we form use expression a closed form of theexpression CRPS (see of the Section CRPS 3.4 (see) with Section a Gaussian 3.4) with distributiona Gaussian for temperaturedistribution probabilistic for temperature forecasting. probabilistic The experiment forecasting. described The experiment the conditional described the distribution conditional of the observationdistributiony given of the the observation ensemble 푦 member given the forecast ensemblex as member the input. forecast 풙 as the input.

Atmosphere 2020, 11, 823 5 of 17

3.3. Natural Gradient Boosting Figure2 presents the flowchart of NGBoost. NGBoost is a supervised learning method for probabilistic forecasting. The approach uses the natural gradient to address the technical challenges that are difficult in generic probabilistic forecasts with existing gradient boosting methods. The origins of the natural gradient Atmosphere 2020, 11, x FOR PEER REVIEW 5 of 17 Atmospherecan beAtmosphere traced 2020, 2020 to11, thex, 11 FOR, fieldx FOR PEER PEER of REVIEW information REVIEW geometry [34], where it was initially defined for the5 statistical of 175 of 17 manifold with the distance measure using Kullback–Leibler divergence [35]. The generalized natural divergence [35]. The generalized natural gradient is the direction of steepest ascent in Riemannian divergencegradientdivergence is the [35]. direction [3 The5]. The generalized ofgeneralized steepest natural ascent natural ingradient gradient Riemannian is thethespace directiondirection and of is of steepest formed steepes ascent as:t ascent in Riemannian in Riemannian space and is formed as: spacespace and isand formed is formed as: as: ∇𝑆̃(𝜃,) 𝑦 ∝ℐ(𝜃)−1 ∇𝑆(𝜃,) 𝑦 (7) ∇푆(휃, 푦) ∝ ℐ푆(휃) ∇푆(휃, 푦) (7) (7) ∇𝑆(𝜃,) 𝑦 ∝ℐ(𝜃) ∇𝑆(𝜃,) 𝑦 (7) wherewhere ℐ (𝜃 )ℐ is(휃 )the is theRiemannian Riemannian metric metric and and ∇𝑆∇푆((𝜃,휃, 푦) 𝑦 is thethe ordinary ordinary gradient gradient of aof scoring a scoring rule rule푆. The 𝑆. The where ℐ (𝜃)푆 is the Riemannian metric and ∇𝑆(𝜃,) 𝑦 is the ordinary gradient of a scoring rule 𝑆. The naturalwherenatural gradient gradientis the is invariant Riemannianis invariant to toparametrization, metricparametrization, and S( θwhichwhich, y) is distinguishesdistinguishes the ordinary it itfrom gradient from the the ordinary of ordinary a scoring gradient gradient rule S . natural gradient is invariant to parametrization,∇ which distinguishes it from the ordinary gradient andThe naturalhelpsand h elpsto gradient reflect to reflect how is invarianthow the the space space to parametrization,of of distributi distributionon moves which when distinguisheswhen updating. updating. it The from The algorithm thealgorithm ordinary has threehas gradient three and helps to reflect how the space of distribution moves when updating. The algorithm has three mainand helps maincomponents: tocomponents: reflect how(1) (1)the the the proper space proper ofscoring distributionscoring rule rule (e.g.,(e.g., moves MLE when oror CRPS); updating.CRPS); (2) (2) the The the parametric algorithm parametric probability has probability three main main components: (1) the proper scoring rule (e.g., MLE or CRPS); (2) the parametric probability distributioncomponents:distribution (e.g., (1) the(e.g., normal proper normal or scoring orLaplace); Laplace); rule (e.g.,(3) (3) the the MLE base base or CRPS);learner (e.g.,(2)(e.g., the decision decision parametric tree tr ee orprobability linear).or linear). Different distribution Different distributionconfigurations (e.g., normalare required or Laplace); for different (3) kinds the ofbase problems. learner (e.g., decision tree or linear). Different configurations(e.g., normal or are Laplace); required (3) for the different base learner kinds (e.g., of problems. decision tree or linear). Different configurations are configurationsrequired for di ffareerent required kinds offor problems. different kinds of problems.

푭 푓 , 푓 , … , 푓 FigureFigure 2. The 2. The ﬂowchart flowchart of naturalof natural gradient gradient boosting boosting (NGBoost). F oror f11, f22, ...11,,f the11, raw the rawensemble ensemble forecasts and as the inputs; 푳 or 퐿 , 퐿 , … , 퐿 , the base learner; 휇 and 휎, the two target parameters: forecasts and as the inputs; L or L1, L12, 2... , L푚m, the base learner; µ and σ, the two target parameters: mean and standard deviation; 푂, the verification or the observation; CRPS, the continuous ranked Figuremean and2. The standard flowchart deviation; of natural0, gradient the veriﬁcation boosting or (NGBoost). the observation; 𝑭 or CRPS,𝑓,𝑓,…,𝑓 the continuous, the raw ensemble ranked Figureprobability 2. The flowchart score. Here, of natural푚 is the gradient number ofboosting base learners. (NGBoost). 𝑭 or 𝑓,𝑓,…,𝑓, the raw ensemble forecastsprobability and score. as the Here, inputs;m is 𝑳 the or number 𝐿,𝐿,…,𝐿 of base, the learners. base learner; 𝜇 and 𝜎, the two target parameters: forecasts and as the inputs; 𝑳 or 𝐿 ,𝐿 ,…,𝐿 , the base learner; 𝜇 and 𝜎, the two target parameters: mean and standard deviation; 𝑂, the verification or the observation; CRPS, the continuous ranked mean and standard deviation; 𝑂, the verification or the observation; CRPS, the continuous ranked probabilityBy assigning score. a scoreHere, between𝑚 is the number a forecasted of base probability learners. density f at the verifying observation y, probability score. Here, 𝑚 is the number of base learners. the scoringBy ruleassigningS(f, y )a representsscore between the a bias forecasted between probability the forecast density and 푓 theat the true verifying distribution observation [36]. Proper푦, scoring therules scoring prompt rule 푆 the(푓, 푦 model) represents to obtain the bias more between calibrated the forecast probabilities and the during true distribut trainingionas [36 loss]. Proper functions.

The mostscoring commonly rules prompt used the proper model scoring to obtain rules more are calibrated MLE and probabilities CRPS, although during training CRPS is as generally loss consideredByfunctions. assigning more The robusta mostscore commonly thanbetween MLE useda [37forecasted]. proper Therefore, scoringprobability we rules chose density are CRPS MLE 𝑓 andas at the the CRPS, target verifying although scoring observation CRPS rule is in our 𝑦, By assigning a score between a forecasted probability density 𝑓 at the verifying observation 𝑦, theexperiments scoringgenerally rule (see considered 𝑆(𝑓, Section 𝑦) representsmore 3.4). robust the than bias MLE between [37]. Theref the ore,forecast we chose and CRPSthe true as thedistribution target scoring [36]. rule Proper the scoring rule 𝑆(𝑓, 𝑦) represents the bias between the forecast and the true distribution [36]. Proper scoringWein ourrules focused experiments prompt on temperature (seethe Sectionmodel probabilistic 3.4).to obtain forecasting.more calibrated The inputprobabilitiesx contains during raw ensemble training forecasts as loss scoring rules prompt the model to obtain more calibrated probabilities풙 during training as loss functions.and the targetWe The focused ymostis the commonly on corresponding temperature used probabilistic observation.proper scor forecasting.ing Assuming rules areThe that MLE input the and variable contains CRPS, temperature although raw ensemble CRPS follows is functions.forecasts The and most the targetcommonly 푦 is the used corresponding proper scor observation.ing rules Assuming are MLE that and the CRPS, variable although temperature CRPS is generallya normal considered distribution, moreθ = robust(µ, log thanσ) are MLE appointed [37]. Ther asefore, the predicted we chose parametersCRPS as the and target linear scoring learners rule generallyfollows considered a normal distribution,more robust 휃 than= (µ MLE, 푙표푔 휎 [37].) are Therappointedefore, aswe the chose predicted CRPS parameters as the target and scoring linear rule inare our used experiments here as the (see base Section learner 3.4).l to speed up the calculation. The total number M of training linear in ourlearners experiments are used (see here Section as the 3.4). base learner 푙 to speed up the calculation. The total number 푀 of learnersWe isfocused set up toon 100. temperature probabilistic forecasting. The input 𝒙 contains raw ensemble Wetraining focused linear onlearners temperature is set up to probabilistic 100. forecasting. The input 𝒙 contains raw ensemble forecastsTo obtain and the the target predicted 𝑦 isparameters the correspondingθfor the observation. input x, each A basessuming learner thatl(k) the(k(푘= )variable1, ... , m temperature) first makes forecasts andTo obtain the target the predicted 𝑦 is the parameters corresponding 휃 for observation. the input 풙, eachAssuming base learner that the 푙 variable (푘 = 1,…, temperature푚) first followsan independentmakes a normal an independent prediction distribution, basedprediction 𝜃= on the (basedµ, same 𝑙𝑜𝑔 onx 𝜎)the andare same thenappointed 풙 the and results then as the are resultspredicted integrated. are integrated. parameters Note that Note there and that are linear two follows a normal distribution, 𝜃=(µ,( 𝑙𝑜𝑔) ( ) 𝜎) are appointed as the predicted parameters and linear learners are used(k) here(k) as(k) the base(푘) learner푘 푘 𝑙 to speed up the calculation. The total number 𝑀 of there are ltwo= basel learnersl 푙 = (푙휇 , 푙푙표푔 휎) per stage for a normal distribution withµ parameterslog µ learnersbase learners are used hereµ , aslog theσ per base stage learner for a normal𝑙 to speed distribution up the withcalculation. parameters The totaland numberσ. 𝑀 of trainingand linear 푙표푔 휎. learners is set up to 100. training linear learners is set up to 100. To obtainWe use the a stage predicted-specifi cparameters scaling factor 𝜃 휌for(푘) theand inputa common 𝒙, each learning base rate learner 휂 to scale𝑙() the(𝑘 =predicted 1,…,𝑚) first To obtain the predicted parameters 𝜃 for the input 𝒙, each base learner 𝑙() (𝑘= 1,…,𝑚) first makesoutputs an independent in 푖th iteration: prediction based on the same 𝒙 and then the results are integrated. Note that 𝒙 makes an independent prediction() based() ( on) the same푚 and then the results are integrated. Note that there are two base learners 𝑙 () =𝑙(),𝑙() per stage for a normal distribution with parameters µ there are two base learners 𝑙 =𝑙휃(푖,𝑙) =휃 (푖− 1 per) − 휂 stage∑ 휌( 푘for) ∙ 푙a(푘 )normal(풙) distribution with parameters µ and 𝑙𝑜𝑔 𝜎. (8) and 𝑙𝑜𝑔 𝜎. 푘=1 We use a stage-specific scaling factor 𝜌() and a common learning rate 𝜂 to scale the predicted We use a stage-specific scaling factor 𝜌() and a common learning rate 𝜂 to scale the predicted outputs in 𝑖th iteration: outputs in 𝑖th iteration: 𝜃() =𝜃() −𝜂𝜌() ∙𝑙()(𝒙) (8) 𝜃() =𝜃() −𝜂𝜌() ∙𝑙()(𝒙) (8)

Atmosphere 2020, 11, 823 6 of 17

We use a stage-speciﬁc scaling factor ρ(k) and a common learning rate η to scale the predicted outputs in ith iteration: Xm (i) (i 1) (k) (k) θ = θ − η ρ l (x) (8) − · k=1 (i) (i 1) The predicted parameters are updated to θ by adding to each θ − the scaled projected natural gradient ρ(k) l(k)(x) and scaled by a small learning rate η. The learning algorithm starts with a · random common θ(0) and minimizes the sum of the scoring rule S by training all the training samples. More details are available in Duan et al. [25].

3.4. Verification Methods Probabilistic forecasting aims to maximize the sharpness of the predictive distributions through calibration [38]. For calibration, the forecast probability density functions (PDFs) and verifications are expected to be consistent with each other and, for sharpness, the prediction intervals are expected to be narrowed by post-processing methods to obtain a more concentrated and sharper forecast PDF. The Talagrand diagrams, also known as the verification rank histogram [39–42] and the probability integral transform (PIT) histograms are often adopted to evaluate the calibration of the ensemble forecast. They are analogous, but the Talagrand diagrams tend to assess the spread, whereas the PIT histograms assess the PDF forecasts. An ideal ensemble forecast usually behaves as a uniform distribution. However, in most cases, it exhibits U-shaped PIT histograms due to the under-dispersive forecasts. Here, PIT histograms have been selected to discuss the calibration of the post-processing methods. In addition to showing PIT histograms, we computed the coverage and average width of the 88.33% central prediction interval, which is chosen from the range of an 11-member ensemble. The coverage represents the accuracy while the average width is used to assess the sharpness. To verify the deterministic forecasts, we computed the mean absolute error (MAE) and the root-mean-square error (RMSE). Denoting µs,t as a deterministic forecast and ys,t as the observation, the MAE is defined as: XN 1 MAE = yn µn (9) N − n=1 and the RMSE can be written as: v tu N 1 X 2 RMSE = (yn µn) (10) N − n=1 where N is the number of cases in the training set. We also computed the CRPS for the assessment of predictive PDFs to address the calibration and sharpness simultaneously. The CRPS is the integral of the Brier scores at all possible threshold values for a continuous predictand [43]. Denoting Fθ as the predictive cumulative distribution function (CDF) with parameters θ and y as the observations, the CRPS is defined as:

Z y Z 2 ∞ 2 crps(Fθ, y) = Fθ(z) dz + (1 Fθ(z)) dz (11) y − −∞ 2 when Fθ is the CDF of a normal distribution with θ (µ, σ ), Equation (11) can be written as:

y µ y µ y µ 1 crps( (µ, σ2), y) = σ( − (2Φ − ) 1) + 2ϕ( − ) ) (12) N σ σ − σ − √π where ϕ and Φ denote the PDF and the CDF, respectively. The CRPS is negatively oriented in a similar manner to the MAE, where a smaller value is better. Atmosphere 2020, 11, 823 7 of 17

The continuous ranked probability skill score (CRPSS) is used to measure the probabilistic skill relative to a reference predictive distribution Fref:

CRPS(F, y) CRPSS(F, y) = 1 (13) − CRPS Fre f , y

The CRPSS is positively oriented, which means that positive values indicate an improvement on the reference forecast. We use the raw ensemble forecasts as the reference forecasts. Here, we assess the application of EMOS and two machine learning post-processing methods to 1–35 day forecasts of the 2-m maximum air temperature for all land grid points over East Asia (15–60◦ N, 70–140◦ E) using the raw ensemble forecast from EMC-GEFS described by Zhu [44]. The calendar year 2016 is used as the validation period. Noted that the EMC-GEFS model initializes once a week so the actual validation period is 53 days.

4. Results

4.1. Overall Performance of EMOS, the Neural Network and NGBoost Table1 provides the summary measures of the deterministic-style forecast accuracy. For a better visualization and understanding, we average the MAE and RMSE with a lead time of one week instead of one day. The MAE in the ﬁrst lead week (week 1) is the average MAE of each grid point in the study area during the validation period, with a lead time of 1–7 days. The deterministic neural network forecast clearly gives the best performance, with the mean MAE 24% and 9% reduced than the mean of the raw ensemble and the EMOS forecasts, respectively, for lead times of 1–35 days. The NGBoost method shows similar results, whereas the neural network method performs slightly better for longer lead times.

Table 1. Comparison of deterministic forecasts of the 2-m maximum air temperature over East Asia for the year 2016 at diﬀerent lead times. The values representing the best performance are marked in bold. ENS, raw ensemble; EMOS, ensemble model output statistics; NGB, natural gradient boosting; NN, neural networks; MAE, mean absolute error; RMSE, root-mean-square-error.

Week 1 Week 2 Week 3 Week 4 Week 5 MAE RMSE MAE RMSE MAE RMSE MAE RMSE MAE RMSE ENS 2.91 3.52 3.12 3.80 3.31 4.05 3.49 4.30 3.64 4.50 EMOS 2.20 2.84 2.49 3.21 2.78 3.57 3.05 3.90 3.26 4.17 NGB 2.05 2.63 2.30 2.93 2.54 3.24 2.77 3.53 2.97 3.77 NN 2.05 2.64 2.28 2.92 2.52 3.22 2.75 3.50 2.93 3.73

The comparison of the CRPS between these three post-processing methods and the raw ensemble is presented in Figure3. It shows that the raw ensemble forecasts are of great uncertainty and all the post-processing methods are able to reduce the CRPS values relative to the raw ensemble for all lead times, especially the two machine learning methods. The improvement in performance could be divided into two parts: weeks 1–2 and weeks 3–5. At a lead time of weeks 1–2, the prediction skill of all the post-processing methods and the raw ensemble decreases with increasing lead time in terms of the CRPS. All the post-processing methods reduced the forecast bias sharply. The neural network and NGBoost methods perform best and increase the available lead time for about 10 days. EMOS achieves similar results to the two machine learning methods in week 1 but performs relatively poorly following week 2. The prediction skill of the post-processing methods and the raw ensemble tends to be stable with little variance as the lead time increases over week 3. Atmosphere 2020, 11, 823 8 of 17 Atmosphere 2020, 11, x FOR PEER REVIEW 8 of 17

Figure 3. Mean continuous ranked probability skill score of the different post-processing methods for Figure 3. Mean continuous ranked probability skill score of the diﬀerent post-processing methods for thethe 2-m2-m maximummaximum air temperature at all all land land grid grid points points over over East East Asia Asia in in 2016 2016 with with lead lead times times of of1– 1–3535 days. days .CRPS, CRPS, continuous continuous ranked ranked probability probability score; score; ENS, raw ensemble;ensemble; EMOS,EMOS, ensembleensemble modelmodel outputoutput statistics;statistics; NGB,NGB, naturalnatural gradient gradient boosting; boosting; NN, NN, neural neural networks. networks.

TableTable2 2presents presents the the CRPS CRPS and and thethe coveragecoverage andand average width of the 88.33%88.33% central central prediction prediction intervalsintervals to to assessassess thethe accuracyaccuracy andand sharpness,sharpness, respectively.respectively. NGBoostNGBoost reducesreduces thethe CRPSCRPS scorescore byby 39%39% andand 7.5%,7.5%, respectively,respectively, compared compared with with the the raw raw ensemble ensemble and and the the EMOS.EMOS. TheThe neuralneural networknetwork achievesachieves similarsimilar resultsresults inin termsterms ofof thethe CRPS,CRPS, whereaswhereas thethe NGBoostNGBoost predictionprediction intervalsintervals showshow betterbetter coverage.coverage. TheThe rawraw ensembleensemble predictionprediction intervalsintervals areare narrownarrow andand thethe accuracyaccuracy andand coverage are unsatisfactory. ByBy contrast,contrast, thethe NGBoostNGBoost predictionprediction intervalsintervals areare notnot muchmuch widerwider thanthan thosethose ofof thethe rawraw ensemble,ensemble, presentingpresenting better better CRPS CRPS and and coverage coverage scores. scores.

TableTable 2. 2.Comparison Comparison ofof probabilisticprobabilistic forecastsforecasts ofof thethe 2-m2-m maximummaximum airair temperaturetemperature overover EastEast AsiaAsia forfor thethewhole whole yearyear ofof 2016 2016 at at di differentﬀerent lead lead times. times. TheThevalues values representingrepresenting thethe bestbestperformance performance areare markedmarked inin bold.bold. CRPS,CRPS, continuouscontinuous rankedranked probabilityprobability score;score; ENS,ENS, rawraw ensemble;ensemble; EMOS,EMOS, ensembleensemble modelmodel outputoutput statistics;statistics; NGB,NGB, naturalnatural gradientgradient boosting;boosting; NN,NN, neuralneural network;network; MAE,MAE, meanmean absoluteabsolute error;error; RMSE, RMSE, root-mean-square-error. root-mean-square-error.

Week 1Week 1 Week Week 2 2 Week Week 3 3Week 4 WeekWeek 4 5 Week 5 CRPS CRPS ENS 2.40 2.51 2.61 2.71 2.80 ENS EMOS2.40 1.60 2.511.81 2.02 2.61 2.21 2.712.36 2.80 EMOS NGB1.60 1.48 1.811.65 1.82 2.02 1.98 2.212.12 2.36 NGB 1.48NN 1.49 1.651.66 1.83 1.82 1.98 1.982.11 2.12 NN 1.49 1.66 1.83 1.98 2.11 Coverage at ퟖퟖ. ퟑퟑ% Prediction Interval ENS Coverage41.73 at 47.7388.33% Prediction52.45 Interval55.99 58.28 ENS EMOS41.73 67.88 47.7367.76 67.91 52.45 67.99 55.9967.94 58.28 EMOS NGB67.88 80.30 67.7680.06 79.98 67.91 79.94 67.9979.71 67.94 NGB 80.30NN 76.43 80.0676.92 77.48 79.98 78.51 79.9478.78 79.71 NN 76.43Average Width 76.92 at ퟖퟖ. ퟑퟑ% Prediction 77.48 Interval 78.51 78.78 ENS Average2.16 Width at2.7588.33% Prediction3.29 3.77 Interval 4.12 EMOS 2.87 3.27 3.68 4.06 4.34 ENS 2.16 2.75 3.29 3.77 4.12 EMOS NGB2.87 3.52 3.273.94 4.38 3.68 4.78 4.065.10 4.34 NGB 3.52NN 3.35 3.943.76 4.19 4.38 4.64 4.784.95 5.10 NN 3.35 3.76 4.19 4.64 4.95 CRPSS is useful in probabilistic forecasts of multi-category events. Positive values of CRPSS indicate that the forecasts are better than the reference prediction. Figure 4 shows the mean CRPSS of different post-processed forecasts compared with the raw ensemble in different lead weeks. The two machine learning post-processing methods achieve the best and similar results, with mean CRPSS values >0.18 at each lead week. EMOS has a poorer performance, although the mean CRPSS

Atmosphere 2020, 11, 823 9 of 17

CRPSS is useful in probabilistic forecasts of multi-category events. Positive values of CRPSS indicate that the forecasts are better than the reference prediction. Figure4 shows the mean CRPSS of diAtmosphereﬀerent post-processed2020, 11, x FOR PEER forecasts REVIEW compared with the raw ensemble in diﬀerent lead weeks. The9 twoof 17 machine learning post-processing methods achieve the best and similar results, with mean CRPSS valuesvalues >are0.18 >0.08. at each These lead results week. suggest EMOS that has all a poorerthe post performance,-processing methods although improve the mean the CRPSS raw ensemble values areprobabilistic>0.08. These forecasts results and suggestthat the two that machine all the post-processinglearning methods methods show the improve best performance. the raw ensemble probabilistic forecasts and that the two machine learning methods show the best performance.

FigureFigure 4.4. BoxplotsBoxplots ofof thethe meanmean continuouscontinuous rankedranked probabilityprobability skillskill scorescore ofof allall thethe post-processingpost-processing modelsmodels atat didifferentﬀerent leadlead timestimes usingusing thethe rawraw ensembleensemble asas thethe referencereference forecast.forecast. CRPSS,CRPSS, continuouscontinuous rankedranked probability probability skill skil score;l score; EMOS, EMOS, ensemble ensemble model model output output statistics; statistics; NGB, natural NGB, gradientnatural boosting; gradient NN,boosting; neural NN, network. neural network.

TheThe PITPIT histograms histograms of of the the raw raw and and post-processed post-processed forecasts forecasts used used to assess to assess the calibration the calibration are shown are inshown Figure in5 .Figure The raw 5. ensembleThe raw ensemble forecasts areforecasts under-dispersed, are under- asdispe indicatedrsed, as by indicated the U-shaped by the veriﬁcation U-shaped rankverification histogram rank at histogram almost all at the almost selected all the lead selected days. lead This days. means This that means the observations that the observations often fall outsideoften fall the outside range the of the range raw of ensemble. the raw ensemble. By contrast, By contrast, the two machine the two machine learning learning post-processed post-processed forecast distributionsforecast distributi are betterons are calibrated better calibrated and the corresponding and the corresponding PIT histograms PIT histograms show much show smaller much deviations smaller fromdeviations uniformity from uniformity at each selected at each lead selected day. From lead thisday. perspective, From this perspective, by calibrating by calibrating the under-dispersive the under- rawdispersive ensemble raw forecasts ensemble with forecasts nonlinear with functions nonlinear in the functions neural networks in the neural and NGBoost networks frames, and weNGBoost obtain moreframes, idealized we obtain ensemble more idealized forecasts ensemble which reduce forecasts the uncertainty which reduce and the improve uncertainty the accuracy. and improve The poor the performanceaccuracy. The of poor EMOS performance in the PIT histograms of EMOS within the the PIT increasing histograms lead with days the may increasing reﬂect the lead disadvantages days may ofreflect the traditionalthe disadvantages method of that the istraditiona based onl method the multi-linear that is based regression on the multi techniques.-linear regression techniques.

Atmosphere 2020, 11, 823 10 of 17 Atmosphere 2020, 11, x FOR PEER REVIEW 10 of 17

Figure 5. Probability integral transform (PIT) histograms of extended range 2-m maximum air Figure 5. Probability integral transform (PIT) histograms of extended range 2-m maximum air temperature probabilistic forecasts for all land grid points over East Asia using different post- temperature probabilistic forecasts for all land grid points over East Asia using different post-processing processing methods at different lead times. ENS, raw ensemble; EMOS, ensemble model output methods at different lead times. ENS, raw ensemble; EMOS, ensemble model output statistics; statistics; NGB, natural gradient boosting; NN, neural networks. NGB, natural gradient boosting; NN, neural networks. 4.2. Spatiotemporal Characteristics 4.2. Spatiotemporal Characteristics Figure 6 presents the daily CRPS of the raw ensemble and post-processed forecasts in the Figurevalidation6 presents period at the different daily CRPSlead days. of theThe rawmethod ensemble presented and by Vigaud post-processed et al. (2017) forecasts is applied in in the validationthis period study. Thisat di ff evaluateserent lead the days. weekly The forecasts method presented initialized by in Vigaud the year et 2016, al. (2017) which is contains applied in 52 this study.validation This evaluates periods. the Figure weekly 6 shows forecasts that initialized the raw ensemble in the year has 2016, the poorest which performance contains 52 atvalidation each periods.selected Figure lead6 shows day. The that post the-processed raw ensemble forecasts has are theconsistent poorest with performance the raw ensemble, at each but selected perform lead day. Themuch post-processed better in reducing forecasts the CRPS are of consistent the raw ensemble, with the particularly raw ensemble, for shorter but lead perform times. much The two better in reducingmachine the learning CRPS methods of the raw achieve ensemble, the best particularly and similar results for shorter on most lead days times. during The the twovalidation machine learningperiods. methods The improvements achieve the best are stable and similareven at long results lead on times. most days during the validation periods. The improvements are stable even at long lead times.

Atmosphere 2020, 11, 823 11 of 17

Atmosphere 2020, 11, x FOR PEER REVIEW 11 of 17

Figure 6. Mean continuous ranked probability skill score of the 2-m maximum air temperature Figure 6. Mean continuous ranked probability skill score of the 2-m maximum air temperature probabilistic forecasts for all land grid points over East Asia on each calendar day during the probabilistic forecasts for all land grid points over East Asia on each calendar day during the validation period for the raw ensemble and using different post-processing methods at different lead validation period for the raw ensemble and using different post-processing methods at different lead times. CRPS, continuous ranked probability score; ENS, raw ensemble; EMOS, ensemble model times. outputCRPS, statistics; continuous NGB, ranked natural probability gradient boosting; score; ENS, NN, neural raw ensemble; networks. EMOS, ensemble model output statistics; NGB, natural gradient boosting; NN, neural networks. Figure 7 shows the spatial distributions of the CRPSS over East Asia for the 2-m maximum air Figuretemperature7 shows using the different spatial distributions post-processing of methods the CRPSS, taking over the East raw Asia ensemble for the as 2-mthe reference maximum air temperatureforecast. Great using improvements different post-processing are achieved by all methods, the post-processing taking the methods raw ensemble in week 1 as. Most the referenceof the forecast.grid Great points improvements in the study area are obtain achieved a positive by all CRPSS, the post-processing especially those on methods the Tibetan in week Plateau. 1. MostThe of the gridtwo points machine in learning the study methods area obtainreduce the a positive negative CRPSS, CRPSS area especially more than those the EMOS on the forecast, Tibetan which Plateau. The twoindicates machine that the learning NGBoost methods and neural reduce network the methods negative are CRPSSpracticable area in broader more than regions. the In EMOS week 2 forecast, and whichlater, indicates the EMOS that forecast the NGBoost skill decre andases neural rapidly network and performs methods poorly are in practicablemost of Russia, in Mongolia broader regions.and mid-eastern China. However, the two machine learning methods still perform well in those regions. In week 2 and later, the EMOS forecast skill decreases rapidly and performs poorly in most of Russia, Although the improvements decrease, the post-processed forecast of the two machine learning Mongolia and mid-eastern China. However, the two machine learning methods still perform well in methods give a positive CRPSS for most grid points in the study area. The neural network forecasts those regions.perform better Although in the thenorthwest improvements of China and decrease, most of the Russia post-processed than the NGBoost forecast outputs. of the The two results machine learningindicate methods that these give atwo positive machine CRPSS learning for methods most grid could points make in theimprovement study area.s in Thecomplex neural terrains network forecasts(such perform as mountainous better in areas, the northwestdeserts) which of Chinaare often and of poor most prediction of Russia skills. than We the consider NGBoost that outputs. the The resultsimprovements indicate that describe these the two nonlinearity machine learning to some methods extent as could the thermodynamics make improvements and dynamics in complex terrainsprocesses (such as often mountainous have. areas, deserts) which are often of poor prediction skills. We consider that the improvements describe the nonlinearity to some extent as the thermodynamics and dynamics processes often have.

AtmosphereAtmosphere 20202020,, 1111,, 823x FOR PEER REVIEW 12 of 17

Figure 7. Continuous ranked probability skill score distribution of the 2-m2-m maximum air temperature probabilistic forecastsforecasts forfor allall landland gridgrid pointspoints over East Asia verifiedverified againstagainst the raw ensemble using didifferentfferent post-processingpost-processing methods at didifferentfferent leadlead times.times. EMOS, ensemble model output statistics; NGB, natural gradientgradient boosting;boosting; NN,NN, neuralneural networks.networks.

The performance of the neural network is similar to that of NGBoost and thus the best modelmodel distribution is is introduced introduced to toprovide provide a simple a simple comparison. comparison. The maps The withmaps the with best-performing the best-performing models, inmodels terms, in of terms the mean of the CRPS mean for CRPS each landfor each grid land point grid at dipointﬀerent at different lead weeks, lead are weeks shown, are in shown Figure 8in. TheFigure two 8. machine The two learning machine methods learning provide methods the provide best predictions the best predictions for most of for the most grid points.of the grid Regardless points. Regardless of the lead week, the two machine learning methods perform better at >90% of the grid

Atmosphere 2020, 11, 823 13 of 17 of the lead week, the two machine learning methods perform better at >90% of the grid points, with clearer advantages at longer lead times. The NGBoost method performs better in week 1, whereas Atmosphere 2020, 11, x FOR PEER REVIEW 13 of 17 the neural network method is better in week 2. The two machine learning methods are comparable for longer leadpoints, times. with Forclearer most advantages of the areas,at longer such lead astimes. the The Tibetan NGBoost Plateau, method the performs Yunnan–Guizhou better in week 1, Plateau, the Loesswhereas Plateau, the the neural Inner network Mongolia method Plateau, is better thein week central 2. The Siberian two machine Plateau learning and themeth Eastods are Siberian mountainouscomparable area, neuralfor longer networks lead times. behave For most as theof the best areas model, such despite as the Tibetan the overwhelming Plateau, the Yunnan performance– Guizhou Plateau, the Loess Plateau, the Inner Mongolia Plateau, the central Siberian Plateau and the in week 1.East There Siberian are mountainous also some di area,fferences neural over networks the Tibetan behave as Plateau, the best formodel instance, despitethe the bestoverwhelming model over the west of itperformance tends to be in changing week 1. There from are the also neural some network differences forecasts over the to Tibetan that of Plateau, NGBoost. for instance,In the meantime the , over thebest areas model including over the the west northwest of it tends of to Chinabe changing where from the the terrains neuralcomprising network forecast mountains,s to that of deserts and basinsNGBoost. (in the In Xinjiang the meantime, Province), over the the neuralareas including networks the model northwest behaves of China better, where while the terrainsthe NGBoost method dominatescomprising mountains,in the basins, deserts such and as basinsthe Qaidam (in the Basin Xinjiang in Province), Qinghai Province the neural and networks the Sichuan model basin behaves better, while the NGBoost method dominates in the basins, such as the Qaidam Basin in in the southeast center of China. For the most populated areas, such as the plains and hilly areas Qinghai Province and the Sichuan basin in the southeast center of China. For the most populated in China,area neurals, such networks as the plains and NGBoost and hilly perform areas in competitively. China, neural networks In all, these and NGBoost two machine perform learning methodscompetitively. have theirown In all, benefits these two in machine different learning areas, methods but both have oftheir them own arebenefits good in atdifferent dealing areas, with the complexbut terrains both of and them adjusting are good at the dealing ensemble with the spread complex to provideterrains and more adjusting reasonable the ensemble forecasts. spread to provide more reasonable forecasts.

Figure 8.FigurePercentage 8. Percentage of theof the best-performing best-performing post post-processing-processing methods methods over the over land surface the land in terms surface in terms ofof the the meanmean continuous continuous ranked ranked probability probability score over score validation over validationperiod at different period lead at times. diﬀerent The lead times. Thepercentage percentage of the best of the-performing best-performing forecast of each forecast method of is eachlisted methodunder the ismosaic listed image. under ENS, the raw mosaic ensemble; EMOS, ensemble model output statistics; NGB, natural gradient boosting; NN, neural network. image. ENS, raw ensemble; EMOS, ensemble model output statistics; NGB, natural gradient boosting; NN, neural network. Atmosphere 2020, 11, 823 14 of 17

5. Conclusions We compare the performance of the EMOS method and other two methods based on machine learning (a neural network and NGBoost) in extended range temperature probabilistic forecasts over East Asia. The EMOS method has been widely used and is already able to reduce systematic bias and is therefore used as the benchmark for the machine learning methods. Both the neural network and NGBoost improve the forecast skills compared with EMOS over the entire study domain and for each forecast lead time. For convenience, the lead time is separated into five groups from the first lead week to the fifth lead week and the scoring rules are averaged over the lead time intervals. From the perspective of deterministic forecast, MAE and RMSE are used to assess the accuracy of forecasts from the raw ensembles, EMOS, NGBoost and the neural network. The neural network and NGBoost outperform both the raw ensemble and EMOS outputs and the neural network is slightly better than NGBoost. In terms of convenience and reliability, the neural network is more appropriate and superior for extended range temperature forecasts. However, NGBoost is more flexible and is designed to develop probabilistic forecasts. From the perspective of probabilistic forecasts, both the neural network and NGBoost are superior in terms of CRPS and coverage at the 88.33% prediction interval over the raw ensemble outputs and EMOS results. NGBoost shows advantages over the neural network from the first to third lead weeks in CRPS and the entire forecast’s lead weeks in terms of coverage. The raw ensemble forecasts perform better in terms of the under-dispersive spread but give poorer scores. After calibration, EMOS, the neural network and NGBoost expand the spread into a more correct status. The spatiotemporal distributions are investigated over multiple lead times. The raw ensemble forecasts perform the worst, offering a space for EMOS, the neural network and NGBoost to make progress. The gaps among the three methods decrease with increasing lead times from seven to 28 days. A greater improvement is seen for shorter lead times. The CRPSSs are characterized by spatial variance but are still practical and improved. All the three methods show remarkable forecast skill over the most areas, especially over the Tibetan Plateau. Both the neural network and NGBoost outputs perform better than the EMOS output over East Asia. With increasing forecast lead times, the forecast skill of EMOS decreases remarkably from the first to the second lead week, whereas the other two machine learning methods show sustainably better forecast skills. The NGBoost and neural network give an outstanding performance in extended range probabilistic forecasts over East Asia for the variable of interest, which can be fitted by a normal distribution. The neural network and NGBoost are well matched. It is difficult to distinguish which one of these two methods is better and we therefore introduced the best model distribution to provide a simple comparison. The neural network and NGB account for almost 90% of the area and they perform better in different lead weeks. However, further investigation of the machine learning applications still required for those variables with a non-normal distribution (e.g., precipitation, wind speed and wind direction).

6. Discussion This paper demonstrates the application of two machine learning methods, the neural networks and NGBoost, to the extended range 2-m maximum air temperature probabilistic forecasts. Following the EMOS frame proposed by Gneiting et al. [11], the neural networks and NGBoost are two parametric methods which directly calibrate the individual raw ensemble members by optimizing a proper scoring rule. However, neural networks and NGBoost show advantages in model robustness and the searching for optimized parameters. As shown in the flowchart of neural networks (see Figure1), a neural network consists of two main parts, the forward and backward propagation. For a given forecast time and station, the inputs, the 11 raw ensemble forecasts, are multiplied by m (the node number in the hidden layer) randomly initialized weights W and then activated by a nonlinear function ReLU in the forward propagation to predict the mean µ and standard deviation σ. Over the training dataset, the mean CRPS is Atmosphere 2020, 11, 823 15 of 17 computed as the scoring rule by the predictive µ and σ with corresponding observations. In our study, the Adam optimization is applied to gradient descent in the backward propagation for its learning rate varies with training which helps accelerate the convergence. It is notable that the CRPS is not a common lose function in machine learning. Since the traditional neural networks are incapable for probabilistic forecasting, here we introduce a close form of the CRPS for a Gaussian distribution which helps to extend the neural networks to probabilistic temperature forecast. Furthermore, the inputs of neural networks are more arbitrary, which can add auxiliary predictors to improve prediction skills [24]. This study tackles probabilistic temperature forecasting in atmospheric sciences using NGBoost. It helps make up the gap of gradient boosting method (GBM) in generic probability prediction. The outstanding performance of NGBoost in our study confirms its ability to solve the practical problems. It is a promising machine learning method for probabilistic forecasting for its flexibility and modularization. Different to neural networks in the forward propagation, NGBoost introduces a weak base learner (for instance, the tree model) to replace the activation function of the neural network frame. By integrating the base learners, we obtain the parameters µ and σ of the predictive PDFs. Another difference lies in the parameter optimization where the natural gradient is introduced in NGBoost to overcome the difficulty of simultaneous boosting the two predictive parameters (µ and σ) from the base learners, which is a challenge to the existing GBMs. Furthermore, the natural gradient helps to reflect how the space of distribution moves when updating [25]. For the probabilistic temperature forecast here, the results of bias correction using neural networks and NGBoost are remarkably superior to the traditional EMOS. Neural networks and NGBoost could represent a high nonlinear relationship which is deficiently described in the traditional linear models. Maybe this is the reason why machine learning methods applied in this study perform better. Plus, the machine learning methods are more flexible and the objective function could be adjusted according to the problems to be solved. Compared with the neural networks, the NGBoost is a more integrated and modular method which is especially advantaged with a small training dataset. However, the NGBoost is limited in some skewed probability distribution, for instance, precipitation and wind speed. The neural networks are more suitable for the flexible cases which have more arbitrary predictors and strong nonlinearity.

Author Contributions: Conceptualization, X.Z.; Formal analysis, T.P. and Y.J.; Funding acquisition, X.Z.; Investigation, T.P. and Y.J.; Validation, Y.J. and L.J.; Writing—original draft, T.P. and Y.J.; Writing—review and editing, T.P., Y.J. and Y.T. All authors have read and agreed to the published version of the manuscript. Funding: This study was jointly supported by the National Key R&D Program of China (Grant No. 2017YFC1502000) and National Natural Science Foundation of China (Grant No. 41575104). Conﬂicts of Interest: The authors declare no conﬂict of interest.

References

1. DelSole, T.; Trenary, L.; Tippett, M.K.; Pegion, K. Predictability of week-3–4 average temperature and precipitation over the contiguous United States. J. Clim. 2017, 30, 3499–3512. [CrossRef] 2. Black, J.; Johnson, N.C.; Baxter, S.; Feldstein, S.B.; Harnos, D.S.; L’Heureux, M.L. The predictors and forecast skill of Northern Hemisphere teleconnection patterns for lead times of 3–4 weeks. Mon. Weather Rev. 2017, 145, 2855–2877. [CrossRef] 3. National Research Council. Assessment of Intraseasonal to Interannual Climate Prediction and Predictability; The National Academies Press: Washington, DC, USA, 2010; ISBN 978-0-309-15183-2. 4. Brunet, G.; Shapiro, M.; Hoskins, B.; Moncrieﬀ, M.; Dole, R.; Kiladis, G.N.; Kirtman, B.; Lorenc, A.; Mills, B.; Morss, R.; et al. Collaboration of the Weather and Climate Communities to Advance Subseasonal-to-Seasonal Prediction. Bull. Amer. Meteorol. Soc. 2010, 91, 1397–1406. [CrossRef] Atmosphere 2020, 11, 823 16 of 17

5. The National Academies of Sciences, Engineering, Medicine. Next Generation Earth System Prediction: Strategies for Subseasonal to Seasonal Forecasts; The National Academies Press: Washington, DC, USA, 2016; ISBN 978-0-309-38880-1. 6. Mariotti, A.; Ruti, P.M.; Rixen, M. Progress in subseasonal to seasonal prediction through a joint weather and climate community eﬀort. npj Clim. Atmos. Sci. 2018, 1, 4. [CrossRef] 7. Pegion, K.; Sardeshmukh, P.D. Prospects for improving subseasonal predictions. Mon. Weather Rev. 2011, 139, 3648–3666. [CrossRef] 8. Li, S.; Robertson, A.W. Evaluation of Submonthly Precipitation Forecast Skill from Global Ensemble Prediction Systems. Mon. Weather Rev. 2015, 143, 2871–2889. [CrossRef] 9. Pegion, K.; Kirtman, B.P.; Becker, E.; Collins, D.C.; Lajoie, E.; Burgman, R.; Bell, R.; Delsole, T.; Min, D.; Zhu, Y.; et al. The subseasonal experiment (SUBX). Bull. Amer. Meteorol. Soc. 2019, 100, 2043–2060. [CrossRef] 10. Raftery, A.E.; Gneiting, T.; Balabdaoui, F.; Polakowski, M. Using Bayesian model averaging to calibrate forecast ensembles. Mon. Weather Rev. 2005, 133, 1155–1174. [CrossRef] 11. Gneiting, T.; Raftery, A.E.; Westveld, A.H.; Goldman, T. Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Mon. Weather Rev. 2005, 133, 1098–1118. [CrossRef] 12. Krishnamurti, T.N.; Kishtawal, C.M.; LaRow, T.E.; Bachiochi, D.R.; Zhang, Z.; Williford, C.E.; Gadgil, S.; Surendran, S. Improved weather and seasonal climate forecasts from multimodel superensemble. Science 1999, 285, 1548–1550. [CrossRef] 13. Zhi, X.; Qi, H.; Bai, Y.; Lin, C. A comparison of three kinds of multimodel ensemble forecast techniques based on the TIGGE data. Acta Meteorol. Sin. 2012, 26, 41–51. [CrossRef] 14. He, C.; Zhi, X.; You, Q.; Song, B.; Fraedrich, K. Multi-model ensemble forecasts of tropical cyclones in 2010 and 2011 based on the Kalman Filter method. Meteorol. Atmos. Phys. 2015, 127, 467–479. [CrossRef] 15. Ji, L.; Zhi, X.; Simmer, C.; Zhu, S.; Ji, Y. Multimodel Ensemble Forecasts of Precipitation Based on an Object-Based Diagnostic Evaluation. Mon. Weather Rev. 2020, 148, 2591–2606. [CrossRef] 16. Ji, L.; Zhi, X.; Zhu, S.; Fraedrich, K. Probabilistic precipitation forecasting over East Asia using Bayesian model averaging. Weather Forecast. 2019, 34, 377–392. [CrossRef] 17. Neapolitan, R.E.; Neapolitan, R.E. Neural Networks and Deep Learning. Artif. Intell. 2018, 389–411. [CrossRef] 18. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [CrossRef][PubMed] 19. Angermueller, C.; Pärnamaa, T.; Parts, L.; Stegle, O. Deep learning for computational biology. Mol. Syst. Biol. 2016, 12, 878. [CrossRef] 20. Goh, G.B.; Hodas, N.O.; Vishnu, A. Deep learning for computational chemistry. J. Comput. Chem. 2017, 38, 1291–1307. [CrossRef] 21. Chao, Z.; Pu, F.; Yin, Y.; Han, B.; Chen, X. Research on real-time local rainfall prediction based on MEMS sensors. J. Sens. 2018, 2018, 1–9. [CrossRef] 22. Shi, X.; Gao, Z.; Lausen, L.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Deep learning for precipitation nowcasting: A benchmark and a new model. In Proceedings of the neural information processing systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5617–5627. 23. Akbari Asanjan, A.; Yang, T.; Hsu, K.; Sorooshian, S.; Lin, J.; Peng, Q. Short-Term Precipitation Forecast Based on the PERSIANN System and LSTM Recurrent Neural Networks. J. Geophys. Res. Atmos. 2018, 123, 12543–12563. [CrossRef] 24. Rasp, S.; Lerch, S. Neural networks for postprocessing ensemble weather forecasts. Mon. Weather Rev. 2018, 146, 3885–3900. [CrossRef] 25. Duan, T.; Avati, A.; Ding, D.Y.; Thai, K.K.; Basu, S.; Ng, A.Y.; Schuler, A. NGBoost: Natural Gradient Boosting for Probabilistic Prediction. Available online: https://arxiv.org/abs/1910.03225 (accessed on 9 June 2020). 26. Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [CrossRef] 27. Ren, L.; Sun, G.; Wu, J. RoNGBa: A Robustly Optimized Natural Gradient Boosting Training Approach with Leaf Number Clipping. Available online: https://arxiv.org/abs/1912.02338 (accessed on 9 June 2020). 28. Fan, Y.; van den Dool, H. A global monthly land surface air temperature analysis for 1948–present. J. Geophys. Res. Atmos. 2008, 113, D01103. [CrossRef] 29. Glahn, H.R.; Lowry, D.A. The Use of Model Output Statistics (MOS) in Objective Weather Forecasting. J. Appl. Meteorol. 1972, 11, 1203–1211. [CrossRef] Atmosphere 2020, 11, 823 17 of 17

30. Hamill, T.M.; Wilks, D.S. A probabilistic forecast contest and the difficulty in assessing short-range forecast uncertainty. Weather Forecast. 1995, 10, 620–631. [CrossRef] 31. Stefanova, L.; Krishnamurti, T.N. Interpretation of seasonal climate forecast using Brier skill score, the Florida State University Superensemble, and the AMIP-I dataset. J. Clim. 2002, 15, 537–544. [CrossRef] 32. Jordan, A.; Krüger, F.; Lerch, S. Evaluating probabilistic forecasts with scoringRules. J. Stat. Softw. 2019, 90, 1–37. [CrossRef] 33. Kingma, D.P.; Ba, J.L. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015; pp. 1–15. 34. Amari, S. Natural Gradient Works Efficiently in Learning. Neural Comput. 1998, 10, 251–276. [CrossRef] 35. Martens, J. New insights and perspectives on the natural gradient method. Available online: https://arxiv.org/ abs/1412.1193 (accessed on 9 June 2020). 36. Gneiting, T.; Raftery, A.E. Strictly Proper Scoring Rules, Prediction, and Estimation. J. Am. Stat. Assoc. 2007, 102, 359–378. [CrossRef] 37. Gebetsberger, M.; Messner, J.W.; Mayr, G.J.; Zeileis, A. Estimation Methods for Nonhomogeneous Regression Models: Minimum Continuous Ranked Probability Score versus Maximum Likelihood. Mon. Weather Rev. 2018, 146, 4323–4338. [CrossRef] 38. Gneiting, T.; Balabdaoui, F.; Raftery, A.E. Probabilistic and sharpness forecasts, calibration. J. R. Stat. Soc. Ser. B-Stat. Methodol. 2013, 69, 243–268. [CrossRef] 39. Anderson, J.L. A Method for Producing and Evaluating Proababilistic Forecasts from ensemble Model Integrations. J. Clim. 1996, 9, 1518–1530. [CrossRef] 40. Hamill, T.M.; Colucci, S.J. Verification of Eta–RSM Short-Range Ensemble Forecasts. Mon. Weather Rev. 1997, 125, 1312–1327. [CrossRef] 41. Talagrand, O.; Vautard, R.; Strauss, B. Evaluation of probabilistic prediction systems. In Proceedings of the Workshop on Predictability, Reading, UK, 20–22 October 1997; p. 12555. 42. Hamill, T.M. Interpretation of Rank Histograms for Verifying Ensemble Forecasts. Mon. Weather Rev. 2001, 129, 550–560. [CrossRef] 43. Hersbach, H. Decomposition of the Continuous Ranked Probability Score for Ensemble Prediction Systems. Weather Forecast. 2000, 15, 559–570. [CrossRef] 44. Zhu, Y.; Zhou, X.; Li, W.; Hou, D.; Melhauser, C.; Sinsky, E.; Peña, M.; Fu, B.; Guan, H.; Kolczynski, W.; et al. Toward the Improvement of Subseasonal Prediction in the National Centers for Environmental Prediction Global Ensemble Forecast System. J. Geophys. Res. Atmos. 2018, 123, 6732–6745. [CrossRef]

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).