Fenton-Wilkinson Order Statistics and German Tanks: A Case Study of an Orienteering Relay Race

Joonas Pa¨akk¨ onen,¨ Ph.D.

Abstract Typical applications of ordinal classification include age estimation with an integer-valued target, advertising sys- tems, recommender systems, and movie ratings on a Ordinal regression falls between discrete- scale from 1 to 5. For further insight into recent develop- valued classification and continuous-valued re- ments in machine learning, including ordinal regression, gression. Ordinal target variables can be as- the reader is kindly directed to (Gutierrez et al., 2016; sociated with ranked random variables. These Cao et al., 2019; Vargas et al., 2019; Wang and Zhu, random variables are known as order statist- 2019; Gambella et al., 2019) and the references therein. ics and they are closely related to ordinal re- gression. However, the challenge of using or- In this work, we perform a case study of ordinal regres- der statistics for ordinal regression prediction sion on the ranks of a sorted sum of random variables is finding a suitable parent distribution. In this corresponding to the duration of an orienteering relay work, we provide a case study of a real-world viewed as a random process with a large number of sub- orienteering relay race by viewing it as a ran- samples of changeover times. We compare three widely- dom process. For this process, we show that used regression schemes to our original method that is accurate order statistical ordinal regression based on manipulations of certain properties of ordered predictions of final team rankings, or places, random variables, otherwise known as order statistics. can be obtained by assuming a lognormal dis- We study whether expectations of order statistics in con- tribution of individual leg times. Moreover, we junction with statistical inference can forecast final team apply Fenton-Wilkinson approximations to in- places when certain intricate, educated guesses are made termediate changeover times alongside an es- about the underlying random process. Specifically, we timator for the total number of teams as in the assume lognormality of both individual leg times and, notorious German tank problem. The purpose more importantly, team changeover times. of this work is, in part, to spark interest in studying the applicability of order statistics in Generally, in competitive sports, the final place is the ordinal regression problems. hard, quantitative result that both individuals and teams want to minimize. For our case study, both plots and nu-

arXiv:1912.05034v1 [stat.ML] 10 Dec 2019 merical prediction error values show that ordinal regres- sion based on lognormal order statistics of time duration 1 INTRODUCTION can provide an accurate fit to real-world relay race data.

Machine learning includes various classification, regres- sion, clustering and dimensionality reduction methods. 2 MODELS In classification models, the target is a discrete class, while in regression, the target is typically a continuous 2.1 Relay Race System Model variable. Ordinal regression, also known as ordinal clas- sification, is regression with a target that is discrete and Consider an orienteering relay race. Let n denote the ordered. Ordinal classification can be thus considered to number of finishing teams as we ignore disqualified and be a hybrid of classification and regression. retired teams. Each team has m runners and each runner runs one leg. Especially note that m is given, whereas n 12th December 2019. Email: [email protected]. is estimated, as will be discussed later. We say that leg time results correspond to random vari- A GP predicts an unknown function g(·). For kernel mat- ables {Z } with i ∈ {1, 2, . . . , m}. We define the 2 i rix Kbxlxl = Kxlxl + σ0I with additive Gaussian noise (l) 2 changeover time T after leg l ∈ {1, 2, . . . , m} as with zero and variance σ0, the expected value of the zero mean GP predictive posterior distribution with a l (l) X Gaussian likelihood is (Rasmussen and Williams, 2006) T := Zi. (1) i=1 (g(x)|x , y) = k> K−1 y. E l xlx bxlxl (3) (l) 1 (l) The order statistics Tr:n of the sum T satisfy We use (3) rounded to the nearest integer as the GP place (l) (l) (l) (l) predictor. The GP regression function thus becomes T1:n ≤ T2:n ≤ · · · ≤ Tr:n ≤ · · · ≤ Tn:n. The final team result list of the relay race is a length-n h(l)(x) := k> K−1 y]. GP xlx bxlxl (4) sample of T (m) sorted in ascending order.

(m) (m) For practical, numerical implementation of exact GP, we Let FT (m) (·) denote the cdf of T , and Trm:n de- th (m) utilize the readily available GPyTorch library with a ra- note the rm order statistic of a length-n sample of T . (m) (m) dial basis function (RBF) kernel as in the “GPyTorch Re- These order statistics satisfy T1:n ≤ T2:n ≤ · · · ≤ gression Tutorial” on (GPyTorch, 2019). T (m) ≤ · · · ≤ T (m). We refer to r as the (final) place rm:n n:n m 3) Ordinal regression refers here to the rounded to the of the corresponding team, where rm = 1 corresponds nearest integer regression-based model from the read- to the winning team. ily available Python mord package for ordered ordinal Let c denote the number of training observations. We ridge regression. This model overwrites the ridge regres- are given a changeover time training vector xl = sion function from the scikit-learn library and uses the h (l) (l)i (minus) absolute error as its score function (Mord, 2019; x , . . . , xc and a final place training vector y = 1 Pedregosa-Izquierdo, 2015). [y , . . . , y ]. Hence, x consists of realizations of T (l) 1 c l 4) Fenton-Wilkinson Order Statistics and y includes the corresponding observed places. We (FWOS) regres- wish to find regression functions h(l)(·) that satisfy sion is our original regression model. For this model we make the following two well-educated assumptions. (l) y ≈ h (xl) (2) Assumption 1: Individual leg time Zi is lognormal. as accurately as possible. In (2), we use vector notation Assumption 2: Changeover time T (l) is lognormal. to emphasize that we find the parameters of h(l)(·) that The lognormal distribution often appears in sciences minimize a loss function with a set of observation pairs (Limpert et al., 2001). Assumption 1 is based on the rather than with a single observation pair. lognormality of vehicle travel time (Chen et al., 2018). 2.2 The Four Regression Models Assumption 2 paraphrases what in the literature is known as the Fenton-Wilkinson approximation (Wilkin- 1) Linear regression refers here to Ordinary Least son, 1934; Fenton, 1960; Cobb, 2012). The Fenton- Squares (OLS) regression rounded to the nearest integer. Wilkinson approximation method is the method of ap- OLS finds the intercept and slope that minimize the re- proximating the distribution of the sum of lognormal ran- sidual sum of squares between the observed targets and dom variables with another lognormal distribution. the targets predicted by the linear approximation. Note that Z and T (l) are both lognormal and independ- 2) Gaussian Process (GP) regression is a nonparametric i ent but not identically distributed. With this in mind, we model that can manage exact regression up to a million can now derive the FWOS regression prediction function data points on commodity hardware (Wang et al., 2019). (l) hOS(·). For this purpose we use the following two well- For a pair of training vectors (xl, y), a GP is defined by known preliminary tools in probability theory. its kernel function k(·, ·), a c×c kernel matrix K with xlxl Tool 1: Let W follow the standard uniform distribution covariance values for all training point pairs, and a c- U(0, 1) and T (l) follow distribution F . Let W denote dimensional vector k with evaluations of the kernel r:n xlx the rth order statistic of a length-n sample of W . The rth function between training points x and a given point x. l order statistic of a length-n sample of T (l) has the same 1 The order statistics of addends of ordered sums are latent distribution as the inverse cdf of F at Wr:n order statistics, examples of which include factor distributions th (Pa¨akk¨ onen¨ et al., 2019). In this work, though, we do not need Tool 2: The r standard uniform order statistic follows to derive any sum or addend distributions. Beta(r, n − r + 1). Therefore, E(Wr:n) = r/(n + 1). The inverse cdf of F is known as the quantile function Estimating the parameter n of U[1, n] with a sample QF (·). Tool 1 can be therefore expressed as drawn without replacement is in the literature known as the German tank problem (Ruggles and Brodie, 1947), a (l) d Tr:n = QF (Wr:n), (5) solution to which is a uniformly minimum-variance un- biased estimator (UMVUE) (Goodman, 1952) =d where “ ” reads “has the same distribution as”, and ap-  1 plying Tool 2 to (5) hence yields nˆ = 1 + d − 1, (12) c (c)  r  (T (l) ) = Q . (6) where E r:n F n + 1 d(c) := max yi i∈{1,2,...,c} Let F be the lognormal distribution with cdf F (l) (·). T th  (l)  is the realization of the c order statistic (maximum) of Now (6) directly implies F (l) (T ) = r/(n + 1) T E r:n a length-c sample of D. and, further, most interestingly for our purposes, that We replace n and (µl, σl) in (10) with the nˆ of (12) and   (l) (µ ˆl, σˆl), respectively, and note that numerical values for FT (l) E(Tr:n) (n + 1) = r, (7) (µ ˆl, σˆl) are obtained through (11). We obtain the follow- which resembles (2) as r is associated with y. ing proposition. Proposition 1. For a training dataset of c observa- For large n, as in our case study, it is reasonable to as- tions, lognormal parameter estimates (ˆµ , σˆ ) that are sume that ∀x ∈ , ∃r such that l l R+ obtained through the maximum likelihood estimators of

 (l)  changeover time observations xl, and d(c), the maximum E Tr:n ≈ x. (8) of final place observations y, the FWOS ordinal regres- sion function The well-known lognormal cdf is given by "     # (l) log x − µˆl 1   hOS(x) = Φ 1 + d(c) log x − µl σˆl c FT (l) (x; µl, σl) = Φ , (9) σl

predicts final place with changeover time x ∈ + at where Φ(·) is the standard normal cdf, µl and σl are the R lognormal parameters and log(·) is the natural logarithm. changeover l. We combine (7), (8) and (9), and define the FWOS place Loosely speaking, Proposition 1 states that the prediction function for leg l as FWOS method approximates final place with expected changeover place. We anticipate that this approximation "   # (l) log x − µl holds to a satisfactory degree and that it improves with l. hOS(x) := Φ (n + 1) , (10) σl We note that c, the dimension of the training vectors, is constant for all changeovers l as is nˆ, the estimate of n, where [·] again denotes rounding to the nearest integer. while the pair (µl, σl) is separately estimated for each Maximum likelihood estimation (MLE) for the normal changeover l with changeover time training vector xl. distribution yields lognormal estimators for µl and σl Further note that FWOS regression only requires an eas- when x(l) are replaced by q(l) := log x(l). Hence,  i i i ily acquirable pair c, d(c) of dimension 2 as opposed to the other three regression methods that require the whole c c 2! 2 1 X (l) 1 X  (l)  y c µˆ , σˆ  = q , q − µˆ . (11) place training vector of dimension . l l c i c i l i=1 i=1 3 NUMERICAL RESULTS What remains to be done is finding an estimate for the total number of teams n. We assume that there are no Real-world data are acquired from the results of Jukola ties, which is equivalent to stating that the elements in y 2019 (Jukola, 2019) with n = 1653 teams with m = 7 are unique. Thus, y is a sample, without replacement, of runners per team. Training of the regression models is the discrete uniform distribution U[1, n]. Let D denote a conducted for two cases: for c = 1322 (c/n ≈ 80%) random variable that follows this distribution. Now recall and also for c = 82 (c/n ≈ 5%) training observations. that y is a random c-subset of {1, 2, . . . , n}. The rest of the data are used for testing the models. Figure 1: Place predictions (orange) at changeover l = 4 with training set size 80% and test set (blue) size 20%.

Figure 2: Place predictions (orange) at changeover l = 4 with training set size 5% and test set (blue) size 95%. (a) Training set size 80%. (b) Training set size 5%

Figure 3: Place prediction root-mean-square errors (RMSEs) after each leg.

3.1 Regression Curves 3.2 Root-Mean-Square Errors

Here we illustrate the error between place prediction Figure1 plots place against team changeover time after (l) h (xi) and the corresponding true value yi with label leg l = 4 for the predictions (orange) and the test set i ∈ {1, 2, . . . , v}, where v is the size of the test set with (blue) for all the four regression models. The size of the n = c + v. We use the root-mean-square error (RMSE) training set is 80% and the size of the test set is 20%. We see that linear regression and GP regression provide v accurate fits for a large portion of the test set points, but u v u 1 X (l) 2 behave poorly when time is large. For average teams, RMSE = t yi − h (xi) . v with approximately 350 to 550 minutes of elapsed time i=1 after four legs, place seems to grow linearly with time, though overall linearity is clearly an oversimplification. Fig. 3a plots the RMSEs after each of the m = 7 changeovers for a random test set of v = 331 points We further notice that ordinal regression captures the ef- (v/n ≈ 0.20 as in Fig.1), while Fig. 3b plots the RMSEs fect where place saturates for large values of time, but for a random test set of v = 1571 points (v/n ≈ 0.95 as fails to provide a smooth transition. FWOS regression, in Fig.2). unlike the other models, does indeed exhibit the smooth sigmoidal behavior of the data. When comparing Fig. 3a with Fig. 3b, we notice similar RMSEs for both training set sizes, except for GP regres- In the setting of Fig.2, there are significantly fewer sion. It is clear that GP regression requires more training training data compared to the setting of Fig.1, namely, data than the other regression models to achieve compar- 5% compared to 80%. Interestingly, linear regression, able prediction error performance. ordinal regression and FWOS regression maintain high performance, whereas GP regression greatly suffers from In every case, FWOS exhibits the best RMSE perform- the lack of training data. ance. However, the RMSEs are not zero even for the 7th changeover, i.e., after the anchor leg when the team fin- Overall, we make the following two key remarks. ishes. This is due to the imperfections of the regression Remark 1: The “place against changeover time” curve models and the random fluctuations of the data. resembles a scaled lognormal cumulative distribution function. This further suggests that elite teams “pull 4 DISCUSSION away” from the rest of the teams, while extremely slow teams fall far behind the rest. For our case study of an orienteering relay race, we have Remark 2: A lognormal cumulative distribution func- shown that rankings of ordered sums, here referred to as tion, alongside a German tank estimator, effectively pre- places, can be relatively accurately predicted by interme- dicts places with changeover times even when the num- diate sums by first taking an educated guess that leg times ber of training observations is small. are lognormal, and then by scaling the cumulative distri- bution function of another lognormal distribution corres- (2016). Ordinal regression methods: survey and ex- ponding to observed changeover times. The latter dis- perimental study. IEEE Trans. Knowl. and Data Eng., tribution can be approximated by the Fenton-Wilkinson 28(1):127–146. method, while the scaling factor can be estimated with a Jukola (2019). [online] https://www.jukola.com/2019/. well-known solution to the German tank problem. Limpert, E., Stahel, W. A., and Abbt, M. (2001). Log- The use of order statistics for ordinal regression provides normal distributions across the sciences: Keys and a powerful approximation for our case study. However, clues. Bioscience, 51:341–352. intricate prior insight into the underlying random pro- Mord (2019). [online] https://pythonhosted.org/mord/. cess is assumed, which is unnecessary for nonparamet- ric models such as the Gaussian process. Similarly, lin- Pa¨akk¨ onen,¨ J., Dharmawansa, P., Freij-Hollanti, R., Hol- ear regression and standard ordinal regression require no lanti, C., and Tirkkonen, O. (2019). File size distribu- knowledge about the underlying distributions. tions and caching for offloading. In Proc. IEEE Signal Process. Advances in Wireless Commun. (SPAWC). An event that may complicate leg time distribution fit- ting and erode our prediction performance is the restart, Pedregosa-Izquierdo, F. (2015). Feature extraction and where the runners that have not started before a specific supervised learning on fMRI: from practice to theory. cut-off time start simultaneously. Restarts may skew the PhD thesis, Universite´ Pierre et Marie Curie. leg time distributions in the same way as the first leg run- Rasmussen, C. E. and Williams, C. K. I. (2006). Gaus- ners that often run in packs. Running in packs may de- sian processes for machine learning. The MIT Press. crease our prediction accuracy. Ruggles, R. and Brodie, H. (1947). An empirical ap- Fenton-Wilkinson order statistics, alongside German proach to economic intelligence in . J. tank estimators, could be applied whenever a longer ran- Amer. Statist. Assoc., 42(237):72–91. dom process can be modeled as a sum of shorter random Vargas, V.-M., Gutierrez,´ P.-A., and Hervas-Mart´ ´ınez, C. processes with random durations and when the outcomes (2019). Cumulative link models for deep ordinal clas- of the longer process are uniquely ordered by the corres- sification. [online] Available: arXiv:1905.13392. ponding total time duration. Such processes are plentiful Wang, K. A., Pleiss, G., Gardner, J. R., Tyree, S., Wein- in sports – those sources of inspiration for ordinal classi- berger, K. Q., and Wilson, A. G. (2019). Exact gaus- fication order statistics research. sian processes on a million data points. [online] Avail- able: arXiv:1903.08114. References Wang, L. and Zhu, D. (2019). Tackling mul- Cao, W., Mirjalili, V., and Raschka, S. (2019). Rank- tiple ordinal regression problems: Sparse and deep consistent ordinal regression for neural networks. [on- multi-task learning approaches. [online] Available: line] Available: arXiv:1901.07884. arXiv:1907.12508. Chen, P., Tong, R., Lu, G., and Wang, Y. (2018). Ex- Wilkinson, R. I. (1934). Unpublished, cited in 1967. Bell ploring travel time distribution and variability patterns Telephone Labs. using probe vehicle data: Case study in beijing. J. Adv. Transp. Cobb, R. R. B. (2012). Approximating the distribution of a sum of log-normal random variables. In Proc. 6th Eur. Workshop Probab. Graph. Models. Fenton, L. F. (1960). The sum of log-normal probabil- ity distibutions in scattered transmission systems. IRE Trans. Commun. Syst., 8:57–67. Gambella, C., Ghaddar, B., and Naoum-Sawaya, J. (2019). Optimization models for machine learning: A survey. [online] Available: arXiv:1901.05331. Goodman, L. A. (1952). Serial number analysis. J. Amer. Statist. Assoc., 47(270):622–634. GPyTorch (2019). [online] https://gpytorch.ai. Gutierrez, P. A., Perez-Ortiz, M., Sanchez-Monedero, J., Fernandez-Navarro, F., and Hervas-Martinez, C.