<<

Universite´ catholique de Louvain

Institute of Information and Communication Technologies, Electronics and Applied Mathematics

Hybrid Models to Predict Recreational Runners’ Performance

Dimitri de Smet d’Olbecke

Thesis submitted in partial fulfillment of the requirements for the Ph.D. Degree in Engineering Sciences and Technology

Thesis Committee Advisor: Michel Verleysen (UCLouvain) Jury: Olivier Br¨uls (ULi`ege) Marc Francaux (UCLouvain) Bernadette Govaerts (UCLouvain) Romain H´erault(INSA, Rouen) John Lee (UCLouvain) Chairman: Jean-Pierre Raskin (UCLouvain)

Louvain-la-Neuve, October, 2019.

Abstract

When long-distance runners prepare for a race, they can train more effi- ciently if they are able to predict their expected performance. Accurate race time prediction also allows them to pick the right pace from the beginning of the race, which is known to impact the race outcome significantly. Usu- ally, expected performance is estimated using fitness and endurance metrics provided by analyzing standardized exercise protocols in specialized labora- tories. Unfortunately, most runners (especially recreational runners) cannot afford access to the required equipment and dedicated staff. In recent years, some companies have started to offer digital coaching for runners through sports watches and smartphone applications. One of the challenges that these companies face is to track runner fitness levels so that their workout planning can evolve according to their progress and so that race paces can be recommended. This thesis addresses the problem of predicting runners’ performance based on data that is cheap and easy to collect, even for recreational runners: previous race times and workout session recordings (most often timestamps, heart rates, and positioning). The modeling of performance is said to be hybrid because it combines blind machine learning methods applied to large sets of data with knowledge from the domain literature. Runner performance can be modeled as a function that describes how race times evolve in relation to race length (in meters). This function can be adjusted to fit previous race times obtained on given race lengths so that performances can be extrapolated to other race lengths. However, the re- gression is made difficult when the number of race records is limited or when the past performances present a high variability. This issue is addressed by using a probabilistic setting that includes probability distributions inferred from a large collection of race times. Races cannot be strictly summarized by their length: gradient of ascent, weather conditions, altitude, vegetation, uneven ground and ground firm-

3 4 ness affect runner speeds. Nevertheless, performance modeling is still pos- sible using the notion of equivalent distances. This considers races char- acterized by only one parameter, all race factors being summarized by the equivalent distance. Equivalent distances are assigned to races based on race times; but, as races are not run by the same set of runners, equivalent distance estimation must address the problem of disparities between partic- ipants. This is done by evaluating races and runners simultaneously using collaborative filtering techniques. Subsequently, the relationship between these equivalent distances and race elevation profiles is formalized. Many runners record their runs using a smartphone or a sports watch. If their device is paired with a heart rate sensor, their relative intensity of exer- tion is continuously monitored. Runners’ fitness levels can thus be assessed by relating their heart rate to an estimation of their activity level. The latter is derived from their recorded geolocations associated with timestamps. Acknowledgements

Let me take the opportunity at the outset of this text to thank the people who made this thesis possible, the people who made it better, and the people who have made the last six years as enjoyable as they have been. First of all, I would like to thank Michel Verlysen for the courses he gave, which sparked my interest in the field of machine learning and made me want to undertake a thesis. His experience and vision have allowed me to grow as a scientist and I am thankful for his careful supervision of my thesis. I am also very grateful to Marc Francaux and Laurent Baijot for raising the central questions addressed in this thesis, as well as their respective competencies in the fields of sports sciences and coaching, which I did not initially share. This manuscript has been made more rigorous and precise thanks to valuable feedback from John Lee, Bernadette Govearts, Romain Herault and Olivier Br¨uls.During the final phase, the language was also significantly improved with the help of Mary Munroe, whom I warmly thank here. The years I spent at university would not have been nearly as pleasant with- out my colleagues. I thank them for sharing their scientific and technical expertise but also and especially for their excellent company! Of course, this thesis was made possible thanks to the people who surround me. That includes my colleagues, but firstly my parents, family and friends. I sincerely thank all of them, especially my wife Aline De Broux, to whom I dedicate this thesis for the support she has offered thus far and which she continues to offer to me and our three wonderful children. Finally, thank you, reader, for opening this thesis for whatever reason, even if this is the only page you intended to read.

5 6 Contents

1 Introduction 11 1.1 Motivation and Scope...... 11 1.2 Modeling and Validation...... 13 1.3 List of Publications...... 16 1.4 Summary of the Contributions...... 17 1.5 Organization of the thesis...... 18 1.6 Conventions...... 19 1.6.1 Units...... 19 1.6.2 Illustrations...... 19 1.6.3 Error Boxplots...... 20 1.6.4 Notations...... 20 1.6.5 Acronyms...... 21

2 Data Description and Pre-Processing 23 2.1 Introduction...... 23 2.2 Race Results...... 24 2.2.1 Data Description...... 24 2.2.2 Filtering the Data...... 24 Minimal Races per Runner...... 25 Isolated Races...... 25 2.2.3 Race Features...... 27 2.3 Runner Recordings...... 27 2.3.1 Heart Rate...... 28 2.3.2 Geo-Localized Positions...... 28 2.3.3 Speeds...... 29 2.3.4 Elevation Data...... 29 2.3.5 Gradient of Ascent...... 30 2.3.6 Instant Power During Exercise...... 31 2.3.7 Smoothing Speeds and Slopes...... 33 2.3.8 Merging the Two Data of Sources...... 33

7 8 Contents

2.3.9 Race Tracks Cropping...... 33 2.4 Conclusion...... 36

3 Runner Performances Modeling 37 3.1 Existing Models...... 38 3.1.1 Power Law...... 38 3.1.2 Hyperbolic Two-Parameter Model...... 41 3.1.3 Hyperbolic Three-parameter model...... 42 3.1.4 Exponential Model...... 44 3.1.5 Logarithmic Endurance Model (P´eronnet)...... 45 3.1.6 VDOT Model...... 46 3.2 Proposed Models...... 47 3.2.1 Two-Threshold Power Law...... 48 3.2.2 Polynomial-Logarithmic Model...... 48 3.3 Fitting of the Models...... 50 3.3.1 Power-Law...... 50 3.3.2 Hyperbolic Two-Parameter Model...... 51 3.3.3 Hyperbolic Three-Parameter Model...... 51 3.3.4 Exponential Model...... 52 3.3.5 Logarithmic Endurance Model...... 52 3.3.6 VDOT Model...... 52 3.3.7 Two-Threshold Power Law...... 52 3.3.8 Polynomial-Logarithmic Model...... 53 3.4 Model Comparison...... 53 3.4.1 World Records Fitting...... 54 3.4.2 Race Performances Prediction...... 54 3.4.3 Results...... 54 World Records Fitting...... 54 Race Performances Predictions...... 55 3.5 Discussion...... 58

4 Races and Athletes Characterization from Race Results 61 4.1 Introduction...... 61 4.2 Low-Rank Approximation of the Race Results Matrix.... 63 4.3 Cost Function...... 65 4.4 Solution...... 65 4.5 Data Requirements...... 67 4.5.1 Communities of Runners...... 68 4.5.2 Choosing Rank k ...... 68 4.6 Validation Process...... 69 4.7 Results...... 69 4.7.1 Data...... 69 4.7.2 Optimization...... 70 Contents 9

4.7.3 Model Selection...... 70 4.8 Discussion...... 71 4.8.1 Race Distances and Rank-One Approximation.... 71 4.8.2 More on Communities...... 72 4.9 Conclusion...... 77

5 Race Equivalent Distances and Race Features 79 5.1 Introduction...... 79 5.2 Equivalent Distances From Race Results...... 80 5.2.1 Equivalent Distance...... 80 5.2.2 Obtaining the Equivalent Distances...... 81 5.2.3 Data Requirements...... 82 5.3 Equivalent Distance From Elevation Profile...... 82 5.3.1 Models...... 84 5.3.2 Model Fitting...... 85 5.4 Validation Process...... 86 5.5 Results...... 87 5.6 Discussion...... 90

6 Probabilistic Race Time Prediction 93 6.1 Introduction...... 93 6.2 Probabilistic Setting...... 94 6.3 The Bayesian Approach...... 97 6.4 ML and MAP Solutions...... 98 6.5 Distribution Parameters Tuning...... 100 6.5.1 Adjusting Distribution Parameters on Race Times...... 101 6.5.2 Tuning Distribution Parameters Through Validation...... 103 6.6 Equivalent Distances...... 104 6.7 Validation Process...... 104 6.8 Results...... 104 6.8.1 Probability Distribution Parameters...... 104 6.8.2 Model Accuracy...... 105 6.9 Discussion...... 106

7 Cardiac Parameter Identification as Fitness Assessment 109 7.1 Introduction...... 109 7.2 Materials and Methods...... 110 7.2.1 Heart Rate Model...... 110 Instant Power...... 110 Steady state...... 110 Transient Response...... 111 Contents 10

Cardiovascular Drift...... 113 7.2.2 Performance Prediction...... 114 7.2.3 Adjusting the Cardiac Parameters on an Activity... 114 7.2.4 Other Means to Estimate Cardiac Parameters..... 116 7.2.5 Validation Process...... 118 7.3 Results...... 119 7.4 Discussion...... 122

8 General Conclusion 125

A Races Communities 129 A.1 Lion Races...... 129 A.2 Rooster Races...... 130 A.3 Races Equivalent Distances...... 132

Bibliography 141 Chapter 1

Introduction

This thesis is intended to provide methodologies to predict the race per- formances of recreational runners. This is in high demanded during their preparation and to choose an optimal pacing strategy on the race day. This work takes place in the scope of automated coaching and is partly conducted in collaboration with the company FORMYFIT, active in this domain. The performance prediction of recreational runners differs from that of elites mainly in two respects. First, in terms of available data, recreational run- ners have few race records and no access to specialized laboratory testing. Second, in terms of variability, their performances are much less stable and thus more difficult to predict. The thesis focuses on race distances that are the most common to recre- ational runners: endurance race ranging from 8 to the 42.2 kilometers (which is the distance); longer races (up to hundreds of kilometers) are only briefly discussed.

1.1 Motivation and Scope

Running is one of the most popular and practiced sports worldwide [58] and the practice is still growing almost everywhere. Marathon attendance has grown by 49% over the last decade [4]. is certainly one of the most accessible sports; it only requires a pair of shoes. Running, like all forms of regular exercises, has numerous health benefits (among others, cardiovascular, weight loss and mental health [12, 36]). The undisputed health benefits of running combined with its accessibility pushes

11 CHAPTER 1. INTRODUCTION 12 different kinds of public entities to promote it. In the private sector, com- panies started to push its practice as a way to improve the productivity of their employees and insurers adopt incentive programs directed at their health or life insurance customers.

Although the running practice is accessible to all, the health- or performance- related improvements are conditioned by the following of specific instruc- tions. Moreover, keeping up regular practice is made easier if it is part of an established training plan: runners are more likely to skip or postpone their runs if they do not see them integrated in a long term plan. The large number of recreational runners who seek improvements cannot all afford the guidance of a private coach. Group sessions are cheaper and have the benefits of a social activity but they are not tailored to individuals.

Automated coaching for runners, that can be materialized as a smartphone application with instructions and feedback can replace or complement coach- ing sessions at a low cost.

A training plan that targets a race event or health benefits is much more ef- ficient if we can evaluate runner performances. Moreover, these evaluations allow quantified feedback to the runner which is a key factor in motivation.

This thesis is intended to provide a mean of fitness assessment of recre- ational runners, it is based on data that is cheap and easy to collect: race times and activity measurements. Most recreational runners in industrial- ized countries possess a smartphone and can pair it at low cost with a heart rate monitor. We found that the measurements that are most commonly available are the positions (from which we can derive speeds and elevations) and the heart rates. Other measurement that are available on smartphone or sports watch could have been used (such as accelerometer and pulse oxymeter) but they are still a lot less widespread and less standardized (units, sampling rate, smoothing).

Some athlete variables that are known to impact performances (such as age, sex, weight, and height) could be used. In this thesis, we assume that their effect is already taken into account in the race times or in heart rates vs workload relationship.

Two kinds of methodology are developed. The first is based on past perfor- mances obtained in other races; and the second is based on the observation of their heart rates during their workout sessions. Both approaches com- bine statistical analyses of collected data with physiological knowledge in the literature. 13 1.2. MODELING AND VALIDATION

1.2 Modeling and Validation

Modeling athlete performance is associated with several tasks: one must select a general parametric model, fit its parameters to the data of a runner, use the model with its parameters to predict performances and validate the model accuracy. The first task, called “model selection”, consists in setting an equation that expresses how performances are related to input variables; the equation contains some athlete-specific parameters. The notion of “model selection” that we use also includes the fitting procedure: how athlete parameters are computed. The latter consists of an optimization criterion (for instance, the least square fit) and an algorithm (for instance a gradient descent). The general equation, the optimization criterion and the algorithm can all have parameters that are not specific to a runner. These parameters are called model hyperparameters (or often meta-parameters). The model selection includes the tuning of the hyperparameters. The second task is referred to as “athlete fitting”. It consists of applying the chosen algorithm to fit the model to data of a specific runner. The third task uses the obtained parameters to predict performances with the model equation. To make the concepts clearer, we give a simple example: let’s model the race times of a runner with an equation: d t = t10K .( )β, (1.1) 104 with d being the race distance and t10K , the only athlete parameter, cor- responding to their expected time on 10-kilometer race. If we assume that β is common to all runners, it can be seen as an hyperparameter of the model: it is established before the fitting of individual runners. If runners have some race records (t1, t2, . . . , tN )| for given distances (d1, d2, . . . , dN )|, we can fit t10K so that it minimizes the sum of squared deviations to the model equation:

N 10K X 10K di β 2 t = arg min (ti − t .( 4 ) ) , (1.2) 10K 10 t i which accepts a closed-form solution as:

N P ( di )β t10K = i 104 . (1.3) PN i ti CHAPTER 1. INTRODUCTION 14

In this example, model selection consists in: 1. Selecting a general equation: Equation (1.1) 2. Selecting an optimization criterion: the sum of squared errors (1.2) 3. Selecting an algorithm to compute the solution t10K : in this case, the algorithm consists only of evaluating Equation (1.3) and is not a debating matter because it is an exact and unique solution to the optimization problem. 4. Selecting a value for β as discussed below. Once the model is selected (these four points are settled), we can fit t10K to a specific runner using their race records. Of course, models can have more parameters (and hyperparameters). If the number of parameters increases, the model gets more complex and possibly more accurate but the risk of overfitting rises: the model gets closer to the recorded points but it loses prediction accuracy. Overfitting is related to the complexity of the model, the variability of the model outputs (in our case, the stability of runner performances) and the quantity of available data. This is summarized in the following table. Chances of overfitting Quantity of data decreases Complexity of the model increases Variability of the target increases To verify that the athlete parameters do not overfit the data, the prediction accuracy must be evaluated on race records that were not used for the fitting. In the following chapters, we evaluate the prediction accuracy and perform the model selection using variants of the following procedure. 1. For a large number of athletes: (a) For each of their races: i. fit the model using all races but the selected one ii. compute a prediction for the selected race iii. store the prediction error 2. The accuracy of the fitting procedure is based on statistics on the stored errors. For the model selection, we simply run this procedure with different models and with different hyperparameters. We can then select the model that gives the least mean squared error. Theoretically model hyperparameters 15 1.2. MODELING AND VALIDATION could overfit too. This means that the selected model could be “acciden- tally” good for the tested runners but less good for others. Considering the number of athlete data that is available, we assume that the risk is very low. This is verified by testing the model on a different set of runners after the model selection. The reason for this “short-cut”, is a significant gain in computation time. CHAPTER 1. INTRODUCTION 16

1.3 List of Publications

Here is the list of my published papers. The next section discusses their contributions.

[55] de Smet, D., Francaux, M., Hendrickx, J.M., Verleysen, M.: Heart rate modelling as a potential physical fitness assessment for runners and cyclists. In: Proceedings of the Machine Learning and Data Mining for Sports Analytics Workshop at ECML/PPKD. Riva del Garda, Italy (2016)

[56] de Smet, D., Verleysen, M., Francaux, M.: Running race times prediction and runner performances comparison using a matrix factorization approach. In: Proceedings of the 5th International Congress on Sport Sciences Research and Technology Support - Volume 1: icSPORTS,. pp. 96–101 (2017)

[57] de Smet, D., Verleysen, M., Francaux, M., Baijot, L.: Long- distance running routes’ flat equivalent distances from race results and elevation profiles. In: Proceedings of the 6th International Congress on Sport Sciences Research and Technology Support - Volume 1: icSPORTS,. pp. 56–62. INSTICC, SciTePress (2018). doi: 10.5220/0006937000560062

[54] de Smet, D., Francaux, M., Baijot, L., Verleysen, M.: Map best performances prediction for endurance runners. In: Proceedings of the 27th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges, Bel- gium (2019) 17 1.4. SUMMARY OF THE CONTRIBUTIONS

1.4 Summary of the Contributions

1. In [56, 57] I developed the notion of race equivalent distance that sum- marizes all aspects that affect the race average speed. I also described a methodology to evaluate it for races based the recorded times. 2. In [57] I proposed a methodology to estimate race equivalent distances from route elevation profile. This is useful for routes for which race results are not yet known. 3. In [54] I showed that the athlete fitting presents difficulties when they have few race records. I expressed a solution that includes prior as- sumptions that we can make on human abilities or on specific age- groups. 4. In [55] I showed a solution to estimate cardiac parameters based on their heart rate measurement during their free activities. I extend here the idea to an estimation of their fitness level. 5. In addition to published papers, I review and test here the different models found in the literature that express race times based on the race distances. This thesis is an applied work, derived from a business need, that was treated as rationally as can be expected from a scientific work. Although a significant number of technical solutions have been found along the way (such as gathering and cleaning data, GPS track de-noising or cropping), some of them will not be detailed in this dissertation because, although they represent a tremendous amount of work, they are more relevent to engineering than science. CHAPTER 1. INTRODUCTION 18

1.5 Organization of the thesis

This thesis is organized as follows.  Chapter2 - Data Description and Pre-Processing describes data, how it was obtained and pre-processed.  Chapter3 - Runner Performances Modeling reviews existing models that relate race distances to race times. The models are tested on the data that was collected as described in Chapter2.  Chapter4 - Races and Athletes Characterization from Race Results describes how race times can be used to characterize athletes and races in a way that allows us to predict race times for every runner in every race. The chapter ends with the notion of equivalent distance.  Chapter5 - Race Equivalent Distances and Race Features combines the two previous chapters. Races are explicitly characterized by their equivalent distances and athletes by a parametric model that is described in Chapter3. We show that race equivalent distance and athlete parameters can be obtained from the race results. The chapter ends by relating the obtained equivalent distances to route elevation profiles.  Chapter7 - Cardiac Parameter Identification as Fitness As- sessment presents a simple model of the heart rates as a function of the activity workload. We show that some parameters of the model reveal the fitness level of a runner and that they can be obtained from their activity recordings. This allows us to monitor their fitness level each time they run.  Chapter8 - Conclusion summarizes the thesis content and dis- cusses the opportunities that it opens. 19 1.6. CONVENTIONS

1.6 Conventions

1.6.1 Units

Most expressions in this thesis use the International System of Units con- vention. For convenience, differents units are used for heart rates : beats per minute [bpm] instead of Hz. Also for convenience, common units are used in figures because they will supposedly be more familiar to the reader: distances are in kilometers, times in metric time (hours, minutes, seconds) and speed in kilometers per hour.

1.6.2 Illustrations

Although the models that will be discussed later are most often based on race times, they are illustrated by their average speed. Indeed, as Figure 1.1 shows, the time vs distance relationship seems close to linear. Some aspect are easier to understand if we plot the average speed vs distance.

Figure 1.1: Illustration of time vs distance and speed vs distance. The blue curve corresponds to a model that is entirely possible for a recreational runner and the orange curve is nonsensical. This appears much clearer when the model is shown in terms of speed rather than times. CHAPTER 1. INTRODUCTION 20

1.6.3 Error Boxplots

Model accuracies are illustrated with boxplots of the relative errors on validation data. This represen- tation allows a quick comparison of models in terms of bias and vari- ance. For instance, model 1 in Fig- ure 1.2 has higher bias than the model 2: model 1 underestimates race times (the sign convention is that errors are computed as target minus prediction). Model 1 has a lower variance. The box represents the first and third quartiles, and the band in- side the box is the second quartile (the median). The box thus con- tains half of the errors. The ends of the whiskers can represent several possible alternative values, we take Figure 1.2: Typical representation of the convention to place the whiskers the errors. Each error is computed at the 2nd percentile and the 98th as the target minus the prediction di- percentile. vided by the target. As an example, the third boxplot represents 1000 samples of a normal distribution that is centered with unit standard deviation. On average, the box is, 0.674.σ above and below the mean; and, the whiskers are 2.05.σ distant from the mean.

1.6.4 Notations

The following notations are used throughout this thesis.

m×n  U : Uppercase and bold symbol refers to a matrix (ai,j) ∈ R  u : Lowercase bold to a vector  a : Lowercase normal font to a vector

 X| : Transpose of matrix X  X−1 : Inverse of matrix X 21 1.6. CONVENTIONS

 ln() refers to the natural logarithm (logarithm of base e being the ∆ irrational mathematical constant ≈ 2.7183): ln(x) = loge(x).

1.6.5 Acronyms

The following acronyms are used throughout the text of this thesis.  ML : Maximum Likelihood Estimate  MAP : Maximum a posteriori probability Estimate  OLS : Ordinary Least Squares  ALS : Alternating Least Squares CHAPTER 1. INTRODUCTION 22 Chapter 2

Data Description and Pre-Processing

We mention in the introduction that two kinds of data were required: race times, and activity recordings for which we have heart rate measurements and the possibility to evaluate the instant power.

This chapter briefly describes the culmination of a long-term, rather tech- nical work of collecting and pre-processing the data that is used in the following. Readers who are more interested in scientific content than en- gineering aspects may skip this chapter. If they choose to do so, they can simply assume that race times are available as well as some activities pro- vided with continuous heart rate measurements together with an estimation of the continuous exerted power.

2.1 Introduction

Two sources of data are used to build and validate the predictive models that are described in the next chapters: official results obtained from organized races and activity recordings performed by runners with their smartphone or sports watch. Both were collected from freely accessible web pages using automated web content parsing. In addition, the company FORMYFIT that sells running training plans on smartphone gave us access (with users’ explicit consent) to recordings of fitness tests at gradual intensities.

23 CHAPTER 2. DATA DESCRIPTION AND PRE-PROCESSING 24

2.2 Race Results

The company “Chronorace”, provides electronic timing for a large number of races, mostly in Belgium but also in other countries. After the race, the company makes race times available on a website. Race results were collected from this website for 334 races that took place during the years 2014 and 2015, totaling 164451 race times of 124418 runners (most of them having only one record). Races present distances ranging from 5 to 159 km.

2.2.1 Data Description

Runners in the collected database present various levels; this can be ob- served from the histograms of the 10-kilometer and marathon times in Fig- ure 2.1.

Figure 2.1: Histograms of the 10K and marathon times in our database.

2.2.2 Filtering the Data

In the following chapters, some races and some athletes are discarded for different reasons that are mentioned here but which will become clearer when describing the methods. 25 2.2. RACE RESULTS

Minimal Races per Runner

When building predictive models for race times, a certain number of races is required for adjusting model parameters to runners and at least one race is required to assess the validity of the model. Table 2.1 shows the quantity of data that remains if we require athletes to have a minimal number of race records. Sparsity is discussed in Chapter4. It corresponds to the proportion of observed entries if race times were placed in a matrix with athletes as rows and races as columns.

Min. race Number of Number of Number of Sparsity per athlete Athletes Races Races Times [%] 1 124418 334 164451 99.6 2 25579 329 65612 99.2 3 7906 328 30270 98.8 4 3249 318 16299 98.4 5 1382 307 8831 97.9 6 732 299 5581 97.5 7 415 282 3680 96.9 8 247 262 2504 96.1 9 162 250 1824 95.5 10 115 236 1401 94.8 11 85 218 1101 94.1 12 51 191 727 92.5 13 28 172 451 90.6 14 17 147 308 87.7 15 16 141 294 87.0

Table 2.1: Data quantities when athletes with too few races are discarded.

Isolated Races

For reasons that will become clearer in Chapter4, in some cases, we restrict our analysis to races that are connected to each other by athletes. To ensure our data does not present isolated communities, we can view races and their connections (athletes in common) as a graph described by an adjacency matrix. In such a graph, isolated communities are called graph components. These components are straightforward to compute by looping through the races. Hopefully, as our data contains races that all took part in Belgium (which is rather small), very few race have to be discarded (see Figure 2.2). Table 2.2 presents the quantity of data that is left if CHAPTER 2. DATA DESCRIPTION AND PRE-PROCESSING 26 isolated races are discarded. Beyond five races per athlete, no race had to be removed.

Figure 2.2: Races are represented as dots. Two races are connected if at least one athlete ran in both. Two isolated subgraphs are present: one large and one with a few races (on the left).

Min. race Number of Number of Number of Sparsity per athlete Athletes Races Races Times [%] 1 119944 322 157426 99.6 2 23546 322 61028 99.2 3 7628 321 29196 98.8 4 3009 314 15339 98.4 5 1382 307 8831 97.9 6 732 299 5581 97.5 7 415 282 3680 96.9 8 247 262 2504 96.1 9 162 250 1824 95.5 10 115 236 1401 94.8 11 85 218 1101 94.1 12 51 191 727 92.5 13 28 172 451 90.6 14 17 147 308 87.7 15 16 141 294 87.0

Table 2.2: Data quantities when athletes with too few races and isolated races are discarded. 27 2.3. RUNNER RECORDINGS

2.2.3 Race Features

Numerous race features could be used to build predictive models of the race times. In this research, only the altitude gradients along the routes are considered. They are computed from runners recordings at our disposal (see Section 2.3). The matching between races and recorded activities is discussed in Section 2.3.8.

2.3 Runner Recordings

Activities were downloaded from the “Garmin Connect” web platform. This platform stores all activities that are recorded using Garmin devices. Ac- tivities of users who set their profile setting as “public”, can be freely down- loaded. The web interface allows anyone to search for all activities around a given location. In addition, Garmin provides APIs (a web interface that allows queries to be made on their database), that can be used to down- load all activities of a given user with a public profile. Data collection was automated as follows: 1. Search for activities at the place and time of the start of a crowded run- ning event (the Brussels 20km race, around 40000 attendees). From there we get all IDs of users who attended the race with a Garmin device and a public profile: around 250 runners. 2. Use Garmin public APIs to download all their activities. 3. In all activities that are obtained, search for other races: places where more than 10 users started an activity at the same place and around the same time. 4. For each of the presumed races, query all other users who started an activity at that time and place. Repeat (a few times) from step 2 with the new collected user IDs. The process can be repeated until all users with a public profile and who attended a race in the world are downloaded. In the scope of this research, the process was stopped after four iterations, having downloaded 312669 activities of various kinds and belonging to 723 users. 241014 activities are runs and, of those, 3005 are races. 177156 activities are provided with heart rates and 1615 are cycling activities with a direct measure of the instant power (bikes are equipped with a power meter). The next subsections describe briefly the information that is used. All measures that are discussed (heart rate, positions, speeds, elevation and power) are sampled at the same time and a timestamp is provided. CHAPTER 2. DATA DESCRIPTION AND PRE-PROCESSING 28

2.3.1 Heart Rate

Heart rates, when recorded, are provided at regular intervals of one to sev- eral seconds depending on the device settings. They provide an estimate of the intensity of exertion relative to the user maximum. As the heart rate recordings are given with a resolution of 1 bpm (see Figure 2.3), they need to be smoothed to approximate the true value, mostly because the deriva- tive is used in Chapter7. Smoothing is performed with a moving average filter of 10 seconds.

Figure 2.3: 40-second zoom on heart rate recording and smoothing.

2.3.2 Geo-Localized Positions

Most devices record geographic coordinates obtained from the GPS system (most recent devices are more accurate by also using GLONASS and Galileo systems at the same time). A satellite-based positioning systems provide positions with an accuracy of a few meters. In some cases, when running in cities, some erratic points can be recorded, especially for recordings made with low-end or out-dated smartphones. In these cases, Kalman filtering is applied; this is a very standard approach to filter positions. Without going into much detail, for each time step, it estimates the current position by weighting the measured position and the expected one based on their respective uncertainties. 29 2.3. RUNNER RECORDINGS

Figure 2.4: Example of location points recorded during an activity.

2.3.3 Speeds

Most satellite-based positioning systems also provide an estimation of the speed. Recent devices also take into account recorded accelerations to im- prove the speed estimate. If the speed is not provided, it can be computed from the geographic positions and timestamps but, in general, these are much more noisy and need to be filtered. A moving average filter is used with a smoothing time window for which the width needs to be selected, as discussed in Section 2.3.7.

2.3.4 Elevation Data

Some activity recordings contain altitude data. Consumer grade satellite- based elevations (obtained from systems such as the GPS) have poor ac- curacy [5] (See Figure 2.5). Route elevations can be gathered by querying publicly available topography data such as SRTM data (Shuttle Radar To- pography Mission) or Google Maps APIs. They are both based on radar topography surveys conducted from space. It is observed that, in such databases, the altitudes of the treetops is assigned to route parts that are covered by trees: this causes artificially high elevation gradients on routes that pass under trees. CHAPTER 2. DATA DESCRIPTION AND PRE-PROCESSING 30

Fortunately, some high-end sports watches include a barometric altimeter which has a good relative accuracy: the altitude is known with an additive bias that would need to be calibrated. The relative accuracy is what is of primary interest, the absolute altitude accuracy being irrelevant to our purpose as our analysis is based on elevation gradient only. Routes recorded with such devices could be found only for 74 of the 334 races for which we have the race results.

Figure 2.5: Altitudes obtained from different means: barometric measure- ment on a sports watch (Garmin Fenix 3), Database queries (Google and SRTM), GPS measurement.

2.3.5 Gradient of Ascent

Elevation profiles as they are recorded, even by high-end devices, are still noisy signals that need to be filtered; especially because our application requires that the gradient of ascent be taken: the derivative of a noisy signal can take artificially high amplitudes. A simple way to compute the gradient and smoothing at the same time is to take the average altitude on a n-meters distance ahead minus the average altitude on the same distance behind. The chosen distance acts then as a smoothing factor. More formally, if the elevation profile e(x) is re-sampled every meter, its gradient g(x) at 31 2.3. RUNNER RECORDINGS distance x is given by

Pi=x+n e(x) − Pi=x−1 e(x) g(x) = i=x+1 i=x−n , (2.1) n2 with n being the smoothing distance in meters. Smoothing reduces the measurement noise but it also reduces fast changing details of the real gra- dient. Therefore, as for the speed filtering, choosing the smoothing distance n results in a trade-off that is illustrated in Figure 2.6 where the elevations on the same routes were recorded by different devices. The selection of the smoothing factor is discussed in Section 2.3.7.

Figure 2.6: Elevation gradient with different smoothing distances recorded by six different sports watches featuring a barometric altimeter on the same route portion of one kilometer. On the first plot, the elevation gradient is under-smoothed: it shows some details that do not correlate among multiple measurements. At the opposite end, on the last plot, the elevation gradient is over-smoothed: some details present in the 6 measurements vanish.

2.3.6 Instant Power During Exercise

The estimation of the instant power together with heart rate measurements are actually the only data required by our analysis. In the following of the thesis, speed and slope recordings are not used. They are only useful in estimating the instant power. CHAPTER 2. DATA DESCRIPTION AND PRE-PROCESSING 32

Minetti et al [42] study the oxygen consumption that is required to run a given distance as a function of the slope g. Oxygen consumption represents the metabolic energy expenditure that is generally expressed relatively to the runner’s weight in [w/kg]. We use two interesting results. First, they show that the energy cost of running EC(g) does not depend on the speed but only on the distance covered: the energy required to run a given distance does not depend on how fast it was run, as long as the speed is in a reasonable range of endurance runs. At the opposite end, the rate at which the energy is used depends on the speed: the power p(t) = EC(g)/T [w/m/kg] is obtained by multiplying the energy cost by the speed s(t):

p(t) = EC(g).s(t).

Second, they established an expression of EC(g) as 5th order polynomial that can be expressed relative to the energy cost required to run on flat ground:

EC(g) = 1 + 5.42g + 12.86g2 − 12.03g3 − 8.44g4 + 43.18g5. (2.2) EC(0)

This relation is illustrated in Figure 2.7.

Figure 2.7: Energy cost of running as a function of the slope; given relative to the energy cost on flat ground. We can see that the energy cost increases with uphill slopes. When running downhill, the energy cost decreases as long as the slope is not too steep: beyond 20% some energy is spent on braking. 33 2.3. RUNNER RECORDINGS

For cycling activities that are used in Chapter7, we have a direct measure of the mechanical power exerted on the pedals. In this case, we use the power output measured by a power meter.

2.3.7 Smoothing Speeds and Slopes

We expressed the need for filtering the speeds and the slopes but we did not select which smoothing factor to use. As we know that the heart rate should be proportional to the power and we estimate the power from speed and slope, we can select the two smoothing factors such that the correlation be- tween the heart rate measurements and the computed power is maximized. We selected 30 running activities that were provided with barometric ele- vation, heart rate and present varying intensities. We found that the corre- lation is maximized when the slope is smoothed with a 120 meter window and the speed with a 2 minutes window. The idea is illustrated in Figures 2.8 to 2.10. The elevation smoothing window is rather wide. Indeed, in 120 meters the slope could have changed several times. This means that we lose rapidly changing details of the route, but that is the price we pay to get rid of the measurement noise. The same reasoning applies when smoothing the speeds. A more elaborated model of the heart rate response to exerted power is developed in Chapter7. At this stage, the correlation is computed after delaying heart rates of 30 seconds so that it has time to adapt to the intensity of the exercise.

2.3.8 Merging the Two Data of Sources

In Chapters5 and7, we will need to establish a relationship between race measurements and race results. It is therefore desired to connect races and users found in the recorded activities and the race results database. It was possible to connect races and runners in the two databases by checking race times correspondence on races that took place on the same day. 142 runners were identified to be in the two databases as well as 183 races.

2.3.9 Race Tracks Cropping

It happens regularly that users record an activity during a race and do not stop recording the activity immediately on the finish line. In most cases, the activities of the same race stop few meters apart but sometimes runners continue to record on their way back home. When this happens, we cannot CHAPTER 2. DATA DESCRIPTION AND PRE-PROCESSING 34

Figure 2.8: Appropriate smoothing of the speeds and the slopes: power and heart rates look very similar (after warming-up). The correlation is 0.89.

Figure 2.9: Slope and speeds not smoothed enough: the power signal presents more details than the heart rates. The correlation is 0.63. 35 2.3. RUNNER RECORDINGS

Figure 2.10: Slope and speeds are smoothed too much: the power signal presents less details than the heart rates. The correlation is 0.71.

Figure 2.11: The dots represent locations where a recording was stopped. The estimation of the end point density is represented by the luminance of the blue dots and the estimated mode is in white. CHAPTER 2. DATA DESCRIPTION AND PRE-PROCESSING 36 compute race features nor race times. In order to cut activities at the finish line, we need to estimate where it was. The mean and the median location are obviously after the finish line because few runners stop the activity before the end of the race. In order to find the region where most runners stop their activity, we must estimate the end points density to locate its mode. This is done with a method known as kernel density estimation (KDE): we sum 2D Gaussian kernels with a 15-meters standard deviation at the location of each end of activity. Then the sum of all kernels represents a topology from which the maximum can be selected as the mode of the distribution and we assume that it is close to the true finish line. Figure 2.11 shows an example for one race.

2.4 Conclusion

In this chapter, we described the main techniques that were used to gather 124418 race results obtained from 334 races by 164451 runners. 183 of the races are provided with barometric elevation profiles and 142 of the runners are provided with activity recordings containing measurements of their heart rates associated with an estimation of the continuous power that they exert. The race results are used in Chapters3 to6 to build race time predictive models that are based on previous race times. The activity recordings are used in Chapter7 to show how they can serve in fitness assessment and in Chapter5 to establish a relationship between equivalent distance computed from race results and route elevation profile. Chapter 3

Runner Performances Modeling

The most trivial piece of information that can be used to predict the fu- ture race performances of a runner is how they performed in the past. A first attempt at predicting the performances of a runner can be achieved by considering only their previous race times: from a set of known results on given distances, one can build a model that interpolates or extrapolates per- formances to any other distances. More specifically, the literature provides some models that express the race time as a function of the distance. These models have parameters that are adjusted to the runner being considered based on his race records. Part of the research effort has been focused on modeling world athletic records [35, 46, 49, 59], describing how the world best times evolve with competition lengths. The same kind of work has been performed for in- dividual performances [62, 63]. The models that are used in both cases have the same mathematical expression, only the parameters differ. The emphasis of this chapter is on models that are suited to predicting the per- formances of recreational runners in endurance races. Although endurance races range from eight to hundreds of kilometers, most race results that are at disposal for the analysis presented in this chapter are between 8 km and the marathon distance (42.2 km). This range represents the distances that are run by most recreational runners. In this chapter, models found in the sports sciences literature are exposed and two others that could accommodate for longer distances are proposed. The fitting of the models (tuning of the parameters to data of a specific

37 CHAPTER 3. RUNNER PERFORMANCES MODELING 38 runner) is discussed before models are compared according to several crite- ria.

3.1 Existing Models

For more than a century, several laws have been proposed to model the relationship between race times and race length or duration. Most of them are actually expressed in terms of an average speed that can be maintained for a given distance or duration. Figure 3.1 illustrates that some models can be adjusted to race performances that correspond to either world best performances or individual records. From models that are inferred from individual race records, prediction can be made by interpolating or, more challengingly, extrapolating to new race distances.

Figure 3.1: Race performances modeling examples. The equations that can take the blue curve to fit observations is the subject of the chapter.

3.1.1 Power Law

The first law that was proposed dates back to a paper published in 1906 [35]. The general expression is claimed to describe the fastest speeds that can be achieved in different disciplines (running, , , skating and 39 3.1. EXISTING MODELS cycling). The same study even generalizes the expression to animals such as horses. Race speeds are modeled using a linear relationship of the race length in the log domain: an athlete is expected to run a race with an average speed s that depends on the total distance of the race d and two runner-specific parameters, α and β:

ln(s) = α + (1 − β). ln(d) (3.1) that can be, equivalently, expressed as

s = k.d(1−β). (3.2)

α with k = e . Taking dref and sref as reference distance, we can write

1−β sref = k.dref (3.3) and thus (by substituting k),

d 1−β s = sref .( ) , (3.4) dref and, taking a 10km distance as reference,

d s = s .( )1−β. (3.5) 10K 104

The last equation shows two runner parameters (s10K and β), which are more interpretable. Indeed, s10K corresponds to the expected speed on a 10-kilometer race and β can be interpreted as an endurance parameter. β is strictly greater than one; otherwise, average speed would increase with the race distance. A value of β close to one corresponds to a good level of endurance: the speed decreases only slightly as the distance increases. Substituing s to t/d, the same relation can be expressed in terms of race times, with t10K being the expected time on a 10-kilometer race, as

d t = t .( )β. (3.6) 10K 104 where it can be seen that β equal to one makes the time directly propor- tional to the distance (without considering the fatigue, meaning a perfect CHAPTER 3. RUNNER PERFORMANCES MODELING 40

Figure 3.2: World record fitting with two different power law (log-log axes). endurance). It is worth noting that most endurance runners tend to ex- press their performances in terms of pace, usually expressed in [min/km] or in [min/mile]. For future reference, the power law equation can easily be turned into an expression of the pace:

1 d p = = p .( )(β−1). (3.7) s 10K 104

Most studies agree on the fact that the power law has a cut-off point sep- arating short and long races [44, 49, 51]. The power law has thus different coefficients below and above the breakpoint, as it is illustrated, for world records, in Figure 3.2. This breakpoint is attributed to the transition be- tween anaerobic effort (high intensity exercise that can be maintained for a short period of time during which the main energy source does not directly depend on oxygen consumption) and aerobic effort (light or moderate exer- cise that can be maintained for a long time in which the energy expenditure is almost proportional to the rate of oxygen intake) [41]. Strictly speaking, the two energy sources co-exist at all intensities, but one of them predomi- nates when one moves away from the transition point. Although the exact position of the cut-off point depends on the study, it is generally found in the range of a few minutes (typically 2 to 6) or, equivalently in the range of a few kilometers (typically 1 to 3). In the present thesis, the distances of interest (> 8 kilometers), are sufficiently beyond the breakpoint to consider 41 3.1. EXISTING MODELS the endurance coefficient as constant for each athlete. For the considered distance range, the literature reports endurance coefficients from 1.051 to 1.23 [35, 49, 63]. Variants of this model are either simpler or more complex. A more complex model is presented in [44]. It requires more parameters being the slope in the aerobic zone, the slope in the anaerobic zone and the cut-off point between them (βaerobic, βanaerobic, and (dth, sth)). At the opposite end, the model can be made simpler by assuming the en- durance parameter independent from the runner. A popular endurance coefficient that is used in coaching is β = 1.08. In this case, race time pre- diction can be computed with only one reference race using Equation 3.6. In the following, the power law with a fixed β coefficient is referred to as Rigel’s formula. The limitations of the power-law model have been pointed out in [18]. Au- thors admit that the accuracy is good up to the half-marathon distance (21.1 kilometers) but deteriorates for longer distances.

3.1.2 Hyperbolic Two-Parameter Model

The second performance model was proposed in [23]. It comes from the assumption that a given muscular power p can be maintained during a time period t as

p = awc/t + r (3.8) or, awc t = (3.9) p − r in which awc refers to aerobic work capacity, expressed in Joules, and r to the rate of energy released from the aerobic metabolism. As the energy expenditure p in running exercises is nearly proportional to the speed [48], the Equation 3.9 can be re-written as

adc t = , (3.10) s − s∞ or,

adc s = + s , (3.11) t ∞ CHAPTER 3. RUNNER PERFORMANCES MODELING 42 where adc stands for aerobic distance capacity, expressed in meters, and s∞ refers to a speed that could be, according to the model, maintained indefinitely as it can be seen in Figure 3.3. Substituting t = d/s in Equation 3.11, the same relationship can be expressed for speeds and distances:

adc.s s = + s d ∞ adc s(1 − ) = s d ∞ 1 s = s∞. adc (3.12) (1 − d )

Figure 3.3: Hyperbolic 2-Parameter model for which a speed of 10 km/h can be maintained indefinitely.

3.1.3 Hyperbolic Three-parameter model

The hyperbolic two-parameter model was criticized in [26]. The author objected that, when fitting the model to a runner, the speed s∞ is always overestimated. Furthermore, the model has a vertical asymptote in d = adc which makes the model impossible to use for distances close to adc. To overcome these limitations, he proposed a new model that includes the notion of maximal power pmax and p∞, the maximal power that could be maintained indefinitely:

adc adc t = + , (3.13) p − p∞ p∞ − pmax 43 3.1. EXISTING MODELS which can also be expressed in terms of speeds as

adc adc t = + . (3.14) s − s∞ s∞ − smax

To obtain the speed as a function of the race distance, t can be replaced with d/s which gives

adc s = + s , d/s − k ∞ where k = adc . The expression can be re-arranged as s∞−smax

2 −k.s + (d − adc + s∞.k).s − d.s∞ = 0 which is a second order polynomial that accepts only one positive solution for the speed for a given distance :

d − adc + s .k + p(d − adc + s .k)2 − 4.k.s .d s = ∞ ∞ ∞ (3.15) 2.k

This formulation describes the speed vs distance relationship but is far less intuitive than 3.14. Figure 3.4 shows that the model still presents a horizontal asymptote in s = s∞ but presents a maximal speed for d = 0.

Figure 3.4: Hyperbolic 3-Parameter model presenting a speed of 10 km/h can be maintained indefinitely and a maximal speed of 18 km/h CHAPTER 3. RUNNER PERFORMANCES MODELING 44

3.1.4 Exponential Model

The exponential model presented in [25] describes the speed as an exponen- tial decay of the exercise duration that goes from a maximal speed smax down to a base speed s∞ that could be maintained indefinitely. The expo- nential model can be expressed as

− t s = s∞ + (smax − s∞)e τ , (3.16) where τ being the decay rate at which the base speed is reached.

Equation 3.16 cannot be easily re-formulated to express the speed as a func- tion of the distance. Instead, knowing the three model parameters (s∞,smax and τ), the speed can be computed by successive iterations: starting with an initial guess for the speed, the total duration time can be computed as t = d/s; then Equation 3.16 gives a better approximation of the speed which, in turn, can be used to better approximate the duration t. Itera- tions are repeated until the speed no longer changes (see Algorithm1). In practice, convergence is observed after 3 to 5 iterations.

Figure 3.5: The exponential model for given s∞, smax, and τ

The speed-distance obtained from this procedure is shown in Figure 3.5. The model is very intuitive (in the sense that the three parameters are easy to interpret), but, as it will be seen later, such a base speed doesn’t seem to exist. Methods that use s∞ overestimate the power that can be maintained indefinitely: in practice, it cannot be maintained for more than 60 minutes [8, 47]. 45 3.1. EXISTING MODELS

Algorithm 1 Compute speed from distance fd(d) knowing speed from dis- tance function ft(t)

1: function fd(d, ft) 2: s ← 3 . Guess an initial speed: 3 m/s (= 10.8 km/h). 3: t ← d/s . Compute duration based on approximated speed. 4: repeat: . Typ. 3-5 times. 5: s ← ft(t) . Re-compute speed based on duration t. 6: t ← d/s . Re-compute duration. −6 7: until (s − ft(t) < 10 ) . Convergence criterion. 8: return s 9: end function

3.1.5 Logarithmic Endurance Model (P´eronnet)

A model proposed in [46] expresses that the maximal speed s one can main- tain decreases with the logarithm of the total time of exertion t :

t s = mas − x. ln( ). (3.17) 420 where x is an endurance index and mas is the maximal anaerobic speed which corresponds, in the paper, to the maximal speed that can be main- tained during an all-out effort of 7 minutes. Again, the model provides a speed vs time relationship that cannot be expressed as a function of the distance. The desired relation, illustrated in Figure 3.6, is obtained through Algorithm1.

Figure 3.6: Peronnet model for given mas and x CHAPTER 3. RUNNER PERFORMANCES MODELING 46

3.1.6 VDOT Model

The last model, exposed in [13], is not discussed much in the literature but is spread over commercial coaching services as [1]. VDOT stands for V˙ (“V ˙ dot”) which is short for VO2 that refers to the volumetric rate of oxygen consumption that is at the center of the model. Indeed, running at a given speed requires a volume of oxygen intake per unit of time and per unit of ˙ body mass: VO2 , expressed in [ml/min/kg]. According to the author, it is related to the runner speed as

˙ 2 VO2 (s) = fvo2 (s) = 10.93548 ∗ s + 0.3744 ∗ s − 4.60. (3.18)

˙ The maximum oxygen uptake VO2 max is an important physiological char- acteristic of a runner. It expresses the maximal volume of oxygen they are able to consume per unit of time. It is known to be a useful fitness metric because it reflects respiratory, circulatory, and muscle capacities[24].

The second formula in the model reflects the intensity that can be main- tained over a given time t. The intensity is expressed in terms of oxygen ˙ ˙ intake VO2 relative to the runner maximum VO2 max as

V˙ O2 = f (t) = 0.2990e−0.003221t + 0.1894e−0.0002130t + 0.8 (3.19) ˙ s.i. VO2 max

Figure 3.7 illustrates the two formulas; on the right, the part of the curve ˙ VO2 that is located above ˙ = 1 is not displayed because it is not used (we VO2 max deal with races that last more than a few a minutes) and because it makes ˙ no sense (runners cannot use more oxygen than their VO2 max). The only unknown athlete parameter that rules the VDOT model is the ˙ VO2 max. Knowing the latter, the speed can be expressed as

s = f (t) = f −1 (f (t).V˙ ) (3.20) vdot t vo2 s.i. O2max where f −1 can be approximated by a polynomial that fits the values that vo2 are illustrated in the left part of the Figure 3.7. Again, speed as a function of the race length can be obtained by iterating with the time-expression, using Algorithm1. The shape of the obtained speed is illustrated in Figure ˙ 3.8 for a runner having a VO2 max of 50. 47 3.2. PROPOSED MODELS

Figure 3.7: Illustration of the two expressions that are used in the VDOT model

˙ Figure 3.8: VDOT model for a VO2 max of 50.

3.2 Proposed Models

According to [62], all the models presented above overestimate performances for race distances longer than the half-marathon (21.1 kilometers). The power law overestimates marathon performances but to a lesser extent than the other models, still according to the same source. The increasing popularity of and longer distances (up to hun- dreds of kilometers) [16] calls for models that can deal with distances beyond the marathon. Two models are proposed here, taking the rationale of the world records that are ratified by the IAAF (International Association of Athletics Federations) [27] and the IAU (International Association of Ultra- runners) [28]. CHAPTER 3. RUNNER PERFORMANCES MODELING 48

3.2.1 Two-Threshold Power Law

The breakpoint between the anaerobic and aerobic performance curves (see Section 3.1.1) corresponds to a change in the primary source of energy. In the aerobic part, muscles source mostly their energy via two mechanisms, both requiring oxygen: the fat metabolism, and the breakdown of glycogen into glucose. Glycogen burns rapidly to provide quick energy. Runners can store, in the liver and muscles, enough glycogen for about 30 km of running. Beyond this limit, they can only rely on fat metabolism (which is slower) or on a re-fill of glycogen through eating food during the race. The negative impact of glycogen depletion, documented in [21, 22], is associated with a significant reduction of power output for the same level of oxygen intake. The sudden and drastic energy loss reported by many runners after about 30 km is, in popular culture, referred to as ”hitting the wall”. Figure 3.9 shows that this second change in energy source coincides with a second breakpoint in the performance alignment. A more complete model of endurance performance could add two parameters to the power law (the “wall” distance dwall and the slope “beyond the wall” (β2). Assuming that the race is longer than a few minutes, thus, beyond the first breakpoint and taking the second breakpoint (dwall, swall) as reference, the speed model can be expressed as

( d β1 swall.( ) , if d ≤ dwall s(d) ˆ= dwall (3.21) d β2 swall.( ) , otherwise. dwall

The runner performance are then fully characterized by four parameters (dwall, swall, β1 and β2).

3.2.2 Polynomial-Logarithmic Model

Another way of adding flexibility to the model is to take a polynomial of order P instead of a linear relationship:

P X p s(d) ˆ= ωp(log(d)) (3.22) p=0 where ωp is the P runner parameters. The speed-distance relationship fitted on world records is illustrated in Figure 3.10. 49 3.2. PROPOSED MODELS

Figure 3.9: A 2-threshold model is proposed to account for the second slope beyond 30 to 40 km

Figure 3.10: Polynomial-logarithmic model adjusted to world records CHAPTER 3. RUNNER PERFORMANCES MODELING 50

3.3 Fitting of the Models

In order to use the models that are presented above to predict race times, they need to be adjusted to runners based on their race results. The way to achieve this operation depends on the model that is being considered. The procedure is described in the next sections for each model.

3.3.1 Power-Law

In the case of the power law, the relationship between the logarithm of the time t and the logarithm of the race distance d is linear: ln(t) = α + β. ln(d). Runner coefficients α and β are obtained with a simple ordinary least squares regression in the log domain. This means that the coefficients are set such that the sum of squared differences between observed logarithms of the race speeds and those predicted by the linear function is minimized. More formally, if the error term for each race i is denoted i, the observed race speed ti can be expressed as

ln(ti) = α + β. ln(di) + i, and thus, i = ln(ti) − α − β. ln(di).

The runner parameters that minimize the quadratic error for N races

N X 2 arg min i α,β i=0 are given from the sample variance var(ln(d)), the sample covariance cov(ln(d), ln(t)) and the means ln(s) and ln(d)[31]:

αˆ = ln(t) − β.ˆ ln(d) (3.23) and

cov(ln(d), ln(t)) βˆ = . (3.24) var(ln(d))

Note that minimizing the quadratic error in the log domain is not strictly equivalent to minimizing the quadratic error of the speed prediction. The interpretation of this specific optimization is presented in Chapter6. 51 3.3. FITTING OF THE MODELS

In the particular case of the model with a fixed endurance parameter (Riegel’s fomula), race time prediction can be computed with a single reference race using Equation 3.4:

d β t = tref .( ) . (3.25) dref with dref and tref the distance and time of the reference race. If several race records are available, either the best performance or the average performance can be chosen.

3.3.2 Hyperbolic Two-Parameter Model

The model has a known expression of fhyp2 d(d, scrit, adc) that describes the relationship between speed and effort distance with two parameters that are specific to the runner (scrit and adc). The problem of curve fitting is to find the parameters that minimize the quadratic error of prediction for observed race performances (si) of length di:

N X 2 arg min (si − fhyp2 d(di, scrit, adc)) (3.26) scrit,adc i

A solution to this problem can be computed efficiently with a trust-region method which is typical for constrained least square curve fitting. The method description can be found in [11]. Once the two parameters are ob- tained, performance prediction for any distance d is computed with Equa- tion 3.12. The same method is applied for fitting the models every time no closed-form solution can be derived.

3.3.3 Hyperbolic Three-Parameter Model

The same procedure can be applied to the hyperbolic 3-parameter model. In this case, a trust region optimization that solves

N X 2 arg min (si − fhyp3 d(di, scrit, smax, adc)) , (3.27) scrit,smax,adc i

ˆ outputs three parametersscrit ˆ ,smax ˆ and adc that can be used to make prediction with Equation 3.15. CHAPTER 3. RUNNER PERFORMANCES MODELING 52

3.3.4 Exponential Model

The exponential model can be adjusted to runners using exactly the same procedure. But, as the speed-duration relationship fexp d(d) can not be directly expressed, each known race performance of index i is described as an average speed si maintained over a duration ti. The same optimization method is used to obtain the runner parameters (scrit, smax and τ) with the speed-duration relation fexp t. Once runner parameter are obtained prediction can be computed for any distance through Equation 3.16 and Algorithm1.

3.3.5 Logarithmic Endurance Model

Similarly to what is done for the power law fitting, a linear relationship is to be found between the speed s and the exercise duration t/420: s = mas − x. ln(t/420). The ordinary least squares regression gives the values of the two runner parameter mas and x.

3.3.6 VDOT Model

˙ The VDOT model fitting requires only one runner parameter VO2max. Each known race performance can be expressed as a race speed si maintained over a duration ti. From Equation 3.18, the speeds give the rate of oxygen con- ˙ sumption VO2,i during the whole race i. Knowing the race duration and the rate of oxygen consumption, Equation 3.19 can be used to compute the ˙ runner’s VO2 maxi. As for the simplified power-law model, the runner pa- rameter can be selected with either his best race performance or his average race performance. ˙ Once, VO2max of the runner is estimated, predictions for any distance can be computed using Algorithm1 and Equation 3.20.

3.3.7 Two-Threshold Power Law

The fitting of this model depends on the assumptions that are made. For instance, the breakpoint distance dwall can be assumed to be 30 km for all runners (this is an approximation). In this case, all race results for races shorter than 30 km can be used to compute swall and β1 with the same linear regression as for the 1-threshold power law. Then, another linear regression can be performed with races above 30 km. In the latter case, 53 3.4. MODEL COMPARISON

only the slope β2 is computed with Equation 3.24 because swall is already computed using the first linear regression.

If dwall is assumed to be runner-dependent, model parameters can be esti- mated using a segmented linear regression, also known as piecewise linear (or segmented linear) regression. Efficient and simple method to solve this problem is proposed in [43] where the breakpoints are estimated iteratively.

3.3.8 Polynomial-Logarithmic Model

A polynomial model of the variable x is actually a linear model, except that its entries are x elevated to powers 0 to P ([1, x, x2, ..., xP ]). With N race results {si, di} and i the error term, we can write the system of linear equations:

   2 P      s1 1 (ln(d1)) (ln(d1)) ... (ln(d1)) ω1 ε1 2 P  s2  1 (ln(d2)) (ln(d2)) ... (ln(d2))  ω2   ε2    =     +   ,  .  ......   .   .   .  . . . . .   .   .  2 P sN 1 (ln(dN )) (ln(dN )) ... (ln(dN )) ωP εN f or, using the matrix notation with X being the matrix of the variables j (xi,j) = (ln(di)) ,

s = Xω + ε. (3.28) The vector of coefficients, computed with the ordinary least squares estima- tion, is T −1 T ωb = (X X) X s. (3.29)

3.4 Model Comparison

The models that are presented above are compared using different crite- ria: their ability to fit world records to date, to predict the performance of a runner in any race given his other race records and, lastly, to pre- dict the marathon time of a runner given his performance on shorter races (half-marathon and shorter). The latter is of particular interest because, in practice, many runners want to prepare for a marathon without testing themselves on the full distance, as this would increase their fatigue level and have a negative impact on their marathon time. The latter regression CHAPTER 3. RUNNER PERFORMANCES MODELING 54 is also a more difficult task because it requires that runner performances be extrapolated beyond the range of observed values.

3.4.1 World Records Fitting

The ability to fit world records is tested using the procedures that were de- scribed in the Section “Fitting of the Models” (3.3) on two distance ranges: the most popular distances (5 to 42.2km) and the full range of endurance races (5 to ∼ 1000km). Data is provided by the IAAF (International Associ- ation of Athletics Federations) [27] and the IAU (International Association of Ultrarunners) [28].

3.4.2 Race Performances Prediction

The ability to predict race times was tested on real individual data that was collected as discussed in Section2. The database contains 335 races from the years 2014 and 2015 totaling 164456 race results from 124418 runners, most of them having only one race time). For the purpose of this chapter, this initial database was reduced to keep only runners with at least one marathon time and at least three short races (half-marathon and below), reducing the number of runners to 255. For estimating the accuracy of the race time prediction, the following valida- tion scheme is used: each race result is predicted with each model adjusted with the runner’s other results; the prediction error is then computed by difference with the actual race result. For each model, the overall prediction error is given as the median absolute error on all race results. The median is preferred to the mean because the different regressions occasionally give extremely high errors (errors of thousands of km/h, as Runner 2 in Figure 3.11). These extremely high errors impact the mean a lot more than the median.

3.4.3 Results

World Records Fitting

As can be seen in Figure 3.12, all models can be adjusted to fit the most common race distances (5 to 42.2km) reasonably well but, except for the hyperbolic three-parameter law, it is not the case for longer endurance races. The quality of the fit is expressed in Table 3.1. 55 3.4. MODEL COMPARISON

Figure 3.11: Two examples of marathon prediction with the different mod- els.

Race Performances Predictions

Tested with raw race results, race performances cannot be accurately pre- dicted with the two-parameter hyperbolic model: the bias (systematic un- derestimation) and variance are both very high (see Figure 3.13). Models with only one parameter (Riegel and vdot models) present a larger variance than other models. When it comes to predicting marathon performances, the power law and the one-parameter models have a lower variance but they still present a bias. For the Riegel model (a power law with a fixed β), the bias can be compensated if β is well selected (see “Riegel -0.15” in the Figure 3.13). CHAPTER 3. RUNNER PERFORMANCES MODELING 56 ) errors [%] | . | Vdot 7.184 14.783 max 2 O ˙ Vdot 3.20 V 1.376 83.3 37.677 62.7 Exp. 7.384 21.613 τ , , crit max Exp. 3.16 s s 0.761 5.70, 7.07, 1950 7.263 6.22, 2.16, 49707 P´eronnet 6.530 18.620 e , 411 758 . . 79, 0 48, 0 . . Hyp. 3 p. 7.368 25.380 P´eronnet 3.17 mas 1.049 5 4.199 7 − − , crit s Hyp. 2 p. 32.505 39.351 , ) errors in speed prediction [%] | max . Hyp. 3 p. 3.14 adc s 1.044 5982, 5.05, 7.011 4.166 154252, 1.81, 6.44 | 20 . = 1 crit s Riegel β 7.892 12.094 , Hyp. 2 p. 3.12 adc 1.899 496, 5.79 92.447 4816, 3.85 15 . = 1 08 . Riegel β 7.190 11.981 K = 1 10 Riegel β 3.5 s 1.195 6.30 26.349 5.00 08 . β 246 = 1 . , Riegel β 7.506 13.710 K 10 0702 s . Power Law 3.5 1.061 6.27, 1 8.725 7.12, 1 Table 3.2: Median relative aboslute ( Power Law 7.030 19.824 Model Equation Parameter Names 5-42.2km error [%] Parameters 8-1000km error [%] Parameters Model Race Pred. Error [%] Marathon Pred. Error [%] Table 3.1: World Records Fitting Test synthesis. Abreviations model names. Median relative absolute ( 57 3.4. MODEL COMPARISON

Figure 3.12: Fitting of the models on world records.

Figure 3.13: Race performance prediction errors for the different models. The Riegel formula is the power law with β value fixed to −0.06, −0.15 and −0.20. CHAPTER 3. RUNNER PERFORMANCES MODELING 58

3.5 Discussion

Based on the comparison of the models, the power law seems to be the most appropriate model for race performance prediction. It can reasonably fit the world records in both ranges. More importantly, it gives errors that present a low variance. Moreover, the bias on those errors can be compensated if parameter β is properly selected. Models that use the notion of a critical speed that could be maintained forever (exponential, hyperbolic and vdot) can be discarded because the existence of such a speed is supported neither by world records nor by the physiology of exercise. The Riegel and VDOT models have the advantage that they are not affected in the case of bad data configuration as is illustrated by runner 2 in Figure 3.11. This suggests that the ideal model would stand between the power law (for its ability to fit any runner) and the Riegel formula (which avoids aberrant coefficients). Such a model is described in Chapter6. Models with more parameters can better describe runner performances but they require more data to be adjusted: the number of model parameters should be smaller than the number of known race results, except if some constraint or dependencies between parameters are being considered. Fur- thermore, the data should be well spread over the distance ranges affected by each of the parameters. This is the reason why, at this stage, the models that were proposed in this chapter could not be tested. Indeed, the two- threshold power law model contains four parameters and the polynomial model contains at least three; in the database that is used, very few runners present the required data. Theoretically, the perfect model should include many parameters that account for the diversity of existing runners; but, in practice, only simple models can be adjusted to data that is generally available. This is true unless constraints or decencies between parameters can be expressed and taken into account for the fitting of the model. The prediction error might seem very high in the context of race pacing. This is due to the high variability observed in recreational runner perfor- mances: for example, they can race in a group, unprepared or even get hurt and finish the race walking. First, this variability makes performance inherently difficult to predict. Second, the high variability leads to aberrant modeling such as the example of runner 2 in Figure 3.11. To conclude this chapter, we can say that simple least square fitting of the model parameters on raw race result is not sufficient within the scope of runner pacing. Therefore, accuracy should be improved by taking account of other pieces of information readily available on: 59 3.5. DISCUSSION

1. races: part of the performance variability can be attributed to varying race features (others than the distance only). 2. human populations: the ranges of the model parameters can be con- strained based on assumptions that can be expressed about human populations. For instance, the β parameter of the power law should be higher than one for everyone or the t10K parameter in a given population (age, sex) cannot be smaller than the 10-km world record. 3. individuals: additional runner features could be computed based on what a runner records with their smartphone or sports watch during their workout sessions. The use of these other sources of information is the focus of the following chapters. CHAPTER 3. RUNNER PERFORMANCES MODELING 60 Chapter 4

Races and Athletes Characterization from Race Results

In the previous chapter, we reviewed models that can be used to characterize athlete performances based on their race times on given distances. The idea of this chapter is to show how we can discover some athlete and race underlying variables that can explain race results. Starting from a large set of competition times, the problem is to find latent variables that characterize runners and races in such a way that we can predict performances without considering knowledge on races or on runners. The content of this chapter largely overlaps with the ideas already published in [56].

4.1 Introduction

Performance models that are solely based on competition distances can- not be optimally accurate because races cannot be strictly summarized by their length. Other parameters could be used: gradient of ascent, weather conditions, altitude, vegetation, uneven ground, and ground firmness affect athlete speeds. Race times reflect all these aspects, and therefore can be used to characterize races. Races times cannot be trivially used to compare competitions because they are not run by the same set of athletes. There may be attendee level differ- ences depending on the popularity of the race. For instance, some local races

61 CHAPTER 4. CHARACTERIZATION FROM RACE RESULTS 62

Figure 4.1: Collaborative Filtering: the similarity between recommender systems and race time prediction. may fail to attract world elite runners; some very popular races may attract world elite runners but also a crowd of recreational runners. This implies that races results (best time or any statistic computed on race times) do not necessarily reflect the objective runnability of the race. The idea of this chapter is to blindly characterize athletes and races at the same time so that the attendee level disparities are taken into account. Moreover, the process could potentially uncover more than one variable per race, each associated with an athlete variable. Indeed, some race characteristics exist that can be quantified (for instance, length, ascent, heat) and which affect the times of athletes depending on their ability to cope with these characteristics (for instance, endurance, ability, resistance to heat). The intuition is that the blind characterization that we propose could reveal race and athlete components that can be mapped to such characteristics or abilities.

The technique that is used to identify race and athlete variables is inspired by collaborative filtering which is discussed extensively in the literature. Collaborative filtering is more frequently used in recommender systems where the objective is to predict user ratings that have not yet been ex- pressed based on past preferences (and hence, recommend products that users may buy). Both problems can be viewed as a large matrix containing known entries (expressed ratings or race results) that are used to character- ize users (or athletes) and products (or races); then using this characteriza- tion, prediction can be made for unknown entries (see figure 4.1).

This chapter presents how race and athlete characterization can be com- puted from the race results matrix using a low rank matrix factorization. The method is validated by quantifying the accuracy of the prediction that can be made based on athlete and race components. 63 4.2. LOW-RANK APPROXIMATION OF THE RACE RESULTS MATRIX

4.2 Low-Rank Approximation of the Race Results Matrix

With Na athletes and Nr races, one can look at the collection of race times Na×Nr as a large matrix T = (ta,r) ∈ R with columns corresponding to races (indexed r ∈ [1, ..., Nr]) and rows corresponding to athletes (indexed a ∈ [1, ..., Na]). This matrix is almost empty because most athletes ran only a few of the many races. In the available race results (described in Section 2.2), more than 95 % of the matrix entries are generally missing.

If we assume that race times can be approximated by a linear dependency on both race vectors vr and athlete vectors ua, race times ta,r can be expressed as a sum of k product terms

k X ta,r = ua,i · vr,i + εa,r, (4.1) i=1

or using the vector notation,

T ta,r = ua · vr + a,r, (4.2)

where εa,r refers to additional noise that reflects the fact that our relation- ship is not deterministic and only holds true on average. If the rank k is set CHAPTER 4. CHARACTERIZATION FROM RACE RESULTS 64 to 3 and few matrix entries are missing, we can write the matrix equation

races z }| {  t . . t . . . t   1,1 1,4 1,Nr  . t . t . . . t  2,2 2,4 2,Nr     . . t3,3 . . . . t3,Nr     t4,1 . . t4,4 ....  athletes   ˆ=  t5,1 t5,2 t5,3 .....   . . t t . . . t   6,3 6,4 6,Nr   . . . . .   ......   . . . . .   tNa,1 . tNa,3 . . . . tNa,Nr (4.3)   u1,1 u1,2 u1,3  u2,1 u2,2 u2,3     u3,1 u3,2 u3,3    v v v v . . . v   u4,1 u4,2 u4,3  1,1 2,1 3,1 4,1 Nr ,1   . v v v v . . . v .  u5,1 u5,2 u5,3   1,2 2,2 3,2 4,2 Nr ,2   v v v v . . . v  u6,1 u6,2 u6,3  1,3 2,3 3,3 4,3 Nr ,3    . . .   . . . 

uNa,1 uNa,2 uNa,3

All missing entries (represented by a . in the equation) can be predicted by Na×k a matrix product. Matrix U = (ua,i) ∈ R being the concatenation of Nr ×k all athlete vectors and matrix V = rr,i ∈ R being the matrix of race vectors, the equation becomes

Tˆ = UV|. (4.4)

This factorization of matrix T of size Na × Nr into two smaller matrices of size Na × k and Nr × k is called a low-rank approximation; k being the rank. The idea is to discover the k latent variables for each athlete and k latent variables for each race that best explain the known entries. Each athlete a is thus characterized by a k-element vector ua and, in a similar way, each of the race r is characterized by a k-element vector vr. The intuition is this characterization of athletes and races would allow to describe them in such a way that race time prediction can be made for any athlete on any race. The same reasoning can be made if we assume the same kind of linear relationship with average race speeds (sa,r) or average paces (pa,r): sa,r = 65 4.3. COST FUNCTION

dr/ta,r, dr being the distance of race r and pa,r = 1/sa,r. Indeed, the linearity that is assumed in 4.1 could be better verified in these other scales. The matrix reconstructions becomes then,

ˆ | S = UsVs , (4.5) or,

ˆ | P = UpVp . (4.6)

The results obtained with these two other models are also presented at the end of this chapter. Traditionally, collaborative filtering techniques uses methods that impose positivity of all the elements in both matrices U and V. This is known as non-negative matrix factorization (NMF). We will not use NMF methods here because, as it will be seen in Chapter5, negative elements may have a physical meaning.

4.3 Cost Function

The problem to be solved is to select athletes and races vectors that best reproduce observed race times: (Na + Nr).k coefficients are to be found.

Let Ω be the set of observed entries (a, r) of the matrix (ta,r). Using the least squares error criterion, athlete and race latent vectors are selected by solving the optimization problem

X | 2 U, V = arg min (ta,r − ua · vr) . (4.7) U,V (a,r)∈Ω

4.4 Solution

The optimization expressed by Equation 4.7 is a non-convex problem that can be solved using heuristics that proved to converge well in practice. Successful experiments are conducted using stochastic gradient descent or alternating least squares algorithms. The Stochastic gradient descent loops over the observed entries Ω, slightly modifying ua and vr in the direction of the steepest gradient of the error; CHAPTER 4. CHARACTERIZATION FROM RACE RESULTS 66

reducing the error criterion at each step [20]. Noting εa,r the error on ta,r, the gradient can be computed with respect to the athlete vectors

∂εa,r = 2.ea,r.vr (4.8) ∂ua or with respect to the race vectors,

∂εa,r = 2.ea,r.ua. (4.9) ∂vr

The procedure corresponds to Algorithm2. The algorithm presents a pa- rameter h that is the update rate: athlete’s and race’s vectors are updated by a quantity of h times the gradient of error. The value of h is optimized with a validation scheme that is described in Section 4.6.

Algorithm 2 Stochastic Gradient Descent

1: function sgd(Uinit, Vinit, h) 2: U ← Uinit 3: V ← Vinit . Initialization 4: repeat: 5: for all [a, r] ∈ Ω do . Updates | 6: ea,r ← ta,r − ua · vr 7: ua ← ua + h.2.ea,r.vr 8: vr ← vr + h.2.ea,r.ua 9: end for 10: Permute Ω 11: until h < 10−5 12: return U,V 13: end function

The Alternating least squares algorithm alternates between two convex prob- lems: optimization of the vr vectors while holding ua vectors constant and optimization of the ua vectors while holding the vr vectors constant [30]. This corresponds to several iterations of Nr linear regressions (one per race) followed by Na linear regressions (one per athlete). The procedure corre- sponds to Algorithm3. The latter is preferred over the Stochastic gradient descent because the parameters that need to be initialized are reduced to race parameters only and no hyperparameter is required. If no additional constraint is added, the solution to Equation 4.7 is not unique. Indeed, a factor x in parameters ua,i∀a can be compensated by a factor 1/x in parameters vr,i (see Equation 4.1) without affecting the cost 67 4.5. DATA REQUIREMENTS

Algorithm 3 Alternating Least Squares

1: function als(Vinit) 2: V ← Vinit . Initialization (race only) 3: repeat: 4: for all a ∈ [1,Na] do . For each athlete 5: X = vr,i with i ∈ [1..k] and r s.t. (a, r) ∈ Ω . Their races 6: y = ta,r with (a, r) ∈ Ω . Their times T −1 T 7: ua ← (X X) X y . Updates athlete coef. 8: end for 9: for all r ∈ [1,Nr] do . For each race 10: X = ua,i with i ∈ [1..k] and a s.t. (a, r) ∈ Ω . Athlete coef. 11: y = ta,r with (a, r) ∈ Ω . Race times T −1 T 12: vr ← (X X) X y . Updates race ceof. 13: end for 14: permute Ω 15: until convergence 16: return U, V 17: end function

function. This is not a problem as long as the factor x is the same for all races and all athletes (see Section 4.5.1); otherwise, we cannot ensure that a prediction for any missing entry of the matrix T is valid. The scaling of the parameters is also important in case we would like to interpret the values (see Chapter5).

4.5 Data Requirements

Obviously, the matrix factorization cannot produce sound vectors if the number of elements to estimate is larger than the number of known entries. Actually, the requirement is stronger than that. Indeed, each athlete must have at least k race results and each race must have at least k athlete results to estimate their k vector elements; the latter being much easier to satisfy. Races and athletes that do not fulfill the conditions are removed from the database. In practice, as race times account for many variables (such as athlete’s fitness of the day or weather conditions), k vector elements require more than k race results so that the noise on the data can be averaged out. CHAPTER 4. CHARACTERIZATION FROM RACE RESULTS 68

4.5.1 Communities of Runners

As mentioned earlier, we must ensure that the way race and athlete pa- rameters are scaled is the same for all races and all athletes. This is not guaranteed unless races have athletes in common. Indeed, it is possible that there exist some isolated communities of runners that do not share races. This situation is illustrated in Figure 4.2: races are represented with dots and an edge connects two races if at least one athlete participated to both.

Figure 4.2: Races are represented as dots. Two races are connected if at least one athlete went to both. Two isolated subgraphs are present.

4.5.2 Choosing Rank k

Parameter k is related to the complexity of the model. A small N value corresponds to a simple model that is more likely to generalize well but that might not capture the entire information included in the race results matrix. A larger value of k requires more data. Choosing k = 1 would mean that the model assumes that a race time only depends on one parameter per race (that can be interpreted as its length reflecting a combination of its distance and difficulty) and one parameter per athlete (that would represent his fitness level). Choosing k > 1 allows a multivariate representation of what makes races faster or slower and a multivariate representation of athlete abilities. For instance, if vector elements can be mapped to route characteristics, the model could express the race difficulty and athlete abilities in terms of endurance, ascent or ground surface type (smaller values corresponding to 69 4.6. VALIDATION PROCESS better runners because the athlete parameter acts as a multiplying factor to compute race times).

4.6 Validation Process

To quantify the ability to determine the U and V vectors that approximate the known race results, 1 percent of the known matrix entries are removed from the initial set (prior to matrix factorization) and then compared to the same entries in the approximated time matrix Tˆ = U · V|. The process is repeated 100 times to reduce the variance of the estimated accuracy. The data requirements that are discussed above must apply to the matrix that is used to train the models. Therefore, the requirements need to be evaluated after removing the test samples. The number of races per athlete being the strongest requirement, the test samples are all taken on different athlete.

4.7 Results

The data that is described in Section 2.2 is used to quantify the accuracy of predictions made from the race and athlete vectors. In the next sub-sections, we first look at how much of the data can be used considering the requirements expressed in Section 4.5. Then, we quantify the prediction accuracy for several settings, varying the rank k and the initial transformation of the race results: the low-rank approximation is performed on the race results matrix that is expressed as times, average speeds or average paces.

4.7.1 Data

Fitting the k parameters of each athlete requires that they each have more than k races. In the same way, races must have more than k athletes. In addition, it is required that, at most, one race performance per athlete is kept aside for testing. Results are presented with a minimum number of races equal to 5 (thus at least 4 for fitting athlete models). All athletes presenting fewer than 5 races are simply removed from the database. To- gether with the community constraints, the original data is thus reduced to 1382 runners and 307 races. The results that are presented below are very similar to those obtained with more or with fewer races per athlete, as long as we keep at least k + 2 races per athlete. CHAPTER 4. CHARACTERIZATION FROM RACE RESULTS 70

4.7.2 Optimization

Using the alternate least square algorithm, race and athlete parameters are updated iteratively; each iteration improves the fit. Each iteration is charac- terized by the training error (the error on the samples that are used during the optimization) and the test error (the error on the race performances that were kept aside). Note that the average training error is almost always smaller than the average testing error. Most of the time, they both decrease until they reach a plateau but, in some cases, the test error come to a min- imum and then rises (see Figure 4.3). The latter results from overtraining (similar to overfitting discussed in 1.2): the model gets closer to the ob- served points but the generalization gets worse. As the rise was found to be very limited and rare, no early stopping strategy was implemented.

Figure 4.3: Possible evolution of the error with iterations of the alternate least square algorithm.

4.7.3 Model Selection

Although the fitting accuracy improves with higher rank on the training samples, the generalization error that is computed on test times is the best for rank equal to one. The relative errors obtained on test samples are illustrated in Figure 4.4. The median of the absolute error on the race times is around 3.5% for rank equal to one, slightly better than the results 71 4.8. DISCUSSION obtained by simple fitting of the two-parameter power law (presented in the previous chapter) which is around 7.5%.

Figure 4.4: Prediction errors with different ranks. The power law is given for reference.

4.8 Discussion

4.8.1 Race Distances and Rank-One Approximation

The simple model of the rank-one approximation assumes a race time can be predicted by multiplying the athlete parameter ua,1 by the race parameter vr,1: ta,r ˆ=ua,1.vr,1. This means that the race parameter is proportional to its recorded times. The intuition tells us that the race parameter should be related to the race distance. To illustrate this idea, we can plot the race parameter versus the actual length of the race that is easy to obtain (provided by the organizers or commuted on runner activity recordings). This is done in Figure 4.5. We can see that the race parameters initialized at random tend to correlate with the race distance after several ALS iterations, although, the trend tends to bend upward as races get longer. This supports the power law CHAPTER 4. CHARACTERIZATION FROM RACE RESULTS 72 model (t ∝ dβ) with β > 1 (see Section 3.1.1). The orange curve is drawn with β = 1.13 such that is passes through the fastest races (the lowest points) of the validity domain of the power law (see Section 3.2.1). If the race distance was the only parameter that affects race times, all blue dots should be aligned because the ALS algorithm compensate for disparities in attendance levels. The reason why some races seem slower (higher in the figure) than what their distance suggests is that other race parameters are actually playing a role in their runability.

Figure 4.5: Race coefficients obtained with the rank-one approximation computed with the ALS algorithm. After several iterations, race coefficients tend to correlate with race distances. The orange curve t = c.d1.15 that passes through the fastest races supports the power law.

4.8.2 More on Communities

The problem of isolated communities can be illustrated in the light of the β observation that race coefficients tend to align as c.dr . The intuition is that if race are not connected through shared athletes, there is no reason for the coefficient c to be the same (see Section 4.5.1). We can show this with three experiments conducted on two communities that were found in our database. Hereafter, these two communities will be referred to as the roosters and the lions. The first experiment shows that isolated communities, do not share the c coefficient. The second shows that if communities are connected, they 73 4.8. DISCUSSION tend to have the same coefficient c. The last experiment shows that few athletes are required for bridging the communities.

Experiment 1: Isolated Communities 1. Remove the race results of some athlete in such a way that some isolated communities appear (see top of Figure 4.6). 2. Initialize race coefficients differently in the communities. 3. Run the alternating least squares algorithm to optimize the race co- efficients. 4. Observe the relationship between race coefficients and race distances (see bottom of Figure 4.6).

Figure 4.6: Experiment 1. Top: Two isolated communities. Bottom-left: Race coefficients initialized differently in the two communities. Three lion races are left in the rooster races to show that the algorithm automatically aligns them with their community. Bottom-right: Race coefficients after ALS optimization are aligned within communities. CHAPTER 4. CHARACTERIZATION FROM RACE RESULTS 74

Experiment 2: Connected Communities 1. Take the same races but, this time, leave one athlete (at random) who attended races in both communities (see top of Figure 4.7). 2. Initialize race coefficients differently in the communities. 3. Run the alternating least squares algorithm to optimize the race co- efficients. 4. Observe the relationship between race coefficients and race distances (see bottom of Figure 4.7).

Figure 4.7: Experiment 2. Top: The two communities are now connected thanks to one athlete that went to four lion races and one rooster race. Bottom-left: Race coefficients initialized differently in the two communities. Bottom-right: Race coefficients after ALS optimization are all aligned. 75 4.8. DISCUSSION

The two first experiments show that race coefficients tend to align within communities of connected races but not between isolated communities. The- oretically one bridging athlete could be sufficient: the optimization criterion expressed by Equation (4.7) forces the coefficient to reflect all observed race results. In practice, the bridging athlete could perform better in one of the communities by chance; introducing a bias between the communities. If more athletes are connecting the communities, we assume that the differ- ences are averaged out. The third experiment show how the communities converge with the number of bridging athletes.

Experiment 3: Connections Between Races

1. Take the same races but leave Nb athlete (at random) who attended races in both communities. 2. Initialize race coefficients differently in the communities. 3. Run the alternating least squares algorithm to optimize the race co- efficients.

crooster 4. Store the ratio between c coefficients in the two communities: clion

5. Repeat 10 times. Each sample of Nb bridging athlete is taken without replacement (so that we have 10 independent samples).

6. Repeat from 1 for several values of Nb (here 1,2,5,10,20,50,100,200,500).

crooster 7. Observe the variability of with respect to Nb. clion CHAPTER 4. CHARACTERIZATION FROM RACE RESULTS 76

Figure 4.8: Experiment 3: Convergence of the race coefficient between the communities with respect to the number of bridging athletes.

The third experiment shows that the race coefficient tend to scale in the same way as the number of bridging athlete increases. For more 20 bridging athletes, the scalings differ by less than 10 % between communities in 96% of the cases (see Figure 4.8). In practice, the data that is used do not present communities that share less than 100 athletes. Note that the figure shows results for more that 50 bridging athletes. In those cases, the 10 samples of Nb bridging athletes are sampled with replacement because bridging athlete were at most 1000. In our example, we see that the lion races have a slightly lower coefficient c than the rooster races (5%). This means that the lion races tend to be run faster. The difference can be explained by the fact that the lion races are all very flat compared to most of the rooster races. This is made clear in AppendixA were race locations are given with distances and cumulative elevation gains (the sum of every gain in elevation throughout the entire race). Other reasons could come into play like the casual atmosphere of certain type of races (typically runs that are only present among the rooster races). Anyway, if a difference between the two communities is observed (we see that it converge to about 105%), we can assume it is relevant for race time prediction. 77 4.9. CONCLUSION

4.9 Conclusion

This chapter shows that race and athlete can be characterized by parameters computed on a large set of race results. The characterization that offers the best prediction accuracy is the one that assigns one parameter per athlete and one parameter per race. The athlete parameter reveal then their level as they express their relative race times. The race parameter is related to the average race duration and thus its length and other factors. Athlete and race parameters can only be used within the same community of connected races. In case of isolated communities, parameter obtained for athlete of a community can not be used to predict race times with the race coefficients that were computed in an other community unless we can ensure that the parameters were scaled in the same way. For rank higher than one, the model starts to overfit the training examples and generalization error deteriorates. A possible explanation is that the variance in observed race results is too high; the model would require more race per athlete to generalize well. We can compare the results obtained in this chapter with the simple fitting of a power law discussed in Chapter3. At first sight, the power law seems to be a more appropriate model because it has two parameters related to two important physiological characteristics of an athlete (base level and endurance) but the simple fitting of its parameters leads to less accurate prediction compared to the rank 1 decomposition (see Figure 4.4). This can be attributed to two main factors. The first is that the power law makes the prediction based on the race distance which is not the only race parameter that impacts race results. On the contrary, the low rank approximation selects one parameter per race that can account for any parameter that affects race times. The next chapter (Chapter5) comes in response to this: it proposes a method that uses the power law but allows race distances to be adapted according to observed race results through the notion of equivalent distance. The chapter relates then the equivalent distances to race objective properties. The second factor is that, as for the decompositions of rank higher than one, the fitting of a power law might be too flexible for the high variability of the observed race results. Guiding the fitting of the power law to account for what is, a priori, expected is the focus of Chapter6. CHAPTER 4. CHARACTERIZATION FROM RACE RESULTS 78 Chapter 5

Race Equivalent Distances and Race Features

The previous chapter demonstrated that race results can be characterized efficiently using one parameter per race. In this chapter, we slightly modify the low-rank approximation in such a way that the race parameter can be interpreted as an equivalent distance of the race. We then show how this equivalent distance can be related to other race features and quantify their quality as predictors of race times. The content of this chapter largely overlaps with the ideas already published in [57].

5.1 Introduction

Race distances alone are not sufficient to evaluate athlete race times. For instance, an athlete who runs a marathon distance in 3:30’ could be a well prepared recreational runner if it happen to be at the Berlin’s Marathon (known to be particularly fast) or he could be a world class champion if it happen to be the Pikes Peak Marathon (known to be one of world’s toughest marathons). Elevation gradient, weather conditions, altitude, vegetation, uneven ground and ground firmness affect athlete speed. A flat equivalent distance, that reflects all race characteristics is useful in many ways: athletes can prepare races considering realistic race lengths. It also makes athlete ranking possible even if they did not attend to the same races. This chapter describes a methodology that assigns flat equivalent distances to routes based, first, on race times (similarly to the previous chapter), and then,

79 CHAPTER 5. EQUIVALENT DISTANCES 80 based on race elevation profiles. The flat equivalent distance is defined as the distance that would be run, on average, in the same time if the ground was flat, all other above-mentioned conditions being ideal. The first part of this chapter (Section 5.2) builds upon the low rank approx- imation exposed in Chapter4 and the performance model that was selected in Chapter3: the low-rank approximation model is slightly modified to correspond to the power law expressed by Equation 3.6. The low-rank ap- proximation outputs then a parameter per race that corresponds to their length (in terms of flat equivalent distance) and two parameters per ath- 10K lete that correspond to a reference time (ta ) and an endurance coefficient (βa). The second part of the chapter (Section 5.3) describes how the obtained equivalent distances can be related to race features. We show this with race elevation profiles. The equivalent distance of a race can then be approxi- mated knowing its elevation profile.

5.2 Equivalent Distances From Race Results

5.2.1 Equivalent Distance

In the previous chapter, we demonstrated that race times could be approx- imated using only one parameter per race. In this chapter, we change the model (ta,r = rr.aa) in favor a physiologically sound model in which the race parameter is the distance: the power law that was already discussed in Section 3.1.1. As we would like the race parameter to adapt to race results, eq we use the notion of equivalent distance dr instead of the actual distance. The power law can be reformulated using the notion of equivalent distance as

deq t ≈ t10K ( r )βa , (5.1) a,r a 104 or equivalently,

deq ln(t ) ≈ ln(t10K ) + β . ln( r ) (5.2) a,r a a 104

10K with ta and βa being two athlete parameters. The equivalent distance reflects all parameters that could affect race times. We see in the next sub- section, that equivalent distances are scaled in such a way that the equivalent distance corresponds to the actual distance for the fastest flat races. 81 5.2. EQUIVALENT DISTANCES FROM RACE RESULTS

Definition The equivalent distance deq of a race is the distance that would be run in the same time if the ground was flat and all conditions ideal.

5.2.2 Obtaining the Equivalent Distances

10K eq Athlete parameters (ta , βa) and race equivalent distances (d ) are un- known but race times must reflect them. As the number of race results is high, athlete parameters and race equivalent distances can be set so that the average squared deviation from Equation (5.1) is minimized on observed race results; just as it was done in the previous chapter with a linear model. Note that the method is very similar to the previous chapter because the model is still linear in the log domain (see Equation 5.2). Noting LT the matrix of the log times (ln(ta,r))1

LTˆ = U.V| ≈ LT, (5.3)

10K with U = (ln(ta ) βa)1

X | 2 arg min (ln(ta,r) − ua.vr) (5.4) U,V (a,r)∈Ω

deq 10K | r | with ua = [ta , βa] and vr = [1, ln 104 ] . This problem is solved using the same alternating least squares algorithm (see Algorithm3) as for the low rank approximation except that the initial matrix is (ln ta,r) instead of (ta,r) and that the rank is two but first element of the race vectors is fixed to 1. As discussed for the low rank approximation, the optimal solution has one 10K degree of freedom because athlete parameters ta are allowed to compen- eq sate freely race parameters dr . Therefore, without an additional constraint, eq the race parameters dr may not have the same scale as actual distances. CHAPTER 5. EQUIVALENT DISTANCES 82

The extra constraint can be expressed such that fastest flat races are as- signed an equivalent distance that is close to their actual distance:

eq dr ≈ dr. (5.5)

For a set Ωf of selected fast races (such as the Berlin Marathon which is known to be one of the fastest marathons), the constraint is met, on average, if:

X X eq dr = dr . (5.6)

Ωf Ωf

This constraint is used to scale equivalent distances obtained by solving the optimization problem so that they can be expressed in meters as the actual distances. We choose to work in two steps: first solving the optimization problem described by Equation 5.4; second, scaling U and V matrices according to 5.6. We could also constrain the optimization problem such that the solutions are provided with the scaling. Such a constraint could be, for instance, that all equivalent distances are greater or equal to the actual distances. This was not done here because our races are not all provided with trusted actual distances.

5.2.3 Data Requirements

The minimum requirements on the data that is used to solve the optimiza- tion problem of Equation 5.4 are the same as for the low rank approximation of the previous chapter (see Section 4.5). We thus work with the same set of races.

5.3 Equivalent Distance From Elevation Pro- file

The previous section showed how we can obtain equivalent distances for a set of races. Equivalent distances are computed for races that belong to a large set of connected races (races that share athletes). One could want to compute the equivalent distance of a new route, directly from its features. This is useful to prepare a new race for which the equivalent distance was not computed, or to evaluate the performances of an athlete on any route. 83 5.3. EQUIVALENT DISTANCE FROM ELEVATION PROFILE

In this section, we can reuse the equivalent distances to relate them to race features. In the scope of this thesis, the equivalent distances are only put in relation with elevation profiles and the actual distances.

The problem of establishing race equivalent distance from the elevation pro- file is already discussed in the literature. Two approaches have been taken so far to achieve the same goal: through metabolic measurements or through statistics on race results. In the first approach, flat equivalent distances can be computed as the distance that would lead to the same energy expenditure on flat ground. The relationship between energy expenditure and gradient of ascent is established by [42] by measuring athletes oxygen uptake on an inclined treadmill. This first approach is perfectly fine to assess workload or to plan weight reduction program but might not be accurate in what con- cerns race times prediction because it does not target it specifically. The second approach infers a relationship between race average speeds and their elevation profile [34, 52, 53]. This chapter follows the second approach and addresses two problems that previous works do not take into account.

The first is that when comparing race results, one needs to compensate for the attendee level disparities between competitions. This is done by assigning an equivalent distance using the low-rank approximation discussed in the previous section.

The second problem is that previous attempts fully describe race elevation profiles by two global metrics that are either the cumulative elevation gain (the sum of all positive vertical displacements along the route) [53] or the av- erage elevation gradient of non-loop races that present a relatively constant gradient [34]. This is rarely observed in practice. In this chapter, the full elevation profile is considered; this allows for a more realistic relationship extraction.

Having equivalent distance approximations for a set of known races, we can look at how the equivalent distances relate to the elevation profiles of the races. Instant elevation gradients for each point on the race route are computed from the elevation profile as described in Section 2.3.5. Let function f(g) be the distance correction that need to be applied to a given distance d presenting an elevation gradient g expressed in [m/m]:

dˆeq = d · f(g). (5.7)

If the considered route presents a varying gradient, it can be split in 1- meter sub-section x presenting a gradient g(x). The equivalent distance of CHAPTER 5. EQUIVALENT DISTANCES 84 the whole route is then the sum of the contributions of each meter :

d ˆeq X d = f(g(x)), (5.8) x=1

Function f(g) can take various forms that can be approximated by a poly- nomial or a piecewise linear model. Section 5.3.1 presents different models that are found in the literature. From the optimization problem expressed by Equation (5.4) in the previous section (5.2) we obtain estimates of the equivalent distance dˆeq for 322 races. For 169 of them we also have baro- metric altitude measurements that we can re-sample and derive to obtain the gradient of ascent g(x). Section 5.3.2 shows how coefficients of function f(g) can be adjusted to re-produce the obtained equivalent distances dˆeq from the slopes g(x).

5.3.1 Models

Naismith-like Model

The first rule of thumb, called the Naismith’s rule, dates from 1892 and is relayed, among others, by [53]. Naismith’s rule, formulated in terms of flat equivalent distances in the sense of our definition, can be expressed as

( deq 1, if (g < 0) ˆ=f(g) = (5.9) d 1 + f0 · g, otherwise with the f0 constant being evaluated to 7.92 [m/m] in [53]. For the complete route, the rule can be formulated using the cumulative elevation gain d+ (the sum of all positive vertical displacements):

eq + d ˆ=d + f0.d (5.10) where d X d+ = g+(x) (5.11) x=1 and where

( g(x), if g(x) > 0 g+(x) = (5.12) 0, otherwise. 85 5.3. EQUIVALENT DISTANCE FROM ELEVATION PROFILE

In other words, a total climb of 100m during a route would have the same effect on the race time as an additional distance of about 800m. In this formulation, negative gradient does not affect race times. This is obviously not true but as the races that we use to re-compute f0 always end where they start, the f0 coefficients is assumed to also account for the speed increases in negative slopes. Therefore, the approximated f0 coefficient that is computed on our race results do not generalize to races that present an elevation difference between the starting point end the arrival.

Polynomial Models

Other papers present a 4th or 5th order polynomial expressions that take increased speed for negative gradients into account. [42] gives a relation that expresses the metabolic energy cost of running by distance unit as a 5th order polynomial of the elevation gradient. Their definition of equivalent distance in that case would be the distance that would lead to the same energy expenditure if it was on flat ground which is not exactly the one that we use. Publication [34] gives a 4th order polynomial that can be expressed with a definition of flat equivalent distance that matches ours.

5.3.2 Model Fitting

As stated earlier, model fitting assigns model parameters of the unknown function f(.) in Equation (5.8) so that it best reproduce equivalent distance that are established solving the optimization problem (5.4) discussed in the previous section.

Naismith-like Model

For the Naismith-like model, fitting Equation (5.12) requires to set param- eter f0. For each of the race, we have the estimation of the equivalent ˆeq + distance d and we can compute the cumulative elevation gain dr from gr(x) using Equation (5.11). f0 can thus be chosen to minimize the least square error:

X ˆeq + 2 arg min (dr − (d + f0.d )) (5.13) f0 r which is a standard OLS regression. CHAPTER 5. EQUIVALENT DISTANCES 86

Polynomial Models

In previous works, function f was fitted to route features using global route features either by assuming a constant gradient or by taking the cumulative elevation gain. In our case, races present varying gradient. The general polynomial form can be expressed as

P deq X ˆ=f(g) = 1 + f · gi (5.14) d i i=1 where P is the polynomial order. The independent term is set to 1 because f(g) = 1 for g = 0 : the equivalent distance of a route on flat ground is the distance itself. The total equivalent distance (5.8) can then be written as the sum for each meter x along the route as

d P eq X X i d ˆ= [1 + fi · g (x)], (5.15) x=1 i=1 which can the be re-arranged as

i=P d eq X X i d ˆ=d + [fi g(x) ], (5.16) i=1 x=0 allowing to pre-compute the inner sum for each race route. The P model parameters fi can then be computed using a multiple linear regression.

5.4 Validation Process

We saw several methodologies to assign equivalent distances to races; they are compared here as race time predictors. Indeed, the power law can be fitted to athletes with their times recorded on races for which we can compute equivalent distances. The prediction accuracy depends on how the equivalent distance was computed. The following equivalent distances are evaluated: 1. the actual distance of the race 2. the equivalent distance that is computed from race results (see Section 5.2, denoted “collaborative filtering”) 87 5.5. RESULTS

3. the Naismith-like formula with the coefficient fitted on our equivalent distances (denoted “Present Paper (Naismith-like)”) 4. the polynomial established by measuring the oxygen consumption on inclined treadmill (denoted “Minetti 2012”) 5. the polynomial computed based on times obtained on nearly constant slope races (denoted “Kay 2012”) 6. the polynomial fitted on our equivalent distances (denoted “Present Paper (Polynomial)”) In order to avoid a bias, the prediction error is computed on test data that is not used neither for athlete parameter fitting nor for computing the equivalent distances. The validation process is as follows: 1. Split the observed data into two sets: 99% for training the model Ωtrain and 1% for testing Ωtest (paying attention not to take two test samples on the same athlete).

2. Compute equivalent distances with the training set Ωtrain.

3. Compute athlete parameters for each athlete using Ωtrain with actual distances and the different equivalent distances.

4. Compute prediction errors on the test set Ωtest with actual and equiv- alent distances. The process is repeated 100 times to observe the variability of the error estimate.

5.5 Results

The boxplot presented in Figure 5.1 shows the relative errors at predicting athletes race times considering different equivalent distances computed from the elevation profile. Models expressed by Equations (5.9) and (5.14) are evaluated; both with literature coefficients and with coefficients fitted on our computed equivalent distances. The last boxplot (collaborative filtering) represents the errors when we use directly the equivalent distances computed from the race results. Table 5.1 shows the mean relative errors (MRE) and model parameters. Figure 5.2 shows Naismith’s piecewise linear model with our coefficient, two polynomial models with original coefficients and our best polynomial fitting (5th order). We can note their similarity in the range that is displayed. As our elevation profiles do not present much gradient outside of the range [−20%; 20%], this is also the range of validity of our model. CHAPTER 5. EQUIVALENT DISTANCES 88

Figure 5.1: Boxplot representing relative error while predicting race times with different equivalent distances. Computing equivalent distances from race results through collaborative filtering gives the best results.

Figure 5.2: Distance correction based on the elevation gradient. 89 5.5. RESULTS

Model Naismith-like 5th order polynomial (present paper) (Minetti 2002) Equation # (5.9) (5.14), P = 5 Coefficients f0 = 7.92 {f1..5} = {5.42, 12.9, −12.0, −8.44, 43.2} MRE 6.4 % 6.3 % Model Naismith-like 4th order polynomial (present paper) (Kay 2012) Equation # (5.9) (5.14), P = 4 Coefficients f0 = 6.5 {f1..4} = {3.64, 17.8, −3.10, −23.8} MRE 6.3 % 6.1 % Model 5th polynomial (present paper) Equation # (5.14), P = 5 Coefficients {f1..5} = {3.90, 220, 23.6, 6.36, −5.34} MRE 5.8 %

Table 5.1: Mean relative errors at predicting race times given for different equivalent distances. Actual distances give 7.5%. Collaborative filtering equivalent distances gives 4% CHAPTER 5. EQUIVALENT DISTANCES 90

The equivalent distances computed from race results are provided in Ap- pendix A.3.

5.6 Discussion

The power law that was selected in Chapter3 describes runner performance as a function of the race length although many other parameters impact race times. This chapter addresses this problem by defining the notion of equivalent distance as the distance that would be run in the same time if the ground was flat and all other conditions ideal. We showed that equivalent distances can be assigned to races based on the race times. We obtained equivalent distances by applying a collaborative filtering tech- nique. The technique was transformed to take benefit of the physiologically sound power law to evaluate races equivalent distances and athletes level at the same time. By doing so, we addressed the problem of athlete level disparity between races. The obtained flat equivalent distances are used to build models that take races elevation profile data as inputs. Unlike previous works, we apply our distance correction model to routes with varying gradient. Results prove that the computed equivalent distances are more relevant than the actual distances as race time predictors. The best distance correction is the one that is computed on race results because it captures all param- eters that affect race times (elevation gradient, weather conditions, ground firmness, etc.). Equivalent distances based on the routes elevation profiles give all similar improvements. The polynomial expression that is obtained for distance adjustment can serve as-is to correct instant speed on a race route. This could be used, for instance, to optimize race management by providing a gradient-adjusted target speed along the route. Races considered in this thesis are all loop races: they end at the same place as they start. Therefore, flat equivalent formulas can not be safely generalized to routes for which this is not the case. This is obvious for Naismith-like formulas: in appearance, the model only considers decreased speed in uphill sections; but actually, as it depends on race results, the model coefficient f0 accounts for the fact that there are necessarily downhill sections were the speed is at least a little increased. A route with only uphill sections, would be slower than what is predicted by the Naismith-like formulas that are considered here. 91 5.6. DISCUSSION

The model of the equivalent distance is restricted here to elevation data. Further work can build more advanced models that predict race equiva- lent distance considering more race features like weather conditions (heat, humidity, wind speed and direction, . . . ) or ground type obtained from on- line maps (openstreet map provides APIs that can be used to gather track information that might be useful to evaluate its runnability [40]). The power law, even with equivalent distances, might still be too flexible for the high variability observed in race performances. In some cases, this leads to unrealistic models. For example, a runner could run a 10.1km race with an average speed higher than their previous record on a 10.0km. This a situation that is not unlikely to happen in reality but the simple fitting of a power law would give a β parameter smaller than one. In this situation, the produced model is very unlikely to generalize well to other distances. The next chapter focuses on restraining the span of possible athlete parameters to avoid this kind of problem. CHAPTER 5. EQUIVALENT DISTANCES 92 Chapter 6

Race Time Prediction in a Probabilistic Setting

The fitting of the power law that describes athlete performance with respect to race length is made difficult by two factors: the high variability intrinsic to athlete performance and the lack of data on each athlete. The purpose of this chapter is to make use of the fact that, although we do not have much data on individuals, we have many individuals. This can help to restrain the space of possible models and thus make individual fitting easier. To do this, we formalize the problem of adjusting athlete parameters in a probabilistic setting. The estimated athlete parameters are maximum a posteriori probability (MAP) estimates. The content of this chapter extends ideas already published in [54].

6.1 Introduction

It is common, especially for recreational runners, to experience high vari- ability in performances. For instance, they may attend races in a group, with minimum preparation, or even get hurt. The high variability in race outcomes, combined with the fact that few race times are available, can make the solution to a simple regression physiologically unrealistic; for in- stance, average speeds that increase with the races total distances as the blue curve in Figure 6.1 (b). We illustrate in this chapter that regression can take into account some a priori assumptions on athlete parameters that represent our belief about the

93 CHAPTER 6. PROBABILISTIC RACE TIME PREDICTION 94

Figure 6.1: Problem illustration with two runner examples. different possible values (and thus force the solution to look like the orange curve). These assumptions take the form of a probability distribution, called prior, that is inferred from a large set of race results. We formalize the problem of adjusting athlete parameters as a probability maximization: we select the athlete parameters that equal the mode of their posterior probability density function. The posterior distribution combines the prior and the observed race results: it is the conditional probability distribution of the athlete parameters given the observed race results. Ath- lete parameters that are adjusted using such a probability maximization are called maximum a posteriori probability (MAP) estimates. In the following sections, the methodology is presented as follows: the prob- lem of fitting the model to a specific athlete is expressed in a probabilistic setting (6.2) for which closed form solutions (6.4) are given. Probabilis- tic assumptions made on athletes are expressed as probability distributions for which parameters tuning (6.5) is required. The concept of equiva- lent distance (6.6), as a way to improve prediction accuracy, is discussed. Subsequently, the validation (6.7) procedure is described to quantify the accuracy of the methodology.

6.2 Probabilistic Setting

The problem that we would like to address in a probabilistic setting is again 10K | to adjust athlete parameters ua = (ln ta , βa) in the performance model:

d ln t = ln t10K + β . ln r + ε (6.1) a,r a a 104 a,r 95 6.2. PROBABILISTIC SETTING or in vector form,

| ln ta,r = ua.vr + εa,r (6.2)

dr | where the race parameter is vr = (1, ln 104 ) . Athlete performances are summarized for each race event r by the race time ta,r and the race distance dr. In the following, Ωa is the set of race indexes for which the performance of athlete a was observed; dataa denotes the collection of race times of athlete a associated with the corresponding 1 race distances. In the following, the term p(dataa) is a simplified notation that must be understood as the probability density of observing what was observed: p(dataa) = p(ta|da) with ta and da that are, respectively, the times and the distances available for athlete a. Athlete parameters ua can be chosen so that they maximize p(dataa|ua). This is known as maximum likelihood (ML) estimation:

ML uˆa = arg max p(dataa|ua) ua Y (6.3) = arg max p(ta,r|ua, dr). ua r∈Ωa

All the randomness associated with Equation (6.3) is encapsulated in εa,r. The latter can be associated with probability density function fε. It reflects | how athletes deviate from ua.vr. For given parameters ua, the model error can be computed as the difference between the model output and the ob- | served log-time: εa,r = ln ta,r − ua.vr. Using the probability distribution function of the model errors fε, the equation gives:

ML Y uˆa = arg max fε(εa,r) u a r∈Ω a (6.4) Y | = arg max fε(ln(ta,r) − ua.vr). ua r∈Ωa

If a normal distribution is assumed for fε, solving (6.4) is equivalent to solving a linear regression with the least squares error criterion [61] and is

1We also use the short notation of probability density function p(x) referring to the probability density function of the random variable X associated with the occurrence x, ∆ evaluated in x: p(x) = fX (x) CHAPTER 6. PROBABILISTIC RACE TIME PREDICTION 96 actually what we did in the previous chapters when fitting the power law to athletes.

It is worth noting that the additive error term εa,r in Equation (6.1) becomes a multiplicative error term with a transformed probability density function in the time-domain:

dr t = t10K .( )βa .eεa,r . (6.5) a,r a 104

The significance of a normal distribution for ε in the log-domain is not easy to grasp. Actually, in the time-domain, it corresponds to a multiplicative error term that is distributed as a log-normal (see Figure 6.2). The log- normal distribution has the desired property of being skewed. For instance, it does not allow race times to drop below zero. As mentioned in the introduction and illustrated in Figure 6.1(b), observed performances can make the standard regression (ML regression) nonsensical. This issue is addressed by including a prior assumption on athletes, taking the form of a probability distribution fu(ua) on athlete parameters ua. This probability distribution is intended to express our belief about the different possible values that the parameters can take. For instance, athlete average race speed must decrease with the total race distance (β > 1), and most athletes are unlikely to outperform world records. The Bayesian theorem can be used to reformulate our regression problem to integrate prior probability distribution fu(u) on athlete parameters. Athlete parameters ua can then be selected so that their probability is maximized given what is observed. This is known as maximum a posteriori (MAP) estimation:

MAP uˆa = arg max p(ua|dataa) ua

p(dataa|ua) · fu(ua) = arg max (6.6) ua p(dataa)

Y | = arg max fu(ua) fε(ln ta,r − ua.vr) ua r∈Ωa being the maximization of the product of the prior distribution by the like- lihood. Equation (6.6) can be solved for each athlete to obtain their param- eters provided that fu and fε are given. In practice, these distributions are selected using a methodology that is described in Section 6.5. Assuming that both can be modeled as normal distributions, a closed form solution derived in [61] can be expressed (see Section 6.4). 97 6.3. THE BAYESIAN APPROACH

Figure 6.2: Shape example of a log-normal distribution.

6.3 The Bayesian Approach

The MAP solution provides athlete parameters uˆa that equal the mode of their posterior distribution p(ua|dataa). The athlete parameters can then be used to predict race times for any given distances. Although, the Bayesian approach gives the same prediction as the MAP approach, it goes further because it quantifies the uncertainty about the variable of interest: for a new dnew race of distance dnew (vnew = (1, ln 104 )), the probability density function of the race time ta,new is

Z p(ta,new|dataa, vnew) = p(ta,new|ua, vnew)p(ua|dataa)dua, (6.7) with | p(tnew|ua, vnew) = fε(ln(ta,new) − ua.vnew), (6.8) and

p(dataa|ua) · fu(ua) p(ua|dataa) = p(dataa) Q (6.9) fε(ln ta,r − u|.vr) · fu(ua) = r∈Ωa a p(dataa) and finally,

Z p(dataa) = p(dataa|ua)p(ua)dua (6.10) CHAPTER 6. PROBABILISTIC RACE TIME PREDICTION 98

The difficulty of the Bayesian approach resides in the fact that, in general, the evaluation of the integral cannot be performed analytically. Therefore, we need to rely on approximate methods such as Monte Carlo techniques that are computationally intensive and require large volume of data on individual athletes. In this work, we have limited ourselves to expressing solutions without adressing the associated uncertainty.

6.4 ML and MAP Solutions

This section provides the closed form solutions for Equation (6.4) and Equa- tion (6.6) which are, respectively, the ML estimate and the MAP estimate. The ML solution is given here for comparison although the equation is used in previous chapters: it corresponds to the standard OLS solution. Solutions are expressed using the following definitions:

 na is the number of race times at disposal to fit the model for a given athlete a: na is equal to the size of Ωa.

 lta is a vector of size [na × 1] containing the logarithm of the race times.

 Va is the matrix of size [na × 2] containing the race vectors in the same order as the corresponding race times in (lta):

  1 ln d1 1 ln d2  V =   a . .  . . 

1 ln dna

 Probability distribution function fε of the error term is assumed to 2 be centered and normally distributed fε ∼ N (0, σε )  The covariance matrix of the model errors, assuming independence of 2 the errors, is given by Σε = σε Ina .

 fu is the prior distribution on athlete parameters, assuming a multi- variate normal distribution, fu ∼ N (µu, Σu) with − µ , the a priori expected values of the two athlete parameters u   µlt10K µu = µβ

− Σu, the covariance matrix of the prior  2  σlt10K ρuσlt10K σβ Σu = 2 with ρu being the correlation ρuσlt10K σβ σβ between the two athlete parameters. 99 6.4. ML AND MAP SOLUTIONS

The ML estimate

ML T −1 T ˆua = (Va Va) Va lta (6.11) requires no parameters and is shown here to be compared with the MAP estimate which is given by (see [61])

MAP −1 T −1 −1 T −1 ˆua = µu + (Σu + Va Σε Va) Va Σε (lta − Vaµu) (6.12) requiring six distribution parameters (σε, ρu, σlt10K , σβ, µlt10K and µβ) that reflect probabilistic assumptions on athlete performances. Note that, if athlete parameters were centered (µu = (0, 0)|), independent and with  2  2 σu 0 identical variance σu (Σu = 2 ), the MAP solution would be the 0 σu same as for a ridge regression:

Ridge T −1 T ˆua = (λIN + Va .Va) Va .lta (6.13)

2 σε with λ = 2 . The ridge regression is also the solution of linear regression σu for which the cost function includes a term that penalizes the amplitude of athlete parameters:

1 Ridge X | 2 | ˆua = arg min (ln ta,r − ua.rr) + λ(ua.ua) (6.14) ua na r∈Ωa with λ being the regularization parameter that weights the two parts of the cost function. Obviously, we are not in the case where our two athlete parameters (ua) can be assumed to be centered on zero and sharing the same variance but it is only a matter of change of variable to make it the case. We can perform the following change of variable:

 ln t −µ −µ . ln dr a,r lt10K β 104  ya,r = σ lt10K (6.15)

 σβ dr  xr = . ln 4 , σlt10K 10 in which ya,r is the new performance metric and xr is the new race feature. The performance model becomes then (with two new athlete parameters ∗ ∗ ∗ (ua,0, ua,1) and a new residual term εa,r):

∗ ∗ ∗ ya,r = ua,0 + ua,1.xr + εa,r. (6.16) CHAPTER 6. PROBABILISTIC RACE TIME PREDICTION 100

The two new athlete parameters are:

 10K ∗ ln ta −µlt10K ua,0 =  σlt10K (6.17)  u∗ = βa−µβ a,1 σβ being centered with unit variance (assuming that correlation between the two athlete parameters is null):

 0  ∗  µu =  0  (6.18) Σu∗ = I2    2  λ = σε∗

This shows that, except that our MAP estimate also consider a possible cor- relation between the athlete parameters, a simple change of variable makes the solution that we propose (6.12) identical to the one of a ridge regression (6.13). Note that solving the ridge regression also requires all the distri- bution parameters (except the correlation) because they are contained in the value of λ as well as in the change of variable. The change of vari- able can be seen as a pre-processing of the inputs. In the case of a ridge regression with pre-processed inputs, the pre-processing and value of λ is commonly treated as model hyperparameters: they are chosen to maximize the prediction accuracy on test samples (see Section 1.2). The equivalence between the ridge regression and the MAP solution that is shown above, is intended to legitimate the tuning of the distribution parameters as simple hyperparameters of the model that is proposed in Section 6.5.2.

6.5 Distribution Parameters Tuning

The race time for any distance dr can be predicted from athlete parameters 10K (ta and βa) using the power law that is reproduced here:

d ln t ˆ=ln t10K + β . ln r (6.19) a,r a a 104 The MAP estimation of athlete parameters based on their past race records is given by Equation (6.12). The latter contains six distribution parame- ters that reflect the assumptions that “pull” the solution towards the most 101 6.5. DISTRIBUTION PARAMETERS TUNING probable region. The purpose of this section is to show how distribution parameters can be chosen using the large database of race results that is at our disposal. Selecting the six distribution parameters can be performed following two distinct approaches: by computing statistics on race times or by validation (choosing the distribution parameters that lead to best prediction accuracy on test samples). The two approaches are described in Sections 6.5.1 and 6.5.2.

6.5.1 Adjusting Distribution Parameters on Race Times

Knowing race results for a large population, distribution parameters can be adjusted by computing statistics on the data. The most obvious parameter to estimate is µlt10K . It corresponds to the expected value of the logarithm of time on a 10-kilometer race. Selecting, the set Ω10K of 10-kilometer race times in our database, we have a collection of times (ta,r)(a,r)∈Ω10K from which we can take the average of the logarithm and write:

µlt10K ˆ=(ln ta,r)(a,r)∈Ω10K . (6.20)

2 From the same set of races, we can also calculate the sample variance slt10K . The observation of race times depends on two random variables: one reflect- ing the variability between athletes (ln t10K ) and the other one reflecting how athletes deviate from their model (ε). The observed variance can thus be considered as an estimate of the sum to the two variances:

2 2 2 slt10K ˆ=σlt10K + σε (6.21)

From this sample variance, it is not possible to isolate the inter-athlete 2 2 variance σlt10K from the intra-athlete variance σε . In the initial database, we can also extract data on athletes who ran a 10- kilometer race and a marathon. From Equation (6.19), for a given runner a, the observation of their two times (ta,mar and ta,10K ) can be turned into observations ba of the random variable β as

ln t + ε − ln t − ε b = a,mar a,mar a,10K a,10k (6.22) a ln 4.2 CHAPTER 6. PROBABILISTIC RACE TIME PREDICTION 102

with εa,mar and εa,10k resulting from the intra-athlete variability. The av- erage of the observations β¯ is an estimate of the expected runner endurance parameter:

¯ µβ ˆ=b. (6.23)

2 The variance of these observations of β values sβ is affected by the variability 2 between athletes (σβ) and by the variability that is associated with the 10- 2 kilometer time and the marathon time (σε ):

2.σ2 s2 ˆ=σ2 + ε (6.24) β β ln 4.2

2 Again, without knowing the values of σε this is not sufficient to evaluate the variability of the endurance parameter β among runners. The same athletes that are used to compute the difference between 10-kilometers times and marathon times provide observations of pairs of athlete parameters 10K (ln ta , βa). If we assume the noise on these observations is independent, we can write that the covariance between the the two athlete parameters is estimated by the sample covariance computed on the observed pairs:

10K ρuσlt10K σβ ˆ=cov(ln ta , βa) (6.25)

Using Equations (6.20) to (6.25), we can then estimate the distribution parameters if the race-to-race variability σε of the athletes is known. The latter corresponds to the residuals that can be observed after the fitting of each athlete. However the fitting will depend on our estimation of σε in the first place. One of the possibilities to estimate all distribution parameters is to proceed iteratively as follows:

1. Compute parameters ua for each athlete using the ML Equation (6.11). 2 2. Compute the global variance of the residuals sε to approximate σε. 3. Compute prior distribution parameters with Equations (6.20) to (6.25).

4. Compute athlete parameters ua using the MAP Equation (6.12). 5. Repeat from step 2 until convergence is reached.

Another possibility is to take σε as a parameter that can be tuned by vali- dation:

1. For several values of σε. Repeat steps 2 to 4. 2. Compute distribution parameters with Equations (6.20) to (6.25). 103 6.5. DISTRIBUTION PARAMETERS TUNING

3. Compute athlete parameters using the MAP Equation (6.12). 4. Evaluate generalization error on test race times that were kept aside before the fitting.

5. Keep the parameter σε that gives the best prediction accuracy.

6.5.2 Tuning Distribution Parameters Through Validation

Estimation of distribution parameters made from Equations (6.20) to (6.25) are valid estimates under the following assumptions.  Residuals are normally distribute and centered on zero in the log do- 2 main: εa,r ∼ N (0, σε ).  Residuals are independent of athlete parameters, and consequently: 10k cov(εa,r, βa) = cov(εa,r, ln(ta )) = 0.  Athlete parameters are jointly normally distributed. 2  Residuals are independent for a specific athlete: Σε = σε IN .  We have enough observations: the sample mean, sample variance and sample covariance are close to the population statistics. These assumptions seem rather strong. As we do not require a full Bayesian treatment that would allow us to generate posterior distributions, we do not necessarily require that our distributions match with real distributions. Therefore, it makes sense to see the problem as a regularized linear regres- sion: the solution will be in-between an a priori curve and the best fit of 2 the data points. The cursor between the two being the 3 variances: σε (how 2 confident we are about the best fit), σβ (how confident we are about the 2 fact that µβ applies to all athletes) and σlt10K (how confident we are about the fact that µlt10K applies to all athletes). In other words, µβ and µlt10k must be regarded as the mean values that are expected for all athletes; the 2 2 variances σβ and σlt10k must be regarded as the inter-athlete variability. If the problem is seen as a regularized linear regression, our six parameters can be adjusted by simple validation: models are adjusted considering a large set of combination of distribution parameters and we keep the combi- nation that gives us the best accuracy on test race times that were not used for the fitting. In practice, we can take an approach that combines statistics on race times and validation: it is relatively safe to assume that the estimates of µlt10K , 10k 2 µβ and cov(ln(ta ), βa) are unbiased. The three variance parameters (σε , CHAPTER 6. PROBABILISTIC RACE TIME PREDICTION 104

2 2 σβ and σlt10K ) that are difficult to estimate can be adjusted by validation (see Section 6.7).

6.6 Equivalent Distances

Up to this point, only the race distance was considered in the performance model, despite the fact that other factors impact athlete race times as dis- cussed in previous chapters. In the scope of this chapter, the accuracy of the prediction would, most likely, also benefit from the notion of equivalent dis- tance introduced in the previous chapter. We could proceed similarly with an alternate least squares algorithm: optimizing the athlete parameter and equivalent distance of the races alternatively until convergence is reached. Although this would become computationally intensive because updates in equivalent distances might require updates in the distribution parameters which in turn might require to update equivalent distances. Although this chapter improves the accuracy of athlete fitting by including the prior distribution, we can rely on the equivalent distances that were computed in the previous chapter using Equation (5.4) because there are many attendees per race. The equivalent distances that are used are pro- vided in Appendix A.3.

6.7 Validation Process

As in previous chapters, the accuracy of the methodology is evaluated by validation. The same process is used to adjust our distribution parameters. 1000 athletes are randomly sampled on the athlete set having at least 5 race results. For each of them, we keep one race time aside at random for validation. Then we adjust the two athlete parameters using the ML and MAP formula’s on N of their other race times to predict the validation time. We make N vary from 1 to 4 to evaluate the evolution of the prediction accuracy with the number of observed race times.

6.8 Results

6.8.1 Probability Distribution Parameters

Parameters of the distribution that give the lower prediction error are pre- sented in Table 6.1. These distribution parameters are used hereafter to 105 6.8. RESULTS compute the athlete performance model with Equation (6.12). The two means and the correlation were obtained with the statistics on race times and the three variances by validation. It was then verified if finer tuning of the means and correlation could improve the prediction accuracy. It was not the case. Parameter Name Value Units µlt10K 8.0 [ln s] µβ 1.15 [ln s/ ln m] σlt10K 0.8 [ln s] σβ 0.03 [ln s/ ln m] σE 0.02 [ln s] ρu 0.0

Table 6.1: Values of the distribution parameters that give the best accuracy using the MAP fitting. The expected value of log-time on a 10km race is 8. It corresponds to about 50 minutes or 12 km/h.

6.8.2 Model Accuracy

Figure 6.3: Relative errors at predicting race times with model fitted on 1 to 4 races using ML and MAP expressions.

Figure 6.3 shows the boxplot of the errors obtained with the ML and MAP CHAPTER 6. PROBABILISTIC RACE TIME PREDICTION 106

fitting and for several numbers of races per athlete. Note that the ML approach presents outliers that are far outside the displayed range of er- rors. Indeed, the ML regression gives occasionally extremely high errors (thousands of %) in cases such as illustrated in Figure 6.4. Compared to standard ML regression, the MAP approach and the use of equivalent dis- tances provide both significant improvements in the accuracy of the race time predictions. Using the equivalent distances computed in the previous chapter improves the accuracy by about 2% which gives an accuracy of about 5.5% (MRE).

Figure 6.4: Model fitting with ML and MAP estimates. ML estimate shows an extremely high error on the test sample.

The superiority of the MAP approach is clearer when the number of races is small (from 1 to 4). The MAP advantage over the ML approach vanishes when more than 4 race results are available. The MAP approach allows the building of reliable prediction even when only one race result is available. This is not the case for the ML approach which needs at least two race results.

6.9 Discussion

Race results of recreational runners are so scarce and unstable that a sim- ple regression is likely to overfit unless we provide some information that conditions the solution. In this chapter, we show how the knowledge that we can obtain on many individuals can compensate for the lack of data on each of them. Indeed a large set of race results is used to infer parameters of probability distributions that represent what is more likely to be observed (10-kilometer race times and endurance parameters). This knowledge can then serve to ease the fitting of runner parameters. The prediction errors that are reported might seem high (5.5%) but they 107 6.9. DISCUSSION should be considered in context: most of the race times belong to recre- ational runners with performances that are highly variable. It is impossible to obtain a variance of error below the intrinsic variability of athlete per- formances. The latter variability could be reduced by discarding worse performances of each runner but then it would be difficult to give a fair evaluation of the prediction accuracy: the criterion that is used to remove data must encompass the notion of what are the “worse performances” (and thus contain a model similar to the model under evaluation). Hence, the more data we remove, the better the prediction accuracy would be. The main findings of this chapter are that the MAP approach is superior to the standard regression approach and that the notion of equivalent dis- tance significantly helps to predict race times. The MAP approach opens perspectives for building predictions based on race results and some athlete data (age, sex, weight, weekly kilometers, experience, etc.). Indeed as the prior assumption on athletes takes the form of a probability distribution, the latter can be conditioned on athlete variables. These results are of in- terest especially for web companies that provide running guidance inferred from massive volume of runner data. Such companies own the necessary data to estimate distribution parameters conditioned on athlete variables. We do not quantify the uncertainty around the predictions. This could be done in further work taking the full Bayesian approach. CHAPTER 6. PROBABILISTIC RACE TIME PREDICTION 108 Chapter 7

Cardiac Parameter Identification as Fitness Assessment

This chapter presents a simple cardiac model from which parameters can be adjusted on free activities revealing runners aerobic fitness level. This provides a mean to monitor a runner’s level at each of their runs. The content of this chapter extends ideas already published in [55].

7.1 Introduction

The regular evaluation of the fitness level provides key information for work- out planning but also to evaluate training efficiency. The standard evalu- ˙ ation of the endurance fitness level is the VO2 max. It corresponds to the maximal oxygen consumption during exercise and reflects cardio-respiratory fitness and endurance capacity in sport performance such as running, cycling and swimming. Currently, it is evaluated by exercise protocols performed in specialized laboratories [15]. The heart rate is known to be directly linked to the exercise intensity; the intuition behind this chapter is that heart rates recorded on a well characterized sub-maximal level of exertion provide sufficient information to estimate maximal performances.

109 CHAPTER 7. HEART RATE 110

The validity of the proposed heart rate model will be assessed by simulation with cyclists’ data for which we have instant power output and heart rate measurements. The same procedure is used with runners’ activities. For the latter, the exerted power needs to be estimated as it is not directly measured.

7.2 Materials and Methods

We propose an heart rate model that describes the relationship between the exerted power p(t) and the heart rate hr(t) of an athlete. As it is illustrated in Figure 7.1, the model contains parameters that are athlete- specific. The following subsections are organized as follows. The model is described (7.2.1). The parameters of the model are specific to each athlete and we discuss how they are related to expected race performances (7.2.2). Then we propose some methods to obtain the cardiac parameters of an athlete (7.2.3) by analyzing their regular workout sessions.

Figure 7.1: Parametric heart rate model

7.2.1 Heart Rate Model

Instant Power

In the following, the power p(t) refers to the metabolic power (the power that corresponds to an estimated oxygen intake) for runners or the power output (the mechanical power exerted on a bike transmission) for cyclists, as discussed in 2.3.6.

Steady state

The steady state heart rate hrss refers to the heart rate that is reached after stabilization at constant power p(t). The relationship between steady- state heart rate and power is athlete-dependent and is known to be very close to linear as long as the heart rate is below its maximum value called 111 7.2. MATERIALS AND METHODS

maximum heart rate hrmax [2]. Higher power is achievable for short periods of time but the heart rate will not exceed hrmax. The three athlete specific parameters describing the steady state relationship are resting heart rate hrmin in beats/min [bpm], the maximum heart rate hrmax in [bpm] and the slope coefficient mp in [bpm/watt] following the equation

( hrmin + mp.p, if hrmin + mp.p < hrmax hrss(p) = (7.1) hrmax, otherwise.

In the case of running exercises, the metabolic power is proportional to the speed [42]. The same relation remains thus valid with the speed instead of the power, except that the coefficient mp is replaced by ms. The linear approximation is supported by the experiment that is illustrated in Figures 7.2 and 7.3: runners are instructed to run at constant speed during six minutes. Then, the speed is increased three times to reach 80% of the maximum heart rate. The runner is then asked to rest for another six- minute period. The test ends with a six-minute all-out effort during which the runner is simply asked to run as far as they can in six minutes (and manage their effort to keep the pace until the end). For each step, the average heart rate after stabilization is recorded. The test results (Figure 7.3) shows that the steps at constant speed are aligned until the athlete reaches the maximum heart rate. From then, they can get faster for a short period of time but their heart rate does not increase anymore.

Transient Response

It appears from the same test session that a step upward in speed leads to a new steady state heart rate that is reached after a few seconds. The exponential-looking shape of the heart rate curve in response to power steps that we can observe at 18:00 when the runner stops and at 24:00 when they re-start. It suggests this phenomenon can be roughly described by a first order differential equation :

dhr(t) τ + hr(t) = hr (p(t)) (7.2) dt ss with τ being an athlete-specific time constant. As it is shown in the result section, this modelling allows for accurate simulations of the heart rate. CHAPTER 7. HEART RATE 112

Figure 7.2: Five-step test: recording

Figure 7.3: Five-step test. The dots represent the test stages. 1, 2, 3: increasing speed instructions, 5:all-out effort. The lines represent the theoretical model. 113 7.2. MATERIALS AND METHODS

In a time frame [t0, t] where the power output is constant, the solution of this equation is given by

− t hr(t) = hr(t0) + (hrss(p(t)) − hr(t0)).e τ . (7.3)

In the discrete time domain, if the timestep is small enough, an iterative form given by

1 hr(t + 1) = hr(t) + (hr (p(t)) − hr(t)) (7.4) τ ss can be used. In practice, the time constant τ being a few tens of seconds, this equation can be used with a time-step of one second. As there is no reason to assume equality between rise time and fall time, we can introduce two different notations: τ+ (rise time) and τ− (fall time). The equation is thus allowed to differ for increasing and decreasing heart rates and becomes

( 1 hr(t) + (hrss(p(t)) − hr(t)), if hrss(p(t)) ≥ hr(t) τ+ hr(t + 1) = 1 (7.5) hr(t) + (hrss(p(t)) − hr(t)), if hrss(p(t)) < hr(t). τ−

The heart rate transient response is thus captured by two athlete-specific parameters which are τ+ and τ−.

Cardiovascular Drift

Intra-session workload results in fatigue that induces increased heart rate for the same power output [2,7]. The increase is assumed to be proportional to the energy expenditure from the beginning of the activity. The power can then be replaced in the above equations by

Z t 0 0 phr(t) = p(t) + kf p(t )dt (7.6) t0 with kf being the athlete’s sensitivity to fatigue and t0 being the start time of the exercise. In discrete time, the power is replaced by

t X 0 phr(t) = p(t) + kf p(t ).∆t. (7.7)

t0 CHAPTER 7. HEART RATE 114

The kf parameter reflects the endurance of the runner that includes the effect of their training and their running economy (related to their running technique but also to anthropometric measures such as size and weight). This intuitive formulation might not be completely accurate but proves to help the heart rate model to better fit activity measurements.

7.2.2 Performance Prediction

Our heart rate model contains six parameters hrmin, hrmax, m, kf , τ− and τ+. Although hrmin, hrmax and τ− are associated with cardiac risk [3, 10, 32], taken separately, none of these parameters is currently believed to be associated with race performance.

At the opposite, if we combine hrmin, hrmax and mp (or ms), Expression (7.1) can be used to obtain the power pmax at hrmax (or maximum speed smax at hrmax) by

pmax = (hrmax − hrmin)/mp (7.8) or, smax = (hrmax − hrmin)/ms. (7.9)

V˙ o2−V˙ o2min As the V˙ o2 reserve (defined as ) coincides with the heart rate V˙ o2max−V˙ o2min hr−hrmin reserve (defined as )[38], the speed at hrmax is equivalent to hrmax−hrmin the speed at maximal oxygen uptake. This value reflects the short to middle distance ability of a runner. Moreover, the speed at maximal oxygen uptake directly reflects the runner’s maximal oxygen uptake V˙ o2max [37] which is a widely used indicator of cardio-respiratory fitness strongly related to life expectancy [50].

7.2.3 Adjusting the Cardiac Parameters on an Activity

It is very common for runners to record all their runs with their smart- phone or sports watch. Most of the times, their activities are automatically uploaded to a web platform (such as Nike+, Runkeeper, Strava, Garmin Connect, Endomondo, TrainingPeaks, Formyfit, . . . ). These uploads might contain GPS positions from which we can derive an estimation of the ex- erted power p(t) (see Section 2.3.6) and heart rate recordings. This data can be used to identify cardiac parameters. Indeed, activity tracks can be 115 7.2. MATERIALS AND METHODS re-sampled with a time-step of one second. From Equation (7.4) and (7.1), we can express the difference between two subsequent heart rates ht(t) as

hr m 1 hr(t + 1) − hr(t) = min + p .p (t) − .hr(t) (7.10) τ τ hr τ which provides a linear equation system that can be solved as an ordinary least square regression. As the the rising time constant τ+ is not equal to the falling time constant τ−, we have two equation systems, one for increasing heart rate (hr(t + 1) − hr(t) ≥ 0) and one for decreasing heart rate (hr(t + 1) − hr(t) ≤ 0). The two systems must be constrained to force the same solutions for m, hrmin and kf . In practice, obtaining all parameters at once induces some reproducibility problems: identified parameters of the same runner vary from one activity to the next. Moreover, the kf parameter that is hidden inside phr(t) (Equation 7.7) can not be isolated from the mp parameter. We propose to improve the stability of the parameter estimation with a step-by-step procedure where parameters are estimated by order of importance. We suppose that hrmin is known (this is a reasonable assumption, as it will be seen in the next subsection).

1. At first, we take the approximation that τ+ = τ− = τ and neglect kf : if hrmin is known, Equation (7.10) can be written as

m 1 hr(t + 1) − hr(t) = p .p(t) + .(hr − hr(t)) (7.11) τ τ min

that can be solved to obtain mp, τ.

2. Knowing mp, τ and hrmin we can re-write Equation (7.11) with the kf coefficient as

t mp mp X 1 hr(t + 1) − hr(t) − .p(t) = k . p(t) + .(hr − hr(t)) τ f τ τ min t=t0 (7.12)

and solve it to obtain kf .

3. Knowing kf , we can replace p(t) with phr(t) and write two equations (one for rising hr and and one for falling hr),

1 hr(t + 1) − hr(t) = .(mp.phr(t) + hrmin − hr(t)) (7.13) τ+ CHAPTER 7. HEART RATE 116

1 hr(t + 1) − hr(t) = .(mp.phr(t) + hrmin − hr(t)) (7.14) τ+

that give two systems of equations from which the solutions give τ+ and τ−.

The procedure does not allow to obtain hrmax; it requires a special testing session or a large set of activities (see next section).

7.2.4 Other Means to Estimate Cardiac Parameters

The easiest parameter to gather is hrmin: all that is required from the runner is to lie down and relax for one or two minutes; hrmin is the smallest observed hear rate during that period. If the runner uses their device to monitor their sleeping pattern, as it is more and more the case, we can even more easily obtain their hrmin at night without requiring any action of the runner. Most training sessions include several constant speed running of fixed length or duration. These sessions can be used to obtain the two time constants τ+ and τ− as well as the slope m and the kf parameter.

The last cardiac parameter, hrmax, is the only one that requires an all-out short exercise which might seem stressful and is generally advised to be performed following safety guidelines. For adults who are beginning the running practice, it is even recommended to perform their first hrmax test only in the presence of medical staff due to risks associated with high heart rates. For regular runners for which hundreds of activities are available, hrmax can be estimated as the maximal heart rate ever recorded. Precau- tions need to be taken because some activities can present extremely high heart rate values at the beginning of an exercise (see Figure 7.4) or brief spikes that are not hold for more than a few seconds. The maximal heart rate is thus the maximal ever recorded but discarding these two cases: heart rates during the first 15 minutes and heart rates that are not maintained more than 30 seconds.

Knowing the hrmax parameter is of high importance because many coach- ing instructions are based on intensities expressed in hrmax percentage (hr/hrmax.100) or relative percentage ((hr − hrmin)/(hrmax − hrmin).100) [33]. Typical training zones are shown in Figure 7.5.

Some empirical age-related formula exist (as the most famous hrmax = 220 − age)[14, 29, 45, 60, 64] but they diverge and are inaccurate because maximum heart rates vary significantly between individuals [17]. Healthy 117 7.2. MATERIALS AND METHODS runners can present maximum heart rates that vary up to 30 bpm for the same age [60]. The estimation presenting such an error, many runners actually train at intensities that are not the ones they target. Their training is thus sub-optimal or present under-estimated cardiac risks. The widespread usage of training intensities calls for an easy formula that everybody can compute (preferably mentally). This might be the reason why an inaccurate but quick estimation can survive over the years although it is well-established that such a relation does not exist with usable accuracy. Hopefully, the most difficult parameter to obtain does not vary much along the training: hrmax approximately decreases by 0.7 bpm per year [19]. An all-out effort is thus required less than once a year. Actually hrmin and hrmax that are used to calibrate workout intensities are quite stable param- eters [39]. The other parameters are, in contrast, subject to changes with the runner state or fitness level. The most important parameter to monitor regularly is the m coefficient.

Figure 7.4: Extreme heart rates at the beginning of an exercise. CHAPTER 7. HEART RATE 118

Figure 7.5: Exercise zones, Fox and Haskell formula between 20 and 70- year-old.

7.2.5 Validation Process

The fitting of cardiac parameters was performed on running and cycling activities provided with heart rate measurements. The fitting of the param- eters was performed using the procedure described in Section 7.2.3.

For the cycling activities, the power output was provided thanks to power meters mounted on the pedals. For the running activities provided with geo-localized points and barometric elevation measurements, the instant power of exertion was estimated based on computed slopes and speeds (as discussed in Section 2.3.6).

Obtained cardiac parameters are used to simulate heart rates that are com- pared with actual measurements. Then, for 50 runners, we computed the maximal speed smax using Equation 7.9 and compared it to race perfor- mances for which we already had equivalent distances from the previous chapters. 119 7.3. RESULTS

7.3 Results

In most activities, the identified heart rate rising time constant τ+ was found to be smaller than heart rate falling time constant τ−; the respective average values are 24 and 30 seconds. The sensitivity to fatigue was modeled with an additional power output in the range [1; 6].10−5 [w/J]. The steady state heart rate parameters were more subject to intra- and inter-athlete variability. The resting heart rate hrmin was found to be in the range [50; 80] [bpm]. The slope coefficient ms was found in the range [3; 14] [bpm/km.h]. Cardiac parameters that are identified on activities with the described methodology enabled accurate heart rate simulation on the same activity taking solely the instant power measurements or its estimation as input. Heart rate simulations differ from heart rate measurements with an aver- age root mean square error of 4 bpm for the 72 cycling activities recorded with the use of a power meter. On average, a root mean square error of 6 bpm was observed for the 234 running activities. Figures 7.6 and 7.7 show respectively cycling and running activities simulation examples. The predictions that could be made from the obtained cardiac parameters is far less accurate that the one obtained in previous chapters. This result comes with no surprise, as race times are easier to predict based on other race times than on heart rate measurements. As race performances are ob- tained on different distances, they can not be compared with the estimated maximal speed that is computed from cardiac parameters using Equation (7.9). Therefore all race performances are expressed as equivalent speeds on a 10K race with the following formula:

ta,r seq,10K = eq (7.15) dr β ( 104 ) with β = 1.15 being the average observed endurance coefficient.

Figure 7.8 shows race performances seq,10K versus runners smax computed from their heart rate models. We can see that race performances correlate with heart rate models: correlation is slightly more than 0.5. It seems that best race performance are bounded by the a linear function of runner’s smax but there exists many under-performances.

If race times have to be predicted, we need to estimate the fraction of smax that can be achieved by runners as a function of the race distance. The Figure 7.9 shows that, again, a relation seems to exist but defines maximal performances with many under-performances. For instance, we see that for a 12-kilometer race, the maximal speed that can be achieved is about 88% of the smax; for a marathon is goes down to 75%. CHAPTER 7. HEART RATE 120

Figure 7.6: Cycling activity. Top: Power measurement. Bottom: Simu- lated vs recorded heart rate

Figure 7.7: Running activity. Plot 1 and 2: Elevation and speed measure- ments. Plot 3: Estimated power output computed on speeds and slopes. Plot 4: Simulated vs recorded heart rate 121 7.3. RESULTS

Figure 7.8: Race equivalent speed versus the runner estimated maximal speed.

Figure 7.9: Fraction of the smax that runners can achieve depending on the race distance. CHAPTER 7. HEART RATE 122

7.4 Discussion

We highlighted several problems associated with fitness assessments made from race results. First, recreational runner may have few or no race ex- perience. Second, race results are subject to a high variability and can be quickly outdated in the case of beginners. Third, runners may attend races very rarely. This chapter proposed a way to evaluate runners at each of their runs provided that they record their activities with a geo-postionning system (such as the GPS) parired with an heart rate monitor. We propose a simple model that describes the steady states and the dy- namics of the heart rates. The model contains parameters that are runner- specific and we show a way to identify the parameters on free running or cycling activities. The athlete parameters can then be turned into a fitness metric that is the speed at maximal heart rate smax.

The benefit of the method is that it allows to re-evaluate smax on each activity. This provides a means to quantify runner progress regularly and therefore can serve to adapt training sessions accordingly. Moreover, the ˙ value of smax is strongly related to the value of the runner’s VO2 max [37, 38] which is the gold standard of the aerobic fitness level and also a widely accepted indicator of the life expectancy. The accuracy of the prediction that can be made from obtained cardiac parameters is not as high as the one that can be obtained from past race results. This might be due to the fact that the nature of the measurement is slightly different but also to the fact that our computation of the speed at hrmax (denoted smax) rely on estimated hrmin and hrmax as, respectively, minimum and maximum heart rates ever recorded. The estimation of smax might improve if hrmin was measured on the runner at rest and if hrmax was evaluated during a maximal effort. Note that those two values are relatively stable and do not need to be measured more than once a year.

The value of smax represents the performance one can expect on an all-out effort of few minutes; it corresponds to a race between 1000 and 3000m, depending on the runner level. The performance on longer distances also depends on the runner endurance to which the kf parameter is intuitively related. Unfortunately, this relation was not verified yet and represents a short-term and interesting continuation of this thesis. Improvement in sport performance is dependent on physiological adapta- tions. These training adaptations are of particular importance for endurance athletes like cyclists and runners. Training optimization requires knowledge about how humans adapt to workout sessions. This knowledge can be seen as a model linking training workout characteristics to fitness level. Such 123 7.4. DISCUSSION system modeling approaches have been refined for at least four decades [9]. Although models are well described and increasingly used instead of relying solely on the empirical experience of coaches [6], they are unable to provide physiologically relevant parameters. The recording of all the workout sessions of a runner can be turned into quantitative workload metrics (such as heart rate zones, mileage, speeds, powers, . . . ) and, with the estimation of cardiac parameters that was pre- sented, into fitness level metrics. Runner activities constitute thus a col- lection of ground truth input-output of the adaptation process. Therefore, this chapter offers opportunities to refine the models, and possibly tune them to specific runners, or specific populations of runners, to optimize their training. CHAPTER 7. HEART RATE 124 Chapter 8

General Conclusion

Recreational sports practice is promoted for its various benefits that are both individual (health, well-being) and societal (health care costs, pro- ductivity, absenteeism). In this context, running is often chosen for its accessibility to all. Although uninstructed free running sessions offer ben- efits, a well-planned running practice offers faster improvement and, as a consequence, plays a role in motivation. This thesis provides methodologies to predict runner performance levels. This is useful for workout session planning and race pacing (choosing the right speed for a race). This work focuses on recreational runners prepa- ration that differs from the one of elites, especially in term of quality and quantity of information that can be used. We built our model focusing on information that is cheap and easy to gather: race results or activity recordings that include heart rates. Other measurements that are available on smartphones or sports watches could have been used (such as accelerom- eters and pulse oximeters) but they are still a lot less widespread and less standardized (units, sampling rate, smooothing). In Chapter3, we reviewed the models from the literature that relate race performances to race distances. It turns out that the power law which ex- presses the logarithm of the race time as a linear function of the logarithm of the race length can fit observed race times in the most common range of distances (8 km to the marathon distance), whether it is individual perfor- mances or world records. In Chapters4 and5, we showed that the notion of equivalent distance is more appropriate than actual distance as a race time predictor. The equiv- alent distance that captures all aspects that affect race times can be com-

125 CHAPTER 8. GENERAL CONCLUSION 126 puted from race results with a collaborative filtering technique: evaluating race distances, runner levels, and runner endurance at the same time, by minimizing the deviation from the power law that was selected in Chapter 3. The proposed methodology for computing the equivalent distance of a race requires that the competition has already taken place, that the results are known and that these results belong to athletes who attended other races for which we also have race times. Moreover, we require that the races belong to a large connected graph and that some races are known to be particularly fast so that they can serve as references for which we assume the equivalent distance to be equal to the actual distance. If these conditions are not met, we showed in Chapter5 that the equivalent distance can be approximated using the elevation profile of the race. We pointed out that other race features could be gathered and used to build a more complete model of the equivalent distance: ground type, temperature, altitude, humidity, pressure, precipitation, wind speed and wind direction. This does not represent major technical challenges in addition to the col- lection of data from a large set of races for which equivalent distances can be computed from race results. The variability in runner performances combined with their scarcity means that it is possible to fit athlete parameters alone with a simplistic model that contains only one parameter (their level). If we want to add the en- durance parameter, the fitted models often suffer from overfitting. Chapter 6 addresses this issue by computing runner parameters considering some prior assumptions that describe what is most likely to be observed. The in- clusion of these assumptions, inferred from a large set of race results, proves to provide more accurate race time predicitons. To conclude on race time prediction from past race results, Chapters3 to6 present models that gradually improve: Equation (6.12) with distribution parameters given in Table 6.1 provides our best estimation of runner param- eters that incorporate all previous discussions (physiological model, equiv- alent distances and probabilistic assumptions). It can be used to predict the race time on any common endurance distance (or equivalent distance if computed). The solution can be obtained to model any runner for which we have, at least, one race record. Therefore, the solution can be integrated, as is, in a coaching application and serves to optimize workout planning and race pacing. The model can be further improved by using more information on runners or races: first, prior assumptions on runners could be tuned to individuals based on their age, sex or weight. Second, equivalent distances could be 127 modeled with more race features outside the ones that are computed on the elevation profiles. In the current model, the range of distance is limited to the most common ones (8 km to marathon); further research could extend the range of validity by adapting the power law. With the first chapters, our intention was to retrieve as much as we could from the race data that was at our disposal: race results and elevation data. We highlighted several problems that would affect any possible methodology based on race results due to their scarcity and their variability. The content of Chapter7 responds to this difficulty by providing a way to evaluate much more often and with less variability the athlete fitness at each of their runs, thanks to the relatively new and growing trend of runners to record all their sports activities with a positioning system and a heart rate monitor. We showed that the methodology can provide an estimation of, at least, one important fitness metric: the speed at maximal heart rate which is strongly related to life expectancy. Computation of this metric can be easily integrated into a coaching application and serve as a reference to derive optimized workout paces. It also provides another way to predict runner potential performances, with the advantage that it can be updated at each run. The frequent monitoring of athlete levels together with recordings of their training session is an opportunity to improve the knowledge we have about how athletes adapt to their workout session and possibly tune the adaptation models to individual runners. Evaluating the runner at each individual runs also makes the quantity of data much larger and provides the possibility to relate their level to more variables, like weather conditions or the gradient of ascent, as already mentioned. Although the race preparation of recreational runners was the primary mo- tivation for this work, the findings that are discussed can extend beyond this area. As they provide health-related metrics that are easy to obtain, they could serve in exercise therapy, health promotion or researches in health sciences. CHAPTER 8. GENERAL CONCLUSION 128 Appendix A

Races Communities

A.1 Lion Races

Total Date Race Name Distance Ascent 2014-03-02 10 miles Ostende-Bruges 2014 16071 23 2014-03-16 halve marathon Sluis 2014 21130 97 2014-04-19 10 km ’t Hazegras 2014 10031 35 2014-04-27 St Jans loop 2014 9521 53 2014-05-30 keignaertloop 2014 8817 16 2014-06-13 vuurtorenloop 2014 11258 31 2014-06-20 Marathon Nacht van Vlaanderen 2014 42067 100 2014-06-27 Konterdamloop 2014 10596 28 2014-08-08 Dwars Door Mariakerke 2014 10435 26 2014-08-22 Oostende Strandloop 2014 9290 81 2014-09-28 Halve marathon Oostende 2014 21269 75 2014-10-05 Steense Hermesloop 2014 11894 24 2014-10-18 Oostendse De Olifant 2014 10001 24 2014-11-11 Halve marathon Deinze - Bellem 2014 21128 65 2014-12-12 Kerstloop Brugge 2014 9986 74 2015-02-01 Cadzand - Halve Marathon 2015 21152 121 2015-03-01 22e Oostende-Brugge Ten Miles 2015 15863 21 2015-06-19 Nacht Van Vlaanderen 2015 42030 93 2015-11-11 Deinze - Bellem (Halve marathon) 2015 21078 37 2015-12-11 Kerstloop Brugge 2015 9988 38

129 APPENDIX A. RACES COMMUNITIES 130

A.2 Rooster Races

Total Date Race Name Distance Ascent 2014-01-25 Challenge BW - Nivelles 2014 10442 157 2014-02-02 Hivernales Boitsfort 2014 19839 256 2014-02-02 Les hivernales Boisfort 2014 9798 165 2014-02-08 Challenge BW - La Hulpe 2014 12421 128 2014-02-15 La Printani`ereErpent 2014 15631 305 2014-02-15 Trail des Bosses 2014 25000 437 2014-02-22 Challenge BW - Lillois 2014 12659 134 2014-02-23 COURSE ARJ a jambes 2014 12812 188 2014-03-08 Challenge BW - Chaumont-Gistoux 2014 11891 170 2014-03-15 Challenge BW - Waterloo 2014 11067 208 2014-03-16 Antwerp Urban Trail 2014 11646 33 2014-03-19 10 miles de Louvain-La-Neuve 2014 15978 292 2014-03-22 Challenge BW - Vieusart 2014 11876 158 2014-03-29 Crˆetesde Spa 2014 21000 504 2014-04-19 Jogging de la principaut´ede Chimay 2014 13941 255 2014-04-27 Ten Miles Antwerpen 2014 16089 52 2014-05-10 Challenge BW - H`eze2014 13175 220 2014-05-11 Les 15km de Woluw´e-Saint-Lambert 2014 13981 203 2014-05-17 Jogging des trois frontieres Kelmis 2014 17137 397 2014-05-18 20km de bruxelles 2014 20002 157 2014-05-29 Abdijentocht Averbode - Tongerlo 2014 15610 48 2014-05-29 Challenge du BW - Jogging du buchet (Bierges) 2014 11207 159 2014-06-08 Jogging LA CHATELETTAINE 2014 14228 197 2014-06-14 Challenge BW - Ottignies 2014 11808 189 2014-06-21 Trail des Forges de la Forˆetd’Allier 2014 18497 281 2014-08-23 Challenge bw nil-saint-vincent 2014 10891 69 2014-08-31 la descente de la Lesse 2014 21825 312 2014-09-21 Nivelles semi marathon 2014 21094 218 2014-09-27 Jogging des carottes Mellet 2014 10671 67 2014-09-28 Dwars Door Mechelen 2014 10210 12 2014-10-05 Brussels 2014 21463 143 2014-10-12 HBvL Dwars door Hasselt 2014 9986 33 2014-10-19 Brugge Urban Trail 2014 10287 33 2014-10-19 Marathon 2014 42580 254 2014-11-02 Namur Urban Trail 2014 11386 132 2014-11-09 Les 4 Cimes du Pays de Herve 2014 33080 586 2014-11-16 Marathon Kasterlee 2014 41988 371 2014-12-06 Mechelen Urban Trial 2014 9774 29 131 A.2. ROOSTER RACES

Total Date Race Name Distance Ascent 2014-12-21 Gaston Roelants - Brussel 2014 9478 136 2015-01-17 VTM Course `apied 2015 8201 132 2015-01-24 Challenge BW - Nivelles 2015 10464 158 2015-02-01 Les hivernales de Boisfort 2015 20435 277 2015-02-07 Challenge BW - La Hulpe 2015 12834 193 2015-02-14 La Printani`ereErpent 2015 15700 361 2015-02-21 Challenge BW - Lillois 2015 12738 135 2015-03-01 Cross de Bousval 2015 14032 221 2015-03-07 Challange BW - Chaumont-Gistoux 2015 12098 180 2015-03-07 1/2 Marathon Spa Francorchamps 2015 20819 390 2015-03-14 Challange BW - Waterloo 2015 12680 138 2015-03-15 CittA Antwerp Urban Trail 2015 11785 42 2015-03-21 Challenge BW - Vieusart 2015 12270 167 2015-03-28 Les Crˆetesde Spa 2015 21000 535 2015-04-18 Jogging de la principaut´ede Chimay 2015 14167 199 2015-04-19 10 km de Bruxelles 2015 9996 129 2015-04-26 Antwerp 10 miles 2015 15865 54 2015-04-26 10 Km de l’ULB 2015 10034 124 2015-05-02 Impala Run 2015 15936 64 2015-05-10 15km de Woluwe-Saint-Lambert 2015 14479 145 2015-05-14 Westerlo Tongerlo Hardlopen 2015 15962 63 2015-05-14 Challenge du BW - Jogging du buchet (Bierges) 2015 11367 174 2015-05-17 Semi-Marathon de l’Ourse 2015 21071 112 2015-05-23 Challenge BW - C´eroux2015 13394 185 2015-05-31 20 km de Bruxelles 2015 2015 20000 152 2015-06-20 Les Forges de la forˆetd’Anlier 2015 18832 270 2015-08-29 Dwars door Turnhout 2015 10785 26 2015-09-06 tilburg ten miles 2015 16129 23 2015-09-13 Jogging Ville de Namur 2015 11684 172 2015-09-20 Nivelles semi marathon 2015 21020 197 2015-09-26 Ecotrail de Bruxelles 2015 18000 198 2015-10-04 Brussels Half Marathon 2015 21324 145 2015-10-11 Marathon Eindhoven 2015 42495 129 2015-10-18 Semi marathon Amsterdam 2015 21295 153 2015-10-25 Les 20km des Ardennes 2015 19860 306 2015-12-18 Trail nocturne Maredsous 2015 15000 282 2015-12-19 La Corrida de Gerpinnes 2015 10035 120 2015-12-27 Eindejaarscorrida Leuven 2015 11904 59 APPENDIX A. RACES COMMUNITIES 132

A.3 Races Equivalent Distances

Here is the list of the races with their equivalent distance computed using Equation (5.4) of Chapter5.

Actual Equivalent Date Race Name Distance Distance 2015-12-11 Kerstloop Brugge 2015 9988 14119 2015-12-18 Trail nocturne Maredsous 2015 15000 22170 2015-12-19 La Corrida de Gerpinnes 2015 10035 13815 2015-12-26 Happy New Year Trophy 2015 10735 14457 2015-12-26 Ciney Corrida 2015 9936 12584 2015-12-27 Argenta Kerstcorrida 2015 9736 13098 2015-12-27 Eindejaarscorrida Leuven 2015 11904 17139 133 A.3. RACES EQUIVALENT DISTANCES

Actual Equivalent Date Race Name Distance Distance 2014-04-19 10 km ’t Hazegras 2014 10031 14002 2014-04-25 Challenge Condrusien - Pailhe 2014 10286 14648 2014-04-27 La Braban¸conne2014 8990 12084 2014-04-27 Ten Miles Antwerpen 2014 16089 19213 2014-04-27 Antwerp Marathon 2014 42300 45813 2014-04-27 Havenloop Gent 2014 20340 23026 2014-04-27 A L Etall 2014 10660 13520 2014-04-27 St Jans loop 2014 9521 12907 2014-04-27 Semi-Marathon de Charleroi 2014 19604 23474 2014-05-01 10km van Knokke 2014 10054 13407 2014-05-03 La Bouillonnante 2014 24000 43057 2014-05-03 La Bouillonnante 2014 52000 78933 2014-05-04 Genk Loopt 2014 9987 13674 2014-05-04 10km de Uccle 2014 9899 14966 2014-05-04 15 Km de Li`ege2014 15395 18692 2014-05-09 Challenge Condrusien - A Pid po l’Bw`es2014 11983 16676 2014-05-10 Challenge BW - H`eze2014 13175 17937 2014-05-11 Dwars door Brugge 2014 14479 18175 2014-05-11 Les 15km de Woluw´e-Saint-Lambert 2014 13981 17604 2014-05-11 Run Marathon Vise/Maastricht 2014 42370 44463 2014-05-16 Challenge Condrusien - Jogging de Modave 2014 11983 17159 2014-05-17 Jogging des trois frontieres Kelmis 2014 17137 21496 2014-05-18 20km de bruxelles 2014 20002 24191 2014-05-24 ACRHO L’enfer des collines 2014 22915 28985 2014-05-25 Stadsloop De Gentenaar 2014 9836 13260 2014-05-25 jogging de Bioul 2014 13003 15741 2014-05-29 Abdijentocht Averbode - Tongerlo 2014 15610 19772 2014-05-29 Challenge du BW - Jogging du buchet (Bierges) 2014 11207 15471 2014-05-30 Challenge Condrusien - Jogging de Vyle 2014 16577 21601 2014-05-30 keignaertloop 2014 8817 11764 2014-05-31 Semi Marathon de Luxembourg 2014 21222 25023 2014-05-31 Acrho La Kainoise 2014 21262 23567 2014-06-07 Les Boucles Ardennaises 2014 35000 59132 2014-06-07 Les Boucles Ardennaises 2014 10540 17480 2014-06-07 Les Boucles Ardennaises/Semi Marathon 2014 20915 29684 2014-06-08 Bellevaux Running Tour 2014 8835 13645 2014-06-08 Jogging LA CHATELETTAINE 2014 14228 18474 2014-06-09 Trail de Namur 2014 30000 41496 2014-06-09 Kortrijk Loopt 2014 11154 15480 2014-06-13 vuurtorenloop 2014 11258 14531 2014-06-14 Challenge BW - Ottignies 2014 11808 15997 2014-06-20 Marathon Nacht van Vlaanderen 2014 42067 45566 2014-06-20 Nacht van Vlaanderen 2014 9852 13762 2014-06-20 Nacht van Vlaanderen 2014 20995 24765 2014-06-21 Midzomernachtrun Gent 2014 14919 18707 2014-06-21 Trail des Forges de la Forˆetd’Allier 2014 18497 22772 2014-06-27 Konterdamloop 2014 10596 13944 2014-06-28 Battle of the Ardennes 2014 9991 18733 2014-06-28 Battle of the Ardennes 2014 19979 32364 2014-07-07 Tessenderlo The classic 2014 9764 12694 APPENDIX A. RACES COMMUNITIES 134

Actual Equivalent Date Race Name Distance Distance 2014-06-21 Trail des Forges de la Forˆetd’Allier 2014 18497 22772 2014-06-27 Konterdamloop 2014 10596 13944 2014-06-28 Battle of the Ardennes 2014 9991 18733 2014-06-28 Battle of the Ardennes 2014 19979 32364 2014-07-07 Tessenderlo The classic 2014 9764 12694 2014-07-12 Challenge Condrusien - Les Sentiers du Val 2014 11085 16301 2014-07-27 Bosloop Tessenderlo 2014 13423 17508 2014-08-08 Strandloop Heist 2014 9986 13926 2014-08-08 Dwars Door Mariakerke 2014 10435 13607 2014-08-09 La Course des Sources 2014 13185 21177 2014-08-16 5e Trail des Fantˆomes 2014 13000 29151 2014-08-16 5e Trail des Fantˆomes 2014 25928 39135 2014-08-16 5e Trail des Fantˆomes 2014 49966 81058 2014-08-16 5e Trail des Fantˆomes 2014 99892 135278 2014-08-22 Oostende Strandloop 2014 9290 13510 2014-08-22 Hollewegjogging Lichtaart 2014 14555 18075 2014-08-23 Challenge bw nil-saint-vincent 2014 10891 14540 2014-08-30 ACRHO Semi H´erinnes2014 21227 22959 2014-08-30 Dwars door Turnhout 2014 10888 13961 2014-08-31 Les 10km de Lasne 2014 9886 13855 2014-08-31 la descente de la Lesse 2014 21825 30243 2014-09-07 Fortloop 2014 9984 13767 2014-09-07 Fortloop 2014 14983 19567 2014-09-07 Sint-Niklaas Hardlopen Ballonloop 2014 10152 12031 2014-09-14 Jogging Ville De Namur (Citadelloop) 2014 11636 15281 2014-09-19 Energizer Night Run 2014 8883 14654 2014-09-21 Race Against Nature 2014 9992 19977 2014-09-21 Les 20km de Bastogne 2014 19968 25078 2014-09-21 Nivelles semi marathon 2014 21094 25223 2014-09-21 Dam tot Damloop Amsterdam 2014 16233 20005 2014-09-27 Jogging des carottes Mellet 2014 10671 13496 2014-09-28 Dwars Door Mechelen 2014 10210 13735 2014-09-28 Halve marathon Oostende 2014 21269 24708 2014-09-28 BMW Berlin Marathon 2014 42551 42200 2014-10-04 ACRHO : forˆetde beloeil 2014 19912 22579 2014-10-05 Brussels Half Marathon 2014 21463 24635 2014-10-05 Brussels Marathon 2014 42502 44908 2014-10-05 Steense Hermesloop 2014 11894 15298 2014-10-12 HBvL Dwars door Hasselt 2014 9986 13615 2014-10-12 MuddyRun 2014 12992 33809 2014-10-12 Mons Urban Trail 2014 10288 15224 2014-10-12 HBvL Dwars door Hasselt 2014 14977 18479 2014-10-12 Marathon Eindhoven 2014 2014 42544 43383 2014-10-12 Halve Marathon Eindhoven 2014 21208 24229 2014-10-18 Acerta Brussels 2014 10010 18703 2014-10-18 Acerta Brussels Ekiden 2014 10011 18429 2014-10-18 Acerta Brussels Ekiden 2014 10036 18751 2014-10-18 Oostendse De Olifant 2014 10001 13630 2014-10-18 vredesloop Ieper 2014 16143 19671 2014-10-19 Brugge Urban Trail 2014 10287 14414 135 A.3. RACES EQUIVALENT DISTANCES

Actual Equivalent Date Race Name Distance Distance 2014-10-18 Acerta Brussels Ekiden 2014 10011 18429 2014-10-18 Acerta Brussels Ekiden 2014 10036 18751 2014-10-18 Oostendse De Olifant 2014 10001 13630 2014-10-18 vredesloop Ieper 2014 16143 19671 2014-10-19 Brugge Urban Trail 2014 10287 14414 2014-10-19 Marathon Amsterdam 2014 42580 46320 2014-10-26 La Belle Neupr´eenne2014 8991 13093 2014-10-26 Omega Pharma Half Marathon 2014 21096 24684 2014-11-02 Namur Urban Trail 2014 11386 15907 2014-11-09 Les 4 Cimes du Pays de Herve 2014 33080 38828 2014-11-09 ACRHO Kain (Trinit´e)2014 12594 15459 2014-11-11 Halve marathon Deinze - Bellem 2014 21128 21100 2014-11-16 Marathon Kasterlee 2014 20866 26200 2014-11-16 Marathon Kasterlee 2014 41988 47298 2014-11-16 Nijmegen Hardlopen 2014 15067 18427 2014-11-22 Challenge Condrusien - Jogging de la St Nicolas 2014 11085 15740 2014-11-23 Womanrace - Uccle 2014 8990 12968 2014-12-06 Mechelen Urban Trial 2014 9774 13529 2014-12-07 Bruggenloop Rotterdam 2014 15160 18746 2014-12-07 Houffalize Trail 2014 25023 35728 2014-12-12 Kerstloop Brugge 2014 9986 14094 2014-12-20 La Corrida de Gerpinnes 2014 10026 13869 2014-12-21 La Belle Hivernoise 2014 23000 29006 2014-12-21 La Belle Hivernoise 2014 12984 19325 2014-12-21 Gaston Roelants - Brussel 2014 9478 13001 2014-12-21 Midwinternachtrun gent 2014 9756 13872 2014-12-26 Tussen 2 Feestenloop 2014 9986 13685 2014-12-27 Argenta Kerstcorrida 2014 9866 13571 2014-12-27 Happy New Year Trophy 2014 10885 14923 2015-01-11 Egmond halve Marathon 2015 20888 24623 2015-01-17 VTM Course `apied 2015 8201 12406 2015-01-24 Challenge BW - Nivelles 2015 10464 14625 2015-02-01 Cadzand - Halve Marathon 2015 21152 25307 2015-02-01 Les hivernales de Boisfort 2015 20435 25938 2015-02-07 Challenge BW - La Hulpe 2015 12834 17194 2015-02-14 La Printani`ereErpent 2015 15700 20585 2015-02-21 Challenge BW - Lillois 2015 12738 17180 2015-02-28 La Portelette 2015 13100 17691 2015-03-01 22e Oostende-Brugge Ten Miles 2015 15863 19911 2015-03-01 La Li´egeoise,le jogging des femmes 2015 10189 14245 2015-03-01 Cross de Bousval 2015 14032 18933 2015-03-07 Challange BW - Chaumont-Gistoux 2015 12098 16476 2015-03-07 1/2 Marathon Spa Francorchamps 2015 20819 25386 2015-03-08 Semi-Marathon Paris 2015 21286 24901 2015-03-14 Challange BW - Waterloo 2015 12680 16406 2015-03-15 CittA Antwerp Urban Trail 2015 11785 16604 2015-03-15 Lotto Crosscup Finale - BK Veldlopen 2015 8984 14539 2015-03-15 Challenge Condrusien - La Neupreenne 2015 10983 15296 2015-03-21 Challenge BW - Vieusart 2015 12270 16141 2015-03-22 Dwars door Dendermonde 2015 9916 13094 APPENDIX A. RACES COMMUNITIES 136

Actual Equivalent Date Race Name Distance Distance 2015-03-15 CittA Antwerp Urban Trail 2015 11785 16604 2015-03-15 Lotto Crosscup Finale - BK Veldlopen 2015 8984 14539 2015-03-15 Challenge Condrusien - La Neupreenne 2015 10983 15296 2015-03-21 Challenge BW - Vieusart 2015 12270 16141 2015-03-22 Dwars door Dendermonde 2015 9916 13094 2015-03-22 Venloop Halve Marathon 2015 21261 24078 2015-03-28 Les Crˆetesde Spa 2015 21000 27807 2015-03-29 Gent urban trail 2015 9983 15377 2015-04-12 Marathon Rotterdam 2015 42463 44179 2015-04-12 Marathon de Paris 2015 42512 46548 2015-04-18 Jogging de la principaut´ede Chimay 2015 14167 19185 2015-04-19 10 km de Bruxelles 2015 9996 14480 2015-04-26 15km de Charleroi 2015 20168 24265 2015-04-26 15km de Charleroi 2015 14080 18073 2015-04-26 Antwerp 10 miles 2015 15865 19022 2015-04-26 Antwerp Marathon 2015 42317 44539 2015-04-26 10 Km de l’ULB 2015 10034 13448 2015-05-01 10km Knokke 2015 10033 13121 2015-05-02 Impala Run 2015 15936 19714 2015-05-03 15 Km de Li`ege2015 15188 18377 2015-05-10 Dwars door brugge 2015 14814 18001 2015-05-10 Trail - La Grimace 2015 80000 97083 2015-05-10 Trail - La Grimace 2015 55000 71427 2015-05-10 Trail - La Grimace 2015 28000 37949 2015-05-10 Trail - La Grimace 2015 17000 23904 2015-05-10 La Belle Dolhaintoise, le jogging des Femmes 2015 8985 12137 2015-05-10 15km de Woluwe-Saint-Lambert 2015 14479 18052 2015-05-13 La Corrida des Remparts 2015 9829 14729 2015-05-14 Westerlo Tongerlo Hardlopen 2015 15962 19765 2015-05-14 Challenge du BW - Jogging du buchet (Bierges) 2015 11367 15273 2015-05-15 Challenge Condrusien - Jogging des 2 Provinces 2015 11484 15723 2015-05-17 Semi-Marathon de l’Ourse 2015 10487 14195 2015-05-17 Semi-Marathon de l’Ourse 2015 21071 26076 2015-05-17 Stadsloop De Gentenaar 2015 9430 12563 2015-05-23 Trail de Malmedy 2015 10989 21518 2015-05-23 Fisherman’s Friend Strongman Run 2015 16981 25114 2015-05-23 Challenge BW - C´eroux2015 13394 17778 2015-05-24 Bellevaux Running Tour 2015 8828 13024 2015-05-30 Luxembourg Half Marathon 2015 21193 24343 2015-05-31 20 km de Bruxelles 2015 2015 20000 23118 2015-06-05 Asics Xtrails Netherlands 2015 52945 67758 2015-06-05 Asics Xtrails Netherlands 2015 21974 30531 2015-06-05 Asics Xtrails Netherlands 2015 29974 43466 2015-06-05 Asics Xtrails Netherlands 2015 10986 16710 2015-06-05 Asics Xtrails Netherlands 2015 24965 32433 2015-06-05 Asics Xtrails Netherlands 2015 13985 22587 2015-06-06 Via Belgica Marathon 2015 44030 46596 2015-06-07 La Carolor´egienne,le jogging des Femmes 2015 8988 12858 2015-06-07 Race Against Nature 2015 9992 16730 2015-06-12 Challenge Condrusien - La Condrusienne 2015 11484 16681 137 A.3. RACES EQUIVALENT DISTANCES

Actual Equivalent Date Race Name Distance Distance 2015-06-05 Asics Xtrails Netherlands 2015 13985 22587 2015-06-06 Via Belgica Marathon 2015 44030 46596 2015-06-07 La Carolor´egienne,le jogging des Femmes 2015 8988 12858 2015-06-07 Race Against Nature 2015 9992 16730 2015-06-12 Challenge Condrusien - La Condrusienne 2015 11484 16681 2015-06-19 Nacht Van Vlaanderen 2015 9808 13614 2015-06-19 Nacht Van Vlaanderen 2015 20940 24618 2015-06-19 Nacht Van Vlaanderen 2015 42030 45598 2015-06-20 Les Forges de la forˆetd’Anlier 2015 18832 24413 2015-06-20 Trail de la Vall´eedes Lacs 2015 14985 22983 2015-06-20 Trail de la Vall´eedes Lacs 2015 26936 55720 2015-06-20 Trail de la Vall´eedes Lacs 2015 12968 25373 2015-06-20 Trail de la Vall´eedes Lacs 2015 86894 132234 2015-06-20 Trail de la Vall´eedes Lacs 2015 54903 86599 2015-06-20 Midzomernachtrun Gent 2015 9828 13577 2015-06-20 Midzomernachtrun Gent 2015 14908 19061 2015-06-21 Bilzen Run 2015 9986 14032 2015-06-21 Bilzen Run 2015 14975 19265 2015-06-21 Jogging des Coucous de Somme 2015 9987 14724 2015-06-21 jogging verviers 2015 12972 16730 2015-06-27 Battle of the Ardennes 2015 22878 38736 2015-06-27 Battle of the Ardennes 2015 11491 22325 2015-06-27 Ren -amp; Run 2015 9992 16450 2015-07-04 GhostRace 2015 13986 25029 2015-07-06 TESSENDERLO - THE CLASSIC 2015 9810 13045 2015-07-08 Midzomerrun Brugge 2015 12982 17200 2015-07-17 Strandloop Knokke 2015 9987 13977 2015-07-18 Les O’nzes kms d’Obernai 2015 10984 16382 2015-07-25 L’Ardennaise 2015 8471 13725 2015-07-25 L’Ardennaise 2015 34927 44765 2015-07-25 L’Ardennaise 2015 20891 27167 2015-08-08 La Course des Sources 2015 13187 21784 2015-08-16 Trail des Fantˆomes2015 53000 79140 2015-08-16 Trail des Fantˆomes2015 75000 108357 2015-08-16 Trail des Fantˆomes2015 20000 32011 2015-08-16 Trail des Fantˆomes2015 31000 48957 2015-08-16 Trail des Fantˆomes2015 12944 21801 2015-08-26 Runway Run Koksijde 2015 10634 13868 2015-08-29 Dwars door Turnhout 2015 10785 14117 2015-08-30 10km de Lasne 2015 9988 15214 2015-09-06 Fortloop 2015 14976 18139 2015-09-06 Fortloop 2015 9988 13625 2015-09-06 tilburg ten miles 2015 16129 19728 2015-09-11 Trail de la Cˆoted’Opale 2015 62000 77367 2015-09-11 Trail de la Cˆoted’Opale 2015 46000 58400 2015-09-11 Trail de la Cˆoted’Opale 2015 31000 41850 2015-09-11 Trail de la Cˆoted’Opale 2015 13985 21851 2015-09-11 Trail de la Cˆoted’Opale 2015 20976 29523 2015-09-12 Havenrun Antwerpen 2015 14979 18800 2015-09-13 Jogging Ville de Namur 2015 11684 15359 APPENDIX A. RACES COMMUNITIES 138

Actual Equivalent Date Race Name Distance Distance 2015-09-11 Trail de la Cˆoted’Opale 2015 31000 41850 2015-09-11 Trail de la Cˆoted’Opale 2015 13985 21851 2015-09-11 Trail de la Cˆoted’Opale 2015 20976 29523 2015-09-12 Havenrun Antwerpen 2015 14979 18800 2015-09-13 Jogging Ville de Namur 2015 11684 15359 2015-09-20 La Course du Chˆateau2015 9950 13648 2015-09-20 La Forme du Cœur 2015 9583 12854 2015-09-20 DH Tournai Urban Trail 2015 9988 14784 2015-09-20 Nivelles semi marathon 2015 21020 23900 2015-09-26 Ecotrail de Bruxelles 2015 18000 23380 2015-09-26 Ecotrail de Bruxelles 2015 80000 94214 2015-09-26 Ecotrail de Bruxelles 2015 8792 13824 2015-09-26 Challenge Condrusien - Les 20 kms de Fraiture 2015 9187 13328 2015-09-26 Challenge Condrusien - Les 20 kms de Fraiture 2015 19973 25735 2015-09-27 GvA Dwars Door Mechelen 2015 10255 13739 2015-09-27 Berlin Marathon 2015 42674 44659 2015-10-03 Kust Trail Run 2015 13973 19202 2015-10-03 Kust Trail Run 2015 27958 33643 2015-10-04 Brussels Half Marathon 2015 21324 25325 2015-10-04 Trail du Barrage 2015 22000 36823 2015-10-04 Trail du Barrage 2015 31000 48441 2015-10-04 Trail du Barrage 2015 13000 23724 2015-10-04 Race Against Nature 2015 9992 19601 2015-10-04 Li`ege10KM 2015 9986 13406 2015-10-11 HBvL Dwars door Hasselt 2015 14977 18273 2015-10-11 HBvL Dwars door Hasselt 2015 9986 13363 2015-10-11 Marathon Eindhoven 2015 42495 42988 2015-10-11 Halve marathon Eindhoven 2015 21263 24404 2015-10-18 La Belle Neupr´eenne,le jogging des Femmes 2015 8991 12715 2015-10-18 Brugge Urban Trail 2015 10289 15623 2015-10-18 5eme jogging des p’tits ouhes 2015 9989 16098 2015-10-18 Marathon Amsterdam 2015 42676 47773 2015-10-18 Semi marathon Amsterdam 2015 21295 24557 2015-10-25 Les 20km des Ardennes 2015 11859 16853 2015-10-25 Les 20km des Ardennes 2015 19860 25569 2015-11-07 Brussels Canal Run 2015 11983 15354 2015-11-08 Les 4 Cimes du Pays de Herve 2015 33020 38517 2015-11-11 Deinze - Bellem (Halve marathon) 2015 21078 23879 2015-11-14 Rebecq Night Jogging 10K 2015 9987 13402 2015-11-15 Neptunus Run 2015 9990 18217 2015-11-15 Nijmegen Hardlopen 2015 15037 18308 2015-11-15 Halve marathon Kasterlee 2015 21180 25641 2015-11-15 Marathon Valencia 2015 42778 44185 2015-11-20 4`emecorrida du Beaujolais nouveau 2015 9988 13964 2015-12-06 Houffalize Hardlopen 2015 25000 36233 2015-12-11 Kerstloop Brugge 2015 9988 14119 2015-12-18 Trail nocturne Maredsous 2015 15000 22170 2015-12-19 La Corrida de Gerpinnes 2015 10035 13815 2015-12-26 Happy New Year Trophy 2015 10735 14457 2015-12-26 Ciney Corrida 2015 9936 12584 139 A.3. RACES EQUIVALENT DISTANCES

Actual Equivalent Date Race Name Distance Distance 2014-01-05 Nieuwjaarsrun Sint Anneke 2014 9962 12917 2014-01-10 TNT 2014 16621 25167 2014-01-11 1/2 marathon Lier 2014 21387 24674 2014-01-12 Egmond Halve Marathon 2014 20975 24964 2014-01-12 Dirk Martens Corrida Aalst 2014 11660 14840 2014-01-19 Trail trefle a quatre feuilles 2014 31305 42380 2014-01-25 Challenge BW - Nivelles 2014 10442 14079 2014-02-02 Hivernales Boitsfort 2014 19839 24152 2014-02-02 Les hivernales Boisfort 2014 9798 13962 2014-02-02 halve marathon Cadzand 2014 21198 25224 2014-02-08 Challenge BW - La Hulpe 2014 12421 16495 2014-02-15 La Printani`ereErpent 2014 15631 20375 2014-02-15 Trail des Bosses 2014 25000 35094 2014-02-15 Trail des Bosses 2014 42000 57741 2014-02-15 Trail des Bosses 2014 65000 76914 2014-02-16 ACRHO Dottignies 2014 10034 13110 2014-02-22 Challenge Condrusien - Nandrin 2014 11285 15806 2014-02-22 Achro Bury 2014 10469 13833 2014-02-22 Challenge BW - Lillois 2014 12659 16768 2014-02-22 LIER - NATUURLOPEN 25KM 2014 24950 28467 2014-02-23 COURSE ARJ a jambes 2014 12812 16986 2014-02-23 Midwinterjogging Hasselt 2014 9831 13910 2014-02-23 kortemark loopt 2014 19941 23127 2014-03-02 10 miles Ostende-Bruges 2014 16071 20001 2014-03-02 Semi marathon de Paris 2014 21355 23715 2014-03-08 Challenge BW - Chaumont-Gistoux 2014 11891 16541 2014-03-09 trail de chimay 2014 24000 32632 2014-03-09 La Li`egeoise2014 8991 13873 2014-03-09 CPC - halve marathon Den Haag 2014 21152 24027 2014-03-15 Challenge BW - Waterloo 2014 11067 15065 2014-03-16 Challenge Condrusien Neupr´e2014 11085 15665 2014-03-16 Antwerp Urbain Trail 2014 11182 15762 2014-03-16 halve marathon Sluis 2014 21130 25102 2014-03-16 Antwerp Urban Trail 2014 11646 15937 2014-03-19 10 miles de Louvain-La-Neuve 2014 15978 20094 2014-03-22 Challenge BW - Vieusart 2014 11876 16064 2014-03-23 Hemaco Dwars door Dendermonde 2014 9985 13280 2014-03-29 Crˆetesde Spa 2014 21000 27911 2014-03-30 Venloop Halve Marathon 2014 21260 25003 2014-03-30 Halve Marathon Lier 2014 21160 25136 2014-04-06 Les Foul´eesdu Bruaysis 2014 9985 13005 2014-04-06 Marathon de Paris 2014 42540 45360 2014-04-13 Marathon Rotterdam 2014 42470 43855 2014-04-18 Corrida Dinant 2014 9970 12380 2014-04-19 Jogging de la principaut´ede Chimay 2014 13941 18006 2014-04-19 10 km ’t Hazegras 2014 10031 14002 2014-04-25 Challenge Condrusien - Pailhe 2014 10286 14648 2014-04-27 La Braban¸conne2014 8990 12084 2014-04-27 Ten Miles Antwerpen 2014 16089 19213 2014-04-27 Antwerp Marathon 2014 42300 45813 APPENDIX A. RACES COMMUNITIES 140 Bibliography

[1] Example of vdot commercial service. https://vdoto2.com, [Online; accessed 17-october-2019]

[2] Achten, J., Jeukendrup, A.E.: Heart rate monitoring. Sports medicine 33(7), 517–538 (2003)

[3] Aladin, A.I., Whelton, S.P., Al-Mallah, M.H., Blaha, M.J., Keteyian, S.J., Juraschek, S.P., Rubin, J., Brawner, C.A., Michos, E.D.: Relation of resting heart rate to risk for all-cause mortality by gender after considering exercise capacity (the henry ford exercise testing project). The American journal of cardiology 114(11), 1701–1706 (2014)

[4] Andersen, J.J.: Marathon statistics 2019 worldwide. https:// runrepeat.com/research-marathon-performance-across-nations (2018), [Online; accessed 8-June-2019]

[5] Bauer, C.: On the (in-) accuracy of gps measures of smartphones: a study of running tracking applications. In: Proceedings of International Conference on Advances in Mobile Computing & Multimedia. p. 335. ACM (2013)

[6] Borresen, J., Lambert, M.I.: The quantification of training load, the training response and the effect on performance. Sports medicine 39(9), 779–795 (2009)

[7] Brueckner, J., Atchou, G., Capelli, C., Duvallet, A., Barrault, D., Jousselin, E., Rieu, M., Di Prampero, P.: The energy cost of run- ning increases with the distance covered. European journal of applied physiology and occupational physiology 62(6), 385–389 (1991)

[8] Bull, A.J., Housh, T.J., Johnson, G.O., Perry, S.R.: Effect of math- ematical modeling on the estimation of critical power. Medicine and science in sports and exercise 32(2), 526–530 (2000)

141 BIBLIOGRAPHY 142

[9] Calvert, T.W., Banister, E.W., Savage, M.V., Bach, T.: A systems model of the effects of training on physical performance. IEEE Trans- actions on systems, man, and cybernetics (2), 94–102 (1976)

[10] Cole, C.R., Blackstone, E.H., Pashkow, F.J., Snader, C.E., Lauer, M.S.: Heart-rate recovery immediately after exercise as a predictor of mor- tality. New England journal of medicine 341(18), 1351–1357 (1999)

[11] Conn, A.R., Gould, N.I., Toint, P.L.: Trust region methods, vol. 1. Siam (2000)

[12] Cotman, C.W., Berchtold, N.C.: Exercise: a behavioral intervention to enhance brain health and plasticity. Trends in neurosciences 25(6), 295–301 (2002)

[13] Daniels, J.: Daniels’ Running Formula. Human Kinetics (2013)

[14] Farazdaghi, G.R., Wohlfart, B.: Reference values for the physical work capacity on a bicycle ergometer for women between 20 and 80 years of age. Clinical Physiology 21(6), 682–687 (2001)

[15] Faria, E.W., Parker, D.L., Faria, I.E.: The science of cycling. Sports medicine 35(4), 285–312 (2005)

[16] Finn, A.: When 26.2 miles just isn’t enough – the phenomenal rise of the . https://www.theguardian.com/lifeandstyle/ 2018/apr/02/ultrarunner-ultramarathon-racing-100-miles (2019), [Online; accessed 8-June-2019]

[17] Froelicher, V., Myers, J.: Exercise and the heart (fifth ed.). chap. 12, p. 108. Elsevier (2006)

[18] Garc´ıa-Manso, J., Mart´ın-Gonz´alez, J., Vaamonde, D., Da Silva- Grigoletto, M.: The limitations of scaling laws in the prediction of performance in endurance events. Journal of theoretical biology 300, 324–329 (2012)

[19] Gellish, R.L., Goslin, B.R., Olson, R.E., McDONALD, A., Russi, G.D., Moudgil, V.K.: Longitudinal modeling of the relationship between age and maximal heart rate. Medicine and science in sports and exercise 39(5), 822–829 (2007)

[20] Gemulla, R., Nijkamp, E., Haas, P.J., Sismanis, Y.: Large-scale matrix factorization with distributed stochastic gradient descent. In: KDD (2011) 143 BIBLIOGRAPHY

[21] Hawley, J.A., Schabort, E.J., Noakes, T.D., Dennis, S.C.: Carbohydrate-loading and exercise performance. Sports medicine 24(2), 73–81 (1997) [22] Heigenhauser, G., Sutton, J.R., Jones, N.L.: Effect of glycogen deple- tion on the ventilatory response to exercise. Journal of Applied Physi- ology 54(2), 470–474 (1983) [23] Hill, A.V., et al.: Muscular movement in man: The factors governing speed and recovery from fatigue. Muscular Movement in Man: the Factors governing Speed and Recovery from Fatigue. (1927) [24] Hill, A., Lupton, H.: Muscular exercise, lactic acid, and the supply and utilization of oxygen. QJM: An International Journal of Medicine (62), 135–171 (1923) [25] Hopkins, W., Edmond, I., Hamilton, B., Macfarlane, D., Ross, B.: Relation between power and endurance for treadmill running of short duration. Ergonomics 32(12), 1565–1571 (1989) [26] Hugh Morton, R.: A 3-parameter critical power model. Ergonomics 39(4), 611–619 (1996) [27] IAAF: World records. https://www.iaaf.org/records/ by-category/world-records (2019), [Online; accessed 8-June- 2019] [28] IAU: World records table. www.iau-ultramarathon.org/images/ file/Records/2017_2020_RecordsTable20190329.pdf (2019), [On- line; accessed 8-June-2019] [29] Inbar, O., Oren, A., Scheinowitz, M., Rotstein, A., Dlin, R., Casaburi, R.: Normal cardiopulmonary responses during incremental exercise in 20-to 70-yr-old men. Medicine and science in sports and exercise 26, 538–538 (1994) [30] Jain, P., Netrapalli, P., Sanghavi, S.: Low-rank matrix completion us- ing alternating minimization. In: Proceedings of the forty-fifth annual ACM symposium on Theory of computing. pp. 665–674. ACM (2013) [31] James, G., Witten, D., Hastie, T., Tibshirani, R.: An introduction to statistical learning. vol. 417, chap. 3, pp. 61–71. Springer (2013) [32] Jouven, X., Empana, J.P., Schwartz, P.J., Desnos, M., Courbon, D., Ducimeti`ere,P.: Heart-rate profile during exercise as a predictor of sudden death. New England Journal of Medicine 352(19), 1951–1958 (2005) BIBLIOGRAPHY 144

[33] Karvonen, J., Vuorimaa, T.: Heart rate and exercise intensity during sports activities. Sports Medicine 5(5), 303–311 (1988) [34] Kay, A.: Pace and critical gradient for hill runners: an analysis of race records. Journal of Quantitative Analysis in Sports 8(4) (2012) [35] Kennelly, A.E.: An approximate law of fatigue in the speeds of rac- ing animals. In: Proceedings of the American Academy of Arts and Sciences. vol. 42, pp. 275–331. JSTOR (1906) [36] Lee, D.c., Pate, R.R., Lavie, C.J., Sui, X., Church, T.S., Blair, S.N.: Leisure-time running reduces all-cause and cardiovascular mortality risk. Journal of the American College of Cardiology 64(5), 472–481 (2014) [37] L´eger,L., Mercier, D.: Gross energy cost of horizontal treadmill and track running. Sports medicine 1(4), 270–277 (1984) [38] Lounana, J., Campion, F., Noakes, T.D., Medelli, J.: Relationship between% hrmax,% hr reserve,% vo2max, and% vo2 reserve in elite cyclists. Medicine and science in sports and exercise 39(2), 350–357 (2007) [39] Luc´ıa,A., Hoyos, J., P´erez, M., Chicharro, J.L.: Heart rate and per- formance parameters in elite cyclists: a longitudinal study. Medicine and science in sports and exercise 32(10), 1777–1782 (2000)

[40] map, O.: highway. https://wiki.openstreetmap.org/wiki/Key: highway (2019), [Online; accessed 8-June-2019] [41] McArdle, W.D., Katch, F.I., Katch, V.L.: Essentials of exercise physi- ology, p. 204. Lippincott Williams & Wilkins (2006) [42] Minetti, A.E., Moia, C., Roi, G.S., Susta, D., Ferretti, G.: Energy cost of walking and running at extreme uphill and downhill slopes. Journal of applied physiology 93(3), 1039–1046 (2002) [43] Muggeo, V.M.: Estimating regression models with unknown break- points. Statistics in medicine 22(19), 3055–3071 (2003) [44] Mulligan, M., Adam, G., Emig, T.: A minimal power model for human running performance. PloS one 13(11), e0206645 (2018) [45] Nes, B., Janszky, I., Wisløff, U., Støylen, A., Karlsen, T.: Age- predicted maximal heart rate in healthy subjects: The hunt f itness s tudy. Scandinavian journal of medicine & science in sports 23(6), 697–704 (2013) 145 BIBLIOGRAPHY

[46] P´eronnet,F., Thibault, G.: Mathematical analysis of running perfor- mance and world running records. Journal of Applied Physiology 67(1), 453–465 (1989)

[47] Poole, D.C., Burnley, M., Vanhatalo, A., Rossiter, H.B., Jones, A.M.: Critical power: An important fatigue threshold in exercise physiology. Medicine and science in sports and exercise 48(11), 2320–2334 (2016)

[48] di Prampero, P.E., Osgnach, C.: Energy cost of human locomotion on land and in water. In: Muscle and Exercise Physiology, pp. 183–213. Elsevier (2019)

[49] Riegel, P.S.: Athletic records and human endurance: A time-vs.- distance equation describing world-record performances may be used to compare the relative endurance capabilities of various groups of people. American Scientist 69(3), 285–290 (1981)

[50] Ross, R., Blair, S.N., Arena, R., Church, T.S., Despr´es,J.P., Franklin, B.A., Haskell, W.L., Kaminsky, L.A., Levine, B.D., Lavie, C.J., et al.: Importance of assessing cardiorespiratory fitness in clinical practice: a case for fitness as a clinical vital sign: a scientific statement from the american heart association. Circulation 134(24), e653–e699 (2016)

[51] Savaglio, S., Carbone, V.: Human performance: Scaling in athletic world records. Nature 404(6775), 244 (2000)

[52] Scarf, P.: An empirical basis for naismith’s rule. Mathematics Today- Bulletin of the Institute of Mathematics and its Applications 34(5), 149–152 (1998)

[53] Scarf, P.: Route choice in mountain navigation, naismith’s rule, and the equivalence of distance and climb. Journal of Sports Sciences 25(6), 719–726 (2007)

[54] de Smet, D., Francaux, M., Baijot, L., Verleysen, M.: Map best perfor- mances prediction for endurance runners. In: Proceedings of the 27th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges, Belgium (2019)

[55] de Smet, D., Francaux, M., Hendrickx, J.M., Verleysen, M.: Heart rate modelling as a potential physical fitness assessment for runners and cyclists. In: Proceedings of the Machine Learning and Data Mining for Sports Analytics Workshop at ECML/PPKD. Riva del Garda, Italy (2016) BIBLIOGRAPHY 146

[56] de Smet, D., Verleysen, M., Francaux, M.: Running race times predic- tion and runner performances comparison using a matrix factorization approach. In: Proceedings of the 5th International Congress on Sport Sciences Research and Technology Support - Volume 1: icSPORTS,. pp. 96–101 (2017)

[57] de Smet, D., Verleysen, M., Francaux, M., Baijot, L.: Long-distance running routes’ flat equivalent distances from race results and elevation profiles. In: Proceedings of the 6th International Congress on Sport Sci- ences Research and Technology Support - Volume 1: icSPORTS,. pp. 56–62. INSTICC, SciTePress (2018). doi: 10.5220/0006937000560062

[58] statista: Running and jogging - statistics and facts. https: //www.statista.com/topics/1743/running-and-jogging/ (2018), [Online; accessed 8-June-2019] [59] SYLVAN KATZ, J., Katz, L.: Power laws and athletic performance. Journal of Sports Sciences 17(6), 467–476 (1999)

[60] Tanaka, H., Monahan, K.D., Seals, D.R.: Age-predicted maximal heart rate revisited. Journal of the american college of cardiology 37(1), 153– 156 (2001) [61] Theodoridis, S.: Machine learning: a Bayesian and optimization per- spective, chap. Bayesian Learning: Inference and the EM Algorithm, pp. 586–589. Academic Press (2015) [62] Vandewalle, H.: Modelling of running performances: Comparisons of power-law, hyperbolic, logarithmic, and exponential models in elite endurance runners. BioMed research international 2018 (2018)

[63] Vickers, A.J., Vertosick, E.A.: An empirical study of race times in recreational endurance runners. BMC Sports Science, Medicine and Rehabilitation 8(1), 26 (2016) [64] Wohlfart, B., Farazdaghi, G.R.: Reference values for the physical work capacity on a bicycle ergometer for men–a comparison with a previous study on women. Clinical physiology and functional imaging 23(3), 166–170 (2003)