Correlation Between Average and Age In

1

Table Of Contents

● Hypothesis…………………………………………………………………………..2

● Statement Of Intent………………………………………………………………...3

● Data…………………………………………………………………………………..4

● Graphs (1)...... 5

● Trend Line and Formula...... 6

● Cricket Background...... 8

● Second Claim...... 8

● Graphs (2)...... 8-11

● Interpolation...... 12

● Extrapolation...... 12

● Validity...... 1 2

● Area For Improvement...... 12

● Conclusion...... 13

● Bibliography...... 14

2

Hypothesis If I compare the data for 7 different batsmen over the course of 12 years, it will be evident that there is a correlation between age and average in cricket. This is due to the belief that as sportsmen age, they become more mature and more aware of what their role requires. Also, a young sportsman is more likely to be shuffled around the order, which stops them from fully expressing themselves. Therefore, I further predict that as an individual ages their average in cricket will also rise. As mentioned earlier, this is due to them being aware of what their role requires and also the fact that they have been playing at the highest level for a long span of time, thus giving them experience on conditions and oppositions. Essentially, I am expecting a positive correlation.

Statement Of Intent In this exploration, my aim is to discover whether age plays a role in performance ( average) in cricket. Many ‘pundits’ suggest that age is just a number, in cricket, but I would like to find out for myself whether age is truly just a number or whether this phenomenon is being carried on by a few anomalies.

In order to ensure variety in my results, I have chosen to compare 7 batsman from across the playing countries. The batsmen I have selected are:

1. (India) 2. (South Africa) 3. (Australia) 4. (India) 5. () 6. Mahela Jayawerdene (Sri Lanka) 7. (Sri Lanka)

To ensure a fair test, I have selected batsmen that have played 40 + ODI (50 over) matches over the course of 12 years. For each batsmen I have recorded their; Average (age 22-34) and their Strike rate (runs per 100/balls). The gathered information has been derived from the largest cricket database, EspnCricinfo.

To discover the relationship between age and average, or if there is any, I will create 3 different types of graphs. The graphs being;

1. Line graph 2. Scatter Plot 3. Trend line graph 4. Combination Graph

To further find the correlation, I will be applying the correlation coefficient and adding a regression equation where applicable. Through the use of the regression equation, I will make a trend line graph to help me find the line of best fit which will further help me predict and analyse my data. Through the data derived from the graphs, I will test my hypothesis and see whether age is truly just a number or whether it has an impact on performance. I will then aim to look for similarities between data and also aim to explain any discrepancies found. To further my knowledge on the correlation, or lack of, I will also be comparing a batsmen’s strike ​ rate to their average to see if it coincides with rise or falls in the average. ​

3

Data

Batsmen Average:

Batsmen Strike Rate

Information In Regards To Age

Information In Regards To Batsmen

Data: ● There were no outliers in the information sorted through age. Although there was one outlier in terms of a cricketer (13.2, Kumar Sangakkara), it was a minor outlier that had no major impact on the data. 1.5 x 12.955 = 19.4325 I 33.46 - 19.4325 = 14.0275. 13.2 falls outside 14.0275 boundary ● Mean = sum of all data / number of data sets ● Median = n+1 / 2 I n=number of values ● Mode = most common number

4

● IQR = Q3-Q1 ● Max Value = Highest value in data ● Min Value = Min Value in data ● R = Correlation Coefficient ● R^2 = Correlation of Determination = how much of the dependent variable can be explained by the independent variable

5

Graphs

Scatter Plot: Will allow me to see any visual correlation between the data sets

This scatter plot depicts the batting average of All the batsmen in relation to their age. Looking at the graph, ​ ​ there does not seem to be a convincing relation. It seems as if the averages fluctuate regardless of age and there does not seem to be one age where all batsmen are at a consistent high or a consistent low.

Second graph : Line graph ​ Line graph: Unlike the scatter plot, the line graph helps the viewer understand the path of a cricketers ​ average. Through using a line graph, ups and downs become more apparent and it also gives us a better idea of where cricketers were at certain points in their career. The line between points also helps the viewer understand the slope, which helps understand local changes thus emphasising rises in averages.

6

The line graph gives us a fairly good representation of the correlation between age and score. As you can see, the beginning of a cricketer’s career is extremely volatile. Some individuals such as Jayasuriya and Sehwag average extremely low to begin with but as time passes their averages increases. As the graph perfectly depicts, there seems to be a cluster around the 40 average at the age of 34, with only Ricky Ponting, Sanath Jayasuriya and Virender Sehwag falling below the mark. To the average statistician, it may seem that a correlation, although extremely weak, is evident. All the batsmen start volatile but average around 40 at the age of 34. But to test the true strength of the correlation, I will explore the regression expression. Trend Line and Formula As I aim to find the correlation between batting averages and age, I will be doing the correlation coefficient for all 7 batsmen. This is the formula that I will be using;

1. N = number of pairs of scissors 2. Σxy = sum of the products of paired scores 3. Σx= sum of x scores 4. Σy = sum of y scores 5. Σx^2 = sum of squared x scores 6. Σy^2 = sum of squared y scores

The results have been collected through a wide mean of methods, that include; online calculator, google sheets, graphing calculator and also by hand. Here are my R values;

1. Sachin Tendulkar : r = -0.0768 r^2 = 0.0059 2. Jacques Kallis : r = 0.1744 r^2 = 0.0304 3. Ricky Ponting : r = -0.1269 r^2 = 0.0161 4. Virender Sehwag : r = 0.4411 r^2 = 0.1946 5. Sanath Jayasuriya : r = 0.7448 r^2 = 0.5547 6. Mahela Jayawerdene : r = 0.3632 r^2 = 0.1319 7. Kumar Sangakkara : r = 0.4801 r^2 = 0.2305

7

To exhibit these values, I have created a trend line graph. This trend line graph will help show ‘the line of best fit’ for each batsmen. The line of best fit shows us the reliable trend in a batsmen’s average as the years go on. Also, The line graph also helps us make accurate judgements to find interpolated and extrapolated data.

This trend line graph shows us the ‘trend’ of all batsmen. The graph highlights the fact that all the batsmen start very volatile but end up consistently around 40. What’s important to note is that all batsmen have positive correlations, except for Ricky Ponting and Sachin Tendulkar. A positive correlation means that the larger scores on one scale are associated with larger scores on the other. For the batsmen, par Tendulkar and Ponting, this means that they had relatively lower scores at a younger age and as they got older, their averages increased. A negative correlation means that larger scores on one scale are associated with minute scores on the other. What this means for Tendulkar and Ponting is that they started off with a relatively high average, and as the got older their average dropped. This is quite interesting, as I predicted that as an individual got older their performance increased. But, I was not expecting for two cricketers two lose steam as they got older. In this scenario, the weak correlation is expected as it is almost impossible to have an almost perfect correlation. In the cricket world, many variables play an impact on average. So thus the chances of having a 1 or -1 correlation are extremely low. But as you can see, Some individuals have a stronger correlation than others. Virender Sehwag, Sanath Jayasuriya, Mahela Jayawerdene and Kumar Sangakkara all have Strong correlations, while Kallis has a weaker correlation. Tendulkar and Ponting have ​ ​ weak negative correlations. Through this, we can assume that a cricketer does not necessarily improve as they get older. But what is intriguing me, is the r^2 value. The r^2 value can be presented as a percentage and what that shows us, is how much of the dependent variable (average) can be explained by the independent variable (age). The low r^2 values tell me that there is more to cricket averages than just age,

8 a lot more. Thus the next step in my investigation will be to compare cricket averages and cricket strike rate.

Cricket Background In cricket, there is a statistic known as the strike rate. The strike rate is the projected score a batsman will make every 100 balls he faces. The formula for strike rate is total runs/total balls faced x 100. To further explain the negative correlation of the batsmen, I'd like to compare the strike rate with the average to see if there is any relation. Over the past few years, cricket has become a faster game. Meaning, that more runs are being scored per 100 balls. So essentially, I'd like to further analyse the positive and negative correlation through the analysis of strike rate.

Second Claim As mentioned earlier, in the hypothesis, I predict that as a cricketer’s strike rate increases, so will their average. I also predict that as a cricketer’s strike rate decreases, so will their average. Mathematically, this means two positive correlations or two negative correlations. Talking from a cricket perspective, this is due to the fact that as the years have passed on, so has the imbalance between bat and ball. Over the past decade, cricket have become more batting friendly, thus making run easier which further increases strike rate. So my claim is backed by the thought that, as a cricketers rate of scoring increased, so did their averages and vice versa. Furthermore, in ‘modern cricket’ a batsman must play at a fast pace (strike rate) in order to sustain their position within the team. Those who play at a slower tempo are often molded to fit with the team combinations thus giving them less chances to prove their worth.

Graphs In order to show the relation between; average, strike rate and age, I have created a combination chart. This combination chart consists of a line graph to show strike rate and a bar graph to show average. I have also included a line of best fit for each data set to show the general trend of the data.

In the above graph, you can clearly see two negative correlations. What this means is that as an individual gets older their average and strike rate falls. This falls in line with my second claim, that suggests as an individual's average falls so will their strike rate. As we already know, Sachin Tendulkar (Average) has a negative correlation which means he averages more when he is younger as opposed to older. What’s interesting to note is that the strike rate also has a strong negative correlation, -0.5821, which means that his strike rate lowers as he ages. This as a whole supports my claim as both sets of data contain a moderate - strong negative correlation, which means that his data (strike rate and average) fall as he ages 9

In the above graph, you can clearly see two positive correlation. What this means is that as Kallis aged, his average and strike rate also increased. This satisfies my second claim that as the average rises so does their strike rate. As mentioned earlier Kallis has a weak positive correlation which means he averaged higher as he aged. After working out the correlation coefficient (r=value), .5694, we can determine that the correlation of his strike rate and age is slightly stronger, in a positive manner, when compared to his average. Ultimately, we can acknowledge that this data corresponds with my claim that as the average increases so will the strike rate.

The graph above depicts 1 positive correlation and one negative. What this means is that as he aged his average decreased, yet his strike rate increased. This challenges my theory, as there is a decrease in average and increase in strike rate, something that I had deemed would not happen. As mentioned in ‘Trend Line and Formula’, Kallis has a weak negative correlation. What this means is that larger averages are attracted to younger ages and a smaller average was attracted to an older age. But thanks to our coefficient correlation formula, we can now acknowledge the strong positive correlation of .5459 between the strike rate and age. The strong positive correlation means lower scores on both axis attract and so do the larger values. This was an unexpected result garnered from the data. It is extremely rare to come

10 across opposing correlations in cricket. This means that Ricky Ponting was a rare case of a cricket who’s average decreased yet his strike rate kept rising.

The graph titled, ‘Virender Sehwag Comparison’, analyzes the correlation between average and strike rate. As shown visually in the graph through the trend lines, there are two positive correlations. What this means is that as the player aged his results got better (higher) in terms of average and strike rate. As mentioned previously, we are already aware of his moderately positive correlation. But as we’ve discovered by working out the r value of the strike rate, .6168, the correlation between strike rate and age is strong. What this means, as mentioned previously, is that he has a lower strike rate at a younger age but as he ages his strike rate improves. The results derived from the graph appease my hypothesis as I predicted an increase in both strike rate and average.

The graph above illustrates the correlation between strike rate, average and age. As shown optically, there are two positive correlations. This means that as the player aged, his performances increased. This supports my theorem, as both variables increase with age. Through the correlation coefficient, we work out the r value of the strike rate, .3802. This positive correlation pairs with the previously mentioned weak, 11 positive correlation between age and average. To sum up, This graph backs my claim that both the strike rate and average will increase as time passes.

The above graph interprets strike rate, average and age. Through the breakdown of the graph, we can see two positive correlations. The r value of .4512 between the strike rate and age acknowledges the moderately strong relationship between strike rate and age. By pairing this positive r value with the one mentioned earlier, age and average, we can safely conclude that this data conforms to my hypothesis. I predicted that both the average and strike rate must increase or decrease in a uniform manner, and that is exactly what has happened here.

The graph above delineates 3 factors; strike rate, average and age. Through the study of previously given information and the graph shown, we can come to the conclusion that there are two positive correlations shown on the graph. What this means is that as age increases so does strike rate and average. We can confirm the positive correlation between strike rate and age by working out the correlation coefficient. The answer we get is .7122, which depicts a strong positive correlation. This is valid proof that shows that the graph correlates to my claim, as there are two positive correlations shown on the graph. 12

Interpolation In order to find out the percentage error (difference between expected and recorded results), I will look to use interpolation. Given that the formula is already given to us in the “Batsmen average line graph”, I only need to plug in a x value,23,. Which means, my formula looks like,y= -.245*23+48.5. The 48.5 is the expected y intercept, the 23 is an x value and -.245 is the slope. After working out the formula, we get y=42.865. This new value is incredibly close to what the actual value was, 40.36. Although they are similar, we would have had to received identical results to not find the percentage error. By using the formula ‘error= estimated value over true value” we will have a numerical way to understand our error. Error=42.865/40.36 Doing this calculation gives us an error of 1.06. We now substitute this into the ‘percentage conversion equation’ (percentage error=(error/actual value) 100) and find out a percentage based error. The error percentage is 2.63%. This shows a bond between the expected results and the actual results. But as there is an error percentage of 2.63%, we can assume that the bond is not perfect. There will be some minor discrepancies within the data but they will be overlooked as they are marginal.

Extrapolation I’m curious to know how my data would expect to look like if it continued for 5 more years. That is why I am using Extrapolation. Extrapolation allows me to look at the expected results in 5 years after the graph to see whether the trend is followed or not and also to what extent. So as mentioned earlier, the formula for the line is y=-.245*x + 48.5. We now have to substitute an x value (age) that is either larger than my x values or smaller than my x values. I chose larger, as I am interested in how the player will perform in 5 years. Therefore i substitute 39 into my x value and get, -.245*39+48.5. After completing the formula we get to, y=38.945. What this means, is that as the time goes on Sachin Tendulkar’s average will drop at a consistent speed. After finding out Sachin Tendulkar's real average at the age of 39, 31.50, we can assume ​ ​ that the formula and trend line are moderately close to the real thing. Sachins 38.945, is not the lowest recorded value for him. Validity To ensure validity of my data I took careful steps throughout the entire progress. Firstly, my data was carefully selected from the largest cricket database, Espncricinfo. After collecting the data, I made multiple graphs to ensure my results were consistent across all forms. I created trend line graphs, line graphs, scatter plots and combo graphs to ensure that all the information and data was uniform. To test for the correlation between variables I used the correlation coefficient. That helped me to find a reliable numerical value that gave me a good idea of the relation between two variables. Furthermore I expressed extreme caution in the way data was transported to graphs and from databases. I am extremely confident about my mathematical procedure. All of the calculations were performed thoroughly and to a high extent that helped ensure mistakes were limited. All of the calculations done were checked over a variety of methods including calculators, online calculators, google sheets and by hand. In addition, To prove the validity of certain opinions or conclusions, I sought the help of the all the data that I had collected, not picking only a single one.

Areas Of Improvement To ensure a better investigation is carried out next time, I need to improve on a few aspects; 1. Firstly, If I had more time on my hands, I would have expanded the number of cricketers I selected and also the number of years each individual played. This would help me gain a better idea of the data coming from a larger sample size as there will be a wider spread of results 2. Secondly, In this investigation I only chose batsmen. Next time I would like to interpret bowlers, so that I could determine whether a ‘Cricketers’ performance increases or decreases with age, rather

13

than just using batsmen. This would also help understand how bowlers change in terms of performance as they age. 3. Next time, I would also like to do more sets of extrapolation and compare it with the real data. This will help show the validity of the formula and also the accuracy of the trend line. Such recommendations, when implemented, would have taken my exploration to the next level. This is because I would be acknowledging a wider audience in cricket (batsmen and bowlers) and also having a larger data set would ensure that my results are incredibly close to reality, rather than depending on 7 batsmen to explain the correlation between age and average for all cricketers.

Conclusion In conclusion, I’d like to say that through the use of a wide variety of graphs and formulas, it is clear that there is a very weak positive correlation between age and average. Out of the 7 batsmen tested, only 2 had a negative correlation whereas the other 5 had a steady increase in average as they aged. The negative correlation for the 2 batsmen was somewhat explained through the analysis of the correlation between strike rate and age, meaning that their negative correlation may be the product of other factors. Although there seems to be some speculation over whether age is irrelevant, I guess that for now, it is truly just a number.

Bibliography 1. Espncricinfo.com 2. http://www.shortell.org/book/chap18.html 3. http://www.dmstat1.com/res/TheCorrelationCoefficientDefined.html 4. https://www.dummies.com/education/math/statistics/how-to-interpret-a-correlat ion-coefficient-r/ 5. https://www.youtube.com/watch?v=JvS2triCgOY 6. https://www.youtube.com/watch?v=n7BhVpBSUjE 7. https://www.youtube.com/watch?v=B1HEzNTGeZ4 8. https://www.youtube.com/watch?v=6DYtC7lrVuY 9. https://www.youtube.com/watch?v=STSP8gTSdT8 10. https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/correlation-coefficient-f ormula/ 11. https://www.youtube.com/watch?v=9aDHbRb4Bf8

14