University of Groningen

The use of self-tracking technology for health Kooiman, Theresia Johanna Maria

IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below. Document Version Publisher's PDF, also known as Version of record

Publication date: 2018

Link to publication in University of Groningen/UMCG research database

Citation for published version (APA): Kooiman, T. J. M. (2018). The use of self-tracking technology for health: Validity, adoption, and effectiveness. Rijksuniversiteit Groningen.

Copyright Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

Take-down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum.

Download date: 26-09-2021

Chapter 3 | Reliability and validity of ten consumer activity trackers depend on walking speed

Tryntsje Fokkema Thea J.M. Kooiman Wim P. Krijnen Cees P. van der Schans Martijn de Groot

Medicine and Science in Sports and Exercise (2017) 49(4):793-800

Chapter 3

Abstract Introduction

Purpose Consumer activity trackers are an inexpensive and feasible method for estimating daily To examine the test-retest reliability and validity of ten activity trackers for step counting at physical activity. As the availability of these devices has increased, so has their use in daily three different walking speeds. life, health care, and medical science. Two commonly used physical activity guidelines are the 30-minutes of moderate to vigorous activity (MVPA) per day for at least five days a 1 2 Methods week. and the 10.000 steps/day norm. Research to a healthy amount of physical activity per day shows that engagement in at least 8000 to 11000 steps a day is related to many Thirty-one healthy participants walked twice on a treadmill for 30 minutes while wearing ten health benefits, like a better physical fitness, body composition, and glycemic control.2,3 activity trackers (Polar Loop, Garmin Vivosmart, Fitbit Charge HR, Sport, When 3000 steps are taken at moderate to vigorous intensity, both guidelines correspond , S, Misfit Flash, Jawbone Up Move, Flyfit and Moves). with each other.4 For physically inactive people (e.g. people who take on average less than Participants walked three walking speeds for ten minutes each; slow (3.2 km·h-1), average 5000 steps/day), an increment of 2000 steps per day already relates to health improvements (4.8 km·h-1), and vigorous (6.4 km·h-1). To measure test-retest reliability, intraclass like a better body composition and decrement of BMI.5 Therefore, activity trackers have a correlations (ICCs) were determined between the first and second treadmill test. Validity large value in objectifying ones physical activity pattern and demonstrating changes in one’s was determined by comparing the trackers with the gold standard (hand counting), using activity behavior. Activity trackers should therefore be reliable and valid. mean differences, mean absolute percentage errors, and ICCs. Statistical differences were calculated by paired-sample t-tests, Wilcoxon signed-rank tests, and by constructing Bland- Many trackers demonstrate acceptable validity and reliability of step counting, Altman plots. however, other activity trackers perform relatively inadequately.6,7 The accuracy of activity trackers that were recently released into the market is currently unknown. A common Results challenge of activity trackers is their validity for tracking activities at different walking speeds 8,9 Test-retest reliability varied with ICCs ranging from -0.02 to 0.97. Validity varied between including a slower walking speed. The latter could be an issue when self-tracking is used trackers and different walking speeds with mean differences between the gold standard and for the assessment of daily physical activity of patients with limited physical abilities or the 10,11 activity trackers ranging from 0.0 to 26.4%. Most trackers showed relatively low ICCs and elderly population. Validation of activity trackers at different speeds is thus important. broad limits of agreement of the Bland-Altman plots at the different speeds. For the slow This certainly accounts for wearables that have recently entered the market. To achieve this, walking speed, the Garmin Vivosmart and Fitbit Charge HR showed the most accurate the aim of this study is to examine the test-retest reliability and validity of ten relatively new results. The Garmin Vivosmart and Apple Watch Sport demonstrated the best accuracy at an activity trackers when walking at three different speeds. average walking speed. For vigorous walking, the Apple Watch Sport, Pebble Smartwatch, and Samsung Gear S exhibited the most accurate results. Methods

Conclusion Test-retest reliability and validity of activity trackers depends on walking speed. In general, Research design consumer activity trackers perform better at an average and vigorous walking speed than at A prospective study was conducted in a laboratory setting. Healthy adult volunteers were a slower walking speed. invited to walk two times for 30 minutes on a treadmill on different days (with approximately one week between the first and the second measurement). Each participant wore ten activity trackers. During the measurement phase, participants walked for half an hour at three different speeds (ten minutes each). First, they walked at a slow walking speed (3.2 km·h-1), next at a speed that is usually experienced as a comfortable walking speed (4.8 km·h-1), and finally at a vigorous walking speed (6.4 km·h-1).12 Participants were instructed to walk in a natural way with a normal intuitive arm swing. During the measurements, the number of steps was counted with a manual hand counter by one observer; the number subsequently functioned as the gold standard. The measurements were also recorded with a

40 Reliability and validity of ten consumer activity trackers depend on walking speed

Abstract Introduction

Purpose Consumer activity trackers are an inexpensive and feasible method for estimating daily To examine the test-retest reliability and validity of ten activity trackers for step counting at physical activity. As the availability of these devices has increased, so has their use in daily three different walking speeds. life, health care, and medical science. Two commonly used physical activity guidelines are the 30-minutes of moderate to vigorous activity (MVPA) per day for at least five days a 1 2 Methods week. and the 10.000 steps/day norm. Research to a healthy amount of physical activity per day shows that engagement in at least 8000 to 11000 steps a day is related to many Thirty-one healthy participants walked twice on a treadmill for 30 minutes while wearing ten 3 health benefits, like a better physical fitness, body composition, and glycemic control.2,3 activity trackers (Polar Loop, Garmin Vivosmart, Fitbit Charge HR, Apple Watch Sport, Pebble When 3000 steps are taken at moderate to vigorous intensity, both guidelines correspond Smartwatch, Samsung Gear S, Misfit Flash, Jawbone Up Move, Flyfit and Moves). with each other.4 For physically inactive people (e.g. people who take on average less than Participants walked three walking speeds for ten minutes each; slow (3.2 km·h-1), average 5000 steps/day), an increment of 2000 steps per day already relates to health improvements (4.8 km·h-1), and vigorous (6.4 km·h-1). To measure test-retest reliability, intraclass like a better body composition and decrement of BMI.5 Therefore, activity trackers have a correlations (ICCs) were determined between the first and second treadmill test. Validity large value in objectifying ones physical activity pattern and demonstrating changes in one’s was determined by comparing the trackers with the gold standard (hand counting), using activity behavior. Activity trackers should therefore be reliable and valid. mean differences, mean absolute percentage errors, and ICCs. Statistical differences were calculated by paired-sample t-tests, Wilcoxon signed-rank tests, and by constructing Bland- Many trackers demonstrate acceptable validity and reliability of step counting, Altman plots. however, other activity trackers perform relatively inadequately.6,7 The accuracy of activity trackers that were recently released into the market is currently unknown. A common Results challenge of activity trackers is their validity for tracking activities at different walking speeds 8,9 Test-retest reliability varied with ICCs ranging from -0.02 to 0.97. Validity varied between including a slower walking speed. The latter could be an issue when self-tracking is used trackers and different walking speeds with mean differences between the gold standard and for the assessment of daily physical activity of patients with limited physical abilities or the 10,11 activity trackers ranging from 0.0 to 26.4%. Most trackers showed relatively low ICCs and elderly population. Validation of activity trackers at different speeds is thus important. broad limits of agreement of the Bland-Altman plots at the different speeds. For the slow This certainly accounts for wearables that have recently entered the market. To achieve this, walking speed, the Garmin Vivosmart and Fitbit Charge HR showed the most accurate the aim of this study is to examine the test-retest reliability and validity of ten relatively new results. The Garmin Vivosmart and Apple Watch Sport demonstrated the best accuracy at an activity trackers when walking at three different speeds. average walking speed. For vigorous walking, the Apple Watch Sport, Pebble Smartwatch, and Samsung Gear S exhibited the most accurate results. Methods

Conclusion Test-retest reliability and validity of activity trackers depends on walking speed. In general, Research design consumer activity trackers perform better at an average and vigorous walking speed than at A prospective study was conducted in a laboratory setting. Healthy adult volunteers were a slower walking speed. invited to walk two times for 30 minutes on a treadmill on different days (with approximately one week between the first and the second measurement). Each participant wore ten activity trackers. During the measurement phase, participants walked for half an hour at three different speeds (ten minutes each). First, they walked at a slow walking speed (3.2 km·h-1), next at a speed that is usually experienced as a comfortable walking speed (4.8 km·h-1), and finally at a vigorous walking speed (6.4 km·h-1).12 Participants were instructed to walk in a natural way with a normal intuitive arm swing. During the measurements, the number of steps was counted with a manual hand counter by one observer; the number subsequently functioned as the gold standard. The measurements were also recorded with a

41 Chapter 3

video camera as a backup. The three times ten minutes time slots of the treadmill test were Statistical analyses measured with software from the Optogait system (OPTOGait, Microgate S.r.I, Italy, 2010). Descriptive statistics and their corresponding 95% confidence intervals were determined for Before and after each time slot the number of steps as recorded by the trackers was all variables. manually entered in a dedicated research form. During registration participants were asked Test-retest reliability was determined by calculating the ICCs between Session 1 and to stand still with their hands on the handrails of the treadmill. The number of steps was Session 2 (two-way random, absolute agreement, single measures) with 95% confidence read either directly from the trackers display or from the corresponding application, which intervals. An ICC > 0.90 was considered as excellent, 0.75 - 0.90 as good, 0.60 - 0.75 as were installed on an iPod touch (2014, Model A1509, Apple Inc., Cupertino, CA, USA). This moderate, and < 0.60 as low.13 Because negative values of the ICC theoretically don’t exist was the case for the Misfit Flash, Jawbone Up Move and the Flyfit. Observers typically (7,23), negative values were set to zero. Additionally, test-retest was assessed by calculating waited for one or two minutes before registration to allow the trackers to make Bluetooth or the mean differences and the mean absolute percentage errors (MAPE) between the Wi-Fi connection with the iPod for synchronization. The registration phase usually took no sessions. Significant mean differences were investigated by paired-sample t-tests and more than five minutes. Wilcoxon signed-rank tests.

Participants Validity was assessed by the mean difference and the mean absolute percentage Thirty-one healthy adults volunteered to participate in this study (16 males and 15 females; errors (MAPE) between the gold standard and the activity trackers. According to Feito et mean ± SD age 32 ± 12 years; mean ± SD; BMI 22.6 ± 2.4 kg·m-2). Participants were recruited al,14,15 a MAPE exceeding 5% can be considered as a practically relevant difference. by flyer advertisement and by word of mouth within the Hanze University of Applied Therefore, a 5% cut-off criterion was utilized for the MAPE. To determine the agreement Sciences, Groningen, the Netherlands. Participants were informed about the test procedures between the gold standard and activity trackers, Bland-Altman plots with the associated and signed an informed consent form prior to the study. The research was performed in limits of agreement were constructed. In addition, the agreement between the gold accordance with the Declaration of Helsinki and an exemption for a comprehensive standard and the activity trackers was determined by calculating intraclass correlation application was obtained by the Medical Ethical Committee of the University Medical Center coefficients (ICC) (two-way random, absolute agreement, single measures with 95% of Groningen. confidence intervals).

All statistical analyses were performed using SPSS 23 (SPSS Inc., Chicago, IL, USA), Activity trackers with a significance level of 5%. To correct for multiple testing, the significance level for the In this study nine activity trackers and one smartphone application were examined. A paired-sample t-tests and Wilcoxon tests was adjusted by using the posthoc correction manual hand counter (Voltcraft, Conrad Electronic SE, Hirschau, Germany) was used as the method of Bonferroni.16 This resulted in an alpha of 0.0045 for the test-retest analyzes, and gold standard. On their right wrist participants wore the Garmin Vivosmart (2014, Garmin an alpha of 0.005 for the validity analyzes. International Inc., Olathe, KS, USA) at the distal side, the Fitbit Charge HR (2014, Fitbit Inc., San Francisco, CA, USA) in the middle and the Polar Loop (2013, Polar Electro Oy, Kempele, Finland) at the proximal side. Three were placed on their left wrist; the Apple Watch Sport (2015, Apple Inc., Cupertino, CA, USA) at the distal side, the Pebble Smartwatch Results (2014, Pebble Technology Corp., Redwood City, CA, USA) in the middle, and the Samsung Gear S (2014, Co, Ltd., Seoul, South Korea) at the proximal side. The On average participants walked 947 ± 54 steps at 3.2 km·h-1, 1112 ± 45 steps at 4.8 km·h-1 Misfit Flash (2014, Misfit Wearables, Burlingame, CA, USA) and Jawbone Up Move (2014, and 1254 ± 53 steps at 6.4 km·h-1, as measured with the gold standard. The mean number of Jawbone Inc., Beverly Hills, CA, USA) were attached at their right hip to the belt of their steps measured by the activity trackers and their 95% confidence intervals during both trousers. The Flyfit (2014, Flyfit Inc., San Francisco, CA, USA) was worn on their right ankle. sessions are depicted in Figures 1, 2, and 3. No sex differences were found in the results. The Finally, a smartphone (Samsung S5 Active, Samsung Electronics Co, Ltd., Seoul, South Korea) number of participants in the different conditions varied (from n=31 to n=21). This variation on which the Moves application (ProtoGeo, Helsinki, Finland) was installed was placed in the was mainly due to a number of occasions in which there was a delay in synchronization, front or back pocket of their trousers. leading to underestimation of the number of steps in the preceding time slot and occasionally an overestimation in the next. Every observed delay was recorded in a diary and involved metrics were excluded from data analysis. This resulted in a lower number of observations for some trackers (especially Flyfit, Misfit Flash, Jawbone up Move).

42 Reliability and validity of ten consumer activity trackers depend on walking speed

video camera as a backup. The three times ten minutes time slots of the treadmill test were Statistical analyses measured with software from the Optogait system (OPTOGait, Microgate S.r.I, Italy, 2010). Descriptive statistics and their corresponding 95% confidence intervals were determined for Before and after each time slot the number of steps as recorded by the trackers was all variables. manually entered in a dedicated research form. During registration participants were asked Test-retest reliability was determined by calculating the ICCs between Session 1 and to stand still with their hands on the handrails of the treadmill. The number of steps was Session 2 (two-way random, absolute agreement, single measures) with 95% confidence read either directly from the trackers display or from the corresponding application, which intervals. An ICC > 0.90 was considered as excellent, 0.75 - 0.90 as good, 0.60 - 0.75 as were installed on an iPod touch (2014, Model A1509, Apple Inc., Cupertino, CA, USA). This moderate, and < 0.60 as low.13 Because negative values of the ICC theoretically don’t exist was the case for the Misfit Flash, Jawbone Up Move and the Flyfit. Observers typically (7,23), negative values were set to zero. Additionally, test-retest was assessed by calculating waited for one or two minutes before registration to allow the trackers to make Bluetooth or 3 the mean differences and the mean absolute percentage errors (MAPE) between the Wi-Fi connection with the iPod for synchronization. The registration phase usually took no sessions. Significant mean differences were investigated by paired-sample t-tests and more than five minutes. Wilcoxon signed-rank tests.

Participants Validity was assessed by the mean difference and the mean absolute percentage Thirty-one healthy adults volunteered to participate in this study (16 males and 15 females; errors (MAPE) between the gold standard and the activity trackers. According to Feito et mean ± SD age 32 ± 12 years; mean ± SD; BMI 22.6 ± 2.4 kg·m-2). Participants were recruited al,14,15 a MAPE exceeding 5% can be considered as a practically relevant difference. by flyer advertisement and by word of mouth within the Hanze University of Applied Therefore, a 5% cut-off criterion was utilized for the MAPE. To determine the agreement Sciences, Groningen, the Netherlands. Participants were informed about the test procedures between the gold standard and activity trackers, Bland-Altman plots with the associated and signed an informed consent form prior to the study. The research was performed in limits of agreement were constructed. In addition, the agreement between the gold accordance with the Declaration of Helsinki and an exemption for a comprehensive standard and the activity trackers was determined by calculating intraclass correlation application was obtained by the Medical Ethical Committee of the University Medical Center coefficients (ICC) (two-way random, absolute agreement, single measures with 95% of Groningen. confidence intervals).

All statistical analyses were performed using SPSS 23 (SPSS Inc., Chicago, IL, USA), Activity trackers with a significance level of 5%. To correct for multiple testing, the significance level for the In this study nine activity trackers and one smartphone application were examined. A paired-sample t-tests and Wilcoxon tests was adjusted by using the posthoc correction manual hand counter (Voltcraft, Conrad Electronic SE, Hirschau, Germany) was used as the method of Bonferroni.16 This resulted in an alpha of 0.0045 for the test-retest analyzes, and gold standard. On their right wrist participants wore the Garmin Vivosmart (2014, Garmin an alpha of 0.005 for the validity analyzes. International Inc., Olathe, KS, USA) at the distal side, the Fitbit Charge HR (2014, Fitbit Inc., San Francisco, CA, USA) in the middle and the Polar Loop (2013, Polar Electro Oy, Kempele, Finland) at the proximal side. Three smartwatches were placed on their left wrist; the Apple Watch Sport (2015, Apple Inc., Cupertino, CA, USA) at the distal side, the Pebble Smartwatch Results (2014, Pebble Technology Corp., Redwood City, CA, USA) in the middle, and the Samsung Gear S (2014, Samsung Electronics Co, Ltd., Seoul, South Korea) at the proximal side. The On average participants walked 947 ± 54 steps at 3.2 km·h-1, 1112 ± 45 steps at 4.8 km·h-1 Misfit Flash (2014, Misfit Wearables, Burlingame, CA, USA) and Jawbone Up Move (2014, and 1254 ± 53 steps at 6.4 km·h-1, as measured with the gold standard. The mean number of Jawbone Inc., Beverly Hills, CA, USA) were attached at their right hip to the belt of their steps measured by the activity trackers and their 95% confidence intervals during both trousers. The Flyfit (2014, Flyfit Inc., San Francisco, CA, USA) was worn on their right ankle. sessions are depicted in Figures 1, 2, and 3. No sex differences were found in the results. The Finally, a smartphone (Samsung S5 Active, Samsung Electronics Co, Ltd., Seoul, South Korea) number of participants in the different conditions varied (from n=31 to n=21). This variation on which the Moves application (ProtoGeo, Helsinki, Finland) was installed was placed in the was mainly due to a number of occasions in which there was a delay in synchronization, front or back pocket of their trousers. leading to underestimation of the number of steps in the preceding time slot and occasionally an overestimation in the next. Every observed delay was recorded in a diary and involved metrics were excluded from data analysis. This resulted in a lower number of observations for some trackers (especially Flyfit, Misfit Flash, Jawbone up Move).

43 Chapter 3

Figure 3. Mean number of steps and 95% confidence interval (95% CI) of the activity trackers at 6.4 km·h-1 during session 1 (a) and session 2 (b). The horizontal lines represent the mean number of steps of the gold standard (1259 ± 53 and 1251 ± 54 steps respectively). Figure 1. Mean number of steps and 95% confidence interval (95% CI) of the activity trackers at 3.2 km·h-1 during session 1 (a) and session 2 (b). The horizontal lines represent the mean number of steps of the gold standard (953 ± 46 Test-retest reliability and 940 ± 61 steps respectively). -1 The outcome measures of test-retest reliability are shown in Table 1. At 3.2 km·h , the mean differences between Sessions 1 and 2 varied from seven steps (MAPE 0.7%, Apple) to 75 steps (MAPE 9.0%, Flyfit). At 4.8 km·h-1, the mean differences varied from three steps (MAPE -0.3%, Moves) to 93 steps (MAPE 8.6%, Polar Loop) and differed significantly for the Apple Watch Sport. At 6.4 km·h-1, the mean differences varied from zero steps (MAPE 0.0%, Pebble Smartwatch) to 40 steps (MAPE 3.5%, Garmin Vivosmart). The ICCs of the gold standard were good at slow and average walking speeds (0.76 and 0.87 respectively) and excellent at a vigorous walking speed (0.93). At 3.2 km·h-1, the ICCs of the trackers ranged from -0.02 (Moves) to 0.97 (Samsung Gear S). At this slowest walking speed, most of the trackers demonstrated low ICCs. The Moves showed a very low ICC, while the Polar Loop and Fitbit Charge HR showed moderate ICCs, the Garmin Vivosmart exhibited a good ICC, and the Samsung Gear S showed an excellent ICC. At 4.8 km·h-1, the ICCs of the trackers ranged from 0.00 (Jawbone) to 0.86 (Samsung Gear S). The Fitbit Charge HR demonstrated a moderate ICC and Samsung Gear S showed a good ICC. All of the other Figure 2. Mean number of steps and 95% confidence interval (95% CI) of the activity trackers at 4.8 km·h-1 during trackers showed low ICCs at this average walking speed. At 6.4 km·h-1, the ICCs of the session 1 (a) and session 2 (b). The horizontal lines represent the mean number of steps of the gold standard trackers ranged from 0.14 (Misfit) to 0.93 (Samsung Gear S). Here the Polar Loop, Misfit (1117 ± 44 and 1108 ± 46 steps respectively). Flash, and Flyfit showed low ICCs, while the Garmin Vivosmart, Fitbit Charge HR, Jawbone Up Move, and Moves showed moderate ICCs. There were two trackers (Apple Watch Sport and Pebble Smartwatch) that indicated good ICCs and one tracker (Samsung Gear S) that showed an excellent ICC at the vigorous walking speed.

44 Reliability and validity of ten consumer activity trackers depend on walking speed

3

Figure 3. Mean number of steps and 95% confidence interval (95% CI) of the activity trackers at 6.4 km·h-1 during session 1 (a) and session 2 (b). The horizontal lines represent the mean number of steps of the gold standard (1259 ± 53 and 1251 ± 54 steps respectively). Figure 1. Mean number of steps and 95% confidence interval (95% CI) of the activity trackers at 3.2 km·h-1 during session 1 (a) and session 2 (b). The horizontal lines represent the mean number of steps of the gold standard (953 ± 46 Test-retest reliability and 940 ± 61 steps respectively). -1 The outcome measures of test-retest reliability are shown in Table 1. At 3.2 km·h , the mean differences between Sessions 1 and 2 varied from seven steps (MAPE 0.7%, Apple) to 75 steps (MAPE 9.0%, Flyfit). At 4.8 km·h-1, the mean differences varied from three steps (MAPE -0.3%, Moves) to 93 steps (MAPE 8.6%, Polar Loop) and differed significantly for the Apple Watch Sport. At 6.4 km·h-1, the mean differences varied from zero steps (MAPE 0.0%, Pebble Smartwatch) to 40 steps (MAPE 3.5%, Garmin Vivosmart). The ICCs of the gold standard were good at slow and average walking speeds (0.76 and 0.87 respectively) and excellent at a vigorous walking speed (0.93). At 3.2 km·h-1, the ICCs of the trackers ranged from -0.02 (Moves) to 0.97 (Samsung Gear S). At this slowest walking speed, most of the trackers demonstrated low ICCs. The Moves showed a very low ICC, while the Polar Loop and Fitbit Charge HR showed moderate ICCs, the Garmin Vivosmart exhibited a good ICC, and the Samsung Gear S showed an excellent ICC. At 4.8 km·h-1, the ICCs of the trackers ranged from 0.00 (Jawbone) to 0.86 (Samsung Gear S). The Fitbit Charge HR demonstrated a moderate ICC and Samsung Gear S showed a good ICC. All of the other Figure 2. Mean number of steps and 95% confidence interval (95% CI) of the activity trackers at 4.8 km·h-1 during trackers showed low ICCs at this average walking speed. At 6.4 km·h-1, the ICCs of the session 1 (a) and session 2 (b). The horizontal lines represent the mean number of steps of the gold standard trackers ranged from 0.14 (Misfit) to 0.93 (Samsung Gear S). Here the Polar Loop, Misfit (1117 ± 44 and 1108 ± 46 steps respectively). Flash, and Flyfit showed low ICCs, while the Garmin Vivosmart, Fitbit Charge HR, Jawbone Up Move, and Moves showed moderate ICCs. There were two trackers (Apple Watch Sport and Pebble Smartwatch) that indicated good ICCs and one tracker (Samsung Gear S) that showed an excellent ICC at the vigorous walking speed.

45 Chapter 3

Table 1. Validity Test-retest reliability measures of session 1 versus session 2: mean differences (session 1 - session 2) ± The outcome measures of the validity tests are shown in Tables 2 and 3. Each column standard error (SE), mean absolute percentage error (MAPE), scores on the paired samples t-test and Wilcoxon signed-rank test, intraclass correlation coefficient (ICC), and the corresponding 95% confidence intervals (95% contains 30 measurements (ten trackers at three different speeds). A total number of 12 out CI). of 30 measurements showed a significant difference of the mean number of steps compared to the gold standard in Session 1, and 12 measurements showed a significant difference in Speed N Mean difference ± SE MAPE (%) t-valuea/ ICC 95% CIc (km·h -1) Z-valueb Session 2 assessed by either the paired samples t-test or the Wilcoxon signed-rank test. With increasing speed, the MAPE decreased for the Polar Loop, Pebble Smartwatch, Samsung Hand counter 3.2 31 13 ± 6 1.3 1.98a 0.76** 0.56 - 0.88 Gear S, Misfit Flash, Jawbone UP Move, Flyfit, and Moves. It was fairly constant for the Apple 4.8 31 10 ± 4 0.9 2.59a 0.87** 0.72 - 0.94 Watch Sport at all three speeds. The MAPE increased for the Garmin Vivosmart and the a 6.4 30 4 ± 4 0.3 1.01 0.93** 0.86 - 0.97 Fitbit Charge HR with accelerating speed. At a walking speed of 3.2 km·h-1, the Polar Loop, Polar Loop 3.2 31 9 ± 46 1.3 -0.01b 0.74** 0.52 - 0.87 Misfit Flash, Jawbone Up Move, Flyfit, and Moves had a MAPE exceeding 5%. The MAPE of 4.8 30 93 ± 41 8.6 -2.66b 0.15 -0.17 - 0.46 the Pebble Smartwatch and Samsung Gear S was higher than 5% during Session 1, but under 6.4 29 -2 ± 19 -0.1 -0.13b 0.49** 0.15 - 0.72 5% during Session 2. All other trackers had a MAPE less than 5% at the slowest walking Garmin Vivosmart 3.2 31 12 ± 7 1.2 1.73a 0.79** 0.60 - 0.89 speed. At 4.8 km·h-1, the MAPE of the Misfit Flash, Jawbone Up Move, and Flyfit was more 4.8 31 16 ± 7 1.4 2.16a 0.51** 0.20 - 0.72 than 5% during both sessions. The Jawbone Up Move had a MAPE over 5% only during 6.4 30 40 ± 22 3.5 -1.58b 0.72** 0.49 - 0.86 Session 1 while the Polar Loop obtained this only during Session 2. Finally, at 6.4 km·h-1, Fitbit Charge HR 3.2 31 12 ± 10 1.2 -1.28b 0.73** 0.51 - 0.86

4.8 31 7 ± 10 0.6 -1.47b 0.70** 0.46 - 0.84 most of the trackers had a MAPE under 5% except for the Garmin Vivosmart, Fitbit Charge

6.4 30 33 ± 14 2.8 -2.37b 0.65** 0.38 - 0.82 HR, and Misfit Flash. The Flyfit had a MAPE of less than 5% during Session 1, however, it Apple Watch Sport 3.2 30 7 ± 16 0.7 -0.73b 0.38* 0.02 - 0.65 exceeded 5% during Session 2.

4.8 28 41 ± 13 3.7 -3.22b # 0.48** 0.12 - 0.73 The limits of agreement of the Bland-Altman plots are presented in Tables 2 and 3. At a 6.4 28 -2 ± 8 -0.1 -0.18 0.80** 0.61 - 0.90 3.2 km·h-1, the Garmin Vivosmart had the narrowest limits of agreement (49 steps, Session Pebble Smartwatch 3.2 31 -16 ± 17 -1.8 -0.12b 0.56** 0.26 - 0.76 1), while the Polar Loop exhibited the broadest limits of agreement (1298 steps, Session 1). 4.8 31 -7 ± 20 -0.6 -1.70b 0.33* -0.03 - 0.61 At 4.8 km·h-1, the Garmin Vivosmart had the narrowest limits of agreement (35 steps, 6.4 30 0 ± 5 0.0 0.06a 0.89** 0.79 - 0.95 Session 2) and the Misfit Flash had the broadest limits of agreement (1104 steps, Session 2). Samsung Gear S 3.2 29 10 ± 9 1.1 -0.98b 0.97** 0.93 - 0.98 -1 4.8 30 4 ± 10 0.3 -1.88b 0.86** 0.73 - 0.93 At 6.4 km·h , the Samsung Gear S had the narrowest limits of agreement (73 steps, Session

6.4 30 3 ± 3 0.2 0.73a 0.93** 0.86 - 0.97 2) and the Misfit Flash had the broadest limits of agreement (1029 steps, Session 2). Misfit Flash 3.2 22 74 ± 56 9.1 -0.92b 0.48** 0.10 - 0.74 The ICCs (Tables 2 and 3) at 3.2 km·h-1 ranged from 0 (Samsung Gear S, session 2 and 4.8 23 25 ± 60 2.4 -0.42b 0.03 -0.40 - 0.44 Moves, Session 2) to 0.95 (Garmin Vivosmart, Sessions 1 and 2) while, at 4.8 km·h-1, ICCs b 6.4 22 -2 ± 62 -0.1 -1.25 0.14 -0.31 - 0.53 ranged from 0 (Moves, Session 1 and 2, Polar Loop session 2) to 0.98 (Garmin Vivosmart, Jawbone Up Move 3.2 28 37 ± 38 4.2 -0.47b 0.07 -0.31 - 0.42 Session 2). ICCs at 6.4 km·h-1 ranged from 0 (Garmin Vivosmart, Session 2) to 0.92 (Samsung 4.8 30 -29 ± 35 -2.7 -0.82a 0.00 -0.36 - 0.36 Gear S, Session 2). Generally, ICCs were higher at vigorous walking speed compared to the 6.4 29 10 ± 10 0.8 -1.10b 0.65** 0.38 - 0.82 slow walking speed except for Garmin Vivosmart and Fitbit Charge HR which showed the Flyfit 3.2 23 75 ± 62 9.0 -1.25b 0.15 -0.26 - 0.52 highest ICCs at the slow walking speed. 4.8 22 32 ± 32 3.0 -1.67b 0.58** 0.23 - 0.80

6.4 18 11 ± 42 0.9 -0.63b 0.46* -0.01 - 0.76 Moves 3.2 25 -57 ± 75 -6.9 -0.76a -0.02 -0.42 - 0.37

4.8 26 -3 ± 30 -0.3 -0.11a 0.49** 0.13 - 0.74

6.4 28 -8 ± 22 -0.7 -0.38a 0.66** 0.38 - 0.83 # p<0.0045; *p<0.05; **p<0.01; a paired samples t-test; b Wilcoxon signed-rank test in case of a non-normal distribution; c 95% CI of the ICC.

46 Reliability and validity of ten consumer activity trackers depend on walking speed

Table 1. Validity Test-retest reliability measures of session 1 versus session 2: mean differences (session 1 - session 2) ± The outcome measures of the validity tests are shown in Tables 2 and 3. Each column standard error (SE), mean absolute percentage error (MAPE), scores on the paired samples t-test and Wilcoxon signed-rank test, intraclass correlation coefficient (ICC), and the corresponding 95% confidence intervals (95% contains 30 measurements (ten trackers at three different speeds). A total number of 12 out CI). of 30 measurements showed a significant difference of the mean number of steps compared to the gold standard in Session 1, and 12 measurements showed a significant difference in Activity tracker Speed N Mean difference ± SE MAPE (%) t-valuea/ ICC 95% CIc (km·h -1) Z-valueb Session 2 assessed by either the paired samples t-test or the Wilcoxon signed-rank test. With increasing speed, the MAPE decreased for the Polar Loop, Pebble Smartwatch, Samsung Hand counter 3.2 31 13 ± 6 1.3 1.98a 0.76** 0.56 - 0.88 Gear S, Misfit Flash, Jawbone UP Move, Flyfit, and Moves. It was fairly constant for the Apple 4.8 31 10 ± 4 0.9 2.59a 0.87** 0.72 - 0.94 Watch Sport at all three speeds. The MAPE increased for the Garmin Vivosmart and the a 3 6.4 30 4 ± 4 0.3 1.01 0.93** 0.86 - 0.97 Fitbit Charge HR with accelerating speed. At a walking speed of 3.2 km·h-1, the Polar Loop, Polar Loop 3.2 31 9 ± 46 1.3 -0.01b 0.74** 0.52 - 0.87 Misfit Flash, Jawbone Up Move, Flyfit, and Moves had a MAPE exceeding 5%. The MAPE of 4.8 30 93 ± 41 8.6 -2.66b 0.15 -0.17 - 0.46 the Pebble Smartwatch and Samsung Gear S was higher than 5% during Session 1, but under 6.4 29 -2 ± 19 -0.1 -0.13b 0.49** 0.15 - 0.72 5% during Session 2. All other trackers had a MAPE less than 5% at the slowest walking Garmin Vivosmart 3.2 31 12 ± 7 1.2 1.73a 0.79** 0.60 - 0.89 speed. At 4.8 km·h-1, the MAPE of the Misfit Flash, Jawbone Up Move, and Flyfit was more 4.8 31 16 ± 7 1.4 2.16a 0.51** 0.20 - 0.72 than 5% during both sessions. The Jawbone Up Move had a MAPE over 5% only during 6.4 30 40 ± 22 3.5 -1.58b 0.72** 0.49 - 0.86 Session 1 while the Polar Loop obtained this only during Session 2. Finally, at 6.4 km·h-1, Fitbit Charge HR 3.2 31 12 ± 10 1.2 -1.28b 0.73** 0.51 - 0.86

4.8 31 7 ± 10 0.6 -1.47b 0.70** 0.46 - 0.84 most of the trackers had a MAPE under 5% except for the Garmin Vivosmart, Fitbit Charge

6.4 30 33 ± 14 2.8 -2.37b 0.65** 0.38 - 0.82 HR, and Misfit Flash. The Flyfit had a MAPE of less than 5% during Session 1, however, it Apple Watch Sport 3.2 30 7 ± 16 0.7 -0.73b 0.38* 0.02 - 0.65 exceeded 5% during Session 2.

4.8 28 41 ± 13 3.7 -3.22b # 0.48** 0.12 - 0.73 The limits of agreement of the Bland-Altman plots are presented in Tables 2 and 3. At a 6.4 28 -2 ± 8 -0.1 -0.18 0.80** 0.61 - 0.90 3.2 km·h-1, the Garmin Vivosmart had the narrowest limits of agreement (49 steps, Session Pebble Smartwatch 3.2 31 -16 ± 17 -1.8 -0.12b 0.56** 0.26 - 0.76 1), while the Polar Loop exhibited the broadest limits of agreement (1298 steps, Session 1). 4.8 31 -7 ± 20 -0.6 -1.70b 0.33* -0.03 - 0.61 At 4.8 km·h-1, the Garmin Vivosmart had the narrowest limits of agreement (35 steps, 6.4 30 0 ± 5 0.0 0.06a 0.89** 0.79 - 0.95 Session 2) and the Misfit Flash had the broadest limits of agreement (1104 steps, Session 2). Samsung Gear S 3.2 29 10 ± 9 1.1 -0.98b 0.97** 0.93 - 0.98 -1 4.8 30 4 ± 10 0.3 -1.88b 0.86** 0.73 - 0.93 At 6.4 km·h , the Samsung Gear S had the narrowest limits of agreement (73 steps, Session

6.4 30 3 ± 3 0.2 0.73a 0.93** 0.86 - 0.97 2) and the Misfit Flash had the broadest limits of agreement (1029 steps, Session 2). Misfit Flash 3.2 22 74 ± 56 9.1 -0.92b 0.48** 0.10 - 0.74 The ICCs (Tables 2 and 3) at 3.2 km·h-1 ranged from 0 (Samsung Gear S, session 2 and 4.8 23 25 ± 60 2.4 -0.42b 0.03 -0.40 - 0.44 Moves, Session 2) to 0.95 (Garmin Vivosmart, Sessions 1 and 2) while, at 4.8 km·h-1, ICCs b 6.4 22 -2 ± 62 -0.1 -1.25 0.14 -0.31 - 0.53 ranged from 0 (Moves, Session 1 and 2, Polar Loop session 2) to 0.98 (Garmin Vivosmart, Jawbone Up Move 3.2 28 37 ± 38 4.2 -0.47b 0.07 -0.31 - 0.42 Session 2). ICCs at 6.4 km·h-1 ranged from 0 (Garmin Vivosmart, Session 2) to 0.92 (Samsung 4.8 30 -29 ± 35 -2.7 -0.82a 0.00 -0.36 - 0.36 Gear S, Session 2). Generally, ICCs were higher at vigorous walking speed compared to the 6.4 29 10 ± 10 0.8 -1.10b 0.65** 0.38 - 0.82 slow walking speed except for Garmin Vivosmart and Fitbit Charge HR which showed the Flyfit 3.2 23 75 ± 62 9.0 -1.25b 0.15 -0.26 - 0.52 highest ICCs at the slow walking speed. 4.8 22 32 ± 32 3.0 -1.67b 0.58** 0.23 - 0.80

6.4 18 11 ± 42 0.9 -0.63b 0.46* -0.01 - 0.76 Moves 3.2 25 -57 ± 75 -6.9 -0.76a -0.02 -0.42 - 0.37

4.8 26 -3 ± 30 -0.3 -0.11a 0.49** 0.13 - 0.74

6.4 28 -8 ± 22 -0.7 -0.38a 0.66** 0.38 - 0.83 # p<0.0045; *p<0.05; **p<0.01; a paired samples t-test; b Wilcoxon signed-rank test in case of a non-normal distribution; c 95% CI of the ICC.

47 Chapter 3

Table 2. Table 3. Validity measures of session 1: mean differences (hand counter – activity tracker) ± standard error (SE), mean Validity measures of session 2: mean differences (hand counter – activity tracker) ± standard error (SE), mean absolute percentage error (MAPE), scores on the paired samples t-test and Wilcoxon signed-rank test, limits of absolute percentage error (MAPE), scores on the paired samples t-test and Wilcoxon signed-rank test, limits of agreement of the Bland-Altman plots, intraclass correlation coefficient (ICC), and the corresponding 95% agreement of the Bland-Altman plots, intraclass correlation coefficient (ICC), and the corresponding 95% confidence intervals (95% CI). confidence intervals (95% CI).

Activity tracker Speed N Mean difference MAPE (%) t-valuea/ Limits of agreement ICC 95% CIc Activity tracker Speed N Mean MAPE (%) t-valuea/ Limits of ICC 95% CIc (km·h-1) ± SE Z-valueb (km·h-1) difference ± Z-valueb agreement Lower Upper SE Polar Loop 3.2 31 252 ± 59 26.4 -3.86b # -397 901 0.08 -0.15 - 0.35 Lower Upper b # b Polar Loop 3.2 31 248 ± 58 26.3 -3.34 -387 882 0.09 -0.14 - 0.36 4.8 31 34 ± 15 3.0 -2.06 -127 195 0.26 -0.06 - 0.54 # # 4.8 30 119 ± 44 10.7 -3.53b -350 588 0 -0.28 - 0.30 6.4 31 45 ± 20 3.6 -3.08b -174 264 0.24 -0.09 - 0.53 b Garmin Vivosmart 3.2 31 10 ± 2 1.0 4.36a # -15 34 0.95** 0.78 - 0.98 6.4 29 38 ± 12 3.0 -2.74 -92 168 0.42** 0.07 - 0.68 a a Garmin Vivosmart 3.2 31 9 ± 3 0.9 2.51 -29 46 0.95** 0.88 - 0.98 4.8 31 -2 ± 7 -0.2 -0.32 -81 77 0.57** 0.27 - 0.77 a # 4.8 31 4 ± 2 0.3 2.41 -14 21 0.98** 0.95 - 0.99 6.4 31 114 ± 27 9.0 -4.44b -177 404 0.10 -0.14 - 0.36 b # Fitbit Charge HR 3.2 31 -7 ± 9 -0.7 -0.81 -101 87 0.62** 0.35 - 0.80 6.4 30 149 ± 34 11.9 -4.17 -220 518 0 -0.26 - 0.20 b a Fitbit Charge HR 3.2 31 -8 ± 9 -0.9 -0.13 -108 92 0.74** 0.54 - 0.87 4.8 31 22 ± 13 2.0 1.70 -118 162 0.20 -0.14 - 0.50 b # 4.8 31 19 ± 13 1.7 -1.93 -122 160 0.27 -0.07 - 0.56 6.4 31 65 ± 14 5.2 -4.61b -83 214 0.31** -0.05 - 0.60 b # Apple Watch Sport 3.2 30 18 ± 9 1.9 2.14a -74 111 0.57** 0.27 - 0.77 6.4 30 96 ± 20 7.7 -4.26 -116 308 0.15 -0.10 - 0.42 b a Apple Watch Sport 3.2 31 13 ± 10 1.4 -0.83 -98 124 0.73** 0.52 - 0.86 4.8 29 0 ± 3 0.0 -0.09 -36 35 0.93** 0.86 - 0.97 b a 4.8 30 29 ± 12 2.6 -2.20 -101 159 0.52** 0.21 - 0.74 6.4 30 6 ± 5 0.5 1.24 -45 56 0.91** 0.82 - 0.95 a Pebble Smartwatch 3.2 31 57 ± 19 6.0 -4.29b # -154 269 0.28* -0.03 - 0.56 6.4 29 1 ± 6 0.1 0.15 -65 67 0.86** 0.72 - 0.93 # # Pebble Smartwatch 3.2 31 29 ± 6 3.0 4.67a -38 96 0.78** 0.33 - 0.92 4.8 31 32 ± 19 2.9 -4.08b -179 244 0.34* 0.00 - 0.61 a # 4.8 31 16 ± 6 1.4 2.65 -48 80 0.77** 0.54 - 0.89 6.4 31 16 ± 5 1.3 3.37a -36 67 0.86** 0.65 - 0.94 a # Samsung Gear S 3.2 31 53 ± 32 5.6 -2.01b -300 406 0.04 -0.30 - 0.37 6.4 30 13 ± 4 1.0 3.51 -27 52 0.91** 0.72 - 0.96 b a Samsung Gear S 3.2 29 45 ± 41 4.8 -0.16 -387 477 0 -0.49 - 0.22 4.8 31 45 ± 22 4.0 2.05 -195 285 0.02 -0.29 - 0.34 # # 4.8 30 38 ± 17 3.5 -4.08b -149 225 0.17 -0.15 - 0.48 6.4 31 14 ± 5 1.1 3.15a -36 65 0.85** 0.63 - 0.93 a Misfit Flash 3.2 25 144 ± 45 15.2 -3.84b # -298 586 0.06 -0.21 - 0.38 6.4 30 9 ± 3 0.8 2.74 -27 46 0.92** 0.81 - 0.97 b # b Misfit Flash 3.2 27 170 ± 47 18.1 -4.16 -312 652 0.15 -0.13 - 0.45 4.8 25 60 ± 31 5.4 -1.71 -241 362 0.26 -0.10 - 0.58 b a 4.8 27 94 ± 54 8.5 -0.86 -458 646 0.07 -0.27 - 0.42 6.4 28 75 ± 47 6.0 1.62 -409 560 0.11 -0.24 - 0.45 b Jawbone Up Move 3.2 29 83 ± 26 8.7 3.17a # -193 358 0.12 -0.16 - 0.42 6.4 25 93 ± 53 7.5 -2.29 -421 608 0.08 -0.28 - 0.44 b # a Jawbone Up Move 3.2 29 110 ± 26 11.7 -4.12 -160 381 0.17 -0.11 - 0.45 4.8 31 65 ± 30 5.9 2.16 -265 396 0.09 -0.22 - 0.41 a a 4.8 30 29 ± 11 2.6 2.71 -85 143 0.56** 0.23 - 0.77 6.4 30 15 ± 8 1.2 1.78 -75 105 0.71** 0.47 - 0.85 b # Flyfit 3.2 28 154 ± 30 16.1 5.13a # -157 464 0.18 -0.10 - 0.47 6.4 30 21 ± 8 1.6 -3.40 -60 101 0.72** 0.45 - 0.86 b # b Flyfit 3.2 26 185 ± 47 19.5 -4.20 -281 651 0.17 -0.11 - 0.47 4.8 26 60 ± 25 5.3 -2.25 -192 312 0.31* -0.04 - 0.60 b # a 4.8 27 77 ± 26 7.0 -3.36 -185 339 0.32* -0.02 - 0.61 6.4 21 29 ± 39 2.3 0.75 -317 375 0.27 -0.18 - 0.62 b Moves 3.2 29 133 ± 51 14.0 -2.01b -406 671 0.15 -0.16 - 0.46 6.4 27 85 ± 46 6.8 -2.45 -386 556 0.08 -0.26 - 0.43 b a Moves 3.2 27 119 ± 63 12.6 -1.57 -524 762 0 -0.41 - 0.27 4.8 29 -29 ± 23 -2.6 -1.30 -268 209 0 -0.52 - 0.17 b a 4.8 28 -9 ± 43 -0.8 -1.32 -460 442 0 -0.49 - 0.26 6.4 29 4 ± 24 0.3 0.15 -253 261 0.25 -0.14 - 0.56 6.4 30 -3 ± 20 -0.2 -1.51b -220 215 0.37* 0.06 - 0.64 #p<0.005; *p<0.05; **p<0.01; a paired samples t-test; b Wilcoxon signed-rank test in case of a non-normal distribution. c 95% CI of the ICC. # p<0.005; * p<0.05; ** p<0.01; a paired samples t-test. b Wilcoxon signed-rank test in case of a non-normal distribution. c 95% CI of the ICC.

48 Reliability and validity of ten consumer activity trackers depend on walking speed

Table 2. Table 3. Validity measures of session 1: mean differences (hand counter – activity tracker) ± standard error (SE), mean Validity measures of session 2: mean differences (hand counter – activity tracker) ± standard error (SE), mean absolute percentage error (MAPE), scores on the paired samples t-test and Wilcoxon signed-rank test, limits of absolute percentage error (MAPE), scores on the paired samples t-test and Wilcoxon signed-rank test, limits of agreement of the Bland-Altman plots, intraclass correlation coefficient (ICC), and the corresponding 95% agreement of the Bland-Altman plots, intraclass correlation coefficient (ICC), and the corresponding 95% confidence intervals (95% CI). confidence intervals (95% CI).

Activity tracker Speed N Mean difference MAPE (%) t-valuea/ Limits of agreement ICC 95% CIc Activity tracker Speed N Mean MAPE (%) t-valuea/ Limits of ICC 95% CIc (km·h-1) ± SE Z-valueb (km·h-1) difference ± Z-valueb agreement Lower Upper SE Polar Loop 3.2 31 252 ± 59 26.4 -3.86b # -397 901 0.08 -0.15 - 0.35 Lower Upper b # b Polar Loop 3.2 31 248 ± 58 26.3 -3.34 -387 882 0.09 -0.14 - 0.36 4.8 31 34 ± 15 3.0 -2.06 -127 195 0.26 -0.06 - 0.54 # # 4.8 30 119 ± 44 10.7 -3.53b -350 588 0 -0.28 - 0.30 6.4 31 45 ± 20 3.6 -3.08b -174 264 0.24 -0.09 - 0.53 b Garmin Vivosmart 3.2 31 10 ± 2 1.0 4.36a # -15 34 0.95** 0.78 - 0.98 6.4 29 38 ± 12 3.0 -2.74 -92 168 0.42** 0.07 - 0.68 a a Garmin Vivosmart 3.2 31 9 ± 3 0.9 2.51 -29 46 0.95** 0.88 - 0.98 4.8 31 -2 ± 7 -0.2 -0.32 -81 77 0.57** 0.27 - 0.77 a # 4.8 31 4 ± 2 0.3 2.41 -14 21 0.98** 0.95 - 0.99 6.4 31 114 ± 27 9.0 -4.44b -177 404 0.10 -0.14 - 0.36 b # Fitbit Charge HR 3.2 31 -7 ± 9 -0.7 -0.81 -101 87 0.62** 0.35 - 0.80 6.4 30 149 ± 34 11.9 -4.17 -220 518 0 -0.26 - 0.20 b a Fitbit Charge HR 3.2 31 -8 ± 9 -0.9 -0.13 -108 92 0.74** 0.54 - 0.87 4.8 31 22 ± 13 2.0 1.70 -118 162 0.20 -0.14 - 0.50 b # 4.8 31 19 ± 13 1.7 -1.93 -122 160 0.27 -0.07 - 0.56 6.4 31 65 ± 14 5.2 -4.61b -83 214 0.31** -0.05 - 0.60 b # Apple Watch Sport 3.2 30 18 ± 9 1.9 2.14a -74 111 0.57** 0.27 - 0.77 6.4 30 96 ± 20 7.7 -4.26 -116 308 0.15 -0.10 - 0.42 b a Apple Watch Sport 3.2 31 13 ± 10 1.4 -0.83 -98 124 0.73** 0.52 - 0.86 4.8 29 0 ± 3 0.0 -0.09 -36 35 0.93** 0.86 - 0.97 b a 4.8 30 29 ± 12 2.6 -2.20 -101 159 0.52** 0.21 - 0.74 6.4 30 6 ± 5 0.5 1.24 -45 56 0.91** 0.82 - 0.95 a Pebble Smartwatch 3.2 31 57 ± 19 6.0 -4.29b # -154 269 0.28* -0.03 - 0.56 6.4 29 1 ± 6 0.1 0.15 -65 67 0.86** 0.72 - 0.93 # # Pebble Smartwatch 3.2 31 29 ± 6 3.0 4.67a -38 96 0.78** 0.33 - 0.92 4.8 31 32 ± 19 2.9 -4.08b -179 244 0.34* 0.00 - 0.61 a # 4.8 31 16 ± 6 1.4 2.65 -48 80 0.77** 0.54 - 0.89 6.4 31 16 ± 5 1.3 3.37a -36 67 0.86** 0.65 - 0.94 a # Samsung Gear S 3.2 31 53 ± 32 5.6 -2.01b -300 406 0.04 -0.30 - 0.37 6.4 30 13 ± 4 1.0 3.51 -27 52 0.91** 0.72 - 0.96 b a Samsung Gear S 3.2 29 45 ± 41 4.8 -0.16 -387 477 0 -0.49 - 0.22 4.8 31 45 ± 22 4.0 2.05 -195 285 0.02 -0.29 - 0.34 # # 4.8 30 38 ± 17 3.5 -4.08b -149 225 0.17 -0.15 - 0.48 6.4 31 14 ± 5 1.1 3.15a -36 65 0.85** 0.63 - 0.93 a Misfit Flash 3.2 25 144 ± 45 15.2 -3.84b # -298 586 0.06 -0.21 - 0.38 6.4 30 9 ± 3 0.8 2.74 -27 46 0.92** 0.81 - 0.97 b # b Misfit Flash 3.2 27 170 ± 47 18.1 -4.16 -312 652 0.15 -0.13 - 0.45 4.8 25 60 ± 31 5.4 -1.71 -241 362 0.26 -0.10 - 0.58 b a 4.8 27 94 ± 54 8.5 -0.86 -458 646 0.07 -0.27 - 0.42 6.4 28 75 ± 47 6.0 1.62 -409 560 0.11 -0.24 - 0.45 b Jawbone Up Move 3.2 29 83 ± 26 8.7 3.17a # -193 358 0.12 -0.16 - 0.42 6.4 25 93 ± 53 7.5 -2.29 -421 608 0.08 -0.28 - 0.44 b # a Jawbone Up Move 3.2 29 110 ± 26 11.7 -4.12 -160 381 0.17 -0.11 - 0.45 4.8 31 65 ± 30 5.9 2.16 -265 396 0.09 -0.22 - 0.41 a a 4.8 30 29 ± 11 2.6 2.71 -85 143 0.56** 0.23 - 0.77 6.4 30 15 ± 8 1.2 1.78 -75 105 0.71** 0.47 - 0.85 b # Flyfit 3.2 28 154 ± 30 16.1 5.13a # -157 464 0.18 -0.10 - 0.47 6.4 30 21 ± 8 1.6 -3.40 -60 101 0.72** 0.45 - 0.86 b # b Flyfit 3.2 26 185 ± 47 19.5 -4.20 -281 651 0.17 -0.11 - 0.47 4.8 26 60 ± 25 5.3 -2.25 -192 312 0.31* -0.04 - 0.60 b # a 4.8 27 77 ± 26 7.0 -3.36 -185 339 0.32* -0.02 - 0.61 6.4 21 29 ± 39 2.3 0.75 -317 375 0.27 -0.18 - 0.62 b Moves 3.2 29 133 ± 51 14.0 -2.01b -406 671 0.15 -0.16 - 0.46 6.4 27 85 ± 46 6.8 -2.45 -386 556 0.08 -0.26 - 0.43 b a Moves 3.2 27 119 ± 63 12.6 -1.57 -524 762 0 -0.41 - 0.27 4.8 29 -29 ± 23 -2.6 -1.30 -268 209 0 -0.52 - 0.17 b a 4.8 28 -9 ± 43 -0.8 -1.32 -460 442 0 -0.49 - 0.26 6.4 29 4 ± 24 0.3 0.15 -253 261 0.25 -0.14 - 0.56 6.4 30 -3 ± 20 -0.2 -1.51b -220 215 0.37* 0.06 - 0.64 #p<0.005; *p<0.05; **p<0.01; a paired samples t-test; b Wilcoxon signed-rank test in case of a non-normal distribution. c 95% CI of the ICC. # p<0.005; * p<0.05; ** p<0.01; a paired samples t-test. b Wilcoxon signed-rank test in case of a non-normal distribution. c 95% CI of the ICC.

49 Chapter 3

Discussion rather than slow walking, which may well explain why the Polar Loop was inadequate for tracking steps at a slower walking speed. Only a small number of trackers demonstrated

good validity at 3.2 km·h-1. When examining both the test-retest reliability and validity, the The aim of this study was to examine the test-retest reliability and validity of ten relatively Garmin Vivosmart showed the best results. The Fibit Charge HR also indicated good results. new activity trackers at three different walking speeds. In general, the results showed that The other trackers had inadequate results on the test-retest reliability and/or the validity. validity and reliability are strongly influenced by walking speed. Though most trackers showed acceptable validity scores at an average walking speed, none of the trackers had Most participants confirmed that the average walking speed of 4.8 km·h-1 was normal valid step counting measures at all three walking speeds. Most trackers seem to for them, as described previously.12 At this speed, accelerations of the body are higher than underestimate the number of steps at each walking speed (with a few exceptions; Fitbit at at 3.2 km·h-1 and, therefore, better results on test-retest reliability and validity were the slowest speed and the Moves app at average and fastest speed). A systematic expected. For the test-retest reliability, this was the case for five out of ten trackers. Only underestimation at slow walking speed has been described before in literature.17 Also, three trackers showed a better test-retest reliability at 4.8 km·h-1 than at 3.2 km·h-1 (Apple mobile applications have been shown to be associated with high variation when tracking Watch Sport, Flyfit and Moves) and two remained approximately the same (Fitbit Charge HR walking on a treadmill.7,17 The general underestimation here may be the result of a and Jawbone Up Move). The other five trackers had lower test-retest reliability at 4.8 km·h-1. systematic bias that affected all devices. This is probably due to the onset and offset of the Most trackers did exhibit a profound improvement in the results at 4.8 km·h-1 than at 3.2 treadmill. Each step is recorded by the gold standard, but the first and last steps are km·h-1 in regard to validity. Eight trackers had a MAPE smaller than 5% during one of the associated with limited acceleration which was probably not enough to be registered by sessions, and six trackers remained below this cut-off point during both sessions. All other accelerometry and thus leading to a small systematic underestimation. trackers that did not meet the 5% criterion still had a MAPE lower than 6%. Only the Polar Loop showed an unexpected high MAPE (10.7%) during the second session. In general, it can In this study, the slowest walking speed was 3.2 km·h-1. At this speed, it is difficult for be concluded that, at this average speed, most trackers show acceptable validity results. the activity trackers to detect accelerations.8,9 It is not only recognized as being problematic Notably, these validity results are accompanied with relatively broad limits of agreement of for activity trackers to do a valid count of steps at a slow walking speed. Our study also the Bland-Altman plots (an average of 39.4% of the total number of steps of the gold showed that the participants had difficulty maintaining a constant pace at 3.2 km·h-1. The standard). The low MAPE in combination with broad LOAs indicate that, although these gold standard had an ICC of only 0.76 which indicates that there were possibly actual trackers, on average, have acceptable validity, this performance varies between individual differences between Sessions 1 and 2 in the number of steps taken by the participants. This participants. Only the Garmin Vivosmart, Apple Watch Sport and Pebble Smartwatch had a can plausibly be explained by the fact that a speed of 3.2 km·h-1 was too slow for many of MAPE less than 5% and narrow limits of agreement (within 20% of the average number of the healthy participants which made it difficult for them to walk in a natural way. This was, steps of the gold standard). However, all of these four trackers had a low test-retest for example, visible in a very small step length or only minimal arm swing during walking. reliability whereby those of the Garmin Vivosmart and Apple Watch Sport were generally The participants probably selected slightly different strategies to compensate for the slow acceptable. speed between Sessions 1 and 2 which resulted in a different number of steps between the sessions even though the speed and distance covered were equal. The Samsung Gear S Most trackers showed the best results at 6.4 km·h-1. For the test-retest reliability, showed the highest ICC (ICC=0.96) at 3.2 km·h-1. However, this differs from the gold there were only three trackers that had low ICCs (Polar Loop, Misfit Flash and Flyfit). The standard which demonstrates that the actual test-retest reliability of the Samsung Gear S at three smartwatches (Apple Watch Sport, Pebble Smartwatch and Samsung Gear S) had the 3.2 km·h-1 may be not that good. The ICCs of the Polar Loop, Garmin Vivosmart, and Fitbit best test-retest reliability. The validity was also generally better than at the slower speeds. Charge HR (ICCs of 0.74, 0.79 and 0.73, respectively) are more equal to the ICC of the gold However, there were two trackers (Garmin Vivosmart and Fitbit Charge HR) that had the standard indicating that these trackers had the best test-retest reliability at 3.2 km·h-1. The poorest validity at the highest walking speed. The explanation for this finding is unclear, validity was probably also influenced by the unnatural walking pattern at 3.2 km·h-1. The however, these two trackers most likely just perform best at slow walking speeds. This is slow walking speed and the unnatural walking pattern at 3.2 km·h-1 resulted in low validity of especially remarkable for the Garmin Vivosmart since Garmin is a sports brand and, most trackers. Only three trackers had a MAPE smaller than 5%. The limits of agreement of therefore, better results at higher speeds were expected. The other trackers had a lower the Bland-Altman plots were high with an average of 65.8% of the total number of steps of and, in a number of cases, a similar MAPE when compared to 3.2 and 4.8 km·h-1. However, the gold standard. The Polar Loop was particularly not able to make a valid measurement of for most trackers the Bland-Altman plots showed relatively broad limits of agreement (on step counting as illustrated by an exceptionally high MAPE of 26% during both sessions. This average, 32.9% of the average number of steps of the gold standard). This indicates that may be explained by the fact that this activity tracker was developed for sports activities there were many individual differences in the results of the trackers also at 6.4 km·h-1, and

50 Reliability and validity of ten consumer activity trackers depend on walking speed

Discussion rather than slow walking, which may well explain why the Polar Loop was inadequate for tracking steps at a slower walking speed. Only a small number of trackers demonstrated

good validity at 3.2 km·h-1. When examining both the test-retest reliability and validity, the The aim of this study was to examine the test-retest reliability and validity of ten relatively Garmin Vivosmart showed the best results. The Fibit Charge HR also indicated good results. new activity trackers at three different walking speeds. In general, the results showed that The other trackers had inadequate results on the test-retest reliability and/or the validity. validity and reliability are strongly influenced by walking speed. Though most trackers showed acceptable validity scores at an average walking speed, none of the trackers had Most participants confirmed that the average walking speed of 4.8 km·h-1 was normal valid step counting measures at all three walking speeds. Most trackers seem to for them, as described previously.12 At this speed, accelerations of the body are higher than underestimate the number of steps at each walking speed (with a few exceptions; Fitbit at at 3.2 km·h-1 and, therefore, better results on test-retest reliability and validity were 3 the slowest speed and the Moves app at average and fastest speed). A systematic expected. For the test-retest reliability, this was the case for five out of ten trackers. Only underestimation at slow walking speed has been described before in literature.17 Also, three trackers showed a better test-retest reliability at 4.8 km·h-1 than at 3.2 km·h-1 (Apple mobile applications have been shown to be associated with high variation when tracking Watch Sport, Flyfit and Moves) and two remained approximately the same (Fitbit Charge HR walking on a treadmill.7,17 The general underestimation here may be the result of a and Jawbone Up Move). The other five trackers had lower test-retest reliability at 4.8 km·h-1. systematic bias that affected all devices. This is probably due to the onset and offset of the Most trackers did exhibit a profound improvement in the results at 4.8 km·h-1 than at 3.2 treadmill. Each step is recorded by the gold standard, but the first and last steps are km·h-1 in regard to validity. Eight trackers had a MAPE smaller than 5% during one of the associated with limited acceleration which was probably not enough to be registered by sessions, and six trackers remained below this cut-off point during both sessions. All other accelerometry and thus leading to a small systematic underestimation. trackers that did not meet the 5% criterion still had a MAPE lower than 6%. Only the Polar Loop showed an unexpected high MAPE (10.7%) during the second session. In general, it can In this study, the slowest walking speed was 3.2 km·h-1. At this speed, it is difficult for be concluded that, at this average speed, most trackers show acceptable validity results. the activity trackers to detect accelerations.8,9 It is not only recognized as being problematic Notably, these validity results are accompanied with relatively broad limits of agreement of for activity trackers to do a valid count of steps at a slow walking speed. Our study also the Bland-Altman plots (an average of 39.4% of the total number of steps of the gold showed that the participants had difficulty maintaining a constant pace at 3.2 km·h-1. The standard). The low MAPE in combination with broad LOAs indicate that, although these gold standard had an ICC of only 0.76 which indicates that there were possibly actual trackers, on average, have acceptable validity, this performance varies between individual differences between Sessions 1 and 2 in the number of steps taken by the participants. This participants. Only the Garmin Vivosmart, Apple Watch Sport and Pebble Smartwatch had a can plausibly be explained by the fact that a speed of 3.2 km·h-1 was too slow for many of MAPE less than 5% and narrow limits of agreement (within 20% of the average number of the healthy participants which made it difficult for them to walk in a natural way. This was, steps of the gold standard). However, all of these four trackers had a low test-retest for example, visible in a very small step length or only minimal arm swing during walking. reliability whereby those of the Garmin Vivosmart and Apple Watch Sport were generally The participants probably selected slightly different strategies to compensate for the slow acceptable. speed between Sessions 1 and 2 which resulted in a different number of steps between the sessions even though the speed and distance covered were equal. The Samsung Gear S Most trackers showed the best results at 6.4 km·h-1. For the test-retest reliability, showed the highest ICC (ICC=0.96) at 3.2 km·h-1. However, this differs from the gold there were only three trackers that had low ICCs (Polar Loop, Misfit Flash and Flyfit). The standard which demonstrates that the actual test-retest reliability of the Samsung Gear S at three smartwatches (Apple Watch Sport, Pebble Smartwatch and Samsung Gear S) had the 3.2 km·h-1 may be not that good. The ICCs of the Polar Loop, Garmin Vivosmart, and Fitbit best test-retest reliability. The validity was also generally better than at the slower speeds. Charge HR (ICCs of 0.74, 0.79 and 0.73, respectively) are more equal to the ICC of the gold However, there were two trackers (Garmin Vivosmart and Fitbit Charge HR) that had the standard indicating that these trackers had the best test-retest reliability at 3.2 km·h-1. The poorest validity at the highest walking speed. The explanation for this finding is unclear, validity was probably also influenced by the unnatural walking pattern at 3.2 km·h-1. The however, these two trackers most likely just perform best at slow walking speeds. This is slow walking speed and the unnatural walking pattern at 3.2 km·h-1 resulted in low validity of especially remarkable for the Garmin Vivosmart since Garmin is a sports brand and, most trackers. Only three trackers had a MAPE smaller than 5%. The limits of agreement of therefore, better results at higher speeds were expected. The other trackers had a lower the Bland-Altman plots were high with an average of 65.8% of the total number of steps of and, in a number of cases, a similar MAPE when compared to 3.2 and 4.8 km·h-1. However, the gold standard. The Polar Loop was particularly not able to make a valid measurement of for most trackers the Bland-Altman plots showed relatively broad limits of agreement (on step counting as illustrated by an exceptionally high MAPE of 26% during both sessions. This average, 32.9% of the average number of steps of the gold standard). This indicates that may be explained by the fact that this activity tracker was developed for sports activities there were many individual differences in the results of the trackers also at 6.4 km·h-1, and

51 Chapter 3

most trackers cannot be used interchangeably with the gold standard. Only the Apple Watch the highest walking speed, but also because these trackers seem to meet the current trend Sport, Pebble Smartwatch, Samsung Gear S and Jawbone Up Move had narrow limits of in technological developments. However, smartwatches are usually more expensive than agreement (within 20% of the average number of steps of the gold standard) and a MAPE most activity trackers, which can be a serious drawback. Fortunately, most trackers showed smaller than 5%. Since the Apple Watch Sport, Pebble Smartwatch and Samsung Gear S also a reasonably good test-retest reliability, and validity was increasing with increasing walking had the best test-retest reliability, it can be concluded that the three smartwatches speed. So, for individual users who want to measure changes in their physical activity exhibited the best results at 6.4 km·h-1. pattern, most trackers are suitable.

In our study, the placement of the trackers was always at the same position on the In conclusion, there is variation in test-retest reliability and validity of consumer hip, wrist or ankle for each participant during each session. Wrist trackers were also activity trackers which largely depends on walking speed. At a slower walking speed, the consistently placed in the same order from distal to proximal. It should be noted there is an Garmin Vivosmart and Fitbit Charge HR showed the most favorable results. For average open debate about possible differences in accelerations between the dominant and non- walking speed, the Garmin Vivosmart and Apple Watch Sport indicated the best results. For dominant wrist. One study showed that trackers at the non-dominant wrist have a higher vigorous walking, the Apple Watch Sport, Pebble Smartwatch, and Samsung Gear S validity than trackers at the dominant wrist.18 Another study showed no differences in demonstrated the most favorable results. In general, it can be concluded that the activity validity between the dominant and non-dominant wrist.19 These contradicting results trackers are more valid at higher walking speeds, and smartwatches showed slightly better indicate that in a (semi) free-living setting, a counterbalanced placement would have been results than the wearables that are solely developed for activity tracking. more appropriate to exclude possible different acceleration values between the dominant and non-dominant wrist. In our study participants were tested in a controlled lab setting.

Walking on a treadmill is typically characterized with symmetrical movement of the dominant and non-dominant arm. Thus, an absence of counterbalanced placement in our study design is unlikely to affect our results. Furthermore, it should be noted that two trackers we tested (Misfit Flash and Jawbone Up Move) can be worn on both the hip and the wrist. However, the results from hip and wrist are not interchangeable.20,21 In this study we choose to place them on the hip because research showed that this positioning is associated with higher validity compared to the wrist.7,18 Therefore, the results of the Misfit Flash and Jawbone Up Move in this study do only apply when wearing them on the hip.

Some caution should be taken regarding the generalizability of these results. First, the test-retest reliability and validity were only examined during treadmill walking at a constant speed. This is not a natural situation in which there is variation in walking speed and activities other than walking are also performed. Second, to translate these results into clinical practice, it should be taken into account that only healthy volunteers participated in this study. The walking pattern of elderly people or those with limited physical abilities may be different. A similar study with the specific target group should be performed to provide actual proof. Despite these limitations, this study demonstrates that most activity trackers perform better as the walking speed increases. In addition, the trackers in this study were relatively new and, to our knowledge, most of them have not been studied before. Therefore, this study provides an initial insight into the test-retest reliability and validity of the activity trackers which can be useful when selecting an activity tracker for a certain target group with a specific level of daily physical activity. For people with a low walking speed, the Garmin Vivosmart or the Fitbit Charge HR seem most suitable. For people with a higher walking speed, the smartwatches (especially the Apple Watch Sport and the Pebble Smartwatch) seem more appropriate. Not only because these trackers were most valid at

52 Reliability and validity of ten consumer activity trackers depend on walking speed

most trackers cannot be used interchangeably with the gold standard. Only the Apple Watch the highest walking speed, but also because these trackers seem to meet the current trend Sport, Pebble Smartwatch, Samsung Gear S and Jawbone Up Move had narrow limits of in technological developments. However, smartwatches are usually more expensive than agreement (within 20% of the average number of steps of the gold standard) and a MAPE most activity trackers, which can be a serious drawback. Fortunately, most trackers showed smaller than 5%. Since the Apple Watch Sport, Pebble Smartwatch and Samsung Gear S also a reasonably good test-retest reliability, and validity was increasing with increasing walking had the best test-retest reliability, it can be concluded that the three smartwatches speed. So, for individual users who want to measure changes in their physical activity exhibited the best results at 6.4 km·h-1. pattern, most trackers are suitable.

In our study, the placement of the trackers was always at the same position on the In conclusion, there is variation in test-retest reliability and validity of consumer hip, wrist or ankle for each participant during each session. Wrist trackers were also activity trackers which largely depends on walking speed. At a slower walking speed, the 3 consistently placed in the same order from distal to proximal. It should be noted there is an Garmin Vivosmart and Fitbit Charge HR showed the most favorable results. For average open debate about possible differences in accelerations between the dominant and non- walking speed, the Garmin Vivosmart and Apple Watch Sport indicated the best results. For dominant wrist. One study showed that trackers at the non-dominant wrist have a higher vigorous walking, the Apple Watch Sport, Pebble Smartwatch, and Samsung Gear S validity than trackers at the dominant wrist.18 Another study showed no differences in demonstrated the most favorable results. In general, it can be concluded that the activity validity between the dominant and non-dominant wrist.19 These contradicting results trackers are more valid at higher walking speeds, and smartwatches showed slightly better indicate that in a (semi) free-living setting, a counterbalanced placement would have been results than the wearables that are solely developed for activity tracking. more appropriate to exclude possible different acceleration values between the dominant and non-dominant wrist. In our study participants were tested in a controlled lab setting.

Walking on a treadmill is typically characterized with symmetrical movement of the dominant and non-dominant arm. Thus, an absence of counterbalanced placement in our study design is unlikely to affect our results. Furthermore, it should be noted that two trackers we tested (Misfit Flash and Jawbone Up Move) can be worn on both the hip and the wrist. However, the results from hip and wrist are not interchangeable.20,21 In this study we choose to place them on the hip because research showed that this positioning is associated with higher validity compared to the wrist.7,18 Therefore, the results of the Misfit Flash and Jawbone Up Move in this study do only apply when wearing them on the hip.

Some caution should be taken regarding the generalizability of these results. First, the test-retest reliability and validity were only examined during treadmill walking at a constant speed. This is not a natural situation in which there is variation in walking speed and activities other than walking are also performed. Second, to translate these results into clinical practice, it should be taken into account that only healthy volunteers participated in this study. The walking pattern of elderly people or those with limited physical abilities may be different. A similar study with the specific target group should be performed to provide actual proof. Despite these limitations, this study demonstrates that most activity trackers perform better as the walking speed increases. In addition, the trackers in this study were relatively new and, to our knowledge, most of them have not been studied before. Therefore, this study provides an initial insight into the test-retest reliability and validity of the activity trackers which can be useful when selecting an activity tracker for a certain target group with a specific level of daily physical activity. For people with a low walking speed, the Garmin Vivosmart or the Fitbit Charge HR seem most suitable. For people with a higher walking speed, the smartwatches (especially the Apple Watch Sport and the Pebble Smartwatch) seem more appropriate. Not only because these trackers were most valid at

53 Chapter 3

References

1. Abellan Van Kan G, Rolland Y, Andrieu S et al. Gait speed at usual pace as a predictor of adverse outcomes in community-dwelling older people An International Academy on Nutrition and Aging (IANA) Task Force. J Nutr Health Aging. 2009;13(10):881-9. 2. Case MA, Burwick HA, Volpp KG, Patel MS. Accuracy of smartphone applications and wearable devices for tracking physical activity data. JAMA. 2015; 313(6):625-6. 3. Dieu O, Mikulovic J, Fardy PS, Bui-Xuan G, Béghin L, Vanhelst J. Physical activity using wrist-worn accelerometers: comparison of dominant and non-dominant wrist. Clin Physiol Funct Imaging. 2016. 4. Evenson KR, Goto MM, Furberg RD. Systematic review of the validity and reliability of consumer- wearable activity trackers. Int J Behav Nutr Phys Act. 2015;12(1):159. 5. Feito Y, Bassett DR, Thompson DL. Evaluation of activity monitors in controlled and free-living environments. Med Sci Sports Exerc. 2012;44(4):733-41. 6. Feito Y, Garner HR, Bassett DR. Evaluation of ActiGraph's low-frequency filter in laboratory and free- living environments. Med Sci Sports Exerc. 2015;47(1):221-7. 7. Giraudeau B. Negative values of the intraclass correlation coefficient are not theoretically possible. J Clin Epidemiol. 1996;49(10):1205. 8. Gjoreski M, Gjoreski H, Luštrek M, Gams M. How accurately can your wrist device recognize daily activities and detect falls? Sensors. 2016;16(6):800. 9. Haskell WL, Lee IM, Pate RR, et al. Physical activity and public health: updated recommendation for adults from the American College of Sports Medicine and the American Heart Association. Med Sci Sports Exerc. 2007;39(8):1423-34. 10. Hasson RE, Haller J, Pober DM, Staudenmayer J, Freedson PS. Validity of the Omron HJ-112 Pedometer during Treadmill Walking. Med Sci Sports Exerc. 2009;41(4):805-9. 11. Hildebrand M, Van Hees VT, Hansen BH, Ekelund U. Age-group comparability of raw accelerometer output from wrist-and hip-worn monitors. Med Sci Sports Exerc. 2014;46(9):1816-24. 12. Holm S. A simple sequentially rejective multiple test procedure. Scand J Statist. 1979;6(2):65-70. 13. Kooiman TJM, Dontje ML, Sprenger SR, Krijnen WP, van der Schans, CP, de Groot M. Reliability and validity of ten consumer activity trackers. BMC Sports Sci Med Rehabil. 2015; 7:24. 14. Middleton A, Fritz SL, Lusardi M. Walking speed: the functional vital sign. J Aging Phys Act. 2015;23(2):314-22. 15. Musto A, Jacobs K, Nash M, DelRossi G, Perry A. The effects of an incremental approach to 10,000 steps/day on metabolic syndrome components in sedentary overweight women. J Phys Act Health. 2010;7(6):737. 16. Portney L, Watkins M. Foundations of clinical research: applications to practice. Upper Saddle River, NJ: Pearson/Prentice Hall; 2009; Chapter 5. 17. Ryan CG, Grant PM, Tigbe WW, Granat MH. The validity and reliability of a novel activity monitor as a measure of walking. Br J Sports Med. 2006;40(9):779-84. 18. Takacs J, Pollock CL, Guenther JR, Bahar M, Napier C, Hunt MA. Validation of the Fitbit One activity mo nitor device during treadmill walking. J Sci Med Sport. 2014; 17:496-500. 19. Tudor-Locke C, Ainsworth BE, Whitt MC, Thompson RW, Addy CL, Jones DA. The relationship between pedometer-determined ambulatory activity and body composition variables. Int J Obes Relat Metab Di sord. 2001;25(11):1571-8. 20. Tudor-Locke C, Craig CL, Brown WJ, et al. How many steps/day are enough? For adults. Int J Behav Nut r Phys Act. 2011;8:79. 21. Tudor-Locke C, Barreira TV, Schuna Jr JM. Comparison of step outputs for waist and wrist acceleromet er attachment sites. Med Sci Sports Exerc. 2015;47(4):839-42. 22. Tudor-Locke C, Craig CL, Thyfault JP, Spence JC. A step-defined sedentary lifestyle index:< 5000 steps/d ay. Appl Physiol Nutr Metab. 2012;38(2):100-14. 23. Weir JP. Quantifying test-retest reliability using the intraclass correlation coefficient and the SEM. J Strength Cond Res. 2005;19(1):231-40.

54 Reliability and validity of ten consumer activity trackers depend on walking speed

References

1. Abellan Van Kan G, Rolland Y, Andrieu S et al. Gait speed at usual pace as a predictor of adverse outcomes in community-dwelling older people An International Academy on Nutrition and Aging (IANA) Task Force. J Nutr Health Aging. 2009;13(10):881-9. 2. Case MA, Burwick HA, Volpp KG, Patel MS. Accuracy of smartphone applications and wearable devices for tracking physical activity data. JAMA. 2015; 313(6):625-6. 3. Dieu O, Mikulovic J, Fardy PS, Bui-Xuan G, Béghin L, Vanhelst J. Physical activity using wrist-worn accelerometers: comparison of dominant and non-dominant wrist. Clin Physiol Funct Imaging. 2016. 3 4. Evenson KR, Goto MM, Furberg RD. Systematic review of the validity and reliability of consumer- wearable activity trackers. Int J Behav Nutr Phys Act. 2015;12(1):159. 5. Feito Y, Bassett DR, Thompson DL. Evaluation of activity monitors in controlled and free-living environments. Med Sci Sports Exerc. 2012;44(4):733-41. 6. Feito Y, Garner HR, Bassett DR. Evaluation of ActiGraph's low-frequency filter in laboratory and free- living environments. Med Sci Sports Exerc. 2015;47(1):221-7. 7. Giraudeau B. Negative values of the intraclass correlation coefficient are not theoretically possible. J Clin Epidemiol. 1996;49(10):1205. 8. Gjoreski M, Gjoreski H, Luštrek M, Gams M. How accurately can your wrist device recognize daily activities and detect falls? Sensors. 2016;16(6):800. 9. Haskell WL, Lee IM, Pate RR, et al. Physical activity and public health: updated recommendation for adults from the American College of Sports Medicine and the American Heart Association. Med Sci Sports Exerc. 2007;39(8):1423-34. 10. Hasson RE, Haller J, Pober DM, Staudenmayer J, Freedson PS. Validity of the Omron HJ-112 Pedometer during Treadmill Walking. Med Sci Sports Exerc. 2009;41(4):805-9. 11. Hildebrand M, Van Hees VT, Hansen BH, Ekelund U. Age-group comparability of raw accelerometer output from wrist-and hip-worn monitors. Med Sci Sports Exerc. 2014;46(9):1816-24. 12. Holm S. A simple sequentially rejective multiple test procedure. Scand J Statist. 1979;6(2):65-70. 13. Kooiman TJM, Dontje ML, Sprenger SR, Krijnen WP, van der Schans, CP, de Groot M. Reliability and validity of ten consumer activity trackers. BMC Sports Sci Med Rehabil. 2015; 7:24. 14. Middleton A, Fritz SL, Lusardi M. Walking speed: the functional vital sign. J Aging Phys Act. 2015;23(2):314-22. 15. Musto A, Jacobs K, Nash M, DelRossi G, Perry A. The effects of an incremental approach to 10,000 steps/day on metabolic syndrome components in sedentary overweight women. J Phys Act Health. 2010;7(6):737. 16. Portney L, Watkins M. Foundations of clinical research: applications to practice. Upper Saddle River, NJ: Pearson/Prentice Hall; 2009; Chapter 5. 17. Ryan CG, Grant PM, Tigbe WW, Granat MH. The validity and reliability of a novel activity monitor as a measure of walking. Br J Sports Med. 2006;40(9):779-84. 18. Takacs J, Pollock CL, Guenther JR, Bahar M, Napier C, Hunt MA. Validation of the Fitbit One activity mo nitor device during treadmill walking. J Sci Med Sport. 2014; 17:496-500. 19. Tudor-Locke C, Ainsworth BE, Whitt MC, Thompson RW, Addy CL, Jones DA. The relationship between pedometer-determined ambulatory activity and body composition variables. Int J Obes Relat Metab Di sord. 2001;25(11):1571-8. 20. Tudor-Locke C, Craig CL, Brown WJ, et al. How many steps/day are enough? For adults. Int J Behav Nut r Phys Act. 2011;8:79. 21. Tudor-Locke C, Barreira TV, Schuna Jr JM. Comparison of step outputs for waist and wrist acceleromet er attachment sites. Med Sci Sports Exerc. 2015;47(4):839-42. 22. Tudor-Locke C, Craig CL, Thyfault JP, Spence JC. A step-defined sedentary lifestyle index:< 5000 steps/d ay. Appl Physiol Nutr Metab. 2012;38(2):100-14. 23. Weir JP. Quantifying test-retest reliability using the intraclass correlation coefficient and the SEM. J Strength Cond Res. 2005;19(1):231-40.

55

Chapter 4 | Behavioral determinants for the adoption of self-tracking devices by adults – a longitudinal study

Thea J.M. Kooiman Arie Dijkstra Justin Timmer Wim P. Krijnen Adriaan Kooy Cees P. van der Schans Martijn de Groot

Submitted