Validity and Reliability
When it comes to fitness testing it is imperative that fitness professionals know and understand the dual concepts of validity and reliability. A reliable measure is measuring something consistently, while a valid measure is measuring what it is supposed to measure.
What is validity?
A valid fitness test is a test that measures exactly what it is supposed to measure. For a test to be valid it must ‘hit the bull’s-eye.’
For instance, if I wanted to measure aerobic running performance then a measure of someone’s fifty metre swim time would have poor validity, whereas a measure of the time it took for them to run five kilometres would be much more valid.
Test measures can be direct or indirect. Direct measures are considered the “gold standard” when testing a certain component of fitness. This means the test can measure the component directly, without using any assumptions or estimates.
For aerobic fitness the gold standard test is considered to be a direct measurement of a client’s maximal oxygen consumption (VO2 max). This test is performed in a lab where the actual amount of oxygen and carbon dioxide breathed in and out during maximal exercise can be measured as it is breathed into a bag or tank. As this test directly measures the maximal amount of oxygen used this is considered a very valid test to measure aerobic capacity.
An indirect measure is a test that measures factors other than the direct measure (oxygen) and then, based on certain assumptions gives an estimate of the specific component of fitness.
An example of an indirect measure is the Astrand-Ryhming step test. Here VO2 max is predicted (estimated) by an equation that takes into account factors such as heart rate after the test, weight and client age.
For some tests the equation doesn’t account for all factors. If your client is different than the sample population the indirect test was devised from the results might be less accurate (invalid). For example if the test doesn’t account for age and was originally carried out on 20 year old men, and you are using the test on a 55 year old woman the results won’t be as valid.
When reading about fitness tests you might see terms such as validity coefficient. When indirect tests (such as the Astrand–Ryhming test) are created the testers measure people on both the direct (VO2 max) and indirect (Astrand-Rhyming) tests to see how close the results are between the two testing methods. The more similar the results, the higher the validity coefficient of the indirect test. The score cannot be higher than 1.0 and a test is considered to have a high validity score above 0.8.
Standard error of estimates (SEE) are also a factor in fitness testing. When measuring a component of fitness using both a direct and indirect method (involving assumptions in an equation) there would be a standard estimating error of measurement using the indirect method. For example if VO2max measured directly is 43ml/min/kg the standard error (SEE) using a certain indirect method might be +/- 7ml/min/kg from the direct measurement, giving a result anywhere from 36-50ml/min/kg. As you can see the result could be a long way from the actual number. A test with high validity should have a low SEE.
Fitness tests with low validity should always be avoided as they will not give you relevant information to set training targets with and won’t measure change in the fitness component you are trying to affect.
For example, if I did the swim test every six weeks to measure running improvement, I may not see much change and I wouldn’t be able to work out which training approach I was using every six weeks was working the best, or hardly working at all. The test simply isn’t valid enough to help inform the training approach or measure true improvement.
Our objective then should be to use highly valid tests with low SEE – that is make sure that the tests we use measure, as closely as possible, what we want to keep an eye on.
What is reliability?
A reliable fitness test is a test that you can rely on to measure something consistently.
For instance, you may measure your body weight on your bathroom scales every day for a year and although they may or may not be very valid (they might not be accurate at the start), they are consistent and show change in your weight very reliably.
Fitness tests with low reliability should be avoided as they will not show what progress or lack of progress is actually occurring. You will also end up setting training targets that are either too low or too high depending on the error in the last test.
For example, if I did a running test on a treadmill and its speed when you put it at 15kph was actually closer to 16kph because the belt was warn and the drums that drive the belt around were new, then my test would show that I wasn’t that fit. If I then re-tested six weeks later on a different treadmill that had a new belt and old drums and for some reason when it was supposed to be at 15kph was actually at 14kph, the test would show that I was much fitter.
Put simply, the reliability of the test in this case could have me registering myself for the 1500m race at the Olympics when in reality I should be doing fun runs with the local running club. The test is producing unreliable results and my prescription of training as well will become problematic.
Our objective should be to use highly reliable tests – that is make sure that the tests we use are repeatable and will accurately show change in what we are measuring when it occurs.
And as well as only using highly reliable tests we must always make sure the tests we use are highly valid – that is they measure only and exactly what we want to measure. There is absolutely no benefit to using tests that are neither valid nor reliable.
Improving test reliability
Reliability is affected by several factors, some which are manageable and some which can’t be changed. To improve reliability we need to focus on the elimination of as much error from the tests as we possibly can.
Random error
Random error is an error that occurs sometimes. An example would be using body weight scales that weren’t calibrated one time, and then were calibrated the next. The error is manageable but we haven’t addressed it by being consistent.
Another example would be testing a client’s flexibility once when they were cold, and another time when they had warmed up.
To minimize random error as a fitness professional you must:
- Learn the test protocols very well and stick to them – this means each time you test you do exactly the same thing, in the same way.
- Make sure the environment and your client are in the same condition – this means recording information about the circumstances you test in each time and use preparation information with your client to ensure they are in the same state. Imagine a client being tested after a days work and four coffees and a traffic jam, versus a Sunday morning after a nice relaxing sleep in…
- Perfect the use and calibration of your measurement tools – this means practice what you do in a test until you can’t get it wrong and know how to make sure the equipment you will use is ‘calibrated’ each time. How often are you the same weight on two different sets of scales on the same day? How well can you use a tape measure on a client’s waist? How good are you at accurately finding and marking skinfold sites when you are going to complete skinfold testing? How well practiced are you at recording heart rates whilst increasing treadmill speed, whilst keeping an eye on the time?
Standard error
A standard error occurs every time. That means, the error is built in to our test, and it will be consistent.
An example would be having a watch that only times down to the second and trying to time forty metre sprints. The person may finish at 5s or 6s only because we can’t measure the ‘5.3s sprint’. The tool we are using means there is quite a lot of standard error in the test.
To minimize standard error as a fitness professional you must:
- Choose tests with the lowest standard error – for example if you are going to do a skinfold test, then just total the sites rather than use an equation to calculate body fat percentage as the equation introduces further standard error as it ‘predicts’ body fat percentage.
- Reduce measurement error by increasing the sensitivity of the tool being used – an example would be using a tape measure that had millimetre divisions on it rather than just centimetres as this will allow more accurate measurement. Similarly – stop watches that go to the thousandth of a second are better for measuring short sprint times.
Improving the different ‘measures’ of reliability
Reliability can be measured to give you an understanding of how good your test or testing is. There are several different measures of reliability as follows
Intra-reliability – This tells you how accurate you are at completing the test repeatedly on the same day. I.e. if you did a thigh girth test on the same client in the morning and the afternoon and got exactly the same result your testing would show high intra-reliability.
Inter-reliability – This tells you how accurate you and someone else are when testing the same person on the same day. I.e. If you and a fellow trainer took the same clients thigh girth measurements on the same day and recorded exactly the same results then your combined testing would show high inter-reliability.
Test-retest reliability – This tells you how reliable the test is between two test times. I.e. if the change in test results between two test times can solely be attributed to a change in the variable being measured (i.e. thigh girth) then the test will have high test-retest reliability. If the difference between test results could be due to factors other than the variable being measured (i.e. not sticking to the exact same test protocol) then the test will have a low test-retest reliability.
Once you have completed a test there are three outcomes possible:
- The results are as close as possible to accurate
- The results are a ‘false-negative’. This means the results are below what they should be – for example the fitness component measured has improved by five percent when actually an accurate test, done well, would show a ten percent improvement.
- The results are a ‘false-positive’. This means the results are above what they should be – for example the fitness component measured has improved by fifteen percent when actually an accurate test, done well, would show a ten percent improvement.
To understand how reliability and validity work together let’s look at a scenario.
Sue is working toward running a ten kilometre run later in the year. Sue has come in for her second fitness test after six weeks of training. It’s a bike test at the gym in the evening and the protocol is that she cycles for three minutes, then the load is increased, then she cycles for three minutes, the load is increased again, then she cycles for a final three minutes. Heart rates are taken throughout the test.
The numbers are then punched in to an equation and Sue’s aerobic fitness is ‘predicted’ by the results.
The room is hot, as it’s summer and she was last tested in spring. She comments she’s had a huge day as after work she tried to get the last of her Xmas shopping done and “man is it busy out there”. She also points out that her lunch was a bit rushed too as she wanted to train and get that last bit of fitness under her belt. The last thing she drank was a coffee which she had whilst going around the shops after work.
You complete the test with Sue. Her first test was completed by another trainer as you were on holiday when she was first tested and asked the other trainer to complete Sue’s test for you. You know how to do the test, but you’ve had a little trouble with the bike at times as it seems to be harder on some days than others. You also lose heart rate readings occasionally as the heart rate monitor goes blank.
You have seen Sue in the club a lot, completing her personal training sessions with you and her prescribed ‘independent’ training on the rower and treadmill, and you expect to see a dramatic improvement in her fitness, as she has been making steady progress during these sessions.
The results come back and it shows that after six weeks of training three hours per week doing aerobic exercise Sue’s fitness hasn’t changed at all!
Alarmed you tell Sue that you think there is an error in the test and that you’d like her to repeat the test again next week. Sue begrudgingly agrees.
Let’s take a look at what could have gone wrong to make this ‘false negative’ occur.
Potential error |
Reliability or validity an issue |
Ways to fix this for next time |
Bike test when her training is rowing and running won’t be as sensitive to changes in her fitness |
Validity – the test isn’t measuring the right thing |
Chose a test that represents what you want to measure – e.g. running aerobic fitness |
The equation used to calculate aerobic fitness introduces error due to predictions made |
Standard error affecting validity |
Choose a test that is performance based and doesn’t use ‘normative’ data – we need to know what Sue’s fitness was when she started and how it’s progressing, not how she stacks up against American college students or alike |
A hot room can easily increase heart rate as the body pumps blood to the skin to help with cooling at the same time as it needs to pump blood to the working muscles |
Random error affecting test-retest reliability |
Take the ambient temperature in the room and if it varies by more than a few degrees, cool the room or plan testing at a better time. |
High levels of stress (rushing and shopping) before the test can cause higher heart rates and blood pressure |
Random error affecting test-retest reliability |
Give Sue pre-test information which will include what to do the day leading into a test, what to drink, what to eat, how much rest to have, and that testing should always be booked in at the same time of day. |
Caffeine prior to exercise can increase heart rates and blood pressure |
Random error affecting test-retest reliability |
As above |
Training at lunch time will mean Sue is still fatigued and as a result her fitness will appear worse than it would be if she was rested |
Random error affecting test-retest reliability |
As above |
It appears Sue hasn’t eaten since lunch time and there is no mention of water intake so it’s possible she has low blood sugar and is slightly de-hydrated – both increasing heart rate and fatigue |
Random error affecting test-retest reliability |
As above |
You set Sue up slightly differently than the first instructor who tested her – the bike seat is lower than the height used by the other trainer |
Random error affecting inter-reliability and test-retest reliability |
Practise your protocol and agree with all trainers how to gauge what height the seat on the bike should be at, plus record that each time you test |
The loss of heart rate readings at key times means you record slightly higher values than you should because Sue’s heart rate continues to increase throughout the test and you can only write it down when it’s there! |
Random error affecting intra, inter and test-retest reliability |
Replace the heart rate monitor with one that works and check with other trainers to see if it’s affecting them too. Also check other factors such as cell phones, and under wire bras which can give heart rate monitors trouble. |
You don’t calibrate the bike which is why it’s harder on some days than others – the harder it is, the higher the heart rates |
Random error affecting intra, inter and test-retest reliability |
Read and practice the calibration protocol for the bike and get everybody to do it every time. |
The above scenario does happen, which is why, if you are going to test, you want to select good valid tests and do them well. Again, you must remember the reason for testing is safety, training focus, and motivation. You lose all of these benefits if the testing is not completed well.
And one last time for good luck…make sure that when you fitness test your personal training clients you only use tests that are valid and measure exactly what you want to measure, and you can trust the results by ensuring the test is as reliable as possible.