Reliability in research

Reliability, like validity, is a way of assessing the quality of the measurement procedure used to collect data in a dissertation. In order for the results from a study to be considered valid, the measurement procedure must first be reliable. In this article, we: (a) explain what reliability is, providing examples; (b) highlight some of the more common threats to reliability in research; (c) briefly discuss each of the main types of reliability you may use in your dissertation, and the situations where they are appropriate; and (d) point to the various articles on the Lærd Dissertation website where we discuss each of these types of reliability in more detail, including articles explaining how to run these reliability tests in the statistics package, SPSS, as well as interpret and write up the results of such tests.

What is reliability?

When we examine a construct in a study, we choose one of a number of possible ways to measure that construct [see the section on Constructs in quantitative research, if you are unsure what constructs are, or the difference between constructs and variables]. For example, we may choose to use questionnaire items, interview questions, and so forth. These questionnaire items or interview questions are part of the measurement procedure. This measurement procedure should provide an accurate representation of the construct it is measuring if it is to be considered valid. For example, if we want to measure the construct, intelligence, we need to have a measurement procedure that accurately measures a person's intelligence. Since there are many ways of thinking about intelligence (e.g., IQ, emotional intelligence, etc.), this can make it difficult to come up with a measurement procedure that has strong validity [see the article: Construct validity].

In quantitative research, the measurement procedure consists of variables; whether a single variable or a number of variables that may make up a construct [see the section on Constructs in quantitative research]. When we think about the reliability of these variables, we want to know how stable or constant they are. This assumption, that the variable you are measuring is stable or constant, is central to the concept of reliability. In principal, a measurement procedure that is stable or constant should produce the same (or nearly the same) results if the same individuals and conditions are used. So what do we mean when we say that a measurement procedure is constant or stable?

Some variables are more stable (constant) than others; that is, some change significantly, whilst others are reasonably constant. However, the measurement procedure that is used to measure a variable introduces some amount/degree of error, whether small or large. Therefore, the score measured (e.g., 0-100 in an exam) for a given variable consists of the true score plus error. The true score is the actual score that would reliably reflect the measurement (e.g., for a person) on a given construct (e.g., a score of 76 out of 100 in an IQ test actually reflects the intelligence of the person taking the test; if that person took another IQ test the next day, we would expect them to get 76 out of 100 again, assuming that we are only seeing that person's true score and not any error). The error reflects conditions that result in the score that we are measuring not reflecting the true score, but a variation on the actual score (e.g., a person whose true score on an IQ test should be 76 out of 100 gets 74 one day, but 79 the next, with the difference in the scores between the two days reflecting the error component). This error component within a measurement procedure will vary from one measurement to the next, increasing and decreasing the score for the variable. It is assumed that this happens randomly, with the error averaging zero over time; that is, the increases or decreases in error over a number of measurements even themselves out so that we end up with the true score (e.g., if the person whose true score should be 76 out of 100 took the IQ test 20 times, we would eventually see an average score of 76, despite the fact that the scores obtained were sometimes higher than 76 and sometimes lower). However, not all measurement procedures have the same amount/degree of error (i.e., some measurement procedures are prone to greater error than others).

Provided that the error component within a measurement procedure is relatively small, the scores that are attained over a number of measurements will be relatively consistent; that is, there will be small differences in the scores between measurements. As such, we can say that the measurement procedure is reliable. Take the following example:

EXAMPLE #1

Error component: Small
Measurement of: Intelligence using IQ
True score: Actual level of intelligence
Error: Caused by factors including current mood, level of fatigue, general health, luck in guessing answers to questions you don't know
Impact of error on scores: Would expect measurements of IQ to be a few points up and down of your actual IQ, not 105 to 135 points, for example (i.e. small error component)

NOTE: You can learn more about reliability, error and intelligence/IQ by reading Schuerger and Witt (1989) and Bartholomew (2004).

By comparison, where the error component within a measurement procedure is relatively large, the scores that are obtained over a number of measurements will be relatively inconsistent; that is, there will be large differences in the scores between measurements. As such, we can say that the measurement procedure is not reliable. Take the following example:

EXAMPLE #2

Error component: Large
Measurement of: Reaction time by measuring the speed of pressing a button when a light bulb goes on (i.e. difference between light appearing and the time when the button was pressed)
True score: Actual reaction speed of person
Error: Level of alertness/focus (i.e. focus, distraction), level/focus of attention, fatigue of hand/finger, guessing behaviour
Impact of error on scores: Potential for time to be significantly different from one measurement to the next (e.g. 50% longer; or possibly 100% longer).
Solution: Take multiple measurements rather than a single measurement, and then average the scores.

NOTE: You can learn more about reliability, error and reaction times by reading Yellott (1971), Ratcliff (1993), and Salthouse and Hedden (2002).

All measurement procedures involve error. However, it is the amount/degree of error that indicates how reliable a measurement is. When the amount of error is low, the reliability of the measurement is high. Conversely, when the amount of error is large, the reliability of the measurement is low. However, there are solutions to help improve measurement procedures that may be prone to large error components. For example, multiple measurements can be taken instead of a single measurement, with the scores from the multiple measurements being averaged. This will increase the consistency/stability of the measurement procedure.

1 2 3