Threats to reliability are those factors that cause (or are sources of) error. After all, the instability or inconsistency in the measurement you are using comes from such error. Some of the sources of error in your dissertation may include: researcher (or observer) error, environmental changes and participant changes.
There are many situations during the dissertation process where you are responsible for taking measurements. During this measurement process, as the researcher, you can introduce error when carrying our measurements. This is known as researcher (or observer) error. Even when a measurement process is considered to be precise (e.g., a stopwatch), your judgement will often be involved in the use of the measurement (e.g., when to start and stop the stopwatch). Human error (or human differences) is also a factor (e.g., the reaction time to start the watch). This becomes a greater problem as the number of researchers (observers) increases and/or the number of measurements increases (e.g., 10 people using stopwatches, making 100 time measurements).
During the time between measurements (e.g., recording time on a stopwatch), there may be small environmental changes that influence the measurements being taken, creating error. These changes in the environment make it impossible to ensure that the same individual is measured in the same way (i.e., under identical conditions). For example, even two closely timed measurements may be affected by environmental conditions/variables (e.g., light, day, time, temperature, etc.). However, it should be noted that ensuring that individuals are measured in the same way each time (i.e., with the same/identical environmental conditions), without any environmental change, is an ideal.
Between measurements, it is also possible for research participants to change in some way. Whilst this potential for change is generally reduced if the time between measurements is short, this is not necessarily the case. It depends on the nature of the measurement (e.g., focus/attention affects reaction times, hunger/tiredness leads to reduced physical/mental performance, etc.). These participant changes can create error that reduces the reliability (i.e., consistency or stability) of measurements.
The type of reliability that you should apply in your dissertation will vary depending on the research methods you select. In the sections below, we look at (a) successive measurements, (b) simultaneous measurements by more than one researcher, and (c) a single measurement point.
It is common in quantitative research for successive measurements to be taken. After all, in experimental research and quasi-experimental research, researchers often conduct a pre-test, followed by a post-test [see the articles: Experimental research designs and Quasi-experimental research designs]. In such cases, we want to make sure that the measurement procedures that are used (e.g., a questionnaire, survey) produce measurements that are reliable, both for the pre-test and the post-test. Sometimes the measurement procedures are the same for the pre-test and the post-test, whilst on other occasions a different measurement procedure is used in the post-test. In both cases, we need to make sure that the measurement procedures that are used are reliable. However, we use different tests of reliability to achieve this: (a) test-retest reliability on separate days; and (b) parallel-forms reliability. Each of these tests of reliability is discussed in turn:
Test-retest reliability on separate days
Test-retest reliability on separate days assesses the stability of a measurement procedure (i.e., reliability as stability). We emphasize the fact that we are interested in test-retest reliability on separate days because test-retest reliability can also be assessed on the same day, where it has a different purpose (i.e., it assesses reliability as internal consistency rather than reliability as stability).
A test (i.e., measurement procedure) is carried out on day one, and then repeated on day two or later. The scores between these two tests are compared by calculating the correlation coefficient between the two sets of scores. The same version of the measurement procedure (e.g., a survey) is used for both tests. The samples (i.e., people being tested) for each test should be the same (or very similar); that is, the characteristics of the samples should be closely matched (e.g., on age, gender, etc.). If there is a strong relationship between the two sets of scores, highlighting consistency between the two tests, the measurement procedure is considered to be reliable (i.e., stable). Where the measurement procedure is reliable in this way, we would expect to see identical (or very similar) results from a similar sample under similar conditions when this measurement procedure was used in future.
Test-retest reliability on separate days is particularly appropriate for studies of physical performance, but it can also be used with written tests/survey methods. However, in such cases, there is greater potential for learning effects [see the section, Testing effects and internal validity, in the article: Internal validity] to result in spuriously high correlations (i.e., the reliability is exaggerated because it cannot mitigate for learning effects; it simply takes into account the two sets of scores).
The interval between the test and retest (i.e., between measurement procedures) will be determined by a number of factors. In physical performance tests, for example, you may need to assess the amount of rest participants? require, especially if the test is physically demanding. In written tests/survey methods, greater time between the test and retest will likely increase the threat from learning effects. Therefore, you will need to assess what is the appropriate interval between the test and retest: too short and there is the potential for memory effects from the first test; too long and there is the potential for extraneous/confounding effects [see the article: Extraneous and confounding variables]. Ultimately, any length of interval where maturation, learning effects, changes in ability, outside influences/situational factors, participant interest, akin to learning effects, and so on, could affect the retest [see the article, Internal validity, if you are unsure what some of these threats to research are].
Parallel-forms reliability
Parallel-forms reliability (the parallel-forms method, alternate-forms method or equivalence method/forms), is used to assess the reliability of a measurement procedure when different (alternate/modified) versions of the measurement procedure are used for the test and retest. The same group of participants is used for both test and retest. The measurement procedures, whilst different, should address the same construct (e.g., intelligence, depression, motivation, etc.).
Where the test-retest reliability method is more appropriate for physical performance measures, the parallel-forms reliability method is more frequently used in written/standardised tests. It is seldom appropriately used in physical performance tests because designing two measurement procedures that measure the same thing is more challenging compared with two sets of standardised test questions.
The reliability of the measurement procedure is determined by the similarity/consistency of the results between the two versions of the measurement instrument (i.e., reliability as equivalence). Such reliability is tested using a t-test, similarity of means and standard deviations (i.e., between the two groups; that is, the scores from the two versions of the measurement instrument), and a high correlation coefficient.
In quantitative research, sometimes more than one researcher is required when collecting measurements, which makes it important to assess the reliability of the simultaneous measurements that are taken. There are two common reasons for this: (a) experimenter bias and instrumental bias; and (b) experimental demands. Let's look at each in turn:
Experimenter bias and instrumental bias
Sometimes, we can think of the measurement device as the researcher collecting the data, since it is the researcher that is making the assessment of the measurement. This is more likely to occur in qualitative research designs than quantitative research because qualitative research generally involves less structured and less standardised measurement procedures, such as unstructured and semi-structured interviews and observations. However, quantitative research also involves research methods where the score on the dependent variable that is given on a particular measurement procedure is determined by the researcher.
In such cases, you want to avoid the potential for experimenter bias and instrumental bias, which are threats to internal validity and reliability [see the sections, Experimenter effects and internal validity and Instrumental bias and internal validity in the article: Internal validity]. For example, let's imagine that a researcher is using structured, participant observation, to assess social awkwardness (i.e., the dependent variable) in two different types of profession (i.e., the independent variable). For simplicity, let's imagine that two researchers monitor these two different groups of employees, and score their level of social awkwardness on a scale of 1-10 (e.g., 10 = extremely socially awkward).
The way that a researcher scores may change during the course of an experiment for two reasons: First, the researcher can gain in experience (i.e., become more proficient) or become fatigued during the course of the experiment, which affects the way that observations are recorded. This can happen across groups, but also within a single group (even pre- and post-tests). Second, a different researcher may be used for the pre-test and post-test measurement. In quantitative research using structured, participant observation, it is important to consider the ability/experience of the researchers, and how this, or other factors relating to the researcher's scoring, may change over time. However, this will only lead to instrumental bias if the way that the researcher scores is different for the groups that are being measured (e.g., the control group versus the treatment group).
One of the goals of reliability of equivalence is to assess such experimenter bias and instrumental bias by comparing the similarity/consistency of the simultaneous measurements that are being taken.
Experimental demands
Sometimes there are too many measurements to be taken by one researcher (e.g., lots of participants), or the measurements are geographically dispersed (e.g., measurements have to be taken at different locations). This may also result in simultaneous measurements being taken.
Since the judgement of researchers is not perfect, we cannot assume that different researchers will record a measurement of something in the same way (e.g., measure the social awkwardness of a person on a scale of 1-10 simply by observing them). In order to assess how reliable such simultaneous measurements are, we can use inter-rater reliability. Such inter-rater reliability is a measure of the correlation between the scores provided by the two observers, which indicates the extent of the agreement between them (i.e., reliability as equivalence). To learn more about inter-rater reliability, how to calculate it using the statistics software SPSS, interpret the findings and write them up, see the Data Analysis section of Lærd Dissertation.