National Center for Technology Innovation

Validity and Reliability

Internal and external validity

All studies attempt to maximize both their internal and external validity.

Internal validity addresses how valid it is to make causal inferences about the intervention in the study. The most common threats to internal validity are selection bias, history, maturation, test-retest, differential attrition, and regression towards the mean. The main elements in study design that address internal validity are the use of a treatment and control group, as well as random assignment. By randomly assigning participants, you can be sure that any difference between the treatment group and control group is due to chance alone, and not selection bias.

External validity addresses how generalizable the study’s inferences are to the general population. A study’s external validity is both dependent on, and at odds with, internal validity. A study that has little to no internal validity cannot claim a causal effect of an intervention, and thus, cannot be generalized. However, in order to strengthen internal validity, studies tend to focus on specific populations (for example, the effects of an intervention on seventh-grade reading scores among learning disabled students in New York City). While limiting the scope of a study allows for greater control over the characteristics of the treatment and control group, as well as some control over history and maturation, the results are less likely to be generalizable. What works for public school students New York City may not work for students in rural Alabama or for private school students in Los Angeles because the populations may be meaningfully different.

Measurement validity and reliability

Case studies, single subject, quasi-experimental and experimental research all involve using one or more instruments to measure outcomes. The measurement validity and reliability of your testing instruments is a critical factor to consider when designing your study.

Measurement validity addresses how accurately the instrument measures the outcome or construct your intervention is attempting to affect. In this context, an instrument is valid if it actually measures what you intend it to measure. Items such as commercial rulers or scales are straightforward examples of instruments with strong measurement validity. However, the validity of a tool that attempts to measure growth in cognitive ability or increased behavior tendencies (such as increases in mobility) is not as clear.

Measurement reliability addresses the consistency of your instrument’s measurement. That is, would your testing instrument generate the same result in similar circumstances? Again, think of a typical measuring instrument, such as a ruler. Using a ruler – the same ruler – to measure something over and over again will give you a very reliable value. Yet standardized tests, such as the SAT or GRE, may generate very different results from the same individual at different points in time or under different conditions (e.g., paper and pencil vs. computer).

Note that an instrument can be valid, but not reliable; or reliable, but not valid. Using our examples above, the SAT may be a valid indicator of cognitive ability, though it may not do so reliably. Alternatively, almost any ruler is a reliable tool for measurement – though a warped ruler would not be valid.

Tips for boosting measurement validity and reliability:


  • Always consider pilot testing your instrument with the target population of your research.
  • Have experts in the area of your research check or provide guidance on your data collection tools.

Skilled-based tests:

  • It is imperative that the test you select will collect data on the types of skills your research is targeting. For example, if you are teaching math to children with cognitive impairments, you will need a test that will be sensitive enough to detect growth in their learning within your timeframe. This is a question best determined in consultation with a professional researcher in your area of study.

Surveys, interviews and focus groups:

  • You need to check your questions to determine if they are prompting the types of responses you expect. Run a pilot test with a small set of people from your target population. Note that these must be people who will not otherwise be involved in the study.


  • The observation protocol or record keeping sheet is critical to getting credible data. Spend time to pilot the protocol with your observers. Try it out in a shared observation (a video-taped classroom would be effective for this) and then discuss ratings. Did all the raters mark events in the same manner? If not, why not? It is critical to work this out in advance.

For more information, see: