Patrick S. Forscher: Increasing testing severity in multi-group designs

Note: I developed these ideas with Dr. Markus Brauer.

In 1967, the psychologist and grandfather of the modern meta-science movement Paul Meehl identified an apparent paradox in the theory-testing strategies deployed in physics versus psychology (Meehl, 1967). In physics, improvements in precision make it harder to corroborate theory, whereas in psychology, such improvements make corroboration easier. Meehl identified as a root cause of this paradox differences in the purpose to which improvements in precision are deployed: in physics, they are used to more precisely detect deviations from a theory's predictions, whereas in psychology, they are used to more precisely detect deviations from a null hypothesis of no relationship.

Furthermore, Meehl argued that because of so-called "crud", small but non-random relationships that permeate human subjects data, the null hypothesis of no relationship is almost always false in psychology. A null that is almost always false, even if only in a trivial way, is trivially easy to reject, letting psychological scientists claim "corroboration" for theories that are completely arbitrary. In later work (Meehl, 1978), Meehl would argue that this flabby theory-testing strategy contributes to slow progress in psychology science by ensuring that psychology theories are neither refuted (by making precise predictions that are disconfirmed by precise measurement) nor corroborated (by making precise predictions that are borne out by precise measurement). Instead, they simply become unfashionable and are forgotten.

Forty years later, Meehl's indictment of psychological science still rings true. Yet the replication crisis has brought with it winds of change: psychology researchers now have an appetite for practices that can bring our theories closer to truth. What is needed to address Meehl's criticisms is the popularization of methods and practices that can increase the severity of our theory-testing.

Severe tests in multi-group designs

Multi-group designs are the workhorse of scientific psychology. Multi-group designs apply to any grouping of people, within-person states, situations, or stimuli; interest typically centers around the either the means or conditional means within each group. However, theory-testing using these means typically proceeds in just the way that Meehl criticized: scientists attempt to reject a null hypothesis of no group mean differences (which, due to Meehl's "crud", may be trivially false a priori), then follow up the rejected null hypothesis with tests intended to diagnose the pattern that produced the rejected null.

Fortunately, we have the tools to do better, at least in research situations where we have a clear a priori hypothesis. These tools come in the form of the well-known tool of Planned Orthogonal Contrasts and a mostly forgotten paper by Abelson and Prentice (1997).

Planned Orthogonal Contrasts carve up the differences in g group means into g - 1 non-overlapping patterns. These patterns are represented in a statistical model using mechanical variables. As long as one of these mechanical can be fairly said to represent the research hypothesis (the "focal contrast"), the remaining mechanical variables (the "residual contrasts") capture deviations from the research hypothesis. That is, if we can reject the hypothesis that the parameter estimate for a residual contrast is zero, we can fairly say that we have refuted the research hypothesis: there are patterns in our means that are not captured by our focal contrast. If, on the other hand, we cannot reject the null hypothesis that our residual contrasts are zero, our research hypothesis is corroborated; it has passed a severe test. Meehl would be pleased.

How does this work?

Let's say that I do a study where people are assigned to take three different dosages of a drug: no dose, a moderate dose, or a high dose. I believe that the drug will produce a reduction in depression that increases linearly with dosage. This hypothesis can be represented by the following (unit-weighted) linear contrast: [no dose = -1/2; moderate dose = 0; high dose = 1/2].

Let's say my study produces the following pattern of means:

The best-fitting parameter estimate for a linear contrast used to represent these means produces these predictions (blue).

However, this linear contrast leaves some variability in the means unexplained. This unexplained variability represents a departure from linearity. If we are sufficiently certain that this unexplained variability is non-zero, the presence of this contrary evidence can cause us to reject our initial hypothesis of linearity.

If we subtract the predictions of our linear contrast from the group means, we will see that the departure from linearity is captured by a (unit-weighted) quadratic contrast [no dose = 1/3; moderate dose = -2/3; high dose = 1/3]. The quadratic contrast thus represents the residual contrast for a focal linear contrast.

This is all shown in the animation, below.

Severe testing in the presence of "crud"

If we accept Meehl's argument that, due to the "crud" that permeates human subjects data, the null hypothesis in psychology is almost always false, we may not want to treat minor deviations from our focal contrast as seriously threatening our research hypothesis. We may want to make some allowances for small deviations that are essentially "crud" – patterns that are truly non-random but not large enough to cause us to amend our theory. What we need is a threshold for our residual contrast, below which we don't care, but above which we commit to amending our theory. What we need is equivalence tests (Lakens, Scheel, & Isager, 2018).

Equivalence tests are a method of testing whether a given effect lies within a particular boundary that the researcher has defined in advance as "small". One sets the smallest effect that is substantively interesting for a particular application and uses that effect to define regions that are "uninteresting" (anything smaller than this smallest effect) and "interesting" (anything larger than this smallest effect). For example, if the smallest effect of substantive interest is .3, the region that is defined as "uninteresting" would be [-.3, .3] and the regions that are defined as "interesting" are [-∞, -.3] and [.3, ∞]; see the figure below. With these regions defined, one can test whether a given effect (e.g., the red horizontal line in the figure) lies in the "uninteresting" region by conducting two one-sided tests: one to determine whether an observed parameter estimate lies below the upper boundary of the "uninteresting" region (i.e., below .3) and a second to determine whether a given effect lies above the lower boundary (i.e., above -.3). If the answer to both these questions is yes, the effect lies within the "uninteresting" region and thus is too small be be worth worrying about.

Equivalence tests provide a solution to the problem of "crud" producing trivial differences that should not be considered as threatening to the focal hypothesis. The researcher defines for the residual contrast a threshold representing a "substantial" deviation (i.e., a deviation representing something more than "crud") from the pattern established by the focal contrast. The researcher then performs an equivalence test to see whether the residual contrast lies within this threshold.

Computational examples

These examples are posted as images here due to Blogger interpreting characters as html code; for ready-made code that you can copy and paste, see this Github gist.

Three group example

The first step in the three group case is to create a set of two Planned Orthogonal Contrasts. For this example, I am assuming that my focal contrast and residual contrasts are the following:

To create these contrasts in R (using data that I simulate), I might do the following (see the Github gist for the full context):

The next step is to fit a model with these contrasts, like the one below. We are looking for whether the test of the focal contrast (c1) allows us to reject the hypothesis of no difference and whether the test of the residual contrast (c2) does not allow us to reject the hypothesis of no difference.

Of course, crud may lead us to reject the hypothesis that the residual contrast is zero. If we want to test whether c2 is small in a way that allows for such crud, we could do an equivalence test using pre-specified boundaries for what we consider the smallest effect of substantive interest (an arbitrarily chosen .4 in this case). Below, we were able to reject the hypothesis that c2 lies outside the boundaries for what we consider a meaningfully large difference.

This is a case where there is no significant deviation from the focal contrast, and we can also say that the deviation from the focal contrast is significantly smaller than the smallest effect of substantive interest. Our research hypothesis, as represented by the focal contrast, has passed a severe test and therefore gains verisimilitude.

Four group example

The first step in the four group case is to create a set of three Planned Orthogonal Contrasts. For this example, I am assuming that my focal contrast and residual contrasts are the following:

To create these (using data that I simulate), I might do the following (see the Github gist for the full context):

The next step is to fit a model with these contrasts, like the one below. We are looking for whether the test of the focal contrast (c1) allows us to reject the hypothesis of no difference. Because we now have two residual contrasts instead of one, we probably want to conduct a joint test of whether all of these are different from zero rather than two individual tests, like so.

As before, crud may lead us to reject the hypothesis that the residual contrasts are zero. If we want to test whether the residual contrasts are small in a way that allows for such crud, we could do an equivalence test using pre-specified boundaries for what we consider the smallest effect of substantive interest (an arbitrarily chosen .4 in this case). We probably want to do joint tests of equivalence, just as we did a joint test that the residual contrasts are zero. I demonstrate that process below.

In this case, although the deviation from the focal contrast is significantly different from zero, we can say that the amount of deviation is smaller than our threshold for what constitutes a meaningful effect. If we think that any deviation from the focal contrast is reason for concern, we might think about amending our theory. If, on the other hand, we believe that the null hypothesis of no relationship is basically always false and that many of these non-zero relationships are "crud", we might advance the claim that our theory has been corroborated. In either case our testing procedure is more severe than a mere test of the null hypothesis of no differences between group means.

Conclusion: Our existing tools can make our tests more severe

In psychology we are in a period of intense interest in changing our methods and practices to improve the efficiency of our science. Some forty years after Meehl wrote about the benefits to science of risky tests, perhaps we are now ready to implement some of his ideas by making our theory-testing procedures more severe. The tools to do so already exist, and I believe the following procedure, which uses many of the well-known methodological tools that are in the toolbelts of psychological scientists, is one such example.

Define a focal contrast that fully represents one's predictions about the group means
Define residual contrasts that are orthogonal to the focal contrast and each other. If you have g groups, you should now have g - 1 total contrasts
Test whether the parameter estimate for the focal contrast is different from 0
Either test whether the parameter estimates for the residual contrasts are different from 0 OR test whether they are smaller than the smallest effect of substantive interest
If the focal contrast is different from 0 AND the residual contrasts are either small or not different from 0, your theory, as represented by the focal contrast, has passed a risky test and is corroborated in a meaningful way. If one of these conditions is not satisfied, the theory may need amendment.

References

Abelson, R., & Prentice, D. Contrast tests of interaction hypotheses. Psychological Methods, 2, 315–328.

Lakens, D., Scheel, A. M., & Isager, P. M. (2018). Equivalence Testing for Psychological Research: A Tutorial. Advances in Methods and Practices in Psychological Science, 1, 259–269. https://doi.org/10.1177/2515245918770963

Meehl, P. E. (1967). Theory-Testing in Psychology and Physics: A Methodological Paradox. Philosophy of Science, 34, 103–115. https://doi.org/10.1086/288135

Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806–834. https://doi.org/10.1037/0022-006X.46.4.806

Patrick S. Forscher

Friday, February 8, 2019

Increasing testing severity in multi-group designs

Severe tests in multi-group designs

How does this work?

Severe testing in the presence of "crud"

Computational examples

Three group example

Four group example

Conclusion: Our existing tools can make our tests more severe

No comments:

Post a Comment

Blog Archive

My Blog List