Wednesday, February 17, 2021

Three Faces of Validity: Internal, Construct, and External

I have a new draft paper in circulation, "The Necessity of Construct and External Validity for Generalized Causal Claims", co-written with two social scientists, Kevin Esterling and David Brady.  Here's a distillation of the core ideas.


-----------------------------------------


Consider a simple causal claim: "α causes β in γ".  One type of event (say, caffeine after dinner) tends to cause another type of event (disrupted sleep) in a certain range of conditions (among typical North American college students).

Now consider a formal study you could run to test this.  You design an intervention: 20 ounces of Peet's Dark Roast in a white cup, served at 7 p.m.  You design a control condition: 20 ounces of Peet's decaf, served at the same time.  You recruit a population: 400 willing undergrads from Bigfoot Dorm, delighted to have free coffee.  Finally, you design a measure of disrupted sleep: wearable motion sensors that normally go quiet when a person is sleeping soundly.

You do everything right.  Assignment is random and double blind, everyone drinks all and only what's in their cup, etc., and you find a big, statistically significant treatment effect: The motion sensors are 20% more active between 2 and 4 a.m. for the coffee drinkers than the decaf drinkers.  You have what social scientists call internal validity.  The randomness, excellent execution, and large sample size ensure that there are no systematic differences between the treatment and control groups other than the contents of their cups (well...), so you know that your intervention had a causal effect on sleep patterns as measured by the motion sensors.  Yay!

You write it up for the campus newspaper: "Caffeine After Dinner Interferes with Sleep among College Students".

But do you know that?

Of course it's plausible.  And you have excellent internal validity.  But to get to a general claim of that sort, from your observation of 400 undergrads, requires further assumptions that we ought to be careful about.  What we know, based on considerations of internal validity alone, is that this particular intervention (20 oz. of Peet's Dark Roast) caused this particular outcome (more motion from 2 to 4 a.m.) the day and place the experiment was performed (Bigfoot Dorm, February 16, 2021).  In fact, even calling the intervention "20 oz. of Peet's Dark Roast" hides some assumptions -- for of course, the roast was from a particular batch, brewed in a particular way by a particular person, etc.  All you really know based on the methodology, if you're going to be super conservative, is this: Whatever it is that you did that differed between treatment and control had an effect on whatever it was you measured.

Call whatever it was you did in the treatment condition "A" and whatever it was you did differently in the control condition "-A".  Call whatever it was you measured "B".  And call the conditions, including both the environment and everything that was the same or balanced between treatment and control, "C" (that it was among Bigfoot Dorm students, using white cups, brewed an average temperature of 195°F, etc.).

What we know then is that the probability, p, of B (whatever outcome you measured), was greater given A (whatever you did in the treatment condition) than in -A (whatever you did in the control condition), in C (the exact conditions in which the experiment was performed).  In other words:

p(B|A&C) > p(B|-A&C).  [Read this as "The probability of B given A and C is greater than the probability of B given not-A and C."]

But remember, what you claimed was both more specific and more general than that.  You claimed "caffeine after dinner interferes with sleep among college students".  To put it in the Greek-letter format with which we began, you claimed that α (caffeine after dinner) causes β (poor sleep) in γ (among college students, presumably in normal college dining and sleeping contexts in North America, though this was not clearly specified).

In other words, what you think is true is not merely the vague whatever-whatever sentence

p(B|A&C) > p(B|-A&C)

but rather the more ambitious and specific sentence

p(β|α&γ) > p(β|-α&γ).[1]

In order to get from one to the other, you need to do what Esterling, Brady, and I call causal specification.

You need to establish, or at least show plausible, that α is what mattered about A.  You need to establish that it was the caffeine that had the observed effect on B, rather than something else that differed between treatment and control, like tannin levels (which differed slightly between the dark roast and decaf).  The internally valid study tells you that the intervention had causal power, but nothing inside the study could possibly tell you what aspect of the intervention had the causal power.  It may seem likely, based on your prior knowledge, that it would be the caffeine rather than the tannins or any of the potentially infinite number of other things that differ between treatment and control (if you're creative, the list could be endless).

One way to represent this is to say that alongside α (the caffeine) are some presumably inert elements, θ (the tannins, etc.), that also differ between treatment and control.  The intervention A is really a bundle of α and θ: A = α&θ.  Now substituting α&θ for A, what the internally valid experiment established was

p(B|(α&θ)&C) > p(B|-(α&θ)&C).

If θ is causally inert, with no influence on the measured outcome B, you can can drop the θ, thus inferring from the sentence above to 

p(B|α&C) > p(B|-α&C).

In this case, you have what Esterling, Brady, and I call construct validity of the cause.  You have correctly specified the element that is doing the causal work.  It's not just A as a whole, but α in particular, the caffeine.  Of course, you can't just assert this.  You ought to establish it somehow.  That's the process of establishing construct validity of the cause.

Analogous reasoning applies to the relationship between B (measured motion-sensor outputs) and β (disrupted sleep).  If you can establish the right kind of relationship between B and β you can move from a claim about B to a conclusion about β, thus moving from 

p(B|α&C) > p(B|-α&C)

to

p(β|α&C) > p(β|-α&C).

If this can be established, you have correctly specified the outcome and have achieved construct validity of the outcome.  You're really measuring disrupted sleep, as you claim to be, rather than something else (like non-disruptive limb movement during sleep).

And finally, if you can establish that the right kind of relationship holds between the actual testing conditions and the conditions to which you generalize (college students in typical North American eating and sleeping environments) -- then you can move from C to γ.  This will be so if your actual population is representative and the situation isn't strange.  More specifically, since what is "representative" and "strange" depends on what causes what, the specification of γ requires knowing what background conditions are required for α to have its effect on β.  If you know that, you can generalize to populations beyond your sample where the relevant conditions γ are present (and refrain from generalizing to cases where the relevant conditions are absent).  You can thus substitute γ for C, generating the causal generalization that you had been hoping for from the beginning:

p(β|α&γ) > p(β|-α&γ).

In this way, internal, construct, and external validity fit together.  Moving from finite, historically particular data to a general causal claim requires all three.  It requires establishing not only internal validity but also establishing construct validity of the cause and outcome and external validity.  Otherwise, you don't have the well-supported generalization you think you have.

Although internal validity is often privileged in social scientists' discussions of causal inference, with internal validity alone, you know only that the particular intervention you made (whatever it was) had the specific effect you measured (whatever that effect amounts to) among the specific population you sampled at the time you ran the study.  You know only that something caused something.  You don't know what causes what.

-----------------------------------------

Here's another way to think about it.  If you claim that "α causes β in γ", there are four ways you could go wrong:

(1.) Something might cause β in γ, but that something might not be α.  (The tannin rather than the caffeine might disrupt sleep.)

(2.) α might cause something in γ, but it might not cause β.  (The caffeine might cause more movement at night without actually disrupting sleep.)

(3.) α might cause β in some set of conditions, but not γ.  (Caffeine might disrupt sleep only in unusual circumstances particular to your school.  Maybe students are excitable because of a recent earthquake and wouldn't normally be bothered.)

(4.) α might have some relationship to β in γ, but it might not be a causal relationship of the sort claimed.  (Maybe, though an error in assignment procedures, only students on the noisy floors got the caffeine.)

Practices that ensure internal validity protect only against errors of Type 4.  To protect against errors of Type 1-3, you need proper causal specification, with both construct and external validity.

-----------------------------------------

Note 1: Throughout the post, I assume that causes monotonically increase the probability of their effects, including the presence of other causes.

-----------------------------------------

Related:



[image modified from source]

5 comments:

Arnold said...

Very exciting, 'four practices that ensure validity of psychology' are like the four "fundamental interactions" or forces or universals or natures...of physics...

Ideas themselves could always begin with a minimum of four practices to be considered in valid philosophy...thanks

Philosopher Eric said...

Professor, this paper will clearly be aimed at countering the reproducibility crisis associated with our still soft mental and behavioral sciences. Well done! But how will it be received? It’s obviously far more convenient for the foxes to guard the hen house without all sorts of hen protection protocol in place. In a sense what we have now is that foxes have come to realize that they must keep other foxes honest by checking a given study with new ones. Independent verification takes time and money however, and who’s to say that similar mistakes (if they exist), won’t happen again? Or that a second study might be flawed in an opposing direction? If modern scientists are beginning to take the softness of their fields seriously however, then your proposal might have some legs.

Arnold said...

..."Note 1: Throughout the post, I assume that causes monotonically increase the probability of their effects, including the presence of other causes"...

Is the thrust of your post to stay with a proposed construct...as in intention and intension...or as in uncertainties of pretension....

David Duffy said...

I think a problem has been that some people have looked for general methods for assessing studies and their results - something like the PRISMA guidelines for assessing studies to combine into a meta-analysis. (In passing, obviously meta-analysis has much to say about external validity).

Assessment requires specialist knowledge about the domain of study, precisely because of the need to diagnose threats to causal inference. Take the example of selecting an appropriate control treatment for a trial. In trials of surgical treatments, the "gold standard" control is a sham procedure including a general anaesthetic and an incision! (look at the trials for arthoscopic procedures). My favourite example of this was garlic tablets for hypercholesterolemia, where the spouse could always work out if a participant was taking the active one rather than the placebo, destroying blinding.

In the coffee example, it would require a pharmacologist to ensure the non-caffeine components (the tannins) actually were at the same concentrations in decaffeinated and caffeinated coffee, and to review experimental evidence for effects of compounds other than caffeine.

In the GSL example, it is obvious to me that a simple two-arm RCT would be inappropriate (even without the details of the story), though these are now commonly preferred in the medical domain for many reasons (but mainly practicality of running large multicentre studies). Domain knowledge should have been used to list the other plausible interventions to test in a factorial design (as beloved of psychologists). The juror turnout problem strikes me as having many similarities to the response rate problem generally - there is a huge literature where they manipulate introduction letters, form itself, reminders by mail or phone and so on. In my own case, I randomized the number of postage stamps on a provided questionnaire's return envelope - the more stamps, the more likely it was to be returned :).

But here the constraints, of course, are cost.

Arnold said...

So I googled causal claim and learned a lot, but was stopped at...
..."Internal Validity" (it) is the approximate truth about inferences regarding cause-effect or causal relationships. Thus, internal validity is only relevant in studies that try to establish a causal relationship. It's not relevant in most observational or descriptive studies, for instance.'...So is there only external validity?

This upsets some of us who try to use contradiction to try to understand all and anything.

A Socratic question might be then now...Can philosophy and social science propose a causal model of contradiction causality as the way the world works?