Suppose I'm running a randomized study: Treatment group A gets the medicine; control group B gets a placebo; later, I test both groups for disease X. I've randomized perfectly, it's double blind, there's perfect compliance, my disease measure is flawless, and no one drops out. After the intervention, 40% of the treatment group have disease X and 80% of the control group do. Statistics confirm that the difference is very unlikely to be chance (p < .001). Yay! Time for FDA approval!

There's an assumption behind the optimistic inference that I want to highlight. I will call it the *Causal Sparseness* assumption. This assumption is required for us to be justified in concluding that randomization has achieved what we want randomization to achieve.

So, what *is* randomization supposed to achieve?

Dice roll, please....

Randomization is supposed to achieve this: a *balancing *of other causal influences that might bear on the outcome. Suppose that the treatment works only for women, but we the researchers don't know that. Randomization helps ensure that approximately as many women are in treatment as in control. Suppose that the treatment works twice as well for participants with genetic type ABCD. Randomization should also balance that difference (even if we the researchers do no genetic testing and are completely oblivious to this influence). Maybe the treatment works better if the medicine is taken after a meal. Randomization (and blinding) should balance that too.

But here's the thing: Randomization only balances such influences *in expectation*. Of course, it could end up, randomly, that substantially more women are in treatment than control. It's just unlikely if the number of participants N is large enough. If we had an N of 200 in each group, the odds are excellent that the number of women will be similar between the groups, though of course there remains a minuscule chance (6 x 10^-61 assuming 50% women) that 200 women are randomly assigned to treatment and none to control.

And here's the other thing: People (or any other experimental unit) have infinitely many properties. For example: hair length (cf. Rubin 1974), dryness of skin, last name of their kindergarten teacher, days since they've eaten a burrito, nearness of Mars on their 4th birthday....

Combine these two things and this follows: For any finite N, there will be infinitely many properties that are *not *balanced between the groups after randomization -- just by chance. If any of these properties are properties that need to be balanced for us to be warranted in concluding that the treatment had an effect, then we cannot be warranted in concluding that the treatment had an effect.

Let me restate in an less infinitary way: In order for randomization to warrant the conclusion that the intervention had an effect, N must be large enough to ensure balance of all other non-ignorable causes or moderators that might have a non-trivial influence on the outcome. If there are 200 possible causes or moderators to be balanced, for example, then we need sufficient N to balance all 200.

Treating all other possible and actual causes as "noise" is one way to deal with this. This is just to take everything that's unmeasured and make one giant variable out of it. Suppose that there are 200 unmeasured causal influences that actually do have an effect. Unless N is huge, some will be unbalanced after randomization. But it might not matter, since we ought to expect them to be unbalanced in a balanced way! A, B, and C are unbalanced in a way that favors a larger effect in the treatment condition; D, E, and F are unbalanced in a way that favors a larger effect in the control condition. Overall it just becomes approximately balanced noise. It would be unusual if all of the unbalanced factors A-F happened to favor a larger effect in the treatment condition.

That helps the situation, for sure. But it doesn't eliminate the problem. To see why, consider an outcome with many plausible causes, a treatment that's unlikely to actually have an effect, and a low-N study that barely passes the significance threshold.

Here's my study: I'm interested in whether silently thinking "vote" while reading through a list of registered voters increases the likelihood that the targets will vote. It's easy to randomize! One hundred get the think-vote treatment and another one hundred are in a control condition in which I instead silently think "float". I preregister the study as a one-tailed two-proportion test in which that's the only hypothesis: no p-hacking, no multiple comparisons. Come election day, in the think-vote condition 60 people vote and in the control condition only 48 vote (p = .04)! That's a pretty sizable effect for such a small intervention. Let's hire a bunch of volunteers?

Suppose also that there are at least 40 variables that plausibly influence voting rate: age, gender, income, political party, past voting history.... The odds are good that at least one of these variables will be unequally distributed after randomization in a way that favors higher voting rates in the treatment condition. And -- as the example is designed to suggest -- it's surely more plausible, despite the preregistration, to think that that unequally distributed factor better explains the different voting rates between the groups than the treatment does. (This point obviously lends itself to Bayesian analysis.)

We can now generalize back, if we like, to the infinite case: If there are infinitely many possible causal factors that we ought to be confident are balanced before accepting the experimental conclusion, then no finite N will suffice. No finite N can ensure that they are all balanced after randomization.

We need an assumption here, which I'm calling *Causal Sparseness*. (Others might have given this assumption a different name. I welcome pointers.) It can be thought of as either a knowability assumption or a simplicity assumption: We can know, before running our study, that there are few enough potentially unbalanced causes of the outcome that, if our treatment gives a significant result, the effectiveness of the treatment is a better explanation than one of those unbalanced causes. The world is not dense with plausible alternative causes.

As the think-vote example shows, the plausibility of the Causal Sparseness assumption varies with the plausibility of the treatment and the plausibility that there are many other important causal factors that might be unbalanced. Assessing this plausibility is a matter of theoretical argument and verbal justification.

Making the Causal Sparseness assumption more plausible is one important reason we normally try to make the treatment and control conditions as similar as possible. (Otherwise, why not just trust randomness and leave the rest to a single representation of "noise"?) The plausibility of Causal Sparseness cannot be assessed purely mechanically through formal methods. It requires a theory-grounded assessment in every randomized experiment.

## 16 comments:

Very cool! Sounds like a specific version of the general idea that -- to put it in terms of Larry's Phil 7 -- our confidence in the conclusion of an inductive argument depends not only on our confidence that the conclusion is better than the rivals we'ce thought of, but also on our confidence that we've thought of all the rivals that are worth thinking of in the context.

There's a sort of parallel in philosophy of religion, too. Some arguments from evil seem to depend on the idea that if God did have a justifying reason for allowing evil, then we would know what it is. So-called skeptical theists question this premise.

Yes, Neal, I like that way of thinking of it. In Bayesian terms, there's the hypothesis under test, the rival hypotheses, and also (sometimes omitted) a catch-all hypothesis for unthought-of rivals.

I don't think that the number of causal factors involved is important.

One way to think about your voting study is that, prior to taking part in your study, each person has a person propensity to vote which is some number between 0% and 100%. If your intervention increases people's propensity to vote by 10 percentage points on average, then it will causes an additional 10% of people to vote. The problem is that you have no idea what anyone's propensity to vote is, so you randomize the people into two groups to try to roughly balance that out. But it won't balance it out perfectly, which is why you need statistics to think through how imbalanced the two groups could be just based on people coming in with different propensities to vote and then getting randomized.

None of this depends on what causal factors are behind people's propensity to vote. It could be 1 thing or it could be 10,000 things. It doesn't matter because their propensity to vote sums all that up. And you can look at all the limitations of randomization in terms of the distribution of propensity to vote without worrying about where each person's propensity comes from.

(Well, it can matter if you are able to do something more complicated than just a simple randomized experiment. If gender has a large effect on voting rates, you could design the experiment to balance gender ratios between the two groups rather than using pure randomization, or attempt to statistically adjust for gender in the results, which reduces the amount of noise. And if you know about lots of factors which are related to a person's propensity to vote, you could estimate a "propensity to vote" score for each person in the study and balance that between the groups (when you split people up or with after-the-fact statistical adjustments). Pure randomization is a trick you can use when you know nothing about anyone's voting propensity or the factors that influence their voting.)

I agree with Dan here; the computation of a p-value effectively sweeps this all under the hood.

The nice thing about randomization is that it lets us

ignoreall of this complexity: after the causes are all tallied up and their subtle interactions worked out in full detail, there will be a final probability that a randomly-selected person from your global population votes/gets the disease/etc,and that probability is all we have to care about- we can ask, for each probability, how strange the results we're seeing look in a world where that is the underlying probability.Another framing: out of the thousands of possible causes behind the disease, some of them will likely be inequally distributed between A and B. But assign prior probabilities to the strength of effect each of these factors has, and compute the odds that this much more complex model results in an 80% - 40% distribution (or more extreme) in the case where the medicine has no effect, and you'll find the same p-value you got originally (provided one vs two-tailed computations are matched, the frequentist model you're using to compute the p-value is the same, and so on).

The odds that, say, Factor A is unevenly distributed in your population (and also has a significant effect) in a way that leads to the results observed, plus the odds that Factor B is unevenly distributed, minus the odds that they both are in ways that cancel out, plus... will all combine to give you the same p-value that modeling things as a much simpler Bernoulli random variable would give you. (You might want to use a more complicated model like this for trying to explain the observed results, or getting a better understanding of the dynamics at play here - it's not a bad model to have! - but it doesn't render significance tests any worse than they are in a world with much simpler underlying dynamics.)

It might be useful to actually run this computation in a simple case. Suppose you know that in a given year, exactly half of Americans will stub their toes. You collect 10 randomly-chosen subjects in the US, give them your patent-pending course on avoiding clumsiness, and find that only 2 stubbed their toes after a year. Under the null hypothesis (where the course has no effect), we would expect a result at least this strong (2 or fewer people stub their toes) 5.46875% of the time, so we get a p-value a touch over 0.05.

Now consider a more complex model: men are more clumsy than women, and stub their toes in 90% of years compared to women's 10% (we assume there are equally many men and women in the US). This is consistent with our prior knowledge that exactly half of Americans stub their toes in a given year, as 0.9*0.5+0.1*0.5 = 0.5.

What are the odds of seeing a result this extreme in our more complex model? Well, we can calculate for each distribution of men and women (0/10, 1/9, ...) the probability that we sampled such a distribution, and then the conditional probability that we would see a result this large in the event of such a distribution (so, for instance, this conditional probability will be high if our sample was all women, and very low if it was all men). Multiplying these and adding them together, we'll get the total we want.

Each line below lists the probability of a given gender distribution, the conditional probability of at most 2 stubbed toes given that distribution, and the overall probability of the observed outcome by multiplying the two probabilities.

0 men, 10 women: 0.0010 * 0.9298 = 0.0009080

1 men, 9 women: 0.0098 * 0.7921 = 0.0077350

2 men, 8 women: 0.0439 * 0.5047 = 0.0221773

3 men, 7 women: 0.1172 * 0.1402 = 0.0164248

4 men, 6 women: 0.2051 * 0.0291 = 0.0059709

5 men, 5 women: 0.2461 * 0.0052 = 0.0012812

6 men, 4 women: 0.2051 * 0.0008 = 0.0001742

7 men, 3 women: 0.1172 * 0.0001 = 0.0000153

8 men, 2 women: 0.0439 * 0.0000 = 0.0000008

9 men, 1 women: 0.0098 * 0.0000 = 0.0000000

10 men, 0 women: 0.0010 * 0.0000 = 0.0000000

Adding these all up, we get 0.0546875.

This seems to be another way of seeing why we should probably pay more attention to affect sizes and less to whether we can reject a null hypothesis at the .05 level.

It will very often be the case that tiny but statistically significant effects could be explained by some causal factors that are unbalanced after randomization, rather than by the treatment. Giant effects are much less likely to be explained this way.

I think this is a good way to explain to laypeople why you need a large sample size, and why controlling by observables is usually not enough.

As explained in the other comments, the actual number of factors is irrelevant to statistical inference. For that, we only care about how noisy the outcomes are. You might have infinite factors, but if they all sort of cancel-out, resulting in some specific finite noise, you're fine.

For binary outcomes, like voting or not voting, the noise of the outcome is always finite, so it's easy to figure out what sample size you need to have some good confidence in the results. Your p-value will tell you how likely it is that you have some terribly unbalanced sample generating the result.

But for unbounded outcomes, like money, or the amount of human suffering in a year, maybe we need to be more careful. In those cases, we need to assume that the variance of the outcome is finite. If it's not finite, there can also be an important unbalanced factor, a monstrous crisis lurking in the shadows, and the usual statistical inference is not valid. If you have infinite variance, no matter your sample size or how many times you run your giant experiment, you will never find the true effect.

Challenging these finite-variance assumptions is a bit out of fashion, at least in the social sciences. I am not sure why.

RA Fisher wrote:

"Let the Devil choose the yields of the plots to his liking...If now I assign treatments to plots on any system which allows any two plots which may be treated alike an equal chance of being treated differently...then it can be shown both that the experiment is unbiased by the Devil’s machinations, and that my test of significance is valid."

I tried to make the argument as a student about the interpretation of the P-value in the presence of chance imbalance of a measured known-to-be-important covariate, claiming that the "as-planned" simple analysis gives a "correct result" - say the intervention is "significant" - when an analysis conditioning on the distribution of the covariate is "not significant". Of course, we aren't supposed to interpret frequentist P-values* in quite this fashion anyway ("the p-value fallacy")**. The simplest summary of my conundrum is that the simple result is correct on average***, but we do better with extra information about causation among the variables being used in appropriate conditionalization in our single actual experiment. The trouble with conditioning is that the Devil could bias the outcome of that type of analysis.

* We won't get into Bayesian (prior or posterior) P-values or Bayesian interpretation of frequentist P-values (eg Sellke et al 2000 on calibration of P-values).

** In my area where we analyze genome-wide association experiments (with literally millions of simultaneous t-tests), we unashamedly think of (multiple-testing-corrected) P-values as a kind of measure of strength of association. I have lost the reference, it may be Berger, who provides a confidence interval for P-values - it is roughly +/- 1 order of magnitude.

Some people have tried to rehabilitate posterior power calculations as "reproducibity P-values" - if your P-value is 0.05 and the true effect size was that estimated in your analysis, then there is only a 50% chance of a successful replication (at alpha 0.05).

*** I have always enjoyed the idea that senior researchers should correct their current P-values for the number of papers they have already published. Perhaps this should also be applied retrospectively.

...sparse-ness could lead to balance-ness in the validity of real P-values...

This weeks news/test headline...25% of North American children are hungry...

...of the hungry...feed 12 and half %...don't feed 12 and half %...then feed ALL 100%

What does P stand for...posted, possible, probable, proportional, parsed?

I think statistician Andrew Gelman would endorse Causal Sparseness, since he criticizes a lot of research as operating under the assumption that small interventions lead to big effects. If this were true, any study would be swamped by the noise of all the causes impacting your dependent variable. He calls this the "piranha problem."

Thanks for all the continuing comments, folks!

Sean -- thanks especially for that link on the piranha problem. I think the assumption of lack of too many piranhas is similar to the causal sparseness assumption. You have to trust that there aren't too many other similar-or-larger sized influences that are as plausible as the tested intervention.

Dan, Drake, Deivis, and David: These responses all seem to me to be versions of the strategy of loading everything into the noise variable -- yes? I guess the simplest response to that is to go Bayesian. From a forward-looking perspective, of course it's unlikely that treatment and control would differ unless the intervention had an effect, but let's take a backwards-looking view. The experiment is over. The groups differ. How likely is that difference to be because of the intervention? Well, that depends on the prior plausibility of the intervention compared to alternative explanations. Right? Now go back and think in a forward-looking way again. If my initial credence is .0001 before the experiment, even a successful experiment that crosses the significance threshold shouldn't probably raise my credence very high. Normally I wouldn't have such a low credence in a hypothesis I test. But why not? What justifies my modest initial (say p = .4) credence in my hypothesis? Part of it is causal sparseness: That there aren't a huge number of other equally-or-more-important causal factors that will provide competing similarly-plausible explanations of the result in the event that the groups differ.

Daniel: Yes, this is part of the reason we need to pay attention to effect size!

Hi Eric

"the noise variable": no, the randomization is actually the intervention. It doesn't matter how many other causes there are or their effect sizes, aside from distributional assumptions for the chosen test, as it is the Y value being assigned into the different groups. The distributional assumptions are bypassed (to a large extent) in non-parametric testing. So it simply a power question. Fisher's idea of randomization as being applicable against a deceptive opponent gives a particularly complex potential confounder.

David: Yes, but even if we go non-parametric, there are some background assumptions about the state of the universe hidden beneath inferring the truth of the hypothesis from p < .05. If the universe is dense enough with other equally plausible causes that the prior probability of the hypothesis is .0001, then even the most stripped down non-parametric test won't save you, right?

"dense enough with other equally plausible causes":

No. The effect of a genuine randomization has to have a simple distribution (binomial, poisson etc) as it breaks the correlation between intervention and confounders, regardless of the number and distribution of confounders. The best way to see this is in a sequential analysis (a la Wald's Sequential Probability Ratio Test). Everything we deal with always has a squillion other causes. Returning to trials against an adversary, doctors regularly sabotaged trials where they believed a new treatment was dangerous by withholding their healthier patients, but that won't give false positives under randomization. The power and generalizability problems of randomization arise from effect size and interactions rather than confounding. For example, your intervention might only work on a small fraction of the population who have a particular unmeasured factor, so the effect may be hopelessly diluted in a simple randomized design.

https://www.tandfonline.com/doi/pdf/10.1080/07474946.2011.539924

David, I'm not sure why you're not engaging with my point about prior probabilities, nor do I understand why you mention effect size and interactions as though they were side issues.

"If there are 200 possible causes or moderators to be balanced, for example, then we need sufficient N to balance all 200". This is not the case. You skip past "noise" and traditional Neyman-Pearson in your discussion, but these assume and work around causal denseness. If one runs a large number of null randomizations, that is repeatedly collects two random samples and measures the difference between them, one observes an empirical distribution of the between-sample differences. In theory, we can explain every extreme difference in that set in terms of the "infinite" number of causative factors acting in that sample, but we don't. We just compare the likelihood of seeing our single realization under this empirical null distribution to that assuming our single intervention has an effect. We have mathematical reasons to think this does not cause problems, and methods to avoid having to perform a real set of null experiments (eg empirical likelihood based on the actual trial).

If you want to condition on a large number of potential covariates, eg to improve statistical power, you generate a multiple testing problem that you didn't previously have. In the normal run of things, we have a list of known causes of

large enough effect size to be problematic, and this empirically will be short ("sparse"). Alternatively, you use the various techniques we have for dealing with high dimensional and ultra-high dimensional data.As to power and interactions, these do not effect Type 1 error rates.

“In the normal run of things...” — exactly my point!

Post a Comment