I wanted to replicate a purported link between ambidexterity and authoritarianism. I had a dataset with ambidexterity and people’s answers to various political questions. I chose four questions that I thought were related to authoritarianism, and got these results:
1. p = 0.049
2. p = 0.008
3. p = 0.48
4. p = 0.052
I judged this as basically presenting evidence in favor of the hypothesis - after all, two of the four tests were “significant”, and one was very close.
In the comments, Ashley Yakeley asked whether I tested for multiple comparisons; Ian Crandell agreed, saying that I should divide my significance threshold by four, since I did four tests. If we start with the traditional significance threshold of 0.05, that would mean a new threshold of 0.0125, which result (2) barely squeaks past and everything else fails.
I agree multiple hypothesis testing is generally important, but I was skeptical of this. Here’s my argument.Suppose I want to test some hypothesis. I try one experiment, and get p = 0.04. By traditional standards of significance, it passes.But suppose I want to be extra sure, so I try a hundred different ways to test it, and all one hundred come back p = 0.04. Common sensically, this ought to be stronger evidence than just the single experiment; I’ve done a hundred different tests and they all support my hypothesis. But with very naive multiple hypothesis testing, I have to divide my significance threshold by one hundred - to p = 0.0005 - and now all hundred experiments fail. By replicating a true result, I’ve made it into a false one
Metacelsus mentions the Holmes-Bonferroni method. If I’m understanding it correctly, it would find the hundred-times-replicated experiment above significant. But I can construct another common-sensically significant version that it wouldn’t find significant - in fact, I think all you need to do is have ninety-nine experiments come back p = 0.04 and one come back 0.05.
What if you thought in Bayesian terms? I’m really weak in Bayesian statistics, but my impression would be you treat each test as giving a separate Bayes factor. So you start with (say) a prior of 1:19 against, and then the four tests give you the following (approximate) Bayes factors:
1. 19:1 in favor
2. 100:1 in favor
3. 1:1 either way
4. 19:1 in factor
Multiply it all out and you end up with odds of 1900:1 in favor - pretty convincing. But I’m not happy with this; I got to just ignore the totally negative finding in test 3. Shouldn’t getting a result of “no difference” increase your probability that the reality is “no difference” compared to “some large difference”? But here the best it can do is nothing.
In fact, this is a big problem. The story of the past few years has been “early small study finds p = 0.00001, later big study finds p = 0.5, we shrug and say early excitement was misplaced and there’s no real effect”. But the way I’m doing things here, the first study should make us believe there’s very likely an effect, and the second study should move us in neither direction, keeping us at “very likely an effect”. That’s clearly not how we think.
Maybe I need a real hypothesis, like “there will be a difference of 5%”, and then compare how that vs. the null does on each test? But now we’re getting a lot more complicated than just the “call your likelihood ratio a Bayes factor, it’ll be fine!” I was promised.
I think I’ve reached the limits of my statistics knowledge here, which surprises me for such an easy question. Interested in hearing what other people know.