Episode 23: Statistical significance

Jan 9, 2024

Listen now | If you want to know about p-values, you've come to the right place

7 Comments

Jan 9, 2024

Only partway through but Tom’s early comments means I immediately need an episode on whether taking part in Dry January is preventative, or indeed predictive, of having a wider drinking problem...

Expand full comment

Reply (1)

Tom Chivers

Jan 9, 2024

we definitely ought to do an alcohol-related episode.

Expand full comment

OS/2

Jan 10, 2024Edited

Fascinating show. One of you referred to a 'D20' - do we have a (maybe former) role-player in our midst?

Isn't there an underlying issue in that at least some of the fields of inquiry that give rise to dodgy results and p-hacking, they are in fact pseudoscientific, given the only way of assessing the truth of a theoretical proposition is to observe statistical epiphenomenon rather than any direct material evidence (whether chemical, physiological or whatever)?

Expand full comment

Mark Kerr

Jan 9, 2024

Another thought about the difference between

a) "The probability that these results could occur if the null hypothesis is true" and

b) "the probability that the null hypothesis is true, given these results"

is the effect of multiple trials. I would expect to get 'statistically significant' results on about 1 trial in 20, even if the null hypothesis is true. This is consistent with the correct formulation a) above. It would however be absurd to suggest that the actual probability of the null hypothesis being true varies depending on the results I get on each trial and that 'if you do enough trials, then by chance the probability of the (non null) hypothesis will on some trials be as high as 95%'.

Expand full comment

Mark Kerr

Jan 9, 2024Edited

I enjoyed this, but Tom's analogy with the 'is the pope human' example caused me a problem. There is a difference as you point out between

a) "The probability that these results could occur if the null hypothesis is true" and

b) "the probability that the null hypothesis is true, given these results"

but it seems a more technical one that the difference between (as Tom seemed to state)

a) "the probability that a random human is the pope" and

b) "the probability that the pope is human"

I can see that the errors are similar because they both are examples of

'Probability of A given B is not the same as probability of B given A'.

But they are different errors. If I assume the null hypothesis is 'the pope is human' I would not test random humans to see if they were the pope to test that hypothesis so case a) is irrelevant. In real life, the probability of a null hypothesis being 'true' is significantly reduced if a p value of 0.05 is found (legitimately) because the test results are relevant to the hypothesis, whereas the probability of the pope being human is not reduced by finding non-papal humans. It is the wrong test.

On another point. As a Psychology student I was taught in the 1970s that we should not use tests that assume our data is normally distributed when we haven't established that. So we used what I think were called non-parametric statistics as opposed to the default of Students T-test for significance. I mention it because Stuart said 'p is easy to calculate' but its worth mentioning that there are pitfalls there too about which test is relevant given the nature of the data.

And finally (I know, I am a sad person) - the example of the probability of a coin flip being heads 60 times or more out of 100 is 3% (or thereabouts). Its worth using the example to point out why hypotheses need to be pre-registered. If my hypothesis is 'the coin is biased' then the probability of either 60 or more heads OR tails would actually be a lot higher (sorry I havent calculated it). It is only if I have explicitly said my hypothesis is that the bias is towards heads that this is 'significant'.

Expand full comment

Reply (1)

Tom Chivers

Jan 9, 2024

hey there! Thanks for your comment.

I'm going to be a complete coward and say that the "the pope is human" example is a classic: it's from a letter to Nature in 1996 https://www.nature.com/articles/382490a0.pdf

But it is analogous, I promise! P(Pope|Human) = 1/8bn, just as P(data|hypothesis)<0.05 in a statistically significant trial. When people assume that P(data|hypothesis)=P(hypothesis|data) they are making the same mistake as if they said P(Pope|Human) = P(Human|Pope). What they're failing to do is take into account other hypotheses and how likely THEY are, such as "how likely is it that a randomly chosen individual is an alien," which is even less likely than a human being the Pope.

You're right that it's *obviously wrong* with the pope/human thing. But it is equally, just less obviously, wrong with the data|hypothesis thing.

Re heads vs tails: I hope I said early in the thing that we were worried it was biased towards heads, which would allow us to reasonably carry out a one-tailed test. You're right, if we would be equally shocked to see it biased towards tails, then we ought to do a two-tailed test: it would mean, if I recall correctly, that you ought to lower your alpha level to 0.025 at either end (ie you'd need twice as strong evidence).

Expand full comment

Reply (1)

Mark Kerr

Jan 10, 2024

Thanks Tom. I take your point that both are examples of the fallacy that P(A|B) is the same as P(B|A) and the pope/human example is an extreme example of that. I guess what throws me is that I was trying to work out what the 'hypothesis' is in the analogy (the pope is an alien) in which case the fact that most humans are not the pope is irrelevant to it. But that doesn't matter.

I'd be interested in your Baysian take on the Post Office / Horizon scandal. It seems to me that the 'priors' of P(fraudulent sub postmaster) and P(bug in the software) were very wrong!

Expand full comment