Only partway through but Tom’s early comments means I immediately need an episode on whether taking part in Dry January is preventative, or indeed predictive, of having a wider drinking problem...

Fascinating show. One of you referred to a 'D20' - do we have a (maybe former) role-player in our midst?

Isn't there an underlying issue in that at least some of the fields of inquiry that give rise to dodgy results and p-hacking, they are in fact pseudoscientific, given the only way of assessing the truth of a theoretical proposition is to observe statistical epiphenomenon rather than any direct material evidence (whether chemical, physiological or whatever)?

a) "The probability that these results could occur if the null hypothesis is true" and

b) "the probability that the null hypothesis is true, given these results"

is the effect of multiple trials. I would expect to get 'statistically significant' results on about 1 trial in 20, even if the null hypothesis is true. This is consistent with the correct formulation a) above. It would however be absurd to suggest that the actual probability of the null hypothesis being true varies depending on the results I get on each trial and that 'if you do enough trials, then by chance the probability of the (non null) hypothesis will on some trials be as high as 95%'.

I enjoyed this, but Tom's analogy with the 'is the pope human' example caused me a problem. There is a difference as you point out between

a) "The probability that these results could occur if the null hypothesis is true" and

b) "the probability that the null hypothesis is true, given these results"

but it seems a more technical one that the difference between (as Tom seemed to state)

a) "the probability that a random human is the pope" and

b) "the probability that the pope is human"

I can see that the errors are similar because they both are examples of

'Probability of A given B is not the same as probability of B given A'.

But they are different errors. If I assume the null hypothesis is 'the pope is human' I would not test random humans to see if they were the pope to test that hypothesis so case a) is irrelevant. In real life, the probability of a null hypothesis being 'true' is significantly reduced if a p value of 0.05 is found (legitimately) because the test results are relevant to the hypothesis, whereas the probability of the pope being human is not reduced by finding non-papal humans. It is the wrong test.

On another point. As a Psychology student I was taught in the 1970s that we should not use tests that assume our data is normally distributed when we haven't established that. So we used what I think were called non-parametric statistics as opposed to the default of Students T-test for significance. I mention it because Stuart said 'p is easy to calculate' but its worth mentioning that there are pitfalls there too about which test is relevant given the nature of the data.

And finally (I know, I am a sad person) - the example of the probability of a coin flip being heads 60 times or more out of 100 is 3% (or thereabouts). Its worth using the example to point out why hypotheses need to be pre-registered. If my hypothesis is 'the coin is biased' then the probability of either 60 or more heads OR tails would actually be a lot higher (sorry I havent calculated it). It is only if I have explicitly said my hypothesis is that the bias is towards heads that this is 'significant'.

But it is analogous, I promise! P(Pope|Human) = 1/8bn, just as P(data|hypothesis)<0.05 in a statistically significant trial. When people assume that P(data|hypothesis)=P(hypothesis|data) they are making the same mistake as if they said P(Pope|Human) = P(Human|Pope). What they're failing to do is take into account other hypotheses and how likely THEY are, such as "how likely is it that a randomly chosen individual is an alien," which is even less likely than a human being the Pope.

You're right that it's *obviously wrong* with the pope/human thing. But it is equally, just less obviously, wrong with the data|hypothesis thing.

Re heads vs tails: I hope I said early in the thing that we were worried it was biased towards heads, which would allow us to reasonably carry out a one-tailed test. You're right, if we would be equally shocked to see it biased towards tails, then we ought to do a two-tailed test: it would mean, if I recall correctly, that you ought to lower your alpha level to 0.025 at either end (ie you'd need twice as strong evidence).

Thanks Tom. I take your point that both are examples of the fallacy that P(A|B) is the same as P(B|A) and the pope/human example is an extreme example of that. I guess what throws me is that I was trying to work out what the 'hypothesis' is in the analogy (the pope is an alien) in which case the fact that most humans are not the pope is irrelevant to it. But that doesn't matter.

I'd be interested in your Baysian take on the Post Office / Horizon scandal. It seems to me that the 'priors' of P(fraudulent sub postmaster) and P(bug in the software) were very wrong!

Only partway through but Tom’s early comments means I immediately need an episode on whether taking part in Dry January is preventative, or indeed predictive, of having a wider drinking problem...

we definitely ought to do an alcohol-related episode.

edited Jan 10Fascinating show. One of you referred to a 'D20' - do we have a (maybe former) role-player in our midst?

Isn't there an underlying issue in that at least some of the fields of inquiry that give rise to dodgy results and p-hacking, they are in fact pseudoscientific, given the only way of assessing the truth of a theoretical proposition is to observe statistical epiphenomenon rather than any direct material evidence (whether chemical, physiological or whatever)?

Another thought about the difference between

a) "The probability that these results could occur if the null hypothesis is true" and

b) "the probability that the null hypothesis is true, given these results"

is the effect of multiple trials. I would expect to get 'statistically significant' results on about 1 trial in 20, even if the null hypothesis is true. This is consistent with the correct formulation a) above. It would however be absurd to suggest that the actual probability of the null hypothesis being true varies depending on the results I get on each trial and that 'if you do enough trials, then by chance the probability of the (non null) hypothesis will on some trials be as high as 95%'.

edited Jan 9I enjoyed this, but Tom's analogy with the 'is the pope human' example caused me a problem. There is a difference as you point out between

a) "The probability that these results could occur if the null hypothesis is true" and

b) "the probability that the null hypothesis is true, given these results"

but it seems a more technical one that the difference between (as Tom seemed to state)

a) "the probability that a random human is the pope" and

b) "the probability that the pope is human"

I can see that the errors are similar because they both are examples of

'Probability of A given B is not the same as probability of B given A'.

But they are different errors. If I assume the null hypothesis is 'the pope is human' I would not test random humans to see if they were the pope to test that hypothesis so case a) is irrelevant. In real life, the probability of a null hypothesis being 'true' is significantly reduced if a p value of 0.05 is found (legitimately) because the test results are relevant to the hypothesis, whereas the probability of the pope being human is not reduced by finding non-papal humans. It is the wrong test.

On another point. As a Psychology student I was taught in the 1970s that we should not use tests that assume our data is normally distributed when we haven't established that. So we used what I think were called non-parametric statistics as opposed to the default of Students T-test for significance. I mention it because Stuart said 'p is easy to calculate' but its worth mentioning that there are pitfalls there too about which test is relevant given the nature of the data.

And finally (I know, I am a sad person) - the example of the probability of a coin flip being heads 60 times or more out of 100 is 3% (or thereabouts). Its worth using the example to point out why hypotheses need to be pre-registered. If my hypothesis is 'the coin is biased' then the probability of either 60 or more heads OR tails would actually be a lot higher (sorry I havent calculated it). It is only if I have explicitly said my hypothesis is that the bias is towards heads that this is 'significant'.

hey there! Thanks for your comment.

I'm going to be a complete coward and say that the "the pope is human" example is a classic: it's from a letter to Nature in 1996 https://www.nature.com/articles/382490a0.pdf

But it is analogous, I promise! P(Pope|Human) = 1/8bn, just as P(data|hypothesis)<0.05 in a statistically significant trial. When people assume that P(data|hypothesis)=P(hypothesis|data) they are making the same mistake as if they said P(Pope|Human) = P(Human|Pope). What they're failing to do is take into account other hypotheses and how likely THEY are, such as "how likely is it that a randomly chosen individual is an alien," which is even less likely than a human being the Pope.

You're right that it's *obviously wrong* with the pope/human thing. But it is equally, just less obviously, wrong with the data|hypothesis thing.

Re heads vs tails: I hope I said early in the thing that we were worried it was biased towards heads, which would allow us to reasonably carry out a one-tailed test. You're right, if we would be equally shocked to see it biased towards tails, then we ought to do a two-tailed test: it would mean, if I recall correctly, that you ought to lower your alpha level to 0.025 at either end (ie you'd need twice as strong evidence).

Thanks Tom. I take your point that both are examples of the fallacy that P(A|B) is the same as P(B|A) and the pope/human example is an extreme example of that. I guess what throws me is that I was trying to work out what the 'hypothesis' is in the analogy (the pope is an alien) in which case the fact that most humans are not the pope is irrelevant to it. But that doesn't matter.

I'd be interested in your Baysian take on the Post Office / Horizon scandal. It seems to me that the 'priors' of P(fraudulent sub postmaster) and P(bug in the software) were very wrong!