Only partway through but Tom’s early comments means I immediately need an episode on whether taking part in Dry January is preventative, or indeed predictive, of having a wider drinking problem...

Fascinating show. One of you referred to a 'D20' - do we have a (maybe former) role-player in our midst?

Isn't there an underlying issue in that at least some of the fields of inquiry that give rise to dodgy results and p-hacking, they are in fact pseudoscientific, given the only way of assessing the truth of a theoretical proposition is to observe statistical epiphenomenon rather than any direct material evidence (whether chemical, physiological or whatever)?

a) "The probability that these results could occur if the null hypothesis is true" and

b) "the probability that the null hypothesis is true, given these results"

is the effect of multiple trials. I would expect to get 'statistically significant' results on about 1 trial in 20, even if the null hypothesis is true. This is consistent with the correct formulation a) above. It would however be absurd to suggest that the actual probability of the null hypothesis being true varies depending on the results I get on each trial and that 'if you do enough trials, then by chance the probability of the (non null) hypothesis will on some trials be as high as 95%'.

I enjoyed this, but Tom's analogy with the 'is the pope human' example caused me a problem. There is a difference as you point out between

a) "The probability that these results could occur if the null hypothesis is true" and

b) "the probability that the null hypothesis is true, given these results"

but it seems a more technical one that the difference between (as Tom seemed to state)

a) "the probability that a random human is the pope" and

b) "the probability that the pope is human"

I can see that the errors are similar because they both are examples of

'Probability of A given B is not the same as probability of B given A'.

But they are different errors. If I assume the null hypothesis is 'the pope is human' I would not test random humans to see if they were the pope to test that hypothesis so case a) is irrelevant. In real life, the probability of a null hypothesis being 'true' is significantly reduced if a p value of 0.05 is found (legitimately) because the test results are relevant to the hypothesis, whereas the probability of the pope being human is not reduced by finding non-papal humans. It is the wrong test.

On another point. As a Psychology student I was taught in the 1970s that we should not use tests that assume our data is normally distributed when we haven't established that. So we used what I think were called non-parametric statistics as opposed to the default of Students T-test for significance. I mention it because Stuart said 'p is easy to calculate' but its worth mentioning that there are pitfalls there too about which test is relevant given the nature of the data.

And finally (I know, I am a sad person) - the example of the probability of a coin flip being heads 60 times or more out of 100 is 3% (or thereabouts). Its worth using the example to point out why hypotheses need to be pre-registered. If my hypothesis is 'the coin is biased' then the probability of either 60 or more heads OR tails would actually be a lot higher (sorry I havent calculated it). It is only if I have explicitly said my hypothesis is that the bias is towards heads that this is 'significant'.

Only partway through but Tom’s early comments means I immediately need an episode on whether taking part in Dry January is preventative, or indeed predictive, of having a wider drinking problem...

edited Jan 10Fascinating show. One of you referred to a 'D20' - do we have a (maybe former) role-player in our midst?

Isn't there an underlying issue in that at least some of the fields of inquiry that give rise to dodgy results and p-hacking, they are in fact pseudoscientific, given the only way of assessing the truth of a theoretical proposition is to observe statistical epiphenomenon rather than any direct material evidence (whether chemical, physiological or whatever)?

Another thought about the difference between

a) "The probability that these results could occur if the null hypothesis is true" and

b) "the probability that the null hypothesis is true, given these results"

is the effect of multiple trials. I would expect to get 'statistically significant' results on about 1 trial in 20, even if the null hypothesis is true. This is consistent with the correct formulation a) above. It would however be absurd to suggest that the actual probability of the null hypothesis being true varies depending on the results I get on each trial and that 'if you do enough trials, then by chance the probability of the (non null) hypothesis will on some trials be as high as 95%'.

edited Jan 9I enjoyed this, but Tom's analogy with the 'is the pope human' example caused me a problem. There is a difference as you point out between

a) "The probability that these results could occur if the null hypothesis is true" and

b) "the probability that the null hypothesis is true, given these results"

but it seems a more technical one that the difference between (as Tom seemed to state)

a) "the probability that a random human is the pope" and

b) "the probability that the pope is human"

I can see that the errors are similar because they both are examples of

'Probability of A given B is not the same as probability of B given A'.

But they are different errors. If I assume the null hypothesis is 'the pope is human' I would not test random humans to see if they were the pope to test that hypothesis so case a) is irrelevant. In real life, the probability of a null hypothesis being 'true' is significantly reduced if a p value of 0.05 is found (legitimately) because the test results are relevant to the hypothesis, whereas the probability of the pope being human is not reduced by finding non-papal humans. It is the wrong test.

On another point. As a Psychology student I was taught in the 1970s that we should not use tests that assume our data is normally distributed when we haven't established that. So we used what I think were called non-parametric statistics as opposed to the default of Students T-test for significance. I mention it because Stuart said 'p is easy to calculate' but its worth mentioning that there are pitfalls there too about which test is relevant given the nature of the data.

And finally (I know, I am a sad person) - the example of the probability of a coin flip being heads 60 times or more out of 100 is 3% (or thereabouts). Its worth using the example to point out why hypotheses need to be pre-registered. If my hypothesis is 'the coin is biased' then the probability of either 60 or more heads OR tails would actually be a lot higher (sorry I havent calculated it). It is only if I have explicitly said my hypothesis is that the bias is towards heads that this is 'significant'.