You Can’t Trust Your PhD Supervisor:)

Before I continue, let me be clear that during my PhD Graham taught me everything I consider worth knowing about the process of science, theory, psychopathology and academic life. So, I tend to listen to what he says (unless it’s about marriage or statistics) very seriously. The two points I want to focus on are:

2.  Do an experiment but make up or severely massage the data to fit your hypothesis. This is an obvious one, but is something that has surfaced in psychological research a good deal recently (http://bit.ly/QqF3cZ;http://nyti.ms/P4w43q).

I’d really like to do a study looking at more experienced researcher’s basic understanding of assumptions (a follow up to Hoekstra’s study on a more experienced sample, and with more probing questions) just to see whether my suspicions are correct. Maybe I should email Hoekstra and see if they’re interested because, left to my own devices, I’ll probably never get around to it.

Anyway, my point is that I think it’s not just deliberate fraud that creates false knowledge, there is also a problem of well-intentioned and honest folk simply not understanding what to do, or when to do it.

3.  Convince yourself that a significant effect at p=.055 is real. How many times have psychologists tested a prediction only to find that the critical comparison just misses the crucial p=.05 value? How many times have psychologists then had another look at the data to see if it might just be possible that with a few outliers removed this predicted effect might be significant? Strangely enough, many published psychology papers are just creeping past the p=.05 value – and many more than would be expected by chance! Just how many false psychology facts has that created? (

This is a massive over-simplification because an effect of p = .055 is ‘real’ and might very well be ‘meaningfiul’. Conversely, an effect with p < .001 might very well be meaningless. To my mind it probably matters very little if a p is .055 or .049. I’m not suggesting I approve of massaging your data, but really this point illustrates how wedded psychologists are to the idea of effects somehow magically becoming ‘real’ or ‘meaningful’ once p drifts below .05. There’s a few points to make here:

First, all effects are ‘real’. There should never be a decision being made by anyone about whether an effect is real or not real. They’re all real. It’s just that some are large and some are small. There is a decision about whether an effect is meaningful, and that decision should be made within the context of the research question.

Second, I think an equally valid way to create ‘false knowledge’ is to publish studies based on huge samples reporting loads of small and meaningless effects that are highly significant. Imagine you look at the relationship between statistical knowledge and eating curry. You test 1 million people and find that there is a highly significant negative relationship, r = -.002, p < .001. You conclude that eating curry is a 'real' effect – it is meaningfully related to poorer statistical knowledge. There are two issues here: (1) in a sample of 1 million people the effect size estimate will be very precise, and the confidence interval very narrow. So we know the true effect in the population is going to be very close indeed to -.002. In other words, there is basically no effect in the population – eating curry and statistical knowledge have such a weak relationship that you may as well forget about it. (2) anyone trying to replicate this effect in a sample substantially smaller than 1 million is highly unlikely to get a significant result. You've basically published an effect that is 'real' if you use p < .05 to define your reality, but is utterly meaningless and won't replicate (in terms of p) in small samples.

Third, there is a wider problem than people massage their ps. You have to ask why people massage their ps. The answer to that is because psychology is so hung up on p-values. Over 10 years since the APA published their report on statistical reporting (Wilkinson, 1999) there has been no change in the practice of applying the all-or-nothing thinking of accepting results as ‘real’ if p < .05. It's true that Wilkinson's report has had a massive impact in the frequency with which effect sizes and confidence intervals are reported, but (in my experience which is perhaps not representative) these effect sizes are rarely interpreted with any substance and it is still the p-value that drives decisions made by reviewers and editors.

This whole problem would go away if ‘meaning’ and ‘substance’ of effects were treated not as a dichotomous decision, but as a point along a continuum. You quantify your effect, you construct a confidence interval around it, and you interpret it within the context of the precision that your sample size allows. This way, studies with large samples could no longer focus on meaningless but significant effects, instead the researcher could say (given  the high level of precision they have) that the effects in the population (the true effects if you like) are likely to be about the size that they observed and interpret accordingly. In small studies, rather than throwing out the baby with the bathwater, large effects could be given some creditability but with the caveat that the estimates in the study lack precision. This is where replication is useful. No need to massage data – researchers just give it to the reader as it is, interpret it and apply the appropriate caveats etc. One consequence of this might be that rather than publishing a single small study with massaged data to get p < .05, researchers might be encouraged to replicate their own study a few times and report them all in a more substantial paper. Doing so would mean that across a few studies you could show (regardless of p) the likely size of the effect in the population.

That turned into a bigger rant than I was intending ….

References

Haller, H., & Kraus, S. (2002). Misinterpretations of Significance: A Problem Students Share with Their Teachers? MPR-Online, 7(1), 1-20.
Wilkinson, L. (1999). Statistical Methods in Psychology Journals: Guidelines and Explanations. American Psychologist, 54(8), 594-604.