One-Tailed Tests

I’ve been thinking about writing a blog on one-tailed tests for a while. The reason is that one of the changes I’m making in my re-write of DSUS4 is to alter the way I talk about one-tailed tests. You might wonder why I would want to alter something like that – surely if it was good enough for the third edition then it’s good enough for the fourth? Textbook writing is quite an interesting process because when I wrote the first edition, I was very much younger, and to some extent the content was driven by what I saw in other textbooks. Even as the book has evolved over certain editions, the publishers will get feedback from lecturers who use the book, I get emails from people who use the book, and so, again, content gets driven a bit by what people who use the book want and expect to see. People expect to learn about one-tailed tests in an introductory statistics book and I haven’t wanted to disappoint them. However, as you get older you also get more confident about having an opinion on things. So, although I have happily entertained one-tailed tests in the past, in more recent years I have felt that they are one of the worse aspects of hypothesis testing that should probably be discouraged.
Yesterday I got the following question landing in my inbox, which was the perfect motivator to write this blog and explain why I’m trying to deal with one-tailed tests very differently in the new edition of DSUS:
Question: “I need some advice and thought you may be able to help. I have a one-tailed hypothesis, ego depletion will increase response times on a Stroop task. The data is parametric and I am using a related T-Test.
Before depletion the Stroop performance mean is 70.66 (12.36)
After depletion the Stroop performance mean is 61.95 (10.36)
The t-test is, t (138) = 2.07, p = .02 (one-tailed)
Although the t-test comes out significant, it goes against what I have hypothesised. That Stroop performance decreased rather than increased after depletion. So it goes in the other direction. How do I acknowledge this in a report?
I have done this so far. Is it correct?
Although the graph suggests there was a decrease in Stroop performance times after ego-depletion. Before ego-depletion (M=70.66, SD=12.36) after ego-depletion (M= 61.95, SD=10.36), a t-test showed there was a significance between Stroop performance phase one and two t (138) = 10.94, p <.001 (one-tailed).”
This question illustrates perfectly the confusion people have about one-tailed tests. The author quite rightly wants to acknowledge that the effect was in the opposite direction, but quite wrongly still wants to report the effect … and why not, effects in the opposite direction and interesting and intriguing and any good scientists wants to explain interesting findings.
The trouble is that my answer to the question of what to do when you get a significant one-tailed p-value but the effect is in the opposite direction to what you predicted is (and I quote my re-written chapter 2 here): “if you do a one-tailed test and the results turn out to be in the opposite direction to what you predicted you must ignore them, resist all temptation to interpret them, and accept (no matter how much it pains you) the null hypothesis. If you don’t do this, then you have done a two-tailed test using a different level of significance from the one you set out to use”
[Quoting some edited highlights of the new section I wrote on one-tailed tests]:
One-tailed tests are problematic for three reasons:
  1. As the question I was sent illustrates, when scientists see interesting and unexpected findings their natural instinct is to want to explain them. Therefore, one-tailed tests are dangerous because like a nice piece of chocolate cake when you’re on a diet, they waft the smell of temptation under your nose. You know you shouldn’t eat the cake, but it smells so nice, and looks so tasty that you shovel it down your throat. Many a scientist’s throat has a one-tailed effect in the opposite direction to that predicted wedged in it, turning their face red (with embarrassment).
  2. One-tailed tests are appropriate only if a result in the opposite direction to the expected direction would result in exactly the same action as a non-significant result (Lombardi & Hurlbert, 2009; Ruxton & Neuhaeuser, 2010). This can happen, for example, if a result in the opposite direction would be theoretically meaningless or impossible to explain even if you wanted to (Kimmel, 1957). Another situation would be if, for example, you’re testing a new drug to treat depression. You predict it will be better than existing drugs. If it is not better than existing drugs (non-significant p) you would not approve the drug; however it was significantly worse than existing drugs (significant p but in the opposite direction) you would also not approve the drug. In both situations, the drug is not approved.
  3. One-tailed tests encourage cheating. If you do a two-tailed test and find that your p is .06, then you would conclude that your results were not significant (because .06 is bigger than the critical value of .05). Had you done this test one tailed however, the p you would get would be half of the two tailed value (.03). This one-tailed value would be significant at the conventional level. Therefore, if a scientist finds a two-tailed p that is just non-significant, they might be tempted to pretend that they’d always intended to do a one-tailed test, half the p value to make it significant and report that significant value. Partly this problem exists because of journal’s obsessions with p-values, which therefore rewards significance. This reward might be enough of a temptation for some people to half their p-value just to get a significant effect. This practice is cheating (for reasons explained in one of the Jane Superbrain boxes in Chapter 2 of my SPSS/SAS/R books). Of course, I’d never suggest that scientists would half their p-values just so that they become significant, but it is interesting that two recent surveys of practice in ecology journals concluded that “all uses of one-tailed tests in the journals surveyed seemed invalid.” (Lombardi & Hurlbert, 2009), and that only 1 in 17 papers using one-tailed tests were justified in doing so (Ruxton & Neuhaeuser, 2010).
For these reasons, DSUS4 is going to discourage the use of one-tailed tests unless there’s a very good reason to use one (e.g., 2 above). 
PS Thanks to Shane Lindsay who, a while back now, sent me the Lombardi and Ruxton papers.

References

  • Kimmel, H. D. (1957). Three criteria for the use of one-tailed tests. Psychological Bulletin, 54(4), 351-353. doi: 10.1037/h0046737
  • Lombardi, C. M., & Hurlbert, S. H. (2009). Misprescription and misuse of one-tailed tests. Austral Ecology, 34(4), 447-468. doi: 10.1111/j.1442-9993.2009.01946.x
  • Ruxton, G. D., & Neuhaeuser, M. (2010). When should we use one-tailed hypothesis testing? Methods in Ecology and Evolution, 1(2), 114-117. doi: 10.1111/j.2041-210X.2010.00014.x

TwitterPanic, NHST, Wizards, and the cult of significance again

****Warning, some bad language used: don’t read if you’re offended by that sort of thing****
I haven’t done a blog in a while, so I figured I ought to. Having joined Twitter a while back, I now find myself suffering from TwitterPanic™, which is an anxiety disorder (I fully anticipate to be part of DSM-V) characterised by a profound fear that people will unfollow you unless you keep posting things to remind them of why it’s great to follow you. In the past few weeks I have posted a video of a bat felating himself and a video of my cat stopping me writing my textbook. These might keep the animal ecologists happy, but most people probably follow me because they think I’m going to write interesting things about statistics, and not because they wanted to see a felating bat. Perhaps I’m wrong, and if so please tell me because I find it much easier to get ideas for things to put online that rhyme with stats (like bats and cats) than I do about stats itself.
Anyway, I need to get over my TwitterPanic, so I’m writing a blog that’s actually about stats. A few blogs back I discussed whether I should buy the book ‘the Cult of Statistical …. I did buy it, and read it. Well, when I say I read it, I started reading it, but if I’m honest I got a bit bored and stopped before the end. I’m the last person in the world who could ever criticise anyone for labouring points but I felt they did. To be fair to the authors I think the problem was more that they were essentially discussing things that I already knew, and it’s always difficult to keep focus when you’re not having ‘wow, I didn’t know that’ moments. I think if you’re a newbie to this debate then it’s an excellent book and easy to follow.
The Fields on Honeymoon
In the book, the authors argue the case for abandoning null hypothesis significance testing, NHST (and I agree with most of what they say – see this), but they frame the whole debate a bit like a war between them (and people like them) and ‘the sizeless scientists’ (that’s the people who practice NHST). The ‘sizeless scientists’ are depicted (possibly not intentionally) like a bunch of stubborn, self-important, bearded, cape-wearing, fuckwitted, wizards who sit around in their wizardy rooms atop the tallest ivory tower in the kingdom of elephant tusks, hanging onto notions of significance testing for the sole purpose of annoying the authors with their fuckwizardry. I suspect the authors have had their research papers reviewed by these fuckwizards. I can empathise with the seeds of bile that experience might have sewn in the authors’ bellies, however, I wonder whether writing things like ‘perhaps they [the sizeless scientists] don’t know what a confidence interval is’  is the first step towards thinking that the blue material w
ith stars on that you’ve just seen would look quite fetching as a hat.
I don’t believe that people who have PhDs and do research are anything other than very clever people, and I think the vast majority  want to do the right thing when it comes to stats and data analysis (am I naïve here?). The tone of most of the emails I get suggest that people are very keen indeed not to mess up their stats. So, why is NHST so pervasive? I think we can look at a few sources:
  1. Scientists in most disciplines are expected to be international experts in their discipline, which includes being theoretical leaders, research experts, and drivers of policy and practice. On top of this they’re also expected have a PhD in applied statistics. This situation is crazy really. So, people tend to think (not unreasonably) that what they were taught in university about statistics is probably still true. They don’t have too much time to update their knowledge. NHST is appealing because it’s a very recipe-book approach to things and recipes are easy to follow.
  2. Some of the people above, will be given the task of teaching research methods/statistics to undergraduates/postgraduates. Your natural instinct is to teach what you know. If you were taught NHST, then that’s what you’ll teach. You might also be doing a course that forms part of a wider curriculum and that will affect what you teach. For example, I teach second year statistics, and by the time I get these students they have had a year of NHST, so it seems to me that it will be enormously confusing for them if I suddenly say ‘oh, all that stuff you were taught last year, well, I think it’s bollocks, learn this instead’. Instead, I weave in some arguments against NHST, but in a fairly low key way so that I don’t send half of the year into mass confusion and panic. Statistics is confusing enough for them without me undermining a year of their hard word.
  3. Even if you wanted to remove NHST from your curriculum, you might be doing your students a great disservice because reviewers of research will likely be familiar with NHST and expect to see it. It might not be ideal that this is the case, but that is the world as we currently know it. When I write up research papers I would often love to abandon p-values but I know that if I do then I am effectively hand-carving a beautiful but knobbly stick, attaching it to my manuscript, and asking the editor if he or she would be so kind as to send the aforementioned stick to the reviewers so that they can beat my manuscript with it. If your students don’t know anything about NHST are you making their research careers more tricky to negotiate?
  4. Textbooks. As I might have mentioned a few million times, I’m updating Discovering Statistics Using SPSS (DSUS as I like to call it). This book is centred around NHST, not because I’m particularly a fan of it, but because it’s what teachers and people who adopt the book expect to see in it. If they don’t see it, they will probably use a different book. I’m aware that this might come across as me completely whoring my principles to sell my book, and perhaps I am, but I also feel that you have to appreciate from where other people come. If you were taught NHST, that’s what you’ve done for 10 or 20 years, that’s what you teach because that what you genuinely believe is the right way to do things, then the last thing you need is a pompous little arse from Brighton telling you to change everything. It’s much better to have that pompous little arse try to stealth-brainwash you into change: Yes, each edition I feel that I can do a bit more to promote approaches other than NHST. Subvert from within and all that.
 So, I think the cult of significance will change, but it will take time, and rather than seeing it as a war between rival factions, perhaps we should pretend it’s Christmas day, get out of the trenches, play a nice game of football/soccer, compliment each other on our pointy hats, and walk away with a better understanding of each other. It’d be nice if we didn’t go back to shooting each other on boxing day though.
The APA guidelines of over 10 years ago and the increased use of meta-analysis have, I think, had a positive impact on practice. However, we’re still in a sort of hybrid wilderness where everyone does significance tests and, if you’re lucky, people report effect sizes too. I think perhaps one day NHST will be abandoned completely, but it will take time, and by the time it has we’ll probably have a found a reason why confidence intervals and effect sizes are as comedic as sticking a leech on your testicles to cure a headache.
I’ve completely lost track of what the point of this blog was now. It started off that I was going to have a rant about one-tailed tests (I’ll save that for another day) because I thought that might ease my TwitterPanic. However, I got side tracked by thinking about the cult of significance book. I now feel a bit bad, because I might have been a bit critical of it and I don’t like it when people criticise my books so I probably shouldn’t criticise other’s. I stuck a sweet wizard hat related honeymoon picture in to hopefully soften the authors’ attitude towards me in the unlikely event that they ever read this and decide to despise me. I then took some therapy for dealing with worrying too much about what other people think. It didn’t work. Once I’d thought about that book I remembered that I’d wanted to tell anyone who might be interested that I thought the authors had been a bit harsh on people who use NHST. I think that side track was driven by a subconscious desire to use the word ‘fuckwizardry’, because it made me laugh when I thought of it and Sage will never let me put that in DSUS4. The end result is a blog about nothing, and that’s making my TwitterPanic worse …

Definitions

  • Fuckwizard: someone who does some complicated/impressive task in a fuckwitted manner but with absolute confidence that they are doing it correctly.
  • Fuckwizardry: doing a complicated or impressive task in a fuckwitted manner but with absolute confidence that you are doing it correctly

Should I Buy This Book?

I’m thinking of buying this book.

On the face of it it seems like the kind of book I’ll enjoy. Admittedly I’d enjoy a nice biography about Metallica or some other heavy rock band more, but I need to maintain the façade of loving statistics. It seems to be an historical account of why we use significance testing and why it’s a terrible idea. I’m fairly confident that I already know most of what it will say, but the synopsis promises some nice discipline-specific examples of the ‘train wreck’ that is hypothesis testing. I probably won’t know these and it seems like there’s potential for entertainment. However, the two reviews of this book (which are fairly positive) say the following:

Reviewer 1: “In statistics, a result is called statistically significant if it is unlikely to have occurred by chance.”

Reviewer 2: “A relationship between two variables is statistically significant if there is a low probability (usually less than five per cent) of it happening by chance.”

Both of which are wrong. A result is statistically significant if the observed effect/relationship/whatever is unlikely to have occurred GIVEN THAT THERE IS NO EFFECT/RELATIONSHIP/WHATEVER IN THE POPULATION.

So, although I probably will buy this book because it looks interesting, I offer up my own free version here in this blog. I know it’s insanely generous of me to give you a whole book for free, but I’m a caring kind of guy. So here it is:

The Cult of Statistical Significance and Why it Will Fry Your Brain
By
Andy P. Field

If people can read an entire book about a concept and still not understand what it is, then that concept is probably unnecessarily confusing, poorly conceived and should be buried in a lead box, chained with particularly curmudgeonly rattlesnakes, guarded by rabid hounds, and placed in Satan’s toilet bowl. Only someone with the guile of Indiana Jones should be able to retrieve it, and should they ever manage to, this person should be dipped in sugar and set upon by locusts. It turns out that significance testing is such a concept (although with N = 2 I probably didn’t have enough power to significance test my hypothesis).

The End

The Joy of Confidence Intervals

In my last blog I mentioned that Null Hypothesis Significance Testing (NHST) was a bad idea (despite most of us having been taught it, use it and possibly teach it to future generations). I also said that confidence intervals are poorly understood. Coincidentally, a colleague of mine, knowing that I was of the ‘burn NHST at the stake’ brigade recommended this book by Geoff Cumming. It turns out that within the first 5 pages, it gives the most beautiful example of why confidence intervals tell us more than NHST. I’m going to steal Geoff’s argument blatantly, but with the proviso that anyone reading this blog buy his book, preferably two copies.
OK, imagine you’ve read Chapter 8 of my SPSS/SAS or R book in which I suggest that rather than cast rash judgments on a man for placing an eel up his anus to cure constipation, we use science to evaluate the efficacy of the man’s preferred intervention. You randomly allocate people with constipation to a treatment as usual group (TAU) or to placing an eel up their anus (intervention). You then find a good lawyer.
Imagine there were 10 studies (you can assume they are of a suitably high quality with no systematic differences between them) that had report such scientific endeavors. They have a measure of constipation as their outcome (let’s assume it’s a continuous measure). A positive difference between means indicates that the intervention was better than the control group at reducing constipation.
Here are the results:
Study           Difference
                      between
                      Means             t                       p
Study 1           4.193              3.229            0.002*
Study 2           2.082              1.743            0.086
Study 3           1.546              1.336            0.187
Study 4           1.509              0.890            0.384
Study 5           3.991              2.894            0.006*
Study 6           4.141             3.551             0.001*
Study 7           4.323             3.745             0.000*
Study 8           2.035             1.479             0.155
Study 9           6.246             4.889             0.000*
Study 10          0.863             0.565             0.577
OK, here’s a quiz. Which of these statements best reflects your interpretation of these data:
  •  A. The evidence is equivocal, we need more research.
  •  B. All of the mean differences show a positive effect of the intervention, therefore, we have consistent evidence that the treatment works.
  •  C. Five of the studies show a significant result (p < .05), but the other 5 do not. Therefore, the studies are inconclusive: some suggest that the intervention is better than TAU, but others suggest there's no difference. The fact that half of the studies showed no significant effect means that the treatment is not (on balance) more successful in reducing symptoms than the control.
  •  D. I want to go for C, but I have a feeling it’s a trick question.

Some of you, or at least those of you bought up to worship at the shrine of NHST probably went for C. If you didn’t then good for you. If you did, then don’t feel bad because if you believe in NHST then that’s exactly the answer you should give. 
Now let’s look at the 95% confidence intervals for the mean differences in each study:
Note the mean differences correspond to those we have already seen (I haven’t been cunning and changed the data). Thinking about what confidence intervals show us, which of the statements A to D above best fits your view?
Hopefully, many of you who thought C before now think B. If you still think C, then I will explain why you should go for B:
A confidence interval is a boundary within which the population value falls 95 times out of 100. In other words, they reflect the likely true population value: 5 out of 100 will miss it, but 95 out of 100 contain the actual population value. Looking at our 10 studies, only 3 of the 7 contain zero (studies 3, 8 and 10) and for two of them (studies 3 and 10) they only just contain zero. Therefore, in 7 of the 10 studies the evidence suggests that the population difference between group means is NOT zero. In other words, there is an effect in the population (zero would mean no difference between the groups). So, 7 out of 10 studies suggest that the population value, the actual real difference between groups, is NOT ZERO. What’s more, even the 3 that do contain zero, show a positive difference, and only a relatively small portion of the tail of the CI is below zero. So, even in the three studies that have confidence intervals crossing zero, it is more likely than not that the population value is greater than zero. As such, across all 10 studies there is strong and consistent evidence that the population difference between means is greater than zero, reflecting a positive effect of the intervention compared to the TAU.
The main point that Cummings makes (he talks about meta-analysis too, but I’m bored of typing now) is that the dichotomous sig/non-significant thinking fostered by the NHST can lead you to radically different conclusions to those you would make if you simply look at the data with a nice, informative confidence interval. In short, confidence intervals rule, and NHST sucks.
More important, it should not be the case that the way we picture the data/results completely alters our conclusions. Given we’re stuck with NHST at least for now, we could do worse than use CIs as the necessary pinch of salt required when interpreting significance tests.
Hopefully, that explains some of the comments in my previous blog. I’m off to buy a second copy of Geoff’s book …