Perhaps my oxytocin was low when I read this paper.

I’m writing a new textbook on introductory statistics, and I decided to include an example based on Paul Zak‘s intriguing work on the role of the hormone oxytocin in trust between strangers. In particular, I started looking at his 2005 paper in Nature. Yes, Nature, one of the top science journals. A journal with an impact factor of about 38, and a rejection rate probably fairly close to 100%. It’s a journal you’d expect published pretty decent stuff and subjects articles to fairly rigorous scrutiny.

Before I begin, I have no axe to grind here with anyone. I just want to comment on some stuff I found and leave everyone else to do what they like with that information. I have no doubt Dr. Zak has done a lot of other work on which he bases his theory, I’m not trying to undermine that work, I’m more wanting to make a point about the state of the top journals in science. All my code here is for R.

The data I was looking at relates to an experiment in which participants were asked to invest money in a trustee (a stranger). If they invested, then the total funds for the investor and trustee increased. If the trustee shares the proceeds then both players end up better off, but if the trustee does not repay the investors’ trust by splitting the fund then the trustee ends up better off and the investor worse off.  The question is, will investors trust the trustees to split the funds? If they do then they will invest, if not they won’t. Critically, one group of investors were exposed to exogenous oxytocin (N = 29), whereas a placebo group were not (N = 29). The oxytocin group invested significantly more than the placebo control group suggesting that oxytocin had causally influenced their levels of trust in the trustee. This is the sort of finding that the media loves.

The paper reports a few experiments, I want to focus on the data specifically related to trust shown in Figure 2a (reproduced below):

 The good thing about this figure is that it shows relative frequencies, which means that with the aid of a ruler and a spreadsheet we can reconstruct the data. Based on the figure the raw data is as follows:

placebo<-c(3, 3, 4, 4, 4, 5, 6, 6, 6, 6, 6, 7, 7, 8, 8, 9, 9, 11, 11, 11, 12, 12, 12, 12, 12, 12)

oxy<-c(3, 4, 4, 6, 6, 7, 8, 8, 8, 8, 9, 9, 10, 10, 10, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12)

The problem being that this gives us only N = 26 in the placebo group. Bollocks!

Ok, well perhaps they reported the wrong N. Here’s their table of descriptives:

Let’s have a look at the descriptives I get (you’ll need the pastecs package):

> round(stat.desc(placebo, basic = F), 1)
      median         mean      SE.mean CI.mean.0.95          var      std.dev     coef.var 
         7.5          7.9          0.6          1.3         10.2          3.2          0.4 
> round(stat.desc(oxy, basic = F), 1)
      median         mean      SE.mean CI.mean.0.95          var      std.dev     coef.var 
        10.0          9.6          0.5          1.1          8.1          2.8          0.3 
For the oxytocin group the mean, median and SD match, but for the placebo group they don’t. Hmmm. So, there must be missing cases. Based on where the median is, I guessed that the three missing cases might be values of 10. In other words, Figure 2a in the paper, ought to look like this:
So, the placebo group data now looks like this:

placebo<-c(3, 3, 4, 4, 4, 5, 6, 6, 6, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 10, 11, 11, 11, 12, 12, 12, 12, 12, 12)

Let’s look at the descriptives now:
> round(stat.desc(placebo, basic = F), 1)
      median         mean      SE.mean CI.mean.0.95          var      std.dev     coef.var 
         8.0          8.1          0.6          1.2          9.5          3.1          0.4 
They now match the table in the paper. So, this gives me confidence that I have probably correctly reconstructed the raw data despite Figure 2’s best efforts to throw me off of the scent. Of course, I could be wrong. Let’s see if my figures match their analysis. They did a Mann-Whitney U that yielded a p-value of 0.05887, which they halved to report one-tailed as .029. Notwithstanding the fact that you probably shouldn’t do one-tailed tests, ever, they conclude a significant between group difference. I replicate their p-value, again convincing me that my reconstructed data matches their raw data.
Put data into a data frame and do wilcoxon rank sum test (which is equivalent to Mann-Whitney):
zak<-data.frame(gl(2, 29), c(placebo, oxy))
names(zak)<-c("Group", "MU")
zak$Group<-factor(zak$Group, labels = c("Placebo", "Oxytocin"))

> wilcox.test(MU~Group, zak)

Wilcoxon rank sum test with continuity correction

data:  MU by Group
W = 301, p-value = 0.05887
alternative hypothesis: true location shift is not equal to 0
So, all well and good. There are better ways to look at this than a Mann-Whitney test though, so let’s use some of Rand Wilcox’s robust tests. First off, a trimmed mean and bootstrap:

> yuenbt(placebo, oxy, tr=.2, alpha=.05, nboot = 2000, side = T)
$ci
[1] -3.84344384  0.05397015

$test.stat
[1] -1.773126

$p.value
[1] 0.056

$est.1
[1] 8.315789

$est.2
[1] 10.21053

$est.dif
[1] -1.894737

$n1
[1] 29

$n2
[1] 29
Gives us a similar p-value to the Mann-Whitney and a confidence interval for the difference that contains zero. However, the data are very skewed indeed:
Perhaps we’re better off comparing the medians of the two groups. (Medians are no-where near as biased by skew as means):

> pb2gen(placebo, oxy, alpha=.05, nboot=2000, est=median)
[1] “Taking bootstrap samples. Please wait.”
$est.1
[1] 8

$est.2
[1] 10

$est.dif
[1] -2

$ci
[1] -6  1

$p.value
[1] 0.179

$sq.se
[1] 2.688902

$n1
[1] 29

$n2
[1] 29
The p-value is now a larger .179 (which even if you decide to halve it won’t get you below the, not very magic, .05 threshold for significance). So, the means might be significantly different depending on your view of one-tailed tests, but the medians certainly are not. Of course null hypothesis significance testing is bollocks anyway (see my book, or this blog) so maybe we should just look at the effect size. Or maybe we shouldn’t because the group means and SDs will be biased by the skew (SDs especially because you square the initial bias to compute it). Cohen’s d is based on means and SDs so if these statistics are biased then d will be too. I did it anyway though, just for a laugh:
> d<-(mean(oxy)-mean(placebo))/sd(placebo)
> d
[1] 0.4591715
I couldn’t be bothered to do the pooled estimate variant, but because there is a control group in this study it makes sense to use the control group SD to standardise the mean difference (because the SD in this condition shouldn’t be affected my the experimental manipulation). The result? About half a standard deviation difference (notwithstanding the bias). That’s not too shabby, although it’d be even more un-shabby if the estimate wasn’t biased. Finally, and notwithstanding various objections to using uninformed priors in Bayesian analysis) we can use the BayesFactor packaged to work out a Bayes Factor
> ttestBF(oxy, y = placebo)
Bayes factor analysis
————–
[1] Alt., r=0.707 : 1.037284 ±0%

Against denominator:
  Null, mu1-mu2 = 0 
Bayes factor type: BFindepSample, JZS
The Bayes factor is 1.03, which means that there is almost exactly equal evidence for the null hypothesis as the alternative hypothesis. In other words, there is no support for the hypothesis that oxytocin affected trust.
So, Nature, one of the top journals in science, published a paper where the evidence for oxytocin affecting trust was debatable at best. Let’s not even get into the fact that this was based on N = 58. Like I said, I’m sure Dr. Zak has done many other studies and I have no interest in trying to undermine what he or his colleagues do, I just want to comment on this isolated piece of research and offer an alternative interpretation of the data. There’s lots of stuff I’ve done myself that with the benefit of experience I’d rather forget about, so, hey, I’m not standing on any kind of scientific high ground here. What I would say is that in this particular study, based on the things least affected by the shape of the data (medians) and a (admittedly using uninformed priors) Bayesian analysis there is not really a lot of evidence that oxytocin affected trust.


One-Tailed Tests

I’ve been thinking about writing a blog on one-tailed tests for a while. The reason is that one of the changes I’m making in my re-write of DSUS4 is to alter the way I talk about one-tailed tests. You might wonder why I would want to alter something like that – surely if it was good enough for the third edition then it’s good enough for the fourth? Textbook writing is quite an interesting process because when I wrote the first edition, I was very much younger, and to some extent the content was driven by what I saw in other textbooks. Even as the book has evolved over certain editions, the publishers will get feedback from lecturers who use the book, I get emails from people who use the book, and so, again, content gets driven a bit by what people who use the book want and expect to see. People expect to learn about one-tailed tests in an introductory statistics book and I haven’t wanted to disappoint them. However, as you get older you also get more confident about having an opinion on things. So, although I have happily entertained one-tailed tests in the past, in more recent years I have felt that they are one of the worse aspects of hypothesis testing that should probably be discouraged.
Yesterday I got the following question landing in my inbox, which was the perfect motivator to write this blog and explain why I’m trying to deal with one-tailed tests very differently in the new edition of DSUS:
Question: “I need some advice and thought you may be able to help. I have a one-tailed hypothesis, ego depletion will increase response times on a Stroop task. The data is parametric and I am using a related T-Test.
Before depletion the Stroop performance mean is 70.66 (12.36)
After depletion the Stroop performance mean is 61.95 (10.36)
The t-test is, t (138) = 2.07, p = .02 (one-tailed)
Although the t-test comes out significant, it goes against what I have hypothesised. That Stroop performance decreased rather than increased after depletion. So it goes in the other direction. How do I acknowledge this in a report?
I have done this so far. Is it correct?
Although the graph suggests there was a decrease in Stroop performance times after ego-depletion. Before ego-depletion (M=70.66, SD=12.36) after ego-depletion (M= 61.95, SD=10.36), a t-test showed there was a significance between Stroop performance phase one and two t (138) = 10.94, p <.001 (one-tailed).”
This question illustrates perfectly the confusion people have about one-tailed tests. The author quite rightly wants to acknowledge that the effect was in the opposite direction, but quite wrongly still wants to report the effect … and why not, effects in the opposite direction and interesting and intriguing and any good scientists wants to explain interesting findings.
The trouble is that my answer to the question of what to do when you get a significant one-tailed p-value but the effect is in the opposite direction to what you predicted is (and I quote my re-written chapter 2 here): “if you do a one-tailed test and the results turn out to be in the opposite direction to what you predicted you must ignore them, resist all temptation to interpret them, and accept (no matter how much it pains you) the null hypothesis. If you don’t do this, then you have done a two-tailed test using a different level of significance from the one you set out to use”
[Quoting some edited highlights of the new section I wrote on one-tailed tests]:
One-tailed tests are problematic for three reasons:
  1. As the question I was sent illustrates, when scientists see interesting and unexpected findings their natural instinct is to want to explain them. Therefore, one-tailed tests are dangerous because like a nice piece of chocolate cake when you’re on a diet, they waft the smell of temptation under your nose. You know you shouldn’t eat the cake, but it smells so nice, and looks so tasty that you shovel it down your throat. Many a scientist’s throat has a one-tailed effect in the opposite direction to that predicted wedged in it, turning their face red (with embarrassment).
  2. One-tailed tests are appropriate only if a result in the opposite direction to the expected direction would result in exactly the same action as a non-significant result (Lombardi & Hurlbert, 2009; Ruxton & Neuhaeuser, 2010). This can happen, for example, if a result in the opposite direction would be theoretically meaningless or impossible to explain even if you wanted to (Kimmel, 1957). Another situation would be if, for example, you’re testing a new drug to treat depression. You predict it will be better than existing drugs. If it is not better than existing drugs (non-significant p) you would not approve the drug; however it was significantly worse than existing drugs (significant p but in the opposite direction) you would also not approve the drug. In both situations, the drug is not approved.
  3. One-tailed tests encourage cheating. If you do a two-tailed test and find that your p is .06, then you would conclude that your results were not significant (because .06 is bigger than the critical value of .05). Had you done this test one tailed however, the p you would get would be half of the two tailed value (.03). This one-tailed value would be significant at the conventional level. Therefore, if a scientist finds a two-tailed p that is just non-significant, they might be tempted to pretend that they’d always intended to do a one-tailed test, half the p value to make it significant and report that significant value. Partly this problem exists because of journal’s obsessions with p-values, which therefore rewards significance. This reward might be enough of a temptation for some people to half their p-value just to get a significant effect. This practice is cheating (for reasons explained in one of the Jane Superbrain boxes in Chapter 2 of my SPSS/SAS/R books). Of course, I’d never suggest that scientists would half their p-values just so that they become significant, but it is interesting that two recent surveys of practice in ecology journals concluded that “all uses of one-tailed tests in the journals surveyed seemed invalid.” (Lombardi & Hurlbert, 2009), and that only 1 in 17 papers using one-tailed tests were justified in doing so (Ruxton & Neuhaeuser, 2010).
For these reasons, DSUS4 is going to discourage the use of one-tailed tests unless there’s a very good reason to use one (e.g., 2 above). 
PS Thanks to Shane Lindsay who, a while back now, sent me the Lombardi and Ruxton papers.

References

  • Kimmel, H. D. (1957). Three criteria for the use of one-tailed tests. Psychological Bulletin, 54(4), 351-353. doi: 10.1037/h0046737
  • Lombardi, C. M., & Hurlbert, S. H. (2009). Misprescription and misuse of one-tailed tests. Austral Ecology, 34(4), 447-468. doi: 10.1111/j.1442-9993.2009.01946.x
  • Ruxton, G. D., & Neuhaeuser, M. (2010). When should we use one-tailed hypothesis testing? Methods in Ecology and Evolution, 1(2), 114-117. doi: 10.1111/j.2041-210X.2010.00014.x