Perhaps my oxytocin was low when I read this paper.

I’m writing a new textbook on introductory statistics, and I decided to include an example based on Paul Zak‘s intriguing work on the role of the hormone oxytocin in trust between strangers. In particular, I started looking at his 2005 paper in Nature. Yes, Nature, one of the top science journals. A journal with an impact factor of about 38, and a rejection rate probably fairly close to 100%. It’s a journal you’d expect published pretty decent stuff and subjects articles to fairly rigorous scrutiny.

Before I begin, I have no axe to grind here with anyone. I just want to comment on some stuff I found and leave everyone else to do what they like with that information. I have no doubt Dr. Zak has done a lot of other work on which he bases his theory, I’m not trying to undermine that work, I’m more wanting to make a point about the state of the top journals in science. All my code here is for R.

The data I was looking at relates to an experiment in which participants were asked to invest money in a trustee (a stranger). If they invested, then the total funds for the investor and trustee increased. If the trustee shares the proceeds then both players end up better off, but if the trustee does not repay the investors’ trust by splitting the fund then the trustee ends up better off and the investor worse off.  The question is, will investors trust the trustees to split the funds? If they do then they will invest, if not they won’t. Critically, one group of investors were exposed to exogenous oxytocin (N = 29), whereas a placebo group were not (N = 29). The oxytocin group invested significantly more than the placebo control group suggesting that oxytocin had causally influenced their levels of trust in the trustee. This is the sort of finding that the media loves.

The paper reports a few experiments, I want to focus on the data specifically related to trust shown in Figure 2a (reproduced below):

 The good thing about this figure is that it shows relative frequencies, which means that with the aid of a ruler and a spreadsheet we can reconstruct the data. Based on the figure the raw data is as follows:

placebo<-c(3, 3, 4, 4, 4, 5, 6, 6, 6, 6, 6, 7, 7, 8, 8, 9, 9, 11, 11, 11, 12, 12, 12, 12, 12, 12)

oxy<-c(3, 4, 4, 6, 6, 7, 8, 8, 8, 8, 9, 9, 10, 10, 10, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12)

The problem being that this gives us only N = 26 in the placebo group. Bollocks!

Ok, well perhaps they reported the wrong N. Here’s their table of descriptives:

Let’s have a look at the descriptives I get (you’ll need the pastecs package):

> round(stat.desc(placebo, basic = F), 1)
      median         mean      SE.mean CI.mean.0.95          var      std.dev     coef.var 
         7.5          7.9          0.6          1.3         10.2          3.2          0.4 
> round(stat.desc(oxy, basic = F), 1)
      median         mean      SE.mean CI.mean.0.95          var      std.dev     coef.var 
        10.0          9.6          0.5          1.1          8.1          2.8          0.3 
For the oxytocin group the mean, median and SD match, but for the placebo group they don’t. Hmmm. So, there must be missing cases. Based on where the median is, I guessed that the three missing cases might be values of 10. In other words, Figure 2a in the paper, ought to look like this:
So, the placebo group data now looks like this:

placebo<-c(3, 3, 4, 4, 4, 5, 6, 6, 6, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 10, 11, 11, 11, 12, 12, 12, 12, 12, 12)

Let’s look at the descriptives now:
> round(stat.desc(placebo, basic = F), 1)
      median         mean      SE.mean CI.mean.0.95          var      std.dev     coef.var 
         8.0          8.1          0.6          1.2          9.5          3.1          0.4 
They now match the table in the paper. So, this gives me confidence that I have probably correctly reconstructed the raw data despite Figure 2’s best efforts to throw me off of the scent. Of course, I could be wrong. Let’s see if my figures match their analysis. They did a Mann-Whitney U that yielded a p-value of 0.05887, which they halved to report one-tailed as .029. Notwithstanding the fact that you probably shouldn’t do one-tailed tests, ever, they conclude a significant between group difference. I replicate their p-value, again convincing me that my reconstructed data matches their raw data.
Put data into a data frame and do wilcoxon rank sum test (which is equivalent to Mann-Whitney):
zak<-data.frame(gl(2, 29), c(placebo, oxy))
names(zak)<-c("Group", "MU")
zak$Group<-factor(zak$Group, labels = c("Placebo", "Oxytocin"))

> wilcox.test(MU~Group, zak)

Wilcoxon rank sum test with continuity correction

data:  MU by Group
W = 301, p-value = 0.05887
alternative hypothesis: true location shift is not equal to 0
So, all well and good. There are better ways to look at this than a Mann-Whitney test though, so let’s use some of Rand Wilcox’s robust tests. First off, a trimmed mean and bootstrap:

> yuenbt(placebo, oxy, tr=.2, alpha=.05, nboot = 2000, side = T)
$ci
[1] -3.84344384  0.05397015

$test.stat
[1] -1.773126

$p.value
[1] 0.056

$est.1
[1] 8.315789

$est.2
[1] 10.21053

$est.dif
[1] -1.894737

$n1
[1] 29

$n2
[1] 29
Gives us a similar p-value to the Mann-Whitney and a confidence interval for the difference that contains zero. However, the data are very skewed indeed:
Perhaps we’re better off comparing the medians of the two groups. (Medians are no-where near as biased by skew as means):

> pb2gen(placebo, oxy, alpha=.05, nboot=2000, est=median)
[1] “Taking bootstrap samples. Please wait.”
$est.1
[1] 8

$est.2
[1] 10

$est.dif
[1] -2

$ci
[1] -6  1

$p.value
[1] 0.179

$sq.se
[1] 2.688902

$n1
[1] 29

$n2
[1] 29
The p-value is now a larger .179 (which even if you decide to halve it won’t get you below the, not very magic, .05 threshold for significance). So, the means might be significantly different depending on your view of one-tailed tests, but the medians certainly are not. Of course null hypothesis significance testing is bollocks anyway (see my book, or this blog) so maybe we should just look at the effect size. Or maybe we shouldn’t because the group means and SDs will be biased by the skew (SDs especially because you square the initial bias to compute it). Cohen’s d is based on means and SDs so if these statistics are biased then d will be too. I did it anyway though, just for a laugh:
> d<-(mean(oxy)-mean(placebo))/sd(placebo)
> d
[1] 0.4591715
I couldn’t be bothered to do the pooled estimate variant, but because there is a control group in this study it makes sense to use the control group SD to standardise the mean difference (because the SD in this condition shouldn’t be affected my the experimental manipulation). The result? About half a standard deviation difference (notwithstanding the bias). That’s not too shabby, although it’d be even more un-shabby if the estimate wasn’t biased. Finally, and notwithstanding various objections to using uninformed priors in Bayesian analysis) we can use the BayesFactor packaged to work out a Bayes Factor
> ttestBF(oxy, y = placebo)
Bayes factor analysis
————–
[1] Alt., r=0.707 : 1.037284 ±0%

Against denominator:
  Null, mu1-mu2 = 0 
Bayes factor type: BFindepSample, JZS
The Bayes factor is 1.03, which means that there is almost exactly equal evidence for the null hypothesis as the alternative hypothesis. In other words, there is no support for the hypothesis that oxytocin affected trust.
So, Nature, one of the top journals in science, published a paper where the evidence for oxytocin affecting trust was debatable at best. Let’s not even get into the fact that this was based on N = 58. Like I said, I’m sure Dr. Zak has done many other studies and I have no interest in trying to undermine what he or his colleagues do, I just want to comment on this isolated piece of research and offer an alternative interpretation of the data. There’s lots of stuff I’ve done myself that with the benefit of experience I’d rather forget about, so, hey, I’m not standing on any kind of scientific high ground here. What I would say is that in this particular study, based on the things least affected by the shape of the data (medians) and a (admittedly using uninformed priors) Bayesian analysis there is not really a lot of evidence that oxytocin affected trust.


Bat Cunnilingus and Spurious Results

Thanks to my brother I became aware of this article about bat cunnilingus today. My brother, knowing how entertained I had been by a bat felating itself at Disney on my honeymoon, decided that this would be just the sort of article that I’d like. Knowing the kind of examples I put in my textbooks I suspect his email is the first of many. Anyway, this article suggests that male bats engage in cunnilingus with females. Female bats, that is. They have videos to prove it. Male bats like cunnilingus so much that they don’t just use it as foreplay, but they get stuck in afterwards as well. The article has two main findings:
  1. The more that males (bats) engage in cunnilingus, the longer the duration of intercourse, r = .91 (see their Figure1). They tested this with a Pearson r, based on 8 observations (although each observation was the average of several data points I think – it’s not entirely clear how the authors came about these averages). They conclude that this behaviour may be due to sperm competition in that males sort of ‘clean out’ the area before they procreate. They argue that the longer intercourse duration supports this argument because longer duration = greater chance of paternity.
  2.  The more that males engage in pre-intercourse cunnilingus, the less they engage in post-intercourse cunnilingus, r = .79 (see their Figure2). They again tested this with a Pearson r, based on 8 observations. They conclude “… the negative relationship between the duration of pre- and post-copulatory cunnilingus suggests the possibility of sperm competition. Thus, the male removes the sperm of competitors but apparently not its own – lick for longer before but less after copulation, if paternity uncertainty is high.”

There are several leaps of faith in their conclusions. Does longer duration necessarily imply greater chance of paternity? Also, if sperm competition is at the root of all of this then I’m not sure post coital cunilingus is really ever an adaptive strategy: consuming your own sperm hardly seems like the best method of impregnating your mate. Yes, I’m still talking about bats here, what you choose to do in your own free time is none of my business.

Enough about Cunnilingus, what about Statistics?

Anyway, I want to make a statistical point, not an evolutionary one. It struck me when looking at this article that finding 2 is probably spurious. Figure 2 looks very much like two outliers are making a relationship out of nothing. With only 8 observations they are likely to have a large impact. With 8 observations it is, of course, hard to say what’s an outlier and what isn’t; however, the figure was like a bat tongue to my brain: my juices were flowing. Fortunately, with so few observations we can approximate the data from the two figures. I emphasises that these are quick estimates of values based on the graphs and a ruler. We have three variables estimated from the paper: pre (pre-intercourse cunnilingus), post (post-intercourse cunnilingus), duration (duration of intercourse).
We’re going to use R, if you don’t know how to use it then some imbecile called AndyField has written a book on it that you could look at. We can input these approximate scores as:
pre<-c(46, 50, 51, 52, 53, 53.5, 55, 59)
post<-c(162, 140, 150, 140, 142, 138, 152, 122)
duration<-c(14.4, 14.4, 14.9, 15.5, 15.4, 15.2, 15.3, 16.4)
bat<-data.frame(pre, post, duration)
Ok, so that’s the three variables entered, and I’ve made a data frame from them called bat.
Let’s load some packages that will be useful (if you don’t have them installed then install them) and source Rand Wilcox’s very helpful robust functions:
library(ggplot2)
library(Hmisc)
library(boot)
source(“http://dornsife.usc.edu/assets/sites/239/docs/Rallfun-v20”)
Let’s draw some rough replicas of Figures 1 and 2 from the paper:
qplot(pre, post, date = bat, ylim = c(85, 200), xlim = c(44, 60)) + stat_smooth(method = “lm”)
qplot(pre, duration, date = bat, ylim = c(13, 18), xlim = c(44, 60))+ stat_smooth(method = “lm”)

Figure 1: Quick replica of Maruthupandian & Marimuthu (2013) Figure 1
Figure 1: Quick replica of Maruthupandian & Marimuthu (2013) Figure 2
As you can see the figures – more or less – replicate their data. OK, let’s replicate their correlation coefficient estimates by executing:
rcorr(as.matrix(bat), type = “pearson”)
gives us
           pre  post duration
pre       1.00 -0.78     0.91
post     -0.78  1.00    -0.75
duration  0.91 -0.75     1.00
n= 8

P
         pre    post   duration
pre             0.0225 0.0018 
post     0.0225        0.0308 
duration 0.0018 0.0308
So, we replicate the pre-intercourse cunnilingus (Pre) – duration correlation of .91 exactly (p= .0018), and we more or less get the right value for the pre-post correlation: we get r = .78 (p = .0225) instead of -.79, but we’re within rounding error. So, it looks like our estimates of the raw data from the Figures in the article aren’t far off the real values at all. Just goes to show how useful graphs in papers can be for extracting the raw data.
If I’m right and the data in Figure 2 are affected by the first and last case, we’d expect something different to happen if we calculate the correlation based on ranked scores (i.e. Spearman’s rho). Let’s try it:
rcorr(as.matrix(bat), type = “spearman”)
           pre  post duration
pre       1.00 -0.50     0.78
post     -0.50  1.00    -0.51
duration  0.78 -0.51     1.00
n= 8
P
         pre    post   duration
pre             0.2039 0.0229 
post     0.2039        0.1945 
duration 0.0229 0.1945 
The Pre-duration correlation has dropped from .91 to .78 (p = .0229) but it’s still very strong. However, the pre-post correlation has changed fairly dramatically from -.79 to r = .50 (p = .204) and is now not significant (although I wouldn’t want to read to much into that with an n = 8).
What about if we try a robust correlation such as the percentile bend correlation described by <!–[if supportFields]> ADDIN EN.CITE Wilcox2012684Wilcox (2012)6846846Wilcox, R. R.Introduction to robust estimation and hypothesis testing3rd2012Burlington, MAElsevier<![endif]–>Wilcox (2012)<!–[if supportFields]><![endif]–>? Having sourced Wilcox’s functions we can do this easily enough by executing:
pbcor(bat$pre, bat$post)
pbcor(bat$pre, bat$duration)
For the pre post correlation we get:
$cor
[1] -0.3686576
$test
[1] -0.9714467
$siglevel
[1] 0.368843
$n
[1] 8
It has changed even more dramatically from -.79 to r = .37 (p = .369) and is now not significant (again I wouldn’t want to read to much into that with an n = 8, but it is now a much smaller effect than they’re reporting in the paper).
The pre-duration correlation fares much better:
$cor
[1] 0.852195
$test
[1] 3.989575
$siglevel
[1] 0.007204076
$n
[1] 8
This is still very strong and significant, r = .85 (p = .007). We could also try to estimate the population value of these two relationships by computing a robust confidence interval using bootstrapping. We’ll stick with Pearson r seeing as that’s the estimate that they used in the paper but you could try this with the other measures if you like:
bootr<-function(data, i){
cor(data[i, 1],data[i, 2])
}
bootprepost<-boot(bat[1:2], bootr, 2000)
boot.ci(bootprepost)
bootpreduration<-boot(bat[c(1, 3)], bootr, 2000)
boot.ci(bootpreduration)
If we look at the BCa intervals, we get this for the pre-post correlation:
Level     Percentile            BCa         
95%   (-0.9903,  0.5832 )   (-0.9909,  0.5670 )
and this for the pre-duration correlation:
Level     Percentile            BCa         
95%   ( 0.6042,  0.9843 )   ( 0.5022,  0.9773 )
In essence then, the true relationship between pre- and post-intercourse cunnilingus duration lies somewhere between about –1 and 0.6. In other words, it could be anything really, including zero. For the relationship between pre-intercourse bat cunnilingus and intercourse duration the true relationship is somewhere between r= .5 and .98. In other words, it’s likely to be – at the very worst – a strong and positive association.

Conclusions

What does this tell us? Well, I’m primarily interested in using this as a demonstration of why it’s important to look at your data and to employ robust tests. I’m not trying to knock the research. Of course, in the process I am, inevitably, sort of knocking it. So, I think there are three messages here:
  1. Their first finding that the more that males (bats) engage in cunnilingus, the longer the duration of intercourse is very solid. The population value could be less than the estimate they report, but it could very well be that large too. At the very worst it is still a strong association.
  2. The less good news is that finding 2 (that the more that males engage in pre-intercourse cunnilingus, the less they engage in post-intercourse cunnilingus) is probably wrong. The best case scenario is that the finding is real but the authors have overestimated it. The worst case scenario is that the population effect is in the opposite direction to what the authors report, and there is, of course, a mid ground that there is simply no effect at all in the population (or one so small as to be not remotely important).
  3. It’s also worth pointing out that the two extreme cases that have created this spurious result may in themselves be highly interesting observations. There may be important reasons why individual bats engage in unusually long or short bouts of cunnilingus (either pre or post intercourse) and finding out what factors affect this will be an interesting thing that the authors should follow up.

 References