Max or No Max?

There’s been a recent spat between the heavy metal bands Sepultura and Soulfly. For those unaware of the history, 50% of Sepulture used to be the Cavalera brothers (Max and Igor) until Max (the frontman and guitarist) left the band in 1996 and formed Soulfly. The full story is here. There’s a lot of bad blood even 20 years later, and according to a recent story on metal sucks, Soulfly’s manager (and Max’s wife) Gloria Cavalier recently posted a fairly pointed post on her Facebook page. This got picked up by my favourite podcast (the metal sucks podcast). What has this got to do with me, or statistics? Well, one of the presenters of the metal sucks podcasts asked me this over Twitter:

After a very brief comment about needing to operationalise ‘better’, I decided that rather than reading book proofs I’d do a tongue in cheek analysis of what is better max or no max and here it is.

First we need to operationalise ‘better’. I have done this by accepting subjective opinion as determining ‘better’ and specifically ratings of albums on (although I am English metal sucks is US based, so I thought I’d pander to them and take ratings from the US site). Our questions then becomes ‘is max or no max rated higher by the sorts of people who leave reviews on Amazon’. We have operationalised our questions and turned it into a scientific statement, which we can test with data. [There are all sorts of problems with using these ratings, not least of which is that they tend to be positively biased, and they likely reflect a certain type of person who reviews, often reviews reflect things other than the music (e.g., arrived quickly 5*), and so on … but fuck it, this is not serious science, just a bit of a laugh.]

Post Sepultura: Max or No Max

The first question is whether post-max Sepultura or Soulfly are rated higher. Figure 1 shows that the data are hideously skewed with people tending to give positive reviews and 4-5* ratings. Figure 2 shows the mean ratings by year post Max’s departure (note they released albums in different years so the dots are out of synch, but it’s a useful timeline). Figure 2 seems to suggest that after the first couple of albums, both bands are rated fairly similarly: the Soulfly line is higher but error bars overlap a lot for all but the first albums.

Figure 1: Histograms of all ratings for Soulfly and (Non-Max Era) Sepultura

Figure 2: Mean ratings of Soulfly and (Non-Max Era) Sepultura by year of album release

There are a lot of ways you could look at these data. The first thing is the skew. That messes up estimates of confidence intervals and significance tests … but our sample is likely big enough that we can rely on the central limit theorem to do its magic and let us assume that the sampling distribution is normal (beautifully explained in my new book!)

I’m going to fit three models. The first is an intercept only model (a baseline with no predictors), the second allows intercepts to vary across albums (which allows ratings to vary by album, which seems like a sensible thing to do because albums will vary in quality) the third predicts ratings from the band (Sepultura vs Soulfly).

  maxModel2a<-gls(Rating ~ 1, data = sepvssoul, method = "ML")
  maxModel2b<-lme(Rating ~ 1, random = ~1|Album, data = sepvssoul, method = "ML")
  maxModel2c<-update(maxModel2b, .~. + Band)
  anova(maxModel2a, maxModel2b, maxModel2c)

By comparing models we can see:

           Model df      AIC      BIC    logLik   Test  L.Ratio p-value
maxModel2a     1  2 2889.412 2899.013 -1442.706                        
maxModel2b     2  3 2853.747 2868.148 -1423.873 1 vs 2 37.66536  <.0001
maxModel2c     3  4 2854.309 2873.510 -1423.155 2 vs 3  1.43806  0.2305

That album ratings varied very significantly (not surprising), the p-value is < .0001, but that band did not significantly predict ratings overall (p = .231). If you like you can look at the summary of the model by executing:

Which gives us this output:

Linear mixed-effects model fit by maximum likelihood
 Data: sepvssoul 
       AIC     BIC    logLik
  2854.309 2873.51 -1423.154

Random effects:
 Formula: ~1 | Album
        (Intercept) Residual
StdDev:   0.2705842 1.166457

Fixed effects: Rating ~ Band 
               Value Std.Error  DF   t-value p-value
(Intercept) 4.078740 0.1311196 882 31.107015  0.0000
BandSoulfly 0.204237 0.1650047  14  1.237765  0.2362
BandSoulfly -0.795

Standardized Within-Group Residuals:
       Min         Q1        Med         Q3        Max 
-2.9717684 -0.3367275  0.4998698  0.6230082  1.2686186 

Number of Observations: 898
Number of Groups: 16 

The difference in ratings between Sepultura and Soulfly was b = 0.20. Ratings for soulfully were higher, but not significantly so (if we allow ratings to vary over albums, if you take that random effect out you’ll get a very different picture because that variability will go into the fixed effect of ‘band’.).

Max or No Max

Just because this isn’t fun enough, we could also just look at whether either Sepultura (post 1996) or Soulfly can compete with the Max-era-Sepultura heyday.
I’m going to fit three models but this time including the early Sepultura albums (with max). The models are the same as before except that the fixed effect of band now has three levels: Sepultura Max, Sepultura no-Max and Soulfly:

  maxModela<-gls(Rating ~ 1, data = maxvsnomax, method = "ML")
  maxModelb<-lme(Rating ~ 1, random = ~1|Album, data = maxvsnomax, method = "ML")
  maxModelc<-update(maxModelb, .~. + Band)
  anova(maxModela, maxModelb, maxModelc)

By comparing models we can see:

          Model df      AIC      BIC    logLik   Test   L.Ratio p-value
maxModela     1  2 4686.930 4697.601 -2341.465                         
maxModelb     2  3 4583.966 4599.973 -2288.983 1 vs 2 104.96454  <.0001
maxModelc     3  5 4581.436 4608.114 -2285.718 2 vs 3   6.52947  0.0382

That album ratings varied very significantly (not surprising), the p-value is < .0001, and the band did significantly predict ratings overall (p = .038). If you like you can look at the summary of the model by executing:

Which gives us this output:

Linear mixed-effects model fit by maximum likelihood
 Data: maxvsnomax 
       AIC      BIC    logLik
  4581.436 4608.114 -2285.718

Random effects:
 Formula: ~1 | Album
        (Intercept) Residual
StdDev:     0.25458 1.062036

Fixed effects: Rating ~ Band 
                         Value Std.Error   DF  t-value p-value
(Intercept)           4.545918 0.1136968 1512 39.98281  0.0000
BandSepultura No Max -0.465626 0.1669412   19 -2.78916  0.0117
BandSoulfly          -0.262609 0.1471749   19 -1.78433  0.0903
                     (Intr) BndSNM
BandSepultura No Max -0.681       
BandSoulfly          -0.773  0.526

Standardized Within-Group Residuals:
       Min         Q1        Med         Q3        Max 
-3.3954974 -0.3147123  0.3708523  0.6268751  1.3987442 

Number of Observations: 1534
Number of Groups: 22  

The difference in ratings between Sepultura without Max compared to with him was b = -0.47 and significant at p = .012 (ratings for post-max Sepultura are significantly worse than for the Max-era Sepultura). The difference in ratings between Soulfly compared two Max-era Sepultura was b = -0.26 and not significant (p = .09) (ratings for Soulfly are not significantly worse than for the Max-era Sepultura). A couple of points here, p-values are silly, so don’t read too much into them, but the parameter (the bs) which quantifies the effect is a bit smaller for Soulfly.

Confidence Intervals

Interestingly if you write yourself a little bootstrap routine to get some robust confidence intervals around the parameters:

    boot.lme <- function(data, indices){
    data <- data[indices,] # select obs. in bootstrap sample
    model <- lme(Rating ~ Band, random = ~1|Album, data = data, method = "ML")
    fixef(model) # return coefficient vector

maxModel.boot<-boot(maxvsnomax, boot.lme, 1000)
  maxModel.boot, index = 1, type = “perc”), index = 2, type = “perc”), index = 3, type = “perc”)

Then you find these confidence intervals for the three betas (intercept, Post-Max Sepultura vs. Max Era-Sepultura, Soulfly vs. Max-Era-Sepultura):

Based on 1000 bootstrap replicates

CALL : = maxModel.boot, type = “perc”, index = 1)

Intervals : 
Level     Percentile     
95%   ( 4.468,  4.620 )  
Calculations and Intervals on Original Scale

Based on 1000 bootstrap replicates

CALL : = maxModel.boot, type = “perc”, index = 2)

Intervals : 
Level     Percentile     
95%   (-0.6153, -0.3100 )  
Calculations and Intervals on Original Scale

Based on 1000 bootstrap replicates

CALL : = maxModel.boot, type = “perc”, index = 3)

Intervals : 
Level     Percentile     
95%   (-0.3861, -0.1503 )  
Calculations and Intervals on Original Scalens: 1534
Number of Groups: 22  

The difference in ratings between Sepultura without Max compared to with him was b = -0.47 [-0.62, -0.31]. The difference in ratings between Soulfly compared to Max-era Sepultura was b = -0.26 [-0.39, -0.15]. This suggests  that both soulfully and post-Max Sepultura yield negative parameters that reflect (to the degree that you believe that a confidence interval tells you about the population parameter ….) a negative effect in the population. In other words, both bands are rated worse than Max-era Sepultura.


Look, this is just a bit of fun and an excuse to show you how to use a bootstrap on a multilevel model, and how you can use data to try to answer pointless questions thrown at you on Twitter. Based on this hastily thrown together analysis that makes a lot of assumptions about a lot of things, my 120 character twitter response will be: Sepultura Max better than everything, but post 1996 Max is no better than No Max;-)

R Code


You can generate the data using this R code (data correct as of today):
morbid<-c(rep(1,2), rep(2, 4), rep(3, 3), rep(4, 8), rep(5, 36))
  Schizo<-c(rep(1,1), rep(2, 2), rep(3, 4), rep(4, 10), rep(5, 33))
  remains<-c(2, rep(3, 5), rep(4, 9), rep(5, 104))
  Arise<-c(rep(2, 2), rep(4, 16), rep(5, 89))
  Chaos<-c(rep(1,4), rep(2, 2), rep(3, 9), rep(4, 20), rep(5, 120))
  Roots<-c(rep(1,9), rep(2, 8), rep(3, 17), rep(4, 24), rep(5, 94))
  Against<-c(rep(1,16), rep(2, 14), rep(3, 11), rep(4, 20), rep(5, 32))
  Nation<-c(rep(1,3), rep(2, 7), rep(3, 6), rep(4, 22), rep(5, 19))
  Roorback<-c(rep(1,6), rep(2, 6), rep(3, 5), rep(4, 13), rep(5, 20))
  Dante<-c(rep(1,1), rep(2, 3), rep(3, 4), rep(4, 8), rep(5, 30))
  Alex<-c(rep(1,1), rep(2, 1), rep(3, 3), rep(4, 6), rep(5, 18))
  Kairos<-c(rep(1,3), rep(2, 2), rep(3, 2), rep(4, 6), rep(5, 33))
  Mediator<- c(rep(1,0), rep(2, 3), rep(3, 4), rep(4, 6), rep(5, 21))
  morbid<-data.frame(rep("Morbid", length(morbid)), rep(1986, length(morbid)), morbid)
  Schizo<-data.frame(rep("Schizo", length(Schizo)), rep(1987, length(Schizo)), Schizo)
  Remains<-data.frame(rep("remains", length(remains)), rep(1989, length(remains)), remains)
  Arise<-data.frame(rep("Arise", length(Arise)), rep(1991, length(Arise)), Arise)
  Chaos<-data.frame(rep("Chaos", length(Chaos)), rep(1993, length(Chaos)), Chaos)
  Roots<-data.frame(rep("Roots", length(Roots)), rep(1996, length(Roots)), Roots)
  Against<-data.frame(rep("Against", length(Against)), rep(1998, length(Against)), Against)
  Nation<-data.frame(rep("Nation", length(Nation)), rep(2001, length(Nation)), Nation)
  Roorback<-data.frame(rep("Roorback", length(Roorback)), rep(2003, length(Roorback)), Roorback)
  Dante<-data.frame(rep("Dante", length(Dante)), rep(2006, length(Dante)), Dante)
  Alex<-data.frame(rep("Alex", length(Alex)), rep(2009, length(Alex)), Alex)
  Kairos<-data.frame(rep("Kairos", length(Kairos)), rep(2011, length(Kairos)), Kairos)
  Mediator<-data.frame(rep("Mediator", length(Mediator)), rep(2013, length(Mediator)), Mediator)
  names(morbid)<-c("Album", "Year", "Rating")
  names(Schizo)<-c("Album", "Year", "Rating")
  names(Remains)<-c("Album", "Year", "Rating")
  names(Arise)<-c("Album", "Year", "Rating")
  names(Chaos)<-c("Album", "Year", "Rating")
  names(Roots)<-c("Album", "Year", "Rating")
  names(Against)<-c("Album", "Year", "Rating")
  names(Nation)<-c("Album", "Year", "Rating")
  names(Dante)<-c("Album", "Year", "Rating")
  names(Alex)<-c("Album", "Year", "Rating")
  names(Kairos)<-c("Album", "Year", "Rating")
  names(Mediator)<-c("Album", "Year", "Rating")

  SepMax<-rbind(morbid, Schizo, Remains, Arise, Chaos, Roots)
  SepMax$Band<-"Sepultura Max"
  SepNoMax<-rbind(Against, Nation, Dante, Alex, Kairos, Mediator)
  SepNoMax$Band<-"Sepultura No Max"
  SepNoMax$Max<-"No Max"
  soulfly<-c(rep(1,8), rep(2, 9), rep(3, 4), rep(4, 16), rep(5, 89))
  primitive<-c(rep(1,11), rep(2, 5), rep(3, 5), rep(4, 19), rep(5, 53))
  three<-c(rep(1,1), rep(2, 10), rep(3, 12), rep(4, 7), rep(5, 19))
  prophecy<-c(rep(1,2), rep(2, 5), rep(3, 5), rep(4, 25), rep(5, 42))
  darkages<-c(rep(1,1), rep(2, 1), rep(3, 5), rep(4, 18), rep(5, 36))
  conquer<-c(rep(1,1), rep(2, 0), rep(3, 5), rep(4, 5), rep(5, 31))
  omen<-c(rep(1,0), rep(2, 2), rep(3, 1), rep(4, 6), rep(5, 17))
  enslaved<-c(rep(1,1), rep(2,1), rep(3, 4), rep(4, 2), rep(5, 30))
  savages<-c(rep(1,0), rep(2, 2), rep(3, 3), rep(4, 10), rep(5, 27))
  archangel<-c(rep(1,3), rep(2, 2), rep(3, 4), rep(4, 7), rep(5, 21))

  soulfly<-data.frame(rep("Soulfly", length(soulfly)), rep(1998, length(soulfly)), soulfly)
  primitive<-data.frame(rep("Primitive", length(primitive)), rep(2000, length(primitive)), primitive)
  three<-data.frame(rep("Three", length(three)), rep(2002, length(three)), three)
  prophecy<-data.frame(rep("Prophecy", length(prophecy)), rep(2004, length(prophecy)), prophecy)
  darkages<-data.frame(rep("Darkages", length(darkages)), rep(2005, length(darkages)), darkages)
  conquer<-data.frame(rep("Conquer", length(conquer)), rep(2008, length(conquer)), conquer)
  omen<-data.frame(rep("Omen", length(omen)), rep(2010, length(omen)), omen)
  enslaved<-data.frame(rep("Enslaved", length(enslaved)), rep(2012, length(enslaved)), enslaved)
  savages<-data.frame(rep("Savages", length(savages)), rep(2013, length(savages)), savages)
  archangel<-data.frame(rep("Archangel", length(archangel)), rep(2015, length(archangel)), archangel)
  names(soulfly)<-c("Album", "Year", "Rating")
  names(primitive)<-c("Album", "Year", "Rating")
  names(three)<-c("Album", "Year", "Rating")
  names(prophecy)<-c("Album", "Year", "Rating")
  names(darkages)<-c("Album", "Year", "Rating")
  names(conquer)<-c("Album", "Year", "Rating")
  names(omen)<-c("Album", "Year", "Rating")
  names(enslaved)<-c("Album", "Year", "Rating")
  names(savages)<-c("Album", "Year", "Rating")
  names(archangel)<-c("Album", "Year", "Rating")

  Soulfly<-rbind(soulfly, primitive, three, prophecy, darkages, conquer, omen, enslaved, savages, archangel)
  maxvsnomax<-rbind(SepMax, SepNoMax, Soulfly)
  sepvssoul<-subset(maxvsnomax, Band != "Sepultura Max")

Graphs in this post


m <- ggplot(sepvssoul, aes(x = Rating, fill = Band, colour = Band)) 
  m + geom_histogram(binwidth = 1, alpha = .2, position=”identity”) + coord_cartesian(xlim = c(0, 6)) + scale_y_continuous(breaks = seq(0, 400, 50)) + labs(x = “Rating”, y = “Frequency”, colour = “Band”, fill = “Band”) + theme_bw()
  line <- ggplot(sepvssoul,  aes(Year, Rating, colour = Band))
  line + stat_summary(fun.y = mean, geom = “point”) + stat_summary( = mean_cl_boot, geom = “errorbar”, width = 0.2) + labs(x = “Year”, y = “Mean Rating on”) + coord_cartesian(ylim = c(1, 5)) + scale_x_continuous(breaks=seq(1998, 2015, 1))+ theme_bw() + stat_summary(fun.y = mean, geom = “line”, aes(group=Band))

Bat Cunnilingus and Spurious Results

Thanks to my brother I became aware of this article about bat cunnilingus today. My brother, knowing how entertained I had been by a bat felating itself at Disney on my honeymoon, decided that this would be just the sort of article that I’d like. Knowing the kind of examples I put in my textbooks I suspect his email is the first of many. Anyway, this article suggests that male bats engage in cunnilingus with females. Female bats, that is. They have videos to prove it. Male bats like cunnilingus so much that they don’t just use it as foreplay, but they get stuck in afterwards as well. The article has two main findings:
  1. The more that males (bats) engage in cunnilingus, the longer the duration of intercourse, r = .91 (see their Figure1). They tested this with a Pearson r, based on 8 observations (although each observation was the average of several data points I think – it’s not entirely clear how the authors came about these averages). They conclude that this behaviour may be due to sperm competition in that males sort of ‘clean out’ the area before they procreate. They argue that the longer intercourse duration supports this argument because longer duration = greater chance of paternity.
  2.  The more that males engage in pre-intercourse cunnilingus, the less they engage in post-intercourse cunnilingus, r = .79 (see their Figure2). They again tested this with a Pearson r, based on 8 observations. They conclude “… the negative relationship between the duration of pre- and post-copulatory cunnilingus suggests the possibility of sperm competition. Thus, the male removes the sperm of competitors but apparently not its own – lick for longer before but less after copulation, if paternity uncertainty is high.”

There are several leaps of faith in their conclusions. Does longer duration necessarily imply greater chance of paternity? Also, if sperm competition is at the root of all of this then I’m not sure post coital cunilingus is really ever an adaptive strategy: consuming your own sperm hardly seems like the best method of impregnating your mate. Yes, I’m still talking about bats here, what you choose to do in your own free time is none of my business.

Enough about Cunnilingus, what about Statistics?

Anyway, I want to make a statistical point, not an evolutionary one. It struck me when looking at this article that finding 2 is probably spurious. Figure 2 looks very much like two outliers are making a relationship out of nothing. With only 8 observations they are likely to have a large impact. With 8 observations it is, of course, hard to say what’s an outlier and what isn’t; however, the figure was like a bat tongue to my brain: my juices were flowing. Fortunately, with so few observations we can approximate the data from the two figures. I emphasises that these are quick estimates of values based on the graphs and a ruler. We have three variables estimated from the paper: pre (pre-intercourse cunnilingus), post (post-intercourse cunnilingus), duration (duration of intercourse).
We’re going to use R, if you don’t know how to use it then some imbecile called AndyField has written a book on it that you could look at. We can input these approximate scores as:
pre<-c(46, 50, 51, 52, 53, 53.5, 55, 59)
post<-c(162, 140, 150, 140, 142, 138, 152, 122)
duration<-c(14.4, 14.4, 14.9, 15.5, 15.4, 15.2, 15.3, 16.4)
bat<-data.frame(pre, post, duration)
Ok, so that’s the three variables entered, and I’ve made a data frame from them called bat.
Let’s load some packages that will be useful (if you don’t have them installed then install them) and source Rand Wilcox’s very helpful robust functions:
Let’s draw some rough replicas of Figures 1 and 2 from the paper:
qplot(pre, post, date = bat, ylim = c(85, 200), xlim = c(44, 60)) + stat_smooth(method = “lm”)
qplot(pre, duration, date = bat, ylim = c(13, 18), xlim = c(44, 60))+ stat_smooth(method = “lm”)

Figure 1: Quick replica of Maruthupandian & Marimuthu (2013) Figure 1
Figure 1: Quick replica of Maruthupandian & Marimuthu (2013) Figure 2
As you can see the figures – more or less – replicate their data. OK, let’s replicate their correlation coefficient estimates by executing:
rcorr(as.matrix(bat), type = “pearson”)
gives us
           pre  post duration
pre       1.00 -0.78     0.91
post     -0.78  1.00    -0.75
duration  0.91 -0.75     1.00
n= 8

         pre    post   duration
pre             0.0225 0.0018 
post     0.0225        0.0308 
duration 0.0018 0.0308
So, we replicate the pre-intercourse cunnilingus (Pre) – duration correlation of .91 exactly (p= .0018), and we more or less get the right value for the pre-post correlation: we get r = .78 (p = .0225) instead of -.79, but we’re within rounding error. So, it looks like our estimates of the raw data from the Figures in the article aren’t far off the real values at all. Just goes to show how useful graphs in papers can be for extracting the raw data.
If I’m right and the data in Figure 2 are affected by the first and last case, we’d expect something different to happen if we calculate the correlation based on ranked scores (i.e. Spearman’s rho). Let’s try it:
rcorr(as.matrix(bat), type = “spearman”)
           pre  post duration
pre       1.00 -0.50     0.78
post     -0.50  1.00    -0.51
duration  0.78 -0.51     1.00
n= 8
         pre    post   duration
pre             0.2039 0.0229 
post     0.2039        0.1945 
duration 0.0229 0.1945 
The Pre-duration correlation has dropped from .91 to .78 (p = .0229) but it’s still very strong. However, the pre-post correlation has changed fairly dramatically from -.79 to r = .50 (p = .204) and is now not significant (although I wouldn’t want to read to much into that with an n = 8).
What about if we try a robust correlation such as the percentile bend correlation described by <!–[if supportFields]> ADDIN EN.CITE Wilcox2012684Wilcox (2012)6846846Wilcox, R. R.Introduction to robust estimation and hypothesis testing3rd2012Burlington, MAElsevier<![endif]–>Wilcox (2012)<!–[if supportFields]><![endif]–>? Having sourced Wilcox’s functions we can do this easily enough by executing:
pbcor(bat$pre, bat$post)
pbcor(bat$pre, bat$duration)
For the pre post correlation we get:
[1] -0.3686576
[1] -0.9714467
[1] 0.368843
[1] 8
It has changed even more dramatically from -.79 to r = .37 (p = .369) and is now not significant (again I wouldn’t want to read to much into that with an n = 8, but it is now a much smaller effect than they’re reporting in the paper).
The pre-duration correlation fares much better:
[1] 0.852195
[1] 3.989575
[1] 0.007204076
[1] 8
This is still very strong and significant, r = .85 (p = .007). We could also try to estimate the population value of these two relationships by computing a robust confidence interval using bootstrapping. We’ll stick with Pearson r seeing as that’s the estimate that they used in the paper but you could try this with the other measures if you like:
bootr<-function(data, i){
cor(data[i, 1],data[i, 2])
bootprepost<-boot(bat[1:2], bootr, 2000)
bootpreduration<-boot(bat[c(1, 3)], bootr, 2000)
If we look at the BCa intervals, we get this for the pre-post correlation:
Level     Percentile            BCa         
95%   (-0.9903,  0.5832 )   (-0.9909,  0.5670 )
and this for the pre-duration correlation:
Level     Percentile            BCa         
95%   ( 0.6042,  0.9843 )   ( 0.5022,  0.9773 )
In essence then, the true relationship between pre- and post-intercourse cunnilingus duration lies somewhere between about –1 and 0.6. In other words, it could be anything really, including zero. For the relationship between pre-intercourse bat cunnilingus and intercourse duration the true relationship is somewhere between r= .5 and .98. In other words, it’s likely to be – at the very worst – a strong and positive association.


What does this tell us? Well, I’m primarily interested in using this as a demonstration of why it’s important to look at your data and to employ robust tests. I’m not trying to knock the research. Of course, in the process I am, inevitably, sort of knocking it. So, I think there are three messages here:
  1. Their first finding that the more that males (bats) engage in cunnilingus, the longer the duration of intercourse is very solid. The population value could be less than the estimate they report, but it could very well be that large too. At the very worst it is still a strong association.
  2. The less good news is that finding 2 (that the more that males engage in pre-intercourse cunnilingus, the less they engage in post-intercourse cunnilingus) is probably wrong. The best case scenario is that the finding is real but the authors have overestimated it. The worst case scenario is that the population effect is in the opposite direction to what the authors report, and there is, of course, a mid ground that there is simply no effect at all in the population (or one so small as to be not remotely important).
  3. It’s also worth pointing out that the two extreme cases that have created this spurious result may in themselves be highly interesting observations. There may be important reasons why individual bats engage in unusually long or short bouts of cunnilingus (either pre or post intercourse) and finding out what factors affect this will be an interesting thing that the authors should follow up.


Bias in End of Year Album Polls?

So, in rolls 2012 and out rolls another year. I like new year: it’s a time to fill yourself with optimism about the exciting things that you will do. Will this be the year that I write something more interesting than a statistics book, for example? It’s also a time of year to reflect upon all of the things that you thought you’d do last year but didn’t. That’s a bit depressing, but luckily 2011 was yesterday and today is a new year and a new set of hopes that have yet to fail to come to fruition.
It’s also the time of year that magazines publish their end-of-year polls. Two magazines that I regularly read are Metal Hammer and Classic Rock (because, in case isn’t obvious from my metal-themed website and podcasts, I’m pretty fond of heavy metal and classic rock). The ‘album of the year’ polls in these magazines are an end of year treat for me: it’s an opportunity to see what highly rated albums I overlooked, to wonder at how an album that I hate has turned up in the top 5, or to pull a bemused expression at how my favourite album hasn’t made it into the top 20. At my age, it’s good to get annoyed about pointless things.
Anyway, for years I have had the feeling that these end of year polls are biased. I don’t mean in any nefarious way, but simply that reviewers who contribute to these polls tend to rate recently released albums more highly than ones released earlier in the year. Primacy and recency effects are well established in cognitive psychology: if you ask people to remember a list of things, they find it easier to recall items at the start or end of the list. Music journalists are (mostly) human so it’s only reasonable that reviewers will succumb to these effects, isn’t it?
I decided to actually take some time off this winter Solstice, and what happens when you take time off? In my case, you get bored and start to wonder whether you can test your theory that end of year polls are biased. The next thing you know, you’re creating a spreadsheet with Metal Hammer and Classic Rock’s end of year polls in it, then you’re on Wikipedia looking up other useful information about these albums, then, when you should be watching the annual re-run of Mary Poppins you find that you’re getting R to do a nonparametric bootstrap of a mediation analysis. The festive period has never been so much fun.
Anyway, I took the lists of top 20 albums from both Metal Hammer and Classic Rock magazine. I noted down their position in the poll (1 = best album of the year, 20 = 20th best album of the year), I also found out what month each album was released. From this information I could calculate how many months before the poll the album came out (0 = album came out the same month as the poll, 12 = the album came out 12 months before the poll). I called this variable Time.since.release.
My theory implies that an album’s position in the end of year list (position) will be predicted from how long before the poll the album was released. A recency effect would mean that albums released close to the end of the year (i.e. low score onTime.since.release) will be higher up the end of year poll (remember that the lower the score, the higher up the poll the album is). So, we predict a positive relationship between position and Time.since.release.
Let’s look at the scatterplot:
Both magazines show a positive relationship: albums higher up the poll (i.e. low score on position) tend to have been released more recently (i.e., low score on the number of months ago that the album was released). This effect is much more pronounced though for Metal Hammer than for Classic Rock.
To quantify the relationship between position in the poll and time since the album was released we can look at a simple correlation coefficient. Our position data are a rank, not interval/ratio, so it makes sense to use Spearman’s rho, and we have a fairly small sample so it makes sense to bootstrap the confidence interval. For Metal Hammer we get (note that because of the bootstrapping you’ll get different results if you try to replicate this) rho = .428 (bias = –0.02, SE = 0.19) with a 95% bias corrected and accelerated confidence interval that does not cross zero (BCa 95% = .0092, .7466). The confidence interval is pretty wide, but tells us that the correlation in the population is unlikely to be zero (in other words, the true relationship between position in the poll and time since release is likely to be more than no relationship at all). Also, because rho is an effect size we can interpret its size directly, and .428 is a fairly large effect. In other words, Metal Hammer reviewers tend to rank recent albums higher than albums released a long time before the poll. They show a relatively large recency effect.
What about Classic Rock? rho = .038 (bias = –0.001, SE = 0.236) with a BCa 95% CI = –.3921, .5129). The confidence interval is again wide, but this time crosses zero (in fact, zero is more or less in the middle of it). This CI tells us that the true relationship between position in the poll and time since release is could be zero, in other words no relationship at all. We can again interpret the rho directly, and .038 is a very small (it’s close to zero). In other words, Classic Rock reviewers do not tend to rank recent albums higher than albums released a long time before the poll. They show virtually no recency effect. This difference is interesting (especially given there is overlap between contributors to the two magazines!).
It then occurred to me, because I spend far too much time thinking about this sort of thing, that perhaps it’s simply the case that better albums come out later in the year and this explains why Metal Hammer reviewers rate them higher. ‘Better’ is far too subjective a variable to quantify, however, it might be reasonable to assume that bands that have been around for a long time will produce good material (not always true, as Metallica’s recent turd floating in the loo-loo demonstrates). Indeed, Metal Hammer’s top 10 contained Megadeth, Anthrax, Opeth, and Machine Head: all bands who have been around for 15-20 years or more. So, in the interests of fairness to Metal Hammer reviewers, let’s see whether ‘experience’ mediates the relationship between position in the poll and time since the album’s release. Another 30 minutes on the internet and I had collated the number of studio albums produced by each band in the top 20. The number of studio albums seems like a reasonable proxy for experience (and better than years as a band, because some bands produce an album every 8 years and others every 2). So, I did a mediation analysis with a nonparametric bootstrap (thanks to the mediation package in R). The indirect effect was 0.069 with a 95% CI = –0.275, 0.537. The proportion of the effect explained by mediation was about 1%. In other words, the recency bias in the Metal Hammer end of year poll could not be explained by the number of albums that bands in the poll had produced in the past (i.e. experience). Basically, despite my best efforts to give them the benefit of the doubt, Metal Hammer critics are actually biased towards giving high ratings to more recently released albums.
These results might imply many things:
  • Classic Rock reviewers are more objective when creating their end of year polls (because they over-ride the natural tendency to remember more recent things, like albums).
  • Classic Rock reviewers are not human because they don’t show the recency effects that you expect to find in normal human beings. (An interesting possibility, but we need more data to test it …)
  • Metal Hammer should have a ‘let’s listen to albums released before June’ party each November to remind their critics of albums released earlier in the year. (Can I come please and have dome free beer?)
  • Metal Hammer should inversely weight reviewer’s ranks by the time since release so that albums released earlier in the year get weighted more heavily than recently released albums. (Obviously, I’m offering my services here …)
  • I should get a life.

Ok, that’s it. I’m sure this is of no interest to anyone other than me, but it does at least show how you can use statistics to answer pointless questions. A bit like what most of us scientists do for a large portion of our careers. Oh, and if I don’t get a ‘Spirit of the Hammer’ award for 15 years worth of infecting students of statistics with my heavy metal musings then there is no justice ion the world. British Psychological Society awards and National Teaching Fellowships are all very well, but I need a spirit of the hammer on my CV (or at least Defender of the Faith).

Have a rockin’ 2012
P.P.S. My top 11 (just to be different) albums of 2011 (the exact order is a bit rushed):
  1. Opeth: Heritage
  2. Wolves in the Throne Room: Celestial Lineage
  3. Anthrax: Worship Music
  4. Von Hertzen Brothers: Stars Aligned
  5. Liturgy: Split LP (although the Oval side is shit)
  6. Ancient Ascendent: The Grim Awakening
  7. Graveyard: Hisingen Blues
  8. Foo Fighters: Wasting Light
  9. Status Quo: Quid Pro Quo
  10. Manowar: Battle Hymns MMXI
  11. Mastodon: The Hunter