Max or No Max?

There’s been a recent spat between the heavy metal bands Sepultura and Soulfly. For those unaware of the history, 50% of Sepulture used to be the Cavalera brothers (Max and Igor) until Max (the frontman and guitarist) left the band in 1996 and formed Soulfly. The full story is here. There’s a lot of bad blood even 20 years later, and according to a recent story on metal sucks, Soulfly’s manager (and Max’s wife) Gloria Cavalier recently posted a fairly pointed post on her Facebook page. This got picked up by my favourite podcast (the metal sucks podcast). What has this got to do with me, or statistics? Well, one of the presenters of the metal sucks podcasts asked me this over Twitter:

After a very brief comment about needing to operationalise ‘better’, I decided that rather than reading book proofs I’d do a tongue in cheek analysis of what is better max or no max and here it is.

First we need to operationalise ‘better’. I have done this by accepting subjective opinion as determining ‘better’ and specifically ratings of albums on amazon.com (although I am English metal sucks is US based, so I thought I’d pander to them and take ratings from the US site). Our questions then becomes ‘is max or no max rated higher by the sorts of people who leave reviews on Amazon’. We have operationalised our questions and turned it into a scientific statement, which we can test with data. [There are all sorts of problems with using these ratings, not least of which is that they tend to be positively biased, and they likely reflect a certain type of person who reviews, often reviews reflect things other than the music (e.g., arrived quickly 5*), and so on … but fuck it, this is not serious science, just a bit of a laugh.]

Post Sepultura: Max or No Max

The first question is whether post-max Sepultura or Soulfly are rated higher. Figure 1 shows that the data are hideously skewed with people tending to give positive reviews and 4-5* ratings. Figure 2 shows the mean ratings by year post Max’s departure (note they released albums in different years so the dots are out of synch, but it’s a useful timeline). Figure 2 seems to suggest that after the first couple of albums, both bands are rated fairly similarly: the Soulfly line is higher but error bars overlap a lot for all but the first albums.

Figure 1: Histograms of all ratings for Soulfly and (Non-Max Era) Sepultura

Figure 2: Mean ratings of Soulfly and (Non-Max Era) Sepultura by year of album release

There are a lot of ways you could look at these data. The first thing is the skew. That messes up estimates of confidence intervals and significance tests … but our sample is likely big enough that we can rely on the central limit theorem to do its magic and let us assume that the sampling distribution is normal (beautifully explained in my new book!)

I’m going to fit three models. The first is an intercept only model (a baseline with no predictors), the second allows intercepts to vary across albums (which allows ratings to vary by album, which seems like a sensible thing to do because albums will vary in quality) the third predicts ratings from the band (Sepultura vs Soulfly).

  maxModel2a<-gls(Rating ~ 1, data = sepvssoul, method = "ML")
  maxModel2b<-lme(Rating ~ 1, random = ~1|Album, data = sepvssoul, method = "ML")
  maxModel2c<-update(maxModel2b, .~. + Band)
  anova(maxModel2a, maxModel2b, maxModel2c)

By comparing models we can see:

           Model df      AIC      BIC    logLik   Test  L.Ratio p-value
maxModel2a     1  2 2889.412 2899.013 -1442.706                        
maxModel2b     2  3 2853.747 2868.148 -1423.873 1 vs 2 37.66536  <.0001
maxModel2c     3  4 2854.309 2873.510 -1423.155 2 vs 3  1.43806  0.2305

That album ratings varied very significantly (not surprising), the p-value is < .0001, but that band did not significantly predict ratings overall (p = .231). If you like you can look at the summary of the model by executing:
  
  summary(maxModel2c)

Which gives us this output:

Linear mixed-effects model fit by maximum likelihood
 Data: sepvssoul 
       AIC     BIC    logLik
  2854.309 2873.51 -1423.154

Random effects:
 Formula: ~1 | Album
        (Intercept) Residual
StdDev:   0.2705842 1.166457

Fixed effects: Rating ~ Band 
               Value Std.Error  DF   t-value p-value
(Intercept) 4.078740 0.1311196 882 31.107015  0.0000
BandSoulfly 0.204237 0.1650047  14  1.237765  0.2362
 Correlation: 
            (Intr)
BandSoulfly -0.795

Standardized Within-Group Residuals:
       Min         Q1        Med         Q3        Max 
-2.9717684 -0.3367275  0.4998698  0.6230082  1.2686186 

Number of Observations: 898
Number of Groups: 16 

The difference in ratings between Sepultura and Soulfly was b = 0.20. Ratings for soulfully were higher, but not significantly so (if we allow ratings to vary over albums, if you take that random effect out you’ll get a very different picture because that variability will go into the fixed effect of ‘band’.).

Max or No Max

Just because this isn’t fun enough, we could also just look at whether either Sepultura (post 1996) or Soulfly can compete with the Max-era-Sepultura heyday.
I’m going to fit three models but this time including the early Sepultura albums (with max). The models are the same as before except that the fixed effect of band now has three levels: Sepultura Max, Sepultura no-Max and Soulfly:

  maxModela<-gls(Rating ~ 1, data = maxvsnomax, method = "ML")
  maxModelb<-lme(Rating ~ 1, random = ~1|Album, data = maxvsnomax, method = "ML")
  maxModelc<-update(maxModelb, .~. + Band)
  anova(maxModela, maxModelb, maxModelc)

By comparing models we can see:

          Model df      AIC      BIC    logLik   Test   L.Ratio p-value
maxModela     1  2 4686.930 4697.601 -2341.465                         
maxModelb     2  3 4583.966 4599.973 -2288.983 1 vs 2 104.96454  <.0001
maxModelc     3  5 4581.436 4608.114 -2285.718 2 vs 3   6.52947  0.0382

That album ratings varied very significantly (not surprising), the p-value is < .0001, and the band did significantly predict ratings overall (p = .038). If you like you can look at the summary of the model by executing:
  
  summary(maxModelc)

Which gives us this output:

Linear mixed-effects model fit by maximum likelihood
 Data: maxvsnomax 
       AIC      BIC    logLik
  4581.436 4608.114 -2285.718

Random effects:
 Formula: ~1 | Album
        (Intercept) Residual
StdDev:     0.25458 1.062036

Fixed effects: Rating ~ Band 
                         Value Std.Error   DF  t-value p-value
(Intercept)           4.545918 0.1136968 1512 39.98281  0.0000
BandSepultura No Max -0.465626 0.1669412   19 -2.78916  0.0117
BandSoulfly          -0.262609 0.1471749   19 -1.78433  0.0903
 Correlation: 
                     (Intr) BndSNM
BandSepultura No Max -0.681       
BandSoulfly          -0.773  0.526

Standardized Within-Group Residuals:
       Min         Q1        Med         Q3        Max 
-3.3954974 -0.3147123  0.3708523  0.6268751  1.3987442 

Number of Observations: 1534
Number of Groups: 22  

The difference in ratings between Sepultura without Max compared to with him was b = -0.47 and significant at p = .012 (ratings for post-max Sepultura are significantly worse than for the Max-era Sepultura). The difference in ratings between Soulfly compared two Max-era Sepultura was b = -0.26 and not significant (p = .09) (ratings for Soulfly are not significantly worse than for the Max-era Sepultura). A couple of points here, p-values are silly, so don’t read too much into them, but the parameter (the bs) which quantifies the effect is a bit smaller for Soulfly.

Confidence Intervals

Interestingly if you write yourself a little bootstrap routine to get some robust confidence intervals around the parameters:
  

    boot.lme <- function(data, indices){
    data <- data[indices,] # select obs. in bootstrap sample
    model <- lme(Rating ~ Band, random = ~1|Album, data = data, method = "ML")
    fixef(model) # return coefficient vector
  }

maxModel.boot<-boot(maxvsnomax, boot.lme, 1000)
  maxModel.boot
  boot.ci(maxModel.boot, index = 1, type = “perc”)
  boot.ci(maxModel.boot, index = 2, type = “perc”)

  boot.ci(maxModel.boot, index = 3, type = “perc”)


Then you find these confidence intervals for the three betas (intercept, Post-Max Sepultura vs. Max Era-Sepultura, Soulfly vs. Max-Era-Sepultura):

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = maxModel.boot, type = “perc”, index = 1)

Intervals : 
Level     Percentile     
95%   ( 4.468,  4.620 )  
Calculations and Intervals on Original Scale

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = maxModel.boot, type = “perc”, index = 2)

Intervals : 
Level     Percentile     
95%   (-0.6153, -0.3100 )  
Calculations and Intervals on Original Scale

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = maxModel.boot, type = “perc”, index = 3)

Intervals : 
Level     Percentile     
95%   (-0.3861, -0.1503 )  
Calculations and Intervals on Original Scalens: 1534
Number of Groups: 22  

The difference in ratings between Sepultura without Max compared to with him was b = -0.47 [-0.62, -0.31]. The difference in ratings between Soulfly compared to Max-era Sepultura was b = -0.26 [-0.39, -0.15]. This suggests  that both soulfully and post-Max Sepultura yield negative parameters that reflect (to the degree that you believe that a confidence interval tells you about the population parameter ….) a negative effect in the population. In other words, both bands are rated worse than Max-era Sepultura.

Summary

Look, this is just a bit of fun and an excuse to show you how to use a bootstrap on a multilevel model, and how you can use data to try to answer pointless questions thrown at you on Twitter. Based on this hastily thrown together analysis that makes a lot of assumptions about a lot of things, my 120 character twitter response will be: Sepultura Max better than everything, but post 1996 Max is no better than No Max;-)

R Code

Data

You can generate the data using this R code (data correct as of today):
morbid<-c(rep(1,2), rep(2, 4), rep(3, 3), rep(4, 8), rep(5, 36))
  Schizo<-c(rep(1,1), rep(2, 2), rep(3, 4), rep(4, 10), rep(5, 33))
  remains<-c(2, rep(3, 5), rep(4, 9), rep(5, 104))
  Arise<-c(rep(2, 2), rep(4, 16), rep(5, 89))
  Chaos<-c(rep(1,4), rep(2, 2), rep(3, 9), rep(4, 20), rep(5, 120))
  Roots<-c(rep(1,9), rep(2, 8), rep(3, 17), rep(4, 24), rep(5, 94))
  Against<-c(rep(1,16), rep(2, 14), rep(3, 11), rep(4, 20), rep(5, 32))
  Nation<-c(rep(1,3), rep(2, 7), rep(3, 6), rep(4, 22), rep(5, 19))
  Roorback<-c(rep(1,6), rep(2, 6), rep(3, 5), rep(4, 13), rep(5, 20))
  Dante<-c(rep(1,1), rep(2, 3), rep(3, 4), rep(4, 8), rep(5, 30))
  Alex<-c(rep(1,1), rep(2, 1), rep(3, 3), rep(4, 6), rep(5, 18))
  Kairos<-c(rep(1,3), rep(2, 2), rep(3, 2), rep(4, 6), rep(5, 33))
  Mediator<- c(rep(1,0), rep(2, 3), rep(3, 4), rep(4, 6), rep(5, 21))
  
  
  morbid<-data.frame(rep("Morbid", length(morbid)), rep(1986, length(morbid)), morbid)
  Schizo<-data.frame(rep("Schizo", length(Schizo)), rep(1987, length(Schizo)), Schizo)
  Remains<-data.frame(rep("remains", length(remains)), rep(1989, length(remains)), remains)
  Arise<-data.frame(rep("Arise", length(Arise)), rep(1991, length(Arise)), Arise)
  Chaos<-data.frame(rep("Chaos", length(Chaos)), rep(1993, length(Chaos)), Chaos)
  Roots<-data.frame(rep("Roots", length(Roots)), rep(1996, length(Roots)), Roots)
  Against<-data.frame(rep("Against", length(Against)), rep(1998, length(Against)), Against)
  Nation<-data.frame(rep("Nation", length(Nation)), rep(2001, length(Nation)), Nation)
  Roorback<-data.frame(rep("Roorback", length(Roorback)), rep(2003, length(Roorback)), Roorback)
  Dante<-data.frame(rep("Dante", length(Dante)), rep(2006, length(Dante)), Dante)
  Alex<-data.frame(rep("Alex", length(Alex)), rep(2009, length(Alex)), Alex)
  Kairos<-data.frame(rep("Kairos", length(Kairos)), rep(2011, length(Kairos)), Kairos)
  Mediator<-data.frame(rep("Mediator", length(Mediator)), rep(2013, length(Mediator)), Mediator)
  
  
  names(morbid)<-c("Album", "Year", "Rating")
  names(Schizo)<-c("Album", "Year", "Rating")
  names(Remains)<-c("Album", "Year", "Rating")
  names(Arise)<-c("Album", "Year", "Rating")
  names(Chaos)<-c("Album", "Year", "Rating")
  names(Roots)<-c("Album", "Year", "Rating")
  names(Against)<-c("Album", "Year", "Rating")
  names(Nation)<-c("Album", "Year", "Rating")
  names(Dante)<-c("Album", "Year", "Rating")
  names(Alex)<-c("Album", "Year", "Rating")
  names(Kairos)<-c("Album", "Year", "Rating")
  names(Mediator)<-c("Album", "Year", "Rating")

  SepMax<-rbind(morbid, Schizo, Remains, Arise, Chaos, Roots)
  SepMax$Band<-"Sepultura Max"
  SepMax$Max<-"Max"
  
  SepNoMax<-rbind(Against, Nation, Dante, Alex, Kairos, Mediator)
  SepNoMax$Band<-"Sepultura No Max"
  SepNoMax$Max<-"No Max"
  
  
  
  soulfly<-c(rep(1,8), rep(2, 9), rep(3, 4), rep(4, 16), rep(5, 89))
  primitive<-c(rep(1,11), rep(2, 5), rep(3, 5), rep(4, 19), rep(5, 53))
  three<-c(rep(1,1), rep(2, 10), rep(3, 12), rep(4, 7), rep(5, 19))
  prophecy<-c(rep(1,2), rep(2, 5), rep(3, 5), rep(4, 25), rep(5, 42))
  darkages<-c(rep(1,1), rep(2, 1), rep(3, 5), rep(4, 18), rep(5, 36))
  conquer<-c(rep(1,1), rep(2, 0), rep(3, 5), rep(4, 5), rep(5, 31))
  omen<-c(rep(1,0), rep(2, 2), rep(3, 1), rep(4, 6), rep(5, 17))
  enslaved<-c(rep(1,1), rep(2,1), rep(3, 4), rep(4, 2), rep(5, 30))
  savages<-c(rep(1,0), rep(2, 2), rep(3, 3), rep(4, 10), rep(5, 27))
  archangel<-c(rep(1,3), rep(2, 2), rep(3, 4), rep(4, 7), rep(5, 21))

  
  soulfly<-data.frame(rep("Soulfly", length(soulfly)), rep(1998, length(soulfly)), soulfly)
  primitive<-data.frame(rep("Primitive", length(primitive)), rep(2000, length(primitive)), primitive)
  three<-data.frame(rep("Three", length(three)), rep(2002, length(three)), three)
  prophecy<-data.frame(rep("Prophecy", length(prophecy)), rep(2004, length(prophecy)), prophecy)
  darkages<-data.frame(rep("Darkages", length(darkages)), rep(2005, length(darkages)), darkages)
  conquer<-data.frame(rep("Conquer", length(conquer)), rep(2008, length(conquer)), conquer)
  omen<-data.frame(rep("Omen", length(omen)), rep(2010, length(omen)), omen)
  enslaved<-data.frame(rep("Enslaved", length(enslaved)), rep(2012, length(enslaved)), enslaved)
  savages<-data.frame(rep("Savages", length(savages)), rep(2013, length(savages)), savages)
  archangel<-data.frame(rep("Archangel", length(archangel)), rep(2015, length(archangel)), archangel)
  
  names(soulfly)<-c("Album", "Year", "Rating")
  names(primitive)<-c("Album", "Year", "Rating")
  names(three)<-c("Album", "Year", "Rating")
  names(prophecy)<-c("Album", "Year", "Rating")
  names(darkages)<-c("Album", "Year", "Rating")
  names(conquer)<-c("Album", "Year", "Rating")
  names(omen)<-c("Album", "Year", "Rating")
  names(enslaved)<-c("Album", "Year", "Rating")
  names(savages)<-c("Album", "Year", "Rating")
  names(archangel)<-c("Album", "Year", "Rating")

  
  
  Soulfly<-rbind(soulfly, primitive, three, prophecy, darkages, conquer, omen, enslaved, savages, archangel)
  Soulfly$Band<-"Soulfly"
  Soulfly$Max<-"Max"
  
  
  maxvsnomax<-rbind(SepMax, SepNoMax, Soulfly)
  maxvsnomax$Band<-factor(maxvsnomax$Band)
  maxvsnomax$Max<-factor(maxvsnomax$Max)
  maxvsnomax$Album<-factor(maxvsnomax$Album)
  
  sepvssoul<-subset(maxvsnomax, Band != "Sepultura Max")
  sepvssoul$Band<-factor(sepvssoul$Band)
  sepvssoul$Album<-factor(sepvssoul$Album)

Graphs in this post

library(ggplot2)

m <- ggplot(sepvssoul, aes(x = Rating, fill = Band, colour = Band)) 
  m + geom_histogram(binwidth = 1, alpha = .2, position=”identity”) + coord_cartesian(xlim = c(0, 6)) + scale_y_continuous(breaks = seq(0, 400, 50)) + labs(x = “Rating”, y = “Frequency”, colour = “Band”, fill = “Band”) + theme_bw()
  
  
  line <- ggplot(sepvssoul,  aes(Year, Rating, colour = Band))
  line + stat_summary(fun.y = mean, geom = “point”) + stat_summary(fun.data = mean_cl_boot, geom = “errorbar”, width = 0.2) + labs(x = “Year”, y = “Mean Rating on Amazon.com”) + coord_cartesian(ylim = c(1, 5)) + scale_x_continuous(breaks=seq(1998, 2015, 1))+ theme_bw() + stat_summary(fun.y = mean, geom = “line”, aes(group=Band))

The referee’s a …..

I’m a bit late on this particular bandwagon, but it’s a busy time what with the start of term and finishing writing  textbook and all that. The textbook is also my excuse for having not written a blog for a year, well, that and the fact I rarely have anything interesting to say.

Anyway, those of you who follow football (or soccer for our US friends) will know the flack that referees get. I am a regular at the stadium of the team I’ve supported* since the age of 3 and like most football stadiums I regularly hear the chant of ‘The referee’s a wanker’ echoing around the ground. Referees have a tough job: the game is fast, often faster than they are, players can be a bit cheaty, and we have the benefit of TV replays which they (for some utterly inexplicable reason) do not. Nevertheless it’s annoying when they get stuff wrong.

The more specific referee-related thing I often hear resonating around the particular stadium that I attend is ‘Oh dear me, Mike Dean is refereeing … he hates us quite passionately’ or something similar with a lot more swearing. This particular piece of folk wisdom reached dizzy new heights the weekend before last when he managed to ignore Diego Costa trying to superglue his hands to Laurent Koscielny’s face shortly before chest bumping him to the floor, and then sending Gabriel off for … well, I’m not really sure what. Indeed, the FA weren’t really sure what for either because they rescinded the red card and banned Costa for 3 matches. You can read the details here.

So, does Mike Dean really hate Arsenal? In another blog 7amkickoff tried to answer this question using data collated about Mike Dean. You can read it here. This blog has inspired me to go one step further and try to answer this question as a scientist would.

How does a scientist decide if Mike Dean really is a wanker?

Scientists try to work with scientific statements. ‘Mike Dean is a wanker’ may be the folklore at the Emirates, but it is not a scientific statement because we probably can’t agree on a working definition of what a wanker is (at least not in the refereeing sense). However, we can turn it into a scientific statement by changing the question to something like ‘Do arsenal lose more games when Mike Dean (MD) referees?’ This is what 7amkickoff tried to do. 7amkickoff presented data from 2009-2015 and concluded that ‘Arsenal’s win% in the Premier League since 2009 is 57% so a 24% win rate is quite anomalous’ (the 24% is the win rate under Dean). There are several problems with this simplistic view, a couple of them are:
  1. MD tends to referee Arsenal’s tougher games (opponents such as Chelsea, Manchester United and Manchester City), so we need to compare Arsenal’s win rate under MD to our win rate against only the same opponents as in the MD games. (In case you’re interested the full list of clubs is Birmingham, Blackburn, Burnley, C Palace, Charlton, Chelsea, Fulham, Man City, Man Utd, Newcastle, QPR, Stoke City, Tottenham, Watford, West Brom, West Ham, Wigan). If we compare against a broader set of opponents then it isn’t a fair comparison to MD: we are not comparing like for like.
  2.  How do we know that a win rate of 24% under MD isn’t statistically comparable to a win rate of 57%. They seem different, but couldn’t such a difference simply happen because of the usual random fluctuations in results? Or indeed because Arsenal are just bad at winning against ‘bigger’ teams and MD happens to officiate those games (see the point above)
I’m going to use the same database as 7amkickoff and extend this analysis. I collected data from football-lineups.com for the past 10 years (2005-6 up to the Chelsea fixture on 19th Sept 2015). I am only looking at English Premier League (EPL) games. I have filtered the data to include only the opponents mentioned above, so we’ll get a fair comparison of Arsenal’s results under MD against their results against those same clubs when MD is not officiating. The data are here as a CSV file.
A scientist essentially sets up two competing hypotheses. The first is that there is no effect of MD, that is Arsenal’s results under MD are exactly the same as under other referees. This is known as the null hypothesis. The second is the opposite: that there is an effect of MD, that is Arsenal’s results under MD are different compared to other referees. This is known as the alternative hypothesis. We can use statistical tests to compare these hypotheses.

A frequentist approach

To grab the file into R run:
arse<-read.csv(file.choose(), header = T)
This creates a data frame called arse. 
If we look at the results of all EPL arsenal fixtures against the aforementioned teams since 2005, we get this, by running the following code in R:
MD.tab <- xtabs(~MD+result, data=arse)
MD.tab
   result
       Draw Lose Win
Other   33   43  115
MD      14   11   9
One way to test whether the profile of results is different for MD is to see whether there is a significant relationship between the variable ‘MD vs Other’ and ‘result’. This asks the question of whether the profile of draws to wins and loses is statistically different for MD than ‘other’. If we assume that games are independent events (which is a simplifying assumption because they won’t be) we can use a chi-square test.
You can fit this model in R by running:
library(gmodels)
CrossTable(MD.tab, fisher = TRUE, chisq = TRUE, sresid = TRUE, format = “SPSS”)

Total Observations in Table:  225 

             | result 
          MD |     Draw  |     Lose  |      Win  | Row Total | 
————-|———–|———–|———–|———–|
           0 |       33  |       43  |      115  |      191  | 
             |    1.193  |    0.176  |    0.901  |           | 
             |   17.277% |   22.513% |   60.209% |   84.889% | 
             |   70.213% |   79.630% |   92.742% |           | 
             |   14.667% |   19.111% |   51.111% |           | 
             |   -1.092  |   -0.419  |    0.949  |           | 
————-|———–|———–|———–|———–|
           1 |       14  |       11  |        9  |       34  | 
             |    6.699  |    0.988  |    5.061  |           | 
             |   41.176% |   32.353% |   26.471% |   15.111% | 
             |   29.787% |   20.370% |    7.258% |           | 
             |    6.222% |    4.889% |    4.000% |           | 
             |    2.588  |    0.994  |   -2.250  |           | 
————-|———–|———–|———–|———–|
Column Total |       47  |       54  |      124  |      225  | 
             |   20.889% |   24.000% |   55.111% |           | 
————-|———–|———–|———–|———–|

Statistics for All Table Factors

Pearson’s Chi-squared test 
————————————————————
Chi^2 =  15.01757     d.f. =  2     p =  0.0005482477 

Fisher’s Exact Test for Count Data
————————————————————
Alternative hypothesis: two.sided
p =  0.0005303779 


       Minimum expected frequency: 7.102222 

The win rate under MD is 26.47% compared to 41.18% draws and 32.35% losses, under other referees it is 60.21%, 17.28% and 22.51% respectively. These are the comparable values to 7amkickoff but looking at 10 years and including only opponents that feature in MD games. The critical part of the output is the p = .0005. This is the probability that we would get the chi-square value we have IF the null hypothesis were true. In other words, if MD had no effect whatsoever on Arsenal’s results the probability of getting a test result of 15.12 would be 0.000548. In other words, very small indeed. Scientists do this sort of thing all of the time, and generally accept that if p is less than .05 then this supports the alternative hypothesis. That is we can assume it’s true. (In actual fact, although scientists do this they shouldn’t because this probability value tells us nothing about either hypothesis, but that’s another issue …) Therefore, by conventional scientific method, we would accept that the profile of results under MD is significantly different than under other referees. Looking at the values in the cells of the table, we can actually see that the profile is significantly worse under MD than other referees in comparable games. Statistically speaking, MD is a wanker (if you happen to support Arsenal).

As I said though, we made simplifying assumptions. Let’s look at the raw results, and this time factor in the fact that results with the same opponents will be more similar than results involving different opponents. That is, we can assume that Arsenal’s results against Chelsea will be more similar to each other than they will be to results agains a different club like West Ham. What we do here is nest results within opponents. In doing so, we statistically model the fact that results against the same opposition will be similar to each other. This is known as a logistic multilevel model. What I am doing is predicting the outcome of winning the game (1 = win, 0 = not win) from whether MD refereed (1 = MD, 0 = other). I have fitted a random intercepts model, which says overall results will vary across different opponents, but also a random slopes model which entertains the possibility that the effect of MD might vary by opponent. The R code is this:

library(lme4)
win.ri<-glmer(win ~ 1 + (1|Opponent), data = arse, family = binomial, na.action = na.exclude)
win.md<-update(win.ri, .~. + MD)
win.rs<-update(win.ri, .~ MD + (MD|Opponent))
anova(win.ri, win.md, win.rs)

The summary of models shows that the one with the best fit is random intercepts and MD as a predictor. I can tell this because the model with MD as a predictor significantly improves the fit of the model (the p of .0024 is very small) but adding in the random slopes component does not improve the fit of the model hardly at all (because the p of .992 is very close to 1).

Data: arse
Models:
win.ri: win ~ 1 + (1 | Opponent)
win.md: win ~ (1 | Opponent) + MD
win.rs: win ~ MD + (MD | Opponent)
       Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)   
win.ri  2 302.41 309.24 -149.20   298.41                            
win.md  3 295.22 305.47 -144.61   289.22 9.1901      1   0.002433 **
win.rs  5 299.20 316.28 -144.60   289.20 0.0153      2   0.992396  

If we look at the best fitting model (win.md) by running:

summary(win.md)

We get this:

Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) [‘glmerMod’]
 Family: binomial  ( logit )
Formula: win ~ (1 | Opponent) + MD
   Data: arse

     AIC      BIC   logLik deviance df.resid 
   295.2    305.5   -144.6    289.2      222 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.6052 -0.7922  0.6046  0.7399  2.4786 

Random effects:
 Groups   Name        Variance Std.Dev.
 Opponent (Intercept) 0.4614   0.6792  
Number of obs: 225, groups:  Opponent, 17

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)   0.5871     0.2512   2.337  0.01943 * 
MDMD         -1.2896     0.4427  -2.913  0.00358 **

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
     (Intr)
MDMD -0.225

The important thing is whether the variable MD (which represents MD vs. other referees) is a ‘significant predictor’. Looking at the p-value of .00358,we can assess the probability that we would get an estimate as large as -1.2896 if the null hypothesis were true (i.e. if MD had no effect at all on Arsenal’s results). Again, because this value is less than .05, scientists would typically conclude that MD is a significant predictor of whether Arsenal win or not, even factoring in dependency between results when the opponent is the same. The value of the estimate (-1.2896) tells us about the strength and direction of the relationship and because our predictor variable is MD = 1, other  = 0, and our outcome is win = 1 and no win = 0, and the value is negative it means that as MD increases (in other words as we move from having ‘other’ as a referee to having MD as a referee) the outcome decreases (that is, the probability of winning decreases). So, this analysis shows that MD significantly decreases the probability of Arsenal winning. The model is more complex but the conclusion is the same as the previous, simpler, model: statistically speaking, MD is a wanker (if you happen to support Arsenal).

A Bayesian Approach

The frequentist approach above isn’t universally accepted because these probability values that I keep mentioning don’t allow us to test directly the plausibility of the null and alternative hypothesis. A different approach is to use a Bayes Factor. Bayes Factors quantify the ratio of the probability of the data under the alternative hypothesis relative to the null. A value of 1, therefore, means that the observed data are equally probable under the null and alternative hypotheses, values below 1 suggest that the data are more probable under the null hypotheses relative to the alternative. Bayes factors provide information about the probability of the data under the null hypothesis (relative to the alternative) whereas significance tests (as above) provide no evidence at all about the status of the null hypothesis. Bayes Factors are pretty easy to do using Richard Morey’s excellent BayesFactor  package in R (Morey & Rouder, 2014). There’s a lot more technical stuff to consider, which I won’t get into … for example, to do this properly we really need to set some prior believe in our hypothesis. However, Morey’s packages uses some default priors that represent relatively open minded beliefs (about MD in this case!).
To get a Bayes Factor for our original contingency table you’d run:
bayesMD = contingencyTableBF(MD.tab, sampleType = “indepMulti”, fixedMargin = “cols”)
bayesMD
and get this output:
Bayes factor analysis
————–
[1] Non-indep. (a=1) : 34.04903 ±0%

Against denominator:
  Null, independence, a = 1 
Bayes factor type: BFcontingencyTable, independent multinomial


The resulting Bayes Factor of 34.05 is a lot bigger than 1. The fact it is bigger than 1 suggests that the probability of the data under the alternative hypothesis is 34 times greater then the probability of the data under the null. In other words, we should shift our prior beliefs towards the idea the MD is a wanker by a factor of about 34. Yet again, statistically speaking, MD is a wanker (if you happen to support Arsenal).

Conclusion

The aim of this blog is to take a tongue in cheek look at how we can apply statistics to a real-world question that vexes a great many people. (OK, the specific Mike Dean question only vexes Arsenal fans, but football fans worldwide are vexed by referees on a regular basis). With some publicly available data you can test these hypotheses, and reach conclusions based on data rather than opinion. That’s the beauty of statistical models.
I’ll admit that to the casual reader this blog is going to be a lot more tedious than 7amkickoff’s. However, what I lack in readability I gain in applying some proper stats to the data. Whatever way you look at it, Mike Dean, statistically speaking does predict poorer outcomes for Arsenal. There could be any number of other explanations for these data, but you’d think that the FA might want to have a look at that. Arsenal’s results under Mike Dean are significant;y different to under other referees, and these models show that this is more than simply the random fluctuations in results that happen with any sports team.

On a practical note, next time I’m at the emirates and someone says ‘Oh shit, it’s Mike Dean, he hates us …’ I will be able to tell him or her confidently that statistically speaking they have a point … well, at least in the sense that Arsenal are significantly more likely to lose when he referees!

————-
*Incidentally, I’m very much of the opinion that people are perfectly entitled to support a different team to me, and that the football team you support isn’t a valid basis for any kind of negativity. If we all supported the same team it would be dull, and if everyone supported ‘my’ team, it’d be even harder for me to get tickets to games. So, you know, let’s be accepting of our differences and all that ….