I’m a bit late on this particular bandwagon, but it’s a busy time what with the start of term and finishing writing textbook and all that. The textbook is also my excuse for having not written a blog for a year, well, that and the fact I rarely have anything interesting to say.
Anyway, those of you who follow football (or soccer for our US friends) will know the flack that referees get. I am a regular at the stadium of the team I’ve supported* since the age of 3 and like most football stadiums I regularly hear the chant of ‘The referee’s a wanker’ echoing around the ground. Referees have a tough job: the game is fast, often faster than they are, players can be a bit cheaty, and we have the benefit of TV replays which they (for some utterly inexplicable reason) do not. Nevertheless it’s annoying when they get stuff wrong.
The more specific referee-related thing I often hear resonating around the particular stadium that I attend is ‘Oh dear me, Mike Dean is refereeing … he hates us quite passionately’ or something similar with a lot more swearing. This particular piece of folk wisdom reached dizzy new heights the weekend before last when he managed to ignore Diego Costa trying to superglue his hands to Laurent Koscielny’s face shortly before chest bumping him to the floor, and then sending Gabriel off for … well, I’m not really sure what. Indeed, the FA weren’t really sure what for either because they rescinded the red card and banned Costa for 3 matches. You can read the details here.
So, does Mike Dean really hate Arsenal? In another blog 7amkickoff tried to answer this question using data collated about Mike Dean. You can read it here. This blog has inspired me to go one step further and try to answer this question as a scientist would.
How does a scientist decide if Mike Dean really is a wanker?
- MD tends to referee Arsenal’s tougher games (opponents such as Chelsea, Manchester United and Manchester City), so we need to compare Arsenal’s win rate under MD to our win rate against only the same opponents as in the MD games. (In case you’re interested the full list of clubs is Birmingham, Blackburn, Burnley, C Palace, Charlton, Chelsea, Fulham, Man City, Man Utd, Newcastle, QPR, Stoke City, Tottenham, Watford, West Brom, West Ham, Wigan). If we compare against a broader set of opponents then it isn’t a fair comparison to MD: we are not comparing like for like.
- How do we know that a win rate of 24% under MD isn’t statistically comparable to a win rate of 57%. They seem different, but couldn’t such a difference simply happen because of the usual random fluctuations in results? Or indeed because Arsenal are just bad at winning against ‘bigger’ teams and MD happens to officiate those games (see the point above)
A frequentist approach
Total Observations in Table: 225
MD | Draw | Lose | Win | Row Total |
0 | 33 | 43 | 115 | 191 |
| 1.193 | 0.176 | 0.901 | |
| 17.277% | 22.513% | 60.209% | 84.889% |
| 70.213% | 79.630% | 92.742% | |
| 14.667% | 19.111% | 51.111% | |
| -1.092 | -0.419 | 0.949 | |
1 | 14 | 11 | 9 | 34 |
| 6.699 | 0.988 | 5.061 | |
| 41.176% | 32.353% | 26.471% | 15.111% |
| 29.787% | 20.370% | 7.258% | |
| 6.222% | 4.889% | 4.000% | |
| 2.588 | 0.994 | -2.250 | |
Column Total | 47 | 54 | 124 | 225 |
| 20.889% | 24.000% | 55.111% | |
Statistics for All Table Factors
Pearson’s Chi-squared test
Chi^2 = 15.01757 d.f. = 2 p = 0.0005482477
Fisher’s Exact Test for Count Data
Alternative hypothesis: two.sided
p = 0.0005303779
Minimum expected frequency: 7.102222
The win rate under MD is 26.47% compared to 41.18% draws and 32.35% losses, under other referees it is 60.21%, 17.28% and 22.51% respectively. These are the comparable values to 7amkickoff but looking at 10 years and including only opponents that feature in MD games. The critical part of the output is the p = .0005. This is the probability that we would get the chi-square value we have IF the null hypothesis were true. In other words, if MD had no effect whatsoever on Arsenal’s results the probability of getting a test result of 15.12 would be 0.000548. In other words, very small indeed. Scientists do this sort of thing all of the time, and generally accept that if p is less than .05 then this supports the alternative hypothesis. That is we can assume it’s true. (In actual fact, although scientists do this they shouldn’t because this probability value tells us nothing about either hypothesis, but that’s another issue …) Therefore, by conventional scientific method, we would accept that the profile of results under MD is significantly different than under other referees. Looking at the values in the cells of the table, we can actually see that the profile is significantly worse under MD than other referees in comparable games. Statistically speaking, MD is a wanker (if you happen to support Arsenal).
As I said though, we made simplifying assumptions. Let’s look at the raw results, and this time factor in the fact that results with the same opponents will be more similar than results involving different opponents. That is, we can assume that Arsenal’s results against Chelsea will be more similar to each other than they will be to results agains a different club like West Ham. What we do here is nest results within opponents. In doing so, we statistically model the fact that results against the same opposition will be similar to each other. This is known as a logistic multilevel model. What I am doing is predicting the outcome of winning the game (1 = win, 0 = not win) from whether MD refereed (1 = MD, 0 = other). I have fitted a random intercepts model, which says overall results will vary across different opponents, but also a random slopes model which entertains the possibility that the effect of MD might vary by opponent. The R code is this:
win.ri<-glmer(win ~ 1 + (1|Opponent), data = arse, family = binomial, na.action = na.exclude)
win.md<-update(win.ri, .~. + MD)
win.rs<-update(win.ri, .~ MD + (MD|Opponent))
anova(win.ri, win.md, win.rs)
The summary of models shows that the one with the best fit is random intercepts and MD as a predictor. I can tell this because the model with MD as a predictor significantly improves the fit of the model (the p of .0024 is very small) but adding in the random slopes component does not improve the fit of the model hardly at all (because the p of .992 is very close to 1).
win.ri: win ~ 1 + (1 | Opponent)
win.md: win ~ (1 | Opponent) + MD
win.rs: win ~ MD + (MD | Opponent)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
win.ri 2 302.41 309.24 -149.20 298.41
win.md 3 295.22 305.47 -144.61 289.22 9.1901 1 0.002433 **
win.rs 5 299.20 316.28 -144.60 289.20 0.0153 2 0.992396
If we look at the best fitting model (win.md) by running:
We get this:
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) [‘glmerMod’]
Family: binomial ( logit )
Formula: win ~ (1 | Opponent) + MD
AIC BIC logLik deviance df.resid
295.2 305.5 -144.6 289.2 222
Min 1Q Median 3Q Max
-1.6052 -0.7922 0.6046 0.7399 2.4786
Groups Name Variance Std.Dev.
Opponent (Intercept) 0.4614 0.6792
Number of obs: 225, groups: Opponent, 17
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.5871 0.2512 2.337 0.01943 *
MDMD -1.2896 0.4427 -2.913 0.00358 **
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
The important thing is whether the variable MD (which represents MD vs. other referees) is a ‘significant predictor’. Looking at the p-value of .00358,we can assess the probability that we would get an estimate as large as -1.2896 if the null hypothesis were true (i.e. if MD had no effect at all on Arsenal’s results). Again, because this value is less than .05, scientists would typically conclude that MD is a significant predictor of whether Arsenal win or not, even factoring in dependency between results when the opponent is the same. The value of the estimate (-1.2896) tells us about the strength and direction of the relationship and because our predictor variable is MD = 1, other = 0, and our outcome is win = 1 and no win = 0, and the value is negative it means that as MD increases (in other words as we move from having ‘other’ as a referee to having MD as a referee) the outcome decreases (that is, the probability of winning decreases). So, this analysis shows that MD significantly decreases the probability of Arsenal winning. The model is more complex but the conclusion is the same as the previous, simpler, model: statistically speaking, MD is a wanker (if you happen to support Arsenal).
A Bayesian Approach
The resulting Bayes Factor of 34.05 is a lot bigger than 1. The fact it is bigger than 1 suggests that the probability of the data under the alternative hypothesis is 34 times greater then the probability of the data under the null. In other words, we should shift our prior beliefs towards the idea the MD is a wanker by a factor of about 34. Yet again, statistically speaking, MD is a wanker (if you happen to support Arsenal).
On a practical note, next time I’m at the emirates and someone says ‘Oh shit, it’s Mike Dean, he hates us …’ I will be able to tell him or her confidently that statistically speaking they have a point … well, at least in the sense that Arsenal are significantly more likely to lose when he referees!
*Incidentally, I’m very much of the opinion that people are perfectly entitled to support a different team to me, and that the football team you support isn’t a valid basis for any kind of negativity. If we all supported the same team it would be dull, and if everyone supported ‘my’ team, it’d be even harder for me to get tickets to games. So, you know, let’s be accepting of our differences and all that ….